Automatic Feature Selection for Improved Interpretability on Whole Slide Imaging

: Deep learning methods are widely used for medical applications to assist medical doctors in their daily routine. While performances reach expert’s level, interpretability (highlighting how and what a trained model learned and why it makes a speciﬁc decision) is the next important challenge that deep learning methods need to answer to be fully integrated in the medical ﬁeld. In this paper, we address the question of interpretability in the context of whole slide images (WSI) classiﬁcation with the formalization of the design of WSI classiﬁcation architectures and propose a piece-wise interpretability approach, relying on gradient-based methods, feature visualization and multiple instance learning context. After training two WSI classiﬁcation architectures on Camelyon-16 WSI dataset, highlighting discriminative features learned, and validating our approach with pathologists, we propose a novel manner of computing interpretability slide-level heat-maps, based on the extracted features, that improves tile-level classiﬁcation performances. We measure the improvement using the tile-level AUC that we called Localization AUC, and show an improvement of more than 0.2. We also validate our results with a RemOve And Retrain (ROAR) measure. Then, after studying the impact of the number of features used for heat-map computation, we propose a corrective approach, relying on activation colocalization of selected features, that improves the performances and the stability of our proposed method.


Introduction
Since their successful application for image classification [1] on ImageNet [2], deep learning methods, especially Convolutional Neural Networks (CNN), have been extensively used and adapted to tackle efficiently a wide range of health issues [3,4].
Along with these new methods, the recent emergence of Whole Slide Imaging (WSI), microscopy slides digitized at a high resolution, represents a real opportunity for the development of efficient Computer-Aided Diagnosis (CAD) tools to assist pathologists in their work. Indeed, over the last three years, notably due to the WSI publicly available datasets, such as Camelyon-16 [5] and TCGA [6], and in spite of the very large size of these images (generally around 10 giga pixels per slide), deep learning architectures for WSI classification have been developed and proved to be really efficient.
In this work, we are interested in WSI classification architectures that use only the global label (e.g., diagnosis) to train and require no intermediate information such as cell labeling or tissue segmentation (which are time-consuming annotations). The training is regularized by introducing prior knowledge by design in the architectures which, in addition, makes the result interpretable. But the interpretability beyond the architectural design is still pretty shallow.
However, interpretability (the ability to provide explanations that are relevant and interpretable by experts in the field) for medical applications are critical in many ways. (i) For routine tools where useful features are well known and are subject to a consensus among experts, it is important to show that the same features are used by the trained model in order to gain the confidence of practitioners. (ii) A good explainability would enable us to get the most out of the architectural interpretability and thus assist more efficiently medical doctors in their slide reviews. (iii) The ability to train using only slide level supervision opens a new field we call discovery, which consists in predicting, based on easier access (e.g., less intrusive) data, outputs that generally require heavy processes or waiting such as surgery (e.g., prognosis, treatment response). In order to be able to guide experts towards new discoveries, the need for reliable interpretability is obviously high.
This work extends our previous publication in the Workshop on Interpretability of Machine Intelligence in Medical Image Computing (iMIMIC) at nternational Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2020) [7]. The new additional contributions consist of:

1.
Further validation of heat-map improvement using a ROAR approach adapted for Multiple Instance Learning (MIL) context; 2.
A study of the impact of the number of features selected to compute heat-maps; 3.
A method relying on features acivation colocalization on slides, that enables us to filter out outlier features selected, thus improving results.

Related Work and Motivations
Most common WSI classification architectures deal with these very large images by cutting them into tiles, which is close to the workflow of pathologists who generally analyze these slides at levels of magnification between 5× and 40×.
Recently, as explained in Section 1, architectures that are able to learn using only global slide-level labels have been proposed. They rely on a context of MIL, i.e., slides are represented by bags of tiles with positive bags containing at least one positive tile and negative bags containing only negative tiles. In general, WSI classification methods relying on MIL require preprocessing steps to encode as efficiently as possible tiles from WSIs using tissue detection and normalization methods. The first step of preprocessing generally consists in detecting samples on the WSI since there are a lot of non-informative tiles that are just white background or artifacts. It can be performed using Otsu thresholding [8], color space transformation (e.g., thresholding on the saturation channel of HSV color space in [9]), or semantic segmentation using a U-Net [10] (e.g., in [11]). Once tissue is detected, another step that consists in stain normalization is generally carried out. The motivation behind this step is to improve the transferability to other datasets that come from a different hospital that might use a different scanner or staining to create their WSIs [12]. Color deconvolution [13] is the most popular approach [14][15][16] but other methods such as color transfer [17] or more recently cycle-GAN [18] can be applied as well. Finally, tiling and feature extraction (i.e., encoding) are performed to enter the MIL context and reduce the input size. Tiling consists in cutting the detected sample region into patches (called tiles) w.r.t. a grid defined by the tile size, the overlap size and the magnification. Feature extraction uses a CNN pre-trained on another task (almost systematically on ImageNet [2] in the literature) to compute descriptors. For example, among most popular works, encoding models used are ResNet [19] in [9,[20][21][22][23][24], LeNet [25] in [26], Inception [27,28] in [29] and VGG [30] in [23,31]. There is no clear consensus on which feature extractor gives the best performances, we will give some insight regarding this in the discussions after the conclusion.
For example, CHOWDER [21] (that stands for "Classification of HistOpathology with Weak supervision via Deep fEature aggRegation") is an extension of WELDON [32] (that stands for "WEakly supervised Learning of Deep cOnvolutional neural Networks") and proposes a solution for WSI classification that uses min-max strategy to guide the training and make the decision. This approach reaches an AUC of 0.858 on Camelyon-16 and 0.915 on TCGA-lung (subset of TCGA dataset related to lung cancer). In [26], an attention module [33] is used instead of a min-max layer. AUC values of 0.775 for a breast cancer dataset and 0.968 for a colon cancer dataset were reported. Recently, more works on large datasets proposed architectures that follow the same design [22,24,29,31]. Heat-maps based on intermediate scores computed in these architectures are what we call architectural explainability that results from prior knowledge on WSI problems that is introduced by design in the architecture. They are of great interest and have proved to be really efficient to the point of being able to spot cancerous lesions that had been missed by experts (in [22]). However explanations are relying on a single "medical" score which might limit the interpretability regarding complex tissue structures that can be found on these slides.
While interpretability for deep learning CNN models is still at its beginning, some methods arise from the literature. "Feature Visualization", first introduced in [34], extended in [35] and then extensively developed in [36], consists of methods aiming at outputting visualizations to express in the most interpretable manner features associated with a single neuron or a group of neurons. It can be used to understand the general training of a model. For example, the question of transferring features learned from natural images (ImageNet) to medical images has only recently been investigated [37] while widely used and yet surprisingly good. It has also been used to measure how robust a learned feature is [38]. Another type of explainability methods consists of the attribution methods, i.e., methods that output values reflecting, for each component of the input (e.g., pixels), its contribution to the prediction. They are performed either through perturbation [39] or gradient computation (i.e., measure of the gradient of the output with respect to the input). This second group of methods is gaining more and more attention. In [34], the authors show that the gradient is a good approximation of the saliency of a model and even put forward a potential to perform weakly supervised localization. This work opened a new way of accessing explanations in deep neural networks and motivated a lot of interesting researches [40][41][42]. Mixed together, these explanation methods can provide meaningful and complementary interpretability.
As the quality of explanations improved, the usefulness to quantify these improvements grew with it, which pushed researches to question interpretability methods and to compare them by measuring their performances. For example, RemOve And Retrain (ROAR) [43] method consists in removing contributing items (features, pixels . . . ) identified by an interpretability method from training samples to evaluate its completeness and relevance by measuring the impact of such an ablation on the learning and the performance of the model. Tests (cascade randomization, data randomization . . . ) are designed in [44] to show whether interpretability methods are sensitive to model parameters or input. In [45], it is shown that some popular interpretability methods are doing a simple partial input image recovery that makes them model or class insensitivite and it is highlighted through adversarial examples.
To the best of our knowledge, a lot of explainability is still to be introduced in WSI classification architectures. In the next section, we present our approach to improve interpretability of a model trained for WSI classification in histopathology. We rely on gradient-based methods to identify and attribute the importance of features in intermediate descriptors, and on patch visualization for cell-level feature explanations. We also extend feature explanation to a slide level, thus drastically improving tumor localization and medical insights.

Proposed Methods
As introduced in Section 2, WSI classification architectures have a common design that is insprired from pathologist's workflow to classify a slide (screening the whole slide at a high magnification level, identifying informative regions and making a decision based on these regions).
We propose to formalize this common design here: Let i be the slide index that is divided into a bag of tiles w.r.t. a tissue detection relying on Otsu segmentation [8] and a non overlapping grid (see Figure 1). j is the tile index for each slide. There are four distinct blocks in a typical WSI classification architecture:

1.
A feature extractor module f e (typically a CNN architecture) that encodes each tile x i,j into a descriptor d i,j ∈ R N with N the descriptor size (depending on the feature extractor): d i,j = f e (x i,j ); Note that this block is part of pre-processing steps enabling to encode the slide into a bag of tile descriptors.

2.
A tile scoring module f s that, based on each tile descriptor d i,j , assigns a single score per tile s i,j ∈ R: An aggregation module f a that, based on all tile scores s i,j , and sometimes their tile descriptors d i,j , computes a slide descriptor D i ∈ R M with M the slide descriptor size (depending on the aggregation module): A decision module f cls that, based on the slide descriptor D i , makes a class prediction P i ∈ R C with C the number of classes: P i = f cls (D i ).
Our approach (illustrated in Figure 2) consists in rewinding explanations from the decision module to tile information by applying interpretability methods and by answering successively the following three questions:

1.
Which features of slide descriptors are relevant for a class prediction? 2.
With regards to the aggregation module, which features of tile descriptors are responsible for previously identified relevant slide descriptor features? 3.
Are these features of tile descriptors relevant medically and representative of histopathological information?
The first question is answered using attribution vector A c ∈ R M (one for each class c) computed as the gradient of the component of index c of P i (noted P i,c ) with respect to D i . It enables us to identify a set of relevant positions (corresponding to features extracted) K c = {K c,1 , ..., K c,L } in slide descriptors, i.e., the L (empirically determined) positions in A c with highest attributions over the slide predicted in class c. Each attribution A c,m at position m (∈ [0; M]) of vector A c is computed as: with I c the set of slides predicted to be in class c and | . | the absolute value. Then, the second question is also answered using an attribution vector a c ∈ R N computed as the gradient of tile score s i,j with respect to tile descriptor d i,j . This enables us to identify features positions k c = {k c,1 , ..., k c,l } in tile descriptors, i.e., the l (empirically determined) tile descriptors that are responsible for high activation at previously identified K c positions in slide descriptor. Each attribution a c,n at position n (∈ [0; N]) of vector a c is computed as: with J c the set of tile positions (i, j) that most activate K c positions in slide descriptors (threshold empirically determined, see Section 4). To answer the third question, we rely on feature activation to highlight features identified as being discriminative to the task by selecting tiles x i,j that have the highest activation per feature in k c identified over the whole test set. Along with these tiles, we display, for each position in tile descriptors k ∈ k c , a maximum activation X k image obtained by iteratively tuning pixels values to activate the feature by gradient ascent as follows: X k is initialized as a uniformly distributed noise image X k 0 ; then while f e (X k n−1 ) k (activation at position k) increases, iterate over n > 0: (3) Furthermore, we also propose a new way to compute heat-maps for each slide i. We note H c,i the map that highlights regions on slide i that explain what has been learned to describe class c based on the identified features. For each slide i and tile j, the heat-map value H c,i,j is computed as the average of activations d i,j,k (normalized per feature over all tiles of all slides) over identified features k in k c for class c: This heat-map values (between 0 and 1) can be considered as a prediction scoring system, and thus we propose to compute the Area Under the ROC (Receiver Operating Characteristic) Curve to measure how relevant is the interpretability brought by our automatic feature extraction approach using ground truth lesion annotations when given. This localization AUC measures the separability between the class of interest (e.g., "tumor") and other classes using heat-maps, indeed for a good heat-map we expect all tiles that are representative of the class of interest to have a high score and all other tiles to have a low score.
We also validate results obtained with localization AUC by performing a ROAR analysis adapted to MIL context (defined in Section 2). Indeed, good heat-maps put forward discriminative tiles, thus removing these tiles from bags should prevent the model to learn. In this context, we propose to gradually (by thresholding on tile scores) removing tiles with a high tile score and to train a model with these new reduced bags. If heat-maps are relevant (i.e., if highlighted tiles represent the class of interest) and complete (i.e., if tiles representing the class of interest all have high tile scores) then slide classification performances should drop, while if heat-maps are not relevant or not complete the performances should remain stable through trainings.
Further and deeper analysis on the impact of the number of selected features on the quality of generated heat-maps presented in the next section enabled us to propose an additional feature selection block (in Figure 2) to filter out outliers selected. We will present this method after motivating it by our results.
In the next section, we present MIL architectures and the dataset on which we make our experiments and validate our approach. We also detail intermediate results that enable us to improve the overall interpretability and explanation heat-maps in particular.

Architectures
We validate our approach on two WSI classification trained architectures: CHOWDER and Attention-based classification.
CHOWDER [21] uses a 1 × 1 convolution layer to turn each tile descriptor into a single tile score. These scores are then aggregated using a min-max layer, that keeps the top-R and bottom-R scores (e.g., empirically R = 5 gives the best results), to give a slide descriptor (M = 2 × R).
Attention-based architecture [26] uses an attention module (two 1 × 1 convolution layers with respectively 128 and 1 channels and a softmax layer) to compute competitive and normalized (sum to 1) tile scores from tile descriptors. Then, the slide descriptor is computed as the weighted (by tile scores) sum of tile descriptors (M = N).
Note that in our experiments the feature extractor is a ResNet-50 [19] (N = 2048) trained on ImageNet and is part of the preprocessing thus is not fine-tuned. The decision module is a two layers fully connected network with 200 and 100 hidden neurons, respectively.

Datasets
We validate our approach using Camelyon-16 dataset that contains 345 WSI divided into 209 "normal" cases and 136 "tumor" cases. This dataset contains slides digitized at 40× magnification from which we perform sample detection using Otsu thresholding [8] on a thumbnail of the slide downscaled by a factor 32 and keeping tiles that contain at least 50% of foreground pixels w.r.t. Otsu segmentation. Then, we extract, with regard to a nonoverlapping grid, 224 × 224 pixels tiles at 20× magnification without stain normalization. Then, we pre-compute, for each tile, 2048-tile descriptors using a ResNet-50 model trained on ImageNet as it is done in [9,[20][21][22][23][24]. 216 slides are used to train our models while 129 slides form the test set to evaluate performances of the different trained models.

Results on CHOWDER Model
CHOWDER model performs, at slide-level classification, with an AUC of 0.82. Let us now illustrate and detail the results of our approach on this CHOWDER model guided by the three questions raised in Section 3.
The first question is "Which slide descriptors features are relevant for a class prediction?" i.e., for CHOWDER given the M = 10 (R = 5) tile scores given as slide descriptor (the 5 minimum tile scores and the 5 maximum tile scores), what is the contribution of each of these values to the prediction? Figure 3 shows, as histograms, the distribution of the (5-)min and (5-)max scores w.r.t. predictions over the whole 129 test slides and highlights that min scores are the ones that contribute to discriminate between the two classes (i.e., the lower the min scores, the more the slide is predicted as being "tumor"). A Mann-Whitney U-Test between scores (min and max independently) distributions reveals that min scores distributions per predicted class are statistically different (p < 10 −3 ) while max scores are not (p = 0.23). The attribution of min and max scores distributions, in Figure 3, validates this assertion by showing a statistically higher attribution on min tile scores than on max tile scores. After finding that min scores are the ones describing tumorous regions (and thus that max scores are used for the "normal" class), we are interested in identifying which features of tile descriptors are mostly responsible for minimum scores, i.e., to describe the "tumor" class (and maximum scores for the "normal" class). To address this second question, we use the same gradient-based explanation method on tile scoring module.
Most minimal tile scores are under −5 (and most maximal tile scores are above 11). For each of these groups of tiles, we compute the average attribution of each of the N = 2048 features in tile descriptors (extracted by a ResNet-50 trained on ImageNet). Figure 4 shows the distribution of features hence activated and allows us to identify which features are mostly responsible for min (and max tile scores), i.e., highest attribution for min (and max scored tiles). Thus, we are able to claim that features (defined by their position in the descriptor) that are mostly useful for the trained model for the "tumor" class are 242, 420, 602, 1154, 1644, 1652 and 1866. Following the same process, we identified features 565, 628, 647, 1158 and 1247 as being the most contributing features for the "normal" class according to CHOWDER model.

Results on Attention-Based Model
The attention-based model performs similarly to CHOWDER at slide-level classification with an AUC of 0.83. Here, tile scores are used to weight how much each tile is contributing to describe the slide w.r.t. the medical task the model has been trained on. As we understand that high tile scores should put forward tile descriptors that activate relevant features for the diagnosis, we also understand that, if the attention module makes its job well, relevant features should be used by both attention module and decision module. Thus, we propose to select features that have a high attribution in both tile descriptors and slide descriptors.
Using gradient-based attribution, we compute the histogram of attributions over the 2048 features of both slide descriptors and high scored tile descriptors of slides predicted as "tumor" (respectively w.r.t. the class prediction made and the predicted tile score). Figure 5 shows the selection of features for the "tumor" class, i.e., attribution of slide, and high (above 0.1) tile descriptors for slide predicted as "tumor".

Tile-Level Interpretability
As exposed in the previous paragraph, based on explanations on decision block, we have been able to identify 7 and 5 features that are mostly used by the trained CHOWDER model to make decisions (and we did the same for the attention-based model). Now, we are interested in interpretable information to return to pathologists so that they can use their expertise to understand what these features put forward histopathologically speaking. We benefited from discussions with two experienced pathologists and report their overall feedback on the interpretable visualization we proposed. Figure 6 shows the 7 tiles that activate the most (over all tiles) each feature and the max activation image, that we expect to reveal what the feature means with regards to the histopathological problem it has been trained on.
Pathologists agreed that patch-based tile visualizations are highly interpretable and exhibit features that are indeed related to each class [46]. For example, feature 1652 tends to trigger spindle-shaped cells that indeed can be a metastasic tissue organization. For "normal" tissue features, feature 565 describes mainly clustered lymphocytes that are preponderant in normal tissues.
Appendix A shows more tile-level visualizations for more features.

Slide-Level Explanations: Heat-Maps
Coherence between patches exposed for a better interpretability led us to think about another way to present features to pathologists. Indeed, since tissues have a coherent and somehow organized structure, a relevant feature for histological problems would be activated in a coherent and somehow organized way over slides. Thus, along with patch-based visualization, we propose to access features activation heat-maps H c,i over slides, as presented in Section 3.  Figure 7 illustrates qualitative results. Quantitatively, we report a tile-level localization AUC of 0.884 for CHOWDER model and 0.739 for Attention-based model, using featurebased heat-map values (that are the average normalized feature activation over all features identified for the"tumor" class, see H c,i,j computation in Section 3) as a "tumor" prediction score and using lesion annotations provided by Camelyon-16 dataset to get the groundtruth label per tile. Both AUCs are significantly high, which validates our approach of identifying features that are relevant and of computing heat-maps for interpretation and explanation. Note that the AUC computed using tile scores is 0.684 for CHOWDER model and 0.421 for Attention-based model (see Table 1). We can also note that there is a gap in interpretability between CHOWDER model and Attention-based model while classification performances are similar (AUC of 0.82 for the CHOWDER model and 0.83 for the Attention-based model). This gap can be explained by the fact that, in the context of Camelyon-16, identifying one tumorous tile is enough to label a slide as "tumor", so implicit tile classification does not need to be exhaustive to provide meaningful information to the slide level decision module, however if so interpretability will decrease.
We also validate the better explanation given by our feature-based heat-maps with a ROAR approach adapted to MIL context. In the context of explanation heat-maps, we expect hot colors regions (i.e., with high scored tiles) to be informative and cold colors regions to be non-informative. So we propose to remove tiles with an increasing threshold and to retrain from scratch (still pre-training from Imagenet) a model that we evaluate. Thus, for a complete and relevant heat-map method the performances should dramatically drop as high scored tiles are removed since we would remove the informative tiles. By contrast, for irrelevant or incomplete heat-maps, performances should remain unchanged since informative tiles are still available for learning (for that we included a control experiment consisting of randomly ditributed tile scores). Figure 8 shows the performances of models retrained after the removal of tiles with different thresholds on the heat-maps obtained from the trained model on the full bags. We can observe that our feature-based heat-maps are the ones impacting the most the performances, which confirms the results in Table 1. Also it confirms that CHOWDER tile score heat-maps are complete and relevant while attention-based tile scores heat-maps are equivalent to random heat-maps due to the important number of positive tiles being scored with a low score by the attention module.  Note that attention-based tile scores are not irrelevant but not complete. Indeed, these scores are learned and optimized for slide classification with a competitive approach, which makes them not complete, and generally pushed most tile scores to a zero value, and one or two (still relevant) tiles with high scores.

Study of the Impact of the Number of Selected Features
Up to now the number of selected features is fixed by hand. We propose to thoroughly study the impact of the number of selected features on the quality of our feature-based heat-maps.
First, we measured the impact with a small number of features from using only the one most contributing feature up to the first 7 most contributing identified features. We can observe, in Figure 9, that there is an important variablily of localization AUC performances depending on the number of features with a variation between 0.903 and 0.82 (which are still great performances). We can also interpret that there are features of interest that make the localization AUC increase (such as feature 602 or feature 1866) and adversarial features that make the localization AUC decrease (such as feature 1644 or feature 420). Thus, it seems critical to study more deeply this problem. So, as shown in Figure 10, by thresholding at different contribution scores, we select from 1 to (all) 2048 features (according to the distribution in Figure 4): Figure 10. Impact of the number of selected features on localization AUC (between 0.7 and 0.9) for high numbers of features.
Three behaviors can be identified depending on the number of selected features: 1.
If the number of selected features is really low (here between 1 and 12 features), the localization performances are unstable; 2.
If the number of selected features is between 1% and 5% of features, we have a pretty constant regime of performances; 3.
If the number of selected features is too high, localization AUC performances drop.
This leads to the conclusion that our method enables us to select statistically a majority of features of interest among top-features identified. Thus when the number of selected features is low the performances are really impacted by the few adversarial features. So we could propose to select a fair amount of features that ensure good heat-maps. However, being able to study individually a small number of features (that lead to about 10 min of discussion per feature) really convinced pathologists, while it is not conceivable to ask a medical expert to analyze deeply a lot of features individually.

Approach
This study about the impact of the number of selected features on heat-map quality shows that three kinds of features stand out for Camelyon-16 dataset among Imagenet features: features of interest that activate homogeneously mostly in tumorous regions, adversarial features that activate homogeneously not only in tumorous regions, and unrelated features that either activate non homogeneously or almost do not activate over slides. Figure 11 illustrates the difference between features of interest and adversarial features. It also gives another way to think about the third question we put forward ("Are these features of tile descriptors relevant medically and representative of histopathological information?") by introducing a manner to measure the potential transfer of each individual feature to a given histopathological problem. Under the hypothesis that our feature-based method enables us to select statistically a majority of features of interest, we should be able to filter out adversarial features that do not colocalize with the feature-based heat-maps (computed as the normalized average of selected features).
To do so we propose to measure, for each selected feature individually, the mean absolute error (MAE) between the feature activation (normalized) and the feature-based heat-maps over whole slides. Thus, we obtain a distribution of MAE and we propose to keep only the ones that are on the lower half of the distribution. Figure 12 illustrates the method in the case of the 7 selected features for the CHOWDER model.

Results
We performed this filtering on every group of features previously studied for CHOW-DER model. Figure 13 shows the obtained results, and confirms the usefulness of this filtering since the curve obtained with the additional colocalization filtering outperforms the feature-based heat-maps on localization AUC. We repeated this experiment for the attention-based model. Figure 14 confirms the three behaviors identified previously with an unsettled range of features where localization AUC highly varies, a stable range of features (around 5% of features), and a range of high number of features where performances drop. However, we can notice that due to the lack of interpretability at slide-level of the attention-based model, it can happen that in the first range of features the majority of features are adversarial, thus, colocalization filtering worsen performances. Still, in the vast majority of cases, it improves the performances and stabalizes them.

Conclusions
In this paper, we presented our interpretability approach and researches that apply to WSI classification architectures. We proposed a unified design that gather a vast majority of WSI classification methods relying on MIL learning. We motivated and applied a gradient-based attribution method to identify features that have been learned to be relevant in intermediate (tile and slide) descriptors. Then we showed the relevance of these features by visualization (with dataset patches and max activation) and validation by pathologists. These discussions made us consider measuring interpretability by computing explanability heat-maps over whole slides taking into account only identified features. Allying patchbased and slide-based visualization took interpretability to a next level for pathologists to understand histological meanings of features used by trained models. They confirmed that our approach gives explanations that are highly meaningful and interpretable, and enabled them to find out that characteristics used by the model are aligned with what pathologists have discovered over decades of researches. Our per-block approach can be used for all WSI classification pipelines trained on histpathological problems (and probably more, such as biomarker discoveries or treatment response) that follow the general design defined in this work, and brought the understanding of how WSI pipeline learns deeper. Validating our approach on two distinct architectures enabled us to claim the generalizability of our approach, and quantifying the improvement of heat-maps generated through two interpretability measures strengthen this point. Finally, our individual analysis of each feature selected at slide-level enabled us to filter out outlier features, to stabilize interpretability performances, and to automatically select the right number of features needeed for good heat-maps. This work dig deeper in the interpretability of WSI classification trained models.

Discussions
With this, we also open a lot of additional questions that could be of critical interest for the development of efficient CAD tools and to guide medical discoveries. We could start by adapting and/or validating the approach to a dataset with more than two classes and also trying fancier explanation methods such as SmoothGrad or Integrated Gradient. Measuring the interpretability could be performed by more pathologists to have quantitative evaluation of tile-level features. It would be also interesting to study the impact of stain normalization pre-processing on identified features and on the quality of featurebased heat-maps. We also want to discuss the fact that it seems that using and mixing together few numbers of features from ImageNet is good enough to describe tissues that can be found on histopathology slides, which may explain the choice of a specific feature extraction model, as long as it is only transfer from pre-training on ImageNet, is not a critical hyperparameter since most popular CNN architectures are able to learn complex features. Moreover, our work could lead to an improvement of classification performances through new regularizers for clustering constraint for example, or feature selection to reduce overfitting and improve generalizability. Finally, we have highlighted that most features used are texture classifiers, which is adapted to histopathological problems, but it could prevent these WSI classification pipelines to be applied directly to more challenging tasks such as liquid-based cytopathology problems that are not related to cell organizations and tissue structure but individual cell analysis.

Acknowledgments:
We would like to gratefully thank Jacques Bosq and Yaelle Harrar for their important medical inputs and feedbacks during this work, along with our colleagues at Keen Eye for their work and support.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: