A Review of Explainable Deep Learning Cancer Detection Models in Medical Imaging

: Deep learning has demonstrated remarkable accuracy analyzing images for cancer detection tasks in recent years. The accuracy that has been achieved rivals radiologists and is suitable for implementation as a clinical tool. However, a signiﬁcant problem is that these models are black-box algorithms therefore they are intrinsically unexplainable. This creates a barrier for clinical implementation due to lack of trust and transparency that is a characteristic of black box algorithms. Additionally, recent regulations prevent the implementation of unexplainable models in clinical settings which further demonstrates a need for explainability. To mitigate these concerns, there have been recent studies that attempt to overcome these issues by modifying deep learning architectures or providing after-the-fact explanations. A review of the deep learning explanation literature focused on cancer detection using MR images is presented here. The gap between what clinicians deem explainable and what current methods provide is discussed and future suggestions to close this gap are provided.


Introduction
Deep learning has grown both in terms of research along with private and public sector application in recent years. A part of this growth is the application of deep learning algorithms to big data in the healthcare sector [1][2][3][4][5][6][7][8][9][10][11] as depicted in Figure 1. These algorithms have demonstrated results suitable for clinical implementation but explanation remains a barrier for wide-spread clinical adoption.
Deep learning is a subset of machine learning that has grown in recent years due to the advances in computational power and the access to large datasets. Deep learning has proven to be successful in a wide breadth of applications [12][13][14][15][16]. There are three main categories of deep learning algorithms which are supervised learning, unsupervised learning, and reinforcement learning. Supervised learning fits a non-linear function using features as input data and labels as output data where the labels, or outputs, are known when training the algorithm. Common supervised deep learning algorithms are convolutional neural networks and recurrent neural networks. These algorithms are commonly used for classification tasks such as classifying lesions as benign or malignant. Unsupervised learning on the other hand does not use labeled output data. It is looking for patterns in the input data distribution. Some examples of unsupervised deep learning are generative adversarial networks and autoencoders. These are used for tasks such as data compression and data augmentation. The final category is reinforcement learning. In reinforcement learning, a reward function is defined and the neural network optimizes its weights to maximize the reward function. Deep neural networks consist of an input layer, output layer, and many hidden layers in the middle. Each layer consists of neurons which are inspired by how neurons in the brain process information. Each neuron has a weight that is updated using a technique called gradient descent during the training process. The weights of the network are optimized by minimizing a cost function. Nonlinearities are applied between layers to fit nonlinear functions from the input data to the output. Due to this, neural networks are able to model complex mappings. After training, the model is able to classify unseen data.
Deep learning has achieved high performance for numerous types of cancers. The authors of [17] demonstrated accuracy up to 98.42 for breast cancer detection. In [18], they found that they could detect lung cancer with up to 97.67 using deep transfer learning. In the fall of 2018, researchers at Seoul National University Hospital and College of Medicine developed a machine learning algorithm called Deep Learning based Automatic Detection (DLAD) to analyze chest radiographs and detect abnormal cell growth, such as potential cancers. In [19], the algorithm's performance was compared to multiple physician's detection abilities on the same images and outperformed 17 of 18 doctors. These results demonstrate the potential for using deep learning to aid medical practitioners. Despite this performance, adoption of these systems in clinical environments has been stunted due to lack of explainability.
Deep learning techniques are considered black boxes as depicted in Figure 2. Explainability attempts to mitigate the black box nature of these algorithms. Explainability is needed for a variety of reasons. First, there are legal and ethical requirements along with laws and regulations that are required for deep learning cancer detection systems to be implemented in a clinical setting. An example of a regulation is the European Union's General Data Protection Regulation (GDPR) requiring organizations that use patient data for classifications and recommendations to provide on-demand explanations [20]. The inability to provide such explanations on demand may result in large penalties for the organizations involved. There are also monetary incentives associated with explainable deep learning models. Beyond ethical and legal issues, clinicians and patients need to be able to trust the classifications provided by these systems [21]. Explanation is also needed for trust and transparency [22][23][24][25][26][27][28]. Explanation methods attempts to show the reasoning behind the model's classification therefore building a degree of trust between the system, clinician, and patient. This can reduce the number of misinformed results that would be a possible consequence of non-explainable systems. Finally, explainable deep learning systems will provide the clinician with additional usefulness such as localization and segmentation of the lesion location [29][30][31][32]. How explanability brings value to the end user is depicted in Figure 3.  Explainable and interpretable are used interchangeably in parts of the literature [33] yet they are defined separately in others. This work will use the terms are used interchangeably. There is not an agreed upon definition of explainability. As Lipton mentioned, "the term explainability holds no agreed upon meaning, and yet machine learning conferences frequently publish papers which wield the term in a quasi-mathematical way". This demonstrates the ambiguity of term and the need for a formal definition. That is outside the scope of this work but for the context of this paper, we adopt the [34] definition of explainability which is, "explainability is the degree to which a human can understand the cause of a decision".
We conducted an extensive literature review by examining relevant papers from six major academic databases. A keyword based search was used to select papers, it consists of searching for index keywords based on common variations in literature of the terms "explainable", "transparency", "black box", "understandable", "comprehensible", and "explainable" and its union with a set of terms related to deep learning cancer detection including "deep learning", "machine learning" and "convolutional neural network". The research was restricted to articles published between 2012 and 2020 with one exception published in 1991. The papers were skimmed and then filtered by quality and relevancy. More papers were then added by using a snowballing strategy that consists of using the reference list of the selected papers to identify additional papers.
The contributions of this review are three fold. First, this work will focus on explainability for cancer detection using MRI images. Second, we provide a thorough overview of the current approaches to measure explainability that are commonly used in the literature and why this is important for medical explainability. There is not an universally agreed upon metric for an explanation. This review will discuss the advantages and disadvantages of metrics commonly found in the literature. Third, this review will discuss papers related to what clinicians consider important for explanation systems. This highlights a gap between ML researchers who design the algorithms and clinicians who are the end users. This paper attempts to discuss direction for closing that gap. This paper is organized in the following way. First, a general taxonomy of explanation approaches for deep learning cancer detection systems is described. Second, current measures for explanation quality are examined and analyzed. Third, we discuss explanation techniques applied to cancer detection deep learning models. Finally, it will conclude with a discussion with suggested future directions in section four and five respectively.

Background
Although the research volume is expanding rapidly around explainability  a review focusing on image intepretability within the context of deep learning cancer detection systems is missing. Indeed, there are existing research studies of deep learning explainability for medical applications [63][64][65][66][67][68][69][70][71][72][73][74] in the literature as well as review articles. Notably, [75,76]. The survey of Tjoa and Guan reviews post-hoc explanation methods for medical applications. The review looks at medical explainability as a whole. The authors provide future directions such as including textual explanation and propose creating a authoritative body to oversee algorithmic explainability. In the review by [76], the authors provide a review of explainable deep learning models for medical imaging analysis. The authors discuss medical applications of brain, breast, retinal, skin, CT, and X-ray images. In contrast to the previously noted reviews, this review focuses solely on analyzing medical images for cancer detection while discussing the gap between clinicians and ML researchers.

Taxonomy
There are various proposed taxonomies to classify explanation methods [77][78][79]. Generally, these classification systems are task dependent therefore there is not an universally accepted taxonomy for explanation methods. This section will outline how explanation methods are grouped throughout this paper which is guided by the taxonomy by the authors of [78].

Local vs. Global
Local explanation refers to providing explanations for individual samples. This approach is useful when explanations for individual patients are of interest. An example of this is SHAP [80,81] which provides explanations for single classifications. On the other hand, global explanation refers to providing explanation for a group of samples or the entire model. This shows the overall feature importance for a group of patients. An example of global explanations are Global Model Explanation via Recursive Partitioning [82] and Garson's Algorithm [83] for depicting feature importance.

Data Modality Specific vs. Data Modality Agnostic
Data modality specific refers to explanation methods that are only applicable for a certain data type. For example, some methods only work with MR images and some methods only work for tabular clinical records. An example of a method that only works for images is Grad-CAM [84] which does not apply to other data modalities such as textual or tabular data. These data modality specific approaches are usually coupled with modelspecific explanation approaches. For example, some explanation methods use feature maps from convolution operations to calculate what information the model uses to make a prediction. Data modality agnostic refers to explanation methods that work for any data type. An example of this is LIME [85] which can provide explanations for image, tabular, and textual data. These are useful for clinical implementation due to the ability to handle a wide range of data used to make clinical decisions. These methods commonly use surrogate or perturbation based approaches to create a general approach for model explanation.

Ad-Hoc vs. Post-Hoc
For ad-hoc explanation methods, the model itself is designed to be intrinsically explainable. Common current approaches are limiting learned features to only explainable features or by altering the optimization procedure to focus on optimizing for explainability rather than accuracy. The goal of this approach is to design a deep learning model that is inherently explainable and opposes the notion that there is a trade off between accuracy and explainability. On the contrary, post-hoc explanation techniques provide explanations after the classification is made. These are more common in the literature due to ease of implementation. Some refer to these as diagnostic methods [86] due to their utility for diagnosing and their apparent limitations for providing a complete explanation for the end user. Examples of these are saliency maps [87,88], Grad-CAM [84], Respond-CAM [89], and SHAP [80] where the explanation is provided after-the-fact.

Model Agnostic vs. Model Specific
Model agnostic refers to explanation methods that are able to explain any model and are not restrained to a certain type (e.g., SHAP). A common approach for modelagnostic methods is to change the inputs and measure the corresponding change in output. Then to use this to determine what change in inputs produces the greatest change in outputs. Model specific explanation methods only work with a specific model. For example, Grad-CAM [84] produces feature visualizations for convolutional neural networks, but does not work with LSTMs. These methods often use certain aspects of model architecture, for example feature maps produced from graph convolutions.

Attribution vs. Non-Attribution
The majority of proposed explanation techniques are attribution based methods. Attribution methods attempt to calculate the inputs of the neural network that are the most important with regards to the network's result. This can be broken further into two categories: perturbation based approaches and back-propagation based approaches. Perturbation based techniques estimate the most important features by removing one, calculating change in class score, and repeating for all features. It uses this to calculate the attribution of each feature. It then ranks feature importance by attribution. An example of this is SHAP. These are usually inefficient due to having to compute many iterations. Gradient-based methods conduct a similar procedure except with a single forward or backwards pass. Some of these methods provide only positive contributions (e.g., CAM) and some provide positive and negative contributions (e.g., DeepLIFT). In general, these methods produce results faster due to having to perform only one pass.

Measures for Explanations
There are various approaches for measuring the quality of an explanation which range from computing an intersection over union (IOU) score to conducting user studies. Measuring the quality of explanation is challenging because of the importance of the context the method is used for along with the non-triviality of defining what a good explanation is. For example, measuring explanation methods for cancer detection could have different criteria then explanation methods for insurance risk prediction. Therefore, in addition to testing explanation methods in a general way, it is important to consider the context of potential applications. This is especially important for cancer explanation methods due to high-risk nature of the predictions.

What Defines a High-Quality Explanation
The question what defines a high-quality explanation is an important first question to consider. There are studies that attempt to explore this question [90][91][92]. One question that is important is the time taken to understand the explanation. Ideally, the explanation is thorough enough to convey the necessary information yet concise enough that is to be understood in a timely manner. The authors of [91] explore this by asking what makes explanations explainable in the context of verification. The authors conduct an user study and identify what kind of increases in complexity have the greatest effect on the time it takes for humans to verify the rationale, and which seem relatively insensitive. Another factor to consider is the question who will be the end user of the algorithm. The end user of the deep learning system should guide the approach to evaluation. Designing explanation techniques for clinical settings require different explanations than designing explanation for robotics. A paper published in 2018 by [92] discusses the different roles involving explanation techniques and how the explanations differs depending on what role you are providing the explanation for. They show that an explanation for one role should be measured differently for other roles. This sounds trivial but is often ignored. For example, measuring explanations for doctors should have different criteria than measuring explanations for patients receiving the diagnosis or engineers designing the system. Consequentially, this suggests the need to test explanations on a variety of human subjects to evaluate for whom the explanations fail and for whom the explanation provides adequate information. For example, the explanation can be thorough and complete for a clinicians but fail for a patient when ideally it should address the needs of each end user. A limitation of the current literature is the lack of studies that verify explanation methods with a population of radiologists or doctors. To the best of our knowledge, there is not any studies that incorporate radiologists into the design process of deep learning explanation algorithms. The majority of human-studies involved general, non-specialized individuals recruited through a crowd-sourcing service such as Amazon Mechanical Turk [93]. The authors of [94] also highlight the importance of keeping the end-user in mind when defining what is explainable and how to measure it. It is also essential to have quantitative measures to be able to compare different methodologies quantitatively when qualitative measures fail. A paper published by the authors of [95] propose Remove and Retrain (ROAR) to quantitatively compute feature importance. ROAR ranks feature importance, removes these the features ranked as most important, and then retrains the model and measures the difference during testing. This method is widely used but there still lacks a standard procedure for quantitatively measuring explanation performance. While there is no silver bullet for measuring explanation quality, these are all important questions to consider when evaluating the quality of an explanation and choosing a metric to measure results.

Methods to Measure Explanation Quality
This next section will discuss approaches to measure explanation quality commonly used in the literature. These are broken into two types: human-based evaluation and proxy-based evaluation. Human-based studies design an experiment where participants are recruited and asked to fill out a questionnaire about different explanation methods. This data is then analyzed in an attempt to answer a question regarding the different explanation methods (e.g., which one is more concise). The participants are usually recruited using a crowd-sourcing service if non-specialized participants are acceptable. If specialized participants are needed, for example doctors, the authors usually recruit a cohort from local or regional organizations. Out of the papers reviewed for this survey 73 percent use proxy-based and 27 percent conduct human studies. This statistics highlights a need for more human-verified explanation techniques. This section will discuss human-based and proxy-based methods to measure explanation quality.
"Human-based" can be broken down into two more categories which are specific tasks or general tasks. For specific tasks, studies can be designed to evaluate specific applications. An example of this would be explaining predictions for ICU duration. The other category is when the application is more general or not known. This is when the desire is to test the explanation method in a general setting. An example of this is Grad-CAM++ [93], SHAP [80], or LIME [85] which all were tested with randomized human studies using Amazon Mechanical Turk. The authors produced heatmaps for 5 classes in the ImageNet testing set totaling 250 images. These maps, along with their corresponding original image, were shown to 13 human subjects and asked which explanation algorithm invoked more trust in the underlying model. They consider the algorithm with more votes the model with greater trustworthiness.
Sometimes designing and carrying out human studies is infeasible or inconvenient due to various reasons such as IRB, recruitment, or funding. There are also cases where human-based studies do not measure what is desired. In these studies a formal definition of explanation is used as a proxy for explanation quality. The non-triviality here is choosing what proxy is best to be used. Some examples are showing increased performance for a method that has been deemed explainable based on human studies or showing that method performs better with respect to certain regularizers. An example of this approach is Saliency Maps which measure explanation quality using a localization task on benchmark datasets.

Explanation Methods for Deep Learning Cancer Detection Models
This section is separated into two sub-sections. First, post-hoc explanation methods will be discussed. The second section will discuss ad-hoc methods. Ad-hoc methods aim to develop intrinsically explainable deep learning systems as opposed to providing after-the-fact explanations.
Explainability for deep learning models within a medical context is essential to create safe and responsible systems suitable for clinical implementation. Explainability attempts to alleviate risk and build trust [96,97]. There are numerous papers published each year regarding explainability and interpretability for healthcare [98]. This section will further discuss some of these works regarding image explanation methods for cancer detection.

Post-Hoc
A majority of deep learning cancer detection models utilize post-hoc methodologies due to their ease of implementation. Post-hoc methods can be used to analyze the learned features of the model and to diagnose overfitting. This provides a feedback loop for the researcher and allows them to improve the model. It also shows the discriminative areas of the image. This section will review recent studies that apply post-hoc explanation methods to breast cancer, prostate cancer, lung cancer, brain cancer, and liver cancer.
The predecessor for post-hoc techniques such as Respond-CAM, Pyramid-CAM, Grad-CAM, Grad-CAM+, and Grad-CAM++ is class activation maps (CAM) which generates a heatmap using global average pooling. In [99] the method shows the region of the image used the identify the label by outputting the spatial average of the feature maps of each unit at the last layer. It was tested as weakly supervised segmentation on the ILSVRC dataset with a top 5 error rate of 37.1 percent which is only 3 percent away from the top score.
CAM is formally defined as the following. Let f k (x, y) represent the activation of unit k in the last convolutional layer at spatial location (x, y). The result of performing global average pooling is F k = ∑ x,y f k (x, y) therefore for a given class, c, the input to the softmax, S c is ∑ k w c k F k where w c k is the weight corresponding to class c for unit k. Essentially, w c k indicates the importance of F k for class c. Finally the output of the softmax for class c, P c is given by ∑ c exp(S c ) By plugging F k = ∑ x,y f k (x, y) into the class score, S c , we obtain The authors define M c as the class activation map for class c, where each spatial element is given by A further developed version of CAM is Grad-CAM. In [84], Grad-CAM calculates the gradients of the class score with respect to the activation of the feature maps which are globally average pooled over the width and height dimensions. This is followed by a weighted combination which is passed through a ReLU function. This works best with the final layer because of the larger receptive field compared to the beginning layers. It was evaluated using the ImageNet localization challenge and outperforms CAM with a top-5 error of 10.89 percent. It achieves an IoU score of 49.6 percent on the PASCAL VOC 2012 segmentation data set. They also used a human study and found Grad-CAM outperforms guided backpropagation by 16.79 percent.
In order to obtain the class-discriminative localization map Grad-CAM L c Grad−CAM ∈ R u×v of width u and height v for any class c, we first compute the gradient of the score for class c, y c (before the softmax), with respect to feature map activations A k of a convolutional layer, i.e., ∂y c ∂A k . These gradients flowing back are global-average-pooled over the width and height dimensions (indexed by i and j respectively) to obtain the neuron importance weights α c k : During computation of α c k while backpropagating gradients with respect to activations, the exact computation amounts to successive matrix products of the weight matrices and the gradient with respect to activation functions till the final convolution layer that the gradients are being propagated to. Hence, this weight α c k represents a partial linearization of the deep network downstream from A, and captures the importance of feature map k for a target class c. We perform a weighted combination of forward activation maps, and follow it by a ReLU to obtain, The authors of [29] propose a modified version of CAM called pyramid gradientbased class activation.they used PG-CAM a densely connected encoder decoder-based feature pyramid network (DC-FPN) as a backbone structure, and extracts a multi-scale Grad-CAM that captures hierarchical features of a tumor. Mathmatically, PG-CAM is defined as follows.
where k p is the unit of the feature map s p , GradCAM c s p is a GradCAM generated from the feature maps delivered from each scale p, and the PG-CAM aggregating the GradCAMs will consequently contain information on multiple scales of the feature pyramid.
Integrated gradients has became popular due to its ability to explain models that use different data modalities. This is in addition its ease of use, theoretical foundation, and computational efficiency. Formally, Integrated Gradients is defined as follows where i is the feature, x is the input, x is the baseline, and α interpolation constant to perturb features by. Note that this integral is often approximated due to computational costs and numerical limits.
Occlusion is a simple to perform model agnostic approach which reveals the feature importance of a model. It can reveal if a model is overfitting and learning irrelevant features as in the case of adversarial examples. Adversarial examples are inputs designed to cause the model to make a false decision. In that case, the model misclassifies the image (say malignant as benign) despite the presence of discriminating feature occluding all features (pixels) one-by-one and running the forward pass each time can be computationally expensive and can take several hours per image. It is common to use patches of sizes such as 5 × 5, 10 × 10, or even larger depending on the size of the target features and computational resources available.
CAM and Grad-CAM calculate the change in class score based on features extracted by the model. These two methods then visualize the features that cause the largest change in class score. On the other hand, occlusion occludes part of the image and then calculates the change in class score using the occluded image. It repeats this for many patches of the image and the patch that produces the largest change in class score is deemed the most important. Both methods produce a heatmap highlighting the most discriminative areas of the image. Therefore, the largest difference between the two techniques is how the they compute the most discriminative area of the image. The disadvantage of occlusion is that it is computationally expensive and takes longer to compute than CAM or Grad-CAM. The advantage of Grad-CAM and CAM is that they can compute their results in one pass instead of many passes like occlusion techniques. The disadvantage of Grad-CAM and CAM is that they only take into account the feature maps therefore could produce misleading results whereas occlusion takes into account different patches of the image. A summary of these methods is presented in Table 1.

Brain Cancer
The authors of [100] attempt to incorporate explainability into the model before training. The authors assess the advantages of implementing features to increase explainability early in the development process, by training a neural network to differentiate between MRI slices containing either a vestibular schwannoma, a glioblastoma, or no tumor. They report that after training to detect explainable features, the network highlights more explainable regions.
In [29] PG-CAM to Grad-CAM was compared and it was found that PG-CAM outperformed Grad-CAM by delivering a 23 percent higher localization accuracy for the validation set.

Liver Cancer
A CNN was developed to classify liver lesions by [101]. To interpret the network, the authors visualized the activations of the last convolution layer. The authors calculated a feature score for each feature by hindering the network to utilize that feature and observing the difference in class score. They tested if the network was able to discover explainable radiological features. The model obtained a PPV of 76.5 ± 2.2 in identifying the 1-4 correct radiological features for the 60 manually labeled test lesions.

Breast Cancer
Deep learning has achieved remarkable accuracy for the detection and segmentation of breast cancer [102][103][104][105][106][107][108][109][110][111] . Detection in Mammography using Deep Convolutional Neural Networks by the author of [112] employee deep convolutional neural networks to localize classifications and masses in mammogram images without training directly on the full images. The authors implement a VGGNet architecture and report an accuracy of 92.53. The network was train by using images cropped around a ROI and then tested on the original, unaltered images. The authors then apply Grad-CAM.
A Neoadjuvant chemotherapy (NAC) aims to minimize the tumor size before surgery. Predicting response to NAC could reduce toxicity and delays to effective intervention. The authors of [113] implement a CNN to predict response using DCE-MRI images and then produce Grad-CAM heatmaps for post-hoc explanation. They report accuracy ranging from 0.69 to 0.88. Deep learning has also shown great success for prostate cancer [114][115][116][117][118][119][120][121][122][123][124][125][126] . The authors of [127] implement a convolution feature extractor followed by a SVM classifier for the task of detecting prostate cancer. They report an AUC of 0.86 for T2W images and 0.93 for ADC images. After prediction the authors generate CAM heatmaps for the MRI images to show localization of prostate lesions.
The authors of [32] demonstrated the ability to localize prostate lesions in T2W transverse images using Grad-CAM. The authors used MRI images for classification as well as tabular data such as weight, height, and age from ProstateX dataset. Deep learning techniques were used for classification. After classification GradCAM, Saliency Maps, LIME and Deep-SHAP were used to provide post-hoc explanations. They reports results of an average error of 6.93 pixels. The authors also made the qualitative observation that different explanation methods highlight different aspects of the resulting classification.

Lung Cancer
The authors of [128] described the goal of their paper "To explain predictions of a deep residual convolutional network for characterization of lung nodule by analyzing heat maps." The authors used occlusion to systematically block regions of a nodule and map drops in malignancy risk score to generate clinical attribution heatmaps on 103 nodules from Lung Image Database Consortium image collection and Image Database Resource Initiative (LIDC-IDRI) dataset, which were analyzed by a thoracic radiologist. They report an accuracy of 85 percent for classification and qualitatively access their explanation results.
The authors of [129] applied XAI visualization to gain an insight into the features learned by a DCNN trained to classify estrogen receptor status (ER+ vs. ER−) based on dynamic contrast-enhanced magnetic resonance imaging (DCEMRI) of the breast. The authors developed a network that utilized "transfer-trained dual-domain DCNN architecture derived from the AlexNet model trained on ImageNet data that received the spatial (across the volume) and dynamic (across the acquisition sequence) components of each DCE-MRI ROI as input". The authors implemented integrated gradients as a post-hoc explanation technique. They observed that the CNN learned relevant features from the spatial and dynamic domains. They note there were differences in the contributing features from the two domains.

Ad-Hoc
Ad-Hoc explanation techniques modify the training procedure and/or network architecture to learn explainable features. An additional proposed approach to learn explainable features was the work by the authors of [130] using supervised iterative descent. The authors' approach learns explainable features for image registration by incorporating a shared modality-agnostic feature space and using segmentations to derive a weak supervisory signal for registration. The authors report their experiments "demonstrate that these features can be plugged into conventional iterative optimizers and are more robust than state-of-the-art hand-crafted features for aligning MRI and CT images". This is significant with respect to medical imaging because of the multi-modality aspect which is common to medical imaging.
The authors of [131] propose a two part explainable deep learning network. One part extracts features and predicts while the other part generates an explanation. The authors create a mathematical model to extract high-level explainable features. The explanation model reports information such as what are the most important features, what is their rank, and how does each one influence the prediction.
Capsule networks have been recently utilized due to the impression they are more explainable compared to CNN [132]. The authors of [133] state that convolutions neural networks struggle with explainability therefore they propose explainable capsules, or X-Caps. X-Caps utilize a capsule network architecture that is designed to be inherently explainable. The authors design to capsule network to learn an explainable feature space by encoding high-level visual attributes within the vectors of a capsule network to perform explainable image-based diagnosis. The authors report accuracy similar to that of unexplainable models demonstrating accuracy and explainability are not mutually exclusive. The authors define a routing sigmoid as follow.
where r i,j are the routing coefficients determined by the dynamic routing algorithm for child capsule i to parent capsule j and the initial logits, b i,j are the prior probabilities that the prediction vector for capsule i should be routed to parent capsule j.
Let the critical response maps A(x|c) for a given CT scan slice image x for each prediction c be computed via back-propagation from the last layer of the last explainable sequencing cell in the proposed SISC radiomic sequencer. The notation used in this study is based on the study done by the authors of [134] for consistency. The last layer in the explainable sequencing cell at the end of the proposed SISC radiomic sequencer contains N = 2 nodes, equal to the number of possible predictions (i.e., benign and malignant). The output activations of this layer are followed by global average pooling and then a softmax output layer. So, to create the critical maps for each possible prediction, the backpropagation starts with the individual prediction nodes in the last layer to the input space. For a single layer l, The deconvoluted output response r l is given by, where f k is the feature map and p l is the kernel of layer l. the symbol * represents the convolution operator. Therefore, the critical response map A(x|c), for a given prediction c is defined as, where M is the unpooling operation and D c L is the convolution operation at the last layer with kernel p L replaced by zero except at the cth location corresponding to the prediction c.
The majority of ad-hoc explanation techniques attempt to learn an explainable feature space using modifications to traditional architectures that vary from modifying loss functions to modifying the procedure for computing convolutional features. Some papers argue the best approach is to move away from convolutional neural networks to capsule networks due to greater explainability with similar predictive performance though data is limited [133].

Brain Cancer
A capsule network trained to recognize explainable radionomic features was proposed by the authors of [132]. The explainability of capsule networks is studied. The authors' results show that the radiomics features extracted by Capsule networks can not only distinguish between the tumor types, but also show considerable correlation with handcrafted features, which are more acceptable and reliable from a physician's point of view.

Lung Cancer
The authors of [135] proposed An explainable Deep Hierarchical Semantic Convolutional Neural Network for Lung Nodule Malignancy Classification in 2018. The proposed network provides two levels of output. One level contains the low-level, explainable features for radiologists and the second level provides a high-level malignancy prediction score. They incorporate both of these tasks in their loss function therefore optimizing the network parameters for both levels mentioned above. The authors use the LIDC dataset which contains annotated lung nodule characteristics. The network achieves a test AUC of 0.856. For the semantic, explainable features task they achieved mean accuracy of 0.908, 0.725, 0.719, 0.834 and 0.552; mean AUC score of 0.930, 0.776, 0.803, 0.850 and 0.568; mean sensitivity of 0.930, 0.758, 0.673, 0.855 and 0.552; and mean specificity of 0.763, 0.632, 0.796, 0.636 and 0.554 for calcification, margin, subtlety, texture, and sphericity, respectively.
Another approach is SISC: End-to-end explainable Discovery Radiomics-Driven Lung Cancer Prediction via Stacked explainable Sequencing Cells was proposed by the authors of [134]. They implement an end-to-end system that automatically discovers semantically meaningful radiomic features using radiomic sequencers. The radiomic sequencer being discovered possesses a deep architecture comprised of stacked explainable sequencing cells (SISC). The network produces an AUC score of 89.06.

Breast Cancer
DeepMiner was proposed by [136] in 2018. Their framework attempts to discover explainable representations and build explanations using these representations for medical purposes, specifically breast cancer. An advantage to this approach is there is no need to train on full medical reports. They attempt to analyze the individual units rather than train a network to output a medical report. Their network is composed of three phase. First, they crop patches of the mammogram image. They then train a network using these cropped patches. Second, they have human experts annotate the units from this network. Third, they generate explanations by ranking the annotated units. The utilize transfer learning and obtain a AUC for 0.838 for normal patches, 0.802 for benign patches, and 0.872 for malignant patches. They test the explanation quality by using the Greedy Matching Score to compare the generated explanations with the explanations in the ground truth resulting in a score of 0.627.
In general, ad-hoc techniques provide more advantageous explanations compared to post-hoc techniques due to their inherently explainable design. One can argue that this makes the explanations more trustworthy. The disadvantage of ad-hoc techniques is their accuracy and scalability. When altering deep learning architectures to be inherently explainable there is usually an amount of accuracy lost. Also, the scalability of ad-hoc methods comes into question. In general, there are claims of post-hoc techniques not being as trustworthy since they are attempting to explain inherently unexplainable models. However, they do not suffer from the accuracy and scalability problems the ad-hoc methods do. See Table 2 for an overview of both post-hoc and ad-hoc methods.

Discussion
There has been significant research progress with regards to explainable deep learning cancer detection systems. The majority of methods utilized in the literature are local, post-hoc methods that are usually model and data specific. Current research demonstrate a trajectory heading towards explainable deep learning for cancer diagnosis, but there is still work to be done to fully realize the implementation of these systems in a clinical setting.
When a methodology provides an obviously wrong explanation it is important to be able to probe more to find out what went wrong. There is little research around this area despite being important for clinical implementation. This would help alleviate misclassification errors and provide greater insight to the clinician. Being able to quantify uncertainity in explanations is important for clinicians as well. This would give the clinician a gauge for the amount of trust to have in an explanation. This would also help alleviate potential errors introduce by explanation techniques.
Since the end user of explainability approaches for cancer detection will be clinicians and their patients, it is essential to take into account what they consider to be explainable. This is an important aspect that is often ignored by explanation studies. Most algorithms are measured using quantitative methods such as log loss or human studies of non-clinicians recruited from a service such as Amazon Mechanical Turk. There is a need to incorporate clinicians into the design process of these algorithms and to evaluate the algorithms with the doctors that will be using them. There are studies that attempt to design clinical visualization development and evaluation with clinicians and researchers. The authors of [139] introduce Clinical-Vis which is an EHR visualization based prototype system for task-focused design evaluation of interactions between healthcare providers (HCPs) and EHRs. They design a study where they recruit 14 clinicians. They compare the clinicians interaction with their systems against a baseline system. They qualitatively analyzed the differences between systems by having the clinicians think out loud during the assigned tasks. The authors also conducted a survey after the task was complete and analyzed the surveys. The measured time taken to arrive at a decision, accuracy of decision, confidence in decision, and TLX scores which the authors define as "Self-reported mental demand, physical demand, effort level, hurriedness, success and discouragement on a scale of 0-10". Overall, they found their system performed above baseline across all measures. The authors of [59] perform an iterative design of an explanation system for recurrent neural networks (RNN) including machine learning researchers and clinicians. They use feedback from clinicians while designing the algorithm. They evaluate it using a mixture of techniques including quantitative, qualitative, and case-studies. To the best of our knowledge, there are no studies such as this for explanation techniques. There is a study that interviews a cohort of clinicians to gain insight on what they consider to be explainable deep learning. The authors of [86] interviewed 10 clinicians belonging to two groups-Intensive Care Unit and Emergency Department. The goal of this study was to shed light on what clinicians look for in explainable machine learning and developing a metric to measure it. The authors interviewed each clinician for 45-60 min. They proposed the clinician's notions of explainability to determine what each clinician understood by the term and what he/she expected from ML models in their specific clinical setting. They designed an hypothetical scenario regarding the use of an explainable deep learning system to gain insights into what the clinicians looked for in explainable systems. The authors found that the clinicians viewed explainability as a means of justifying their clinical decision-making (for instance, to patients and colleagues) in the context of the model's prediction. The clinicians desired to understand the clinically relevant model features that align with current evidencebased medical practice. The implemented system/model needs to provide clinicians with information about the context within which the model operates and promote awareness of situations where the model may fall short (e.g., model did not use specific history or did not have information around certain aspect of a patient). Models that lacked accuracy were deemed suitable as long as an explanation of the prediction was provided. The authors state a quote that stood out was "the variables that have derived the decision of the model" and mentioned it was brought up by 3 ICU, 1 ED clinicians which is also identified in. The authors report that the clinical thought process for acting on predictions of any assistive tool appears to consist of two primary steps following presentation of the model's prediction: (i) understanding, and (ii) rationalizing the predictions. The clinicians stated they wanted to see both global and local explanations. Clinicians found finding similar training examples to be useful is certain applications such as diagnosis but not as much in other cases. The clinicians reported one of the most sought after features was a measure for uncertainty. 'Patient trajectories that are influential in driving model predictions' was reported to be an important aspect of model explanation. These studies provide direction for future algorithmic development as well as a call for more research regarding the clinical performance of current explanation techniques.

Future Directions
There are numerous works that propose novel explanation methods but little focus on what makes an explanation suitable for the context it will applied. This is essential for domains such as cancer diagnosis. Furthermore, more studies are needed to identify what clinicians deem explainable and use these insights as a measuring stick for explanation methods for deep learning cancer detection systems. Another important direction is using these insights to derive quantitative explanation metrics. The majority of metrics used to measure explanation quality are designed without clinicians involved. To create explanation techniques suitable for clinician integration is is necessary to measure them in a way guided by end-user criteria. Explanation techniques also can be useful to extract clinically relevant features. There are studies that use explanation techniques to localize lesions but these models do not extract information such as shape, volume, area, and other relevant characteristics. A future opportunity is to extract these features without computationally expensive segmentation. With this, clinicians do not need to extract these features manually. If explanation techniques are implemented into clinical settings, the system can automatically extract these characteristics for the clinicians thus aiding in the diagnosis process.
The common approach for providing an explanation for image classification is to produce a heatmap showing the most discriminative region. More work needs to be done to provide greater insight than using this direct approach. For example, it would provide more insight if explanation methods could should why a classification was not made and quantify the uncertainity of the explanation. Being able to reason for and against is also important to provide greater insight and explainability. More studies should be conducted on where current methods fail and why. This would provide the community more insight on how to create more robust explanation methods. It would also provide greater confidence the methods we currently have.
Currently, the majority of cancer detection studies utilize post-hoc techniques. As stated in [33,140] it is advantageous to develop intrinsically explainable (i.e., ad-hoc) deep learning models. Post-hoc explanation methods may be useful for diagnosing the deep network model but the insight gained is unsubstantial. A model suited for clinical implementation should be fully explainable with the predictive power of modern deep learning approaches. Post-hoc methods only explain where the network is looking which is not enough information for high stakes decision making. Post-hoc methods are fragile and easily manipulated [141,142] and can provide misleading results which hinders clinical implementation. Some approaches attempting ad-hoc models were discussed in the paper with results that show that intrinsically explainable models can achieve accuracy similar to that of unexplainable models. More research is needed to verify this and evaluate the clinical performance.

Conclusions
A review on explainability techniques for cancer detection using MRI images was presented. Different evaluation methods for these methods were discussed with their respective significance. In addition to this, what clinicians deem to be explainable was also discussed along with the gap between what current methods provide versus what clinicians want. The importance of designing explanation methods with this in mind was highlighted. A critical analysis and suggestions for future work were then presented to conclude and future efforts were suggested based on the current state of the literature.
In summary, there is a need to incorporate clinicians in the design process for these algorithms and a need to build intrinsically explainable deep learning algorithms for cancer detection. The majority of explanation techniques are designed for general use and do not incorporate context. This is important for clinical use cases due to the unique, highrisk environment. It is important to evaluate proposed methods alongside clinicians to determine the clinical strengths and weaknesses of the methods. This has the potential to help guide future efforts. Furthermore, there is a need to go beyond basic visualization of the discriminative regions. Some promising future directions include quantifying the explanation uncertainty, providing counter examples, and designing ad-hoc models that are intrinsically explainable.
Author Contributions: All authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.