Explainable-AI in Automated Medical Report Generation Using Chest X-ray Images

: The use of machine learning in healthcare has the potential to revolutionize virtually every aspect of the industry. However, the lack of transparency in AI applications may lead to the problem of trustworthiness and reliability of the information provided by these applications. Medical practitioners rely on such systems for clinical decision making, but without adequate explanations, diagnosis made by these systems cannot be completely trusted. Explainability in Artiﬁcial Intelligence (XAI) aims to improve our understanding of why a given output has been produced by an AI system. Automated medical report generation is one area that would beneﬁt greatly from XAI. This survey provides an extensive literature review on XAI techniques used in medical image analysis and automated medical report generation. We present a systematic classiﬁcation of XAI techniques used in this ﬁeld, highlighting the most important features of each one that could be used by future research to select the most appropriate XAI technique to create understandable and reliable explanations for decisions made by AI systems. In addition to providing an overview of the state of the art in this area, we identify some of the most important issues that need to be addressed and on which research should be focused.


Introduction
Machine learning models are powerful tools for the solution of complex problems. However many of these models are so complex that we do not really understand how they work [1,2]. Hence, for machine learning models that are used in critical applications, such as medical diagnosis, it is essential that the models provide a transparent and clear explanation of how they reached a particular decision. Without an explanation of how a model works, it cannot be trusted [1].
Explainability in artificial intelligence (XAI) is a relatively new field that aims at providing explanations on how an AI model works and how it makes its decisions. Incorporating explanations into an AI model does not directly try to improve the performance of the model, but rather to give insights into how and why a model produces a particular output. Understanding an AI model includes knowledge of the role of each parameter of the model, which factors affect the model's output and how the model's parameters and input influence the output [3].
XAI is fundamental for the development and adoption of AI prediction systems for healthcare and other critical applications, as it provides the necessary elements for transforming a mysterious, incomprehensibly complex black box system into a trustworthy and efficient tool. This paper summarizes some of the most relevant work in the field of XAI, specifically in relation to deep-neural-network-based models for chest X-ray image analysis and automated medical report generation. Medical imaging techniques are extensively used for diagnosing illnesses [4][5][6][7]. However, each image needs to be carefully examined by an experienced healthcare professional, who then needs to spend time writing a report 1.
We categorize and organize current research on the design and use of XAI techniques to generate explanations for AI models used in the analysis of X-ray images and in the automatic generation of medical reports.

2.
This high-level overview of research on XAI in this field helps identify some of the most important issues with existing XAI models and highlight the importance of collaborative efforts between clinicians, practitioners and system designers. We hope this paper will help focus researchers on these issues and eventually lead to the creation of effective, accurate, efficient and highly understandable and reliable AI systems for healthcare.
The rest of this paper is organized into four broad categories. The first part provides some background information. The second part focuses on explanation approaches for medical chest X-ray image analysis. The third part describes explanation approaches proposed for automated medical report generation. The last part discusses the pros and cons of available XAI methods and in which circumstances a particular XAI method would be the most suitable.

Background
Medical images provide extensive information that can be used to diagnose diseases and track the progress of patients. Chest X-ray images necessitate a thorough examination by radiologists, who then document the results of their analyses in full-text reports. To generate accurate reports, radiologists must have expertise in diagnosing medical images. Nonetheless, many reports do not provide a conclusive diagnostic due to the large number of potential diagnoses. Furthermore, the amount of time it takes radiologists to prepare fulltext reports is an issue of concern. In modern-day hospitals, automated medical imaging techniques are commonly employed to help alleviate these problems.
Convolutional-Neural-Network-based systems are commonly used for medical image analysis. ConvNets are powerful feature extractors able to discover relevant information in images without the need for human intervention.

Convolutional Neural Networks
Artificial neural networks (ANNs), or just neural networks, imitate the behavior of the human brain by allowing computers to learn how to recognise relevant components of a problem and how these components interact; this information aids in the solution of difficult problems. An ANN consists of nodes or neurons grouped into layers and connections between them; neuron connections have associated weights.
Neurons receive information from other neurons or from external sources, process the information and pass it to other neurons or external sources through their connections.
A ConvNet is an ANN that can extract features from an input image, classify them and identify patterns (see Figure 1). A ConvNet is a deep learning network consisting of the following kinds of neuron layers: • Input Layer: The input layer reads the image that will be processed. • Convolutional Layers: Convolutional layers process the input image to extract features from it. A convolutional layer applies a set of linear kernels or filters to its input to produce a set of so-called convolved features. A non-linear activation function is applied to the convolved features to produce the output of the convolutional layer, called a feature or activation map. Filters are designed to detect changes in an image's intensity values to recognise spatial patterns such as edges. The more convolutional layers a ConvNet has, the more complex spatial features a ConvNet can detect in an image. Some XAI techniques are aimed at visualizing feature maps to gain a better understanding of the image features that a ConvNet discovers and uses to reach conclusions about what an image represents. • Pooling Layers: Pooling is a technique for reducing the size of feature maps to speed up computation. In the pooling layer, the feature maps produced from the previous layer are down-sampled so that new feature maps with a condensed resolution can be generated. The input spatial dimension is greatly reduced by this layer. The most common types of pooling are max pooling and average pooling in which a group of values from a feature map is replaced with the maximum or the average value in the group, respectively. In a ConvNet, a convolutional layer is usually followed by a pooling layer.
• Flatten Layer: Feature maps are flattened to create a long continuous linear vector from all the 2-Dimensional arrays. A fully connected layer uses the flattened matrix as input for image classification. • Fully-connected Layers: These layers appear at the end of ConvNets. They process the feature maps computed by the other layers to determine relationships between high-level image features. • Output Layer: In a neural network that performs multi-class classification, the output layer consists of a set of scores giving the likelihood of the image belonging to each one of the classes that the ConvNet can identify. The softmax function is commonly used in ConvNets to compute these scores.
Given a large number of medical images depicting different classes of abnormalities, a ConvNet can learn the key characteristics of each class, but it lacks the ability to provide explanations about how particular classes are detected.
When dealing with data of a sensitive nature, such as medical data, we can see the desire to better understand how complex algorithms process it. Without such understanding, it is very difficult to trust the information produced by these algorithms.

Attention Mechanisms in Deep Learning
Visual selective attention enables us to focus on the most important parts of a scene and allows us to efficiently extract useful information from it. According to cognitive science, the abundance of information restricts the human ability to comprehend it, so we must focus on a small part of it [20]. To study the human visual perception process, researchers have developed models of visual selective attention that simulate the human visual system.
The study of attention mechanisms has made huge advances in the past few years in areas such as natural language processing [21] and image processing [22]; attention mechanisms mimic the perceptual mechanisms in the human brain. Most research combining deep learning algorithms with visual attention mechanisms focus on the use of masks; masking identifies the key features in an image. Deep neural networks can learn about the regions in an image that are most relevant for a specific task and, hence, on which they must focus their attention.
Attention mechanisms in image processing compute an attention map, which is the matrix representing the importance that each part of the image has for a particular task. In a ConvNet, the input can be re-weighted with that map before feeding it to the convolutional layers so the ConvNet can focus on the most relevant parts of an image. The input is encoded so that it can be fed to the convolutional and pooling layers. If all the network states used to encode the input are used to produce the attention map, the corresponding attention mechanism is called global; if only some of the states are used, the mechanism is called local attention.
The weights in an attention map are between zero and one with higher weights given to the parts of an image that are more relevant to a ConvNet. If the weights are continuous values between zero and one, the attention mechanism is called soft attention; if the weights are only either zero or one, the attention mechanism is called hard attention. Soft attention mechanisms are deterministic, while hard attention mechanism are stochastic [23]. Soft attention mechanisms can be further divided into spatial attention [24], channel attention [25] and self attention [26].
Self attention was introduced in transformer-based architectures [27]. Self-attention focuses on a single context, and it is commonly used in NLP tasks [28]. In multi-head attention, multiple attention modules run in parallel, and this allows focus to be simultaneously centered on diverse parts of the input and multiple relationships between the input components to be discovered [26].
The layers of a convolutional neural network process the input image and generate new channels from the input's initial three channels, Each channel contains different information, and a channel attention module assigns weights to these channels reflecting their relevance, so the ConvNet can focus on the channels with higher weights [25]. Convolutional neural networks also use spatial attention modules to identify the most relevant locations of an image [24].

Explainability in AI
Deep learning models have shown to be reliable, highly effective and accurate in a wide range of research fields, but we do not exactly know how these models make predictions and why specific conclusions are reached; these are concerns that limit our trust in them. We can think of deep learning models as black boxes that receive input and produce output, but we do not understand the complex processes that happen inside. We would like to have models that are reliable, accurate and transparent, so they can be trusted. There are several reasons why explainability in AI is essential: Enhances understanding of models output. Users of AI models can make informed decisions if they understand how the models work.

2.
Reduces the number of errors. Explainability can help spot model anomalies that allow us to design more accurate models. Explainability also helps to learn from mistakes and train the models to prevent them.

3.
Provides clarity about models output and strengthens our confidence in them. This is essential to build trust and have AI models adopted and accepted.
XAI explanations can be classified into the following three categories [3,29]: visual explanations, textual explanation and example-based explanations.

Visual Explanations
A visual explanation of a medical image is vital to a reliable analysis of the image. Visual explanations that show important parts of an image and can be used to justify decisions are called saliency or heat maps [5,23,30].

Textual Explanations
Text-based explanations provide descriptions of model output. A description may consist of a simple labeling of an image's contents or an entire medical report [19,31].

Example-based Explanations
In example-based explanations, examples are provided to help understand why predictions are made by deep learning models. For instance, a previously diagnosed patient who had the same symptoms as a new patient can be used to understand a diagnosis made by the model [32].
XAI methods can be classified into two groups: ante hoc and post hoc. The term post hoc refers to methods that are used to generate explanations for a model's output using the trained model, whereas ante hoc techniques create explanations during the training stage of the model. XAI methods can also be classified as global or local. A local approach provides explanations for the output produced by a deep learning model for a single, specific instance. Most local approaches are model agnostic, which means that they do not require knowledge of how the deep learning model works. Global approaches aim to explain the logic behind the functioning of a deep learning model, and therefore, these approaches are model-dependent.

XAI Approaches for Chest X-ray Image Analysis and Report Generation
A number of techniques have been proposed in XAI to explain how a deep learning model works and how it infers conclusions from the given input. Table 1 summarized details of XAI techniques presented in recent years by considering chest X-rays image analysis. In this section, we describe some of the explanation techniques that have been used in AI models that analyze chest X-rays.

Class Activation Mapping
In ConvNet-based image classification models, class activation maps (CAM) are used to highlight the most relevant or discriminative portions of an image that are used by a model to identify disorders in chest X-rays. A class activation map serves as a visual explanation of a ConvNet model that can assist radiologists in determining whether the decisions made by the model are based on the processing of the correct features of chest X-ray images. Class activation maps can also assist in the detection of data bias.
Some ConvNet layers preserve spatial information and are capable of detecting different objects in an image, such as bones, organs or tumors. The feature or activation maps of a ConvNet measure the importance of each part of an image in detecting a particular object. To construct a CAM, the average of each feature map of the ConvNet's last layer is computed, and these average values are fed to a fully connected layer to assign a weight to each image feature reflecting its importance in the computation of the output (see Figure 2). Each feature map is mapped back to the input X-ray by assigning a colour to each region of the X-ray based on the weight assigned to that region by the feature map. Colours are used to highlight the parts of the image that are most significant for detecting the object selected by a feature map. Similarly, the collection of all ConvNet feature maps can be mapped back to the image by using a linear combination of the feature maps using the weights determined by the fully connected layer. The resulting heat map highlights the areas of the input X-rays that are more relevant to the ConvNet's output.
Let f k (x, y) be the value of the kth feature map for position (x, y) of the input image and w k be the weight of the kth feature map. The class activation map M(x, y) is defined as, Several CAM-based methods have been proposed in the literature to create visual explanations of ConvNet-based models using linear combinations of feature maps. These methods differ in the manner in which the weights for the feature maps are computed. A theoretical study of the best algorithms for computing the values for these weights is presented in [2]. Below, we review some of the recent research on chest X-ray image analysis that uses CAMs and its variations to explain ConvNet models.
A learning system for identification of pneumothorax in chest X-rays using the deep convolutional neural network ResNet-152 [33] is described in [34]. CAM heat map analysis was used to highlight the parts of an X-ray that are most important to the predictions of the model. It is helpful for radiologists to see what parts of an image are the focus of the neural network as this assists them in figuring whether the neural network bases its predictions on the areas of an image that are most relevant to a particular diagnosis.
CheXNet [30] is a deep learning model to detect and locate 14 different diseases on chest X-ray images. The ChestX-ray 14 dataset was used to train a 121-layer densely connected convolutional neural network that has performance comparable to that of experienced radiologists. CheXNet was also used to predict lung cancer from chest X-ray images [5] and for thoracic disease classification [35]. Transfer learning was used twice in [5] to create a more accurate model for lung cancer detection. This application of transfer learning led to the computation of class activation maps that accurately show the most salient locations on the X-rays that the model uses for making predictions.
A method is proposed in [36] to improve understanding of the features that most heavily influence the decisions of neural network classifiers, through the use of adversarial robust optimization. The invariance of a model to perturbations on its inputs is referred to as adversarial robustness. Feature understanding and interpretability is significant in X-ray analysis because it helps explain why a classifier made a diagnosis. When models are adversarially trained, CAMs reveal a substantially broader set of interpretable features.

Variations of Class Activation Mapping
Class activation maps can be used only with ConvNets with a specific architecture in which feature maps directly transfer to the output softmax layers, and hence CAMs can only be used to explain the decisions of a limited number of ConvNet types.
Grad-CAM [37] is a generalization of CAM that works with a wider range of ConvNets. Grad-CAM assigns weights to the feature maps based on the gradient information from the last convolutional layer. These weights are calculated by averaging the gradients across the spatial dimensions of the input image. Let y c be the score that a ConvNet computes for the probability that an input X-ray image displays disease or anomaly c. Grad-CAM computes the weights w k in (2) for anomaly c as, where z is the number of pixels in the feature map. The lesion-location guided network LLAGnet [38] integrates two different attention mechanisms: Region level attention (RLA) and channel level attention (CLA) into a unified framework in order to focus on the discriminative features of lesion locations as suggested by professional radiologists. Grad-CAMs are used in LLAGnet to construct class discriminative heat maps which can identify the approximate spatial location of each candidate disease in a chest X-ray image.
Many researchers have been working on image analysis of chest X-rays, particularly after the COVID-19 pandemic. An individual with COVID-19 can suffer from many types of respiratory illness, from a simple cold to pneumonia as a result of this disease. Chest X-rays have become even more important since COVID-19 as they are used as a diagnostic tool for assessing the state of the lungs.
The deep learning architecture CovXNet [39] was designed to predict pneumonia caused by COVID-19 in chest X-ray images. X-ray images with different resolutions are used to train several ConvXNets. A meta learner uses the predictions made by the ConvXNets to generate a final output. Grad-CAM was integrated with the ConvXNets to generate heat maps used to interpret the learning of the network from a clinical perspective. The heat maps provide important information about the underlying reasons for the presence of pneumonia.
Covid-SDNet [40] is a ConvNet-based model for categorizing COVID-19 cases as severe, moderate, mild and absent from X-ray images. Grad-CAM was used to highlight the regions of an input X-ray image that triggered a prediction and also the regions that show a counterfactual explanation suggesting a different classification.
A system that integrates image processing, Guided Grad-CAM, ConvNets and risk management is presented in [41] to detect COVID-19 in chest radiography images. Guided Grad-CAM combines Grad-CAM and guided backpropagation to create high resolution heat maps that visualize at the pixel level the most important areas of an X-ray image for a ConvNet (see Figure 3). One of the shortcomings of Grad-CAM is that the heat maps that it produces might be distorted due to the gradients being backpropagated to the input. For the task of object detection and classification, Grad-CAM has poor performance when localizing multiple objects in the same image. Furthermore, for images containing a single object of interest, the heat maps produced by Grad-CAM often do not capture the entire object.
Grad-CAM++ [42] is a generalization of Grad-CAM that incorporates each pixel's contribution to the final output. Grad-CAM++ improves on Grad-CAM by providing better object localization and accurate detection of multiple objects in a single image. Grad-CAM++ computes the weights w k in (1) for class or anomaly c as, where α kc xy are the weights for the pixel-wise gradients for class c and feature map k, and relu is rectified linear unit activation function. Work on pixel-space visualization, such as deconvolution [43] and guided backpropogation [44], have shown that positive gradients are very important in producing accurate saliency maps. An activation map f k with a positive gradient implies that an increase in intensity at location (x, y) would have a positive influence on the classification score y c . Based on this, in GradCAM++ a linear combination of the partial derivatives of each pixel in an activation map f k represents the importance of that map.
Three XAI methods, Grad-CAM, Grad-CAM++ and Integrated Gradients [45] were used in multiple neural network architectures trained to detect pathologies in X-rays in [46]. The accuracy of the heat maps that they produced was compared to segmentations made by human experts. An explainable deep neural network called DeepCovidExplainer for automatic detection of COVID-19 symptoms from chest X-rays is presented in [47]. Grad-CAM++was used to highlight class discriminating regions in X-rays. Other variations of class activation maps include Score-CAM [48], LIFT-CAM [2] and Ablation-CAM [49].

Attention-Based Explanation
In the field of deep learning, the concept of attention has attracted a lot of interest due to its powerful influence on the learning ability of deep neural networks. Studies have been conducted on developing attention-based models that can explain decisions made by neural network models, allowing humans to trust these decisions.
Attention is undoubtedly one of the most powerful ideas in the field of cognitive science. Attention focuses on relevant features of input data while fading out the nonrelevant ones. Attention allows a neural network to spend more computational power on the relevant features, which represent the critical portions of the data as shown in Figure 4.
Using attention, a neural network can focus on valuable portions of the input and learn the relationships between them.
The concept of attention is implemented in natural language processing (NLP) systems through transformers [26], which have revolutionised the field of NLP. Medical report generation is an NLP problem which will be discussed in the next section. In image analysis, the notion of attention is incorporated in ConvNets using attention modules. Table 1. Overview of explainable AI methods for chest X-ray image analysis. The last column explains the disease or anomaly predicted by a model, and the first column indicates the XAI method used to explain the decisions of the model.

Explainable AI Techniques Studies
Year Chest X-ray Analysis Class Activation Mapping (CAM) and its variations. Saporta et al. [46] 2021 COVID-19 CAM creates a heat map reflecting Paul et al. [34] 2020 Pneumothorax identification the importance of the feature maps.
Mahmud et al. [39] 2020 COVID-19 Tabik et al. [40] 2020 COVID-19 Lin et al. [41] 2020 COVID-19 Karim et al. [47] 2020 COVID-19 Khakzar et al. [36] 2019 Classification of Chest pathologists Dunnmon et al. [50] 2019 Labelling of Chest X-ray pathologies Sedai et al. [6] 2018 Detection of Pathology diseases Rajpurkar et al. [30] 2018 Thoracic disease classification Ausawalaithong et al. [5] 2018 Detection of Lung Cancer  [52] 2019 Classification of Thoracic Diseases Pesce et al. [53] 2019 Pulmonary lesions Huang et al. [54] 2019 Diagnose Chest Pathology Guan et al. [55] 2018 Emphysema Detection Ypsilantis et al. [56] 2017 Enlarged Heart Local Interpretable Model-Agnostic Explanations (LIME In the multi-label chest X-ray image classification problem, the discriminative features of different pathologies must be learned. In general, chest X-rays could contain information of various anomalies, so critical clues are required to classify and localize the different abnormalities in lung regions. Attention has been used with Fully ConvNets (FCNs) in [7,54] to design a multiattention convolutional neural network for automatic disease detection in chest X-ray images. Fully ConvNets are an adaptation of the DenseNet-121 model that can process spatial information [54]. FCNs create multiple attention maps for each pathology category being considered via a collection of correlated convolutions followed by a mean-pooling process. Because each channel in an image shows a specific visual symptom for a disease class, the channels have the ability to represent huge intra-class variability. This intra-class variability helps to generate explanations using heat maps based on spatial attention maps. A recurrent attention model is proposed in [53] that uses reinforcement learning to focus on the parts of an X-ray image that are likely to display pulmonary lesions. Spatial attention maps were used to predict the regions where a pathology might exist. This improves the accuracy of classification.
A stochastic attention-based approach is presented in [56] to predict which areas in chest X-rays should be visually explored to look for a specific radiological anomaly: enlarged heart.
Thorax diseases typically occur in localised disease-specific regions. Irrelevant noisy regions in chest X-ray images are those regions which either do not portray any information about the disorder or do not present a clear depiction of an image. These irrelevant noisy regions may have a negative effect on ConvNets that are trained with whole X-ray images. An attention guided ConvNet is described in [55] which learns where noisy regions are and uses that information to accurately identify regions showing the presence of a disorder. A heat map generated with an attention guided ConvNet serves as a guide to crop out the noisy regions of an X-ray image.
KGZNet, a knowledge guided deep neural network for automatic diagnosis of thoracic diseases is proposed in [52]. KGZNet is a zoom neural network that is trained on hierarchically organized partitions of X-ray images and guided by human medical expertise. According to [52], thoracic diseases are typically limited within certain lung regions. A lung lesion is learnt through the analysis of lung images guided by an attention heat map. Disease-specific CAM attention heat maps focused on locating specific disorders were used to visualize suspicious lesions in chest X-ray images.
A novel deep learning method to diagnose COVID-19 in chest X-ray images using a self-supervised learning approach with a convolutional attention module is presented in [51]. By using Score-CAM [48], they identified the causes of misclassified cases. Score-CAM heat maps were generated based on a convolutional attention mechanism.

Local Interpretable Model-Agnostic Explanations (LIME)
LIME is an interpretability method that is model agnostic; this means that LIME explains why an AI model makes a particular prediction without making any assumptions about the architecture of the model. For the case of ConvNets, these models might base their decisions on a large number of features. Model explanations need to be understandable to humans, so explanations must be based on a subset of features that humans can comprehend and relate to the predictions made by a model. Therefore, LIME approximates a ConvNet with a simpler interpretable model that behaves like the ConvNet for a specific prediction (this is called local interpretability). LIME works by first partitioning an image into blocks of homogeneous regions consisting of pixels with similar attributes (such as color and brightness). These blocks are called superpixels. Then, a set S of new images is created by graying out a random selection of superpixels. A ConvNet is then used on each image of S to make a prediction, and a weight is assigned to each superpixel denoting its importance for making the prediction. In an image, the superpixels with the largest weight are highlighted, yielding a heat map showing the most important parts of the image over which the ConvNet based its prediction.
A number of works have been proposed for COVID-19 detection since the pandemic began in 2020. To reduce dependency on limited COVID-19 test kits, an alternative is the use of screening systems for chest X-rays. Two studies on the effectiveness of COVID-19 detection using different ConvNet-based models were presented in [57,63]; LIME was used to identify the features that the ConvNets used to distinguish patients with COVID-19 from patients without COVID-19.
A ConvNet-based system to identify lesions in X-rays is presented in [62]. In this work, a combination of predictions from different classifiers was used to detect abnormalities using frontal and lateral X-rays. Radiologists highlighted the regions of X-rays they would focus on to make a diagnosis, and this human-made highlighting was compared to that produced by Grad-CAM and LIME.
A spiking neural network technique is presented in [58] to detect COVID-19 positive cases using a spike neural network (SNN) with supervised synaptic learning. Three additional works on the use of ConvNets to detect COVID-19 using X-ray images are presented in [59][60][61]. All these works use LIME to identify the regions of an X-ray showing a COVID-19 infection.
Explanations obtained through LIME have demonstrated their importance in COVID-19 related research. In analyzing chest X-rays, localization and segmentation are considered crucial parts of the deep learning process as prediction of various chest pathologies can be effectively explained by LIME as reported by recent research discussed in this section.

Layer-Wise Relevance Propagation (LRP)
Layer-wise relevance propagation unravels the prediction of a deep neural network by propagating the prediction backwards through the layers of the network to compute relevance scores for the pixels of the input image. This backward propagation is performed as follows. For each neuron i in the last layer, the neural network computes an output X i through its activation function. This output X i is the relevance score for neuron i. Consider now a neuron j in network layer L l with relevance score R l j . This relevance score is backpropagated to the neurons k in the previous layer L l−1 that provide input to j so that where r jk is the fraction of the relevance score of neuron j transferred to neuron k. Equation (4) defines a conservation property, so that the total relevance score of the neurons in each layer is the same, and it is equal to the value of the prediction p(X) computed by the neural network for image X.
The relevance score for a neuron k in layer L l−1 is computed as follows: The relevance scores of pixels are obtained from the relevance scores of the neurons in the first layer. The relevance scores are visualized as a heat map. The functions r jk in Equation (4) are called propagation rules. Different propagation rules have been proposed, including LRP-0 [68], the epsilon rule [68], LPR-αβ [68], the Z + rule [69] and the gamma rule [70].
Deep Covid-Explainer, a neural network ensemble for automatic detection of COVID-19 symptoms from chest X-rays is described in [47]. Class discriminating regions are highlighted using Grad-CAM++ and LRP to provide explanations and to identify critical regions on patients' chests.
In [65], two ConvNets, VGG16 and ResNet60, are used to detect pneumonia caused by the COVID-19 virus. LRP, LIME and Grad-CAM are used to generate explanations for the predictions made by the two models.
In [66], a model for detecting pneumonia and COVID-19 is presented that uses deep neural networks trained with transfer learning. LRP was used to discover that the words and letters printed in X-rays can influence the predictions of the model. In [67], a COVID-19 detection model was designed that consists of a segmentation module and a 201-layer ConvNet. LRP was used to generate heat maps which were correlated with the Brixia scoring system used by radiologists to measure the severity of COVID-19 in different lung regions.
ISNet [64] is a ConvNet-based system that is able to perform segmentation and classification as a single process. ISNet introduced the concept of relevance segmentation in LRP maps to minimize background relevance.

XAI Approaches in Medical Report Generation
Automated medical report generation from chest X-rays has the potential to improve patient clinical diagnosis. Automated report generation is a special type of image captioning problem. The sentences generated by image captioning are usually short and describe the most prominent visual elements of an image. This cannot fully represent the rich information of an image, but it can help train deep learning models to associate parts of an image with words. Deep learning models for image captioning that use attention mechanisms are effective and accurate [71][72][73].
An auto report generator can potentially relieve doctors of a considerable amount of work by assisting them in drafting medical reports. We have discussed the role of explanations in chest X-ray image analysis. In this section, we explain why XAI is a significant part of automated medical report generation.

Image Captioning with Visual Explanations
The image captioning problem combines elements of natural language processing (NLP) and image processing. There are several works on the use of deep neural networks for image captioning that use visual explanations for the models predictions. A mutlimodel neural network called TandemNet is presented in [74] which can detect bladder cancer and produce a diagnostic report. TandemNet uses ResNet to analyze images and long-short term memory (LSTM) networks to model report sentences. A dual-attention module is used to train the system using images and text.
Through the interplay of semantic information with visual information, TandemNet is taught to distill the most relevant features of an image. Attention maps are used to visualize how TandemNet uses image and text information to support its predictions.
TieNet [75] is a system for predicting thoracic diseases and automatically generating diagnostic reports. TieNet uses a ResNet for image analysis and LSTM networks for text processing, and it integrates multi-level attention models for the most significant words in a report and regions in an image.
A system is proposed in [76] that generates explanations for the predictions made by a deep-neural-network-based diagnosis system. The justification generator provides explanations consisting of heat maps highlighting the most relevant regions in an image and textual reports indicating the significance of the heat maps.

Textual Explanations with Concept Activation Vectors
Concept activation vectors (CAVs) [77] are designed to provide an interpretation of the inner working of a deep neural network using human understandable concepts. The state of a neural network can be represented as a vector space V n , where vectors correspond to input features and neuron outputs. This vector space is difficult to understand for humans who are more adept at working with concepts. CAVs provide a translation between V n and V k , where V k is a vector space in which vectors correspond to human understandable concepts.
In [78], a ConvNet model using variational auto-encoders (VAE) [79] is presented that detects cardiac diseases in temporal sequences of cardiac magnetic resonance (MR) segmentations. CAVs allow us to identify clinically known biomarkers that are associated with cardiac disorders. Hence, when the model classifies images, it also provides interpretable concepts relevant to the classification and relates them to the corresponding parts of the images.
The CAV model was extended in [80] through the addition of regression concept vectors (RCVs); while CAV models indicate whether a concept is present or not in an explanation of a deep learning model's prediction, regression concept vectors express continuous measures of that concept. RCVs are especially useful when investigating continuous features such as tumor size. The use of RCVs to generate explanations for the decisions of the breast cancer detection ConvNet [80] gives a better understanding of why the ConvNet classifies some areas of an image as cancerous and others as healthy.
In [81], a framework is proposed for generating explanations for ConvNet decisions using RCVs. The framework allows explanation generation for multi-class classification tasks, and it improves the learning stage through the removal of spatial dependencies of the convolutional feature maps.

Other Textual Explanation Techniques
A hierarchical model for text processing with multi-attention is presented in [31]. This work identified two main aspects in automated medical report generation. The first one is related to identifying regions in an X-ray image that show a pathology and describing this information in textual form. To address both issues, a novel multi-attention hierarchical model is proposed that focuses on the image's channels and spatial information and a word embedding method that incorporates the patient's medical history.
As part of the efforts to develop a radiologist-interpretable algorithm for lung cancer prediction, ref. [82] presents a hierarchical semantic convolutional neural network model (HSCNN) for detecting malignant nodules in CT scans. When analyzing and detecting a malignant nodule, HSCNN considers five nodule properties: calcification, margin, subtlety, texture and sphericity. In addition to the diagnosis prediction, these five nodule properties help explain the final malignancy prediction.
In [8], the authors presented a domain-based system for generating chest X-ray radiology reports. Based on predictions about topics that will be discussed in the report, their model then generates conditional sentences corresponding to these topics. With reinforcement learning, the resulting system is fine-tuned for both clinical accuracy and readability. The attention mechanism was embedded in their presented model, and attention maps were generated as an output with a highlighted portion of an image that corresponds to its description.

Discussion of Explainable AI Approaches
In the development of high stakes decision-making systems, such as computer-aided diagnostic systems, it is of fundamental importance that we understand how those systems work and how they make decisions; without this knowledge, these systems cannot be trusted. XAI is rapidly becoming one of the mainstream subjects in AI as it provides the foundations for understanding complex AI models, a necessary requisite for deploying these systems in critical applications. XAI is still in its early stages. New and better explanation techniques are being developed, and we expect that they will revolutionize the healthcare field.
Automated medical report generation can be subdivided into two problems, which are related but come from two separate domains of study. Analysis of medical images is one aspect of the problem that is related to computer vision. The other half of the problem belongs to natural language processing (NLP). There is no standard single XAI technique available that can be applied simultaneously to computer vision and NLP models. When analysis of medical images needs to be conducted on parts of an image, class activation maps (CAM) and its variants have proven to be important techniques for explaining the decisions of ConvNet classifiers. In situations where we do not have understanding of the AI model which needs to be trained, model-agnostic techniques, such as LIME, are helpful.
Attention mechanisms are part of neural architectures and are capable of dynamically highlighting relevant features of the input data, which in the case of image analysis focuses on specific parts of an image and in the case of NLP on a particular sequence of textual elements. If we are interested in finding the role of each pixel to the training of an AI model, then LRP is the best-suited technique as it propagates the output back through the network until reaching the input layer using the network weights and neural activation created by the forward-pass.
A summary of XAI techniques used in medical image analysis and report generation is depicted in Table 2. Being post hoc techniques, CAM, Grad-CAM, LIME, LRP and CAV use trained networks to generate explanations, whereas attention-based explanations and explanations generated for image captioning are ante hoc techniques. LIME and CAV are techniques that generate global explanations. CAV can also generate local explanations. Two important open research problems that we encountered through our study are the following.

•
An integrated XAI framework is required for automated medical report generation. The framework should integrate the explainability aspect for both image processing and text generation. Currently, existing XAI methods deal with only one aspect of the automated report generation process. • A reasoning mechanism is required to provide quantitative, and not just qualitative explanations of the decisions of a model. This will be helpful to understand and improve the accuracy of AI models.

Improving AI Models through XAI
XAI methods help explain the decisions made by AI models and can help enhance them. The reflective neural network Reflective-Net introduced in [83] uses a reflection process to improve its accuracy. In this reflection process, first a classifier makes a prediction based on an input I, and it generates an explanation E. Then, the input I and explanation E are given to a reflective network that refines the prediction. Training the reflective network with correct and incorrect explanations helps increase its accuracy.
XAI techniques have been found to be highly effective in improving deep learning model performance as described in [84]. MobilNet, a ConvNet to detect metal surface defects, was used in [84] on a dataset of images containing images with super imposed text and company logos. Through the use of LRP, it was demonstrated that the performance of the model of mode was consistently negatively affected by that unwanted information. The LRP analysis showed that the model learned to identify patterns in the text and logos rather than the actual surface defects. The performance of the model was greatly improved by removing the superimposed text and logos from the images before training.

Challenges in Explainable AI
In order to ensure the production of accurate automated medical reports, we should use a multidisciplinary approach and take into account input from the report generation system designers, the users of the system and anyone who will be affected by the system. In spite of the fact that XAI can assist in identifying problems with medical data, the existence of unstructured medical data remains a challenge for the development of useful AI-based systems.
There are several problems with existing XAI techniques. Two of the most important problems are:
Lack of quantitative methods to evaluate the correctness and completeness of explanations.
To address the first challenge, we note that there is no best universal XAI technique. Some techniques provide understandable and accurate explanations for some prediction models but not for others. So, a careful selection of the XAI methods is essential for improving the quality of explanations. In addition, it is very important, as mentioned above, to involve system users in the design of XAI techniques, as user input is geared towards improving explanations understandability.
The second challenge highlights the need to design accurate metrics to evaluate XAI techniques [1]. Currently, studies on this area are mainly based on subjective measurements, such as user satisfaction, clarity of descriptions and trust in the system [1]. An overview of metrics for evaluating explainability properties (i.e., clarity, breadth, parsimony, completeness and soundness) is discussed in [85]. There is an overall lack of universally accepted quantitative evaluation metrics for XAI techniques, so additional research in this direction is needed.

Conclusions
A report generated by an automated medical report generator must be trustworthy, easy to understand and accurate in order to be used effectively in practice. The quality of the explanations on how the report was generated and how its diagnoses were reached is a key factor to meet these goals. Having a system that is explainable allows developers to identify any shortcomings or inefficiencies and clinicians to be confident in the decisions they make with the help of these systems.
Although many studies have been conducted on the use of XAI in the medical field, there was not any work summarizing research on the use of XAI for automated medical report generation. XAI techniques have been experimented and discussed with reference to medical image analysis, but the role of XAI in NLP models with reference to medical report generation have not been extensively explored. This paper summarizes some of the most relevant research in the use of XAI for image analysis of chest X-ray images and automatic medical report generation. We also list some of the current challenges in XAI research and mention some ways in which the performance of AI models can be improved through the use of XAI.

Conflicts of Interest:
The authors declare no conflict of interest.