An Explainable Classiﬁcation Method of SPECT Myocardial Perfusion Images in Nuclear Cardiology Using Deep Learning and Grad-CAM

: Background: This study targets the development of an explainable deep learning methodology for the automatic classiﬁcation of coronary artery disease, utilizing SPECT MPI images. Deep learning is currently judged as non-transparent due to the model’s complex non-linear structure, and thus, it is considered a «black box», making it hard to gain a comprehensive understanding of its internal processes and explain its behavior. Existing explainable artiﬁcial intelligence tools can provide insights into the internal functionality of deep learning and especially of convolutional neural networks, allowing transparency and interpretation. Methods: This study seeks to address the identiﬁcation of patients’ CAD status (infarction, ischemia or normal) by developing an explainable deep learning pipeline in the form of a handcrafted convolutional neural network. The proposed RGB-CNN model utilizes various pre- and post-processing tools and deploys a state-of-the-art explainability tool to produce more interpretable predictions in decision making. The dataset includes cases from 625 patients as stress and rest representations, comprising 127 infarction, 241 ischemic, and 257 normal cases previously classiﬁed by a doctor. The imaging dataset was split into 20% for testing and 80% for training, of which 15% was further used for validation purposes. Data augmentation was employed to increase generalization. The efﬁcacy of the well-known Grad-CAM-based color visualization approach was also evaluated in this research to provide predictions with interpretability in the detection of infarction and ischemia in SPECT MPI images, counterbalancing any lack of rationale in the results extracted by the CNNs. Results: The proposed model achieved 93.3% accuracy and 94.58% AUC, demonstrating efﬁcient performance and stability. Grad-CAM has shown to be a valuable tool for explaining CNN-based judgments in SPECT MPI images, allowing nuclear physicians to make fast and conﬁdent judgments by using the visual explanations offered. Conclusions: Prediction results indicate a robust and efﬁcient model based on the deep learning methodology which is proposed for CAD diagnosis in nuclear medicine.


Introduction
Coronary artery disease (CAD) is the most frequent pathological condition and is the primary reason for mortality worldwide. CAD is an atherosclerotic disease commonly resulting from genetic and environmental circumstances [1,2]. CAD occurs when the blood vessel leading to the heart muscle narrows, resulting in the potential detriment of a part of-or the entire-heart muscle. Early detection can be lifesaving and offers the opportunity the automatic classification of SPECT MPI images. The classification task is a three-class problem that labels images as infarction, ischemia, or healthy images. After thoroughly exploring the CNN architecture and the configuration of certain hyper-parameters, the authors proposed a robust CNN model followed by Grad-CAM, which can predict infarction and ischemia while providing explanations for these decisions. For further evaluation of the model, we applied k-fold cross-validation to estimate its robustness and reliability.

Literature
An adequate number of studies have been reported regarding the automatic analysis of SPECT MPI images in CAD diagnosis utilizing DL. In particular, Betancur et al., in [19], developed a CNN to estimate obstructive CAD. The dataset included 1160 SPECT MPI polar map cases without known CAD, in semi-upright and supine positions, and stress demonstrations without predefined coronary territories. Furthermore, sex was added as information between the fully connected layers. The classification was validated by utilizing a novel leave-one-center-out cross-validation procedure with four centers, which is equivalent to external validation. The proposed model was compared against the standard quantitative method cTPD (total perfusion deficit), and it emerged that deep CNN outperformed cTPD. More specifically, DL and cTPD achieved AUCs of 0.81% and 0.78% per patient, and 0.77% and 0.73% per vessel, accordingly. In addition, Betancur et al., in [20], explored the capabilities of CNN and TPD to successfully predict obstructive CAD. A total of 1638 patients without known CAD were included in stress SPECT MPI polar map representations. Moreover, the information about sex was added as an extra reference to the CAD's characteristics. The results demonstrated that the CNN achieved higher sensitivity than TPD, which was 82.3% and 79.8% per patient and 69.8% and 64.4% per vessel for CNN and TPD, respectively. The proposed model underwent a stratified 10-fold cross-validation procedure to evaluate the prediction. Zahiri et al., in [21], investigated the prediction of abnormalities regarding CAD with the development of CNN. A total of 3318 images of stress polar maps were included, with patients in a supine position. A stratified five-fold cross-validation procedure, including additional rest scans, was used to evaluate the model. An expert reader labeled the images for classification purposes. Furthermore, data augmentation was used to reduce over-fitting and achieve generalizability. The results demonstrated that by adding rest perfusion maps, AUC improved, achieving 0.845 against 0.827, which was only observed in the case of the stress images.
Papandrianos et al., in [22], explored the capabilities of CNNs to diagnose CAD automatically in a two-class classification problem. A total of 513 patients were included in the stress and rest representation, and the possible outputs were normal and abnormal. The data augmentation technique was utilized to increase the number of patients. Regardless of the small size of the dataset, the authors managed to extract magnificent values for AUC, such as 93.77%, with an accuracy of 90.2075%. In [5], Papandrianos et al. targeted the implementation of a CNN to automatically diagnose early signs of CAD (infarction or ischemia) utilizing SPECT MPI data. A total of 224 patients were included in the stress and rest representation. Data augmentation was also used to increase the size of the dataset. The authors implemented an RGB-CNN model and compared the results against a robust technique, which was transfer learning, employing VGG16, DenseNet, MobileNet, and InceptionV3 as pre-trained networks for the classification of images as normal or abnormal. The extracted results demonstrated the model's great future potential, with an accuracy of 93.47% ± 2.81% and an AUC score of 0.936. Apostolopoulos et al., in [23], focused on using CNNs to categorize polar maps into normal and abnormal. This research consisted of 216 patient cases in a stress and rest demonstration, wherein both attenuationcorrected (AC) and non-corrected (NAC) polar maps were included. Concerning the small dataset, the authors followed two methodologies. The first one was transfer learning and, more specifically, VGG-16, which is widely utilized in image classification tasks. The second methodology applied data augmentation to increase the number of training images. The evaluation of VGG-16 was accomplished through 10-fold cross-validation. The extracted results were also compared against standard semi-quantitative methodologies and experts' analyses. The pre-trained VGG-16 network outperformed with an accuracy of 74.53%, sensitivity of 75%, and specificity of 73.43%, whereas the accuracy of the semiquantitative analysis was 66.20%. Apostolopoulos et al.,in [24], developed a hybrid method to automatically classify MPI polar maps, concerning the early diagnosis of CAD, in contrast to medical experts' diagnostic analyses. A total of 566 patients in a stress and rest representation were involved in this research, while the following clinical data were also added: gender, age symptoms, pre-disposing factors, and recurrent diseases. Data augmentation was used to apply variation to the images providing the model with generalizability. The authors developed a hybrid combination of InceptionV3 and random forest (RF) algorithms with the utilization of images and clinical data. The hybrid model with the InceptionV3-Random Forest approach extracted 79.15% accuracy and was similar to the expert's analysis, achieving a sensitivity of 77.36% and specificity of 79.25%. The results were evaluated with the 10-fold cross-validation technique.
Liu et al., in [6], aimed to investigate a DL approach to improve the diagnostic accuracy of CAD. A total of 37,243 patients in a stress mode were selected in this study, wherein count profile maps were extracted from SPECT MPI images. In addition, clinical data were included, such as gender, BMI, length, stress type, and radiotracer, along with the options of adding or not adding attenuation correction. A DL methodology with transfer learning was developed. More specifically, ResNet-34 was utilized and compared against a conventional quantitative perfusion defect size (DS). The DL prediction was evaluated utilizing five-fold cross-validation. The AUC results for the DL and DS methods were 0.872 ± 0.002 and 0.838 ± 0.003, respectively, the DL method accordingly showing better performance.
Berkaya et al., in [8], proposed two classification models, DL-based and knowledgebased, to identify perfusion abnormalities (infarction and/or ischemia). Concerning the first type, the authors developed transfer learning and a support vector machine (SVM) classifier. In regards to the second model, they focused on the expert readers' analysis to apply image processing techniques such as segmentation, color thresholding, and feature extraction. The dataset consisted of 192 patients in a stress and rest demonstration. The proposed models extracted results similar to the experts' analysis, providing an accuracy of 94% and 93%, sensitivity of 88% and 100%, and specificity of 100% and 86% for the DL-based and the knowledge-based models, respectively. Filho et al. in [25] developed an ML algorithm to detect perfusion abnormalities on CAD images. A total of 1007 polar maps were included in the stress and rest representation, wherein each image was split into five vertical and five horizontal slices, and ten attributes were acquired. Moreover, data augmentation was used to increase the normal images of the dataset. The authors employed random forest as a classification algorithm and compared the extracted results with adaptive boosting (AB), gradient boosting (GB), and eXtreme gradient boosting (GB). The RF algorithm outperformed all, attaining an AUC of 0.853, accuracy of 0.938, precision of 0.968, and sensitivity of 0.963. Nakajima et al.,in [26], proposed an artificial neural network (ANN) to diagnose CAD, in contrast to a conventional quantitative approach concerning several metrics. The dataset included 1001 images in a stress and rest demonstration. Furthermore, patient data concerning CAD characteristics were included as additional information in the ANN algorithm. The clinical data were sex, age, weight, height, risk factors, coronary angiography results, and history of percutaneous coronary intervention (PCI) or coronary artery bypass grafting (CABG). For evaluating the model, data from 364 patients were included and utilized as an external validation dataset. The ANN generated a better AUC value (0.9), demonstrating high capabilities for future studies. Ciecholewski et al.,in [27], presented three methodologies for diagnosing ischemic heart disease: SVM, principal component algorithm (PCA), and NN. The results from the three methodologies' implementation were compared against the CLIP3 algorithm, which is a combination of the decision tree algorithm and the rule induction algorithm. A total of 267 SPECT MPI heart images were included, having previously undergone a stress and rest examination. As a result of the complete experiment, the SVM outperformed all and obtained higher accuracy and specificity. Nevertheless, PCA achieved better sensitivity, and the sensitivity of NN was surprisingly low. Otaki et al., in [18], presented an explainable DL model for the extraction of the probability of CAD, with SPECT MPI stress and rest polar maps. A total of 3578 patients were included and clinical data such as age, sex, and cardiac volumes were added. Furthermore, attenuation maps were generated with Grad-CAM, which constitutes an explainable tool that ensures the fact that model predictions are accurate and related to the problem. The CNN model demonstrated higher AUC value (0.83), in contrast to an expert's visual analysis (0.71) and a quantitative approach (0.78). The results were evaluated with both 10-fold testing and external testing on unseen data.
Chen et al., in [4], examined the utilization of CZT SPECT myocardial perfusion images for the classification of CAD. The authors developed a three-dimensional CNN to classify abnormal and normal patients. In addition, they followed a five-fold cross-validation approach for the hyper-parameter adjustment and experimentation of the model's ability. The process of visualizing the model's decision was applied after using Grad-CAM, which produces heatmaps, highlighting the regions that correspond to the predicted class. The CNN model achieved magnificent results in accuracy, sensitivity, and specificity, such as 87.64%, 81.58%, and 92.16%, respectively. Otaki et al. [28] developed a DL model and compared its performance against TPD. A total of 1160 patients, in four different centers, without known CAD were included in raw polar maps in an upright and supine stress MPI demonstration. Besides the images, the following clinical data were added: sex and body mass index (BMI). Grad-CAM was applied to evaluate the CNN. The sensitivity extracted by the DL methodology was higher, in contrast to those of SSS (summed stress score), U-TPD (upright-TPD), and S-TPD (supine-TPD), with values of 82%, 75%, 77%, and 73% in men, and 71%, 71%, 70%, and 65% in women, respectively. Spier et al., in [29], proposed a CNN for automatic CAD analysis. The authors developed a graph CNN for automatically classifying 946 MPI polar maps into normal and abnormal in a stress and rest representation. They further compared the results against three other neural network methodologies. The proposed model extracted similar results to human analysis, which were 89.3% for rest polar maps and 91.1% for stress polar maps. The results were evaluated with the utilization of a four-fold cross-validation procedure. It must be mentioned that heatmaps were generated to visualize the model's decisions behind the classification. Nazari et al., in [13], developed an explainable technique, namely layer-wise relevance propagation (LPR), which produces an individual relevance map for each patient to explain the 3D-CNN model concerning the classification of DAT-SPECT data in regards Parkinson's disease. A total of 1296 SPECT MPI images were included and classified by experienced readers into normal and abnormal. The extracted results were magnificent in terms of accuracy, sensitivity, and specificity values, which were 95.8%, 92.8%, and 98.7%, respectively. However, CNN performed similarly to conventional semi-quantitative analysis, as well as to classification and regression tree analysis.
Magesh et al., in [30], focused on the early diagnosis of Parkinson's disease through the application of XAI and, more specifically, LIME (local interpretable model-agnostic explainer), which is a widely used XAI method for interpreting a model's decisions. A total of 642 DaTscan SPECT images were included in the corresponding research dataset. The authors employed transfer learning by utilizing the VGG-16 pre-trained network. Moreover, data augmentation was applied to increase the number of training images. The pre-trained model achieved an accuracy of 95.2%, sensitivity of 97.5%, and specificity of 90.9%.
In conclusion, in the above-mentioned related works, there has been a decent number of studies concerning medical imaging cases and the development of a computer-aided system for automatic classification. Nevertheless, XAI has been recently applied in nuclear imaging in a small number of research papers, particularly on CAD, to eliminate the model's bias; therefore, there is a need for further research and experimentation in this area.

CAD Dataset
Patient data have been acquired from the Diagnostic Medical Center "Diagnostiko-Iatriki A.E." in Larisa, Greece, by the Nuclear Department and have been retrospectively examined. The study covers a period from 30 March 2012 to 28 February 2017. Over this period, 842 consecutive patients underwent gated-SPECT MPI with 99mTc-tetrofosmin. A hybrid SPECT/CT gamma-camera system (Infinia, Hawkey-4, GE Healthcare (Chicago, USA)) was used for MPI imaging. Fifty-six (56) patients were excluded from the dataset due to inconclusive MPI results. Our dataset includes a total of 625 patients, of which 127 correspond to infarction, 241 to ischemic, and 257 to normal. The images have been extracted with the SPECT method and illustrate a visual representation of the heart in rest and stress modes.
Two nuclear medicine experts (N. Papandrianos and D. Apostolopoulos) were asked to label each instance of the dataset according to their expertise and experience. The nuclear medicine experts count several years of experience (approximately 15 and 25 years, respectively). The experts completed the labeling using solely the MPI scans from each patient. This way, the model could be directly compared with the human experts. Hence, this study uses the experts' diagnostic yield as the ground truth and aims to furnish a DL model capable of competing with the human eye and expertise. The ethical committee for our institution approved the study. The nature of the survey waives the requirement to obtain patients' informed consent. In Figure 1, we provide a representation of all cases. system for automatic classification. Nevertheless, XAI has been recently applied in nuclear imaging in a small number of research papers, particularly on CAD, to eliminate the model's bias; therefore, there is a need for further research and experimentation in this area.

CAD Dataset
Patient data have been acquired from the Diagnostic Medical Center "Diagnostiko-Iatriki A.E." in Larisa, Greece, by the Nuclear Department and have been retrospectively examined. The study covers a period from 30 March 2012 to 28 February 2017. Over this period, 842 consecutive patients underwent gated-SPECT MPI with 99mTc-tetrofosmin. A hybrid SPECT/CT gamma-camera system (Infinia, Hawkey-4, GE Healthcare (Chicago, USA)) was used for MPI imaging. Fifty-six (56) patients were excluded from the dataset due to inconclusive MPI results. Our dataset includes a total of 625 patients, of which 127 correspond to infarction, 241 to ischemic, and 257 to normal. The images have been extracted with the SPECT method and illustrate a visual representation of the heart in rest and stress modes.
Two nuclear medicine experts (N. Papandrianos and D. Apostolopoulos) were asked to label each instance of the dataset according to their expertise and experience. The nuclear medicine experts count several years of experience (approximately 15 and 25 years, respectively). The experts completed the labeling using solely the MPI scans from each patient. This way, the model could be directly compared with the human experts. Hence, this study uses the experts' diagnostic yield as the ground truth and aims to furnish a DL model capable of competing with the human eye and expertise. The ethical committee for our institution approved the study. The nature of the survey waives the requirement to obtain patients' informed consent. In Figure 1, we provide a representation of all cases. The clinical characteristics of the dataset are presented in Table 1.  The clinical characteristics of the dataset are presented in Table 1. The followed protocol included a 1 day stress-rest injection of Tc-99m tetrofosmin for SPECT imaging. Symptom-limited Bruce protocol treadmill exercise testing (n = 154 [69%]) or pharmacologic stress (n = 69 [31%]) were applied to the patients, while radiotracer was injected at peak exercise or during maximal hyperemia, respectively.
Stress SPECT images were collected in the first 20 min after injecting 7 to 9 mCi 99mTc-tetrofosmin in both medical processes (effort test or pharmacological stress with dipyridamole). Concerning the effort test, patients underwent a treadmill test based on the Bruce protocol and were injected 99mTc-tetrofosmin when the age-predicted maximum heart rate achieved at least 85%, allowing 1 min before the end of the test. During the rest process, a dose of 21-27 mCi 99mTc-tetrofosmin was injected into patients, allowing 40 min for rest imaging to be performed. In particular, 32 projections of SPECT MPI were carried out in a period of 30 s for the stress and 30 s for the rest, before the SPECT system delivered the data. A 140 keV photopeak, a 180 degree arc, and a 64 × 64 matrix were among the configurations that were additionally set up.
This study was approved by the board committee director of the diagnostic medical center "Diagnostiko-Iatriki A.E.", Dr. Vasilios Parafestas. The director of the diagnostic center waived the requirement to obtain informed consent due to its retrospective nature. All procedures in this study were in accordance with the Declaration of Helsinki.

Convolutional Neural Networks: Main Aspects
CNN refers to a computational method that mimics the functionality of the brain's neurons. CNNs include input, hidden, and output layers, where each layer consists of nodes connected by edges. The term deep neural network applies to a CNN with at least two hidden layers. CNNs have demonstrated great accuracy and efficiency and are trustworthy for use in image recognition tasks. One of their main advantages is that they can operate effectively with only images as input and do not need visual extraction of features [31]. CNNs have established their position in medical image analysis based on their fascinating results [32]. Each layer is described in detail below.
The first layer is the convolutional layer, for which the name denotes the type of neural network. The primary block is a convolutional layer comprising filters that use the convolution operation to build activation maps. Activation maps are made up of extracted patterns based on the input images and are in charge of classifying unseen data.
After that is the pooling layer, inserted after each convolutional layer and downsamples the picture while discarding pixel values categorized as noise. As a result, the computational time decreases, and the pixel values that are relevant to the structure of the CNN are forwarded according to the dataset.
Following that, a dropout layer is added to avoid overfitting. Dropout sets random pixel values to zero so that they are not included in the training method, reducing computing time.
Following that, we have the flatten layer, which turns multi-dimensional data into vectors.
Finally, fully connected layers are applied, connecting each node to the preceding one and calculating the prediction using activation functions. RELU (rectified linear unit) was used for the convolutional layers and softmax was used for the output activation functions [5].

Methodological Framework
This research aims to implement an RGB-CNN model as a promising method in nuclear medical image analysis to classify CAD images into infarction, ischemia and normal, and to provide an autonomous computer-aided system to nuclear medicine experts through the application of Grad-CAM. To explore the development of an explainable AI model in nuclear cardiology for CAD diagnosis utilizing SPECT MPI images, we applied Grad-CAM to an efficient and robust, fully trained CNN model. Grad-CAM has shown critical capabilities concerning the interpretability of neural networks since they are constantly characterized as black box models because of their complex internal functionality. An overview of the experiment can be found in Figure 2. and to provide an autonomous computer-aided system to nuclear medicine experts through the application of Grad-CAM. To explore the development of an explainable AI model in nuclear cardiology for CAD diagnosis utilizing SPECT MPI images, we applied Grad-CAM to an efficient and robust, fully trained CNN model. Grad-CAM has shown critical capabilities concerning the interpretability of neural networks since they are constantly characterized as black box models because of their complex internal functionality. An overview of the experiment can be found in Figure 2. The methodological flow includes the following parts: (i) loading the dataset; (ii) data pre-processing; (iii) CNN model design and evaluation; and (iv) Grad-CAM application. The steps are as follows: Step 1: Loading dataset The SPECT MPI images given by the nuclear expert (N.P.) were in RGB (red, green, and blue) format, and the corresponding patients underwent a stress and rest examination. Each instance was classified as infarction, ischemia, or normal, being assigned with 0, 1, and 2, respectively. The final dataset was stored locally in a PC memory. The employed SPECT image acquisition technology produced 22-27 64 × 64 images that illustrated axial views of the myocardial region (slices). All the available slices per patient have been combined, generating a single image of 300 × 300 size.
Step 2: Data preparation

•
Data normalization: Data normalization is a common technique in ML classification tasks. This method rescales pixel values by transforming them to the range [0, 1]. This process contributes to the discard of outliers and the effective reduction of computation time.

•
Data shuffle: Before inserting into the algorithm, data has to be shuffled so that the extraction of patterns is as unbiased as possible. Therefore, the data shuffle technique was deployed to provide a random order of data insertion. The methodological flow includes the following parts: (i) loading the dataset; (ii) data pre-processing; (iii) CNN model design and evaluation; and (iv) Grad-CAM application. The steps are as follows: Step 1: Loading dataset The SPECT MPI images given by the nuclear expert (N.P.) were in RGB (red, green, and blue) format, and the corresponding patients underwent a stress and rest examination. Each instance was classified as infarction, ischemia, or normal, being assigned with 0, 1, and 2, respectively. The final dataset was stored locally in a PC memory. The employed SPECT image acquisition technology produced 22-27 64 × 64 images that illustrated axial views of the myocardial region (slices). All the available slices per patient have been combined, generating a single image of 300 × 300 size.
Step 2: Data preparation

•
Data normalization: Data normalization is a common technique in ML classification tasks. This method rescales pixel values by transforming them to the range [0, 1]. This process contributes to the discard of outliers and the effective reduction of computation time.

•
Data shuffle: Before inserting into the algorithm, data has to be shuffled so that the extraction of patterns is as unbiased as possible. Therefore, the data shuffle technique was deployed to provide a random order of data insertion. • Data split: We split the dataset into three parts: validation, training, and testing. More specifically, 15% of the entire dataset was given to testing and the remaining 85% was split into 20% for validation and 80% for training.
Step 3: Training • Data augmentation: Data augmentation is usually employed to increase the small number of datasets. It artificially generates various versions of the existing dataset, utilizing specific data augmentation techniques. In our case, we selected flipping and scaling strategies to achieve generalization and avoid overfitting [33].
• Define CNN architecture and activation functions: A detailed analysis was conducted to determine the best CNN architecture. During the experimentation, various values were applied for image size (pixels), batch size, number of nodes and layers for convolutional layers, and number of nodes for fully connected layers. The selection of the activation function is highly crucial since it corresponds to the type of classification problem. For example, the sigmoid function is proposed in binary classification problems, extracting values between 0 and 1 based on a default threshold that categorizes images. On the other hand, softmax is applied in multi-class classification problems, providing probabilities for each possible output, the sum of which is equal to 1 [33,34] • Train CNN: In the training process, the gradient backward propagation technique was utilized to find the minimized error by adjusting the weights. In this process, CNN also extracted patterns from input images, which will be used in future classification tasks with unknown data. Furthermore, the loss function and the optimizer must be selected for training the CNN.

Step 4: Validation
In the validation step, the validation dataset evaluates CNN in known data. CNN's hyper-parameters are properly fine-tuned, and the final model is defined.

Step 5: Testing
After its training and validation, the best CNN model was developed and tested on unknown data using the testing dataset. CNN's performance was computed through robust evaluation metrics such as accuracy, loss, AUC, and ROC curve.
Step 6: Explainability through Grad-CAM Since CNNs are not inherently explainable and transparent, post-hoc explainability methods have been employed to inspect the outputs of their layers and visualize them to increase comprehensiveness [35,36]. Towards explainability in our approach, we selected Grad-CAM (gradient-weighted class activation mapping) [37] to interpret the predictions of our CNN model through the production of heatmaps. More specifically, heatmaps highlight regions that indicate a positive impact on the corresponding predicted output of the fullytrained model. To acquire these critical regions, we utilize the extracted gradients/weights of the last convolutional layer of the defined model, which is expected to extract the most important, deep, and abstract features that endorse the final decision [38]. Grad-CAM uses the convolutional layers as they include the spatial information of high-level features of the generated patterns produced during the training process. This is because this spatial information is lost after the flatten layer and the fully connected layers. Therefore, the last convolutional layer's gradients could help locate the regions that indicate the predicted output [39].
Step 7: Inference phase First, our pre-trained classification model is fed with a testing image, preferably an unknown instance of the training dataset, thus producing the extracted output, which the model for the respective image will predict. Next, we calculate the weights of the feature maps produced by the last convolutional layer of the model. Then, we apply GAP (global average pooling) to obtain the alpha values of the weights. Afterward, a heatmap is generated by computing the weighted sum of the acquired feature maps. The heatmap highlights the critical regions that correspond to the predicted output. The produced heatmap is resized to match the dimension of the testing image. It is worth mentioning that the negative values are discarded so that only the pixel values that positively impact the produced heatmap are kept. Moreover, for the overlay functionality, the heatmap is placed on top of the testing image to ensure interpretability with respect to the results [40]. Grad-CAM constitutes a trustworthy explainability algorithm that achieves excellent results, and thus, it can be applied in a wide variety of CNN architectures and pre-trained networks [37].

Results
Initially, a thorough exploration of the CNN model architecture was conducted to determine the best model for image classification after various combinations of hyperparameters such as image size, batch size, convolutional layers, and nodes, as well as the number of dense nodes, were tested. All of the experiments were executed 10 times to compute the average value. The hardware specifications on which the experiments were conducted are processor: Intel(R) Core (TM) i7-8750H CPU @ 2.20 GHz 2.21 GHz; RAM: 8 GB; and system type: 64-bit operating system, x64-based processor. The frameworks Keras 2.8.0 and Sklearn 1.0.2 were used, as well as python language 3.9.7.
To address the three-fold classification problem, various architectures of RGB-CNN algorithm were explored, which involve several batch sizes (8,16,32 We used well-known performance metrics such as accuracy, loss, AUC with confidence interval, ROC curve, sensitivity, and specificity to evaluate the examined CNN architectures. AUC represents the model's ability to distinguish between the given classes, ranging between 0 and 1, and the higher the performance of the model, the better the differentiation. ROC (receiver operating characteristic) is the visual demonstration of AUC [8].
The experiment started with the default values for batch size, image size, convolutional layers and nodes, and dense nodes, which were 16, 200 × 200, 16-32-64-128, and 128-128, respectively. It must be noted that all runs were conducted for 400 epochs with a drop rate of 0.2 to acquire satisfactory results. Table 2 collects the findings for the various batch sizes while leaving the remaining parameters at their default values. It emerges that the model performs better regarding the relevant metrics for a batch size of 32. Afterward, we examined different image sizes while keeping fixed values for convolutional and dense layers, considering that 32 is the best batch size. In Figure 3a, we can clearly distinguish the remarkable results extracted for 300 × 300 in contrast to the rest of the structures. Thus, 300 × 300 is the best combination for the corresponding dataset. In Figure 3b, the computed outcomes concerning the utilization of various combinations of convolutional layers are presented. The formation of 16-32-64-128 convolutional layers seems to have performed better in all metrics. Next, various numbers of nodes were examined, producing the results that are visually depicted in Figure 3. It is concluded that the best sequence is 128-128, wherein all metrics are considered.
At this point, deep neural networks (such as CNNs) have made advances in large sample size applications. However, they are susceptible to overfitting and high-variance gradients when dealing with high-dimensional and low-sample size data. In our study, we explored a relatively small dataset; therefore, increasing the number of the image size (feature dimensionality) means that we increase the number of CNN network's parameters (its complexity), making it more prone to overfitting. In our experiments, we are attempting to determine the image size that provides the ideal balance between accuracy and complexity. Various architectures for our RGB model and each combination were executed for at least 10 runs so that a robust and reliable model was finally built. The ultimate architecture for RGB is 300 × 300 for image size, 32 for batch size, 16-32-64-128 for convolutional layers, and 128-128 for dense layers. seems to have performed better in all metrics. Next, various numbers of nodes were examined, producing the results that are visually depicted in Figure 3. It is concluded that the best sequence is 128-128, wherein all metrics are considered. At this point, deep neural networks (such as CNNs) have made advances in large sample size applications. However, they are susceptible to overfitting and high-variance gradients when dealing with high-dimensional and low-sample size data. In our study, we explored a relatively small dataset; therefore, increasing the number of the image size  Table 3 gathers all the robust metrics values produced by the exploration process. Our concluded structure achieved promising results and outperformed the rest of the structures. Following the exploration and the definition of the best CNN model, a robust technique was applied to evaluate the CNN's capabilities further. This technique is k-fold crossvalidation, where k is the number of parts into which the dataset is divided. In our case, we distributed our dataset into 10 parts, of which 9 were utilized as training and 1 as testing. This method was iterated several times until each part was used for testing [41]. Table 4 presents the outcomes demonstrating that the RGB-CNN model delivers excellent robustness and efficiency. It is observed that the results produced from the 10-fold cross-validation are similar to the results produced by the data split method, indicating that the proposed model provided sufficiency and robustness (see Table 5). To cope with explainability, Grad-CAM was utilized in this research work to provide predictions with interpretability. To verify the feasibility of the Grad-CAM, we conducted experiments on the proposed medical image classification model producing improved visualization results. In addition, the results were assessed by a nuclear medicine physician to support the decisions in this domain. Figure 4 shows the visualization results produced by the Grad-CAM for each CAD category. Among them, we can distinguish the original images for each category, the heatmaps generated by the Grad-CAM method, and the visualization result generated by superimposing the original image on the heatmap. Different colors indicate the importance of pixels in the classification results, representing the sensitivity of the CNN-based classification model to each pixel. The colormap Viridis, which was obtained from the OpenCV library, was used. experiments on the proposed medical image classification model producing improved visualization results. In addition, the results were assessed by a nuclear medicine physician to support the decisions in this domain. Figure 4 shows the visualization results produced by the Grad-CAM for each CAD category. Among them, we can distinguish the original images for each category, the heatmaps generated by the Grad-CAM method, and the visualization result generated by superimposing the original image on the heatmap. Different colors indicate the importance of pixels in the classification results, representing the sensitivity of the CNNbased classification model to each pixel. The colormap Viridis, which was obtained from the OpenCV library, was used. For implementing the process of acquiring the correct results, Grad-CAM was initially fed with the extracted gradients of the images produced from the last convolutional layer of the model, generating the expected heatmaps. The heatmaps are two-dimensional and indicate the impact on the predicted output for each pixel value. In our study, Viridis colormap was selected. The high-impact value is displayed in yellow, whereas the lowlevel value is displayed in dark blue. Next, the overlay technique was applied, placing the produced heatmaps above the original image. This comparison can offer a better under- For implementing the process of acquiring the correct results, Grad-CAM was initially fed with the extracted gradients of the images produced from the last convolutional layer of the model, generating the expected heatmaps. The heatmaps are two-dimensional and indicate the impact on the predicted output for each pixel value. In our study, Viridis colormap was selected. The high-impact value is displayed in yellow, whereas the lowlevel value is displayed in dark blue. Next, the overlay technique was applied, placing the produced heatmaps above the original image. This comparison can offer a better understanding to the nuclear experts. Based on the results, the generated heatmaps provide interpretation and compatibility to nuclear diagnoses concerning infarction and ischemic cases. It should be mentioned that Grad-CAM was applied in new images that were excluded from the training procedure to emulate a case with unseen data.
In what follows, we provide some indicative visualization results of the Grad-CAM employment in several cases concerning infarction, ischemia, and normal. In particular, three cases were selected from each one of the three categories. RGB-CNN correctly classified these cases, while Grad-Cam was employed to visualize regions contributing to disease prediction on SPECT MPI images.
In Figure 5a, according to the physicians' diagnosis (N.P.), a large, fixed defect was present on all three slice orientations (SA-short axis, VLA-vertical long axis, and HLAhorizontal long axis images). In particular, this SPECT MPI scan showed a case of myocardial infarction with a fixed reduction in perfusion in the apex (see slice 9 in SA and slices 31-32 (row A) in VLA), extending to several apical segments to the mid-anterior and the mid-lateral wall-see slices 12-15 (row A) and slices 13-16 (row B) in SA, post-stress and at rest, respectively; slices 29-34 (row A and row B in VLA), slices 27-36 (row A), and 29-37 (row B) in HLA, at stress and rest, respectively. Before the classification process, the initial image was processed following the steps defined in Section 3.2.2 (see Figure 5b). horizontal long axis images). In particular, this SPECT MPI scan showed a case of myocardial infarction with a fixed reduction in perfusion in the apex (see slice 9 in SA and slices 31-32 (row A) in VLA), extending to several apical segments to the mid-anterior and the mid-lateral wall-see slices 12-15 (row A) and slices 13-16 (row B) in SA, post-stress and at rest, respectively; slices 29-34 (row A and row B in VLA), slices 27-36 (row A), and 29-37 (row B) in HLA, at stress and rest, respectively. Before the classification process, the initial image was processed following the steps defined in Section 3.2.2 (see Figure 5b). By applying the proposed algorithm, Grad-CAM identifies the above regions of interest and colors them in bright yellow, providing a heatmap as illustrated in Figure 5c. Figure 5d represents the visualization result of SPECT MPI after placing the produced By applying the proposed algorithm, Grad-CAM identifies the above regions of interest and colors them in bright yellow, providing a heatmap as illustrated in Figure 5c. Figure 5d represents the visualization result of SPECT MPI after placing the produced heatmaps above the processed image. A similar distribution of the yellow color on the heatmaps is observed in all examined infarction cases. In Appendix A, two more indicative infarction cases (cases B and C) are provided (see Figure A3), illustrating the yellow color distribution in the visualization regions representing the perfusion defects.
In Figure 6a, according to the nuclear medicine physician's expertise (N.P), a mediumsize reversible perfusion abnormality was diagnosed in the apex and the anteroseptal myocardium-see slices 10-14 (row A in SA), slices 30-33 (row A in VLA), and slices 31-35 (row A in HLA). By applying the algorithm, Grad-CAM identifies the segments with stress-induced hypo-perfusion in row A on all three axes in stress mode, which are marked in bright yellow (see Figure 6c). More specifically, after a thorough look at the visualized regions ( Figure 6d) and comparing them to those of the initial image, it emerges that the algorithm sufficiently recognizes the post-stress defects.
dium-size reversible perfusion abnormality was diagnosed in the apex and the anteroseptal myocardium-see slices 10-14 (row A in SA), slices 30-33 (row A in VLA), and slices 31-35 (row A in HLA). By applying the algorithm, Grad-CAM identifies the segments with stress-induced hypo-perfusion in row A on all three axes in stress mode, which are marked in bright yellow (see Figure 6c). More specifically, after a thorough look at the visualized regions ( Figure 6d) and comparing them to those of the initial image, it emerges that the algorithm sufficiently recognizes the post-stress defects. Moreover, in Figure 7 (ischemia-case B), the nuclear expert (N.P.) diagnosed hypoperfusion, which was observed in the septum, in the inferior myocardial wall, as well as in a part of the apex-see slices 10-18 (row A), 29-35 (row A), and 30-36 (row A). As soon as the algorithm was applied, we observed that the above myocardial walls were marked with more intense yellow at post-stress in row A on all three axes regarding the same slices as prescribed by the expert in Figure 7a. Moreover, in Figure 7 (ischemia-case B), the nuclear expert (N.P.) diagnosed hypoperfusion, which was observed in the septum, in the inferior myocardial wall, as well as in a part of the apex-see slices 10-18 (row A), 29-35 (row A), and 30-36 (row A). As soon as the algorithm was applied, we observed that the above myocardial walls were marked with more intense yellow at post-stress in row A on all three axes regarding the same slices as prescribed by the expert in Figure 7a. The proposed trained model, which accurately predicted infarction, also performs well when color visualization is used with the Grad-CAM method and depicts the areas of interest in the event of infarction in MPI scans. The results produced by the visualization assessment concurred with the expert's diagnosis and assessment.
The proposed RGB-CNN classification model (best) implemented with the Grad-CAM technique has achieved remarkable accuracy, exhibiting at the same time explainable capabilities concerning the predictions of defects/abnormalities in CAD diagnosis. This is attributed to the fact that Grad-CAM can discover complicated underlying relationships and non-linearities; thus, it demonstrates a solid performance identifying regions of interest that represent possible abnormalities (ischemia and infarction) in SPECT MPI scans. Appl. Sci. 2022, 12,   The proposed trained model, which accurately predicted infarction, also performs well when color visualization is used with the Grad-CAM method and depicts the areas of interest in the event of infarction in MPI scans. The results produced by the visualization assessment concurred with the expert's diagnosis and assessment.
The proposed RGB-CNN classification model (best) implemented with the Grad-CAM technique has achieved remarkable accuracy, exhibiting at the same time explainable capabilities concerning the predictions of defects/abnormalities in CAD diagnosis. This is attributed to the fact that Grad-CAM can discover complicated underlying relationships and non-linearities; thus, it demonstrates a solid performance identifying regions of interest that represent possible abnormalities (ischemia and infarction) in SPECT MPI scans.

Discussion
In this research study, we developed a fully automatic CNN-based method to detect any signs of infarction or ischemia in patients. The dataset included heterogeneous data of 625 patients. Among these data, 127 corresponded to infarction, 241 to ischemic, and 257 to normal, which nuclear experts had previously labeled for the current classification task. Given the small size of the population, we employed data augmentation to produce new images by applying different transformations to the current dataset, such as flipping and rescaling. In addition, we divided our dataset into three parts: validation, training,

Discussion
In this research study, we developed a fully automatic CNN-based method to detect any signs of infarction or ischemia in patients. The dataset included heterogeneous data of 625 patients. Among these data, 127 corresponded to infarction, 241 to ischemic, and 257 to normal, which nuclear experts had previously labeled for the current classification task. Given the small size of the population, we employed data augmentation to produce new images by applying different transformations to the current dataset, such as flipping and rescaling. In addition, we divided our dataset into three parts: validation, training, and testing. The validation dataset was used for fine-tuning the hyper-parameters, the training dataset for training the model, and the testing dataset for estimating its reliability.
To determine the specifications of the proposed model, the authors performed an in-depth exploration analysis by examining various values for batch size, image size, number of nodes and layers for convolutional layer, and number of nodes for fully connected layers. Using the SPECT MPI images as our only input, we proposed a deep CNN with convolutional layers and two fully connected layers to enhance the accuracy of our corresponding three-class classification task. For the definition of the best model in terms of its performance, reliable metrics such as accuracy, loss, AUC with confidence internal, ROC curve, sensitivity, and specificity were utilized. The results demonstrated high efficiency and stability, achieving 93.33% accuracy and 94.58% AUC. Additionally, the authors conducted 10-fold cross-validation to further evaluate the model's stability and robustness. On the identical three-class data problem, the suggested model exceeded the performance of sophisticated deep learning networks (VGG-16 and ResNet-121), which only managed to attain lower levels of accuracy (88.54% and 86.11%, respectively), as demonstrated in [42].
In contrast to other traditional approaches, our RGB-CNN demonstrated superior performance for several reasons. First, it could extract optimum results despite the small dataset and without needing to employ other pre-trained networks that rely on existing datasets (e.g., ImageNet, which was trained in 1000 classes). Additionally, having performed an in-depth analysis of its parameters, the CNN model could avoid overfitting and achieve generalization. Finally, the proposed methodology includes simple architecture involving a small number of nodes, which provides minimized training time.
Nevertheless, our research presents certain limitations. The proposed approach exclusively accepts images as input, despite physicians also considering additional clinical data such as age and sex to conclude an exact opinion about a patient's status. In our future considerations, we will seek to develop a hybrid method, which will use both images and clinical data as input to simulate the visual diagnosis of CAD fully.
Concerning the clinical implications that unfold from applying the proposed DL-based approach, these entail the beneficial automatic clinical diagnosis of SPECT MPI images, which could prevent unwanted heart conditions such as ischemia and infarction. The CNN-based method can serve as a vital tool to assist medical experts in providing a precise diagnosis of SPECT MPI images and explicit treatment suggestions to patients suffering from CAD.
On the other side, CNNs do not offer transparency and interpretability in their decisions, which is a critical throwback for their full integration into medical image analysis. Thus, doctors cannot rely on the provided predictions. CNNs are characterized by nature as black boxes since they do not supply details regarding their internal prediction process, leaving researchers to depend exclusively on the values of reliable metrics. Explaining artificial intelligence was implemented to offer details about CNNs' internal functionality. In our proposed research, we implemented the Grad-CAM technique, which generates heatmaps for interpretability.
To sum up, the proposed three-class classification model can identify any signs of infarction or ischemia in SPECT MPI images while ensuring reliability. Even though we dealt with a small dataset, our model performed great. The proposed RGB-CNN can be a fundamental tool that can assist nuclear experts in automatically diagnosing CAD. Overall, the current research constitutes an innovation in nuclear medicine, especially in CAD diagnosis utilizing SPECT images, mostly due to the observed lack of relevant published articles in this domain. In addition, this approach investigates and applies XAI methodologies with promising results.

Conclusions
The proposed paper presents the first known attempt of developing an explainable pipeline to CAD diagnosis utilizing SPECT MPI pictures and sophisticated state-of-the-art DL and explainability techniques. Apart from implementing an effective CNN algorithm for accurately classifying infarction and ischemia in CAD, it is essential to address image interpretability through visualization. For the purposes of this study, the efficacy of the well-known Grad-CAM explainability tool was investigated, providing promising results for automated and accurate diagnosis in nuclear cardiology. The proposed model achieved 93.33% testing accuracy, 0.21 testing loss, and 0.94 AUC, demonstrating great applicability to the corresponding dataset and sufficient stability. As illustrated in the results section, the nuclear physician can use the Grad-CAM visualization technique to make efficient and confident decisions, taking advantage of the visual explanations provided. Thus, the Grad-CAM methodology was proven to be an effective tool in providing explanations for CNN-based decisions in SPECT MPI images. The next steps are devoted to the integration of clinical, stress, and imaging variables in DL methods to further improve disease diagnosis. To sum up, this study contributes to the effective diagnosis of ischemia and infarction in CAD, hence fostering trust in the use of explainable artificial intelligence models for diagnosis in nuclear medicine. Institutional Review Board Statement: This research does not report human experimentation; it does not involve human participants following experimentation. All procedures in this study were in accordance with the Declaration of Helsinki.
Informed Consent Statement: This study was approved by the board committee director of the diagnostic medical center "Diagnostiko-Iatriki A.E.", Vasilios Parafestas. The requirement to obtain informed consent was waived by the director of the diagnostic center due to its retrospective nature.

Data Availability Statement:
The datasets analyzed during the current study are available from the nuclear medicine physician on reasonable request.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
Figures A1 and A2 visually represent the precision curves concerning the accuracies and loss and ROC curves for the best RGB-CNN model. Overall, the produced model achieved the highest classification accuracy, providing generalizability and robustness to CAD diagnosis, at the same time in contrast to the rest architectures.