Using Shapley Values to Explain the Decisions of Convolutional Neural Networks in Glaucoma Diagnosis

Sigut, Jose; Fumero, Francisco; Díaz-Alemán, Tinguaro

doi:10.3390/a18080464

Open AccessArticle

Using Shapley Values to Explain the Decisions of Convolutional Neural Networks in Glaucoma Diagnosis

by

Jose Sigut

^1,*

,

Francisco Fumero

¹ and

Tinguaro Díaz-Alemán

²

¹

Department of Computer Science and Systems Engineering, Universidad de La Laguna, Camino San Francisco de Paula, 19, La Laguna, 38203 Santa Cruz de Tenerife, Spain

²

Department of Ophthalmology, Hospital Universitario de Canarias, Carretera Ofra S/N, La Laguna, 38320 Santa Cruz de Tenerife, Spain

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(8), 464; https://doi.org/10.3390/a18080464

Submission received: 30 May 2025 / Revised: 10 July 2025 / Accepted: 23 July 2025 / Published: 25 July 2025

(This article belongs to the Special Issue Algorithms and Applications of Machine Learning Techniques for Healthcare)

Download

Browse Figures

Versions Notes

Abstract

This work aims to leverage Shapley values to explain the decisions of convolutional neural networks trained to predict glaucoma. Although Shapley values offer a mathematically sound approach rooted in game theory, they require evaluating all possible combinations of features, which can be computationally intensive. To address this challenge, we introduce a novel strategy that discretizes the input by dividing the image into standard regions or sectors of interest, significantly reducing the number of features while maintaining clinical relevance. Moreover, applying Shapley values in a machine learning context necessitates the ability to selectively exclude features to evaluate their combinations. To achieve this, we propose a method involving the occlusion of specific sectors and re-training only the non-convolutional portion of the models. Despite achieving strong predictive performance, our findings reveal limited alignment with medical expectations, particularly the unexpected dominance of the background sector in the model’s decision-making process. This highlights potential concerns regarding the interpretability of convolutional neural network-based glaucoma diagnostics.

Keywords:

Shapley values; interpretability; explainability; glaucoma diagnosis; convolutional neural networks

1. Introduction

In domains where the decisions made by a machine learning model are critical, the interpretability of the model becomes a fundamental requirement, since in general, we will want to ensure that the abstraction learned by the model is free of errors [1]. From the medical point of view, the interpretability of machine learning models is especially relevant for regulatory aspects and also to gain confidence in the system by the medical specialist, as it allows us to know on what criteria the model has been based to arrive at a certain prediction. In some cases, it can even allow the medical specialist to find relationships in the data that were previously hidden from the naked eye or not given enough attention or importance [2]. In this context, Shapley values have achieved a high preponderance among the methods for explaining the decisions of a machine learning system. In the review carried out by Loh et al. [3] on the application of explainability techniques in artificial intelligence in the healthcare sector during the last decade, it is worth noting that almost half of the works reviewed used them for this purpose.

The Shapley values refer to a method that comes from the theory of cooperative games [4]. This method provides us with a way to distribute the benefits obtained in a cooperative game between the players of that game [5]. Moving this method to the field of machine learning, a prediction from a model could be explained assuming that each feature for a given instance is a “player” in a game in which the prediction is the payoff that we can obtain. The Shapley values tell us how to “fairly” distribute that payoff among the features, depending on the contribution of each one [1].

In a previous work [6], we explored the use of saliency maps to interpret convolutional neural networks for glaucoma diagnosis, highlighting both their potential and their limitations. In contrast, Shapley values offer a fundamentally different approach to assigning a degree of importance to each feature in a decision. They have the advantage of being based on a solid mathematical foundation derived from game theory, ensuring a series of desirable properties: efficiency, symmetry, null (or dummy) player, and linearity (or additivity) [5,7]. However, their main disadvantage is that they require evaluating all possible combinations of features in order to calculate these values accurately. When applied in a machine learning context where the number of features is high, this would produce a combinatorial explosion that could be unfeasible in many cases since it would require performing on the order of

2^{n}

predictions of our model, being n the total number of features. In addition, in general, a machine learning model is not prepared to make predictions for any combination of input features, such as a smaller subset of them, so re-training may be required. For this reason, some methods arise that calculate the Shapley values in an approximate way, such as the sampling method of [8], the different methods based on SHAP proposed by [9] or Deep SHAP, specifically intended for Deep Learning systems [9,10].

The main objective of this work is the use of Shapley values to explain the decisions of convolutional neural networks trained to predict glaucoma. Glaucoma is one of the leading causes of irreversible blindness in the world [11], and it is a pathology that is difficult to address because of the lack of clear and unequivocal biomarkers that warn of its existence, especially in the most incipient stages [12]. Thus, only highly specialized experts can deal with this issue, usually following a holistic approach to diagnosis, using both functional and structural techniques. In this category of structural techniques are those based on images, and more specifically retinal or fundus images, which are the ones used in this research. Figure 1 shows an example of this type of image with the most relevant anatomical structures.

For the reasons mentioned above, the application of Shapley values to a machine learning problem can be difficult in practice. For this particular problem, we propose to follow a discretization procedure consisting of dividing the input images into standard sectors. This allows us to refer the results to regions of interest to the ophthalmic specialist while drastically reducing the number of features that will be used to form the various combinations of players required for the calculation of the Shapley values, moving from the pixel level to a much higher one. It remains to be resolved how players are removed from the game. The strategy adopted to simulate this in a given image is the occlusion of the corresponding sector, which facilitates the calculation of the Shapley values, thus obtaining the contributions of each sector to the prediction made by a given model. In Section 3.3 we will describe in more detail the discretization and occlusion performed.

Although there are a few papers that have used Shapley values to explain the decisions of conventional machine learning systems with handcrafted features for glaucoma diagnosis, to our knowledge, this is the first to approach the problem from the perspective of Deep Learning and referring to parts of the input image instead. This is possible thanks to the aforementioned discretization and occlusion strategies and therefore we consider them our main contribution.

The rest of the article is structured as follows. In Section 2 we review the related work. In Section 3 we describe the datasets and models used, how to discretize the problem to compute the Shapley values, how to occlude the different sectors, and the evaluation methodology adopted. The results of all experiments are presented and discussed in Section 4. Finally, Section 5 is devoted to the conclusions we have reached.

2. Related Work

There has been some work related to ophthalmology in which Shapley values have been used. One of these works is [13], where a study was conducted to determine the degree of agreement of 14 medical specialists with the results of 13 attribution methods applied to a neural network based on InceptionV3 to diagnose choroidal neovascularization, diabetic macular edema and drusen from retinal images captured with an OCT. The results show that the three methods most highly rated by clinicians were Deep Taylor, Guided backpropagation and Deep SHAP.

Another publication, also related to this field, is that of [14]. In this work, the authors train a convolutional neural network model based on DensetNet121 to detect different degrees of diabetic retinopathy in fundus images. In order to explain the results, they choose to use Grad-CAM and Deep SHAP.

As noted in the introduction, Deep SHAP is an approximate way to calculate Shapley values in Deep Learning systems, making the calculation feasible from a computational standpoint but with some disadvantages:

It’s approximate: Built on DeepLIFT [15], it estimates rather than computes exact Shapley values, which can cause inaccuracies, especially in complex models.
Feature independence is assumed: It doesn’t handle feature correlations well, which may lead to misleading explanations.
Sensitive to background data: The quality of explanations hinges heavily on the choice of representative background samples.
Model limitations: Only works with differentiable models that support backpropagation, so it is a no-go for many non-standard architectures.
Harder to generalize: Local explanations don’t always aggregate neatly into global insights, making interpretation tricky.
Can still be resource-hungry: While faster than some alternatives, it can still be taxing on large-scale models.

The proposed method is much more straightforward, as it is based solely on discretizing the problem using image sectors and devising a strategy to hide those sectors in order to retrain the model.

Focusing now on glaucoma diagnosis, in [16] several models were developed, some based on clinical features to establish a baseline from which to start, and another deep learning model that combines information from macular OCT and color eye fundus images with the demographic, systemic, and ocular data that were used in the baseline models. For the interpretability of models using images, they used Integrated gradients and SmoothGrad, while for tree-based models developed with XGBoost they calculated SHAP values to assess the importance and interaction between features. A similar approach is followed in [17] using also OCT and SHAP values to develop a valuable user-friendly XAI software tool which provides transparent and interpretable insights to improve decision-making.

In the work carried out by [18], they collected 23 features, such as intraocular pressure, a visual field test, a retinal nerve fiber layer (RNFL) test performed with OCT and a fundus photograph, and carried out a feature selection process to finally keep only five of them. With the selected features, they tested several algorithms, such as support vector machines (SVM), C5.0, random forests, or XGBoost, and evaluated the importance of these features in the final decision using different graphs, including those derived from SHAP.

Also related to glaucoma, but using only structured data extracted from the electronic health record (EHR), we find the work of [19], where they train different models based on regression, decision trees or deep learning to predict whether a glaucoma patient would need surgery, based on progression data extracted from the follow-up of 4512 patients. They extracted 361 features from their medical records, including demographic, eye exam, diagnostic and prescription data. Then, using SHAP, they studied which of these features were the most important for the different models, finding that intraocular pressure, visual acuity or the use of medications commonly prescribed for glaucoma were among the most important. A very similar approach for the same problem is followed in [20,21,22,23]. Ref. [24] also focus on glaucoma progression, although, in their case, to predict long-term, rapid thinning of the retinal nerve fiber layer over a 5-year period in 505 patients with primary open-angle glaucoma in a South Korean population. They developed models based on decision trees using systemic and ophthalmologic data, applying SHAP to analyze the importance of the different features.

As mentioned earlier in Section 1, the main limitation of these studies on glaucoma diagnosis is that they are based on hand-crafted features, rather than addressing the problem from a Deep Learning perspective, as we do, using the entire image as input and regions of the images as features.

3. Materials and Methods

3.1. Datasets

The main color image datasets used in this work are RIM-ONE DL and HUC RGB. The publicly available image set RIM-ONE DL [25] resulted from the combination of the three previous versions of the well-known RIM-ONE dataset [26]. This new version consists of 313 images from healthy subjects and 172 images from patients with glaucoma acquired at three different hospitals. On the other hand, the HUC RGB dataset is a privately accessible set of retinal images collected by the medical specialists of our research team at the Hospital Universitario de Canarias. This set is composed of 191 images of patients with glaucoma and 63 images of healthy patients, and has been acquired with the Topcon TRC-NW8 multifunctional non-mydriatic retinograph (Topcon Healthcare, Spain). Considering the two sets, a total of 739 images have been used: 363 of glaucomatous eyes and 376 of healthy eyes. All these images have been annotated by two experts and include manual disc and cup segmentation.

Two other publicly available datasets have been used for further validation of the experiments carried out, namely REFUGE and DRISHTI-GS1. The REFUGE challenge dataset [27] is composed of 1200 retinal images, of which 10% (120 samples) correspond to glaucomatous subjects, including primary open-angle glaucoma and normal tension glaucoma. The dataset also contains the ground truth segmentation of the disc and cup. The DRISHTI-GS1 dataset [28] contains 101 fundus images, 31 belonging to healthy subjects and 70 to glaucomatous ones, with different resolutions and ground truth labels for the optic disc and cup.

Since the area of interest for glaucoma diagnosis is concentrated around the optic nerve head, all images used in this study were cropped squarely around it using the same proportionality criteria as shown in Figure 1b. Table 1 contains a summary of the aforementioned datasets, with the number of healthy and glaucomatous images per dataset and their resolution after cropping them around the optic disc.

3.2. Deep Learning Models

In this section, we explain how the CNN models used in this work were trained. To do so, we describe how the training and test sets were obtained from the available data, the training strategy, and the parameters selected for each CNN.

In training the CNNs for this study, the initial set of 739 images underwent a random division into training and test sets, maintaining an 80/20 proportion, respectively. The training set comprised 290 retinographies of glaucoma and 301 of healthy eyes, while the test set included 73 retinographies of glaucoma and 75 of healthy eyes. Additionally, the training set was further partitioned into five distinct training and validation subsets, employing a 5-fold approach.

We selected four prominent convolutional neural network architectures, namely VGG19 [29], ResNet50 [30], InceptionV3 [31], and Xception [32]. The selection of architectures was guided by specific algorithmic features: ResNet50 employs residual connections to facilitate the training of deeper networks; Xception utilizes depthwise separable convolutions to optimize computational efficiency; InceptionV3 benefits from multi-scale feature extraction via parallel convolutions; and VGG19 offers a simpler sequential design known for its performance consistency. These complementary design principles offer diverse algorithmic trade-offs, enriching our analysis pipeline and enabling comparative evaluation across design paradigms. These architectures are widely adopted and have been extensively utilized in the domain of glaucoma diagnosis and other medical fields. They have also been evaluated in similar studies, such as the works analyzed in Section 2.

All specified neural network architectures are accessible in the Keras module of the Tensorflow v2 package [33]. To tailor these models to our specific problem, we modified the top layer of each network. The adaptation involved introducing a DropOut layer, succeeded by a Flatten layer, and then a Dense layer featuring 128 neurons with ReLU activation. Subsequently, another DropOut layer was added, followed by a final Dense layer with two outputs utilizing the SoftMax activation function.

The hyperparameters for each model were empirically selected based on the accuracy values obtained for the corresponding validation subset during training. For VGG19, the DropOut rate was set to 0.5, while for InceptionV3, ResNet50, and Xception, it was set to 0.2. Additionally, in these three networks, the BatchNormalization layers were maintained in inference mode to prevent the non-trainable weights from being updated during the training phase.

Concerning the size of the Input layer, it was configured as 224 × 224 × 3 for ResNet50 and VGG19, and 299 × 299 × 3 for InceptionV3 and Xception.

Our training strategy is the same for all the models. First, starting with the pre-trained weights from ImageNet, the base model was frozen and we trained the new top layer for 200 epochs using an Adam optimizer, with a learning rate of 1 ×

10^{- 6}

, and categorical cross-entropy as the loss function. Second, we unfroze the base model, except for the BatchNormalization layers, and trained the entire model end-to-end for 250 epochs, using the same optimizer, with a learning rate of 1 ×

10^{- 5}

, and the same loss function as before. In all cases, a batch size of 8 was used.

In the pre-processing stage, we utilized the built-in Keras pre-processing functions tailored for each network. To address overfitting, in addition to the DropOut rates described earlier, we applied on-the-fly data augmentation exclusively to the training subset of each fold in the 5-fold cross-validation setup (approximately 473 images per fold). The augmentation included random adjustments to contrast (

\pm 0.3

) and brightness (

\pm 0.3

), horizontal flipping, rotations up to

\pm 45^{°}

, translations up to

\pm 0.05

along both axes, and zooming (

\pm 0.2

), all using the nearest neighbor fill mode and preserving the original aspect ratio. These transformations were applied dynamically at each epoch, so no new images were permanently added, and the total dataset size remained unchanged. Overfitting was monitored throughout training, and no significant divergence between training and validation metrics was observed, suggesting the combined augmentation and regularization strategy was effective.

The final weights for each model were chosen from the epoch that maximized the validation accuracy on average among the 5 folds. This resulted in 5 different models per network architecture, totaling 20 models.

3.3. Shapley Values Implementation

As discussed in the introduction, Shapley values are framed in cooperative game theory, and provide us with a way of assigning a payoff to each player: the average of all the marginal contributions of that player to each coalition of which he or she is a member. That is, those players who contribute more to the groups that include them should receive a higher payoff.

Formally, one of the most common representations of a cooperative game is through the characteristic function. Let

N = \{1, \dots, n\}

be a finite, non-empty set of players, and

S \subseteq N

be a subset of players of N which we will call a coalition. A characteristic function set is given by the pair

(N, v)

, where

v : 2^{N} \to R

is a characteristic function that assigns to each coalition S a real number

v (S)

. This number

v (S)

is often referred to as the “value” or contribution of the coalition S [5,34]. Thus, the payoff assigned to player j is given by the corresponding Shapley value

ϕ_{j} (v)

and is defined as follows [5]:

\begin{matrix} ϕ_{j} (v) = ϕ_{j} = \sum_{S \subseteq N \ {j}} \frac{|S|! (n - |S| - 1)!}{n!} [v (S \cup {j}) - v (S)], \forall j \in N \end{matrix}

(1)

That is, a weighted average is made, for all coalitions S that do not contain player j, of the differences between the contribution when player j is included in the coalition and when is not included.

Note that the empty set

S = \emptyset

is also part of the summation in the Equation (1). Therefore, the undistributed gain is also defined as follows

ϕ_{0} = v (\emptyset)

, which would be the fixed payoff that is not associated with any of the players.

In order to apply this method to a machine learning model, it would, in general, be necessary to re-train the model for all feature subsets

S \subseteq F

, where F is the set of all features [9,35,36,37]. Therefore, to use the Equation (1) in this case, one would train a model

f_{S \cup {j}}

that includes feature j in its training set, and another model

f_{S}

that does not include that feature. With this in mind, making use of the notation of [9], we can adapt the Equation (1) as follows:

\begin{matrix} ϕ_{j} (f) = ϕ_{j} = \sum_{S \subseteq F \ {j}} \frac{|S|! (|F| - |S| - 1)!}{|F|!} [f_{S \cup {j}} (x_{S \cup {j}}) - f_{S} (x_{S})] \end{matrix}

(2)

where

x_{S}

and

x_{S \cup {j}}

represent, respectively, the values of the input features of the subsets S and

S \cup {j}

for instance x. In this approach, the undistributed gain would be

ϕ_{0} = f_{\emptyset} (\emptyset)

, i.e., the model prediction that is not associated with any of the features.

In our particular case, dealing with images as input data, applying this equation at the pixel level would be impractical, since we would have to form a total of

2^{n}

coalitions, where n is the number of pixels. However, if we divide the input images into standard sectors [38], as shown in Figure 2, and treat each sector as if it were a feature, the number of combinations is dramatically reduced, making the problem of computing Shapley values computationally tractable in a reasonable amount of time, even if we have to re-train the models for each possible coalition. This decision constitutes a key design trade-off in algorithm implementation: by treating sectors as features, we reduce the computational complexity of the Shapley value calculation from exponential in the number of pixels to exponential in the number of sectors (i.e., from

2^{n}

to

2^{7}

), making the algorithm feasible for execution while preserving anatomical relevance. This approach is similar to that adopted by [39], where a discretization of the problem is also proposed but through an arbitrary regular grid without medical significance.

For practical purposes, another important problem to solve is how to obtain an image in which only a subset

S \subset N

of sectors is present, i.e., how to simulate the absence of one or more sectors to form the coalitions required for the calculation of the Shapley values. In our case, we have solved this by occlusion. When we talk about occlusion in this context, we mean it in the same sense as the occlusion method in [40], i.e., hiding or covering parts of the image so that the neural network model does not have access to that information. Specifically, we have chosen to occlude the sectors with black, i.e., setting all their pixels to 0 in the three RGB channels, which gives us a total of 128 coalitions (

2^{7}

), including the occlusion of the image background. Figure 3 shows an example of occluded sectors in an image.

Note that we are alternatively using “coalition” and “combination of occluded sectors”, although formally a coalition

C \subseteq N

would represent the sectors that enter the game, while the corresponding combination O of occluded sectors would be the sectors that are not present in that coalition,

O = N \ C

. Note also that the calculation of the Shapley value also includes in particular

C = \emptyset

and

C = N

, i.e., respectively, an image in which all sectors are occluded or an image in which none are occluded (the original image).

The Shapley values, as previously mentioned, are the contribution of each sector to the difference between the model prediction for the original image and the prediction for the empty set. In our case, since our models return two output probabilities (normal and glaucoma classes), where the sum of both will be 1 by the application of the softmax activation function, we are only considering the probability that the image corresponds to a glaucoma case. Therefore, to calculate the Shapley value for each sector

j \in F

we have proceeded as follows:

For each input image I, generate all possible combinations of occluded sectors for I, taking into account whether the image belongs to a left or right eye to properly locate the sectors.
For each sector j of I, calculate the probability of the glaucoma class of the model $f_{S}$ and $f_{S \cup {j}}$ for each possible coalition $S \subseteq F \ {j}$ of sectors, using as input $I_{S}$ and $I_{S \cup {j}}$ , the images generated from I with that coalition of sectors present in the image.
With the above probabilities, calculate the Shapley value $ϕ_{j}$ for sector j following Equation (2).

The most straightforward way to calculate the output probabilities for all possible coalitions of sectors would be to use the original trained models, as they stand. However, as discussed in [41], one of the basic assumptions in machine learning is that training and testing data come from the same distribution, so if we occlude a portion of the sectors in the image when evaluating our models, we could be changing the distribution from which the testing data comes. In this scenario, it would be unclear whether the prediction of a model is affected because the removed features were important or because of the change in the distribution of the data. Therefore, it seems appropriate to re-train the models to accommodate these new images with occluded sectors. From an algorithm design perspective, we considered two alternative retraining strategies that reflect a trade-off between model-centric and data-centric fidelity:

Re-train only the last layer, i.e., the classification layer of the original models, keeping the convolutional base frozen, using as training data the different possible combinations of sectors, resulting in a model adapted to each combination, but without modifying the original feature extraction.
Re-train all layers of the original models, from start to finish, also using the same data as in the previous scenario, resulting in new models adapted to each combination.

The second assumption, according to the approach taken by [42], would be more “true to the data”, as by re-training all layers one could start using different features that are correlated with the original ones, for example, redundant information contained in the images that was not used before, which could move us further away from the original model [43]. This, however, could also be interesting, because it allows us to better understand what information is contained in the data that can be exploited for predictions.

The first assumption according to [42], would be more “true to the model”, as only the classification part would be re-trained so that it can adjust to possible outliers from the convolutional layers that the classifier had not seen before, but the features that the model extracts from the images are not modified. This would allow the classifiers to adapt to possible changes in the distribution, as previously discussed, while obtaining a prediction closer to the original model, since the same convolutional layers of the original model are used. We believe that this approach is more what is desired to better understand the decisions made by convolutional neural network models, and for this reason, the experiments have been carried out under this assumption.

Therefore, in order to re-train our original models and obtain new models for each combination of occluded sectors, we have generally followed the following steps:

Given a set of images with a combination O of occluded sectors, the training and test sets were divided using exactly the same division as in the corresponding original set. That is, if an image was originally in the training set, the same image with the combination O of occluded sectors will still be in the training set.
Similarly, the training set was subdivided into the same 5 training and validation partitions as the respective unoccluded set.
For each of the 5 partitions we built a new model, initializing it with the weights of the original model corresponding to that partition, and we re-trained it following the methodology described in Section 3.2, but only for 50 epochs, as we empirically observed that this was sufficient.
Finally, we selected the models for the epoch that achieved the highest average validation accuracy among the 5 partitions. This results in 5 models per network architecture for each combination O of occluded sectors.

The above procedure was repeated for each possible combination of occluded sectors and for each network architecture.

3.4. Evaluation Methodology

The degree of agreement between the Shapley values calculated in different situations was assessed:

Agreement between Shapley values for re-trained models and randomly initialized models.
Agreement between models with the same architecture.
Agreement between models with different architectures.

In order to calculate the degree of agreement, Spearman’s correlation coefficient [44] was used. The degree of correlation between the Shapley values for the re-trained and randomly initialized models will allow us to have a baseline from which a correlation can be considered reasonable in the following experiments. It will also serve us as an initial control, in the same way as suggested by [45] for the attribution/saliency methods, since it would be desirable to obtain a low correlation.

Additionally, the global relevance for each sector, calculated as the probability that the largest contribution provided by the Shapley values for a given image is located in that sector, has also been studied. It is important to consider that the resulting contributions of the Shapley values can be both positive and negative. We need to take this into account when determining which sector has contributed most to the prediction. If the network predicts that the image corresponds to a glaucoma (probability greater than 0.5), the sector that contributed the most will be the one with the maximum Shapley value. If the network predicts that the image corresponds to a healthy eye, the sector that contributed the most will be the one with the minimum Shapley value.

4. Experimental Results and Discussion

Different experiments have been carried out following the methodology described in Section 3. Details of these experiments can be found in the following subsections.

4.1. Performance Evaluation of the Deep Learning Models

The trained CNN models were tested with the previously mentioned independent set, which consists of 75 samples from healthy subjects and 73 from glaucoma subjects. The results achieved per network architecture and fold have been included in Table 2, highlighting those corresponding to the best performance per architecture in terms of balanced accuracy, which is the arithmetic mean of sensitivity and specificity [46]. Analogously, Table 3 and Table 4 show the performance obtained by evaluating the trained models on the REFUGE and DRISHTI-GS1 datasets. It is important to remark that these additional image sets have only been used for testing.

Furthermore, as a representative example, we present in Figure 4 the validation accuracy and loss curves for the ResNet50 model, which achieved the highest test accuracy in one of the folds. These plots offer additional insights into the model’s behavior during training, complementing the test performance metrics.

While a comprehensive evaluation of the trained CNN models’ performance is beyond the scope of this article, it can be seen that the models classify the images across all considered datasets with reasonable accuracy.

4.2. Randomization of Model Weights

In this experiment, using the same architectures as for the re-trained models described in Section 3.3, different network models have been generated by initializing their weights randomly using the initialization proposed by [47], while the bias vectors have been initialized to zero.

Given that our test set contains 148 images, we constructed two vectors of Shapley values, each with 1036 elements (7 sectors × 148 images), to compute the correlation. One vector corresponds to the randomized model and the other to the re-trained model. By doing so, we have a correlation value for each partition per architecture. These 5 values are the ones we have used to calculate each box-and-whiskers diagram in Figure 5, where we present the correlations for each architecture between the randomized and the re-trained models.

It can be observed that both the medians of the correlations and their extreme values for each architecture present absolute values, in general, lower than 0.1, except for the maximum value of Xception which reached 0.25. These values are therefore within the expected range.

4.3. Correlation Between Shapley Values for Models with the Same Architecture

In this experiment, we calculate the correlation between Shapley values when the partition from which the models are re-trained changes but the neural network architecture remains the same.

In this case, the correlation was calculated between the results of each possible pair of partitions, per architecture. For each pair of partitions, a correlation value was obtained between two vectors of as many elements as test images and Shapley values per image. We therefore have

(\binom{5}{2}) = 10

correlation values per architecture.

The results obtained are shown in Figure 6, where each box-and-whiskers plot has been calculated from the correlation values between the results for each pair of partitions per network architecture.

All medians are above 0.65, which seems to indicate that, regardless of the architecture considered, the models are not very sensitive to the change of the training and validation partitions.

4.4. Correlation Between Shapley Values for Models with Different Architectures

These tests aim to assess the degree of agreement between Shapley values corresponding to re-trained models with different architectures but with the same data.

Correlation was calculated for all possible pairs of architectures, considering separately the results of each partition per architecture. In other words, to calculate each correlation value, a vector is constructed for each architecture with the Shapley values of each image of the test set for the models of each partition. We will therefore have 5 correlation values for each pair of architectures (one per partition).

The results obtained are shown in Figure 7 where each boxplot has been calculated from the Shapley values between each pair of architectures for the same partitions. It can be observed, as expected, that the change of architecture has a greater impact on the calculation of the Shapley values than the change of training and validation partitions.

4.5. Relevance of Sectors According to Shapley Values

Glaucoma is characterized by a progressive enlargement of the optic cup area, leading to a narrowing of the neuroretinal rim [12]. Although this enlargement can occur in all directions, certain regions tend to be affected earlier than others. In eyes with mild glaucomatous damage, rim loss predominantly appears in the temporal inferior and temporal superior regions. As the disease advances to a moderate stage, the temporal region typically shows the most significant rim loss. In cases of severe glaucoma, the nasal disc area generally experiences the least rim loss, while the nasal inferior region tends to be more affected than the nasal superior region.

With this in mind, Shapley values have been calculated for the different sectors in order to assess in this section the degree of coincidence with the expected values from a medical point of view. For this purpose, a global relevance measure has been obtained for each sector, which indicates the probability that the maximum of the contributions provided by the Shapley values for a given image is found in that sector. More specifically, given the models trained for a specific network architecture (all the partitions have been considered as a whole) and a given test set, each sector has a global probability of

N %

of being the most relevant, i.e., the one that contributes the most for an input image, according to the Shapley values obtained for a normal or glaucoma prediction. In this sense, in the case of a normal prediction, the signs of the Shapley values were inverted to study the distribution of these values and their averages. The results obtained are shown in Figure 8.

What stands out the most in this figure is the predominance of the background (B) sector, being the most probable sector in most cases, although with some differences between classes. For the VGG19 and ResNet50 architectures, the probabilities assigned to the B sector are the highest in the case of a normal prediction. However, in the case of a glaucoma prediction, the highest predictions assigned to the B sector are found for the InceptionV3 and Xception architectures. This extremely high relevance of the B sector may seem very surprising from a medical perspective but in this regard, it is worth recalling the study carried out by [2] where they showed that the information contained in the background may allow the detection of glaucoma in this type of models.

In Figure 9 and Figure 10 we can see, respectively, the average Shapley values per class and their dispersion. We show only the results for the partitions that obtained the highest balanced accuracy per architecture for the re-trained models, as reported in Table 5. As we can see, for the selected partitions the average values generally reflect what we observed in Figure 8, with the background sector being one of the largest contributors in many cases. In some cases, negative averages appear which indicates that those sectors contribute negatively, on average, towards the class in question.

4.6. Comparison Between Shapley Values for Different Datasets

The global relevance by sectors has also been evaluated for the other datasets considered, DRISHTI-GS1 and REFUGE. Figure 11 shows the results obtained.

It can be seen that the greatest similarity in terms of the distribution of probability values by sector is between our set and that of DRISHTI-GS1. The correlation in this case was 0.80679 (p-value < 0.001). Among some differences that we can appreciate, it stands out the rise of the background probability in ResNet50 for the normal class and its decrease for the glaucoma class. Additionally, the probability decreases in the T, TI, and N sectors in the normal class, but increases in the glaucoma class, whose TI sector is now the most probable. A decrease in the probability of the background sector in InceptionV3 is also noticeable for the glaucoma class, while the probability of the TI, NI, and TS sectors increases. In the normal class for this network architecture, there is an exchange of importance between the background and the nasal sector, which now appears as the major contributor in 29% of the cases. In VGG19 the probability of B also decreases in both classes. For the glaucoma class, the probability of the T sector increases, and for the normal class, the probability of the N sector increases, and that of TI decreases. In Xception, the probability of the N sector also increases for the normal class, which would now be the most probable, followed by TI, which also increases. For glaucomas, as with InceptionV3, there is a reduction in the percentage of the B sector and an increase in the TI, NI, and TS sectors.

The changes in probabilities are most notable between our set and REFUGE, where the Background stands out mainly for the normal class in all architectures. However, for the glaucoma class, the N and TI sectors are the most relevant except for the Xception architecture where N and B are the most important contributors overall. The correlation between the Shapley values for our set and REFUGE was 0.53530 (p-value = 0.003).

5. Conclusions

The use of Shapley values has enabled us to assign a relevance score to each of the standard optic disc sectors typically considered by specialists in glaucoma diagnosis, reflecting the contribution of each sector to the network model’s decision-making process. This sector-based discretization has proven essential in making the problem computationally manageable in practice.

To address the challenge of discarding features to evaluate different combinations for classification, we employed an approach based on occlusion and re-training the non-convolutional portion of the model. While we acknowledge that this method introduces some limitations, we believe it remains a reasonable compromise. This strategy allows us to stay close to the original model by re-training only the classification layer to adapt to occluded images, thus reducing the impact of outliers.

Regarding the relevance scores obtained for each sector, there appears to be limited alignment with what would typically be expected from a clinical perspective. The networks generally seem to extract useful information across all sectors, without any single sector consistently holding the majority of discriminative information.

Additionally, the frequent prominence of the background sector is unexpected and suggests that, despite achieving high accuracy, the model’s decisions may not align with established medical criteria. This discrepancy raises concerns about the model’s interpretability and its potential application in clinical practice.

As future work, as mentioned in Section 3.3, we plan to explore the possibility of retraining all layers of the original models following a more “true to the data” approach to see the difference with the “true to the model” approach adopted in this work. This will require more computing time, but we think it will be interesting to compare both approaches to the problem. We are also considering including new deep learning models in our study, such as newer CNN architectures and Vision Transformers.

Author Contributions

Methodology, J.S. and F.F.; software, J.S. and F.F.; validation, J.S. and F.F.; formal analysis, J.S. and F.F.; investigation, J.S., F.F. and T.D.-A.; resources, T.D.-A.; data curation, J.S., F.F. and T.D.-A.; writing—original draft preparation, J.S. and F.F.; writing—review and editing, J.S., F.F. and T.D.-A.; visualization, J.S. and F.F.; supervision, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

Project (PROID2024010027) financed by the Canary Islands Agency for Research, Innovation and the Information Society (ACIISI) and by the European Regional Development Fund within the framework of the FEDER programme Canarias 2021–2027.

Informed Consent Statement

Patient consent was waived by the Ethics Committee of Hospital Universitario de Canarias due to the nature of the study and because the confidentiality of personal data was ensured.

Data Availability Statement

This study utilized four datasets for analysis. The HUC RGB dataset is not publicly available due to privacy restrictions. However, the remaining datasets—RIM-ONE DL, REFUGE, and DRISHTI-GS1—are publicly accessible and can be obtained online via their respective repositories or official websites. Specific access details for the publicly available datasets can be provided upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Molnar, C. Interpretable Machine Learning, 2nd ed.; Lulu Press, Inc.: Morrisville, NC, USA, 2022. [Google Scholar]
Hemelings, R.; Elen, B.; Barbosa-Breda, J.; Blaschko, M.B.; De Boever, P.; Stalmans, I. Deep learning on fundus images detects glaucoma beyond the optic disc. Sci. Rep. 2021, 11, 20313. [Google Scholar] [CrossRef] [PubMed]
Loh, H.W.; Ooi, C.P.; Seoni, S.; Barua, P.D.; Molinari, F.; Acharya, U.R. Application of explainable artificial intelligence for healthcare: A systematic review of the last decade (2011–2022). Comput. Methods Programs Biomed. 2022, 226, 107161. [Google Scholar] [CrossRef] [PubMed]
Shapley, L.S. A Value for N-Person Games; RAND Corporation: Santa Monica, CA, USA, 1952. [Google Scholar] [CrossRef]
Aas, K.; Jullum, M.; Løland, A. Explaining individual predictions when features are dependent: More accurate approximations to Shapley values. Artif. Intell. 2021, 298, 103502. [Google Scholar] [CrossRef]
Sigut, J.; Fumero, F.; Estévez, J.; Alayón, S.; Díaz-Alemán, T. In-Depth Evaluation of Saliency Maps for Interpreting Convolutional Neural Network Decisions in the Diagnosis of Glaucoma Based on Fundus Imaging. Sensors 2024, 24, 239. [Google Scholar] [CrossRef] [PubMed]
Roth, A.E. The Shapley Value: Essays in Honor of Lloyd S. Shapley; Cambridge University Press: Cambridge, UK, 1988. [Google Scholar]
Štrumbelj, E.; Kononenko, I. Explaining prediction models and individual predictions with feature contributions. Knowl. Inf. Syst. 2014, 41, 647–665. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Sydney, Australia, 2017; Volume 30. [Google Scholar]
Chen, H.; Lundberg, S.; Lee, S.I. Explaining Models by Propagating Shapley Values of Local Components. arXiv 2019, arXiv:1911.11888. [Google Scholar] [CrossRef]
Tham, Y.C.; Li, X.; Wong, T.Y.; Quigley, H.A.; Aung, T.; Cheng, C.Y. Global Prevalence of Glaucoma and Projections of Glaucoma Burden through 2040: A Systematic Review and Meta-Analysis. Ophthalmology 2014, 121, 2081–2090. [Google Scholar] [CrossRef] [PubMed]
European Glaucoma Society. European Glaucoma Society Terminology and Guidelines for Glaucoma, 5th Edition. Br. J. Ophthalmol. 2021, 105, 1–169. [Google Scholar] [CrossRef] [PubMed]
Singh, A.; Jothi Balaji, J.; Rasheed, M.A.; Jayakumar, V.; Raman, R.; Lakshminarayanan, V. Evaluation of explainable deep learning methods for ophthalmic diagnosis. Clin. Ophthalmol. 2021, 15, 2573–2581. [Google Scholar] [CrossRef] [PubMed]
Shorfuzzaman, M.; Hossain, M.S.; El Saddik, A. An Explainable Deep Learning Ensemble Model for Robust Diagnosis of Diabetic Retinopathy Grading. ACM Trans. Multimed. Comput. Commun. Appl. 2021, 17, 113:1–113:24. [Google Scholar] [CrossRef]
Shrikumar, A.; Greenside, P.; Kundaje, A. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, ICML’17, Sydney, Australia, 6–11 August 2017; pp. 3145–3153. [Google Scholar]
Mehta, P.; Petersen, C.A.; Wen, J.C.; Banitt, M.R.; Chen, P.P.; Bojikian, K.D.; Egan, C.; Lee, S.I.; Balazinska, M.; Lee, A.Y.; et al. Automated Detection of Glaucoma with Interpretable Machine Learning Using Clinical Data and Multimodal Retinal Images. Am. J. Ophthalmol. 2021, 231, 154–169. [Google Scholar] [CrossRef] [PubMed]
Hasan, M.M.; Phu, J.; Wang, H.; Sowmya, A.; Kalloniatis, M.; Meijering, E. OCT-based diagnosis of glaucoma and glaucoma stages using explainable machine learning. Sci. Rep. 2025, 15, 3592. [Google Scholar] [CrossRef] [PubMed]
Oh, S.; Park, Y.; Cho, K.J.; Kim, S.J. Explainable Machine Learning Model for Glaucoma Diagnosis and Its Interpretation. Diagnostics 2021, 11, 510. [Google Scholar] [CrossRef] [PubMed]
Tao, S.; Ravindranath, R.; Wang, S.Y. Predicting Glaucoma Progression to Surgery with Artificial Intelligence Survival Models. Ophthalmol. Sci. 2023, 3, 100336. [Google Scholar] [CrossRef] [PubMed]
Ravindranath, R.; Naor, J.; Wang, S.Y. Artificial Intelligence Models to Identify Patients at High Risk for Glaucoma Using Self-reported Health Data in a United States National Cohort. Ophthalmol. Sci. 2025, 5, 100685. [Google Scholar] [CrossRef] [PubMed]
Christopher, M.; Gonzalez, R.; Huynh, J.; Walker, E.; Radha Saseendrakumar, B.; Bowd, C.; Belghith, A.; Goldbaum, M.H.; Fazio, M.A.; Girkin, C.A.; et al. Proactive Decision Support for Glaucoma Treatment: Predicting Surgical Interventions with Clinically Available Data. Bioengineering 2024, 11, 140. [Google Scholar] [CrossRef] [PubMed]
Wang, S.Y.; Ravindranath, R.; Stein, J.D.; Amin, S.; Edwards, P.A.; Srikumaran, D.; Woreta, F.; Schultz, J.S.; Shrivastava, A.; Ahmad, B.; et al. Prediction Models for Glaucoma in a Multicenter Electronic Health Records Consortium: The Sight Outcomes Research Collaborative. Ophthalmol. Sci. 2024, 4, 100445. [Google Scholar] [CrossRef] [PubMed]
Wang, R.; Bradley, C.; Herbert, P.; Hou, K.; Ramulu, P.; Breininger, K.; Unberath, M.; Yohannan, J. Deep learning-based identification of eyes at risk for glaucoma surgery. Sci. Rep. 2024, 14, 599. [Google Scholar] [CrossRef] [PubMed]
Yoon, J.S.; Kim, Y.e.; Lee, E.J.; Kim, H.; Kim, T.W. Systemic factors associated with 10-year glaucoma progression in South Korean population: A single center study based on electronic medical records. Sci. Rep. 2023, 13, 530. [Google Scholar] [CrossRef] [PubMed]
Fumero, F.; Diaz-Aleman, T.; Sigut, J.; Alayon, S.; Arnay, R.; Angel-Pereira, D. RIM-ONE DL: A Unified Retinal Image Database for Assessing Glaucoma Using Deep Learning. Image Anal. Stereol. 2020, 39, 161–167. [Google Scholar] [CrossRef]
Fumero, F.; Alayon, S.; Sanchez, J.L.; Sigut, J.; Gonzalez-Hernandez, M. RIM-ONE: An open retinal image database for optic nerve evaluation. In Proceedings of the 2011 24th International Symposium on Computer-Based Medical Systems (CBMS), Bristol, UK, 27–30 June 2011; pp. 1–6. [Google Scholar] [CrossRef]
Orlando, J.I.; Fu, H.; Barbosa Breda, J.; van Keer, K.; Bathula, D.R.; Diaz-Pinto, A.; Fang, R.; Heng, P.A.; Kim, J.; Lee, J.; et al. REFUGE Challenge: A unified framework for evaluating automated methods for glaucoma assessment from fundus photographs. Med. Image Anal. 2020, 59, 101570. [Google Scholar] [CrossRef] [PubMed]
Sivaswamy, J.; Chakravarty, A.; Joshi, G.D.; Ujjwal; Syed, T.A. A Comprehensive Retinal Image Dataset for the Assessment of Glaucoma from the Optic Nerve Head Analysis. JSM Biomed. Imaging Data Pap. 2015, 2, 1004. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar]
Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: www.tensorflow.org (accessed on 16 May 2025).
Chalkiadakis, G.; Elkind, E.; Wooldridge, M. Basic Concepts. In Computational Aspects of Cooperative Game Theory; Chalkiadakis, G., Elkind, E., Wooldridge, M., Eds.; Synthesis Lectures on Artificial Intelligence and Machine Learning; Springer International Publishing: Cham, Switzerland, 2012; pp. 11–35. [Google Scholar] [CrossRef]
Campbell, T.W.; Roder, H.; Georgantas, R.W., III; Roder, J. Exact Shapley values for local and model-true explanations of decision tree ensembles. Mach. Learn. Appl. 2022, 9, 100345. [Google Scholar] [CrossRef]
Campbell, T.W.; Wilson, M.P.; Roder, H.; MaWhinney, S.; Georgantas, R.W.; Maguire, L.K.; Roder, J.; Erlandson, K.M. Predicting prognosis in COVID-19 patients using machine learning and readily available clinical data. Int. J. Med. Inform. 2021, 155, 104594. [Google Scholar] [CrossRef] [PubMed]
Štrumbelj, E.; Kononenko, I.; Robnik Šikonja, M. Explaining instance classifications with interactions of subsets of feature values. Data Knowl. Eng. 2009, 68, 886–904. [Google Scholar] [CrossRef]
Garway-Heath, D.F.; Poinoosawmy, D.; Fitzke, F.W.; Hitchings, R.A. Mapping the visual field to the optic disc in normal tension glaucoma eyes11The authors have no proprietary interest in the development or marketing of any product or instrument mentioned in this article. Ophthalmology 2000, 107, 1809–1815. [Google Scholar] [CrossRef] [PubMed]
Van Craenendonck, T.; Elen, B.; Gerrits, N.; De Boever, P. Systematic Comparison of Heatmapping Techniques in Deep Learning in the Context of Diabetic Retinopathy Lesion Detection. Transl. Vis. Sci. Technol. 2020, 9, 64. [Google Scholar] [CrossRef] [PubMed]
Covert, I.; Lundberg, S.; Lee, S.I. Explaining by Removing: A Unified Framework for Model Explanation. arXiv 2022, arXiv:2011.14878. [Google Scholar] [CrossRef]
Hooker, S.; Erhan, D.; Kindermans, P.J.; Kim, B. A Benchmark for Interpretability Methods in Deep Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Sydney, NSW, Australia, 2019; Volume 32. [Google Scholar]
Chen, H.; Janizek, J.D.; Lundberg, S.; Lee, S.I. True to the Model or True to the Data? arXiv 2020, arXiv:2006.16234. [Google Scholar] [CrossRef]
Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.I. From Local Explanations to Global Understanding with Explainable AI for Trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef] [PubMed]
Zar, J.H. Significance Testing of the Spearman Rank Correlation Coefficient. J. Am. Stat. Assoc. 1972, 67, 578–580. [Google Scholar] [CrossRef]
Adebayo, J.; Gilmer, J.; Muelly, M.; Goodfellow, I.; Hardt, M.; Kim, B. Sanity Checks for Saliency Maps. In Proceedings of the Advances in Neural Information Processing Systems; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Sydney, NSW, Australia, 2018; Volume 31. [Google Scholar]
Brodersen, K.H.; Ong, C.S.; Stephan, K.E.; Buhmann, J.M. The Balanced Accuracy and Its Posterior Distribution. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 3121–3124. [Google Scholar] [CrossRef]
Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; Chia Laguna Resort. Volume 9, pp. 249–256. [Google Scholar]

Figure 1. Sample color fundus image illustrating the key anatomical structures: the optic disc, optic cup, fovea, and blood vessels. This image was obtained from a right eye. (a) Full fundus image with its main structures: optic disc, fovea, blood vessels. (b) A magnification of the image showing the optic disc and cup.

Figure 2. Diagram illustrating the division of the optic disc into six sectors: Nasal Superior (NS), Nasal (N), Nasal Inferior (NI), Temporal Inferior (TI), Temporal (T), and Temporal Superior (TS). Additionally, the Background (B) is treated as an additional sector. The image depicts the configuration for a right eye; for a left eye, the sectors are mirrored horizontally.

Figure 3. Example of occluded sectors in an image of the left eye of a normal subject extracted from RIM-ONE DL. The nasal inferior (NI) and nasal superior (NS) sectors have been occluded. (a) Original color image. (b) Example of occlusion of the NI and NS sectors.

Figure 4. Validation accuracy and loss per epoch for each of the five cross-validation folds using the ResNet50 model. Each line and color represents a different fold. The plots illustrate the model’s convergence behavior and consistency across folds, providing insight into the training dynamics and generalization performance.

Figure 5. Boxplots of the correlation coefficients between the Shapley values of the random and re-trained models. There is one diagram per network architecture. Each diagram has been obtained from the 5 correlation values corresponding to each partition by architecture.

Figure 6. Boxplots of the correlation coefficients between the different pairs of partitions by architecture for the Shapley values.

Figure 7. Boxplots showing the correlation coefficients between Shapley values for each pair of architectures: VGG19 (v19), ResNet50 (r50), InceptionV3 (i3), and Xception (xcp).

Figure 8. Global relevance by sector according to the Shapley values for the different network architectures, separating the data by the class predicted by the networks. Given an input image, the graph indicates each sector’s probability of being the highest contributor according to the Shapley values for that class.

Figure 9. Average Shapley values per architecture, for each sector, for the folds with the highest balanced accuracy. The results are separated according to the class predicted by the network.

Figure 10. Dispersion of Shapley values per architecture, for each sector, for the folds with the highest balanced accuracy. The results are separated according to the class predicted by the network. For each image in the test set, a dot is shown on the graph indicating the Shapley value for each sector. If any point overlaps with a point in another image in the same sector, it is shifted slightly on the y-axis.

Figure 11. Global relevance by sectors according to the Shapley values for the different network architectures and the different data sets considered, separating the data by the class predicted by the networks. Given an input image, the graph indicates each sector’s probability of being the highest contributor according to the Shapley values for that class. The relevance by sector is shown for our own set (1st and 4th rows), for DRISHTI-GS1 (2nd and 5th rows), and REFUGE (3rd and 6th rows).

Table 1. Summary of the datasets used in this work, indicating the number (No.) of healthy images, the number of glaucoma images, and the minimum (Min.) and maximum (Max.) resolution of the cropped images per dataset.

Dataset	No. of Healthy Images	No. of Glaucoma Images	Min. Resolution	Max. Resolution
RIM-ONE DL	313	172	274 × 274	793 × 793
HUC RGB	63	191	520 × 520	1035 × 1035
REFUGE	1080	120	268 × 268	562 × 562
DRISHTI-GS1	31	70	450 × 450	722 × 722

Table 2. Results obtained by training and testing with our dataset, according to different metrics. Results corresponding to the maximum balanced accuracy (B. Accuracy) are highlighted in bold. The last row of each architecture shows the mean (M) ± the standard deviation (SD) for every metric.

Network	Fold	Sensitivity	Specificity	Accuracy	B. Accuracy	F1 Score
VGG19	1	0.8767	0.9467	0.9122	0.9117	0.9078
VGG19	2	0.9726	0.9467	0.9595	0.9596	0.9595
VGG19	3	0.9041	0.9733	0.9392	0.9387	0.9362
VGG19	4	0.9863	0.9467	0.9662	0.9665	0.9664
VGG19	5	1.0000	0.9200	0.9595	0.9600	0.9605
VGG19	M ± SD	0.9479 ± 0.0543	0.9467 ± 0.0189	0.9473 ± 0.0221	0.9473 ± 0.0225	0.9461 ± 0.0243
ResNet50	1	0.9315	0.9067	0.9189	0.9191	0.9189
ResNet50	2	0.9589	0.9867	0.9730	0.9728	0.9722
ResNet50	3	0.9863	0.9600	0.9730	0.9732	0.9730
ResNet50	4	0.9178	0.9467	0.9324	0.9322	0.9306
ResNet50	5	0.8904	0.9200	0.9054	0.9052	0.9028
ResNet50	M ± SD	0.9370 ± 0.0370	0.9440 ± 0.0318	0.9405 ± 0.0311	0.9405 ± 0.0311	0.9395 ± 0.0318
InceptionV3	1	0.9589	0.8800	0.9189	0.9195	0.9211
InceptionV3	2	0.9452	0.8667	0.9054	0.9059	0.9079
InceptionV3	3	0.9726	0.9067	0.9392	0.9396	0.9404
InceptionV3	4	0.9178	0.9333	0.9257	0.9256	0.9241
InceptionV3	5	0.9178	0.9200	0.9189	0.9189	0.9178
InceptionV3	M ± SD	0.9425 ± 0.0245	0.9013 ± 0.0276	0.9216 ± 0.0123	0.9219 ± 0.0122	0.9223 ± 0.0118
Xception	1	0.9452	0.9067	0.9257	0.9259	0.9262
Xception	2	0.9315	0.8933	0.9122	0.9124	0.9128
Xception	3	0.9452	0.8267	0.8851	0.8859	0.8903
Xception	4	0.9178	0.9067	0.9122	0.9122	0.9116
Xception	5	0.9315	0.8133	0.8716	0.8724	0.8774
Xception	M ± SD	0.9342 ± 0.0115	0.8693 ± 0.0456	0.9014 ± 0.0222	0.9018 ± 0.0219	0.9036 ± 0.0195

Table 3. Results obtained by evaluating the models with the REFUGE dataset, according to different metrics. Results corresponding to the maximum balanced accuracy (B. Accuracy) are highlighted in bold. The last row of each architecture shows the mean (M) ± the standard deviation (SD) for every metric.

Network	Fold	Sensitivity	Specificity	Accuracy	B. Accuracy	F1 Score
VGG19	1	0.8000	0.8769	0.8692	0.8384	0.5501
VGG19	2	0.8167	0.8472	0.8442	0.8319	0.5117
VGG19	3	0.7333	0.9269	0.9075	0.8301	0.6132
VGG19	4	0.7833	0.8537	0.8467	0.8185	0.5054
VGG19	5	0.8833	0.8898	0.8892	0.8866	0.6145
VGG19	M ± SD	0.8033 ± 0.0545	0.8789 ± 0.0319	0.8713 ± 0.0273	0.8411 ± 0.0264	0.5590 ± 0.0529
ResNet50	1	0.7250	0.9620	0.9383	0.8435	0.7016
ResNet50	2	0.8417	0.9009	0.8950	0.8713	0.6159
ResNet50	3	0.8000	0.8981	0.8883	0.8491	0.5890
ResNet50	4	0.7833	0.9296	0.9150	0.8565	0.6483
ResNet50	5	0.8083	0.8065	0.8067	0.8074	0.4554
ResNet50	M ± SD	0.7917 ± 0.0429	0.8994 ± 0.0580	0.8887 ± 0.0498	0.8456 ± 0.0237	0.6020 ± 0.0921
InceptionV3	1	0.7500	0.9435	0.9242	0.8468	0.6642
InceptionV3	2	0.8333	0.9065	0.8992	0.8699	0.6231
InceptionV3	3	0.8500	0.9389	0.9300	0.8944	0.7083
InceptionV3	4	0.6750	0.9843	0.9533	0.8296	0.7431
InceptionV3	5	0.7750	0.9426	0.9258	0.8588	0.6764
InceptionV3	M ± SD	0.7767 ± 0.0701	0.9431 ± 0.0276	0.9265 ± 0.0193	0.8599 ± 0.0244	0.6830 ± 0.0454
Xception	1	0.7500	0.9083	0.8925	0.8292	0.5825
Xception	2	0.8083	0.8963	0.8875	0.8523	0.5897
Xception	3	0.7000	0.9241	0.9017	0.8120	0.5874
Xception	4	0.7333	0.9148	0.8967	0.8241	0.5867
Xception	5	0.9250	0.6898	0.7133	0.8074	0.3922
Xception	M ± SD	0.7833 ± 0.0884	0.8667 ± 0.0994	0.8583 ± 0.0812	0.8250 ± 0.0176	0.5477 ± 0.0870

Table 4. Results obtained by evaluating the models with the DRISHTI-GS1 dataset, according to different metrics. Results corresponding to the maximum balanced accuracy (B. Accuracy) are highlighted in bold. The last row of each architecture shows the mean (M) ± the standard deviation (SD) for every metric.

Network	Fold	Sensitivity	Specificity	Accuracy	B. Accuracy	F1 Score
VGG19	1	0.8429	0.7742	0.8218	0.8085	0.8676
VGG19	2	0.8857	0.7419	0.8416	0.8138	0.8857
VGG19	3	0.8429	0.8065	0.8317	0.8247	0.8741
VGG19	4	0.8857	0.7419	0.8416	0.8138	0.8857
VGG19	5	0.9571	0.6452	0.8614	0.8012	0.9054
VGG19	M ± SD	0.8829 ± 0.0467	0.7419 ± 0.0603	0.8396 ± 0.0147	0.8124 ± 0.0086	0.8837 ± 0.0144
ResNet50	1	0.9286	0.7742	0.8812	0.8514	0.9155
ResNet50	2	0.8571	0.7419	0.8218	0.7995	0.8696
ResNet50	3	0.8143	0.7419	0.7921	0.7781	0.8444
ResNet50	4	0.9143	0.7419	0.8614	0.8281	0.9014
ResNet50	5	0.9429	0.6774	0.8614	0.8101	0.9041
ResNet50	M ± SD	0.8914 ± 0.0540	0.7355 ± 0.0353	0.8436 ± 0.0360	0.8135 ± 0.0279	0.8870 ± 0.0293
InceptionV3	1	0.8429	0.8065	0.8317	0.8247	0.8741
InceptionV3	2	0.8571	0.7419	0.8218	0.7995	0.8696
InceptionV3	3	0.9000	0.7097	0.8416	0.8048	0.8873
InceptionV3	4	0.8571	0.7742	0.8317	0.8157	0.8759
InceptionV3	5	0.8857	0.7742	0.8515	0.8300	0.8921
InceptionV3	M ± SD	0.8686 ± 0.0235	0.7613 ± 0.0368	0.8356 ± 0.0113	0.8149 ± 0.0128	0.8798 ± 0.0095
Xception	1	0.8429	0.7097	0.8020	0.7763	0.8551
Xception	2	0.8714	0.6774	0.8119	0.7744	0.8652
Xception	3	0.8286	0.7419	0.8020	0.7853	0.8529
Xception	4	0.9143	0.6452	0.8317	0.7797	0.8828
Xception	5	0.8857	0.6774	0.8218	0.7816	0.8732
Xception	M ± SD	0.8686 ± 0.0341	0.6903 ± 0.0368	0.8139 ± 0.0129	0.7794 ± 0.0043	0.8659 ± 0.0125

Table 5. Balanced accuracy of the re-trained models without occluding any sector. Results are shown by architecture and fold, with those with the highest balanced accuracy highlighted in bold. The last column displays the mean (M) ± the standard deviation (SD) per architecture.

Architecture	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	M ± SD
VGG19	0.9254	0.9322	0.9387	0.9459	0.9663	0.9417 ± 0.0157
ResNet50	0.9052	0.9728	0.9528	0.9391	0.8985	0.9337 ± 0.0315
InceptionV3	0.9391	0.9256	0.9185	0.9256	0.9393	0.9296 ± 0.0092
Xception	0.8995	0.9187	0.8789	0.9393	0.8791	0.9031 ± 0.0261

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sigut, J.; Fumero, F.; Díaz-Alemán, T. Using Shapley Values to Explain the Decisions of Convolutional Neural Networks in Glaucoma Diagnosis. Algorithms 2025, 18, 464. https://doi.org/10.3390/a18080464

AMA Style

Sigut J, Fumero F, Díaz-Alemán T. Using Shapley Values to Explain the Decisions of Convolutional Neural Networks in Glaucoma Diagnosis. Algorithms. 2025; 18(8):464. https://doi.org/10.3390/a18080464

Chicago/Turabian Style

Sigut, Jose, Francisco Fumero, and Tinguaro Díaz-Alemán. 2025. "Using Shapley Values to Explain the Decisions of Convolutional Neural Networks in Glaucoma Diagnosis" Algorithms 18, no. 8: 464. https://doi.org/10.3390/a18080464

APA Style

Sigut, J., Fumero, F., & Díaz-Alemán, T. (2025). Using Shapley Values to Explain the Decisions of Convolutional Neural Networks in Glaucoma Diagnosis. Algorithms, 18(8), 464. https://doi.org/10.3390/a18080464

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Using Shapley Values to Explain the Decisions of Convolutional Neural Networks in Glaucoma Diagnosis

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Datasets

3.2. Deep Learning Models

3.3. Shapley Values Implementation

3.4. Evaluation Methodology

4. Experimental Results and Discussion

4.1. Performance Evaluation of the Deep Learning Models

4.2. Randomization of Model Weights

4.3. Correlation Between Shapley Values for Models with the Same Architecture

4.4. Correlation Between Shapley Values for Models with Different Architectures

4.5. Relevance of Sectors According to Shapley Values

4.6. Comparison Between Shapley Values for Different Datasets

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI