A Trust-Based Methodology to Evaluate Deep Learning Models for Automatic Diagnosis of Ocular Toxoplasmosis from Fundus Images

In the automatic diagnosis of ocular toxoplasmosis (OT), Deep Learning (DL) has arisen as a powerful and promising approach for diagnosis. However, despite the good performance of the models, decision rules should be interpretable to elicit trust from the medical community. Therefore, the development of an evaluation methodology to assess DL models based on interpretability methods is a challenging task that is necessary to extend the use of AI among clinicians. In this work, we propose a novel methodology to quantify the similarity between the decision rules used by a DL model and an ophthalmologist, based on the assumption that doctors are more likely to trust a prediction that was based on decision rules they can understand. Given an eye fundus image with OT, the proposed methodology compares the segmentation mask of OT lesions labeled by an ophthalmologist with the attribution matrix produced by interpretability methods. Furthermore, an open dataset that includes the eye fundus images and the segmentation masks is shared with the community. The proposal was tested on three different DL architectures. The results suggest that complex models tend to perform worse in terms of likelihood to be trusted while achieving better results in sensitivity and specificity.


Introduction
Over a third of the world's human population is exposed to Toxoplasma gondii, making Toxoplasmosis one of the most common parasitic diseases worldwide [1]. Ocular toxoplasmosis (OT) occurs if the parasite reaches the retina, as it can damage host cells and neighboring cells leaving primary lesions. OT requires drug-based therapy to eliminate the parasite and the inflammation caused by it. If not treated properly, OT can lead to loss of vision [2].
Ophthalmologists conduct eye exams that look for lesions caused by the disease in eye fundus images to diagnose OT. Clinical manifestations of the disease tend to be highly characteristic; however, atypical manifestations can cause false-negative errors even by experienced doctors. Clinical examination is considered the diagnostic standard, due to the lack of a sufficiently sensitive lab test [3].
Machine learning is a subfield of artificial intelligence that allows computers to learn from existing data and make predictions. Its application has improved the performance of many challenging tasks in medical imaging, with a considerable impact on ophthalmology based on fundus photography, optical coherence tomography and slit-lamp imaging [4].
Deep learning (DL) is a subfield of machine learning based on artificial neural networks (ANN), a paradigm inspired by the human brain. DL models allow end-to-end learning, skipping the feature engineering step that was required by traditional computer vision approaches [5]. DL models have achieved promising results in automatic classification of images, and they have brought breakthroughs to the state of the art in recent years [6].
In particular, when applied to retinal images for medical diagnosis and prognosis, convolutional neural networks (CNNs) have been able to identify and estimate the severity of ocular diseases, such as age-related macular degeneration [7] and diabetic retinopathy [8,9]. Moreover, models have been trained to detect lesions caused by these diseases and classify them according to their severity [10].
Hasanreisoglu et al. [11] explored similar techniques for OT diagnosis using fundus images. Parra et al. [12] attempted an additional network architecture and achieved promising results, in addition to publishing an open OT dataset. To the best of our knowledge, these are the only works that have applied deep learning to OT diagnosis.
In the field, most works have been focused on the predictive power of the model. However, despite the good results obtained, the medical community is skeptical about its use due, mainly, to the difficulty in the interpretation of the results. Human factors play an important role in the diagnosis, and they must be taken into account to increase the reliability of the models induced and to extend human-AI collaboration. The concept of Trust arises in this context, defined as the intention to accept vulnerability based on positive expectations [13]. Currently, a lack of trust in AI systems is a significant drawback in the adoption of this technology in healthcare [14]. Understanding the reasons behind predictions, and analyzing them considering prior knowledge about the application domain, can be important to establish trust [15].
Zhang et al. defined interpretability as the ability to provide explanations in understandable terms to a human [16]. As such, interpretability methods can be used to obtain an explanation of the output of a predictive model. Attribution methods, a family of interpretability methods, assign credit (or blame) with regards to the prediction to the input features. For images, this means that they assign a score to each of the input pixels.
Several deep learning attribution methods are based on gradients, i.e., partial derivatives of the output with respect to the input. Gradient * Input [17], Integrated Gradients [18], Layer-wise Relevance Propagation (LRP) [19] and DeepLIFT [20] are examples of such methods. Although they use gradients differently to compute attribution scores, Ancona et al. have shown these methods to be strongly related, if not equivalent under certain conditions [21].
Attribution methods have been applied to classification problems with retinal images, to enrich predictions presented to physicians. Sayres et al. explored integrated gradients to grade diabetic retinopathy [22], and Mehta et al. used the same method for automatic detection of glaucoma [23].
A general-purpose trust metric was proposed by Wong et al. [24] and extended by Hryniowski et al. [25]. They were experimentally tested with Imagenet with insightful results. Interpretability, a prerequisite of trust, is known to be a domain-specific notion [26]. Hence, we argue that domain-specific trust metrics are important for machine learning adoption.
In this study, we propose a method to quantitatively evaluate the trustworthiness of a model in the OT diagnosis domain. We do this by comparing the average attribution scores of pixels that belong to a lesion vs. the rest of the pixels. We assume that doctors are more likely to trust a model if its predictions are based on the features they consider for their diagnosis. Hence pixels within lesions should have higher attribution scores than the rest for an OT model to be considered trustworthy.
The rest of this paper is organized as follows. Section 2 introduces the main concepts of this work, including the data used. Then, in Section 3, the experimental results are described. The discussion about such results is given in Section 4. Finally, Section 5 presents the conclusions of this work.

Materials and Methods
In this section, the main characteristics of the data are first presented. Then, the different Deep Learning architectures are introduced. Finally, the proposed evaluation methods are described.

Dataset
Predictive models were trained and evaluated on a dataset of 160 eye fundus images. These images were collected at the Hospital de Clínicas in Asunción (Paraguay) by members of the Department of Ophthalmology. Some examples of the dataset can be seen in Figure 1. The complete dataset can be found online and is freely available for research purposes. Images were captured using a Zeiss brand camera, model Visucam 500, operated by experienced ophthalmologists. Each image was manually segmented by an ophthalmologist using an open source labeling tool (https://labelstud.io (accessed on Wednesday, 20 October 2021)) to manually highlight OT entities (active lesions and inactive scars).
Active lesions have variable size, white or yellow color, blurry edges and a cottony center. They might be associated with a brown retinal hyperpigmentation area, which is compatible with previous scar lesions. In some cases, active lesions can be hard to differentiate due to the presence of vitreitis. Inactive lesions have variable size with possible brown hyperpigmentation, with a stunted yellow or white center. An example of these annotations can be seen in Figure 2. Figure 2. A sample of unhealthy eye fundus images (a,c) with their corresponding masks of segmented OT lesions (b,d) from the dataset.

Model Training
Deep learning models and, in particular, CNNs, have achieved state-of-the-art results in terms of predictive power for computer vision use cases [27]. Convolutional neural networks are a particular type of feedforward neural networks (artificial neural networks with no backlinks) that is normally composed of a combination of layers: • Convolutional layers: capture local features by sliding a set of kernels over their input. • Pooling layers: are used to downsample the output of convolutional layers. • Fully-connected layers: are often used as the final layers of the model, to perform the final prediction.
As kernels share weights with all neurons, they help significantly in reducing the total number of parameters of the network. Thus, CNNs allow building neural networks with many layers with fewer parameters than other architectures [28].
We evaluate three different architectures: VGG16 is an architecture proposed by Simonyan and Zisserman, which was the first to experiment with smaller kernel sizes achieving promising results and increased depth of the model. Furthermore, Resnet18, which introduced the concept of residual connections. Residual connections help transfer knowledge from previous layers, alleviating the vanishing gradient problem that neural networks often suffer from. Residual networks allowed even deeper models to be trained, with a decreased number of parameters [28].
A comparison of the three architectures in terms of number of parameters and depth is shown in Table 1. Data augmentation based on random flips and crops was performed for all models, as shown in Figure 3. The last two models leverage transfer learning, i.e., they were pretrained on a larger general-purpose image dataset and then, with minor modifications to the learned weights, applied to OT classification for which less data is available. This is common when applying DL in domains where it is very difficult to build well-annotated datasets on a large scale due to the cost of acquiring data and annotations [31]. The idea of transfer learning is represented graphically in Figure 4. Models were optimized for 50 epochs using stochastic gradient descent (SGD) with a batch size of 32. Binary cross-entropy loss was used as the optimization target. The dataset was split into training (70%), validation (10%) and test (20%) sets. The training set was used for model fitting, the validation set for hyperparameter tuning and the test set to make the final model evaluation.

Model Evaluation
All models were evaluated using traditional predictive performance metrics: accuracy, sensitivity and specificity. In addition to that, we propose a method to obtain a trust score based on feature attributions, which is described in detail below. We only consider ed images with lesions that were correctly classified by the models (as a reminder, we consider an eye fundus image to be unhealthy if there are any lesions) for our evaluation, since our analysis depends on OT entities that were segmented by ophthalmologists.

Measuring Feature Importance: Pixel Attribution Scores
Attribution methods provide scores for each of the input features that estimate the relevance they had on the prediction. Formally, given a deep neural network (DNN) F : R n → [0, 1], let x ∈ R n be the model input. An attribution method can be seen as a function A(F, x) = [s 1 , . . . , s n ], where s 1 , . . . , s n are referred to as attribution scores. In this study, we use Integrated Gradients (IG) as the attribution method of choice.
Let x ∈ R n be a baseline input of the model, which is usually a black image for image networks. Integrated gradients are defined as the integral of the gradients along the path from the baseline x to the input x. The integrated gradient for the ith dimension is defined as follows: is the gradient of F(x) along the ith dimension and α is an interpolation constant to perturb features by.
We can calculate an attribution score per feature using IG. To obtain a per-pixel attribution score, we sum scores across RGB channels. The proposal of this study is independent of the actual attribution method selected.

Evaluating a Prediction: To Trust or Not to Trust?
Given a particular pixel attribution matrix A ∈ R n and a mask of OT entities for the original image, some pixels belong to an OT entity and others do not. Assume that those two groups of pixels were sampled from different populations, L and R. We expect the median of L to be larger than that of R for OT cases, i.e., pixels from the lesions identified by a physician should be relatively more relevant for the model to elicit trust from them. We can test this hypothesis by using a one-sided Mann-Whitney U test such that: h 0 : The median of R is larger or equal than the median of L. h 1 : The median of L is larger than the median of R. Therefore, we can define a binary trust function t as: Given a test set of images, a model is scored by calculating the ratio of images for which we obtain a one after applying t to their pixel-attribution matrix. This aggregate represents the proportion of images for which to model is likely to be considered trustworthy by an ophthalmologist.
The general purpose trust score proposed by Wong et al. [24] and extended by Hryniowski et al. [25] defines trust based on the answer to two questions: (1) How much trust do we have in a model that gives wrong answers with great confidence? and (2) How much trust do we have in a model that gives right answers hesitantly? However, valuable, interpretability and trust are known to be domain-specific notions [26]. Hence, the trust score proposed in this work incorporates domain-specific knowledge (masks) and compares it with the attribution matrix to answer the question: Did the model consider the features that an ophthalmologist would have taken into account (lesions) for this prediction?
A general overview of the process to evaluate a model is depicted in Figure 5 and can be summarized as follows: (i) an eye fundus dataset was collected by ophthalmologists at the Hospital de Clínicas of Asunción, Paraguay, (ii) physicians manually segmented OT entities for every image that had lesions, (iii) a predictive model is trained on a subset of the eye fundus dataset, (iv) pixel-attribution matrices are computed for all correctly-predicted sick images of a test set and, finally, (v) segmentation masks and attribution matrices are compared using a Mann-Whitney U test, and the results are aggregated to calculate the model trust score.

Results
The experiments were performed on a Google Colab Pro account, which provides Nvidia T4 and P100 graphic cards and up to 25 GB of RAM. The models were implemented using Pytorch 1.4. Models were trained with a batch size of 32, a learning rate of 1 × 10 −2 and stochastic gradient descent (SGD) as the optimizer, and these hyperparameters were selected according to the selection process performed by Parra et al. [12].
Two experiments were performed: • Models were trained and evaluated with respect to accuracy, sensitivity and specificity, to contrast them with the results of the proposed trust metric and then, • Models were evaluated using the proposed trust score on all correctly-predicted sick images from the test set.

Common Predictive Metrics
After fine-tuning all predictive models common metrics used to evaluate predictive power were computed on the complete test set. Table 2 summarizes the results in terms of accuracy, sensitivity and specificity. The goal of this experiment was to determine if the evaluated models ranked similarly to comparisons made in other domains. As expected, both VGG and Resnet achieve better results than the vanilla CNN. Interestingly, better results were achieved with VGG than with Resnet as opposed to the results published for ImageNet [30].

Trust Score
The proposed trust score was calculated for each of the models on the subset of correctly-labeled images from the test set, as depicted in Section 5. Aggregated results for each the compared models are summarized in Table 3. Predictive metrics are included to better contrast their relationship to the proposed score. The results show that models that scored higher in terms of traditional metrics associated with predictive power, e.g., accuracy, sensitivity and specificity performed worse in terms of the proposed trust score. This can be seen on a per-image basis in Figure 6. In addition to this, numeric values associated with the trust score calculation on a per-image basis can be found in Table A1 of Appendix A.

Discussion
Exploratory analysis of the IG attribution maps confirms the intuition behind the proposed trust score. Figure 7 shows an example prediction for which the model was considered trustworthy. This can be visually verified as the attribution scores are clustered around the area of the lesion. Figure 8 shows an example prediction for which the model was considered untrustworthy. One can visually confirm that pixel attribution scores are scattered and less concentrated on the lesion area. Figure 7. An example of an unhealthy eye fundus image that was correctly classified by the CNN model (a), the mask segmented by an ophthalmologist (b), a heatmap of the IG-based pixel attribution scores (c) and the attribution scores as an overlay (d). Median pixel attribution score differences were statistically significant between lesion and non-lesion areas. Figure 8. An example of an unhealthy eye fundus image that was correctly classified by the Resnet18 model (a), the mask segmented by an ophthalmologist (b), a heatmap of the IG-based pixel attribution scores (c) and the attribution scores as an overlay (d). Median pixel attribution score differences were not statistically significant between lesion and non-lesion areas.
The obtained results suggest that predictions made by the most accurate deep learning might be harder to trust by experienced physicians. These findings agree with the existing literature, as it is known that healthcare workers often find it challenging to trust complex machine-learning models [32].
Interestingly, the relationship between the trust score and the number of parameters of the trained models (a common proxy for complexity) is not perfectly inverse. Although it is clear that the simple CNN scored much higher, the trust score for VGG16 was higher than that of Resnet18, despite having approximately 10-times more trainable parameters. This suggests that further research is needed regarding what exactly is it about complexity that punishes trustworthiness of the predictions, e.g., Are residual blocks bad for model trust? In other words, Can the key architectural decisions that lead to poor trustworthiness be identified?
Answering the previous question can lead to developing better building blocks for DL and machine learning in general, and this represents a needed, but challenging, change in the way state-of-the-art models are currently evaluated. Considering metrics beyond performance power is key to achieving mainstream adoption of predictive models in the healthcare domain.

Conclusions
We evaluated three different DL architectures and observed an inverse relation between the predictive power and our trust score. These results suggest that trust should also be considered for model selection, in addition to more traditional metrics, such as sensitivity and specificity. This is particularly the case if we expect deep learning models to be adopted by the medical community.
The main contributions of this work are: (i) an open dataset of annotated eye fundus images for OT diagnosis and (ii) a domain-specific method to evaluate predictive models with respect to trust (i.e., how likely a physician is to trust a model's predictions) for OT diagnosis.
Extensions to our work can include: (i) a user study with ophthalmologists could help validate that our trust score adequately models their reactions to different model predictions, (ii) comparing the results using alternative attribution methods and (iii) comparing our score with traditional ML models by using an extension of IG that supports non-differentiable models [33].

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: