Deep-Learning-Based Visualization and Volumetric Analysis of Fluid Regions in Optical Coherence Tomography Scans

Retinal volume computation is one of the critical steps in grading pathologies and evaluating the response to a treatment. We propose a deep-learning-based visualization tool to calculate the fluid volume in retinal optical coherence tomography (OCT) images. The pathologies under consideration are Intraretinal Fluid (IRF), Subretinal Fluid (SRF), and Pigmented Epithelial Detachment (PED). We develop a binary classification model for each of these pathologies using the Inception-ResNet-v2 and the small Inception-ResNet-v2 models. For visualization, we use several standard Class Activation Mapping (CAM) techniques, namely Grad-CAM, Grad-CAM++, Score-CAM, Ablation-CAM, and Self-Matching CAM, to visualize the pathology-specific regions in the image and develop a novel Ensemble-CAM visualization technique for robust visualization of OCT images. In addition, we demonstrate a Graphical User Interface that takes the visualization heat maps as the input and calculates the fluid volume in the OCT C-scans. The volume is computed using both the region-growing algorithm and selective thresholding technique and compared with the ground-truth volume based on expert annotation. We compare the results obtained using the standard Inception-ResNet-v2 model with a small Inception-ResNet-v2 model, which has half the number of trainable parameters compared with the original model. This study shows the relevance and usefulness of deep-learning-based visualization techniques for reliable volumetric analysis.


Introduction
Intraretinal Fluid (IRF), Subretinal Fluid (SRF), and Pigmented Epithelial Detachment (PED) are three of the most prevalent retinal pathologies [1][2][3]. Quantitative measures of retinal fluids are important biomarkers, and fluid volume computation from optical coherence tomography (OCT) scans by artificial intelligence (AI) algorithms can guide the treatment of exudative retinal conditions, such as Neovascular Age-related Macular Degeneration (NAMD). In NAMD, pathological fluids accumulate in different compartments of the retina-IRF inside the retina, SRF beneath the neurosensory retina, and PED between the Retinal Pigment Epithelium (RPE) and Bruch's membrane (BM). Different types of OCT technologies are used to capture the retina of the eye, such as time-domain, spectral-domain, and swept-source OCT [4]. In this paper, we use data generated from commercially available CIRRUS and PRIMUS OCT machines, which are spectral-domain-based OCT machines.
OCT captures structural information of the retina using the principle of coherent detection. Coherent detection produces multiplicative noise, known as speckle in the acquired images. Speckle lowers image quality and makes visual and automated analysis of OCT images challenging [5]. The noise statistics in OCT images play a significant role in developing suitable despeckling algorithms. Various data distributions have been proposed in the literature for modeling speckles in OCT images. According to Bashkansky and Reintjes [6], speckle noise follows a Gaussian distribution. Pircher et al. [7] and Karamata et al. [8] suggested that speckle noise follows a Rayleigh distribution. Schmitt et al. [9] proposed an exponentially decaying distribution for speckle. Sudeep et al. [10] proposed a denoising method based on the Gamma distribution for speckle reduction in OCT images. In this paper, we model the speckle noise as gamma distribution and perform denoising of OCT images as the first step.
Deep learning has produced remarkable advancements in the field of healthcare [11][12][13][14], especially in the automatic detection of retinal abnormalities such as glaucoma, diabetic retinopathy, Diabetic Macular Edema (DME), Age-related Macular Degeneration (AMD), and Retinal Vein Occlusion (RVO). Deep-learning-based classifiers are used to detect the relationship between the multiple input features in the image and analyze the relationships between these features to make predictions or classifications. By leveraging the complex patterns and representations learned by the model, it can provide an overall prediction based on these detected features. However, for deploying these classifiers in the real world, there is a need to provide an explanation of the relationships captured among the features of the image. This problem of explainable artificial intelligence (XAI) can be addressed to a certain extent by using appropriate visualization tools. Techniques such as Grad-CAM [15], Grad-CAM++ [16], Score-CAM [17], Ablation-CAM [18], and Selfmatching-CAM [19] are used to explain which features of the image are significant contributors to the prediction score. Explainability enhances the confidence of doctors in deploying AI models in real-world applications. The models are otherwise considered opaque or black-box type. Explainability in AI has now become an indispensable part of the deployment pipeline, because when deploying a deep learning model in real-world scenarios, it is crucial to be able to explain and interpret the model's decisions or predictions, particularly when dealing with medical applications. Explainability in AI is necessary to provide insights and justifications to doctors or medical professionals, enabling them to understand why a certain deep learning model is classifying an image into a specific pathology. Through explainability, doctors can gain trust and confidence in the model's capabilities, facilitating its adoption and integration into clinical practice.
There are several studies available in the literature that proposed automatic volume computation of retinal fluid. Typically, one performs segmentation for calculating the fluid volume, for instance, Yukun et al.'s Retinal Fluid Segmentation Network (ReF-Net) [20], Thomas et al.'s encoder-decoder-based segmentation network [21], Wilson et al.'s U-Netbased architecture [22], and Lu et al.'s automated segmentation technique [23]. To the best of our knowledge, ours is the first attempt at adopting the visualization outcome of a deep learning model for retinal fluid volume computation instead of a segmentation output. Typically, one employs a segmentation-based approach for calculating the retinal fluid volume. However, our methodology is significantly different, as we utilize the results obtained from visualization techniques. We also present a Graphical User Interface (GUI) for automatic fluid volume computation. We consider GradCAM, GradCAM++, Score-CAM, Ablation-CAM, and Self-Matching-CAM as visualization techniques for the binary classification models, which detect IRF, SRF, and PED pathologies in the retinal OCT images. The volume is computed separately for IRF, SRF, and PED. In addition, we also propose a novel fusion technique, namely Ensemble-CAM, which is a robust visualization technique.
A combination of IRF, SRF, and PED, as well as fluid volume variations, are used to categorize Age-related Macular Degeneration (AMD) patient response to antivascular endothelial growth factor (Anti-VEGF) therapy. The AMD patient is categorized as a Responder if the post-treatment fluid volume has reduced by more than 10% compared with the pretreatment fluid volume and as a Nonresponder if the post-treatment fluid volume has increased or remained the same or reduced by less than 10% compared with the pretreatment fluid volume. The automatic quantification of fluid response may be a better means of monitoring than the current qualitative evaluation used in clinical practice. The computed fluid volume could be a potential input to an AI model to predict the AMD patient's response to an Anti-VEGF injection [24]. The AI response prediction can help clinicians in deciding the form of the treatment and could be a beneficial tool in the clinical armamentarium for the therapeutic management of AMD.
Section 2 contains a description of the dataset used for classification and visualization tasks and the methodology used in the paper. Section 3 presents the results obtained for classification, visualization, and fluid volume computation, and Section 4 contains a discussion of the results. Concluding remarks are given in Section 5. Figure 1 shows an OCT image in which all three pathologies, namely IRF, SRF, and PED, are present.

Dataset
In this section, we present details of the datasets used for classification and visualization.

Classification Dataset
The dataset used in this study is proprietary to Carl Zeiss and is the same as the one used in our recent publication [25]. It comprises OCT B-scan images stacked as cubes. The images are obtained using two brands of Carl Zeiss machines-CIRRUS and PRIMUS. There are, in turn, two variants of CIRRUS, one that captures 200 images per cube and an other that captures 128 images per cube. There are two types of PRIMUS machines, of which one captures 128 images per cube and an other that captures 32 images per cube. The training and test data for classification consist of 207,535 and 52,650 images, respectively. The B-scan images from CIRRUS and PRIMUS are of size 1024 × 512 and 1024 × 200, respectively.

Segmentation Dataset
The ground-truth segmentation data required for visualization is available for 73 cubes, each, in turn, containing 128 B-scans, making up a total of 9344 (73 × 128) images. Of these, 1235, 1361, and 3686 images have IRF, SRF, and PED pathologies, respectively.

Performance Measures
The segmentation ground truth is given by an expert and is used for the computation of the Jaccard Index/Intersection over Union (IoU) [26] and Dice score [27,28] given as follows: and where A is the generated mask, B is the ground truth mask, and |·| represents the cardinality of the set.

Architecture of the Model
In this work, we use the Inception-ResNet-v2 [29] to develop the binary classification models for each pathology. Figure 2a shows the standard Inception-ResNet-v2 architecture divided into several parts, such as stem, Inception-A, Inception-B, Inception-C, Reduction-A, and Reduction-B, each explained in the following:

1.
Stem: The stem of the Inception-Resnet-v2 architecture consists of a convolutional layer followed by a max pooling layer, and several convolutional layers with increasing dilation rates.

2.
Inception-ResNet-A: The Inception-A blocks consist of multiple branches, each of which applies a different type of operation (such as convolution or pooling) to the input. These branches are then concatenated together and passed through a final convolutional layer.

3.
Inception-ResNet-B: The Inception-B blocks are similar to Inception-A blocks, but with a different set of operations. It uses an average pooling operation followed by a convolutional layer.

4.
Inception-ResNet-C: Inception-C blocks are also similar to Inception-B blocks but with a different set of operations and a larger number of filters.

5.
Reduction: Reduction-A and Reduction-B blocks reduce the spatial dimensions of the feature maps. They consist of several convolutional layers and max pooling layers.
Finally, the output of the last block is fed into an average pooling layer, followed by a dropout layer and a fully connected layer, which produces the final output. The Inception-ResNet-v2 is trained on 1 million images from ImageNet [30] dataset with 1000 classes. We fine-tuned the model using the training dataset with two classes.

Comparison with Small Inception-ResNet-v2
We developed a novel small Inception-ResNet-v2 architecture, which has a similar skeleton to the original Inception-ResNet-v2 but with half the number of trainable parameters (28 million parameters as opposed to 56 million trainable parameters). Ten Inception-ResNet-A, twenty Inception-ResNet-B, and ten Inception-ResNet-C layers in Inception-ResNet-v2 architecture are replaced by eight Inception-ResNet-A, seven Inception-ResNet-B, and four Inception-ResNet-C layers, respectively, to obtain the small Inception-ResNet-v2 architecture shown in Figure 2b. Figure 2c shows the stem.

Denoising
The OCT images obtained from the OCT scanners are typically noisy. Hence, there is a need to perform mild denoising of the images. If the models are trained directly on noisy images, there is a chance that the model might fit the noise in the image rather than the pathological features of interest. After experimenting with several noise models, we found the gamma distribution to be an appropriate model for the noise in OCT images. We carry out denoising using the Gated Convolution and Deconvolution Structure (GCDS) model [5]. • The GCDS model consists of two phases-an encoding phase and a decoding phase. The encoding phase consists of 5 convolution layers that create a representation that encapsulates all fundamental features but leaves out the noise.
• The model also contains skip connections between corresponding convolution and deconvolution layers. These skip connections reduce the number of weights to be trained in the neural network, which leads to quicker convergence of the model. • Each skip connection is associated with a gating factor, which determines the ratio of split of the information between the next convolution layer and the corresponding deconvolution layer.

Classification of the OCT Images
We now present the details involved in training the classification model.

Training Phase
We trained the classification model on 207,535 images. Each image carries labels for IRF, SRF, and PED. For training the classifier, we resize the images, which are of size (1024, 512, 3) to (450, 450, 3) for compatibility with the Inception-ResNet-v2 architecture. The model is tested using data consisting of 52,650 images. We used an Inception-ResNet-v2 architecture pretrained on the ImageNet dataset. We fine-tuned the pretrained model using OCT training images. We used the Adam optimization algorithm [31], which combines adaptive gradient methods and stochastic optimization techniques. Adam has been shown to be effective in various deep learning tasks, as it dynamically adjusts the learning rate based on the gradient magnitudes of individual parameters.
We set the learning rate to 10 −4 , which was chosen based on empirical observations and previous research in the field. This learning rate provides a good balance between convergence speed and stability, ensuring that the model converges to a suitable solution while avoiding overshooting or instability issues.
The training phase comprises iterative updating of the model parameters using backpropagation and stochastic gradient descent. The Adam optimizer adjusted the learning rate for each parameter separately, allowing the model to effectively adapt to different features and gradients present in the training data. The loss function employed is Categorical Cross-Entropy (CCE) [32] given by whereŷ i is the ith entry (scalar value) in the model output, y i is the corresponding target value, and the output size is the number of scalar values in the model output.

Visualization of the OCT Images
Due to the black-box nature of the deep-learning-based classification models, there is a need for explainability of the model. We consider standard visualization-based techniques such as GradCAM, GradCAM++, Score-CAM, Ablation-CAM, and Self-Matching-CAM on the binary classification models of IRF, SRF, and PED. These visualization techniques help doctors understand why the model made a certain decision for an image by highlighting relevant parts of the image. Figure 5b shows the Grad-CAM output of the GCDS denoised OCT image shown in Figure 5a. In the heat map, the yellow region indicates the most relevant region, and the relevance decreases as the color changes from yellow to blue.

Ensemble CAM
We applied standard visualization techniques such as Grad-CAM (GD), Grad-CAM++ (GD++), Score-CAM (SC), Ablation-CAM (AC), and Self-Matching-CAM (SM) on the nine binary classification models developed for each of the three pathologies using the noisy OCT images and denoised OCT images. The heat maps obtained from the visualization techniques are converted to binary maps using the Otsu thresholding [33] technique. The procedure is applied to expert-annotated ground-truth segmentation comprising 128 B-scans, making up a total of 9344 (73 × 128) images to calculate the Intersection over Union (IoU) and Dice scores.
We observed that for several OCT images, the heat maps given by various visualization techniques are not always in agreement, as the techniques emphasize different regions in the OCT image. The Ensemble CAM output is a binary map with the value at a certain pixel equal to one, where three or more than three visualization techniques agree with each other. Computationally, the Ensemble CAM heat map can be obtained by simply adding the five binarized class activation maps and using binary thresholding with a threshold value of three. Figure 6 shows the binary maps obtained from various visualization techniques, including the Ensemble-CAM that we developed.

Retinal Fluid Volume Computation
AI-based fluid volume computation compared with manual-annotations-based volume computation saves an enormous amount of time and removes intraobserver and interobserver variability. Currently, the device does not provide clinicians the ability to measure fluid volumes automatically, and the assessment is performed by visual inspection. An automatic fluid volume computation tool would greatly enable the clinicians in providing reliable diagnosis.
The fluid volume is computed using two different techniques, namely the region-growing algorithm [34,35] and the thresholding technique. We computed the volumes of 73 C-scans for IRF, SRF, and PED pathologies separately using the region-growing and thresholding techniques. Figure 7 shows the pipeline for computing the volume of IRF, SRF, and PED in a C-scan. First, forthe Inception-ResNet-v2 GCDS model and the small Inception-ResNet-v2 GCDS model, we pass the noisy OCT images through the denoising model and preprocess them before passing them to the classification model. If the prediction score is more than 0.5, we pass the preprocessed image through the visualization technique to obtain the heat map, which is further binarized using the Otsu thresholding technique [33]. Figure 8 shows the snapshots of the Graphical User Interface (GUI) created for volume computation. The doctors select a C-scan and press the volume computation button to display the results. Four images are displayed in the GUI-the input image, the CAM output, the binary map obtained from the CAM heat map, and the ground truth corresponding to the input image.

Post Processing
We used a classification model to gain insights into the explainability of the model's predictions. To visualize the important regions contributing to the model's decision-making process, we employed several popular techniques, such as GradCAM, GradCAM++, Score-CAM, LayerCAM, and AblationCAM. These techniques generate heat maps that highlight the regions of interest in the input image. However, we observed that these heat maps only provided an indication of the regions without precisely outlining the boundaries of the specific regions of interest. This limitation became apparent when dealing with conditions such as IRF (Intraretinal Fluid) and SRF (Subretinal Fluid), where it is crucial to localize the presence of fluid in the retina precisely.
We explored two additional techniques to address this issue, selective thresholding and region-growing, with the aim of refining the heat maps generated by the CAM techniques by emphasizing the regions corresponding to fluid-containing areas. We noticed that these regions appeared darker in contrast to the other areas. By applying selective thresholding and region-growing techniques, we improved the localization of the fluidcontained regions and obtained more accurate outlines of these regions. These techniques allowed us to enhance the interpretability of the classification model for identifying and precisely delineating the presence of fluid in the retina, specifically for IRF and SRF.

Region-Growing Technique
To further enhance the identification and delineation of regions of interest, we employed the region-growing technique [36,37], leveraging the CAM output from the classification model. The purpose was to identify potential regions within the image that corresponded to specific features, particularly in the case of Intraretinal Fluid (IRF) and Subretinal Fluid (SRF).
The region-growing approach comprises the following steps: 1.
Thresholding and contour extraction: The CAM output was thresholded to convert it into a binary image. Contours were then extracted from the binary image. The centroids of these contours were considered as seed points, representing potential regions of interest within the original input image.

2.
Seed point selection: The centroids of the extracted contours served as the seed points for region expansion. These seed points acted as starting points for the iterative growth of the regions.

3.
Region-growing algorithm: Starting from the seed points, a region-growing algorithm was applied to expand the regions of interest iteratively. This algorithm examined the neighboring pixels of each seed point and determined their inclusion in the growing region based on predefined criteria.

4.
Inclusion criteria: The decision to include neighboring pixels in the growing region was based on a specific criterion, which involved evaluating the pixel intensity difference between the seed pixel and the corresponding neighboring pixel. If the difference was below a predefined threshold, such as 15, the neighboring pixel was considered similar and included in the region. Otherwise, it was excluded.
By iteratively growing the regions based on the inclusion criteria, we could accurately outline the desired regions of interest, specifically the fluid-contained areas in the case of IRF and SRF. This enabled precise visualization and analysis of the extent and location of these regions within the retinal images.
It is important to note that while the region-growing technique demonstrated effectiveness in identifying and delineating fluid regions, its applicability may vary for different conditions. For example, in the case of Pigment Epithelial Detachment (PED), where the objective is to highlight layer detachment areas rather than fluid regions, alternative techniques may be necessary to represent the specific features of interest accurately.
In summary, the region-growing technique, utilizing the CAM output and contour extraction, allowed us to iteratively expand regions of interest and achieve accurate outlines of fluid-contained areas. However, its suitability should be carefully considered in the context of the specific condition under analysis. Alternative techniques may be required to address different types of features and meet the specific requirements of the analysis.

Selective Thresholding
In order to enhance the visualization and outline of regions of interest, we employed a technique known as selective thresholding. This approach aimed to improve upon the existing CAM-based visualization methods, such as GradCAM, GradCAM++, ScoreCAM, LayerCAM, and AblationCAM, which provided highlight regions but lacked precise outlines. The specific goal was to highlight fluid-containing regions in the case of IRF and SRF, while recognizing that a different approach was necessary for PED.
The selective thresholding process involved the following steps: 1.
Obtaining the CAM output: We obtained the CAM output from the classification model, which provided an indication of the regions of interest within the image.

2.
Applying Otsu thresholding: To convert the CAM output into a binary image, we applied the Otsu thresholding technique. This process assigns a value of 1 to pixels that exceed a certain threshold and 0 to pixels that are below the threshold. The result is a binary mask that highlights potential regions of interest.

3.
Pixel-wise product: We then applied a pixel-wise product operation between the original image and the binary mask. This operation allowed us to select and isolate the regions in the image where the mask pixel values were 1. By doing so, we focused only on the areas indicated by the CAM as regions of interest.

4.
Histogram analysis: To further refine the selection, we analyzed the histogram of the image. This analysis revealed that pixel values in the range of 25 to 50 exhibited a significant spike, indicating the presence of the desired regions.

5.
Thresholding the image: Based on the observation from the histogram analysis, we applied a thresholding operation to the image. Pixels with values in the range of 20 to 50 were considered as 1, representing the regions of interest, while the remaining pixels were set to 0. This thresholding step allowed us to achieve a more precise outline of the desired regions. 6.
Performance evaluation: We evaluated the effectiveness of selective thresholding by assessing metrics such as Intersection over Union (IoU) and Dice score. The results indicate a significant improvement in both metrics for the detection of fluid-containing regions in IRF and SRF cases. However, it should be noted that selective thresholding was not suitable for PED, as the objective in the case of PED was to highlight layer detachment rather than fluid regions.
In summary, selective thresholding provided a refined visualization approach by emphasizing the outline and extent of fluid-containing regions in the retinal images. While it demonstrated promising results for specific conditions, its applicability varied depending on the nature of the features of interest. For PED, alternative techniques were required to accurately highlight the layer detachment areas. Figure 9 depicts the procedure adopted for calculating the volume using the selective thresholding technique.
The volume computation process involves employing the region growing or selective thresholding techniques for all the images in a C-scan to derive masks for each True Positive image. Each pixel within the mask is considered as a cuboid, and the volume of the entire scan is computed by summing up all the individual pixel volumes given in Equation (4) across all the identified pixels in the C-scan. The volume occupied by each pixel v pixel [38] is scanner-specific, and for the OCT scanners under consideration, it is given by v pixel = 11.7 × 47.2 × 2.0 µm 3 . (4)

Classification Results
Metrics such as True Positive, False Positive, True Negative, False Negative, Accuracy, Precision, Sensitivity, Specificity, and F1 score are used to evaluate the classification performance. We refer to the positive labels as images in which the pathology is present and negative labels in which the pathology is absent. True Positives (TP) are the images for which prediction is abnormal, and the ground truth is also a positive label. In the case of True Negative (TN) images, the prediction is normal, and the ground truth is also a negative label. For False Positive (FP) cases, the prediction is abnormal, but the ground truth is a negative label, whereas for False Negative (FN) cases, the prediction is normal, but the ground truth is a positive label.
Accuracy is the ratio of total number of correct predictions to the total number of images and is expressed as Precision is the ratio of the total number of predictions that are classified correctly as positive to the total number of positive predictions: Sensitivity or Recall is the ratio of the number of images that are correctly classified as positive to the total number of positive images: Specificity is the ratio of the number of images that are correctly classified as negative to the total number of negative images, given by F1 score is the harmonic mean between precision and recall and is given by F1 score is mainly used when the dataset has a class imbalance. Tables 1-3 present the classification metrics for the IRF, SRF, and PED pathologies.

Visualization Results
In this section, we present the quantitative visualization results of the Grad-CAM, Grad-CAM++, Score-CAM, Ablation-CAM, Self-Matching CAM, and Ensemble-CAM visualization techniques on the noisy model, Inception-ResNet-v2 GCDS model, and small Inception-ResNet-v2 GCDS model. We used IoU and Dice metrics for evaluating the visualization results. Figures 10-12 show the visualization outputs obtained using the Grad-CAM technique, binary maps, and ground truths for IRF, SRF, and PED, respectively.

Volume Computation Results
We computed the volume over 73 C-scans obtained from the CIRRUS machine for all three kinds of models for IRF, SRF, and PED and calculated the IoU mean and standard deviation, Dice mean and standard deviation, predicted volume mean and standard deviation, and ground truth volume mean and standard deviation.

Discussion
The evaluation of the model's performance includes various classification metrics, such as accuracy, precision, sensitivity, specificity, and F1 score. The classification results for each pathology are presented in Tables  Matching-CAM, and Ensemble-CAM heat maps, and the segmentation data masks. For IRF, the small Inception-ResNet-v2 GCDS model achieved the best IoU using Grad-CAM++. In the case of SRF, the noisy model achieved the best IoU using the newly proposed Ensemble-CAM visualization technique. For PED, the small Inception-ResNet-v2 GCDS model obtained the best IoU using the Self-Matching-CAM visualization technique. Tables 7-23 present the Intersection over Union (IoU) and Dice coefficient on a volumetric level for the region-growing and selective thresholding technique. The mean and standard deviation of these metrics are calculated for both the predicted and ground truth volume. The results are presented for all three types of models: the noisy model, the Inception-ResNet-v2 GCDS model, and the Small Inception-ResNet-v2 GCDS model. While all the models performed reasonably well on IRF and SRF, their performance in terms of volume computation for PED was below par.       IoU and Dice results: In certain cases of performance evaluation, such as in PED, we observed that the standard deviation is greater than the mean. This indicates that the model's performance is inconsistent, and there is a large variability in the results obtained across different evaluations or samples. It also suggests that the model's performance varies widely from one test instance to another, indicating outliers, instability, and unpredictability. On the other hand, for IRF and SRF, this anomaly was not observed, indicating that the model's performance is relatively consistent, with the results clustering closer to the mean. A lower standard deviation indicates that the model's performance is more stable and less prone to significant variations across different evaluations. In summary, when the standard deviation is greater than the mean, it implies inconsistency and variability in the performance of the model, whereas when the mean is greater than the standard deviation, it suggests stabler and consistent model performance across different evaluations. The specific interpretation may vary based on the context of the evaluation and the performance metric used.

Fluid volume computation:
If the data points in the dataset have a wide range and are spread out, it results in a large standard deviation. This can happen when there are extreme values (outliers) that are significantly different from the rest of the data. In our case, for some patients, the fluid volume calculated was pretty high, making them outliers. Hence, the standard deviation is more than the mean in some cases.

Conclusions
In this paper, we presented a technique to compute the retinal fluid volume starting from deep-learning-based visualization modules such as Grad-CAM, Grad-CAM++, Score-CAM, Ablation-CAM, Self-Matching-CAM, and Ensemble CAM optimized on 73 C-scans obtained from CIRRUS and PRIMUS OCT machines. The visualization techniques are applied to the models trained on the Inception-ResNet-v2 and small Inception-ResNet-v2 architectures. The calculated volume is obtained using two techniques based on the regiongrowing algorithm and selective thresholding scheme and compared. Automatic volume computation using AI helps in reducing the effort put in by the annotators and reduces intraobserver and interobserver variability.  Data Availability Statement: The data were obtained from a Sankara Nethralaya eye hospital, and the patient data will be maintained with confidentiality. The data will be shared upon request and can be provided only after approval from the institutional review board.