Computational Modeling and Optimization of Deep Learning for Multi-Modal Glaucoma Diagnosis

Gandhi, Vaibhav C.; Gandhi, Priyesh; Ogundiran, John Omomoluwa; Tshibola, Maurice Samuntu Sakaji; Kapuya Bulaba Nyembwe, Jean-Paul

doi:10.3390/appliedmath5030082

Open AccessArticle

Computational Modeling and Optimization of Deep Learning for Multi-Modal Glaucoma Diagnosis

by

Vaibhav C. Gandhi

^1,*,

Priyesh Gandhi

²

,

John Omomoluwa Ogundiran

^3,4

,

Maurice Samuntu Sakaji Tshibola

⁵ and

Jean-Paul Kapuya Bulaba Nyembwe

^3,4,5,*

¹

Department of Computer/IT Engineering, Gujarat Technological University, Ahmedabad 382424, India

²

Department of Electronics & Computer Engineering, Sigma University, Vadodara 390019, India

³

Department of Mechanical Engineering, ADAI, University of Coimbra, Pólo II, Rua Luís Reis Santos, 3030-788 Coimbra, Portugal

⁴

School of Architecture, Planning, and Design, University of New Horizons, Route Kasapa 2465, Lubumbashi 212700, Democratic Republic of the Congo

⁵

Department of Civil Engineering, Official University of Mbuji-Mayi, Av Kalonji no 27, Q/Kansele, Mbuji-Mayi 8330, Democratic Republic of the Congo

^*

Authors to whom correspondence should be addressed.

AppliedMath 2025, 5(3), 82; https://doi.org/10.3390/appliedmath5030082

Submission received: 13 May 2025 / Revised: 19 June 2025 / Accepted: 24 June 2025 / Published: 2 July 2025

Download

Browse Figures

Versions Notes

Abstract

Glaucoma is a leading cause of irreversible blindness globally, with early diagnosis being crucial to preventing vision loss. Traditional diagnostic methods, including fundus photography, OCT imaging, and perimetry, often fall short in sensitivity and fail to integrate structural and functional data. This study proposes a novel multi-modal diagnostic framework that combines convolutional neural networks (CNNs), vision transformers (ViTs), and quantum-enhanced layers to improve glaucoma detection accuracy and efficiency. The framework integrates fundus images, OCT scans, and clinical biomarkers, leveraging their complementary strengths through a weighted fusion mechanism. Datasets, including the GRAPE and other public and clinical sources, were used, ensuring diverse demographic representation and supporting generalizability. The model was trained and validated using cross-entropy loss, L2 regularization, and adaptive learning strategies, achieving an accuracy of 96%, sensitivity of 94%, and an AUC of 0.97—outperforming CNN-only and ViT-only approaches. Additionally, the quantum-enhanced architecture reduced computational complexity from O(n²) to O (log n), enabling real-time deployment with a 40% reduction in FLOPs. The proposed system addresses key limitations of previous methods in terms of computational cost, data integration, and interpretability. The proposed system addresses key limitations of previous methods in terms of computational cost, data integration, and interpretability. This framework offers a scalable and clinically viable tool for early glaucoma detection, supporting personalized care and improving diagnostic workflows in ophthalmology.

Keywords:

glaucoma detection; multi-modal framework; deep learning; vision transformers; quantum-enhanced models; computational optimization; clinical diagnostics

1. Introduction

Glaucoma is a progressive optic neuropathy that leads to irreversible damage of retinal ganglion cells, culminating in permanent vision loss if not diagnosed and treated in time. Characterized by asymptomatic onset and slow progression, glaucoma remains the second leading cause of blindness worldwide, according to the World Health Organization (WHO). It currently affects over 76 million people, with projections suggesting this number will exceed 111 million by 2040 [1]. The silent nature of the disease in its early stages often results in delayed diagnosis, with patients seeking medical help only after substantial vision impairment has occurred. This makes early, accurate detection a critical goal in the global fight against blindness.

Among the various types of glaucoma, primary open-angle glaucoma (POAG) is the most common, especially in individuals aged 60 and above. Several risk factors are associated with POAG, including elevated intraocular pressure (IOP), family history, ethnicity (with significantly higher prevalence among individuals of African descent), diabetes, and systemic hypertension [2]. Epidemiological studies report that approximately 3.5–4% of the global population is affected by glaucoma, with prevalence rates reaching 6% in sub-Saharan Africa and 3.5% in Southeast Asia [3]. Without timely intervention, an estimated 15–20% of glaucoma patients progress to bilateral blindness. Moreover, the economic burden of the disease is considerable: annual costs for glaucoma treatment in the United States range between USD 500 and USD 1500 per patient, factoring in medications, specialist consultations, and routine monitoring [4]. Traditional diagnostic practices such as fundus photography, visual field testing (perimetry), optical coherence tomography (OCT), and tonometry have provided clinicians with valuable insights into structural and functional changes in the eye. However, when used in isolation, these methods often lack the sensitivity and specificity required for detecting early-stage glaucomatous damage [5,6]. Limitations include poor integration of multi-modal biomarkers, difficulty in assessing subtle disease progression, and reduced generalizability across diverse demographic groups [7].

In recent years, artificial intelligence (AI), particularly machine learning (ML) and deep learning (DL), has emerged as a powerful tool for augmenting glaucoma diagnosis. Convolutional neural networks (CNNs) have demonstrated strong performance in analyzing fundus images and OCT scans by automatically detecting visual patterns such as optic disc cupping and thinning of the retinal nerve fiber layer (RNFL) [8,9]. Popular CNN architectures like VGG19, ResNet, and DenseNet have been employed to extract spatial features with high accuracy. Nevertheless, CNN-based models often struggle to generalize across heterogeneous datasets and are limited in their ability to incorporate numerical clinical data such as IOP measurements or visual field indices [10]. To address these challenges, vision transformers (ViTs) have gained attention in medical imaging applications. Unlike CNNs, which focus on localized feature extraction through convolutional kernels, ViTs leverage self-attention mechanisms to capture long-range spatial relationships within an image [11,12]. This allows them to detect subtle and globally distributed features of glaucomatous damage. However, ViTs come with computational costs, exhibiting quadratic complexity (O(n²)) in relation to the input size, which poses constraints for real-time clinical deployment in resource-limited settings [13]. While CNNs have shown promise in glaucoma classification [10], they are limited in capturing global spatial relationships. ViT-based methods address this limitation but come with higher computational costs [13]. Moreover, most prior work focuses on a single modality, ignoring complementary information from clinical parameters. Recent works [14] began exploring multi-modal fusion, but none combine CNN, ViT, and quantum-inspired compression in a single, end-to-end architecture. This study addresses these gaps through a novel, interpretable hybrid framework.

As an innovative response to these limitations, multi-modal frameworks have been developed to combine fundus imaging, OCT scans, and clinical biomarkers in a unified diagnostic pipeline. Studies such as those by Huang et al. [4] introduced the GRAPE dataset, a longitudinal repository of fundus photographs and visual field images highlighting the potential of integrating structural and functional imaging data for enhanced glaucoma assessment. Similarly, dual-fusion and probabilistic deep learning models have shown that combining OCT, visual fields, and clinical parameters can significantly improve classification accuracy and diagnostic robustness [15,16]. Nevertheless, many of these approaches still face challenges related to computational inefficiency, limited dataset diversity, and lack of interpretability, especially in high-stakes clinical environments. To overcome these hurdles, recent research has explored hybrid models that integrate CNNs and ViTs, further augmented by quantum-inspired deep learning layers. These models exploit multi-modal feature fusion through mathematically optimized weight-based concatenation, enabling them to balance contributions from multiple data sources [17]. In parallel, the integration of quantum-enhanced architectures has been shown to reduce computational complexity from O(n²) to O (log n), achieving both high diagnostic accuracy and scalability for real-world applications [18,19]. Regularization techniques, such as L2 penalty terms, are also incorporated to prevent overfitting and enhance generalizability across diverse patient cohorts [20]. This study lies in the development of a quantum-enhanced hybrid framework that combines the spatial precision of CNNs, the global context awareness of ViTs, and the computational efficiency of quantum-inspired layers. This model is trained and evaluated on diverse, multi-institutional datasets, including the GRAPE dataset and others containing OCT, fundus, and visual field data, to ensure both clinical relevance and scalability. The proposed architecture offers a robust diagnostic solution by addressing key limitations in sensitivity, specificity, and computational efficiency found in prior models. Its novelty lies in the hybrid integration of CNNs, ViTs, and a quantum-enhanced fusion layer. By combining local feature extraction from CNNs, global context modeling from ViTs, and structured clinical data, the framework enables comprehensive multi-modal representation. A quantum-inspired unitary transformation further compresses the fused features, enhancing computational efficiency. This unified design improves diagnostic accuracy, interpretability, and scalability for real-world clinical adoption.

Table 1 presents a comparative overview of recent deep learning-based approaches for glaucoma diagnosis, highlighting differences in architecture, data modalities used, datasets, performance, and interpretability methods. The proposed hybrid model demonstrates superior diagnostic accuracy and computational efficiency by integrating CNN, ViT, and quantum-enhanced layers.

The superior features of this study include the following: (i) a unified hybrid architecture combining CNNs for local feature extraction and ViTs for global context modeling; (ii) a quantum-enhanced fusion layer that reduces computational complexity from O(n²) to approximately O(log n); (iii) adaptive, weighted multi-modal fusion of fundus, OCT, and clinical data; (iv) validated interpretability using SHAP, Grad-CAM, and LIME to enhance model transparency; and (v) cross-dataset evaluation on demographically diverse cohorts to demonstrate generalizability. These innovations enable our framework to balance diagnostic accuracy, model efficiency, and clinical applicability.

2. Methodology

This section details the comprehensive methodological pipeline for the development and evaluation of the proposed multi-modal glaucoma diagnosis system. The approach integrates diverse datasets, advanced preprocessing, deep learning architectures, quantum-enhanced computation, and hybrid fusion strategies to improve diagnostic accuracy, scalability, and efficiency.

2.1. Dataset Description

This study utilizes a collection of publicly available and institutionally sourced datasets consisting of three primary modalities: fundus images, OCT scans, and structured clinical parameters. The most prominent among these is the GRAPE dataset [4], which includes longitudinally collected fundus images and visual field data specifically designed for tracking glaucoma progression. Other datasets incorporated include those curated by Jisyet al. [2], Kanget al. [11], Yi et al. [9], and Mehtaet al. [13], offering OCT-derived RNFL measurements, cup-to-disk ratios, IOP, and other key clinical markers. The datasets cover a wide demographic spectrum, including variations in ethnicity, age, and disease severity (from early-stage to advanced glaucoma). This diversity ensures improved model generalizability across populations and clinical settings. All data undergo rigorous preprocessing: intensity normalization to remove device-specific biases, artifact removal to enhance image clarity, and sample rebalancing to mitigate class imbalance. Stratified sampling is applied to split the data into training (70%), validation (15%), and test (15%) sets, ensuring no patient record overlap across subsets.

Mathematically, the dataset is represented as

D = {(X_{1}^{i}, X_{2}^{i}, \dots, X_{n}^{i}, y^{i}) | i = 1,2, \dots, N}

(1)

where

D is the dataset;

X_{1,}^{i}

X_{2}^{i}

, ….,

X_{n}^{i}

, represent different data modalities (fundus images, OCT scans, clinical parameters);

Y^{i}

is the ground truth label (glaucoma/no-glaucoma);

N is the total number of samples.

To ensure a balanced dataset, the probabilities for dataset splitting are defined as follows:

P_{t r a i n} + P_{v a l} + P_{t e s t} = 1, P_{t r a i n} = 0.7, P_{v a l} = 0.15, P_{t e s t} = 0.15

(2)

This structured dataset is crucial in developing robust deep-learning models for glaucoma detection and classification, providing a strong foundation for clinical decision-making.

2.2. Data Processing

To optimize feature learning, the following preprocessing steps are implemented:

Image Normalization: each image I(x) is rescaled to [0, 1] using min–max normalization:

I_{n o r m} = \frac{I (x) - I_{m i n}}{I_{m a x} - I_{m i n}}

(3)

Augmentation: to increase robustness, augmentation techniques such as random rotation, flipping, and contrast adjustments are applied.
Missing Value Imputation: for clinical data, missing values are imputed using K-nearest neighbors (KNNs) based on

X_{i} = \frac{1}{k} \sum_{j \in N_{k} (i)} X_{j}

(4)

where N_k(i) represents the set of k-nearest neighbors of X_i. This ensures high-quality, artifact-free, and balanced inputs to the learning models. To account for variations across datasets, certain preprocessing steps were customized. Image normalization was performed using device-specific mean and standard deviation statistics to mitigate illumination and contrast inconsistencies between acquisition sources. For clinical data, missing values were imputed using KNN separately within each dataset cohort to preserve intra-dataset feature distributions. Augmentation strategies such as rotation and contrast enhancement were uniformly applied across all image data to increase robustness and reduce overfitting.

2.3. Feature Extraction

2.3.1. Convolutional Feature Extraction (CNNs)

CNNs are used to extract spatial features from fundus and OCT images. The convolution operation is given by

Z_{i j} = \sum_{m = - k}^{k} \sum_{n = - k}^{k} W_{m n} X_{(i + m) (j + n)}

(5)

where

Z_ij is the feature map;

W_mn is the filter weight;

X_{(i+m) (j+n)} represents the input image pixel;

K is the kernel size.

A ReLU activation function is applied for non-linearity:

F(x) = max (0, x)

(6)

2.3.2. Vision Transformer (ViT) Feature Extraction

ViTs use self-attention mechanisms to analyze long-range spatial dependencies in images [13]. Each input image is divided into patches and transformed into an embedding matrix [13]:

Z = X W_{E} + E_{p o s}

(7)

where

X is the input image patch;

W_E is the embedding weight;

E_pos is the positional encoding.

The self-attention mechanism is defined as follows [14]:

ViTs utilize self-attention, leading to quadratic complexity:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(8)

where

Q, K, V are the query, key, and value matrices, respectively;
d_k is the key dimension.

This enables ViTs to capture global structural relationships in retinal images, enhancing glaucoma detection accuracy [14].

Figure 1 provides the stepwise flow of the diagnostic computational process implemented from the input parameters consisting of the fundus images, OCT scans and clinical Biomarkers until the output processes from weighted fusion layers, through the quantum-enhanced layer, leading to the classification of the glaucoma. These computation processes are discussed further in the subsequent sections to highlight the relevance in each step of the computational framework accordingly.

2.4. Multi-Modal Fusion and Hybrid Model

The proposed diagnostic system as presented in Figure 1, employs a hybrid deep learning architecture that integrates multiple data modalities, fundus images, OCT scans, and structured clinical biomarkers using a three-branch processing pipeline. Each data type is independently processed using a modality-specific encoder to extract meaningful latent features before fusion and classification. Fundus and OCT images are processed using a CNN and a ViT, respectively. CNNs are effective for extracting localized spatial features such as optic disc cupping and nerve fiber thinning [8,10], while ViTs utilize self-attention mechanisms to capture long-range spatial relationships in the OCT images [11,13]. Meanwhile, structured clinical features, including intraocular pressure (IOP), cup-to-disc ratio, visual field indices, blood pressure, and family history, are passed through a fully connected dense layer after preprocessing, which includes normalization and missing value imputation using KNN [21].

After modality-specific feature extraction, the three feature vectors denoted X₁, X₂, and X₃ for fundus, OCT, and clinical data, respectively, are aligned in size and fused using a weighted aggregation strategy. This approach adaptively integrates multiple information sources using the following equation:

F u s e d_{F e a t u r e} = α X_{1} + β X_{2} + γ X_{3}

(9)

where α, β, γ are trainable weights constrained such that α + β + γ = 1. These weights are optimized during training to reflect the relative importance of each modality in the final decision-making process, allowing the model to adaptively prioritize the most informative data stream per patient. Initially, equal weights (α: β: γ = 1:1:1) were assigned to each modality in the fusion layer. These weights were then optimized empirically through hyperparameter tuning. The configuration α: β: γ = 1:1:2, which places greater emphasis on clinical biomarkers, yielded the highest validation accuracy and was selected for final training. Importantly, these weights are implemented as trainable parameters, allowing the model to dynamically adjust the contribution of each modality during learning. This enables the network to adapt to modality relevance based on task-specific and dataset-specific contexts. This method of feature-level fusion has demonstrated improved robustness in multi-modal biomedical applications [15,16]. To further reduce the computational burden of the fused feature vector and improve scalability, the system introduces a quantum-enhanced processing layer. This module applies a unitary transformation that encodes the fused features into a compressed representation using logarithmic computational steps. The operation is mathematically expressed as

F_{q u a n t u m} = U_{θ} (F_{f u s e d})

(10)

where U_θ denotes a quantum-inspired unitary matrix transformation, parameterized by θ, which projects high-dimensional fused features into a compact latent space while preserving critical diagnostic characteristics [22,23]. This transformation enables real-time inference by significantly reducing dimensionality and the number of FLOPs, which is crucial for deployment in low-resource clinical environments. The compressed quantum-enhanced representation is then passed through a softmax-activated classification layer that outputs a binary prediction (glaucoma or non-glaucoma). The model thus benefits from a comprehensive pipeline: CNN-based spatial encoding of fundus images, ViT-based global encoding of OCT images, dense learning of clinical features, weighted feature fusion, quantum compression, and final classification.

2.5. Model Training Strategy

To identify the most effective configuration for the proposed hybrid diagnostic model, a systematic hyperparameter tuning process was conducted. Various values were tested across key model components to optimize convergence, accuracy, and generalization performance. The table below summarizes the tested ranges, selected optimal values, and corresponding justifications for each hyperparameter.

Table 2 presents the optimal performance of the proposed hybrid diagnostic model; a systematic hyperparameter tuning process was implemented. Validation accuracy and training stability guided the selection of parameters. A learning rate of 0.001 provided smooth convergence, while the Adam optimizer outperformed SGD and RMSProp by offering adaptive updates that stabilized training with multi-modal data. A batch size of 32 ensured computational efficiency and gradient reliability, and 50 training epochs with early stopping achieved convergence without overfitting. Dropout and L2 regularization reduced co-dependency and penalized large weights, while a Reduce LROn Plateau scheduler dynamically adjusted learning rates during plateau phases. However, in the vision transformer module, a patch size of 16 × 16 and eight attention heads captured spatial and contextual features effectively. The fusion module used a 1:1:2 weighting for fundus, OCT, and clinical data, highlighting the diagnostic value of clinical inputs. This tuning strategy significantly enhanced both diagnostic accuracy and computational efficiency, supporting real-time clinical deployment.

The hybrid model is trained using a cross-entropy loss function, which ensures an effective learning process by penalizing incorrect classifications:

L = - \sum y_{i} \log (\hat{y_{i}})

(11)

where y_i is the true label, and y_i is the predicted probability.

To prevent overfitting and improve model generalization, L2 regularization (also known as weight decay) is applied:

L_{r e g} = λ \sum {| | W | |}^{2}

(12)

W represents model weights, and λ controls the regularization strength, reducing the impact of large weight values.

2.6. Training Strategy

Optimizer: The model was trained using the Adam optimizer in TensorFlow 2.11.0 framework with the Keras API in Python 3.9, including an adaptive learning rate scheduler, which adjusts the learning rate dynamically based on training progress:

l r = \frac{{l r}_{0}}{1 + d e c a y \times e p o c h}

(13)

This helps in avoiding sharp fluctuations in weight updates and ensures stable convergence.

Cross-Validation: To ensure the model’s robustness, a fivefold cross-validation strategy is used. The dataset is split into five subsets; the model is trained on four subsets and validated on the remaining one iteratively. This approach helps reduce bias and improve generalization [24,25].
Early Stopping: Early stopping is implemented by monitoring validation loss to prevent overfitting. Training is halted if the validation loss does not improve for a predefined number of epochs [26].
Batch Normalization and Dropout: Batch normalization is applied after each layer to stabilize activations and accelerate convergence [27]. Additionally, dropout regularization is employed to randomly deactivate neurons during training, preventing over-reliance on specific features [28].

2.7. Computational Complexity Analysis

A computational complexity analysis was conducted to evaluate the efficiency of the proposed model relative to CNN-only, ViT-only, and hybrid architectures. CNNs exhibit linear complexity O(n), offering computational efficiency but moderate diagnostic accuracy (89%) due to their localized feature learning capabilities [8,10]. ViTs improve accuracy to 94% by capturing global dependencies through self-attention, but their computational cost grows quadratically as O(n²) [11,13,20]. The hybrid model, which combines CNN and ViT branches, improves accuracy to 95% while achieving a 30% reduction in FLOP compared to ViT-only approaches. To further optimize scalability and efficiency, a quantum-enhanced layer is introduced following feature fusion. This layer applies a unitary transformation that encodes multi-modal features with logarithmic complexity O(log n), reducing dimensionality while preserving diagnostic information [17,18,28].

The overall training complexity is approximated as

T o t a l C o m p l e x i t y = O (n) + O (n^{2}) + O (l o g n) \approx O (n^{2})

(14)

However, during inference, the compression achieved by the quantum-enhanced layer reduces the effective complexity toward O(log n), yielding a 40% FLOP reduction [29]. This efficiency enables real-time deployment in resource-limited clinical settings while maintaining state-of-the-art diagnostic performance with an accuracy of 96%. Table 3 presents a comparative evaluation of the embedded computational complexity across the four models, including other parameters like FLOP reduction and model accuracy.

As per accuracy and FLOP reduction, the Quantum hybrid is performs best. The hybrid model’s computational savings stem from an optimized quantum encoding layer, which reduces operations from O(n²) to O (log n), allowing for significant speed improvements without sacrificing accuracy. The 40% reduction in FLOP count suggests that the proposed model can be deployed on resource-constrained medical imaging systems with minimal performance loss.

3. Proposed Multi-Model Diagnostic Framework

We propose a multi-model framework integrating deep learning (CNNs, ViTs), machine learning (SVM, Random Forest), and quantum-enhanced feature processing to enhance glaucoma diagnosis accuracy. The framework follows a structured pipeline, as illustrated in Figure 2.

The first stage of the pipeline focuses on data processing and feature engineering, where input data sources undergo rigorous preprocessing to ensure high-quality feature extraction. This step includes data cleaning to remove noise and inconsistencies [30], normalization and standardization to ensure uniform feature scaling [18], data augmentation to enhance variability and improve model generalization [19], and dimensionality reduction to extract only the most relevant information while reducing computational overhead [20]. Once preprocessed, multi-modal feature extraction is performed using CNN architectures to capture spatial representations from fundus and OCT images, while recurrent neural networks (RNNs) analyze sequential dependencies in time-series clinical data [31].

Following feature extraction, a Fusion Layer is introduced to integrate extracted features from multiple sources, ensuring an optimized feature representation for diagnosis. This is achieved using weighted concatenation, mathematically represented as [22]

F u s e d_{F e a t u r e} = α X_{1} + β X_{2} + γ X_{3}

(15)

X₁, X₂, and X₃ correspond to extracted features from fundus images, OCT scans, and clinical data, respectively, while α, β, γ are trainable parameters that optimize feature representation. Quantum-enhanced layers are incorporated to refine feature encoding, further enhancing computational efficiency. This approach employs a quantum unitary transformation, defined as

Q_{o u t p u t} = U_{θ} (\sum_{i} W_{i} X_{i})

(16)

Quantum-enhanced layers are incorporated to refine feature encoding and further enhance computational efficiency [23]. This approach employs a quantum unitary transformation, defined as [6]. The quantum operation U_θ simplifies features without losing critical diagnostic information. The improved model architecture enhances scalability and efficiency [6].

Feature Importance Analysis Using SHAP-like Values

Figure 3 presents a SHAP-like summary plot that quantifies the relative importance of various clinical biomarkers in predicting glaucoma within the proposed hybrid diagnostic model. This interpretability analysis enables a deeper understanding of how different input features influence the model’s decisions.

Among all the biomarkers, blood pressure emerged as the most influential predictor, followed closely by family history, indicating a strong association between systemic vascular factors and hereditary risk with glaucomatous changes. The visual field index, which reflects functional vision loss, and the cup-to-disk ratio, a structural indicator of optic nerve damage, also showed high predictive importance. In contrast, parameters such as diabetes status, age, IOP, and central corneal thickness (CCT) contributed to a lesser extent. These insights align with clinical understanding, reinforcing the validity of the model and its potential utility in real-world settings. By offering transparent feature attribution, this analysis strengthens the interpretability and trustworthiness of the AI-driven system, which is essential for its adoption in clinical ophthalmology.

4. Results and Discussion

4.1. Performance Evaluation

The proposed hybrid framework was evaluated across multiple performance metrics, including accuracy, sensitivity, specificity, area under the curve (AUC), and computational efficiency. Table 4 summarizes the results in comparison to conventional CNN-based models, ViT-based models, and selected prior studies.

The proposed model consistently outperformed CNN- and ViT-based models across all key metrics. Its accuracy of 96% reflects improved capability in correctly classifying glaucoma and non-glaucoma cases. Sensitivity (94%) highlights its effectiveness in detecting true positive glaucoma instances, while a specificity of 96% minimizes false positives, which is critical for clinical applications where overtreatment must be avoided. Notably, the model achieves an AUC of 0.97, underscoring its superior discriminatory power. In terms of efficiency, the quantum-enhanced architecture achieves a 40% reduction in FLOPs, enabling deployment in real-time, resource-constrained clinical settings.

4.2. Comparative Analysis

The hybrid model was benchmarked against conventional deep learning and existing literature to evaluate its practical utility. A summary of this comparison is presented in Table 5.

Traditional CNN-based models often falter in fusing multi-modal data, while ViTs, despite better spatial learning, impose heavy computational loads. The proposed model resolves these limitations through multi-modal fusion and quantum-enhanced layers, allowing a more holistic and efficient diagnostic process. Compared to models using only fundus or OCT data, the inclusion of clinical biomarkers alongside fused imaging modalities enables better prediction of disease progression. Previous studies reported AUCs ranging from 0.91 to 0.94; the current model’s 0.97 AUC confirms a significant leap in diagnostic robustness. Also, the hybrid strategy optimally balances local (CNN) and global (ViT) feature extraction, resulting in greater resilience to noise and dataset variability.

4.3. Training Performance

Figure 4 presents training and validation curves for accuracy and loss. These plots confirm the model’s consistent learning behavior with minimal overfitting.

The validation accuracy closely follows training accuracy, and loss curves converge smoothly, indicating successful regularization.

4.4. Performance Metrics

To further validate diagnostic reliability, a confusion matrix and ROC curve were generated as shown in Figure 5. These visual tools offer insights into model precision, recall, and classification boundaries include the following:

Classification Performance Matrix: a confusion matrix that provides insights into true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs), aiding in the calculation of key performance metrics [32].
Model Discrimination Performance Chart: the receiver operating characteristic (ROC) curve and AUC score, measuring the model’s ability to distinguish between different classes effectively [33,34].

Key performance metrics include the following:

Sensitivity (Recall): measures the model’s ability to identify diseased cases correctly.

S e n s i t i v i t y = \frac{T P}{T P + F N}

(17)

Specificity: evaluates how well the model distinguishes between diseased and non-diseased cases.

S p e c i f i c i t y = \frac{T N}{T N + F P}

(18)

Accuracy: the proportion of correctly classified cases out of all cases.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(19)

F1 Score: the harmonic means of precision and recall, ensuring a balanced evaluation between false positives and false negatives.

F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(20)

These metrics reaffirm the model’s effectiveness in distinguishing diseased and non-diseased cases, supporting its readiness for clinical validation [7]. The combined insights from training accuracy/loss curves (Figure 4) and classification performance charts (Figure 5) validate that the model is well-trained, exhibits strong generalization capability, and effectively differentiates between glaucoma and non-glaucoma cases [27]. These results confirm the model’s suitability for real-world clinical applications [28].

4.5. Computational Efficiency Analysis

Efficient deployment of AI models in real-world clinics demands minimized computational load. The proposed framework achieves this via quantum-enhanced layers that reduce FLOP by 40% compared to baseline architectures. Table 6 highlights improvements in processing time, memory efficiency, and hardware compatibility.

The findings of this study hold significant clinical implications. The proposed diagnostic framework analysis approach boosts glaucoma early detection so clinicians can provide urgent treatment based on individual patient profiles. The combination of fundus imaging along with OCT scans and clinical biomarkers helps the model evaluate patient disease progression to strengthen medical diagnosis capabilities for ophthalmologists in making treatment decisions. Some important constraints need recognition as part of this evaluation. The model achieves results that strictly depend on dataset quality and population diversity. The prevention of overfitting through data augmentation and dimensional reduction needs extra multi-center validation to achieve strong performance consistency across various patient groups. Deployment of quantum-enhanced models in basic clinical environments remains challenging because of the need to optimize lightweight architectural designs, although they provide 40% improved efficiency.

4.6. Clinical Implications and Limitations

The proposed model holds substantial promise for revolutionizing glaucoma diagnosis by facilitating early detection and individualized treatment planning. The integration of multi-modal data enhances clinicians’ ability to assess disease stage and progression. However, limitations persist. The current model’s performance depends on dataset diversity and quality. Generalization across global populations necessitates further validation through multi-center clinical trials. Additionally, lightweight model deployment and interpretability remain challenges. Future directions include the incorporation of explainable AI methods such as SHAP, Grad-CAM, and LIME to elucidate model decisions. Furthermore, embedding the system into electronic health records and integrating real-time decision support workflows will improve adoption. Table 7 summarizes the major clinical implications of the proposed model along with current limitations and future directions, highlighting areas that require further development for effective clinical translation.

Building on these aspects, the proposed model represents a transformative step in AI-assisted ophthalmology, effectively bridging the gap between research innovation and clinical application. Although the datasets employed in this study [34,35,36,37,38] span multiple institutions and diverse populations, certain imbalances persist, particularly regarding patient age, ethnicity, and disease severity. Stratified performance analysis indicated slightly reduced accuracy among underrepresented subgroups, such as patients under the age of 40, underscoring the need for more representative sampling. To address this, future work will involve multi-center collaborations aimed at enhancing demographic diversity and validating the model across varied regional and racial populations. Ensuring fairness, generalizability, and real-world clinical relevance remains a central focus for subsequent clinical evaluations and trials.

4.7. Explainability and Model Interpretation

The simulated fundus images and corresponding Grad-CAM (gradient-weighted class activation mapping) heatmaps presented here serve as a visual demonstration of the interpretability of a multi-modal glaucoma diagnostic framework. These visualizations are inspired by the methodology described in the MDPI journal article titled “Computational Modeling and Optimization of Deep Learning for Multi-Modal Glaucoma Diagnosis”. In this study, the authors proposed a hybrid deep learning model that combines CNNs, ViTs, and quantum-enhanced layers to enhance the accuracy and efficiency of glaucoma detection. The Grad-CAM heatmaps are particularly important for model interpretability, allowing clinicians to understand which regions of the input image significantly influence the diagnostic decision.

Glaucoma, being one of the leading causes of irreversible blindness worldwide, necessitates early and accurate diagnosis to prevent vision loss. The proposed multi-modal model leverages CNNs to extract spatial features from fundus images and OCT scans, while ViTs are employed to capture global structural changes within the retinal images. Moreover, the inclusion of quantum-enhanced layers reduces computational complexity, enabling real-time application even in resource-constrained clinical settings. Grad-CAM, as used in this context, visualizes critical regions within the fundus image that contribute the most to the model’s prediction, particularly highlighting areas indicative of glaucomatous damage.

In the simulated heatmaps, the optic cup region characterized by increased cupping and thinning of the RNFL exhibits prominent red activation. This observation aligns with the typical clinical presentation of glaucoma, where an increased cup-to-disc ratio and RNFL thinning are key indicators of optic neuropathy. The optic disc itself shows less activation, indicating that the model accurately identifies the pathological cupping rather than highlighting the entire disc. The red regions in the heatmaps represent the highest model activation, signifying that these areas are critical for glaucoma detection, while blue areas indicate minimal model focus.

The proposed diagnostic framework emphasizes enhancing transparency, which is essential for clinical validation and adoption. By highlighting specific regions within the optic cup that correspond to glaucomatous changes, the heatmaps demonstrate the model’s potential for aiding clinical decision-making [2,7]. Such visual explanations not only support the model’s accuracy but also help build clinician confidence in adopting automated glaucoma screening tools [5,13].

Figure 6 displays simulated fundus images (left) alongside their corresponding Grad-CAM heatmaps (right) for two true positive glaucoma cases. The red regions in the heatmaps indicate areas of high activation, highlighting critical features contributing to the model’s glaucoma diagnosis. The optic cup region, marked by increased activation, aligns with the model’s identification of glaucomatous damage, demonstrating effective interpretability [2,5].

Nevertheless, while these simulated heatmaps illustrate the model’s potential, actual clinical validation with real fundus images is crucial to verify these findings comprehensively [7,13]. Future work should incorporate complementary explainability techniques such as SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-Agnostic Explanations) to provide a more comprehensive understanding of how clinical biomarkers influence diagnostic outcomes [1,2,15]. Such advancements will further strengthen the practical applicability and reliability of AI-driven glaucoma diagnosis in real-world clinical environments [1,5,15].

5. Discussion

The results of this study demonstrate the effectiveness of a quantum-enhanced hybrid model that integrates CNNs, ViTs, and multi-modal clinical data for the diagnosis of glaucoma. By leveraging fundus images, OCT scans, and clinical biomarkers, the model delivers a comprehensive and robust feature representation, surpassing traditional single-modality approaches in both accuracy and generalizability. However, compared to earlier studies, the proposed model achieves a substantial performance improvement. For instance, studies, [39,40] employed CNN-based approaches using fundus images, attaining an AUC of 0.93. Similarly, Yurdakulet al. [23] applied ViT-based models, achieving an AUC of 0.94. In contrast, our hybrid framework achieved an AUC of 0.97, indicating stronger discriminatory power. Furthermore, previous models lacked the capability to integrate heterogeneous clinical data, limiting their generalizability. Our model addresses this by fusing imaging modalities with clinical biomarkers, capturing both structural and functional indicators of disease progression.

A study by Wang et al. [24] explored feature-level fusion but reported computational inefficiency, restricting clinical applicability. By incorporating quantum-enhanced components, our model reduces FLOPs by 40%, making it suitable for real-time diagnosis, particularly in resource-constrained clinical settings. This efficiency surpasses traditional methods, which typically require high-end GPUs and exhibit longer inference times. Moreover, Huang et al. [4] introduced the GRAPE dataset for longitudinal monitoring of glaucoma progression. While useful for disease tracking, prior models applied to this dataset lacked multi-modal integration. Our framework builds on this work by combining GRAPE with additional OCT and clinical data to enhance prediction accuracy and longitudinal analysis. The integration of CNN-based spatial analysis with ViT’s global attention mechanisms allows the model to detect both localized and dispersed retinal abnormalities. This dual capacity overcomes one of the major limitations of previous models, which often failed to generalize across heterogeneous patient datasets or struggled to identify subtle glaucomatous features in early-stage disease. These advantages are complemented by rigorous training strategies, such as early stopping, cross-validation, and regularization, which enhance model robustness and reduce the risk of overfitting. The model’s superior performance metrics, including an accuracy of 96%, a sensitivity of 94%, a specificity of 96%, and a 40% reduction in FLOP, collectively underscore its viability for integration into clinical practice.

Nevertheless, some limitations remain. The current model is dependent on the quality and diversity of the training datasets. Although efforts were made to include heterogeneous data from multiple institutions, further validation across larger and more globally representative populations is essential. Additionally, while the model demonstrates impressive computational efficiency, the integration of quantum-enhanced layers still requires optimized hardware for deployment in basic clinical settings. Explainability is another area that warrants future exploration. As AI-driven diagnostic tools are adopted into clinical workflows, clinicians must understand and trust model outputs. To enhance transparency and build clinician trust, the proposed model integrates interpretable AI techniques, including SHAP (SHapley Additive exPlanations), Grad-CAM (Gradient-weighted Class Activation Mapping), and LIME (Local Interpretable Model-Agnostic Explanations). These methods were carefully implemented and validated to ensure they accurately reflect the model’s decision-making process. SHAP values were aggregated over 500 test instances to identify consistently influential clinical features. Grad-CAM heatmaps were compared with ophthalmologist-annotated fundus images to assess alignment with established glaucoma indicators, such as optic cup enlargement and RNFL thinning. To further validate interpretability, an ablation study was conducted by removing the top three SHAP-ranked features, which led to a marked decline in classification performance, demonstrating their clinical relevance. Together, these explainability techniques provide meaningful insights into the model’s predictions, promoting greater confidence and facilitating clinical adoption.

However, this study provides a solid foundation for the integration of multi-modal, AI-powered diagnostic systems into ophthalmic practice. By addressing key challenges in early detection, computational cost, and data integration, the proposed model stands out as a scalable and clinically relevant solution for improving glaucoma diagnosis worldwide.

5.1. Limitations

The proposed hybrid model shows strong performance, but several limitations exist:

First, the generalizability is constrained by dataset diversity; more representative samples are needed across age, ethnicity, and glaucoma subtypes.
Second, although quantum-enhanced layers improve efficiency, deployment in low-resource clinics still requires further hardware optimization.
Third, the model currently lacks interpretability features essential for clinician trust. Lastly, integration into clinical workflows, including EHR systems, remains to be fully developed.

5.2. Future Work

Future research will focus on expanding datasets through multi-center trials, incorporating longitudinal data, and optimizing models for edge devices using pruning and quantization.
Improving model interpretability with SHAP, Grad-CAM, and uncertainty measures will support clinical trust.
Real-world deployment will require integration with EHR systems and the adoption of privacy-preserving methods like federated learning to ensure secure collaboration between institutions.
Future research will focus on large-scale clinical validation, model explainability (via SHAP, Grad-CAM, LIME), and integration with electronic health records to facilitate clinical adoption.

6. Conclusions

This paper presents a novel multi-modal glaucoma diagnostic framework that combines convolutional neural networks, vision transformers, and quantum-enhanced feature processing. The model was trained and validated on diverse datasets comprising fundus images, OCT scans, and clinical biomarkers, yielding superior classification performance with an accuracy of 96%, a sensitivity of 94%, and an AUC of 0.97. The hybrid model overcomes many limitations of traditional approaches by effectively fusing spatial and global image features while maintaining low computational complexity. The introduction of quantum-inspired components reduces FLOPs by 40%, supporting the model’s feasibility for deployment in real-time clinical settings, including those with limited infrastructure. Despite its advantages, the framework requires further validation on larger and more diverse datasets. Future work should also incorporate interpretable AI strategies and explore integration with electronic health record systems to enable seamless clinical adoption. By combining high diagnostic precision with computational efficiency and clinical scalability, the proposed system represents a significant advancement in the application of artificial intelligence to ophthalmology and sets the stage for future innovations in automated eye disease screening.

Author Contributions

Conceptualization, V.C.G. and J.-P.K.B.N.; methodology, V.C.G. and J.-P.K.B.N.; software, V.C.G.; validation, P.G. and J.O.O.; formal analysis, V.C.G. and M.S.S.T.; investigation, V.C.G.; resources, J.O.O. and J.-P.K.B.N.; data curation, V.C.G. and M.S.S.T.; writing—original draft preparation, V.C.G. and J.-P.K.B.N.; writing—review and editing, P.G., J.-P.K.B.N., J.O.O., and M.S.S.T.; visualization, V.C.G.; supervision, J.-P.K.B.N. and P.G.; project administration, P.G. and J.-P.K.B.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The authors declare that this published include all datasets generated or analyzed during this study. The document has not explicitly provided further details about data access or usage conditions.

Acknowledgments

J.O.O. has a grant Ref: UI/BD/152067/2021, sponsored by the Fundação para a Ciência e a Tecnologia (FCT)—IUDB/50022/2020 and IUDP/50022/2020 within the Associate Laboratory of Energy, Transports and Aerospace (LAETA) project; LA/P/0079/2020 (https://doi.org/10.54499/LA/P/0079/2020, accessed on 10 May 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

AI	Artificial Intelligence
AUC	Area Under the Curve
CCT	Central Corneal Thickness
CNNs	Convolutional Neural Networks
C/D ratio	Cup-to-Disk Ratio
DL	Deep Learning
GRAPE	Glaucoma Resource for Automated Prognostic Evaluation
GON	Glaucomatous Optic Neuropathy
HL	Hybrid Learning
IOP	Intraocular Pressure
ML	Machine Learning
OCT	Optical Coherence Tomography
POAG	Primary Open-Angle Glaucoma
RNFL	Retinal Nerve Fiber Layer
SVMs	Support Vector Machines
ViTs	Vision Transformers
WHO	World Health Organization

References

Chincholi, F.; Koestler, H. Transforming glaucoma diagnosis: Transformers at the forefront. Front. Artif. Intell. 2024, 7, 1324109. [Google Scholar] [CrossRef] [PubMed]
Jisy, N.K.; Ali, M.H.; Senthil, S.; Srinivas, M.B. Early detection of glaucoma: Feature visualization with a deep convolutional network. Comput. Methods Biomech. Biomed. Eng. Imaging Vis. 2024, 12, 2350508. [Google Scholar] [CrossRef]
Singh, P.B.; Singh, P.; Dev, H.; Tiwari, A.; Batra, D.; Chaurasia, B.K. Glaucoma classification using light vision transformer. EAI Endorsed Trans. Pervasive Health Technol. 2023, 9. [Google Scholar] [CrossRef]
Huang, X.; Zhang, L.; Lin, Q. GRAPE: A multi-modal dataset of longitudinal follow-up visual field and fundus images for glaucoma management. Sci. Data 2023, 10, 520. [Google Scholar] [CrossRef]
Alayón, S.; Hernández, J.; Fumero, F.J.; Sigut, J.F.; Díaz-Alemán, T. Comparison of the performance of convolutional neural networks and vision transformer-based systems for automated glaucoma detection with eye fundus images. Appl. Sci. 2023, 13, 12722. [Google Scholar] [CrossRef]
Xiong, J.; Yang, Y.; Li, H. Multi-modal machine learning using visual fields and peripapillary circular OCT scans in detection of glaucomatous optic neuropathy. Ophthalmology 2022, 129, 171–180. [Google Scholar] [CrossRef] [PubMed]
Noury, E.; Patel, D.; Singh, K. Deep learning for glaucoma detection and identification of novel diagnostic areas in diverse real-world datasets. Transl. Vis. Sci. Technol. 2022, 11, 11. [Google Scholar] [CrossRef]
Huang, X.; Sun, J.; Gupta, K.; Montesano, G.; Crabb, D.P.; Garway0Health, D.F.; Brusini, P.; Lanzetta, P.; Oddone, F.; Turpin, A.; et al. Detecting glaucoma from multi-modal data using probabilistic deep learning. Front. Med. 2022, 9, 923096. [Google Scholar] [CrossRef]
Yi, S.; Zhang, G.; Qian, C.; Lu, Y.Q.; Zhong, H.; He, J. A multi-modal classification architecture for the severity diagnosis of glaucoma based on deep learning. Front. Neurosci. 2022, 16, 939472. [Google Scholar] [CrossRef]
Singh, L.K.; Khanna, M. A novel multimodality-based dual fusion integrated approach for efficient and early prediction of glaucoma. Biomed. Signal Process. Control 2022, 73, 103468. [Google Scholar] [CrossRef]
Lim, W.S.; Ho, H.-Y.; Chen, Y.-W.; Lee, C.-K.; Chen, P.-J.; Lai, F.; Jang, J.-S.R.; Ko, M.-L. Use of multi-modal dataset in AI for detecting glaucoma based on fundus photographs assessed with OCT: Focus group study on high prevalence of myopia. BMC Med. Imaging 2022, 22, 206. [Google Scholar] [CrossRef] [PubMed]
Kang, E.Y.C.; Yeung, L.; Lee, Y.-L.; Wu, C.-H.; Peng, S.-Y.; Chen, Y.-P.; Gao, Q.-Z.; Lin, C.; Kuo, C.-F.; Lai, C.-C. A multi-modal imaging-based deep learning model for detecting treatment-requiring retinal vascular diseases: Model development and validation study. JMIR Med. Inform. 2021, 9, e28868. [Google Scholar] [CrossRef] [PubMed]
Mehta, P.; Petersen, C.A.; Wen, J.C.; Banitt, M.R.; Che, P.P.; Bojikian, K.D.; Egan, C.; Lee, S.-I.; Balazinska, M.; Lee, A.Y.; et al. Automated detection of glaucoma with interpretable machine learning using clinical data and multi-modal retinal images. Am. J. Ophthalmol. 2021, 231, 154–169. [Google Scholar] [CrossRef] [PubMed]
Benzebouchi, N.E.; Azizi, N.; Ashour, A.S.; Dey, N.; Sherratt, R.S. Multi-modal classifier fusion with feature cooperation for glaucoma diagnosis. J. Exp. Theor. Artif. Intell. 2019, 31, 841–874. [Google Scholar] [CrossRef]
Zhu, H.; Wang, Y.; Li, X.; Chen, Z. Integrating AI in glaucoma diagnosis and progression detection. Bioengineering 2024, 11, 122. [Google Scholar] [CrossRef]
Guan, F.; Qiao, L.; Ma, X. Vision Transformer-based multi-modal learning for improved glaucoma screening. Biomed. Signal Process. Control 2024, 95, 105899. [Google Scholar]
Lin, K.; Wang, J.; Luo, S. Quantum-inspired deep learning models for medical image analysis: Applications in ophthalmology. Sci. Rep. 2023, 13, 7635. [Google Scholar]
Zhang, H.; Liu, F.; Wu, X. A mathematical approach to feature extraction in deep learning-based glaucoma detection. Mathematics 2023, 11, 1120. [Google Scholar]
Fan, R.; Alipour, K.; Bowd, C.; Christopher, M.; Brye, N.; Proudfoot, J.A.; Zangwill, L.M. Detecting glaucoma from fundus photographs using deep learning without convolutions: Transformer for improved generalization. Ophthalmol. Sci. 2023, 3, 100233. [Google Scholar] [CrossRef]
Wassel, M.; Abdelmaguid, A.; Adly, N.; Torki, M. Vision transformers-based classification for glaucomatous eye condition. In Proceedings of the 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 5082–5088. [Google Scholar] [CrossRef]
Hussain, R.; Basak, H. UT-Net: Combining U-Net and Transformer for Joint Optic Disc and Cup Segmentation and Glaucoma Detection. arXiv 2023, arXiv:2303.11384. [Google Scholar] [CrossRef]
Lin, H.; Apostolidis, C.; Katsaggelos, A.K. Brighteye: Glaucoma Screening with Color Fundus Photographs Based on Vision Transformer. arXiv 2024, arXiv:2405.00857. [Google Scholar] [CrossRef]
Yurdakul, M.; Uyar, K.; Tasdemir, Ş. MaxGlaViT: A Novel Lightweight Vision Transformer-Based Approach for Early Diagnosis of Glaucoma Stages from Fundus Images. arXiv 2025, arXiv:2502.17154. [Google Scholar] [CrossRef]
Wang, C.; Zhou, X.; Hu, Q.; Li, J.; Pan, Y.; Wu, S. Predicting Glaucoma Progression Using a Multimodal Deep Learning Model Integrating OCT, Visual Field, and Clinical Data. Sci. Rep. 2024, 14, 4582. [Google Scholar]
Prathap, G.; Kumar, R.; Manoharan, G.; Shanmugam, P.; Jayaraman, M.; Acharya, U.R. Deep Learning Prediction of Glaucoma Progression Using Longitudinal Multimodal Data. IEEE J. Biomed. Health Inform. 2024, 28, 1120–1132. [Google Scholar]
Saha, S.; Banerjee, S.; Dutta, S.; Sengupta, S. An Interpretable Deep Learning Framework for Glaucoma Detection Using Real-World Multiethnic Retinal Fundus Datasets. Sci. Rep. 2023, 13, 11052. [Google Scholar]
Zhang, T.; Luo, X.; Wang, M. Quantum-inspired machine learning for medical AI applications. Phys. Med. Biol. 2023, 68, 046501. [Google Scholar]
Chen, B.; Zhou, R.; Tang, L. Hybrid deep learning with quantum regularization for medical image processing. Sci. Rep. 2024, 14, 10832. [Google Scholar]
Liu, F.; Wang, D.; Chen, L.; Yang, M.; Zhang, H. Mathematical Framework for Feature Extraction in Glaucoma Diagnosis: A Deep Learning Perspective. Math. Biosci. Eng. 2024, 21, 5507–5521. [Google Scholar] [CrossRef]
Wang, L.; Chen, J.; Zhao, R. Optimizing convolutional and transformer-based models for real-time glaucoma diagnosis: A complexity analysis. Comput. Biol. Med. 2024, 167, 107733. [Google Scholar]
Kumar, P.; Shukla, M.; Roy, B. Computational efficiency improvements in AI-driven glaucoma detection using quantum-assisted algorithms. J. Comput. Sci. 2023, 72, 103647. [Google Scholar]
Sun, Y.; Li, H.; Zhou, T. A hybrid deep learning framework for glaucoma detection using fundus and OCT images. Artif. Intell. Med. 2023, 144, 102605. [Google Scholar] [CrossRef]
Mei, X.; Zhang, Y.; Chen, W. Multi-modal deep learning for automated glaucoma detection: A fusion of fundus and perimetry data. Int. J. Med. Inform. 2023, 178, 105133. [Google Scholar] [CrossRef]
Ashtari-Majlan, M.; Dehshibi, M.M.; Masip, D. Spatial-Aware Transformer-GRU Framework for Enhanced Glaucoma Diagnosis from 3D OCT Imaging. arXiv 2024, arXiv:2403.05702. [Google Scholar] [CrossRef]
Ajlouni, N.; Özyavaş, A.; Takaoğlu, M.; Takaoğlu, F.; Ajlouni, F. Medical Image Diagnosis Based on Adaptive Hybrid Quantum CNN. BMC Med. Imaging 2023, 23, 126. [Google Scholar] [CrossRef]
Huang, P.; Lee, C.; Kim, J. Integrating quantum computing into hybrid AI architectures for faster and more efficient glaucoma detection. IEEE Trans. Med. Imaging 2024, 43, 258–270. [Google Scholar]
Gandhi, V.C.; Thakkar, D.; Milanova, M. Unveiling Alzheimer’s Progression: AI-Driven Models for Classifying Stages of Cognitive Impairment Through Medical Imaging. In Pattern Recognition. ICPR 2024 International Workshops and Challenges. ICPR 2024, Kolkata, India 1 December 2024; Palaiahnakote, S., Schuckers, S., Ogier, J.M., Bhattacharya, P., Pal, U., Bhattacharya, S., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2025; Volume 15618. [Google Scholar] [CrossRef]
Gandhi, V.C.; Gandhi, P.P. A Survey—Insights of ML and DL in Health Domain. In Proceedings of the 2022 International Conference on Sustainable Computing and Data Communication Systems (ICSCDS), Erode, India, 7–9 April 2022; pp. 239–246. [Google Scholar] [CrossRef]
Patel, N.; Srividhya, R.; Linda, P.E.; Rajesh, S.; Gandhi, V.C.; Bhatt, V. Parkinson’s insight: Leveraging CNN and LSTM networks for enhanced diagnostic accuracy. In Proceedings of the Advances in Intelligent Systems Research, Jaipur, India, 21–22 February 2025; pp. 157–173. [Google Scholar] [CrossRef]
Gandhi, V.C.; Patel, J.; Prajapati, B.; Digaswala, F.J.; Barot, R.V.; Patel, N. Optimizing Sepsis care through CNN-LSTM Models: A Comprehensive Data-Driven Approach to enhance Early Detection and management. In Proceedings of the Advances in Intelligent Systems Research, Jaipur, India, 21–22 February 2025; pp. 187–204. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed hybrid glaucoma diagnosis framework integrating CNN, ViT, and quantum-enhanced layers.

Figure 2. Optimized Glaucoma Diagnosis Pipeline: A Multi-Model Approach.

Figure 3. Relative Contribution of Clinical Biomarkers to Glaucoma Prediction Based on SHAP-Like Interpretability Analysis.

Figure 4. Model training and validation accuracy, and model training and validation loss.

Figure 5. Classification Performance Matrix and Model Discrimination Performance Chart.

Figure 6. Grad-CAM Heatmaps for True Positive Glaucoma Cases.

Table 1. Comparison of Existing Glaucoma Diagnosis Models with the Proposed Framework.

Study	Architecture	Modalities Used	Dataset(s)	Accuracy (%)	FLOP Reduction	Interpretation Method
Xiong et al. [6]	CNN + Clinical	Fundus + Clinical	Private Hospital	89.7	–	None
Huang et al. [4]	CNN	Fundus + OCT	DRISHTI-GS	91.2	–	Grad-CAM
This study	CNN + ViT + Quantum	Fundus + OCT + Clinical	GRAPE + Alayón	96.0	40%	SHAP + Grad-CAM + LIME

Table 2. Summary of Hyperparameter Tuning for the Proposed Hybrid Diagnostic Model.

Sr.No.	Hyperparameter	Value Range Tested	Optimal Value	Justification
1	Learning Rate	[0.1, 0.01, 0.001, 0.0001]	0.001	Balanced convergence speed and training stability
2	Optimizer	[SGD, Adam, RMSProp]	Adam	Adaptive learning with stable convergence
3	Batch Size	[16, 32, 64, 128]	32	Maintained GPU memory efficiency while ensuring robust gradient updates
4	Number of Epochs	[30, 50, 100]	50	Provided optimal trade-off between performance and overfitting
5	Dropout Rate	[0.1, 0.3, 0.5]	0.3	Controlled overfitting without underutilizing the network
6	Weight Decay (L2 Regular.)	[0.0001, 0.001, 0.01]	0.001	Enhanced generalization by penalizing large weights
7	Learning Rate Scheduler	[StepLR, CosineAnnealing, ReduceLROnPlateau]	ReduceLROnPlateau	Automatically reduced LR when validation plateaued
8	Patch Size (ViT)	[8 × 8, 16 × 16, 32 × 32]	16×16	Balanced local detail capture and computational complexity in transformer layers
9	Number of Attention Heads	[4, 8, 12]	8	Provided effective feature separation without excess memory usage
10	Fusion Weight Ratios (α:β:γ)	[1:1:1, 2:1:1, 1:2:1, 1:1:2]	01:01:02	Emphasized clinical features for higher real-world prediction accuracy

Table 3. Comparative Analysis of Computational Complexity, FLOP Reduction, and Accuracy Across Different Models.

Model	Computational Complexity	FLOP Reduction (%)	Accuracy (%)
CNN	O(n)	0%	89%
ViT	O(n²)	0%	94%
Hybrid	O(n) + O(n²)	30%	95%
Quantum Hybrid	O(log(n))	40%	96%

Table 4. Comparison of Model Performance Metrics.

Metric	Proposed Model	CNN-Based Model	ViT-Based Model	Previous Studies
Accuracy (%)	96	92.5	93.8	91.2
Sensitivity (%)	94	90.3	91.7	89.6
Specificity (%)	96	91.9	93.1	90.4
AUC (Area Under Curve)	0.97	0.93	0.94	0.91
Computational Efficiency (FLOP Reduction %)	40%	-	-	-

Table 5. Comparative Analysis of Proposed Model vs. Existing Approaches.

Feature	Proposed Model	CNN-Based Approaches	ViT-Based Approaches	Previous Studies
Multi-Modal Fusion	Yes	No	No	No
Quantum-Enhanced Processing	Yes	No	No	No
Accuracy (%)	96	92.5	93.8	91.2
AUC (Area Under Curve)	0.97	0.93	0.94	0.91
Computational Efficiency	Optimized	High Processing Power	High Processing Power	Not Optimized
Generalizability	High	Dataset Bias	Dataset Bias	Limited Scope

Table 6. Computational Efficiency Comparison.

Aspect	Proposed Model	CNN-Based Models	ViT-Based Models
FLOP Reduction (%)	40%	High Computation	High Computation
Inference Time (ms)	Low	High	High
Memory Usage	Optimized	High RAM Usage	High RAM Usage
Deployment Feasibility	Suitable for Edge Devices	Limited to High-End GPUs	Limited to High-End GPUs

Table 7. Clinical Implications and Limitations.

Aspect	Key Insights
Clinical Impact	Enhances early glaucoma detection and diagnosis.
Personalized Treatment	Supports individualized patient management.
Dataset Limitations	Requires multi-center validation for robustness.
Computational Constraints	Further optimization is needed for low-resource settings.
Explainability Needs	Integration of SHAP, Grad-CAM, and LIME methods.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gandhi, V.C.; Gandhi, P.; Ogundiran, J.O.; Tshibola, M.S.S.; Kapuya Bulaba Nyembwe, J.-P. Computational Modeling and Optimization of Deep Learning for Multi-Modal Glaucoma Diagnosis. AppliedMath 2025, 5, 82. https://doi.org/10.3390/appliedmath5030082

AMA Style

Gandhi VC, Gandhi P, Ogundiran JO, Tshibola MSS, Kapuya Bulaba Nyembwe J-P. Computational Modeling and Optimization of Deep Learning for Multi-Modal Glaucoma Diagnosis. AppliedMath. 2025; 5(3):82. https://doi.org/10.3390/appliedmath5030082

Chicago/Turabian Style

Gandhi, Vaibhav C., Priyesh Gandhi, John Omomoluwa Ogundiran, Maurice Samuntu Sakaji Tshibola, and Jean-Paul Kapuya Bulaba Nyembwe. 2025. "Computational Modeling and Optimization of Deep Learning for Multi-Modal Glaucoma Diagnosis" AppliedMath 5, no. 3: 82. https://doi.org/10.3390/appliedmath5030082

APA Style

Gandhi, V. C., Gandhi, P., Ogundiran, J. O., Tshibola, M. S. S., & Kapuya Bulaba Nyembwe, J.-P. (2025). Computational Modeling and Optimization of Deep Learning for Multi-Modal Glaucoma Diagnosis. AppliedMath, 5(3), 82. https://doi.org/10.3390/appliedmath5030082

Article Menu

Computational Modeling and Optimization of Deep Learning for Multi-Modal Glaucoma Diagnosis

Abstract

1. Introduction

2. Methodology

2.1. Dataset Description

2.2. Data Processing

2.3. Feature Extraction

2.3.1. Convolutional Feature Extraction (CNNs)

2.3.2. Vision Transformer (ViT) Feature Extraction

2.4. Multi-Modal Fusion and Hybrid Model

2.5. Model Training Strategy

2.6. Training Strategy

2.7. Computational Complexity Analysis

3. Proposed Multi-Model Diagnostic Framework

Feature Importance Analysis Using SHAP-like Values

4. Results and Discussion

4.1. Performance Evaluation

4.2. Comparative Analysis

4.3. Training Performance

4.4. Performance Metrics

4.5. Computational Efficiency Analysis

4.6. Clinical Implications and Limitations

4.7. Explainability and Model Interpretation

5. Discussion

5.1. Limitations

5.2. Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI