A Hybrid Model for Fluorescein Funduscopy Image Classification by Fusing Multi-Scale Context-Aware Features

Wang, Yawen; Chen, Chao; Chen, Zhuo; Wu, Lingling

doi:10.3390/technologies13080323

Open AccessArticle

A Hybrid Model for Fluorescein Funduscopy Image Classification by Fusing Multi-Scale Context-Aware Features

¹

School of Computer Science and Engineering, Sichuan University of Science and Engineering, Yibin 644000, China

²

Intelligent Perception and Control Key Laboratory of Sichuan Province, Sichuan University of Science and Engineering, Yibin 644000, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Technologies 2025, 13(8), 323; https://doi.org/10.3390/technologies13080323

Submission received: 17 June 2025 / Revised: 11 July 2025 / Accepted: 28 July 2025 / Published: 30 July 2025

(This article belongs to the Special Issue Application of Artificial Intelligence in Medical Image Analysis)

Download

Browse Figures

Versions Notes

Abstract

With the growing use of deep learning in medical image analysis, automated classification of fundus images is crucial for the early detection of fundus diseases. However, the complexity of fluorescein fundus angiography (FFA) images poses challenges in the accurate identification of lesions. To address these issues, we propose the Enhanced Feature Fusion ConvNeXt (EFF-ConvNeXt) model, a novel architecture combining VGG16 and an enhanced ConvNeXt for FFA image classification. VGG16 is employed to extract edge features, while an improved ConvNeXt incorporates the Context-Aware Feature Fusion (CAFF) strategy to enhance global contextual understanding. CAFF integrates an Improved Global Context (IGC) module with multi-scale feature fusion to jointly capture local and global features. Furthermore, an SKNet module is used in the final stages to adaptively recalibrate channel-wise features. The model demonstrates improved classification accuracy and robustness, achieving 92.50% accuracy and 92.30% F1 score on the APTOS2023 dataset—surpassing the baseline ConvNeXt-T by 3.12% in accuracy and 4.01% in F1 score. These results highlight the model’s ability to better recognize complex disease features, providing significant support for more accurate diagnosis of fundus diseases.

Keywords:

deep learning; fundus diseases classification; fluorescein fundus angiography images; multi-scale feature fusion; global context modeling

1. Introduction

Chronic fundus diseases are common and serious diseases in ophthalmology, which in advanced stages can lead to severe loss of vision or even complete blindness. Studies report that approximately 196 million people worldwide are affected by age-related macular degeneration (AMD), with cases expected to rise to 288 million by 2040 [1]. Diabetic retinopathy (DR) is a major cause of blindness [2], highlighting the need for early detection and intervention. Traditional fundus disease diagnosis relies on expert experience, which is time-consuming and limited by expertise. Manual analysis requires extensive knowledge and is prone to observer variability, increasing the risk of misdiagnosis. Thus, the demand for efficient and accurate automated diagnostic tools is growing [3].

Artificial intelligence (AI), especially deep learning, is advancing rapidly and is increasingly used in ophthalmic disease screening and classification [4,5]. However, most studies focus on a single or a few diseases, such as DR [6], AMD [7], and glaucoma [8]. These approaches are difficult to cope with the diversity and complexity of retinal diseases, especially when multiple pathologies coexist. For example, uveitis, a condition prevalent among young adults, has diverse types and complex etiologies, but research on its automated classification remains limited [9,10].

Furthermore, research progress varies across imaging modalities. In fundus image classification, attention mechanisms and multi-scale feature fusion strategies are widely used and have been proven to improve classification performance effectively. Li et al. [11] proposed a neural network model based on attention mechanisms and feature fusion for multi-label fundus image classification. The model achieves outstanding performance on the ODIR dataset, with an accuracy of 94.23% and a recall of 99.23%. It offers a novel approach to multi-label fundus disease diagnosis and shows big improvements in attention mechanisms and multi-scale feature fusion. Yan et al. [12] proposed Fundus-DANet, a multi-disease recognition model based on dilated convolution and fused attention mechanisms. Experiments on the OIA-ODIR dataset achieved an accuracy of 93%. The results of their study show that fused attention mechanisms can help with problems like morphological feature extraction and lesion variability.

In retinal vascular imaging, FFA images are the gold standard for visualizing the retinal vascular system [13,14]. However, studies on multi-lesion categorization based on FFA images are still few, and several focus on DR. Pan et al. [15] used DenseNet for four types of DR classification through transfer learning. The proposed model achieved area under the curve (AUC) values ranging from 0.8703 to 0.9647 in experiments on a cohort of 4067 FFA images acquired from the Second Affiliated Hospital of Zhejiang University. Gao et al. [16,17] first proposed a VGG16-based framework for automatic detection of five types of DR lesions, and subsequently developed a novel DR grading system using VGG16 with a nine-grid input strategy. This limits its generalizability and clinical value. Although the FFA-Lens developed by Veena K.M et al. [18] based on YOLO achieved high precision and recall values in detecting 25 lesions, their model relied on a traditional object detection framework and still has room for improvement. Lesions in FFA images vary in size and shape. Traditional CNNs, limited by fixed receptive fields, struggle to capture both local details and global structural information simultaneously [19]. Multi-scale feature fusion strategies can alleviate this challenge. Meanwhile, FFA image lesions are closely associated with vascular structures, and disease progression often involves long-range interactions [20]. Accurate classification requires models to capture long-term dependencies and contextual relationships. The attention mechanism enhances classification performance by capturing vascular details in FFA images [21].

Therefore, it is very important in terms of FFA images to come up with a new deep learning model that can handle classifying multiple lesions, combining features from different scales, and gathering global contextual information. This study proposes the EFF-ConvNeXt network to address this gap. It does this by using better attention mechanisms and multi-scale feature fusion strategies to correctly classify 23 common fundus lesions. Experimental results demonstrate the model’s superior performance across multiple evaluation metrics, highlighting its potential for fundus disease classification. The innovations of this study are as follows:

Our model can classify 23 common retinal lesions, unlike traditional methods that group retinal diseases into a limited number of categories. This type of classification better reflects the complexity and diversity of diseases in clinical practice.
We proposed the CAFF module to enhance multi-scale feature fusion in FFA images. It fixes a major problem with current methods by effectively capturing both shallow and deep features. This makes it easier to identify lesions and fine details at all scales.
We proposed the IGC module to enhance global context extraction. By adding IGC, the model can focus better on important areas, which is very important for finding small lesions in FFA images. This approach significantly improves classification accuracy and robustness.

2. Model Architecture Design

This study proposes a novel fundus lesion FFA image classification model, named EFF-ConvNeXt. Built upon the ConvNeXt model, it integrates the VGG16 network, attention mechanisms, and multi-scale feature fusion. The overall network architecture is shown in Figure 1. The model consists of two main modules: the VGG16 network, which extracts low-level features such as edges and textures, and the improved ConvNeXt (im-ConvNeXt) module, which captures high-level features, including semantic information and global context. In im-ConvNeXt, the IGC Block enhances the positional encoding of single-scale features, the CAFF module fuses shallow and deep features at each stage to establish cross-stage correlations, and SKNet is incorporated to further enhance the network’s ability to adapt to multi-scale feature fusion.

2.1. The VGG16 Network

The VGG16 [22] is a classical CNN architecture proposed by the Visual Geometry Group at the University of Oxford in 2014. It achieved remarkable results in the ImageNet challenge and has been widely used in image classification and feature extraction. As shown in Figure 1, VGG16 primarily consists of 3 × 3 convolutional filters. These small filters capture local image features while maintaining a large receptive field, avoiding excessive computational complexity. VGG16 has 13 convolutional layers and 3 fully connected layers, enabling the extraction of complex image features. It has become a cornerstone for various computer vision tasks and a powerful pre-trained model for high-quality feature extraction. Taking advantage of this advantage, we employed VGG16 for the shallow extraction of features from FFA images. When used with the im-ConvNeXt model for deep semantic feature extraction, this method made classification much better.

2.2. The Improved ConvNeXt Network

Fair et al. [23] introduced the ConvNeXt model in 2022, which integrates the strengths of ResNet and Vision Transformer (ViT) [24]. This model outperformed others in the ImageNet-1K benchmark. Its design was inspired by the success of the Swin Transformer (SwinT) [25] model. Researchers [23] explored the limitations of pure ConvNets and proposed several key components for constructing and improving the ConvNeXt network. Inspired by the block stacking ratio of 1:1:3:1 in SwinT-tiny, they adjusted the stacking ratio of the ConvNeXt model to 3:3:9:3. Traditional 3 × 3 convolutions are replaced by grouped convolutions in ConvNeXt. It also has an inverted bottleneck mechanism, like MobileNetV2 [26], to make feature representation better. Additionally, it substitutes

R e L U

with

G E L U

and replaces batch normalization (BN) with layer normalization (LN). These modifications optimize both computational efficiency and accuracy, resulting in the final ConvNeXt architecture.

The ConvNeXt offers models of varying scales. In this study, we selected the smaller ConvNeXt-T model for more efficient deployment of FFA in clinical applications. As illustrated in Figure 1, the improved ConvNeXt module incorporates two key enhancements. First, a CAFF module is introduced between adjacent stages to fuse multi-scale contextual information from shallow and deep layers. Second, the model’s ability to adapt to multi-scale fused features is enhanced by the addition of an SKNet module prior to classification.

2.2.1. The Context-Aware Feature Fusion Module

In the ConvNeXt model, the output of each stage is directly downsampled and sent to the next stage. This makes feature interactions weak and the ability to classify lesions poor. To enhance feature utilization, we fuse the previous stage’s output with its downsampled features and feed them into the next stage to improve lesion representation. Figure 2 illustrates the structure of the CAFF module.

The specific process of the CAFF module is as follows: First, shallow features pass through the IGC Block to capture global context and fine-grained details, generating richer deep features. Then, average pooling is applied to halve the width and height of the feature map, followed by a

1 \times 1

convolution to double the number of channels. Finally, an add operation fuses the deep features with the processed shallow features. The specific calculation process is shown in Equation (1):

X = L N (C o n v 2 D (a v g p o o l (Y))) + X_{d}

(1)

where X denotes the CAFF module output, Y represents the IGC Block output features,

a v g p o o l

is average pooling,

C o n v 2 D

is a

1 \times 1

convolution,

L N

is Layer Normalization, and

X_{d}

signifies the deep features.

2.2.2. The Improved Global Context Network Module

Medical images are known to be of high complexity. Global contextual feature modeling is widely recognized as crucial in medical image vision tasks. In 2019, Yue Cao et al. [27] proposed GCNet based on Non-Local Neural Networks (NLNet) [28] and Squeeze-and-Excitation Networks (SENet) [29]. GC block primarily comprises the Context Modeling and Transform modules, with its architecture illustrated in Figure 3a. The Context Modeling captures global context information through adaptive weight allocation, enabling noise suppression and selection of critical features. It addresses the limitations of traditional convolution, enhancing model performance while maintaining computational efficiency. The transform module significantly improves feature representation by enhancing convolutional features to balance local details with global context. This improvement makes it suitable for integration into existing convolutional neural networks to optimize performance. GC block is good at capturing global context, but it lacks the local detail ability to meet the needs of fine-grained classification tasks that need detailed features, such as FFA image classification.

The Mamba [30] model does a good job of handling long-range and multi-modal data and figuring out the subtle dependencies and relationships between pieces of information. Inspired by the Mamba model, we propose IGC module in this study. The architecture of the IGC block is shown in Figure 3b. We integrate the context modeling and transformation components of the GC block. This procedure reduces the feature processing redundancy of GC block and helps speed up model convergence and optimization efficiency. Subsequently, we replace the LN with a BN that performs better in convolutional neural networks. After attention pooling modulation, the features are weighted or re-emphasized. Finally, these weighted features are spatially reorganized through an additional Conv2D layer. With these improvements, IGC block more effectively captures the relationships and dependencies between FFA image details, providing finer and more accurate information. The IGC block is implemented as follows:

The input feature is

X_{j} \in R^{C \times H \times W}

, where C represents the number of channels, and H and W are the height and width.

The feature extraction path applies multiple convolutions to the input X, starting with a

1 \times 1

convolution, followed by

B N

and

R e L U

, then another

1 \times 1

convolution. The equation is as follows:

\begin{matrix} X_{1} = W_{v_{1}} X \end{matrix}

(2)

\begin{matrix} X_{2} = R e L U (B N (X_{1})) \end{matrix}

(3)

\begin{matrix} X_{3} = W_{v_{2}} X_{2} \end{matrix}

(4)

where

W_{v 1}

is the weight of the first

1 \times 1

convolution, and

X_{1}

is the intermediate result.

B N

refers to batch normalization, and

R e L U

is the activation function.

W_{v 2}

is the weight of the second

1 \times 1

convolution, and

X_{3}

is the final result.

The global context path uses a

1 \times 1

convolution on the input X to get global features. It then uses

s o f t m a x

to make global context weights. The equation is given below:

\begin{matrix} G = W_{k} X \end{matrix}

(5)

\begin{matrix} G_{w} = s o f t m a x (G) = \sum_{j = 1}^{N_{p}} \frac{e^{W_{k} X_{j}}}{\sum_{m = 1}^{N_{p}} e^{W_{k} X_{m}}} \end{matrix}

(6)

where G is the result of applying the weight matrix

W_{k}

to the input features X.

X_{j}

represents the feature vector at position j. The term

\frac{e^{W_{k} X_{j}}}{\sum_{m = 1}^{N_{p}} e^{W_{k} X_{m}}}

weights the features at each position using

s o f t m a x

normalization.

\sum_{j = 1}^{N_{p}}

represents the global range over which features are weighted and summed to produce a global feature representation for context modeling.

The features obtained from the feature extraction path are element-wise multiplied by the global context weights. This operation yields a weighted context feature representation. The formula is shown in Equation (7):

F_{g} = X_{3} \otimes G_{w} = \sum_{j = 1}^{N_{p}} \frac{e^{W_{k} X_{j}}}{\sum_{m = 1}^{N_{p}} e^{W_{k} X_{m}}} \cdot W_{v_{2}} (R e L U (B N (W_{v_{1}} X_{j})))

(7)

the symbol ⊗ represents element-wise multiplication, or the Hadamard product, indicating the multiplication of two matrices or vectors at corresponding positions.

Finally, a

1 \times 1

convolution integrates the weighted features, and the original input X is added to the global context features

F_{c}

to obtain the final output. The process equation is as follows:

\begin{matrix} F_{c} = W_{v_{3}} F_{g} \end{matrix}

(8)

\begin{matrix} Y = X + F_{c} \end{matrix}

(9)

where, in Equation (8),

W_{v 3}

is the third weight matrix, and in Equation (9), X is the initial input. The steps are integrated into a single equation:

Y_{i} = W_{v_{3}} (\sum_{j = 1}^{N_{p}} \frac{e^{W_{k} X_{j}}}{\sum_{m = 1}^{N_{p}} e^{W_{k} X_{m}}} \cdot W_{v_{2}} (R e L U (B N (W_{v_{1}} X_{j})))) + X_{i}

(10)

in Equation (10),

X_{i}

represents the pixel at the current position on the feature map, with j and m enumerating all possible positions.

X_{j}

and

X_{m}

represent the pixels at corresponding positions.

W_{k}

,

W_{v 1}

,

W_{v 2}

, and

W_{v 3}

are the linear transformation matrices, and

N_{p} = H \times W

is the number of positions.

R e L U

is the activation function, and

B N

refers to batch normalization.

2.2.3. The SKNet Attention Module

Fundus diseases exhibit complex heterogeneity, with significant variations in lesion location, number, size, and depth. These complexities challenge CNNs in retaining critical local information, leading to performance degradation. To mitigate this issue, we introduced the SKNet attention mechanism [31]. The SKNet attention mechanism adaptively selects the convolutional kernel size and dynamically adjusts the receptive field so that it can effectively pick up features at different scales. The SKNet architecture comprises three key operations: Split, Fuse, and Select. These operations are responsible for splitting, fusing, and selecting convolution kernels at different scales. The structure of SKNet is shown in Figure 4. This adaptive mechanism enhances the network’s ability to capture and integrate multi-resolution features. Because of this, adding it to the EFF-ConvNeXt model can make finding and classifying lesions more accurate.

2.3. Loss Function Optimization

The cross-entropy loss function [32] is commonly used in image classification, and its formula is given in Equation (11):

L_{C E} = - \sum_{i = 1}^{m} p (x) log q (x)

(11)

where m is the batch size, i denotes the i-th FFA image,

p (x)

is the true probability distribution of the 23 fundus disease types, and

q (x)

is the predicted probability distribution.

There is a big class imbalance in the APTOS2023 dataset. This could make the model favor the majority class and ignore the minority class, which would limit its practical performance. The traditional cross-entropy loss function struggles to optimize the model effectively. To improve model performance, we introduced Focal Loss (FL) [33] to address the class imbalance issue. Its mathematical expression is given in Equation (12). FL increases the weight of difficult-to-classify samples so that the model pays more attention to challenging samples and improves the accuracy and generalization ability. This strategy optimizes the intra-class distance, which can further improve the FFA image classification performance.

F L (p_{t}) = - α_{t} {(1 - p_{t})}^{γ} log (p_{t})

(12)

where, in Equation (12),

p_{t}

is the predicted probability for class t, while

α_{t}

and

γ

are hyperparameters.

α_{t}

balances class ratios, and

γ

reduces the weight of easy samples. We set

α = 0.25

and

γ = 1.0

. To focus on both easy and hard examples, we combine cross-entropy loss with Focal Loss, as shown in Equation (13):

L_{C E} = λ L_{C E} + (1 - λ) F L

(13)

where

L_{C E}

denotes cross-entropy loss,

F L

represents Focal Loss, and

λ

is a weight coefficient ranging from 0 to 1. The model achieves optimal performance at

λ = 0.5

.

3. Experimental Results and Analysis

3.1. Dataset and Data Preprocessing

The dataset was sourced from the 2023 Asia Pacific Tele-Ophthalmology Society (APTOS) Big Data Competition. The dataset utilized in this study was retrospectively collected from real-world clinical practice, comprising 55,361 multi-modal fundus angiography images from both eyes of 3179 distinct patients. In order to minimize the memory footprint of the model, we conducted experiments using only FFA images. The contest only provided annotated data for the training set, so we used 33,559 images from 1921 patients in the training set in this study. The APTOS dataset (APTOS2023) includes 23 types of lesion classifications. Figure 5 illustrates the data distribution used in our study. The number of images per class ranges from 55 to 15,231, indicating a significant class imbalance.

According to Figure 5, the number of images in the “macular neovascularization” category is significantly higher than in other categories. The “central retinal vein occlusion” and “central retinal artery occlusion” categories have relatively fewer images. This class imbalance may lead to substantial discrepancies in classification accuracy across categories. We implemented data balancing strategies and preprocessing methods to cope with this phenomenon. The data balancing strategy involves adjusting the sampling frequency for each category, with higher sampling probabilities to underrepresented classes and lower probabilities to those with excessive samples. We determine the sampling rate according to Equation (14) as shown in Table 1.

p_{k} = \frac{1 / n_{k}}{\sum_{j = 1}^{K} (1 / n_{j})}

(14)

where

p_{k}

is normalized sampling probability,

n_{k}

is the number of samples in class k, and K is total number of classes.

For image preprocessing, we used five data enhancement techniques: random cropping, random flipping, Gaussian blur, brightness and contrast adjustment, and sharpness adjustment. Each of these techniques was applied to the original images with a random probability. The effect of data enhancement is shown in Figure 6. The augmented images effectively retain the detailed features of the original images. Additionally, we divided the dataset into training, validation, and test sets in a ratio of 6:2:2. These subsets were randomly partitioned on a patient-wise basis and are mutually exclusive to ensure their independence and prevent any overlap of images from the same patient across different subsets.

3.2. Implementation Details and Performance Metrics

All experiments were conducted on a workstation equipped with an NVIDIA Quadro RTX 5000 GPU with 15GB VRAM and an Intel(R) Xeon(R) Gold 521 processor. The software environment includes the Windows 10 operating system and Python 3.9.0 as the programming language. We employed transfer learning by initializing the main model with pretrained weights from the ImageNet dataset. We resized the FFA images to a resolution of

224 \times 224

. Each model was trained using the AdamW optimizer and the cross-entropy-focal loss function. The initial learning rate was set to

5 \times 10^{- 4}

, with a final learning rate of

1 \times 10^{- 8}

. The weight decay parameter (L2 regularization) was set to

1 \times 10^{- 4}

. To ensure stable evaluation results, we performed five-fold cross-validation exclusively on the training set, while keeping the validation and test sets completely independent throughout the process. After several rounds of hyperparameter tuning, the batch size was set to 64, and the training calendar elements to 200.

Several performance metrics were considered for model evaluation, including accuracy (

A c c

), precision (

P_{r}

), recall (

R_{e}

), and F1 score (

F 1

). The formulas are provided in Equations (15)–(18). Macro-averaging was used for model evaluation. Specifically, after calculating the metrics for each class, the arithmetic mean of these values was computed as the model performance indicator.

\begin{matrix} A c c = \frac{T P + T N}{T P + T N + F P + F N} \end{matrix}

(15)

\begin{matrix} P_{r} = \frac{T P}{T P + F P} \end{matrix}

(16)

\begin{matrix} R_{e} = \frac{T P}{T P + F N} \end{matrix}

(17)

\begin{matrix} F 1 = \frac{2 \times P_{r} \times R_{e}}{P_{r} + R_{e}} \end{matrix}

(18)

where

T P

is the count of true positives (both label and prediction are positive),

F N

is false negatives (positive label, negative prediction),

F P

is false positives (negative label, positive prediction), and

T N

is true negatives (both label and prediction are negative).

3.3. Experimental Results

All experiments were conducted on the APTOS2023 dataset under identical conditions. Each model was evaluated through three independent trials, with the average of the results taken as the final outcome to ensure stability.

3.3.1. Comparison Between SOTA Baseline Model

The EFF-ConvNeXt model achieves 92.5% accuracy, 92.46% precision, 92.56% recall, and 92.30% F1 score on the APTOS2023 dataset. It also demonstrated an inference time of 0.007 s per image, balancing high classification performance with computational efficiency. Table 2 compares EFF-ConvNeXt with SOTA baseline methods for FFA image classification. Among all models, VGG16 shows relatively lower performance, while ConvNeXt and Swin Transformer perform better, achieving accuracies of 89.38% and 89.36%, respectively. Our EFF-ConvNeXt model outperforms ConvNeXt by 3.12% in accuracy, with a slight increase in inference time of 0.002 s.

Table 3 shows the classification performance of EFF-ConvNeXt across different lesion categories on the APTOS2023 dataset. The model achieves high precision and recall in most classes. Notably, for common and well-represented categories such as BRVO, CRVO, CME, and PDR, the model achieves high precision, recall, and F1 scores, all exceeding 95%. However, relatively lower performance is observed in challenging categories such as RPED and uveitis, where the F1 scores drop to 47.52% and 71.92%, respectively. This may be attributed to the limited sample size, ambiguous visual features, or inter-class similarities. These findings highlight the model’s overall effectiveness while also suggesting that future improvements could focus on enhancing performance for underrepresented or complex lesion types.

To intuitively compare model performance, we plotted the accuracy and loss curves of EFF-ConvNeXt, ConvNeXt, Swin-T, and ResNet50 (as shown in Figure 7). All models achieved over 95% accuracy on the training set, with most converging after 100 epochs. ResNet50 exhibited significant fluctuations in accuracy and eventually stabilized at a relatively low value of 78.64%. In contrast, EFF-ConvNeXt achieved a stable accuracy of 92.5%, outperforming ConvNeXt by 3.12% with reduced variance. Regarding loss, ResNet50 performed noticeably worse on both the training and validation sets. EFF-ConvNeXt demonstrated superior stability and reached the lowest loss value of 0.384, highlighting its overall performance advantage.

To evaluate performance improvements, we conducted ten independent runs per model and plotted box plots for key metrics. As shown in Figure 8, the proposed EFF-ConvNeXt achieved the highest median values and lowest variance across all metrics, showing superior performance and robustness. Compared to ConvNeXt, it consistently improved by over 3%, with statistically significant gains (p < 0.01). In contrast, conventional models such as VGG19 and ViT exhibited lower performance and higher variability, demonstrating the effectiveness of our enhanced multi-scale and global context fusion design.

3.3.2. Visualization Analysis

We applied Grad-CAM++ [34] to visualize model decisions on the APTOS dataset. The activation maps highlight key regions by capturing response intensity in the last convolutional layer. Figure 9 compares activation maps of ResNet, Swin Transformer, ConvNeXt, and EFF-ConvNeXt on the same image. The ResNet50 captures local features but struggles with complex lesions. SwinT and ConvNeXt extract global features more effectively. In contrast, EFF-ConvNeXt achieves precise lesion localization with stronger responses, demonstrating the effectiveness of CAFF in enhancing feature extraction.

We plotted the macro-averaged ROC curve (Figure 10) and confusion matrix (Figure 11) for EFF-ConvNeXt. The warm diagonal distribution in Figure 11 shows that it is very good at classifying most types of fundus diseases, showing that it works well for complex FFA image classification. Figure 10 shows that the macro AUC reaches 0.9859, approaching 1, indicating excellent discriminative ability and stability. The model also keeps a high true positive rate (TPR) and a low false positive rate (FPR) across thresholds, showing that it can be used to classify fundus diseases and is reliable.

3.3.3. Comparison with Different Pretrained Models

Table 4 further reveals that SwinT+im-ConvNeXt leads to a performance drop. VGG16 performs poorly on the APTOS2023 dataset. However, as a pre-trained model, it outperforms ResNet and SwinT. Moreover, SwinT+im-ConvNeXt exhibits a decline in performance. We attribute the limited improvement in ResNet+im-ConvNeXt to functional redundancy, while the lack of significant gain in SwinT+im-ConvNeXt may result from overlapping functionalities. In contrast, VGG16’s simple convolutional stacking effectively captures edges and textures, complementing ConvNeXt’s semantic and global context modelling, leading to better overall performance.

3.3.4. Comparison with Previous Attention Mechanisms

In addition, we conducted comparative experiments by replacing the IGC module in the CAFF module with other attention mechanisms to validate the superiority of the IGC block. Each model was evaluated through three independent trials, with the average results reported. The best results are highlighted in bold in Table 5. As shown in Table 5, all models exhibit comparable inference times, while ConvNeXt+IGC outperforms all other models across the remaining evaluation metrics. Compared to the baseline model, the average Acc improves by 1.54%, and the average F1 score increases by 1.37%. ConvNeXt+SE and ConvNeXt+ECA achieve strong performances, with average accuracies of 91.56% and 91.84%, respectively, but still lag behind ConvNeXt+IGC. In contrast, ConvNeXt+CBAM and ConvNeXt+GC show weaker results, with accuracies of 87.83% and 89.43%. The IGC block improves accuracy by 3.07% and the F1 score by 2.84% over the GC block. These results demonstrate that IGCNet more effectively focuses on key areas of FFA images, accurately capturing lesion features.

3.3.5. Ablation Studies

A series of ablation experiments were conducted to analyse the impact of gradually adding or removing each module on the performance of EFF-ConvNeXt. Table 6 reports the results of the APTOS2023 dataset. The baseline model achieved an average accuracy of only 89.38%. After incorporating VGG16 into the backbone network, the average accuracy improved to 90.10%, indicating that deep feature extraction offers significant advantages in fundus image classification. When the CAFF module was further integrated, the average accuracy reached 91.81%, with a recall of 91.26%, further highlighting the benefits of enhanced feature extraction. In contrast, introducing SKNet or optimizing the loss function individually led to relatively modest improvements, with accuracies of 91.05% and 90.96%, respectively.

The synergy between modules plays a critical role in system performance. When CAFF is combined with SKNet, the inference time increases by only 0.001 s, while the accuracy reaches 92.05%, which was better than just adding their individual scores together. The complete model, incorporating all components, achieved the best overall performance, with peak values in accuracy (92.50%), precision (92.46%), and recall (92.56%). Interestingly, when VGG16 was removed while retaining the other modules, the inference time decreased by 0.001 s, and the F1 score improved to 92.37%, indicating that the CAFF and SKNet combination is more efficient for lightweight applications.

These findings confirm that the CAFF module provides the most significant performance gains, while SKNet and the optimized loss function rely on the backbone network to be effective. Overall, collaborative optimization among modules is key to achieving optimal classification accuracy.

3.3.6. Comparison with Previous FFA Image Classification Methods

We evaluated the proposed EFF-ConvNeXt on the APTOS2023 dataset and compared its performance with SOTA methods for fundus disease classification, as shown in Table 7. For a fair comparison, we reimplemented existing YOLO-based approaches [18] and CNN models [15,16,17] under similar experimental settings. As observed, EFF-ConvNeXt consistently outperforms previous YOLO-based and CNN-based methods on the ATOPS2023 dataset. Compared to traditional CNN models [15,16,17], EFF-ConvNeXt demonstrates superior capability in extracting discriminative features from FFA images, leading to significantly improved performance. This improvement is attributed to the integration of global feature extraction, multi-scale feature fusion, and pretrained CNN backbones, which enhance the representation of disease-specific and stage-relevant features while suppressing redundant information. Notably, the performance gain is achieved without a substantial trade-off in computational speed. Therefore, the proposed framework holds great promise as an efficient system for multi-lesion fundus disease classification and can serve as a valuable tool to assist ophthalmologists in verifying screening outcomes.

4. Discussion

4.1. Experimental Results and Clinical Relevance

Deep learning is widely applied in automated disease diagnosis across ophthalmology, orthopedics, cardiology, dermatology, and pathology [38,39]. In ophthalmology, AI plays a crucial role in aiding the diagnosis and treatment of diseases such as DR, AMD, and glaucoma. Its applications hold significant potential to transform the diagnosis and treatment of various ocular diseases in the future [40]. Although much progress has been made on these tasks, to the best of our knowledge, no previous work has targeted FFA image classification for more than 20 diseases. In this work, we investigated the importance of image differentiation of multiple lesions for automatic classification. Furthermore, we propose a multi-scale feature fusion network to capture lesion locations and enable multi-disease classification. It integrates VGG16, CAFF, and SKNet into the ConvNeXt framework to enhance lesion feature representation. VGG16 effectively captures edge and texture information, while ConvNeXt improves semantic and global context modeling, boosting overall performance. Additionally, CAFF and SKNet facilitate multi-scale feature fusion and selection, enhancing system stability and classification accuracy, particularly in complex FFA images.

To comprehensively validate the feasibility and effectiveness of the EFF-ConvNeXt model, extensive experiments were conducted and comparative analyses were performed against multiple benchmark models. As shown in Table 2 and Table 7 and Figure 7, Figure 8 and Figure 9, the proposed model outperforms the latest FFA classification methods in both classification accuracy and lesion recognition, with statistically significant improvements. In particular, it surpasses ConvNeXt and FFA-Lens by 3.12% and 3.97% in accuracy, respectively, delivering more focused and accurate lesion identification while maintaining high computational efficiency. According to the results in Table 4, Table 5 and Table 6, the VGG16 and IGC modules contribute most significantly to performance enhancement. Although VGG16 alone performs poorly in FFA classification, its combination with ConvNeXt and IGC outperforms other configurations. Meanwhile, the results demonstrate that while our method introduces a slight increase in inference time due to the added modules, it achieves a significantly better classification performance, thus providing a favorable trade-off between accuracy and efficiency.

Although the present experiments were conducted solely on the APTOS2023 dataset—an inherent limitation for assessing cross-domain generalisation—we employed extensive data augmentation, preprocessing, transfer learning and regularisation strategies to maximise the model’s capacity to learn domain-invariant features. The strong internal-validation results already obtained provide initial evidence of its generalisability. Furthermore, Figure 9 indicates that the model concentrates on anatomical regions relevant to the target pathologies rather than device-specific artefacts or background patterns, implying that it has captured biologically meaningful features. Taken together, these observations strongly suggest that EFF-ConvNeXt can be transferred to data acquired from other hospitals or imaging devices. We are actively seeking collaborations to obtain multi-centre datasets and plan to incorporate advanced domain-generalisation techniques to comprehensively evaluate and further enhance the robustness and real-world utility of our model in diverse clinical environments.

Based on Table 3 and Figure 10 and Figure 11, the proposed model achieved promising results on the APTOS2023 dataset. In the confusion matrix, it is noteworthy that the classification accuracy for RPED was only 47.37%, whereas PRD reached 99.01%. As shown in Figure 5, this significant disparity is primarily attributed to class imbalance. RPED and MN are often different stages or co-manifestations of the same pathological process in clinical settings. Since MN accounts for a disproportionately large number of samples in the dataset, the model tends to predict ambiguous or borderline cases as the more frequently encountered class—namely, MN. To mitigate this misclassification issue, we introduced fine-grained feature enhancement, which helped alleviate some errors. However, inter-class imbalance remains a challenge. Such imbalance and misclassification issues can lead to delayed or missed diagnoses in clinical deployment, posing potential risks to patient outcomes. Future research will focus on expanding the dataset and reducing inter-class performance disparities to enhance model robustness and clinical reliability.

4.2. AI Ethics in Ophthalmology

The application of AI models in ophthalmic diagnostics has led to significant performance improvements. However, their integration into clinical practice necessitates rigorous consideration of ethical frameworks and regulatory compliance, which are critical to ensuring patient safety, fairness, and real-world reliability [41]. The U.S. Food and Drug Administration (FDA), together with the UK’s Medicines and Healthcare products Regulatory Agency (MHRA) and Health Canada, has outlined ten Good Machine Learning Practice (GMLP) guiding principles [42]. These emphasize core elements such as data governance, transparency, independent evaluation, and human-centered interface design to ensure the quality and safety of AI-enabled medical devices. In this study, we adhered to a structured development process that included traceable data sourcing, hierarchical validation, and continuous performance monitoring. For future clinical translation, we plan to adopt a Predetermined Change Control Plan (PCCP) [43] to manage model updates and mitigate performance drift in accordance with the latest FDA guidelines.

Another key aspect is the standardization of study reporting and model validation. Internationally recognized guidelines such as CONSORT-AI [44] offer detailed recommendations for trial design, results disclosure, and model reproducibility, thereby enhancing the transparency and comparability of AI-related clinical studies [45]. In the present work, we have followed CONSORT-AI recommendations to report model inputs and outputs, conduct error analysis, and clarify limitations of use. In future stages, we intend to perform prospective validation using multicenter external datasets.

4.3. Limitations and Future Clinical Applications

Despite significant performance improvements on the experimental dataset, the model has limitations. First, current data augmentation techniques cannot fully simulate the complex pathological conditions of real-world fundus images, and the APTOS2023 dataset alone cannot fully demonstrate the generalization of EFF-ConvNeXt. As such, the reported results may be overly optimistic. Future studies will focus on collecting additional external datasets to enhance sample diversity and improve both robustness and generalizability. Second, despite high accuracy, computational speed and resource consumption may limit practical applications. Future research will design a more lightweight model, optimizing the architecture to reduce computational load and improve processing speed. Ultimately, integrating longitudinal or progression data offers irreplaceable advantages in predicting disease evolution. However, this study is limited to single-time-point FFA images, which fail to capture the dynamic changes of lesions over time. Due to the scarcity of longitudinal FFA data, we did not incorporate follow-up sequences for modeling disease progression. As part of our future work, we plan to collect multi-timepoint data and explore temporal modeling approaches to enhance the prediction of disease trajectories.

5. Conclusions

This study proposes an improved ConvNeXt model for FFA image classification, retaining the core structure of the original ConvNeXt while incorporating VGG16, CAFF, and SKNet modules, along with an optimized loss function. These enhancements significantly improve the model’s accuracy and robustness in medical image classification tasks. Experimental results show that EFF-ConvNeXt achieves an accuracy of 92.50% and an F1 score of 92.30% on the APTOS2023 dataset, which are 3.12 and 4.01 percentage points higher than the original ConvNeXt-T model, respectively. The model accurately distinguishes 23 common retinal diseases while maintaining high efficiency, providing clinicians with a reliable diagnostic aid. However, current data augmentation techniques cannot fully capture the diversity of real-world lesions. Future work will focus on expanding FFA image datasets to enhance model stability and generalization. Additionally, lightweight model designs will be explored to facilitate integration into FFA diagnostic systems, offering more efficient support for retinal disease screening and diagnosis.

Author Contributions

Conceptualization, Y.W. and Z.C.; methodology, Y.W. and Z.C.; software, C.C.; validation, Y.W., Z.C. and L.W.; formal analysis, Y.W.; investigation, Y.W. and C.C.; resources, C.C.; data curation, Y.W. and L.W.; writing—original draft preparation, Y.W. and L.W.; writing—review and editing, Y.W. and Z.C.; visualization, Y.W. and L.W.; supervision, C.C.; project administration, C.C. and Y.W.; funding acquisition, C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Innovation Talent Fund of Sichuan Provincial Department of Science and Technology (No. 2024JDRC0013), and the Scientific Research and Innovation Team Program of Sichuan University of Technology (No. SUSE652A006). This study was supported by the computational support provided by the High-Performance Computing Center, Sichuan University of Science and Engineering.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This research used a publicly available dataset, which can be accessed at https://tianchi.aliyun.com/dataset/170128 (accessed on 12 March 2024).

Acknowledgments

We thank all the participants of this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

EFF-ConvNeXt	Enhanced Feature Fusion ConvNeXt
FFA	fluorescein fundus angiography
CAFF	Context-Aware Feature Fusion
IGCNet	Improved Global Context Networks
GCNet	Global contextual Networks
AMD	age-related macular degeneration
AI	Artificial intelligence
APTOS	Asia Pacific Tele-Ophthalmology Society
im-ConvNeXt	improved ConvNeXt
ViT	Vision Transformer
SwinT	Swin Transformer
NLNet	Non-Local Neural Networks
SENet	Squeeze-and-Excitation Networks
SKNet	Selective kernel Networks
BN	batch normalization
LN	layer normalization
FL	Focal Loss
BRVO	branch retinal vein occlusion
CRAO	central retinal artery occlusion
CRVO	central retinal vein occlusion
CSC	central serous chorioretinopathy
CA	chorioretinal atrophy
CS	chorioretinal scar
CM	choroidal mass
CME	cystoid macular edema
DR	diabetic retinopathy
DAMD	dry age-related macular degeneration
EM	epiretinal membrane
MN	macular neovascularization
PPE	pachychoroid pigment epitheliopathy
PCV	polypoidal choroidal vasculopathy
PDR	proliferative diabetic retinopathy
RAM	retinal arterial macroaneurysm
RD	retinal dystrophy
RPED	retinal pigment epithelial detachment
RVO	retinal vein occlusion
UC	unremarkable changes
FDA	US Food and Drug Administration
MHRA	Medicines and Healthcare products Regulatory Agency
GMLP	Good Machine Learning Practice
PCCP	Predetermined Change Control Plan

References

Wong, W.; Su, X.; Li, X.; Cheung, C.M.G.; Klein, R.; Cheng, C.-Y.; Wong, T.Y. Global prevalence of age-related macular degeneration and disease burden projection for 2020 and 2040: A systematic review and meta-analysis. Lancet Glob. Health 2014, 2, e106–e116. [Google Scholar] [CrossRef]
Curran, K.; Peto, T.; Jonas, J.B.; Friedman, D.; Kim, J.E. Global estimates on the number of people blind or visually impaired by diabetic retinopathy: A meta-analysis from 2000 to 2020. Eye 2024, 38, 2047–2057. [Google Scholar] [CrossRef]
Hogarty, D.T.; Mackey, D.A.; Hewitt, A.W. Current state and future prospects of artificial intelligence in ophthalmology: A review. Clin. Exp. Ophthalmol. 2019, 47, 128–139. [Google Scholar] [CrossRef]
Cheng, Y.; Guo, Q.; Xu, F.J.; Fu, H.; Lin, S.W.; Lin, W. Adversarial exposure attack on diabetic retinopathy imagery grading. IEEE J. Biomed. Health Inform. 2025, 29, 297–309. [Google Scholar] [CrossRef] [PubMed]
Ji, Z.; Ma, X.; Leng, T.L.; Rubin, D.L.; Chen, Q. Mirrored X-Net: Joint classification and contrastive learning for weakly supervised GA segmentation in SD-OCT. Pattern Recognit. 2024, 153, 110507. [Google Scholar] [CrossRef]
Alwakid, G.; Gouda, W.; Humayun, M.; Jhanjhi, N.Z. Deep learning-enhanced diabetic retinopathy image classification. Digit. Health 2023, 9, 20552076231194942. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Wang, Z.; Chen, Y.; Zhu, C.; Xiong, M.; Bai, H.X. A Transformer utilizing bidirectional cross-attention for multi-modal classification of Age-Related Macular Degeneration. Biomed. Signal Process. Control 2025, 109, 107887. [Google Scholar] [CrossRef]
Das, D.; Nayak, D.R.; Pachori, R.B. AES-Net: An adapter and enhanced self-attention guided network for multi-stage glaucoma classification using fundus images. Image Vis. Comput. 2024, 146, 105042. [Google Scholar] [CrossRef]
Xu, Q.; Zhang, J.; Qin, T.; Bao, J.; Dong, H.; Zhou, X.; Hou, S.; Mao, L. The role of the inflammasomes in the pathogenesis of uveitis. Exp. Eye Res. 2021, 208, 108618. [Google Scholar] [CrossRef]
Wildner, G.; Diedrichs-Mhring, M. Resolution of uveitis. Semin. Immunopathol. 2019, 41, 727–736. [Google Scholar] [CrossRef]
Li, Z.; Xu, M.; Yang, X.; Han, Y. Multi-label fundus image classification using attention mechanisms and feature fusion. Micromachines 2022, 13, 947. [Google Scholar] [CrossRef]
Yan, Y.; Yang, L.; Huang, W. Fundus-DANet: Dilated convolution and fusion attention mechanism for multilabel retinal fundus image classification. Appl. Sci. 2024, 14, 8446. [Google Scholar] [CrossRef]
Kwiterovich, K.A.; Maguire, M.G.; Murphy, R.P.; Schachat, A.P.; Bressler, N.M.; Bressler, S.B.; Fine, S.L. Frequency of adverse systemic reactions after fluorescein angiography: Results of a prospective study. Ophthalmology 1991, 98, 1139–1142. [Google Scholar] [CrossRef] [PubMed]
Cole, E.D.; Novais, E.A.; Louzada, R.N.; Waheed, N.K. Contemporary retinal imaging techniques in diabetic retinopathy: A review. Clin. Exp. Ophthalmol. 2016, 44, 289–299. [Google Scholar] [CrossRef] [PubMed]
Pan, X.; Jin, K.; Cao, J.; Liu, Z.; Wu, J.; You, K.; Lu, Y.; Xu, Y.; Su, Z.; Jiang, J.; et al. Multi-label classification of retinal lesions in diabetic retinopathy for automatic analysis of fundus fluorescein angiography based on deep learning. Graefes Arch. Clin. Exp. Ophthalmol. 2020, 258, 779–785. [Google Scholar] [CrossRef]
Gao, Z.; Pan, X.; Shao, J.; Jiang, X.; Su, Z.; Jin, K.; Ye, J. Automatic interpretation and clinical evaluation for fundus fluorescein angiography images of diabetic retinopathy patients by deep learning. Br. J. Ophthalmol. 2023, 107, 1852–1858. [Google Scholar] [CrossRef]
Gao, Z.; Jin, K.; Yan, Y.; Liu, X.; Shi, Y.; Ge, Y.; Pan, X.; Lu, Y.; Wu, J.; Wang, Y.; et al. End-to-end diabetic retinopathy grading based on fundus fluorescein angiography images using deep learning. Graefes Arch. Clin. Exp. Ophthalmol. 2022, 260, 1663–1673. [Google Scholar] [CrossRef]
Veena, K.M.; Tummala, V.; Sangaraju, Y.S.V.; Reddy, M.S.V.; Kumar, P.; Mayya, V.; Kulkarni, U.; Bhandary, S.; Shailaja, S. FFA-Lens: Lesion detection tool for chronic ocular diseases in Fluorescein angiography images. SoftwareX 2024, 26, 101646. [Google Scholar] [CrossRef]
Lyu, J.; Yan, S.; Hossain, M.S. DBGAN: Dual branch generative adversarial network for multi-modal MRI translation. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 235. [Google Scholar] [CrossRef]
Palaniappan, K.; Bunyak, F.; Chaurasia, S.S. Image analysis for ophthalmology: Segmentation and quantification of retinal vascular systems. In Ocular Fluid Dynamics: Anatomy, Physiology, Imaging Techniques, and Mathematical Modeling; Springer: Berlin/Heidelberg, Germany, 2019; pp. 543–580. [Google Scholar]
Shili, W.; Yongkun, G.; Chao, Q.; Ying, L.; Xinyou, Z. Global attention and context encoding for enhanced medical image segmentation. Vis. Comput. 2025, 41, 7781–7798. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. GCNet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 510–519. [Google Scholar] [CrossRef]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed]
Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, NV, USA, 12–15 March 2018; pp. 839–847. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
Zhu, B.; Hofstee, P.; Lee, J.; Al-Ars, Z. An Attention Module for Convolutional Neural Networks. In Proceedings of the Artificial Neural Networks and Machine Learning–ICANN 2021: 30th International Conference, Bratislava, Slovakia, 14–17 September 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 167–178. [Google Scholar] [CrossRef]
Panahi, O. Deep Learning in Diagnostics. J. Med. Discov. 2025, 2, 1–6. [Google Scholar]
Piccialli, F.; Di Somma, V.; Giampaolo, F.; Cuomo, S.; Fortino, G. A survey on deep learning in medicine: Why, how and when? Inf. Fusion 2021, 66, 111–137. [Google Scholar] [CrossRef]
Wang, F.; Casalino, L.P.; Khullar, D. Deep Learning in Medicine—Promise, Progress, and Challenges. JAMA Intern. Med. 2019, 179, 293–294. [Google Scholar] [CrossRef]
Goktas, P.; Grzybowski, A. Shaping the Future of Healthcare: Ethical Clinical Challenges and Pathways to Trustworthy AI. J. Clin. Med. 2025, 14, 1605. [Google Scholar] [CrossRef] [PubMed]
U.S. Food and Drug Administration; Health Canada; MHRA. Good Machine Learning Practice for Medical Device Development: Guiding Principles. U.S. Food and Drug Administration. 2021. Available online: https://www.fda.gov/media/153486/download (accessed on 3 July 2025).
U.S. Food and Drug Administration. Marketing Submission Recommendations for a Predetermined Change Control Plan for Artificial Intelligence/Machine Learning (AI/ML)-Enabled Device Software Functions. U.S. Food and Drug Administration. 2024. Available online: https://www.fda.gov/media/174698/download (accessed on 3 July 2025).
Liu, X.; Rivera, S.C.; Moher, D.; Calvert, M.J.; Denniston, A.K. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: The CONSORT-AI extension. Nat. Med. 2020, 26, 1364–1374. [Google Scholar] [CrossRef]
Martindale, A.P.L.; Llewellyn, C.D.; de Visser, R.O.; Dodhia, H.; Watkinson, P.J.; Clifton, D.A. Concordance of randomised controlled trials for artificial intelligence interventions with the CONSORT-AI reporting guidelines. Nat. Commun. 2024, 15, 1619. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The architecture of the EFF-ConvNeXt Network.

Figure 2. The architecture of the Context-Aware Feature Fusion module (CAFF).

Figure 3. Comparison of the GC Block and IGC Block structure diagrams. (a) The architecture of the GC Block module. (b) The architecture of the IGC Block module.

Figure 4. The architecture of the SKNet.

Figure 5. Status of the ATPOS2023 dataset.

Figure 6. Data augmentation effects on FFA images in the APTOS2023 dataset.

Figure 7. Comparison plots of accuracy and loss of different network models on the APTOS2023 dataset. For easier comparison, a horizontal line is drawn at y = 0.9 in the accuracy plots to highlight the respective performance of each model.

Figure 8. Distribution of performance metrics across models based on 10 independent runs. Boxes represent the interquartile range with the median line, while dots show individual run results.

Figure 9. Activation map results for different model detections.

Figure 10. Macro-averaged ROC curve for EFF-ConvNeXt.

Figure 11. Confusion matrix for EFF-ConvNeXt.

Table 1. Sampling rates for each category in the APTOS2023 Dataset.

Category	Rate	Category	Rate	Category	Rate
BRVO	10	DR	8	PCV	6
CRAO	10	DAMD	7	PDR	10
CRVO	10	EM	10	RD	10
CSC	3	EM	10	RPED	8
CA	10	MN	1	ROV	10
CS	8	myopia	10	UC	5
CM	10	other	10	uveitis	6
CME	10	PPE	10

Note: Due to the length of the full names, some lesions are abbreviated in the table. The full names of these lesions can be found in Abbreviations.

Table 2. Classification performance of different models on APTOS2023.

Model	Average $Acc$ (%)	Average $P_{r}$ (%)	Average $R_{e}$ (%)	Average $F 1$ (%)	Inference Time (t/s)
ViT	70.15	70.06	70.37	70.18	0.004
SwinT	89.36	89.43	89.36	88.29	0.004
VGG19	66.86	64.52	65.33	65.98	0.003
ResNet50	78.64	78.65	78.52	78.39	0.003
DenseNet121	75.41	74.81	76.24	74.12	0.004
EffcientNet-B3	68.40	68.43	68.29	68.26	0.005
ConvNeXt	89.38	89.36	89.03	89.30	0.005
EFF-ConvNeXt	92.50	92.46	92.56	92.30	0.007

Note: The bold formatting here is used to highlight the best performance results.

Table 3. The performance of EFF-ConvNeXt on the classification task across different lesion subsets of the APTOS2023 dataset.

Lesion Category	$Acc$ (%)	$P_{r}$ (%)	$R_{e}$ (%)	$F 1$ (%)
BRVO	98.02	97.79	98.67	98.23
CRAO	96.98	95.66	96.33	95.99
CRVO	97.33	97.58	97.78	97.68
CSC	75.86	78.19	75.86	77.01
CA	94.44	93.67	91.67	92.66
CS	90.91	93.10	93.18	93.14
CM	86.56	86.15	87.10	86.62
CME	96.72	97.86	95.07	96.45
DR	93.54	94.91	93.75	94.32
DAMD	86.72	92.16	85.45	88.68
EM	98.79	89.79	96.70	93.11
MN	95.33	97.98	96.64	97.30
myopia	95.82	94.66	96.63	95.63
other	98.33	97.98	98.55	98.26
PPE	94.67	96.43	93.10	94.74
PCV	92.33	91.04	91.04	91.04
PDR	99.59	99.24	99.01	99.12
RAM	91.83	91.45	91.30	91.37
RD	97.90	97.67	97.67	97.67
RPED	48.72	47.67	47.37	47.52
RVO	81.82	82.08	81.82	81.95
UC	95.67	97.23	96.25	96.74
uveitis	73.08	72.57	71.28	71.92

Table 4. Comparison of the Performance of Various Pretrained Models Using im-ConvNeXt as the Baseline.

Model	Average $Acc$ (%)	Average $P_{r}$ (%)	Average $R_{e}$ (%)	Average $F 1$ (%)	Inference Time (t/s)
im-ConvNeXt	92.03	92.05	92.13	92.37	0.005
SwinT	89.36	89.43	89.36	88.29	0.004
swinT+im-ConvNeXt	91.78	91.86	91.77	92.02	0.008
ResNet50	78.64	78.65	78.52	78.39	0.003
ResNet50+im-ConvNeXt	92.12	91.08	91.68	92.18	0.007
VGG16	63.91	64.12	63.93	63.98	0.003
VGG16+im-ConvNeXt (ours)	92.50	92.46	92.56	92.30	0.007

Note: The bold formatting here is used to highlight the best performance results.

Table 5. Performance comparison of proposed attention module adopting im-ConvNeXt as backbone with different cutting-edge attention modules.

Model	Average $Acc$ (%)	Average $P_{r}$ (%)	Average $R_{e}$ (%)	Average $F 1$ (%)	Inference Time (t/s)
baseline	90.96	90.95	91.10	90.93	0.005
+CBAM [35]	87.83	87.86	87.65	87.12	0.005
+ECA [36]	91.84	91.98	91.80	91.91	0.005
+SimAM [37]	91.06	91.05	91.06	91.18	0.005
+SE [29]	91.56	91.24	91.43	91.23	0.005
+GC [27]	89.43	89.36	89.48	89.46	0.005
+IGC	92.50	92.46	92.56	92.30	0.005

Note: The baseline indicates that no attention mechanism was added to CAFF, while other models replaced different attention mechanisms in CAFF. The bold formatting here is used to highlight the best performance results.

Table 6. Performance comparison of ablation experiments.

Type	Configuration	VGG16	CAFF	SKNet	Loss Opt.	Average $Acc$ (%)	Average $P_{r}$ (%)	Average $R_{e}$ (%)	Average $F 1$ (%)	Inference Time (t/s)
Additive	Baseline					89.38	89.36	89.03	89.30	0.005
	+ VGG16	✓				90.10	90.36	89.68	90.16	0.006
	+ VGG16 + CAFF	✓	✓			91.81	91.80	91.26	91.75	0.006
	+ VGG16 + SKNet	✓		✓		91.05	91.03	91.17	91.02	0.006
	+ VGG16 + Loss Opt.	✓			✓	90.96	90.86	91.10	91.98	0.006
Ablative	− Loss Opt.	✓	✓	✓		92.05	92.19	91.97	92.03	0.007
	− SKNet	✓	✓		✓	91.81	91.82	91.68	91.78	0.006
	− CAFF	✓		✓	✓	90.96	90.95	91.10	90.93	0.006
	− VGG16		✓	✓	✓	92.03	92.05	92.13	92.37	0.006
	Completed model	✓	✓	✓	✓	92.50	92.46	92.56	92.30	0.007

Note: ✓ indicates that this module is included in the baseline model. “+” indicates modules included in the baseline model. ”−” indicates reduction of modules in the completed model. “Opt.” is short for optimization. The bold formatting here is used to highlight the best performance results.

Table 7. Comparison with previous automated FFA image classification methods on APTOS2023.

Model	$Acc$ (%)	$P_{r}$ (%)	$R_{e}$ (%)	$F 1$ (%)	Recognition Time t/s
Pan et al. [15]	76.14	76.18	76.38	76.21	0.004
Gao et al. [17]	63.91	64.12	63.93	63.98	0.004
Gao et al. [16]	74.57	74.75	75.25	73.98	0.003
FFA-lens [18]	88.53	88.38	87.63	87.88	0.008
EFF-ConvNeXt	92.50	92.46	92.56	92.30	0.007

Note: The bold formatting here is used to highlight the best performance results.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Chen, C.; Chen, Z.; Wu, L. A Hybrid Model for Fluorescein Funduscopy Image Classification by Fusing Multi-Scale Context-Aware Features. Technologies 2025, 13, 323. https://doi.org/10.3390/technologies13080323

AMA Style

Wang Y, Chen C, Chen Z, Wu L. A Hybrid Model for Fluorescein Funduscopy Image Classification by Fusing Multi-Scale Context-Aware Features. Technologies. 2025; 13(8):323. https://doi.org/10.3390/technologies13080323

Chicago/Turabian Style

Wang, Yawen, Chao Chen, Zhuo Chen, and Lingling Wu. 2025. "A Hybrid Model for Fluorescein Funduscopy Image Classification by Fusing Multi-Scale Context-Aware Features" Technologies 13, no. 8: 323. https://doi.org/10.3390/technologies13080323

APA Style

Wang, Y., Chen, C., Chen, Z., & Wu, L. (2025). A Hybrid Model for Fluorescein Funduscopy Image Classification by Fusing Multi-Scale Context-Aware Features. Technologies, 13(8), 323. https://doi.org/10.3390/technologies13080323

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid Model for Fluorescein Funduscopy Image Classification by Fusing Multi-Scale Context-Aware Features

Abstract

1. Introduction

2. Model Architecture Design

2.1. The VGG16 Network

2.2. The Improved ConvNeXt Network

2.2.1. The Context-Aware Feature Fusion Module

2.2.2. The Improved Global Context Network Module

2.2.3. The SKNet Attention Module

2.3. Loss Function Optimization

3. Experimental Results and Analysis

3.1. Dataset and Data Preprocessing

3.2. Implementation Details and Performance Metrics

3.3. Experimental Results

3.3.1. Comparison Between SOTA Baseline Model

3.3.2. Visualization Analysis

3.3.3. Comparison with Different Pretrained Models

3.3.4. Comparison with Previous Attention Mechanisms

3.3.5. Ablation Studies

3.3.6. Comparison with Previous FFA Image Classification Methods

4. Discussion

4.1. Experimental Results and Clinical Relevance

4.2. AI Ethics in Ophthalmology

4.3. Limitations and Future Clinical Applications

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI