Multi-Fundus Diseases Classification Using Retinal Optical Coherence Tomography Images with Swin Transformer V2

Fundus diseases cause damage to any part of the retina. Untreated fundus diseases can lead to severe vision loss and even blindness. Analyzing optical coherence tomography (OCT) images using deep learning methods can provide early screening and diagnosis of fundus diseases. In this paper, a deep learning model based on Swin Transformer V2 was proposed to diagnose fundus diseases rapidly and accurately. In this method, calculating self-attention within local windows was used to reduce computational complexity and improve its classification efficiency. Meanwhile, the PolyLoss function was introduced to further improve the model’s accuracy, and heat maps were generated to visualize the predictions of the model. Two independent public datasets, OCT 2017 and OCT-C8, were applied to train the model and evaluate its performance, respectively. The results showed that the proposed model achieved an average accuracy of 99.9% on OCT 2017 and 99.5% on OCT-C8, performing well in the automatic classification of multi-fundus diseases using retinal OCT images.


Introduction
Fundus diseases include conditions such as diabetic macular edema (DME), choroidal neovascularization (CNV), and drusen, which significantly impact the quality of life [1].With the continuous development of ophthalmic medicine, OCT technology has become an important diagnostic tool, especially in the diagnosis of fundus diseases.OCT is a non-invasive imaging technique that provides high-resolution retinal images to help diagnose eye diseases, evaluate treatment outcomes, and monitor disease progression [2].However, due to the large amount of data and complex structural and morphological features of retinal OCT images, manual diagnosis requires a significant amount of time and effort.Therefore, computer-aided diagnosis (CAD) techniques have significant value in the automatic classification of retinal OCT images.
CAD refers to the use of computer technology to analyze and process medical images to provide diagnostic assistance [3].CAD is now widely used in the automatic analysis and diagnosis of medical images, such as breast cancer, lung cancer, and colorectal cancer.CAD systems can help doctors diagnose diseases quickly and accurately, improving diagnostic accuracy and efficiency.Deep learning is a machine learning technique that has been widely applied in the field of computer-aided diagnosis [4].Convolutional neural networks (CNNs) are a type of deep learning technique that has been continuously developed since the 1980s.CNNs have achieved great success in the field of computer vision and are widely used in tasks such as image classification, object detection, and semantic segmentation.Some early CNN models include LeNet [5] and AlexNet [6].As deep learning technology has continued to develop, many new CNN models have emerged, including VGGNet [7], GoogLeNet [8], ResNet [9], DenseNet [10], MobileNet [11], and EfficientNet [12].Although

The Proposed Model
In this paper, we propose a multi-foveal disease classification model based on Swin Transformer V2 [16].The dataset is first subjected to preprocessing operations such as data enhancement, and then the network is trained.Based on the results of training, the network parameters such as learning rate and batch size are fine-tuned to determine the appropriate training parameters.By comparing different loss functions, we finally adopted PolyLoss [17] as the loss function to obtain better performance in retinal OCT image classification.In order to improve the interpretability of the model and understand its decision-making process, visualization methods such as the confusion matrix and Grad-CAM heatmap [18] were used in the testing phase.Finally, after continuous optimization of network parameters and loss functions, the results were compared after multiple training sessions to obtain the optimal network model for multiple fundus disease classification.
The contributions of this paper are as follows: 1.
The proposed method will first use the Swin Transformer V2 model to classify multiple diseases in retinal OCT images.

2.
Based on the Swin Transformer V2 model, its loss function is improved by introducing PolyLoss, which improves the model's performance.

3.
Experimental validation was performed with two datasets, OCT2017 and OCT-C8, and using Grad-CAM visualization to help understand decision-making mechanisms in network models.

Related Work
The use of deep learning algorithms for identifying OCT images has been extensively studied by many researchers.For example, Lee et al. used a deep neural network to classify OCT images as normal or AMD, achieving an accuracy of 87.63% [19].Lu et al. and Bhadra et al. used a deep multi-layer CNN to categorize OCT images into healthy, dry AMD, wet AMD, and DME [20].Kermany et al. applied deep transfer learning to automatically diagnose diabetic retinopathy in OCT images [21].Rong et al. suggested a different auxiliary classification method, based on CNNs, for the automatic categorization of retinal OCT images [22].Fang et al. proposed a novel lesion-aware convolutional neural network (LACNN) method for retinal OCT image classification, where retinal lesions in OCT images were used to guide the CNN to achieve more accurate classification [23].Singh et al. studied attribute-explained deep learning: application to ophthalmic diagnosis and proposed a framework for explaining the classification decisions of a deep learning network on retinal OCT images [24].Wang et al. proposed classifying volumetric OCT images via a recurrent neural network (VOCT-RNN), which can fully exploit temporal information among B-scans.This choice may introduce unnecessary model complexity, limiting the interpretation of such model results in clinical practice [25].To investigate this hypothesis, Arefin et al. developed a configurable deep convolutional neural network (CNN) that classifies four types of macular diseases using retinal optical coherence tomography (OCT) images [26].V et al. proposed a method to improve the automatic classification and detection of macular diseases using retinal optical coherence tomography (OCT) images by fusing two pre-trained deep learning networks [27].Identifying macular diseases and segmenting lesion areas to assist ophthalmologists in clinical diagnosis is necessary.Liu et al. studied joint disease classification and lesion segmentation in OCT images via a one-stage attention-based convolutional neural network [28].Deep-learning-based methods have been proposed to address this problem.To evaluate the proposed method, Esfahani et al. used publicly available data including 45 OCT volumes, 15 age-related macular degeneration, 15 diabetic macular edema, and 15 normal volumes captured by Heidelberg OCT imaging equipment [29].He et al. proposed a method for classifying retinal OCT images using an interpretable Swin-Poly Transformer network [30].This is a significant contribution to the field of retinal OCT image classification.At the same time, our work has been inspired by this study, and we have improved upon it.Other influential works include those by Lbrahim, Ai, Z, etc. [31,32].However, to achieve fast and accurate detection results, it is necessary to break out of the existing CNN framework, which is challenging.
The Transformer is a type of model architecture in the field of natural language processing (NLP).Its relatively mature theoretical support and technological development in the field of natural language processing have brought it to the attention of researchers, and it has been shown that Transformer methods can be applied to computer vision tasks, outperforming existing CNN methods in some tasks [33].The Vision Transformer (ViT) is a model proposed by the Google team in 2020 that applies the Transformer to image classification.Its model is "simple" and effective, with strong scalability (the larger the model, the better the performance), and performs well in the field of computer vision.The Swin Transformer is a new type of visual Transformer that can serve as a general backbone network for computer vision.It adopts a hierarchical structure and shifted windows to effectively extract multi-scale features.In addition, some researchers have attempted to combine Transformers and CNNs to improve prediction performance.For example, when performing object detection in drone images, a Transformer-based model can be fused with a CNN-based model [34].Swin Transformer V2 is a large model for computer vision that addresses three main issues in training and applying large visual models, including training instability, the resolution gap between pre-training and fine-tuning, and the need for labeled data.Swin Transformer V2 can better handle complex image data and achieve excellent performance in the automatic classification of retinal OCT images.

Materials and Methods
The overall framework of the proposed method is illustrated in Figure 1.The PolyLoss loss function is employed during the experiment to enhance the training efficiency of the model.Data augmentation methods are applied during the training phase to increase the diversity of the training data and enhance the network's ability to generalize.After training, Grad-CAM is utilized to visualize and explain the results.

Architecture of Swin Transformer V2
Swin Transformer V2 is an upgraded version of Swin Transformer.It improves upon version 1.0 by making the model larger and able to adapt to different image resolutions and window sizes.The Swin Transformer V2 block incorporates two Swin Transformer modules, the window multi-head self-attention (W-MSA) module and the shifted window multi-head self-attention (SW-MSA) module, in place of the standard multi-head self-attention (MSA) module found in ViT.In addition, when calculating Attention in the Transformer block in ViT, the dot(Q,K) operation is used, which is replaced by cosine(Q,K)/τ in Swin V2, where τ is a learnable parameter that is not shared between blocks.The cosine operation inherently includes normalization, which further stabilizes the attention output values.
Figure 2 illustrates the overall structure of the Swin Transformer V2 model [14].The input image, with a size of 256 × 256, is first divided into non-overlapping 4 × 4 patches by the patch partitioning module.These patches are then treated as 'tokens' and projected into C dimensions using a linear embedding layer.Two consecutive Swin Transformer V2 blocks with self-attention computation are applied to these patch tokens, controlling their number as shown in Figure 2b.A 'stage' consists of a linear embedding layer and Swin

Architecture of Swin Transformer V2
Swin Transformer V2 is an upgraded version of Swin Transformer.It improves upon version 1.0 by making the model larger and able to adapt to different image resolutions and window sizes.The Swin Transformer V2 block incorporates two Swin Transformer modules, the window multi-head self-attention (W-MSA) module and the shifted window multi-head self-attention (SW-MSA) module, in place of the standard multi-head self-attention (MSA) module found in ViT.In addition, when calculating Attention in the Transformer block in ViT, the dot(Q,K) operation is used, which is replaced by cosine(Q,K)/τ in Swin V2, where τ is a learnable parameter that is not shared between blocks.The cosine operation inherently includes normalization, which further stabilizes the attention output values.
Figure 2 illustrates the overall structure of the Swin Transformer V2 model [14].The input image, with a size of 256 × 256, is first divided into non-overlapping 4 × 4 patches by the patch partitioning module.These patches are then treated as 'tokens' and projected into C dimensions using a linear embedding layer.Two consecutive Swin Transformer V2 blocks with self-attention computation are applied to these patch tokens, controlling their number as shown in Figure 2b.A 'stage' consists of a linear embedding layer and Swin Transformer V2 blocks.The design of Swin Transformer V2 resembles the layer structure of CNNs, where the resolution is halved, and the number of channels is doubled at each stage.To produce hierarchical representations, the Swin Transformer reduces the number of tokens by merging patch layers, making the network deeper.Figure 3a shows an example of a hierarchical representation.Differing from the 224 × 224 input resolution used by He et al. [30], we employ Swin Transformer V2, which uses a higher resolution of 256 × 256.The advantage of this is that the network has access to more features, and increasing the feature extraction capability of the network improves the performance of the model.
Transformer V2 blocks.The design of Swin Transformer V2 resembles the layer structure of CNNs, where the resolution is halved, and the number of channels is doubled at each stage.To produce hierarchical representations, the Swin Transformer reduces the number of tokens by merging patch layers, making the network deeper.Figure 3a shows an example of a hierarchical representation.Differing from the 224 × 224 input resolution used by He et al. [30], we employ Swin Transformer V2, which uses a higher resolution of 256 × 256.The advantage of this is that the network has access to more features, and increasing the feature extraction capability of the network improves the performance of the model.Transformer V2 blocks.The design of Swin Transformer V2 resembles the layer structure of CNNs, where the resolution is halved, and the number of channels is doubled at each stage.To produce hierarchical representations, the Swin Transformer reduces the number of tokens by merging patch layers, making the network deeper.Figure 3a shows an example of a hierarchical representation.Differing from the 224 × 224 input resolution used by He et al. [30], we employ Swin Transformer V2, which uses a higher resolution of 256 × 256.The advantage of this is that the network has access to more features, and increasing the feature extraction capability of the network improves the performance of the model.

Shifted-Window-Based Self-Attention
A method of calculating self-attention within local windows is used to reduce computational complexity and improve modeling efficiency.The moving window strategy used to calculate self-attention in this experiment is shown in Figure 3a.In the ViT architecture, the standard MSA module is used for global attention, resulting in an unbearable amount of computation and quadratic computational complexity.In W-MSA, this relationship is linear, and the amount of computation is acceptable.Assuming that each window includes M × M patches, windows are organized in a non-overlapping manner to split the image in an equal amount.On an image with hardware patches, the global MSA module's computational complexity and the window-based MSA module's computational complexity are, respectively: where h × w is the total number of patches in the picture, and C denotes the patch channel's channel.When M is constant (the default value is 7), the complexity of Equation ( 2) is linear as opposed to Equation (1), where the difficulty is quadratic with respect to the number of patches h × w.
The window-based self-attention module lacks cross-window connections, ignoring the relationships between different windows and limiting modeling capabilities.This approach switches between two partition configurations in succeeding Swin Transformer V2 blocks to set up cross-window connections while retaining the computational efficiency of non-overlapping windows.As identified in Figure 4 [14], the first module equally divides the 8 × 8 feature map into 2 × 2 windows of size 4 × 4 (M = 4) using a standard window partitioning approach starting from the top-left pixel.Then, the next module adopts a window configuration that is offset from the previous layer's window configuration by shifting the window from the regular partitioned window by ( M 2 , M 2 )pixels.In the new window, the self-attention calculation also takes into account the boundary of the previous window, thus considering the connection information between different windows.Using the shifted window partitioning method, consecutive Swin Transformer V2 blocks are calculated as: where W-MSA and SW-MSA indicate window-based multi-head self-attention utilizing normal and shifted window partitioning configurations, respectively; and Ẑl and Z l denote the output characteristics of the (S)W-MSA module and MLP in the l layer, respectively.

Shifted-Window-Based Self-Attention
A method of calculating self-attention within local windows is used to reduce computational complexity and improve modeling efficiency.The moving window strategy used to calculate self-attention in this experiment is shown in Figure 3a.In the ViT architecture, the standard MSA module is used for global attention, resulting in an unbearable amount of computation and quadratic computational complexity.In W-MSA, this relationship is linear, and the amount of computation is acceptable.Assuming that each window includes M × M patches, windows are organized in a non-overlapping manner to split the image in an equal amount.On an image with hardware patches, the global MSA module's computational complexity and the window-based MSA module's computational complexity are, respectively: where h × w is the total number of patches in the picture, and C denotes the patch channel's channel.When M is constant (the default value is 7), the complexity of Equation ( 2) is linear as opposed to Equation (1), where the difficulty is quadratic with respect to the number of patches h × w.
The window-based self-attention module lacks cross-window connections, ignoring the relationships between different windows and limiting modeling capabilities.This approach switches between two partition configurations in succeeding Swin Transformer V2 blocks to set up cross-window connections while retaining the computational efficiency of non-overlapping windows.As identified in Figure 4 [14], the first module equally divides the 8 × 8 feature map into 2 × 2 windows of size 4 × 4 (M = 4) using a standard window partitioning approach starting from the top-left pixel.Then, the next module adopts a window configuration that is offset from the previous layer's window configuration by shifting the window from the regular partitioned window by ( , )pixels.In the new window, the self-attention calculation also takes into account the boundary of the previous window, thus considering the connection information between different windows.Using the shifted window partitioning method, consecutive Swin Transformer V2 blocks are calculated as:   A number of new windows are produced by the window partitioning technique, some of which are smaller than M × M. One typical method for calculating self-attention is to flatten all windows to M × M.This method, however, results in more windows.
For instance, in Figure 3b, the window transformation technique results in a large rise in the computational cost of the model when the number of windows goes from 2 × 2 to 3 × 3.As demonstrated in Figure 4, we apply an effective batch computation technique that cyclically shifts to the top left to address this problem.The batch-calculated windows may include a number of non-adjacent windows in the feature map after shifting.Therefore, to confuse the self-attention calculation for each sub-window, we use a masking method.The computational efficiency is increased for cyclic shifting since the number of batch windows and regular window divisions stays constant.

PolyLoss
The PolyLoss function has been demonstrated to outperform cross-entropy loss and focal loss in tasks such as 3D detection, 2D picture classification, instance segmentation, and object identification.As a result, in this experiment, we adopted PolyLoss as the loss function for our model to improve the OCT classification model's classification accuracy.The coefficients of the polynomial are represented by, and the PolyLoss formula is expressed as follows: (7)   There are an endless number of polynomial coefficients that need to be changed in this formula.Tuning multiple polynomial coefficients would still result in a dauntingly large search space, which is not feasible.Additionally, cross-entropy loss does not perform better than many coefficients being tuned simultaneously.This problem is solved by perturbing the leading polynomial coefficient in the cross-entropy loss while leaving the other coefficients constant.The loss formula is written as Poly-N, where N is the quantity of leading coefficients that need to be changed.
In particular, we update the cross-entropy loss's j polynomial coefficient from 1/j to 1/j + ε j , where ε j ∈[−1⁄j,∞) is the perturbation term.Equation ( 8) demonstrates how the first N polynomials may be precisely computed without having to worry about an endless number of higher-order (j > N + 1) coefficients.The largest increase is possible for the first polynomial term.The final PolyLoss formula is as follows with further simplification of the Poly-N formula and concentration on Poly-1 evaluation, where only the first polynomial coefficient in the cross-entropy loss is changed: In this experiment, we accomplish OCT image classification using the value of ε 1 = 2.

Datasets
In this paper, two public datasets, OCT2017 [35] and OCT-C8 [36], were used to train and test the network model.Dataset 1, as shown in Figure 5, depicts examples of three fundus diseases and normal retina, while Dataset 2, as shown in Figure 6 [37], depicts OCT images of seven diseases and one normal category of retinal OCT images.The OCT2017 dataset contains images of three diseases: choroidal neovascularization (CNV), diabetic macular edema (DME), Drusen, and a class of normal fundus.The OCT2017 dataset contains 84,452 retinal OCT images of 4 classes (as shown in Figure 5 of 1000 images each for validation.Details of the two datasets have been shown in Table 1.The OCT-C8 dataset contains 24,000 images of eight categories (as shown in Figure 6), including AMD, choroidal neovascularization (CNV), central serous retinopathy (CSR), DME, diabetic retinopathy (DR), drusen, macular hole (MH), and one for healthy classes.The training set consists of 2300 images per category for a total of 18,400 images for training and 2800 images each for testing and validation containing 350 images per category for the network model.Before training the model, we preprocessed and augmented the data.Obtaining a large number of labeled medical images is challenging due to the time-consuming nature of the labeling process and the need for professional medical expertise, which can be costly.To increase the diversity of the training data, data augmentation methods such as random rotation, cropping, and mirroring were used.Additionally, the images were resized to 256 × 256 and normalized to match the model's input requirements.In the final step, the data were converted into tensors and fed into the model for training.This process helps to enhance the model's ability to generalize and improve its stability.The OCT-C8 dataset contains 24,000 images of eight categories (as shown in Figure 6), including AMD, choroidal neovascularization (CNV), central serous retinopathy (CSR), DME, diabetic retinopathy (DR), drusen, macular hole (MH), and one for healthy classes.The training set consists of 2300 images per category for a total of 18,400 images for training and 2800 images each for testing and validation containing 350 images per category for the network model.Before training the model, we preprocessed and augmented the data.Obtaining a large number of labeled medical images is challenging due to the timeconsuming nature of the labeling process and the need for professional medical expertise, which can be costly.To increase the diversity of the training data, data augmentation methods such as random rotation, cropping, and mirroring were used.Additionally, the images were resized to 256 × 256 and normalized to match the model's input requirements.In the final step, the data were converted into tensors and fed into the model for training.This process helps to enhance the model's ability to generalize and improve its stability.

Evaluation Metrics
To evaluate the performance of the model in classification, we use Accuracy, Precision, and Recall as evaluation metrics.The formulas for these evaluation metrics are shown below.

Accuracy =
TP + TN TP + TN + FP + FN ( 10) The numbers TP, TN, FP, and FN stand for the corresponding amounts of true positives, true negatives, false positives, and false negatives.For OCT classification, TP is defined as the proportion of cases that the model correctly classified as positive, TN as the proportion of cases that the model correctly classified as negative, FP as the proportion of negative samples that the model incorrectly classified as positive, and FN as the proportion of positive cases that the model incorrectly classified as negative.

Results
In this research, the network was trained and evaluated on a Windows 10 operating system with 64 GB of memory, an NVIDIA 4090 24 GB GPU, a 2 TB solid-state drive, Python 3.7, and PyTorch 1.10.1 + cu102.At the start of each experiment, we imported ImageNet-22K pre-trained models through transfer learning.The input resolution for the EfficientNetV2 is set to 384 × 384, the VIT and Swin Transformer models are set to 224 × 224, and the V2 model supports higher resolution image input than the Swin Transformer, set to 256 × 256.The batch size was set to 32 and each model was trained for 200 epochs.During training, we saved the models with the highest accuracy and lowest loss function and selected the model with the highest test accuracy as the optimal model through comparison.
The performance of each category in the OCT2017 dataset was tested using pre-trained EfficientNetV2 [38], Vision Transformer (VIT), Swin Transformer, and our improved Swin Transformer V2 network.Table 2 shows the experimental results for the three retinal disease and normal category diagnoses when the CrossEntropy loss function is used for the four network models on the dataset OCT2017.Table 3 shows the experimental results obtained for different network models on the same dataset when using the PolyLoss function.
To further validate our models, we also tested and analyzed the performance of the VIT, Swin Transformer, and Swin Transformer V2 network models on the OCT-C8 dataset using CrossEntropyLoss, with the results shown in Table 4, and the PolyLoss loss function, with the results shown in Table 5, to categorize the performance of the VIT, Swin Transformer, and Swin Transformer V2 network models.
In order to visualize the performance of each model more intuitively, we use the confusion matrix to visualize the matching results between the model predictions and the true categories.The results obtained by our models on the OCT2017 and OCT-C8 datasets using different loss functions, respectively, are shown in Figure 7a,c

Discussion
As can be seen from Table 2, EfficientNetV2 achieved an accuracy of 0.975 in the CNV category, and the highest accuracy of 0.988 was obtained in the normal category, with an F1-Score of 0.953 and 0.976 in the CNV and normal, respectively.The category accuracies of 0.986 and 0.977 were achieved in the DME and DRUSEN, respectively, while the VIT model obtained an overall lower evaluation metric than EfficientNetV2 on all four categories.Both Swin Transformer and our model achieved more than 99% accuracy on a single category, and the evaluation metrics achieved a score of 1 on the normal category.Table 3 shows that when using the PolyLoss function, EfficientNetV2 shows a slight decrease in diagnostic performance on the CNV and DRUSEN categories and a slight increase on the DME and NORMAL categories.The evaluation metrics for the three retinal disease diagnoses improved on Swin Transformer and our model.Compared to the Swin Transformer, our model obtained a higher performance evaluation with a category diagnostic accuracy of 0.999 for both CNV and DME.An accuracy score of 1 was obtained on DEUSEN and normal fundus.
Table 6 is the average of the experimental results obtained using the CrossEntropy and PolyLoss functions on the OCT2017 and OCT-C8 datasets, respectively.We observed that the performance of the EfficientNetV2 network was better than that of VIT when using CrossEntropy loss, with average accuracies of 98.2% and 96.5%, respectively.However, the Swin Transformer model achieved a 3.3% average accuracy improvement over EfficientNetV2 and performed better.We achieved an average accuracy of 99.8% using Swin Transformer V2, which improved on Precision, Recall, Specificity, and F1-Score compared to the Swin Transformer.When the loss function was changed from CrossEn-tropyLoss to Polyloss, although the Swin Transformer network achieved the same accuracy, it improved in several other evaluation metrics.It can be seen that when using Pol-yLoss, compared with CrossEntropyLoss, Swin Transformer V2 showed an improvement in Performance, with a 0.3% increase in Precision, a 0.4% increase in Recall, and a 0.1%

Discussion
As can be seen from Table 2, EfficientNetV2 achieved an accuracy of 0.975 in the CNV category, and the highest accuracy of 0.988 was obtained in the normal category, with an F1-Score of 0.953 and 0.976 in the CNV and normal, respectively.The category accuracies of 0.986 and 0.977 were achieved in the DME and DRUSEN, respectively, while the VIT model obtained an overall lower evaluation metric than EfficientNetV2 on all four categories.Both Swin Transformer and our model achieved more than 99% accuracy on a single category, and the evaluation metrics achieved a score of 1 on the normal category.Table 3 shows that when using the PolyLoss function, EfficientNetV2 shows a slight decrease in diagnostic performance on the CNV and DRUSEN categories and a slight increase on the DME and NORMAL categories.The evaluation metrics for the three retinal disease diagnoses improved on Swin Transformer and our model.Compared to the Swin Transformer, our model obtained a higher performance evaluation with a category diagnostic accuracy of 0.999 for both CNV and DME.An accuracy score of 1 was obtained on DEUSEN and normal fundus.
Table 6 is the average of the experimental results obtained using the CrossEntropy and PolyLoss functions on the OCT2017 and OCT-C8 datasets, respectively.We observed that the performance of the EfficientNetV2 network was better than that of VIT when using CrossEntropy loss, with average accuracies of 98.2% and 96.5%, respectively.However, the Swin Transformer model achieved a 3.3% average accuracy improvement over Effi-cientNetV2 and performed better.We achieved an average accuracy of 99.8% using Swin Transformer V2, which improved on Precision, Recall, Specificity, and F1-Score compared to the Swin Transformer.When the loss function was changed from CrossEntropyLoss to Polyloss, although the Swin Transformer network achieved the same accuracy, it improved in several other evaluation metrics.It can be seen that when using PolyLoss, compared with CrossEntropyLoss, Swin Transformer V2 showed an improvement in Performance, with a 0.3% increase in Precision, a 0.4% increase in Recall, and a 0.1% increase in F1-Score.Swin Transformer V2 achieved 100% Precision, Recall, and Sensitivity in the DME, DRUSEN, and NORMAL categories and achieved near 1.0 accuracy in the CNV, DME, DRUSEN, and NORMAL categories.This proves the excellent classification ability of Swin Transformer V2 on the OCT dataset and that using the PolyLoss loss function can further improve the performance of the network.On the OCT-C8 dataset, this method outperformed VIT and Swin Transformer, and using the PolyLoss loss function further improved performance, resulting in the best average performance.After using the PolyLoss loss function, Swin Transformer and our Swin Transformer V2 achieved 100% accuracy in the ADM, CSR, DR, and MH categories.In summary, in our experiments, Swin Transformer V2 demonstrated excellent classification ability on the OCT dataset.In addition, we found that using the PolyLoss loss function can further improve the performance of the network.
In addition, we compared our results with other studies.Table 7 shows the results of our comparison.Through comparison, we found that our Swin Transformer V2 improved with PolyLoss, achieving better accuracy and sensitivity performance.This demonstrates the reliability of our method in OCT image classification.These results indicate that our method has high reliability and accuracy in OCT image classification.Our Swin Transformer V2 improved with PolyLoss not only performs well in terms of accuracy, but also achieves good results in terms of sensitivity.These achievements provide strong support for our research in the field of OCT image classification and lay a solid foundation for future research.
Figure 7a,b are the confusion matrices of Swin Transformer V2 using CrossEntropyLoss and PolyLoss when tested with 968 images in the OCT2017 dataset, respectively.Figure 8b represents that the model judged a DME image as CNV disease, while it made zero errors in other categories, thus proving the excellent classification ability of the network.Figure 7c,d are the confusion matrices using two loss functions on 2800 test images in OCT-C8, respectively.As can be seen, the network has successfully classified AMD, CSR, DR, and MH data.[40] 0.985 0.994 ResNet50-v1 [9] 0.993 0.993 Joint-Attention-Network ResNet-v1 [41] 0.924 Xception [42] 0.997 0.997 OpticNet-71 [43] 0.998 0.998 Swin Transformer V1 [30] 0.998 0.998 Ours 0.999 0.999

OCT-C8
VIT 0.975 0.986 GAN [44] 0.939 Swin Transformer 0.994 0.997 Deep CNN [45] 0.938 CenterNet [46] 0.981 Ours 0.995 0.997  For the trained OCT model, we use Grad-CAM to visualize the decision-making mechanism of the prediction.Grad-CAM is a gradient-based deep network visualization method that explains the classification basis of deep neural network models in the form of heat maps, making category judgments through the pixels of the image.Figures 8 and 9 show heatmaps of the prediction results for the OCT2017 and OCT-C8 datasets, respectively.The colors of the heatmap represent regions of interest, with red indicating high correlation with the target category and blue indicating less attention to the region.The purple area is the result of filling the blank area after data enhancement of the image.Meanwhile, lesion regions show up as a darker red color in disease OCT images.As shown in Figure 8, the second row of images shows the Grad-CAM of the DME image, and from the third image, it can be observed that the region of susceptibility contains the macular edema lesion.The image in the third row and fourth column of Figure 8 shows the region of interest for Drusen and also the region where the lesion occurred.Figure 9 is a partial image of the heat maps of the eight disease categories on the OCT-C8 dataset, showing the prediction of the heat maps of the lesion regions of each disease by our trained model.Grad-CAM helps us to see the regions of interest that the model focuses on when making a prediction, and thus to understand the decision-making process of the prediction.It is worth noting that this focus on the region of interest is also consistent with the ophthalmologist's observation and diagnostic process.

Conclusions
In this paper, a multi-fundus disease classification model based on Swin Transformer V2 and the PolyLoss loss function was proposed.By comparing two different loss functions, it has been demonstrated that the PolyLoss function can enhance the model's functionality.In the final experiment, an evaluation index close to 1 was achieved on the OCT2017 dataset, proving the good performance of the model in classifying OCT images.To validate the generalization ability of the network, it was trained and evaluated on OCT-C8, attaining a score of 1 for accuracy and other assessment metrics in half of the OCT illness categories and an average accuracy of 99.5% on the OCT-C8 dataset, proving the effectiveness of our designed model in classifying fundus diseases on OCT images.
The basic Swin Transformer V2 demonstrated strong performance on the publicly available OCT2017 dataset, making further improvements challenging.In clinical practice, misdiagnosis and missed diagnosis can lead to serious medical accidents and cause great pain to patients.The aim of our work is to improve the accuracy of model automatic

Conclusions
In this paper, a multi-fundus disease classification model based on Swin Transformer V2 and the PolyLoss loss function was proposed.By comparing two different loss functions, it has been demonstrated that the PolyLoss function can enhance the model's functionality.In the final experiment, an evaluation index close to 1 was achieved on the OCT2017 dataset, proving the good performance of the model in classifying OCT images.To validate the generalization ability of the network, it was trained and evaluated on OCT-C8, attaining a score of 1 for accuracy and other assessment metrics in half of the OCT illness categories and an average accuracy of 99.5% on the OCT-C8 dataset, proving the effectiveness of our designed model in classifying fundus diseases on OCT images.
The basic Swin Transformer V2 demonstrated strong performance on the publicly available OCT2017 dataset, making further improvements challenging.In clinical practice, misdiagnosis and missed diagnosis can lead to serious medical accidents and cause great pain to patients.The aim of our work is to improve the accuracy of model automatic diagnosis as much as possible to reduce the occurrence of misdiagnosis and missed diagnosis.However, by using polynomial loss and optimizing the network parameters, we were able to achieve a comprehensive improvement in performance metrics at a highperformance level of 99.7%, achieving a score close to 1.This indicates that our modified network model exhibits superior diagnostic capabilities.Although the magnitude of improvement is relatively small, it has positive implications for reducing misdiagnosis and improving diagnosis.
However, despite the good progress made by deep learning models in identifying abnormalities on retinal OCT images, due to the limited dataset, it is not possible to verify how well they perform on other retinal OCT data.In the future, more retinal OCT data will be sought to validate and improve the network.In addition, turning a network model into a powerful tool in the hands of clinical ophthalmologists in real life is also a major challenge, requiring more professionals to work together to turn theoretical methods into products that improve ophthalmic diagnosis.

Figure 1 .
Figure 1.The overall framework of the proposed method.

Figure 1 .
Figure 1.The overall framework of the proposed method.

Figure 3 .
Figure 3. (a) The hierarchical structure of Swin Transformer V2 for extracting multi-scale feature representation.(b) An illustration of the shifted window strategy for computing self-attention in the Swin Transformer V2 architecture.Each Swin Transformer V2 block comprises two units, with each unit containing two normalization layers (LayerNorm), a self-attention module, and a multi-layer perceptron (MLP) layer.The standard multi-head self-attention (MSA) module from ViT is replaced by two consecutive Swin Transformer V2 modules in the Swin Transformer V2 block: the window multi-head self-attention (W-MSA) module and the shifted window multi-head self-attention (SW-MSA) module, as shown in Figure 2b.The first unit utilizes the window MSA (W-MSA) module, while the second unit employs the shifted window MSA (SW-MSA) module.In contrast to the Swin Transformer, Swin Transformer V2 incorporates a LayerNorm layer after each MSA module and MLP layer and implements residual connections after each module.

Figure 3 .
Figure 3. (a) The hierarchical structure of Swin Transformer V2 for extracting multi-scale feature representation.(b) An illustration of the shifted window strategy for computing self-attention in the Swin Transformer V2 architecture.Each Swin Transformer V2 block comprises two units, with each unit containing two normalization layers (LayerNorm), a self-attention module, and a multi-layer perceptron (MLP) layer.The standard multi-head self-attention (MSA) module from ViT is replaced by two consecutive Swin Transformer V2 modules in the Swin Transformer V2 block: the window multi-head self-attention (W-MSA) module and the shifted window multi-head self-attention (SW-MSA) module, as shown in Figure 2b.The first unit utilizes the window MSA (W-MSA) module, while the second unit employs the shifted window MSA (SW-MSA) module.In contrast to the Swin Transformer, Swin Transformer V2 incorporates a LayerNorm layer after each MSA module and MLP layer and implements residual connections after each module.

Figure 3 .
Figure 3. (a) The hierarchical structure of Swin Transformer V2 for extracting multi-scale feature representation.(b) An illustration of the shifted window strategy for computing self-attention in the Swin Transformer V2 architecture.Each Swin Transformer V2 block comprises two units, with each unit containing two normalization layers (LayerNorm), a self-attention module, and a multi-layer perceptron (MLP) layer.The standard multi-head self-attention (MSA) module from ViT is replaced by two consecutive Swin Transformer V2 modules in the Swin Transformer V2 block: the window multi-head self-attention (W-MSA) module and the shifted window multi-head self-attention (SW-MSA) module, as shown in Figure 2b.The first unit utilizes the window MSA (W-MSA) module, while the second unit employs the shifted window MSA (SW-MSA) module.In contrast to the Swin Transformer, Swin Transformer V2 incorporates a LayerNorm layer after each MSA module and MLP layer and implements residual connections after each module.

Figure 4 .
Figure 4. Illustration of an efficient batch computation approach for self-attention in shifted window partitioning.

Figure 4 .
Figure 4. Illustration of an efficient batch computation approach for self-attention in shifted window partitioning.

Figure 5 .
Figure 5. Optical coherence tomography images from the OCT2017 dataset.The panels display images of choroidal neovascularization (CNV) on the far left, diabetic macular edema (DME) on the middle left, drusen on the middle right, and a normal image on the far right.

Figure 5 .
Figure 5. Optical coherence tomography images from the OCT2017 dataset.The panels display images of choroidal neovascularization (CNV) on the far left, diabetic macular edema (DME) on the middle left, drusen on the middle right, and a normal image on the far right.

Figure 5 .
Figure 5. Optical coherence tomography images from the OCT2017 dataset.The panels display ages of choroidal neovascularization (CNV) on the far left, diabetic macular edema (DME) on middle left, drusen on the middle right, and a normal image on the far right.

Figure 6 .
Figure 6.Displays examples of the eight classes in the OCT-C8 dataset, including AMD, CNV, C DME, DR, DRUSEN, MH, and NORMAL.
are the results when CrossEntropy is applied, and Figure 7b,d represent the results obtained by the PolyLoss function.The diagonal elements in the confusion matrix represent the correct classification, and the remaining elements represent the misclassification.

Figure 8 .
Figure 8. Gradient-weight class activation mapping on OCT2017 of our proposed networks.

Figure 8 .
Figure 8. Gradient-weight class activation mapping on OCT2017 of our proposed networks.

Figure 9 .
Figure 9. Gradient-weight class activation mapping on OCT-C8 of our proposed networks.

Figure 9 .
Figure 9. Gradient-weight class activation mapping on OCT-C8 of our proposed networks.
): 83,484 training images and 968 test images.The training set includes 36,205 CNV images, 10,348 DME images, 7616 DRUSEN images, and 25,315 NORMAL images for training and four classes of 1000 images each for validation.Details of the two datasets have been shown in Table 1.

Table 1 .
Classification and dataset setup for datasets OCT2017 and OCT-C8.

Table 1 .
Classification and dataset setup for datasets OCT2017 and OCT-C8.

Table 1 .
Classification and dataset setup for datasets OCT2017 and OCT-C8.

Table 2 .
Classification results using OCT2017 with a CrossEntropy loss function.Significant values are in [bold].

Table 6 .
Average of experimental results using CrossEntropy and PolyLoss functions on datasets OCT2017 and OCT-C8, respectively.Significant values are in bold.

Table 7 .
Experimental results using different models on the OCT2017 and OCT-C8 datasets, respectively.Significant values are indicated in bold.