1. Introduction
In recent years, the application of deep learning in medical imaging has sparked a paradigm shift in the field of ophthalmology, heralding a new era of automated and precise diagnosis. Optical coherence tomography (OCT), a cornerstone of modern ophthalmic practice, offers unparalleled insights into the ocular anatomy and pathology. By harnessing the power of deep learning algorithms, OCT has transcended traditional manual analysis, enabling rapid, accurate, and standardized diagnoses that hold the promise of transforming patient care. An OCT image shows each layer of the retina at a high resolution, as described in
Figure 1a. By interpreting OCT images, ophthalmologists are able to detect changes in the structure of the eye and investigate many pathologies, such as age-related macular degeneration (AMD), epiretinal membrane (ERM), and macular edema (ME), as depicted in
Figure 1b,
Figure 1c, and
Figure 1d, respectively. Disease diagnosis in the early stages plays an important role in preventing vision loss.
Previous studies on automatic diagnostic development can be categorized into feature-based and deep learning-based methods. Feature-based methods typically adopt image processing techniques such as histograms of oriented gradient (HOG) [
1], linear binary patterns (LBP) [
2], and scale-invariant feature transform (SIFT) [
3] to extract features for the final classifier. Although these methods have achieved promising results in situations where labeled data is scarce or computational resources are limited, they do not capture all relevant information in OCT images because of their limited representation capability, which reduces diagnostic accuracy. Another challenge is that the choice of feature extraction methods requires domain-specific expertise, which makes it difficult for non-experts to develop effective classifiers [
4,
5].
Deep learning-based methods have emerged as a popular approach for disease diagnosis in OCT images because of their ability to learn complex features directly from raw data. These models have shown state-of-the-art performance [
5,
6,
7] in disease classification, demonstrating analytical capabilities corresponding to the diagnostic accuracy and sensitivity of ophthalmologists. Transfer learning is a commonly used approach in deep learning-based methods for eye disease classification using OCT images, which involves fine-tuning a pre-trained model on a smaller labeled dataset. However, deploying pre-trained models in practical applications is challenging because of their large number of parameters and high computational requirements [
5]. At the same time, researchers have attempted to improve the performance by incorporating multi-scale features [
8,
9] and additional information such as the region of interest [
10] and disease symptoms [
11]. As a result, the reported approaches require considerable computational resources and effort to design the model and extract the necessary information [
5].
Despite significant progress in the use of OCT for the diagnosis and management of retinal diseases, the current classification methods still have limitations. A prominent challenge is the need to address patients who may concurrently present with multiple diseases. Notably, many existing studies focus only on a single disease, with a specific focus on AMD [
9] and ME [
2]. On the other hand, some studies have made attempts at classifying multiple diseases but limit their data to images containing only a single disease [
5,
6,
7], making it less practical for real-world applications. To the best of our knowledge, the largest and most common OCT dataset is OCT2017 [
6], which contains 83,484 images with single-disease labels. The lack of a benchmark multi-label OCT dataset, where an image may contain signs from one or multiple diseases, limits the applications of the current diagnosis AI models in clinical environments.
In this paper, we collect and annotate a large-scale multi-label OCT data with approximately 33,000 images. Each image in this dataset is annotated with multiple diseases, including AMD, ERM, and ME. To perform multi-disease diagnosis using this extensive multi-label OCT dataset, we propose a simple yet effective multi-scale sparse residual network (MS-SRN) for multi-disease diagnosis in OCT images. First, the multi-scale learning (MSL) method effectively exploits the information from OCT images of different sizes to address the problem of varied disease lesions, improving the classification performance and enhancing interpretability, as shown in
Figure 2. The MSL shows its effectiveness in improving the performance of different convolutional neural networks (CNNs). Second, the lightweight SRN consists of six convolutional blocks and employs residual learning for efficient learning. The proposed SRN uses only 6.1% of the learnable parameters compared with ResNet-101 but achieves similar performance in terms of all evaluation metrics. SRN is suitable for real-time applications because of its reduced number of parameters and reduced complexity. The combination of the MSL and SRN significantly outperforms other methods for multi-disease diagnosis in OCT images.
The main contributions of this paper are summarized as follows:
We collected and annotated a large-scale multi-label OCT dataset with approximately 33,000 images, where each image is labeled as normal or abnormal with one or multiple diseases, including AMD, ERM, and ME.
We propose a simple yet effective MSL method that fuses information from images of different sizes to improve classification performance and enhance visual interpretability. MSL shows its robustness when applied to different CNN architectures.
The proposed SRN is a minimal residual network, where convolutional layers with large kernel sizes are replaced with multiple convolutional layers that have smaller kernel sizes, thereby reducing the model complexity while achieving better performance than the large kernel CNNs.
Comprehensive experiments show that the proposed MS-SRN significantly outperforms the existing methods in terms of accuracy, sensitivity, and specificity. By combining MSL and SRN, we achieve superior performance while saving computational costs.
The remainder of this article is as follows:
Section 2 summarizes related work.
Section 3 formulates the problem and describes the proposed method and workflow in detail.
Section 4 describes the datasets, implementation details, and evaluation metrics.
Section 5 presents the results of the performance evaluation. Finally,
Section 6 concludes the article.
2. Related Work
Recently, deep learning has brought about significant advancements in the interpretation of OCT images. This progress extends to various tasks, such as retinal layer and fluid segmentation [
12,
13,
14,
15], noise removal [
16,
17], image super-resolution [
18,
19], image generation [
20], and disease classification [
21,
22]. For instance, in the context of retinal layer and fluid segmentation, researchers in [
12] proposed a new convolutional neural architecture, namely RetiFluidNet, for multi-class retinal fluid segmentation. RetiFluidNet benefits from hierarchical representation learning of textural, contextual, and edge features via the attention mechanism [
23]. On the other hand, OCT images are inevitably corrupted by speckle noise due to the coherence characteristics of scattered light. To enhance the OCT image quality, Zhou et al. [
17] computed the weight of the non-local means using the deep features extracted by the self-supervised transformer and adopted the boosting strategy to realize an effective OCT image. In terms of disease classification, existing studies can be categorized into feature-based and deep learning-based methods.
Feature-based methods: Traditional machine learning approaches for automatic disease classification in OCT images consists of three main blocks: preprocessing, feature extraction, and classifier design. The preprocessing block, which involves techniques such as image denoising [
24] and retinal flattening [
3], is used to remove unwanted or redundant information from the raw input data and allows the model to extract meaningful information in the following stage. Next, feature descriptors such as histogram of oriented gradients [
1], linear binary patterns [
2], and scale-invariant feature transforms [
3] are employed to manually extract features. Finally, the extracted features are fed into a classifier such as a random forest algorithm [
25], a Bayesian classifier [
23], or a support vector machine [
2] to complete the classification. Although machine learning approaches have demonstrated promising results, they have several limitations. First, manual feature extraction is a time-consuming task that requires expertise, making it inefficient to build a large and comprehensive database. Furthermore, expert interpretations may differ, leading to results that may not be acceptable to other experts.
Deep learning-based methods: Previous studies [
6,
7] have employed pre-trained CNNs such as AlexNet [
26] and InceptionNet [
27] trained on ImageNet [
28] and fine-tuned them using transfer learning. These models show accuracies of 97.1% and 96.1% on the OCT2017 dataset [
6], respectively. However, the use of pre-trained networks with transfer learning has made the system complex due to the large number of parameters involved. Such networks are generally unsuitable for real-time deployment. To address this issue, Sunija et al. [
5] proposed a lightweight CNN called OCTNet that achieves state-of-the-art (SOTA) performance with 99.6% accuracy on the OCT2017 dataset.
Multi-scale learning is another approach for disease classification in OCT images. Thomas et al. [
9] proposed a multi-scale CNN with seven convolutional layers, allowing the network to detect a large number of local structures with different filter sizes to classify normal vs. AMD images, whereas Saman et al. [
4] introduced a multi-scale CNN based on the feature pyramid network structure for single-disease multi-class classification. On the other hand, V. Das et al. [
8] proposed a multi-scale deep feature fusion approach using four CNNs, which increases the inference time and computational complexity. The limitation of these methods is that they require a sophisticated model design and are not effective in challenging tasks, including multi-disease classification.
Attention-based methods have also been explored for disease classification using OCT images. For example, Fang et al. [
11] demonstrated that detected macular lesion information can guide the network to focus on discriminative features and ignore insignificant information. However, their approach utilizes two separate networks, including a lesion detection network and a lesion-aware convolutional neural network, which increases computational complexity. Similarly, Huang et al. [
10] used ReLayNet [
29] for retinal layer segmentation and then employed a layer-guided convolutional neural network (LGCNN) to integrate the extracted information for classification. However, these methods are specific to eye diseases whose symptoms are easily detected, and their performances are significantly affected by the quality of the extracted information [
11].
4. Experiments
In this section, we first describe our OCT dataset and the metrics used for performance evaluation. We then provide the implementation details used to train our method.
Dataset: The largest and most common OCT dataset used in previous studies is OCT2017 [
6], which contains 83,484 images with single-disease labels. Various studies have used this dataset to classify retinal pathologies using OCT images. However, the coexistence of multiple symptoms makes an accurate diagnosis a challenging task. We propose and collect a large OCT dataset for the multi-disease classification task, as presented in
Table 1. High-quality OCT videos taken with Spectralis are collected and anonymized to protect the patient’s privacy. Each OCT video is split into frames, which are manually labeled by two ophthalmologists from Kangbuk Samsung Hospital (KBSMC). In particular, the labels annotated by a junior doctor are reviewed and verified by a senior doctor for accuracy and quality assurance.
Figure 6 describes the distribution of our dataset.
Evaluation metrics: For each class, accuracy (Acc), sensitivity (Sen), and specificity (Spe) are used for performance evaluation. Based on the ophthalmologist’s opinion, we calculate the micro-average (
-average) of each metric to have a more accurate representation of the overall performance. Micro-average accuracy is determined by aggregating the counts of true negatives, true positives, false negatives, and false positives across all classes and subsequently calculating the accuracy. Micro-average sensitivity is computed by summing up the counts of false negatives and true positives across all classes and then calculating the sensitivity. Micro-average specificity is derived by summing up the counts of false positives and true negatives across all classes and then calculating the specificity.
where
C,
TP,
TN,
FP, and
FN denote the number of classes, true positives, true negatives, false positives, and false negatives, respectively.
Implementation details: The entire dataset is split into a training set (80%) and a testing set (20%). We first resize the OCT images and then apply data augmentation techniques such as random rotation and horizontal/vertical flip. The proposed method is implemented using the Pytorch framework with random initialization weights on an NVIDIA A6000 GPU (48 GB). The batch size, learning rate, and the number of epochs are set to 64, 0.003, and 200, respectively. The stochastic gradient descent (SGD) optimizer is adopted with momentum and weight decay parameters set to 0.9 and 0.0001, respectively. All experiments are conducted with five different seeds, and then the mean and standard deviation values are calculated to produce solid results and ensure reproducibility.
6. Discussion
The proposed MS-SRN not only outperforms other methods in multi-disease classification on OCT, as presented in
Table 2, but also provides insights into its performance, as illustrated in
Figure 7. By combining information from images of different scales, the MSL method demonstrates the advantage of identifying lesions that may not be identified at a single scale but become distinguishable at higher or lower scales. Additionally, MSL is a general method that can be applied to other CNNs such as VGGNet, ResNet, and OCTNet (
Table 4). Notably, SRN achieves a performance similar to that of ResNet while containing considerably fewer parameters (
Table 9).
One limitation of our work is the simplicity of the proposed multi-label OCT dataset, which includes only three diseases and a normal class. Although it serves as a valuable pilot dataset for multi-disease diagnosis, it does not fully capture the complexity of the clinical scenarios where patients may present with multiple concurrent diseases. In our future work, we plan to expand the dataset to include a wider range of diseases, making it more representative and enhancing the model’s versatility for real-world medical cases. Furthermore, the current multi-scale learning method simply concatenates the model outputs from local and global branches to produce the prediction via a fully connected layer. Other forms of information fusion can be applied to further improve the performance of the MSL method. Additionally, an active learning-based method with doctor assistance [
36] has been proven to improve the performance of diagnosis systems. In future work, we will train our method in an active learning manner with the help of ophthalmologists to improve its effectiveness and robustness.
7. Conclusions
In this paper, we construct and annotate large-scale multi-label OCT data with approximately 33,000 images with multi-disease labels. To perform multi-disease diagnosis on this dataset, we propose a simple yet effective approach, namely MS-SRN, for multi-disease diagnosis in OCT images. By capturing both local and global features in the input images of different sizes, the MSL method not only improves the performance but also enhances the interpretability of the CNNs via visual discrimination. Regarding the proposed SRN, we employ factorization and residual learning principles to reduce the complexity while achieving a performance similar to that of existing CNNs. In particular, a convolutional layer with a large kernel size is factorized by employing multiple convolutional layers that have small kernel sizes to reduce the number of parameters. Through extensive experiments on our multi-label OCT dataset, the proposed MS-SRN shows its effectiveness and significantly outperforms other models in terms of accuracy, sensitivity, and specificity. Our method has demonstrated the potential to improve the diagnosis and treatment of a wide range of eye diseases. Due to the reduced complexity, the proposed method is suitable for real-time applications, enabling efficient and timely decision-making in clinical settings. In future work, we will address the limitations of our work mentioned in the Discussion section.