3.1. Dataset
In order to evaluate the good performance of the proposed method, this paper uses two public datasets for experimental verification.
Dataset 1 is a publicly available SD-OCT [
27] dataset of patients with diabetic macular edema (DME) at Duke University, which is composed of about 110 SD-OCT B-scan images, each with a B-scan image size of 496 × 768. These images were taken from 10 patients with DME, and 11 of each patient’s B-scans were centered on the central concavity, with 5 scans collected on each of the two sides of the central recess (central recess slice and transverse scans collected at ±2, ±5, ±10, ±15, and ±20 from the central recess slice). These 110 B-scans were annotated by two clinical ophthalmologists for the retinal layer and the fluid region, respectively. In this experiment, the annotations of expert 1 and expert 2, respectively, were used as the golden annotation for training the network, and the proposed method was compared and evaluated in discussion with the main existing methods.
Dataset 2 is the POne dataset [
28], which consists of 100 SD-OCT B-scan images, each of size 496 × 610, collected from 10 healthy adult subjects. The POne dataset was calibrated by two ophthalmic experts for eight layers, and the segmentation results of expert 1 were used as the golden annotation for evaluation and comparison.
3.2. Experimental Environment and Configuration
In the experiments, using Pytorch version 1.10.1 (Sunnyvale, CA, USA) as the deep learning framework, Nvidia version of RTX3090 with 24 GB of video memory (Santa Clara, CA, USA), Anaconda (Austin, TX, USA) used version 4.10.1 and version 11.1 Cuda (Menlo Park, CA, USA). The weights pre-trained on ImageNet were used to initialize the model parameters, with the input patch size P and resolution set to 16 and 224 × 224, respectively (unless otherwise stated). The optimizer was the SGD optimizer with momentum 0.9, learning rate 0.01, and weight decay of 1 × 10−4, used to optimize the backpropagation of the model. The batch size was set to 24 by default and the epoch was set to 150. In the contrast experiment, the OCT dataset was then trained and tested using the pre-trained weights provided by the original model. The augmentation of data set was mainly aimed at Dataset 1, the horizontal flip, mirror flip, Gaussian noise (noise standard deviation of 0.12), and salt and pepper noise (noise amount of 0.025) were applied to augment the OCT dataset of the images to 20 times, and 60% were randomly selected for training dataset, 20% for test dataset, and 20% for validation dataset. No data from the Dataset 2 have been used for training.
3.3. Evaluation Metrics
The average Dice similarity coefficient (DSC) and the average Hausdorff distance (HD) were used to evaluate the performance of intraretinal layer hierarchical segmentation. Including sensitivity (SE), specificity (SP), Jacquard similarity (JAC), and precision (PR) well-known metrics were used to further demonstrate the advantages of the proposed model. Among them, higher values of DSC, SE, SP, JAC, and PR, and lower values of HD indicate better segmentation. The formulae and brief descriptions of each metric follow: TP denotes the count of correct predictions of positive symptoms; TN denotes the count of correct predictions of negative symptoms; FP denotes the count of negative symptoms wrongly predicted by the model; FN denotes the count of positive symptoms wrongly predicted by the model.
For the speed detection of the model, the calculation amount (Flops), parametric volume (Params), and inference time (FPS) of the neural network were selected to evaluate the speed of the model. Flops is used to calculate the time complexity, which is used to measure the complexity of the algorithm. Params is used to calculate the space complexity, which is used to measure the size of the model. FPS stands for frames per second, which is how many images the network can process per second.
Hausdorff distance (HD) is a measure describing the degree of similarity between two sets of points, and it is a form of definition of the distance between two sets of points: suppose there are two sets A = {a
1,…, a
p}, B = {b
1,…, b
q}, then the Hausdorff distance between the two point sets is defined as:
Specificity (SP) represents the percentage of true negatives (TN) in model identification. It is calculated as follows.
Sensitivity (SE) represents the percentage of true positives (TP) in model identification, also known as recall. Its calculation formula is as follows.
The Jaccard similarity (JAC) is used to compare the similarities and differences between finite sets of samples. It is calculated as follows.
Precision (PR) represents the proportion of samples predicted to be positive among those predicted to be positive. It is calculated as follows.
The Dice similarity coefficient (DSC) represents how similar the predicted value is to the true value. It is calculated as follows.
False positive rate (FPR): the proportion of samples in which the true result is negative. It is calculated as follows.
Paired t-tests are often used to test whether two related samples are from a normal population with the same mean. The essence is to test the difference between the mean and zero of the difference of two relevant samples. SPSS software is used to perform paired t-test on the data. If p value is less than 0.05, it is proved that there is a significant difference.
3.4. Experimental Results
The proposed model has been experimentally compared with four current effective methods, including U-Net, ReLayNet, Swin-Unet, and TransUNet on the SD-OCT dataset. The experimental comparison results are shown in
Table 1.
First, taking expert 1 as the golden annotation, from the average of the overall Dice coefficient, the overall Dice coefficient of the proposed algorithm is 0.904, which is better than the four methods of U-Net, ReLayNet, Swin-Unet, and TransUNet 0.079, 0.062, 0.043, and 0.023, respectively. At the same time, it is also superior to the above four methods in the Dice coefficient score of each layer. In particular, there is a large improvement in Dice coefficient scores in the INL and OPL layers. Therefore, from the perspective of Dice coefficient, the hierarchical results of the proposed algorithm are better than those of other methods, and the segmentation accuracy of each layer is improved.
Second, taking expert 2 as the golden annotation, from the average of the overall Dice coefficient, the overall Dice coefficient of the proposed algorithm is 0.903, which is better than 0.078, 0.056, 0.04, and 0.031 of U-Net, ReLayNet, Swin-Unet, and TransUNet, respectively. At the same time, it is also superior to the above four methods in the Dice coefficient score of each layer. Therefore, from the perspective of Dice coefficient, the hierarchical results of the proposed algorithm are better than those of the other methods, and the segmentation accuracy of each layer is improved.
Since the data in
Table 1 are all mean values, the standard variances of all values are calculated and the error bar is drawn, as shown in
Figure 6.
It can be seen from
Figure 6 that the standard deviation of the proposed method is the smallest, that is, the distance between the upper and lower errors is the shortest, which indicates that the segmentation results of the proposed method are more stable and have small fluctuations.
TransUNet achieved the second-best performance in seven categories among the compared methods, so the proposed method and TransUNet were statistically analyzed by paired t-test to obtain significant p-values. The improvement at seven levels has significant p < 0.05, indicating that the proposed model has a significant improvement compared with the four existing methods.
Furthermore, the proposed model was compared with the above four models in terms of Hausdorff distance (HD), sensitivity (SE), specificity (SP), Jaccard similarity (JAC), and precision (PR). As shown in
Table 2, the HD distance of the proposed method was 2.43 mm, which was 0.23 mm lower than the quantization index of the best TransUNet model among the models. The specificity of the proposed method was 0.9979, the sensitivity (recall) was 0.911, the Jaccard similarity was 0.828, and the precision was 0.897, which were 0.0002, 0.018, 0.037, and 0.025 higher than the quantitative index of the best TransUNet model in the comparison model, respectively. The proposed model is better than the TransUNet, which has the best results among all four previous models. Thus, the proposed model in this paper obtains the optimal results for segmentation in all six of these quantitative metrics, indicating the high precision, low error, and robustness of this paper for intraretinal stratification of OCT images.
In order to further evaluate the predictive ability of the features, labels, and tests of different layers, for the seven-layer classification, each layer is regarded as an imbalanced binary classification problem, evaluated and visualized by the “Receiver Operating Characteristic” curve. The abscissa of the ROC curve is the false positive rate, and the calculation formula is (14). The ordinate of the ROC curve is the true positive rate, and the calculation formula is the same as the recall, that is, Formula (10). The ROC curves for U-Net, ReLayNet, Swin-Unet, TransUNet, and the proposed method were compared for the seven-layer segmentation of the retina, as illustrated in
Figure 7, with the region below the ROC curve (AUC) as a metric for evaluating the model.
As shown in
Figure 7,
Figure 7a–g show the ROC curves of the results of this paper’s method (red line) and four other methods for segmenting the intraretinal layers of fundus OCT images in each layer, respectively. For different layers, the AUC values of different methods are shown in
Table 3.
Table 3 describes the comparison results of the proposed method with U-Net, ReLayNet, Swin-Unet, and TransUNet in terms of AUC values. The AUC value of the proposed method in each layer is in the last row, which is higher than the four comparison methods on each layer.
Figure 8 shows (a) the original image, (b) and (c) the labeled image of expert 1 and expert 2 respectively, (d) U-Net prediction results, (e) ReLayNet prediction results, (f) Swin-Unet prediction results, (g) TransUNet prediction results, and (h) the proposed method prediction results. As can be seen from the figure, the prediction results of U-Net network will be very fuzzy at the boundary, and the sense of boundary between layers is not so clear. The ReLayNet network does not have clear upper and lower boundaries when processing the OS-RPE layer. There are errors in the prediction results of the Swin-Unet network in the NFL layer, indicated by a white arrow in the figure. The prediction results of TransUNet network are relatively fuzzy at the boundary between the head and tail layers. The segmentation results of the proposed method are closer to the labeled image.