In this chapter we will mainly introduce our segmentation results and compare it with different segmentation networks. The proposed method is implemented on GPU (GeForce GTX 1080 Ti) and CPU (3.4GHz Intel core), caffe framework is used as well.
3.1. Data Preparation
We have refined and built our own optimized database based on the characteristics of the public database.
3.1.1. Public Database
The existing retinal fundus database has a complete DRIVE library, DIARETDB0 library, DIARETDB1 library, MESSIDOR database, e-Optha EX and e-Optha MA library, and a STARE library. Each library has a blood vessel labeling database. Basically, features other than vascular features are missing in these databases.
The DRIVE database [
27] consists of 40 color retinal fundus images captured by a Canon non-mydriasis 3CCD camera with a resolution of 768 × 584, 45° FOV, and 8-bit channels per color. The database contains 33 healthy states and seven pathological retinal fundus images. The set of images is divided into a training set and a test set, both of which contain 20 images. The database provides true annotation of blood vessels for the test set. However, the data in the dataset are limited, and the image resolution is single, which is not ideal for additional fundus image testing. The DIARETDB0 and DIARETDB1 databases [
28] contain 130 color retinal fundus images. The 130 retinal images available in the database include 110 markers depicting diabetic retinopathy (e.g., hard exudate, soft exudate, micro aneurysm, or hemorrhages). The images consist of 20 normal retina images. The DIARETDB1 database consists of 89 colored retinal fundus images. The 84 retinal images available in the database contain mild signs of diabetic retinopathy, including exudates, and the remaining five are healthy retinal images. However, these labels only contain relevant areas of the lesion in the retinal image.
The MESSIDOR database [
29] consists of 1200 retinal fundus images taken with a Topcon TRC NW6 non-mydriasis fundus camera with 1440 × 960, 2240 × 1488 and 2304 × 1536 pixels. This database provides diagnostic results for various retinal images, but the database does not provide a reference standard for segmentation of fundus retinopathy.
The e-Optha EX and e-Optha MA databases are described in [
30]. The e-Optha EX database consists of 82 retinal fundus images with 35 healthy retinal images and 47 pathological retinal images with pathological retinal images and mild exudate composition associated with moderate and severe diabetic retinopathy. The e-Optha MA database consists of 233 healthy retinal fundus images and 148 images with MA or small hemorrhage. These annotations are useful for accurately assessing the efficiency of exudate and MA segmentation methods. Unfortunately, there is no indication of common features like the cup optic disc or macula.
The STARE database [
31] consists of 400 color retinal fundus photographs with a resolution of 605 × 700. This database provides diagnostic results for diabetic retinopathy based on a single retinal fundus image, but it does not contain areas of various retinopathy structures, such as exudates and roth spots. Moreover, after careful comparison of the data, database labeling is very different from the original image, while edge processing has insufficient detail, which has a serious impact on the training results. After a detailed comparison and fitting, only 20 groups of retinal data have reference values.
In summary, the open source benchmark database has the following limitations: (1) contains a limited number of retinal images and the pixel resolution is generally low; (2) contains only clear contrast and original retinal images; (3) does not contain ground-truth information, e.g., labeling of all the retina’s features; (4) the annotations are significantly different from the original ones and some have no annotation detail.
3.1.2. Optimized Database
Our comprehensive database contains 1500 retinal fundus images of 283 patients. There are 444 1444 × 1444 images and 940 2124 × 2056 images. The images are from individuals ranging from 25–83 years old for men and 24–75 years old for women. Additionally, each pathological grade is approved by the hospital ophthalmologist. Most of the retinal fundus images are longitudinal, i.e., data contain images for a single patient over a period of time, which makes it easier to analyze the condition later. The structure of the database is shown in
Figure 2. In the experiment, we use optimized database to train and test the network segmentation model.
Our database has improved in nature. It is not only a simple expansion of the number of open source databases but also contributes to the simple classification of disease level. The most important goal is to solve the problem of poor quality of some fundus photos, such as retinal images containing dark areas, so that the fundus image can be separated from the background image, as shown in the third column of
Figure 1. Therefore, the first step is to segment the retinal image from the background. We grouped image pixels with similar properties by using a
k-means clustering method [
32] to reduce the number of different colors in the retinal fundus image and perform an adaptive thresholding method to segment the FOV boundary from the background of the retinal fundus image. Since the goal is to distinguish the FOV from the background, the threshold must be greater than the intensity value of the background pixel. However, we need to choose a larger threshold parameter value. The observation is compared by changing the threshold from 0 (corresponding to black) to 30. Initially, FOV segmentation improved as the threshold increased, but only after a certain number of trials (23). Our results show that there was no significant improvement in the segmentation. The FOV boundaries become distorted. This is due to the inclusion of pixels corresponding to the inner region of the FOV. Therefore, we re-set the threshold to 23. Some retinal images and corresponding segmented FOVs are given in
Figure 3.
Figure 3a–c shows an original retinal fundus image with a dark background area, a full field-of-view circle and a truncated circle. We see that that the proposed method is capable of segmenting the FOV without being affected by the dark areas of the fundus image, as shown in
Figure 3d–f. This can also solve the problem of false detection of FOV edges, as shown in
Figure 4.
We accurately labelled the blood vessels, bleeding points, exudates, optic discs, and macula in the FOV. Hospital clinical ophthalmologists continuously calibrated the labeled results in real time. Eventually all results were corrected by the fundus specialist to achieve the most accurate labeling results.
3.2. Segmentation Criteria
To more objectively compare the similarities and differences between the segmentation results and the artificial calibration results, four quantitative statistical indicators are introduced: (1) true positives (TP) refers to pixels that are actually blood vessels that are accurately recognized by the model as blood vessels; (2) false negatives (FN) refer to pixels that are actually blood vessels but are recognized as non-vascular by the model; (3) true negatives (TN) refer to pixels that are actually non-vascular and are accurately identified by the model as non-vascular; (4) false positives (FP) refer to pixels that are actually non-vascular but are recognized by the model as blood vessels. The parameters of the network trained by the parameters sensitivity (Se), specificity (Sp), accuracy (Acc), and other parameters are evaluated via the expression:
the receiver operating characteristic (ROC) curve is drawn according to the relationship between Se and 1-Sp. The area under the ROC curve (area under curve, AUC) reflects the performance of the segmentation method. The AUC is 1. The classifier is a perfect classifier. We first validate the performance of our proposed method via a grading method that uses
K1,
K2,
K3,
K4 and binary segmentation. The detection of diabetic eye disease plays a critical role in clinical diagnosis, where the false positive and false negatives need to be avoided. Thus, we validate the performance using multiple evaluation protocols, including accuracy, sensitivity, and specificity.
Table 1 records the performance of all testing images using the
K1,
K2,
K3,
K4 and binary segmentation evaluation protocols.
From the
Table 1 we know that the higher the accuracy of the input images (image size increases) in the branch, the lower the accuracy of the image. This is because the input image pixel value increases too fast.
Figure 5 provides the ROC curve of our multi-seg-CNN.
Additionally, as shown in
Figure 5, we provide the ROC curve of our multi-seg-CNN. According to
Figure 5, our method can achieve excellent performance during the automated diagnosis step and can meet the requirements of retinal anatomy and pathological structure.
3.3. Training
The various structures of the retina are different. Each structure can be divided into two categories according to our network structure. One type is called the concentrated structures, e.g., optic discs and the macula. The other types are called the overall structure, e.g., blood vessels, exudates, and roth spots. The characteristics of the concentrated structure are that the retinal structure of the retina usually appears only in a certain region of the image, and the proportion is smaller than the overall image, thus the structural integrity is lower. The characteristics of the overall structure are that the retinal structure generally appears in various places of the image, and the proportion is larger than that of the whole image.
In summary, we use two data input methods. For the concentrated structure, we perform a preprocessing step, as shown in
Figure 6. We extract the concentrated structure of the complete image, such as (a) and (b) in
Figure 6. The segmented structure is extracted more clearly. Then the concentrated structure is selected as the input. The purpose of this step is to adapt to the network. We try to increase the proportion of the centralized structure in the whole image. This structure is used to improve the speed because it is relatively simple. Therefore, we use a single scale as the input, e.g., scale of 224 × 224. For the overall structure, we use the multiscale input network shown in
Figure 1. The input is 224 × 224, 448 × 448, 896 × 896, and complete image. Each image inputs the details by not ignoring the integrity of the overall structure, thus deriving the weight based on the back-propagation algorithm [
33]. The block diagram of the training parameter algorithm is shown in
Figure 7. There is no need to distinguish between the two structures during training.
In our method, each branch has a loss function. We denote the
Loss K1,
Loss K2,
Loss K3,
Loss K4, and
Loss Bin as the branch one loss, branch two loss, branch three loss, branch four loss, and binary loss from the final outputs. The objective function of the whole network can be Equation (5) as:
in each branch, the loss function can be back-propagated from the sub-network to the task-specific network. Accordingly, the networks can be updated based on the four classification tasks. Our design considers the three predictions in a generalized framework, which is more suitable for the back-propagation of our multi-task deep neural network. The
loss curve of first 40 epochs in training procedure is shown in
Figure 8.
In summary,
loss can be defined as Equation (6):
where
is the number of detection branches and
α is the weight of the corresponding
loss. The corresponding
loss of each detection layer is defined as Equations (7) and (8):
where the optimal parameter can be defined as in Equation (9):
For each detection layer m, there is a training sample . For a single image, the distribution of an image with an object and one without an object is unbalanced; therefore, we modify the sampling step to eliminate this imbalance via Equation (10). There are three sampling strategies: random, bootstrapping, and mixture.
3.4. Comparison to other Segmentation Networks
In this paper, a multiscale input CNN network is combined with preprocessed datasets to design a network with stronger learning ability and higher accuracy. The network is based on image multiscale patching, where 80% of the image patch is used for training and 20% of the image patch is used for verification. The training is done an epoch at a time, while the verification set is evaluated as a whole. The epoch lasts 100 iterations.
The multiscale segmentation network is compared to the original PixelNet and Unet networks. The three networks are based on the same database. This paper compares the Se, Sp, and Acc indicators, and the results are reflected in the table of results. Our network results are shown in the
Figure 9. Note that we are here to obtain the accurate segmentation effect of the various feature structures of the fundus image.
This section compares the performance of our proposed method to related methods to demonstrate that the proposed multi-seg-CNN framework can indeed achieve better segmentation of retinal anatomy and pathological structure. Particularly, the proposed method is compared to three related state-of-the-art methods. The first is PixelNet (Aayush Bansal et al., 2017). The other is the classification baseline using the most widely accepted deep model, i.e., Unet (Olaf Ronneberger et al., 2015). The last method is v3+ (Liang-Chieh Chen et al., 2018), which is Google’s recent work on the segmentation. Our baseline model trains over three independent classification models using our own database for training. We then derive the segmentation method according to the clinical segmentation criteria.
Figure 10 provides the ROC curve of each framework. All four methods use deep learning methods to train models using the same training data.
The best segmentation threshold and the highest accuracy during testing is achieved for images from a single database. Images from this database also have the fewest number of false positives and false negatives. Our method has high segmentation accuracy. In second place is the PixelNet method, which also has high segmentation accuracy, but because it is a single branch, the accuracy is not as high as the four-branch method. The Unet segmentation effect is not ideal when blood vessels are separately segmented. The Deeplabv3+ segmentation method has the lowest accuracy, which leaves us stunned because DeepLabv3+ is generally effective for small targets where the overall structure is segmented.
We will evaluate the Se, Sp, and Acc indicators for each feature structure, separately, for blood vessels, exudates, bleeding points, visual cup optic discs, and macula. Since the structures of the optic disc and the macula are relatively simple, and the features are obvious, this paper considers the two structures together.
For the concentrated structure, the segmentation results of the other three network segmentation methods (for optic discs and the macula) are listed in
Table 2.
Comparing the three networks, we notice that the PixelNet accuracy is higher than that of Unet and our designed Mult-CNN; however, we also find that Multi-seg-CNN is better than Unet, which has similar accuracy to that of PixelNet. This is likely due to the fact that the optic disc and the macula are simple in structure, and the segmentation is relatively easy to identify, so the sensitivity and specificity of the three networks are not much different.
We used similar methods to find the overlay structure and to obtain the blood vessel segmentation results in
Table 3, the exudate segmentation results in
Table 4, and the bleeding point segmentation results in
Table 5.
We find that the accuracy of the overall structure is lower than that of the centralized structure. The reason for this result is that the coverage structure of a pixel is much larger than its centralized structure. The three indicators Se, Sp, and Acc have no centralized structure with high accuracy, yet this result is unavoidable. Thus, the accuracy of the overall structure of the network vessels, exudates, and bleeding points is improved compared to other networks used for medical image processing. We also compare the DeepLabv3+, which represents the semantic segmentation network. We come to the conclusion that DeepLabv3+ has no obvious effect on the segmentation of the fundus or on finding features. DeepLabv3+ is not as good as PixelNet when considering the segmentation effect. Intuitively,
Table 6 combines the results from
Table 2,
Table 3,
Table 4 and
Table 5. In 2015, Olaf Ronneberger et al. [
33] show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Aayush Bansal et al. [
20] explore design principles for general pixel-level prediction problems, from low-level edge detection to midlevel surface normal estimation to high-level semantic segmentation. In 2018, Liang-Chieh Chen et al. [
34] extended DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. All tables contain data from three open source networks and the new network we designed. Finally, we calculate the total result of the four networks in terms of the retinal anatomy and pathological structure. Multiscale segmentation achieves a more accurate segmentation effect.