Which Color Channel Is Better for Diagnosing Retinal Diseases Automatically in Color Fundus Photographs?

Color fundus photographs are the most common type of image used for automatic diagnosis of retinal diseases and abnormalities. As all color photographs, these images contain information about three primary colors, i.e., red, green, and blue, in three separate color channels. This work aims to understand the impact of each channel in the automatic diagnosis of retinal diseases and abnormalities. To this end, the existing works are surveyed extensively to explore which color channel is used most commonly for automatically detecting four leading causes of blindness and one retinal abnormality along with segmenting three retinal landmarks. From this survey, it is clear that all channels together are typically used for neural network-based systems, whereas for non-neural network-based systems, the green channel is most commonly used. However, from the previous works, no conclusion can be drawn regarding the importance of the different channels. Therefore, systematic experiments are conducted to analyse this. A well-known U-shaped deep neural network (U-Net) is used to investigate which color channel is best for segmenting one retinal abnormality and three retinal landmarks.


Introduction
Diagnosing retinal diseases at their earliest stage can save a patient's vision since, at an early stage, the diseases are more likely to be treatable. However, ensuring regular retina checkups for each citizen by ophthalmologists is infeasible not only in developing countries with huge populations but also in developed countries with small populations. The main reason is that the number of ophthalmologists compared to citizens is very small. It is particularly true for low-income and low-middle-income countries with huge populations, such as Bangladesh and India. For example, according to a survey conducted by the International Council of Ophthalmology (ICO) in 2010 [1], there were only four ophthalmologists per million people in Bangladesh. For India, the number was 11. Even for high-income countries with a small population, such as Switzerland and Norway, the numbers of ophthalmologists per million were not very high (91 and 68, respectively). More than a decade later, in 2021, these numbers remain roughly the same. Moreover, 60+ people (who are generally at high risk of retinal diseases) are increasing in most countries. The shortage of ophthalmologists and the necessity of regular retina checkups at low cost inspired researchers to develop computer-aided systems to detect retinal diseases automatically.
Different kinds of imaging technologies (e.g., color fundus photography, monochromatic retinal photography, wide-field imaging, autofluorescence imaging, indocyanine image sensor coated with a color filter array (CFA) is used more commonly to capture the reflected light in a fundus camera. In a CFA, in general, color filters are arranged following the Bayer pattern [19], developed by the Eastman Kodak company, as shown in Figure 1a. Instead of using three filters for capturing three primary colors (i.e., red, green and blue) reflected from the retina, only one filter is used per pixel to capture one primary color in the Bayer pattern. In this pattern, the number of green filters is twice the number of blue and red filters. Different kinds of demosaicing techniques are applied to get full color fundus photographs [20][21][22]. Some sophisticated and expensive fundus cameras do not use a CFA with a Bayer pattern to distinguish color, rather they use a direct imaging sensor with three layers of photosensitive elements as shown in Figure 1b. No demosaicing technique is necessary for getting full color fundus photographs from such fundus cameras. As shown in Figure 2, in a color fundus photograph, we can see the major retinal landmarks, such as the optic disc (OD), macula, and central retinal blood vessels (CRBVs), on the colored foreground surrounded by the dark background. As can be seen in Figure 3, different color channels highlight different things in color fundus photographs. We can see the boundary of the OD more clearly and the choroid in more detail in the red channel. The red channel helps us segment the OD more accurately and see the choroidal blood vessels and choroidal lesions such as nevi or tumors more clearly than the other two color channels. The CRBVs and hemorrhages can be seen in the green channel with excellent contrast. The blue channel allows us to see the retinal nerve fiber layer (RNFL) defects and epiretinal membranes more clearly than the other two color channels.

Figure 2.
A color fundus photograph. We can see the retinal landmarks, i.e., optic disc, macula, and central retinal blood vessels, on the circular and colored foreground, surrounded by a dark background. Source of image: publicly available DRIVE data set and image file: 21_training.tif.

Previous Works on Diagnosing Retinal Disease Automatically
Many diseases can be the cause of retinal damage, such as glaucoma, age-related macular degeneration (AMD), diabetic retinopathy (DR), diabetic macular edema (DME), retinal artery occlusion, retinal vein occlusion, hypertensive retinopathy, macular hole, epiretinal membrane, retinal hemorrhage, lattice degeneration, retinal tear, retinal detachment, intraocular tumors, penetrating ocular trauma, pediatric and neonatal retinal disorders, cytomegalovirus retinal infection, uveitis, infectious retinitis, central serous retinopathy, retinoblastoma, endophthalmitis, and retinitis pigmentosa. Among them, glaucoma, AMD, DR, and DME drew the main focus of researchers for color fundus photograph-based automation. One reason could be that for many cases, these causes lead to irreversible complete vision loss, i.e., blindness if they are left undiagnosed and untreated. According to the information reported in [23,24], glaucoma, AMD, and DR are among the five most common causes of vision impairment in adults. Among 7.79 billion people living in 2020, 295.09 million people experienced moderate or severe vision impairment (MSVI) and 43.28 million people were blind. Glaucoma was the cause of MSVI for 4.14 million people, whereas AMD for 6.23 million and DR for 3.28 million people. Glaucoma was the cause of blindness for 3.61 million people, whereas AMD for 1.85 million and DR for 1.07 million people [24]. Therefore, in our literature survey, we investigate the color channels used in previously published studies for automatically diagnosing glaucoma, DR, AMD, and DME. We also survey works on segmentation of retinal landmarks, such as OD, macula/fovea and CRBVs, and retinal atrophy.
We consider both original studies and reviews as the source of information. However, our survey includes only original studies written in English and published in SJR ranked Q1 and Q2 journals. Note that SJR (SCImago Journal Rank) is an indicator developed by SCImago from the widely known algorithm Google PageRank [25]. This indicator shows the visibility of the journals contained in the Scopus database from 1996. We used different keywords such as 'automatic retinal disease detection', 'automatic diabetic retinopathy detection', 'automatic glaucoma detection', 'detect retinal disease by deep learning', 'segment macula', 'segment optic disc', and 'segment central retinal blood vessels' in the Google search engine to find previous studies. After finding a paper, we checked the SJR rank of the journal. We used the reference list of papers published in Q1/Q2 journals; we especially benefited from the review papers related to our area of interest.

Data Sets
We used RGB color fundus photographs from seven publicly available data sets: (1) Child Heart Health Study in England (CHASE) data set [3,4], (2) Digital Retinal Images for Vessel Extraction (DRIVE) data set [5], (3) High-Resolution Fundus (HRF) data set [6], (4) Indian Diabetic Retinopathy Image Dataset (IDRiD) [7], (5) Pathologic Myopica Challenge (PALM) data set [218], (6) STructured Analysis of the Retina (STARE) data set [10,11], and (7) University of Auckland Diabetic Retinopathy (UoA-DR) data set [12]. Images in these data sets were captured by different fundus cameras for different kinds of research objectives, as shown in Table 7. Since all of the seven data sets do not have manually segmented images for all retinal landmarks and atrophy, we cannot use all of them for all kinds of segmentation tasks. Therefore, instead of seven data sets we used five data sets for the experiments of segmenting CRBVs, three data sets for OD, and two data sets for macula, while only one data set for the experiments of segmenting retinal atrophy. We emphasize to have reliable results. For that we used the majority of the data (i.e., 55% of the data) as the test data. We prepared one training and one validation set. By combining 25% of the data from each data set, we prepared the training set, whereas we prepared the validation set by combining 20% of the data from each data set. By taking the rest of the 55% of the data from each data set, we prepared individual test sets for each type of segmentation. See Table 8 for the number of images in the training, validation, and test sets. Note that the training set is used to tune the parameters of the U-Net (i.e., weights and biases), the validation set is used to tune the hyperparameters (such number of epochs, learning rate, and activation function), and the test set is used to evaluate the performance of the U-Net.

Image Pre-Processing
We prepared four types of 2D fundus photographs: I R , I G , I B , and I Gr . By splitting 3D color fundus photographs into three color channels (i.e., red, green and blue), we prepared I R , I G , I B . Moreover, by performing a weighted summation of I R , I G , I B , we prepared the grayscale image, I Gr . By a grayscale image, we generally mean an image whose pixels have only one value representing the amount of light. It can be visualized as different shades of gray. An 8-bit grayscale image has pixel values in the range 0-255. There are many ways to convert a color image into a grayscale image. In this paper, we use a function from the OpenCV library where each grey pixel is generated according to the following scheme: I Gr = 0.299 × I R + 0.587 × I G + 0.114 × I B . This conversion scheme is frequently used in computer vision and implemented in different toolboxes, e.g., GIMP and MATLAB [219] including OpenCV.
The background of a fundus photograph does not contain any information about the retina, which can be helpful for manual or automatic retina-related tasks. Sometimes background noise can be misleading. In order to avoid the interference of the background noise in any decision, we need to use a binary background mask, which has zero for the pixels of the background and 2 n − 1 for the pixels of the foreground, where n is the number of bits used for the intensity of each pixel. For an 8-bit image, 2 n − 1 = 255. Except the DRIVE and HRF data sets, background masks are not provided for the other five data sets. Therefore, we followed the steps described in Appendix A to generate the background masks for all data sets. We generated binary background masks for DRIVE and HRF data sets in order to keep the same set up for all data sets. Overall, I R has a higher intensity than I G and I B in all data sets, whereas I B has a lower intensity compared to I R and I G . Moreover, in I R , the foreground is less likely to overlap with the background noise than I G and I B . In I B , the foreground intensity has the highest possibility to be overlapped with the intensity of the background noise, as shown in Figure 4. Therefore, we use I R (i.e., the red channel image) for generating the binary background masks.
We used the generated background mask and followed the steps described in Appendix B for cropping out the background as much as possible and removing background noise outside the field-of-view (FOV). Since cropped fundus photographs of different data sets have different resolutions as shown in Table 7, we re-sized all masked and cropped fundus photographs to 256 × 256 by bicubic interpolation so that we could use one U-Net. After resizing fundus photographs, we applied contrast limited adaptive histogram equalization (CLAHE) [220] to improve the contrast of each single colored image. Then we re-scaled pixel values to [0, 1]. Note that, re-scaling pixel values to [0, 1] is not necessary for fundus photographs. However, we did it to keep the input and output in the same range. We did not apply any other pre-processing techniques to the images.
Similar to the fundus photographs, reference masks provided by the data sets for segmenting OD, CRBVs and retinal atrophy can have an unnecessary and noisy background. We, therefore, cropped out the unnecessary background of the provided reference masks and removed noise outside the field-of-view area by following the steps described in Appendix B. Since some provided masks are not binary masks, we turned them into 2D binary masks by following the steps described in Appendix C. No data set provides binary masks for segmenting the macula. Instead the center of the macula are provided by the PALM and UoA-DR. We generated binary masks for segmenting macula using the center values of the macula and the OD masks of the PALM and UoA-DR by following the steps described in Appendix D. We re-sized all kinds of binary masks to 256 × 256 by bicubic interpolation. We then re-scaled pixel values to [0, 1], since we used the sigmoid function as the activation function in the output layer of the U-Net and the range of this function is [0, 1].

Setup for U-Net
We trained color-specific U-Nets with an architecture as shown in Table A3 of Appendix E. To train our U-Nets, we set Jaccard co-efficient loss (JCL) as the loss function; RMSProp with a learning rate of 0.0001 as the optimizer and mini_batch_size = 8. We reduced the learning rate if there was no change in the validation_loss for more than 30 consecutive epochs. We stopped the training if the validation_loss did not change in 100 consecutive epochs. We trained all color-specific U-Nets five times to avoid the effect of randomness caused by different factors, including weight initialization and dropout, on the U-Net's performance. That means, in total, we trained 100 U-Nets, among which 25 U-Nets for OD segmentation (i.e., five models for each RGB, gray, red, green, and blue), 25 U-Nets for macula segmentation, 25 U-Nets for CRBVs segmentation, and 25 U-Nets for atrophy segmentation. We estimate the performance of each model separately and then report mean ± standard deviation of the performance for each category.

Evaluation Metrics
In segmentation, the U-Net shall predict whether a pixel is part of the object in question (e.g., OD) or not. Ideally, it should therefore output: If the pixel belongs to the targeted retinal landmark or atrophy. 0, Otherwise.
However, instead of 0/1, the output of the U-Net is in the range [0, 1] for each pixel since we use sigmoid as the activation function in the last layer. The output can be interpreted as the probability that the pixel is part of the mask. To obtain a hard prediction (0/1), we use a threshold of 0.5. By comparing the hard prediction to the reference, it is decided whether the prediction is a true positive (TP), true negative (TN), false-positive (FP), or false negative (FN). Using those results for each pixel in the test set, we estimated the performance of the U-Net using four metrics. We used three metrics that are commonly used in classification tasks (i.e., precision, recall, and area-under-curve (AUC)) and one metric which is commonly used in image segmentation tasks (i.e., mean intersection-overunion (MIoU), also known as Jaccard index or Jaccard similarity coefficient). We computed precision = TP / (TP + FP) and recall = TP / (TP + FN) for both semantic classes together. On the other hand, we computed IoU = TP / (TP + FP + FN) for each semantic class (i.e., 0/1) and then averaged over the classes to estimate MIoU. We estimated the AUC for the receiver operating characteristic (ROC) curve using a linearly spaced set of thresholds. Note that AUC is a threshold-independent metric, unlike precision, recall, and MIoU, which are threshold-dependent metrics.

Performance of Color Channel Specific U-Net
Comparing the results as shown in Tables 9-12, we can say that the U-Net is more successful at segmenting the OD and less successful at segmenting CRBVs for all channels. The U-Net performs better when all three color channels (i.e., RGB images) are used together than when the color channels are used individually. For segmenting the OD, the red and gray channels are better than the green and blue channels (see Table 9). For segmenting CRBVs the green channel performs better than other single channels, whereas both the red and blue channels perform poorly (see Table 10). For macula segmentation, there is no clear winner among gray and green channels. Although, the blue channel is a bad choice for segmenting the CRBVs, it is reasonably good at segmenting macula (see Table 11). For segmenting retinal atrophy, the green channel is better than other single channel and the blue channel is also a good choice (see Table 12). Table 9. Performance (mean ± standard deviation) of U-Nets using different color channels for segmenting optic disc. To better understand the performance of U-Nets, we manually inspect all images together with their reference and predicted masks. As shown in Table 13, we see that for the majority number of cases, all color-specific U-Nets can generate at least partially accurate masks for segmenting OD and macula. When the retinal atrophy severely affects any retina, no channel-specific U-Net can generate accurate masks for segmenting OD and macula, as shown in Figures 5 and 6. For many cases multiple areas in the generated masks are pointed as OD (see Figure 5d-f) and macula (see Figure 6d). As shown in Table 14, it happens more in the gray channel for the macula and in the green channel for the OD. We find that our U-Nets trained for the RGB, gray, and green channel images can segment thick vessels quite well, whereas they are in general not good at segmenting thin blood vessels. As shown in Figure 7b,e, Figure 7c,f, and Figure 7h,k, discontinuity occurs in the thin vessels segmented by our U-Nets.

Color
The performance of U-Nets also depends to some extent on how accurately CRBVs are marked in the reference masks. Among the five data sets, the reference masks of the DRIVE data set are very accurate for both thick and thin vessels. That could be one reason we get the best performance for this data set. On the contrary, we get the worst performance for the UoA-DR data set because of the inaccurate reference masks (see Appendix F for more details). If the reference masks have inaccurate information, then the estimated performance of the U-Nets will be lower than what it should be. Two things can happen when reference masks are inaccurate. The first thing is that inaccurate reference masks in the training set may deteriorate the performance of the U-Net. However, if most reference masks are accurate enough, the deterioration may be small. The second thing is that inaccurate reference masks in the test set can generate inaccurate values for the estimated metrics. These two cases happen for the UoA-DR data set. Our U-Nets can tackle the negative effect of inaccurate reference masks in the training set of the UoA-DR. Our U-Nets learn to predict the majority of the thick vessels and some parts of thin vessels quite accurately for the UoA-DR data set. However, because of the inaccurate reference masks of the test data, the precision and recall are extremely low for all channels for the UoA-DR data set.
We also notice that quite often, the red channel is affected by the overexposure, whereas the blue channel is affected by the underexposure (see Table 15). Both kinds of inappropriate exposure wash out retinal information that causes low entropy. Therefore, the generated masks for segmenting CRBVs do not have lines in the inappropriately exposed parts of a fundus photograph (see the overexposed part of the red channel in Figure 7j and the underexposed part of the blue channel in Figure 7l). Note that histograms of inappropriately exposed images are highly skewed and have low entropy (as shown in Figure 8). Table 15. Number of inappropriately exposed fundus photographs. N: Total number RGB fundus photographs in the test set of a specific data set. It is not surprising that using all three color channels (i.e., RGB images) as input to the U-Net performs the best since the convolutional layers of the U-Net are flexible enough to use all information from the three color channels appropriately. By using multiple filters in each convolutional layer, U-Nets can extract multiple features from the retinal images, many of which are appropriate for segmentation. As discussed in Section 3, previous works based on non-neural network-based models usually used one color channel, most likely because these models could not be benefited from the information contained in three channels. The fact that the individual color channel performs well in certain situations raises two questions regarding the camera design:

1.
Would it be worth it to develop cameras with only one color channel rather than red, green, and blue, possibly customized for retina analysis? 2.
Could a more detailed representation of the spectrum than RGB improve the automatic analysis of retinas? The RGB representation captures the information from the spectrum that the human eye can recognize. Perhaps this is not all information from the spectrum that an automatic system could have used.
To fully answer those questions, many hardware developments would be needed. However, an initial analysis to address the first question could be to tune the weights used to produce the grayscale image from the RGB images.

Conclusions
We conduct an extensive survey to investigate which color channel in color fundus photographs is most commonly preferred for automatically diagnosing retinal diseases. We find that the green channel images dominate previous non-neural network-based works while all three color channels together, i.e., RGB images, dominate neural network-based works. In non-neural network-based works, researchers almost ignored the red and blue channels, reasoning that these channels are prone to poor contrast, noise, and inappropriate exposure. However, no works provided a conclusive experimental comparison of the performance of different color channels. In order to fill up that gap we conduct systematic experiments. We use a well-known U-shaped deep neural network (U-Net) to investigate which color channel is best for segmenting retinal atrophy and three retinal landmarks (i.e., central retinal blood vessels, optic disc, and macula). In our U-Net based segmentation approach, we see that segmentation of retinal landmarks and retinal atrophy can be conducted more accurately when RGB images are used than when a single channel is used. We also notice that as a single channel, the red channel is bad for segmenting the central retinal blood vessels, but better than other single channels for the optic disc segmentation. Although, the blue channel is a bad choice for segmenting the central retinal blood vessels, it is reasonably good for segmenting macula and very good for segmenting retinal atrophy. For all cases, RGB images perform the best which reveals the fact that the red and blue channels can provide supplementary information to the green channel. Therefore, we can conclude that all color channels are important in color fundus photographs.  Institutional Review Board Statement: Not applicable. We only used publicly available data sets prepared by other organizations and these data sets are standard to use for automatic diagnosis of retinal diseases.
Informed Consent Statement: Informed Consent Statement: Not applicable. We only used publicly available data sets prepared by other organizations and these data sets are standard to use for automatic diagnosis of retinal diseases.
Data Availability Statement: All data sets used in this work are publicly available as described in Section 4.2.

Conflicts of Interest:
We declare no conflict of interest. Author Angkan Biswas was employed by the company CAPM Company Limited. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A. Generating Background Mask
The background of a fundus photograph can be noisy, i.e., the background pixels can have non-zero values. Noisy background pixels, in general, are invisible to the bare eye because of their low intensity. Exceptions to this occur. For example, images in the STARE data set have visible background noise. Moreover, sometimes non-retinal information, such as image capturing date-time and patient's name, can be present with high intensities in the background (e.g., images in the UoA-DR data set). This kind of information is also considered as noise when they are not useful for any decision. No matter whether noise in the background is visible or invisible to human eyes, or whether the intensity of background pixels are high or low, by global binary thresholding with threshold, θ = 0, we detect the presence of noisy background pixels in almost all data sets as shown in Figure A1. Using a background mask, we can get rid of background noise. A simple method for creating a background mask would be to consider all pixels with an intensity lower than or equal to a threshold, θ, to be part of the background and the other pixels to be part of the foreground. When the image is noiseless, setting θ = 0 (i.e., keeping zero-valued pixels unchanged while setting pixels with non-zero intensities to 2 n − 1) is good enough to generate the background mask. However, for a noisy background, if we set the threshold, θ, to a very small value (i.e., a value lower than the intensities of noise), then the background mask will consider the background parts as a foreground, as shown in Figure A2c-i. On the other hand, if we set a very high value to θ (i.e., a value higher than the intensities of foreground pixels), then some parts of the foreground may get lost in the background mask, as shown in Figure A2k,l. Of course, in reality, some background pixels may have a higher intensity than some foreground pixels so that no threshold would accurately separate the foreground from the background. Further, the optimal threshold may depend on the data set.
As a more robust procedure for generating background masks for removing background noise, we apply the following steps: • Step-1: Generate a preliminary background mask, B 1 , by global binary thresholding, i.e., by setting the pixel intensity, p, of a single channeled image, I, to 0 or 2 n − 1 in the following way: where n is the number of bit used for the intensity of p (see Figure A3c). For an 8-bit image, 2 n − 1 = 255. Note that we are using the red channeled image, I R . By trial-and-error, we finally set θ to 15, 40, 35, 35, 5, 35, and 5 to get good preliminary background masks for the CHASE_DB1, DRIVE, HRF, IDRiD, PALM, STARE, and UoA-DR data sets, respectively. • Step-2: Determine the boundary contour of the retina by finding the contour which has the maximum area. Note that a contour is a closed curve joining all the straight points having the same color or intensity (see Figure A3d). • Step-3: Set the pixels inside the boundary contour to 2 n − 1 and outside the boundary contour to zero in order to generate the final background mask, B 2 (see Figure A3e). Figure A4 shows seven examples of generated binary background masks and Figure A5 illustrates the benefit of using B 2 instead of B 1 for masking out the high-intensity background noise caused by text information in an image. Using the provided masks of the DRIVE and HRF data sets, we estimate the performance of our approach of generating binary background masks. As shown in Table A1, our approach is highly successful.

Appendix B. Cropping Out Background
The background of an image, I x , does not contain any information about the retina, which can be helpful for automatic retina-related tasks. Note that I x can be an RGB image, a single channeled image, or a binary mask for segmenting OD, macula, CRBVs, or retinal atrophy. As a robust procedure for cropping the unnecessary background and removing background noise from the I x , we apply the following steps: • Step-1: Generate the background mask, B 2 , using the steps described in Appendix A. • Step-2: Determine the minimum bounding rectangle (MBR) which minimally covers the background mask, B 2 (See Figure A3f). • Step-3: Crop I x and B 2 equal to the MBR (see Figure A3g,h). • Step-4: Remove background noise from the cropped I x by masking it by the cropped B 2 (see Figure A3i).

Appendix C. Turning Provided Reference Masks into Binary Masks
Although reference masks used for segmentation need to be binary masks (i.e., having only two pixel intensities, e.g., zero for the background pixels and 255 for the foreground pixels of an 8-bit image), we notice that two data sets (i.e., HRF and UoA-DR) do not fulfill this requirement, as shown in Table A2. Three out of 45 provided masks of the HRF data set, and all 200 provided masks of the UoA-DR data set have pixels of multiple intensities. There are two cases: the first case is that the noisy background pixels which are supposed to be 0 have intensities other than zero and the second case is that the foreground pixels, which are supposed to be 255 have intensities other than 255. We also notice that even though the provided masks of the IDRiD data set are binary masks, however, the maximum intensity is 29 instead of 255.
We turn all provided masks into binary masks having pixel intensity [0, 255] by global binary thresholding with threshold, θ = 127. Before binarization, we remove noisy pixels from the outside of the field-of-view area by using the estimated background mask, B 2 (see Figure A6b for an example). As shown in Figure A6c, there will still be noisy pixels inside the FOV area. For that, we apply binary thresholding and generate the final binary mask as shown in Figure A6d. Table A2. Distribution of provided binary and non-binary masks for segmenting CRBVs, optic discs, macula and retinal atrophy. n: total number of provided masks, m: number of provided binary masks.

Appendix D. Generating Binary Masks for Segmenting Macula
Even though three data sets (i.e., IDRiD, PALM, and UoA-DR) provide reference masks for segmenting the optic disc (OD), five data sets (i.e., CHASE_DB1, DRIVE, HRF, STARE, and UoA-DR) for CRBVs and one data set (i.e., PALM) for retinal atrophy, none of the seven data sets provide reference masks for segmenting macula. However, two data sets (PALM and UoA-DR) provide the center of the macula. The average size of the macula in humans is around 5.5 mm. However, the average clinical size of the macula in humans is 1.5 mm, whereas the average size of the OD is 1.825 mm (vertically 1.88 mm and horizontally 1.77 mm). We assume that the size of maculas are equal to the ODs and using the provided center values we generate binary masks for segmenting the macula using the following steps: • Step-1: Get the corresponding reference mask, R OD of a color fundus photograph for segmenting OD. • Step-2: Generate the background mask, B 2 , by following the steps described in Appendix A. • Step-3: Remove the background noise outside the foreground of R OD by masking it by B 2 . • Step-4: Turn R OD into a binary mask R OD_Binary by global thresholding. • Step-5: Find the boundary contour of the foreground of R OD_Binary . • Step-6: Determine radius, r of the minimum closing circle of R OD_Binary . • Step-7: Draw a circle in the provided center of the macula having radius r. • Step-8: Set the pixels inside the circle to 2 n − 1 and outside the circle to 0 in order to generate the final reference mask, R Macula_Binary .

Appendix E. Architecture of U-Net
Our color specific U-Nets have the architecture shown in Table A3. Similar to the original U-Net proposed in [17], our U-Nets consist of two parts: a contracting side and an expansive side. None of these sides have any fully connected layers instead both sides have mainly convolutional layers. Unlike the original U-Net, we use convolutional layer with stride two instead of a max poling layer for down-sampling in the contracting side. Instead of using unpadded convolutions we use same padding convolutions in both the contracting side and expansive side. Note that in the same padding the output size is the same as the input size. Therefore, we do not need cropping in the expansive side which was needed in the original work due to the loss of border pixels in every convolution. We use Exponential Linear Unit (ELU) instead of Rectified Linear Unit (ReLU) as activation function in each convolutional layer except the output layer. In the output layer, we usethe sigmoid function as the activation function. An alternative would have been the softmax function with two outputs. In both the contracting and expansive sides, the two padded convolutional layers are separated by a drop-out layer. We use a drop-out layer in order to avoid over-fitting.
There are 23 convolutional layers in the original U-Net, whereas in our U-Nets there are 29 convolutional layers. In the original U-Net, there are four down-sampling blocks in the contracting side and four up-sampling blocks in the expansive side, whereas in our U-Nets there are five down-sampling and five up-sampling blocks. In total, each of our U-Nets has 5,939,521 trainable parameters.

Appendix F. Inaccurate Masks in UoA_DR for Segmenting CRBVs
Among the five data sets we experiment on, the UoA-DR data set has the largest number of masks for segmenting CRBVs. Even though it could be a good data set for training and testing U-Nets, the performance of any color-specific U-Net for the UoA-DR test set is the worst among all data sets regardless of whether U-Nets are trained by combining data of five data sets together or by using data only from the UoA-DR data set. The reason behind it is that all the reference masks provided by the UoA-DR data set for segmenting CRBVs are inaccurate. In the UoA-DR data set, the reference masks usually do not match the real blood vessels well. In many places of the reference masks, vessels are marked in the wrong places. Moreover, thick vessels are marked by thinner lines, and thin vessels are marked by thicker lines in many places of reference masks. Even in some reference masks, clearly visible thin vessels are not marked as shown in Figure A7.

Appendix G. Performance of U-Nets Trained and Tested on Individual Data Set
Since different fundus cameras capture retinal images of different data sets in different experimental setups, different data sets may be of different difficulties. We, therefore, do experiments on the different sets individually, i.e., training and testing on the same set for segmenting CRBVs. Tables A4 and A5 show the results for CRBVs' segmentation of five data sets: CHASE_DB1, DRIVE, HRF, STARE, and UoA_DR. The first and second blocks in these tables show the results of the U-Nets for which 25% of the data is used for training, whereas the third block shows the results of the U-Nets for which 55% of the data is used for training. In the first block, 55% of the data is sued for testing, whereas in the second and third blocks, only 25% of the data is used for testing. For all three cases, 20% of the data is used as the validation set. It should be noted that individual test sets prepared by taking 25% of the data are fairly small, so these results may not be very reliable. However, the results in the first and second blocks are fairly similar, which indicates that the results are reasonably stable. Overall, we see a substantial improvement in the third block compared to the second, suggesting that the U-Nets benefit from more training data. We also notice that both in Table 10 (same training data for all sets) and Table A4 (set specific training data), there is a large difference in the results for the different data sets which indicates that different data sets have different levels of difficulty.

Appendix H. Effect of CLAHE
Different data sets have different qualities, which cause different levels of difficulty. One reason for poor-quality images is inappropriate contrast. In general, histogram equalization techniques such as Contrast Limited Adaptive Histogram Equalization (CLAHE) are applied to enhance the local contrast of fundus photographs. In this work, we also apply CLAHE in the pre-processing stage of the experiments mentioned above. In order to investigate the effect of CLAHE on different data sets, we conduct experiments using the fundus photographs without applying CLAHE. Table A5 shows results when CLAHE is not applied. These results are obtained using the same training/validation/test splits as in the third blocks of Table A4. Overall, CLAHE improves the results of the STARE set a lot, and also quite a lot of the DRIVE and HRF data sets. The results of the CHASE_DB1 data set are a bit mixed depending on the metric. For the UoA-DR data set, CLAHE does not seem to help at all. Table A5. Performance (mean ± standard deviation) of U-Nets trained using different color channels for segmenting CRBVs when CLAHE is not applied on the retinal images in the pre-processing stage. Note that 55% data was used for training, whereas, 25% data is for validation and 25% data for testing.

Database
Color