Deep Ensemble Learning Based Objective Grading of Macular Edema by Extracting Clinically Significant Findings from Fused Retinal Imaging Modalities

Macular edema (ME) is a retinal condition in which central vision of a patient is affected. ME leads to accumulation of fluid in the surrounding macular region resulting in a swollen macula. Optical coherence tomography (OCT) and the fundus photography are the two widely used retinal examination techniques that can effectively detect ME. Many researchers have utilized retinal fundus and OCT imaging for detecting ME. However, to the best of our knowledge, no work is found in the literature that fuses the findings from both retinal imaging modalities for the effective and more reliable diagnosis of ME. In this paper, we proposed an automated framework for the classification of ME and healthy eyes using retinal fundus and OCT scans. The proposed framework is based on deep ensemble learning where the input fundus and OCT scans are recognized through the deep convolutional neural network (CNN) and are processed accordingly. The processed scans are further passed to the second layer of the deep CNN model, which extracts the required feature descriptors from both images. The extracted descriptors are then concatenated together and are passed to the supervised hybrid classifier made through the ensemble of the artificial neural networks, support vector machines and naïve Bayes. The proposed framework has been trained on 73,791 retinal scans and is validated on 5100 scans of publicly available Zhang dataset and Rabbani dataset. The proposed framework achieved the accuracy of 94.33% for diagnosing ME and healthy subjects and achieved the mean dice coefficient of 0.9019 ± 0.04 for accurately extracting the retinal fluids, 0.7069 ± 0.11 for accurately extracting hard exudates and 0.8203 ± 0.03 for accurately extracting retinal blood vessels against the clinical markings.


Introduction
Visual impairments severely degrade the quality of life and have an adverse effect on people suffering from other chronic health issues. Currently, blindness is considered as a major health problem worldwide. According to the Global Burden of Disease (GBD) in their 2017 report (released on 18 November 2018), loss of vision is categorized as the third leading form of impairments in humans and 48.2 million people are suffering from eye diseases all over the world. In addition to 39.6 million people have severe visual impairments whereas 279 million people and 969 million people have moderate to low visual impairments, respectively [1,2]. Moreover, most of the visual impairments that were reported are due to retinopathy.
The prime cause of retinopathy is diabetes mellitus (DM). DM is caused due to the destruction of pancreatic beta cells (β-cell) affecting the glucose metabolism of the candidate subject. DM is graded into two types. Type I DM specifies deficiency of insulin whereas Type II DM is associated with insulin resistance [3][4][5]. Apart from this, DM also affects other vital organs of the human body including eyes, kidney, heart, etc. [6]. Macula produces the central vision and it is the most critical part of retina. Any damage to macula results in the loss of central vision. The retinal diseases that affect the central vision of a person are collectively known as maculopathy. The most common form of maculopathy is ME, which is caused by the leakage of extracellular fluid from hyper-permeable capillaries in the macula of the retina. ME is clinically graded into different stages depending upon the affected area of macular thickening. However, early detection and laser photocoagulation can prevent sudden blindness in most of the cases. Moreover, many retinal complications are often treatable and according to the initiative of "VISION 2020: The Right to Sight", different measures are being taken to eradicate avoidable blindness by the year 2020 [7]. At the same time, it is equally important to equip ophthalmologists with state-of-the-art retinal computer aided diagnostic systems for efficient detection and grading of retinopathy.
The two non-invasive imaging modalities that are clinically in practice for retinal examination, are OCT and fundus imagery [8]. OCT captures the tissue reflection through light coherency. For retinal examination, a beam is bombarded on the fundus of retina yielding a cross-sectional axial scan (A-scan) [9][10]. The A-scans are joined together to produce a brightness scan (B-scan). Since OCT captures the cross-sectional retina, so early progression of retinopathy can be easily visualized. The early identification of retinopathy effectively leads towards the better treatment. Retinal OCT imagery has revolutionized the clinical examination and eye treatment [11,12]. Figure 1 (a) shows the basic OCT scan acquisition schematics, which is based on Michelson interferometer (MI). In MI, a monochromatic coherent light source is used to penetrate the human eye to produce a cross sectional retinal scan. Beam splitter at the center splits the light source into two separate beams where one beam is directed towards a reference mirror and the other travels to the subject's eye. These two beams upon reflection get recombined into a single beam producing axial scan at the detector. On the other hand, fundus photography also captures the central and peripheral retinal regions [13]. Figure 1 (b) shows the acquisition principle of fundus imagery where a specialized microscope attached to a charge coupled device (CCD) camera for taking fundus photography. Fundus scans On the other hand, fundus photography also captures the central and peripheral retinal regions [13]. Figure 1b shows the acquisition principle of fundus imagery where a specialized microscope attached to a charge coupled device (CCD) camera for taking fundus photography. Fundus scans should ideally Sensors 2019, 19, 2970 3 of 26 be taken in dim conditions. In certain circumstances, it becomes vital to consider all of the retinal examination techniques to fully analyze the pathological conditions of the human retina. The optical principle of the fundus camera is the same as of ophthalmoscopy, which acquires about two to five times enlarged inverted fundus scan [14,15]. The light passes through the series of biconvex lenses, which are used for focusing light to pass through the central aperture forming an annulus. After that, light passes through the cornea and falls on the fundus and hence the fundus scan appears on the display device, which can then be saved. The advantages of fundus photography are: it does not require pupil dilation, it is easy to use, it does not require a skilled user and it captures the images that can easily be examined by specialists at any time anywhere. However, apart from the high cost of equipment and non-portability, a major limitation of fundus photography is that it obtains a 2D representation of 3D semi-transparent retinal tissues projected onto the imaging plane, which is catered through OCT imagery. Figure 2 shows ME visualization in both OCT and fundus scans. should ideally be taken in dim conditions. In certain circumstances, it becomes vital to consider all of the retinal examination techniques to fully analyze the pathological conditions of the human retina. The optical principle of the fundus camera is the same as of ophthalmoscopy, which acquires about two to five times enlarged inverted fundus scan [14,15]. The light passes through the series of biconvex lenses, which are used for focusing light to pass through the central aperture forming an annulus. After that, light passes through the cornea and falls on the fundus and hence the fundus scan appears on the display device, which can then be saved. The advantages of fundus photography are: it does not require pupil dilation, it is easy to use, it does not require a skilled user and it captures the images that can easily be examined by specialists at any time anywhere. However, apart from the high cost of equipment and non-portability, a major limitation of fundus photography is that it obtains a 2D representation of 3D semi-transparent retinal tissues projected onto the imaging plane, which is catered through OCT imagery. Figure 2 shows ME visualization in both OCT and fundus scans.

Related Work
In the past, many researchers have conducted clinical studies on analyzing ME using fundus and OCT scans [16][17][18] and concluded that OCT imaging provides better visualization of ME in comparison to fundus photography, especially in early stages where symptoms of ME are not relatively prominent. In addition to this, many studies have been conducted on devising automated algorithms for detecting ME from fundus or OCT scans individually. Most of the methods that use fundus images for the automated detection of ME are based on component segmentation, lesion detection and extraction of hard exudates (HE). Since in digital fundus scans, the contrast between HE and other retinal structures is relatively high, the most common approaches for detecting HE include marker-controlled watershed transformation [19], particle swarm optimization (PSO) based algorithm [20] and by means of local standard variation in a sliding window, morphological closing of the luminance channel and watershed transform [21]. However, illumination variations, which arise because of the changes in tissue pigmentation and imaging conditions, greatly affect these methods. Additionally, the methods based on extracting edge and color features are also proposed over the past for the segmentation of HE [22][23][24][25]. In general, such algorithms produce unsatisfactory results without including complex pre and post processing steps.
Different researchers have developed automated frameworks for the extraction of retinal layers and retinal fluids for analyzing ME affected pathologies [26][27][28][29]. Kernel regression and graph theory dynamic programming (KR + GTDP) [30] and software development life cycle (SDLC) [31] frameworks are also developed for segmenting retinal layers and retinal fluids in ME affected OCT scans. Srinivasan et al. [32] proposed a maculopathy detection framework using histogram of oriented gradients. Apart from this, deep learning frameworks [33][34][35] are also proposed recently for the automated extraction of retinal information from maculopathy affected OCT scans.
However, to the best of our knowledge, no method has been proposed in the past that fuses multiple retinal imaging modalities for objective evaluation of ME pathology. In this paper, we proposed a deep ensemble learning based framework that gives the objective grading ME pathology. The main contributions of our papers were as follows:

Related Work
In the past, many researchers have conducted clinical studies on analyzing ME using fundus and OCT scans [16][17][18] and concluded that OCT imaging provides better visualization of ME in comparison to fundus photography, especially in early stages where symptoms of ME are not relatively prominent. In addition to this, many studies have been conducted on devising automated algorithms for detecting ME from fundus or OCT scans individually. Most of the methods that use fundus images for the automated detection of ME are based on component segmentation, lesion detection and extraction of hard exudates (HE). Since in digital fundus scans, the contrast between HE and other retinal structures is relatively high, the most common approaches for detecting HE include marker-controlled watershed transformation [19], particle swarm optimization (PSO) based algorithm [20] and by means of local standard variation in a sliding window, morphological closing of the luminance channel and watershed transform [21]. However, illumination variations, which arise because of the changes in tissue pigmentation and imaging conditions, greatly affect these methods. Additionally, the methods based on extracting edge and color features are also proposed over the past for the segmentation of HE [22][23][24][25]. In general, such algorithms produce unsatisfactory results without including complex pre and post processing steps.
Different researchers have developed automated frameworks for the extraction of retinal layers and retinal fluids for analyzing ME affected pathologies [26][27][28][29]. Kernel regression and graph theory dynamic programming (KR + GTDP) [30] and software development life cycle (SDLC) [31] frameworks are also developed for segmenting retinal layers and retinal fluids in ME affected OCT scans. Srinivasan et al. [32] proposed a maculopathy detection framework using histogram of oriented gradients. Apart from this, deep learning frameworks [33][34][35] are also proposed recently for the automated extraction of retinal information from maculopathy affected OCT scans.
However, to the best of our knowledge, no method has been proposed in the past that fuses multiple retinal imaging modalities for objective evaluation of ME pathology. In this paper, we proposed a deep ensemble learning based framework that gives the objective grading ME pathology. The main contributions of our papers were as follows:

1.
A novel method was presented in this paper that extracted the ME pathological symptoms from retinal fundus and OCT scans.

2.
Instead of extracting handcrafted features, the proposed framework employed a deep convolutional neural network (CNN) model that gives the most relevant and useful features from retinal fundus and OCT scans for the objective evaluation of ME pathology irrespective of the scan acquisition machinery.

3.
Many frameworks that have been proposed in the past were tested on a single dataset or on scans acquired through single OCT machinery. However, the proposed framework could give objective grading of ME pathology irrespective of OCT acquisition machinery and was rigorously tested on scans from different publicly available datasets. 4.
The proposed framework employed an ensemble of artificial neural networks (ANN), support vector machines (SVM) and naïve Bayes (NB) for the in-depth grading of ME using both fundus and OCT retinal imaging modalities.

5.
The proposed framework is adaptive and gives more weight to the clinical findings such as foveal swelling, fluid filled spaces and hard exudates while evaluating ME. This is achieved by fine-tuning the proposed CNN model on observing the critical ME symptoms from both fundus and OCT imagery.
Rest of the paper is organized as: Section 3 reports dataset details used in this study, Section 4 explains the proposed methodology, results are presented in Sections 5 and 6 describes the detailed discussion about the proposed framework. Section 7 concludes the paper and highlights the future directions.

Datasets
The proposed framework has been tested on retinal fundus and OCT B-scans from multiple publicly available Rabbani and Zhang datasets. Zhang's dataset only consisted of OCT scans of various retinal pathologies while Rabbani's datasets had scans of fundus, fluorescein angiography (FA) and OCT retinal imaging modalities. We excluded the retinal pathologies other than healthy and ME in these datasets. The detailed description of the datasets that were used for training and evaluation purposes is listed in Table 1. All the scans within the datasets were marked by the expert clinicians and we used them as a ground truth in evaluating the performance of the proposed framework.

Proposed Methodology
The proposed framework fuses retinal fundus and OCT imagery for the automated recognition and classification of ME and healthy subjects. The block diagram of the proposed framework is shown in Figure 3 where it can be observed that the proposed framework consisted of five major stages: a We only considered OCT and fundus imaging modalities consisting of healthy and ME retinal pathologies from these datasets; b the count shows the total number of scans in these datasets (including all the B-scans in OCT volumes).

Proposed Methodology
The proposed framework fuses retinal fundus and OCT imagery for the automated recognition and classification of ME and healthy subjects. The block diagram of the proposed framework is shown in Figure 3 where it can be observed that the proposed framework consisted of five major stages:  At first, the input retinal scans were categorized as fundus or OCT through the first layer of the deep CNN model. Afterwards, different acquisition artifacts and unwanted noise content from both type of imagery were removed through the preprocessing stage. After enhancing the scans, the information about retinal layers, retinal fluids and the hard exudate regions were automatically extracted through the set of coherent tensors, which highlights the clinically significant pathological features of ME retinal syndrome. The extracted retinal information was then mapped on the original scan from which the distinct features were extracted through deep CNN models. The extracted features from both fundus and OCT imagery were concatenated together to form a feature vector upon which the candidate subject was graded. The detailed description of each stage is presented in the subsequent subsections below. At first, the input retinal scans were categorized as fundus or OCT through the first layer of the deep CNN model. Afterwards, different acquisition artifacts and unwanted noise content from both type of imagery were removed through the preprocessing stage. After enhancing the scans, the information about retinal layers, retinal fluids and the hard exudate regions were automatically extracted through the set of coherent tensors, which highlights the clinically significant pathological features of ME retinal syndrome. The extracted retinal information was then mapped on the original scan from which the distinct features were extracted through deep CNN models. The extracted features from both fundus and OCT imagery were concatenated together to form a feature vector upon which the candidate subject was graded. The detailed description of each stage is presented in the subsequent subsections below.

Retinal Imaging Modality Recognition
The first stage of the proposed framework was related to the automated recognition of retinal fundus and OCT scans. For this purpose, we utilized the pre-trained AlexNet model [42]. AlexNet is a 25-layered CNN architecture that is trained on an ImageNet dataset. We modified the classification layer of the AlexNet network and retrained it on the local image modality recognition training dataset through transfer learning. The transfer learning phase is shown in Figure 4 and the detailed description  Table 1. The pretrained weights of the AlexNet model were very convergent for the recognition of retinal imaging modalities, which resulted in lesser training and fine-tuning time. The optimization during the training phase was performed through stochastic gradient descent (SGD) [43] where two 50% dropout layers were employed to reduce the overfitting. The main reason for employing AlexNet model instead of designing a CNN architecture from scratch is to achieve greater accuracy with the small amount of training dataset in a lesser time duration. Apart from this, the softmax function was used in the modified AlexNet architecture to compute final output probabilities. The softmax function is mathematically expressed in Equation (1) and the architectural description about AlexNet layers is presented in Table 2.
where, X = {x 1 , x 2 , . . . x N } is the input vector. After each convolution layer, the rectified linear units (ReLU) layer is employed that ensures that only the positive values retain in the feature map (because the negative values reflect the changes, which are dissimilar within the input and the convolutional kernel). After the ReLU layer, the max pooling layer has been added, which only keeps the maximum values within the neighborhood, which ultimately shrinks the resultant feature map.

Preprocessing Retinal Scans
The acquisition of retinal scans is highly sensitive to the subject's head and eye movements and this often leads towards the scan degradation. Apart from this, the acquisition machines add different kind of scan annotations, which greatly affects the automated retinal analysis. In order to cater such noisy artifacts, a preprocessing stage was added, which removes the noisy contents effectively, while enhancing the retina. Since the annotations are mostly added in the top and bottom rows of the respective B-scan. They are automatically removed by setting the first and last fifty rows to zero. This threshold was empirically selected by analyzing the scans within all the datasets. Apart from this, the degraded scan areas as shown in Figure 5 were automatically removed by searching for the first and last highly sharp transitions for each column within the respective scan and then by setting the values in the identified noisy regions with the mean of background pixels.

Preprocessing Retinal Scans
The acquisition of retinal scans is highly sensitive to the subject's head and eye movements and this often leads towards the scan degradation. Apart from this, the acquisition machines add different kind of scan annotations, which greatly affects the automated retinal analysis. In order to cater such noisy artifacts, a preprocessing stage was added, which removes the noisy contents effectively, while enhancing the retina. Since the annotations are mostly added in the top and bottom rows of the respective B-scan. They are automatically removed by setting the first and last fifty rows to zero. This threshold was empirically selected by analyzing the scans within all the datasets. Apart from this, the degraded scan areas as shown in Figure 5 were automatically removed by searching for the first and last highly sharp transitions for each column within the respective scan and then by setting the values in the identified noisy regions with the mean of background pixels.

Preprocessing Retinal Scans
The acquisition of retinal scans is highly sensitive to the subject's head and eye movements and this often leads towards the scan degradation. Apart from this, the acquisition machines add different kind of scan annotations, which greatly affects the automated retinal analysis. In order to cater such noisy artifacts, a preprocessing stage was added, which removes the noisy contents effectively, while enhancing the retina. Since the annotations are mostly added in the top and bottom rows of the respective B-scan. They are automatically removed by setting the first and last fifty rows to zero. This threshold was empirically selected by analyzing the scans within all the datasets. Apart from this, the degraded scan areas as shown in Figure 5 were automatically removed by searching for the first and last highly sharp transitions for each column within the respective scan and then by setting the values in the identified noisy regions with the mean of background pixels.  The preprocessing stage further enhances retinal portions by increasing their variability with the background and also by removing the noisy outliers. This is accomplished through an adaptive low pass Wiener filter, which uses a localized neighborhood of a candidate pixel for denoising [34]. The response of the Wiener filter is expressed in Equations (2)-(4): where, O(x i , y i ) and D(x i , y i ) represent the pixels of the original and denoised retinal scan respectively, w h is the horizontal axis of denoising window while w y is the vertical axis of denoising window. Local estimated mean and variance are represented by and ∫ 2 respectively and 2 is the average of all estimated mean values.

Extraction of Clinically Significant ME Pathological Symptoms
ME is clinically graded into different categories as defined by the Early Treatment Diabetic Retinopathy Study (EDTRS). ME due to the presence of hard exudates and retinal fluids within the foveal diameter of 500 micrometers was considered to be clinically significant. ME outside this region was considered as non-clinically significant. Therefore, the accurate extraction of hard exudates, retinal fluid regions and the localization of fovea were very critical for effectively grading ME. Retinal fluids could be accurately observed through OCT scans while hard exudates were effectively visualized through fundus images. Therefore, the proposed framework, rather than relying on either fundus or OCT imagery, used both of them to effectively extract the retinal information for the reliable and objective grading of ME. In order to localize fovea, the proposed framework extracted the retinal layers from the OCT volume and measured the deepest inner limiting membrane (ILM) point within the foveal B-scan.
The extraction of retinal information from both types of imagery was performed through structure coherence matrix, also known as structure tensor. Structure tensor has gained tremendous popularity in medical image processing because it provides low-level feature analysis and it is very useful for detecting corners, edges and boundaries [44]. Structure tensor also known as Förstner interest operator, is a second moment matrix which computes the gradients of an image by using Gaussian derivative filters as expressed in Equations (5)-(8): where, ∫ T is the second order structure tensor matrix. T § § 2 is the horizontally computed tensor, T † † 2 is the vertically computed tensor and T § † , T † § are the horizontal and vertical oriented tensors. ϕ § , ϕ † and ϕ § † are the partial derivate of denoised image within the pixel neighborhood with respect to x, y and both x, y orientations. g(x, y) is the Gaussian window and D(x, y) is the de-noised retinal scan. Figure 6 shows the structure tensor computational stage. Structure tensor uses a set of eigenvalues to measure the degree of coherency and the tensor with maximum coherency is automatically selected for extracting retinal information [34].
retinal layers [34]. Since most of the retinal layers are horizontally oriented so will the most coherent tensor in for extracting layers information as evident from Figure 6 (b). After extracting the nine retinal layers, ILM and the retinal pigment epithelium (RPE) layers were used to generate a retinal mask, which was then multiplied by the candidate OCT B-scan for the extraction of retinal fluids [34]. The extraction retinal information was then overlaid onto the respective scan for the extraction of clinically significant feature set by the proposed CNN model.

CNN for Feature Extraction
After extracting the hard exudates, retinal layers and retinal fluids, they were marked on the respective fundus and OCT scans for computing the distinct features to discriminate between healthy and ME affected subjects. These features were extracted through proposed CNN architecture. We designed a 14 layered structure tensor influenced CNN architecture containing one input layer, three convolution layers, three batch normalization layers, three ReLUs, two max pooling layers, one dropout layer with 50% threshold and one fully connected layer. The kernels within the convolution layers of the proposed CNN architecture contained weights that retain the structure tensor-based features while suppressing other content. This gave the significant variability between ME and healthy subjects. The proposed CNN model for feature extraction was designed from scratch and was trained on more than 0.07 million scans where the optimization was performed through SGD. In the proposed CNN model, the negative convolution sum values were removed through ReLU and the max pooling layer shrank the feature map to avoid unnecessary calculations. Since retinal fundus and OCT scans showed different clinically significant ME findings, therefore, the proposed CNN After preprocessing the retinal fundus and OCT scans, the second moment matrix was automatically computed by the proposed framework for further analysis. ∫ T from the retinal fundus scan was computed for the extraction of blood vascular patterns. Afterwards, the optic disc region was automatically localized by analyzing the high intensity retinal regions. The extraction of blood vessels and localization of optic disc region was performed in order to improve the segmentation of hard exudates regions. Since blood vessels contain high frequency components, so, the tensors present their detailed visualization while suppressing all other contents as evident from Figure 6b. After computing ∫ T of the candidate fundus scan, the four coherent tensors were obtained. The best tensor (T MAX ) was then obtained by fusing T XX and T YY tensors, which together contained gives the maximum information about the blood vessels. Blood vessel segmentation in the proposed framework is quite robust as it can easily extract small blood capillaries as well, which are not even visible to the naked eye as shown in Figure 6e.
Structure coherence matrix of the retinal OCT scans is computed for the extraction of up to nine retinal layers [34]. Since most of the retinal layers are horizontally oriented so T YY will the most coherent tensor in ∫ T for extracting layers information as evident from Figure 6b. After extracting the nine retinal layers, ILM and the retinal pigment epithelium (RPE) layers were used to generate a retinal mask, which was then multiplied by the candidate OCT B-scan for the extraction of retinal fluids [34]. The extraction retinal information was then overlaid onto the respective scan for the extraction of clinically significant feature set by the proposed CNN model.

CNN for Feature Extraction
After extracting the hard exudates, retinal layers and retinal fluids, they were marked on the respective fundus and OCT scans for computing the distinct features to discriminate between healthy and ME affected subjects. These features were extracted through proposed CNN architecture. We designed a 14 layered structure tensor influenced CNN architecture containing one input layer, three convolution layers, three batch normalization layers, three ReLUs, two max pooling layers, one dropout layer with 50% threshold and one fully connected layer. The kernels within the convolution layers of the proposed CNN architecture contained weights that retain the structure tensor-based features while suppressing other content. This gave the significant variability between ME and healthy subjects. The proposed CNN model for feature extraction was designed from scratch and was trained on more than 0.07 million scans where the optimization was performed through SGD. In the proposed CNN model, the negative convolution sum values were removed through ReLU and the max pooling layer shrank the feature map to avoid unnecessary calculations. Since retinal fundus and OCT scans showed different clinically significant ME findings, therefore, the proposed CNN architecture extracted distinct features from both imaging modalities (i.e., it extracted eight distinct features from retinal fundus scan and eight distinct features from OCT images), which were then concatenated together to generate a 16-D feature vector. These sixteen features were then used to grade healthy and ME subjects. The proposed CNN model shows promising results of feature extraction after getting trained on the dataset mentioned in Table 1. This was due to the robust extraction of retinal information, which were mapped on the retinal scans from which proposed that the CNN model generates the most meaningful and distinctive features as shown in Figure 7. The detailed configuration of the proposed CNN model for feature extraction is presented Table 3, while Table 4 contains the sixteen extracted features from some of the healthy and ME affected scans. Figure 7 shows detailed CNN model for features extraction from both imaging modalities.

Retinal Diagnosis
After extracting the sixteen clinically significant features from retinal fundus and OCT imagery, they were concatenated together and were utilized by the hybrid classification system for grading ME. The hybrid classification model in the proposed framework consisted of an ensemble of ANN, SVM and NB. The final decision was computed by measuring the majority votes of all three classification models. The description of each classification model is presented below.

Artificial Neural Networks
In this study, we used a feed forward artificial neural network classifier with one input layer, one output layer and two hidden layers. The input layer consisted of 16 nodes as per the extracted features. For hidden layers, we experimented with two to 40 nodes to find the optimum architecture (12 for the 1 st hidden layer and nine for the 2 nd hidden layer) of the neural network. A single output layer node gave the final classification probability. The sigmoid function was used for activation in each hidden layer whereas the final output layer contained softmax as the activation function. The weights during training were updated through gradient descent. Figure 8 shows the architecture of ANN used in the proposed study.

Retinal Diagnosis
After extracting the sixteen clinically significant features from retinal fundus and OCT imagery, they were concatenated together and were utilized by the hybrid classification system for grading ME. The hybrid classification model in the proposed framework consisted of an ensemble of ANN, SVM and NB. The final decision was computed by measuring the majority votes of all three classification models. The description of each classification model is presented below.

Artificial Neural Networks
In this study, we used a feed forward artificial neural network classifier with one input layer, one output layer and two hidden layers. The input layer consisted of 16 nodes as per the extracted features. For hidden layers, we experimented with two to 40 nodes to find the optimum architecture (12 for the 1st hidden layer and nine for the 2nd hidden layer) of the neural network. A single output layer node gave the final classification probability. The sigmoid function was used for activation in each hidden layer whereas the final output layer contained softmax as the activation function. The weights during training were updated through gradient descent. Figure 8 shows the architecture of ANN used in the proposed study.

Support Vector Machines
We used a SVM classifier as well in the proposed classification model. SVM is among the most extensively used classifier [34], and in this research a non-linear decision boundary was computed through Gaussian radial basis function (RBF) and multilayer perceptron (MLP) hyperplanes for predicting ME and healthy subjects based on the extracted feature vector (F V ).
features. For hidden layers, we experimented with two to 40 nodes to find the optimum architecture (12 for the 1 st hidden layer and nine for the 2 nd hidden layer) of the neural network. A single output layer node gave the final classification probability. The sigmoid function was used for activation in each hidden layer whereas the final output layer contained softmax as the activation function. The weights during training were updated through gradient descent. Figure 8 shows the architecture of ANN used in the proposed study.

Naïve Bayes
NB is a probabilistic classifier, which makes a decision based on the maximum a posteriori (MAP) rule. In this study, we used the NB classifier to determine the probability of ME and healthy classes through a 16-D feature vector. The category with the maximum probability was then automatically chosen as a diagnosis for the respective feature vector. The probabilities were computed through Bayes Rule as expressed in Equations (9) and (10): where, c i represents the healthy and ME class, F v is the 16-D test feature vector formulated during the feature extraction stage and Y represents the class assigned to the unlabeled scan, which has the largest probability given the F v . F v contains eight distinct features from the retinal fundus scan and eight distinct features from the OCT scan. We used Gaussian distribution to calculate the likelihood P(F v |c i ).
The detailed block diagram of classifiers training stage is shown in Figure 9. We used around 0.07 million retinal scans for training the hybrid classifier. Details of the training dataset are mentioned in Table 1. At first, sixteen distinct features were extracted from the labeled training scans to form a 16-D feature vector, which was then passed to all three classifiers separately and their decisions were finalized through majority voting. The performance of the proposed hybrid classifier during training was measured through K-fold cross validation as shown in Table 5 for different values of k. Once the classifiers achieved the desirable accuracy, they were used for retinal diagnosis of unlabeled scans during the classification stage as shown in Figure 9. Algorithm 1 summarizes the working flow of our proposed framework. g training was measured through K-fold cross validation as shown in Table 5 for diffe s of k. Once the classifiers achieved the desirable accuracy, they were used for retinal diagn labeled scans during the classification stage as shown in Figure 9. Algorithm I summarizes ing flow of our proposed framework.

Results
We tested the proposed framework on an unlabeled dataset consisting of 5000 OCT B-scans out of which 2500 were of ME affected eyes and 2500 were of healthy eyes and 100 fundus scans with the same ratio of ME and healthy eyes. Since the feature vector is generated by concatenating the

Results
We tested the proposed framework on an unlabeled dataset consisting of 5000 OCT B-scans out of which 2500 were of ME affected eyes and 2500 were of healthy eyes and 100 fundus scans with the same ratio of ME and healthy eyes. Since the feature vector is generated by concatenating the extracted features from both fundus and OCT scans, therefore, we individually computed the mean dice coefficient for measuring the performance of extracting hard exudates, blood vessels and retinal fluids.
The dataset in [37] consisted of 24 diabetic macular edema eyes with seven diffuse pattern of fluid leakage, 10 focal pattern of leakage and seven mixed pattern of leakage. They also provided three different expert markings of hard exudates for all these cases, which we used in validating the performance of the proposed system in extracting hard exudates regions as shown in Table 6. It can be observed from Table 6 that the proposed framework achieved the overall mean dice coefficient of 0.7069 ± 0.11 in extracting hard exudates. Figure 10 shows the visual comparison of the proposed framework for extracting hard exudates with three different expert markings.
Since the annotations against blood vessels in fundus/FA scans and retinal fluids in OCT scans were not available in the datasets used in this study, we arranged these annotations through a local expert clinician for comparative analysis. We evaluated the efficiency of the proposed system for blood vessels extraction through mean dice coefficient computed against the manual markings done by a local clinician as shown in Table 7. We obtained the overall mean dice coefficient of 0.8589 ± 0.04 for blood vessels segmentation in the case of healthy eyes and 0.8012 ± 0.03 in the case of ME affected eyes. Whereas, for both retinal conditions we achieved the overall mean dice coefficient of 0.8203 ± 0.03. These results validate the accuracy of proposed systems in blood vessels segmentation against various retinal pathologies, even in the presence of hard exudates, hemorrhages and micro-aneurysms in ME fundus/FA scans. It shows the effectiveness of the proposed method in detailed extraction of blood vessels. Figure 11 shows the extracted blood vessels by the proposed system in healthy and ME affected fundus/FA scans.
Similarly, we evaluated the performance of the proposed system for the extraction of retinal fluids through mean dice coefficient computed against the manual markings done by a local clinician as shown in Table 8. We obtained the overall mean dice coefficient of 0.9026 ± 0.03 for retinal fluid extraction on the Rabbani dataset [36] and 0.9012 ± 0.04 on the Zhang dataset [35]. Whereas, for both the datasets we achieved the overall mean dice coefficient of 0.9019 ± 0.04. These results show that the proposed method performed well in retinal fluid extraction irrespective of the datasets and the OCT acquisition equipment. Figure 12 shows the extracted retinal fluid by the proposed system in healthy and ME affected OCT scans. eyes. Whereas, for both retinal conditions we achieved the overall mean dice coefficient of 0.8203 ± 0.03. These results validate the accuracy of proposed systems in blood vessels segmentation against various retinal pathologies, even in the presence of hard exudates, hemorrhages and microaneurysms in ME fundus/FA scans. It shows the effectiveness of the proposed method in detailed extraction of blood vessels. Figure 11 shows the extracted blood vessels by the proposed system in healthy and ME affected fundus/FA scans.  extraction of blood vessels. Figure 11 shows the extracted blood vessels by the proposed system in healthy and ME affected fundus/FA scans.  Whereas, for both the datasets we achieved the overall mean dice coefficient of 0.9019 ± 0.04. These results show that the proposed method performed well in retinal fluid extraction irrespective of the datasets and the OCT acquisition equipment. Figure 12 shows the extracted retinal fluid by the proposed system in healthy and ME affected OCT scans.  Table 9. We used sensitivity (SE), specificity (SP), positive predictive values (PPV), negative predictive values (NPV) and diagnostic accuracy (A) as the five measuring metrics to evaluate the hybrid classifier as expressed in Equations (11)-(15): = , = .
and are the true positives and true negatives respectively, which specify the correctly classified (CC) cases. In this study, indicates whether the input scan was macular edema and it was classified as macular edema too, while represents the cases where actual input scan was of healthy eye and the classification also showed it as healthy. and stands for false positive and false negative, respectively, these are false classification indicators.
cases are those in which actual input scan was of healthy eye and classifier classified it as ME, while is the reverse of .  Table 9. We used sensitivity (SE), specificity (SP), positive predictive values (PPV), negative predictive values (NPV) and diagnostic accuracy (A) as the five measuring metrics to evaluate the hybrid classifier as expressed in Equations (11)-(15): Diagnostic Accuracy = T P + T N T P + T N + F P + F N .
T P and T N are the true positives and true negatives respectively, which specify the correctly classified (CC) cases. In this study, T P indicates whether the input scan was macular edema and it was classified as macular edema too, while T N represents the cases where actual input scan was of healthy eye and the classification also showed it as healthy. F P and F N stands for false positive and false negative, respectively, these are false classification indicators. F P cases are those in which actual input scan was of healthy eye and classifier classified it as ME, while F N is the reverse of F P .  other state-of-the-art techniques.    [25] shows some of the healthy and ME fundus scans, which were correctly processed by the proposed system.  shows some of the healthy and ME fundus scans, which were correctly processed by the proposed system.  Zhang [35] datasets, which are correctly processed by the proposed framework whereas Figure 15 shows some of the healthy and ME fundus scans, which were correctly processed by the proposed system.  Zhang [35] datasets, which are correctly processed by the proposed framework whereas Figure 15 shows some of the healthy and ME fundus scans, which were correctly processed by the proposed system.   Figure 13 and Figure 14 shows some healthy and ME OCT cases from both Rabbani [36] and Zhang [35] datasets, which are correctly processed by the proposed framework whereas Figure 15 shows some of the healthy and ME fundus scans, which were correctly processed by the proposed system.   Figure 13 and Figure 14 shows some healthy and ME OCT cases from both Rabbani [36] and Zhang [35] datasets, which are correctly processed by the proposed framework whereas Figure 15 shows some of the healthy and ME fundus scans, which were correctly processed by the proposed system.   Figure 13 and Figure 14 shows some healthy and ME OCT cases from both Rabbani [36] and Zhang [35] datasets, which are correctly processed by the proposed framework whereas Figure 15 shows some of the healthy and ME fundus scans, which were correctly processed by the proposed system.   Figure 13 and Figure 14 shows some healthy and ME OCT cases from both Rabbani [36] and Zhang [35] datasets, which are correctly processed by the proposed framework whereas Figure 15 shows some of the healthy and ME fundus scans, which were correctly processed by the proposed system.   Figure 13 and Figure 14 shows some healthy and ME OCT cases from both Rabbani [36] and Zhang [35] datasets, which are correctly processed by the proposed framework whereas Figure 15 shows some of the healthy and ME fundus scans, which were correctly processed by the proposed system.  Figure 13 and Figure 14 shows some healthy and ME OCT cases from both Rabbani [36] and Zhang [35] datasets, which are correctly processed by the proposed framework whereas Figure 15 shows some of the healthy and ME fundus scans, which were correctly processed by the proposed system.  Figure 13 and Figure 14 shows some healthy and ME OCT cases from both Rabbani [36] and Zhang [35] datasets, which are correctly processed by the proposed framework whereas Figure 15 shows some of the healthy and ME fundus scans, which were correctly processed by the proposed system.  Figure 13 and Figure 14 shows some healthy and ME OCT cases from both Rabbani [36] and Zhang [35] datasets, which are correctly processed by the proposed framework whereas Figure 15 shows some of the healthy and ME fundus scans, which were correctly processed by the proposed system.  Figure 13 and Figure 14 shows some healthy and ME OCT cases from both Rabbani [36] and Zhang [35] datasets, which are correctly processed by the proposed framework whereas Figure 15 shows some of the healthy and ME fundus scans, which were correctly processed by the proposed system.  Figure 13 and Figure 14 shows some healthy and ME OCT cases from both Rabbani [36] and Zhang [35] datasets, which are correctly processed by the proposed framework whereas Figure 15 shows some of the healthy and ME fundus scans, which were correctly processed by the proposed system.  Figure 13 and Figure 14 shows some healthy and ME OCT cases from both Rabbani [36] and Zhang [35] datasets, which are correctly processed by the proposed framework whereas Figure 15 shows some of the healthy and ME fundus scans, which were correctly processed by the proposed system.  Figure 13 and Figure 14 shows some healthy and ME OCT cases from both Rabbani [36] and Zhang [35] datasets, which are correctly processed by the proposed framework whereas Figure 15 shows some of the healthy and ME fundus scans, which were correctly processed by the proposed system.  Figure 13 and Figure 14 shows some healthy and ME OCT cases from both Rabbani [36] and Zhang [35] datasets, which are correctly processed by the proposed framework whereas Figure 15 shows some of the healthy and ME fundus scans, which were correctly processed by the proposed system.  Figure 13 and Figure 14 shows some healthy and ME OCT cases from both Rabbani [36] and Zhang [35] datasets, which are correctly processed by the proposed framework whereas Figure 15 shows some of the healthy and ME fundus scans, which were correctly processed by the proposed system.   Figures 13 and 14 shows some healthy and ME OCT cases from both Rabbani [36] and Zhang [35] datasets, which are correctly processed by the proposed framework whereas Figure 15 shows some of the healthy and ME fundus scans, which were correctly processed by the proposed system. through the accuracy and cross-entropy loss function as expressed in Equation (16)

Methods
where , is an indicator that the w is the correct class for the feature vector , , is the probability computed for that it belongs to class w and is the cross-entropy loss. The summation in Equation (9) runs for the total number of classes.
where , is an indicator that the w is the correct class for the feature vector , , is the probability computed for that it belongs to class w and is the cross-entropy loss. The summation in Equation (9) runs for the total number of classes.       The performance of the AlexNet and proposed CNN model during training phase was measured through the accuracy and cross-entropy loss function as expressed in Equation (16): where I F V ,w is an indicator that the w is the correct class for the feature vector F V , P F V ,w is the probability computed for F V that it belongs to class w and C L is the cross-entropy loss. The summation in Equation (9) runs for the total number of classes. Figure 15. Healthy and ME fundus scans in the Rabbani Dataset [39,40], which are processed by the proposed system. (a) Original healthy fundus scans; (b) original ME fundus scans; (c) classified as healthy by the proposed system and (d) classified as ME by the proposed system.  Apart from this, for every 100 iterations, the validation was performed where validation performance was also measured through accuracy and cross-entropy loss function. The validation was performed in order to get the unbiased evaluation of the candidate model during the training phase as evident from Figure 16 and Figure 17. Furthermore, we employed 50% dropout layers within each model to reduce overfitting on the dataset. The proposed CNN model achieved the accuracy of 99.23% in 1500 iterations during the training phase while the AlexNet model achieved the accuracy of 98.79%. These results were obtained through MATLAB R2018a and Table 10 shows the details of systems and software along with the average time required for computing the results by each classifier. Although, the average time of hybrid classifier was a few seconds more than individual classifiers, the accuracy achieved by the proposed classification model was 94.33%. Table 10. Details of the system and software used for conducting this research. Apart from this, for every 100 iterations, the validation was performed where validation performance was also measured through accuracy and cross-entropy loss function. The validation was performed in order to get the unbiased evaluation of the candidate model during the training phase as evident from Figures 16 and 17. Furthermore, we employed 50% dropout layers within each model to reduce overfitting on the dataset. The proposed CNN model achieved the accuracy of 99.23% in 1500 iterations during the training phase while the AlexNet model achieved the accuracy of 98.79%. These results were obtained through MATLAB R2018a and Table 10 shows the details of systems and software along with the average time required for computing the results by each classifier. Although, the average time of hybrid classifier was a few seconds more than individual classifiers, the accuracy achieved by the proposed classification model was 94.33%.

Discussion
A deep retinal diagnostic framework was proposed here that combines retinal fundus and OCT imagery for the extraction of clinically significant ME findings and uses the extracted information for the reliable and accurate grading of ME. According to EDTRS, ME was clinically graded based upon the locality of edema with respect to fovea i.e., if the retinal fluids or hard exudates are observed within the foveal diameter of 500 micrometers, then ME is graded as clinically significant otherwise it is graded as non-clinically significant. Clinically significant macular edema is more critical as compared to non-clinically significant macular edema as it produces retinal thickening near the fovea, which causes non-recoverable visual impairments (or even blindness). Retinal fundus and OCT imagery are the most common and non-invasive retinal examination techniques, which depicts the prominent symptoms of retinopathy. OCT imagery shows the early symptoms of retinopathy due to its ability to present retinal cross-sectional regions. Therefore, the retinal blood vessels leakages and retinal fluids accumulation can be easily visualized through OCT scans. However, accurate visualization of hard exudates from OCT imagery is a very cumbersome task, therefore, the retinal fundus scans are clinically used for this purpose. To the best of our knowledge, all the retinal diagnostic frameworks that have been proposed in the past for ME diagnosis are based on single retinal imaging modality, which do not completely depict the retinal abnormalities. The proposed framework is unique as it fuses the findings from both retinal fundus and OCT imagery for the effective, reliable and objective diagnosis as well as grading of ME subjects (especially those having a diabetic history). The proposed framework works in a way that it first recognizes the type of imagery through the pre-trained AlexNet CNN model. The retinal imaging modality recognition is one of the crucial steps of the proposed framework since both images do not contain any metadata that can depict their unique information or description. Therefore, in order to develop a generalized framework that can perform automated analysis and can automatically mass screen retinal patients, the respective imagery has to be automatically recognized first. After recognizing the retinal images, the proposed framework extracts the retinal layers and retinal fluids from the candidate OCT scans and it also extracts the hard exudate regions from the fundus scans. The extracted retinal information is then overlaid onto the respective scans and the annotated scans are then passed to the proposed CNN model, which extracts the eight distinct features from the annotated fundus scan and eight distinct features from the annotated fundus scans. These features are fused together to form a 16-D feature vector, which is passed to the proposed hybrid classifier formed through the ensemble of ANN, SVM and NB. One of the major aims of the proposed framework was to accurately diagnose and grade ME pathologies. Since ME is clinically graded into different categories depending upon the disease severity levels so in order to get reliable and accurate diagnosis, the hybrid classification was proposed that gives a decision based upon the majority votes obtained through all the three supervised classifiers. This increases the diagnostic performance of the proposed framework without compromising the time performance as evident from Table 10. Apart from this, the proposed framework was extensively tested on multiple publicly available datasets and was compared with state-of-the-art solutions against different metrics and ground truths (provided by expert clinicians) as evident from the results section. Table 9 depicts the detailed diagnostic comparison with other existing solutions where it can be seen that the proposed framework was the only generic framework that was validated on multiple publicly available datasets containing both retinal fundus and OCT imagery and achieved the diagnostic accuracy of 94.33%.

Conclusions and Future Work
In this paper, we proposed a computer aided diagnostics method for segmentation of retinal pathological symptoms and classification of macular edema using two retinal imaging modalities (OCT and fundus imaging). The proposed framework was based on a hybrid classification model in which 16 unique features are extracted for distinguishing macular edema cases from healthy ones. The dataset used for conducting this study consisted of more than 78,891 retinal scans in total, out of which we used 73,791 scans for training purpose and 5100 for evaluation purpose. The proposed classification model correctly classified 4811 retinal scans, achieving 94.33% accuracy. The proposed system was quite robust in general, insensitive to OCT B-scans orientations and performed extremely well against the noisy and degraded scans as shown in Figure 5. Moreover, the proposed technique could be optimized for detecting other ocular diseases such as age-related macular degeneration (ARMD), idiopathic central serous chorioretinopathy (CSCR), Glaucoma, diabetic retinopathy, etc., as well as for segmenting other retinal layers. It could also be extended for the 3D modeling of the human retina.