HyCAD-OCT: A Hybrid Computer-Aided Diagnosis of Retinopathy by Optical Coherence Tomography Integrating Machine Learning and Feature Maps Localization

Optical Coherence Tomography (OCT) imaging has major advantages in effectively identifying the presence of various ocular pathologies and detecting a wide range of macular diseases. OCT examinations can aid in the detection of many retina disorders in early stages that could not be detected in traditional retina images. In this paper, a new hybrid computer-aided OCT diagnostic system (HyCAD) is proposed for classification of Diabetic Macular Edema (DME), Choroidal Neovascularization (CNV) and drusen disorders, while separating them from Normal OCT images. The proposed HyCAD hybrid learning system integrates the segmentation of Region of Interest (RoI), based on central serious chorioretinopathy (CSC) in Spectral Domain Optical Coherence Tomography (SD-OCT) images, with deep learning architectures for effective diagnosis of retinal disorders. The proposed system assimilates a range of techniques including RoI localization and feature extraction, followed by classification and diagnosis. An efficient feature fusion phase has been introduced for combining the OCT image features, extracted by Deep Convolutional Neural Network (CNN), with the features extracted from the RoI segmentation phase. This fused feature set is used to predict multiclass OCT retina disorders. The proposed segmentation phase of retinal RoI regions adds substantial contribution as it draws attention to the most significant areas that are candidate for diagnosis. A new modified deep learning architecture (Norm-VGG16) is introduced integrating a kernel regularizer. Norm-VGG16 is trained from scratch on a large benchmark dataset and used in RoI localization and segmentation. Various experiments have been carried out to illustrate the performance of the proposed system. Large Dataset of Labeled Optical Coherence Tomography (OCT) v3 benchmark is used to validate the efficiency of the model compared with others in literature. The experimental results show that the proposed model achieves relatively high-performance in terms of accuracy, sensitivity and specificity. An average accuracy, sensitivity and specificity of 98.8%, 99.4% and 98.2% is achieved, respectively. The remarkable performance achieved reflects that the fusion phase can effectively improve the identification ratio of the urgent patients’ diagnostic images and clinical data. In addition, an outstanding performance is achieved compared to others in literature.


Introduction
Optical Coherence Tomography (OCT) imaging is a type of optical biopsy. It is used to view and capture small changes in retina. It generates cross sectional 3D images by measuring the echo time delay or the magnitude of back reflected light [1]. Compared to traditional regular retinal Macular edema is the swelling of the central part of the retina [6]. It is considered one of the most common retinal disorders. Increased blood sugar levels, in people with diabetes, can damage the retinal blood vessels or result in tiny bulges protruding from the vessel walls, leaking or oozing fluid and blood into the retina. This leads to the serious eye complication called Diabetic Macular Edema (DME) [7], which may cause vision problems or blindness.
Another type of retinal disorder is Choroidal Neovascularization (CNV), it is the creation of new blood vessels in the choroid layer of the eye. Neovascular Degenerative maculopathy is commonly accompanied with extreme myopia, malignant myopic degeneration or age-related developments [8]. CNV is a major cause of visual loss.
Another type of retinal disorder is drusen. It is small yellow deposits of fatty proteins (lipids) that accumulate under the retina. The retina is a thin layer of tissue that lines the back of the inside of the eye, near the optic nerve. The retina contains light-sensing cells that are essential for vision [9]. Drusen are like tiny pebbles of debris that build up over time and can result in central vision loss. Central vision allows us to focus on details straight ahead [9].
The presence of overt changes in retina layers with various diseases in OCT images, encourage the development of diagnostic systems to differentiate between these diseases automatically. Computer aided diagnosis can support ophthalmologists with accurate timely decisions, thus leading to promoting health care provision.

Literature Review on Retina Segmentation
In particular, segmentation has received considerable attention recently as it is an important step in aiding diagnosis since it helps determine the RoI. Ghorbel et al. [10] proposed a method for the segmentation of eight retinal layers in Heidelberg spectralis SD-OCT images. Global segmentation algorithms such as active contours and Markov random fields are used. In addition, a Kalman filter was designed. It is used to model the approximate parallelism between the photoreceptor segments [10].
A comparison between different edge detectors was conducted by Luo et al. [11]. They studied and compared the performance of canny edge detector [12,13], the two-pass method proposed by Macular edema is the swelling of the central part of the retina [6]. It is considered one of the most common retinal disorders. Increased blood sugar levels, in people with diabetes, can damage the retinal blood vessels or result in tiny bulges protruding from the vessel walls, leaking or oozing fluid and blood into the retina. This leads to the serious eye complication called Diabetic Macular Edema (DME) [7], which may cause vision problems or blindness.
Another type of retinal disorder is Choroidal Neovascularization (CNV), it is the creation of new blood vessels in the choroid layer of the eye. Neovascular Degenerative maculopathy is commonly accompanied with extreme myopia, malignant myopic degeneration or age-related developments [8]. CNV is a major cause of visual loss.
Another type of retinal disorder is drusen. It is small yellow deposits of fatty proteins (lipids) that accumulate under the retina. The retina is a thin layer of tissue that lines the back of the inside of the eye, near the optic nerve. The retina contains light-sensing cells that are essential for vision [9]. Drusen are like tiny pebbles of debris that build up over time and can result in central vision loss. Central vision allows us to focus on details straight ahead [9].
The presence of overt changes in retina layers with various diseases in OCT images, encourage the development of diagnostic systems to differentiate between these diseases automatically. Computer aided diagnosis can support ophthalmologists with accurate timely decisions, thus leading to promoting health care provision.

Literature Review on Retina Segmentation
In particular, segmentation has received considerable attention recently as it is an important step in aiding diagnosis since it helps determine the RoI. Ghorbel et al. [10] proposed a method for the segmentation of eight retinal layers in Heidelberg spectralis SD-OCT images. Global segmentation algorithms such as active contours and Markov random fields are used. In addition, a Kalman filter was designed. It is used to model the approximate parallelism between the photoreceptor segments [10].
A comparison between different edge detectors was conducted by Luo et al. [11]. They studied and compared the performance of canny edge detector [12,13], the two-pass method proposed by Bagci et al. [14] and Edgeflow technique [15]. All of these techniques were used with retinal OCT images to delineate the retinal layer boundaries. From the evaluation of the results, it was shown that the two-pass method outperforms the Canny detector and the EdgeFlow technique that is used with OCT images to delineate the retinal layer boundaries. In addition, the mean localization deviation metrics show that the smallest edge shifting problem is caused by the two-pass method. The study suggests that the two-pass method is the best one for retinal OCT boundary detectors.
In the work of Salarian et al. [16], a method that uses graph theory and the shortest path algorithm was presented to detect certain layers. The aim was to choose the RoI that could be used to distinguish normal cases from abnormal ones. They discovered that using changes in some parts, such as inner limiting membrane (ILM), retinal nerve fiber layer (RNFL) and retinal pigment epithelium (RPE), leads to separating these layers easily. The proposed technique was applied to all B-Scan images of 16 people, including low-quality images and some images with diseased eyes. The results were accurate according to manual segmentation of an expert.
Others such as Garvin et al. [17] proposed a method to segment five layers (NFL, GCL+IPL, INL+OPL, IS, OS) in OCT retinal images. The layers were identified using constructed geometric graph and computed a minimum cost closed set. This graph is constructed from the edge/regional information and a priori determined surface smoothness and interaction constraints.

Literature Review on Retina Disorder Classification Using OCT
Although the previous studies present effective segmentation techniques, there is a need to develop comprehensive diagnostic systems that emphasize RoI and perform disease classification. Such systems can help alleviate the burden on medical care systems.
Current automated diagnostic systems either use traditional ML techniques of feature extraction, segmentation and classification or utilize deep learning architectures with unprocessed images.
An example of these automated systems with OCT imaging can be found in the work of Srinivasan et al. [18]. It described an automated method for detecting retinal diseases using features extracted by Histogram of Oriented Gradients (HOG) techniques and SVMs to classify three classes namely DME, dry AMD and Normal. Their method does not rely on the segmentation of inner retinal layers; however, retinal flattening is performed. The proposed model achieved 100% for both DME and dry AMD classes and 86.67% for normal class. Another example of traditional approach is that of Alsaih et al. [19]. They proposed a pipeline model that generated a large set of extracted features using HOG and Local Binary Pattern (LBP) at four levels of the multiscale Gaussian lowpass image pyramid. These features were either reduced using Principle Component Analysis (PCA) or directly passed to SVM classifier, which was used to classify two classes: DME and normal. Their best model achieved an accuracy of 81%.
Naz et al. [1] addressed the problem of automatic classification of OCT images through SVM (leave one out) technique after image denoising and extraction of features from thickness profile and cyst spaces. This technique was based on extracting features from segmenting the retinal layer (ILM and choroid layer) and from the changes in thickness of these layers. Naz et al. [1] applied that technique for the identification of patients with DME versus normal subjects. The proposed model achieved an accuracy of 79.65%.
These traditional approaches [1,18,19], which employed hand-crafted features for the classifiers, were shown to obtain promising results. However, these approaches shared a common disadvantage, which is requiring abundant expert knowledge. Medical knowledge was needed to correctly detect retinal layers and/or changes in layers' thickness [1] and/or assist in retinal flattening [18,19]. Such dependence on experts' availability gives rise to another concern, which is having poor generalization to other domains [20]. Despite these limitations, hand-crafted features remain to provide intuitive information, which enables the needed analysis of the model [21]. Hence, the generation of hand-crafted features that are not heavily reliant on experts' knowledge would provide the required balance of providing explanation of the model without being dependent on an expert. This balance Appl. Sci. 2020, 10, 4716 5 of 25 would help mitigate the pressure on medical systems. In order to overcome these disadvantages, deep learning techniques were employed to perform embedded feature extraction implicitly. Awais et al. [22] used a deep learning architecture (VGG16) for extracting features from different layers of the network. Classifications were made using KNN and Decision Tree based on the extracted features. The model is used to classify two classes: DME and normal. The best model got an accuracy of 87.5%.
Perdomo et al. [23] proposed a new CNN architecture for OCT classification. The new CNN consists of two blocks, the first block contains four subblocks with convolutional layers and max pooling and the last block has two fully connected layers. The new CNN was evaluated with different learning rates, until it was able to correctly detect DME with a classification accuracy of 93.75% between two classes: DME and normal at learning rate 0.00001.
Motozawa et al. [24] developed a pipeline model consisting of two CNNs. The first is used to differentiate AMD class from normal class and the second CNN model is used to divide the AMD class into those with exudates and without exudates using transfer learning. The first CNN achieved an accuracy of 99% and the second one achieved an accuracy of 93.9%. Despite that the models attained high accuracies, the classification was limited to two classes which simplifies the problem. The work of Kermany et al. [4] established a diagnostic tool based on a deep learning framework for the screening of patients with common treatable blinding retinal diseases. They applied their approach on an OCT dataset [25] with four classes which are DME, CNV, DRUSEN, NORMAL. They used transfer learning inception model which achieved 96.1% average accuracy. Their approach demonstrated performance comparable to that of human experts in classifying age-related macular degeneration and diabetic macular edema. Using the same dataset, Li et al. [20] adopted a transfer learning VGG16 network that achieved an accuracy of 98.6%.
Both studies applied deep learning approaches [4,20] to classify four different classes including serious diseases (DME, CNV) and achieved promising results. However, the problem for deep learning models (CNNs) presented in the work of Kermany et al. and Li et al. [4,20] is that its sensitivity to urgent referrals classes (CNV/DME) is unsatisfactory. This is the main argue point against the approaches in the work of Kermany et al. and Li et al. [4,20]. Another common concern of deep learning architectures is the limited interpretability of the constructed models.
The incorporation of hand-crafted features can provide an adequate solution to deep learning issues. Hand-crafted features can offer complementary information that increase the understandability of the classification decision [26,27]. In addition, the use of hand-crafted features was shown to considerably improve the performance of pure deep learning approaches [28]. Recently, similar studies for image classification and recognition have shown that the integration of deep and traditional learning yield better performance. Examples of these studies include the work of Zhang et al. [29] for face recognition and Xie at al. [30] for lung nodule classification. Moreover, deep learning architectures in return can provide automated localization of the retinal RoI, which overcomes the need for an expert to aid in the specification of the RoI, required for traditional feature extraction.
In this paper, we propose a hybrid computer aided diagnosis (HyCAD) system based on OCT images using deep and traditional learning to attain high accuracy classifications and provide relative insight into the classification decision. In addition, HyCAD provides automated RoI localization which helps advance the diagnosis process.

Data Acquisition and Labeling
The current study uses a benchmark dataset which consists of 109,312 retinal OCT images. Center Ophthalmology Associates, the Shanghai First People's Hospital and the Beijing Tongren Eye Center [20,25].
Two retinal specialists provided the class labels for the images. Four class labels were available CNV, DME, DRUSEN, NORMAL) as shown in Figure 2. Furthermore, only 109,312 out of 207,130 images were clearly distributed between these classes by the agreement of the two retinal specialists and formed the dataset while the other images contained severe artifacts. The dataset is divided into 108,312 images as training set and 1000 images as testing set, according to Li et al. [20]. The distribution of the classes within the training set is illustrated in Table 1, where both classes: DME and DRUSEN, present a small percentage of the training set and can be considered to impose the challenges of minority classes on our classification problem [4]. The testing set is sampled to be of equal distribution for all classes as illustrated in Table 2.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 25 Two retinal specialists provided the class labels for the images. Four class labels were available CNV, DME, DRUSEN, NORMAL) as shown in Figure 2. Furthermore, only 109,312 out of 207,130 images were clearly distributed between these classes by the agreement of the two retinal specialists and formed the dataset while the other images contained severe artifacts. The dataset is divided into 108,312 images as training set and 1000 images as testing set, according to Li et al. [20]. The distribution of the classes within the training set is illustrated in Table 1, where both classes: DME and DRUSEN, present a small percentage of the training set and can be considered to impose the challenges of minority classes on our classification problem [4]. The testing set is sampled to be of equal distribution for all classes as illustrated in Table 2.  The classes are categorized as Urgent referrals and Nonurgent referrals groups. The CNV and DME classes represent the Urgent referrals group, where patients need to be transferred to an ophthalmologist in time for definitive anti-VEGF treatment [4,20]. The referral is urgent in this case because if the treatment is delayed, it would almost cause irreversible vision impairment and even lead to blindness. The Nonurgent referrals group, represented by DRUSEN and NORMAL classes, does not need immediate Urgent referral from an ophthalmologist [4,20].  The classes are categorized as Urgent referrals and Nonurgent referrals groups. The CNV and DME classes represent the Urgent referrals group, where patients need to be transferred to an ophthalmologist in time for definitive anti-VEGF treatment [4,20]. The referral is urgent in this case because if the treatment is delayed, it would almost cause irreversible vision impairment and even lead to blindness.
The Nonurgent referrals group, represented by DRUSEN and NORMAL classes, does not need immediate Urgent referral from an ophthalmologist [4,20].

Image Preprocessing
A range of preprocessing steps are applied to the dataset to limit the processing system requirements and to increase the robustness of the models. First, individual cross-sectional tomography (B-scans) in the SD-OCT volume are denoised using the sparsity-based block matching and 3D-filtering (BM3D) denoising method which reduces the speckle noise. It is the filter from the nonlocal means class that looks for local neighborhoods of similar shapes and puts them into a 3D matrix [31].
The applied BM3D filter noticeably reduces the speckle noise in images, as can be seen in Figure 3. Noise reduction helps in achieving good performance metrics.

Image Preprocessing
A range of preprocessing steps are applied to the dataset to limit the processing system requirements and to increase the robustness of the models. First, individual cross-sectional tomography (B-scans) in the SD-OCT volume are denoised using the sparsity-based block matching and 3D-filtering (BM3D) denoising method which reduces the speckle noise. It is the filter from the nonlocal means class that looks for local neighborhoods of similar shapes and puts them into a 3D matrix [31].
The applied BM3D filter noticeably reduces the speckle noise in images, as can be seen in Figure  3. Noise reduction helps in achieving good performance metrics. Second, images are streamed in batches to avoid system crash due to RAM overloading, and this is because of the memory limitations. Memory limitations come from training models using a RAM size of small bandwidth compared to the large dataset used in training (108,312 images) and the large size of our deployed CNN architecture. To solve this problem, our data is partitioned into batches of 32 images and the training data was transferred batch by batch from the hard disk to RAM during training. After the training of each batch ends, it is discarded, and a new batch of images is generated and stored in RAM. So, each training epoch will have 3385 iterations (equal to number of training dataset/batch size). This technique allows us to train our large training set without RAM overflowing.
Data Augmentation is applied for each batch. Rotation by range from 0 to 30 degrees, horizontal flipping and shifting by range between 0 and 30% from left, right, up and down is applied randomly on each batch of images in each epoch to allow deep learning architectures to train on different orientations of input image, thus improving the robustness of the system. After that, all the images are resized to 200 × 200 pixels.
Each pixel value is normalized to process all images in the same manner because some images may have high pixel range that could cause stronger loss and low pixel range that could cause weaker loss. Additionally, high pixel range will have a large number of votes to determine how to update weights. So, if the values of intensities are normalized to avoid discrepancy in image processing caused by different pixel ranges, this could decrease the gap and make a fair competition in votes between high pixel range and low pixel range. Second, images are streamed in batches to avoid system crash due to RAM overloading, and this is because of the memory limitations. Memory limitations come from training models using a RAM size of small bandwidth compared to the large dataset used in training (108,312 images) and the large size of our deployed CNN architecture. To solve this problem, our data is partitioned into batches of 32 images and the training data was transferred batch by batch from the hard disk to RAM during training. After the training of each batch ends, it is discarded, and a new batch of images is generated and stored in RAM. So, each training epoch will have 3385 iterations (equal to number of training dataset/batch size). This technique allows us to train our large training set without RAM overflowing.
Data Augmentation is applied for each batch. Rotation by range from 0 to 30 degrees, horizontal flipping and shifting by range between 0 and 30% from left, right, up and down is applied randomly on each batch of images in each epoch to allow deep learning architectures to train on different orientations of input image, thus improving the robustness of the system. After that, all the images are resized to 200 × 200 pixels.
Each pixel value is normalized to process all images in the same manner because some images may have high pixel range that could cause stronger loss and low pixel range that could cause weaker loss. Additionally, high pixel range will have a large number of votes to determine how to update weights. So, if the values of intensities are normalized to avoid discrepancy in image processing caused by different pixel ranges, this could decrease the gap and make a fair competition in votes between high pixel range and low pixel range.

The Proposed HyCAD System: Hybrid Computer-Aided Diagnosis of Retinopathy Applying Deep and Traditional Machine Learning on OCT Images
In this section, the proposed model architecture will be introduced, with an illustration of the phases' description. The proposed model utilizes modified variant of an existing deep learning architecture, namely VGG16, with OCT images for efficient diagnosis of retina disorders. The proposed deep learning architecture modifications aim at improving the performance of the standard basic architectures in terms of attained accuracy and training time requirements. The implemented architectures provide RoI localization and feature generation. Hand-crafted features are generated from the RoI that is extracted by the deep learning architecture. Afterwards, both CNN-based features and hand-crafted features are merged and input to the dense layer. Diagnosis is produced based on the classification output by the dense layer. The proposed system architecture is shown in Figure 4.

The Proposed HyCAD System: Hybrid Computer-Aided Diagnosis of Retinopathy Applying Deep and Traditional Machine Learning on OCT Images
In this section, the proposed model architecture will be introduced, with an illustration of the phases' description. The proposed model utilizes modified variant of an existing deep learning architecture, namely VGG16, with OCT images for efficient diagnosis of retina disorders. The proposed deep learning architecture modifications aim at improving the performance of the standard basic architectures in terms of attained accuracy and training time requirements. The implemented architectures provide RoI localization and feature generation. Hand-crafted features are generated from the RoI that is extracted by the deep learning architecture. Afterwards, both CNN-based features and hand-crafted features are merged and input to the dense layer. Diagnosis is produced based on the classification output by the dense layer. The proposed system architecture is shown in Figure 4.

Norm-VGG16 Deep Learning Architecture
One of the most prominent deep learning architectures is VGGs Network. VGG16 [32] architecture is utilized for its known superior performance, for example, compared to other models like AlexNet [33]. This may be attributed to its small kernel size (3 × 3) and more trainable parameters, which helps increase the depth of CNN without the problem of gradient loss in deep CNNs. A set of modifications to the standard VGG network architecture are proposed to create Norm-VGG16 Network variant architecture that enhances the performance of the basic architecture.
First, the number of convolution layers is increased from 13 layers in standard VGG to 16 convolution layers. Each of the added layers has a kernel of size 3 × 3 and stride = 1. The pipeline of the convolution layers starts with first layer of size 200 × 200 and is convolved by 64 (3 × 3) kernels. The ending layer of the pipelined convolution layers has 512 (3 × 3) kernel with output of size 3 × 3. Between each group of convolution layers there is a max pooling layer of size (2 × 2) to down sample the feature tensor to reduce overfitting. The second applied modification is adding a global average pooling layer followed by a dense layer to reduce the overfitting of CNN. The dense layer has an output of four neurons representing the four classes of our dataset.
Another enhancement to the standard architecture is adding L2 kernel regularization to limit overfitting. The added regularization is shown in Figure 5 at the last dense layer. Kernel regularization was first added in different layers, but the best performance is attained by the variant scenario implementing L2 kernel regularization at the dense layer (last layer). The value of the hyperparameter of L2 regularizer, which determines the relative importance of the regularization component compared to the loss component, is set to 0.01.
L2 regularizer puts a constraint on the complexity of a network by giving a small value to its weights, which makes the distribution of weights regular. This is done by the increasing loss function of the network, which has large weights [34].

Norm-VGG16 Deep Learning Architecture
One of the most prominent deep learning architectures is VGGs Network. VGG16 [32] architecture is utilized for its known superior performance, for example, compared to other models like AlexNet [33]. This may be attributed to its small kernel size (3 × 3) and more trainable parameters, which helps increase the depth of CNN without the problem of gradient loss in deep CNNs. A set of modifications to the standard VGG network architecture are proposed to create Norm-VGG16 Network variant architecture that enhances the performance of the basic architecture.
First, the number of convolution layers is increased from 13 layers in standard VGG to 16 convolution layers. Each of the added layers has a kernel of size 3 × 3 and stride = 1. The pipeline of the convolution layers starts with first layer of size 200 × 200 and is convolved by 64 (3 × 3) kernels. The ending layer of the pipelined convolution layers has 512 (3 × 3) kernel with output of size 3 × 3. Between each group of convolution layers there is a max pooling layer of size (2 × 2) to down sample the feature tensor to reduce overfitting. The second applied modification is adding a global average pooling layer followed by a dense layer to reduce the overfitting of CNN. The dense layer has an output of four neurons representing the four classes of our dataset.
Another enhancement to the standard architecture is adding L2 kernel regularization to limit overfitting. The added regularization is shown in Figure 5 at the last dense layer. Kernel regularization was first added in different layers, but the best performance is attained by the variant scenario implementing L2 kernel regularization at the dense layer (last layer). The value of the hyperparameter of L2 regularizer, which determines the relative importance of the regularization component compared to the loss component, is set to 0.01. Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 25 To explain how the regularizer works, a training function ŷ: f(x) should be first defined as a function that maps an input vector x to output ŷ where ŷ is predicted value for actual value y. The loss between ŷ (f(x)) and y can be defined for one sample xi with corresponding target yi. Loss (L) can be computed as L((yi), ŷi) = L (f(xi), yi) [35]. For all input samples xi ……. xn. The sum of all loss functions between each input xi and its corresponding output ŷ is minimized as given in Equation (1).
The regularizer (R(f)) is added to the loss equation to increase the training loss to decrease overfitting as shown in Equation (2).
where λ is a hyperparameter that determines the relative importance of the regularization component compared to the loss component [35]. We choose our λ = 0.01. From Equation (1) and Equation (2) [34,35]. The best results are obtained at λ = 0.01. The L2 regularization is a type of regularization that is proportional to the square of the value of the weight coefficients (the L2 norm of the weights) [36] as shown in Equation (3).

=
(3) By substituting Equation (3) into Equation (2), L2 regularization Equation (4) is generated L2 regularizer puts a constraint on the complexity of a network by giving a small value to its weights, which makes the distribution of weights regular. This is done by the increasing loss function of the network, which has large weights [34].
To explain how the regularizer works, a training functionŷ: f(x) should be first defined as a function that maps an input vector x to outputŷ whereŷ is predicted value for actual value y. The loss betweenŷ (f(x)) and y can be defined for one sample x i with corresponding target y i . Loss (L) can be computed as L((y i ),ŷ i ) = L (f(x i ), y i ) [35]. For all input samples x i . . . . . . . x n . The sum of all loss functions between each input x i and its corresponding outputŷ is minimized as given in Equation (1).
The regularizer (R(f)) is added to the loss equation to increase the training loss to decrease overfitting as shown in Equation (2).
where λ is a hyperparameter that determines the relative importance of the regularization component compared to the loss component [35]. We choose our λ = 0.01. From Equation (1) and Equation (2), minimization after adding the regularization would occur to both the component of loss and the regularization component. Different values for λ are tested in the range between 0 and 1 (0.001, 0.01, 0.02, 0.1, etc.) according to the commonly used values in literature [34,35]. The best results are obtained at λ = 0.01. The L2 regularization is a type of regularization that is proportional to the square of the value of the weight coefficients (the L2 norm of the weights) [36] as shown in Equation (3).

RoI Localization and Segmentation
The performance of deep learning is noteworthy. Nevertheless, a shortcoming of deep learning architectures is that they are considered as black boxes which do not provide any insight into the classification process. Therefore, in this phase the activation maps are generated by Norm-VGG16 for our classes to visualize and localize the areas used for classification. The activation maps highlight the focus areas, from which most of the CNN-based features were extracted.
In order to generate discriminative localization map, Gradient-weighted Class Activation Mapping (Grad-CAM) module [37] is used. The implemented Grad-CAM module structure is shown in Figure 6. Grad-CAM L c Grad-CAM ∈ R u×v of width u and height v for any class c. The gradient of the score of the class c is computed before the SoftMax layer (y c ) with respect to feature maps A k of a convolutional layer ∂y c ∂A k . These gradients flowing back are global average pooled to obtain the neuron importance weights α c k as shown in Equation (5).  This weight α c k represents a partial linearization of the deep network downstream from A and captures the "importance" of feature map k for a target class c. After that a weighted combination of forward activation maps takes place. RELU is applied on the combination of feature maps. RELU is used to emphasize the positive influencer pixels whose intensity should be increased in order to increase y c for a certain class and remove the negative influencer pixels that are likely to belong to other categories in the image as shown in Equation (6).
The RELU helps in highlighting the positive influencer pixels in the localization maps and avoid highlighting the negative influencer pixels, which achieve better localization [37].
The generated heat maps (activation maps) seem to clearly portray the RoIs used in classification, which can be considered as implicit segmentation. Thus, they provide transparency of the model, which is required in medical diagnosis. The generated heat maps for all the four classes are shown in Figure 7, where the activation maps clearly mark the RoI.   The shown samples of the activation maps clearly mark the RoI that can be used in classification and feature generation. As shown in Figure 8, a binary mask is generated for each image from the RoI and the corresponding image is segmented using this binary mask.
The binary masks are generated based on the heat maps output by the Norm-VGG16 deep learning architecture. The heat maps highlight the RoI, which allow the generation of binary masks. Simple global thresholding [38,39] is used, where a histogram is plotted for each heat map and the respective threshold value is determined. The threshold value is selected per individual heat map such that red and yellow areas in the heat map are used to create the mask (emphasizing the localized RoI region) as shown in Figure 9. The global threshold value is used to generate the binary mask. Appl. Sci. 2020, 10, x FOR PEER REVIEW 12 of 25

Hand-Crafted Feature Generation
After RoI localization and cropping, a set of spatially articulated features are generated. Before generation, the cropped part of the image is resized to 64 × 64. Then, HOG and DAISY feature descriptors were generated.
HOG descriptors are calculated by normalizing colors then dividing the image into blocks and each block is divided into smaller units called cells. Each cell contains number of pixels (pixel intensities). First, the gradient magnitude and direction of each cell's pixel intensities is calculated. If  (x, y) is assumed as a pixel intensity, then gradient magnitude is calculated from Equation (7) and gradient angle is calculated using Equation (8) [40,41].

Hand-Crafted Feature Generation
After RoI localization and cropping, a set of spatially articulated features are generated. Before generation, the cropped part of the image is resized to 64 × 64. Then, HOG and DAISY feature descriptors were generated.
HOG descriptors are calculated by normalizing colors then dividing the image into blocks and each block is divided into smaller units called cells. Each cell contains number of pixels (pixel intensities). First, the gradient magnitude and direction of each cell's pixel intensities is calculated. If  (x, y) is assumed as a pixel intensity, then gradient magnitude is calculated from Equation (7) and gradient angle is calculated using Equation (8) [40,41].

Hand-Crafted Feature Generation
After RoI localization and cropping, a set of spatially articulated features are generated. Before generation, the cropped part of the image is resized to 64 × 64. Then, HOG and DAISY feature descriptors were generated. HOG descriptors are calculated by normalizing colors then dividing the image into blocks and each block is divided into smaller units called cells. Each cell contains number of pixels (pixel intensities). First, the gradient magnitude and direction of each cell's pixel intensities is calculated. If (x, y) is assumed as a pixel intensity, then gradient magnitude is calculated from Equation (7) and gradient angle is calculated using Equation (8) [40,41].
After calculating magnitude and angle, the HOG is calculated for each cell by calculating the histogram. Q bins for angles are chosen with unsigned orientation angles between 0 and 180. Afterwards, normalization is applied since different images may have different contrasts [40]. The pipeline of HOG can be shown in Figure 10. In our implementation, a [4 × 4] , = arctan , , After calculating magnitude and angle, the HOG is calculated for each cell by calculating the histogram. Q bins for angles are chosen with unsigned orientation angles between 0 and 180. Afterwards, normalization is applied since different images may have different contrasts [40]. The pipeline of HOG can be shown in Figure 10. In our implementation, a [4 × 4] cell size, [2 × 2] cells per block and 9 orientation histogram bins creating 8100 features were used. DAISY feature descriptor is an algorithm that generates low dimensional invariant descriptors that convert local image regions into low dimensional invariant descriptors, which can be used for matching and classification. DAISY feature descriptor is faster than GLOH and SIFT feature descriptors [42]. It can be computed efficiently at every pixel unlike SURF. For DAISY descriptors generation, eight orientation maps, G, are computed for each image. one for each quantized direction, where G (u, v) equals the image gradient at location (u, v) [42]. Gaussian kernels of different ∑ values convolve each orientation map several times to obtain convolved orientation maps for different sized regions [42]. If h∑ (u, v) is the vector made of the values at location (u, v) in the orientation maps after convolution by a Gaussian kernel of standard deviation Σ as shown in Equation (9), after that, the If h∑ (u, v) is normalized where G1 ∑ , G2 ∑ …….G8 ∑ denote the Σ-convolved orientation maps. The vector is normalized and denoted by ĥ∑ (u, v). To correctly represent the pixels near occlusions, the normalization is performed in each histogram independently because normalizing the descriptor as a whole would lead to the same point, which is close to occlusions, appearing different in two images. A complete explanation of calculating Full DAISY descriptor is given by Tola et al. [42].

Feature Fusion and Classification
A set of 512 CNN-based features are extracted per image from our Norm-VGG16 network architecture. The features are extracted from the global average pooling layer. These features are fused with the 8100-feature array of hand-crafted HOG descriptors and 400-feature array of handcrafted DAISY, leading to a concatenated array size of 9012 features. The classification of these feature sets is done in batches. A Neural Network consisting of three dense layers is used to train these features with first layer of input size of 9012 inputs and 2048 neurons followed by a hidden layer of 512 neurons and the final layer contains four neurons with a SoftMax activation to normalize the distribution probability as shown in Figure 11. The model is trained using Adam optimizer with batch size 32 for 100 epochs. DAISY feature descriptor is an algorithm that generates low dimensional invariant descriptors that convert local image regions into low dimensional invariant descriptors, which can be used for matching and classification. DAISY feature descriptor is faster than GLOH and SIFT feature descriptors [42]. It can be computed efficiently at every pixel unlike SURF. For DAISY descriptors generation, eight orientation maps, G, are computed for each image. one for each quantized direction, where G (u, v) equals the image gradient at location (u, v) [42]. Gaussian kernels of different values convolve each orientation map several times to obtain convolved orientation maps for different sized regions [42]. If h (u, v) is the vector made of the values at location (u, v) in the orientation maps after convolution by a Gaussian kernel of standard deviation Σ as shown in Equation (9), after that, the If h (u, v) is normalized where G 1 , G 2 . . . . . . .G 8 denote the Σ-convolved orientation maps. The vector is normalized and denoted byĥ (u, v). To correctly represent the pixels near occlusions, the normalization is performed in each histogram independently because normalizing the descriptor as a whole would lead to the same point, which is close to occlusions, appearing different in two images. A complete explanation of calculating Full DAISY descriptor is given by Tola et al. [42].

Feature Fusion and Classification
A set of 512 CNN-based features are extracted per image from our Norm-VGG16 network architecture. The features are extracted from the global average pooling layer. These features are fused with the 8100-feature array of hand-crafted HOG descriptors and 400-feature array of hand-crafted DAISY, leading to a concatenated array size of 9012 features. The classification of these feature sets is done in batches. A Neural Network consisting of three dense layers is used to train these features with first layer of input size of 9012 inputs and 2048 neurons followed by a hidden layer of 512 neurons and the final layer contains four neurons with a SoftMax activation to normalize the distribution probability as shown in Figure 11. The model is trained using Adam optimizer with batch size 32 for 100 epochs. Appl. Sci. 2020, 10, x FOR PEER REVIEW 14 of 25 Figure 11. Neural Network Architecture.

Expermental Setup
Models implementation, training and results are done using Python language v3.7 All deep learning architectures are trained for 30 epochs from scratch using Adam optimizer with starting learning rate of 0.001 on the 108,312 training images. Inputs are divided into batches of size 32. Validation accuracy and cross-entropy loss are monitored for each epoch. In addition, the learning rate is reduced by factor of 0.2 for each three epochs without improvement in validation loss.
For the deep learning architectures, the best model is defined as having minimum validation loss, then it is stored and applied on the testing set. The best model Norm-VGG16 took about 12 h to be trained.

Performance Metrics
The following measures are used to evaluate the performance of trained models to decide which one has advantage over the others.

Confusion Matrix
It is a table that is used to visualize and describe the performance of a classification model on a testing set of data as shown in Figure 12. It is a summary of prediction results in a classification problem [43].  For the deep learning architectures, the best model is defined as having minimum validation loss, then it is stored and applied on the testing set. The best model Norm-VGG16 took about 12 h to be trained.

Performance Metrics
The following measures are used to evaluate the performance of trained models to decide which one has advantage over the others.

Confusion Matrix
It is a table that is used to visualize and describe the performance of a classification model on a testing set of data as shown in Figure 12. It is a summary of prediction results in a classification problem [43].

Accuracy (Acc)
It is measured by dividing the number of correctly labeled images by the total number of test images [4]. Equation (10) depicts single class accuracy measurements = + + + + (10)

Sensitivity (Recall)
It is determined by dividing the total number of correctly classified Urgent referrals by total number of actual Urgent referrals [4]. Equation (11) depicts single class sensitivity measurements = + (11)

Specificity (Sp)
It is determined by dividing the total number of correctly classified non-referrals by total number of actual Nonurgent referrals [4]. Equation (12) depicts single class specificity measurements = + (12)

Experimental Results
In this section, the implemented experimental scenarios used for performance comparison and the corresponding results using the described performance evaluation metrics are described.

HyCAD Architecture
The proposed system architecture presented in Section 5 is applied to the dataset described in Section 4. Norm-VGG16 integrating kernel regularizer is trained on the dataset from scratch aiming at higher performance metrics. It is used for RoI generation and feature extraction. A set of 512 features are extracted from the global average pooling layer of the CNN network. The hand-crafted feature extraction methods, namely HOG and DAISY, when applied on the segmented RoI generated 8500 features. At the fusion stage of our HyCAD system, all the extracted features are fused and fed into a three sequential dense layer neural network for a classification decision.

Norm-VGG16 (kernel regularized) Deep Learning Architecture
In this scenario, the modified Norm-VGG16 Deep Learning architecture is applied on the OCT dataset. The performed classification is based solely on the CNN-based features. This scenario is • TP (C i .) = All the instance of C i . that are classified as C i . • TN (C i .) = All the non-Ci. instances that are not classified as C i . • FP (C i .) = All the non-Ci. instances that are classified as C i . • FN (C i .) = All the Ci instances that are not classified as C i .

Accuracy (Acc)
It is measured by dividing the number of correctly labeled images by the total number of test images [4]. Equation (10)  It is determined by dividing the total number of correctly classified non-referrals by total number of actual Nonurgent referrals [4]. Equation (12) depicts single class specificity measurements Speci f icity = True Negative True Negative + False Positive (12)

Experimental Results
In this section, the implemented experimental scenarios used for performance comparison and the corresponding results using the described performance evaluation metrics are described.

HyCAD Architecture
The proposed system architecture presented in Section 5 is applied to the dataset described in Section 4. Norm-VGG16 integrating kernel regularizer is trained on the dataset from scratch aiming at higher performance metrics. It is used for RoI generation and feature extraction. A set of 512 features are extracted from the global average pooling layer of the CNN network. The hand-crafted feature extraction methods, namely HOG and DAISY, when applied on the segmented RoI generated 8500 features. At the fusion stage of our HyCAD system, all the extracted features are fused and fed into a three sequential dense layer neural network for a classification decision.
2. Norm-VGG16 (kernel regularized) Deep Learning Architecture In this scenario, the modified Norm-VGG16 Deep Learning architecture is applied on the OCT dataset. The performed classification is based solely on the CNN-based features. This scenario is presented to elucidate the merit of incorporating human articulated features on the performance of the system.
3. ResNet-50 Net (kernel regularized) Deep Learning Architecture ResNets [32,44] can be considered one of the most applied deep learning architectures for image recognition and classification. ResNets include residual blocks which allow deeper network architecture to avoid information loss during training. In addition, it was shown to achieve lower validation loss compared to VGG16 on the ImageNet dataset [44]. Therefore, ResNet-50 architecture (shown in Appendix A) is chosen as one of our experimental scenarios, to compare its performance to the proposed HyCAD system. The ResNet-50 best model is modified by adding kernel regularization in the final dense layer.
In addition to these implemented experimental scenarios, the results of HyCAD are compared with the work of Kermany et al. [4] and Li et al. [20]. These were chosen as their results were reported on the same dataset using deep learning approaches and attained high results.

Classification Results
Our experimental results are shown in two folds. First, bootstrapping is adopted, and a limited set of results are reported to show the performance of HyCAD across several testing data partitions and investigate its stability relative to Norm-VGG16 and RESNET-50 and to clarify the impact of hand-crafted features' fusion with deep learning architectures. In the second fold of experiments, elaborate results are reported on the same training and test percentages used by Kermany et al. [4] and Li et al. [20]. The results of the best run are displayed to be able to compare to Kermany et al. [4] and Li et al. [20].
1. Bootstrapping Experiments Fold Ten bootstrapping experiments are conducted to evaluate the performance of the models at different boot strapped test partitions. In each experiment, the 109,312 images are shuffled and a random selection of 108,312 images is used as training dataset and the rest of the 1000 images are used as testing dataset. Each new partition tested on our implemented architectures (HyCAD, Norm-VGG16 integrating kernel regularizer and ResNet-50 integrating kernel regularizer) taking into consideration that the testing dataset is partitioned equally between four classes (CNV, DME, DRUSEN and NORMAL). Table 3 presents the mean values and the standard deviation of overall accuracy and urgent group sensitivity for 10 different bootstrapped partitions. The depicted results reveal the improvement achieved through integrating the handcrafted features with Norm-VGG16 integrating kernel regularizer. The proposed HyCAD architecture attains the highest mean accuracy and mean sensitivity for the critical Urgent referrals compared to Norm-VGG16 integrating kernel regularizer and ResNet-50. In addition, it attains the lowest standard deviation in comparison to the pure CNN architectures. Compared to Norm-VGG16 integrating kernel regularizer, HyCAD achieves a substantial increase of 2.9% and 4.9% in terms of mean accuracy and mean sensitivity of Urgent referrals, respectively. In addition, the HyCAD model has a lower standard deviation relative to the pure deep learning architecture Norm-VGG16 integrating kernel regularizer. Such findings emphasize the positive impact of fusing hand-crafted features with learned features from deep learning architecture. Table 3. Results of 10 different experimental runs by random bootstrapping selection of 1000 images as testing set out of the 109,312 images on Large Dataset of Labeled Optical Coherence Tomography (OCT) v3 [25].

Mean Accuracy ± Std
Mean UR Sensitivity ± Std The proposed HyCAD architecture 97. The Training and Validation losses and accuracies for the two deep learning architecture (Norm-VGG16 kernel regularized and ResNet-50 kernel regularized) for each epoch are shown in Figure 13. The Norm-VGG16 architecture best model achieves training accuracy of 94.86% and training loss of 15.75%. In addition, it achieves a testing accuracy of 97.3% and testing loss of 8.34%. ResNet-50 achieves a training accuracy of 96.95% and a training loss of 9.02%. In addition, it achieves a testing accuracy of 97% and a testing loss of 8.76%. From the shown learning curves, it is manifest that the performance stabilizes around the 15th epoch, which can lead to reducing the training time considerably.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 17 of 25 The Training and Validation losses and accuracies for the two deep learning architecture (Norm-VGG16 kernel regularized and ResNet-50 kernel regularized) for each epoch are shown in Figure 13. The Norm-VGG16 architecture best model achieves training accuracy of 94.86% and training loss of 15.75%. In addition, it achieves a testing accuracy of 97.3% and testing loss of 8.34%. ResNet-50 achieves a training accuracy of 96.95% and a training loss of 9.02%. In addition, it achieves a testing accuracy of 97% and a testing loss of 8.76%. From the shown learning curves, it is manifest that the performance stabilizes around the 15th epoch, which can lead to reducing the training time considerably. The confusion matrices of the best models of the presented experimental scenarios on the test set are presented in Figure 14. ResNet-50 ( Figure 14C) achieves the lowest testing accuracy of 97%. The model's sensitivity for Urgent referral group is 98.6%, while its specificity is 95.4%. The confusion matrix output by applying the best Norm-VGG16 model on the test set is presented in Figure 14B. The confusion matrix illustrates that 973 out of 1000 testing images were correctly classified. The model's sensitivity for the Urgent referral group is 98%, while its specificity is 96.6%. The confusion matrices of the best models of the presented experimental scenarios on the test set are presented in Figure 14. ResNet-50 ( Figure 14C) achieves the lowest testing accuracy of 97%. The model's sensitivity for Urgent referral group is 98.6%, while its specificity is 95.4%. The confusion matrix output by applying the best Norm-VGG16 model on the test set is presented in Figure 14B. The confusion matrix illustrates that 973 out of 1000 testing images were correctly classified. The model's sensitivity for the Urgent referral group is 98%, while its specificity is 96.6%. The performance of the models is summarized and compared in Table 4. From the shown results, it is determined that the best performing model is HyCAD architecture according to its confusion matrix presented in Figure 14A, scoring an accuracy of 98.8%. In Urgent referrals group, the HyCAD architecture only misclassified 3 images out of 500 and achieves a sensitivity of 99.4% as shown in Table 4. In the Nonurgent referrals group, the same model misclassified only 9 images out of 500 images and achieves 98.2%. It is worth mentioning that the Norm-VGG16 architecture without Kernel regularization attained a testing accuracy of 96.7% when trained on the dataset from scratch.  [20] in Table 5. The per class accuracy is given by dividing the truly classified image for each class by their total number. It is noticeable from the accuracies in Table 5 that the lowest accuracy scored by HyCAD is of DRUSEN class, which is explainable due to the fact that it has the lowest percentage in the training set. Nevertheless, HyCAD with the fused hand-crafted features managed to surpass its CNN counterpart Norm-VGG16 in separating the DRUSEN minority class with a difference of 3.2%.
As shown in Table 5, the proposed HyCAD model achieves a significant increase in accuracy of CNV, DME, DRUSEN by 4.8%, 2.4%, 2.8%, while it attains similar NORMAL accuracy compared to Kermany et al. [4]. The proposed HyCAD model achieves an increase in accuracies of CNV class by 3.2% compared to Li et al. [20]. In order to further examine the performance of our best performing model HyCAD architecture relative to the state of the art, its overall performance is compared to the work of Kermany et al. [4]  The performance of the models is summarized and compared in Table 4. From the shown results, it is determined that the best performing model is HyCAD architecture according to its confusion matrix presented in Figure 14A, scoring an accuracy of 98.8%. In Urgent referrals group, the HyCAD architecture only misclassified 3 images out of 500 and achieves a sensitivity of 99.4% as shown in Table 4. In the Nonurgent referrals group, the same model misclassified only 9 images out of 500 images and achieves 98.2%. It is worth mentioning that the Norm-VGG16 architecture without Kernel regularization attained a testing accuracy of 96.7% when trained on the dataset from scratch.  [20] in Table 5. The per class accuracy is given by dividing the truly classified image for each class by their total number. It is noticeable from the accuracies in Table 5 that the lowest accuracy scored by HyCAD is of DRUSEN class, which is explainable due to the fact that it has the lowest percentage in the training set. Nevertheless, HyCAD with the fused hand-crafted features managed to surpass its CNN counterpart Norm-VGG16 in separating the DRUSEN minority class with a difference of 3.2%. As shown in Table 5, the proposed HyCAD model achieves a significant increase in accuracy of CNV, DME, DRUSEN by 4.8%, 2.4%, 2.8%, while it attains similar NORMAL accuracy compared to Kermany et al. [4]. The proposed HyCAD model achieves an increase in accuracies of CNV class by 3.2% compared to Li et al. [20].
In order to further examine the performance of our best performing model HyCAD architecture relative to the state of the art, its overall performance is compared to the work of Kermany et al. [4] and Li et al. [20]. The comparison is depicted in Table 6 and Figure 15. The values for Kermany et al. [4] and Li et al. [20] are calculated from the best reported confusion matrices as shown in Figure 16. The HyCAD architecture outperforms the work of Kermany et al. [4] in terms of accuracy, sensitivity and specificity scoring an increase of 2.5%, 1.6% and 0.8% respectively. In addition, compared to Li et al. [20], the proposed HyCAD model achieves a noticeable increase in sensitivity by 1.8%, a comparable specificity and a slight increase in accuracy. HyCAD architecture attains the highest sensitivity, scoring an increase of 1.6% compared to the best model of the state of art models [4,20] to reach 99.4% for the Urgent referral group. Such a high sensitivity is required for this group as they represent the critical group, which needs immediate attention. This improvement is achieved while maintaining a competitive specificity with the state of the art. and Li et al. [20]. The comparison is depicted in Table 6 and Figure 15. The values for Kermany et al. [4] and Li et al. [20] are calculated from the best reported confusion matrices as shown in Figure 16. The HyCAD architecture outperforms the work of Kermany et al. [4] in terms of accuracy, sensitivity and specificity scoring an increase of 2.5%, 1.6% and 0.8% respectively. In addition, compared to Li et al. [20], the proposed HyCAD model achieves a noticeable increase in sensitivity by 1.8%, a comparable specificity and a slight increase in accuracy. HyCAD architecture attains the highest sensitivity, scoring an increase of 1.6% compared to the best model of the state of art models [4,20] to reach 99.4% for the Urgent referral group. Such a high sensitivity is required for this group as they represent the critical group, which needs immediate attention. This improvement is achieved while maintaining a competitive specificity with the state of the art.  [20] and HyCAD architecture performance metrics.   [20] and HyCAD architecture performance metrics.  Another important aspect to study is the performance of our HyCAD architecture using hybrid integration of Norm-VGG16 CNN features and hand-crafted features in case of binary classifications. Hence, multiple binary models of CNV vs. NORMAL, DME vs. NORMAL and DRUSEN vs. NORMAL classes are trained and achieve significant results. The respective confusion matrices are shown in Figure 17 and the results are compared to Kermany et al. [4] and Li et al. [20] in Table 7. The two binary models CNV vs. NORMAL vs. DME and NORMAL achieve an accuracy of 100%, sensitivity of 100% and specificity of 100% without any wrong classification of 500 testing set as can be seen from the confusion matrix in Figure 17A,B. The binary model between DRUSEN vs. NORMAL achieves an accuracy of 99.79%, sensitivity of 99.6% and specificity of 100% by classifying only one wrong image as shown in Figure 17C The CNN features are extracted by training Norm-VGG16 for only 15 epochs with learning rate starting with 0.001. Another important aspect to study is the performance of our HyCAD architecture using hybrid integration of Norm-VGG16 CNN features and hand-crafted features in case of binary classifications. Hence, multiple binary models of CNV vs. NORMAL, DME vs. NORMAL and DRUSEN vs. NORMAL classes are trained and achieve significant results. The respective confusion matrices are shown in Figure 17 and the results are compared to Kermany et al. [4] and Li et al. [20] in Table 7. The two binary models CNV vs. NORMAL vs. DME and NORMAL achieve an accuracy of 100%, sensitivity of 100% and specificity of 100% without any wrong classification of 500 testing set as can be seen from the confusion matrix in Figure 17A,B. The binary model between DRUSEN vs. NORMAL achieves an accuracy of 99.79%, sensitivity of 99.6% and specificity of 100% by classifying only one wrong image as shown in Figure 17C The CNN features are extracted by training Norm-VGG16 for only 15 epochs with learning rate starting with 0.001.

Conclusions
In this paper, a novel hybrid computer aided diagnostic system HyCAD-OCT is introduced that effectively distinguishes a range of retinal diseases from OCT scans. The proposed HyCAD architecture integrates deep and traditional learning paradigms to present an accurate timely diagnosis. In addition, HyCAD provides an explainable diagnostic decision through automatic RoI localization using Norm-VGG16 activation maps and the extraction of human articulated features. The automatic RoI localization provides a relative advantage compared to traditional segmentation methods as it does not require expert involvement. Our HyCAD architecture proves the importance of combining automatic feature extraction and hand-crafted features in achieving higher performance metrics. Moreover, the hand-crafted features are more interpretable than the learned features by deep learning architectures models. For the employed deep learning architecture, a set of modifications are applied on the standard VGG16 architecture creating the Norm-VGG16 integrating kernel regularizer variant. Our modified Norm-VGG16 achieved better results than ResNet-50 and Kermany  In addition, the results had lower standard deviation. Overall, a new robust diagnostic system is proposed which offers automatic RoI segmentation that can be used in assisting ophthalmologists in their decision with a range of retinal disorders. Moreover, HyCAD is a general architecture that can be trained and applied on similar problems, since no underlying specific assumptions were made that would hinder its generalization.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A ResNet Architecture
ResNet architecture is chosen because they got a deeper network than Standard VGG16 because of the idea of the residual block which avoids information loss during training [44]. The ResNet-50 has shortcuts between blocks that helped in increasing the number of layers without information loss along the data processing flow, which was a very common issue in any sequential deep network. It consists of 53 convolutional layers. The input starts with an input layer with size 224 × 224 and the output of the input layer is convolved by 64 (7 × 7) kernel. This is followed by batch normalization layer which is followed by max pooling layer as shown in Figure A1.
After the first block, there are two types of residual blocks (i.e., bottle neck blocks) that are repeated in our architecture. The two blocks that were implemented are defined by De Rezende et al. [45] and can be seen in Figure A2.
The first block shown in Figure A2a is called identity shortcut bottleneck block which is composed of a sequence of convolution layers of kernel size (1 × 1) and stride = 1 connected to a convolution layer with kernel (3 × 3) and stride = 1 followed by a convolution layer followed by kernel (1 × 1) and stride = 2. This block is used when the input and output of feature map are the same.
The other block shown in Figure A2b is called projection shortcut bottleneck block which has the same sequence of layers with a newly added convolution layer in the projection shortcut which has a kernel size of (1 × 1) with stride = 2. It is applied when shortcuts go across the feature map of two sizes. In the two blocks all the convolution layers are followed by batch normalization and RELU activation function. The full architecture is shown in Figure A3.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. ResNet Architecture
ResNet architecture is chosen because they got a deeper network than Standard VGG16 because of the idea of the residual block which avoids information loss during training [44]. The ResNet-50 has shortcuts between blocks that helped in increasing the number of layers without information loss along the data processing flow, which was a very common issue in any sequential deep network. It consists of 53 convolutional layers. The input starts with an input layer with size 224 × 224 and the output of the input layer is convolved by 64 (7 × 7) kernel. This is followed by batch normalization layer which is followed by max pooling layer as shown in Figure A1. After the first block, there are two types of residual blocks (i.e., bottle neck blocks) that are repeated in our architecture. The two blocks that were implemented are defined by De Rezende et al. [45] and can be seen in Figure A2. The first block shown in Figure A2a is called identity shortcut bottleneck block which is composed of a sequence of convolution layers of kernel size (1 × 1) and stride = 1 connected to a convolution layer with kernel (3 × 3) and stride = 1 followed by a convolution layer followed by kernel (1 × 1) and stride = 2. This block is used when the input and output of feature map are the same.
The other block shown in Figure A2b is called projection shortcut bottleneck block which has the same sequence of layers with a newly added convolution layer in the projection shortcut which has a kernel size of (1 × 1) with stride = 2. It is applied when shortcuts go across the feature map of two sizes. In the two blocks all the convolution layers are followed by batch normalization and RELU activation function. The full architecture is shown in Figure A3.  The first block shown in Figure A2a is called identity shortcut bottleneck block which is composed of a sequence of convolution layers of kernel size (1 × 1) and stride = 1 connected to a convolution layer with kernel (3 × 3) and stride = 1 followed by a convolution layer followed by kernel (1 × 1) and stride = 2. This block is used when the input and output of feature map are the same.
The other block shown in Figure A2b is called projection shortcut bottleneck block which has the same sequence of layers with a newly added convolution layer in the projection shortcut which has a kernel size of (1 × 1) with stride = 2. It is applied when shortcuts go across the feature map of two sizes. In the two blocks all the convolution layers are followed by batch normalization and RELU activation function. The full architecture is shown in Figure A3.