Detecting Diabetic Retinopathy Using Embedded Computer Vision

Featured Application: A low-cost solution for detecting diabetic retinopathy for the medically underserved population who do not have access to regular eye exams and are at increased risk of diabetes-related blindness. Abstract: Diabetic retinopathy is one of the leading causes of vision loss in the United States and other countries around the world. People who have diabetic retinopathy may not have symptoms until the condition becomes severe, which may eventually lead to vision loss. Thus, the medically underserved populations are at an increased risk of diabetic retinopathy-related blindness. In this paper, we present development e ﬀ orts on an embedded vision algorithm that can classify healthy versus diabetic retinopathic images. Convolution neural network and a k-fold cross-validation process were used. We used 88,000 labeled high-resolution retina images obtained from the publicly available Kaggle / EyePacs database. The trained algorithm was able to detect diabetic retinopathy with up to 76% accuracy. Although the accuracy needs to be further improved, the presented results represent a signiﬁcant step forward in the direction of detecting diabetic retinopathy using embedded computer vision. This technology has the potential of being able to detect diabetic retinopathy without having to see an eye specialist in remote and medically underserved locations, which can have signiﬁcant implications in reducing diabetes-related


Introduction
Diabetic retinopathy (DR) is caused by damage to the blood vessels in the tissue at the back of the eye (retina) causing vision impairment and blindness. Uncontrolled blood sugar is a major risk factor. Since diabetic retinopathy lacks early symptoms, it is very difficult to detect the diseases at an early stage [1]. The 2020 diabetes report from the Center for Disease Control (CDC) found that the US diabetic population rose to 9.4% or 3.3 million people [2]. Another 88 million people, more than 1 out of 3 adults, have pre-diabetes. When left untreated, pre-diabetes can lead to type-2 diabetes within five years. The diabetic population worldwide was more than 250 million in 2017 [3]. This number is projected to rise to around 629 million by 2045 [4]. An estimated one-third of the diabetic population has diabetic retinopathy symptoms, with a significant portion of them being vision-threatening [5]. Currently, 7.7 million Americans are impacted by this disease, and it is expected to rise to 11.3 million by 2030 [6]. One of the major contributors of increased diabetic retinopathy-related vision loss is the lack of access to the medical care needed to catch the disease at an early stage.
The current diabetic retinopathy detection method involves a dilated eye exam. In this exam, eye-dilating drops placed inside a patient's eye widen eye pupils and allow doctors to see the eye blood vessels [7,8]. A special dye is injected, and pictures are taken as the dye circulates through blood vessels. The images are used to further examine the blood vessels and catch any damaged veins or fluid leaks. These eye exams are very effective; however, for patients without health insurance, it costs $200 or more in the US and is often unavailable in remote or developing parts of the world [9,10].
The detection of diabetic retinopathy using computer vision has recently been considered as a possible alternative to a visual analysis by a clinician. One such example is the Kaggle competition on diabetic retinopathy, in which more than 600 teams participated [11]. Similarly, Lam et al. present an automated detection of diabetic retinopathy using TensorFlow and GoogLeNet in [12]. Different aspects of pre-processing to feature extraction are presented. In [13], Wu et al. presents the classification of diabetic retinopathy using convolution neural network. Using a self-gated soft-attention mechanism and pre-trained coarse network, the presented work performs four-class hierarchical classification based on the severity of the disease. The detection of diabetic retinopathy using deep learning is also presented in [14][15][16][17]. In another approach, hardware-based solutions have also been reported [18]. With a digital signal processing kit developed by Texas Instrument, the article reports the possibility of using hardware for detecting diabetic retinopathy. These studies were conducted using high-resolution retina images taken with a fundus camera. Efforts to detect retinopathy from images taken with a low-cost camera have also been reported [19].
This manuscript furthers the reported work through a solution toward detecting diabetes retinopathy using embedded computer vision for implementation on a cost-effective portable NVIDIA Jetson TX2 hardware [20] and on-device real-time classification without the need for an Internet connection. We present training and testing of the Caffe and Keras model using healthy and diabetic retinopathy images. The training and testing set consists of 83,000 high-resolution retina images labeled by experts. The Caffe model was trained and tested on NVIDIA-embedded hardware.

Training and Testing Dataset
Diabetic retinopathy is generally divided into five groups: Normal, Mild DR, Moderate DR, Severe DR, and Proliferative DR (Table 1). Diseases start with small changes in blood vessels, which are designated as Mild DR. At this stage, a complete recovery is possible. If proper care is not taken, in a few years, it will progress to Moderate DR, where leakage in blood vessels may begin. Then, the diseases progress further to Severe and Proliferative DR and may lead to complete vision loss. To predict DR with a higher accuracy using a machine learning algorithm, a large amount of training data are needed. The data need to come from reliable sources with accurate labels. We used the Kaggle dataset, which was provided by EyePacs [22]. EyePacs screened more than 750,000 patients and collected 5 million retina images [22]. The Kaggle dataset contains 35,126 images for training and another 53,594 for testing. The size of the images ranged from 360 KB to 2 MB. However, only a few images were below 500 KB. The Kaggle dataset is one of the largest datasets of diabetic retinopathy images currently available. Table 2 below shows the number of images in each DR category among the training and testing datasets. Figure 1 shows samples images in different classes representing different stages of DR. The Kagle database provided images of all DR categories in one folder and a comma separate value (CSV) file with a description of each image category. For training and testing, the images need to be separated and placed in separate folders. A script was written to separate the images based on CSV labels, which is shown below. Next, the images were cropped using the Otsu method to isolate the main features [23,24]. Furthermore, images were normalized and contrast adjusted using a filtering algorithm. Data augmentation was also performed to improve the diversity of data. Different operations of padding, cropping, and flipping were also performed. only a few images were below 500 KB. The Kaggle dataset is one of the largest datasets of diabetic retinopathy images currently available. Table 2 below shows the number of images in each DR category among the training and testing datasets. Figure 1 shows samples images in different classes representing different stages of DR. The Kagle database provided images of all DR categories in one folder and a comma separate value (CSV) file with a description of each image category. For training and testing, the images need to be separated and placed in separate folders. A script was written to separate the images based on CSV labels, which is shown below. Next, the images were cropped using the Otsu method to isolate the main features [23,24]. Furthermore, images were normalized and contrast adjusted using a filtering algorithm. Data augmentation was also performed to improve the diversity of data. Different operations of padding, cropping, and flipping were also performed.

Convolution Neural Network
A simple neural network with a set of input, multiple hidden layers, and an output layer was used. Each hidden layer is similar to a neuron with a set of inputs and weighted outputs, which produces output with the help of an activation function [25][26][27][28][29]. Convolution neural network (CNN) learns such features on the patterns from the input training images when those patterns occur repeatedly in the input data. In diabetic retinopathy, those features are distortions in blood vessels, with macula occurring in the retina. When these features occur repeatedly and CNN sufficiently learns to classify them, they form a model. Then, this model is tested with a new dataset for accuracy.

Convolution Neural Network
A simple neural network with a set of input, multiple hidden layers, and an output layer was used. Each hidden layer is similar to a neuron with a set of inputs and weighted outputs, which produces output with the help of an activation function [25][26][27][28][29]. Convolution neural network (CNN) learns such features on the patterns from the input training images when those patterns occur repeatedly in the input data. In diabetic retinopathy, those features are distortions in blood vessels, with macula occurring in the retina. When these features occur repeatedly and CNN sufficiently learns to classify them, they form a model. Then, this model is tested with a new dataset for accuracy.
A filter or convolution kernel is a matrix that is convolved with the input image to detect specific features. Consider an example where the input image is 4 × 4. A 3 × 3 filter is applied to the image, resulting in an output. This process is repeated by moving the filter by one column, generating a second output. The initial choice of the filter size is random and is optimized based on the accuracy. The learning rate determines the step size at each iteration. It is important to have an optimum learning rate because a too-small learning rate will slow down the convergence, and a too-high learning rate may make a learning jump over minima. To determine an optimum learning rate, the system is trained with different learning rates, typically starting from 0.001, and a rate that results in the highest accuracy is chosen. Activation functions are mathematical equations that determine the output of a neural network [29][30][31]. Without activation functions, the output of the neural network is linear. An epoch is when one scans the complete dataset in one pass through the neural network in either a forward or backward direction. The number of epochs impacts the performance of a system. If a dataset is very large and the number of epochs is chosen incorrectly, the time to train the system increases. Thus, it is important to choose an optimum epoch value. A hidden layer is an intermediate layer between the input and output layers, where a set of weighted inputs produces an output through an activation function [32].
The CNN architecture consists of convolution and pooling layers. The convolution layer is a core layer of a neural network and performs convolution operation (point-wise or depth-wise). The pooling layer is used to reduce the spatial size of convolved features. There are two types of pooling layers: max pooling and average pooling. A fully connected layer, following the pooling layer, holds composite and aggregated information and is used to predict the output as either healthy or unhealthy.

Comparison of Various Tools
GoogLeNet is a 22 layers deep CNN architecture developed by Google. In 2014, GoogleNet won a competition for visual recognition with a 6.7% error rate. The network was trained on 1000 object categories. It can be retrained to perform new tasks using transfer learning [33]. TensorFlow is an open-source library and platform to develop end-to-end solutions for machine learning. It provides an inbuilt application programming interface (API) that supports different machine learning algorithms. We initially used TensorFlow as one of the platforms for faster development [34]. We implemented our project on two platforms, Keras and NVIDIA Jetson. Keras helps to integrate lower-level deep learning languages such as TensorFlow.

Nvidia Jetson TX2 and Nvidia Digits
We chose NVIDIA Jetson TX2 as our hardware platform because one of our primary goals was to process the data locally and in real time so that it can be used to classify retina images without the need for an Internet connection [20]. The features provided by the NVIDIA Jetson enables producing classification results in a reasonable timeframe for a practical application and allows for future on-device retraining and improvement of the model. This is important, as our aim is to enable use of the device at remote locations where an Internet connection may not be available or reliable. However, it must be noted that it is possible to implement the presented model on a microcontroller and CPU-based hardware platform. The performance and efficiency of Jetson are equivalent to an Intel Xeon E5. In a direct comparison of Jetson with Intel Xeon Server in running a deep learning inference model based on a GoogleNet deep learning image recognition network, Jetson was able to process 290 images per second compared to 231 by the Intel Xeon Server with a 128 batch size [35]. With a few modifications, and following the instruction provided by NVIDIA, we installed, set up, and ran an NVIDIA deep learning GPU training system (DIGITS), including installing Nvidia drivers, installing Docker, and setting up and starting a DIGITS Container [36]. NVIDIA DIGITS is an open-source software provided to design, train, and visualize deep neural networks for image classification, segmentation, and object detection using Caffe, Torch, and TensorFlow. Docker provides an easy tool to create, deploy, and run applications using containers [37]. Caffe is a deep learning framework that provides processing speed and expressive architecture for developing machine learning models.
Once the setup was complete, the training data were imported to DIGITS, classification was selected based on desired output labels of healthy or unhealthy. Lightning memory-mapped database (LMDB), a memory-mapped file, was chosen as a database backend for its faster input/output performance over a large dataset. Next, an inference model was created using the solver options shown in Table 3 below. Once the model is trained, it outputs epochs and accuracy. The trained model can be tested with testing images.

Training and Testing Procedure
The training and testing were conducted with Caffe and Keras models. Caffe was chosen because of its direct support in NVIDIA DIGITS. Keras was chosen because it allowed easy optimization to improve system accuracy. For the Caffe GoogleNet model in NVIDIA DIGITS, the LMDB format was used. In the Keras model, hierarchical data format version 5 (HDF5) was used. HDF5 provides a simple format to read/write over a large dataset. It is easy to add data to a dataset without creating copies. In addition, it supports different languages such as R, C, and Fortran, making it easier to use across various platforms. Once the dataset was processed and converted into the respective file format, it was processed separately with Caffe and Keras models. Once both the models were trained using training images, they were tested with a testing dataset of 53,594 images. The K-fold cross-validation method was used in this study. In k-fold cross-validation, the complete dataset is randomly divided into k folds or groups. One of the groups is used for testing and the remaining groups are used for validation. This process is repeated for all the groups until the complete dataset is trained [38]. In each iteration of k-fold cross validation, the data were spilt into a training and testing dataset, with 30% for testing and 70% for training. We tested for k = 5, 10, and 12. We found that k = 5 was optimal.

Results and Discussion
Results from data preprocessing, hyperparameter optimization, and test results are presented and discussed in this section.

Data Preprocessing
The images were processed in two formats, LMDB in NVIDIA Jetson and HDF5 in Google Engine. The use of RAM, number of CPUs and cores, and processing time were compared. The results are shown in Table 4. Time taken to convert data in the HDF5 file format was comparable to LMDB.

Hyperparameters Optimization
Techniques implemented during the training process were discussed in the previous section. In addition to those techniques, hyperparameters were optimized to improve accuracy. The initial selections of the hyperparameters in Table 3 were optimized through an iterative process to improve accuracy. The optimized parameters are shown in Table 5.

Accuracy Rate for Inference Model
The trained Caffe and Keras models were separately tested with 53,594 test images that included No DR, Mild DR, Moderate DR, Severe DR, and Proliferate DR. Then, the predictions were aggregated to separate as Healthy and Unhealthy. A No DR result is Healthy, whereas a Mild DR to Proliferate DR is Unhealthy. For example, for a Mild DR test image, a No DR result is counted as Healthy classification and Moderate DR is counted as an Unhealthy classification. We chose these two classifications because of our end intention for the envisioned device, which is to give the healthcare professional conducting the test and the patient one of two specific types of feedback: "The eye appears to be healthy. Please consult an eye specialist if there are any concerns," or "A visit to an eye specialist is recommended. The system detects diabetic retinopathy symptoms." In addition, because of the higher risk of a false negative result, specificity is very important for this application.

Caffe Model
The classification results with the trained Caffe model developed using NVIDIA Jetson are shown in Table 6 below. As discussed above, results were aggregated to two classes: Healthy, which means a patient does not have diabetic retinopathy (No DR), and Unhealthy, which implies that a patient has diabetic retinopathy (Mild, Moderate, Severe, or Proliferate DR). The input image types to the model are indicated in the table. As discussed previously, the diabetic retinopathy generally progresses from mild to proliferating over time. The result shows that the model accurately predicts healthy images 98% (specificity or true negative) of the time. The best result for images with DR was observed for the Proliferate DR. The model accurately classified 40% (sensitivity or true positive) of images as unhealthy. The least accurate results were with Mild DR images. The models inaccurately classified 97% of Mild DR images as healthy. We observed that the sensitivity increases as the severity of the disease increases. The combined sensitivity for severe and proliferate DR was 35%. The overall accuracy of the model is 75.6%. We can see that it is highly influenced by the higher specificity value of the model and a large number of No DR images compared to other classes.  The least accurate results both for Caffe and Keras models were with Mild and Moderate DR images. We believe this is due to the fact that Mild and Moderate DR are the initial stages of retinopathy, and they are less distinguishable from healthy (No DR) compared to Severe and Proliferate DR. The accuracy of the presented models need to be further improved for practical application. Methods that may be used to improve the accuracy include removing low-quality images, using the local spectrum analysis method [39], or enhancing local contrast using contrast limited adaptive histogram [40]. The accuracy can also be improved by optimizing the presented model or considering alternative models such as recurrent neural network (RNN). The ability of RNN to accumulate features or memorize from past inputs can be advantageous in DR image classification [41,42].

Conclusions
We presented the training and testing of the Caffe and Keras model using retina images of healthy and at various stages of diabetic retinopathy. A total of 35,000 high-resolution images labeled by experts were used for training of the models. Another 53,000 images also labeled by experts were used for testing of the trained models. The classification results were aggregated as healthy and unhealthy. The unhealthy images included mild to proliferate diabetic retinopathy. The Caffe model classified healthy and unhealthy images with 75.6% accuracy and 98% specificity. The sensitivities for severe and proliferate diabetic retinopathy images were 30% and 40%, respectively. The Keras model classified healthy and unhealthy images with 76.7% accuracy and 97% specificity. However, the sensitivities for severe and proliferate diabetic retinopathy images significantly improved to 41% and 55%, respectively. The future goal of this research is to develop low-cost hardware capable of on-site real-time classification of retina images. As a result of the nature of the intended application, the specificity (true negative) needs to be close to 100% to reduce the risks associated with a wrong classification result. In addition, to reduce unnecessary tests and associated costs, the sensitivity of the system has to be further improved for a practical application. This can be achieved by further optimizing the presented model or considering alternative models such as recurrent neural network. The Caffe model was trained and tested on an NVIDIA embedded hardware. The Keras model has shown more promise and will be implemented on the same hardware, and the accuracy will be further improved. Although the accuracy needs to be further improved, the presented results represent a significant step forward in the direction of detecting diabetic retinopathy using embedded computer vision for implementation in portable low-cost hardware. This technology has the potential of being able to detect diabetic retinopathy without having to see an eye specialist in remote and medically underserved locations, which can have significant implications in reducing diabetes-related vision losses in the future.
Author Contributions: Both P.V. and S.S. contributed to writing and editing the paper, research design and results analysis, experimental work was done by P.V. under the supervision of S.S., and S.S. was responsible for project administration. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.