A New Approach for Detecting Fundus Lesions Using Image Processing and Deep Neural Network Architecture Based on YOLO Model

Diabetic Retinopathy is one of the main causes of vision loss, and in its initial stages, it presents with fundus lesions, such as microaneurysms, hard exudates, hemorrhages, and soft exudates. Computational models capable of detecting these lesions can help in the early diagnosis of the disease and prevent the manifestation of more severe forms of lesions, helping in screening and defining the best form of treatment. However, the detection of these lesions through computerized systems is a challenge due to numerous factors, such as the characteristics of size and shape of the lesions, noise and the contrast of images available in the public datasets of Diabetic Retinopathy, the number of labeled examples of these lesions available in the datasets and the difficulty of deep learning algorithms in detecting very small objects in digital images. Thus, to overcome these problems, this work proposes a new approach based on image processing techniques, data augmentation, transfer learning, and deep neural networks to assist in the medical diagnosis of fundus lesions. The proposed approach was trained, adjusted, and tested using the public DDR and IDRiD Diabetic Retinopathy datasets and implemented in the PyTorch framework based on the YOLOv5 model. The proposed approach reached in the DDR dataset an mAP of 0.2630 for the IoU limit of 0.5 and F1-score of 0.3485 in the validation stage, and an mAP of 0.1540 for the IoU limit of 0.5 and F1-score of 0.2521, in the test stage. The results obtained in the experiments demonstrate that the proposed approach presented superior results to works with the same purpose found in the literature.


Introduction
Vision is one of the most essential and complex senses; through it, it is possible to see and interact with the world around us. Vision is based on the absorption of light by the photoreceptor cells of the eye [1]. Unfortunately, various diseases can harm the eyes. Thus, taking care of this organ is essential to prevent or even reduce the severity of these diseases. Furthermore, eye health is associated with a better quality of life, so it is necessary to maintain a healthy vision. The human retina is the most complex of eye tissues, having a highly organized structure. The retina receives the visual image produced by the eye's optical system and converts the light energy into an electrical signal, which, after initial processing, is transmitted through the optic nerve to the visual cortex [2]. It is a thin, semitransparent, multi-layered layer of nervous tissue that lines the inside of the posterior two-thirds of the eyeball wall. It is sensitive to light and can be compared to a film in a photographic camera, being like a screen to project the images seen, which retains In this context, this work aims to present a state-of-the-art approach based on a pre-trained deep neural network model that, through image pre-processing and data augmentation techniques, can detect fundus lesions and assist in the medical diagnosis and treatment of Diabetic Retinopathy.
The contributions of this work concerning the state-of-the-art can be summarized as follows: (i) presentation of a convolutional neural network structure based on YOLO version 4 and 5 architectures to improve fundus lesions detection and performing real-time inferences on low-cost GPUs; (ii) implementation of a pre-processing block to reduce outliers and improve enhancement, to provide a more efficient extraction of features from retinal lesions; and, (iii) method for partial cropping of the black background of the fundus images in order to minimize the generation of false positives pixels and the creation of blocks (tiles) to increasing the receptive field of the input images and minimizing the loss of information caused by the reduction of the images used at the entrance of the network, especially in the case of small lesions, such as in the case of microaneurysms.
The article is structured as follows. Section 2 describes related works. Section 3 presents the materials and methods used for this work. Section 4 will describe the results obtained through the proposed approach. Section 5 presents the discussions about the results obtained in this work. Finally, in Section 6 we will describe the final considerations.

Related Work
Li et al. [17] presented a new Diabetic Retinopathy dataset titled Dataset for Diabetic Retinopathy (DDR) and evaluated state-of-the-art deep learning models for classification, segmentation, and detection of lesions associated with Diabetic Retinopathy. To evaluate these methods in clinical situations, 13,673 fundus images were collected from 9598 patients. These images were divided into six classes and evaluated by seven specialists to identify the stage of Diabetic Retinopathy. In addition, the authors selected 757 images with Diabetic Retinopathy to annotate the fundus lesions: Microaneurysms, Hemorrhages, Hard Exudates, and Soft Exudates.
The authors used the DDR dataset to evaluate ten deep learning models, including five classification models, two segmentation models, and three detection models. Experimental results demonstrate that research must improve the model performance for microlesion recognition to apply deep learning models to clinical practice. To perform the classification of DR, the authors used the models VGG-16 [22], ResNet-18 [23], GoogLeNet [24], DenseNet-121 [25] and SE-BN-Inception [26]. To perform the segmentation of DR lesions, the authors used the HED [27] and DeepLab-v3+ [28] models, while they used the SSD [29] and YOLO [30] models for the detection single-stage. The authors evaluated the models with the metrics mean Average Precision and the mean Intersection over Union (mIoU) [31], obtained in the set of validation and testing of the DDR dataset. Although the authors obtained an accuracy of 0.8284 in Accuracy with the SE-BN-Inception model for the classification of Diabetic Retinopathy, the detection and segmentation models performed poorly in detecting fundus lesions.
The work by Porwal et al. [18] presents results of deep learning models used for segmentation, classification, and detection of fundus lesions during the IDRiD: Diabetic Retinopathy-Segmentation and Grading Challenge. The main contribution was the availability of the set of public images of Diabetic Retinopathy called IDRiD. Most of the teams participating in the challenge explored the U-Net architecture for segmenting (microaneurysms, hemorrhages, hard exudates, and soft exudates) [32]. The U-Net architecture is an extended version of the fully convolutional networks (FCN), providing more precise segmentation, even in small training sets [33]. In the classification challenge, they categorized the fundus images according to the severity level of DR and Diabetic Macular Edema. The team that achieved the best results presented a method based on a ResNet [23] architecture and Deep Layer Aggregation (DLA) [34]. Finally, the detection challenge aimed to obtain the location of the Optical Disc and the Fovea. The winning team presented a method based on the Mask R-CNN model to locate and segment the Optical Disc and Fovea simultaneously.
First, the authors pre-processed the fundus images to a fixed size to use them as neural network input. Then, they performed a scan on the image to generate the region proposals. Then, the authors classified proposals into different classes and created a binary mask for each object. Next, they used a ResNet-50 architecture to extract features from the images and a Feature Pyramid Network (FPN) to generate feature maps at different scales and extract regions of interest from the images. A Region Proposal Network (RPN) then traverses the feature maps and locates regions that contain objects. Finally, architectural branches are employed to obtain the label, mask, and bounding box for each region of interest. Transfer learning technique was applied to train the model. The authors started with a learning rate of 0.001 and a momentum of 0.9. They trained the network with 20 epochs. In the Optical Disc and Fovea segmentation and detection task, the team that obtained the best result in the test dataset achieved a Jaccard coefficient equal to 0.9338.
Mateen et al. [19] proposed a pre-trained framework based on a convolutional neural network to detect exudates in fundus images through transfer learning. In the proposed structure, the authors combined three pre-trained network architectures to perform the fusion of features since, according to the authors, different architectures could capture various features. The three models used to compose the framework proposed by the authors are Inception-v3 [35], VGG-19 [22] and ResNet-50 [23]. In addition, the collected features are treated as input in the fully connected layers to perform further actions, such as classification, performed through the Softmax function. When using the e-Ophtha dataset, the highest accuracy obtained individually by the Inception-v3, ResNet-50, and VGG-19 architectures was 93.67%, 97.80%, and 95.80%, respectively, while the approach proposed by the authors reached a classification accuracy of 98.43%. Moreover, using the DIARETDB1 dataset, the highest accuracy obtained individually by the Inception-v3, ResNet-50, and VGG-19 architectures was 93.57%, 97.90% and 95.50%, respectively, while the approach proposed by the authors reached a classification accuracy of 98.91% The work by Alyoubi et al. [20] presented a DR diagnostic system that classifies fundus images into five stages (no DR, mild DR, moderate DR, severe DR, and proliferative DR) and the location of lesions on the retinal surface. The system is composed of two models based on deep learning. The first model used is a CNN512, in which the images are used to classify in one of the five stages of DR. The public datasets DDR and Kaggle APTOS 2019 [36] were used. The second model was a YOLOv3 [37], adopted to detect DR lesions. Finally, both proposed frameworks, CNN512 and YOLOv3, were combined to classify DR images and locate DR lesions. The data balancing of the data sets used when performing the training of the models was not performed. In classifying the type of DR, the CNN512 model achieved an accuracy of 88.6% and 84.1% in the DDR and APTOS Kaggle 2019 public datasets, respectively, while the YOLOv3 model, adopted to detect the DR lesions, obtained a mAP of 0.216 in the detection of lesions in the DDR dataset.
The main limitation of the work was not performing the balancing in the retinal image datasets to perform the training of the models, which may have generated a bias in the detection of lesions. For example, there are lesions with a significantly higher number of examples concerning the others, as in the case of hard exudates. This imbalance in the training of models can make the deep neural network tend to classify objects in the majority classes, causing the model to have its generalization ability impaired. In future work, the authors claim that it is necessary to balance the number of examples in the data sets to train the models and conduct experiments with the YOLOv4 and YOLOv5 models to verify the performance of these models in the detection of fundus lesions.
The work by Dai et al. [21] presented a deep learning system, called DeepDR, to detect early and late stages of Diabetic Retinopathy using 466,247 fundus images of 121,342 patients with diabetes. The DeepDR system architecture had three subnetworks: an image quality assessment subnetwork, the subnet for lesion recognition, and the subnet for DR classification. These subnets were developed based on a ResNet [23] architecture and a Mask R-CNN [38] architecture, responsible for performing the detection of lesions in 2 stages: a preliminary stage, in which regions of interest (RoI) are selected and then, in a second stage, check for the presence of lesions in the regions verified in the previous step [39][40][41].
First, the authors used 466,247 images to train the image quality assessment subnet to check for quality problems in terms of artifacts in the retinal images. After this analysis, 415,139 images were used to train the DR classification subnet to classify the images: no DR, mild nonproliferative DR, moderate nonproliferative DR, severe nonproliferative DR, or proliferative DR. The subnetwork for detecting and segmenting the lesions was trained using 10,280 images with annotations of retinal lesions: microaneurysms, soft exudates, hard exudates, and hemorrhages. For microaneurysms, the value obtained from IoU (Intersection Over Union) [42,43] was not presented. For soft exudates, hard exudates, and hemorrhages, IoU of 0.941, 0.954, and 0.967 were obtained.
The study had limitations. The first one was using only a private DR dataset to train the deep learning models, making it difficult to reproduce the results obtained by the authors using the same method. In the validation step, the authors used the Kaggle eyePACS [44] public DR dataset only. The second limitation is that the subnet that detects the lesions was tested only on the local validation dataset due to the lack of lesion annotations in the public dataset that the authors used. Therefore, in future work, the authors claim that further validation through public datasets is necessary to assess the proposed deep learning system's performance in classifying DR and detecting fundus lesions.
The works in the literature applied deep neural networks to identify DR, but the deep learning models presented limitations. The work by Porwal et al. [18] showed results in detecting only the fovea and the optical disc, while the work by Mateen et al. [19] only exudates. The work by Li et al. [17] showed promising results in the classification of DR but limited results in the detection of fundus lesions. The work by Alyoubi et al. [20] showed promising results in detecting lesions but did not balance the datasets used, which possibly impacted the results presented. The work proposed by Dai et al. [21] used a private dataset for training and validating the model responsible for detecting lesions, which makes a fair comparison and validation with the models we propose impossible. Furthermore, the model used by Dai et al. [21] is based on a model that performs the detection of objects in 2 stages, unlike the conception of our work, which performs the Single-Stage detection after pre-processing and augmentation of image data. A summarized comparison between related works is presented in Table 1.

Materials and Methods
The pipeline of the proposed approach is presented in the form of a block diagram in Figure 1. The proposed approach was based on the YOLOv5 [45][46][47][48][49][50][51] deep neural network model and implemented through the open-source machine learning library PyTorch (https: //pytorch.org, accessed on 25 August 2022). The model was trained with 8000 epochs and 32 batches per epoch, with a learning rate of 0.01 and a momentum rate of 0.937. The size of the bounding box anchors was adaptively calculated [52], through a genetic algorithm, which optimizes the anchors after a scan performed by the unsupervised Kmeans algorithm [53] Figure 1. Block diagram of the proposed approach. First, the images are passed to the Pre-processing block for noise filtering, contrast improvement, partial elimination of the black background of the images, and creation of tiles. Then, the pre-processed images are transferred to the Data Augmentation block, where sub-images are artificially created that will be used in the neural network input layer for training the proposed approach, which will be carried out after a pre-stage step training the network with the weights fitted to the Common Objects in Context (COCO) dataset.
An anchor set more adjusted improves detection accuracy and speed [54]. Anchors are the initial sizes (width, height) of bounding boxes that will be adjusted to the size closest to the object to be detected using some neural network output (final feature maps). In this way, the network will not predict the final size of the object but only adjust the size of the anchor closest to the object's size. For this reason, YOLO is regarded as a method that treats object detection as a regression problem, in which a single neural network predicts the bounding boxes and class occurrence probabilities directly on the complete image being evaluated. Moreover, as all detection is performed in one network (Single-Stage), the neural network model can be directly optimized end-to-end [30].
In the proposed approach, detection is performed in the final layers and at three scales, as proposed in the YOLOv3 [37] model, allowing learning objects in different sizes, the scales being: 19 × 19, specialized in object detection of large size; 38 × 38, which specializes in detecting medium-sized objects; and, 76 × 76, which specializes in detecting small objects. Each of these outputs or detection "heads" has a separate set of anchor scales. In YOLOv3, 9 anchor sizes are used, with 3 anchors per detection head. After detection, the confidence percentage was reached for each identified lesion. To carry out the experiments a Core i7-7700K 8 × 4.6 GHz device, 32 GB of RAM and an NVIDIA Titan Xp GPU with 12 GB of VRAM were used.
YOLOv5 is a single-stage detection model capable of detecting objects without a preliminary step, as in the case of two-stage detectors, which use a preliminary stage where regions of importance are then classified to check if objects have been detected in those areas. The advantage of a single-stage detector is the speed at which it can make real-time inferences. Furthermore, another feature of this model type is the possibility of working on edge devices and with low-cost hardware, whose training can be performed with just one GPU [55]. Therefore, we intend to present an approach that achieves greater precision than approaches with the same purpose presented in the literature. Next, we methodologically detail each step that makes up the pipeline of the proposed approach.

Dataset
The DDR image dataset was used in this work, which has 757 images labeled in JPEG format with variable sizes. Second [17] to capture these images 45 TRC NW48, Nikon D5200, and Canon CR 2 cameras were used. In addition, the lesions contained in these fundus images contain annotations (Ground Truth), as illustrated in Figure 2.  Table 2 presents attributes of the DDR dataset, such as the number of images, the resolution of the images, the type of annotations, the number of images with annotations for MA, HE, EX, and SE, and the total amount of annotations by lesion type before the data augmentation step. Data were collected using single-view images. The bounding box annotations were generated automatically from the pixel level annotations of the lesions [17]. Although the DDR dataset is of good quality, training the deep neural network of the proposed approach has challenges, such as the small number of annotated fundus lesions and the variability in size and shape of these lesions. Another factor that generates problems in training deep learning models to detect retinal lesions is the reduced size of some types of lesions, as in the case of microaneurysms. Data augmentation was performed to circumvent problems related to the sub-sampling of the data set due to the small number of examples by creating artificial images derived from the original images and annotations of the fundus lesions. This data augmentation and the other techniques applied to overcome the challenges mentioned above will be discussed in the following sections.

Pre-Processing and Image Preparation
The use of a pre-processing block aims to (i) improve the quality of images through the elimination of periodic or random patterns through the application of filters and (ii) increase the image enhancement to improve and accentuate the characteristics of the lesions that will be used for training the deep neural network. The treatment of the images aims at (i) filtering noise generated during the capture of fundus images, (ii) correcting lighting deformities, and (iii) improving the contrast and enhancement of the images [56].
For image smoothing, the median filter of size 5 × 5 was used. The median filter is non-linear and very effective in removing impulsive noise (irregular pulses of large amplitudes), such as Salt & Pepper [57,58] noise. Moreover, the contrast-limited histogram adaptive equalization technique (CLAHE) [20,59] was used to enhance the images. This technique was initially developed for low contrast image enhancement and is an evolution of the histogram equalization method [60], and has been used as part of pre-processing pipelines to improve image quality. medical images [61]. However, before applying the CLAHE algorithm to the fundus images of the dataset, it was necessary to define the most suitable color space to perform the image enhancement (i.e., RGB, HSV, or LAB).
The background suppression was performed as the last step in the pre-processing and image preparation stage. As the works by El abbadi and Hammod [62] and Alyoubi et al. [20], a pre-processing step for partial cropping of the black background of the retinal images was performed, as illustrated in the Supplementary Materials. According to El abbadi and Hammod [62], the importance of removing the black background from the retinal image is related to the generation of false positives during the detection of lesions, especially at the retina border, where there is a similarity of the retinal border with the blood vessels. Furthermore, in the case of fundus images, only the pixels of the retina have significant information; the rest is considered the background. Therefore, it is essential to locate the area of interest and remove unwanted features related to the image's background.
Details about the median filter, CLAHE, and the suppression of the useless retinal background are presented in the Supplementary Materials attached to this article. It is important to note that this additional document presents images where these pre-processing methods are tested and where the used measures are clearly presented and discussed.
Pre-processing techniques were explored to improve the performance of the proposed approach, especially in detecting microlesions for better smoothing and enhancement of the fundus images. In addition, the black background that caused the generation of false positives was partially removed, and the Tilling of the original images was applied to use the resulting image blocks in the training of the deep neural network. Finally, a Data Augmentation step was applied after pre-processing the fundus images, as we will explain in the next section.

Data Augmentation
In the proposed approach, a data augmentation was performed from the images and labeled lesions available in the DDR dataset. Then, for each training batch, the model was configured to pass the images through a data loader that creates the artificial images while they are accessed. The data loader did the following types of augmentations: Mosaic, MixUp, Copy-Paste and Geometric Transformations.
This technique works in real-time, i.e., on-the-fly [63,64], and new examples are not generated in advance (before training). Thus, at each training performed, a random number of artificially created examples are generated and passed on for training the neural network. Each data augmentation technique is applied to all images in the batch, except the Mix-Up, which was set to randomly apply to 50% of the images in the batch. Details on the methods applied in the data augmentation block of the fundus images are presented in the Supplementary Materials attached to this article. It is important to note that this additional document discusses the methods used and the images with the results obtained in this step.
After performing the data increase, we also had to deal with the problem related to data imbalance. Models based on deep learning generally try to minimize the number of errors when classifying new data. Therefore, the cost of different errors must be equal. However, the costs of different errors are often unequal in real-world applications. For example, in the area of medical diagnosis, the cost of misdiagnosing a sick patient as healthy (False Negative) can be much higher than accidentally diagnosing a healthy person as sick (False Positive) since the former type of error can result in the loss of a life [65].
There are cases in which data imbalance causes biases in the training of models, including generating uncertainties about the results obtained [66,67]. In the case of eye fundus lesions, an imbalance in the number of examples of different fundus lesions was verified, as shown in Table 2. This imbalance can become even more significant after the application of the data increase step since the number of new examples created after this step is random, and it is impossible to accurately predict the number of new examples generated for each lesion.
To balance the number of examples of each lesion, during training the method Threshold-Moving [68][69][70] was used through the parameter image-weights, in which the samples of training set images are weighted by their mean Average Precision (mAP) inverse of the previous epoch test. Unlike uniformly sampling the images during training, as in conventional training, sampling during training is based on the weighted images based on the result obtained by a certain evaluation metric calculated in the test of the previous epoch. of training. This method moves the decision boundary so that minority class examples can easily predict correctly [71,72].
The Threshold-Moving method was used to minimize the imbalance of the dataset and reduce the possibility of biases during the classification due to the presence of classes with a significantly larger number of examples. Due to the need for many examples to obtain more accurate results, we chose not to use the subsampling technique of the majority classes. Likewise, the oversampling of minority classes was also discarded because the number of examples of these classes to equate them to the majority classes would not reflect the natural incidence of fundus lesions.
Our approach used a method to minimize the imbalance problem in the number of examples of lesions to avoid possible overfitting of the model associated with misclassification of lesions in the majority class. After the step of increasing and balancing the data, our method was trained to perform the detection of fundus lesions. The next section will discuss the architecture of the deep neural network used in our proposed approach.

Deep Neural Network Architecture
After the pre-processing and data augmentation steps described above, the image set of the DDR dataset was split into a training set (50%), validation set (20%), and test set (30%), the same proportion performed in the work by Li et al. [17]. The images of one set are not present in the others to avoid bias during the evaluation of the proposed approach. A validation step to fine-tune the hyperparameters of the architecture and a test step to assess the generalization capability of the neural network was used. In addition, the public Diabetic Retinopathy dataset IDRiD [18] was also used to validate our proposed approach.
In our approach, we used the architecture of the YOLOv5 [20,[45][46][47][48][49][50]53,73] model as a basis. This model currently has four versions: s (small), m (medium), l (large) and x (extra-large). Each version has different features. Before choosing the YOLOv5 s as the base architecture for our proposed approach, different versions of YOLOv5 have been tried. The quantities of depth (depth) and scale (width) multipliers of convolutional cores of the adopted model are 0.33 and 0.50, respectively. According to Iyer et al. [74], YOLOv5s achieves precision equivalent to YOLOv3 (https://pjreddie.com/darknet/yolo/, accessed on 25 August 2022), but with superior performance in performing real-time inferences at a lower computational cost.
The YOLOv5 s was used because it achieved the best results in detecting fundus lesions and at a lower computational cost compared to other versions. The explanation for the shorter version of YOLOv5 to have obtained the best results lies in the fact that the fundus microlesions present a gradient dissipation problem when training the versions of YOLOv5 with greater depth and number of parameters. Thus, using a model with a smaller number of parameters allowed us to use fewer hardware resources, thus enabling the detection of fundus lesions through low-cost GPUs without impacting the precision of the proposed approach. The deep neural network structure used in our approach has a total of 283 layers, 7.2 million parameters, and 17.1 GFLOPs. According to Yu et al. [75], GFLOPs represent the 1 billion worth of FLOPs (floating-point math operations). Another advantage of using the YOLOv5 model as the basis of this approach is the possibility of integration and portability with different types of projects, including mobile devices. This feature is associated with the fact that this model has been implemented natively in PyTorch.
The network Backbone was used as a pre-trained feature extractor on an image classification dataset, useful for detecting objects in the last layers of the network. The Backbone used in the experiments is composed of a CSP-Darknet-53. The convolutional neural network Darknet-53 was initially used as Backbone of the YOLOv3 [37] model, replacing its predecessor Darknet-19 [30], as it includes the use of residual connections as well as more layers since it has 53 depth layers [37]. Furthermore, its architecture is built by consecutive layers of convolution 3 × 3 and 1 × 1 followed by a jump connection, which helps activations propagate through deeper layers without gradient dissipation.
The CSP proposed by Wang et al. [76] can be applied in different architectures, such as ResNet [23], ResNeXt [77], DenseNet [25], YOLOv4 [55] and YOLOv5 [45,46] as it not only produces a reduction in computational cost and memory usage of these networks but also brings benefits such as improved inference speed and increased precision. These goals are achieved by partitioning the base layer feature map into two parts. Then, the parts are merged through a hierarchy of crossed stages, whose main idea is to make the gradient propagate through different network paths.
The block diagram of the neural network architecture that composes the proposed approach, responsible for detecting fundus lesions, is illustrated in Figure 3. As shown in, the network architecture is based on the structure of YOLOv5 version small [45,46,74] and is divided into three main blocks: Backbone, Neck and Head. The network input layer size is 640 × 640 × 3, where the first two values correspond to the height and width in pixels, and the third value corresponds to the number of channels in the input image. The DDR dataset makes the bounding box annotations of fundus lesions available for deep neural network training. These bounding boxes were used to calculate the initial anchor sizes. In the architectures of the YOLO family, we usually configure the sizes of the initial anchors before training the model.
In training the proposed approach, we produced bounding boxes with predictions based on the lengths of the initial anchors. Then a comparison of the bounding box of the detected object is performed with the bounding box of the object annotated (Ground Truth). Then we used the result of this comparison to update the neural network weights during the training stage. Therefore, defining the initial size of the anchors is essential, especially when training the neural network on objects with sizes different from the standard dimensions of anchors, which are typically calculated based on objects from datasets such as COCO, for example. The YOLOv5 model repository has a function that performs the adaptive calculation of the anchors, in which, when training the neural network, it is possible to enable the "auto-anchor" option so that the best values for the docking boxes automatically. This function was used before training the proposed approach to ensure that the anchors were adjusted according to the sizes of fundus lesions present in the dataset used in the experiment.  The Backbone structure of the neural network starts with the Focus module, responsible for performing a slicing operation. In the case of the neural network structure of the proposed approach, when we insert an image of size 640 × 640 × 3 in the module Focus, a slicing operation is performed on this image to generate a map of size features 304 × 304 × 64. Still in Backbone, the modules Conv are composed of a 2d convolution, followed by a batch normalization. The batch normalization reduces the number of training cycles needed to train deep networks, providing a regularization and reducing generalization error [78]. After batch normalization, the activation function Sigmoid Linear Unit (SiLU) [79], derived from the function Rectified Linear (ReLU) [80] is applied.
In network architecture, the CSP module (C3) is used both in the Backbone and in the Neck of the network. These CSPs were used to connect the front and back layers of the network, aiming to improve the model inference speed without compromising its precision. Also, they allow better integration of different parts that make up the neural network, in addition to a reduction in the size of the model [45]. These C3 modules have in their structure three Conv modules and a Bottleneck [23] module. The module Bottleneck consists of two Conv modules followed by an addition operation (add), responsible for adding tensors without expanding the image dimension.
The Backbone of the proposed approach is composed of four CSP modules (C3). After Bottleneck module, each C3 module, there is a concatenation module (Concat) so that the features that were divided at the beginning of the C3 block are regrouped, expanding the dimensions of the tensors. The flow and constitution of the various modules that make up the Backbone were illustrated in Figure 3. Another Backbone component of the proposed approach is the SPP module (Spatial Pyramid Pooling) [81]. With SPP, it is possible to introduce multiple variable-scale pools concatenated to form a 1 dimension vector for the FC layer. As in He et al. [81], the MaxPool method was used with groupings of size equal to 1 × 1, 5 × 5, 9 × 9 and 13 × 13, followed by the Concat operation to concatenate feature maps at different scales.
The Backbone structure is responsible for extracting feature maps of different sizes from the input image through multiple convolutions and clusters [82]. The Neck structure, in turn, is responsible for merging these feature maps obtained from different levels of the architecture to obtain more contextual information and reduce problems related to information loss during the process of extracting features from the images [45]. In the process of merging the feature maps from the Backbone, the Feature Pyramid Network (FPN) [83] and the PAN (Path Aggregation Network) [84] are used as illustrated in Figure 4. The structure of the FPN itself can be extensive, as the spatial information may need to be propagated to hundreds of layers. In this context, an FPN in conjunction with a PAN was used. The PAN structure follows an additional upward path than the downward path taken by the FPN, helping to shorten this path by using lateral connections as a shortcut.  In the structure of Neck four CSP modules (C3) were used, as illustrated in Figure 3. These C3 modules were adopted to strengthen the ability to integrate the characteristics extracted from the lesions during the propagation of this information in the neural network. In Figure 4 it is possible to observe that the detection of lesions is performed in layers P3 (small objects), P4 (medium objects) and P5 (large objects), with sizes of 80 × 80 × 255, 40 × 40 × 255 and 20 × 20 × 255, respectively.
Finally, the Head part of the neural network is responsible for making the dense prediction (final prediction). This part is composed of a vector that contains the predicted bounding box (central coordinates, height, width), the confidence score of the prediction, and the label of the class to which the detected object belongs. The prediction mechanism used in the Head of the deep neural network architecture of the proposed approach is equivalent to the one used in YOLOv3 [37]. A bounding box predicts each object, and in case several bounding boxes are detected for the same object, then we apply the NMS technique, which allows us to discard bounding boxes with an IoU below a predefined threshold as shown in Table 3. The Head structure used in our approach is composed of 3 layers responsible for performing the detection of fundus lesions, each of these layers dividing the image into grid cells of sizes 20 × 20 × 255, 40 × 40 × 255 and 80 × 80 × 255, as illustrated in Figures 3 and 4. The smaller the size of feature maps, the larger the image area to which each grid unit in the feature map corresponds, indicating that it is suitable for detecting large objects from feature maps of size 20 × 20 × 255. In contrast, feature maps of size 80 × 80 × 255 are better suited for detecting small objects.   The final loss function used in the proposed approach is calculated based on the confidence score (Objectness), the classification score (Class Probability) and the bounding box regression score (Bounding Box Regression), according to Equation (1). Objectness determines whether there are objects in the grid cell, Class Probability determines which category objects that are in a grid cell belong to, and Bounding Box Regression is just calculated when the box predicted contains objects. In this case, the Bounding Box Regression calculation is performed by comparing the predicted box with the box associated with the Ground Truth of the detected object.
To calculate the confidence score loss (objecteness) and classification score (class probability) functions, the Binary Cross-Entropy with the PyTorch Logits function [85] was used. To calculate the loss function referring to the bounding box regression, the loss function Generalized Intersection over Union (GIoU) [86][87][88] was used.
In the post-processing of the detection of fundus lesions, it was necessary to perform the screening and removal of duplicated bounding boxes representing the same object. The NMS technique was used for that, which keeps the bounding box detected with a higher precision index. Therefore, the NMS method used is based on the obtained IoU values (IoU_nms), in which a threshold of 0.45 [48] was set for the training step.

Pre-Training
Transfer learning is a method employed in the area of machine learning that consists of reusing information learned in a given task as a starting point for the solution of a new task [89]. This method is often used when it is not possible to obtain a large-scale dataset with labeled objects to solve a particular Computer Vision task [90]. In this context, as the public DR datasets do not contain a large number of labeled lesions, in addition to a data augmentation step, the proposed approach has a pre-training step. In this step, the Transfer Learning with pre-trained weights on the COCO [91] followed by the Fine-Tuning dataset from the last layers of the neural network were applied. To fine-tune the proposed approach, we kept the weights of the first layers and changed only the weights of the last layers of the neural network.
COCO provides a large dataset with labeled images for object detection tasks. The neural network output of the proposed approach was modified to suit the problem of detecting fundus lesions, preserving the knowledge (weights) of the initial layers. The reuse of information from these initial layers is essential to obtain the most basic characteristics of fundus lesions, such as contours, edges, etc. In addition, pre-training enabled a reduction in computational cost and training time of the proposed approach. The method adopted to carry out the transfer of learning was based on the work proposed by Franke et al. [92] and consists of the four steps presented below: 1.
The initial layers of the architecture of the proposed approach, focused on detecting the most fundamental characteristics of objects, were pre-trained with the weights of the COCO dataset, composed of 80 categories.

2.
The last three layers (out of a total of 283) that make up the Head of the architecture of the proposed approach are cut and replaced by new layers. 3.
The new layers added are adjusted by training the neural network on the DR dataset, while the weights of the initial layers are frozen.

4.
After fine-tuning the Head layers of the architecture, the entire neural network is unfrozen and retrained so that minor adjustments to the weights are performed across the entire network.
The fine-tuning of the neural network aimed to optimize the proposed approach to achieve more accurate results. So, hyperparameters, such as batch size, number of epochs, and learning rate, were adjusted. According to Franke et al. [92], the optimization of hyperparameters aims to find a set of values that produces an ideal model in which a predefined loss function is minimized. As in work proposed by Franke et al. [92], the methodology adopted to fine-tune the proposed approach consisted of the following steps:

1.
For each adjustment performed, a hyperparameter value is varied, and the proposed approach is retrained, keeping the other hyperparameter values constant.

2.
The effect of this change is analyzed through the performance evaluation of the proposed approach with the metrics Average Precision (AP) and mean Average Precision (mAP), which will be presented and discussed in the next section of this article.

3.
If there is an improvement in the metric values, the hyperparameter value is further adjusted (increased or decreased) until the local maximum is reached.

4.
The exact process is carried out for the other hyperparameters until an optimal set of values is obtained that produces the maximum results of AP and mAP for the detection of the investigated fundus lesions.
After performing these steps using the validation dataset from the DDR, the ideal fit for hyperparameters was found, as shown in Table 3. With the hyperparameter values adequately adjusted, the next step was to evaluate the proposed approach in the test set of the dataset used in the experiments. In the next section, the metrics used to evaluate the performance of the proposed approach will be presented and discussed.

Performance Metrics
Typically, these models are evaluated by their performance on a dataset's validation/test set, measured using different metrics. The metrics adopted to evaluate a model must follow the type of task being investigated so that it is possible to adequately and quantitatively compare the performance of this model. For example, quantitative evaluation is performed for object detection tasks in images by estimating overlapping regions between detected images and annotating bounding boxes of objects in original images (Ground Truth).
Metrics typically used to evaluate problems involving object detection and segmentation of instances in images are the Intersection over Union (IoU) [31,86], the Average Precision (AP ) [38,55,93,94] and the mean Average Precision (mAP) [30,31,37,40]. The IoU, also identified by the similarity coefficient of Jaccard [42,43] is a statistic to estimate the similarity between two sets of samples. The Intersection over Union is obtained by the ratio between the Area of Overlap and the Area of Union of the predicted bounding boxes and the Ground Truth bounding boxes.
The Average Precision corresponds to the Area Under the Curve (AUC) of Precision × Recall, also called the PR [95] curve. With the Precision and Recall values, it is possible to plot a graph, where the y axis is the Precision and the x axis is the Recall. Recall, and Precision are then calculated for each class by applying the formulas for each image, as shown in Equations (2)  Precision and Recall are useful measures to assess the efficiency of a model in predicting classes when they are unbalanced. The Precision×Recall (PR curve) presents the trade-off between Precision and Recall for different thresholds. The PR curve is an important tool for analyzing the results of a predictor. The PR curve is an important tool for analyzing the results of a predictor.
The same approach used in the COCO [96] challenge were used to perform the AP calculation, a range of threshold values of IoU, calculate the average of the AP for each IoU, and then obtain a final average of AP. Another critical aspect of calculating the AP in the COCO challenge is that 101 recovery points are used in the PR [96] curve.
Another way to evaluate models that perform object detection is through mAP, a metric widely used to evaluate deep learning models [30,31,37,40]. Its main feature is the ability to compare different models, contrasting precision with recall. The definition of the mAP metric for object detection was first formalized in the PASCAL VOC challenge. To calculate mAP, just average the Average Precision calculated for all object classes [95], as shown in Equation (4). Although it is not simple to quantify and interpret the results of a model, mAP is a metric that helps evaluate deep learning models that perform object detection.

Experiments and Results
The performance evaluation of the proposed approach experiments was performed using the public DDR Diabetic Retinopathy dataset. To avoid biasing the results, divided the data set into training, validation, and test sets in a proportion of 50%, 20%, and 30%, respectively. The architecture was implemented and trained based on the YOLOv5 model. First, we perform transfer learning based on the pre-trained weights in the COCO dataset, then fine-tune the detection layers of the architecture, and finally, we retrain the entire neural network.
During the training of the proposed approach was used the regularization method Early Stopping [97]. With this technique, it was not necessary to statically define the number of epochs necessary for training the proposed approach since the classification precision is calculated in the validation data at the end of each epoch. When the precision stops improving, the training is finished. Therefore, with the use of Early Stopping, it was possible to avoid problems such as underfitting, in which the neural network cannot extract enough features from the images during training due to an insufficient number of epochs; and, overfitting, where the neural network overfits the training data due to an excessive amount of epochs [98]. To do so, a parameter was defined to terminate training if the classification precision has not improved during the last 100 epochs, as shown in Table 3.
In addition to the Early Stopping method, the Dropout [99] technique was used to help regularize the proposed approach. This technique is widely used to regularize the training of deep neural networks [100]. Dropout helps to regularize the model [101], without modifying the cost function. Also, with the use of Dropout some hidden neurons of the neural network are randomly and temporarily turned off without changing the input and output neurons. Therefore, this technique causes some neurons not to function according to a certain probability during training.
Dropout helps regularization because it reduces complex co-adaptations of neurons, causing some neurons to be forced to learn features that a priori would be learned by other neurons in the architecture. In short, the main idea is to drop units (neurons) randomly (along with their connections) from the neural network during training, preventing the units from adapting too much to the data [99], reducing the possibility of problems related to overfitting of the neural network after data augmentation, for example. The parameter defined in the proposed approach for using Dropout is shown in Table 3.
The proposed approach was evaluated with the AP and mAP metrics to compare the results. These metrics are often used to measure the precision of deep learning algorithms that perform object detection [29,37]. The proposed approach was compared with related approaches found in the literature, including, SSD [17], YOLO [17], YOLOv3 [20], YOLOv4 [59] and YOLOv5 (unmodified) as shown in Table 4. After the experiments carried out during the validation step of the proposed approach in the DDR dataset using the Stochastic Gradient Descent (SGD) optimizer, the best result was obtained using the Tilling method, with a mAP of 0.2490 (indicated in bold font) and values of AP with a limit of IoU of 0.5 equal to 0.2290, 0.3280, 0.1050 and 0.3330, for Hard Exudates (EX), Soft Exudates (SE), Microaneurysms (MA) and Hemorrhages (HE), respectively, as shown in Table 4.
To investigate the results of our proposed approach, the PR curve instead of the ROC curve (Receiver Operating Characteristic) [102] was chosen to be analyzed. It is important to observe that the ROC curve is not recommended for situations where the dataset presents an imbalance in the number of examples among the investigated classes. In these cases, the ROC curve usually presents a very high AUC due to the predictor correctly classifying the class with the highest number of examples (majority class) [103,104].
Precision and Recall were used to evaluate the results obtained, which are performance metrics commonly used to evaluate image classification and information retrieval systems. Generally speaking, Precision and Recall are not discussed in isolation, and some issues may require a higher Recall over Precision, or vice versa, depending on the importance given to false positives versus false negatives. In classification problems involving medical images, for example, what is generally desired is to minimize the incidence of false negatives; therefore, a high Recall becomes more important than a high Precision since a false negative can imply a wrong medical diagnosis and, therefore, patient risks. Table 4. Results obtained by the proposed approach with SGD optimizer compared to works related to the metrics AP and mAP for the limit of Intersection over Union of 0.5 in the validation set of the Dataset for Diabetic Retinopathy (DDR).  Figure 5 presents the graph referring to the PR curve with a limit of IoU of 0.5 obtained during the validation stage using the proposed approach with the SGD optimizer and Tilling in the DDR dataset. The AP values obtained by the fundus lesions are plotted on the graph, according to the results presented in Table 4, whose mean Average Precision value obtained by all the classes corresponds to 0.249. The x axis of the PR curve represents Recall while the y axis represents Precision. This curve mainly focuses on the performance of positive classes, which is critical when dealing with unbalanced classes. Thus, in the PR space, the goal is to be in the upper right corner (1,1), meaning that the predictor classified all positives as positive (Recall = 1) and that everything that was classified as positive is true positive (Precision = 1).

Models
It is possible to verify, based on the analysis of the PR curve graph, that the proposed approach found greater difficulty in predicting Microaneurysms (MA) (red curve) followed by Hard Exudates (EX) (cyan curve), with the best results obtained with the prediction of Soft Exudates (SE) (green curve) and Hemorrhages (HE) (orange curve), respectively. The low precision obtained in the detection of MA is mainly related to the size of these microlesions and the gradient dissipation of these objects when the neural network is trained, causing a high rate of errors. This fact can be noted in the confusion matrix (as shown in Figure 6), with 79% of background FN and 38% of background FP, second only to hard exudates, with 40%. The fact that the proposed approach has achieved better results in predicting SE is associated with the morphological characteristics of these lesions since they generally have larger sizes than other lesions. The confusion matrix is a table containing data from experiments with the proposed approach. Based on these data, it was possible to summarize the information related to the performance of the proposed approach and compare it with the results obtained with similar works in state-of-the-art. Figure 6 presents the confusion matrix obtained by the proposed approach with the SGD optimizer and Tilling during the validation step on the DDR dataset. It should be noted that the confusion matrix resulting from the detection of objects presents different characteristics when compared to problems that only involve the classification of objects in images since most model errors are associated with the background class and not with the other classes. In addition, the results presented in the confusion matrix will vary according to the defined confidence limit.
When detecting objects, it is common for information regarding false positives (FP) and false negatives (FN) to be presented in the confusion matrix (background). In this sense, the confidence limit established for detecting objects present in these images will directly impact the results obtained from background FP and background FN. Therefore, the last row of the confusion matrix refers to Ground Truth objects that were not detected by the approach (background FN) and therefore considered as background. The last column of the confusion matrix is the detections performed by the approach that does not have any corresponding label in the Ground Truth (background FP), that is, the image background detected as a lesion.
A confidence limit is applied to filter the bounding boxes of a possible object to eliminate bounding boxes with low confidence scores through a Non-Maximum Suppression algorithm, which disregards detected objects with IoU less than the defined threshold. Thus, if a high confidence limit is defined, such as 0.90, there will be little confusion between the classes and low background FP results, but there will be a marked elimination of correctly detected fundus lesions (although with a low confidence limit), but with a confidence limit lower than 0.90. On the other hand, if a confidence limit of 0.25 is defined, there will be a more significant generation of background FP and background FN since it increases the probability of the model detecting the background as a lesion and vice versa. Therefore, as the confidence limit tends to 1, the fund FPs will tend to 0. The results presented in the confusion matrix were calculated using a fixed confidence limit of 0.25, which is in line with the default inference configuration contained in the detect.py file of the proposed approach. In summary, with lower confidence limits, the results of mAP will be improved but will also produce a more significant amount of background FP that will appear in the confusion matrix, while if you increase the confidence limit, there will be a decrease in background FP in the confusion matrix, however, with a loss in mAP since more lesions are lost.
Cells with darker shades of blue indicate a greater number of samples. The confusion matrix presents the hits in the prediction of fundus lesions on the main diagonal, while the values off the main diagonal correspond to prediction errors. It is possible to verify that the highest incidence of background FN occurred in Microaneurysms (with 79%), followed by Hard Exudates (with 69%), Soft Exudates (with 68%), and Hemorrhages (with 58%). As for FP background errors, the highest incidence occurred in Hard Exudates (with 40%), followed by Microaneurysms (with 38%), Hemorrhages (with 19%), and Soft Exudates (with 3%). Also, it can be seen that 9% of Hemorrhages were incorrectly detected as Microanerysms, 2% of Soft Exudates were incorrectly detected as Hard Exudates, and 3% of Microaneurysms were incorrectly detected as Hemorrhages.
Thus, the results presented in the confusion matrix cannot be directly compared with the results of AP presented in this work, because the values are associated with the area under the PR curve, as shown in Figure 5. A good mAP produced by a low confidence limit, for example, will necessarily contain thousands of FPs, pushed to the lower right corner of the PR curve, with trends from Recall to 1 and Precision to 0, as shown in Figure 5.
During the test stage of the proposed approach in the DDR dataset using the SGD optimizer, the best result obtained reached a mAP of 0.1430 (indicated in bold font) and values of AP with a limit of IoU of 0.5 equal to 0.2100, 0.1380, 0.0530 and 0.1710, for Hard Exudates (EX), Soft Exudates (SE), Microaneurysms (MA) and Hemorrhages (HE), respectively, as presented in Table 5. Furthermore, both results obtained by the proposed approach, with and without the use of Tilling, achieved superior results to related works, which also detected fundus lesions in images from the test set of the DDR dataset.
Experiments were carried out during the validation stage of the proposed approach in the DDR dataset using the Adam optimizer, in which the best result was obtained using the Tilling method, with a mAP of 0.2630, and AP amounts with a IoU limit of 0.5 equal to 0.2240, 0.3650, 0.1110 and 0.3520, for Hard Exudates (EX), Soft Exudates (SE), Microaneurysms (MA) and Hemorrhages (HE), respectively, as shown in Table 6. Table 5. Results obtained by the proposed approach with SGD optimizer compared to works related to the metrics AP and mAP for the limit of Intersection over Union of 0.5 in the test set of the Dataset for Diabetic Retinopathy (DDR).  The use of the Adam optimizer resulted in a higher mAP than the result obtained by the proposed approach with the SGD optimizer, presented in Table 4. The proposed approach presented results superior to all works with the same purpose found in the literature. Figure 7 presents the graph of the PR curve with a limit of IoU of 0.5 obtained during the validation step using the proposed approach with the optimizer Adam and Tilling in the DDR dataset. The AP values obtained by the fundus lesions are plotted on the graph, according to the results presented in Table 6, whose mean Average Precision value obtained for all classes corresponds to 0.2630 (indicated in bold font). Analyzing the PR curve, it appears that using the Adam optimizer, the proposed approach presented similar results to those obtained using the SGD optimizer, i.e., there was a high rate of errors in detecting Microaneurysms (MA) (curve in red).

Models
The best results were achieved in the prediction of Soft Exudates (SE) (curve in green color) and Hemorrhages (HE) (curve in orange color), respectively. It is possible to verify in the confusion matrix (as shown in Figure 8) the high rate of background FN (89%) and high rate of background FP (34%) of microaneurysms. The rate of background FP of hard exudates also stands out from the other classes of lesions, reaching 44%. The reasons that led to the high rates of FN and FP, both in detecting microaneurysms and hard exudates, have already been discussed.   Figure 8 presents the confusion matrix obtained by the proposed approach with the Adam optimizer and Tilling during the validation step on the DDR dataset. It is possible to verify that the highest incidence of background FN occurred in Microaneurysms (with 80%), followed by Hard Exudates (with 67%), Soft Exudates (with 63%), and Hemorrhages (with 58%). As for FP background errors, the highest incidence occurred in Hard Exudates (with 44%), followed by Microaneurysms (with 34%), Hemorrhages (with 17%), and Soft Exudates (with 5%). Also, it can be seen that 10% of Hemorrhages were incorrectly detected as Microaneurysms, 2% of Soft Exudates were incorrectly detected as Hard Exudates, and 2% of Microaneurysms were incorrectly detected as Hemorrhages. Figure 9 presents a batch with fundus images from the DDR dataset along with annotations (Ground Truth) of the fundus lesions after the preprocessing steps and augmentation of data that were used to validate the proposed approach using Adam optimizer and Tilling. Figure 10 shows the detections of fundus lesions performed on the same batch of fundus images described above.   Table 7. Results obtained by the proposed approach with Adam optimizer compared to works related to the AP and mAP metrics for the Intersection over Union limit of 0.5 in the test set of the DDR (Dataset for Diabetic Retinopathy).  The results obtained with the metrics are presented below: Precision, which considers, among all the positive classifications made by the model, how many are correct; the Recall, which assumes, among all situations of the positive class as expected value, how many are correct; and, the F1-score, which calculates the harmonic mean between Precision and Recall.

AP
The best results achieved by the approach proposed in the DDR dataset were obtained using the Adam optimizer and the Tilling method, according to the F1-score metric obtained in the Validation and Testing stages, with values of 0.3485 (indicated in bold font) and 0.2521 (indicated in bold font), respectively, as shown in Table 8. Table 8. Results obtained with the metrics: Precision, Recall and F1-score with the Stochastic Gradient Descent (SGD) and Adam optimizers during the validation and testing steps using the Dataset for Diabetic Retinopathy (DDR).

Models
Precision Validation

Precision Test
Recall Test The mean inference time to detect fundus lesions per image in the DDR dataset in the Validation and Testing steps of the proposed approach is presented in Table 9. The approach proposed without Tilling had the lowest average inference time per image with the Adam optimizer, with 14.1 ms, while with Tilling the lowest average inference time per image was achieved with the SGD optimizer, with 4.6 ms (indicated in bold font). However, the highlight is the inference time of the proposed approach using Tilling, which was around 3 times faster than the inference time of the proposed approach applied to the images without performing from Tilling. Therefore, in addition to increasing the precision of the proposed approach in detecting fundus lesions, the Tilling method made the prediction process faster. Proposed Approach+Adam with Tilling 5.5 7.5
To assess the precision of the proposed approach in different public DR datasets, we also performed experiments with the Diabetic Retinopathy image set IDRiD [18]. During the validation step on the IDRiD dataset, the best result obtained by the proposed approach was using the SGD optimizer with the Tilling method, with a mAP of 0.3280 (indicated in bold font) and values of AP with a limit of IoU of 0.5 equal to 0.2630, 0.5340, 0.2170 and 0.2980, for Hard Exudates (EX), Soft Exudates (SE), Microaneurysms (MA) and Hemorrhages (HE), respectively, as shown in Table 10. Table 10. Results obtained by the proposed approach with Tilling and the SGD and Adam optimizers with the metrics AP and mAP for the Intersection over Union limit of 0.5 in the validation set of the Indian Diabetic Retinopathy Image Dataset (IDRiD).   fication, in which a lesion is erroneously considered as background (background FN) or, conversely, where the background is erroneously considered as a lesion (background FP). For these cases, the image processing techniques applied in the pre-processing block of the proposed approach play an important role, as they aim to minimize these problems by reducing noise and improving the contrast of these images, for example.

Discussion
The works proposed by Alyoubi et al. [20] and Dai et al. [21] presented results in the validation set. Unlike these works, the same methodology as the work proposed by Li et al. [17] was adopted, which evaluated the proposed approach through the analysis of the results obtained both in the validation stage and in the test stage using the public dataset of DDR Diabetic Retinopathy. This method was adopted to avoid the evaluation of the proposed approach being carried out only in the validation set since this evaluation could give a false impression that the proposed approach is accurate in detecting fundus lesions.
Evaluating the approach on a validation set (where the neural network model is fitted) and then on a test set (where the data is not known a priori) allowed the generalization capability of the proposed approach to be properly verified, without the risk of biases produced by possible overfitting of the model during the validation. To validate the predictive capacity of the proposed approach regarding the detection of fundus lesions, we also evaluated it in the IDRiD dataset, in which we achieved results equivalent to those obtained in the DDR dataset.
The work by Dai et al. [21] was not compared with the proposed approach, as the authors used a 2-stage architecture while we used a single-stage architecture. The authors did not present the results of AP or mAP, unlike other works with a similar purpose found in the literature, which makes an adequate comparison impossible. In addition, the authors used a private DR dataset to train the deep learning models, which makes it difficult to reproducible the results obtained using the same method.
In the work proposed by Li et al. [17], the best results obtained regarding the detection of fundus lesions in the DDR dataset, using a Single-Stage model, was 0.0059 of mAP in the validation step with SSD and 0.0030 in the test step with YOLO. Santos et al. [59], using the DDR dataset, obtained a mAP of 0.0716 with the YOLOv4 model in the validation step. In work presented by Alyoubi et al. [20], the best result obtained by the authors in the validation step with the DDR dataset was a mAP of 0.1710, using the YOLOv3 model.
The approach based on the YOLOv5 model proposed in this work, obtained a mAP of 0.2630 in the validation step and 0.1540 in the test step, both in DDR dataset.
With a confidence limit set at 0.25, the lesions were identified with their respective confidence percentages. The experimental results showed that the proposed approach obtained greater precision than similar works found in the literature. Another aspect observed during the experiments is that the proposed approach obtained greater precision in detecting Soft Exudates, Hemorrhages, and Hard Exudates, and, in contrast, a lower precision in detecting Microaneurysms was reached.
The detection of these lesions through computerized systems is a challenge due to numerous factors, among which: are the characteristics of size and shape of these lesions; noise and image contrast available in public DR datasets; the number of annotated examples of these lesions available in public DR datasets; and, the difficulty of deep learning algorithms in detecting very small objects. These problems were reported in the literature and observed during the experiments performed. Thus, a new image processing-based approach techniques and a state-of-the-art Single-Stage deep neural network architecture were proposed to overcome some of these problems to detect lesions in fundus images.
For the problem related to the shape and size of objects, in which very small lesions such as microaneurysms are more challenging to detect, techniques were applied to increase the receptive field of these lesions, such as partial Cropping of the black background of the images, and the Tilling of the input images for training the neural network. Also, a data augmentation technique based on the Scale geometric transformation was applied. In this case, a 50% zoom is randomly performed on the input images so that new images are artificially created for training. Thus, the neural network can extract more features, making it more efficient in detecting microlesions. A pre-processing block was developed in which we first filter the images to remove outliers from capturing these images and then apply the contrast-limited adaptive histogram equalization method to increase the local contrast of fundus images and improve lesion enhancement.
A block responsible for data augmentation was developed to minimize the problem of the small number of lesion examples noted in the public DR datasets. In this block, different state-of-the-art methods were applied, such as Mosaic, MixUp, Copy-Paste and geometric transformations (flipping, scaling, perspective, translation and shearing). The purpose of this step was to artificially create a greater number of example images with annotated lesions for training the proposed approach, to allow the deep neural network to extract a greater amount of lesion characteristics and, consequently, to increase the generalization capacity of the proposed approach.
In comparison to similar works found in the literature, it is essential to highlight the contributions of the proposed approach related to the structure of the deep neural network used, such as the use of CSP modules (C3) in the Backbone and Neck of the architecture, which minimized gradient dissipation problems, caused by the number of dense layers. In addition, through these modules, there was an improvement in inference speed and precision in lesion detection, as well as a reduction in computational cost and memory usage. Another innovation in the proposed approach's structure was using the SiLU activation function throughout the neural network to simplify the architecture and reduce the number of hyperparameters. The Threshold-Moving method was applied during neural network training so that the image samples were weighed using a precision metric to minimize the imbalance in the number of examples of the different classes of lesions investigated and avoid classification biases in significant classes. Finally, the adjustments and tests were performed on the proposed approach through different public datasets on Diabetic Retinopathy. These datasets were split into Training, Validation, and Testing to evaluate the proposed approach according to the results of the different performance metrics.
The main limitation of our proposed approach focuses on identifying some lesions, as in the case of hard exudates, which have characteristics similar to drusen other than ocular signs caused by DR, for example. For this reason, the proposed approach made classification errors considering these drusen as hard exudates. Thus, a public dataset, with a broader range of eye signals, with lesions labeled and with a reasonable number of examples, associated or not with Diabetic Retinopathy, could support the training of deep learning models and the distinction more accurate of these different types of eye signals.
Another limitation observed during the experiments is associated with detecting microaneurysms. Although the results are better than similar works found in the literature, they are still low due to the size of these microlesions and have gradient dissipation problems. In this sense, we aimed to increase the detection accuracy of these microlesions by exploring architectures that perform the detection in two stages. Furthermore, we intend to explore different strategies, namely: (1) augmenting data to increase the number of examples of these microlesions; (2) improving the process of creating tiles from fundus images to provide the neural network images with the highest possible level of detail for extracting features from fundus lesions.

Conclusions
This article presented a new approach to fundus lesion detection using image processing techniques and a deep neural network based on a state-of-the-art YOLO architecture. Two public datasets of Diabetic Retinopathy images were used to train and evaluate the proposed approach's precision: DDR and IDRiD. Only the images with annotated lesions in the datasets were used to perform the training and evaluation of the proposed approach. These datasets were partitioned into training, validation, and testing sets in a ratio of 50:20:30, respectively. The best results were achieved in the DDR dataset using the Adam optimizer and the Tilling method, reaching in the validation stage the mAP of 0.2630 for the limit of IoU of 0.5 and F1-score of 0.3485, and in the test step the mAP of 0.1540 for the limit of IoU of 0.5 and F1-score of 0.2521. The results obtained in the experiments demonstrate that the proposed approach presented results superior to equivalent works found in the literature.
The deep neural network architecture was implemented based on the YOLOv5 framework and the framework PyTorch, reaching 22.4% in the validation stage Average Precision for Hard Exudates, 36.5% for Soft Exudates, 11.1% for Microaneurysms, and 35.2% for Hemorrhages, in the DDR dataset. Different state-of-the-art image processing and data augmentation techniques were explored, such as CLAHE, Tilling, Mosaic, Copy-Paste and MixUp. In this way, it was possible to increase the precision of the proposed approach in detecting fundus lesions because, with the help of these techniques, the deep neural network architecture extracted a greater and more representative amount of characteristics of the lesions investigated during the training stage.
The experiments achieved state-of-the-art results, surpassing related works found in the literature with similar purpose and application and demonstrating that the detection of fundus lesions can be performed effectively through a deep learning-based approach. Furthermore, for the problem related to the size of objects, in which very small lesions are more difficult to detect, techniques were applied to increase the receptive field of these lesions, such as partial Cropping of the black background of the images and Tilling of the input images for training the neural network. However, the results presented in this work indicate that detecting microlesions such as microaneurysms in fundus images remains challenging for future research.
In future work, we intend to explore state-of-the-art architectures that perform instance segmentation to investigate and compare the trade-off between the precision and inference speed of these architectures with the approach proposed in this work. Furthermore, we intend to explore new data augmentation strategies and structures for the Backbone, Neck and Head of the deep neural network architecture implemented in the proposed approach, as well as to carry out experiments with other sets of public data on Diabetic Retinopathy. Funding: This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior-Brasil (CAPES)-Finance Code 001.