Performance Evaluation of Different Object Detection Models for the Segmentation of Optical Cups and Discs

Glaucoma is an eye disease that gradually deteriorates vision. Much research focuses on extracting information from the optic disc and optic cup, the structure used for measuring the cup-to-disc ratio. These structures are commonly segmented with deeplearning techniques, primarily using Encoder–Decoder models, which are hard to train and time-consuming. Object detection models using convolutional neural networks can extract features from fundus retinal images with good precision. However, the superiority of one model over another for a specific task is still being determined. The main goal of our approach is to compare object detection model performance to automate segment cups and discs on fundus images. This study brings the novelty of seeing the behavior of different object detection models in the detection and segmentation of the disc and the optical cup (Mask R-CNN, MS R-CNN, CARAFE, Cascade Mask R-CNN, GCNet, SOLO, Point_Rend), evaluated on Retinal Fundus Images for Glaucoma Analysis (REFUGE), and G1020 datasets. Reported metrics were Average Precision (AP), F1-score, IoU, and AUCPR. Several models achieved the highest AP with a perfect 1.000 when the threshold for IoU was set up at 0.50 on REFUGE, and the lowest was Cascade Mask R-CNN with an AP of 0.997. On the G1020 dataset, the best model was Point_Rend with an AP of 0.956, and the worst was SOLO with 0.906. It was concluded that the methods reviewed achieved excellent performance with high precision and recall values, showing efficiency and effectiveness. The problem of how many images are needed was addressed with an initial value of 100, with excellent results. Data augmentation, multi-scale handling, and anchor box size brought improvements. The capability to translate knowledge from one database to another shows promising results too.


Introduction
Near 2.2 billion people have a vision disability or blindness, and one billion have a condition that could have been prevented or has not yet been treated [1].
Glaucoma is a common cause of irreversible blindness, and it is associated with essential pathophysiology such as the retinal ganglion cell (RGC), stroma, photoreceptors, lateral geniculate body, and visual cortex. However, RGC axons lost inside of the optic nerve head (ONH) are the leading cause of vision loss. Because of that, glaucoma is considered a multifactorial disease, influenced at least by intraocular pressure (IOP), blow flow and ischemia inside laminar and prelaminar tissues, and an autoimmune or inflammatory state of the tissues [2]. Regarding prognosis, most of the time, the illness can be controlled with early diagnosis and adequate medication and care.

•
Interpreting the segmentation of the OD and OC is crucial in diagnosing glaucoma. Accurate results can make the difference between a good and a poor prediction.

What is Already Known
• Deep neural networks are used for segmentation, focusing mainly on encoder-decoder models. • Heavy pre-processing and post-processing work.

•
Pipelines are based on previously extracting the region of interest to perform the segmentation on that trimmed area.

What This Paper Adds
• Evaluate state-of-the-art new object detection models with a two-stage approach, highlighting the best average precision. These models unified the detection and segmentation task.

•
Addressed the traditional question of how many images are needed to train a deep-neural-network model. Experimentation performance into a subset and full dataset. • The effect of multiscale data augmentation technique and the importance of a correct configuration of anchor's scale for object location.
After the introduction and a review of state-of-the-art practices, the article follows with the materials and methods in Section 2, where every model is described, including the framework utilized, annotation mechanics, and training configuration. Evaluation and results can be seen in Section 3, to report performance; discussion of results is found in Section 4, and conclusion of the work is seen in Section 5.

Materials and Methods
Currently, object detection models are being used for pose estimation, vehicle detection, and surveillance, among other applications. These algorithms try to draw a bounding box around the object of interest. It does not necessarily have to be one; it can be several different box dimensions and different objects.
MMDetection tools have been used in this investigation [32], which provides an integrated framework for object detection and instance segmentation based on Pytorch [33]. This tool belongs to the MMLab project, an open-source project for academic research and industrial applications. Significant features of MMDetection are modular design, support for multiple frameworks out-of-the-box, and high efficiency.
The general intention of this study is to test various object detection models that can be useful in OD and OC segmentation, addressing the two-state categories of object detection models, which are generally more accurate than one-stage object detection. See Figure 1 for a presentation of the models that will be used. The two-stage framework comprises first a CNN for extracting features from original images and then a category-specific classifier for the class label.

•
Addressed the traditional question of how many images are needed to train a deep-neural-network model. Experimentation performance into a subset and full dataset.

•
The effect of multiscale data augmentation technique and the importance of a correct configuration of anchor's scale for object location.

Materials and Methods
Currently, object detection models are being used for pose estimation, vehicle detection, and surveillance, among other applications. These algorithms try to draw a bounding box around the object of interest. It does not necessarily have to be one; it can be several different box dimensions and different objects.
MMDetection tools have been used in this investigation [32], which provides an integrated framework for object detection and instance segmentation based on Pytorch [33]. This tool belongs to the MMLab project, an open-source project for academic research and industrial applications. Significant features of MMDetection are modular design, support for multiple frameworks out-of-the-box, and high efficiency.
The general intention of this study is to test various object detection models that can be useful in OD and OC segmentation, addressing the two-state categories of object detection models, which are generally more accurate than one-stage object detection. See Figure 1 for a presentation of the models that will be used. The two-stage framework comprises first a CNN for extracting features from original images and then a category-specific classifier for the class label. Two approaches are taken to predict the presence of glaucoma with deep learning techniques. The first type of ONH assessment, choosing between healthy and glaucomatous, is an image-level task. However, although this task shows a high level of accuracy, a pixel-level job is preferred in this work, segmenting OD and OC, because this second type helps to obtain different signs used by doctors to predict the presence or absence of glaucoma.

Foundation in Object Detection with Deep Learning
Understanding critical elements of how these models work, why they are relatively slow but powerful, and the process of how sharing features improves the two-stage detector are the goals of this section. Two approaches are taken to predict the presence of glaucoma with deep learning techniques. The first type of ONH assessment, choosing between healthy and glaucomatous, is an image-level task. However, although this task shows a high level of accuracy, a pixellevel job is preferred in this work, segmenting OD and OC, because this second type helps to obtain different signs used by doctors to predict the presence or absence of glaucoma.

Foundation in Object Detection with Deep Learning
Understanding critical elements of how these models work, why they are relatively slow but powerful, and the process of how sharing features improves the two-stage detector are the goals of this section.

R-CNN
In [34], the first successful system for object localization, classification, and segmentation was created, taking from original images around 2000 of patched knows as a region. It then computes the feature for each proposal using a prominent CNN and finally classifies each region using class-specific linear-support-vector machines.

Fast R-CNN
R-CNN is slow since each proposal region passes through a CNN without sharing computation. In more recent work [35], the entire image is passed through a CNN. It introduces ROI pooling as an input-to-output concatenation of the features extracted from each proposed region and fed into a fully connected layer during category prediction, with two outputs: one softmax probability and a per-class bounding-box regression offset.

Faster R-CNN
Previous object-detection models still have a bottleneck with selective search, which has lethargic mechanics and time-consuming processes affecting the network's performance. In [36], the concept of region-proposal networks was proposed: placed on top of the CNN features, it is then reshaped using RoI and classified, and both the bounding-box tasks are carried out.

Mask R-CNN
Mask R-CNN [37] was introduced for predicting segmentation masks on each RoI with a small FCN on each one. This model extends Faster R-CNN, adding a new branch parallel to the existing branch for classifying and using the bounding box. This model slightly increases the computational cost but still is a fast system and allows rapid experimentation.

Common Components in Object Detection Architectures
A combination of ResNet50 [38], which overcame the degradation problem when models began to converge, and Feature Pyramid Network (FPN) [39], designed as an extractor of multiple levels of feature maps, was used as a backbone. The idea is to obtain semantic-rich layers from the bottom up and construct higher-resolution layers from the top-down side.
The bottom-up pathway starts with simple convolution over the input image. This first level is not included in the pyramid scheme because of extensive memory use. Then levels 2, 3, 4, and 5 use Resnet blocks with multiple convolution layers. The outputs of each level are reduced by a factor of 2, doubling the stride, and are the entry to the next level and used in the top-down pathway.
The top-down pathway begins with a 1 × 1 convolution filter to reduce channel dimensions to 256. The process is repeated at all levels, concatenating the same level output from the bottom-up side by element-wise addition. Each level is upsampled by two before addition. Finally, to obtain the final feature map at each level, a 3 × 3 convolution is applied. This helps with the superposition problem of upsampling. Figure 2 has a detailed schema of the backbone utilized. Standard components in two-stage object-detection architectures are Backbo Neck, DenseHead, RoIExtractor, and RoIHead.

•
Backbone: The network takes an image as input and extracts the feature map with the last fully connected layer. The backbone can be a pre-trained neural network. • Neck: Following the backbone, the neck layer extracts more elaborate feature m  RoIHead: This takes RoI features into a specific task such as bounding-box classification/regression and mask prediction in instance segmentation.

Instance Segmentation Models
This investigation aims to extract the OD and OC as individual elements. The extraction task is related to instance segmentation, which allows the detection and localizing of an object in an image. The goal of instance segmentation is to allow objects of the same class to be divided into different instances; although, by concept, disc and cup are different classes, they are very alike in shape and overlap each other. For that reason, they need to be extracted separately. Object detection models can address this problem and will be covered here, assessing the performance of recent architectures.

Cascade R-CNN
Cascade R-CNN is a multi-stage object-detection architecture where increased IoU thresholds are used in a sequence-order detector, using the output of one as a trainer of the next one, thereby improving quality, guaranteeing a positive training set, and minimizing overfitting [40]. This architecture is an extension of Faster R-CNN, and obtaining masks can be addressed in two ways: putting the segmentation branch at the beginning or end of Cascade R-CNN or in each stage. This last one maximizes the diversity of samples used to learn the mask-prediction task.

Mask Scoring R-CNN
For most instance-segmentation models, the confidence of instance classification is used as a mask quality score. However, in practice, instance mask and ground truth usually are not well correlated with classification scores. The idea behind this architecture is to take the instance feature and the corresponding predicted mask together to regress the mask IoU throughout a MaskIoU head [41]. In this proposal, the predicted mask and ROI feature are taken as input to the MaskIoU head.

PointRend: Image Segmentation as Rendering
This module brings us flexibility throughout point-based segmentation predictions at adaptively selected locations based on an iterative subdivision algorithm. The PointRend model provides crisp object edges in regions over smoothed by previous methods [42]. PointRend chooses a set of points to accomplish the task and predicts each point individually with a small multilayer perceptron, using interpolated features computed at these points. This process is applied sequentially to optimize conflicted regions of the predicted mask.

CARAFE
Content-Aware ReAssembly of Features exploits a large field of view, aggregating contextual information. It enables instance-specific, content-aware handling, generating adaptive kernels instantly, and is lightweight and quick to compute [43]. Carafe is composed of two principal components: the kernel prediction module generates reassembly kernels in a content-aware manner, whereas the content-aware reassembly module reconstructs the features from each reassembly kernel within a local region with a specific function.

GCNet
Global Context Network was proposed to enhance NLNet, which is tasked with capturing long-range dependencies via aggregating query-specific global contexts foreach query position. The improvement was set up with a three-step general framework, obtaining a better instantiation based on a query-independent formulation [44].

SOLO
Segmenting Objects by Locations [45] introduces "instance categories," an approach that assigns categories to each pixel within an instance according to its location and size; this approach transforms instance segmentation into a one-step classification problem. The proposed model divides the input image into a uniform grid, and if the center of an object falls into a grid cell, this one predicts the semantic category and segments that object instance.
The reviewed models can be summarized in Figure 3, where the main contribution of this investigation is highlighted. The importance of this methodology is that it allows us to analyze the behavior of object detection models based on an architectural approach, backbone components, or components in the neck. The overall workflow starts with retrieving the images and annotating them accordingly. Then, we set up the experimentation with and without multiscale and appropriate anchor-box configuration before training and predicting the segmented area.  In the annotation process, the ellipse with the number 1 is related to the disc class, and ellipse two is related to the cup class.

Device and Databases
The device used in all experiments is a PC with an Intel(R) Core (TM) i5-8400 processor CPU@2.80 GHz, with 16 GB RAM and an NVIDIA GeForce GTX 1070 graphics card, with 8 GB VRAM. The software packages used were Python, in version 3.9.7, from the authors' Van Rossum, Guido and Drake, in Scotts Valley, California, Pytorch in his version 1.10.0, developed by a laboratory of Facebook, today Meta, and is an open-source tool, MMDetection, version 2.19.0, and MMCV, version 1.4.0, which is a foundational library for computer vision and support projects such as MMDetection [46], all open source.
Two databases were used: (a) REFUGE [47]: The Retinal Fundus Glaucoma Challenge was the first challenge on glaucoma assessment from retinal fundus photography and is one of the most extensive public datasets available for cup/disc segmentation. It consists of 1200 retinal   Two databases were used: (a) REFUGE [47]: The Retinal Fundus Glaucoma Challenge was the first challenge on glaucoma assessment from retinal fundus photography and is one of the most extensive public datasets available for cup/disc segmentation. It consists of 1200 retinal images in JPEG format. Two devices were used: a Zeiss Visucam 500 fundus camera with a resolution of 2124 × 2056 pixels (400 images) and a Canon CR-2 with a resolution of 1634 × 1634 pixels (800 images). The macula and optic disc are visible in each image, centered at the posterior pole. (b) G1020 [48]: A new public dataset for cup/disc segmentation and images was collected at a private clinical practice in Kaiserslautern, Germany, between 2005 and 2017, with a 45-degree field of view after dilation drops. Experts marked optic-disc and cup boundaries and bounding-box annotations using labelme [49], a free, open-source tool for annotations. Images are stored in JPG format with sizes between 1944 × 2108 and 2426 × 3007 pixels.

Experimentation
Different model implementations provided by the MMDetection framework were used to operate experiments.

Annotations and Pre-Processing
This framework supports the COCO-style dataset and its large-scale object detection, segmentation, and captioning dataset. It is crucial to determine the precise location of optic and cup discs. VGG Image Annotator (VIA) [50] software was used to generate the annotations of the RoI for a proper and correct training procedure. This software is open source, and it is a standalone and straightforward manual-annotation platform for images, audio, and video. REFUGE dataset was set up through these tools and then exported in COCO format.
An example can be seen in Figure 4. An elliptical shape was selected since this is what best fits the optic and cup discs. The original ground truth of both datasets was used as a guide. annotations of the RoI for a proper and correct training procedure. This software is open source, and it is a standalone and straightforward manual-annotation platform for images, audio, and video. REFUGE dataset was set up through these tools and then exported in COCO format. An example can be seen in Figure 4. An elliptical shape was selected since this is what best fits the optic and cup discs. The original ground truth of both datasets was used as a guide. Pre-processing and augmenting images are always vital to a neural network's successful behavior; however, aggressing transformations do not always lead to better results. In this work, a few steps were made. Resizing was done, adopting a simple data augmentation scheme based on multiscale training where images were taken with sizes between 1333 × 640 and 1333 × 960 with a step of 32 between each other; this approach shows high performance in terms of average precision (AP), bounding box, and masks with respect to one fixed size. Then, a random flip follows normalization based on the Pre-processing and augmenting images are always vital to a neural network's successful behavior; however, aggressing transformations do not always lead to better results. In this work, a few steps were made. Resizing was done, adopting a simple data augmen-tation scheme based on multiscale training where images were taken with sizes between 1333 × 640 and 1333 × 960 with a step of 32 between each other; this approach shows high performance in terms of average precision (AP), bounding box, and masks with respect to one fixed size. Then, a random flip follows normalization based on the mean and standard deviation of ImageNet [51], commonly used as transfer learning to speed up the training process.

Training Setting
The training pipeline was set up following the original setting of the framework used in [32]. However, minor changes were made with a view to improving the performance. As an optimizer, AdamW was used [52], with an initial learning rate of 0.0025. The original framework worked with a value of 0.02 for 8 GPU; since here it worked with only one, this original value was divided by 8, and as a learning rate schedule, cosine annealing was utilized, allowing warm restart techniques to improve performance when training deep neural networks [53]. Cosine annealing was initially developed for the Stochastic Gradient Descend (SGD) optimizer, but new studies suggest a better performance when cosine annealing is used with AdamW [54].
The total loss of an object detector is the combination of classification, localization, and segmentation losses due to the multi-task nature of these models. Equation (1) shows the formula.
L Cls is the loss of classification and uses a log loss function over two classes, p i , the predicted probability, and q i , ground truth label, see Equation (2).
L Reg is the loss of bounding box regression. The mean square error is typically applied between original points u i and predicted points v i over the center coordinate, width, and height vector, i ∈ [x, y, w, h], see Equation (3).
Finally, L Mask employs an average binary cross-entropy function over a mask of dimension m x m associated with ground truth class k. See Equation (4) for details, where x i,j is the cell's label (i, j) in the ground truth and y k i,j is the generated value from the model of the same cell and class k [37].
Another characteristic that has been tuned is anchor boxes, an essential parameter for quality object detection. Anchor settings are specified with anchor scales and ratios, while anchor strides correspond to feature-map strides. To obtain more scales in each location, to increase the possibility of fixing the object correctly, more scales and ratios were added.

Evaluations and Results
After setting up the different models, it is crucial to evaluate their performances. COCO evaluation metrics have been adopted [55], and, as the primary metric, AP is used, with 10 Intersection over Union (IoU) thresholds of 0.50:0.05:0.95. This approach rewards detectors with better localization. AP, a measure to summarize the precision-recall curve into a single value representing the average of all precisions at multiple recall levels, can be seen in Equation (5) [31].
where r is the total number of relevant samples, n the numbers of threshold and p interp is the precision at each recall level r, defined by Equation (6), and where p( r) is the measured precision at recall r, which is the recall that exceeds r.
This equation is based on the following sub-metrics: IoU, recall, and precision. According to [56], these metrics are well suited for medical image segmentation, along with the F1 score or Dice similarity coefficient, which is slightly different from IoU because this one penalizes under-and over-segmentation more than the Dice does. The Dice coefficient is often used to quantify the performance of image segmentation methods. This metric is an order of how similar two objects are.
Their calculation is based on four possible interpretations of the data: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). Based on these values, the following formulas can be defined for IoU, Recall, Precision and F1-score in Equations (7)-(10), respectively.
Experimentation starts with a fraction of the Refuge dataset, specifically 100 images for training, 30 for validation, and 30 for the test. This subset was taken, due to the cumbersome nature of the labeling process, to observe a first approximation of the behavior of the models to be evaluated. It also gives us an idea of how well the model behaves with a limited number of images. The task was developed without multiscale (WM) and multiscale (M).
Three criteria are reported, AP [IoU = 0.50:0.95], where AP is averaged over multiple IoU values, which rewards detectors with better localization. This will be taken as the primary metric. AP [IoU = 0.50] and AP [IoU = 0.75] are reported, as well, since they are more common in the literature and results are more tightened. Loose thresholds are reported as well. Three models improve their performance with multiscale: GCNet, MS-RCNN, and Point_Rend. The best result overall was Mask-RCNN with AP [IoU = 0.50:0.95] with 0.671. Results can be seen in Table 2. After that, the experiment was repeated under the same conditions with the full Refuge dataset, using 400 images for training, 200 for validation, and 200 for testing. Results can be seen in Table 3. Except for Carafe, all models improved their performance with the multiscale approach. The experiments were run on the G1020 dataset, with multiscale, since this approach shows better results. Results are shown in Table 4. Precision-recall curves are provided for better understanding. The first plot, see Figure 5, shows that all models perform excellently, meaning that recall increases by some amount and precision is unchanged; thus, all retrieved are true positives.
In Figure 6, as in the previous one, all model performances are perfect except Cascade Mask-RCNN. This model cannot retrieve all true positives. Figure 7 shows promising results as well. However, this plot implicitly shows the presence of false positives associated with localization errors, class confusion, and false negatives. These errors occur because the G1020 dataset presents high diversity in its images compared to the Refuge dataset.
A series of precision-recall (PR) curves for each class are given for interpretation purposes in Figure 8. Each PR curve is guaranteed to be higher than the previous one as the evaluation setting becomes more permissive regarding the IoU threshold. The legend is described as follows, with the meaning of each curve [55] Loc: localization errors are ignored, but not duplicate detections.

5.
Oth: PR after all class confusions are removed.

6.
BG: PR after all background (and class confusion) fps are removed. 7.
FN: PR after all remaining errors are removed (trivially AP = 1).
An explanation of Figure 6, where the precision-recall curve was not perfect for the Cascade Mask-RCNN model on the Refuge dataset, is presented by a per-class analysis. Figure 8 shows a thin line at the end of the plot exhibiting the presence of the false negative for each class.
In Figures 9 and 10, segmentation results can be seen for MS-RCNN. Precision-recall curves are provided for better understanding. The first plot, see Figure 5, shows that all models perform excellently, meaning that recall increases by some amount and precision is unchanged; thus, all retrieved are true positives.    Figure 7 shows promising results as well. However, this plot implicitly shows the presence of false positives associated with localization errors, class confusion, and false negatives. These errors occur because the G1020 dataset presents high diversity in its images compared to the Refuge dataset.   Figure 7 shows promising results as well. However, this plot implicitly shows the presence of false positives associated with localization errors, class confusion, and false negatives. These errors occur because the G1020 dataset presents high diversity in its images compared to the Refuge dataset. A series of precision-recall (PR) curves for each class are given for interpretation purposes in Figure 8. Each PR curve is guaranteed to be higher than the previous one as the evaluation setting becomes more permissive regarding the IoU threshold. The legend is described as follows, with the meaning of each curve [55].   An explanation of Figure 6, where the precision-recall curve was not perfect for the Cascade Mask-RCNN model on the Refuge dataset, is presented by a per-class analysis. Figure 8 shows a thin line at the end of the plot exhibiting the presence of the false negative for each class. In Figures 9 and 10, segmentation results can be seen for MS-RCNN.  An explanation of Figure 6, where the precision-recall curve was not perfect for the Cascade Mask-RCNN model on the Refuge dataset, is presented by a per-class analysis. Figure 8 shows a thin line at the end of the plot exhibiting the presence of the false negative for each class. In Figures 9 and 10, segmentation results can be seen for MS-RCNN.  Models trained with the G1020 dataset from multiple experiments gave better results than the Refuge dataset when the trained models were applied to new datasets because of more diversity in their images. The test was performed on multiple images; for illustrative purposes, an image from DRIONS-DB [57] and ORIGA DB [58] are shown in Figures  11 and 12, respectively, with the Cascade Mask R-CNN model. As seen in the previous figure, the segmentation obtained for the OD does not cover the entire area expected with the models trained on the Refuge dataset, see Figure 11b. This result can be seen when the prediction is made with the model trained on the G1020 dataset, see Figure 11c. In the following Figure 12, the degradation of OD segmentation can be seen as well, comparing models trained on the Refuge and G1020 datasets. Models trained with the G1020 dataset from multiple experiments gave better results than the Refuge dataset when the trained models were applied to new datasets because of more diversity in their images. The test was performed on multiple images; for illustrative purposes, an image from DRIONS-DB [57] and ORIGA DB [58] are shown in Figures 11 and 12, respectively, with the Cascade Mask R-CNN model. Models trained with the G1020 dataset from multiple experiments gave better results than the Refuge dataset when the trained models were applied to new datasets because of more diversity in their images. The test was performed on multiple images; for illustrative purposes, an image from DRIONS-DB [57] and ORIGA DB [58] are shown in Figures  11 and 12, respectively, with the Cascade Mask R-CNN model. As seen in the previous figure, the segmentation obtained for the OD does not cover the entire area expected with the models trained on the Refuge dataset, see Figure 11b. This result can be seen when the prediction is made with the model trained on the G1020 dataset, see Figure 11c. In the following Figure 12, the degradation of OD segmentation can be seen as well, comparing models trained on the Refuge and G1020 datasets. As seen in the previous figure, the segmentation obtained for the OD does not cover the entire area expected with the models trained on the Refuge dataset, see Figure 11b. This result can be seen when the prediction is made with the model trained on the G1020 dataset, see Figure 11c. In the following Figure 12, the degradation of OD segmentation can be seen as well, comparing models trained on the Refuge and G1020 datasets.

Discussion
In this investigation, the objective was to compare different object detection models for segmenting the OD and OC on fundus images. Selected models were the original Mask R-CNN, Carafe, Cascade Mask R-CNN, GCNet, MS R-CNN, SOLO, and Point_Rend. The performance was evaluated based on average precision, F1-score, and area under the precision-recall curve (AUCPR).
The experimentation started with a small number of images to see the behavior of the model with a reduced number of images in obtaining a good segmentation of both OD and OC. According to [59], between 150-500 images there is a turning point where the performance of the models starts to increase significantly. We initially used 100 refuge images for training, 30 for validation, and 30 for testing. With this ratio, the best performances were associated with the Cascade, Mask R-CNN, and Point_Rend models with a value of 1.000 in average precision and an IoU threshold of 50. However, the rest of the models had good performances, around 0.98, except for SOLO, which resulted in a value of 0.886; this is because of a higher number of false positives due to incorrect detection of a non-existing object or a misplaced detection of an existing object. According to [60], false positives can be related to classification, localization, classification plus localization, duplication, and background errors. The slight decrease in SOLO may be associated with localization errors when an object is detected with a misaligned bounding box, meaning an overlap between 0.1 and 0.5 IoU. This problem can be addressed using negative samples in the training process [61,62]. Calculating the F1 score yielded a perfect value of 1.
With these previous results, we proceeded to annotate the entire Refuge data set and repeat the experimentation, where MS R-CNN outperformed the other models, with an average precision of 0.995 without using multiscale and 1.000 when multiscale was applied, always taking the IoU threshold equal to 50. This data augmentation technique slightly improved performance in all models except for the SOLO model, which decreased from 0.989 to 0.984, but this improved with the increase of images compared to the previous experimentation. Concerning the F1 score, all models maintained a perfect value of 1 except for Cascade Mask R-CNN, which decreased to 0.997.
We experimented with these results on the G1020 dataset, as already noted previously, with the best performance being obtained by the Point_Rend model with an average precision value of 0.956. The remaining models maintained values around 0.94 except for the SOLO model, which decreased to 0.909. The decrease in the dataset may be related to more significant heterogeneity in its images; however, this resulted in an advantage at the time of transferring the predictions to images from other sources that were not used in the training process. Thus, the G1020 dataset was better at segmenting both the optical disc and optical cup than Refuge.
Key elements used in this research were multiscale, varying the values between 1333 × 640 to 1333 × 960. This approach introduced slight improvements in the metrics when

Discussion
In this investigation, the objective was to compare different object detection models for segmenting the OD and OC on fundus images. Selected models were the original Mask R-CNN, Carafe, Cascade Mask R-CNN, GCNet, MS R-CNN, SOLO, and Point_Rend. The performance was evaluated based on average precision, F1-score, and area under the precision-recall curve (AUCPR).
The experimentation started with a small number of images to see the behavior of the model with a reduced number of images in obtaining a good segmentation of both OD and OC. According to [59], between 150-500 images there is a turning point where the performance of the models starts to increase significantly. We initially used 100 refuge images for training, 30 for validation, and 30 for testing. With this ratio, the best performances were associated with the Cascade, Mask R-CNN, and Point_Rend models with a value of 1.000 in average precision and an IoU threshold of 50. However, the rest of the models had good performances, around 0.98, except for SOLO, which resulted in a value of 0.886; this is because of a higher number of false positives due to incorrect detection of a non-existing object or a misplaced detection of an existing object. According to [60], false positives can be related to classification, localization, classification plus localization, duplication, and background errors. The slight decrease in SOLO may be associated with localization errors when an object is detected with a misaligned bounding box, meaning an overlap between 0.1 and 0.5 IoU. This problem can be addressed using negative samples in the training process [61,62]. Calculating the F1 score yielded a perfect value of 1.
With these previous results, we proceeded to annotate the entire Refuge data set and repeat the experimentation, where MS R-CNN outperformed the other models, with an average precision of 0.995 without using multiscale and 1.000 when multiscale was applied, always taking the IoU threshold equal to 50. This data augmentation technique slightly improved performance in all models except for the SOLO model, which decreased from 0.989 to 0.984, but this improved with the increase of images compared to the previous experimentation. Concerning the F1 score, all models maintained a perfect value of 1 except for Cascade Mask R-CNN, which decreased to 0.997.
We experimented with these results on the G1020 dataset, as already noted previously, with the best performance being obtained by the Point_Rend model with an average precision value of 0.956. The remaining models maintained values around 0.94 except for the SOLO model, which decreased to 0.909. The decrease in the dataset may be related to more significant heterogeneity in its images; however, this resulted in an advantage at the time of transferring the predictions to images from other sources that were not used in the training process. Thus, the G1020 dataset was better at segmenting both the optical disc and optical cup than Refuge.
Key elements used in this research were multiscale, varying the values between 1333 × 640 to 1333 × 960. This approach introduced slight improvements in the metrics when the number of images increased, allowing the possibility of training models with new dimensions. However, this approach did not bring significant improvements with few images, so a researcher could consider its use, or not, depending on their resources.
Another element that was adjusted was the anchor scale. The configuration of the anchors is given by anchor_scales and anchor_ratios, and originally there were four generated in the region-proposal network. After using anchor_scales [4,6,8,10,12,12,14,16] for anchor_ratios [0.5, 1.0, 1.5, 2.0], 28 anchors were generated for each feature-pyramid network level. This modification allowed us to trap a greater diversity in disc shapes and optical cups.
From the results, the perfect F1-score values are related to the precision-recall curves. The horizontal line means that all positives have been classified as positive (recall = 1), and all positives have been classified as true positives (precision = 1). G1020 shows excellent results as well, but these minor differences are because of higher variability in the images in this dataset.

Conclusions
Segmenting tasks are complex and time consuming in general. Traditional approaches use an encoder-decoder design, but this work has explored models based on seven objectdetection algorithms for OD and OC segmentation, chosen from the many new models that have been released.
This investigation can set a precedent for comparing object detection models for localization, classification, and segmentation of OD and OC. Refuge and G1020 datasets were used, and a simple but effective mechanism for annotating input images was demonstrated over the first dataset. A fraction was initially taken from the Refuge dataset and evaluated with great results, proving that based on the result obtained, 100 images could be a good start, providing excellent results and decreasing the cumbersome process of annotating the images, but working with all datasets improved the average precision by around 0.050%. Another common problem addressed here was image reduction. Most studies of state-ofthe-art procedures reduce input images significantly or introduce a new workflow in the segmentation process by cropping the area of interest. This work, with data augmentation based on multiscale, is proven to enhance results, keeping high-resolution input images and avoiding significant reductions. Anchor boxes setting for regional proposals improve target location too. AdamW optimizer and cosine-annealing strategy in the learning-rate schedule also slightly improved.
However, some limitations were identified in this research, such as the need for annotated images, which remains a substantial obstacle in the training of object detection models. The annotation process by specialists is full of variability that can be related to both human factors and image quality. Due to the complexity of evaluating segmentations, it is difficult to establish comparisons and select the best architectures since different metrics are used by researchers. Additionally, researchers use different data sources, so that even under the same training and testing conditions, it becomes impossible to compare two works if the same datasets were not used. Another limitation was the impossibility of training all the models with the different backbones that can be used, due to computational capacity and the time-consuming task of testing all the variants.
In general, all models perform almost perfectly in Refuge, while in G1020 high values of the F1 score were reached too. However, M. N. Bajwa et al. in [48] showed a higher F1 score, but they segmented the cup and the disc separately, making the task less challenging, while in this investigation they were processed together.
Deeper networks such as Resnet101 and ResnetX [32] can extract more features because of the number of layers; however, Resnet50 gives almost perfect results, being the least complex model, extracting correct features. Selecting the correct model depends upon hardware capability and time available since more extensive backbones imply a long processing time.
Future work is to extend the analysis to tiny objects and multi-class targets, especially crowded objects in retina images.