ATSS Deep Learning-Based Approach to Detect Apple Fruits

: In recent years, many agriculture-related problems have been evaluated with the integration of artiﬁcial intelligence techniques and remote sensing systems. Speciﬁcally, in fruit detection problems, several recent works were developed using Deep Learning (DL) methods applied in images acquired in different acquisition levels. However, the increasing use of anti-hail plastic net cover in commercial orchards highlights the importance of terrestrial remote sensing systems. Apples are one of the most highly-challenging fruits to be detected in images, mainly because of the target occlusion problem occurrence. Additionally, the introduction of high-density apple tree orchards makes the identiﬁcation of single fruits a real challenge. To support farmers to detect apple fruits efﬁciently, this paper presents an approach based on the Adaptive Training Sample Selection (ATSS) deep learning method applied to close-range and low-cost terrestrial RGB images. The correct identiﬁcation supports apple production forecasting and gives local producers a better idea of forthcoming management practices. The main advantage of the ATSS method is that only the center point of the objects is labeled, which is much more practicable and realistic than bounding-box annotations in heavily dense fruit orchards. Additionally, we evaluated other object detection methods such as RetinaNet, Libra Regions with Convolutional Neural Network (R-CNN), Cascade R-CNN, Faster R-CNN, Feature Selective Anchor-Free (FSAF), and High-Resolution Network (HRNet). The study area is a highly-dense apple orchard consisting of Fuji Suprema apple fruits ( Malus domestica Borkh) located in a smallholder farm in the state of Santa Catarina (southern Brazil). A total of conducted by varying the bounding box size (80, 100, 120, 140, 160, and 180 pixels) in the image original for the proposed approach. Our results showed that the ATSS-based method slightly outperformed all other deep learning methods, between 2.4% and 0.3%. Also, we veriﬁed that the best result was obtained with a bounding box size of 160 × 160 pixels. The proposed method was robust regarding most of the corruption, except for snow, frost, and fog weather conditions. Finally, a benchmark of the reported dataset is also generated and publicly available.


Introduction
Remote sensing systems have effectively assisted different areas of applications over a long period of time. Nevertheless, in recent years, the integration of these systems with machine learning techniques has been considered state-of-the-art to attend diverse areas of knowledge, including precision agriculture [1][2][3][4][5][6][7]. A subgroup inside machine learning methods refers to DL, which is usually architectured to consider a deeper network in task-solving with many layers of non-linear transformations [8]. A DL model has three main characteristics: (1) can extract features directly from the dataset itself; (2) can learn hierarchical features which increase in complexity through the deep network, and (3) can be more generalized in comparison with a shallower machine learning approach, like support vector machine, random forest, decision tree, among others [8,9].
Feature extraction with deep learning-based methods is found in several applications with remote sensing imagery [10][11][12][13][14][15][16][17][18]. These deep networks are built with different types of architectures that follow a hierarchical type of learning. In this aspect, frequently adopted architectures in recent years include Unsupervised Pre-Trained Networks (UPN), Recurrent Neural Networks (RNN), and Convolutional Neural Networks (CNN) [19]. CNN is a DL class, being the most used for image analysis [20]. CNN's can be used for segmentation, classification, and object detection problems. The use of object detection methods in remote sensing has been increasing in recent years. Li et al. [21] conducted a review survey and proposed a large benchmark for object detection. The authors considered orbital imagery and 20 classes of objects. They investigated 12 methods and verified that RetinaNet slightly outperformed the others in the proposed benchmark. However, novel methods have already been proposed that were not investigated yet in several contexts.
In precision farming problems, the integration of DL methods and remote sensing data has also presented notable improvements. Recently, Osco et al. [15] proposed a CNNbased method to count and geolocate citrus trees in a highly dense orchard using images acquired by a remotely piloted aircraft (RPA). More specifically to fruit counting, Apolo-Apolo et al. [22] performed a faster region CNN technique for the estimation of the yield and size of citrus fruits using images collected by RPA . Similarly, Apolo-Apolo et al. [23] generated yield estimation maps from apple orchards using RPA imagery and a region CNN technique.
Many other examples are presented in the literature regarding object detection and counting in agriculture context [2,[24][25][26], like for strawberry [27,28], orange and apple [29], apple and mango [30], mango [31,32], among others. A recent literature review [32] regarding the usage of DL methods for fruit detection highlighted the importance of creating novel methods to easy-up the labeling process since the manually annotation to obtain training datasets is a labor-intensive and time-consuming task. Additionally, the authors argued that DL models usually outperformed pixel-wise segmentation approaches based on traditional (shallow) machine learning methods in the fruit-on-plant detection task.
In the context of apple orchards, Dias et al. [33] developed a CNN-based method to detect flowers. They proposed a three-step based approach-superpixels to propose regions with spectral similarities; CNN, for the generation of the feature; and support vector machines (SVM) to classify the superpixels. Also, Wu et al. [34] developed a YOLO v4-based approach for the real-time detection of apple flowers. Regarding apple leaf disease analysis, another research was also conducted [35], and a DL approach to edge detection was proposed to apple growth monitoring by Wang et al. [36]. Therefore, in apple fruit detection, recent works were developed based on CNN adoption. Tian et al. [37] improved the Yolo-v3 method to detect apples at different growth stages. An approach to the real-time detection of apple fruits in orchards was proposed by Kang and Chen [38] showing an accuracy ranging from 0.796 to 0.853. The Faster R-CNN approach showed slightly superior to the remaining methods. Gené-Mola et al. [39] adapted the Faster R-CNN method to detect apple fruits using an RGB-D sensor. In this aspect, these papers pointed out that occlusions may difficult the process of detecting apple fruits. To cope with this, Gao et al. [40] investigated the use of Faster R-CNN considering four fruit classes: non-occluded, leaf-occluded, branch/wire-occluded, and fruit-occluded.
Even providing very accurate results in the current application, Faster R-CNN [41], and RetinaNet [42] are dated before 2018, and, recently, novel methods have been proposed but not yet evaluated in apple fruit detection. One of these refers to the Adaptive Training Sample Selection (ATSS) [43]. ATSS, different from other methods, consider an individual intersection over union (IoU) threshold for each ground-truth bounding box. Consequently, improvements in the detection of small objects or objects that occupy a small bounding box area are expected. Although being a recent method, few works explored it, mainly in remote sensing applications, to attend to precision agriculture-related problems. Also, when considering RGB imagery, the method may be useful for low-cost sensing systems.
In fact, the prediction of fruit production is made manually by counting the fruits in selected trees, and then a generalization is made for the entire orchard. This procedure is still a reality for most (not to say totally) of the orchards in Southern Brazil. Evidently, there is some imprecise forecasting due to the natural variability of the orchards. Therefore, techniques that aim to count apple fruits more efficiently for each tree, even having some occluded apple fruits, helps in the prediction yield. The fruit monitoring is then important immediately from the fruit-set until the ripe stage. After the fruit setting, it can be monitored weekly. Still, losses may occur with apples dropping and lying on the ground due to natural conditions or mechanical injuries until the harvesting process. It is important to mention that the apple fruits are only visible after the fruit setting stage because the ovary is retained under the stimulus of pollination [44]. The final formed shapes occur one month before harvesting and can be monitored to check losses in productivity. This statistic is also an important key-information for the orchard manager and the fruit production forecasting.
However, the increasing use of anti-hail plastic net cover in orchards helps to prevent damage caused by hailstorms. The installation of protective netting over high-density apple tree orchards is strongly recommended in regions where heavy convective storms occur to avoid losses in both production and fruit quality [45,46]. Therefore, the shield reduces the percentage of injuries in apple trees during critical growth development stages, mainly at both flowering and fruit development phases. As a result, protective netting has been implemented in different regions of the world, and in Southern Brazil [47,48]. On the other hand, the use of nets hinders the use of remote sensing images acquired at orbital, airborne, and even from RPA systems for fruit detection and fruit production forecasting. Therefore, this highlights the importance of using close-range and low-cost terrestrial RGB images acquired by semi or professional cameras. Such devices can be operated either manually or even be fixed in agricultural machinery or tractors.
Terrestrial remote sensing is also important in this specific case study because of the scale. The proximal remote sensing also allows the detection of apple fruits occluded by leaves and small branches. Complementary, the high spatial resolution also enables the detection of eventual injuries caused by diseases, pests, and even climate effects on fruits itself besides leaves, branches, and trunks that would not be possible at all in images acquired by aerial systems.
This paper proposes an approach based on the ATSS DL method applied to close-range and low-cost terrestrial RGB images for automatic apple fruit detection. The main ATSS characteristic is that only the center point of the objects is labeled. It is a very interesting approach since bounding box annotation, mainly in heavily-dense fruit areas, is difficult to annotate manually besides being time-consuming.
Experiments were conducted comparing this approach to other object detection methods, considered the state-of-the-art in object detection proposals in the last years. The methods are: Faster-RCNN [41], Libra-RCNN [49], Cascade-RCNN [50], RetinaNet [42], Feature Selective Anchor-Free-FSAF [51], and HRNet [52]. Moreover, we evaluated the generalization capability of the proposed method by applying simulated corruptions with different severity levels, such as: noise (Gaussian, shot and impulse), blurs (defocus, glass, and motion), weather conditions (snow, frost, fog, and brightness) and digital processing (elastic transform, pixelate, and JPEG compression). Finally, we evaluate the bounding box size variation on the ATSS method application, ranging from 80 × 80 to 180 × 180 pixel-size. Another important contribution from this paper is the availability of the used dataset to be implemented in future research that compares novel DL networks in the same context.

Data Acquisition
The experiment was conducted in the Correia Pinto municipality, State of Santa Catarina, Brazil, in a commercial orchard (27 • 40 01 S, 50 • 22 32 W) ( Figure 1A-C). According to the Köppen classification, the climate is subtropical (Cfb) with temperate summers [53]. The soil of the region is classified as Inceptisol according to American Soil Taxonomy System [54], which corresponding to Cambissolo (CHa) in the Brazilian Soil Classification System [55]. The relief is relatively flat, and the altitude above sea level is 880 m. The averaged annual accumulated precipitation is 1560 mm, based on the pluviometric records of the Lages station (2750005-INMET), located 15 km from the orchard, in 74 years period (1941-2014) [56]. The precipitation is regularly distributed between the months September to February, with an average of 150 mm per month. The average precipitation from April to June is below 100 mm. Frost events usually occur from April to September.
The State of Santa Catarina represents 50% of the Brazilian apple production. Up to 90% of this production comes from orchards with an average size of 10 hectares [57,58]. In general, the owners of these small rural properties have limited resources to invest in management tools and decision support, making the use of close-range and low-cost terrestrial RGB images for automatic apple fruit detection an interesting alternative.
The selected study orchard has 46 ha, and the apple trees were planted in 2009, keeping a spacing of 0.8 × 3.5 m. As a result, a density of 3570 plants·ha −1 is found with an alignment in the planting rows following the N-S orientation. In the analyzed area, two apple cultivars were planted: Fuji Suprema and Gala, and they were planted in rows to be cultivated, keeping a proportion of four rows of Fuji Suprema for two rows of Gala, respectively. This practice is used for facilitating the cross-pollination [59]. The analyzed cultivar was Fuji Suprema, divided by a row of samples identified from L1 to L8 ( Figure 1D). The two Gala cultivar rows labeled as LA and LB in the same figure were already harvested when the field surveys were conducted and therefore precluded from further analysis. The targeted surveyed area was not covered with an anti-hail net. To acquire the image dataset, a Canon EOS-T6 camera (Tokyo, Japan, CMOS sensor of 5184 × 3456 pixel (17.9 Mp) and pixel size of 4.3 µm) was used with a fixed and nominal focal length of 18 mm. The camera was operated manually, in the portrait position at the operator's face's height, with the cameras' auxiliary flash active ( Figure 1E). Figure 2 shows the mode of image acquisition nearly perpendicular at a given planting row. With each image capture, a small step parallel to the planting line was taken, and a new acquisition was then made, similar to the "Stop and Go" mode adopted by Liang et al. [61]. The images were obtained on two dates, first on 29 March and second on 1 April 2019, with variations in the day's acquisition times. In this period, the apple fruits were near ripe and close to the point of harvesting according to visual inspection of specialists in the field. The cultivar under analysis has a red and uniform color soon after flowering, covering 80% to 100% of the fruit [62].
The acquisition times for each planting line are shown in Table 1, and varied between 11:50 a.m. and 6:20 p.m. local time. No artificial background was adopted, and the natural light varied between cloudy and sunny days. Agricultural management procedures conducted using either agricultural machinery or tractors during a usual working day usually occur at that time interval.
The annotations of the apple fruits were performed manually in the online application VIA (http://www.robots.ox.ac.uk/~vgg/software/via/via-1.0.6.html) (VGG Image Annotator) [63], by specialists. In this procedure, a marking point, annotated in the center of each fruit, was then added. The inspection of each single point was doubled checked by a second specialist. The summarized approach idealized in this study is illustrated in Figure 3 and be summarized into three main steps. The first step (#1), illustrates the dot annotation at the center of each apple fruit and the setting of the bounding box size, with a value fixed of 120. At step (#2), we highlight the state-of-art object detection models used in our work and the distribution of image patches into training, validation, and test set.
Step (#3) shows the setting of the bounding box size, whose value range from 80 to 180 with an interval of 20 and four categories (noise, blur, weather, and digital) of the image corruption methods applied in the test set to evaluate the model robustness of the ATSS.

Object Detection Methods
In the proposed approach to apple-fruit detection, we compared the ATSS [43] method to six popular object detection methods, including Faster-RCNN, Libra-RCNN [49], Cascade-RCNN [50], RetinaNet, Feature Selective Anchor-Free-FSAF [51], and HR-Net [52]. We chose these methods for two main reasons: (i) they constitute the stateof-the-art in object detection proposals in the last years; (ii) they cover the main directions of object detection research: two-stage and one-stage, and anchor-based and anchor-free detectors. Below, we briefly describe the main characteristics of each adopted method.
ATSS: Experimental results shown in Reference [43] concluded that an important step in training object detection methods is the choice of positive and negative samples. Therefore, ATSS [43] proposes a new strategy to select samples following objects' statistical characteristics. For each ground truth g, k anchor boxes whose center is closest to the center of g are selected as positive candidates. Then, the IoU between candidates and g is calculated, obtaining the mean m g and standard deviation σ g of the IoUs. A threshold is calculated as t g = m g + σ g and used to select final positive candidates whose IoU is greater than or equal to that threshold. The remaining samples are used as negatives. In this work, ATSS has ResNet50 and Feature Pyramid Network (FPN) as a backbone, and k = 9 anchor boxes are first selected as positive candidates.
Faster-RCNN: The Faster-RCNN [41] is a two-stage CNN that uses the Region Proposal Network (RPN) to generate candidate bounding boxes for each image. The first stage regresses the location of each candidate bounding boxes, and the second stage predicts its class. In this work, we build the FPN on top of the ResNet50 network, as proposed in Reference [64]. This setting brings some improvements in the results to the Faster-RCNN since the ResNet+FPN extract rich semantical features on the images that feed the RPN to help the detector predict the bounding box location betters.
Cascade-RCNN: In training object detection methods, an IoU threshold is set to select positive and negative samples. Despite the promising results, Faster-RCNN uses a fixed IoU threshold of 0.5. A low value for IoU threshold during training can produce noisy detections, while high values can result in overfitting due to the lack of positive samples. To improve this issue, Cascade-RCNN Reference [50] proposes a sequence of detectors with increasing IoU thresholds trained stage by stage. The output of a detector is used to train a subsequent detector. In this way, the final detector receives a superior distribution and produce more quality detections. Following the authors of Reference [50], we have used ResNet50+FPN as a backbone and three cascade detectors with IoUs of 0.5, 0.6, and 0.7, respectively.
Libra-RCNN: Despite the wide variety of object detection methods, most of them follow three stages: (i) proposal of regions, (ii) feature extraction from these regions, and (iii) classification of categories and refinement of bounding boxes. However, the imbalance in these three stages prevents the increase in the overall accuracy. Thus, Libra-RCNN [49] proposes three components to force balancing in the three stages explicitly. In the first stage, the proposed IoU-balanced sampling mines hard samples according to their IoU with the ground-truth uniformly. To reduce the imbalance in the feature extraction, Libra-RCNN proposes the balanced feature pyramid, which resizes the feature maps from the FPN to the same size, calculates the average, and resizes them back to the original size. Finally, balanced L1 loss rebalances the tasks of classification and localization. In this method, we have used ResNet50+FPN as the backbone.
HRNet: The main purpose of the HRNet [52] is to maintain the feature extraction in high-resolution throughout the backbone. HRNet consists of four stages, where the first stage uses high-resolution convolutions. The second, third, and fourth stages are made up of multi-resolution blocks. Each multi-resolution block connects high-to-low resolution convolutions in parallel and fuses multi-scale representations across parallel convolutions. In the end, low-resolution representations are rescaled to high-resolution through bilinear upsampling. In this method, we used HRNet as a backbone and Faster-RCNN heads to detect objects.
RetinaNet: To improve the accuracy of the single-stage object detections with regards to the two-stage methods, the RetinaNet was proposed in Reference [42] to solve the extreme class imbalance. The RetinaNet solves the class imbalance of the positive (foreground) and negative (background) samples by introducing a new loss function named Focal Loss. Since the annotation keeps only the positive samples (ground-truth bounding boxes), the negative samples are obtained during the training after generating candidate-bound boxes that do not match the ground-truth. In this sense, depending on the number of generated candidate bounding boxes, the number of negative samples can be over-represented. To solve this problem, RetinaNet down-weights the loss assigned to well-classified samples (whose probability of ground-truth is above 0.5) and focuses on samples hard to classify. The RetinaNet architecture is composed of a single network with ResNet+FPN as the backbone and two independent subnetworks to regress and classify the bounding boxes. In our experiments, we combine the Resnet50 network with the FPN.
FSAF: This method proposes the feature selective anchor-free (FSAF) module [51] to be applied to single-shot detectors with FPN structure, such as RetinaNet. In such methods, choosing the best level of the pyramid for each object's training is a complicated task. Thus, FSAF module attaches an anchor-free branch at each level of the pyramid that is trained with online feature selection. We used the RetinaNet with FSAF module, including the ResNet50+FPN as its backbone.

Proposed Approach
The proposed approach to training the implemented object detection methods can be divided into three steps. The first step is to split the high-resolution images into patches. This division aims to aid the detection of small and/or occluded apples. Given an image, it is divided into non-overlapping patches with size w p × h p as shown in Figure 4.
In the second step, the apples were labeled with a point in their center (Figure 5a). The apple fruits over the ground were also marked. This point feature speeds up the labeling process and is essential for labeling a large number of patches. However, these labels are not directly usable in object detection methods that require a bounding box. To work around this problem, we propose to estimate a fixed size bounding box around each labeled point (Figure 5b). This weak labeling is also important to assess how well each bounding box needs to be evaluated in recent methods.
Finally, in the last step, patches and ground-truths were used to train the object detection methods. We then tested the methods described in Section 2.2. Given the trained method, it can be used to detect apples in the test step.

Experimental Setup
The plantation lines were randomly divided into training, validation and testing sets. Due to the high resolution of the images, they were divided into patches of 1024 × 1024 pixels as described in the previous section. Table 2 shows the division of the dataset with the number of images, patches, and apples in each set.  [58.395, 57.12, 57.375]. For data augmentation, each input patch was randomly flipped with a probability of 0.5. Given the manually labeled centers, bounding boxes with a fixed size of 120 × 120 pixels were first evaluated. Next, we tested different sizes ranging from 80 × 80 to 180 × 180 pixels.
For training, the backbone of the object detection methods was initialized with the pre-trained weights on ImageNet in a strategy known as Transfer Learning. A set of hyper-parameters was fixed for all methods after validation. Faster-RCNN, Libra-RCNN, Cascade-RCNN, and HRNet were trained using SGD optimizer for 17,500 iterations with a learning rate of 0.02, a momentum of 0.9, and a weight decay of 0.0001. RetinaNet, FSAF, and ATSS followed the same hyper-parameters, except that they used an initial learning rate of 0.01 as shown in Table 3. They were also trained with a higher learning rate (0.02), but did not converge properly during training. Figure 6 illustrates the behavior of the loss curve for the methods. We can see that the curve drops dramatically in the first iterations and stabilizes afterward, showing that the training was adequate. As an evaluation metric, we use the average precision (AP) commonly used in object detection with IoU of 0.5. In summary, this metric determines examples as true-positive (TP) when the IoU value is greater than a threshold (0.5 in our experiments) and false-positive (FP) otherwise. The IoU stands for the Intersection Over Union, which is calculated as the overlapping area between the predicted and the ground-truth bounding boxes divided by the union between them. According to this amount of TP and FP examples, the AP is calculated as the area under the precision-recall curve, and its value range from 0 (low) to 1 (high). The precision and recall are calculated according to Equations (1) and (2), respectively.
The model robustness was measured by applying 13 image corruptions type, and three severity levels (SL) per corruption type. As pointed out in Reference [65], these images corruptions are not recommended to be used in training data as a data augmentation toolbox. In this sense, the corruptions were applied only for the images in the test set. The corruptions are distributed in four categories: (i) noise (Gaussian, shot and impulse); (ii) blurs (defocus, glass, and motion); (iii) weather conditions that usually occur in the study area (snow, frost, fog, and brightness); and (iv) digital image processing (elastic transform, pixelate, and JPEG compression). According to Hendrycks et al. [66], we follow the same severity levels. However, in our experiments, only three instead of five levels were considered.
All experiments were performed in a desktop computer with Intel(R) Xeon(R) CPU E3-1270@3.80 GHz, 64 GB memory, and NVIDIA Titan V Graphics Card (5120 Compute Unified Device Architecture-CUDA cores and 12 GB graphics memory). The methods were implemented using mmdetection toolbox (https://github.com/open-mmlab/mmdetection) on the Ubuntu 18.04 operating system.

Apple Detection Performance Using Different Object Detection Methods
The summary results of the apple fruit detection methods are shown in Table 4. For each deep learning-based method, AP for each plantation line is presented in addition to its average and standard deviation (std). Experimental results showed that ATSS obtained the highest AP of 0.925(±0.011), followed by HRNet and FSAF. Traditional object detection methods, such as Faster-RCNN and RetinaNet, obtained slightly lower results with AP of 0.918(±0.016) and 0.903(±0.020), respectively. Despite this, all methods showed results above 0.9, which indicates that apple counting automatically is a possible task. Compared to manual counting by a human inspection, automatic counting is much faster, with similar accuracy.
The last column of Table 4 shows the estimated number of apples detected by each method in the test set. We can see that ATSS detected approximately 50 more apples than the second best method (HRNet or FSAF). It is important to highlight that this result was obtained in three plantation lines, while the impact on the entire plantation would be much greater. In addition, the financial impact on an automatic harvesting system using ATSS, for example, would be considerable despite the small difference in AP.
Examples of apple detection are shown in Figures 7 and 8 for all object detection methods. These images presented challenging scenarios with high apple occlusion and different lighting conditions. Occlusion and differences in lighting are probably the most common problems faced by this type of detection, as observed in similar applications [29][30][31]. The accuracy obtained by the ATSS method in our data-set was similar or superior when compared against other tree-fruit detection tasks [67].
These images are challenging even for humans, who would take considerable time to perform a visual inspection. Despite the challenges imposed on these images, the detection methods presented good visual results, including ATSS (Figure 7a). On the other hand, the lower AP of RetinaNet is generally due to false positives, as shown in Figure 8c.
It is important to highlight that the methods were trained with all bounding boxes of the same size (Table 2). However, the estimated bounding boxes were not all the same size. For example, for Faster-RCNN, the calculated width and height of all predicted bounding boxes in the test set have an average size of 114.60(±15.44) × 114.96(±17.40). The standard deviation, therefore, indicates that there is a variation between the predicted bounding boxes. The bounding boxes average area was 13,233 (±4232.4), indicating the variation between the predicted boxes.

Influence of the Bounding Box Size
The results of the previous section were obtained with a bounding box size of 120 × 120 pixels. In this section, we evaluated the influence of the bounding box size on the ATSS, as it obtained the best results among all of the object detection methods. Table 5 shows the results returned when using sizes from 80 × 80 to 180 × 180.
Bounding boxes with small size (e.g., 80 × 80 pixels) did not achieve good results as they did not cover most of the apple fruit edges. The contrast and the edges of the object are important for the adequate generalization of the methods. On the other hand, large sizes (e.g., 180 × 180) cause two or more apples to be covered by a single bounding box. Therefore, the methods cannot detect apple fruits individually, especially small ones or in occlusion. The experiments showed that the best result was obtained with a size of 160 × 160 pixels.
Despite the challenges imposed by the application (e.g., fruit apples with large occlusion, scales, lighting, agglomeration), the results showed that the AP is high even without labeling a bounding box for each object. Since previous fruit-tree object detection stud-ies [29,31] used different box-sizes as label mainly because of perspective differences in the fruit position, this type of information should be considered, as manual labeling is an intensive and laborious operation.
Here, we demonstrated that, with the ATSS method, when objects have a regular shape, which is generally the case with apples, using a fixed-size bounding box is sufficient to obtain acceptable AP. This, by itself, may facilitate labeling these fruits since the specialist can use a point-type of approach and, later, a fixed size can be adopted by the boundingbox-based methods what implies a significant time reduction.
We can see that the method maintains precision around 0.94-0.95 for densities between 0-29 apples. Figure 9 shows examples of detection in patches without apples, for example. This shows that the method is consistent in detecting a few apples to a considerable amount of 29 apples in a single patch. As expected, the method's precision decreases at the last density level (patches with 30-42 apples). However, the precision is still adequate due to the large amount of apples.  Figure 9. Examples of patches without apples.

Robustness against Corruptions
In this section, we evaluated the model robustness of the ATSS with a bounding box of size 160 × 160 pixels in different corruptions only for the images in the test set. These corruptions simulate possible conditions to occur in image acquisition in situ, among adverse environmental factors, sensor attitude, and degradation in data recording. Tables 7 and 8 shows the results for different corruptions and severity levels. The noise results among the three severity levels (Table 7) indicated a reduction in averaged precision between 2.7% and 11.1% compared to non-corruption values. When considering gaussian and shot noise, the precisions were reduced in the same proportion, while the reduction caused by the impulse noise was higher in the first and second levels. Despite the severity levels of noise corruption, the precision was over 84%. The reduction in the precision for all the blur components (Table 7) was very similar, showing a slight decrease compared with the non-corruption condition. The reduction values were between 0% and 4.8%. In all three severity levels, the obtained results from motion blur impacted less when compared to those results achieved from defocus and glass. The lowest precision here was caused due to the defocus blur. All results showed precision values above 90%, even considering all the severity levels. According to the results shown in Table 8, the fog resulted in less precision in the different simulated weather conditions. The precision reductions were approximately 15.5% to 35.5% from the first to the third severity level, respectively. Figure 10 shows the apple detection in two images corrupted by fog. By increasing the severity level of the fog, some apples are not detected. This loss of precision is mainly expected when there are apples with large leaf occlusion.
The brightness implied a slight precision reduction, between 0.5% and 2.4%, compared to the no-corruption condition. In the second severity level, the snow condition achieved the second-worst precision for all weather conditions. The obtained precisions from digital processing, in the elastic transform and pixelate, were near 95% for all severity levels, and the obtained results from JPEG compression were near to 92%. The pixelate showed the best precision results for all severity levels than other digital conditions, with the lowest precision reduction. The JPEG compression precisions reductions were between 1.4% to 3.2% for the severities levels 1 to 3, respectively, compared to the no-corruption condition. In general, the weather condition affected more precision than digital processing.
Previous works [37,39,40] investigated and adapted Faster R-CNN and Yolo-v3 methods. Here, we investigated a novel method based on ATSS, and presented its potential for apple fruit detection. In a general sense, the results indicated that the ATSS-based method slightly outperformed the other deep learning methods, with the best result obtained with a bounding box size of 160 × 160 pixels. The proposed method was robust regarding most of the corruption, except for snow, frost, and fog weather conditions. However, the results remain still satisfactory what implies good accuracy records for apple fruit detection even when RGB pictures are acquired in periods when weather conditions are not that favorable such as frost, fog, and even snow events that are frequent during winter in some areas of the southern Santa Catarina Plateau in southern Brazil.

Further Research Perspectives
This research made all the close-range and low-cost terrestrial RGB images available, including the annotations. These images can be acquired manually in the field. However, cameras can also be installed in agricultural implements such as sprayers, or even in tractors and autonomous vehicles [68]. This would be interesting to check the applicability of such images in the previous phases or even during the apple fruits' harvesting. As the inspection process of the apple fruits is performed locally, it is also subjective. Interestingly would be to count fruits that fall in the ground and the impact in the estimation of the final apple fruit production. Another important task is to segment each fruit, which can be performed using a region growing strategy or based on DL instance segmentation methods.
Some orchards with plums and grapes would present similar changes as shown here by apples due to their particular shape and color. Both orchards also used the anti-hail plastic net cover to prevent further damages after the fruit setting until the harvesting period. However, some challenges may particularly occur when dealing with avocado, pear, and feijoa since the typical green color of these fruits is very similar to their leaves green color. These specific fruits would surely require adjustments in this manuscript's proposed methodology and are also encouraged for further studies. Interestingly, the time demanding and performance of using both the marking point, annotated in the center of each fruit, and a bounding box annotation using bounding boxes is also recommended for completeness of comparisons.
Besides fruit identification, other fruit attributes such as shape, weight, and color are also very relevant information targeted by the market and are recommended for further studies. The addition of quantitative information such as shape could support average fruit weight estimation. Such information would provide better and more realistic fruit production estimates. However, the images must contain the entire apple tree plant what sometimes was not the case due to the tree height. Due to the limited space between the planting lines (i.e., 3.5 m), it is sometimes impossible to cover the entire apple tree in the picture frame. Therefore, camera devices are suggested to be installed at different platform heights in agricultural implements or tractors to overcome this problem. Alternatively, further studies could also explore the possibility of acquiring oblique images or the use of fish-eye lens.
Interestingly, the use of photogrammetric techniques with DL would make it possible to locate the geospatial position of the visible fruits [29], as well as to model their surface. This would enable the individual assessment of the fruit size allowing a more realistic fruit counting. It also eliminates any possibility of counting twice a single fruit eventually in sequential images with overlapping. This may also occur in the images on the backside of the planting row that is also often computed twice [69]. The technique used in this study is encouraged to be analyzed with other fruit varieties that are commercially important in the southern highlands of Brazil, such as pears, grapes, plums, and feijoa [70,71]. Finally, our dataset is available to the general community. The details for accessing the dataset are described in Table A1. It will allow further analysis of the robustness of the reported approaches to counting objects and compares them with forthcoming up-to-date methods.

Conclusions
In this study, we proposed an approach based on the ATSS DL method to detect apple fruits using a center points labeling strategy. The method was compared to other state-of-the-art deep learning-based networks, including RetinaNet, Libra R-CNN, Cascade R-CNN, Faster R-CNN, FSAF, and HRNet.
We also evaluated different occlusion conditions and noise corruption in different image sets. The ATSS-based approach outperformed all of the other methods, achieving a maximum average precision of 0.946 with a bounding box size of 160 × 160 pixels. When re-creating the adversity conditions at the time of data acquisition, the approach provided a robust response to most corrupted data, except for snow, frost, and fog weather conditions.
Further studies are suggested with other fruit varieties in which color plays an important role in differentiating them from leaves. Additional fruit attributes such as shape, weight, and color are also important information for determining the market price and are also recommended for future investigations.
To conclude, our study's relevant contribution is the availability of our current dataset to be publicly accessible. This may help others to evaluate the robustness of their respective approaches to count objects, specifically in fruits with highly-dense conditions.

Data Availability Statement:
The data presented in this study are openly available. The links for downloading the data are provided in Table A1.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: