Location of Fruits by Counting: A Point-to-Point Approach

: The emergence of deep learning-based methods for harvesting and yield estimates, including object detection or image segmentation-based methods, has notably improved performance but has also resulted in large annotation workloads. Considering the difﬁculty of such annotation, a method for locating fruit is developed in this study using only center-point labeling information. To address point labeling, the weighted Hausdorff distance is chosen as the loss function of the corresponding network, while deep layer aggregation (DLA) is used to contend with the variability in the visible area of the fruit. The performance of our method in terms of both detection and position is not inferior to the method based on Mask-RCNN. Experiments on a public apple dataset are provided to further demonstrate the performance of the proposed method. Speciﬁcally, no more than two targets had positioning deviations exceeding ﬁve pixels within the ﬁeld of view.


Introduction
With the development of smart agriculture, computer vision technology has begun to play an important role in the perception of crops and their growth environments. The perception of fruit is the first step in yield estimation and picking, two classic applications of agricultural automation, and involves both the detection and positioning of fruits.
The research on fruit detection began first to aid in yield estimation. Pioneering works on yield estimation included grapes [1,2], apples [3,4], tomatoes [5,6] and cotton [7]. Some visual features obtained by artificial selection, such as color, size, texture or shape, are used to accomplish the detection task. Vulnerable to the influence of the environment and light, the detection recall rate in these papers ranges from 63.7% to 80%. This means lots of fruit are ignored. With the great success of deep learning in the field of computer vision, the performance in yield estimation has improved by feature extraction [8], object detection [9] and image segmentation [10,11] to around 90%. As one other application, obtaining the position of fruits is useful for automatic harvesting, which requires judgment of stress points. Meanwhile, the geometric model of fruit is often regarded as a priori knowledge. Hence, accurate positioning on the central point of fruit plays a crucial role in successful harvesting. In past works, when the accuracy of perception technology on an RGB image was inadequate, the researchers often met positioning requirements by adding new sources of information or weakening the degree of complexity of the background [12][13][14][15][16][17][18][19][20].
However, due to its low cost, research on monocular vision-based localization continues. Conforming to the development trend of computer vision, image segmentation technology based on deep convolutional neural networks is advantageous in locating fruit. Semantic segmentation is first used to distinguish the different tissues of plants, and then, the fruit template is used to search for morphologically similar targets in the result [21]; instance segmentation is also used to further improve the positioning accuracy [22][23][24]. Note that a deep learning method not only improves the accuracy but also results in a large annotation workload [25]. The short harvesting season does not allow a sufficient number of annotations to finetune the network parameters [26]. Hence, we attempt to involve a training network in a way that allows easy labeling: by empirically identifying only the centers of fruits, called the point-to-point approach, as in [25] to use point information to produce proper bounding boxes as ground truth for detection algorithm. Point-based approaches have emerged with the demand of counting objects with a large density, such as crowd counting. The "detecting then counting" approaches [27][28][29] are gradually being replaced by approaches based on estimating by a density map [30][31][32][33] due to the requirements for the bounding-box or instance mask annotations, which are much more labor-intensive in dense crowds. However, in addition to yield estimation, precision agriculture also requires the positioning of fruit for harvesting. Therefore, a method of "estimating then positioning" is adopted, and average Hausdorff distance is used to achieve point-to-point positioning.
In this paper, we propose a deep learning-based fruit location method with annotation only for estimating fruit centers. To distinguish the fruit from the background, a multi-task neural network is designed to identify the presence of fruit centers with higher activation in the output layer and to output the estimated number of fruits simultaneously. Our contribution is reducing the annotation workload for fruit detection and positioning. It includes the identification of an appropriate loss function for the tasks described in this paper and the use of a special backbone for a multi-task network according to the specific visible appearance characteristics of fruit. Thus, a multi-task deep neural network with branched outputs is developed to accurately locate the instance of each fruit with an extremely limited annotation workload for various complex environments, without limitations regarding illumination, occlusion, number of fruits, etc. Moreover, an end-to-end inference on fruit detection can ensure a real-time yield estimation with GPU acceleration.
The remainder of the paper is organized as follows: first, in Section 2, to verify the effectiveness of our method, the data preparation process for a public dataset is introduced, and the entire method is detailed. Then, for the tasks of detection and positioning, we compare the performance of our method to that of Mask-RCNN in Section 3, in which the features and limitations of our method are also discussed. Finally, conclusions and aspirations for future work are given in Section 4.

Image Dataset
For convenience in the verification of the effectiveness of our method, a public dataset [24] was chosen for the tasks of detection and positioning of fruit. This dataset contains raw apple images with dimensions of 1024 × 1024 under different natural lighting conditions. All images have corresponding manual annotations in the form of boxes. In this paper, we only use the information on the central point of the fruit targets. As shown in Figure 1, the ground truth of this information is the center point of the annotated boxes in the database, which is in the form of (x i , y i ), i = 1, · · · , C. Note that x i and y i are the coordinates of center point i and C is the number of apples in the image.

Method Framework
In this paper, a fruit detection algorithm is proposed to solve the location problem utilizing the concept of segmentation in computer vision. A brief flowchart of the proposed algorithm is shown in Figure 2. The main procedure can be divided into two phases, i.e., network inference and pixel clustering. In the first phase, a multi-task deep neural network (DNN) is trained to export the binary segmentation graph and estimated number of fruits. As a classical backbone for semantic segmentation, the deep layer aggregation segmentation network (DLASeg) is taken as the core element of the multi-task DNN.

Method Framework
In this paper, a fruit detection algorithm is proposed to solve the location problem utilizing the concept of segmentation in computer vision. A brief flowchart of the proposed algorithm is shown in Figure 2. The main procedure can be divided into two phases, i.e., network inference and pixel clustering. In the first phase, a multi-task deep neural network (DNN) is trained to export the binary segmentation graph and estimated number of fruits. As a classical backbone for semantic segmentation, the deep layer aggregation segmentation network (DLASeg) is taken as the core element of the multitask DNN.  Figure 2. Flowchart of the two-phase positioning approach. In the first phase, network inference is realized on the apples to produce a binary segmentation graph and estimate the number of apples.
In the second phase, based on the binary segmentation graph, the apple pixels are modeled by a Gaussian mixture model to obtain their center points.
After feature extraction, the saliency map branch consists of five 2 × upsampling operations to output a 1024 × 1024 probability map, and the other branch used to estimate the number of objects in the image is based on information from the features in the deepest level and from the estimated probability map. Two sets of features, namely the 32 × 32 × 512 feature vector and the 1024 × 1024 probability map, are transformed into two 64-dimensional feature vectors respectively. Then, the vectors are grouped into  where the stars represent the marked center points of fruit. And bottom is the record for this annotation; its header is composed of the record number, the file name of the raw RGB image, the fruit count in each image and the center point coordinates of the fruits.

Method Framework
In this paper, a fruit detection algorithm is proposed to solve the location problem utilizing the concept of segmentation in computer vision. A brief flowchart of the proposed algorithm is shown in Figure 2. The main procedure can be divided into two phases, i.e., network inference and pixel clustering. In the first phase, a multi-task deep neural network (DNN) is trained to export the binary segmentation graph and estimated number of fruits. As a classical backbone for semantic segmentation, the deep layer aggregation segmentation network (DLASeg) is taken as the core element of the multitask DNN.  Figure 2. Flowchart of the two-phase positioning approach. In the first phase, network inference is realized on the apples to produce a binary segmentation graph and estimate the number of apples.
In the second phase, based on the binary segmentation graph, the apple pixels are modeled by a Gaussian mixture model to obtain their center points.
After feature extraction, the saliency map branch consists of five 2 × upsampling operations to output a 1024 × 1024 probability map, and the other branch used to estimate the number of objects in the image is based on information from the features in the deepest level and from the estimated probability map. Two sets of features, namely the 32 × 32 × 512 feature vector and the 1024 × 1024 probability map, are transformed into two 64-dimensional feature vectors respectively. Then, the vectors are grouped into Figure 2. Flowchart of the two-phase positioning approach. In the first phase, network inference is realized on the apples to produce a binary segmentation graph and estimate the number of apples. In the second phase, based on the binary segmentation graph, the apple pixels are modeled by a Gaussian mixture model to obtain their center points.
After feature extraction, the saliency map branch consists of five 2× upsampling operations to output a 1024 × 1024 probability map, and the other branch used to estimate the number of objects in the image is based on information from the features in the deepest level and from the estimated probability map. Two sets of features, namely the 32 × 32 × 512 feature vector and the 1024 × 1024 probability map, are transformed into two 64-dimensional feature vectors respectively. Then, the vectors are grouped into a hidden layer together to output a single number. After that, ReLU is implemented to ensure that the output is positive. Then, this output is rounded to the closest integer as the final estimation on the number of fruits, which is named asĈ and can be used for yield estimation.
After inference, the saliency map branch outputs P = {p i }, p i ∈ [0, 1] and i ∈ I, where p i is the confidence that a fruit exists at the pixel coordinate i in the image I. For fruit positioning, we threshold the whole map P to obtain fruit pixels by {F = {i ∈ P | p i τ}}, where τ is set to 0.9 in this paper. The saliency map is converted into a binary segmentation graph for the fruits and background. Finally, a Gaussian mixture model is used to fit the points F by the expectation maximization (EM, [34]) algorithm. The centers of Gaussian distributions are the estimated centers of fruits for picking.

Network Backbone
Recent fruit detection approaches have made use of a variety of backbone architectures, but the most popular among them are usually the VGGnet [35] and ResNet families of architecture [36]. Although these architectures have proven benefits across a variety of tasks, we believe that more recent developments in the field can be leveraged to achieve more successful fruit detection. To this end, we make use of the DLA-34 backbone presented in [37]. The deep layer aggregation (DLA) family of models employs two forms of aggregation, named iterative deep aggregation (IDA) and hierarchical deep aggregation (HDA), to form an architecture that extends densely connected networks [38] and feature pyramid networks with hierarchical and iterative skip connections that deepen the representation and refine the resolution. In detail, IDA focuses on fusing resolutions and scales, while HDA focuses on merging features from all modules and channels. In a real environment, the visible area of the fruit is affected by the distance from the camera and its own growth and occlusion, resulting in considerable shape changes. Using the above-described architecture, the DLA family of models leverages deep layer aggregation, which unifies semantic and spatial fusion for better localization and semantic interpretation. We believe these are desirable properties for the tasks of fruit detection and segmentation.

Loss Function for Network
In this paper, the loss function for the fruit position network is based on the average Hausdorff distance D AH . In detail, taking an image in a harvest environment as an example, the estimated fruit center points are marked as P C , while the centers set in ground truth are marked as F C . The commonly used Euclidean distance is used to measure the distance between the two categories of pixels, and it can be recorded as ED(p, f ), where p ∈ P C and f ∈ F C . Hence, D AH can be formulated as follows: where |P C | and |F C | are the numbers of points in P C and F C , respectively. Since both P C and F C are non-empty sets required for calculation, images without fruit targets cannot be used for training. Meanwhile, the heatmap P output by the network backbone cannot give accurate fruit location coordinates. Thus, Equation (1) is modified into a weighted Hausdorff distance [39] as follows: where i is the pixel coordinate of I, d max is the maximum diagonal pixel distance of each image and p i ∈ [0, 1] is the single-valued output of the network at pixel coordinate i. Moreover, is the sum of the probabilities of whether the pixels in the figure belong to the fruit and is used as the generalized mean, and ∈ is set to 10 −6 , while where α = −1 and |I| is the number of pixels in the heatmap. Meanwhile, we designed the loss function for fruit counting with a smooth L1 loss for regression of the fruit count. Hence, the total loss function of the network Ł(I, F) is defined as follows: Ł(I, F) = D W H (I, F) + Ł reg C −Ĉ(I) (5) where C is the actual number of fruits andĈ(I) is the estimated number of fruits.

Performance Measures 2.5.1. Software and Experimental Settings
In this paper, the whole approach is implemented in the Python programming language with support from OpenCV, an open-source library for computer vision. In addition to PyTorch, the training and inference of our deep CNN in a multi-task framework require GPU acceleration; other algorithms in our approach are processed on a CPU. The GPU and CPU of our platform are a Tesla P40 and a Core i5-8400, respectively. ImageNet is used with pretrained weights for the DLA-34 backbone and the inputs are normalized accordingly. The batch size is set to 32. Meanwhile, an Adam optimizer with a learning rate of 10 −4 and a momentum of 0.9 was chosen for training. Augmentation of the training images was conducted via horizontal flipping operation as performed in [24].

Evaluation Metrics
To demonstrate the effectiveness of our approach, we evaluated its performance in fruit detection. Following [24], three indexes are considered in the evaluation: where T is the total number of fruits in the dataset, D is the number of detected fruits and TP is the number of true positives (detection points in the areas of ground truth). We also report the mean absolute error (MAE), root mean squared error (RMSE) and mean absolute percentage error (MAPE) to evaluate the positioning accuracy (a true positive is counted if an estimated location is at most at distance r from a ground truth point): where e i =Ĉ i − C i , N is the number of images, C i is the true object count in the i-th image andĈ i is our estimate.

Experimental Results on Detection
In this section, using the apple dataset mentioned in Section 2, we present the detection results of our method and compare them with those of Mask-RCNN, described in [24]. As shown in Table 1, compared with Mask-RCNN, our method is less susceptible to environmental factors (such as illumination and color) and misjudgment of the background as fruit, so it has an obvious advantage in precision. Our method is easily confused by overlaps between fruits, however (especially those between large apples and small apples). In addition, small pieces captured in the segmentation results were easily missed as targets. Therefore, the performance of our method is similar to Mask-RCNN in terms of recall, as shown in Table 1; however, in general, our method achieved a better performance in terms of the F1-score. To more intuitively visualize the advantages and disadvantages of the two methods, we show the detection results for some images in the test set. As shown in subsets (a), (e) and (f) of Figure 3, Mask-RCNN is easily affected by occlusion and splits an apple into multiple targets. In contrast, our algorithm easily confuses them into one target due to the overlap between fruits. In production estimation, the former readily leads to overestimation, while the latter leads to underestimation.

Experimental Results on Position
Within different allowable center point deviation distances, the number of apples to be located was counted and compared with the real number of apples. The statistical results are shown in Table 2. Using our method, we found that a large portion of the positions detected deviated from the center of the fruit by more than one pixel. However, if the allowable range for the positioning error was relaxed to five pixels, on average, there were only approximately 1 to 2 false detections per image. It makes little sense to allow the positioning error to be relaxed to 10 pixels because positioning errors are mainly caused by misdetection. Additionally, because most of the false positives from Mask-RCNN are caused by multiple detections or misjudgments of the background, most fruits are detected; furthermore, it is not easily influenced by the overlap between fruits and thus performs better than our method. Hence, our algorithm only performs worse than Mask-RCNN for an allowable error boundary of one pixel. However, the two methods perform similarly for an allowable positioning error boundary of 5 or 10 pixels.

Discussion on the Features of Our Method
• Apples on the ground: Apples on the ground should not be included in the final yield or regarded as the picking target because they are not fresh. Hence, we provide examples in Figure 4 to show that our method can distinguish fruit on the branches from fruit on the ground. In the training set, apples on the ground were not considered annotation targets. Thus, our method clearly learns this better than Mask-RCNN.

•
Learning process: In Figure 5, the heatmap gradually changes as the learning process advances. In the process of focusing on fruit areas, small and broken fruit areas (green boxes) are also ignored along with falsely detected background (red boxes). • Impact of augmentation operation: The horizontal flipping operation could improve the performance of our method from 86.97% to 87.43% in terms of precision and from 84.88% to 85.32% in terms of recall.

Discussion of Limitations
Our method cannot distinguish apples with overlapping areas well; overlaps among several apples led to a continuous hot spot in the heatmap. Additionally, while it directly improves upon the difficulty of segmentation, it causes a large offset in the center point positioning. In particular, if one apple occupies a dominant part of the fruit area, other apples are easily ignored. To reduce the impact of the above limitations, we tended to choose models with lower loss so that the hot spot corresponding to the central point will shrink and continuous areas will be interrupted. However, the disadvantage of this choice is that small fruit areas may be ignored, as shown in Figure 5d. to misannotated apples (green rectangles), false positives due to multi-detection (blue rectangles) and false negatives (yellow rectangles) are also shown.

Discussion on the Features of Our Method
• Apples on the ground: Apples on the ground should not be included in the final yield or regarded as the picking target because they are not fresh. Hence, we provide examples in Figure 4 to show that our method can distinguish fruit on the branches from fruit on the ground. In the training set, apples on the ground were not considered annotation targets. Thus, our method clearly learns this better than Mask-RCNN. • Learning process: In Figure 5, the heatmap gradually changes as the learning process advances. In the process of focusing on fruit areas, small and broken fruit areas (green boxes) are also ignored along with falsely detected background (red boxes). • Impact of augmentation operation: The horizontal flipping operation could improve the performance of our method from 86.97% to 87.43% in terms of precision and from 84.88% to 85.32% in terms of recall.

Discussion of Limitations
Our method cannot distinguish apples with overlapping areas well; overlaps among several apples led to a continuous hot spot in the heatmap. Additionally, while it directly improves upon the difficulty of segmentation, it causes a large offset in the center point to misannotated apples (green rectangles), false positives due to multi-detection (blue rectangl and false negatives (yellow rectangles) are also shown.

Discussion on the Features of Our Method
• Apples on the ground: Apples on the ground should not be included in the final yie or regarded as the picking target because they are not fresh. Hence, we provi examples in Figure 4 to show that our method can distinguish fruit on the branch from fruit on the ground. In the training set, apples on the ground were n considered annotation targets. Thus, our method clearly learns this better than Mas RCNN. • Learning process: In Figure 5, the heatmap gradually changes as the learning proce advances. In the process of focusing on fruit areas, small and broken fruit areas (gre boxes) are also ignored along with falsely detected background (red boxes). • Impact of augmentation operation: The horizontal flipping operation could impro the performance of our method from 86.97% to 87.43% in terms of precision and fro 84.88% to 85.32% in terms of recall.

Discussion of Limitations
Our method cannot distinguish apples with overlapping areas well; overlaps amo several apples led to a continuous hot spot in the heatmap. Additionally, while it direc improves upon the difficulty of segmentation, it causes a large offset in the center poi

Conclusions
In this paper, we presented a point-to-point approach for locating fruit. The framework of our approach consists of two corresponding module selections for fruit location scenarios: (i) For considerable fruit shape changes, DLA-34 was chosen as the backbone of our approach. IDA was used for handling the changes at the coarse level and HDA tackled the changes at the fine level. (ii) A modified loss function based on average Hausdorff distance was chosen to not only predict the number of fruits but also estimate fruit locations. This proposed approach reduces the annotation workload with only a click on the center point of the fruit. Furthermore, the performance of our approach in the detection or positioning of fruit was similar to that of Mask-RCNN on a public dataset. This approach meets the requirements for yield estimation and harvesting, but in the future, an end-to-end central point detection network will be the focus of our research to avoid the time-consuming EM process. Meanwhile, an appropriate combination of augmentation strategies [40] will be adopted to further improve the performance of our approach.