In-Field Automatic Identification of Pomegranates Using a Farmer Robot

Ground vehicles equipped with vision-based perception systems can provide a rich source of information for precision agriculture tasks in orchards, including fruit detection and counting, phenotyping, plant growth and health monitoring. This paper presents a semi-supervised deep learning framework for automatic pomegranate detection using a farmer robot equipped with a consumer-grade camera. In contrast to standard deep-learning methods that require time-consuming and labor-intensive image labeling, the proposed system relies on a novel multi-stage transfer learning approach, whereby a pre-trained network is fine-tuned for the target task using images of fruits in controlled conditions, and then it is progressively extended to more complex scenarios towards accurate and efficient segmentation of field images. Results of experimental tests, performed in a commercial pomegranate orchard in southern Italy, are presented using the DeepLabv3+ (Resnet18) architecture, and they are compared with those that were obtained based on conventional manual image annotation. The proposed framework allows for accurate segmentation results, achieving an F1-score of 86.42% and IoU of 97.94%, while relieving the burden of manual labeling.


Introduction
Accurate and efficient in-field data gathering is a major requirement to increase crop monitoring and management efficiency, and improving the sustainability of agricultural processes. While manual survey and sampling by experts are currently adopted to get information on crop growth and health, this process is labour intensive, expensive and often destructive. Sampling is typically done on a limited number of plants, and measures can consequently be affected by sparsity as well as by human bias, resulting in inaccurate estimations.
Unmanned ground vehicles (UGVs) equipped with visual sensors have been recently proposed as a valuable technology to automate data acquisition over large farms, while simultaneously increasing accuracy on a narrow scale as well as reducing execution time and costs [1]. Related to this is the development of robust image segmentation techniques in order to extract meaningful information from collected data. For images captured at an orchard, this entails automatically labeling each pixel, or groups of pixels as representing fruits, trunks, branches or foliage. The parsed information can then be used in higher-level tasks such as fruit counting, plant phenotyping or crop health and growth monitoring, thus providing a rich source of information for the farmers. It can also enable Recently, [16] proposed a modified Faster R-CNN (FR-CNN) for fruit detection that was tested on five types of fruits including pomegranates.
To the best of our knowledge, this is the first study that demonstrates a ground robotic system and deep learning techniques for in-field data gathering and processing in pomegranate orchards. Notable examples of agriculture robots for fields operations in different cultivations can be found, for example, in [17][18][19].
Image acquisition is performed using a consumer-grade camera, namely an Intel RealSense D435 camera (Santa Clara, CA, USA) that is mounted onboard a tracked farmer robot that traverses the orchard. A deep learning segmentation framework is then applied for the separation of fruits from non-fruit regions using multi-stage transfer learning, whereby a pre-trained network is initially tuned on pomegranate images acquired under controlled conditions; this discrimination ability is progressively improved to segment field images with increasingly complex scenarios. Specifically, images of fruits arranged on a flat surface with a neutral background are first acquired both under controlled and natural lighting conditions. Images acquired under uniform lighting are automatically labeled, based on color thresholds in RGB space followed by morphological operations, and then used for transfer learning from a pre-trained architecture. The network is successively applied to produce accurate labels for the images acquired under natural lighting disturbances, for which standard filtering techniques do not prove effective, mainly due to the presence of shadows. These labels are, in turn, included to retrain the network, which is finally applied to segment field images. In order to further enhance the network's performance, two strategies are also proposed to minimize false negatives and false positives. In the first case, the network is retrained after adding negative examples (i.e., images of non-fruit regions); in the second case, field images that include both fruit and non-fruit regions are included. The DeepLabv3+ [20] pre-trained architecture has been chosen for transfer learning, since it has been demonstrated to be effective for the semantic segmentation of natural images [21]. However, the proposed approach is independent of the type of pre-trained architecture; therefore, other architectures may be alternatively adopted without changing the overall framework.
Experiments performed in a commercial pomegranate orchard in Apulia, southern Italy, are presented. It is shown that the proposed approach allows for accurate segmentation results, with an F1-score of 86.42% and IoU of 97.94%, leading to classification performance that is comparable to those obtained using the same network trained by conventional manual image annotation, while relieving the burden of the time-consuming manual labeling process. When compared to other deep learning approaches using manual data set annotation (e.g., [2,16]), the proposed system shows comparable results.
It is worth noting that, although the proposed system in this study is specifically tested in pomegranate cultivations, it can be reasonably extended to other types of fruit crops for which pre-existing databases are available [22], since it requires only a limited number of labeled frames acquired from the field for refinement. It should be finally noted that the proposed segmentation framework has been proven to be effective for color images acquired by a consumer-grade RGB-D sensor. Therefore, 2D fruit detection could be used as a preliminary step before extracting additional fruit morphology information based on associated 3D data.
The research is presented in the paper as follows: the acquisition system and the image segmentation framework are detailed in Section 2; experimental results are discussed in Section 3; conclusions are drawn in Section 4.

Materials and Methods
Accurate and robust image segmentation to separate fruit from non-fruit regions is fundamental to successful fruit detection and yield estimation. This paper proposes a semi-supervised deep learning approach to automatically segment pomegranate fruits in natural images, beginning with fruit images that are acquired under controlled environment conditions. Field images are captured by consumer-grade hardware from a farm robotic vehicle, while processing is performed off-line at the end of the acquisitions.
In the rest of this section, the data acquisition platform and the experimental environment are initially described; then, the proposed segmentation method is detailed.

Testing Platform and Data Sets
The testing platform Polibot ( Figure 1) is a research ground vehicle that was completely custom-built at the Politecnico of Bari, with the aim of achieving high mobility over challenging terrain. It features an articulated suspension system that is controlled in a purely passive manner, but can fulfill high-load capacity, vibration isolation and trafficability over rough terrain, similarly to a multi-leg insect. Polibot has a footprint of 1.5 × 1 m and weighs about 70 kg, ensuring a payload of up to 40 kg. The control and acquisition systems have been implemented under ROS (Robot Operating System).
natural images, beginning with fruit images that are acquired under controlled ment conditions. Field images are captured by consumer-grade hardware from robotic vehicle, while processing is performed off-line at the end of the acquisitio In the rest of this section, the data acquisition platform and the experimen ronment are initially described; then, the proposed segmentation method is deta

Testing Platform and Data Sets
The testing platform Polibot ( Figure 1) is a research ground vehicle that w pletely custom-built at the Politecnico of Bari, with the aim of achieving high over challenging terrain. It features an articulated suspension system that is cont a purely passive manner, but can fulfill high-load capacity, vibration isolation an cability over rough terrain, similarly to a multi-leg insect. Polibot has a footprint o m and weighs about 70 kg, ensuring a payload of up to 40 kg. The control and acq systems have been implemented under ROS (Robot Operating System). The Polibot's standard sensor suite includes sensors to measure the electrical drawn by the two drive motors, in addition to an MTI 680 GNSS/INS that suppo cm-accuracy. The aluminum frame attached to the top plate allows the robot to b ted with various dedicated sensors such as laser range finders and monitoring c among which an Intel RealSense D435 imaging system is used for crop visual da ering in this study. It consists of a consumer-grade active InfraRed (IR) stereo sens €), including a left-right IR stereo pair that provides depth information and a color The color camera is a FullHD, Bayer-patterned, rolling shutter CMOS imager. Its field of view is 69 (H) × 42 (V) deg, and runs at 30 Hz at FullHD. The stream acq the color camera is spatially-calibrated and time-synchronized with the stereo pai color-aligned depth images are also available for 3D crop characterization.
Data sets were acquired at a commercial pomegranate orchard in Apulia, s Italy, just before harvesting (October 2021). In order to collect the data sets, the was tele-operated along both sides of one row of 15 trees that were located at an distance of about 3 m from each other. The camera was mounted about 1 m a The Polibot's standard sensor suite includes sensors to measure the electrical currents drawn by the two drive motors, in addition to an MTI 680 GNSS/INS that supports RTK cm-accuracy. The aluminum frame attached to the top plate allows the robot to be outfitted with various dedicated sensors such as laser range finders and monitoring cameras, among which an Intel RealSense D435 imaging system is used for crop visual data gathering in this study. It consists of a consumer-grade active InfraRed (IR) stereo sensor (~250 €), including a left-right IR stereo pair that provides depth information and a color camera. The color camera is a FullHD, Bayer-patterned, rolling shutter CMOS imager. Its nominal field of view is 69 (H) × 42 (V) deg, and runs at 30 Hz at FullHD. The stream acquired by the color camera is spatially-calibrated and time-synchronized with the stereo pair, so that color-aligned depth images are also available for 3D crop characterization.
Data sets were acquired at a commercial pomegranate orchard in Apulia, southern Italy, just before harvesting (October 2021). In order to collect the data sets, the vehicle was tele-operated along both sides of one row of 15 trees that were located at an average distance of about 3 m from each other. The camera was mounted about 1 m above the ground, and acquired lateral views of the tree rows from a distance of about 1.5 m. Acquisitions were performed at an image resolution of 1280 × 720 and frame rate of 6 Hz, accounting for a total of 1270 frames. Considering that the farmer robot drives at an average speed of about 0.5 m/s, a frame rate of 6 Hz was found to be a good trade-off between computational burden and pomegranate shrub coverage.
Laboratory acquisitions of about 150 different pomegranates were performed with the same camera as the one onboard the robot, using the following two different setups: the first one, referred to as SET1, under uniform lighting conditions; the second one, referred to as SET2, under intense and non-uniform sun exposure. In both cases, fruits were arranged on a flat surface of neutral color, and spread well apart from each other. The reason for this split in the acquisitions is to insert an intermediate stage between the algorithm developed for indoor image labeling and the network that will be used for the segmentation of acquisitions in the field, as will be shown in the next section. Sample images from the different data sets are reported in Figure 2. Acquisitions were performed at an image resolution of 1280 × 720 and frame rate of 6 Hz, accounting for a total of 1270 frames. Considering that the farmer robot drives at an average speed of about 0.5 m/s, a frame rate of 6 Hz was found to be a good trade-off between computational burden and pomegranate shrub coverage. Laboratory acquisitions of about 150 different pomegranates were performed with the same camera as the one onboard the robot, using the following two different setups: the first one, referred to as SET1, under uniform lighting conditions; the second one, referred to as SET2, under intense and non-uniform sun exposure. In both cases, fruits were arranged on a flat surface of neutral color, and spread well apart from each other. The reason for this split in the acquisitions is to insert an intermediate stage between the algorithm developed for indoor image labeling and the network that will be used for the segmentation of acquisitions in the field, as will be shown in the next section. Sample images from the different data sets are reported in Figure 2.

Fruit Images-SET1
Fruit Images-SET2 Field Images

Image Segmentation
In the context of this study, image segmentation refers to separating fruits from nonfruit regions as a preliminary step before agricultural tasks such as fruit grading, counting or harvesting. To this end, a multi-stage transfer learning approach is developed, whereby a pre-trained network is initially tuned using fruit images that were acquired under controlled acquisitions, and then progressively extendedto more complex scenarios. The main advantage of the proposed approach is in automating the image annotation process required for network training, hence relieving the burden of having to use conventional manual labeling.
The overall image processing pipeline is shown in Figure 3. Firstly, fruit images acquired on a neutral background under controlled lighting conditions (SET1) are automatically labeled based on color thresholding, followed by morphological operations, and

Image Segmentation
In the context of this study, image segmentation refers to separating fruits from non-fruit regions as a preliminary step before agricultural tasks such as fruit grading, counting or harvesting. To this end, a multi-stage transfer learning approach is developed, whereby a pre-trained network is initially tuned using fruit images that were acquired under controlled acquisitions, and then progressively extendedto more complex scenarios. The main advantage of the proposed approach is in automating the image annotation process required for network training, hence relieving the burden of having to use conventional manual labeling.
The overall image processing pipeline is shown in Figure 3. Firstly, fruit images acquired on a neutral background under controlled lighting conditions (SET1) are auto- matically labeled based on color thresholding, followed by morphological operations, and then fed to a pre-trained deep learning architecture (Stage 1). This process builds a new network that is referred to as First Controlled Environment (CE) net. The First CE net is, in turn, used to label additional fruit images that are still acquired from a neutral background but under natural lighting conditions (SET2) that could not be straightforwardly labeled based on color thresholds, mainly due to the presence of shadows. Labeled images from SET2 are then used, along with labeled images from SET1, to retrain the network (Stage 2) and deal with non-uniform lighting conditions, leading to the so-called Final CE net.
Finally, field images that were acquired by the robot are added to the training set for a final training stage (Stage 3). Two different kinds of field images are considered: images containing only non-fruit regions (i.e., leaves, branches and background), and images containing both fruit and non-fruit regions. In the first case, image masks are obtained directly by setting to 0 all the image pixels. For the labeling of field images, instead, the Final CE net is employed, followed by the use of morphological operations instead of classical manual annotation. This leads to two nets, referred respectively to as True Negative Augmented (TNA) net and Field Image Augmented (FIA) net. As will be shown in the experimental results section, the TNA helps to reduce false positives, whereas the FIA network leads to a reduction in false negatives. then fed to a pre-trained deep learning architecture (Stage 1). This process builds a new network that is referred to as First Controlled Environment (CE) net. The First CE net is, in turn, used to label additional fruit images that are still acquired from a neutral background but under natural lighting conditions (SET2) that could not be straightforwardly labeled based on color thresholds, mainly due to the presence of shadows. Labeled images from SET2 are then used, along with labeled images from SET1, to retrain the network (Stage 2) and deal with non-uniform lighting conditions, leading to the so-called Final CE net.
Finally, field images that were acquired by the robot are added to the training set for a final training stage (Stage 3). Two different kinds of field images are considered: images containing only non-fruit regions (i.e., leaves, branches and background), and images containing both fruit and non-fruit regions. In the first case, image masks are obtained directly by setting to 0 all the image pixels. For the labeling of field images, instead, the Final CE net is employed, followed by the use of morphological operations instead of classical manual annotation. This leads to two nets, referred respectively to as True Negative Augmented (TNA) net and Field Image Augmented (FIA) net. As will be shown in the experimental results section, the TNA helps to reduce false positives, whereas the FIA network leads to a reduction in false negatives.

Network Architecture
In this research, the DeepLabv3+ pre-trained deep neural network is used. In DeepLabv3+ [20], features are extracted from a backbone network, in this case a ResNet18; it processes the input image, along with a set of 18 consecutive convolutional layers, where a pre-trained version of the network trained on more than a million images from the ImageNet (http://www.image-net.org, accessed on 8 August 2022) database is adopted. The extracted features are input to an Atrous Spatial Pyramid Pooling (ASPP) network, which resamples the features at arbitrary resolutions in order to achieve the best pixel-bypixel classification. The output of the ASPP network is passed through a 1 × 1 convolution to rearrange the data and obtain the final segmentation mask. A schematic representation of the network is depicted in Figure 4, which also shows its complexity at a glance. The strategy of transfer learning is thus applied to the proposed networks. Since the required number of the target classes is equal to 2, the last fully-connected layer of every deep network is downsized to output a binary classification. In this manner, the network is not initialized, and thus trained from the scratch.

Network Architecture
In this research, the DeepLabv3+ pre-trained deep neural network is used. In DeepLabv3+ [20], features are extracted from a backbone network, in this case a ResNet18; it processes the input image, along with a set of 18 consecutive convolutional layers, where a pretrained version of the network trained on more than a million images from the ImageNet (http://www.image-net.org, accessed on 8 August 2022) database is adopted. The extracted features are input to an Atrous Spatial Pyramid Pooling (ASPP) network, which resamples the features at arbitrary resolutions in order to achieve the best pixel-by-pixel classification. The output of the ASPP network is passed through a 1 × 1 convolution to rearrange the data and obtain the final segmentation mask. A schematic representation of the network is depicted in Figure 4, which also shows its complexity at a glance. The strategy of transfer learning is thus applied to the proposed networks. Since the required number of the target classes is equal to 2, the last fully-connected layer of every deep network is downsized to output a binary classification. In this manner, the network is not initialized, and thus trained from the scratch.

Controlled Environment (CE) Net
RGB images of harvested pomegranates located on a neutral background and acquired under uniform lighting conditions (SET1) are first fed to an algorithm using colorbased thresholding, in order to obtain ground-truth masks. Labels at this step have several noise blobs in the background as well as some stems, which need to be removed. In order to improve the labels, morphological operations are applied to automatically clean all masked images. The following operations, based on the connected components, are applied to each binary mask: 1. Noise removal: connected components with areas that are below a threshold were eliminated. 2. Stem discarding: morphological erosion is used to separate the protruding stems of each pomegranate from the rest of the fruit. The resulting connected components are approximated by ellipses. Stems are thus removed by a thresholding operation on the ellipse's eccentricity. The mask is finally restored by a dilation operation. 3. Hole-filling: a hole-filling operation is used to make labels uniform, even in the shaded areas.
An example of a labeled image from SET1 is shown in Figure 5. During the training phase, the more data that are available, the better. In order to satisfy this constraint, data augmentation is performed by rotating, reflecting and varying contrast and exposure in such ways that from each original image of SET1, 20 new images are obtained, as shown in Figure 6 for the sample case of Figure 5.
Labeled images are then used to tune a pre-trained network (Stage 1) and generate the First Controlled Environment (CE) net. The First CE net is then applied to produce image labels for images from SET2. Images from SET2 provide poor results when segmented

Controlled Environment (CE) Net
RGB images of harvested pomegranates located on a neutral background and acquired under uniform lighting conditions (SET1) are first fed to an algorithm using color-based thresholding, in order to obtain ground-truth masks. Labels at this step have several noise blobs in the background as well as some stems, which need to be removed. In order to improve the labels, morphological operations are applied to automatically clean all masked images. The following operations, based on the connected components, are applied to each binary mask:

1.
Noise removal: connected components with areas that are below a threshold were eliminated.

2.
Stem discarding: morphological erosion is used to separate the protruding stems of each pomegranate from the rest of the fruit. The resulting connected components are approximated by ellipses. Stems are thus removed by a thresholding operation on the ellipse's eccentricity. The mask is finally restored by a dilation operation.

3.
Hole-filling: a hole-filling operation is used to make labels uniform, even in the shaded areas.
An example of a labeled image from SET1 is shown in Figure 5.

Controlled Environment (CE) Net
RGB images of harvested pomegranates located on a neutral background and quired under uniform lighting conditions (SET1) are first fed to an algorithm using col based thresholding, in order to obtain ground-truth masks. Labels at this step have seve noise blobs in the background as well as some stems, which need to be removed. In ord to improve the labels, morphological operations are applied to automatically clean masked images. The following operations, based on the connected components, are a plied to each binary mask: 1. Noise removal: connected components with areas that are below a threshold w eliminated. 2. Stem discarding: morphological erosion is used to separate the protruding stems each pomegranate from the rest of the fruit. The resulting connected components a approximated by ellipses. Stems are thus removed by a thresholding operation the ellipse's eccentricity. The mask is finally restored by a dilation operation. 3. Hole-filling: a hole-filling operation is used to make labels uniform, even in t shaded areas.
An example of a labeled image from SET1 is shown in Figure 5. During the training phase, the more data that are available, the better. In order satisfy this constraint, data augmentation is performed by rotating, reflecting and varyi contrast and exposure in such ways that from each original image of SET1, 20 new imag are obtained, as shown in Figure 6 for the sample case of Figure 5.
Labeled images are then used to tune a pre-trained network (Stage 1) and gener the First Controlled Environment (CE) net. The First CE net is then applied to produce ima labels for images from SET2. Images from SET2 provide poor results when segment During the training phase, the more data that are available, the better. In order to satisfy this constraint, data augmentation is performed by rotating, reflecting and varying contrast and exposure in such ways that from each original image of SET1, 20 new images are obtained, as shown in Figure 6 for the sample case of Figure 5.
Labeled images are then used to tune a pre-trained network (Stage 1) and generate the First Controlled Environment (CE) net. The First CE net is then applied to produce image labels for images from SET2. Images from SET2 provide poor results when segmented using standard color-based filtering techniques, mainly as a result of the presence of shadows. Segmentation results for four sample images from SET2 are shown in Figure 7. This process facilitates the creation of labels that conventional color threshold tools are not able to obtain, due to the strong contrasts and chromatic distortions of the images. Also in this case, morphological operations are applied as previously described to improve the ground-truth masks that are used for subsequent training (see Figure 8 as an example).  Figure 7. Thi process facilitates the creation of labels that conventional color threshold tools are not abl to obtain, due to the strong contrasts and chromatic distortions of the images. Also in thi case, morphological operations are applied as previously described to improve th ground-truth masks that are used for subsequent training (see Figure 8 as an example).   Specifically, a new transfer learning stage (Stage 2) is performed using the full dat set that was obtained from all the controlled environment acquisitions (both SET1 an SET2), resulting in the Final CE net. For simplicity's sake, this net will be referred to as C net in the following. As will be shown in Section 3, this network has an excessive sensitiv ity in RGB space when applied to field images, neglecting, in some cases, other parameter Sensors 2022, 22, x FOR PEER REVIEW using standard color-based filtering techniques, mainly as a result of the presence ows. Segmentation results for four sample images from SET2 are shown in Figur process facilitates the creation of labels that conventional color threshold tools are to obtain, due to the strong contrasts and chromatic distortions of the images. Als case, morphological operations are applied as previously described to imp ground-truth masks that are used for subsequent training (see Figure 8 as an exa   Specifically, a new transfer learning stage (Stage 2) is performed using the set that was obtained from all the controlled environment acquisitions (both S using standard color-based filtering techniques, mainly as a result of the presence of shadows. Segmentation results for four sample images from SET2 are shown in Figure 7. This process facilitates the creation of labels that conventional color threshold tools are not able to obtain, due to the strong contrasts and chromatic distortions of the images. Also in this case, morphological operations are applied as previously described to improve the ground-truth masks that are used for subsequent training (see Figure 8 as an example).   Specifically, a new transfer learning stage (Stage 2) is performed using the full data set that was obtained from all the controlled environment acquisitions (both SET1 and SET2), resulting in the Final CE net. For simplicity's sake, this net will be referred to as CE net in the following. As will be shown in Section 3, this network has an excessive sensitivity in RGB space when applied to field images, neglecting, in some cases, other parameters Specifically, a new transfer learning stage (Stage 2) is performed using the full data set that was obtained from all the controlled environment acquisitions (both SET1 and SET2), resulting in the Final CE net. For simplicity's sake, this net will be referred to as CE net in the following. As will be shown in Section 3, this network has an excessive sensitivity in RGB space when applied to field images, neglecting, in some cases, other parameters such as shape. This behavior is primarily evident in the presence of dry and yellowed leaves near the fruits, as is the case in Figure 9. The colors of those leaves tend to confuse the network, consequently increasing the number of false positives exponentially. For this reason, a further training stage is proposed (Stage 3), as described in the following.

Sensors 2022, 22, x FOR PEER REVIEW
leaves near the fruits, as is the case in Figure 9. The colors of those leaves tend to the network, consequently increasing the number of false positives exponentially reason, a further training stage is proposed (Stage 3), as described in the followin

True Negative Augmented (TNA) Net
In order to enhance the network capability of reducing false positives, true examples are added to labeled images from SET1 and SET2 as inputs for network (Stage 3). True negative examples are images that contain only background info such as fruitless images that were not necessarily acquired in the same crop typ case, images taken from a vineyard during the post-harvesting stage are used, a in the sample images in Figure 10. The resulting network is referred to as True Augmented (TNA) net. In Section 3, it will be shown that by applying the TNA n segmentation of field images, false positives are drastically reduced, although at of a reduction in true positives, as can be seen for the sample case shown in Figu

True Negative Augmented (TNA) Net
In order to enhance the network capability of reducing false positives, true negative examples are added to labeled images from SET1 and SET2 as inputs for network training (Stage 3). True negative examples are images that contain only background information, such as fruitless images that were not necessarily acquired in the same crop type. In this case, images taken from a vineyard during the post-harvesting stage are used, as shown in the sample images in Figure 10. The resulting network is referred to as True Negative Augmented (TNA) net. In Section 3, it will be shown that by applying the TNA net to the segmentation of field images, false positives are drastically reduced, although at the cost of a reduction in true positives, as can be seen for the sample case shown in Figure 11. (Stage 3). True negative examples are images that contain only background information, such as fruitless images that were not necessarily acquired in the same crop type. In this case, images taken from a vineyard during the post-harvesting stage are used, as shown in the sample images in Figure 10. The resulting network is referred to as True Negative Augmented (TNA) net. In Section 3, it will be shown that by applying the TNA net to the segmentation of field images, false positives are drastically reduced, although at the cost of a reduction in true positives, as can be seen for the sample case shown in Figure 11.

Field Image Augmented Network (FIA) Net
As an alternative strategy, the data set can be augmented using in-field imag ples. Masks are generated from the frames of the first few seconds of field acqui applying the CE net. No morphological finishing operations are applied to the labels. A new data set for training is then generated by adding these images to th images from SET1 and SET2 for retraining the network (Stage 3). The resulting ferred to as Field Image Augmented Network (FIA) net, leads to an improvement in performance, reducing both false negatives and false positives, as can be seen for shown in Figure 12, and is discussed in detail in Section 3.

Field Image Augmented Network (FIA) Net
As an alternative strategy, the data set can be augmented using in-field image examples. Masks are generated from the frames of the first few seconds of field acquisition by applying the CE net. No morphological finishing operations are applied to the obtained labels. A new data set for training is then generated by adding these images to the labeled images from SET1 and SET2 for retraining the network (Stage 3). The resulting net, referred to as Field Image Augmented Network (FIA) net, leads to an improvement in system performance, reducing both false negatives and false positives, as can be seen for the case shown in Figure 12, and is discussed in detail in Section 3.
applying the CE net. No morphological finishing operations are applied to the labels. A new data set for training is then generated by adding these images to th images from SET1 and SET2 for retraining the network (Stage 3). The resulting ferred to as Field Image Augmented Network (FIA) net, leads to an improvement in performance, reducing both false negatives and false positives, as can be seen for shown in Figure 12, and is discussed in detail in Section 3.

Results and Discussion
This section reports on the field validation of the proposed multi-stage transf ing approach using the experimental setup described in Section 2. Quantitative obtained from the CE, TNA and FIA networks are discussed. Results are also co with a conventional approach using manually labeled field frames for training.

Results and Discussion
This section reports on the field validation of the proposed multi-stage transfer learning approach using the experimental setup described in Section 2. Quantitative metrics obtained from the CE, TNA and FIA networks are discussed. Results are also compared with a conventional approach using manually labeled field frames for training.
Each network is trained using different data sets that increase in complexity. The data sets used for training are summarized in Table 1. Specifically for the CE net, images from controlled environments enclosed in SET 1 and SET 2 are used after augmentation. The TNA network is trained by the addition of 100 true negative examples, including images of leaves and other background parts (e.g., sky regions and branches). It is worth noting that both the CE and the TNA net do not use images from the robot field acquisitions. A relatively small number of field images (i.e., 100), including both positive and negative examples of fruits acquired during the robot motion, are enclosed instead for the training of the FIA network. The CE, TNA and FIA networks are trained without requiring any manual labeling. Manually labeled field images are only used to compare the proposed approach with a conventional training procedure.
Once trained, the generalization ability of the networks is evaluated by applying the model to an independent (i.e., different from that used for training) data set that consists of 50 field images acquired during robot motion. Ground-truth labels for these images are obtained via manual labeling.

Evaluation Metrics
Pixel-wise accuracy is measured by comparing ground truth and predicted information. Specifically, having defined as TP the number of true positives (i.e., pixels correctly classified as fruit), TN as the number of true negatives (i.e., pixels correctly classified as non-fruit), FN as the number of false negatives (i.e., pixels incorrectly classified as non-fruit) and FP as the number of false positives (i.e., pixels incorrectly classified as fruit), precision (P), recall (R) and the F1-score are recovered as follows: In addition, the following metrics are computed: • Global Accuracy, the ratio of correctly classified pixels, regardless of class, to the total number of pixels. This metric allows for a quick and computationally inexpensive estimate of the percentage of correctly classified pixels.

•
Intersection over union (IoU), also known as the Jaccard similarity coefficient, is the most commonly used metric. It provides a measure of statistical accuracy that penalizes false positives. For each class, IoU is the ratio of correctly classified pixels to the total number of true and predicted pixels in that class, as shown in the following: For each image, mean IoU is the average IoU score of all classes in that particular image. For the aggregate data set, mean IoU is the average IoU score of all classes in all images.

•
Weighted IoU, is the average IoU of each class weighted by the number of pixels in that class. This metric is used when images have disproportionally sized classes, in order to reduce the impact of errors in the small classes on the aggregate quality score.

Segmentation Performance
The performance of the CE, TNA and FIA networks is reported in Table 2, in terms of precision, recall and F1-scores. These metrics are computed for each image and then averaged over the entire test set. These results show the performance of the segmentation algorithm using different training sets. Referring to Table 2, the CE net shows reasonably good recall values amounting to 79.06%, with a relatively low precision of 48.10%. These results were due to a high number of false positives that mainly related to the misclassification of dry yellow leaves, as shown in the sample case in Figure 8. Conversely, the TNA network shows an increment in the precision that reached a value of 96.87%, thanks to the addition of true negative examples to the training set, at the expense of recall, which dropped from 79.06% to 43.25%. Thus, the network no longer confuses fruit with other elements in a scene; however, at the same time, it loses information by missing several fruits. Both CE and TNA nets show comparable F1-score values of 55.79% and 56.97%, respectively. A significant improvement was achieved with the FIA net, in which the training set was enriched with a few images acquired in the field and directly labeled through the CE network, following the pseudolabeling learning strategy. The FIA net guarantees the best performance, with precision and recall values of 93.33% and 81.49%, respectively, resulting in an F1-score of 86.42%. Figure 13 shows the segmentation results obtained from the different networks for some sample field images, with white pixels representing true positives, black pixels representing true negatives, magenta pixels representing false negatives and green pixels representing false positives. The progressive reduction in misclassification passing from the CE to FIA net can be clearly seen. The proposed multi-stage approach was compared with a conventional manua beling procedure. The results are collected in Table 3. The confusion matrices that s marize the percentage of correct and incorrect predictions over the entire test set are shown in Figure 14. The proposed approach leads to comparable results to standard tr ing with manually labeled examples, with fairly consistent results in terms of accur and IoU, and with a relatively low decrement of about 4.0% in the F1-score, but with advantage of relieving the manual labeling burden.  The proposed multi-stage approach was compared with a conventional manual labeling procedure. The results are collected in Table 3. The confusion matrices that summarize the percentage of correct and incorrect predictions over the entire test set are also shown in Figure 14. The proposed approach leads to comparable results to standard training with manually labeled examples, with fairly consistent results in terms of accuracy and IoU, and with a relatively low decrement of about 4.0% in the F1-score, but with the advantage of relieving the manual labeling burden.  The computational efficiency of the FIA network was also evaluated by segmenting the entire field image data set, using an Nvidia RTX 2080 Ti GPU and an Intel (R) Core (TM) i7-4790 CPU @ 3.60 GHz, showing that the processing time attests to about 0.15 s per frame (less than the acquisition frame rate of 6 Hz) This would make it feasible the adoption of the proposed approach for real-time processing onboard farm robots.

Conclusions
In this paper, a multi-stage transfer learning approach has been developed to segment field images of pomegranate fruits acquired by an agricultural robot, without the need for a time-consuming manual labeling process. Each learning stage leads to a different network, ranging from a network that was trained using images acquired under controlled conditions (CE network), up to a network (FIA network) that incorporates field images as well. Results obtained from experimental tests have been presented, and show that despite the low quality of the input images, the proposed methods can segment field images achieving an F1-score of 86.42% and IoU of 97.94%, using the DeepLabv3+ architecture. The obtained performance is comparable with those of a standard learning approach, without the burden of having to use manual annotation.

Future Work
The methods discussed in the paper use a consumer-grade camera (worth a few hundred Euros) as the only sensory input. This choice proved to be a good trade-off between performance and cost-effectiveness. An obvious improvement would be to use higherend depth cameras available on the market. Future efforts will be devoted to further improving the proposed framework, e.g., using other pre-trained nets, and to extend the scope of the system to yield prediction or automatic size estimation of fruits. In this respect, the use of 3D data will be specifically investigated in order to improve classification results, and to recover further information on fruit morphology. In addition, the proposed framework could be enhanced for fruit control by the farmer robot. Finally, the portability of the system to different crops (e.g., vineyards), or for different fruit maturation stages, will be evaluated.
Author Contributions: R.P.D.: conceptualization, methodology, data collection, writing-original draft; A.M.: conceptualization, methodology, data collection, writing-original draft, writing-re- The computational efficiency of the FIA network was also evaluated by segmenting the entire field image data set, using an Nvidia RTX 2080 Ti GPU and an Intel (R) Core (TM) i7-4790 CPU @ 3.60 GHz, showing that the processing time attests to about 0.15 s per frame (less than the acquisition frame rate of 6 Hz) This would make it feasible the adoption of the proposed approach for real-time processing onboard farm robots.

Conclusions
In this paper, a multi-stage transfer learning approach has been developed to segment field images of pomegranate fruits acquired by an agricultural robot, without the need for a time-consuming manual labeling process. Each learning stage leads to a different network, ranging from a network that was trained using images acquired under controlled conditions (CE network), up to a network (FIA network) that incorporates field images as well. Results obtained from experimental tests have been presented, and show that despite the low quality of the input images, the proposed methods can segment field images achieving an F1-score of 86.42% and IoU of 97.94%, using the DeepLabv3+ architecture. The obtained performance is comparable with those of a standard learning approach, without the burden of having to use manual annotation.

Future Work
The methods discussed in the paper use a consumer-grade camera (worth a few hundred Euros) as the only sensory input. This choice proved to be a good trade-off between performance and cost-effectiveness. An obvious improvement would be to use higher-end depth cameras available on the market. Future efforts will be devoted to further improving the proposed framework, e.g., using other pre-trained nets, and to extend the scope of the system to yield prediction or automatic size estimation of fruits. In this respect, the use of 3D data will be specifically investigated in order to improve classification results, and to recover further information on fruit morphology. In addition, the proposed framework could be enhanced for fruit control by the farmer robot. Finally, the portability of the system to different crops (e.g., vineyards), or for different fruit maturation stages, will be evaluated.
Funding: This research was funded by ERA-NET Cofund ICT-AGRI-FOOD "ANTONIO-multimodal sensing for individual plANT phenOtyping in agriculture robotics" (Id. 41946).

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study can be made available on request.