An Accurate Optimized Contour Segmentation Model for Green Spherical Fruits

Zhang, Ting; Xu, Ying; Cao, Kai; Chen, Xiude; Liu, Qiaolian; Jia, Weikuan

doi:10.3390/horticulturae11070761

Open AccessArticle

An Accurate Optimized Contour Segmentation Model for Green Spherical Fruits

by

Ting Zhang

¹,

Ying Xu

²,

Kai Cao

²,

Xiude Chen

³,

Qiaolian Liu

^1,* and

Weikuan Jia

^2,*

¹

School of Information Science and Engineering, Zaozhuang University, Zaozhuang 277160, China

²

School of Information Science and Engineering, Shandong Normal University, Jinan 250358, China

³

National Research Center for Apple Engineering and Technology, Shandong Agricultural University, Taian 271018, China

^*

Authors to whom correspondence should be addressed.

Horticulturae 2025, 11(7), 761; https://doi.org/10.3390/horticulturae11070761

Submission received: 20 May 2025 / Revised: 19 June 2025 / Accepted: 26 June 2025 / Published: 1 July 2025

(This article belongs to the Special Issue Application of Smart Technology and Equipment in Horticulture—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Accurate fruit detection in complex orchard environments remains challenging due to variable lighting conditions and weather factors. This paper proposes an optimized contour segmentation model for green spherical fruits (apples and persimmons) based on the E2EC framework. The model employs DLA34 as the backbone network for feature extraction enhanced by a path aggregation balanced feature pyramid network (PAB FPN) with embedded attention mechanisms to refine feature representation. For contour segmentation, we introduce a Cycle MLP Aggregation Deformation (CMAD) module that incorporates cycleMLP to expand the receptive field and improve contour accuracy. Experimental results demonstrate the model’s effectiveness, achieving average precision (AP) and average recall (AR) of 75.5% and 80.4%, respectively, for green persimmons and 57.8% and 64.0% for green apples, outperforming previous segmentation methods. These advancements contribute to the development of more robust smart agriculture systems.

Keywords:

green spherical fruit; PAB FPN; CMAD module; contour segmentation; object detection

1. Introduction

The continuous development of frontier technologies and theories such as deep learning and artificial intelligence provides powerful support for the progress of intelligent agriculture, and the fruit and vegetable growth monitoring [1,2] and harvesting [3,4] of modern orchards has begun to gradually become intelligent and automated. In recent years, computer vision technology has been gradually applied to modern orchard equipment due to its role in the vision system of agricultural automation equipment [5], which has already solved many problems in practical applications [6,7,8,9], reduced the consumption of human resources, and effectively improved the efficiency of orchards. However, the consumption of fruits is still increasing, especially spherical fruits like apples, the annual consumption of which is even more huge; thus, the agricultural automation equipment needs to be upgraded accordingly to improve the operation efficiency as well as the accuracy and real-time performance of the work. In modern orchards, the segmentation results of the target fruit constrain the performance of the vision system of agricultural automation equipment. The actual environment of the orchard is complex and variable; the performance of detection and segmentation of the target fruit is affected by a variety of factors such as weather conditions, fruit color and posture, light changes, and so on. At present, how to further improve the accuracy of the detection and segmentation of target fruits in the complex orchard environment to meet the needs of the vision systems of fruit automation equipment has become an urgent problem to be solved.

Considering the development needs of modern agricultural orchards, many generalized or optimized object detection and segmentation algorithms have been applied to the field of fruit detection and segmentation, whether based on machine learning or based on deep learning, and have achieved good results. Qiu et al. [10] constructed the SM-YOLOv4 algorithm based on the improved YOLOv4, and the overall accuracy of the detection of grape bunches reached 93.52%. Liu et al. [11] proposed an instance segmentation model for greenhouse cucumbers with improved Mask-RCNN, and the improved model achieved an F1 score of 89.47% for cucumber testing. Jiang et al. [12] chose the elongated loofah fruit as the research object and proposed an instance segmentation model LuffaInst, and the model on the loofah dataset achieved better detection and segmentation accuracy. Spherical fruits are more prevalent in real life because of the advantages of their shape characteristics; the planting area as well as the production is larger compared with other shaped fruits; and the development of automation in orchards is more urgent [13]. At present, the detection and segmentation of spherical fruits is the focus of fruit detection and segmentation research. Arefi et al. [14] chose ripe tomatoes as the research object and extracted ripe tomatoes by removing the background information and combining the color space information, which finally achieved 96.36%. Sun et al. [15] aim at the difficulty of locating small target fruits in the natural orchard environment, a balanced feature pyramid network (BFP Net) is proposed for small apple detection, which improved the ability of the model to recognize occluded target apples. Kang et al. [16] improved DASNet to obtain DASNet-v2, which uses visual sensors to segment ripe apple instances so it can realize the segmentation of fruits more robustly and efficiently.

In these studies on spherical fruits (grapes, tomatoes, and apples), the pronounced color contrast between the fruits and their backgrounds significantly facilitated detection and segmentation. In contrast, the immature fruits in the growing stage have a green color, which is similar to the color of the background leaves, so the detection and segmentation of immature green fruits is more challenging compared with the mature fruits. In order to improve the detection accuracy of green fruits, Jia et al. [17] used a combination of ResNet and DenseNet as a feature extraction network for the original model and improved the instance segmentation model Mask R-CNN to adapt to the detection of apple targets, which significantly improved the detection accuracy of apple targets in the overlapping and branch-obscured environments. Mu et al. [18], based on the deep learning technique for highly occluded unripe tomatoes, combined the Regional Convolutional Network (R-CNN) as well as ResNet-101 for ripeness detection and yield prediction of tomatoes. Sun et al. [19] proposed a detection model for detecting small green apples and begonia fruits at night, GHFormer-Net, which used PVTv2 as a backbone network for feature extraction and introduced a gradient coordination mechanism, GHM, which optimizes the classification loss and regression loss from the perspective of the gradient, thus effectively achieving the detection of small target fruits at night.

Most of the current state-of-the-art instance segmentation algorithms perform pixel segmentation within the bounding box given by the object detector, which usually leads to expensive post-processing. The other option is the contour segmentation algorithm developed in recent years, which views instance segmentation as a problem of object contour vertex regression and outputs the positions of predicted vertices directly without the need for expensive and complex post-processing operations. Ling et al. [20] viewed the target annotation as a regression problem, modeled the contour of an object as a graph structure, and used a graph convolution network (GCN) to simultaneously predict the evolution direction of each vertex on the boundary. Jia et al. [21] proposed an efficient YOLO-snake model for green fruit segmentation, it is sensitive to the green fruit, and the segmentation accuracy and efficiency are significantly improved. Liu et al. [22] proposed a depth-attentive contour model for contour segmentation to optimize the matching scheme of contour vertices and introduced an attention mechanism in the contour deformation stage so that the model can pay better attention to vertices outside the boundary during the deformation process to obtain a better target contour. Zhang et al. [23] proposed an example segmentation model based on end-to-end contour E2EC. The model uses a learnable contour initialization architecture, which optimizes the step of manually producing the initial contour and improves the accuracy of the contour segmentation model, but in the complex environment of the real orchard, the leaf shading and the overlapping phenomenon between the fruits present challenges to the detection and segmentation of the target fruits, so the detection and segmentation accuracy for green spherical fruits is still to be further improved, and the model needs to be further optimized to improve the recognition accuracy of green spherical fruits in complex environments.

In order to improve the detection and segmentation accuracy for targeted green spherical fruits and to meet the requirements of recognition in complex environments of real orchards, this paper proposes an accurate optimized contour segmentation model of green spherical fruits for instance segmentation of green persimmons and green apples. The multilayer features of the target green fruit image obtained by the backbone network feature extraction use PAB FPN to enhance the multilevel features by deeply integrating the balanced semantic features, so that the obtained feature layer has more perfect fruit feature information. In the segmentation stage, the initial contour of the fruit is initially deformed using the CMAD module to obtain the coarse contours of persimmons and apples, and cycleMLP is introduced into the module to expand the receptive field with stepped style-sampling points to aggregate more contextual information about the fruit contours, and then, the obtained coarse contours are again refined and deformed to obtain the final contours of the target green persimmons and green apples. Overall, this study makes at least the following contributions:

(1): An accurate optimal contour segmentation model for green spherical fruits is proposed, and better results are achieved with two spherical green fruits, green persimmon and green apple.
(2): After the backbone network, PAB FPN is used to enhance the multilevel features of the fruits by deeply integrating the balanced semantic features and embedding the attention mechanism.
(3): We introduce cycleMLP, which optimizes the initial contour deformation module with stepwise style-sampling points to expand the receptive field and aggregate better contextual information about the fruit contour.

The rest of the paper is organized as follows: in Section 2, the collection and production process of the green persimmon dataset and the green apple dataset are described in detail, and images of spherical fruits under a number of different environmental conditions that make up the dataset are shown. The structure of the model is given in Section 3, and the structure of each part of the model is introduced in some detail. In Section 4, the validation results and visualization images of the model are given, and comparative experiments with some other commonly used detection and segmentation models are performed. Section 5 summarizes this study.

2. Materials and Methods

2.1. Fruit Dataset

The datasets used in this paper are the unripe green spherical persimmon dataset and the green spherical apple dataset, and the pictures were taken during the period when the fruits were growing in a more rounded shape, which meets the research requirements of this study.

2.1.1. Dataset Collection

Image collection locations: Images of immature green apples were collected from Longwangshan Apple Production Base, Fushan District, Yantai City, Shandong Province, and images of immature green persimmons were collected from the back of the mountain of Shandong Normal University (Changqing Lake Campus).

Images of fruit varieties: Persimmon varieties include “Niushin”, “Jishinhuang”, “Kyo-myeon”, etc., and the apple fruit variety is “gala”.

Image acquisition equipment: All images were captured using the same camera, a Canon EOS 80D DSLR camera (Canon company, Tokyo, Japan) with a CMOS image sensor, which captured a total of 1361 images of green apples and 553 images of green persimmons with a resolution of 6000 pixels by 4000 pixels saved in .jpg format as 24-bit color images.

The orchard’s actual environment is more complex. To simulate real situations as closely as possible, the dataset includes images collected from various complex environmental conditions. Using a DSLR camera, the target fruits are photographed and collected from different angles, including front, side, in sun and in back, near and far. The collection is also performed at different times of the day, including morning, noon, and night. In addition, the fruit images constituting the dataset also include cases where the boundaries are not easy to discriminate, such as foliage shading and fruit overlapping; water droplets on the fruits after rain can be a factor affecting the detection, and different lighting can also affect the final detection and segmentation results. Figure 1 and Figure 2 show the fruit images in several different situations.

In addition, the persimmon dataset includes 2524 fruits, and the apple dataset includes 7137 fruits. Table 1 shows the number and proportion of fruits of different scales: the ground truth box area less than or equal to

32^{2}

belongs to small-scale target fruits, the ground truth box area greater than

32^{2}

and less than

96^{2}

belongs to medium-scale target fruits, and the ground truth box area greater than

96^{2}

belongs to large-scale target fruits.

2.1.2. Dataset Production

The fruit images included in the dataset fully consider the complex and changeable environment of the real unstructured orchard, which has a certain degree of randomness and representativeness. The collected persimmon and apple images are divided into training and validation sets in the ratio of 7:3. After the division, the persimmon training set includes 388 images and the validation set includes 165 images, and the apple training set includes 953 images and the validation set includes 408 images.

In order to reduce the amount of calculation, shorten the subsequent experimental time, and meet the real-time requirements of the equipment, the image resolution was uniformly scaled from 6000 × 4000 pixels to 600 × 400 pixels. The labeling software used was LabelMe (V5.3.0) [24], and the edge contour information of the two spherical green fruits was marked with the labeling points to separate the fruits from the background. The labeling information of the images, such as labels and coordinates of the labeled points, were saved to the corresponding .json files, and finally, the finished json files were converted to a dataset in COCO format [25].

2.2. Optimized Contour Segmentation Model

To address the challenges of spherical fruit detection and segmentation in complex orchard environments, this study develops an enhanced contour segmentation model based on the E2EC framework. As illustrated in Figure 3, the proposed model specifically optimizes the segmentation of green spherical fruits through a multi-stage architecture. The DLA34 backbone network first extracts discriminative features from fruit images followed by the PAB FPN module, which refines feature fusion through balanced multi-scale integration and embedded attention mechanisms. During segmentation, the initial fruit contours are generated from detected center coordinates and progressively refined through iterative deformation to achieve precise boundary delineation.

2.2.1. Backbone Network DLA34

Considering the difficult detection and segmentation problems, such as occlusion and overlap in the two green spherical fruit datasets, this study uses DLA34 [26] as the backbone feature extraction network for more effective feature extraction of the target fruit, and the input target fruit image is generated by the DLA34 network to generate four effective feature layers for the subsequent training operations. The DLA (Deep Layer Aggregation) network is mainly composed of IDA (Iterative Deep Aggregation) and HDA (Hierarchical Deep Aggregation), in which IDA is used to connect features between different stages, while HDA fuses features in the same stage. Therefore, IDA mainly performs fusion across resolutions and scales, while HDA mainly fuses the feature maps of each module and channel, and the specific structure of each component is shown in Figure 4.

The input three-channel RGB image of the spherical fruit is first changed to 32 channels by the convolution operation of the Basic module, and then after four stages, four different effective feature layers are output for subsequent training operations. As can be seen from Figure 4, the backbone network DLA34 consists of four stages. Each stage of the network is composed of convolutional modules in a tree-like structure; the connection between different convolutional modules is the HDA structure, and the fusion of the features output from the two different tree-like structures of the square is the aggregation node, that is, the Root module in Figure 4. The aggregation node also aggregates features as they are propagated from shallow to deep. In addition, the connection of features between different stages is the IDA structure. Between stages, the feature map output from the previous stage is down-sampled and used as input for the next stage, so that the width and height of the feature map is continuously halved, and the number of channels becomes twice as much as the previous stage.

The input target green spherical fruit image is feature extracted by different stages of the DLA34 network, and the feature information of the image is effectively obtained by the HDA and IDA structures, as shown in Figure 4; finally, four different effective feature layers {Layer 1, Layer 2, Layer 3, Layer 4} are obtained.

2.2.2. PAB-FPN

During feature extraction from the green fruit images, the semantic information in high-level feature layers becomes increasingly rich, while the resolution progressively decreases due to consecutive convolution and down-sampling operations. This results in the loss of positional details such as boundary contours. Although these high-level features can more accurately determine whether a region contains the target spherical fruit, they are less effective for precise fruit localization. Conversely, lower feature layers retain rich spatial localization information but lack sufficient semantic information, leading to missed detections and false positives, particularly when fruits are occluded by foliage or when fruit and leaf colors are similar. In summary, the information contained in the four feature layers finally obtained is different. Therefore, in order to solve the above problems and improve the accuracy of the final segmentation and detection of the target green persimmon and green apple, it is necessary to use the feature pyramid network to perform feature fusion on the feature layers of the two green spherical fruit images that contain different information to complement each other and to make the information of the feature layers richer and more complete.

The feature fusion pyramid network used in this study is PAB FPN (path aggregation balanced feature pyramid network), and the structure of the network is shown in Figure 5. In the process of feature fusion, PAB FPN uses the structure of PANet [27], which not only starts from the high-level features and performs the up-sampling operation to fuse them with the low-level features, but also performs the down-sampling operation on the four feature layers after the up-sampling fusion starting from the low level, and then, it performs the feature fusion again to obtain the preliminary fused feature layers. And in this stage, using the network structure of CSPNet [28], the learning ability of the convolutional neural network is enhanced by adding the CSPLayer [29]. The preliminary fused feature layers are fed into the BFP (Balanced Feature Pyramid) module [30] for further refinement and fusion. The BFP module first rescales the input feature layers of different sizes, selects an intermediate-sized feature layer, e.g., C4, and uses the size of this feature layer as a criterion for up-sampling and maximum pooling of the rest of the feature layers. The feature layers larger than the size of the feature layer, such as C2 and C3, are subjected to maximum pooling, while C5, which is smaller than the size of the feature layer, is subjected to up-sampling and interpolation, and the final sizes of the four input feature layers are adjusted to the size of C4. After that, the four feature layers obtain balanced semantic features by simple averaging, balancing the information contained in the different feature layers without adding parameters, as shown in Equation (1).

C = \frac{1}{L} \sum_{l = l_{\min}}^{l = l_{\max}} C_{l}

(1)

The balanced semantic features can be further refined to make their features more distinguishable, and the refinement method used is embedded Gaussian non-local attention. The non-local module [31] computes the response of a location as a weighted sum of all the location features in the input feature mapping and directly computes the relationship between the two locations to quickly capture long-range dependencies, which leads to further extraction of the semantic features. This refining step helps to enhance the integration properties and further improve the network results, the structure of which is shown in Figure 6.

As shown in Figure 6, the feature layers are input to three 1 × 1 convolutional layers, and their channel numbers are all set to half of the input feature X, which can greatly reduce the amount of computation. Finally, before outputting the feature Z, the number of channels is adjusted again, and its channel number is set to be the same as X to ensure that the dimensions of the input features and the output features are the same. The output formula is shown in Equation (2).

y_{i} = \frac{1}{C (x)} \sum_{\forall j} f (x_{i}, x_{j}) g (x_{j})

(2)

where

x_{i}

denotes the input signal; i is the index of the output position; j is the index of all possible positions;

f (x_{i}, x_{j})

is used to compute the response value of the output position i enumerating all possible positions by j and to compute the similarity relationship between i and all possible positions j;

g (x_{j})

is used to compute the eigenvalue of the input signal at position j; and C(x) is a normalization parameter.

After the features are refined by the non-local module, the obtained features are re-scaled to the original size of the feature layer using the same but opposite process to enhance the original features and refine the positional details and semantic information contained in the feature layer.

2.2.3. Segmentation Network

The segmentation network generates a heatmap to localize target fruit centers with the initial contours derived from centroid-based offset regression [22]. These contours are progressively refined by the Cycle MLP Aggregation Deformation (CMAD) module, which incorporates CycleMLP [32] for stepped sampling to expand contextual receptive fields. Then, the deformation module refines the obtained fruit coarse contour followed by iterative deformation to output the final target spherical fruit segmentation contour, and the specific network structure is shown in Figure 7.

For the contours of the target green persimmon and green apple, the initial contours of the fruits are deformed based on the centroid features of the target spherical fruits and all the contour vertices’ features, and the features of the N initial contour vertices are concatenated with the centroid features to form a feature vector with the shape of (batch, c, N+1) (where c is the number of channels of the vertices’ features), which is then inputted into the CMAD module to obtain the offset prediction of the contour vertices, and finally, the offsets are added to the initial contour coordinates to obtain the adjusted coarse instance contours of the two green spherical fruits. The structure of the CMAD (Cycle MLP Deformation) module is shown in Figure 8. The design of cycleFC in the figure is derived from the common channelFC, which is designed for channel information communication, but its receptive field is limited to better realize the aggregation of contextual information, while the cycleFC samples points along the channel dimension in a step style to increase the receptive field for the aggregation of contextual information.

After obtaining the coarse contours of the two spherical fruits, it is necessary to continue the refinement to obtain the final fruit contours of the green persimmon and green apple. The refined deformation behind the coarse contour is iterated using the DeepSnake [33] network, which consists of eight CirConv-Bn-ReLUs forming a backbone network for feature extraction, and the fusion module fuses the multi-scale information of all the points on the fruit contour, which is then passed into a 1 × 1 convolutional layer and then subjected to another max-pooling operation so that the fused features are combined with those of each vertex. The prediction head uses three 1 × 1 convolutional layers for the vertex features, which are mapped to the output on the offset of each vertex to obtain the final contour of the spherical fruit.

2.2.4. Loss Function

The design of the loss function is an important influence in determining the effect of green spherical fruit segmentation, and its optimal model is obtained by fitting the training data through the gradient back-propagation of the model and iterative training, which helps in iterative optimization during the model training process.

According to the structure of the model in this study, the overall loss function equations are shown in Equations (3) and (4).

L = L_{\det} + α L_{init} + β L_{c o a r s e} + L_{i t e r}

(3)

L_{iter} = L_{i t e r 1} + L_{i t e r 2}

(4)

where

α

and

β

are set to 0.1,

L_{d e t}

is the loss of centroid detection,

L_{i n i t}

is the loss incurred in the contour initialization stage,

L_{c o a r s e}

is the loss incurred in the coarse contour deformation stage, and

L_{i t e r}

is the loss incurred in the contour refinement stage.

The loss of the model for centroid prediction is calculated using Modified Focal loss [34]; the loss calculation for this part of the model specifically is shown in Equation (5).

L_{d e t} = \frac{- 1}{N} \{\begin{matrix} \begin{matrix} (1 - {\hat{Y}}_{x y c})^{α} \log ({\hat{Y}}_{x y c}), & \begin{matrix} i f & Y_{x y c} = 1 \end{matrix} \end{matrix} \\ \begin{matrix} (1 - Y_{x y c})^{β} ({\hat{Y}}_{x y c})^{α} \log (1 - {\hat{Y}}_{x y c}), & other w i s e \end{matrix} \end{matrix}

(5)

where N denotes the number of target centroids,

{\hat{Y}}_{x y c}

is the value of the c-th channel (x,y) position, and

Y_{x y c}

is the corresponding ground truth.

The losses generated in the initialization stage, the global deformation stage, and the first refinement stage are calculated using the Smooth L1 loss [35], and the model losses for these three parts are calculated as shown in the following three equations.

L_{init} = \frac{1}{N} \sum_{i = 1}^{N} s m o o t h_l 1 ({\bar{x}}_{i}^{i n i t} - {x_{i}}^{g t})

(6)

L_{coarse} = \frac{1}{N} \sum_{i = 1}^{N} s m o o t h_l 1 ({\bar{x}}_{i}^{c o a r s e} - {x_{i}}^{g t})

(7)

L_{iter 1} = \frac{1}{N} \sum_{i = 1}^{N} s m o o t h_l 1 ({\bar{x}}_{i}^{i t e r 1}, {x_{i}}^{g t})

(8)

In the above equation, N is the number of contour vertices, which is set to 128;

{\bar{x}}_{i}^{i n i t}

is the predicted initial contour vertices;

{\bar{x}}_{i}^{c o a r s e}

is the predicted vertices of the coarse contour;

{x_{i}}^{g t}

is the labeled contour vertices; and

{\bar{x}}_{i}^{i t e r 1}

is the contour vertices after the first deformation in the refinement step.

The loss incurred in the second deformation stage in the refinement step is calculated using Dynamic matching loss (DML) [22] with the functional equation shown below.

L_{i t e r 2} = L_{D M L} ({\bar{x}}_{i}^{i t e r 2}, x_{i}^{g t})

(9)

L_{D M L} (p r e d, g t) = \frac{L_{1} (p r e d, g t) + L_{2} (p r e d, g t)}{2}

(10)

The DML can be divided into two parts. First, each prediction point obtained by the model is adjusted to the nearest point on the ground truth boundary, as shown in the following equation.

x_{i}^{*} = \underset{x}{a r g m i n} ‖ p r e d_{i}^{i n} - g t_{x}^{i p t} ‖_{2}

(11)

L_{1} (p r e d, g t) = \frac{1}{N} \sum_{i = 1}^{N} ‖ p r e d_{i}^{o u t} - g t_{x_{i}^{*}}^{i p t} ‖_{1}

(12)

where pred refers to the predicted contour point and

{g t}^{i p t}

refers to the interpolated ground truth vertex closest to the predicted contour point.

The key labeled point then pulls the nearest predicted point toward its position as shown in the following functional equation.

y_{i}^{*} = \underset{y}{a r g m i n} ‖ p r e d_{y}^{i n} - g t_{i}^{k e y} ‖_{2}

(13)

L_{2} (p r e d, g t) = \frac{1}{n_{k e y}} \sum_{i = 1}^{n_{k e y}} ‖ p r e d_{y_{i}^{*}}^{o u t} - g t_{i}^{k e y} ‖_{1}

(14)

3. Experimental Setup and Result Analysis

3.1. Experimental Platform

In this paper, the configuration environment of the server used for model training is Ubuntu 16.04 OS, NVIDIA A30 graphics card, and 11.1 CUDA environment. The programming language used for the models is python, and the pytorch 1.8 [36] deep learning library is also used in the process. In the comparison experiments, several models are implemented using the relevant modules of MMdetection [37] for training.

Before formal training, the pre-training weight parameters obtained from the ImageNet dataset [38] were used as initialization parameters, which are migrated to the model network of this paper to accelerate the detection speed and improve the robustness of the whole model. In the formal training phase, the model parameters were updated using Adam optimization, with the learning rate and weight decay factor set to 0.0001 and 0.0005, respectively, and trained iteratively for 140 epochs, with the parameter results saved every 5 epochs and evaluated on the validation set for each training epoch. During the training process, the change trend in the loss function can be represented by the loss function change curve in Figure 9, where the x-axis represents the number of iterations and the y-axis is the corresponding loss function value.

3.2. Assessment Indicators

In order to comprehensively evaluate the performance of the model, this paper uses a variety of assessment indicators to evaluate the effect, in which the main consideration is the average accuracy of detection and segmentation mAP; the accuracy P is the probability of the samples that are predicted correctly among all the samples, which is calculated as shown in Equation (15); and recall R is the probability of the positive examples that are predicted correctly among the prediction results, which is calculated as shown in Equation (16).

P = \frac{TP}{TP + F P}

(15)

R = \frac{TP}{T P + F N}

(16)

where TP, FP, and FN are the number of true positive samples, the number of false positive samples, and the number of false negative samples, respectively. Further, the average precision AP at a particular IOU threshold can be calculated as shown in Equation (17).

I o U = \frac{|A ⋂ B|}{|A ⋃ B|}

(17)

{A P}^{I o u = i} = \frac{1}{101} \sum_{r \in R} {m a x}_{\tilde{r} \tilde{r} \geq r} p (\tilde{r})

(18)

where i represents the IoU threshold value within the range of [0.5, 0.55, 0.6, …, 0.95] (10 discrete values in total), and p(r) denotes the precision corresponding to recall rate r, where r ∈ [0, 0.01, 0.02, …, 1] (101 discrete values in total). By averaging the ten AP values at different IoU thresholds (AP|IOU = i), we obtain the final AP metric as defined in Equation (18).

A P = \frac{1}{10} \sum_{i \in I} A P^{I O U = i}

(19)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(20)

In addition, in order to more comprehensively evaluate the performance of the modeling method in this paper, some other evaluation indexes are added, in which AR refers to the average recall rate;

{A P}^{I O U = 0.5}

refers to the AP value when the IOU threshold exceeds 0.5;

{A P}^{I O U = 0.75}

refers to the AP value when the IOU threshold exceeds 0.75;

A P_{S}

refers to the average precision rate of detecting small-scale target fruits;

A P_{M}

refers to the average precision rate of detecting meso-scale target fruits;

A P_{L}

refers to the average precision rate of detection for large-scale target fruits; and where the real frame area is less than or equal to

3 2^{2}

for small-scale targets, the real frame area is greater than

3 2^{2}

and less than or equal to

9 6^{2}

for medium-scale targets, and the real frame area is greater than

9 6^{2}

for large-scale targets.

3.3. Model Segmentation Effect and Analysis

After the network training is completed, the overall performance of the model is assessed by comprehensively evaluating the model performance based on the above evaluation metrics. The detailed assessment results of the model for various evaluation metrics on the green persimmon validation set and the green apple validation set are shown in Table 2, and the graph of the average accuracy change in the segmentation is shown in Figure 10. In addition, multiple green persimmon images and green apple images under mixed interference conditions, such as nighttime, after rain, fruit overlap, and branch and leaf shading, are selected for segmentation, and the segmentation effect maps are studied and analyzed. The segmentation effect maps of the network model on the target green persimmon and green apple are shown in Figure 11 and Figure 12, respectively.

Table 2 shows the average precision and average recall of detection and segmentation obtained for the green persimmon dataset and the green apple dataset at different target fruit sizes as well as different IOU thresholds. The average precision of segmentation of this model for green persimmons can reach 75.5% with an average recall of 80.4%, and the average precision of detection can reach 75.1% with an average recall of 80.3%; for green apples, the average precision of segmentation of this model can reach 57.8% with an average recall of 64.0%, and the average precision of detection can reach 60.3% with an average recall of 65.8%. The detection and segmentation accuracy of this model for these two kinds of spherical green fruits can reach a good level, the target green spherical fruits in the image can basically be recognized, and the target localization and contour edge segmentation are more accurate. The model in this paper has the best segmentation accuracy in large-sized target fruits, and the average segmentation accuracy AP on the green persimmon dataset can reach 91.6%; for green apples, the average segmentation accuracy of large-sized apple fruits can also reach 90.0%. For small-sized target fruits, the segmentation performance on green apples achieves a reasonable accuracy of 36.4%, though there remains room for improvement. In contrast, the lower segmentation accuracy for small-sized green persimmons may be attributed to their limited representation in the dataset (only 13% of samples). Future studies should address this imbalance by incorporating additional small-sized persimmon samples to enhance model performance.

As can be seen in Figure 11 and Figure 12, when the target fruit image has many complex problematic factors such as occlusion overlap, rainy day, and light, it brings some challenges to the detection and segmentation of fruits, but the detection and segmentation effect still performs better. Some of the target fruits that have not been labeled in the annotation can also be detected and segmented, and the segmentation contour of the fruit is more accurate. For other target fruits that are difficult to detect and segment, such as some of the fruits that have leaves blocking and overlapping, they can also be accurately identified, and the segmentation of the boundary contour is more accurate and smooth, to achieve a more accurate localization and high-quality segmentation of the fruits.

In this paper, the model for the segmentation of the two green spherical fruits has been able to achieve very good results; most of the occluded overlapping fruits can be segmented, and the fruit segmentation contour is more accurate and smoother. However, there still exists a situation where the color of the background leaves is too close to the color of the occluded fruits and the fruits are occluded over a large area, resulting in the fruits being unable to be recognized due to severe occlusion. This may result in missed detection or wrong detection of the fruit contour, as shown in Figure 13, which is also a problem that needs to be solved by continued optimization.

3.4. Algorithms Comparison

To further illustrate the effectiveness of the model for object fruit segmentation, the model is compared with several common segmentation model algorithms; the selected models are YOLACT [39], MS_RCNN [40], Mask_RCNN [41], BoxInst [42], and CondInst [43] for mask segmentation and BoxSnake [44], BoundaryFormer [45], Yolov8-Seg, and E2EC for contour segmentation. The final results are obtained by comparing the differences in performance between them and the models in this paper on different evaluation metrics. The specific results are shown in Table 3 below.

Table 3 shows the evaluation results of different modeling algorithms on two green spherical fruit datasets, apple and persimmon. It can be seen that the average segmentation accuracies of this model on green persimmons and green apples are 75.5% and 57.8%, respectively. Compared with several algorithms for the same contour instance segmentation, the average accuracy of the model in this study for segmentation of both green spherical fruits is significantly better than that of the other three contour segmentation algorithms, and the model in this paper performs optimally compared with several masked segmentation models. In addition, the average accuracies of detection of the model in this paper on the two green spherical fruit datasets of persimmon and apple are 75.1% and 60.3%, respectively, which are also basically better than the results of several other models.

In addition, as can be seen from Figure 14, compared with several other segmentation models, the model in this paper is more accurate and smooth for the segmentation of the target green spherical fruit contour edges. For many of the target green fruits in the image that are not labeled initially, even if there is a phenomenon of occlusion and overlap, the model in this paper is basically able to achieve the detection and segmentation. In summary, the model in this paper can achieve a good level of detection and segmentation accuracy and effect for persimmon and apple, which improves the accuracy of the vision systems of agricultural automation equipment and has certain practical significance in solving the problems of branch and leaf shading and orchard picking operations.

4. Discussion of Results

The proposed optimized contour segmentation model demonstrated significant improvements in the detection and segmentation of green spherical fruits, achieving an average precision (AP) of 75.5% and an average recall (AR) of 80.4% for green persimmons and 57.8% AP and 64.0% AR for green apples. These results highlight the model’s robustness in handling complex orchard environments including challenges such as occlusions, varying lighting conditions, and color similarities between fruits and foliage. The integration of PAB FPN for feature fusion and the CMAD module for contour deformation played a pivotal role in enhancing the segmentation accuracy, particularly for large-scale fruits, where the AP reached 91.6% for persimmons and 90.0% for apples.

However, the model exhibited limitations in segmenting small-scale fruits, especially in the persimmon dataset, where the AP for small targets was only 25.0%, primarily due to their underrepresentation in the training data. The segmentation productivity could be improved by expanding the diversity of small-fruit images and optimizing the feature pyramid for scale invariance. Furthermore, instances of severe occlusion or extreme color similarity between fruits and leaves led to occasional missed detections, which may be addressed through targeted data augmentation techniques focusing on these challenging cases. Future work could explore advanced lightweight networks or additional attention mechanisms to address these challenges. The practical implications of this study are significant for smart agriculture, as the model’s high accuracy and real-time performance can enhance the efficiency of automated harvesting systems. By improving the vision systems of agricultural robots, this technology can reduce labor costs and increase productivity in orchards.

5. Conclusions

In this paper, we propose an optimized contour segmentation model for the accurate segmentation of green spherical fruits in complex orchard environments. The model integrates several key technical innovations to address practical challenges in fruit identification. The PAB FPN module enhances feature extraction from the DLA34 backbone network, effectively balancing feature semantics through embedded attention mechanisms while refining the feature layer to obtain more precise target fruit information. During contour segmentation, the cycleMLP-optimized initial contour deformation module (CMAD) demonstrates superior performance by aggregating and deforming the initial contours, expanding receptive fields, and better integrating contextual information, ultimately yielding optimal fruit contour segmentation results.

While the current dataset provides adequate validation, future work will expand it to include more challenging cases like small fruits and heavily shaded targets. We plan to develop specialized augmentation techniques to better simulate difficult field conditions, particularly for partially hidden fruits and those with background-like coloration. Additionally, we will explore temporal analysis methods using sequential image data to improve the tracking of occluded fruits. These enhancements will be validated through extended field testing across various seasonal conditions, ultimately creating a more robust system for real-world orchard environments.

Author Contributions

Conceptualization, T.Z., Y.X., and W.J.; methodology, T.Z. and Y.X.; software, K.C.; validation, K.C. and X.C.; formal analysis, T.Z. and Q.L.; investigation, K.C. and X.C.; resources, Y.X.; data curation, X.C.; writing—original draft preparation, T.Z. and Y.X.; writing—review and editing, Y.X. and W.J.; visualization, Q.L.; project administration, W.J.; funding acquisition, W.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by Natural Science Foundation of Shandong Province in China (No.: ZR2020MF076); Young Innovation Team Program of Shandong Provincial University (No.: 2022KJ250); and the New Twentieth Items of Universities in Jinan (2021GXRC049).

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Anjali; Jena, A.; Bamola, A.; Mishra, S.; Jain, I.; Pathak, N.; Sharma, N.; Joshi, N.; Pandey, R.; Kaparwal, S.; et al. State-of-the-art non-destructive approaches for maturity index determination in fruits and vegetables: Principles, applications, and future directions. Food Production. Process. Nutr. 2024, 6, 56. [Google Scholar]
Costa, L.; Ampatzidis, Y.; Rohla, C.; Maness, N.; Cheary, B.; Zhang, L. Measuring pecan nut growth utilizing machine vision and deep learning for the better understanding of the fruit growth curve. Comput. Electron. Agric. 2021, 181, 105964. [Google Scholar] [CrossRef]
Tang, Y.; Chen, M.; Wang, C.; Luo, L.; Li, J.; Lian, G.; Zou, X. Recognition and localization methods for vision-based fruit picking robots: A review. Front. Plant Sci. 2020, 11, 510. [Google Scholar] [CrossRef] [PubMed]
Hou, G.; Chen, H.; Jiang, M.; Niu, R. An overview of the application of machine vision in recognition and localization of fruit and vegetable harvesting robots. Agriculture 2023, 13, 1814. [Google Scholar] [CrossRef]
Wei, X.; Jia, K.; Lan, J.; Li, Y.; Zeng, Y.; Wang, C. Automatic method of fruit object extraction under complex agricultural background for vision system of fruit picking robot. Opt. Int. J. Light Electron Opt. 2014, 125, 5684–5689. [Google Scholar] [CrossRef]
Xiao, F.; Wang, H.; Li, Y.; Cao, Y.; Lv, X.; Xu, G. Object detection and recognition techniques based on digital image processing and traditional machine learning for fruit and vegetable harvesting robots: An overview and review. Agronomy 2023, 13, 639. [Google Scholar] [CrossRef]
Luo, J.; Li, B.; Leung, C. A survey of computer vision technologies in urban and controlled-environment agriculture. ACM Comput. Surv. 2023, 56, 1–39. [Google Scholar] [CrossRef]
Tian, H.; Wang, T.; Liu, Y.; Qiao, X.; Li, Y. Computer vision technology in agricultural automation—A review. Inf. Process. Agric. 2020, 7, 1–19. [Google Scholar] [CrossRef]
Apolo-Apolo, O.; Martínez-Guanter, J.; Egea, G.; Raja, P.; Pérez-Ruiz, M. Deep learning techniques for estimation of the yield and size of citrus fruits using a UAV. Eur. J. Agron. 2020, 115, 126030. [Google Scholar] [CrossRef]
Qiu, C.; Tian, G.; Zhao, J.; Liu, Q.; Xie, S.; Zheng, K. Grape Maturity Detection and Visual Pre-Positioning Based on Improved Yolov4. Electronics 2022, 11, 2677. [Google Scholar] [CrossRef]
Liu, X.; Zhao, D.; Jia, W.; Ji, W.; Ruan, C.; Sun, Y. Cucumber fruits detection in greenhouses based on instance segmentation. IEEE Access 2019, 7, 139635–139642. [Google Scholar] [CrossRef]
Jiang, S.; Liu, Z.; Hua, J.; Zhang, Z.; Zhao, S.; Xie, F.; Ao, J.; Wei, Y.; Lu, J.; Li, Z.; et al. A Real-Time Detection and Maturity Classification Method for Loofah. Agronomy 2023, 13, 2144. [Google Scholar] [CrossRef]
Mhamed, M.; Zhang, Z.; Yu, J.; Li, Y.; Zhang, M. Advances in apple’s automated orchard equipment: A Comprehensive Research. Comput. Electron. Agric. 2024, 221, 108926. [Google Scholar]
Arefi, A.; Motlagh, A.M.; Mollazade, K.; Teimourlou, R.F. Recognition and localization of ripen tomato based on machine vision. Aust. J. Crop Sci. 2011, 5, 1144–1149. [Google Scholar]
Sun, M.; Xu, L.; Chen, X.; Ji, Z.; Zheng, Y.; Jia, W. BFP net: Balanced feature pyramid network for small apple detection in complex orchard environment. Plant Phenomics 2022, 2022, 9892464. [Google Scholar] [CrossRef] [PubMed]
Kang, H.; Chen, C. Fruit detection, segmentation and 3D visualisation of environments in apple orchards. Comput. Electron. Agric. 2020, 171, 105302. [Google Scholar] [CrossRef]
Jia, W.; Tian, Y.; Luo, R.; Zhang, Z.; Lian, J.; Zheng, Y. Detection and segmentation of overlapped fruits based on optimized mask R-CNN application in apple harvesting robot. Comput. Electron. Agric. 2020, 172, 105380. [Google Scholar] [CrossRef]
Mu, Y.; Chen, T.-S.; Ninomiya, S.; Guo, W. Intact detection of highly occluded immature tomatoes on plants using deep learning techniques. Sensors 2020, 20, 2984. [Google Scholar] [CrossRef] [PubMed]
Sun, M.; Xu, L.; Luo, R.; Lu, Y.; Jia, W. GHFormer-Net: Towards more accurate small green apple/begonia fruit detection in the nighttime. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 4421–4432. [Google Scholar] [CrossRef]
Ling, H.; Gao, J.; Kar, A.; Chen, W.; Fidler, S. Fast interactive object annotation with curve-gcn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5257–5266. [Google Scholar]
Jia, W.; Liu, M.; Luo, R.; Wnag, C.; Pan, N.; Yang, X.; Ge, X. YOLOF-Snake: An efficient segmentation model for green object fruit. Front. Plant Sci. 2022, 13, 765523. [Google Scholar] [CrossRef]
Liu, Z.; Liew, J.H.; Chen, X.; Feng, J. Dance: A deep attentive contour model for efficient instance segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 345–354. [Google Scholar]
Zhang, T.; Wei, S.; Ji, S. E2ec: An end-to-end contour-based method for high-quality high-speed instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4443–4452. [Google Scholar]
Russell, B.C.; Torralba, A.; Murphy, K.P.; Freeman, W.T. LabelMe: A database and web-based tool for image annotation. Int. J. Comput. Vis. 2008, 77, 157–173. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13; Springer International Publishing: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Yu, F.; Wang, D.; Shelhamer, E.; Darrell, T. Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2403–2412. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Wang, C.Y.; Liao, H.Y.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 821–830. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
Chen, S.; Xie, E.; Ge, C.; Chen, R.; Liang, D.; Luo, P. Cyclemlp: A mlp-like architecture for dense prediction. arXiv 2021, arXiv:2107.10224. [Google Scholar] [CrossRef]
Peng, S.; Jiang, W.; Pi, H.; Li, X.; Bao, H.; Zhou, X. Deep snake for real-time instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 8533–8542. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Paszke, A. Pytorch: An imperative style, high-performance deep learning library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open mmlab detection toolbox and benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar] [CrossRef]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9157–9166. [Google Scholar]
Zhang, Y.; Chu, J.; Leng, L.; Miao, J. Mask-refined R-CNN: A network for refining object details in instance segmentation. Sensors 2020, 20, 1010. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Tian, Z.; Shen, C.; Wang, X.; Chen, H. Boxinst: High-performance instance segmentation with box annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 5443–5452. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H. Conditional convolutions for instance segmentation. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part I 16; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 282–298. [Google Scholar]
Yang, R.; Song, L.; Ge, Y.; Li, X. BoxSnake: Polygonal Instance Segmentation with Box Supervision. arXiv 2023, arXiv:2303.11630. [Google Scholar]
Lazarow, J.; Xu, W.; Tu, Z. Instance segmentation with mask-supervised polygonal boundary transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4382–4391. [Google Scholar]

Figure 1. Image of green persimmon in different situations.

Figure 2. Image of green apple in different situations.

Figure 3. Overall structure of the segmentation model.

Figure 4. Structure of DLA34 of the backbone network.

Figure 5. Path aggregation balanced pyramid structure diagram.

Figure 6. Structure of non-localized attention.

Figure 7. Contour segmentation workflow diagram.

Figure 8. CMAD module structure diagram.

Figure 9. Loss function values for green spherical fruits.

Figure 10. Segmentation mAP values of green spherical fruits.

Figure 11. Segmentation effect of green persimmon.

Figure 12. Segmentation effect of green apple.

Figure 13. Image segmented incorrectly.

Figure 14. Segmentation effects of each model on both datasets.

Table 1. The divided results of datasets by area size of fruit.

Apple Dataset
Area	Small	Medium	Large	Fruit Total	Image Total
Train	1701/34%	2007/41%	1235/25%	4943	953
Val	851/39%	816/37%	527/24%	2194	408
Total	2552/36%	2823/39%	1762/25%	7137	1361
Persimmon Dataset
Area	Small	Medium	Large	Fruit Total	Image Total
Train	272/15%	1111/59%	482/26%	1865	388
Val	47/7%	415/63%	197/30%	659	165
Total	319/13%	1256/60%	679/27%	2524	553

Table 2. Indicator values of the model on two green spherical fruits.

Persimmon				Apple
Segm		Bbox		Segm		Bbox
Metric	Value/%	Metric	Value/%	Metric	Value/%	Metric	Value/%
$m A P$	75.5	$m A P$	75.1	$m A P$	57.8	$m A P$	60.3
${m A P}_{50}$	91.3	${m A P}_{50}$	90.7	${m A P}_{50}$	84.1	${m A P}_{50}$	85.4
${m A P}_{S}$	25.0	${m A P}_{S}$	28.8	${m A P}_{S}$	36.4	${m A P}_{S}$	39.9
${m A P}_{M}$	76.6	${m A P}_{M}$	76.9	${m A P}_{M}$	65.1	${m A P}_{M}$	68.4
${m A P}_{L}^{s}$	91.6	${m A P}_{L}^{b}$	90.6	${m A P}_{L}^{s}$	90.0	${m A P}_{L}^{b}$	91. $5$
mAR	80.4	mAR	80.3	mAR	64.0	mAR	65.8
${m A R}_{S}$	45.7	${m A R}_{S}$	41.0	${m A R}_{S}$	47.5	${m A R}_{S}$	49.2
${m A R}_{M}$	81.7	${m A R}_{M}$	82.3	${m A R}_{M}$	71.1	${m A R}_{M}$	73. $8$
${m A R}_{L}$	94.5	${m A R}_{L}$	94.4	${m A R}_{L}$	92.8	${m A R}_{L}$	93.8

Table 3. Recognition results for each model on the two datasets.

Methods	Persimmon Dataset		Apple Dataset
Methods	mAP^s/%	mAP^d/%	mAP^s/%	mAP^d/%
BoxSnake	62.6	67.6	50.6	59.5
Bounda	65.5	66.4	53.4	56.2
BoxInst	68.9	71.5	50.8	59.7
YOLACT	63.1	59.2	50.1	51.8
CondInst	70.7	72.2	56.6	60.6
MS_RCNN	71.7	71.8	57.0	60.7
Mask_RCNN	72.1	72.4	57.2	60.2
Yolov8-Seg	74.4	73.2	57.0	59.1
E2EC	73.3	72.7	56.6	58.5
Ours	75.5	75.1	57.8	60.3

Note: mAP^s is the average segmentation accuracies, mAP^d is he average detection accuracies.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, T.; Xu, Y.; Cao, K.; Chen, X.; Liu, Q.; Jia, W. An Accurate Optimized Contour Segmentation Model for Green Spherical Fruits. Horticulturae 2025, 11, 761. https://doi.org/10.3390/horticulturae11070761

AMA Style

Zhang T, Xu Y, Cao K, Chen X, Liu Q, Jia W. An Accurate Optimized Contour Segmentation Model for Green Spherical Fruits. Horticulturae. 2025; 11(7):761. https://doi.org/10.3390/horticulturae11070761

Chicago/Turabian Style

Zhang, Ting, Ying Xu, Kai Cao, Xiude Chen, Qiaolian Liu, and Weikuan Jia. 2025. "An Accurate Optimized Contour Segmentation Model for Green Spherical Fruits" Horticulturae 11, no. 7: 761. https://doi.org/10.3390/horticulturae11070761

APA Style

Zhang, T., Xu, Y., Cao, K., Chen, X., Liu, Q., & Jia, W. (2025). An Accurate Optimized Contour Segmentation Model for Green Spherical Fruits. Horticulturae, 11(7), 761. https://doi.org/10.3390/horticulturae11070761

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Accurate Optimized Contour Segmentation Model for Green Spherical Fruits

Abstract

1. Introduction

2. Materials and Methods

2.1. Fruit Dataset

2.1.1. Dataset Collection

2.1.2. Dataset Production

2.2. Optimized Contour Segmentation Model

2.2.1. Backbone Network DLA34

2.2.2. PAB-FPN

2.2.3. Segmentation Network

2.2.4. Loss Function

3. Experimental Setup and Result Analysis

3.1. Experimental Platform

3.2. Assessment Indicators

3.3. Model Segmentation Effect and Analysis

3.4. Algorithms Comparison

4. Discussion of Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI