Next Article in Journal
Determination of the Bentonite Content in Molding Sands Using AI-Enhanced Electrical Impedance Spectroscopy
Previous Article in Journal
Development of a Collision-Free Path Planning Method for a 6-DoF Orchard Harvesting Manipulator Using RGB-D Camera and Bi-RRT Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CabbageNet: Deep Learning for High-Precision Cabbage Segmentation in Complex Settings for Autonomous Harvesting Robotics

1
School of Computer and Information Engineering, Xinjiang Agricultural University, Urumqi 830052, China
2
Ministry of Education Engineering Research Center for Intelligent Agriculture, Urumqi 830052, China
3
Xinjiang Agricultural Informatization Engineering Technology Research Center, Urumqi 830052, China
4
National Engineering Research Center for Information Technology in Agriculture, Beijing 100125, China
5
Key Laboratory of Digital Village Technology, Ministry of Agriculture and Rural Affairs, Beijing 100125, China
*
Authors to whom correspondence should be addressed.
Sensors 2024, 24(24), 8115; https://doi.org/10.3390/s24248115
Submission received: 25 November 2024 / Revised: 17 December 2024 / Accepted: 18 December 2024 / Published: 19 December 2024
(This article belongs to the Section Smart Agriculture)

Abstract

:
Reducing damage and missed harvest rates is essential for improving efficiency in unmanned cabbage harvesting. Accurate real-time segmentation of cabbage heads can significantly alleviate these issues and enhance overall harvesting performance. However, the complexity of the growing environment and the morphological variability of field-grown cabbage present major challenges to achieving precise segmentation. This study proposes an improved YOLOv8n-seg network to address these challenges effectively. Key improvements include modifying the baseline model’s final C2f module and integrating deformable attention with dynamic sampling points to enhance segmentation performance. Additionally, an ADown module minimizes detail loss from excessive downsampling by using depthwise separable convolutions to reduce parameter count and computational load. To improve the detection of small cabbage heads, a Small Object Enhance Pyramid based on the PAFPN architecture is introduced, significantly boosting performance for small targets. The experimental results show that the proposed model achieves a Mask Precision of 92.2%, Mask Recall of 87.2%, and Mask mAP50 of 95.1%, while maintaining a compact model size of only 6.46 MB. These metrics indicate superior accuracy and efficiency over mainstream instance segmentation models, facilitating real-time, precise cabbage harvesting in complex environments.

1. Introduction

Cabbage, as one of the extensively cultivated vegetable crops in China, leads globally in terms of planting area and output, reaching approximately 900,000 hectares and 35,000,000 tons [1]. The complexity of the environmental conditions faced during cabbage harvesting significantly inhibits the efficiency and precision of mechanical harvesting, thereby rendering the process highly dependent on manual labor and physical resources [2,3]. The intimate fusion of artificial intelligence technology with agricultural practices has gradually emerged as a critical trend in the evolution of contemporary agriculture, positioning smart agriculture at the forefront of this advancement [4]. Particularly for cabbage harvesting, accurate recognition and segmentation of cabbage heads are the cornerstone technologies for achieving automated harvesting. This technology promises to facilitate unmanned and automated cabbage harvesting processes, reduce production costs and labor intensity, while simultaneously enhancing the precision and efficiency of harvesting [5]. Moreover, it assists in the non-destructive prediction of cabbage yield. Consequently, the study of technologies for the identification and segmentation of cabbage heads is of paramount significance.
Traditional image segmentation techniques rely on threshold selection, edge detection, and region growth to divide the image, making them more accurate for the segmentation of simple scenes. However, achieving segmentation accuracy for complex scenes remains a challenge [6]. The development of machine vision technology has led to the maturation of image segmentation technology based on deep learning, with an increasing number of image segmentation algorithms being proposed [7]. In contrast to traditional image segmentation algorithms, deep learning-based algorithms are capable of capturing more complex and detailed texture characteristics of the target object through the training of neural network models, thereby markedly enhancing the efficacy of segmentation tasks in a variety of complex environments [8].
Deep learning-based image segmentation techniques, such as Mask R-CNN [9], YOLACT [10], and ConvNeXt V2 [11], have demonstrated strong performance in agricultural robotics, particularly in crop, vegetable, and fruit detection for automated harvesting [12]. Shuo Kang et al. [13] employed DeepLabV3+ for broccoli head detection, achieving 57.9% mIoU and 98.56% pixel accuracy, but their model requires 0.75 s per image, limiting real-time applicability. Similarly, Pieter M. Blok et al. [14,15] employed an Occlusion Region-based Convolutional Neural Network (ORCNN) for broccoli segmentation, achieving a 6.4 mm dimensional discrepancy, but its slow processing prevents real-time deployment in field environments. Hanwen Kang et al. [16] proposed a Geometry-Aware (A3N) network for fruit identification in orchards, achieving an 87.3% instance segmentation accuracy in 35 ms. However, its low accuracy and focus on orchard environments limit its use in more complex settings. Lei Shen et al. [17] integrated attention mechanisms with Mask R-CNN for grape cluster segmentation, achieving 59.1% accuracy, but the model’s low frame rate and accuracy hinder real-world, high-speed applications. Similarly, Dandan Wang et al. [18] proposed an enhanced Mask R-CNN for apple segmentation, with 93.2% accuracy for bounding box detection, but its 0.27 s/image (3.7 fps) processing time is insufficient for real-time tasks. Gabriel Coll-Ribes et al. [19] used CNN-based instance segmentation and monocular depth estimation for grape bunch and peduncle segmentation. Despite effectiveness, its 0.5 FPS frame rate is inadequate for real-time dynamic environments. Olarewaju Mubashiru Lawal [20] developed YOLOv5n-based YOLOv5-LiNet for cucurbit fruit segmentation, achieving 88.5% accuracy with real-time capabilities, but it struggles with occlusion, limiting its performance in complex scenarios. Yajun Li et al. [21] proposed an MTA-YOLACT-based algorithm for tomato cluster segmentation, achieving high accuracy, but its demand for high agronomic precision limits its applicability in open-field cabbage harvesting. Nils Lüling et al. [22,23,24] designed a Mask R-CNN algorithm for cabbage volume and leaf area assessment, achieving 92.6% and 89.8% accuracy, but its processing speed hinders real-time use. Masaki Asano et al. [25] employed an SSD-based approach for autonomous cabbage harvesting, but suffers from low recognition accuracy. Peichao Cong et al. [26] integrated Swin Transformer attention mechanisms into Mask R-CNN for sweet pepper segmentation, but its speed is inadequate for deployment in open-field agricultural settings that require real-time processing with limited computational resources. Huarui Wu et al. [27] proposed an UperNet model with a Swin Transformer backbone for Brassica napus segmentation, achieving 91.2% mIoU and 95.2% pixel accuracy, but its processing speed is too slow for use in real-time harvesting systems that require quick and efficient decision-making. Jia Weikuan et al. [28] enhanced SOLO with ResNeSt and FPN, achieving 94.84% recall and 96.16% precision for persimmon and apple segmentation, but its computational demands make it difficult to implement in real-time agricultural scenarios. Xing Sheng et al. [29] proposed EdgeSegNet for fruit segmentation in complex scenes, achieving 90.9% and 94.2% mIoU for apple and peach, but its processing speed limits real-time performance.
The studies mentioned above have improved upon deep learning-based segmentation algorithms to achieve fruit and leaf segmentation. However, accurately segmenting fruits and vegetables in complex backgrounds, especially when they are in motion, remains a significant challenge. In particular, the segmentation of cabbage heads during the harvesting process is further complicated by the need for high accuracy, real-time performance, and computational efficiency. Achieving precise, efficient, and lightweight segmentation in dynamic and cluttered field environments requires models that balance segmentation accuracy with processing speed to meet the demands of automated harvesting. To address these challenges, this study proposes an enhanced YOLOv8n [30] instance segmentation model, “CabbageNet”. Designed to overcome issues such as inaccurate detection of moving cabbage heads, inefficiencies, and misidentification or missed detection of small cabbages, CabbageNet ensures precise and efficient segmentation of cabbage heads in complex harvesting scenes. The main contributions of this study are as follows:
  • Construction of a cabbage instance segmentation dataset: Cabbage images from the harvest period were collected using image acquisition devices and search engines. After rigorous selection and processing, 10,000 images were annotated, creating a high-quality cabbage head instance segmentation dataset.
  • Integration of deformable attention into the C2f module: The C2f module was enhanced through the incorporation of deformable attention and dynamic sampling points, forming a deformable attention module. This enables the model to better adapt to varying image sizes and content, thereby improving accuracy and efficiency in instance segmentation tasks.
  • Improvement of the downsampling process using the ADown module: The ADown module employs an adaptive mechanism to retain essential information and capture higher-level image features. The model efficiently handles objects of varying sizes through multi-scale feature fusion, improving both accuracy and robustness in instance segmentation tasks.
  • Enhancement of small object segmentation using the SOEP module: The Small Object Enhance Pyramid (SOEP) applies Spatial-Depthwise Separable Convolution (SPDConv) to the P2 layer to extract richer small-object features. Combined with CSP-OmniKernel feature aggregation, SOEP effectively preserves critical small-object information, significantly improving segmentation performance for small objects.

2. Materials and Methods

2.1. Image Acquisition

In this study, the collection process of cabbage head image data was divided into two main parts. The first part includes 6000 images, all collected on site. The collection sites included Jintaiyang Farm in Changping District, Beijing, the National Precision Agriculture Research Demonstration Base in Xiaotangshan, and the experimental base in Canal District, Cangzhou, Hebei Province. The collection focused on the Zhonggan-21 cabbage variety, a widely cultivated species in North China. To ensure the diversity and comprehensiveness of the dataset, images were selected under various climatic conditions, illumination levels, shooting angles, and cabbage growth stages. The resolution of the image collection was set to 3024 × 3024 to ensure that the image quality met the study’s specifications. The second part included 4000 images obtained from the Internet through web crawling techniques, thereby expanding the breadth and diversity of the data. Throughout the data collection process, keywords related to cabbage heads were first identified, followed by thorough searches on the Internet using these keywords. This approach ensured a rich quantity of images. Following the acquisition of over 20,000 original images, a series of stringent screening procedures was undertaken to guarantee the content, clarity, and relevance of the images. Through this meticulous process, the scientific relevance and research value of the images were greatly enhanced. Ultimately, 4000 images were carefully selected from the initial pool, adhering to the study’s requirements.
Throughout the entire data collection process, we endeavored to encompass all scenarios encountered during the actual harvest, ensuring the comprehensiveness and breadth of the data. Following a meticulous preliminary phase of data organization and rigorous screening, we successfully compiled a collection of 10,000 cabbage heads images. Each collected image strictly meets the standards for dataset construction in scientific research.

2.2. Dataset Establishment

The construction of datasets is crucially important for training the models. Initially, the filtered images underwent a uniform resizing process, standardizing all images to 640 × 640 pixels as required. Subsequently, labeling was conducted for instance segmentation using X-AnyLabeling (v2.3.5, by Wei Wang, CVHub organization, published on GitHub) [31] for annotation purposes. Data marking before and after labeling is illustrated in Figure 1. Then, the annotated JSON files were converted into both YOLO and COCO dataset formats. The YOLO-format dataset was used to train the model proposed in this paper, while the COCO format was utilized for comparative experiments with other models. The transformed datasets were then randomly partitioned into training, validation, and testing sets with a ratio of 6:2:2. The training set was used to train the model to fit the data distribution pattern, involving the determination of model parameters such as weights and biases. The validation set was used to adjust model parameters and hyperparameters to optimize performance and prevent overfitting. Finally, the testing set was used to evaluate the generalization capability of the model.

2.3. Network Structure

The YOLO (You Only Look Once) algorithm is a one-stage object detection method. In YOLOv8, the algorithm has been further extended with a variant specifically designed for segmentation, known as YOLOv8-seg. This variant is optimized for segmentation tasks by simultaneously predicting bounding boxes, class probabilities, and pixel-level masks. In YOLOv8-seg, the architecture includes two distinct heads: a Detection Head for bounding box coordinates and class probabilities, and a Segmentation Head for generating pixel-level masks. In this paper, the necessity of timeliness in unmanned cabbage harvesting is investigated, and CabbageNet is proposed by choosing YOLOv8n-seg as the baseline model, which enhances segmentation accuracy and efficiency for targets of various sizes by integrating a deformable attention mechanism in the C2f module and introducing the C2f deformable module. Multi-scale feature fusion is achieved by enhancing the ADown module, thereby improving segmentation accuracy for various sizes of targets and ensuring robust performance in the instance segmentation task. Ultimately, the SOEP module enhances the segmentation performance of small objects, facilitating the effective segmentation of cabbage heads during harvesting in complex backgrounds. Figure 2 illustrates the structural diagram of the CabbageNet model network, detailing its architecture and components.

2.3.1. C2f Deformable Attention Block

In this paper, the final C2f module in the BottleNeck architecture is enhanced by incorporating a deformable attention module [32,33], as shown in Figure 3, introducing a deformable attention mechanism and dynamic sampling points to improve model performance. The traditional Transformer uses standard self-attention, requiring the processing of all pixels in the image, which significantly increases computation. To address this challenge, in this paper, we choose the deformable attention mechanism, which focuses on key regions of the image, reducing computational load while maintaining performance. Furthermore, the deformable attention mechanism dynamically selects sampling points, allowing the model to better focus on critical regions. By incorporating the deformable attention module, the model adapts better to images of varying sizes and contents, improving efficiency and accuracy in instance segmentation tasks.

2.3.2. ADown Block

The ADown module [34] is a convolutional structure that has been specifically designed for the purpose of downsampling operations in deep learning models, as illustrated in Figure 4. In such models, downsampling serves as a crucial technique for the reduction in the spatial dimensions of feature maps. This reduction allows the model to capture image features at higher levels while simultaneously reducing the computational costs related to that process. The ADown module employs an adaptive mechanism that determines whether to transmit deeper features based on the variability of information across different regions of the input feature map, facilitating a flexible downsampling strategy that effectively preserves critical information. Moreover, the ADown module utilizes a multi-scale feature fusion approach that combines features from various scales, thereby enabling the adept processing of targets of different sizes. This mitigates the detailed loss associated with excessive downsampling and enhances detection accuracy. It is noteworthy that the incorporation of depthwise separable convolutions allows the ADown module to significantly reduce the number of parameters and computational complexity while maintaining a high level of feature extraction capability. As a result, the ADown module markedly improves the model’s accuracy and computational efficiency, ensuring robust performance in instance segmentation tasks even within complex backgrounds.

2.3.3. Small Object Enhance Pyramid

In instance segmentation tasks with YOLOv8, the standard P3, P4, and P5 feature layers often encounter challenges in handling small objects. A traditional approach to improve small-object segmentation is to add a P2 feature layer for enhanced feature capture; however, this method significantly increases computational complexity and post-processing time. To address these challenges while maintaining computational efficiency, we propose an optimized method based on the Path Aggregation Feature Pyramid Network (PAFPN), called the Small Object Enhance Pyramid (SOEP). Rather than directly adding a P2 layer, we employ Spatial-Depthwise Separable Convolution (SPD-Conv) [35] to process the P2 layer, extracting richer, small-object-specific features and fusing them with the P3 layer. An illustration of the SPD-Conv process is provided in Figure 5. This fusion strategy effectively preserves key small-object information while mitigating the rise in computational costs. Furthermore, by incorporating the Cross-Stage Partial Network (CSP) [36] architecture and OmniKernel-based [37] feature aggregation (CSP-OmniKernel), we enhance feature representation, as illustrated in Figure 6. The OmniKernel module, composed of global, coarse-scale, and fine-scale branches, efficiently captures multi-scale features from global to local levels. This multi-branch structure substantially improves the segmentation performance of small objects, especially in complex scenes, making it an effective approach for instance segmentation tasks.

3. Experiments and Results

3.1. Experimental Environment

To ensure experimental consistency, both the training and testing phases of our model were conducted on a computer workstation equipped with a 12th generation Intel Core i9 processor (by Intel Corporation, Santa Clara, CA, USA) and an NVIDIA GeForce RTX 3080Ti laptop GPU (by NVIDIA Corporation, Santa Clara, CA, USA), operating under Windows 11 (by Microsoft Corporation, Redmond, DC, USA). The machine learning framework utilized for model training was Pytorch, version 2.2.2. Furthermore, we utilized CUDA version 11.8, a general-purpose parallel computing platform for GPUs, with NVIDIA’s CUDA Deep Neural Network library (cuDNN) v8.9.7 to optimize computational efficiency. Parameter settings for CabbageNet training were selected to optimize segmentation performance across diverse scenarios, with a learning rate of 0.01, batch size of 16, momentum of 0.937, weight decay of 0.0005, 300 training epochs, and an input image size of 640 × 640 pixels.

3.2. Evaluation Metrics

This study employs a comprehensive set of evaluation metrics to systematically assess the performance of the model, including precision, recall, average precision (AP), mean average precision (mAP), Intersection over Union (IoU), Dice Coefficient, F1 Score, Hausdorff Distance, Pixel Accuracy (PA), Params, GFlOPS (giga floating-point operations per second), and FPS (frames per second). Precision, defined by Equation (1), measures the model’s ability to correctly identify positive samples. Recall, calculated using Equation (2), represents the proportion of actual positive samples successfully detected by the model. AP, defined by Equation (3), evaluates detection and classification performance across varying confidence thresholds, while mAP, given by Equation (4), provides an aggregated measure of precision across multiple categories. IoU, defined by Equation (5), quantifies spatial overlap between predicted and ground-truth regions as the ratio of their intersection to union. The Dice Coefficient, detailed in Equation (6), assesses the similarity between predicted and actual regions, with a focus on overlap quality. F1 Score, represented by Equation (7), harmonizes precision and recall in a single metric. Furthermore, Hausdorff Distance, described in Equation (8), measures the maximum discrepancy between predicted and ground-truth boundaries. PA, as defined by Equation (9), represents the ratio of correctly classified pixels to the total number of pixels. Params, as defined by Equation (10), denotes the total number of learnable parameters in the model. GFlOPS, given in Equation (11), quantifies the number of floating-point operations per second during inference, while FPS, described in Equation (12), indicates the number of frames processed per second—a key metric for assessing real-time performance.
Precision = T P T P + F P
Recall = T P T P + F N
AP = 0 1 Precision ( Recall ) d ( Recall )
mAP = 1 N i = 1 N A P i
IoU = T P T P + F P + F N
Dice = 2 · T P 2 · T P + F P + F N
F 1 = 2 · Precision · Recall Precision + Recall
d H ( X , Y ) = max sup x X inf y Y d ( x , y ) , sup y Y inf x X d ( x , y )
PA = T P + T N T P + T N + F P + F N
Params = l = conv ( K h × K w × C in , l + 1 ) × C out , l + l = fc C in , l × C out , l + C out , l
GFlOPS = Total Floating Point Operations Execution Time ( seconds ) × 10 9
FPS = Total Number of Frames Processed Total Time ( seconds )
where T P denotes true positive, F P denotes false positive, F N denotes false negative, N denotes the number of categories, A P i denotes the AP value corresponding to category i, sup represents the supremum (least upper bound) and inf represents the infimum (greatest lower bound), while K h and K w represent the height and width of the convolution kernel, C in , l and C out , l denote the input and output channels of the l-th layer, respectively, and “+1” denotes the bias term.

3.3. Comparison Experiments

To validate the effectiveness of the proposed CabbageNet model in the instance segmentation task for harvested cabbage heads in complex scenes, we conducted a systematic evaluation using COCO metrics, including Mask AP0.5:0.95, Mask AP0.5, Mask AP0.75, Mask APsmall, parameter count, FPS, and GFLOPS [38]. We compared CabbageNet with several state-of-the-art instance segmentation models, including Mask R-CNN, Cascade Mask R-CNN, Mask Scoring R-CNN, Hybrid Task Cascade, YOLACT, SOLO, PointRend, SOLOv2, QueryInst, and ConvNeXt-V2, with the experimental results presented in Table 1.
The analysis reveals that CabbageNet presents superior performance across multiple key metrics, achieving Mask AP0.5:0.95 and Mask AP0.5 values of 78.9% and 94.0%, respectively, significantly surpassing other models such as Mask R-CNN (74.1%, 90.8%) and Cascade Mask R-CNN (74.5%, 90.2%) in the same metrics. Furthermore, CabbageNet excels in Mask AP0.75 and Mask APsmall, reaching 85.4% and 38.7%, respectively, far exceeding YOLACT (73.6% and 31.0%) and SOLO (62.8% and 7.0%). In terms of computational efficiency, CabbageNet’s parameter count is only 3.21 M, with a GFLOPS of 15.1 and an FPS of 154, indicating remarkable computational performance while maintaining high accuracy. This advantage is particularly evident when compared to the larger parameter models, such as Cascade Mask R-CNN (76.8M, FPS of 12) and Hybrid Task Cascade (79.9M, FPS of 6). Additionally, CabbageNet has proven its adaptability in complex scenarios, effectively handling diverse image datasets, particularly excelling in harvested cabbage head instance segmentation. Consequently, the experimental results indicate that CabbageNet demonstrates significant superiority in this specific task.
To evaluate the effectiveness of the proposed model, a comparative analysis was conducted with the advanced segmentation model SAM (Segment Anything Model) and its derivatives, which do not rely on prior knowledge. Since SAM models perform segmentation across all elements within an image, post-processing was necessary to refine their outputs. Specifically, an IoU matrix combined with the Hungarian algorithm was applied to match predicted masks with ground-truth masks, enabling precise identification and extraction of cabbage head targets. To ensure a comprehensive performance assessment, universal metrics were employed to evaluate segmentation accuracy, consistency, and boundary quality. The detailed results of the comparison are summarized in Table 2.
The experimental results clearly demonstrate that CabbageNet outperforms all other models in the SAM series across key metrics. It achieves the highest values in Intersection over Union (IoU), Dice Coefficient, and F1 Score, with 85.3%, 90.0%, and 90.0%, respectively, significantly surpassing SAM2, which shows notably lower performance. CabbageNet also excels in boundary precision, with a Hausdorff Distance of 27.1 pixels, much lower than SAM2 at 175.4 pixels. In Pixel Accuracy (PA), it achieves 99.4%, outperforming MobileSAM at 97.2%. Furthermore, CabbageNet attains a remarkable frame rate of 154 FPS, surpassing FastSAM at 63 FPS. These results underscore CabbageNet’s superior segmentation accuracy, boundary precision, and computational efficiency, making it a highly competitive model for real-time applications that require both high precision and speed.
To further evaluate the effectiveness of the CabbageNet model for segmenting mature cabbage heads in complex environments, comparative experiments were conducted against YOLOv5n-seg, YOLOv9c-seg, and the baseline model YOLOv8n-seg. The results of the experiment are presented in Table 3.
The findings indicate that CabbageNet exhibited superior performance compared to the competing models in both mask precision and mean average precision (mAP). In particular, the CabbageNet model exhibited a mask precision of 92.2%, which was markedly higher than that of the YOLOv8n-seg (90.9%) and YOLOv5n-seg (90.5%) models. This illustrates its enhanced capability in addressing intricate agricultural segmentation tasks. Although YOLOv9c-seg exhibited a marginally higher mAP50 (95.4%), CabbageNet demonstrated a superior balance between accuracy and computational efficiency, with a parameter count of 3.21M and a model size of 6.46MB. The computational load of CabbageNet was 15.1 GFLOPs, which was slightly higher than that of YOLOv8n-seg (12.0 GFLOPs). However, this increase is justified given the significant increase in precision observed. CabbageNet exhibited an inference speed of 154 FPS, which, although inferior to that of YOLOv5n-seg (329 FPS) and YOLOv8n-seg (215 FPS), was considerably higher than that of YOLOv9c-seg (22 FPS), thus meeting the requisite specifications for real-time agricultural applications. Overall, CabbageNet demonstrated substantial improvements in segmentation accuracy, model compactness, computational efficiency, and real-time performance, making it well suited for instance segmentation tasks in automated cabbage head harvesting under visually challenging conditions.

3.4. Ablation Experiments

In order to demonstrate the performance improvement contributed by each module, we conducted ablation experiments to observe the impact of gradually incorporating distinct modules on model performance. In these experiments, the C2f-DAttention module, the ADown module, and the SOEP module were sequentially added to evaluate the specific effects of adding modules on the overall model performance. The results of these experiments are presented in Table 4.
The results of the ablation experiments with various modules indicate that the proposed CabbageNet model demonstrates significant advantages in terms of accuracy and mAP. Compared to the benchmark model YOLOv8n-seg, CabbageNet shows an improvement in precision from 90.9% to 92.2%, while maintaining a stable mAP50 of 95.1% and achieving a high recall rate of 87.2% through the integration of modules such as C2f-DAttention, ADown, and SOEP. The dynamic attention mechanism within C2f-DAttention enhances responsiveness to key features and effectively reduces background noise, significantly improving the model’s overall detection and segmentation accuracy in challenging environments. However, this focus on prominent features slightly deprioritizes the importance of secondary features, contributing to a marginal decrease in recall. The incorporation of the ADown module not only reduces the number of model parameters in an efficient manner but also preserves the integrity of the features through a rational downsampling approach. This strategy achieves the objective of optimizing the model complexity and enhancing the detection efficiency. The incorporation of the ADown module not only effectively reduces the number of parameters in the model, but also maintains the integrity of the features through a reasonable downsampling strategy. Additionally, the introduction of SOEP enhances the aggregation of features across different scales, further improving precision.
This strategy achieves a balance between optimizing the complexity of the model and improving the detection efficiency. Although these improvements do increase the computational burden of the model to a certain extent, resulting in a reduction in the inference speed, it remains capable of achieving the real-time segmentation of cabbage heads during the harvesting period. In conclusion, the improvement strategy of CabbageNet, through multi-module co-optimization, makes the model achieve significant improvement in the core metrics such as precision and mAP, which proves the effectiveness of our proposed method in the segmentation task.

3.5. Visualization and Analysis of Experiments

To further verify the segmentation effect of the proposed model and the comparison model more intuitively, we compared the same image after segmentation. The segmentation effect is shown in Figure 7, where the first is the original image and the rest are the segmentation mask maps of other mainstream models and the CabbageNet model, and the corresponding segmentation model is shown under each image.
In evaluating the performance of various instance segmentation algorithms on the same cabbage image, it was found that the segmentation masks generated by Cascade Mask R-CNN, ConvNeXt-V2, Hybrid Task Cascade, Mask R-CNN, Mask Scoring R-CNN, QueryInst, SOLOv2, and YOLACT did not fully cover the cabbage head. Although PointRend showed relatively better performance, it still exhibited deficiencies in edge segmentation. Moreover, none of these models succeeded in segmenting the small, incomplete cabbage head in the upper right corner. YOLOv5n-seg failed to completely segment the larger cabbage head, while YOLOv8n-seg incorrectly split the larger cabbage head into two overlapping instances, creating a bluish-purple overlap. YOLOV9c-seg provided better edge handling but showed slight inaccuracies in segmenting the medium-sized cabbage. In contrast, CabbageNet handles edge details with precision, outperforming all other models in overall segmentation quality.
As shown in Figure 8, this study presents a comparative analysis of the segmentation performance between the SAM series models and the proposed CabbageNet. The results reveal that SAM series models often struggle to differentiate objects that closely resemble the background, leading to significant segmentation errors and producing segmentation edges that lack precision. In contrast, CabbageNet effectively addresses these challenges by minimizing such accuracy-reducing errors. CabbageNet demonstrates superior performance, accurately and efficiently segmenting cabbage heads in complex cabbage harvesting environments.
To further validate the segmentation performance of the model in real-world production environments, an image acquisition device was installed on the harvesting equipment, leaf-stripping device, and high-altitude UAV of the cabbage harvesting machinery. The systems captured images from multiple perspectives, thereby reflecting the complex conditions typically encountered in agricultural settings. The collected images were used to compare the segmentation of cabbage heads in complex backgrounds, using both the YOLO series segmentation models and the proposed CabbageNet model. The detailed segmentation results are shown in Figure 9.
As shown in the results, among the images from the harvesting device, the YOLOv5n-seg model performed poorly in detecting and segmenting small-sized targets. In contrast, CabbageNet consistently outperformed the other models, demonstrating superior performance in both detection and segmentation. Among the images from the leaf-stripping device, YOLOv5n-seg and YOLOv9c-seg both misidentified the stripped cabbage leaves as cabbage heads, while YOLOv8n-seg mistakenly included the cabbage leaves in its segmentation mask. However, CabbageNet provided significantly more accurate segmentation, particularly in distinguishing cabbage heads from surrounding leaves. Similarly, in the images acquired by the UAV, CabbageNet outperformed the other models in segmenting small cabbage heads. Overall, CabbageNet consistently delivered superior segmentation performance across complex backgrounds, performing better than the other YOLO series models.
The ablation experiment results are visually compared and presented in Figure 10. An analysis of the figure reveals that the YOLOv8n-seg model exhibits slightly lower segmentation accuracy for both single targets and multiple small-sized targets. However, after incorporating the C2f-DAttention module, a significant improvement in accuracy is observed. Further integration of the ADown module not only reduces the number of parameters and computational complexity considerably but also preserves the model’s performance. Finally, the addition of the SOEP module leads to a substantial enhancement in segmentation performance for small-sized targets. Overall, the various improvements proposed in this study contribute positively to the model’s performance in segmentation tasks.

4. Conclusions and Discussion

This paper addresses the challenges of cabbage detection and segmentation in unmanned cabbage harvesting robots operating in complex environments. To reduce the miss rate and damage rate during automated cabbage harvesting, an improved cabbage head segmentation algorithm, CabbageNet, is proposed based on YOLOv8n-seg. First, a dynamic attention mechanism is introduced into the C2f module to create the C2f DAttention module, enhancing segmentation accuracy for cabbages of varying sizes. Next, the downsampling mechanism is improved with the ADown module, which significantly reduces the number of parameters and computational complexity while maintaining high-level feature extraction capabilities. Finally, the SOAP module is integrated to boost the recognition and segmentation of small-sized cabbages. Experimental results show that the CabbageNet model, with a size of only 6.46MB, achieves a Mask Precision of 92.2%, a Mask Recall of 87.2%, and a Mask mAP50 of 95.1%, with a segmentation speed of 154 FPS in real-time applications. These results provide a solid foundation for the application of unmanned cabbage harvesting robots in complex harvesting environments.
The dataset used in this study includes only a limited number of cabbage varieties. Consequently, the model trained on this dataset may face difficulties in generalizing to new or unseen varieties not included in the dataset. To enhance the model’s robustness and extend its applicability, further expansion of the dataset will be required to accommodate a broader range of cabbage types. In the future, the current dataset will be expanded by the addition of further cabbage varieties, thereby enhancing the model’s capacity to recognize and segment multiple types of cabbage. Furthermore, the CabbageNet algorithm will be optimized to enable its effective operation in unmanned cabbage harvesting environments, which are characterized by complex and variable conditions.

Author Contributions

Conceptualization, Y.T. and C.Z.; methodology, Y.T.; software, Y.T.; validation, Y.T., C.Z. and T.Z.; formal analysis, Y.T.; investigation, Y.T.; resources, Y.T., X.C. and H.W.; data curation, Y.T. and H.W.; writing—original draft preparation, Y.T.; writing—review and editing, Y.T. and Y.Z.; visualization, Y.T.; supervision, Y.T. and C.Z.; project administration, Y.T. and C.Z.; All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the National Key R&D Program of China (2022ZD0115805), Provincial Key S&T Program of Xinjiang (2022A02011).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are available upon request from the readers.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Tong, W.Y.; Zhang, J.F.; Song, Z.Y.; Cao, G.Q.; Jin, Y.; Ning, X.F. Research Status and Development Trend of Cabbage Mechanical Harvesting Equipment and Technology. J. Chin. Agric. Mech. 2024, 45, 322–329. [Google Scholar]
  2. Yang, J.H.; Fang, X.; Ma, L.X.; Zhou, C.; Shao, C.F. Research Status and Direction of Headed Vegetable Harvesting Machinery. J. Agric. Mechan. Res. 2023, 45, 10–17. [Google Scholar] [CrossRef]
  3. Yang, J.H.; Du, Y.G.; Fang, X.; Zhou, C. Design and Experimental study of Cabbage Picking and Conveying Device. J. Chin. Agric. Mech. 2024, 45, 32–36. [Google Scholar] [CrossRef]
  4. Ghazal, S.; Munir, A.; Qureshi, W.S. Computer vision in smart agriculture and precision farming: Techniques and applications. Artif. Intell. Agric. 2024, 13, 64–83. [Google Scholar] [CrossRef]
  5. Zou, L.L.; Liu, X.M.; Yuan, J.; Dong, X.H. Advances in Mechanized Harvesting Technology and Equipment for Leaf Vegetables. Chin. J. Agric. Mech. 2022, 43, 15–23. [Google Scholar] [CrossRef]
  6. Luo, Z.; Yang, W.; Yuan, Y.; Gou, R.; Li, X. Semantic Segmentation of Agricultural Images: A Survey. Inf. Process. Agric. 2024, 11, 172–186. [Google Scholar] [CrossRef]
  7. Charisis, C.; Argyropoulos, D. Deep Learning-Based Instance Segmentation Architectures in Agriculture: A Review of the Scopes and Challenges. Smart Agric. Technol. 2024, 8, 100448. [Google Scholar] [CrossRef]
  8. Yu, Y.; Wang, C.; Fu, Q.; Kou, R.; Huang, F.; Yang, B.; Yang, T.; Gao, M. Techniques and Challenges of Image Segmentation: A Review. Electronics 2023, 12, 1199. [Google Scholar] [CrossRef]
  9. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Venice, Italy, 2017. [Google Scholar] [CrossRef]
  10. Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-time Instance Segmentation. In Proceedings of the International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
  11. Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. arXiv 2023, arXiv:2301.00808. [Google Scholar]
  12. Li, Y.; Feng, Q.; Li, T.; Xie, F.; Liu, C.; Xiong, Z. Advance of Target Visual Information Acquisition Technology for Fresh Fruit Robotic Harvesting: A Review. Agronomy 2022, 12, 1336. [Google Scholar] [CrossRef]
  13. Kang, S.; Li, D.; Li, B.; Zhu, J.; Long, S.; Wang, J. Maturity Identification and Category Determination Method of Broccoli Based on Semantic Segmentation Models. Comput. Electron. Agric. 2024, 217, 108633. [Google Scholar] [CrossRef]
  14. Blok, P.M.; Barth, R.; van den Berg, W. Machine Vision for a Selective Broccoli Harvesting Robot. IFAC-PapersOnLine 2016, 49, 66–71. [Google Scholar] [CrossRef]
  15. Blok, P.M.; van Henten, E.J.; van Evert, F.K.; Kootstra, G. Image-Based Size Estimation of Broccoli Heads under Varying Degrees of Occlusion. Biosyst. Eng. 2021, 208, 213–233. [Google Scholar] [CrossRef]
  16. Kang, H.; Wang, X.; Chen, C. Geometry-Aware Fruit Grasping Estimation for Robotic Harvesting in Orchards. Comput. Electron. Agric. 2022, 193, 106716. [Google Scholar] [CrossRef]
  17. Shen, L.; Su, J.; Huang, R.; Quan, W.; Song, Y.; Fang, Y.; Su, B. Fusing Attention Mechanism with Mask R-CNN for Instance Segmentation of Grape Cluster in the Field. Front. Plant Sci. 2022, 13, 934450. [Google Scholar] [CrossRef]
  18. Wang, D.; He, D. Apple Detection and Instance Segmentation in Natural Environments Using an Improved Mask Scoring R-CNN Model. Front. Plant Sci. 2022, 13, 1016470. [Google Scholar] [CrossRef]
  19. Coll-Ribes, G.; Torres-Rodríguez, I.J.; Grau, A.; Guerra, E.; Sanfeliu, A. Accurate Detection and Depth Estimation of Table Grapes and Peduncles for Robot Harvesting, Combining Monocular Depth Estimation and CNN Methods. Comput. Electron. Agr. 2023, 215, 108362. [Google Scholar] [CrossRef]
  20. Lawal, O.M. YOLOv5-LiNet: A Lightweight Network for Fruits Instance Segmentation. PLoS ONE 2023, 18, e0282297. [Google Scholar] [CrossRef]
  21. Li, Y.; Feng, Q.; Liu, C.; Xiong, Z.; Sun, Y.; Xie, F.; Li, T.; Zhao, C. MTA-YOLACT: Multitask-Aware Network on Fruit Bunch Identification for Cherry Tomato Robotic Harvesting. Eur. J. Agron. 2023, 146, 126812. [Google Scholar] [CrossRef]
  22. Lüling, N.; Reiser, D.; Griepentrog, H.W. Volume and Leaf Area Calculation of Cabbage with a Neural Network-Based Instance Segmentation. In Precision Agriculture ’21; Wageningen Academic Publishers: Budapest, Hungary, 2021; pp. 719–726. [Google Scholar] [CrossRef]
  23. Lüling, N.; Reiser, D.; Stana, A.; Griepentrog, H.W. Using Depth Information and Colour Space Variations for Improving Outdoor Robustness for Instance Segmentation of Cabbage. arXiv 2021, arXiv:2103.16923. [Google Scholar] [CrossRef]
  24. Lüling, N.; Reiser, D.; Straub, J.; Stana, A.; Griepentrog, H.W. Fruit Volume and Leaf-Area Determination of Cabbage by a Neural-Network-Based Instance Segmentation for Different Growth Stages. Sensors 2022, 23, 129. [Google Scholar] [CrossRef] [PubMed]
  25. Asano, M.; Onishi, K.; Fukao, T. Robust Cabbage Recognition and Automatic Harvesting under Environmental Changes. Adv. Robot. 2023, 37, 960–969. [Google Scholar] [CrossRef]
  26. Cong, P.; Li, S.; Zhou, J.; Lv, K.; Feng, H. Research on Instance Segmentation Algorithm of Greenhouse Sweet Pepper Detection Based on Improved Mask RCNN. Agronomy 2023, 13, 196. [Google Scholar] [CrossRef]
  27. Wu, H.; Guo, W.; Liu, C.; Sun, X. A Study of cabbage Recognition Based on Semantic Segmentation. Agronomy 2024, 14, 894. [Google Scholar] [CrossRef]
  28. Jia, W.; Li, Q.; Zhang, Z.; Liu, G.; Hou, S.; Ji, Z.; Zheng, Y. Optimized SOLO Segmentation Algorithm for the Green Fruits of Persimmons and Apples in Complex Environments. Trans. Chin. Soc. Agric. Eng. 2021, 37, 121–127. [Google Scholar]
  29. Sheng, X.; Kang, C.; Zheng, J.; Lyu, C. An edge-guided method to fruit segmentation in complex environments. Comput. Electron. Agric. 2023, 208, 107788. [Google Scholar] [CrossRef]
  30. Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics Yolov8. Available online: https://github.com/ultralytics/ultralytics (accessed on 12 June 2024).
  31. Advanced Auto Labeling Solution with Added Features. Available online: https://github.com/CVHub520/X-AnyLabeling (accessed on 13 June 2024).
  32. Xia, Z.; Pan, X.; Song, S.; Li, E.L.; Huang, G. DAT++: Spatially Dynamic Vision Transformer with Deformable Attention. arXiv 2023, arXiv:2309.01430. [Google Scholar]
  33. Xia, Z.; Pan, X.; Song, S.; Li, E.L.; Huang, G. Vision Transformer with Deformable Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4794–4803. [Google Scholar]
  34. Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024. [Google Scholar] [CrossRef]
  35. Sunkara, R.; Luo, T. No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects. arXiv 2022. [Google Scholar] [CrossRef]
  36. Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W. CSPNet: A New Backbone That Can Enhance Learning Capability of CNN. arXiv 2019. [Google Scholar] [CrossRef]
  37. Cui, Y.; Ren, W.; Knoll, A. Omni-Kernel Network for Image Restoration. AAAI Conf. Artif. Intell. 2023, 38, 27907. [Google Scholar] [CrossRef]
  38. Gupta, K.; Shakya, S.; Singla, A. Efficient Graph-Friendly COCO Metric Computation for Train-Time Model Evaluation. arXiv 2022, arXiv:2207.12120. [Google Scholar]
  39. Cai, Z.; Vasconcelos, N. Cascade R-CNN: High Quality Object Detection and Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1483–1498. [Google Scholar] [CrossRef] [PubMed]
  40. Huang, Z.; Huang, L.; Gong, Y.; Huang, C.; Wang, X. Mask Scoring R-CNN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  41. Chen, K.; Pang, J.; Wang, J.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Shi, J.; Ouyang, W.; et al. Hybrid task cascade for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  42. Wang, X.; Kong, T.; Shen, C.; Jiang, Y.; Li, L. SOLO: Segmenting Objects by Locations. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020. [Google Scholar]
  43. Kirillov, A.; Wu, Y.; He, K.; Girshick, R. PointRend: Image Segmentation as Rendering. arXiv 2019, arXiv:1912.08193. [Google Scholar]
  44. Wang, X.; Zhang, R.; Kong, T.; Li, L.; Shen, C. SOLOv2: Dynamic and Fast Instance Segmentation. Adv. Neural Inf. Process. Syst. (NeurIPS) 2020, 33, 17721–17732. [Google Scholar]
  45. Fang, Y.; Yang, S.; Wang, X.; Li, Y.; Fang, C.; Shan, Y.; Feng, B.; Liu, W. Instances As Queries. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 6910–6919. [Google Scholar]
  46. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. arXiv 2023, arXiv:2304.02643. [Google Scholar]
  47. Ravi, N.; Gabeur, V.; Hu, Y.T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. SAM 2: Segment Anything in Images and Videos. arXiv 2024, arXiv:2408.00714. [Google Scholar] [CrossRef]
  48. Zhang, C.; Han, D.; Qiao, Y.; Kim, J.U.; Bae, S.H.; Lee, S.; Hong, C.S. Faster Segment Anything: Towards Lightweight SAM for Mobile Applications. arXiv 2023, arXiv:2306.14289. [Google Scholar]
  49. Zhao, X.; Ding, W.; An, Y.; Du, Y.; Yu, T.; Li, M.; Tang, M.; Wang, J. Fast Segment Anything. arXiv 2023, arXiv:2306.12156. [Google Scholar]
  50. Jocher, G. Ultralytics Yolov5. Available online: https://github.com/ultralytics/yolov5 (accessed on 27 June 2024).
Figure 1. Comparison of cabbage before and after labeling. Each pair of images is shown together, with the original on the left and the labeled version on the right.
Figure 1. Comparison of cabbage before and after labeling. Each pair of images is shown together, with the original on the left and the labeled version on the right.
Sensors 24 08115 g001
Figure 2. CabbageNet model structure.
Figure 2. CabbageNet model structure.
Sensors 24 08115 g002
Figure 3. An Illustration of the deformable attention mechanism. (a) Information flow of the deformable attention. (b) Structure of the offset generation network. DWConv represents depthwise convolution.
Figure 3. An Illustration of the deformable attention mechanism. (a) Information flow of the deformable attention. (b) Structure of the offset generation network. DWConv represents depthwise convolution.
Sensors 24 08115 g003
Figure 4. ADown module network diagram. CBS denotes a sequence of operations comprising convolution, batch normalization, and SiLU activation. AvgPool2d and MaxPool2d denote average pooling and max pooling layers, respectively.
Figure 4. ADown module network diagram. CBS denotes a sequence of operations comprising convolution, batch normalization, and SiLU activation. AvgPool2d and MaxPool2d denote average pooling and max pooling layers, respectively.
Sensors 24 08115 g004
Figure 5. An Illustration of the SPD-Conv Process. (a) Traditional feature map. (b) Space-to-depth transformation. (c) Channel merging. (d) Addition operation. (e) Convolution with no stride.
Figure 5. An Illustration of the SPD-Conv Process. (a) Traditional feature map. (b) Space-to-depth transformation. (c) Channel merging. (d) Addition operation. (e) Convolution with no stride.
Sensors 24 08115 g005
Figure 6. An illustration of the OmniKernel and CSP-OmniKernel. (a) Structure of the OmniKernel module, showing local, large, and global branches. (b) CSP-OmniKernel architecture, incorporating the OmniKernel module for enhanced feature aggregation. DWConv represents depthwise convolution, DCAM represents Dual-Domain Channel Attention Module, and FSAM represents Frequency-Based Spatial Attention Module.
Figure 6. An illustration of the OmniKernel and CSP-OmniKernel. (a) Structure of the OmniKernel module, showing local, large, and global branches. (b) CSP-OmniKernel architecture, incorporating the OmniKernel module for enhanced feature aggregation. DWConv represents depthwise convolution, DCAM represents Dual-Domain Channel Attention Module, and FSAM represents Frequency-Based Spatial Attention Module.
Sensors 24 08115 g006
Figure 7. Comparison of segmentation results from different instance segmentation models. The color sequence of segmented instances in the figure is as follows: blue, orange-yellow, purple, indigo, etc.
Figure 7. Comparison of segmentation results from different instance segmentation models. The color sequence of segmented instances in the figure is as follows: blue, orange-yellow, purple, indigo, etc.
Sensors 24 08115 g007
Figure 8. Comparison of segmentation results from SAM series models. The color sequence of segmented instances in the figure is as follows: blue, orange-yellow, purple, indigo, etc.
Figure 8. Comparison of segmentation results from SAM series models. The color sequence of segmented instances in the figure is as follows: blue, orange-yellow, purple, indigo, etc.
Sensors 24 08115 g008
Figure 9. Comparison of segmentation performance between YOLO series models and CabbageNet in a real-world cabbage harvesting environment. The color sequence of segmented instances in the figure is as follows: blue, orange-yellow, purple, indigo, etc.
Figure 9. Comparison of segmentation performance between YOLO series models and CabbageNet in a real-world cabbage harvesting environment. The color sequence of segmented instances in the figure is as follows: blue, orange-yellow, purple, indigo, etc.
Sensors 24 08115 g009
Figure 10. Visualization comparison of ablation experiments. The color sequence of segmented instances in the figure is as follows: blue, orange-yellow, purple, indigo, etc.
Figure 10. Visualization comparison of ablation experiments. The color sequence of segmented instances in the figure is as follows: blue, orange-yellow, purple, indigo, etc.
Sensors 24 08115 g010
Table 1. Comparison experiments with classic models.
Table 1. Comparison experiments with classic models.
ModelsMask AP0.5:0.95/%Mask AP0.5/%Mask AP0.75/%Mask
APsmall/%
Params/MGFLOPSFPS
Mask R-CNN74.190.880.236.343.9284.621
Cascade Mask R-CNN [39]74.590.280.435.576.83323.212
Mask Scoring R-CNN [40]75.289.880.535.360.4366.623
Hybrid Task Cascade [41]75.991.781.838.079.93506.06
YOLACT68.289.473.631.034.7163.221
SOLO [42]58.278.162.87.035.9311.511
PointRend [43]73.390.579.635.155.9184.118
SOLOv2 [44]49.376.150.77.846.0276.712
QueryInst [45]68.485.074.624.4172.2135.610
ConvNeXt-V275.889.080.534.3108.1469.66
CabbageNet (Ours)78.994.085.438.73.2115.1154
Table 2. Comparison experiments with the SAM series models.
Table 2. Comparison experiments with the SAM series models.
ModelsIoU/%Dice/%F1/%Hausdorff Distance/PixelsPA/%FPS
SAM (base) [46]70.274.274.173.896.8-
SAM2 (2.1 base) [47]49.951.951.9175.494.9-
MobileSAM [48]76.180.380.358.4597.2-
FastSAM (s) [49]71.376.676.680.396.963
CabbageNet (Ours)85.390.090.027.199.4154
Table 3. Comparison experiments with the YOLO series models.
Table 3. Comparison experiments with the YOLO series models.
ModelsMask Precision/%Mask Recall/%Mask mAP50/%Mask mAP50-95/%Params/MFPSGFLOPsModel Size/MB
YOLOv5n-seg [50]90.588.594.676.51.883296.73.96
YOLOv9c-seg90.689.595.482.527.6322157.656.30
YOLOv8n-seg (Baseline)90.988.094.980.43.2621512.06.50
CabbageNet (Ours)92.287.295.180.63.2115415.16.46
Table 4. Results of the various ablation experiments.
Table 4. Results of the various ablation experiments.
ModelsC2f-DAttentionADownSOEPMask Precision/%Mask Recall/%Mask mAP50/%Mask mAP50-95/%Params/MFPSGFLOPsSize/MB
YOLOv8n-seg (Baseline)90.988.094.980.43.2621512.06.50
+ C2f-DAttention92.086.794.980.53.3219312.06.64
+ C2f-DAttention + ADown91.687.295.180.62.9116211.35.87
CabbageNet (Ours)92.287.295.180.63.2115415.16.46
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tian, Y.; Cao, X.; Zhang, T.; Wu, H.; Zhao, C.; Zhao, Y. CabbageNet: Deep Learning for High-Precision Cabbage Segmentation in Complex Settings for Autonomous Harvesting Robotics. Sensors 2024, 24, 8115. https://doi.org/10.3390/s24248115

AMA Style

Tian Y, Cao X, Zhang T, Wu H, Zhao C, Zhao Y. CabbageNet: Deep Learning for High-Precision Cabbage Segmentation in Complex Settings for Autonomous Harvesting Robotics. Sensors. 2024; 24(24):8115. https://doi.org/10.3390/s24248115

Chicago/Turabian Style

Tian, Yongqiang, Xinyu Cao, Taihong Zhang, Huarui Wu, Chunjiang Zhao, and Yunjie Zhao. 2024. "CabbageNet: Deep Learning for High-Precision Cabbage Segmentation in Complex Settings for Autonomous Harvesting Robotics" Sensors 24, no. 24: 8115. https://doi.org/10.3390/s24248115

APA Style

Tian, Y., Cao, X., Zhang, T., Wu, H., Zhao, C., & Zhao, Y. (2024). CabbageNet: Deep Learning for High-Precision Cabbage Segmentation in Complex Settings for Autonomous Harvesting Robotics. Sensors, 24(24), 8115. https://doi.org/10.3390/s24248115

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop