YOLOv8n-Pose-DSW: A Precision Picking Point Localization Model for Zucchini in Complex Greenhouse Environments

Su, Hongxiong; Wang, Sa; Su, Honglin; Ma, Fumin; Li, Yanwen; Li, Juxia

doi:10.3390/agriculture15181954

Open AccessArticle

YOLOv8n-Pose-DSW: A Precision Picking Point Localization Model for Zucchini in Complex Greenhouse Environments

by

Hongxiong Su

^1,†,

Sa Wang

^1,†,

Honglin Su

¹,

Fumin Ma

²,

Yanwen Li

¹ and

Juxia Li

^1,*

¹

College of Information Science and Engineering, Shanxi Agricultural University, Taigu 030801, China

²

College of Energy and Power Engineering, Lanzhou University of Technology, Lanzhou 730050, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Agriculture 2025, 15(18), 1954; https://doi.org/10.3390/agriculture15181954

Submission received: 14 August 2025 / Revised: 4 September 2025 / Accepted: 12 September 2025 / Published: 16 September 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Zucchini growth in greenhouse environments presents significant challenges for fruit recognition and picking point localization due to characteristics such as foliage occlusion, high density, structural complexity, and diverse fruit morphologies. Current recognition and localization algorithms exhibit limitations including low accuracy, restricted applicability, and procedural complexity, falling short of the requirements for precise and robust intelligent harvesting. To address these issues, this study constructs a zucchini dataset of 942 images using an Intel RealSense D455 depth camera and a smartphone, and proposes a novel keypoint detection model named YOLOv8n-Pose-DSW. The model introduces three key enhancements compared with YOLOv8n-Pose. First, the conventional upsample operator is replaced with an adaptive point sampling operator called Dysample, improving detection accuracy while reducing GPU memory consumption. Second, a Slim-Neck structure is designed to decrease computational overhead through lightweight bottleneck architecture, while preserving robust feature representation. Third, the WIoU-v3 loss is adopted to optimize bounding box regression for object detection, thereby enhancing localization accuracy. Experimental results demonstrate that YOLOv8n-Pose-DSW achieves a zucchini detection P, R, mAP@50, and mAP@50–95 of 92.1%, 90.7%, 94.0%, and 71.4%, respectively. These metrics represent improvements of 3.3%, 11.7%, 7.4%, and 15.4%, respectively, over the original model. For picking point localization, the improved model attains a P of 93.1%, R of 89.5%, mAP@50 of 95.6%, and mAP@50–95 of 95.2%, corresponding to gains of 8.8%, 11.0%, 11.3%, and 27.9% over the original model. Further error analysis shows that picking point localization errors are concentrated within the 0–4-pixel range, demonstrating enhanced localization precision critical for practical harvesting applications. The proposed algorithm effectively addresses greenhouse environmental challenges and provides essential technical support for intelligent zucchini harvesting systems.

Keywords:

deep learning; YOLOv8n-Pose; lightweight model; zucchini; picking point localization

1. Introduction

Zucchini (Cucurbita pepo) is an annual, herbaceous, trailing vegetable belonging to the Cucurbitaceae family and the genus Cucurbita [1]. Its fruit serves as the primary part consumed as a vegetable. The fruit morphology is predominantly cylindrical or club-shaped, typically ranging in length from 15 to 30 cm [2]. The skin exhibits diverse coloration, including green, yellow, and white. Owing to its rich nutritional profile and strong adaptability, zucchini has become a significant vegetable crop widely favored in consumer markets. However, current zucchini harvesting operations predominantly rely on manual labor. This approach suffers from several limitations: low efficiency, high rates of fruit damage, escalating labor costs, and difficulties in accurately determining optimal harvest timing. In recent years, continuous growth in zucchini production has coincided with an increasing shortage of agricultural labor. Consequently, traditional harvesting methods have become inadequate, making the development of intelligent harvesting technology imperative. A significant challenge in the automated harvesting of zucchini arises from the plant’s biological and environmental characteristics [3,4]. The fruits are large and possess tender flesh, yet are attached via tough, rigid stems. Moreover, in greenhouse settings, zucchini plants exhibit highly random spatial distribution and are often affected by severe leaf occlusion. These factors collectively complicate both reliable fruit detection and accurate identification of the picking point [5]. Thus, developing intelligent algorithms capable of robust recognition and precise localization is essential to enable efficient automated harvesting.

As computer vision technology becomes increasingly prevalent in agricultural production, numerous researchers have conducted extensive studies on fruit detection and picking point localization methods. Existing approaches can be broadly divided into two categories: traditional image processing methods and deep learning-based object detection and localization methods. Traditional image processing methods typically rely on handcrafted features, including morphological characteristics, textural patterns, and color information. They employ techniques like image segmentation and morphological operations for feature extraction, subsequently achieving object detection and keypoint localization. For instance, Luo et al. [6] utilized the Otsu algorithm to segment grape clusters, then applied geometric constraints to precisely locate cutting point coordinates. Zhao et al. [7] proposed a fusion method based on an adaptive red/blue chromatic map and the sum of absolute transformed differences to segment potential citrus regions. They further employed Support Vector Machines trained on optimal co-occurrence matrix features for citrus recognition. Namal et al. [8] proposed an algorithm based on the Cascaded Adaptive Network-based Fuzzy Inference System (Cascaded-ANFIS), achieving a high accuracy of 98.6% on the Fruit-360 dataset using advanced feature engineering with shallow features. While successful under specific controlled conditions, these methods are generally limited to simple simulated backgrounds. The challenges in greenhouse zucchini harvesting present a fundamentally different scenario, the severe occlusion and morphological diversity in greenhouses exceed the adaptability of such handcrafted feature methods. Compared to traditional image processing approaches, deep learning demonstrates superior accuracy and real-time performance in object detection and keypoint localization [9], making them a more efficient and practical technology for applications in precision agriculture [10]. Consequently, numerous researchers have systematically investigated deep learning-based methods for fruit detection and picking point localization [11,12]. For instance, Zhang et al. [13] proposed YOLOv5-GAP, a grape cluster detection algorithm based on YOLOv5. This method integrates digital image processing algorithms with mathematical geometry principles to segment identified grape clusters and subsequently determine picking point coordinates. Experimental results demonstrate that YOLOv5-GAP achieved an average precision of 95.13% with a mean picking point localization error of 6.3 pixels, confirming its efficacy in rapid and accurate grape detection. Li et al. [14] enhanced YOLOv7 with efficient modules, attention mechanisms, and a small-target detection layer, combined with geometric localization. This achieved 98.8% accuracy, 96.8% mAP@95, and 90.8% localization success at 76 ms latency. Chen et al. [15] employed a lightweight DeepLabv3+ with MobileNetv2 backbone, CBAM attention, and DenseASPP, using skeleton endpoint detection for plum branch picking point localization. The method reached 86.13% MIoU and 92.92% MPA in segmentation with 59.6 MB model size, achieving 88.8% localization success. Despite advancements in existing methods for fruit recognition and picking point localization, their reliance on intricate multi-stage processing—for instance, separately performing detection, segmentation, and geometric localization—increases system complexity and computational overhead. This not only reduces the operational efficiency of mobile harvesting robots but also introduces cumulative errors that impair localization accuracy.

The rapid advancement of deep learning-based end-to-end keypoint detection methods in recent years offers a promising approach to addressing these limitations. These methods employ end-to-end coordinate regression to directly predict picking point positions, effectively circumventing the error propagation inherent in traditional multi-stage processing. This capability enables precise localization under complex environmental conditions, facilitating high-precision operations. Consequently, researchers are increasingly exploring their application in fruit harvesting to enhance picking accuracy and operational efficiency. Ma et al. [16] proposed STRAW-YOLO for strawberry detection and keypoint localization, achieving a precision of 91.6%, recall of 91.7%, and mAP@50 of 96.0%, with a detection time of 92 milliseconds per image, meeting robotic deployment requirements. Du et al. [17] developed YOLO-lmk, a YOLOv5-based algorithm that simultaneously detects tomato bounding boxes and keypoints in complex backgrounds, achieving 93.4% detection accuracy at 0.09 s per fruit processing speed. Wu et al. [18] designed a grape stem localization method that integrates a lightweight Ghost-HRNet architecture with the YOLOv5m detector. Employing a top-down strategy, their approach achieved a localization accuracy of 90.2% at a speed of 7.7 FPS. Although these deep learning-based methods demonstrate substantial potential for fruit recognition and picking point localization, their computational and storage demands remain high. These requirements can lead to significant operational costs, which may restrict the large-scale adoption of such systems in agricultural applications due to economic constraints.

Although numerous lightweight models have been developed for fruit detection, Huang et al. [19] propose a lightweight Pepper-YOLO model for detecting peppers and locating keypoints—including picking points, top, and bottom. With only 1.9 million parameters and a 5.9 GFLOP computational load, this model achieves an mAP@50 of 87.6%. Wang et al. proposed propose a lightweight model OW-YOLO for detecting walnuts [20], achieving an mAP@50 of 83.6% and mAP@50–95 of 53.7%. Most prioritize generic object detection accuracy or mere parameter reduction, often overlooking the critical requirement for precise keypoint localization in agricultural automation tasks like robotic picking. Moreover, their performance frequently degrades in complex, unconstrained greenhouse environments characterized by occlusion, variable lighting, and cluttered backgrounds. To address these limitations, this study proposes a novel YOLOv8n-Pose-DSW architecture specifically designed for zucchini picking point detection in complex greenhouse environments. Our research primarily contributes in the following five perspectives:

Zucchini Fruit Dataset: A robust dataset encompassing diverse illumination conditions, capture distances, fruit densities, and scenarios simulated through advanced data augmentation techniques was established.
YOLOv8n-Pose-DSW Model for Unstructured Environments: The proposed YOLOv8n-Pose-DSW model effectively addresses missed detections and low-accuracy challenges in unstructured environments. Comparative experiments validate its superiority and demonstrate the efficacy of each constituent module.
Adaptive Dysample Operator for Computational Efficiency: Traditional Upsampling methods were replaced with Dysample, achieving synchronized optimization of computational efficiency and GPU memory consumption while maintaining detection accuracy.
Slim-Neck Architecture for Feature Representation: A Slim-Neck network structure was developed, enhancing computational efficiency and feature representation capability through optimized bottleneck layer design.
WIoUv3 Loss Function for Localization Sensitivity: The WIoUv3 loss function was introduced to replace the CIoU loss function, which can improve its detection sensitivity for zucchini fruits, enhance picking point localization precision, and refine model fitting accuracy.

2. Materials and Methods

2.1. Description of Study Area and Data Collection

The dataset was collected from zucchini cultivation plots in multi-span greenhouses at Hexicun Village, Taigu County, Shanxi Province, featuring an 80 cm row spacing. Acquisition occurred between December 2023 and April 2024 using handheld devices: an Intel RealSense D455 stereo depth camera (RealSense Inc., Cupertino, CA, USA) and a Xiaomi 14 smartphone camera (Xiaomi Inc., Beijing, China). The working distance ranged from 0.5 to 2 m, with the depth camera exhibiting a depth field of view of 58° (H) × 87° (V). Representative cultivation conditions are shown in Figure 1. A total of 942 images were captured: 258 images from the depth camera at 1280 × 720 pixel resolution and 684 images from the smartphone at 3450 × 3450 pixel resolution. To ensure data consistency and compatibility with network input requirements, all images and depth maps were uniformly cropped and resized to 720 × 720 pixels for storage.

To ensure the dataset captured the unstructured orchard characteristics of multi-span greenhouses, image acquisition was conducted at different row positions across varying time periods. Representative samples are shown in Figure 2. Given the variable orientation of zucchini fruits during growth, both front and top views were systematically captured. Multi-angle perspectives (Figure 2a,b) enhance object observation from diverse viewpoints and mitigate occlusion effects. Significant illumination variations occurred due to positional differences within greenhouses and temporal acquisition windows: Figure 2c demonstrates optimal lighting with distinct fruit contours and prominent features, whereas Figure 2d exhibits suboptimal low-lighting conditions, increasing misidentification risks. The natural growth patterns of zucchini introduced interference and occlusion challenges. Figure 2e illustrates leaf-obscured fruits, while Figure 2f shows stem occlusion. Both scenarios substantially increase complexity for fruit recognition and stem detection.

2.2. The Dataset

2.2.1. Dataset Annotation and Division

Image annotation was performed using X-AnyLabeling software (Version: 3.0.2, CVHub, Shenzhen, GD, CN) as illustrated in Figure 3. The annotation protocol adhered to the following principles: (1) Each zucchini fruit was annotated with a minimum bounding rectangle labeled “zucchini”; a keypoint labeled “point” was marked 1 cm proximal to the peduncle base. Annotation results were stored in standardized JSON format. (2) All bounding boxes were tightly fitted around fruit contours while ensuring picking points remained within their respective boxes. (3) Specimens without visible stems or with ambiguous stem connections received only bounding box annotations to prevent picking point misidentification. (4) Each fruit and its corresponding picking point shared identical category labels to establish explicit associations.

Finally, JSON-format annotation files were converted to txt format compatible with YOLOv8n-Pose. The dataset was partitioned into training, validation, and test sets at a 6:2:2 ratio, yielding 565, 189, and 188 images, respectively.

2.2.2. Data Augmentation

To enhance model generalization and prevent overfitting, this study implemented data augmentation techniques to diversify the training set, addressing limitations in capturing the full spectrum of real-world environmental conditions. As illustrated in Figure 4, these techniques included (1) image rotation and mirroring to simulate varying viewpoints; (2) random noise injection to emulate diverse image quality degradation; and (3) brightness and contrast adjustments to replicate different illumination conditions and camera sensor responses. Through randomized combinations of these transformations, the approach simulated multifaceted harvesting scenarios and growth environments, enabling the model to achieve enhanced recognition accuracy for zucchini in complex field conditions.

2.3. Improved YOLOv8n-Pose Network

The YOLO [21] (You Only Look Once) series represents a quintessential single-stage object detection algorithm that integrates classification and regression through anchor-based object localization, offering advantages in real-time processing, efficiency, and adaptability. Developed by Ultralytics, YOLOv8 enhances feature extraction and object detection capabilities by incorporating architectural innovations from prior generations. This framework scales across five variants—YOLOv8n-Pose, YOLOv8s-Pose, YOLOv8m-Pose, YOLOv8l-Pose, and YOLOv8x-Pose—where increasing model precision corresponds to higher computational complexity and parameter counts. Given deployment constraints on mobile harvesting platforms, the lightweight YOLOv8n-Pose architecture was selected to detect zucchini fruits and localize their picking points.

To address challenges in complex natural environments, this study implements targeted enhancements to YOLOv8n-Pose as depicted in Figure 5. First, to improve point localization precision, the Dysample module [22] was integrated into the network to replace the standard average upsampling operation. Second, a Slim-Neck module [23] enhances multi-scale feature extraction capabilities, improving detection accuracy and efficiency for zucchini fruits. Finally, to ensure the training process aligns with the practical demand for high-precision localization, the WIoUv3 loss function is employed during training. It assigns increased weights to critical regions, thereby directing the model’s attention to minimize errors at picking points.

A thorough theoretical analysis and a detailed description of the implementation process are provided in Section 2.3.1, Section 2.3.2 and Section 2.3.3.

2.3.1. Dysample Module

Within the neck feature fusion network of the YOLOv8n-Pose object detection architecture, feature map upsampling typically employs spatial position-based nearest neighbor interpolation. However, this fixed sampling pattern struggles to effectively leverage semantic information within feature maps. To address this limitation, this study introduces the Dysample module to refine the upsampling process. By enhancing the input feature map’s adaptability to noise variations during sampling, Dysample improves model robustness under noisy conditions. The module structure is illustrated in Figure 6.

The core of the Dysample module is a Dynamic Sampling Point Generator (DSPG) that adaptively adjusts the stride of sampling point displacement to enhance upsampled image quality. As shown in Figure 6, a feature map X of dimensions

H \times W \times C

serves as input to the DSPG. The DSPG generates upsampling offsets o through linear projection, followed by sigmoid activation, as defined by Equation (1):

O = 0.5 sigmoid ({linear}_{1} (x) \cdot {linear}_{2} (x))

(1)

Subsequently, the offsets o are added to the original sampling grid G to derive the dynamic sampling set S. The input features are then resampled using grid sampling, producing the upsampled feature map

x^{'}

, where S denotes the upscaling factor. Here, S represents the dynamic sampling coordinates,

s^{2}

indicates the replication count of offsets per dimension, g signifies the number of groups, G corresponds to the original sampling grid, and

σ

denotes the sigmoid activation function.

This dynamic point sampling approach effectively upscales low-resolution feature maps to higher resolutions, particularly critical for detecting zucchini targets in non-uniform environments. Higher-resolution feature maps better preserve and represent intricate target details under complex conditions, enhancing recognition accuracy. Moreover, the Dysample module operates without requiring specialized CUDA kernels while utilizing fewer parameters, floating-point operations (FLOPs), latency, and GPU memory than conventional methods. These attributes enable efficient deployment on resource-constrained devices, making Dysample exceptionally suitable for the present research objectives.

2.3.2. Slim-Neck Module

The Slim-Neck module represents a lightweight architecture specifically designed to optimize the neck structure of object detectors. Positioned between the backbone network and detection head, its core function lies in efficiently fusing multi-scale features to enhance detection accuracy and computational efficiency. The innovation integrates two key lightweight components: First, GSConv, a hybrid module combining group convolution and depthwise separable convolution, replaces standard convolutional layers, reducing computational complexity (measured in FLOPs) and parameter count. The architecture of the GSConv module is illustrated in Figure 7. The input features are first processed through a standard convolutional layer, followed by a depthwise separable convolutional layer. The resulting features from these two branches are then concatenated along the channel dimension. To enhance cross-channel information flow, the concatenated features undergo a channel-shuffling operation.

Second, the VoV-GSCSP module substitutes the original complex C2f structure, optimizing model complexity and gradient propagation through streamlined feature fusion pathways and information flow design. By integrating GSConv and VoV-GSCSP, Slim-Neck effectively enhances feature extraction capabilities for zucchini fruits while simultaneously improving model performance and generalization capacity, all with reduced computational burden. The detailed structure of the VoV-GSCSP module is illustrated in Figure 8.

2.3.3. WIoUv3 Loss Function

YOLOv8n-Pose employs CIoU Loss (Complete Intersection-over-Union Loss) for bounding box regression [24]. This loss function integrates three critical geometric metrics between predicted and ground-truth bounding boxes: Intersection-over-Union ratio, centroid distance, and aspect ratio consistency. The mathematical formulation is presented in Equation (2).

\begin{matrix} L_{C I O U} = 1 - I O U + \frac{({(x - x_{g t})}^{2} + {(y - y_{g t})}^{2})}{(C_{w}^{2} + C_{h}^{2})} + α ν \\ α = \frac{ν}{(1 - I O U) + ν} \\ ν = \frac{4}{π^{2}} {(arctan \frac{w_{g t}}{h_{g t}} - arctan \frac{w}{h})}^{2} \\ I O U = \frac{w h + w_{g t} h_{g t} - 2 W_{i} H_{i}}{w h + w_{g t} h_{g t} - W_{i} H_{i}} \end{matrix}

(2)

where

α

is the weighting function, v quantifies the aspect ratio correspondence between predicted and ground-truth bounding boxes. Figure 9 illustrates the structural components of the IoU metric. Here, (x, y) and w, h represent the coordinates of the predicted box and width, height; similarly,

w_{g t}

,

h_{g t}

, (

x_{g t}

,

y_{g t}

) denote the ground-truth box’s width, height, and center coordinates. The width and height of the intersection region are designated

W_{i}

and

H_{i}

, with

C_{w}

and

C_{h}

signifying the width and height of the minimum enclosing rectangle containing both boxes.

However, CIoU quantifies bounding box similarity primarily through aspect ratio divergence, without incorporating dimensional confidence derived from width and height measurements. Consequently, its penalty term may approach zero when predicted ground-truth boxes maintain linear aspect ratio proportionality, resulting in uniform regression loss regardless of anchor box quality. To address this limitation, this study adopts WIoUv3 as the replacement loss function. WIoUv3 [25] introduces a dynamic non-monotonic focusing strategy that quantifies outlier degree as an anchor quality indicator, thereby adaptively allocating gradient gain instead of traditional IoU weighting. Compared with CIoU, WIoUv3 eliminates the aspect ratio penalty term, achieving balanced regression contributions from both high- and low-quality anchors. This approach enhances model generalization capacity and detection precision. The mathematical formulation of WIoUv3 is presented in Equation (3).

L_{WIoUv 3} = r \cdot R_{WIoU} \cdot L_{IoU}

(3)

The proposed loss function incorporates three adaptive mechanisms to optimize bounding box regression: (1) A scaling factor

R_{WIoU} \in [1, e)

magnifies the IoU loss

L_{I O U}

for normal-quality anchor frames to enhance regression sensitivity; (2) a suppression weight

L_{o u} \in [0, 1]

reduces

R_{W I o U}

’s influence on high-quality anchors to prevent optimization imbalance; and (3) when anchor–target overlap exceeds threshold

τ

(I o U \geq 0.5)

, a non-monotonic focusing coefficient r dynamically prioritizes normal-quality anchors, exhibiting centroid misalignment. These mathematically defined mechanisms collectively optimize detection robustness, with formal implementations given in Equations (4)–(6):

R_{WIoU} = exp (\frac{{(x - x_{gt})}^{2} + {(y - y_{gt})}^{2}}{C_{w}^{2} + C_{h}^{2}})

(4)

L_{IoU} = 1 - IoU

(5)

\{\begin{matrix} r = \frac{λ}{δ α^{β - δ}} \\ β = \frac{L_{IoU}}{\bar{L_{IoU}}} \in [0, + \infty) \end{matrix}

(6)

3. Experimental Design and Discussion

3.1. Experimental Details

To ensure experimental fairness, all procedures were executed on an identical workstation configured with an Intel i7-13700K CPU (Intel Corporation, Santa Clara, CA, USA), 32 GB RAM, and an NVIDIA RTX3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA). The experiment environment was created in Miniconda (Version: 23.11.0, Anaconda Inc., Austin, TX, USA) using Python (Version: 3.11, Python Software Foundation, Beaverton, OR, USA) as the programming language and PyTorch (Version: 2.0.1, PyTorch Foundation, Menlo Park, CA, USA) as the deep learning framework. An initial pilot training session utilizing early stopping (patience = 20 epochs) was performed to quickly assess the model’s learning behavior. Subsequently, based on the analysis of validation performance metrics across multiple runs, the optimal training duration was determined to be 150 epochs. The final model was then trained for the full 150 epochs without early stopping. Through iterative experimentation comparing diverse hyperparameter combinations followed by fine-tuning based on empirical results, a critical set of training parameters was established (Table 1), balancing network performance optimization with computational resource efficiency. To ensure reproducibility, all experiments were run with four different random seeds; results are reported as the mean ± standard deviation across the four independent runs. Finally, the optimized weights were evaluated on the test set to validate model generalizability. To ensure a fair comparison, all competing pose estimation models (DeepPose, RTMPose, ViTPose, YOLOX-Pose, YOLO11n-Pose and YOLO12n-Pose) were re-trained on our zucchini dataset using the identical training schedule, data augmentation strategies, and hyperparameters that were used for our proposed YOLOv8n-Pose-DSW model.

3.2. Metrics for Model Evaluation

The YOLOv8n-Pose model was originally designed for human pose estimation but was adapted in this study for zucchini fruit detection and picking point localization. To evaluate detection accuracy, performance metrics including precision (P), recall (R), mean average precision at 50% IoU (mAP@50), and mean average precision averaged over IoU thresholds from 0.50 to 0.95 with a 0.05 step (mAP@50–95) were employed, as detailed in Equations (7)–(9).

P r e c i s i o n = \frac{T P}{T P + F P} \times 100 %

(7)

R e c a l l = \frac{T P}{T P + F N} \times 100 %

(8)

m A P = \frac{\sum_{1}^{N} \int_{0}^{1} P e r c i s i o n (R e c a l l) d (R e c a l l)}{N}

(9)

For keypoint detection evaluation, this study employs object keypoint similarity-based mAP

(L_{O K S})

as the primary metric. Additionally, point-level precision, recall, mean average precision at an OKS threshold of 0.5 (mAP@50), and mean average precision averaged over OKS thresholds from 0.50 to 0.95 with a 0.05 step (mAP@50–95) were adopted as performance indicators for localization accuracy, as formalized in Equation (10).

\begin{matrix} L_{O K S} = \frac{\sum_{i} [exp (- \frac{d_{i}^{2}}{2 s^{2} k_{i}^{2}}) \partial (v_{i} > 0)]}{\sum_{i} \partial (v_{i} > 0)} \\ {P r e c i s i o n}_{k p t} = \frac{T P_{k p t}}{T P_{k p t} + F P_{k p t}} \times 100 % \\ {R e c a l l}_{k p t} = \frac{T P_{k p t}}{T P_{k p t} + F P_{k p t}} \times 100 % \\ A P_{k p t} = \int_{0}^{1} {Precision}_{k p t} d (R e c a l l_{k p t}) \\ m A P_{k p t} = \frac{\sum_{i = 1}^{N} {A P}_{i}}{N} \end{matrix}

(10)

In Equation (10), i denotes the index of annotated picking points, with

d_{i}^{2}

representing the squared Euclidean distance between detected and ground-truth picking point positions. The variable s corresponds to the image area occupied by the detected object, while k signifies the category-specific normalization factor for the i-th picking point. The indicator function

(v_{i} > 0)

ensures calculation exclusively for visible keypoints, where visibility

(v_{i} > 0)

denotes discernible points. Performance evaluation employs the following metrics: A true positive (TP) is recorded when a keypoint is correctly detected, and the predicted location demonstrates sufficient spatial alignment with the ground-truth, exceeding a predefined object keypoint similarity threshold. A false positive (FP) arises when the model identifies a keypoint in a region where no actual feature exists, typically due to over-prediction or noise, resulting in a detection that surpasses the confidence or matching threshold without a corresponding ground-truth. Conversely, a false negative (FN) occurs when a genuine keypoint is missed by the model, meaning no valid prediction is generated within the required proximity or threshold criteria for that ground-truth instance. Average precision (AP) constitutes the area under the precision–recall curve, with mean average precision (mAP) representing the categorical mean of AP values across all classes.

3.3. Picking Point Positioning Accuracy Metrics

To evaluate picking point localization accuracy with enhanced intuitiveness, Manhattan distance was adopted as the error metric due to its alignment with the discrete grid structure of digital images. Given an image resolution of

w \times h

pixels, where actual and predicted picking points possess normalized pixel coordinates

(x_{1}, y_{1})

and

(x_{2}, y_{2})

, respectively, the Manhattan distance error is defined as shown in Equation (11).

Δ (x, y) = w \times (| x_{1} - x_{2} |) + h \times (| y_{1} - y_{2} |)

(11)

Equation (11) incorporates coordinate denormalization to account for zucchini harvesting image dimensions, yielding absolute pixel displacement values critical for assessing robotic positioning accuracy. Within agricultural automation systems for zucchini harvesting, the direct correspondence of this metric with the mechanical motion grid (horizontal/vertical axes) provides distinct practical advantages over Euclidean distance. It better adapts to the motion control requirements of zucchini harvesting robots along both horizontal and vertical operational planes.

3.4. Experimental Results

3.4.1. Ablation Experiment

Figure 10 presents ablation study performance curves for zucchini detection and picking point localization on the validation set. Figure 10a,b depict mAP@50 curves for fruit and picking point detection, respectively, demonstrating that all modified modules in this study achieve superior accuracy compared to the original baseline network. The model stabilizes after 120 epochs without significant fluctuations, overfitting, or underfitting, indicating enhanced training stability. Figure 10d illustrates the loss between predicted and ground-truth coordinates for each picking point in the validation set. Substituting the loss function with WIoUv3 accelerates convergence speed and improves final convergence precision. Compared to the baseline, the enhanced YOLOv8n-Pose exhibits greater stability beyond 140 epochs, confirming its improved accuracy in locating zucchini picking points. This optimized model more effectively interprets input data and extracts features conducive to picking point estimation, conclusively validating the superiority of the proposed enhancements.

To systematically evaluate the impact of the improved Dysample upsampling operator, Slim-Neck module, and WIoUv3 loss function on comprehensive zucchini detection performance, an ablation study was conducted analyzing fruit detection and picking point localization efficacy. Each enhanced component was incrementally integrated into the baseline YOLOv8n-Pose network, with quantitative results detailed in Table 2 and Table 3.

The ablation analysis reveals progressive improvements across key metrics. When replacing the original upsampling structure with Dysample, zucchini detection precision, recall, mAP@50, and mAP@50–95 increased by 2.2%, 10.5%, 4.6%, and 15.1%, respectively, while simultaneously reducing floating-point operations (FLOPs)—demonstrating effective computational complexity reduction without compromising contour precision. Picking point localization improved by 7.6% precision, 12.2% recall, 4.9% mAP@50, and 11.7% mAP50-95. Notably, the inference speed remained high at 131 frames per second (FPS), indicating only a minimal decrease compared to the baseline (132 FPS), which confirms the practical usability of Dysample in real-time applications. Subsequent integration of Slim-Neck marginally decreased fruit detection precision but elevated recall, mAP@50, and mAP@50–95 by 8.2%, 5.5%, and 15.4%, respectively, while reducing model size by 0.21 M parameters and FLOPs by 1.1G operations. Importantly, this modification led to a noticeable improvement in inference speed, reaching 140 FPS, higher than the baseline, demonstrating that Slim-Neck contributes not only to parameter efficiency but also to enhanced execution performance. Picking point metrics showed gains of 2.3% precision, 7.9% recall, 9.0% mAP@50, and 23.7% mAP@50–95. Finally, WIoUv3 loss optimization enhanced all precision, recall, and mAP metrics, confirming its efficacy in strengthening model fitting capacity and recognition accuracy. This enhancement was achieved with an inference speed of 135 FPS, maintaining real-time capability comparable to the original model. Collectively, these improvements have enhanced the accuracy and efficiency of zucchini detection and picking point localization. The Dysample operator optimizes feature map resolution for precise target identification, while Slim-Neck enables more efficient and representative feature extraction with reduced background noise interference. This dual improvement elevates overall accuracy in challenging conditions. Furthermore, WIoUv3 ensures rapid and stable convergence during training through enhanced fitting and recognition capabilities. Critically, the improved model achieves an inference speed of 137 FPS, successfully balancing high accuracy with real-time performance, demonstrating robust applicability in practical deployment scenarios. These synergistic advancements boost model robustness, establishing high efficacy across diverse zucchini detection scenarios.

3.4.2. Visualization and Discussion of Ablation Experiment

Figure 11 presents a comparative test between the baseline YOLOv8n-Pose model and the improved YOLOv8n-Pose-DSW model under scenarios relevant to future agricultural robotic harvesting systems. The study focused on three key operational dimensions affecting automated harvesting performance.

Regarding viewpoint adaptability, the baseline model exhibited limitations when processing different observation angles. The improved YOLOv8n-Pose-DSW model demonstrated superior performance through its advanced feature processing architecture, significantly enhancing fruit recognition confidence. This improvement originates from the model’s ability to dynamically adjust feature weights based on viewing geometry, a capability that substantially enhances the reliability of target acquisition for robotic manipulators operating in three-dimensional space. Under stem occlusion conditions, both models maintained stable fruit detection capabilities. However, the improved YOLOv8n-Pose-DSW model achieved improvements in picking point localization accuracy: experimental measurements showed substantially reduced positioning errors, with the enhanced model attaining the precision required for mechanical operations. These characteristics indicate potential for reducing collision risks and improving path planning efficiency in future robotic harvesting systems. The performance difference was most significant in partial leaf occlusion scenarios. While the baseline model showed noticeable accuracy degradation, the improved YOLOv8n-Pose-DSW model maintained reliable performance through advanced feature extraction methods. This ability to preserve recognition and localization accuracy under visual occlusion demonstrates significant application potential in densely vegetated natural growth environments.

From a technical implementation perspective, the improved model addresses the special requirements of zucchini harvesting through three key innovations: a geometry-aware feature enhancement module improving multi-view adaptability, a dynamic sampling mechanism strengthening feature expression under occlusion, and an improved loss function ensuring high localization stability. These technological innovations collectively form a specialized vision solution for zucchini harvesting in greenhouse environments.

3.4.3. Comparative Experiments

To evaluate the performance of the proposed model, comparative experiments were conducted on the detection network for zucchini picking points within a greenhouse environment. The detection results are presented in Table 4. The comparison involved several mainstream single-stage object detection models: DeepPose [26], RTMPose [27], ViTPose [28], and models from the YOLO series (including YOLOX-Pose [29], YOLOv8-Pose, YOLO11n-Pose, and YOLO12n-Pose [30]).

According to the comparative experimental results for zucchini picking point detection shown in Table 5, the models exhibited marked disparities in detection performance. The DeepPose model achieved scores of 59.6% for P, 60.1% for R, 72.6% for mAP@50, and 52.0% for mAP@50–95, with 23.55 M parameters, 42.8 G FLOPs, and 39 FPS, demonstrating a relatively basic overall performance. The RTMPose model outperforms DeepPose on all metrics, with a precision of 64.4%, recall of 63.8%, mAP@50 of 75.2%, and mAP@50–95 of 62.5%, and demonstrates better efficiency (6.17 M parameters, 7.4 G FLOPs, 150 FPS), its precision remains limited and may be insufficient for high-accuracy requirements in agricultural automation scenarios. The ViTPose model attained 67.3% for P and 66.9% for R, surpassing the previous two models in these metrics. However, its mAP@50 and mAP@50–95 were 65.2% and 51.1%, respectively, and it requires substantial computational resources (22.46 M parameters, 88.9 G FLOPs) while achieving only 12 FPS, revealing limitations in real-time keypoint localization tasks. In contrast, the YOLOX-Pose model exhibited a notable improvement in performance, achieving 79.1% for P, 77.3% for R, 84.2% for mAP@50, and 66.5% for mAP@50–95 with balanced efficiency (6.04 M parameters, 13.7 G FLOPs, 125 FPS). The more recent YOLO11n-Pose and YOLO12n-Pose delivered further gains: YOLO11n-Pose attained 86.0% P, 82.0% R, 90.3% mAP@50, and 88.7% mAP@50–95, with low computational cost (2.63 M parameters, 6.7 G FLOPs) and high speed (135 FPS). YOLO12-Pose achieved even higher P (88.0%) and R (84.6%) with similar efficiency (2.66 M parameters, 6.7 G FLOPs).

Most significantly, the proposed YOLOv8n-Pose-DSW model demonstrated superior performance across all evaluation metrics. It attained the highest scores of 93.1% for P, 89.5% for R, 95.6% for mAP@50, and 95.2% for mAP@50–95. A key improvement is observed in mAP@50–95, where the model exceeds YOLO11n-Pose and YOLO12n-Pose by 6.5% and 9.1%, respectively, underscoring its enhanced capability for accurate keypoint localization under varying IoU thresholds. Moreover, with a competitive computational profile (3.05 M parameters, 8.3 G FLOPs, and 137 FPS), the proposed method attains state-of-the-art accuracy while demonstrating robust performance in both detection precision and multi-scale adaptation.

3.4.4. Visualization and Discussion of Comparative Experiments

Figure 12 visually compares the detection performance of various models under combined lighting and occlusion conditions, with all results benchmarked against the proposed YOLOv8n-Pose-DSW model. In low-light scenarios and overexposed scenarios, while all models maintained baseline performance without complete failures, the YOLO-series models consistently achieved higher confidence scores than DeepPose, RTMPose, and ViTPose. Nevertheless, even the newer YOLO11-Pose and YOLO12-Pose exhibited slightly lower confidence and visual clarity compared to our proposed model under these ideal conditions.

When introducing occlusion under normal lighting, performance differences became more apparent. DeepPose showed clear missed detections, while ViTPose and RTMPose maintained detection coverage but with confidence scores 5–8% lower than YOLOv8n-Pose-DSW. Although YOLOX-Pose avoided missed detections, its confidence scores remained substantially below our method. Both YOLO11-Pose and YOLO12-Pose showed improved robustness but still exhibited 3–5% lower confidence scores than our proposed model.

Under the most challenging strong-illumination-with-occlusion scenarios, the advantages of YOLOv8n-Pose-DSW became particularly evident. While YOLOX-Pose maintained detection capability without missed detections, it suffered from reduced confidence in severe light–shadow transitions. ViTPose exhibited detectable missed detections under these extreme conditions, and YOLO12-Pose developed duplicate detections with multiple false positives for single picking points.

In clear contrast, YOLOv8n-Pose-DSW outperforms all comparable models in both detection accuracy and reliability across varying environmental challenges. The consistent maintenance of high confidence levels above 85%, combined with the absence of both missed detections and duplicate identifications, confirms its suitability for precision-sensitive agricultural applications.

3.4.5. Analysis of Picking Point Location Under Different Occlusion Levels

To further investigate the robustness of our model in challenging conditions, we conducted a specialized experiment on the test set categorized by occlusion degree. We selected 120 images in the test set and divided it into three subsets: light occlusion (40 images), medium occlusion (40 images), and heavy occlusion (40 images). The performance of YOLOv8n-Pose-DSW on each subset is detailed in Table 5.

As illustrated in Table 5, both models exhibit a performance decline as occlusion severity increases, which is an expected challenge. However, our proposed YOLOv8n-Pose-DSW consistently outperforms the baseline across all occlusion levels and all metrics. Under heavy occlusion, the most significant performance bottleneck for both models is a sharp drop in R. This indicates that the primary failure reason is missed picking point detections, where the model fails to perceive heavily obscured fruits. The 11.8% improvement in R offered by our model under these conditions demonstrates the enhanced feature representation capability of the Dysample operator and Slim-Neck structure, allowing it to extract more meaningful cues from limited visible information. Furthermore, the notable 12.1% improvement in mAP@50–95 under heavy occlusion highlights our model’s superior localization precision even in the complex environment. While challenges remain in extreme scenarios, the proposed enhancements mitigate the performance degradation caused by occlusion, confirming the improved robustness of our model.

3.4.6. Picking Point Positioning Accuracy Discussion

To demonstrate localization accuracy differences in zucchini picking point detection between the improved YOLOv8n-Pose-DSW and original YOLOv8n-Pose models, both architectures were evaluated. A test set of 188 zucchini fruit images was utilized for evaluation. The error statistical analysis is illustrated in accompanying figures where the x-axis represents pixel error and the y-axis indicates error frequency. In the horizontal direction (Figure 13a), YOLOv8n-Pose-DSW demonstrated a higher proportion of predictions within the [0, 1.5] pixel error range compared to the baseline model, while showing markedly reduced error frequency in the [4.5, 9] pixel range. Similarly, in the vertical direction (Figure 13b), the improved model exhibited superior prediction precision within the [0, 1.5] pixel interval. The Manhattan distance distribution (Figure 13c)—serving as a comprehensive evaluation metric—further corroborated these findings: YOLOv8n-Pose-DSW achieved higher prediction proportions in the [0, 2] and [2, 4] pixel ranges, with lower frequencies in the [4, 6] and [8, 16] pixel intervals. This distribution pattern aligns with horizontal and vertical directional analyses, collectively indicating YOLOv8n-Pose-DSW’s substantial advantage in localization accuracy. Particularly within the critical low-error range ([0, 4] pixels), the improved model exhibited greater concentration of prediction results, which is essential for precise positioning in practical harvesting applications.

4. Discussion

4.1. Visual Analysis

To further validate the performance advantages of the improved model during feature extraction, this study introduces heatmap visualization to analyze the model’s target-focused regions. Heatmaps intuitively reflect the model’s attention distribution during target recognition through color intensity gradients, where bright regions indicate high-attention features and dark areas represent low-attention zones. This visualization enables clear assessment of the model’s capability to capture zucchini fruit characteristics. As evidenced by the comparative results in Figure 14, the heatmap of the YOLOv8n-Pose-DSW method (Figure 14b) exhibits distinctly highlighted features within zucchini fruit regions with minimal interference signals in background areas (e.g., leaves, peduncles). This demonstrates the model’s precise focus on target fruits and effective suppression of irrelevant background information. In stark contrast, the baseline model (Figure 14c) reveals significant feature extraction deficiencies: (1) partial neglect of zucchini fruits (white-circled areas), indicating missed detection risk; and (2) undesired attention to non-target background regions (yellow-circled areas), increasing false detection probability. These deviations in target feature capture directly compromise the baseline model’s detection stability in complex scenarios. Consequently, the enhanced YOLOv8n-Pose-DSW model optimizes feature extraction mechanisms to better capture zucchini fruits while reducing interference from background peduncles, thereby improving detection accuracy.

4.2. Discussion of Model Limitations and Applications

Despite the significant performance advantages demonstrated by the proposed YOLOv8n-Pose-DSW model in zucchini fruit detection and picking point localization tasks, several limitations warrant attention for further investigation.

Firstly, regarding model performance, while its lightweight design facilitates efficient deployment, its capacity for learning complex features is constrained. The model exhibits suboptimal performance under scenarios of severe occlusion (e.g., dense foliage, vines, or overlapping multiple fruits), where it is susceptible to decreased accuracy or missed detections. This indicates that the model’s adaptability to extreme environments requires enhancement. Subsequent research should focus on mitigating this limitation by strengthening the feature extraction network or incorporating more robust occlusion-handling mechanisms.

Secondly, dataset limitations also constrain the model’s generalization capability. The current training data lack sufficient coverage of morphological variations across different zucchini growth stages (seedling, fruiting, and maturity) and diverse environmental conditions (e.g., varying light intensity, weather changes). This deficiency may impair its detection efficiency within complex backgrounds and poses encountered in real orchard settings. Future work will prioritize constructing a more diverse dataset by incorporating samples under varying weather conditions, viewing angles, and growth cycles to enhance the model’s environmental robustness. Concurrently, leveraging the growth characteristics of zucchini, exploring optimized picking point configuration strategies based on dynamic factors such as pedicel angle and maturity level could further improve localization accuracy.

Furthermore, to promote practical application, this model will be integrated into the embedded vision system of our team’s independently developed mobile field robot in subsequent research. The low computational load of the proposed model makes such low-latency embedded vision feasible. Furthermore, dedicated hardware accelerators, such as FPGA-based image processing pipelines, can potentially reduce the end-to-end system latency by orders of magnitude, meeting the stringent requirements of real-time robotic harvesting. Through system collaboration, the 3D spatial coordinates of the stem pedicle picking points output by the model will be transmitted in real-time to the robot control system. This integration will establish a closed-loop automation process from visual perception to robotic arm execution. This practical implementation will not only validate the model’s real-world utility but also provide critical technological support for the intelligent upgrading of agricultural automation equipment.

5. Conclusions

Addressing the relative scarcity of research on visual algorithms for intelligent zucchini harvesting, this study proposes a lightweight YOLOv8n-Pose-DSW keypoint detection algorithm. The algorithm aims to achieve efficient and precise identification and localization of zucchini fruits and their picking points in complex environments. By integrating the Dysample operator, Slim-Neck network, and WIoUv3 loss function into the YOLOv8n-Pose framework, the model’s overall performance was significantly enhanced.

Experimental results demonstrate substantial improvements across key metrics. In the zucchini fruit detection task, the improved YOLOv8n-Pose achieved precision, recall, mAP@50, and mAP@50-95 of 92.1%, 90.7%, 94.0%, and 71.4%, respectively. These represent significant gains of 3.3%, 11.7%, 7.4%, and 15.4% over the baseline YOLOv8n-Pose model. For the picking point localization task, precision, recall, mAP@50, and mAP@50-95 further increased to 93.1%, 89.5%, 95.6%, and 95.2%, corresponding to improvements of 8.8%, 11.0%, 11.3%, and 27.9% compared to the baseline. Furthermore, the proposed model outperformed other mainstream keypoint detection algorithms, demonstrating superior overall performance.

The study confirmed that the proposed model exhibits excellent robustness and stability under varying lighting conditions and complex growth environments. The improved YOLOv8n-Pose, with only 3.05 M parameters, effectively handles zucchini target detection and picking point localization tasks across diverse backgrounds while maintaining high accuracy and low computational load. This efficiency makes it suitable for deployment on resource-constrained mobile platforms. Different from prior works that focused solely on lightweight design or generic detection, our approach achieves a superior balance of high accuracy, real-time speed, and model efficiency.

This algorithm significantly enhances the detection and localization capabilities for zucchini fruits and their picking points, thereby laying the technical groundwork for intelligent harvesting systems. It holds broad application prospects in the field of agricultural automation, particularly for the development of unmanned precision zucchini harvesting systems.

Author Contributions

Conceptualization, H.S. (Hongxiong Su) and J.L.; methodology, H.S. (Hongxiong Su) and S.W.; validation, H.S. (Hongxiong Su), S.W. and Y.L.; formal analysis H.S. (Honglin Su) and F.M.; investigation, H.S. (Honglin Su); resources, H.S. (Hongxiong Su) and H.S. (Honglin Su); writing—original draft preparation, H.S. (Hongxiong Su) and S.W.; writing—review and editing, S.W., F.M., J.L. and Y.L.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Postgraduate Education Innovation Program of Shanxi Province (Grant No. 2025AL05).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

We appreciate and thank the anonymous reviewers for their helpful comments that led to the overall improvement of the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liang, J.; Liu, G. High-Quality and High-Efficiency Management Techniques for Open-Field Cultivation of Zucchini. Fruit Grow. Friend 2025, 26, 78–80. (In Chinese) [Google Scholar]
Ben-Noun (Nun), L. Characteristics of Zucchini; B. N. Publication House: Hawthorne, CA, USA, 2019. [Google Scholar]
Paris, H.S. Germplasm enhancement of Cucurbita pepo (pumpkin, squash, gourd: Cucurbitaceae): Progress and challenges. Euphytica 2016, 208, 415–438. [Google Scholar] [CrossRef]
Grumet, R.; McCreight, J.D.; McGregor, C.; Weng, Y.; Mazourek, M.; Reitsma, K.; Labate, J.; Davis, A.; Fei, Z. Genetic resources and vulnerabilities of major cucurbit crops. Genes 2021, 12, 1222. [Google Scholar] [CrossRef] [PubMed]
Zhao, Y.; Gong, L.; Huang, Y.; Liu, C. A review of key techniques of vision-based control for harvesting robot. Comput. Electron. Agric. 2016, 127, 311–323. [Google Scholar] [CrossRef]
Luo, L.; Tang, Y.; Zou, X.; Ye, M.; Feng, W.; Li, G. Vision-based extraction of spatial information in grape clusters for harvesting robots. Biosyst. Eng. 2016, 151, 90–104. [Google Scholar] [CrossRef]
Zhao, C.; Lee, W.S.; He, D. Immature green citrus detection based on colour feature and sum of absolute transformed difference (SATD) using colour images in the citrus grove. Comput. Electron. Agric. 2016, 124, 243–253. [Google Scholar] [CrossRef]
Rathnayake, N.; Rathnayake, U.; Dang, T.L.; Hoshino, Y. An Efficient Automatic Fruit-360 Image Identification and Recognition Using a Novel Modified Cascaded-ANFIS Algorithm. Sensors 2022, 22, 4401. [Google Scholar] [CrossRef]
Fu, L.; Feng, Y.; Wu, J.; Liu, Z.; Gao, F.; Majeed, Y.; Al-Mallahi, A.; Zhang, Q.; Li, R.; Cui, Y. Fast and accurate detection of kiwifruit in orchard using improved YOLOv3-tiny model. Precis. Agric. 2021, 22, 754–776. [Google Scholar] [CrossRef]
Lu, Y.; Young, S. A survey of public datasets for computer vision tasks in precision agriculture. Comput. Electron. Agric. 2020, 178, 105760. [Google Scholar] [CrossRef]
Tang, Y.; Qiu, J.; Zhang, Y.; Wu, D.; Cao, Y.; Zhao, K.; Zhu, L. Optimization strategies of fruit detection to overcome the challenge of unstructured background in field orchard environment: A review. Precis. Agric. 2023, 24, 1183–1219. [Google Scholar] [CrossRef]
Wang, H.; Lin, Y.; Xu, X.; Chen, Z.; Wu, Z.; Tang, Y. A study on long-close distance coordination control strategy for litchi picking. Agronomy 2022, 12, 1520. [Google Scholar] [CrossRef]
Zhang, T.; Wu, F.; Wang, M.; Chen, Z.; Li, L.; Zou, X. Grape-bunch identification and location of picking points on occluded fruit axis based on YOLOv5-GAP. Horticulturae 2023, 9, 498. [Google Scholar] [CrossRef]
Li, Y.; Wang, W.; Guo, X.; Wang, X.; Liu, Y.; Wang, D. Recognition and positioning of strawberries based on improved YOLOv7 and RGB-D sensing. Agriculture 2024, 14, 624. [Google Scholar] [CrossRef]
Chen, X.; Dong, G.; Fan, X.; Xu, Y.; Liu, T.; Zhou, J.; Jiang, H. Fruit Stalk Recognition and Picking Point Localization of New Plums Based on Improved DeepLabv3+. Agriculture 2024, 14, 2120. [Google Scholar] [CrossRef]
Ma, Z.; Dong, N.; Gu, J.; Cheng, H.; Meng, Z.; Du, X. STRAW-YOLO: A detection method for strawberry fruits targets and key points. Comput. Electron. Agric. 2025, 230, 109853. [Google Scholar] [CrossRef]
Du, X.; Meng, Z.; Ma, Z.; Lu, W.; Cheng, H. Tomato 3D pose detection algorithm based on keypoint detection and point cloud processing. Comput. Electron. Agric. 2023, 212, 108056. [Google Scholar] [CrossRef]
Wu, Z.; Xia, F.; Zhou, S.; Xu, D. A method for identifying grape stems using keypoints. Comput. Electron. Agric. 2023, 209, 107825. [Google Scholar] [CrossRef]
Huang, Y.; Zhong, Y.; Zhong, D.; Yang, C.; Wei, L.; Zou, Z.; Chen, R. Pepper-YOLO: An lightweight model for green pepper detection and picking point localization in complex environments. Front. Plant Sci. 2024, 15, 1508258. [Google Scholar] [CrossRef]
Wang, H.; Yun, L.; Yang, C.; Wu, M.; Wang, Y.; Chen, Z. OW-YOLO: An Improved YOLOv8s Lightweight Detection Method for Obstructed Walnuts. Agriculture 2025, 15, 159. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6027–6037. [Google Scholar]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–9 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Toshev, A.; Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1653–1660. [Google Scholar]
Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. Vitpose: Simple vision transformer baselines for human pose estimation. Adv. Neural Inf. Process. Syst. 2022, 35, 38571–38584. [Google Scholar]
Jiang, T.; Lu, P.; Zhang, L.; Ma, N.; Han, R.; Lyu, C.; Li, Y.; Chen, K. Rtmpose: Real-time multi-person pose estimation based on mmpose. arXiv 2023, arXiv:2303.07399. [Google Scholar]
Maji, D.; Nagori, S.; Mathew, M.; Poddar, D. Yolo-pose: Enhancing yolo for multi person pose estimation using object keypoint similarity loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 2637–2646. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]

Figure 1. Zucchini cultivation environment in multi-span greenhouse.

Figure 2. Image data in complex environments.

Figure 3. Example of dataset annotation.

Figure 4. Dataset enhancement methods.

Figure 5. Improved YOLOv8n-Pose model structure.

Figure 6. Dysample upsampling network structure.

Figure 7. GSConv module.

Figure 8. The structure of the GSbottleneck module VoV-GSCSP module.

Figure 9. Example of IoU structure.

Figure 10. The mAP and loss curve of ablation experiments.

Figure 11. Comparison of zucchini detection before and after improvement.

Figure 12. Comparison of the visualization results of different models. Blue ovals mark misidentified and missed zucchini fruits.

Figure 13. Histogram of pixel error statistics.

Figure 14. Comparison of heatmaps before and after network improvement. The white oval area indicates a risk of missed detection, yellow ovals mark areas increasing false detection probability.

Table 1. Training parameters.

Training Parameters	Values
Initial learning rate	0.01
Optimizer	SGD
Optimizer momentum	0.937
Optimizer weight decay rate	0.0005
Number of images per batch	16
Number of epochs	150

Table 2. Results of ablation experiments for the detection of zucchini.

Baseline Network	Dysample	Slim-Neck	WIoUv3	P (%)	R (%)	mAP@50 (%)	mAP@50-95 (%)	Para (M)	FLOPs (G)	FPS
YOLOv8n-Pose	×	×	×	88.8 ± 0.2	79.0 ± 0.4	86.6 ± 0.3	56.0 ± 0.5	3.08	8.7	132 ± 5
YOLOv8n-Pose	✓	×	×	91.0 ± 0.2	89.5 ± 0.5	91.3 ± 0.2	71.1 ± 0.3	3.09	8.5	131 ± 5
YOLOv8n-Pose	×	✓	×	87.2 ± 0.4	87.2 ± 0.3	92.1 ± 0.2	62.8 ± 0.4	2.87	7.6	140 ± 3
YOLOv8n-Pose	×	×	✓	91.0 ± 0.1	88.6 ± 0.3	93.5 ± 0.1	71.4 ± 0.3	3.07	8.4	135 ± 3
YOLOv8n-Pose	✓	✓	✓	92.1 ± 0.3	90.7 ± 0.2	94.0 ± 0.1	71.4 ± 0.3	3.05	8.3	137 ± 3

Table 3. Results of ablation experiments for zucchini picking point detection.

Baseline Network	Dysample	Slim-Neck	WIoUv3	P (%)	R (%)	mAP@50 (%)	mAP@50–95 (%)
YOLOv8n-Pose	×	×	×	84.3 ± 0.2	78.5 ± 0.2	84.3 ± 0.1	67.3 ± 0.2
YOLOv8n-Pose	✓	×	×	91.9 ± 0.1	90.7 ± 0.2	89.2 ± 0.1	79.0 ± 0.1
YOLOv8n-Pose	×	✓	×	86.6 ± 0.2	86.4 ± 0.2	93.3 ± 0.1	89.4 ± 0.2
YOLOv8n-Pose	×	×	✓	91.5 ± 0.2	87.3 ± 0.1	93.7 ± 0.1	91.0 ± 0.2
YOLOv8n-Pose	✓	✓	✓	93.1 ± 0.1	89.5 ± 0.2	95.6 ± 0.2	95.2 ± 0.2

Table 4. Comparison of picking point detection among different models.

Model	P (%)	R (%)	mAP@50 (%)	mAP@50–95 (%)	Para (M)	FLOPs (G)	FPS
DeepPose	59.6 ± 0.3	60.1 ± 0.2	72.6 ± 0.3	52.0 ± 0.2	23.55	42.8	39 ± 3
RTMPose	64.4 ± 0.2	63.8 ± 0.2	75.2 ± 0.1	62.5 ± 0.2	6.17	7.4	150 ± 6
ViTPose	67.3 ± 0.3	66.7 ± 0.2	65.2 ± 0.4	51.1 ± 0.3	22.46	88.9	12 ± 1
YOLOX-Pose	79.1 ± 0.3	77.3 ± 0.3	84.2 ± 0.1	66.5 ± 0.1	6.04	13.7	125 ± 4
YOLO11n-Pose	86.0 ± 0.2	82.0 ± 0.2	90.3 ± 0.1	88.7 ± 0.1	2.63	6.7	135 ± 6
YOLO12n-Pose	88.0 ± 0.3	84.6 ± 0.3	89.5 ± 0.2	86.1 ± 0.4	2.66	6.7	104 ± 3
YOLOv8n-Pose-DSW	93.1 ± 0.1	89.5 ± 0.2	95.6 ± 0.2	95.2 ± 0.2	3.05	8.3	137 ± 3

Table 5. Picking point localization results under different occlusion levels.

Occlusion	Model	P (%)	R (%)	mAP@50 (%)	mAP@50–95 (%)
Degree
Light	YOLOv8n-Pose-DSW	95.8 ± 0.1	93.1 ± 0.2	97.1 ± 0.2	88.8 ± 0.2
	YOLOv8n-Pose	90.1 ± 0.1	88.3 ± 0.1	92.2 ± 0.1	79.8 ± 0.2
Medium	YOLOv8n-Pose-DSW	90.2 ± 0.3	86.0 ± 0.3	92.0 ± 0.2	78.5 ± 0.1
	YOLOv8n-Pose	82.1 ± 0.2	75.5 ± 0.3	80.0 ± 0.2	65.2 ± 0.4
Heavy	YOLOv8n-Pose-DSW	76.3 ± 0.4	63.7 ± 0.3	71.4 ± 0.2	58.0 ± 0.4
	YOLOv8n-Pose	64.8 ± 0.4	51.9 ± 0.2	58.6 ± 0.3	44.7 ± 0.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Su, H.; Wang, S.; Su, H.; Ma, F.; Li, Y.; Li, J. YOLOv8n-Pose-DSW: A Precision Picking Point Localization Model for Zucchini in Complex Greenhouse Environments. Agriculture 2025, 15, 1954. https://doi.org/10.3390/agriculture15181954

AMA Style

Su H, Wang S, Su H, Ma F, Li Y, Li J. YOLOv8n-Pose-DSW: A Precision Picking Point Localization Model for Zucchini in Complex Greenhouse Environments. Agriculture. 2025; 15(18):1954. https://doi.org/10.3390/agriculture15181954

Chicago/Turabian Style

Su, Hongxiong, Sa Wang, Honglin Su, Fumin Ma, Yanwen Li, and Juxia Li. 2025. "YOLOv8n-Pose-DSW: A Precision Picking Point Localization Model for Zucchini in Complex Greenhouse Environments" Agriculture 15, no. 18: 1954. https://doi.org/10.3390/agriculture15181954

APA Style

Su, H., Wang, S., Su, H., Ma, F., Li, Y., & Li, J. (2025). YOLOv8n-Pose-DSW: A Precision Picking Point Localization Model for Zucchini in Complex Greenhouse Environments. Agriculture, 15(18), 1954. https://doi.org/10.3390/agriculture15181954

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLOv8n-Pose-DSW: A Precision Picking Point Localization Model for Zucchini in Complex Greenhouse Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Description of Study Area and Data Collection

2.2. The Dataset

2.2.1. Dataset Annotation and Division

2.2.2. Data Augmentation

2.3. Improved YOLOv8n-Pose Network

2.3.1. Dysample Module

2.3.2. Slim-Neck Module

2.3.3. WIoUv3 Loss Function

3. Experimental Design and Discussion

3.1. Experimental Details

3.2. Metrics for Model Evaluation

3.3. Picking Point Positioning Accuracy Metrics

3.4. Experimental Results

3.4.1. Ablation Experiment

3.4.2. Visualization and Discussion of Ablation Experiment

3.4.3. Comparative Experiments

3.4.4. Visualization and Discussion of Comparative Experiments

3.4.5. Analysis of Picking Point Location Under Different Occlusion Levels

3.4.6. Picking Point Positioning Accuracy Discussion

4. Discussion

4.1. Visual Analysis

4.2. Discussion of Model Limitations and Applications

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI