A Monocular Pose Estimation Framework for Automatic Dragon Fruit Harvesting Using Navel and Stem Keypoints

Yang, Xing; Bai, Liping; Zhang, Tai; Wu, Rongzhen

doi:10.3390/horticulturae12040505

Open AccessArticle

A Monocular Pose Estimation Framework for Automatic Dragon Fruit Harvesting Using Navel and Stem Keypoints

¹

Institute of Systems Engineering and Collaborative Laboratory for Intelligent Science and Systems, Macau University of Science and Technology, Macau, China

²

School of Biological Science and Medical Engineering, Hunan University of Technology, Zhuzhou 412007, China

^*

Author to whom correspondence should be addressed.

Horticulturae 2026, 12(4), 505; https://doi.org/10.3390/horticulturae12040505

Submission received: 28 January 2026 / Revised: 16 April 2026 / Accepted: 18 April 2026 / Published: 21 April 2026

(This article belongs to the Section Postharvest Biology, Quality, Safety, and Technology)

Download

Browse Figures

Versions Notes

Abstract

Automated fruit harvesting is crucial for alleviating labor shortages and enhancing agricultural productivity. In this context, it is crucial to obtain information on fruit poses before picking in order to avoid damaging the fruit and/or the plant. However, the complex and unstructured orchard environment poses significant challenges regarding the pose estimation task. In this study, a dragon fruit pose estimation (DFPE) framework using a single RGB image is proposed for dragon fruit automated harvesting, which includes three key components: dataset annotation processing, keypoint detection, and geometric pose estimation. First, a multi-source dataset consisting of 8467 images is constructed to enhance the estimation model’s generalizability. A pseudo four-keypoint annotation strategy is designed to fit the annotation rules of mainstream single-class keypoint detection models and mitigate the inherent limitations of multi-target keypoint detection in agricultural scenarios. This strategy implicitly encodes the fruit’s orientation using bounding box group IDs, while preserving geometric information for pose inference. Then, the fruit body and its two core keypoints (navel and stem) are detected via a real-time keypoint detection model. Notably, the proposed DFPE framework is detector-agnostic: other mainstream keypoint detection models can also be plugged into the subsequent geometric pose inference stage, which guarantees the generality and scalability of the framework. Finally, a dragon fruit pose estimation algorithm based on customized geometric constraints is designed, which takes the detected pose information as the input and outputs the posture of dragon fruit. The results of experiments conducted in natural orchard and laboratory environments demonstrate that the ellipses fitted using the proposed DFPE framework closely aligned with fruit contours, even under foliage occlusion conditions. In the laboratory environment, roll errors reached a maximum of 14.8°, whereas yaw errors peaked at 13.4°. Crucially, all roll and yaw errors remained consistently below 15°, which is well within the tolerance threshold required for non-destructive picking operations using a harvesting robot. In summary, this work presents a low-cost solution for dragon fruit pose estimation from a single RGB image, which can potentially be extended to other ellipsoid crops and is suitable for implementation in harvesting robots operating in orchards.

Keywords:

harvesting robots; dragon fruit; pose estimation; YOLO11n-Pose

1. Introduction

Dragon fruit is nutritionally rich, containing high levels of vitamin C and water-soluble dietary fiber [1]. The Asia–Pacific region, particularly China, is the world’s largest producer of dragon fruit. By 2020, China’s dragon fruit output reached 1.526 million tons, making it a key agricultural economic industry contributing to local rural revitalization [2]. However, dragon fruit has a short harvesting window, with mature fruits requiring picking within 3–5 days to avoid rot and quality deterioration, which puts forward extremely high requirements for harvesting efficiency. To achieve efficient and high-yield production, agriculture was, and still is, highly reliant on farmers’ experience, knowledge, and wisdom [3]. At present, orchard management and fruit harvesting processes remain labor-intensive and costly. Labor consumed in harvesting accounts for 40–50% of the entire planting process, with manual harvesting costs representing 50–70% of total orchard expenses [4]. In 2021, the average comprehensive mechanization rate of orchards in China was approximately 30%, while the mechanization rate for the harvesting process was less than 3% [4]. Globally, the agricultural sector faces labor shortages driven by demographic shifts, including an aging population, rural-to-urban migration, and declining interest in physically demanding farm work. As agricultural production expands amid these labor constraints, fruit growers are encountering significant challenges; particularly low harvesting efficiency and rising labor costs. The growing gap between labor availability and harvesting demands during peak seasons underscores the need for automated solutions such as robotic harvesting systems.

The design of fruit harvesting robots often depends on the characteristics of the target fruit or fruits. Fruits can be categorized into single- and cluster-growing types. Figure 1 shows several representative fruits with different growth patterns. Cluster-growing fruits require simultaneous detection of fruit bodies and stem cutting points [5,6], while single-growing hard fruits (represented by dragon fruit [7] and citrus [8]) with firm stem–branch attachment rely on accurate 3D position and pose information to achieve non-destructive harvesting. Different picking methods require different key technologies to be implemented in the harvesting robot’s vision system. In recent years, research on vision systems for fruit harvesting robots has primarily focused on fruit recognition and localization, ripeness detection, and pose estimation. Moreover, existing research has mainly focused on high-yield, easy-to-pick fruits, such as apples and tomatoes, with less research on the detection and pose estimation of dragon fruit in natural environments.

Accurate fruit pose information is essential for the non-destructive harvesting of dragon fruit, as the growth direction of the fruit in natural orchards is highly uncertain. Without precise pose guidance, the harvesting actuator may damage branches and adjacent fruits during operation. Notably, the 2D pose parameters (roll and yaw angles) obtained from monocular RGB images in this study can be mapped to the 3D operating space of the harvesting robot through camera calibration, providing reliable spatial attitude guidance for automated picking operations. Therefore, it is crucial to accurately estimate the fruit pose to achieve non-destructive and efficient automated dragon fruit harvesting.

Existing fruit pose estimation methods are primarily classified into three categories based on their data requirements. Methods in the first category rely solely on 3D point cloud information. An example is the 3D point cloud reconstruction method [9,10], with which the fruit pose is estimated via a template-based approach that involves acquiring 3D point cloud data of target fruits, processing the data through reconstruction and segmentation algorithms, and then searching for the optimal offline template. The second category of methods depend on both 3D point cloud information and RGB images [11], integrating 3D point cloud data with RGB image processing algorithms (e.g., keypoint detection and stem segmentation), and completing pose estimation by capturing information about key parts of the fruit. The third category includes methods that rely only on RGB images, without requiring 3D point cloud data. These methods involve operations such as image segmentation, keypoint detection, and object detection, which are performed on RGB images to identify fruits and estimate their pose. Table 1 details typical fruit pose estimation methods used in the existing literature.

Notably, methods in the first two categories rely on high-quality 3D point cloud data for effective pose estimation. However, under natural orchard conditions, the accuracy of point cloud data is often compromised by environmental complexity, resulting in noise interference and partial information loss. This not only increases the difficulty of data pre-processing but may also compromise the real-time requirements of harvesting robots. Therefore, researchers have turned to single RGB image-based solutions, which usually first detecting fruits and then estimating their poses via image segmentation or keypoint detection. Three primary RGB image-based fruit pose estimation approaches have been developed for robotic harvesting.

The first method classifies fruits into multiple categories based on harvest-adapted growth orientations. Lv et al. [12] divided apples into four harvest-adapted categories according to their natural growth forms in orchards to guide the harvesting robot to adopt differentiated obstacle avoidance and picking strategies for fruits with different occlusion conditions. Jin et al. [13] categorized fruits into five types (front, up, down, left, and right) according to their growth patterns; Zhang et al. [14] divided the 3D poses of tomatoes into four categories (front–right, front–middle, front–left, and back) based on the fruit’s axial direction; and Zhou et al. [7] classified dragon fruit into two orientations (forward and lateral), where forward-oriented fruits can be directly harvested while lateral ones require further angle measurement. This method converts the pose estimation problem into a multi-object detection task. However, it fails to obtain precise spatial poses, typically leading to relatively low harvesting success rates in actual field environments.

The second method enables robotic harvesting through the detection of cutting points on fruit-attached stems. Wu et al. [5] proposed a top-down stem localization method using grape growth characteristics; Chen et al. [6] proposed an improved YOLOv8-GP (YOLOv8-Grape and Picking Point) model based on YOLOv8n-pose to address the simultaneous detection of grape clusters and picking points on their stems; Gu et al. [15] and Diao et al. [16] realized concurrent detection of mango fruits and peduncles; Li et al. [17] centered on tomato bunches without leaf interference, adopting a segmentation method to segment the main stem, peduncles, and fruit clusters to obtain cutting keypoints; Zhang et al. [18] designed a system with six keypoints for tomato clusters and five for peduncles, targeting scenarios with a single object structure, and this framework was further extended by integrating 3D point clouds to generate 3D bounding boxes and 3D poses [19]; Sun et al. [20] specifically detected cutting points on citrus peduncles; and Yu et al. [21] proposed the R-YOLO model with MobileNet-V1 as the backbone network, introducing rotated bounding boxes to predict fruit poses for strawberries via keypoint detection. This method is not suitable for dragon fruit harvesting, as dragon fruit either lacks a peduncle or has an extremely short one.

Table 1. Comparison and analysis of typical fruit pose estimation methods.

Target Crop	Technical Characteristics	Performance Indicators
Tomato [22]	A 3D pose detection algorithm (TPD) consisting of the YOLO-lmk model and a point cloud processing module.	YOLO-lmk model: bounding box mAP = 92.9%, dlmk = 7.9, FLOPs = 16.6 B, speed = 0.062 s/sheet.
Strawberry [21]	An R-YOLO model with MobileNet-V1 backbone, predicting rotated bounding boxes for fruit pose estimation and picking point localization.	Overall precision = 94.43%, average detection time = 0.056 s/sheet, field harvesting success rate = 84.35%.
Tomato bunch [18]	The Tomato Pose Method (TPM) for 3D pose detection, including an a priori geometric model, a cascaded multi-task network, and a 3D reconstruction process.	2D keypoint detection success rate = 94.02%, accuracy = 85.77%; can construct 70.05% of tomato bunches with multiple poses.
Tomato bunch [19]	A multi-task algorithm (TPMv2) to estimate 3D poses based on 2D/3D bounding boxes and keypoints.	2D BBox precision = 0.9372, 3D BBox precision = 0.8700; 2D PCK = 0.8882, 3D PCK = 0.7836; 78.36% of 3D keypoints have positioning errors < 20 mm.
Grape [23]	An improved YOLOv8n-GP model based on YOLOv8n-Pose, integrating SENetV2 attention and CARAFE upsampling for synchronous detection of grape clusters and stem picking keypoints.	mAP = 97.1%, mAP-kp = 95.4%, inference time = 7.3 ms/sheet, model size = 6.4 MB.
Cherry tomato clusters [17]	An improved HER-SAC algorithm to guide collision-free grasping motions of end-effectors.	Orchard harvesting success rate = 85.5%, average operation time = 11.42 s.
Citrus [8]	A multi-task learning model (FPENet) to simultaneously locate fruit navel points and predict rotation vectors.	Fruit navel point detection AP = 88.92, average rotation vector error = 11.13°; over 90% of rotation vectors have errors < 22.5%; harvesting success rate = 79.79%.
Apple [24]	A method for estimation of fruit orientation by projecting 2D information onto the 3D space.	Median angular error: 17.6° (orchard), 14.6° (laboratory).
Tomato [25]	A deep learning network (Deep-ToMaToS) for simultaneous three-level maturity classification and 6D pose (3D translation and 3D rotation) estimation.	6D pose estimation accuracy = 96% (ADD_S metric), average harvesting success rate = 84.5%.

The third method derives fruits’ spatial pose via the fusion of keypoint detection and geometric features. Du et al. [22] used the line connecting keypoints and the centroid obtained from point clouds to represent the fruit’s growth direction; Sun et al. [8] detected navel keypoints and regressed rotation vectors; Kok et al. [24] proposed a method that projects 2D information into 3D space through a keypoint detection neural network and line segment extraction-based circle detection, calculating the unit vector of apple orientation, which was validated on public 3D point cloud datasets and in laboratory environments, achieving an azimuth error of 17.6° in orchard scenarios and 14.6° in laboratory settings, showing performance comparable to existing studies and suitability for orchard robotic harvesting; and Jang et al. [26] first segmented tomato fruit bodies and sepals using an instance segmentation model applied to RGB images, matched tomato bodies and sepals from multiple candidates, generated point clouds corresponding to each part based on RGB-D data, and estimated tomato poses as direction vectors through the centroids of the point clouds of tomato bodies and sepals respectively. This method can achieve relatively precise spatial poses but requires the integration of accurate 3D point cloud data.

Beyond the dominant YOLO-series frameworks in agricultural pose estimation, the field has seen rapid advances in ViT-based, two-stage, and bottom-up keypoint estimation paradigms. Ni et al. [27] proposed TomatoPoseNet, which outperformed the ViT-based ViTPose by 3.78% in keypoint detection accuracy and provided a benchmark for CNN and transformer backbone selection in fruit harvesting; Ci et al. [28] verified the robustness of two-stage Keypoint R-CNN in occluded greenhouse environments; and Kim et al. [29] developed a bottom-up OpenPose-based method for simultaneous multi-target pose estimation with constant computational load. Notably, ViTs have emerged as a promising direction for agricultural pose estimation with strong global feature modeling ability, yet they still face challenges of high computational overhead and large parameter scale for edge deployment on harvesting robots [27], with lightweight ViT-based solutions remaining under-explored.

Although these methods have advanced the field of fruit pose estimation, including the aforementioned non-YOLO architectures for tomato crops [27,28,29], their application remains limited to specific fruit types, with most existing works focusing on tomato, apple and other spherical fruits, while few studies have addressed the pose estimation of ellipsoidal dragon fruits in natural orchard environments. To address the challenge of automatic dragon fruit harvesting, this study proposes an RGB image-based dragon fruit pose estimation (DFPE) framework aimed at acquiring accurate spatial poses. This framework adopts a method that transforms the 3D pose estimation problem into determining the spatial yaw and roll angles. To match YOLO11n-Pose’s single-class annotation requirements, a pseudo four-keypoint annotation strategy was designed, and the spatial yaw and roll angles are then derived from the relative positions of the detected keypoints and the target fruit. The main contributions of this study are summarized as follows:

(1) A dataset for dragon fruit pose estimation is compiled under various lighting conditions and made publicly available, enabling researchers to reproduce the experimental results.

(2) We propose a 3D pose estimation method for dragon fruit that detects navel and stem keypoints to predict the fruit’s orientation, transforming the 3D pose estimation problem of near-ellipsoidal fruits into a 2D keypoint detection task; thus significantly simplifying the pose estimation process.

(3) We develop a pseudo four-keypoint annotation strategy to label two target types (fruit_front and fruit_back) and their corresponding keypoints. This meets YOLO-Pose’s annotation requirements, implicitly encoding the orientation of fruit via bounding box group IDs while preserving geometric information for pose inference and facilitating effective model training.

2. Materials and Methods

The overall workflow of the proposed DFPE framework is illustrated in Figure 2. First, the framework commences with dataset construction, which encompasses dragon fruit image acquisition and annotation: each image is annotated with the target fruit’s bounding box, stem keypoint, and navel keypoint. To align with the single-class annotation requirement of one of the mainstream keypoint detection models, YOLO11n-Pose, a pseudo four-keypoint format conversion algorithm is employed: this algorithm translates the annotated information into the four-keypoint structural format, while assigning Class ID 0 to “fruit_front” and Class ID 1 to “fruit_back” to encode the fruit’s distinct growth orientations. The processed dataset (formatted in the pseudo four-keypoint structure) is then fed into the keypoint detection model for end-to-end detection of dragon fruit instances and regression of their corresponding keypoint coordinates. Subsequently, the dragon fruit pose estimation algorithm processes the model’s inference outputs (Class ID, bounding box parameters, stem keypoint, and navel keypoint): by fitting an ellipse to the detected fruit region to characterize its spatial contour, and calculating the relative spatial positional relationship between the stem and navel keypoints, the framework derives the yaw and roll angles, which are core parameters of the dragon fruit’s 3D spatial pose. Finally, this framework is deployable in both controlled laboratory environments and natural orchard scenarios, yielding accurate spatial pose information to support automated dragon fruit harvesting operations.

2.1. Dataset Construction

2.1.1. Image Acquisition

The dataset used in this study consists entirely of dragon fruit images with natural orchard environments as the background, and it is derived from three independent image sources, with a total of 8476 images covering diverse real orchard scenarios such as different lighting conditions, occlusion levels, and target scales. The distribution of the training set and validation set of the full dataset is detailed in Table 2.

The first group consists of 3749 RGB images with a resolution of 1920 × 1080, with some examples shown in Figure 3. These images were collected at the orchard of Zhuhai Xijian Agricultural Development Co., Ltd., Guangdong, China (GPS: 22.15° N, 113.25° E), which has the largest contiguous dragon fruit planting base in Guangdong Province, China, with a ridge spacing of 2.8 m–3.5 m; the camera was fixed in the middle of the two ridges, facing the dragon fruit plants for random shooting. The image acquisition time was from 11:00 to 19:00 on 23 June 2024, and the weather was cloudy on the day of collection. The images were captured using a ZED 2i binocular stereo camera (Manufactured by Stereolabs, San Francisco, CA, USA), but only RGB images were utilized in this work. These images cover diverse lighting conditions, including weak light, moderate light, and strong light from front, side, and back directions. The images in this group are highly consistent with the actual operational environment of automated picking and mainly consist of small target scenes.

The second group consists of 868 publicly available RGB images of mature dragon fruits captured in natural orchard environments [30], with examples shown in Figure 4. This subset has the characteristics of high definition, with a single dragon fruit target in most images, and the target accounts for most of the image area, with clear and complete morphological features of the navel and stem.

The last group consists of RGB images cropped from https://universe.roboflow.com/. Roboflow Universe is a computer vision community platform developed by Roboflow Inc., a company headquartered in Des Moines, IA, USA. We first strictly screened 227 high-quality original images from this platform, all of which are natural orchard-captured dragon fruit images with complete navel and stem morphological features. The resolution of these original images was unified to 640 × 640 through scaling and padding; to eliminate data leakage risks, we first completed the division of the training set and validation set for all original images, and then performed data augmentation exclusively on the 227 original images, expanding them from 227 to 3859 via augmentation operations including rotation, brightness adjustment, flipping, sharpening, and contrast adjustment, with examples shown in Figure 5. The image in this group are large differences in image clarity and large differences in the proportion of the target in the image, which can effectively improve the generalization ability of the model to targets of different scales.

2.1.2. Data Annotation

This study establishes a unified annotation standard for bounding boxes and keypoints. For dragon fruit, a typical ellipsoidal crop, a tightly fitted rectangular bounding box is used to enclose fully visible fruits with an occlusion rate less than 50%. Two key keypoints are labeled for each valid box: the navel, the geometric center of the calyx detachment at the fruit bottom, and the stem, the center of the branch connection at the pedicel. This unified paradigm, consisting of one bounding box paired with two corresponding morphological keypoints, ensures the framework’s generalizability to other ellipsoidal crops, such as citrus and elongated tomatoes, by standardizing the annotation of fruit bottom indentations and pedicel connection points.

After completing the pre-processing of all images and the division of the training set and validation set, we performed a unified annotation operation on all images using the LabelMe 5.2.1 software, as shown in Figure 6. The default YOLO-Pose models are designed based on 17 keypoints, each representing a different part of the human body. In contrast, this study used a custom dataset, in which the annotations for each mature dragon fruit instance include two parts: first, a bounding box labeled “dragon_fruit” was annotated with a group_id set to 0 or 1 (where 0 represents fruit_front, denoting that the navel is closer to the camera; while 1 represents fruit_back, indicating that the stem is closer to the camera). This setup enabled implicit classification without requiring dual-class labeling. Second, two keypoints labeled “navel” and “stem” are annotated with a group_id of 0 or 1, where 0 represents visible and 1 represents occluded. For occluded keypoints (e.g., those partially obscured by foliage), annotations were performed strictly based on the morphological prior knowledge of dragon fruits, and all occluded sample annotations were reviewed and corrected by experts with agronomic experience in dragon fruit cultivation to ensure annotation accuracy.

To ensure the reliability and consistency of the annotation results, we implemented a standardized two-step annotation and cross-validation procedure. First, the full initial annotation of the dataset was completed by an annotator with professional experience in agricultural computer vision annotation. Subsequently, another annotator with the same qualification performed an independent full-range cross-verification on all the initial annotation results. For samples with uncertain or disputed annotations identified during verification (particularly complex samples with occluded keypoints), the two annotators engaged in in-depth discussions utilizing morphological prior knowledge of dragon fruit to reach a consensus on all annotations. We have ensured that the collection and utilization of the dataset adhere to the relevant laws and regulations pertaining to copyright and data protection. Further details regarding the construction of the dataset and a comprehensive list of contributors can be found in Appendix A.

2.1.3. Pseudo Four-Keypoint Format Conversion Algorithm

YOLO11n-Pose natively supports only single-category keypoint detection, posing a challenge for multi-category keypoint detection tasks. To address incompatibility with the required label structure, a format conversion algorithm (the pseudo four-keypoint format conversion algorithm) was developed, as detailed in Algorithm 1. This algorithm implicitly distinguishes two dragon fruit growth orientations (fruit_front and fruit_back) via keypoint positional encoding. It enables YOLO11n-Pose to learn orientation-aware keypoint detection within a single-category framework without architectural modifications.

Algorithm 1 Pseudo four-keypoint format conversion

Require: Bounding box

B = {x_{1}, y_{1}, x_{2}, y_{2}, c}

, where

c \in {0, 1}

; keypoints

K_{s t e m}, K_{n a v e l}

with visibility v; image dimensions

W, H

.
Ensure: YOLO-Pose label vector Y of length 17.
1: Calculate normalized box center and size:
2:

c_{x} \leftarrow \frac{x_{1} + x_{2}}{2 W}, c_{y} \leftarrow \frac{y_{1} + y_{2}}{2 H}

3:

w \leftarrow \frac{| x_{2} - x_{1} |}{W}, h \leftarrow \frac{| y_{2} - y_{1} |}{H}

4: Initialize vector Y with zeros. Set

Y [0] \leftarrow 0

and

Y [1 : 4] \leftarrow [c_{x}, c_{y}, w, h]

.
5: if

c = = 0

then
6:

i d x_{s t e m} \leftarrow 5, i d x_{n a v e l} \leftarrow 8

7: else
8:

i d x_{s t e m} \leftarrow 11, i d x_{n a v e l} \leftarrow 14

9: end if
10: Process Stem keypoint $K_{s t e m}$ :
11:

x_{n o r m} \leftarrow \frac{K_{s t e m} . x}{W}, y_{n o r m} \leftarrow \frac{K_{s t e m} . y}{H}

12:

v_{o u t} \leftarrow 2

if

K_{s t e m} . v = = 0

else 1
13:

Y [i d x_{s t e m} : i d x_{s t e m} + 3] \leftarrow [x_{n o r m}, y_{n o r m}, v_{o u t}]

14: Process Navel keypoint $K_{n a v e l}$ :
15:

x_{n o r m} \leftarrow \frac{K_{n a v e l} . x}{W}, y_{n o r m} \leftarrow \frac{K_{n a v e l} . y}{H}

16:

v_{o u t} \leftarrow 2

if

K_{n a v e l} . v = = 0

else 1
17:

Y [i d x_{n a v e l} : i d x_{n a v e l} + 3] \leftarrow [x_{n o r m}, y_{n o r m}, v_{o u t}]

18: return Y

2.2. Keypoint Detection Model

Real-time and lightweight keypoint detection is critical for edge deployment of harvesting robots in unstructured orchards. Current mainstream architectures for agricultural pose estimation include YOLO-series single-stage frameworks, two-stage R-CNN [28], transformer-based ViT models [27], and bottom-up PAF-based frameworks [29]. The YOLO framework [5,6,7,13,22] achieves the optimal balance of speed, accuracy, and lightweight deployment for real-time orchard operations, while two-stage frameworks have excessive latency, ViT models bring prohibitive computational overhead for edge devices, and bottom-up methods have insufficient localization accuracy for occluded small targets. Therefore, we selected YOLO-series models as the keypoint detection model for this paper. It is noteworthy that other keypoint detection models are also suitable for the DFPE framework proposed in this paper. Many YOLO series architectures also offer multi-scale variants (n/s/m/l/x) for flexible trade-offs between detection accuracy and computational cost, with the n variants providing lightweight configurations that maintain competitive performance–cost trade-offs. For consistency across all experiments in this study, we exclusively used these n variants. In addition, the YOLO series has evolved from anchor-based to anchor-free designs through in-depth architectural optimizations.

YOLOv5, YOLOv8, and YOLO11 are representative deep learning architectures developed by Ultralytics that support varied computer vision tasks, including detect, segment, oriented bounding boxes, classify, and pose. As a classic anchor-based variant representing the early mature stage of the YOLO series with stable performance and extensive community validation, YOLOv5 uses a C3 backbone with cross-stage partial connections and concatenation-based feature fusion. In contrast, in the anchor-free YOLOv8, the C3 backbone is replaced with the C2f module and a BiFPN is integrated for multi-scale fusion. This design eliminates anchor mismatch issues for targets of varying sizes and enhances the small-target detection performance by strengthening feature flow between shallow and deep layers. Released in 2024, YOLO11 features further iterative upgrades, using the C3k2 module as its backbone and adopting an enhanced strategy with attention mechanisms for feature fusion, achieving efficient multi-scale integration of information via the C2PSA module and an improved feature pyramid structure. The performances of YOLOv5, YOLOv8, YOLO11, yolov12 and yolo26 models built from YAML trained for dragon fruit pose estimation were compared, and YOLO11n-Pose was ultimately selected as the basis for the keypoint detection model. It is worth clarifying that the selection of YOLO11n-Pose is only to provide a real-time, lightweight baseline implementation for the keypoint detection module of the DFPE framework, and does not constitute a binding requirement for the DFPE framework. Other mainstream keypoint detection models that can output valid navel and stem keypoint coordinates can also be connected to the subsequent customized geometric pose inference stage.

The performance of the YOLO-pose model is evaluated using two sets of metrics: bounding box (B) metrics and pose (P) metrics. The former assesses the model’s ability to locate and classify objects, while the latter evaluates the accuracy of keypoint localization. These metrics include Precision, Recall, and mean Average Precision (mAP), as defined in Equations (1)–(4). In these formulas, True Positives (TP) represent correctly identified targets, False Positives (FP) denote non-targets or incorrectly localized targets, and False Negatives (FN) represent actual targets that the model failed to identify.

For (B) metrics, a TP is defined by an Intersection over Union (IoU) threshold between the predicted and ground truth boxes. For (P) metrics, a TP is defined by an Object Keypoint Similarity (OKS) threshold.

Precision = \frac{TP}{FP + TP},

(1)

Recall = \frac{TP}{FN + TP},

(2)

AP = \int_{0}^{1} Precision (Recall) d_{Recall},

(3)

mAP = \frac{\sum_{i = 0}^{n} AP (i)}{n},

(4)

The first category focuses on bounding box detection performance, denoted by the (B) suffix. Precision (B) and Recall (B) measure the accuracy and completeness of the object detection boxes, respectively. To evaluate the overall localization performance, mAP50 (B) provides the mean Average Precision calculated at a fixed IoU threshold of 0.50, whereas mAP50-95 (B) represents the average mAP across a range of IoU thresholds from 0.50 to 0.95 with a step of 0.05, serving as a robust indicator of detection stability and localization quality.

The second category specifically addresses pose estimation performance through (P) metrics based on the OKS criteria. Precision (P) and Recall (P) evaluate the accuracy of keypoint predictions, while mAP50 (P) indicates the mean Average Precision for keypoints when the OKS threshold is set to 0.50. Furthermore, mAP50-95 (P) is employed as the primary metric for pose estimation by averaging performance across OKS thresholds from 0.50 to 0.95, which effectively reflects how precisely keypoints, such as facial landmarks or joint positions, are localized relative to the object’s overall scale.

To assess the computational efficiency and deployment suitability of the model, two additional complexity metrics are utilized. The Number of Parameters (M) refers to the total volume of trainable variables within the network, reflecting the model size and hardware memory requirements. Concurrently, GFLOPs (Giga Floating-Point Operations per second) measure the computational cost required for a single inference pass, providing a standardized metric for the model’s processing speed and its suitability for real-time applications on various hardware platforms.

2.3. Pose Estimation Algorithm

To address challenges arising in the context of automated dragon fruit harvesting, a pose estimation method that uses only RGB images is proposed. Dragon fruits exhibit a rotational ellipsoid shape in 3D space, with the major axis aligning in the stem–navel direction. The y’ axis is defined based on the projection of the major axis onto the YOZ plane, as shown in Figure 7a. In addition, the ellipsoid’s minor and intermediate axes are equal in length and perpendicular to the major axis. Under camera perspective projection, the 3D ellipsoid projects to a 2D ellipse in the image plane, as indicated by the dashed ellipse in Figure 7b.

In 3D space, the attitude of any rigid body can be represented by a set of Euler angles (yaw (

α

), pitch, roll (

β

)). By approximating dragon fruit as an ellipsoid with equal intermediate and minor axes, it can be assumed to have axial symmetry about its major axis (the stem–navel line). Consequently, any rotation about this longitudinal axis (defined here as the pitch angle) does not alter the geometric profile of the fruit or the spatial orientation of its centerline. From the perspective of the harvesting task, the end-effector primarily requires alignment with the fruit’s major axis to execute a non-destructive grasp or cut. Since the fruit typically hangs vertically and the camera is configured to capture images from a near-horizontal viewpoint, variations in pitch are inherently constrained and primarily manifest as a slight foreshortening of the projected major axis. This foreshortening introduces negligible error in the calculation of the harvesting vector compared to the dominant influence of the fruit’s lateral and vertical orientation.

Due to this symmetry and the specific kinematic requirements of the harvesting robot, we only need to obtain the roll (

β

) and yaw (

α

) shown in Figure 7 to determine its spatial attitude, thereby guiding harvesting robots to perform non-destructive harvesting. This simplified two-degree-of-freedom representation is sufficient for the end-effector to approach the fruit along its central axis without the need for the redundant pitch component. The roll value is derived from the slope between the navel and stem keypoints in the RGB image, while the yaw value is calculated based on the relative positional relationship between these two keypoints and the projection of the dragon fruit in the image plane (YOZ plane). The black dots in Figure 7 represent the keypoints and their projections, while the dashed ellipse represents the projected outline of the dragon fruit in the YOZ plane, which also represents the fitted ellipse in the YOZ plane.

To determine the dragon fruit’s attitude from an RGB image, two angles (

β

and

α

) are derived based on the geometric relationships between the fitted ellipse and the keypoints. Here,

β

describes the inclination of the fitted ellipse’s major axis relative to the image’s horizontal axis. This major axis is defined by the stem–navel line, which connects the stem keypoint

P s (x_{s}, y_{s})

and the navel keypoint

P n (x_{n}, y_{n})

; in particular, the image’s horizontal axis corresponds to the y-axis in Figure 7. Given the coordinates of

P s

and

P n

,

β

is calculated as the arctangent of the slope of the line

β = arctan (\frac{y_{s} - y_{n}}{x_{s} - x_{n}}),

(5)

Assume that the shape of the dragon fruit in the image is approximated as an ellipse. To calculate

α

, an ellipse in the XOY’ plane, as shown in Figure 7b, is defined by Equation (6):

\frac{{(x cos α + y sin α)}^{2}}{a^{2}} + \frac{{(y cos α - x sin α)}^{2}}{b^{2}} = 1,

(6)

where a and b denote the semi-major and semi-minor axes of the ellipse, respectively;

α

is the angle between the ellipse’s major axis and y’ axis; and

(x, y)

denotes the pixel coordinates of any point on the ellipse. Taking the partial derivative of both sides of Equation (6) with respect to y and setting

x^{'} (y) = 0

, we obtain Equation (7). Then, substituting Equation (7) into Equation (6), whereby x attains an extreme value

x_{m a x}

, it follows that

x_{m a x}

satisfies Equation (8):

\frac{sin α (x cos α + y sin α)}{a^{2}} + \frac{cos α (y cos α - x sin α)}{b^{2}} = 0,

(7)

x_{m a x}^{2} = a^{2} {cos}^{2} α + b^{2} {sin}^{2} α,

(8)

The relationship between the major axis of the fitted ellipse and

x_{m a x}

is expressed as Equation (9):

The length of major axis = 2 |x_{m a s}|,

(9)

Two variables (n and m, defined in Equations (10) and (11), respectively) are defined to link ellipse geometry with keypoint topology. Here, m is the ratio of the ellipsoid’s major axis to minor axis, which can be set according to the specific morphology of different varieties of dragon fruit or other ellipsoid-shaped fruits, while n is defined as the ratio of the stem–navel line to the fitted ellipse in the image plane:

n = \frac{\sqrt{{(x_{s} - x_{n})}^{2} + {(y_{s} - y_{n})}^{2}}}{The major axis of fitted ellipse},

(10)

m = \frac{a}{b},

(11)

Finally,

α

is derived from Equation (12), with its sign determined by the class ID. If class ID = 0, indicating that the fruit orientation is fruit_front,

α

takes a positive value; whereas if class ID = 1, denoting the fruit orientation as fruit_back,

α

is also a positive value.

α = \pm arctan \frac{\sqrt{1 - n^{2}}}{m n},

(12)

Based on the above theoretical framework, we introduce the dragon fruit pose estimation algorithm, as detailed in Algorithm 2. This algorithm characterizes the pose of ellipsoid-shaped fruits using the roll and yaw values derived from a single RGB image. Utilizing the geometric constraints imposed by the detected bounding box and two keypoints, the algorithm fits an ellipse to the fruit’s profile. Subsequently, it leverages the geometric properties of this ellipse, in conjunction with the previously identified keypoints, to deduce the fruit’s spatial orientation relative to the camera.

Algorithm 2 Dragon fruit pose estimation

Require: Class ID

c l s \in {0, 1}

, bounding box

B = {x, y, w, h}

, keypoints

K = {x_{0}, y_{0}, v_{0}, x_{1}, y_{1}, v_{1}}

, empirical constant m.
Ensure: Ellipse parameters

{c_{x}, c_{y}, a, b, β}

, Yaw angle

α

.
1: Calculate

Δ x \leftarrow x_{1} - x_{0}

,

Δ y \leftarrow y_{1} - y_{0}

.
2: Containment Check: Verify if

K_{n a v e l}

and

K_{s t e m}

are within bounding box boundaries.
If not, terminate.
3: if

Δ x \approx 0 \lor Δ y \approx 0

then
4: Special Case: Axis-Aligned
5:

a \leftarrow max (w, h) / 2

,

b \leftarrow min (w, h) / 2

6: if

w > h

then
7:

β \leftarrow 0

8: else
9:

β \leftarrow 90

{or

π / 2

radians}
10:     end if
11: else
12:     General Case: Oblique Ellipse
13:

β \leftarrow atan2 (Δ y, Δ x)

14: Compute conic parameters

A, B, C, F

based on

w, h, β

satisfying

A x^{2} + B x y + C y^{2} +

F = 0.
15: Solve for semi-axes a and b from the conic parameters.
16: end if
17: Set ellipse center

c_{x} \leftarrow x

,

c_{y} \leftarrow y

.
18: Yaw Estimation:
19:

d_{p t s} \leftarrow \sqrt{Δ x^{2} + Δ y^{2}}

20:

n \leftarrow d_{p t s} / (2 a)

21:

α_{m a g} \leftarrow arctan (\frac{\sqrt{1 - n^{2}}}{m \cdot n})

22: if

c l s = = 0

then
23:

α \leftarrow α_{m a g}

24: else
25:

α \leftarrow - α_{m a g}

26: end if
27: return

{c_{x}, c_{y}, a, b, β, α}

Considering natural morphological variations that may affect the ideal axial symmetry of fruits, the ellipse fitting process in this work does not enforce strict collinearity between the long axis of the ellipse and the connecting line of the two keypoints. Instead, a parallel constraint strategy is adopted, which enables robust and accurate ellipse fitting and effectively reduces errors caused by irregular fruit growth and morphological deviations.

3. Results

3.1. Performance Evaluation of Keypoint Detection Model

In deep learning-based object detection, anchor-based and anchor-free strategies represent two fundamental localization paradigms, which differ primarily in their reliance on pre-defined anchor boxes. Therefore, this study compared the performance of YOLOv5n-Pose (anchor-based) and YOLOv8n-Pose (anchor-free) for dragon fruit keypoint detection, utilizing a total of 868 images (two-keypoint annotations with group ID) sourced from the public dataset as detailed in Table 2. The experimental training environment was configured as follows: Windows 10; an NVIDIA GeForce RTX 3090Ti; and PyTorch 1.11.0, Python 3.8.5, and CUDA 10.2. Key hyperparameter settings are detailed in Table 3; all other parameters were set to their default values.

Figure 8 presents the detailed training metrics of YOLOv5n-Pose on the dragon fruit dataset. Although effective feature acquisition for bounding box detection and keypoint regression was observed initially (approximately the first 25 epochs), a pronounced instability in the subsequent training process became evident. This instability is characterized by a sharp and persistent decline in Precision, mAP50, and mAP50-95 to near-zero values, while Recall managed to recover to a fluctuating, suboptimal plateau around 0.6. This suggests that the default configuration of the model struggled to achieve a state of balanced and stable learning on this specific dataset. The severely compromised values of Precision and mAPs, contrasting with the partial recovery of Recall, highlight that the model is highly sensitive to training parameters. To achieve stability in complex scenarios, meticulous hyperparameter tuning and specialized debugging is performed. This confirms that without extensive and precise manual configuration, this anchor-based method can experience severe predictive failures in later training stages. Figure 9 shows the visualization results of the YOLOv5n-Pose model. The left image shows a dragon fruit with two overlapping predicted bounding boxes, while the keypoint detection results across all three images are notably inaccurate, showing significant deviations from ground truth positions.

In Figure 10, the top four subfigures present the YOLOv8n-Pose training results for the bounding boxes, while the bottom four subfigures present those for the keypoints. The corresponding bounding box and keypoint metrics, including Precision, Recall, mAP50, and mAP50-90, exhibited identical trends. All metric curves rose rapidly during the early training phase and remained stable at a high level in the middle and late stages, showing minimal fluctuations. These results indicate that the YOLOv8n-Pose model achieved a training effect characterized by early rapid convergence, late stability, and excellent performance for both dragon fruit bounding box detection and navel and stem keypoint pose estimation. Figure 11 shows the visualization results for the YOLOv8n-Pose model, from which it can be seen that the bounding boxes were relatively close to the actual situation, and that both keypoints for each dragon fruit were detected precisely.

In comparison, although YOLOv5n-Pose demonstrated an upward trend in the early training stages, experiments on the 868 images from the public dataset (as detailed in Table 2) revealed that it required more extensive hyperparameter configuration and meticulous debugging to maintain convergence stability. This sensitivity is primarily attributed to the constraints of its anchor-based mechanism, which relies on pre-defined anchor boxes to match ground truth targets. As observed in the training metrics, the imbalance between the recovery of Recall and the collapse of Precision/mAP suggests that the default anchor settings struggled to adapt to the specific geometric distribution of dragon fruit keypoints. In contrast, YOLOv8n-Pose effectively addresses the adaptability limitations of static anchors through dynamic sample assignment (e.g., via a Task-Aligned Assigner), optimized loss functions (e.g., CIoU combined with pose regression loss), and a more robust architecture. Throughout the training process, YOLOv8n-Pose exhibited superior stability, with core metrics for bounding boxes and keypoints (including Precision, Recall, mAP50, and mAP50-95) maintaining consistently high levels. This confirms the advantages of YOLOv8n’s architectural upgrades in achieving a more reliable and accurate performance for pose estimation tasks.

The fundamental difference between these architectures lies in their localization strategies: anchor-based versus anchor-free. Anchor-based methods, exemplified by YOLOv5, utilize the K-means algorithm to cluster training data dimensions into pre-defined reference boxes. While this approach performs well in controlled environments, it necessitates manual anchor design for different datasets and often lacks the flexibility to represent objects in complex, dynamic scenes. Conversely, YOLOv8 adopts an anchor-free approach that directly utilizes grid points on feature maps as object centers, predicting distances to bounding box boundaries without prior references. Supported by Distributed Focal Loss (DFL), this method offers stronger generalization and adaptability to multi-scale objects and occlusions, while significantly reducing the manual complexity associated with anchor box optimization.

In this study, the bounding box group ID is used to distinguish between the fruit_front and fruit_back orientations, while the keypoint group IDs are used to label their visibility (visible vs. occluded). To investigate the impact of both visibility labeling and keypoint quantity on detection performance, we conducted comparative experiments using the YOLOv8-Pose model based on the 868 images from the public dataset (as specified in Table 2). The experiments evaluated four distinct configurations: two-keypoint and pseudo four-keypoint labeling, each tested with and without group ID assignment. The training environment utilized an NVIDIA GeForce RTX 3090 Ti GPU with PyTorch 1.11.0. Key hyperparameters remained consistent across all groups, including 300 epochs, an initial learning rate of 0.01, and a weight decay of 0.0005.

Figure 12 presents the training curves on the 868-image public dataset (Table 2) across the four experimental configurations. Objectively, the Precision, Recall, and mAP50 curves for models with and without group ID assignment are perfectly identical and completely overlapped in both the two-keypoint and pseudo four-keypoint scenarios. This visually confirms that visibility labeling has zero impact on the training dynamics. However, the curves reveal a discernible difference in convergence behavior between the two labeling strategies: the two-keypoint configurations achieved a faster convergence rate and reached a stable plateau earlier, whereas the pseudo four-keypoint curves exhibited higher initial fluctuations due to the increased complexity of the regression task. Table 4 further quantifies these observations, showing that while both strategies eventually achieve high precision, the two-keypoint strategy maintains a slight lead in final validation metrics. This is because doubling the number of keypoint targets in the pseudo four-keypoint setup imposes a heavier computational burden on the localization head, as the network must simultaneously optimize coordinate constraints for four distinct points per fruit. Nevertheless, despite the slightly longer convergence time, the pseudo four-keypoint method demonstrates a decisive functional advantage in practical harvesting applications. While both the two-keypoint and pseudo four-keypoint strategies can achieve high detection accuracy, the four-keypoint approach provides richer geometric information that is essential for determining the spatial orientation of the dragon fruit. Specifically, the four-keypoint configuration allows the system to effectively distinguish between the fruit_front and fruit_back orientations, which is a critical prerequisite for the picking robot to plan an accurate approach vector. In contrast, the two-keypoint model lacks sufficient spatial constraints to resolve this orientation ambiguity. Therefore, considering the dual requirements of localization precision and orientation perception, this study ultimately adopts the pseudo four-keypoint annotation strategy as the optimal framework for automated dragon fruit harvesting.

To identify a more robust keypoint detection model, we enhanced YOLOv8n-Pose by integrating four representative attention mechanisms, namely SE (Squeeze-and-Excitation), ECA (Efficient Channel Attention), EMA (Efficient Multi-Scale Attention), and CBMA (Convolutional Block Attention Module), prior to its SPPF layer and compared its performance with YOLO11n-Pose, another anchor-free framework. To enhance training efficiency, the training environment was improved. The training environment for the YOLO networks was configured as follows: Windows 10; an NVIDIA Tesla V100-SXM2-32 GB GPU (32,501 MiB); and a deep learning stack comprising PyTorch 2.7.1, Python 3.10.18, and CUDA 12.6. Key hyperparameters included 300 training epochs, an initial learning rate of 0.01, a momentum of 0.937, and a weight decay coefficient of 0.0005. To enhance the generalizability of the proposed model, we utilized the comprehensive multi-source dataset consisting of 8467 images (four-keypoint annotations with group ID) for this experiment and all subsequent specialized studies.

To ensure the reliability and repeatability of the experimental results, each training configuration was independently tested three times, and the results were found to be consistent and stable. Figure 13 shows the training results for the improved YOLOv8-Pose and other YOLO-series models (YOLO11-Pose, YOLOv12-Pose, YOLO26-Pose) that has been unveiled recently. The results show that adding the attention mechanism had a minimal effect on model accuracy as YOLOv8-Pose and YOLO11-Pose already achieve high accuracy, leaving little room for further improvement. As shown in Table 5, YOLO11-Pose demonstrated superior computational efficiency, with the lowest GFLOPs among the evaluated models. Consequently, we adopted YOLO11-Pose as the basis for the final keypoint detection framework due to its optimal balance of accuracy and lightweight design.

3.2. Orchard Experiment

We collected a set of RGB images with a resolution of 1920 × 1080 at the orchard of Zhuhai Xijian Agricultural Development Co., Ltd. The images were captured using a ZED 2i binocular stereo camera, but only the RGB data was utilized in this work. The acquisition environment was consistent with that of the first dataset group, while the images themselves are non-overlapping with the existing dataset. These images were then used to evaluate the performance of the proposed DFPE framework for dragon fruit pose estimation. Figure 14 illustrates the performance of the proposed DFPE framework under diverse lighting conditions and fruit states in a natural orchard environment. Specifically, Figure 14a–c demonstrate the framework’s effectiveness under different ambient light intensity conditions, while Figure 14d validates its robustness under backlight conditions. Fruit samples were categorized based on two primary visibility states (in-view and partially out-of-view). Specifically, in-view samples were further subcategorized by maturity (unripe/ripe), spatial orientation (fruit_front/fruit_back), and occlusion severity (≤50%: partial occlusion; >50%: heavy occlusion). A selective processing strategy for pose estimation performed exclusively on ripe dragon fruits, automatically excluding unripe fruit, was adopted in the model, as demonstrated in Figure 14c,d. Additionally, it skipped processing for fruits with occlusion rates exceeding 50%, as shown on the left in Figure 14a,b, and ignored partially out-of-view fruits, as shown on the right in Figure 14c. Red circles are used in Figure 14 to explicitly mark the corresponding occluded or partially visible fruits for clear visualization.

The results demonstrate that all valid results of pose estimation were classified into the fruit_front and fruit_back categories based on navel orientation. The fitted green ellipses demonstrate high geometric alignment with the natural fruit contours, confirming the DFPE framework’s applicability even under conditions of branch occlusion. In Figure 14b–d, fruits with navels oriented toward the plant are classified as fruit_back with negative yaw, while in Figure 14a,c,d fruits with navels facing the camera are classified as fruit_front with positive yaw. This demonstrates that the orientation of the navel and the sign of the yaw angle are consistent.

As shown in Figure 14, all valid results clearly display the roll and yaw values output by the DFPE framework. From a visual perspective, these angle values closely align with the actual spatial orientations of the fruits, providing preliminary evidence for the intuitive rationality of the model’s output and confirming its feasibility and robustness in complex real-world environments.

3.3. Pose Estimation Experiment in a Laboratory Environment

To further quantitatively evaluate the accuracy of the roll and yaw angle estimates, we designed a controlled laboratory experiment. A robotic workstation was employed to precisely position dragon fruit models at pre-defined pose angles in front of a camera. The proposed framework was then use to estimate poses, with the output values compared against the theoretically controlled ground truth from the robotic system. Figure 15 presents representative samples of the dragon fruit pose estimation experiments conducted in the laboratory. Each subfigure displays the full camera-captured image with the pose estimation results, alongside a zoomed-in view of the dragon fruit. The blue bounding box denotes the object detection result, the green ellipse represents the ellipse fitted by the proposed dragon fruit pose estimation algorithm, the red dot marks the navel keypoint, and the black dot indicates the stem keypoint. The theoretical values (ground truth from the robotic workstation, labeled below each subfigure), experimental values (raw estimates from the proposed framework, marked in red text on the image), and corrected values (refined results, marked in blue text on the image) are clearly labeled for each sample, enabling direct visual comparison of the estimation performance. Figure 16 shows the complete scatterplot of all dragon fruit pose estimation experiments in the laboratory. In this plot, the black squares denote the theoretical ground-truth angles, the red dots represent the experimental estimated values, and the blue triangles indicate the corrected values. The purple bounding boxes group the corresponding theoretical, experimental, and corrected points, explicitly illustrating the one-to-one correspondence between the ground truth and the estimated results across all tested poses.

Analysis of the experimental results, as summarized in Figure 15, revealed significant deviations in the roll estimates for Figure 15d (11.5° error) and Figure 15e (14.8° error). For the special case in Figure 15d, where the two keypoints are clustered on the same side of the dragon fruit’s ellipsoidal surface, the ellipse-fitting strategy adopted in this study leads to utilization of the keypoint connection line as the major axis direction rather than enforcing keypoint alignment on the major axis. This approach effectively reduced fitting errors caused by fruit deformation, maintaining high alignment between the fitted ellipse and the actual fruit contours. Conversely, in Figure 15e, the proximity of the two keypoints in the image plane amplified roll estimation errors when minor detection inaccuracies occurred, leading to uncertainty in the estimate. To address such edge-cases, a corrective strategy was proposed: when the yaw angle exceeds 85°, the roll is forcibly set to 0° and the yaw to 90° (as shown in Figure 15g). Experimental validation demonstrated that this correction significantly improved the stability of estimates under extreme poses.

The final experimental data showed a maximum yaw error of 13.4°, with overall errors for both roll and yaw angles controlled below 15°, well within the tolerance threshold required for robotic harvesting operations [19,24]. Verification using an adaptive gripper-based picking system confirmed that this error range had no statistically significant impact on actual harvesting success rates, thereby proving the model’s high precision and edge-case adaptability under controlled conditions.

4. Discussion

This study constructs a novel RGB-image-based dragon fruit pose estimation framework for vision-guided robotic harvesting. This framework relies solely on single 2D RGB images for high-precision pose estimation, reducing hardware costs and environmental dependence, while achieving a balance between lightweight performance and robustness in unstructured orchard scenarios.

The pseudo four-keypoint format conversion algorithm resolves the conflict between YOLO11n-Pose’s single-class annotation requirement and multi-category keypoint detection, simplifying the annotation process and laying a solid foundation for subsequent pose inference. The dragon fruit pose estimation (DFPE) algorithm establishes a reliable geometric inference system by fitting ellipsoids and calculating keypoint spatial relationships, converting abstract keypoint information into interpretable pose parameters.

Compared with Reference [24], this framework achieves superior accuracy, with maximum roll and yaw deviations constrained within 15°. It ensures robots accurately grasp the spatial orientation of fruits, thereby reducing harvesting damage. Compared with the non-YOLO pose estimation methods for tomato crops, the proposed framework also shows competitive performance and better lightweight characteristics. Ni et al. [27] achieved a maximum angular error of 6.5° for tomato cutting point pose estimation with 31.015 M parameters and 5.122 G FLOPs; Ci et al. [28] achieved a mean absolute angular error of 10°–11° for tomato peduncle pose estimation using a two-stage Keypoint R-CNN framework; Kim et al. [29] reported an average angular error of 20° for tomato pedicel pose estimation with a bottom-up OpenPose framework. In contrast, the proposed DFPE framework based on YOLO11n-Pose constrained the maximum roll and yaw errors within 15° in laboratory environments, with only 2.65 M parameters and 6.6 G FLOPs, achieving a superior balance between pose accuracy and deployment efficiency. Compared with Reference [20], this framework also achieves superior accuracy, with maximum roll and yaw deviations constrained within 15°, ensuring robots accurately grasp the spatial orientation of fruits, thereby reducing harvesting damage.

The anchor-free YOLOv8 and YOLO11 models demonstrate strong generalization capability due to the dataset’s large-scale, multi-source characteristics, enabling high-precision, high-recall keypoint detection without additional parameter tuning. These models exhibit significant advantages over the anchor-based YOLOv5, particularly in complex scenarios. The experimental results confirm that labeling keypoint visibility using group IDs has a negligible impact on final pose estimation results, indicating the model’s inherent adaptability to partially occluded scenes.

Although the ellipsoid-based pose estimation method proposed in this study theoretically has cross-crop generalization potential due to its universal geometric properties, making it applicable to similarly ellipsoidal fruits such as certain citrus fruits and papaya, its practical adaptability in real-world scenarios remains unverified due to a lack of cross-crop testing. Furthermore, existing methods for acquiring ground truth pose data in complex orchard environments are immature and lack high-precision 3D pose annotation techniques, which creates challenges in quantifying model performance under authentic orchard conditions. Future research will focus on cross-crop empirical study and high-precision ground truth pose database construction. In addition, exploring lightweight visual transformer (ViT) architectures tailored for agricultural pose estimation, and integrating the two-stage and bottom-up pose estimation paradigms into the DFPE framework to further improve its adaptability to dense, occluded orchard scenarios, will also be the focus of subsequent research.

5. Conclusions

This study addresses the key challenge of pose estimation in automated dragon fruit harvesting by proposing a lightweight, high-precision pose estimation framework based on a single RGB image. It provides a practical technical solution for vision-guided harvesting robots. The core contributions and achievements are as follows:

An innovative pseudo four-keypoint format conversion algorithm is developed to resolve the compatibility conflict between YOLO11n-Pose’s single-class annotation requirement and multi-category keypoint detection tasks. This algorithm converts dragon fruit’s bounding box, stem, and navel keypoint information into a standardized format, and implicitly encodes the fruit’s front/back growth orientations via group IDs. It not only simplifies the annotation process but also fully preserves the geometric information required for pose inference, laying a solid foundation for efficient model training.
The dragon fruit pose estimation (DFPE) algorithm is proposed to realize accurate pose inference based on geometric constraints. By using ellipse fitting to characterize fruit contours and combining the relative spatial positional relationship between stem and navel keypoints, the algorithm transforms 3D pose estimation into the calculation of yaw and roll angles. Even in complex scenarios such as foliage occlusion, it can stably extract spatial pose information of dragon fruits.
A multi-source dataset containing 8467 images is constructed and made publicly available to enhance the mode’s generalization ability. The dataset covers diverse lighting conditions, different maturity levels, and occlusion states in natural orchards.
Dual verification through laboratory and orchard experiments confirms the excellent accuracy and robustness of the framework. In the laboratory environment, the maximum yaw error is 13.4° and the maximum roll error is 14.8°, both within the 15° tolerance threshold of harvesting robots, outperforming the method in Reference [24]. Field tests in orchards also show that the framework can effectively adapt to unstructured environments, accurately distinguish fruit growth orientations, and output reliable pose parameters, providing key support for non-destructive harvesting.

The framework only relies on RGB images to complete pose estimation, greatly reducing hardware deployment costs. Its design based on elliptical geometric properties endows it with cross-crop generalization potential, which can be extended to ellipsoidal fruits such as citrus and papaya, and its detector-agnostic design also allows for the integration of more advanced non-YOLO pose estimation architectures in subsequent applications. In summary, this study breaks through the limitations of existing fruit pose estimation methods in adaptability to complex orchard environments, annotation complexity, and accuracy. It provides an important technical reference for the industrialization of automated harvesting of dragon fruit and similar crops. Future research will focus on cross-crop empirical studies to verify the framework’s generalizability to other ellipsoidal crops and the construction of a high-precision ground truth pose database, aiming to further promote the industrial application of multi-crop harvesting robots.

Author Contributions

Conceptualization, X.Y. and L.B.; methodology, X.Y.; software, X.Y. and R.W.; validation, X.Y., L.B. and T.Z.; formal analysis, X.Y.; investigation, X.Y.; resources, X.Y.; data curation, X.Y. and R.W.; writing—original draft preparation, X.Y.; writing—review and editing, X.Y. and T.Z.; visualization, X.Y.; supervision, L.B.; project administration, L.B.; funding acquisition, L.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Science and Technology Development Fund, Macau SAR (No. 0082/2025/AFJ) & Science and Technology Planning Project of Guangdong Province, China (2023A0505020007).

Data Availability Statement

The original contributions presented in this study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DFPE	Dragon fruit pose estimation
YOLO	You Only Look Once
mAP	mean Average Precision
TP	True Positives
FP	False Positives
FN	False Negatives
GFLOPs	Giga Floating-Point Operations per second
SE	Squeeze-and-Excitation
ECA	Efficient Channel Attention
EMA	Efficient Multi-Scale Attention
CBMA	Convolutional Block Attention Module

Appendix A

The Primary Source of the Dataset: https://data.mendeley.com/drafts/t8kbmg6cgz (accessed on 21 December 2025).

References

Jiang, H.; Zhang, W.; Li, X.; Shu, C.; Jiang, W.; Cao, J. Nutrition, phytochemical profile, bioactivities and applications in food industry of pitaya (Hylocereus spp.) peels: A comprehensive review. Trends Food Sci. Technol. 2021, 116, 199–217. [Google Scholar] [CrossRef]
Li, H.; Gu, Z.; He, D.; Wang, X.; Huang, J.; Mo, Y.; Li, P.; Huang, Z.; Wu, F. A lightweight improved YOLOv5s model and its deployment for detecting pitaya fruits in daytime and nighttime light-supplement environments. Comput. Electron. Agric. 2024, 220, 108914. [Google Scholar] [CrossRef]
Zhang, Q. Opinion: Ai in agriculture, researchable issues. Comput. Electron. Agric. 2023, 212, 108110. [Google Scholar] [CrossRef]
Zhao, C.; Fan, B.; Li, J.; Feng, Q. Agricultural robots: Technology progress, challenges and trends. Smart Agric. 2023, 5, 14–15. [Google Scholar]
Wu, Z.; Xia, F.; Zhou, S.; Xu, D. A method for identifying grape stems using keypoints. Comput. Electron. Agric. 2023, 209, 107825. [Google Scholar] [CrossRef]
Chen, J.; Ma, A.; Huang, L.; Li, H.; Zhang, H.; Huang, Y.; Zhu, T. Efficient and lightweight grape and picking point synchronous detection model based on key point detection. Comput. Electron. Agric. 2024, 217, 108612. [Google Scholar] [CrossRef]
Zhou, J.; Zhang, Y.; Wang, J.T. A Dragon Fruit Picking Detection Method Based on YOLOv7 and PSP-Ellipse. Sensors 2023, 23, 3803. [Google Scholar] [CrossRef]
Sun, Q.; Zhong, M.; Chai, X.; Zeng, Z.; Yin, H.; Zhou, G.; Sun, T. Citrus pose estimation from an RGB image for automated harvesting. Comput. Electron. Agric. 2023, 211, 1108022. [Google Scholar] [CrossRef]
Gao, Y.; Wang, Q.; Rao, X.; Xie, L.; Ying, Y. OrangeStereo: A navel orange stereo matching network for 3D surface reconstruction. Comput. Electron. Agric. 2024, 217, 108626. [Google Scholar] [CrossRef]
Rapado-Rincón, D.; van Henten, E.J.; Kootstra, G. Development and evaluation of automated localisation and reconstruction of all fruits on tomato plants in a greenhouse based on multi-view perception and 3D multi-object tracking. Biosyst. Eng. 2023, 231, 78–91. [Google Scholar] [CrossRef]
Chu, P.; Li, Z.; Zhang, K.; Lammers, K.; Lu, R. High-precision fruit localization using active laser-camera scanning: Robust laser line extraction for 2D-3D transformation. Smart Agric. Technol. 2024, 7, 100391. [Google Scholar] [CrossRef]
Lv, J.; Xu, H.; Xu, L.; Gu, Y.; Rong, H.; Zou, L. An image rendering-based identification method for apples with different growth forms. Comput. Electron. Agric. 2023, 211, 108040. [Google Scholar] [CrossRef]
Jin, S.; Zhou, L.; Zhou, H. CO-YOLO: A lightweight and efficient model for Camellia oleifera fruit object detection and posture determination. Comput. Electron. Agric. 2025, 235, 110394. [Google Scholar] [CrossRef]
Zhang, J.; Xie, J.; Zhang, F.; Gao, J.; Yang, C.; Song, C.; Rao, W.; Zhang, Y. Greenhouse tomato detection and pose classification algorithm based on improved YOLOv5. Comput. Electron. Agric. 2024, 216, 108519. [Google Scholar] [CrossRef]
Gu, Z.; He, D.; Huang, J.; Chen, J.; Wu, X.; Huang, B.; Dong, T.; Yang, Q.; Li, H. Simultaneous detection of fruits and fruiting stems in mango using improved YOLOv8 model deployed by edge device. Comput. Electron. Agric. 2024, 227, 109512. [Google Scholar] [CrossRef]
Diao, S.; Feng, J.; Zhang, B.; Xia, Y.; Gu, Y.; Li, D.; Fu, W. MangoStem-YOLOv8n: An improved YOLOv8n for mango and stem detection in natural orchard environments. Smart Agric. Technol. 2025, 12, 101485. [Google Scholar] [CrossRef]
Li, Y.; Feng, Q.; Zhang, Y.; Peng, C.; Ma, Y.; Liu, C.; Zhao, C. Peduncle collision-free grasping based on deep reinforcement learning for tomato harvesting robot. Comput. Electron. Agric. 2024, 216, 108488. [Google Scholar] [CrossRef]
Zhang, F.; Gao, J.; Zhou, H.; Zhang, J.; Zou, K.; Yuan, T. 3D pose detection method based on keypoints detection network for tomato bunch. Comput. Electron. Agric. 2022, 195, 106824. [Google Scholar] [CrossRef]
Zhang, F.; Gao, J.; Song, C.; Zhou, H.; Zou, K.; Xie, J.; Yuan, T.; Zhang, J. TPMv2: An end-to-end tomato pose method based on 3D key points detection. Comput. Electron. Agric. 2023, 210, 107878. [Google Scholar] [CrossRef]
Sun, Q.; Chai, X.; Zeng, Z.; Zhou, G.; Sun, T. Multi-level feature fusion for fruit bearing branch keypoint detection. Comput. Electron. Agric. 2021, 191, 106479. [Google Scholar] [CrossRef]
Yu, Y.; Zhang, K.; Liu, H.; Yang, L.; Zhang, D. Real-Time Visual Localization of the Picking Points for a Ridge-Planting Strawberry Harvesting Robot. IEEE Access 2020, 8, 116556–116568. [Google Scholar] [CrossRef]
Du, X.; Meng, Z.; Ma, Z.; Lu, W.; Cheng, H.T. Tomato 3D pose detection algorithm based on keypoint detection and point cloud processing. Comput. Electron. Agric. 2023, 212, 108056. [Google Scholar] [CrossRef]
Jiang, T.; Li, Y.; Feng, H.; Wu, J.; Sun, W.; Ruan, Y. Research on a Trellis Grape Stem Recognition Method Based on YOLOv8n-GP. Agriculture 2024, 14, 1449. [Google Scholar] [CrossRef]
Kok, E.; Chen, C. Occluded apples orientation estimator based on deep learning model for robotic harvesting. Comput. Electron. Agric. 2024, 219, 108781. [Google Scholar] [CrossRef]
Kim, J.; Pyo, H.; Jang, I.; Kang, J.; Ju, B.; Ko, K. Tomato harvesting robotic system based on Deep-ToMaToS: Deep learning network using transformation loss for 6D pose estimation of maturity classified tomatoes with side-stem. Comput. Electron. Agric. 2022, 201, 107300. [Google Scholar] [CrossRef]
Jang, M.; Hwang, Y. Tomato pose estimation using the association of tomato body and sepal. Comput. Electron. Agric. 2024, 221, 108961. [Google Scholar] [CrossRef]
Ni, J.; Zhu, L.; Dong, L.; Wang, R.; Chen, K.; Gao, J.; Wang, W.; Zhou, L.; Zhao, B.; Rong, J.; et al. TomatoPoseNet: An Efficient Keypoint-Based 6D Pose Estimation Model for Non-Destructive Tomato Harvesting. Agronomy 2024, 14, 3027. [Google Scholar] [CrossRef]
Ci, J.; Wang, X.; Rapado-Rincón, D.; Burusa, A.K.; Kootstra, G. 3D pose estimation of tomato peduncle nodes using deep keypoint detection and point cloud. Biosyst. Eng. 2024, 243, 57–69. [Google Scholar] [CrossRef]
Kim, T.; Lee, D.-H.; Kim, K.-C.; Kim, Y.-J. 2D pose estimation of multiple tomato fruit-bearing systems for robotic harvesting. Comput. Electron. Agric. 2023, 211, 108004. [Google Scholar] [CrossRef]
Khatun, T.; Nirob, M.A.S.; Bishshash, P.; Akter, M.; Uddin, M.S. A comprehensive dragon fruit image dataset for detecting the maturity and quality grading of dragon fruit. Data Brief 2025, 52, 109936. [Google Scholar] [CrossRef]

Figure 1. Several representative fruits with different growth patterns. Cluster-growing fruits: (a) grape, (b) litchi, and (c) wampee. Single-growing fruits: (d) strawberry, (e) dragon fruit, and (f) apple.

Figure 2. Overall flowchart for the proposed dragon fruit pose estimation (DFPE) framework.

Figure 3. Examples of dragon fruit images captured in the orchard: (a) orchard environment; (b) variable lighting conditions.

Figure 4. Examples of dragon fruit images from the public dataset [30].

Figure 5. Examples of data augmentation for website-cropped images.

Figure 6. Examples of image annotation using the LabelMe software. The image resolution is (a) 1920 × 1080; (b) 640 × 640.

Figure 7. Schematic of an ellipsoid’s 2D and 3D attitude information: (a) 3D attitude, (b) 2D attitude.

Figure 8. Training results for YOLOv5n-Pose.

Figure 9. Visualization results for YOLOv5n-Pose.

Figure 10. Training results for YOLOv8n-Pose. (B) denotes bounding boxes; (P) denotes keypoints.

Figure 11. Visualization results for YOLOv8n-Pose.

Figure 12. Training results of YOLOv8-Pose under two keypoint labeling strategies. (B) denotes bounding boxes; (P) denotes keypoints.

Figure 13. Training results for improved YOLOv8n-Pose and YOLO11n-Pose. (B) denotes bounding boxes; (P) denotes keypoints.

Figure 14. Dragon fruit pose estimation experiments in the orchard.

Figure 15. The Examples of dragon fruit pose estimation experiments in the laboratory.

Figure 16. The scatterplot of dragon fruit pose estimation experiments in the laboratory.

Table 2. Distribution details of the dragon fruit dataset training set and validation set.

Image Source	Total	Training Set	Validation Set	Resolution	Image Characteristics
Orchard Acquisition	3749	2964	785	1920 × 1080	Real picking environment, mainly small target scenes
Public Dataset	868	679	189	640 × 640	High definition, single target per image, large target proportion
Roboflow Universe	3859	3137	722	640 × 640	Large differences in clarity and target scale
Total	8476	6780	1696	-	-

Table 3. Key model training hyperparameters in the experiment.

Parameter	Value
Initial learning rate	0.01
Weight decay coefficient	0.001
Momentum	0.937
Number of epochs	300

Table 4. Comparison between two keypoint labeling strategies for keypoint detection by YOLOv8n-Pose.

	2 Keypoint_with Group ID	2 Keypoint_NO Group ID	4 Keypoint_with Group ID	4 Keypoint_NO Group ID
Precision ¹	0.931	0.931	0.945	0.945
Recall ¹	0.995	0.995	0.972	0.972
mAP50 ¹	0.981	0.981	0.988	0.988
mAP50-95 ¹	0.853	0.853	0.868	0.868
Precision ²	0.931	0.931	0.94	0.94
Recall ²	0.995	0.995	0.967	0.967
mAP50 ²	0.981	0.981	0.976	0.976
mAP50-95 ²	0.956	0.956	0.937	0.937
GFLOPs	8.3	8.3	8.3	8.3

¹ Results for bounding boxes. ² Results for keypoints.

Table 5. The quantitative results of improved YOLOv8n-Pose and YOLO11n-Pose.

Model	Precision ¹	Recall ¹	mAP50 ¹	mAP50-95 ¹	Precision ²	Recall ²	mAP50 ²	mAP50-95 ²	Parameters	GFLOPs
YOLOv8n	0.958	0.972	0.987	0.883	0.955	0.969	0.985	0.973	2,756,695	7.1
YOLO11n	0.966	0.977	0.987	0.885	0.963	0.973	0.984	0.972	2,654,479	6.6
YOLOv8n-SE	0.963	0.978	0.986	0.881	0.96	0.975	0.985	0.972	3,086,167	8.3
YOLOv8n-EMA	0.965	0.97	0.989	0.881	0.963	0.967	0.987	0.975	3,078,647	8.4
YOLOv8n-ECA	0.967	0.968	0.989	0.885	0.964	0.965	0.987	0.975	3,077,980	8.3
YOLOv8n-CBAM	0.96	0.977	0.988	0.884	0.958	0.974	0.986	0.974	3,086,265	8.3
YOLOv12n	0.954	0.975	0.987	0.883	0.951	0.973	0.985	0.973	2,629,055	6.6
YOLO26n	0.941	0.962	0.984	0.87	0.937	0.958	0.981	0.972	2,467,347	5.5

¹ Results for bounding boxes. ² Results for keypoints.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, X.; Bai, L.; Zhang, T.; Wu, R. A Monocular Pose Estimation Framework for Automatic Dragon Fruit Harvesting Using Navel and Stem Keypoints. Horticulturae 2026, 12, 505. https://doi.org/10.3390/horticulturae12040505

AMA Style

Yang X, Bai L, Zhang T, Wu R. A Monocular Pose Estimation Framework for Automatic Dragon Fruit Harvesting Using Navel and Stem Keypoints. Horticulturae. 2026; 12(4):505. https://doi.org/10.3390/horticulturae12040505

Chicago/Turabian Style

Yang, Xing, Liping Bai, Tai Zhang, and Rongzhen Wu. 2026. "A Monocular Pose Estimation Framework for Automatic Dragon Fruit Harvesting Using Navel and Stem Keypoints" Horticulturae 12, no. 4: 505. https://doi.org/10.3390/horticulturae12040505

APA Style

Yang, X., Bai, L., Zhang, T., & Wu, R. (2026). A Monocular Pose Estimation Framework for Automatic Dragon Fruit Harvesting Using Navel and Stem Keypoints. Horticulturae, 12(4), 505. https://doi.org/10.3390/horticulturae12040505

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Monocular Pose Estimation Framework for Automatic Dragon Fruit Harvesting Using Navel and Stem Keypoints

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Construction

2.1.1. Image Acquisition

2.1.2. Data Annotation

2.1.3. Pseudo Four-Keypoint Format Conversion Algorithm

2.2. Keypoint Detection Model

2.3. Pose Estimation Algorithm

3. Results

3.1. Performance Evaluation of Keypoint Detection Model

3.2. Orchard Experiment

3.3. Pose Estimation Experiment in a Laboratory Environment

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI