1. Introduction
Dragon fruit is nutritionally rich, containing high levels of vitamin C and water-soluble dietary fiber [
1]. The Asia–Pacific region, particularly China, is the world’s largest producer of dragon fruit. By 2020, China’s dragon fruit output reached 1.526 million tons, making it a key agricultural economic industry contributing to local rural revitalization [
2]. However, dragon fruit has a short harvesting window, with mature fruits requiring picking within 3–5 days to avoid rot and quality deterioration, which puts forward extremely high requirements for harvesting efficiency. To achieve efficient and high-yield production, agriculture was, and still is, highly reliant on farmers’ experience, knowledge, and wisdom [
3]. At present, orchard management and fruit harvesting processes remain labor-intensive and costly. Labor consumed in harvesting accounts for 40–50% of the entire planting process, with manual harvesting costs representing 50–70% of total orchard expenses [
4]. In 2021, the average comprehensive mechanization rate of orchards in China was approximately 30%, while the mechanization rate for the harvesting process was less than 3% [
4]. Globally, the agricultural sector faces labor shortages driven by demographic shifts, including an aging population, rural-to-urban migration, and declining interest in physically demanding farm work. As agricultural production expands amid these labor constraints, fruit growers are encountering significant challenges; particularly low harvesting efficiency and rising labor costs. The growing gap between labor availability and harvesting demands during peak seasons underscores the need for automated solutions such as robotic harvesting systems.
The design of fruit harvesting robots often depends on the characteristics of the target fruit or fruits. Fruits can be categorized into single- and cluster-growing types.
Figure 1 shows several representative fruits with different growth patterns. Cluster-growing fruits require simultaneous detection of fruit bodies and stem cutting points [
5,
6], while single-growing hard fruits (represented by dragon fruit [
7] and citrus [
8]) with firm stem–branch attachment rely on accurate 3D position and pose information to achieve non-destructive harvesting. Different picking methods require different key technologies to be implemented in the harvesting robot’s vision system. In recent years, research on vision systems for fruit harvesting robots has primarily focused on fruit recognition and localization, ripeness detection, and pose estimation. Moreover, existing research has mainly focused on high-yield, easy-to-pick fruits, such as apples and tomatoes, with less research on the detection and pose estimation of dragon fruit in natural environments.
Accurate fruit pose information is essential for the non-destructive harvesting of dragon fruit, as the growth direction of the fruit in natural orchards is highly uncertain. Without precise pose guidance, the harvesting actuator may damage branches and adjacent fruits during operation. Notably, the 2D pose parameters (roll and yaw angles) obtained from monocular RGB images in this study can be mapped to the 3D operating space of the harvesting robot through camera calibration, providing reliable spatial attitude guidance for automated picking operations. Therefore, it is crucial to accurately estimate the fruit pose to achieve non-destructive and efficient automated dragon fruit harvesting.
Existing fruit pose estimation methods are primarily classified into three categories based on their data requirements. Methods in the first category rely solely on 3D point cloud information. An example is the 3D point cloud reconstruction method [
9,
10], with which the fruit pose is estimated via a template-based approach that involves acquiring 3D point cloud data of target fruits, processing the data through reconstruction and segmentation algorithms, and then searching for the optimal offline template. The second category of methods depend on both 3D point cloud information and RGB images [
11], integrating 3D point cloud data with RGB image processing algorithms (e.g., keypoint detection and stem segmentation), and completing pose estimation by capturing information about key parts of the fruit. The third category includes methods that rely only on RGB images, without requiring 3D point cloud data. These methods involve operations such as image segmentation, keypoint detection, and object detection, which are performed on RGB images to identify fruits and estimate their pose.
Table 1 details typical fruit pose estimation methods used in the existing literature.
Notably, methods in the first two categories rely on high-quality 3D point cloud data for effective pose estimation. However, under natural orchard conditions, the accuracy of point cloud data is often compromised by environmental complexity, resulting in noise interference and partial information loss. This not only increases the difficulty of data pre-processing but may also compromise the real-time requirements of harvesting robots. Therefore, researchers have turned to single RGB image-based solutions, which usually first detecting fruits and then estimating their poses via image segmentation or keypoint detection. Three primary RGB image-based fruit pose estimation approaches have been developed for robotic harvesting.
The first method classifies fruits into multiple categories based on harvest-adapted growth orientations. Lv et al. [
12] divided apples into four harvest-adapted categories according to their natural growth forms in orchards to guide the harvesting robot to adopt differentiated obstacle avoidance and picking strategies for fruits with different occlusion conditions. Jin et al. [
13] categorized fruits into five types (front, up, down, left, and right) according to their growth patterns; Zhang et al. [
14] divided the 3D poses of tomatoes into four categories (front–right, front–middle, front–left, and back) based on the fruit’s axial direction; and Zhou et al. [
7] classified dragon fruit into two orientations (forward and lateral), where forward-oriented fruits can be directly harvested while lateral ones require further angle measurement. This method converts the pose estimation problem into a multi-object detection task. However, it fails to obtain precise spatial poses, typically leading to relatively low harvesting success rates in actual field environments.
The second method enables robotic harvesting through the detection of cutting points on fruit-attached stems. Wu et al. [
5] proposed a top-down stem localization method using grape growth characteristics; Chen et al. [
6] proposed an improved YOLOv8-GP (YOLOv8-Grape and Picking Point) model based on YOLOv8n-pose to address the simultaneous detection of grape clusters and picking points on their stems; Gu et al. [
15] and Diao et al. [
16] realized concurrent detection of mango fruits and peduncles; Li et al. [
17] centered on tomato bunches without leaf interference, adopting a segmentation method to segment the main stem, peduncles, and fruit clusters to obtain cutting keypoints; Zhang et al. [
18] designed a system with six keypoints for tomato clusters and five for peduncles, targeting scenarios with a single object structure, and this framework was further extended by integrating 3D point clouds to generate 3D bounding boxes and 3D poses [
19]; Sun et al. [
20] specifically detected cutting points on citrus peduncles; and Yu et al. [
21] proposed the R-YOLO model with MobileNet-V1 as the backbone network, introducing rotated bounding boxes to predict fruit poses for strawberries via keypoint detection. This method is not suitable for dragon fruit harvesting, as dragon fruit either lacks a peduncle or has an extremely short one.
Table 1.
Comparison and analysis of typical fruit pose estimation methods.
Table 1.
Comparison and analysis of typical fruit pose estimation methods.
| Target Crop | Technical Characteristics | Performance Indicators |
|---|
| Tomato [22] | A 3D pose detection algorithm (TPD) consisting of the YOLO-lmk model and a point cloud processing module. | YOLO-lmk model: bounding box mAP = 92.9%, dlmk = 7.9, FLOPs = 16.6 B, speed = 0.062 s/sheet. |
| Strawberry [21] | An R-YOLO model with MobileNet-V1 backbone, predicting rotated bounding boxes for fruit pose estimation and picking point localization. | Overall precision = 94.43%, average detection time = 0.056 s/sheet, field harvesting success rate = 84.35%. |
| Tomato bunch [18] | The Tomato Pose Method (TPM) for 3D pose detection, including an a priori geometric model, a cascaded multi-task network, and a 3D reconstruction process. | 2D keypoint detection success rate = 94.02%, accuracy = 85.77%; can construct 70.05% of tomato bunches with multiple poses. |
| Tomato bunch [19] | A multi-task algorithm (TPMv2) to estimate 3D poses based on 2D/3D bounding boxes and keypoints. | 2D BBox precision = 0.9372, 3D BBox precision = 0.8700; 2D PCK = 0.8882, 3D PCK = 0.7836; 78.36% of 3D keypoints have positioning errors < 20 mm. |
| Grape [23] | An improved YOLOv8n-GP model based on YOLOv8n-Pose, integrating SENetV2 attention and CARAFE upsampling for synchronous detection of grape clusters and stem picking keypoints. | mAP = 97.1%, mAP-kp = 95.4%, inference time = 7.3 ms/sheet, model size = 6.4 MB. |
| Cherry tomato clusters [17] | An improved HER-SAC algorithm to guide collision-free grasping motions of end-effectors. | Orchard harvesting success rate = 85.5%, average operation time = 11.42 s. |
| Citrus [8] | A multi-task learning model (FPENet) to simultaneously locate fruit navel points and predict rotation vectors. | Fruit navel point detection AP = 88.92, average rotation vector error = 11.13°; over 90% of rotation vectors have errors < 22.5%; harvesting success rate = 79.79%. |
| Apple [24] | A method for estimation of fruit orientation by projecting 2D information onto the 3D space. | Median angular error: 17.6° (orchard), 14.6° (laboratory). |
| Tomato [25] | A deep learning network (Deep-ToMaToS) for simultaneous three-level maturity classification and 6D pose (3D translation and 3D rotation) estimation. | 6D pose estimation accuracy = 96% (ADD_S metric), average harvesting success rate = 84.5%. |
The third method derives fruits’ spatial pose via the fusion of keypoint detection and geometric features. Du et al. [
22] used the line connecting keypoints and the centroid obtained from point clouds to represent the fruit’s growth direction; Sun et al. [
8] detected navel keypoints and regressed rotation vectors; Kok et al. [
24] proposed a method that projects 2D information into 3D space through a keypoint detection neural network and line segment extraction-based circle detection, calculating the unit vector of apple orientation, which was validated on public 3D point cloud datasets and in laboratory environments, achieving an azimuth error of 17.6° in orchard scenarios and 14.6° in laboratory settings, showing performance comparable to existing studies and suitability for orchard robotic harvesting; and Jang et al. [
26] first segmented tomato fruit bodies and sepals using an instance segmentation model applied to RGB images, matched tomato bodies and sepals from multiple candidates, generated point clouds corresponding to each part based on RGB-D data, and estimated tomato poses as direction vectors through the centroids of the point clouds of tomato bodies and sepals respectively. This method can achieve relatively precise spatial poses but requires the integration of accurate 3D point cloud data.
Beyond the dominant YOLO-series frameworks in agricultural pose estimation, the field has seen rapid advances in ViT-based, two-stage, and bottom-up keypoint estimation paradigms. Ni et al. [
27] proposed TomatoPoseNet, which outperformed the ViT-based ViTPose by 3.78% in keypoint detection accuracy and provided a benchmark for CNN and transformer backbone selection in fruit harvesting; Ci et al. [
28] verified the robustness of two-stage Keypoint R-CNN in occluded greenhouse environments; and Kim et al. [
29] developed a bottom-up OpenPose-based method for simultaneous multi-target pose estimation with constant computational load. Notably, ViTs have emerged as a promising direction for agricultural pose estimation with strong global feature modeling ability, yet they still face challenges of high computational overhead and large parameter scale for edge deployment on harvesting robots [
27], with lightweight ViT-based solutions remaining under-explored.
Although these methods have advanced the field of fruit pose estimation, including the aforementioned non-YOLO architectures for tomato crops [
27,
28,
29], their application remains limited to specific fruit types, with most existing works focusing on tomato, apple and other spherical fruits, while few studies have addressed the pose estimation of ellipsoidal dragon fruits in natural orchard environments. To address the challenge of automatic dragon fruit harvesting, this study proposes an RGB image-based dragon fruit pose estimation (DFPE) framework aimed at acquiring accurate spatial poses. This framework adopts a method that transforms the 3D pose estimation problem into determining the spatial yaw and roll angles. To match YOLO11n-Pose’s single-class annotation requirements, a pseudo four-keypoint annotation strategy was designed, and the spatial yaw and roll angles are then derived from the relative positions of the detected keypoints and the target fruit. The main contributions of this study are summarized as follows:
(1) A dataset for dragon fruit pose estimation is compiled under various lighting conditions and made publicly available, enabling researchers to reproduce the experimental results.
(2) We propose a 3D pose estimation method for dragon fruit that detects navel and stem keypoints to predict the fruit’s orientation, transforming the 3D pose estimation problem of near-ellipsoidal fruits into a 2D keypoint detection task; thus significantly simplifying the pose estimation process.
(3) We develop a pseudo four-keypoint annotation strategy to label two target types (fruit_front and fruit_back) and their corresponding keypoints. This meets YOLO-Pose’s annotation requirements, implicitly encoding the orientation of fruit via bounding box group IDs while preserving geometric information for pose inference and facilitating effective model training.
2. Materials and Methods
The overall workflow of the proposed DFPE framework is illustrated in
Figure 2. First, the framework commences with dataset construction, which encompasses dragon fruit image acquisition and annotation: each image is annotated with the target fruit’s bounding box, stem keypoint, and navel keypoint. To align with the single-class annotation requirement of one of the mainstream keypoint detection models, YOLO11n-Pose, a pseudo four-keypoint format conversion algorithm is employed: this algorithm translates the annotated information into the four-keypoint structural format, while assigning Class ID 0 to “fruit_front” and Class ID 1 to “fruit_back” to encode the fruit’s distinct growth orientations. The processed dataset (formatted in the pseudo four-keypoint structure) is then fed into the keypoint detection model for end-to-end detection of dragon fruit instances and regression of their corresponding keypoint coordinates. Subsequently, the dragon fruit pose estimation algorithm processes the model’s inference outputs (Class ID, bounding box parameters, stem keypoint, and navel keypoint): by fitting an ellipse to the detected fruit region to characterize its spatial contour, and calculating the relative spatial positional relationship between the stem and navel keypoints, the framework derives the yaw and roll angles, which are core parameters of the dragon fruit’s 3D spatial pose. Finally, this framework is deployable in both controlled laboratory environments and natural orchard scenarios, yielding accurate spatial pose information to support automated dragon fruit harvesting operations.
2.1. Dataset Construction
2.1.1. Image Acquisition
The dataset used in this study consists entirely of dragon fruit images with natural orchard environments as the background, and it is derived from three independent image sources, with a total of 8476 images covering diverse real orchard scenarios such as different lighting conditions, occlusion levels, and target scales. The distribution of the training set and validation set of the full dataset is detailed in
Table 2.
The first group consists of 3749 RGB images with a resolution of 1920 × 1080, with some examples shown in
Figure 3. These images were collected at the orchard of Zhuhai Xijian Agricultural Development Co., Ltd., Guangdong, China (GPS: 22.15° N, 113.25° E), which has the largest contiguous dragon fruit planting base in Guangdong Province, China, with a ridge spacing of 2.8 m–3.5 m; the camera was fixed in the middle of the two ridges, facing the dragon fruit plants for random shooting. The image acquisition time was from 11:00 to 19:00 on 23 June 2024, and the weather was cloudy on the day of collection. The images were captured using a ZED 2i binocular stereo camera (Manufactured by Stereolabs, San Francisco, CA, USA), but only RGB images were utilized in this work. These images cover diverse lighting conditions, including weak light, moderate light, and strong light from front, side, and back directions. The images in this group are highly consistent with the actual operational environment of automated picking and mainly consist of small target scenes.
The second group consists of 868 publicly available RGB images of mature dragon fruits captured in natural orchard environments [
30], with examples shown in
Figure 4. This subset has the characteristics of high definition, with a single dragon fruit target in most images, and the target accounts for most of the image area, with clear and complete morphological features of the navel and stem.
The last group consists of RGB images cropped from
https://universe.roboflow.com/. Roboflow Universe is a computer vision community platform developed by Roboflow Inc., a company headquartered in Des Moines, IA, USA. We first strictly screened 227 high-quality original images from this platform, all of which are natural orchard-captured dragon fruit images with complete navel and stem morphological features. The resolution of these original images was unified to 640 × 640 through scaling and padding; to eliminate data leakage risks, we first completed the division of the training set and validation set for all original images, and then performed data augmentation exclusively on the 227 original images, expanding them from 227 to 3859 via augmentation operations including rotation, brightness adjustment, flipping, sharpening, and contrast adjustment, with examples shown in
Figure 5. The image in this group are large differences in image clarity and large differences in the proportion of the target in the image, which can effectively improve the generalization ability of the model to targets of different scales.
2.1.2. Data Annotation
This study establishes a unified annotation standard for bounding boxes and keypoints. For dragon fruit, a typical ellipsoidal crop, a tightly fitted rectangular bounding box is used to enclose fully visible fruits with an occlusion rate less than 50%. Two key keypoints are labeled for each valid box: the navel, the geometric center of the calyx detachment at the fruit bottom, and the stem, the center of the branch connection at the pedicel. This unified paradigm, consisting of one bounding box paired with two corresponding morphological keypoints, ensures the framework’s generalizability to other ellipsoidal crops, such as citrus and elongated tomatoes, by standardizing the annotation of fruit bottom indentations and pedicel connection points.
After completing the pre-processing of all images and the division of the training set and validation set, we performed a unified annotation operation on all images using the LabelMe 5.2.1 software, as shown in
Figure 6. The default YOLO-Pose models are designed based on 17 keypoints, each representing a different part of the human body. In contrast, this study used a custom dataset, in which the annotations for each mature dragon fruit instance include two parts: first, a bounding box labeled “dragon_fruit” was annotated with a group_id set to 0 or 1 (where 0 represents fruit_front, denoting that the navel is closer to the camera; while 1 represents fruit_back, indicating that the stem is closer to the camera). This setup enabled implicit classification without requiring dual-class labeling. Second, two keypoints labeled “navel” and “stem” are annotated with a group_id of 0 or 1, where 0 represents visible and 1 represents occluded. For occluded keypoints (e.g., those partially obscured by foliage), annotations were performed strictly based on the morphological prior knowledge of dragon fruits, and all occluded sample annotations were reviewed and corrected by experts with agronomic experience in dragon fruit cultivation to ensure annotation accuracy.
To ensure the reliability and consistency of the annotation results, we implemented a standardized two-step annotation and cross-validation procedure. First, the full initial annotation of the dataset was completed by an annotator with professional experience in agricultural computer vision annotation. Subsequently, another annotator with the same qualification performed an independent full-range cross-verification on all the initial annotation results. For samples with uncertain or disputed annotations identified during verification (particularly complex samples with occluded keypoints), the two annotators engaged in in-depth discussions utilizing morphological prior knowledge of dragon fruit to reach a consensus on all annotations. We have ensured that the collection and utilization of the dataset adhere to the relevant laws and regulations pertaining to copyright and data protection. Further details regarding the construction of the dataset and a comprehensive list of contributors can be found in
Appendix A.
2.1.3. Pseudo Four-Keypoint Format Conversion Algorithm
YOLO11n-Pose natively supports only single-category keypoint detection, posing a challenge for multi-category keypoint detection tasks. To address incompatibility with the required label structure, a format conversion algorithm (the pseudo four-keypoint format conversion algorithm) was developed, as detailed in Algorithm 1. This algorithm implicitly distinguishes two dragon fruit growth orientations (fruit_front and fruit_back) via keypoint positional encoding. It enables YOLO11n-Pose to learn orientation-aware keypoint detection within a single-category framework without architectural modifications.
| Algorithm 1 Pseudo four-keypoint format conversion |
Require: Bounding box , where ; keypoints with visibility v; image dimensions . Ensure: YOLO-Pose label vector Y of length 17. 1: Calculate normalized box center and size: 2: 3: 4: Initialize vector Y with zeros. Set and . 5: if then 6: 7: else 8:
9: end if 10: Process Stem keypoint : 11: 12: if else 1 13: 14: Process Navel keypoint : 15: 16: if else 1 17: 18: return Y |
2.2. Keypoint Detection Model
Real-time and lightweight keypoint detection is critical for edge deployment of harvesting robots in unstructured orchards. Current mainstream architectures for agricultural pose estimation include YOLO-series single-stage frameworks, two-stage R-CNN [
28], transformer-based ViT models [
27], and bottom-up PAF-based frameworks [
29]. The YOLO framework [
5,
6,
7,
13,
22] achieves the optimal balance of speed, accuracy, and lightweight deployment for real-time orchard operations, while two-stage frameworks have excessive latency, ViT models bring prohibitive computational overhead for edge devices, and bottom-up methods have insufficient localization accuracy for occluded small targets. Therefore, we selected YOLO-series models as the keypoint detection model for this paper. It is noteworthy that other keypoint detection models are also suitable for the DFPE framework proposed in this paper. Many YOLO series architectures also offer multi-scale variants (n/s/m/l/x) for flexible trade-offs between detection accuracy and computational cost, with the n variants providing lightweight configurations that maintain competitive performance–cost trade-offs. For consistency across all experiments in this study, we exclusively used these n variants. In addition, the YOLO series has evolved from anchor-based to anchor-free designs through in-depth architectural optimizations.
YOLOv5, YOLOv8, and YOLO11 are representative deep learning architectures developed by Ultralytics that support varied computer vision tasks, including detect, segment, oriented bounding boxes, classify, and pose. As a classic anchor-based variant representing the early mature stage of the YOLO series with stable performance and extensive community validation, YOLOv5 uses a C3 backbone with cross-stage partial connections and concatenation-based feature fusion. In contrast, in the anchor-free YOLOv8, the C3 backbone is replaced with the C2f module and a BiFPN is integrated for multi-scale fusion. This design eliminates anchor mismatch issues for targets of varying sizes and enhances the small-target detection performance by strengthening feature flow between shallow and deep layers. Released in 2024, YOLO11 features further iterative upgrades, using the C3k2 module as its backbone and adopting an enhanced strategy with attention mechanisms for feature fusion, achieving efficient multi-scale integration of information via the C2PSA module and an improved feature pyramid structure. The performances of YOLOv5, YOLOv8, YOLO11, yolov12 and yolo26 models built from YAML trained for dragon fruit pose estimation were compared, and YOLO11n-Pose was ultimately selected as the basis for the keypoint detection model. It is worth clarifying that the selection of YOLO11n-Pose is only to provide a real-time, lightweight baseline implementation for the keypoint detection module of the DFPE framework, and does not constitute a binding requirement for the DFPE framework. Other mainstream keypoint detection models that can output valid navel and stem keypoint coordinates can also be connected to the subsequent customized geometric pose inference stage.
The performance of the YOLO-pose model is evaluated using two sets of metrics: bounding box (B) metrics and pose (P) metrics. The former assesses the model’s ability to locate and classify objects, while the latter evaluates the accuracy of keypoint localization. These metrics include Precision, Recall, and mean Average Precision (mAP), as defined in Equations (1)–(4). In these formulas, True Positives (TP) represent correctly identified targets, False Positives (FP) denote non-targets or incorrectly localized targets, and False Negatives (FN) represent actual targets that the model failed to identify.
For (B) metrics, a TP is defined by an Intersection over Union (IoU) threshold between the predicted and ground truth boxes. For (P) metrics, a TP is defined by an Object Keypoint Similarity (OKS) threshold.
The first category focuses on bounding box detection performance, denoted by the (B) suffix. Precision (B) and Recall (B) measure the accuracy and completeness of the object detection boxes, respectively. To evaluate the overall localization performance, mAP50 (B) provides the mean Average Precision calculated at a fixed IoU threshold of 0.50, whereas mAP50-95 (B) represents the average mAP across a range of IoU thresholds from 0.50 to 0.95 with a step of 0.05, serving as a robust indicator of detection stability and localization quality.
The second category specifically addresses pose estimation performance through (P) metrics based on the OKS criteria. Precision (P) and Recall (P) evaluate the accuracy of keypoint predictions, while mAP50 (P) indicates the mean Average Precision for keypoints when the OKS threshold is set to 0.50. Furthermore, mAP50-95 (P) is employed as the primary metric for pose estimation by averaging performance across OKS thresholds from 0.50 to 0.95, which effectively reflects how precisely keypoints, such as facial landmarks or joint positions, are localized relative to the object’s overall scale.
To assess the computational efficiency and deployment suitability of the model, two additional complexity metrics are utilized. The Number of Parameters (M) refers to the total volume of trainable variables within the network, reflecting the model size and hardware memory requirements. Concurrently, GFLOPs (Giga Floating-Point Operations per second) measure the computational cost required for a single inference pass, providing a standardized metric for the model’s processing speed and its suitability for real-time applications on various hardware platforms.
2.3. Pose Estimation Algorithm
To address challenges arising in the context of automated dragon fruit harvesting, a pose estimation method that uses only RGB images is proposed. Dragon fruits exhibit a rotational ellipsoid shape in 3D space, with the major axis aligning in the stem–navel direction. The y’ axis is defined based on the projection of the major axis onto the YOZ plane, as shown in
Figure 7a. In addition, the ellipsoid’s minor and intermediate axes are equal in length and perpendicular to the major axis. Under camera perspective projection, the 3D ellipsoid projects to a 2D ellipse in the image plane, as indicated by the dashed ellipse in
Figure 7b.
In 3D space, the attitude of any rigid body can be represented by a set of Euler angles (yaw (), pitch, roll ()). By approximating dragon fruit as an ellipsoid with equal intermediate and minor axes, it can be assumed to have axial symmetry about its major axis (the stem–navel line). Consequently, any rotation about this longitudinal axis (defined here as the pitch angle) does not alter the geometric profile of the fruit or the spatial orientation of its centerline. From the perspective of the harvesting task, the end-effector primarily requires alignment with the fruit’s major axis to execute a non-destructive grasp or cut. Since the fruit typically hangs vertically and the camera is configured to capture images from a near-horizontal viewpoint, variations in pitch are inherently constrained and primarily manifest as a slight foreshortening of the projected major axis. This foreshortening introduces negligible error in the calculation of the harvesting vector compared to the dominant influence of the fruit’s lateral and vertical orientation.
Due to this symmetry and the specific kinematic requirements of the harvesting robot, we only need to obtain the roll (
) and yaw (
) shown in
Figure 7 to determine its spatial attitude, thereby guiding harvesting robots to perform non-destructive harvesting. This simplified two-degree-of-freedom representation is sufficient for the end-effector to approach the fruit along its central axis without the need for the redundant pitch component. The roll value is derived from the slope between the navel and stem keypoints in the RGB image, while the yaw value is calculated based on the relative positional relationship between these two keypoints and the projection of the dragon fruit in the image plane (YOZ plane). The black dots in
Figure 7 represent the keypoints and their projections, while the dashed ellipse represents the projected outline of the dragon fruit in the YOZ plane, which also represents the fitted ellipse in the YOZ plane.
To determine the dragon fruit’s attitude from an RGB image, two angles (
and
) are derived based on the geometric relationships between the fitted ellipse and the keypoints. Here,
describes the inclination of the fitted ellipse’s major axis relative to the image’s horizontal axis. This major axis is defined by the stem–navel line, which connects the stem keypoint
and the navel keypoint
; in particular, the image’s horizontal axis corresponds to the y-axis in
Figure 7. Given the coordinates of
and
,
is calculated as the arctangent of the slope of the line
Assume that the shape of the dragon fruit in the image is approximated as an ellipse. To calculate
, an ellipse in the XOY’ plane, as shown in
Figure 7b, is defined by Equation (6):
where
a and
b denote the semi-major and semi-minor axes of the ellipse, respectively;
is the angle between the ellipse’s major axis and y’ axis; and
denotes the pixel coordinates of any point on the ellipse. Taking the partial derivative of both sides of Equation (6) with respect to
y and setting
, we obtain Equation (7). Then, substituting Equation (7) into Equation (6), whereby
x attains an extreme value
, it follows that
satisfies Equation (8):
The relationship between the major axis of the fitted ellipse and
is expressed as Equation (9):
Two variables (
n and
m, defined in Equations (10) and (11), respectively) are defined to link ellipse geometry with keypoint topology. Here,
m is the ratio of the ellipsoid’s major axis to minor axis, which can be set according to the specific morphology of different varieties of dragon fruit or other ellipsoid-shaped fruits, while
n is defined as the ratio of the stem–navel line to the fitted ellipse in the image plane:
Finally,
is derived from Equation (12), with its sign determined by the class ID. If class ID = 0, indicating that the fruit orientation is fruit_front,
takes a positive value; whereas if class ID = 1, denoting the fruit orientation as fruit_back,
is also a positive value.
Based on the above theoretical framework, we introduce the dragon fruit pose estimation algorithm, as detailed in Algorithm 2. This algorithm characterizes the pose of ellipsoid-shaped fruits using the roll and yaw values derived from a single RGB image. Utilizing the geometric constraints imposed by the detected bounding box and two keypoints, the algorithm fits an ellipse to the fruit’s profile. Subsequently, it leverages the geometric properties of this ellipse, in conjunction with the previously identified keypoints, to deduce the fruit’s spatial orientation relative to the camera.
| Algorithm 2 Dragon fruit pose estimation |
Require: Class ID , bounding box , keypoints , empirical constant m. Ensure: Ellipse parameters , Yaw angle . 1: Calculate , . 2: Containment Check: Verify if and are within bounding box boundaries. If not, terminate. 3: if then 4: Special Case: Axis-Aligned 5: , 6: if then 7: 8: else 9: {or radians} 10: end if 11: else 12: General Case: Oblique Ellipse 13: 14: Compute conic parameters based on satisfying F = 0. 15: Solve for semi-axes a and b from the conic parameters. 16: end if 17: Set ellipse center , . 18: Yaw Estimation: 19: 20: 21: 22: if then 23: 24: else 25: 26: end if 27:
return |
Considering natural morphological variations that may affect the ideal axial symmetry of fruits, the ellipse fitting process in this work does not enforce strict collinearity between the long axis of the ellipse and the connecting line of the two keypoints. Instead, a parallel constraint strategy is adopted, which enables robust and accurate ellipse fitting and effectively reduces errors caused by irregular fruit growth and morphological deviations.
4. Discussion
This study constructs a novel RGB-image-based dragon fruit pose estimation framework for vision-guided robotic harvesting. This framework relies solely on single 2D RGB images for high-precision pose estimation, reducing hardware costs and environmental dependence, while achieving a balance between lightweight performance and robustness in unstructured orchard scenarios.
The pseudo four-keypoint format conversion algorithm resolves the conflict between YOLO11n-Pose’s single-class annotation requirement and multi-category keypoint detection, simplifying the annotation process and laying a solid foundation for subsequent pose inference. The dragon fruit pose estimation (DFPE) algorithm establishes a reliable geometric inference system by fitting ellipsoids and calculating keypoint spatial relationships, converting abstract keypoint information into interpretable pose parameters.
Compared with Reference [
24], this framework achieves superior accuracy, with maximum roll and yaw deviations constrained within 15°. It ensures robots accurately grasp the spatial orientation of fruits, thereby reducing harvesting damage. Compared with the non-YOLO pose estimation methods for tomato crops, the proposed framework also shows competitive performance and better lightweight characteristics. Ni et al. [
27] achieved a maximum angular error of 6.5° for tomato cutting point pose estimation with 31.015 M parameters and 5.122 G FLOPs; Ci et al. [
28] achieved a mean absolute angular error of 10°–11° for tomato peduncle pose estimation using a two-stage Keypoint R-CNN framework; Kim et al. [
29] reported an average angular error of 20° for tomato pedicel pose estimation with a bottom-up OpenPose framework. In contrast, the proposed DFPE framework based on YOLO11n-Pose constrained the maximum roll and yaw errors within 15° in laboratory environments, with only 2.65 M parameters and 6.6 G FLOPs, achieving a superior balance between pose accuracy and deployment efficiency. Compared with Reference [
20], this framework also achieves superior accuracy, with maximum roll and yaw deviations constrained within 15°, ensuring robots accurately grasp the spatial orientation of fruits, thereby reducing harvesting damage.
The anchor-free YOLOv8 and YOLO11 models demonstrate strong generalization capability due to the dataset’s large-scale, multi-source characteristics, enabling high-precision, high-recall keypoint detection without additional parameter tuning. These models exhibit significant advantages over the anchor-based YOLOv5, particularly in complex scenarios. The experimental results confirm that labeling keypoint visibility using group IDs has a negligible impact on final pose estimation results, indicating the model’s inherent adaptability to partially occluded scenes.
Although the ellipsoid-based pose estimation method proposed in this study theoretically has cross-crop generalization potential due to its universal geometric properties, making it applicable to similarly ellipsoidal fruits such as certain citrus fruits and papaya, its practical adaptability in real-world scenarios remains unverified due to a lack of cross-crop testing. Furthermore, existing methods for acquiring ground truth pose data in complex orchard environments are immature and lack high-precision 3D pose annotation techniques, which creates challenges in quantifying model performance under authentic orchard conditions. Future research will focus on cross-crop empirical study and high-precision ground truth pose database construction. In addition, exploring lightweight visual transformer (ViT) architectures tailored for agricultural pose estimation, and integrating the two-stage and bottom-up pose estimation paradigms into the DFPE framework to further improve its adaptability to dense, occluded orchard scenarios, will also be the focus of subsequent research.