Experimental Comparison of Two 6D Pose Estimation Algorithms in Robotic Fruit-Picking Tasks

Alterani, Alessio Benito; Costanzo, Marco; De Simone, Marco; Federico, Sara; Natale, Ciro

doi:10.3390/robotics13090127

Open AccessArticle

Experimental Comparison of Two 6D Pose Estimation Algorithms in Robotic Fruit-Picking Tasks^†

by

Alessio Benito Alterani

^‡,

Marco Costanzo

,

Marco De Simone

^‡

,

Sara Federico

^‡ and

Ciro Natale

^*

Dipartimento di Ingegneria, Università degli Studi della Campania Luigi Vanvitelli, Via Roma, 29, 81031 Aversa, CE, Italy

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in Costanzo, M.; Simone, M.D.; Federico, S.; Natale, C.; Pirozzi, S. Enhanced 6D pose estimation for robotic fruit picking. In Proceedings of the 2023 9th International Conference on Control, Decision and Information Technologies CoDIT, Rome, Italy, 3–6 July 2023; IEEE: Piscataway, NJ, USA, 2023.

^‡

These authors contributed equally to this work.

Robotics 2024, 13(9), 127; https://doi.org/10.3390/robotics13090127

Submission received: 5 July 2024 / Revised: 1 August 2024 / Accepted: 9 August 2024 / Published: 26 August 2024

(This article belongs to the Section Sensors and Control in Robotics)

Download

Browse Figures

Versions Notes

Abstract

This paper presents an experimental comparison between two existing methods representative of two categories of 6D pose estimation algorithms nowadays commonly used in the robotics community. The first category includes purely deep learning methods, while the second one includes hybrid approaches combining learning pipelines and geometric reasoning. The hybrid method considered in this paper is a pipeline of an instance-level deep neural network based on RGB data only and a geometric pose refinement algorithm based on the availability of the depth map and the CAD model of the target object. Such a method can handle objects whose dimensions differ from those of the CAD. The pure learning method considered in this comparison is DenseFusion, a consolidated state-of-the-art pose estimation algorithm selected because it uses the same input data, namely, RGB image and depth map. The comparison is carried out by testing the success rate of fresh food pick-and-place operations. The fruit-picking scenario has been selected for the comparison because it is challenging due to the high variability of object instances in appearance and dimensions. The experiments carried out with apples and limes show that the hybrid method outperforms the pure learning one in terms of accuracy, thus allowing the pick-and-place operation of fruits with a higher success rate. An extensive discussion is also presented to help the robotics community select the category of 6D pose estimation algorithms most suitable to the specific application.

Keywords:

pose estimation; robotic grasping; deep learning

1. Introduction

Algorithms for 6D pose estimation of objects are fundamental modules of any robotic system for logistic applications where the relocation of items is a task of primary relevance. One of the most challenging material-handling tasks is the pick-and-place of items characterized by a high variability of object instances in appearance and dimensions. Food handling is a logistic scenario where such a challenge arises, demonstrating an increasing interest of food industries in autonomous solutions, especially in packaging systems, in the last decade [1]. The related challenges include making robotic manipulators able to first localize in space objects of variable shape and dimensions, such as fruits and vegetables, and then carefully grasp them without any damage. Although several solutions for pick-and-place tasks have already been developed, dealing with delicate food makes the problem more challenging as explained in [1,2,3]. Indeed, although the 6D pose estimation algorithms based on RGB-only images (such as [4,5]) devised so far can localize objects with no variability in appearance and size with great accuracy, they cannot localize different instances of the same object, such as fresh fruits, which greatly differ in appearance and dimensions; this is still an open problem, and some solutions based on deep learning have been proposed in some works such as [6,7]. Besides the localization problem, the existing technology typically used to pick items in the industry could be unsuitable for grasping delicate fruits without damaging them. Existing solutions are based on two main approaches, the use of soft grippers [8] and tactile sensors [9]. The former ones are difficult to control in a precise way, while the latter ones require proper algorithms for the interpretation of tactile data and their correlation to the grasping force [3]. As demonstrated by the approach used by humans to manipulate fresh fruits, the best way to avoid deteriorating food is controlling the force used for grasping it; therefore, grippers equipped with tactile sensors seem to be a great replacement for the suction cup usually adopted in industrial practice.

The localization of the object to grasp is usually achieved by adopting a pose estimation algorithm based on deep learning techniques, able to recognize objects in a cluttered scene and estimate their position and orientation in the space. This is a common feature of all the examined methods [4,5,10,11,12,13].

Here, we classify the existing 6D pose estimation algorithms into two main categories: the pure learning methods, based on an end-to-end approach, and the hybrid ones, which combine deep learning techniques with geometrical approaches, thus exploiting additional knowledge to improve the accuracy of the estimated pose. Concerning the pure learning approach, PoseNet proposed in [13] is one of the first end-to-end 6D pose estimation models, based on convolutional neural networks (CNNs) and RGB data only, and it regresses the camera pose in both indoor and outdoor scenarios. By exploiting the same input, Deep-6DPose in [4] extends the instance segmentation network Mask R-CNN [14] with an end-to-end regression-based pose estimation branch to jointly detect, segment, and recover the 6D pose of a target object without any post-refinement algorithm. In order to improve the pose estimation performance in complex environments, several works exploit RGB-D data, thus directly elaborating both the depth and the corresponding RGB image. The authors of [12,15,16] propose an efficient RGB-D dense (pixel-wise) approach to fuse the two data sources individually processed. On the other hand, the approaches developed in [10,17] regress the eight corners of the 3D cuboid enclosing a known object and use the Perspective-n-Point (PnP) solver [18] to retrieve the corresponding 6D pose, thus realizing a hybrid 6D pose estimation algorithm. Moreover, the methods presented in [19,20] employ a least-squares fitting algorithm [21] instead of PnP to retrieve the final 6D pose given the keypoints on the object surface. Other hybrid solutions, such as [11], exploit a pose refinement stage based on geometrical approaches, such as the Iterative Closest Point (ICP) algorithm [22], to improve the accuracy of the pose arising from a deep learning technique.

In this article, which extends the conference paper [23], we present an extensive experimental comparison between a pure deep learning method based on the widely used DenseFusion deep neural network [12] and the hybrid method, originally proposed in [23], resulting from a combination of the RGB-only algorithm Deep Object Pose Estimation (DOPE) [10] with a geometric pose refinement algorithm based on the availability of the depth map. The same algorithm is also used in a more sophisticated manipulation task including non-prehensile manipulation actions in [24]. Moreover, besides comparing two pose estimation methods with two fruit types, while the conference paper considered apples only, we add new experiments with the following objectives: highlighting how both the position and orientation of detected fruits are relevant to the grasping action, quantifying the noise level and sensitivity to the lighting conditions of the two compared methods.

The presented comparison is carried out by testing the robotic fruit picking not only with the apples used in the former conference paper but also with other fresh fruits. To avoid damage to the fruits, the slipping avoidance grasp control algorithm described in [2] is added to the overall robotic pipeline. Exploiting the 6D contact wrench measurements arising from a parallel gripper equipped with the SUNTouch force/tactile sensors [3], the aforementioned control algorithm computes the minimum grasp force required to lift the object, whose weight is unknown. Figure 1 shows the overall setup of the robotic picking solution.

The test results prove that both methods can address the problem of dimension variability but with differing performance. Indeed, the experimental results of several picking tests of various types of fruits with similar appearances but different dimensions show that the robotic picking system has a better grasp success rate with the proposed 6D pose hybrid estimation algorithm than with the state-of-the-art purely deep learning method. The paper presents a final discussion on the accuracy of the compared methods and a deep analysis of all their pros and cons.

2. Materials and Methods

The deep learning methods to address the pose estimation problem can be divided into 2 groups: the ones that use only RGB images and the ones that add to that information the depth map. In general, RGB models are attractive thanks to their simplicity since no fusion is required to merge multiple data sources and the sensor is a standard RGB camera, making these methods quite cheap (see [5,10,25]). On the other hand, they may suffer a loss in accuracy from the lack of the geometrical information given by the depth channel, especially in the case of occlusions or high-intensity light conditions. Furthermore, the recent availability of relatively cheaper depth sensors has increased the attention on RGBD models [11,12,16,20], which, at the cost of a more complex architecture, usually are more accurate and less sensitive to occlusion and lighting conditions. To investigate the 2 methodologies, we choose one model for each category, namely DOPE [10] and DenseFusion [12]; both are selected thanks to their effectiveness in solving the pose estimation problem and for having inspired several subsequent models such as [20] or [26].

To make DOPE able to correctly detect objects that differ in terms of size from the model used during the training, in [23], we realize a hybrid solution by combining DOPE with a geometric algorithm that exploits the depth map without fusing it with the RGB image as it happens in DenseFusion.

2.1. DenseFusion

The DenseFusion network is an instance-level pose estimator, which directly regresses the pose, as a translation vector and a quaternion, of instances of known objects from RGB-D images. The methods also require the point cloud models during training but not in the inference stage. The architecture of the model (see Figure 2) can be divided into 4 main components in series: the first one performs instance segmentation on the input RGB image, isolating the pixels belonging to each instance of the object of interest in the scene; in the second component, the cropped RGB image and the masked point cloud are passed to the feature extraction module, which generates both pixel-wise and global features; the third component performs the pose prediction based on each of the input features; and the last component is an optional refinement module that improves the prediction coming from the previous component. The dense fusion processing is carried out during the feature extraction phase, and it consists of concatenating each feature (related to the same pixel) coming from the CNN and the PointNet modules. As for the global feature, it is computed by applying an average pooling stage to the output of a fully connected Multi Layer Perceptron (MLP) network, that takes as inputs all the dense features. The pose predictor module processes one feature for each segmented pixel and estimates a pose for each of them. It follows that there is one estimated pose for each segmented pixel, and therefore a mechanism to select the final pose is needed. The network training has a twofold objective. The first one aims at minimizing the mean distance between the corresponding points in the actual object’s point cloud and the one transformed by the regressed pose; the second one aims at learning a score for each prediction that measures how confident the network is. The final pose is simply chosen as the one that has the maximum score.

It is worth noting that the network used for the real robotic grasp is trained on synthetic data only, in contrast to the original work in [12], which used classical benchmarks as LineMOD in [27] and YCB-video in [11] for both training and testing.

2.2. DOPE + Pose Refinement

The instance-level level neural network DOPE works only on RGB images and assumes to know the size of the object to recognize at inference time. With reference to Figure 3, DOPE is characterized by a two-stage architecture that first regresses nine keypoints of the object in the image plane. These keypoints are functional to the estimation of the eighth vertices and the centroid of the 3D cuboid enclosing the object. The keypoints are estimated by a deep CNN and a Greedy Algorithm (GA) necessary to associate keypoints to a single object. The second stage, exploiting the knowledge of the camera intrinsic parameters and the known cuboid dimensions of the recognized object, uses the standard PnP algorithm [18] to retrieve the 6D pose of the object. This separation between regression and pose retrieving is an attractive feature of the architecture because the network will learn only the object’s appearance feature and not the camera parameters too. It follows that the network parameters do not depend on the camera used to make the inference, unlike other end-to-end pose-estimation algorithms such as DenseFusion or [11]. However, if a scaled version of the CAD model used for the training were shown to the network, the estimated position would be affected by an error proportional to the scale difference between CAD and real object. Indeed, estimating the position of a scaled version of the trained object using a single RGB image is an ill-posed problem since we cannot establish whether we are looking at a smaller object nearest to the camera or at a bigger one farthest from the camera. To overcome this limitation and successfully use DOPE to recognize and localize objects of different scales as shown in the last part of the scheme of Figure 3, we exploit the camera depth measures by employing the pose refinement algorithm presented in [23], and here summarized for the reader’s convenience. The algorithm consists of jointly translating and scaling the CAD model of the object (used by the network) in a virtual world without altering its representation in the image plane so that the virtual depth map of the object and the real one are best overlapped.

Let

\hat{p}

and

\hat{Q}

be the position and unit quaternion estimated by DOPE with respect to the camera frame, respectively. We can place the CAD model (used by the network) in a virtual world at the same pose with respect to a virtual camera and capture the corresponding virtual depth map. Using the geometric relations reported in (1) and (2) (more details can be found in [23]), for a given scalar value

σ

, we can translate the CAD in the new position

{\hat{p}}^{'}

and to scale it into the updated cuboid dimensions

{\hat{d}}^{'}

(starting from the original ones

d_{C A D} = {[\begin{matrix} d_{x} d_{y} d_{z} \end{matrix}]}^{T}

) such that the RGB image is the same but the depth map is different. Therefore, we can solve an optimization problem to compute the optimal

σ_{o p t}

such that the depth map coming from the RGB-D sensor and the one rendered are best overlapped. Once the

σ_{o p t}

is computed, we can retrieve the refined position

{\hat{p}}_{o p t}^{'}

and the refined cuboid dimensions

{\hat{d}}_{o p t}^{'}

entering with this value in Equations (1) and (2):

{\hat{p}}^{'} = \hat{p} - σ \frac{\hat{p}}{∥\hat{p}∥}

(1)

{\hat{d}}^{'} = μ (σ) d_{C A D}, with μ (σ) = \frac{∥{\hat{p}}^{'}∥}{∥\hat{p}∥} .

(2)

Figure 4 shows the pose estimation of an apple smaller than the CAD model used for training DOPE and the resulting one after the pose refinement. The bigger apple (under the table) has the same dimensions as the CAD model and generates the same RGB image of the actual smaller apple (on the table). As expected, the position estimated by DOPE is such that the object is located under the table due to the size mismatch, but our refinement is able to correct not only its position but also its dimensions.

2.3. Robotic Fruit Picking

To successfully pick the fresh fruit, we adopt the robotic pipeline described in detail in [23], whose architecture is illustrated in Figure 5. The pipeline is composed of a vision component for the Equation6D pose estimation algorithm (which is one of the two methods being compared), a motion-planning module, used to bring the robot gripper in a collision-free grasp pose, and a grasp-force controller for parallel grippers, which exploits tactile measures to grasp the delicate items, avoiding any deformation or damage. The motion-planning module and the grasp-force controller are briefly described in the next subsections. In Section 3, we report the experimental results of the robotic pipeline when the objects to pick are apples or limes.

2.4. Grasp Control

The target objects are fresh fruits to be handled with care to avoid their damage; thus, the grasping force should be the lowest one that permits lifting with no slippage.

This challenging task is carried out by the model-based grasp force controller in [2] that will be here briefly described for the readers’ convenience. The control law enables two modalities, slipping avoidance and pivoting (see Figure 6). In both cases, the algorithm outputs a grasp force reference tracked by a low-level force controller. To grasp the fruits, we use only the slipping avoidance modality.

The grasp force

f_{n}

is computed as a superposition of two contributions, a static contribution

f_{n_{s}}

and a dynamic one

f_{n_{d}}

. The object motion is modeled as a planar slider, i.e., a rigid object that moves in a plane and can be described as a pure rotation about the instantaneous Center of Rotation (CoR) with angular velocity

ω

. In particular,

f_{n_{s}}

is the minimum force that avoids slippage in static conditions; it exploits the Limit Surface concept [28], and it is based on the measured tangential and torsional loads estimated by the fingertip tactile sensors.

When the loads are time-varying (e.g., during the lifting), the static contribution is not enough, as the maximum friction decreases as the rate of the variation in the load increases [29]. Such a condition is modeled by the LuGre dynamic friction model [30]. The dynamic contribution

f_{n_{d}}

is based on a slippage velocity observer exploiting a dynamic model that combines the LuGre dynamics with the Limit Surface concept.

f_{n_{d}}

is computed by applying a linear control law that brings the estimated slippage velocity to zero. The interested reader can find more details on the control design and its stability property in [30].

2.5. Motion Planning

To improve the safety of the overall pick-and-place task and avoid collisions with obstacles in the working scene, we adopt the RRT* planning algorithm, powered by the open-source motion planning library OMPL [31] and directly integrated into the MoveIt! [32] platform. Given the pick-and-place poses, we plan from the pre-grasp pose to the post-place one by appropriately alternating the plans in joint and Cartesian spaces. In the end, the overall task is executed by the robot only if the motion planner has successfully found the sequence of the required trajectories. Finally, to make the robot aware of the surrounding obstacles, we exploit the point cloud from the depth camera to reconstruct the corresponding OctoMap [33].

2.6. Grasping Strategy

Given the estimated pose of the target object, a suitable grasp pose has to be defined depending not only on the object’s shape but also on the obstacles of the scene. Therefore, to ensure a successful grasp, we adopt the grasp strategy described in [23], which is well suited for the fruits under consideration, i.e., apples and limes, which have a practically spherical shape. Indeed, referring to Figure 7, the strategy consists of sampling a sphere of radius

r_{o}

centered at the estimated position of the target object

{\hat{p}}_{o p t}^{'}

. In particular, the positions are computed by varying the rotation angle about the z axis of the object frame (assumed aligned to the axis of symmetry)

α

and those about the gripper sliding vector

θ

, while the orientations arise from the quaternion product

\tilde{Q} * Q_{z} (α) * Q_{y} (θ)

, where

\tilde{Q}

is used to align the approach vector of the end-effector frame with the z axis of the world frame, and

Q_{z} (α)

and

Q_{y} (θ)

represent the elementary rotation about the z and y axes of the corresponding angles, respectively.

3. Results

The following experimental tests aim to evaluate whether and with what accuracy the two selected algorithms estimate the 6D pose of apples and limes of various sizes, different from those of the CAD model used during the training phase. The pick-and-place tasks are performed using the seven-axis robot Yaskawa Motoman SIA5F, equipped with the RGB-D Intel Realsense D435i camera mounted on a WSG32 gripper. This latter is sensorized with the SUNTouch force/tactile fingers [3] that provide a

5 \times 5

tactile map and are calibrated to estimate the full contact wrench at 500 Hz between 0.5 and 8 N with an accuracy of 0.2 N. The gripper accepts finger velocity commands updated at 50 Hz to control the grasp force. The CAD model of the apple used to train DOPE and DenseFusion is enclosed in a cuboid with dimensions in meters

d_{C A D} = {[\begin{matrix} 0.092 0.08 0.092 \end{matrix}]}^{T}

, where, according to the orientation of the CAD frame,

d_{y} = 0.08

m represents the apple height. Moreover, the CAD model of the lime is enclosed in a cuboid with dimensions

d_{C A D} = {[\begin{matrix} 0.040 0.039 0.052 \end{matrix}]}^{T}

, where

d_{x} = 0.04

m represents the lime height.

3.1. Dataset Generation and Training of the Models

Both models are trained on synthetic datasets using the CAD models described above. In order to ensure a suitable variability among all the images, the datasets of both limes and apples are generated by randomizing several elements, such as the background and the relative orientation of the camera, the light intensity, and the type and the number of distractors, as well as the number of instances of the object of interest. This randomization is inspired by works such as [10,34,35,36], which show the effectiveness of using simulated data to train deep learning algorithms for real-world tasks such as grasping, drone flight, and object recognition. The same works also suggest that, through domain randomization and/or adaptation, the generalizability of the network can be improved; if the training set is built with a large number of synthetic images that represent the same objects in very different configurations or if only a few real images are injected into the synthetic dataset, the network is forced to learn only the highly important features from the training images due to the high variability of the dataset.

The software used to generate the data is NViSII [37], which also provides the APIs to store the annotations in JSON format. Regarding the apples, the generated dataset contains 120 k frames, 30% of which are used as the validation set and the remaining part as the training set. The training procedure of DenseFusion takes 51 epochs with a learning rate of

10^{- 4}

, a batch size of 32, and a refine margin (the distance margin below which the refinement training starts) of 9 mm. Instead, DOPE is trained for 120 epochs with a batch size of 116 and a learning rate of

10^{- 4}

. For limes, the dataset contains about 5 k frames, 20% of which are used for the validation and the remaining for the training phase. DenseFusion is trained for 277 epochs with a learning rate of

10^{- 4}

and a batch size of 32, while DOPE requires 232 epochs with a batch size of 116 and a learning rate of

10^{- 4}

. Note that, even if the lime dataset is much smaller than the apple one, each frame of the former contains a higher number of object instances than the latter one as the sample frames in Figure 8 demonstrate.

Training and testing are both implemented on a workstation equipped with a GPU NVIDIA Titan V GV100 and an Intel Core i9-7980XE processor at 2.60 GHz. The OS used is Ubuntu 20.04 LTS. The reason why the two networks are trained for different epochs is that, to carry out a complete comparison, we decide to allocate the same training time to both to assess which one, with a given amount of training time, performs better. The allotted training time is 72 h, which is enough to have a satisfactory convergence of the loss function, whereas the inference rates of DenseFusion and DOPE are about 20 Hz and 15 Hz, respectively.

3.2. Comparison of the Two Methods

The comparison of the two methods is carried out in three experimental tests by employing the robotic pipeline described in the previous section. In the first test, the quantitative results are provided in terms of two metrics.

The first one is aimed at evaluating the pose accuracy. Considering that we express all the estimated poses in a reference frame with the origin located on the table surface and the z axis pointing in the upward direction, to evaluate the pose accuracy along the z direction, we define the ground truth equal to the half of each fruit height h, which corresponds to

d_{y}

for apples and

d_{x}

for limes. Therefore, we define the centroid error metric as the difference between

h / 2

and the z coordinate of the estimated frame. The second metric is the grasp success rate defined as the ratio between the number of successful fruit grasps and the total number of attempts. A grasp is successful if the fruit is picked and lifted without dropping or damaging it; the damage is evaluated by means of visual inspection. For this experimental test, we select five apples (see Figure 9) and two limes (see Figure 10) such that all of them have great variability in terms of dimensions compared to the CAD models used in the model training.

Regarding the centroid error, we evaluate the results obtained with each apple and each lime in location 3, indicated in Figure 9 and Figure 10, respectively. Table 1 shows a first rough comparison of the two proposed algorithms highlighting the centroid error metric. Regarding the apples, the accuracy of the estimated position along the z direction seems to be unaffected by the dimension variability for both the proposed methods. Therefore, both methods might be able to provide a position accurate enough to allow a successful grasp. Instead, regarding limes, unlike DOPE + Pose Refinement (PR), DenseFusion does not provide a position that allows us to grasp the fruits since the resulting z component of the estimated position is mostly above the centroid of the object.

The results of the comparison in terms of grasp success rates are computed on three repetitions of each grasp attempt for each apple and each lime approximately placed in the five locations with different orientations on the table that the robot is fixed to, shown in Figure 9 and Figure 10, respectively, for a total of 105 grasp attempts for each pose estimation algorithm.

Table 2 shows the corresponding results obtained by using DenseFusion and DOPE + PR, respectively. These more accurate experiments show that DOPE + Pose Refinement significantly outperforms DenseFusion (which uses its refinement method too): regarding the apples, we obtain a 68% overall grasp success rate for the latter and an 88% grasp success rate for the former; for the limes, we obtain a 40% overall grasp success rate for DenseFusion and a 100% grasp success rate for DOPE + Pose Refinement. The success rates achieved by DenseFusion for limes being slightly worse than those achieved for the apples could be explained by the small dimensions of the selected limes (even smaller than Apple 5), which require a higher accuracy of the estimated pose to be picked correctly.

The second experimental test is devoted to assessing the sensitivity of the two pose estimation methods to lighting conditions by evaluating the grasp success rate metric. To this aim, additional fruits are selected, i.e., a big and a small apple and a big and a small lime. Figure 11 shows the four fruits and the four locations selected for the grasp attempts that are repeated three times by approximately placing each fruit in each location with a random orientation, for a total of 48 grasp attempts for each pose estimation algorithm. Note that the two locations on the taller box are subject to a brighter illumination than those on the lower box, where the fruits are overshadowed. Table 3 shows the grasp success rates obtained in the second test. The two pose estimation methods show a different sensitivity to the lighting conditions. The performance of both the algorithms deteriorates for the overshadowed fruits, but DenseFusion is more affected by non-optimal lighting conditions.

A third experimental test is performed with the specific objective of comparing the two methods in the estimation of the orientation of fresh fruits. An apple is placed in a fruit punnet in the three different orientations shown in Figure 12. This arrangement is chosen for a twofold reason; first, to keep the apple in a stable position with different orientations, and second, because it is representative of a common situation in practice. Again, the evaluation is carried out indirectly through the grasp success rate computed on three repetitions of each orientation. We perform the experiment with the apple only because the limes are too small to be grasped without collisions in a cluttered fruit punnet. Figure 13 clearly shows that the gripper orientations planned to grasp the apple are significantly different for each of the three apple orientations, demonstrating that the estimation of the full 6D pose is relevant to the grasp. The results reported in Table 4 confirm that DOPE + PR generally outperforms DenseFusion.

By observing the failure cases in more detail, we can argue that the failures of the DenseFusion method are mainly due to the low accuracy in the pose estimation, as it can be appreciated in the attached video at minute 1:12–1:43 and at minute 8:36–8:45, where, even if MoveIt! successfully plans the grasps of the fruit, the task fails due to the error in the

x y

plane or to an orientation error, respectively. Instead, the DOPE + PR failures are mainly due to the inaccurate estimation of the object scale since they all happen during the planning phase because of the collisions of the fingers with the support table. The video also shows a pick-and-place experiment with some limes and apples in a crate, where the robot has to place them in separate bowls. With reference to this experiment, Figure 14 (left) and Figure 14 (right) show the trend of the grasp force computed by the grasp controller and applied by the sensorized fingers to an apple and a lime, respectively. As expected, lifting the apple requires a higher force than lifting the lime because it is lighter than the apple.

A final evaluation is performed to assess the noise level of the pose estimated by the two algorithms being compared. An apple and a lime are placed first on the left box (good lighting) and then on the right shelf (bad lighting) in Figure 11. The standard deviations of the estimated pose components are reported in Table 5. It is quite evident how the lighting conditions affect the noise level in both algorithms but the pose components estimated by DenseFusion are generally more noisy.

The video of a sample experiment with several grasp attempts is available as Supplemental Material.

4. Discussion

Besides the differences in terms of accuracy of the estimated pose and grasp success rate, the compared methods also differ in terms of the resources needed and complexity. In particular, we find that DenseFusion needs more training time to make estimations accurate enough to be used in a real grasp experiment.

This is due to two main reasons:

DenseFusion is a modular architecture, which rigidly separates the segmentation pipeline from the pose estimation one. Hence, if on one hand, this enhances the scalability and the interpretability of the entire model, on the other hand, this also implies that two different (and usually large) networks must be trained, one that provides the segmentation masks and one that implements the pose estimation. This could significantly increase the training time if the two networks cannot be trained in parallel and makes the joint optimization of the whole method more difficult.
Since the refiner is a deep network too, it must be trained as well as the main pipeline, with the further condition that its training cannot start until the main network has converged, as otherwise it will not learn useful information, i.e., the refiner and the main network cannot be trained in parallel. This increases the total training time.

The modular nature of DenseFusion also requires that the network has to be deployed on hardware that is more powerful (and expensive) since it requires more memory usage to run the inference because two models have to be loaded into the VRAM at the same time. Of course, this could be mitigated by using simpler segmentation networks, but the accuracy of the masks greatly impacts the overall performance of the estimator, so a careful trade-off must be chosen. Another feature of the DenseFusion architecture is that it is quite noisy, in the sense that the estimated pose has a considerable degree of variability. This fact could be explained by the decision to make a direct regression on the input images, which means that the network has to learn a strongly non-linear function that maps the 2D information (the ones in the image space) to 3D space (position and orientation); therefore, especially for the orientation, even very slight changes in the input could produce very different outputs in the final prediction. DOPE, on the other hand, applies a keypoint-based regression, mapping 2D information (the RGB image) to 2D keypoints (the corners and the centroid), so the function to be learned is simpler, and the pose is less noisy. Another consequence of performing a direct regression on the images is that the network might suffer a considerable drop in performance if the camera parameters used for training are not exactly the same as the ones used during the inference stage: making 6D predictions directly on the images causes the network, during the training phase, to specialize too much on the particular characteristics of the images coming from a specific camera, indirectly learning also the intrinsics, distortions, and all those parameters tightly linked to the camera used to construct the dataset. On the other hand, DOPE (and all methods using a keypoint-based regression) does not suffer from this, thanks to the following features:

It usually has an initial normalization and resizing stage that makes all objects look identical, independent of the camera used;
The network outputs keypoints in the rectified space only;
Some kind of algorithm is used to transform these keypoints in 3D and obtain the final pose.

In this framework, the camera parameters are only used in the first stage to rectify the image and in the final stage to obtain the final pose; the whole network is trained and makes predictions on normalized images. The main advantage is obviously that once trained on a dataset, the algorithm is very resilient to changes in the camera parameters to the point that even a different camera could in principle be used for inference, while a network that directly makes the regression should be retrained if the two cameras have very different parameters. Finally, only one disadvantage of the proposed post-refinement algorithm can be highlighted in comparison with DenseFusion. Indeed, the proposed post refinement algorithm used to enhance DOPE in terms of dimension variability, requires more time than DenseFusion to retrieve the estimated pose due to the virtual depth map generation and the comparison with the real one at each iteration of the optimization problem. The optimization time might be reduced depending on the solver parameters and the required accuracy. Nevertheless, the inference time is less important than the accuracy of the estimated pose given the specific use case, when the pose estimation is performed only at the planning time.

5. Conclusions

The paper presents an experimental comparison of two 6D pose estimation algorithms aimed at providing the location of fresh fruits to be picked by a robot. The first algorithm (DenseFusion) is based on a pure deep learning approach exploiting RGB-D inputs; the second one (DOPE + Pose Refinement) combines the deep learning network prediction made upon an RGB image with a geometric method that estimates the dimensions of the fruit by resorting to an optimization algorithm based on the depth map. The robot picks the fruit without any damage owing to a grasp force control that uses tactile sensors. Experiments have been performed to evaluate the performance of both methods and to test their ability to fill the reality gap from the synthetic data used for training to the real object pose estimation. These tests show that DOPE + Pose Refinement has a significantly higher grasp success rate, even if both methods can handle relevant dimensional variability in the sample fruit used for training. Furthermore, DOPE + Pose Refinement has proved to be less noisy at inference time and less computationally expensive than the other at training time. Future work will be devoted to extending the hybrid approach such that non-uniform scaling can be performed to achieve higher accuracy in the pose refinement. Indeed, fruits are not simply scaled versions of each other but are typically deformed and thus characterized by different scaling factors along the three directions. Furthermore, a teacher–student model could be useful to mitigate the long training time required, especially by DenseFusion.

Supplementary Materials

The accompanying video can be accessed at: https://www.mdpi.com/article/10.3390/robotics13090127/s1. The source code of the pose refinement algorithm can be downloaded at: https://github.com/Vanvitelli-Robotics/depth_optimization.

Author Contributions

Conceptualization, A.B.A., M.C., M.D.S., S.F. and C.N.; methodology, A.B.A., M.C., M.D.S. and S.F.; software, A.B.A., M.D.S. and S.F.; validation, A.B.A., M.D.S. and S.F.; writing—original draft preparation, A.B.A., M.D.S. and S.F.; writing—review and editing, A.B.A., M.C., M.D.S., S.F. and C.N.; supervision, M.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by the European Commission under the INTELLIMAN project grant agreement ID 101070136, and by the Italian Ministry of University and Research (MUR) under the PRIN2022 PNRR, the DARC Project (ID P2022MHR5C).

Data Availability Statement

The datasets used for training the neural networks are available upon request to the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CAD	Computer-Aided Drawing
CNN	Convolutional Neural Network
DOPE	Deep Object Pose Estimation
RGB	Red Green Blue
RGB-D	Red Green Blue Depth
CoR	Center of Rotation
OMPL	Open Motion Planning Library
RRT	Rapidly Exploring Random Tree
HDR	High Dynamic Range
JSON	JavaScript Object Notation
MLP	Multi Layer Perceptron
GA	Greedy Algorithm

References

Wang, Z.; Hirai, S.; Kawamura, S. Challenges and Opportunities in Robotic Food Handling: A Review. Front. Robot. AI 2022, 8, 789107. [Google Scholar] [CrossRef] [PubMed]
Costanzo, M. Control of robotic object pivoting based on tactile sensing. Mechatronics 2021, 76, 102545. [Google Scholar] [CrossRef]
Costanzo, M.; De Maria, G.; Natale, C.; Pirozzi, S. Design and Calibration of a Force/Tactile Sensor for Dexterous Manipulation. Sensors 2019, 19, 966. [Google Scholar] [CrossRef] [PubMed]
Do, T.T.; Cai, M.; Pham, T.; Reid, I. Deep-6DPose: Recovering 6D Object Pose from a Single RGB Image. arXiv 2018, arXiv:1802.10367v1. [Google Scholar]
Xu, Y.; Lin, K.Y.; Zhang, G.; Wang, X.; Li, H. RNNPose: Recurrent 6-DoF Object Pose Refinement with Robust Correspondence Field Estimation and Pose Optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LO, USA, 19–24 June 2022. [Google Scholar]
Lin, Y.; Tremblay, J.; Tyree, S.; Vela, P.A.; Birchfield, S. Single-stage Keypoint-based Category-level Object Pose Estimation from an RGB Image. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 June 2021; abs/2109.06161. Available online: http://arxiv.org/abs/2109.06161 (accessed on 4 July 2024).
Manhardt, F.; Nickel, M.; Meier, S.; Minciullo, L.; Navab, N. CPS: Class-level 6D Pose and Shape Estimation From Monocular Images. arXiv 2020, arXiv:2003.05848. [Google Scholar]
Chen, K.; Li, T.; Yan, T.; Xie, F.; Feng, Q.; Zhu, Q.; Zhao, C. A Soft Gripper Design for Apple Harvesting with Force Feedback and Fruit Slip Detection. Agriculture 2022, 12, 1802. [Google Scholar] [CrossRef]
Dischinger, L.M.; Cravetz, M.; Dawes, J.; Votzke, C.; VanAtter, C.; Johnston, M.L.; Grimm, C.M.; Davidson, J.R. Towards Intelligent Fruit Picking with In-hand Sensing. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 3285–3291. [Google Scholar] [CrossRef]
Tremblay, J.; To, T.; Sundaralingam, B.; Xiang, Y.; Fox, D.; Birchfield, S. Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects. In Proceedings of the 2nd Conference on Robot Learning, Zürich, Switzerland, 29–31 October 2018; Billard, A., Dragan, A., Peters, J., Morimoto, J., Eds.; Volume 87, Proceedings of Machine Learning Research. pp. 306–316. [Google Scholar]
Xiang, Y.; Schmidt, T.; Narayanan, V.; Fox, D. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. In Proceedings of the Robotics: Science and Systems XIV, Pittsburgh, PA, USA, 26 June 2018. [Google Scholar] [CrossRef]
Wang, C.; Xu, D.; Zhu, Y.; Martín-Martín, R.; Lu, C.; Fei-Fei, L.; Savarese, S. DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 3338–3347. [Google Scholar]
Kendall, A.; Grimes, M.; Cipolla, R. PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–18 December 2015; pp. 2938–2946. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Xu, D.; Anguelov, D.; Jain, A. PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 244–253. [Google Scholar] [CrossRef]
Wang, Y.; Jiang, X.; Fujita, H.; Fang, Z.; Qiu, X.; Chen, J. EFN6D: An efficient RGB-D fusion network for 6D pose estimation. J. Ambient Intell. Humaniz. Comput. 2022, 15, 75–88. [Google Scholar] [CrossRef]
Jang, J.h.; Lee, J.; Kim, S.h. Two-Phase Approach for Monocular Object Detection and 6-DoF Pose Estimation. J. Electr. Eng. Technol. 2023, 19, 1817–1825. [Google Scholar] [CrossRef]
Lepetit, V.; Moreno-Noguer, F.; Fua, P. EPnP: An accurate O(n) solution to the PnP problem. Int. J. Comput. Vis. 2009, 81, 155–166. [Google Scholar] [CrossRef]
Zhou, J.; Chen, K.; Xu, L.; Dou, Q.; Qin, J. Deep Fusion Transformer Network with Weighted Vector-Wise Keypoints Voting for Robust 6D Object Pose Estimation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 13921–13931. [Google Scholar] [CrossRef]
He, Y.; Huang, H.; Fan, H.; Chen, Q.; Sun, J. FFB6D: A Full Flow Bidirectional Fusion Network for 6D Pose Estimation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 3002–3012. [Google Scholar] [CrossRef]
Arun, K.S.; Huang, T.S.; Blostein, S.D. Least-Squares Fitting of Two 3-D Point Sets. IEEE Trans. Pattern Anal. Mach. Intell. 1987, PAMI-9, 698–700. [Google Scholar] [CrossRef] [PubMed]
Besl, P.J.; McKay, N.D. A Method for Registration of 3-D Shapes. IEEE Trans. Pattern Anal. Mach. Intell. 1992, 14, 239–256. [Google Scholar] [CrossRef]
Costanzo, M.; De Simone, M.; Federico, S.; Natale, C.; Pirozzi, S. Enhanced 6D Pose Estimation for Robotic Fruit Picking. In Proceedings of the 2023 9th International Conference on Control, Decision and Information Technologies (CoDIT), Rome, Italy, 3–6 July 2023; pp. 901–906. [Google Scholar] [CrossRef]
Costanzo, M.; De Simone, M.; Federico, S.; Natale, C. Non-Prehensile Manipulation Actions and Visual 6D Pose Estimation for Fruit Grasping Based on Tactile Sensing. Robotics 2023, 12, 92. [Google Scholar] [CrossRef]
Kehl, W.; Manhardt, F.; Tombari, F.; Ilic, S.; Navab, N. SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1530–1538. [Google Scholar] [CrossRef]
Li, Y.; Wang, G.; Ji, X.; Xiang, Y.; Fox, D. DeepIM: Deep Iterative Matching for 6D Pose Estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Hinterstoißer, S.; Holzer, S.; Cagniart, C.; Ilic, S.; Konolige, K.; Navab, N.; Lepetit, V. Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 858–865. [Google Scholar]
Howe, R.D.; Cutkosky, M.R. Practical Force-Motion Models for Sliding Manipulation. Int. J. Robot. Res. 1996, 15, 557–572. [Google Scholar] [CrossRef]
Richardson, R.S.H.; Nolle, H. Surface friction under timedependent loads. Wear 1976, 37, 87–101. [Google Scholar] [CrossRef]
Cavallo, A.; Costanzo, M.; De Maria, G.; Natale, C. Modeling and slipping control of a planar slider. Automatica 2020, 115, 108875. [Google Scholar] [CrossRef]
Sucan, I.A.; Moll, M.; Kavraki, L.E. The Open Motion Planning Library. IEEE Robot. Autom. Mag. 2012, 19, 72–82. [Google Scholar] [CrossRef]
Görner, M.; Haschke, R.; Ritter, H.; Zhang, J. MoveIt! Task Constructor for Task-Level Motion Planning. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 190–196. [Google Scholar] [CrossRef]
Hornung, A.; Wurm, K.M.; Bennewitz, M.; Stachniss, C.; Burgard, W. OctoMap: An Efficient Probabilistic 3D Mapping Framework Based on Octrees. Auton. Robot. 2013, 34, 189–206. Available online: https://octomap.github.io (accessed on 31 July 2024). [CrossRef]
Sadeghi, F.; Levine, S. CAD2RL: Real Single-Image Flight without a Single Real Image. In Proceedings of the Robotics: Science and Systems(RSS), Cambridge, MA, USA, 12–16 July 2017. [Google Scholar]
Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; Abbeel, P. Domain randomization for transferring deep neural networks from simulation to the real world. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 23–30. [Google Scholar] [CrossRef]
Tremblay, J.; Prakash, A.; Acuna, D.; Brophy, M.; Jampani, V.; Anil, C.; To, T.; Cameracci, E.; Boochoon, S.; Birchfield, S. Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1082–10828. [Google Scholar] [CrossRef]
Morrical, N.; Tremblay, J.; Lin, Y.; Tyree, S.; Birchfield, S.; Pascucci, V.; Wald, I. NViSII: A Scriptable Tool for Photorealistic Image Generation. In Proceedings of the ICLR Workshop on Synthetic Data Generation, Virtual, 7 May 2021. [Google Scholar]

Figure 1. Robot with sensorized gripper grasping a lime.

Figure 2. Architecture of the DenseFusion network.

Figure 3. Simplified sketch of the architecture of the DOPE network: the nine images at the first stage output depict as white dots the 8 keypoints corresponding to the projections of the 3D cuboid vertices and the object centroid in the image plane.

Figure 4. Pose estimation of an apple smaller than the one of the CAD model: the apple under the table is the one estimated by DOPE, and the apple on the table is the one estimated by DOPE + Pose Refinement. The cone delimited by the red lines contains all possible apples translated and scaled to have the same image in the camera plane. The dotted blue line passes through the centroids of these apples.

Figure 5. Pipeline of the robotic fruit-picking execution. The 6D pose estimation algorithm block implements one of the two compared methods: DOPE + Pose Refinement or DenseFusion.

Figure 6. Block scheme of the grasp controller.

Figure 7. Illustration of the grasping strategy.

Figure 8. Sample frames of the apple (left) and the lime (right) datasets.

Figure 9. Top and side views of the five apples used for the first test with corresponding dimensions (left) and the five locations for grasp attempts (right).

Figure 10. Top and side views of the two limes used for the first test with corresponding dimensions (left) and the five locations for grasp attempts (right).

Figure 11. Fruits and locations (well illuminated on the left and overshadowed on the right) of the grasp attempts selected for the second test.

Figure 12. Third test: three orientations selected for the apple.

Figure 13. Third test: Grasp configurations planned for the three orientations of the apple, where the blue axis is the estimated z axis of the object frame (symmetry axis).

Figure 14. Grasp forces applied during the pick-and-place operation of an apple (left) and a lime (right).

Table 1. First test: comparison of estimated positions and corresponding centroid errors in location 3 for the two pose estimation methods.

	DOPE + PR		DENSEFUSION
	Estimated Pos. [m]	Centroid Err. [m]	Estimated Pos. [m]	Centroid Err. [m]
`Apple 1`	$\begin{matrix} - 0.055 \\ - 0.449 \\ 0.048 \end{matrix}$	$- 0.004$	$\begin{matrix} 0.019 \\ - 0.295 \\ 0.041 \end{matrix}$	$- 0.003$
`Apple 2`	$\begin{matrix} - 0.045 \\ - 0.435 \\ 0.039 \end{matrix}$	$- 0.006$	$\begin{matrix} 0.004 \\ - 0.465 \\ 0.033 \end{matrix}$	$- 0.001$
`Apple 3`	$\begin{matrix} - 0.061 \\ - 0.432 \\ 0.037 \end{matrix}$	$- 0.002$	$\begin{matrix} - 0.280 \\ - 0.274 \\ 0.038 \end{matrix}$	$- 0.003$
`Apple 4`	$\begin{matrix} - 0.061 \\ - 0.445 \\ 0.034 \end{matrix}$	$- 0.002$	$\begin{matrix} - 0.043 \\ - 0.408 \\ 0.030 \end{matrix}$	$- 0.002$
`Apple 5`	$\begin{matrix} - 0.040 \\ - 0.437 \\ 0.028 \end{matrix}$	$- 0.0005$	$\begin{matrix} - 0.035 \\ - 0.392 \\ 0.023 \end{matrix}$	$- 0.004$
`Lime 1`	$\begin{matrix} - 0.061 \\ - 0.467 \\ 0.023 \end{matrix}$	$0.005$	$\begin{matrix} - 0.038 \\ - 0.436 \\ 0.044 \end{matrix}$	$- 0.016$
`Lime 2`	$\begin{matrix} - 0.044 \\ - 0.463 \\ 0.021 \end{matrix}$	$0.007$	$\begin{matrix} - 0.053 \\ - 0.444 \\ 0.047 \end{matrix}$	$- 0.020$

Table 2. First test: comparison of the grasping success rate for the two pose estimation methods.

DenseFusion	LOC. 1	LOC. 2	LOC. 3	LOC. 4	LOC. 5
`Apple 1`	$33 %$	$100 %$	$100 %$	$100 %$	$66 %$
`Apple 2`	$0 %$	$66 %$	$100 %$	$100 %$	$100 %$
`Apple 3`	$100 %$	$33 %$	$100 %$	$66 %$	$33 %$
`Apple 4`	$100 %$	$100 %$	$100 %$	$100 %$	$0 %$
`Apple 5`	$33 %$	$66 %$	$66 %$	$33 %$	$0 %$
`Lime 1`	$100 %$	$100 %$	$0 %$	$33 %$	$0 %$
`Lime 2`	$0 %$	$100 %$	$0 %$	$66 %$	$0 %$
DOPE + PR	LOC. 1	LOC. 2	LOC. 3	LOC. 4	LOC. 5
`Apple 1`	$100 %$	$100 %$	$100 %$	$100 %$	$100 %$
`Apple 2`	$100 %$	$100 %$	$100 %$	$100 %$	$100 %$
`Apple 3`	$66 %$	$66 %$	$100 %$	$100 %$	$100 %$
`Apple 4`	$66 %$	$100 %$	$100 %$	$100 %$	$100 %$
`Apple 5`	$66 %$	$33 %$	$66 %$	$66 %$	$66 %$
`Lime 1`	$100 %$	$100 %$	$100 %$	$100 %$	$100 %$
`Lime 2`	$100 %$	$100 %$	$100 %$	$100 %$	$100 %$

Table 3. Second test: comparison of the grasping success rate for the two pose estimation methods.

DenseFusion	LOC. 1	LOC. 2	LOC. 3	LOC. 4
`Small apple`	$100 %$	$100 %$	$33 %$	$66 %$
`Big apple`	$100 %$	$66 %$	$100 %$	$100 %$
`Small lime`	$33 %$	$33 %$	$33 %$	$0 %$
`Big lime`	$33 %$	$33 %$	$33 %$	$0 %$
DOPE + PR	LOC. 1	LOC. 2	LOC. 3	LOC. 4
`Small apple`	$100 %$	$100 %$	$100 %$	$100 %$
`Big apple`	$66 %$	$100 %$	$100 %$	$100 %$
`Small lime`	$100 %$	$100 %$	$66 %$	$66 %$
`Big lime`	$100 %$	$100 %$	$66 %$	$100 %$

Table 4. Third test: comparison of the grasping success rate for the two pose estimation methods with an apple in different orientations.

	Orientation 1	Orientation 2	Orientation 3
DenseFusion	$66 %$	$33 %$	$66 %$
DOPE + PR	$100 %$	$100 %$	$100 %$

Table 5. Standard deviations of the pose components (

{[p_{x} p_{y} p_{z}]}^{T}

for the position and

{q_{w} q_{x} q_{y} q_{z}}

for the orientation quaternion) estimated by the two methods in different lighting conditions for the two fruit categories.

Table 5. Standard deviations of the pose components (

{[p_{x} p_{y} p_{z}]}^{T}

for the position and

{q_{w} q_{x} q_{y} q_{z}}

for the orientation quaternion) estimated by the two methods in different lighting conditions for the two fruit categories.

DenseFusion	Apple Good Light	Apple Bad Light	Lime Good Light	Lime Bad Light
$p_{x}$ [m]	$1.9 \times 10^{- 3}$	$10 \times 10^{- 3}$	$9.6 \times 10^{- 4}$	$9.5 \times 10^{- 3}$
$p_{y}$ [m]	$1.5 \times 10^{- 3}$	$7.2 \times 10^{- 3}$	$1.5 \times 10^{- 3}$	$3.0 \times 10^{- 3}$
$p_{z}$ [m]	$0.9 \times 10^{- 3}$	$24 \times 10^{- 3}$	$8.6 \times 10^{- 4}$	$41 \times 10^{- 4}$
$q_{w}$	$9.6 \times 10^{- 3}$	$20 \times 10^{- 3}$	$22 \times 10^{- 3}$	$27 \times 10^{- 3}$
$q_{x}$	$3.9 \times 10^{- 3}$	$2.2 \times 10^{- 2}$	$23 \times 10^{- 3}$	$20 \times 10^{- 2}$
$q_{y}$	$31 \times 10^{- 3}$	$14 \times 10^{- 2}$	$14 \times 10^{- 3}$	$5.1 \times 10^{- 2}$
$q_{z}$	$25 \times 10^{- 3}$	$4.0 \times 10^{- 2}$	$19 \times 10^{- 3}$	$14 \times 10^{- 2}$
DOPE	Apple Good Light	Apple Bad Light	Lime Good Light	Lime Bad Light
$p_{x}$ [m]	$0.5 \times 10^{- 3}$	$0.7 \times 10^{- 3}$	$1.0 \times 10^{- 4}$	$2.4 \times 10^{- 3}$
$p_{y}$ [m]	$0.3 \times 10^{- 3}$	$1.0 \times 10^{- 3}$	$0.1 \times 10^{- 3}$	$3.3 \times 10^{- 3}$
$p_{z}$ [m]	$1.2 \times 10^{- 3}$	$9.3 \times 10^{- 3}$	$7.7 \times 10^{- 4}$	$1.1 \times 10^{- 4}$
$q_{w}$	$9.9 \times 10^{- 3}$	$8.1 \times 10^{- 3}$	$4.6 \times 10^{- 3}$	$8.5 \times 10^{- 3}$
$q_{x}$	$4.3 \times 10^{- 3}$	$6.7 \times 10^{- 2}$	$2.8 \times 10^{- 3}$	$4.1 \times 10^{- 2}$
$q_{y}$	$6.2 \times 10^{- 3}$	$1.2 \times 10^{- 2}$	$4.1 \times 10^{- 3}$	$7.5 \times 10^{- 2}$
$q_{z}$	$1.8 \times 10^{- 3}$	$1.7 \times 10^{- 2}$	$3.8 \times 10^{- 3}$	$1.0 \times 10^{- 2}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alterani, A.B.; Costanzo, M.; De Simone, M.; Federico, S.; Natale, C. Experimental Comparison of Two 6D Pose Estimation Algorithms in Robotic Fruit-Picking Tasks. Robotics 2024, 13, 127. https://doi.org/10.3390/robotics13090127

AMA Style

Alterani AB, Costanzo M, De Simone M, Federico S, Natale C. Experimental Comparison of Two 6D Pose Estimation Algorithms in Robotic Fruit-Picking Tasks. Robotics. 2024; 13(9):127. https://doi.org/10.3390/robotics13090127

Chicago/Turabian Style

Alterani, Alessio Benito, Marco Costanzo, Marco De Simone, Sara Federico, and Ciro Natale. 2024. "Experimental Comparison of Two 6D Pose Estimation Algorithms in Robotic Fruit-Picking Tasks" Robotics 13, no. 9: 127. https://doi.org/10.3390/robotics13090127

APA Style

Alterani, A. B., Costanzo, M., De Simone, M., Federico, S., & Natale, C. (2024). Experimental Comparison of Two 6D Pose Estimation Algorithms in Robotic Fruit-Picking Tasks. Robotics, 13(9), 127. https://doi.org/10.3390/robotics13090127

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Experimental Comparison of Two 6D Pose Estimation Algorithms in Robotic Fruit-Picking Tasks^†

Abstract

1. Introduction

2. Materials and Methods

2.1. DenseFusion

2.2. DOPE + Pose Refinement

2.3. Robotic Fruit Picking

2.4. Grasp Control

2.5. Motion Planning

2.6. Grasping Strategy

3. Results

3.1. Dataset Generation and Training of the Models

3.2. Comparison of the Two Methods

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Experimental Comparison of Two 6D Pose Estimation Algorithms in Robotic Fruit-Picking Tasks †

Abstract

1. Introduction

2. Materials and Methods

2.1. DenseFusion

2.2. DOPE + Pose Refinement

2.3. Robotic Fruit Picking

2.4. Grasp Control

2.5. Motion Planning

2.6. Grasping Strategy

3. Results

3.1. Dataset Generation and Training of the Models

3.2. Comparison of the Two Methods

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Experimental Comparison of Two 6D Pose Estimation Algorithms in Robotic Fruit-Picking Tasks^†