From Single 2D Depth Image to Gripper 6D Pose Estimation: A Fast and Robust Algorithm for Grabbing Objects in Cluttered Scenes

Jabalameli, Amirhossein; Behal, Aman

doi:10.3390/robotics8030063

Open AccessArticle

From Single 2D Depth Image to Gripper 6D Pose Estimation: A Fast and Robust Algorithm for Grabbing Objects in Cluttered Scenes^†

by

Amirhossein Jabalameli

and

Aman Behal

^*

Electrical and Computer Engineering Department, University of Central Florida (UCF), Orlando, FL 32816, USA

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in Jabalameli, A.; Ettehadi, N.; Behal, A. Near Real-Time Robotic Grasping of Novel Objects in Cluttered Scenes. In Proceedings of the SAI Computer Vision Conference (CVC), Las Vegas, NV, USA, 25–26 April 2019.

Robotics 2019, 8(3), 63; https://doi.org/10.3390/robotics8030063

Submission received: 18 May 2019 / Revised: 19 July 2019 / Accepted: 24 July 2019 / Published: 30 July 2019

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we investigate the problem of grasping previously unseen objects in unstructured environments which are cluttered with multiple objects. Object geometry, reachability, and force-closure analysis are considered to address this problem. A framework is proposed for grasping unknown objects by localizing contact regions on the contours formed by a set of depth edges generated from a single-view 2D depth image. Specifically, contact regions are determined based on edge geometric features derived from analysis of the depth map data. Finally, the performance of the approach is successfully validated by applying it to scenes with both single and multiple objects, in both simulation and experiments. Using sequential processing in MATLAB running on a 4th-generation Intel Core Desktop, simulation results with the benchmark Object Segmentation Database show that the algorithm takes 281 ms on average to generate the 6D robot pose needed to attach with a pair of viable grasping edges that satisfy reachability and force-closure conditions. Experimental results in the Assistive Robotics Laboratory at UCF using a Kinect One sensor and a Baxter manipulator outfitted with a standard parallel gripper showcase the feasibility of the approach in grasping previously unseen objects from uncontrived multi-object settings.

Keywords:

autonomous grasping; unstructured environments; novel object grasping

1. Introduction

A crucial problem in robotics is interacting with known or novel objects in unstructured environments. Among several emerging applications, assistive robotic manipulators seek approaches to assist users to perform a desired object motion in a partial or fully autonomous system. While the convergence of a multitude of research advances is required to address this problem, our goal is to describe a method that employs the robot’s visual perception to identify and execute an appropriate grasp for immobilizing novel objects with respect to a robotic manipulator’s end-effector.

Finding a grasp configuration relevant to a specific task has been an active topic in robotics for the past three decades. In a recent article by Bohg et al. [1], grasp synthesis algorithms are categorized into two main groups, viz., analytical and data-driven. Analytical approaches explore for solutions through kinematic and dynamic formulations [2]. Object and/or robotic hand models are used in [3,4,5,6,7] to develop grasping criteria such as force-closure, stability, and dexterity, and to evaluate if a grasp is satisfying them. The difficulty of modeling a task, high computational costs, and assumptions of the availability of geometric or physical models for the robot are the challenges that analytical approaches deal with in real-world experiments. Furthermore, researchers conducting experiments have inferred that classical metrics are not sufficient to tackle grasping problems in real-world scenarios despite their efficiency in simulation environments [8,9].

On the other hand, data-driven methods retrieve grasps according to their prior knowledge of either the target object, human experience, or through information obtained from acquired data. In line with this definition, Bohg et al. [1] classified data-driven approaches based on the encountered object being considered known, familiar, or unknown to the method. Thus, the main issues relate to how the query object is recognized and then compared with or evaluated by the algorithm’s existing knowledge. As an example, References [10,11,12] assume that all the objects can be modeled by a set of shape primitives such as boxes, cylinders, and cones. During the off-line phase, they assign a desired grasp for each shape while during the on-line phase, these approaches are only supposed to match sensed objects to one of the shape primitives and pick the corresponding grasp. In [13], a probabilistic framework is exploited to estimate the pose of a known object in an unknown scene. Ciocarlie et al. [14] introduced the human operator in the grasp control loop to define a hand postures subspace known as eigen-grasp. This method finds appropriate grasp corresponding to a given task, using the obtained low-dimensional subspace. A group of methods considers the encountered object as a familiar object and employs 2D and/or 3D object features to measure the similarities in shape or texture properties [1]. In [15], a logistic regression model is trained based on labeled data sets and then grasp points for the query object are detected based on the extracted feature vector from a 2D image. The authors in [16] present a model that maps the grasp pose to a success probability; the robot learns the probabilistic model through a set of grasp and drop actions.

The last group of methods, in data-driven approaches, introduce and examine features and heuristics which directly map the acquired data to a set of candidate grasps [1]. They assume sensory data provide either full or partial information of the scene. [17] takes the point cloud and clusters it to a background and an object, then addresses a grasp based on the principal axis of the object. The authors in [18] propose an approach that takes 3D point-cloud and hand geometric parameters as the input, then search for grasp configurations within a lower dimensional space satisfying defined geometric necessary conditions. Jain et al. [19] analyze the surface of every observed point cloud cluster and automatically fit spherical, cylindrical, or box-like shape primitives to them. The method uses a predefined strategy to grasp each shape primitive. The algorithm in [20] builds a virtual elastic surface by moving the camera around the object while computing the grasp configuration in an iterative process. Similarly, [21] approaches the grasping problem through a surface-based exploration. Another approach to grasp planning problem can be performed through object segmentation algorithms to find surface patches [21,22].

In general, knowledge level of the object, accessibility to partial or full shape information of the existing objects in a scene and type of the employed features are the main aspects that characterize data-driven methods. One of the main challenges that most of the data-driven grasping approaches deal with is uncertainties in the measured data which causes failure in real-world experiments. Thus, increasing the robustness of a grasp against uncertainties appearing in the sensed data, or during the execution phase, is the aim for a group of approaches. While [23] uses tactile feedback to adjust the object position deviation from the initial expectation [24] employs visual servoing techniques to facilitate the grasping execution. Another challenge for data-driven approaches is data preparation and specifically background elimination. This matter forces some of the methods to make simplifying assumptions about an object’s situation, e.g., [17] is only validated for objects standing on a planar surface. Finding a feasible grasp configuration subject to the given task and user constraints is required for a group of applications.As discussed by [25,26,27], suggesting desired grasp configurations, in assitive human-robot interaction, results in increasing the users’ engagement and easing the manipulator trajectory adaptation.

In this paper, we introduce an approach to obtain stable grasps using partial depth information for an object of interest. The expected outcome is an executable end-effector 6D pose to grasp the object in an occluded scene. We propose a framework based on the supporting principle that potential contacting regions for a stable grasp can be found by searching for (i) sharp discontinuities and (ii) regions of locally maximal principal curvature in the depth map. In addition to suggestions from empirical evidence, we discuss this principle by applying the concept of wrench convexes. The framework consists of two phases. First, we localize candidate regions, and then we evaluate local geometric features of candidate regions toward satisfying desired grasp features. The key point is that no prior knowledge of objects is used in the grasp planning process; however, the obtained results show that the approach is capable of successfully dealing with objects of different shapes and sizes. We believe that the proposed work is novel and interesting because the description of the visible portion of objects by the aforementioned edges appearing in the depth map facilitates the process of grasp set-point extraction in the same way as image processing methods with the focus on small-size 2D image areas rather than clustering and analyzing huge sets of 3D point-cloud coordinates. In fact, this approach completely sidesteps reconstruction of objects. These features result in low computational costs and make it possible to run the algorithm in near real-time. We also see this approach as a useful solution to obtain grasp configuration according to the given task and user constraints as opposed to most data-driven methods that address this problem by locating the objects and lifting them from the top. Finally, the reliance on only a single-view depth map without the use of color information distinguishes our approach from other candidates in relevant applications. The biggest challenge in the work arises due to uncertainties arising from pixel-wise characteristics; for this matter, relying on larger pixel sets of interest has helped to make the process more robust.

This paper is organized as follows. Grasping preliminaries are presented in Section 2. The grasp problem is stated in Section 3. The proposed approach is presented in Section 4. Specifically, in Section 4.1, we define the object model in the 2D image according to geometry and then introduce the employed grasp model in Section 4.2. Next, in Section 4.3, we propose an approach to find reliable contact regions for the force closure grasp on the targeted object. Details of algorithm implementation for a parallel gripper are provided in Section 5. In Section 6, we validate our proposed approach by considering different scenarios for grasping objects, using a Kinect One sensor and a Baxter robot as a 7-DOF arm manipulator followed by a discussion of the obtained results. Section 7 concludes the paper.

2. Preliminaries

Choosing a stable grasp is one of the key components of a given object manipulation task. According to the adopted terminology from [6], a stable grasp is defined as a grasp having force closure on the object. Force-closure needs the grasp to be disturbance resistance meaning any possible motion of the object is resisted by the contact forces [28]. Thus, determining possible range of force directions and contact locations for robotic fingers is an important part of grasp planning [6]. By considering force-closure as a necessary condition, reference [3] discussed the problem of synthesizing planar grasps. In the planar grasp, all the applied forces will lie in the plane of the object and shape of the object will be the only input through the process. Any contact between fingertips and the object can be described as a convex sum of three primitive contacts.

Definition 1.

A wrench convex represents the range of force directions that can be exerted on the object and is determined depending on the contact type and the existing friction coefficient.

Figure 1 shows the primitive contacts and their wrench convexes in 2D. Wrench convexes are illustrated by two arrows forming the angular sector. In the frictionless point contact, the finger can only apply force in the direction of normal. However, through a point contact with friction, the finger can apply any forces pointing into the wrench convex. Soft finger contact can exert pure torques in addition to pure forces inside the wrench convex.

Remark 1.

Any force distribution along an edge contact can be cast to a unique force at some point inside the segment. This force is described by the positive combination of two wrench convexes at the two ends of the contact edge. It is also common in this subject to refer to the wrench convex as friction cone. To resist translation and rotation motions for a 2D object, force-closure is simplified to maintain force-direction closure and torque-closure [3].

Theorem 1.

(Nguyen I) A set of planar wrenches W can generate force in any direction if and only if there exists a set of three wrenches (

w_{1}

,

w_{2}

,

w_{3}

) whose respective force directions

f_{1}, f_{2},

f_{3}

satisfy: (i) two of the three directions

f_{1}, f_{2},

f_{3}

are independent. (ii) a strictly positive combination of the three directions is zero:

\sum_{i = 1}^{3} α_{i} f_{i} = 0

.

Theorem 2.

(Nguyen II) A set of planar forces W can generate clockwise and counterclockwise torques if and only if there exists a set of four forces (

w_{1}

,

w_{2}

,

w_{3}

,

w_{4}

) such that three of the four forces have lines of action that do not intersect at a common point or at infinity. Let

p_{12}

(resp.

p_{34}

) be the points where the lines of action of

w_{1}

and

w_{2}

(resp.

w_{3}

and

w_{4}

) intersect; there exist positive values of

α_{i}

such that

p_{34} - p_{12} = \pm (α_{1} f_{1} + α_{2} f_{2}) = \mp (α_{3} f_{3} + α_{4} f_{4}) .

Basically, force-direction closure checks if the contact forces (friction cones) span all the directions in the plane. Torque-closure tests if the combination of all applied forces produces pure torques. According to Theorem I and II, existence of four wrenches with three being independent is necessary for a force-closure grasp in a plane. Assuming the contacts are with friction, each point contact provides two wrenches. Thus, a planar force-closure grasp is possible with at least two contacts with friction. As stated in [3,4], the conditions for forming a planar force-closure grasp with two and three points are interpreted in geometric sense as below and illustrated in Figure 2:

Two opposing fingers: A grasp by two point contacts, $p_{1}$ and $p_{2}$ with friction is in force-closure if and only if the segment $p_{1} - p_{2}$ points out of and into two friction cones respectively at $p_{1}$ and $p_{2}$ . Mathematically speaking, assuming $φ_{1}$ and $φ_{2}$ are angular sectors of friction cones at $e_{1}$ and $e_{2}$ , $arg (p_{1} - p_{2}) \in \pm (φ_{1} \cap - φ_{2})$ is the necessary and sufficient condition for two point contacts with friction.
Triangular grasp: A grasp by three point contacts, $p_{1}$ , $p_{2}$ , and $p_{3}$ with friction is in force-closure if there exists a point, $p_{f}$ (force focus point) such that for each $p_{i}$ , the segment $p_{f} - p_{i}$ points out of the friction cone of the $i th$ contact. Let $k_{i}$ be the unit vector of segment $p_{f} - p_{i}$ which points out the edge; a strictly positive combination of the three directions is zero: $\sum_{i = 1}^{3} α_{i} k_{i} = 0$ .

An appropriate object representation and analysis on the shape of objects based on the accessed geometry information is needed to find contact regions for a stable grasp. In Section 4, we relate planar object representation to a proper grasp configuration to obtain an object’s possible grasps.

3. Problem Definition

The problem addressed in this paper is to find contacting regions for grasping unknown objects in a cluttered scene. The obtained grasp needs to exhibit force-closure, be reachable, and feasible under the specifications of a given end-effector. Partial depth information of the object, which is sensed by an RGBD camera, is the only input through this process and the proposed approach assumes that the manipulated objects have rigid and non-deformable shapes. In practice, we do not use objects with transparent and reflective surfaces since they cannot be sensed by the employed sensor technology. We note that we do not address the problem of finger placement of a generalized n-fingered gripper interacting with an arbitrary object. Instead, we focus on identification of reachable points on the surfaces of novel objects in multi-object scenes; these can then be tested for force-closure using existing results in the literature. We only provide an explicit algorithm for locating the fingers of a parallel gripper on previously unseen objects of arbitrary shapes and sizes.

4. Approach

In this section, we first present an object representation and investigate its geometric features based on the scene depth map; then a grasp model for the end-effector is provided. In the end, pursuant to the development, we draw a relationship between an object’s depth edges and force-closure conditions. Finally, we specify contact location and end-effector pose to grasp the target object.

4.1. 2D Object Representation

Generally, 3D scanning approaches require multiple-view scans to construct complete object models. In this work, we restrict our framework to use of partial information captured from a single view and represent objects in a 2-dimensional space. As previously stated, our main premise is that potential contacting regions for a stable grasp can be found by looking for (i) sharp discontinuities or (ii) regions of locally maximal principal curvatures in the depth map. A depth image can be shown by a 2D array of values which is described by an operator

d (.)

z = d (r, c), d (.) : R^{2} \to R

(1)

where z denotes the depth value (distance to the camera) of a pixel positioned at coordinates

(r, c)

in the depth image

(I_{d})

. Mathematically speaking, our principle suggests a search for regions holding high gradient property in depth or depth direction values. Gradient image, gradient magnitude image, and gradient direction image are defined as follows

\begin{matrix} Depth Image : & I_{d} = & [d (r_{i}, c_{i})] \\ Image Gradient : & ▽ I = & {(\frac{\partial I_{d}}{\partial x}, \frac{\partial I_{d}}{\partial y})}^{T} \\ Gradient Magnitude Image : & I_{M} = & [\sqrt{{(\frac{\partial I_{d}}{\partial x})}^{2} + {(\frac{\partial I_{d}}{\partial y})}^{2}]} \\ Gradient Direction Image : & I_{θ} = & [{tan}^{- 1} ((\frac{\partial I_{d}}{\partial y}) / (\frac{\partial I_{d}}{\partial x}))] \end{matrix}

(2)

where gradient magnitude image pixels describe the change in depth values in both horizontal and vertical directions. Similarly, each pixel of gradient direction image demonstrates the direction of largest depth value increase. In Figure 3, color maps of depth image and gradient direction image are provided. Sharp change in the color is an indication of occurrence of discontinuity in intensity values. Regions holding these specific features, in the image, locally divide the area into two sides and cause appearance of edges. In our proposed terminology, a depth edge is defined as a 2-dimensional collection of points in the image plane which forms a simple curve and satisfy the high gradient property.

Definition 2.

A point set

p = (x, y)

in the plane is called a curve or an arc if

x = x (t)

and

y = y (t)

where

a \leq t \leq b

while

x (t)

and

y (t)

are continuous functions of t. The points

p (a)

and

p (b)

are said to be initial and terminal points of the curve. A simple curve never crosses itself, except at its endpoints. A closed contour is defined as a piecewise simple curve in which

p (t)

is continuous and

p (a) = p (b)

. According to the Jordan curve theorem [29], a closed contour divides the plane in two sets, interior and exterior. Therefore, we define surface segment as a 2D region in the image plane which is bounded by a closed contour.

To expound on the kinds of depth edges and what they offer to the grasping problem, we investigate their properties in the depth map. All the depth edges are categorized into two main groups: (1) Depth Discontinuity (DD) edges and (2) Curvature Discontinuity(CD) edges. A DD edge is created by high gradient in depth values or a significant depth value difference between its two sides in the 2D depth map (

I_{d}

). It intimates a free-space between its belonged surface and its surroundings along the edge. A CD edge emerges from the directional change of depth values (

I_{θ}

) although it holds a continuous change in depth values on its sides. Please note that the directional change of depth values is equivalent to surface orientation in 3D. In fact, a CD depth edge illustrates intersection of surfaces with different orientation characteristics in 3D. CD edges are further divided into two subtypes, namely concave and convex. A CD edge is called convex if the outer surface of the object curves such as the exterior of a circle in its local neighborhood while it curves such as a circle’s interior in the local neighborhood of a concave edge.

Moreover, each surface segment in the image plane is the projection of an object’s face. In particular, projection of a flat surface maps all the belonged points to the corresponding surface segment while in a case of curved/non-planar face, the corresponding surface segment includes that subset of the face, which is visible in the viewpoint. Assume that operator

λ : R^{2} \to R^{3}

maps 2D pixels to their real 3D coordinates. In Figure 4, the

S_{i}

show 2D surface segments and

A_{i}

indicate collections of 3D points. It is clear that

S_{1}

represents a flat face of the cube and

λ (S_{1}) = A_{1}

while the surface segment

S_{2}

implies only a subset of the cylinder’s lateral surface in 3D bounded between

e_{2}

and

e_{4}

such that

λ (S_{2}) \subseteq A_{2}

. Hence, a depth edge in the image plane may or may not represent an actual edge of the object in 3-dimensional space. Thus, edge type determination, in the proposed framework, relies on the viewpoint. While a concave CD edge holds its type in all the viewpoints, a convex CD edge may switch to DD edge and vice versa by changing the point of view.

4.2. Grasp and Contact Model

Generally, a precision grasp is indicated by end-effector and fingertips poses with respect to a fixed coordinate system. According to terminology adopted from [30], referring to an end-effector E with

n_{E}

fingers and

n_{θ}

joints with the fingertips contacting an object’s surface, a grasp configuration, G, is addressed as follows:

G = (p_{G}, θ_{G}, C_{G})

(3)

where

p_{G}

is the end-effector pose (position and orientation) relative to the object,

θ_{G} = (θ_{1}, θ_{2}, \dots, θ_{n θ})

indicates the end-effector’s joint configuration, and

C_{G} = {c_{i} \in S (O)}_{i = 1}^{n_{E}}

determines

n_{E}

point contacts on the object’s surface. The contact locations set on the end-effector’s fingers is

C_{E} = {{\bar{c}}_{i} \in S (E)}_{i = 1}^{n_{E}}

and is obtained by a forward kinematics derived from the end-effector joint configuration

θ_{G}

.

Throughout this paper, we make an assumption regarding the end-effector during the interaction with the object. Each fingertip applies a force in the direction of its normal and the forces exerted by all fingertips lie in the same plane. We refer to this plane and its normal direction, respectively, as end-effector’s approach plane,

ρ_{G}

, and approach direction,

{\vec{V}}_{G}

. In addition, some of the end-effector geometric features, such as finger’s opening-closing range, can be described according to how they appear on the approach plane. Figure 5 shows how a three-finger end-effector contacts points

c_{1}

,

c_{2}

, and

c_{3}

to grasp the planar shape.

4.3. Edge-Level Grasping

Until this point, we have discussed how to extract depth edges and form closed contours based on the available partial information. In other words, objects are captured through 2D shapes formed by depth edges. Experiments show human tendency to grasp the objects by contacting its edges and corners [3]. The main reason is that edges provide a larger wrench convex and accordingly a greater capability to apply necessary force and torque directions. In this part, we aim to evaluate existence of grasps for each of the obtained closed contours as a way to contact an object. For this matter, we use contours as the input for the planar grasp synthesis process. The output grasp will satisfy reachability, force-closure, and feasibility with respect to end-effector geometric properties. Next, we describe how the 6D pose of the end-effector is obtained from contact points specified on the depth image. Finally, we point out the emerging ambiguity and uncertainties due to the 2D representation.

If we assume that the corresponding 3D coordinates of a closed contour are located on a plane, planar grasp helps us to find appropriate force directions lying on this virtual plane. In addition, edge type determination guides us to evaluate the feasibility of applying the force directions in 3D. Reachability of a depth edge is measured by the availability of a wrench convex lying in the plane of interest. A convex CD edge provides wrench convexes for possible contacting of two virtual planes while a concave CD edge is not reachable for a planar grasp. Exerting force on a DD edge, which also points to object interior, is just possible from one side. Therefore, DD and convex CD edges are remarked as reachable edges while concave CD edges are not considered to be available points for planar contact.

For simplicity in the analysis and without loss of generality, we approximate curved edges by a set of line segments. As a result, all 2D contours turn into polygonal shapes. To obtain the planar force-closure grasp, we assume each polygon side represents just one potential contact. Then we evaluate all the possible combinations of polygon sides subject to the force-direction closure (Theorem 1) and torque-closure (Theorem 2) conditions.

We name the validation of force-direction closure, Angle test. According to Section 2, force-direction closure is satisfied for a two-opposing-fingers contact if the angle made by two edges is less than twice the friction angle. The Angle test for a three-finger end-effector is passed for a set of three contacts such that a wrench from the first contact with opposite direction overlaps with any positive combination of the other two contacts’ provided wrenches (friction cones) [4].

In Section 2, we also discussed how to check if a set of points, corresponding to wrench convexes, satisfy the torque-closure condition. Here, we apply the following steps to recognize regions that include such points on each edge:

Form orthogonal projection areas ( $H_{i}$ ) for each edge $e_{i}$ .
Find the intersection of projection areas by the candidate edges and output the overlapping area ( $\bar{H}$ ).
Back-project the overlapping area on each edge and output the contact regions ( ${\bar{e}}_{i}$ ).

In fact, torque-closure is satisfied if there exists a contact region for each edge. We call this procedure Overlapping Test. Figure 6 illustrates this for a grasp using a combination of 3 reachable edges detected for an object. Specifically, Figure 6a illustrates projection areas for each edge by color coded dashed lines, while in Figure 6b, the shaded region corresponds to the overlapping area and green lines correspond to acceptable contact regions.

Please note that the entire procedure up to this point is performed in the image plane. In the next step, we extract 3D coordinates of the involved edges in order to evaluate the feasibility of the output grasp with respect to the employed end-effector. For instance, comparison of Euclidean distance of line segments and two-fingered gripper width range specifies if the end-effector can fit around a pair of edges. Furthermore, by accessing the 3D coordinates of pixels, we find the Cartesian equation of a plane passing through the edge contact regions (

{\bar{e}}_{i}

). According to Section 4.2, the obtained plane determines end-effector approach plane (

ρ_{G}

) and approach direction (

{\vec{V}}_{G}

) at the grasping moment. To make the grasp robust to positioning errors, the center of each usable edge contact region (

e_{i}^{*}

) is chosen as the point contact on the object’s surface (

c_{i}

). It is beyond the scope of this paper to completely specify grasp configuration

G (

p_{G},

θ_{G},

C_{G})

for an arbitrary object being grasped with a generalized multi-fingered gripper. To completely specify G, one needs to specify end-effector kinematics and a chosen grasp policy for execution. While necessary and sufficient conditions for two and three-finger force-closure grasps are provided in Section 2, a complete grasp specification for a parallel gripper is provided in Section 5.

To sum up our proposed approach, Figure 7 presents a block diagram that shows how the process is applied to an input depth image to output a desired 6D gripper pose. The approach is split into two main parts. During contact region localization process, all the depth edges are extracted from the input. After a post-processing step, closed contours are formed and each edge gets an edge type feature (CD/DD). By filtering out concave CD edges, available contact points alongside corresponding force directions are provided to second section of the approach. In the Force-closure Grasp Synthesis, depending on the desired number of contacts, all the possible combinations of edge segments are constructed. Force direction closure and torque-closure are validated by applying Angle Test and Overlap Test on each constructed combination. These two tests are derived from Theorems 1 and 2 in Preliminaries Section and are specifically described for a parallel gripper in the Implementation Section. The combinations that satisfy force-closure conditions are subject to the last two constraints, namely plane existence and gripper specifications. Usable edge segment centers determine fingers’ contacting point while the normal to the plane described by these points specifies the end-effector approach direction. Based on the kinematic model of the end-effector and chosen grasp policy, its 6D pose at the grasping moment is calculated for each candidate grasp. The following pseudo-code summarizes the various steps for constructing a force-closure grasp using a depth image:

detect disc edges from depth image
detect curvature disc edges from gradient direction image
form closed 2D contours using all depth edges from Steps 1 and 2
for each contour:
(a)
approximate curved edges by line segments to obtain corresponding polygon
(b)
remove the concave CD edges to obtain reachable edges
(c)
make combination of edges with desired number of contacts (depending on available number of fingers)
(d)
for each edge combination:
perform Angle Test (based on Theorem 1)
perform Overlapping Test (based on Theorem 2) and output edge contact regions in 2D
perform end-effector geometric constraint test and output end-effector approach plane and contact points
output grasp parameters given end-effector kinematics

It is worth mentioning the possible uncertainties in our method. Throughout this paper, we assume that friction between fingertips and the object is large enough such that applied planar forces lie inside the 3D friction cones at the contacting points. To clarify, the current layout of our approach provides similar grasps for the contact pairs

(e_{1}, e_{2})

,

(e_{3}, e_{4})

and

(e_{5}, e_{6})

in Figure 8; however, the depth edges in these three cases offer different 3D wrench convexes which increases the grasping uncertainty. Another impacting factor is the relative position of force focus point with respect to the center of gravity of the object. Hence, we assume that a sufficient level of force is available to prevent any possible torques arising from this uncertainty.

5. Implementation for Parallel Gripper

In this section, we describe the implementation steps to process a depth image as the input and identify appropriate grasps for a two-opposing-finger gripper. Therefore, we employ the described algorithm in Section 4.3 to construct a grasp based on forming a combination of two edges to indicate a pair of contact locations. A set of pixel-wise techniques is used to achieve the regions of interest in a 2D image and eventually obtain the desired 3D grasp. In addition, to cope with noise effects of edge detection step in the algorithm, we use a tweaked procedure to follow the approach steps. In fact, we skip contour formation process in the third step of the approach and directly look for the pairs that meet the discussed conditions. Thus, if an edge is missed in the detection step, we do not lose the whole contour and its corresponding edge pairs. However, this results in expansion of the pair formation search space; therefore, later, in this section, we introduce constraints to restrict this search space.

5.1. Edge Detection and Line Segmentation

According to Section 4.1, depth edges appear in depth image and gradient direction image. Due to the discontinuity existing by traveling in the orthogonal direction of a DD edge in depth image (

I_{d}

), the pixels belonging to the edge are local maxima of

I_{M}

(magnitude of the gradient image

▽ I

). Alongside, a CD edge demonstrates a discontinuity in gradient direction image (

I_{θ}

) values, which illustrate a sudden change in normal directions corresponding to the edge neighborhood. Thus, in the first step, an edge detection method is required to be applied to

I_{d}

and

I_{θ}

to capture all the DD and CD edges, respectively. We selected Canny edge detection method [31] that outputs the most satisfying results with our collected data.

Generally, the output of an edge detection method is a 2D binary image. Imperfect measurement in depth image yields appearance of artifacts and distorted texture in the output binary images. For instance, an ideal edge is marked out with one pixel-width. However, practically there exist non-uniform thickness along the detected edges. To reduce such effects and enhance the output of edge detection, a set of morphological operations is applied to the binary images. In coordination with the aforementioned attempt, logical OR operation is used to integrate all the marked pixels corresponding to depth edges from

I_{d}

and

I_{θ}

in a single binary image called detected depth image

I_{D E}

. Figure 9 shows the output of edge detection step for an acquired depth image from the Object Segmentation Dataset [32]. Note that the only input through the whole algorithm is

I_{d},

and color image is merely used to visualize the obtained results. For visualizing, a range of colors is also assigned to the values of

I_{d}

and

I_{θ}

. Improvements made by the morphological operations is noticeable in Figure 9d.

To perform further processing, a procedure is required to distinguish edges by a 2D representation in the obtained binary image (

I_{D E}

). Considering a 2D image with the origin on the left bottom corner, each pixel is addressed by a pair of positive integers. We employed a method proposed in [33] to cluster binary pixels into groups and then represent them by start and end points. Given

I_{D E}

, we first congregate the marked pixels into connected pixel arrays such that each pixel in an array is connected only to its 8 immediate neighbor pixels of the same array. Next, an iterative line fitting algorithm is used to divide the pixel arrays into segments such that each segment is indicated by its two endpoints. The pixels belong to a segment, satisfy an allowable deviation threshold from the 2D line formed by the endpoints. As a result, pixels corresponding to a straight edge are represented by one line segment while curved edges are captured by a set of line segments. Figure 10 indicates outputs of marked edge pixels and corresponding line segmentation for a synthetic depth image; colors are randomly assigned to distinguish the captured lines. Operator

| L_{i} |

computes pixel-length of line segment and

∡ (L_{i})

measures the angle which is made by the line segment and the positive direction of horizontal axis in the range of

[0 ° :

+ 180 °)

, where counterclockwise is assumed as the positive orientation.

5.2. Edge Feature Extraction and Pair Formation

At the end of the previous step, a set of pixel groups, indicated by a corresponding set of line segments, is provided. In this part, we aim to form pairs of line segments subject to mentioned constraints in Section 4.3. In this implementation, the geometric characteristic of a line segment is extracted from its adjacent pixels. Adjacent pixels are subsets of two surface segments in the image plane and can be situated by rotating the line segment with either the clockwise or the counterclockwise orientation. We choose the collection of pixels in each surface segment by building 2D image masks enclosing the line segment. A parallelogram mask for a line segment is obtained by operator

h (.)

.

h ({\vec{L}}_{i}, \vec{W_{i}}) \equiv h ({\vec{L}}_{i}, (w, γ))

(4)

where

{\vec{L}}_{i}

and

\vec{W_{i}}

are the sides of the parallelogram. In the equivalent operator representation, w and

γ

, respectively, show pixel-length of the line segment

\vec{W_{i}}

and the angle between the sides

\vec{W_{i}}

and

L_{i}

in the range of

[- 180 ° : + 180 °)

. In a similar way, we provide the following predefined masks for a line segment:

\begin{matrix} H^{0} (L_{i}) = h ({\vec{L}}_{i}, (1, + 90)) \\ H^{+} (L_{i}) = h ({\vec{L}}_{i}, (w_{0}, + 90)) \\ H^{-} (L_{i}) = h ({\vec{L}}_{i}, (w_{0}, - 90)) \end{matrix}

(5)

Please note that the binary mask locates the region of interest in the image and then desired functions are applied to the pixels’ values. Figure 11a demonstrates masks

H_{1}

and

H_{2}

provide a positive angle parallelogram for

L_{1}

and negative angle parallelogram for

L_{2}

, respectively. We remark here that increasing the value of w results in larger masks and a more robust feature identification.

We take advantage of the defined masks to evaluate reachability of each line segment and existence of a wrench convex for it. To do so, the line segments must be assigned with an edge type label. Comparison of binary masks

H^{0} (L_{i})

applied to

I_{d}

and

I_{θ}

images results in distinguishing DD and CD line segments from one another. In addition, a line segment divides its local region in two sides. Therefore, the object is posed either with a positive orientation regarding the line segment or a negative orientation. As discussed earlier, the wrench convex(es) is available in certain side(s) for each line segment. Please note that depth value of DD edge sides hint at object relative pose with respect to the line segment. As a result, the side with lower depth value implies object (foreground) while the side with greater depth value points out the background; correspondingly available wrench is suggested. Likewise, evaluating the sides and line segment average depth values based on the provided definition, specifies convexity/concavity of a CD edge. Mathematically speaking, edge type feature is determined for a DD line segment

L_{i}

and a CD line segment

L_{j}

as follows

\begin{matrix} \{\begin{matrix} if : \bar{d} (H^{+} (L_{i})) < \bar{d} (H^{-} (L_{i})) \\ then : L_{i} is D D^{-} \\ otherwise : L_{i} is D D^{+} \end{matrix} \end{matrix}

(6)

\begin{matrix} \{\begin{matrix} if : 1 / 2 [\bar{d} (H^{-} (L_{j})) + \bar{d} (H^{+} (L_{j}))] > \bar{d} (H^{0} (L_{j}) \\ then : L_{j} is C D^{\pm} \\ otherwise : L_{j} is C D^{0} \end{matrix} \end{matrix}

(7)

such that (

\pm, +, -, 0

) signs indicate availability of wrench convex w.r.t the line segment and

\bar{d} (.)

operator takes the average of depth values over the specified region.

According to Section 4.3, a pair of line segments constructs a planar force-closure grasp if it satisfies the Angle test and Overlapping test. Therefore, for constructing a two-opposing-fingers grasp, line segments

(L_{i}

and

L_{j})

need to have opposite wrench signs and satisfy the following inequality:

| ∡ (L_{i}) - ∡ (L_{j}) | < 2 α_{f}

(8)

where

α_{f}

is determined by the friction coefficient. For checking the existence of overlapping area and edge contact regions, we use the

h (.)

operator to create masks and form line segments corresponding to projection areas. The

{\bar{H}}_{β}

mask is the pair overlapping area which is captured by intersection of edges projection areas and acquired by the following relations:

\begin{matrix} β = 1 / 2 \times | 180 - | ∡ (L_{i}) - ∡ (L_{j}) | | \\ H_{β} (L_{i}) = \{\begin{matrix} h ({\vec{L}}_{i}, (w_{max}, β)) if D D^{-} or C D^{-} \\ h ({\vec{L}}_{i}, (w_{max}, - β)) if D D^{+} or C D^{+} \end{matrix} \\ \bar{H} (β) = H_{β} (L_{i}) \cap H_{β} (L_{j}) \end{matrix}

(9)

such that

H_{β} (L_{i})

addresses projection area made by line segment

L_{i}

with the angle of

β

. In fact,

β

implies orthogonal direction of the bisector. Assuming existence of the overlapping area, edge contact regions,

L_{i}^{*}

and

L_{j}^{*}

are parts of the line segments which are enclosed by the

\bar{H} (β)

mask. Figure 11b demonstrates projection areas and contact regions for a pair of edges.

Remark 2.

In the case that we have access to the closed contours formed by depth edges, both the line segments are required to belong to the same closed contour.

To this point, planar reachability and force-closure features are assessed. As the final step, we check if the pair is feasible under the employed gripper constraints. We assume

P_{i} = λ (L_{i}^{*})

is the set corresponding all the 3D points located on

L_{i}^{*}

region. Euclidean distance between the average points of two sets

P_{i}

and

P_{j}

is required to satisfy:

ϵ_{min} < | | {\bar{P}}_{i} - {\bar{P}}_{j} {| |}_{2} < ϵ_{max}

(10)

where

ϵ

denotes the width range of the gripper and

{\bar{P}}_{i}

is the average point of set

P_{i}

. In addition, to assure that

P_{i}

and

P_{j}

posed on a plane, we fit plane model to the data. In the current implementation, we have used RANSAC [34] to estimate the plane parameters. The advantage of RANSAC is its ability to reject the outlier points resulting from noise. If a point holds greater distance from the plane than an allowable threshold (

t_{max}

), it is considered an outlier point. The output plane and the normal unit vector pointing in the plane are referred as

ρ_{R}

and

{\vec{V}}_{R}

. Please note that for further processes, sets

P_{i}

and

P_{j}

are also replaced with corresponding sets excluding the outliers.

5.3. 3D Grasp Specification

We desire to calculate grasp parameters based on the presented model in Section 4.2. To reduce the effects of uncertainties, we pick the center of the edge contact regions (

P_{i}

) as the safest contact points. As stated by [35], a key factor to improve the grasp quality is orthogonality of the end-effector approach direction to the object surface. In addition, the fingers of a parallel-finger gripper can only move toward each other. Hence, according to the employed grasp policy, the gripper holds a certain pose such that the gripper approach direction is aligned with normal of the extracted plane. Subsequently, closing the fingers yields contact with the object at the desired contact points. Thus, for a graspable pair, grasp parameters are described by:

\begin{matrix} G (L_{i}, L_{j}) & = (p_{G}, θ_{G}, C_{G}) \end{matrix}

(11)

\begin{matrix} = \{\begin{matrix} p_{G} = (P_{G}, R_{G}) \\ θ_{G} = {θ_{1}, θ_{2}} \\ C_{G} = {c_{1}, c_{2}} = {{\bar{P}}_{i}, {\bar{P}}_{j}} \end{matrix} \end{matrix}

(12)

where 3D vector

P_{G}

and rotation matrix

R_{G}

indicate the gripper pose. We adjust

θ_{G}

such that fingers have maximum width before contacting and width equals to

| | {\bar{P}}_{i} - {\bar{P}}_{j} {| |}_{2}

during the contact. If the length of the fingers is equal to

l_{d}

and the fingers direction closure is defined by the unit vector

{\vec{V}}_{c} = ({\bar{P}}_{i} - {\bar{P}}_{j})

/

| | {\bar{P}}_{i} - {\bar{P}}_{j} {| |}_{2}

, then we can obtain:

\{\begin{matrix} P_{G} = 1 / 2 \times ({\bar{P}}_{i} + {\bar{P}}_{j}) - l_{d} {\vec{V}}_{R} \\ R_{G} = [{\vec{V}}_{R}; {\vec{V}}_{c} \times {\vec{V}}_{R}; {\vec{V}}_{c}] \end{matrix} .

(13)

5.4. Practical Issues

Through the process of implementation on the real data, we face issues which are caused by uncertainties in the measured data. According to [36] error sources for imported data by depth sensors origin from imperfect camera calibration, lighting condition, and properties of the object surface. RGBD sensors are subject to specific problems in measuring depth information based on the technology they use [37]. A common example of these problems is shadows or holes that appear in the depth image which point out the sensor inability to measure depth of such pixels. The main reason is some regions are visible to the emitter but not to the receiver sensor. Consequently, the sensor returns a non-value code for these regions. Since our implementation is mainly dominated by pixel-level processes, a procedure is required to handle this issue. In order to do so, we use a recursive median filter to estimate depth values for the shadow regions [38]. In Figure 9b, white pixels display shadows in the sensed depth image and Figure 9c demonstrates depth image after the estimation procedure is performed.

Another issue in the case of DD edges is, if each edge pixel is rightly placed on its belonged surface. In practice, an edge is detected as a combination of the pixels placed on both the object and the background. Since the marked pixels are used for the purpose of object pose estimation, we are interested to locate them on the foreground object. Although there are efficient ways to recognize the foreground pixels such as [39], as roboticists, we make use of a very simple pixel-level procedure to keep computational cost low. Based on the object-line relative position, we create

H^{+}

or

H^{-}

mask that orients toward the object side. Applying the mask to the gradient magnitude image (

I_{M}

) provides accurate location of maximum depth gradient along the mask width (perpendicular to the line segment). The relative position of marked edge pixels with respect to the peak of the gradients determines if they are located on the object side or not. In the case of incorrect allocation, we move the marked pixel in the direction perpendicular to the line segment with a sufficient displacement to make sure the new marked pixel is located on the foreground side. It is important to note that this process is applied only to DD edges since there is no foreground/background concept for a CD edge.

Due to the projection occurring in camera from 3D to 2D, an ambiguity emerges causing two distinct depth edges along each other being captured as a single line segment. This issue can be handled by adding extra examination to the edge feature extraction. Applying masks

H^{0}

to images

I_{d}

and

I_{θ}

demonstrates if there exists any depth or gradient orientation discontinuity along the edge. The occurrence of this discontinuity results in breaking the edge into two line segments at the location of the abrupt change. Note here this ambiguity does not arise during the formation of closed contours; thus, this test is discarded during that exercise. In Figure 12 edges

e_{1}

and

e_{2}

in 2D image are considered to be one line segment, while by performing the above test, they can be distinguished.

6. Results

In this section, we first evaluate in simulation the performance of detection step of grasp planning algorithm and then conduct experiments with two setups to test the overall grasping performance using the 7-DOF Baxter arm manipulator. The robotics community has assessed grasping approaches using diverse datasets. However, most of these datasets, e.g., YCB Object set [40], only captures single objects in isolation. Since we are interested in scenes with multiple objects, a standard data set named Object Segmentation Database (OSD) [32] is adopted for the simulation. Besides, we collected our own data set using Microsoft Kinect One sensor for real-world experiments. The data sets include a variety of unknown objects from the aspects of shape, size, and pose. In both cases, the objects are placed on a table inside the camera view and data set provides RGBD image. The depth image is fed in the grasp planning pipeline while the RGB image is simply used to visualize the obtained results. Please note that all the computations are performed in MATLAB.

6.1. Simulation-Based Results

In this part, to validate our method, we used a simulation-based PC environment; specifically, the algorithm was executed using sequential processing in MATLAB running on the 4th generation Intel Core Desktop i7-4790 @ 3.6GHz. In what follows below, we focus on the output results of the detection step in i.e., edge detection, line segmentation, and pair evaluation. To explain our results, we chose 8 illustrative images from the OSD dataset including different object shapes in cluttered scenes. Figure 13 shows provided scenes.

To specify the ground truth, we manually mark all the reachable edges (DD and Convex CD) for the existing objects and consider them as graspable edges. If each graspable edge is detected with correct features, it is counted as a detected edge. Assuming there are no gripper constraints, a graspable surface segment is determined if it provides at least one planar force-closure grasp in the camera view. In a similar way, detected surface segment, graspable object, and detected object are specified. Table 1 shows the obtained results by applying the proposed approach on the data set. In addition, Figure 14 illustrates the ground truth and detected edges for scene number 4.

According to the provided results from the illustrative subset, although 20% of the graspable edges are missed in the detection steps, 97% of the existing objects are detected and represented by at least one of their graspable surface segments. This emphasizes how skipping the contour formation step has positive effects through the grasp planning.

A large sample (N = 63) simulation study containing objects relevant to robotic grasping was also conducted. Each of the 63 scenes was comprised of between 2 and 16 objects with an average of 5.57 objects per scene. The algorithm predicted an average of 19.4 pairs out of which an average of 18.4 were deemed graspable which computes to a 94.8% accuracy. As previously stated, the algorithm was run using sequential processing in MATLAB on an 4th generation Intel Core Desktop, specifically, the i7-4790 @ 3.6 GHz. Per pair of viable grasping edges that satisfy reachability and force-closure conditions, an average time period of 281 ms was needed in the aforementioned environment to generate a 6D robot pose needed to successfully attach with the object. However, increased efficiency in terms of detection time per graspable pair of edges was seen as the scenes became more cluttered with multiple objects. In fact, it was seen in the study to dip to as low as 129 ms per pair in a scene containing 15 objects and 61 viable graspable edge pairs. More detailed statistics as well as the raw data to arrive at these statistics can be made available to the reader by the authors upon request.

6.2. Robot Experiments

For the real-world experiments, the approach is run in two phases, namely grasp planning and grasp execution. In the first phase, the proposed approach is applied to the sensed data and extracted grasping options are presented to the user by displaying the candidate pairs of contact regions. Based on the selected candidate, a 3D grasp is computed for the execution phase and the grasp strategy is performed. During all the experiments, arm manipulator, RGBD camera, and the computer station are connected through a ROS network. The right arm of Baxter is fitted out with a parallel gripper. The gripper is controlled with two modes, in its “open mode” fingers distance is manually adjusted,

ϵ_{max}

= 7 cm based on the size of the used objects. During the “closed mode”, fingers take either minimum distance,

ϵ_{min}

= 2 cm or hold a certain force value in the case of contacting. Please note that a motion planner [41] is used to find feasible trajectories for the robotic arm joints. The grasp strategy is described for the end-effector by taking the following steps:

Step 1: Move from an initial pose to the planned pre-grasp pose.
Step 2: Wend through a straight line from pre-grasp pose to final grasp pose with fingers in the open mode.
Step 3: Contact the object by switching the fingers to the close mode.
Step 4: Lift the object and move to post-grasp pose.

We defined two scenarios to examine the algorithm’s overall performance, single object and multiple objects setups. In all the experiments, we assume target object is placed in the camera field of view, there exists at least one feasible grasp based on the employed gripper configuration, and planned grasps are in the workspace of the robot. An attempt is considered to be a successful grasp, if the robot could grasp the target object and hold it for a 5 s duration after elevating. In the cases, where the user desired object does not provide a grasp choice, the algorithm acquires a new image from the sensor. If the grasp does not show up even in the second try, we consider the attempt as a failed case. In the case, where the planned grasp is valid, but the motion planner fails to plan or execute the trajectory, the iteration is discarded, and a new query is called.

In single object experiments, objects are in an isolated arrangement on a table in front of the robot. Four iterations are performed, covering different positions and orientations for each object. The grasp is planned by the algorithm provided in the previous section followed by robot carrying out the execution strategy to approach the object. Prior to conducting each experiment, relative finger positions of the Baxter gripper are set to be wide enough for the open mode and narrow enough for the closed mode. Figure 15 displays all the objects were used in the experiments and Table 2 shows the obtained results in the single object experiment.

According to the provided rates, 90% of the robot attempts were successful for the entire set where 11 objects were grasped successfully in all 4 iterations, 4 objects failed to be grasped successfully in 1 out of 4 iterations, while one object (mouse) had 2 successful and 2 unsuccessful attempts. In the unsuccessful attempts, the inappropriate orientation of the gripper during approaching moment is observed as the main reason of failure (4 out of 6) preventing the fingers from forming force-closure on the desired contact regions. Basically, this relates performance of plane extraction from the detected contact regions. Observations during the experiments illustrate high sensitivity of the plane retrieval step to existence of unreliable data in the case of curved shape objects. For instance, in grasping the toothpaste box, although estimated normal direction (

{\vec{V}}_{R}

) made a

19 °

angle with the expected normal direction (actual normal of the surface), the object was lifted successfully. However, a

9 °

error resulted in failure to grasp the mouse. Impact of force-closure uncertainties on the mouse case is also noticeable. For the other 2 unsuccessful attempts in the single object experiment, inaccurate positioning of the gripper was the main reason for the failure. For grasping the apple charger, gripper could not contact the planned regions, due to noisy values retrieved from low number of pixels on the object edges.

Multi-object experiments are conducted to demonstrate the algorithm overall performance in a more complex environment. In each scene, a variety of objects are placed on the table and the robot approaches the object of interest in each attempt. Measuring the reliability and quality of candidate grasps is not in the scope of this paper. Hence, the order of grasping objects is manually determined such that:

(i): the objects which are not blocked by other objects in the view, are attempted first.
(ii): the objects, pursuant to lifting, result in scattering the other objects are attempted last. Therefore, the user chooses one of the candidate grasps and robot attempts the target object unless there are no feasible grasps in the image. This experiment includes 6 different scenes, two scenes with box shaped objects, two scenes with cylinder-shaped objects and two scenes with a variety of shapes. Table 3 indicates the obtained results of multi object experiment while Figure 16 demonstrates the setups of three of the scenes.

Based on the obtained results, the proposed approach yields a 100% successful rate for box shaped objects, 90% for curved shapes, and 72% for very cluttered scenes with mixed objects. Figure 17 indicates a sequence of images during the grasp execution for the scene #3 in the multi-object experiment. During the grasp execution for scene #3 in (Figure 16) in the multi-object experiment. The robotic arm attempted to grasp the cylinder-shaped objects located on the table. In the first two attempts, orange and blue bottles were successfully grasped, lifted and removed from the scene. Although in the third attempt, the gripper contacted the paste can and elevated it, the object was dropped due to lack of sufficient friction between the fingertips and object surface. Then, the arm approached the remaining objects (green cylinder and large paper cup) and grasped them successfully. In the last attempt, another grasp was planned for the paste can by capturing a new image. This attempt also failed because of inaccurate pose estimation. Finally, the experiment was finished with one extra attempt and with a success rate of 80% (specifically, 4 out of 5).

Figure 18 illustrates outcomes of three steps of the algorithm. The left column pictures indicate a color map of depth gradient direction images, the middle column demonstrates detected depth edges before any post-processing, and the last column shows candidate grasps presented by a pair of same color line segments. As discussed in the Implementation Section, 3D coordinates of edge points result in 6D pose of end-effector. Please note that scene in the first row is from OSD dataset and scene in the second row is collected through our experiments. A video of the robot executing grasping tasks can also be found on-line at [42].

6.3. Discussion

According to the implemented approach, we discuss the performance of the approach and failure reasons at different levels, namely 2D contact region detection, 3D grasp extraction, and execution. In the detection phase, the output is a pair of 2D line segments. False positive and false negative pairs are caused by the following reasons: (i) inefficiency of edge detection, (ii) incorrect identification of edge type feature(DD/CD), and (iii) incorrect identification of wrench direction feature (

\pm, +, -, 0

). The aforementioned errors are caused by measurement noise and appearance of artifacts in the data. However, objective modification based on specific datasets can yield performance improvement. Please note that if we perform detections on a synthetic dataset without adding noise, these reasons do not affect the output.

Since the user selects a desired Tpair, false positive output of detection phase does not impact the grasp attempt in the conducted experiments. As a matter of fact, in the 3D grasp extraction step, the approach provides grasp parameters (

P_{G}, {\vec{V}}_{R}, {\vec{V}}_{C}

) based on a true positive pair of contact regions. Overall, the grasp parameter estimation errors can be sourced to the following underlying reasons: (i) inaccurate DD edge pixel placement(foreground/background), (ii) unreliable data for low pixel density objects, and (iii) noise in the captured data. Since we derive contact regions instead of contact points, deviation of

P_{G}

in certain directions is negligible unless the finger collides with an undesired surface while approaching the object. Width of the gripping area with respect to the target surface determines limits for this deviation. Furthermore, error in estimation of

{\vec{V}}_{R}

also results in force exertion on improper regions and consequently results in an unsuccessful grasp. Sensitivity of a grasp to this parameter depends on the surface geometry and finger kinematics. Compliant fingers show high flexibility to the estimated plane error, while firm wide fingertips do not tolerate the error. Uncertainties and assumptions regarding the friction coefficient, robot calibration, and camera calibration errors are among the factors impacting the performance of the execution step.

Obtained results also indicate that the efficiency of the proposed approach (in terms of average detection time needed per pair of graspable edges) increases as the scene becomes more cluttered. However, as expected, the total time to process a scene correlates well with the complexity of the scene. Addressing how exactly the performance of these pixel-wise techniques, such as edge detection and morphology operations, affect the efficiency of our approach is complex. Output quality and setting of these methods strongly depend on characteristics of the image view and scene. Therefore, we only analyze edge length effects and avoid detailing other effective parameters. In fact, an edge appearing longer in a 2D image is composed of a greater number of pixels. Thus, it has a smaller chance of being missed in the detection step. In addition, since there is uncertainty in the measured data, a longer 2D edge signifies more reliable information in the grasp extraction step. On the other hand, appearance of an edge in the image relies on the distance and orientation of the object regarding the camera view. Thus, depth pixel density of an object in 2D image affects the detection performance and reliability of its corresponding grasp.

7. Conclusions

We have proposed an approach to grasp novel objects from a scene containing multiple objects in an uncontrived setting. Our algorithm estimates reliable regions on the contours (formed by a set of depth edges) to contact the object based on geometric features extracted from a captured single view depth map. The proposed algorithm leads to a force-closure grasp. Real-world experiments demonstrate the ability of the proposed method to successfully grasp a variety of objects of different shapes, sizes, and colors. Future work will focus on four directions: (1) illustrating implementation of approach for a general multi-fingered gripper (2) extracting more geometric features from the available data to control the uncertainties, (3) employing efficient techniques to reduce the noise effects and, (4) equipping the approach with a process to evaluate the grasp quality and reliability.

Author Contributions

A.J. and A.B. conceived of the presented idea. A.J. implemented the theory and carried out the experiments. A.B. verified the analytical methods and supervised the findings of this work. A.J. wrote the manuscript. All authors discussed the results and commented on the manuscript.

Funding

This study was funded in part by NSF grant numbers IIS-1409823 and IIS-1527794, and in part by NIDILRR grant #H133G120275. However, these contents do not necessarily represent the policy of the aforementioned funding agencies, and you should not assume endorsement by the Federal Government.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bohg, J.; Morales, A.; Asfour, T.; Kragic, D. Data-Driven Grasp Synthesis—A Survey. IEEE Trans. Robot. 2014, 30, 289–309. [Google Scholar] [CrossRef]
Sahbani, A.; El-Khoury, S.; Bidaud, P. An overview of 3D object grasp synthesis algorithms. Robot. Auton. Syst. 2012, 60, 326–336. [Google Scholar] [CrossRef] [Green Version]
Nguyen, V.D. Constructing force-closure grasps. In Proceedings of the 1986 IEEE International Conference on Robotics and Automation, San Francisco, CA, USA, 7–10 April 1986; pp. 1368–1373. [Google Scholar]
Park, Y.C.; Starr, G.P. Grasp Synthesis of Polygonal Objects Using a Three-Fingered Robot Hand. Proc. Int. J. Robot. Res. 1992, 11, 163–184. [Google Scholar] [CrossRef]
Howard, W.S.; Kumar, V. On the stability of grasped objects. IEEE Trans. Robot. Autom. 1996, 12, 904–917. [Google Scholar] [CrossRef]
Okamura, A.M.; Smaby, N.; Cutkosky, M.R. An overview of dexterous manipulation. In Proceedings of the IEEE Conference on Robotics and Automation, San Francisco, CA, USA, 22–28 April 2000; Volume 1, pp. 255–262. [Google Scholar]
Bicchi, A.; Kumar, V. Robotic grasping and contact: A review. In Proceedings of the 2000 ICRA Millennium Conference, Symposia Proceedings (Cat. No.00CH37065), San Francisco, CA, USA, 24–28 April 2000; Volume 1, pp. 348–353. [Google Scholar]
Diankov, R. Automated Construction of Robotic Manipulation Programs. Ph.D. Dissertation, Carnegie Mellon University, Pittsburgh, PA, USA, August 2010. [Google Scholar]
Weisz, J.; Allen, P.K. Pose error robust grasping from contact wrench space metrics. In Proceedings of the 2012 IEEE International Conference on Robotics and Automation, Saint Paul, MN, USA, 14–18 May 2012; pp. 557–562. [Google Scholar]
Miller, A.T.; Knoop, S.; Christensen, H.I.; Allen, P.K. Automatic grasp planning using shape primitives. In Proceedings of the 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422), Taipei, Taiwan, 14–19 September 2003; pp. 1824–1829. [Google Scholar]
Hübner, K.; Kragic, D. Selection of robot pre-grasps using box-based shape approximation. In Proceedings of the 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, Nice, France, 22–26 September 2008; pp. 1765–1770. [Google Scholar]
Przybylski, M.; Asfour, T.; Dillmann, R. Planning grasps for robotic hands using a novel object representation based on the medial axis transform. In Proceedings of the 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, San Francisco, CA, USA, 25–30 Sepember 2011; pp. 1781–1788. [Google Scholar]
Detry, R.; Pugeault, N.; Piater, J.H. A Probabilistic Framework for 3D Visual Object Representation. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 1790–1803. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ciocarlie, M.; Allen, P. Hand posture subspaces for dexterous robotic grasping. Int. J. Robot. Res. 2009, 28, 851–867. [Google Scholar] [CrossRef]
Saxena, A.; Driemeyer, J.; Ng, A.Y. Robotic grasping of novel objects using vision. Int. J. Robot. Res. 2008, 27, 157–173. [Google Scholar] [CrossRef]
Detry, R.; Kraft, D.; Kroemer, O.; Bodenhagen, L.; Peters, J.; Krüger, N.; Piater, J. Learning Grasp Affordance Densities. In Proceedings of the 2009 IEEE 8th International Conference on Development and Learning, Shanghai, China, 5–7 June 2009. [Google Scholar]
Suzuki, T.; Oka, T. Grasping of unknown objects on a planar surface using a single depth image. In Proceedings of the 2016 IEEE International Conference on Advanced Intelligent Mechatronics (AIM), Banff, AB, USA, 12–15 July 2016; pp. 572–577. [Google Scholar]
ten Pas, A.; Platt, R. Using geometry to detect grasp poses in 3d point clouds. arXiv 2015, arXiv:1501.03100. [Google Scholar]
Jain, S.; Argall, B. Grasp detection for assistive robotic manipulation. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 2015–2021. [Google Scholar]
Lippiello, V.; Ruggiero, F.; Siciliano, B.; Villani, L. Visual Grasp Planning for Unknown Objects Using a Multifingered Robotic Hand. IEEE/ASME Trans. Mechatronics 2013, 18, 1050–1059. [Google Scholar] [CrossRef]
Teng, Z.; Xiao, J. Surface-Based Detection and 6-DoF Pose Estimation of 3-D Objects in Cluttered Scenes. IEEE Trans. Robot. 2016, 32, 1347–1361. [Google Scholar] [CrossRef]
Ückermann, A.; Haschke, R.; Ritter, H. Realtime 3D segmentation for human-robot interaction. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, 3–7 November 2013; pp. 2136–2143. [Google Scholar]
Hsiao, K.; Chitta, S.; Ciocarlie, M.; Jones, E.G. Contact-reactive grasping of objects with partial shape information. In Proceedings of the 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, Taipei, Taiwan, 18–22 October 2010; pp. 1228–1235. [Google Scholar]
Gratal, X.; Romero, J.; Bohg, J.; Kragic, D. Visual servoing on unknown objects. Mechatronics 2012, 22, 423–435. [Google Scholar] [CrossRef]
Parkhurst, E.L.; Rupp, M.A.; Jabalameli, A.; Behal, A.; Smither, J.A. Compensations for an Assistive Robotic Interface. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Austin, TX, USA, 9–13 October 2017; p. 1793. [Google Scholar]
Rahmatizadeh, R.; Abolghasemi, P.; Boloni, L.; Jabalameli, A.; Behal, A. Trajectory adaptation of robot arms for head-pose dependent assistive tasks. In Proceedings of the Twenty-Ninth International Flairs Conference, Largo, FL, USA, 16–18 May 2016. [Google Scholar]
Ettehadi, N.; Behal, A. Implementation of Feeding Task via Learning from Demonstration. In Proceedings of the Second IEEE International Conference on Robotic Computing (IRC), Laguna Hills, CA, USA, 1–2 February 2018; pp. 274–277. [Google Scholar]
León, B.; Morales, A.; Sancho-Bru, J. From Robot to Human Grasping Simulation; Springer: Cham, Switzerland, 2014. [Google Scholar]
Carmo, D.; Manfredo, P. Differential Geometry of Curves and Surfaces: Revised and Updated Second Edition; Courier Dover Publications: Mineola, NY, USA, 2016. [Google Scholar]
Stork, A. Representation and Learning for Robotic Grasping, Caging, and Planning. Ph.D. Dissertation, KTH Royal Institute of Technology, Stockholm, Sweden, 2016. [Google Scholar]
Canny, J. A Computational Approach to Edge Detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, PAMI-8, 679–698. [Google Scholar] [CrossRef]
Richtsfeld, A.; Mörwald, T.; Prankl, J.; Zillich, M.; Vincze, M. Segmentation of unknown objects in indoor environments. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura, Portugal, 7–12 October 2012; pp. 4791–4796. [Google Scholar]
Kovesi, P.D. MATLAB and Octave Functions for Computer Vision and Image Processing. 2000. Available online: http://www.peterkovesi.com/matlabfns (accessed on 10 May 2017).
Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
Balasubramanian, R.; Xu, L.; Brook, P.D.; Smith, J.R.; Matsuoka, Y. Human-guided grasp measures improve grasp robustness on physical robot. In Proceedings of the 2010 IEEE International Conference on Robotics and Automation, Anchorage, AK, USA, 3–7 May 2010; pp. 2294–2301. [Google Scholar]
Khoshelham, K.; Elberink, S.O. Accuracy and resolution of kinect depth data for indoor mapping applications. Sensors 2012, 12, 1437–1454. [Google Scholar] [CrossRef] [PubMed]
Moyà-Alcover, G.; Elgammal, A.; Jaume-i-Capó, A.; Varona, J. Modeling depth for nonparametric foreground segmentation using RGBD devices. Pattern Recognit. Lett. 2016, arXiv:1609.09240. [Google Scholar] [CrossRef]
Brian Hudson. Emphasis on Filtering & Depth Map Occlusion Filling, Clarkson University Computer Vision Course CS 611 Fall 2012. Available online: https://people.clarkson.edu/~hudsonb/courses/cs611/ (accessed on 10 May 2017).
Carmichael, O.; Hebert, M. Shape-based recognition of wiry objects. In Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Madison, WI, USA, 18–20 June 2003; Volume 2. [Google Scholar] [CrossRef]
Calli, B.; Singh, A.; Walsman, A.; Srinivasa, S.; Abbeel, P.; Dollar, A. The YCB object and model set: Towards common benchmarks for manipulation research. In Proceedings of the Conference on Advanced Robotics (ICAR), Istanbul, Turkey, 27–31 July 2015. [Google Scholar]
Sucan, I.A.; Chitta, S. MoveIt! Available online: Http://moveit.ros.org (accessed on 10 May 2017).
Jabalameli, A. Sample Grasping Execution. Available online: https://youtu.be/7r0KaxpQk6E (accessed on 10 May 2019).

Figure 1. Planar contacts: (a) frictionless point contact, (b) point contact with friction, (c) soft finger contact, (d) edge contact.

Figure 2. Force-closure geometric interpretation for two opposing finger grippers and triangular end-effector. (a,b) show feasible force closures grasps, while (c,d) illustrate impossible force-closure grasps.

Figure 3. (a) RGB image of the scene,

I_{c}

. (b) Color map of the raw depth map. (c) Color map of computed gradient direction image,

I_{θ}

.

Figure 3. (a) RGB image of the scene,

I_{c}

. (b) Color map of the raw depth map. (c) Color map of computed gradient direction image,

I_{θ}

.

Figure 4. Geometric interpretation of a surface segment for a cube and a cylinder.

Figure 5. Grasp representation for a planar shape.

Figure 6. Overlapping Test. (a) shows intersection of orthogonal projection for three edges, (b) indicates overlapped region and edge contact regions.

Figure 7. Grasp Planning Algorithm Block Diagram.

Figure 8. Shapes with similar planar grasps despite different 3D friction cones.

Figure 9. Applied edge detection on an acquired depth map. (a) RGB image of the scene,

I_{c}

. (b) Color map of the raw depth map. White pixels imply to non-returned values from the sensor (depth shadows). (c) Color map of the processed depth map,

I_{d}

. (d) Color map of computed gradient direction image,

I_{θ}

. (e) Detected edges before applying the morphological operations. (f) Detected edges after the morphological process,

I_{D E}

.

Figure 9. Applied edge detection on an acquired depth map. (a) RGB image of the scene,

I_{c}

. (b) Color map of the raw depth map. White pixels imply to non-returned values from the sensor (depth shadows). (c) Color map of the processed depth map,

I_{d}

. (d) Color map of computed gradient direction image,

I_{θ}

. (e) Detected edges before applying the morphological operations. (f) Detected edges after the morphological process,

I_{D E}

.

Figure 10. Line segmentation step is applied to a synthetic depth map. (a) Detected edge pixels are marked. (b) Edges are broken into line segment(s).

Figure 11. (a) Examples of parallelogram masks the sides of 2D shape ABCDE. (b) Projection area and edge contact regions for a pair of edges.

Figure 12. 2D object representation ambiguity.

Figure 13. Subset of 8 images used to illustrate simulation results.

Figure 14. Reference and detected edges for scene No.4 in the simulation-based results. Please note that assigned colors are only used to distinguish the line segments visually. (a) Reference graspable edges: each edge is manually marked by a line segment, (b) detected graspable edges: marked points are detected by algorithm as graspable edges, (c) detected line segments: each detected edge is represented by several line segments.

Figure 15. The entire set of objects used through real-world experiments (16 objects).

Figure 16. Multi-object experiments scenes including variety of objects. From left to right, scenes No.2., No.3., and No.6..

Figure 17. A Sequence of snapshots from the robot arm while approaching to grasp the objects in a cluttered scene.

Figure 18. Outcome of the grasp planning algorithm stages. Left: Color map of direction of Gradient, Middle: Binary image of detected depth edges, and Right: Detected grasp candidates for a parallel gripper.

Table 1. Simulation section results. Columns describe the number of (G)raspable and (D)etected objects, surface segments, and edges for the 8 scenes drawn from the benchmark OSD datasetO. The last row indicates average accuracy rates of detection in object level, surface-level, and edge-level.

Scene	Objects	G. Object	D. Object	G. Surface	D. Surface	G. Edge	D. Edge
No.1	Boxes	3	3	6	6	17	14
No.2	Boxes	3	3	8	8	20	17
No.3	Cylinders	3	3	6	5	12	10
No.4	Cylinders	5	5	10	9	20	19
No.5	Mixed-low cluttered	6	6	13	9	28	21
No.6	Mixed-low cluttered	7	7	13	9	28	22
No.7	Mixed-high cluttered	11	11	24	17	55	42
No.8	Mixed-high cluttered	12	10	22	16	49	33
Average detection accuracy rate		97%		81%		80%

Table 2. Single object experiment results. Four attempts for each object are performed. “L” indicates the large size and “S” indicates small-size objects.

Object	% Succ.	Object	% Succ.
Toothpaste Box	100	L Box	100
S Blue Box	75	L Paper Cup	100
Banana	100	L Plastic Cup	100
S Paper Cup	75	Green Cylinder	100
Apple Charger	75	L Pill Container	100
Tropicana Bottle	100	Chips Container	75
S Pill Container	100	Smoothie Bottle	100
Mouse	50	Fruit Can	100
Average: 90.62%

Table 3. Multi-object experiment results. The success rate implies number of objects grasped successfully out of total number of objects in the scene.

Scene No.	Objects	Grasped Objects	Total Attempts
1	Boxes	4 out of 4	4
2	Boxes	5 out of 5	5
3	Cylinders	4 out of 5	6
4	Cylinders	5 out of 5	6
5	Mix.	5 out of 6	8
6	Mix.	5 out of 8	8

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jabalameli, A.; Behal, A. From Single 2D Depth Image to Gripper 6D Pose Estimation: A Fast and Robust Algorithm for Grabbing Objects in Cluttered Scenes. Robotics 2019, 8, 63. https://doi.org/10.3390/robotics8030063

AMA Style

Jabalameli A, Behal A. From Single 2D Depth Image to Gripper 6D Pose Estimation: A Fast and Robust Algorithm for Grabbing Objects in Cluttered Scenes. Robotics. 2019; 8(3):63. https://doi.org/10.3390/robotics8030063

Chicago/Turabian Style

Jabalameli, Amirhossein, and Aman Behal. 2019. "From Single 2D Depth Image to Gripper 6D Pose Estimation: A Fast and Robust Algorithm for Grabbing Objects in Cluttered Scenes" Robotics 8, no. 3: 63. https://doi.org/10.3390/robotics8030063

APA Style

Jabalameli, A., & Behal, A. (2019). From Single 2D Depth Image to Gripper 6D Pose Estimation: A Fast and Robust Algorithm for Grabbing Objects in Cluttered Scenes. Robotics, 8(3), 63. https://doi.org/10.3390/robotics8030063

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

From Single 2D Depth Image to Gripper 6D Pose Estimation: A Fast and Robust Algorithm for Grabbing Objects in Cluttered Scenes^†

Abstract

1. Introduction

2. Preliminaries

3. Problem Definition

4. Approach

4.1. 2D Object Representation

4.2. Grasp and Contact Model

4.3. Edge-Level Grasping

5. Implementation for Parallel Gripper

5.1. Edge Detection and Line Segmentation

5.2. Edge Feature Extraction and Pair Formation

5.3. 3D Grasp Specification

5.4. Practical Issues

6. Results

6.1. Simulation-Based Results

6.2. Robot Experiments

6.3. Discussion

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

From Single 2D Depth Image to Gripper 6D Pose Estimation: A Fast and Robust Algorithm for Grabbing Objects in Cluttered Scenes †

Abstract

1. Introduction

2. Preliminaries

3. Problem Definition

4. Approach

4.1. 2D Object Representation

4.2. Grasp and Contact Model

4.3. Edge-Level Grasping

5. Implementation for Parallel Gripper

5.1. Edge Detection and Line Segmentation

5.2. Edge Feature Extraction and Pair Formation

5.3. 3D Grasp Specification

5.4. Practical Issues

6. Results

6.1. Simulation-Based Results

6.2. Robot Experiments

6.3. Discussion

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

From Single 2D Depth Image to Gripper 6D Pose Estimation: A Fast and Robust Algorithm for Grabbing Objects in Cluttered Scenes^†