Pose Estimation of Primitive-Shaped Objects from a Depth Image Using Superquadric Representation

: This paper presents a method for estimating the six Degrees of Freedom (6DoF) pose of texture-less primitive-shaped objects from depth images. As the conventional methods for object pose estimation require rich texture or geometric features to the target objects, these methods are not suitable for texture-less and geometrically simple shaped objects. In order to estimate the pose of the primitive-shaped object, the parameters that represent primitive shapes are estimated. However, these methods explicitly limit the number of types of primitive shapes that can be estimated. We employ superquadrics as a primitive shape representation that can represent various types of primitive shapes with only a few parameters. In order to estimate the superquadric parameters of primitive-shaped objects, the point cloud of the object must be segmented from a depth image. It is known that the parameter estimation is sensitive to outliers, which are caused by the miss-segmentation of the depth image. Therefore, we propose a novel estimation method for superquadric parameters that are robust to outliers. In the experiment, we constructed a dataset in which the person grasps and moves the primitive-shaped objects. The experimental results show that our estimation method outperformed three conventional methods and the baseline method.


Introduction
The 3D pose estimation and tracking of objects plays an important role in object grasping by robots, scene understanding, augmented/virtual reality, and other applications. In the computer vision field, numerous methods have been proposed to estimate the six Degrees of Freedom (DoF) pose of the object from an RGB image [1][2][3] or depth image [4][5][6]. Most approaches extract handcrafted features [4,5] or learned features [1,2]. Although the feature-based method is powerful for various types of objects, the method requires rich textures or rich geometric features on the objects in order to detect feature points for matching.
The 3D objects can be tracked by estimating the sequential six Degrees of Freedom (6DoF) pose of the object. To estimate the object pose between successive frames, Iterative Closest Point (ICP) [7] is widely employed [8,9]. ICP registers two point clouds by minimizing the Euclidean distance between corresponding points. However, when the ICP algorithm is applied to the objects that have a limited number of geometrical features, the pose estimation is not accurate and unstable due to the difficulty in obtaining the correct corresponding keypoints. In this paper, we aim to tackle the problem of pose estimation against geometrically simple (primitive-shaped), texture-less objects from sequential depth images. In this case, the above feature point-based methods and ICP pose estimation are unsuitable.
As primitive shapes can be represented by just few parameters, a model fitting method, such as RANdom SAmple Consensus (RANSAC) [10] or Hough voting algorithm [11], can be applied to estimate the pose and shape parameters of primitive-shaped objects. For example, as part of the shape parameters of primitive-shaped objects, three parameters (height, width, and depth) exist to represent a cuboid, two parameters (height and radius) exist to represent a cylinder, and one parameter (radius) exists to represent a sphere. These parameters can be estimated by setting the cost function against each primitive shape representation. However, these methods for primitive shape pose estimation explicitly use limited kinds of shape representation, which also limits the application of the pose estimation of primitive-shaped objects.
Superquadric is one of an ideal shape representation for adapting various kinds of shapes using a single equation [12]. Applying the superquadric to an object enables the object to be expressed by various primitive shapes-such as cuboids, cylinders, and spheres-with several parameters in the equation. As we aim to estimate the pose of geometrically simple objects, we assume that the objects can be represented by superquadrics.
A naive approach to estimating the pose of objects that are represented by superquadrics is to apply the method proposed by Solina et al. [13]. They proposed a method to estimate the superquadric parameters and 6DoF pose parameters from the 3D point cloud of the object. After extracting the 3D point cloud of the object from the depth image, superquadric and pose parameters can be estimated using their method. However, the superquadric parameter estimation methods takes the input of the 3D point cloud of the object at the optimization, thereby requiring the segmentation against the obtained depth image the result of parameter estimation also relies on the quality of depth image segmentation. That is, if the object point clouds have outlier points that are caused by miss-segmentation, the parameters that fit the outliers are estimated, and these points would lead to a low accuracy of the pose estimation.
In order to achieve a robust superquadric pose estimation, it is important to exclude the 3D points that are not 3D points of the objects. One simple idea to exclude the outlier point is to employ the threshold value to cut off those points that are further from the object centroid. However, the threshold value differs among objects in accordance with their scale and shape, thereby defining the threshold hyper parameter against each object. Therefore, we introduce the coefficient that implicitly down-weighs outlier points and is invariant to the shape and scale of objects.
In this paper, we present a method for estimating the 6DoF pose of primitive-shaped objects from sequential depth images. In order to estimate the pose parameters, we propose the novel pose prediction method of primitive-shaped objects using superquadric representation that is robust to outliers and irrelevant to the shape or scale of the object. The results of the proposed method is shown in Figure 1. At the initial frame, we geometrically segment the depth image using normal vector concavities and depth continuities. Second, we label the binary segmented depth image by using the connected component algorithm. Thereafter, we estimate the superquadric shape, scale, and pose parameters to the each primitive-shaped object. In successive frames, we match the label map at the initial frame and the current frame to find the object, and we update the pose parameter of each primitive-shaped object. Our method estimates the shape, size, and the pose of the primitive shaped objects in the scene and the detected objects are tracked sequentially. The results of pose estimation is overlaid onto the RGB image for the visualization, and we do not use any color information, but only use the depth images.
As our method enables us to estimate the pose of superquadrics even if the outlier points exist in the object point cloud, our method can handle a case in which a person freely moves objects. In such a case, miss-segmentation can occur easily. In the experiment, we captured four scenes in which the user interacts with four primitive-shaped objects to show the robustness of our proposed method. We compare the pose estimation results with two baseline methods. The experimental results show that our method outperformed two baseline methods, thereby verifying the effectiveness of our proposed method.

Related Work
As our work is about primitive shape pose estimation using superquadric representation, we introduce the pose estimation of primitive shapes and research using superquadric representation.

Pose Estimation of Primitive-Shaped Objects
Recently, due to the development of the deep learning technique, 6DoF pose estimation using Convolutional Neural Networks from a single RGB image have been well developed [1,14]. However, as we aim to estimate the pose of visually simple (no color) objects, RGB information does not contribute to the estimation. The object pose can be also estimated from a single depth image. Moreover, the pose estimation methods based on learnable features [1,14] or handcrafted features [4,15] require rich textures or rich geometric keypoints. As we aim to estimate the pose of geometrically and visually simple objects, these approaches are unsuitable to employ.
Numerous approaches of primitive shape pose estimation employed the RANSAC algorithm to estimate the pose of the primitives [16][17][18][19][20][21][22]. Ana et al. [19] fit the parameters of plane, sphere, and cylinder to object point cloud by M-estimation SAmple and Consensus (MSAC) [23] and the final decision on the selection of the primitive shape is taken based on number of inliers during estimation. Drost et al. [22] employ a local Hough Transform algorithm to estimate pre-defined geometrically primitive shapes. Most approaches define primitive shapes in each. For example, three parameters (width, height, and depth) are needed to represent a cuboid, two parameters (radius and height) are needed to represent a cylinder. As primitive-shaped objects are represented in all previous work, only limited types of primitive shapes can be handled. In contrast, our method can handle any primitive shapes as long as it can be represented by superquadrics.
Sano et al. [16] proposed a method to estimate the pose of a fixed-size cuboid for Spatial Augmented Reality. Although they handle a single type of primitive shape, they propose a method that estimates the pose of a cube sequentially, while the user interactively moves the target object. They estimate the pose parameter of cubes by efficient planar region detection using RGB-D superpixel segmentation. Their method has two main limitations: first, the size of the objects is known beforehand and, second, they only can estimate the pose and track the cuboid. Our work overcomes these two limitations.

Superquadrics
Superquadric functions are an extension of quadric surfaces and include supertoroids, superhyperboloids, and superellipsoids. Superellipsoids are most commonly used in object modeling because they define closed surfaces. Examples of elementary objects that can be represented with superellipsoids are depicted in Figure 2. Recently, superquadrics have been widely used for shape abstraction [24], object grasping [25,26], object localization [27], and object recognition [28].
A superquadric in an object-centered coordinate system can be defined by the inside-outside function with a scale parameter (s x , s y , s z ) and a shape parameter (ε 1 , ε 2 ): where Λ is a tuple as (s x , s y , s z , ε 1 , ε 2 ). Parameters s x , s y , and s z are scale parameters that define the superquadric size at the x, y, and z coordinates, respectively. The superquadric function F in Equation (1) with a unit scale can be re-written with the following parametric equation: Parameters ε 1 and ε 2 are shape representation parameters that express squares along the z axis and the x-y plane. Given a point (x, y, z), if F < 1, the point is inside the superquadric, if F > 1, the point is outside the superquadric, and if F = 1 the point lies on the surface of the superquadric. Further, the inside-outside description can be expressed in a generic coordinate system by adding six additional variables, representing the superquadric pose Φ (three for translation (t x , t y , t z ) and three Euler angles for rotation (θ x , θ y , θ z )), with a total of eleven independent variables, i.e., q ∈ R 11 .

Methodology
The overview of our proposed method is illustrated in Figure 3. At the initial frame of the scene obtained from the depth sensor, such as Kinect, superquadric and pose parameters are estimated from segmented object 3D point cloud. At the successive frames, we do not estimate both superquadric parameters and pose parameters, but we only update pose parameters due to the instability of superquadric parameter estimation.
In Sections 3.1 and 3.2, we apply both the initial and successive frames. In Section 3.3, we apply only the initial frame, and in Sections 3.4 and 3.5, we apply only the successive frames.

Preprocessing
First, in the preprocessing stage, a depth map D t at current frame t is transformed into a metric vertex map V t (u) = K −1u D t (u), with the known camera intrinsic matrix K, a depth map pixel u = (x, y) T in the image domain u ∈ Ω ⊂ R 2 and its homogeneous representationu. The vertex map V t (u) is then smoothed by applying a median filter. The normal map of the current frame N t is simply generated from the vertex map V t by a cross-product of neighborhood pixels [29].
We assume that the z-axis of all the objects are horizontal to the normal vector of the floor plane at the initial depth frame D 0 . The preprocessing of plane estimation to the initial depth map is applied to extract the normal vector of the floor plane n p . We apply RANSAC-based plane parameter estimation to estimate the vector n p .

Geometric Segmentation
As primitive shapes have only convex surfaces, we segment the depth map D t into convex shapes. Tateno et al. [30] adapted the concave region penalty to the normal edge-based segmentation in their SLAM pipeline. We employ their segmentation approach. They classify each pixel irrespective of whether the pixel is an object edge by employing two operators. The first operator detects concave boundaries by computing the dot product between the normal vector of target pixel and each normal of the eight-connected neighboring pixels. The second operator takes into account the maximum 3D point-to-plane distance between the target pixel and its eight neighbors. In order to establish the threshold of the second operator, we employ an uncertainty measure computed following the noise model [31].
As a result, we obtain a binary geometric edge map B t at frame t, B t from the input depth frame. To the edge map B t , we apply a four-neighborhood connected component analysis algorithm to yield a label map L t , where each element u is associated with a segment label L t (u) = l j . Unlike the method in [30], we do not segment the points on the floor plane to extract the object point cloud.

Superquadric Fitting
The i-th superquadric surface O i , which best represents the object, is estimated from given K 3D points p k = (x k , y k , z k ) of the i-th object's 3D points cloud. The superquadric surface O i is represented by Λ i and Φ i . The minimization of the algebraic distance from points to the superquadric surface can be solved by defining a non-linear least-squares minimization problem: where (F(Tr Φ (p i ); Λ) − 1) 2 imposes the point to superquadric surface distance minimization, where the term √ s x s y s z is proportional to the superquadric volume, compensates for the fact that the previous equation is biased toward larger superquadric surfaces. Tr Φ (·) is a rigid transformation by 6DoF pose Φ. The Levenberg-Marquardt algorithm [32] is used to minimize the non-linear function Equation (3). Moreover, Equation (3) is numerically unstable when ε 1 , ε 2 < 0.1 and the superquadric surface has concavities with ε 1 , ε 2 > 2. We apply constraints when minimizing Equation (3) for shape parameters 0.1 ≤ ε 1 , ε 2 ≤ 2 and for the scale parameters s 1 , s 2 , s 3 > 0.
As the function in Equation (3) is not a convex function, the initial parameters determine on which local minimum the minimization converges. It is important to estimate the rough parameters: translation, rotation, scale, and shape parameters. First, as it is difficult to roughly estimate the shape of the object, the initial shape parameters ε 1 and ε 2 are set to 1, which implies that the shape of the initial model is an ellipsoid. Second, the centroid of all 3D data points can be used to estimate the initial translation. Third, to compute the initial rotation, we gauge the covariance matrix of all 3D data points. From this covariance matrix, we can compute three pairs of eigenvectors and eigenvalues. The largest eigenvector of the covariance matrix always points in the direction of the largest variance of the data, while the magnitude of this largest vector equals the corresponding eigenvalue. The second-largest eigenvector is always orthogonal to the largest and points in the direction of the second largest spread of the data, which is the same as the third vector. Therefore, the eigenvectors can be used as the initial rotation parameters, and the eigenvalues can be used as the initial scale parameters.

Label Matching
As the labels of the label maps L t and L t+1 do not correspond to each other, labels between sequential frames must be matched. We re-label the label map L t+1 using the overlapping area of two frames. We denote the number of pixels that satisfies the label of label map L t (u) is l i and the label of label map L t+1 is l j as Π t (l i , l j ). We normalize this term in the following manner: where Π t (l i ) is the number of pixels with the label l i at label map L t . We re-label l j by finding the label l * at the L t , which maximizesΠ t (l i , l j ): If max l i ∈L t {Π t (l i , l j )} > τ, we re-label the label l j at L t+1 to l * .

Superquadric Tracking
As superquadric fitting is numerically unstable, we do not sequentially estimate parameters Λ based on the assumption that the shape of the objects does not change sequentially. We only update the pose parameters Φ at consecutive frames. The naive approach to estimate the pose Φ t at frame t is where Λ i is the estimated superquadric parameters at the initial frame. However, the outliers which are caused by miss-segmentation lead to the low accuracy of pose estimation. Therefore, we down-weigh the points that are far away from the center of objects during optimization by introducing a coefficient, β.
A naive approach to down-weighing distant points is employing threshold value based on the Euclidean distance from the centroid of the object and each point. However, there can be two possible problems here. First, as the distance is defined in absolute scale, the threshold value might differ between different scaled objects. Second, calculating the distance in Euclidean Space is not compatible for a superquadric parameter estimation, because the distance in superquadric shape is non-linear. Therefore, we introduce a scale-invariant non-linear threshold parameter to down-weigh miss-segmented points for robust pose parameter estimation.
We introduce coefficient β to the minimization equation, Equation (6), to eliminate the points distant from the origin in an object-centered coordinate system. First, we transform the point cloud p by the transformation matrix Φ t−1 at the previous frame.
From this equation, it follows that Instead of estimating the pose Φ t directly, we estimate the residual pose Φ t−1:t . Therefore, the equation below is applied to estimate the residual pose Φ t−1:t (using x + ≡ max(0, x)): We define the hyperparameter β th as 0.75 throughout the experiment.

Dataset
As there is no dataset that has a sequence in which the person moves primitive-shaped objects, we created the dataset using Kinect v.1 to evaluate our pose estimation method. In this paper, we did not conduct the evaluation using the synthetic dataset, because the effectiveness of the proposed method can only be validated using the data captured in the real environment. The main task of this paper is to robustly estimate the pose of primitive shaped objects in the case that the object point clouds cannot be accurately segmented. Although the sensor noise can be added to the acquired synthetic data, it is difficult to simulate the miss-segmentation.
Four primitive-shaped objects were used in the experiments. The objects used in the experiment were cube (length = 20 cm), tall cylinder (radius = 7.5 cm, height = 40 cm), wide cylinder (radius = 15 cm, height = 5 cm), and half sphere (radius = 10 cm). These objects are illustrated in Figure 4. In the dataset, there are 10 scenes included, and each scene comprised approximately 220 frames. At each scene, the user lifted, piled, and moved each object. The examples of frames in the scene are depicted in Figure 5. The ground truth pose is needed to evaluate our proposed method. We used Cloudcompare [33] to annotate the ground truth pose. First, the synthesized primitive-shaped objects are generated. The detailed explanation to generate the object point cloud is explained in the later section. For example, we set superquadric parameter (a 1 , a 2 , a 3 , ε 1 , ε 2 ) = (10.0, 10.0, 10.0, 0.1, 0.1) for generating a cube, and (a 1 , a 2 , a 3 , ε 1 , ε 2 ) = (7.5, 7.5, 20.0, 0.1, 1.0) for generating a tall cylinder. Third, we manually align the generated model to the point cloud of each frame. By extracting the transformation matrix after the alignment, we can obtain the ground truth pose of each primitive-shaped object.

Evaluation Metrics
We employed two metrics: 3D error and 2D error. 3D error measures the 3D Euclidean distance between point cloud transformed by ground truth 6DoF pose and by predicted pose. 2D error measures the 2D image pixel distance between the pixel in which the 3D point is projected onto the image using ground-truth pose and the predicted pose.
For the 3D error, we employed the 3D average distance metric for the symmetric object proposed by Hinterstoisser et al. [34]: where x is a point sampled from object point cloud ν,Φ is the ground truth pose, and Φ is the predicted pose.
For the 2D error, we employed 2D projection metric (2D error) which is suited for applications such as a augmented reality [35]. An error is calculated as follows: where Γ is a camera intrinsic parameters which project the 3D points to the 2D image plane. Both metrics are that lower is better. We sample M points from the superquadric surface to generate x. By sampling a point x from a unit superquadric surface, according to Equation (2), the points that have a high curvature are emphasized. For an unbiased sample distribution, we need to apply equidistant rendering using spherical angles, as introduced by Bardinet et al. [36], where, The generated ground-truth models are visualized in Figure 4 right. Note that this model is only used to annotate the ground-truth pose and calculate the evaluation metrics.

The Comparison with Other Methods and the Baseline Method
In order to evaluate the effectiveness of our pose estimation, we compared the pose estimation accuracy with four methods. First, we compare with Iterative Closest Point (ICP) algorithm with point-to-plane metric [7], which achieves faster convergence than the point-to-point metric. Unlike the point-to-point metric, which has a closed-form solution, the point-to-plane metric is usually solved using the standard nonlinear least squares method, we employ the Levenberg-Marquardt algorithm. Second, we compare with feature-based RANSAC method, proposed by Buch et al. [37]. We employ Fast Point Feature Histograms (FPFH) feature extractor to extract the point features and match the features using RANSAC with pre-rejection step pose in the estimation loop in order to avoid verification of pose hypotheses that are likely to be wrong. Thirdly, we compare with Normal Distributions Transform (NDT) algorithm [38]. At last, we compare with our baseline method, which does not employ coefficient β at Equation (9), to evaluate the effectiveness of excluding the outlier points implicitly.
Note that the baseline and our method estimates the shape, scale, and the pose of each object at the initial frame, and only the pose parameter is updated for the successive frames. For the other methods (ICP [7], NDT [38], RANSAC [37]), we estimate the pose between two successive frames using objects' point cloud.
Sano et al. [16] also estimate the poses of a cube from a depth image, for instance. However, as the conventional methods for primitive shape pose estimation have to know the type of primitive shaped objects (e.g., cuboid, cylinder and sphere) before estimating the pose. Instead, we employ superquadrics to represent the primitive shapes so that the representative primitive shapes can be represented using a single equation. Moreover, the method proposed by Sano et al. [16] has an assumption that the two faces of the cuboid can be seen from the camera. As the fair comparison against these methods is difficult, we did not conduct the comparison against them.

Qualitative Evaluation
In this section, we evaluate the accuracy of the pose estimated by our approach. The qualitative results are illustrated in Figure 6. We present results of superquadric and pose parameter estimation from four scenes. The images in column Figure 6a depict RGB images of the scene, Figure 6b    We compare the pose estimation results with three conventional methods and one baseline method. The qualitative results are shown in Figure 7. Note that the estimated superquadric surface is used to visualize the result of pose estimation. In these scenes, miss-segmentation was occurred due to the interaction with the user and the object (drew a rectangle in red). Although the pose estimation failed with the baseline and the conventional methods (Figure 7b-e), our proposed method successfully estimated the pose of each object (Figure 7f). As the geometric segmentation cannot distinguish the pixels of person and object, the pose estimation is conducted with an entire labeled pixel. This verifies the robustness of the proposed method against the miss-segmentation of object pixels.

Quantitative Evaluation
The quantitative results for each object are summarized in Table 1. We calculate the average error of the entire 10 scenes in our dataset and summarized per primitive-shaped object. We exclude the frames that failed to match the point cloud (Section 3.4) in order to purely compare the object pose estimation accuracy. It can be seen that the proposed method outperforms the conventional methods [7,37,38] and the baseline method for all of the objects. Even though the RANSAC-based method [37] outperformed the baseline for cylindrical objects, introducing the coefficient β improves the performance.
The quantitative results for each scene are summarized in Table 2. It can be seen that our method outperformed the conventional and baseline methods for most of the scenes. For the scenes in which the baseline outperforms the proposed method, our robust pose estimation method failed to track the objects. The example failure is visualized in Figure 8. As our method implicitly exclude the 3D points which are distant from the superquadric surface, the point clouds of fast-moving objects are not considered for pose estimation. In the bottom figure, the pose of the sphere was not updated due to the fast movement of the object.  Table 2. Error of primitive shape pose estimation (per scene) using our dataset. S1 to S10 denotes the indices of the scene in the dataset. The rightmost column shows the average of entire scenes.

Discussion
Currently, the proposed method cannot run at real-time speeds, but instead around 30 FPS. Although geometric segmentation and label matching can run at a reasonable frame rate (over 30FPS), pose estimation cannot. The computational time of the proposed method is summarized in Table 3. Note that the proposed system was implemented on a Windows 10 64-bit laptop with an Intel Core i7-6950X 3.00 GHz CPU, and 16 GB memory. We did not use any GPUs in the system. Further, we were able to confirm that the pose update process is crucial. Even if there is a single object in the scene, the pose update takes 97 ms and this is not in real-time. Table 3. Computational time of the each step (ms). Note that the pose of objects is estimated independently, and the overall system takes a time to run if a large number of objects exist in the scene.

Process
Segmentation To achieve the same results in a real-time system, we can employ the fast parameter estimation method proposed by Duncan et al. [39]. They proposed a method to estimate the superquadric parameters in near real-time. They down-sampled the input point cloud recursively until the optimization failed. Although they achieved 40 ms, this is not still suitable for augmented reality or a robot grasping system. Another solution is implementing the Levenberg-Marquardt algorithm using GPU [40]. Currently, as our system does not use any GPU acceleration, it can be expected to estimate parameters in real-time.
In this paper, three types of primitive shapes (cuboid, sphere, and cylinder) and four objects were used to construct the dataset. Even though the types of primitive shaped objects are limited, the dataset includes the representative primitive shapes that are used in the conventional primitive shape estimation methods [16,19]. Also, unlike existing methods, our method uses superquadrics for the primitive shape representation, which enables us to estimate the shape and the pose regardless of the type of the primitive shape (sphere, cylinder, cuboid, etc.). Although only four objects were used in the experiments, Figures 6 and 7 show that the superquadrics (shape and scale) parameters were correctly estimated. As the shape and scale parameters are estimated at the initial frame, our method can be extended to other primitive shaped objects as long as the objects can be represented by superquadrics. The purpose of this paper is not object shape classification, so we did not conduct experiments with a variety of primitive shaped objects, but this can be an important subject for future study.

Conclusions
In this paper, we proposed a method for estimating the six Degrees of Freedom (6DoF) pose of texture-less primitive-shaped objects from a depth image. The method is robust to outliers, which are caused by the miss-segmentation of the depth image. To achieve robustness, we introduced the novel pose estimation method. We implicitly ignore the points that are distant from the object point cloud to ensure that the optimization is conducted with the only object's point cloud. Further, we generated a new data set in the experiments. In this dataset, the user interactively moves the primitive-shaped objects. The experimental results revealed the effectiveness of our novel estimation method by comparing the pose accuracy for primitive shaped objects.
There are several future works that can be considered. First, as mentioned in the discussion section, we aim to achieve real-time computation of superquadric parameters. Second, currently, out pose estimation method relies on the preprocessing steps, such as geometric segmentation and segment matching, of the obtained depth image to extract the object point cloud. However, the pose of the objects cannot be estimated if these preprocessing steps failed to extract the point cloud of the target object. We will further investigate the method to estimate the pose of the objects from the raw depth image.