A Novel Metric-Learning-Based Method for Multi-Instance Textureless Objects’ 6D Pose Estimation

: 6D pose estimation of objects is essential for intelligent manufacturing. Current methods mainly place emphasis on the single object’s pose estimation, which limit its use in real-world applications. In this paper, we propose a multi-instance framework of 6D pose estimation for textureless objects in an industrial environment. We use a two-stage pipeline for this purpose. In the detection stage, EfﬁcientDet is used to detect target instances from the image. In the pose estimation stage, the cropped images are ﬁrst interpolated into a ﬁxed size, then fed into a pseudo-siamese graph matching network to calculate dense point correspondences. A modiﬁed circle loss is deﬁned to measure the differences of positive and negative correspondences. Experiments on the antenna support demonstrate the effectiveness and advantages of our proposed method.


Introduction
Estimating 6D pose, i.e., 3D translation and 3D rotation of a target, is a fundamental problem in intelligent manufacturing, especially in the application fields of object grasping [1,2], assembling [3,4], bin-picking [5,6], and stacking [7] with the help of the visual sensors.
Visual sensors in an industrial environment can mainly be divided into three categories, namely, RGB, D, and RGB-D sensors. RGB sensors only achieve color information through a CMOS unit. D sensors use structured light, lidar injector-receiver, or radar injector-receiver to measure the distance from the camera to the target. RGB-D sensors combine both RGB and D sensors and leverage the calibration method to assign color information onto the depth information. However, there are limitations for D sensors in industrial environments [8]. On one hand, using depth sensors in industrial environments are not always useful, as there are plant of non-Lambert surface objects such as metal parts, glasses, and ceramics, which have uncertain reflection ratios for the light to make the depth immeasurable. On the other hand, thanks to the fast development of the deep learning technologies in recent years, the performance of 6D pose estimation methods using only RGB information is comparable with those using RGB-D information [9,10]. Therefore, we focus on the investigation of RGB-based 6D pose estimation method in this paper.
Traditional methods use different kinds of hand-crafted descriptors [11][12][13] to extract features surround the image points to establish the feature descriptions of the image points. The property of scale and rotation invariant is always considered to ensure the feature similarity of the same point of an object under different point-of-view in the image. These methods are sufficient for rich textured objects because of the variant color gradients of their surfaces; however, they are not capable of obtaining distinguishable point features from textureless surfaces such as metal, glasses, and ceramics. To solve this problem, geometric features such as lines [14,15], moments [16], circles [17], and gradients of edges [18,19], which can represent the geometric structures of an object, have been designed to describe the implicit features. Properties that are invariant to scale and rotation have also studied on these geometric features [20]. However, the geometric features usually describe the overall structure of an object. When they are invariant to rotation and scale, they are only useful in object detection from an image, but lose the ability to distinguish different translation and rotation of the object.
With the fast development of deep learning technologies in recent years, many researchers have used deep neural networks to predict the 6D pose of a textureless object. SSD-6D [10] uses a direct regression strategy to predict a translation and orientation based on the popular SingleShot multibox Detector (SSD) object detection framework. DeepIM [21] proposes a CNN structure to iteratively measure the difference between the current 2D image projection of the predicted pose and the real 2D image. A deep neural network that outputs the optic flow between the two images was designed to provide pose refinement for the current pose. [22] combines semantic key-points predicted by a convolutional network with a deformable shape model to determine the 2D-3D correspondences. PVNet [9] regresses pixelwise vectors pointing to the key-points with a modified U-Net structure and proposes a voting scheme to decide the location of the key-points. HybridPose [23] extends the approach of PVNet [9] by utilizing a hybrid intermediate representation to express different geometric information in the input image, including key-points, edge vectors, and symmetry correspondences. CosyPose [24] develops a robust method for matching individual 6D object pose hypotheses across different input images in order to jointly estimate camera viewpoints and 6D poses of all the objects in a single consistent scene.
Recently, finding dense correspondences using the deep neural networks has shown advantages in 6D pose estimation [25,26]. The per-pixel matching scheme was utilized to design and train the network. In [26], a pseudo-siamese matching network was proposed to match dense correspondences in high-dimension; then, the dense correspondences were used to calculate the target pose through Perspective-n-Points (PnP) method [27]. This method achieved state-of-the-art performance in LineMod [28] and Occlusion-LineMod [29] datasets. However, both datasets contain only one instance for each object. The network is designed to directly segment the object in the image. Thus, it is not applicable for multitarget pose estimation tasks. In this paper, we improve this method in two main aspects for industrial usage.
(1) We adopt EfficientDet [30] to first detect every object in the image. Each object in the image is then cropped through the bounding box provided by the EfficientDet and resized into fixed value. All the resized images are fed into the correspondences matching network to predict dense correspondences. After obtaining the correspondences, PnP-Ransac method is used to calculate the 6D pose of the target. By adopting the two-stage network structure, we solve the problem for multi-instance 6D pose estimation.
(2) We introduce the circle loss, a well-known loss function in metric learning, to measure the similarities between pixelwise deep features from a 2D image and nodewise deep features from a 3D mesh model. We analyze the reason why the softmax cross-entropy loss [31] used in [26] is not suitable for dense correspondences matching and compare the proposed masked circle loss with the softmax cross-entropy loss though ablation studies to show the superiority of the proposed loss.
In summary, our main contribution lies in the framework of the 6D pose estimation that can deal with multiple instances in single frame and a novel metric learning loss that efficiently constrains the matching of the 2D-3D correspondences.
The remainder of this paper is organized as follows: In Section 2, we introduce the whole two-stage 6D pose estimation framework for multi-instance textureless objects. The masked circle loss for 2D-3D correspondences matching is introduced in detail. In Section 3, we test our proposed method on the pose estimation problem for the antenna support and compare it with some other state-of-the-art methods to show the effectiveness and advantages of our method. Conclusions are drawn in Section 4.

Methodology
Given an RGB image, the main purpose of the 6D pose estimation is to predict a rotation matrix R ∈ SO(3) and a translation vector t ∈ R 3 from the objects' coordinate system to the camera coordinate system. When the pose is accurately detected, the transformation between the industrial robot and the object can be easily inferred for further action such as object grasping or assembling. In fact, the 6D pose estimation problem can be divided into two subtasks: (1) Find out the target objects from the image. (2) Calculate the poses for all the target objects. Most of the existing works [9,25,26,32] solve the two problems in a unified framework for boosting the performance on commonly used open-evaluation datasets such as LINEMOD, Occlusion LINEMOD, and YCB Video. However, all of these datasets only contain a single target for each of the classes in one frame. When there are plenty of targets of the same type, the model cannot handle the situation well. Therefore, in this paper, we propose a two-stage framework to separately solve the 6D pose estimation problem for multi-instance environments.

Overview
In this section, we introduce the framework of the proposed multi-instance pose estimation method in detail. The framework consists of four modules, namely, the object detection module, mesh feature encoding module, image feature encoding module, and pose estimation module. The flowchart of the framework is shown in Figure 1.  The input of the model is an RGB image taken by an industrial camera. The image is first fed into the object detection module to find out the bounding boxes for each object in the image. We chose to use EfficientDet [30] in this module due to its light weight and high performance among current commonly used object detection modules. The EfficientDet offers seven versions with different model sizes to fit the need for variant applications. The bi-FPN used in the EfficientDet can effectively extract useful features for different kinds of objects.
After each bounding box of the objects in the image was correctly obtained, the objects were cropped out of the image through the bounding box. We expanded the bounding box by δ w and δ h in width and height, respectively, to ensure the object is inside the bounding boxes. The bilinear interpolation method was used to resize all the cropped images into a fixed size (H crop , W crop ). Then, the resized images were parallely fed into the image feature encoding module to achieve the deep representation for each pixel.
The image feature encoding module utilized the U-Net structure as the backbone for feature extraction. The cropped images with the same size were fed into the U-Net to extract deep features. Two multiple-fully-connected layers were designed to predict the semantic segmentation and pixelwise deep representation, respectively. The function of the U-Net can be represented as F I = Λ θ unet (I), where I denotes the input cropped image of an object. θ unet is the parameter of the U-Net model.
indicates the output tensor of the U-Net. The F seg I ∈ R (C+1)×H×W part of the tensor is responsible for the semantic segmentation for C classes objects while the F f eat I ∈ R D×H×W part of the tensor represents the pixelwise D dimensional deep features of the object in the image.
In the mesh feature encoding module, a 4-layer SplineCNN F M = Φ θ spline (M) ∈ R D×L is used to extract nodewise deep features from the 3D mesh model. M is the 3D mesh model of the target object. θ spline denotes the parameters of the SplineCNN. F M represents the node features calculated through the SplineCNN, where L is the number of the nodes from the 3D mesh model. The affinity submodule explicitly provides an affine transformation between the pixelwise deep features and nodewise deep features through Equation (1).
where A k ∈ R D×D are the learnable parameters of the affinity submodule for the k-th object. F x i M ∈ R D and F y j I ∈ R D are the x i pixel and y j node in the image and 3D mesh model, respectively. This submodule provides the ability for the network to learning affineinvariance features that can match with each other through feature similarity s i,j ∈ R.
In the pose estimation module, the deep features encoded from the image feature encoding module and the mesh feature encoding module are multiplied through dot product to calculate the similarity of the correspondences. The features with the maximum similarities were chosen as the 2D-3D correspondences. As there was one correspondence from 3D model for each pixel in the RGB image, dense correspondences were directly obtained for the RANSAC-based PnP method to calculate relative pose from the camera to the object.

Masked Circle Loss for Matching Dense Correspondences
The core operation in dense 2D-3D correspondences matching is to calculate the similarity between the pixelwise deep features from an image and the nodewise deep features from a 3D model. The cosine similarity is used to measure the distance between the features where S k denotes the similarity matrix for object k. In [26], the softmax cross-entropy losswhich is the most generally used loss function for traditional classification problem-was chosen to select the corresponding node from 3D model for each image pixel that belongs to the target object. The lost function can be described as where s ij denotes the similarity between the i-th pixel in the image and the j-th node from the 3D model. l is the correct label for the matching. The softmax step p iq = e s il n ∑ j=1 e s ij , q = 1. . . n turns each similarity s ij into a probability p ij . Then, the p ij is used to calculate the cross entropy with the one-hot vector, which only the true class equals to one, while all the other classes remain zero. The gradient of the j-th node in the softmax cross entropy loss is As shown in Equation (4), the gradient of the true class is p il − 1, which means the network is trained to make the similarity of the true class to be one, while the similarity of the false class to be zero. However, in the case of feature matching, the divergence among the classes is not as large as that in the traditional classification problems.
As shown in Figure 2, the red point denotes the true class matching from the image pixel i to the node j in 3D model. The green circle denotes a nearby region for node j. As in the softmax cross-entropy loss, all the nodes in the green circle are trained to have zero similarities with respect to pixel i while the node j is trained to have a similarity of one. This situation is apparently not reasonable for the training. In fact, the main purpose of the correspondence matching is to find the most similar node from 3D model for pixel i instead of the same node from 3D model. Thus, it is more suitable to learn a distance metric for the 2D-3D correspondences.  Metric learning, which is also known as similarity learning, is a conventional research area before the deep learning era. Deep metric learning introduces deep neural networks into conventional metric learning. One of the most popular metrics of learning loss is contrastive loss where m is a margin among different classes and c i denotes the i-th class. Another wellknown metric loss is triplet loss The main difference between these two methods is that triplet loss stops the optimization of the inner class distance f i − f j 2 2 when the condition m+ f i − f j 2 2 − f i − f k 2 2 < 0 is fulfilled, while the contrastive loss always optimizes the distance among features that belong to the same class. Apparently, triplet loss is more suitable for the task of dense feature matching, as the similarity of the true correspondences does not have to be one, it only needs to be more similar with its correspondence compared with the others.
Circle loss [33] proposes a unified perspective of view to explain the triplet loss and the softmax cross-entropy loss. Assume there are K in within-class similarities and K out between-class similarities, which are denoted by s i p (i = 1, 2, ..., K in ) and s j n (j = 1, 2, ..., K out ), respectively; p and n mean the positive and negative similarity, respectively.
In order to minimize s j n (∀j ∈ 1, 2, ..., K out ) as well as to maximize s i p (∀i ∈ 1, 2, ..., K in ), the unified loss function can be designed as where γ is a scale factor. We can find out that if we set γ = 1, m = 0, and K in = 1, Equation (7) degenerates to the softmax cross-entropy loss, as shown in Equation (3). The main purpose of the function is to minimize (s n − s p ), in which reducing s n is equivalent to increasing s p . Circle loss introduces (α n s n − α p s p ) instead of (s n − s p ), where in which [ ] + is the ReLU function that ensures α i p and α j n are non-negative; α i p and α j n adjust the weight so the gradient of reducing s n is equivalent to increasing s p . When s n approaches zero and s p approaches one, the gradients drop to a small value according to α i p and α j n . It intuitively emphasizes the hard examples where s n is similar to s p . As for the purpose of dense 2D-3D correspondence matching, we need to emphasize the hard examples and pay less attention to the easy case. Thus, the circle loss is more suitable than the softmax cross-entropy loss.
Another problem for the 2D-3D correspondence matching is that the ground-truth poses of the objects have measurement errors that lead to the mismatch of the correspondences. To overcome the problem, we assign a neighborhood area N for each pixel. If the nodes on the 3D mesh model lie in the neighborhood area, they are regarded as positive correspondences. Each pixel has its own neighborhood area to eliminate the influence of the measurement errors of the ground-truth poses.
For every neighborhood area, we set a mask on it, and name the overall loss function the masked circle loss. The masked circle loss can be formulated as where u denotes the number of pixels that belong to the object in the image; ∆ p = 1 − m and ∆ n = m are the margin between the positive pairs and negative pairs; N k denotes the set of nodes from 3D mesh model that lie in the neighborhood area of pixel k.
The final loss of the network is defined as the combination of the segmentation loss and the correspondence matching loss L all = L seg + ζL m_circle (10) where ζ is a hyperparameter to balance the two parts of the loss; L seg is the pixelwise softmax cross-entropy for the semantic segmentation of the objects. After the dense correspondences are obtained, PnP with RANSAC method is used to calculate the final pose of the target.

Results
In this section, we use our proposed method in a real industrial application to verify the effectiveness and advantages of the proposed method. The target object in the experiment is an antenna support, as shown in Figure 3a. The target is first molded through injection; then, the mounting hole is conducted using a hole puncher. Between these two steps, the antenna support needs to be collected from the conveyor belt with a correct pose, and then put on the screw for the punch. Therefore, we train a deep learning model based on our proposed method to predict the pose of the antenna support.

Implementation Details
Data collections. In order to recognize the pose of the antenna support correctly, we collected ten videos (5679 frames) of the antenna support in total as the training dataset, two videos (1096 frames) for evaluation, and another 5 videos (3105 frames) as the validation dataset. For each video, we manually selected some key points on the 3D model of the the antenna support, as shown in Figure 3b. The 2D correspondences in the first frame of the video were then pointed out (Figure 3c) and the ground truth of the objects were calculated through PnP method, as shown in Figure 3d.
The Aruco markers were used to calculate the pose of the camera with respect to the board. The property of relevant stills among frames in the same video were used to calculate the poses of each objects with respect to the camera for the rest of the frames. To enhance the performance of the model, we further rendered 20,000 synthetic images through the BOP [34] renderer for training, as shown in Figure 4. We also added data augmentation to the original images including random cropping, resizing, 3D rotation, and color jittering during training. Model settings. We used EfficientDet-D2 as the object detection backbone in terms of the balance between detection accuracy and memory usage. The dimension D of the pixelwise and nodewise deep features was set to 128. The hyperparameter ζ to balance the loss of the segmentation and the loss of similarity matching was set to 0.01 through cross validation on the evaluation dataset. All the objects detected by the EfficientDet were resized to 256 × 256 for further calculation by the U-Net.
Training strategy. We used Pytorch [35] to implement our framework. The network was trained on two Nvidia RTX 3090 graphics cards with 24 GB RAM. The batch size was set to 16. We utilized the Adam optimizer [36] to process gradient decent of the parameters. The initial learning rate was set to 0.001 and divided by two for every twenty epochs. The model was totally trained for two hundreds of epochs and evaluated for every ten epochs. The model with the best score in the evaluation dataset was chosen as the final model for testing.
Mesh model Simplification. To reduce the memory usage of our model, we simplified the 3D mesh model of the antenna support to possess less than 8000 triangular patches and 4000 vertices through quadric edge collapse decimation in MeshLab [37]. The average of the node-pixel matching error is less than 0.5 pixel under this setting.

Evaluation Metric and Comparison
We utilized two commonly used evaluation metric to compare our proposed method with some state-of-the-art methods.
2D Projection metric. This metric computes the mean distance in the 2D image between the projections of the 3D mesh model from the estimated pose and the ground truth pose. A pose is considered correct if the distance is less than σ pixels.
ADD metric. This metric [32] computes the mean distance between two transformed model points using the estimated pose and the ground-truth pose through When the distance is less than a certain percentage of the model diameter, it is claimed that the estimated pose is correct.
We compare our method with PSGMN [26], DPOD [25], and HybridPose [23]. As all of the three methods are one stage pose estimation schemes that are not able to detect multiple instances in one frame, we used the EfficientDet as the backbone for all the methods and tested these methods with the fixed size image that only contains one object per image. The results in terms of 2D Projection metric are shown in Table 1. It can be seen that our proposed method achieves better performance than the other methods, especially when the metric is stricter. The results of comparison in terms of ADD metric are shown in Table 2. Our method also outperforms the other method with a large margin. As this metric focus on the measurement of the distances between the 2D-3D correspondences, our method takes advantages of the dense matching loss and shows a great improvement in the scores. Some qualitative examples of our proposed method are shown in Figure 5. It is shown that our proposed method can handle the multi-instance situation well and successfully deal with partial occlusion and light changing conditions. (c) (d)

Conclusions and Future Work
In this paper, a multi-instance 6D pose estimation framework was proposed to solve the localization problem of certain objects in intelligent manufacturing. EfficientDet is used as the backbone for object detection. The detected objects in image are resized and fed into a U-Net model to further extract pixelwise deep features for 2D-3D correspondence matching. We proposed a novel, metric-based loss, named masked circle loss, for the feature matching. The results of the pose estimation of the antenna support demonstrate the effectiveness of our proposed method compared with the state-of-the-art pose estimation methods.
However, current frameworks do not consider the geometric structure and constraints among pixels; further studies will focus on the investigation of the relationships between pixels.