A 6DoF Pose Estimation Dataset and Network for Multiple Parametric Shapes in Stacked Scenarios

: Most industrial parts are instantiated from different parametric templates. The 6DoF (6D) pose estimation tasks are challenging, since some part objects from a known template may be unseen before. This paper releases a new and well-annotated 6D pose estimation dataset for multiple parametric templates in stacked scenarios donated as Multi-Parametric Dataset, where a training set (50K scenes) and a test set (2K scenes) are obtained by automatical labeling techniques. In particular, the test set is further divided into a TEST-L dataset for learning evaluation and a TEST-G dataset for generalization evaluation. Since the part objects from the same template are regarded as a class in the Multi-Parametric Dataset and the number of part objects is inﬁnite, we propose a new 6D pose estimation network as our baseline method, Multi-templates Parametric Pose Network (MPP-Net), aiming to have sufﬁcient generalization ability for parametric part objects in stacked scenarios. To our best knowledge, our dataset and method are the ﬁrst to jointly achieve 6D pose estimation and parameter values prediction for multiple parametric templates. Many experiments are conducted on the Multi-Parametric Dataset. The mIoU and Overall Accuracy of foreground segmentation and template segmentation on the two test datasets exceed 99.0%. Besides, MPP-Net achieves 92.9% and 90.8% on mAP under the threshold of 0.5cm for translation prediction, achieves 41.9% and 36.8% under the threshold of 5 ◦ for rotation prediction, and achieves 51.0% and 6.0% under the threshold of 5% for parameter values prediction, on the two test set, respectively. The results have shown that our dataset has exploratory value for 6D pose estimation and parameter values prediction tasks. Author Contributions: Conceptualization, X.Z., W.L. and L.Z.; methodology, X.Z. and W.L.; software, X.Z.; validation, X.Z.; formal analysis, X.Z. and W.L.; investigation, X.Z. and W.L.; resources, X.Z. and W.L.; writing—original draft preparation, X.Z. and W.L.; writing—review and editing, L.Z.


Introduction
Parametric techniques have been widely used in the field of industrial design [1]. The assembly of an industrial product usually requires many parametric part objects from different parametric shapes. A parametric shape is a parametric template described by a set of driven parameters, which can be instantiated as many parametric part objects [2,3]. For example, many common industrial products comprise a variety of screw parts and nut parts generated from the screw template and the nut template.
When we disassemble the recyclable part objects from products into the recycling bins, it is common that there is a stacked scene including parametric part objects from multiple templates. Then, the part objects from the same template are sorted into their own bins according to their parameter values. In recent years, robots guided by visual systems are often used to sort the part objects automatically. However, due to the varied templates, the frequent changes of parameter values, heavy occlusion, sensor noise, etc., the accurate 6D pose estimation and parameter values prediction in such stacked scenes are challenging.
Accurate 6D pose estimation, i.e., 3D translation and 3D rotation, is very essential for robotic grasping tasks. Existing 6D pose estimation methods based on deep learning can be roughly classified into instance-level and category-level methods. Some instancelevel 6D pose estimation methods [4][5][6][7][8][9][10][11][12] established 2D-3D or 3D-3D correspondence to solve the 6D pose with exact 3D models for each object. However, these methods cannot generalize to the unseen objects from the same category which have no exact 3D models. Category-level 6D pose estimation methods [13][14][15][16][17] regarded the objects from the same category as a class and they had sufficient generalization ability to estimate the unseen objects' pose. In addition, they also estimated the size of the object, e.g., Refs. [13][14][15] estimated a scale value as the size, which was different from the parameter values in parametric templates. However, since a parametric template is considered as a class in this paper and the parameter values prediction is necessary for the sorting tasks, there are no existing category-level methods to jointly achieve 6D pose estimation and parameter values prediction of part objects from multiple templates in stacked scenarios. Besides, the lack of datasets for such tasks is a barrier for learning-based methods to research further.
To solve the lack of dataset, we construct a new dataset for stacked scenarios of parametric part objects from multiple templates, donated as Multi-Parametric Dataset. As shown in Figure 1a, we first select four templates from Zeng's database [1], which well represent the geometric features and rotation types. Then we sample the parameter values to instantiate each template into different part objects, and they are randomly selected to form a stacked scene. Through automatically labeling technique, we generate a large RGB-D dataset (50K training set, 2K test set) with ground truth annotations of each instance in the stacked scenes, including template label, segmentation mask, 6D pose, parameter values, and visibility. To solve the lack of method for stacked scenarios of parametric part objects from multiple templates, we propose a new network with residual modules as our baseline method, Multi-templates Parametric Pose Network, donated as MPP-Net. MPP-Net can jointly achieve foreground segmentation, instance segmentation, template segmentation, 6D pose estimation and parameter values prediction. As shown in Figure 1b, MPP-Net takes unordered point cloud as the input, and first predicts the foreground points in the scene. Then, similar to [3,18,19], we predict point-wise template, parameter values and 6D pose. To improve the accuracy of our method, we also design the residual modules for the prediction of translation, rotation and parameter values. In our experiments, MPP-Net is evaluated on our Multi-Parametric Dataset for the learning and generalization abilities evaluation, respectively.
To the best of our knowledge, compared with existing datasets and methods for 6D pose estimation, we propose the first public RGB-D dataset and the first deep learning network to jointly achieve 6D pose estimation and parameter values prediction. Besides, we set the evaluation metrics and provide the benchmark results for our dataset.
In summary, the main contributions of our work are: • We construct a new dataset with evaluation metrics for stacked scenarios of parametric part objects from multiple templates. • We propose a new network to provide benchmark results, which jointly achieves foreground segmentation, instance segmentation, template segmentation, 6D pose estimation and parameter values prediction.

Dataset
Recent 6D pose estimation methods based on deep learning become a trend and have shown remarkable performance. Dataset is one of the key factors that determines the performance of deep learning networks. There are many public datasets used for instance-level and category-level 6D pose estimation learning-based methods, as shown in Table 1.
Instance-level tasks usually treat each object as a class and they need an exact 3D model for each object. The LINEMOD dataset [20] is designed for the household cluttered scenarios, providing about 18,000 RGB-D images with manual annotations, and it contained 15 texture-less objects. Tejani et al. [21] proposed a dataset (IC-MI dataset) designed for the household cluttered scenarios which contain two textureless and four textured objects with 5000 RGB-D images. The Rutgers APC dataset [22] contains 24 textured objects from the Amazon Picking Challenge 2015 [23] with 10,000 RGB-D images, which is designed for household cluttered scenarios. The YCB-Video dataset [24] is also constructed for the household cluttered scenarios, which provides 133,827 frames observed in 92 videos with manual annotations by using an RGB-D camera. The dataset contains 21 daily life objects with varying textures and shapes chosen from the YCB dataset [25]. The StereOBJ-1M dataset [26] provides 396,000 frames of 18 household objects, which are constructed in 11 different environments to form cluttered scenes. Different from the dataset for the household cluttered scenarios, the dataset for the industrial stacked scenarios has more occlusion and higher pose variability of objects, since the objects are piled into a bin together. The T-LESS dataset [27] is an industrial stacked scenario dataset, which contains 30 textureless industrial objects with strong inter-object similarity and provided 47,762 frames captured by cameras. The Siléane dataset [28] is a typical dataset for the industrial stacked scenarios, including 8 objects and providing 1922 synthetic RGB-D images with noise by using automatic annotating technique and 679 real-world RGB-D images. The Fraunhofer IPA dataset [29] is an extension of the Siléane dataset with extra 2 industrial objects, providing 206,000 synthetic and 520 real-world RGB-D images. These datasets treat an object as a class and the objects in their test set are seen in the training phrase.
Category-level tasks target at generalization for the unseen objects without exact 3D models. There are also many datasets for category-level tasks, which treat the objects from the same category as a class, so their test set with unseen objects focuses on evaluation for the generalization ability. The CAMERA dataset [13] is designed for the household cluttered scenarios, comprising 300,000 real scenes with about 1000 synthetic objects within six daily life categories generated by a new Context-Aware MixEd ReAlity approach. The REAL dataset [13] is also constructed for the household cluttered scenarios, providing 8000 RGB-D frames of 18 real-world scenes, 42 unique objects within 6 daily life categories. For the industrial stacked scenarios, Parametric Dataset [3] is a synthetic dataset consisting of 374,400 depth images, 416 part objects within 4 industrial templates. However, each scene in the Parametric Dataset contains instances of a single part object only and there is a lack of dataset for industrial stacked scenarios of parametric part objects from multiple templates. Many existing methods are designed for instance-level tasks. Methods [4][5][6][7][8][9] targeted on training their network to learn the 2D-3D correspondence, i.e., the match between the 2D keypoints on RGB image and the 3D keypoints on the 3D model. Then, the object pose was solved by Perspective-n-Point algorithm [30]. However, due to the lack of depth information in RGB images, these methods performance were deteriorated in the stacked scene with heavy occlusion. Recently, He et al. [10,11] and Liu et al. [12] took RGB-D images as the input, predefined the keypoints on the 3D model in different ways and predicted the keypoints of the object in the scene. Then the 6D pose was solved by the leastsquare fitting algorithm [31] with the 3D-3D correspondence. Although these methods explored the way to fully leverage RGB-D information to achieve a better performance, they cannot be generalized to the category-level unseen objects since the selections of keypoints are based on the known 3D models.

Category-Level 6D Pose Estimation
Compared with instance-level tasks, category-level tasks are more challenging due to the large number of shape and size variations of objects from the same category. To address the challenges, Wang et al. [13] innovatively proposed a canonical representation, i.e., normalized object coordinate space (NOCS), for all the objects from a category. They directly predicted the 3D coordinates in NOCS for each pixel of the object on the RGB image. Then the Umeyama algorithm [32] was used to solve the 6D pose and size of the objects. However, it is difficult to predict 3D coordinates in NOCS directly from RGB images, and the predicting deviation will make the results worsen. Tian et al. [14] explored the NOCS method by adding a shape prior and RGB-D fusion feature to reconstruct the canonical representation. Wang et al. [15] also adopted the NOCS method and proposed a cascaded relation network to exploit the RGB-D fusion feature and the relation between instance and category features. In addition, a recurrent reconstruction network is designed to refine the canonical shape reconstruction of the objects. Chen et al. [16] consider both geometric and category information to produce dense-fusion features to regress 6D pose and to calculate the size of objects by reconstruction. Chen et al. [17] proposed a network with an online 3D deformation mechanism for data augmentation to increase the generalization ability, which directly regress the 6D pose and size of objects. However, the size predicted by these methods is the 3D bounding box size of the object, or the ratio between the object shape in their unified space and that in the scene. So, there are no suitable methods to predict exact driven parameter values of parametric part objects.

Dataset
In this section, we will introduce our new dataset, Multi-Parametric Dataset, targeting 6D pose estimation and parameter values prediction for stacked scenarios of parametric part objects from multiple parametric templates.

Dataset Description
Our dataset is completely generated automatically by simulation techniques. The synthetic stacked scenes are constructed by the physics engine, and the annotations are obtained by the virtual camera in the rendering engine. As mentioned in [29], the simulation dataset can perform well on the real test set through the domain transfer.
We select four parametric templates in Zeng's database [1], including TN06, TN16, TN34, and TN42, as shown in Figure 2. They represent the symmetry types commonly existing in industrial scenarios. For a parametric template with p parameters, we sample k (k = 4) values for each one within a certain range. Then we instantiate it into k p different part objects to construct a 3D model library where the models are selected to construct our synthetic stacked scenes. The dataset is divided into a training set (50K scenes) and a test set (2K scenes). The test set comprises TEST-L (1K scenes) where part objects' parameter values are the same as those of the training set and TEST-G (1K scenes) where part objects' parameter values are different from those of the training set. The two test datasets are set up to evaluate the learning ability and generalization ability, respectively. The parameter values distribution of part objects in the dataset is shown in Figure 3. The rectangular regions represent the sampling ranges of the parameter values, and the lines represent the sampled parameter values.
The ground truth annotations comprise template labels, segmentation masks, translation labels t ∈ R 3 and rotation matrix labels R ∈ SO(3) relative to the camera frame, visibility labels v ∈ [0, 1], and parameter values labels, for each instance in a scene.

Synthetic Data Generation
We build a 3D model of the bin with a size of 50 × 50 × 15 cm which is randomly rotated at an angle around the axis perpendicular to the ground in each scene to increase the variety of the dataset. Then we extract randomly one part object from the model library n (n ∈ [35,45]) times with the place back. So the number of objects in a scene is n and the number of each part object in a scene may be more than one. A physical simulation engine, i.e., Bullet, is used to simulate the free fall motion and collision of the objects to generate a typical stacked scene, where the labels of the bin and part objects are obtained automatically. Repeating the above process, we can generate different synthetic scenes. After rendering each scene through the render engine, i.e., Blender, we obtain an RGB image, a depth image, a segmentation image, and a set of images with each individual object for each scene. All the results are saved for the perspective and orthogonal version, as shown in Figure 4. The RGB information of the synthetic scenes is stored in the RGB images. The depth information of each pixel is stored in the depth image as 16 bit unsigned integer format (unit16). The segmentation images store the instance information to its corresponding pixels. In addition, we save a set of mask images for each individual instance without occlusion in a scene to calculate their corresponding pixel number. Intuitively, we regard the degree of visible surface of the ith (i = 1, 2, . . . , n) instance in the scene as its visibility v i : where N scene i is the pixel number of the ith instance in the segmentation image, and N total i is the pixel number of the ith instance in its own mask image.

Evaluation Metrics
Our dataset is designed for the evaluation of foreground segmentation, template segmentation, 6D pose estimation, and parameter values prediction.
For evaluation of foreground segmentation and template segmentation, we use mean Intersection over Union (mIoU) as the evaluation metric, which is calculated by the ratio between the intersection and the union of ground truth and predicted segmentation results. The IoUs are calculated on each class and averaged to get mIoU as follows: where N c is the number of the class, p ij is the point which is predicted as jth class and the ground truth is ith class.
For evaluation of 6D pose estimation, we regard the instances as true positive, whose error is less than m cm for translation and n • for rotation similar to [33,34]. Given the rotation label R and translation label t, and the predicted rotationR and translationt, the error of rotation and translation e R and e t can be, respectively, computed by: In particular, for the template with finite symmetry, e R can be computed by: where G ⊂ SO (3) is the set of rotation matrix G that has equivalent effect on a object. For the template with revolution symmetry, we assume that the object is symmetric on the z axis of the object's local frame. So we can abstract it as a unit vector z = [0, 0, 1] T along the z axis, and e R can be computed by: For evaluation of parameter values prediction, we regard the instance as true positive whose relative error of all the parameter values are less than q%. Given the ground truth of the sth parameter para s , and the predicted sth parameter pâra s , the error of the sth parameter e p s can be computed by: For 6D pose estimation and parameter values prediction, only the poses of objects that are less than 60% occluded are relevant for the retrieval. The metric breaks down the performance of a method to a single scalar value named average precision (AP) by taking the area under the precision-recall curve. Then the mean average precision (mAP) is adopted as the final evaluation metric computed by the average of the APs of all the parametric templates.

Baseline Method
In this section, we propose a new network with residual modules, Multi-templates Parametric Pose Network, donated as MPP-Net. This baseline method can jointly achieve foreground segmentation, instance segmentation, template segmentation, 6D pose estimation, and parameter values prediction. The architecture of our network is shown in Figure 5. Firstly, the point-wise features of the point cloud are extracted by a backbone network, e.g., Pointnet [35], Pointnet++ [36], and PointSIFT [37]. The backbone network takes unordered point cloud of the scene with size of N p × 3 as input to produce a point-wise feature F e with size of N p × N f . Then foreground segmentation and template segmentation are achieved with F e and F ef , respectively. Furthermore, the 4 branches which consume the enhanced feature F efc with shared multi-layer perception (MLP) and residual modules, can jointly obtain foreground segmentation result, template segmentation result, translation prediction, rotation prediction, parameter values prediction, and visibility prediction for each point. Thus, the loss function of our network is the sum of each part loss with their own weight: where L f g , L tem , L t , L R , L para , L v are the loss of the parts, and λ f g , λ tem , λ t , λ R , λ para , λ v are the loss weights for different parts.

Foreground Segmentation
For stacked scenes consisting of the bin and part objects, the point cloud in the scene need to be divided into foreground and background, since only the foreground point cloud are what we concern. We feed the embedded feature F e into MLPs to produce point-wise are the probability of the ith point belonging to background and foreground, respectively. The loss L f g is the binary cross entropy softmax loss: where b i represents foreground segmentation label of the ith point:

Template Segmentation
It is one of the important tasks for recycling scenarios to distinguish parts belonging to different templates, which is also an important prerequisite for 6D pose estimation and parameter values prediction. Therefore, we design a template segmentation branch to identify templates to which the point belongs. Since the foreground segmentation information might be useful for other tasks, so we concatenate F e andB to produce a feature F ef with size of N p × (N f + 2) as the input of the template segmentation branch. Then we feed F ef into MLPs to produce point-wise template segmentation resultsĈ = {[ĉ i1 ,ĉ i2 , . . . ,ĉ iN c ]} N p i=1 with size of N p × N c . The elementĉ ij represents probability of the ith point belonging to the jth template. In order to avoid the influence of background points during training phrase, only the foreground points are considered when applying the cross entropy softmax loss L tem : where N f g is the number of the foreground points, N c is the number of the templates, and c ij represents the template label of the ith point: c ij = 1, the ith foreground point belongs to the jth template 0, the ith foreground point belongs to other templates (12) Due to segmentation information may be beneficial to other tasks, so we concatenate F ef andĈ to produce a feature F efc with size of N p × (N f + 2 + N c ) as the input of the following branches.

Translation Branch
This branch is used to predict the translation of instances and achieve instance segmentation. Since the local frame origin of the object is its centroid in our dataset, the translation is the centroid coordinate in the scene. We feed F efc into MLPs to regress the point-wise offsets to the centroid of the instance to which each point belongs. Then the predicted centroids with size of N p × 3 is calculated by adding the offsets to the point cloud. The loss L t 0 considering foreground points only is L 1 loss between predicted centroids and centroid labels: where t i andt i are point-wise labels and predicted centroids of the instance to which the ith foreground point belongs. To make the most of the information we extract from the backbone network, we add MLPs which consume F efc to regress the residual δt i between predicted centroids and centroid labels to improve our prediction results. However, in fact, ground truth of the residual is unknown. Similar to [38], we set the optimal target of the residual δt i = |t i −t i | by online learning and the loss L t res is L 1 loss: Furthermore, the total loss of the translation branch is the sum of two loss: The residual module comprises two independent MLPs to jointly regress the coarse prediction and the residual prediction to obtain the accurate predictiont i + δt i , as shown in Figure 6. Similar to [3,18,19], we believe that if the points belong to the same instance, then the centroid predictions of these points will be close in the centroid space. During inference, we cluster the points in the centroid space into D clusters, i.e., D instances, by the unsupervised learning clustering method. Finally, the final centroid of each instance with the final size of D × 3 is obtained by Hough voting in each cluster.

Rotation Branch
Similar to the translation branch, we feed F efc into the residual module to regress the point-wise rotation prediction of the instance to which each point belongs. There are many representations for rotation, such as quaternion, rotation matrix, Euler angles, and axis-angle. Gao et al. [39] proposed that the axis-angle representation is a better choice for rotation, and Dong et al. [18] proved that their point-wise pose regression framework has almost the same results by learning Euler angles and axis-angle. Therefore, Euler angles are chosen to learn which is more intuitive for humans.
The loss function for rotation prediction adopts the pose distance proposed by Romain Brégier et al. [40], which is calculated by the pose vectors with at most 12 dimensions in Euclidean space for different types of object symmetry. Let the set R(R) represents the vectors (at most 9 dimensions) of the equivalent poses of the rotation matrix R ∈ SO(3). The distance between two rotation vectors, i.e., r 1 , r 2 , can be represented as follows: The rotation of the instances are obtained by a residual module which comprises of two MLPs. One MLPs aims to regress point-wise Euler angles α i , β i , γ i directly, and then they are converted into the coarse predicted rotation matrix: Therefore, the loss L R 0 is the sum of rotation distance of different templates between the coarse predicted rotation and the rotation labels: where N j f g is the number of the foreground points belonging to the jth template. The other one MLPs aims to regress the residual Euler angles δα i , δβ i , δγ i , and then they are converted into the accurate rotation matrix by adding them to α i , β i , γ i : The loss L R is the sum of rotation distance of different templates between the accurate predicted rotation and the rotation labels: Furthermore, the total loss of the rotation branch is the sum of the two loss: During inference, the foreground point-wise rotation results with size of N f g × 3 are divided into D clusters according to the instance segmentation result. Then, the final predicted rotations with size of D × 3 is obtained by Hough voting in each cluster.

Parameter Values Branch
When we recycle the parametric part objects, we need to sort them according to the different parameter values. So we need to predict the value of each parameter for each part object. The feature F efc is fed into the residual module to regress k parameter values of the instance to which each point belongs. The two loss considering only the foreground points L para 0 , L para res are L 1 loss and the sum is the total loss of this branch: L para = L para 0 + L para res (24) where pâra i , para i , δpâra i are coarse predicted parameter vectors, ground truth parameter vectors, and the residual parameter vectors with k parameter values for the instance to which the ith foreground point belongs. Furthermore, we can obtain the accurate parameter values pâra i + δpâra i . During inference, the foreground point-wise parameter values are divided into D clusters in the parameter space according to instance segmentation result. Then, each cluster's points votes for the final predicted parameter values for each instance with size of D × k.

Visibility Branch
In typical stacked scenarios, we are not interested in the instances with heavy occlusion since they are difficult for robots to grasp. In addition, severely occluded instances only have very limited local information, which is likely to damage the performance of the network. So we filter out these instances by a visibility threshold T v . Similar to other branches, we feed F efc into MLPs to regress the visibility result with size of N p × 1 of each point, and divide the foreground point-wise results into D clusters. Then each cluster's points votes for the final predicted visibility for each instance with size of D × 1. The loss L v for the visibility branch is L 1 loss: where v i andv i are the point-wise visibility labels and predicted visibility.

Implementation Details
We sampled N p = 16,384 points with only 3D coordinate information in each scene by Furthest Point Sampling. PointSIFT [37] was chosen as our backbone network which is the latest work compared with Pointnet [35] and Pointnet++ [36], and has stronger feature extraction ability. We set N f = 128, T v = 0.4, λ f g = λ tem = 20, λ t = λ R = λ para = 200, and λ v = 50. We implemented our network by using Tensorflow 1.5 on a GeForce RTX2080Ti. We optimized our network using the Adam optimizer with batch size 6 and initial learning rate 0.001. The learning rate decayed every 200,000 steps by a factor of 0.5. The forwardpass time of MPP-Net for a single scene was about 300 ms.

Results
For foreground segmentation and template segmentation, we evaluate the learning and generalization ability on TEST-L and TEST-G, as shown in Table 2. Our experiment results show that the segmentation performance of MPP-Net is good enough to meet the segmentation requirements, and the generalization performance is almost the same as the learning performance. Specifically, the background segmentation results on two test datasets exceed 99.9%. Besides, the IoU results of each template on TEST-L exceed 99.5%, while the IoU results of each template on TEST-G exceed 98.5%. The mIoU and Overall Accuracy on the two test datasets exceed 99.0%.
Obviously, MPP-Net has excellent learning and generalization performance for translation prediction. However, learning and generalization performance for rotation prediction are poor. As for the parameter values prediction, its generalization performance is much lower than learning performance, especially if the threshold is strict. So the accurate prediction for rotation and parameter value is very challenging. The visualization for predicted results on our test set is shown in Figure 7.

Ablation Study
In addition, we explore the influence of the residual module of our network. For translation prediction, as shown in Table 4, the residual module has little improvement due to the performance is saturated. As the predicted results converge to the ground truth, it becomes harder for residual learning to improve the performance of translation prediction. For rotation prediction, as shown in Table 5, the residual module can achieve improvements by 5.9%, 8.1%, 8.0%, 9.1% under the four thresholds for learning performance, and 2.7%, 7.8%, 8.1%, 8.8% under the four thresholds for generalization performance. Obviously, it is a large improvement for rotation prediction. When the network cannot directly regress an exact prediction result, the residual module can further explore the information in the embedding features, so as to make an improvement on the performance of rotation prediction. For parameter values prediction, as shown in Table 6, the residual module can achieve improvements by 3.5%, 2.8%, 1.9%, 1.6% under the four thresholds for learning performance, and 1.2%, 2.4%, 1.3%, 3.0% under the four thresholds for generalization performance. It shows that the residual module slightly improves parameter prediction results, due to this it may be difficult to explore the parameter information from the embedding features. The experiments results show that the residual modules bring great benefits for the rotation prediction, while the performance of parameter values prediction is improved a little.

Further Discussion
In addition to the above experiments, we explored the learning and generalization performance of each template under more threshold settings to show more analysis and comparison details, as shown in Figure 8. For translation prediction, when the threshold is lower than 1cm, the AP of each template is close to 100%, which means MPP-Net shows excellent performance on TEST-L and TEST-G. For rotation prediction, the performance of the four templates on TEST-L and TEST-G are close. Obviously, TN06 and TN16 perform better, while TN34 and TN42 perform worse. For parameter values prediction, the network generalization performance is much worse than the learning performance, since the main difference between TEST-L and TEST-G lies in parameter values. Among the four templates, TN06 is worse than the others. In the visualization for evaluation, the prediction results of inner diameter r, outer diameter R, and height h of the part objects from the TN06 template are inaccurate. Therefore, we need more research works targeting the improvements of rotation and parameter values prediction for various parametric templates, and our dataset, Multi-Parametric Dataset, has exploratory value for 6D pose estimation methods in the future. Figure 8. Further experiment results on the evaluation of translation, rotation, and parameter values prediction. The images in the left column are the results evaluated on TEST-L for learning performance, and the images in the right column are the results evaluated on TEST-G for generalization performance.

Conclusions
In this paper, we proposed a new dataset Multi-Parametric Dataset for the lack of recycling stacked scenarios containing parametric part objects from multiple templates. Besides, we designed the evaluation metrics for the evaluation on the Multi-Parametric Dataset. To provide benchmark results, a new 6D pose estimation network, MPP-Net, was designed for such stacked scenarios as a baseline method for the Multi-Parametric Dataset.
The experiment results showed that the Multi-Parametric Dataset has exploratory value for 6D pose estimation and parameter values prediction tasks. In the future, our research will focus on improving the generalization performance of different templates on the rotation and parameter values prediction, and exploring more templates in the field of industry.