DeepHMap++: Combined Projection Grouping and Correspondence Learning for Full DoF Pose Estimation

In recent years, estimating the 6D pose of object instances with convolutional neural network (CNN) has received considerable attention. Depending on whether intermediate cues are used, the relevant literature can be roughly divided into two broad categories: direct methods and two-stage pipelines. For the latter, intermediate cues, such as 3D object coordinates, semantic keypoints, or virtual control points instead of pose parameters are regressed by CNN in the first stage. Object pose can then be solved by correspondence constraints constructed with these intermediate cues. In this paper, we focus on the postprocessing of a two-stage pipeline and propose to combine two learning concepts for estimating object pose under challenging scenes: projection grouping on one side, and correspondence learning on the other. We firstly employ a local-patch based method to predict projection heatmaps which denote the confidence distribution of projection of 3D bounding box’s corners. A projection grouping module is then proposed to remove redundant local maxima from each layer of heatmaps. Instead of directly feeding 2D–3D correspondences to the perspective-n-point (PnP) algorithm, multiple correspondence hypotheses are sampled from local maxima and its corresponding neighborhood and ranked by a correspondence–evaluation network. Finally, correspondences with higher confidence are selected to determine object pose. Extensive experiments on three public datasets demonstrate that the proposed framework outperforms several state of the art methods.


Introduction
Estimating the full degree-of-freedom (DoF) pose of a rigid object, meaning 3D translation and 3D orientation from a single frame is an important topic in the realm of computer vision.A huge number of approaches have been proposed to address applications in the domain such as robotics, augmented reality and medical navigation [1].Real time is a key indicator for almost all applications from the three fields above.One line of solutions is modeling objects into a sparse set of feature points [2].For well-textured objects, this problem has been well addressed by constructing correspondence constraints between the prior model and scene image [2,3].However, both robustness and accuracy continue to be critical issues that limit existing methods under challenging scenarios [4].Thus, many researchers have recently begun to employ convolutional neural network (CNN) or ensemble learning [5] to address these issues.
Depending on whether intermediate cues such as 3D object coordinate [6], projection of virtual control points [7], or semantic keypoints [8] are used, related approaches can be roughly divided into two categories: direct methods and two-stage pipelines.For a typical two-stage pipeline, intermediate cue is prepared in the first stage and pose parameters is computed by these intermediate cues in the back-end.Throughout the rest of this paper, the front-end and back-end correspond to the first and second phase of a two-stage pipeline, respectively.Instead of predicting directly the full DoF pose, Brachmann et al. [6] first compute 3D object coordinates and the confidence map of scene pixels with random forests.Dense correspondences then are transferred instantly to a random sample consensus (RANSAC) based optimization step.Encoding local feature of the input image makes 3D object coordinates inherently robust to partial occlusion and achieves top-level results on the Occluded LineMOD dataset [6].However, Brachmann et al. [6] didn't specially consider the case of symmetrical objects [9].Sparsity is another important requirement to ensure the robustness to heavy occlusion.In contrast to dense object coordinates [6], both projections of virtual control points [10] and bounding box's corners [7] are formally sparse.Oberweger et al. [11] taken into account both sparsity and locality, and upgraded the robustness of BB8 [7] (8 corners of the bounding box) by predicting projection heatmaps from random local patches.For simplicity, the method reported in Ref. [11] is named DeepHMap.As a part-based method, DeepHMap [11] lifts the robustness to occlusion to a new level with simple local patches.However, predicted heatmaps always encounter multiple local maxima due to the absence of global information.Oberweger et al. [11] directly select the global maxima to construct correspondence constraints without considering the rationality of corner projection.Compared with a wide range of intermediate cues, the postprocessing corresponding to the second stage is seldomly noticed.The studies [12][13][14] are three of the few postprocessing stages that begin to divert attention to the back-end, all of which are less portable because of the depth of customization.
Motivated by the above analysis, we focus on the back-end of a two-stage pipeline for ensuring both accuracy and robustness in this paper.Given the simple yet efficient strategy of baseline [11], we follow the same line to achieve projection heatmaps of 3D bounding boxes' corners (BBCs).For raw merged heatmaps from the baseline [11], a good postprocessing method should have the following features: (1) the postprocessing can be seamlessly integrated with the front-end network.That is, we do not have to spend extra effort to connect these two parts; (2) the back-end should be efficient enough, and introducing a heavy computational cost in exchange for improved accuracy is not advisable; and (3) unreasonable projection distribution on a single-layer of heatmaps should be properly excluded.To this end, we present a two-stage approach as depicted in Figure 1.The proposed method consists of three parts, projection prediction, projection grouping and correspondence evaluation.A simple projection grouping module is designed firstly to learn spatial correlation of projection of different BBCs.Thus, unreasonable local maxima can be removed by geometric constraints learned with this projection grouping module and each layer of the filtered heatmaps contains only one peak.For each layer of heatmap, each pixel stores the corresponding confidence of projection distribution.In fact, current projection predictions are still biased against the ground truth.Multiple correspondence hypotheses thus are sampled from both the only local maxima and its corresponding neighborhood instead of feeding directly 2D-3D correspondences to the perspective-n-point (PnP) method [15].Similar hypothesis sampling can be found in Ref. [16].Instead of random hypothesis selection or iterative refinement [13], we delegate all of these to a correspondence evaluation network [17].Finally, correspondences with higher confidence are chosen to calculate object pose.For brevity, we name our method DeepHMap++ in text below.
In summary, the main contributions of this paper can be concluded as the following: 1. We present a simple yet efficient projection grouping module for removing fake local maxima in each layer of projection heatmaps.The projection grouping module learns correlation constraints among projections of different BBCs and to select the optimal projection.2. In order to suppress different jitters during inference, multiple correspondence hypotheses are randomly sampled from local maxima and its corresponding neighborhood and ranked by a correspondence-evaluation network.3. We show the effectiveness of projection grouping and corresponding evaluation, and the corresponding two-stage pipeline achieves state-of-the-art performance on public benchmarks including LineMod dataset [18], Occluded LineMOD dataset [6], and YCB-Video dataset [19].
The rest of the paper is structured as follows: an overview of the related works is provided in Section 2. Section 3 describes the complete pipeline.Extensive evaluations and comparisons with several state-of-the-art baselines are demonstrated in Section 4. Conclusions and future work are presented in Section 5.

Projection Grouping Correspondence Evaluation Projection Prediction
Figure 1.Overview of the proposed two-stage approach for recovering the 6D pose.For any input patch (yellow box), its corresponding output from projection prediction module consists of projection heatmaps with eight layers.

Direct Methods
For direct methods, full DoF poses are directly encoded in both learning and inference.Attracted by the efficient LineMOD [18] template, Tejani et al. [20] proposed a latent-class based hough forest that employs a part-based version to improve the robustness to partial occlusion and clutter.Instead of random forest, a convolutional auto-encoder [31] is trained from local view patches and generalizes well both on seen and unseen objects.A similar extension of local patch based regression can be found in Ref. [32].Wohlhart et al. [21] presented a novel learning based descriptor mapping the object categories and viewpoints to Euclidean space.Thus, a large Euclidean distance between descriptors means different category attributes and distance of descriptors in Euclidean space is directly related to the difference between different views.More aggressively, a pose guided feature [22,30] is designed to learn exact pose differences.
One-shot based 6D pose estimation has been recently frequently addressed in the literature.In Ref. [23], a fully connected auto-encoder is employed to learn latent features.Significant progress in visual object recognition and detection has been made with deep learning.Therefore, many scholars have begun to predict pose parameters in one branch of deep neural networks.Following the state-of-the-art object detector, Mask R-CNN [40], a multi-task learning network [27] with a pose branch is demonstrated.Similar end-to-end fashion for pose estimation can also be found in Refs.[26,28].To reduce the reliance on data annotations, an implicit orientation learning [29] is proposed via learning from samples processed by an augmented autoencoder.Mitash et al. [25] presented a comprehensive framework for full DoF pose estimation via Monte Carlo tree search, which completely eliminates a time-consuming labeling step.More recently, a multi-view and multi-class framework [24] demonstrates impressive 6D pose estimation via a multi-class representation of pose space.

Two-Stage Pipeline
The biggest difference between direct method and two-stage pipeline is whether to use intermediate cues or not.A common intermediate cue is segmented point cloud [33] in a pick-and-place system.6D pose of a rigid object instance is achieved by aligning the segmented point cloud with a pre-scanned 3D model.3D object coordinate [6] is another flexible intermediate cue, which has been proven to be very efficient for 6D pose estimation [6,8,35,38,41] and camera localization [16].To cope with a multi-object case, 3D object coordinates together with object labels [34] are jointly employed intermediate cues.
Holistic methods [14,19,36] globally formulate the pose detection issue and feed directly the entire scene image into the regression network.What we want to highlight here is SSD-6D (SSD denotes single shot detector) [36], which originally constructs a 6D hypothesis from 2D bounding boxes.In contrast to 3D object coordinates, a 2D bounding box is an implicit intermediate cue that doesn't explicitly contain 3D information.After significant progress [42] has been made in the face of conventional clutter scenes [18], researchers begin to shift their attention to more challenging scenes under severe occlusion [6].Compared with the part-based approach [6], a holistic method such as SSD-6D is more likely to be disturbed by the foreground occlusion when constructing the mapping to pose space.
Crivellaro et al. [10] put a novel intermediate cue, virtual control points, into our view.3D pose of an object part is represented as projections of virtual control points, making it possible to handle poorly textured objects under partial occlusion and heavy clutter.To avoid manually selecting parts of a special object like Crivellaro et al. [10], more general-purpose virtual control points, BBCs [7] are utilized to construct 2D-3D correspondences.Following the same principle of BB8 [7], Oberweger et al. [11] proposed training CNN from random local patches and achieved state-of-the-art performance.A similar part loss [43] or part response [44] based method also shows strong robustness to occlusion in other areas such as face detection.In multi-branch networks [19,39], segmentation mask, 2D bounding box, and object's center are frequently employed as intermediate cues.Different from virtual control points mentioned above, semantic keypoints [37] have also been proved to be an effective intermediate cue.Unfortunately, time-consuming auto-extracting of semantic keypoints hinders its real-time applications.
The following postprocessing is equally important after the acquisition of intermediate cues.After achieving 3D object coordinates using random forests in the first stage, a novel pose agent [13] is designed to repeatedly refine pose hypotheses.This is the only reinforcement learning based example that we can find in the back-end.In the case of a differentiable RANSAC (DSAC) based pipeline [16], finite differences used in refinement gradients lead to high gradient variance during the end-to-end learning.To address the remaining issues in DSAC, a fully differentiable backend [12] is proposed for camera localization.Compared with intermediate cues in the front-end, research about postprocessing is still relatively deficient.

Methods
According to the definition of the preceding statement, our method belongs to a typical two-stage method.In the front-end, projection heatmaps of 3D BBCs are predicted by the tutorial described in DeepHMap [11].The core task of the paper is to design a comprehensive postprocessing in the back-end.Our proposed postprocessing consists of two modules: projection grouping and correspondence learning based hypothesis selection.We describe each necessary step in this section.

Local Patch Based Heatmap Prediction
DeepHMap [11] uses an asymmetric hourglass network for predicting projection heatmaps, which takes a random local patch with size of 32 × 32 as input and produces corresponding predicted heatmaps with size of 128 × 128.Different from direct projection prediction of BBCs with a holistic patch in BB8 [7], DeepHMap outputs projection heatmaps that denote a confidence distribution of projection.Compared with projection heatmaps, direct pose regression is a more demanding task.Different predicted heatmaps from random local patches are then merged via simple averaging, which constantly produces multiple local maxima in single channel of heatmaps.A more flexible strategy instead of the global maxima [11] is adopted and described in detail in the subsection below.

Projection Grouping
For DeepHMap [11], the key to improve the robustness to heavy occlusion is to feed random local patches instead of holistic objects of interest to CNN.However, local patches mean that the correlation between different parts of a special object is ignored.During the inference, each sample in the minibatch predicts projection heatmaps according to its own content.In each channel of the merged heatmaps, multiple local maxima can be frequently found.Oberweger et al. [11] select the global maxima to eliminate this ambiguity.However, the global maxima is not always the optimal choice.
To solve these ambiguities more thoroughly, we propose a simple projection grouping module to guide the projection selection.For projection distribution on a single channel of heatmaps, the rationality of its location can be evaluated by constraints from two aspects: correlation constraints among projections on different channels on one side, and correspondence constraints between 3D BBCs and their corresponding 2D projections on the other.Next, we elaborate on the design process of the network architecture with considering correlation constraints, and correspondence constraints are fused in subsequent correspondence evaluation step.The first difficulty we have to face is that the number of local maxima on each channel of the heatmaps is always in dynamic change.Each channel of the merged heatmaps may contain projection clusters ranging in number from zero to many.For local patches from background or occlusion areas, heatmaps may not contain peaks.Before detailing more design details, we first revisit the strategy employed in DeepHMap.In particular, let X = {l 1 , l 2 , ..., l 8 } represents predicted heatmaps consisting of eight channels corresponding to different BBCs.In order to get a group of projections from different heatmap channels, Oberweger et al. [11] consistently choose the global maxima.This simple strategy can be written as: where max (•) is a function that takes the global maxima from a single-channel heatmap, the output y i denotes the ith channel of filtered heatmaps and contains the predicted projection of the corresponding BBC.Obviously, the above mentioned projection grouping described in Equation ( 1) is carried out separately on a layer-by-layer fashion.
To fuse correlation constraints mentioned above, a simple fully connected network with residual architecture is employed to learn different projection cases.The task of projection grouping becomes constructing a mapping f with learned parameters ψ, such that ∀i, ( Compared with Equation ( 1), the strategy given in Equation ( 2) takes the correlation constraints among different channels into account.As shown in Figure 2, the projection grouping module takes merged heatmaps adjusted by the spatial transformation layer [45] as input.For the input patch p j of a minibatch consisting of N batch batches, let Y j = y represent the predicted heatmaps and expected heatmaps, respectively.Note that expected heatmaps are normalized to ensure that the maxima on each channel is equal to 1.Following the tutorial reported in DeepHMap, ground truth heatmaps are generated by placing a 2D Gaussian distribution at the ground truth projection for each channel.More details can be seen in Figure 2. In our practice, the immediate output of the last layer of projection grouping module is a feature vector V with N out dimension.We flatten the ground truth heatmaps O j to construct corresponding probability labels.The standard cross entropy loss for training can be written as where H (•) represents the cross entropy, N batch denotes the number of samples in a minibatch, s (•) is the softmax function, vi and v i is the ith element of output vector V and its corresponding probability label, respectively.With projection grouping module, most unmatched local maxima are removed and projection clusters corresponding to the ground truth are reserved.We test different configuration parameters of projection grouping module (see Figure 2), and the corresponding results can be found in Section 4.
Figure 2. Architecture of the projection grouping module.The projection grouping network adopts a residual structure [46] consisting of a fully connected network based feedforward path and a shortcut connection (see case (c) and case (d)).Compared with case (c), case (d) utilizes a fully connected two-layer feedforward network.The projection grouping layer takes merged heatmaps with a shape of [w, h, 8] as input and generates the same sized heatmaps containing only one peak in each channel.On the input side, merged heatmaps with a shape of [w, h, 8] are flattened to match the architecture of projection grouping module.On the output side, the immediate output is reshaped to generate filtered projection heatmaps.All layers have rectified linear unit (ReLU) [47] activation function except for the output layer.Dropout layers [48] are employed for the first dense layer, and the softmax function are placed after the add operation.Different configurations of projection grouping layer are detailed in the evaluation section.Additional case (a) and case (b) are plain module without shortcut connection, which are tested for comparison purposes.

Correspondence Learning Based Hypothesis Scoring
Usually, only one peak is reserved for each channel after raw predicted heatmaps travel through the projection grouping module.However, it is still insufficient to directly utilize this maximum to construct correspondence constraints.CNN is inevitably disturbed by some explicit bias from occlusion, background clutter and noise in inference.To minimize these perturbations, we construct a hypothesis pool and throw all 2D-3D correspondences into a correspondence-evaluation network.Correspondence hypotheses are assigned a confidence score and high-confidence correspondences are collected to calculate the object pose.The process of assigning confidence to correspondence hypotheses is referred to as hypothesis scoring.

Generating Hypothesis Pool
We first describe the construction process of the hypothesis pool.As mentioned earlier, only one projection cluster is usually reserved for each channel after projection grouping.Eight projections of interest centered on the global maxima with a radius of R are first determined.We use such a projection of interest to accommodate the bias caused by jitters.A total of N ch correspondence hypotheses are randomly sampled from each projection of interest, including the one corresponding to the peak.Additionally, the sampled points need to have a higher confidence than the predefined threshold.These 8N ch correspondence hypotheses are then fed into the subsequent correspondence-evaluation network.

Learning with a Hybrid Loss
The hybrid loss of correspondence learning network [17] consists of a classification term and a regression term.The input correspondences are assigned a weight that indicates whether they are inliers or outliers.Weighted correspondences are then utilized to formulate an essential matrix based regression loss.As shown in Figure 3, the correspondence-evaluation network in our case is formally similar to the correspondence-learning network [17].The input of our correspondence-evaluation network is 2D-3D correspondence c i = [p i , P i ] instead of keypoint pairs on stereo images.The loss function thus needs to be reformulated to accommodate the new input type.Let C = [c 1 , c 2 , ..., c N ] be a set of 2D-3D correspondences, where p i = [u i , v i ] is the predicted projection in the heatmaps and P i = [x i , y i , z i ] is the spatial coordinate of BBCs in the object coordinate system.For each object of interest, arbitrary pose can be represented by eight size-specific BBCs.Spatial coordinates of BBCs are thus reused when constructing different correspondences.For each 2D-3D correspondence, the mapping takes the form of a 3 × 4 projection matrix: The vector p i and HP i have the same direction, and Equation (4) thus can be expressed in terms of a vector cross product: For the over-determined case that has more than six 2D-3D correspondences, the above Equation ( 5) can be rewritten in the following form: where A i denotes the correspondence matrix H) is the coefficient vector made up of entries from H. We now construct a 2N × 12 correspondence matrix A by stacking Equation ( 6) generated by each correspondence.The projection matrix H can be computed by performing the singular value decomposition (SVD) of A and taking the unit singular vector corresponding to the smallest singular value [15].With classification term in the hybird loss, possible numerical instability [17,49] in eigendecomposition are suppressed well.The singular value based regression term is replaced with common reprojection error: where N h represents the number of 2D-3D correspondences with a predicted label of 1.The classification term L cla can be computed by a binary cross-entropy loss, which efficiently rejecting outliers with correspondence classification.Putting both the classification term and geometry term together, the overall loss of N in 2D-3D correspondences can be written as: ... In practice, the multi-layer perceptrons are implemented using conv1d in tensorflow [50].
The main differences between the correspondence-evaluation network we use here and the original case [17] are detailed as follows: (1) Instead of 2D-2D correspondences obtained from stereo images, the network here takes 2D-3D correspondences as input and learns the mapping between projection heatmaps and BBCs.( 2) Training loss of the network is reformulated by replacing the SVD based regression term to a general reprojection loss (see Equations ( 7) and ( 8)). ( 3) Training dataset of the correspondence evaluation network is different from the conventional 2D case [17] and 3D case [51].Additional correlation constraints are fused to imitate the projection distribution of 3D BBCs.More training details are given in Section 3.4.

Training Dataset
Our proposed two-stage pipeline can't be trained via an end-to-end fashion because of non-differentiable paths connecting different modules.Three subtasks, prediction of projection heatmaps, projection grouping and correspondence evaluation thus are trained separately.In the first stage, a mixed dataset consisting of synthetic and real samples are generated according to the tutorial described in Ref. [19].The synthetic samples are collected by accumulating a series of discrete viewpoints, and the real parts are generated by segmenting the masked object of interest and then combining an additional in-plane rotation.This mixed dataset contains 200,000 samples, of which the ratio of synthetic to real is 1 to 1. Hyper parameters of DeepHMap are completely preserved.Note that DeepHMap is object-specific network and we also need to prepare similar object-specific training dataset for each object.
Merged heatmaps are naturally collected to train the projection grouping module.As for the correspondence-evaluation network, it takes a set of 2D-3D correspondences as input.We thus synthesize a series of 2D-3D correspondences by projecting size-specific BBCs to the image coordinate system.Similar to the preparation of training dataset for DeepHMap, a sample set consisting of 200,000 2D-3D correspondences is collected by placing a virtual camera at different viewpoints.Eight BBCs instead of a mesh model are placed at center of the view-sphere.Additional noises and outliers are added to augment synthesized samples.In practice, the weights in Equation ( 8) are set to α = 1 and β = 0.15, respectively.

Datasets and Evaluation Metric
In this section, three public datasets: LineMOD dataset [18], Occluded LineMOD dataset [6] and YCB-Video dataset [19] are employed to evaluate the proposed backend and integrated two-stage pipeline.
The LineMOD dataset consists of 15 different object sequences and corresponding ground truth pose.The occluded version [6] is generated by selecting images from LineMOD dataset, and these objects occlude each other to a large extent under different viewing directions.The YCB-Video dataset contains 21 different object sequences with significant image noise, illumination changes, background clutter and severe occlusion.
To evaluate the performance of pose estimation algorithms objectively, two popular metrics in this field, 2D reprojection error [34] and ADD|I [18] are employed to define a correctly estimated pose.With the 2D reprojection error, an estimated pose is accepted if the average reprojection error of all model points from the estimated pose and the ground truth pose is below five pixels.ADD depicts a ratio between the average distance and the object's diameter.ADI is specifically designed to deal with symmetrical objects, of which the average distance is computed using the closest point of transformed model points.The default ratio in ADD|I is retained and set to 0.1.

Architecture and Parameter Selection for Projection Grouping Module
Among the raw merged heatmaps from occluded scenes, multiple local peaks can frequently found in different channels.The one-size-fits-all rule in DeepHMap is not always able to find the optimal projection cluster, which defines a region of interest centered at the ground truth projection with a radius of 10 pixels.To quantify the effect of the projection grouping module, here we count the number of false projection selection (FPS) N ps per hundred channels.A projection selection is considered correct if its location is inside the corresponding projection cluster.The goal of our projection grouping module is to implicitly learn correlation constraints among projections of BBCs.To best meet the three design principles mentioned above, we test different configurations and report results in Figures 4 and 5.The corresponding results provided by max function [11] also have been included.Unless explicitly stated, results from DeepHMap don't utilize feature mapping [52].As shown in Figures 4 and 5, we observe the following results: (1) the projection grouping module with residual architecture and dropout layer achieves best results in FPS metric on both datasets; (2) Compared with the results corresponding to Occluded LineMOD dataset, all test methods give a lower number of FPS.It is in line with expectations because severe occlusions bring more interference to the inference of network; (3) In addition to the optimal configuration, other configurations of projection grouping module also outperform the max function in the baseline [11].The above experiments consistently demonstrate the effectiveness of projection grouping module.Such improvements can also be seen in Figure 6.For occluded cases, that is (Figure 6a,b) projection heatmaps from DeepHMap++ are much cleaner.As for non-occluded cases, improvements from projection grouping module are visually limited.It benefits by learned correlation constraints and projection grouping module can find the matching projection that doesn't correspond to the global maxima (as shown in Figure 6).For the rest of the evaluation, we use the optimal configuration, that is, PG-2-2048+D.

Correspondence Evaluation
For the correspondence evaluation module, we evaluate it from three different perspectives.First, we evaluate the performance of correspondence evaluation network with varying sizes of projections of interest and varying numbers of sampled correspondences from each channel (see Figure 7).As mentioned in Sections 3.3 and 3.4, correlation constraints are employed to guide the learning of correspondence evaluation network.To verify the effectiveness of correlation constraints among the training dataset, we test two versions of the correspondence-evaluation networks: trained from dataset with correlation constraints (CorrNet) and without correlation constraints (CorrNet w/o CC).Here, correlation constraints are evaluated as the second factor.For the non-constraint case, we follow the similar procedure of a view-sphere based method [18].For each viewpoint, we randomize the 2D position of projections and achieve the corresponding 3D reference points by a back-projection function.Third, we evaluate the correspondence-evaluation module against a RANSAC based strategy employed in DeepHMap.Regarding the radius of POI, Figure 7 shows that the increase of precision tends to saturation at 10 px.As for the number of sampled correspondences, we can find that increasing from 60 to 80 or more only slightly affects the performance of the correspondence evaluation network.For the rest of the evaluation, we thus use the parametric values of r = 10 px, n = 60.
With utilizing the identified radius of POI and the number of sampled correspondences, we then begin to evaluate the impact of correlation constraints on the CorrNet.It should be noted that the evaluation of DeepHMap on the LineMOD dataset hasn't been given, and thus we list the corresponding results from BB8 [8] as a substitute.Table 1 shows that both of the two versions significantly outperform BB8 [11] on the LineMOD dataset [18].This is mainly because of the specially designed network for weighted correspondence and projection grouping, and boosting from DeepHMap.In the case of without correlation constraints, the average accuracy of CorrNet w/o CC is about 9.7% higher than BB8 [11] in ADD|I metric, and about 5.3% higher in the 2D reprojection error metric.In addition, correlation constraints among the training dataset can further improve the average accuracy of CorrNet w/o CC by 1.8% under ADD|I metric, and about 0.7% in a 2D-reprojection error metric.It proves the validity of correlation constraints on the correspondence selection.
Similar test results from Occluded LineMOD dataset [6] can be found in Table 2. Margins between RANSAC based strategy in DeepHMap and two versions of CorrNet reach an average of 1.6% and 1.0% in ADD|I metric, 5.2% and 3.0% in the 2D reprojection error metric.It confirms once again that correlation constraints can further improve the performance of CorrNet.CorrNet has a stronger ability to handle fake correspondences than the RANSAC based strategy [11].

Results from the Full Pipeline
We now evaluate our full pipeline on two datasets with serve occlusion, namely Occluded LineMOD dataset [6], and YCB-Video dataset [19].For comparison purposes, we have employed two state-of-the-art methods, that is, PoseCNN [19] and DeepHMap [11].Note that all methods in the evaluation section take only RGB images as input.Especially for the YCB-Video dataset, the area under the accuracy-threshold curve (AUC) [19] is utilized as an additional metric.
As depicted in Figure 8, a more complete comparison between DeepHMap and DeepHMap++ is given.For all eight sequences from the Occluded LineMOD dataset [6], DeepHMap++ steadily achieves better results than DeepHMap under different pixel thresholds.With the 2D reprojection error metric, a smaller pixel threshold means more accurate estimation.It is not unusual to find that the boosting of DeepHMap++ is more obvious under a low-threshold phase that ranges from 0 px to 30 px.This is because, when a test scene corresponds to a larger pixel threshold, it means that the estimated pose deviates significantly from the ground truth.Dealing with such challenging scenes is very difficult for both projection grouping module and correspondence evaluation module.Thus, the improvement of DeepHMap++ becomes limited when pixel threshold reaches a high level that is bigger than 30 px.Comparisons between DeepHMap++, DeepHMap and PoseCNN are listed in Table 3.For all object sequences from YCB-Video dataset, DeepHMap++ consistently improves DeepHMap, which utilizes RANSAC based correspondence sampling and max function based projection grouping in three different metrics.For another baseline [19], semantic labeling and object center are jointly employed intermediate cues.However, the entire image is directly cast into CNN for building the mapping from image space to object center.This holistic scheme is more sensitive to foreground occlusion than both DeepHMap and DeepHMap++ using local input.Local feature input plus a specially designed back-end ensures that DeepHMap++ achieves best results over most of the entries.We also show some qualitative results on both datasets in Figures 9 and 10, respectively.

Runtime Analysis
Our current implementation is written in python on an Ubuntu machine with an intel E5-2640 CPU (intel, Santa Clara, CA, USA) and NVIDIA Geforce GTX1080Ti GPU (NVIDIA, Santa Clara, CA, USA).The network part is built on a large-scale machine learning library-tensorflow [50].In the first stage, the projection heatmap prediction takes 80 ms for 64 patches.Parallel processing can significantly reduce the prediction time to 20 ms.Benefiting from the simple architecture of projection grouping module, it only takes 3 ms to complete this subtask.Subsequent correspondence evaluation network takes 20 ms to assign correspondence weights and compute the full DoF pose parameters.For a 640 px × 480 px image, it takes about 130 ms for pose detection via a sliding window fashion and goes down to 60 ms with a parallel trick.The main runtime statistics are listed in Table 4.

Conclusions and Future Work
We have improved the back-end of a two-stage pipeline to recover the 6D pose of rigid objects under challenging scenes.With a simple fully connected module, the projection ambiguity can be better addressed than the one-size-fits-all strategy in DeepHMaps.The proposed projection grouping module learns correlation constraints of different BBCs and reduces the number of false projection selections.A corresponding-evaluation network is then employed to achieve weighted correspondences, as opposed to RANSAC based strategy.The above mentioned efforts have enabled the proposed method to outperform state-of-the-art solutions on three public benchmarks.Meanwhile, these refinements don't introduce too much computing burden, which indicates the great potential of our method in real-time applications.
In the future, an interesting direction is to add a branch on the backbone for object segmentation.This branch can provide additional regularization to some extent.Another line is to fuse the improved two-stage approach into a pose tracking framework.In a standard pipeline of pose tracking, pose parameters from the previous frame can be reused to replace the pose detection step in DeepHMap.In addition, tuning the architecture of a network to achieve an end-to-end training is beneficial to final results.

Figure 3 .
Figure3.Architecture of the correspondence-evaluation network.It takes 2D-3D correspondences as input and produces directly correspondence weights.The basic residual block consisting of weight-sharing perceptrons and context normalization.In practice, the multi-layer perceptrons are implemented using conv1d in tensorflow[50].

Figure 4 .
Figure 4. Statistics of false projection selection per one hundred channels for different configurations on parts of the LineMOD dataset[18].A typical configuration can be expressed as PG-x-y w/o SC+D, where PG denotes the abbreviation of projection grouping module, w/o SC indicates that the network doesn't contain shortcut connection (SC), D is the dropout layer, x and y represents the number of fully connected layers and dimensionality of the output space, respectively.max refers to the strategy utilized in DeepHMap.For results shown in figure, lower is better.

Figure 5 .
Figure 5. Statistics of false projection selection per one hundred channels on Occluded LineMOD dataset [6].All test methods here are the same as Figure 4.

Figure 6 .
Figure 6.Predicted projection heatmaps from different RGB images of Occluded LineMOD dataset[6].From (a) to (e), the region of interest of different test frames and its corresponding predicted heatmap channels with DeepHMap (up) and projection grouping module (down) are given, respectively.

Figure 7 .
Figure 7. Evaluation of our proposal with a varying radius of projection of interest (POI) and different sampled correspondences.A horizontal axis denotes the radius of POI in pixels.The vertical axis denotes the fraction of correctly estimated scenes under the 2D reprojection error metric.

Figure 8 .
Figure 8.Evaluations on an Occluded LineMOD dataset[6].The curve represents accuracy vs. pixel threshold in a 2D reprojection error metric.The vertical axis denotes a fraction of correctly estimated scenes.The horizontal axis denotes pixel threshold.

Figure 9 .
Figure 9.Estimated 6D pose on an Occluded LineMOD dataset [6].The red and blue bounding boxes denote the ground truth and results estimated by DeepHMap++, respectively.The left column is the results of Ape sequence.The middle column is the results from Can sequence.The right column is the results from Driller sequence.

Figure 10 .
Figure10.Estimated 6D pose on YCB-Video dataset[19].The ground truth is shown in red, and estimated results with DeepHMap++ are shown in blue.The four rows (from up to down) correspond to test images from 003_cracker_box sequence, 004_sugar_box sequence, 005_tomato_soup_can sequence and 007_tuna_fish_can sequence, respectively.

Table 1 .
[18] estimation results of two different versions: CorrNet and CorrNet w/o CC on the LineMOD dataset[18].The best results for each term are shown in bold.

Table 3 .
[19]arisons with state-of-the-art methods on YCB-Video dataset[19].We report the AUC scores, ADD|I and 2D reprojection error for the 21 image sequences of YCB-Video dataset.The best results for each term are shown in bold.