1. Introduction
The precise determination of an object’s location in a given scene is a key capability to archive flexible interaction and manipulation to perform autonomous operations [
1,
2]. Traditionally, most computer vision research efforts have been centered on the detection and classification of objects in monocular images; however, only a few studies have focused on explicitly solving the full six degree-of-freedom problem. In this line, most approaches face the problem from a 2D point of view, rather than inferring the precise rotation and position of the objects in the 3D space, commonly known as the 6D pose estimation problem. Within the last decade, with the rise of machine learning, novel monocular methods appeared showing increasing levels of robustness [
3]. However, most of these approaches were still limited to the understanding of the scene in terms of object classification, segmentation, and bounding box detection. Only recently, methods based on deep learning [
4,
5,
6,
7] have shown promising results in solving the problem from a 6D pose estimation perspective.
In another direction, methods based on three-dimensional scene data have been the best solutions to robustly solve the 6D pose estimation problem for different types of objects and complex scenarios. A vast variety of solutions based on local features [
8], such as PFH [
9], SHOT [
10], PPF [
11] and related methods [
12,
13], global features, such as VHP [
9] or ESF [
14], template-matching approaches like Linemod [
15], and also machine learning methods based on deep learning [
16,
17] or random forests [
18] have been proposed. The existing wide range of traditional and more recent approaches and techniques evaluated under different platforms, procedures and datasets drew a rather complex picture of the state-of-the-art. In order to obtain a clearer picture of the state of the field, Hodan et al. [
19] presented an extensive benchmark for 6D pose estimation of a single instance of a single object (SiSo) task, where different challenging existing and new datasets were collected under a standardized evaluation procedure. Initially, 15 different methods were evaluated and, since then, the 6D pose estimation of a varying number of instances of a varying number of objects (ViVo) has been introduced and more methods have been tested. Among these solutions, methods based on the Point Pair Features (PPF) voting approach [
11] have shown some of the best performances. These approaches combine benefits of a global object definition and a local matching approach by matching the object’s and scene’s surface data using feature quantization, a voting-based corresponding grouping and clustering process. Only recently, deep learning methods are reaching higher levels of accuracy, although they require a tremendous amount of data and long training procedures, compared to the single training 3D model of PPF. Among all methods, our previous PPF-based depth-only solution presented in [
20] obtained the best overall result in the BOP Challenge SiSo task in 2017, the best overall result in 2019, and best depth-only result in 2020 for the ViVo task. This solution proposes a six-step pipeline that focuses on extracting more discriminate information from the surface data and uses additional refinements and improvements to boost its performance. Despite the good performance of those top scoring methods against clutter and illumination changes, results show that occluded scenarios still remain challenging. In particular, for the SiSo task, results obtained on the Linemod occluded (LM-O) dataset [
21,
22] show a clear weakness of state-of-the-art methods against occluded cases, decreasing the overall recognition results from
to
when compared to the non-occluded version Linemod (LM) dataset. In a more detailed analysis presented in [
19], recall scores related to the visible fraction of the target object show that more recent methods performance decrease to less than
recognition rates when occlusion levels reach
of the object.
In this paper, we propose to incorporate color information and visual attention principles to boost the performance of a state-of-the-art pose estimation method for highly occluded scenarios. Specifically, we propose to improve the method presented in [
20] by using color information to guide the attention of the method to potential scene zones and improve surface matching.
Visual attention is an important biological mechanism based on selecting subsets of the world information to perform faster and more efficient scene understanding. Inspired by the understanding of the human visual system and the development of more efficient intelligent applications, visual attention has been an important research topic in both neuroscience and computer vision fields. Either based on bottom-up or top-down architectures [
23], different computer vision methods for visual attention have been presented behind the ideas of salient maps [
24], object-based attention [
25], and saliency feature vectors [
26]. Potapova et al. [
27] presented a survey of visual attention from a 3D point of view, analyzing 3D visual attention for both human and robot vision. Their work reviews the most important attention computational models presented, from the widely used contrast-based saliency models [
24] to the recently proposed Convolutional Neural Network (CNN) learning approaches [
28]. On this line, most research done on visual attention has focused on biologically inspired bottom-up attentional mechanisms. For most solutions, the generalized idea of salient feature identification is applied to optimize the application of the limited computational resources to the most attractive elements, regardless of the final task or prior knowledge. This pathway, however, does not completely match the requirements of occluded scenarios, where target objects may not necessary be prominent or highly distinguishable attention elements in the scene. Hence, top-down mechanisms, where previously known features are identified as salient scene points for potential targets, are considered more suitable. Therefore, following this direction, we propose to integrate a top-down attention mechanism to the method presented in [
20] by using color cues as prior knowledge of the object.
Although studies suggest that color contributes to biological object recognition [
29], traditionally, color information has been scarcely applied to computer vision recognition approaches. While most methods rely on shape and texture information [
30], only a few cases have considered color information as a prominent feature. Although this situation has been abruptly reversed with the rising of artificial neural networks approaches, for which color information is usually considered, only few traditional model-based solutions have relied on color information for object detection and recognition, such as color SIFT features [
31] for 2D vision or CSHOT [
32] and VCSH [
33] for 3D vision. For the PPF voting approaches, Drost et al. [
34] proposed a multimodal variant of the original method [
11], defining pairs of oriented 3D points and 2D gradient edges. The proposed method showed a noticeable improvement on performance while showing robustness to light, having the main drawback of a big impact on the runtime performance [
19]. In a different direction, Choi et al. [
35] proposed to extend the PPF to 10 dimensions, including color information from both points underneath surface. Although showing positive results on some datasets, more recent results presented in the Ref. [
36] suggest that the inclusion of the color on the PPF may provide, for some cases, higher precision results, but lower recognition rates. The deterioration of the recognition rates can be attributed to the subjugation of the geometric information to the color information, disregarding valid geometrical matches for non-matching color cases produced by illumination changes, modeling artifacts or different sensors characteristics. Arguably, this undesired behavior can be considered the main reason why historically few traditional model-based methods have relied on color data. In general, the mathematical modeling of the color information, including their dependency on the sensor technology, is much more complex and unstable than other available features, like gradients or geometrical features. In a different direction, we propose to use the color information only as a cue of a correct matching on the top of the existing geometrical approach based on PPF. The main idea is to use the color information as a weighting factor to provide more relevance to color consistent cases and help to distinguish between geometrically ambiguous cases.
Therefore, we present a novel solution based on visual attention and color cues to boost performance of a state-of-the-art method on highly occluded cases. First, we propose to use a top-down attention mechanism to focus the method on those parts of the scene that potentially belong to the object. Secondly, we propose to use color information as a weighting factor to improve the geometrical matching of the method. The proposed solutions have been evaluated on the SiSo task for different parameters, color spaces and metrics against the state-of-the-art benchmark occluded LM-O dataset. Results show that the proposed method obtains a very significant improvement against occluded cases, increasing recognition rates for relatively low visible objects, outperforming the other solutions. In addition, the proposed solution has been tested under different illumination conditions on the TUD-L dataset, obtaining better performance than previous methods and showing the robustness of the proposed approach to illumination changes. Finally, the method robustness has been also tested for cases with multiple instances for two different datasets, IC-MI and IC-BIN, showing robustness against scenes with a high number of repeated color patterns of the target object.
2. Method
We propose to integrate the attention-based approach and the color cue weighting solution in a state-of-the-art PPF voting approach. Specifically, the method of [
20] is extended by using color information to identify a set of salient points that will guide the attention of the pose estimation algorithm, decreasing the complexity of the global matching problem while increasing the chances of obtaining a positive result. In addition, the color information is used as a weighting factor for the matching of point pairs and re-scoring step to increase the relevance of the color consistence of geometrical data.
2.1. The Point Pair Features Voting Approach
We base the proposed solution on a Point Pair Features (PPF) voting approach, extending with color the depth-only method presented in [
20]. The PPF voting approach, first introduced by Drost et al. [
11], is a feature-based method that globally defines an object as the set of pairs of the oriented points that defines its surface, allowing a local matching of the object in a given scene by only matching a subset of these pairs.
The pairs are individually matched by using 4D features that encode the distance between the pair of points and the difference between their normal angles. More specifically, for a set of model points
, a PPF is defined between a reference and second points
with their respective normal vectors
and
, as shown in Equation (
1):
where,
and
is the angle between the vectors
and
. As the number of all possible point pairs combinations is determined by a square factor, the method proposes to reduce the overall number of points by downsampling both scene and model data with respect to the object diameter. Similar point pairs, i.e., that define similar surface features, are grouped together on a hash table by quantizing the feature space. This table defines the object model as a mapping from the quantized PPF to the set of their corresponding model point pairs. Later, during the scene matching, this table is used to determine scene-model point pair correspondences, which are grouped in geometrically consistence 6D poses representing potential candidate poses for the object in the scene. This corresponding grouping relays on the fact that pairs sharing the same reference point can be efficiently grouped on a 2D space in a Hough transform manner. Specifically, for a given scene point belonging to the object model surface, a candidate pose is represented by a local coordinate (LC), which is defined by two parameters; a corresponding model point and the rotation around their aligned normals. The method defines a two-dimensional accumulator for each scene reference point where each cell represents a LC. Then, all pairs defined from this reference point are matched against the model and each scene-model pair correspondence is used to define a LC that casts a vote in the accumulator. The most voted LC defines a potential candidate pose. Finally, similar candidate poses obtained from different scene reference points are joined together using a clustering approach.
Starting from here, the method presented in [
20] further extends this idea by proposing a set of novel and improved steps on an integrated local pipeline. First, the preprocessing part is improved, proposing a resolution independent process for normal estimation and introducing two novel downsampling steps to optimize the method performance by filtering non-discriminative surface data. These two steps check the normal variation between neighbouring points, clustering and filtering cases that have similar normal information. During matching, the method uses a more efficient kd-tree structure for neighbouring search and includes two additional improvements to tackle problems derived from the quantization and the over-representation of similar scene features. In addition, a novel threshold is introduced after correspondence grouping to discard low supported poses from the accumulator. A complete-linkage clustering approach is also proposed to improve the original clustering step, which can join similar poses more robustly. Another relevant improvement is the introduction of an accurate solution to recompute the object fitting score by counting the number of model’s points matching the scene. This process employs the model’s render view refined by an efficient variant of the Iterative Closest Point (ICP) method. Finally, two different verification steps are included to discard false positive cases which do not consistently fit the scene surface in terms of visibility context and geometrical edges. Overall, the method showed a significant improvement with respect to [
11] for varying types of objects and scene cases, showing to outperform the other methods for different types of datasets under clutter and occlusion. Further details about the method can be found in the Ref. [
20].
2.2. Attention-Based Matching Using Color Cues
The PPF voting approach is characterized for describing the whole object model as a set of oriented pairs from each of its points, as shown in
Figure 1. As explained before, the matching process relies on finding for each scene reference point the best LC, i.e., corresponding model point and rotation angle, that better fits the object model in the scene, i.e., most voted cell in the accumulator. Indeed, only scene reference points that truly belong to the object model will have a matching corresponding model point, and thus a correct LC. Therefore, all the other scene points will only add superfluous cases, i.e., wrong hypothesis, that will increase processing time and the likelihood of a final mismatching. From this point of view, the right selection of these reference points is an important element of the method performance which has been underestimated so far. In fact, up to now most available approaches propose to use a blind-search approach, using all scene points [
35,
37] or a fixed random fraction of them, usually one-fifth [
11,
20].
If we consider a rather more intuitive human perception approach, an object could be more efficiently found by focusing attention on zones of the scene that contains elements or features which resemble the ones of the object and can potentially be part of it. However, there is a number of reasons, i.e., occlusion, illumination changes, imperfections, for which those zones could not be properly identified and therefore, the whole scene should be searched. In that case, it seems reasonable to search the scene at a regular intervals related to the object size. Hence, we propose to combine two different strategies: (1) to focus the matching attention on parts of the scene that are similar to the object; and (2) to search the whole scene at constant intervals. Following this reasoning, and taking advantage of the PPF voting approach nature of matching an object from a single reference point, we propose to center the attention of these matching points on scene points that have similar color to the object as well as selected points distributed homogeneously at fixed spaced intervals. Therefore, the matching attention will be focused on salient points that are selected based on their relevance in the image (i.e., their color prominence) as well as their spatial distribution.
In order to identify the scene points that have color similar to the object and can potentially belong to an object part, we propose to check the color similarity between each of the scene points and the object model. As a single object can have multiple colors on its surface and in different amounts, we only consider those scene points for which their color is found multiple times in the model surface. Therefore, for each scene point, we propose to use a color metric to search all model points with similar color and only use those scene points with a minimum number of matching color model points, which are more likely part of the object. Specifically, for a given scene point
, the set of similar color model points is defined by Equation (
2),
where
is a color distance metric between two points and
is a threshold bounding the similarity level. Then, the set of reference points used to center the method attention is defined by the cardinality of color matching points as defined by Equation (
3),
where
is a threshold bounding the minimum number of color matches for a scene point to be considered.
In another direction, a voxel-grid structure is defined to divide the scene at fixed regular distance intervals on the three dimensions. These divisions are used to determine an homogeneous distributed set of potential points on the 3D space. In practice we propose to divide the scene using a voxel size of of the object diameter and to use the nearest point to the voxel’s center as a reference point.
Figure 2 shows a representation of the two different proposed reference points selection strategies for the Duck object.
2.3. Color-Weighted PPF Matching
In addition to raising attention on potential scene zones, the object model color information can be used to improve the matching process. Choi and Christensen [
35,
38] proposed a straightforward approach to use the color information underneath each point pair using the HSV color space to define 10 dimensional features, which include both the geometrical and color data. This solution, however, subordinates the 3D geometrical information to the quality of the color information, and vice versa. This subordination implies the requirement of high quality color models and scene data. Otherwise, the solution can dramatically decrease the method performance on low-quality color scenarios produced by the discrepancy and distortion introduced by different sensor properties, illuminations, and the model creation process. We propose a different solution in which color information is used as a weighting factor for geometric data, rewarding those feature correspondences that are consistent with the scene in terms of both geometrical and color information. In this direction, a weight value is applied for each LC on the accumulator to increase the value of those poses supported by color consistent point pairs. The weight value for a given scene-model corresponding point pairs,
and
, is defined by Equation (
4),
and Equation (
5),
where
is a scalar factor that relates the value of the color information with respect to geometrical data. Notice that the multiplication factor links the consistency of each point of the pair and the added unit accounts for the basic value of the geometrical matching.
As described earlier, the method rescores the clustered candidate poses to obtain a better fitting value to select the best candidate hypothesis. Therefore, the proposed color weighting only affects the corresponding grouping step and color information will not be taken into account after rescoring. Nevertheless, the color information can also be considered to improve rescoring and compute a better fitting score. In this direction, we propose a novel improved rescoring approach which takes into consideration both geometrical and color data. Following the rescoring formula proposed in the Ref. [
20], the fitting score is obtained by summing the model points that have a scene nearest neighbour within a threshold. In this work we propose a more refined solution for which the score value of each object’s point is computed by adding the inlier maximum distance plus the additive inverse of the point’s distance, i.e., the Euclidean distance between the object point and its nearest scene point. In this way, inliers that are further away from the surface provide lower scores. Then, to consider color information, this geometric score is multiplied by one plus the color matching weight, in a similar way to the weighted matching of Equation (
4). Specifically, for a given pose
which transforms the model
to the scene
, the score is computed as defined by the Equation (
6),
where,
represents the nearest neighbour from an object model point to the surface and
th represents the inliers maximum distance threshold, which is set to half of the downsamping voxel size, as in the Ref. [
20].
2.4. Color Models and Distance
Color information can be affected by scene conditions (i.e., illumination and shadows), sensor properties (e.g., exposition time, white balance, resolution), and object modeling processes. In this direction, we have taken into account several combinations of most used different color models and metrics to determine the most robust solution.
First, we consider the RGB color space [
39], as the most standardized solution. We propose to use the
norm as defined by Equation (
8),
We also consider the HSV/HSL [
39] spaces, due to their known illumination invariant properties. Similarly to RGB, we propose to use a variant of the
metric, which takes into consideration the particularities of the Hue dimension of both spaces, and this metric
is defined by Equation (
9)
Finally, we have also considered the CIELAB color space [
39,
40], as a perceptually uniform space with respect to human vision. This color space provides a device-independent color model with respect to a defined white point. Although conceived and mostly used in the industry, this complex color space has also been tested before for other 3D computer vision methods [
32]. In this case, the CIE94
distance metric is used as a trade-off between accuracy and speed, defined by Equation (
10),
where the model point is considered as the standard reference and the parameters are set like graphic arts applications under reference conditions with
. Notice that the LAB color space transformation has been done by using the X, Y, and Z tristimulus reference values for a perfect reflecting diffuser, using the standard A illuminant (incandescent lamp) and 2° observer (CIE 1931). The reader can refer to the Refs. [
39,
40] for more details about this color space and its metrics.
2.5. Precomputing Color Weights
It can be observed that the color weight between a scene and model point, i.e., Equation (
5), is computed for each point pair correspondence during matching, i.e., Equation (
4), and for each model point during rescoring, i.e., Equation (
6). Therefore, the weight value for the same scene-model combination is required multiple times for both cases, significantly increasing the method’s running time. This problem can be easily solved by precomputing the weight for every scene-model point combination in a lookup table. In this way, the given weight for any scene-model point combination can be found by accessing the lookup table in a constant time. As during this weight precomputing process all scene-model points will be checked, we propose to also determine the attention reference points simultaneously. To obtain further efficiency, for the
and
metrics, we propose to create a kd-tree structure with the object model color information that can help to efficiently retrieve the model points with similar color information.
4. Conclusions and Future Work
A novel solution based on visual attention and color cues for improving robustness against occlusion for 6D pose estimation using Point Pair Features voting approach has been presented. The proposed method incorporates color information at different steps: first to identify potential scene points belonging to the object in order to focus the pose estimation method. Secondly, the method uses the color information to weigh the feature-matching and re-scoring step, providing more weight to those points matching both geometry and color. The method has been analyzed on different parameters, color spaces and metrics, showing a better performance for all tested color spaces for the SiSo task on the widely used LM-O dataset. The best result has been obtained with the HSV color space and L2 metric, alpha 0.45, beta 10 and omega 5, showing the benefits of including color cues obtaining an average recall of 70%. Compared to the original PPF-based method without color information, the proposed method obtains an improvement of 9%, which is specially focused in low occlusion levels between 30% to 70%. Compared to the state-of-the-art, the proposed method outperforms all approaches by at least 8% including comparison to current machine learning (deep learning)-based methods. The method’s robustness to illumination changes has been evaluated on the TUD-L dataset, showing stable behavior and obtaining an overall better performance compared to the other approaches, with an improvement limited to the cases with meaningful color information. Finally, the proposed solution has also shown robustness under repeated color patterns when tested against a moderate and high number of multiple instance of the same object on the IC-MI and IC-BIN datasets.
Future work will focus on four main directions. First, study how the presented color solutions can improve other well-known problems faced by object recognition approaches, especially distinguishing objects from identical or similar shapes. Secondly, future work will also focus on investigating richer features based on color and texture patterns that could potentially improve the robustness and results of the method. Third, we will also study more complex color models based on the idea of weighted color for surface features. Finally, we will adapt the current SiSo problem to the slightly different ViVo task, where multiple instance and multiple objects are considered simultaneously.