You are currently viewing a new version of our website. To view the old version click .
Journal of Imaging
  • Review
  • Open Access

24 February 2022

A Survey of 6D Object Detection Based on 3D Models for Industrial Applications

,
and
1
Fraunhofer-Institut für Graphische Datenverarbeitung, Fraunhoferstraße 5, 64283 Darmstadt, Germany
2
Department Graphisch-Interaktive Systeme, Technische Universität Darmstadt, Karolinenplatz 5, 64289 Darmstadt, Germany
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Advanced Scene Perception for Augmented Reality

Abstract

Six-dimensional object detection of rigid objects is a problem especially relevant for quality control and robotic manipulation in industrial contexts. This work is a survey of the state of the art of 6D object detection with these use cases in mind, specifically focusing on algorithms trained only with 3D models or renderings thereof. Our first contribution is a listing of requirements typically encountered in industrial applications. The second contribution is a collection of quantitative evaluation results for several different 6D object detection methods trained with synthetic data and the comparison and analysis thereof. We identify the top methods for individual requirements that industrial applications have for object detectors, but find that a lack of comparable data prevents large-scale comparison over multiple aspects.

1. Introduction

The problem of 6D object detection comprises the detection of objects and the estimation of the translation and rotation thereof. In a three-dimensional space, both of these properties have three degrees of freedom, thus resulting in the 6D portion of the term. In many cases, algorithms that solve this problem also give an estimate of the target object’s class (in this work, the term object detection implies object classification). The most common sensors used here to record scenes are cameras. In this work, we focus on approaches for solving this problem with two specific properties:
  • RGBD cameras (i.e., color and depth) are available for providing input to the algorithm;
  • Only 3D object models (CAD or reconstructed) are required to set up the algorithm (i.e., no recordings by real cameras).
Algorithms with these properties are especially well suited for industrial applications, specifically automation tasks. On the one hand, RGBD images are easy to obtain in production environments. We have mostly indoor scenes with controlled lighting, simplifying the usage of active sensors. The larger form factor, compared to RGB cameras, is usually no problem in static setups, and the pricing of high-quality RGBD sensors does not weigh heavily on company-scale budgets. On the other hand, industrially manufactured products are usually based on computer-aided design (CAD), which makes the 3D models of target objects readily available. There are two major use cases in industrial environments that require localizing real-world objects: robotic manipulation and quality control, examples of these are shown in Figure 1.
Figure 1. Two examples for applying object detectors in industrial production processes. (a) Quality control. (b) Bin picking (image from [1] published under CC BY 4.0).
In this work, we examine the current state of the art of 6D object detection for application in industrial use cases. We put a strong focus on empirical data; to our knowledge, we collected the most comprehensive comparison of evaluation scores for object detectors with the aforementioned properties to date. Our core contributions are the following:
  • A listing of requirements that typical industrial use cases have for object detectors.
  • A comprehensive collection of empirical data from experiments with 6D object detectors that meet the identified criteria.
  • Empirical data on the performance of the object detector FFB6D [2], which has not been evaluated with purely model-based training, yet.
In the remainder of this work, we first give an overview of related work. Then, we present the background of our work by presenting a definition of the 6D object detection task, establishing a rationale for our focus by identifying the requirements of typical industrial applications and giving a short overview of model-based training (strictly speaking, the term training refers to setting up learning-based algorithms. For better readability, in this work, we also use it to refer to generating reference data for non-learning-based algorithms) and synthetic data generation for this purpose. We then describe the method of our analysis, including a categorization of examined algorithms and a description of used datasets and metrics. Finally, the collected data are presented and discussed, followed by a short conclusion.

3. Background

This section provides background information on concepts referenced in this work and sets the scope for our analysis. Specifically, we define the problem of 6D object detection, describe requirements that industrial applications pose, describe what constitutes model-based training and give an overview of the modalities we took into account in this work.

3.1. Problem Definition: 6D Object Detection

Six-dimensional object detection comprises the detection of objects and an estimation of their three-dimensional translation and their three-dimensional rotation. We define the relationship of 6D object detection, object detection and 6D pose estimation as follows:
6 D o b j e c t d e t e c t i o n = o b j e c t d e t e c t i o n + 6 D p o s e e s t i m a t i o n .
i.e., a 6D object detector detects object instances in a scene and outputs their locations as 6D poses. For some detection methods, this is a single algorithmic step (often called single-stage detectors, e.g., [9,26,34]), while others perform object detection and pose estimation as distinct steps (two- or multi-stage detectors, e.g., [42,43,44]). The latter usually first employs an object detector that outputs 2D bounding boxes for object instances found in an image and then inputs these into a pose estimator.
Hodaň et al. [3] differentiated object detection and object localization in Appendix 1 of their work. When detecting objects, one tries to find an unknown number of objects, while localizing objects means that we know a priori that N objects are visible in the scene and we need to find their locations. In our specific case, localizing an object means estimating its 6D pose. Thus, in this work, the term object localization is synonymous to 6D pose estimation for N objects, which, according to the relationship given above, makes object localization a sub-task of 6D object detection.
The common usage of object detection in the literature also implies the classification of objects. In this work, we assume 3D models of objects as references to look for, i.e., we look at object detectors that perform their task based on very specific geometric properties, which is known as instance-level detection. This stands in contrast to category-level detection, whose goal is to detect objects that fall into broader categories, for instance, “find all cars in an image”.
Similar to Sahin et al. [6], we define 6D pose estimation formally as
T i * = arg max T i P ( T i | I , S , O ) ,
where T = ( r 1 , r 2 , r 3 , t 1 , t 2 , t 3 ) is the six-dimensional pose of object instance i, I is the input image, S is a seen instance of an object, and O the reference for an object class. Pose estimators try to maximize the probability function P. In practice, different pose estimators mainly alternate in their formulation of P, e.g., some use neural networks [27,34,45], while others use hand-crafted heuristics to determine the probability [9,14]. The output of P can also be interpreted as the detection score.
Whether a use case is a detection or localization task has two important practical implications:
  • The parametrization of the algorithm is different. For localization, we can accept the N best hypotheses that the object detector produced, while for detection, we need to set a score threshold for P as an acceptance criterion for the hypotheses.
  • The required metrics for evaluating the performance differ. For localization, determining a score that only regards the rate of positive detection is sufficient (e.g., recall). As the detector outputs a maximum of N results, we know that every false positive implies a false negative, e.g., the precision is always at least as good as the recall here. For detection tasks, this is not true, and so we need to regard metrics that both take true and false positives into account (e.g., recall and precision).

3.2. Industrial Applications

There is a great potential in object detectors when applied to problems that occur in industrial environments. Particular tasks from the areas of quality control and robotic manipulation require fast and accurate detection and pose estimation of target objects. To deduce the requirements that industrial use cases have for object detectors, we identified their chances and challenges regarding 6D object detection. Chances are aspects of these scenarios that potentially simplify 6D object detection, while challenges are those that make it harder. Chances are as follows:
  • CAD models are available, which means that generating reference data is cheap.
  • High-end and RGBD cameras are available, as higher costs and a larger form factor compared to RGB cameras are negligible in large-scale production environments.
  • Scene setups are controlled. Production mostly happens indoors, and the placement of lights and cameras can be controlled easily. Indoor setups also allow for a broader range of possible RGBD cameras, as active cameras often do not work well in sunlight.
  • The minimally required frame rate for many automation tasks is the production’s takt time, which is usually lower than the required frame rate for interactive applications.
We conclude that the combination of RGBD cameras and 3D models as references makes optimal use of these chances. RGBD allows for a higher robustness and accuracy than RGB, and the availability of 3D models allows generating synthetical images, which can be acquired much more simply and cheaply than annotated real-world recordings. On the other hand, we find that industrial applications pose the following specific challenges for 6D object detection:
  • Lots of industrially manufactured objects are textureless. Specifically, workpieces that are at the beginning of production chains are often made of a single material with flat and untextured surfaces.
  • A lot of man-made objects, especially those with simple geometry, are rotationally symmetric, or at least appear so under certain perspectives. This makes their poses ambiguous, which can be a difficult problem for algorithms relying on optimization.
  • A common task in the area of robotic manipulation is bin picking. Here, individual objects can be highly occluded.
  • Additionally, especially in bin-picking tasks, we have an unknown number of instances of the same object class. As described in Section 3.1, we refer to this task as object detection in contrast to object localization, where the number of objects to detect is known a priori. When attempting to detect an unknown number of instances, false positives can be a major problem.
  • Object colors are often unspecified in the reference data. CAD models generally store an object’s geometric and kinetic properties, but not its surface properties, defining color and reflective behavior.
  • There are objects with difficult surface properties that hinder the recognition of geometric properties based on optical recordings, i.e., objects made from materials with high specular reflections, such as metals, or objects made from translucent or transparent materials, such as glass.
Here, too, RGBD can alleviate the problems posed by these challenges. The geometric information encoded in the depth channel can complement the color information and lead to better accuracy when there is no discerning texture on the target object or no color information in the reference. Especially in use cases that target a lot of object instances, much can be gained by generating annotated synthetic scenes, showing heaps of objects, as manual annotation is hardly feasible for those.
Of course, not necessarily all of these characteristics apply to every industrial use case. However, based on our experience, we find that these properties are typical for production environments and so they set the scope of this work. In the remainder, we analyze if and how good published object detectors can fulfill the requirements posed by the presented chances and challenges.

3.3. Model-Based Training

In this section, we give a definition of what constitutes model-based training and give an overview of how synthetic images can be generated. For object detectors that only regard the geometric properties of target objects, model-based training is straightforward. This kind of algorithm can be trained directly with the reference model by generating features in a latent space, e.g., PPFs [9] fall in this category. Training object detectors that work on surface properties and projections of model geometries (i.e., on images) is more involved. This is especially true for learning-based algorithms, which generally work better the more similar the data available at training time are to the input data during inference. Generating “real images”, i.e., recordings of target objects for estimation and annotating them with ground-truth poses, is a very involved and costly process. This is especially true if the use case requires training images taken under a multitude of different perspectives, lighting conditions or from different objects. Synthetic images, on the other hand, which comprise simulated recordings, are cheap, and images taken under a great number of different simulated conditions can be generated easily. To achieve this, a 3D model of the target object has to be available, which is the case in industrial production environments, as most products are usually modeled before they are manufactured. From these models, one can derive rendered images that also take into account the properties of the recording process. We found the following strategies of utilizing models of the real world in order to train object detectors:
3D models: 
Here, we directly derive features in the latent space from the information contained in a 3D model, i.e., a model’s vertices and normals, e.g., PPFs only require a 3D model of an object at training time.
Augmented real images: 
In this strategy, real images are augmented to generate a higher variety of training images. This can be done by simulating varied recording conditions, e.g., changing an image’s size or aspect ratio, its brightness or sharpness, or adding noise. A more involved mode of image augmentation is the “Render and Paste” strategy in which an object is cropped from its original scene and pasted onto a different background to simulate a varying background, or covered by another cropping to simulate occlusion.
Renderings: 
Rendering is the process of simulating the full image recording pipeline and thus generating 2D images from 3D models. There is a big variety in how this simulation is implemented and how realistic the resulting output is. The simplest and quickest method for rendering images is using a rasterization-based renderer, such as OpenGL. This type of renderer usually produces plausible, but not necessarily physically accurate renderings, in order to achieve real-time performance. A more ambitious mode of generating realistic images is called physically based rendering (PBR), which is not a strictly defined term, but usually entails more realistic simulation of the behavior of light and surfaces than the commonly used Blinn–Phong model [46], e.g., by employing ray tracing.
In this work, the term model-based trained object detectors refers to algorithms that are trained either on 3D models only or renderings thereof (synthetic images), i.e., the training of these algorithms does not include recording the physical target objects. Note, however, that we do not exclude algorithms that use training images involving generic real-world images as backgrounds, textures or distractors, as these images can be easily obtained from 2D image datasets, such as ImageNet (https://www.image-net.org/, accessed on 23 December 2021).
We did not investigate the nature of each training set used to train the methods referenced here. This means that it is likely that some methods could perform considerably better by training them on images generated with more advanced strategies for synthetic image generation. Therefore, the values presented should be regarded as the empirically proven lower bound for each algorithm’s performance.

3.4. Modalities

As described in Section 3.2, we assume the availability of a RGBD camera for industrial use cases and include 6D object detectors that take RGBD images as input. For a potential user, the quality of the output of a method is much more relevant than which modality it accepts, as long as it is compatible with available hardware. As RGB-based detectors are fully compatible with RGBD images, we also include the former in our survey. Note that RGB-based detection can easily be refined with depth information by employing a geometry-based refinement algorithm, such as ICP [47].
There are two modalities that could fit the requirements of industrial use cases as defined in Section 3.1 well, but they were not explicitly considered in this work: multi-view images and point clouds. We found that algorithms using these modalities as input were difficult to fit into the scope of this work, for the following reasons:
  • Besides CosyPose [5], we did not find any multi-view approaches that fit the scope given by our use case.
  • Point-cloud-based object detectors are very popular in the area of autonomous driving. Consequently, they are commonly evaluated on datasets and metrics tailored to this use case (e.g., the KITTI dataset [48]), and the evaluation scores found in the literature cannot be compared to those of most RGBD-based object detectors.
However, we also found that these modalities could potentially benefit the industrial use case, particularly in these regards:
  • Multi-view images as well as point clouds usually cover a larger portion of a scene than single-view images. Thus, they could mitigate problems due to occlusion, pose ambiguities and specular reflections.
  • Point clouds are primarily geometrical representations of scenes, and thus, object detection based on geometrical 3D models potentially requires less preprocessing of training data, as input and training data are already in the same domain. In particular, the involved generation of synthetic images can be skipped.
For these reasons, we decided to postpone evaluating object detectors based on these modalities to future work; in particular, we plan to evaluate their performance with metrics and datasets commonly used for RGBD-based detectors.

4. Materials and Methods

4.1. Methods

In this section, we categorize the methods for 6D object detection that we have examined and compared. As this work’s scope is use case specific, we focus on method properties that put constraints on a method’s usage and regard aspects, such as the type of CNN, that were used as implementation details. In the following, we describe method properties that we found to be relevant in application scenarios and in which way they can constrain potential usages. The property description should be regarded as a set of general guidelines and not as strict rules, e.g., although depth-based detectors tend to give better camera–object distance estimates than RGB-based detectors, this must not be true in all circumstances. Table 1 shows our categorization of algorithms.
Table 1. Properties of 6D object detectors. Methods are sorted alphabetically. For detailed explanations of properties, see Section 4.1. Alternating grey and white rows are visual aids.
Modality 
describes which type of input a method accepts at training time and runtime. RGB-based methods tend to have a larger error when estimating the distance of objects to the camera. Depth-based methods are based on geometry only, so they cannot use color cues or textures visible on objects. RGBD-based methods can use the best of both worlds. We only regarded the modality of the core-method uses, i.e., no optional refinement steps. Of course, every RGB-based detector can be extended to RGBD by, for example, post-processing the results with ICP [47], and every depth-based detector can be extended to RGBD by employing some kind of 2D-edge-based pose refinement.
Features 
states whether a method uses learned or hand-crafted features for object detection, i.e., whether the algorithm is data- or model-driven. As the name suggests, data-driven methods tend to require large amounts of training data: in our case, synthetic images. The generation of these data and the subsequent training can be computationally very demanding, in some cases needing several days for a full setup. Hand-crafted features usually do not require as much data, and the conversion of training data to features is straightforward, as no weight optimization takes place. However, the latter tend to have more parameters that need to be fine-tuned for optimal results.
Scope 
describes whether a feature in the object-detection step represents the full target object (e.g., a “template”) or a single point of interest (e.g., a single pixel or an image patch). Global features, representing the whole object, are usually more robust when detecting multiple instances of a single object class that are close to or even occluding each other. Local features tend to be more robust against general occlusion or difficult lighting conditions.
Output 
gives the type of space that the output pose is in. Regression-based methods predict continuous results, i.e., the poses they estimate are theoretically infinitely accurate. Classification-based methods predict discrete results, i.e., their output is one of a previously learned finite number of classes. Whether a discrete estimation is good enough depends on the use-case requirements and whether there are enough computational resources to perform a refinement step.
All methods referenced in this work have been tested with model-based training. Note that the mode of generating synthetic data differs between methods, i.e., there can be potential for better scores and thus presented scores are only lower bounds.

Remarks for Individual Methods

  • The learning-based method FFB6D [2] was trained on synthetic images by ourselves. To train FFB6D, we used the synthetical images generated with BlenderProc [4] for BOP Challenge 2020 [3], using scene 2 as the validation set. We deactivated all data augmentation and trained on the renderings as they are. The training ran for 366 , 000 iterations at a batch size of 3.
  • PoseRBPF [33] is a tracking and not an object-detection method. However, the algorithm can actually be used for object detection (referred to as initialization in the respective paper), and the pose estimation accuracy is improved over consecutive frames. For this reason, we regarded it in this work, despite not fully fitting the required profile.
  • PointVoteNet [30] supports both global and local features, as it is based on PointNet by Qi et al. [50], which represents target objects as a cascade of global and local features.
  • The depth-based methods PointVoteNet [30], PoseCluster [22] and all PPF-variants [9,10,11] can be trained with point clouds only, i.e., no image synthesis is required here.

4.2. Datasets

In this work, we focus on datasets, which, on the one hand, pose challenges that map to the requirements stated in Section 3.2 and for which a significant amount of quantitative data are available in the literature. For all of the datasets used in this work, RGBD images with annotations for ground truth poses and 3D models of the depicted objects are available. Furthermore, synthetically generated RGBD training images are provided through BOP (see https://bop.felk.cvut.cz/home/, accessed on 30 December 2021 or [3]). Sample images of the datasets can be seen in Figure 2. These are regarded in this work:
Figure 2. Sample images from the three datasets considered in this work. Lighter renderings show annotated poses. (a) LM, (b) LMO, (c) TLESS. Images used with permission from Hodaň et al. [3].
LineMOD (LM) [14]: 
First presented by Hinterstoisser et al. to evaluate their algorithm of the same name, the LM dataset provides 15 scenes. In each scene, 1 of 15 different objects from an office environment is annotated and placed on a desktop with severe clutter.
LineMOD occluded (LMO) [19]: 
This dataset includes scene number 2 of the original LineMOD datasets, but with ground truth annotations for multiple objects from different classes in a single frame. In addition to the background clutter, this poses the challenge of a lot of occlusion between objects.
TLESS [51]: 
The T-LESS dataset comprises 20 scenes with annotations for 30 different object classes. The depicted objects are all typical industrially manufactured objects, made from textureless white plastic, many of which are rotationally symmetric. The objects are all placed on a black background, so there is little background clutter. All scenes show different combinations of objects with different placements, with cases of multiple instances of one object in a scene and objects occluding each other.
For a comprehensive overview of other datasets, typically used to evaluate 6D object detection, we recommend reading the publication of Hodaň et al. [3]. Their work not only gives details on the datasets and the specific challenges they pose, but the authors also brought 12 widely used datasets in a common format and provided synthetic training images for most of them, generated using BlenderProc [4].

4.3. Metrics

Object detectors are usually evaluated by regarding them as binary classifiers. Consequently, metrics used to measure the performance of object detectors are calculated in two stages:
  • The distances of detected instances and ground-truth annotations are calculated with a geometric metric. Based on a metric-specific threshold, every detected instance and ground truth annotation is classified as one of true positive (TP), false positive (FP), and false negative (FN).
  • The numbers of TPs, FPs and FNs are aggregated based on a metric for the evaluation of binary classifiers, which then gives the final evaluation score.
There are several metrics commonly used in the literature to determine the performance of object detectors. We only describe metrics in detail that are relevant in this work. A metric is deemed relevant if it allows comparing multiple object detectors that fit the industrial use-case profile. We note that a large majority of publications considering model-based training use one of the metrics presented below. For an overview of the distribution in the literature of these metrics, consult Table 2.
Table 2. Dataset–metric combinations we found empirical data for, the tasks (Section 3.1) and challenges (Section 3.2) that they address, and the number of data points we found for each (i.e., the number relevant methods evaluated). Alternating grey and white rows are visual aids.
The following geometric metrics are most often used in the literature to evaluate the performance of 6D object detectors that are trained on 3D models only:
Average distance (symmetric) (ADD(S)) [14]: 
This metric measures the average distance of 3D points of an object’s model transformed with two different poses. ADD-S (also ADI) is a variant, which takes into account that rotationally symmetric objects can have multiple valid pose estimates. ADD(S) is used to denote that the symmetric variant ADD-S is used for objects with rotational symmetries and ADD for non-symmetric objects. The most commonly used threshold for classifying an estimate as correct is t = 0.1 · d , where d is the target object’s diameter. Some publications use t = 0.15 , which are marked in the respective locations.
Visual surface discrepancy (VSD) [52]: 
As the name suggests, this metric measures the difference of the visible surface of an object transformed with two different poses relative to the camera, i.e., if an object looks exactly the same when transformed with two poses, the VSD is 0. In particular, this handles rotational symmetries more intuitively than ADD(S). This metric has two threshold parameters, determining whether a pose is considered to be correct: τ is the maximum allowed difference in the camera distance of overlapping pixels; θ is the minimally allowed percentage of object pixels that need to be considered correct according to the τ condition for the whole hypothesis to be considered correct. A widely used combination of thresholds is τ = 20 mm and θ = 0.3 . BOP Challenge 2020 [3] used a different approach by increasing τ in the interval [ 0.05 · d , 0.5 · d ] in steps of 0.05 · d and θ in [ 0.05 , 0.5 ] in steps of 0.05 . They then determined the score for every τ θ pair and took the average as the total score. We refer to this configuration as VSDBOP.
For evaluating object detectors as binary classifiers, we found that the most widely used metrics for methods fitting our requirements profile are recall and F1-score, where the latter is the harmonic mean of recall and precision. They are calculated as follows:
r e c a l l = T P s T P s + F N s , p r e c i s i o n = T P s T P s + F P s ,
F 1 = 2 · p r e c i s i o n · r e c a l l p r e c i s i o n + r e c a l l .
Recall is suitable for evaluating the object localization task, as defined in Section 3.1. In this case, we know that there are N object instances in the scene, and we have a maximum of N result hypotheses. From this, it follows that F P s F N s and r e c a l l p r e c i s i o n . This makes calculating precision redundant for this task. For object detection, the number of objects to find is unknown, so here, the F1-score is required, as it takes true and false positives into account.
The geometric metrics that we did not take into account, due to their irrelevancy based on the conditions given above, but which can be found in literature include 2D projection error, intersection over union (IoU), translational and angular error, maximum symmetry-aware surface distance (MSSD), maximum symmetry-aware projection distance (MSPD) and average orientation similarity (AOS). The binary classification metrics that we did not regard include average precision (AP), mean average precision (mAP) and area under curve (AUC). See Hodaň et al. [3] or Sahin et al. [6] for more information on these metrics.

5. Evaluation

Benchmarks of object detectors are defined by three main aspects: the dataset used, the metric used and the thresholds (tolerances) used to classify a detection result as a success or failure. Searching for quantitative data on the performance of object detectors meeting the requirements we posed, we found data for the dataset–metric combinations listed in Table 2.
Although we focused on the object detection task in this work, we also examined the results for object localization benchmarks. On the one hand, there are industrial use cases for which object localization is sufficient, and on the other hand, as stated in Section 3.1, object localization can be regarded as a sub-task of 6D object detection.
Note that the empirical data we found do not address the two challenges posed in Section 3.2: unknown object color and difficult surfaces. For learning-based methods, the properties of the synthetic images used for training play a major role in the robustness against varying colors of objects, in particular, whether the renderings were generated with known colors or using some randomization strategy. We could not gather enough information on the mode of data generation for the training of all algorithms presented herein, and thus the performance regarding this aspect remains inconclusive. The same is true for reflective or translucent objects, which pose a very challenging case for all computer vision tasks and for which, to our knowledge, no annotated dataset for 6D object detection is available.

5.1. Discussion

In the following, we present the empirical evaluation results we found for several object detectors. We first summarize our findings regarding benchmark scores, again focusing on the requirements defined in Section 3.2, then we have a look at the runtimes of different methods, and finally take a step backward and describe our findings on the availability and comparability of empirical data in the literature.

5.1.1. Method Scores

The quantitative evaluation results we found for object detectors trained on purely model-based data are reported in Table 3. From these numbers, we can draw the following conclusions, regarding the requirements posed in Section 3.2:
Table 3. Scores of model-based trained 6D object detectors for different datasets and metrics in percentages. Methods are sorted alphabetically for easy cross referencing with Table 1. Methods variants are given in brackets (e.g. refinements, like ICP). Unlike in Table 1, refinement steps are considered in the modality-column. All except ADD(S)-F1 are recall-based scores. Darker green in cell backgrounds denotes higher scores, column wise; alternating grey and white rows are visual aids. The top three methods of a column are in bold. † denotes that a threshold of t = 0.15 · d was used for ADD(S) instead of t = 0.1 · d . No citation is given for the values of FFB6D, as we evaluated them ourselves.
Object localization: 
For LM-ADD(S), LM-VSD, LM-VSDBOP, TLESS-VSD and TLESS-VSDBOP, the following respective methods perform best: LCHFs [17], PPFs by Vidal et al. [11], SurfEmb [36], PoseRBPF [33] and and again SurfEmb [36]. LMO-VSDBOP allows a direct comparison of PFFs and SurfEmb, from which we can assume that the latter is the overall better method. We cannot compare the other top runners because they were not evaluated on the same metric–dataset combination, so the best overall object localizer remains inconclusive.
Object detection: 
For LMO-ADD(S)-F1, LCHFs [17] perform best. As they also perform very well for object localization in LM-ADD(S), we conclude that this method can outperform many other object detectors, albeit with some reservations.
Occlusion: 
The LMO-VSDBOP-ranking is led by SurfEmb [36], followed by PointVoteNet [30] and HybridPose [42] with some margin.
Workpiece-detection (textureless, rotationally symmetric): 
The top runners on TLESS-VSD are PoseRBPF with SDF, followed by the same method without refinement [33], and AAE refined with ICP [43] makes third place with a large margin. For TLESS-VSDBOP, SurfEmb [36] again makes first place with a large margin, followed by HybridPose [42] and CosyPose [5].
Which is the overall best 6D pose estimator when RGBD images and only model-based training data are available? From the quantitative data, we found that we cannot answer this question. Most top performers were evaluated on different dataset–metric combinations, and thus cannot be compared based on the available data. In particular, most of the promising methods, LCHF [17], PPFs by Vidal et al. [11], SurfEmb [36] and PoseRBPF [33], cannot be compared directly to each other.
It is of specific note that under certain circumstances, hand-crafted features can still contend with learning-based methods, regardless of their age. In particular, for LM-ADD(S), the ten-year-old LineMOD algorithm [14] and its variant by Rios-Cabrera [15] very nearly reach the performance of LCHFs [17] and outperform a lot of other newer methods (e.g., SSD6D [44], SynPo-Net [32] and AAEs [43]). For LM-VSD, the PPF-based method by Vidal et al. [11] still outperforms all other methods, and for TLESS-VSD, it makes second place. However, the good performance in this ranking needs to be put under some reservations:
  • A lot of newer methods in the literature are trained on real or a combination of real and synthetic data, and for a lot of generally promising methods, there are currently no or little empirical data available on the performance with purely model-based training; if data are available, they are not comparable.
  • LineMOD as well as PPFs have drawbacks compared to learning-based methods that are not reflected in the scores, such as the need for manual parameter optimization (both), fragility against occlusion (LineMOD) and slow runtimes (PPF).
  • Both LineMOD and PPFs show mediocre performance for LM-F1, while being good at generating high recalls. We assume this is because both methods are not discriminative (i.e., they do not explicitly “know” what to exclude), and thus tend to have lower precision than learning-based methods.

5.1.2. Runtime

In Table 4, we list the runtimes for the evaluated methods, if they were available. The top two performers and the only ones that reach a frame rate for interactive real-time applications are methods based on neural networks that work on RGB images without any refinement [24,32], which comes as no surprise. The fastest RGBD-based method is the LineMOD-variant DTT-OPT-3D [15]. The numbers show that ICP-refinement is a costly operation. Methods relying on ICP generally perform worse and more specifically, we can see a difference of 0.6   s when we compare AAE, with and without refinement. The slowest methods are those based on PPFs. It is of specific note that SurfEmb, the top-performer regarding occlusion and workpiece detection, needs about 9 seconds per frame, which, for many use cases, is not acceptable.
Table 4. Runtimes of methods and their variants. Note that the comparability of the values listed here is under great reservations, as they were generated by different persons with different parameters, using different hardware over the course of 11 years. Nevertheless, they can at least give an impression of the order of magnitude in which algorithms perform. Alternating grey and white row colors are visual aids.

5.1.3. Availability and Comparability of Empirical Data

In Table 2, we list the amount of data we have for each dataset–metric combination. We observe a focus on recall-based metrics in literature. Recall-based metrics have a total of 70 rows of data, while the F 1 -score, which also considers precision, only has 7, i.e., many publications only evaluate the performance of their algorithms regarding object localization, ignoring false positives, which can be a significant problem in object detection scenarios.
Many methods cannot be compared due to being evaluated on different datasets or metrics. We assume that one major reason for this is the effort required for processing multiple datasets or implementing different metrics. It would be desirable to have a benchmarking framework with a well-defined interface for datasets and pose-estimation results, supporting multiple metrics and allowing simple extension thereof. This framework should be accompanied by an online database that allows easy collection and analysis of empirical data on object detection performance. BOP is a big step in this direction, especially regarding the standardization of datasets. However, the evaluation metrics implemented herein were novel at the time of publication, and so they are not comparable to older results. Additionally, the BOP challenges do not regard precision in their benchmarks. In Appendix A.1 of [3], the authors discussed their decision on excluding precision in their benchmark and came to the conclusion that, for the purpose of their benchmark, recall-based scores are appropriate for two reasons. First, these scores were not saturated at the time of writing and second, only regarding the recall is less computationally complex for the evaluation framework. We consider these valid arguments for a benchmark targeting the research community, but argue that for the potential application of object detectors, more information is needed in order to determine its suitability when considering the requirements for a specific use case.

6. Conclusions and Future Work

In this work, we surveyed the state of the art of 6D object detection with a focus on industrial applications for which we identified model-based training and support for RGBD images as being especially important. We presented a collection of qualitative and quantitative information on object detectors from the literature and new data for the object detector FFB6D [2]. These data were discussed, and promising candidates for taking on specific challenges of industrial applications were identified.
Furthermore, we had a look on the availability of data that provide information on the suitability of algorithms for the use case we analyzed. We found that, for many methods, there are not enough empirical data available in the literature to determine how well they are suited to tackle specific challenges or to compare them to competing methods. In particular, many methods in the literature were only tested on real data, and many methods were only evaluated on recall-based metrics.
Based on these findings, we believe that the following future research topics would benefit the application of 6D object detection in industrial contexts:
  • Train established and promising object detectors with model-based data and evaluate them.
  • Evaluate established and promising object detectors with metrics that take precision into consideration.
  • Take methods based on point clouds and multi-view images into consideration.
  • Allow researchers to produce meaningful and comparable data by providing tools and frameworks that offer uniform formats and interfaces for evaluating object detectors on a multitude of different datasets and metrics. Additionally, provide an online database that simplifies collecting, categorizing and analyzing evaluation results. We consider BOP to be a good start in this direction, but in order to be a general-purpose framework for evaluating object detection, it should be extended with more metrics and simpler interfaces.

Author Contributions

Conceptualization, F.G.; methodology, F.G.; software, F.G., P.R. and T.P.; validation, F.G., P.R. and T.P.; formal analysis, F.G.; investigation, F.G.; resources, F.G.; data curation, F.G.; writing—original draft preparation, F.G.; writing—review and editing, F.G., P.R. and T.P.; visualization, F.G.; supervision, P.R.; project administration, F.G.; funding acquisition, F.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. He, Z.; Feng, W.; Zhao, X.; Lv, Y. 6D pose estimation of objects: Recent technologies and challenges. Appl. Sci. 2020, 11, 228. [Google Scholar] [CrossRef]
  2. He, Y.; Huang, H.; Fan, H.; Chen, Q.; Sun, J. FFB6D: A Full Flow Bidirectional Fusion Network for 6D Pose Estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 3002–3012. [Google Scholar] [CrossRef]
  3. Hodaň, T.; Sundermeyer, M.; Drost, B.; Labbé, Y.; Brachmann, E.; Michel, F.; Rother, C.; Matas, J. BOP Challenge 2020 on 6D Object Localization. In Computer Vision—ECCV 2020 Workshops; Springer Science and Business Media Deutschland GmbH: Berlin/Heidelberg, Germany, 2020; Volume 12536, pp. 577–594. [Google Scholar] [CrossRef]
  4. Denninger, M.; Sundermeyer, M.; Winkelbauer, D.; Zidan, Y.; Olefir, D.; Elbadrawy, M.; Lodhi, A.; Katam, H. BlenderProc 2019. Available online: http://xxx.lanl.gov/abs/1911.01911 (accessed on 28 December 2021).
  5. Labbé, Y.; Carpentier, J.; Aubry, M.; Sivic, J. CosyPose: Consistent Multi-View Multi-Object 6D Pose Estimation. In Computer Vision—ECCV 2020; Springer Science and Business Media Deutschland GmbH: Berlin/Heidelberg, Germany, 2020; Volume 12362, pp. 574–591. [Google Scholar] [CrossRef]
  6. Sahin, C.; Garcia-Hernando, G.; Sock, J.; Kim, T.K. A review on object pose recovery: From 3D bounding box detectors to full 6D pose estimators. Image Vis. Comput. 2020, 96, 103898:1–103898:25. [Google Scholar] [CrossRef] [Green Version]
  7. Cong, Y.; Chen, R.; Ma, B.; Liu, H.; Hou, D.; Yang, C. A Comprehensive Study of 3-D Vision-Based Robot Manipulation. IEEE Trans. Cybern. 2021. [Google Scholar] [CrossRef] [PubMed]
  8. Du, G.; Wang, K.; Lian, S.; Zhao, K. Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: A review. Artif. Intell. Rev. 2021, 54, 1677–1734. [Google Scholar] [CrossRef]
  9. Drost, B.; Ulrich, M.; Navab, N.; Ilic, S. Model globally, match locally: Efficient and robust 3D object recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 998–1005. [Google Scholar] [CrossRef] [Green Version]
  10. Hinterstoisser, S.; Lepetit, V.; Rajkumar, N.; Konolige, K. Going further with point pair features. In Computer Vision—ECCV 2016; Springer: Cham, Switzerland, 2016; Volume 9907, pp. 834–848. [Google Scholar] [CrossRef] [Green Version]
  11. Vidal, J.; Lin, C.Y.; Lladó, X.; Martí, R. A method for 6D pose estimation of free-form rigid objects using point pair features on range data. Sensors 2018, 18, 2678. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  12. Hodaň, T.; Michel, F.; Brachmann, E.; Kehl, W.; Buch, A.G.; Kraft, D.; Drost, B.; Vidal, J.; Ihrke, S.; Zabulis, X.; et al. BOP: Benchmark for 6D object pose estimation. In Computer Vision—ECCV 2018; Springer: Cham, Switzerland, 2018; Volume 11214, pp. 19–35. [Google Scholar] [CrossRef] [Green Version]
  13. Hinterstoisser, S.; Holzer, S.; Cagniart, C.; Ilic, S.; Konolige, K.; Navab, N.; Lepetit, V. Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In Proceedings of the IEEE International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 858–865. [Google Scholar] [CrossRef]
  14. Hinterstoisser, S.; Lepetit, V.; Ilic, S.; Holzer, S.; Bradski, G.; Konolige, K.; Navab, N. Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In Computer Vision—ACCV 2012; Springer: Berlin/Heidelberg, 2013; Volume 7724, pp. 548–562. [Google Scholar] [CrossRef] [Green Version]
  15. Rios-Cabrera, R.; Tuytelaars, T. Discriminatively trained templates for 3D object detection: A real time scalable approach. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 1–8 December 2013; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2013; pp. 2048–2055. [Google Scholar] [CrossRef] [Green Version]
  16. Tejani, A.; Tang, D.; Kouskouridas, R.; Kim, T.K. Latent-class Hough Forests for 3D Object Detection and Pose Estimation. In Computer Vision—ECCV 2014; Springer: Berlin/Heidelberg, Germany, 2014; Volume 8694, pp. 462–477. [Google Scholar] [CrossRef]
  17. Tejani, A.; Kouskouridas, R.; Doumanoglou, A.; Tang, D.; Kim, T.K. Latent-Class Hough Forests for 6 DoF Object Pose Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 119–132. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  18. Hodan, T.; Zabulis, X.; Lourakis, M.; Obdrzalek, S.; Matas, J. Detection and fine 3D pose estimation of texture-less objects in RGB-D images. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Hamburg, Germany, 28 September–2 October 2015; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2015; pp. 4421–4428. [Google Scholar] [CrossRef]
  19. Brachmann, E.; Krull, A.; Michel, F.; Gumhold, S.; Shotton, J.; Rother, C. Learning 6D object pose estimation using 3D object coordinates. In Computer Vision—ECCV 2014; Springer: Berlin/Heidelberg, Germany, 2014; Volume 8690, pp. 536–551. [Google Scholar] [CrossRef] [Green Version]
  20. Brachmann, E.; Michel, F.; Krull, A.; Yang, M.Y.; Gumhold, S.; Rother, C. Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3364–3372. [Google Scholar] [CrossRef]
  21. Kehl, W.; Milletari, F.; Tombari, F.; Ilic, S.; Navab, N. Deep learning of local RGB-D patches for 3D object detection and 6D pose estimation. In Computer Vision—ECCV 2016; Springer: Cham, Switzerland, 2016; Volume 9907, pp. 205–220. [Google Scholar] [CrossRef] [Green Version]
  22. Buch, A.G.; Kiforenko, L.; Kraft, D. Rotational Subgroup Voting and Pose Clustering for Robust 3D Object Recognition. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2017; pp. 4137–4145. [Google Scholar] [CrossRef] [Green Version]
  23. Rambach, J.; Deng, C.; Pagani, A.; Stricker, D. Learning 6DoF Object Poses from Synthetic Single Channel Images. In In Proceedings of the 2018 IEEE International Symposium on Mixed and Augmented Reality, ISMAR-Adjunct 2018, Munich, Germany, 16–20 October 2018; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2018; pp. 164–169. [Google Scholar] [CrossRef]
  24. Tekin, B.; Sinha, S.N.; Fua, P. Real-Time Seamless Single Shot 6D Object Pose Prediction. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 292–301. [Google Scholar] [CrossRef] [Green Version]
  25. Sundermeyer, M.; Marton, Z.C.; Durner, M.; Brucker, M.; Triebel, R. Implicit 3D orientation learning for 6D object detection from RGB images. In Computer Vision—ECCV 2018; Springer: Berlin/Heidelberg, Germany, 2018; Volume 11210, pp. 712–729. [Google Scholar] [CrossRef] [Green Version]
  26. Park, K.; Patten, T.; Vincze, M. Pix2pose: Pixel-wise coordinate regression of objects for 6D pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 7667–7676. [Google Scholar] [CrossRef] [Green Version]
  27. Zakharov, S.; Shugurov, I.; Ilic, S. DPOD: 6D pose object detector and refiner. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2019; pp. 1941–1950. [Google Scholar] [CrossRef] [Green Version]
  28. Thalhammer, S.; Patten, T.; Vincze, M. SyDPose: Object Detection and Pose Estimation in Cluttered Real-World Depth Images Trained using only Synthetic Data. In Proceedings of the 2019 International Conference on 3D Vision, 3DV 2019, Quebec City, QC, Canada, 16–19 September 2019; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2019; pp. 106–115. [Google Scholar] [CrossRef]
  29. Li, Z.; Wang, G.; Ji, X. CDPN: Coordinates-based disentangled pose network for real-time RGB-based 6-DoF object pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, hlSeoul, Korea, 27–28 October 2019; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2019; pp. 7677–7686. [Google Scholar] [CrossRef]
  30. Hagelskjar, F.; Buch, A.G. Pointvotenet: Accurate Object Detection and 6 DOF Pose Estimation in Point Clouds. In Proceedings of the International Conference on Image Processing, ICIP, Abu Dhabi, United Arab Emirates, 25–28 October 2020; IEEE Computer Society: Washington, DC, USA, 2020; pp. 2641–2645. [Google Scholar] [CrossRef]
  31. Hodaň, T.; Baráth, D.; Matas, J. EPOs: Estimating 6D pose of objects with symmetries. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; IEEE Computer Society: Washington, DC, USA, 2020; pp. 11700–11709. [Google Scholar] [CrossRef]
  32. Su, Y.; Rambach, J.; Pagani, A.; Stricker, D. Synpo-net—Accurate and fast CNN-based 6DoF object pose estimation using synthetic training. Sensors 2021, 21, 300. [Google Scholar] [CrossRef] [PubMed]
  33. Deng, X.; Mousavian, A.; Xiang, Y.; Xia, F.; Bretl, T.; Fox, D. PoseRBPF: A rao-blackwellized particle filter for 6-D object pose tracking. IEEE Trans. Robot. 2021, 37, 1328–1342. [Google Scholar] [CrossRef]
  34. He, Y.; Sun, W.; Huang, H.; Liu, J.; Fan, H.; Sun, J. PVN3D: A deep point-wise 3D keypoints voting network for 6DoF pose estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11629–11638. [Google Scholar] [CrossRef]
  35. Wang, C.; Xu, D.; Zhu, Y.; Martin-Martin, R.; Lu, C.; Fei-Fei, L.; Savarese, S. DenseFusion: 6D object pose estimation by iterative dense fusion. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3338–3347. [Google Scholar] [CrossRef] [Green Version]
  36. Haugaard, R.L.; Buch, A.G. SurfEmb: Dense and Continuous Correspondence Distributions for Object Pose Estimation with Learnt Surface Embeddings. arXiv 2021, arXiv:2111.13489. [Google Scholar]
  37. Rudorfer, M.; Neumann, L.; Krüger, J. Towards Learning 3d Object Detection and 6d Pose Estimation from Synthetic Data. In Proceedings of the IEEE International Conference on Emerging Technologies and Factory Automation, ETFA, Zaragoza, Spain, 10–13 September 2019; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2019; pp. 1540–1543. [Google Scholar] [CrossRef]
  38. Hodan, T.; Vineet, V.; Gal, R.; Shalev, E.; Hanzelka, J.; Connell, T.; Urbina, P.; Sinha, S.N.; Guenter, B. Photorealistic Image Synthesis for Object Instance Detection. In Proceedings of the International Conference on Image Processing, ICIP, Taipei, Taiwan, 22–25 September 2019; IEEE Computer Society: Washington, DC, USA, 2019; pp. 66–70. [Google Scholar] [CrossRef] [Green Version]
  39. Hinterstoisser, S.; Pauly, O.; Heibel, H.; Martina, M.; Bokeloh, M. An annotation saved is an annotation earned: Using fully synthetic training for object detection. In Proceedings of the 2019 International Conference on Computer Vision Workshop, ICCVW 2019, Seoul, Korea, 27–28 October 2019; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2019; pp. 2787–2796. [Google Scholar] [CrossRef]
  40. Rojtberg, P.; Pöllabauer, T.; Kuijper, A. Style-transfer GANs for bridging the domain gap in synthetic pose estimator training. In Proceedings of the 2020 IEEE International Conference on Artificial Intelligence and Virtual Reality, AIVR 2020, Utrecht, The Netherlands, 14–18 December 2020; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2020; pp. 188–195. [Google Scholar] [CrossRef]
  41. Eversberg, L.; Lambrecht, J. Generating images with physics-based rendering for an industrial object detection task: Realism versus domain randomization. Sensors 2021, 21, 7901. [Google Scholar] [CrossRef] [PubMed]
  42. König, R.; Drost, B. A Hybrid Approach for 6DoF Pose Estimation. In Computer Vision—ECCV 2020 Workshops; Springer Science and Business Media Deutschland GmbH: Berlin/Heidelberg, Germany, 2020; Volume 12536, pp. 700–706. [Google Scholar] [CrossRef]
  43. Sundermeyer, M.; Durner, M.; Puang, E.Y.; Marton, Z.C.; Vaskevicius, N.; Arras, K.O.; Triebel, R. Multi-path learning for object pose estimation across domains. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; IEEE Computer Society: Washington, DC, USA, 2020; pp. 13913–13922. [Google Scholar] [CrossRef]
  44. Kehl, W.; Manhardt, F.; Tombari, F.; Ilic, S.; Navab, N. SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1530–1538. [Google Scholar] [CrossRef] [Green Version]
  45. Shugurov, I.; Zakharov, S.; Ilic, S. DPODv2: Dense Correspondence-Based 6 DoF Pose Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2021. [Google Scholar] [CrossRef] [PubMed]
  46. Blinn, J.F. Models of light reflection for computer synthesized pictures. ACM Siggraph Comput. Graph. 1977, 11, 192–198. [Google Scholar] [CrossRef]
  47. Rusinkiewicz, S.; Levoy, M. Efficient variants of the ICP algorithm. In Proceedings of the International Conference on 3-D Digital Imaging and Modeling, 3DIM, Quebec City, QC, Canada, 28 May–1 June 2001; pp. 145–152. [Google Scholar] [CrossRef] [Green Version]
  48. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef] [Green Version]
  49. Peng, S.; Liu, Y.; Huang, Q.; Zhou, X.; Bao, H. PVNET: Pixel-wise voting network for 6dof pose estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; IEEE Computer Society: Washington, DC, USA, 2019; pp. 4556–4565. [Google Scholar] [CrossRef] [Green Version]
  50. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2016; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2017; pp. 77–85. [Google Scholar] [CrossRef] [Green Version]
  51. Hodaň, T.; Haluza, P.; Obdrzalek, Š.; Matas, J.; Lourakis, M.; Zabulis, X. T-LESS: An RGB-D dataset for 6D pose estimation of texture-less objects. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision, WACV 2017, Santa Rosa, CA, USA, 24–31 March 2017; pp. 880–888. [Google Scholar] [CrossRef] [Green Version]
  52. Hodăn, T.; Matas, J.; Obdržálek, Š. On evaluation of 6D object pose estimation. In Computer Vision—ECCV 2016 Workshops; Springer: Cham, Switzerland, 2016; Volume 9915, pp. 609–619. [Google Scholar] [CrossRef]
  53. Hodan, T. BOP: Benchmark for 6D Object Pose Estimation. Available online: https://bop.felk.cvut.cz/leaderboards/ (accessed on 28 December 2021).
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.