A Survey on Deep Learning Based Methods and Datasets for Monocular 3D Object Detection

Owing to recent advancements in deep learning methods and relevant databases, it is becoming increasingly easier to recognize 3D objects using only RGB images from single viewpoints. This study investigates the major breakthroughs and current progress in deep learning-based monocular 3D object detection. For relatively low-cost data acquisition systems without depth sensors or cameras at multiple viewpoints, we first consider existing databases with 2D RGB photos and their relevant attributes. Based on this simple sensor modality for practical applications, deep learningbased monocular 3D object detection methods that overcome significant research challenges are categorized and summarized. We present the key concepts and detailed descriptions of representative single-stage and multiple-stage detection solutions. In addition, we discuss the effectiveness of the detection models on their baseline benchmarks. Finally, we explore several directions for future research on monocular 3D object detection.


Introduction
Deep learning networks have increasingly been extending the generality of object detectors. In contrast to traditional methods in which each stage is individually handcrafted and optimized by classical pipelines, deep learning networks achieve superior performance by automatically deriving each stage for feature representation and detection. In addition, new approaches for data-driven representation and end-to-end learning with a substantial number of images have led to significant performance improvements in 3D object detection. With the evolution of deep representation, object detection is being widely used in robotic manipulation, autonomous driving vehicles, augmented reality, and many other applications, such as CCTV systems.
Beyond the significant progress in image-based 2D object detection, 3D understanding of real-world objects is an open challenge that has not been explored extensively thus far. In addition to the most closely related studies [1][2][3][4][5][6], we focus on investigating deep learning-based monocular 3D object detection methods. For location-sensitive applications, conventional 2D detection systems have a critical limitation in that they do not provide physically correct metric information on objects in 3D space. Hence, 3D object detection is an interesting topic in both academia and industry, as it can provide relevant solutions that significantly improve existing 2D-based applications.
Camera sensors that capture color and texture information have emerged as an essential imaging modality in many computer vision applications. The passive camera sensors do not interfere with other active optical systems, and always work well with them when needed. For image-based deep representations that encode depth cues, monocular images are also highly cost-effective. Owing to considerable accumulations of annotations for RGB databases, the data-driven representations using deep neural networks make monocular 3D object detectors even more advantageous without expensive depth-aware sensors or cameras at additional viewpoints.
To understand the major breakthroughs and current progress in practical 3D object detection, we contribute to the literature by reviewing recent developments in deep learning-based state-of-the-art 3D object detection with monocular RGB databases. The remainder of this paper is organized as follows. Section 2 presents the overall background for our taxonomic approach. Section 3 summarizes well-known datasets for monocular 3D object detection. Section 4 comprehensively describes multi-stage approaches and end-toend learning for monocular 3D object detection methods. The key concepts, representative solutions, and effectiveness of the detection models in terms of their baseline benchmarks are discussed in detail. Section 5 briefly highlights potential research opportunities. Finally, Section 6 concludes the paper.

Background on Object Detection
Given an image with a pixel grid representation, object detection is the task of localizing instances of objects with bounding boxes of a certain class. An important contribution in solving the 2D object detection problem is the use of region-based convolutional neural networks (R-CNNs), which involves two main stages: region proposal and detection. The region of interest (ROI) of an image is proposed on the basis of certain assumptions, such as color, texture, and size. The ROI is cropped to feed a CNN that performs the detection. By combining prior knowledge and labeled datasets, the two-stage detection framework has emerged as a classical model in both 2D and 3D object detection [7][8][9][10].
Another important algorithm for object detection is the YOLO algorithm [11]. It does not have a separate region proposal stage; instead, it divides an input image roughly into an N × N grid. Based on each grid cell, localization and classification tasks are performed together in a unified regression network, followed by further post-processing. Early end-to-end approaches performed poorly in the detection of small or occluded objects. As new datasets are being developed, there have been significant innovations in end-to-end networks [12][13][14]. As fewer proposal steps with hand-crafted features are involved in single-stage methods, they are computationally less complex than multi-state approaches that usually prioritize detection accuracy. In practice, there was active competition between multi-stage and single-stage methods for object detection tasks. 3D object detection is similar to this overall flow.
The goal of 3D object detection systems is to provide 3D-oriented bounding boxes for 3D objects in the 3D real world. The 3D cuboids can be parameterized by 8-corners, 3D centers with offsets, 4-corner-2-height representations, or other encoding methods. In monocular 3D object detection methods, we seek the oriented bounding boxes of 3D objects from single RGB images. Similarly to 2D-image-based object detection systems, monocular 3D object detection methods can be also categorized into two main types, as shown in Figure 1. From a taxonomic point of view, we have extended them to six sub-categories, according to the main distinguishing features of each sub-category. As shown in Table 1, we have summarized the main features of ten high-quality datasets, such as descriptions with quick links, input data types, contextual information for different applications, the availability of synthetic RGB images, the number of 3D object instances/categories, the number of training/testing images, and lastly, other related references, which can be used for future research. In Table 2, we have briefly explained key features of the most representative works for each category and the related databases, computational time, and so on. All of those methods use powerful algorithms that can only run on a high-performance system using GPUs, and we did not pay attention to lightweight deep learning models for lower-power embedded/mobile systems.
Based on a general understanding of object detection, we review 11 datasets for monocular 3D object detection and more than 29 recent algorithms. The unique properties of 3D object detection systems, such as different data representations and the availability of both 2D and 3D annotations, make the 3D detection frameworks more complicated and interesting.

KITTI 3D
It takes 1.8 s in a single core, but exhaustive search in the proposal step can be done efficiently as all features can be computed with integral images.
Simultaneous vehicle detection, part localization (even if some parts are hidden), visibility characterization, and 3D template for each detection. Coarse-to-fine object proposal with multiple refinement steps for accurate 2D vehicle bounding boxes.

KITTI 3D
It is approximately twice faster than Mono3D, due to the lower resolution of images in the coarseto-fine method, considerably reducing a search space.

MF3D [21] Depth Estimation
Multi-level fusion scheme for monocular 3D object detection utilizing a standalone depth estimation module to ensure the accurate 3D localization and improve the detection performance.

Cityscape, KITTI 3D
The inference time including the depth module achieves about 120 ms per img. on a NVIDIA GeForce GTX Titan X.
No Partial implement.

Represent. Transform
Conversion of an estimated depth map from stereo or monocular imagery into a 3D point cloud, which mimics the real LiDAR, and takes advantage of existing LiDAR-based detection pipelines.

KITTI 3D
The paper does not focus on realtime processing. More effective way to speed up depth estimation is required.

Yes
Deep-6D Pose [25] Direct Regression An end-to-end deep learning framework for detection, segmentation, and 6D pose estimation of 3D objects. It directly regress 6D object poses without any post-refinements.

LineMOD
(LM), [51] Due to the end-to-end architecture, it offers an inference speed of 10 fps on a Titan X GPU (not optimized speed).
No Partial implement.
A single-shot approach for simultaneously detecting an object in an RGB image and predicting its 6D pose without requiring multiple stages or having to examine multiple hypotheses. It predicts the projected vertices of the object's 3D bounding box.

LM, LM-O
A pose refinement step can be used to boost the accuracy, but it runs at 10 fps. Without additional post-processing, it takes 50 fps on a single Titan X GPU.

Datasets Used for Monocular 3D Object Detection
Although deep learning methods for 2D object detection using pure RGB images have achieved considerable success, it is much more challenging to obtain 3D-oriented bounding boxes owing to the absence of absolute 3D information in the 2D image plane. In general, when the number of layers to be trained increases, the size of the labeled datasets is especially important for obtaining the data-driven solution. Compared with well-built 2D datasets, 3D datasets are still under construction. In this section, we review well-known RGB (or RGB-D) datasets used in recent 3D object detection tasks.

Beyond PASCAL
PASCAL3D+ [31], which is an extension of one of the most popular 2D detection benchmarks, PASCAL VOC [32], handles 12 selected categories of rigid objects. As shown in Figure 2, 3D CAD models are collected and aligned to images in the PASCAL VOC database. To overcome some ambiguities of 2D images in different categories, additional photos from ImageNet [33], according to the 12 categories, are included. Instead of a small number of images per category, captured in controlled environments, more than 3000 objects per category are stored in PASCAL3D+, with rich 3D annotations for objects appearing in a variety of natural images. Indeed, PASCAL3D+ with its extended 3D information, facilitated significant progress in research on monocular 3D object detection.

SUN RGB-D
SUN RGB-D [34], which is an extension of the SUN3D dataset [37] developed at Princeton University, contains 10,355 images with depth channels from four different sensors. For example, 3389 frames without severe motion blur have been manually selected from the SUN3D videos. Further, 1449 RGB-D images from the NYU Depth V2 dataset [35] and 554 realistic scene images from the Berkeley B3DO dataset [36] are included. The collected datasets handle 47 scene categories and around 800 object categories, and the annotations consist of 146,617 2D polygons and 64,595 3D-oriented bounding boxes. On average, 14.2 objects are annotated in each image. Thus, the SUN RGB-D dataset has had a major impact on indoor vision tasks such as 3D object, detection using RGB or RGB-D images, object orientation estimation, and indoor scene understanding.

ObjectNet3D
ObjectNet3D [38] comprises 90,127 images with 44,147 3D models in 100 rigid object categories. The 2D images are initially acquired from ImageNet [33] and added from Google searches for some categories that do not include sufficient numbers of images. Furthermore, 3D models from ShapeNet [39] and Trimble Warehouse were selected to precisely align with 201,888 objects appearing in these photos. Similarly to PASCAL3D+, this process gives a 3D shape label and the closest pose annotation for each object. Given accurate 2D and 3D annotations, ObjectNet3D facilities the study of object proposal, shape retrieval, object detection, and pose estimation algorithms.

Falling Things (FAT)
Falling Things (FAT) [40], which is an extension of the Yale-CMU-Berkeley (YCB) dataset [41], contains 61,500 snapshots of 21 household objects. A physics-based graphic simulator was introduced to generate photorealistic training images and automatic annotations to evaluate and train robotic manipulation algorithms for household scenes. By combining synthetic objects and backgrounds, all the information, such as 2D/3D locations, poses, and segmentation masks, is available for all the objects drawn in the highquality simulation images. The simulation process and analysis are also well described. In the context of robust perception for robotic manipulation, this synthetic dataset can help improve the overall performances of object classification algorithms, pose recognition algorithms, and other related algorithms.

Benchmark for 6D Object Pose Estimation (BOP)
The Benchmark for 6D Object Pose Estimation (BOP) dataset [42,43] contains training images with rigid objects at various viewpoints, wherein the 6D poses (3D translation and 3D rotation in space) of the presented objects are known, or texture-mapped models of the 3D objects were well prepared. The images with test objects have occlusions or background clutter; hence, some parts of the objects may not be observable and only the visible surface can fit multiple 3D models. For the evaluation, the benchmark additionally consists of eight well-known datasets in different scenarios.
One of the datasets is the LineMOD (LM) benchmark [44], comprising 15 texture-less objects with discriminative shapes, sizes, and colors in household environments. A test image with background clutter shows an annotated object with small occlusions only. The level of occlusion is further controlled in the LineMOD-Occluded (LM-O) dataset [45] with additional annotations of all associated objects. T-LESS [46] comprises 30 industryrelevant objects from 20 scenes with discriminative colors and no significant textures. The objects have mutual similarities and symmetries in size and shape, and some objects are composited from other assemblable objects. The MVTec Industrial 3D Object Detection Dataset (ITODD) [47] contains 3500 labeled scenes and 28 objects acquired from realistic setups for industrial applications. The 6D poses are known for the validation images only and are not available publicly for the test images. The YCB-Video (YCB-V) dataset [27] contains 133,827 frames with 21 objects, selected from 92 videos of the YCB dataset. The 80 K simulation images in the original dataset are also included in this benchmark. In the case of the HomebrewedDB (HB) dataset [48], there are 33 toy, household, and industry-relevant objects in 13 complex scenes with different backgrounds.
As shown in Figure 3, other datasets, such as RU-APC [49], IC-BIN [50], IC-MI [50], and TYO-L [42], were also used for the BOP Challenge. The training and test images for 3D object detection are annotated with ground-truth object poses. Every dataset, together with the given 3D models, is available in the unified BOP format.

Context-Aware MixEd ReAlity (CAMERA)
The Context-Aware MixEd ReAlity (CAMERA) dataset [54] addresses the limitations of traditional data generation by synthetically generating a large number of training images and ground truths in a faster and more cost-effective manner. From mostly tabletop scenes, 553 background images were acquired in widely varying conditions. For hand-scale objects such as a bowl, a bottle, a can, a camera, a mug, and a laptop, selected from ShapeNetCore, a total of 300,000 composited images of 31 indoor tabletop scenes were rendered. Further, 25,000 photorealistic images were set aside for validation. For point sampling and plane detection, the mixed reality compositing technique exploits the Unity engine with custom plugins. In contrast to non-context-aware images from previous approaches, the simulated images in this database facilitate and improve generalization in learning-based methods.

Objectron
The Objectron dataset [52] is a collection of short, object-centric video clips that are accompanied by AR session metadata, including camera poses, sparse point clouds, and characterizations of planar surfaces in the surrounding environment. In each video, the camera moves around the object, capturing it from different angles. The data also include manually annotated 3D bounding boxes for each object, which describe the object's position, orientation, and dimensions. The dataset consists of 15,000 annotated video clips supplemented with over 4 million annotated images of the following objects: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes. In addition, to ensure geo-diversity, our dataset was collected from 10 countries across five continents. Along with the dataset, we must mention a 3D object detection solution for four categories of objects: shoes, chairs, mugs, and cameras. These models were trained using this dataset and released in MediaPipe, Google's open-source framework for cross-platform customizable ML solutions for live and streaming media.

KITTI 3D
The KITTI3D benchmark [55] comprises 7481 training images, no official validation images, and 7518 test images. As there is no validation set, the training images are often split into 3712 images for training and 3769 images for analyzing the validation results before reporting the results on the test set via the evaluation server. For 3D annotations of 2D images, the possible 3D bounding boxes are given for only three categories, namely, cyclist, car, and pedestrian. Depending on object truncation, occlusion, and distance to the camera, the difficulty of 3D detection is determined as hard, moderate, or easy. Figure 4 shows examples of the target objects with their ground-truth bounding boxes. Virtual KITTI [56] is one of the first synthetic datasets for training and testing machine learning models for autonomous driving applications. In a video-game world, it is easy to create data for rare events, and scenes with changes in only one condition (such as the weather) can be generated. Moreover, the exact ground truth can be generated along with simulated images; hence, little annotation is required. The Unity game engine has been used to explore this concept by carefully recreating real-world videos from the popular KITTI autonomous driving benchmark suite.

CityScape 3D
Cityscapes 3D [39] is an extension of Cityscapes [57], one of the most influential datasets, which enriches annotations with high-quality 3D bounding boxes for vehicles. The original Cityscapes dataset [58] contains 5000 images, of which 2975 are used for training, 500 are used for validation, and 1525 are used for testing. The 3D bounding box annotations cover all eight semantic classes in the vehicle category of the Cityscapes dataset, i.e., bicycle, bus, car, caravan, train, motorcycle, trailer, and truck. The 3D annotations are newly labeled with nine degrees of freedom (DoF) using stereo images, resulting in more accurate re-projection in images and a higher range than LiDAR-based methods. It is a new benchmark for 3D detection tasks in autonomous driving with full 3D orientation, including yaw, pitch, and roll labels. Compared to other 3D detection datasets, Cityscapes 3D has a high object density, which indicates complex scenes.

Synscapes
The Synscapes database [59], created by a collaboration between 7DLabs Inc. and researchers at Linköping University, is a synthetic dataset comprising more than 25,000 simulated images. In the context of street scene parsing, the photorealistic rendering technique tries to capture every aspect of the optical process in the camera system, from illumination sources such as the sun, to the object's material and geometric composition, and finally, to the sensors. As photons hit digital sensors through a lens in a pinhole camera, the signal is converted into an image with other physically plausible noise. For example, owing to the relative velocities of vehicles, motion blur can be modeled. Synscapes ensures simulations that are representative of the real world for data augmentation in driving scenes.

SYNTHetic Collection of Imagery and Annotations (SYNTHIA-AL)
The SYNTHetic collection of Imagery and Annotations (SYNTHIA) dataset [61] provides photorealistically rendered frames in city-level scenes. The categories handled in the database are building, bicycle, car, fence, lane marking, road, pedestrian, pole, sidewalk, sky, traffic light, traffic sign, vegetation, and void. The SYNTHIA-AL dataset is generated by modifying the SYNTHIA environment using the Unity Pro game engine. In the context of driving scenarios, the data are generated in a virtual world consisting of three different areas, namely, town, city, and highway. These areas are populated with a variety of pedestrians, cars, cyclists, and wheelchairs, except for the highway, which is limited to cars. Several environmental conditions, such as season (winter, fall, and spring), day time (day or night), and weather (clear or rainy) can be set. The ground truth is provided in terms of 2D/3D bounding boxes, instance segmentation, and depth information [60].

Monocular 3D Object Detection Methods
Researchers have proposed new methods to overcome challenges for monocular 3D object detection. Here, we categorize these methods into multi-stage and end-to-end approaches.

Multi-Stage Approaches
First of all, we can deal with an ill-posed problem by employing prior hypotheses on 3D objects. The prior knowledge includes semantic, context, shape, and location information, and so on. By performing distinct tasks linearly, including hand crafting features, 2D boxes of interest can be proposed. Alternatively, we can use standard 2D object detectors with simple deep neural networks. As an example of GS3D [16], 2D detections are converted into basic 3D boxes using projection knowledge, which is called guidance. Given the guidance, the 3D bounding boxes are further refined without expensive stereo data or point clouds.
With an additional 3D shape prior, we can perform 3D object detection through CAD template matching. During the detection process, a template library will be established, and the network will match the best model in the template library. In the case of the method in [19], the 3D template, partial visibility, and partial coordinates of the detected vehicle are given. Then, these features are considered to estimate the localization and orientation with 2D-3D model fitting. Even if some parts of the test objects are not visible in the 2D case, vehicle models can be retrieved via template matching.
As monocular images lack depth information owing to the principle of perspective transformation, we can use deep learning to predict the depth map of the image first, which serves as the basis for 3D object detection in the next stage. To achieve effective monocular depth estimation, many algorithms have been developed in recent years. In addition to using the depth estimation module, the object ROI and depth feature map are fused to calculate the object coordinate and spatial location information [21]. Using a multi-layer fusion scheme, this framework [21] can generate the final pseudo-point cloud information for its application.
Likewise, it is also a popular algorithm for converting the image information into point cloud information; the point-cloud-related network is then used for processing. For another application of representation transform, the orthographic feature transform (OFT) [24] maps perspective images to orthographic bird's eye view (BEV) images in the deep-learning-based framework. In general, the representation transform selects an application-specific data representation that is more suitable for the target scenario than the image domain. Hence, it can achieve satisfactory detection results.

2D Detection-Driven Methods
Based on PASCAL3D+ [31], simultaneous 2D object detection and viewpoint estimation was proposed by Su et al. [62]. Given an input RGB image and a bounding box from an off-the-shelf detector, a deep representation was tailored specifically for viewpoint estimation. The authors selected category-specific orientations of objects with a novel loss layer adapted over synthetically generated viewpoint labels. Experimental results indicated that the performances of both joint detection and viewpoint estimation can be significantly improved on PASCAL 3D+. 2D images and 3D shapes/scans can be connected through their image synthesis pipeline; thus, information can be transported between the two domains bidirectionally. When training datasets for deep learning need to be manually annotated, this approach infers 3D information with negligible human effort.
For encoding raw images with different sensor modalities in compact descriptors, Wohlhart et al. [63] used pair-wise and triplet-wise constraints on training images and template views. By considering the dissimilarity and similarity of the descriptors, they efficiently captured both the object categories and the 3D poses. The constraints untangle the input images with different objects from different views into several clusters, which are not only well separated but also structured as the corresponding sets of 3D poses. The Euclidean metric between descriptors is sufficiently large when the descriptors are given from different objects. Furthermore, when the encoded descriptors are given from the same object, the distance is directly associated with the different 3D poses of the object. In this manner, the learned descriptors can be generalized to classify unseen objects as well. This approach requires binary masks of the objects of interest; however, it works well with either RGB or RGB-D images of the LineMOD dataset [44].
To use a standard CNN method for high-quality detections, Chen et al. [15] assumed that objects are always on the ground plane. Initially, given a set of category-specific object proposals, the monocular 3D object detection is formulated as an energy minimization task that optimally locates object candidates in the 3D world. Based on prior information such as object size, location, shape, segmentation, and contextual information, each intuitive loss function accurately optimizes a 3D box. Hence, Mono3D [15] uses two stages with a 2D object detection network. The detection performance of this approach was quantitatively confirmed on the challenging KITTI benchmark.
When objects have significant truncation, occlusion, and scale variations in the CNNbased detection pipeline, region proposals can often be a bottleneck. To alleviate this issue, subcategory-aware CNNs [17] have an interesting region proposal network whereby the proposal step is guided by subcategory information. The subcategory concept refers to categories of objects that share similar attributes, such as 3D shape and pose. Based on this key assumption, the SubCNN considers a new detection network for joint detection and subcategory classification. In addition, test objects at different scales are handled using image pyramids in an efficient manner. While exploring the effects of subcategory information on CNN-based object detection, extensive experiments were conducted on the PASCAL VOC 2007, PASCAL3D+, and KITTI detection benchmarks.
In GS3D [16], 2D detections are converted into basic 3D boxes using projection knowledge, which is called guidance. Given the guidance, the 3D bounding boxes are further refined without expensive stereo data or point clouds. To remove representation ambiguities in 2D bounding boxes, the underlying 3D structure in the surface feature is extracted. In practice, coarse cuboids are reported to have sufficient accuracy for determining the 3D bounding boxes of objects by refinement. To refine the 3D detections, the surface feature extraction module, which is an affine extension of RoIAlign, is also used. In this framework, the complex residual regression problem is reformulated as a classification task, which is much easier to train. Finally, the discriminative ability is enhanced by the quality-aware loss. This approach was evaluated using the KITTI 3D benchmark. Figure 5 shows 2D detection-driven GS3D. 2D detections and the orientations of target objects are obtained using the CNN-based model (2D+O subnet). Then, the proposed algorithm generates the guidance using the given 2D bounding box and orientation with a projection matrix. The refinement model (3D subnet) uses the extracted features from visible parts and 2D detections of the projection guidance. Instead of direct regression, the reformulated classification task is adopted by the refinement model with the qualityaware loss to achieve a more accurate result.

3D Shape Information
While Mono3D [15], an optimization-based pioneering method, does not show satisfactory accuracy and speed, its successor, Mono3D++ [20], achieves improved performance with better template matching, as does the Ceres toolbox [64]. Mono3D++ [20] uses coarse and fine 3D hypotheses to infer the object shape and pose from one RGB image. Specifically, a fine representation for vehicles is generated by morphable wireframe models with different shapes and poses. For lower sensitivity to 2D landmark features, a coarse representation aims to model 3D bounding boxes to improve stability and robustness. For joint energy minimization with a projection error, three priors are considered, namely, vehicle shape, a ground plane constraint, and unsupervised monocular depth.
3D shape information-based methods tend to become slow when the number of shape templates or object poses increases, because hand-crafted steps for comparing them are required for optimization. To tackle this problem with some physical quantities, Konishi et al. [65] proposed a new image feature based on orientation histograms of random projection images from CAD models. Similarly, in [66], coarse initialization was adopted for 3D poses of texture-less objects. In [67], temporally consistent, local color histograms were used for pose estimation and segmentation of rigid 3D objects. For handheld objects, the statistical descriptors can be learnt online within a few seconds.
Instead of optimizing separate quantities, Chabot et al. [18] proposed a multi-tasking network structure for 2D and 3D vehicle analysis from a single image. For simultaneous part localization, visibility characterization, vehicle detection, and 3D dimension estimation, the many-tasks network (MANTA) first detects 2D bounding boxes of vehicles in multiple refinement stages. For each detection, it also gives the 3D shape template, part visibility, and part coordinates of the detected vehicle even if some parts are not visible. Then, these features are considered to estimate the vehicle localization and orientation using 2D-3D correspondence matching. To access the 3D information of the test objects, the vehicle models are searched for template matching. The real-time pose and orientation estimation uses the outputs of the network in the inference stage. At the time of publication, this approach was the state-of-the-art approach using the KITTI 3D benchmark in terms of vehicle detection, 3D localization, and orientation estimation tasks.
As shown in Figure 6, an input image is passed forward to the deep MANTA network where convolution layers with the same weights have the same color. The existing architecture is split into three blocks. With these networks, the object proposals are refined iteratively until the final detection that is associated with the part's coordinate, the part's visibility, and the template similarity. Moreover, non-maximum suppression (NMS) removes some redundant detections. Based on the outputs, the best 3D shape is chosen in the inference stage. 2D and 3D pose computation is then performed with the associated shape.
In the ROI-10D algorithm [19], a monocular deep network directly optimizes a novel 3D loss formulation and then lifts a 2D bounding box to 3D shape recovery and pose estimation. Using CAD templates and synthetic data augmentation, deep feature maps are generated and combined to obtain the shape dimensions. Then, shape regression is performed to obtain the object information. In particular, the pose distributions are well analyzed in the KITTI 3D benchmark. In metrically accurate pose estimation, learning synthetic data is useful for increasing the pose recall; however, some hand-crafted modules such as 2D and 3D NMS have a strong influence on the final results.

Depth Estimation
On the basis of deep-learning-based monocular depth estimation, Xu and Chen [21] proposed the multi-level fusion-based 3D object detection (MF3D) algorithm, which combines the Deep3Dbox algorithm [68] and a standard depth estimation module. Using deep CNN features, it basically uses the existing detectors. In addition to 2D proposals, the disparity estimation is computed to generate a 3D point cloud. Thus, the deep features derived from the RGB images and the point cloud are fused to enhance the object detection performance. The depth feature map and the ROI for objects are combined to obtain the 3D spatial location. The key idea of this framework is to use the multi-level fusion scheme, taking advantage of the standalone module for disparity computation. Experimental results showed that the performance of 3D object detection can be boosted by 3D localization. Figure 7 shows sub-networks of the MF3D framework [21]. The task-specific modules are responsible for objectness classification, 2D box regression, and disparity prediction. Based on region proposals and point cloud maps from the estimated disparity, the 3D bounding box of the object is optimized and visualized as shown in the figures on the right. MonoGRNet [22] uses depth estimation for similar reasons. However, it does not require precise pixel-level depth annotation; only instance-level depth is considered for 3D localization. The unified MonoGRNet uses four sub-networks for instance depth estimation (IDE), 2D object detection, 3D localization, and corner regression. In the IDE module, the depth of a target object is predicted at the center of its 3D bounding box. With sparse supervision, this network performs depth inference only on the areas of objects detected as 2D bounding boxes. By avoiding depth estimation for the entire image, it reduces the computational requirements considerably. The global 3D position is achieved by simply estimating the object location in the vertical and horizontal dimensions. Then, the corner coordinates are regressed in the local context. By optimizing the poses and positions of 3D bounding boxes, MonoGRNet is trained in the global context.

Representation Transform
Pseudo-LiDAR [23] assumes that the main innovation for bridging the gap between LiDAR-based and pure image-based 3D object detection is the computational representation itself for expressing the 3D scene. In other words, the point cloud representation may be more suitable for monocular 3D object detection than the image-based representation with the same quality of depth information. Thus, the image-based depth maps are converted into the proposed pseudo-LiDAR representations mimicking real liDAR signals. For this reason, the deep ordinal regression network (DORN) [69] has been exploited for monocular depth estimation, and the mathematical relationship between the 2D image coordinates and the 3D pseudo-point cloud has been derived. After processing two additional networks for pseudo-point data, the representation transform makes it possible to apply any LiDAR-based algorithms to monocular 3D object detection.
The BEV image is another popular representation for applications involving autonomous vehicles. A common technique for converting images into BEV images is inverse perspective mapping (IPM); however, it typically assumes that all the pixels should be on the ground plane and it requires accurate camera parameters for estimating the plane homography. Without needing to calibrate extrinsic parameters, the OFT [24] maps perspective images to orthographic BEV images in the deep-learning-based framework. The overall architecture and its output is shown in Figure 8. To encode camera images, the authors used a front-end ResNet-18 architecture and accumulated image-based features into a voxel-based representation. The voxel features were then collapsed along the vertical dimension to yield orthographic ground plane features. Another network was finally employed to remove the distortional effects of perspective projection and refine the BEV map. The top-down network processes these features in the BEV space. At each location on the ground plane, it predicts a confidence score S, a position offset ∆pos, a dimension offset ∆dim, and an angle vector ∆ang. Reasoning in the 3D space improves the performance, and the network is robust to objects that are distant or occluded. In fact, the method proposed in [70] is similar. The two-phase approach basically uses IPM to infer the distance from the 3D scene. In the first phase, camera motions such as pitch and roll rotations are removed using inertial measurement units. The front view is corrected and projected onto the BEV via the IPM module. In the second phase, the position, orientation, and size of the vehicle are detected by the CNN. By canceling the camera pitch and roll rotations, a vanishing point is moved to infinity so that it is not affected by any vehicle attitude. The resulting projection image is parallel and linear with respect to the x-y coordinate system of the vehicle. For 3D localization of objects in the real world, the bounding box detected from the BEV is transformed by the inverse projection matrix for conversion into metric units. The proposed algorithm was quantitatively validated using KITTI 3D.
The representation transform is also a promising candidate for robotics, augmented reality, and 3D scene understanding. Wang et al. [54] recently proposed a novel normalized object coordinate space (NOCS) for indoor applications. It defines a shared space with consistent object orientation and scaling. To estimate the metrically accurate size and pose of unseen objects, the NOCS map is predicted by the proposed network and used with the depth map for pose fitting. Extensive experiments on the CAMERA dataset [54] demonstrated that the proposed method can estimate the sizes and poses of unseen object instances robustly in real environments.

End-to-End Approaches
Some recent methods directly return 3D location information of objects and pose parameters of a camera. For example, a well-known deep representation uses the shared 2D and 3D detection space to build an independent monocular 3D area recommendation network, which achieved the best performance at the time of its publication [26]. In practice, the space for directly searching for rotation parameters is nonlinear, making it difficult for the CNN to recognize the rotation of an object. To avoid this problem, some algorithms have been proposed to discretize the rotation space or refine the result iteratively. Postprocessing is often crucial for direct regression.
Meanwhile, algorithms that use key points for an algebraic solver do not directly obtain the pose of the object from a monocular image, but focus on 2D-3D point correspondences to find a firm geometric model using the perspective-n-point (PnP) algorithm [71].
2D key point detection is easier than 3D localization and rotation estimation; however, it requires a model of a known 3D object and some predefined key points. One of the clear trends is to increase the number of matching pairs. For example, a pixel-wise voting network (PVNet) [72] predicts pixel-level indicators corresponding to the key points so that they can handle truncation or occlusion of object parts. Each pixel votes for the predefined key points, which are optimized by the Ceres toolbox [64]. PVNet can achieve good results compared to the previous algorithms.
In this section, we will review these methods using end-to-end CNNs with monocular images.

Direct Regression
Mousavian et al. [68] proposed Deep3Dbox, a method that estimates the 3D pose and the size of the 3D bounding box of an object. Similarly to previous 2D-based object detectors, it partitions the parameter space of the 3D bounding boxes into multiple bins (MultiBin). From the shared convolution features, the proposed architecture estimates the dimensions, angles, and confidences using fully connected layers, which can facilitate robust MultiBin-based regression. Instead of using the L2 loss function to extract a rotation angle directly, the angle is separated into numerous bins. Then, the confidence of each bin and the offset are predicted using the residual of the center bin. In the object space size estimation, the L2 loss function is directly used to compute the offset of the space size. After determining the size and rotation angle of the object, we can restore the object's 6D pose by computing the rotation matrix of the object. This method outperforms other previous methods in terms of the orientation accuracy on the KITTI dataset and viewpoint estimation on Pascal3D+.
Xiang et al. [27] proposed PoseCNN for 6D object pose estimation. It consists of feature extraction, embedding, and classification/regression blocks. The feature extraction network is based on a single-shot detector (SSD) [13]. Here, the extracted features are shared among all the tasks performed by the second stage. Semantic labeling can provide rich information for the objects, and this pixel-level classification is effective at dealing with occlusions. Inspired by the implicit shape model, it can regress the center position and object distance. It is difficult for the CNN to regress the 3D rotation matrix directly owing to the nonlinearity of the target space. Hence, a discretization scheme was proposed for the space of rotation. However, the accuracy of estimating the rotation matrix may be degraded by converting the regression of the rotation into a classification problem. To overcome this problem, in PoseCNN [27], two new loss functions were designed for estimating the 3D rotation matrix to handle symmetric objects and to match object shapes by decoupling the estimations of 3D translation and 3D rotation. Poirson et al. [73] also used SSD [13], the 2D object detector, to integrate the pose estimation for each detected object in the same network. The previous two-stage approach requires at least three resamplings of the image for region proposals, object detection, and pose estimation. By combining these steps into a single network, they achieved very fast object detection and pose estimation of up to 46 frames on a Titan X GPU.
Deep-6DPose [25] achieves simultaneous estimations of object detection, segmentation, and pose estimation using an end-to-end network. Interestingly, it takes advantage of the concept in Mask-RCNN [9], in order to directly regress 6D poses of objects without further post-processing. The remarkable contribution of this approach is the separate regression of translation and rotation matrices using a Lie group. Compared with a conventional orthonormal matrix or quaternion-based representation, a Lie algebra gives an optimal solution owing to fewer parameters and the unconstrained condition. Deep-6Dpose achieves rapid processing through its end-to-end architecture, and it is suitable for various robotic applications.
M3D-RPN [26] uses a shared space of 2D detection and 3D detection for a single 3D region proposal architecture. It gives greater weight to the relationship of 2D and 3D aspects. To improve the 3D parameter estimation accuracy, depth-aware convolution has been proposed for learning more high-level features with spatial information, as shown in Figure 9. Then, the pose optimization algorithm is adopted for orientation estimation, followed by 2D detections and 3D projections. Applying M3D-RPN to BEV and 2D and 3D object detection tasks shows the effectiveness of the single-stage network. Liu et al. [74] proposed measuring the degree of visual fitting between the object and the projected 3D proposals for achieving high-precision localization. After regressing the 3D bounding box and the orientation of the object for constructing suitable 3D proposals, they proposed the fitting quality network (FQNet), which can predict intersection over union (IoU) in the 3D space between the 3D bounding box and the target object using 2D cues, as shown in Figure 10. Their motivation was that denoting the projections on the image domain can provide additional knowledge to better understand the spatial relationship. Matching the object-rendered image with the input image generates better results compared to the limited accuracy of direct regression. DeepIM [75] is a new refinement method that uses a deep neural network for matching 6D poses iteratively. Given an initial pose, a relative transformation can be predicted by matching the rendered image with the observed image. As rendering the object and estimating the 6D pose are complementary, the accuracy of pose estimation increases with iteration. The separate representation of 3D position and rotation not only achieves accurate estimated poses but also allows unseen objects to be refined. Experiments on commonly used benchmarks such as LM [44] and T-LESS [46] demonstrated that the proposed method shows significant improvements over previous methods. Figure 10. Overall pipeline of a deep fitting degree scoring network [74], which refines an initial bounding box using a regression module and FQNet.

2D-3D Correspondences
BB8 [76] is based on the idea of using 2D-3D correspondences for 3D object localization. In the first step, a network of object segmentation is applied to an input image for localizing the objects. Next, another network is used to estimate the 2D projections of the interest points of the 3D boundaries around the target objects. The 6D pose is estimated using the relationship between the 3D bounding box corners and the corresponding projected 2D points. To handle the rotational symmetry, it restricts the pose ranges in the training stage and introduces a classifier to estimate the pose ranges at run time. For the final refinement of the estimated poses, it includes a feedback loop to compare the input image and the rendered object for better prediction of the 2D projected points. This holistic approach showed more accurate results on the challenging T-LESS dataset [46]. SSD-6D [29] applies the SSD concept [13] to 6D pose estimation. As an extension of SSD for inferring the 3D location and orientation, it predicts the corner points in the bounding boxes, classes, viewpoints, and in-plane rotation. For better results, it tries to find a proper sampling for the space of rotation. Interestingly, SSD-6D is trained using only a synthetic dataset, which can alleviate the difficulties of building a database for new target objects. SSD-6D can treat depth as an optional modality for hypothesis verification and pose refinement.
Tekin et al. [28] used the YOLO network [77] to predict the key points of corresponding objects. The network has a regular grid to present feature maps spatially. In each cell, the 2D positions of the corner points corresponding to the 3D bounding boxes are predicted. Then, the 6D pose can be computed using the PnP algorithm for the given 2D and 3D correspondences. However, the predicted 2D points may be insufficient for pose estimation when there is severe occlusion. To overcome such problems due to occlusion or truncation, Hu et al. [78] proposed an image-segmentation-based method to estimate an object's 6D pose by aggregating numerous local pose estimates, which can achieve more accurate key point estimations, even in cases of severe occlusion. To combine the pose candidates into a more robust set of 3D and projected 2D correspondences, confidence measures are computed. Even in the case of severe occlusion, because it generates precise results based on merging local pose estimates robustly, it does not adopt an additional refinement process. The proposed algorithm was tested on the challenging LM-O [45] and YCB-V [27] datasets. We believe the future goal of these 2D-3D correspondence approaches is to incorporate the PnP step into the network to establish a complete, end-to-end framework.
The dense pose object detector (DPOD) [79] predicts dense multi-class 2D and 3D correspondence maps between input images and possible 3D models. A 6D pose is computed on the basis of the PnP algorithm and RANdom SAmple Consensus (RANSAC) for the correspondences. Then, the pose is refined from the initial pose estimation using the refinement architecture. In contrast to the previous methods that regress projections of the object's bounding boxes [28,76] or formulate pose estimation as a discrete classification problem [29], DPOD shows more robust and accurate 6D pose estimation owing to the dense correspondences. PVNet [72] also uses a denser key point prediction method, as shown in Figure 11. Instead of using sparse key points by regression or prediction, PVNet is used to predict pixel-level indicators corresponding to the key points. This flexible representation can handle occlusion or truncated key points robustly. The RANSAC-based voting scheme provides the spatial probability distribution of each key point for estimating 6D poses with an uncertainty-driven PnP algorithm. Figure 11. Overview of the keypoint localization in PVNet [72]. The probability distributions of the keypoint locations are estimated from hypotheses.
As the pose estimation problem belongs to the domain of geometric vision, it is essential to approach it as an end-to-end optimization to seamlessly combine the geometrically relevant information with the deep learning process. To this end, the BPnP [30] was proposed as an effective network module that computes the gradients of backpropagation by guiding parameter updates in the network using a PnP solver. If the optimization block is differentiable, the gradients of the PnP solver can be derived accurately via implicit differentiation. Although it integrates a layer from the PnP solver, the proposed method can be effectively employed to learn feature representations for various geometric vision problems such as structure from motion, geometric camera calibration, and pose estimation. For pose estimation, a BPnP-based trainable pipeline achieves higher accuracy by incorporating the feature map loss with 2D-3D reprojection errors.

Discussion
In images captured by cameras, geometric clues are essentially lost during dimension reduction through 3D to 2D projection. To overcome this problem, the studies reviewed in this paper constitute an active domain of 3D object detection using RGB images. To begin with, we summarized well-known benchmark datasets [31,34,38,[40][41][42]54,55,57,59,61] built by research groups from academia and industry. Benchmark databases are useful for fair comparison of the previous methods in the fields of machine learning and computer vision. They freely present high-quality training and test datasets. This is important for most of the deep-learning-based problems; in particular, for 3D-related tasks, sufficiently comprehensive information with a large amount of data is necessary. In fact, 3D bounding box annotation requires more specific guidelines for annotators and considerable time and effort.
When no single annotation alone is sufficient for ideal end-to-end training, we often use different types of human annotations together for a new task. By exploiting easily accessible 2D/3D databases, multi-stage methods typically have intermediate representations or features learnt from different annotations/guidance. The 2D detection-driven methods [15][16][17]62,63] started with two-stage 2D detectors, and developed the feature representation of RGB images for detecting 3D objects. The 3D information of objects is often inferred through fusion schemes with hand-crafted features on points, patches, and parts, or topological structures in 2D.
For the additional prior knowledge, 3D hypothesis has long been used to recognize and localize a 3D object from a single RGB image. To represent objects reliably, edges or more robust local features are extracted from a photo and matched with their counterparts in 3D models. Being that they are conceptually similar to the traditional approaches, the algorithms using a 3D shape hypothesis [18][19][20][65][66][67] can derive 3D information based on template matching with known 3D CAD models. Although it is not easy to deal with multi-object cases in real time, this approach can be highly practical, providing new deep representations and efficient optimization. Recognizing object locations in the actual 3D space also plays an important role in scene perception.
By inferring the scene geometry from 2D images, the depth information can compensate for the weakness of monocular vision. Given only a single RGB image and sufficient ground-truth depth data on the Web, we can predict the depth value of each pixel of the object of interest using learning-based monocular depth estimation. For example, DORN [69] is a popular network for depth extraction that incorporates multi-scale features to estimate pixel-level depth information with small errors. On the basis of existing depth extraction networks, many 3D object algorithms [20][21][22] combine such depth information as a sub-block in their proposed networks.
Some researchers argue that monocular 3D object detection is difficult to infer in perspective image-based representation, especially when the appearance and scale of objects vary drastically with depth and meaningful distances. A typical approach for the representation adaption is to transform the 2D image into a 3D point cloud. Then, we can use the available networks for processing the 3D point cloud. Some studies such as [23] have suggested that the data format for point cloud data is more suitable for detecting and recognizing 3D objects. Another way to alleviate occlusion and scale variation in perspective views is to convert the images into orthographic BEV images [24,70]. This approach forms the basis for future exploration of other tasks where the BEV representation is naturally applicable, such as 3D object tracking and motion forecasting. To address the representation challenge for hand-scale objects on a plane, Wang et al. [54] approached it as a problem of detecting correspondences in the normalized coordinates of a shared space of object description.
The most recent trend in monocular 3D object detection is learning deep neural networks to directly regress the 6D pose from a single image [25][26][27]68,75] or to estimate the 2D positions of 3D key points and solve the PnP algorithm [28][29][30]72,76,78,79]. The efficient, robust PnP algorithm can detect multiple 3D objects from the candidate correspondences between 2D and 3D points, but the object is considered as a global body in such cases. Consequently, these methods suffer from severe occlusion, and they easily fail in various real-world situations. As of the limitation of representation in the deep learning net-work and widely occurring occlusion, it is impossible to interpret the 2D-3D relationship correctly using a single CNN model.
While the above-mentioned issues provide important clues for possible research directions, we believe that 3D object localization with hybrid representations [80,81] has considerable scope for improvement in the near future. Compared to unitary representation, a hybrid representation with edge, region, or creative geometric assumptions or any objectpart awareness can use multiple training databases. Another possible direction is enforcing the consistency beyond diverse representations by training the network in a self-supervised manner. In particular, synthetic datasets can pave the way to robust representation for feature domain adaptation. Finally, beyond multiple intermediate representations, geometric relationships across object categories in different scenes can be ultimately formulated as an end-to-end optimization throughout an entire network. We believe that there is considerable scope for finding the best representation transforms, geometric relationships, or other physical conditions of 3D objects, and these discoveries can have a strong influence on future work.

Conclusions
Recently, deep learning methods have attracted considerable attention and witnessed rapid development. In contrast to previous hand-crafted features, the success of the CNN is attributed to its powerful ability to learn rich feature descriptions from an adequate amount of training data. Monocular 3D object detection is not an exception. Hence, we surveyed the current methodologies for deep-learning-based 3D object detection using single RGB images. They are being employed in various practical applications such as autonomous vehicles and robotics. We believe that the current gap between mature 2Dbased methods and nascent 3D-based methods can be rapidly bridged on the basis of the intensive review presented herein. First, we summarized the widely used benchmark databases for training and evaluating the proposed methods in this area, and we reviewed the recent progress in monocular 3D object detection approaches by categorizing them into multi-stage and end-to-end approaches. We dealt with the main approaches used by recent methods to tackle the objective problem and discussed their underlying limitations. Finally, we examined the issues involved in localizing objects in the 3D space, which presently is an active research field because of its practical implications. Based on the current research status, object localization followed by pose estimation could be developed adequately for the 3D domain. In particular, enabling 3D perception only from a single camera will be useful for prospective applications.