1. Introduction
Object detection and localization have always been hot computer vision topics. Conventional methods, such as you only look once (YOLO) [
1], single-shot detector (SSD) [
2], etc., have demonstrated impressive performance in the 2D region. The same semantic representation of the objective 3D world, on the other hand, is more difficult to achieve due to a lack of information about the object’s rotation and position in relation to the camera. An object pose has 6 degrees of freedom (6DOF), 3 degrees of freedom in translation, and 3 degrees in rotation, and this information is needed in several applications for robotic and scene interpretation [
3]. The measurement of 6D poses has been previously the subject of extensive study. Consequently, estimating an object’s 6D pose is challenging because of different elements, e.g., shape varieties and their different angles of visualization, illumination, and occlusions among objects.
Feature-based and template-based techniques are the most traditional approaches in this area. A feature-based algorithm extracts local features from points of interest in the image with local descriptors to align them with 3D models to retrieve 6D poses [
4,
5,
6,
7,
8]. For example, [
8] used Scale Invariant Feature Transform (SIFT) descriptors and clustered images in a single model from similar perspectives [
7], which provided a fast and flexible sensing approach to detect and estimate objects. However, these methods suffer from a common limitation in that they require sufficient texture on the objects. Feature learning methods [
9,
10] were proposed to perform better matching perspectives in dealing with inadequate texture artifacts. Feature-based techniques compare the characteristic points between 3D models and images. Nevertheless, such methods are functional only where there are rich textures on the objects. They are unable to manage artifacts that are texture-less [
11]. Further, their basic design is time-consuming and multi-stage. A rigid template was examined in the image in a template-based way [
12,
13], and distance measurement was measured to better fit. These approaches can work precisely and rapidly but do not work properly if clutter and occlusions are involved.
Depth cameras make RGB-D object pose estimation methods more prevalent [
14,
15,
16,
17,
18,
19,
20,
21]. For instance, an algorithm suitable for textured and texture-less generic objects was studied in [
19]. Another study was conducted based on multiple views to identify an object’s pose with 6DOF for multiple object cases in a crowded scene [
20]. For RGB-D images, another study established a flexible and fast process [
14]. The task can be made more accessible by using depth images, but this requires extra hardware for acquiring depth data.
Approaches focused on convolutional neural networks (CNNs) have become mainstream in recent years to solve problems related to 6D posing, particularly camera poses [
22,
23] and objects [
24,
25,
26,
27,
28]. Both [
22] and [
23] trained CNNs to return the 6D camera pose directly. The prediction of camera poses is much simpler since there is no necessity to identify any object. A study used quaternion as rotation representation but omitted the spherical constraint, using an unreasonable loss function [
22]. The problem was tackled through more scientific loss function growth [
23]. Another study [
25] used CNNs for direct 3D object regression, but their results were limited to 3D rotation estimates only, with no regard for 3D translation. The 6D pose estimation has been expanded by the SSD recognition method [
2]. The authors of [
26] have transformed pose recognition into two simple steps, classification of angles and in-plane rotation. However, incorrect classification can result in an inaccurate pose estimate at either point. A segmentation network was used to locate objects in another analysis, called BB8, for eight corners of the bounding box [
27]. To calculate 2D projections of the corners of the 3D boundary box around the object, another CNN was then used. A Perspective-n-Point (PnP) algorithm determined the 6D pose. The CNN was eventually conditioned to improve the pose. BB8 is highly reliable but time-consuming. Another study was conducted [
24] that advanced the YOLO object detection framework [
1] to anticipate two-dimensional projections at the 3D bounding box corners. The excessive amount of 2D points were regressed, thus increased the learning problem and reduced the speed of learning [
24,
27]. Contrary to the above, a CNN algorithm proposed regressing the orientation of a 3D object and then combining those estimates with geometrical strictures generated by a 2D object bounding box to produce a complete 3D bounding box [
28], although this approach usually requires 4096 linear equations to be solved. In extraordinary cases, e.g., the KITTI dataset [
29], a project of Karlsruhe Institute of Technology and Toyota Technology Institute at Chicago, the pitch and roll angle of artifacts are both zero, and 64 equations still have to be solved, which makes the method computationally expensive. The above-mentioned issues are avoided in our method. A study based on deep learning proposed a novel method to localize human actions in videos spatio-temporally with integrating an optical flow subnet. The designed new architecture is able to perform action localization and optical flow estimation jointly in an end-to-end manner [
30].
Deep learning models have recently become commonplace in estimating the pose of a 6D object. We proposed a generic framework in this paper that overcomes the limitations of previous methods of estimating 6D objects. We have developed an entirely new way of assessing the 3D rotation R and 3D translation rotation t in only one RGB image. The 6D Pose estimation procedure was divided into two main phases. In the first step, the 3D rotation R was regressed by a new convolutional neural network we named Q-Net. In step two, the Bounding Box Equation algorithm included the 3D translation t generated with R and the 2D bounding box on the original image.
We evaluated our approach using the LineMod dataset method [
13], a 6D pose prediction reference. Our experiments have shown that the Q-Net and Bounding Box Equation worked effectively and fruitfully on these demanding data sets, producing cutting-edge results irrespective of the complicated scenes on the images. We also used our system to identify the 6D pose of ordinary objects in everyday life, and the results were also very satisfactory. The objective world is three-dimensional, and as such, in practical applications we are more concerned about the three-dimensional position and posture of the target relative to the camera, which will be more convenient for us to understand the target’s semantic information in the scene. In short, our job has the aforementioned benefits and accomplishments:
Our approach requires no depth information and works on textured and texture-free images. It also has practical significance and can be used in everyday life.
Compared to previous 2D detection systems, our approach worked robustly and more easily.
We designed a channel normalize layer for uniting n-dimensional feature vectors formed by neurons in n feature maps. The calculation layer can directly output the attitude quaternion of the target. At the same time, we defined a new quaternion pose loss function for measuring the similarity of pose between models to predict and ground truth.
We implement the Bounding Box Equation, a new algorithm for the effective and accurate 3D translation using the R and 2D bounding box.
The rest of the paper is structured accordingly. In
Section 2.2., we introduced the network architecture, and
Section 2.2 and
Section 2.3 introduced a new network model Q-net for predicting attitude quaternion.
Section 2.4. contains the loss function for regressing the quaternion of the target attitude.
Section 2.5. studies the backpropagation algorithm for quaternion calculation.
Section 2.6. holds the prediction of object position through the BB3D algorithm. Experimental results and their evaluations are elaborated in
Section 3. The last two sections, i.e.,
Section 2.4 and
Section 2.5, consist of discussion and conclusion.