Joint Optimization of the 3D Model and 6D Pose for Monocular Pose Estimation

Guo, Liangchao; Chen, Lin; Wang, Qiufu; Zhang, Zhuo; Sun, Xiaoliang

doi:10.3390/drones8110626

Open AccessArticle

Joint Optimization of the 3D Model and 6D Pose for Monocular Pose Estimation

by

Liangchao Guo

,

Lin Chen

,

Qiufu Wang

,

Zhuo Zhang

and

Xiaoliang Sun

^*

School of Aerospace Science and Engineering, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(11), 626; https://doi.org/10.3390/drones8110626

Submission received: 3 September 2024 / Revised: 12 October 2024 / Accepted: 27 October 2024 / Published: 30 October 2024

Download

Browse Figures

Versions Notes

Abstract

The autonomous landing of unmanned aerial vehicles (UAVs) relies on a precise relative 6D pose between platforms. Existing model-based monocular pose estimation methods need an accurate 3D model of the target. They cannot handle the absence of an accurate 3D model. This paper adopts the multi-view geometry constraints within the monocular image sequence to solve the problem. And a novel approach to monocular pose estimation is introduced, which jointly optimizes the target’s 3D model and the relative 6D pose. We propose to represent the target’s 3D model using a set of sparse 3D landmarks. The 2D landmarks are detected in the input image by a trained neural network. Based on the 2D–3D correspondences, the initial pose estimation is obtained by solving the PnP problem. To achieve joint optimization, this paper builds the objective function based on the minimization of the reprojection error. And the correction values of the 3D landmarks and the 6D pose are parameters to be solved in the optimization problem. By solving the optimization problem, the joint optimization of the target’s 3D model and the 6D pose is realized. In addition, a sliding window combined with a keyframe extraction strategy is adopted to speed up the algorithm processing. Experimental results on synthetic and real image sequences show that the proposed method achieves real-time and online high-precision monocular pose estimation with the absence of an accurate 3D model via the joint optimization of the target’s 3D model and pose.

Keywords:

UAV autonomous landing; monocular pose estimation; multi-view geometry constraints; joint optimization; sliding window; keyframe

1. Introduction

The autonomous landing of unmanned aerial vehicles (UAVs) necessitates precise estimation of the relative 6D pose, encompassing both 3D rotation and 3D translation [1]. Compared to radar, optoelectronic, and navigation satellite guidance, monocular vision guidance has the advantages of simple configuration, low cost, high autonomy, and strong anti-interference ability. It has received widespread attention from researchers. Monocular pose estimation is one of the critical technologies in monocular vision guidance. Monocular pose estimation, as a significant research direction in the field of computer vision, aims to estimate the position and orientation of a target in three-dimensional space solely based on image information captured by a monocular camera. This technology is not only crucial for applications like autonomous driving and robot navigation but also demonstrates broad application prospects in fields such as augmented reality and virtual reality. Due to complex environmental interference and application conditions, monocular pose estimation remains a challenging task.

Existing monocular pose estimation methods can be divided into cooperative and non-cooperative methods based on whether cooperation markers are used [2]. Cooperative methods [3,4,5] measure the pose by detecting the markers placed on the target. Such methods are simple and efficient. However, the additional cooperation markers limit the application of these methods. On the contrary, non-cooperative methods [6,7,8] only utilize the features of the target itself to achieve pose estimation. And these methods can be further divided into traditional and deep learning-related methods. Traditional methods perform pose estimation based on feature points, templates, etc. [9,10]. They have difficulties dealing with textureless objects, cluttered backgrounds, etc. Benefiting from the powerful feature extraction and representation capabilities of neural networks, deep learning-based methods achieve excellent performance in monocular pose estimation. And they have become the mainstream methods. Some of these methods regress to obtain the pose directly in an end-to-end manner. Such methods are concise, but the accuracy of pose estimation is not high. Some other methods introduce an intermediate representation of the target. The trained networks output the detected intermediate representation. With the accurate 3D prior information of the target, the pose is solved using an Iterative Closest Point (ICP) [11], Perspective-n-Point (PnP) [12,13,14], etc. Existing non-cooperative methods, including traditional and deep learning-related methods, need an accurate 3D model of the target for template preparation, model training, and accurate 2D–3D establishment. However, in practice, due to deviation during the manufacturing process, there are differences between the target entity and its design 3D model. And for sensitive targets, e.g., carrier-borne aircraft or satellites, it is difficult to obtain their accurate 3D models. In the absence of an accurate 3D model of the target, existing methods are unable to achieve high-precision pose estimation.

Multi-view geometry constraints have been widely used in structural reconstruction. Motivated by this, this paper adopts the geometry constraints within the monocular image sequence to handle the problem mentioned above. And a novel method for monocular pose estimation is proposed, which performs joint optimization of the target’s 3D model and pose. Similarly to the landmark used in human pose estimation [15], this paper represents the target’s 3D model using a set of sparse landmarks. A trained neural network is used to detect the landmarks from the input image. With the 2D–3D correspondences, the proposed method obtains the initial pose estimation by solving the PnP problem. Then, for joint optimization, this paper takes the correction value of the landmarks and the pose as parameters to be optimized. We build the optimization objective function based on reprojection errors. By solving this optimization problem, the target’s 3D model and pose are optimized jointly. And we obtain a high-precision pose estimation with the absence of the target’s accurate 3D model. Further, a sliding window combined with a keyframe extraction strategy is also adopted in the proposed method to achieve real-time monocular pose estimation. Experimental results indicate that the proposed method achieves real-time, online, and high-precision monocular pose estimation for the case in which the accurate 3D model is unavailable. And the target’s 3D model is optimized jointly.

The main contributions of this paper are as follows.

1): To handle the absence of an accurate 3D model, this paper proposes a novel joint opti-mization method of the 3D model and 6D pose for monocular pose estimation. Using the multi-view geometry constraints in a monocular image sequence, it achieves re-al-time, online, and high-precision pose estimation.
2): This paper proposes to represent the target’s 3D model using a set of sparse landmarks. It enables the joint optimization problem to be efficiently solved.

The rest of the paper is organized as follows: Section 2 gives an overview of the most relevant works to this paper. The proposed method is presented in detail in Section 3. Section 4 describes the experimental results and analysis, and the paper is concluded in Section 5.

2. Related Works

This section summarizes the most related recent works to the proposed method, including deep learning-related monocular pose estimation and multi-view geometry constraints used in 3D structure reconstruction.

2.1. Deep Learning-Related Monocular Pose Estimation

As mentioned above, deep learning-related methods have become the mainstream method in monocular pose estimation. According to whether to use intermediate representation, existing deep learning-related monocular pose estimation methods can be categorized into two main kinds: direct and indirect methods. The direct method [16,17,18] takes the input image and regresses it to obtain the pose parameters directly. Kendall et al. [18] proposed PoseNet, which trained a convolutional neural network to regress the 6-DOF camera pose from a single RGB image. The direct method achieves end-to-end network training and inference, but the precision of pose estimation is not high. The indirect method introduces an intermediate representation of the target. By inferring intermediate representation, the indirect method adopts PnP, ICP, etc., to solve pose parameters. Common intermediate models include dense models, normalized coordinates, and sparse landmarks. The dense model-based pose estimation method [19,20,21] matches each pixel in the image or scene. For instance, DPOD [20] estimated dense multiclass 2D–3D mappings between the input image and the available 3D model. The dense model-based representation method has high accuracy and stability, but low computational efficiency. The normalized coordinates-based method [22,23,24,25] normalized objects belonging to the same category through normalized object coordinate space. And it does not need to obtain an accurate 3D model of the target. However, the normalized coordinates-based method relies on the depth image to restore the true scale of the object. The sparse landmarks representation method [26,27,28,29,30,31,32] models the object as a sparse landmark set. And it is simple and efficient, suitable for real-time pose estimation tasks. At present, the sparse landmarks representation method is widely used. By establishing sparse or dense 2D–3D correspondence, the indirect method can solve PnP to obtain a more accurate pose than the direct method. Although indirect methods perform better, the non-differentiable nature limits their application in pose estimation tasks. To solve this problem, Chen et al. [33] and EPro-PnP [34] interpreted PnP as a differentiable layer embedded into the neural network. With a differentiable PnP layer, end-to-end training of the indirect method is achieved. In addition, multi-view geometry constraints have been applied in monocular pose estimation [35,36]. For example, CosyPose [36] realized global scene optimization through object-level bundle adjustment.

The above methods all require accurate target 3D prior information, such as an accurate 3D CAD model. Through 3D prior information, 2D–3D correspondences or 3D-3D correspondences are established to achieve accurate pose estimation. However, they cannot handle the case in which the accurate target’s 3D model is absent.

2.2. Multi-View Geometry Constraints

Multi-view geometry constraints recover the 3D structure of the object or scene using the internal relationship between the images and calculate the camera pose simultaneously, e.g., Simultaneous Localization and Mapping (SLAM) [37,38,39,40], Structure from Motion (SfM) [41,42], object localization [43], and high-precision localization of the aircraft itself [44]. Similarly, in the monocular pose estimation focused on in this paper, the image sequence provides multi-view geometry constraints due to the presence of relative motion. Therefore, in this paper, we attempt to use multi-view geometry constraints to handle the monocular pose estimation in the absence of the accurate target’s 3D model. It should be noted that our method is different from the traditional SLAM and SfM methods. These methods all utilize feature point extraction and matching to establish inter-frame matching and then solve the pose and localization by optimization. However, the target’s 3D platform we focus on cannot establish correspondences based on traditional feature points. Furthermore, even when feature point correspondences are established, linking them to specific 3D models remains challenging. Therefore, in this paper, complex feature point extraction and matching methods are discarded, and an existing neural network-based landmark detection algorithm is used to directly regress the 2D projections corresponding to predefined 3D landmarks.

3. Method

To achieve reliable monocular pose estimation for UAV autonomous landing missions without an accurate 3D model, this paper proposes a novel joint optimization method using multi-view geometry constraints within image sequences. In this section, we first describe the joint optimization problem and notations. Then, the proposed method is provided in detail in Section 3.2.

3.1. Problem Formulation and Notations

A basic schematic of the object imaging and monocular pose estimation process is given in Figure 1, where

O_{O} - X_{O} Y_{O} Z_{O}

and

O_{C} - X_{C} Y_{C} Z_{C}

denote the target coordinate frame and camera coordinate frame, respectively.

In Figure 1, the rigid body transformation from the target coordinate frame to the camera coordinate frame is represented by

T_{O}^{C}

, i.e.,

T_{O}^{C} = [\begin{matrix} R_{O}^{C} & t_{O}^{C} \\ 0 & 1 \end{matrix}] \in SE (3), R_{O}^{C} \in SO (3), t_{O}^{C} \in R^{3}

(1)

where the symbols

S E (3)

and

S O (3)

used in this article denote the special Euclidean group and the special orthogonal group, respectively [45].

R_{O}^{C}

and

t_{O}^{C}

are the rotation and translation components of

T_{O}^{C}

, respectively.

The perspective projection of a vertex

{\tilde{V}}^{O} = [v_{x}^{O} v_{y}^{O} v_{z}^{O} 1]

of the target’s 3D model onto a 2D image can be modeled as follows:

x = π (K {(T_{O}^{C} {\tilde{V}}^{O})}_{3 \times 1})

(2)

with

K

representing the precalibrated and fixed

3 \times 3

camera intrinsic matrix,

x = {[u v]}^{T}

the image point. In

x = {[u v]}^{T}

,

u

and

v

represent the pixel coordinates of

x

on the horizontal and vertical axes, respectively.

π (X) = {[x / z y / z]}^{T}

and

X = [x y z]

are the homogeneous coordinates of the image point

x

.

Monocular pose estimation extracts 2D features in the image and combines the target’s accurate 3D information to establish the 2D–3D correspondence for solving 6D pose. Hence, the inaccurate 3D model makes it difficult to solve the pose for existing methods. To be specific, there is a difference between the obtained target’s 3D model

X^{O}

and the accurate target’s 3D model

\bar{X^{O}}

. And the 3D model error is denoted as

Δ X^{O}

, so,

X^{O} = \bar{X^{O}} + Δ X^{O}

. The traditional methods cannot obtain accurate pose by utilizing

X^{O}

. Therefore, this paper aims to address the problem of monocular pose estimation in the absence of an accurate 3D model. And the solution of the 3D model and pose is realized by the method of joint optimization.

3.2. Joint Optimization of the 3D Model and Pose for Monocular Pose Estimation

This paper proposes to represent the target’s 3D model using a set of sparse 3D landmarks. The initial pose is obtained by detecting 2D landmarks combined with solving the PnP problem. And the pose and 3D model are jointly optimized via bundle adjustment. As shown in Figure 2, the proposed monocular pose estimation method mainly includes sparse landmark-based 3D model representation, initial pose estimation, joint optimization of the 3D model and pose, and sliding window combined with keyframe extraction strategy, which are all described in detail next.

3.2.1. Sparse Landmark-Based 3D Model Representation

In this paper, the pose measurement in the visual guidance of UAV autonomous landing is used as an example to illustrate the proposed method. And the ship shown in Figure 3 is used as the target platform. In Figure 3b, the 3D model is obtained by accurate 3D scanning of the simplified physical model of the ship. As shown in Figure 3b, to make the proposed joint optimization method efficient and reliable, a set of sparse 3D landmarks capable of representing the main structure of the ship is employed to represent the target’s 3D model. Different from the 2D feature points used in traditional SLAM, SfM, and other fields, this paper uses semantic-level landmarks to represent the target’s 3D model.

Furthermore, this paper uses

{\tilde{X}}^{O}

to represent a set of sparse landmarks used, as shown in Equation (3):

{\{{\tilde{X}}_{j}^{O}\}}_{j = 1, 2, ..., N} = {\{{[x_{j}^{O} y_{j}^{O} z_{j}^{O} 1]}^{T}\}}_{j = 1, 2, ..., N}

(3)

where

N

is 13 and represents the number of 3D landmarks used. To simulate the dual-motion platform rendezvous in the actual task, we scaled the 3D landmark coordinates of the ship to the actual size.

3.2.2. Initial Pose Estimation Based on the Set of Sparse Landmarks

As shown in Figure 4, the initial pose estimation part provides the initial pose for the subsequent optimization. Specifically, this paper performs target detection on the initial frame. Subsequently, the 2D landmarks are detected within the target region, and the initial pose is obtained by solving the PnP problem. For efficiency, a 2D target tracking algorithm is adopted in subsequent frames. In addition, to improve the robustness of the initial pose estimation, we design a consistency criterion to determine the success or failure of the 2D tracking and pose estimation.

A.: 2D landmark detection

YOLOX employs a lightweight network architecture, which grants it a notable advantage in computational speed. Compared to other object detection algorithms, YOLOX achieves faster object detection while maintaining accuracy, making it suitable for application scenarios with high real-time requirements. Therefore, as shown in Figure 4, this paper utilizes the YOLOX algorithm [46] to detect the target position. The YOLOX outputs the target box region in the image, denoted as B. YOLOX takes more time to detect the target on a single frame. Hence, to improve efficiency, after obtaining the target area in the initial frame, the STAPLE algorithm [47] is utilized to track the target from the subsequent input images. To keep the notation uniform, the target box region output by the STAPLE algorithm is also denoted as B. Then, based on the representation of the sparse landmark set proposed in Section 3.2.1, the RTMPose algorithm, according to ref. [48], with superior performance in human pose estimation tasks, is utilized to realize the efficient and accurate detection of 2D landmarks in the input image. The RTMPose algorithm outputs the positions of 2D landmarks through heat maps, which are concise, efficient, and easy to deploy. Actually, this approach presented in this paper can incorporate a variety of efficient and precise landmark detection methods, rather than being confined merely to RTMPose. Finally, the paper utilizes the EPnP [12] + RANSAC [49] method to solve the initial pose.

For training, this paper first adopts a manual method to annotate the coordinates of sparse landmarks on a single image as the ground truth. Subsequently, the annotated data are divided into a training set and a validation set, which are then fed into the network for training. Eventually, a 2D landmark detection model is obtained. It is important to note that the RTMPose algorithm is an instance-level landmark detection algorithm. Therefore, it requires separate training for each individual object to obtain the corresponding landmark detection model. This approach limits the generalization ability of the proposed method. Therefore, the next step of this paper is to explore category-level landmark detection algorithms.

B.: Consistency criteria

After solving the pose, this paper further calculates the intersection ratio

I

between B and the target’s 3D landmark reprojection bounding box

B_{r}

. By

B_{r}

, the proposed method determines whether the solved initial pose is reliable, as shown in Equation (4).

I = \frac{B \cap B_{r}}{B_{r}}

(4)

When

I < T h r e s h o l d

, i.e.,

B_{r} ⊄ B

, the solved pose is considered unreliable. Then, the current frame is processed according to the initial frame and the pose is re-solved. Otherwise, subsequent joint optimization operations are performed.

3.2.3. Joint Optimization of the 3D Model and 6D Pose

Based on the initial pose estimation results obtained above, in this section, the correction value

Δ X^{O}

of the target’s 3D model

X^{O}

and the initial pose estimation

{\{T_{O}^{C_{i}}\}}_{i = 1 : M}

of consecutive M frames are taken as the parameters to be optimized. And the optimization objective function is established by minimizing the reprojection error, as shown in Equation (5):

\{\begin{array}{l} Δ X^{O *}, T_{O}^{C *} = \arg \min_{Δ X^{O}, T_{O}^{C}} \frac{1}{2} {\sum_{i = 1}^{M} \sum_{j = 1}^{N} ‖x_{i}^{j} - π (K {(T_{O}^{C_{i}} ({\tilde{X}}_{j}^{O} + Δ {\tilde{X}}_{j}^{O}))}_{3 \times 1})‖}_{2}^{2} \\ X^{O *} = X^{O} + Δ X^{O *} \end{array}

(5)

where

{‖.‖}_{2}^{2}

represents the square of the vector norm.

Δ X^{O *}

is the optimized correction value of the 3D model.

X^{O *} = {\{X_{j}^{O *}\}}_{j = 1, 2, ..., N}

and

T_{O}^{C *} = {\{T_{O}^{C_{i}^{*}}\}}_{i = 1, 2, ..., M}

are the optimized 3D model and pose, respectively. Furthermore,

x_{i}^{j}

is the observed value of the

j^{t h}

3D landmark on the

i^{t h}

frame.

{\tilde{X}}_{j}^{O}

is the original homogeneous coordinate of the

j^{t h}

3D landmark.

Δ {\tilde{X}}_{j}^{O}

is the homogeneous coordinate of the correction value to be optimized of the

j^{t h}

3D landmark.

T_{O}^{C_{i}}

is the pose to be optimized for the

i^{t h}

frame, which can be modeled with a twist

ξ \in ℝ^{6}

, as follows:

T_{O}^{C_{i}} = \exp (\hat{ξ})

(6)

where

\{\begin{array}{l} ξ = [\begin{matrix} ϕ \\ ρ \end{matrix}] = {[\begin{matrix} ϕ_{1} & ϕ_{2} & ϕ_{3} & \begin{matrix} ρ_{1} & \begin{matrix} ρ_{2} & ρ_{3} \end{matrix} \end{matrix} \end{matrix}]}^{T} \in s e (3) \\ \hat{ξ} = [\begin{matrix} \hat{ϕ} & ρ \\ 0^{T} & 0 \end{matrix}] \in ℝ^{4 \times 4}, ϕ \in s o (3), ρ \in ℝ^{3} \end{array}

(7)

where

s e (3)

is the Lie algebra corresponding to

S E (3)

, and

s o (3)

is the Lie algebra corresponding to

S O (3)

[45]. The rotation and translation components of the

s e (3)

are

ϕ \in s o (3)

and

ρ \in ℝ^{3}

, respectively [45]. And

\hat{ϕ}

represents the skew-symmetric defined by

ϕ

.

In this paper, we focus on monocular pose estimation in the absence of an accurate 3D model. During the optimization process, the amount of the correction value of the model is constrained. This is mainly because the gap between the 3D model and the real object is relatively small in practice. Therefore, the search space in the optimization can be limited by constraining the optimization range of the correction value. It prevents the algorithm from becoming stuck in the local optimal solution and makes the optimized 3D model closer to the actual situation. When the deviation between the 3D model and the real object is large, it is difficult to reduce the model error by joint optimization. And this paper does not focus on the optimization of 3D models with excessive errors. Hence, the correction value threshold

σ_{1}

of the target’s 3D model is set. After optimization, when

Δ X^{O *}

exceeds

σ_{1}

, the optimization result of the 3D model is discarded, as shown in Equation (8).

Δ X^{O *} = \{\begin{array}{l} Δ X^{O *}, i f {‖Δ X^{O *}‖}_{2} \leq σ_{1} \\ 0, e l s e \end{array}

(8)

In addition, the poses obtained by joint optimization can be output directly as the final measurement result. However, the poses direct output from optimization may not be accurate. Therefore, this paper solves the PnP problem again using the optimized 3D model to obtain the final pose estimation result

\hat{T_{O}^{C_{c u r}}}

for the current frame.

3.2.4. Sliding Window Combined with Keyframe Extraction Strategy

In Section 3.2.3, based on Equations (5) and (8), the 6D pose for each input image and 3D model are continuously jointly optimized to obtain a high-precision pose and reduce 3D model error. However, monocular visual guidance requires real-time, online, and continuous pose measurement. In Section 3.2.3, the number of consecutive images increases over time. And, in the established optimization objective function, the residual error and the pose optimization parameters are continuously increased. Therefore, in Section 3.2.4, this paper uses the sliding window strategy instead of the frame-by-frame optimization strategy to restrict the optimization scale. It only optimizes the pose of the image within a time window of a fixed length

L

.

The adoption of the sliding window strategy can effectively reduce the computational scale of the process to be optimized. However, the amount of pose variation between consecutive frames is small, and the provided geometry constraints are not enough. Therefore, to take advantage of stronger geometry constraints and improve the processing efficiency, after utilizing a sliding window, this paper further adopts the processing strategy of keyframes in the proposed optimization method. Referring to the keyframe extraction strategy in ORB-SLAM2 [40], this paper selects keyframes in the input image sequence. Then, the poses of the extracted keyframes and the correction value of the target’s 3D model are taken as the parameters to be optimized. If the current frame is a non-keyframe, the PnP problem is calculated using the recently optimized target’s 3D model to obtain its high-precision pose. In detail, we select keyframes according to the camera rotation angle, as shown in Figure 5. This paper assumes that the existing keyframe set is

F_{k e y} = \{F_{k e y}^{0}, F_{k e y}^{1}, ..., F_{k e y}^{k - 1}\}

and its corresponding pose set is

T_{O}^{C_{k e y}} = \{T_{O}^{C_{k e y}^{0}}, T_{O}^{C_{k e y}^{1}}, ..., T_{O}^{C_{k e y}^{k - 1}}\}

. The origin of the current frame’s camera and the nearest keyframe’s camera are connected with the origin of the target coordinate frame, respectively. And the two connections are

\vec{{OF}_{k e y}^{k - 1}}

and

\vec{{OF}_{c u r}}

in Figure 5, respectively. Then, the angle

ω

between the two vectors is calculated according to Equations (9) and (10). When

ω

is greater than the angle threshold

ω_{\min}

, the current frame will be added to the set of keyframes.

\{\begin{matrix} \vec{{OF}_{k e y}^{k - 1}} = - {(R_{O}^{C_{k e y}^{k - 1}})}^{T} t_{O}^{C_{k e y}^{k - 1}} \\ \vec{{OF}_{c u r}} = - {(R_{O}^{C_{c u r}})}^{T} t_{O}^{C_{c u r}} \end{matrix}

(9)

ω = a r c \cos (\frac{\vec{{OF}_{k e y}^{k - 1}} . \vec{{OF}_{c u r}}}{‖\vec{{OF}_{k e y}^{k - 1}}‖ ‖\vec{{OF}_{c u r}}‖})

(10)

By setting the sliding window combined with the keyframe extraction strategy, the complexity of the algorithm implementation is reduced, and the efficiency of the algorithm is improved. Hence, Equation (5) is modified as follows:

\{\begin{array}{l} Δ X^{O *}, T_{O}^{C_{k e y}^{*}} = \arg \min_{Δ X^{O}, T_{O}^{C_{k e y}}} \frac{1}{2} {\sum_{i = 0}^{L} \sum_{j = 1}^{N} ‖x_{i}^{j} - π (K {(T_{O}^{C_{k e y}^{i}} ({\tilde{X}}_{j}^{O} + Δ {\tilde{X}}_{j}^{O}))}_{3 \times 1})‖}_{2}^{2} \\ X^{O *} = X^{O} + Δ X^{O *} \end{array}

(11)

4. Experimental Results and Analysis

To evaluate the performance of the proposed method, this paper implements experiments using both synthetic and real image sequences. This section first describes the experimental settings. Subsequently, the robustness of the proposed method on the simulation data is analyzed. Further, the performance of the proposed method on the 3D model and 6D pose is evaluated on synthetic and real image sequences. Finally, the shortcomings of the proposed method and future improvements are demonstrated.

4.1. Experiment Settings

4.1.1. Training Details Based on Deep Learning Methods

In the synthetic experiment and real experiment detailed below, the automatic landing process of the UAV was simulated from approximately about 2.0 km to 0.2 km distance, respectively. Based on the set camera trajectory, the synthetic images are generated by combining a target’s 3D model with a computer graphics engine. This paper rendered about 16,000 synthetic images with a resolution of 1024 × 1024 pixels. In the real experiment, the size of the target ship model is about 113 mm × 120 mm × 440 mm. This paper simulates the process of a UAV approaching a ship using a handheld camera. In total, about 9000 real images with a resolution of 1280 × 720 pixels were generated. Subsequently, about 13,000 synthetic images and nearly 8000 real images were fed into the network for training, with the remaining images serving as a test set. As for the parameters of deep learning networks, this paper only modifies the output dimension of RTMPose to the number of setting landmarks.

All experiments were conducted on a laptop equipped with an AMD Core R7-5800H @ 3.2 GHz and an NVIDIA Geforce RTX3060 GPU. The GPU was utilized for rendering, training, and inference. Our method is implemented in C++14.

4.1.2. Parameter Settings

In the experiment, uniform distribution can better control the error degree of the target’s 3D model. Therefore, to simulate the absence of an accurate 3D model, this paper adds error to the three dimensions of each accurate 3D landmark by uniform distribution

U (a, b)

, where

U (a, b)

is equal to

U (2, 3)

or

U (- 3, - 2)

. The detailed parameter settings of the proposed method are shown in Table 1. It should be noted that, in the synthetic experiment, to ensure the selection of sufficient keyframes for optimization while considering efficiency, this paper sets appropriate keyframe selection threshold

ω_{\min}

based on the rendezvous angles of different image sequences.

4.1.3. Evaluation Metrics

To evaluate the performance of different landmark detection algorithms, this paper uses Normalized Mean Error [50] (NME) to calculate 2D landmark detection error, as shown in Equation (12):

N M E = E (\frac{{‖x_{j}^{p} - x_{j}^{g}‖}_{2}}{\sqrt{w h}})

(12)

where

E (.)

indicates average error calculation.

w

and

h

indicate the width and height of the object bounding box, respectively.

x_{j}^{p}

and

x_{j}^{g}

are the

j^{t h}

prediction and ground truth landmarks, respectively.

The same as MC-LRF [1], in this paper, rotation error

E_{R}

and normalized translation error

E_{t}

are used to evaluate the accuracy of 6D pose estimation. The average 3D landmark error

E_{L}

is used to evaluate the accuracy of the target’s 3D model, as shown in the following:

\{\begin{array}{l} E_{R} = 2 \cos^{- 1} (\min (a b s (R_{g}^{T} R_{p}), 1)) \frac{180}{π} \\ E_{t} = \frac{‖t_{g} - t_{p}‖}{‖t_{g}‖} \times 100 % \\ E_{L} = \sum_{j = 1}^{N} {‖L_{g}^{j} - L_{p}^{j}‖}_{2} / N = \sum_{j = 1}^{N} L_{2}^{j} / N \end{array}

(13)

where the true values of the rotation and translation components are denoted by

R_{g}

and

t_{g}

, respectively.

R_{p}

and

t_{p}

are the predicted values of the rotation and translation components, respectively.

L_{g}^{j}

and

L_{p}^{j}

are the accurate and estimated values of the

j^{t h}

3D landmark, respectively. In this paper, the Euclidean distance between the estimated and accurate values of the 3D landmarks is used as the evaluation metric of the model accuracy.

4.2. The Robustness Analysis of the Proposed Method

This paper adopts the geometry constraints within the monocular image sequence to handle the absence of an accurate 3D model. The accuracy of multi-view geometry constraints mainly depends on the following two critical factors: Firstly, the accuracy of 2D observations, which usually involves the precise extraction and positioning of feature points, line segments, or other geometric elements in the image, and, secondly, the magnitude of the rendezvous angle between viewpoints in the image sequence, which affects the stability and accuracy of reconstructing the 3D structure from different viewpoints. Thus, the accuracy of 2D landmark detection and the rendezvous angle of the image sequence are two crucial factors that influence the performance of the proposed method.

This paper takes the autonomous landing of UAVs as an example to verify the proposed method. Hence, as depicted in Figure 6, this paper simulates the UAV flying around the central axis of the stern of the ship at a distance of 1.0 km. And the simulation experiment data are generated synchronously. In Figure 6,

θ

represents the setting rendezvous angle. Additionally, the projection of the 3D landmarks on the image is computed based on the true poses corresponding to the simulated trajectory.

The optimization objective function presented in this paper jointly optimizes the 3D model and poses by minimizing the reprojection error. Hence, a significant 2D landmark detection error can result in the failure of the proposed method. To investigate the tolerance level of the proposed method towards landmark detection errors, this paper first selects the simulation data with

θ = 180^{\circ}

. Subsequently, this paper simulates 2D landmark detection errors by adding different levels of Gaussian noise

N (0, σ_{2})

to 2D landmark ground truth. And different noise levels were simulated 1000 times, respectively. The average optimization outcomes of the pose and 3D model are depicted in Figure 7. In Figure 7 and the subsequent images, EPnP indicates the pose error acquired directly by resolving the PnP problem. The proposed represents the pose and 3D model error obtained through the further optimization of the proposed method. The original represents the initial 3D model error.

As depicted in Figure 7, when

σ_{2}

is less than 2.0, the proposed method effectively mitigates the pose and 3D model error by leveraging the multi-view geometry constraints information and attains high-precision monocular pose estimation. However, when

σ_{2}

exceeds 2.0, the optimization ability of the proposed method for the 3D model attenuates, thereby further causing the pose error of the solution to be close to or even exceed that solved by EPnP. Thus, within the established simulation application scenario, the proposed method demands that the 2D noise standard deviation

σ_{2}

be less than 2.0 for the effective optimization of the 3D model and poses. In fact, the above experiments indicate that the proposed method requires the reprojection error of the 3D model to be greater than the 2D landmark error to effectively optimize the 3D model and pose.

The proposed method jointly optimizes the 3D model and poses by leveraging the multi-view geometry constraints information. Hence, when the rendezvous angle is minor, it might cause insufficient geometry constraints of the observed data, thereby leading to a poorer optimization effect of the 3D model and pose. To investigate the impact of various rendezvous angles on the optimization effect of the proposed method, this paper first selects the simulation data with

σ_{2} = 1.0

and simulates it 1000 times. Subsequently, as depicted in Figure 6, data segments with varying rendezvous angles

θ

were selected from the simulation data for optimization. The average optimization outcomes of the pose and 3D model are depicted in Figure 8.

As depicted in Figure 8, when the rendezvous angle varies within the range of 5 degrees to 180 degrees, in all instances, the proposed method has mitigated the errors of the 3D model and the pose to a certain extent. However, when the rendezvous angle is 5 degrees, the optimization effect of the 3D model and the pose is insufficiently conspicuous. In actuality, the optimization outcome at this point merely holds statistical significance but lacks practical value for application. It can be observed from Figure 8 that, within the established simulation application scenario, when the rendezvous angle exceeds 30 degrees, the proposed method can largely reduce the errors of the 3D model and pose. In fact, within the optimization approach presented in this paper, a larger rendezvous angle encompasses more geometry constraint information, which is more conducive to the optimization of 3D models and poses.

Through the above experiments, it was discovered that when the accuracy of the 2D landmarks was relatively high, that is, when its error was smaller than the reprojection error of the 3D model, the proposed method exhibited excellent performance. Additionally, when the rendezvous angle between the viewpoints in the image sequence was large, the proposed method was capable of further effectively jointly optimizing the 3D model and poses.

4.3. Experiments on Synthetic Image Sequences

In this section, this paper selects synthetic image sequences to conduct an experiment. The synthetic images are produced through the target’s 3D model combined with a computer graphics engine. Specifically, this paper employs the visual simulation tool BlenderProc [51] to load the 3D CAD model of the ship. It sets the image resolution of the virtual camera to

1024 \times 1024

pixels, the focal length to 1910 pixels, the acquisition frame rate to 25 fps, and the field of view angles to 30° (horizontal) and 30° (vertical). As depicted in Figure 9, within the synthetic environments, corresponding image data are produced by establishing diverse camera trajectories. The generated synthetic image sequences are elaborated in detail below.

In Figure 9a,

R a d i u s

represents the flight radius parameter. The experiment detailed in Section 4.2 demonstrates that in the circular flight scenario with a radius of 1 km, the proposed method can accommodate significant landmark noise. The detection of 2D landmarks in the synthetic images is comparatively precise. Consequently, in this section, the

R a d i u s

in Figure 9a is set to 2.0 km to investigate the optimization in a more distant circular flight scenario. And image sequence, namely Orbit-2.0 (

R a d i u s = 2.0 km

) is produced. In Figure 9b, the rendezvous process of the UAV and the ships with different small rendezvous angles is simulated. And the image sequences Rendezvous-10 (

θ = 10^{°}

), Rendezvous-20 (

θ = 20^{°}

), and Rendezvous-30 (

θ = 30^{°}

) are generated, respectively.

This paper selects Orbit-2.0, Rendezvous-10, Rendezvous-20, and Rendezvous-30 image sequences to conduct the experiment. Additionally, existing approaches typically adopt sparse or dense 2D–3D correspondence and directly resolve PnP problems to obtain pose estimation outcomes. Consequently, this paper separately employs the current cutting-edge RTMpose and PP-TinyPose [52] approaches to detect 2D landmarks. Then, this paper contrasts the proposed method with the RTMPose-EPnP and PP-TinyPose-EPnP methods. Both the RTMPose-EPnP and PP-TinyPose-EPnP methods further solve PnP problems on the basis of landmark detection to obtain pose results.

The synthetic example results of reprojection using the poses of PP-TinyPose-EPnP, RTMPose-EPnP, and the proposed method are presented in Figure 10, where the local zoomed-in results of the reprojection image are displayed in the red box at the upper left corner. And the yellow mask represents the ground truth, the green mask indicates the reprojection using the pose solved by PP-TinyPose-EPnP, the blue mask represents the reprojection using the pose solved by RTMPose-EPnP, and the purple mask corresponds to the proposed method.

As depicted in Figure 10, owing to the inaccurate target’s 3D model, the PP-TinyPose-EPnP and RTMPose-EPnP methods fail to acquire precise poses. As demonstrated in the pose reprojection outcomes of the second and third rows, the pose reprojection results based on the two comparative methods do not fully align with the genuine target area. The proposed method takes advantage of the constraint information of multi-view geometry constraints to jointly optimize the target’s 3D model and pose parameters. The experimental results presented in the fourth row of Figure 10 indicate that, by employing the proposed method, the reprojected target region conforms splendidly to the ground truth. The outcomes reveal that the proposed method can accomplish high-precision monocular pose estimation in the absence of an accurate target.

To quantitatively assess the performance of various methods in predicting pose, based on Equation (13), this paper selects a set of complete pose prediction samples from four synthetic image sequences, respectively. These samples are depicted in the first two rows of Figure 11. Additionally, the proposed method is configured to output the pose optimization results of the current frame only when the number of optimized keyframes reaches the length of the sliding window. Thus, Figure 11 presents only the optimized portion of the image sequence. In addition, the third row of Figure 11 exhibits the variation in the average error of the target’s 3D model throughout the entire optimization within the four image sequence samples. Meanwhile, Table 2 presents the average detection outcomes of 2D landmarks for PP-TinyPose and RTMPose in various synthetic image sequences.

As depicted in the first two rows of Figure 11, within four image sequence samples, as a consequence of errors in the target’s 3D model and the detected 2D landmarks, the PP-TinyPose-EPnP and RTMPose-EPnP methods fail to obtain precise poses. Especially, as depicted in Table 2, the 2D landmark error identified by PP-TinyPose is more substantial. Hence, the pose error addressed by PP-TinyPose-EPnP is conspicuously larger than that addressed by RTMPose-EPnP and the proposed method. In contrast, RTMPose has a relatively high accuracy in detecting 2D landmarks, and its detection error is less than the reprojection error of the 3D model. Hence, the proposed method utilizes multi-view geometry constraints to jointly optimize the target’s 3D model and pose, obtaining high-precision poses. As depicted in the third row of Figure 11, throughout the entire joint optimization procedure, the error of the target’s 3D model generally manifests a downward trend. Hence, the pose error achieved by resolving the PnP problem based on the optimized 3D model is reduced accordingly.

In Orbit-2.0, the long imaging distance will magnify the influence of 2D errors on the optimization. Consequently, the successful pose optimization outcomes on Orbit-2.0 suggest that the approach presented in this paper is moderately robust against 2D landmark detection errors at extended distances. As analyzed in Section 4.2, the rendezvous angles of Rendezvous-10, Rendezvous-20, and Rendezvous-30 are relatively small and unable to provide sufficient information for optimization. However, the continuous proximity between platforms serves to mitigate the influence of 2D detection errors on the proposed optimization algorithm. Hence, the proposed method still achieves favorable pose optimization results on these three sequence samples.

Figure 11 depicts the pose prediction outcomes of a solitary sample from each of the four synthetic image sequences. Subsequently, Table 3 exhibits the statistical findings of the average pose errors of all samples within the four synthetic image sequences, with the minimum error highlighted in bold.

As demonstrated in Table 3, in comparison with the PP-TinyPose-EPnP and RTMPose-EPnP approaches, the proposed method achieves the pose prediction outcomes that are closest to the ground truth in all four image sequences. Additionally, as can be observed from Table 3, due to the magnified impact of a greater distance on the 2D landmark detection error, the optimization effect of the pose in Orbit-2.0 is marginally subpar.

The proposed method employs multi-view geometry constraint information to jointly optimize the target’s 3D model and pose. Consequently, the optimization effect of the proposed method for the 3D model on a single sample from each of the four image sequences is presented in Figure 12. In this figure, the green dot represents the original 3D model error and the purple five-pointed star represents the optimized 3D model error. As depicted in Figure 12, the proposed method effectively decreases the overall error of the 3D model within four synthetic image sequence samples.

Figure 12 exhibits the influence of 3D model optimization in a solitary sample. Subsequently, Table 4 presents the statistical outcomes of the average 3D model errors of all samples within four synthetic image sequences, where the minimum error is highlighted in bold.

As depicted in Table 4, the proposed method reduces the 3D model error in all four image sequences. Furthermore, a longer imaging distance will increase the impact of 2D landmark detection errors on optimization. Therefore, the optimization effect for the 3D model in the Orbit-2.0 image sequence is relatively poor.

In addition, this paper compares the processing time of different algorithms on a single frame in Table 5.

As shown in Table 5, because of the benefit of the lightweight network design, PP-TinyPose-EPnP’s average processing time on the synthetic image sequence is only about 8.15 ms/frame. Due to the addition of the optimization module, the proposed method takes a longer time than RTMPose-EpnP at about 32.5 ms/frame. However, when the camera frame rate is greater than or equal to 20 Hz, the proposed method can still meet the requirements of real-time applications.

4.4. Experiments on Real Image Sequences

Similarly to synthetic experiments, in real experiments, this paper employs a handheld camera to simulate the scenarios where the UAV is circling around and rendezvousing with the ship. The simulation process of the UAV circling the ship at about 1 km in the real scene is presented in Figure 13a, and its corresponding image sequence is named real orbit. The simulation process of the rendezvous between the UAV and the ship in the real scene is presented in Figure 13b, and the corresponding image sequence is designated as real rendezvous. And the pose truth values for the real image sequences are obtained through manual annotation.

This paper selects real orbit and real rendezvous image sequences to conduct real experiments. Additionally, PP-TinyPose-EPnP and RTMPose-EPnP are also employed as comparative approaches. Similarly to Figure 10, the example results of pose reprojection of PP-TinyPose-EPnP, RTMPose-EpnP, and the proposed method on real image sequences are presented in Figure 14, where the local zoomed-in results of pose reprojection are depicted in the red box on the left. The meanings represented by the different color masks in the figure are consistent with those in Figure 10. Moreover, Table 5 presents the average detection outcomes of 2D landmarks for PP-TinyPose and RTMPose in various real image sequences.

In Table 6, The 2D landmark detection error of PP-TinyPose on real image sequences is significantly greater than that of RTMPose. Therefore, in Figure 14, even the pose reprojection of PP-TinyPose-EPnP exhibits severe deformation, indicating that the obtained pose results deviate significantly from the ground truth. In contrast, the accuracy of 2D landmarks detected by RTMPose generally meets the boundary conditions given in Section 4.2. Therefore, the poses obtained by RTMPose-EPnP and the proposed method are more accurate. Furthermore, as shown in the first two examples in Figure 14, due to the relatively large detection error of the 2D landmarks in the real image sequence, the subsequent optimization effect on the pose is poor.

Similarly to the synthesis experiment, this paper selects a set of complete pose prediction samples from two real image sequences, respectively. These samples are depicted in the first two rows of Figure 15. In addition, the third row of Figure 15 exhibits the variation in the average error of the target’s 3D model throughout the entire optimization within the two image sequence samples.

As depicted in the initial two rows of Figure 15, the pose errors achieved by the proposed method are consistently smaller than those obtained through the PP-TinyPose-EPnP and RTMPose-EPnP methods throughout the image sequence. This is predominantly due to the fact that, as exhibited in the third row of Figure 15, the proposed method continuously diminishes the error of the 3D model during the optimization procedure, thereby reducing the pose error. However, owing to the significant 2D detection error in real images, the optimization effect of the 3D model presented in Figure 15 is not as satisfactory as that of the synthetic experiment. As a result, in Figure 15, the pose error curves of RTMPose-EPnP and the proposed method are not discriminated clearly enough. Additionally, in a few images, the pose error achieved by RTMPose-EPnP is smaller than that in the proposed method.

Figure 15 depicts the pose prediction outcomes of a solitary sample from each of the two real image sequences. Subsequently, Table 7 exhibits the statistical findings of the average pose errors of all samples within the two real image sequences, with the minimum error highlighted in bold.

As demonstrated in Table 7, in comparison with the PP-TinyPose-EPnP and RTMPose-EPnP approaches, the proposed method achieves the pose prediction outcomes that are closest to the ground truth in all two image sequences.

The proposed method employs multi-view geometry constraint information to jointly optimize the target’s 3D model and pose. Consequently, the optimization effect of the proposed method for the 3D model on a single sample from each of the two image sequences is presented in Figure 16. In this figure, the green dot represents the original 3D model error and the purple five-pointed star represents the optimized 3D model error.

As depicted in Figure 16, the proposed method effectively decreases the overall error of the 3D model within two real image sequence samples. However, in contrast to the synthesis experiments, in Figure 16, the errors of a few 3D landmarks further escalate after optimization. This is mainly due to the large 2D landmark detection error on the real image sequence, which is not conducive to the subsequent optimization based on minimizing the reprojection error.

Figure 16 exhibits the influence of 3D model optimization in a solitary sample. Subsequently, Table 8 presents the statistical outcomes of the average 3D model errors of all samples within two real image sequences, where the minimum error is highlighted in bold.

As depicted in Table 8, the proposed method reduces the 3D model error in all two image sequences. Furthermore, in line with the aforementioned analysis, on account of the significant 2D detection error, the optimization effect of the 3D model in the real image sequence is restricted. For instance, in real orbit, the model error is reduced by merely approximately 4%, while in real rendezvous, the model error is decreased by approximately 10%. Therefore, high-precision 2D landmark detection is of crucial importance to the proposed method.

In addition, this paper compares the processing time of different algorithms on a single frame in Table 9.

Similarly to the statistics in synthetic experiments, in real experiments, PP-TinyPose-EPnP consumes the least amount of time. RTMPose-EPnP takes slightly longer, approximately 29.15 ms. The proposed method has the longest duration of around 36.15 ms. It can be observed that, compared to synthetic experiments, the optimization module consumes more time in real experiments. This is primarily due to the larger landmark detection errors in real image sequences, which subsequently leads to an increase in the number of iterations required by the optimization module. Additionally, inaccurate landmark detection implies larger errors in the initial pose estimation, requiring the optimization algorithm to take longer to find the correct convergence direction.

4.5. Limitation Analysis

The joint optimization outcomes in synthetic and real image sequences demonstrate that the proposed method can effectively diminish the errors in both the pose and the target’s 3D model. Nevertheless, in some samples, as a result of significant 2D landmark detection errors, the proposed approach fails to effectively optimize the pose and 3D model.

Figure 17 shows the pose optimization failure results from a real sample. Since the pose error acquired by the PP-TinyPose-EPnP approach is considerably oversized, we merely compare RTMPose-EPnP and the proposed method. Further, Figure 18 shows the multi-set failure optimization results of the target’s 3D model in real image sequences. In certain images within the real image sequence, the detection errors of 2D landmarks are significantly larger than the reprojection error of the 3D model. Consequently, as depicted in Figure 17 and Figure 18, it is challenging for the method presented in this paper to optimize the pose and 3D model by minimizing the reprojection error.

The current research of this paper primarily focuses on landmark detection under sufficient lighting and clear image conditions. And the in-depth discussions and experiments under adverse weather conditions have yet to be conducted. Consequently, the proposed method may exhibit lower robustness and accuracy in these challenging environments. This is primarily due to the fact that the algorithm learns feature representations under favorable conditions during the training process. Therefore, it lacks sufficient adaptability to the variations in image features under extreme weather conditions.

To enhance the robustness of the landmark detection algorithm under adverse weather conditions, this study will improve the proposed method in two aspects for future work. Firstly, this study introduced simulated adverse weather condition data (such as generating images with intense light, rain, snow, or fog through image processing techniques) during the training process to increase the model’s adaptability to such conditions. Secondly, this study explored the integration of other sensor data (such as infrared cameras, LiDAR, etc.) to assist in landmark detection, thereby improving detection accuracy and stability under adverse weather conditions.

5. Conclusions

Aiming at the issue of pose estimation of UAV autonomous landing missions with the absence of an accurate 3D model, this paper proposes a new monocular pose estimation method that jointly optimizes the target’s 3D model and poses. This paper represents the target’s 3D model using a set of sparse 3D landmarks. A trained neural network is used to detect the landmarks from the input image. Furthermore, the correction value of the target’s 3D model and pose are taken as parameters to be optimized. Then, the optimization objective function based on minimizing the reprojection error is established. By solving the optimization problem, the target’s 3D model and poses are optimized jointly. And, with the absence of an accurate 3D model, the high-precision pose is obtained. Moreover, a sliding window combined with a keyframe extraction strategy is also adopted in the proposed method to improve the processing speed of the algorithm. The experimental results on synthetic and real data demonstrate that the proposed method can achieve real-time and online high-precision pose estimation for the case in which the accurate 3D model is unavailable.

Despite the manifest effect of the joint 3D model and pose optimization, the proposed method still has the issue that it is not applicable to large 2D landmark detection errors. In addition, this paper does not focus on the robustness of landmark detection under adverse weather conditions. In subsequent work, this study plans to improve the adaptability of the model by incorporating data from harsh weather conditions during the training process. Additionally, it aims to explore the integration of sensors such as infrared cameras and radars to assist in landmark detection. To improve the accuracy of landmark detection, we will probe into the utilization of multi-view geometry constraint information of sequence images to further diminish 2D detection landmark errors. Additionally, in the next work, by enhancing the expression form of the parameters in the optimization equation, the joint optimization of the target’s 3D model and poses with considerable 2D noise will be deliberated.

Author Contributions

Conceptualization, X.S.; Data Curation, Z.Z.; Formal Analysis, L.C.; Funding Acquisition, X.S.; Investigation, Q.W.; Methodology, L.G.; Project Administration, X.S.; Resources, Q.W.; Software, L.G.; Supervision, X.S.; Validation, L.C.; Visualization, L.G.; Writing—Original Draft, L.G.; Writing—Review and Editing, Q.W., Z.Z. and X.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the National Natural Science Foundation of China (12272404) and the Hunan Science and Technology Innovation Program (2023RC3023).

Data Availability Statement

The data used to support this study have not been made available.

Acknowledgments

The authors sincerely appreciate the diligent collaboration of all members in the laboratory and the valuable suggestions provided by the anonymous reviewers for this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, Z.; Wang, Q.F.; Bi, D.M.; Sun, X.L.; Yu, Q.F. MC-LRF based pose measurement system for shipborne aircraft automatic landing. Chin. J. Aeronaut. 2023, 36, 298–312. [Google Scholar] [CrossRef]
Han, H.; Kim, H.; Bang, H. Monocular pose estimation of an uncooperative spacecraft using convexity defect features. Sensors 2022, 22, 8541. [Google Scholar] [CrossRef] [PubMed]
Xun, Z.; Huang, J.; Li, Z.; Ying, Z.J.; Wang, Y.J.; Xu, C.; Gao, F.; Gao, Y.J. CREPES: Cooperative relative pose estimation system. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 5274–5281. [Google Scholar] [CrossRef]
Guo, M.; Chen, Y.; Liang, B.; Wang, Y.B.; Huang, K.; Luo, Z.W.; Jiang, S. Fast recognition and pose estimation algorithm for space cooperative target via mono-vision. J. Phys. Conf. Ser. 2022, 2405, 012021. [Google Scholar] [CrossRef]
Lei, J.; Wang, J.; Shi, J.; Xu, G.; Cheng, Y.H. Configuration optimization method of cooperative target for pose estimation with monocular vision. Opt. Eng. 2024, 63, 023102. [Google Scholar] [CrossRef]
Zhu, Z.; Xiang, W.; Huo, J.; Yang, M.; Zhang, G.Y.; Wei, L. Non-cooperative target pose estimation based on improved iterative closest point algorithm. J. Syst. Eng. Electron. 2022, 33, 1–10. [Google Scholar] [CrossRef]
Deng, L.; Suo, H.; Jia, Y.; Huang, C. Pose estimation method for non-cooperative target based on deep learning. Aerospace 2022, 9, 770. [Google Scholar] [CrossRef]
Liu, K.; Wang, L.; Liu, H.; Zhang, X. A relative pose estimation method of non-cooperative space targets. J. Phys. Conf. Ser. 2022, 2228, 012029. [Google Scholar] [CrossRef]
Sharma, S.; Ventura, J.; D’Amico, S. Robust model-based monocular pose initialization for noncooperative spacecraft rendezvous. J. Spacecr. Rocket. 2018, 55, 1414–1429. [Google Scholar] [CrossRef]
Zhang, X.; Jiang, Z.; Zhang, H.; Wei, Q.M. Vision-based pose estimation for textureless space objects by contour points matching. IEEE Trans. Aerosp. Electron. Syst. 2018, 54, 2342–2355. [Google Scholar] [CrossRef]
Besl, P.J.; McKay, N.D. A method for registration of 3-D shapes. IEEE Trans. Pattern Anal. Mach. Intell. 1992, 14, 239–256. [Google Scholar] [CrossRef]
Lepetit, V.; Moreno-Noguer, F.; Fua, P. EPnP: An accurate O (n) solution to the PnP problem. Int. J. Comput. Vis. 2009, 81, 155–166. [Google Scholar] [CrossRef]
Gao, X.S.; Hou, X.R.; Tang, J.; Cheng, H.F. Complete solution classification for the perspective-three-point problem. IEEE Trans. Pattern Anal. Mach. Intell. 2003, 25, 930–943. [Google Scholar] [CrossRef]
Penate-Sanchez, A.; Andrade-Cetto, J.; Moreno-Noguer, F. Exhaustive linearization for robust camera pose and focal length estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2387–2400. [Google Scholar] [CrossRef] [PubMed]
Lu, P.; Jiang, T.; Li, Y.; Li, X.T.; Chen, K.; Yang, W.M. RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 1491–1500. [Google Scholar]
Zhuang, B.; Chandraker, M. Fusing the old with the new: Learning relative camera pose with geometry-guided uncertainty. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 32–42. [Google Scholar] [CrossRef]
Li, Z.; Wang, G.; Ji, X. Cdpn: Coordinates-based disentangled pose network for real-time rgb-based 6-dof object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7678–7687. [Google Scholar] [CrossRef]
Kendall, A.; Grimes, M.; Cipolla, R. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2938–2946. [Google Scholar] [CrossRef]
Peng, S.; Liu, Y.; Huang, Q.; Zhou, X.W.; Bao, H.J. Pvnet: Pixel-wise voting network for 6D of pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4561–4570. [Google Scholar] [CrossRef]
Zakharov, S.; Shugurov, I.; Ilic, S. Dpod: 6D pose object detector and refiner. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1941–1950. [Google Scholar] [CrossRef]
Shugurov, I.; Pavlov, I.; Zakharov, S.; Ilic, S. Multi-view object pose refinement with differentiable renderer. IEEE Robot. Autom. Lett. 2021, 6, 2579–2586. [Google Scholar] [CrossRef]
Wang, H.; Sridhar, S.; Huang, J.; Valentin, J.; Song, S.; Guibas, L.J. Normalized object coordinate space for category-level 6D object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2642–2651. [Google Scholar] [CrossRef]
Liu, J.; Chen, Y.; Ye, X.; Qi, X.J. Prior-free category-level pose estimation with implicit space transformation. In Proceedings of the IEEE International Conference on Computer Vision, Paris, France, 2–6 October 2023; p. 2303. [Google Scholar] [CrossRef]
Wan, B.; Shi, Y.; Xu, K. SOCS: Semantically-aware Object Coordinate Space for Category-Level 6D Object Pose Estimation under Large Shape Variations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 14065–14074. [Google Scholar] [CrossRef]
Chen, K.; James, S.; Sui, C.; Liu, Y.H.; Abbeel, P.; Dou, Q. Stereopose: Category-level 6D transparent object pose estimation from stereo images via back-view nocs. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 2855–2861. [Google Scholar] [CrossRef]
Pérez-Villar, J.I.; García-Martín, Á.; Bescós, J. Spacecraft pose estimation based on unsupervised domain adaptation and on a 3d-guided loss combination. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 37–52. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, Z.; Sun, X.; Li, Z.; Yu, Q.F. Revisiting monocular satellite pose estimation with transformer. In Proceedings of the IEEE Transactions on Aerospace and Electronic Systems, Tel Aviv, Israel, 23–27 October 2022; Volume 58, pp. 4279–4294. [Google Scholar] [CrossRef]
Wang, Z.; Chen, M.; Guo, Y.; Li, Z.; Yu, Q.F. Bridging the domain gap in satellite pose estimation: A self-training approach based on geometrical constraints. IEEE Trans. Aerosp. Electron. Syst. 2023, 60, 2500–2514. [Google Scholar] [CrossRef]
Park, T.H.; Sharma, S.; D’Amico, S. Towards robust learning-based pose estimation of noncooperative spacecraft. arXiv 2019, arXiv:1909.00392. [Google Scholar]
Wang, Q.; Zhou, J.; Li, Z.; Sun, X.L.; Yu, Q.F. Robust and accurate monocular pose tracking for large pose shift. IEEE Trans. Ind. Electron. 2022, 70, 8163–8173. [Google Scholar] [CrossRef]
Chen, B.; Cao, J.; Parra, A.; Chin, T.J. Satellite pose estimation with deep landmark regression and nonlinear pose refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019; pp. 2816–2824. [Google Scholar] [CrossRef]
Wang, Q.; Zhou, J.; Li, Z.; Sun, X.L.; Yu, Q.F. Robust monocular object pose tracking for large pose shift using 2D tracking. Vis. Intell. 2023, 1, 22. [Google Scholar] [CrossRef]
Chen, B.; Parra, A.; Cao, J.; Li, N.; Chin, T.J. End-to-end learnable geometric vision by backpropagating pnp optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 8100–8109. [Google Scholar] [CrossRef]
Chen, H.; Wang, P.; Wang, F.; Tian, W.; Xiong, L.; Li, H. Epro-pnp: Generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 2781–2790. [Google Scholar] [CrossRef]
Li, Y.; Wang, G.; Ji, X.; Xiang, Y.; Fox, D. Deepim: Deep iterative matching for 6D pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 683–698. [Google Scholar] [CrossRef]
Labbé, Y.; Carpentier, J.; Aubry, M.; Aubry, M.; Sivic, J. Cosypose: Consistent multi-view multi-object 6D pose estimation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 574–591. [Google Scholar] [CrossRef]
Leutenegger, S.; Lynen, S.; Bosse, M.; Siegwart, R.; Furgale, P. Keyframe-based visual-inertial odometry using nonlinear optimization. Int. J. Robot. Res. 2015, 34, 314–334. [Google Scholar] [CrossRef]
Sibley, G.; Matthies, L.; Sukhatme, G. A sliding window filter for incremental SLAM. In Unifying Perspectives in Computational and Robot Vision; Springer: Boston, MA, USA, 2008; pp. 103–112. [Google Scholar] [CrossRef]
Qin, T.; Li, P.; Shen, S. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Mur-Artal, R.; Tardós, J.D. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
Moulon, P.; Monasse, P.; Marlet, R. Global fusion of relative motions for robust, accurate and scalable structure from motion. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3248–3255. [Google Scholar] [CrossRef]
Stathopoulou, E.K.; Welponer, M.; Remondino, F. Open-source image-based 3D reconstruction pipelines: Review, comparison and evaluation. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. W17. [CrossRef]
Qiu, K.; Qin, T.; Gao, W.; Shen, S.J. Tracking 3D Motion of Dynamic Objects Using Monocular Visual-Inertial Sensing. IEEE Trans. Robot. 2019, 35, 799–816. [Google Scholar] [CrossRef]
Chen, J.; Wang, R.; Wang, R. Vision Positioning method for Autonomous Precise Landing of UAV Based on Square Landing Mark. J. Phys. Conf. Ser. 2020, 1651, 012182. [Google Scholar] [CrossRef]
Murray, R.M.; Li, Z.; Sastry, S.S. A Mathematical Introduction to Robotic Manipulation, 1st ed.; CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Bertinetto, L.; Valmadre, J.; Golodetz, S.; Miksik, O.; Torr, P.H.S. Staple: Complementary learners for real-time tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1401–1409. [Google Scholar] [CrossRef]
Jiang, T.; Lu, P.; Zhang, L.; Ma, N.S.; Han, R.; Lyu, C.Q.; Li, Y.; Chen, K. Rtmpose: Real-time multi-person pose estimation based on mmpose. arXiv 2023, arXiv:2303.07399. [Google Scholar]
Fischler, M.A.; Bolles, R.C. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Comm. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
Dong, X.Y.; Yu, S.I.; Weng, X.S.; Wei, S.E.; Yang, Y.; Sheikh, Y. Supervision-by-registration: An unsupervised approach to improve the precision of facial landmark detectors. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 360–368. [Google Scholar] [CrossRef]
Denninger, M.; Sundermeyer, M.; Winkelbauer, D.; Olefir, D.; Hodan, D.; Zidan, T.; Elbadrawy, Y.; Knauer, M.; Katam, M.; Lodhi, H. Blenderproc: Reducing the reality gap with photorealistic rendering. In Proceedings of the International Conference on Robotics: Science and Systems, RSS 2020, Virtually, 12–16 July 2020. [Google Scholar]
PaddlePaddle authors, 2019. Object Detection and Instance Segmentation Toolkit Based on Paddlepaddle. Available online: https://github.com/PaddlePaddle/PaddleDetection (accessed on 5 January 2024).

Figure 1. Illustration of the object imaging and monocular pose estimation.

Figure 2. Overall architecture of the proposed method. (For brevity, only some of the landmarks are shown in the figure).

Figure 3. Ship target used in this paper. (a) Ship target, (b) 3D model and landmark annotation. The red dots represent the landmarks used in this paper.

Figure 4. Flow chart of the initial pose estimation. The initial pose estimation stage mainly consists of 2D landmark detection and consistency criteria for discrimination.

Figure 5. Keyframe selection from image sequence. (The connection between the origin of the camera and the origin of the target coordinate frame is represented by red line).

Figure 6. Visual illustration of the simulated trajectory.

Figure 7. The average optimization results of pose and 3D model with varying 2D landmark errors.

Figure 8. The average optimization results of pose and 3D model with varying rendezvous angles.

Figure 9. Synthetic camera trajectories visualization. (a) Orbit trajectory. (b) Rendezvous trajectory.

Figure 10. Examples of pose estimation results of synthetic image sequences.

Figure 11. Error curve samples of pose and 3D model from synthetic image sequences.

Figure 12. The errors of the target’s 3D landmark from four synthetic samples.

Figure 13. Real camera trajectories visualization. (a) Orbit trajectory. (b) Rendezvous trajectory.

Figure 14. Instances of pose estimation outcomes of real image sequences.

Figure 15. Error curve samples of pose and 3D model from real image sequences. The frame number corresponding to the red dotted line indicates that the number of optimized keyframes reaches the required sliding window length.

Figure 16. The errors of the target’s 3D landmark from two real samples.

Figure 17. Example of failed pose optimization from a real image sequence.

Figure 18. Examples of target’s 3D model optimization failures from real image sequences.

Table 1. Parameter settings of the proposed method on synthetic and real experiments.

Parameter	Setting
Consistency criterion threshold	0.8
Correction value threshold $σ_{1}$	5
$Angle threshold for keyframe selecting ω_{\min}$	Set the appropriate value according to the rendezvous angle of the image sequence
Sliding window length $L$	20

Table 2. NME results of 2D landmark detection based on RTMPose and PP-TinyPose in synthetic experiment. The minimum error is shown in bold.

Methods	Evaluation Metrics	Orbit-2.0	Rendezvous-10	Rendezvous-20	Rendezvous-30
PP-TinyPose	$NME \times 100$ (%)	3.90	7.89	7.94	7.21
RTMPose	$NME \times 100$ (%)	0.48	0.32	0.33	0.35

Table 3. Results of pose estimation from four synthetic image sequences. The minimum error is shown in bold.

Image Sequences	$E_{R} (°)$			$E_{T} (%)$
Image Sequences	PP-TinyPose-EPnP	RTMPose-EPnP	Proposed	PP-TinyPose-EPnP	RTMPose-EPnP	Proposed
Orbit-2.0	3.34	1.41	1.23	2.71	1.24	1.12
Rendezvous-10	1.73	1.39	0.38	4.11	2.31	0.69
Rendezvous-20	1.27	1.39	0.48	2.85	2.28	0.69
Rendezvous-30	1.73	1.40	0.47	3.78	2.27	0.71

Table 4. The average optimization results of 3D model from four synthetic image sequences. The minimum error is shown in bold.

Image Sequences	$E_{L} (m)$
Image Sequences	Original	Proposed
Orbit-2.0	4.35	3.16
Rendezvous-10	4.35	2.30
Rendezvous-20	4.35	2.15
Rendezvous-30	4.35	2.39

Table 5. The comparison of the average processing time of different methods on synthetic image sequences.

Method	Time (ms/Frame)
PP-TinyPose-EPnP	8.15
RTMPose-EPnP	27.15
Proposed	32.15

Table 6. NME results of 2D landmark detection based on RTMPose and PP-TinyPose in real experiment. The minimum error is shown in bold.

Methods	Evaluation Metrics	Real Orbit	Real Rendezvous
PP-TinyPose	$NME \times 100$ (%)	5.28	4.91
RTMPose	$NME \times 100$ (%)	0.29	0.29

Table 7. Results of pose estimation from two real image sequences. The minimum error is shown in bold.

Image Sequences	$E_{R} (°)$			$E_{T} (%)$
Image Sequences	PP-TinyPose-EPnP	RTMPose-EPnP	Proposed	PP-TinyPose-EPnP	RTMPose-EPnP	Proposed
Real orbit	10.82	2.24	1.45	12.78	2.19	1.34
Real rendezvous	6.69	2.30	1.23	10.44	2.69	1.33

Table 8. The average optimization results of 3D model from two real image sequences. The minimum error is shown in bold.

Image Sequences	$E_{L} (m)$
Image Sequences	Original	Proposed
Real orbit	4.35	4.18
Real rendezvous	4.35	3.88

Table 9. The comparison of the average processing time of different methods on real image sequences.

Method	Time (ms/Frame)
PP-TinyPose-EPnP	9.15
RTMPose-EPnP	29.15
Proposed	36.15

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, L.; Chen, L.; Wang, Q.; Zhang, Z.; Sun, X. Joint Optimization of the 3D Model and 6D Pose for Monocular Pose Estimation. Drones 2024, 8, 626. https://doi.org/10.3390/drones8110626

AMA Style

Guo L, Chen L, Wang Q, Zhang Z, Sun X. Joint Optimization of the 3D Model and 6D Pose for Monocular Pose Estimation. Drones. 2024; 8(11):626. https://doi.org/10.3390/drones8110626

Chicago/Turabian Style

Guo, Liangchao, Lin Chen, Qiufu Wang, Zhuo Zhang, and Xiaoliang Sun. 2024. "Joint Optimization of the 3D Model and 6D Pose for Monocular Pose Estimation" Drones 8, no. 11: 626. https://doi.org/10.3390/drones8110626

APA Style

Guo, L., Chen, L., Wang, Q., Zhang, Z., & Sun, X. (2024). Joint Optimization of the 3D Model and 6D Pose for Monocular Pose Estimation. Drones, 8(11), 626. https://doi.org/10.3390/drones8110626

Article Menu

Joint Optimization of the 3D Model and 6D Pose for Monocular Pose Estimation

Abstract

1. Introduction

2. Related Works

2.1. Deep Learning-Related Monocular Pose Estimation

2.2. Multi-View Geometry Constraints

3. Method

3.1. Problem Formulation and Notations

3.2. Joint Optimization of the 3D Model and Pose for Monocular Pose Estimation

3.2.1. Sparse Landmark-Based 3D Model Representation

3.2.2. Initial Pose Estimation Based on the Set of Sparse Landmarks

3.2.3. Joint Optimization of the 3D Model and 6D Pose

3.2.4. Sliding Window Combined with Keyframe Extraction Strategy

4. Experimental Results and Analysis

4.1. Experiment Settings

4.1.1. Training Details Based on Deep Learning Methods

4.1.2. Parameter Settings

4.1.3. Evaluation Metrics

4.2. The Robustness Analysis of the Proposed Method

4.3. Experiments on Synthetic Image Sequences

4.4. Experiments on Real Image Sequences

4.5. Limitation Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI