A Learnable Viewpoint Evolution Method for Accurate Pose Estimation of Complex Assembled Product

Zhao, Delong; Kong, Feifei; Du, Fuzhou

doi:10.3390/app14114405

Open AccessArticle

A Learnable Viewpoint Evolution Method for Accurate Pose Estimation of Complex Assembled Product

by

Delong Zhao

,

Feifei Kong

and

Fuzhou Du

^*

School of Mechanical Engineering and Automation, Beihang University, 37 College Road, Haidian District, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(11), 4405; https://doi.org/10.3390/app14114405

Submission received: 29 April 2024 / Revised: 17 May 2024 / Accepted: 21 May 2024 / Published: 22 May 2024

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

This study can be applied for the camera pose estimation and appearance quality inspection of large-sized complex assembled products that require high reliability, adaptability, and accuracy simultaneously.

Abstract

Balancing adaptability, reliability, and accuracy in vision technology has always been a major bottleneck limiting its application in appearance assurance for complex objects in high-end equipment production. Data-driven deep learning shows robustness to feature diversity but is limited by interpretability and accuracy. The traditional vision scheme is reliable and can achieve high accuracy, but its adaptability is insufficient. The deeper reason is the lack of appropriate architecture and integration strategies between the learning paradigm and empirical design. To this end, a learnable viewpoint evolution algorithm for high-accuracy pose estimation of complex assembled products under free view is proposed. To alleviate the balance problem of exploration and optimization in estimation, shape-constrained virtual–real matching, evolvable feasible region, and specialized population migration and reproduction strategies are designed. Furthermore, a learnable evolution control mechanism is proposed, which integrates a guided model based on experience and is cyclic-trained with automatically generated effective trajectories to improve the evolution process. Compared to the

{1.69}^{°}, 55.67

mm of the state-of-the-art data-driven method and the

{1.28}^{°}, 77.67

mm of the classic strategy combination, the pose estimation error of complex assembled product in this study is

{0.23}^{°}, 23.71

mm, which proves the effectiveness of the proposed method. Meanwhile, through in-depth exploration, the robustness, parameter sensitivity, and adaptability to the virtual–real appearance variations are sequentially verified.

Keywords:

machine vision; complex assembled product; camera pose estimation; virtual–real matching; viewpoint evolution; learnable evolution control

1. Introduction

In the high-end equipment manufacturing industry, accurate multi-view pose estimation of complex products is an essential step for subsequent appearance quality inspection. Taking an aerospace product as an example, a piece of equipment may consist of multiple sub-modules, including dozens or even hundreds of intricate parts (complex assembled products). To ensure integrity and reliability, the product needs to be repeatedly compared with 3D models and process documents from different viewpoints to cover all visible elements to avoid defects during its assembly stage. The complexity of structure and requirements to make this work are still completed manually in most cases. In this context, if there is a method to obtain the pose of the current viewpoint, it will greatly improve the efficiency of the process. Based on the estimated pose, the 3D model can be adjusted to the same perspective as the real object so as to extract the corresponding region of interest (ROI) and filter irrelevant context [1,2]. In such a way, the reference provided by a virtual image can greatly simplify inspection and evaluation (whether it is for the operator or algorithm) while avoiding the difficulty of building an end-to-end multi-target detector. The inherent logic of pose estimation is to solve the problem of constructing a virtual–real feature connection in a scenario with diverse noise and continuously adjust incorrect correspondences through correct correspondences (e.g., feature point). For vision technology that utilizes sensor perception and algorithm judgment, this seemingly simple task for humans is quite difficult, and the challenges can be summarized as follows:

(1) Industry characteristics: In the realm of high-end equipment manufacturing, the interpretability and dependability of methods are important considerations. Although it is possible to train a specific neural network for pose estimation through a large amount of data, blindly adhering to the big data-driven learning paradigm in the pursuit of adaptability may render it challenging to address failures, abnormalities, and re-improvement, ultimately impeding promotion [3].

(2) Task features: These assembled products with high structural complexity may involve dozens or even hundreds of components, typically assembled through human or human–machine collaboration. Under such operation scenarios, it is difficult to collect enough high-quality samples at once or accurately pre-plan the imaging pose. The task features of a small sample, multiple components, and coarse initial pose greatly increase the difficulty of pose estimation/adjustment in common vision-based applications, e.g., AR/VR-based projection [4] mobile robots based on viewpoint pre-planning [5,6].

(3) Product appearance: As shown in Figure 1, the appearance features of the target and background captured from different viewpoints are quite different, and the multi-layer stacking between objects will further enhance the difference. Due to perspective effect and distortion, whether the extremely small-sized target located at the far end can be successfully recognized is sensitive to the accuracy of pose estimation. It is common that there are differences in features and visual styles between the process model and physical object. Additionally, the surface treatment of machined products can result in sparse texture and high color consistency that are not conducive to feature extraction.

Technically, camera pose estimation focused in this study is the absolute pose estimation (APE). Research in this direction mainly includes six categories [7,8,9]:

(1) APE based on cooperative target can be further subdivided into manual markers [10,11], auxiliary projection [12], laser tracker [13], and structured light measurement [14]. In these methods, easily identifiable features are supplemented to the target manually or by automated devices to alleviate the difficulty of feature extraction and analysis during APE. This strategy is commonly used in high-precision tasks, such as calibration, and is also suitable for large structural parts with smooth surfaces and simple structures. Furthermore, with the support of a movable robot vision platform [15], APE can be transformed into matching and stitching of a 3D point cloud.

(2) APE based on geometric features includes two steps: feature extraction and relationship construction of 2D pixels and 3D coordinates (2D–D) and optimization-based APE estimation. Manual feature [16,17,18] and learning-based approaches [19] are usually adopted in the first step, in which two key issues need to be considered, namely, the adaptability of 2D features and the robustness of 2D–3D relationship evaluation. The representative optimization method for APE estimation is EPnP [20] and some other related studies [21,22,23,24]. In the estimation stage, it is essential to reduce the interference of outliers on the 2D–3D relationship and avoid getting trapped in the wrong local optimal solution. Compared with other technical routes, strategy (2) can obtain the best APE accuracy, but it is limited by the complexity of the scene to obtain enough 2D–3D pairs with correct relations at one time. Moreover, the iterative mode is easily affected by the latest observations, making the final convergent result significantly biased.

(3) APE based on mapping and retrieval is one of the commonly used strategies in the field of simultaneous localization and mapping (SLAM), emphasizing the robustness of scene feature encoding and the ability to describe differences between different viewpoints [25,26]. When the accuracy requirement is not high, the pose can be obtained by retrieving the viewpoint closest to the current feature in the map library; otherwise, the result of strategy (3) will be adopted as the initial value for strategy (2) [27].

(4) APE based on deep learning (DL) relies on a deep convolutional neural network (CNN) to perform pose estimation in an end-to-end manner. CNN-based architectures represented by classic research PoseNet series [28,29] do not require feature engineering and have a strong adaptability to feature variation. Nevertheless, there is still a gap in reliability and accuracy between strategy (4) and strategy (2) [30]. Moreover, some state-of-the-art studies indicate that close performance can be achieved in public datasets such as LineMod, LineMod-O [31,32,33], and MegaPose [34] and can adapt to unseen objects through rendering and comparison, but they all require ensuring model accuracy, fewer targets, and balanced sizes, which is significantly different from the complex assembled product.

(5) Hybrid and hierarchical strategies include methods based on functional hybrid training of multiple CNN models [35,36] and hierarchical approximation from coarse to fine [37]. Research in strategy (5) is devoted to addressing the adaptability of pose estimation to changeable feature context in open multi-scene. The accuracy achieved depends partly on the architecture and trick design of CNN and partly on the geometric feature-based APE algorithm adopted in the backend.

Through the literature review above, it can be found that the compatibility of accuracy and adaptability in pose estimation of complex assembled products under free view hinges on building an interpretable hierarchical architecture while maintaining the advantages of empirical design and learning paradigm. However, the interpretability and consistency of DL-based methods limit their application in manufacturing, while traditional APE lacks local–global collaborative control in matching and optimization mechanisms and relies on fixed empirical parameters in adaptability, resulting in quite strict requirements for initial pose in complex assembled products. Therefore, it is necessary to deeply integrate traditional APE- and DL-based paradigms in terms of execution mechanism, operation rule, and interactive strategy, laying the foundation for consistency of results, local–global feature adaptability, and traceability of exceptions, rather than simply concatenating the two and directly applying them.

Driven by this motivation, this paper proposes a novel, learnable viewpoint evolution algorithm for high-accuracy pose estimation of complex products in high-end equipment manufacturing. Compared with existing APE methods, the contribution can be summarized as follows:

(1): Reconstruct the camera pose estimation into the evolution of parameter population to generate an interpretable fine-grained iterative framework;
(2): To balance exploration and optimization, feasible region constraint, parent migration, and reproduction strategies are designed for local–global collaborative control;
(3): Propose a guided-model-based hierarchical cyclic effective trajectory learning mechanism to reduce the empirical dependence of control and enhance adaptability.

2. Proposed Method

As shown in Figure 2, the method includes two stages, which can be divided into five modules. The initial stage is used to generate the initial pose, including the extraction of significant features, viewpoint sampling and projection, feature matching and viewpoint searching, virtual image rendering, and initialization of the evolution stage. The significant feature refers to the geometric feature (e.g., points, lines, and curves) that are well visible and distinguishable from different viewpoints. Virtual viewpoints are generated following the spherical or ellipsoidal rule [38]. The rough pose is provided by the sampled virtual viewpoint with the best similarity to the significant feature of the real image. The feature matching and similarity evaluation are performed by a weighted point pair matching algorithm [39]. Some CNN-based coarse pose estimators with a small number of samples are also suitable choices for obtaining the initial pose.

The evolution stage is redesigned as population initialization and migration, offspring generation, viewpoint quality assessment and selection, pose feasible region construction and update, and a learnable evolution function optimization inspired by importance sampling in reinforcement learning [40].

2.1. Viewpoint Performance Assessment

For each viewpoint

v p

, the performance

{v p}^{q}

reflects the average similarity of virtual–real ROI of all visible components. The geometric feature is adopted to overcome texture sparsity. Inspired by [41], as shown in Figure 3, a self-adaptive evaluation line-based feature matching and evaluation algorithm that balances structural stability and multiple considerations is designed to enhance robustness against virtual–real differences, as well as uneven correspondence. For each visible component

w

, where

w

belongs to the set of all visible components

W

under

v p

, the virtual–real feature is denoted as

\{p^{w}, x^{w}\}

and

p^{w}

is projected from 3D coordinate

P^{w}

.

p_{l}^{w} = {{p}_{l, 1}^{w}, \dots p_{l, m}^{w}}

represents the set of points on the

l

-th projected line segment of

p^{w}

, and the same applies to

x_{r}^{w}

.

For any

p_{l, m}^{w}

in

p_{l}^{w}

, a rectangular search region with changeable size along the normal direction is expanded, where the size depends on

σ = E^{s} ({v p}^{q}), {v p}^{q}

is the performance of the parent view of

v p

, and

E^{s}

denotes the evolution function of the search factor

σ

. In the rectangular region, any point belonging to

x_{r}^{w}

can be selected as a candidate matching point for

p_{l, m}^{w}

as long as the angle between

p_{l}^{w}

and

x_{r}^{w}

satisfies the preset range. After searching, a redundant candidate point set

{\tilde{x}}^{w}

of

p^{w}

is established. Theoretically, there is always an ideal transformation

T_{2 D}

that can minimize the error of

\{{T_{2 D} p}^{w}, x^{w}\}

with the constraints provided by root

v p

. During the evolution stage, the translational part in

T_{2 D}

is the main factor that narrows the gap between

\{p^{w}, x^{w}\}

. Based on the above assumptions, the quadrant is divided uniformly to roughly determine the potential direction of translation, and a pair of matching points is defined as

\{p_{l, m}^{w}, x_{r, n}^{w}\}

. According to

a r c c o s ({[x}_{r, n}^{w} - p_{l, m}^{w}] [\begin{matrix} 1 \\ 0 \end{matrix}] / ‖x_{r, n}^{w} - p_{l, m}^{w}‖],

(1)

the candidate set of

p^{w}

in each slope range can be obtained. For each slope range, the number ratio

r a

and the distance variance

{v a r}^{r e s}

between the candidate set and

p^{w}

can be counted. The direction advantage is defined as

n o r m (r a) + (1 - n o r m ({v a r}^{r e s}))

, where

n o r m

represents normalization operation (all slope ranges), and candidate matching points are reassigned to

p^{w}

in order of advantage. During the reassignment process, if there is a one-to-many situation, the candidate point belonging to the slope range that reduces the variance of the current matching distance is preferentially selected. Afterward, we can obtain

\{p^{w}, {\overset{ˇ}{x}}^{w}\}

,

{c p}^{r a}

, and

{c p}^{r e s}

, where

{\overset{ˇ}{x}}^{w}

represents an incomplete set, that is, part of

x

finds no corresponding

p

,

{c p}^{r a}

is the number ratio, and

{c p}^{r e s}

denotes the distance average of all matched pairs.

Evaluation criteria that rely solely on

{c p}^{r e s}

are easily affected by background noise, incomplete geometric features, and virtual–real differences. Therefore, a ridge function mapping strategy is further embedded with

{c p}^{r a}

as an additional evaluation factor. Inspired by inverse square root unit (ISRU) [42], the mapping is defined as follows:

{I S R U}^{'} ({c p}^{r a}) = \frac{1}{2} (\frac{1.07 (2 \cdot n f \cdot {c p}^{r a} - n f)}{\sqrt{1.5 + {(2 \cdot n f \cdot {c p}^{r a} - n f)}^{2}}} + 1),

(2)

where

n f = 5

is the transformation factor of

{c p}^{r a}

. Here, (2) shows a parameter combination with suitable performance, which allows

{c p}^{r a}

to be less sensitive to extreme ranges. Afterward, define the measurement point

{m p}_{1} = {({c p}^{r e s 0}, 0, {v p}^{q 0})}^{T}, {m p}_{2} = {({c p}^{r e s 0}, 1.0, {v p}^{q 1})}^{T}

and the fixed point

{f p}_{1} = {({c p}^{r e s 0}, 1.0, 1.0)}^{T}, {f p}_{2} = {({r o o t}^{r e s 0}, 0.0, 0.0)}^{T}

, where

{c p}^{r e s 0} = 0

describes the ideal average distance error. For the discrimination of the result after mapping, let

{c p}^{r e s 0} = 0.1 * {r o o t}^{r e s 0}

, where

{r o o t}^{r e s 0}

is the average distance error of the root

v p

. The ridge function

r f ({c p}^{r a}, {c p}^{r e s}, {v p}^{q})

can be planned as

\{\begin{matrix} {[({f p}_{1} - {m p}_{1}) \times ({f p}_{2} - {m p}_{1})]}^{T} [{({c p}^{r e s}, {c p}^{r a}, {v p}^{q})}^{T} - {m p}_{1}] = 0, {c p}^{r a} \leq {r f}_{T 1} \\ {[({f p}_{1} - {m p}_{2}) \times ({f p}_{2} - {m p}_{2})]}^{T} [{({c p}^{r e s}, {c p}^{r a}, {v p}^{q})}^{T} - {m p}_{2}] = 0, {c p}^{r a} > {r f}_{T 2} \end{matrix}

(3)

where

{r f}_{T 1}, {r f}_{T 2}

are the boundary of the transition region and

{r f}_{T 1} = {r f}_{T 2} = ({c p}^{r e s} - {r o o t}^{r e s 0}) / ({c p}^{r e s 0} - {r o o t}^{r e s 0})

. The transition region can similarly be set to be plane, making the mapped

{v p}^{q}

smoother.

{{v p}^{q 0}, {v p}^{q 1}}

are a set of indicators adopted to adaptively modulate the propensity of

{c p}^{r a}

and

{c p}^{r e s}

. Intuitively, it may be more suitable to set

{v p}^{q 1} > {v p}^{q 0}

at the beginning and

{v p}^{q 1} < {v p}^{q 0}

at the end.

In this way, the four aspects of structural stability, error direction, number of matching points, and distance error can be integrated. For the

w

-th component, let

{v p}^{q} |_{w}

denote the result mapped by (3). The white numbers in the second row of Figure 3 represent

{c p}^{r e s}

(first row) and

{v p}^{q} |_{w}

(second row). To maintain consistency with the gradient direction of distance error, let

{v p}^{q} |_{w} = 1 - {v p}^{q} |_{w}

in Figure 3. For some matching states that

{c p}^{r e s}

cannot effectively distinguish,

{v p}^{q} |_{w}

provides reliable discrimination. The viewpoint performance

{v p}^{q}

can be calculated according to

{v p}^{q} = Q (v p) = (1 / W) \sum_{w}^{W} ({v p}^{q} |_{w})

. If the

W

and

P^{w}

obtained by root

v p

are not updated in time during the evolution stage, the evaluation will encounter greater viewpoint ambiguity and differences in geometric features. Therefore, by configuring different rendering frequencies, the adaptability of the proposed method with respect to the degree of virtual–real difference is deeply evaluated in Section 3.

2.2. Feasible Region Construction and Update

2.2.1. Problem Planning

Projection matrix (or pose)

T

contains seven variables if the representation approach of quaternion and translation is employed. The crossover and mutation in the traditional genetic algorithm are too exploratory to ensure the optimization of

T

. The commonly used numerical optimization is limited by the observed

{x, p, P}

and prone to fall into local optimum. Thus, it is expected to provide a range constraint in the solution space for parent

P_{i}

to children

C_{i}

based on observed

{x, p, P}

, which can be continuously updated and contracted to the ideal solution with the adjustment of

x, P

during the evolution. The range in this study is defined as pose feasible region

F

, and for

F_{i}, F_{i} = {T | F_{i}^{δ_{i}, I} (T) \leq 0, T \in S E (3)}

, where

F_{i}^{δ_{i}, W} (T) = \sum_{w}^{W} {‖x^{w'} - π (T P^{w})‖}_{2}^{2} - {δ_{i}}^{2} .

(4)

x^{w'}

represents the pixel coordinate corresponding to the best

{v p}^{q} |_{w}

achieved by the

w

-th visible component in the previous

i

round.

δ_{i} = {E^{g} (\bar{v p}}^{q, i}), {\bar{v p}}^{q, i} = (1 / P_{i}) \sum_{k}^{P_{i}} ({v p}_{k}^{q}), {v p}_{k} \in P_{i + 1}

,

P_{i + 1}

is the initial parent selected from

{C_{i}, P_{i}^{'}}

in the

i

-th round, and

P_{i}^{'}

refers to

P_{i}

after migration to

F_{i}

.

E^{g}

is the evolution function of the tolerance factor for constructing

F

. To provide a feasible constraint to the pose search, the expression of

F

should be solved. This expression problem can be formulated as

{m i n}_{T} \{\sum_{w}^{W} {‖x^{w^{'}} - π (T P^{w})‖}_{2}^{2} + (- \frac{1}{\dot{t}}) \log (- F_{i}^{δ_{i}, W} (T))\},

(5)

where

- l o g (- F_{i}^{δ_{i}, W} (T))

is the logarithmic barrier function and

\dot{t}

denotes the approximation factor to the indicative function. This is an optimization technique that uses the interior point method to remove inequality constraints. In practice, however, the optimization process of (5) is too complicated. Given different initial values, repeated iterations and optimizations of

\dot{t}

are computationally expensive. To this end, the assumption of structural stability is again applied as a basis for simplification. Specifically, only the matching relation corresponding to the best advantage in

\{p^{w}, {\tilde{x}}^{w'}\}

is retained to sparse

x^{w'}

to reduce the scale of problem (5). To alleviate the tedious optimization work, a combined strategy based on boundary projection and sparse trajectory approximation is provided below.

2.2.2. Initialization

As shown in Figure 4, an ellipsoidal rule [38] is used to sample the initial viewpoint pose, denoted as

V

. Due to the small rotation of the in-plane during machine or manual imaging, the rotation of the out-of-the-plane can be regarded as a strong prior for constructing the initial viewpoint network.

2.2.3. Boundary Projection

For each

T

of

v p, v p \in V

, the projection to the boundary of

F_{i}

can be planned as

{a r g m i n}_{T^{⊥}} \{{f_{d} (R}^{⊥}, R, t^{⊥}, t)}, s . t . F_{0}^{δ_{0}, W} (T^{⊥}) = 0,

(6)

where

{R_{v}^{⊥}, t_{v}^{⊥}} \in T^{⊥}

is the projection result. The function

f_{d} (\cdot) = ‖q \otimes {q_{v}}^{- 1} - {[0_{3 \times 1}, 1]}^{T}‖ + ‖t - {q \otimes {q_{v}}^{- 1} t}_{v}‖

is designed to measure the distance between poses, where

q

is the quaternion representation of

R

and

\otimes

defines the multiplication of quaternions. The objective function (6) is a small-scale with equality-constrained optimization problem, and the solution of each iteration can be approximately equivalent to

m i n f_{1} (T) = \frac{1}{2} ∆ T^{T} \nabla^{2} f (T^{k}) ∆ T + \nabla f (T^{k}) ∆ T, s . t . \nabla F_{0}^{δ_{0}, W} (T^{k}) ∆ T + F_{0}^{δ_{0}, W} (T^{k}) = 0,

(7)

according to SQP [43], where

f_{d} (T^{k})

is the objective function of (8) and

*^{T}

is the transpose operation. Then, (7) can be rewritten as

m i n L (∆ T, λ) = \frac{1}{2} ∆ T^{T} H_{f} ∆ T + J_{f} ∆ T + λ^{T} (J_{F} ∆ T + F_{0}^{δ_{0}, W} (T^{k})),

(8)

at

T^{k}

,

∆ T = T^{k + 1} ⊝ T^{k}

is the variable to be solved, and

λ

is the Lagrange multiplier. It should be noted that the above calculation involves the derivation of

T

and the update of

∆ T

. However, the

R

of

T

is only closed for multiplication, not for addition and subtraction. Therefore, a logarithmic mapping needs to be performed on

T

to obtain

ξ = [\begin{matrix} ϕ \end{matrix}, ρ] \in R^{6}, ρ \in R^{3}, ϕ \in s o (3)

based on the representation of a Lie algebra.

⊝

is the subtraction defined on

ξ

. Let

f (\cdot) = ‖T_{q}‖ + ‖T_{t}‖,

where

{\hat{T}}_{q}, {\hat{T}}_{t}

are their respective unit vectors. Then, the

J_{f}

of (8) can be expanded as

J_{f} = [\frac{\partial f_{1}}{\partial δ q}, \frac{\partial f_{1}}{\partial δ t}] = [{\hat{T}}_{q}^{T} \frac{\partial T_{q}}{\partial δ q} + {\hat{T}}_{t}^{T} \frac{\partial T_{t}}{\partial δ q}, {\hat{T}}_{t}^{T} \frac{\partial T_{t}}{\partial δ t}],

(9)

The related research is mature, and hence, the calculation process of

ρ

,

ϕ

, and (9) is omitted. After completing the optimization, the results can be recovered from

s o (3)

to

S E (3)

by exponential mapping. The rSQP [44] is recommended if the scale of

x^{w'}

and

x

in other studies is still large.

2.2.4. Sparse Trajectory Approximation

Let the projected

V

be

V^{⊥}

. For each

v p, v p \in V^{⊥}

, an unconstrained objective function can be constructed as

f_{2} (T) = \min \{\sum_{w}^{W} {‖H (x^{w'} - π (T P^{w}))‖}_{2}^{2}\},

(10)

where

H (\cdot)

is the kernel function to enhance the robustness of outliers, and Huber is selected in this study. Taking

V^{⊥}

as the initial state, (17) is solved in parallel based on the LM strategy, where the pose change can be recorded as

{t r a}^{h} = \{{t r a}_{T_{0}^{⊥}}, {t r a}_{T_{1}^{⊥}}, \dots, {t r a}_{T_{v}^{⊥}}\}, {t r a}_{T_{v}^{⊥}} = \{T_{v}^{⊥}, T_{v}^{1}, T_{v}^{2} \dots, T_{v}^{V}\}

, and

V

is the number of iteration steps of the

v

-th trajectory.

Let the optimization direction of all trajectories in the

z

-th step be

\{{∆ T}_{0}^{z}, {∆ T}_{1}^{z}, \dots, {∆ T}_{v}^{z}\}

, and define the trajectory affinity as

A_{m, n}^{z} = f_{d} {(T}_{m}^{z}, T_{n}^{z}) \cdot {(∆ \hat{T}}_{m}^{z} {∆ \hat{T}}_{n}^{z}),

(11)

Afterward, the horizontal interpolation is evaluated for

\{T_{1}^{z}, T_{2}^{z}, \dots, T_{v}^{z}\}

. For all

A_{m, n}^{z}, m, n \in {t r a}^{h}

,

{t r a}^{v} = \{l i (T_{m}^{z}, T_{n}^{z}, \frac{1}{2} (f_{d} (T_{m}^{z + 1}, T_{m}^{z}) + f_{d} (T_{n}^{z + 1}, T_{n}^{z})))| A_{m, n}^{z} > \frac{1}{v} \sum_{m} \sum_{n = m + 1} A_{m, n}^{z}\} .

(12)

Let

t = {(f}_{d} {(T}_{m}^{z + 1}, T_{m}^{z}) + f_{d} {(T}_{n}^{z + 1}, T_{n}^{z})) / (2 f_{d} {(T}_{m}^{z}, T_{n}^{z}))

, then

l i (\cdot)

can be expressed as a spherical linear interpolation operation on

q

and

t,

respectively, in the following way.

{l i}_{q} = \frac{\sin [(1 - t) θ] q_{m} + \sin (t θ) q_{n}}{s i n (θ)}, {l i}_{t} = t_{m} + t (t_{n} - t_{m}) .

(13)

Through the above work, the discrete approximation

{t r a}_{i} = {{t r a}_{i}^{h}, {t r a}_{i}^{v}}

of

F_{i}

can be obtained.

During evolution, once the historical best

{v p}^{q} |_{w}

of any component is reset due to the generation of a child

v p

, the pose feasible region update can be selectively started according to the change in the

{v p}^{q}

.

F_{i + 1}

is updated from

F_{i}

, and the update work can be summarized into four steps, including the following: (a) simplifying

x^{w}

based on structural stability to obtain

x^{w'}, x^{'}

; (b) determining

δ_{i + 1}

and

F_{i + 1}^{δ_{i + 1}, W}

and selecting the support

v p

according to

{v p}^{q}

from

\{P_{i}, C_{i}\}

; (c) calculating the projection of the support

v p

to the boundary by Formulas (6)–(9); and (d) solving the optimization problem (10) and performing affinity-based interpolation with (11)–(13).

2.3. Parent Migration and Offspring Generation

As shown in Figure 5, a parent migration mechanism is proposed to ensure that the population performance can be gradually improved with the update of

F

. The migration operation is defined as

P_{i + 1} \to P_{i + 1}^{'},

and

T_{v}^{*}

of

{v p}_{k}^{'} \in P_{i + 1}^{'}

, which denotes the pose of the migrated parent, can be expressed as

T_{v}^{*} = T_{v}^{r a n d [{0, (1 - λ}_{k}) \cdot v_{k}]}, T_{v} \in {t r a}_{T_{v}^{⊥}}, λ_{k} = E^{t} ({v p}_{k}^{q}),

(14)

where

r a n d

means to randomly pick an integer in the range

[{0, (1 - λ}_{k}) \cdot v_{k}]

and

E^{t}

is the evolution function of the migration factor

λ_{k}

. To further align the exploration with the actual situation, we performed rotation correction on the randomly generated values.

Sampling

v p

is highly exploratory while performing strictly according to the optimization direction lacks exploration. To address these problems, an offspring generation strategy based on the pose feasible region constraint is designed.

The Jacobian matrix of (10) is

J_{F} \in R^{I \times 7}

. For any

{v p}_{k}^{'}

and its

T_{v}^{*}

, the next feasible point within the trust region

r^{*}

can be calculated from

({J_{F}}^{T} J_{F} + r^{*} I) ∆ = - {J_{F}}^{T} f_{2} (q_{v}^{*}, t_{v}^{*})

. The basic direction is defined as

d = \{d_{o}| ({J_{o}}^{T} J_{o} + r^{*} I) d_{o} = - {J_{o}}^{T} f_{2} (q_{v}^{*}, t_{v}^{*}), J_{o} = J_{F} I_{o = 0}, o \in o\},

(15)

where

I_{o = 0}

indicates that the corresponding column of

I

is assigned to 0 according to the gradient clipping configuration

o

. For example,

o = {5,6, 7}

means that only the gradient of rotation is considered. In this study,

o

includes maintaining the gradient of

t_{v}^{*}

and clipping

q_{v}^{*}

in four dimensions and maintaining the gradient of

q_{v}^{*}

and clipping

t_{v}^{*}

in three dimensions. The estimated amount of gradient descent can usually be expressed as

l_{o} = - {d_{o}}^{T} {J_{o}}^{T} f_{2} (q_{v}^{*}, t_{v}^{*}) - \frac{1}{2} {d_{o}}^{T} {J_{o}}^{T} J_{o} d_{o}

. The probability of generating offspring in each basic direction is described as

p_{o} = I (l_{o}) (1 - p^{e}) \frac{l_{o}}{\sum_{o \in C u o^{e}} l_{o}} + (1 - I (l_{o})) p^{e} |\frac{l_{o}}{\sum_{o \in o^{e}} l_{o}}|,

(16)

where

i f l_{o} > 0

,

I (l_{o}) = 1

, otherwise,

I (l_{o}) = 0

.

p^{e}

is the pre-allocated probability of exploration, and

o^{e} = \{o| l_{o} \leq 0, o \in o\},

C u o^{e}

is the complementary set of

o^{e}

. To prevent the imbalance of

p_{o}

due to too few exploration directions, it is recommended to apply secondary adjustments according to the ratio of the

{\bar{p}}_{o}

in the two sets.

As shown in Figure 5, once the basic direction

d_{o}

is selected, a cone-based disturbance is further performed on the rotation and translation, respectively. For simplicity,

τ = E^{a} ({v p}^{'}_{k}^{q})

and

s = E^{e} ({v p}^{'}_{k}^{q})

are adopted to represent the disturbance range of angle range and scale, respectively, where

{E^{a}, E}^{e}

are the corresponding evolution functions. Therefore, the perturbed direction

d_{o}^{'} = r a n d {C (T)} - T_{v}^{*}

, where

C (T)

is described as

{\hat{d}}_{o} (T - (T_{v}^{*} + d_{o})) = 0, s . t . {‖T - (T_{v}^{*} + d_{o})‖}_{2}^{2} \leq {t a n}^{2} (τ) {‖d_{o}‖}_{2}^{2}

(17)

{\hat{d}}_{o}

is the unit vector,

r a n d \{C (T)\}

is adopted to randomly sample from the distribution

C (T)

. After

d_{o}^{'}

is determined, the pose

T_{v}^{*}

can be updated through

{T_{v}^{*} = T}_{v}^{*} + {s (\sum_{j = 1}^{V} |T_{v}^{j} - T_{v}^{j + 1}|) d}_{o}^{'}, T_{v}^{j} \in {t r a}_{T_{v}^{⊥}}

according to

s

and the cumulative change in pose.

To alleviate the unpredictability of actual gradient descent caused by the introduction of

\{τ, s\}

, a relaxed projection approach based on

F

is established. The projection position of

T_{v}^{*}

is determined by the hyperplane supported by its three neighbors in

F

. The hyperplane is defined as

A^{T} (T - b) = 0,

and the relaxed factor is denoted as

α = E^{p} ({v p}^{'}_{k}^{q})

.

{v p}_{k, o}^{'}

, who is the offspring of

{v p}_{k}^{'}

, can be calculated after relaxed projection through

T_{v}^{*, c} = {(T_{v}^{*} + d_{o}^{'}) + r a n d (0, α) (T}_{v}^{*}_{⊥ {t r a}_{i}} - (T_{v}^{*} + d_{o}^{'})), {T_{v}^{*}}_{⊥ {t r a}_{i}} = A^{T} {(A^{T} A)}^{- 1} A^{T} b

(18)

according to the projection theorem, where

α

is adopted to control the constraint strength and

E^{p}

is the evolution function of

α

.

2.4. A Learnable Evolutionary Function Optimization

The evolution functions mentioned above include

δ = {E^{g} (\bar{v p}}^{q})

,

λ = E^{t} ({v p}_{k}^{q})

,

s = E^{e} ({v p}^{'}_{k}^{q}) v_{k}

,

τ = E^{a} ({v p}^{'}_{k}^{q})

,

α = E^{p} ({v p}^{'}_{k}^{q})

, and

σ = E^{s} ({v p}^{'}_{k}^{q})

, which are employed to manage evolution parameters and processes based on the performance of the population. The evolution function is initialized as a linear function, and the abscissas of the head and tail are taken as

{\bar{v p}}^{q 0}

and

m i n {{ε \bar{v p}}^{q 0}, 1.0}

, where

ε

is the termination threshold.

δ_{0}

can be referenced by

\frac{1}{N_{i}} \sum_{k}^{N_{i}} {c p}^{r e s}, {v p}_{k} \in P_{1}

,

δ_{e n d} = 0.0

. The ordinates of other parameters need to be empirically provided; see Section 3 for details.

As shown in Figure 6, a learning mechanism for optimizing the evolution function is proposed. The learning stage consists of three main modules and is designed as a hierarchical cyclic process. In the data collection module, each image of the dataset is repeatedly tested according to the current evolution function, and the evolution trajectory

T \in T

of the viewpoint with the best effectiveness in each test is collected. The effectiveness of the

i

-th round

k

-th viewpoint is defined as

ϱ_{k}^{i} = {v p}_{i, k}^{q} + \sum_{j = i} ({v p}_{j, k}^{'}^{q} + p_{o} {v p}_{j + 1, k}^{q})

(19)

which can be regarded as a reverse tracking of

{v p}^{q}

of the trajectory generated by the final output viewpoint. For each node in

T

,

{{v p}^{q}, δ, σ, \dot{λ}, \dot{s}, \dot{τ,} \dot{α}}

are saved to complete the trajectory collection, where

\dot{*}

represents the actual sampled value. Afterward,

\{δ, σ, \dot{λ}, \dot{s}, \dot{τ,} \dot{α}\}

are normalized to prevent overwhelming effects due to the magnitude differences. It can be found that the core purpose of evolution function optimization is to build the mapping relation between the observed population performance and process adjustment through high-quality results. Intuitively, there should be a better way of exploratory decay than six separate descent strategies with fixed slopes.

The uneven performance of

T

(there is a large variance of

{v p}^{q}

) generated by linear function will make further learning difficult to converge. To this end, a guided model based on multi-model fusion is designed to provide a reasonable transition for data distribution learning. The linear model can be extended as

e = a_{b} ({v p}^{q}^{b}) + c_{b}, e \in E_{0}

, where

b

controls the downward trend of the evolution curve, and

e

is the normalized result of each parameter in

E_{0}

. Given the head and tail points,

a_{b}, c_{b}

can be modeled as functions of

b

. Considering the correlation between parameters, factor fusion is performed as

\sum_{e \in E_{0}} {a_{b_{e}} ({v p}^{q}^{b_{e}}) + c_{b_{e}}} = e^{T} W e

(20)

where

W

is the correlation matrix. In such a way, the relationship between

{v p}^{q}

and

E_{0}

can be comprehensively modeled by (20). The design of this guided model is derived from an empirical understanding of the evolution process from coarse-to-fine and a priori knowledge of the interaction of parameters. For example, the correlation between

s

and

τ

should be prioritized over the correlation between

s

and other parameters.

Furthermore, let the

b_{e}

of each parameter be initialized as a constant

b 0, b 0 > 1.0

. The fitting of (20) is executed with

T_{0}

as the dataset to obtain the guided model. Next, the guided model is applied to testing and data collection as a temporary evolution function to prepare

T_{1}

, which implies that the distribution state of the data will be naturally constrained. A neural network with ResNet18 as the backbone and four fully connected layers as the output is employed for completing the learning of complex nonlinear mapping relationships through

T_{1}

. The mapping model

{E_{1} = E}_{M L P} ({v p}^{q})

, where the input is the

{v p}^{q}

set of five parents, and the output is

E_{1} \in R^{6 \times 1}

. Let

T_{E_{1}}

be the new dataset obtained according to the mapping model. Obviously, the construction of

E_{M L P}

is affected directly by the value of

b_{e}

, which symbolizes the strictness of the intermediate guidance strategy. For the automation of the whole learning process, the value of

b_{e}

is set with reference to

e

following the principle of decreasing integral. For

b 1

,

\int (e (b 0) - e (b 1)) d {(v p}^{q}) = \int ∆_{p r} (e (1.0) - e (b 0)) d {(v p}^{q})

(21)

Then for

b i

,

\int e (b i) d {(v p}^{q}) = \int e (b 0) - {i \cdot ∆}_{p r} (e (1.0) - e (b 0)) d {(v p}^{q})

(22)

In this study,

b 0 = 1.5, ∆_{p r} = 0.5

, and the standard deviation of the output offspring performance after repeated evolution of each image is denoted as

{s t d}_{ϱ^{e n d}}

. The hierarchical cyclic learning process is shown in Figure 6, which can be specifically expressed as

b 0 \to T_{1} \to E_{M L P} \to T_{E_{1}} \to \sum {s t d}_{ϱ^{e n d}} \to b 1 \to \dots

. Each update of

E_{M L P}

is based on the previous one, and the

E_{M L P}

corresponding to the one with the best

\sum {s t d}_{ϱ^{e n d}}

is taken as the final learned evolution function.

3. Experiment and Results

Section 3.1 shows the preparation work. The cutting-edge DL-based method and classical VPS method are compared in Section 3.2. The robustness of the proposed method to pose deviation and the advantage of enabling the learning mechanism are verified in Section 3.3. Section 3.4 conducts parameter sensitivity analysis to provide a basis for parameter selection; Section 3.5 compares different strategies for offspring generation; the adaptability of the virtual–real difference is presented in Section 3.6.

3.1. Preliminary Works

(1) Experimental platform. For this study, a complex assembled product with 113 components was constructed, including 27 different types of equipment, 10 parts (e.g., craft cap), 14 plugs, and 62 screws. The appearance of this product is inconsistent from different viewpoints. Some equipment and parts were surface-treated to exhibit low texture and high color consistency.

(2) Parameter configuration. Migration factor

[λ_{0} {, λ}_{e n d}] = [0.5, 0.05]

, scale disturbance factor

[s_{0} {, s}_{e n d}] = [0.10, 0.001]

, angle disturbance factor

[τ_{0} {, τ}_{e n d}] = [20^{°}, 0.0]

, relaxed factor

[α_{0} {, α}_{e n d}] = [0.3, 0.0]

, search factor

[σ_{0} {, σ}_{e n d}] = [25.0, 5.0]

, and termination threshold

ε = 4.0

,

\{{v p}^{q 0}, {v p}^{q 1}\} = {0.1, 0.7}

. The number of parent nodes is 10, and the number of child nodes is 5.

(3) Dataset. As shown in Figure 7, the dataset consists of about 240 viewpoints, which were captured by a hand-held industrial camera and a mobile robot vision system, respectively. The working distance is about 1.8 m, the view width is about 1.5 m, and the resolution is 3264

\times

2448. The ground truth was calculated through artificial markers and manual refinement (the beige base is only used to calculate ground truth).

(4) Evaluation Indicator. The pose of complex products in the manufacturing industry mainly serves subsequent inspection, AR-based projection, or robot operation. Therefore, the positioning accuracy of each component can better reflect the practical significance of pose accuracy. For this reason, in addition to the widely used pose deviation (mean translation error

{\bar{t}}_{e}

and mean angular error

{\bar{a}}_{e}

), we have also modified IoU as follows:

l_{*} ({\bar{I o U}}_{e}) = \frac{1}{|l_{*}|} \sum_{w \in l_{*}} I o U (r e c t (p^{w, 8}), r e c t (p_{g t}^{w, 8})), p^{w, 8} = π (T P^{w, 8})

(23)

to better distinguish performance, where

P^{w, 8}

is the 3D 8-point box coordinate of component

w

,

l_{*}

represents a set of elements (

|l_{*}|

is the number) belonging to Range * (Range I, II, III, plugs, and skews).

3.2. Performance of Pose Estimation

Three commonly used solutions in APE were employed for comparison, including direct estimation, iterative estimation, and hierarchical approach. Each solution was supported by a different combination of the most common and classic algorithms. Specifically, the methods for virtual–real feature matching consist of nearest neighbor principle (S0), normal-based search [14] (S1), fusion of Turkey estimation and color [45] (S2), region-based method [16] (S3), iterative point matching [39] (S4), and global strategy FRICP [46]. The optimized part can be arbitrarily selected from EPnP [20], EPPnP, CEPnP [22], ASPnP [31], PPnP [23], and MLPnP [24]. DL-based methods employed in hierarchical strategy include PoseNet2 [29], single-shot pose [47], and state-of-the-art estimator PVNet [30] and MegaPose [34]. Another part of the latest study [31,32] has also undergone research. However, in the case of insufficient sample size/form and dense auxiliary supervision signals, these structurally complex networks are difficult to train sufficiently and prone to overfitting. Therefore, considering the different assumptions of the research, they were not compared for the sake of fairness.

The control group settings are as follows: (a) Control group I. The virtual viewpoint sampling and matching method described in Section 2 was employed to perform the rough positioning. Only the direct estimation scheme with the best performance is listed. (b) Control group II. Iterative version of Group I. For each optimization algorithm, the combination with the best performance and the combination with the method in Section 2.1 (VPS) were compared. (c) Control group III. Different feature-matching algorithms were embedded into the proposed viewpoint evolution.

The group applying the deep learning method was denoted as Group DL, which includes the following: (a) DL-A. The rough pose was adjusted by PoseNet2 with a different number of bounding box labels. The bounding boxes of two devices from Range I were set as labels in DL-A1, and the same 2 boxes from Range I, together with another 4 boxes of devices from Range II, were set as labels in Group DL-A2. (b) DL-B. The single-shot pose network was trained with the labels of DL-A1, and its output was used for PnP estimation to obtain the rough pose. The label was modified from a bounding box to an 8-point bounding box. (c) DL-C. The estimated result of DL-B was adopted as the rough pose for Group II; DL-C1 represents the combination with best performance, and DL-C2 represents the combination with VPS. (d) DL-D. PVNet is employed, whose semantic labels were generated from the minimum bounding rectangle of significant geometric features. (e) DL-E. MegaPose is selected, where the base is removed from the 3D model, and the bounding box prompt is manually set.

For all DL Groups A–D, about 120 real samples and 1200 virtual samples were prepared for model training. The configurations of the proposed method were as follows: (a) Proposed-A. Evolution function learning was disabled in this group, while only one rendering according to the rough pose was allowed. (b) Proposed-B. Evolution function learning and multiple rendering were enabled in this group, and the estimated result of DL-B was accepted as a rough pose.

The results are listed in Table 1, and the performance is shown in Figure 8. The areas that can reflect the difference in accuracy are marked, and the higher the brightness, the more accurate. Overall, it can be summarized as the following points: (a) the effectiveness of the proposed method can be proved; (b) non-DL methods are still the first choice for high-precision applications; (c) DL module with weak annotation can be used to provide a suitable initial pose under insufficient dataset; (d) the possibility of escaping from the local optimum in feature matching is the main factor affecting the performance.

The difficulty of the problem is evident in Figure 8a, and the results of Group I and the reasons are shown in Figure 8b. Unpredictable background diversity and various overlapping states pose challenges to the construction of correspondence in the matching process under dense noise, while virtual–real geometric difference further increases the difficulty of noise identification. Moreover, under some special initial viewpoints, there is an uneven deviation in the virtual–real correspondence, which makes it more difficult to explore the correct direction in local large-deviation areas with dense noise. Take the upper plug in the sixth row of the sub-image as an example. There is a large spatial difference between the plug and all visible components that provide the 2D–3D pairs for pose estimation. Therefore, even if these visible components are well positioned, the

{\bar{I o U}}_{e}

of the plug may still be less than 0.5. Through the results of Screws

{\bar{I o U}}_{e}

, it can be proved that a satisfactory camera pose cannot be achieved only by direct estimation. Since the local iteration mode of S4 brings more exploration to the virtual–real feature matching of each component, it shows the best performance and is listed in Table 1.

In terms of matching strategy, S3, which simultaneously fuses edge, texture, and color, is the best choice among S0–S4. The matching and weight adjustment of S4 is based on S0, so it is easily interfered with by noise and has no advantage in Group II. Compared with S0–S4, the strategies of structural stability and advantage direction in VPS effectively improve feature matching, especially in the reduction in angle error. From the perspective of the matching mechanism, the conventional iterative mode lacks exploration and cannot be solved by simply increasing the search range. Once the components with sparse texture stack seriously and the initial deviation is large, the matching process is easily affected, which further leads to a local optimal result (e.g., Figure 8b). Therefore, as shown in the hard example in Figure 8a, no matter how many iterations, there is no obvious difference between the obtained pose and the rough pose. Similarly, iterative global matching tends to maintain the majority of good (minor deviation) while sacrificing bad (large deviation), resulting in cumulative deviation.

On the other hand, the applied optimization methods usually assume that the noise follows a Gaussian distribution in their studies. Therefore, it is possible for the method to ignore outliers and escape from the local optimal state only when the distribution of noises in 2D–3D is close to the Gaussian assumption. In practice, a more common situation is that the final matching results tend to exhibit a biased state due to ignoring the support of geometric features in low-quality matching regions. For example, as shown in the result of Group II in row 5 and row 2, the projection result is skewed by the estimated pose to the side with suitable 2D–3D pairs. According to the results shown by the combination with VPS, the performance difference between the optimization algorithms is not obvious.

The results of Group III demonstrate the robustness and improved ability of the proposed method to feature-matching algorithms. Due to the iterative local exploration ability, S4 in Group III shows excellent performance, even surpassing Proposed-A. However, this approach is less efficient and has insufficient practical significance. For the DL group, the amount of dataset is the primary issue. From the results of DL-A1 and A2, DL models always tend to provide suitable results near fully trained viewpoints, while its estimation performance degrades greatly when the target participating in training is occluded. Therefore, the improvement in the DL-A2 with more labels is not significant compared with DL-A1. Transforming the pose estimation into a two-stage method of significant geometric point regression and PnP, DL-B effectively alleviates the unreliability of end-to-end estimation and is a recommended rough posed acquisition strategy. Supported with DL-B, DL-C further brings better performance to Group II, which fully illustrates the importance of the initial pose. Despite providing precise target boxes, DL-E still cannot directly adapt to complex products that are invisible.

3.3. In-Depth Analysis of the Proposed Method

3.3.1. Verifying Robustness to Initial Pose Deviation

In this section, the repeatability and robustness to initial pose deviation of Proposed-A and -B were tested. The DL-B method was employed in the rough positioning process, and the test was repeated 20 times for each image. Furthermore, different max Gaussian image noises were added to the coordinates of the two 8-point bounding boxes recognized by DL-B.

In Table 2,

{\bar{\bar{a}}}_{e}

represents the average value of

{\bar{a}}_{e}

in all datasets,

{\bar{a}}_{e}_s t d

is the standard deviation of all

{\bar{a}}_{e}

,

{\bar{a}}_{e}_m i n

and

{\bar{a}}_{e}_m a x

denote the extreme value of

{\bar{a}}_{e}

, and the same is true for

{\bar{\bar{t}}}_{e}, {\bar{t}}_{e}_s t d, {\bar{t}}_{e}_m i n

, and

{\bar{t}}_{e}_m a x

. It can be clearly seen that the proposed method exhibits the ability to adapt to rough poses within the error range of

{6.12}^{°}, 187.53

mm. Moreover, as shown by the results of

{\bar{a}}_{e}_s t d

and

{\bar{t}}_{e}_s t d

in Noise-0, it is reasonable to believe that the proposed method has suitable repeatability, and this repeatability can be maintained until the noise exceeds the tolerance (e.g., Noise-10). As shown in Figure 9, the pose estimation process was visualized according to the number of iteration rounds. It can be found that the rough pose with large error requires more exploration, which means that the effective update of the feasible region decreases and the value of

\bar{i t e r}

tends to increase. The update lag of

F

reduces the probability of occurrence of offspring with suitable

{v p}^{q}

before termination, thereby increasing

{\bar{a}}_{e}_m a x

and

{\bar{t}}_{e}_m a x

. Due to the existence of evolution function optimization and multiple rendering mechanisms, the performance gap between -A and -B is obvious, which can also be found by the

1 - {v p}^{q}

and projection errors of -A and -B with the same

i t e r

. Once the error of the initial pose exceeds the acceptable range, such as Noise-12, the performance of both groups is destroyed. Presently, it is necessary to modify the method parameters to improve the degree of exploration, such as increasing the initial ellipsoid size,

s_{0}

,

τ_{0}

, and

σ_{0}

.

3.3.2. The Advantage of Learning Mechanism

To further explore the performance difference between Proposed-A and -B during the evolution, an additional metric is designed as

{\bar{a}}_{e} (i, *) = \frac{1}{| D a t a |} \sum_{D a t a} \frac{1}{| * |} \sum_{*} a_{e} (i), * = P_{i} o r P_{i}^{'}

(24)

where

P_{i}

and

P_{i}^{'}

represent the

i

-th round of parents and the migrated parents. Then, let

{\overset{=}{a}}_{e} (i, *)

be the mean of

{\bar{a}}_{e} (i, *)

for all repetitions (

{\overset{=}{t}}_{e} (i, *)

is the same), and the results are shown in Figure 10.

Overall, the final performance and improvement speed of

{\bar{\bar{a}}}_{e}

in -B is significantly higher than that of -A, which indicates that a more reasonable parameter reduction approach is conducive to the high-quality update of

F

, indirectly increasing the probability of local difficult components escaping the dilemma. Comparing

{\bar{\bar{a}}}_{e}, {\bar{\bar{t}}}_{e}

with their respective

{\overset{=}{a}}_{e} (i, *), {\overset{=}{t}}_{e} (i, *)

, it can be observed that appropriate reproductive strategy can provide suitable local exploration for the composition of the optimal viewpoint and make

{\bar{\bar{*}}}_{e} < {\overset{=}{*}}_{e} (i, P_{i}^{'})

. However, as the iterations progress, the absence of fine control for -A leads to the inability to strictly constrain the exploration. Meanwhile, the lack of timely rendering results in viewpoints with high

{v p}^{q}

not necessarily exhibiting smaller errors. Therefore, in the later stage, the optimal viewpoint of -A is perturbed and invalid updated, causing fluctuations in

{\overset{=}{t}}_{e}

, even greater than

{\overset{=}{t}}_{e} (i, *)

. Proposed-B shows better result stability compared with -A, suggesting that many ineffective explorations could be avoided, and a smaller

i t e r

can be obtained. Therefore, the Proposed-B was employed in all subsequent experiments.

3.4. Parameter Sensitivity Analysis

Important parameters include the number of parents and children and a series of initial values {

λ_{0}, σ_{0}, s_{0}, τ_{0} {, α}_{0}

} related to the initialization of the evolution function. Sensitivity analysis was first carried out on the number of parents and children, and then an orthogonal experiment was performed for {

λ_{0}, σ_{0}, s_{0}, τ_{0} {, α}_{0}

}. Then, based on the results of the orthogonal experiment, the main factors are extracted using the mean effect plot, and further analysis is performed.

As shown in Figure 11,

P 5 C 5

indicates that the number of parents and children is both 5. It should be noted that the application of guided model (20) requires additional sub-viewpoints when

P < 6

, while the

P

of the parent still follows the configuration during offspring generation. In theory, more parents means greater global exploration, and more children means stronger local exploration. When

P

is fixed, the performance first increases and then decreases with the increase in

C

from the results in Figure 11b. This is because local exploration requires a reasonable combination with global exploration, so blindly expanding

C

will inject too many low-quality attempts, harming the updating of

F

and the evolution process. In terms of process, excessive

C

makes it difficult to maintain the early advantages of evolution and premature convergence to the low-performance range. When

C

is fixed, the performance gradually stabilizes with the increase in the

P

. However, from a process perspective, although the results have improved, more fluctuations have emerged during this period, resulting in delayed convergence. To balance accuracy and efficiency,

P 10 C 5

was adopted for subsequent experiments.

Letting

s_{0} \in \{0.15,0.125,0.1,0.075,0.05\}, τ_{0} \in \{35^{°}, 30^{°}, 25^{°}, 20^{°}, 15^{°}\}

,

λ_{0} \in {0.9,0.8,0.7,0.6,0.5}, α_{0} \in \{0.6,0.5,0.4,0.3,0.2\}, σ_{0} \in {35,30,25,20,15}

, the orthogonal experiments with 5 factors and 5 levels were performed, and the results are shown in Table 3.

Moreover, to further clarify the impact of each factor, we drew a corresponding 5-factor 5-level main effect plot based on Table 3. For ease of calculation, the pose error was merged as

n o r m ({\bar{\bar{a}}}_{e}) + n o r m ({\bar{\bar{t}}}_{e})

, and the results are provided in Figure 12. In terms of value range (the first row), it can be found that

{λ_{0}, σ_{0}}

have a strong influence on the performance (0.3597, 0.3940) and tend to provide the best results at levels 4 and 5. This is because

λ_{0}

is a variable that controls the degree of global exploration and

σ_{0}

determines the search scope of the viewpoint quality evaluation. With the assistance of the method mechanism, an appropriate increase in

{λ_{0}, σ_{0}}

can indeed contribute to early exploration. As the projection operation based on

α_{0}

in offspring generation will provide additional limitations for exploration,

{{s}_{0}, τ_{0}}

has the least impact on the performance due to the close effect of

σ_{0}

at levels 3 and 4 and the fact that increasing

σ_{0}

will increase the range of viewpoint evaluation (reduce efficiency). Moreover, the balance of efficiency and performance depends on the hardware devices and application scenarios. In this study, the configuration with

P 10 C 5

and {3, 4, 5, 4, 3} was ultimately selected and provided.

3.5. Appearance Difference Adaptability

This section verified the adaptability of the proposed method to the appearance difference (e.g., product at different stages or lightweight model) and viewpoint difference between the virtual model and real product. The setting of appearance difference is shown in Figure 13, where A, D, and E were not equipped with cables, one device belonging to Range II was removed in B1, one virtual component belonging to Range II was removed in B2, and C = B1 + B2. Additionally, the visual style of Groups A–C was different from the standard, and the part dimensions of Groups D and E were modified, as shown in Figure 13, respectively. The dataset was adjusted according to the above changes, and the experimental results are listed in Table 4.

The results proved the adaptability of the proposed method to state differences, especially to visual feature changes caused by the differences in model rendering style. Moreover, pose errors of Groups A, D, and E reflect that sparse context changes such as component size modification have less impact on the performance than rendering style changes. Although the error of Group C is close to

{0.75}^{°}

, 69.59 mm, the proposed method still has an advantage over the existing methods listed in Table 1. The impact of a missing part is significant, and the reasons are shown in Figure 14. The current object has a high similarity with the component behind it in this viewpoint, and mismatching is considered a high-quality correspondence. Moreover, when occluded components participate in matching, their features are extremely different from the region that contains a missing part but rather highly similar to the noise. Therefore, the coupling of the above situations prevents anomalies from being intercepted by VPS, causing viewpoints that should have been discarded to always participate in feasible region construction and evolution, resulting in continuous accumulation of errors.

The viewpoint difference was adjusted by setting the total rendering times of the evolution process. For example, Group 2 means that only the first two rounds enabled rendering. The virtual image rendering was performed based on the viewpoint with the best

{v p}^{q}

in each round. The upper limit of iteration rounds was set to 11, and the results are shown in Table 5. By comparing the results of Group 1 in Table 5 with those of Proposed-A in Section 3.2, it can be further found that the value of evolution function optimization. The results of Groups 1~3 prove the adaptability of the proposed method to the differences between virtual and real viewpoints and the geometric feature inconsistency caused by them. The pose estimation accuracy is positively correlated with the rendering times. By observing the change in

\bar{i t e r}

from Groups 4 to 11, we can find that rendering times are not the more, the better. Since suitable results usually appear in the 6th to 9th rounds and then tend to be stable, the difference between virtual and real features brought by additional rendering will stimulate more exploration, which will lead to fluctuation in results and an increase in

\bar{i t e r}

. Therefore, the balance between accuracy and performance should be determined based on practical applications.

3.6. Practical Application Cases

The application cases in two real manufacturing scenarios are shown in Figure 15.

The first row shows a large-sized truss-type product painting task, and the second row introduces the assembly and gluing of porous structural components assisted by AR. The core task of both cases is pose estimation, and their respective problems correspond to a part of the experimental platform. The test results show that the proposed method can improve the initial error of the former from

{5.0}^{°}

, 80 mm to

{0.8}^{°}

, 17 mm, and increase the hole projection deviation

{\bar{I o U}}_{e}

of the latter from <0.1 to >0.5. Some valuable setting strategies and potential limitations analysis are as follows.

(1) Setting strategies. (a) Divide object set based on assembly relationships and part sizes, and define a rule for sub-object to gradually participate in evolution; (b) given an empirical viewpoint, obtain the initial pose through the strategies provided in this article or studies [48,49,50]; (c) considering both uniform distribution and matching residuals, determine the constraint features for feasible region initialization; (d) choose to adjust VPS or replace it with other matching strategies based on the object structure; (e) select the number of termination evolution rounds based on observations; and (f) choose whether calibration optimization is needed based on the characteristics of the scene.

(2) Potential limitations. (a) No significant structural features (such as flexible bodies). Lack of initial pose will result in the inability to initialize the feasible region. One possible solution is to adopt a method like MegaPose [34] that can adapt to unseen objects to provide an initial pose. (b) Excessive initial pose deviation is always a challenge. For possible failures, we recommend some possible methods from an engineering perspective to constrain the initial pose, including planning viewpoint, empirical adjustment, fixtures, etc. (c) Equipment failure. The coverage of the field of view and the rationality of the depth of field needs to be given priority consideration during the planning phase.

4. Discussion and Conclusions

In this study, we present a novel attempt at the integration of knowledge-based agent models and data-driven deep learning for accurate pose estimation of complex products. The pose estimation is reconstructed as an evolution process. To solve the problem of large-scale solution space optimization, virtual–real matching considering structural stability, evolvable feasible region constraint, and adaptive population migration and reproduction are proposed. To reduce the empirical dependency in parameter tuning rules, enhance robustness and promote rapid convergence, a guided-model-based hierarchical cyclic effective trajectory learning mechanism is proposed and embedded. Compared with the best result

{1.28}^{°}, 77.67

mm and

{1.69}^{°}, 55.67

mm in classic and latest solutions under different existing method configurations, the proposed method enjoys better performance, the pose error can reach

{0.23}^{°}, 23.71

mm, and the cause of abnormal results can be retraced.

Moreover, through experiments and analysis, we have found the following:

(1): A hierarchical architecture from coarse to fine is a practical approach to image-based pose estimation while maintaining high accuracy and reliability;
(2): Consistent with previous studies [28,29,30,35], cascaded DL module and geometric optimization can effectively balance adaptability and accuracy;
(3): Some practical problems, such as limited samples, complex structure, and multiple elements, make it difficult to directly perform complex advanced DL-based networks. On the contrary, indirect models that only require weak annotation are easier to deploy in practical applications;
(4): For global or divide-and-conquer architectures based on matching, metric, and optimization, the coordination of exploration and optimization requires more detailed regulation in the pose estimation of complex assembled products.

From the perspective of application value, through case studies and validation, it shows that the method will be beneficial for non-serious measurement applications that require camera pose in the assembly process of some complex products, such as AR projection, robot-assisted control, appearance inspection, etc.

In terms of potential limitations, it is undeniable that many specialized designs and appropriate initial poses need to be ensured. In practical applications, we recommend planning the imaging pose network or employing robot imaging for known scenarios in advance to reduce the search space. Moreover, we plan to carry out further validation in more manufacturing scenarios to identify more possible practical challenges (e.g., actual missing parts) and adopt CNN-based strategies to represent feasible regions.

Author Contributions

Conceptualization, D.Z.; methodology, D.Z.; software, D.Z. and F.K.; validation, D.Z. and F.K.; formal analysis, D.Z. and F.K.; investigation, D.Z. and F.D.; resources, F.D.; data curation, F.K.; writing—original draft preparation, D.Z.; writing—review and editing, D.Z.; visualization, D.Z. and F.K.; supervision, F.D.; project administration, F.D.; funding acquisition, F.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work is financially supported by the National Natural Science Foundation of China (grant number 52375478).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Marchand, E.; Uchiyama, H.; Spindler, F. Pose Estimation for Augmented Reality: A Hands-On Survey. IEEE Trans. Visual Comput. 2016, 22, 2633–2651. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Shang, Y.; Zhang, H. A Survey on Approaches of Monocular CAD Model-Based 3D Objects Pose Estimation and Tracking. In Proceedings of the 2018 IEEE CSAA Guidance, Navigation and Control Conference (CGNCC), Xiamen, China, 10–12 August 2018. [Google Scholar] [CrossRef]
Jia, Z.; Wang, M.; Zhao, S. A review of deep learning-based approaches for defect detection in smart manufacturing. J. Opt. 2024, 53, 1345–1351. [Google Scholar] [CrossRef]
Eswaran, M.; Gulivindala, A.K.; Inkulu, A.K.; Bahubalendruni, M.R. Augmented reality-based guidance in product assembly and maintenance/repair perspective: A state of the art review on challenges and opportunities. Expert Syst. Appl. 2023, 213, 118983. [Google Scholar] [CrossRef]
Hao, J.C.; He, D.; Li, Z.Y.; Hu, P.; Chen, Y.; Tang, K. Efficient cutting path planning for a non-spherical tool based on an iso-scallop height distance field. Chin. J. Aeronaut. 2023, in press. [Google Scholar] [CrossRef]
Glorieux, E.; Franciosa, P.; Ceglarek, D. Coverage path planning with targeted viewpoint sampling for robotic free-form surface inspection. Robot. Comput. Integr. Manuf. 2020, 61, 101843. [Google Scholar] [CrossRef]
Du, G.; Wang, K.; Lian, S.; Zhao, K. Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: A review. Artif. Intell. Rev. 2021, 54, 1677–1734. [Google Scholar] [CrossRef]
Wang, H.Y.; Shen, Q.; Deng, Z.L.; Cao, X.; Wang, X. Absolute pose estimation of UAV based on large-scale satellite image. Chin. J. Aeronaut. 2023, in press. [CrossRef]
Zhang, M.; Zhang, C.C.; Wang, W.; Du, R.; Meng, S. Research on Automatic Assembling Method of Large Parts of Spacecraft Based on Vision Guidance. In Proceedings of the 2021 2nd International Conference on Artificial Intelligence and Information Systems, ACM, Chongqing, China, 28–30 May 2021. [Google Scholar] [CrossRef]
Qin, L.; Wang, T. Design and research of automobile anti-collision warning system based on monocular vision sensor with license plate cooperative target. Multimed. Tools Appl. 2017, 76, 14815–14828. [Google Scholar] [CrossRef]
Jiang, C.; Li, W.; Li, W.; Wang, D.F.; Zhu, L.J.; Xu, W.; Zhao, H.; Ding, H. A Novel Dual-Robot Accurate Calibration Method Using Convex Optimization and Lie Derivative. IEEE Trans. Robot. 2024, 40, 960–977. [Google Scholar] [CrossRef]
Huang, B.; Tang, Y.; Ozdemir, S.; Ling, H. A Fast and Flexible Projector-Camera Calibration System. IEEE Trans. Autom. Sci. Eng. 2021, 18, 1049–1063. [Google Scholar] [CrossRef]
Nubiola, A.; Bonev, I.A. Absolute calibration of an ABB IRB 1600 robot using a laser tracker. Robot. Comput. Integr. Manuf. 2012, 29, 236–245. [Google Scholar] [CrossRef]
Yu, H.; Huang, Y.; Zheng, D.; Bai, L.; Han, J. Three-dimensional shape measurement technique for large-scale objects based on line structured light combined with industrial robot. Optik 2020, 202, 163656. [Google Scholar] [CrossRef]
Li, D.; Wang, H.; Liu, N.; Wang, X.; Xu, J. 3D Object Recognition and Pose Estimation from Point Cloud Using Stably Observed Point Pair Feature. IEEE Access 2020, 8, 44335–44345. [Google Scholar] [CrossRef]
Wuest, H.; Vial, F.; Stricker, D. Adaptive line tracking with multiple hypotheses for augmented reality. In Proceedings of the Fourth IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR’05), Vienna, Austria, 5–8 October 2005; pp. 62–69. [Google Scholar] [CrossRef]
Han, P.; Zhao, G. A review of edge-based 3D tracking of rigid objects. Virtual Real. Intell. Hardw. 2019, 1, 580–596. [Google Scholar] [CrossRef]
Huang, H.; Zhong, F.; Sun, Y.; Qin, X. An Occlusion-aware Edge-Based Method for Monocular 3D Object Tracking using Edge Confidence. Comput. Graph. Forum 2020, 39, 399–409. [Google Scholar] [CrossRef]
Jau, Y.-Y.; Zhu, R.; Su, H.; Chandraker, M. Deep keypoint-based camera pose estimation with geometric constraints. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 4950–4957. [Google Scholar] [CrossRef]
Lepetit, V.; Moreno-Noguer, F.; Fua, P. EPnP: An accurate O(n) solution to the PnP problem. Int. J. Comput. Vis. 2009, 81, 155–166. [Google Scholar] [CrossRef]
Ferraz, L.; Binefa, X.; Moreno-Noguer, F. Leveraging Feature Uncertainty in the PnP Problem. In Proceedings of the British Machine Vision Conference 2014, British Machine Vision Association, Nottingham, UK, 1 September 2014. [Google Scholar] [CrossRef]
Zheng, Y.; Sugimoto, S.; Okutomi, M. ASPnP: An Accurate and Scalable Solution to the Perspective-n-Point Problem. Trans. Inf. Syst. 2013, 96, 1525–1535. [Google Scholar] [CrossRef]
Garro, V.; Crosilla, F.; Fusiello, A. Solving the PnP Problem with Anisotropic Orthogonal Procrustes Analysis. In Proceedings of the 2012 Second International Conference on 3D Imaging, Modeling, Processing, Visualization & Transmission, Zurich, Switzerland, 13–15 October 2012. [Google Scholar] [CrossRef]
Urban, S.; Leitloff, J.; Hinz, S. MLPnP-A Real-Time Maximum Likelihood Solution to the Perspective-n-Point Problem. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2016, III-3, 131–138. [Google Scholar] [CrossRef]
Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1437–1451. [Google Scholar] [CrossRef]
Gordo, A.; Almazán, J.; Revaud, J.; Larlus, D. Deep Image Retrieval: Learning Global Representations for Image Search. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar] [CrossRef]
Humenberger, M.; Cabon, Y.; Guerin, N.; Morat, J.; Leroy, V.; Revaud, J.; Rerole, P.; Pion, N.; de Souza, C.; Csurka, G. Robust Image Retrieval-based Visual Localization using Kapture. arXiv 2022, arXiv:2007.13867. [Google Scholar]
Kendall, A.; Grimes, M.; Cipolla, R. PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar] [CrossRef]
Kendall, A.; Cipolla, R. Geometric Loss Functions for Camera Pose Regression with Deep Learning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Peng, S.; Liu, Y.; Huang, Q.; Zhou, X.; Bao, H. PVNet: Pixel-wise voting network for 6dof pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3212–3223. [Google Scholar] [CrossRef] [PubMed]
Balntas, V.; Li, S.; Prisacariu, V. RelocNet: Continuous Metric Learning Relocalisation Using Neural Nets. In Computer Vision-ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018. [Google Scholar] [CrossRef]
Xu, Y.; Lin, K.-Y.; Zhang, G.; Wang, X.; Li, H. RNNPose: Recurrent 6-DoF Object Pose Refinement with Robust Correspondence Field Estimation and Pose Optimization. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Bukschat, Y.; Vetter, M. EfficientPose-An efficient, accurate and scalable end-to-end 6D multi object pose estimation approach. arXiv 2020, arXiv:2011.04307. [Google Scholar]
Labbe, Y.; Manuelli, L.; Mousavian, A.; Tyree, S.; Birchfield, S.; Tremblay, J.; Carpentier, J.; Aubry, M.; Fox, D.; Sivic, J. Megapose: 6d pose estimation of novel objects via render & compare. arXiv 2022, arXiv:2212.06870. [Google Scholar]
Brachmann, E.; Krull, A.; Michel, F.; Gumhold, S.; Shotton, J.; Rother, C. Learning 6D Object Pose Estimation Using 3D Object Coordinates. In Computer Vision—ECCV 2014; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2014; Volume 8690. [Google Scholar] [CrossRef]
Brachmann, E.; Krull, A.; Nowozin, S.; Shotton, J.; Michel, F.; Gumhold, S.; Rother, C. DSAC-Differentiable RANSAC for Camera Localization. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Sarlin, P.-E.; Cadena, C.; Siegwart, R.; Dymczyk, M. From Coarse to Fine: Robust Hierarchical Localization at Large Scale. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef]
Ben Abdallah, H.; Jovančević, I.; Orteu, J.-J.; Brèthes, L. Automatic Inspection of Aeronautical Mechanical Assemblies by Matching the 3D CAD Model and Real 2D Images. J. Imaging 2019, 5, 81. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; De, D.L.; Wei, B.; Chen, L.; Martin, R.R. Regularization Based Iterative Point Match Weighting for Accurate Rigid Transformation Estimation. IEEE Trans. Vis. Comput. Graph. 2015, 21, 1058–1071. [Google Scholar] [CrossRef] [PubMed]
Hanna, J.P.; Niekum, S.; Stone, P. Importance sampling in reinforcement learning with an estimated behavior policy. Mach. Learn. 2021, 110, 1267–1317. [Google Scholar] [CrossRef]
Tjaden, H.; Schwanecke, U.; Schömer, E. Real-time monocular pose estimation of 3d objects using temporally consistent local color histograms. In Proceedings of the IEEE International Conference on Computer Vision, Institute of Electrical and Electronics Engineers Inc., Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
Carlile, B.; Delamarter, G.; Kinney, P.; Marti, A.; Whitney, B. Improving Deep Learning by Inverse Square Root Linear Units (ISRLUs). arXiv 2017, arXiv:1710.09967. [Google Scholar]
Gill, P.E.; Wong, E. Sequential Quadratic Programming Methods. In Mixed Integer Nonlinear Programming; Lee, J., Leyffer, S., Eds.; Springer: New York, NY, USA, 2012; pp. 147–224. [Google Scholar] [CrossRef]
Schmid, A.; Biegler, L.T. Reduced Hessian Successive Quadratic Programming for Realtime Optimization. IFAC Proc. Vol. 1994, 27, 173–178. [Google Scholar] [CrossRef]
Huang, H.; Zhong, F.; Qin, X. Pixel-Wise Weighted Region-Based 3D Object Tracking Using Contour Constraints. IEEE Trans. Visual Comput. Graph. 2022, 28, 4319–4331. [Google Scholar] [CrossRef]
Zhang, J.; Yao, Y.; Deng, B. Fast and Robust Iterative Closest Point. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3450–3466. [Google Scholar] [CrossRef]
Tekin, B.; Sinha, S.N.; Fua, P. Real-Time Seamless Single Shot 6D Object Pose Prediction. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
Wang, Q.; Zhou, J.; Li, Z.; Sun, X.; Yu, Q. Robust and Accurate Monocular Pose Tracking for Large Pose Shift. IEEE Trans. Ind. Electron. 2023, 70, 8163–8173. [Google Scholar] [CrossRef]
Tian, X.; Lin, X.; Zhong, F.; Qin, X. Large-Displacement 3D Object Tracking with Hybrid Non-local Optimization. In Proceedings of the Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar] [CrossRef]
Stoiber, M.; Pfanne, M.; Strobl, K.H.; Triebel, R.; Albu-Schäffer, A. A sparse gaussian approach to region-based 6DoF object tracking. In Proceedings of the Computer Vision-ACCV 2020: 15th Asian Conference on Computer Vision, Kyoto, Japan, 30 November 2020. [Google Scholar] [CrossRef]

Figure 1. Difficulties of high-precision camera pose estimation for complex assembled products.

Figure 2. Method framework.

Figure 3. An improved virtual—real feature matching and similarity evaluation criterion. (a) Principles of viewpoint performance assessment. The blue and green points represent virtual features and candidate real features, respectively; (b) the effect after ridge function mapping.

Figure 4. Pose feasible region construction.

Figure 5. Parent migration and offspring generation strategy.

Figure 6. A learnable mechanism based on effective trajectory tracking and guided model.

Figure 7. Practical application, data preparation, and configuration.

Figure 8. Visualization of estimation performance. (a) The error is reflected by the projected deviation of a component with an 8-point box. (b) Performance in dealing with dense noise and geometric differences.

Figure 9. Visualized comparison of robustness of Proposed-A and -B to pose deviation.

Figure 10. Performance comparison of whether to enable evolution function optimization.

Figure 11. The influence of the number of parents and children on the performance. (a) Pose performance during evolution. (b) Skews (

{\bar{I o U}}_{e}

) under different configurations when

i t e r

= 10 and

i t e r

= 20.

Figure 11. The influence of the number of parents and children on the performance. (a) Pose performance during evolution. (b) Skews (

{\bar{I o U}}_{e}

) under different configurations when

i t e r

= 10 and

i t e r

= 20.

Figure 12. Main effect of performance. The red circle indicates the potential level of factors.

Figure 13. Configuration of state difference between 3D model and real product.

Figure 14. Analysis of the impact caused by missing parts.

Figure 15. Settings and application cases in real scenarios. (a) The components are divided according to size and participate in pose estimation based on this order. (b) Pose estimation performance. The blue points represent the reprojection effect of virtual features.

Table 1. Camera pose estimation performance comparison.

			Range I ( ${\bar{I o U}}_{e}$ )	Range II ( ${\bar{I o U}}_{e}$ )	Range III ( ${\bar{I o U}}_{e}$ )	Plugs ( ${\bar{I o U}}_{e}$ )	Screws ( ${\bar{I o U}}_{e}$ )	Average Error
			Range I ( ${\bar{I o U}}_{e}$ )	Range II ( ${\bar{I o U}}_{e}$ )	Range III ( ${\bar{I o U}}_{e}$ )	Plugs ( ${\bar{I o U}}_{e}$ )	Screws ( ${\bar{I o U}}_{e}$ )	${\bar{a}}_{e}$ (°)	${\bar{t}}_{e}$ (mm)
Rough pose			0.64	0.57	0.34	0.11	0.02	6.03	185.71
Group I	S4 + ASPnP		0.74	0.60	0.43	0.13	0.03	3.61	134.74
Group II	FRICP + EPnP		0.77	0.63	0.48	0.15	0.04	1.89	110.77
	EPnP	S3	0.81	0.69	0.53	0.22	0.08	1.83	96.92
	EPnP	VPS	0.86	0.75	0.62	0.32	0.18	1.44	85.23
	CEPnP	S3	0.79	0.66	0.51	0.19	0.06	1.95	98.20
	CEPnP	VPS	0.86	0.75	0.62	0.32	0.18	1.44	86.07
	RPnP	S3	0.77	0.64	0.49	0.17	0.04	2.02	102.32
	RPnP	VPS	0.86	0.75	0.62	0.31	0.18	1.43	86.29
	ASPnP	S3	0.82	0.74	0.62	0.29	0.12	1.64	90.67
	ASPnP	VPS	0.87	0.75	0.63	0.33	0.19	1.39	84.86
	EPPnP	S3	0.84	0.75	0.61	0.28	0.12	1.59	92.45
	EPPnP	VPS	0.87	0.75	0.62	0.32	0.19	1.42	83.29
	MLPnP	S3	0.81	0.71	0.55	0.23	0.09	1.73	95.10
	MLPnP	VPS	0.86	0.75	0.62	0.32	0.18	1.41	85.44
Group III	S0		0.89	0.81	0.72	0.49	0.30	0.97	63.41
	S1		0.92	0.87	0.79	0.60	0.49	0.63	47.90
	S2		0.92	0.87	0.80	0.60	0.47	0.71	47.33
	S3		0.93	0.89	0.83	0.65	0.51	0.59	44.78
	S4		0.97	0.95	0.90	0.81	0.65	0.26	33.05
Group DL	DL-A1		0.77	0.65	0.46	0.15	0.03	2.14	115.67
	DL-A2		0.79	0.67	0.50	0.19	0.05	1.93	103.52
	DL-B		0.68	0.57	0.35	0.12	0.02	5.32	167.42
	DL-C1		0.85	0.75	0.61	0.32	0.15	1.57	84.18
	DL-C2		0.88	0.81	0.65	0.38	0.22	1.28	77.67
	DL-D		0.85	0.74	0.60	0.29	0.11	1.79	89.45
	DL-E		0.88	0.77	0.61	0.34	0.19	1.69	55.67
Proposed-A			0.97	0.95	0.89	0.79	0.63	0.27	35.24
Proposed-B			0.98	0.97	0.94	0.85	0.71	0.23	23.71

Table 2. The robustness to initial pose deviation.

	Noise-0		Noise-2		Noise-4		Noise-6		Noise-8		Noise-10		Noise-12
${\bar{\bar{a}}}_{d} (°)$	5.32		5.94		5.74		6.12		7.75		8.97		11.74
${\bar{\bar{t}}}_{d} (°)$	167.42		178.22		183.28		187.53		194.13		226.56		277.29
	A	B	A	B	A	B	A	B	A	B	A	B	A	B
${\bar{\bar{a}}}_{e} (°)$	0.26	0.23	0.28	0.24	0.31	0.27	0.34	0.26	0.46	0.34	0.59	0.43	0.78	0.71
${\bar{a}}_{e}_s t d (°)$	0.18	0.13	0.17	0.14	0.21	0.15	0.23	0.22	0.29	0.24	0.34	0.29	0.47	0.46
${\bar{a}}_{e}_m i n$	0.11	0.10	0.10	0.09	0.14	0.11	0.13	0.10	0.15	0.13	0.20	0.15	0.17	0.18
${\bar{a}}_{e}_m a x$	0.54	0.44	0.61	0.50	0.75	0.59	0.78	0.58	0.93	0.77	1.07	0.92	1.35	1.23
${\bar{\bar{t}}}_{e}$ (mm)	32.66	23.71	31.54	22.97	33.46	25.61	37.35	28.38	45.52	34.27	51.63	42.57	73.83	66.34
${\bar{t}}_{e}_s t d$ (mm)	17.14	12.87	16.82	13.21	19.52	12.79	23.63	18.75	26.48	27.82	31.57	32.11	41.63	39.69
${\bar{t}}_{e}_m i n$	10.27	6.42	9.17	8.36	9.49	7.94	14.49	8.34	13.53	11.75	17.93	13.56	17.31	14.57
${\bar{t}}_{e}_m a x$	54.53	43.08	59.31	42.24	61.40	47.75	64.37	51.85	71.59	73.33	83.23	77.59	92.64	88.62
$\bar{i t e r}$	12.48	8.14	12.75	8.21	12.69	8.17	12.95	8.39	13.44	8.80	12.95	8.91	14.27	9.05

Table 3. Orthogonal experiment of initial parameters.

Config	${\bar{\bar{a}}}_{e}$ (°)	${\bar{\bar{t}}}_{e}$ (mm)	Config	${\bar{\bar{a}}}_{e}$ (°)	${\bar{\bar{t}}}_{e}$ (mm)
{1,1,1,1,1}	0.7118	38.6272	{3,4,5,1,2}	0.3289	31.6075
{1,2,3,4,5}	0.5432	30.9155	{3,5,2,4,1}	0.6245	31.4838
{1,3,5,2,4}	0.5148	32.5514	{4,1,3,5,2}	1.0717	40.5826
{1,4,2,5,3}	0.4505	30.4453	{4,2,5,3,1}	0.3486	25.7130
{1,5,4,3,2}	0.5673	32.5823	{4,3,2,1,5}	0.6769	51.0665
{2,1,5,4,3}	0.5493	26.0468	{4,4,4,4,4}	0.2366	24.5641
{2,2,2,2,2}	0.5637	35.1910	{4,5,1,2,3}	0.3477	31.0679
{2,3,4,5,1}	0.5201	21.0529	{5,1,2,3,4}	0.6443	27.2303
{2,4,1,3,5}	0.4341	59.3825	{5,2,4,1,3}	0.5959	43.7068
{2,5,3,1,4}	0.4369	49.7276	{5,3,1,4,2}	0.5311	38.3228
{3,1,4,2,5}	0.3744	37.6432	{5,4,3,2,1}	0.5638	32.4311
{3,2,1,5,4}	0.2759	32.6803	{5,5,5,5,5}	0.3495	35.0093
{3,3,3,3,3}	0.2489	31.9432	\	\	\

Table 4. Appearance difference adaptability.

	A-Dataset	B1-Dataset	B2-Dataset	C-Dataset	D-Dataset	E-Dataset
${\bar{\bar{a}}}_{e} (°)$	0.33	0.43	0.39	0.75	0.27	0.28
${\bar{\bar{t}}}_{e}$ (mm)	30.50	44.49	50.67	69.59	28.92	26.16
Proposed-B with Standard 3D Model
${\bar{\bar{a}}}_{e} (°)$	0.25	0.26	0.26	0.25	0.23	0.25
${\bar{\bar{t}}}_{e}$ (mm)	24.71	25.71	26.09	26.65	24.34	23.37

Table 5. Adaptability of the difference between virtual and real viewpoints.

	1	2	3	4	5	6	7	8	9	10	Each
${\bar{\bar{a}}}_{e} (°)$	0.26	0.24	0.23	0.22	0.23	0.22	0.21	0.22	0.23	0.23	0.23
${\bar{\bar{t}}}_{e}$ (mm)	27.24	26.40	27.12	26.11	25.48	25.26	24.51	23.96	23.74	23.70	23.71
$\bar{i t e}$ (round)	9.12	8.73	8.81	8.93	8.25	7.97	7.93	8.11	8.13	8.14	8.14

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, D.; Kong, F.; Du, F. A Learnable Viewpoint Evolution Method for Accurate Pose Estimation of Complex Assembled Product. Appl. Sci. 2024, 14, 4405. https://doi.org/10.3390/app14114405

AMA Style

Zhao D, Kong F, Du F. A Learnable Viewpoint Evolution Method for Accurate Pose Estimation of Complex Assembled Product. Applied Sciences. 2024; 14(11):4405. https://doi.org/10.3390/app14114405

Chicago/Turabian Style

Zhao, Delong, Feifei Kong, and Fuzhou Du. 2024. "A Learnable Viewpoint Evolution Method for Accurate Pose Estimation of Complex Assembled Product" Applied Sciences 14, no. 11: 4405. https://doi.org/10.3390/app14114405

APA Style

Zhao, D., Kong, F., & Du, F. (2024). A Learnable Viewpoint Evolution Method for Accurate Pose Estimation of Complex Assembled Product. Applied Sciences, 14(11), 4405. https://doi.org/10.3390/app14114405

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Learnable Viewpoint Evolution Method for Accurate Pose Estimation of Complex Assembled Product

Abstract

Featured Application

Abstract

1. Introduction

2. Proposed Method

2.1. Viewpoint Performance Assessment

2.2. Feasible Region Construction and Update

2.2.1. Problem Planning

2.2.2. Initialization

2.2.3. Boundary Projection

2.2.4. Sparse Trajectory Approximation

2.3. Parent Migration and Offspring Generation

2.4. A Learnable Evolutionary Function Optimization

3. Experiment and Results

3.1. Preliminary Works

3.2. Performance of Pose Estimation

3.3. In-Depth Analysis of the Proposed Method

3.3.1. Verifying Robustness to Initial Pose Deviation

3.3.2. The Advantage of Learning Mechanism

3.4. Parameter Sensitivity Analysis

3.5. Appearance Difference Adaptability

3.6. Practical Application Cases

4. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI