Efﬁcient Model-Based Object Pose Estimation Based on Multi-Template Tracking and PnP Algorithms

: Three-Dimensional (3D) object pose estimation plays a crucial role in computer vision because it is an essential function in many practical applications. In this paper, we propose a real-time model-based object pose estimation algorithm, which integrates template matching and Perspective-n-Point (PnP) pose estimation methods to deal with this issue efﬁciently. The proposed method ﬁrstly extracts and matches keypoints of the scene image and the object reference image. Based on the matched keypoints, a two-dimensional (2D) planar transformation between the reference image and the detected object can be formulated by a homography matrix, which can initialize a template tracking algorithm efﬁciently. Based on the template tracking result, the correspondence between image features and control points of the Computer-Aided Design (CAD) model of the object can be determined efﬁciently, thus leading to a fast 3D pose tracking result. Finally, the 3D pose of the object with respect to the camera is estimated by a PnP solver based on the tracked 2D-3D correspondences, which improves the accuracy of the pose estimation. Experimental results show that the proposed method not only achieves real-time performance in tracking multiple objects, but also provides accurate pose estimation results. These advantages make the proposed method suitable for many practical applications, such as augmented reality.


Introduction
Accurate and efficient pose estimation of an Object-Of-Interest (OOI) is an important task in many robotic and computer vision applications involving vision-based robotic manipulation, position-based visual servoing, augmented reality, camera localization, etc.The purpose of object pose estimation is to recover the relative 3D pose of the OOI in the camera coordinate frame from captured scene images.This problem is one of the fundamental research topics in computer vision, and a variety of approaches has been proposed in the literature.Traditionally, the object pose estimation techniques can be divided into feature-based and model-based approaches.The feature-based approaches use geometric features, such as points, lines or circles, to estimate the 3D pose of the OOI from the scene image.In contrast, the model-based approaches use a priori-constructed 3D model of the OOI to deal with object detection and pose estimation issues efficiently.Empirically, the model-based approaches provide better robustness than the feature-based ones in a heavily-cluttered environment, but they usually cost more computational efforts in computing the optimal pose of the OOI.
Feature-based approaches detect the OOI and estimate its 3D pose with respect to (w.r.t.) the camera based on the keypoints, which are image features robust to translation, scaling and rotation In this study, a novel and efficient model-based object pose estimation algorithm is proposed.The proposed algorithm is inspired by the work presented in [18], but we employ multiple HA template trackers instead of the SSD tracker to track an OOI having a multi-planar structure.Based on the tracking result, the correspondences between image features and control points of a given 3D CAD model can be determined efficiently.Moreover, an enhanced 3D pose estimation algorithm is proposed by combining an HD-based initial pose solver and a Perspective-n-Point (PnP) solver to improve the accuracy of pose estimation results.Experimental results demonstrate the tracking performance and estimation accuracy of the proposed model-based object pose tracker.
The rest of this paper is organized as follows.Section 2 introduces the system framework of the proposed model-based object pose estimation algorithm.Section 3 presents the proposed multi-template tracking design for the proposed algorithm to improve the performance of pose tracking with multiple templates.Section 4 describes the proposed model-based 3D pose estimation method based on the template tracking result.Experimental results are reported in Section 5 to evaluate the effectiveness and efficiency of the proposed model-based object pose estimation method.Section 6 concludes the contributions of this paper.

System Framework
Figure 1 shows the framework of the proposed model-based object pose estimation system, which consists of three modules: object recognition, template tracking and 3D pose estimation.The object recognition module aims to detect and recognize the OOI from the input image using a keypoint-based object detection algorithm.For extracting keypoint descriptors of an image or template patch, the Scale-Invariant Feature Transform (SIFT) algorithm with Graphics Processing Unit (GPU) acceleration [24] was employed to improve the robustness and real-time performance of the proposed system.Moreover, a GPU-accelerated multi-resolution keypoint-descriptor matching algorithm [25] was used to detect the OOI by matching the keypoint descriptors between the input image and the multi-template database.In this study, a novel and efficient model-based object pose estimation algorithm is proposed.The proposed algorithm is inspired by the work presented in [18], but we employ multiple HA template trackers instead of the SSD tracker to track an OOI having a multi-planar structure.Based on the tracking result, the correspondences between image features and control points of a given 3D CAD model can be determined efficiently.Moreover, an enhanced 3D pose estimation algorithm is proposed by combining an HD-based initial pose solver and a Perspective-n-Point (PnP) solver to improve the accuracy of pose estimation results.Experimental results demonstrate the tracking performance and estimation accuracy of the proposed model-based object pose tracker.
The rest of this paper is organized as follows.Section 2 introduces the system framework of the proposed model-based object pose estimation algorithm.Section 3 presents the proposed multi-template tracking design for the proposed algorithm to improve the performance of pose tracking with multiple templates.Section 4 describes the proposed model-based 3D pose estimation method based on the template tracking result.Experimental results are reported in Section 5 to evaluate the effectiveness and efficiency of the proposed model-based object pose estimation method.Section 6 concludes the contributions of this paper.

System Framework
Figure 1 shows the framework of the proposed model-based object pose estimation system, which consists of three modules: object recognition, template tracking and 3D pose estimation.The object recognition module aims to detect and recognize the OOI from the input image using a keypoint-based object detection algorithm.For extracting keypoint descriptors of an image or template patch, the Scale-Invariant Feature Transform (SIFT) algorithm with Graphics Processing Unit (GPU) acceleration [24] was employed to improve the robustness and real-time performance of the proposed system.Moreover, a GPU-accelerated multi-resolution keypoint-descriptor matching algorithm [25] was used to detect the OOI by matching the keypoint descriptors between the input image and the multi-template database.Given multiple template patches of the OOI, a keypoint-descriptor dataset was created a priori by recording the SIFT descriptors of each template patch of the OOI.When a scene image is captured, the SIFT algorithm is applied to extract the SIFT descriptors of the input image, and the keypoint-descriptor matching algorithm is then used to find the matching points between the scene descriptors and the reference descriptors of the template database.Next, a Random Sample Consensus (RANSAC) algorithm [26] is employed to remove outliers from the keypoint matches, and the template with maximum matching inliers is used to determine the initial position of the OOI in the image using an initial homography estimation method [27].
When the OOI in the image is detected, the corresponding initial homography parameters and template index (ID) number are sent to activate the template tracking module to track the target in the incoming frames.In this work, the existing HA template tracking algorithm [19]   Given multiple template patches of the OOI, a keypoint-descriptor dataset was created a priori by recording the SIFT descriptors of each template patch of the OOI.When a scene image is captured, the SIFT algorithm is applied to extract the SIFT descriptors of the input image, and the keypoint-descriptor matching algorithm is then used to find the matching points between the scene descriptors and the reference descriptors of the template database.Next, a Random Sample Consensus (RANSAC) algorithm [26] is employed to remove outliers from the keypoint matches, and the template with maximum matching inliers is used to determine the initial position of the OOI in the image using an initial homography estimation method [27].
When the OOI in the image is detected, the corresponding initial homography parameters and template index (ID) number are sent to activate the template tracking module to track the target in the incoming frames.In this work, the existing HA template tracking algorithm [19] was extended to multi-template tracking, and the technical details are presented in the next section.Based on the tracking result, the six degree-of-freedom (6-DoF) pose of the OOI is estimated by the 3D pose estimation module, which is a PnP solver as given in Figure 1 and presented in Section 4.

Multi-Template Tracking Algorithm
The template tracking module was designed based on an HA template tracker, which consists of an offline learning and an online tracking process.The former process learns a tracking model of the target based on template patch of the target, and the latter process works with the trained tracking model to predict the optimal position of the OOI between the input image and the reference template in the sense of energy minimization [28].In this section, a multi-template tracking algorithm is proposed based on the HA template tracking algorithm.

Offline Learning
Suppose that there are N template patches of the OOI in the database, and each of them is assigned an ID number from 1-N.In each template patch, N s sample points are selected as the reference intensity pixels denoted by X (j) N s }, where the symbol j denotes the ID number of the template, X (j) s denotes the sample-point set of the j-th template and p (j) i denotes the image coordinates of the i-th reference intensity pixel (x i ,y i ) in the j-th sample-point set.We also define an initial parameter vector µ (j) 0 , which stores n parameters of the j-th position prediction model f (j) p (p, µ) that predicts the new image coordinates of a pixel location p of the j-th template patch w.r.t. a parameter vector µ.

Let I (j)
re f denote the intensity image of the j-th template patch and f (j) m (X, I) µ a motion variation model that predicts a motion variation vector δµ w.r.t. the intensity differences between two consecutive images at some pixel locations X evaluated under a given parameter vector µ.In the offline learning process, the goal is to find the motion variation model for each given template patch.To achieve this goal, a large number of random motion transformations δµ (j) k with k = 1~N t and N t >> N s is generated a priori to form a reference motion variation matrix Next, a corresponding intensity variation matrix N t ] ∈ N s ×N t is computed based on the initial parameter vector µ (j) 0 and the random transformations δµ which is the k-th intensity-variation vector of the j-th reference intensity template associated with the k-th random transformation.Suppose that the j-th motion variation model is a linear predictor related to the j-th intensity variation vector given by Equation (1).Then, we have Equation (2) as follows: where p is an n-by-N s matrix to predict the j-th motion variation vector δµ k from the j-th intensity variation vector δi (j) associated with the parameter vector µ k in the sense of the minimal SSD criterion such that: ( A closed-form solution to the minimum optimization problem (3) can be solved as: which is computationally expensive because it requires inverting an N s -by-N s matrix H (j) (H (j) ) T .
Therefore, the closed-form solution ( 4) is unsuitable to be used in the online learning of the template tracking module.

Online Tracking
When the object recognition module had detected the OOI using one template in the database, the corresponding template ID number was stored, and an initial homography matrix was computed to form an a priori parameter vector µ t−1 of the position prediction model.Let J denote the stored template ID number.To track the OOI in the current image, denoted by I t , the corresponding intensity variation vector w.r.t. the reference intensity pixels is given by Equation ( 5), as follows: which is used to predict the motion variation δµ t between the two consecutive images such that: Similar to the forward additional and forward compositional algorithms used in dense image alignment [29], the prediction result obtained from Equation ( 6) is used to update the a priori parameter vector µ t−1 of the position prediction model such that: Finally, the time index is updated by t − 1 ← t to track the OOI in the next input image continuously.

Model-Based 3D Pose Estimation
As suggested in [30], the PnP methods usually work better than the HD methods because it optimizes the pose based on a cost function related to the correspondence mapping errors.Therefore, the proposed 3D pose estimation module includes an initial pose solver and a PnP solver.The former solves the initial 3D pose of the OOI based on an efficient HD algorithm and the latter refines the initial pose result using the PnP algorithm.

Initial Pose Solver
Suppose that each template patch in the database corresponds to a 3D planar model of the OOI.When working with the template tracking result, the correspondences between 2D image features of the tracked template patch and 3D control points of the CAD model of the object can be determined efficiently.For the j-th 3D planar model, we define N c coplanar 3D control points n (i) denote the i-th element of the n-th 3D control point, where i = 1~3.If the OOI has been tracked using the j-th template, then we calculate the mean vector of the corresponding 3D control points of the planar model using Equation (8): where C (j) denotes the center location of the j-th 3D planar model.Next, a three-by-three covariance matrix of the 3D control points is calculated using Equation ( 9) Algorithms 2018, 11, 122 6 of 14 which can be decomposed by Singular-Value-Decomposition (SVD) such that: where U c is a rectangular diagonal matrix with non-negative singular values on the diagonal.According to the SVD result from Equation (10), the control points of the 3D planar model are transformed by a unitary transformation such that: As a result, the Z-value of the transformed 3D control point is fixed to zero, or Ĉ(j) n denote 2D feature points corresponding to the projection of the 3D planar control points onto the image plane.Define a mapping function ρ : 3 → 2 by ρ(x 1 , x 2 , x 3 ) = (x 1 /x 3 , x 2 /x 3 ).Then, an optimal homography matrix H opt between the transformed 3D planar model and the 2D image plane can be obtained by solving the following nonlinear optimization problem: where T is formed by the result of Equation (11).Suppose that the camera was calibrated previously.By using the camera intrinsic matrix K and Equation ( 12), the pose matrix of the transformed 3D planar model in the unitary coordinate system can be solved by the perspective HD method [18] such that: which can be used to recover the rotation matrix of the transformed 3D planar model such that: where the symbol x denotes the two-norm value of the vector x and the operator × represents the cross-product of two vectors.Finally, the initial 6-DoF pose of the 3D planar model in the original camera coordinate system can be estimated according to Equations ( 13) and ( 14) such that: where R 0 and T 0 represent the initial rotation matrix and the translation vector of the OOI w.r.t. the camera system, respectively.

PnP Solver
Once the initial pose of the OOI has been obtained, the optimal pose solution can be further estimated based on the correspondences between the 2D feature points and the 3D control points.Let π(X, R, T) denote a projection function of a 3D point associated with a rotation matrix R and a translation vector T as follows: where K is the camera intrinsic matrix.Then, the re-projection error E(R, T) between the 2D-3D correspondences {p n=1 can be computed by: where ρ(X) is the mapping function of a 3D point X defined previously.Hence, the proposed PnP solver aims to refine the initial pose (R 0 ,T 0 ) of the OOI by minimizing the re-projection error of Equation ( 17) such that: where R opt and T opt are the optimal rotation matrix and the optimal translation vector of the OOI w.r.t. the camera system, respectively.The performance of the proposed model-based object pose estimation algorithm is evaluated in the next section.

Experimental Results
Figure 2 shows three template patches of a cube box used as an OOI in the experiments.For each template patch, the four corners of the image were set as the control points used in the proposed algorithm.The camera used in the experiments was a Logitech HD Pro Webcam C920, which provides images with a size of 1280 × 720 pixels.The camera was calibrated, and its intrinsic matrix is given below: which is used in Equations ( 13) and ( 16).For evaluating the estimation performance of the proposed algorithm, the following experiments focus on four issues, including: (1) pose estimation testing; (2) quantitative evaluation; (3) computational efficiency; and (4) multi-object pose tracking discussed in the proposed algorithm.
where ρ(X) is the mapping function of a 3D point X defined previously.Hence, the proposed PnP solver aims to refine the initial pose (R0,T0) of the OOI by minimizing the re-projection error of Equation ( 17) such that: where Ropt and Topt are the optimal rotation matrix and the optimal translation vector of the OOI w.r.t. the camera system, respectively.The performance of the proposed model-based object pose estimation algorithm is evaluated in the next section.

Experimental Results
Figure 2 shows three template patches of a cube box used as an OOI in the experiments.For each template patch, the four corners of the image were set as the control points used in the proposed algorithm.The camera used in the experiments was a Logitech HD Pro Webcam C920, which provides images with a size of 1280 × 720 pixels.The camera was calibrated, and its intrinsic matrix is given below: 935.4444 0 642.74150 935.1667360.9098 0 0 1 which is used in Equations ( 13) and ( 16).For evaluating the estimation performance of the proposed algorithm, the following experiments focus on four issues, including:

Pose Estimation Results
Figure 3 shows the experimental results of the proposed object pose estimation algorithm in tracking the OOI with the three templates shown in Figure 2. In this experiment, the OOI was firstly rotated around the y-axis in a clockwise direction as shown in Figure 3a-c

Pose Estimation Results
Figure 3 shows the experimental results of the proposed object pose estimation algorithm in tracking the OOI with the three templates shown in Figure 2. In this experiment, the OOI was firstly rotated around the y-axis in a clockwise direction as shown in Figure 3a-c.Hence, the 3D pose of the OOI was estimated by continuously tracking the templates from No. 1-No.3. Next, the OOI was rotated around the y-axis in a counterclockwise direction as shown in Figure 3d-g.The 3D pose of the OOI was then estimated by tracking the template No. 3, No. 1 and No. 2, successively.Finally, the OOI was rotated back to the initial pose, and it was then rotated around the z-axis in a counterclockwise direction as shown in Figure 3h.From the 3D pose estimation result shown in Figure 3, it is clear that the value of Ry is firstly increased from 0 • to about 120 • and then is decreased in steps to about −95 • .After that, the value of Ry is returned to about 0 • , and the value of Rz is then decreased from 0 • to about −37 • .These pose estimation results are consistent with the actual motion trajectory of the OOI in the experiment.Therefore, the pose estimation performance of the proposed algorithm to deal with an OOI having a multi-plane structure is validated.
the OOI was rotated back to the initial pose, and it was then rotated around the z-axis in a counterclockwise direction as shown in Figure 3h.From the 3D pose estimation result shown in Figure 3, it is clear that the value of Ry is firstly increased from 0 to about 120 and then is decreased in steps to about −95.After that, the value of Ry is returned to about 0, and the value of Rz is then decreased from 0 to about −37.These pose estimation results are consistent with the actual motion trajectory of the OOI in the experiment.Therefore, the pose estimation performance of the proposed algorithm to deal with an OOI having a multi-plane structure is validated.

Quantitative Evaluation
To evaluate the estimation accuracy of the proposed algorithm quantitatively, a protractor and a ruler were used in the experiments to measure the orientation angle and the translation distance of the target along an axis manually, respectively.These measures were used as the ground truth of the target poses.Figure 4  In Figure 4a2-h2, the maximum absolute translation error along the y-axis is about 1.0 cm when the target is close to the boundary of the image.In Figure 4a3-h3, the maximum absolute translation error along the z-axis is less than 0.6 cm.Therefore, the accuracy of the target translation estimation of the proposed algorithm is verified.

Quantitative Evaluation
To evaluate the estimation accuracy of the proposed algorithm quantitatively, a protractor and a ruler were used in the experiments to measure the orientation angle and the translation distance of the target along an axis manually, respectively.These measures were used as the ground truth of the target poses.Figure 4  In Figure 4a2-h2, the maximum absolute translation error along the y-axis is about 1.0 cm when the target is close to the boundary of the image.In Figure 4a3-h3, the maximum absolute translation error along the z-axis is less than 0.6 cm.Therefore, the accuracy of the target translation estimation of the proposed algorithm is verified.5, the maximum absolute rotation error is less than 1.8 degrees across the x-axis, 1.9 degrees across the y-axis and 0.9 degrees across the z-axis.Hence, the accuracy of the target rotation estimation of the proposed algorithm is also verified.
Algorithms 2018, 11, x FOR PEER REVIEW 10 of 14 is less than 1.8 degrees across the x-axis, 1.9 degrees across the y-axis and 0.9 degrees across the z-axis.Hence, the accuracy of the target rotation estimation of the proposed algorithm is also verified.Based on the above experimental results, two estimation error criteria were employed to quantitatively evaluate estimation performance of the proposed algorithm as follows: where Ω = {x,y,z} denotes one of the three axes of the 3D Cartesian coordinate system; RΩ and TΩ denote the ground truths of the rotation angle and translation distance on the Ω-axis, respectively;  R ˆ and  T ˆ denote the corresponding estimates of the rotation angle and translation distance obtained from the proposed algorithm, respectively.Table 1 records the estimation errors of the experimental results shown in Figures 4 and 5.The last row in Table 1 shows the average estimation error of the experiments.From Table 1, it is clear that the rotation estimation in y-axis has the largest estimation error of about 1.64 degrees on Based on the above experimental results, two estimation error criteria were employed to quantitatively evaluate the estimation performance of the proposed algorithm as follows: where Ω = {x,y,z} denotes one of the three axes of the 3D Cartesian coordinate system; R Ω and T Ω denote the ground truths of the rotation angle and translation distance on the Ω-axis, respectively; RΩ and TΩ denote the corresponding estimates of the rotation angle and translation distance obtained from the proposed algorithm, respectively.Table 1 records the estimation errors of the experimental results shown in Figures 4 and 5.The last row in Table 1 shows the average estimation error of the experiments.From Table 1, it is clear that the rotation estimation in y-axis has the largest estimation error of about 1.64 degrees on average.By contrast, the average rotation estimation errors in both the xand z-axis are smaller than 0.82 • and 0.23 • , respectively.Moreover, the translation estimation errors in the three axes are all smaller than 0.33 cm on average.In other words, the average percentage error of the translation estimation results is lower than 3.3% when compared to the real dimensions of the object.This result is similar to the extended Kalman filter-based direct homography tracking method [23], which also has a percentage error of 3.3% represented in the translational error.This accuracy level is suitable for many practical applications, such as augmented reality.Therefore, the above quantitative analysis confirms that the proposed model-based object pose estimation algorithm can provide accurate and stable 3D pose estimation results of the OOI having a multi-planar structure.

Computational Efficiency
The proposed algorithm was implemented in C++ running on a Windows 7 platform equipped with a 3.6-GHz Intel®Core(TM) i7-4790 CPU and an NVIDIA Tesla C2050 GPU, which has 448 CUDA cores [31].Table 2 tabulates the average processing time in each stage of the proposed model-based object pose estimation algorithm.From Table 2, it is clear that the stage of object recognition costs the most processing time of the proposed algorithm.However, when the OOI has been detected, the stage of the proposed algorithm is switched to template tracking and 3D pose estimation processes, which cost about 1.09 ms on average after tracker initialization.Therefore, the proposed algorithm can achieve extremely high processing speeds when the OOI is detected from the input image.

Multi-Object Pose Tracking
The proposed method can be extended to track multiple objects.Figure 6 shows the experimental results of the proposed method to track three different objects simultaneously.From Figure 6, it is clear that the proposed method not only performs 3D pose tracking of multiple objects with a frame rate higher than 30 frames per second (FPS), but also overcomes the partial occlusion of the object No. 3 in the tracking process.Moreover, the pose estimation results are also accurate when the objects No. 1 and No. 2 change their poses.Therefore, these experimental results validate the tracking performance and tracking accuracy of the proposed pose estimation algorithm.Two video clips of the experiment can be accessed through the webpages of [32,33].
of the object No. 3 in the tracking process.Moreover, the pose estimation results are also accurate when the objects No. 1 and No. 2 change their poses.Therefore, these experimental results validate the tracking performance and tracking accuracy of the proposed pose estimation algorithm.Two video clips of the experiment can be accessed through the webpages of [32,33].Remark: From the above experimental results, we conclude that the advantage of the proposed method includes the real-time performance in tracking multiple targets and the accuracy of the pose estimation results.However, there are still some limitations of the proposed method; for example, the proposed method is not able to track textureless objects or planar objects with homogeneous surfaces.Furthermore, an unfavorable lighting condition may significantly reduce the tracking performance of the proposed method.

Conclusions and Future Work
In this paper, a novel and efficient model-based object pose estimation algorithm is proposed to achieve an accurate and stable pose tracking function for an OOI having a multi-planar structure.This property can help to implement many computer vision applications.Thanks to the GPU acceleration and the template tracking technologies, the proposed algorithm can efficiently and robustly detect and track the target in real time.Based on the template tracking result, the 3D pose of the OOI can be obtained accurately from the proposed model-based pose estimator, which combines an HD-based initial pose solver and a model-based PnP solver.Experimental results show that the maximum estimation error of the proposed algorithm is in the case of estimating the rotation in the y-axis and is about 1.64° on average.Otherwise, the proposed algorithm can provide accurate pose estimation results.The rotation estimation errors in both the x-and z-axis are about 0.81° and 0.22° on average, respectively.The translation estimation errors in the three axes are all smaller than 0.33 cm on average.Moreover, the entire system achieves an extremely high processing speed on a desktop computer equipped with a GPU accelerator.These advantages significantly increase the applicability of the proposed algorithm in practical applications.
In the future, a robust visual tracker can be integrated with the template tracker to improve the tracking robustness of the system to overcome external uncertainties during the pose tracking process.Furthermore, it is also crucial to combine a robust object detection method, such as a deep learning-based object detector [34], with the proposed algorithm to detect the target in complex scenes, which helps to improve the robustness and computational efficiency of the proposed algorithm in practical applications.Remark: From the above experimental results, we conclude that the advantage of the proposed method includes the real-time performance in tracking multiple targets and the accuracy of the pose estimation results.However, there are still some limitations of the proposed method; for example, the proposed method is not able to track textureless objects or planar objects with homogeneous surfaces.Furthermore, an unfavorable lighting condition may significantly reduce the tracking performance of the proposed method.

Conclusions and Future Work
In this paper, a novel and efficient model-based object pose estimation algorithm is proposed to achieve an accurate and stable pose tracking function for an OOI having a multi-planar structure.This property can help to implement many computer vision applications.Thanks to the GPU acceleration and the template tracking technologies, the proposed algorithm can efficiently and robustly detect and track the target in real time.Based on the template tracking result, the 3D pose of the OOI can be obtained accurately from the proposed model-based pose estimator, which combines an HD-based initial pose solver and a model-based PnP solver.Experimental results show that the maximum estimation error of the proposed algorithm is in the case of estimating the rotation in the y-axis and is about 1.64 • on average.Otherwise, the proposed algorithm can provide accurate pose estimation results.The rotation estimation errors in both the xand z-axis are about 0.81 • and 0.22 • on average, respectively.The translation estimation errors in the three axes are all smaller than 0.33 cm on average.Moreover, the entire system achieves an extremely high processing speed on a desktop computer equipped with a GPU accelerator.These advantages significantly increase the applicability of the proposed algorithm in practical applications.
In the future, a robust visual tracker can be integrated with the template tracker to improve the tracking robustness of the system to overcome external uncertainties during the pose tracking process.Furthermore, it is also crucial to combine a robust object detection method, such as a deep learning-based object detector [34], with the proposed algorithm to detect the target in complex scenes, which helps to improve the robustness and computational efficiency of the proposed algorithm in practical applications.

Figure 1 .
Figure 1.System framework of the proposed model-based object pose estimation algorithm.

Figure 1 .
Figure 1.System framework of the proposed model-based object pose estimation algorithm.

c
are real unitary matrices and ∑ (j)

Figure 2 .
Figure 2. Three template patches of the Object-Of-Interest (OOI) used in the experiments.
Figure 3 shows the experimental results of the proposed object pose estimation algorithm in tracking the OOI with the three templates shown in Figure 2. In this experiment, the OOI was firstly rotated around the y-axis in a clockwise direction as shown in Figure 3a-c.Hence, the 3D pose of the OOI was estimated by continuously tracking the templates from No. 1-No.3. Next, the OOI was rotated around the y-axis in a counterclockwise direction as shown in Figure 3d-g.The 3D pose of the OOI was then estimated by tracking the template No. 3, No. 1 and No. 2, successively.Finally,

Figure 2 .
Figure 2. Three template patches of the Object-Of-Interest (OOI) used in the experiments.

Figure 3 .
Figure 3. Pose estimation results of the proposed algorithm in tracking the OOI with the three templates shown in Figure 2: 3D pose estimation by tracking (a) the template No. 1, (b) the two templates No.1 and No. 3, (c) the template No. 3, (d) the two templates No.1 and No. 3 again, (e) the template No. 1, (f) the two templates No. 1 and No. 2, (g) the template No. 2 and (h) the template No. 1 with a rotation on the z-axis.

Figure 3 .
Figure 3. Pose estimation results of the proposed algorithm in tracking the OOI with the three templates shown in Figure 2: 3D pose estimation by tracking (a) the template No. 1, (b) the two templates No.1 and No. 3, (c) the template No. 3, (d) the two templates No.1 and No. 3 again, (e) the template No. 1, (f) the two templates No. 1 and No. 2, (g) the template No. 2 and (h) the template No. 1 with a rotation on the z-axis.
illustrates the translation estimation results of the proposed algorithm.The experimental results of target translations along the x−, yand z-axes are shown in Figure 4a1-h1, Figure 4a2-h2 and Figure 4a3-h3, respectively.From Figure 4a1-h1, it is clear that the maximum absolute translation error (defined by |Ground Truth-Estimation|) is less than 0.8 cm on the x-axis.
illustrates the translation estimation results of the proposed algorithm.The experimental results of target translations along the x−, y-and z-axes are shown in Figure 4a1-h1, Figure 4a2-h2 and Figure 4a3-h3, respectively.From Figure 4a1-h1, it is clear that the maximum absolute translation error (defined by |Ground Truth-Estimation|) is less than 0.8 cm on the x-axis.

Figure 5
Figure 5 presents the rotation estimation results of the proposed algorithm.Similarly, the experimental results of target rotation estimation in the x-, y-, and z-axes are shown in Figure 5a1-h1, Figure 5a2-5h2 and Figure 5a3-h3, respectively.From Figure 5, the maximum absolute rotation error

Figure 5
Figure 5 presents the rotation estimation results of the proposed algorithm.Similarly, the experimental results of target rotation estimation in the x-, y-, and z-axes are shown in Figure 5a1-h1, Figure 5a2-5h2 and Figure 5a3-h3, respectively.From Figure 5, the maximum absolute rotation error

Figure 6 .
Figure 6.Experimental results of multi-object pose tracking of the proposed algorithm [32].

Figure 6 .
Figure 6.Experimental results of multi-object pose tracking of the proposed algorithm [32].
was extended to multi-template tracking, and the technical details are presented in the next section.Based on the

Table 1 .
Estimation errors of the proposed algorithm shown in Figures4 and 5.

Table 2 .
Average processing times in each stage of the proposed algorithm.