An Online 3D Modeling Method for Pose Measurement under Uncertain Dynamic Occlusion Based on Binocular Camera

3D modeling plays a significant role in many industrial applications that require geometry information for pose measurements, such as grasping, spraying, etc. Due to random pose changes in the workpieces on the production line, demand for online 3D modeling has increased and many researchers have focused on it. However, online 3D modeling has not been entirely determined due to the occlusion of uncertain dynamic objects that disturb the modeling process. In this study, we propose an online 3D modeling method under uncertain dynamic occlusion based on a binocular camera. Firstly, focusing on uncertain dynamic objects, a novel dynamic object segmentation method based on motion consistency constraints is proposed, which achieves segmentation by random sampling and poses hypotheses clustering without any prior knowledge about objects. Then, in order to better register the incomplete point cloud of each frame, an optimization method based on local constraints of overlapping view regions and a global loop closure is introduced. It establishes constraints in covisibility regions between adjacent frames to optimize the registration of each frame, and it also establishes them between the global closed-loop frames to jointly optimize the entire 3D model. Finally, a confirmatory experimental workspace is designed and built to verify and evaluate our method. Our method achieves online 3D modeling under uncertain dynamic occlusion and acquires an entire 3D model. The pose measurement results further reflect the effectiveness.


Introduction
With the increasing demands for intelligent manufacturing and industrial automation, digital three-dimensional (3D) models of workpieces hold increasingly important status in production processes [1,2], such as grasping, spraying, etc., which refer to the whole 3D point cloud rather than the 2.5D point cloud in a single frame captured by the depth sensor.
3D modeling applied in factories can be performed offline or online [3][4][5]. Offline 3D modeling methods can achieve high accuracy and are often used in high precision applications, such as defect detection. However, their efficiency is low and they have ex situ processes. Online 3D modeling is often used to measure the pose of the workpiece as a standard guide [6][7][8]. For example, the pose of the workpiece is necessary for robotic spray painting, and the 3D model is an important prerequisite for pose measurement. Because random pose changes often occur during workpiece processing on the production line, online 3D modeling becomes more important for pose measurement. Moreover, an entire 3D model contains rich geometric information, which can guarantee the accuracy of pose measurement, while the 2.5D point cloud obtained from a single view is not sensitive to the shape of the workpiece, resulting in a decrease in the accuracy of the pose. Although there are many online 3D modeling methods, online 3D modeling has not been entirely reconstruction based on ElasticFusion [24]. PoseFusion [20] mainly aims at human life scenarios. It uses the OpenPose network [25] to identify human joints as priors to perform a minimum cut to propose human body regions and achieve background reconstruction. These methods are suitable for scenes where objects are known beforehand. However, the number of object samples in many scenarios often do not meet the training requirements of learning-based methods and the appearances of offline and online are different. In conclusion, learning-based methods rely on priors and have limited adaptability, and the first two methods are weak in terms of robustness. The above methods cannot meet the needs of 3D modeling under uncertain dynamic occlusion.
In this paper, aiming at the uncertain dynamic occlusion problem, an online 3D modeling method based on a single binocular camera is proposed. This method does not rely on prior information about objects, and can only achieve the robust segmentation of feature points and corresponding point clouds (2D-3D) by analyzing the motion patterns of adjacent frames. Through exploring the distribution of pose hypotheses calculated using 2D-3D point pairs that are obtained by multiple random sampling, 2D-3D point pairs corresponding to dense regions of the pose distribution are chosen as the segmentation results. Initial poses between adjacent frames can be acquired from these 2D-3D point pairs. On this basis, considering the incomplete point cloud of the occluded target object obtained after segmentation, an optimization method based on local constraints of covisibility regions and global loop closure is proposed to optimize the local registration and the entire 3D model. Furthermore, we design and build an experimental workspace to verify the feasibility and effectiveness of our method.

Methods
Aimed at the task of 3D modeling under uncertain dynamic occlusion, we propose an online 3D modeling method, as shown in Figure 1. It mainly includes four steps: raw data processing and data association, dynamic objects segmentation, 3D modeling of the target object and dynamic objects, and closed-loop optimization (bundle adjustment). For the original images captured by the binocular camera, stereo rectification [26] is performed to align the epipolar lines of the left and right images. Then, the depth information of a single frame can be calculated by disparity estimation methods. After that, data association between adjacent frames is established through feature point matching, image block correlation, or auxiliary texture. That is to say, it is necessary to obtain the image point-image point correspondences (2D-2D) and the image point-3D point correspondences (2D-3D) between adjacent frames. On this basis, dynamic object segmentation is carried out to divide these 2D-3D correspondences into different objects with different motions. The initial pose of each object is estimated by these 2D-3D correspondences. Subsequently, the initial poses and points are optimized by local constraints between adjacent frames. The constraints between multiple closed-loop frames with covisibility regions are also added to the optimization. Finally, poses and 3D models of target objects and dynamic objects are acquired. In the following, uncertain dynamic object segmentation and 3D modeling based on closed-loop optimization are described in detail.  Figure 1. An illustration of our pipeline. (1) Vision system captures left and right images. Stereo rectification is first performed to align the epipolar lines of the left and right images by rotating two images. The depth of the points can be calculated by disparity. (2) When two consecutive frames arrive, data association is established by Grid-based Motion Statistics (GMS) and image block correlation. Then, object segmentation is carried out by random sampling and pose hypotheses clustering. (3) The segmentation results, which include initial poses and 2D-3D point pairs, are sent to the 3D modeling part. It includes the target object tracking module and the dynamic object tracking module. The target object poses (namely, camera pose) and the dynamic object motions will be estimated and optimized. (4) Finally, local BA and global BA are carried out to optimize and update these variables. The outputs of our pipeline are poses of target and dynamic objects, a 3D model, and 3D point cloud of dynamic objects.

Object Segmentation by Clustering
Model for Dynamic Objects Segmentation. Given two consecutive images, the feature point correspondences { } and { } can be obtained first. For the dynamic object segmentation problem, we need to divide feature point correspondences so that the feature points in each divided point set conform to one motion model, the total number of motion models should be as small as possible, and the number of points in each set should be as many as possible. Therefore, the multiple dynamic object segmentation problem can be modeled as where is -th segmented feature point set, means the motion model satisfied by points in , denotes the number of divided objects, and ( ) represents the reciprocal of the number of points in . ( , ) represents the reprojection error of a point on the motion model . It is defined as follows: where is the depth of , is a 3D point that can be computed by triangulation. This model combines the complexity of segmentation, the number of inliers and the motion consistency of the object, where reflects the complexity of segmentation and ( ) implies the number of inliers of each set. The motion consistency means that feature points belonging to the same object satisfy the same epipolar constraint: where refers to the camera intrinsic matrix and ⋀ means the antisymmetric matrix of translation vector .
[ , ] is the pose transformation matrix.
The best segmentation =1→ * * can be obtained by (1). In order to solve (1), we propose a segmentation method based on cluster analysis below.
Cluster Analysis. A pose hypothesis can be obtained by one sampling in point correspondences, where one sampling represents sampling the smallest set of point pairs that (1) Vision system captures left and right images. Stereo rectification is first performed to align the epipolar lines of the left and right images by rotating two images. The depth of the points can be calculated by disparity. (2) When two consecutive frames arrive, data association is established by Grid-based Motion Statistics (GMS) and image block correlation. Then, object segmentation is carried out by random sampling and pose hypotheses clustering. (3) The segmentation results, which include initial poses and 2D-3D point pairs, are sent to the 3D modeling part. It includes the target object tracking module and the dynamic object tracking module. The target object poses (namely, camera pose) and the dynamic object motions will be estimated and optimized. (4) Finally, local BA and global BA are carried out to optimize and update these variables. The outputs of our pipeline are poses of target and dynamic objects, a 3D model, and 3D point cloud of dynamic objects.

Object Segmentation by Clustering
Model for Dynamic Objects Segmentation. Given two consecutive images, the feature point correspondences {p i } and {q i } can be obtained first. For the dynamic object segmentation problem, we need to divide feature point correspondences so that the feature points in each divided point set conform to one motion model, the total number of motion models should be as small as possible, and the number of points in each set should be as many as possible. Therefore, the multiple dynamic object segmentation problem can be modeled as where S j is j-th segmented feature point set, M S j means the motion model satisfied by points in S j , n τ denotes the number of divided objects, and V S j represents the reciprocal of the number of points in S j . r p i , M S j represents the reprojection error of a point p i on the motion model M S j . It is defined as follows: where s i is the depth of p i , P i is a 3D point that can be computed by triangulation. This model combines the complexity of segmentation, the number of inliers and the motion consistency of the object, where n τ reflects the complexity of segmentation and V S j implies the number of inliers of each set. The motion consistency means that feature points belonging to the same object satisfy the same epipolar constraint: where K refers to the camera intrinsic matrix and t means the antisymmetric matrix of translation vector t. [R, t] is the pose transformation matrix. The best segmentation S * j=1→n * τ can be obtained by (1). In order to solve (1), we propose a segmentation method based on cluster analysis below. Cluster Analysis. A pose hypothesis can be obtained by one sampling in point correspondences, where one sampling represents sampling the smallest set of point pairs that are sufficient to compute a pose hypothesis. The pose hypotheses calculated from point pairs sampled on the same object must coincide in the pose parameter space. Based on this idea, cluster analysis can be achieved (1) by finding the dense regions in pose hypotheses and object segmentation can be acquired.
(1) Pose Hypothesis Calculation. To compute the pose hypothesis M S j , we consider the fundamental matrix F and the homography matrix H. We use the eight-point [27] method and the four-point [28] method to calculate F and H. Therefore, we have where ξ ∈ se(3) is the Lie algebra form of M S j , P and Q are the sampled matching point sets. g(·, ·) represents the algorithm for solving the pose. We need to select methods according to their applicability, limitations, and possible degradation [29]. In this paper, because the depth of points can be obtained, the perspective-n-point (PnP) method [30] is adopted to calculate the pose hypotheses. Here, the PnP algorithm requires at least four 2D-3D point pairs to compute a pose.
(2) Pose Hypotheses Clustering. Based on these pose hypotheses, the clustering algorithm is carried out. Euclidean distance in the Lie algebra coordinate system is adopted to measure the distance between two pose hypotheses.
Considering that the number of objects in the scene is unknown, the cluster method proposed in [31] is introduced. This method can not only automatically and accurately determine the number of cluster centers, but is also very fast. After clustering, we select the neighborhood of each cluster center as a region Ω l . Usually, the number of regions is the same as the number of objects. We only need to re-use the 2D-3D point pairs that calculate the pose hypotheses in each region Ω l to compute a new pose, which is the pose of the object. However, due to the randomness of sampling, there may be regions made up of noise. To remove noise regions, we adopt the idea of the ordered residual kernel (ORK) [32]. We sort the number η l of poses in each region Ω l in descending order {η l1 ,η l2 , · · · ,η lL }, where ∀k :η lk ≥η lk+1 . For one region Ω lk withη lk pose hypotheses, the 2D-3D point pairs used to calculate pose hypotheses Ω lk ξ i in Ω lk are taken out and considered as effective points: where i = 1, 2, · · · ,η lk ,P lk = Ω lk P i ,Q lk = Ω lk Q i . Ω lk ξ i denotes a pose hypothesis in Ω lk . G(·) means taking out the 2D-3D point pairs P lk ,Q lk that calculate all hypotheses in Ω lk . For P lk ,Q lk , the number of point pairs from different objects has already been significantly reduced. In other words, such an operation causes most of these 2D-3D point pairs to come from the same object. We carry out multiple sampling in P lk ,Q lk once more and calculate pose hypotheses. The cluster algorithm in [31] is used again to find cluster centers. The regions with only one cluster center are chosen and the regions with more than one cluster center are rejected. Finally, each remaining region corresponds to an object. The pose of each object can be calculated by 2D-3D point pairs corresponding to pose hypotheses in each region. The above process is also summarized as Algorithm 1.

Algorithm 1. Pose Estimation for Each Cluster Region Ω lk
Input: feature points P lk ,Q lk taken out from Ω lk .

3D Modeling Based on Graph Optimization
Notation. We take the coordinate system of object O 1 as the reference coordinate system and illustrate the notations in Figure 2. Each object coordinate system is established at the centroid of this object. (3) denote the poses of the camera C and object O l in the O 1 coordinate system at the k-th frame, respectively, where k ∈ F , l is the l-th object, and l = 2, · · · , N L , l = 1. N L is the total number of the objects tracked in k-th frame. In k-th frame I k , an image feature point on O 1 is denoted by

3D Modeling Based on Graph Optimization
Notation. We take the coordinate system of object 1 as the reference coordinate system and illustrate the notations in Figure 2. Each object coordinate system is established at the centroid of this object. Let 1 , 1 ∈ SE(3) denote the poses of the camera and object in the 1 coordinate system at the -th frame, respectively, where ∈ ℱ, is the -th object, and = 2, ⋯ , ℒ , ≠ 1. ℒ is the total number of the objects tracked in -th frame. In -th frame , an image feature point on 1 is denoted by It is a homogeneous coordinate representation. Similarly, an image feature point on in is represented by , = 1,2, ⋯ , . 1 and 1 mean the -th 3D point on 1 and -th 3D point on in the 1 coordinate system in frame , respectively. Let 1 , 1 be the pose transformation matrix of the camera and object in the 1 coordinate system at , where the camera pose is the pose of the object 1 itself. We use the notation −1 1 to term the motion of from −1 to . Figure 2. Notation and coordinate system. The big black and green circle represent the 3D point of 1 and 3D point of in the 1 coordinate. Small black and green circles denote the corresponding observations in images of adjacent frames. We establish the camera coordinate system in the camera optical center and the object coordinate system in the centroid of the object.
Constraints and Factor Graph Establishment. After object segmentation, the 2D-3D corresponding point pairs and initial pose of each object can be obtained. Due to which object is the target object is unknown, all objects need to be estimated. Moreover, only using these initial poses cannot register the point clouds well because the point cloud of object is the target object is unknown, all objects need to be estimated. Moreover, only using these initial poses cannot register the point clouds well because the point cloud of each frame is incomplete caused by occlusion. Therefore, an optimization method based on local constraints and global loop closure is introduced to acquire accurate poses.
(1) Reprojection Constraints. For a spatial 3D point O 1 P i k−1 on O 1 in the O 1 coordinate system, according to the camera projection model, we have where, π(·) is the projection function, K is the 3 × 3 camera intrinsic matrix.
[·] 1:3 means to take the first three rows of the matrix. In other words, we take the first three components of the index set of corresponding point pairs, we can compute the reprojection error: For a 3D point in camera coordinate system, no matter how this object moves, on account of the object motion consistency, the coordinates of a 3D point Here,  (9): where Here, (2) Object Motion Constraints. On the basis of object motion k−1 , the 3D point in I k−1 after this motion transformation should be the same as the corresponding 3D point in I k . Accordingly, we establish the object motion constraints: That is to say, the moving speed of dynamic objects between adjacent frames is roughly the same.
All of the above constraints are added to the factor graph. The variables that need to be optimized are Then, (14) can be solved using the Levenberg-Marquardt method [33] to optimize the factor graph.
Here, we not only establish the above constraint relationships between adjacent frames, but also establish constraints between closed-loop frames with covisibility regions, so as to further optimize relevant variables. After optimization, 3D models and object poses can be acquired at last.

Results
Our method was verified on an experimental workspace built by ourselves. We modeled three objects under uncertain dynamic occlusion in the workspace and used the modeling results to estimate poses. The three objects were a complex surface part (W a ), a 3D printed block (W b ), and an earphone case (W c ), respectively. Their sizes were 39.52 mm × 24.61 mm × 38.00 mm, 40.00 mm × 36.00 mm × 30.47 mm, and 47.10 mm × 55.24 mm × 28.67 mm, respectively. In the following experiments, we only used W a as an example to illustrate.

Introduction of Confirmatory Experimental Workspace
As shown in Figure 3, our workspace was mainly composed of a machining system, a positioner, and a vision system. The machining system consisted of a robotic arm UR5 and an end effector. Tools or other objects were clamped by the end effector and the robotic arm drove tools to perform the related operations, which could be grasping, spraying, etc. Of course, there were many kinds of tools that were not known in advance. The positioner involved a turntable and a rotating shaft. The target object was installed in the slot of the turntable. These two devices could drive the target object to change poses and they consisted of two high-precision servo motors that enabled accurate rotational control.
botic arm drove tools to perform the related operations, which could be gr ing, etc. Of course, there were many kinds of tools that were not known in positioner involved a turntable and a rotating shaft. The target object was i slot of the turntable. These two devices could drive the target object to cha they consisted of two high-precision servo motors that enabled accurate rota  Figure 3. The structure of the workspace. The vision system mainly included two arranged high-resolution cameras and an auxiliary texture device in the middle. The placed in the white slot on the turntable. The turntable and rotating shaft could cha the object. Tools could be clamped by the end effector and be moved by the robotic The vision system mainly included two high-resolution Basler camer gm (2.3 million pixels) and an auxiliary texture device. Two cameras were arranged with a large baseline. The purpose was to improve the accuracy o urements. By adjusting the focal length, the two cameras could focus on t of the turntable. The auxiliary texture device was installed between tw needed to be simultaneous with the camera to provide the texture when som Figure 3. The structure of the workspace. The vision system mainly included two symmetrically arranged high-resolution cameras and an auxiliary texture device in the middle. The workpiece was placed in the white slot on the turntable. The turntable and rotating shaft could change the pose of the object. Tools could be clamped by the end effector and be moved by the robotic arm.
The vision system mainly included two high-resolution Basler cameras acA1920-50 gm (2.3 million pixels) and an auxiliary texture device. Two cameras were symmetrically arranged with a large baseline. The purpose was to improve the accuracy of depth measurements. By adjusting the focal length, the two cameras could focus on the central axis of the turntable. The auxiliary texture device was installed between two cameras. It needed to be simultaneous with the camera to provide the texture when some untextured objects were measured. After the cameras and the auxiliary texture device were installed and fixed, the calibration of the intrinsic and extrinsic parameters of the two cameras was performed. The stereo calibration method [34] was adopted to acquire these parameters. After that, stereo rectification [26] was implemented to align the epipolar lines of the left and right images. On this basis, the depth information in a single frame could be obtained by disparity or depth estimation methods. The texture of objects or auxiliary texture could be used to acquire point correspondences. The camera was also mounted on a platform controlled by servo motors and could be driven to move.
In our experiment, the camera was fixed and the target object was rotated by a positioner during modeling. To test the effectiveness of the proposed method, we kept the tool moving between the target object and the camera so that the target object was occluded throughout the whole process. The auxiliary texture was utilized to obtain the corresponding points of the left and right images. The resolution of captured images was 1920 × 1200. Since the background was removed, the actual maximum resolution of the image occupied by the target object and tool did not exceed 1000 × 1200. The vision system continuously observed the object and generated 2.5D point clouds within the field of view.

Demonstration of the Clustering Process
Dynamic object segmentation can be implemented by the algorithm described in Section 2.1. Before segmentation, the image corresponding points (2D-2D) of the current frame and the last frame are first obtained using the Grid-based Motion Statistics (GMS) method [35] and image block correlation [36]. Because 3D points corresponding to the image points of the last frame are known, 2D-2D point pairs can be converted into the correspondences between the image points in the current frame and the 3D points in the last frame (2D-3D point pairs). Based on these 2D-3D point pairs, the segmentation method is carried out.
In order to present the clustering process more graphically, we chose the translation vector of the pose transformation to represent the pose hypothesis. The 3D pose parameter space is displayed in the 3D Cartesian coordinate system. The translations of the camera and objects have a real scale. The intermediate results of the segmentation of W a and the tool are shown in Figure 4. Red circles and blue circles in each grayscale image represent the points on W a and the tool, respectively. The black part in the grayscale image means the background has been removed. In the pose parameter space, we use shades of color to show the distribution of pose hypotheses. The darker the color (blue), the denser it is. The lighter the color (yellow), the sparser it is. It can be seen from Figure 4 that there are two obvious cluster centers, which fully reflect the existence of two objects with inconsistent motion. The 2D-3D point pairs of each object are acquired.  In our experiment, dynamic object segmentation is only pe and when the number of points on the objects is less than a c frames, the points on each object are separately tracked. In our experiment, dynamic object segmentation is only performed at the beginning and when the number of points on the objects is less than a certain threshold. In other frames, the points on each object are separately tracked.

Modeling Results and Running Time
After object segmentation, the initial pose of each object can be acquired. Then, the factor graph optimization described in Section 2.2 is implemented to refine the poses of each object. The outputs are the camera poses and 3D models of the target object and dynamic objects, as shown in Figure 5. Usually, we take the object with the largest number of 2D-3D point pairs as object O 1 . In our experiment, the target object has the largest number of point pairs, so it is object O 1 , and camera poses are the poses of the target object point clouds. In Figure 5, all results are represented in the reference coordinate system. The blue line denotes the trajectory of the camera. Each magenta pyramid denotes the 6DOF pose of each frame. For clearer presentation, we display the images of the 5th, 15th, 21th, and 32th frames captured by the camera. In each image, the green and blue marks mean the tracked feature points on the target object and tool, respectively. The 3D model of is shown in the center of Figure 5, surrounded by the 3D points of the tool. The different colors of the tool's 3D point clouds indicate the observations of the tool in different frames. With the movement of the target object and tool, occluded parts will be observed and modeled in other frames, as shown in the middle of the target object's 3D model with different colors. The different colors on the 3D model correspond to the colors of the tool. It demonstrates that the occluded parts in other frames are observed in the current frame. It can be seen that the occluded parts are well modeled after optimization.
We also tested the running time of optimization. The experiments were performed on an Intel Core i7-8700 3.2GHz desktop with 24GB RAM. For three objects , , to be modeled, we counted the time consumed by each important part of the program in each image frame, and calculated the average running time, as shown in the first four rows of Table 1. The last row of Table 1 gives the time for closed-loop optimization of all frames when there are closed-loop frames. In our experiment, we tracked 2000 points on the target object and 500 points on the moving tool. Local BA and global BA were implemented using the g2o [37] library. In Figure 5, all results are represented in the reference coordinate system. The blue line denotes the trajectory of the camera. Each magenta pyramid denotes the 6DOF pose of each frame. For clearer presentation, we display the images of the 5th, 15th, 21th, and 32th frames captured by the camera. In each image, the green and blue marks mean the tracked feature points on the target object and tool, respectively. The 3D model of W a is shown in the center of Figure 5, surrounded by the 3D points of the tool. The different colors of the tool's 3D point clouds indicate the observations of the tool in different frames. With the movement of the target object and tool, occluded parts will be observed and modeled in other frames, as shown in the middle of the target object's 3D model with different colors. The different colors on the 3D model correspond to the colors of the tool. It demonstrates that the occluded parts in other frames are observed in the current frame. It can be seen that the occluded parts are well modeled after optimization.
We also tested the running time of optimization. The experiments were performed on an Intel Core i7-8700 3.2 GHz desktop with 24 GB RAM. For three objects W a , W b , W c to be modeled, we counted the time consumed by each important part of the program in each image frame, and calculated the average running time, as shown in the first four rows of Table 1. The last row of Table 1 gives the time for closed-loop optimization of all frames when there are closed-loop frames. In our experiment, we tracked 2000 points on the target object and 500 points on the moving tool. Local BA and global BA were implemented using the g2o [37] library.

Comparison Experiment (Accuracy Evaluation)
In order to evaluate the registration accuracy of the incomplete point cloud of each frame in the case of uncertain dynamic occlusion, we compared it with three methods. One was an incremental open-loop registering method. The method in [38] was utilized to calibrate the extrinsic parameters between the rotation axis and the camera. Then, point clouds were registered. The other two methods were the classic point cloud registration method Iterative Closest Point (ICP) [39] and an advanced method Correntropy-based Bidirectional Generalized Iterative Closest Point (CoBigICP) [40]. These three methods are good registration methods without moving objects. However, 3D modeling under uncertain dynamic occlusion cannot be solved. Specifically, we adopted the latter two methods to stitch point clouds, including dynamic objects. We found that they often did not converge or converged to wrong solutions due to the presence of dynamic objects, which indicated the necessity for uncertain dynamic object segmentation. Therefore, we first used our object segmentation method to segment the target object and moving tool in each frame, and then stitched the incomplete point clouds of the target object with the above three methods. Finally, the registration results of three objects W a , W b , W c obtained using the three methods and our method were compared with their ground truths (obtained by a high-precision 3D camera with a nominal accuracy of 0.02 mm). The ground truths were a set of single-frame 2.5D point clouds captured by a high accuracy 3D camera from multiple perspectives. The model obtained by each method was aligned with the ground truths. Then, registration errors (root mean squared error of the Euclidean distance of each point in a 3D model to its nearest point in ground truth) and uncertainty (standard deviation) were calculated, as shown in Table 2, where the values in brackets are the corresponding standard deviations. It can be seen from the table that the performance of our method was better than the first three methods. The pose measurement results for the target object were also evaluated. The initial entire 3D model was first acquired. Then, the target object was driven by the turntable to perform pre-defined relative pose transformations. After each pose transformation, the 3D model was acquired again at this transformed position and was registered with the initial entire 3D model to estimate the relative pose transformation. The pre-defined pose transformations could be regarded as ground truths of the pose transformations between two positions. Here, we provide three pre-defined relative pose transformations T k = [θ k , t k ] in the camera coordinate system where k = 1, 2, 3, as shown by the ground truths (GT) in Table 3, where θ k = k θ x , k θ y , k θ z denotes the Euler's angles about the xyz-axes (measured In this experiment, our expected translation error and angle error were within 0.1 mm and 0.4 degrees, respectively. We selected the pose measurement results of W a corresponding to pre-defined poses T 1 and T 3 , and showed the error of each measurement value compared to the ground truth in Figure 6. Figure 6a,b is the rotation and translation errors corresponding to T 1 . Figure 6c,d is the rotation and translation errors corresponding to T 3 . It can be seen that the angle errors are within 0.4 degrees, and translation errors are within 0.1mm, but there are a few results that exceed these values. Our method performed well. The pose measurement results of the three target objects W a , W b , W c at pre-defined position transformed by T k are given in the form of statistical values, as shown in Table 3.
In Table 3, are the measurement value and ground truth. As can be seen from Table 3, the average values of pose measurements are very close to the ground truths. Although the maximum absolute errors are large, it can be seen from the standard deviations, median values, and mean absolute errors that the measurement results are relatively stable.
[ , ] in the camera coordinate system where = 1,2,3, as shown by the ground tru (GT) in Table 3, where = [ , , ] denotes the Euler's angles about the -ax (measured in degrees) and = [ , , ] means translation (measured in millim ters). These three pre-defined pose relative transformations 1 , 2 , 3 , are 1 [0°, 10°, 0°, −10,0,50] , 2 = [0°, −10°, 0°, 0,0, −50] , and 3 = [0°, 20°, 0°, −20,0,80] , resp tively. At each pre-defined pose, we performed multiple 3D modeling and measure re tive pose transformations. In this experiment, our expected translation error and an error were within 0.1 mm and 0.4 degrees, respectively. We selected the pose measu ment results of corresponding to pre-defined poses 1 and 3 , and showed the er of each measurement value compared to the ground truth in Figure 6. Figure 6a,b is t rotation and translation errors corresponding to 1 . Figure 6c,d is the rotation and tra lation errors corresponding to 3 . It can be seen that the angle errors are within 0.4 d grees, and translation errors are within 0.1mm, but there are a few results that exce these values. Our method performed well. The pose measurement results of the three t get objects , , at pre-defined position transformed by are given in the fo of statistical values, as shown in Table 3. In Table 3, ̅̅̅̅ and denote the average a standard deviation (uncertainty) of measured Euler angles.
includes the medi values of , , and in the measured results. |∆ | and |∆ | are the ma mum and mean of the absolute errors of , , and in the measured results, wh ∆ =̃− , ̃, and are the measurement value and ground truth, respective Similarly, ̅ and denote the average and standard deviation (uncertainty) of me ured translations.
includes the median values of the , , and in the me ured results. |∆ | and |∆ | are the maximum and mean of the absolute erro of , , and in the measured results, where ∆ =̃− , ̃, and are the me urement value and ground truth. As can be seen from Table 3, the average values of po measurements are very close to the ground truths. Although the maximum absolute rors are large, it can be seen from the standard deviations, median values, and mean a solute errors that the measurement results are relatively stable.   Table 3. Averages, standard deviations, median values, maximum absolute errors, and mean absolute errors of pose measurement results in the camera coordinate system. T 1 , T 2 , T 3 are three pre-defined relative pose transformations. The 3D modeling is carried out at the initial position and the positions transformed by these three pre-defined relative pose transformation matrixes. Pose transformations between two positions can be measured by these 3D models and compared with the ground truths T 1 , T 2 , T 3 . In order to more vividly show the registration results of the incremental open-loop registering method, ICP, and our method, we demonstrate the results of the W a in Figure 7. Figure 7a is the result of the incremental open-loop registering method. Due to accumulated errors, the 3D model does not close well. It can be seen there is a gap on the edge of the 3D model. The registration result of the ICP method is displayed in Figure 7b. ICP does not perform well. The possible reason is that W a is a complex curved structure and the ICP method tends to converge to a local optimum. As can be seen from the figure, the accumulation of errors in the registering process makes it difficult for the later point clouds to register well with the previous point clouds. Our method performs well, as shown in Figure 7c.

Discussion
In this paper, we propose a segmentation method based on motion consistency constraints without any prior knowledge, which only achieves robust segmentation by analyzing the motion patterns of adjacent frames. The demonstration experiment of the clustering process fully reflects that pose hypotheses calculated from point pairs sampled on the same object must coincide in the pose parameter space. The number of clustering centers represents the number of possible dynamic objects with an inconsistent motion. Our method does not require object samples for training compared with deep learning-based methods, and it is insensitive to the size and motion of moving objects compared with the point cloud clustering-based and optical-flow-based methods. It can meet the needs of 3D modeling under uncertain dynamic occlusion. The experiments also further emphasize the importance of dynamic object segmentation. ICP and other methods are difficult to register point clouds without segmenting dynamic objects. For the registering of incomplete point clouds, we propose an optimization method based on local constraints and global loop closure constraints. Table 2 and Figure 7 show that the optimization method plays an important role in obtaining the entire 3D model. It can effectively improve the accuracy of 3D modeling, while it will increase the calculation and time cost. As the number of feature points increases, the calculation speed decreases. Finally, the pose measurement experiment shows that the rich geometric information of the entire 3D model makes the pose measurement more accurate and reflects the accuracy of 3D modeling results.

Discussion
In this paper, we propose a segmentation method based on motion consistency constraints without any prior knowledge, which only achieves robust segmentation by analyzing the motion patterns of adjacent frames. The demonstration experiment of the clustering process fully reflects that pose hypotheses calculated from point pairs sampled on the same object must coincide in the pose parameter space. The number of clustering centers represents the number of possible dynamic objects with an inconsistent motion. Our method does not require object samples for training compared with deep learning-based methods, and it is insensitive to the size and motion of moving objects compared with the point cloud clustering-based and optical-flow-based methods. It can meet the needs of 3D modeling under uncertain dynamic occlusion. The experiments also further emphasize the importance of dynamic object segmentation. ICP and other methods are difficult to register point clouds without segmenting dynamic objects. For the registering of incomplete point clouds, we propose an optimization method based on local constraints and global loop closure constraints. Table 2 and Figure 7 show that the optimization method plays an important role in obtaining the entire 3D model. It can effectively improve the accuracy of 3D modeling, while it will increase the calculation and time cost. As the number of feature points increases, the calculation speed decreases. Finally, the pose measurement experiment shows that the rich geometric information of the entire 3D model makes the pose measurement more accurate and reflects the accuracy of 3D modeling results.

Conclusions
Aimed at the task of 3D modeling under uncertain dynamic occlusion, we propose an online 3D modeling method for pose measurements under uncertain dynamic occlusion. Firstly, in order to solve the uncertain dynamic object occlusion problem, we propose a dynamic object segmentation method based on motion consistency constraint without any prior knowledge about the objects, which segments the objects by random sampling and pose hypotheses clustering. Then, the initial pose and 2D-3D point pairs of each object between two adjacent frames can be obtained. After that, because the incomplete point cloud of the target object of each frame needs to be registered and inspired by the constraint relationships of the covisibility regions, an optimization method based on local constraints of overlapping view regions and global loop closure is introduced to optimize the registration of adjacent frames and the entire 3D model. Furthermore, constraints between global closed-loop frames are also established to refine the results. Finally, based on the proposed framework and method, we design and build a confirmatory experimental workspace. The experiments are carefully carried out in this workspace to test and verify the performance of our method. For the modeling results of three objects, our method outperforms three other classical methods. The 3D model is utilized to measure the poses, and the poses are further compared with the ground truths. The performance of our method meets the expected requirements.
In this paper, we mainly focus on improving the optimization speed and solving the object reflection problem. Optimization will slow down as the number of points increases. Graphics Processing Unit (GPU) acceleration and the filter-based method are considered to address this problem. Reflection is also a challenging problem in computer vision. At present, traditional methods cannot solve this problem well. We intend to introduce a learning-based technique to deal with this.