A Robust CoS ‐ PVNet Pose Estimation Network in Complex Scenarios

: Object 6D pose estimation, as a key technology in applications such as augmented reality (AR), virtual reality (VR), robotics, and autonomous driving, requires the prediction of the 3D posi ‐ tion and 3D pose of objects robustly from complex scene images. However, complex environmental factors such as occlusion, noise, weak texture, and lighting changes may affect the accuracy and robustness of object 6D pose estimation. We propose a robust CoS ‐ PVNet (complex scenarios pixel ‐ wise voting network) pose estimation network for complex scenes. By adding a pixel ‐ weight layer based on the PVNet network, more accurate pixel point vectors are selected, and dilated convolution and adaptive weighting strategies are used to capture local and global contextual information of the input feature map. At the same time, the perspective ‐ n ‐ point localization algorithm is used to accu ‐ rately locate 2D key points to solve the pose of 6D objects, and then, the transformation relationship matrix of 6D pose projection is solved. The research results indicate that on the LineMod and Oc ‐ clusion LineMod datasets, CoS ‐ PVNet has high accuracy and can achieve stable and robust 6D pose estimation even in complex scenes.


Introduction
Object 6D pose estimation, as an important task in the field of computer vision, has many applications in fields such as augmented reality (AR), virtual reality (VR), robotics, and autonomous driving.As shown in Figure 1, by estimating the 6D pose of an object in the camera coordinate system, namely the 3D position and 3D pose, virtual and real objects can be combined in the real environment to enhance people's perception of the real world [1].In addition, in industrial manufacturing, robots can perform precise part positioning and assembly operations through pose estimation.In autonomous driving navigation, cars need to understand their location in the environment in order to plan the optimal path.However, due to the influence of complex conditions such as background clutter and target occlusion in the real environment [2], the 6D object pose estimation is inaccurate, and the robustness is poor.Therefore, accurately and robustly estimating the 6D pose of the target object from complex scenes is crucial for improving the performance of AR, VR, robotics, and autonomous driving [3].
The 6D pose estimation of target objects aims to detect targets and estimate their direction and translation relative to the standard framework [4].The main challenge of traditional 6D pose estimation is to establish a correspondence between the input image and available 3D models and then use the perspective-n-point (PnP) to calculate pose parameters.However, the quality of the correspondence is sensitive to factors such as lighting changes, weak textures, and cluttered backgrounds [5], making it difficult for traditional methods to handle textureless objects and exhibiting poor robustness to severe occlusion and background changes [6].In recent years, deep learning-based methods have shown strong capabilities in handling 6D pose estimation, which can generally be divided into two categories: end-to-end methods based on direct regression and two-stage methods based on object class priors.In end-to-end methods, training a neural network and directly regressing the 6D pose from the input image using the neural network is not as accurate as traditional geometry-based PnP algorithms, although this type of method is highly efficient [7].In two-stage methods, CNN is first used to regress the intermediate representation, establishing 2D and 3D correspondence, and then the PnP algorithm is executed based on this correspondence.However, this type of method usually uses regression and multiple representations to estimate the pose, requiring accurate acquisition of key point information of the target object.It can be seen that the existing mainstream 6D object pose estimation methods model this problem as a regression task, requiring special designs to deal with multiple solution problems when dealing with symmetric and partially visible objects.We propose a deep learning-based CoS-PVNet (complex scenarios pixel-wise voting network) for 6D object pose estimation in complex scenes, which achieves accurate and robust 6D pose estimation of the target object.CoS-PVNet provides support for achieving stable and robust 6D pose estimation in virtual real fusion interactive applications.The main work of this article is as follows: (1) In complex scenes such as cluttered environments and severe occlusion, a CoS-PVNet object pose estimation network framework is proposed, which can enhance the key point feature processing ability of RGB images, accurately filter and predict pixel vectors, and effectively improve the accuracy and robustness of 6D pose estimation in complex scenes.(2) Inaccurate vector field prediction will affect the quality of generating key point assumptions.By adding a pixel-weight selflearning module between the encoder and decoder of the PVNet to predict pixel confidence, it can adapt to more complex image features and changes through learnability, prevent key feature information loss, and make semantic segmentation results more accurate.(3) To improve the quality of key point feature extraction in complex scenes, a pixel-weight layer is added to PVNet to filter out more accurate pixel vectors, and a global attention mechanism is proposed to enhance the extraction of useful key point features while adding contextual information to enhance the performance of CoS-PVNet in extracting weak texture scene features.

Related Work
A deep learning network model is used to calculate the 6D pose [R,T] of the target object from the image given an RGB/RGB-D image containing the target object and a 3D model of the target object [8], as shown in Formula (1). [ where F is the deep learning model, I is the input image, Model is the 3D model of the object, and  is the model parameter.
Deep learning-based 6D object pose estimation typically uses object detection networks or semantic segmentation networks as feature extraction networks to annotate target regions from images and encode pose semantic features [9].However, unlike pixellevel classification based on semantic segmentation, object detection has a faster inference speed and is more in line with the real-time requirements of AR, VR, robotics, and autonomous driving.Therefore, early 6D pose estimation often used object detection networks as feature extraction networks [10].Algorithms such as SSD-6D [11], YOLO-6D [12], and CDPN [13] first calculate the 2D bounding box of the object based on SSD [11], YOLO V2 [14], and Faster R-CNN [15] target detection network, respectively, and then send the 2D bounding box area into the pose calculation branch to estimate the 6D pose of the target object.The inference speeds of YOLO-6D and CDPN are 20 ms and 33 ms, which are much higher than the pose estimation models based on semantic segmentation during the same period.However, due to the fact that the 2D bounding boxes output by the object detection network contain some background or occlusion areas, the features input to the pose calculation module inevitably contain interference features, thereby reducing the accuracy of 6D pose estimation and the robustness of model occlusion.Semantic segmentation is a pixel-level object detection method that accurately segments objects along their contours, eliminating occlusion and irrelevant background regions.Therefore, it is more suitable as a feature extraction network for complex scenes [16].For example, the average accuracy of PoseCNN in occluded scenes in 2017 [17] was 24.9%, much higher than the 6.42% of YOLO-6D [12] in 2018, but the frame rate of PoseCNN was 10 FPS, only 1/5 of YOLO-6D.The semantic segmentation architecture of PoseCNN is similar to FCN [18], with an encoder of VGG [19], gradually encoding semantic features of different dimensions.However, the output high-resolution semantic features severely lack detailed information, and transposed convolution is inefficient.U-Net [20] is a classic semantic segmentation network that adopts a symmetric encoding and decoding structure and fuses detailed features with deep semantic features through skip connections to improve the network's understanding of images.Inspired by the U-Net network, the PVNet [21] uses residual blocks and bilinear interpolation to reconstruct a lightweight U-Net as a feature extraction network with an inference speed of 40 ms.However, inaccurate vector field predictions in complex scenes will affect the quality of generated key point assumptions, and PVNet can lead to difficult and insufficient feature extraction for target objects in complex scenes, affecting the accuracy and robustness of 6D pose estimation.
In summary, semantic segmentation-based pose estimation methods are more suitable for 6D object pose estimation in complex scenes, and pose estimation based on different architecture feature extraction networks belongs to multitask learning [22], which not only annotates the target object from the input image but also calculates the 6D pose of target object [23].Therefore, appropriate multitask, self-learning weights help to explore the correlation between object detection tasks, semantic segmentation tasks, and pose estimation tasks and extract sufficient semantic features to distinguish target objects from occluded objects to reduce the impact of occluded areas to improve the capability of semantic feature expression and accurate estimation of 6D pose, and thus improve the performance of 6D object pose estimation network.

Overall Framework Structure of CoS-PVNet
In response to the difficulty in accurately estimating the 6D pose of objects in complex scenes [24], this paper proposes a two-stage CoS-PVNet pose estimation network based on a single RGB image for PVNet pose estimation.By integrating key point localization into a deep learning architecture, CNN is used to establish the correspondence between the 2D and 3D of the target object, accurately locate 2D key points, and then use the global attention mechanism and voting mechanism to execute the PnP algorithm to solve the 6D pose information of objects, accurately estimating the 6D pose of the target object without any pose refinement.
The overall framework structure of CoS-PVNet is shown in Figure 2. Given a single RGB image containing the target object, a weight self-learning module is added between the skip connections of PVNet, and three tasks are performed: constructing semantic labels for predicting pixel directions, constructing unit vectors, and predicting pixel weights.Then, a new global attention mechanism (GAM) is proposed to enhance the extraction of useful features and increase contextual information.Furthermore, the ASPP-DF-PVNet algorithm [25] is used to optimize RANSAC voting for locating 2D key points, filtering out biased votes, and further optimizing the voting results to obtain more accurate 2D key points.Finally, the PnP algorithm is used to solve the 6D pose of the target, and a homogeneous coordinate transformation matrix composed of translation and rotation transformations of the target object coordinate system relative to the camera input coordinates is solved, achieving the transformation of the CoS-PVNet coordinate system.

CoS-PVNet Weight Self-Learning Module Structure
The weight self-learning module structure consists of a series of residual units.As shown in Figure 2, the overall backbone of the network is a pretrained ResNet-18 [26], followed by a weight self-learning module and several convolutional and upsampling layers, described as a weight self-learning structure.In the network structure, a weight self-learning module is added to the skip connections, and through the weight self-learning module, larger weight information is added to prevent the loss of key information, thereby making the semantic segmentation results more accurate.
The weight self-learning module has added conv5-conv10x to the network structure of ResNet-18.Take the image of 3 H W   as input for downsampling until the feature map reaches , and then replace the convolution in the last two blocks of Res-Net-18 with rate = 2 and rate = 4. Subsequently, the feature maps output by the encoder are input into the weight self-learning module to extract dense features.Finally, connect the result feature maps from all branches and feed them to another 1 × 1 convolution to obtain the desired spatial dimension.In the weight self-learning module, the number of output channels is set to 256.After obtaining the feature map processed by the weight self-learning module, upsampling is performed until the size reaches H W  .Assuming there are C object classes and each object has K key points, 1 × 1 convolution is applied on the feature map to output tensors for vector field representations of key points and The semantic labels of the segmented image and the predicted unit vector ( ) k v p of each pixel for K key points are outputted with the same size by inputting an RGB image.
( ) k v p represents the direction of each pixel voted pointing to a key point k X , and ( ) is calculated as the distance difference between the current pixel P and the K-th key point divided by the binomial of the distance difference between the two: The pixel-weight outputs by CoS-PVNet represent the confidence score obtained by each pixel, which is used to filter out outliers and internal pixels for voting before calculating the two-dimensional position of the key points.e I represents pixel weight, which estimates the cosine value between the predicted vector and the target vector ( ) e cos( ( ), ( )) The larger the pixel-weight value, the closer the predicted vector is to the true value.In the process of calculating the key points later, the pixels to be voted on are selected based on the predicted pixel-weight values from the previous ones to ensure the accuracy of pose estimation.The total loss function is: where vec L is the vector field prediction loss function, sem L is the semantic segmentation loss function, and e L is the weight prediction loss function.vec  , sem  , e  represents the corresponding coefficient.The loss function for vector field prediction is defined as follows:

CoS-PVNet Global Attention Mechanism
To cope with complex scene feature extraction or lack of features and scenes without features, a global attention mechanism is proposed in the CoS-PVNet algorithm to enhance the extraction of useful features and increase contextual information for more effective extraction of input feature maps.As shown in Figure 3, this mechanism adopts dilated convolution and adaptive weighting strategies to capture local and global contextual information of the input feature map.Firstly, a dilated convolutional layer is applied to the input feature map X to capture the local contextual information of the input feature map.The dilated convolutional layer outputs a feature map D, which contains spatial information of the original feature map and contextual information captured through dilated convolution.Next, a global average pooling is performed on feature map D to extract global contextual information.Global average pooling transforms feature map D into a feature vector G that represents global information.In order to achieve an adaptive attention mechanism, the global information feature vector G is utilized and passed to a shared fully connected layer MLP.The fully connected layer MLP outputs a weight matrix W with the same dimension as the input feature map X.Subsequently, the weight matrix W is used to perform weighted fusion on the input feature map X.The weighted fusion operation can be expressed as A = W ⨂ X, where ⨂ represents element-wise multiplication.In this way, a weighted feature map A containing adaptive attention information is obtained.
Then, an element-wise addition operation is performed on the weighted feature map A and X to obtain the added feature map.Finally, the final feature map Z is generated through the ReLU activation function.For the given input X, GAM is expressed as: where D is the feature map obtained by dilated convolution, G is the feature vector generated by average pooling, W is the output weight matrix of the fully connected layer, and Z is the feature map generated by the ReLU activation function after addition.Therefore, by integrating local and global contextual information, this adaptive attention mechanism can more effectively extract information from input feature maps.The adaptive weighting strategy enables the model to automatically adjust attention weights based on input feature maps, thereby improving the performance of CoS-PVNet.In addition, since PVNet uses cosine similarity between two vectors to determine voting, the method is more reliable when the key point assumption is consistent with more predicted directions [25].However, when pixels are far from the key point assumption, the small angle between two directional vectors may cause significant voting bias.When the two assumptions are close, it will lead to inaccurate voting.Therefore, this paper uses the ASPP-DF-PVNet algorithm to optimize RANSAC voting for locating 2D key points, in order to obtain more accurate 2D key points and provide support for subsequent accurate target object pose estimation.

CoS-PVNet Target Object Pose Estimation
After determining the key 2D positions of the target object, CoS-PVNet achieves pose estimation through the PnP algorithm.By calculating the mean k  and covariance matrix of the estimated target object and using the minimum Mahalanobis distance, the 6D pose   , R t is calculated: where k X represents the 3D coordinates of the key points, k X represents the 2D map- ping of 3D coordinates k X , and  is the perspective mapping function.The rotation and translation parameters R and t are initialized using the EPnP (efficient perspective-npoint) algorithm.Due to the uncertainty of the features, the Levenberg−Marquardt (nonlinear least squares algorithm) is used to minimize the remapping error and solve Formula (12).Therefore, based on the voting results, PnP can accurately locate and utilize 2D key points, allowing distance filtering voting schemes to improve the performance of pose estimation further.In addition, in the subsequent experiments of this article, to explore the impact of the number of key points on pose estimation, different numbers of key points are used to compare the results, and K = 8 is taken into account for efficiency and accuracy.

CoS-PVNet Coordinate System Conversion Relationship
The 6D pose estimation refers to estimating the 3D position and 3D pose of an object in the camera coordinate system.At this time, the coordinate system of the original object itself can be regarded as the world coordinate system, that is, obtaining a homogeneous transformation matrix composed of translation and rotation transformations from the world coordinate system of the original object to the camera coordinate system.As shown in Figure 4, CoS-PVNet registration mainly establishes the transformation relationship between the world coordinate system, camera coordinate system, image coordinate system, and pixel coordinate system.(image coordinate system) involves K (camera reference).Rotation and translation transformations will occur during the camera shooting process.According to the principle of small-hole imaging, the target image is inverted, and the transformation is represented by a homography matrix.The mapping relationship between the homography matrix and the rotation translation matrix , R t is used to calculate and solve various parameters.By inferring the relationship between coordinate systems, it can be inferred that the relationship between point ( , ) u v in the pixel coordinate system of CoS-PVNet and point ( w , , w w X Y Z ) in the world coordinate system is: where represents the internal reference matrix of the camera, which can be obtained by calibrating the camera.When the camera is actually shooting, the camera pose can be solved according to the above Formula ( 14).Therefore, after completing the spatial projection coordinate transformation based on CoS-PVNet, the camera pose parameters can be solved, achieving stable and robust system applications in AR, VR, robotics, and autonomous driving.The following four coordinate systems are involved in image processing: O X Y Z  ：Camera coordinate system with optical center as origin o xy  ：Image coordinate system, with optical center as the midpoint of the image uv ：Pixel coordinate system, with the origin in the upper left corner of the image P ：A point in the world coordinate system is a real point in life p ：Point is an imaging point in the image, with coordinates in the image coordinate system and in the pixel coordinate system f ： Camera focal length, equal to the distance between and ,

CoS-PVNet Pose Estimation and Application Process
The pose estimation and application process of CoS-PVNet are shown in Figure 5.By extracting feature information from the input RGB image and using a weight self-learning module, the model can automatically adjust and optimize weights during the training process, improving the flexibility and adaptability of the model.Then, the CoS-PVNet predicts the key point position of each target object on the feature map and combines the global attention mechanism to enhance the extraction of useful features and increase contextual information in order to extract information more effectively from the input feature map.Subsequently, CoS-PVNet generates a voting vector for each detected key point, using the voting results to estimate the 6D pose of the target object.Finally, the CoS-PVNet coordinate system transformation relationship is solved to further realize applications of AR, VR, robotics, and autonomous driving.The specific steps for CoS-PVNet pose estimation and application are:

RGB image input
Step 1. Use the camera to input an RGB image containing the target object.
Step 2. Feed the input image into a pre-trained ResNet18 convolutional neural network to accurately extract feature information such as the shape, texture, and color of objects in the input image.For different RGB image data, the CoS-PVNet weight self-learning module can balance the focus of the model by adjusting the weights of different categories.
Step 3.During the training process, CoS-PVNet updates its weights through a weight selflearning module and backpropagation algorithm to minimize the value of the loss function and generate accurate key point feature maps.
Step 4. CoS-PVNet predicts the key point positions of each target object on the feature map, usually the corners, centers, or other prominent feature points of the target object.
Step 5. A set of key points is defined on the 3D model of an object with fixed coordinates (X, Y, Z) in 3D space.When the object is placed in a certain posture in the real world and captured as an image, these 3D key points will be projected onto the image plane to form 2D key points, which involve the internal parameters (such as focal length, principal point, etc.) and external parameters (such as rotation matrix and translation vector) of the phase machine.These parameters can map points in 3D space to the 2D image plane.
Step 6.Before predicting the feature map, the global attention mechanism is used to enhance the extraction of useful features and increase contextual information, which is used to extract input feature map information more effectively and better correspond to the 2D−3D relationship of the target object.
Step 7. CoS-PVNet generates a voting vector for each detected key point, uses the Gaussian kernel function to balance the importance of different votes, aggregates all voting vectors in the image space, and can form a voting density map or voting cloud, which reflects the 3D position and 3D pose of the target object in the image.
Step 8.In CoS-PVNet, PnP is used to calculate the 3D position of an object from the centroid position of the vote, and the relative position relationship between key points is used to estimate the rotation of an object.
Step 9. CoS-PVNet can estimate the pose parameter matrix of the camera, including rotation matrix, translation vector, or quaternion, based on a set of known 3D points and their projections in the image and apply it to AR, VR, robotics, and autonomous driving.

Experimental Environment Configuration
CoS-PVNet provides accurate initial pose estimation based on RGB images, aiming to accurately locate and estimate the 3D direction and 3D translation relationship of objects.This article conducts experimental analysis on PVNet, CoS-PVNet, and the latest 6D pose estimation algorithm and uses ablation experiments to analyze the performance of each module of CoS-PVNet.The configuration of the experimental environment in this article is shown in Table 1.This article conducts experiments on two benchmark datasets, LineMod [27] and Occupation LineMod [28], which are widely used in 6D pose estimation experiments to evaluate the performance of CoS-PVNet.The LineMod dataset exhibits significant clutter, diversity, multiview, and true pose annotation but only slight occlusion.The Occlusion LineMod dataset introduces interference of different occlusion levels based on LineMod, which is characterized by the complex relationship between the target object and the background.This provides more information for the performance evaluation of the 6D object pose estimation network.
(1) LineMod is a benchmark dataset used for 6D object pose estimation, as shown in Figure 6, consisting of 15 objects, each consisting of over 1200 images, with a total of 15,783 images.It not only annotates the central object in each RGB image but also provides the inherent characteristics of 3D CAD models and cameras for each object.The complex factors of LineMod include background clutter, textureless objects, and lighting changes.(2) The Occupation LineMod dataset, as a subset of LineMod, contains 1214 images of 8 objects and provides additional pose annotations on non-central objects.Compared with LineMod, the images in the Occlusion LineMod dataset contain multiple objects under severe occlusion, making 6D pose estimation extremely challenging.In order to ensure fairness in conducting comparative experiments with PVNet and related algorithms, the same training test segmentation is used on the LineMod dataset (15% for training and 85% for testing), while the Occlusion LineMod dataset is only used for testing.In addition to the training images provided by LineMod, synthetic images are used to enhance training data.Moreover, this article adopts data augmentation techniques to prevent overfitting, including rotating images between certain angles (−30°, 30°), randomly blurring and cropping with a 50% probability, and randomly changing the original brightness and contrast of each image from 0.9 to 1.1 times.In addition, during the training dataset stage, an Adam optimizer with an initial learning rate of 0.001 is used, and the batch size is set to 20.The network is trained with 100 epochs.

Evaluation Indicators
The performance of CoS-PVNet can be evaluated using two metrics: 2D projection metric and a model point average 3D distance (ADD) metric [29], which can measure pose errors in 2D−3D space.The 2D projection metric mainly measures the average distance between the estimated pose and the 3D model point projection under the real pose, specifically: where M represents the set of 3D model points, m is the number of points.K is the inherent matrix of the camera.R and T are estimated rotation and translation matrices, while R and T are real poses.When the average distance after 2D projection measurement is within 5 pixels, it is considered that the estimated 6D pose is correct.
Two common metrics-ADD (average distance) metric and ADD-S metric-are used to estimate attitude in the ADD metric, which is represented uniformly with ADD (-S) in this paper.
(1) ADD metric: Convert model points based on estimated and ground true attitudes and calculate the average distance between the two conversion point sets.When the distance is less than 10% of the model diameter, the estimated attitude is correct, as shown in Formula (16).
where W represents the set of sampling points for the target 3D model, y represents the point in W, and m represents the total number of sampling point sets.(2) ADD-S metric: For symmetric objects, use ADD-S metric, where the average distance is calculated based on the distance to the nearest point.Evaluate the target using ADD-S accuracy and AUC (Area Under Curve) area, where AUC is the area under the accuracy threshold curve, obtained by changing the distance threshold in the evaluation.ADD-S metric is represented by ADD-S, as shown in Formula (17).

LineMod Dataset Experimental Results
The visualization results of CoS-PVNet on the LineMod dataset for pose estimation are shown in Figure 7.The green 3D border represents the true pose, and the blue border represents the estimated pose.It can be seen from the figure that CoS-PVNet has high accuracy, with the estimated target object almost overlapping with the estimated bounding box.

Occlusion LineMod Dataset Experimental Results
The Occlusion LineMod dataset is only used as a testing set, and the previously trained model can be used for experimental testing.The pose estimation results of the Occupation LineMod dataset are shown in Figure 8.The green 3D border also represents the true pose, and the blue border also represents the estimated pose.Compared with the baseline method PVNet, it can be seen that CoS-PVNet can produce accurate results even in severe occlusion.However, the last column also shows that CoS-PVNet cannot provide sufficient information for 6D pose estimation when the target area is too small.Therefore, testing the renderings on the LineMod dataset and Occlusion LineMod dataset shows that CoS-PVNet has a good overlap effect, indicating that CoS-PVNet still has high accuracy in complex backgrounds.However, when the object target is too small, it cannot accurately estimate its 6D pose, which is related to overfitting caused by the weight self-learning module in CoS-PVNet.Correspondingly, this article uses data augmentation to prevent this from occurring.

Comparison Experiment of 2D Projection Metrics
CoS-PVNet is compared with relevant RGB image-based pose estimation methods.On the LineMod dataset and Occlusion LineMod dataset, CoS-PVNet is compared quantitatively with BB8 [30], YOLO-6D [12], and PVNet [21] in 2D projection metrics.The experimental results of 2D projection metric comparisons are shown in Table 2.As shown in Table 2, BB8 and YOLO-6D use the eight corners of a 3D bounding box plus an object center as the key point and directly regress its coordinates, while PVNet and CoS-PVNet apply a voting strategy to locate eight surface key points and one object center from the predicted vector field.When using the same loss as PVNet, CoS-PVNet achieves better performance on most objects, increasing the average accuracy of BB8 by 10.08% on the LineMod dataset, especially increasing the accuracy of target objects can and cat by more than 15%.This indicates that CoS-PVNet is also more accurate for smallscale object pose estimation.In the case of target occlusion, CoS-PVNet is improved by 37.46% on the 2D projection metric measurement of target object categories compared to YOLO-6D.CoS-PVNet also shows better performance compared to the PVNet, with an improvement of 1.32% in the 2D Projection metric evaluation.Therefore, compared with the indicator evaluation results on the LineMod dataset mentioned above, CoS-PVNet performs better than PVNet in complex occlusion scenes on the Occlusion LineMod dataset, which also proves the correctness of the CoS-PVNet pose estimation in this paper.

Comparative Experiment of CoS-PVNet Algorithm ADD (-S) (1) LineMod Dataset ADD (-S) Comparative Experiment
Experiments are conducted on the LineMod dataset, comparing CoS-PVNet with algorithms such as YOLO-6D [12], PoseCNN [17], DenseFusion [31], Dual Stream [32], and PVNet [21].Two symmetrical objects, egg-box and glue, are evaluated using the ADD-S metric, while other objects are evaluated using the ADD metric.The comparative experimental results are shown in Table 3.As shown in Table 3, CoS-PVNet has improved the average values of YOLO-6D, PoseCNN, DenseFusion, DualStream, and PVNet algorithms by 39.5%, 6.8%, 1.1%, 0.6%, and 9.1% respectively.For four types of objects: ape, cat, duck, and hole puncher, the accuracy improvement of pose estimation is relatively small.The reason for this is that CoS-PVNet has certain advantages in extracting features for large-scale target objects, and when the ADD metric is less than 10% of the maximum diameter of the target during testing, the pose estimation is considered to be correct.The maximum diameter of the above four types of targets is small, resulting in relatively small improvement.CoS-PVNet performs better in estimating the pose of other target objects, indicating that CoS-PVNet fully extracts features of the target object and can effectively improve the accuracy of 6D pose estimation for objects in complex scenes.
(2) Comparison Experiment of the Occupation LineMod Dataset ADD (-S) Experiments are conducted on the Occlusion LineMod dataset to compare CoS-PVNet with HybridPose [33], SSPE [34], RePOSE [35], SegDriven [36], PoseCNN [17] and PVNet [21].Using the same indicators as the test LineMod dataset, the comparative experimental results are shown in Table 4.As shown in Table 4, CoS-PVNet outperforms HybridPose, SSPE, SegDriven, PoseCNN, and PVNet on the average mean values, with improvements of 1.7%, 5.9%, 22.2%, 24.3%, and 8.4%, respectively.However, CoS-PVNet is 2.4% lower than RePOSE on the average mean value, mainly focusing on three aspects: can, cat, and duck.This indicates that RePOSE can quickly and accurately refine the pose by minimizing the feature measurement error between input and rendered image representations.However, when small targets are severely occluded, or the extracted features are insufficient to recognize the target object well, the performance of CoS-PVNet is even better.

CoS-PVNet Ablation Experiment
To verify the effectiveness of each module, ablation experiments are conducted to analyze each module of CoS-PVNet.Table 5 shows the results of gradually adding the CoS-PVNet algorithm modules separately for comparison.Due to the significant improvement in some categories on the LineMod dataset, ablation experiments have a certain representativeness.Therefore, this paper uses the LineMod dataset for accuracy and speed testing.As shown in Table 5, the accuracy and velocity of pose estimation for different modules of CoS-PVNet are shown.If the predicted translation and rotation errors with the actual pose are less than 5 cm and 5°, respectively, it is considered that the predicted object pose is correct.If the CoS-PvNet weight self-learning module is directly added, the accuracy and speed of using the PnP algorithm to solve pose are 46.3% and 14 FPS.If the CoS-PvNet global attention mechanism is added to infer a pose, the accuracy is improved by 17.3%, and the FPS is improved by 7 FPS.Therefore, if the CoS-PVNet weight self-learning module is directly added, the PnP algorithm is prone to incomplete simulation or overfitting, resulting in lower accuracy of CoS-PVNet pose estimation.However, CoS-PVNet directly utilizes the local and global context information of the global attention mechanism feature map, further improving the robustness of CoS-PVNet pose estimation.

Discussion
Object 6D pose estimation is a core technology for applications such as AR, VR, robotics, and autonomous driving.However, due to complex scene factors, such as background clutter, target occlusion, and weak texture features, it can easily lead to inaccurate 6D pose estimation.This article proposes a robust CoS-PVNet pose estimation network for complex scenes.Firstly, by adding a pixel-weight self-learning layer on the basis of the PVNet network structure, the pixel-weight values are predicted to be selected for voting.Then, stable and robust useful features are extracted using the global attention mechanism of local and global contextual information in the input feature map.Finally, the PnP algorithm is used to solve the 6D pose, which improves the accuracy and robustness of 6D object pose estimation in complex scenes.
6D object pose estimation is an important research topic in the field of computer vision, which determines the 3D position and direction of an object in the camera center coordinate system.In the field of AR, virtual elements can be superimposed on objects to maintain their relative pose as they move.With the maturity of technologies such as SLAM, robots have been able to perform good positioning in 3D space, but 6D pose estimation technology is still needed for object grasping interaction.In the field of autonomous driving, the 6D pose estimation assistance mode can achieve dynamic 360° panoramic driving.In this paper, by adding pixel-weight layers on the basis of the PVNet network, more accurate pixel point vectors are selected, and the pose of the object is estimated based on local and global contextual information of the feature map, and then the coordinate system transformation matrix is solved.CoS-PVNet for virtual real fusion interactive application framework is shown in Figure 9.By using feature detection operators to extract key feature points and descriptors from real-world scene images and matching them with the corresponding natural feature templates constructed offline, CoS-PVNet is used to solve the pose of AR cameras and assembly objects through geometric visual transformation [37], and 3D virtual real interaction technology is used to empower stable and robust virtual real fusion interactive applications of AR, VR, robotics, and autonomous driving.In recent years, 6D pose estimation methods have made significant progress in fields such as AR registration, robot grasping, and autonomous driving navigation.However, the lack of higher dimensional semantic modeling and understanding of specific complex interactive application scenarios has made it difficult to meet the accuracy and robustness of 6D pose estimation in different job scenarios [38].On the other hand, with the optimization of deep learning models and the development of new architectures, a 6D pose estimation algorithm will be able to process object recognition and pose estimation in complex scenes more quickly.Although the CoS-PVNet pose estimation algorithm proposed in this article has achieved good results on the LineMod and Occlusion LineMod dataset, the dynamic uncertainty of "human-machine-object" in AR, VR, robotics, and autonomous driving [39] makes pose estimation for severely occluded and truncated target objects still difficult and important in the field of 6D object pose estimation [40].Therefore, there is still a lot of room for improvement in the accuracy of 6D pose estimation in complex scenes.Future work will utilize the latest advances in target area semantic segmentation models to accelerate the inference process and consider combining reinforcement learning to achieve active 6D object pose estimation [3].This will also provide support for improving the system performance of AR, VR, robotics, and autonomous driving, effectively promoting the digital and intelligent transformation and upgrading of manufacturing, transportation, and other industries.

Conclusions
We propose a robust CoS-PVNet pose estimation network for complex scenes to address the low accuracy in object 6D pose estimation.By adding pixel-weight self-learning layers on the basis of PVNet, more accurate pixel point vectors are selected, and a global attention mechanism is proposed to improve the performance of feature extraction by adding contextual information, thereby estimating the pose of CoS-PVNet target objects and solving the CoS-PVNet coordinate system transformation matrix, providing support for the implementation of AR, VR, robotics, and autonomous driving.The performance of CoS-PVNet is evaluated on the LineMod and Occlusion LineMod datasets.The experimental results show that CoS-PVNet can accurately estimate the 6D pose of target objects and effectively estimate the 6D pose of occluded objects with higher accuracy and robustness.However, this study also has limitations in not fully integrating geometric, normal, and other multivariate features.The next step is to deeply integrate industry application context feature information to adapt to more complex industry application scenarios.

Figure 1 .
Figure 1.Relationship between real world and virtual information in an AR system.
t R and 1 T rep- resent the actual rotation and translation, while v R and 2 T represent the predicted rota- tion and translation respectively.

Table 2 .
Comparison results of 2D projection metrics (unit: %).Bold represents the maximum value of each row, and the meaning expressed in subsequent tables is the same.