SaMfENet: Self-Attention Based Multi-Scale Feature Fusion Coding and Edge Information Constraint Network for 6D Pose Estimation

: Accurate estimation of an object’s 6D pose is one of the crucial technologies for robotic manipulators. Especially when the lighting conditions changes or the object is occluded, resulting in the missing or the interference of the object information, which makes the accurate 6D pose estimation more challenging. To estimate the 6D pose of the object accurately, a self-attention-based multi-scale feature fusion coding and edge information constraint 6D pose estimation network is proposed, which can achieve accurate 6D pose estimation by employing RGB-D images. The proposed algorithm ﬁrst introduces the edge reconstruction module into the pose estimation network, which improves the attention of the feature extraction network to the edge features. Furthermore, a self-attention multi-scale point cloud feature extraction module, i.e., MSPNet, is proposed to extract point cloud geometric features, which are reconstructed from depth maps. Finally, the clustering feature encoding module, i.e., SE-NetVLAD, is proposed to encode multi-modal dense feature sequences to construct more expressive global features. The proposed method is evaluated on the LineMOD and YCB-Video datasets, and the experimental results illustrate that the proposed method has an outstanding performance, which is close to the current state-of-the-art methods. into the image feature extraction module to extract features afterward. We add edge feature constraints to improve the attention of the image feature extraction module to edge features. At the same time, the object point cloud is fed into MSPNet to extract the multi-scale geometric features of the point cloud. In addition, the two features emerged at the pixel level to obtain a multi-modal dense feature sequence, which is fed into SE-NetVLAD to extract the global features. Finally, the dense feature sequence is fed into the pose estimator to predict the rotation and translation. experimental results illustrate that our method outperforms CosyPose, DenseFusion, MSCNet, and G2L-Net by 2.8%, 1.4%, 1.1%, and 0.2% on the ﬁrst metric, respec-tively. The maximum error tolerated by robot grasping is 2 cm. Our method surpasses PoseCNN + ICP, DenseFusion, and MSCNet by 2.4%, 0.3%, and 1.7% in this metric, respec-tively. This proves that our method is more suitable for grasping tasks in the real world.


Introduction
Six-dimensional pose estimation is of great significance for robotic grasping, augmented reality, autonomous driving, etc. However, the lighting condition variation and the occlusion of objects can make it extremely difficult for accurate 6D pose estimation.
Generally, the classical pose estimation methods can be roughly divided into two main categories for indoor environments. One kind of method is the corresponding feature points-based method, which establishes corresponding relationships between RGB image feature points and the 3D object model feature points; then the object pose can be calculated by applying the perspective-n-point (PnP) [1] algorithm. The other kind of method is a template matching-based method, which samples the 3D object point cloud model from multiple observation views to establish a template library. Then the image is matched with the templates in the template library to obtain the initial poses and perform subsequent optimization. Although traditional methods have many advantages, such as fast calculation, little amount of data required for training, etc., traditional methods cannot be applied in complex environments due to the weak robustness to disturbances, e.g., changes in illumination, occlusions, and the weak surface texture features of objects.
With the successful application of deep learning in computer vision, the application of deep learning in pose estimation has been researched and explored. This trend has led to the emergence of several data-driven 6D pose estimation methods, such as RGB data inputbased networks, e.g., PVNet [2], BB8 [3], and PoseCNN [4], and RGB-D data input-based networks, e.g., DenseFusion [5] and PVN3D [6]. To a certain extent, the deep learningbased 6D pose estimation method has overcome the shortcomings of the above traditional methods. However, vulnerability to environmental disturbance cannot be eliminated easily. For instance, the PoseCNN employs an end-to-end approach to directly regress the 6D poses from RGB images. These methods have a greater advantage in calculating speed but generally have lower accuracy when the ambient lighting conditions are poor, or the object is occluded. With the inspiration of the traditional methods based on corresponding feature points, many methods, such as PVNet, have been proposed to calculate the 6D pose by locating feature points in the image. These methods exploit neural networks to predict feature points, which can improve the robustness significantly compared to traditional methods. However, these methods are still sensitive to influencing factors such as illumination changes and weak texture of the object.
Considering the above issues, the approaches to improving the 6D pose estimation network robustness are worth exploring and discussing. With the illumination condition changes, the stability of the object edge features has been clearly observed. Similarly, with weak object surface textures, there are still effective edge features that can be used for pose estimation. Nowadays, MaskedFusion [7], HybridPose [8], and several other approaches utilize mask or edge features in object 6D pose estimation. For example, MaskedFusion extends the DenseFusion network by employing the mask feature extraction branches. However, MaskedFusion simply adopts the same method for RGB images to the mask of the object. However, the complementary nature of the color and edge features are not emphasized, which leads to MaskedFusion still relying on iterative refinement to achieve high accuracy in pose estimation.
In addition, among the currently advanced methods, e.g., DenseFusion [5], MSC-Net [9], and EANet [10], PointNet [11] is usually employed to extract the geometric features of the point cloud. However, PointNet only uses a single scale to extract the point cloud features, which loses the local geometric feature information of the point cloud. This leads to low accuracy pose estimation when the object is heavily occluded. Moreover, Point-Net++ [12] is applied in PVN3D [6] to solve the loss of local geometric features of point clouds, while the complex structure of the PointNet++ network leads to slower forward inference in PVN3D.
The pose estimation methods such as DenseFusion and EANet usually use Maxpooling to extract global features of dense feature sequences. However, Maxpooling simply takes the maximum value in the dense feature sequence as the global feature, which ignores the distribution characteristics of the feature sequences. Additionally, the existence of outliers in the dense feature sequence may cause interference to the global features.
To solve the above problem, a novel 6D pose estimation network with edge feature constraints is proposed. To be more specific, texture features extracted by the network are used to perform edge reconstruction and calculate the edge reconstruction loss. Then, the combination of the edge reconstruction loss and the pose estimation loss contributes to optimizing the proposed network. The training phase for both the edge reconstruction and pose estimation tasks is conducted in the meantime. The increasing attention of the pose estimation network to edge features has been drawn by the edge reconstruction module; thereby the robustness to disturbances such as illumination changes has been dramatically improved. Meanwhile, to address the problem that PointNet extracts geometric features at a homogeneous scale, we propose the MSPNet, which is a multi-scale point cloud feature extraction network based on the self-attentiveness mechanism. Then we introduce the MSPNet into our pose estimation network for multi-scale feature extraction of point clouds obtained from depth image reconstruction. MSPNet adopts multiple parallel point cloud feature extraction modules to extract local geometric features at different scales and employs a self-attention mechanism to fuse the local geometric features from different scales.
To address the problem that Maxpooling has insufficient modeling capability and is susceptible to outlier interference, we propose the SE-NetVLAD, a clustered feature coding network. SE-NetVLAD clusters and encodes multi-modal dense feature sequences so that SE-NetVLAD is capable of extracting distributed features in feature sequences to construct more expressive global features. Finally, we further enhance the multi-modal dense feature sequences by reinforcing influential features and suppressing redundant features through a self-attention mechanism.
Our method has been evaluated on the LineMOD Dataset [13] and the YCB-Video Dataset [4]. The experiment results show that our method outperforms the advanced DenseFusion [5] with refinement by 0.7%, and our method has the best performance on smooth, untextured objects in the YCB-Video Dataset.
In summary, there are three main contributions of this work: • We propose a self-attention-based multi-scale feature fusion coding and edge information constraint network for 6D pose estimation, named SaMfENet. The proposed network introduces an edge reconstruction module, which enhances the attention of the network to edge features. An accurate estimation of the object's 6D pose can be achieved despite the changing effects of lighting conditions and the weak surface texture of the object. • A self-attention multi-scale point cloud feature extraction network, named MSPNet, is proposed to extract local geometric features of point clouds at different scales and integrate features from different scales through the self-attention module. MSPNet can improve the 6D pose estimation accuracy with a few model parameters increasing.

•
The clustered feature coding network, named SE-NetVLAD, is proposed to extract global features from multi-modal dense feature sequences. Compared to the maximum pooling layer, SE-NetVLAD is less sensitive to outlier interference and is capable of constructing more expressive global features.
The remainder of the article is organized as follows. Section 2 introduces the related works of pose estimation and attention mechanisms. Section 3 describes SaMfENet in detail. Section 4 describes experiments on LineMOD Dataset and YCB-Video Dataset and analyzes the experiment results. Finally, the conclusion of this article is given in Section 5.
Our code is open source, the code is available at https://github.com/r-9li/SaMfENet.

Pose Estimation
The existing pose estimation methods can be mainly divided into traditional methods and deep learning-based methods. While traditional methods have a low tolerance to environmental disturbances and poor robustness, we, therefore, focus on 6D object pose estimation methods based on deep learning.

Pose Estimation from RGB Images
The 6D pose estimation methods based on RGB images extract the features of the image through the network and further return the 6D pose of the object. The direct regression method and the key point method are popular approaches for 6D pose estimation. The direct regression method uses the network to regress the object pose directly; for instance, PoseCNN [4] splits the 6D pose estimation task into three subtasks, namely semantic segmentation, 3D translation, and 3D rotation, then constructs a link between these three subtasks to make the network structure more reasonable. SSD-6D [14] extends the object detection task [15] to a 6D pose estimation task and abstracts the 3D rotation of an object to a discrete classification in space. The key point method involves predicting the projection of the object's 3D key points on the 2D image through the network and then acquiring the object's poses by the PnP algorithm. This method is more robust and faster than the direct regression method. In YOLO-6D [16], the object is detected by the YOLO [17] module at first, and the 2D projection of the 3D bounding box of the object is obtained, then the PnP algorithm is employed to solve the pose. Moreover, PVNet [2] extracts the features of the object through a convolutional neural network at the beginning and adopts pixel-level unit vector representation. These vectors are then voted on to determine the key points of the object, and the highest scoring key points are used to solve the poses.
Li et al. [18] proposed an iterative refinement method based on deep learning to estimate the 6D pose of objects called DeepIM. DeepIM optimizes the initial pose by minimizing the difference between the observed image and the rendered image. The iterative refinement process stops until the optimized pose converges or the number of iterations reaches a threshold. Further, Labbé et al. [19] proposed a more effective multiobject pose estimation method based on the idea of DeepIM, which is called CosyPose. They employed the rotation parametrization reported in [20] into CosyPose to make CNN training more stable.
Generally speaking, object 6D pose estimating methods based on RGB images have the advantages of simple input and fast processing. However, the neglect of spatial geometric information makes the estimation accuracy of these methods limited and less robust.

Pose Estimation from RGB-D Data
The current methods for pose estimation based on RGB-D images are mainly divided into three types. For the first type, depth information is utilized as additional information to optimize the estimation accuracy, e.g., PoseCNN [4], YOLO-6D [16], and BB8 [3]. These methods employ only RGB images to estimate the object pose roughly in the first stage, then generate point clouds based on depth information and RGB images. Finally, point cloud matching algorithms (e.g., Iterative Closest Point (ICP) [21], Generalized Iterative Closest Point (GICP) [22], and Super 4PCS [23]) are adopted to refine the object pose. For the second type, taking MCN [24] as an example, MCN splices the depth information channel with the RGB information channel and feeds the whole features into the network to predict the object pose. However, neither of these two types of methods deeply fuse RGB information with depth information, so these two types of methods cannot fully exploit the complementary nature of the two types of data. The use of point cloud matching algorithms such as ICP consumes a lot of time, which means these methods (e.g., PoseCNN + ICP, BB8 + ICP) cannot estimate the pose in real time.
By contrast, DenseFusion [5], MaskedFusion [7], and other methods have attempted to integrate RGB features deeply with depth-informed features at a later stage of feature extraction, which has achieved better results than the previous two types of methods. MSCNet [9] extends DenseFusion to extract further contextual information of point-level multi-modal features for enhancing feature expression ability after constructing those features. However, all of these methods only extract the geometric features of the point cloud at a single scale through PointNet, so the local geometric features of the point cloud are lost. Moreover, all of the above approaches focus on extracting color features of RGB images only, ignoring the importance of edge features in the pose estimation task.
To solve the above problem, a self-attention multi-scale point cloud feature extraction module, i.e., MSPNet, is proposed. At the same time, we incorporate the edge feature constraint into the pose estimation network and propose a self-attention-based multi-scale feature fusion coding and edge information constraint 6D pose estimation network.

Attention Mechanism
The attention mechanism was initially applied in machine translation [25][26][27], and now it has become an important part of neural networks. Many variants of attention mechanisms are widely used in computer vision and natural language processing tasks. For example, SENet [28] can learn the importance of each feature channel autonomously to activate practical features and suppress ineffective ones. ECANet [29] replaces the fully connected layer in SENet with a 1D convolutional layer, improving network performance dramatically with a small increase in the number of parameters and computation. GSoP-Net [30] extends SENet and replaces the global average pooling in SENet with second-order pooling, which compensates for the lack of modeling capability of global average pooling and makes GSoP-Net more capable of extracting global information.
Unlike the traditional convolutional layers with only one convolutional kernel, Cond-Conv [31] assembles multiple convolutional kernels in one convolutional layer. Specifically, CondConv adopts a routing function to calculate the weight of each convolutional kernel based on the input to the convolutional layer and weighs each convolutional kernel by its weight. Finally, the weighted generated convolution kernel is used to convolve the input. CondConv achieves only a slight increase in computation while boosting model capacity and performs well on tasks such as image classification. SKNet [32] is capable of adjusting the receptive field of features adaptively by assigning weights to feature maps generated by convolution kernels of different sizes. Convolutional Block Attention Module (CBAM) [33] combines channel attention with spatial attention, which has excellent performance for improving the accuracy of object detection tasks.
Inspired by SKNet, the attention mechanism is introduced into MSPNet to integrate local features from different scales, avoiding significant increases in network parameters caused by high feature dimensionality.

The Proposed Method
The proposed network takes the RGB-D image as an input and outputs the 6D pose of the object. Specifically, the 6D pose of the object is the rigid transformation from the object coordinate system to the camera coordinate system. This rigid transformation is represented as a homogeneous transformation matrix p = [R, t] consisting of a rotation transformation R ∈ SO(3) and a translation transformation t ∈ R 3 . Figure 1 illustrates the overall architecture of SaMfENet. SaMfENet contains five main parts as follows:

Overview
• I. Semantic segmentation module. Based on the semantic segmentation network proposed in PoseCNN [4], the input RGB images are segmented to obtain a mask and bounding box for each instance object. The 3D point cloud is transformed from depth pixels covered by the mask, and the image block obtained by cropping with bounding boxes is used for subsequent feature extraction. • II. Edge reconstruction module. The image block of the instance object is fed into an image feature extractor constructed via an encoder-decoder structure to extract the texture features. Then the edge reconstruction network generates an edge reconstruction image of the object based on the texture features. The object's edges generated by the Canny [34] operator are used to constrain the edge reconstruction. It can improve the ability of image feature extraction to perceive edge information, thereby enhancing the robustness of the network to illumination changes. scales. Finally, a self-attention mechanism is applied to fuse the local geometric features at different scales into a multi-scale geometric feature of the point cloud. • IV. SE-NetVLAD for features fusion. The multi-modal dense feature sequence A constructed by pixel-wise texture features and geometric features are fed into SE-NetVLAD. Then, SE-NetVLAD constructs global features by clustering and encoding feature sequence A and concatenates the global features and feature sequence A at the pixel level. The influential features are further enhanced through a self-attention mechanism, while redundant features are suppressed. • V. Pose estimation module. Feature sequence B is fed into the pose estimator, which consists of multiple consecutive convolutional layers. The pose estimator is used to perform regression translations and rotations directly.
The overall architecture of SaMfENet. SaMfENet can be divided into five parts. First, the input image is semantically segmented to obtain image blocks containing the object and the object point cloud reconstructed from the depth images. The image block is fed into the image feature extraction module to extract features afterward. We add edge feature constraints to improve the attention of the image feature extraction module to edge features. At the same time, the object point cloud is fed into MSPNet to extract the multi-scale geometric features of the point cloud. In addition, the two features emerged at the pixel level to obtain a multi-modal dense feature sequence, which is fed into SE-NetVLAD to extract the global features. Finally, the dense feature sequence is fed into the pose estimator to predict the rotation and translation. Figure 1. The overall architecture of SaMfENet. SaMfENet can be divided into five parts. First, the input image is semantically segmented to obtain image blocks containing the object and the object point cloud reconstructed from the depth images. The image block is fed into the image feature extraction module to extract features afterward. We add edge feature constraints to improve the attention of the image feature extraction module to edge features. At the same time, the object point cloud is fed into MSPNet to extract the multi-scale geometric features of the point cloud. In addition, the two features emerged at the pixel level to obtain a multi-modal dense feature sequence, which is fed into SE-NetVLAD to extract the global features. Finally, the dense feature sequence is fed into the pose estimator to predict the rotation and translation.

Semantic Segmentation
In order to reduce the interference of the surrounding environment on the pose estimation, the object region should be segmented from the image first. In this work, the semantic segmentation network provided by PoseCNN is employed. The image semantic segmentation network constructed by an encoder-decoder structure takes an RGB image as input and outputs N + 1 binary maps. The activated pixels in each binary image indicate that these pixels belong to the object represented by the binary image. Based on the masks of the objects, we can obtain the bounding box that encloses these objects and crop the image with this bounding box to obtain the image block containing these objects. Moreover, the object regions in the depth map can be obtained by multiplying the masks of the objects and the depth map. Further, we transform the depth map into a visible surface point cloud of the object using the camera's intrinsic parameters.

Image Feature Extraction Module
As the size of the cropped image blocks is not fixed, inspired by the property that the fully convolutional network (FCN) is not sensitive to the size of the input image, we designed an image feature extraction module that can be fed with images of arbitrary size. Although PSPNet [35] used in DenseFusion [5] can integrate features at different scales, the lack of shallow image features makes the output feature maps unfavorable for edge reconstruction. Therefore, an encoder-decoder network with a symmetric hourglass-type structure and a skip connection structure between the encoder and decoder is employed.
As shown in Figure 2, we first feed the image block into a 2D convolutional layer to generate a feature map with a size of (H, W, 64). This feature map is then fed into four successive downsampling modules to generate a feature map with a size of (H/16, W/16, 1024). The downsampling module includes two consecutive 2D convolutional layers (DoubleConv) and a maximum pooling layer. For the decoder part, we use four successive upsampling modules to recover the feature map with a size of (H/16, W/16, 1024) to a feature map with a size of (H, W, 64). The upsampling module consists of two consecutive 2D convolutional layers and a bilinear interpolation upsampling layer.
In order to reduce the interference of the surrounding environment on the pose estimation, the object region should be segmented from the image first. In this work, the semantic segmentation network provided by PoseCNN is employed. The image semantic segmentation network constructed by an encoder-decoder structure takes an RGB image as input and outputs N + 1 binary maps. The activated pixels in each binary image indicate that these pixels belong to the object represented by the binary image. Based on the masks of the objects, we can obtain the bounding box that encloses these objects and crop the image with this bounding box to obtain the image block containing these objects. Moreover, the object regions in the depth map can be obtained by multiplying the masks of the objects and the depth map. Further, we transform the depth map into a visible surface point cloud of the object using the camera's intrinsic parameters.

Image Feature Extraction Module
As the size of the cropped image blocks is not fixed, inspired by the property that the fully convolutional network (FCN) is not sensitive to the size of the input image, we designed an image feature extraction module that can be fed with images of arbitrary size. Although PSPNet [35] used in DenseFusion [5] can integrate features at different scales, the lack of shallow image features makes the output feature maps unfavorable for edge reconstruction. Therefore, an encoder-decoder network with a symmetric hourglass-type structure and a skip connection structure between the encoder and decoder is employed.
As shown in Figure 2, we first feed the image block into a 2D convolutional layer to generate a feature map with a size of (H, W, 64). This feature map is then fed into four successive downsampling modules to generate a feature map with a size of (H/16, W/16, 1024). The downsampling module includes two consecutive 2D convolutional layers (DoubleConv) and a maximum pooling layer. For the decoder part, we use four successive upsampling modules to recover the feature map with a size of (H/16, W/16, 1024) to a feature map with a size of (H, W, 64). The upsampling module consists of two consecutive 2D convolutional layers and a bilinear interpolation upsampling layer.

Edge Reconstruction Module
In real scene applications, the texture features of the object surface are easily affected by changes in ambient lighting. Moreover, the surface of some objects is smooth and weakly textured, which makes the texture-based feature extraction method ineffective. We observe that the edge features of the object surface remain stable when the lighting condition changes dramatically, and the objects with weak textures also have effective edge features for pose estimation. Therefore, the edge reconstruction module is designed. As shown in Figure 1(II), this module uses the Edge Reconstructor to generate an edge reconstruction image from the feature map output by the image feature extraction module. The Edge Reconstructor consists of two 1 × 1 convolutional layers, which can map the input feature map with the size of (H, W, 64) to an edge reconstruction image with the size of (H, W, 1). At the same time, we multiply the image block cropped by the bounding box with the mask of the object region to obtain an image block containing the object only. Moreover, we process this image block with the Canny [34] operator to generate the object edge for edge reconstruction.
The loss of the edge reconstruction task is defined as the binary cross-entropy (BCE), whose loss function can be expressed as: where (i, j) indicates the position of the pixel on the image, E gt (i, j) = 1 indicates that pixel (i, j) is an edge pixel in the object edge, E x (i, j) represents the value of the pixel (i, j) in the edge reconstruction image, and β indicates the percentage of non-edge pixels to all pixels in the object edge.

Multi-Scale Point Cloud Geometric Feature Extraction Module
Typically, pose estimation methods do not fully exploit the complementary nature of depth information and RGB information. To solve this problem, we employ the depth image to generate the surface point cloud of the object and feed it into our network. Then the geometric features of the point cloud are extracted by a self-attention multi-scale point cloud feature extraction network, MSPNet. MSPNet extracts multi-scale geometric features of the point cloud through a parallel structure. Meanwhile, we introduce the self-attention mechanism into MSPNet to adaptively select the feature extraction scale. Figure 3 displays the specific process of our network for extracting geometric features. The network can be divided into two parts. The upper branch generates point-level features for each point by encoding the spatial location information of each point through three successive multi-layer perceptrons.
by changes in ambient lighting. Moreover, the surface of some objects is smooth and weakly textured, which makes the texture-based feature extraction method ineffective. We observe that the edge features of the object surface remain stable when the lighting condition changes dramatically, and the objects with weak textures also have effective edge features for pose estimation. Therefore, the edge reconstruction module is designed. As shown in Figure 1(II), this module uses the Edge Reconstructor to generate an edge reconstruction image from the feature map output by the image feature extraction module. The Edge Reconstructor consists of two 1 × 1 convolutional layers, which can map the input feature map with the size of (H, W, 64) to an edge reconstruction image with the size of (H, W, 1). At the same time, we multiply the image block cropped by the bounding box with the mask of the object region to obtain an image block containing the object only. Moreover, we process this image block with the Canny [34] operator to generate the object edge for edge reconstruction.
The loss of the edge reconstruction task is defined as the binary cross-entropy (BCE), whose loss function can be expressed as: where ( , ) indicates the position of the pixel on the image, ( , ) = 1 indicates that pixel ( , ) is an edge pixel in the object edge, ( , ) represents the value of the pixel ( , ) in the edge reconstruction image, and indicates the percentage of non-edge pixels to all pixels in the object edge.

Multi-Scale Point Cloud Geometric Feature Extraction Module
Typically, pose estimation methods do not fully exploit the complementary nature of depth information and RGB information. To solve this problem, we employ the depth image to generate the surface point cloud of the object and feed it into our network. Then the geometric features of the point cloud are extracted by a self-attention multi-scale point cloud feature extraction network, MSPNet. MSPNet extracts multi-scale geometric features of the point cloud through a parallel structure. Meanwhile, we introduce the selfattention mechanism into MSPNet to adaptively select the feature extraction scale. Figure 3 displays the specific process of our network for extracting geometric features. The network can be divided into two parts. The upper branch generates point-level features for each point by encoding the spatial location information of each point through three successive multi-layer perceptrons.  The lower branch aims to extract the local feature for the neighbors of each point. Multiple parallel Graph Conv Layers are employed to extract the local geometric features. Each Graph Conv Layer selects a different number of neighborhood points to extract local geometric features with multiple scales. Figure 4 demonstrates the structure of the Graph Conv Layer. We first take each point as the center, then select the k nearest neighbor points with Euclidean distance and form a neighbor point set Y with the size of (3, k, N). Additionally, each point in the point cloud is subtracted by its k neighborhood points to generate a local feature vector F with the size of (3, k, N), which is then mapped to a local feature vector F with the size of (128, k, N) by a multi-layer perceptron. The process is expressed as: where P(i) represents the i-th point, and Y(i, j) represents the j-th neighborhood point of the i-th point. h( ) represents a non-linear function with parameters, i.e., a multilayer perceptron.
The lower branch aims to extract the local feature for the neighbors of each point. Multiple parallel Graph Conv Layers are employed to extract the local geometric features. Each Graph Conv Layer selects a different number of neighborhood points to extract local geometric features with multiple scales. Figure 4 demonstrates the structure of the Graph Conv Layer. We first take each point as the center, then select the k nearest neighbor points with Euclidean distance and form a neighbor point set with the size of (3, k, N). Additionally, each point in the point cloud is subtracted by its k neighborhood points to generate a local feature vector with the size of (3, k, N), which is then mapped to a local feature vector ′ with the size of (128, k, N) by a multi-layer perceptron. The process is expressed as: where ( ) represents the i-th point, and ( , ) represents the j-th neighborhood point of the i-th point. ℎ( ) represents a non-linear function with parameters, i.e., a multi-layer perceptron. Meanwhile, an attention mechanism is introduced to assign different weights to each point and its k neighborhood points. As shown in Figure 4, the self-coefficients obtained from the point cloud mapping represent the weights of each point, and the localcoefficients obtained from the local feature vector ′ mapping represent the weights of the k neighborhood points of each point. We add the self-coefficients and local-coefficients to generate the final weight attention-coefficients . Then we multiply the local feature vector ′ by the attention-coefficients with the size of (1, k, N) and then sum the result in the channel dimension to obtain the geometric features ′′ of the Graph Conv Layer output. The process can be expressed as: where ℎ′( ) and ℎ′′( ) represent different non-linear functions with parameters.
To allow the network to choose the scale for local features extraction adaptively, we introduce a self-attention mechanism after the feature extraction layer. Here, in order to obtain the weights ( = 1 ⋯ ) of the feature vectors, we feed the sum of the feature vector ′′ ( = 1 ⋯ ) output from the feature extraction layer into the network, which Meanwhile, an attention mechanism is introduced to assign different weights to each point and its k neighborhood points. As shown in Figure 4, the self-coefficients α obtained from the point cloud mapping represent the weights of each point, and the localcoefficients β obtained from the local feature vector F mapping represent the weights of the k neighborhood points of each point. We add the self-coefficients α and local-coefficients β to generate the final weight attention-coefficients γ. Then we multiply the local feature vector F by the attention-coefficients γ with the size of (1, k, N) and then sum the result in the channel dimension to obtain the geometric features F of the Graph Conv Layer output. The process can be expressed as: where h ( ) and h ( ) represent different non-linear functions with parameters.
To allow the network to choose the scale for local features extraction adaptively, we introduce a self-attention mechanism after the feature extraction layer. Here, in order to obtain the weights W i (i = 1 · · · n) of the feature vectors, we feed the sum of the feature vector F i (i = 1 · · · n) output from the feature extraction layer into the network, which consists of the average pooling layer, the fully connected layer, and the softmax layer. The final output local feature vector F is the result of the weighted summation of the feature vectors at each scale, which can be expressed as: At last, the final geometric features are generated by concatenating the point-level features of each point with the local features.

Feature Fusion
In order to effectively fuse geometric and texture features, we used a pixel-level dense fusion strategy. This fusion strategy can avoid using a single global feature to estimate the object pose, thereby improving the robustness of the pose estimation network against inaccurate image segmentation or object occlusion. As shown in Figure 1(IV), the geometric features of each point are concatenated with the texture features to generate a multi-modal dense feature sequence.

Global Feature Extraction
Since the object is composed of multiple pixels, global features constructed by features of each pixel can be used to describe the object recapitulatively, which are essential for estimating object poses in a changing environment.
Here, we propose a SE-NetVLAD rather than the commonly used maximum pooling layer or average pooling layer to extract the global features of a feature sequence. Vector of Locally Aggregated Descriptors (VLAD) [36] was originally designed to aggregate local feature descriptors in an image into a global feature description vector. However, VLAD cannot be introduced into neural networks because it applies hard classification to find the nearest clustering center to the local feature descriptors. Hard classification means the neural network cannot optimize end-to-end by back-propagation. We replace the hard classification of local feature descriptors in VLAD with a differentiable soft classification, which allows SE-NetVLAD to be optimized end-to-end through back-propagation.
As shown in Figure 1(IV), the input to SE-NetVLAD is a one-dimensional multimodal dense feature sequence {x i } with the size of (N, D), where D is the feature dimension. We first input the feature sequence into a 1 × 1 convolutional layer with a softmax layer to obtain the soft classification weights p k (x i ) with the size of (N, K). p k (x i ) can be expressed as: Afterward, the soft classification weight p k (x i ) is multiplied by the residual (x i (j) − c k (j)) to obtain a feature vector V with a size of (K, D), V could be expressed as: where K is the number of clusters. {W k } and {c k } are the trainable parameter sets. We then reconstruct the feature vector V into a 1D feature vector with size of (1, (K × D)), and feed it into a single fully connected layer to map it to a global feature vector. Finally, we concatenate the global feature vector with the multi-modal dense feature sequence at the pixel level to generate a multi-modal dense feature sequence containing the global features.

Self-Attention Mechanism
Multi-modal dense feature sequences may have redundant feature information, so we need to model the importance of feature channels explicitly and suppress redundant channels. Thus, we insert a channel attention module, as shown in Figure 1(IV). The feature sequence { f in } with the size of (N, D) is mapped to the channel weight vector W with size of (1, D). The final output of the module, the feature sequence { f out }, is the weighting of W on the original input feature sequence { f in }. { f out } can be expressed as:

Pose Estimator
The pose estimator is composed of three parallel network structures, which are used to regress rotation, translation, and confidence. All three parallel network structures include four consecutive 1 × 1 convolutional layers. To make our method more robust for environmental changes and occlusion solutions, we regress a predicted pose for each feature vector in the multi-modal dense feature sequence. As shown in Figure 1(V), we input all feature vectors in the sequence into the pose estimator and generate a prediction for each feature vector. Meanwhile, we adopt a self-supervised method to select the best prediction, i.e., when the network returns to the pose, the confidence of the prediction result is regressed simultaneously. Among the dense prediction results, the prediction result with the highest confidence is selected as the final network output.

Loss Function
The proposed network model needs to learn the 6D poses of the objectp i and the predicted confidence c i . In terms of the 6D pose estimation, we define the loss of pose estimation as the distance between the sample points on the object model after the ground truth and the predicted pose transformation. Therefore, the loss function of each dense prediction result is defined as follows: where M is the number of sample points on the 3D model of the object, x j is the j-th sampled point, p = [R|t] is the ground truth, andp i = R i t i is the i-th predicted pose. The above loss function is only applicable to asymmetric objects, which have only one correct pose. While symmetrical objects may have more than one correct pose, still using the above loss function will result in the ambiguity of the object pose estimation. Therefore, for symmetrical objects, we define the loss of pose estimation as the distance between the sampled points on the object model after the ground truth and their nearest neighbor after the predicted pose transformation, where the loss function can be expressed as follows: To make the network learn a confidence level for each prediction, we apply the confidence of the prediction results to weight the loss of the prediction results and add a confidence regularization term. The final pose estimation loss function can be expressed as: where N is the number of predicted outcomes and w is a balanced hyperparameter. Finally, we combine the edge reconstruction loss function and the pose estimation loss function as the final loss function Loss with the hyperparameter λ, as shown in Equation (12).

Datasets
To evaluate the performance of the proposed network, two open datasets (LineMOD Dataset [13] and YCB-Video Dataset [4]) are used to conduct experiments.
The LineMOD Dataset, which is well accepted to evaluate various classical or learningbased pose estimation methods, contains 13 weakly textured objects from 13 videos and does not contain synthetic images.
The YCB-Video Dataset consists of 92 videos containing a total number of 21 objects. Further, the YCB-Video Dataset contains distractions such as lighting condition changes and objects being occluded, which makes this dataset challenging.

Metrics
The accuracy of the pose estimation can be measured by two metrics, the average distance (ADD) and the average closest point distance (ADD-S).
The average distance is defined as the average distance between the sampling points on the object 3D model after the ground truth transformation and the sampling points after the predicted pose transformation, which can be expressed as: where M is the set of 3D model points and m is the number of sample points. The average closest point distance calculates the distance between the sampling points on the 3D model after the ground truth transformation and the closest point after the predicted pose transformation and then averages the distances between the closest points of all sampling points, which can be expressed as:

Implementation Details
Our method is implemented based on the PyTorch framework and adopts the Adam optimizer to optimize the network parameters at training time. All experiments in this work run on a desktop computer with an Intel ® Xeon ® E5-2680 v4 CPU and NVIDIA RTX 3090 2 GPUs. Within the process of training, we set the initial learning rate to 0.0001, the maximum number of iterations to 500, the number of sampling points to 1000, hyperparameter λ to 0.3, and hyperparameter w to 0.015.

Evaluation on LineMOD Dataset
In the LineMOD Dataset, we consider the pose estimation to be correct if ADD-(S) (measured by ADD for asymmetric objects and ADD-S for symmetric objects) is lower than ten percent of the object diameter, which is the same as the previous work [5]. We use the percentage of correct key-frames for pose estimation to all key-frames to evaluate various methods. In addition, we refer to this percentage as accuracy in the following. Table 1 shows the performance of our method and other state-of-the-art RGB-based or RGB-D-based methods on the LineMOD Dataset. From Table 1, the average accuracy of the RGB-based methods, such as BB8 (62.7%), PVNet (86.3%), PoseCNN + DeepIM (88.6%), and HRPose (91.6%), are lower than ours (95.0%), which is due to the fact that the RGB-based methods do not utilize spatial geometric information. When the ambient lighting conditions are not desirable or the surface texture of the object is weak (e.g., ape, duck), effective texture features cannot be extracted, resulting in inaccurate pose estimation. For the RGB-D-based method, SSD-6D + ICP only uses depth information in the postprocessing stage without a deep fusion of texture features and geometric features, so the average accuracy of this method merely reaches 79%. EANet also exploits edge cues, but since our method uses MSPNet to extract multi-scale geometric features of point clouds and constructs more expressive dense feature sequences, our method outperforms EANet by 3.5% in the average accuracy. MSCNet uses multi-scale dense features for pose estimation, but since it does not use edge information to constrain the network, the pose estimation accuracy is lower when the object surface texture is weak, and the average accuracy is lower than our method by 0.4%. We notice that keypoint-based methods (such as PVNet, HRPose) can achieve high pose estimation accuracy on bench can in the LineMOD Dataset. This is because bench and can have obvious texture features and geometric corners, so the keypoint-based methods can stably predict keypoints on bench. While our method does not exploit keypoints, it performs slightly worse on these two types of objects.
Moreover, we present the visualized results of our method on the LineMOD Dataset, which can be seen in Figure 5.  Table 2 shows the pose estimation results on the YCB-Video Dataset. Two metrics are used to measure the effectiveness of the methods. One is the area under the ADD-S scorethreshold curve (AUC), with thresholds ranging from 0 to 10 cm. Another indicator is the percentage of ADD-S scores of less than 2 cm (<2 cm). All methods use semantic segmentation masks from PoseCNN to guarantee fair comparison. Table 2. Quantitative evaluation results using the ADD-S metric on the YCB-Video Dataset, where the data shown in bold are the highest scores among the different methods, and the objects marked with an asterisk * are symmetrical objects.  Table 2 shows the pose estimation results on the YCB-Video Dataset. Two metrics are used to measure the effectiveness of the methods. One is the area under the ADD-S score-threshold curve (AUC), with thresholds ranging from 0 to 10 cm. Another indicator is the percentage of ADD-S scores of less than 2 cm (<2 cm). All methods use semantic segmentation masks from PoseCNN to guarantee fair comparison. The experimental results illustrate that our method outperforms CosyPose, Dense-Fusion, MSCNet, and G2L-Net by 2.8%, 1.4%, 1.1%, and 0.2% on the first metric, respectively. The maximum error tolerated by robot grasping is 2 cm. Our method surpasses PoseCNN + ICP, DenseFusion, and MSCNet by 2.4%, 0.3%, and 1.7% in this metric, respectively. This proves that our method is more suitable for grasping tasks in the real world. An edge reconstruction module is introduced into the network, which implicitly improves the attention of the image feature extraction module to edge features, so our method shows the best performance on weakly textured objects such as banana, mug, and wood_block.

Evaluation of YCB-Video Dataset
In the YCB-Video Dataset, large_clamp and extra_large_clamp are two types of objects with the same appearance but different sizes. Therefore, it is difficult for the semantic segmentation network provided by PoseCNN to generate the correct semantic segmentation masks for these two types of objects, which leads to poor performance of our network on large_clamp and extra_large_clamp. Moreover, scissors in the YCB-Video Dataset are small and have a discontinuous surface, so the edge-attention image feature extraction module cannot completely extract the texture features of scissors from the masked RGB images. Therefore, our network has lower pose estimation accuracy on scissors. Figure 6 presents the qualitative analysis results of different methods on the YCB-Video Dataset. All methods use the semantic segmentation results provided by PoseCNN in this experiment. We transform the point cloud of the object according to the predicted 6D pose and project it onto a 2D image. The higher degree of coincidence between the transformed point cloud and the object means the higher accuracy of pose estimation. Our network predicts the results that have the highest degree of coincidence on smooth and textureless objects, such as bowl and banana. Conversely, DenseFusion and PoseCNN + ICP fail to accurately estimate the pose of the bowl and banana. This is because our method introduces edge information constraints into the pose estimation network, which improves the attention of our network to edge features and enables it to extract effective features for pose estimation even on smooth and texture-less objects. Because of the pixel-level prediction, our network also has a robust anti-occlusion capability and shows high prediction accuracy on the severely occluded cracker_box, scissors, and mustard_bottle.  Figure 7 shows the performance of our method when the lighting condition changes. We randomly selected three images from the YCB-Video Dataset and then used the  Figure 7 shows the performance of our method when the lighting condition changes. We randomly selected three images from the YCB-Video Dataset and then used the OpenCV to change the brightness of the images to simulate changes in illumination. From Figure 7, we can see that the pose estimation results hardly change with the brightness. This proves that our method is robust to lighting condition changes.
Mathematics 2022, 10, x FOR PEER REVIEW 17 of 20 OpenCV to change the brightness of the images to simulate changes in illumination. From Figure 7, we can see that the pose estimation results hardly change with the brightness. This proves that our method is robust to lighting condition changes.

Figure 7.
Visualized results of our method on the YCB-Video Dataset when lighting condition changes. Figure 8 shows the performance of our method when the object is occluded. For a clearer presentation, we only render the pose estimation results of occluded objects on the graph. Figure 8 proves that our method can still estimate the 6D pose of the object accurately when the object is heavily occluded.

Ablation Study
All ablation experiments are performed on the LineMOD Dataset. The ablation study results are shown in Table 3, where the definition of accuracy is the same as the evaluation  Figure 8 shows the performance of our method when the object is occluded. For a clearer presentation, we only render the pose estimation results of occluded objects on the graph. Figure 8 proves that our method can still estimate the 6D pose of the object accurately when the object is heavily occluded. OpenCV to change the brightness of the images to simulate changes in illumination. From Figure 7, we can see that the pose estimation results hardly change with the brightness. This proves that our method is robust to lighting condition changes.  Figure 8 shows the performance of our method when the object is occluded. For a clearer presentation, we only render the pose estimation results of occluded objects on the graph. Figure 8 proves that our method can still estimate the 6D pose of the object accurately when the object is heavily occluded.

Ablation Study
All ablation experiments are performed on the LineMOD Dataset. The ablation study results are shown in Table 3, where the definition of accuracy is the same as the evaluation

Ablation Study
All ablation experiments are performed on the LineMOD Dataset. The ablation study results are shown in Table 3, where the definition of accuracy is the same as the evaluation metric for the LineMOD Dataset. We test our improvement against DenseFusion [5] as a benchmark, i.e., model (a) represents DenseFusion without refinement. Based on DenseFusion, we employed the edge reconstruction module, MSPNet and SE-NetVLAD, forming model (b), model (c), and model (d), respectively. Model (d) represents our method. Comparing model (a) and model (b), after the edge reconstruction module is introduced into the network, the pose estimation accuracy of model (b) is improved by 4.3%. This proves the edge reconstruction module can implicitly increase the attention of the image feature extraction module to edge features, thus improving the pose estimation performance. At the same time, we noticed that this multi-task learning training method could significantly improve the problem of difficulties in training the image feature extraction module because of the overall deep network and could also improve the speed of network convergence during training.
Comparing Similarly, the effectiveness of SE-NetVLAD can be verified by comparing model (c) and model (d). Table 3 shows that model (d) improves the pose estimation accuracy by 0.7% compared to model (c). Combining the above comparisons, we can find that each module proposed in our network makes a great contribution to improving the performance of pose estimation.

Conclusions
In this paper, we propose an end-to-end 6D pose estimation network based on RGB-D images, which was named SaMfENet. After a series of experiments, we prove the effectiveness of our method on the task of object 6D pose estimation. The proposed method can stably estimate the 6D pose of smooth, weak texture objects in complex lighting conditions. Our network is also robust to situations such as objects being severely occluded, meeting the needs of grasping tasks in the real scene. Moreover, ablation experiments prove that edge information constraints and multi-scale feature fusion can significantly improve pose estimation accuracy.
Our network is developed for real scene applications. In future work, we will apply our network to a robot to improve its performance in practical applications. Moreover, our network relies on a robust semantic segmentation network to segment object regions. However, using the independently trained semantic segmentation network, it is difficult to provide reliable and robust semantic segmentation results. In future work, we have a plan to deeply integrate the semantic segmentation network as a module into our network and train the semantic segmentation network together with the pose estimation network.