Multiscale Feature Fusion with Self-Attention for Efficient 6D Pose Estimation

Lv, Zekai; Guo, Yufeng; Yang, Shangbin; Du, Linlin; Gao, Rui; Sun, Jinti; Han, Jiaqi; Zhang, Hui; Wang, Qiang

doi:10.3390/a18040207

Open AccessArticle

Multiscale Feature Fusion with Self-Attention for Efficient 6D Pose Estimation

by

Zekai Lv

^1,†,

Yufeng Guo

^1,†,

Shangbin Yang

¹,

Linlin Du

¹,

Rui Gao

¹,

Jinti Sun

¹,

Jiaqi Han

¹,

Hui Zhang

^1,2,* and

Qiang Wang

^1,*

¹

College of Information and Management Science, Henan Agricultural University, Zhengzhou 450046, China

²

Artificial Intelligence Research Center, Henan Agricultural University, Zhengzhou 450046, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Algorithms 2025, 18(4), 207; https://doi.org/10.3390/a18040207

Submission received: 21 February 2025 / Revised: 30 March 2025 / Accepted: 1 April 2025 / Published: 8 April 2025

(This article belongs to the Section Algorithms for Multidisciplinary Applications)

Download

Browse Figures

Versions Notes

Abstract

Six-dimensional (6D) pose estimation remains a significant challenge in computer vision, particularly for objects in complex environments. To overcome the limitations of existing methods in occluded and low-texture scenarios, a lightweight, multiscale feature fusion network was proposed. In the network, a self-attention mechanism is integrated with a multiscale point cloud feature extraction module, enhancing the representation of local features and mitigating information loss caused by occlusion. A lightweight image feature extraction module was also introduced to reduce the computational complexity while maintaining high precision in pose estimation. Ablation experiments on the LineMOD dataset validated the effectiveness of the two modules. The proposed network achieved 98.5% accuracy, contained 19.49 million parameters, and exhibited a processing speed of 31.8 frames per second (FPS). Comparative experiments on the LineMOD, Yale-CMU-Berkeley (YCB)-Video, and Occlusion LineMOD datasets demonstrated the superior performance of the proposed method. Specifically, the average nearest point distance (ADD-S) metric was improved by 4.2 percentage points over DenseFusion for LineMOD and by 0.6 percentage points for YCB-Video, with it reaching 63.4% on the Occlusion LineMOD dataset. In addition, inference speed comparisons showed that the proposed method outperforms most RGB-D-based methods. The results confirmed that the proposed method is both robust and efficient in handling occlusions and low-texture objects while also featuring a lightweight network design.

Keywords:

6D pose estimation; multiscale feature fusion; attention mechanism; point cloud feature extraction; lightweight network

1. Introduction

The aim of object 6D pose estimation is to estimate the 3D rotation and translation of an object relative to the camera coordinate system, which is a fundamental problem in computer vision. It plays a critical role in a wide range of applications such as robotic manipulation [1,2,3,4], augmented reality [5], and autonomous navigation [6,7,8,9]. However, achieving robust and accurate 6D pose estimation in real-world scenarios remains highly challenging, particularly when objects are textureless, partially occluded, or captured under poor lighting conditions.

Traditional approaches can be broadly categorized into feature point-based or template-based methods. Feature point-based methods estimate 6D poses by establishing 2D–3D correspondences between image keypoints and the 3D model [10]. These methods typically exhibit a degraded performance when applied to low-texture objects due to their inability to extract reliable keypoints. Template-based methods, on the other hand, match input images with a pre-rendered template database. While efficient, these methods tend to fail in the presence of occlusions or appearance variations.

With the rise of deep learning, learning-based approaches have significantly improved the 6D pose estimation performance. Early RGB-based methods such as PoseCNN [11] and BB8 [12] have achieved promising results but are inherently limited by the lack of depth information, making them less robust in complex environments. RGB-D methods, such as DenseFusion [13], leverage complementary depth cues and fuse RGB and point cloud features at the pixel level. Although effective, these approaches rely heavily on PointNet [14] for geometric feature extraction, which is not able to sufficiently capture the local context due to its global feature aggregation design. Later approaches such as PVN3D [15] introduced local feature modeling, but this significantly increased the model complexity and computational cost, limiting their deployment in real-time or resource-constrained applications.

More recently, large-scale foundation models [16,17,18,19] have demonstrated strong generalization capabilities and robustness through the use of high-capacity architectures and multimodal representations. However, their considerable parameter sizes and inference latency present challenges for real-time deployment and mobile use. Therefore, there is a clear research gap between high-accuracy yet computationally expensive models and lightweight models that have a limited robustness and precision.

To bridge this gap, we propose a novel lightweight multiscale feature fusion network for 6D pose estimation that achieves an optimal balance between accuracy and efficiency. Specifically, we design a multiscale point cloud feature extraction module using parallel graph convolution layers with varying neighborhood sizes to capture geometric information at different scales. A channel-wise self-attention mechanism is applied to adaptively fuse these features, allowing the model to focus on the most informative scale. In parallel, a lightweight RGB feature extractor using depthwise separable convolutions reduces computations while preserving color cues. Finally, RGB and geometric features are fused at the pixel level to estimate the 6D pose. Compared with existing methods, our approach achieves a competitive accuracy with significantly fewer parameters and faster inference, making it well suited for real-time, mobile deployment.

2. Related Works

2.1. Pose Estimation

Current pose estimation methods are mainly divided into traditional methods and deep learning-based methods. Traditional methods tend to have a low accuracy and poor robustness for 6D pose estimation in complex environments; thus, our study focuses on deep learning-based methods for pose estimation.

2.1.1. Pose Estimation Methods Based on RGB Images

RGB image-based 6D pose estimation methods are generally categorized into keypoint-based methods and direct regression methods. Unlike the traditional methods that use descriptor extraction algorithms to extract feature points from RGB images, recent methods leverage the powerful computational capabilities of deep learning to process RGB images and extract more representative feature points. For example, YOLO-6D [20] utilizes the YOLO framework to quickly match and project the predicted 3D bounding box of the target into a 2D image. By leveraging the natural 2D-3D keypoint correspondence, the PnP algorithm can be used to calculate a target’s 6D pose. PVNet [21] extracts pixel-wise feature vectors from RGB images using specific convolutional networks and then votes on these vectors to determine the object’s keypoints. Finally, the PnP algorithm determines the 6D pose based on the keypoints with the highest votes. HybridPose [22] enhanced the robustness of pose estimation for symmetric objects by incorporating edge vectors and symmetry correspondences into the PVNet framework. In contrast, direct regression methods utilize network models to directly regress the 6D pose of the target object from the RGB image. For instance, in PoseCNN, semantic segmentation is first performed on the image, followed by a calculation of the translation and rotation for each segmented object, resulting in a more streamlined network structure. DeepIM [23] is an iterative refinement method for 6D object pose estimation based on deep learning. It first estimates an initial coarse 6D pose from an RGB image; then, it optimizes the initial pose by minimizing the difference between the observed and rendered images, iterating until the pose converges to a threshold or until a set number of iterations is reached. Labbé et al. [24] extended the DeepIM idea and proposed a more efficient multi-object pose estimation method called CosyPose by applying a rotation parameterization method, making convolutional neural network (CNN) training more stable.

In addition to the aforementioned keypoint-based and direct regression methods, recent innovations in RGB-based 6D pose estimation have further improved the performance. CRT-6D [25] is a real-time 6D pose estimation method based on cascade regression trees, which strikes a good balance between computational speed and accuracy, making it well suited for real-time applications. However, it faces challenges in handling low-resolution and complex backgrounds. FS6D [26] focuses on few-shot 6D pose estimation for novel objects, making predictions with only a few support views without additional training and thus enhancing the model’s generalization capability. However, FS6D is sensitive to the quality and quantity of support views, which can result in instability in diverse and complex scenes. RNNPose [27] uses recurrent neural networks (RNNs) to process data from consecutive frames, effectively capturing object motion information to improve the 6D pose estimation accuracy in dynamic scenes. Compared to traditional static image methods, RNNPose is more stable in continuous video scenes, but its high computational demands make it less applicable to real-time applications or limited hardware environments.

RGB image-based 6D pose estimation methods feature simple inputs and fast processing, making them suitable for real-time applications. However, while recent innovations have improved their performance in few-shot learning and dynamic scenes, the robustness of these methods is generally limited when dealing with occlusions, complex backgrounds, and low-texture objects; thus, the pose estimation accuracy is also limited due to the lack of spatial geometry information.

2.1.2. Pose Estimation Methods Based on RGB-D Images

The advent of RGB-D sensors allowed deep learning models to leverage richer information. Early methods treated RGB and depth information separately, with depth information only being used as supplementary data to refine the estimation results. For instance, methods such as PoseCNN and BB8 first estimate the object’s pose using RGB images and then refine the results using point clouds generated from depth information, employing algorithms such as iterative closest point (ICP) [28] or generalized ICP (GICP) [29]. Another approach is to concatenate depth and RGB information and feed the combined features into the pose estimation network [30,31]. However, such simple concatenation does not fully utilize the complementarity between RGB and depth data. Recent studies have focused on the inherent correlation between RGB and depth information. PointFusion [32] independently processes RGB image data and raw point cloud data using CNN and PointNet architectures, respectively, and fuses the outputs in a new fusion network. This network uses the input 3D points as spatial anchors to predict multiple 3D bounding boxes and their confidence scores. DenseFusion further improves upon this by fusing pixel-level features from both modalities and re-encoding the fused features for pose estimation. PVN3D uses a similar approach by combining RGB and point cloud information, feeding the fused pixel data into a full-pixel voting network to generate 3D keypoints. The 6D pose is then computed using the least squares method, addressing occlusion in pose estimation through keypoint detection.

In recent research, RGB-D fusion has been extensively explored, emphasizing the correlation and complementarity between these two modalities to enhance the pose estimation accuracy and robustness. Various approaches have been developed to leverage both geometric and visual information. SurfEmb [33] improves pose estimation for complex-shaped objects by learning geometric and visual features through surface embedding, enhancing both accuracy and generalization. CMCL6D [34] adopts cross-modal collaborative learning, effectively fusing RGB and depth data to maintain high accuracies under varying lighting conditions and occlusions. Similarly, MaskedFusion [35] addresses occlusion challenges by employing an occlusion-aware mechanism and pixel-level fusion, significantly boosting the performance in cases of partial occlusion.

Among these approaches, CMFF6D [36] features a cross-modality multiscale feature fusion framework that progressively integrates high-resolution RGB features with depth-based geometric representations via a multi-branch architecture. This design enhances informative features while suppressing noise, thereby improving robustness in the presence of occlusions and in textureless scenarios. However, explicit cross-modal fusion introduces a considerable computational overhead, which limits its real-time performance and poses challenges for deployment on resource-constrained devices. In contrast, our method adopts a self-attention-driven multiscale feature fusion strategy that simplifies the fusion process by dynamically aggregating spatial information across multiple scales. This unified design significantly reduces the computational complexity while maintaining high pose estimation accuracy, offering an efficient and robust solution for real-time 6D pose estimation in complex environments.

2.2. Attention Mechanism

The attention mechanism was initially introduced in machine translation and has since become a fundamental component in both image processing and natural language processing tasks. In models with large parameters or information overload, attention mechanisms play a crucial role, allowing the network to focus on the most relevant information while filtering out irrelevant or redundant inputs. This targeted focus enhances both the computational efficiency and task accuracy. A representative example of this is SENet [37], which models inter-channel dependencies to automatically learn the importance of each channel, thereby enhancing useful features and suppressing less informative ones. CondConv [38] extends this concept by introducing dynamic convolutional kernels that adapt to each input sample, improving the model capacity without sacrificing inference speed. Further, SKNet [39] incorporates convolutional kernels with varying receptive fields, enabling the network to adaptively process input features across multiple scales.

Building upon the foundation of attention, self-attention expands its capability by enabling each input feature to interact with all others. This not only captures global contextual relationships but also allows the model to dynamically adjust feature importance, making it particularly suitable for complex tasks. In the domain of 6D pose estimation, self-attention has shown significant advantages, especially in multiscale feature fusion. For instance, Trans6D [40] introduces a self-attention mechanism to effectively aggregate multiscale features from RGB-D data, improving the capture of global semantic information while reducing the influence of irrelevant inputs. This design enhances robustness in scenes with complex backgrounds and partial occlusions. Similarly, SaMfENet [41] leverages self-attention to strengthen the representation of multiscale geometric features in point clouds, leading to an improved pose estimation performance in the presence of occlusions and in low-texture conditions. These studies demonstrate the potential of self-attention mechanisms to achieve robust and accurate multimodal feature fusion in challenging environments.

Motivated by the effectiveness of transformer architectures in large-scale vision models, recent studies on 6D pose estimation have also explored transformer-based strategies to further enhance global feature interactions. One such example is the deep fusion transformer network (DFTr) [42], in which a transformer-based fusion approach is adopted to aggregate contextual information across spatial locations and pose predictions are refined using a weighted keypoint voting mechanism. By modeling both local and global dependencies, it achieves improved generalization across object categories and varying scenes. However, the high computational cost of transformer modules and the complexity of keypoint voting significantly hinder its real-time performance and practical deployment on resource-limited devices.

In contrast, the self-attention-driven multiscale feature fusion mechanism adopted in this study avoids explicit keypoint voting and complex transformer structures. Instead, a lightweight yet effective strategy is employed to dynamically aggregate spatial features while preserving both local geometric detail and the global semantic context. This design offers a favorable trade-off between computational efficiency and pose estimation accuracy. As a result, the proposed method exhibits strong robustness in the face of occlusions, textureless surfaces, and complex environments, making it highly suitable for real-time 6D pose estimation applications.

2.3. Lightweight Design

As deep learning is being widely applied across various fields, the demand for high-performance models is increasing, with lightweight design gaining attention. The main aim is to design more efficient computational networks, reducing network parameters without sacrificing performance and making it easier to deploy networks on mobile devices. SqueezeNet [43] was one of the first networks to feature a squeeze layer, using 1 × 1 convolutional kernels to compress features from higher layers and reduce the dimensionality of the features. MobileNet [44] introduced a new type of convolution called depthwise separable convolution, in which convolution is first performed for each channel individually, and then the resulting features are combined, reducing the number of parameters. ShuffleNet [45] employs a similar approach by splitting standard convolution into two steps—input feature map grouping and convolutions within each group—thereby reducing the convolutions’ computational cost.

The lightweight module proposed in this paper uses depthwise separable convolution and inverted residual blocks to improve the efficiency of computations during the feature extraction process, significantly reducing both the parameters and computational complexity. Compared to the traditional ResNet18 architecture, the proposed lightweight design not only maintains high accuracy while reducing the model size but is also better suited to real-time deployment on mobile devices. These improvements have yet to be fully explored and applied in existing 6D pose estimation methods.

3. Materials and Methods

3.1. Datasets

To evaluate the performance of the network, three mainstream public datasets (LineMOD, YCB-Video, and Occlusion LineMOD) were used for validation.

The LineMOD dataset is a widely used computer vision dataset for object recognition and pose estimation. It includes RGB-D images and 3D models of 13 objects, featuring both symmetric and asymmetric common objects. Each object has approximately 1000 manually annotated images. These images and models are captured from real-world objects, ensuring high realism and diversity.

The YCB-Video dataset is based on the YCB dataset and includes 21 objects selected for their high-quality model files and comprehensive depth images. The dataset contains a total of 133,827 frames, all annotated with 6D poses. It includes scenes with clutter and occlusions featuring both symmetric and asymmetric objects, as well as objects with minimal surface features.

The Occlusion LineMOD dataset is an extended version of the LineMOD dataset, specifically designed to evaluate the performance of models under complex occlusion conditions. It contains many heavily occluded scenes, providing a more challenging benchmark for assessing the robustness of pose estimation models. Compared to the original LineMOD dataset, the LM-O dataset introduces more complex occlusions and object arrangements, allowing for a more comprehensive evaluation of the model’s performance when dealing with occluded objects in real-world environments. The LM-O dataset has become an important benchmark for validating the ability of 6D pose estimation algorithms to handle occlusions.

3.2. Overall Structure of the Network

The aim of the network proposed in this paper is to estimate the 6D pose of objects from RGB-D images. The 6D pose of an object refers to the rigid transformation from the object’s coordinate system to the camera coordinate system. This transformation can be intuitively expressed as a combination of a rotational and translational transformation. It is represented using a homogeneous transformation matrix

P = [R, t]

, where

R \in S O (3)

is the rotational transformation and

t \in R^{3}

is the translational transformation.

3.2.1. Overview

Figure 1 shows the overall structure of the network, which mainly consists of four parts as follows:

•: Semantic Segmentation Module: This module uses the semantic segmentation network proposed in PoseCNN. It segments the target object’s bounding box from the RGB image, crops the corresponding mask, and combines the camera parameters to convert the depth map into the corresponding 3D point cloud, laying the foundation for the subsequent extraction of image and point cloud features.
•: Multiscale Point Cloud Feature Extraction Module: This module primarily extracts features from the segmented point cloud data. In this module, the point cloud information passes through multiple graph convolution layers, each extracting local information from the point cloud at different scales based on neighborhoods of varying sizes. The local features are then fused into more expressive multiscale point cloud features using a self-attention mechanism.
•: Lightweight Image Feature Extraction Module: This module introduces an inverted residual structure to improve the original CNN network. To ensure that the extracted features can be fused with the point cloud features at the pixel level, the features are upsampled into texture feature maps of the original size.
•: Feature Fusion and Pose Estimation: The multiscale point cloud features and texture feature maps are concatenated at the pixel level and then encoded to construct global features. These pixel-level features are fused with global features and then input into the pose estimation network, which consists of multiple consecutive convolutional layers that directly regress the translation and rotation matrices.

3.2.2. Semantic Segmentation

Semantic segmentation is a crucial step in the network, providing a data foundation for the main network. It helps eliminate distractions and improves the estimation accuracy to some extent. The function of this module is to segment the target area from the image, crop out the target object, and, finally, obtain an image patch and depth map containing only the target object. Since this work is focused on pose estimation algorithms, we directly use PoseCNN’s semantic segmentation network to crop the image. The overall framework is an encoder–decoder network, which produces N+1 binary images for each input RGB image in the dataset. These binary images represent the masks for each object, and the RGB image is cropped based on these masks to produce image patches containing only the target object. Similarly, depth maps containing only the target object are obtained, as shown in Figure 2.

Since the focus of this research is on optimizing and improving the pose estimation algorithm, we did not make additional improvements to the semantic segmentation module, maintaining its original structure and method. However, the effectiveness of this module remains critical to the overall estimation process. It ensures the accurate fusion of image and point cloud information, providing a solid foundation for subsequent multiscale feature extraction and fusion. The masks generated by this module can effectively filter out background noise and focus on extracting features from the target object, further improving the model’s estimation accuracy for occluded and low-texture objects.

3.2.3. Multiscale Point Cloud Feature Extraction Module

Point cloud data serve as the data carrier for the entire network, and the accurate conversion of depth images into object surface point clouds is particularly important. Based on optical principles and the camera parameters, pixel points from the depth image are restored into corresponding 3D points, as shown below:

\{\begin{matrix} z & = d_{t} / s \\ x & = (u - c_{x}) \times z / f_{x} \\ y & = (v - c_{y}) \times z / f_{y} \end{matrix}

(1)

where

(u, v)

represents the pixel coordinates in the image,

d_{t}

is the depth value of the pixel, s is the camera’s scaling factor,

c_{x}

and

c_{y}

represent the image coordinates of the camera’s optical axis, and

f_{x}

and

f_{y}

are the camera’s focal lengths along the x-axis and y-axis, respectively.

(x, y, z)

are the 3D coordinates corresponding to the pixel.

PointNet, a commonly used network for processing point cloud data, can only extract geometric information on a single scale and cannot perceive local information. To address this issue, we consider the idea of multiscale local information extraction from grouping and sampling layers in PointNet++ [46]. We simultaneously collect neighborhood information of different sizes around 3D points and integrate this information using a self-attention mechanism to create more expressive fused features. Figure 3 shows the specific process of point cloud feature extraction in our network.

The multiscale point cloud extraction module consists of two parts:

•: The upper branch, which contains three consecutive multilayer perceptrons (MLPs) and is structurally similar to the traditional PointNet network. It mainly encodes the geometric information of each point in the point cloud to generate point-level features.
•: The lower branch of the network, which is dedicated to capturing local geometric features for each point through multiple parallel graph convolution layers. Each layer operates on neighborhoods of different sizes, allowing the network to extract geometric information across various spatial scales. To effectively integrate these multi-scale local features, we introduce a channel-wise self-attention mechanism that adaptively fuses the features based on their relative significance. After obtaining the feature maps from the graph convolution branches, we first apply global average pooling to the summed features, resulting in a compact representation. This pooled vector is then passed through two lightweight MLP layers to generate attention weights with a shape of [3,128], where the dimensions correspond to the three spatial scales and one hundred and twenty-eight feature channels. A softmax function is applied to normalize the attention weights across the scales. These weights are subsequently used to reweight the scale-specific feature maps before aggregation. This attention-driven fusion enables the network to dynamically emphasize the most informative features for each channel, thereby enhancing its discriminative capability and improving robustness to variations in scale.

The structure of each graph convolution layer is shown in Figure 4. We select k neighboring points within a radius r of the center point based on the Euclidean distance, forming a point set Y of size (3,k,N). The center point is then subtracted from all neighboring points in Y to represent local features with size (3,k,N) and is then input into the multilayer perceptron (MLP) to obtain high-dimensional features

F^{1}

of size (128,k,N). However, a larger radius r expands the receptive field, allowing the model to capture richer contextual information, which is particularly beneficial for objects with a detailed texture. Nevertheless, an excessively large r may introduce noise or irrelevant points, potentially degrading the feature quality. Conversely, a smaller r preserves finer local geometric details but may result in the loss of the global structure, reducing the effectiveness for highly occluded or low-texture regions. To balance this trade-off, we adopt the multi-scale neighborhood definition strategy from PointNet++ grouping layers, selecting r with an interval of 0.2 to effectively capture features at different scales. The local feature

F^{1}

computation is as follows:

F^{1} (i, j) = h (P (i) - Y (i, j)) (i \in N, j \in k)

(2)

where N represents the total number of sampled points in the point cloud,

P (i)

represents the i-th center point in the point cloud, k represents the number of points selected in the neighborhood, and

Y (i, j)

represents the j-th neighboring point of the i-th center point.

h ()

denotes a nonlinear computation, i.e., a MLP.

To obtain more effective local features, we introduce an attention mechanism that assigns an adaptive weight to each point’s local feature representation. This weight consists of a self-weight

α

and a local weight

β

, where

α

is derived by passing the global point cloud through a shared MLP with dimensions (3,128,1), representing the significance of each point independently. Meanwhile,

β

is computed by processing the extracted local features F through another shared MLP (128,1), assigning importance to the k-nearest neighbor points. The final attention coefficient

γ

is obtained by summing

α

and

β

and further refining it through a shared MLP to generate the final local coefficients. These computed attention coefficients are then used to weight the local feature vectors

F^{1}

, dynamically emphasizing more relevant geometric structures. The weighted features are then summed along the neighborhood dimension, resulting in the final geometry-enhanced output of the graph convolution layer,

F^{2}

, which captures both local and global dependencies in a more structured manner. The computation process is as follows:

\{\begin{matrix} γ (i, j) & = h^{'} (F^{1} (i, j)) + h^{''} (P (i)) \\ F_{i}^{2} & = \sum_{j = 1}^{k} F^{1} (i, j) \times γ (i, j) \end{matrix}

(3)

where

F^{2}

represents the local features obtained by each graph convolution layer; P,

F^{1}

, i, and j have the same definitions as in Equation (2); and

h^{'} ()

and

h^{''} ()

represent the mapping functions used to calculate the self-weight and local weight.

To ensure the network extracts the optimal local features, a self-attention module is introduced after the parallel graph convolution layers. When passing through this module, the feature vector

F^{2}

is assigned a specific weight, and, after passing through an average pooling layer, fully connected layers, and a softmax layer, the final local feature vector

F^{3}

is output. This process can be expressed as follows:

F^{3} = \sum_{i = 1}^{n} W_{i} \cdot F_{i}^{2}, \sum_{i = 1}^{n} W_{i} = 1

(4)

Finally, the point-level features and local features

F^{3}

are concatenated to form the final geometric features of the point cloud.

3.2.4. Lightweight Image Feature Extraction Module

To ensure that image features can be correspondingly fused with point cloud features, we designed the lightweight feature extraction module as a U-shaped encoder–decoder network, as shown in Figure 5. The network structure is based on the lightweight MobileNetV2 [47] network. Each downsampling layer features an inverted residual module and depthwise separable convolution, and the downsampling layer structure contains two consecutive bottlenecks. The encoder part includes four downsampling layers, ultimately generating a feature map of size (H/16,W/16,1024). Correspondingly, the decoder symmetrically uses four consecutive upsampling layers to restore the feature map to a feature map of size (H,W,64). The upsampling layers are based on the upsampling method in DenseFusion and PSPNet [48], with each upsampling layer consisting of bilinear interpolation and a 2D convolution layer.

The core of the network is the inverted residual module, which is used multiple times during downsampling. As shown in Figure 6, in addition to the depthwise separable structure, this module also employs Expansion and Projection layers. The Expansion layer uses a 1 × 1 network structure to map the low-dimensional space to a high-dimensional space. The Projection layer also uses a 1 × 1 network structure but functions in the opposite way, mapping the high-dimensional space back to a low-dimensional space. These two layers are positioned at either end of the inverted residual module, with a 3 × 3 depthwise separable convolution in between, giving the module a spindle-like structure—narrow at the ends and wide in the middle. This design facilitates smoother gradient propagation, helping to improve the model’s convergence speed. Additionally, the presence of skip connections in the inverted residual module allows the original input to be directly passed to subsequent layers, making gradient propagation through the network easier. This helps alleviate the vanishing gradient problem, making the network easier to train.

3.2.5. Feature Fusion and Pose Estimation

During feature fusion, the pixel-level image features obtained from the two feature extraction modules and the point-level features containing local information need to be fully integrated. By leveraging the natural complementarity between the two, the features are concatenated to construct global features. These features are then input into the pose estimation network to estimate the object’s pose.

Feature fusion: Considering that the point cloud features are derived from the depth map and that there is a direct correspondence between the depth map and the RGB image, feature fusion is performed by concatenating features at the pixel level based on this positional correspondence. This forms point cloud features that not only include geometric and color information but also contain neighborhood information. These features can then be directly used to extract global variables, which are encoded through two multilayer shared perceptrons. Afterward, an average pooling layer is used to obtain representative global variables. The pixel-level fused features and global features are concatenated and input into the pose estimation network as the final features. The process is expressed as follows:

\{\begin{matrix} F & = x + e m b \\ F^{f i n} & = F + a p (h^{'} (h (F))) \end{matrix}

(5)

where x represents the point cloud features,

e m b

represents the image features, F represents the pixel-level fused features containing color and local information,

a p ()

is the average pooling operation,

h^{'} ()

and

h ()

are nonlinear functions (shared perceptrons), and

F^{f i n}

is the final feature. Note that the + operation in Equation (5) represents feature concatenation.

To clearly illustrate the feature extraction pipeline from the input images to the final fused features, we have presented the entire fusion process in a diagram. As shown in Figure 7, the flow proceeds from left to right, showing how RGB and depth images are processed. A mask is first applied to extract the target object region (choose), which is then used to index both the RGB image and depth map—leveraging the one-to-one pixel-level correspondence between them—for precise feature alignment.

RGB features are extracted using a CNN, producing a feature map of shape

C_{1} \times H \times W

. Using choose, we obtain an image feature tensor of shape

C_{1} \times N

. Simultaneously, the depth map is back-projected using built-in camera functions to generate 3D coordinates and corresponding geometric features of size

C_{2} \times N

. These features are concatenated to form the fused feature

F \in R^{(C_{1} + C_{2}) \times N}

, where N is the number of sampled points and

C_{1}

and

C_{2}

are the channel dimensions of the corresponding features.

To capture the global context, F is passed through shared MLPs and an average pooling operation is used to extract global features. These are then concatenated back to the point-wise fused features, resulting in the final feature

F^{f i n}

. This process is represented in Equation (5) and offers an intuitive view of pixel-aligned fusion between image and point cloud features.

Pose estimation: When the features are input into the neural network for pose estimation, the network directly regresses the rotation, translation, and confidence score. The pose estimation network in this paper is similar to that of DenseFusion and consists of three similar small network layers that each consist of four 1D convolutional layers. The estimation accuracy is determined by calculating the average deviation distance between the sampled points of the predicted and ground truth point clouds. For asymmetric objects, the loss is defined as follows:

L_{i}^{p} = \frac{1}{G} \sum_{j}^{G} ‖ (R x_{j} + t) - ({\bar{R}}_{i} x_{j} + {\bar{t}}_{i}) ‖

(6)

This metric is only suitable for asymmetric objects with a unique correct pose. For symmetric objects, a new measurement is introduced:

L_{i}^{p} = \frac{1}{G} \sum_{j}^{G} min_{0 < k < G} ∥ (R x_{j} + t) - ({\bar{R}}_{i} x_{k} + {\bar{t}}_{i}) ∥

(7)

The two metrics are similar, with

[R | t]

representing the ground truth pose,

x_{j}

being the j-th point in the point cloud model,

[{\bar{R}}_{i} | {\bar{t}}_{i}]

being the pose estimated by the network, and G being the number of sampled points in the point cloud model. The loss function for the entire network is defined as follows:

L = \frac{1}{N} \sum_{i = 1}^{N} (L_{i}^{p} c_{i} - ω log (c_{i}))

(8)

where N is the number of predicted poses and

ω

is the balancing hyperparameter for confidence regularization, which penalizes low-confidence predictions.

c_{i}

represents the confidence score for each regressed pose, and

L_{i}^{p}

represents the loss function for each predicted pose.

3.3. Evaluation Metrics

The accuracy of pose estimation is generally determined by two metrics—the average distance (ADD) and average nearest point distance (ADD-S)—used to evaluate the accuracy of asymmetric and symmetric objects, respectively.

The average distance (ADD) is defined as the mean distance between the sampled points on the 3D model transformed by the ground truth pose and those transformed by the predicted pose. It is calculated as follows:

A D D = \frac{1}{m} \sum_{x \in M} ‖ (R x + T) - (\hat{R} x + \hat{T}) ‖

(9)

where m is the number of sampled points,

[R | T]

and

[\hat{R} x | \hat{T}]

represent the ground truth pose and the estimated pose, respectively, and x refers to the sampled points from the point cloud model.

The average nearest point distance (ADD-S) is defined as the distance between each sampled point on the 3D model transformed by the ground truth pose and the nearest point in the predicted pose. The average of these nearest point distances is calculated across all sampled points as follows:

A D D - S = \frac{1}{m} \sum_{x_{1} \in M} min_{x_{2} \in M} ‖ (R x_{1} + T) - (\hat{R} x_{2} + \hat{T}) ‖

(10)

3.4. Experimental Details

Validation was performed using the PyTorch (version 2.6) framework with the Adam optimizer. All experiments were conducted on an Intel^® Xeon^® Platinum 8255C CPU and an NVIDIA 3080 Ti GPU. The training parameters were configured as follows based on insights from previous studies and data analysis: the initial learning rate was set to 0.001 to promote faster convergence, the number of training iterations was set to 500, the hyperparameter

ω

was set to 0.015, and the number of point cloud sampling points (n) was set to 512. The optimizer and learning rate schedule were configured in line with the settings used in DenseFusion. Additionally, the batch size was set to 8 to accommodate GPU memory limitations, and several data augmentation strategies were applied to enhance generalization. With these settings, the model demonstrated rapid convergence without significant oscillations during training. Table 1 outlines the model parameters for the training process.

4. Results

4.1. Ablation Study

We conducted an ablation study on the LineMOD dataset to evaluate the effectiveness of each proposed module in terms of accuracy, parameter count, and real-time inference speed. The results are summarized in Table 2, where the accuracy metric is based on ADD-(S) (ADD is used for asymmetric objects, while ADD-S is applied to symmetric ones). The number of parameters reflects the model’s complexity, and the runtime (FPS) indicates its inference efficiency.

DenseFusion is used as the baseline model. The MGC column denotes the use of multi-branch graph convolution modules, while SA represents the self-attention mechanism. These two components together form the multiscale point cloud feature extraction module. LIFE refers to the lightweight image feature extraction module, which replaces the original image downsampling layers in DenseFusion with MobileNetV2. As LIFE is an independent replacement of the image encoder, we did not further break it down into sub-components in the ablation experiment.

From the results in Table 2, we can observe that each proposed module contributes positively to model performance in terms of accuracy and efficiency. Below, we provide a detailed analysis of the effects of the multiscale point cloud feature extraction module (MGC + SA) and the lightweight image feature extraction module (LIFE), as well as the trade-offs between accuracy, parameter count, and runtime speed.

Comparing the baseline (Row 1) with the variant incorporating only MGC (Row 2), we observe a significant improvement in accuracy from 94.3% to 96.4%, indicating the benefit of introducing multi-branch graph convolution. This module enhances the local geometric representation by aggregating features from multiple receptive fields. However, it introduces a moderate increase in parameters and a slight drop in FPS due to the added computational paths. When the self-attention (SA) module is further added (Row 3), the accuracy rises to 97.6%, showing that the adaptive weighting of multi-scale features can better emphasize task-relevant patterns. Notably, despite the added attention mechanism, the overall parameter count decreases slightly compared to Row 2 due to structural balancing. This confirms that the SA module improves the feature selection efficiency with only a minor increase in overhead, demonstrating a favorable trade-off between accuracy and complexity.

Replacing the original CNN backbone with MobileNetV2 (Row 4) significantly reduces the parameter count from 23.39M to 18.86M and boosts the inference speed from 25.5 FPS to 35.3 FPS. However, this comes with a slight drop in accuracy (94.1%), indicating that aggressive compression may compromise the representation power if not complemented by other enhancements. When LIFE is combined with MGC (Row 5), the accuracy climbs to 97.5% while maintaining a relatively low parameter count (22.48M) and high speed (32.6 FPS). This suggests that point cloud enhancements can compensate for lightweight image features, resulting in a balanced and effective design. Finally, integrating all three components (Row 6) achieves the best accuracy (98.5%) with the fewest parameters (19.49 M) among the high-performing variants, confirming the effectiveness of the full architecture.

Comparing Row 3 and Row 6 in Table 2, we observe an interesting phenomenon: the addition of the LIFE module not only reduces the parameter count (from 25.83M to 19.49M) and improves the runtime speed (from 22.4 FPS to 31.8 FPS), but also leads to an increase in accuracy from 97.6% to 98.5%. This result may seem counterintuitive, as the use of lightweight modules is typically expected to lead to a trade off in accuracy for efficiency. We attribute this improvement to the complementary nature of the LIFE module when combined with the multiscale point cloud feature extraction module. Using MobileNetV2 introduces a more efficient image encoding pipeline that preserves essential semantic features while reducing redundancy, which may help suppress irrelevant noise and overfitting. Moreover, the compact representation from LIFE might improve the effectiveness of the subsequent fusion with point cloud features, resulting in better joint representations and an enhanced overall performance.

These results demonstrate a clear trade-off landscape: while MGC and SA contribute to accuracy gains through richer geometric modeling and adaptive feature weighting, LIFE introduces substantial efficiency benefits with minimal compromise. When appropriately combined, these modules not only offset each other’s limitations but also reinforce each other’s strengths. The final model thus achieves the best of both worlds—delivering a state-of-the-art performance with a significantly reduced computational cost—making it highly suitable for real-time deployment on resource-constrained devices.

4.2. Comparison of Algorithms Under the LineMOD Dataset

For the LineMOD dataset, following previous work, we considered a prediction as correct if the ADD-(S) was less than 10% of the object’s diameter. The percentage of correctly predicted keyframes out of the total number of keyframes was used as the evaluation metric (i.e., accuracy). The evaluation results for various algorithms are shown below in Table 3. Bold numbers represent the highest scores among the methods, and objects marked with an asterisk (*) are symmetric.

Table 3 presents a comparative evaluation of our proposed method against several representative RGB and RGB-D based algorithms on the LineMOD dataset. Among the RGB-based approaches, PoseCNN+ICP and HRPose [49] achieved average accuracies of 88.6% and 87.6%, respectively, with both failing to surpass the 90% threshold. However, RNNPose exhibited a strong performance among RGB-only methods, with an average ADD(S) accuracy of 97.1%, indicating that even in the absence of depth data, it can produce competitive results through effective temporal modeling.

For RGB-D-based methods, FS6D achieved an accuracy of 91.5% using few-shot learning techniques and RGB-D inputs. DenseFusion and MaskedFusion, which adopt dense pixel-level fusion strategies between RGB and depth features, achieved higher accuracies of 94.3% and 97.3%, respectively. These results highlight the benefit of effective geometric and visual feature integration for pose estimation robustness. Notably, DFTr further improves upon previous results, with an average accuracy of 99.2%, benefiting from transformer-based global feature aggregation and a cross-modal fusion design.

In contrast, our proposed method reaches an average ADD(S) accuracy of 98.5%, outperforming PoseCNN+ICP by 9.9%, DenseFusion by 4.2%, and MaskedFusion by 1.2%. Although DFTr achieved a slightly higher mean accuracy, our method maintains a highly competitive accuracy while offering a better computational efficiency and a significantly lower model complexity, making it more suitable for real-time or resource-constrained applications.

From a category-wise perspective, the proposed method achieves the best performance in several challenging object categories such as ape, can, driller, duck, and hole, all of which contain occlusions or have weak textures. This superior performance is attributed to the proposed multiscale point cloud feature fusion module, which enhances local geometric detail extraction and captures spatial context more effectively. Furthermore, for symmetric objects such as eggbox and glue, the proposed method achieves or matches the highest possible accuracy (100%), demonstrating robustness in handling symmetry ambiguity.

When comparing the experimental results, we observed that nearly all algorithms struggle to accurately estimate the poses of objects with indistinct surface textures, such as apes, drillers, and ducks. Due to their lack of effective texture features, RGB-based methods find it difficult to handle such objects. Using depth as supplemental information for iterative optimization can improve the accuracy to some extent; however, spatial geometric information is not fully integrated using this approach, resulting in a lower overall accuracy. Both DenseFusion and the proposed method use fused information from both images and depth data as the basis for pose estimation, taking advantage of the complementary nature of both data sources. Therefore, they both demonstrate good performances in pose estimation for objects with indistinct surface textures. Figure 8 shows a comparison of DenseFusion and the proposed method’s results on several weak-textured objects.

From the comparison between DenseFusion and the proposed method, it is evident that the proposed method exhibits significantly improved accuracy across nearly all object categories. This is because DenseFusion only extracts single point cloud features, which, while improving accuracy to some extent, still has its limitations. In contrast, the point cloud extraction module proposed in this work enhances the focus on local point cloud information during pixel-level point cloud extraction, leading to richer features and better performance on weak-textured objects. In summary, the proposed method achieves a favorable balance between pose estimation accuracy and computational efficiency, outperforming most baseline methods.

4.3. Comparison of Algorithms Using the YCB-Video Dataset

For pose estimation on the YCB-Video dataset, we employed two evaluation metrics: the area under the ADD-S score curve (AUC) with a threshold from 0 to 10 cm, and the percentage of poses with an ADD-S error less than 2 cm, which reflects the high accuracy. In Table 4, the best-performing results for each object under both metrics are shown in bold. Objects marked with an asterisk (*) are symmetric and evaluated accordingly.

Considering the AUC metric, our proposed method achieves an average accuracy of 93.1%, outperforming G2L-Net (92.3%), FS6D (88.4%), SaMfENet (92.6%), DenseFusion (91.2%), and MaskedFusion (92.9%). Compared to DenseFusion and FS6D, our method shows improvements of 1.9% and 4.7%, respectively, demonstrating stronger capability in pose regression.

Considering the <2 cm accuracy metric, our method achieves 95.9%, slightly outperforming SaMfENet (95.6%) and DenseFusion (95.3%) and remaining competitive with MaskedFusion (96.3%). Notably, our method achieves 100% accuracy on several challenging objects such as mug, gelatin_box, and mustard_bottle, showcasing its effectiveness in precise pose estimation, particularly for symmetric or small-scale objects.

Although FS6D demonstrates good generalization in few-shot scenarios, its overall AUC accuracy remains lower than our method, likely due to the absence of depth inputs. Overall, our method consistently ranks among the top-performing methods across both metrics and object categories, showing strong generalization and robustness. However, as shown in Table 4, all methods, including ours, struggled with the large_clamp and extra_large_clamp objects, resulting in a relatively low accuracy. This is mainly due to the use of PoseCNN for semantic segmentation, as the two clamps are nearly identical in appearance except for scale, making it difficult to produce accurate masks and thus impacting pose estimation. Despite these challenges, our approach achieves notable improvements in overall pose estimation accuracy, particularly considering the <2 cm metric, and demonstrates enhanced robustness and precision compared to several representative RGB-D methods.

To provide a more intuitive comparison of the prediction results from different algorithms on the YCB-Video dataset, we visualized the estimation results of each method. All methods used the semantic segmentation output from the PoseCNN network. We combined the sampled point information to transform the predicted results into point clouds and projected them onto the original image. This allowed for a direct observation of the consistency between the estimated pose and the ground truth. The comparison results are shown in Figure 9. As seen in Figure 9a, for smooth and textureless objects such as bananas and bleach cleaners, the method proposed in this paper provides accurate pose estimation, and for objects that lack texture features and have complex shapes, the pose estimates from our method were closer to the ground truth (as shown in Figure 9b). This improvement was attributed to the optimization of depth information processing in our method, which effectively integrated image and depth data to leverage richer features for more precise pose estimation. Moreover, the effective fusion of texture and depth features ensured that our method maintained good compatibility with symmetric objects, such as bowls and wooden blocks (as shown in Figure 9b,c). Furthermore, by using pixel-level prediction and enhancing the multi-scale local attention of point clouds, our method demonstrated strong robustness to occlusion and was able to provide accurate predictions for severely occluded objects, such as the scissors in Figure 9d.

4.4. Comparison of Algorithms Using the Occlusion LineMOD Dataset

The experimental results on the Occlusion LineMOD dataset indicate the performance of various algorithms in pose estimation under occluded scenes. The challenge of this dataset lies in the partial occlusion of objects, requiring the algorithm to not only identify the target object but also accurately estimate its 6D pose in complex scenes. We used the AUC as the evaluation metric, which comprehensively reflects the robustness of the algorithm when handling occlusions.

As shown in Table 5, the proposed method demonstrated outstanding performance in several object categories, showcasing strong competitiveness in occluded scenarios. For example, for objects with complex shapes and symmetric objects, such as the eggbox and glue, the proposed method achieved AUC scores of 63.9% and 77.3%, respectively, surpassing PVNet, HybridPose, and GDR-Net [51] and even approaching or exceeding the performance of CRT-6D. In these categories, the geometric properties and high occlusion rates make precise 6D pose estimation highly challenging, but the proposed method effectively addressed these issues, demonstrating its robustness in handling occlusions.

Furthermore, for objects with severe occlusions, such as the duck and holepuncher, the proposed method outperformed most existing algorithms. For instance, in the duck category, our method achieved an AUC of 50.2%, surpassing both CRT-6D and GDR-Net, demonstrating its stronger capability to handle severe occlusion. Similarly, in the holepuncher category, the proposed method reached an AUC of 76.3%, outperforming GDR-Net and HybridPose and approaching the performance of CRT-6D.

Despite these promising results, we observed that pose estimation under severe occlusion remains challenging. Failures are mainly attributed to the lack of visible discriminative features when large portions of the object’s surface are obscured. In such cases, the fused RGB-D features may be ambiguous or insufficient to guide accurate regression, especially for symmetric or textureless objects.

Overall, the proposed method achieved an average AUC of 63.4% on the Occlusion LineMOD dataset, demonstrating strong competitiveness compared with the latest CRT-6D and SAM-6D methods. This indicates that the proposed approach offers both stability and accuracy in complex and occluded scenarios and still exhibits substantial potential under severe occlusion. By leveraging efficient computational strategies and multiscale geometric reasoning, the method achieves a favorable balance between estimation accuracy and resource consumption, making it well suited for practical applications in challenging environments.

4.5. Real-Time Performance and Efficiency Evaluation

To comprehensively evaluate the real-time performance and efficiency of the proposed method, we compared the inference speed, model size (in terms of parameters), and pose estimation accuracy (measured using ADD(S)) across several representative algorithms on the LineMOD dataset. The inference speed is expressed in FPS, indicating the number of keyframe images processed by each model per second in a full end-to-end pipeline. Table 6 summarizes the performance of seven commonly used methods, covering RGB-based, RGB-D-based, and large-scale vision models.

The results show that HybridPose and RNNPose, as RGB-based methods, achieve inference speeds of 113.3 FPS and 87.1 FPS, respectively, with relatively small model sizes of 4.7M and 11.9M parameters. These models demonstrate excellent real-time capabilities but exhibit limited ADD(S) accuracies (91.3% and 97.1%, respectively) due to their lack of depth information. In contrast, RGB-D-based methods such as MaskedFusion, DFTr, and DenseFusion achieve significantly higher ADD(S) accuracies of 97.3%, 99.2%, and 94.3%, respectively, through the incorporation of geometric depth cues and multimodal fusion. However, this comes at the cost of increased model complexity. For example, DFTr introduces 162.1M parameters and has an inference speed of only 20.6 FPS, while MaskedFusion and DenseFusion have moderate sizes (37.4M and 23.4M, respectively) but relatively lower speeds (23.5 FPS and 25.5 FPS, respectively).

SAM-6D, which relies on a large-scale vision backbone, has the highest parameter count (>1000M) but suffers from an extremely low inference speed (0.2 FPS), rendering it impractical for real-time applications despite its strong performance in static scenarios. Our proposed method achieves a better balance between efficiency and accuracy. Optimized from DenseFusion, it attains a high ADD(S) accuracy of 98.5% while maintaining a lightweight architecture of only 19.5M parameters and achieving a real-time inference speed of 31.8 FPS. Compared with MaskedFusion and DFTr, our method offers a competitive or superior efficiency, with only minor sacrifices in accuracy, making it more suitable for resource-constrained real-time applications.

In summary, RGB-based methods exhibit high-speed inference but fall short in accuracy due to the absence of depth cues. RGB-D-based methods have significantly improved accuracy at the cost of greater computational complexity. By leveraging a lightweight design and efficient multiscale fusion, the proposed method achieves a favorable trade-off between speed, model size, and pose estimation accuracy, demonstrating strong potential for real-world deployment on mobile and embedded platforms.

5. Discussion

We focused on two main components in our approach. The first is a multiscale point cloud feature extraction module integrated with a self-attention mechanism, which enhances the representation of local features and mitigates information loss caused by partial occlusion. The second is a lightweight image feature extraction module, designed to reduce computational complexity while maintaining high-precision pose estimation. These two components complement each other to improve the performance in complex and low-texture environments.

Despite these contributions, several limitations remain. Although the proposed method improves the pose estimation accuracy under moderate occlusion, challenges still persist in cases of severe occlusion or highly symmetric objects. In these scenarios, pose ambiguity arises due to insufficient visual cues. However, as observed in the Occlusion LineMOD dataset, our method still shows considerable potential in such cases, outperforming several strong baselines on difficult categories such as duck and holepuncher. This suggests that with further refinement—such as improved feature alignment and occlusion-aware learning—the method could be extended to better handle these challenging conditions.

Another limitation is that the reliance on depth and point cloud data makes the model sensitive to sensor quality; poor or missing depth maps significantly affect estimation results, especially in dynamic or low-light environments.

While our lightweight design improves efficiency, we did not perform extensive comparisons with other lightweight 6D pose estimation frameworks such as PVN3D-Lite or EfficientPose. Future work will include a more systematic evaluation of accuracy–efficiency trade-offs across representative models.

Furthermore, the current study focuses primarily on known-object pose estimation, and its generalization ability to unseen objects is not explicitly addressed. This limits the model’s applicability in zero- or few-shot scenarios, which are commonly encountered in real-world robotics. One possible direction is to explore meta-learning techniques or feature-space adaptation to allow generalization to novel object categories with minimal training samples.

In summary, future work will focus on (1) enhancing robustness to occlusions and symmetry by incorporating synthetic and augmented training data; (2) integrating multi-view fusion and temporal consistency modules; (3) improving the generalization capability to unseen objects by investigating zero- or few-shot learning extensions of the current framework; and (4) further optimizing the model for deployment on edge devices, especially under real-world constraints such as low-power and limited bandwidth conditions.

6. Conclusions

In this paper, a lightweight 6D pose estimation network based on RGB-D images was proposed, which exhibited significantly improved pose estimation accuracy in occluded and low-texture object scenarios. The model is composed of a multiscale point cloud feature extraction module with a self-attention mechanism and a lightweight image feature extraction module. These components were validated through ablation studies on the LineMOD dataset, demonstrating their effectiveness and complementarity.

The proposed network achieved an accuracy of 98.5% on LineMOD with only 19.49 million parameters and a real-time speed of 31.8 FPS. Comparative experiments on the LineMOD, YCB-Video, and Occlusion LineMOD datasets showed that the proposed method outperformed several state-of-the-art approaches. Specifically, it achieved a 4.2% and 0.6% improvement over DenseFusion on the LineMOD and YCB-Video datasets, respectively, and reached 63.4% accuracy on Occlusion LineMOD.

While the method demonstrates promising results, we acknowledge that it still exhibits weaknesses in handling severely occluded and highly symmetric objects. Nevertheless, its performance on difficult cases suggests that the method also holds great potential in these areas, especially with further enhancements in feature learning and occlusion handling. In addition to planned improvements such as multi-view fusion and sensor combination, we also aim to investigate the model’s potential in few- or zero-shot settings, which would further enhance its practical deployment value.

In conclusion, the proposed method exhibits a compelling balance between accuracy and computational efficiency for 6D pose estimation and lays a solid foundation for practical deployment in resource-constrained environments such as edge devices or robotic platforms.

Author Contributions

Conceptualization, Z.L. and Y.G.; data curation, S.Y. and L.D.; formal analysis, S.Y.; funding acquisition, H.Z. and Q.W.; investigation, H.Z.; methodology, Z.L. and Y.G.; project administration, H.Z. and Q.W.; resources, H.Z.; software, R.G.; supervision, R.G.; validation, J.S. and J.H.; visualization, J.S.; writing—original draft, Z.L. and Y.G.; writing—review and editing, Z.L. and Y.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Henan Provincial Science and Technology Major Project (No. 221100110100), the National Natural Science Foundation of China (No. 32271993), and the Key Scientific and Technological Project of Henan Province (No. 242102111193).

Data Availability Statement

The original data presented in the study are openly available in the BOP datasetrepository at https://bop.felk.cvut.cz/datasets/ (accessed on 31 March 2025).

Acknowledgments

The authors thank the College of Information and Management Sciences of Henan Agricultural University and the Artificial Intelligence Research Center of Henan Agricultural University.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wong, S.W.; Chiu, Y.C.; Tsai, C.Y. A real-time affordance-based object pose estimation approach for robotic grasp pose estimation. In Proceedings of the 2023 International Conference on System Science and Engineering (ICSSE), Ho Chi Minh, Vietnam, 27–28 July 2023; pp. 614–619. [Google Scholar]
Tian, F.; Zhang, J.; Zhong, Y.; Liu, H.; Duan, Q. A method for estimating an unknown target grasping pose based on keypoint detection. In Proceedings of the 2022 2nd International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI), Nanjing, China, 23–25 September 2022; pp. 267–271. [Google Scholar]
Yu, Y.; Xie, H.; Zhang, K.; Wang, Y.; Li, Y.; Zhou, J.; Xu, L. Design, Development, Integration, and Field Evaluation of a Ridge-Planting Strawberry Harvesting Robot. Agriculture 2024, 14, 2126. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, K.; Yang, L.; Zhang, D.; Cui, T.; Yu, Y.; Liu, H. Design and simulation experiment of ridge planting strawberry picking manipulator. Comput. Electron. Agric. 2023, 208, 107690. [Google Scholar]
Firintepe, A.; Pagani, A.; Stricker, D. A comparison of single and multi-view IR image-based AR glasses pose estimation approaches. In Proceedings of the 2021 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), Lisbon, Portugal, 27 March–1 April 2021; pp. 571–572. [Google Scholar]
Zeng, Y.; Ma, C.; Zhu, M.; Fan, Z.; Yang, X. Cross-modal 3d object detection and tracking for auto-driving. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 3850–3857. [Google Scholar]
Xiong, Z.; Cai, Z.; Han, Q.; Alrawais, A.; Li, W. ADGAN: Protect your location privacy in camera data of auto-driving vehicles. IEEE Trans. Ind. Inform. 2020, 17, 6200–6210. [Google Scholar]
Cui, Z.; Hu, J.; Yu, Y.; Cao, G.; Zhang, H.; Chai, X.; Chen, H.; Xu, L. Automatic grain unloading method for track-driven rice combine harvesters based on stereo vision. Comput. Electron. Agric. 2024, 220, 108917. [Google Scholar]
Xie, H.; Zhang, Z.; Zhang, K.; Yang, L.; Zhang, D.; Yu, Y. Research on the visual location method for strawberry picking points under complex conditions based on composite models. J. Sci. Food Agric. 2024, 104, 8566–8579. [Google Scholar] [PubMed]
Lepetit, V.; Moreno-Noguer, F.; Fua, P. EP n P: An accurate O (n) solution to the P n P problem. Int. J. Comput. Vis. 2009, 81, 155–166. [Google Scholar]
Xiang, Y.; Schmidt, T.; Narayanan, V.; Fox, D. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv 2017, arXiv:1711.00199. [Google Scholar]
Rad, M.; Lepetit, V. Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3828–3836. [Google Scholar]
Wang, C.; Xu, D.; Zhu, Y.; Martín-Martín, R.; Lu, C.; Fei-Fei, L.; Savarese, S. Densefusion: 6d object pose estimation by iterative dense fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3343–3352. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
He, Y.; Sun, W.; Huang, H.; Liu, J.; Fan, H.; Sun, J. Pvn3d: A deep point-wise 3d keypoints voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11632–11641. [Google Scholar]
Wen, B.; Yang, W.; Kautz, J.; Birchfield, S. Foundationpose: Unified 6d pose estimation and tracking of novel objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17868–17879. [Google Scholar]
Lin, J.; Liu, L.; Lu, D.; Jia, K. Sam-6d: Segment anything model meets zero-shot 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27906–27916. [Google Scholar]
Labbé, Y.; Manuelli, L.; Mousavian, A.; Tyree, S.; Birchfield, S.; Tremblay, J.; Carpentier, J.; Aubry, M.; Fox, D.; Sivic, J. Megapose: 6d pose estimation of novel objects via render & compare. arXiv 2022, arXiv:2212.06870. [Google Scholar]
Hodan, T.; Sundermeyer, M.; Labbe, Y.; Nguyen, V.N.; Wang, G.; Brachmann, E.; Drost, B.; Lepetit, V.; Rother, C.; Matas, J. BOP Challenge 2023 on Detection Segmentation and Pose Estimation of Seen and Unseen Rigid Objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5610–5619. [Google Scholar]
Tekin, B.; Sinha, S.N.; Fua, P. Real-time seamless single shot 6d object pose prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 292–301. [Google Scholar]
Peng, S.; Liu, Y.; Huang, Q.; Zhou, X.; Bao, H. Pvnet: Pixel-wise voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4561–4570. [Google Scholar]
Song, C.; Song, J.; Huang, Q. Hybridpose: 6d object pose estimation under hybrid representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 431–440. [Google Scholar]
Li, Y.; Wang, G.; Ji, X.; Xiang, Y.; Fox, D. Deepim: Deep iterative matching for 6d pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 683–698. [Google Scholar]
Labbé, Y.; Carpentier, J.; Aubry, M.; Sivic, J. Cosypose: Consistent multi-view multi-object 6d pose estimation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 574–591. [Google Scholar]
Castro, P.; Kim, T.K. Crt-6d: Fast 6d object pose estimation with cascaded refinement transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 5746–5755. [Google Scholar]
He, Y.; Wang, Y.; Fan, H.; Sun, J.; Chen, Q. Fs6d: Few-shot 6d pose estimation of novel objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6814–6824. [Google Scholar]
Xu, Y.; Lin, K.Y.; Zhang, G.; Wang, X.; Li, H. Rnnpose: Recurrent 6-dof object pose refinement with robust correspondence field estimation and pose optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14880–14890. [Google Scholar]
Besl, P.J.; McKay, N.D. Method for registration of 3-D shapes. In Proceedings of the Sensor Fusion IV: Control Paradigms and Data Structures, Boston, MA, USA, 12–15 November 1991; Volume 1611, pp. 586–606. [Google Scholar]
Segal, A.; Haehnel, D.; Thrun, S. Generalized-icp. In Proceedings of the Robotics: Science and Systems, Seattle, WA, USA, 28 June–1 July 2009; Volume 2, p. 435. [Google Scholar]
Hosseini Jafari, O.; Mustikovela, S.K.; Pertsch, K.; Brachmann, E.; Rother, C. ipose: Instance-aware 6d pose estimation of partly occluded objects. In Proceedings of the Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Revised Selected Papers, Part III 14. Springer: Berlin/Heidelberg, Germany, 2019; pp. 477–492. [Google Scholar]
Li, C.; Bai, J.; Hager, G.D. A unified framework for multi-view multi-class object pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 254–269. [Google Scholar]
Xu, D.; Anguelov, D.; Jain, A. Pointfusion: Deep sensor fusion for 3d bounding box estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 244–253. [Google Scholar]
Haugaard, R.L.; Buch, A.G. Surfemb: Dense and continuous correspondence distributions for object pose estimation with learnt surface embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6749–6758. [Google Scholar]
Wang, Z.; Sun, X.; Wei, H.; Ma, Q.; Zhang, Q. Enhancing 6-DoF object pose estimation through multiple modality fusion: A hybrid CNN architecture with cross-layer and cross-modal integration. Machines 2023, 11, 891. [Google Scholar] [CrossRef]
Pereira, N.; Alexandre, L.A. MaskedFusion: Mask-based 6D object pose estimation. In Proceedings of the 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA, 14–17 December 2020; pp. 71–78. [Google Scholar]
Han, Z.; Chen, L.; Wu, S. CMFF6D: Cross-modality multiscale feature fusion network for 6D pose estimation. Neurocomputing 2025, 623, 129416. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Yang, B.; Bender, G.; Le, Q.V.; Ngiam, J. Condconv: Conditionally parameterized convolutions for efficient inference. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Zhang, Z.; Chen, W.; Zheng, L.; Leonardis, A.; Chang, H.J. Trans6d: Transformer-based 6d object pose estimation and refinement. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 112–128. [Google Scholar]
Li, Z.; Li, X.; Chen, S.; Du, J.; Li, Y. SaMfENet: Self-Attention Based Multi-Scale Feature Fusion Coding and Edge Information Constraint Network for 6D Pose Estimation. Mathematics 2022, 10, 3671. [Google Scholar] [CrossRef]
Zhou, J.; Chen, K.; Xu, L.; Dou, Q.; Qin, J. Deep fusion transformer network with weighted vector-wise keypoints voting for robust 6d object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 13967–13977. [Google Scholar]
Iandola, F.N. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Howard, A.G. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Guan, Q.; Sheng, Z.; Xue, S. HRPose: Real-time high-resolution 6D pose estimation network using knowledge distillation. Chin. J. Electron. 2023, 32, 189–198. [Google Scholar]
Chen, W.; Jia, X.; Chang, H.J.; Duan, J.; Leonardis, A. G2l-net: Global to local network for real-time 6d pose estimation with embedding vector features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4233–4242. [Google Scholar]
Wang, G.; Manhardt, F.; Tombari, F.; Ji, X. Gdr-net: Geometry-guided direct regression network for monocular 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16611–16621. [Google Scholar]
Lin, J.; Liu, L.; Lu, D.; Jia, K. SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation. arXiv 2024, arXiv:2311.15707. [Google Scholar]

Figure 1. Overall network structure.

Figure 2. Semantic segmentation process.

Figure 3. Structure of the point cloud feature extraction module.

Figure 4. Neighborhood feature extraction structure.

Figure 5. Structure of the image feature extraction module.

Figure 6. Inverse residual module.

Figure 7. Feature fusion diagram.

Figure 8. Comparison of pose estimation effectiveness for weakly textured objects.The red dots represent the point cloud result obtained by transforming the predicted poses of the object.

Figure 9. Visual comparison of pose estimation results. The different colors and shapes represent the point cloud visualizations of the same object predicted by different algorithms. Each color corresponds to the same object across different methods. The boxes around the objects highlight the areas of interest. The subfigures (a–d) show the predictions of the same image by different algorithms.

Table 1. Training environment and parameters.

Environment	Parameters
CPU	Intel (R) Xeon (R) Platinum 8255C CPU @ 2.40GHz
GPU	NVIDIA GeForce RTX 3080Ti
Operating System	Ubuntu 18.04.4
Initial learning rate	0.001
Epochs	500
Hyperparameter ( $ω$ )	0.015
Sampling point (n)	512
Batch size	8
Learning rate schedule	CosineAnnealingLR
Augmentation	ColorJitter and Pose Random Translation

Table 2. Comparison of ablation experiment results.

MGC	SA	LIFE	Accuracy (%)	Parameter (M)	Runtime (FPS)
			94.3	23.39	25.5
✓			96.4	27.31	22.8
✓	✓		97.6	25.83	22.4
		✓	94.1	18.86	35.3
✓		✓	97.5	22.48	32.6
✓	✓	✓	98.5	19.49	31.8

Table 3. Evaluation results of different algorithms using the LineMOD dataset.

	RGB Methods			RGB-D Methods
	PoseCNN [11] +ICP	HRPose [49]	RNNPose [27]	FS6D [26]	DenseFusion [13]	MaskedFusion [35]	DFTr [42]	Ours
ape	77	61.2	85.62	78	92.3	92.2	98.2	96.8
bench vi.	97.5	95.5	100	88.5	93.2	98.4	99.4	97
camera	93.5	84.9	98.43	91	94.4	98.0	99.1	98.9
can	96.5	93.6	99.51	89.5	93.1	97.4	97.4	98.8
cat	82.1	86.0	96.41	97.5	96.5	97.8	99.8	99.3
driller	95	96.2	99.5	92	87.0	95.6	99.0	99.1
duck	77.7	68.0	89.67	75.5	92.3	94.0	98.7	97.7
eggbox *	97.1	99.0	100	99.5	99.8	99.6	100	100
glue *	99.4	97.0	97.3	99.5	100	100	100	99.7
hole	52.8	78.1	97.15	96	92.1	97.3	99.3	98.7
iron	98.3	95.5	100	87.5	97.0	97.1	98.9	98
lamp	97.5	96.6	100	97.0	95.3	99.0	99.6	98.8
phone	87.7	86.7	98.68	97.5	92.8	98.8	99.8	97.2
MEAN	88.6	87.6	97.1	91.5	94.3	97.3	99.2	98.5

Notes: Bold numbers represent the highest scores among the methods, and objects marked with an asterisk (*) are symmetric.

Table 4. Evaluation results of different algorithms on the YCB-Video dataset.

	G2L-Net [50]		FS6D [26]		SaMfENet [41]		DenseFusion [13]		MaskedFusion [35]		Ours
	AUC	<2 cm	AUC	<2 cm	AUC	<2 cm	AUC	<2 cm	AUC	<2 cm	AUC	<2 cm
002_master_chef_can	94.0	-	92.6	-	95.9	100.0	95.2	100.0	95.9	100.0	96.0	100.0
003_cracker_box	88.7	-	83.69	-	94.1	94.8	92.5	99.3	96.0	99.7	95.0	97.4
004_sugar_box	96.0	-	95.1	-	97.2	100.0	95.1	100.0	97.6	100.0	97.7	100.0
005_tomato_soup_can	86.4	-	93.0	-	94.2	96.9	93.7	96.9	94.2	96.5	94.2	96.7
006_mustard_bottle	95.9	-	97.0	-	95.4	100.0	95.9	100.0	97.6	100.0	97.9	100.0
c007_tuna_fish_can	96.0	-	94.5	-	94.7	99.0	94.9	100.0	96.7	99.9	97.0	99.7
008_pudding_box	93.5	-	94.9	-	95.9	100.0	94.7	100.0	96.3	100.0	96.6	100.0
009_gelatin_box	96.0	-	98.3	-	95.4	100.0	95.8	100.0	98.0	100.0	98.2	100.0
010_potted_meat_can	93.5	-	87.6	-	94.7	100.0	90.1	93.1	89.4	90.8	91.3	93.6
011_banana	96.8	-	94.0	-	97.3	100.0	91.5	93.9	97.5	100.0	97.0	98.9
019_pitcher_base	86.2	-	91.1	-	91.2	94.1	94.6	100.0	97.4	99.9	98.3	100.0
021_bleach_cleanser	91.8	-	89.4	-	96.7	100.0	94.3	99.8	93.8	94.0	95.2	100.0
024_bowl *	84.3	-	74.7	-	83.8	84.1	86.6	69.5	90.1	95.8	89.6	91.4
025_mug	92.0	-	86.5	-	95.5	99.6	95.5	100.0	97.0	100.0	97.6	100.0
035_power_drill	86.7	-	73.0	-	89.5	93.4	92.4	97.1	96.4	99.4	96.8	99.4
036_wood_block *	95.4	-	94.7	-	97.1	100.0	85.5	93.4	90.6	98.1	93.6	99.2
037_scissors	92.3	-	74.2	-	84.1	67.4	96.4	100.0	93.2	99.0	89.3	92.3
040_large_marker	94.4	-	97.4	-	73.8	77.5	94.7	99.2	97.0	99.7	96.4	98.3
051_large_clamp *	92.3	-	82.7	-	67.6	66.1	71.6	78.5	72.1	77.9	73.0	76.8
052_extra_large_clamp *	94.7	-	65.7	-	95.5	100.0	69.0	69.5	69.6	72.0	67.0	68.1
061_foam_brick *	92.6	-	95.7	-	92.6	95.6	92.4	100.0	94.2	100.0	96.6	100.0
MEAN	92.3	-	88.4	-	92.6	95.6	91.2	95.3	92.9	96.3	93.1	95.9

Notes: Bold numbers represent the highest scores among the methods, and objects marked with an asterisk (*) are symmetric.

Table 5. Evaluation results of various algorithms using the Occlusion LineMOD dataset.

	PVNet [21]	HybridPose [22]	GDR-Net [51]	CRT-6D [25]	SAM-6D [52]	Ours
ape	15.8	20.9	46.8	53.4	-	40.2
can	63.3	75.3	90.8	92	-	76.2
cat	16.7	24.9	40.5	42	-	42.7
driller	65.7	70.2	82.6	81.4	-	80.6
duck	25.2	27.9	46.9	44.9	-	50.2
eggbox *	50.2	52.4	54.2	62.7	-	63.9
glue *	49.6	53.8	75.8	80.2	-	77.3
holepuncher	39.7	54.2	60.1	74.3	-	76.3
MEAN	40.8	47.5	62.2	66.3	66.5	63.4

Notes: Bold numbers represent the highest scores among the methods, and objects marked with an asterisk (*) are symmetric.

Table 6. Comparison of inference speed, model size, and pose accuracy across different methods.

Method	HybridPose [22]	RNNPose [27]	MaskedFusion [35]	DFTr [42]	SAM-6D [52]	DenseFusion [13]	Ours
Inference Speed (FPS)	113.3	87.1	23.5	20.6	0.2	25.5	31.8
Parameters (M)	4.7	11.9	37.4	162.1	>1000	23.4	19.5
ADD(s) Accuracy (%)	91.3	97.1	97.3	99.2	-	94.3	98.5

Notes: The bold numbers represent the fastest inference speed, the smallest number of parameters, and the highest ADD(s) Accuracy among the methods.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lv, Z.; Guo, Y.; Yang, S.; Du, L.; Gao, R.; Sun, J.; Han, J.; Zhang, H.; Wang, Q. Multiscale Feature Fusion with Self-Attention for Efficient 6D Pose Estimation. Algorithms 2025, 18, 207. https://doi.org/10.3390/a18040207

AMA Style

Lv Z, Guo Y, Yang S, Du L, Gao R, Sun J, Han J, Zhang H, Wang Q. Multiscale Feature Fusion with Self-Attention for Efficient 6D Pose Estimation. Algorithms. 2025; 18(4):207. https://doi.org/10.3390/a18040207

Chicago/Turabian Style

Lv, Zekai, Yufeng Guo, Shangbin Yang, Linlin Du, Rui Gao, Jinti Sun, Jiaqi Han, Hui Zhang, and Qiang Wang. 2025. "Multiscale Feature Fusion with Self-Attention for Efficient 6D Pose Estimation" Algorithms 18, no. 4: 207. https://doi.org/10.3390/a18040207

APA Style

Lv, Z., Guo, Y., Yang, S., Du, L., Gao, R., Sun, J., Han, J., Zhang, H., & Wang, Q. (2025). Multiscale Feature Fusion with Self-Attention for Efficient 6D Pose Estimation. Algorithms, 18(4), 207. https://doi.org/10.3390/a18040207

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multiscale Feature Fusion with Self-Attention for Efficient 6D Pose Estimation

Abstract

1. Introduction

2. Related Works

2.1. Pose Estimation

2.1.1. Pose Estimation Methods Based on RGB Images

2.1.2. Pose Estimation Methods Based on RGB-D Images

2.2. Attention Mechanism

2.3. Lightweight Design

3. Materials and Methods

3.1. Datasets

3.2. Overall Structure of the Network

3.2.1. Overview

3.2.2. Semantic Segmentation

3.2.3. Multiscale Point Cloud Feature Extraction Module

3.2.4. Lightweight Image Feature Extraction Module

3.2.5. Feature Fusion and Pose Estimation

3.3. Evaluation Metrics

3.4. Experimental Details

4. Results

4.1. Ablation Study

4.2. Comparison of Algorithms Under the LineMOD Dataset

4.3. Comparison of Algorithms Using the YCB-Video Dataset

4.4. Comparison of Algorithms Using the Occlusion LineMOD Dataset

4.5. Real-Time Performance and Efficiency Evaluation

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI