1. Introduction
The aim of object 6D pose estimation is to estimate the 3D rotation and translation of an object relative to the camera coordinate system, which is a fundamental problem in computer vision. It plays a critical role in a wide range of applications such as robotic manipulation [
1,
2,
3,
4], augmented reality [
5], and autonomous navigation [
6,
7,
8,
9]. However, achieving robust and accurate 6D pose estimation in real-world scenarios remains highly challenging, particularly when objects are textureless, partially occluded, or captured under poor lighting conditions.
Traditional approaches can be broadly categorized into feature point-based or template-based methods. Feature point-based methods estimate 6D poses by establishing 2D–3D correspondences between image keypoints and the 3D model [
10]. These methods typically exhibit a degraded performance when applied to low-texture objects due to their inability to extract reliable keypoints. Template-based methods, on the other hand, match input images with a pre-rendered template database. While efficient, these methods tend to fail in the presence of occlusions or appearance variations.
With the rise of deep learning, learning-based approaches have significantly improved the 6D pose estimation performance. Early RGB-based methods such as PoseCNN [
11] and BB8 [
12] have achieved promising results but are inherently limited by the lack of depth information, making them less robust in complex environments. RGB-D methods, such as DenseFusion [
13], leverage complementary depth cues and fuse RGB and point cloud features at the pixel level. Although effective, these approaches rely heavily on PointNet [
14] for geometric feature extraction, which is not able to sufficiently capture the local context due to its global feature aggregation design. Later approaches such as PVN3D [
15] introduced local feature modeling, but this significantly increased the model complexity and computational cost, limiting their deployment in real-time or resource-constrained applications.
More recently, large-scale foundation models [
16,
17,
18,
19] have demonstrated strong generalization capabilities and robustness through the use of high-capacity architectures and multimodal representations. However, their considerable parameter sizes and inference latency present challenges for real-time deployment and mobile use. Therefore, there is a clear research gap between high-accuracy yet computationally expensive models and lightweight models that have a limited robustness and precision.
To bridge this gap, we propose a novel lightweight multiscale feature fusion network for 6D pose estimation that achieves an optimal balance between accuracy and efficiency. Specifically, we design a multiscale point cloud feature extraction module using parallel graph convolution layers with varying neighborhood sizes to capture geometric information at different scales. A channel-wise self-attention mechanism is applied to adaptively fuse these features, allowing the model to focus on the most informative scale. In parallel, a lightweight RGB feature extractor using depthwise separable convolutions reduces computations while preserving color cues. Finally, RGB and geometric features are fused at the pixel level to estimate the 6D pose. Compared with existing methods, our approach achieves a competitive accuracy with significantly fewer parameters and faster inference, making it well suited for real-time, mobile deployment.
3. Materials and Methods
3.1. Datasets
To evaluate the performance of the network, three mainstream public datasets (LineMOD, YCB-Video, and Occlusion LineMOD) were used for validation.
The LineMOD dataset is a widely used computer vision dataset for object recognition and pose estimation. It includes RGB-D images and 3D models of 13 objects, featuring both symmetric and asymmetric common objects. Each object has approximately 1000 manually annotated images. These images and models are captured from real-world objects, ensuring high realism and diversity.
The YCB-Video dataset is based on the YCB dataset and includes 21 objects selected for their high-quality model files and comprehensive depth images. The dataset contains a total of 133,827 frames, all annotated with 6D poses. It includes scenes with clutter and occlusions featuring both symmetric and asymmetric objects, as well as objects with minimal surface features.
The Occlusion LineMOD dataset is an extended version of the LineMOD dataset, specifically designed to evaluate the performance of models under complex occlusion conditions. It contains many heavily occluded scenes, providing a more challenging benchmark for assessing the robustness of pose estimation models. Compared to the original LineMOD dataset, the LM-O dataset introduces more complex occlusions and object arrangements, allowing for a more comprehensive evaluation of the model’s performance when dealing with occluded objects in real-world environments. The LM-O dataset has become an important benchmark for validating the ability of 6D pose estimation algorithms to handle occlusions.
3.2. Overall Structure of the Network
The aim of the network proposed in this paper is to estimate the 6D pose of objects from RGB-D images. The 6D pose of an object refers to the rigid transformation from the object’s coordinate system to the camera coordinate system. This transformation can be intuitively expressed as a combination of a rotational and translational transformation. It is represented using a homogeneous transformation matrix , where is the rotational transformation and is the translational transformation.
3.2.1. Overview
Figure 1 shows the overall structure of the network, which mainly consists of four parts as follows:
- •
Semantic Segmentation Module: This module uses the semantic segmentation network proposed in PoseCNN. It segments the target object’s bounding box from the RGB image, crops the corresponding mask, and combines the camera parameters to convert the depth map into the corresponding 3D point cloud, laying the foundation for the subsequent extraction of image and point cloud features.
- •
Multiscale Point Cloud Feature Extraction Module: This module primarily extracts features from the segmented point cloud data. In this module, the point cloud information passes through multiple graph convolution layers, each extracting local information from the point cloud at different scales based on neighborhoods of varying sizes. The local features are then fused into more expressive multiscale point cloud features using a self-attention mechanism.
- •
Lightweight Image Feature Extraction Module: This module introduces an inverted residual structure to improve the original CNN network. To ensure that the extracted features can be fused with the point cloud features at the pixel level, the features are upsampled into texture feature maps of the original size.
- •
Feature Fusion and Pose Estimation: The multiscale point cloud features and texture feature maps are concatenated at the pixel level and then encoded to construct global features. These pixel-level features are fused with global features and then input into the pose estimation network, which consists of multiple consecutive convolutional layers that directly regress the translation and rotation matrices.
3.2.2. Semantic Segmentation
Semantic segmentation is a crucial step in the network, providing a data foundation for the main network. It helps eliminate distractions and improves the estimation accuracy to some extent. The function of this module is to segment the target area from the image, crop out the target object, and, finally, obtain an image patch and depth map containing only the target object. Since this work is focused on pose estimation algorithms, we directly use PoseCNN’s semantic segmentation network to crop the image. The overall framework is an encoder–decoder network, which produces N+1 binary images for each input RGB image in the dataset. These binary images represent the masks for each object, and the RGB image is cropped based on these masks to produce image patches containing only the target object. Similarly, depth maps containing only the target object are obtained, as shown in
Figure 2.
Since the focus of this research is on optimizing and improving the pose estimation algorithm, we did not make additional improvements to the semantic segmentation module, maintaining its original structure and method. However, the effectiveness of this module remains critical to the overall estimation process. It ensures the accurate fusion of image and point cloud information, providing a solid foundation for subsequent multiscale feature extraction and fusion. The masks generated by this module can effectively filter out background noise and focus on extracting features from the target object, further improving the model’s estimation accuracy for occluded and low-texture objects.
3.2.3. Multiscale Point Cloud Feature Extraction Module
Point cloud data serve as the data carrier for the entire network, and the accurate conversion of depth images into object surface point clouds is particularly important. Based on optical principles and the camera parameters, pixel points from the depth image are restored into corresponding 3D points, as shown below:
where
represents the pixel coordinates in the image,
is the depth value of the pixel, s is the camera’s scaling factor,
and
represent the image coordinates of the camera’s optical axis, and
and
are the camera’s focal lengths along the x-axis and y-axis, respectively.
are the 3D coordinates corresponding to the pixel.
PointNet, a commonly used network for processing point cloud data, can only extract geometric information on a single scale and cannot perceive local information. To address this issue, we consider the idea of multiscale local information extraction from grouping and sampling layers in PointNet++ [
46]. We simultaneously collect neighborhood information of different sizes around 3D points and integrate this information using a self-attention mechanism to create more expressive fused features.
Figure 3 shows the specific process of point cloud feature extraction in our network.
The multiscale point cloud extraction module consists of two parts:
- •
The upper branch, which contains three consecutive multilayer perceptrons (MLPs) and is structurally similar to the traditional PointNet network. It mainly encodes the geometric information of each point in the point cloud to generate point-level features.
- •
The lower branch of the network, which is dedicated to capturing local geometric features for each point through multiple parallel graph convolution layers. Each layer operates on neighborhoods of different sizes, allowing the network to extract geometric information across various spatial scales. To effectively integrate these multi-scale local features, we introduce a channel-wise self-attention mechanism that adaptively fuses the features based on their relative significance. After obtaining the feature maps from the graph convolution branches, we first apply global average pooling to the summed features, resulting in a compact representation. This pooled vector is then passed through two lightweight MLP layers to generate attention weights with a shape of [3,128], where the dimensions correspond to the three spatial scales and one hundred and twenty-eight feature channels. A softmax function is applied to normalize the attention weights across the scales. These weights are subsequently used to reweight the scale-specific feature maps before aggregation. This attention-driven fusion enables the network to dynamically emphasize the most informative features for each channel, thereby enhancing its discriminative capability and improving robustness to variations in scale.
The structure of each graph convolution layer is shown in
Figure 4. We select
k neighboring points within a radius
r of the center point based on the Euclidean distance, forming a point set
Y of size (3,k,N). The center point is then subtracted from all neighboring points in
Y to represent local features with size (3,k,N) and is then input into the multilayer perceptron (MLP) to obtain high-dimensional features
of size (128,k,N). However, a larger radius
r expands the receptive field, allowing the model to capture richer contextual information, which is particularly beneficial for objects with a detailed texture. Nevertheless, an excessively large
r may introduce noise or irrelevant points, potentially degrading the feature quality. Conversely, a smaller
r preserves finer local geometric details but may result in the loss of the global structure, reducing the effectiveness for highly occluded or low-texture regions. To balance this trade-off, we adopt the multi-scale neighborhood definition strategy from PointNet++ grouping layers, selecting
r with an interval of 0.2 to effectively capture features at different scales. The local feature
computation is as follows:
where
N represents the total number of sampled points in the point cloud,
represents the
i-th center point in the point cloud,
k represents the number of points selected in the neighborhood, and
represents the
j-th neighboring point of the
i-th center point.
denotes a nonlinear computation, i.e., a MLP.
To obtain more effective local features, we introduce an attention mechanism that assigns an adaptive weight to each point’s local feature representation. This weight consists of a self-weight
and a local weight
, where
is derived by passing the global point cloud through a shared MLP with dimensions (3,128,1), representing the significance of each point independently. Meanwhile,
is computed by processing the extracted local features
F through another shared MLP (128,1), assigning importance to the k-nearest neighbor points. The final attention coefficient
is obtained by summing
and
and further refining it through a shared MLP to generate the final local coefficients. These computed attention coefficients are then used to weight the local feature vectors
, dynamically emphasizing more relevant geometric structures. The weighted features are then summed along the neighborhood dimension, resulting in the final geometry-enhanced output of the graph convolution layer,
, which captures both local and global dependencies in a more structured manner. The computation process is as follows:
where
represents the local features obtained by each graph convolution layer;
P,
,
i, and
j have the same definitions as in Equation (
2); and
and
represent the mapping functions used to calculate the self-weight and local weight.
To ensure the network extracts the optimal local features, a self-attention module is introduced after the parallel graph convolution layers. When passing through this module, the feature vector
is assigned a specific weight, and, after passing through an average pooling layer, fully connected layers, and a softmax layer, the final local feature vector
is output. This process can be expressed as follows:
Finally, the point-level features and local features are concatenated to form the final geometric features of the point cloud.
3.2.4. Lightweight Image Feature Extraction Module
To ensure that image features can be correspondingly fused with point cloud features, we designed the lightweight feature extraction module as a U-shaped encoder–decoder network, as shown in
Figure 5. The network structure is based on the lightweight MobileNetV2 [
47] network. Each downsampling layer features an inverted residual module and depthwise separable convolution, and the downsampling layer structure contains two consecutive bottlenecks. The encoder part includes four downsampling layers, ultimately generating a feature map of size (H/16,W/16,1024). Correspondingly, the decoder symmetrically uses four consecutive upsampling layers to restore the feature map to a feature map of size (H,W,64). The upsampling layers are based on the upsampling method in DenseFusion and PSPNet [
48], with each upsampling layer consisting of bilinear interpolation and a 2D convolution layer.
The core of the network is the inverted residual module, which is used multiple times during downsampling. As shown in
Figure 6, in addition to the depthwise separable structure, this module also employs Expansion and Projection layers. The Expansion layer uses a 1 × 1 network structure to map the low-dimensional space to a high-dimensional space. The Projection layer also uses a 1 × 1 network structure but functions in the opposite way, mapping the high-dimensional space back to a low-dimensional space. These two layers are positioned at either end of the inverted residual module, with a 3 × 3 depthwise separable convolution in between, giving the module a spindle-like structure—narrow at the ends and wide in the middle. This design facilitates smoother gradient propagation, helping to improve the model’s convergence speed. Additionally, the presence of skip connections in the inverted residual module allows the original input to be directly passed to subsequent layers, making gradient propagation through the network easier. This helps alleviate the vanishing gradient problem, making the network easier to train.
3.2.5. Feature Fusion and Pose Estimation
During feature fusion, the pixel-level image features obtained from the two feature extraction modules and the point-level features containing local information need to be fully integrated. By leveraging the natural complementarity between the two, the features are concatenated to construct global features. These features are then input into the pose estimation network to estimate the object’s pose.
Feature fusion: Considering that the point cloud features are derived from the depth map and that there is a direct correspondence between the depth map and the RGB image, feature fusion is performed by concatenating features at the pixel level based on this positional correspondence. This forms point cloud features that not only include geometric and color information but also contain neighborhood information. These features can then be directly used to extract global variables, which are encoded through two multilayer shared perceptrons. Afterward, an average pooling layer is used to obtain representative global variables. The pixel-level fused features and global features are concatenated and input into the pose estimation network as the final features. The process is expressed as follows:
where
x represents the point cloud features,
represents the image features,
F represents the pixel-level fused features containing color and local information,
is the average pooling operation,
and
are nonlinear functions (shared perceptrons), and
is the final feature. Note that the + operation in Equation (
5) represents feature concatenation.
To clearly illustrate the feature extraction pipeline from the input images to the final fused features, we have presented the entire fusion process in a diagram. As shown in
Figure 7, the flow proceeds from left to right, showing how RGB and depth images are processed. A mask is first applied to extract the target object region (choose), which is then used to index both the RGB image and depth map—leveraging the one-to-one pixel-level correspondence between them—for precise feature alignment.
RGB features are extracted using a CNN, producing a feature map of shape . Using choose, we obtain an image feature tensor of shape . Simultaneously, the depth map is back-projected using built-in camera functions to generate 3D coordinates and corresponding geometric features of size . These features are concatenated to form the fused feature , where N is the number of sampled points and and are the channel dimensions of the corresponding features.
To capture the global context,
F is passed through shared MLPs and an average pooling operation is used to extract global features. These are then concatenated back to the point-wise fused features, resulting in the final feature
. This process is represented in Equation (
5) and offers an intuitive view of pixel-aligned fusion between image and point cloud features.
Pose estimation: When the features are input into the neural network for pose estimation, the network directly regresses the rotation, translation, and confidence score. The pose estimation network in this paper is similar to that of DenseFusion and consists of three similar small network layers that each consist of four 1D convolutional layers. The estimation accuracy is determined by calculating the average deviation distance between the sampled points of the predicted and ground truth point clouds. For asymmetric objects, the loss is defined as follows:
This metric is only suitable for asymmetric objects with a unique correct pose. For symmetric objects, a new measurement is introduced:
The two metrics are similar, with
representing the ground truth pose,
being the
j-th point in the point cloud model,
being the pose estimated by the network, and
G being the number of sampled points in the point cloud model. The loss function for the entire network is defined as follows:
where
N is the number of predicted poses and
is the balancing hyperparameter for confidence regularization, which penalizes low-confidence predictions.
represents the confidence score for each regressed pose, and
represents the loss function for each predicted pose.
3.3. Evaluation Metrics
The accuracy of pose estimation is generally determined by two metrics—the average distance (ADD) and average nearest point distance (ADD-S)—used to evaluate the accuracy of asymmetric and symmetric objects, respectively.
The average distance (ADD) is defined as the mean distance between the sampled points on the 3D model transformed by the ground truth pose and those transformed by the predicted pose. It is calculated as follows:
where
m is the number of sampled points,
and
represent the ground truth pose and the estimated pose, respectively, and
x refers to the sampled points from the point cloud model.
The average nearest point distance (ADD-S) is defined as the distance between each sampled point on the 3D model transformed by the ground truth pose and the nearest point in the predicted pose. The average of these nearest point distances is calculated across all sampled points as follows:
3.4. Experimental Details
Validation was performed using the PyTorch (version 2.6) framework with the Adam optimizer. All experiments were conducted on an Intel
® Xeon
® Platinum 8255C CPU and an NVIDIA 3080 Ti GPU. The training parameters were configured as follows based on insights from previous studies and data analysis: the initial learning rate was set to 0.001 to promote faster convergence, the number of training iterations was set to 500, the hyperparameter
was set to 0.015, and the number of point cloud sampling points (
n) was set to 512. The optimizer and learning rate schedule were configured in line with the settings used in DenseFusion. Additionally, the batch size was set to 8 to accommodate GPU memory limitations, and several data augmentation strategies were applied to enhance generalization. With these settings, the model demonstrated rapid convergence without significant oscillations during training.
Table 1 outlines the model parameters for the training process.
4. Results
4.1. Ablation Study
We conducted an ablation study on the LineMOD dataset to evaluate the effectiveness of each proposed module in terms of accuracy, parameter count, and real-time inference speed. The results are summarized in
Table 2, where the accuracy metric is based on ADD-(S) (ADD is used for asymmetric objects, while ADD-S is applied to symmetric ones). The number of parameters reflects the model’s complexity, and the runtime (FPS) indicates its inference efficiency.
DenseFusion is used as the baseline model. The MGC column denotes the use of multi-branch graph convolution modules, while SA represents the self-attention mechanism. These two components together form the multiscale point cloud feature extraction module. LIFE refers to the lightweight image feature extraction module, which replaces the original image downsampling layers in DenseFusion with MobileNetV2. As LIFE is an independent replacement of the image encoder, we did not further break it down into sub-components in the ablation experiment.
From the results in
Table 2, we can observe that each proposed module contributes positively to model performance in terms of accuracy and efficiency. Below, we provide a detailed analysis of the effects of the multiscale point cloud feature extraction module (MGC + SA) and the lightweight image feature extraction module (LIFE), as well as the trade-offs between accuracy, parameter count, and runtime speed.
Comparing the baseline (Row 1) with the variant incorporating only MGC (Row 2), we observe a significant improvement in accuracy from 94.3% to 96.4%, indicating the benefit of introducing multi-branch graph convolution. This module enhances the local geometric representation by aggregating features from multiple receptive fields. However, it introduces a moderate increase in parameters and a slight drop in FPS due to the added computational paths. When the self-attention (SA) module is further added (Row 3), the accuracy rises to 97.6%, showing that the adaptive weighting of multi-scale features can better emphasize task-relevant patterns. Notably, despite the added attention mechanism, the overall parameter count decreases slightly compared to Row 2 due to structural balancing. This confirms that the SA module improves the feature selection efficiency with only a minor increase in overhead, demonstrating a favorable trade-off between accuracy and complexity.
Replacing the original CNN backbone with MobileNetV2 (Row 4) significantly reduces the parameter count from 23.39M to 18.86M and boosts the inference speed from 25.5 FPS to 35.3 FPS. However, this comes with a slight drop in accuracy (94.1%), indicating that aggressive compression may compromise the representation power if not complemented by other enhancements. When LIFE is combined with MGC (Row 5), the accuracy climbs to 97.5% while maintaining a relatively low parameter count (22.48M) and high speed (32.6 FPS). This suggests that point cloud enhancements can compensate for lightweight image features, resulting in a balanced and effective design. Finally, integrating all three components (Row 6) achieves the best accuracy (98.5%) with the fewest parameters (19.49 M) among the high-performing variants, confirming the effectiveness of the full architecture.
Comparing Row 3 and Row 6 in
Table 2, we observe an interesting phenomenon: the addition of the LIFE module not only reduces the parameter count (from 25.83M to 19.49M) and improves the runtime speed (from 22.4 FPS to 31.8 FPS), but also leads to an increase in accuracy from 97.6% to 98.5%. This result may seem counterintuitive, as the use of lightweight modules is typically expected to lead to a trade off in accuracy for efficiency. We attribute this improvement to the complementary nature of the LIFE module when combined with the multiscale point cloud feature extraction module. Using MobileNetV2 introduces a more efficient image encoding pipeline that preserves essential semantic features while reducing redundancy, which may help suppress irrelevant noise and overfitting. Moreover, the compact representation from LIFE might improve the effectiveness of the subsequent fusion with point cloud features, resulting in better joint representations and an enhanced overall performance.
These results demonstrate a clear trade-off landscape: while MGC and SA contribute to accuracy gains through richer geometric modeling and adaptive feature weighting, LIFE introduces substantial efficiency benefits with minimal compromise. When appropriately combined, these modules not only offset each other’s limitations but also reinforce each other’s strengths. The final model thus achieves the best of both worlds—delivering a state-of-the-art performance with a significantly reduced computational cost—making it highly suitable for real-time deployment on resource-constrained devices.
4.2. Comparison of Algorithms Under the LineMOD Dataset
For the LineMOD dataset, following previous work, we considered a prediction as correct if the ADD-(S) was less than 10% of the object’s diameter. The percentage of correctly predicted keyframes out of the total number of keyframes was used as the evaluation metric (i.e., accuracy). The evaluation results for various algorithms are shown below in
Table 3. Bold numbers represent the highest scores among the methods, and objects marked with an asterisk (*) are symmetric.
Table 3 presents a comparative evaluation of our proposed method against several representative RGB and RGB-D based algorithms on the LineMOD dataset. Among the RGB-based approaches, PoseCNN+ICP and HRPose [
49] achieved average accuracies of 88.6% and 87.6%, respectively, with both failing to surpass the 90% threshold. However, RNNPose exhibited a strong performance among RGB-only methods, with an average ADD(S) accuracy of 97.1%, indicating that even in the absence of depth data, it can produce competitive results through effective temporal modeling.
For RGB-D-based methods, FS6D achieved an accuracy of 91.5% using few-shot learning techniques and RGB-D inputs. DenseFusion and MaskedFusion, which adopt dense pixel-level fusion strategies between RGB and depth features, achieved higher accuracies of 94.3% and 97.3%, respectively. These results highlight the benefit of effective geometric and visual feature integration for pose estimation robustness. Notably, DFTr further improves upon previous results, with an average accuracy of 99.2%, benefiting from transformer-based global feature aggregation and a cross-modal fusion design.
In contrast, our proposed method reaches an average ADD(S) accuracy of 98.5%, outperforming PoseCNN+ICP by 9.9%, DenseFusion by 4.2%, and MaskedFusion by 1.2%. Although DFTr achieved a slightly higher mean accuracy, our method maintains a highly competitive accuracy while offering a better computational efficiency and a significantly lower model complexity, making it more suitable for real-time or resource-constrained applications.
From a category-wise perspective, the proposed method achieves the best performance in several challenging object categories such as ape, can, driller, duck, and hole, all of which contain occlusions or have weak textures. This superior performance is attributed to the proposed multiscale point cloud feature fusion module, which enhances local geometric detail extraction and captures spatial context more effectively. Furthermore, for symmetric objects such as eggbox and glue, the proposed method achieves or matches the highest possible accuracy (100%), demonstrating robustness in handling symmetry ambiguity.
When comparing the experimental results, we observed that nearly all algorithms struggle to accurately estimate the poses of objects with indistinct surface textures, such as apes, drillers, and ducks. Due to their lack of effective texture features, RGB-based methods find it difficult to handle such objects. Using depth as supplemental information for iterative optimization can improve the accuracy to some extent; however, spatial geometric information is not fully integrated using this approach, resulting in a lower overall accuracy. Both DenseFusion and the proposed method use fused information from both images and depth data as the basis for pose estimation, taking advantage of the complementary nature of both data sources. Therefore, they both demonstrate good performances in pose estimation for objects with indistinct surface textures.
Figure 8 shows a comparison of DenseFusion and the proposed method’s results on several weak-textured objects.
From the comparison between DenseFusion and the proposed method, it is evident that the proposed method exhibits significantly improved accuracy across nearly all object categories. This is because DenseFusion only extracts single point cloud features, which, while improving accuracy to some extent, still has its limitations. In contrast, the point cloud extraction module proposed in this work enhances the focus on local point cloud information during pixel-level point cloud extraction, leading to richer features and better performance on weak-textured objects. In summary, the proposed method achieves a favorable balance between pose estimation accuracy and computational efficiency, outperforming most baseline methods.
4.3. Comparison of Algorithms Using the YCB-Video Dataset
For pose estimation on the YCB-Video dataset, we employed two evaluation metrics: the area under the ADD-S score curve (AUC) with a threshold from 0 to 10 cm, and the percentage of poses with an ADD-S error less than 2 cm, which reflects the high accuracy. In
Table 4, the best-performing results for each object under both metrics are shown in bold. Objects marked with an asterisk (*) are symmetric and evaluated accordingly.
Considering the AUC metric, our proposed method achieves an average accuracy of 93.1%, outperforming G2L-Net (92.3%), FS6D (88.4%), SaMfENet (92.6%), DenseFusion (91.2%), and MaskedFusion (92.9%). Compared to DenseFusion and FS6D, our method shows improvements of 1.9% and 4.7%, respectively, demonstrating stronger capability in pose regression.
Considering the <2 cm accuracy metric, our method achieves 95.9%, slightly outperforming SaMfENet (95.6%) and DenseFusion (95.3%) and remaining competitive with MaskedFusion (96.3%). Notably, our method achieves 100% accuracy on several challenging objects such as mug, gelatin_box, and mustard_bottle, showcasing its effectiveness in precise pose estimation, particularly for symmetric or small-scale objects.
Although FS6D demonstrates good generalization in few-shot scenarios, its overall AUC accuracy remains lower than our method, likely due to the absence of depth inputs. Overall, our method consistently ranks among the top-performing methods across both metrics and object categories, showing strong generalization and robustness. However, as shown in
Table 4, all methods, including ours, struggled with the large_clamp and extra_large_clamp objects, resulting in a relatively low accuracy. This is mainly due to the use of PoseCNN for semantic segmentation, as the two clamps are nearly identical in appearance except for scale, making it difficult to produce accurate masks and thus impacting pose estimation. Despite these challenges, our approach achieves notable improvements in overall pose estimation accuracy, particularly considering the <2 cm metric, and demonstrates enhanced robustness and precision compared to several representative RGB-D methods.
To provide a more intuitive comparison of the prediction results from different algorithms on the YCB-Video dataset, we visualized the estimation results of each method. All methods used the semantic segmentation output from the PoseCNN network. We combined the sampled point information to transform the predicted results into point clouds and projected them onto the original image. This allowed for a direct observation of the consistency between the estimated pose and the ground truth. The comparison results are shown in
Figure 9. As seen in
Figure 9a, for smooth and textureless objects such as bananas and bleach cleaners, the method proposed in this paper provides accurate pose estimation, and for objects that lack texture features and have complex shapes, the pose estimates from our method were closer to the ground truth (as shown in
Figure 9b). This improvement was attributed to the optimization of depth information processing in our method, which effectively integrated image and depth data to leverage richer features for more precise pose estimation. Moreover, the effective fusion of texture and depth features ensured that our method maintained good compatibility with symmetric objects, such as bowls and wooden blocks (as shown in
Figure 9b,c). Furthermore, by using pixel-level prediction and enhancing the multi-scale local attention of point clouds, our method demonstrated strong robustness to occlusion and was able to provide accurate predictions for severely occluded objects, such as the scissors in
Figure 9d.
4.4. Comparison of Algorithms Using the Occlusion LineMOD Dataset
The experimental results on the Occlusion LineMOD dataset indicate the performance of various algorithms in pose estimation under occluded scenes. The challenge of this dataset lies in the partial occlusion of objects, requiring the algorithm to not only identify the target object but also accurately estimate its 6D pose in complex scenes. We used the AUC as the evaluation metric, which comprehensively reflects the robustness of the algorithm when handling occlusions.
As shown in
Table 5, the proposed method demonstrated outstanding performance in several object categories, showcasing strong competitiveness in occluded scenarios. For example, for objects with complex shapes and symmetric objects, such as the eggbox and glue, the proposed method achieved AUC scores of 63.9% and 77.3%, respectively, surpassing PVNet, HybridPose, and GDR-Net [
51] and even approaching or exceeding the performance of CRT-6D. In these categories, the geometric properties and high occlusion rates make precise 6D pose estimation highly challenging, but the proposed method effectively addressed these issues, demonstrating its robustness in handling occlusions.
Furthermore, for objects with severe occlusions, such as the duck and holepuncher, the proposed method outperformed most existing algorithms. For instance, in the duck category, our method achieved an AUC of 50.2%, surpassing both CRT-6D and GDR-Net, demonstrating its stronger capability to handle severe occlusion. Similarly, in the holepuncher category, the proposed method reached an AUC of 76.3%, outperforming GDR-Net and HybridPose and approaching the performance of CRT-6D.
Despite these promising results, we observed that pose estimation under severe occlusion remains challenging. Failures are mainly attributed to the lack of visible discriminative features when large portions of the object’s surface are obscured. In such cases, the fused RGB-D features may be ambiguous or insufficient to guide accurate regression, especially for symmetric or textureless objects.
Overall, the proposed method achieved an average AUC of 63.4% on the Occlusion LineMOD dataset, demonstrating strong competitiveness compared with the latest CRT-6D and SAM-6D methods. This indicates that the proposed approach offers both stability and accuracy in complex and occluded scenarios and still exhibits substantial potential under severe occlusion. By leveraging efficient computational strategies and multiscale geometric reasoning, the method achieves a favorable balance between estimation accuracy and resource consumption, making it well suited for practical applications in challenging environments.
4.5. Real-Time Performance and Efficiency Evaluation
To comprehensively evaluate the real-time performance and efficiency of the proposed method, we compared the inference speed, model size (in terms of parameters), and pose estimation accuracy (measured using ADD(S)) across several representative algorithms on the LineMOD dataset. The inference speed is expressed in FPS, indicating the number of keyframe images processed by each model per second in a full end-to-end pipeline.
Table 6 summarizes the performance of seven commonly used methods, covering RGB-based, RGB-D-based, and large-scale vision models.
The results show that HybridPose and RNNPose, as RGB-based methods, achieve inference speeds of 113.3 FPS and 87.1 FPS, respectively, with relatively small model sizes of 4.7M and 11.9M parameters. These models demonstrate excellent real-time capabilities but exhibit limited ADD(S) accuracies (91.3% and 97.1%, respectively) due to their lack of depth information. In contrast, RGB-D-based methods such as MaskedFusion, DFTr, and DenseFusion achieve significantly higher ADD(S) accuracies of 97.3%, 99.2%, and 94.3%, respectively, through the incorporation of geometric depth cues and multimodal fusion. However, this comes at the cost of increased model complexity. For example, DFTr introduces 162.1M parameters and has an inference speed of only 20.6 FPS, while MaskedFusion and DenseFusion have moderate sizes (37.4M and 23.4M, respectively) but relatively lower speeds (23.5 FPS and 25.5 FPS, respectively).
SAM-6D, which relies on a large-scale vision backbone, has the highest parameter count (>1000M) but suffers from an extremely low inference speed (0.2 FPS), rendering it impractical for real-time applications despite its strong performance in static scenarios. Our proposed method achieves a better balance between efficiency and accuracy. Optimized from DenseFusion, it attains a high ADD(S) accuracy of 98.5% while maintaining a lightweight architecture of only 19.5M parameters and achieving a real-time inference speed of 31.8 FPS. Compared with MaskedFusion and DFTr, our method offers a competitive or superior efficiency, with only minor sacrifices in accuracy, making it more suitable for resource-constrained real-time applications.
In summary, RGB-based methods exhibit high-speed inference but fall short in accuracy due to the absence of depth cues. RGB-D-based methods have significantly improved accuracy at the cost of greater computational complexity. By leveraging a lightweight design and efficient multiscale fusion, the proposed method achieves a favorable trade-off between speed, model size, and pose estimation accuracy, demonstrating strong potential for real-world deployment on mobile and embedded platforms.
5. Discussion
We focused on two main components in our approach. The first is a multiscale point cloud feature extraction module integrated with a self-attention mechanism, which enhances the representation of local features and mitigates information loss caused by partial occlusion. The second is a lightweight image feature extraction module, designed to reduce computational complexity while maintaining high-precision pose estimation. These two components complement each other to improve the performance in complex and low-texture environments.
Despite these contributions, several limitations remain. Although the proposed method improves the pose estimation accuracy under moderate occlusion, challenges still persist in cases of severe occlusion or highly symmetric objects. In these scenarios, pose ambiguity arises due to insufficient visual cues. However, as observed in the Occlusion LineMOD dataset, our method still shows considerable potential in such cases, outperforming several strong baselines on difficult categories such as duck and holepuncher. This suggests that with further refinement—such as improved feature alignment and occlusion-aware learning—the method could be extended to better handle these challenging conditions.
Another limitation is that the reliance on depth and point cloud data makes the model sensitive to sensor quality; poor or missing depth maps significantly affect estimation results, especially in dynamic or low-light environments.
While our lightweight design improves efficiency, we did not perform extensive comparisons with other lightweight 6D pose estimation frameworks such as PVN3D-Lite or EfficientPose. Future work will include a more systematic evaluation of accuracy–efficiency trade-offs across representative models.
Furthermore, the current study focuses primarily on known-object pose estimation, and its generalization ability to unseen objects is not explicitly addressed. This limits the model’s applicability in zero- or few-shot scenarios, which are commonly encountered in real-world robotics. One possible direction is to explore meta-learning techniques or feature-space adaptation to allow generalization to novel object categories with minimal training samples.
In summary, future work will focus on (1) enhancing robustness to occlusions and symmetry by incorporating synthetic and augmented training data; (2) integrating multi-view fusion and temporal consistency modules; (3) improving the generalization capability to unseen objects by investigating zero- or few-shot learning extensions of the current framework; and (4) further optimizing the model for deployment on edge devices, especially under real-world constraints such as low-power and limited bandwidth conditions.
6. Conclusions
In this paper, a lightweight 6D pose estimation network based on RGB-D images was proposed, which exhibited significantly improved pose estimation accuracy in occluded and low-texture object scenarios. The model is composed of a multiscale point cloud feature extraction module with a self-attention mechanism and a lightweight image feature extraction module. These components were validated through ablation studies on the LineMOD dataset, demonstrating their effectiveness and complementarity.
The proposed network achieved an accuracy of 98.5% on LineMOD with only 19.49 million parameters and a real-time speed of 31.8 FPS. Comparative experiments on the LineMOD, YCB-Video, and Occlusion LineMOD datasets showed that the proposed method outperformed several state-of-the-art approaches. Specifically, it achieved a 4.2% and 0.6% improvement over DenseFusion on the LineMOD and YCB-Video datasets, respectively, and reached 63.4% accuracy on Occlusion LineMOD.
While the method demonstrates promising results, we acknowledge that it still exhibits weaknesses in handling severely occluded and highly symmetric objects. Nevertheless, its performance on difficult cases suggests that the method also holds great potential in these areas, especially with further enhancements in feature learning and occlusion handling. In addition to planned improvements such as multi-view fusion and sensor combination, we also aim to investigate the model’s potential in few- or zero-shot settings, which would further enhance its practical deployment value.
In conclusion, the proposed method exhibits a compelling balance between accuracy and computational efficiency for 6D pose estimation and lays a solid foundation for practical deployment in resource-constrained environments such as edge devices or robotic platforms.