LA-Net: An End-to-End Category-Level Object Attitude Estimation Network Based on Multi-Scale Feature Fusion and an Attention Mechanism

Wang, Jing; Liu, Guohan; Guo, Cheng; Ma, Qianglong; Song, Wanying

doi:10.3390/electronics13142809

Open AccessArticle

LA-Net: An End-to-End Category-Level Object Attitude Estimation Network Based on Multi-Scale Feature Fusion and an Attention Mechanism

by

Jing Wang

,

Guohan Liu

^*

,

Cheng Guo

,

Qianglong Ma

and

Wanying Song

School of Communication and Information Engineering, Xi’an University of Science and Technology, Xi’an 710054, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(14), 2809; https://doi.org/10.3390/electronics13142809

Submission received: 21 June 2024 / Revised: 14 July 2024 / Accepted: 15 July 2024 / Published: 17 July 2024

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

In category-level object pose estimation tasks, determining how to mitigate intra-class shape variations and improve pose estimation accuracy for complex objects remains a challenging problem to solve. To address this issue, this paper proposes a new network architecture, LA-Net, to efficiently ascertain object poses from features. Firstly, we extend the 3D graph convolution network architecture by introducing the LS-Layer (Linear Connection Layer), which enables the network to acquire features from different layers and perform multi-scale feature fusion. Secondly, LA-Net employs a novel attention mechanism (PSA) and a Max-Pooling layer to extract local and global geometric information, which enhances the network’s ability to perceive object poses. Finally, the proposed LA-Net recovers the rotation information of an object by decoupling the rotation mechanism. The experimental results show that LA-Net can has much better accuracy in object pose estimation compared to the baseline method (HS-Pose). Especially for objects with complex shapes, its performance is 8.2% better for the 10°5 cm metric and 5% better for the 10°2 cm metric.

Keywords:

category-level 6D pose estimation; attention mechanism; feature fusion; 3D graph convolution

1. Introduction

Object pose estimation (OPE) is an important research area in computer vision and robotics, aiming to determine an object’s position and orientation in 3D space. In recent years, there has been an urgent need for accuracy in object pose estimation in robot vision [1], augmented reality [2], autonomous driving [3], aerospace [4], and human–computer interaction. Currently, instance-level object pose estimation [5,6,7] and category-level object pose estimation are the two main approaches used for object pose estimation. Among them, instance-level methods usually require the utilization of CAD models or the texture information of each object during the training process to achieve higher accuracy and robustness. Although existing instance-level object pose estimation algorithms have achieved high accuracy, overreliance on the CAD model of an object leads to its weak generalization to new objects. It also requires a large quantity of instance-specific labeled data, which are costly to acquire and maintain. In contrast, class-level object pose estimation algorithms [8,9,10,11] can handle different object instances in the same class without requiring detailed 3D models or texture information for each instance to perform pose estimation on untrained objects, leading to better generalization. However, the multi-hypothesis problem due to intra-class shape differences and object symmetry makes category-level object pose estimation algorithms more challenging.

With the development of deep learning, category-level object pose estimation has been transformed from the traditional methods based on template matching [12] and feature correspondence [13] to neural network-based methods. Neural networks can effectively extract richer features and process more high-dimensional and complex point cloud data through a multi-layered network structure. The authors of [14] define a normalized coordinate space for each object category, in which the models of all objects within the category follow a uniform scale and orientation standard. Object pose is then ascertained by the Umeyama algorithm [15] based on coordinate correspondence. In [16], the authors incorporate prior geometric information about an object and use a self-supervised approach to train the model used to better estimate the object’s pose by minimizing the reprojection error or other similarity metrics. The authors of [17] applied a diffusion model to object pose estimation; this model that can achieve high-accuracy estimations of object poses by utilizing the generative and denoising capabilities of the diffusion process.

Although all of the above methods improve the accuracy of class-level object pose estimation, there is still room for improvement in the performance of intra-class shape variation and complex object pose estimation. the 3DGCN (Three-Dimensional Graph Convolutional Network) [18] is a network that extends the traditional two-dimensional graph convolutional network to 3D space, enabling it to handle 3D data such as point clouds and voxel data. For category-level object position estimation, 3DGCN efficiently handles disordered and irregular point cloud data by defining a local neighborhood and aggregating information within that neighborhood. However, the 3DGCN is widely applied to local geometric features and ignores global features, which play a crucial role in the pose estimation of complex objects. So, the 3DGCN is more susceptible to noise when processing objects with complex shapes.

To address the above problems, we extend the network structure of the 3DGCN and propose LA-Net, an end-to-end category-level object pose estimation network based on multi-scale feature fusion and an attention mechanism, enabling it to sense local and global geometric information and sufficiently extract point cloud features to ascertain object poses. More specifically, we added a linear branch to the 3DGCN. This branch allows the fusion of feature maps at different scales by acquiring features at different layers of the network to capture multi-scale information. The feature relations over long distances are then captured through the attention mechanism and the Max-Pooling layer to extract the local and global geometric features of the point cloud, which enhances the network’s information perception and the accuracy of object pose estimation. Finally, the concept of decoupling is utilized to decouple the rotation information into two independent components in different directions, and the two rotations are decoded by using two decoders. In addition, a miniature point network is used to recover the translation and size information of the object. In summary, the main contributions of this paper are as follows:

We have extended the network architecture of the 3DGCN and proposed a new framework called LA-Net. Building upon the structure of the 3DGCN, LA-Net integrates an additional parallel linear branch designed to extract and fuse features from different hierarchical levels within the network. This enhancement significantly boosts the network’s expressive power, improving its capability to recognize complex patterns.
We introduce the PSA (Pyramid Split Attention) attention mechanism and the Max-Pooling layer, which can perceive the local and global geometric information of the point cloud. It is especially advantageous in dealing with objects with complex shapes while also being robust against noise.
Our experimental results show that LA-Net can effectively improve object pose accuracy. Compared with the baseline method (HS-Pose), LA-Net can more effectively ascertain object poses from objects with complex shapes.

2. Related Works

Instance-Level Object Pose Estimation. Instance-level object pose estimation often requires a CAD model of an object to achieve precise pose estimation. The existing methods for instance-level object pose estimation reported in research are mainly categorized into coordinate-based [18], keypoint-based [19], and template-matching [20] methods. Among them, the coordinate-based methods predict an object’s positional information by establishing a 2D-3D spatial mapping relationship, which reduces the effect of the occlusion problem on the object’s pose to a certain extent. The keypoint-based method, on the other hand, prioritizes the detection of key points in the object instances from the input data and matches the detected key points with the key points in the predefined object model. Finally, the pose is determined using the PnP algorithm [12]. The template-matching-based object pose estimation method has been widely used in recent years. In this method, an object’s attitude is determined by establishing a feature database of the target object, performing feature extraction on the input data, comparing them with the features in the feature database, and selecting the feature with the highest matching degree as the optimal feature. Although the above methods have achieved high accuracy, they all depend on the CAD model of the analyzed object, so they do not generalize when dealing with the attitude of unseen objects.

Category-Level Object Pose Estimation. The category-level object pose estimation method does not depend on the CAD model of a specific object instance and can handle new object instances that have not been seen in the same category [21]. Boasting strong generalization capabilities, this method holds extensive potential for application across domains such as augmented reality, robotic control, and autonomous driving. The current category-level object pose estimation methods are mainly categorized into those based on holistic regression [22,23,24] and those based on coordinate correspondence [13]. The holistic-regression-based methods mainly obtain an object’s rotation, translation, and size information directly through the network. FS-Net [22] incorporates an online data augmentation method that effectively mitigates the issue of intra-class shape variation. GPV-Pose [23] accomplishes category-level object pose estimation by employing a geometrically directed point-voting technique.

Meanwhile, HS-Pose [24] incorporates an outlier-resistant feature extraction layer that mitigates the impact of outlier points, thereby refining the precision of category-level object pose estimation. The coordinate-correspondence-based approach mainly determines an object’s attitude by establishing the uniform mapping space (NOCS) of the object and matching the key points. In addition, SAR-Net [25] considers the symmetry of an object and introduces symmetry-aware constraints in the point cloud reconstruction. SSP-Pose employs self-supervised learning to create a model that estimates object poses without requiring a large quantity of labelled training data [26]. It applies comparative learning methods to enhance the model’s ability to distinguish between different object poses by learning feature representations of unlabeled images. SPD combines self-conditioning learning with domain adaptation techniques to deal with variations between the source and target domains in pose estimation tasks [8]. It starts with easy samples and gradually includes more challenging samples, allowing the model to adapt to different domains. DualPoseNet uses a dual-pose network, where the first branch provides a rough pose estimation, and the second branch refines it for higher accuracy [7]. This approach lets the network capture the overall pose and then focus on the details. All of the above methods provide some degree of improvement in object pose estimation. However, they ignore the global geometric relationships of an object and the feature relationships between different layers when estimating poses, leading to limitations in estimating complex target poses.

3. Methods

The goal of category-level object pose estimation is to estimate the 9DOF poses of objects, which include 3D rotation, where

r \in R^{3 \times 3}

; 3D translation, where

t \in R^{3}

; and size, where

s \in R^{3}

[24], as illustrated in Figure 1. In this study, we estimated an object’s pose using the LA-Net network, the architecture of which is shown in Figure 2. First, for the input RGB-D image, we used an instance segmentation technique, Mask R-CNN [27], to isolate the target from the background within an image, thereby acquiring both the class identity and the segmentation mask of the object. Then, the depth map was converted into point cloud data, and the generated point cloud data were sampled to obtain 1028 representative data points. These sampled point cloud data points were then fed into the LA-Net network.

Since 3DGCN only focuses on local geometric information, we incorporated the PSA attention mechanism [28] and the Max-Pooling layer and integrated multi-scale features. Finally, the rotation information was decoupled into independent rotations in two directions via a decoupling-based mechanism [29]. A miniature point network was utilized to recover the object’s translation and size information. In the following subsections, we describe each network branch and our innovations in detail.

3.1. Background of 3DGCN

3DGCN utilizes a convolution operation to extract the feature representation of a node from the graph structure. In the graph convolution operation, the features of each node are calculated by combining the features of its neighboring nodes. This combination is usually weighted based on the similarity or connection strength between the nodes. Specifically, for each node, 3DGCN considers the features of its neighboring nodes; combines these features, which are weighted according to their similarity; and then obtains a new feature representation of the node through operations such as activation functions. In this way, the convolution operation based on similarity allows the feature representation of a node to take into account its position, relationship, and similarity in the graph, thus better capturing the structure and characteristics of the data. The kernel of 3DGC

K^{S}

is defined as

K^{S} = {k_{C}, k_{1}, k_{2}, \dots, k_{s}}

(1)

where

k_{C}

is the center kernel point, and S is the total number of support vectors. Because of the disordered nature of the point cloud data, the 3DGC kernel needs to consider the weight of the convolution kernel along with the similarity of each feature when performing convolution. That is, the convolution operation is defined as

C o n v (R_{n}^{M}, K^{S}) = < f (P_{n}), ω (k_{C}) > + g (A)

(2)

A = {s i m (P_{m}, k_{s})| \forall m \in (1, M), \forall s \in (1, S)}

(3)

g (A) = \sum_{S = 1}^{S} {}_{m \in (1, M)}^{m a x}A

(4)

s i m (p_{m}, k_{s}) = < f (p_{m}), ω (k_{s}) > \frac{< d_{m, n}, k_{s} >}{| | d_{m, n} | | | | k_{s} | |}

(5)

where

R_{n}^{M}

denotes the receptive field,

K^{S}

is the convolution kernel [16], and

f (P_{n})

and

ω (k_{C})

are the eigenvalues and weights of the center point, respectively. g(A) is the maximum similarity, and A is the set of similarities between nodes.

d_{m, n}

is the nearest neighbor point, and

p_{m}

is the direction vector pointing to the center point

p_{n}

of the director vector. Equation (5) calculates the cosine similarity between each neighboring feature

f (p_{m})

and its corresponding support vector weight

ω (k_{s})

[14].

3.2. Overall Framework

In our study, we ascertained an object’s pose through the LA-Net network, the architecture of which is depicted in Figure 2. To eliminate the impact of a cluttered background on object pose estimation, Mask R-CNN is first applied to the input RGB image. A significant feature of Mask R-CNN is that it can generate a segmentation mask for each detected target, through which the background can be separated from the target object, thus improving the pose estimation accuracy. Then, the depth map is transformed into point cloud data, and feature information is extracted from LA-Net. This paper introduces an LS-Layer in each GraphConv layer, enabling element-wise summation of features from different layers, thereby achieving multi-scale feature fusion. The multi-scale features are fed into the attention mechanism module to enhance the representation of essential features and capture the point-to-point long-range dependencies for local and global feature extraction. Finally, the multi-scale and attention-enhanced feature fusion is passed through the Max-Pooling layer to generate the final feature representation and then used for object pose estimation. To better recover rotational information, we employ a decoupling approach to separate the rotational information into two independent branches, and the rotation matrices for these branches are calculated separately. The rotation information is recovered by mapping the two rotation matrices into one rotation matrix through the rotation-mapping function mentioned in [30,31]. We recover their information through a separate network for translation and size estimation. The feature extraction process of the LA-Net network is shown in Figure 3. In this figure, the feature fusion module represents the features after the fusion of the original GraphConv and LS-Layer layers.

3.3. Multi-Scale Feature Fusion (LS-Layer)

Neural networks extract features through convolutional layers, and in general, deep-level features are extracted from the convolutional layers in the second half of the network. Deep-level features focus on global and abstract high-level semantic information and have stronger representational capabilities, can abstract higher-level concepts and categories, and are more robust to changes and deformations in the input data. Shallow features, on the other hand, are extracted within a smaller receptive field, and these features often consist of basic information. However, we believe that shallow features are also effective, and Resnet [32] mentions that introducing jump connections in different levels of convolutional layers can not only effectively mitigate gradient vanishing but also enable a network to learn richer features. Therefore, based on this idea, we added an independent parallel linear layer (LS-Layer) in the convolutional layer. As shown in Figure 2, the 3DGCN network consists of five convolutional layers and two pooling layers. Among them, the kernels of the convolutional layers are [128, 128, 256, 256, 512], and the pooling layers have a fixed down-sampling ratio of r = 4 [14]. 3DGCN senses object pose features through the convolutional layer (GraphConv), and the pooling layer (GraphPool) compresses and aggregates the learned features. A pooling layer is inserted after every two convolutional layers in its network structure, and the last convolutional layer outputs a 3D feature descriptor. A given observation point cloud contains rich geometric information about objects. In order to better learn the object poses from the observed point cloud, we applied the LS-Layer layer to each GraphConv layer; i.e., a parallel branch was added to each GraphConv layer. This parallel branch can extract feature information from different layers. Then, the GraphConv layer performs a graph convolution operation on the input feature map to extract the geometric features of the object. Finally, the features of the two branches are summed element by element to obtain the final feature descriptor, as shown in Equation (6). With this layered structure, the network can sense and process features at different scales and capture complex relationships between different features. Our ablation experiments in Section 4.2 show that the introduced linear layer enables the network to estimate poses more accurately. The process of multi-scale feature fusion by LS-Layer is shown in Figure 4, where F is the input feature descriptor, denoting the geometric feature extraction and linear feature extraction of F, respectively.

F_{o u t} = G (F) + L (F)

(6)

3.4. Pyramid Split Attention

Complex objects usually have diverse appearances and shape features, including different colors, textures, and geometries. This diversity makes feature extraction and matching more difficult, leading to increased errors in pose estimation. To improve the performance of a network in complex object pose estimation, we inserted a PSA attention mechanism module after the second GraphPool layer in the 3DGCN network. This module can adaptively select important features based on the characteristics of the current input data. This is especially important for dealing with complex objects; through adaptive selection, the model can ignore irrelevant features and focus on the information that is most useful for pose estimation. Meanwhile, the PSA attention mechanism is able to capture these long-range dependencies by calculating point-to-point weights on a global scale, thus improving the model’s understanding of the global structure of the point cloud. It can also dynamically adjust the importance of different spatial locations and channels to ensure that the most useful features are extracted in different situations. Finally, we inserted a Max-Pooling layer into the last GraphConv layer of the network. The Max-Pooling layer allows local information to be aggregated during downsampling by picking maxima within the pooling window, allowing subsequent layers to handle larger sensory fields. This information aggregation helps to capture higher-level pose features, such as overall pose structure and shape information, rather than just local details. In addition, given that the pose of an object changes, the Max-Pooling layer enhances the robustness of the model to these translational changes. Regardless of the exact position of the object within the localized window, the Max-Pooling layer can effectively extract the main features of an object as long as its pose features remain within the window. This helps to improve the model’s pose estimation ability in different viewpoints and positions. The following is the principal part of the PSA attention mechanism, and its network structure is shown in Figure 5.

The SPC module is utilized for channel slicing and multi-scale feature extraction. For point cloud data, the input features can be expressed as

X \in R^{N \times C}

, where

N

is the number of points and

C

is the feature dimension of each point. Multi-scale feature extraction is performed using different convolution kernels for each point feature, assuming the use of

S

different scales. At the first scale, the features are represented as

F^{s} = f_{s} (X), s = 1, 2, 3, \dots, S

(7)

where

F^{s} \in R^{N \times C}

is the feature representation at the Sth scale, and

f_{s}

denotes the feature extraction operation at scale S.

Then, the SEWeight module is used to extract the channel attention of the features at different scales and pool the global average of the features at each scale to obtain the statistical information of each channel [33]:

z^{s} = \frac{1}{N} \sum_{i = 1}^{N} F_{I}^{s} v

(8)

z^{s} \in R^{c}

is the feature vector at the sth scale. The channel attention weights are then generated using the fully connected layer and activation functions.

W_{1} \in R^{\frac{c}{r} \times C}

and

W_{2} \in R^{C \times \frac{c}{r}}

are the weight matrix of the fully connected layer,

δ

is the activation function, and

σ

is the activation function [34].

W^{s} = σ (W_{2} δ (W_{1} z^{s}))

(9)

The attention vectors of multi-scale channels are recalibrated using the Softmax function to obtain the updated attention weights after the interaction of multi-scale channels; here,

α_{i}^{s}

is the normalized weight of the ith channel at the sth scale.

α_{i}^{s} = \frac{e x p (ω_{i}^{s})}{\sum_{s = 1}^{S} e x p () ω_{i}^{t}}

(10)

Finally, element-wise multiplication is applied to the re-scaled weights with their corresponding features, yielding the multi-scale feature information post-attention-based weighting, which is formulated as follows

F^{'} = \sum_{s = 1}^{S} α^{s} F^{s}

(11)

F^{'}

denotes weighted characteristics.

According to the ablation experiments in Section 4.2, it can be seen that adding the PSA attention mechanism can allow the extraction of the local and global geometric relationships of an object, which can effectively improve model performance, especially since the pose accuracy for complex objects is greatly improved. In Figure 6, we show in detail the contribution of PSA attention in extracting local and global features.

F \in R^{C \times H \times W}

represents the input feature descriptor, which generates a spatial attention weight map with localized features, S (

S \in R^{1 \times H \times W}

), after several convolution operations.

F^{'}

is the feature map weighted by S and F. The weighted feature map

F^{'}

is then subjected to channel attention computation to obtain two attention weight maps:

C_{a v g} \in R^{C \times 1 \times 1}

and

C_{m a x} \in R^{C \times 1 \times 1}

. These are then merged to generate the final channel attention weight map C, which emphasizes the global features. Moreover,

F^{″}

is the weighted final feature map for the downstream estimation task. This local and global information helps the network to understand the overall layout and orientation of the object in the image, thus avoiding pose estimation errors due to blurred or missing local information.

3.5. Selection of Network Parameters

The five GraphConv layers in the LA-Net network have kernel numbers [128, 128, 256, 256, 512], and the LS-Layer layer consists of a Conv1d convolutional layer with a convolutional kernel size of 1. The reasons for choosing these kernel sizes are as follows.

In the initial phase of the network, we use a smaller number of convolutional kernels (128) to capture the information underlying the input features. This helps the model to extract low-level features in the early stages. As the network layers increase, the number of convolutional kernels (256) gradually increases so that more complex and high-level feature information can be captured. In the final layer of the network, we use a larger number of convolutional kernels (512) to fully extract the high-level features of the input data and enhance the expressive power of the model to capture more detailed and complex features. A Conv1d layer with a convolutional kernel size of 1 was chosen as the core of the LS-Layer layer because the Conv1d layer does not change the spatial dimensions of the input feature map. This makes it easy to adjust the number of channels of the feature map without affecting its spatial structure, facilitating the processing of the subsequent network layers. Also, the Conv1d layer with a convolution kernel of 1 only linearly transforms the channel features at each position, which is less computationally intensive than other sizes of convolution kernels and more computationally efficient. These parameter choices are based on the success achieved using graph convolutional networks in image and point cloud feature extraction and related research results [16,17].

In addition, LA-Net contains two GraphPool layers, and we set the fixed downsampling ratio to r = 4 in this study. This is because the GraphPool layer is used to reduce the number of nodes in the graph, which reduces computational complexity and improves the training and inference speed of the model. By downsampling the graph, important feature information can be retained while reducing the size of the feature graph. A fixed downsampling ratio of r = 4 was chosen to ensure that each pooling operation significantly reduces the size of the feature graph while retaining sufficient feature information. This choice was made based on experience, and related studies that show that a fixed downsampling ratio leads to good performance in regard to balancing computational efficiency and feature retention [14]. By referring to the relevant literature and successes in the field, we chose quantities of convolutional kernels amounting to [128, 128, 256, 256, 512] and a fixed downsampling ratio of r = 4 for the GraphPool layer. These parameter settings are theoretically sound and have been validated in many successful 3D graph convolutional network architectures, and they can effectively improve the performance and efficiency of LA-Net in pose estimation tasks. Meanwhile, to make a fair comparison with the baseline method, we keep the experimental details partially consistent with the baseline method.

4. Experimental Results

4.1. Experimental Environment and Evaluation Metrics

In this part, we report on the performance of an ablation experiment based on feature fusion and an attention mechanism on two datasets and compare the results of this experiment with those obtained using other mainstream methods, such as GPV-Pose, HS-Pose, FS-Net, SAR-Pose, and RBP-Net [35]. The experimental results show that our improvement improves the accuracy of object pose estimation, and the specific experimental methods and results are described below.

Datasets. We conducted category-level object pose evaluations using two distinct datasets, REAL275 [13] and CAMERA25 [13]. REAL275 provides 7000 real-world RGB-D images across 13 diverse settings, encompassing six object types—cans, laptops, mugs, bowls, cameras, and bottles—each represented by six unique instances. The dataset was divided such that the training set includes 43,000 images featuring three instances per category from seven scenes. In comparison, the test set comprises 2700 images with the same instance distribution from an additional six scenes. In contrast, CAMERA25 is a synthetic counterpart, mirroring the object categories of REAL275. The CAMERA25 dataset consists of 1085 objects in the training set and 184 in the test set. The training subset is expansive, with 275,000 images, a much greater number than the test subset’s 25,000 images. By incorporating assessments from both datasets, our methodology demonstrates its capability for accurate pose estimation across various object classes within both real-world and virtual contexts.

Implementation details. To affirm the efficacy of LA-Net, we employed datasets, optimization algorithms, and data augmentation techniques identical to those utilized in the baseline approach (HS-Pose), with the network processing an input of 1028 points extracted from the point cloud. Our implementation was conducted utilizing the PyTorch framework and executed on a system equipped with an Intel Core i5-12600KF processor and an NVIDIA GeForce 3090 RTX graphics card (manufactured by NVIDIA, located in Santa Clara, CA, USA). The two datasets, CAMERA25 and Real275, underwent separate training regimens, each configured with a batch size of 16, and 150 epochs of training. The training was optimized using the Ranger optimizer [36,37,38], starting with a specific learning rate for which a cosine decay strategy was subsequently adopted at the 28% mark in the training process. We trained the models for 150 epochs on the REAL275 dataset, totaling approximately 42 h. The loss of translation and size converged at about 23 h. The loss of rotation information converged at about 29 h. On the CAMERA25 dataset, we also trained the models in 150 rounds, totaling about 72 h. The translation loss converged at about 48 h. The size loss converged at about 45 h. The loss of rotation information converged at about 52 h.

Evaluation metrics. For this experiment, we chose the mean average precision (mAP) of the 3D intersection over union (IoU) at thresholds of 25%, 50%, and 75% to gauge the dimensions of the objects [24]. The 2 cm and 5 cm metrics were utilized to assess rotational and translational accuracy. Furthermore, additional metrics specific to each degree of freedom were also employed to evaluate the joint performance of rotation and translation. A pose prediction is deemed accurate if it falls within the predefined error thresholds, with a higher mAP value reflecting superior estimation precision.

4.2. Ablation Study

To verify the effectiveness of LA-Net in the category-level object pose estimation task, we conducted experiments on the REAL275 and CAMERA25 datasets. LA-Net enables a network to perceive local and global geometric relationships while incorporating the pose information of objects at different scales by introducing a linear layer (LS-Layer), a PSA attention mechanism, and a Max-Pooling layer. Decoupling the rotation into two independent branches when recovering the rotation information effectively reduces the spatial learning difficulty. Therefore, we conducted ablation experiments on the Real275 dataset. The experimental results are shown in Table 1.

Note that A0 is the baseline method used. Table 1 progressively demonstrates the impact of different methods on object pose estimation performance. First, we gradually added innovations to the original network structure and analyzed their effectiveness in object pose estimation. As shown in Table 1, the performance of our method shows an improvement from 74.7% to 79.1% on the IoU₇₅ metric and from 82.7% to 87.0% for 10°5 cm.

To demonstrate the effectiveness of PSA attention and highlight the importance of PSA attention in extracting global geometric information when processing point cloud data, we added the PSA module after the convolutional layer of the 3DGCN network. As shown in row B0, the experimental results indicate that the introduction of the PSA attention mechanism improved the accuracy of both rotation and translation estimation (enhancing performance by 2.3% for the

10 ° 2

cm metric and by 3% for the 5° metric) as well as the accuracy of scale estimation. HS-Pose reduces the negative impact of noise on the network’s sensing of an object’s pose information by introducing an outlier-feature-robust layer, further increasing the model’s anti-jamming ability. However, our results show that the addition of the PSA module to the model caused it to outperform HS-Pose in terms of accuracy for several rigorous evaluation metrics (

5 ° 2

cm and

5 ° 5

cm), again validating the effectiveness of the PSA attention mechanism in this task.

To demonstrate the effectiveness of the LS-Layer layer and show the importance of multi-scale feature fusion in the potential feature extraction phase, we added an independent parallel linear branch to the original network, the 3DGCN. Specifically, we introduce a Conv1d linear layer on top of the convolutional layer of the 3DGCN. As can be seen from row C0 of Table 1, the LS-Layer layer has a larger impact on the rotation estimation, improving object pose detection accuracy by 4.2% for the 5° metric. In addition, the performance for IoU₇₅ and 5°5 cm improved by 3.7% and 3.5%, respectively.

To determine the effects of PSA attention, the LS-Layer layer, and the Max-Pooling layer on object pose estimation, we conducted an integrated experiment combining PSA attention, the LS-Layer, and the Max-Pooling layer. As shown in D0, the combination of these three components is mutually reinforcing in terms of performance, and the results yielded by this combination are all superior to their respective independent results. Compared to the baseline method (HS-Pose), the improvement is 4.4% for IoU₇₅, 3% for 5°2 cm, and 4.4% for 5°. Although our improvement points only yielded an about 4% improvement in mean accuracy (mAP), the accuracy for individual categories and especially some complex objects (e.g., cameras) could be improved by 8.2% and 5% (for the 10°5 cm metric and 10°2 cm metric), respectively.

Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13 and Figure 14 compare the performance of the proposed strategies and HS-Pose for complex objects. To demonstrate the superiority of LA-Net in analyzing complex objects, we compared the accuracy (mAP) of the three strategies (PSA, LS-Layer, and PSA + LS-Layer + Max-Pooling) with that of the baseline method on different evaluation metrics. Figure 7 shows the performance obtained for a laptop for the 3D IoU₇₅, 5°5 cm, 2 cm, and 5° metrics, while Figure 8 presents the performance obtained with regard to a complex object camera for the 3D IoU₇₅, 10°2 cm, and 10°5 cm metrics. Figure 9 and Figure 10 show the performance metrics for bottles and cans using 3DIoU₇₅,

5 ° 2

cm, and other measures. In the CAMERA25 dataset, we selected four types of objects (a camera, a bottle, a laptop, and a mug) for comparison across various evaluation metrics. The results are shown in Figure 11, Figure 12, Figure 13 and Figure 14. From the figure, it is evident that all three proposed strategies improve the ability of complex object pose estimation and demonstrate their effectiveness in estimating complex object poses.

4.3. Comparison with State-of-the-Art Methods

Results regarding the REAL275 dataset. To demonstrate that LA-Net generated a significant improvement in average accuracy across various evaluation metrics, we compared LA-Net with other methods, and the results are presented in Table 2. To assess the reliability of the results, we selected eight methods, such as HS-Pose, GPV-Pose, FS-Net, SAR-Net, and RBP-Pose, and used the same dataset for validation. The experimental results show that our method yielded comparable performance with respect to HS-Pose for the IoU₂₅ metric, but outside of this metric, our methods outperformed all the existing methods. The accuracy of LA-Net reaches 79.1% on the IoU₇₅ metric and improved to 88% on 10°10 cm. Even on some strict metrics (5°2 cm and 5°5 cm), it improved by 3% and 4.6%, respectively. Additionally, for the REAL275 dataset, we evaluated LA-Net (Ours) and the baseline method (HS-Pose) across different thresholds for three metrics, as shown in Figure 15 and Figure 16. Subfigure (a) shows the average mAP (%) for 3D IoU, subfigure (b) for rotation, and subfigure (c) for translation. In these plots, the Y-axis represents the accuracy, which quantifies the performance of the model under different error thresholds. The X-axis represents the error thresholds. The different colored curves represent different objects in the dataset. In these graphs, we show the performance improvements generated by the model under different category conditions. For example, the average accuracy of LA-Net is higher than that of HS-Pose for the mug category (the brown line in the figure). These enhancements were attributed to specific improvements in our model structure, including multi-scale feature fusion and local and global feature extraction. The combined effect of these improvements improves the accuracy and robustness of the model.

Results on the CAMERA25 dataset. We compared the proposed LA-Net with seven existing methods (including HS-Pose, GPV-Pose, SAR-Net, and RBP-Pose) on the CAMERA25 synthetic dataset. The experimental results in Table 3 indicate that LA-Net outperforms these methods in category-level object pose estimation. The SAR-Net and RBP-Pose methods incorporate shape prior information, but LA-Net still surpasses them in terms of performance. In particular, LA-Net outperforms HS-Pose by 3.5% on the IoU₇₅ metric and yielded 3.1% higher performance on the 10°2 cm metric. Thus, our method remains competitive on the CAMERA25 dataset.

5. Discussion

Category-level object pose estimation is a computer vision technique for recognizing and determining the position and orientation of an object in three-dimensional space. This research has important application value and significance in the field of augmented reality (AR) and robot operation. We applied LA-Net to the category-level object pose estimation task, and the higher accuracy of LA-Net compared to previous models highlights its effectiveness and flexibility in estimating object poses. In addition, this model demonstrated excellent generalization ability with respect to estimating the poses of unseen objects. This makes LA-Net a powerful tool for solving complex object pose estimation problems.

The LA-Net network significantly improves the accuracy and robustness of pose estimation through its innovative architectural design and advanced algorithms. Specifically, LA-Net allows a network to extract features from different layers and perform multi-scale feature fusion by introducing the LS-Layer, which helps the model to better understand the shape and pose of an object. Global geometric features play a crucial role in object pose estimation, so we added a PSA attention mechanism and a Max-Pooling layer after the convolutional layer to enable the network to perceive both local and global geometric relationships, thus preventing pose estimation errors due to ambiguous local information or complex object shapes. The experimental results show that several of our adjustments improve performance in object pose estimation, especially regarding some objects with complex shapes. These improvement points highlight the advantages of the LA-Net network in this field.

Despite the advantages of LA-Net in processing unstructured data and estimating complex object poses, there are still some limitations. For example, the hierarchical structure of the network and the addition of each module lead to a complex network structure and increased training time. Additionally, varying lighting conditions, glare, shadows, and reflective surfaces in an environment can interfere with the accuracy of position estimation. When the target object is partially occluded or there is a complex background, the accuracy of position estimation decreases significantly.

6. Conclusions

In this paper, we propose LA-Net, an end-to-end category-level object attitude estimation network based on multi-scale feature fusion and attention mechanisms. This network is designed to solve the problems of intra-class shape variation and a low accuracy of attitude estimation for complex objects in category-level object attitude estimation tasks. The LS-Layer in the proposed LA-Net can fuse multi-scale features, which helps the network perceive object pose information. Since the 3DGCN only focuses on local geometric information when extracting point cloud features, the global geometric information has a crucial role in estimating the pose of complex objects. Therefore, based on the 3DGCN, we added the PSA attention and Max-Pooling layer to allow the network to perceive global geometric information, further improving the accuracy of object pose estimation. According to the experimental results, LA-Net performs better on REAL275 and CAMERA25. Compared with the existing methods, the accuracy of our LA-Net on several evaluation metrics is superior in different degrees. In future work, we plan to introduce MLP into LA-Net to make this network more lightweight.

Author Contributions

Conceptualization, C.G. and W.S.; methodology, J.W.; software, G.L.; validation, J.W., G.L. and C.G.; formal analysis, J.W. and Q.M.; investigation, C.G. and Q.M.; resources, G.L.; data curation, J.W. and G.L.; writing—original draft preparation, G.L.; writing—review and editing, J.W. and C.G.; visualization, J.W. and G.L.; supervision, C.G. and W.S.; project administration, C.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Natural Science Foundation of China (61901358).

Data Availability Statement

We employed two publicly available 3D object pose estimation datasets, namely, CAMERA25 and Real275, graciously provided by the BOP Challenge: https://bop.felk.cvut.cz/datasets/ (accessed on 18 March 2024).

Acknowledgments

The author sincerely appreciates the candid suggestions from each editor and reviewer.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kothari, N.; Gupta, M.; Vachhani, L.; Arya, H. Pose estimation for an autonomous vehicle using monocular vision. In Proceedings of the 2017 Indian Control Conference (ICC), Guwahati, India, 4–6 January 2017; pp. 424–431. [Google Scholar]
Su, Y.; Rambach, J.; Minaskan, N.; Lesur, P.; Pugani, A.; Stricker, D. Deep multi-state object pose estimation for augmented reality assembly. In Proceedings of the 2019 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), Beijing, China, 10–18 October 2019; pp. 222–227. [Google Scholar]
Li, Y.; Wang, H.; Dang, L.M.; Nguyen, T.N.; Han, D.; Lee, A.; Jang, I.; Moon, H. A deep learning-based hybrid framework for object detection and recognition in autonomous driving. IEEE Access 2020, 8, 194228–194239. [Google Scholar] [CrossRef]
Remus, A.; D’Avella, S.; Di Felice, F.; Tripicchio, P.; Arizzano, C.A. i2c-net: Using instance-level neural networks for monocular category-level 6D pose estimation. IEEE Robot. Autom. Lett. 2023, 8, 1515–1522. [Google Scholar] [CrossRef]
Sahin, C.; Garcia-Hernando, G.; Sock, J.; Kim, T.-K. Instance-and category-level 6d object pose estimation. In RGB-D Image Analysis and Processing; Springer: Berlin/Heidelberg, Germany, 2019; pp. 243–265. [Google Scholar]
Wei, F.; Sun, X.; Li, H.; Wang, J.; Lin, S. Point-set anchors for object detection, instance segmentation and pose estimation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part X 16;. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 527–544. [Google Scholar]
Lin, J.; Wei, Z.; Li, Z.; Xu, S.; Jia, K.; Li, Y. Dualposenet: Category-level 6d object pose and size estimation using dual pose network with refined learning of pose consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3560–3569. [Google Scholar]
Tian, M.; Ang, M.H.; Lee, G.H. Shape prior deformation for categorical 6d object pose and size estimation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXI 16. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 530–546. [Google Scholar]
Song, C.; Song, J.; Huang, Q. Hybridpose: 6d object pose estimation under hybrid representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 431–440. [Google Scholar]
Wang, G.; Manhardt, F.; Tombari, F.; Ji, X. Gdr-net: Geometry-guided direct regression network for monocular 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 16611–16621. [Google Scholar]
Cai, D.; Heikkilä, J.; Rahtu, E. Ove6d: Object viewpoint encoding for depth-based 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6803–6813. [Google Scholar]
Chen, H.; Wang, P.; Wang, F.; Tian, W.; Xiong, L.; Li, H. Epro-pnp: Generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2781–2790. [Google Scholar]
Wang, H.; Sridhar, S.; Huang, J.; Valentin, J.; Song, S.; Guibas, L.J. Normalized object coordinate space for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 5–20 June 2019; pp. 2642–2651. [Google Scholar]
Zou, L.; Huang, Z.; Gu, N.; Wang, G. MSSPA-GC: Multi-Scale Shape Prior Adaptation with 3D Graph Convolutions for Category-Level Object Pose Estimation. Neural Netw. 2023, 166, 609–621. [Google Scholar] [CrossRef] [PubMed]
Castro, P.; Armagan, A.; Kim, T.K. Accurate 6d object pose estimation by pose conditioned mesh reconstruction. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 4147–4151. [Google Scholar]
Lin, Z.H.; Huang, S.Y.; Wang, Y.C.F. Convolution in the cloud: Learning deformable kernels in 3d graph convolution networks for point cloud analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1800–1809. [Google Scholar]
Zhang, J.; Wu, M.; Dong, H. Generative category-level object pose estimation via diffusion models. Adv. Neural Inf. Process. Syst. 2023, 36, 1–18. [Google Scholar]
Irshad, M.Z.; Zakharov, S.; Ambrus, R.; Kollar, T.; Kira, Z.; Gaidon, A. Shapo: Implicit representations for multi-object shape, appearance, and pose optimization. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 275–292. [Google Scholar]
Cai, D.; Heikkilä, J.; Rahtu, E. Sc6d: Symmetry-agnostic and correspondence-free 6d object pose estimation. In Proceedings of the 2022 International Conference on 3D Vision (3DV), Prague, Czech Republic, 12–15 September 2022; pp. 536–546. [Google Scholar]
Nguyen, V.N.; Hu, Y.; Xiao, Y.; Salzmann, M.; Lepetit, V. Templates for 3d object pose estimation revisited: Generalization to new objects and robustness to occlusions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6771–6780. [Google Scholar]
Duan, G.; Cheng, S.; Liu, Z.; Zheng, Y.; Su, Y.; Tan, J. Zero-Shot 3D Pose Estimation of Unseen Object by Two-step RGB-D Fusion. Neurocomputing 2024, 597, 128041. [Google Scholar] [CrossRef]
Chen, W.; Jia, X.; Chang, H.J.; Zheng, Y.; Su, Y.; Tan, J. Fs-net: Fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 1581–1590. [Google Scholar]
Di, Y.; Zhang, R.; Lou, Z.; Manhardt, F.; Ji, X.; Navab, N.; Tombari, F. Gpv-pose: Category-level object pose estimation via geometry-guided point-wise voting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6781–6791. [Google Scholar]
Zheng, L.; Wang, C.; Sun, Y.; Dasgupta, E.; Chen, H.; Leonardis, A.; Zhang, W.; Chang, H. Hs-pose: Hybrid scope feature extraction for category-level object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17163–17173. [Google Scholar]
Lin, H.; Liu, Z.; Cheang, C.; Fu, Y.; Guo, G.; Xue, X. Sar-net: Shape alignment and recovery network for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6707–6717. [Google Scholar]
Zhang, R.; Di, Y.; Manhardt, F.; Tombari, F.; Ji, X. Ssp-pose: Symmetry-aware shape prior deformation for direct category-level object pose estimation. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 7452–7459. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Zhang, H.; Zu, K.; Lu, J.; Zou, Y.; Meng, D. EPSANet: An efficient pyramid squeeze attention block on convolutional neural network. In Proceedings of the Asian Conference on Computer Vision, Macao, China, 4–8 December 2022; pp. 1161–1177. [Google Scholar]
Lin, J.; Wei, Z.; Zhang, Y.; Jia, K. Vi-net: Boosting category-level 6d object pose estimation via learning decoupled rotations on the spherical representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 14001–14011. [Google Scholar]
Pitteri, G.; Ramamonjisoa, M.; Ilic, S.; Lepetit, V. On object symmetries and 6d pose estimation from images. In Proceedings of the 2019 International Conference on 3D Vision (3DV), Quebec City, QC, Canada, 16–19 September 2019; pp. 614–622. [Google Scholar]
Tremblay, J.; To, T.; Sundaralingam, B.; Xiang, Y.; Fox, D.; Birchfield, S. Deep object pose estimation for semantic robotic grasping of household objects. arXiv 2018, arXiv:1809.10790. [Google Scholar]
Targ, S.; Almeida, D.; Lyman, K. Resnet in resnet: Generalizing residual architectures. arXiv 2016, arXiv:1603.08029. [Google Scholar]
Wang, C.Z.; Tong, X.; Zhu, J.H.; Gao, R. Ghost-YOLOX: A lightweight and efficient implementation of object detection model. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Quebec City, QC, Canada, 21–25 August 2022; pp. 4552–4558. [Google Scholar]
Liu, X.J.; Nie, Z.; Yu, J.; Xie, F.; Song, R. (Eds.) Intelligent Robotics and Applications: 14th International Conference, ICIRA 2021, Yantai, China, 22–25 October 2021, Proceedings, Part III; Springer Nature: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
Zhang, R.; Di, Y.; Lou, Z.; Manhardt, F.; Tombari, F.; Ji, X. Rbp-pose: Residual bounding box projection for category-level pose estimation. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 655–672. [Google Scholar]
Liu, L.; Jiang, H.; He, P.; Chen, P.; Lin, X.; Gao, J.; Han, J. On the variance of the adaptive learning rate and beyond. arXiv 2019, arXiv:1908.03265. [Google Scholar]
Yong, H.; Huang, J.; Hua, X.; Zhang, L. Gradient centralization: A new optimization technique for deep neural networks. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part I 16. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 635–652. [Google Scholar]
Zhang, M.; Lucas, J.; Ba, J.; Hinton, G. Lookahead optimizer: K steps forward, 1 step back. Adv. Neural Inf. Process. Syst. 2019, 32, 1–12. [Google Scholar]

Figure 1. LA-Net feature extraction diagram. By incorporating PSA (Point Set Attention) and Max- Pooling layers, the LA-Net network is enhanced, with the capability to extract both global and local features, thereby increasing its robustness.

Figure 2. Overview of LA-Net. The LA-Net network determines object poses through three main stages. Firstly, in the data-preprocessing stage, Mask-RCNN is used to detect the target object’s position in an image and predict its category. Then, in the feature extraction stage, the LS-Layer, PSA attention mechanism, and Max-Pooling layer are introduced into the original network, enabling the improved network to perceive point cloud features better. Finally, in the pose and scale estimation stage, three separate branches are used to obtain the object’s rotation information (r) and translation and scale information (t and s). This stage decouples the rotation information R into two independent rotation branches.

Figure 3. LA-Net feature extraction process flowchart.

Figure 4. Multi-scale feature fusion.

Figure 5. Structure of the PSA network. The PSA attention mechanism begins by employing the SPC module to segment channels and extract features at multiple scales. Next, it leverages the SEWeight module to determine the attention for features across these scales. The attention vectors from the multi-scale channels are then normalized via the Softmax function. In the final step, the mechanism generates new feature descriptors through the application of attention-based weighting.

Figure 6. Diagram of PSA local and global feature extraction.

Figure 7. Accuracy in assessing the laptop regarding different evaluation metrics (REAL275).

Figure 8. Accuracy in assessing the camera regarding different evaluation metrics (REAL275).

Figure 9. Accuracy in assessing the bottle with regard to different evaluation metrics (REAL275).

Figure 10. Accuracy in assessing the can regarding different evaluation metrics (REAL275).

Figure 11. Accuracy in assessing the camera regarding different evaluation metrics (CAMERA25).

Figure 12. Accuracy in assessing the bottle with regard to different evaluation metrics (CAMERA25).

Figure 13. Accuracy in assessing the laptop regarding different evaluation metrics (CAMERA25).

Figure 14. Accuracy in assessing the mug regarding different evaluation metrics (CAMERA25).

Figure 15. mAP (%) for the six categories of objects on the REAL275 dataset for different thresholds of 3D IoU, rotation, and translation errors (our method).

Figure 16. mAP (%) for the six categories of objects on the REAL275 dataset for different thresholds of 3D IoU, rotation, and translation errors (HS-Pose).

Table 1. Table of experimental results (REAL275).

Row	Method	IoU₂₅	IoU₅₀	IoU₇₅	5°2 cm	5°5 cm	10°2 cm	10°5 cm	2 cm	5°
A0	HS − Pose [24]	84.2	82.1	74.7	46.5	55.2	68.6	82.7	78.2	58.2
B0	A0 + PSA	84.3	82.5	76.9	47.9	57.5	70.9	85.5	79.5	61.2
C0	A0 + LS − Layer	84.3	83.2	78.4	49.3	58.7	71.6	86.3	80.5	62.4
D0	A0 + PSA + LS-Layer + Max-Pooling	84.5	83.6	79.1	49.5	59.8	71.9	87.0	79.5	62.6

Table 2. Comparison of methods on the REAL275 dataset.

Method	IoU₂₅	IoU₅₀	IoU₇₅	5°2 cm	5°5 cm	10°2 cm	10°5 cm	10°10 cm
DaulPoseNet [7]	-	79.8	62.2	29.3	35.9	50.0	66.8	-
FS-Net [22]	84.0	81.1	63.5	19.9	33.9	-	69.1	71.0
GPV-Pose [23]	84.1	83.0	64.4	32.0	42.9	55.0	73.3	74.6
HS-Pose [24]	84.2	82.1	74.7	46.5	55.2	68.6	82.7	83.7
SAR-Net [25]	-	79.3	62.4	31.6	42.3	50.3	68.3	-
SSP-Pose [26]	84.0	82.3	66.3	34.7	44.6	-	77.8	79.7
SPD [8]	83.4	77.3	53.2	19.3	21.4	43.2	54.1	-
RBP-Net [35]	-	-	67.8	38.2	48.1	63.1	79.2	-
Ours	84.5	83.6	79.1	49.5	59.8	71.9	87.0	88.0

Data underlined in the table have the highest precision of the listed comparison methods. The bolded data in the tables represent the experimental results of this paper.

Table 3. Comparison of methods on CAMERA25 dataset.

Method	IoU₅₀	IoU₇₅	5°2 cm	5°5 cm	10°2 cm	10°5 cm
DaulPoseNet [7]	92.4	86.4	64.7	70.7	77.2	84.7
GPV-Pose [23]	93.4	88.3	72.1	79.1	-	89.0
HS-Pose [24]	93.3	89.4	73.3	80.5	80.4	89.4
SAR-Net [25]	86.8	79.0	66.7	70.9	75.3	80.3
SSP-Pose [26]	-	86.8	64.7	75.5	-	87.4
SPD [8]	93.2	83.1	54.3	59.0	73.3	81.5
RBP-Pose [35]	93.1	89.0	73.5	79.6	82.1	89.5
Ours	95.2	92.9	76.3	82.6	83.5	91.3

Data underlined in the table have the highest precision of the listed comparison methods. The bolded data in the tables represent the experimental results of this paper.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Liu, G.; Guo, C.; Ma, Q.; Song, W. LA-Net: An End-to-End Category-Level Object Attitude Estimation Network Based on Multi-Scale Feature Fusion and an Attention Mechanism. Electronics 2024, 13, 2809. https://doi.org/10.3390/electronics13142809

AMA Style

Wang J, Liu G, Guo C, Ma Q, Song W. LA-Net: An End-to-End Category-Level Object Attitude Estimation Network Based on Multi-Scale Feature Fusion and an Attention Mechanism. Electronics. 2024; 13(14):2809. https://doi.org/10.3390/electronics13142809

Chicago/Turabian Style

Wang, Jing, Guohan Liu, Cheng Guo, Qianglong Ma, and Wanying Song. 2024. "LA-Net: An End-to-End Category-Level Object Attitude Estimation Network Based on Multi-Scale Feature Fusion and an Attention Mechanism" Electronics 13, no. 14: 2809. https://doi.org/10.3390/electronics13142809

APA Style

Wang, J., Liu, G., Guo, C., Ma, Q., & Song, W. (2024). LA-Net: An End-to-End Category-Level Object Attitude Estimation Network Based on Multi-Scale Feature Fusion and an Attention Mechanism. Electronics, 13(14), 2809. https://doi.org/10.3390/electronics13142809

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LA-Net: An End-to-End Category-Level Object Attitude Estimation Network Based on Multi-Scale Feature Fusion and an Attention Mechanism

Abstract

1. Introduction

2. Related Works

3. Methods

3.1. Background of 3DGCN

3.2. Overall Framework

3.3. Multi-Scale Feature Fusion (LS-Layer)

3.4. Pyramid Split Attention

3.5. Selection of Network Parameters

4. Experimental Results

4.1. Experimental Environment and Evaluation Metrics

4.2. Ablation Study

4.3. Comparison with State-of-the-Art Methods

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI