Group-in-Group Relation-Based Transformer for 3D Point Cloud Learning

Liu, Shaolei; Fu, Kexue; Wang, Manning; Song, Zhijian

doi:10.3390/rs14071563

Open AccessTechnical Note

Group-in-Group Relation-Based Transformer for 3D Point Cloud Learning

¹

Shanghai Key Laboratory of Medical Image Computing and Computer Assisted Intervention, Shanghai 200030, China

²

Digital Medical Research Center, School of Basic Medical Science, Fudan University, Shanghai 200032, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2022, 14(7), 1563; https://doi.org/10.3390/rs14071563

Submission received: 18 February 2022 / Revised: 15 March 2022 / Accepted: 21 March 2022 / Published: 24 March 2022

(This article belongs to the Special Issue Advanced Application of Artificial Intelligence and Machine Vision in Remote Sensing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Deep point cloud neural networks have achieved promising performance in remote sensing applications, and the prevalence of Transformer in natural language processing and computer vision is in stark contrast to underexplored point-based methods. In this paper, we propose an effective transformer-based network for point cloud learning. To better learn global and local information, we propose a group-in-group relation-based transformer architecture to learn the relationships between point groups to model global information and between points within each group to model local semantic information. To further enhance the local feature representation, we propose a Radius Feature Abstraction (RFA) module to extract radius-based density features characterizing the sparsity of local point clouds. Extensive evaluation on public benchmark datasets demonstrate the effectiveness and competitive performance of our proposed method on point cloud classification and part segmentation.

Keywords:

3D point clouds; transformer; deep learning; point cloud classification; point cloud segmentation

1. Introduction

Point cloud processing tasks have attracted much attention due to their wide application in remote sensing [1,2], autonomous driving [3], robotics [4], and augmented reality [5]. Unlike 2D images, point clouds are unordered, irregular, and sparse. Therefore, designing an effective approach for point cloud learning is an important yet challenging task.

Currently, with the continuous development of deep learning, there has been much work trying to tackle this challenge [6,7,8], and existing methods can be divided into three categories. The first category relies on the successful application of convolutional neural networks in regular grids. Some of them transform unordered point clouds into regular voxels [9] or multi-view images [10] and then extract features by sparse 3D and 2D convolutions, respectively. Although these methods achieve promising results, they inevitably cause the loss of important geometric information during the transform process, and their high computational cost and memory consumption further hinder the performance improvement. The second category is processing the raw point cloud data directly, thus avoiding the information loss in transforming the point cloud into a regular data structure. The pioneering work PointNet [11] learns point-based global information by a shared multi-layer perceptron (MLP) and global aggregation. Some of its subsequent work [12] uses local aggregation mechanisms and hierarchical structures to further improve the network performance, and others [13,14,15] learn deeper semantic structure information by defining new specific convolution-like operations or constructing graph structures [16,17]. However, due to the irregularity of point clouds, point-based methods must satisfy permutation invariance. Therefore, it is unavoidable to use symmetric functions (e.g., pooling functions), which limits their feature representation capability to some extent [18], and neither of them is sufficient enough to capture the local relations of the points. The third category is the hybrid-data based methods, which attempt to combine voxel-based and point-based approaches. For example, two studies [19,20] integrate voxel features and PointNet features at the scene level. However, as these two point cloud representations are fundamentally different, the current methods in this category are not very effective in extracting point cloud features.

In order to develop a more efficient point cloud learning network, PCT [21] recently introduced Transformer to point cloud processing tasks. Transformer has been extremely successful in natural language processing [22,23] and computer vision [24]. Transformer is not affected by the order of the input sequence and thus satisfies the permutation invariance requirement of point clouds. PCT focuses on using offset-attention layers to extract global features. Although PCT utilizes the neighbor feature embedding in EdgeConv [17], it still cannot learn the relationship within local point clouds effectively to model local geometric features, and thus the extraction of semantic features for local point clouds is not sufficient.

Inspired by transformer-in-transformer (TNT) [25], we propose a point cloud feature extraction network that efficiently learns both global and local information through a group-in-group relation-based Transformer architecture (GiG). Specifically, we divide a point cloud into a series of groups, and use an inner-group Transformer to extract local features within each group and use a cross-group Transformer to modeling interaction among groups and extract global features. To further enhance local shape information, we propose a Radius Feature Abstraction (RFA) module to extract radius-based density features to represent the sparsity of the local point cloud.

The main contributions of this paper are as follows:

•: We propose a new Transformer-based point cloud learning architecture GiG. By dividing a point cloud into a set of small groups, GiG can not only learn the relationship between groups to model global information but also model local semantic information by learning the relationship between points within each group. Therefore, it is possible to extract effective object-related global information as well as shape-related local information.
•: We propose a RFA module to enhance the representation of local semantic information by extracting radius-based density features to characterize the sparsity of local point clouds.
•: Extensive experiments demonstrate that our method achieves new state-of-the-art performance in object classification and part segmentation.

Our paper is structured as follows. In Section 2, we briefly review related work. In Section 3, the proposed method is introduced in detail. Section 4 introduces the experimental results and analysis. In the last section, we give the conclusions of our paper.

2. Related Work

2.1. Deep Learning on Point Clouds

2.1.1. Multi-View Based Methods

The multi-view based methods [10,26,27] mainly map an original 3D point cloud to 2D images from different viewing angles, and then extract view-wise features using a 2D convolutional neural network. Pioneering work MVCNN [5] aggregates the multi-view features into a global descriptor by max-pooling. How to integrate the features from different views into discriminative global features is still the main challenge. Despite the good results of these methods, the mapping process inevitably results in loss of shape information, and the high performance often requires a large number of views, which is also a time-consuming process.

2.1.2. Voxel-Based Methods

The voxel-based methods [9,28,29] voxelize irregular point clouds into a regular volumetric grid structure, which can then be directly used for feature extraction using 3D convolutional neural networks. Although there are encouraging results, these methods tend to suffer from low resolution due to computational cost, and thus lose geometric structural information.

2.1.3. Point-Based Methods

Point-based methods process the raw point cloud data directly. According to the network architecture used to extract the features, point-based methods can be further classified into point-wise MLP methods, convolution-based methods, and graph-based methods.

Point-wise MLP Methods: As a pioneering work, PointNet [11] processes each point individually by a shared MLP and then aggregates global features by a symmetric function. Its follow-up work further aggregates local features to extract local geometric information [12,30,31].

Convolution-based Methods: Convolution-based methods [13,14,15,32,33,34,35,36] extract local features of point clouds by a convolution-like strategy. Due to the irregularity of point clouds, the current convolution-based methods can be further classified into continuous convolution methods [14,15,32,33] and discrete convolution methods [13]. Continuous methods define the convolution kernel in continuous space, and the weights of its nearest neighbors are related to the spatial distribution of the corresponding centroids, whereas discrete methods define the convolution kernel in regular grids, and the weights of its nearest neighbors are related to the offsets of the corresponding centroids.

Graph-based Methods: Graph-based methods [16,17,37,38,39] model the original point clouds as graphs and then utilize graph-based learning methods. For example, FeastNet [37] and DGCNN [16] model the geometric information by constructing different graph convolution operators to aggregate the neighboring features to model geometric information.

2.1.4. Hybrid-Data Based Methods

The hybrid-data methods [29,40,41] are proposed based on different point cloud data structures (e.g., octree and kd-tree) and integrate voxel features and PointNet features at the scene level [19,20]. However, the effectiveness of extracting point cloud features is further limited by the fact that these two point cloud representations are fundamentally different.

2.2. Transformer in Point Clouds

Transformer has achieved promising results in the field of natural language [22,23] and computer vision [24]. Compared with convolution and MLP-like operations, Transformer can model effective long-range dependency while directly processing irregular point clouds. However, there are relatively few studies on Transformer-based point cloud processing. PCT [21] mainly uses offset-attention self-attentive layers to extract global features while using neighbor feature embedding in EdgeConv [17] to assist in extracting local features. However, PCT does not make full use of Transformer to extract local features and cannot model local geometric features effectively.

3. Methods

3.1. Preliminaries

To make it straightforward to understand our method, we first describe the basic components of the transformer [42], including scaled dot-product self-attention (SA), multi-head self-attention (MSA), feed-forward networks (FFN), and layer normalization (LN).

SA: Given the inputs

X \in R^{N \times D}

, in the SA module, they are first linearly mapped into three vectors (queries

Q \in R^{N \times D_{k}}

, keys

K \in R^{N \times D_{k}}

, and values

V \in R^{N \times D_{V}}

), where N is the sequence length, and

D, D_{k}, D_{V}

are the dimensions of the input, queries (keys), and values, respectively. Each word in the sequence is treated as a query and all of the words are treated as keys. We calculate the similarity score between each query and key, then the similarity score is normalized and multiplied by the corresponding values to obtain a weighted sum, i.e., the attention score of the word, which is defined as follows:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{D_{k}}}) V

(1)

MSA: If we only perform single self-attention, all the information is aggregated in the same level. Study [42] stacks multiple self-attention modules (i.e., multi-head self-attention) to learn stronger discriminative representation. Specifically, MSA applies h different linear functions for

Q, K, V

in parallel and finally concatenates the output values of each head, which are defined as follows:

M u l t i H e a d (Q, K, V) = C o n c a t {({h e a d}_{1}, \dots, {h e a d}_{h}) W}^{O}

(2)

{h e a d}_{i} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(3)

where

W^{Q}, W^{K}, W^{V}, a n d W^{O}

are four learnable matrixes that perform the linear mapping operation for queries, keys, values, and the final output, respectively.

FFN: The FFN module (achieved by several MLPs) is used between MSA layers for feature mapping:

F F N (x) = σ (x W_{1} + b_{1}) W_{2} + b_{2}

(4)

where W and b are the weights and bias of the MLP, and

σ (\cdot)

is the activation function such as Gaussian error linear units (GELU) [43].

LN: Layer normalization is applied in Transformer to stabilize the training process effectively and reduce the training time [44]. For each input

x \in R^{D}

, LN is performed as follows:

\hat{x} \leftarrow \frac{x - μ_{L}}{\sqrt{σ_{L}^{2} + ϵ}}

(5)

\hat{y} \leftarrow γ {\hat{x}}_{i} + β \equiv {L N}_{γ, β} (x)

(6)

where

μ_{L}

and

σ_{L}

are the mean and standard deviation of the input, respectively, and

γ

and

β

are two learnable transform parameters.

3.2. Framework Overview

The overall framework is shown in Figure 1. In this paper, we propose GiG to efficiently learn global and local information for point cloud feature extraction. The network is able to focus not only on object-related global information by cross-group transformer, but also on shape-related local semantic information by inner-group transformer. In addition, we further propose the RFA module to enhance the local semantic information by radius-based density features.

Formally, we are given an input point cloud

p \in R^{N \times 3}

with N points. First, a global feature extraction module (E) is used to extract point features

f_{p} \in R^{N \times d}

, and the original point cloud p and the features

f_{p}

are input to the Group Abstraction (

G A

) module.

G A

samples

M_{1}

representative points from the original point cloud p and constructs a local neighboring point group for each of them, so that we have

M 1

groups denoted as

G 1 \in R^{M_{1} \times K \times 3}

, where K (K = 32) is the number of points in each group. We use

f_{G 1} \in R^{M_{1} \times K \times d}

to denote the features of the point groups. The radius-related density features

f_{R 1} \in R^{M_{1} \times K \times d_{1}}

are extracted from

G 1

by the RFA module. The features

f_{G 1}

are then aggregated by the local feature aggregation module (

L A

, consisting of cascaded MLPs and MaxPooling) and concatenated with

f_{R 1}

to obtain the local group features

ψ_{G 1} \in R^{M_{1} \times d_{G}}

. Subsequently, all groups

G_{1}

and group features

ψ_{G 1}

are input to another

G A

module and sampled into

M_{2}

groups

G 2 \in R^{M_{2} \times K \times 3}

. Similarly, local features are integrated to obtain group features

ψ_{G 2} \in R^{M_{2} \times d_{G}}

, and

f_{G 2}

is reshaped to obtain inner-group point features

φ_{G 2} \in R^{M_{2} \times K \times d_{p}}

. Finally, the group features

ψ_{G 2}

and the inner point features

φ_{G 2}

within group

G 2

are input to the GiG module to extract cross-group and inner-group features to learn the global information and local semantic information, respectively, and the point cloud feature F is obtained by MaxPooling. The GA, RFA, and GiG module will be introduced in detail in the following subsections.

Classification: After obtaining the global feature F, in order to classify the point cloud p into

N_{c}

classes, we input F to a classification head, which consists of two cascaded classification layers. Each classification layer consists of a linear layer, a BatchNorm, a LeakyReLU layer, and a Dropout layer with probability 0.5. Finally, the final classification result is predicted by a linear layer.

Segmentation: The segmentation network is a little different from the classification network because the segmentation task requires higher resolution. Therefore, we do not perform a downsampling operation before the GiG module. We apply DGCNN as the backbone of the segmentation network, and the GiG module is introduced after the second graph feature extraction layer of DGCNN to model local and global information, which effectively improves the performance of segmentation. This shows the effectiveness and practicality of the proposed GiG module.

3.3. Group Abstraction (GA) and Radius-Based Feature Abstraction (RFA)

First, given a point cloud

p \in R^{N \times 3}

, we group it by the GA module and perform linear projection. Specifically, we obtain M representative points

p_{1}, p_{2}, \dots, p_{M}

for the original point cloud p by farthest distance sampling (FPS) [45], and then for each representative point

p_{M} \in p

, we collect its K nearest neighbors to form a group and obtain M groups

G \in R^{M \times K \times 3}

, and the features corresponding to all the points in the groups are retained. The features of all the points in each group are denoted as

f_{p}^{i} (i = 1, 2, \dots, M)

and it is augmented by a local feature enhancement module (

L E

, consisting of cascaded MLP) to obtain

f_{G}^{i} \in R^{K \times d} (f_{G}^{i} \in f_{G} {, f}_{G} \in R^{M \times K \times d})

by linearly mapping the local features. After two GAs, we reshape the features

f_{G}

to obtain inner-group feature

φ_{G} \in R^{M \times K \times d_{p}}

.

For each group, we apply the RFA module to enhance local semantic information, where the farthest distance between the centroid

p_{M}

from other inner-group points is mapped as the feature representation

f_{R}^{i} \in R^{K \times d} (f_{R}^{i} \in f_{R} {, f}_{R} \in R^{M \times K \times d})

to determine if the group is sparse or dense. The features

f_{G}

are then aggregated by the LA and concatenated with

f_{R}

to obtain the local group features

ψ_{G} \in R^{M \times d_{G}}

.

3.4. GiG (Group-in-Group Relation-Based Transformer)

The GiG module is described in detail below, and its architecture is shown in Figure 1. The encoded inner-group point features

φ

and group features

ψ

are input to the GiG module. To enhance the inner-group feature representation, we use an inner-group transformer block to learn the relationship between inner points

p^{i, j}

in each group, where

p^{i, j}

represents the jth point in the ith group. The inner-group transformer block is defined as follows:

φ_{l}^{' i} = φ_{l - 1}^{i} + M S A (L N (φ_{l - 1}^{i}))

(7)

φ_{l}^{i} = φ_{l}^{' i} + F F N (L N (φ_{l}^{' i}))

(8)

where l = 1, 2 …L represent the lth block, and L represents the number of transformer blocks.

For the extraction of cross-group features, we use another standard transformer to learn the relationship between different groups. In each layer, the point feature

φ_{l}^{i}

after linear projection within the group is added to the group feature, which is defined as follows:

ψ_{l - 1}^{i} = ψ_{l - 1}^{i} + V e c (φ_{l - 1}^{i}) W_{l - 1} + b_{l - 1}

(9)

where

V e c (\cdot)

represents flattening the input into a one-dimensional vector, and

W_{l - 1}

,

b_{l - 1}

represent the weights and bias, respectively. Cross-group transformer block is defined as follows:

ψ_{l}^{' i} = ψ_{l - 1}^{i} + M S A (L N (ψ_{l - 1}^{i}))

(10)

ψ_{l}^{i} = ψ_{l}^{' i} + F F N (L N (ψ_{l}^{' i}))

(11)

The inner-group transformer block is used to model the relationship between points within a group and then extract shape-related local semantic features. The cross-group transformer block is applied to model the relationship between groups and then extract object-related global features. By stacking L layers of GiG blocks, we build the GiG module.

φ_{l}, ψ_{l} = G i G (φ_{l - 1}, ψ_{l - 1})

(12)

4. Results

The comparison results are detailed as follows. In Section 4.1, we compare our proposed method with the existing methods in the classification task on the ModelNet40 dataset. In Section 4.2, we further compare our method in the part segmentation task on ShapeNet dataset.

4.1. Classification on ModelNet40 Dataset

Dataset: ModelNet40 is currently the most widely used dataset for point cloud classification. It includes 12311 CAD models with 40 categories. For a fair comparison, we used the official data splitting, with 9843 point clouds as the training set and 2468 point clouds as the test set. The point clouds were uniformly downsampled to 1024 points. No data augmentation was used during either training or testing. The training batch size and epoch are set as 16 and 250, respectively, and an initial learning rate of 0.01 is set according to DGCNN [16] and decayed by using cosine annealing strategy for each epoch. In addition, as the point cloud data already contains the location information, it is not necessary to perform position encoding, and class token is not used.

Performance Comparison: The experimental results of classification are shown in Table 1, and our method achieves the best accuracy over existing state-of-the-art methods. We conducted an ablation study on the proposed RFA module. We can see that our method without RFA can still achieve the same accuracy as PCT. After adding the RFA module, our method achieves 93.9% classification accuracy. Compared with PointNet, the accuracy of our method is 4.7% higher.

To have an intuitive understanding of the classification performance of our models, we visualize the learned features on the test set of ModelNet40 in Figure 2. For better visualization, we randomly select 10 classes to map the high-level features to 2D space by applying t-SNE [46]. From Figure 2, we can clearly see that features from different categories are separated fairly well, which reflects the strong discriminative power of the learned representation by our proposed method.

Table 1. Comparison of classification accuracy on ModelNet40.

Method	Input	#Points	Accuracy (%)
MVCNN [5]	multi-view	_	90.1
OctNet [29]	octree	_	86.5
PointwiseCNN [34]	points	1K	86.1
PointNet [11]	points	1K	89.2
PointNet++ [12]	points + normal	5K	91.9
SpecGCN [47]	points + normal	2K	92.1
PCNN [33]	points	1K	92.3
SpiderCNN [47]	points + normal	1K	92.4
DGCNN [16]	points	1K	92.9
PointCNN [13]	points	1K	92.5
PointWeb [30]	points + normal	1K	92.3
PointConv [14]	points + normal	1K	92.5
RS-CNN [32] w/o vot.	points	1K	92.4
KPConv [22]	points	1K	92.9
3D-GCN [35]	points	1K	92.1
FPConv [36]	points	1K	92.5
PCT [21]	points	1K	93.2
Ours w/o RFA	points	1K	93.4
Ours	points	1K	93.9

4.2. Part Segmentation on ShapeNet Dataset

Dataset: Part segmentation of point clouds is more challenging compared to point cloud classification. We perform experiments in the commonly used dataset ShapeNet [45], which consists of 16,880 3D models with 16 classes and 50 parts, with each model consisting of 2–5 parts. We use DGCNN data splitting with 14,006 models in the training set and 2874 models in the test set. The training batch size and epoch are set as 16 and 200, respectively. The same experimental settings as in classification task are used for training.

Evaluation Metric: We use the intersection-over-union (IoU) to quantitatively evaluate the segmentation results of our proposed method and compare with other existing methods. According to the configuration of PointNet [11], we define the IoU of each category as the average of IoUs for all shapes belonging to each category. Furthermore, the overall mean IoU (mIoU) is finally calculated by the average of IoUs among all the shape instance.

Performance Comparison: The results of part segmentation of the proposed method and other comparison methods are shown in Table 2. The visualization of segmentation results of several models are shown in Figure 3. Compared with other methods, our method achieves the highest mean IoU. In addition, by inserting the proposed GiG module, the mIoU of DGCNN is improved from 85.2% to 86.6%, and the performance is improved in 13 of 16 classes. PointCNN achieves the best IoU in 5 of 16 classes, including “bag”, “earphone”, “laptop”, “knife”, and “skateboard”. These categories are small in shape and relatively simple in structure, which shows that PointCNN with CNN can better learn local information but fails to capture long distance global information. In contrast, our method with Transformer can perform better with complicated and relatively big shapes, such as “plane” and “motor”. As we can see in Figure 3, the segmentation results show the robustness of our method to diverse shapes.

5. Conclusions

In this paper, we propose a new transformer-based point cloud learning network that can efficiently extract both local and global information of point clouds. The GiG module proposed in this paper extracts object-related global information and shape-related local information by modeling the relationship between cross-group and inner-group points. In addition, to further enhance the local semantic information characterization, the RFA module is proposed to represent the sparsity of local regions of point clouds based on radius-based density features. The proposed method achieves the best results in both the classification and segmentation experiments of point clouds. In the future, the method proposed in this paper has potential in other research efforts on point clouds, such as point cloud generation and completion. Furthermore, more effective architecture could be proposed to extract the global and local information of point clouds.

Author Contributions

Conceptualization, S.L. and K.F.; methodology, S.L.; software, S.L. and K.F.; writing—original draft preparation, S.L.; writing—review and editing, S.L. and M.W.; supervision, M.W. and Z.S.; funding acquisition, Z.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Natural Science Foundation of China under Grant 62076070.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this paper are publicly available, including ModelNet40 and ShapeNet.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wan, J.; Xie, Z.; Xu, Y.; Zeng, Z.; Yuan, D.; Qiu, Q. DGANet A Dilated Graph Attention-Based Network for Local Feature Extraction on 3D Point Clouds. Remote Sens. 2021, 13, 3484. [Google Scholar] [CrossRef]
Wu, W.; Xie, Z.; Xu, Y.; Zeng, Z.; Wan, J. Point Projection Network: A Multi-View-Based Point Completion Network with Encoder-Decoder Architecture. Remote Sens. 2021, 13, 4917. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
Nezhadarya, E.; Taghavi, E.; Razani, R.; Liu, B.; Luo, J. Adaptive hierarchical down-sampling for point cloud classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 12956–12964. [Google Scholar]
Park, Y.; Lepetit, V.; Woo, W. Multiple 3d object tracking for augmented reality. In Proceedings of the 2008 7th IEEE/ACM International Symposium on Mixed and Augmented Reality, Cambridge, UK, 15–18 September 2008; pp. 117–120. [Google Scholar]
Guo, Y.; Wang, H.; Hu, Q.; Liu, H.; Liu, L.; Bennamoun, M. Deep learning for 3d point clouds: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 4338–4364. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Han, Z.; Liu, Y.S.; Zwicker, M. Point2sequence: Learning the shape representation of 3d point clouds with an attention-based sequence to sequence network. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8778–8785. [Google Scholar]
Duan, Y.; Zheng, Y.; Lu, J.; Zhou, J.; Tian, Q. Structural relational reasoning of point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 949–958. [Google Scholar]
Maturana, D.; Scherer, S. Voxnet: A 3d convolutional neural network for real-time object recognition. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems, Hamburg, Germany, 28 September–2 October 2015; pp. 922–928. [Google Scholar]
Su, H.; Maji, S.; Kalogerakis, E.; Learned-Miller, E. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 945–953. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. arXiv 2017, arXiv:1706.02413. [Google Scholar]
Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. Pointcnn: Convolution on x-transformed points. Adv. Neural Inf. Process. Syst. 2018, 31, 820–830. [Google Scholar]
Wu, W.; Qi, Z.; Fuxin, L. Pointconv: Deep convolutional networks on 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9621–9630. [Google Scholar]
Thomas, H.; Qi, C.R.; Deschaud, J.E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 6411–6420. [Google Scholar]
Phan, A.V.; Le Nguyen, M.; Nguyen, Y.L.H.; Bui, L.T. Dgcnn: A convolutional neural network over large-scale labeled graphs. Neural Netw. 2018, 108, 533–543. [Google Scholar] [CrossRef] [PubMed]
Jiang, X.; Ma, X. Dynamic graph CNN with attention module for 3D hand pose estimation. In Proceedings of the International Symposium on Neural Networks, Moscow, Russia, 10–12 July 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 87–96. [Google Scholar]
Pan, X.; Xia, Z.; Song, S.; Li, L.E.; Huang, G. 3d object detection with pointformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7463–7472. [Google Scholar]
Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10529–10538. [Google Scholar]
Ye, M.; Xu, S.; Cao, T. Hvnet: Hybrid voxel network for lidar based 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1631–1640. [Google Scholar]
Guo, M.H.; Cai, J.X.; Liu, Z.N.; Mu, T.J.; Martin, R.R.; Hu, S.M. PCT: Point cloud transformer. Comput. Vis. Media 2021, 7, 187–199. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. arXiv 2021, arXiv:2103.00112. [Google Scholar]
Feng, Y.; Zhang, Z.; Zhao, X.; Ji, R.; Gao, Y. GVCNN: Group-view convolutional neural networks for 3D shape recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 264–272. [Google Scholar]
Guo, H.; Wang, J.; Gao, Y.; Li, J.; Lu, H. Multi-view 3D object retrieval with deep embedding network. IEEE Trans. Image Process. 2016, 25, 5526–5537. [Google Scholar] [CrossRef] [PubMed]
Gadelha, M.; Wang, R.; Maji, S. Multiresolution tree networks for 3d point cloud processing. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 103–118. [Google Scholar]
Riegler, G.; Osman Ulusoy, A.; Geiger, A. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3577–3586. [Google Scholar]
Jiang, L.; Zhao, H.; Liu, S.; Shen, X.; Fu, C.W.; Jia, J. Hierarchical point-edge interaction network for point cloud semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 10433–10441. [Google Scholar]
Yan, X.; Zheng, C.; Li, Z.; Wang, S.; Cui, S. Pointasnl: Robust point clouds processing using nonlocal neural networks with adaptive sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5589–5598. [Google Scholar]
Liu, Y.; Fan, B.; Xiang, S.; Pan, C. Relation-shape convolutional neural network for point cloud analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8895–8904. [Google Scholar]
Atzmon, M.; Maron, H.; Lipman, Y. Point convolutional neural networks by extension operators. ACM Trans. Graph. 2018, 37, 1–12. [Google Scholar] [CrossRef] [Green Version]
Hua, B.S.; Tran, M.K.; Yeung, S.K. Pointwise convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 984–993. [Google Scholar]
Lin, Z.H.; Huang, S.Y.; Wang, Y.C.F. Convolution in the cloud: Learning deformable kernels in 3d graph convolution networks for point cloud analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1800–1809. [Google Scholar]
Lin, Y.; Yan, Z.; Huang, H.; Du, D.; Liu, L.; Cui, S.; Han, X. Fpconv: Learning local flattening for point convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4293–4302. [Google Scholar]
Verma, N.; Boyer, E.; Verbeek, J. Feastnet: Feature-steered graph convolutions for 3d shape analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2598–2606. [Google Scholar]
Wang, C.; Samari, B.; Siddiqi, K. Local spectral graph convolution for point set feature learning. In Proceedings of the European conference on computer vision, Munich, Germany, 8–14 September 2018; pp. 52–66. [Google Scholar]
Te, G.; Hu, W.; Zheng, A.; Guo, Z. Rgcnn: Regularized graph cnn for point cloud segmentation. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Korea, 22–26 October 2018; pp. 746–754. [Google Scholar]
Klokov, R.; Lempitsky, V. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 21–26 July 2017; pp. 863–872. [Google Scholar]
Li, J.; Chen, B.M.; Lee, G.H. So-net: Self-organizing network for point cloud analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 9397–9406. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Yi, L.; Kim, V.G.; Ceylan, D.; Shen, I.C.; Yan, M.; Su, H.; Lu, C.; Huang, Q.; Sheffer, A.; Guibas, L. A scalable active framework for region annotation in 3d shape collections. ACM Trans. Graph. 2016, 35, 1–12. [Google Scholar] [CrossRef]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Xu, Y.; Fan, T.; Xu, M.; Zeng, L.; Qiao, Y. Spidercnn: Deep learning on point sets with parameterized convolutional filters. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 87–102. [Google Scholar]

Figure 1. Group-in-group relation-based transformer network for point cloud classification.

Figure 2. Visualization of learned representations on the test set of ModelNet40 using t-SNE. Best viewed in color.

Figure 3. Visualization of part segmentation.

Table 2. Comparison of part segmentation results.

Method	mIoU	Plane	Bag	Cap	Car	Chair	Earph-	Guitar	Knife	Lamp	Laptop	Motor	Mug	Pistol	Rocket	Skate	Table
ShapeNet [45]	$81.4$	$81.0$	$78.4$	$77.7$	$75.7$	$87.6$	$61.9$	$92.0$	$85.4$	$82.5$	$95.7$	$70.6$	$91.9$	$85.9$	$53.1$	$69.8$	$75.3$
PointNet [11]	$83.7$	$83.4$	$78.7$	$82.5$	$74.9$	$89.6$	$73.0$	$91.5$	$85.9$	$80.8$	$95.3$	$65.2$	$93.0$	$81.2$	$57.9$	$72.8$	$80.6$
PointNet++ [12]	$85.1$	$82.4$	$79.0$	$87.7$	$77.3$	$90.8$	$71.8$	$91.0$	$85.9$	$83.7$	$95.3$	$71.6$	$94.1$	$81.3$	$58.7$	$76.4$	$82.6$
KD-Net [40]	$82.3$	$80.1$	$74.6$	$74.3$	$70.3$	$88.6$	$73.5$	$90.2$	$87.2$	$81.0$	$94.9$	$57.4$	$86.7$	$78.1$	$51.8$	$69.9$	$80.3$
SO-Net [41]	$84.9$	$82.8$	$77.8$	$88.0$	$77.3$	$90.6$	$73.5$	$90.7$	$83.9$	$82.8$	$94.8$	$69.1$	$94.2$	$80.9$	$53.1$	$72.9$	$83.0$
RGCNN [39]	$84.3$	$80.2$	$82.8$	$92.6$	$75.3$	$89.2$	$73.7$	$91.3$	$88.4$	$83.3$	$96.0$	$63.9$	$95.7$	$60.9$	$44.6$	$72.9$	$80.4$
PCNN [6]	$85.1$	$82.4$	$80.1$	$85.5$	$79.5$	$90.8$	$73.2$	$91.3$	$86.0$	$85.0$	$95.7$	$73.2$	$94.8$	$83.3$	$51.0$	$75.0$	$81.8$
SRN [8]	$85.3$	$82.4$	$79.8$	$88.1$	$77.9$	$90.7$	$69.6$	$90.9$	$86.3$	$84.0$	$95.4$	$72.2$	$94.9$	$81.3$	$62.1$	$75.9$	$83.2$
DGCNN [16]	$85.2$	$84.0$	$83.4$	$86.7$	$77.8$	$90.6$	$74.7$	$91.2$	$87.5$	$82.8$	$95.7$	$66.3$	$94.9$	$81.1$	$63.5$	$74.5$	$82.6$
P2Sequence [7]	$85.2$	$82.6$	$81.8$	$87.5$	$77.3$	$90.8$	$77.1$	$91.1$	$86.9$	$83.9$	$95.7$	$70.8$	$94.6$	$79.3$	$58.1$	$75.2$	$82.8$
PointConv [14]	$85.7$	−	−	−	−	−	−	−	−	−	−	−	−	−	−	−	−
PointCNN [13]	$86.1$	$84.1$	$86.5$	$86.0$	$80.8$	$90.6$	$79.7$	$92.3$	$88.4$	$85.3$	$96.1$	$77.2$	$95.2$	$84.2$	$64.2$	$80.0$	$83.0$
PointASNL [31]	$86.1$	$84.1$	$84.7$	$87.9$	$79.7$	$92.2$	$73.7$	$91.0$	$87.2$	$84.2$	$95.8$	$74.4$	$95.2$	$81.0$	$63.0$	$76.3$	$83.2$
PCT [21]	$86.4$	$85.0$	$82.4$	$89.0$	$81.2$	$91.9$	$71.5$	$91.3$	$88.1$	$86.3$	$95.8$	$64.6$	$95.8$	$83.6$	$62.2$	$77.6$	$83.7$
Ours	$86.6$	$85.6$	$84.6$	$88.0$	$79.3$	$91.8$	$79.7$	$92.5$	$86.5$	$83.6$	$95.3$	$77.9$	$94.9$	$84.9$	$65.6$	$77.4$	$83.8$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, S.; Fu, K.; Wang, M.; Song, Z. Group-in-Group Relation-Based Transformer for 3D Point Cloud Learning. Remote Sens. 2022, 14, 1563. https://doi.org/10.3390/rs14071563

AMA Style

Liu S, Fu K, Wang M, Song Z. Group-in-Group Relation-Based Transformer for 3D Point Cloud Learning. Remote Sensing. 2022; 14(7):1563. https://doi.org/10.3390/rs14071563

Chicago/Turabian Style

Liu, Shaolei, Kexue Fu, Manning Wang, and Zhijian Song. 2022. "Group-in-Group Relation-Based Transformer for 3D Point Cloud Learning" Remote Sensing 14, no. 7: 1563. https://doi.org/10.3390/rs14071563

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Group-in-Group Relation-Based Transformer for 3D Point Cloud Learning

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning on Point Clouds

2.1.1. Multi-View Based Methods

2.1.2. Voxel-Based Methods

2.1.3. Point-Based Methods

2.1.4. Hybrid-Data Based Methods

2.2. Transformer in Point Clouds

3. Methods

3.1. Preliminaries

3.2. Framework Overview

3.3. Group Abstraction (GA) and Radius-Based Feature Abstraction (RFA)

3.4. GiG (Group-in-Group Relation-Based Transformer)

4. Results

4.1. Classification on ModelNet40 Dataset

4.2. Part Segmentation on ShapeNet Dataset

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI