Point Cloud Segmentation Network Based on Attention Mechanism and Dual Graph Convolution

: To overcome the limitations of inadequate local feature representation and the underutilization of global information in dynamic graph convolutions, we propose a network that combines attention mechanisms with dual graph convolutions. Firstly, we construct a static graph based on the dynamic graph using the K-nearest neighbors algorithm and geometric distances of point clouds. This integration of dynamic and static graphs forms a dual graph structure, compensating for the underutilization of geometric positional relationships in the dynamic graph. Next, edge convolutions are applied to extract edge features from the dual graph structure. To further enhance the capturing ability of local features, we employ attention pooling, which combines max pooling and average pooling operations. Secondly, we introduce channel attention modules and spatial self-attention modules to improve the representation ability of global features and enhance semantic segmentation accuracy in our network. Experimental results on the S3DIS dataset demonstrate that compared to dynamic graph convolution alone, our proposed approach effectively utilizes both semantic and geometric relationships between point clouds using dual graph convolutions while addressing limitations related to insufﬁcient local feature extraction. The introduction of attention mechanisms helps mitigate underutilization issues with global information, resulting in signiﬁcant improvements in model performance.


Introduction
With the increasing maturity of laser scanning technology [1,2], the efficiency and quality of point cloud acquisition have significantly improved.Three-dimensional point cloud semantic segmentation technologies have crucial applications in fields such as autonomous driving, smart cities, and virtual reality [3], making it a prominent research area in computer vision [4].In comparison to two-dimensional images, 3D point clouds offer more comprehensive information, including scene depth and three-dimensional object shapes, enabling machines to better comprehend scenes and recognize the real world.However, 3D point clouds possess disadvantages such as non-structural characteristics, unordered data points, and sparsity.Traditional convolution cannot be directly applied to point cloud tasks; thus, automatically and efficiently utilizing point cloud semantic segmentation for understanding the objective world remains an immensely challenging task [5].
Traditional point cloud segmentation algorithms, such as edge-based, model-based, and region-based methods, heavily rely on manually extracted features.This dependence results in poor generalization and high computational complexity, leading to low efficiency when dealing with large-scale data [6].In contrast, deep learning-based algorithms offer efficient computations and strong generalization capabilities that enable them to handle large-scale point clouds effectively.Consequently, they have gradually become dominant in the field of point cloud semantic segmentation [7,8].In recent years, various deep learningbased segmentation networks have been proposed by researchers, which can be categorized into three types of methods: projection-based [9][10][11][12][13], voxelization-based [14][15][16][17][18], and pointbased [19][20][21][22][23].However, projection-based and voxelization-based methods may suffer from information loss due to the transformation of point clouds into other forms of data that could potentially impact segmentation accuracy.In contrast, point-based methods enable direct feature learning on point clouds without additional transformations.In 2017, Qi et al. [19] introduced the pioneering PointNet network, which utilized multilayer perceptron to extract per-point features and employed max pooling to aggregate global information.However, it neglected the learning of local features.To address this issue, Qi et al. [23] proposed PointNet++, which adopted a hierarchical approach by sampling points at each layer and using PointNet to learn features within the local neighborhood, thus enlarging the receptive field and aggregating local features.Despite its advancements, PointNet++ failed to consider inter-point relationships resulting in insufficient learning of local features.In 2019, Wang et al. [21] introduced a Dynamic Graph Convolutional Neural Network (DGCNN), which represented point clouds as graph structures and utilized edge convolutions to capture local relationships between center points and their neighbors, effectively extracting local features from point clouds.In 2020, Zhai et al. [24] proposed a multi-scale graph convolutional network architecture that used graph convolutional networks of different scales to extract and aggregate multi-scale features in groups, improving model robustness, while Zhang et al. [25], in 2021, connected the output of dynamic graphs for each layer effectively solving gradient disappearance problems; they fixed feature extractor after training then repeatedly trained classifier improving network performance.
However, current dynamic graph convolutional networks face two challenges: (1) the construction of the graph structure based on feature distances partially loses the geometric relationships between point clouds, leading to a loss of intrinsic connections between the data and insufficient extraction of local features and (2) the use of max pooling functions for obtaining global information results in excessive information loss and insufficient representation of global information.To address these issues and further explore the semantic and geometric relationships between point clouds, this paper proposes a Dual Graph Convolutional Neural Network (DualGraphCNN).(1) To address the issue of insufficient local feature extraction capability in dynamic graph convolutions, we introduce a dual graph convolution approach.While constructing the dynamic graph, a static graph independent of the network layers is constructed based on spatial geometric distances.The combination of the dynamic and static graphs forms a dual graph structure.A multi-layer perceptron is utilized to extract edge features on the dual graph structure and aggregate neighborhood features of point clouds.The dual graph convolution not only retains the characteristics of non-local diffusion of information in dynamic graph convolution but also considers the geometric structure of point clouds to a certain extent, which enhances the ability of the network to capture the internal relationship between point clouds.(2) In order to enhance the network's ability to extract global information, we propose a channel attention module and a spatial self-attention module.The channel attention module selectively enhances informative channels while suppressing irrelevant ones by learning how to utilize global information.Meanwhile, the spatial self-attention module models long-range dependency relationships between point clouds, thereby enhancing the representation of global information in point cloud features.
To summarize, our main contributions include the following.

•
We propose a dual-graph convolution module, which can make full use of the geometric and semantic information of point clouds, capture the internal relationship between point clouds, and realize the effective extraction of local features of point clouds.

•
We propose a channel attention module and a spatial self-attention module to enhance the network's ability to extract global information in both spatial and channel dimensions, thereby optimizing the final segmentation results.

•
We apply the proposed methodology to conduct quantitative experiments and various ablation studies on the challenging Stanford Large-Scale 3D Indoor Space (S3DIS) dataset.The experimental results verify the rationality and effectiveness of our method.

Network Structure Design
The overall structure of the DualGraphCNN network is illustrated in Figure 1.The input to this network consists of point clouds with a size of N × D, where n represents the number of input point clouds and D denotes the feature dimension of each point cloud.These input point clouds undergo a series of stacked three-layer dual graph convolution modules to extract local neighborhood information effectively.To emphasize important features at the channel level, suppress irrelevant features, and enhance the representation of point cloud features, a layer of channel attention module is connected after each dual graph convolution module.Subsequently, the extracted point cloud features from each layer are concatenated and further processed by a shared multi-layer perceptron for deep feature extraction.Global features are obtained via max pooling applied on these processed features.Furthermore, these global features are concatenated with the output features from the first three layers of the double graph convolution module again, followed by another round of feature learning using three shared multi-layer perceptron.Finally, N × D segmentation results for point clouds are generated as output.Additionally, a spatial self-attention module is employed between these three shared multi-layer perceptrons to model long-range dependencies among point clouds and enhance global information representation.
metric and semantic information of point clouds, capture the internal relationsh between point clouds, and realize the effective extraction of local features of po clouds.

•
We propose a channel attention module and a spatial self-attention module to e hance the network's ability to extract global information in both spatial and chann dimensions, thereby optimizing the final segmentation results.

•
We apply the proposed methodology to conduct quantitative experiments and va ous ablation studies on the challenging Stanford Large-Scale 3D Indoor Space (S3D dataset.The experimental results verify the rationality and effectiveness of o method.

Network Structure Design
The overall structure of the DualGraphCNN network is illustrated in Figure 1.T input to this network consists of point clouds with a size of N × D , where n represe the number of input point clouds and D denotes the feature dimension of each po cloud.These input point clouds undergo a series of stacked three-layer dual graph co volution modules to extract local neighborhood information effectively.To emphasize i portant features at the channel level, suppress irrelevant features, and enhance the rep sentation of point cloud features, a layer of channel attention module is connected af each dual graph convolution module.Subsequently, the extracted point cloud featu from each layer are concatenated and further processed by a shared multi-layer perce tron for deep feature extraction.Global features are obtained via max pooling applied these processed features.Furthermore, these global features are concatenated with output features from the first three layers of the double graph convolution module aga followed by another round of feature learning using three shared multi-layer perceptro Finally, N × D segmentation results for point clouds are generated as output.Ad tionally, a spatial self-attention module is employed between these three shared mu layer perceptrons to model long-range dependencies among point clouds and enhan global information representation.

Dual Graph Convolution Module
The data are organized into a graph structure consisting of vertices and edges, a the convolution operation performed on the graph data is referred to as graph convo tion.DGCNN employs dynamic graph convolution, which leverages the distances tween point cloud features to compute the K-nearest neighbors for each point.As po cloud features evolve during network extraction, the K-nearest neighbor points va across different network layers, resulting in dynamically updated graph structures each layer; hence, it is termed dynamic graph convolution.Generally, closer spatial d

Dual Graph Convolution Module
The data are organized into a graph structure consisting of vertices and edges, and the convolution operation performed on the graph data is referred to as graph convolution.DGCNN employs dynamic graph convolution, which leverages the distances between point cloud features to compute the K-nearest neighbors for each point.As point cloud features evolve during network extraction, the K-nearest neighbor points vary across different network layers, resulting in dynamically updated graph structures for each layer; hence, it is termed dynamic graph convolution.Generally, closer spatial distances indicate higher similarity in feature information and stronger internal relationships.However, in dynamic graph convolution, neighborhood point cloud sets are constructed solely based on the distance between point cloud features, leading to a partial loss of geometric positional relationships among point clouds and diminishing the network's capacity to capture finegrained local features.
To address the issue of insufficient utilization of geometric relationships in dynamic graph convolution, this paper proposes a dual-graph convolution method that leverages spatial geometric information to enhance local feature extraction from point clouds by constructing a double graph structure.As illustrated in Figure 2, the dual-graph convolution for feature extraction involves two steps: (1)  To address the issue of insufficient utilization of geometric relationships in dynam graph convolution, this paper proposes a dual-graph convolution method that levera spatial geometric information to enhance local feature extraction from point clouds constructing a double graph structure.As illustrated in Figure 2, the dual-graph convo tion for feature extraction involves two steps: (1) employing the K-nearest neighbor al rithm to construct the dual-graph structure and (2) utilizing MLP to extract edge featu in the graph structure, which are then aggregated by the feature aggregation module generate local features of the point cloud, C* represents a higher feature dimension.
(N,C) ( where ij f represents the feature vector of the neighboring point ij p and resents the feature distance of two points.The center point, neighbor point, and edge f ture constitute the dynamic graph structure.Since the point cloud features constan change with the extraction of the network, the dynamic graph constructed according the feature distance will be dynamically updated with different layers.Secondly, the g metric distance between the center point and the remaining points is calculated by the nearest neighbor algorithm, and the s K nearest points are selected as the nearest nei bor points: ij p′ ( j = 1,2… s K ), and the edge features between the nearest neighbor poi and the center point are calculated, as shown in Equation ( 2): where ij f ′ represents the feature vector of the nearest neighbor point ij p ′ , and ij x′ rep (

1) Construction of Dual Graph Structure
The input point clouds are defined as is the number of point clouds, p i represents the i-th point of the input, x i ∈ R N×3 represents the 3D coordinate vector of point p i , f i ∈ R N×C represents the feature vector of point p i , and C is the feature dimension.For an input point p i , the construction process of the dual graph structure is shown in Figure 3.The red circle indicates the center point, the blue circle indicates the nearest neighbor point searched by the feature distance, and the orange circle indicates the nearest neighbor point searched by the geometric distance.Firstly, p i is regarded as the center point, and the feature distance between the center point and the remaining points is calculated by the K-nearest neighbor algorithm.The K d nearest points are selected as the nearest neighbor points, p ij (j = 1,2. ..,K d ), and the edge features between the nearest neighbor points and the center point are calculated, as shown in Equation ( 1): where f ij represents the feature vector of the neighboring point p ij and f i − f ij represents the feature distance of two points.The center point, neighbor point, and edge feature constitute the dynamic graph structure.Since the point cloud features constantly change with the extraction of the network, the dynamic graph constructed according to the feature distance will be dynamically updated with different layers.Secondly, the geometric distance between the center point and the remaining points is calculated by the K-nearest neighbor algorithm, and the K s nearest points are selected as the nearest neighbor points: p ij (j = 1, 2. ..K s ), and the edge features between the nearest neighbor points and the center point are calculated, as shown in Equation ( 2): where f ij represents the feature vector of the nearest neighbor point p ij , and x ij represents the 3D coordinate vector of the nearest neighbor point p ij .The center point, the nearest neighbor point, and the edge feature constitute the static graph structure.The position of the point clouds in space is fixed, so the vertices of the static graph need to be calculated only once, and the edge features need to be updated each time, which does not cause too much additional performance loss and can compensate for the lack of geometric position relationship in the dynamic graph structure.Finally, the dynamic graph and the static graph are combined to obtain the double graph structure.The dual graph structure preserves both points with similar features and points with related geometric positions, which is helpful for the extraction of local fine-grained features at the center point.
Electronics 2023, 12, x FOR PEER REVIEW 5 of 15 nearest neighbor point, and the edge feature constitute the static graph structure.The position of the point clouds in space is fixed, so the vertices of the static graph need to be calculated only once, and the edge features need to be updated each time, which does not cause too much additional performance loss and can compensate for the lack of geometric position relationship in the dynamic graph structure.Finally, the dynamic graph and the static graph are combined to obtain the double graph structure.The dual graph structure preserves both points with similar features and points with related geometric positions, which is helpful for the extraction of local fine-grained features at the center point.(

2) Extraction of Local Features
In this paper, we employ a shared multi-layer perceptron to extract edge features in the dual graph structure.The encoding process of the edge features constructs feature relationships between the center point and its neighbors, while the multi-layer perceptron further extracts and abstracts high-level semantic features.Specifically, a two-layer MLP is utilized for extracting edge features in the dual-graph convolution module, as shown in Equation (3): where σ represents the activation function, bn represents the batch normalization, and ij e′′ represents the edge features of the dual graph structure.Subsequently, the local feature extraction of point clouds is realized via feature aggregation operation, as shown in Equation (4): where i E represents the set of edges of the dual graph structure, ij e  represents edge features of the dual graph structure after feature extraction, i f  represents the local feature of point i p , and  represents the feature aggregation operation.For feature aggregation, the common approach is to use max or average pooling to extract the maximum or average value of local neighborhood features as a representation.However, simple pooling operations can result in a significant loss of information from the neighborhood.To ensure that local neighborhood information is transmitted as much as possible, we propose a feature aggregation module that combines attention mechanisms with both aver- (

2) Extraction of Local Features
In this paper, we employ a shared multi-layer perceptron to extract edge features in the dual graph structure.The encoding process of the edge features constructs feature relationships between the center point and its neighbors, while the multi-layer perceptron further extracts and abstracts high-level semantic features.Specifically, a two-layer MLP is utilized for extracting edge features in the dual-graph convolution module, as shown in Equation ( 3): where σ represents the activation function, bn represents the batch normalization, and e ij represents the edge features of the dual graph structure.Subsequently, the local feature extraction of point clouds is realized via feature aggregation operation, as shown in Equation ( 4): where E i represents the set of edges of the dual graph structure, e ij represents edge features of the dual graph structure after feature extraction, f i represents the local feature of point p i , and represents the feature aggregation operation.For feature aggregation, the common approach is to use max or average pooling to extract the maximum or average value of local neighborhood features as a representation.However, simple pooling operations can result in a significant loss of information from the neighborhood.To ensure that local neighborhood information is transmitted as much as possible, we propose a feature aggregation module that combines attention mechanisms with both average and max pooling methods.This allows for automatic focus on important local neighborhood information while considering all features within the area.The feature aggregation module is shown in Figure 4, which corresponds to the feature aggregation process in Figure 2.
age and max pooling methods.This allows for automatic focus on important local neighborhood information while considering all features within the area.The feature aggregation module is shown in Figure 4, which corresponds to the feature aggregation process in Figure 2.For attention pooling, the input neighborhood features are first passed through a shared multi-layer perceptron to learn the potential activation levels of each feature, and then the softmax function is applied to calculate an attention score.This learned score acts as a soft mask that automatically selects important features.The attention score is elementwise multiplied by the input features and then summed to obtain an attention-pooled feature.Next, these focused features are concatenated with saliency features from max pooling output and global features from average pooling output.Information interaction and feature redistribution between these three types of pooled outputs are achieved via a multi-layer perceptron network.Finally, local details-rich features are produced.

Channel Attention Module
The channel attention mechanism is an effective deep learning technology, which can be used to improve the learning ability of the model for key information in the input features [26][27][28].It can adaptively adjust the weight of different channels, reduce the weight of unimportant channels, and increase the weight of useful information channels so that the model pays more attention to important features.To this end, we design a channel attention module to highlight important feature channels of point clouds, improve the representation ability of the network by modeling the interdependence between feature channels, and learn to use global information to selectively strengthen useful channels and suppress irrelevant channels.
The channel attention module is shown in Figure 5. Firstly, the spatial information of the input point clouds feature f is aggregated via the average pooling and max pooling operations to generate two different spatial context representative features avg f and max f , which represent the average level and maximum level of the input point clouds in the channel dimension, respectively.avg f and max f are input into a shared multi-layer perceptron to further capture the channel dependency and the two output feature vectors are summed by vector addition, and then the channel attention score cas is obtained by the sigmoid function, as shown in Equation ( 5): where σ denotes the sigmoid function, and the MLP is shared with respect to the pa- rameters of the inputs avg f and max f .Finally, the input feature f and cas are multi- plied element-by-element to obtain the enhanced feature after channel attention screening, and the idea of residual connection is used to add the input feature and the enhanced For attention pooling, the input neighborhood features are first passed through a shared multi-layer perceptron to learn the potential activation levels of each feature, and then the softmax function is applied to calculate an attention score.This learned score acts as a soft mask that automatically selects important features.The attention score is elementwise multiplied by the input features and then summed to obtain an attention-pooled feature.Next, these focused features are concatenated with saliency features from max pooling output and global features from average pooling output.Information interaction and feature redistribution between these three types of pooled outputs are achieved via a multi-layer perceptron network.Finally, local details-rich features are produced.

Channel Attention Module
The channel attention mechanism is an effective deep learning technology, which can be used to improve the learning ability of the model for key information in the input features [26][27][28].It can adaptively adjust the weight of different channels, reduce the weight of unimportant channels, and increase the weight of useful information channels so that the model pays more attention to important features.To this end, we design a channel attention module to highlight important feature channels of point clouds, improve the representation ability of the network by modeling the interdependence between feature channels, and learn to use global information to selectively strengthen useful channels and suppress irrelevant channels.
The channel attention module is shown in Figure 5. Firstly, the spatial information of the input point clouds feature f is aggregated via the average pooling and max pooling operations to generate two different spatial context representative features f avg and f max , which represent the average level and maximum level of the input point clouds in the channel dimension, respectively.f avg and f max are input into a shared multi-layer perceptron to further capture the channel dependency and the two output feature vectors are summed by vector addition, and then the channel attention score cas is obtained by the sigmoid function, as shown in Equation ( 5): where σ denotes the sigmoid function, and the MLP is shared with respect to the parameters of the inputs f avg and f max .Finally, the input feature f and cas are multiplied element-byelement to obtain the enhanced feature after channel attention screening, and the idea of residual connection is used to add the input feature and the enhanced feature to obtain the final output result f out .The specific calculation is shown in Equation ( 6): where ⊕ represents element-wise addition and ⊗ represents element-wise multiplication.

Channel Attention Module
(N,C) x +

Element-wise multiplication
Element-wise addition

Spatial Self-Attention Module
In semantic segmentation tasks, the global information of the point clouds is very important for the final semantic class prediction because two points with large spatial distance differences may belong to the same class, and their feature representations can be considered at the same time to enhance each other.In the field of 2D images, many works [29,30] adopt the self-attention mechanism to model the long-range dependencies between pixels and obtain global information between points.To effectively capture the global contextual information within each point cloud, we propose a spatial self-attention module that facilitates the modeling of interrelationships among point clouds.For any given point in space, we utilize a weighted sum approach to aggregate all point cloud features and update its feature representation based on similarity with other points.This enables mutual enhancement even when points are spatially distant but exhibit similar features.
The specific operation is shown in Figure 6; given the input point cloud feature vector , it is fed into two convolutional layers to generate two new feature vectors , respectively.Matrix multiplication operation is performed on the transpose of A F and B F , and softmax function is applied to calculate the spatial similarity matrix , and the calculation formula is shown in Equation ( 7): where ij s represents the influence of the j-th point on the i-th point, that is, the correlation between two points.At the same time, in F is fed into the convolution layer to generate a new feature vector , and then the similarity matrix S from Equation ( 7) is multiplied with C F , and the operation result is summed element-wise with in F to obtain the final output result out F .The calculation formula is shown in Equation ( 8).

( )
The output feature i out F of each point cloud is the weighted sum of the original feature and all point cloud features, which has global context information and selectively

Spatial Self-Attention Module
In semantic segmentation tasks, the global information of the point clouds is very important for the final semantic class prediction because two points with large spatial distance differences may belong to the same class, and their feature representations can be considered at the same time to enhance each other.In the field of 2D images, many works [29,30] adopt the self-attention mechanism to model the long-range dependencies between pixels and obtain global information between points.To effectively capture the global contextual information within each point cloud, we propose a spatial selfattention module that facilitates the modeling of interrelationships among point clouds.For any given point in space, we utilize a weighted sum approach to aggregate all point cloud features and update its feature representation based on similarity with other points.This enables mutual enhancement even when points are spatially distant but exhibit similar features.
The specific operation is shown in Figure 6; given the input point cloud feature vector F in ∈ R N×C , it is fed into two convolutional layers to generate two new feature vectors F A , F B ∈ R N×C , respectively.Matrix multiplication operation is performed on the transpose of F A and F B , and softmax function is applied to calculate the spatial similarity matrix S ∈ R N×N , and the calculation formula is shown in Equation ( 7): where s ij represents the influence of the j-th point on the i-th point, that is, the correlation between two points.At the same time, F in is fed into the convolution layer to generate a new feature vector F C ∈ R N×C , and then the similarity matrix S from Equation ( 7) is multiplied with F C , and the operation result is summed element-wise with F in to obtain the final output result F out .The calculation formula is shown in Equation (8).
The output feature F i out of each point cloud is the weighted sum of the original feature and all point cloud features, which has global context information and selectively aggregates the global information according to the similarity matrix.The semantic features of similar points achieve mutual gain, thereby enhancing intra-class aggregation and semantic consistency.

Experimental Dataset
In order to evaluate the segmentation performance of the proposed algorithm model, the public data set S3DIS [31] is selected for training and testing.The S3DIS dataset is a semantic dataset with pixel-level annotations developed by Stanford University.The dataset is a large indoor point cloud dataset containing six large indoor areas consisting of 271 rooms.The areas include office areas, educational and exhibition spaces, meeting rooms, personal offices, restrooms, open spaces, lobbies, staircases, and corridors.The dataset consists of about 273 million scanned points.Each point contains XYZ coordinates and RGB color information and has an explicit semantic label.The S3DIS dataset contains a total of 13 labeled classes, which are ceiling, floor, wall, beam, column, window, door, table, chair, sofa, bookcase, board, and clutter.
The performance evaluation metrics of semantic segmentation are overall accuracy (OA) and Mean Intersection over Union (mIoU), which can be defined as follows: In Equations ( 9) and ( 10), K is the number of categories, n is the true category, m is the predicted category, nn p is the number of correct category predictions, nm p repre- sents the number of false negative examples, and mn p represents the number of false positive examples.

Network Parameter Setting
The experimental setup for the algorithm in this paper is as follows: the hardware used was an RTX3090 24GB GPU with 128 GB of running memory, while the software employed included UBUNTU20.04operating system, CUDA11.1, and Python 3.7; Pytorch 1.8.2 served as the deep learning framework utilized in this study.Network training parameters were set to include SGD optimizer with momentum of 0.9, initial learning rate of 0.1, and cosine annealing algorithm to decay it to a value of 0.001.The batch size is set to 12, the number of point clouds fed into the network at each time is about 4096 points,

Experimental Dataset
In order to evaluate the segmentation performance of the proposed algorithm model, the public data set S3DIS [31] is selected for training and testing.The S3DIS dataset is a semantic dataset with pixel-level annotations developed by Stanford University.The dataset is a large indoor point cloud dataset containing six large indoor areas consisting of 271 rooms.The areas include office areas, educational and exhibition spaces, meeting rooms, personal offices, restrooms, open spaces, lobbies, staircases, and corridors.The dataset consists of about 273 million scanned points.Each point contains XYZ coordinates and RGB color information and has an explicit semantic label.The S3DIS dataset contains a total of 13 labeled classes, which are ceiling, floor, wall, beam, column, window, door, table, chair, sofa, bookcase, board, and clutter.
The performance evaluation metrics of semantic segmentation are overall accuracy (OA) and Mean Intersection over Union (mIoU), which can be defined as follows: In Equations ( 9) and ( 10), K is the number of categories, n is the true category, m is the predicted category, p nn is the number of correct category predictions, p nm represents the number of false negative examples, and p mn represents the number of false positive examples.

Network Parameter Setting
The experimental setup for the algorithm in this paper is as follows: the hardware used was an RTX3090 24GB GPU with 128 GB of running memory, while the software employed included UBUNTU20.04operating system, CUDA11.1, and Python 3.7; Pytorch 1.8.2 served as the deep learning framework utilized in this study.Network training parameters were set to include SGD optimizer with momentum of 0.9, initial learning rate of 0.1, and cosine annealing algorithm to decay it to a value of 0.001.The batch size is set to 12, the number of point clouds fed into the network at each time is about 4096 points, and the network is trained for 100 epochs.The nearest neighbor K d size is set to 20 for dynamic graphs, and the nearest neighbor K s size is set to 20 for static graphs.

Contrast Experiment
The proposed model is evaluated using the six-fold cross-validation method on the S3DIS dataset, where each time, five different regions are selected as the training set and one region is used as the test set.This process is repeated six times to cover the entire dataset.The input data for each room is divided into blocks with an area of 1 m × 1 m, and 4096 points are sampled from each block.All point cloud data are used for testing, and overall accuracy and mean intersection over union ratio serve as metrics.Table 1 presents a comparison of metrics for semantic segmentation in S3DIS data between the proposed method and other methods.The results demonstrate that DualGraphCNN, which incorporates dual graph convolution, outperforms the DGCNN network by 7.1% and 3.7% in terms of mIoU and OA, respectively, indicating its effectiveness in aggregating local features and improving segmentation accuracy.Moreover, compared with PointNet, PointNet++, Point-PlaneNet, KVGCN, and DBAN networks, our model achieves optimal mIoU and OA scores, demonstrating its superior performance in fine-grained segmentation of complex scenes.[23] 54.5 81.0 DGCNN [21] 56.1 84.1 Point-PlaneNet [32] 54.8 83.9 KVGCN [33] 60.9 87.4 DBAN [34] 60.9 86.1 DualGraphCNN 63.2 87.8 Table 2 presents a comparison of segmentation results for various object types between the proposed network and other models under Region 5. Compared to the DGCNN network, except for beam and window, there are varying degrees of improvement in IoU for other categories, particularly sofa, bookshelf, and wood board, which increased by 15.9%, 12.4%, and 17.3%, respectively.The experimental findings demonstrate that incorporating dual graph convolution and attention mechanisms enhances the model's ability to capture local and global information from point clouds while improving segmentation accuracy across different object types.Figure 7 shows the visual segmentation result images of the proposed model and DGCNN network.Here, Figure 7a are the input point clouds, Figure 7b are the segmentation results of the DGCNN network, Figure 7c are the segmentation results of the proposed method, and Figure 7d are the reference standard.From top to bottom, they are room 1, room 2, and room 3.The red dashed line represents the region where the results of our method are compared with those of the dynamic graph convolutional network.It can be seen that compared with DGCNN, the proposed method achieves better segmentation results in the details of various objects.In room 1, for the junction of windows, walls, and columns, DGCNN confused their edges together, and the segmentation was not smooth enough.In room 2, DGCNN mistakenly divided a part of the blackboard into a beam with a similar shape, and the boundary between the door frame and the wall was not clearly divided, resulting in the loss of part of the door frame.In room 3, DGCNN failed to correctly distinguish walls with similar colors from white blackboards, and completely divided stacks of bookshelves and sundry objects into bookshelves.The method in this paper achieves relatively good segmentation results at the above positions, indicating that the introduction of dual graph convolution and attention mechanism strengthens the local information mining ability of the network, improves the contour segmentation ability of various objects, and the segmentation of object connections is smoother.
tation results in the details of various objects.In room 1, for the junction of windows, walls, and columns, DGCNN confused their edges together, and the segmentation was not smooth enough.In room 2, DGCNN mistakenly divided a part of the blackboard into a beam with a similar shape, and the boundary between the door frame and the wall was not clearly divided, resulting in the loss of part of the door frame.In room 3, DGCNN failed to correctly distinguish walls with similar colors from white blackboards, and completely divided stacks of bookshelves and sundry objects into bookshelves.The method in this paper achieves relatively good segmentation results at the above positions, indicating that the introduction of dual graph convolution and attention mechanism strengthens the local information mining ability of the network, improves the contour segmentation ability of various objects, and the segmentation of object connections is smoother.

Ablation Study
To validate the functionality and effectiveness of different modules in the network, we conducted an ablation study on various combinations of DualGraphCNN network modules.For this experiment, regions 1-5 from the S3DIS dataset were selected as the training set, while region 6 served as the test set.The results are presented in Table 3, √ means to use the corresponding module, × means that the corresponding module is not used.Method 1 represents the segmentation outcome of DGCNN.In method 2, we replaced the graph convolution module of DGCNN with a dual graph convolution module, resulting in a 3.8% increase in mIoU for the network.This improvement can be attributed to how the dual graph convolution module compensates for geometric structure limitations present in dynamic graph convolutions by considering both geometric and feature connections between point clouds simultaneously.Additionally, when aggregating point cloud features, this module combines three pooling methods to further enhance local feature representation ability.Method 3 incorporates a spatial self-attention module (SSA) on top of method 2 and demonstrates a mIoU improvement of 1.3% compared to method 2. This enhancement is due to SSA's capability to model long-range dependencies among point clouds and improve global information representation within point cloud features.Method 4 introduces a channel attention module (CA) based on method 2, which selectively strengthens useful channels using global information while suppressing irrelevant channels, thereby enhancing point cloud feature representation, consequently achieving a mIoU that is higher by 0.6% than that achieved by method 2. In method 5, the simulta-

Ablation Study
To validate the functionality and effectiveness of different modules in the network, we conducted an ablation study on various combinations of DualGraphCNN network modules.For this experiment, regions 1-5 from the S3DIS dataset were selected as the training set, while region 6 served as the test set.The results are presented in Table 3, √ means to use the corresponding module, × means that the corresponding module is not used.Method 1 represents the segmentation outcome of DGCNN.In method 2, we replaced the graph convolution module of DGCNN with a dual graph convolution module, resulting in a 3.8% increase in mIoU for the network.This improvement can be attributed to how the dual graph convolution module compensates for geometric structure limitations present in dynamic graph convolutions by considering both geometric and feature connections between point clouds simultaneously.Additionally, when aggregating point cloud features, this module combines three pooling methods to further enhance local feature representation ability.Method 3 incorporates a spatial self-attention module (SSA) on top of method 2 and demonstrates a mIoU improvement of 1.3% compared to method 2. This enhancement is due to SSA's capability to model long-range dependencies among point clouds and improve global information representation within point cloud features.Method 4 introduces a channel attention module (CA) based on method 2, which selectively strengthens useful channels using global information while suppressing irrelevant channels, thereby enhancing point cloud feature representation, consequently achieving a mIoU that is higher by 0.6% than that achieved by method 2. In method 5, the simultaneous utilization of all modules further improves segmentation performance with an increased mIoU of 5.8% compared to method 1, resulting in varying degrees of improvement compared to other methods.

Selection of Pooling Methods
During the aggregation process of local features, different pooling functions yield local features with distinct characteristics, thereby influencing the segmentation accuracy.To investigate the impact of pooling methods on network performance, combination tests were conducted using various pooling techniques.Region 6 was utilized as the test set, and the results are presented in Table 4, √ means to use the corresponding pooling method, × means that the corresponding pooling method is not used.Method A employed max pooling, B used average pooling, and C applied attention pooling, resulting in mIoU values of 75.1%, 74.7%, and 75.3%, respectively.It can be inferred that attention pooling produces focused features with superior effectiveness, while max pooling generates saliency-based features with a secondary effect; conversely, average pooling yields overall features with a comparatively lower efficacy.Method D combines max and average pooling to achieve better results than A and B alone, indicating complementarity between salient and overall features which enhances information transmission efficiency when combined together.Building upon method D's foundation, method E incorporates attention pooling to attain optimal mIoU scores by enriching information representation via the inclusion of important neighborhood-specific attributes within focused features.The K value of the nearest neighbor plays a crucial role in capturing local information during the construction of dynamic and static graphs.To investigate its impact on network performance, various numerical values were employed for combinatorial testing, with region 6 serving as the test set.The results are presented in Table 5, where K d represents the size of the K value for the dynamic graph, and K s denotes the size of the K value for the static graph.Combination 1 corresponds to using the dynamic graph, similar to DGCNN solely, while combination 2 represents the complete utilization of the static graph.Combinations 3-9 involve employing dual graph structures.Comparing combinations 1 and 2 reveals that constructing a graph structure based on feature distance outperforms geometric distance-based construction.Furthermore, comparing combinations 1, 2, and 6 demonstrates that utilizing a dual graph structure yields superior results compared to using either dynamic or static graphs alone since it retains both points with similar features and those with related geometric positions; this facilitates extraction of fine-grained local features at central points.Additionally, by comparing combinations 3, 4, 5, and 6, it is observed that network performance consistently improves as both K d and K s increase due to insufficient acquisition of local information when using small K values.Conversely, comparing combinations 6, 7, 8, and 9 indicates that further increasing both K d and K s leads to decreased network performance owing to the introduction of extra noise and redundant information associated with high values of K. Therefore, K d = 20 and K s = 20 are chosen as appropriate sizes for facilitating the extraction of local features in the proposed network.

Dual Graph Convolutional Module Layer Number Test
To further validate the impact of the number of layers in the dual graph convolution module on segmentation accuracy, we conducted experimental testing by stacking different numbers of layers in the dual graph convolution modules using region 6 as the test set.The results are presented in Table 6.It is evident that an increase in the number of layers in the dual graph convolution module leads to gradual improvements in mIoU and OA.This improvement can be attributed to insufficient aggregation of local features when there are too few layers.However, when there are three layers, mIoU and OA reach their optimum values.Further increasing the number of layers causes a decline in mIoU and OA due to increased network complexity, higher risk of overfitting, training difficulties, and hindered extraction of high-dimensional features.

Robustness Experiments
To evaluate the model's robustness to sparse point clouds, we trained it on regions 1-5 of the S3DIS dataset using randomly subsampled point clouds with 2048, 1024, 512, 256, and 128 points in addition to the original sampling of 4096 points.We then tested the Mean Intersection over Union (mIoU) of our model on region 6 and compared it with that of DGCNN.As shown in Figure 8, our proposed model achieved higher mIoU at different sampling points than DGCNN did; this gap widened as the number of sampling points decreased.When only using a sample size of 128 points, our model outperformed DGCNN by a margin of up to 10.1%, indicating its superior robustness to sparse point clouds.
decreased.When only using a sample size of 128 points, our model outperformed DGCNN by a margin of up to 10.1%, indicating its superior robustness to sparse point clouds.

Conclusions
The paper proposes a network model that integrates a Dual Graph Convolutional module, channel attention module, and spatial self-attention module to enhance the extraction of both local and global features.Firstly, a Dual Graph Convolutional module is constructed to extract fine-grained local features by effectively utilizing the geometric positional relationships among point clouds compared to dynamic graph convolutions.Secondly, the channel attention mechanism selectively enhances useful channels and suppresses irrelevant channels by learning from global information.This mechanism improves feature focus on relevant aspects and enhances the model's capability to extract discriminative information.Additionally, a spatial self-attention module is incorporated to capture long-range dependency relationships between point clouds, enhancing the representation of global information in point cloud features.This allows for capturing contextual information and considering interactions between points.Experimental results demonstrate that the proposed model enhances point interaction and better extracts local and global features of point clouds while achieving good performance.
Of course, the network model proposed in this paper still needs to be improved in many areas, and how to further simplify the model and reduce the complexity of the model is the focus of the next step.In addition, we will further extend our method to outdoor large-scale scenes for application in the field of autonomous driving, provide high-level semantic information for vehicles to understand the surrounding environment, and improve the environment perception and scene understanding ability of autonomous navigation.

Conclusions
The paper proposes a network model that integrates a Dual Graph Convolutional module, channel attention module, and spatial self-attention module to enhance the extraction of both local and global features.Firstly, a Dual Graph Convolutional module is constructed to extract fine-grained local features by effectively utilizing the geometric positional relationships among point clouds compared to dynamic graph convolutions.Secondly, the channel attention mechanism selectively enhances useful channels and suppresses irrelevant channels by learning from global information.This mechanism improves feature focus on relevant aspects and enhances the model's capability to extract discriminative information.Additionally, a spatial self-attention module is incorporated to capture long-range dependency relationships between point clouds, enhancing the representation of global information in point cloud features.This allows for capturing contextual information and considering interactions between points.Experimental results demonstrate that the proposed model enhances point interaction and better extracts local and global features of point clouds while achieving good performance.
Of course, the network model proposed in this paper still needs to be improved in many areas, and how to further simplify the model and reduce the complexity of the model is the focus of the next step.In addition, we will further extend our method to outdoor large-scale scenes for application in the field of autonomous driving, provide high-level semantic information for vehicles to understand the surrounding environment, and improve the environment perception and scene understanding ability of autonomous navigation.

Figure 3 .
Figure 3. Construction process of dual graph structure.

Figure 3 .
Figure 3. Construction process of dual graph structure.

Figure 7 .
Figure 7. Visual comparison of segmentation results of S3DIS dataset.

Figure 7 .
Figure 7. Visual comparison of segmentation results of S3DIS dataset.
employing the K-nearest neighbor algorithm to construct the dual-graph structure and (2) utilizing MLP to extract edge features in the graph structure, which are then aggregated by the feature aggregation module to generate local features of the point cloud, C* represents a higher feature dimension.

Table 1 .
Comparison of segmentation accuracy of different methods on S3DIS dataset.

Table 3 .
Ablation studies of different modules.

Table 4 .
Combinatorial testing of different pooling methods.

Table 5 .
The K nearest neighbor size test.

Table 6 .
Dual graph convolutional module layer number test.

:
National Natural Science Foundation of China, Grant/Award Number: 62272426; Shanxi Province Science and Technology Major Special Project, Grant/Award Number: 202201150401021; Figure 8. Network robustness testing.