KVGCN: A KNN Searching and VLAD Combined Graph Convolutional Network for Point Cloud Segmentation

: Semantic segmentation of the sensed point cloud data plays a signiﬁcant role in scene understanding and reconstruction, robot navigation, etc. This work presents a Graph Convolutional Network integrating K-Nearest Neighbor searching (KNN) and Vector of Locally Aggregated Descriptors (VLAD). KNN searching is utilized to construct the topological graph of each point and its neighbors. Then, we perform convolution on the edges of constructed graph to extract representative local features by multiple Multilayer Perceptions (MLPs). Afterwards, a trainable VLAD layer, NetVLAD, is embedded in the feature encoder to aggregate the local and global contextual features. The designed feature encoder is repeated for multiple times, and the extracted features are concatenated in a jump-connection style to strengthen the distinctiveness of features and thereby improve the segmentation. Experimental results on two datasets show that the proposed work settles the shortcoming of insufﬁcient local feature extraction and promotes the accuracy (mIoU 60.9% and oAcc 87.4% for S3DIS) of semantic segmentation comparing to existing models.


Introduction
As one of the key technologies for scene understanding, semantic segmentation [1][2][3] of 3D point clouds plays fundamental roles in the fields of 3D reconstruction, autonomous driving, and robotics. Vehicles in autonomous driving applications need to interpret the objects (e.g., pedestrians and cars) and their kinestates in the outdoor scenes before making reliable decisions. For a robot, reconstructing and parsing the models of surrounding environments is the premise of navigation and object manipulation. Unlike 2D images, point clouds are unstructured, unevenly distributed, and large in data volume, making them difficult to process analyze by the conventional methods. Great attention has been paid to achieving reliable semantic segmentation of point clouds in deep learning style. However, how to effectively learn presentative features from unorganized point clouds is still a challenging problem.
In order to solve the difficulty of feature learning caused by disordered point clouds, a number of works [4][5][6][7][8][9][10] preprocess the 3D point cloud data to a normalized 2D or 3D representation, such as voxelization and multi-view projection. However, projection causes information loss, and the 3D convolution after voxelization brings greater computational cost. Another type of work, on the other hand, uses the original point clouds as input and conducts end-to-end learning, such as the pioneering work PointNet [11], the upgraded PointNet++ [12], and other new deep networks [13][14][15][16][17][18][19][20]. They all directly deal with the unordered point clouds, and learn global and local features to realize the classification and segmentation. These works have achieved excellent results, yet difficulties still exist in the processing of point clouds [21]. The effective local geometric feature correspondence cannot be established, and the information between the points cannot be well utilized, which results in low accuracy of semantic segmentation.
To solve the above issues, this work mainly considers two aspects: how to learn features from the local structures and how to aggregate the local and global information. In this paper, we propose the K-nearest neighbor searching (KNN) and Vector of Locally Aggregated Descriptors (VLAD) [22] combined Graph Convolutional Network (KVGCN) to learn point cloud features from the topological graph constructed by each point and its K-nearest neighbors. Specifically, it performs convolution on the edges of constructed graph to extract representative local features by multiple MLPs, followed by two different pooling operations to compensate information loss. Then, the VLAD layer is embedded in the feature encoding block to aggregate local and global contextual features. Furthermore, the designed feature encoding block is repeated multiple times and the extracted features are concatenated in a jump connection style to strengthen the ability of network and thereby improve the accuracy of semantic segmentation. Our key contributions are as follows. • We propose an edge convolution on the constructed KNN graph to encode the local features. The query point denotes the global position and the edges, namely, the relative positions of the query point and its neighbors, represent the local geometry. By concatenating this information, representative features can be achieved via edge convolution. • We employ max pooling and average pooling as the symmetric functions, and because VLAD layer is symmetric, the proposed network guarantees invariance to the point order. • By embedding the VLAD layer in the feature encoding block, we construct an enhanced feature merging encoder which effectively aggregate the local features and the global contextual information. The segmentation in dense region or confusable region improves. • Experimental results on two datasets show that KVGCN achieves comparable or superior performance compared with state-of-the-art methods.
The rest of this work is organized as follows. Section 2 classifies and discusses the related work. Section 3 describes details of the proposed KGCN and KVGCN segmentation networks. Experimental evaluations and discussions are performed in Sections 4 and 5. Then, Section 6 concludes this work.

Related Work
Before the deep learning era, most researchers followed the traditional routine of "segmenting then labeling" in point cloud segmentation. However, with the growing scale of 3D data, the workload of traditional methods increases rapidly, resulting in low efficiency of segmenting [23][24][25][26]. On the other hand, the deep learning-based methods can improve the accuracy and efficiency of semantic segmentation [27]. At present, the semantic segmentation models based on deep learning can be divided into two main categories.

Networks with Preprocessings
This type of model presents the 3D data in multi-view renderings or voxels. The multi-view convolutional neural network (MV-CNN) was first applied to 3D shape recognition [4], rendering 3D data from multiple visions into 2D images as input to 2D CNNs. Kalogerakis et al. [6] propose a more complex multi-view rendering framework based on the multi-view CNNs. The MV-RNN [7] model successfully applies a Recurrent Neural Network (RNN) to the 3D image field, which treats multi-view image sequences as time series and uses RNN to learn features from images.
Point cloud voxelization converts the irregular point cloud into the regular 3D voxelized representation. Qi et al. [9] propose a volumetric convolution neural network model (volumetric CNN), which was developed based on the 2D CNNs. Maturana et al. [5] develop the VoxNet model, which divides the 3D space into regular 3D voxel grids and applies it to the revised 3D CNN model. However, the biggest disadvantage of this method is that the higher resolution of the voxelization, the faster the memory and computing consumption increase. What is worse, the increase in the resolution of the voxel causes many empty voxel grids, so the calculations consumed in this part are wasteful. In order to reduce this waste, 3D CNNs based on rough resolution [8] seem to be a solution, but in turn lead to greater quantization errors.
Some researchers also propose other works to reduce the huge consumption of memory and computation of voxelized CNNs, for instance, the Kd-Net [28] and OctNet [29]. These algorithms are all tree-based models that can perform high-resolution voxel learning on point clouds. These models try to ignore empty voxel grids and focus on the informationrich grids, but are difficult to implement and expressiveness limited. Qi et al. [9] compare the recognition performance of the volumetric methods and multi-view presentation. Compared with voxel representation, multi-view rendering lowers the calculation cost, but causes information loss.

End-to-End Learning Networks
In order to make full use of the information of point clouds, many researchers abandoned the preprocessing and tried to learn the point clouds end-to-end. A groundbreaking work, PointNet [11], is completed for unordered point clouds. It directly processes point clouds and uses the pooling layer to extract the global point features. On several different datasets, PointNet achieves advanced performance in 3D classification and semantic segmentation. However, there is a drawback that the PointNet lacks local feature representation. Therefore, the authors propose an improved model, PointNet++ [12], which applies a fast point sampling algorithm and a ball query algorithm to cluster the input point clouds, combining local dependence and multi-level features to achieve better performance. Later, an RNN-based deep learning model RSNet [13] was proposed. This model projects the features of disordered points onto the ordered sequence of feature vectors, which are then fed into the recurrent neural network for learning. Besides, Graph Convolutional Neural Networks [30][31][32] and Multi-scale Graph Convolutional Neural Networks [15,33] also are used to deal with irregular data structure. Mao et al. [16] and Thomas et al. [17] present new convolution methods to extract local features. In 2020, more powerful architectures are presented for 3D point cloud processing, e.g., JSENet [34], CT2 [35], and Point Transformer [36], pushing point cloud semantic segmentation to further level. Among these networks, Point Transformer designs self-attention layers for point clouds and uses them to construct self-attention networks for tasks such as semantic scene segmentation and object part segmentation, achieving the best performance on S3DIS dataset. By dropping the preprocessing, the end-to-end learning methods can learn the intrinsic information of the point clouds and reduce the costs of storage and calculation.
In summary, the point clouds retain most of the sensor information, and the difficulty of directly learning from point clouds have been solved. However, how to sample from point cloud data and how to effectively extract the local features remain to be studied, so we make further exploration around this issue. This paper proposes a convolution network based on a KNN topological graph to encode the local features. By embedding the VLAD layer in the feature encoding block, we construct an enhanced feature merging encoder which effectively aggregates the local features and the global contextual information. The proposed network is introduced in detail in Section 3.

Method
Inspired by PointNet [11] and convolutions, this work proposes a different segmentation network based on the idea of graph convolution to resolve the segmentation defects in the premise of permutation invariance of point cloud data. Unlike PointNet extracting features from isolated points, this network constructs the neighboring geometric graphs and then performs convolution on the edges of the local graphs to gather powerful features for semantic segmentation.

Knn Graph Convolution
For a point cloud with N points X = {x 1 , x 2 , ..., x N } ⊆ R F , establish the KNN topological graph. As Figure 1 illustrates, the directional graph G = {V, E} shows the local structure of a point x i and its K nearest neighbors {x j 1 , x j 2 , ..., is a nonlinear function with learning parameter θ, and F denotes the dimension of the edge features. Then, an aggregating operation C is applied on the K edge features to get the output y i corresponding to x i , as defined in Equation (1). The choice of edge function and aggregation method is crucial to the proposed KNN graph convolution network. PointNet only encodes the global coordinates without considering the local structure, leading to unideal segmentation in local regions. To simultaneously take into account the global and local information, here we utilize an asymmetric operation as edge function: The function obtains the global information through the query point x i and the local information through (x i − x j ). In this way, both the global and local features are counted to get stronger feature representation. For feature aggregation C, we use the two-channel pooling (max and average pooling) techniques to compensate the information loss, retaining strong response and fine-grained local feature. To be specific, we define the following operator in Equation (3) to train edge feature, which could be considered as a shared MLP with learning parameters θ = {λ, ϕ} and activation function ReLU. Then, the aggregation can be realized on the edge features by Equation (4).

Kgcn
To realize the above idea, we propose a KNN Graph Convolution Network (KGCN) consisting of three main parts, as shown in Figure 2.
-First, it searches the K nearest neighbors of each point and constructs the KNN graph (N × K tensor, N is the number of points) for the input point cloud (N × F tensor, here F represents the data dimension, usually including coordinates, colors, normals, etc.). -Afterward, the point cloud and its KNN graph are sent into the local feature encoder which transforms the inputs to a N × 256 tensor. The local feature encoder block is recursively assembled for four times to expand the perception field, and each block could output different level of feature abstraction. -Finally, the third-level features and the final features are concatenated in the jumpconnection style, followed by two sets of shared MLPs (MLPs{512, 256, 128} and MLPs{128, 64, Q}) for dimensionality reduction to get the semantic labels. The final output of the network is a N × P matrix, namely, the possibilities of each point belonging to P different categories, which can be further assigned by a softmax function.   As is implied in the network, the design of the local feature encoder plays a decisive role. As shown in Figure 2, the structure of the encoder basically follows the idea of KNN graph convolution introduced. First, the two input tensors-the point features and the K neighboring relations in R F space-are merged and transformed to a N × K × 2F tensor by a specific operation L. Assume that k i stores the indices of the K nearest neighbors of point x i and l i signifies the transformed vector related to x i , then the operation L is defined as

Input cloud
Here, l i (j) corresponds to the neighbor x j , and [x i , x i − x j ] represents the concat of vector x i and (x i − x j ). The tensor of edge features is repressed as E = {e 1 , ..., e N | e i ∈ R K×128 } which is computed by Equation (8): In this equation, M denotes a shared MLP and σ means the activation function ReLU. θ 1 and θ 2 are, respectively, the learning parameter of the two MLPs. Afterward, the maxpooling and average-pooling are used in K channels to simultaneously get the strong and fine-grained responses. Then, the output features Y={y 1 , ..., y N | y i ∈ R 256 } of the local feature encoder are determined by Equation (9), in which θ 3 is the parameter of the last MLP. Cross-entropy function is employed to weigh the loss in the network training.

Kvgcn
The KGCN extracts the global features via assembling local feature encoders repeatedly. This way it loses certain semantic contexts of scenes, causing confusions in dense or regions with similar appearance. To abstract and exploit the global contextual information effectively, we try to embed a Vector of Locally Aggregated Descriptors (VLAD) [22] layer into the proposed network and name it the KNN and VLAD combined Graph Convolution Network (KVGCN).

Overall Structure
VLAD is a traditional feature encoding algorithm based on feature descriptors, which is usually applied in image classification and instance retrieving. It is a technique that aggregates the local feature descriptors to global vectors with elaborate representation of local details but no data loss. Arandjelović et al. [37] simulated VLAD and designed a trainable generalized VLAD layer in the CNN framework, known as NetVLAD, to aggregate the local features. In this work, we embed the NetVLAD layer into our network and merge the local and global features for point cloud semantic segmentation (as shown in Figure 3).  The proposed KVGCN follows the similar encoder-decoder pipeline, whereas a new feature merging encoder is designed. The new feature encoder merges the N × 256 local features and the global features aggregated by NetVLAD layer to compose the N × 512 feature representation for the following decoding stage. The NetVLAD layer uses a convolution and a softmax function to compute parameters of the VLAD core which aggregates the input local features. Afterward, the L2-normalization and feature concatenation are introduced to generate the merged features. Similar to KGCN, two feature vectors from different stages are then jump-connected for a decoding process to label the points. For loss function, we follow the setting in KGCN and use the cross-entropy function to weigh the segmentation loss in network training.

Netvlad Layer
VLAD can be considered as a feature pooling method that stores the residuals of the feature vector and its cluster center. Given N F-dimensional local features {x i } as input, and P cluster centers {c p } as parameters, the output V of VLAD is a P × F representation matrix (Equation (10)) which is further normalized to a vector as the global feature. The (j, p) element of V is computed as where x i (j) and c p (j) are the jth element of the ith local feature and the pth cluster center. a p (x i ) defines the contribution of feature x i to cluster c p ; it is 1 or 0, meaning that c p is the closest cluster to x i or otherwise. Intuitively, the elements of matrix V records the residuals (x i − c p ) of each feature to each cluster. Apparently, VLAD is untrainable due to the discontinuity of a p (x i ).
In order to construct a trainable VLAD layer and embed it into a back-propagation network, the differentiability of the layer operation to its all parameters is required. To make the operation differentiable, the hard assignment of a p (x i ) is replaced with a soft assignment in Equation (11) andā p (x i ) ranges between 0 and 1 according to the proximity of feature x i to the cluster center c p . Note that α is a positive parameter which controls the decay of the response and α → +∞ corresponds to the original VLAD. This equation can be expanded and simplified by canceling e −α x i 2 term, then we get a softmax function in Equation (12): where w T p = 2αc p and b = −α c p 2 . Then, the soft-assignment of V is determined by Equation (13): (13) {w p },{b p } and {c p } are the trainable parameter sets of NetVLAD. Comparing to the original VLAD just having parameter set {c p }, NetVLAD has three sets of parameters, enabling greater flexibility. As shown in Figure 3, the NetVLAD layer includes three steps before the normalizations: (1) a convolution "conv(w, b)" with P filters {w p } of size 1 × 1 and biases {b p }, (2) the following "softmax" function which computes weightā p (x i ) by Equation (12), and (3) the "VLAD core" takingā p (x i ) as parameter to calculate the output matrix V by Equation (13). Afterward, the intra-normalization and L2-normalization are used to normalize the aggregated P × F features. A fully connected layer and L2-normalization are added to obtain the concise 1 × 256 global descriptor, followed by a vector expansion operation to get the N × 256 global features, which are then concatenated with local features for further processing.

Permutation Invariance
For unordered point cloud data, the premise of a model being invariant to input permutation is required. The local feature encoder uses two symmetric functions-max pooling and average pooling-to aggregate the information from each point, thus is invariant to input orders. Therefore, the NetVLAD layer also meets this requirement, meaning that it can be embedded to the feature merging encoder for point cloud feature learning.
The 1 × 1 kernel only upsamples or downsamples the input feature without considering the spatial relations with others, so it will not be impacted by the input permutation, then we can get stable parameters from "conv(w, b)" and "softmax" operations. Therefore, the invariance of NetVLAD to input permutation lies to that of the "VLAD core".
Given N input feature descriptors X = {x 1 , x 2 , ..., x N } and the trained parameters of P clusters {w p },{b p } and {c p }, then the output of "VLAD core" is V = [V(j, p)], V(j, p) aggregates the residuals in the jth part of each descriptor to the pth cluster center. It is the sum of a function of descriptor x i . For another input,X = {x i , x 1 , ..., x t , ..., x N , x i+1 } with the same descriptors as X but different input order, the corresponding "VLAD core" outputṼ = [Ṽ(j, p)] is calculated as (15) As the parameters are stable to input orders, the soft-assignmentā p (x i ) and the residual of the same descriptor remain stable, which means the Equations (14) and (15) represent the same output. Therefore, the "VLAD core" is invariance to input permutation, and so is the NetVLAD layer.

Datasets
To evaluate the performance of the proposed network, we test it on two different datasets: the high-density indoor datasets S3DIS [1] and ScanNet [38], as demonstrated in Figure 4.
S3DIS Dataset contains six different large-scale indoor scenes in three buildings, involving 11 room types, which are divided into 13 semantic categories. It covers an area of more than 6000 square meters, exceeding 200 million points that are composed of 3D coordinates, colors, and point labels. ScanNet is a dataset of richly annotated RGB-D scans of real-world environments containing 2.5 M RGB-D images in 1513 scans acquired in 707 distinct spaces, with estimated calibration parameters, camera poses, 3D surface reconstructions, textured meshes, and dense object-level semantic segmentations.

Experimental Details
As the S3DIS dataset is divided into six different areas, we follow the same evaluation protocol used in PointNet [11], which is a 6-fold cross-validation over all the areas, namely, choosing five areas as training set and one as testing set in each training and using crossvalidation to establish six models to cover the whole dataset. Three widely used metricsoverall accuracy (oAcc), mean accuracy (mAcc), and mean intersection over union (mIoU)are employed to measure the segmentation performances.

Parameter Evaluations
Before conducting comparisons with state-of-the-art methods, we first evaluate the influence of two critical parameters on semantic segmentation to choose the best settings. The two parameters are the number of neighboring points K in KNN algorithm and the number of clusters D in NetVLAD layer. Our evaluations are carried out on S3DIS dataset, with "Area_5" as testing set and the other five areas as training set.

Parameter K
The parameter K controls the scope of neighbor searching and the volume of local information for training. Too small a value of K leads to inadequate local contextual information, and too large a value may include irrelevant noises and is computational expensive. The evaluation of K is performed by comparing the segmentation results of different settings in KGCN. Table 1 lists the segmentation metrics under six settings of K, and the trend can be observed though only six tests are carried out. It is easily concluded that the best setting is K = 20 (highest mAcc, mIoU, and oAcc), and the segmentation accuracy does not improve with more neighboring information counted as usually anticipated. In the following evaluations, we fix parameter K to 20 for both KGCN and KVGCN. In the NetVLAD layer of the proposed KVGCN, the input features are grouped to D clusters and then the output of "VLAD core" is calculated. This preset parameter D has a major impact on the performance of KVGCN. Here, we try to experimentally determine the optimal setting for D, and the segmentation results under different D are shown in Table 2. The segmentation accuracy arises with D ranges from 4 to 12 and drops afterwards. The optimal value cluster number is 12, which is also the setting for parameter D in the following evaluations. We gather the segmentation results of the proposed KGCN and KVGCN models on S3DIS dataset by 6-fold cross-validation and compare with state-of-the-art methods. The overall accuracy results are reported in Table 3, and the IoU scores of each semantic class are reported in Table 4. The mean IoU, overall accuracy, and mean accuracy of KGCN are, respectively, 59.0%, 85.8%, and 70.9%, which are better than PointNet [11], G+RCU [2], SEGCloud [3], RSNet [13], and A-SCN [19]. By combing the KNN and NetVLAD layer into the feature encoder, the KVGCN model further lifts the segmentation accuracy on S3DIS dataset, and the three metrics reach 60.9%, 87.4%, and 72.3%. Meanwhile, our KVGCN model improves the IoU results in five classes (Table 4). These quantitative results show that the NetVLAD is an essential feature pooling method for aggregating global features from local features and can promote the accuracy of point cloud semantic segmentation. To present intuitive observations, we select three samples to visualize the segmentation results, which are shown in Figure 5. KGCN segments most of the tables and chairs in the scenes, but the separation in certain parts (e.g., the chairs in the first two samples, denoted in white boxes) and the easily confused areas (e.g., the corner with several different objects and the conjunction parts of wall and windows in the third sample, denoted in white boxes) are ambiguous. Then, KVGCN aggregates the local and global contextual features and lifts the segmentation in these areas. Though better segmenting boundaries are obtained by KVGCN, there are few negative examples, which are marked by black boxes in the samples. Overall, KVGCN achieves comparative segmentation to groundtruth on S3DIS dataset.

Scannet Dataset
To evaluate the segmentation of the proposed models on the ScanNet dataset without color information, the 1201 samples out of total 1513 scanned scenes of ScanNet are used for training, and the other 312 samples are for testing. Table 5 gives the overall accuracy results, and Table 6 lists the IoU scores of each semantic class in the dataset. Results in Table 6 reveal that the KGCN outperforms PointNet [11], PointNet++ [12], and RsNet [13] in IoU of 11 categories (e.g., floor, chair, bathtub, bath curtain, etc.) and maintains comparative results for the other nine categories, meaning that the local feature encoders in KGCN can recognize and provide distinctive feature representations in local regions. Then, with the global aggregation layer NetVLAD combined in the feature encoder, KVGCN obtains richer semantic features to distinguish different objects. KVGCN promotes the segmentation IoUs of all classes in this dataset, including several challenging classes (such as sofa, sink, window, photo, etc.) and gains the best performance in 13 classes comparing to other methods. The overall results in Table 5 verify the advantages of KVGCN in three metrics mIoU, oAcc, and mAcc. Intuitive examples are demonstrated in Figure 6. KGCN performs better in semantic parts with larger data portion, without causing so many mislabeled points comparing to existing methods. KVGCN gets closer segmentation to the groundtruth in semantic parts like chairs and walls, as well as in the easily confused regions, such as the joint parts between chair/desk and floor. KVGCN generates fewer mislabelings on the three examples. This illustrates that the proposed feature merging encoder in KVGCN is beneficial to aggregating meaningful global information and improves the semantic segmentation.

Discussion
Establishing the local graph structure of the query point and convolving the graph edges to obtain powerful local features shows great advantage in feature representation. It can extract sophisticated features from graph structure compared with PointNet directly capturing single point's feature, and causes less data loss in comparison to slicing in RNN. The choice of K in KNN graph construction plays a crucial role in the feature gathering. In this work, we employ a fix value for K, which could limit the ability of feature learning in dense area. An adaptive adjusting strategy for K or assuring uniform input points could be a plausible solution. The similar concern for parameter D in KVGCN also stands.
Another issue is that four feature encoders are assembled in the proposed KGCN and KVGCN for feature learning. As a general rule, a greater number of encoders implies a deeper network, meaning that more distinctive features could be learnt. However, too deep a network may result in gradient explosion or overfitting during network training. In this work, we determine to use four encoders based on experimental tests. Table 7 gives the segmentation performance of KGCN on "Area_5" of S3DIS using different number of local feature encoders. The segmenting metrics lift significantly when more encoders are used, especially from three to four, and start declining when more than four encoders are utilized. Therefore, we use four encoders in all evaluations of the proposed KGCN and KVGCN. Assembling multiple local feature encoders does not guarantee that reliable global features could be trained. The segmentation of KGCN in object concentrated areas or easily confused areas is sub-average. To aggregate representative global vectors for making full use of the semantic information, the trainable version of VLAD, the NetVLAD layer, is introduced in our network. Similar to Fisher Vector [39] or other single-vector representations [40], NetVLAD considers each element of the local features and depicts fine-grained details without data loss. The most important characteristic of NetVLAD is that it is amenable to training via back-propagation, which make it pluggable into other deep learning architectures. By embedding NetVLAD, the proposed KVGCN significantly improved the segmentation performance. Figures 7 and 8 illustrate more segmentation results of KVGCN. Comparing to the best network available [41], our network achieves comparative oAcc and mAcc results, but falls behind on the mIoU metric. As a concrete manifestation, the separation in shapely similar areas, such as walls, embedded boards, objects in corners, etc., requires more distinguishable features in model training. However, NetVLAD has shown its potential power in feature aggregating; the study of exploiting and enhancing NetVLAD for global feature aggregation could be a promising future direction.

Conclusions
This work designs a new graph convolutional network architecture that is trained for semantic segmentation of point cloud data in an end-to-end manner. The input points and KNN graph topological information are sent into four orderly assembled feature encoders to train powerful features for semantic segmentation. The encoder performs edge convolution on each query point and its nearest K neighbors to gain rich details, outperforming existing methods that only focus on isolated points. Another highlight is that we introduce the powerful pooling mechanism with learnable parameters that can be easily trained via back-propagation, NetVLAD layer, in the proposed feature merging encoder to aggregate global representation from local features. This combination of KNN graph and NetVLAD layer pays off on the final segmentation. We evaluated the proposed network on two challenging indoor datasets, and the results show its superiority in segmentation (better mIoU, oAcc, and mAcc). Based on this work, further research could focus on designing a network with adjustable parameters and high training efficiency.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: