3.1.1. Indoor3DNet
As illustrated in
Figure 2, the Indoor3DNet is a deep semantic labeling network, which first encodes the point position into high-dimensional vectors by concatenating the embeddings of
,
, and
axes. Then, the network progressively aggregates the spatial local features through a multi-stage hierarchical FPS,
k Nearest Neighbors (
k-NN), and MLP.
Position encoding. The 3D position encodes for point clouds and is estimated by trigonometric functions, inspired by the methods described in PointNN [
51] and PCT [
52]. For each point
within the point cloud
, its
,
, and
coordinates are embedded separately by sine and cosine functions and then their embeddings are concatenated to form high-dimensional feature vectors,
.
where
denotes the dimension of the feature,
, and the notation
represents a value from one of the dimensions
,
, or
. The wavelengths form a geometric series ranging from
to
.
The inherent properties of trigonometric functions enable the encoded feature vector to effectively capture relative position information between different points. The spatial relation of two points can be depicted by a dot product of their embeddings. In addition, for any fixed offset , can be represented as a linear function of , which facilitates the extraction of high-frequency information and the capture of fine-grained structural changed in 3D shapes.
Deep feature extraction sub-network. Figure 3 illustrates the deep supervised encoder–decoder sub-network. By adopting the encoder–decoder architecture [
55], we expand the network by adding more intermediate nodes and incorporating skip connections between each of these nodes. At the end of each down-sampling (DS) block, an offset-attention (OA) mechanism, as originally proposed in PCT [
52], is incorporated. Additionally, deep supervision is employed to enable gradients to propagate back to the intermediate nodes, facilitating faster and more effective convergence of the model. This, in turn, allows the model to flexibly adjust network depth through pruning, thus achieving a balance between accuracy and efficiency.
We use a 4-layer network structure to boost accuracy further when computational resources are available. Each DS block conducts local feature extraction as local information aggregation, starting from FPS and followed by -NN to obtain local 3D region sets.
As the proposed network directly processing points, the FPS is adopted to reduce the density of the point cloud while preserving its overall shape and structural features. Then, the
-NN operator clusters the
closest points neighbour points around the sampled points to construct local region sets. Additionally, we apply MLP for each local 3D region to extract local features, followed by an aggregation step that employs both max and average pooling, denoted as
where
denotes the estimate of the maximum value,
represents the formula for calculating the average value, and
is the coefficient. Notice that max pooling may discard a considerable amount of local information by only selecting the maximum feature. Average pooling can also attenuate key features. By combining max and average pooling, we have substantially aggregated local information, enhancing the feature representation.
After the pooling process, the features are refined through OA, which is estimated based on the input feature
of the self-attention (SA) block and its output feature
. The result
of OA is the offset between the input features
and the output features
of the enhanced SA, as follows:
where
represents the continuous operations of the Linear, BatchNorm, and ReLU layers. Then, the up-sampling (US) block uses the same inverse distance interpolation as in PointNet [
45] for interpolation, followed by global feature learning through an MLP.
Loss function. In the encoder–decoder sub-network, each down-sample layer has a corresponding up-sampled layer that has a same size of the input point cloud, as the nodes
,
,
, and
show in
Figure 3. The loss function for each layer employs a cross-entropy measure, and the overall loss function is defined as a weighted aggregation of the individual losses from the decoder stages.
where
represents the loss of the layer
. In this context,
is the one-hot encoded label vector that represents the true labels of the data, where each class is represented by a binary vector with a 1 in the position of the correct class and 0 elsewhere.
represents the predicted label probability vector generated by the model, indicating how likely each class is for a given input.
represents the
point within the layer, while
represents the total number of points within
.
At the top layer nodes, i.e., , , , and , we predict the semantic label of each point, and they maintain segmentation capabilities. The skip connections allow the network to propagate gradient information to the earlier layers.
Similar concepts are also employed in RFFS-Net [
56]. RFFS-Net targets airborne laser scanning (ALS) data, using multi-scale receptive field fusion and stratification to address complex structures and scale variations. It incorporates Dilated Graph Convolution (DGConv) and Annular Dilated Convolution (ADConv) to capture multi-scale features, integrates features via the DAGFusion module, and optimizes classification with Multi-level Receptive Field Aggregation Loss (MRFALoss).
In contrast, Indoor3DNet is designed for indoor point clouds, incorporating positional encoding to capture spatial relationships. It employs FPS and -NN for feature extraction, followed by aggregation using MAP. The loss function is a weighted sum of cross-entropy losses from decoder stages. In summary, RFFS-Net excels in handling complex ALS data through multi-scale fusion, while Indoor3DNet leverages positional encoding and precise local feature extraction for indoor classification.
3.1.2. Super-Point Guided Instance Segmentation
Growth of super-points. To further generate instance segmentation, we use a modified Voxel Cloud Connectivity Segmentation (VCCS) [
57] method to perform the over-segmentation on the input point cloud. The coordinates of the patch centroids serve as the coordinates of the corresponding super voxels, while the normal vectors of these super voxels are determined based on the points within the patches. Subsequently, a region-growing algorithm is applied in the context of super voxel growth, incrementally expanding the super-points to create larger homogenous regions, i.e., super-points.
The growth of super-points is constrained by the criteria: if the spatial distance and feature similarity between neighboring super-points fall within a pre-defined threshold, they are merged. This merging process is crucial in the propagation of labels, ensuring that all points within the same super-point are assigned a uniform clustering label, thereby providing a higher-level representation of the scene.
Final instance segmentation. A decision fusion process is applied to integrate two types of labels (from the deep network and the super-point clustering) to generate final instance labels that are closely aligned with the distribution of real-world objects. The decision fusion process involves a comparative analysis between the semantic and clustering labels for each point, with subsequent label adjustments based on the intersection coverage between these labels.
Specifically, we designed an algorithm based on heuristic strategies to bring the adjustment of semantic and clustering labels for each point closer to their true values in a bottom-up manner by combining local analysis and global optimization. This algorithm focuses on the boundaries where discrepancies occur between semantic and clustering labels in 3D space. First, we marked and classified points, as shown in
Figure 4. The algorithm starts by selecting
k nearest neighbors of each point in its local neighborhood using a ball query method. Then, it compares the labels of these neighboring points to search for differences. If the semantic labels are varied while the clustering labels are uniform, or vice versa, the point is marked to reflect this discrepancy.
The marking method is expressed as follows: (left: all semantic labels in the local neighborhood, right: all instance labels in the local neighborhood). After traversing all points, we will merge some semantic/clustering labels according to pre-defined decision rules. We merge and classify points with the same mark, as well as those points where one side demonstrates a containment relationship in terms of their markings.
Then, for points classified into one category, we conduct decision fusion analysis on the clustering regions they are in. An example of instance segmentation is shown in
Figure 5. If there are the same clustering labels but different semantic labels in the region (Case 1), we choose to believe the semantic boundary and modify the instance labels of these points according to the semantic boundary. If there are different clustering but same semantics in the region, we conduct decision analysis based on the number of classified points and the number of points in the clustering region they are in. If the number of classified points is less than a pre-defined threshold (Case 2), or the number of clusters in the clustering region is greater than a pre-defined threshold (Case 3), we choose to believe the semantic boundary and modify the instance labels of these points according to the semantic boundary. On the contrary of Case 2 and Case 3 (Case 4), we choose to believe the clustering boundary and modify the instance labels of these points according to the clustering boundary. If there are different semantics and clustering labels in the region (Case 5), we choose to believe the semantic boundary and modify the instance labels of these points according to the semantic boundary.