BEMF-Net: Semantic Segmentation of Large-Scale Point Clouds via Bilateral Neighbor Enhancement and Multi-Scale Fusion

: The semantic segmentation of point clouds is a crucial undertaking in 3D reconstruction and holds great importance. However, achieving precise semantic segmentation represents a signiﬁcant hurdle. In this paper, we present BEMF-Net, an efﬁcient method for large-scale environments. It starts with an effective feature extraction method. Unlike images, 3D data comprise not only geometric relations but also texture information. To accurately depict the scene, it is crucial to take into account the impacts of texture and geometry on the task, and incorporate modiﬁcations to improve feature description. Additionally, we present a multi-scale feature fusion technique that effectively promotes the interaction between features at different resolutions. The approach mitigates the problem of the smoothing of detailed information caused by downsampling mechanisms, while ensuring the integrity of features across different layers, allowing a more comprehensive representation of the point cloud. We conﬁrmed the effectiveness of this method by testing it on benchmark datasets such as S3DIS, SensatUrban, and Toronto3D.


Introduction
With the advancement of science and technology in recent years, 3D data have played an increasingly important role in intelligent analysis and simulation.Point clouds, an important segment of visual data, are obtained by 3D sensors from either individual objects or entire scenes.Point clouds also represent the digital mapping of the real world, providing a comprehensive understanding of the state of a large and complex environment.This has led to a significant shift in focus to 3D point clouds in areas such as smart cities [1][2][3][4][5], autonomous driving [6][7][8], and land monitoring [9][10][11].The challenge, however, is that point clouds exist in the form of discrete point collections, making their effective processing a complex task.
Semantic segmentation is crucial for both upstream and downstream tasks related to point clouds [12][13][14].Directly acquired point cloud data lack auxiliary information, so categorizing each point and providing semantic information is necessary for the effective performance of the subsequent relative tasks.Currently, image semantic segmentation [15], change detection [16][17][18], and classification [19] have achieved significant success due to the improvement of deep learning.These achievements have spurred research towards the effective application of deep learning to point cloud tasks [20], which has become an important research direction.
As point cloud deep learning research progresses, related tasks have yielded certain outcomes, leading to numerous novel methods [21][22][23][24].In 2017, PointNet [25] involved a direct approach to handling point clouds.After that, Pointnet++ [26] utilized a methodology capable of perceiving local information.RandLA-Net [27] used the U-Net structure and random sampling strategy to address large-scale scene applications by examining the point cloud information at different scales.Since then, a large number of methods have been proposed [28][29][30], mainly aimed at acquiring more comprehensive features and enhancing local information by constructing geometric neighborhood relationship maps to improve experimental results.Although these approaches have proven to be crucial in boosting the feature description capability for point cloud semantic segmentation, they still face a number of challenges.
Firstly, accurately describing point cloud information is the core problem of scene understanding.One way to achieve this is to create a geometric neighborhood relationship graph.However, this approach [31] also has its limitations.Geometric relationships created simply by finding the K-nearest neighbors in Euclidean space may not accurately capture local relationships.Local regions in a scene frequently display similar geometric patterns, making it difficult to effectively distinguish between them using a geometric relationship.Therefore, we propose the dilated bilateral block (DBB), which generates multiple feature spaces by incorporating additional information and exploits the differences between these spaces to improve feature representativeness.Image semantic segmentation is commonly achieved through color, while point clouds also carry texture information.This study enhances local information by establishing texture relationships, and precise segmentation is achievable in areas with a dense distribution of various semantic categories through the difference between the initial and offset spatial attributes.
Secondly, the effective utilization of information across varying scales is critical to solving the problem of accurately segmenting large-scale scenes.The encoder-decoder architecture adopts an inverted pyramid structure, allowing the integration of features at different scales.The downsampling process enables the acquisition of several point cloud segments with varying densities.Additionally, it is evident that the point cloud becomes less dense as it is sampled at lower layers.The use of these point clouds enables the perception of neighborhood states through varying receptive fields.Subsequently, the upsampling structure combines this information to provide a comprehensive description at multiple scales.Several methods [32][33][34][35] have been proposed to enhance this framework for more efficient usage.However, the existing methods tend to fuse information layer by layer.The sequence fusion method leads to the omission of a considerable number of intricate details in sparser point clouds and hinders cross-scale information exchange, thereby diminishing the veracity of features.To address the aforementioned issues, we introduced the U-Fusion module, which incorporates a symmetrical structure of progressive aggregation and divergence.The purpose of progressive aggregation is to reduce the feature gap as the fusion proceeds and to prevent feature information from becoming blurred during multiscale fusion.Furthermore, in order to guarantee the exchange of information between different scales and maintain the integrity of the data, we adopted a gradual divergence approach to achieve this goal.
In summary, our main contributions are as follows: • We propose the dilated bilateral block (DBB) module, which allows the fine-grained learning of point clouds and optimizes the understanding of their local relationships.
The module enriches the neighborhood representation by constructing local texture relations.In addition, it uses the differences in the neighborhood space to effectively differentiate semantic class boundaries.

•
We designed a novel U-Fusion module, which facilitates the exchange of information from point clouds at multiple resolutions and ensures the effective utilization of features at each resolution.

•
We proposed BEMF-Net for the task of semantic segmentation of large-scale point cloud scenes and achieved excellent results on all public benchmark datasets.

Semantic Segmentation on Point Cloud
As deep learning research on point clouds progresses, there are numerous new methods [36][37][38][39][40] that have achieved excellent performances on semantic segmentation.Currently, there are three primary methods for semantically segmenting point clouds: projection-based, voxel-based, and point-based.Projection-based approaches [41] rely on a virtual camera to project the point cloud as a set of images into multiple viewpoints, then perform semantic segmentation through 2D image deep learning, and finally reproject the image segmentation results onto the point cloud.Voxel-based techniques [39] require the point cloud to be converted into voxels, and the points within each voxel block share the same semantic segmentation result.Both methods dilute the intricacies of geometric structural information within the point cloud data structure, leading to a reduction in its descriptive capacity.Point-based methods [42][43][44] do not require data type conversion for the point cloud and directly use points as input.PointNet [25] achieves this by using multilayer perceptrons (MLPs) to learn features, and a max-pooling layer for global feature extraction.PointNet++ [26] addresses the issue of insufficient local information by introducing the concept of neighborhood balls.Most of the methods are implemented through MLPs or graph convolution, while KPConv [33] involves a different approach by proposing kernel point convolution, a type of convolution suitable for 3D data.PCT [45] introduces a transformer module to achieve the interaction between global and local information.

Point Cloud Feature Extraction
The increasing research focus on point clouds has shifted the focus of feature extraction from individual points to local regions.In contrast to point-based methods, the present current advanced approaches [43,46,47] accentuate the extraction of valuable insights from local connections, typically established by spatial proximity.DGCNN [48] attains the successful depiction of local information by utilizing the Euclidean distance to seek out the K-nearest points and establish edge relationships upon these points.SCF-Net [44] introduces the polar coordinate space to represent point clouds, aiming to overcome the problem of orientation sensitivity of certain objects.However, current methods lack sufficient localized information extraction.RandLa-Net [27] incorporates multi-resolution characteristics, which provide unique descriptions of local information at different resolutions, expanding the perceptual field through fusion.BAAFNet [32] uses both geometric relationships and semantic associations, leveraging bilateral information to mutually enhance and offer a boost to local contextual information.These methods independently process extensive information using their respective approaches, enabling the information to be effectively exploited.In this work, inspired by the feature learning of image vision, we introduce texture information into the encoding process to enrich the local description.Furthermore, we effectively distinguish semantic category edges using differences in different neighborhood spaces.

Multi-Scale Feature Fusion
The U-Net framework is frequently implemented in image processing tasks.The inverted pyramid architecture enables the acquisition of features at diverse resolutions, which capture diverse neighborhoods depending on the resolution.Hence, several studies have been undertaken to effectively fuse information at multiple scales.Res-UNet [49] applies residual concatenation instead of the sub-modules of the original structure.Dense-UNet [50] interconnects each layer with the following layers.UNet++ [51] also adopted a similar approach to improve skip-connection processing.
In 2D images, there are several methods working on fusing the multi-scale information of point cloud.PointNet++ [26] achieves this through interpolative fusion.BAAFNet [32] involves a feature fusion module that uses an adaptive strategy to fuse features at different scales.ResDLPS-Net [35] presents a method for pairwise aggregation to effectively extract appropriate the neighborhood information.Meanwhile, MFA [34] utilizes dense skin connections to improve feature retention at the current resolution.Although these methods have produced satisfactory outcomes in real-world applications, they have overlooked the presence of potential hidden risks.Fine-grained segmentation is highly dependent on fully comprehending local information.The conventional U-Net approach only facilitates the information interaction between neighboring layers, with no regard for the impact of information at other resolutions.In contrast, our U-Fusion module not only explores the efficient utilization of multi-scale information but also facilitates the interaction of features at various resolutions to eliminate any possible perceptual blind spots.

Methodology
In 2D image semantic segmentation, the color information of each pixel is often utilized for feature representation and semantic discrimination.Similarly, in 3D point cloud semantic segmentation, we think color information, a semantic representation form for each point, can enhance the accuracy and robustness of semantic segmentation with proper handling and utilization.For the scene point cloud, we take the spatial coordinates and color information of each point as the raw input P ∈ R N×6 .Firstly, we use a fully connected (FC) layer to perform feature extraction on P to obtain an initial semantic feature F ∈ R N×d .Then, P and F are jointly input into five consecutive encoders for feature encoding, which yields the encoding features {E i } 5 i=1 .Subsequently, a U-Fusion module is employed to facilitate feature interactions across multi-scales and layer-wise feature decoding.Finally, three FC layers are used to predict final semantic labels C ∈ R N×N C from final decoding features, where N C is the number of object categories.

Encoder Module
The encoder module consists of five successive dilated bilateral blocks (DBBs), each of which includes two bilateral local aggregation (BLA) modules, as illustrated in Figure 1.For each encoder layer, P serves as the guidance information to enhance and fuse with F. Afterward, random sampling (RS) is performed to reduce the resolution of the point cloud, and then we use an MLP to transform the feature into the specified dimension (the detail settings are shown in the caption of Figure 1), and finally obtain the input for the next layer of the encoder.Following the encoding process through several encoders, the final encoded features will comprise a discriminative, spatially aware feature representation.At this stage, each feature will cover a broader receptive field, making it more global compared to the initial one.

Bilateral Local Aggregation
Given the point cloud P and semantic features F, as shown in Figure 2, we first perform the K-nearest neighbor (KNN) clustering for each point p i of the point cloud and its corresponding semantic feature f i .This allows us to obtain the information of neighboring points p K i = p 1 i , p 2 i , . . ., p k i and their corresponding features within a certain range of point p i .Similarly, following the encoding approach of RandLA-Net [27], we employ relative position encoding to p i and its corresponding semantic feature f i based on the neighborhood relationships to capture the local geometric relationships around each point, as follows: where p T i = Tile(p i ), f T i = Tile( f i ) are implemented for aligning p i , f i with the dimension of the neighborhood features.In this case, Tile means expanding the point quantity dimension to K. Concat is the concatenation operation.• is a scalar computation operator, which is similar to the Euclidean distance.In particular, we first calculate the sum of the squares of the differences of all components in the feature space, and then its square root is later computed as a measure of distance in the feature space.However, the features obtained after relative position encoding may exhibit some ambiguity.This arises because KNN is a simple clustering algorithm, and in cases where there are multiple classes in the boundary region, it usually results in feature inconsistency for points, making it challenging to distinguish between different semantic categories.To address this issue, we update each point's coordinate-color encoding feature and semantic encoding feature interactively to obtain two offsets f o i and p o i by using a linear layer (e.g., an MLP layer), and their corresponding neighborhood features.These offsets are then concatenated with the previous encoding features to generate pi , fi .This prevents features from being confined to a single feature space, mitigating errors and making the feature representation more distinctive and representative, as follows: As mentioned before, pixel-level segmentation in 2D primarily relies on color information.Therefore, we assume that the primordial coordinates and color information can provide supervisory guidance for point cloud semantic segmentation.To prevent information redundancy caused by the continuous encoding of features, we once again concatenate the relative position encoding features with the bilateral features f bi .This results in a multifaceted feature representation fi that takes into account geometry, color, semantics, and the primordial information of the point cloud: Due to the abundance of information contained in multifaceted features fi , we employ a straightforward attention mechanism to enable the network to automatically select highly representative information while discarding irrelevant features.This results in a more comprehensive and highly representative bilateral local aggregation enhanced feature f 1 i : where δ(•) is the softmax activation function and • is the element-wise product.Sum aggregates the features of K points by summation, thereby reducing the quantity dimension to a single point.

Dilated Bilateral Block
As previously noted, our dilated bilateral block (DBB) consists of two BLA modules, and is designed to further expand the neighborhood of features to capture a wider range of semantic information.However, there is a potential risk of losing the original information when expanding the perception region.To avoid this, inspired by ResDLPS-Net [35], we concatenate the outputs of the two blocks and update them through an MLP layer.Next, the input feature f i is also updated through an MLP layer and then summed with the prior fused output features, finally yielding the encoded feature fi : In summary, the DBB module has the capability to learn a more comprehensive feature space in the context of semantic segmentation tasks.This representation not only adjusts and integrates coordinates, colors, and semantic information, but also encapsulates certain geometric structural details.As a result, it yields a comprehensive and distinctive encoded feature that effectively improves the accuracy and stability of semantic segmentation.

U-Fusion Module
The traditional U-Net [52] architecture typically consists of three components: encoders, decoders, and skip connections that link the features of each encoder and decoder layer.However, this simple form of skip connection is often inadequate for large-scale semantic segmentation.Considering outdoor urban scenes as an example, they typically contain semantic categories at multi-scales, such as large-scale objects like buildings and roads, and small-scale objects like cars and bicycles.In scene-level semantic segmentation, solely relying on the traditional U-Net architecture lacks the capability for feature interactions between different scales.Consequently, during the encoding process, it always results in an enlarged receptive field while losing a considerable amount of local and fine-grained details.
To address the aforementioned problems, we propose the U-Fusion module.This module innovatively combines encoding features from different layers, with a central focus on identifying the anchor layer (or intermediate layer).It integrates feature information from the local layer, global layer (adjacent upper and lower layer relative to the anchor layer), and the current anchor layer.This integration enables the anchor layer to access feature information from multi-scale receptive fields.Additionally, note that the dimensions for features at the same layer are the same.
We hope that the fused information will possess both global and local characteristics, resulting in comprehensive and extensive fused features.In this study, we selected the second and fourth layers, which are relatively intermediate, as anchor layers to obtain the fused features F 1 and F 2 , as illustrated in Figure 3.After the initial fusion, to further enhance and deepen the fused features, we once again fuse F 1 , F 2 with the features from the third layer of the encoder.This final fusion yields the fused encoder features F 3 , encompassing information from all scales, as follows: where RS is random sampling, which is used to downsample the point cloud to a given size, thereby sparsifying the point cloud.IS represents interpolation sampling, which is used to upsample the point cloud to a specified quantity, primarily to restore the resolution of the point cloud.Conv denotes a convolution operation.Similarly, in the decoding stage, we split the previously generated fusion encoder features in an order that is symmetrical to the previous fusion process and concatenate them with the corresponding decoder layers.This ensures that each decoder layer can access scale-specific feature information, preventing performance degradation due to information loss during the encoding process.
For each decoder layer, we first apply IS or RS to align the resolution of points in the concatenated features and then use a transformation layer to reduce the feature dimension as follows: where TransConv is the transpose convolution operation.

U-Fusion Module
Encoder Module In summary, after feature compensation through the U-Fusion module, each decoder layer no longer solely relies on the feature information from its same-scale layer, in contrast to the traditional U-Net architecture.By fusing and compensating for multi-scale features, the final decoder features have robust scale-awareness.They can effectively recognize global semantic feature information while preserving fine-grained geometric details at a more local scale.With the help of these decoder features, the accuracy and robustness of our network during semantic segmentation tasks is significantly improved.

Experiments
In this section, we demonstrate the effectiveness of the proposed BEMF-Net through various benchmark datasets.First, we introduce the evaluation metrics used in the experiments, as well as the parameter settings and hardware configurations.Next, we provide a brief description of each dataset to aid in understanding their properties.Finally, we show the performance of BEMF-Net on different datasets, comparing it with state-of-theart networks.We also present ablation studies on different modules to demonstrate the contribution of each individual module.

Experiment Settings
For each dataset, we used the coordinates and color information of points as inputs.The datasets mainly consist of two types of large-scale scene point cloud datasets: outdoor and indoor.Comprehensive experiments were conducted on both types of datasets; detailed quantitative results and qualitative visualization prove the generalization capability of BEMF-Net.
Evaluation Metrics.For all experiment datasets, we employed the same evaluation metrics: overall accuracy (OA), intersection over union (IoU), and mean intersection over union (mIoU): where TP is the true positive case, TN is the true negative case, FP is the false positive case, and FN is the false negative case.Loss function.Like most point cloud segmentation tasks, we chose weighted crossentropy as the loss function for all experiments, as follows: where ω i = 1/ √ r i , y i and ŷi are the ground truth and predicted class labels, respectively, r i is the ratio of the number of point clouds about i th category to the overall point cloud.Configuration setting.For the parameter settings, we followed the configuration of RandLA-Net [27].The number of neighbors for the K-nearest neighbor (KNN) search was set to 16, and the size and stride of the convolution kernel were both [1,1].The initial learning rate was 0.01, with a 5% decrease every epoch, for a total of 100 epochs.All experiments were run on an Ubuntu system, using a single NVIDIA RTX 3090 GPU for training and inference, and the architecture implemented in the experiments was the TensorFlow framework.

Dataset Description
The SensatUrban [53] is a large-scale urban scene point cloud dataset constructed using UAV-based photogrammetry.The dataset was collected across several cities in the UK, covering approximately 7.6 square kilometers of scenes and containing over 3 billion points with semantic annotations.The entire point cloud is manually classified into thirteen object classes, including ground, vegetation, buildings, walls, bridges, parking lots, cars, bicycles, etc., covering most of the semantic categories commonly encountered in real-world urban areas.The multi-scale characteristics of the dataset serve to validate the robustness and effectiveness of the network.The entire point cloud dataset was divided into 34 tiles for training and testing purposes, and we followed the official partitioning method to ensure fair comparisons.
The Toronto3D [54] was, as its name suggests, collected on a street in Toronto, Canada.It comprises approximately 78.3 million points annotated in eight semantic categories: roads, road markings, nature, buildings, utility lines, poles, cars, and fences.This dataset was acquired using a vehicle-mounted LiDAR system and is considered a standard city street scene dataset.The dataset was divided into four blocks: L001, L002, L003, and L004.Following the official partitioning method, we used L002 as the test set, while the rest were used as the training set.The ratio of the number of points of the test set to that of the training set is approximately 1:7.
The Stanford Large-Scale 3D Indoor Spaces (S3DIS) [38] is one of the prominent indoor datasets, captured using Matterport scanners.The dataset is divided into 6 areas, covering a total area of 6000 square meters, including 272 rooms.It consists mainly of thirteen common indoor object categories: ceilings, floors, walls, beams, columns, windows, doors, tables, chairs, sofas, bookshelves, and clutter.In this study, we chose Area 5 as the test set, while Areas 1-4 and 6 served as the training set to assess the generalization ability of our network in indoor scenes.

Experiment Result and Analysis
In this section, we elaborate on the experimental results of BEMF-Net on three datasets: SensatUrban, Toronto3D, and S3DIS.The segmentation accuracy for each single class in the datasets is fully compared and illustrated.Overall, the extensive experiments were sufficient to demonstrate the effectiveness and robustness of our proposed method for the task of semantic segmentation of a large-scale point cloud.

Evaluation on SensatUrban
Table 1 quantitatively demonstrates the leading position of our method on SensatUrban and establishes a new state-of-the-art (SoTA).To elucidate the impact of the different input information on the network and to demonstrate fairness in subsequent comparisons, we present our numerical results for two different types of input.The first type only uses the spatial coordinates of the points as input (w/o color), while the second type uses both the coordinates and colors of the points as input information (w/ color).Our method (w/ color) outperforms the baseline RandLA-Net [27] with an improvement of 2.9% and 9.1% in terms of OA and mIOU, respectively.Particularly noteworthy is its outstanding performance in categories such as railways and footpaths, highlighting the significant improvement of our method in detecting small-scale objects compared to most existing approaches.In addition, our method (w/o color) still achieves great performances on SensatUrban.With a 4.2% advantage over the previous state-of-the-art method NeiEA-Net in terms of mIoU, our approach achieved higher accuracy on individual categories such as bridges, parking lots, and bicycles.Analyzed from the perspective of feature information, these categories of objects tend to rely more on positional information for classification, with color information potentially leading to misleading results.However, regardless of whether color information is used as input, the excellent comparative results demonstrate the superiority of our network.Overall, for urban scenes like SensatUrban, incorporating color information remains a favorable choice.
The visualizations in Figure 4 further illustrate the excellent improvements of BEMF-Net in segmenting large-scale objects such as buildings and parking lots compared to the baseline RandLA-Net.Moreover, our method exhibits a strong recognition capability for small-scale objects such as cargo on rivers and railways.This clearly demonstrates the ability of our method to perceive multi-scale objects in large scenes and the robustness of its segmentation task.

Evaluation on Toronto3D
To validate the effectiveness of BEMF-Net in city street scenes, we conducted experiments on Toronto3D.Similarly, we provide the experimental results for two models (w/o and w/ color) to ensure the fairness and persuasiveness of the experiments.As shown in Table 2, our method retains a competitive advantage over most existing approaches, achieving an improvement of 2.6% in OA over the baseline.For individual categories, we achieve the highest segmentation accuracy in the nature and car categories, demonstrating the success of our method in this respect.When analyzing our two models, we find that the use of color leads to improved accuracy compared to solely relying on coordinates.For instance, notable improvements are observed in the segmentation of poles and cars.This suggests that Toronto3D is effective in the overall and individual segmentation results upon the inclusion of color information.In contrast to the SensatUrban dataset, where the addition of color information led to a significant decline in accuracy for certain individual classes, this suggests that the effectiveness of incorporating color information varies across different datasets.This, in turn, prompts our consideration for future work on how to better leverage color information.

Input
RandLA-Net Ours (w/ color)  Qualitative visualizations in Figure 5 highlight the excellent performance of our approach.In particular, there is a noticeable improvement in the delineation of road contours, even in scenarios with multiple overlapping objects in close proximity, indicating the effectiveness of BEMF-Net in scenarios where traditional single-space feature encoding may introduce ambiguities.

Evaluation on S3DIS
Initially, our network modules were designed for outdoor scenes, but we believe that BEMF-Net also possesses some degree of generalization for indoor environments.Therefore, we conducted validation experiments on the mainstream indoor segmentation dataset, S3DIS.As mentioned earlier in the methodology, indoor scenes often contain objects that are spatially very close, making it difficult to distinguish them using only spatial coordinates.Therefore, we used color information as a complement and constructed a bilateral feature space.
Table 3 demonstrates our comparative results, creating a new SoTA compared to current point-based algorithms.From the comparisons, it is evident that both of our proposed models have shown improvements in the overall evaluation metrics such as OA and mIoU, thus confirming our initial hypotheses.When comparing the segmentation accuracy of individual objects, a notable improvement is observed in most categories, indicating the effectiveness of color in the context of S3DIS and similar indoor scenes.The additional provision of color information in densely distributed objects can efficiently enhance the discriminative capabilities of the segmentation network.We also achieved excellent results in the wall, chair, bookshelf, board, and clutter categories.As shown in Figure 6, where we showcase the segmentation results for several rooms, it can be seen that BEMF-Net performs exceptionally well when there are walls and other objects present on the same surface, such as blackboards on walls or doors near bookshelves.This confirms that, even in indoor environments where object scales do not vary significantly, there are still objects with similar colors and geometries that are challenging in terms of recognition.Thanks to our proposed bilateral encoding approach, our method does not rely solely on a single feature space for semantic representation.The integration of semantic information from multiple feature spaces effectively aids indoor semantic segmentation tasks, improving the segmentation recognition accuracy and robustness across all scenes.

Ablation Studies
To elucidate the individual contribution and effectiveness of each core module we designed, we conducted ablation experiments separately for the bilateral local aggregation (BLA), dilated bilateral block (DBB), and U-Fusion modules.Since SensatUrban requires online submission, and Toronto3D has fewer test blocks, we selected S3DIS as the test dataset for conducting the ablation experiments.The descriptions of different ablation models are given below: Our baseline model is designed based on RandLA-Net [27], so all ablation experiments are conducted according to the structure of RandLA-Net for combination and comparison.
• a 1 : Replace RandLA-Net's local spatial encoding and attentive pooling modules with our BLA module.This is intended to validate the effectiveness of the proposed encoder and the enhancement provided by the inclusion of multifaceted reinforced features including coordinates, colors, and semantics for the segmentation task.• a 2 : Replace RandLA-Net's dilated residual block with our proposed DBB.This aims to demonstrate the effectiveness of the multi-receptive field space provided by dense connections for feature representation.All evaluation metrics are consistent with the experiments.We chose mIoU for the overall performance assessment, and detailed results and visualization are shown in Table 4 and Figure 7.It is evident that our full model achieves the best performance.Analyzed from the perspective of individual module contributions, the numerical results from a 1 indicate that the most significant effect comes from our U-Fusion model, leading to an improvement in mIoU of almost 2% compared to the baseline.Additionally, from model a 4 , it is evident that the removal of the U-Fusion module significantly decreases the accuracy, highlighting the critical importance of multi-scale feature information for the semantic segmentation of large-scale point clouds.This effectively demonstrates the indispensability of our designed multi-scale feature fusion module, U-Fusion, to the network.Moreover, the comparison of segmentation accuracy between a 1 , a 2 , and a 3 with the baseline also reflects the accuracy improvement achieved by our designed encoding module and multi-scale fusion module for the segmentation task.Conversely, a 4 , a 5 , and a 6 elaborate on the roles between the modules, demonstrating the interdependence of our designed modules, which effectively cooperate with each other to improve the effectiveness of semantic segmentation.
In addition, our proposed U-Fusion module effectively addresses the problem of multi-scale perception in large scenes, and can be easily transferred to other networks as a plug-and-play module and applied to various visual tasks.

Discussion on Hyperparameter
In this section, we primarily analyze the main influencing hyperparameters.Since our neighborhood construction algorithm is implemented through KNN, the choice of the number of neighbors in KNN has a significant impact on the type of neighborhood relationships we construct.We conducted tests with different K values on the S3DIS dataset, and the experimental results are shown in Figure 8.
We further investigated how the variation of K affects the segmentation results and found that the accuracy of segmentation rapidly increases with the increment in K.This is because the expansion of the neighborhood provides a broader range of semantic information and geometric correlations, thus ensuring greater semantic consistency within the local region.However, when K reaches a certain threshold, the performance starts to decline.This is typically due to the possibility of excessive local spatial information causing some information blurring, as previously mentioned, when multiple objects exist within a small space.Additionally, as K increases, the time required for neighborhood construction and feature extraction increases significantly, which is highly undesirable for processing data in large-scale scenes.In other words, it leads to a substantial increase in time and memory consumption, with only minimal improvements in segmentation results.

Discussion on Loss Function
In this section, we present experiments and discussions on the common weighted cross-entropy (WCE) loss in semantic segmentation.Specifically, we experimented with two different weighting computation methods for WCE; the first one is w de f ault = 1/r i , which is also the default weighting method of WCE, while the second is w sqrt = 1/ √ r i , denoted by sqrt.r i is the ratio of the number of point clouds about the ith category to the total point cloud.
The experimental results for the S3DIS dataset are shown in Table 5.It can be seen that the choice of weighting computation method for either type of loss does not significantly affect the overall performance, but the combination of the two losses (w mixed = 1 ) does lead to a slight improvement in accuracy compared to using either method alone.Since most segmentation algorithms currently only employ a single type of loss function for training and evaluation, for the sake of fairness, we used only WCE(sqrt), which is also used in RandLA-Net [27], as our loss function in the experiments for comparison with other algorithms.Furthermore, the improvement in accuracy resulting from the combined loss provides us with some potential directions for future research.For example, it is worth further investigating how to design a reasonable and effective loss function tailored to the network architecture and how to apply the loss function across multiple layers.

Discussion on Computational Efficiency
As shown in Table 6, we reported the inference speed and the number of parameters of the model on the S3DIS dataset.It can be observed that our method has slightly fewer parameters compared to BAF-LAC, and its speed is similar to LEARD-Net.However, in terms of segmentation performance, it shows a slight improvement compared to both of these algorithms, and it achieves a 4.5% increase in mIoU compared to RandLA-Net.

Learning Process of Our Methods
To provide a more intuitive depiction of the actual performance of our BEMF-Net during the learning process, we visually present the curves of loss and mIoU in Figure 9.We describe the training process of RandLA-Net and our method on the S3DIS dataset.The figure shows that our method converges faster compared to RandLA-Net, with the loss tending to stabilize around the 40th epoch.In addition, the mIoU reaches its peak around the 60th epoch and remains relatively stable in the subsequent training, without significant fluctuations.

Conclusions
Semantic segmentation techniques allow a better understanding of the scene environment, thereby extracting valuable information from 3D data that can be used to simulate the real world.In this paper, we proposed a novel model, called BEMF-Net, designed for the semantic segmentation of large-scale point clouds.This method encompasses two significant contributions.Firstly, we presented the DBB module, which integrates texture information to supplement the description of neighbor spaces and improve the perception of local details.Our encoding process systematically exploits differences in the neighboring spaces to achieve the accurate segmentation of semantic class boundaries.Additionally, we introduced the U-Fusion module, which is based on the traditional skip connection.This component circumvents issues caused by feature smoothing from sampling mechanisms, enabling the integration of multi-scale data and maintaining the integrity of information across different layers.Notably, we achieved excellent results for several benchmark datasets such as S3DIS and SensatUrban, and our performance on the Toronto3D benchmark was on par with state-of-the-art methods.Finally, we conducted ablation experiments to demonstrate the effectiveness of each proposed module.
The following conclusions were drawn from the above work: • Enhancing the network's ability to describe the point cloud is possible by adding extra data, such as color information.The simultaneous use of geometry and color data can help distinguish semantic class boundaries.
• Effective utilization of features at different resolutions is essential to improve scene understanding.Ablation tests show that the proposed U-Fusion method is sensitive to feature changes and provides positive feedback.

•
This methodology can effectively function in three separate urban environments: SensatUrban, Toronto3D, and S3DIS.SensatUrban pertains to capturing large-scale outdoor urban scenes through the means of UAVs, while Toronto3D entails localized urban scenes captured by radar mounted on vehicles.S3DIS encompasses indoor scene data.This showcases the ability to address data variability to a certain extent.

•
Real-world point cloud data are commonly obtained by radar or which often leads to inherent problems such as noise and incomplete data.In the future, we will focus on overcoming these challenges and achieving accurate point cloud segmentation, especially in regions characterized by low data quality.

Figure 4 .
Figure 4. Visual comparison of RandLA-Net and our method on SensatUrban.

Figure 5 .
Figure 5. Visual comparison of RandLA-Net and our method on Toronto3D.

Figure 6 .
Figure 6.Visual comparison of RandLA-Net and our method on S3DIS.

• a 3 : 4 :
Embed the interlayer multi-scale fusion module U-Fusion into RandLA-Net to illustrate the advantages of multi-scale feature fusion over the single-scale feature connections of the traditional U-Net.• a Remove multi-scale features from the complete network structure to demonstrate the importance of multi-scale information.• a 5 : Remove DBB from the entire network structure to demonstrate the effectiveness of dense connections.• a 6 : Remove BLA from the full network to highlight the effectiveness of bilateral features.

Figure 7 .
Figure 7. Visual comparison of different variants on S3DIS.

Figure 9 .
Figure 9. Validation mIoU and training loss curves of RandLA-Net and our method on S3DIS.

Table 1 .
Quantitative comparison results on SensatUrban (%).The best results are presented in bold, and the second-best results are underlined.
Ground Veg.Building Wall Bridge Parking Rail Traffic.Street.Car Footpath Bike Water

Table 2 .
Quantitative comparison results on Toronto3D (%).The best results are presented in bold, and the second-best results are underlined.

Table 3 .
Quantitative comparison results on S3DIS (Area5) (%).The best results are presented in bold, and the second-best results are underlined.

Table 4 .
Ablation study of BEMF-Net core modules.

Table 5 .
Ablation study about the loss function in our methods.

Table 6 .
The number of parameters and computational efficiency on S3DIS.