PVI-Net: Point–Voxel–Image Fusion for Semantic Segmentation of Point Clouds in Large-Scale Autonomous Driving Scenarios

: In this study, we introduce a novel framework for the semantic segmentation of point clouds in autonomous driving scenarios, termed PVI-Net. This framework uniquely integrates three different data perspectives—point clouds, voxels, and distance maps—executing feature extraction through three parallel branches. Throughout this process, we ingeniously design a point cloud–voxel cross-attention mechanism and a multi-perspective feature fusion strategy for point images. These strategies facilitate information interaction across different feature dimensions of perspectives, thereby optimizing the fusion of information from various viewpoints and significantly enhancing the overall performance of the model. The network employs a U-Net structure and residual connections, effectively merging and encoding information to improve the precision and efficiency of semantic segmentation. We validated the performance of PVI-Net on the SemanticKITTI and nuScenes datasets. The results demonstrate that PVI-Net surpasses most of the previous methods in various performance metrics.


Introduction
In recent years, with the rapid development of artificial intelligence technology, 3D point cloud processing has become an important branch in the field of computer vision.Especially in outdoor scenes, such as autonomous driving, urban planning, and Geographic Information Systems (GISs), LiDAR point cloud segmentation technology plays a crucial role.For autonomous vehicles, accurate point cloud segmentation is key to safe navigation and decision making.Due to the working principle of LiDAR sensors, the collected point cloud data may have uneven density and occlusion issues.These characteristics make extracting accurate and reliable semantic information from these data a challenging task.
Recent advancements in point cloud semantic segmentation have substantially contributed to the field, particularly within large-scale autonomous driving scenarios [1][2][3][4].These advancements predominantly revolve around the effective processing and analytical representation of voluminous point cloud data, captured through LiDAR technology.Our work introduces a novel conceptualization within this domain, where a single point cloud dataset is represented through three distinct but complementary perspectives: point-based, voxel-based, and distance map representations.This unique approach aims to enhance the model's feature extraction capabilities by leveraging the intrinsic advantages of each representation method, thereby enriching the semantic segmentation process.Among these, voxel-based methods convert point clouds into three-dimensional grids and use 3D convolutional neural networks for processing, which is convenient for capturing spatial information but requires high resolution when dealing with sparse point clouds, increasing computational and storage burdens.Direct point-based methods retain the precision of the original structure but are computationally inefficient when dealing with unstructured data, while image-based methods accelerate processing, but they may lose three-dimensional spatial information when projecting point clouds into two-dimensional images, affecting segmentation accuracy.Therefore, we found that, in building models for large scene point cloud segmentation, the fusion of point cloud, voxel, and distance map perspectives is not just a simple data overlay but a multi-dimensional information fusion strategy.Point clouds, as a high-fidelity representation of raw data, maintain the original precision of spatial information and the integrity of microscopic details, directly reflecting the depth perception of scenes.Voxelization, though introducing some quantization errors, provides an intuitive and operable geometric expression for the macro form and volumetric characteristics of the data.Distance maps, as an advanced representation of the spatial relationships in point clouds, provide a key perspective for understanding the geometric continuity and topological structure of scenes by encoding the spatial distances between points.This multi-dimensional data representation strategy lays the foundation for in-depth analysis and accurate segmentation of large-scale point cloud scenes.A point cloud segmentation model that integrates different perspectives shows outstanding robustness and accuracy in processing complex and large-scale scenes.This fusion is not just a simple stacking of data but a deep integration of information.
In this study, we propose an adaptive point-voxel-distance map feature fusion framework, PVI-Net, to optimize the semantic segmentation of point clouds in outdoor scenes.This framework combines the advantages of point cloud, voxel, and distance map perspectives, providing a comprehensive perspective for processing complex large-scale data.PVI-Net uses a multi-layer feature extraction and fusion mechanism, combining multilayer perceptron (MLP), 3D sparse convolution, and 2D convolution, implementing effective feature fusion and information encoding retention through a U-Net structure and residual connections, thereby improving the accuracy and efficiency of semantic segmentation.Specifically, the point cloud-voxel cross-attention mechanism and point-image multi-perspective feature fusion strategy effectively handle the structural differences and information fusion between different perspectives, enhancing the overall performance of the model.For computational efficiency, PVI-Net reduces the computational cost of multi-perspective fusion through optimization strategies.Voxelization processing quickly filters point clouds in the early stage of data processing, reducing the processing burden on high-density information, while the high-level spatial relationship expression provided by distance maps helps the model quickly identify scene features, reducing the need for pointby-point analysis of complex data.These strategies collectively contribute to effectively improving computational efficiency and resource management balance while maintaining high segmentation accuracy.The experimental results show that PVI-Net performs excellently in processing point cloud data of complex outdoor scenes.The evaluation results on two key datasets, SemanticKITTI and nuScenes, show that PVI-Net performs excellently in terms of point cloud semantic segmentation accuracy in large-scale autonomous driving scenarios.
Our work offers the following key contributions: • Proposing PVI-Net, a semantic segmentation framework for large-scale point cloud scenes, which integrates three different data perspectives-point cloud, voxel, and distance map-achieving an adaptive multi-dimensional information fusion strategy.

•
Designing point-voxel cross-attention and Multi-perspective Fusion Attention (MF-Attention) mechanisms in the network structure, effectively addressing the structural differences and information fusion issues between different perspectives.

•
Designing a multi-perspective feature post-fusion module.This module can effectively combine features from point clouds, voxels, and distance maps.In the post-fusion stage, the model integrates information from different perspectives, enhancing semantic understanding of complex outdoor scenes.

Point Processing in Point Cloud Segmentation
Point-based methods [5][6][7][8] are renowned for their ability to learn global features directly from raw point clouds.However, they fall short in capturing details and local structures within point clouds.To address this deficiency, multi-scale processing methods [9] have been proposed.Such methods enhance the understanding of complex structures by analyzing point cloud features at different scales.Nevertheless, these methods often increase computational burdens.Graph-based methods [10], on the other hand, have turned to a new processing strategy, transforming point cloud data into graph structures and utilizing graph neural networks to capture complex relationships between points.This approach is particularly suitable for processing unstructured point cloud data but faces high computational costs in graph construction and processing.Overall, in the field of large-scale autonomous driving point cloud processing, point cloud data are unstructured, meaning the data points are unordered, and the number of neighbors for each point can vary.This irregularity poses significant challenges to point cloud processing.

Voxel Processing in Point Cloud Segmentation
Voxel-based point cloud segmentation [11][12][13] has garnered widespread attention in the understanding of autonomous driving scenes.Park et al. [14] proposed an Efficient Point Cloud Transformer (EPT) based on local self-attention to understand large-scale 3D scenes.EPT, due to its voxel structure, offers faster inference speeds compared with point-based work.Wang et al. [15] introduced a Dynamic Sparse Voxel Transformer (DSVT), a Voxel Transformer backbone based on a single-step window for outdoor 3D perception.This method divides a series of local regions in each window according to sparsity and then computes the features of all regions in a fully parallel manner.Although these methods use sparse voxel grids to reduce memory occupancy and employ layered and multi-scale voxel representations to capture more details, the conversion of point cloud data into voxel format faces detail loss due to voxelization.Our proposed PVI-Net bridges this gap through multiple perspectives.

Range Image Processing in Point Cloud Segmentation
Recent advancements in point cloud segmentation have highlighted the potential of range images as a complementary representation to traditional point-based and voxel-based methods.Range images, derived from point clouds through spherical projection, maintain depth information in a structured, image-like format, facilitating the application of mature 2D image-processing techniques.The transformation of point clouds into range images involves projecting 3D points onto a 2D plane based on their azimuth and elevation angles relative to a specific viewpoint, typically the sensor origin.This process preserves the spatial locality and depth information, offering a compact representation that is particularly beneficial for capturing surface geometries and contours.Several notable studies have leveraged range images for enhancing point cloud analysis.For instance, RangeNet++ [16] employs a deep neural network to segment range images semantically, exploiting their structured nature for efficient processing.Similarly, SqueezeSeg [17] and its successors demonstrate the efficacy of convolutional neural networks in interpreting range images for tasks like semantic segmentation and object detection within point clouds.Despite their advantages, range images are not without challenges.The projection process can introduce distortions, particularly at large distances or near the edges of the field of view.Therefore, considering how to further narrow the gap between 2D image processing and 3D point cloud analysis is a potential objective.

Multi-Perspective Fusion
The advantages of multi-perspective point cloud segmentation [18][19][20] are primarily manifested in its ability to provide a more comprehensive spatial understanding than a single perspective.In multi-perspective point cloud segmentation, data from different angles are fused to form more complete three-dimensional representations of target objects or environments.Chen et al. [21] explored interactive fusion between point cloud and image data, using an autoencoder structure to enhance the performance of 3D object detection through simultaneously learning features of point clouds and images.Tang et al. [8] focused on finding efficient 3D architectures.They combined sparse point and voxel convo-lutions, aiming to create a network that is both efficient and accurate for processing point cloud data.These methods can significantly reduce the occlusions and blind spots caused by single perspectives, especially in complex environments.Compared with previous methods, our approach proposes a point-voxel-image tri-perspective point cloud semantic segmentation framework, which enables capturing more information about shape, size, and other important features from multiple angles.

Methodology
In this section, we provide a comprehensive introduction to the PVI-Net framework for point cloud processing in outdoor scene segmentation.In Section 3.1, we outline the overall structure of the network and data flow.Following this, in Section 3.2, we detail the input data sources and feature extraction processes of the network's three key branches.Further, in Section 3.3, we delve into the fusion methods of these three branches during the feature extraction stage and the key modules designed for the post-fusion stage.This chapter aims to offer an in-depth understanding of the details of the PVI-Net framework, showcasing its efficiency and innovation in processing complex outdoor scene point cloud data.

Overview
Figure 1 shows our newly developed PVI-Net network, a tri-branch feature fusion network for point cloud semantic segmentation.For the input point cloud data, we first map point cloud features into voxel grid features, providing input for the voxel feature learning branch.Then, point cloud data are transformed into range images through spherical projection, serving as input for the image feature learning branch.The point cloud branch employs a basic PointNet structure and several MLPs to generate multiresolution features.The voxel and image branches utilize 3D sparse convolution and 2D convolution, respectively, and employ a U-Net structure for featuring the encoding and decoding of each branch, simultaneously achieving a fusion of features from three perspectives.Additionally, in the decoding stage, we apply residual connections to ensure that information learned during the encoding stage is effectively transferred to the output.Finally, using an innovative multi-perspective feature post-fusion module, we perform postfusion of features from the three branches, accurately restoring the semantic information of each point cloud.In the point cloud branch of PVI-Net, given an unordered set of points P = {p i P } N i=1 , where each point in the point cloud p i P ∈ R C includes the coordinates c i P = [x i , y i , z i ] and the point cloud features.The direct use of MLP to extract features in the point cloud branch helps to reduce the high computational load and memory consumption caused by searching for neighboring relationships, thereby enabling efficient processing of large-scale data and simplifying the network structure.Each point in the point cloud is individually processed with MLP, which effectively extracts and learns the features of each point, and can be represented as follows:

MF-Attention
where l denotes the layer of the MLP, and F i p represents the features extracted via the MLP at layer l.The point cloud feature extraction involves processing each point in the point cloud individually.MLP layers, including linear transformations and nonlinear activations, allow the network to learn complex patterns in the data.This process is crucial for capturing the complex geometric details of the point cloud, and these features are subsequently integrated with the voxel and range image branches through the fusion process.

Voxel Feature Extraction Branch
For the input point cloud P = p i P N i=1 , a three-dimensional voxel grid covering the entire range of the point cloud is first defined.This grid consists of many small cubes (voxels), each with a fixed size.Furthermore, the point cloud data are mapped onto the threedimensional voxel grid to obtain voxel features with a voxel resolution of L The voxel index for each point is calculated based on its coordinates in three-dimensional space.For a point p i P (x, y, z) and a voxel grid in which each voxel's size is ∆x × ∆y × ∆z, the voxel index (i, j, k) of point p i P can be calculated as follows: where x min , y min , and z min are the minimum coordinate values of the voxel grid in each direction, ⌊•⌋ denotes the floor function, and c k is the downsampling stride of the 3D CNN.This approach ensures that each point in the point cloud is allocated to a corresponding voxel, establishing a mutual correspondence between points and voxels, facilitating feature interaction between point cloud and voxels.To avoid the memory loss caused by empty voxels, we use 3D sparse convolution to downsample and encode voxel features: where SConv3D(•) contains a 3D sparse convolution and an activation function, and F l V represents the voxel features extracted via 3D sparse convolution at layer l.We use 3D sparse convolution to downsample and encode voxel features, preserving feature maps of three downsampling voxel directions.The voxel features are then upsampled to restore voxel features.

Image Feature Extraction Branch
The method of converting point cloud data into range images is achieved through spherical projection, where the position of each point is mapped onto a two-dimensional plane.Given a three-dimensional point cloud P = p i P N i=1 with coordinates (x i , y i , z i ) in the three-dimensional Cartesian coordinate system, the corresponding two-dimensional coordinates [u i , v i ] of the two-dimensional image I ∈ R H I ×W I ×C , with height H I , width W I , and dimension C, through spherical projection, can be expressed as follows: where d = x 2 i + y 2 i + z 2 i is the Euclidean distance from point P to the reference origin in the LiDAR coordinate system, as well as the straight-line distance to the projection center.R represents the vertical field of perspectives of the LiDAR sensor, and R d is the lower boundary of the vertical field of perspectives.Spherical projection is a non-bijective process in which each point, p i , in the point cloud maps to a pixel position in the projected image.However, due to the nature of this mapping, multiple three-dimensional points may correspond to the same pixel in the image, leading to a one-to-many mapping relationship.
In the image feature extraction branch, convolutional operations are used to extract features from the two-dimensional image obtained through spherical projection, which can be represented as follows: where Conv(•) contains a 2D convolution and an activation function.F l I represents the image features extracted via 2D convolution at layer l, similarly preserving feature maps of three downsampling image directions for the encoding-decoding process.

Multi-Perspective Feature Fusion
In the previous section, we first introduced the projection system, establishing corresponding index systems between point-voxel-range and the feature extraction process of the three branches.In this section, we construct interactions between the representations based on points, voxels, and ranges.
The distinct characteristics and advantages of point clouds, voxels, and depth maps necessitate different fusion strategies, based on their properties and complementarity in fusion.Point cloud data are irregular, while voxels partition space into regular grids.This structural difference makes simple addition or concatenation fusion insufficient for capturing their complex relationships.
Therefore, we designed an adaptive point-voxel cross-attention feature interaction method to handle this irregularity and structural difference better.It computes the relationship between point cloud and voxel features, enabling more flexible weighting of these features and a more effective combination of their information.As shown in Figure 2, where MLP(•) denotes a feature encoding function, ⊙ represents element-wise multiplication, and δ is the positional encoding, defined as follows: where p k P is the 3D coordinates of a point P, µ c = 1 K ∑ K i=1 p i is the mean of all projected point coordinates, σ is a nonlinear activation function, and Concat(•) denotes vector concatenation.This combines both relative and absolute position information, passed through nonlinear activation and then concatenated as input to the MLP, capturing the spatial relationships of points in both local and global contexts.

MF-Attention Feature Fusion Module
We process point cloud data, mapping them to a two-dimensional image.In this process, multiple points in the point cloud may map to the same pixel position in the twodimensional image.To consider information comprehensively from different perspectives of points and images and dynamically balance their contributions, we designed an MF-Attention feature fusion module.Suppose a set of points, p k P K k=1 , in the point cloud maps to a pixel, P I , in the two-dimensional image, then each point, p k P , in the set has a corresponding feature vector, f k P , and each pixel, P I , also has a feature vector, f I .The goal of MF-Attention fusion is to update point features, f k P , to reflect their relationship with the corresponding pixel feature, f I .Firstly, we calculate the attention weights between point cloud features and image features: where W I , W P are learnable weight matrices for further transforming the mapped features into the attention computation space.The dimension size of the key vectors is represented by d k .
Employing the scaling factor √ d k aids in preserving the numerical stability within the attention mechanism.Then, the final MF-Attention fusion feature is represented as: where Concat(•) is used to concatenate features.The point-image attention fusion mechanism provides an effective way to synthesize and utilize information from point clouds and images, enabling the model to discover and leverage their inherent connections when processing multi-perspective data.This method is particularly useful in combining point cloud and image data for semantic prediction of point clouds.

Multi-Perspective Feature Post-Fusion Module
We extract features from each branch and design a deep fusion method for the features of the three branches to enhance the feature representation ability of each branch.As shown in Figure 3. Furthermore, we post-fuse the final prediction results of point cloud, voxel, and depth map features to provide a richer and more comprehensive feature representation for point cloud semantic segmentation tasks.For the final features obtained from the point cloud branch, F P ∈ R N×D , the voxel branch, F V ∈ R L V ×H V ×W V ×D , and the image branch, F I ∈ R H I ×W I ×D , the corresponding semantic segmentation pseudo-probabilities are represented as follows: where O P ∈ R N×T , where T represents the number of semantic categories.For E V ∈ R L V ×H V ×W V ×T and E I ∈ R H I ×W I ×T , they are mapped back to the original point cloud position according to the hash table built in the voxelization and spherical projection processes: where O V ∈ R N×T , O I ∈ R N×T .To associate global features, we weight the features of each branch globally, allowing the model to learn key features automatically in each perspective.
The weighted features of each branch are represented as follows: where g(•) denotes (2, 1) linear mapping, 3DGAP(•) represents 3D global average pooling, and GAP(•) represents global average pooling.Thus, the final fusion result is represented as follows: Fusing the features of point clouds, voxels, and depth maps utilizes each perspective's unique advantages to provide a more comprehensive and powerful data representation, thus achieving better performance in specific tasks.

Experiments
In this section, we extensively explore the PVI-Net network and its application in autonomous driving.In Section 4.1, we provide a thorough introduction to the two key datasets used in our experiments-SemanticKITTI and nuScenes-elucidating their importance in network testing and evaluation.Following this, in Section 4.2, we delve into the various components of the PVI-Net architecture, detailing the key aspects and experimental settings of the network to ensure transparency and reproducibility in our experiments.In this section, to intuitively understand the impact of various indicators on network performance, we use "↓" and "↑" to denote that smaller or larger values of the indicators, respectively, lead to better network performance.Finally, in Section 4.3, we conduct a comprehensive performance evaluation of the PVI-Net model.In addition, we perform a series of ablation experiments to verify the superiority and effectiveness of the model in its key constituent steps.

Datasets
SemanticKITTI.The SemanticKITTI dataset, an extension of the KITTI Vision Benchmark Suite, is a leading dataset in the fields of autonomous driving and robotics vision.Its key feature is the provision of a large-scale, time-sequenced LiDAR scanning dataset, comprising over 43.5 billion finely annotated point clouds distributed across more than 22,000 scene sequences, covering various road types and climatic conditions.The point clouds in the dataset are subdivided into 25 categories, with training and test sets composed of sequences from 00 to 10 and 11 to 21, respectively, to test and optimize their models, ensuring their effective operation in various environments and an accurate understanding of their surroundings.
nuScenes.The nuScenes dataset, released by Aptiv Autonomous Mobility, is a widely used multi-perspective dataset in the field of autonomous driving research.It was collected in diverse urban environments in Boston and Singapore, providing rich information on roads, traffic, and climate conditions.This dataset combines data from six cameras, five radars, and one LiDAR, achieving 360-degree comprehensive environmental capture, greatly facilitating an in-depth understanding of complex scenes and supporting tasks such as object detection, tracking, and segmentation.nuScenes includes over 1 million precise 3D bounding box annotations, covering 23 different object categories, totaling 40,000 frames of high-quality data.These data are meticulously divided into 8130 training samples, 6019 validation samples, and 6008 test samples, ensuring extensive training and evaluation coverage.Additionally, to enhance its applicability in real-world scenarios, the dataset specially optimized its category annotations, focusing on 16 primary categories for LiDAR semantic segmentation.

Implementation Details and Settings
Architecture Settings.As shown in Figure 1, we propose a multi-perspective point cloud segmentation network architecture.This architecture first converts point cloud data into quantized voxels with a high resolution of 1600 × 1408 × 40 × 8.At the core of voxel processing, the backbone network employs 3D sparse convolution, generating feature maps of voxel directions at four different scales with output dimensions of 32, 64, 128, and 256, respectively.Subsequently, these feature maps are restored by a decoder symmetrical to the dimensions of the encoder to recover voxel features.In our experiments, the resolution of voxels is set to a 5 cm edge length for each voxel.For image branch processing, when dealing with the SemanticKITTI dataset, the input range-image size is set to 64 × 2048.When handling the nuScenes dataset, the initial input range-image size of 32 × 2048 is later adjusted to 64 × 2048 to align with the dimensions of the SemanticKITTI dataset.
Training Strategies.In our experiments, we trained the model for 120 epochs using the Adam optimizer, with the initial learning rate set to 0.01.This process was conducted on a system equipped with 4× RTX 3090 GPUs, with the batch size set to 4. To prevent overfitting, we used data augmentation techniques, including GT-sampling technology and random flipping, rotation, and scaling, within the range of [0.95, 1.05].During training, we also employed a cosine annealing strategy to adjust the learning rate and implemented global scaling and random rotation around the Z-axis as enhancement measures to increase data diversity and the model's generalization capability.

Evaluation on SemanticKITTI Dataset
In our research, we conducted comprehensive experiments on the newly proposed PVI-Net network using the SemanticKITTI dataset and compared it with some of the latest advanced methods, as shown in Table 1.The results show that PVI-Net achieved a significant improvement of over 10% in the mean intersection over union (mIOU) metric compared with previous classic single-perspective input networks (such as point-based, voxel-based, and image-based methods).In comparison with mixed-perspective methods, PVI-Net also exhibited the best mIOU performance.Notably, PVI-Net outperformed RPVNet by 0.6% in mIOU, highlighting the effectiveness and practical value of the crossattention mechanism and the proposed MF-Attention multi-perspective fusion strategy used in our network compared with the direct averaging fusion approach of RPVNet.For a comprehensive validation of our model's robustness, we carried out a series of detailed experiments on the nuScenes dataset.As shown in Table 2, PVI-Net demonstrated exceptional performance, especially in the key metric of mIOU, where it surpassed other classic single-perspective and multi-perspective networks, achieving a leading position.This result further confirms the enormous potential of multi-perspective data fusion in the field of point cloud semantic segmentation.Notably, by combining point cloud and voxel data, our network effectively overcomes geometric distortions that may occur during point cloud projection, significantly enhancing the accuracy of point cloud segmentation.Moreover, Figure 4 presents the semantic segmentation visualization results of the PVI-Net network on the nuScenes dataset.These experimental results not only showcase the efficient performance of PVI-Net but also emphasize the importance of multi-perspective fusion in enhancing point cloud processing capabilities in complex environments.

Ablation Study
In this section, we delve into the key components of the PIV-Transformer, conducting a series of fusion experiments to analyze the impact of each branch, the multi-perspective feature deep fusion modules, and the post-fusion modules within the network.Additionally, we evaluate the computational efficiency and parameter count of PVI-Net under various branch combinations.All the aforementioned experiments are implemented on the SemanticKITTI dataset, and we showcase the test results of these methods on the validation part (sequence 08) of this dataset.

Impact of Different Perspectives on Network Performance
A shown in Table 3, we conducted a series of independent and interactive ablation experiments on three different branches.Furthermore, we detailed the required parameter count and model inference speed for each ablation experiment network.For the sake of uniformity, all ablation experiments in Table 3 use the same hardware settings and batch sizes as the PVI-Net network experiments (see Section 4.2).Our experimental results clearly show that, compared with single-perspective inputs, multi-perspective inputs demonstrate better performance in segmentation tasks.Specifically, regarding the point cloud segmentation network's interaction with multi-perspective features, we found that voxel features, as opposed to image features, provide a richer and more comprehensive feature supplement for the point cloud branch.In Table 4, we present a series of ablation experiments on the key modules of the PVI-Net network, verifying their contributions in the process of deep feature fusion.In this table, modules marked with a "✓" default to using an averaging method for fusion.Through these experimental results, we observed that each module mentioned in the network positively impacted the model's effectiveness.In Table 5, we specifically compare the multi-perspective feature post-fusion method used in our network with the common Addition (additive fusion) and Concatenation (concatenative fusion) methods.The experimental results show that, on the SemanticKITTI dataset, our fusion method improved the mIoU by 1.7% and 1.4% compared with the Addition and Concatenation methods, respectively.This outcome demonstrates that our fusion strategy more effectively integrates information from different sources when processing multi-perspective data, thereby enhancing the accuracy of semantic segmentation.This paper enhances the understanding of complex 3D scenes by introducing a multi-view fusion approach, addressing the limitations of single-view methods that often miss crucial scene details due to occlusions, scale variations, and viewpoint dependencies.By integrating data from various perspectives, our multi-view fusion technique reconstructs obscured parts, mitigates scale discrepancies, and generates viewpoint-invariant features, leading to improved feature completeness and classification accuracy.Although our initial model, PVI-Net, does not outperform the latest state-of-the-art models in accuracy, it validates the feasibility of multi-view fusion and offers a novel perspective for 3D scene comprehension.
Point to Voxel V2P : Voxel to Point P2I : Point to Image I2P : Image to Point Range Image (M×D) Point (N×C) Voxel (V×D) Point to Voxel V2P : Voxel to Point P2I : Point to Image I2P : Image to Point

Figure 4 .
Figure 4.A visual comparison of the results from the model on the nuScenes dataset.

Table 1 .
Experimental results of the model on the SemanticKITTI dataset.To compare the performance of different models clearly, we divide the compared models into four groups based on the type of input data: point-based input, image-based input, voxel-based input, and mixed-view input.In the table, we specifically highlight the highest mIOU score in each category in red and the second highest score in blue.

Table 2 .
Experimental data on PVI-Net for the nuScenes dataset.We highlight the highest score in red and the second-highest score in blue.

Table 3 .
Impact of different perspectives on network performance.

Table 4 .
Impact of different perspectives on network performance.

Table 5 .
Impact of different perspectives on network performance.