1. Introduction
With the rapid development of 3D data acquisition technologies such as LiDAR and multi-view stereo vision [
1], point clouds, as an important carrier of 3D spatial information, have been widely used in many fields such as autonomous driving [
2], smart cities [
3,
4], industrial inspection [
5], digital twins [
6], and agricultural monitoring [
7]. However, due to factors such as physical limitations of sensors, object occlusion, and changes in scanning distance, the actual collected point cloud data often exhibits a high degree of density non-uniformity [
6,
7,
8,
9]: the point cloud in the near object area is dense and rich in details, while the point cloud in the far object or occlusion area is sparse and lacks information. This density difference directly leads to a sharp decline in the feature representation ability of traditional point cloud processing methods in sparse areas, causing a series of problems such as detail loss, feature blurring, and even target misidentification [
10,
11], seriously restricting the reliability and robustness of 3D vision systems in complex real-world scenarios.
Currently, significant progress has been made in research on point cloud analysis, and mainstream methods can be roughly divided into voxel-based, point-based, and projection-based categories. Point-based methods, such as the PointNet++ series, aggregate local features through hierarchical sampling and grouping operations [
12]. However, their feature extraction performance heavily depends on the density and distribution quality of local neighborhoods [
13], and in sparse areas, they are easily affected by the insufficient number of neighboring points, making it difficult to construct effective geometric contexts. The voxel-based method regularizes irregular point clouds, which is convenient for applying 3D convolution, but inevitably introduces quantization errors, resulting in loss of details, and faces huge challenges in computing and memory consumption when processing large-scale scenes [
14]. To balance efficiency and detail, projection-based methods project 3D point clouds onto 2D image planes (such as range views, bird’s-eye views, etc.), and then use mature 2D convolutional neural networks for feature extraction [
15]. Although this type of method improves computational efficiency, it introduces information loss during the reverse projection process and is sensitive to occlusion and viewpoint changes. Its performance largely depends on the integrity and quality of the projected view [
16].
Although the above methods have promoted the development of the field at different levels, there are still significant shortcomings in addressing the fundamental challenge of uneven density. Firstly, most existing network structures are “density blind”, meaning that their original design did not explicitly consider the spatial variation of point cloud density, and lack perception and adaptive mechanisms for local density information [
17]. This makes it difficult for the model to distinguish between reasonable sparsity caused by distance and information loss caused by occlusion, and it cannot provide sufficient attention and compensation to information deficient areas in the feature learning process. Secondly, at the feature representation level, a single data representation form (pure 3D points, voxels, or 2D projections) is difficult to comprehensively capture all beneficial information in complex scenes [
18]. For example, 3D point clouds can accurately express geometric structures, but there are shortcomings in texture details. However, 2D images are rich in texture and edge information, but they lose some three-dimensional spatial relationships. How to effectively integrate multi-source complementary information, especially using 2D visual priors to enhance the representation of 3D sparse regions, is still a key issue that has not been fully explored [
19]. Furthermore, from the perspective of loss function design, the standard cross-entropy loss treats all sample points equally and fails to differentiate weighting based on the density (i.e., reliability) of their respective regions, resulting in insufficient attention to low-density and high-uncertainty regions in the model optimization process, further exacerbating the risk of misclassification in sparse regions [
20].
In recent years, some studies have begun to attempt to address related challenges. For example, some studies have explored simulating different densities through data augmentation [
21] or implicitly learning density changes using attention mechanisms [
22], but these methods often fail to embed density as a clear and quantifiable feature signal into the network. In terms of multimodal fusion, the combination of BIM models and point clouds provides a new approach for building reconstruction [
3,
4], while knowledge-enhanced domain-adaptive learning attempts to improve the model’s generalization ability under different data distributions [
23]. In specific tasks such as point cloud registration [
24], compression [
25], shape completion [
26], SLAM [
27], and object detection [
28], researchers are increasingly focusing on the robustness of local features [
29]. In addition, the importance of high-quality point cloud processing has been highlighted in research applications such as rapid volume calculation from point clouds [
21,
22,
23], modeling [
24,
25], animation evaluation [
26], and even computing holograms [
27,
28], as well as in computer graphics applications. A systematic review [
9] pointed out the many challenges faced by 3D point cloud deep learning, including noise, density variations, computational efficiency, and so on. However, a unified framework that can systematically and explicitly model density changes and synergistically utilize multidimensional information for feature enhancement and efficient optimization is still lacking in current research.
In summary, existing 3D point cloud analysis methods have significant limitations in density perception, multi-source information fusion, and targeted optimization of loss functions when dealing with scenes with uneven density. To overcome these shortcomings, this paper proposes an innovative framework that integrates density-adaptive feature enhancement and lightweight spectral fine-tuning, aiming to comprehensively improve the feature representation robustness and classification accuracy of the model in challenging scenarios such as sparsity and occlusion by explicitly injecting density information, fusing multi-view complementary features, and introducing density-aware optimization objectives.
2. Design of Density-Adaptive Feature Enhancement Algorithm
2.3. Multi-View Projection Fusion Module
The multi-view projection fusion module is the core innovative design of this framework to achieve complementary multi-source information and alleviate the vulnerability of sparse features in 3D point clouds. Unlike existing projection-based methods [
15] that typically use a single view (such as a range view or bird’s-eye view), the core of this module is to propose a symmetric multi-view projection and feature consistency fusion mechanism, aimed at capturing complementary 2D texture and structural information from multiple optimal views and achieving high-fidelity fusion with 3D point cloud features through a differentiable backprojection operation, effectively combating occlusion and enhancing detail representation of sparse areas.
The input of the module is a point cloud enhanced by density features
. Firstly, this module adopts an adaptive viewpoint selection strategy based on point cloud principal component analysis (PCA) instead of fixed predefined viewpoints. By calculating the eigenvectors of the covariance matrix of the point cloud, the plane directions formed by the two eigenvectors with the highest eigenvalues are used as the two main projection planes (e.g., approximated as top and side views) to ensure that the projection views can cover the maximum variance of the point cloud, thereby maximizing information capture. For each selected viewpoint
, where
K = 3 is determined via grid search, transform the 3D points into 2D image coordinates through parallel projection. A hyperparameter table is added: Viewpoint Selection: Radius = 0.1 m,
K = 3; Visibility Score: Scaling factor
γ = 1.0; transform the 3D points
into 2D image coordinates through parallel projection:
.
Among them, and , respectively, represent the rotation matrix and translation vector corresponding to the viewpoint, are the three-dimensional coordinates of the point, and denotes the . This enables subsequent 2D CNNs to also perceive the density information of the three-dimensional points behind them when extracting features.
After completing the above processing, each density-aware feature map
is fed into a feature extraction backbone network of a pretrained 2D CNN (such as ResNet-18) with shared weights to extract multi-scale 2D features
, where
represents the feature layer level. The key innovation of this article lies in the steps of backprojection and feature fusion. To avoid information loss and edge modulus caused by simple interpolation, we have designed a differentiable backprojection mechanism based on bidirectional nearest neighbor search. For three-dimensional points,
extract the deep two-dimensional features
(
is the selected feature layer) corresponding to their projection points
at each viewpoint. However, direct backprojection can result in each 3D point obtaining
K potentially different 2D features. To integrate these multi-perspective features and maintain consistency, we propose adaptive aggregation of perspective feature weights:
Among them, is the weight calculated adaptively and is the scaling factor. is the visibility score of a point in the viewing angle, which is determined by the angle between the point and the viewing direction, as well as whether it is occluded in the projected image. The depth of this formula lies in allowing the network to adaptively select the most reliable and informative viewpoint features for weighted fusion for each point, rather than treating all viewpoints equally. For example, for a partially occluded point, the weight of the occluded viewpoint will automatically decrease, while the weight of the unobstructed viewpoint will increase, significantly improving the robustness of the fused features.
Finally, the fused two-dimensional features are convolved and fused with the features extracted by the backbone 3D network through channel concatenation to generate the final enhanced joint feature representation: . This design effectively complements the three-dimensional geometric features and two-dimensional texture details at the point level, especially injecting rich contextual information from other perspectives into sparse or occluded areas of the original point cloud, fundamentally improving its feature representation ability.
2.4. Design of Density Sensing Loss Function
The density-aware loss function module is a key innovative link in this framework to achieve targeted optimization and improve the robustness of sparse region feature learning. Unlike the traditional cross-entropy loss function that treats all sample points equally [
11], this paper proposes a dynamic weight adjustment mechanism based on local density priors, which allows the model to explicitly focus on the learning difficulties of low-density regions during training, thereby forcing the network to extract more robust feature representations from sparse points.
This loss function is based on the standard cross-entropy loss but introduces a density-adaptive dynamic weight term. There is a total of points in the point cloud, with each point having a true label. The probability distribution predicted by the model is
and its local density feature is
(calculated in
Section 2.2). The basic cross-entropy loss is
Meanwhile, by introducing density-aware weight terms, a new loss function is constructed:
:
The design of weight terms
is the core innovation of this module. We propose a nonlinear mapping function based on the reciprocal of density:
Among them,
is the hyperparameter that controls the punishment intensity (set to 0.4 in the experiment, corresponding to the weight increase mentioned in the abstract),
is a small constant that prevents zero division, and
is the sigmoid function used to normalize the density value to the interval:
.
Among them, and are the mean and standard deviation of the density values of all points in the training set, respectively. The deep analysis of this design lies in its superior mathematical properties: firstly, through sigmoid normalization processing, it ensures the scale consistency of density values in different point cloud datasets, avoiding training instability caused by differences in absolute density values. Secondly, the weight function is inversely proportional to the density value , which means that points in low-density areas ( smaller) will receive greater loss weights.
From a probabilistic perspective, points in low-density areas have higher uncertainty in feature representation and a greater risk of misclassification due to the lack of neighborhood information. This loss function provides a clear indication of the learning direction for the model during the optimization process by increasing the gradient backpropagation strength of these points: it is necessary to strengthen the learning of discriminative features in sparse regions. When a point is in a low-density area, , the weight is used to strengthen the punishment for classification errors of that point. On the contrary, in high-density areas, larger weights close to 1 are restored to standard cross-entropy loss.
This design is consistent with human visual cognitive mechanisms—it will focus more on identifying and judging areas with incomplete information. In this design, the loss function forms a closed-loop synergy with the density feature injection and multi-view fusion mechanisms mentioned earlier: density features help the network “perceive” the reliability of the region, multi-view fusion “supplements” cross-modal information to sparse points, and density perception loss “forces” the network to effectively utilize this information from the optimization objective level. The combined effect of the three will significantly improve the performance of the model in challenging scenarios.
3. Experimental Simulation and Performance Evaluation
3.5. Performance Verification in Different Scenarios
In order to comprehensively evaluate the applicability of the algorithm in different scenarios, we conducted performance verification experiments under various point cloud density distributions and image types, covering various scenarios such as algorithm testing and outdoor and indoor images.
Firstly, a comparative analysis was conducted on the performance of the algorithm in different-density regions, including high-, medium-, low-, and very-low-density scenarios. The results are shown in
Table 4.
From
Table 4, it can be seen that the algorithm proposed in this paper maintains stable performance in different density regions. Even in extremely challenging low-density areas, F1-Score can still reach 0.73, a significant improvement compared to the baseline model’s 0.65 in the same region.
Secondly, this article analyzed the point cloud computing effect and density distribution of PNG images in different scenarios.
Figure 1,
Figure 2 and
Figure 3 show ordinary indoor scenes and complex outdoor road scenes, and circular test images were selected for comparative analysis. The results are shown in
Figure 5.
According to the experimental results in
Figure 5, the point cloud density distribution uniformity of our algorithm in complex outdoor road scenes reaches 0.92, which is 8.2% higher than the 0.85 in indoor scenes. In the processing of circular test images, the algorithm uses a multi-view projection fusion mechanism to achieve an edge feature retention rate of 95.3%, which is 23.6% higher than traditional methods. Especially in the occluded areas of outdoor road scenes, the algorithm based on the KD tree density feature extraction constructed in this paper improves the completion accuracy of point cloud missing areas to 87.5%, verifying the robustness of the algorithm in complex environments. Based on the results of
Figure 6, considering that one of
Figure 1 and
Figure 2 is an indoor environment and the other is a complex outdoor road environment, there are significant differences in their point cloud computing results. Further comprehensive comparative analysis and visualization were conducted on these two scenarios, and the results are shown in
Figure 6.
The experimental results in
Figure 6 show that the completeness of point cloud reconstruction in outdoor scenes reaches 91.2%, which is slightly lower than the 94.5% in indoor scenes, but performs better in occlusion processing, with a point cloud recovery rate of 85.3% in occluded areas. This is thanks to the perspective feature weight-adaptive aggregation mechanism designed in
Section 2.3, which achieves a perspective consistency index of 0.89 for complex outdoor scenes. In terms of computational efficiency, the processing time for outdoor scenes is only 15.7% longer than indoor scenes, demonstrating the algorithm’s good scalability.
Considering the practical application scenarios, which are mostly indoor and outdoor point cloud computing for semantic segmentation, this study focuses solely on the SemanticKITTI benchmark to maintain clarity. The comparison results are shown in
Figure 7. The conversion algorithm in this paper achieves a similarity of 93.8% in maintaining intensity distribution, and the depth distribution error is controlled within 0.05. The feature retention rate of 2D projection is 96.2% for indoor scenes and 92.7% for outdoor scenes, which verifies the effectiveness of the multi-view fusion mechanism in
Section 2.3. Especially in terms of texture detail conversion, the algorithm improves the edge information retention rate of 2D images to 89.4%, which is 31.2% higher than the baseline method.
We also conducted an in-depth analysis of the relationship between density and significance, with ‘significance’ explicitly defined as a measure based on semantic segmentation labels from SemanticKITTI, calculated using point-wise confidence values, revealing the behavioral characteristics of the algorithm under different density conditions. The experimental results show that the density sensing mechanism effectively balances the detection accuracy of different density regions and avoids a sharp decline in performance in sparse regions. The results are shown in
Figure 8. When the point cloud density is increased from 0.1 to 0.5, the significance detection accuracy linearly increases from 73.2% to 91.5%. This fully validates the effectiveness of the density-aware loss function in
Section 2.4, where the classification accuracy of the model in low-density areas (<0.2) increased from 65.3% to 78.9% after a 40% weight increase in that area. Error analysis shows that the correlation coefficient between density and significance reaches 0.87, proving that the density-adaptive mechanism proposed in this paper can effectively guide feature learning.
Finally, this article analyzed the processing performance and computational complexity of algorithms for different image types, and the results are shown in
Table 5 and
Table 6.
The results in
Table 5 show the comparison of Program Time, Precision, Recall, and F1-Score parameters for 1000 point calculations during the processing of different image types.
From the analysis of the results in
Table 5, it can be seen that the algorithm proposed in this paper exhibits excellent performance in various image types. In gradient image processing, the algorithm achieved an accuracy of 89% and a recall of 87%, with an F1-Score of 88%, mainly due to the adaptability of the density feature extraction module designed in this paper to continuously changing features. The circular image processing effect is the best, with an F1-Score of up to 91%, mainly due to the excellent processing ability of multi-view fusion mechanism for regular geometric shapes. The F1-Score of 83% is still maintained in noisy image processing, proving that the algorithm effectively suppresses noise interference through density sensing mechanism. In complex scenarios, F1-Score reaches 90%, fully reflecting the collaborative optimization effect of the three core modules, among which the density-adaptive loss function contributes significantly to the optimization of complex boundaries.
Table 6 shows the comparison of algorithm complexity under different Time Complexity, Space Complexity, Training Time, and Inference Time conditions.
The analysis of computational complexity from
Table 6 shows that the algorithm proposed in this paper achieves a good balance between performance improvement and computational cost. Although the Time Complexity increased from O(n) to O(n log n), the actual Training Time only increased from 120 s to 185 s, an increase of 54.2%, which is much lower than the performance improvement rate (12.5%). The spatial complexity increases from O(n) to O(n + v), where v is the number of views. Through shared weight design and feature layer selection optimization, the additional overhead is controlled within an acceptable range. The Inference Time has increased from 5 ms to 8 ms, an increase of 60%, but it still meets the real-time requirements in practical applications. This optimization of computational efficiency benefits from the lightweight design of the overall framework, especially the feature sharing and selective backprojection strategies adopted in the multi-view processing stage. Although the algorithm in this paper has increased in computational resource consumption, the performance improvement significantly exceeds the increase in computational cost, reflecting a good balance of computational efficiency.
In summary, through the experimental verification of the above system, this article fully demonstrates the effectiveness, robustness, and practicality of the density-adaptive feature enhancement algorithm proposed in this article in different scenarios, providing reliable technical support for point cloud analysis in complex environments.