1. Introduction
Construction machinery plays an important role in national economic construction and emergency rescue. Intelligent and unmanned construction machinery can effectively alleviate operation difficulty caused by the harsh working environment and the threat of the dangerous situation of the rescue and disaster relief site. Digital acceleration and intelligent development have become the consensus in this industry [
1].
Environmental perception is essential for the intelligent development of construction machinery. However, there are a large number of unstructured scenes for construction machinery. Additionally, the working environment is highly dynamic, and the scene is often cluttered with numerous irregularly distributed objects. This inherent complexity and variability make it difficult for a single sensor to achieve effective perception. Multi-sensor fusion has become an important trend of its development.
At present, in the aspect of multi-sensor environment semantic perception, with the development of deep learning, a large number of perception tasks based on network fusion methods have emerged [
2,
3]. F-PointNet projected the image 2D candidate box into the 3D space and used the 3D cone view formed by the 2D box to merge the features of the points in the 3D space as the candidate box in the 3D space. At the same time, the outlier points on the cone view were removed by instance segmentation. Finally, the network was used to predict the category information of the object, which improved the network prediction accuracy to a certain extent [
4]. PointPainting mapped the semantic information of the image to the point cloud data by projection and input the semantic point cloud with semantic information into the target detection network to improve the regression performance of target detection through semantic information. However, the low mapping accuracy led to limited performance improvement [
5]. The representative work of MV3D proposed by Chen et al. was based on the mapping of 3D candidate boxes. The convolutional network was used to extract features on the three inputs of the point cloud, the front view, the top view and the image, respectively. Then, the 3D candidate boxes extracted from the top view point cloud were mapped to other modes for feature extraction. Finally, based on the candidate boxes, all modes were fused and detected. The prediction accuracy was improved to a certain extent, but the fusion result depended on the generation of the initial candidate boxes [
6]. The Mvx-Net network improved the dimension of fusion information by directly fusing the feature vectors on the image with the voxelized vector point features of the point cloud [
7]. The CLOCs network used the output results from two different branches of point cloud and image to fuse the final output prediction results by calculating confidence, distance, and intersection ratio, but it was necessary to unify the output results of the two modes [
8]. Zhuang et al. proposed the PMF network, in which the image data and the forward-looking projection data of the point cloud were input into the two-stream architecture. The residual-based fusion module of the point cloud branch could learn the complementary features of the image RGB and the point cloud data, and it obtained better semantic segmentation results [
9]. The above fusion algorithms improved the accuracy of the network through the mapping of image semantics to point clouds, but they all relied on the acquisition of point cloud semantic labels, and the computational resource consumption is huge.
Most of the aforementioned fusion algorithms follow the “data-level” or “feature-level” fusion paradigms. While they improve accuracy, their common prerequisite is reliance on large-scale, high-quality point cloud semantic labels for end-to-end training, which consumes enormous computational resources. Meanwhile, to reduce annotation costs in the 2D semantic segmentation domain, weakly supervised methods such as mask-supervised learning [
10] have been proposed, achieving significant progress using only image-level or bounding box labels. In 3D point cloud processing, unsupervised or self-supervised geometric feature learning and clustering represent long-standing research directions. For instance, methods like graph-regularized sparse coding [
11] have been successfully applied to 3D shape representation and clustering. However, the key challenge for practical 3D semantic perception in unstructured environments lies in effectively and robustly fusing low-cost, readily available 2D semantic knowledge with unlabeled 3D geometric clustering results, thereby completely eliminating dependence on 3D point cloud semantic labels.
Unlike fully supervised fusion paradigms, this work pursues a fundamentally different direction by investigating reliable 3D semantic perception using only 2D image labels, thereby entirely bypassing the need for costly 3D point cloud annotations. This approach is especially valuable in unstructured scenarios, where 3D labels are inherently scarce. The novelty of the approach lies not in the concept of projection itself, but in the specific technical innovations introduced. These include PSO-based calibration refinement and Kd-tree-guided semantic consistency optimization. Together, they enable robust performance under the real-world conditions of weak supervision and imperfect calibration typically found on construction sites.
Currently, there is a lack of high-quality 3D semantic segmentation data sets. Additionally, in unstructured scenes, the scene exhibits high complexity with numerous irregularly distributed objects, and the shape changes rapidly, which makes the acquisition cost of point cloud semantic labels increase sharply. The environmental semantic perception technology in the field of passenger vehicles cannot be directly applied to construction machinery. In this paper, multi-sensor fusion is carried out based on camera and LiDAR. Point cloud clustering and image semantic segmentation results are fused by perspective projection. 3D semantic perception in unstructured environment is realized only by image semantic labels. Particle Swarm Optimization [
12] (PSO) is used to optimize the perspective projection fusion process, and Kd-tree based radius nearest neighbor (RNN) is used to match the fusion results. Finally, a semantic data set of the excavator unstructured actual working environment is constructed for perception experiments to verify the effectiveness of the proposed scheme. Notably, the accurate 3D semantic understanding achieved by our fusion framework serves as a critical foundation for dynamic 4D scene analysis (3D + time), which is essential for advanced applications like progress monitoring and constructability review in construction projects [
13,
14].
Our contributions are summarized as follows: The first contribution is a weakly-supervised 3D semantic perception pipeline that operates exclusively on 2D image labels, completely bypassing the cost and effort of 3D point cloud annotation. This approach establishes a new, practical paradigm for perception in unstructured environments. The second contribution lies in our dual-branch fusion architecture and its novel optimization components: we use a PSO-based mechanism to correct weak calibration in projection and an adaptive RNN algorithm to enforce semantic consistency, directly addressing the key bottlenecks in fusion reliability. As a third contribution, competitive performance is demonstrated through a dedicated dataset and rigorous experimentation, with the framework achieving an mIoU of 75.85% and an mPA of 84.72%. A compelling balance is thus offered between high accuracy and low resource expenditure.
3. Results and Discussion
3.1. Test Scene and Vehicle Platforms
Figure 21 shows the test environment scene. The environment is a construction site where there is no fixed lane line, and the shape of the perceived object is irregular. The environment includes a variety of construction machinery, various shapes of earthwork to be operated, vegetation, staff, unstructured pavement and other target objects. It belongs to the typical unstructured working environment of an excavator.
The experimental vehicle platform, based on an electric excavator, is shown in
Figure 22. The excavator has a height of 2.6 m. A VLS-128 LiDAR is mounted on a sensor bracket at a height of 1.65 m, positioned 0.06 m to the right of the left track wheel center and 0.15 m behind the front track shaft center. A binocular camera is installed with its optical center at a height of 1.43 m and 0.02 m forward of the front track shaft, maintaining co-planar alignment with the LiDAR in the horizontal direction.
Given the computational cost of training and the consistency of results across the dataset, performance metrics in this study are derived from a single training run using a fixed train/validation/test split. The reported results thus represent a point estimate of the model’s performance on the designated test set.
3.2. Algorithm Test
Based on the aforementioned framework, the image branch employs the optimal parameter weights as the inference model for semantic prediction. The point cloud branch utilizes the DBSCAN clustering algorithm with parameters
ε = 2.5 m and
MinPts = 30 to segment the raw LiDAR data. The results from both branches are fused via perspective projection. Clusters corresponding to construction machinery and personnel—which exhibit clear spatial independence—are selected as optimization targets. A Particle Swarm Optimization (PSO) with 12 particles is applied to refine the 6-DoF extrinsic parameters (rotation and translation). The semantically fused point cloud is then restored to 3D space, and its consistency is further enhanced using a Kd-tree-based RNN matching algorithm. The semantic segmentation performance metrics for key façade targets—including mounds, personnel, construction machinery, and vegetation—are summarized in
Table 8.
The experimental results demonstrate that the proposed algorithm achieves a segmentation accuracy and Intersection over Union (IoU) for construction machinery of 96.21% and 93.14%, respectively, indicating excellent performance for large, rigid objects, which served as a primary optimization target. For operational mounds, the accuracy and IoU reach 86.34% and 83.61%, also reflecting robust segmentation. In the case of personnel—dynamic and small targets—the accuracy remains high at 92.13%, but the IoU is 72.33%. This lower consistency primarily arises from the inherent sparsity of human-body point clouds and potential motion artifacts in dynamic scenes. The ground category, mapped using filtered ground points, attains an accuracy of 79.56% and an IoU of 70.87%. Vegetation, including shrubs with staggered distributions and weak feature similarity during clustering, shows weaker object independence, resulting in an accuracy of 69.38% and an IoU of 59.31%.
In summary, the proposed algorithm meets the semantic perception requirements for unstructured excavator environments, achieving an overall mean accuracy of 84.72% and a mean IoU of 75.85%. This performance is competitive for 3D semantic perception, which is particularly notable because it was attained using only 2D image labels during training, thereby substantially reducing annotation cost. The final sampled point cloud segmentation result (
Figure 23) contains, within the perspective range, 84 points for construction machinery, 23 for personnel, 56 for operational mounds, 29 for vegetation, and 38 for the ground. It is important to emphasize that these results were achieved without any 3D point cloud semantic labels in training, highlighting the key advantage of our weakly supervised approach. Through quantitative metrics, segmentation visualization, and target contour reconstruction, the proposed multi-sensor fusion algorithm demonstrates robust semantic perception performance, especially for construction machinery and operational mounds.
3.3. Ablation Study
To further validate the effectiveness of the optimization strategies in the proposed multi-sensor post-fusion semantic perception algorithm, an ablation study was conducted. The study focuses on the following two aspects: the refinement of weakly calibrated perspective projection via the PSO algorithm and the enhancement of semantic consistency using the Kd-tree-based RNN algorithm. Under the condition that the image and point cloud branches produce consistent preliminary outputs, the following four configurations were tested: (1) no post-fusion optimization, (2) PSO optimization only, (3) RNN optimization only, and (4) combined PSO and RNN optimization. The corresponding results are summarized in
Table 9. The findings demonstrate that the proposed post-fusion optimization strategies significantly improve the performance of 3D semantic segmentation.
To further evaluate the performance of the proposed multi-sensor post-fusion semantic perception algorithm, comparative tests were conducted against the following baseline methods: the image-based semantic segmentation model DeeplabV3Plus [
23], the projection-based point cloud segmentation method RangeNet++ [
29], and the point cloud network RandLA-Net [
30] that uses random sampling. The performance comparison is summarized in
Table 10. As shown in the table, our method achieves a competitive mIoU of 75.85%. Notably, it outperforms all fully supervised 3D segmentation methods that require point-cloud labels for training. This result highlights the following key advantage of our framework: by using only 2D image labels, our weakly supervised approach not only reduces annotation cost substantially, but it also attains comparable or even superior perception performance through effective multi-sensor fusion and optimization.
3.4. Analysis of Computational Efficiency and Deployment Feasibility
Although this framework integrates multiple modules, including image segmentation, point cloud clustering, projection optimization, and semantic consistency matching, its overall computational overhead remains manageable in practical deployment. The image branch leverages the lightweight MobileNetV2 backbone network (with only 23.5 million parameters), significantly reducing model complexity and inference time while maintaining high segmentation accuracy. This offers clear advantages over networks like Xception (219.9 million parameters). The point cloud branch employs unsupervised DBSCAN clustering, eliminating dependencies on extensive labeled data and complex neural network training. Optimal parameters (ε = 2.5 m, MinPts = 30) were experimentally determined and combined with Kd-tree data structures for efficient neighborhood search, further enhancing clustering efficiency. Within the fusion optimization module, PSO is employed for offline calibration optimization without introducing online perception latency. The Kd-tree-based radius nearest neighbor matching algorithm achieves O(log n) query efficiency on point cloud data, enabling rapid semantic consistency correction. Practical testing demonstrates that this framework operates stably with near-real-time performance on the NVIDIA AGX Orin embedded platform, validating its deployability in resource-constrained construction machinery environments. Future enhancements will focus on code optimization and hardware acceleration to further improve system operational efficiency.
4. Conclusions
This paper addresses the challenge of semantic perception in unstructured environments for construction machinery. Given the scarcity of relevant datasets and the high cost of acquiring 3D point cloud semantic labels in such settings, a multi-sensor fusion semantic segmentation algorithm is proposed, leveraging a binocular camera and LiDAR. The core of the approach lies in utilizing 2D image semantic labels to derive 3D semantic information, thereby circumventing the need for expensive point cloud annotations. To address imperfect sensor calibration during fusion, Particle Swarm Optimization (PSO) and a Recurrent Neural Network (RNN) are integrated to refine the extrinsic parameters. Validation in a real-world operational environment demonstrates that the proposed algorithm achieves competitive performance without relying on point cloud semantic labels, significantly reducing the cost associated with data annotation.
While the proposed weakly supervised 3D semantic perception framework achieves competitive performance using only 2D image labels, it has certain limitations. The method relies heavily on the accuracy of the 2D image segmentation branch, as any segmentation errors can propagate directly to the 3D results and are difficult to correct subsequently, especially under challenging conditions such as extreme lighting, severe occlusion, or unseen object categories. Additionally, the unsupervised DBSCAN clustering remains sensitive to parameter settings, which may require recalibration in significantly different environments. Nevertheless, this work establishes a practical, low-cost pathway for 3D scene understanding and provides a valuable foundation for future research.