Figure 1.
The overall framework of the proposed method. The system consists of three main components: (1) differentiable mesh representation with multi-layer perceptron (MLP) optimization for Bird’s-Eye-View (BEV) scene reconstruction, (2) semantic filtering to eliminate off-road false positives, and (3) multi-frame fusion through ray-casting projection and temporal accumulation. The bottom panel shows the detection network architecture with En-Backbone for feature extraction. Arrows indicate data flow and processing sequence. Color coding: red boxes represent input (left) and output (right), black boxes indicate intermediate processing modules, and blue boxes show the detection network architecture.
Figure 1.
The overall framework of the proposed method. The system consists of three main components: (1) differentiable mesh representation with multi-layer perceptron (MLP) optimization for Bird’s-Eye-View (BEV) scene reconstruction, (2) semantic filtering to eliminate off-road false positives, and (3) multi-frame fusion through ray-casting projection and temporal accumulation. The bottom panel shows the detection network architecture with En-Backbone for feature extraction. Arrows indicate data flow and processing sequence. Color coding: red boxes represent input (left) and output (right), black boxes indicate intermediate processing modules, and blue boxes show the detection network architecture.
Figure 2.
Processing pipeline. After mesh initialization, each frame undergoes YOLO detection and semantic segmentation, followed by semantic filtering to retain on-road detections. Ray-casting projects filtered 2D detections to 3D vertices, which are accumulated across T-frames. Final filtering by observation count produces high-confidence defect vertices.
Figure 2.
Processing pipeline. After mesh initialization, each frame undergoes YOLO detection and semantic segmentation, followed by semantic filtering to retain on-road detections. Ray-casting projects filtered 2D detections to 3D vertices, which are accumulated across T-frames. Final filtering by observation count produces high-confidence defect vertices.
Figure 3.
Multi-modal vertex representation. Each vertex stores four attribute types: (1) geometry (3D position and height); (2) appearance (RGB values); (3) semantic (class probabilities); and (4) defect (observation count , severity , confidence ). Right panel shows an example vertex with typical attribute values.
Figure 3.
Multi-modal vertex representation. Each vertex stores four attribute types: (1) geometry (3D position and height); (2) appearance (RGB values); (3) semantic (class probabilities); and (4) defect (observation count , severity , confidence ). Right panel shows an example vertex with typical attribute values.
Figure 4.
BEV mesh reconstruction via MLP optimization. Multi-view images are processed through an MLP network to optimize vertex positions, appearance (RGB), and semantic attributes. The network is trained using multi-task losses (, , , ) to produce a continuous BEV mesh covering 100 × 100 m at 0.1 m resolution with 143,857 vertices. Colors in the layered structure represent different vertex attributes (geometry, appearance, semantics) optimized through MLP. Output shows the reconstructed BEV mesh with RGB and semantic information.
Figure 4.
BEV mesh reconstruction via MLP optimization. Multi-view images are processed through an MLP network to optimize vertex positions, appearance (RGB), and semantic attributes. The network is trained using multi-task losses (, , , ) to produce a continuous BEV mesh covering 100 × 100 m at 0.1 m resolution with 143,857 vertices. Colors in the layered structure represent different vertex attributes (geometry, appearance, semantics) optimized through MLP. Output shows the reconstructed BEV mesh with RGB and semantic information.
Figure 5.
Semantic filtering process. The method computes the road overlap ratio for each detection using rendered semantic segmentation. Detections with are retained while off-road detections are filtered out. The right panel shows results: (top) raw detections, (middle) the road mask, and (bottom) filtered on-road detections, reducing false positives by 33.7%.
Figure 5.
Semantic filtering process. The method computes the road overlap ratio for each detection using rendered semantic segmentation. Detections with are retained while off-road detections are filtered out. The right panel shows results: (top) raw detections, (middle) the road mask, and (bottom) filtered on-road detections, reducing false positives by 33.7%.
Figure 6.
Coordinate systems and transformations. The ray-casting process involves: camera coordinate system (S), world coordinate system (O), image coordinate system (D), and BEV grid. Transformations are performed using camera intrinsic and extrinsic parameters to project 2D detections to 3D vertices.
Figure 6.
Coordinate systems and transformations. The ray-casting process involves: camera coordinate system (S), world coordinate system (O), image coordinate system (D), and BEV grid. Transformations are performed using camera intrinsic and extrinsic parameters to project 2D detections to 3D vertices.
Figure 7.
Ray-casting for 2D–3D association. A ray is cast from camera center through the detection center into 3D space using . Vertices within the distance threshold from the ray (highlighted region) are associated with the detection for attribute update.
Figure 7.
Ray-casting for 2D–3D association. A ray is cast from camera center through the detection center into 3D space using . Vertices within the distance threshold from the ray (highlighted region) are associated with the detection for attribute update.
Figure 8.
Vertex-wise temporal accumulation. Detections from multiple frames are projected to BEV vertices and accumulated over time. Vertices with consistent observations across frames (vertical alignment) exhibit high confidence. Observation count and severity are updated via EMA as with .
Figure 8.
Vertex-wise temporal accumulation. Detections from multiple frames are projected to BEV vertices and accumulated over time. Vertices with consistent observations across frames (vertical alignment) exhibit high confidence. Observation count and severity are updated via EMA as with .
Figure 9.
Visualization of semantic filtering process. (a) Scene-0064 shows raw YOLOv8 detections with off-road false positives on vehicles. (b) KITTI-00 demonstrates accurate road/non-road discrimination despite vehicle occlusion.
Figure 9.
Visualization of semantic filtering process. (a) Scene-0064 shows raw YOLOv8 detections with off-road false positives on vehicles. (b) KITTI-00 demonstrates accurate road/non-road discrimination despite vehicle occlusion.
Figure 10.
Evolution of multi-frame fusion process. The figure shows the temporal progression of BEV RGB reconstruction from Epoch 1 to Epoch 7, with progressively clearer lane markings, pavement texture, and vehicle details. The depth map (right) at Epoch 7 demonstrates improved geometric accuracy essential for precise 3D defect mapping.
Figure 10.
Evolution of multi-frame fusion process. The figure shows the temporal progression of BEV RGB reconstruction from Epoch 1 to Epoch 7, with progressively clearer lane markings, pavement texture, and vehicle details. The depth map (right) at Epoch 7 demonstrates improved geometric accuracy essential for precise 3D defect mapping.
Figure 11.
Final BEV defect map for Scene-0064. (a) BEV RGB reconstruction. (b) Semantic segmentation. (c) Defect overlay (green: D00 longitudinal cracks, yellow: D10 transverse cracks, orange: D20 alligator cracks, red: D40 potholes). (d) Statistical summary showing defect type distribution.
Figure 11.
Final BEV defect map for Scene-0064. (a) BEV RGB reconstruction. (b) Semantic segmentation. (c) Defect overlay (green: D00 longitudinal cracks, yellow: D10 transverse cracks, orange: D20 alligator cracks, red: D40 potholes). (d) Statistical summary showing defect type distribution.
Figure 12.
Cross-scene defect distribution comparing five representative nuScenes scenes. Color coding: green = D00 longitudinal cracks, yellow = D10 transverse cracks, orange = D20 alligator cracks, red = D40 potholes. Scene-specific statistics are listed on the right.
Figure 12.
Cross-scene defect distribution comparing five representative nuScenes scenes. Color coding: green = D00 longitudinal cracks, yellow = D10 transverse cracks, orange = D20 alligator cracks, red = D40 potholes. Scene-specific statistics are listed on the right.
Figure 13.
Confidence threshold ablation visualization for Scene-0064, showing progressive defect map refinement. Color coding: green = D00 longitudinal cracks, yellow = D10 transverse cracks, orange = D20 alligator cracks, red = D40 potholes.
Figure 13.
Confidence threshold ablation visualization for Scene-0064, showing progressive defect map refinement. Color coding: green = D00 longitudinal cracks, yellow = D10 transverse cracks, orange = D20 alligator cracks, red = D40 potholes.
Table 1.
Dataset statistics and scene characteristics.
Table 1.
Dataset statistics and scene characteristics.
| Scene | Type | Lighting | Traffic | Frames | Vertices | Total Detections |
|---|
| nuScenes-0063 | Exit passage | Daytime | Sparse | 39 | 130,770 | 1710 |
| nuScenes-0064 | Parking lot | Daytime | Medium | 40 | 143,857 | 4908 |
| nuScenes-0200 | Parking lot | Daytime | Dense | 39 | 149,635 | 4233 |
| nuScenes-0283 | Right-turn intersection | Daytime | Medium | 40 | 144,924 | 2160 |
| nuScenes-0655 | Complex parking lot | Daytime | Dense | 41 | 143,144 | 2247 |
| KITTI-00 | Urban street | Daytime | Sparse | 4541 | 3,171,274 | 1452 |
Table 2.
Implementation parameters.
Table 2.
Implementation parameters.
| Parameter | KITTI | nuScenes |
|---|
| BEV Coverage | 600 × 600 m | 100 × 100 m |
| Resolution | 0.1 m | 0.1 m |
| Positional Encoding | L = 4 | L = 5 |
| Confidence Threshold | 0.10 | 0.05 |
| Road Overlap Threshold | 0.5 | 0.5 |
| EMA Coefficient | 0.3 | 0.3 |
| Cameras | 1 (front) | 6 (surround) |
Table 3.
Semantic filtering performance across scenes. The filter successfully removes approximately one-third of non-road false detections, with stable performance across different scenes.
Table 3.
Semantic filtering performance across scenes. The filter successfully removes approximately one-third of non-road false detections, with stable performance across different scenes.
| Scene | Total Detections | Filtered | Filter Rate (%) | On-Road Precision (%) |
|---|
| nuScenes-0063 | 1710 | 471 | 27.5 | 72.5 |
| nuScenes-0064 | 4908 | 1656 | 33.7 | 66.3 |
| nuScenes-0200 | 4233 | 1460 | 34.5 | 65.5 |
| nuScenes-0283 | 2160 | 732 | 33.9 | 66.1 |
| nuScenes-0655 | 2247 | 889 | 39.6 | 60.4 |
| nuScenes Average | 3052 | 1042 | 33.8 ± 4.1 | 66.2 |
| KITTI-00 | 1452 | 536 | 36.9 | 63.1 |
Table 4.
Multi-frame fusion quality. High observation counts per vertex indicate effective multi-frame aggregation, enhancing defect detection reliability.
Table 4.
Multi-frame fusion quality. High observation counts per vertex indicate effective multi-frame aggregation, enhancing defect detection reliability.
| Scene | Vertices | Defect Vertices | Coverage (%) | Obs/V | Severity |
|---|
| nuScenes-0063 | 130,770 | 130 | 0.10 | 2.28 | 0.132 |
| nuScenes-0064 | 143,857 | 375 | 0.26 | 2.39 | 0.153 |
| nuScenes-0200 | 149,635 | 329 | 0.22 | 2.03 | 0.127 |
| nuScenes-0283 | 144,924 | 166 | 0.12 | 2.97 | 0.184 |
| nuScenes-0655 | 143,144 | 162 | 0.11 | 2.71 | 0.191 |
| nuScenes Average | 142,466 | 232 | 0.16 | 2.48 | 0.157 |
| KITTI-00 | 3,171,274 | 127 | 0.00 | 2.26 | 0.158 |
Table 5.
3D mapping accuracy. MSR indicates percentage of valid detections successfully projected to mesh. Mapping failures primarily stem from depth estimation limitations.
Table 5.
3D mapping accuracy. MSR indicates percentage of valid detections successfully projected to mesh. Mapping failures primarily stem from depth estimation limitations.
| Scene | Valid Detections | Mapped | MSR (%) | Average Distance (m) |
|---|
| nuScenes-0063 | 1239 | 297 | 24.0 | 0.18 |
| nuScenes-0064 | 3252 | 895 | 27.5 | 0.15 |
| nuScenes-0200 | 2773 | 667 | 24.1 | 0.17 |
| nuScenes-0283 | 1428 | 493 | 34.5 | 0.14 |
| nuScenes-0655 | 1358 | 439 | 32.3 | 0.16 |
| nuScenes Average | 2010 | 558 | 27.8 | 0.16 |
| KITTI-00 | 916 | 287 | 31.3 | 0.12 |
Table 6.
Road filtering threshold ablation (Scene-0064). Three threshold settings show highly similar performance, demonstrating method robustness with this parameter.
Table 6.
Road filtering threshold ablation (Scene-0064). Three threshold settings show highly similar performance, demonstrating method robustness with this parameter.
| Total Detections | Filtered | Filter Rate (%) | Coverage (%) | On-Road Precision (%) |
|---|
| 0.3 | 5052 | 1715 | 33.9 | 0.27 | 66.1 |
| 0.5 | 4908 | 1656 | 33.7 | 0.26 | 66.3 |
| 0.7 | 4932 | 1672 | 33.9 | 0.26 | 66.1 |
Table 7.
Confidence threshold ablation (Scene-0064). Lower threshold maximizes recall, with subsequent semantic filtering maintaining precision.
Table 7.
Confidence threshold ablation (Scene-0064). Lower threshold maximizes recall, with subsequent semantic filtering maintaining precision.
| Total Detections | Filtered | Valid | Coverage (%) | Defect Vertices |
|---|
| 0.05 | 4908 | 1656 | 3252 | 0.26 | 375 |
| 0.10 | 4809 | 1616 | 3193 | 0.25 | 365 |
| 0.15 | 3339 | 1122 | 2217 | 0.18 | 259 |
| 0.20 | 2523 | 851 | 1672 | 0.14 | 198 |
| 0.25 | 2064 | 722 | 1342 | 0.11 | 162 |
Table 8.
EMA weight ablation (Scene-0064). Different values have minimal impact on final performance, consistent with relatively short training cycle.
Table 8.
EMA weight ablation (Scene-0064). Different values have minimal impact on final performance, consistent with relatively short training cycle.
| EMA α | Coverage (%) | Defect Vertices | Total Observations | Obs/V | Severity |
|---|
| 0.5 | 0.26 | 375 | 895 | 2.39 | 0.153 |
| 0.7 | 0.26 | 372 | 893 | 2.40 | 0.154 |
| 0.9 | 0.26 | 372 | 896 | 2.41 | 0.155 |
Table 9.
Cross-dataset comparison between nuScenes and KITTI-00.
Table 9.
Cross-dataset comparison between nuScenes and KITTI-00.
| Metric | nuScenes Average | KITTI-00 | Notes |
|---|
| Filter Rate (%) | 33.8 ± 4.1 | 36.9 | Stable |
| Obs/Vertex | 2.48 | 2.26 | Consistent |
| Coverage (%) | 0.16 | 0.00 | Dataset-dependent |
| Average Severity | 0.157 | 0.158 | Similar |