1. Introduction
The construction of large-scale, diverse, and realistic 3D vehicle assets has become a central bottleneck for closed-loop testing of autonomous driving systems. While current research shows that the primary limitation of autonomous driving algorithms lies in data scarcity and the resulting poor handling of corner cases [
1], collecting such rare and often dangerous scenarios (e.g., collisions, near-misses) in the real world is constrained by safety, ethical, and legal concerns [
2,
3]. Closed-loop testing addresses this issue by evaluating autonomous vehicles in virtual environments, enabling rapid, safe, and cost-effective iteration of algorithms [
3]. However, the realism and coverage of these tests are fundamentally limited by the quality and diversity of the underlying 3D assets.
Early closed-loop testing mainly relied on simulation environments, where scenes were crafted using 3D rendering tools and computer graphics techniques [
4,
5,
6]. Despite the visual fidelity of modern simulators, researchers soon identified a noticeable gap between results obtained in simulation and those in the real world [
7]. This has led to a shift toward constructing 3D assets directly from real-world data via 3D reconstruction, and then using these assets for more faithful closed-loop evaluation [
8]. To further reduce manual effort and cost, recent work explores reconstructing 3D vehicle assets from relatively simple and readily available inputs, such as single images or sparse views [
9,
10,
11]. Despite recent progress in single-image 3D reconstruction, existing vehicle reconstruction frameworks still suffer from two major limitations: First, most existing methods primarily rely on visual observations while neglecting semantic prior information such as vehicle brand, model, and category attributes, which are highly correlated with global geometric structure. Second, general-purpose reconstruction frameworks often assume relatively complete or ideal observations and, therefore, struggle under real-world non-ideal conditions involving occlusion, truncation, arbitrary viewpoints, and incomplete structural information.
Therefore, we propose a novel perspective:
Can we design a reliable algorithm to reconstruct the 3D assets of vehicles with specific types and brands from casually taken images captured from non-ideal viewpoints? Addressing and exploring this problem will constitute the primary focus of this work. Thanks to the rapid advances in techniques like Gaussian splatting [
12] and NeRF [
13], reconstructing a 3D scene or object from multiple images and their associated camera parameters has become significantly more accessible. In contrast, single-view 3D reconstruction is a highly ill-posed problem. Reconstructing 3D objects from a single image, where both geometric and semantic information are limited, has increasingly attracted attention in recent research.
In many practical closed-loop testing scenarios, simply representing vehicles as generic “cars” is no longer sufficient. Perception performance, interaction behavior, and safety margins depend strongly on fine-grained properties such as vehicle type (e.g., SUV, truck, sedan) and specific model (e.g., compact hatchback vs. long-wheelbase sedan). For example, accurate modeling of occlusions, visibility, and sensor returns depends on the true size and shape of the vehicle; realistic simulation of traffic flow and collision risk requires distinguishing between heavy trucks and small passenger cars; and scenario libraries for functional safety and regulation testing often specify particular brands and models involved in corner cases. Therefore, being able to reconstruct vehicles with specific types and brands is critical for building high-fidelity, semantically consistent 3D asset libraries [
14,
15].
This work focuses on the following key question:
Given a casually captured single image of a vehicle from a non-ideal viewpoint, can we reliably reconstruct a 3D point cloud asset that is both geometrically plausible and semantically faithful to its specific type and model? This setting is highly challenging. Real-world images typically exhibit occlusions, truncated views, and cluttered backgrounds; single-view 3D reconstruction is inherently ill-posed due to severe depth and shape ambiguities; and existing methods are often trained on idealized datasets and treat all vehicles uniformly as “cars”, ignoring fine-grained semantics. As illustrated in
Figure 1, vehicle images captured in real-world scenarios are often obtained from arbitrary viewpoints and under non-ideal conditions. Such images frequently contain severe occlusions, truncation, incomplete contours, and missing structural regions, making single-view reconstruction highly ambiguous. In these situations, the visible image regions are often insufficient for reliably inferring the complete global geometry of the vehicle.
When these non-ideal inputs are directly processed by general-purpose image-to-3D reconstruction frameworks, the generated results may become severely distorted or semantically inconsistent. This is because universal reconstruction models typically rely heavily on visible image observations and lack sufficient prior understanding of vehicle-specific geometric structures and semantic characteristics. As a result, they may reconstruct only partial structures or generate geometrically implausible assets that do not match the actual vehicle category or shape.
This observation reveals an important gap between generic reconstruction frameworks and practical autonomous-driving asset reconstruction scenarios. Therefore, we argue that introducing semantic and geometric prior information during the generation process is necessary for guiding the model toward structurally plausible and semantically consistent vehicle reconstruction under non-ideal single-view conditions.
Under these conditions, existing universal reconstruction frameworks tend to generate incomplete geometries, semantically inconsistent structures, or reconstructions that are biased toward the visible regions only. These failure cases reveal that purely image-driven reconstruction methods are insufficient when the input observations are sparse or partially missing. Therefore, introducing semantic and geometric prior knowledge becomes necessary for recovering structurally plausible and semantically consistent vehicle point clouds under non-ideal single-view scenarios..
To cope with these challenges, we choose 3D point clouds as the target representation. Compared with meshes, point clouds impose weaker topological constraints and are more tolerant to missing or uncertain regions, reducing the risk of severe geometric artifacts under non-ideal viewpoints. At the same time, point clouds serve as a flexible intermediate representation: given additional information or stronger downstream modules, they can be converted into meshes or textured surface models with high fidelity [
16].
Beyond conceptual motivation, we also conduct a feasibility study to verify that vehicle type and textual semantics provide informative priors for point cloud reconstruction. Following feature [
17] extraction from vehicle point clouds by our proposed Vehicle Type Regulator, we apply t-SNE [
18] to visualize their distributions across different vehicle types; clear clusters emerge for categories such as “Sedan” and “SUV” (
Figure 2), indicating that type labels correlate strongly with geometric structure and can serve as effective guidance signals. In parallel, we encode textual prompts [
19] of the form “Brand; Model; Type” (e.g., “BMW; X3; SUV”) using our proposed Vehicle Model Regulator and compare their t-SNE maps with those of the corresponding point cloud features (
Figure 3a,b). The two distributions exhibit similar clustering patterns, which reveals a strong alignment between geometric features and text semantics, while also supporting our use of textual information as a high-level prior in the diffusion model.
To investigate whether the proposed Vehicle Model Regulator effectively captures semantic vehicle information, we visualize both textual semantic embeddings and point cloud feature embeddings using t-SNE, as shown in
Figure 3a,b, respectively. Different colors correspond to different vehicle categories/types.
As observed in the visualizations, semantically related vehicle categories exhibit similar clustering tendencies in both the textual embedding space and the point cloud feature space. Representative categories such as SUVs, sedans, and pickup vehicles form relatively consistent neighborhood distributions across the two feature domains. This observation suggests that the proposed semantic regulator helps align textual semantic priors with geometric point cloud representations, thereby providing semantically meaningful guidance during the reconstruction process.
Our key idea is to embed rich prior knowledge into a diffusion-based point cloud generator so that the model can compensate for the information loss of single-view inputs and adhere to fine-grained vehicle semantics. Concretely, we design a prior-guided multimodal diffusion framework [
20] in which
(i) geometric priors from the input image (camera parameters, vehicle mask, distance-transform map, and vision backbone features) are projected onto the evolving 3D point cloud as control signals at each denoising step, providing strong structural cues under occlusion and viewpoint bias;
(ii) semantic priors are introduced via textual prompts of the form “Brand; Model; Type”, whose CLIP embeddings are fused into the point cloud features through a cross-attention mechanism, guiding the generation toward vehicle-specific semantics; and
(iii) a set of regulators, i.e., pretrained neural networks that encode high-level semantic priors, are used during training to regularize the diffusion process. The Vehicle Type Regulator learns the distribution of different vehicle categories and encourages type-consistent outputs, while the Vehicle Model Regulator aligns point cloud features with text features, suppressing mode collapse and semantic drift in low-information regimes. Intuitively, these regulators act as semantic critics that gently push the diffusion trajectory toward globally coherent and brand/model-consistent shapes.
Based on the above design motivations, our contributions are summarized as follows:
We investigate the problem of single-view vehicle point cloud reconstruction under non-ideal real-world conditions and analyze the limitations of existing general-purpose reconstruction frameworks in handling incomplete and semantically ambiguous vehicle observations.
We propose a prior-guided reconstruction framework that introduces semantic vehicle information, including brand, model, and category attributes, to improve semantic consistency and geometric plausibility during the point cloud generation process.
Extensive experiments demonstrate that the proposed framework achieves superior reconstruction quality and robustness compared with existing reconstruction approaches under challenging vehicle observation conditions.
4. Dataset Construction
This section mainly discusses the establishment and analysis of our dataset. To the best of our knowledge, 3DRealCar [
50] is the first and currently the only large-scale 3D real car dataset, which contains 2500 car instances and their point clouds with actual sizes in real-world scenes. Based on 3DRealCar, we curate a higher-quality dataset, 3DRealCar++, tailored for fine-grained vehicle point cloud reconstruction.
The 3DRealCar dataset was collected using image-based 3D reconstruction methods such as COLMAP [
51] and SAM [
52]. However, this pipeline suffers from two key limitations: (1) the lack of accurate camera parameters and depth data introduces errors in pose estimation and point cloud generation; (2) semantic segmentation with SAM requires extensive manual validation to ensure the correctness of masks, which is labor-intensive and error-prone. We observed a large number of inaccurate masks and reconstruction artifacts in the dataset, as COLMAP fails to detect or correct erroneous inputs, instead propagating these errors throughout the reconstruction process.
Therefore, in 3DRealCar++, we optimize this process. The overall data curation and reconstruction pipeline is illustrated in
Figure 7. For each image used for SfM [
51], we employ LSAM [
45] with the prompt “car” to perform segmentation, ensuring that all extracted masks correspond to vehicles.
We feed the images and the corresponding masks into VggSfM [
53] for vehicle point cloud reconstruction. Additionally, based on the reprojection filtering functionality of VggSfM [
53], the masks segmented by LSAM [
45] undergo further filtering. Masks and their corresponding images with too few reprojected points are considered invalid and are discarded.
We then perform a final round of filtering for the reconstructed point clouds and corresponding images. Point clouds with fewer than 2048 points are considered invalid and discarded. For each valid point cloud, we manually check for errors and remove any erroneous reconstructions. Additionally, we review the quality of the corresponding images (e.g., lighting, motion blur) and discard those with poor quality. The remaining high-quality images, along with their camera parameters, are retained as monocular inputs, with each valid point cloud corresponding to multiple high-quality images.
Figure 8 illustrates our dataset filtering process. A comparison between the original 3DRealCar and our optimized 3DRealCar++ is shown in
Figure 9. The comparison reveals that the original dataset’s reconstructed point clouds suffer from background noise and poor uniformity, with ground points surrounding the vehicle. In contrast, while our point clouds remain sparse, their uniformity and accuracy are significantly improved.
As introduced in
Section 1, we assign textual descriptions to each instance using human annotation and LLM assistance (ChatGPT-4o API). For each instance, the most distinctive image is selected, and the LLM is queried for the vehicle’s brand, model, and type in the format “Brand; Model; Type” (e.g., “BMW; X3; SUV”). The generated descriptions are then manually reviewed for accuracy.
Table 3 presents the distribution and statistical results of our 3DRealCar++ dataset. After filtering, 2017 vehicle instances from the original 3DRealCar dataset were retained, each associated with multiple surrounding-view images and annotated for brand, model, and type.
To address the data imbalance (e.g., fewer buses compared to sedans), we applied augmentation techniques (flipping, rotation, scaling) for underrepresented types and randomly removed data from overrepresented types. The 3DRealCar++ dataset was split into 60% training (14,335 images) and 40% testing (9554 images), ensuring no overlap between the sets.
5. Experiments
5.1. Implemention Details
During training and evaluation, each image was resized to 256 × 256, and each point cloud was downsampled to 2048 points. Point clouds underwent random rotation, translation, scaling, and shuffling during training to improve generalization, while no augmentation or shuffling was applied during evaluation for consistency.
Regarding the training of the Vehicle Type Regulator , we employed the Adam optimizer with an initial learning rate of . The optimizer was configured with and , an epsilon value of , and a weight decay of to prevent overfitting. The model was trained for 500 epochs using the designated training split of 3DRealCar++.
For the Vehicle Model Regulator , we used the same optimizer as , but we set the initial learning rate to 1 × 10−3 to achieve better convergence and stability based on empirical observations. Additionally, a learning rate scheduler was employed, reducing the learning rate by a factor of 0.7 every 20 epochs, aiming to stabilize training and promote convergence.
For the proposed diffusion model, we adopted a standard generative model training strategy. The model was trained for 100,000 iterations, with a batch size of 16 per training step. We used the AdamW optimizer with
and
[
28]. Throughout the training process, we set the initial learning rate to
and applied linear decay to 0 over the course of the training steps.
For the hyperparameters in the loss function mentioned in
Section 3.3.1, we set
,
, and
. The hyperparameter settings for the loss function mainly depend on the magnitude of each loss term. We aimed to keep their magnitudes and convergence speeds as consistent as possible.
For our diffusion noise schedule, we used a linear schedule with a warm-up phase, where linearly increases from to .
For the sampling phase, we propose two sampling modes based on DDPM [
44] and DDIM [
43]. During the evaluation phase, similar to previous state-of-the-art works like PC2 [
28], BDM [
29], and PVD [
54], we chose the F-score and Chamfer Distance [
29] as the quantitative evaluation metrics. Our evaluation code is consistent with that of BDM [
29], and more details can be found in their open-source code.
All experiments were conducted on a single NVIDIA RTX 3090 GPU. The code was primarily implemented in Python 3.9, with the core model built using PyTorch 2.0, and rasterization projections handled by PyTorch3D 0.7.4.
5.2. Semantic Annotation Workflow Discussion
The proposed semantic priors are annotated at the vehicle-instance level rather than the image level. Since multiple images correspond to the same reconstructed vehicle instance, only one semantic label of the form “Brand; Model; Type” is required for an entire image group. This significantly reduces the annotation burden compared with conventional image-wise labeling pipelines.The average human verification time for each vehicle instance is approximately 15 s.
To further improve efficiency, we employed an LLM-assisted annotation strategy using the ChatGPT-4o API to automatically generate candidate semantic descriptions. Human annotators mainly performed lightweight verification and correction instead of manually labeling all semantic information from scratch.
Therefore, the overall human workload remained manageable, while preserving semantic consistency across the dataset. Moreover, in practical deployment scenarios, semantic priors can be automatically estimated using existing vehicle recognition systems, OCR-based identification methods, or vision–language models, reducing the dependence on manual annotation during inference.
5.3. Comparison with Existing Methods
For point cloud-based reconstruction methods, we selected RGB2Point [
27], PC2 [
29], and BDM [
32] as strong baselines, and we further included two recent diffusion-based 3D reconstruction approaches—SDFit [
55] and ICDDPM [
56]—for a comprehensive comparison with our method. We reimplemented all methods and followed the training settings reported in their original papers as closely as possible. To ensure fairness, however, every model was trained for the same number of iterations (100,000) on 3DRealCar++. The quantitative results under this unified protocol are summarized in
Table 4.
Following several recent diffusion-based point cloud reconstruction frameworks [
29], this work primarily adopts the Chamfer Distance (CD) and F-score for quantitative evaluation. Compared with the Earth Mover’s Distance (EMD), CD is more suitable for measuring the geometric coverage and structural consistency of generated point clouds under stochastic sampling conditions. Since the proposed task focuses on semantically plausible vehicle reconstruction rather than strict point-wise correspondence, CD-based evaluation provides a more computationally practical and structurally representative assessment for large-scale diffusion-based reconstruction experiments.
As shown in
Table 4, our DDIM-based variant (Ours) already surpasses all baselines in terms of both F1-score and Chamfer Distance, improving the F1-score from 0.6163 (BDM) to 0.6990 and reducing the CD from 79.2421 to 70.0585. When switching to the DDPM sampling strategy (Ours-DDPM), performance is further boosted across all metrics: precision increases from 0.7020 to 0.7399, recall from 0.7278 to 0.7395, and F1-score from 0.6990 to 0.7226, while the Chamfer Distance decreases from 70.0585 to 62.1525. While DDPM sampling delivers the best reconstruction quality, it also entails higher computational cost; therefore, all main comparisons with prior SOTA methods and most ablation studies are reported using the DDIM-based variant, with DDPM results provided to reveal the upper-bound potential of our model under more expensive sampling settings.
Furthermore, as shown in
Table 4, by adopting the DDPM sampling strategy, our method (Ours-DDPM) achieves further improvements across all metrics compared to the DDIM-based version (Ours), with precision increasing from 0.7020 to 0.7399, recall from 0.7278 to 0.7395, and F-score from 0.6990 to 0.7226, while the Chamfer Distance decreases from 70.0585 to 62.1525. Although DDPM-based sampling yields superior performance, it also incurs significantly higher computational costs. Therefore, all major comparisons with prior SOTA methods and ablation studies are conducted based on the DDIM-based results. The DDPM-based results are provided to demonstrate the further potential of our model under more powerful computational settings.
In
Table 5, we provide detailed quantitative results for each vehicle category. Although data balancing techniques were applied during training, we can observe that model performance across categories remains closely correlated with the original sample distribution—categories with fewer real samples, such as Van and Sport Car, generally exhibit lower F-scores. Nevertheless, the overall performance, with an average F-score of 0.6990 achieved by our model, highlights its superiority over existing state-of-the-art approaches. Additionally, we provide several representative cases in
Figure 10 for qualitative analysis.
From the results shown in
Figure 10, our method reconstructs point clouds that most accurately approximate the ground-truth geometry, further validating the objectivity of our quantitative results. Combining both quantitative and qualitative analyses, we present the following discussions:
RGB2Point adopts a transformer-based encoder–decoder architecture; however, under non-ideal conditions, its performance falls short compared to diffusion-based models such as PC2 and BDM, which leverage iterative denoising. Nevertheless, the performance of PC2 remains significantly constrained due to the lack of prior information guidance. Although BDM incorporates a joint reasoning paradigm between two models, it also struggles to achieve robust performance when sufficient input information is unavailable.
In addition to the quantitative comparisons against point cloud-based reconstruction methods, we also compare our approach with other end-to-end monocular 3D reconstruction methods. As shown in
Figure 11, we provide several case studies aimed at theoretically validating the robustness of our prior-based approach under non-ideal conditions, rather than relying solely on qualitative results.
Specifically, in
Figure 11, we select two of the most prominent end-to-end mesh reconstruction methods—One-2-3-45++ [
38] and Spar3d [
39]—to benchmark against our approach. While these methods are highly effective for general reconstruction tasks, they can still produce notable errors when the input image contains limited visual information. For instance, One-2-3-45++ reconstructs an SUV as a sedan when only the front of the vehicle is visible, while Spar3D merely reconstructs a partial frontal section of the car.
From a narrow, image-consistency perspective, these results are technically correct, as they form a closed loop with the available image data. However, from the broader perspective of controllable asset generation for autonomous driving applications, such errors are critical. The root cause of these mistakes lies in the inherent limitations of single-view input images. In contrast, our prior-guided point cloud reconstruction method, informed by prior knowledge, successfully reconstructs the accurate 3D geometry of vehicles, even under information-sparse conditions.
Furthermore, VQA-Diff [
40] introduces an LLM-based [
41] approach that leverages the extensive knowledge and reasoning capabilities of LLMs to perform 3D reconstruction from single-view image inputs. We reproduced the LLM-related component of their work using ChatGPT-4o, and the results are shown in
Figure 12.
Due to the hallucination issues inherent in current LLMs, their responses under suboptimal input conditions are not always accurate. As shown in
Figure 12, both input images depict a “Hong Qi” brand vehicle of the “H5” model. However, the LLM fails to provide the correct identification, leading to inaccurate downstream reconstruction by VQA-Diff [
40].
Therefore, we argue that guiding 3D reconstruction under constrained conditions using prior information is a more reliable and robust approach.
5.4. Ablation Studies
This section presents ablation studies to verify the importance and effectiveness of the key components in our method. Each experiment isolates a specific factor to clarify its contribution to the final performance.
We ablate the two regulators,
and
, introduced in
Section 3.3.1. The results in
Table 6 show that both modules exhibit improved performance, with
providing a more pronounced gain. This is likely because features associated with different vehicle types are more separable in the feature space, offering stronger guidance for generating point clouds of the desired category.
We evaluate the feature fusion strategy described in
Section 3.2.2. In particular, we study the effect of incorporating text features and the choice of cross-attention for fusion. As summarized in
Table 7, introducing text features already leads to a clear performance boost (first vs. third column), and replacing simple concatenation with attention-based fusion yields further improvements (second vs. third column), confirming that our fusion design is both necessary and effective.
In addition to the fusion mechanism, we also examined the impact of different text encoders on our framework. We compared the original CLIP text encoder with two widely used language models—BERT [
57] and RoBERTa [
58]—as well as a baseline without any text priors. As shown in
Table 8, using text priors consistently improves performance over the no-text setting, and CLIP provides the strongest gains among all choices, likely due to its stronger alignment between visual and textual semantics.
We further investigated the effects of different diffusion noise schedules while keeping the sampling strategy fixed. In particular, we compared a linear schedule with a warm-up (our default choice) against cosine and quadratic schedules. As reported in
Table 9, all three schedules yield reasonable results, but the linear schedule consistently achieves the best F1-score and the lowest Chamfer Distance, justifying our design choice in
Section 4.
To better understand the role of the regulators as semantic regularizers, we compared different training schemes: (i) removing both regulators, (ii) jointly training regulators and the diffusion model end-to-end, and (iii) using our two-stage strategy where the regulators are pretrained and then frozen. As summarized in
Table 10, joint training improves over the no-regulator setting but still underperforms our frozen-regulator scheme, which achieves the best F1-score and the lowest Chamfer Distance. This supports our interpretation that frozen regulators act as stable, data-dependent regularizers rather than merely adding model capacity.
We also compared different sampling strategies for the diffusion model, as reported in
Table 11. The first row corresponds to DDPM sampling with 100 steps from random Gaussian noise, the second row to DDIM sampling with 100 steps, and the third row to DDPM with 1000 steps. Consistent with expectations, DDPM with 1000 steps achieves the best reconstruction quality, but at a substantial cost in inference time. Therefore, we adopted DDIM with 100 steps as our default setting, which provides a favorable trade-off between accuracy and efficiency.
Finally, we designed a reverse experiment to further validate the effectiveness and necessity of the regulators. We fed four types of point clouds—Gaussian noise, outputs from PC2, our reconstructed results, and ground-truth point clouds—into the regulators and compared their evaluations. As shown in
Table 12, the Vehicle Type Regulator (VTR) was assessed using instance accuracy, while the Vehicle Model Regulator (VMR) was evaluated using the matching score, defined as the Euclidean distance in Equation (
9). Gaussian noise yielded the worst scores, ground truth the best, and our reconstructions consistently outperformed PC2. These results confirm that the regulators can reliably distinguish point clouds by type and model, and that their supervision provides meaningful priors that effectively enhance the performance of our diffusion model.
6. Conclusions
In this work, we tackled the challenging problem of reconstructing fine-grained 3D vehicle point clouds from a single non-ideal image, where severe viewpoint limitations, occlusions, and background clutter lead to highly ambiguous geometry. Our main contribution is a prior-guided conditional diffusion framework that injects rich geometric and semantic priors into both the denoising process and the training objective. Concretely, we (i) fused camera-aware image features, masks, and distance-transform maps with the evolving point cloud to provide strong geometric control; and (ii) embedded textual prompts of the form “Brand; Model; Type” through cross-attention, enabling vehicle-specific semantic guidance. On top of this, we introduced frozen Vehicle Type and Vehicle Model Regulators as semantic regularizers that enforce consistency with learned type distributions and text–geometry alignments. Extensive experiments on the 3DRealCar++ dataset show that our method significantly improves over state-of-the-art single-view point cloud reconstruction baselines.
In practical deployment scenarios, the proposed framework is primarily designed for offline or semi-online 3D asset generation in autonomous driving simulation and digital-twin systems, rather than strict real-time onboard perception. Therefore, reconstruction fidelity and semantic consistency are prioritized over ultra-low-latency inference. Nevertheless, the DDIM-based sampling strategy provides substantially faster inference while maintaining competitive reconstruction quality, demonstrating a favorable trade-off between computational efficiency and reconstruction performance.
Although the proposed framework utilizes semantic prior information such as vehicle brand, model, and category attributes, the objective is not to memorize fixed vehicle templates but to learn generalized semantic–geometric correlations across vehicle structures. Therefore, the proposed framework can maintain a certain generalization capability toward unseen or newly introduced vehicle variants. In practical deployment scenarios, newly emerging vehicle models can be incorporated through incremental dataset expansion and lightweight fine-tuning without modifying the overall reconstruction architecture. The text-prompt-based semantic guidance mechanism also provides a flexible way to extend the semantic space as new vehicle categories and models become available.
Future research may further improve the practicality of the proposed framework by reducing the dependence on human intervention during semantic prior generation. In particular, integrating vision–language models and automatic vehicle recognition systems may enable semantic priors such as vehicle brand, model, and type to be extracted directly from unconstrained real-world images. In addition, continual learning strategies could help the reconstruction model incrementally adapt to newly emerging vehicle models without requiring complete retraining. Exploring self-supervised semantic alignment mechanisms that jointly learn geometric and semantic consistency without explicit human verification also represents a promising direction toward fully automated and scalable vehicle reconstruction systems.