3.1. Implementation Details
In accordance with the proposed pipeline, a dataset comprising eight textual prompts was constructed to comprehensively evaluate the performance of the proposed method across diverse indoor and outdoor scenarios. The prompts included six standard descriptions: “A mountain landscape”, “Waves on the beach”, “A luxury bathroom”, “A bedroom”, “Hulunbuir grassland with blue sky”, and “Beijing city library”. In addition, two more complex scenes were introduced to examine spatial complexity and geometric generalization, namely, “An indoor exhibition hall with multiple art installations, glass display cases, large posters on the wall, and spotlights” and, “An outdoor city plaza with a large central fountain, stone benches, tiled ground, and modern street lamps surrounded by open space.”
All experiments were conducted using Pytorch 2.4.0 with CUDA 12.4 on a workstation equipped with an NVIDIA RTX 3090 GPU (24 GB memory). Each training session consisted of 10,000 iterations to ensure stable convergence and consistent reconstruction performance.
To comprehensively evaluate our method, we employed three established reconstruction metrics: PSNR, SSIM, and LPIPS. These metrics collectively gauge pixel level accuracy, structural fidelity, and perceptual realism. Leveraging the dense pseudo labels that encompass the entire panoramic scene, we directly compared the reconstructed outputs with these pseudo labels to provide an objective and dependable assessment.
3.2. Comparison with Baselines
Baselines. We compared our method with three representative approaches to text-to-3D generation and panoramic scene reconstruction. LucidDreamer [
30] iteratively enhances a single image and its textual prompt to generate multi-view consistent content, progressively expanding the scene to form a holistic view. To ensure fair comparison, we adapted its pipeline to accept an initial panoramic image as input and integrated our dense pseudo label supervision into its training. DreamScene360 [
31] constructs immersive 360° panoramic scenes from textual prompts by projecting generated images into 3D environments. While it preserves global scene coherence, the projection process often leads to local geometric distortions, especially near high curvature regions. Scene4U [
32] introduces a panoramic image-driven framework for immersive 3D scene reconstruction that enhances scene integrity by removing distracting elements. The method generates panoramic with specific spatiotemporal attributions, decomposes them into semantic layers, and refines each layer through inpainting and depth restoration before reconstructing a multi-layered 3D scene using 3DGS. Since the official implement is unavailable, we reproduced a variant following its multi-layer decomposition principle to ensure consistency within our framework.
Qualitative results.
Figure 4 presents visual comparisons with baseline methods. Our method exhibits sharper textures, cleaner structural boundaries, and fewer rendering artifacts. In outdoor scenes, fine grained details in vegetation and terrain are preserved while maintaining global structural consistency. In indoor, object contours and furniture edges are preserved without the blurring and blocky distortions seen in baseline outputs.
Quantitative results.
Figure 5 summarizes performance across all eight datasets in (a) PSNR, (b) SSIM, and (c) LPIPS, and scenes are indexed as Scene1–SceneN; the mapping to full scene name is provided in
Table 1. In outdoor scenes like “Hulunbuir grassland with blue sky”, our method achieves over a 5 dB improvement in PSNR score. In indoor scenarios such as “A bedroom”, our method achieves the highest SSIM and lowest LPIPS, indicating better structural similarity and perceptual realism. We attribute these gains to the improved rendering capabilities of Gaussian Splatting and our initialization strategy, which seeds a higher, more uniformly distributed set of points, enabling accurate recovery of critical details throughout the scene.
3.3. Ablation Study and Analysis
Ablation on point cloud initialization. We conducted ablation studies to evaluate the effectiveness of our point cloud initialization scheme against several mainstream approaches, including BiFuse, Depth anything V2 [
33], VGGT [
34], COLMAP, COLMAP (MVS), and FlowMap. Specifically, BiFuse fuses ERP and CubeMap projections through a dual-branch network; Depth anything V2 leverages large-scale pseudo labeled data for robust monocular depth estimation; VGGT introduces a geometry transformer capable of directly predicting depth and point clouds; COLMAP and COLMAP (MVS) provide classical sparse and dense reconstructions; and FlowMap jointly optimizes depth and camera parameters in a differentiable framework. As summarized in
Table 2, our method achieves the best overall performance across all three metrics, with average scores of 42.07, 0.992, and 0.020, respectively.
Generalization experiments on real world panoramic data. To further assess the generalization capability and robustness of the proposed approach in real world environments, a supplementary dataset was collected using a Teche 360 panoramic camera (Teche, Shenyang, China). The first scene was an indoor exhibition hall on the first floor, the second scene was the exterior area of the laboratory building, and the third scene was a public park adjacent to the university library, as illustrated in
Figure 6. For each of the three scenes, we conducted comparative experiments to quantitatively assess reconstruction performance. The detailed numerical results are summarized in
Table 3, where our method consistently outperforms other methods.
Teacher model comparison. To validate the rationality of selecting Moge2 as the teacher model in our distillation framework, we conducted comparative experiments using three alternative teachers: DPT, Metrics3D, and VGGT. All training settings and evaluation metrics were kept identical to ensure fairness. As summarized in
Table 4, our method distilled from Moge2 achieves the highest reconstruction quality, while DPT performs significantly worse, and Metrics3D and VGGT yield slightly lower results. This gap demonstrates that the intrinsic errors and generalization ability of the teacher model critically influence the pseudo label quality and student performance.
Effect of pseudo label quantity. We further conducted an ablation study with different numbers of pseudo label: 120, 180, 240, 300, and 360. As presented in
Table 5, the results exhibit a varied trend: performance first improves as the number of pseudo labels increases, reaching its highest PSNR and SSIM at 180 samples, then slightly declines as more pseudo labels are added. With fewer pseudo labels, the student model is trained on a compact set of relatively clean labels, which reduces the influence of outliers and large teacher model errors. However, as the pseudo labels increase, the datasets become more diverse and cover a broader range of geometric structures, which theoretically enhances the generalization ability. At the same time, this results in an increase in training time and memory usage. Overall, these results suggest that the tradeoff between label representativeness and noise accumulation plays a key role in distillation performance. In our experiments, using around 240 samples achieves the best balance between supervision diversity, label reliability, and training efficiency.
Ablation study on the number of Fibonacci sampling points. To analyze the impact of the number of Fibonacci sampling points on the results, we performed an ablation study testing different sampling point values with three metrics and reconstruction time. The results are summarized in
Table 6.
From the results, 20 points provided an optimal balance between the three metrics and reconstruction time. This choice aligns with common practices in the field, where 20 points sampling is widely adopted, and we also used the traditional icosahedron method as a baseline for comparison.
Under identical settings, we compare our optimizer with the traditional baseline, as shown in
Table 7. The adaptive optimization achieves higher fidelity in nearly the same runtime, reducing artifacts, and improves consistency without adding significant computational cost.