Panoramic Image Driven Point Cloud Initialization for 3D Reconstruction

Qian, Haoyu; Yang, Lidong; Wang, Jing; Anwar, Muhammad Shahid

doi:10.3390/s25226840

Open AccessArticle

Panoramic Image Driven Point Cloud Initialization for 3D Reconstruction

by

Haoyu Qian

^1,2,

Lidong Yang

^1,2,*

,

Jing Wang

³

and

Muhammad Shahid Anwar

⁴

¹

School of Digital and Intelligent Industry, Inner Mongolia University of Science and Technology, Baotou 014010, China

²

Inner Mongolia Key Laboratory of Pattern Recognition and Intelligent Image Processing, Baotou 014010, China

³

School of Information and Electronics, Beijing Institute of Technology, Beijing 100811, China

⁴

Interdisciplinary Research Center for Finance and Digital Economy, King Fahd University of Petroleum and Minerals, Dhahran 31261, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(22), 6840; https://doi.org/10.3390/s25226840

Submission received: 22 September 2025 / Revised: 23 October 2025 / Accepted: 6 November 2025 / Published: 8 November 2025

(This article belongs to the Section Sensing and Imaging)

Download

Browse Figures

Versions Notes

Abstract

The ability to reconstruct immersive and realistic three-dimensional scenes plays a fundamental role in advancing virtual reality, digital twins, and related fields. With the rapid development of differentiable rendering frameworks, the reconstruction quality of static scenes has been significantly improved. However, we observe that the challenge of insufficient initialization has been largely overlooked in existing studies, while at the same time heavily relying on dense multi-view imagery that is difficult to obtain. To address these challenges, we propose a pipeline for text driven 3D scene generation, which employs panoramic images as an intermediate representation and integrates with 3D Gaussian Splatting to enhance reconstruction quality and efficiency. Our method introduces an improved point cloud initialization using Fibonacci lattice sampling of panoramic images, combined with a dense perspective pseudo label strategy for teacher–student distillation supervision, enabling more accurate scene geometry and robust feature learning without requiring explicit multi-view ground truth. Extensive experiments validate the effectiveness of our method, consistently outperforming state-of-the-art methods across standard reconstruction metrics.

Keywords:

3D reconstruction; panoramic image; point cloud initialization

1. Introduction

With the widespread adoption of virtual reality (VR) devices and content, immersive and realistic 3D reconstructions have become increasingly important. Traditional scanning pipelines can capture accurate geometric structures and spatial relationships but are costly in both time and labor. Recent developments in deep learning have given rise to learning-based reconstruction approaches such as photogrammetry [1,2,3,4], Neural Radiance Fields (NeRFs) [5,6,7,8], and 3D Gaussian Splatting (3DGS) [9,10,11,12,13,14]. Although these methods have achieved impressive results, they typically depend on multi-view observations to build complete spatial understanding, requiring high quality imaging with multiple or moving cameras, posing a barrier for general users.

Reliable depth estimation is the cornerstone of these reconstruction pipelines because it links image data to 3D geometry. Structure-from-motion (SfM) and multi-view stereo (MVS) rely on feature matching and photometric consistency to estimate dense disparity across views. Despite numerous improvements from learning-based approaches [15,16,17], depth prediction remains unreliable under sparse views, small baselines, or weak textures. Several studies on sparse view depth estimation [18,19,20] attempt to predict depth from limited inputs, but reconstruction quality remains constrained by the lack of parallax and multi-view consistency.

To mitigate dependence on multi-view capture, we adopt panoramas as alternative sources of global scene context. A single panorama provides a comprehensive view of an entire scene within one frame, offering broader spatial coverage and richer contextual cues than conventional perspective images. Recent latent diffusion models, such as MVDiffusion [21] and StitchDiffusion [22], have demonstrated strong capability in generating high fidelity panoramic images from textual descriptions, enabling users to obtain realistic scene priors directly from text. These developments suggest that panoramic representations can potentially replace multi-view inputs for reconstruction by providing rich global context.

Although panoramas provide global coverage, they lack parallax, making depth estimation inherently ambiguous and unstable. Monocular and panoramic depth estimation methods [23,24] partially alleviate this issue, yet they remain susceptible to scale ambiguity and instability in texture-less regions. Within 3DGS, the quality of point cloud initialization (which depends on depth accuracy) strongly affects optimization stability and final reconstruction quality. Existing pipelines often initialize 3DGS using SFM, such as COLMAP [25] and FlowMap [26]. While these methods recover camera poses and sparse geometry effectively, their performance degrades when texture information is limited. Thus, a robust, panorama-aware initialization is essential for preserving fine Gaussian details and ensuring stable optimization.

To this end, we propose a novel pipeline that generates a 3D Gaussian representation specifically designed for panoramic formats. The method leverages text-to-panorama generation to efficiently recover comprehensive 3D scenes from a single panoramic image, enhancing reconstruction quality by exploiting the wide field of view and rich contextual cues of panoramic imagery. Advanced text generation techniques are first applied to refine prompt quality and image boundaries through adaptive weight smoothing. Multiple overlapping perspective projections are then generated as pseudo labels, which transfer feature knowledge to a student model via knowledge distillation. During initialization, Fibonacci sampling determines camera positions and intrinsic parameters to produce multi-view depth maps, which are transformed into spherical coordinates and logarithmically compressed to reduce depth variation. The optimized depth is remapped into 3D space to form a detailed point cloud, which is subsequently converted into optimizable 3D Gaussian primitives. The continuous Gaussian representation fills depth gaps, while transmittance weighting distinguishes surface from internal Gaussians, preserving crucial structural details.

In summary, our work introduces the following key contributions:

We propose a novel 3D Gaussian representation pipeline that leverages text-to-panorama generation to enhance scene reconstruction performance.
Our method employs Fibonacci sampling for point cloud initialization, which substantially improves reconstruction quality.
Our method employs dense pseudo labels to guide student model learning from the teacher model, and results show consistent improvements over deterministic teachers.
Extensive experiments demonstrate that our method achieves higher quality reconstructions than SOTA approaches.

2. Methods

As illustrated in Figure 1, the proposed framework begins with text-to-panorama generation using a diffusion model. The resulting panoramic images are used for pseudo label generation and point cloud initialization, which are further optimized into 3D Gaussian primitives through differentiable rasterization under loss constraints. This process integrates generative and optimization-based components for high fidelity panoramic reconstruction. Section 2.1 introduces the basic concept of 3DGS, Section 2.2 describes the text-to-panorama generation process, Section 2.3 presents dense pseudo label generation, Section 2.4 details point cloud initialization, and Section 2.5 discusses adaptive Gaussian optimization.

2.1. Preliminary

3DGS employs Gaussian ellipsoids [27] to effectively parallelize the representation of the scene or object intended for reconstruction. Each Gaussian contains center position

μ

, covariance matrix

\sum

, opacity

α

, and SH coefficients, collectively known as the Gaussian mean:

G (x) = e^{- \frac{1}{2} {(x)}^{T} \sum^{- 1} (x)}

(1)

In order to guarantee the positive semi-definiteness of the covariance matrix, it can be factorized into a scaling matrix

S

and a rotation matrix

R

. We retain the diagonal vector of the scaling matrix

S

and a quaternion vector of the rotation matrix

R

for Gaussian distributions:

\sum = R S S^{T} R^{T}

(2)

During rendering, 3D Gaussians are projected onto a 2D plane, forming ellipses that are approximated as circles by maintaining their center coordinates and radius. This process entails converting a 3D covariance matrix to a 2D covariance matrix using a combination of the projective transformation matrix and the viewing transformation matrix:

\sum^{'} = J W \sum W^{T} J^{T}

(3)

The 2D plane is partitioned into distinct blocks, with individual Gaussians associated with specific blocks, influencing all pixels within them. When there is overlap between 2D Gaussians, determining the relative depths of each Gaussian is necessary to address occlusion issues from closer Gaussians. Rendering of blocks and pixels occurs independently and simultaneously following a predefined sequence, culminating in color computation via alpha blending:

C = \sum c_{i} α_{i}^{'} \prod_{j = 1}^{i - 1} (1 - α_{j}^{'})

(4)

c_{i}

represents the learned color, while the final opacity

α_{i}^{'}

is the product of the learned opacity

α_{i}

and the Gaussian function:

α_{i}^{'} = α_{i} \times \exp (- \frac{1}{2} {(x^{'} - μ_{i}^{'})}^{T} \sum_{i}^{' - 1} (x^{'} - μ_{i}^{'}))

(5)

where

x^{'}

and

μ_{i}^{'}

represent the coordinates in the projection space.

2.2. Text to Panoramic Image

We employ a latent denoising strategy that combines sliding window sampling with adaptive weight blending to reduce discontinuities at the left and right boundaries; the entire procedure is detailed in Figure 2. Initially, the noisy latent

z_{t}

is divided into nine overlapping patches

P_{i} (z_{t})

through sliding window sampling. Next, the text prompt is encoded with a pretrained text encoder, optionally augmented with a LoRA [28] module for refined parameter adjustment. The resultant text embedding is then integrated into the diffusion model

Φ

at each denoising step, influencing the denoising process of every latent patch; this process can be mathematically expressed as:

Φ (z_{t - 1}) = \sum_{i = 1}^{n} \frac{P_{i}^{- 1} (1)}{\sum_{j = 1}^{n} P_{j}^{- 1} (1)} ⨂ P_{i}^{- 1} (Φ (P_{i} (z_{t})))

(6)

Additionally, an adaptive weight w is employed to blend the initial left and final right patches. Subsequently, denoising is also applied to the blended patch to maintain coherence in the border areas, the formula for this process is expressed as:

b l e n d e d = w * l e f t + (1 - w) * r i g h t

(7)

After integrating and weighting the outputs of the diffusion model, the resulting latent representation is passed to the decoder for generation. To ensure high fidelity, the central region is retained in the ultimate output.

2.3. Dense Perspective Pseudo Label Generation for Teacher–Student Distillation Supervision

The inherent inconsistency among Gaussian distributions requires individual optimization of their properties using image supervision. However, panoramic images inherently lack explicit multi-view information, limiting their effectiveness in tasks demanding precise geometric comprehension and reconstruction. To overcome this limitation, we introduce a framework for creating dense perspective pseudo labels to facilitate teacher–student distillation supervision.

Given the panoramic image P and a designated camera position, we project it onto N overlapping perspective tangent images. Prior studies suggest that using 20 tangent images sufficiently covers the spherical surface resulting from icosahedral projection. By positioning multiple sets of densely distributed virtual perspective cameras within the spherical domain and applying the same projection method, we generate dense pseudo labels that capture fine scene details.

To enhance reconstruction, we leverage a powerful pretrained model (Moge2) as the teacher model and implement knowledge distillation to transfer the acquired knowledge from the teacher model to our student model without the need for actual 3D ground truth data. Simultaneously, it functions as a supervised label to guide the acquisition of crucial feature information by the student model from the teacher model, and it enables optimization through an appropriate loss function. During the training phase, the loss function is defined as the following weighted sum:

L_{t o t a l} = (1 - λ_{D S S I M}) L_{L 1} + λ_{D S S I M} L_{S S I M} + ω_{d e p t h} L_{P e a r s o n}

(8)

Specifically:

L_{L 1} = \frac{1}{N} \sum_{i = 1}^{N} |y_{i} - \hat{y_{i}}|

(9)

where

y_{i}

is the true value,

\hat{y_{i}}

is the predicted value, and

N

is the total number of elements.

L_{P e a r s o n} = 1 - ρ (y, \hat{y})

(10)

where

ρ

is the Pearson correlation coefficient.

ρ (y, \hat{y}) = \frac{C o v (y, \hat{y})}{σ_{y} σ_{\hat{y}}}

(11)

where

C o v (y, \hat{y})

is the covariance of

y

and

\hat{y}

, and

σ_{y}

and

σ_{\hat{y}}

are the standard deviations of

y

and

\hat{y}

. The formula for

L_{S S I M}

is as follows:

L_{S S I M} = 1 - \frac{(2 {μ_{y}}^{2} {μ_{\hat{y}}}^{2} + C_{1}) (2 σ_{y \hat{y}} + C_{2})}{({μ_{y}}^{2} + {μ_{\hat{y}}}^{2} + C_{1}) ({σ_{y}}^{2} + {σ_{\hat{y}}}^{2} + C_{2})}

(12)

With

μ_{y}

and

μ_{\hat{y}}

denoting the means,

σ_{y}

and

σ_{\hat{y}}

the variances,

σ_{y \hat{y}}

the covariance, and

C_{1}

and

C_{2}

are small constants to stabilize the division.

Intuitively,

L_{L 1}

prioritizes pixel-wise reconstruction accuracy,

L_{S S I M}

emphasizes structural similarity, and

L_{P e a r s o n}

enforces the consistency correlation between the predicted and supervised image depth maps. Collectively, these components facilitate the generation of reconstructed views that closely mirror the target images in terms of both visual appearance and geometric fidelity.

2.4. Point Cloud Construction

Improved initialization of point clouds leads to more accurate scene geometry and reduces local overfitting during the reconstruction process. We propose a Fibonacci lattice arrangement [29] to obtain crucial information from the panoramic image.

Fibonacci camera placement. Let the golden angle be

φ = π (3 - \sqrt{5})

. For

N

views and index,

i \in \{0, \dots, N - 1\}

.

y_{i} = - 1 + 2 \cdot \frac{i}{N}, θ_{i} = φ \cdot i, r_{i} = \sqrt{1 - {y_{i}}^{2}}, d_{i} = (r_{i} \cdot \cos {(θ}_{i}), y_{i}, r_{i} \cdot \sin (θ_{i}))

(13)

Uniformity justification. The “latitude” coordinates are chosen as midpoints of

N

equal subdivisions in [−1, 1], so such latitude band covers the same surface area

4 π / N

. The azimuths follow the golden angle progression

θ_{i} = φ \cdot i

, which is known to be equidistributed on the circle. Combining equal area bands with azimuthal equidistributional yields asymptotic uniformity on the sphere.

Comparative rationale. Under the same

N

, latitude–longitude grids suffer polar crowding; icosahedral, geodesic meshes include pentagonal defects and mild anisotropy; and spherical Poisson disk requires iterative generation. In contrast, the spherical Fibonacci set combines equal area behavior, low discrepancy, and an analytic mapping from index to direction, offering a simple and robust initializer.

Camera intrinsics and calibration matrix. The internal reference matrix

K

calculation method is based on the field of view angle:

K_{i} = [\begin{matrix} f_{x} & 0 & c_{x} \\ 0 & f_{y} & c_{y} \\ 0 & 0 & 0 \end{matrix}]

(14)

The focal lengths

f_{x}

and

f_{y}

are calculated based on the horizontal and vertical fields of view

{f o v}_{x}

and

{f o v}_{y}

:

f_{x} = \frac{w}{2 \tan (\frac{{f o v}_{x}}{2})}, f_{y} = \frac{h}{2 \tan (\frac{{f o v}_{y}}{2})}

(15)

Panoramic mapping and point cloud formation. After initializing camera directions via

(13)

, spherical UV coordinates are computed for each viewpoint by transforming the depth maps into a spherical coordinate system, through a direction to spherical coordinate conversion. To ensure smooth transitions between adjacent perspectives, a logarithmic transformation is applied to compress depth value variations. Subsequently, gradients of the logarithmic depth map are determined horizontally and vertically, denoted as

{d e p t h}_{i}

representing the depth map from the i-th camera:

G_{x} = \frac{\partial}{\partial_{x}} \log {(d e p t h}_{i}), G_{y} = \frac{\partial}{\partial_{y}} \log {(d e p t h}_{i})

(16)

To further enhance the smoothness, Laplacian regularization is applied, penalizing abrupt depth discontinuities:

\nabla^{2} \log ({d e p t h}_{i}) = \frac{\partial}{\partial_{x}} \log {(d e p t h}_{i}) + \frac{\partial}{\partial_{y}} \log {(d e p t h}_{i})

(17)

The refined panoramic depth map is then obtained by solving a least squares optimization that minimizes projection errors across multiple viewpoints while incorporating gradient and Laplacian regularization. The optimized log depth is restored via an inverse logarithmic transformation, yielding the panoramic depth maps corresponding to the generated panoramic images. Finally, the reconstructed depth values are projected into 3D space using spherical UV coordinates, yielding a detailed point cloud representation of the scene, as shown in Figure 3.

2.5. Adaptive Gaussian Optimization

After generating the point cloud, the Gaussian Splatting points’ centers are initialized from the input point cloud, and their position and volume are refined using supervised image guidance. Prior research on reconstructing geometry from 3D Gaussian distributions has primarily relied on strong geometric regularization, typically using opacity as a criterion and discarding Gaussians with very low opacity. However, Gaussians with low contributions often exhibit relatively high opacity, making the default strategy inadequate to maintain geometry precision, particularly under the dense initial distributions. We identify Gaussians for pruning by determining their overall contribution as the average across several high contributing viewpoints. Surface Gaussians are further distinguished from internal ones by their transmittance properties, ensuring that surface-attached Gaussians are preserved even under low transmittance conditions. Training begins with a relaxation stage aimed at reducing pruning severity, which is gradually increased over time. This strategy retains a larger number of Gaussians during the early stages to support optimization, while progressively removing redundancies in later stages.

3. Experiments

3.1. Implementation Details

In accordance with the proposed pipeline, a dataset comprising eight textual prompts was constructed to comprehensively evaluate the performance of the proposed method across diverse indoor and outdoor scenarios. The prompts included six standard descriptions: “A mountain landscape”, “Waves on the beach”, “A luxury bathroom”, “A bedroom”, “Hulunbuir grassland with blue sky”, and “Beijing city library”. In addition, two more complex scenes were introduced to examine spatial complexity and geometric generalization, namely, “An indoor exhibition hall with multiple art installations, glass display cases, large posters on the wall, and spotlights” and, “An outdoor city plaza with a large central fountain, stone benches, tiled ground, and modern street lamps surrounded by open space.”

All experiments were conducted using Pytorch 2.4.0 with CUDA 12.4 on a workstation equipped with an NVIDIA RTX 3090 GPU (24 GB memory). Each training session consisted of 10,000 iterations to ensure stable convergence and consistent reconstruction performance.

To comprehensively evaluate our method, we employed three established reconstruction metrics: PSNR, SSIM, and LPIPS. These metrics collectively gauge pixel level accuracy, structural fidelity, and perceptual realism. Leveraging the dense pseudo labels that encompass the entire panoramic scene, we directly compared the reconstructed outputs with these pseudo labels to provide an objective and dependable assessment.

3.2. Comparison with Baselines

Baselines. We compared our method with three representative approaches to text-to-3D generation and panoramic scene reconstruction. LucidDreamer [30] iteratively enhances a single image and its textual prompt to generate multi-view consistent content, progressively expanding the scene to form a holistic view. To ensure fair comparison, we adapted its pipeline to accept an initial panoramic image as input and integrated our dense pseudo label supervision into its training. DreamScene360 [31] constructs immersive 360° panoramic scenes from textual prompts by projecting generated images into 3D environments. While it preserves global scene coherence, the projection process often leads to local geometric distortions, especially near high curvature regions. Scene4U [32] introduces a panoramic image-driven framework for immersive 3D scene reconstruction that enhances scene integrity by removing distracting elements. The method generates panoramic with specific spatiotemporal attributions, decomposes them into semantic layers, and refines each layer through inpainting and depth restoration before reconstructing a multi-layered 3D scene using 3DGS. Since the official implement is unavailable, we reproduced a variant following its multi-layer decomposition principle to ensure consistency within our framework.

Qualitative results. Figure 4 presents visual comparisons with baseline methods. Our method exhibits sharper textures, cleaner structural boundaries, and fewer rendering artifacts. In outdoor scenes, fine grained details in vegetation and terrain are preserved while maintaining global structural consistency. In indoor, object contours and furniture edges are preserved without the blurring and blocky distortions seen in baseline outputs.

Quantitative results. Figure 5 summarizes performance across all eight datasets in (a) PSNR, (b) SSIM, and (c) LPIPS, and scenes are indexed as Scene1–SceneN; the mapping to full scene name is provided in Table 1. In outdoor scenes like “Hulunbuir grassland with blue sky”, our method achieves over a 5 dB improvement in PSNR score. In indoor scenarios such as “A bedroom”, our method achieves the highest SSIM and lowest LPIPS, indicating better structural similarity and perceptual realism. We attribute these gains to the improved rendering capabilities of Gaussian Splatting and our initialization strategy, which seeds a higher, more uniformly distributed set of points, enabling accurate recovery of critical details throughout the scene.

3.3. Ablation Study and Analysis

Ablation on point cloud initialization. We conducted ablation studies to evaluate the effectiveness of our point cloud initialization scheme against several mainstream approaches, including BiFuse, Depth anything V2 [33], VGGT [34], COLMAP, COLMAP (MVS), and FlowMap. Specifically, BiFuse fuses ERP and CubeMap projections through a dual-branch network; Depth anything V2 leverages large-scale pseudo labeled data for robust monocular depth estimation; VGGT introduces a geometry transformer capable of directly predicting depth and point clouds; COLMAP and COLMAP (MVS) provide classical sparse and dense reconstructions; and FlowMap jointly optimizes depth and camera parameters in a differentiable framework. As summarized in Table 2, our method achieves the best overall performance across all three metrics, with average scores of 42.07, 0.992, and 0.020, respectively.

Generalization experiments on real world panoramic data. To further assess the generalization capability and robustness of the proposed approach in real world environments, a supplementary dataset was collected using a Teche 360 panoramic camera (Teche, Shenyang, China). The first scene was an indoor exhibition hall on the first floor, the second scene was the exterior area of the laboratory building, and the third scene was a public park adjacent to the university library, as illustrated in Figure 6. For each of the three scenes, we conducted comparative experiments to quantitatively assess reconstruction performance. The detailed numerical results are summarized in Table 3, where our method consistently outperforms other methods.

Teacher model comparison. To validate the rationality of selecting Moge2 as the teacher model in our distillation framework, we conducted comparative experiments using three alternative teachers: DPT, Metrics3D, and VGGT. All training settings and evaluation metrics were kept identical to ensure fairness. As summarized in Table 4, our method distilled from Moge2 achieves the highest reconstruction quality, while DPT performs significantly worse, and Metrics3D and VGGT yield slightly lower results. This gap demonstrates that the intrinsic errors and generalization ability of the teacher model critically influence the pseudo label quality and student performance.

Effect of pseudo label quantity. We further conducted an ablation study with different numbers of pseudo label: 120, 180, 240, 300, and 360. As presented in Table 5, the results exhibit a varied trend: performance first improves as the number of pseudo labels increases, reaching its highest PSNR and SSIM at 180 samples, then slightly declines as more pseudo labels are added. With fewer pseudo labels, the student model is trained on a compact set of relatively clean labels, which reduces the influence of outliers and large teacher model errors. However, as the pseudo labels increase, the datasets become more diverse and cover a broader range of geometric structures, which theoretically enhances the generalization ability. At the same time, this results in an increase in training time and memory usage. Overall, these results suggest that the tradeoff between label representativeness and noise accumulation plays a key role in distillation performance. In our experiments, using around 240 samples achieves the best balance between supervision diversity, label reliability, and training efficiency.

Ablation study on the number of Fibonacci sampling points. To analyze the impact of the number of Fibonacci sampling points on the results, we performed an ablation study testing different sampling point values with three metrics and reconstruction time. The results are summarized in Table 6.

From the results, 20 points provided an optimal balance between the three metrics and reconstruction time. This choice aligns with common practices in the field, where 20 points sampling is widely adopted, and we also used the traditional icosahedron method as a baseline for comparison.

Under identical settings, we compare our optimizer with the traditional baseline, as shown in Table 7. The adaptive optimization achieves higher fidelity in nearly the same runtime, reducing artifacts, and improves consistency without adding significant computational cost.

4. Conclusions

In this work, we introduce a novel and effective framework that leverages the generative power of diffusion models to eliminate the reliance on multi-view imagery. Instead, our approach employs panoramic images as intermediate inputs to achieve globally consistent scene reconstruction. At the core of our design is a Fibonacci lattice-based initialization, which generates uniformly distributed point clouds from panoramic inputs and alleviates localized overfitting during reconstruction. In addition, we propose a dense pseudo label distillation strategy, where perspective projections derived from panoramas serve as supervisory signals, allowing the student model to inherit both structural and perceptual knowledge from a pretrained teacher model. Extensive experiments demonstrate that our framework consistently outperforms existing methods across diverse evaluation metrics. Future work will focus on enhancing higher resolution reconstruction and exploring dynamic scene modeling. One of the core challenges in dynamic scenes is handling motion blur, which can obscure fine details in the scene. Addressing this issue requires adapting our approach to effectively capture and reconstruct motion across frames, which will be an essential component in the development of immersive applications.

Author Contributions

Conceptualization, H.Q., L.Y. and J.W.; methodology, H.Q., L.Y., M.S.A. and J.W.; software, H.Q.; validation, H.Q. and L.Y.; formal analysis, H.Q. and M.S.A.; resources, H.Q., L.Y., and J.W.; data curation, H.Q.; writing—original draft preparation, H.Q.; writing—review and editing, H.Q., L.Y. and M.S.A.; supervision, H.Q., L.Y. and J.W.; project administration, H.Q., L.Y. and J.W.; funding acquisition, H.Q., L.Y. and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by National Natural Science Foundation of China (62161040), Supported by Program for Young Talents of Science and Technology in Universities of Inner Mongolia Autonomous Region (NJYT22056) and Science and Technology Project of Inner Mongolia Autonomous Region (2023YFSW0006), and Supported by the Fundamental Research Funds for Inner Mongolia University of Science and Technology (2023RCTD029).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all the subjects involved in this study.

Data Availability Statement

Part of the dataset is available on request from the authors (the data are part of an ongoing project).

Acknowledgments

The authors would like to thank all the participants in this study.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of the data; in the writing of the manuscript; or in the decision to publish the results.

References

Leroy, V.; Cabon, Y.; Revaud, J. Grounding image matching in 3d with mast3r. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 71–91. [Google Scholar]
Ma, Z.; Teed, Z.; Deng, J. Multiview stereo with cascaded epipolar raft. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 734–750. [Google Scholar]
Wang, S.; Leroy, V.; Cabon, Y.; Chidlovskii, B.; Revaud, J. Dust3r: Geometric 3d vision made easy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 20697–20709. [Google Scholar]
Zhang, Z.; Peng, R.; Hu, Y.; Wang, R. Geomvsnet: Learning multi-view stereo with geometry perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21508–21518. [Google Scholar]
Martin-Brualla, R.; Radwan, N.; Sajjadi, M.S.M.; Barron, J.T.; Dosovitskiy, A.; Duckworth, D. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7210–7219. [Google Scholar]
Meng, X.; Chen, W.; Yang, B. Neat: Learning neural implicit surfaces with arbitrary topologies from multi-view images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 248–258. [Google Scholar]
Wei, Y.; Liu, S.; Rao, Y.; Zhao, W.; Lu, J.; Zhou, J. Nerfingmvs: Guided optimization of neural radiance fields for indoor multi-view stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5590–5599. [Google Scholar]
Tancik, M.; Weber, E.; Ng, E.; Li, R.; Yi, B.; Wang, T.; Kristoffersen, A.; Austin, J.; Salahi, K.; Ahuja, A.; et al. Nerfstudio: A modular framework for neural radiance field development. In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Proceedings, Los Angeles, CA, USA, 6–10 August 2023; pp. 1–12. [Google Scholar]
Yu, Z.; Chen, A.; Huang, B.; Sattler, T.; Geiger, A. Mip-splatting: Alias-free 3d gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19447–19456. [Google Scholar]
He, S.; Ji, P.; Yang, Y.; Wang, C.; Ji, J.; Wang, Y.; Ding, H. A survey on 3d gaussian splatting applications: Segmentation, editing, and generation. arXiv 2025, arXiv:2508.09977. [Google Scholar] [CrossRef]
Guédon, A.; Lepetit, V. Sugar: Surface-aligned gaussian splatting for efficient 3d mesh reconstruction and high-quality mesh rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5354–5363. [Google Scholar]
Lin, J.; Li, Z.; Tang, X.; Liu, J.; Liu, S.; Liu, J.; Lu, Y.; Wu, X.; Xu, S.; Yan, Y.; et al. Vastgaussian: Vast 3d gaussians for large scene reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5166–5175. [Google Scholar]
Liu, Y.; Luo, C.; Fan, L.; Wang, N.; Peng, J.; Zhang, Z. Citygaussian: Real-time high-quality large-scale scene rendering with gaussians. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 265–282. [Google Scholar]
Kerbl, B.; Meuleman, A.; Kopanas, G.; Wimmer, M.; Lanvin, A.; Drettakis, G. A hierarchical 3d gaussian representation for real-time rendering of very large datasets. Acm T Graphic 2024, 43, 62. [Google Scholar] [CrossRef]
Cao, C.; Ren, X.; Fu, Y. MVSFormer: Multi-view stereo by learning robust image features and temperature-based depth. arXiv 2022, arXiv:2208.02541. [Google Scholar]
Ding, Y.; Yuan, W.; Zhu, Q.; Zhang, H.; Liu, X.; Wang, Y.; Liu, X. Transmvsnet: Global context-aware multi-view stereo network with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8585–8594. [Google Scholar]
Wang, F.; Galliani, S.; Vogel, C.; Pollefeys, M. Itermvs: Iterative probability estimation for efficient multi-view stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 3–7 June 2022; pp. 8606–8615. [Google Scholar]
Jiang, K.; Fu, Y.; Varma, T.M.; Belhe, Y.; Wang, X.; Su, H.; Ramamoorthi, R. A construct-optimize approach to sparse view synthesis without camera pose. In Proceedings of the ACM SIGGRAPH Conference, Denver, CO, USA, 27 July –1 August 2024; pp. 1–11. [Google Scholar]
Chen, Z.; Yang, J.; Yang, H. Pref3r: Pose-free feed-forward 3d gaussian splatting from variable-length image sequence. arXiv 2024, arXiv:2411.16877. [Google Scholar]
Tang, H.; Wang, W.; Gleize, P.; Feiszli, M. Aden: Adaptive density representations for sparse-view camera pose estimation. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 20 September–4 October 2024; pp. 111–128. [Google Scholar]
Bar-Tal, O.; Yariv, L.; Lipman, Y.; Dekel, T. Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv 2023, arXiv:2302.08113. [Google Scholar] [CrossRef]
Wang, H.; Xiang, X.; Fan, Y.; Xue, J.-H. Customizing 360-degree panoramas through text-to-image diffusion models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 4933–4943. [Google Scholar]
Wang, R.; Xu, S.; Dong, Y.; Deng, Y.; Xiang, J.; Lv, Z.; Sun, G.; Tong, X.; Yang, J. MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details. arXiv 2025, arXiv:2507.02546. [Google Scholar] [CrossRef]
Wang, R.; Xu, S.; Dai, C.; Xiang, J.; Deng, Y.; Tong, X.; Yang, J. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11–15 June 2025; pp. 5261–5271. [Google Scholar]
Schonberger, J.L.; Frahm, J.M. Structure-from-motion revisited. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Smith, C.; Charatan, D.; Tewari, A.; Sitzmann, V. Flowmap: High-quality camera poses, intrinsics, and depth via gradient descent. In Proceedings of the International Conference on 3D, Vision Davos, Switzerland, 18–21 March 2024; pp. 389–400. [Google Scholar]
Wu, T.; Yuan, Y.J.; Zhang, L.X.; Yang, J.; Cao, Y.P.; Yan, L.Q.; Gao, L. Recent advances in 3d gaussian splatting. Comput. Vis. Media 2024, 10, 613–642. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar]
Frisch, D.; Hanebeck, U.D. Deterministic gaussian sampling with generalized fibonacci grids. In Proceedings of the IEEE 24th International Conference on Information Fusion, Sun City, South Africa, 1–4 November 2021; pp. 1–8. [Google Scholar]
Chung, J.; Lee, S.; Nam, H.; Lee, J.; Lee, K.M. Luciddreamer: Domain-free generation of 3d gaussian splatting scenes. arXiv 2023, arXiv:2311.13384. [Google Scholar] [CrossRef] [PubMed]
Zhou, S.; Fan, Z.; Xu, D.; Chang, H.; Chari, P.; Bharadwaj, T.; You, S.; Wang, Z.; Kadambi, A. Dreamscene360: Unconstrained text-to-3d scene generation with panoramic gaussian splatting. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September –4 October 2024; pp. 324–342. [Google Scholar]
Huang, Z.; He, J.; Ye, J.; Jiang, L.; Li, W.; Chen, Y.; Han, T. Scene4U: Hierarchical Layered 3D Scene Reconstruction from Single Panoramic Image for Your Immerse Exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11 –15 June 2025; pp. 26723–26733. [Google Scholar]
Yang, L.; Kang, B.; Huang, Z.; Zhao, Z.; Xu, X.; Feng, J.; Zhao, H. Depth anything v2. Adv. Neural Inf. Process. Syst. 2024, 37, 21875–21911. [Google Scholar]
Wang, J.; Chen, M.; Karaev, N.; Vedaldi, A.; Rupprecht, C.; Novotny, D. Vggt: Visual geometry grounded transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11 –15 June 2025; pp. 5294–5306. [Google Scholar]

Figure 1. Overview of the proposed experimental framework, including text-to-panorama generation, pseudo label extraction, point cloud initialization, and adaptive Gaussian optimization.

Figure 2. Pipeline of latent panoramic image generation using sliding window sampling and adaptive blending within the diffusion denoising process.

Figure 3. Text to panoramic image generation and corresponding depth maps. The top row shows the generated panoramic images, and bottom row displays the corresponding depth maps.

Figure 4. Representative examples, randomly selected from the test set, are shown for visual comparison. The white dashed boxes highlight the differences. The reconstruction quality of our method demonstrates superior recovery of both geometry and texture details.

Figure 5. Quantitative comparison, (a) PSNR, (b) SSIM, (c) LPIPS. Results are reported on eight scenes indexed as Scene1–Scene8.

Figure 6. Real world panoramic images captured by the Teche 360 camera: (R1) indoor exhibition hall, (R2) laboratory exterior, (R3) public park.

Table 1. Mapping from scene indices (Scene1–Scene8) to full scene names.

Scene1	“A luxury bathroom”	Scene2	“A bedroom”	Scene3	“A mountain landscape”
Scene4	“Hulunbuir grassland with blue sky”	Scene5	“Waves on the beach”	Scene6	“Beijing city library”
Scene7	“An indoor exhibition hall with multiple art installations, glass display cases, large posters on the wall, and spotlights”
Scene8	“An outdoor city plaza with a large central fountain, stone benches, tiled ground, and modern street lamps surrounded by open space”

Table 2. Ablation study comparing our point cloud initialization method with the latest and most common techniques. The results, averaged across all datasets, demonstrate that our method achieves the best performance.

Method	Average
Method	PSNR	SSIM	LPIPS
BiFuse	23.86	0.693	0.330
Depth anything V2	31.67	0.931	0.162
VGGT	35.80	0.904	0.213
COLMAP	38.64	0.945	0.079
COLMAP (MVS)	37.93	0.958	0.074
FlowMap	34.79	0.936	0.098
Our study	42.07	0.992	0.020

Table 3. Quantitative comparison of reconstruction performance on real world panoramic datasets.

Model	R1			R2			R3
Model	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS
LucidDreamer	38.68	0.987	0.047	39.73	0.988	0.043	36.14	0.982	0.049
DreamScene360	31.59	0.954	0.102	29.25	0.924	0.130	35.22	0.976	0.053
Scene4U	30.24	0.962	0.086	34.84	0.982	0.067	27.84	0.913	0.159
Our study	42.97	0.993	0.018	43.58	0.994	0.019	40.80	0.991	0.019

Table 4. Quantitative comparison of different teacher models in the distillation framework.

Method	Average
Method	PSNR	SSIM	LPIPS
DPT	32.93	0.954	0.128
Metrics3D	41.62	0.990	0.023
VGGT	41.67	0.991	0.022
Our	42.07	0.992	0.020

Table 5. Ablation study on the effect of pseudo label quantity.

	Average
	PSNR	SSIM	LPIPS	Time	GPU Usage
120	42.12	0.992	0.019	7 min 13 s	3608.04 MB
180	42.37	0.993	0.017	7 min 48 s	4328.77 MB
240	42.07	0.992	0.020	8 min 21 s	5050.23 MB
300	42.03	0.992	0.019	8 min 48 s	5770.45 MB
360	41.44	0.991	0.020	9 min 10 s	6490.01 MB

Table 6. Comparative quantitative analysis of different point configurations in Fibonacci sampling.

	Average
	PSNR	SSIM	LPIPS	Time
Icosahedron (20)	41.67	0.990	0.022	8 min 12 s
15	41.58	0.990	0.023	8 min 6 s
20	42.07	0.992	0.020	8 min 21 s
25	42.24	0.991	0.022	8 min 37 s

Table 7. Quantitative comparison between adaptive optimization and traditional 3DGS.

Method	Average
Method	PSNR	SSIM	LPIPS	Time
Transitional Gaussian Optimization	41.75	0.990	0.022	8 min 25 s
Adaptive Gaussian Optimization	42.07	0.992	0.020	8 min 21 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qian, H.; Yang, L.; Wang, J.; Anwar, M.S. Panoramic Image Driven Point Cloud Initialization for 3D Reconstruction. Sensors 2025, 25, 6840. https://doi.org/10.3390/s25226840

AMA Style

Qian H, Yang L, Wang J, Anwar MS. Panoramic Image Driven Point Cloud Initialization for 3D Reconstruction. Sensors. 2025; 25(22):6840. https://doi.org/10.3390/s25226840

Chicago/Turabian Style

Qian, Haoyu, Lidong Yang, Jing Wang, and Muhammad Shahid Anwar. 2025. "Panoramic Image Driven Point Cloud Initialization for 3D Reconstruction" Sensors 25, no. 22: 6840. https://doi.org/10.3390/s25226840

APA Style

Qian, H., Yang, L., Wang, J., & Anwar, M. S. (2025). Panoramic Image Driven Point Cloud Initialization for 3D Reconstruction. Sensors, 25(22), 6840. https://doi.org/10.3390/s25226840

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Panoramic Image Driven Point Cloud Initialization for 3D Reconstruction

Abstract

1. Introduction

2. Methods

2.1. Preliminary

2.2. Text to Panoramic Image

2.3. Dense Perspective Pseudo Label Generation for Teacher–Student Distillation Supervision

2.4. Point Cloud Construction

2.5. Adaptive Gaussian Optimization

3. Experiments

3.1. Implementation Details

3.2. Comparison with Baselines

3.3. Ablation Study and Analysis

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI