In the experimental section, we conducted three groups of experiments to evaluate the network’s performance under different conditions. Specifically, the first set of experiments assessed the accuracy of pose estimation in complex environmental conditions. The second and third sets of experiments evaluated the network’s ability to reconstruct dataset scenarios and real-world scenes under the dual adverse conditions of unknown camera poses and sparse viewpoints, as well as its capability to synthesize novel views.
All experiments were conducted on a workstation equipped with a 13th Gen Intel(R) Core(TM) i7-13700KF CPU and an NVIDIA GeForce RTX 4090 GPU. Our method follows a per-scene optimization paradigm. For each scene, we train our SparsePose–NeRF model for 200,000 iterations, a number consistent with related state-of-the-art works such as DietNeRF and FreeNeRF to ensure full convergence. The entire pipeline for a typical 3-view scene is highly efficient. The front-end pose and geometry estimation using MASt3R-SfM is completed in approximately 4 s. We also note that MASt3R-SfM reduces the computational complexity from quadratic to quasi-linear by constructing a sparse scene graph, as detailed in its original paper. The subsequent back-end NeRF optimization takes about 4.5 h. Once trained, rendering a new 800 × 800 resolution view takes approximately 5–10 s. This optimization time is competitive and notably faster than standard NeRF baselines, which can require over 10 h to converge on sparse data. The near and far rendering bounds were determined from the point cloud provided by the front end, with the redundancy margin ξ (Equations (9) and (10)) set to 5% of the scene’s depth range. For the hyperparameter λ in our loss function (Equation (12)), which weights the global, confidence-based depth loss LM, it is activated in the second half of training and linearly increases from 0 to its final value of 0.1. We found this setting to be robust across all scenes and datasets, requiring no per-scene tuning.
4.1. Validation of Camera Pose Estimation Accuracy
To validate the accuracy of the network in estimating camera poses, we conducted experiments on the Tanks and Temples [
36] dataset, the CO3Dv2 [
37] dataset, and the Rea-lEstate10K [
38] dataset. The Tanks and Temples dataset features a wide range of complex indoor and outdoor scenes, encompassing various lighting conditions, texture complexities, and scene scales. The design of the scenes in this dataset considers real-world complexities, such as non-uniform lighting, low-texture regions, and intricate geometric structures. These challenging characteristics make it an ideal benchmark for evaluating the robustness of algorithms. To quantitatively assess the accuracy of camera pose estimation, we compared our network with the COLMAP algorithm. It is worth noting that the COLMAP algorithm computes camera poses using all the images in each dataset, and its results are treated as ground truth for comparison purposes.
To calculate the Absolute Trajectory Error (ATE) [
39], we align two camera trajectories using the Procrustes alignment method. This involves three key steps: decentering, normalization, and rotational alignment of both trajectories. After alignment, the translational error between the two poses is computed. We also evaluate the Relative Rotation Accuracy (RRA@5) and Relative Translation Accuracy (RTA@5) to assess the precision of relative rotations and translations between camera pairs. RRA@5 is a commonly used metric for high-precision rotation error evaluation. Considering that COLMAP performs remarkably well in high-overlap and high-texture scenes where it achieves highly accurate rotation estimations, we adopted a rigorous evaluation criterion. In addition, we report the registration success rate (Reg), which measures the proportion of cameras with successfully estimated poses. This metric is an essential indicator of an algorithm’s robustness and applicability across various scenarios. In the Tanks and Temples dataset, we selected eight scenes and performed resampling and subdivision of the original dataset into subsets containing 25, 50, and 100 frames. Our algorithm was run under these three conditions, and the corresponding camera pose estimation metrics were reported, as summarized in
Table 1.
The experimental results comprehensively validate the efficiency and robustness of our method in multi-view 3D reconstruction tasks across various view numbers (25, 50, and 100 views) and diverse scenes (e.g., Ballroom, Barn, and Church). As the number of views increases, the absolute trajectory error (ATE) significantly decreases, and both reconstruction accuracy (RRA@5) and time efficiency (RTA@5) consistently improve. For example, in geometrically rich scenes (e.g., Barn and Ignatius), our method exhibits near-perfect reconstruction results under all view configurations, with extremely low ATE values and RRA@5 and RTA@5 approaching 100%. Furthermore, even in complex scenes with sparse texture and geometry (e.g., Museum), our method maintains a 100% registration rate (Reg), demonstrating its robustness in handling global keypoints. In scenes with intricate structures and more dynamic changes (e.g., Family and Church), our method demonstrates superior capability in capturing finer details, and as the number of views increases, it significantly enhances accuracy and consistency, further narrowing the gap with baseline methods.
This performance indicates that our method not only excels in traditional static scenes but also adapts well to scenes with complex geometries and dynamic changes. Overall, these experimental results fully confirm the comprehensive advantages of our method in terms of accuracy, efficiency, robustness, and scene adaptability. Whether in geometrically rich, simple scenes or challenging scenarios with sparse texture or dynamic changes, our method consistently exhibits high stability and consistency, providing strong support for improving accuracy and adapting to complex scenes in 3D reconstruction tasks.
In the qualitative experiments, we utilized COLMAP to generate high-quality 3D reconstructions and camera pose estimations as ground truth. To evaluate the accuracy of our network’s camera pose estimation, we compared the camera trajectories generated by our method against those produced by the state-of-the-art pose-free NeRF algorithm, NoPe–NeRF, across eight selected scenarios. During the experiments, we applied the Procrustes alignment method to align and visualize the camera trajectories, as illustrated in
Figure 4. From the visualized results, it can be observed that our proposed method performs well on complex datasets such as Tanks and Temples. For instance, in most scenes (e.g., Ballroom, Barn, Francis, and Museum), the trajectories reconstructed by MASt3R-SfM are highly consistent with those of COLMAP, demonstrating superior accuracy and robustness. This indicates that our method can reliably recover 3D trajectories in these scenarios. In contrast, NoPe–NeRF exhibits significant deviations or instability in certain scenes (e.g., Horse and Church), highlighting its limitations in handling complex or structurally unique environments, possibly due to constraints in its model capacity or algorithmic design. Moreover, in scenes with intricate structures and variations (e.g., Family and Church), MASt3R-SfM outperforms NoPe–NeRF by capturing finer details and maintaining alignment with COLMAP’s trajectories. Overall, these trajectory visualizations clearly illustrate the advantages of our algorithm in terms of camera pose estimation accuracy, stability, and adaptability to diverse scenes. However, the results also reveal that NoPe–NeRF requires further optimization to handle challenging scenarios effectively.
To evaluate the accuracy of our network’s camera pose estimation under sparse viewpoints, we conducted experiments on the CO3Dv2 and RealEstate10K datasets, both of which provide ground truth camera parameters. On the RealEstate10K dataset, we reported the mAA(30) metric, while on the CO3Dv2 dataset, we included RRA@15, RTA@15, and mAA(30). This choice is justified because RealEstate10K is primarily designed for novel view synthesis (NVS), focusing on the quality of generated views rather than precise pose estimation. The dataset typically involves views captured across large-scale scenes, which may lack highly accurate 3D pose annotations. The mAA(30) metric, being directly based on the matching quality of synthesized views, better reflects overall image-level performance compared to RRA and RTA, which are more reliant on precise camera poses. Furthermore, we set the threshold parameters for RRA and RTA to 15° because a rotational error within this range is generally acceptable in various computer vision tasks such as multi-view geometry, SLAM, and visual localization. For tasks like scene understanding or novel view synthesis, visual consistency is often maintained even with rotational errors within 15°. Additionally, practical applications such as indoor navigation and scene modeling do not demand extremely fine-grained rotational accuracy, making 15° a reasonable threshold to evaluate robustness in common scenarios. During the experiments, we selected 10 scenes from the test set of the RealEstate10K dataset. For each scene, three subsets were created by randomly sampling 3, 5, and 10 consecutive images, respectively. Similarly, from the CO3Dv2 dataset, we selected five categories: hotdog, motorcycle, bicycle, laptop, and toy bus. For each category, two shooting scenes were chosen, and subsets of varying sizes were constructed following the same approach used for the RealEstate10K dataset. Finally, the average parameter values were calculated across all scenes, and the results are presented in
Table 2.
From the experimental results, it can be observed that our network achieves highly accurate camera pose estimation even under conditions of extreme sparsity in image data. This effectively addresses the challenge faced by NeRF networks in obtaining precise camera poses when operating with sparse viewpoints.
4.2. Comparative Experiments on LLFF Dataset with Sparse-View NeRF Models
We conducted comparative experiments on the LLFF [
40] dataset with other sparse-view NeRF models. The experiments focused primarily on scene reconstruction and novel view synthesis using three input views. Both quantitative and qualitative analyses were performed to evaluate the results. Previous improvements in sparse-view NeRFs often evaluated their models by directly sampling a fixed number of images from datasets and utilizing the camera poses provided within those datasets for NeRF reconstruction. However, an overlooked fact is that these camera poses are estimated using the COLMAP algorithm by processing dozens of images of the same scene collectively. Under such conditions, the errors in camera pose estimation are minimal.
Accurate camera poses are crucial for NeRF reconstruction, as NeRF fundamentally relies on optimizing scene density and radiance fields. This optimization requires known camera poses to establish the origin and direction of rays, correlating 3D points in space with corresponding image pixels. In real-world applications, obtaining accurate camera poses under sparse-views is highly challenging. COLMAP typically requires at least five images to generate stable sparse point clouds. Moreover, it assumes sufficient overlap in the field of view and robust feature matching between images. When the number of images is limited or the overlap and viewing angle differences are minimal, the number of matched points decreases significantly. This results in incomplete scene reconstruction and inaccurate camera pose estimation. Through our experimental validation, we observed that existing methods for sparse-view reconstruction perform exceptionally well when provided with more than six input views. Therefore, our experiments focus on scenarios with three input views, where current NeRF-based methods for sparse-view improvements generally underperform. Even when these NeRF models directly utilize accurate camera poses from datasets—thus avoiding the challenges of inaccurate camera poses in real-world scenarios—their reconstruction quality and novel view synthesis remain suboptimal. This limitation is primarily due to insufficient scene information, which prevents the network from capturing adequate low-frequency information. By improving the NeRF network architecture, our proposed model effectively addresses these challenges and achieves superior performance. We compared four different methods for achieving NeRF-based 3D reconstruction under sparse view inputs. Due to experimental constraints, we used only three input views.
Directly using COLMAP to compute camera parameters from these three views often failed or resulted in significant camera pose errors. Furthermore, there were no ground truth images available for the new view synthesis experiments under these conditions. To address these challenges and facilitate experiments with three input views, we adopted a strategy of directly reading standard camera pose parameters from the dataset for the input to the comparison networks. Specifically, we selected three images from each of the four scenes in the Real Forward-Facing dataset as training inputs for NeRF and one additional image from the remaining images as ground truth for evaluating novel view synthesis performance. However, during the experiments with our network, we did not use the camera poses provided by the dataset; we consistently utilized our MASt3R-SfM frontend network to estimate camera poses. While our estimated poses exhibited minor deviations from those computed by COLMAP in the dataset, these differences had negligible effects on quantitative analysis. In qualitative experiments, the slight shifts in viewing angles were insignificant. This demonstrates that our method’s pose estimation is a viable replacement for the traditional SfM algorithm in the NeRF pipeline.
The final results, as shown in
Figure 5, highlight the performance differences across methods. For the “room” scene, PixelNeRF exhibited poor performance, with blurred boundaries on walls and ceilings and significant geometric distortions in objects on the desk and floor. DSNeRF improved boundary sharpness, particularly at wall-ceiling transitions. MVSNeRF captured geometric consistency effectively, with shapes closer to reality and improved detail recovery, though textures on small objects remained blurred. FreeNeRF achieved geometric consistency comparable to MVSNeRF but excelled in restoring details on small objects. Our method outperformed all others, delivering clear and realistic details on small objects on the floor and boundary regions of walls, while also achieving consistent depth and lighting in transitions between walls and ceilings. In the quantitative analysis, we calculated PSNR, SSIM [
41], and LPIPS [
42] metrics.
It is worth noting that the PixelNeRF and MVSNeRF networks used in the experiments were not fine-tuned for the current scenes. This is because our focus is on reconstruction performance under the three-view condition, while typical fine-tuning procedures require two to three additional images per scene, undermining the comparability of results. Therefore, we used non-fine-tuned versions of these networks for the experiments. The detailed comparison results are presented in
Table 3.
4.3. Sparse Input-Based Reconstruction and View Synthesis in Real-World Scenes
In real-world scenarios, accurate camera pose information is typically unavailable, unlike the data provided by synthetic datasets. To achieve precise 3D scene reconstruction and novel view synthesis under these conditions, the pose estimation module of the network must be capable of accurately estimating camera poses even with sparse input. We sampled three real-world scenes using a smartphone and extracted frames at fixed intervals with FFmpeg to create the training and testing images required for our experiments. Our goal is to enable NeRF-based 3D reconstruction and novel view synthesis with only three images in real-world applications.
Experimental results indicate that when other networks employ the traditional SfM algorithm for pose estimation following NeRF’s pipeline, the reconstruction quality of all four baseline networks deteriorates significantly. Consequently, a direct comparison with our method becomes impractical. The primary reason lies in the inherent limitations of traditional SfM algorithms, which require substantial overlap between input images and sufficient texture features within overlapping regions. Additionally, camera positions must have a baseline (viewpoint difference) allowing triangulation constraints. Even under these conditions, SfM-estimated camera poses often contain significant noise, leading to poor reconstruction quality and making comparative experiments unnecessary.
Given these limitations, our final experimental setup uses our network’s frontend, MASt3R-SfM, to estimate camera poses for all three input images. The same estimated poses are then used for NeRF reconstruction by the four baseline networks. In our assumed real-world scenario (only three images available), there is no standardized test set like the Real Forward-Facing dataset. Ground truth images for quantitative evaluation are unavailable.
To mitigate this, we use adjacent frames near the rendered viewpoints as the best possible reference. We acknowledge that this introduces a slight viewpoint parallax, which can affect the absolute values of the reported metrics. However, since this evaluation condition is applied consistently across all compared methods, it provides a fair and valid basis for assessing their relative performance. We render images from the same viewpoints across all networks for comparison. The ground truth consists of adjacent frames near rendered viewpoints, which—despite slight viewpoint differences—represent the best possible reference setup.
To mitigate viewpoint differences in the ground truth during quantitative experiments, we also render depth maps for each network. These depth maps provide direct visualization of geometric consistency and detail recovery. The ground truth depth maps are estimated using the current state-of-the-art monocular depth estimation network, DepthAnythingV2 [
43,
44]. It is important to note that these depth maps serve strictly as a qualitative visual reference for geometric plausibility and are not used for any quantitative evaluation metrics. Our quantitative analysis is performed exclusively on the rendered RGB images. In these maps, closer points are shown in white and farther points in black—the inverse of NeRF-rendered depth maps, where closer points appear in black and farther points in white. The qualitative and quantitative experimental results are shown in
Figure 6 and
Table 4, respectively.
From the results of quantitative experiments, it is evident that our model achieves superior reconstruction quality compared to other sparse-view NeRF networks. This advantage primarily stems from the incorporation of the sampling annealing strategy and a series of regularization optimization techniques in our network. Our method demonstrates strong performance across three quantitative metrics. From the RGB images output by different comparison networks, it is apparent that PixelNeRF produces subpar reconstruction quality (MVSNeRF performs at a similar level and is therefore not presented separately), exhibiting significant blurriness and artifacts. Many regions lose texture and color information, likely due to the lack of fine-tuning, which substantially degrades the reconstruction quality of these networks. In the red-boxed areas, there are noticeable color shifts and texture blurriness (e.g., the doll’s face, the leaves of the plant, and the bottom box). Although DSNeRF shows notable improvements compared to PixelNeRF, it still suffers from artifacts and color shifts. The occlusion boundaries in the red-boxed areas (e.g., the edges of the doll’s ears and the plant’s leaves) exhibit blending issues, leading to unnatural texture and color rendering. FreeNeRF achieves better texture restoration in the red-boxed regions (e.g., plant details) but still encounters blurriness in geometrically complex regions (e.g., the edges of plant leaves and the doll’s ears). In contrast, our method accurately reconstructs texture and color in the red-boxed regions (e.g., the doll’s ears, plant leaves, and bottles) without noticeable blurriness or artifacts. The occlusion boundaries and segmentation between the foreground and background are sharp, with high color consistency. Our network also reconstructs fine textures in complex geometric areas (e.g., plant stems and leaves) with minimal artifacts. Additionally, comparing the depth maps output by different networks reveals that PixelNeRF produces depth maps with significant blurriness and noise, particularly in detail-rich or geometrically complex regions (e.g., areas marked by red boxes). Although DSNeRF demonstrates significant improvement over PixelNeRF, it still exhibits incomplete geometry in certain regions and insufficient sharpness at depth boundaries. FreeNeRF generally outperforms DSNeRF in sharpness, but depth estimation errors persist in some regions (e.g., gaps between objects or small objects). Our method produces depth maps with the best clarity, detail, and boundary sharpness, closely approximating the ground truth. The red-boxed regions (e.g., edge details, occluded areas, and small objects) demonstrate significantly higher accuracy than other methods. Overall, our approach outperforms other methods significantly in both RGB rendering and depth map reconstruction quality.
4.4. Ablation Study
4.4.1. Efficacy of the High-Frequency Annealing Strategy
To isolate and rigorously assess the contribution of our high-frequency annealing strategy for positional encoding in sparse-view scenarios, we conduct a targeted ablation study. This evaluation is performed on the NeRF-synthetic dataset, as its scenes—characterized by sharp geometric boundaries and uniform color surfaces—present an ideal testbed for evaluating the network’s susceptibility to overfitting on high-frequency information. To eliminate confounding factors from camera pose inaccuracies, we utilize the ground truth camera poses provided with the dataset. This experimental setup ensures that any observed performance differences are solely attributable to the inclusion or exclusion of the proposed annealing strategy.
We compare the following two model variants, each trained using three sparse input views.
Ours (w/Annealing): This is our full model, which incorporates the proposed high-frequency annealing strategy. The strategy progressively unmasks frequency bands by incrementally increasing the maximum level, Lmax, of the positional encoding. This compels the network to first learn the coarse, low-frequency structure of the scene during the initial training stages before gradually enabling the learning of fine, high-frequency details.
Ours (w/o Annealing): This is a baseline variant where the high-frequency annealing strategy is ablated. In this configuration, the network has access to the full spectrum of positional encoding frequencies from the outset of training. This approach, particularly in a sparse-view setting, makes the model prone to overfitting on high-frequency noise rather than capturing the underlying global scene structure.
The experimental results clearly demonstrate the necessity of the high-frequency annealing strategy for mitigating overfitting in sparse-view settings. As illustrated in
Figure 7, the baseline model (top row), which lacks this strategy, exhibits catastrophic failure. Specifically, its rendered novel views are plagued by erroneous geometric artifacts, often termed “floaters,” that appear in otherwise empty regions of space, resulting in an overall blurry and noisy scene. In sharp contrast, our full model (bottom row) with the annealing strategy generates clean, sharp, and geometrically faithful images that are highly consistent with the ground truth. This stark visual disparity is corroborated by the quantitative metrics in
Table 5. Our full model achieves a PSNR of 24.47, more than doubling the 12.31 score of the baseline. Similarly, it shows a decisive advantage across other metrics, with SSIM improving from 0.513 to 0.832 and LPIPS decreasing from 0.482 to 0.162. These results confirm substantial improvements in both structural similarity and perceptual quality.
This experiment conclusively demonstrates that our high-frequency annealing is a critical component for robust reconstruction from sparse views. By guiding the network to learn in a coarse-to-fine manner, it ensures the model first establishes a robust, low-frequency foundation of the scene before progressively refining high-frequency details. Notably, even under the idealized condition of perfect camera poses, this strategy remains indispensable for achieving high-fidelity renderings.
4.4.2. Impact of Low-Frequency Regularization on Geometric Accuracy
To verify the synergistic contribution of our two-part regularization strategy—which involves sampling contiguous patches from low-frequency regions and an accompanying depth regularization term, LU, based on the Pearson correlation coefficient—we conduct a targeted ablation study on the DTU dataset. This strategy is designed to stabilize the training process and enhance geometric accuracy. The DTU dataset is selected for its primary advantage of providing high-precision, ground truth 3D geometry, which enables a direct quantitative evaluation of our model’s geometric fidelity. To isolate the impact of the regularization strategy, we utilize the provided ground truth camera poses, thereby eliminating potential confounding effects from pose estimation errors.
We compare two variants of our model, both trained on sparse input views:
Ours (Full Model): Our standard model, which employs both the low-frequency patch sampling strategy and the Pearson correlation-based depth loss (LU) during the first half of the training process (t ≤ T/2).
Ours (w/o Low-Freq. Regularization): A baseline variant where both components of the low-frequency regularization are ablated. In this setting, the sampling scheme reverts to standard random pixel sampling across the entire image, and the depth regularization term LU is omitted from the loss function.
Our ablation study reveals that the low-frequency regularization strategy is critical for ensuring geometric stability and the final rendering quality. This is particularly evident in the qualitative results presented in
Figure 8. The baseline model (top row), which omits this strategy, suffers from severe visual artifacts. Specifically, its rendered surfaces exhibit unnatural splotches of color and shading, an effect typically indicative of an incorrectly learned underlying 3D geometry. Furthermore, the boundaries between the object and the background are poorly defined, further evidencing defects in its geometric boundary reconstruction.
These qualitative degradations are quantitatively substantiated by the metrics in
Table 6. Our full model achieves a PSNR of 21.63, outperforming the baseline’s 15.23 by a significant margin of 6.4 points. Commensurate advantages are observed in the SSIM and LPIPS metrics, collectively demonstrating a substantial leap in both image fidelity and perceptual quality. We also note that the training process of the baseline is markedly unstable, exhibiting slower convergence and greater volatility in its loss curve.
In summary, both the qualitative and quantitative results provide compelling evidence for our central claim: the strategy of sampling contiguous patches in low-frequency regions and the Pearson-based depth regularization loss are complementary and act in synergy. This composite strategy plays a pivotal role throughout the optimization process by stabilizing training and ensuring geometric integrity. It is therefore indispensable for mitigating geometric artifacts induced by sparse views to ultimately achieve high-fidelity view synthesis.
4.4.3. Ablation on the Front-End Pose and Geometry Estimation
To demonstrate the foundational and indispensable role of our integrated MASt3R-SfM front-end, we conduct a final ablation study. We establish a baseline that emulates a conventional NeRF pipeline by first estimating camera poses from the sparse input views using a classical SfM algorithm (i.e., COLMAP). Replacing our MASt3R-SfM front end, however, deprives the system of the initial self-supervised depth maps and the final normalized point cloud it generates. Consequently, our core depth regularization losses, LU and LM, cannot be applied.
To enable this baseline to function, we leverage a pre-trained DepthAnythingV2 model to generate monocular relative depth priors for each input view. These priors are then used to guide our patch sampling strategy and to offer a basic geometric scaffold. This comparative experiment is performed on the challenging forward-facing scenes of the LLFF dataset.
Our final ablation study conclusively demonstrates that the integrated MASt3R-SfM front-end is the cornerstone of our framework. As shown in
Figure 9 and
Table 7, replacing it with a baseline composed of traditional SfM and a monocular depth estimator leads to a precipitous and catastrophic decline in reconstruction performance. The baseline’s renderings are rife with severe geometric distortions and blurry artifacts, rendering the scene structure entirely unrecognizable. This visual collapse is corroborated by the quantitative metrics: the baseline achieves a PSNR of only 7.56, a score indicative of complete reconstruction failure. In stark contrast, our full pipeline achieves a PSNR of 21.36, with a similarly vast performance gap observed for the SSIM and LPIPS metrics.
The root cause of this failure is twofold. First, for complex, real-world scenes such as those in the LLFF dataset, traditional SfM algorithms struggle to estimate accurate and robust camera poses from only 3–5 sparse views. The resulting pose inaccuracies fundamentally corrupt the geometric foundation required for NeRF. Second, while the monocular depth maps from DepthAnythingV2 provide relative depth cues, they fall significantly short of the priors generated by our MASt3R-SfM in terms of accuracy, scale, and multi-view consistency.
Ultimately, this experiment demonstrates that MASt3R-SfM is more than a mere pose estimator; it provides a dual-pronged foundation for the system. It delivers not only high-accuracy, robust camera poses but also a set of multi-view consistent geometric priors that are critical for regularization. Under the challenging conditions of extreme view sparsity in real-world scenes, both of these elements are indispensable prerequisites for achieving high-quality reconstruction.
Taken together, these three ablation studies demonstrate the strong synergistic effects between our proposed components. The front-end ablation (
Section 4.4.3) shows that the robust pose estimation and multi-view consistent geometry from MASt3R-SfM form an indispensable foundation. Concurrently, the back-end ablations (
Section 4.4.1 and
Section 4.4.2) prove that even with a strong geometric foundation, our novel regularization strategies are critical for preventing overfitting and achieving high-fidelity results under sparse-view conditions.