1. Introduction
With the rapid development of remote sensing technology, geographic information systems (GISs), and virtual reality, high-precision three-dimensional modeling of the Earth’s surface has become a critical foundation for spatial information services [
1,
2,
3]. The DSM, as the core dataset describing the surface morphology and spatial structure, has been widely applied in urban planning, environmental simulation, disaster warning, and virtual simulation [
4,
5,
6,
7]. Compared to direct measurement using LiDAR [
8], satellite imagery has become a key data source for DSM generation due to its broad coverage, rapid update cycles, and lower acquisition costs [
9,
10].
As a standard technique to reconstruct 3D surfaces from multi-view images, multi-view stereo has been studied for decades, yielding traditional [
11,
12,
13] and learning-based [
14,
15,
16,
17] methods with or without the use of deep learning. However, both traditional and learning-based methods often struggle with blurred model contours and distorted geometric structures, making it difficult to meet the demands of high-precision modeling. Besides, most learning-based methods need labeled data for training, which hinders their applications.
In response to these limitations, neural implicit representation methods have recently emerged to reconstruct 3D shapes from multi-view images. Compared with traditional methods such as point clouds or voxel grids, neural implicit approaches model the color and density distributions in 3D space using neural networks, offering higher continuity and detail-preserving capability. Neural Radiance Fields (NeRFs), as one of the most representative models [
18], maps spatial coordinates and viewing directions to color and density values, enabling high-fidelity reconstruction of complex geometry and illumination variations [
19,
20] without the need for labeled data.
Since its introduction, NeRF and its variants have driven breakthroughs in virtual reality and computer graphics. To extend its applicability, various improved models have been proposed: Deformable-NeRF [
21] and NeRFies [
22] handle non-rigid objects; D-NeRF [
23] models dynamic scenes; and other works focus on high dynamic range [
24], dark scenes [
25], and computational efficiency [
26,
27,
28,
29]. These methods have significantly enhanced the practicality of NeRF in close-range and large-scale 3D modeling.
However, applying NeRF to satellite imagery presents unique challenges due to lower spatial resolution, wide-angle imaging, inconsistent acquisition times, and atmospheric scattering. To address these, researchers have tailored the NeRF framework for satellite photogrammetry. S-NeRF [
30] incorporates solar illumination direction as a prior; LiDeNeRF [
31] and recent depth-guided works [
32] integrate LiDAR or depth priors to improve robustness. Sat-NeRF [
33] and NeRF-MS [
34] utilize multi-temporal images to reconstruct scene geometry and handle transient objects. More recently, CaLiSa-NeRF [
35] and EO-NeRF [
36] have further advanced this direction by incorporating separate streams for geometry and appearance or utilizing shadow-free albedo. Despite these progresses, current methods still face limitations in recovering geometric details in texture-less or shadowed areas and reconstructing sharp building contours, which restricts their scalability in high-precision urban modeling.
These remaining challenges highlight that 3D surface reconstruction from multi-view images is an ill-posed inverse problem. The quality of reconstructed surfaces is easy to decline when there are textureless, specular, or shadowed areas in the images. As a result, some researchers tried to model the imaging process with the Bidirectional Reflectance Distribution Function (BRDF) [
37,
38] into the reconstruction process, i.e., optimizing the illumination, surface albedo, and surface normal, and reconstructing 3D surfaces at the same time.
For texture-less areas, the shape from shading (SfS) technique is often used to refine the 3D surfaces. For example, Peng et al. [
39] proposed a method utilizing shading information from a single-view high-resolution image to refine the low-resolution DSM, which achieves better detailed shape recovery than interpolation methods. However, the above method assumes that the direction of the incident light is known, which is hard to satisfy in a general imaging process. Thus, researchers tried to approximate the illumination with spherical harmonics to alleviate the need for the known direction of the incident light. Wu et al. [
40] proposed a method utilizing spherical harmonics to refine the 3D surfaces reconstructed by MVS methods under uncalibrated illumination. Hu et al. [
41] also combined the shading information and the edge information for better surface refinement in urban scenes.
To improve the physical realism and robustness of light modeling, researchers have increasingly incorporated more sophisticated BRDF models. For example, the SREVAS method [
42] explicitly introduces a specular reflection component into the imaging model and employs spherical harmonics to simulate natural illumination. This approach enables accurate modeling of non-uniform albedo and non-Lambertian surfaces under complex lighting conditions, significantly enhancing the detail preservation in 3D reconstruction. However, most of the above shading-based 3D reconstruction methods did not take the imaging process into account in the reconstruction process from images, but rely on the initial surfaces recovered by MVS methods, which can cause non-optimal reconstruction results. Besides, how to better combine the shading information with the NeRF-based reconstruction process still needs further research.
Beyond pure shading cues, geometric constraints are also crucial for refining 3D surfaces. Traditional multi-view stereo methods often employ smoothness priors to reduce noise, but standard smoothness terms can blur sharp features. To address this, edge-preserving techniques, such as bilateral filtering or anisotropic diffusion, have been widely used in depth map refinement and mesh processing [
43,
44,
45]. These methods utilize intensity gradients to guide the smoothing process, ensuring that geometric edges are preserved where photometric edges exist.
In the context of Neural Radiance Fields, integrating such physical shading models and edge-aware geometric constraints is a promising direction. By explicitly modeling the interaction between illumination and geometry and enforcing feature-preserving spatial regularity, we can potentially enhance the NeRF’s ability to recover fine structural details and reduce geometric ambiguity.
Although spherical harmonics and edge-aware priors are not new in themselves, their integration into satellite NeRF is not a direct transplantation of traditional SfS or MVS post-processing. Satellite imagery introduces wide viewing-angle changes, multi-date radiometric inconsistency, cast shadows, and complex surface materials, so illumination and geometry cannot be refined only after a stable surface has already been reconstructed. The present contribution, therefore, lies in coupling SH-based radiometric supervision and geometry-aware regularization with NeRF optimization itself, allowing density, albedo, shadow-aware shading, and normals to be adjusted jointly from multi-view satellite observations.
To address these challenges, this study proposes a general Shading and Geometric Constraint method that can be integrated into existing NeRF-based frameworks to improve DSM reconstruction.
Specifically, besides the rendering procedure of the neural radiance field, we add a physical rendering method based on the Lambertian reflectance model represented by spherical harmonics to handle the complex lighting conditions, including shadowed areas, i.e., the objective of the proposed method is to minimize the difference between the images rendered by neural radiance field and the observed images, and the difference between the images rendered by the Lambertian reflectance model and the observed images. In detail, to minimize the objective mentioned above at the same time, we estimate the surface normal from the neural radiance field.
In addition, to improve the fidelity of the reconstructed 3D surfaces, we design a geometric constraint method. Since most of the reconstructed scenes are continuous areas, we add a normal constraint to the neural radiance field, which constrains the near neighbour 3D points to have a close normal. Finally, to overcome the trade-off between noise reduction and edge preservation, we design a novel position constraint based on bilateral weights. This constraint utilizes the color similarity between neighboring rays to guide geometric smoothing. Specifically, it enforces strong position consistency in regions with similar colors while relaxing the constraint in regions with large color differences. This effectively preserves sharp building boundaries while smoothing valid surfaces.
In summary, the main contributions are as follows:
- (1)
We introduced a Lambertian reflectance-based physical imaging model using spherical harmonics into the neural radiance field to tackle illumination inconsistency. The generated synthetic physical images serve as auxiliary supervision to improve the model’s robustness under complex lighting conditions.
- (2)
We design a geometric feature constraint mechanism to overcome geometric distortion and blurred boundaries. The surface normals are constrained to guide model optimization and enhance the reconstruction accuracy of terrain structures and object edges. Simultaneously, we introduce a bilateral edge-aware constraint mechanism into the neural radiance field optimization. By dynamically weighting the geometric smoothness based on photometric clues, we achieve feature-preserving surface refinement, yielding sharper building edges and smoother terrains.
- (3)
Qualitative experiments show that the proposed method can recover the fine detailed shapes of the ground surfaces while keeping the planar structures smooth. Quantitative experiments show that the proposed method achieves higher accuracy in terms of mean absolute error (MAE) compared with recently published NeRF-based DSM reconstruction methods. For example, when integrated into EO-NeRF, the proposed SGC reduces the elevation MAE on JAX_264 RGB from 3.064 m to 1.289 m, corresponding to a 57.93% reduction.
2. Materials and Methods
To enhance the geometric accuracy and illumination consistency of NeRF in remote sensing image-based 3D reconstruction, this paper proposes a general Shading and Geometric Constraint method that can be integrated into various satellite NeRF architectures (e.g., Sat-NeRF, EO-NeRF). The method leverages the interaction between surface geometry and incident lighting, combining a Lambertian reflectance model represented by spherical harmonic basis functions with edge-aware geometric constraints to form a unified constraint mechanism. The overall pipeline consists of two parts: the Base NeRF Backbone that predicts the scene attributes, and the proposed SGC method that imposes physical and geometric regularizations. During training, the SGC method generates synthetic images under modeled illumination and enforces structural consistency using bilateral weights derived from photometric cues. These combined losses guide the optimization of the backbone network, enabling improved reconstruction of both geometry and radiometry in complex remote sensing scenes. The overall pipeline is illustrated in
Figure 1.
2.1. Shadow-Aware Irradiance Model
This study proposes a method for DSM generation by integrating NeRF with a physical imaging model. By fully leveraging the rich illumination information from multi-view satellite imagery and accurately modeling scene illumination characteristics, the proposed method enhances NeRF’s learning capability of scene density information, particularly achieving more stable predictions in shadowed regions, thus enabling the construction of more accurate three-dimensional ground models.
Vanilla NeRF samples and accumulates lighting rays to render colors by combining transmittance, opacity, and point color along the ray, as expressed in Equation (1):
where
represents the shadow value at the
i-th sampled point, the opacity
, and the transmittance
are computed by Equations (2) and (3):
where represents the density value at the sampled point, indicating the probability that the point lies on a surface (higher implies higher opacity), and denotes the distance between adjacent sampled points along the ray.
Differently, following Sat-NeRF and EO-NeRF, we adopt the shadow-aware irradiance model to compute the NeRF-based rendering color in Equation (4):
where
is the albedo of the surface points, the shading scalar
takes values from 0 to 1,
is the ambient color indicating the shadowed areas.
2.2. Physical Imaging Model
To further improve model stability and accuracy under complex lighting environments, a physical imaging model is introduced. This model relies on scene attributes predicted by the NeRF and the estimated surface normals, and models illumination features using spherical harmonics. Spherical harmonics, due to their strong global representation and ability to capture low-frequency information, are widely applied in large-scale, complex illumination and spatially non-uniform 3D reconstruction tasks. Low-order spherical harmonic components are employed to model global illumination characteristics, while high-order components capture finer local illumination variations. Decomposing and reconstructing the scene illumination through spherical harmonics enables NeRF to learn radiative properties more accurately during training, thereby enhancing the geometric precision and illumination consistency of the 3D reconstruction.
Applying the physical imaging process only as a post-processing step makes it difficult to achieve an optimal reconstruction result, because illumination and geometry would be refined only after the radiance field has already been estimated. Therefore, in this work, the physical imaging process is unified with the NeRF reconstruction procedure so that radiometric decomposition can participate directly in optimization, enabling more accurate and faithful 3D reconstruction under complex satellite imaging conditions. Specifically, we decompose illumination, albedo, and surface normals within the NeRF framework. Sunlight and ambient illumination are modeled separately rather than merged into a single term: the sunlight component is treated as a directional source associated with the solar illumination condition, whereas the spherical-harmonic coefficients are used to represent the ambient skylight field. Their combination forms the incident irradiance used by the physical imaging branch.
As illustrated in
Figure 2, the physical imaging module is integrated into the NeRF optimization process rather than being applied only after reconstruction. Direct sunlight is modeled as a directional illumination term associated with the solar geometry, while the spherical-harmonic component represents the low-frequency ambient skylight field. These illumination terms interact with the albedo predicted by the NeRF branch and the normals extracted from the same density field to synthesize a physically constrained image. The shared shadow-aware state further modulates the balance between sunlight and ambient light, so that shadow-induced radiometric variation is explained by illumination rather than being absorbed into false reflectance changes. In this way, the module reduces lighting ambiguity and improves reconstruction fidelity in shadowed or radiometrically inconsistent regions.
In terms of detailed implementation, surface normals are extracted by calculating the gradient of the density field. The gradient of the density field
at point
P = (
x,
y,
z) is defined in Equation (5):
The unit normal vector is then obtained by normalization in Equation (6):
However, due to the randomness in the direction of the NeRF-predicted density gradients, the extracted normals may sometimes point in the opposite direction to the actual surface normals. Therefore, this study proposes a camera-view-based normal direction constraint strategy. Specifically, first, obtain the camera origin position op based on the camera parameters of the satellite image, and calculate the ray direction vector = normalize (P − op) from the surface point to the camera origin in combination with the coordinates P of the sampling point. Then, the inner product between the unit normal and the observation vector is computed. If the inner product is negative, it indicates that the normal vector points away from the camera, and thus the normal vector is reversed; otherwise, the original normal is retained. This strategy ensures that all extracted normals face towards the camera, maintaining an angle less than 90°, providing stable and reliable geometric information support for subsequent physical imaging modeling.
In the physical imaging model, considering that sunlight and skylight can be approximated as distant illumination sources, this study models the lighting environment using second-order spherical harmonics. Specifically, the spherical harmonic basis functions are defined in Equation (7):
where
,
, and
denote the components of the unit surface normal
at the i-th surface point.
Using these basis functions, the physical imaging model can be formulated as Equation (8):
where
represents the image intensity at the i-th surface point,
denotes the surface albedo (including RGB channels) reflecting the material’s reflectance properties across different wavelengths,
denotes the illumination coefficients corresponding to each spherical harmonic basis, and
are the evaluated basis functions. The image intensity is modeled across three RGB channels, where each channel’s final illumination intensity is determined by the corresponding surface albedo
, spherical harmonic basis
, and illumination coefficient
.
Regarding loss function design, this study comprehensively considers the error between NeRF rendered images and ground truth images (), the error between synthetic images generated by the physical imaging model and ground truth images (), and the normal vector smoothness constraint ().
Following the uncertainty-aware training strategy, we model the photometric observations with a Gaussian distribution where the variance represents the aleatoric uncertainty. The loss
measuring the fitting error between the NeRF-rendered image and the ground truth is defined as the negative log-likelihood in Equation (9):
where
N denotes the total number of pixels,
denotes the pixel value of the NeRF-rendered image, and
denotes the corresponding ground truth pixel value.
Similarly,
measures the fitting error between the synthetic image generated by the physical imaging model and the ground truth, and is defined in Equation (10):
where
denotes the pixel value of the synthesized image,
is the transient uncertainty predicted by the network for ray
.
The losses and in Equations (9) and (10) are both written in an uncertainty-aware form, so the predicted per-ray uncertainty downweights unreliable observations in both branches during optimization.
From a module perspective, the framework contains a NeRF rendering module and a physical imaging module that are optimized jointly rather than independently. The physical imaging module directly uses the albedo predicted by the NeRF branch, while its surface normals are extracted from the NeRF geometry represented by the same density field. In this way, the two modules share scene attributes and are coupled through a common latent representation during training.
In terms of functionality, the NeRF rendering module in Equation (4) explains the observed image through shadow-aware volume rendering, whereas the physical imaging module in Equation (8) synthesizes a physically constrained image under decomposed illumination using the shared albedo and geometry-derived normals. The same shadow state is maintained across the two modules: in Equation (4), it modulates the balance between direct illumination and the ambient component, while in Equation (8) it separates sunlight and ambient spherical-harmonic coefficients, preventing shadow-induced brightness variation from being absorbed into albedo.
2.3. Geometric Structure Constraint
To further enhance the quality of 3D reconstruction and the geometric accuracy of DSM using NeRF, we propose a geometry-aware structural optimization method consisting of two complementary terms: a Normal Smoothness Constraint and a novel Bilateral Edge-Aware Constraint. Building upon the fusion of NeRF and the physical imaging model, this approach introduces geometric features as external supervision signals. These features guide the network in accurately learning the density field distribution over the target area, thereby improving the accuracy of the elevation estimates derived from the density field.
As illustrated in
Figure 3, the two geometric branches play complementary roles. The normal-smoothing branch promotes local continuity in flat or homogeneous regions, but its effect is weakened when local albedo differences indicate a likely edge, thereby avoiding undesired smoothing across roofs, walls, or terrain breaks. The bilateral branch further uses the tangent plane of the center point to distinguish local noise from true structural change: neighbors that are close in both geometry and albedo receive strong regularization, whereas neighbors with inconsistent photometric cues receive weak weights and are preserved as possible discontinuities. This adaptive design is especially important for implicit NeRF geometry, because unlike explicit point-based or Gaussian-based representations, the radiance field does not provide fixed neighborhoods or surface connectivity a priori, and nearby image samples may still belong to different structures in the continuous field.
- (1)
Normal Smoothness Constraint:
To ensure good continuity of the estimated surface normals in flat regions while preserving sharp features, an adaptive normal smoothness constraint
is introduced in Equation (11):
where
denotes the surface normal at point
and
denotes the normal at a randomly perturbed neighboring point. In the implementation, the perturbed point is sampled by adding uniform noise in the normalized 3D space with magnitude 0.01. The edge-aware weight mask
is computed from the albedo color difference
:
if the difference is no larger than
, and
otherwise.
By minimizing the discrepancy between adjacent normals, surface smoothness is maintained, thus improving the quality of subsequent 3D reconstructions. This term effectively reduces high-frequency noise on continuous surfaces.
- (2)
Bilateral Edge-Aware Constraint:
Standard smoothness constraints often lead to over-smoothed results, blurring the distinct boundaries of buildings and terrain features. To address this, we introduce a Bilateral Edge-Aware Constraint . Inspired by edge-preserving filtering techniques, this constraint utilizes the photometric differences between neighboring rays to weight the geometric regularization dynamically.
The core idea is that significant changes in pixel color often correspond to geometric discontinuities, while similar colors imply a continuous surface. We define a bilateral weight
for a ray
and its neighbor
, considering both spatial proximity and photometric similarity, as shown in Equation (12):
where
and
denote the 3D coordinates of the center point and its
j-th neighbor, and
and
denote their albedo features. Both constraints are guided by albedo-based photometric differences between the center point and its neighboring point(s), while the normal-smoothing branch uses a thresholded pairwise albedo difference and the bilateral branch uses continuous Gaussian weights derived from adaptive pairwise albedo differences. In the implementation,
K = 4 neighbors are sampled per ray by drawing integer image-plane offsets uniformly from [−2, 2] × [−2, 2] while excluding the zero offset. The corresponding neighbor rays are rendered to recover their 3D positions. The adaptive spatial scale
is defined as the mean Euclidean neighborhood distance in Equation (13):
And the photometric scale
is computed adaptively from local albedo differences in Equation (14), rather than being fixed as a global constant.
Instead of directly smoothing depth values, we enforce a coplanarity constraint locally. We minimize the projected distance of neighboring points onto the tangent plane of the central point. The bilateral geometric loss is formulated in Equation (15):
where
is the unit normal vector at point
, and
is the average edge length used for normalization to ensure the loss is scale-invariant. This term encourages neighbors to lie on the same plane defined by the surface normal, weighted by their bilateral consistency.
The adaptive mask in Equation (11) weakens normal smoothing when local albedo differences indicate an edge, while the bilateral weights in Equation (12) jointly consider spatial proximity and photometric consistency to preserve discontinuities at roofs, walls, and shadow transitions. In this sense, the proposed masks and weights are specifically designed to make geometric regularization compatible with an implicit NeRF representation rather than assuming an already segmented explicit surface.
By minimizing this loss, the network is encouraged to smooth the geometry in texture-less or homogeneous regions while allowing for sharp depth transitions at object boundaries.
2.4. Optimization and DSM Generation
The final objective function integrates the rendering loss, physical shading loss, and the dual geometric constraints, as summarized in Equation (16):
where
is the standard NeRF rendering loss,
is the physical shading loss, and
,
, and
control the physical shading, adaptive normal smoothing, and bilateral edge-aware terms, respectively. In the implementation,
and
are selected from {0.001, 0.005, 0.01}, whereas
is selected from {0.1, 0.01, 0.001}; the final configuration for each scene is chosen according to validation performance.
Once the network is optimized, we generate the final Digital Surface Model. In this phase, this study adopts a ray integration-based depth estimation method to convert the continuous density field into discrete surface point clouds. By performing a weighted average of the sampled point depths
along each ray, the final depth
is computed by Equation (17):
Then, combining the ray origin
and the direction
, the 3D coordinates are obtained by Equation (18):
Subsequently, the point cloud coordinates are transformed into geographic latitude and longitude using the RPC model, and further projected into a 2D planar coordinate system through Universal Transverse Mercator (UTM) projection, enabling the point cloud to be mapped directly into a gridded raster structure. Finally, based on the predefined spatial resolution and grid range, the elevation values of all points are mapped to their corresponding grid cells, thus completing the DSM generation process.
In summary, this section presents a comprehensive DSM optimization framework driven by the proposed Shading and Geometric Constraint method. Beyond standard surface normal computation and spatial binding, the approach introduces a novel bilateral edge-aware mechanism and physical shading supervision into the NeRF training process. This effectively addresses the limitations of traditional NeRF frameworks in elevation and structural recovery, ensuring both smoothness in flat areas and sharpness at building edges.
3. Results
3.1. Experimental Setup
To evaluate the effectiveness of the proposed method, we used the public US3D/DFC2019 Track 3 data released through the IEEE GRSS 2019 Data Fusion Contest and publicly hosted on IEEE DataPort (open access; non-IEEE members can also browse and download the released files at
https://ieee-dataport.org/open-access/data-fusion-contest-2019-dfc2019 (accessed on 2 April 2026)). The source imagery consists of WorldView-3 panchromatic and 8-band visible and near-infrared (VNIR) satellite images courtesy of DigitalGlobe. According to the contest description, the native ground sampling distance is approximately 35 cm for the panchromatic band and 1.3 m for the VNIR bands, and the VNIR images released for the contest are pan-sharpened.
In this work, we follow the Sat-NeRF/EO-NeRF data preparation and use the RGB image products distributed from DFC2019: Track3-RGB-crops for the JAX_RGB setting and Track3-NEW-crops for the JAX_NEW setting. The former contains preprocessed uint8 RGB crops, whereas the latter contains float32 RGB crops pansharpened from the raw DFC2019 data. The NIR bands are not introduced as additional channels, so that the comparison remains aligned with the released Sat-NeRF/EO-NeRF RGB protocols rather than being confounded by a change in input modality. The selected areas are located in Jacksonville, Florida, USA (JAX_004, JAX_068, JAX_214, JAX_260, JAX_207, and JAX_264) and Omaha, Nebraska, USA (OMA_203 and OMA_212). At the city scale, the contest imagery was collected between 2014 and 2016 over Jacksonville and between 2014 and 2015 over Omaha.
For the experimental setup used here, each target region is defined by a released Track 3 reference DSM tile of size 512 × 512 pixels at 0.5 m spacing, corresponding to an approximately 256 × 256 m ground footprint. The associated perspective RGB inputs are view-dependent image crops and therefore do not all have identical pixel dimensions after the contest preprocessing; for example, OMA_203 input views are not stored as 512 × 512 images but include sizes such as 821 × 786 and 854 × 854, even though they correspond to the same 512 × 512 reference DSM tile. As shown in
Figure 4, JAX_068 and JAX_214 are dominated by dense and regular urban structures with clearly defined boundaries, whereas JAX_004 and JAX_260 contain more vegetation and irregular terrain, making 3D reconstruction more challenging. JAX_207 and JAX_264 contain larger contiguous urban layouts, while OMA_203 and OMA_212 exhibit evident radiance differences.
Table 1 summarizes the key scene information, including the number of training and test images, total sampled rays, and DSM elevation ranges in meters.
All SGC-enhanced models were trained using the Adam optimizer with an initial learning rate of 5 × 10
−4. Each training batch included 1024 sampled rays, with 128 sampled 3D points per ray. The average training time required to reach convergence is discussed in
Section 4.4.
To justify the choice of the edge threshold in the normal smoothness term, we additionally evaluated three values (0.10, 0.15, and 0.20) on four NEW scenes, namely JAX_004, JAX_068, JAX_214, and JAX_260. The average MAE values were 1.507, 1.495, and 1.521 m, respectively. The best result was obtained at = 0.15, indicating that an overly small threshold tends to weaken beneficial smoothing, whereas an overly large threshold may blur true geometric discontinuities. Therefore, = 0.15 was adopted in the reported experiments.
For evaluation, we used the airborne LiDAR-derived DSM TIFFs released in DFC2019 Track 3 as the reference geometry for all scenes. Elevation errors were computed directly between the predicted DSM and the corresponding reference DSM raster. The public Track 3 DSM files are provided as float32 TIFF products, and their GeoTIFF headers do not explicitly encode vertical-datum metadata; therefore, we follow the released benchmark convention and report elevation differences in meters with respect to the provided reference raster. Smaller errors indicate higher precision and better terrain generalization.
Accordingly, the present evaluation should be interpreted as benchmark-relative to the released DFC2019 reference raster rather than as a geodetically validated claim about orthometric or ellipsoidal height.
3.2. Method Comparison
To comprehensively evaluate the effectiveness and generalization capability of the proposed SGC method for DSM reconstruction, this section presents a series of comparative experiments. In addition to the standard NeRF-based methods (NeRF, S-NeRF, Sat-NeRF), we introduced two state-of-the-art baselines: EO-NeRF [
36] and EOGS [
46].
To ensure a fair and unified comparison, all methods were evaluated under the same scene-specific protocol for each available scene-format pair. Specifically, the same input images, camera metadata, reference LiDAR DSM raster, DSM grid definition, and error-computation procedure were used when comparing different methods on the same scene. MAE and RMSE were computed in the same manner for all methods. RGB and NEW results are reported separately and are compared only within the same input format rather than across formats. For the Sat-NeRF and EO-NeRF backbones, we followed the original backbone settings as closely as possible and introduced only the proposed SGC module in the enhanced variants. For scenes where the NEW-format data could not be generated because the corresponding preprocessing scripts are unavailable, the entries are marked as “/” and are excluded from comparisons requiring complete baseline coverage.
Crucially, to validate that our proposed method is a general enhancement unit, we applied it to both Sat-NeRF and EO-NeRF backbones. In
Table 2, ‘Sat-NeRF+SGC’ denotes the Sat-NeRF model integrated with our SGC method, while ‘EO-NeRF+SGC’ denotes the EO-NeRF model enhanced by the same method. MAE remains the primary elevation indicator, but for a more comprehensive assessment,
Table 2 reports scene-wise MAE on the first line and RMSE on the second line, both in meters. Additional structure-aware metrics are further summarized and discussed in
Section 4.1.
As shown in
Table 2, the proposed method consistently improves the reconstruction accuracy across different backbones. The RMSE values follow the same overall trend as the MAE values, further confirming that the proposed SGC module improves both average elevation accuracy and robustness to larger errors.
Improvement on Sat-NeRF: When integrated with the Sat-NeRF backbone, our ‘Sat-NeRF+SGC’ model achieves a reduction in Mean Absolute Error (MAE) across all evaluated scenes compared to the baseline. This improvement is particularly notable in scenarios with complex surface structures. For instance, in JAX_260 and JAX_004, which feature irregular terrain, the method demonstrates enhanced adaptability. Furthermore, in the JAX_264 scene, the accuracy improves substantially from 2.107 m to 1.442 m, proving the method’s strong capability in correcting geometric distortions in challenging environments.
Improvement on EO-NeRF: The efficacy of our approach is even more pronounced when applied to the advanced EO-NeRF architecture. For the JAX_264 RGB scene, the EO-NeRF baseline yields an elevation MAE of 3.064 m, whereas EO-NeRF+SGC reduces this value to 1.289 m. On the NEW-format scenes, the method also improves the JAX_214 and JAX_260 areas from 1.667 m to 1.553 m and from 1.768 m to 1.452 m, respectively. These gains confirm that introducing explicit physical shading models and bilateral geometric constraints brings substantial benefits, even to state-of-the-art frameworks.
Comparison with State-of-the-art: Our enhanced model also demonstrates superior performance compared with the emerging 3D Gaussian Splatting baseline EOGS [
46] on the scenes where NEW-format results are available. For example, on JAX_214 and JAX_260, EO-NeRF+SGC achieves lower elevation MAE values than EOGS (1.553 m vs. 1.750 m on JAX_214 and 1.452 m vs. 1.553 m on JAX_260). Additionally, on JAX_068, our method achieves an MAE of 1.087 m, slightly lower than the 1.112 m obtained by EOGS. This evidence establishes that NeRF-based methods, when equipped with our SGC module, remain a top-tier solution for high-precision satellite photogrammetry.
EOGS also shows that explicit and implicit representations have different strengths for this task. As an explicit Gaussian-based representation, EOGS can optimize local surface structure more directly and can therefore be highly competitive in scenes with stable radiometry and clear geometric support. By contrast, the implicit NeRF formulation remains attractive when radiometric decomposition and shadow-aware supervision are important, because density, albedo, shading, and geometry can be optimized within one continuous field. The results in
Table 2, therefore, suggest a complementary picture rather than a one-sided dominance: EOGS is strong in some scenes, while EO-NeRF+SGC is particularly effective when shadow ambiguity and radiometric inconsistency become major error sources.
In summary, the quantitative results in
Table 2 confirm that the proposed SGC method is not limited to a specific network but serves as a generalizable plug-in that universally enhances geometric fidelity and elevation accuracy in diverse remote sensing scenarios.
A scene-specific reading of
Table 2 shows that the gains of EO-NeRF+SGC are not uniform across all regions. The largest MAE reductions appear in scenes where radiometric inconsistency, strong shadow effects, or more evident geometric ambiguity are present, such as JAX_264 RGB (3.064 m to 1.289 m), OMA_212 RGB (1.306 m to 1.111 m), OMA_203 RGB (1.504 m to 1.362 m), and JAX_260 NEW (1.768 m to 1.452 m). The NEW inputs are pansharpened from the DFC2019 raw data, so their radiometric characteristics are not identical to those of the original RGB imagery; in such cases, the added physical imaging and geometric constraints provide especially useful complementary guidance beyond the EO-NeRF baseline.
By contrast, the improvements are more modest in JAX_068 RGB (1.301 m to 1.283 m), JAX_260 RGB (1.484 m to 1.448 m), JAX_207 RGB (1.921 m to 1.908 m), and JAX_004 NEW (1.419 m to 1.387 m). These cases suggest that the EO-NeRF baseline already provides comparatively stable reconstruction when illumination inconsistency is less dominant, so the room for further global MAE reduction becomes smaller. In addition, some remaining errors in these scenes are still related to vegetation, fine irregular structures, and local boundary complexity, which are only partially addressed by the current SGC design. The three RGB scenes represent native observations with relatively limited shadow-driven ambiguity, whereas JAX_004 NEW is a pansharpened product derived from the DFC2019 raw data and is additionally affected by vegetation and non-rigid canopy structure. Overall, this scene-dependent behavior indicates that SGC is most beneficial when radiometric decomposition and edge-aware regularization directly target the dominant source of reconstruction error.
3.3. Three-Dimensional Model Visualization and Qualitative Performance Assessment
To further validate the effectiveness and robustness of the proposed method, this section presents a comparative 3D visualization analysis of DSM generated by the ground truth, baseline Sat-NeRF, Sat-NeRF+SGC, the baseline EO-NeRF, and EO-NeRF+SGC under mesh representation. Through direct visual inspection, the differences in reconstruction quality, geometric consistency, and model completeness can be clearly observed, providing deeper insight into the practical performance of each method.
Figure 5,
Figure 6,
Figure 7,
Figure 8,
Figure 9,
Figure 10,
Figure 11 and
Figure 12 present the visualized details of the 3D models reconstructed for the eight target regions in the experiments. By comparing the detailed areas highlighted with red rectangles, it can be verified that the proposed method demonstrates significant advantages across multiple aspects.
As highlighted by the red bounding boxes in
Figure 5 and
Figure 6, which are dominated by dense and regular urban structures with clearly defined boundaries, the SGC module improves geometric reconstruction across both backbones.
In these regions, the baseline Sat-NeRF exhibits severe planar distortion and noise, which blur edges and fragment the reconstructed structures. EO-NeRF improves the overall layout, but still shows noticeable surface roughness and jagged artifacts along building boundaries.
After SGC is introduced, both backbones improve visibly. Sat-NeRF+SGC reduces much of the distortion present in Sat-NeRF and produces smoother surfaces and clearer shapes, indicating that the module can correct substantial geometric errors. EO-NeRF+SGC yields the cleanest results overall, preserving the structural advantages of EO-NeRF while further suppressing residual noise and sharpening building transitions. As a result, EO-NeRF+SGC provides the most complete meshes and the clearest building facades among the compared methods.
As highlighted by the red bounding boxes in
Figure 7 and
Figure 8, which contain abundant vegetation and irregular terrain, the proposed SGC method improves the reconstruction of complex landscape details.
In these scenes, the baseline Sat-NeRF struggles to separate vegetation from terrain structure, producing noisy 3D outputs with limited local detail. EO-NeRF captures the overall layout more reliably, but still exhibits surface roughness and unstable texture-geometry separation, especially where dense vegetation covers the ground.
With SGC, reconstruction quality improves for both backbones. Sat-NeRF+SGC reduces noise and clarifies the shapes of buildings and terrain, showing that the module can recover part of the missing structure. EO-NeRF+SGC gives the best results overall, with clearer separation between vegetation and uneven terrain and fewer visible artifacts. By combining the stronger EO-NeRF backbone with the proposed radiometric and geometric constraints, it produces a more faithful reconstruction of these challenging scenes.
As highlighted by the red bounding boxes in
Figure 9 and
Figure 10, which contain larger contiguous urban layouts, the proposed SGC method improves geometric consistency while preserving local detail.
The baseline Sat-NeRF performs poorly in these larger contiguous urban scenes and fails to recover the overall layout reliably. EO-NeRF captures the global structure more successfully, but still lacks local refinement: building boundaries remain ragged, and the surfaces show noticeable noise and granularity.
The SGC-enhanced models improve both global completeness and local detail. Sat-NeRF+SGC restores much of the missing structural continuity relative to its baseline and yields a more coherent overall layout. EO-NeRF+SGC again provides the best reconstruction quality, preserving large-scale organization while producing sharper building boundaries and smoother transitions between urban zones.
As highlighted by the red bounding boxes in
Figure 11 and
Figure 12, where the input images exhibit pronounced radiance differences, the proposed SGC method improves geometric reconstruction under challenging illumination variation.
Under these conditions, the baseline Sat-NeRF is strongly affected by radiometric inconsistency, producing fragmented meshes with blurred edges and unstable geometry. EO-NeRF improves boundary definition, but still suffers from high-frequency surface noise and visible artifacts.
After SGC is added, both models better disentangle geometry from brightness variation. Sat-NeRF+SGC produces more coherent surfaces and preserves geometric boundaries more reliably than the baseline. EO-NeRF+SGC yields the best overall result, further reducing residual noise and recovering more faithful structural detail in these radiometrically inconsistent scenes.
Across the four scene categories—dense regular urban areas (JAX_068, JAX_214), vegetation-rich and irregular regions (JAX_004, JAX_260), larger contiguous urban layouts (JAX_207, JAX_264), and scenes with pronounced radiance variation (OMA_203, OMA_212)—the proposed SGC method consistently improves the baseline reconstructions.
When applied to either Sat-NeRF or EO-NeRF, the SGC module improves reconstruction quality. In regular urban scenes, it sharpens building boundaries and completes facades; in vegetation-rich and irregular scenes, it recovers finer terrain and object separation; in larger urban layouts, it better balances global structural consistency with local detail; and in radiometrically inconsistent scenes, it better separates geometry from brightness variation.
Overall, EO-NeRF+SGC provides the most robust performance by combining the strengths of EO-NeRF with the radiometric and geometric regularization introduced by SGC. These results support the effectiveness of the proposed method across diverse and challenging reconstruction scenarios.
3.4. Ablation Study
Table 3 and
Table 4 report the ablation results of the proposed components on EO-NeRF and Sat-NeRF, respectively. In
Table 3, “+BI” denotes the bilateral edge-aware term alone, “+BI+SM” further adds the normal smoothing term, and “+GC+PIM” corresponds to the complete EO-NeRF+SGC model. In
Table 4, “+GC” denotes the geometric-constraint branch added to Sat-NeRF, while “+GC+PIM” represents the full Sat-NeRF+SGC configuration. This design separates the effect of edge-aware geometric regularization from that of the physical imaging model.
The EO-NeRF ablation reveals that the contribution of individual modules is not necessarily monotonic in every scene. The bilateral term alone already brings clear gains in scenes with stable man-made boundaries, but its standalone effect can be mixed when vegetation, irregular objects, or radiometric inconsistency are dominant. After the smoothing term is added, the results become more stable, indicating that BI and SM are complementary: BI helps preserve meaningful edges, whereas SM suppresses local geometric noise. The full GC+PIM configuration then yields the lowest MAE in all listed scenes, showing that radiometric decomposition further strengthens the geometric optimization rather than simply duplicating the role of the geometric terms.
A similar trend can be observed for Sat-NeRF. The geometric constraint branch improves most scenes but not every case, which is expected for a strong regularizer applied to more challenging radiometric observations. Once PIM is jointly introduced, the full model again achieves the best overall results across all reported RGB scenes. Therefore, the ablation study supports a nuanced conclusion: the proposed components should be interpreted as complementary modules whose joint use is the most robust, rather than as isolated terms that must individually improve every scene.
3.5. Illumination Decomposition Analysis
To directly evaluate the role of the physical imaging model (PIM), we perform a consistency-based quantitative analysis on the illumination decomposition produced by EO-NeRF+SGC. Because the DFC2019 benchmark does not provide ground-truth albedo, shading, or normal maps, the goal here is not to construct a full intrinsic-image benchmark, but to verify whether the learned decomposition behaves in the intended physically meaningful way.
Specifically, we analyze four manually selected flat-roof ROIs in validation view JAX_068_011_RGB. Each ROI contains both sunlit and shadowed pixels on the same roof surface. For each ROI, we measure the mean sunlit-shadow intensity gap in the original RGB image, the estimated albedo, and the derived shading map. A successful decomposition is expected to reduce the sunlit-shadow gap in albedo while preserving or even enlarging the contrast in shading. We further report the coefficient of variation (CV) for RGB and albedo as a complementary indicator of illumination sensitivity within the same surface.
As shown in
Table 5, the mean sunlit-shadow gap decreases from 0.1551 in the original RGB image to 0.0627 in the estimated albedo, corresponding to a 60.36% reduction. In contrast, the derived shading map retains a mean gap of 0.2841, which indicates that the dominant illumination contrast is transferred to shading rather than remaining entangled with surface reflectance. The mean coefficient of variation also drops from 0.6480 in RGB to 0.2928 in albedo. Although this is not a benchmark-style intrinsic decomposition evaluation, these results provide direct quantitative support for the intended behavior of the PIM. Representative qualitative examples are discussed later in
Section 4.3 together.
4. Discussion
The results in
Section 3 show consistent gains from SGC across backbones, but the mechanism is better understood when quantitative and qualitative evidence are read together. This section therefore interprets the supplementary metrics in
Table 6, analyzes spatial error patterns in
Figure 13, and discusses the PIM decomposition evidence in
Figure 14. The emphasis is not only on average improvement, but also on where gains are concentrated and where residual errors remain.
4.1. Interpretation of the Supplementary Metrics
Beyond the scene-wise MAE and RMSE in
Table 2,
Table 6 reports the mean supplementary metrics over the eight RGB scenes for Sat-NeRF and EO-NeRF, with and without SGC. These metrics complement global error: RE normalizes height error by scene relief, completeness, and correctness characterize tolerance-level consistency, F1@
balances the two, and Boundary-MAE/Boundary-RMSE focus on edge-sensitive regions where urban DSM reconstruction is typically most difficult.
As shown in
Table 6, SGC improves all supplementary metrics for both backbones. For Sat-NeRF, RE decreases from 11.024 to 9.255, and F1@
increases from 0.457 to 0.513; for EO-NeRF, RE decreases from 9.956 to 8.053, and F1@
increases from 0.436 to 0.527. Boundary-MAE and Boundary-RMSE also decrease for both models, indicating that the gain is not limited to global averages but extends to boundary-critical regions.
4.2. Spatial Error Analysis in Boundary and Vegetation Regions
Average metrics indicate overall improvement, but they do not reveal where residual errors are concentrated. We therefore analyze two complementary cases: boundary-dominant urban regions and vegetation-heavy regions.
Figure 13 first compares signed rdsm_diff maps and local enlargements for EO-NeRF and EO-NeRF+SGC in two representative urban scenes, JAX_068 and JAX_214. To extend the analysis beyond building-boundary regions, we additionally evaluate vegetation and non-vegetation MAE in JAX_004 and JAX_260 using binary region masks. All heat maps are rendered with the same fixed color scale [−8, +8] m, enabling direct visual comparison of error intensity and spatial distribution. Under this fixed display range, a small number of localized boundary outliers can remain visually salient even when the scene-wide MAE decreases.
In JAX_068, the baseline already captures the main urban layout, but residual errors remain around roof boundaries, roof-to-ground transitions, and concave structural corners. After adding SGC, several roof and ground regions become less diffuse in the error map, and the enlarged views show tighter boundary-localized responses. This pattern agrees with the boundary-metric improvements reported in
Table 6.
JAX_214 shows a complementary behavior: EO-NeRF+SGC regularizes broad planar regions and suppresses part of scattered high-error responses, while thin residual strips persist near sharp edges and complex transitions.
Whereas the JAX_068 and JAX_214 examples emphasize boundary-localized urban errors, vegetation-heavy scenes require a different diagnostic because their dominant errors are less confined to straight building boundaries. Using binary vegetation masks on JAX_004 and JAX_260, we separately computed MAE over vegetation and non-vegetation regions from the registered DSM outputs. EO-NeRF+SGC reduces the vegetation/non-vegetation MAE from 2.856/0.909 m to 2.795/0.842 m in JAX_004 and from 2.598/1.378 m to 2.166/1.097 m in JAX_260, corresponding to relative reductions of 2.13%/7.28% and 16.61%/20.39%, respectively. These results show that SGC improves not only rigid urban structures but also more irregular vegetated areas, although the residual errors remain larger over vegetation than over surrounding non-vegetated surfaces. This gap is consistent with the non-rigid canopy structure, weaker multi-view consistency on thin branches and crowns, and locally unstable depth support under self-occlusion. Therefore, the strongest and most localized gains still appear around regular urban boundaries, whereas in vegetation-heavy scenes, SGC delivers broader but less spatially concentrated error reduction.
4.3. Role of the Physical Imaging Model: Albedo and Shading Decomposition
Figure 14 provides mechanism-oriented evidence for the role of the physical imaging model in shadowed urban reconstruction. In JAX_068, the estimated albedo maps are more uniform across roofs and roads than the raw RGB observations, while the mean-ratio shading maps isolate the dominant illumination-loss pattern. This separation helps decouple reflectance from lighting and reduces shadow-driven ambiguity during geometry optimization.
The paired views show that corresponding structures retain similar albedo despite clear brightness changes in the original images, indicating that illumination variation is largely absorbed by the shading component rather than encoded as false reflectance change.
The normal maps offer a geometric consistency check: major roof planes and transition areas remain coherent after introducing PIM, which supports stable optimization near boundaries and shaded facades.
These qualitative results should be interpreted as explanatory evidence rather than a strict inverse-rendering benchmark, because this dataset provides no ground-truth albedo, shading, or normals. Accordingly,
Figure 14 is used to explain why PIM helps geometric reconstruction under shadows, not to claim fully accurate physical decomposition for all materials and conditions.
4.4. Efficiency and Practical Applicability
To assess the engineering practicality of the proposed module, we report the average training time required to reach convergence over the four NEW scenes used in the main EO-NeRF comparison, namely JAX_004, JAX_068, JAX_214, and JAX_260. Based on these experiments, EO-NeRF requires about 2.6 h on average to converge, whereas EO-NeRF+SGC requires about 8.2 h.
Although SGC increases the convergence-time cost during training, the GPU memory overhead remains very small. In the profiled reference run, peak GPU memory stays nearly unchanged after introducing SGC, remaining at about 9.30–9.33 GiB. Therefore, the main computational cost of SGC lies in additional optimization time rather than in substantially higher memory usage.
The inference-stage overhead is much smaller. When rendering three validation images from JAX_068, the baseline requires 6.69 s per image on average (24,549 rays/s) with 0.69 GiB peak allocated GPU memory, while EO-NeRF+SGC requires 6.73 s per image (24,378 rays/s) with 0.68 GiB. This near-identical test-time behavior indicates that the additional cost of SGC is concentrated in training, whereas inference remains essentially unchanged. From an application perspective, such a trade-off is acceptable for offline DSM reconstruction, where improved geometric fidelity around boundaries, shadowed roofs, and complex urban transitions is often more important than minimizing training time.
To provide a first controlled check of image-count robustness, we additionally conducted a reduced-view experiment on OMA_203 RGB using the same rdsm_diff-based MAE criterion as in
Table 2 and focusing on the near-nadir evaluation image OMA_203_010_RGB. With 15 training images, the MAE decreased from 1.840 m for EO-NeRF to 1.788 m for EO-NeRF+SGC. With 24 training images, the MAE decreased from 1.701 m for EO-NeRF to 1.654 m for EO-NeRF+SGC. These preliminary results indicate that SGC remains beneficial under reduced-image settings.
5. Conclusions
In this paper, we propose the Shading and Geometric Constraint (SGC), a general enhancement module for multi-view satellite NeRF-based DSM reconstruction.
The main theoretical contribution is a unified optimization formulation that couples a spherical-harmonics physical imaging model with edge-aware geometric regularization inside the NeRF training objective. This design links radiometric decomposition, shadow-robust supervision, and local surface-consistency priors, providing a physically grounded route to improve geometry rather than relying only on post-reconstruction refinement.
Across diverse scene types and two representative backbones, SGC consistently improves global elevation accuracy as well as boundary-sensitive quality. The results indicate that the radiometric component is especially helpful under radiance inconsistency and shadow, whereas the geometric component sharpens contours and stabilizes smooth structural regions.
The present study mainly evaluates DSM elevation accuracy and spatial error structure. A systematic benchmark of derived terrain parameters such as slope, roughness, curvature, and hillshade would also be valuable for downstream applications, but this lies beyond the current scope and remains an important direction for future work. To facilitate such follow-up analysis, the GeoTIFF elevation grids used in this study have been publicly released.
Remaining difficulties are concentrated in vegetation-dominated and highly non-Lambertian areas, where static-scene assumptions and appearance-guided regularization become less reliable. Future work will therefore focus on richer reflectance modeling, more efficient implementations, and stronger geometric priors for large-scale remote-sensing DSM reconstruction.