This paper infers the geometric shape of scenes from monocular RGB images and performs self-supervised training using an image conditional NeRF model. Depth information is deduced from the NeRF radiance volume and optimized via a reprojection loss function. The inputs to the NeRF model include extracted feature vectors, encoded sampling point position coordinates and viewing directions.
To estimate depths in large complex scenes and endow the model with generalization ability, this method proposes a spherical network based on an adaptive fine-grained channel attention mechanism to extract image features and generate universal and sampling point representations. Additionally, a Gaussian probability-based ray sampling method is introduced to sample points close to surfaces. It reduces the number of sampling points in large autonomous driving scenes. The training data of the model contains
sequences, each includes
RGB images and corresponding pose information, specifically denoted as
. This method estimates the neural representation conditioned on each first frame
and learns a radiance field shared among sequences. The specific implementation is illustrated in
Figure 1.
3.1. Spherical Network Based on Channel Attention Mechanism
NeRF-based monocular depth estimation methods generally require scene-specific retraining and exhibit limited generalization to unseen environments. A key factor underlying this limitation lies in their reliance on conventional encoder–decoder architectures for 2D feature extraction. Standard U-Net structures confine learned features to the camera’s field of view (FOV), preventing NeRF from inferring colors and depths beyond visible regions. Moreover, repeated downsampling–upsampling operations tend to produce ambiguous or degraded features, which weakens the effectiveness of projected 3D point representations.
To overcome these shortcomings, this paper introduces a spherical U-Net enhanced with a fine-grained adaptive channel attention mechanism. Through spherical projection, 2D features are mapped onto a wider angular domain, enabling the network to incorporate contextual information beyond the original FOV and construct richer 3D point descriptors. Meanwhile, the adaptive channel attention module dynamically fuses global contextual cues with local geometric details, delivering more accurate feature weighting and substantially enhancing the discriminative ability of extracted features.
Building on this design, the decoder is further restructured to operate on a spherical surface, reducing geometric distortion compared with planar projection and effectively expanding the usable FOV to approximately 120°. This extension allows the network to recover depth and color information from regions that would otherwise lie outside the image boundary. To mitigate the feature degradation commonly introduced during upsampling, the proposed fine-grained channel attention (FCA) module explicitly models both global and local dependencies. Unlike SE, which focuses mainly on global statistics through fully connected layers, the proposed FCA module integrates both global and local cues, enabling more accurate channel weighting and improved generalization.
In the spherical U-Net, the adaptive fine-grained channel attention module is primarily applied in the decoder, where feature refinement is essential. For view-synthesis tasks, combining global context with local channel cues improves the suppression of blur and enhances reconstruction fidelity. As illustrated in
Figure 2, firstly, to summarize channel-level responses from the feature maps, this method converts the feature map
containing global spatial information into a channel descriptor U through global average pooling. Given the feature map
, where
,
and
denote the number of channels, height and width, respectively. The channel descriptor
is generated via GAP. The
-th channel element of
is expressed by Equation (1):
Here,
denotes the activation at spatial location
in the
-
th channel of the feature map, while
refers to the global average pooling function. This function compresses the feature map
from
to
. To obtain local channel information while maintaining fewer model parameters, a matrix
is used for local channel interaction with the setting
. This leads to Equation (2):
where
denotes the channel descriptor,
represents the local information, and
signifies the number of adjacent channels. In this experiment, a one-dimensional convolution (conv1D) is employed to implement this module. To obtain global channel information and enhance the capability of representing global context, a diagonal matrix
is utilized to capture dependencies among all channels as global information, with
. This yields Equation (3):
where
denotes the global information and
represents the number of channels, a two-dimensional convolution is used to implement this module. To enable meaningful integration of global and local cues, the global features derived from the diagonal matrix are fused with the local features produced by the weight matrix. Finally, cross-correlation operations are employed to capture the correlations between them at various granularities, with the specific form shown in Equation (4):
Here,
represents the correlation matrix. To balance accurate feature weighting with computational efficiency, an adaptive fusion strategy is introduced. This mechanism constructs global and local weight vectors by extracting row- and column-level statistics from
and its transpose, respectively, and subsequently merges them using learnable fusion coefficients, as formalized in Equations (5)–(7):
where
and
denote the fused global and local channel weights, respectively,
is the number of channels, and
is a learnable parameter. This design eliminates unnecessary cross-correlation computations between global and local representations while strengthening their mutual interaction. As a result, the mechanism selectively amplifies informative channels and suppresses irrelevant ones, yielding more accurate weight assignments for deblurring-related features. The resulting weights are then applied to the input feature map, as indicated in Equation (8) where
represents the feature map, and
denotes the final output feature map:
At the network bottleneck, features are transformed onto a spherical surface using ψ(⋅) before entering the spherical decoder. To accommodate the expanded feature domain, the decoder applies lightweight dilated convolutions, enabling a larger receptive field at low cost. Following the U-Net design principle, multi-scale skip connections are employed to maintain effective gradient flow, requiring only feature remapping through ψ(⋅). The encoder leverages a pretrained EfficientNet-B7 for 2D feature extraction, while the spherical decoder comprises five stages that upsample resolution and progressively reduce channel depth. Each layer incorporates an adaptive fine-grained channel self-attention module. To compensate for the large blank areas caused by the expanded field of view, three ResNet blocks with dilation rates of 1, 2 and 3 are embedded in each layer to enhance the receptive field. Additionally, skip connections are applied between the encoder and decoder at corresponding scales. The specific network architecture is illustrated in
Figure 3.
In the experiment, each 2D pixel
is converted into its normalized spherical coordinates
. Given that the vector
represents the viewing ray originating from the camera center and passing through that pixel, the corresponding spherical projection can be formulated as in Equation (9):
where
. When input into the decoder,
are uniformly discretized, and features are stored in a tensor covering an arbitrarily large FOV. Through the above modules, given an input image, new depth views can be uniformly synthesized at different angles along the imaginary straight path.
3.2. Feature-Informed NeRF Color Prediction
In its standard formulation, NeRF models a continuous volumetric radiance field that maps a 3D location and viewing direction to two quantities: the volume density and RGB color . Building upon PixelNeRF, this method learns a generalizable cross-sequence radiance field and introduces novel sampling designs for efficiently synthesizing new depth views.
The basic architecture is shown in
Figure 1. Using the first frame as input
in Sequence 1, a spherical U-Net with adaptive fine-grained attention extracts a feature volume
. A source future frame
is randomly selected, from which
pixels are sampled. Using known source poses and camera intrinsics,
points are efficiently sampled along rays passing through these pixels. Each sampled point
is projected onto a sphere via
, allowing corresponding input image feature vectors
to be retrieved by bilinear interpolation. These features
, combined with viewing direction
and positional encoding
, are fed into NeRF’s multi-layer perceptron
to predict point density
and RGB color
. in the input frame coordinates, as formalized in Equation (10):
Following the NeRF formulation, the color
is computed by numerically aggregating the radiance samples along ray
. Its generalized expression is provided in Equation (11):
where
, with
denoting the cumulative transmittance and
the distance to the previous adjacent point. Unlike traditional self-supervised methods, this approach disentangles depth from the radiance volume and defines the estimated depth as the distance from sampling points to the object surface.
3.3. Monocular Depth Estimation Method via Neural Radiance Field
Similarly to the color prediction method in the previous subsection, the depth
estimated by NeRF is defined in the form shown in Equation (12):
Here,
denotes the distance between the
-
th sampled point and its corresponding sampling location. To enable depth optimization without ground-truth annotations, the method follows conventional self-supervised monocular depth estimation paradigms by employing a photometric reprojection loss between the warped source image
and its preceding frame
(i.e., the target frame). Meanwhile, continuous frames are selected to ensure maximum overlap. For the sparse depth estimation
, the photometric reprojection loss
is expressed as Equation (13):
where
denotes the projection of 2D coordinate
onto image
, using the camera’s intrinsic parameters and poses. Although
obtained in this method is sparse (since it is estimated only for certain rays), the randomness of these rays provides statistically dense supervision. To account for moving objects in autonomous driving scenarios, this method also applies a pixel-wise auto-masking strategy during depth estimation.
To reduce the number of sampling points for NeRF in large-scale scenarios such as autonomous driving, this chapter proposes a Gaussian probability-based sampling strategy that incorporates the depth prior information predicted above. This strategy models the density distribution along each ray using a one-dimensional Gaussian mixture guided by the sampled points. Because higher mixture responses typically indicate proximity to object surfaces, the method can concentrate sampling in more informative regions, thereby reducing the number of required samples.
3.4. Gaussian Probability-Based Ray Sampling Method
To mitigate this problem, a Gaussian-based probabilistic sampling strategy is adopted, approximating the ray’s density profile with a 1D Gaussian mixture estimated from sampled points. As peaks in the mixture align with likely surface locations, the method can focus sampling accordingly and greatly reduce the number of required samples using just 64 points for a 100 m ray.
As shown in
Figure 4, for a given ray
, first,
points (blue dots in the figure) are uniformly sampled at the near and far ends. Taking the blue sampling points and their features as inputs, an MLP network
is used to predict
1D Gaussian mixtures
. Then,
points (square points in the figure) are sampled from each Gaussian distribution and 32 points (triangular points in the figure) are uniformly sampled along the ray with a total of
points sampled.
Where the additional uniform point sampling enforces calculations on the radiance volume to prevent
from getting into local minima. All sampled points are then fed into
in Equation (10) for NeRF volume rendering of color
and depth
. The inferred densities
during rendering serve as cues for 3D surface positions, from which new Gaussian mixtures can be obtained—but this requires solving a point-Gaussian assignment problem. Thus, this chapter proposes a probabilistic self-organizing map (PSOM) method to address this issue, which is shown in the Algorithm 1. In this framework, sampling points are associated with individual Gaussian components according to the probability that each point is generated by that component, while the structure of the mixture is strictly maintained. For a Gaussian
and its assigned point set
, the updated Gaussian
is the mean of all points
, weighted by the conditional probability
where
is the occupancy of
. In specific experiments,
from the original NeRF formulation is used as it serves as a sufficiently good occupancy estimator, i.e.,
, where
is the distance to the previous point.
| Algorithm 1: PSOM-based Point–Gaussian Assignment |
| Input: Sampling points {}, Gaussian components {} |
| Output: Updated Gaussian components {} |
| |
| 1: Render sampling points with NeRF to obtain densities |
| 2: Compute occupancy = 1 − exp( ) |
| 3: for each sampling point do |
| 4: for each Gaussian component do |
| 5: Compute conditional probability p(j|) |
| 6: end for |
| 7: end for |
| 8: for each Gaussian component do |
| 9: Update using the weighted mean of assigned points with p(j|) and |
| 10: end for |
| 11: Compute Gaussian consistency loss: |
| 12: L_gauss = (1/k) |
| 13: Compute surface consistency loss L_surface |
| 14: L_samp = L_gauss + L_surface |
| 15: Update Gaussian predictor g(·) by minimizing L_samp |
Finally, the Gaussian predictor
is subsequently updated by computing the average KL divergence between the current and revised Gaussian components, as expressed in Equation (14):
To further enforce a Gaussian on visible surfaces, this method also minimizes the distance between the depth and the nearest Gaussian. The total loss is: . In experiments, Gaussian functions are used, with each Gaussian sampling points, and 32 points are uniformly sampled, such that each ray only requires sampling points.