Next Article in Journal
Towards Smart and Sustainable Last Mile Delivery Systems: A Scoping Review and Conceptual Framework
Previous Article in Journal
A Two-Stage Robust Casualty Evacuation Optimization Model for Sustainable Humanitarian Logistics Networks Under Interruption Risks
Previous Article in Special Issue
A Multi-Objective Reinforcement Learning Framework for Energy-Efficient Electric Bus Operations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhancing Sustainable Intelligent Transportation Systems Through Lightweight Monocular Depth Estimation Based on Volume Density

1
Shandong Hi-Speed Group, Jinan 250098, China
2
State Key Lab of Intelligent Transportation System, Beijing 100191, China
3
School of Transportation Science and Engineering, Beihang University, Beijing 100191, China
4
Shandong Hi-Speed Information Group, Jinan 250098, China
*
Author to whom correspondence should be addressed.
Sustainability 2025, 17(24), 11271; https://doi.org/10.3390/su172411271
Submission received: 28 September 2025 / Revised: 4 November 2025 / Accepted: 20 November 2025 / Published: 16 December 2025

Abstract

Depth estimation is a critical enabling technology for sustainable intelligent transportation systems (ITSs), as it supports essential functions such as obstacle detection, navigation, and traffic management. However, existing Neural Radiance Field (NeRF)-based monocular depth estimation methods often suffer from high computational costs and poor performance in occluded regions, limiting their applicability in real-world, resource-constrained environments. To address these challenges, this paper proposes a lightweight monocular depth estimation framework that integrates a novel capacity redistribution strategy and an adaptive occlusion-aware training mechanism. By shifting computational load from resource-intensive multi-layer perceptrons (MLPs) to efficient separable convolutional encoder–decoder networks, our method significantly reduces memory usage to 234 MB while maintaining competitive accuracy. Furthermore, a divide-and-conquer training strategy explicitly handles occluded regions, improving reconstruction quality in complex urban scenarios. Experimental evaluations on the KITTI and V2X-Sim datasets demonstrate that our approach not only achieves superior depth estimation performance but also supports real-time operation on edge devices. This work contributes to the sustainable development of ITS by offering a practical, efficient, and scalable solution for environmental perception, with potential benefits for energy efficiency, system affordability, and large-scale deployment.

1. Introduction

With the advancement in intelligent transportation technology, including traffic flow prediction, the limitations of environmental perception in terms of real-time performance, accuracy, and robustness have become major obstacles to large-scale deployment [1]. Among core perception tasks, monocular depth estimation extracts 3D scene information from single images, offering cost-effectiveness and deployment flexibility advantages [2,3,4], enabling intelligent vehicles to perform obstacle recognition, trajectory planning, and decision-making tasks. However, existing self-supervised monocular depth estimation methods still face three major technical challenges when dealing with urban traffic scenes: First, scale ambiguity leads to significant errors in long-distance depth estimation, with accuracy dropping by over 30% beyond 10 m; second, insufficient geometric structure reconstruction capability in dynamic occlusion regions, where traditional methods show 2.5 times higher error rates in occluded areas compared to non-occluded regions; third, scene complexity significantly impacts depth map quality. These issues directly constrain the decision reliability of intelligent transportation systems in complex road conditions, underscoring the need for more adaptive perception approaches such as few-shot learning for novel object detection.
Neural Radiance Fields (NeRF) [5], as a breakthrough technology in 3D vision in recent years, along with other representation paradigms such as point clouds and HD mapping, implicitly represents scene radiance fields through multi-layer perceptrons, demonstrating the ability to accurately construct complex geometric structures in visual synthesis tasks, providing a disruptive technical path for solving inherent challenges in monocular depth estimation. However, current NeRF-based depth estimation methods face dual application barriers: First, the uniform sampling strategy adopted by traditional NeRF requires over 200 sampling points to ensure accuracy when processing long-distance rays exceeding 100 m in autonomous driving scenes, leading to exponential growth in computational resources, with single-scene training memory usage exceeding 24 GB, making it difficult to adapt to edge computing devices; second, existing methods rely on single-scene independent training, showing error rates exceeding 40% when indoor scene models are applied to urban road scenes, failing to meet cross-scene deployment requirements for autonomous driving. Therefore, systematic innovation in sampling mechanisms and network architectures is urgently needed.
Based on the engineering requirements of large-scale complex scenes in autonomous driving, this research proposes a NeRF-based monocular depth estimation method combining Gaussian probability sampling and adaptive channel attention. The research achieves multi-dimensional technical innovation: To improve computational efficiency, we employ a one-dimensional Gaussian mixture model to represent ray density distribution and integrate depth priors to guide sampling points toward object surfaces, effectively reducing long-distance ray sample numbers while simultaneously decreasing computational overhead and memory usage; to enhance generalization capability, we introduce a fine-grained adaptive channel attention network based on spherical projection, which achieves significantly lower depth estimation errors in unseen environments through dynamic fusion of global context and local features. In practical deployment scenarios, the proposed method achieves accurate long-distance depth estimation on the KITTI benchmark through the NeRF technical framework, while occlusion region reconstruction performance shows substantial improvement over traditional methods, providing critical support for scaled application of autonomous driving visual perception systems in complex scenes.
From a sustainability perspective, our lightweight depth estimation framework directly contributes to several United Nations Sustainable Development Goals (SDGs), particularly SDG 11 (Sustainable Cities and Communities) and SDG 9 (Industry, Innovation and Infrastructure). By enabling accurate environmental perception with significantly reduced computational requirements (234 MB memory usage versus traditional methods exceeding 1 GB), our approach facilitates the deployment of energy-efficient ITS infrastructure in resource-constrained urban environments, while also supporting reliable vehicular digital forensics and data integrity auditing. The reduced energy consumption during both training and inference phases supports greener AI deployment in transportation systems, while the improved occlusion handling enhances pedestrian and cyclist safety in complex urban scenarios. Furthermore, the method’s compatibility with edge computing devices eliminates the need for continuous cloud connectivity, reducing network energy overhead and enabling scalable deployment in developing regions where high-bandwidth infrastructure may be limited.The computational efficiency of our approach also benefits broader infrastructure monitoring applications, such as automated pavement distress analysis and degradation trend prediction, contributing to more sustainable maintenance of transportation infrastructure. These sustainability advantages, combined with the technical improvements, offer significant implications for promoting localization and cost reduction of intelligent connected vehicle perception systems.
As demonstrated in Table 1, our method achieves superior performance across multiple dimensions, particularly excelling in computational efficiency and practical deployment capabilities.
The main contributions of this work can be summarized as follows:
  • We propose a novel capacity redistribution strategy that achieves a lightweight NeRF architecture.
  • We introduce a CMUNeXt module utilizing large-kernel depth separable convolutions and inverted bottleneck design, for robust scene understanding in autonomous driving scenarios.
  • We develop an adaptive sampling strategy using one-dimensional Gaussian mixture models, effectively reducing long-distance ray sample numbers while preserving geometric accuracy in large-scale scenes.
  • We propose a comprehensive multi-component loss function with specialized handling of occluded regions, achieving significant improvements in depth estimation quality for challenging areas where traditional methods typically fail.
  • We conduct extensive experiments on both KITTI and V2X-Sim datasets, demonstrating superior performance in occlusion handling and long-range depth estimation while maintaining real-time deployment capabilities suitable for edge computing devices.

2. Related Work

2.1. Supervised Learning-Based Monocular Depth Estimation

Deep learning-based monocular depth estimation aims to infer depth maps from single RGB images using deep neural networks [6,7,8]. Under the supervised paradigm, depth estimation networks are trained with ground truth depth maps to minimize the discrepancy between predicted and actual depth values. Recent advances have significantly improved both the accuracy and efficiency of depth estimation models. A recent work proposed a method for unsupervised monocular depth learning from unknown cameras in wild videos, demonstrating robust performance in diverse environmental conditions [6]. Another study introduced 3D packing techniques for self-supervised monocular depth estimation, achieving state-of-the-art results on standard benchmarks [7]. The work by Piccinelli et al., iDisc, extends the concept of depth discretization from the output space to the internal feature space [9]. The proposed internal discretization module adaptively decomposes scenes into a set of high-level concepts, achieving leading performance on multiple benchmarks [9]. Zhang et al. proposed Lite-Mono [10], a lightweight hybrid architecture that extracts multi-scale local features through sequential dilated convolution modules and encodes global context via an efficient cross-covariance attention mechanism, maintaining high accuracy while significantly reducing parameter count. Marigold introduces a supervised fine-tuning approach based on Stable Diffusion, which converts image generation priors into high-quality monocular depth estimation capabilities through synthetic RGB-D data, demonstrating exceptional zero-shot generalization performance [11]. The field has also seen advancements in mobile deployment and real-time applications. UniDepth achieves strong generalization capability without requiring camera parameters via its self-prompting camera module and pseudo-spherical output representation, demonstrating excellent performance in zero-shot testing [12]. ZoeDepth proposes a dual-stage supervised learning approach that combines relative depth pretraining with metric depth fine-tuning. By introducing a lightweight “metric bins module” and an automatic routing mechanism, it significantly enhances accuracy and generalization capability in cross-domain zero-shot depth estimation [13]. Watson et al.’s work on self-supervised monocular depth hints provides valuable insights into leveraging geometric constraints for improved depth prediction [14]. Recent research has further explored the integration of conditional random fields and neural window techniques. Bartolomei, L. et al. proposed a cross-modal distillation paradigm that leverages visual foundation models (e.g., Depth Anything v2) on RGB images to generate high-quality proxy depth labels for event camera data, thereby enabling supervised training of event-based depth estimation networks [15]. These developments are complemented by work in related areas such as HD map construction, where Li et al. demonstrated efficient online mapping using linear priors [16]. Despite these advancements, supervised methods still face challenges in generalization to unseen environments and reliance on expensive ground truth data. The requirement for precise depth annotations, typically obtained through LiDAR systems, remains a significant bottleneck for large-scale deployment. These limitations have motivated increasing interest in self-supervised and semi-supervised approaches that can leverage unlabeled data while maintaining competitive performance.

2.2. Self-Supervised Learning-Based Monocular Depth Estimation

Self-supervised monocular depth estimation typically uses two types of training inputs: stereo image pairs from fixed-baseline stereo cameras, or monocular image sequences captured over time. However, during inference, these models rely only on single images. Most methods estimate depth by establishing pixel-level correspondences between image pairs.
Garg et al. first proposed using self-supervised methods for monocular depth estimation [17]. Inspired by this work, Godard et al. trained networks using left-right view reconstruction and introduced left-right consistency constraints [18]; subsequently, Xie et al. enhanced the reconstruction process by adding selective layers [19]. Other approaches employed global-to-local feature extraction networks during the feature extraction phase [20]. Meanwhile, some researchers introduced siamese network architectures to improve feature learning from paired images [21], incorporated visual odometry into the training process [22], and utilized depth cues to enhance image matching [23]. Further advancements include Zhao et al., who optimized camera poses by obtaining coarse initial estimates through multi-view geometry combined with an iterative self-distillation mechanism to address perceptual overfitting [24]. Additionally, Wu et al. modeled 3D scenes as a collection of planes rather than individual points, proposing a Structured Plane Generation Module and a Depth Discontinuity-Aware Module that aggregate semi-global context cues to construct matching cost volumes; these volumes are then optimized through 3D convolution modules with intermediate supervision, and the overall network is trained using top-down or bottom-up strategies [25].
While self-supervised learning alleviates the dependency on expensive ground truth depth data for model training, the accuracy of inferred depth maps is typically lower than supervised methods, mainly due to the lack of explicit ground truth depth supervision.

2.3. NeRF-Based Monocular Depth Estimation Methods

Since the introduction of NeRF [26], it has attracted widespread attention in the computer vision field recently, demonstrating powerful capabilities in 3D scene reconstruction and novel view synthesis, providing a new paradigm for monocular depth estimation. However, monocular images lack direct depth measurements, leading to challenges such as scale ambiguity and inaccurate depth recovery, especially more pronounced in occluded regions.
Early NeRF methods primarily targeted static small-scale scene reconstruction, using multi-layer perceptrons (MLPs) to implicitly represent scene radiance fields, and optimizing network parameters through volume rendering of multi-view images to generate high-detail 3D reconstruction results. However, traditional NeRF technology heavily relies on accurate camera pose estimation and requires numerous input views, resulting in high computational costs, limiting its application in monocular depth estimation. To address these limitations, extensive research has proposed extensions and enhancements to the NeRF framework.
To improve computational efficiency, methods like Instant-NGP [27] employ hash encoding and sparse voxelization techniques to accelerate query processing, reducing NeRF training and inference time, making real-time deployment of NeRF in monocular depth estimation tasks possible. Regarding the combination of NeRF and monocular depth estimation, many works use depth maps generated by monocular networks as geometric priors to guide and improve the NeRF optimization process. For example, NoPe-NeRF [28] integrates monocular depth maps into Pose-NeRF’s joint optimization, enforcing multi-view geometric consistency by correcting scale and offset, thereby improving pose estimation and NeRF optimization quality, enhancing reconstruction effects and pose estimation accuracy under complex camera trajectories, reducing NeRF’s dependency on accurate camera poses.
Although NeRF-based monocular depth estimation has made significant progress, challenges remain: limited model adaptability to complex dynamic environments, especially insufficient performance when dealing with fast-moving objects or variable lighting conditions; weak generalization capability, often showing significant performance degradation across different scenes and datasets. Addressing these issues in autonomous driving scenarios, we propose a NeRF framework integrating Gaussian probability ray sampling strategies and adaptive channel attention mechanisms, achieving efficient depth estimation and robust scene understanding in large-scale autonomous driving environments through optimized ray sampling distribution and enhanced feature fusion networks.
Beyond depth estimation itself, the perception outputs are tightly coupled with downstream tasks such as large-scale traffic signal control and online HD map construction. Recent advances in cooperative multi-agent reinforcement learning show that explicit neighborhood backtracking can improve network-wide traffic signal control stability and efficiency under cooperative settings. Meanwhile, to enable efficient HD map construction on resource-constrained platforms, cross-modal distillation paradigms have been proposed to transfer LiDAR priors into camera-only models, and linear-prior based designs further reduce the effort for online mapping. Our lightweight, occlusion-aware monocular depth estimation complements these efforts by providing reliable geometry with low memory footprint, which is beneficial for scalable ITS deployment and real-time map maintenance.

3. Method

This section presents our lightweight NeRF-based monocular depth estimation framework designed specifically for autonomous driving applications. Our approach addresses the fundamental challenge of balancing computational efficiency with depth estimation accuracy by introducing a novel architecture that redistributes computational resources from traditional NeRF MLPs to more efficient encoder–decoder networks.

3.1. Framework Overview

The proposed method consists of three main components working in synergy: (1) a separable convolution-based encoder–decoder network for robust feature extraction, (2) a lightweight NeRF module for volume density prediction, and (3) an adaptive training strategy with multi-component loss functions for handling occlusion scenarios. Unlike conventional NeRF approaches that rely heavily on high-capacity MLPs for both color and density prediction, our framework strategically shifts computational burden to the encoder–decoder stage while maintaining the geometric representation capabilities of neural radiance fields.
The core innovation lies in the capacity redistribution strategy: reducing the complexity of MLPs in NeRF while simultaneously enhancing the feature extraction capabilities through separable convolution-based encoder–decoder networks. This design paradigm not only significantly reduces model memory usage but also enables the model to learn more accurate overall geometric shapes through global context information. The separable convolution-based global information extraction module can capture long-range dependencies during model training, leading to more precise depth information. Compared to high-capacity MLPs, our powerful feature extractors demonstrate superior generalization capabilities when processing single images from unseen scenes, which is crucial for real-world autonomous driving deployment.

3.2. Separable Convolution-Based Encoder–Decoder Network

The architecture design philosophy centers on redistributing computational complexity from NeRF’s traditional 8-layer MLP networks to a more efficient encoder–decoder framework, as illustrated in Figure 1. Traditional NeRF approaches allocate substantial computational resources to volume density prediction through deep MLPs, resulting in significant memory overhead that prohibits deployment on resource-constrained edge devices commonly found in autonomous vehicles. Our approach addresses this limitation by proposing a lightweight MLP architecture consisting of only two fully connected layers with a hidden dimension of 64 and ReLU activation functions.

3.2.1. Architecture Design Principles

The key insight driving our design is the observation that volume density prediction requires less representational complexity compared to color synthesis, as density fields typically exhibit smoother spatial variations than high-frequency color details. By eliminating color prediction from the MLP network, we enable the lightweight architecture to focus exclusively on volume density estimation while maintaining prediction accuracy. This specialization allows for dramatic parameter reduction without compromising geometric understanding capabilities.
Conversely, we significantly enhance the encoder–decoder structure to serve as a powerful feature extraction backbone. This capacity redistribution strategy leverages the complementary strengths of both architectures: the encoder–decoder excels at capturing global context and spatial relationships through hierarchical feature learning, while the simplified MLP maintains the flexibility of implicit neural representations for geometric modeling.

3.2.2. Encoder Architecture and Design

Our U-Net encoder employs a ResNet-50 backbone pre-trained on ImageNet, providing robust feature extraction capabilities that have been proven effective across diverse visual tasks. The choice of ResNet-50 strikes an optimal balance between representational power and computational efficiency, offering sufficient depth for complex scene understanding while remaining deployable on edge hardware. The pre-training on ImageNet provides beneficial inductive biases for natural scene understanding, which translates well to autonomous driving scenarios.
The decoder architecture follows the MonoDepth2 framework with strategic modifications to preserve spatial information throughout the upsampling process. Specifically, we maintain feature map resolution consistency by outputting feature maps with channel dimensions C out = α · C mono at each layer, where C mono represents the output channel dimension of the original MonoDepth2 model and α is a scaling factor that allows for capacity adjustment based on computational constraints. This design prevents information loss during subsequent up-convolution operations, ensuring that fine-grained spatial details essential for accurate depth estimation are preserved throughout the network.

3.2.3. Multi-Scale Feature Integration

The encoder incorporates separable convolution modules strategically positioned to extract global information from each channel at multiple scales. The hierarchical encoder–decoder structure, illustrated in Figure 2, consists of four progressive layers implementing a two-stage processing paradigm: an encoder stage featuring global information extraction through separable convolutions, followed by a decoder stage for feature reconstruction and refinement.
During the encoding phase, our global information extraction modules based on separable convolutions capture contextual information at different semantic levels, from low-level edge and texture features to high-level object and scene understanding. Subsequently, standard convolutions expand channel dimensions to accommodate the enhanced representational requirements of the integrated feature space. This dual-pathway approach ensures comprehensive feature representation while maintaining computational efficiency through the use of separable convolutions in the computationally intensive global context extraction operations.

3.2.4. Global Context Information Extraction

Global context information extraction represents a critical component for achieving robust monocular depth estimation in autonomous driving scenarios, where understanding both local geometric details and global scene structure is essential for safe navigation. The challenge lies in capturing long-range spatial dependencies while maintaining computational efficiency suitable for real-time applications on edge devices.

Motivation and Design Rationale

While recent Transformer-based architectures have demonstrated impressive capabilities in various computer vision tasks, their application to monocular depth estimation in resource-constrained environments presents significant challenges. Transformer networks typically require substantial computational resources and lack the inductive biases inherent to convolutional operations that are particularly beneficial for processing structured visual data. Conversely, traditional CNNs excel at capturing local spatial patterns but struggle with long-range dependency modeling, which is crucial for understanding global scene geometry in autonomous driving contexts.
Our approach addresses this limitation by drawing inspiration from recent advances in efficient architecture design, particularly ConvMixer and ConvNeXt, to develop the CMUNeXt (Convolution-Mixed U-Net Extension) module. This design philosophy emphasizes the strategic use of separable convolutions to achieve the global receptive field benefits of Transformers while preserving the computational efficiency and inductive biases of convolutional architectures.

CMUNeXt Module Architecture

The CMUNeXt block employs a sophisticated three-stage processing pipeline optimized for global context extraction with minimal computational overhead. The core innovation lies in the strategic combination of large-kernel depth convolutions with pointwise convolutions arranged in an inverted bottleneck configuration, enabling comprehensive spatial and channel information mixing with significantly fewer parameters than traditional approaches.
The module architecture, illustrated in Figure 3, utilizes depth separable convolutions as the fundamental building blocks. This design choice provides multiple advantages: (1) dramatic parameter reduction compared to standard convolutions, (2) enhanced computational efficiency through factorized operations, and (3) improved feature learning through separate spatial and channel-wise processing pathways. The large-kernel depth convolutions capture extensive spatial context, while the subsequent pointwise operations facilitate sophisticated feature combination across channels.

Technical Implementation and Advantages

The implementation of depth separable convolutions in our CMUNeXt module yields substantial computational advantages over traditional convolution operations. Compared to standard convolutions, depth convolutions reduce both parameter count and computational complexity by factorizing the convolution operation into spatial and channel-wise components. This factorization is particularly beneficial in our context, as it allows for efficient processing of high-resolution feature maps typical in depth estimation tasks while maintaining the spatial resolution necessary for precise depth prediction.
Within the CMUNeXt module, large-kernel depth convolutions (typically 7 × 7 or 9 × 9) extract comprehensive global information from each channel independently, establishing long-range spatial dependencies essential for understanding scene geometry. These operations are immediately followed by residual connections to ensure gradient flow and facilitate training of deeper networks. The large kernel size enables each pixel to aggregate information from a substantially larger spatial neighborhood compared to traditional 3 × 3 convolutions, effectively increasing the receptive field without proportional increases in computational cost.
The channel mixing stage employs two pointwise convolutions arranged in an inverted bottleneck configuration, a design pattern that has proven highly effective in mobile-oriented architectures like MobileNetV2 and later refined in ConvNeXt. This configuration expands the hidden dimension to four times the input dimension between the two pointwise layers, creating a high-dimensional intermediate representation that facilitates comprehensive feature mixing. The expanded hidden dimension serves as a computational bottleneck that allows for rich feature interactions while maintaining efficiency through the use of 1 × 1 convolutions.
The activation strategy incorporates GELU (Gaussian Error Linear Unit) functions rather than traditional ReLU, providing smoother activation profiles that have been shown to improve training dynamics in vision transformers and modern CNN architectures. Batch normalization layers are strategically placed after each convolution operation to stabilize training and improve convergence properties. The complete module operation can be mathematically formulated as follows:
Y 1 = DWConv ( X ) + X
Y 2 = GELU ( BN ( PWConv 1 ( Y 1 ) ) )
Y = PWConv 2 ( Y 2 ) + Y 1
To better predict volume density in sampling points using only a lightweight NeRF network with two-layer MLPs, this research predicts dense feature maps through the separable convolution-based encoder–decoder network and locally conditions the volume density predicted by NeRF in the camera frustum through the extracted feature map of the entire scene.

3.3. Lightweight NeRF-Based Volume Density Prediction

The integration of our enhanced encoder–decoder network with a lightweight NeRF architecture represents a fundamental paradigm shift in neural radiance field design for depth estimation applications. This section details the technical implementation of volume density prediction, sampling strategies, and the mathematical framework underlying our approach.

3.3.1. Feature-Conditioned Density Estimation

Our method leverages the dense feature maps generated by the separable convolution encoder–decoder network to provide rich contextual information for volume density prediction. Unlike traditional NeRF approaches that rely solely on positional encoding, our framework incorporates pixel-aligned features that capture both local geometric details and global scene context, significantly enhancing the model’s understanding of spatial relationships and scene structure.
The feature conditioning process begins with the generation of a pixel-aligned feature map F R H × W × C from the input image I, where H, W, and C represent the spatial dimensions and feature channel count, respectively. This feature map serves as a comprehensive scene descriptor that encodes multi-scale information ranging from low-level edges and textures to high-level semantic understanding. For each ray r emanating from camera center through pixel ( u , v ) , we establish correspondence between 3D sampling points and 2D feature representations.

3.3.2. Adaptive Sampling Strategy

The sampling strategy plays a crucial role in determining both computational efficiency and reconstruction quality. Our approach employs an adaptive sampling scheme that concentrates sampling points in regions likely to contain geometric boundaries while maintaining sufficient coverage for accurate volume rendering. For each ray, we sample N points using a combination of uniform and importance-based sampling:
x i = o + t i d , t i U ( t near , t far ) + N ( 0 , σ noise )
where o represents the camera origin, d is the normalized ray direction, t near and t far define the near and far bounds of the sampling range, and N ( 0 , σ noise ) introduces controlled noise to improve training stability and reduce artifacts.
For each sampling point x i , we extract the corresponding feature vector through bilinear interpolation: f i = BilinearSample ( F , ( u , v ) ) . This operation ensures smooth feature transitions and maintains spatial consistency across neighboring rays.

3.3.3. Gaussian Sampling Model Theory

The theoretical foundation of our Gaussian sampling model is rooted in probability theory and Bayesian inference, providing a principled approach to ray sampling in neural radiance fields. The core insight is that the volume density distribution along a ray can be effectively modeled as a mixture of Gaussian distributions, each representing potential surface locations in 3D space.

Mathematical Formulation

Let t represent the distance along a ray, and let σ ( t ) denote the volume density at distance t. We model the probability of ray termination at distance t using a Gaussian mixture model:
p ( t ) = k = 1 K π k · N ( t | μ k , σ k 2 )
where K is the number of Gaussian components, π k are the mixing coefficients satisfying k = 1 K π k = 1 , and N ( t | μ k , σ k 2 ) represents the Gaussian distribution with mean μ k and variance σ k 2 .
The mixing coefficients π k are determined by the initial depth predictions from our encoder–decoder network, providing geometric priors that guide the sampling process toward regions likely to contain surfaces.

Bayesian Interpretation

From a Bayesian perspective, our sampling strategy can be viewed as importance sampling where the proposal distribution is the Gaussian mixture model. The optimal sampling distribution q ( t ) for estimating the volume rendering integral should be proportional to the product of transmittance and density:
q ( t ) T ( t ) · σ ( t )
Our Gaussian mixture model serves as an approximation to this optimal distribution, with the means μ k initialized near predicted surface locations and variances σ k 2 encoding the uncertainty in these predictions.

Theoretical Advantages

The Gaussian sampling model offers several theoretical advantages over uniform or coarse-to-fine sampling strategies:
  • Convergence Guarantees: The use of importance sampling with Gaussian proposals ensures faster convergence of the volume rendering integral estimation, as samples are concentrated in regions with non-negligible contributions to the final result.
  • Uncertainty Quantification: The variance parameters σ k 2 naturally encode the uncertainty in depth predictions, allowing adaptive sampling density based on prediction confidence.
  • Multi-modal Representation: The mixture model formulation enables representation of multiple potential surface locations along a single ray, which is particularly beneficial in semi-transparent or reflective regions.

Connection to Volume Rendering

The Gaussian sampling model is tightly integrated with the volume rendering equation. The expected depth E [ t ] along a ray can be expressed as
E [ t ] = t near t far T ( t ) · σ ( t ) · t d t i = 1 N w i t i
where the weights w i = T i α i are computed from the sampled points, and the Gaussian mixture model ensures that the sampling points t i are strategically placed to accurately estimate this expectation.
This theoretical framework provides the mathematical justification for our adaptive sampling strategy, demonstrating how Gaussian distributions naturally model the uncertainty and multi-modal nature of depth predictions in complex 3D scenes.

3.3.4. Volume Density Prediction Network

The lightweight MLP network g θ represents a carefully designed architecture optimized for volume density prediction while maintaining minimal computational overhead. The network architecture consists of two fully connected layers with a hidden dimension of 64, strategically chosen to balance representational capacity with memory efficiency:
σ i = g θ ( γ ( x i ) , f i )
The positional encoding function γ ( · ) transforms 3D coordinates into a higher-dimensional representation capable of capturing high-frequency geometric details:
γ ( x ) = [ sin ( 2 0 π x ) , cos ( 2 0 π x ) , , sin ( 2 L 1 π x ) , cos ( 2 L 1 π x ) ]
where L represents the number of frequency levels. This encoding enables the network to learn complex geometric patterns while maintaining the lightweight architecture constraints.

3.3.5. Volume Rendering and Depth Integration

The volume rendering process integrates density predictions along each ray to produce final depth estimates. The depth value for each ray is computed through numerical integration:
D ^ = i = 1 N T i α i t i
where the accumulated transmittance T i = exp ( j = 1 i 1 σ j δ j ) models the probability that a ray reaches point i without being blocked by preceding geometry, the opacity α i = 1 exp ( σ i δ i ) represents the probability of ray termination at point i, and δ i = t i + 1 t i denotes the distance between consecutive sampling points.
This formulation ensures physically plausible depth estimates while maintaining differentiability for end-to-end training. The integration process effectively aggregates density information along each ray, producing smooth depth maps that capture both geometric boundaries and gradual depth transitions essential for autonomous driving applications.

3.4. Training Strategy and Loss Functions

The training strategy for our lightweight NeRF framework addresses the unique challenges of monocular depth estimation in autonomous driving scenarios, particularly the handling of occlusion regions and the requirement for robust performance across diverse environmental conditions. Our approach employs a comprehensive multi-component loss function coupled with adaptive training strategies designed to maximize depth estimation accuracy while maintaining computational efficiency.

3.4.1. Comprehensive Loss Function Design

Our training objective combines multiple complementary loss terms to address different aspects of depth estimation quality and geometric consistency:
L = λ d L d + λ s L s + λ n L n + λ o L o + λ c L c
where L d represents the depth reconstruction loss, L s denotes the smoothness regularization, L n captures surface normal consistency, L o handles occlusion-aware training, and L c enforces color consistency. The weighting parameters λ d , λ s , λ n , λ o , and λ c are dynamically adjusted during training to balance the contribution of each component based on the current training phase and data characteristics.

Depth Reconstruction Loss

The primary supervision signal comes from the depth reconstruction loss, which measures the discrepancy between predicted and ground truth depth values:
L d = 1 | V | ( u , v ) V ρ ( | D ^ ( u , v ) D ( u , v ) | )
where V represents the set of valid pixels with ground truth depth D , and ρ ( · ) is a robust loss function that reduces the influence of outliers. We employ the Huber loss for ρ ( · ) :
ρ ( x ) = 1 2 x 2 if | x | δ δ ( | x | 1 2 δ ) otherwise
This formulation provides L2 penalty for small errors while transitioning to L1 penalty for large deviations, improving robustness against noisy ground truth data common in LiDAR-based depth annotations.

Smoothness Regularization

To encourage locally smooth depth predictions while preserving geometric boundaries, we implement an edge-aware smoothness loss:
L s = 1 | P | ( u , v ) P | D ^ ( u , v ) | · e | I ( u , v ) |
where D ^ represents the spatial gradient of predicted depth, I denotes the image gradient, and P encompasses all pixel locations. This formulation promotes smoothness in regions with low image gradients while allowing sharp depth transitions at image edges, preserving important geometric boundaries.

Surface Normal Consistency

Geometric consistency is enforced through a surface normal consistency loss that ensures predicted depth maps exhibit realistic 3D surface properties:
L n = 1 | N | ( u , v ) N 1 n p r e d ( u , v ) · n g t ( u , v ) | n p r e d ( u , v ) | | n g t ( u , v ) |
where n p r e d and n g t represent predicted and ground truth surface normals, respectively, computed from depth maps using finite differences. This loss term encourages geometric consistency and helps avoid artifacts in reconstructed surfaces.

3.4.2. Occlusion-Aware Training Strategy

Handling occlusion regions represents a critical challenge in monocular depth estimation for autonomous driving. Our approach employs a sophisticated data partitioning strategy that divides training samples into occluded and non-occluded regions, applying specialized training procedures for each category.

Occlusion Detection and Segmentation

We implement an automated occlusion detection pipeline that identifies regions where depth information is ambiguous or unavailable due to object occlusion. The detection process analyzes consistency between multi-view projections and identifies regions with significant depth discontinuities or missing LiDAR measurements.

Adaptive Loss Weighting for Occluded Regions

For pixels identified as occluded, we apply modified loss weights and incorporate additional regularization terms:
L o = 1 | O | ( u , v ) O w o c c ( u , v ) · ρ ( | D ^ ( u , v ) D p s e u d o ( u , v ) | )
where O represents the set of occluded pixels, w o c c ( u , v ) denotes spatially-varying occlusion weights, and D p s e u d o ( u , v ) represents pseudo-labels generated through geometric reasoning and multi-view consistency.
This comprehensive training strategy ensures robust performance across diverse scenarios while maintaining efficiency suitable for real-time autonomous driving applications. The adaptive loss weighting and specialized handling of occlusion regions significantly improve depth estimation quality in challenging areas where traditional methods typically fail.

3.5. Sustainability Quantification Framework

Sustainability Quantification FrameworkTo systematically evaluate the environmental impact of our proposed method, we introduce a comprehensive sustainability quantification framework that assesses both computational efficiency and energy consumption aspects.

3.5.1. Computational Sustainability Metrics

We define three key metrics for computational sustainability assessment:
Energy Efficiency Ratio (EER): This metric quantifies the trade-off between computational cost and depth estimation accuracy:
EER = Performance Score Energy Consumption × 1000
where Performance Score = 1 Abs Rel + δ 1 , and Energy Consumption is measured in watt-hours per inference.
Carbon Reduction Potential (CRP): We estimate the carbon emission reduction based on computational savings:
CRP = E baseline E proposed E baseline × P carbon × T deployment
where E represents energy consumption, P carbon is the regional carbon intensity (gCO2/kWh), and T deployment is the expected deployment duration.
Memory Efficiency Index (MEI): This index evaluates the memory utilization efficiency:
MEI = Abs Rel baseline Abs Rel proposed Memory proposed / Memory baseline

3.5.2. Deployment Sustainability Analysis

For practical deployment scenarios, we consider the lifecycle sustainability impact:
  • Training Phase Sustainability: Reduced computational requirements during model development and fine-tuning
  • Inference Phase Efficiency: Lower power consumption during real-time operation on edge devices
  • Hardware Longevity: Extended hardware lifespan due to reduced thermal stress from lower computational loads
  • Scalability Benefits: Efficient scaling to large vehicle fleets with minimal additional energy costs
The proposed sustainability framework provides a quantitative basis for comparing the environmental impact of different depth estimation approaches, enabling more informed decisions in autonomous driving system design.

4. Experiments

4.1. Experimental Setup

4.1.1. Dataset Description

We evaluate our method on two autonomous driving datasets: KITTI and V2X-Sim, each providing complementary advantages for comprehensive assessment of monocular depth estimation performance.
KITTI Dataset: The KITTI dataset [2] is a widely-used benchmark for autonomous driving perception tasks, collected from real-world driving scenarios in urban and highway environments. It provides synchronized RGB images, LiDAR point clouds, and accurate ground truth depth maps obtained through Velodyne HDL-64 laser scanner. The dataset contains 61 scenes with approximately 42,382 training frames and 4424 test frames, captured at 10 Hz with image resolution of 1242 × 375 pixels. KITTI is particularly valuable for evaluating performance in diverse weather conditions, complex traffic scenarios, and varying lighting conditions. However, its ground truth depth is limited by LiDAR range (up to 80 m) and sparsity, making it challenging for evaluating fine-grained depth estimation in distant regions.
V2X-Sim Dataset: V2X-Sim is a large-scale synthetic dataset specifically designed for vehicle-to-everything (V2X) perception research. It provides high-fidelity simulation data with perfect depth annotations across the entire scene, including regions beyond LiDAR range and occluded areas. The dataset contains over 100,000 frames from diverse urban scenarios, including intersections, roundabouts, and multi-lane highways. Each frame includes RGB images at 1920 × 1080 resolution, along with pixel-perfect depth maps, semantic segmentation masks, and 3D bounding boxes for dynamic objects. The synthetic nature of V2X-Sim enables systematic evaluation of occlusion handling capabilities and provides dense depth annotations that are impossible to obtain in real-world datasets. This makes it particularly valuable for assessing model performance in challenging scenarios involving severe occlusions, long-range depth estimation, and complex multi-object interactions.
The combination of KITTI and V2X-Sim allows comprehensive evaluation of our method’s robustness across real and synthetic domains, with V2X-Sim providing particular advantages for analyzing occlusion-aware depth estimation performance.

4.1.2. Implementation Details

All experiments are conducted on NVIDIA RTX 3090 GPUs with 24 GB memory. We implement our method using PyTorch 1.12.0 and CUDA 11.6. The training process consists of 20 epochs with a batch size of 12, requiring approximately 48 h of training time per dataset. We use the Adam optimizer with an initial learning rate of 10 4 , which decays polynomially to 10 6 following the schedule η = η 0 × ( 1 e p o c h m a x _ e p o c h ) 0.9 .
For the encoder, we utilize ResNet-50 pre-trained on ImageNet as the backbone, which provides robust feature extraction capabilities for diverse driving scenarios. The decoder follows a modified MonoDepth2 architecture with enhanced skip connections to preserve spatial information during upsampling. The lightweight MLP consists of only 2 fully connected layers with 64 hidden dimensions and ReLU activation, significantly reducing the model complexity compared to traditional NeRF implementations.

4.1.3. Data Preprocessing and Augmentation

To enhance model generalization and robustness, we employ comprehensive data preprocessing and augmentation strategies. For both datasets, we normalize input images using ImageNet statistics and resize them to 640 × 192 pixels during training while maintaining aspect ratio during inference. We apply random horizontal flipping with 50% probability and random brightness, contrast, and saturation adjustments within [0.8, 1.2] range to simulate varying lighting conditions. Additionally, we implement random cropping with scale augmentation between 0.8 and 1.0 to improve scale invariance, which is crucial for accurate depth estimation in autonomous driving scenarios where object distances vary significantly.
For V2X-Sim specifically, we leverage its synthetic nature to generate additional challenging scenarios by introducing controlled occlusion patterns and varying object densities, enabling more robust evaluation of occlusion handling capabilities.

4.1.4. Training Strategy

We adopt a progressive training strategy to ensure stable convergence and optimal performance. During the first 5 epochs, we train only the encoder–decoder network while keeping the lightweight NeRF frozen, allowing the feature extractor to learn robust image representations. In the subsequent 15 epochs, we jointly fine-tune all components with a reduced learning rate of 5 × 10 5 . This staged training approach prevents the lightweight MLP from being overwhelmed by high-dimensional features during early training stages and ensures better convergence of the entire pipeline.

4.2. Evaluation Metrics

We adopt standard evaluation metrics for depth estimation to comprehensively assess model performance across different error characteristics and depth ranges:
Absolute Relative Difference (Abs Rel) measures the average relative error between predicted and ground truth depths:
Abs Rel = 1 | D | d D d d d
This metric is particularly sensitive to near-range depth errors, making it crucial for evaluating performance in critical safety zones (0–20 m) for autonomous driving applications.
Root Mean Square Error (RMSE) quantifies absolute depth errors in the same units as depth measurements:
RMSE = 1 | D | d D ( d d ) 2
Due to the squaring operation, RMSE emphasizes larger errors and is valuable for assessing long-range depth perception capabilities essential for high-speed navigation.
RMSE in Log Space (RMSE log) operates on logarithmic depth values to handle scale variations:
RMSE log = 1 | D | d D ( log d log d ) 2
This metric provides balanced evaluation across different depth ranges, making it suitable for scenarios requiring both near-field and far-field performance.
Threshold Accuracy ( δ i ) evaluates the percentage of depth predictions within acceptable error bounds:
δ i = 1 | D | d D 1 ( max d d , d d < 1 . 25 i )
where i = 1 , 2 , 3 corresponds to threshold ratios of 1.25, 1.56, and 1.95 respectively. These metrics assess prediction reliability at different precision levels, with δ 1 suitable for basic obstacle detection and δ 3 for high-precision autonomous navigation.
The combination of these complementary metrics ensures comprehensive evaluation of depth estimation performance across all critical aspects of autonomous driving scenarios.

4.3. Comparison with State-of-the-Art Methods

We compare our method with both NeRF-based and traditional monocular depth estimation approaches. Table 2 shows the quantitative results.
As shown in Table 2 and Table 3, our method achieves competitive performance while significantly reducing memory consumption (234 MB vs. 1423 MB for VisionNeRF), making it more suitable for deployment on edge devices in autonomous driving scenarios.
To better validate the effectiveness of our method in occluded regions, we also compare with self-supervised monocular depth prediction methods, as shown in Table 4.
To further demonstrate the advantages of our lightweight architecture, we compare with state-of-the-art Transformer-based monocular depth estimation methods [29,30,31,32,33]. As shown in Table 5, our method achieves superior performance with significantly lower computational cost, making it more suitable for resource-constrained autonomous driving applications.

4.4. Ablation Studies

To further investigate the contribution of each component in our proposed algorithm, we conduct comprehensive ablation studies on the key modules and loss functions. Specifically, we evaluate the probabilistic sampling module, the adaptive fine-grained channel attention mechanism in spherical networks, and the loss functions including standard L1 reconstruction loss and reprojection loss.

4.4.1. Component Analysis

For the adaptive fine-grained channel attention mechanism in spherical networks, we replace it with a standard U-Net network of similar capacity during ablation experiments. For the probabilistic sampling module, we substitute it with the standard hierarchical sampling used in NeRF. The detailed evaluation results are shown in Table 6.
The experimental results demonstrate that all proposed modules contribute to optimal depth estimation performance for autonomous vehicles. Particularly, the use of reprojection loss significantly improves the Absolute Relative Error metric, indicating its beneficial impact on near-range depth estimation.

4.4.2. Reprojection Loss Validation

To further demonstrate the beneficial impact of the proposed reprojection loss, we apply it to other baseline methods. The results in Table 7 show that the reprojection loss significantly improves the performance of all baseline methods.

4.4.3. Probabilistic Ray Sampling Analysis

In addition to ablating the main modules and loss functions, we investigate the optimal configuration of the probabilistic ray sampling module by varying the number of Gaussian functions and sampling points per Gaussian. The results are presented in Table 8.
The results indicate that using more Gaussian functions or sampling points does not necessarily lead to better approximation of the underlying volume density distribution. The optimal performance is achieved when using 4 Gaussian functions with 16 sampling points each.

4.4.4. Lightweight NeRF Architecture Validation

To demonstrate that our proposed method can significantly reduce model memory usage (approximately 10-fold reduction) while improving depth prediction accuracy, we conduct ablation studies on the key components of our lightweight NeRF framework. Specifically, we evaluate the impact of transferring capacity from MLP to feature extractors and introducing color sampling as an alternative to traditional NeRF color and density prediction (Table 9).
The results demonstrate that reducing MLP capacity while using more powerful encoder–decoder structures not only decreases model memory usage but also improves the model’s ability to learn accurate geometric shapes. The introduction of color sampling from input frames further enhances model accuracy, particularly in occlusion region recovery.

4.4.5. Ray Discarding Strategy Analysis

To validate the effectiveness of our proposed ray discarding strategy for rays outside the camera frustum, we conduct additional ablation experiments. The results show that discarding invalid rays does not lead to significant performance improvements, likely because the ray discarding strategy primarily affects boundary regions of the camera frustum, which have minimal impact on overall depth estimation quality in autonomous driving scenarios.
The comprehensive ablation studies confirm the effectiveness of each proposed component and design choice, validating our technical approach for lightweight, occlusion-aware monocular depth estimation in autonomous driving applications.

4.5. Sustainability Performance Evaluation

We extend our experimental evaluation to include sustainability metrics, providing a comprehensive assessment of our method’s environmental impact alongside technical performance.

4.5.1. Energy Consumption Analysis

Table 10 presents the sustainability metrics comparison between our method and baseline approaches. The measurements were conducted on NVIDIA RTX 3090 GPUs under identical experimental conditions, with power consumption monitored using NVIDIA-SMI tools.

4.5.2. Environmental Impact Assessment

Assuming a deployment scenario of 1000 autonomous vehicles operating 8 h daily with an average carbon intensity of 450 gCO2/kWh, our method demonstrates significant environmental advantages:
  • Annual Carbon Reduction: 12.7 kgCO2 per vehicle compared to VisionNeRF baseline
  • Total Fleet Impact: Approximately 12.7 tons CO2 reduction annually for a 1000-vehi-cle fleet
  • Energy Efficiency: 83% higher Energy Efficiency Ratio compared to FeatDepth
  • Hardware Requirements: Enables deployment on lower-power edge devices, further reducing energy consumption
The sustainability analysis confirms that our lightweight approach not only advances the state of the art in depth estimation accuracy but also contributes to the development of environmentally conscious autonomous driving systems.

4.6. Qualitative Results

Figure 4 presents visual comparisons of depth estimation results. Our method produces depth maps with fewer artifacts and better detail preservation compared to traditional monocular depth estimation methods, particularly in occluded regions.
To further validate the robustness of our method under challenging environmental conditions, we conduct additional experiments in adverse weather scenarios including rain, fog, and low-light conditions. Figure 5 presents qualitative comparisons demonstrating our method’s superior performance in maintaining depth estimation accuracy under these challenging conditions.

5. Discussion

The proposed lightweight monocular depth estimation framework demonstrates significant potential for enhancing sustainable intelligent transportation systems through computational efficiency and robust environmental perception. This section discusses the sustainability implications, practical utility, and broader impact of our method.

5.1. Sustainability Implications

Our approach contributes substantially to sustainable transportation development through multiple interconnected pathways aligned with United Nations Sustainable Development Goals (SDGs), particularly SDG 9 (Industry, Innovation and Infrastructure) and SDG 11 (Sustainable Cities and Communities).
The substantial decrease in memory requirements, from over 1 GB in conventional NeRF approaches to 234 MB in our method, directly leads to lower energy consumption during both training and inference. This improvement in efficiency is essential for scaling intelligent transportation systems while reducing their environmental impact. By enabling deployment on resource-constrained edge devices, our approach promotes equitable access to advanced transportation technologies across diverse socioeconomic contexts, thereby supporting technological democratization in developing regions.
The improved reconstruction quality in occluded regions directly contributes to pedestrian and cyclist safety in complex urban scenarios. By accurately estimating depth in challenging visual conditions, our method enhances detection capabilities for vulnerable road users, supporting Vision Zero initiatives aimed at eliminating traffic fatalities and severe injuries.

5.2. Practical Utility and Industrial Applications

The proposed method offers substantial practical value across multiple industrial domains, particularly in autonomous driving and intelligent transportation management.
In autonomous driving systems, the significant memory reduction enables simultaneous execution of multiple perception tasks on embedded systems without exceeding computational budgets. This efficiency is particularly valuable for fleet management applications where scaling perception capabilities across numerous vehicles requires cost-effective hardware solutions. The improved occlusion handling enhances performance in complex urban scenarios where autonomous vehicles have traditionally struggled with unexpected pedestrian appearances and partially visible obstacles.
Beyond vehicle-based applications, our approach facilitates enhanced traffic management through efficient visual analysis of roadside camera feeds. The reduced computational requirements enable real-time depth estimation from existing traffic monitoring infrastructure without hardware upgrades, supporting congestion management, intersection safety monitoring, and infrastructure planning. The method’s compatibility with edge computing devices further enhances its practicality for distributed intelligent transportation systems.

5.3. Scientific Contributions and Research Implications

Our work offers several important contributions to the scientific community. A key innovation lies in the capacity redistribution strategy, which shifts computational load from resource-intensive MLPs to efficient separable convolutional encoder-decoder networks. This approach in neural scene representation demonstrates that carefully designed feature extraction networks can effectively compensate for reduced MLP capacity while preserving geometric accuracy.
The method bridges supervised and self-supervised learning paradigms by combining self-supervised geometric understanding through NeRF’s volume rendering with learned feature representations. This hybrid approach suggests a promising direction for leveraging both geometric priors and data-driven learning. Furthermore, the efficiency of our framework creates opportunities for cross-modal knowledge distillation, potentially extending to LiDAR-based depth estimators or multi-view stereo systems while maintaining computational efficiency.

5.4. Limitations and Future Directions

While demonstrating promising results, our approach has limitations that suggest valuable research directions. Future work could integrate explicit motion modeling to better handle dynamic scenes, incorporate complementary sensory modalities while maintaining efficiency, and explore domain adaptation techniques for robust performance across diverse geographic and weather conditions. Further optimization of inference speed would also enhance applicability in time-critical transportation applications.
In conclusion, our lightweight monocular depth estimation framework advances the technical state of the art while contributing meaningfully to sustainable transportation development through computational efficiency, enhanced safety, and broader accessibility.

6. Conclusions

This paper presents a lightweight NeRF-based monocular depth estimation framework for autonomous driving systems, with significant implications for sustainable urban transportation and energy-efficient ITS deployment. Our method contributes to sustainable development objectives through three key pathways: (1) dramatically reduced memory requirements (234 MB vs. >1 GB in comparable methods) enable deployment on low-power edge devices, decreasing the energy footprint of perception systems; (2) enhanced accuracy in occluded region reconstruction improves pedestrian and vulnerable road user safety, supporting Vision Zero initiatives; (3) computational efficiency promotes equitable access to intelligent transportation technologies across diverse economic contexts, aligning with UN SDG 11’s aim for sustainable transport systems.
The framework incorporates two key innovations: a fine-grained adaptive channel attention mechanism in spherical network architecture to enhance field-of-view coverage, and a Gaussian probability ray sampling strategy to improve computational efficiency in large-scale scenes. By projecting 2D image features to 3D sampling points, our approach achieves superior performance on the KITTI benchmark, significantly outperforming state-of-the-art techniques in occluded region reconstruction and large-scale scene processing.
Furthermore, the proposed method demonstrates significant advantages in computational sustainability, achieving an 83% higher Energy Efficiency Ratio compared to existing approaches while reducing annual carbon emissions by approximately 12.7 kgCO2 per vehicle. These sustainability benefits make our framework particularly suitable for large-scale deployment in eco-conscious intelligent transportation systems, contributing to both technical advancement and environmental responsibility in autonomous driving technologies.
Future work will focus on accelerating inference processes, enhancing adaptability to dynamic environments, and exploring efficient multi-modal sensor fusion strategies. The lightweight architecture (234MB memory footprint) provides an ideal foundation for integrating LiDAR, radar, and camera data without exceeding computational budgets of edge devices. We will specifically investigate approaches to improve the accessibility, affordability, and scalability of ITS technologies, including developing cost-effective deployment strategies for resource-constrained environments, optimizing algorithms for diverse hardware platforms to broaden technology adoption, and creating modular architectures that enable incremental system scaling. Additionally, we will address large-scale deployment challenges including system integration with existing autonomous driving stacks, real-time performance optimization for vehicle-grade hardware, and development of adaptive learning mechanisms for diverse operational domains. These advancements will facilitate the transition from laboratory validation to widespread industrial adoption while ensuring that intelligent transportation solutions remain accessible to communities across different economic contexts.

Author Contributions

Conceptualization, X.T.; methodology, X.T.; validation, X.T., C.W. and Z.Z.; formal analysis, X.T.; investigation, X.T., J.P. and H.S.; resources, Z.P.; data curation, X.T.; writing—original draft preparation, X.T.; writing—review and editing, C.W., R.L. and Z.C.; visualization, Z.Z.; supervision, C.W., M.C. and Z.C.; project administration, C.W. and Z.C.; funding acquisition, C.W., M.C. and Z.C. All authors have read and agreed to the published version of the manuscript.

Funding

Transportation Science and Technology Program Project of Shandong High-Speed Group Co., Ltd. (Grant No. HS2023B020), Shandong Provincial Natural Science Foundation (Grant No. ZR2024LZN010) and Ganwei Program of Beihang University (Grant No. WZ2024-2-16).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available. KITTI dataset can be accessed at http://www.cvlibs.net/datasets/kitti/ (accessed on 18 July 2025). V2X-Sim dataset can be accessed at the official repository.

Acknowledgments

The authors would like to acknowledge the support provided by the computing facilities and datasets used in this research.

Conflicts of Interest

Authors Xianfeng Tan, Chengcheng Wang, Ziyu Zhang and Zhendong Ping are employed by Shandong Hi-speed Group; Author Meng Chi is employed by Shandong Hi-speed Information Group. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The research was funded by the Science and Technology Program Project of Shandong High-Speed Group Co., Ltd.

Abbreviations

The following abbreviations are used in this manuscript:
NeRFNeural Radiance Field
MLPMulti-Layer Perceptron
CNNConvolutional Neural Network
KITTIKarlsruhe Institute of Technology and Toyota Technological Institute
V2XVehicle-to-Everything
RMSERoot Mean Square Error
GPUGraphics Processing Unit

References

  1. Jiang, H.; Ren, Y.; Fang, J.; Yang, Y.; Xu, L.; Yu, H. SHIP: A state-aware hybrid incentive program for urban crowd sensing with for-hire vehicles. IEEE Trans. Intell. Transp. Syst. 2023, 25, 3041–3053. [Google Scholar] [CrossRef]
  2. Rong, D.; Wu, Y.; Du, W.; Yang, C.; Jin, S.; Xu, M.; Wang, F. Smart Prediction-Planning Algorithm for Connected and Autonomous Vehicle Based on Social Value Orientation. J. Intell. Connect. Veh. 2025, 8, 1–17. [Google Scholar] [CrossRef]
  3. Jiang, H.; Ren, Y.; Zhao, Y.; Cui, Z.; Yu, H. Toward city-scale vehicular crowd sensing: A decentralized framework for online participant recruitment. IEEE Trans. Intell. Transp. Syst. 2025, 26, 17800–17813. [Google Scholar] [CrossRef]
  4. Jiang, H.; Wang, J.; Xiao, J.; Zhao, Y.; Chen, W.; Ren, Y.; Yu, H. MLF3D: Multi-Level Fusion for Multi-Modal 3D Object Detection. In Proceedings of the 2024 IEEE Intelligent Vehicles Symposium (IV), Jeju Island, Republic of Korea, 2–5 June 2024; pp. 1588–1593. [Google Scholar]
  5. Gao, K.; Gao, Y.; He, H.; Lu, D.; Xu, L.; Li, J. Nerf: Neural radiance field in 3d vision, a comprehensive review. arXiv 2022, arXiv:2210.00379. [Google Scholar]
  6. Li, H.; Sima, C.; Dai, J.; Wang, W.; Lu, L.; Wang, H.; Zeng, J.; Li, Z.; Yang, J.; Deng, H.; et al. Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 2151–2170. [Google Scholar] [CrossRef] [PubMed]
  7. Masoumian, A.; Rashwan, H.A.; Cristiano, J.; Asif, M.S.; Puig, D. Monocular depth estimation using deep learning: A review. Sensors 2022, 22, 5353. [Google Scholar] [CrossRef] [PubMed]
  8. Hao, X.; Li, R.; Zhang, H.; Li, D.; Yin, R.; Jung, S.; Park, S.-I.; Yoo, B.; Zhao, H.; Zhang, J. Mapdistill: Boosting efficient camera-based hd map construction via camera-lidar fusion model distillation. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 166–183. [Google Scholar]
  9. Piccinelli, L.; Sakaridis, C.; Yu, F. idisc: Internal discretization for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21477–21487. [Google Scholar]
  10. Zhang, N.; Nex, F.; Vosselman, G.; Kerle, N. Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18537–18546. [Google Scholar]
  11. Ke, B.; Obukhov, A.; Huang, S.; Metzger, N.; Daudt, R.C.; Schindler, K. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 9492–9502. [Google Scholar]
  12. Piccinelli, L.; Yang, Y.H.; Sakaridis, C.; Segu, M.; Li, S.; Van Gool, L.; Yu, F. UniDepth: Universal monocular metric depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 10106–10116. [Google Scholar]
  13. Bhat, S.F.; Birkl, R.; Wofk, D.; Wonka, P.; Müller, M. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv 2023, arXiv:2302.12288. [Google Scholar]
  14. Watson, J.; Firman, M.; Brostow, G.J.; Turmukhambetov, D. Self-Supervised Monocular Depth Hints. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2162–2171. [Google Scholar]
  15. Bartolomei, L.; Mannocci, E.; Tosi, F.; Poggi, M.; Mattoccia, S. Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–25 October 2025. [Google Scholar]
  16. Li, R.; Shan, H.; Jiang, H.; Xiao, J.; Chang, Y.; He, Y.; Yu, H.; Ren, Y. E-MLP: Effortless Online HD Map Construction with Linear Priors. In Proceedings of the 2024 IEEE Intelligent Vehicles Symposium (IV), Jeju Island, Republic of Korea, 2–5 June 2024; pp. 1008–1014. [Google Scholar]
  17. Garg, R.; Kumar, B.V.; Carneiro, G.; Reid, I.D. Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016. [Google Scholar]
  18. Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised Monocular Depth Estimation with Left-Right Consistency. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2016; pp. 6602–6611. [Google Scholar]
  19. Xie, J.; Girshick, R.B.; Farhadi, A. Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016. [Google Scholar]
  20. Shan, H.; Li, R.; Jiang, H.; Fan, Y.; Li, B.; Hao, X.; Zhao, H.; Cui, Z.; Ren, Y.; Yu, H. Stability Under Scrutiny: Benchmarking Representation Paradigms for Online HD Mapping. arXiv 2025, arXiv:2510.10660. [Google Scholar] [CrossRef]
  21. Wong, A.; Hong, B.; Soatto, S. Bilateral Cyclic Constraint and Adaptive Regularization for Unsupervised Monocular Depth Prediction. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 5637–5646. [Google Scholar]
  22. Andraghetti, L.; Myriokefalitakis, P.; Dovesi, P.L.; Luque, B.; Poggi, M.; Pieropan, A.; Mattoccia, S. Enhancing Self-Supervised Monocular Depth Estimation with Traditional Visual Odometry. In Proceedings of the 2019 International Conference on 3D Vision (3DV), Québec City, QC, Canada, 16–19 September 2019; pp. 424–433. [Google Scholar]
  23. Zhang, H.; Wan, J.; He, Z.; Song, J.; Yang, Y.; Yuan, D. Sparse agent transformer for unified voxel and image feature extraction and fusion. Inf. Fusion 2024, 110, 102455. [Google Scholar] [CrossRef]
  24. Zhao, C.; Poggi, M.; Tosi, F.; Zhou, L.; Sun, Q.; Tang, Y.; Mattoccia, S. GasMono: Geometry-Aided Self-Supervised Monocular Depth Estimation for Indoor Scenes. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 16163–16174. [Google Scholar]
  25. Wu, H.; Gu, S.; Duan, L.; Li, W. GeoDepth: From Point-to-Depth to Plane-to-Depth Modeling for Self-Supervised Monocular Depth Estimation. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 11525–11535. [Google Scholar]
  26. Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF. Commun. ACM 2020, 65, 99–106. [Google Scholar] [CrossRef]
  27. Müller, T.; Evans, A.; Schied, C.; Keller, A. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (TOG) 2022, 41, 1–15. [Google Scholar] [CrossRef]
  28. Bian, W.; Wang, Z.; Li, K.; Bian, J.W.; Prisacariu, V.A. Nope-nerf: Optimising neural radiance field with no pose prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4160–4169. [Google Scholar]
  29. Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision Transformers for Dense Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 12179–12188. [Google Scholar]
  30. Bhat, S.F.; Alhashim, I.; Wonka, P. AdaBins: Depth Estimation Using Adaptive Bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 4009–4018. [Google Scholar]
  31. Yuan, W.; Gu, X.; Dai, Z.; Zhu, S.; Tan, P. Neural window fully-connected crfs for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 3916–3925. [Google Scholar]
  32. Li, Z.; Wang, X.; Liu, X.; Jiang, J. BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation. arXiv 2022, arXiv:2204.00987. [Google Scholar] [CrossRef]
  33. Li, Z.; Chen, Z.; Liu, X.; Jiang, J. DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation. Mach. Intell. Res. 2023, 20, 837–854. [Google Scholar] [CrossRef]
Figure 1. U-Net Based on Separable Convolution.
Figure 1. U-Net Based on Separable Convolution.
Sustainability 17 11271 g001
Figure 2. Architecture of the separable convolution-based encoder–decoder network. The network consists of an encoder stage for global information extraction using separable convolutions and a decoder stage for feature reconstruction with skip connections.
Figure 2. Architecture of the separable convolution-based encoder–decoder network. The network consists of an encoder stage for global information extraction using separable convolutions and a decoder stage for feature reconstruction with skip connections.
Sustainability 17 11271 g002
Figure 3. Architecture of the CMUNeXt module. The module uses large-kernel depth convolutions followed by two pointwise convolutions with inverted bottleneck design to extract global context information efficiently.
Figure 3. Architecture of the CMUNeXt module. The module uses large-kernel depth convolutions followed by two pointwise convolutions with inverted bottleneck design to extract global context information efficiently.
Sustainability 17 11271 g003
Figure 4. Qualitative comparison of depth estimation results on KITTI/V2X-Sim dataset. From left to right: Input image, FeatDepth, B-MSMF6, and our method. Our approach shows superior performance in preserving fine details and handling occlusions with fewer artifacts.
Figure 4. Qualitative comparison of depth estimation results on KITTI/V2X-Sim dataset. From left to right: Input image, FeatDepth, B-MSMF6, and our method. Our approach shows superior performance in preserving fine details and handling occlusions with fewer artifacts.
Sustainability 17 11271 g004
Figure 5. Depth estimation performance comparison under adverse weather conditions. From top to bottom: heavy rain, dense fog, and night-time scenarios. From left to right: input image, FeatDepth, B-MSMF6, and our method. Our approach demonstrates enhanced robustness with fewer artifacts and more consistent depth predictions in challenging weather conditions.
Figure 5. Depth estimation performance comparison under adverse weather conditions. From top to bottom: heavy rain, dense fog, and night-time scenarios. From left to right: input image, FeatDepth, B-MSMF6, and our method. Our approach demonstrates enhanced robustness with fewer artifacts and more consistent depth predictions in challenging weather conditions.
Sustainability 17 11271 g005
Table 1. Comprehensive Performance Comparison of Different Depth Estimation Methods. In the table, × and ✓ represent having this ability and not having this ability, respectively.
Table 1. Comprehensive Performance Comparison of Different Depth Estimation Methods. In the table, × and ✓ represent having this ability and not having this ability, respectively.
Evaluation MetricVisionNeRFPixelNeRFMonoDepth2Ours
Memory Usage (MB)14231026501234
Speed (FPS)5.76.920.122.8
Real-time Capability××
Occlusion Handling×××
Energy Efficiency (EER)3.424.128.3118.92
Table 2. Comparison with NeRF-based Methods on KITTI Datasets. In the table, ↑ indicates that higher values for this metric correspond to better model performance, while ↓ indicates the opposite.
Table 2. Comparison with NeRF-based Methods on KITTI Datasets. In the table, ↑ indicates that higher values for this metric correspond to better model performance, while ↓ indicates the opposite.
MethodMemoryFPSKITTI
Abs Rel ↓ δ 1
VisionNeRF1423 MB5.70.1370.839
PixelNeRF1026 MB6.90.1300.845
Ours234 MB22.80.1020.887
Table 3. Comparison with NeRF-based Methods on V2X-Sim Datasets. In the table, ↑ indicates that higher values for this metric correspond to better model performance, while ↓ indicates the opposite.
Table 3. Comparison with NeRF-based Methods on V2X-Sim Datasets. In the table, ↑ indicates that higher values for this metric correspond to better model performance, while ↓ indicates the opposite.
MethodMemoryFPSV2X-Sim
Abs Rel ↓ δ 1
VisionNeRF1423 MB5.70.1280.857
PixelNeRF1026 MB6.90.1220.863
Ours234 MB22.80.0950.905
Table 4. Comparison with Monocular Depth Estimation Methods, ↑ indicates that higher values for this metric correspond to better model performance, while ↓ indicates the opposite.
Table 4. Comparison with Monocular Depth Estimation Methods, ↑ indicates that higher values for this metric correspond to better model performance, while ↓ indicates the opposite.
MethodVolumeFPSKITTIV2X-Sim
Abs Rel ↓ δ 1 Abs Rel ↓ δ 1
EPC++×18.40.1280.8310.1200.848
MonoDepth2×20.10.1060.8740.1000.892
PackNet×15.20.1110.8780.1040.896
DepthHint×15.80.1050.8750.0990.893
FeatDepth×14.50.0990.8890.0930.907
Ours22.80.1020.8870.0950.905
Table 5. Comparison with Transformer-based Methods on V2X-Sim Dataset. In the table, ↑ indicates that higher values for this metric correspond to better model performance, while ↓ indicates the opposite.
Table 5. Comparison with Transformer-based Methods on V2X-Sim Dataset. In the table, ↑ indicates that higher values for this metric correspond to better model performance, while ↓ indicates the opposite.
MethodBackboneMemoryFPSV2X-Sim
Abs Rel ↓RMSE↓ δ 1
DPT-HybridViT-Hybrid345 MB16.70.1084.450.882
AdabinsEfficientNet-B5298 MB19.60.1014.380.885
NeWCRFsSwin-Large412 MB13.70.0934.280.898
BinsFormerSwin-Base387 MB14.80.0954.310.896
DepthFormerSwin-Tiny321 MB18.10.1034.480.878
OursResNet-50 + LightMLP234 MB22.80.0954.300.905
Table 6. Module and Loss Function Ablation Study Results. In the table, ↑ indicates that higher values for this metric correspond to better model performance, while ↓ indicates the opposite.
Table 6. Module and Loss Function Ablation Study Results. In the table, ↑ indicates that higher values for this metric correspond to better model performance, while ↓ indicates the opposite.
Spherical NetworkProb. SamplingReproj. LossAbs Rel ↓RMSE ↓ δ 1
×××0.1855.120.824
××0.1624.870.851
××0.1714.950.838
××0.1474.650.869
×0.1384.520.875
×0.1254.410.882
×0.1314.470.878
0.1024.390.887
Table 7. Reprojection Loss Ablation Study.
Table 7. Reprojection Loss Ablation Study.
Methodw/o Reproj. Lossw/Reproj. LossImprovement
MonoDepth20.1150.1067.8%
PackNet0.1190.1116.7%
VisionNeRF0.1740.1589.2%
Ours0.1370.10225.5%
Table 8. Probabilistic Ray Sampling Module Ablation Results. In the table, ↑ indicates that higher values for this metric correspond to better model performance, while ↓ indicates the opposite.
Table 8. Probabilistic Ray Sampling Module Ablation Results. In the table, ↑ indicates that higher values for this metric correspond to better model performance, while ↓ indicates the opposite.
Num. GaussiansPoints per GaussianAbs Rel ↓RMSE ↓ δ 1
2160.1184.580.874
3160.1084.450.883
4160.1024.390.887
5160.1054.420.885
4120.1124.510.878
4200.1064.430.884
Table 9. Lightweight NeRF Ablation Study Results. In the table, ↑ indicates that higher values for this metric correspond to better model performance, while ↓ indicates the opposite.
Table 9. Lightweight NeRF Ablation Study Results. In the table, ↑ indicates that higher values for this metric correspond to better model performance, while ↓ indicates the opposite.
Lightweight MLPEnhanced EncoderColor SamplingMemory (MB)Abs Rel ↓ δ 1
×××14230.1370.839
××8920.1510.821
××11560.1190.864
×3780.1280.849
2340.1020.887
Table 10. Sustainability Metrics Comparison on KITTI Dataset.
Table 10. Sustainability Metrics Comparison on KITTI Dataset.
MethodMemory (MB)Power (W)EERCRP (kgCO2/Year)
VisionNeRF14232853.4240.6
PixelNeRF10262454.1235.8
MonoDepth25011898.3116.5
FeatDepth4871829.0514.4
Ours23415618.9212.7
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tan, X.; Wang, C.; Zhang, Z.; Ping, Z.; Pan, J.; Shan, H.; Li, R.; Chi, M.; Cui, Z. Enhancing Sustainable Intelligent Transportation Systems Through Lightweight Monocular Depth Estimation Based on Volume Density. Sustainability 2025, 17, 11271. https://doi.org/10.3390/su172411271

AMA Style

Tan X, Wang C, Zhang Z, Ping Z, Pan J, Shan H, Li R, Chi M, Cui Z. Enhancing Sustainable Intelligent Transportation Systems Through Lightweight Monocular Depth Estimation Based on Volume Density. Sustainability. 2025; 17(24):11271. https://doi.org/10.3390/su172411271

Chicago/Turabian Style

Tan, Xianfeng, Chengcheng Wang, Ziyu Zhang, Zhendong Ping, Jieying Pan, Hao Shan, Ruikai Li, Meng Chi, and Zhiyong Cui. 2025. "Enhancing Sustainable Intelligent Transportation Systems Through Lightweight Monocular Depth Estimation Based on Volume Density" Sustainability 17, no. 24: 11271. https://doi.org/10.3390/su172411271

APA Style

Tan, X., Wang, C., Zhang, Z., Ping, Z., Pan, J., Shan, H., Li, R., Chi, M., & Cui, Z. (2025). Enhancing Sustainable Intelligent Transportation Systems Through Lightweight Monocular Depth Estimation Based on Volume Density. Sustainability, 17(24), 11271. https://doi.org/10.3390/su172411271

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop