AuxDepthNet: Real-Time Monocular 3D Object Detection with Depth-Sensitive Features

Zhang, Ruochen; Choi, Hyeung-Sik; Jung, Dongwook; Anh, Phan Huy Nam; Jeong, Sang-Ki; Zhu, Zihao

doi:10.3390/app15137538

Open AccessArticle

AuxDepthNet: Real-Time Monocular 3D Object Detection with Depth-Sensitive Features

by

Ruochen Zhang

¹

,

Hyeung-Sik Choi

^1,*

,

Dongwook Jung

¹

,

Phan Huy Nam Anh

¹

,

Sang-Ki Jeong

² and

Zihao Zhu

¹

Department of Mechanical Engineering, National Korea Maritime and Ocean University, Busan 49112, Republic of Korea

²

Maritime ICT and Mobility Research Department, Korea Institute of Ocean Science and Technology, Busan 49111, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(13), 7538; https://doi.org/10.3390/app15137538

Submission received: 10 June 2025 / Revised: 1 July 2025 / Accepted: 2 July 2025 / Published: 4 July 2025

Download

Browse Figures

Versions Notes

Abstract

Monocular 3D object detection is a challenging task in autonomous systems due to the lack of explicit depth information in single-view images. Existing methods often depend on external depth estimators or expensive sensors, which increase computational complexity and complicate integration into existing systems. To overcome these limitations, we propose AuxDepthNet, an efficient framework for real-time monocular 3D object detection that eliminates the reliance on external depth maps or pre-trained depth models. AuxDepthNet introduces two key components: the Auxiliary Depth Feature (ADF) module, which implicitly learns depth-sensitive features to improve spatial reasoning and computational efficiency, and the Depth Position Mapping (DPM) module, which embeds depth positional information directly into the detection process to enable accurate object localization and 3D bounding box regression. Leveraging the DepthFusion Transformer (DFT) architecture, AuxDepthNet globally integrates visual and depth-sensitive features through depth-guided interactions, ensuring robust and efficient detection. Extensive experiments on the KITTI dataset show that AuxDepthNet achieves state-of-the-art performance, with

{A P}_{3 D}

scores of 24.72% (Easy), 18.63% (Moderate), and 15.31% (Hard), and

{A P}_{B E V}

scores of 34.11% (Easy), 25.18% (Moderate), and 21.90% (Hard) at an IoU threshold of 0.7.

Keywords:

monocular 3D object detection; depth feature learning; vehicle detection; detection algorithms

1. Introduction

Three-dimensional (3D) object detection [1,2,3,4,5] plays a critical role in autonomous driving and robotic perception by enabling machines to perceive and interact with their surroundings in a spatially aware manner. Traditionally, this task has relied on precise depth information from multiple sensors, such as LiDAR, stereo cameras, or depth sensors. While these systems provide high accuracy, they are often expensive and complex to deploy. In recent years, monocular 3D object detection has gained attention as a cost-effective alternative, requiring only a single RGB camera [6,7,8]. Despite its simplicity and accessibility, monocular 3D detection faces challenges due to the absence of explicit depth cues, making effective depth reasoning essential for improving accuracy and practicality.

Monocular 3D object detection methods can be broadly categorized into two approaches, as illustrated in Figure 1. The first category utilizes pre-trained depth estimation models to generate pseudo-depth maps, which are combined with LiDAR-based 3D detectors for object recognition and localization [9,10]. This approach, as exemplified by methods like Pseudo-LiDAR and F-Pointnet [5,11,12], achieves improved localization but suffers from inaccuracies in depth priors and high computational costs due to the reliance on external depth estimators, limiting their applicability in real-time scenarios.

The second category focuses on feature fusion, where features from monocular images and estimated depth maps are extracted and combined to enhance detection. Methods like

D^{4} L C N

and

C a D D N

[13,14], as shown in Figure 1b, use specialized convolutional architectures to integrate visual and depth features. While these methods demonstrate promising results, their reliance on the quality of depth maps and the complexity of their architectures can hinder efficiency and robustness in dynamic environments.

To address these challenges, we propose AuxDepthNet, a novel framework for real-time monocular 3D object detection that avoids the need for external depth estimators or pre-generated depth maps. AuxDepthNet introduces two key modules: the Auxiliary Depth Feature Module (ADF) and the Depth Position Mapping Module (DPM), as depicted in Figure 1c. These modules enable implicit learning of depth-sensitive features and inject depth positional cues directly into the detection pipeline. Conventional CNN-based monocular 3D detection frameworks often struggle to capture global dependencies and integrate multi-source depth features. In contrast, Transformers leverage self-attention mechanisms to process image and depth information simultaneously [15,16], allowing for end-to-end 3D object detection within a streamlined pipeline [17]. Built upon the DepthFusion Transformer architecture, our method efficiently integrates contextual and depth-sensitive features, achieving superior detection performance with reduced computational costs [11,12].

Figure 1. Representative depth-assisted monocular 3D object detection frameworks. (a) Depth estimation methods use monocular inputs to construct pseudo-LiDAR data, enabling LiDAR-style 3D detectors [5,11,12]. (b) Fusion-based methods combine visual and depth features to improve object detection accuracy [13,14,18]. (c) Our AuxDepthNet leverages depth guidance during training to develop depth-sensitive features and performs end-to-end 3D object detection without requiring external depth estimators.

The main contributions of this paper are as follows:

(1): We propose AuxDepthNet, a novel framework for efficient, real-time monocular 3D object detection that eliminates reliance on external depth maps or estimators.
(2): We design the Auxiliary Depth Feature Module (ADF) and Depth Position Mapping Module (DPM) to implicitly learn depth-sensitive features and encode depth positional cues into the detection process.
(3): We provide a plug-and-play architecture that can be seamlessly integrated into other image-based detection frameworks to enhance their 3D reasoning capabilities.

2. Related Work

Monocular 3D object detection has gained significant attention for its cost-effectiveness compared to sensor-based approaches, such as LiDAR and stereo cameras. Existing methods can be broadly categorized into depth-based methods and Transformer-based methods, which are discussed below.

2.1. Monocular 3D Object Detection Methods

Image-based 3D object detection leverages monocular or stereo images to estimate depth and generate 3D proposals. Approaches like MONO3D [13], MLF [19], and Frustum PointNet [20] use depth maps or disparity estimates to compute 3D coordinates, often integrating them into 2D detection pipelines. However, poor depth representations in these methods limit the ability of convolutional networks to accurately localize objects, especially at greater distances, leading to performance gaps compared to LiDAR-based systems. Depth-based methods aim to bridge this gap by explicitly or implicitly leveraging depth information. One common approach is to estimate pseudo-depth maps from monocular images and utilize them for 3D object detection. For instance, Pseudo-LiDAR [5] and F-PointNet [12] rely on pre-trained depth models to generate point clouds, which are then processed with LiDAR-style 3D detectors. While these methods improve spatial reasoning, they suffer from depth estimation errors and increased computational costs due to reliance on external depth estimators.

Another line of work focuses on integrating monocular image features with estimated depth information in a feature fusion framework. Methods like D⁴LCN [13] and CaDDN [21] incorporate depth-sensitive features into the 2D detection pipeline to enhance accuracy. However, these approaches are constrained by the quality of the estimated depth maps and often rely on complex architectures, limiting their real-time applicability. In contrast, our proposed method avoids reliance on external depth estimators by directly learning depth-sensitive features through the Auxiliary Depth Feature (ADF) module. This design allows AuxDepthNet to achieve high accuracy while maintaining computational efficiency.

2.2. Transformer in Monocular 3D Object Detection

The recent success of Transformers in computer vision has inspired their application in monocular 3D object detection. Unlike convolutional neural networks (CNNs), Transformers capture global spatial dependencies using self-attention mechanisms, addressing the limitations of local feature extraction. For example, MonoDETR [22] employs a dual-encoder architecture and a depth-guided decoder to improve depth representation and object localization. Similarly, MonoPSTR [23] introduces scale-aware attention and position-coded queries to enhance detection precision and efficiency. Recently, diffusion models [24,25,26,27] have also demonstrated their potential in depth estimation by iteratively refining depth predictions, offering an alternative to traditional depth map generation methods. These models could complement Transformer-based architectures by providing robust depth priors for enhanced detection.

Despite their advancements, Transformer-based methods often rely on pre-computed depth maps or handcrafted priors to guide detection [17,28]. This dependence on external inputs can hinder adaptability to diverse scenarios. AuxDepthNet addresses this limitation by embedding depth reasoning directly into the network via the Auxiliary Depth Feature (ADF) and Depth Position Mapping (DPM) modules. These modules enable the model to implicitly learn depth-sensitive features, eliminating external dependencies and delivering robust, scalable performance.

3. Proposed Method

3.1. Overview

As illustrated in Figure 2, the AuxDepthNet architecture employs a multi-faceted feature representation to enhance monocular 3D object detection. Specifically, our framework integrates three key feature types: depth-sensitive, context-sensitive, and depth-guided features—each capturing critical information at different processing stages.

We adopt DLA-102x as the Backbone network, which takes an input image of size

3 \times H \times W

and produces multi-scale feature maps. The final output of the Backbone, denoted

F

, is a feature map with dimensions

C \times H^{'} \times W^{'}

, where

C

is the number of channels and

H^{'}, W^{'}

may be smaller than the original input dimensions due to downsampling operations. These feature maps feed into the Auxiliary Depth Feature (ADF) module, which encodes depth-related cues through an additional “auxiliary” learning task.

This task eliminates the need for pre-computed depth maps by implicitly capturing depth prototypes, thereby strengthening depth-sensitive feature representation. In this context, “auxiliary” denotes an extra but complementary objective that supports the model’s primary 3D detection goal by providing more accurate depth information for improved spatial reasoning.

Meanwhile, context-sensitive features extracted by the Backbone are refined via a feature pyramid and the DepthFusion Transformer (DFT) to supply essential semantic and spatial context. In parallel, depth-guided features receive further enhancement from the Depth Position Mapping (DPM) module and positional encoding, embedding depth-related positional cues into the latent space. This combination of local and global spatial relationships, delivered at minimal computational cost, results in robust and efficient 2D and 3D object detection.

3.2. Depth-Sensitive Feature Enhancement

Current depth-assisted approaches face challenges in generalizing to varied datasets and environments, which hampers their adaptability. Moreover, their dependency on external depth sensors or estimators not only increases hardware requirements but also risks propagating inaccuracies into the detection pipeline [5,13,17,18]. To address these challenges, we propose the Auxiliary Depth Feature (ADF) module, which leverages auxiliary supervision during training to implicitly learn depth-sensitive features, eliminating the need for external depth maps or estimators.

Unlike previous methods [3,5,8], which rely on pre-computed depth maps or external estimators, the ADF module captures depth-sensitive features directly from the input feature map using auxiliary learning. This approach ensures scalability, improves generalization across datasets, and enables real-time applicability.

As illustrated in Figure 3, the Auxiliary Depth Feature (ADF) module operates in three stages to enhance depth-sensitive features. First, it generates initial depth-sensitive features through an auxiliary supervision task, which predicts a probability distribution over discretized depth bins for each pixel. Next, the module learns depth prototypes by leveraging spatial-depth attention mechanisms, aggregating features across spatial dimensions to capture meaningful depth relationships. Finally, these depth prototypes are projected back to refine the feature map, producing an enhanced representation that integrates depth-sensitive information and spatial context, thereby improving the accuracy and robustness of 3D object detection.

The ADF module is designed as a lightweight and efficient solution that enhances depth reasoning and spatial localization while maintaining low computational cost. The term ‘auxiliary’ refers to the fact that the module learns depth-sensitive features through an auxiliary supervision task. This task does not rely on pre-computed depth maps or external depth estimators, but instead enables the network to learn depth-related information directly from the image data during training.

The following sections detail each step of the ADF module and its role in improving depth representation for 3D object detection.

3.2.1. Backbone Network Architecture and Multi-Scale Feature Extraction

Figure 4 illustrates the changes in input image size and channel dimensions as the Backbone processes the input. We assume that the input image has a resolution of

1280 \times 288 \times 3

(

h e i g h t \times w i d t h \times c h a n n e l s

). In the first layer of the Backbone, the image undergoes an initial convolution operation, resulting in a feature map with a size of

640 \times 144 \times 16

. As the network progresses, the spatial dimensions of the feature map gradually decrease, while the number of channels increases. Specifically, at Level 0, the feature map retains the same size of

640 \times 144 \times 16

, but at Level 1, after downsampling, the feature map size reduces to

320 \times 72 \times 32

. Further down the network, the feature map dimensions continue to shrink while the channel depth continues to grow. At Level 2, the output is

160 \times 36 \times 128

, at Level 3 it is

80 \times 18 \times 256

, at Level 4 it is

40 \times 9 \times 512

, and finally, at Level 5, the feature map size becomes

20 \times 5 \times 1024

. This process demonstrates the gradual extraction of deeper features from the image through successive downsampling and convolution operations.

Additionally, the DLA-102× network includes a feature aggregation module, which fuses feature maps from different layers. Through this aggregation process, the final output feature map has a size of

320 \times 72 \times 256

, containing information from multiple scales that enhances the subsequent depth-sensitive feature extraction process.

Mathematically, let the input image

I

have a size of

H \times W \times 3

; the Backbone network first applies convolution to produce an initial feature map

F_{0}

:

F_{0} = B a c k b o n e (I), F_{0} \in R^{C_{0} \times H_{0} \times W_{0}}

(1)

where

C_{0}

is the number of channels in the feature map, and

H_{0}, W_{0}

are the spatial dimensions after convolution and pooling. As the network deepens, each layer generates deeper feature representations through convolution operations and downsampling, gradually reducing the spatial dimensions and increasing the number of channels. After

n

layers, the feature map’s size and channels become the following:

F_{n} = {B a c k b o n e}_{n} (F_{n - 1}), F_{n} \in R^{C_{n} \times H_{n} \times W_{n}}

(2)

Finally, after the feature aggregation module, the resulting feature map

F_{a g g r e g a t e d}

has the following dimensions:

F_{a g g r e g a t e d} \in R^{C_{a g g r e g a t e d} \times H_{a g g r e g a t e d} \times W_{a g g r e g a t e d}}

(3)

This aggregated feature map serves as the foundation for subsequent depth-sensitive feature extraction, providing rich visual information for depth estimation and object detection tasks. This input–output definition not only provides clear guidance for understanding the feature extraction process in the Backbone but also forms a solid foundation for extracting depth-sensitive features, ensuring the success of the subsequent tasks.

3.2.2. Extracting Foundational Depth-Sensitive Features

In the Auxiliary Depth Feature (ADF) module, foundational depth-sensitive features are generated using an auxiliary depth estimation task modeled as a sequential classification problem. Given the input feature map

F \in R^{C \times H \times W}

from the Backbone, we apply two convolutional layers to predict the probability distribution of discretized depth bins

D_{b} \in R^{D \times H \times W}

, where

D_{b}

denotes the number of depth categories. The predicted probability represents the confidence of each pixel belonging to a specific depth bin:

P_{i, j, d} = \frac{\exp (z_{i, j, d})}{\sum_{k = 1}^{D_{b}} \exp (z_{i, j, k})} .

(4)

where

z_{i, j, d}

is the raw score for pixel

(i, j)

in bin

n

.

To discretize continuous depth values into bins, we utilize Linear-Increasing Discretization (LID), which refines depth granularity for closer objects while allocating broader intervals for farther ones.

D

represents the total number of discretized depth bins, and

i

denotes the index of the corresponding depth bin. The depth binning strategy is defined as follows.

The discretization can be formulated as follows:

B_{i} = \{\begin{matrix} \frac{{(i + 1)}^{2}}{D}, i f i < \sqrt{D}, \\ \frac{(i + 1)}{D}, o t h e r w i s e . \end{matrix}

(5)

By applying depth-wise separable convolutions, the intermediate feature map

X \in R^{C \times H \times W}

efficiently captures depth-sensitive features with reduced computational overhead. These features, further enhanced through attention mechanisms, provide a robust foundation for downstream 3D object detection tasks.

3.2.3. Depth-Sensitive Prototype Representation Module

This module aims to refine feature representations through depth prototype learning and enhancement. Starting from the initial depth-sensitive feature map

F_{i n i t} \in R^{C \times H \times W}

, generated by the Backbone network, the module predicts the depth distribution

P_{d e p t h} \in R^{D \times H \times W}

for each pixel, where

C

is the feature dimension,

H

and

W

are spatial dimensions, and

D

represents the number of depth bins. We extract spatial features

F_{i n i t} [i]

from the input image and match them with their corresponding depth information using a trainable attention mechanism to obtain

A t t e n t i o n (i, d)

. This attention weight measures the confidence of pixel

i

at different depth ranges, allowing the network to adaptively emphasize relevant depth features without relying on manually designed priors.

The predicted probability represents the confidence of each pixel belonging to a specific depth bin, which is computed using softmax normalization. The ADF module employs D-Convolution (D-Conv) and Grouped Convolution (G-Conv) to refine depth-sensitive features. D-Conv expands the receptive field without increasing parameters by using dilation rates, while G-Conv reduces computational complexity by performing convolution in separate channel groups. Using the predicted depth distribution, the module estimates depth prototype representations

P [d] \in R^{C}

for each depth bin

d

by aggregating features across all pixels:

P [d] = \sum_{i = 1}^{N} A t t e n t i o n (i, d) \times F_{i n i t} [i], d \in {1, \dots, D} .

(6)

where

P [d]

,

N = H \times W

represents the total number of spatial positions, and

A t t e n t i o n (i, d)

denotes the attention weight for pixel

i

and depth bin

d

, derived from the similarity between

F_{i n i t}

and

P_{d e p t h}

. All pixels contributing strongly to depth

d

are combined into a single global depth prototype

P [d]

. Intuitively, the model scans the entire image to collect spatial cues most representative of depth

d

. If certain pixels are more likely to be at depth

d

, they receive higher weights and thus dominate the resulting prototype.

3.2.4. Feature Enhancement with Depth Prototype

We redistribute the depth prototypes back to each pixel

i

by weighting the prototypes

P [d]

again with

A t t e n t i o n (i, d)

. In other words, each pixel gathers relevant cues from all depth levels based on its learned attention distribution. This re-projection step injects global depth context into local features, thereby strengthening the model’s spatial–depth reasoning capability.

The depth prototypes are then projected back to enhance the feature map, creating an updated representation:

F_{e n h a n c e d} = \sum_{d = 1}^{D} A t t e n t i o n (i, d) \times P [d], i \in {1, \dots, N}

(7)

Each pixel

i

not only retains its local feature representation but also absorbs contextual information from the global depth prototype, allowing for enhanced depth reasoning and spatial consistency. Next, each pixel

i

retrieves or reclaims information from every depth prototype

P [d]

by applying the same attention distribution

A t t e n t i o n (i, d)

. As a result, each pixel’s final representation

F_{e n h a n c e d} [i] = \sum_{d = 1}^{D} A t t e n t i o n (i, d) \times P [d]

incorporates both its original local features and the global depth cues distilled into

P [d]

. Physically, this second step “injects” scene-wide depth awareness back into each local position, reinforcing the model’s three-dimensional perception.

Finally, the enhanced feature map

F_{e n} \in R^{C \times H \times W}

undergoes a convolutional refinement to integrate depth-sensitive information. As a result, the enhanced depth feature is generated by combining the initial depth-sensitive feature with the reconstructed features through concatenation. This mechanism ensures that the depth-sensitive prototypes effectively capture spatial–depth relationships, significantly improving feature quality for 3D object detection tasks.

3.3. Depth Position Mapping and Transformer Integration

Inspired by the success of Transformer architectures in capturing long-range dependencies and modeling global relationships [29], we introduce the Depth Position Mapping (DPM) module as a key component of our framework. The DPM module is designed to address the challenges of integrating spatial and depth cues in monocular 3D object detection. Unlike traditional methods that rely solely on local visual features or predefined depth priors, DPM leverages a depth-guided mapping mechanism to explicitly encode positional depth information into the feature space. By embedding depth positions into learnable queries within an encoder–decoder structure, the module enables a precise alignment of spatial and depth representations. This design enhances the understanding of scene-level depth geometry while improving the quality of feature fusion for downstream 3D attribute prediction. The DPM module thus provides a robust and efficient solution for bridging the gap between depth estimation and object localization in monocular 3D object detection.

To further enhance these depth-sensitive features, we introduce the DepthFusion Transformer (DFT), which leverages self-attention to capture long-range spatial dependencies. By integrating local depth cues with a global spatial perspective, the model achieves more accurate depth reasoning, ultimately improving 3D object localization. This approach helps the model handle common challenges in real-world monocular 3D detection, such as occlusion and variable distances, by focusing on pertinent spatial regions and depth information. The encoder–decoder design within the DepthFusion Transformer (DFT) is pivotal in capturing long-range interactions between depth and spatial features. In the encoder, spatial features are processed via self-attention, while the decoder employs cross-attention to refine depth features against spatial context. This structure ensures the model can effectively capture global interactions throughout the entire image, boosting object localization even in scenarios with occlusion or diverse depth scales.

3.3.1. Transformer Encoder

The Transformer Encoder in the Depth Position Mapping (DPM) module plays a critical role in capturing global dependencies and refining feature representations by leveraging self-attention mechanisms. It enables the model to incorporate contextual information across the entire sequence, facilitating a deeper understanding of spatial and depth relationships. The encoder refines input features by employing a multi-head self-attention mechanism and a feed-forward neural network (FFN) [17,28]. Given the input feature tensor

X \in R^{N \times L \times C}

, where

N

is the batch size,

L

is the sequence length, and

C

is the feature dimensionality, the encoder projects X into query

Q_{f}

, key

K_{f}

, and value

V_{f}

matrices, where

Q_{f}

,

K_{f}

,

V_{f} \in R^{N \times L \times C}

. These are further divided into

H

attention heads, with each head having a dimensionality of

D_{k} = C / H

. The attention weights are computed as follows:

A t t e n t i o n (Q_{f}, K_{f}) = S o f t m a x (\frac{Q_{f} \times {K_{f}}^{T}}{\sqrt{D_{k}}}) .

(8)

enabling the encoder to capture long-range dependencies across the sequence. A linear attention mechanism enhances computational efficiency by transforming [29]

Q_{f}

and

K_{f}

, using an ELU-based feature map. The output of the attention layer,

Z \in R^{N \times L \times C}

, is combined with the input via a residual connection and normalized using layer normalization,

X_{1} = N o r m (X + Z)

. This is followed by a two-layer FFN,

F F N (X) = R e L U (X \times W_{1}) \times W_{2}

, where

W_{1}

and

W_{2}

are learnable weight matrices. The FFN output is also added back to the input with a residual connection and normalized,

X_{o t} = N o r m (X_{1} + F F N (X_{1}))

. The final output,

X_{o t} \in R^{N \times L \times C}

, integrates depth-sensitive global context and long-range dependencies, providing refined features for downstream tasks.

3.3.2. Transformer Decoder

The Transformer Decoder is designed to integrate depth-sensitive features with contextual information, enabling robust representation refinement for downstream tasks [29]. By employing a multi-layer structure, each layer consists of self-attention, cross-attention, and a feed-forward network. Self-attention operates on the input queries to capture internal dependencies, while cross-attention aligns these queries with encoder outputs, ensuring the effective fusion of spatial and contextual features. Specifically, cross-attention uses depth-sensitive features as queries and leverages contextual embeddings as keys and values, allowing the decoder to focus on relevant spatial–depth relationships. Each layer applies residual connections and normalization to stabilize feature updates, while a feed-forward network further enhances the refined representations. Through iterative processing across multiple layers, the decoder progressively aligns and integrates depth-sensitive features and contextual cues, resulting in task-specific predictions with enhanced accuracy and robustness.

The computational complexity of the transformer is as follows. Let the input feature map have spatial dimensions

H \times W

, batch size

N

and per-position embedding dimension C, and flattened sequence length

L = H \times W

. Each attention layer uses

h

heads with per-head dimension

D = C / h

. For each attention computation, three linear projections contribute

3 \times N \times L \times C^{2}

multiply-accumulate operations (MACs), while the linear Attention mechanism avoids the explicit

L \times L

similarity matrix and requires only

N \times (L \times C + C^{2} / h)

MACs, which are negligible compared to the projections. An output projection adds another

N \times L \times C^{2}

, resulting in a dominant per-attention complexity of approximately

4 \times N \times L \times C^{2}

. Each layer also includes a feed-forward network contributing

2 \times N \times L \times C^{2}

. Thus, the encoder layer (one attention + FFN) requires

6 \times N \times L \times C^{2}

MACs, and the decoder layer (two attentions + FFN) requires

10 \times N \times L \times C^{2}

. The total complexity of the transformer is therefore

16 \times N \times L \times C^{2}

, which scales linearly with

L

. In contrast, standard softmax attention incurs

O (N L^{2} C)

complexity due to the quadratic term in sequence length. This linear formulation yields substantial computational benefits for high-resolution inputs.

3.3.3. Depth Position Mapping (DPM) Module

The Depth Position Mapping (DPM) module embeds depth-related positional information to enhance the model’s understanding of spatial-depth relationships, as shown in Figure 5. Given the input features,

X \in R^{B \times N \times C}

, where

B

is the batch size,

N = H \times W

is the number of spatial positions, and

C

is the number of spatial positions, and

F \in R^{B \times C \times H \times W}

. Based on previously predicted depth bins

D_{b} = [d_{1}, \dots, d_{D}] \in R^{D \times C}

, where

D

is the number of depth bins and

d_{i}

represents the learnable embedding for the

i

-th depth category, the depth features are locally refined using a depth-wise separable convolution with a

3 \times 3

kernel. The operation is defined as follows:

F^{'} = {C o n v}_{3 \times 3} (F) + F

(9)

where

C o n v_{3 \times 3}

integrates local depth information while maintaining computational efficiency. The resulting depth-sensitive feature map

F^{'}

is then flattened back to

R^{B \times N \times C}

, ensuring alignment between the depth positional mappings and the spatial features. By embedding depth-sensitive positional cues through

D_{b}

, the DPM module improves the model’s ability to capture 3D geometric structures, while preserving efficiency and scalability for downstream tasks.

3.4. Loss Function

The proposed framework adopts a single-stage detector design, which directly predicts object bounding boxes and class probabilities using pre-defined 2D-3D anchors. This architecture is optimized for efficient object detection and depth estimation in a single forward pass. To handle challenges such as class imbalance and bounding box regression, the framework incorporates Focal Loss [30,31] for classification and Smooth L1 Loss [32,33] for bounding box regression. These loss functions are combined with a custom depth loss to ensure robust performance across all tasks.

For the classification task, we use Focal Loss to address class imbalance by down-weighting easy examples and emphasizing hard ones. For each positive anchor

i

, let

p_{i}

denote the predicted probability that the anchor belongs to the positive class, and let

p_{i}^{g t} \in {0,1}

be the ground-truth label. The Focal Loss for a single anchor is defined as follows:

F o c a l L o s s (p_{i}, p_{i}^{g t}) = \{\begin{matrix} - α {(1 - p_{i})}^{γ \log} p_{i}, i f p_{i}^{g t} = 1, \\ - (1 - α) p_{i}^{γ \log} (1 - p_{i}), i f p_{i}^{g t} = 0, \end{matrix}

(10)

where

p_{i}

is the predicted probability of anchor

i

being positive, and

p_{i}^{g t}

is the ground-truth label for anchor

i

(either 0 or 1).

α

is a balancing factor that controls the importance of positive versus negative examples, and

γ

is a focusing parameter that reduces the loss contribution from easy examples and focuses on harder ones. The loss is averaged over all positive anchors:

L_{c l s} = \frac{1}{N_{p o s}} \sum_{i = 1}^{N_{p o s}} F o c a l L o s s (p_{i}, p_{i}^{g t}),

(11)

where

N_{pos}

is the number of positive anchors.

For bounding box regression, we adopt Smooth L1 Loss. Let

δ_{i} = b_{i} - b_{i}^{g t}

be the difference between the predicted bounding box

b_{i}

and the ground-truth bounding box

b_{i}^{g t}

, where

b_{i}

and

b_{i}^{g t}

represent the predicted and ground-truth bounding box parameters, respectively. The Smooth L1 Loss is defined as follows:

S m o o t h L 1 (δ_{i}) = \{\begin{matrix} 0.5 \times \frac{δ_{i}^{2}}{β}, i f |δ_{i}| < β, \\ |δ_{i}| - 0.5 \times β, i f |δ_{i}| \geq β, \end{matrix}

(12)

where

β

is a threshold that controls the transition between L1 and L2 loss. This loss is less sensitive to outliers compared to L2 loss and is computed as follows:

L_{r e g} = \frac{1}{N_{pos}} \sum_{i = 1}^{N_{pos}} S m o o t h L 1 (b_{i}, b_{i}^{g t})

(13)

where

N_{p o s}

is the number of positive anchors, and

δ_{i} = b_{i} - b_{i}^{g t}

is the difference between the predicted and ground-truth bounding box parameters.

In addition to the classification and regression losses, we introduce a depth estimation loss to enable depth-sensitive feature learning. To do this, we discretize the continuous depth range into

D

bins, and the model outputs a probability distribution over these bins for each pixel. Let

d_{i, b}^{g t} \in [0,1]

denote the ground-truth probability for depth bin

b

at pixel

i

, and

d_{i, b} \in [0,1]

denote the predicted probability for that bin. The depth loss is then formulated as follows:

L_{d e p t h} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{b = 1}^{D} w_{i, b} d_{i, b}^{g t} \log d_{i, b}

(14)

where

N

is the total number of valid pixels, and

w_{i, b} = {(1 - d_{i, b}^{g t})}^{- γ}

is a weighting factor that emphasizes harder-to-predict depth bins. The term

w_{i, b}

follows the same principle as Focal Loss, where bins with higher uncertainty or difficulty are weighted more heavily. Depending on whether a one-hot or soft labeling strategy is used,

d_{i, b}^{g t}

can be 1 for the true bin and 0 otherwise, or distributed across several bins in a soft fashion.

L_{t o t a l} = L_{c l s} + λ_{r e g} \times L_{r e g} + λ_{d e p t h} \times L_{d e p t h} .

(15)

where

λ_{r e g}

and

λ_{d e p t h}

control the weights of the regression and depth losses, respectively. During training, anchors with an intersection-over-union (IoU) greater than 0.5 with the ground truth boxes are selected for optimization. By combining classification, regression, and depth estimation objectives, the proposed loss function ensures a balanced and effective training process. The use of Focal Loss and Smooth L1 Loss further enhances the robustness of the framework in handling challenging 3D object detection scenarios.

4. Experiment and Analysis

4.1. Dataset

The KITTI dataset is a widely used benchmark for autonomous driving research, providing labeled data for tasks such as 3D object detection, stereo vision, and SLAM. It includes 7481 training images and 7518 test images, captured with stereo cameras and LiDAR in real-world driving scenarios. KITTI evaluates 3D object detection across three categories (Car, Pedestrian, Cyclist) and three difficulty levels (Easy, Moderate, Hard), making it a standard for testing perception algorithms. Consistent with prior work [34], we partitioned the 7481 training images into 3712 training samples and 3769 validation samples. The detailed distribution of object instances in this split is summarized in Table 1.

4.2. Valuation Metrics

The KITTI dataset evaluates the performance of 3D object detection and bird’s-eye view (BEV) detection using Average Precision (AP) as the primary metric. To ensure a more accurate evaluation, it adopts the

A P_{40}

metric, which calculates the Average Precision across 40 evenly spaced recall positions, reducing potential bias from fewer recall points in the original AP metric.

Detection tasks are divided into three difficulty levels—Easy, Moderate, and Hard—based on object attributes such as occlusion, truncation, and size. The dataset includes three object categories: Car, Pedestrian, and Cyclist. The evaluation employs intersection over union (IoU) thresholds of 0.7 for the Car category and 0.5 for Pedestrian and Cyclist.

4.3. Implementation Details

The proposed AuxDepthNet framework was evaluated on the KITTI dataset, targeting the Car category. Input images were resized to a resolution of

1280 \times 288

. The model was trained for 120 epochs with a batch size of 16, using the Adam optimizer with an initial learning rate of

1 \times 10^{- 4}

and no weight decay. A cosine annealing learning rate scheduler was employed, with the minimum learning rate set to

5 \times 10^{- 6}

.

To enhance the robustness of the model, several data augmentation techniques were applied during training. These included photometric distortions (e.g., random adjustments to brightness, contrast, saturation, and hue), cropping the top 100 pixels of the image, random horizontal flipping with a probability of 0.5, resizing to the target resolution, and normalization using RGB mean values [0.485, 0.456, 0.406] and standard deviations [0.229, 0.224, 0.225].

For testing, augmentation was limited to cropping, resizing, and normalization without photometric distortions or flipping. The detection head employed an anchor-based design, with IoU thresholds of 0.5 for positive samples and 0.4 for negative samples. The loss function combined Focal Loss with Smooth L1 Loss to balance the classification and regression tasks. During inference, non-maximum suppression (NMS) was applied with an IoU threshold of 0.4, and predictions with a confidence score below 0.75 were discarded.

To illustrate the convergence behavior of the model during training, we plotted the loss value across iterations. As shown in Figure 6, the training loss consistently decreases and stabilizes, demonstrating the effective optimization and convergence of the proposed framework.

As illustrated in Figure 6, the horizontal axis represents the number of iterations during training, while the vertical axis denotes the loss value. The curve demonstrates that the loss decreases rapidly in the early training phase and gradually stabilizes as training progresses. This behavior indicates that the model effectively learns from the data and reaches convergence after approximately 15,000 iterations. The stable and low final loss confirms the robustness and reliability of the training procedure.

4.4. Comparison with State-of-the-Art Methods

To evaluate the effectiveness of AuxDepthNet, we compare its performance with state-of-the-art monocular 3D object detection methods on the KITTI 3D Object Detection Benchmark, as shown in Table 2. The benchmark measures Average Precision (AP) under IoU 0.7 for both 3D detection

{A P}_{3 D}

and bird’s-eye view detection

{A P}_{B E V}

across three levels: Easy, Moderate, and Hard.

From Table 2, AuxDepthNet achieves competitive results. For 3D detection, it obtains an AP of 24.72, 18.63, and 15.31 on Easy, Moderate, and Hard levels, respectively. Compared to MonoUNI with 24.75, 16.73, and 13.43, our method improves the Moderate and Hard scores by 1.90 and 1.88 points. Similarly, AuxDepthNet outperforms Cube R-CNN, which achieves 23.59, 15.01, and 12.65, across all three categories.

For BEV detection, AuxDepthNet achieves an AP of 34.11, 25.18, and 21.90. Compared to MonoATT, which achieves 36.87, 24.42, and 21.88, our method improves the Moderate score by 0.76 points while maintaining competitive performance on other levels.

Overall, AuxDepthNet surpasses MonoUNI, Cube R-CNN, and MonoATT, especially on Moderate and Hard levels. These results demonstrate the effectiveness of our Depth Position Mapping (DPM) module in capturing depth-sensitive features and enhancing 3D detection accuracy on the KITTI benchmark.

4.5. Ablation Studies and Analysis

4.5.1. Effectiveness of Each Proposed Components

To assess the impact of each component in the AuxDepthNet framework, we conducted ablation studies highlighting their contributions to overall performance. (a) The baseline utilizes only context-aware features for 3D object detection, without incorporating depth-sensitive modules. (b) Context-aware and depth-sensitive features were integrated using a convolutional concatenate operation as an alternative to the depth-sensitive transformer module. (c) Depth-sensitive features in the transformer were replaced with object queries similar to DETR, creating a baseline with a DETR-like transformer. (d) Depth-sensitive features were replaced with those extracted from depth images generated by a pretrained depth estimator (DORN), instead of being learned in an end-to-end manner.

The results in Table 3 highlight several key observations: substituting object queries in the transformer with depth-sensitive features (c→e) significantly enhances depth representation, resulting in notable performance improvements. Furthermore, using features derived from a pretrained depth estimator (d) underperforms compared to the end-to-end approach in our full framework (e), emphasizing the benefits of directly learning depth-sensitive features. Our depth-sensitive transformer module (e) effectively combines context- and depth-sensitive features, outperforming the simpler convolutional concatenation method (b). Additionally, the inclusion of the depth prototype enhancement module (d→e) further refines detection accuracy, particularly under more challenging scenarios. Ultimately, the complete AuxDepthNet model (e) delivers marked advancements across all levels of difficulty relative to the baseline (a), demonstrating the combined contributions of the proposed modules to robust and precise 3D object detection.

4.5.2. The Impact of Different Backbones

To further evaluate the impact of different Backbone networks on the performance of 3D object detection for the Car category in the KITTI validation set, we conducted an ablation study comparing various backbones. Specifically, we tested DLA-102, DLA-102x2, DLA-60, DenseNet, ResNet, and our proposed AuxDepthNet (Ours). Among these, our framework employs DLA-102 as the default Backbone due to its balanced trade-off between efficiency and accuracy. The results demonstrate that while other backbones provide competitive performance, DLA-102 achieves superior precision, validating its suitability for our approach. The results are shown in Table 4.

The ablation study highlights the impact of different backbones on KITTI’s Car category detection. DLA-102 achieves a strong balance between efficiency and accuracy, while DLA-102x2 slightly outperforms it, particularly in the Easy and Moderate settings. DenseNet and ResNet provide competitive results but fall short of DLA-based models, and DLA-60 demonstrates limited performance due to reduced capacity. Notably, our proposed AuxDepthNet achieves the highest accuracy, with an

{A P}_{3 D}

of 18.63 (Moderate) and

{A P}_{B E V}

of 25.18 (Moderate), showcasing its superior depth-sensitive feature extraction capabilities for monocular 3D detection. So, our AuxDepthNet Backbone outperforms conventional backbones, highlighting the effectiveness of its design in enhancing depth-sensitive feature representation.

4.5.3. The Impact of Different Dilation Rates in the ADF Module

To investigate how various dilation rates in the ADF module affect 3D object detection performance, we conducted ablation experiments on the validation set of the KITTI dataset, focusing on the Car category. Specifically, we set the dilation rates to standard (no dilated convolution), 2, 8, 16, or 4 (Ours), all under the same network architecture and training hyperparameters, and evaluated their

{A P}_{3 D}

(Average Precision in 3D) metrics. As shown in Figure 7, different dilation rate configurations yielded varying results across the Easy, Moderate, and Hard difficulty levels. Notably, a dilation rate of 4 achieves the highest detection accuracy, reaching 24.72%, 18.63%, and 15.31% on Easy, Moderate, and Hard, respectively—outperforming the other settings. These findings suggest that using a suitable dilation rate helps capture more effective multi-scale features, thereby significantly enhancing the accuracy of 3D object detection. Hence, we chose a dilation rate of 4 for all subsequent experiments to achieve optimal performance.

5. Visualization

Figure 8 presents qualitative examples from the KITTI validation set. Compared to the baseline model without depth-sensitive modules, AuxDepthNet predictions align much more closely with the ground truth, demonstrating the effectiveness of the proposed depth-sensitive modules in enhancing object localization accuracy.

6. Conclusions

In this study, we proposed AuxDepthNet, a framework for real-time monocular 3D object detection that eliminates the need for external depth maps or pre-trained depth models. By introducing the Auxiliary Depth Feature Module (ADF) and the Depth Position Mapping Module (DPM), AuxDepthNet effectively learned depth-sensitive features and integrated depth positional information, enhancing spatial reasoning with minimal computational cost. Built on the DepthFusion Transformer architecture, the framework demonstrated robust performance in object localization and 3D bounding box regression.

Although AuxDepthNet does not rely on any externally pre-computed depth maps or pre-trained depth estimation networks, it still requires dataset-specific information, such as camera calibration parameters and anchor box statistics derived from the KITTI dataset. These configurations are standard hyper-parameters that can be re-estimated for other datasets with appropriate 3D annotations and calibration data. Therefore, the framework remains general in its architecture and learning process, without depending on handcrafted priors unique to KITTI.

Despite its effectiveness, AuxDepthNet has limitations, including its focus on the KITTI dataset and potential challenges in generalizing to diverse environments. Future work will explore broader dataset adaptation, improve robustness under extreme conditions, and optimize computational efficiency for edge applications.

Author Contributions

Conceptualization, R.Z. and H.-S.C.; methodology, R.Z. and D.J.; software, R.Z. and S.-K.J.; validation, P.H.N.A. and Z.Z.; formal analysis, R.Z. and D.J.; data curation, D.J. and S.-K.J.; writing—original draft preparation, R.Z.; writing—review and editing, R.Z.; visualization, P.H.N.A. and Z.Z.; supervision, H.-S.C.; funding acquisition, H.-S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Korea Institute of Marine Science & Technology Promotion (KIMST) funded by the Korea Coast Guard (RS-2021-KS211488). Also, it was supported by the Unmanned Vehicles Core Technology Research and Development Program through the Na-tional Research Foundation of Korea (NRF) and the Unmanned Vehicle Advanced Research Center (UVARC), funded by the Ministry of Science and ICT, the Republic of Korea (NRF-2020M3C1C1A02086321).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

He, C.; Zeng, H.; Huang, J.; Hua, X.-S.; Zhang, L. Structure aware single-stage 3d object detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11873–11882. [Google Scholar]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
Shi, S.; Wang, X.; Li, H. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 770–779. [Google Scholar]
Li, P.; Chen, X.; Shen, S. Stereo r-cnn based 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7644–7652. [Google Scholar]
Wang, Y.; Chao, W.-L.; Garg, D.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8445–8453. [Google Scholar]
Brazil, G.; Pons-Moll, G.; Liu, X.; Schiele, B. Kinematic 3d object detection in monocular video. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XXIII 16; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 135–152. [Google Scholar]
Chen, Y.; Tai, L.; Sun, K.; Li, M. Monopair: Monocular 3d object detection using pairwise spatial relationships. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12093–12102. [Google Scholar]
Ku, J.; Pon, A.D.; Waslander, S.L. Monocular 3d object detection leveraging accurate proposals and shape reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11867–11876. [Google Scholar]
Shen, F.; Xie, Y.; Zhu, J.; Zhu, X.; Zeng, H. Git: Graph interactive transformer for vehicle re-identification. IEEE Trans. Image Process. 2023, 32, 1039–1051. [Google Scholar] [CrossRef] [PubMed]
Shen, F.; Shu, X.; Du, X.; Tang, J. Pedestrian-specific bipartite-aware similarity learning for text-based person retrieval. In Proceedings of the 31st ACM International Conference on Multimedia, New York, NY, USA, 27 October 2023; pp. 8922–8931. [Google Scholar]
Ma, X.; Wang, Z.; Li, H.; Zhang, P.; Ouyang, W.; Fan, X. Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 6851–6860. [Google Scholar]
Weng, X.; Kitani, K. Monocular 3d object detection with pseudo-lidar point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Ding, M.; Huo, Y.; Yi, H.; Wang, Z.; Shi, J.; Lu, Z.; Luo, P. Learning depth-guided convolutions for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 1000–1001. [Google Scholar]
Ouyang, E.; Zhang, L.; Chen, M.; Arnab, A.; Fu, Y. Dynamic depth fusion and transformation for monocular 3d object detection. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
Kim, B.; Lee, J.; Kang, J.; Kim, E.-S.; Kim, H.J. Hotr: End-to-end human-object interaction detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 74–83. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Wang, L.; Du, L.; Ye, X.; Fu, Y.; Guo, G.; Xue, X.; Feng, J.; Zhang, L. Depth-conditioned dynamic message propagation for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 454–463. [Google Scholar]
Xu, B.; Chen, Z. Multi-level fusion based 3d object detection from monocular images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2345–2353. [Google Scholar]
Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 918–927. [Google Scholar]
Reading, C.; Harakeh, A.; Chae, J.; Waslander, S.L. Categorical depth distribution network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8555–8564. [Google Scholar]
Zhang, R.; Qiu, H.; Wang, T.; Guo, Z.; Cui, Z.; Qiao, Y.; Li, H.; Gao, P. Monodetr: Depth-guided transformer for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 9155–9166. [Google Scholar]
Yang, F.; He, X.; Chen, W.; Zhou, P.; Li, Z. MonoPSTR: Monocular 3D Object Detection with Dynamic Position & Scale-aware Transformer. IEEE Trans. Instrum. Meas. 2024, 73, 5028313. [Google Scholar]
Shen, F.; Ye, H.; Zhang, J.; Wang, C.; Han, X.; Yang, W. Advancing pose-guided image synthesis with progressive conditional diffusion models. arXiv 2023, arXiv:2310.06313. [Google Scholar]
Shen, F.; Ye, H.; Liu, S.; Zhang, J.; Wang, C.; Han, X.; Yang, W. Boosting consistency in story visualization with rich-contextual conditional diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, Madrid, Spain, 20–22 October 2025; Volume 39, pp. 6785–6794. [Google Scholar]
Shen, F.; Jiang, X.; He, X.; Ye, H.; Wang, C.; Du, X.; Li, Z.; Tang, J. Imagdressing-v1: Customizable virtual dressing. In Proceedings of the AAAI Conference on Artificial Intelligence, Madrid, Spain, 20–22 October 2025; Volume 39, pp. 6795–6804. [Google Scholar]
Shen, F.; Tang, J. Imagpose: A unified conditional framework for pose-guided person generation. Adv. Neural Inf. Process. Syst. 2024, 37, 6246–6266. [Google Scholar]
Zou, C.; Wang, B.; Hu, Y.; Liu, J.; Wu, Q.; Zhao, Y.; Li, B.; Zhang, C.; Zhang, C.; Wei, Y.; et al. End-to-end human object interaction detection with hoi transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11825–11834. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 20750–20762. [Google Scholar]
Ross, T.Y.; Dollár, G. Focal loss for dense object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2980–2988. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Wei, L.; Zheng, C.; Hu, Y. Oriented object detection in aerial images based on the scaled smooth L1 loss function. Remote Sens. 2023, 15, 1350. [Google Scholar] [CrossRef]
Liu, C.; Yu, S.; Yu, M.; Wei, B.; Li, B.; Li, G.; Huang, W. Adaptive smooth L1 loss: A better way to regress scene texts with extreme aspect ratios. In Proceedings of the 2021 IEEE Symposium on Computers and Communications (ISCC), IEEE, Athens, Greece, 5–8 September 2021; pp. 1–7. [Google Scholar]
Chen, X.; Kundu, K.; Zhu, Y.; Ma, H.; Fidler, S.; Urtasun, R. 3d object proposals using stereo imagery for accurate object class detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1259–1272. [Google Scholar] [CrossRef] [PubMed]
Zou, Z.; Ye, X.; Du, L.; Cheng, X.; Tan, X.; Zhang, L.; Feng, J.; Xue, X.; Ding, E. The devil is in the task: Exploiting reciprocal appearance-localization features for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2713–2722. [Google Scholar]
Zhou, Y.; He, Y.; Zhu, H.; Wang, C.; Li, H.; Jiang, Q. Monocular 3d object detection: An extrinsic parameter free approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7556–7566. [Google Scholar]
Zhang, Y.; Lu, J.; Zhou, J. Objects are different: Flexible monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3289–3298. [Google Scholar]
Lu, Y.; Ma, X.; Yang, L.; Zhang, T.; Liu, Y.; Chu, Q.; Yan, J.; Ouyang, W. Geometry uncertainty projection network for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3111–3121. [Google Scholar]
Huang, K.-C.; Wu, T.-H.; Su, H.-T.; Hsu, W.H. Monodtr: Monocular 3d object detection with depth-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4012–4021. [Google Scholar]
Lian, Q.; Li, P.; Chen, X. Monojsg: Joint semantic and geometric cost volume for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1070–1079. [Google Scholar]
Kumar, A.; Brazil, G.; Corona, E.; Parchami, A.; Liu, X. Deviant: Depth equivariant network for monocular 3d object detection. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 664–683. [Google Scholar]
Li, Y.; Chen, Y.; He, J.; Zhang, Z. Densely constrained depth estimator for monocular 3d object detection. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 718–734. [Google Scholar]
Peng, L.; Wu, X.; Yang, Z.; Liu, H.; Cai, D. Did-m3d: Decoupling instance depth for monocular 3d object detection. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 71–88. [Google Scholar]
Brazil, G.; Kumar, A.; Straub, J.; Ravi, N.; Johnson, J.; Gkioxari, G. Omni3d: A large benchmark and model for 3d object detection in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13154–13164. [Google Scholar]
Jinrang, J.; Li, Z.; Shi, Y. Monouni: A unified vehicle and infrastructure-side monocular 3d object detection network with sufficient depth clues. Adv. Neural Inf. Process. Syst. 2023, 36, 11703–11715. [Google Scholar]
Zhou, Y.; Zhu, H.; Liu, Q.; Chang, S.; Guo, M. Monoatt: Online monocular 3d object detection with adaptive token transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17493–17503. [Google Scholar]
Zhang, J.; Li, J.; Lin, X.; Zhang, W.; Tan, X.; Han, J.; Ding, E.; Wang, J.; Li, G. Decoupled Pseudo-Labeling for Semi-Supervised Monocular 3D Object Detection. arXiv 2024. [Google Scholar] [CrossRef]
Shi, P.; Dong, X.; Ge, R.; Liu, Z.; Yang, A. Dp-M3D: Monocular 3D object detection algorithm with depth perception capability. Knowl. Based Syst. 2025, 318, 113539. [Google Scholar] [CrossRef]
Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]

Figure 2. The overall framework of our proposed method. AuxDepthNet enhances monocular 3D object detection by integrating depth-sensitive, context-sensitive, and depth-guided features. The Auxiliary Depth Feature (ADF) module encodes depth-related cues without pre-computed depth maps. Context-sensitive features are refined by a feature pyramid and DepthFusion Transformer (DFT), providing semantic and spatial context. The Depth Position Mapping (DPM) module embeds depth-based positional information for precise 3D localization. This integration captures local and global spatial relationships efficiently, delivering robust 2D and 3D detection.

Figure 3. Architecture of the Auxiliary Depth Feature Module. (a) Generate the initial depth-sensitive feature

F_{i n i t}

and determine the depth distribution

P_{d e p t h}

. (b) P [d] represents the feature representation of the depth prototype. (c) The depth prototype enhanced feature

F_{e n h a n c e d}

is generated and fused with

F_{i n i t}

.

Figure 3. Architecture of the Auxiliary Depth Feature Module. (a) Generate the initial depth-sensitive feature

F_{i n i t}

and determine the depth distribution

P_{d e p t h}

. (b) P [d] represents the feature representation of the depth prototype. (c) The depth prototype enhanced feature

F_{e n h a n c e d}

is generated and fused with

F_{i n i t}

.

Figure 4. Backbone network architecture and multi-scale feature extraction.

Figure 5. Overview of the proposed Depth Position Mapping (DPM) module. This process aligns spatial features with depth information, enhancing the model’s 3D geometric understanding.

Figure 6. Training loss curve during optimization. The figure shows that the loss steadily decreases and converges, confirming the stability of the training process.

Figure 7. Comparison of

{A P}_{3 D}

detection accuracy for the Car category on the KITTI validation set using different dilation rates in the dilated convolution of the ADF module. Comparison of

{A P}_{3 D}

detection accuracy for the Car category on the KITTI validation set, with standard convolution replaced by dilated convolutions using dilation rates of 2, 4, 8, and 16 in the ADF module. (a) Easy. (b) Mod. (c) Hard.

Figure 7. Comparison of

{A P}_{3 D}

detection accuracy for the Car category on the KITTI validation set using different dilation rates in the dilated convolution of the ADF module. Comparison of

{A P}_{3 D}

detection accuracy for the Car category on the KITTI validation set, with standard convolution replaced by dilated convolutions using dilation rates of 2, 4, 8, and 16 in the ADF module. (a) Easy. (b) Mod. (c) Hard.

Figure 8. Qualitative results on the KITTI dataset. (a) Bird’s-eye view of detection results, where blue represents the ground truth (GT), red denotes predictions from the AuxDepthNet model, and green indicates baseline (without depth-sensitive modules) predictions. (b) Predicted results from the AuxDepthNet model visualized in the image view.

Table 1. Data distribution of KITTI Training and validation splits used for monocular 3D detection.

Dataset	Images	Car Instances	Pedestrian Instances	Cyclist Instances
Training	3712	14,357	2207	734
Validation	3769	14,385	2280	893

Table 2. Detection performance of Car category on the KITTI validation set; the best and second-best results are highlighted in red and blue, respectively.

Method	AP3D@IoU = 0.7			APBEV@IoU = 0.7
Method	Easy	Mod.	Hard	Easy	Mod.	Hard
DDMP-3D [18]	19.71	12.78	9.80	28.08	17.89	13.44
CaDDN [21]	19.17	13.41	11.46	27.94	18.91	17.19
DFRNet [35]	19.40	13.63	10.35	28.17	19.17	14.84
MonoEF [36]	21.29	13.87	11.71	29.03	19.70	17.26
MonoFlex [37]	19.94	13.89	12.07	28.23	19.75	16.89
GUPNet [38]	20.11	14.20	11.77	-	-	-
MonoDTR [39]	21.99	15.39	12.73	28.59	20.38	17.14
MonoJSG [40]	24.69	16.14	13.64	32.59	21.26	18.18
DEVIANT [41]	21.88	14.46	11.89	29.65	20.44	17.43
DCD [42]	23.94	17.38	15.32	32.55	21.50	18.25
DID-M3D [43]	24.40	16.29	13.75	32.95	22.76	19.83
Cube R-CNN [44]	23.59	15.01	12.56	31.70	21.20	18.43
MonoUNI [45]	24.75	16.73	13.49	33.28	23.05	19.39
MonoATT [46]	24.72	17.37	15.00	36.87	24.42	21.88
DPL [47]	24.19	16.67	13.83	33.16	22.12	18.74
DPM3D [48]	23.41	13.65	12.91	32.23	20.13	17.14
AuxDepthNet (Ours)	24.72	18.63	15.31	34.11	25.18	21.90

Table 3. Analysis of AuxDepthNet components on the KITTI validation set (Car category).

Index	Ablation	AP3D@IoU = 0.7
Index	Ablation	Easy	Mod.	Hard
(a)	Baseline	19.50	15.51	12.66
(b)	w/o depth prototype enhancement	23.91	18.27	15.21
(c)	Depth-sensitive feature → object query	20.25	16.15	13.89
(d)	Depth-sensitive feature → DORN [49]	24.27	17.15	13.84
(e)	AuxDepthNet (full model)	24.72	18.63	15.31

Table 4. Comparison of different backbones on KITTI validation set for Car category; the table presents the performance (

{A P}_{3 D}

and

{A P}_{B E V}

@ IoU = 0.7) using various backbones, including our proposed AuxDepthNet.

Table 4. Comparison of different backbones on KITTI validation set for Car category; the table presents the performance (

{A P}_{3 D}

and

{A P}_{B E V}

@ IoU = 0.7) using various backbones, including our proposed AuxDepthNet.

Backbone	AP3D@IoU = 0.7			APBEV@IoU = 0.7
Backbone	Easy	Mod.	Hard	Easy	Mod.	Hard
DLA-102	24.40	18.55	15.30	34.02	25.11	21.24
DLA-102x2	24.59	18.56	15.25	34.01	25.13	21.32
DLA-60	22.85	17.37	14.33	32.14	23.64	19.92
DenseNet	24.24	18.04	15.14	33.01	24.69	21.48
ResNet	24.09	17.98	15.15	32.97	24.39	21.33
Ours	24.72	18.63	15.31	34.11	25.18	21.90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, R.; Choi, H.-S.; Jung, D.; Anh, P.H.N.; Jeong, S.-K.; Zhu, Z. AuxDepthNet: Real-Time Monocular 3D Object Detection with Depth-Sensitive Features. Appl. Sci. 2025, 15, 7538. https://doi.org/10.3390/app15137538

AMA Style

Zhang R, Choi H-S, Jung D, Anh PHN, Jeong S-K, Zhu Z. AuxDepthNet: Real-Time Monocular 3D Object Detection with Depth-Sensitive Features. Applied Sciences. 2025; 15(13):7538. https://doi.org/10.3390/app15137538

Chicago/Turabian Style

Zhang, Ruochen, Hyeung-Sik Choi, Dongwook Jung, Phan Huy Nam Anh, Sang-Ki Jeong, and Zihao Zhu. 2025. "AuxDepthNet: Real-Time Monocular 3D Object Detection with Depth-Sensitive Features" Applied Sciences 15, no. 13: 7538. https://doi.org/10.3390/app15137538

APA Style

Zhang, R., Choi, H.-S., Jung, D., Anh, P. H. N., Jeong, S.-K., & Zhu, Z. (2025). AuxDepthNet: Real-Time Monocular 3D Object Detection with Depth-Sensitive Features. Applied Sciences, 15(13), 7538. https://doi.org/10.3390/app15137538

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AuxDepthNet: Real-Time Monocular 3D Object Detection with Depth-Sensitive Features

Abstract

1. Introduction

2. Related Work

2.1. Monocular 3D Object Detection Methods

2.2. Transformer in Monocular 3D Object Detection

3. Proposed Method

3.1. Overview

3.2. Depth-Sensitive Feature Enhancement

3.2.1. Backbone Network Architecture and Multi-Scale Feature Extraction

3.2.2. Extracting Foundational Depth-Sensitive Features

3.2.3. Depth-Sensitive Prototype Representation Module

3.2.4. Feature Enhancement with Depth Prototype

3.3. Depth Position Mapping and Transformer Integration

3.3.1. Transformer Encoder

3.3.2. Transformer Decoder

3.3.3. Depth Position Mapping (DPM) Module

3.4. Loss Function

4. Experiment and Analysis

4.1. Dataset

4.2. Valuation Metrics

4.3. Implementation Details

4.4. Comparison with State-of-the-Art Methods

4.5. Ablation Studies and Analysis

4.5.1. Effectiveness of Each Proposed Components

4.5.2. The Impact of Different Backbones

4.5.3. The Impact of Different Dilation Rates in the ADF Module

5. Visualization

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI