3.1. Overall Framework of MonoAMP Network
The overall framework of our proposed method is illustrated in
Figure 1. The multi-task detection head comprises two main branches. The keypoint branch estimates the likelihood of vehicle existence at the current location, generating a heatmap of vehicle center points. The regression branch predicts key geometric attributes, including depth, dimensions, and orientation of objects.
The backbone network utilizes DLA-34 [
42] to extract features from images and incorporates triplet attention [
43] to enhance the key feature extraction capability through collaborative interactions across different dimensions. The three-branch attention structure establishes cross-dimensional attention dependency modeling through rotational transformations and residual mappings. This structure constructs attention maps from three orthogonal planes and enhances the discriminative representation of features through adaptive attention weight computations based on dimensional decomposition, ultimately achieving efficient cross-dimensional attention interaction and fusion. The tail of the backbone network is connected to an adaptive multi-order perceptual aggregation (AMPA) module, which effectively aggregates multi-order contextual information, fully utilizes inter-channel correlations, and enhances the model’s feature selection capability.
Subsequently, a multi-scale feature hierarchy is constructed by hierarchically upsampling the features processed by the AMPA. We obtain feature maps with resolutions of 1/16, 1/8, and 1/4 of the input image size, respectively. We employ feature alignment operations to achieve a unified representation of multi-scale features, and feature concatenation is used to establish a joint semantic expression across multiple scales. During the feature fusion stage, we improve detection capability for multi-scale vehicles and address issues such as blurred object boundaries. For the heatmaps output by the keypoint branch, a set of response values is generated by applying a two-dimensional Gaussian kernel centered at the actual target point locations. The ground truth value of the vehicle center point on the heatmap corresponds to the pixel-level maximum response generated at the Gaussian kernel location. For depth prediction in the regression branch, an uncertainty-guided depth ensemble is designed, which adaptively fuses multiple depth estimation methods through weighted aggregation, thereby improving the model’s depth estimation precision and robustness. The regression vector of this network primarily includes center point offsets , 3D box vertex offsets , vehicle residual scales , orientation angles , direct depth , uncertainty estimates , and geometric depth , among others.
3.2. Triplet Attention Mechanism
Existing attention mechanisms typically model feature relationships within a single dimension. SENet [
44] adjusts the weights of different channels through channel attention, enhancing relevant features while suppressing irrelevant or redundant ones. However, SENet overlooks spatial information and fails to account for the correlations between different spatial locations. In contrast, CBAM [
22] successfully demonstrates the importance of capturing both channel and spatial attention. Although CBAM improves performance, it does not consider cross-dimensional interactions. In monocular 3D object detection, the model needs to infer the spatial location, keypoints, pose, and size of objects from 2D images. Therefore, it is essential for the model to accurately capture both local and global features during the feature extraction phase. Unlike traditional 2D detection, it requires more refined feature representations, particularly for estimating the depth, spatial pose, and scale variations in objects. Therefore, the key to enhancing performance lies in effectively capturing the interaction between spatial and channel information. The triplet attention mechanism addresses cross-dimensional interactions through three parallel branches. Each branch focuses on modeling the interaction between the channel dimension (C) and the spatial dimension (H or W). This design enables complementary enhancement of information across dimensions. Theoretically, the triplet attention mechanism can significantly improve model performance, especially for monocular 3D object detection tasks that require fine-grained feature representation.
In this paper, we introduce triplet attention [
43] and integrate it into the backbone network to capture the mutual dependencies between the channel and spatial dimensions. Specifically, we integrate triplet attention (TA) into the deeper levels, Level 4 and Level 5, of the DLA-34 network for cross-dimensional information modeling. These layers typically extract high-level semantic features. It effectively enhances the fine-grained interactions between features, enabling the fine-grained feature representations required for the task. The TA strengthens cross-dimensional dependencies through parallel computation, combining rotational operations and residual transformations, as shown in
Figure 2. In each branch, features of specific dimensions are compressed and extracted through Z-pool and convolution operations, while a Sigmoid function computes attention weights that determine the importance of features in each dimension. This mechanism enables the backbone network to enhance its focus on multi-dimensional features through a global self-attention perspective. It preserves the original information of the input features while enhancing cross-dimensional cooperation through the attention mechanism. All of this is achieved without significantly increasing computational overhead, thereby achieving a notable performance improvement.
3.4. Uncertainty-Guided Depth Ensemble Strategy
For monocular 3D object detection, a single-depth prediction method often struggles to handle the complexities of various environments and scenarios. To enhance the robustness and accuracy of depth estimation, we propose an uncertainty-guided depth ensemble (UGDE). The depth estimation methods in this paper include direct regression of depth predictions and multiple geometric depth estimates derived from keypoints.
As shown in
Figure 6, our depth ensemble strategy combines multiple depth estimation methods and adaptively weights different predictions based on uncertainty, generating more reliable depth estimates. Each depth prediction
has a corresponding weight
, which represents the model’s confidence in that prediction. Through uncertainty-based weighted averaging, predictions with higher confidence are given greater weight, dominating the depth ensemble. The computation formula for the ensemble is as follows:
where the weight term
is defined as
. Estimates with smaller uncertainty
are given greater weights, while those with larger uncertainty are given lower weights.
The uncertainty factor is computed from the model’s regression feature maps. Specifically, it is estimated based on the input features and the model’s performance on that input. During the forward pass, the regression feature maps contain the depth predictions along with other related features. We extract the features corresponding to the target locations (i.e., points of interest, POI) from these maps and obtain the specific regression results for each location using the corresponding indices. The uncertainty factor, which plays a critical role in depth estimation, is embedded in a dedicated channel of the regression feature map. This value indicates the reliability of the model’s depth predictions in certain regions. In practice, the uncertainty factor is constrained within a predefined range to prevent it from exceeding a reasonable limit, ensuring that the model’s uncertainty predictions remain both precise and adjustable. The visualization of the uncertainty factor during the training process is shown in
Figure 7. The uncertainty factor is influenced by various factors, such as differences in input images or scene conditions, which can lead to variations in the uncertainty. When the training data include complex, occluded, or blurred scenes, the model may exhibit higher uncertainty in such situations. Furthermore, the model’s architecture and training process also play a crucial role in determining the magnitude of the uncertainty factor.
3.5. Multi-Task Loss Function Design
Based on the designed network output, the loss function consists of seven components: center point classification loss
, 3D bounding box vertex offset loss
, center point offset loss
, direct regression depth loss
, geometric depth loss
, vehicle 3D residual scale loss
, and heading angle loss
. The center point classification loss
is computed using focal loss [
46]:
where
and
are hyperparameters that adjust the weights of the positive and negative sample losses.
N indicates the quantity of positive samples.
represents the predicted response value at (
i,
j).
represents the true response value at (
i,
j) for the Gaussian kernel function
, and
is the standard deviation of the Gaussian distribution.
The center point offset loss
and 3D bounding box vertex offset loss
are trained using L1 Loss:
where
P represents the actual target point coordinates.
R is the downsampling factor.
denotes the true coordinate offset.
denotes the floor operation).
represents the predicted coordinate offset.
For the vehicle 3D residual scale
, each regression quantity is evaluated using the Smooth L1 Loss [
47]:
where
h,
w, and
l represent the real height, width, and length dimensions of the vehicle, respectively.
represents the numerical difference between the ground truth and predicted values.
The heading angle regression loss
is predicted by applying the Multi-Bin [
48] method. In our approach, we define four bins with centers at
. The total loss for Multi-Bin orientation estimation is:
The confidence loss
is defined as the softmax loss for the confidence of each interval. The calculation formula for the localization loss
is as follows:
where
denotes the number of intervals covering the true heading angle
.
represents the center angle of interval
i.
denotes the correction required for the center of interval
i.
The direct regression depth loss
is designed to quickly capture depth features by directly regressing the object depth information. The network output
is first transformed into absolute depth using the inverse sigmoid function to reduce the influence of the output range on predictions.
where
is defined as
. The term
represents a small constant that is introduced to ensure the stability of the direct depth calculation process. Meanwhile, by incorporating uncertainty modeling, the model can adaptively adjust the loss value when it has low confidence in depth predictions.
where
denotes the predicted depth.
denotes the ground truth depth.
is used to prevent the model from increasing uncertainty to avoid loss.
is the predicted depth uncertainty.
The keypoint-based geometric depth loss
estimates depth by leveraging the geometric relationships of the object’s keypoints. It computes the center depth
and the depths of four 3D bounding box diagonals,
,
,
, and
. The loss function for geometric depth estimation also uses L1 Loss and incorporates uncertainty modeling:
where
represents the depth calculated based on the geometric relationships of the keypoints.
denotes the uncertainty of the depth.
is an indicator function that determines whether the keypoints used for depth calculation are visible. When the keypoints are not visible, the model allows the impact of the loss to be reduced by assigning a larger uncertainty.
Based on the above, the comprehensive loss function of the network,
, is formulated in the manner presented below:
where
,
,
,
,
,
, and
are the balance factors between the individual loss functions, with
.