MDS-Net: Multi-Scale Depth Stratification 3D Object Detection from Monocular Images

Monocular 3D object detection is very challenging in autonomous driving due to the lack of depth information. This paper proposes a one-stage monocular 3D object detection network (MDS Net), which uses the anchor-free method to detect 3D objects in a per-pixel prediction. Firstly, a novel depth-based stratification structure is developed to improve the network’s ability of depth prediction, which exploits the mathematical relationship between the size and the depth in the image of an object based on the pinhole model. Secondly, a new angle loss function is developed to further improve both the accuracy of the angle prediction and the convergence speed of training. An optimized Soft-NMS is finally applied in the post-processing stage to adjust the confidence score of the candidate boxes. Experiment results on the KITTI benchmark demonstrate that the proposed MDS-Net outperforms the existing monocular 3D detection methods in both tasks of 3D detection and BEV detection while fulfilling real-time requirements.


Introduction
The 3D object detection is a fundamental function to enable complex and advanced autonomous driving tasks, such as object tracking and event detection. Nowadays, most 3D object detection algorithms use the LiDAR point cloud [1][2][3][4][5] to provide distance information. Recently, 3D detection based on point cloud [1][2][3][6][7][8] has developed rapidly. A common theme among SOTA point-based 3D detection methods is to project the point cloud into sets of 2D views. For example, AVOD [9] projects the 3D point cloud to the bird's-eye view (BEV), and then fuses the features from BEV and image to predict 3D bounding boxes. Although the view-based method is time-saving, it destroys the information integrity of the point cloud. The voxel-based method [1] divides the point cloud into regular voxels, and then regresses the 3D bounding boxes by the 3D convolutional network. It preserves the shape information but usually suffers from high time complexity. The set-based method [2,3] represents the point cloud as a point set and uses a multi-layer perceptron (MLP) to learn features from the unordered set of points directly. Through doing this, it reduces the impact of the point cloud's local irregularity. However, it still has the problem of mismatch between the regular grid and the point cloud structure. In summary, the laser is hard to apply in real-time for large amounts of data. Furthermore, the irregularity of the point cloud increases the difficulty of using convolutional operations to extract features. By contrast, feature extraction from the image can be realized using the convolutional neural network and the camera is cheaper. However, due to the lack of reliable depth information, 3D object detection based on monocular images is generally considered more challenging.
prediction. Finally, the NMS algorithm causes a large number of reasonable candidate boxes to be dropped. Aiming at solving the above problems, this paper proposes an endto-end one-stage 3D detection network based on a Multi-scale Depth-based Stratification structure, MDS-Net. The novelty of the MDS-Net primarily lies in the following three newly obtained features. Firstly, we propose a depth-based stratification structure derived from the conventional Feature Pyramid Network (FPN) [31]. By establishing mathematical models between the object's depth and 2D scale, we assign each feature map of FPN to a different predictable depth range based on prior statistics and improve the depth perception of our network without adding additional time complexity. Secondly, we design a novel angle loss based on the consistency of IoU and angle to improve the prediction precision of boxes and the converge speed of training at the same time, which also solves the problem of the small derivatives of the loss making the network difficult to regress. Thirdly, we design a density-based Soft-NMS algorithm for post-processing. Since MDS-Net generates multiple boxes for the same object, we believe that objects are more likely at locations with a high density of predictions.
In summary, our contributions are as follows: • We propose a one-stage monocular 3D object detection network, MDS-Net, based on a Multi-scale Depth-based Stratification structure, which can accurately predict the object's localization in the 3D camera coordinate system from the monocular image in an end-to-end manner. The proposed MDS-Net achieves state-of-art performance on the KITTI benchmark for the monocular image of 3D object detection. • We design a novel angle loss function to strengthen the network's ability of angle prediction. • We propose a density-based Soft-NMS method to improve the confidence of credible boxes.

The Proposed MDS-Net
This chapter comprises five sections: In Section 2.1, we introduce the overall architecture of our network. In Section 2.2, we introduce the depth-based stratification structure, which improves monocular depth estimation. In Section 2.3, the formulation to transform the network output to the 3D coordinate is presented. In Section 2.4, we expound on our loss function and propose a novel angle loss to improve angle prediction accuracy. Finally, in Section 2.5, we detail a density-based Soft-NMS algorithm that reasonably processes the prediction box and increases the recall.

Network Architecture
As shown in Figure 1, our object detection network consists of three parts: backbone, FPN [31] and detection head. Our network uses the Darknet53 [15] as the backbone to generate three feature maps, D 1 , D 2 and D 3 , with downsampling ratios of 32, 16 and 8, respectively. We use the Shape-Aware Convolution (SAC) to enhance the network's perception of objects with different aspect ratios between the backbone and FPN. We then use the FPN to fuse features from different layers and obtain the fused feature maps, F 1 , F 2 and F 3 . Finally, each feature map F i (i = 1, 2, 3) is connected to two detection heads, H k (k = 1, 2). The detection head is composed of two branches. One branch is responsible for predicting the 3D parameters of the bounding box, the other is responsible for predicting the confidence, intersection-over-union(IoU), and center-ness of the object. During the inference, the predicted-IoU and center-ness are multiplied with the corresponding confidence to calculate the final score.

Depth Stratification
As shown in Figure 2, we assume that the course angle is 0, then according to the pinhole model, the relationship between 3D depth Z 3d and 2D size (w 2d , h 2d ) of the object can be approximatively calculated as follows: where W 3d , L 3d and H 3d are the 3D sizes of the object, and f u and f v are the camera's focal length. It can be inferred from Equation (1) that when focusing on objects of the same class that have similar 3D sizes, 2D sizes are mainly determined by depth. Based on this deduction and inspired by [31], which uses the Feature Pyramid Network (FPN) to detect objects of different scales, we apply the FPN structure to further improve the depth prediction by capturing objects' size gap on the image, as shown in Figure 1. The feature maps' output from the FPN structure are denoted as F 1 , F 2 and F 3 , respectively, in ascending order of resolution, and each feature map is connected to two detection heads. The depth ranges predicted by these six detection heads are ϕ · 2 i+k−2 , ϕ · 2 i+k−1 , respectively, where ϕ is a hyperparameter used to control the predictable ranges. During training, we assign the ground truth boxes to the reasonable detection heads based on the predefined predictable depth range, and each head is only responsible for predicting objects in the corresponding depth range. Our proposed depth-based stratification structure assigns the long-distance objects to a high-resolution feature map following the pinhole model to obtain precise depth prediction. The depth-based stratification structure has the following two superiorities. Firstly, each feature map's predictable depth range is increased twice as much as the previous layer. It is wise to keep the growth rate of the feature map's predictable depth range consistent with the receptive fields for a better network depth prediction. Secondly, the three feature maps' predictable depth ranges actually overlap each other for two reasons. On the one hand, from (1), we can deduce that the 2D size of the object is not strictly inversely proportional to the depth but is also affected by the object's pose. We need to expand each feature map's predictable depth range to strengthen the network's prediction robustness. On the other hand, our overlapping stratification structure enables the network to predict objects in crucial areas multiple times, which in consequence eases the imbalance between positive and negative samples. To further verify the proposed model Equation (1), we plot the correlation of 3D depth and 2D scale for all ground-truth boxes in the KITTI dataset, which are shown as the three colored scatters in Figure 3a,b, and produce the fitting curve of Equation (1) by replacing the W 3d , L 3d , H 3d with the statistical average, which is shown as the red curves in Figure 3a,b. It should be noted that the practical correlation between Z 3d and h 2d (shown in Figure 3a) fits with the proposed model Equation (1) much better than the w 2d (shown in Figure 3b). For the w 2d , the pose impacts the object's 2D width on the image and the 3D width of each category varies widely. By contrast, the h 2d is less affected by pose and has smaller inter-class variance and intra-class variance.  The correlation of Z 3d and w 2d for part ground-truth boxes in the KITTI dataset and the fitting curve (red curve). We use green points, blue stars, and black triangles to denote cars, pedestrians, and cyclists, respectively.
Based on this observation, we propose a more adaptive convolutional structure-Shape-Aware Conv (SAC), which is more robust to objects of various scales, as shown in Figure 1. The SAC performs both 1 × 3 and 3 × 1 convolutions on the input feature maps in parallel and then concatenates the two outputs with the input feature map followed by a 1 × 1 convolution to yield the final output feature map. As shown in Figure 4, the statistical distribution of the 2D aspect ratios in the KITTI dataset indicates that the aspect ratios of most objects are distributed between 1:3 and 3:1. Therefore, the network with SAC is more compatible with various aspect ratios of objects and thus improves the network's robustness by solving the underfitting problem caused by the depth stratification structure.

Network Predictions
The outputs of our network consist of two parts. The first part is the predicted 3D parameters of an object, including its 3D center, 3D size, and angle. The second part is the predicted score of an object, including its confidence, predicted-IoU, and center-ness.

3D Prediction
According to the pinhole model of the camera, we project the object's 3D center [X 3d , Y 3d , Z 3d ] T in the camera coordinate system into the image.
where u and v denote the coordinates of the projected center in the image. f u and f v are the focal length and c u and c v are the principal point offset of the camera. According to the formula Equation (2), we can obtain the relationship between u, v and X 3d , Y 3d , Z 3d as follows: where c u , c v , f u , f v are the parameters of the camera, and u, v, Z 3d can be predicted by the network. For the 3D size prediction, we calculate the average 3D size (w 0 , l 0 , h 0 ) of all objects in the data set as the preset value and then predict the logarithm of the ratio between the ground truth and the preset value. We use the observation angle, which is intuitively more meaningful when processing image features, in the angle prediction. Following Kinematic 3D [32], we split the observation angle β into the object heading θ h , axis α, and angle offset θ in Figure 5.
Suppose the predicted values on the feature map F i areû,v,Ẑ,ŵ,l,ĥ,α,θ,θ h , then we use the following formula to calculate the 3D parameters X 3d , Y 3d , Z 3d , W 3d , L 3d , H 3d , β of the object: where (X P , Y P ) represents the center of the grid.

Score Prediction
For the confidence prediction, according to KITTI's standard [33] for difficulty level, we piecewise predict the confidence of different difficulty levels and set the ground truth to 1.0, 0.8, 0.6, and 0.4, respectively. Following FCOS [17], the center-ness is defined as the normalized distance from the grid's center to the object's 2D center. Following IoUNET [34], the IoU is defined as the intersection over the union of the 3D prediction box and the 3D ground-truth box.

Loss
For the positive samples' assignment strategy, M3D-RPN [25] sets the girds containing the object center as positive samples. FCOS3D [35] sets the grids near the object center as positive samples. We set all the grids inside the projected ground-truth box as positive samples and those outside as negative to ease the imbalance between positive and negative samples. Considering that our network uses the depth-based stratification structure, we set the confidence label of a grid as "ignore" if its center lies within an object beyond the predictable depth range of the relative detection head. As the solution to the label ambiguity that occurs when two or more boxes overlap on the image, we set the grid only responsible for the closest object due to the visibility from the camera's perspective. As shown in Figure 6, compared with M3D-RPN and FCOS3D, our network has more positive samples to ease the imbalance between positive and negative samples. The loss L in our network is composed of classification loss L c and 3D box L 3d : We use the sum of confidence loss, center-ness loss, and IoU loss as the classification loss, and ignore the contribution from all samples labeled as "ignore". Inspired by FCOS [17], we add two branches parallel with the confidence branch to predict the center-ness and IoU and use the Quality Focal Loss (QFL) [36] to optimize the confidence, center-ness, and predicted-IoU.
We add up localization loss, size loss, and angle loss as our L 3D loss. We use the L2 loss to optimize the localization and size predictions. VoxelNet [1] directly uses the offset of radians as the loss function. However, in the case shown in Figure 7a, the overlap of the two boxes is considerable while the network still generates a large angle loss. To solve this problem, SECOND [37] proposed a new angle loss as follows: where β andβ, respectively, denote the ground truth observation angle and the predicted observation angle. This function naturally models IoU against the angle offset function. However, in the case shown in Figure 7b, the derivative of the loss is improperly small despite the loss reaching the maximum value, making it difficult for the network to regress. To overcome the above problems, we design a new angle loss: where α andα, respectively, denote ground truth axis and the predicted axis, θ andθ, respectively, denote ground truth angle offset and the predicted angle offset, θ h and θ h , respectively, denote the ground truth heading and the predicted heading. Our angle loss not only establishes a consistent model between the IoU and the observation angle but also makes it easier to train the network when the angle loss rises to the maximum. We assume that the network has predicted an accurateθ. As shown in Figures 8 and 9, the inaccurately predicted axisα has a minor effect on the IoU when the ground truth angle offset is close to either 0 • or 90 • , while it has a significant impact on the IoU when the ground truth angle is close to 45 • . We apply a weighted parameter sin(2 ·θ) to the axis loss to increase the penalty for the inaccurate axis prediction when the ground truth angle is close to 45 • .

Density-Based Soft-NMS
As shown in Algorithm 1, the traditional Soft-NMS algorithm [38] selects the detection box b i according to the predicted confidence score in descending order. It uses the IoUbased weighting function f (IoU(M, b i )) to decay the confidence score of the box b i that has a high IoU with the set of the predicted boxes M, as follows: where σ is a hyperparameter of the decay coefficient.  s m ← s m · g(IoU(b m , B 0 )), 13: end while 14: return D, S Since our network predicts redundant boxes for one object in different depth stratifications, we develop a density-based Soft-NMS algorithm to filter repeated boxes. The key strategy of our NMS algorithm is that the denser the predicted box is, the more likely the object exists. We define the density of a candidate box as the sum of IoU between the candidate box and all surrounding boxes. Based on the results obtained by the 3D Soft-NMS algorithm, we use the density-based weighting function g(IoU(b m , B 0 )) to activate the candidate box, which has a high density: where γ is another hyperparameter of the decay coefficient.

Dataset
We evaluate our MDS-Net on the challenging KITTI data set [33]. The 3D detection of this data set contains two core tasks: bird's eye view (BEV) object detection and 3D object detection. The dataset assigns its samples to three difficulty categories (easy, medium, and difficult) according to the object's truncation, occlusion, and 2D bounding box height. We evaluate our algorithm on KITTI's three classes in this work. Following [21], we apply the two split methods of the validation dataset [39] and official test dataset [33]. In each split method, the data from the same sequence will only appear in one split so that the interference of adjacent frames to the network model is eliminated.
Following [40], KITTI uses Average Precision on 40 recall positions (AP 40 ) on the official test dataset to replace Average Precision on 11 recall positions (AP 11 ). On the official test dataset and validation dataset, we show the AP 40 with a 0.7 IoU threshold, unless otherwise noted.

Implementation Details
Seeing that the center-ness and IoU prediction need a better box prediction result, we first train the location regression network except for the center-ness and IoU branches with a piecewise decayed learning rate. The initial learning rate is 10 −4 , and it will decay to 3 · 10 −5 and 10 −5 at the 30th epoch and 50th epoch, respectively. We then train the score branch for 10 epochs with a learning rate of 10 −5 . We use the pre-trained weights under the COCO dataset [41] to initialize our backbone. In our density-based Soft-NMS algorithm, σ is a hyperparameter of the weighting function f (IoU(M, b i )), N t is the threshold of IoU, and γ is a hyperparameter of the weighting function g (IoU(b m , B 0 )). A smaller σ means that the confidence of the redundant boxes is decayed more. A smaller setting of N t means that more boxes are considered as redundant boxes. A smaller setting of γ means that the confidence of the box that has a high density is activated more. We use Bayesian optimization to adjust the parameters, σ, γ and N t , on the validation set with the AP 3D of the medium level as the optimization goal for the three classes, and show the process of adjusting the parameters and the effect of the parameters on the AP 3D score in Section 3.4.1. We then apply the optimized parameters to the test dataset. We obtain σ = 0.9, γ = 25, N t = 0.7 for car, σ = 1.0, γ = 32, N t = 0.4 for pedestrian, and σ = 1.2, γ = 30, N t = 0.4 for cyclist. For parameters of the Depth Stratification structure, as described in Section 2.2, The depth ranges predicted by the six detection heads are ϕ · 2 i+k−2 , ϕ · 2 i+k−1 , with i and k, respectively, representing the index of feature maps and that of detection heads, and ϕ representing a hyperparameter that is used to control the predictable ranges of our framework. We obtain the parameter ϕ by counting the depths of the three classes in the KITTI training set, and produce the histograms of the depth of the three classes in the KITTI training set in Figure 10. From the results, we learn that the depth of cars mainly distributes between 5 m-70 m, while the depth of pedestrians and cyclists mainly distributes between 2.5 m-55 m. Therefore, the ϕ equal to 5 m, 2.5 m, and 2.5 m, respectively, for car, pedestrian, and cyclist. Due to the limitations of the network model, the maximum depth predicted by the network for the objects is 40 m when ϕ is set to 2.5. This will cause the network to give up predicting pedestrians and cyclists between 40 m and 55 m, which respectively occupy about 2% of pedestrians and 8% of cyclists in KITTI. This is worthwhile, considering the facts that close objects produce more dangers in autonomous driving tasks. Thus, we set the initial depth ϕ = 5 m for car, ϕ = 2.5 m for pedestrian, and cyclist in KITTI. We set (5 m, 80 m) as the predictable depth range for car and (2.5 m, 40 m) for pedestrian and cyclist, ignoring the objects outside the range.

Evaluation Results
We evaluate our network for the BEV object detection task and the 3D object detection task, each under both validation [39] and the official test dataset [33]. In Table 1, we compare the evaluation results of our network with the existing state-of-the-art monocular 3D detection algorithms. Our network obtains remarkable results in the car detection of all three difficulty levels, especially the easy level, for both tasks. For instance, under the test data split with IoU ≥ 0.7, we surpass the previous state-of-the-art 3D object detection approach by +5.25 on easy, +2.70 on moderate, and +1.73 on hard for car. Moreover, in Table 2, for pedestrian and cyclist, we also surpass the previous state-of-the-art 3D object detection approach by +5.76/+2.56 on easy, +3.61/+1.01 on moderate, and +3.12/+0.86 on hard. In addition, we compare Table 3 the performance between our method and the three methods of MonoEF, GUPNet, and MonoDTR. The three methods obtain similar performance with ours while suffering from higher time complexity. In Table 3, in the easy level, our method surpass MonoDTR by +2.31 on AP 3D and +4.22 on AP BEV . This indicates that our method has better detection performance for easy objects. In addition, in other difficulty levels, the results of our method are also close to other methods. It is worth noting that our method does not need to generate extra information. Thus, our method has the shortest inference time among the above methods.
We evaluate our network on car in different depth ranges (5-20 m, 10-40 m, 20-80 m) with IoU criteria of 0.7 and 0.5 on the KITTI validation set to prove the validity of the depth stratification. The improvement is defined as the percentage gain of our algorithm over M3D-RPN. In Table 4, our method outperforms M3D-RPN [25] with margins on AP of +52.29/+15.16 in 5-20 m, +72.87/+29.21 in 10-40 m, and +169.81/+53.55 in 20-80 m for criteria of IoU ≥ 0.7 and IoU ≥ 0.5. In order to evaluate our network for the depth prediction, we denote the average difference in the depth between every ground truth object and its best-predicted box as the average depth estimation error and calculate the error in intervals of 10 meters in Figure 11. As shown in the result, the depth estimation error increases as the distance grows. Moreover, we achieve a more accurate depth estimation improvement in 10-40 m, indicating that our network is able to obtain better predictions in critical areas.  Improvement +52.29% +72.87% +169.81% +15.16% +29.21% +53.55% Figure 11. The average error of depth prediction visualized in different depth ranges.
It must be noted that by regressing the 3D bounding boxes on the image directly, our approach reaches the operating speed of 20 FPS and better meets the real-time requirements of autonomous driving, as compared to the existing networks based on the pseudo point cloud.
As mentioned in Section 2 of this paper, the depth stratification strategy enhances the model's prediction for depth, the assignment strategy of the positive sample enhances the prediction for occluded objects, and the new angle loss function enhances the prediction for angles. To verify the validity of the components, we visualize some detection results in Figure 12. As shown in the results, our method obtains a much higher accuracy than M3D-RPN in the prediction of depth and angle, and works more effectively in the prediction of the occluded objects, as demonstrated in the left bottom plot of the figure. Therefore, our method obtains the improvement in AP 3D and AP BEV . This also proves the effectiveness of our proposed depth stratification strategy, the assignment strategy of the positive sample, and the new angle loss. Moreover, in Figure 13, we visualize more 3D detection results on the KITTI validation set [39].

Ablation Study
For the ablation study, following [39], we divide the training samples of the KITTI data set into 3712 training samples and 3769 verification samples and then verify the network's accuracy according to the AP 40 standard.

Depth Stratification
We compare the object detection AP 40 with and without depth stratification on the car. In the test without depth stratification, by referring to the strategy of the YOLO v3 network [15], we use IoU between the 2D box of the object and the preset anchor as the basis for stratification. As shown in the first and second rows of Table 5, the network based on Multi-scale Depth Stratification (MDS) (the second row) achieves a significant gain of 6.35 over the baseline implementation (the first row) on 3D detection on moderate, which verifies the superiority of the MDS structure. Figure 12. Qualitative results from our framework and MDSNet in the KITTI validation set. We use red box to denote the ground truth box and green box to denote the predicted box. It can be observed from the figure that our model has better accuracy for the prediction of depth and angle, and better performance for the detection of occluded objects.  In addition, we also compare the impact of different depth stratification strategies on the results. First, we set each FPN layer's predictable depth range of the depth stratification structure to be an 'equal interval and no overlap'. We modify the network, so that each feature layer is connected to one detection head. The depth ranges of the predicted car of F 1 , F 2 and F 3 are [5,30] m, [30,55] m and [55, 80] m, respectively. For the second experiment, we add a feature layer F 0 with the downsampling ratio of 64 to the network, and each feature layer is connected to one detection head, and the depth ranges of the predicted car of F 0 , F 1 , F 2 and F 3 are [5,10] m, [10,20] m, [20,40] m and [40,80] m, respectively, which is denoted as 'exponentially increasing interval and no overlap'. Finally, the depth stratification structure we described in Section 2.2 is denoted as an 'exponentially increasing interval with overlap'. In Figure 14, we calculated the AP 40 in different depth ranges and we find that our depth stratification structure has better AP 40 for cars in different depth ranges, especially within 10-40 m. This verifies that the depth stratification of 'exponentially increasing interval and no overlap' allows the network to better learn the unified representation of objects in different depth ranges, and improves the robustness of the network.

Density-Based Soft-NMS and Piecewise Score
As shown in Figure 15, we use Bayesian optimization to select parameters (σ, γ, N t ) for density-based Soft-NMS and demonstrate the effect of these three parameters on the AP 40 . Moreover we demonstrate the values of three parameters for the different classes when the AP 40 is highest in Section 3.2. To better understand the effect of the density-based Soft-NMS and piecewise score, we ablate them by using a standard NMS algorithm according to M3D-RPN and by setting all object scores to 1 as the baseline implementation (the second row in Table 5). We observe that both components achieve a considerable gain in both 3D and BEV perspectives. The combination of the two components surpasses the baseline by +9.17 on the 3D detection at the moderate level.

Assignment Strategy of Positive Samples
We replace the positive samples assignment strategy with the strategies proposed by M3D-RPN [25] and FCOS3D [35]. We calculate the positive and negative samples ratios using different strategies and find that our proposed strategy has a more balanced positive and negative samples ratio. Moreover, we compare the 3D and BEV detection AP to verify our method's effectiveness. As shown in Table 6, our positive samples' assignment strategy improves performance by +6.95/ + 2.86 in AP BEV and +7.49/ + 1.84 in AP 3D on moderate over the M3D-RPN and FCOS3D. The results demonstrate that our proposed strategy can effectively ease the imbalance between positive and negative samples to improve the robustness of the network.

Angle Loss
We compare the 3D & BEV detection AP and Average Heading Similarity (AHS) of our proposed angle loss with the angle loss of AVOD [9] and that of SECOND [37]. It demonstrates that our angle loss focuses more on the consistency of angle prediction and overlap, rather than struggling with direction recognition in challenging cases. As shown in Table 7, our angle loss improves performance by +3.94/ + 3.38 in AP BEV and +2.74/ + 2.71 in AP 3D over the SECOND and AVOD on moderate. Moreover our angle loss improves performance by 3.07/2.42 in AHS on BEV object detection task over the SECOND and AVOD on moderate. We also observe from the training process in practice that our angle loss improves the convergence speed of the angle regression, suggesting that we mitigate the problem of extrema convergence in the SECOND.

Shape-Aware Convolution
As shown in Table 8, we compare the impact of different SAC structures. We use our networks without the SAC module as structure A. In Figure 16b, we add a branch of 3 × 1 convolution on A as structure B. The added branch is used to increase the horizontal receptive field of the network. Moreover, in Figure 16c, we add a branch of 1 × 3 convolution on A as structure C. The added branch is used to increase the longitudinal receptive field of the network. Finally, our proposed SAC structure is referred to as Structure D. The result demonstrates that our proposed SAC structure surpasses A by +1.94/ + 1.89 on easy, +1.15/ + 1.03 on moderate, and +0.99/ + 1.08 on hard for car and pedestrian class. This result shows that 3 × 1 and 1 × 3 convolution together improve car and pedestrian prediction. Shape-aware convolution did not improve the result for cyclist. According to the statistical results in Figure 4, we consider that this is because the cyclist in KITTI is relatively rare, which makes the network insensitive to the cyclist. Figure 16. Different shape-aware convolution. We show the illustration of structure A (a), structure B (b), structure C (c), and structure D (d). Table 8. The comparison of the AP 40 among four shape-aware convolution structures in Figure 16 on the 3D object detection task on the KITTI validation set. We use bold type to indicate the best result.

Limitations
As shown in Figure 17, the visualization of the dataset demonstrates that our model is insensitive to some close objects, causing inaccurate prediction or even missed detection of thses objects. The visualization illustrates that most of these objects are truncated by the images. Our proposed depth stratification strategy uses the mathematical relationship between size and depth in the image of an object to enhance the prediction of depth. However, when the object is truncated by the image, the mathematical relationship we established does not hold anymore, causing inaccurate prediction or even missed detection of these objects. Although severely occluded objects are not required to be detected in the KITTI dataset, in autonomous driving tasks, inaccurate prediction or missed detection of close objects is very dangerous. For the detection of truncated targets, a new branch is added after the output of the pyramid network, using a strategy for truncated objects, which is responsible for the detection of truncated objects. This is also part of our future work. Figure 17. Some failure cases of our method. We use red box to denote the ground truth box, green box to denote the predicted box.

Conclusions
In this work, we propose a one-stage anchor-free network, which realizes direct detection of 3D objects from monocular images by applying the newly proposed depthbased stratification structure, IoU-aware angle loss, and density-based Soft-NMS algorithm. The experimental results based on the KITTI dataset [33] have demonstrated that our network can significantly improve the accuracy and recall rate in both bird's-eye view object detection tasks and 3D object detection tasks. However, our network suffers decreased detection accuracy for the objects truncated by the image. We will investigate the strategies that can be applied to the proposed network to enhance the prediction of close objects.  Data Availability Statement: The data sets presented in this study are available on http://www. cvlibs.net/datasets/kitti/, accessed on 29 June 2022. The codes presented in this study are available on request from the corresponding author. The code will be released in the future.