In this section, we present our proposed method in detail.
Section 3.1 presents the construction method of SAF.
Section 3.2 presents the LFA module for the point cloud.
Section 3.3 presents the joint 3D-2D anchor box PL function.
3.1. Spatial Attention Frustum (SAF) Module
This study proposed a SAF module based on monocular depth estimation. The segmentation method for spatial attention is guided by object height, where the evaluation metrics of spatial attention are closely related to the distance estimation of objects. The F-pointnet indicated that finding the local point cloud corresponding to the suggested pixels in the 2D region can avoid traversing an extensive range of point clouds and improve detection efficiency. We hope to construct a model that resembles the mechanism of human visual attention to be able to observe the occluded objects more efficiently. Inspired by FconvNet, we further thought about the work of sliding the frustum. The point cloud density distribution follows the law of becoming sparser as the distance increases, and so the density of unimportant objects close to the occlusion scene is denser than that of the occluded objects in the distance. As shown in
Figure 3, the fixed frustum sequence step size makes the unimportant point feature and the exciting point feature indistinguishable in the feature extraction process, which may cause computational costs to be wasted on detecting unimportant points, affecting the occluded object feature expression. The occluded object features’ weight is relatively small in the limited feature space, which leads to the feature expression not being significant in the subsequent process. The human visual attention mechanism ignores the unimportant object features and improves the weight of the occluded object features in the feature map. Therefore, we designed a frustum structure with spatial attention. As shown in
Figure 4, with 2D region suggestions and camera parameters, the model can focus more on the occluded object features.
SAF Segmentation Method
We estimated a coarse distance for the model to focus on the features of the occluded object, while the exact regression of the 3D position was performed in the point cloud. Therefore, we chose a relatively lightweight approach to restore depth based on the principle of camera projection.
As shown in
Figure 5,
is the true height of the object in the 3D ground truth. The height H of each cuboid is fixed in the 3D real space, but the projection heights of the four vertical edges are different on the image plane projected by the camera. For
, we found the average height based on the height statistical characteristics of the dataset, and we used
for cars in the KITTI dataset. In the image plane, the more significant the height of the vertical edge projection, the closer the corresponding 3D spatial edge was to us. Therefore, we chose the side with the more significant vertical edge projection height to estimate the closest depth possible. For the relative depth
of each vertical edge of the box project on the image plane, we assumed that the camera was distortion-free, and solved for it according to Equation (1):
where
is the focal length of the camera,
is the true distance from the optical centre
to the focused object, and
and
are the vertical projection heights of the cuboid of the target bounding box in the 3D space on the image plane.
Given the actual situation of the 2D detector, it would be complicated to calculate the vertical edge projection in 3D space on the image plane. Therefore, we ignored the regression performance of the 2D detector on the orientation angle of the 3D box and only obtained the class information and position information of the 2D box from the 2D detector
. As shown in
Figure 5, we were able to complete our rough depth estimation according to the results of the red box. We calculated the projection height according to Equation (2).
The corresponding depth estimate was:
The FconvNet verified that the multi-resolution frustum feature integration variant is effective. We referred to some of the original paper settings in the subsequent FCN module to facilitate feature alignment in subsequent operations. For the frustum of each region proposal, we proposed the following segmentation scheme.
Table 1 shows the division scale and number of the frustum. First, the division scale is the division size of the frustum, which can also be interpreted as the resolution of the frustum-level features, and there are four levels of feature resolution,
T,
T/2,
T/4,
T/8.
T is generally taken as a multiple of 8. The slice step of each sub-frustum is the length parameter along the axis of the apparent frustum, denoted as
.
is the resolution level of the division scale and takes the values of 1, 2, 3, 4. Its parameters are determined by the correction factor
and
. The correction coefficient
is used to correct the distance estimation.
is the length of the extracted frustum in total. Num A is the number of frustums in the not interested region, and Num B is the number of frustums in the ROI. For a frustum of any scale, the frustum length
and step length
of each not interested region segmentation are solved by Equation (4):
Each frustum length
and step length
of the frustums of ROI in the four scales are solved by Equations (5)–(8), respectively:
3.2. Local Feature Aggregation (LFA) Module
The occluded object is usually only a part of the point cloud visible with LiDAR, and the lack of some features will increase the difficulty of recognition. Enhancing the understanding of the local structure of the object is crucial, because sometimes it is needed to infer the whole object position from a smaller number of local point clouds. We thought that each point should have a larger sensory field in the point sampling stage and ensure sampling efficiency. The current common point cloud sampling methods were analysed and compared in the selection of sampling methods. Farthest point sampling (FPS) was considered first to ensure good coverage of the sampled points. However, when dealing with large-scale scenes of point clouds, the complexity of the squared calculation will result in the more unsatisfactory real-time performance of the model. The grid sampling method uses grid points to discrete 3D space, and then samples each grid point and controls the spacing between points by controlling the size of the grid points, but its uniformity is not as good as the FPS method. The sampling method based on the point cloud curvature shape is stable, but the long curvature computation time causes it to be unsuitable for large-scale datasets. Random sampling (RS) has the most efficient constant computational complexity and good dataset scalability but will inevitably result in the loss of some useful information, which will adversely affect the feature representation of the model.
This study selected the
RS algorithm, which allows the model to work well when facing datasets of any size. Inspired by the work of RandLA-Net and pointnet [
33,
34], we defined an LFA module to increase the receptive field of each point. The LFA module is based on the K nearest neighbour (
KNN) algorithm to find the nearest
K neighbour points.
Figure 6 illustrates the feature aggregation and the down-sampling process of the LFA module. The red dashed box shows the feature aggregation process for the sampled points, and the number of point clouds after each
RS operation is reduced to half of the original number.
For any one of the frustums, assume that it contains M local points, which are represented in the camera coordinate system as
. Instead of the coordinates in the camera coordinate system, we used the relative coordinate
to the centre of the current frustum, which is calculated as
. The
KNN algorithm can find the nearest
K neighbourhood points
in the Euclidean space for each point
, the Euclidean space coordinate characteristic of each point is
, and the point feature of each neighbor point is
. Then, the local map structure position encoding is performed, and the 3D coordinate of
. A multilayer perceptron (MLP) maps features to high-dimensional space, connects them to the original neighbouring point features, and pools them. The output result is used as a new point feature. The vector mapping relationship of neighbour features is as follows:
The new point features are as follows:
In Equations (9) and (10), MLP represents the multi-layer perceptron, MaxPool represents maximum pooling, is the coordinate of the selected point, is the coordinate of the neighbour point, is concatenate operation, is the relative coordinate, and is the Euclidean distance. The KNN algorithm ensures that the neighbouring points can still be extracted in the sparse region of the point cloud. After two down-sampling processes of the aggregated local features, the sampled points can be considered to have a larger perceptual field. The local graph structure embeds the coordinates of all neighbouring points and efficiently learns the complex local structure to retain more local features.
3.3. Feature Extractor and Fully Convolutional Network (FCN)
As with FconvNet, we used the pointnet with weight sharing for parallel processing and aggregated point feature into the frustum feature. The pointnet module consists of three MLP layers and one Max Pooling layer. Pointnet with T numbers of shared weights aggregates the features of T numbers of a subfrustum into a frustum level feature vector. The T feature vectors are combined into a 2D feature map F of size , used as the input of a subsequent FCN. The FCN contains four convolution layers and three deconvolution layers. Each convolutional layer is followed by batch normalization and ReLU nonlinearity. Except for the first convolutional layer, each convolution block uses stride-2 convolution to down-sample the 2D feature maps, so the output feature map of the convolutional block in FCN has a 2-fold lower resolution in the frustum dimension. When the scale is T/2, the feature map is compatible with its corresponding one in FCN. To maintain the integrity of the FCN, we concatenate the feature vectors extracted in the T scale down-sampling process and the feature vectors of the T/2 scale and use a fusion convolution layer to keep the size constant. The feature map output by each convolution block uses the corresponding deconvolution block for up-sampling. It concatenates all deconvolution outputs together with the feature size. Our detection header includes parts of CLS and REG.
3.4. Projection Loss Function
To fully exploit the excellent performance of the 2D detector, inspired by [
31], we proposed a 3D-2D coupled loss function in the regression stage to obtain a more accurate 3D box estimate. The ideal 2D bounding box corresponds to the projection of the 3D bounding box in the image plane. Therefore, it is necessary to make full use of the constraints of the 2D bounding box on the 3D bounding box in the regression process of the 3D bounding box. The ground truth of the 3D bounding box is represented as
in the LiDAR coordinate system, where
denote coordinates of box centre,
denote three side lengths of the box, and
is the object orientation from the BEV. The 2D bounding box is represented as
, where
is the 2D bounding box centre and
is the 2D bounding box size. The projection relationship from a point x in the Velodyne LiDAR coordinate system to image coordinate y is as follows:
In Equation (11),
is the homogeneous coordinate form of the point cloud,
is a
projection correction matrix containing camera parameters,
is a
rectifying rotation matrix of the reference camera, and
is the external parameter matrix of the LiDAR and camera obtained by calibration, including the rotation matrix
and translation matrix
as follows:
We followed existing study [
12] for anchor boxes generation. For any one of them represented as
, the centre offsets
predefined size offsets
and the orientation
were computed. For the regression for projection of 3D bounding box, we projected regressed 3D anchors onto images to generate 2D anchors of size
, and computed 2D centre offsets
and size offsets
. We calculated the offset by using Equation (13).
The regression loss function is as follows:
The regression loss (Equation (14)) is based on the Euclidean distance and smooth-L1 regression loss for offsets of size and angle, including
and
.
,
, and
are loss coefficients. The focal loss [
35] is used to calculate point segmentation loss
to handle the class imbalance issue:
where
where
is the predicted foreground probability of a single 3D point, and we use a corner loss
[
11] to regularize box regression of all parameters.