OccTr: A Two-Stage BEV Fusion Network for Temporal Object Detection

Fu, Qifang; Yu, Xinyi; Ou, Linlin

doi:10.3390/electronics13132611

Open AccessArticle

OccTr: A Two-Stage BEV Fusion Network for Temporal Object Detection

by

Qifang Fu

^†,

Xinyi Yu

^† and

Linlin Ou

^*

College of Information and Engineering, Zhejiang University of Technology, Hangzhou 310000, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2024, 13(13), 2611; https://doi.org/10.3390/electronics13132611

Submission received: 13 May 2024 / Revised: 17 June 2024 / Accepted: 1 July 2024 / Published: 3 July 2024

(This article belongs to the Special Issue Artificial Intelligence Empowered Internet of Things)

Download

Browse Figures

Versions Notes

Abstract

Temporal fusion approaches are critical for 3D visual perception tasks in IOV (Internet of Vehicles), but they often rely on intermediate representations without fully utilizing position information from the previous frame’s detection results, which cannot compensate for the lack of depth information in visual data. In this work, we propose a novel framework called OccTr (Occupancy Transformer) that combines two temporal cues, intermediate representation and back-end representation, via occupancy map to enhance temporal fusion in object detection task. OccTr leverages attention mechanisms to perform both intermediate and back-end temporal fusion by incorporating intermediate BEV (bird’s-eye view) features and back-end prediction results of the detector. Our two-stage framework includes occupancy map generation and cross-attention feature fusion. In stage one, the prediction results are converted into occupancy grid map format to generate back-end representation. In stage two, the high-resolution occupancy maps are fused with BEV features using cross-attention layers. This fused temporal cue provides a strong prior for the temporal detection process. Experimental results demonstrate the effectiveness of our method in improving detection performance, achieving an NDS (nuScenes Detection Score) metric score of 37.35% on the nuScenes test set, which is 1.94 points higher than the baseline.

Keywords:

temporal fusion; occupancy map; 3D perception; autonomous driving; object detection; Internet of Vehicles

1. Introduction

Object detection in 3D space, which provides surroundings information of ego-vehicles for subsequent decision-making behaviors in autonomous driving, plays a pivotal role in Internet of Vehicles (IOV) [1]. Cameras are preferred for object detection due to their ability to capture visual road elements and long-distance features. Visual object detection of the surrounding scene aims to predict a set of 3D bounding boxes from 2D RGB images for the IOV trajectory analysis works such as [2]. Simultaneously, the challenge of occlusion arises in 3D perception. To address this issue, leveraging video data and integrating temporal and spatial dimensions [3] is commonly employed to compensate for location information and enable object prediction based on temporal fusion. Generally, in the temporal fusion process, each input image group from a video stream is individually processed to obtain the BEV features or depth map of each frame. Subsequently, the cross-frame BEV features are aligned considering motion and passed through a detector for predicting 3D bounding boxes. Another approach implicitly incorporates spatial depth relations by leveraging the exceptional global perception capabilities of Transformer models such as Cross-view Transformers [4], BEVFormer [5], and VoxFormer [6].

However, previous works have relied on intermediate representation for temporal fusion of 3D detection tasks. Whether it is the BEV feature or depth map, these approaches essentially utilize an ambiguous form of feature representation, which inherently lack precise position information compared to 3D point clouds. The object detection system actually applies biased preframe information without a detection process, leading to inefficient temporal fusion. We argue that efficient temporal fusion should incorporate two temporal cues—the BEV feature and occupancy map—in order to fully leverage the preframe result. This preframe result serves as a temporal memory from the back-end and accumulates invisible inherited memory. Therefore, we enhance the temporal fusion with BEV features by incorporating preframe result information, as shown in Figure 1. We introduce the concept of occupancy grid map into the task of multiframe object detection.

As depicted in Figure 2, we propose a novel system called OccTr (Occupancy Transformer) that facilitates the integration of preframe results into object detection through BEV temporal fusion. OccTr is built upon the Transformer framework, offering an alternative to explicit methods reliant on geometric relations and avoiding the introduction of compound errors. The fundamental concept behind OccTr is to maintain a cross-frame occupancy grid map as spatial memory from preframe detection results, providing a robust prior for enhancing object detection with BEV temporal fusion. In tasks such as 3D detection and semantic segmentation for autonomous driving, the occupancy map represents the occupancy of each location in space in 2D or 3D form. This representation enables the conversion of preframe 3D bounding boxes into a format suitable for 2D BEV temporal fusion. Our proposed two-stage temporal fusion system consists of an occupancy map generation module and an occupancy cross-attention module. In stage-one, we transform the previous frame’s 3D prediction coordinates to a world frame using camera poses and update the occupancy states at each location. By assigning different values to areas inside and outside the detection box, we obtain a 2D occupancy map similar to the BEV feature format. In stage-two, we introduce a cross-attention module that can adapt to occupancy maps with different resolutions. Through this module, ambiguous information in BEV features is modified and enhanced by explicit information from the occupancy map. Ultimately, our two-stage pipeline allows us to leverage preframe detection results and predict object states using two temporal cues, compensating for location information in visual work while effectively handling occlusion and truncation.

We evaluate our system on the nuScenes [8] dataset and perform ablation experiments on different modules of the system. The experimental results demonstrate that, compared to the baseline, 3D object detection performance can be improved without preframe prior assistance. In summary, our contributions are as follows:

•: We propose OccTr, a novel framework that integrates the BEV feature and preframe result for sufficient temporal fusion. By incorporating an occupancy grid map, our model enhances the efficiency and accuracy of perception tasks in multicamera autonomous driving.
•: We design an occupancy grid map, incorporating an occupancy generation stage and a temporal cues fusion stage, to effectively generate and update the occupancy map using a preframe detector. Furthermore, we successfully integrate two distinct temporal cues—BEV feature and occupancy map—employing diverse methodologies.
•: The temporal fusion framework OccTr is assessed on the nuScenes dataset. In comparison to the baseline, OccTr achieves an NDS of 37.35% on the nuScenes test set, surpassing the baseline by 1.94 points. Experiments demonstrate that the prior provided by the back-end detection results effectively improved the accuracy of object detection, and the cross-attention modality of fusion was also very effective.

2. Related Works

2.1. Temporal Camera-Based 3D Perception

The 3D perception framework plays a pivotal role in our proposed methodology. Our focus lies on object detection with image inputs, as it aligns with the nature of our work, OccTr. Early approaches to 3D object detection relied on the conventional 2D object detection framework, followed by post-processing of results, which overlooked the interframe feature relationships. FCOS3D [9] adopts the 2D detection approach of FCOS [10] and treats 3D perception as a depth prediction task in 2D perception. Camera-based 3D object detection typically integrates image features from multiple camera views into a unified bird’s-eye view [11,12,13,14], following the BEV detection paradigm used in LiDAR-based methods such as [15,16]. However, compared to point cloud input of LiDAR-based methods such as [17], RGB images compress location information from 3D to 2D, placing camera-based 3D perception at a disadvantage. To address this issue of inaccurate location, two solutions have been proposed in the literature [18,19,20,21,22]: depth-based approaches [18,19,20,23,24,25] predict the depth map by Structure From Motion [26] or estimating depth and map image features to the BEV space, while transformer-based approaches [21,22,27,28,29] project predefined learnable 3D queries in BEV space into the image perspective.

The occlusion and truncation of input images can be effectively addressed through temporal fusion. As shown in Figure 1, perception frameworks [30,31,32] enhance prediction accuracy by incorporating temporal cues into the inference pipeline. In the context of monocular object detection [32], the previous frame’s detection result is directly utilized as a temporal cue to enhance the prediction of the current frame. Another approach to temporal fusion involves utilizing intermediate representations such as BEV features or depth maps as temporal cues, which is commonly employed in autonomous driving 3D perception tasks. For instance, BEVDet4D [31], based on the BEVDet [25] framework, aligns intermediate BEV features from different frames for interaction. Methods like Fiery [30] stack the BEV features from multiple frames together. Furthermore, some works leverage temporal fusion to iteratively optimize predicted depth information during the depth estimation stage, e.g., BEVStereo [33]. Our work aims to address this issue by incorporating back-end prediction as a temporal cue into the temporal fusion pipeline. We posit that combining different temporal fusion schemes can lead to improved object detection accuracy, and, thus, propose a scheme wherein the prediction results of previous frames assist in the intermediate BEV feature temporal fusion.

2.2. Occupancy-Based Perception

Occupancy grid maps classify each spatial location as either occupied or unoccupied. They have been widely applied in the field of robotics navigation for a long time. Based on the Bayes’ theorem, the probability occupancy grid map [34] assumes that the grids are mutually independent to avoid exponentially increasing occupancy possibility. The discontinuous assumption is not conducive to camera-based scene understanding. But the concept of occupancy representation, inspired by this notion, offers an intuitive and effective approach that can be tailored to diverse requirements [6,35,36,37,38,39]. KinectFusion [40] utilizes truncated signed distance function to construct spatial occupancy representation by continuously searching for surface along camera rays. UDOLO [37] incorporates occupancy representations into the conventional two-stage detection framework for monocular indoor scene tasks. TPVFormer [38], building upon MonoScene [39], extends its application to multiview scenes through occupancy prediction. Additionally, there exist single-frame methodologies that generate occupancy representations from image features and depth maps for subsequent occupancy prediction. The Occformer model [35] encodes multiple layers of 3D feature volumes with varying sizes and formulates occupancy representations as a set of binary masks associated with different category labels. Voxformer [6] utilizes depth prediction to generate proposals for the occupancy 3D grid. Voxelformer [41] iteratively refines BEV features by incorporating 3D occupancies generated from depth maps. In contrast, Surroundocc [36] generates offline ground-truth occupancy values and supervises the occupancy prediction process using an occupancy map as the primary form of representation. We utilize the predictions from previous frames to generate online occupancy representations for temporal multiview outdoor scene object detection.

3. Proposed Method

For multicamera perception tasks, the inefficiency of temporal fusion with BEV features stems from the inaccurate location information of image input. Leveraging the 3D bounding box predictions from previous frames can provide a robust prior for detecting objects in the current frame. In this study, we propose a novel approach for temporal object detection that effectively transforms geometric information of bounding boxes into feature representations and employs attention mechanisms to fuse them with BEV features, thereby enhancing temporal fusion for object detection.

3.1. Overall Architecture

As shown in Figure 2, given a set of 3D detection bounding boxes

{\{b o x_{t - 1}^{i}\}}_{i = 1}^{M}

obtained by detect decoder of

t - 1

frame, where each 3D bounding box is in the

3 \times 8

form

\{(x_{1}, y_{1}, z_{1}), (x_{2}, y_{2}, z_{2}), \dots (x_{8}, y_{8}, z_{8})\}

, our objective is to transform the geometric information of the bounding box into feature-based data and integrate them with BEV features using an appropriate methodology, facilitating temporal fusion for enhanced object detection.

The images I is represented as

I_{t} = \{I_{1}, I_{2}, \dots I_{N}\}

, where t is the frame index, and

N = 6

denotes the number of cameras in different views. Given a set of multiview images

I_{t} = \{I_{1}, I_{2}, \dots I_{N}\}

for the current frame, the image features set

F_{t} = \{F_{1}, F_{2}, \dots F_{N}\}

is extracted via a backbone network. On the temporal dimension, we add the intermediate feature

F_{B E V}^{t - 1}

of the

t - 1

frame and the occupancy grid map

O c c^{t - 1}

generated from the detection result

R e s u l t_{t - 1}

to provide a temporal cue for the generation of

F_{B E V}^{t}

. For the multiview image feature

F_{i m g} \in {\{F_{i m g}^{i}\}}_{i = 1}^{N}

, it is fused by spatial cross-attention layers in the united BEV space to obtain

F_{B E V}^{t}

. The detection result is derived from the BEV features extracted from the current frame by the detector. Subsequently, it undergoes processing in both the occupancy map generation module and the occupancy cross-attention module to provide a prior for object detection in frame

t + 1

. The overall workflow can be summarized as follows:

\begin{matrix} O c c_{t - 1} = φ_{O c c} (N (I_{t - 1})) \end{matrix}

(1)

\begin{matrix} F_{O B}^{t - 1} = C r o s s A t t n (F_{B E V}^{t - 1}, O c c_{t - 1}) \end{matrix}

(2)

\begin{matrix} F_{B E V}^{t} = A l i g n A t t n (F_{i m g}^{t}, F_{O B}^{t - 1}) \end{matrix}

(3)

\begin{matrix} O c c_{t} = φ_{O c c} (D e c o d e r (F_{B E V}^{t})) \end{matrix}

(4)

where

N

is a neural network which inputs image information and outputs object detection prediction results,

φ_{O c c} (\cdot)

is a module to generate occupancy space information, and

O c c_{t - 1}

denotes the occupancy of the scene around the ego in the

t - 1

frame. By adding the occupancy information

O c c_{t - 1}

of frame

t - 1

to the object detection process of frame t, the accurate position information already obtained from the previous frame can be introduced to provide a prior for the current frame. This avoids using fuzzy information of intermediate feature as a time cue.

3.2. Occupancy Map Generation

In multicamera 3D perception for autonomous driving, the temporal fusion approach that utilizes intermediate features as temporal cues may overlook deterministic information in the back-end detection results. However, object detection results contain valuable geometry information about 3D bounding boxes that are difficult to fuse with primary image features or a unified BEV representation. To address this issue, we propose an occupancy grid map generation module (shown in Figure 3 and described by Equations (1) and (4)) that initializes a set of reference points in BEV space to incorporate back-end detection results into each location’s occupancy on the BEV space.

The detection result of the previous frame is

3 D b o x e s_{t - 1} = \{b o x_{t - 1}^{1}, b o x_{t - 1}^{2}, \dots b o x_{t - 1}^{M}\}

, where M denotes the total number of detected boxes in frame

t - 1

. We establish a set of reference points

r e f_{O c c}

with resolution

[H_{O c c}, W_{O c c}, Z_{O c c}]

. By calculating the distances between points and the edges of the 3D bounding box, we can establish the positional relationship between points and bounding boxes. Subsequently, based on this relationship, we can determine whether a location is occupied or not. The occupancy information

O c c_{t - 1}

can be derived using the following equation:

\begin{matrix} O c c_{t - 1} = φ_{O c c} (3 D b o x e s_{t - 1}, r e f_{O c c}) + O c c_{t - 2} \end{matrix}

(5)

where

φ_{O c c} (\cdot)

is the process of converting

3 D b o x e s_{t - 1}

into 2D BEV view occupancy map form. Based on the idea of the occupancy grid map, for each point

(x, y, z)

in

r e f_{O c c}

, we calculate its spatial relationship with

b o x \in {\{b o x_{t - 1}^{j}\}}_{j = 1}^{M}

by geometric means. When

(x, y, z)

is inside the 3D bounding box, the occupied status of this position is considered, and its value is set to

+ 1

. For points outside the 3D bounding box, we consider that these positions are unoccupied and assign them a value of

- 1

. We compress the assigned reference points set

r e f_{O c c}

in height to obtain the intermediate representation

O c c_{i n t}

of occupancy state, which has a resolution of

[H_{O c c}, W_{O c c}]

and a position

(x, y)

with a value

δ \in [- Z_{O c c}, Z_{O c c}]

. For the further process, we set a threshold

δ

for the value

O c c_{i n t} (x, y)

corresponding to the position

(x, y)

:

\begin{matrix} \{\begin{matrix} O c c_{i n t} (x, y) < δ u n o c c u p i e d, \\ O c c_{i n t} (x, y) \geq δ o c c u p i e d, \end{matrix} \end{matrix}

(6)

with which the occupancy space is binarized.

We employ a straightforward and intuitive approach to categorize the locations of different values into unoccupied and occupied regions, resulting in the final occupancy representation

O c c_{t - 1}

of the detection result at

t - 1

frame.

O c c_{t - 1}

incorporates positional information learned from the previous frame, which offers more valuable position priors for the current frame compared to intermediate BEV features. While RGB image data provide richer visual features, they lack sufficient location information. By incorporating occupancy information, we introduce temporal references from location history to address this limitation in current frame images

I_{t} = \{I^{1}, I^{2}, \dots I^{N}\}

.

3.3. Time Cues Fusion

The occupancy information

O c c

, which is consistent with the temporal fusion operation space form, is acquired. Our experiments reveal that simply processing the occupancy mask or concatenating the BEV features cannot fully exploit the occupancy cues and are also incompatible with higher-resolution occupancy information

O c c

. To effectively leverage the temporal cues in the back-end, it is necessary to fuse the occupancy information

O c c

with the BEV features more efficiently.

We employed cross-attention, as illustrated in Figure 4 and Equation (2), to integrate the occupancy state information

O c c

and the BEV feature

F_{B E V}

. This cross-attention approach enables simultaneous fusion of features with different sizes, effectively mitigating the computational burden associated with high-resolution features. Initially, a set of reference points

r e f_{B E V}

consistent with the dimensions

[H_{B E V}, W_{B E V}]

of the BEV feature

F_{B E V}

is initialized and then projected onto the occupancy information

O c c

at resolution

[H_{O c c}, W_{O c c}]

. Leveraging this positional correspondence,

F_{B E V}

serves as a query to retrieve location-based points and aggregate information using deformable attention.

The query tensor

F_{B E V} \in R^{C \times H \times W}

, shown in Figure 4, is associated with the location on the occupancy map by leveraging the mapping relationship between features of different resolutions in a uniform-size space. Subsequently, the occupancy map is treated as a feature map and nearby features surrounding this location are sampled. These features are then returned to the query tensor

F_{B E V} \in R^{C \times H \times W}

for efficient fusion of the occupancy information

O c c

and BEV features. The overall process can be summarized as follows:

\begin{matrix} D e f o r m A t t n (q, p, x) = \sum_{i = 1}^{M} W_{i} \sum_{j = 1}^{N} A_{i j} W_{i}^{'} x (p + Δ_{i j}) \end{matrix}

(7)

\begin{matrix} F_{O B} = D e f o r m A t t n (F_{B E V}, P (p_{B E V}), O c c) \end{matrix}

(8)

where

ρ (\cdot)

denotes the mapping relationship between the reference points and the corresponding positions on

O c c

. For each query, its position

p_{B E V}

maps to a position on

O c c

as

p = ρ (p_{B E V})

.

x (p + Δ_{i j})

is the feature corresponding to position

p + Δ_{i j}

on

O c c

.

W_{i}

is the learnable weight of the ith head.

A_{i j} \in [0, 1]

is the attention weight that corresponds to the ith head and the jth key, calculated by dot product of query and key. The output

F_{O B}

is finally obtained.

The outcome of fusing high-resolution back-end cues effectively addresses the issues of insufficient training and semantic ambiguity associated with intermediate feature

F_{B E V}

. Additionally, the cross-attention fusion scheme also mitigates the significant increase in computational cost resulting from directly incorporating high-resolution

O c c

.

3.4. BEV Feature Alignment

The BEV feature with enhanced position information for the previous frame is obtained. However, there exists a discrepancy in the representations of different frames due to ego-motion. In Equation (3), we utilize the extension package

c a n_b u s

of the dataset as a set of queries. A set of queries Q is initialized and subsequently augmented by incorporating data from the

c a n_b u s

information. Queried by

Q_{c a n_b u s}

with information about position, velocity, and acceleration of ego-motion,

F_{O B}^{t - 1}

will be converted to

F_{O B}^{a}

, which is converted to the current ego coordinate and aligned with the BEV space of the current frame. The query-based alignment formula is as follows:

\begin{matrix} F_{O B}^{a} = D e f o r m A t t n (Q_{c a n_b u s}, p, F_{O B}^{t - 1}) \end{matrix}

(9)

where

Q_{c a n_b u s}

is the

c a n_b u s

queries and p is the position of the queries. Then, the aligned feature

F_{O B}^{a}

could be regarded as queries carrying time cues in the BEV space of the current frame. Each query is based on the corresponding abundant feature and exact position of the previous frame, and they continue to interact with 2D image features based on deformable attention.

4. Experiments and Results

4.1. Datasets

We conducted training and evaluation on the nuScenes dataset [8], a widely used large-scale autonomous driving dataset. The RGB image information in nuScenes is collected from six cameras configured on the vehicle’s roof, providing a 360° all-round horizontal field of view centered on the ego vehicle. This dataset consists of 1000 scenes, each lasting for 20 s with keyframes captured at a frequency of 2 Hz. For object detection tasks, the nuScenes dataset includes 1.4 million annotated 3D bounding boxes across 10 categories. Additionally, it provides official evaluation metrics for detection tasks. To enhance evaluation accuracy, instead of using

IOU

calculation for

mAP

, the nuScenes dataset employs ground plane center distance as well as true positive metrics such as

ATE

,

ASE

,

AOE

,

AVE

, and

AAE

to measure ground-truth prediction matching errors. These metrics are combined into a scalar score called NDS for comprehensive evaluation, as follows:

\begin{matrix} m A P = \frac{1}{|C| |D|} \sum_{c \in C} \sum_{d \in D} A P_{c, d} \end{matrix}

(10)

\begin{matrix} N D S = \frac{1}{10} [5 m A P + \sum_{m T P \in TP} (1 - m i n (1, m T P))] \end{matrix}

(11)

where

C

is the classes set,

D

is a set of distances, and

TP

means the set of the five mean true positive metrics. We use these metrics to evaluate the model in later experiments.

For OccTr’s robustness, we conducted experiments on another large-scale autonomous driving dataset, waymo [42]. The RGB images are captured by five cameras, covering a 252° horizontal field of view around the ego vehicle. It consists of 1950 scenes, each lasting 20 s, collected at a rate of 10 Hz, totaling 390,000 frames of data. The metric of 3D detection is APH (average precision with heading information), which is similar to AP (average precision) but considers heading information. The equation is presented below:

\begin{matrix} A P H = 100 \int_{0}^{1} min x \{h (r^{'}) | r^{'} \geq r\} d r \end{matrix}

(12)

where

h (r^{'})

is the P/R curve with heading weight.

4.2. Implementation Details

According to the previous methods, we adopt ResNet-50 [7] as the backbone of our comprehensive framework, while employing FPN for the neck part. Considering the computational resources, we define the size of the input images as 800 × 450. For the BEV space, the resolution size of grids is set to 2.048 m. The occupancy map generation encompasses an occupancy space range of [−51.2 m, 51.2 m] along the X and Y axis, aligning consistently with the BEV space.The range of Z axis is [−3.0 m, 5.0 m]. All experiments are trained for 24 epochs. For comparison experiments, the resolution of the occupied maps varies in different situations. Respectively, the number of grids in the same size space is 50 × 50, 100 × 100, and 200 × 200. We conducted a comparative analysis of multiple fusion schemes using the

1 / 20

nuScenes train set and evaluated their performance on the original scale test set. Ultimately, our model exhibited an improvement of 1.87 points in mAP and 1.94 points in NDS compared to the baseline.

4.3. 3D Object Detection Results

To investigate the effect of occupancy map resolution and different Occ-BEV fusion methods, we performed comparative experiments on a

1 / 20

nuScenes training set. Ultimately, we selected a 200 × 200 RES occupancy map and employed cross-attention-based feature fusion scheme as our final version.

4.3.1. Occ-Bev Fusion Methods

Three distinct fusion methods were devised, with the aim of enhancing the effectiveness of high-resolution occupancy maps in temporal fusion. The selected fusion method employed by the model bypasses the need for up- or downsampling processes and instead utilizes cross-attention to fuse features of different resolutions. Additionally, we explored two alternative approaches. The first approach involves upsampling BEV features and directly concatenating them with the occupancy map in the channel dimension, followed by a downsample operation to the original size. The second approach follows the transformer structure of baseline, downsampling the high-resolution occupancy grid map and subsequently interacting it with low-resolution BEV features through a self-attention module.

The models for this part of the experiments were trained on a subset of the nuScenes train set, which accounted for only

1 / 20

of the entire dataset. As depicted in Table 1, the performance metrics of the cross-attention scheme surpass those of the other two schemes at an equivalent resolution. Ultimately, we designed the temporal cue fusion module based on cross attention.

4.3.2. Resolution Factor

We argue that the effect of location information on temporal fusion effectiveness increases with the resolution of the occupancy grid map. To investigate the directional trend of the resolution’s effect on model accuracy, we conducted experiments using occupancy grid maps at varying resolutions with cross-attention fusion scheme. The resolutions 50 × 50, 100 × 100, 200 × 200, and 400 × 400 resolutions are, respectively, taken for the experiments.

The trend of detection performance, as indicated by the mAP and NDS metrics, is illustrated in Figure 5. The metrics are increasing with increasing resolution. However, it should be noted that while increasing resolution leads to improved model performance, the impact of resolution gradually diminishes over time. Especially when the resolution increases from 200 × 200 to 400 × 400, there is a slight improvement in mAP, while NDS remains almost unchanged. The significant increase in reference points has placed additional pressure on model training. Considering the trade-off between computational cost and performance, we selected a final version with a resolution of 200 × 200.

4.3.3. Object Detection Results

In Table 2, we compare OccTr with the current methods, including the baseline work BEVFormer-Tiny and previous works [9,21,24]. From the table, it is evident that OccTr outperforms the baseline method, BEVFormer-Tiny, with a significant improvement of 1.94 points in NDS and 1.87 points in mAP. Furthermore, even when considering the same resolution with BEV version, OccTr still achieves a notable enhancement of 1.44 points in NDS and 1.21 points in mAP, thereby validating the effectiveness of our main idea regarding occupancy grid mapping. Despite being lightly trained on an RGB image dataset of size 800 × 450 pixels, OccTr demonstrates comparable performance to FCOS3D and DETR3D models operating on images twice its size.

To evaluate the robustness of OccTr, we compared it with the baseline work on the waymo dataset, another IoU-based large-scale autonomous driving dataset. We conducted experiments on a 1/10 subset of training dataset and focused on detecting the vehicle category. In Table 3, OccTr outperforms baseline with APH (average precision with heading information). With IoU criteria of 0.5, OccTr, respectively, exceeded the baseline performance by 0.3 points and 0.2 points in LEVEL_1 and LEVEL_2 difficulties.

4.4. Ablation Study

The effects of the designed occupancy map generation module and occupancy cross-attention module were investigated through ablation experiments on the nuScenes train set. As shown in Table 4, compared to the baseline model, NDS improved by 0.64 points and mAP improved by 0.70 points after straightforwardly incorporating occupancy information as a mask of

F_{BEV}

. Furthermore, when employing cross-attention fusion, the model exhibited even better performance, with an increase of 1.30 points in NDS and 1.17 points in mAP, respectively. These findings highlight the positive guidance role played by positional information in the occupancy map during temporal fusion processes. Additionally, it is observed that the cross-attention method effectively leverages positional information in occupancy maps.

4.5. Visualization

We present the predicted results of the OccTr object detection task in Figure 6 and compare them with the baseline. With occupancy information from the back-end of the previous frame, OccTr corrects the small objects mistakenly detected in BEVFormer. Furthermore, OccTr effectively predicts objects that are occlused. The figure clearly demonstrates that OccTr outperforms BEVFormer-tiny (green) by significantly detecting objects that are otherwise undetectable due to truncation occlusion, while also effectively rectifying incorrectly detected bounding boxes of baseline (red). This visual evidence further highlights how incorporating the back-end detection result from previous frames enhances temporal cues and improves the object detection effort of the current frame.

5. Conclusions

OccTr is a transformer-based end-to-end framework that facilitates the fusion of BEV temporal information with online generated occupancy maps. To compensate for the missing positional information, OccTr incorporates a back-end temporal cue into the temporal fusion work of a single cue. It realizes the fusion of two temporal cues through the design of a two-stage pipeline, leveraging the concept of classical occupancy grid maps to construct an updated online occupancy grid map. This map serves as a bridge for obtaining history back-end detection results in BEV format. By employing structurally consistent cross-attention layers, OccTr effectively integrates intermediate BEV features and back-end prediction result features, demonstrating experimentally enhanced location information in the temporal cues and addressing inaccuracies commonly encountered in visual objectworks. We anticipate that OccTr will contribute to enriching the repertoire of multiview camera-based 3D object detection work.

Author Contributions

Conceptualization, Q.F. and X.Y.; methodology, Q.F.; software, Q.F.; validation, Q.F., X.Y. and L.O.; formal analysis, Q.F.; investigation, Q.F.; resources, L.O.; data curation, Q.F.; writing—original draft preparation, X.Y.; writing—review and editing, X.Y.; visualization, Q.F.; supervision, X.Y.; project administration, X.Y.; funding acquisition, L.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Baima Lake Laboratory Joint Funds of the Zhejiang Provincial Natural Science Foundation of China (under grant no. LBMHD24F030002) and the National Natural Science Foundation of China (under grant no. 62373329).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

IOV	Internet of Vehicles
BEV	Bird’s-eye view
NDS	nuScenes Detection Score
IoU	Intersection over union
AP	Average precision
mAP	Mean average precision
APH	Average precision with heading information

References

Kong, X.; Wang, J.; Hu, Z.; He, Y.; Zhao, X.; Shen, G. Mobile Trajectory Anomaly Detection: Taxonomy, Methodology, Challenges, and Directions. IEEE Inter. Things J. 2024, 11, 19210–19231. [Google Scholar] [CrossRef]
Kong, X.; Lin, H.; Jiang, R.; Shen, G. Anomalous Sub-Trajectory Detection With Graph Contrastive Self-Supervised Learning. IEEE Trans. Veh. Technol. 2024, 1–13. [Google Scholar] [CrossRef]
Hu, P.; Ziglar, J.; Held, D.; Ramanan, D. What you see is what you get: Exploiting visibility for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Zhou, B.; Krähenbühl, P. Cross-view transformers for real-time map-view semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Yu, Q.; Dai, J. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Li, Y.; Yu, Z.; Choy, C.; Xiao, C.; Alvarez, J.M.; Fidler, S.; Feng, C.; Anandkumar, A. Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Wang, T.; Zhu, X.; Pang, J.; Lin, D. FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Saha, A.; Maldonado, O.M.; Russell, C.; Bowden, R. Translating images into maps. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 3–27 May 2022. [Google Scholar]
Li, Y.; Huang, B.; Chen, Z.; Cui, Y.; Liang, F.; Shen, M.; Liu, F.; Xie, E.; Sheng, L.; Ouyang, W.; et al. Fast-BEV: A Fast and Strong Bird’s-Eye View Perception Baseline. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 1–14. [Google Scholar] [CrossRef]
Xie, E.; Yu, Z.; Zhou, D.; Philion, J.; An kumar, A.; Fidler, S.; Luo, P.; Alvarez, J. M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation. arXiv 2022, arXiv:2204.05088. [Google Scholar]
Can, Y.B.; Liniger, A.; Unal, O.; Paudel, D.P.; Gool, L.V. Understanding Bird’s-eye View Semantic hd-maps Using an Onboard Monocular Camera. arXiv 2020, arXiv:2012.03040. Available online: https://api.semanticscholar.org/CorpusID:227342485 (accessed on 16 June 2024).
Mohapatra, S.; Yogamani, S.; Gotzig, H.; Milz, S.; Mäder, P. BEVDetNet: Bird’s Eye View LiDAR Point Cloud based Real-time 3D Object Detection for Autonomous Driving. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021. [Google Scholar]
Beltrán, J.; Guindel, C.; Moreno, F.M.; Cruzado, D.; García, F.; Escalera, A.d. Birdnet: A 3d object detection framework from lidar information. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018. [Google Scholar]
Wang, Y.; Chao, W.; Garg, D.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-Lidar from Visual Depth Estimation: Bridging the Gap in 3d Object Detection for Autonomous Driving. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.; Han, S. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023. [Google Scholar]
Zhang, Y.; Zhu, Z.; Zheng, W.; Huang, J.; Huang, G.; Zhou, J.; Lu, J. Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv 2022, arXiv:2205.09743. [Google Scholar]
Li, Y.; Ge, Z.; Yu, G.; Yang, J.; Wang, Z.; Shi, Y.; Sun, J.; Li, Z. BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection. arXiv 2022, arXiv:2206.10092. [Google Scholar] [CrossRef]
Wang, Y.; Guizilini, V.; Zhang, T.; Wang, Y.; Zhao, H.; Solomon, J.M. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Proceedings of the Conference on Robot Learning (CoRL), London, UK, 8–11 November 2021. [Google Scholar]
Liu, Y.; Wang, T.; Zhang, X.; Sun, J. Petr: Position embedding transformation for multi-view 3d object detection. arXiv 2022, arXiv:2203.05625. [Google Scholar]
Philion, J.; Fidler, S. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proceedings of the 2020 European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Pan, B.; Sun, J.; Andonian, A.; Oliva, A.; Zhou, B. Cross-view semantic segmentation for sensing surroundings. IEEE Robot. Autom. Lett. 2020, 5, 4867–4873. [Google Scholar] [CrossRef]
Huang, J.; Huang, G.; Zhu, Z.; Du, D. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv 2021, arXiv:2112.11790. [Google Scholar]
Schönberger, J.L.; Frahm, J.-M. Structure-from-motion revisited. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Jiang, Y.; Zhang, L.; Miao, Z.; Zhu, X.; Gao, J.; Hu, W.; Jiang, Y. PolarFormer: Multi-camera 3D Object Detection with Polar Transformers. In Proceedings of the AAAI conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023. [Google Scholar]
Yang, C.; Chen, Y.; Tian, H.; Tao, C.; Zhu, X.; Zhang, Z.; Huang, G.; Li, H.; Qiao, Y.; Lu, L.; et al. BEVFormer v2: Adapting Modern Image Backbones to Bird’s-Eye-View Recognition via Perspective Supervision. arXiv 2022, arXiv:2211.10439. [Google Scholar]
Chen, Y.; Tai, L.; Sun, K.; Li, M. Monopair: Monocular 3d Object Detection Using pairwise spatial relationships. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Hu, A.; Murez, Z.; Mohan, N.; Dudas, S.; Hawke, J.; Badrinarayanan, V.; Cipolla, R.; Kendall, A. FIERY: Future instance prediction in bird’s-eye view from surround monocular cameras. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Huang, J.; Huang, G. Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. arXiv 2022, arXiv:2203.17054. [Google Scholar]
Yang, W.; Liu, B.; Li, W.; Yu, N. Tracking Assisted Faster Video Object Detection. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 1750–1755. Available online: https://api.semanticscholar.org/CorpusID:199490530 (accessed on 15 July 2019).
Li, Y.; Bao, H.; Ge, Z.; Yang, J.; Sun, J.; Li, Z. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with dynamic temporal stereo. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023. [Google Scholar]
Elfes, A.; Matthies, L. Sensor integration for robot navigation: Combining sonar and stereo range data in a grid-based representataion. In Proceedings of the 26th IEEE Conference On Decision And Control, Los Angeles, California, USA, 9–11 December 1987; Volume 26, pp. 1802–1807. [Google Scholar]
Zhang, Y.; Zhu, Z.; Du, D. Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. arXiv 2023, arXiv:2304.05316. [Google Scholar]
Wei, Y.; Zhao, L.; Zheng, W.; Zhu, Z.; Zhou, J.; Lu, J. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. arXiv 2023, arXiv:2303.09551. [Google Scholar]
Sun, J.; Xie, Y.; Zhang, S.; Chen, L.; Zhang, G.; Bao, H.; Zhou, X. You Don’t Only Look Once: Constructing spatial-temporal memory for integrated 3d object detection and tracking. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Huang, Y.; Zheng, W.; Zhang, Y.; Zhou, J.; Lu, J. Tri-perspective view for vision-based 3d semantic occupancy prediction. arXiv 2023, arXiv:2302.07817. [Google Scholar]
Cao, A.-Q.; de Charette, R. Monoscene: Monocular 3d semantic scene completion. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Newcombe, R.; Izadi, S.; Hilliges, O.; Molyneaux, D.; Kim, D.; Davison, A.; Kohi, P.; Shotton, J.; Hodges, S.; Fitzgibbon, A. KinectFusion: Real-time dense surface mapping and tracking. In Proceedings of the 2011 10th IEEE International Symposium On Mixed And Augmented Reality, Basel, Switzerland, 26–29 October 2011; pp. 127–136. [Google Scholar]
Li, Z.; Zhang, C.; Ma, W.-C.; Zhou, Y.; Huang, L.; Wang, H.; Lim, S.; Zhao, H. Voxelformer: Bird’s-eye-view feature generation based on dual-view attention for multi-view 3d object detection. arXiv 2023, arXiv:2304.01054. [Google Scholar]
Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in Perception for Autonomous Driving: Waymo Open Dataset. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]

Figure 1. OccTr fuses BEV features and detects results based on occupancy state to assist intermediate representation temporal fusion for 3D object detection tasks with back-end information.

Figure 2. The framework of the proposed OccTr for multiview camera-based 3D object detection tasks. Given a horizontal full-view RGB image, 2D features are extracted through backbone, ResNet-50 [7]. BEV queries of the current frame interact with the BEV feature of the historical frame to obtain the time cue of the previous frame. Stage-one generates an occupancy grid map based on the detection results of the previous frame. Stage-two fuses intermediate BEV feature and back-end occupancy feature through cross-attention form.

Figure 3. Illustration of the occupancy map generation module. We initialize a set of 3D reference points and obtain the 3D occupancy representation via the geometric relationship between every point and the 3D bounding box. Then, we compress it to 2D as

O c c_{i n t}

and combine it with historical occupancy information to obtain the occupancy map

O c c_{t - 1}

.

Figure 3. Illustration of the occupancy map generation module. We initialize a set of 3D reference points and obtain the 3D occupancy representation via the geometric relationship between every point and the 3D bounding box. Then, we compress it to 2D as

O c c_{i n t}

and combine it with historical occupancy information to obtain the occupancy map

O c c_{t - 1}

.

Figure 4. Illustration of the occupancy cross-attention module. In order to fuse features of different resolutions, we project each BEV query to the corresponding location of the occupancy map and sample the nearby features (red circle) to interact with the BEV feature query.

Figure 5. Impact of occupancy maps in different resolutions on performance metrics. ‘↑’ indicates that an increase in the value is better. Both mAP and NDS show an increasing trend with increasing resolution. However, the trend is slowing down given the sharp increase in reference points.

Figure 6. OccTr visualization results on nuScenes val set. Boxes of different colors correspond to different object categories. Compared to the baseline framework, OccTr is significantly able to correct false detections (red circle), as well as detect occluded objects (green circle). This demonstrates the effectiveness of occupancy information for object detection.

Table 1. Performance of different Occ-BEV fusion methods at the same resolution on the

1 / 20

nuScenes train set. ‘S&D’ means downsampling the resolution of the occupancy map to the resolution of the BEV features and then integrating them by a self-attention approach. ‘Concat’ means upsampling the BEV feature and then performing a conatenate operation between it and the occupancy map in the channel dimension. ‘Cro’ means the fusion of two features of different resolutions directly by cross-attention. ‘↓’ indicates that a decrease in the value is better, while ‘↑’ indicates that an increase in the value is better.

Table 1. Performance of different Occ-BEV fusion methods at the same resolution on the

1 / 20

nuScenes train set. ‘S&D’ means downsampling the resolution of the occupancy map to the resolution of the BEV features and then integrating them by a self-attention approach. ‘Concat’ means upsampling the BEV feature and then performing a conatenate operation between it and the occupancy map in the channel dimension. ‘Cro’ means the fusion of two features of different resolutions directly by cross-attention. ‘↓’ indicates that a decrease in the value is better, while ‘↑’ indicates that an increase in the value is better.

Method	Modify	RES	NDS ↑	mAP ↑	mATE ↓	mASE ↓	mAOE ↓	mAVE ↓	mAAE ↓
OccTr-S	S&D	200	0.1796	0.1118	1.0651	0.3476	1.2153	1.1771	0.4156
OccTr-Cat	Concat	200	0.1784	0.1117	1.0662	0.3632	1.1792	1.2127	0.4116
OccTr	Cro	200	0.1803	0.1129	1.0632	0.3465	1.1564	1.1857	0.4151

Table 2. Performance of 3D detection on the nuScenes dataset. ‘Image-RES’ means RGB images resolution. ‘Occ-RES’ means the resolution of occupancy map. ‘↓’ indicates that a decrease in the value is better, while ‘↑’ indicates that an increase in the value is better. Even at the same resolution as the BEV feature, OccTr outperforms baseline and previous work. With more fine-grained occupancy map, OccTr exceeds baseline work by 1.94 points on NDS and 1.87 points on mAP.

Method	Image-RES	Occ-RES	NDS ↑	mAP ↑	mATE ↓	mASE ↓	mAOE ↓	mAVE ↓	mAAE ↓
VPN	1600 × 900	−	0.3330	0.2530	−	−	−	−	−
PointPillars	LiDAR	−	0.4530	0.3050	0.5170	0.2900	0.5000	0.3160	0.3680
MEGVII	LiDAR	−	0.6330	0.5280	0.3000	0.2470	0.3790	0.2450	0.1400
FCOS3D	1600 × 900	−	0.3730	0.2990	0.7850	0.2680	0.5570	1.3960	0.1540
DETR3D	1600 × 900	−	0.3740	0.3030	0.8600	0.2780	0.4370	0.9670	0.2350
BEVFormer-tiny	800 × 450	−	0.3541	0.2524	0.8995	0.2937	0.6549	0.6571	0.2158
OccTr-50	800 × 450	50 × 50	0.3688	0.2645	0.8860	0.2927	0.6118	0.6352	0.2083
OccTr-200	800 × 450	200 × 200	0.3735	0.2711	0.8757	0.2916	0.6199	0.6168	0.2165

Table 3. Performance of 3D detection on the waymo dataset. ‘Occ-RES’ means the resolution of occupancy map. ‘L1’ and ‘L2’ are the APH metric of two levels difficulties in the waymo dataset. OccTr outperforms the baseline work on the IoU-based dataset.

Method	Img-RES	Occ-RES	IoU = 0.5		IoU = 0.7
Method	Img-RES	Occ-RES	L1	L2	L1	L2
DETR3D	1600 × 900	-	0.220	0.216	0.055	0.051
BEVFormer-tiny	800 × 450	-	0.151	0.120	0.038	0.032
OccTr	800 × 450	200 × 200	0.154	0.122	0.039	0.032

Table 4. The ablation experiment of OccTr. ‘O’ means that occupancy information is added to the baseline in the form of a mask, with little or no change to the basic structure of the BEV feature. ‘O&C’ means integrating occupancy maps with BEV features to make the best use of occupancy information. ‘↓’ indicates that a decrease in the value is better, while ‘↑’ indicates that an increase in the value is better.

Method	Modify	NDS ↑	mAP ↑
BEVFormer-tiny	−	0.3541	0.2524
OccTr-mask	O	0.3605	0.2594
OccTr	O&C	0.3735	0.2711

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fu, Q.; Yu, X.; Ou, L. OccTr: A Two-Stage BEV Fusion Network for Temporal Object Detection. Electronics 2024, 13, 2611. https://doi.org/10.3390/electronics13132611

AMA Style

Fu Q, Yu X, Ou L. OccTr: A Two-Stage BEV Fusion Network for Temporal Object Detection. Electronics. 2024; 13(13):2611. https://doi.org/10.3390/electronics13132611

Chicago/Turabian Style

Fu, Qifang, Xinyi Yu, and Linlin Ou. 2024. "OccTr: A Two-Stage BEV Fusion Network for Temporal Object Detection" Electronics 13, no. 13: 2611. https://doi.org/10.3390/electronics13132611

APA Style

Fu, Q., Yu, X., & Ou, L. (2024). OccTr: A Two-Stage BEV Fusion Network for Temporal Object Detection. Electronics, 13(13), 2611. https://doi.org/10.3390/electronics13132611

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

OccTr: A Two-Stage BEV Fusion Network for Temporal Object Detection

Abstract

1. Introduction

2. Related Works

2.1. Temporal Camera-Based 3D Perception

2.2. Occupancy-Based Perception

3. Proposed Method

3.1. Overall Architecture

3.2. Occupancy Map Generation

3.3. Time Cues Fusion

3.4. BEV Feature Alignment

4. Experiments and Results

4.1. Datasets

4.2. Implementation Details

4.3. 3D Object Detection Results

4.3.1. Occ-Bev Fusion Methods

4.3.2. Resolution Factor

4.3.3. Object Detection Results

4.4. Ablation Study

4.5. Visualization

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI