RecurrentOcc: An Efficient Real-Time Occupancy Prediction Model with Memory Mechanism

Chen, Zimo; Xie, Yuxiang; Wei, Yingmei

doi:10.3390/bdcc9070176

Open AccessArticle

RecurrentOcc: An Efficient Real-Time Occupancy Prediction Model with Memory Mechanism

by

Zimo Chen

,

Yuxiang Xie

and

Yingmei Wei

^*

Laboratory for Big Data and Decision, National University of Defense Technology, Changsha 410003, China

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(7), 176; https://doi.org/10.3390/bdcc9070176

Submission received: 21 April 2025 / Revised: 7 June 2025 / Accepted: 30 June 2025 / Published: 2 July 2025

(This article belongs to the Special Issue Perception and Detection of Intelligent Vision)

Download

Browse Figures

Versions Notes

Abstract

Three-dimensional Occupancy Prediction provides a detailed representation of the surrounding environment, essential for autonomous driving. Long temporal image sequence fusion is a common technique used to improve the occupancy prediction performance. However, existing temporal fusion methods are inefficient due to three issues: repetitive feature extraction from temporal images, redundant fusion of temporal features, and suboptimal fusion of long-term historical features. To address these challenges, we propose the Recurrent Occupancy Prediction Network (RecurrentOcc). We introduce the Scene Memory Gate, a new temporal fusion module that condenses temporal scene features into a single historical feature map. This eliminates the need for repeated extraction and aggregation of multiple temporal images, reducing computational overhead. The Scene Memory Gate selectively retains valuable information from historical features and recurrently updates the historical feature map, enhancing temporal fusion performance. Additionally, we design a simple yet efficient encoder, significantly reducing the number of model parameters. Compared with other real-time methods, RecurrentOcc achieves state-of-the-art performance of 39.9 mIoU on the Occ3D-NuScenes dataset with the fewest parameters of 59.1 M and an inference speed of 23.4 FPS.

Keywords:

occupancy prediction; autonomous driving; visual perception; deep learning

1. Introduction

In autonomous driving, the ability to perceive and understand the surrounding environment is a fundamental requirement. 3D occupancy prediction, which reconstructs a fine-grained representation of the street scene in the form of 3D voxels, has gained significant attention in recent years. Although previous works [1,2,3,4] have significantly improved the performance of 3D occupancy prediction, they have also resulted in a corresponding increase in model scale and computation. Due to the limited computational resources available on vehicles, 3D occupancy prediction models should have a compact model size and low computational demands for practical deployment. Additionally, to ensure safety, the model must achieve real-time inference speed. Therefore, some researchers focus on improving the model’s computational efficiency, such as SparseOcc [5], FlashOcc [6], and FastOcc [7].

To enhance occupancy prediction performance, many methods [5,6,8,9,10] have incorporated temporal image sequences. Historical images provide supplementary information, aiding the model in reconstructing the current scene, especially in occluded areas. However, integrating temporal images, particularly long-term sequences, increases computational costs for image processing and feature aggregation, thus reducing the model’s inference speed. Figure 1a illustrates a widely used method proposed by [11] for temporal feature aggregation. It reuses the backbone to extract features from historical images and feeds these temporal features into a fusion module for aggregation. The fusion module typically consists of several convolutional layers. However, this approach has three main drawbacks, leading to computational inefficiency. First, the extraction of historical image features introduces redundant computation, since these image features have already been extracted at previous timestamps. Second, the temporal fusion process is redundant. For example, let T represent the current time and K represent the sequence length. At

T - 1

, the module aggregates features from

[T - 1 - K, T - 1]

, whereas at T, it aggregates features from

[T - K, T]

, causing redundant fusion of features in the overlapping interval

[T - K, T - 1]

. Third, current temporal fusion methods risk introducing noise. These methods directly concatenate short-term and long-term temporal features with current features and aggregate them indiscriminately. While long-term features may provide valuable information about occluded areas (inaccessible to short-term observations), they often contain outdated scene information due to the ego-vehicle’s motion. Direct fusion without proper filtering may therefore propagate irrelevant or outdated information, adversely affecting model performance. GSD-Occ [12] introduces a temporal feature queue to store the computed scene features, as shown in Figure 1b. When reconstructing the current scene, the model simply retrieves the adjacent historical features from the queue, avoiding redundant feature extraction [3,6,9,11]. Nevertheless, GSD-Occ’s temporal fusion still exhibits the last two drawbacks. Moreover, the size of stored features grows with the length of the temporal image sequence.

In this paper, we propose a new paradigm for temporal fusion, as shown in Figure 1c. The network condenses the historical information into a single feature (we refer to it as a historical feature map) and recurrently updates this feature map over time. As a result, the network integrates only the historical feature map with the current scene feature for temporal fusion, avoiding the redundant computation caused by temporal image sequence fusion. We introduce the Scene Memory Gate (SMG) as the temporal fusion module, which selectively integrates the historical feature map with the current scene feature and retains the fused feature for the next iteration, mitigating the third challenge. Furthermore, we design the Light Encoder, a simpler and more efficient encoder structure that adopts the VRWKV [13] network as its basic block. The Light Encoder significantly reduces the number of model parameters while maintaining high inference speed and prediction performance. In summary, our contributions are as follows:

We introduce a new temporal fusion module, SMG, which selectively condenses valuable temporal information into a historical feature map. It allows the model to integrate only a single feature map instead of multiple temporal features, improving computational efficiency and achieving superior performance.
We design a simpler encoder structure based on the VRWKV network, termed the Light Encoder. The introduction of this encoder reduces the model’s parameters to only 59.1 M without compromising performance.
Our proposed Recurrent Occupancy Prediction Network (RecurrentOcc, ReOcc) achieves state-of-the-art performance of 39.9 mIoU on the Occ3D-NuScenes dataset [14] with the fewest parameters and an inference speed of 23.4 FPS, compared with other real-time competitors [5,6,9,12].

2. Related Works

2.1. Vision-Based 3D Occupancy Prediction

Monoscene [1] is the first vision-based occupancy prediction work that reconstructs scenes solely from RGB images. It proposes FLoSP to transform 2D image features into 3D features and uses a 3D U-Net architecture to predict voxel occupancy and semantic classification. VoxFormer [2] introduces a two-stage framework, where sparse voxels are identified in the first stage and the scene is refined in the second. OccFormer [15] builds a transformer-based encoder-decoder network. It utilizes a dual-path transformer [16] block as an encoder to extract the global and local features of the scene and Mask2Former [17] as a decoder for semantic occupancy prediction. SurroundOcc [18] employs deformable attention [19] to directly create a dense 3D representation from multi-camera images without first predicting depth. Instead of voxel-wise prediction, Symphonies [20] creates instance queries and reconstructs the scene using instance semantics, improving contextual understanding. COTR [3] proposes a geometry-aware encoder, consisting of explicit and implicit view transformation, to generate a compact 3D representation. In recent years, numerous advanced methods have emerged, rapidly improving occupancy prediction performance. However, many of these methods overlook deployment challenges and suffer from high computational costs. Consequently, researchers have increasingly focused on developing lightweight models suitable for vehicle deployment.

2.2. Deployment-Friendly Occupancy Prediction

While dense 3D voxels provide an effective representation for autonomous driving, their computation is time-consuming and memory-intensive. To simplify intermediate computations, TPVFormer [21] reduces the 3D space to tri-perspective views, and FlashOcc [6] uses Bird’s Eye View (BEV) for scene feature representation. Given that more than half of the voxels in a scene are empty, SparseOcc [5] employs a sparse transformer to minimize computational costs. OctreeOcc [22] uses an octree-based [23] architecture to construct the occupancy prediction network, significantly reducing the number of voxels needed to depict the scene. GSD-Occ [12] introduces a dual-branch network, with a 2D encoder for BEV semantic perception and a 3D encoder leveraging the re-parameterization technique [24] for 3D geometry perception. Despite significant progress in deployment-friendly occupancy prediction, we find that the temporal fusion process is still inefficient. This inspires us to seek a more efficient temporal fusion approach.

2.3. Temporal Fusion in 3D Occupancy Prediction

Incorporating temporal frames into the input is a common practice to improve the prediction performance. A widely used approach is to concatenate temporal features extracted from the backbone and aggregate them using convolutional operations [11]. This approach overlooks the varying contributions of features across different times and locations, potentially leading to insufficient aggregation. Moreover, the repeated feature extraction operation introduces redundant computation. HTCL [25] proposes a new paradigm for improving temporal feature aggregation, which includes cross-frame affinity measurement and affinity-based dynamic refinement. While this method greatly improves prediction performance, it also increases computational complexity. Inspired by [26], GSD-Occ [12] employs a history feature queue to store temporal features, avoiding redundant extraction, but it still relies on long-term temporal fusion to achieve better results. In this paper, we propose the scene memory gate that aggregates temporal features from historical frames into a single feature map and continuously updates it over time, enabling efficient temporal fusion and real-time inference.

3. Method

3.1. Problem Setup

The task input is a sequence of multi-view temporal images

I \in R^{b \times t \times n \times 3 \times h \times w}

, where b denotes the batch size, t denotes the temporal sequence length, n denotes the number of camera views, and

h, w

denote the image resolution. The output of the occupancy prediction model G is a 3D representation of the scene with semantic labels

O \in R^{H \times W \times Z \times C}

, where

H, W, Z

denote the spatial resolution of the scene, and C denotes the number of labels. The labels consist of

C - 1

object categories and an ’empty’ category. If a voxel in the scene is occupied, G assigns it an object category; otherwise, it assigns the ’empty’ category.

3.2. Overall Architecture

Figure 2 illustrates the overall architecture of ReOcc. First, the input images I at time T are processed by the backbone to extract image features

F \in R^{b \times n \times c \times h^{'} \times w^{'}}

, where c denotes the feature dimension. Features F are then lifted to 3D feature

V_{T} \in R^{b \times C \times H_{0} \times W_{0} \times Z_{0}}

in the 2D-to-3D transformation module, where C and

H_{0}, W_{0}, Z_{0}

denotes

V_{T}

’s feature and spatial dimensions, respectively. Traditional occupancy networks employ 3D encoders to aggregate the semantic and geometry features from

V_{T}

, which is computationally expensive. To reduce complexity, we adopt FlashOcc’s [6] approach by transforming the 3D features into Bird’s Eye View (BEV) representations. Although this projection sacrifices height-dimension information, it achieves a trade-off for faster inference. Critically, the compressed 2D features retain implicit height cues, which can assist the network to reconstruct 3D features from the BEV space.

Building upon this, we adopt the dual-path architecture from GSD-Occ [12], which separately processes semantic information in 2D and geometric information in 3D. The semantic path incorporates a Scene Memory Gate for efficient temporal fusion along with a lightweight 2D encoder, while the geometric path maintains GSD-Occ’s original design, which consists of a non-dilated small-kernel 3D convolution and multiple dilated small-kernel 3D convolutions. During inference, these small-kernel convolutions are re-parameterized into a large-kernel convolution to accelerate calculation. Specifically, the BEV feature map

B_{T} \in R^{b \times C \times H_{0} \times W_{0}}

is generated by averaging

V_{T}

’s features along the height dimension. The Scene Memory Gate aggregates

B_{T}

with the historical scene feature map

h_{T - 1}

and outputs the temporal fused features

h_{T}

, which are then saved as the next historical scene feature map. Simultaneously,

h_{T}

is restored to the 3D feature

V_{b}

via the BEV-Voxel Lifting (BVL) module [12].

V_{b}

is combined with

V_{T}

to generate

V_{T}^{'}

. Next, the 3D geometric feature

V_{T}^{'}

and the 2D semantic feature

h_{T}

are further processed by 3D and 2D encoders, respectively, producing

V_{g}

and

B_{s}

.

B_{s}

is then restored to 3D semantic feature

V_{s}

via the BVL method. Finally,

V_{s}

and

V_{g}

are combined, and the resulting feature is processed by a prediction head to obtain the final prediction O. The dual-path architecture enables computationally and memory-efficient processing of dense, high-dimensional semantic features through its 2D pathway, while the 3D geometric path preserves crucial spatial information that compensates for height dimension loss in the semantic processing.

Two-dimensional-to-three-dimensional transformation. Two-dimensional-to-three-dimensional transformation aggregates multi-view image features to generate the corresponding 3D scene feature. We adopt the LSS method [27], where image features

F \in R^{b \times n \times c \times h^{'} \times w^{'}}

are fed into a depth net and a context net [28] to predict the depth distribution

D \in R^{b \times n \times d \times h^{'} \times w^{'}}

and semantic feature

F^{'} \in R^{b \times n \times c \times h^{'} \times w^{'}}

, respectively, where d denotes the number of bins. The outer product of

F^{'}

and D generates a pseudo 3D point cloud feature

P \in R^{b \times n \times c \times d \times h^{'} \times w^{'}}

, which is then converted into 3D voxel features

V \in R^{b \times c \times H_{0} \times W_{0} \times Z_{0}}

via voxel-pooling based on the camera intrinsic parameters

{\{K_{i}\}}_{i = 1}^{n}

and extrinsic parameters

{\{[R_{i} | t_{i}]\}}_{i = 1}^{n}

.

BEV-to-3D transformation. BEV-to-3D transformation aims to recover the height information from the BEV feature. We use the BVL method [12], which is illustrated in the top right of Figure 2. The approach of BVL shares similarities with LSS. It consists of two branches: one height branch employs a convolutional network to predict height distribution based on BEV feature information, while the other context branch generates height-aware features via a separate convolutional network. Subsequently, the 3D voxel features are obtained through the outer product of the height distribution and height-aware features. For example, taking

B_{s} \in R^{b \times C \times H_{0} \times W_{0}}

as input, BVL converts

B_{s}

into height distribution

H_{s} \in R^{b \times Z_{0} \times H_{0} \times W_{0}}

and height-aware features

B_{s}^{'} \in R^{b \times C \times H_{0} \times W_{0}}

. Then

B_{s}^{'}

and

H_{s}

are computed by the outer product

B_{s}^{'} \otimes H_{s}

to generate the 3D feature

V_{s} \in R^{b \times C \times H_{0} \times W_{0} \times Z_{0}}

.

3.3. Scene Memory Gate

Inspired by LSTM [29], SMG employs a recurrent process for temporal fusion. The core idea of SMG is to selectively retain valuable information from historical scene features and discard features from relatively distant or highly uncertain areas. The retained historical features are then integrated into the current BEV feature. The fused feature is then saved as the historical feature map, preparing for the next iteration of temporal fusion. Due to pose changes, the historical feature map needs to be reprojected onto the current BEV plane before fusion. The detailed structure of SMG is illustrated in Figure 3.

SMG first transforms the historical scene feature

h_{T - 1}

into a hidden state

s_{T - 1}

through a hidden block comprising a 3 × 3 convolution layer and a point-wise convolution layer, with each layer succeeded by a normalization and ReLU activation layer. Next, a separate 3 × 3 convolution layer followed by a sigmoid activation layer is used to predict the value score of each pixel in

h_{T - 1}

based on

s_{T - 1}

. The same module is also applied to compute the score of the current BEV feature

B_{T}

. The mask

M_{T - 1}

is derived from Equation (1):

\begin{matrix} M_{T - 1} = σ (f_{s} (s_{T - 1})) + (1 - σ (f_{c} (B_{T}))) \end{matrix}

(1)

where

f_{c} (\cdot)

and

f_{s} (\cdot)

denote convolution operations, and

σ (\cdot)

denotes the sigmoid function. The term

1 - σ (f_{c} (B_{T}))

is used to enhance the weight of historical features in highly uncertain areas, such as occluded areas, of the current scene. Due to the limitations of the LLS method [27], the initial BEV feature B is sparse, as it cannot infer features from occluded regions. The introduction of

1 - σ (f_{c} (B_{T}))

enhances the retention of historical features at corresponding positions. Furthermore, the mask values are bounded within the interval [0,1], as formulated in Equation (2).

M_{T - 1} (x) = \{\begin{matrix} M_{T - 1} (x) & if M_{T - 1} (x) \leq 1 \\ 1 & if M_{T - 1} (x) > 1 \end{matrix}

(2)

The selected valuable historical feature is derived from Equations (2) and (3) and generates the fused feature map:

\begin{matrix} {\tilde{h}}_{T - 1} = M_{T - 1} ⊙ h_{T - 1} \end{matrix}

(3)

\begin{matrix} h_{T} = {\tilde{h}}_{T - 1} + B_{T} \end{matrix}

(4)

For the first BEV feature map

B_{1}

in an image sequence (where no historical features exist), we initialize the hidden state as

\begin{matrix} h_{0} = B_{1} \end{matrix}

(5)

Compared with the temporal feature queue [12], SMG is more memory friendly as it only needs to save a single feature map rather than multiple. And its mask mechanism reduces the risk of noise contamination in the feature fusion process. Additionally, because SMG is a recurrent operation over time, there exists long-term memory forgetting. We argue that this mechanism is well-suited for autonomous driving. In most cases, autonomous vehicles move forward, making the information from earlier scenes less relevant for the current scene reconstruction. Long-term forgetting can automatically discard features from earlier scenes, allowing the network to focus more on the current moment and recent temporal features. Therefore, our designed SMG can achieve excellent temporal feature aggregation performance by fusing only one historical feature map.

3.4. Light Encoder

While SMG achieves efficient temporal feature fusion, it introduces additional computational overhead and parameters compared with convolution-based methods when fusing two temporal features. Therefore, we design a simpler and more efficient 2D encoder to work alongside the SMG to balance computational speed, network size, and performance. The Light Encoder’s architecture is illustrated on the left of Figure 4. The encoder consists of two encoding blocks. A downsample layer is inserted to obtain high-dimensional semantic features. The high-dimensional feature and low-dimensional feature are merged via a single convolution layer.

We aim to seek an encoded block structure that possesses both strong reasoning capabilities and fast inference speeds. Therefore, we employ VRWKV [13] as the encoding block. The architecture of VRWKV is illustrated on the right side of Figure 4. The input is first fed into the spatial mix module. After the Q-shift operation, the input is transformed to

r_{s}

,

k_{s}

, and

v_{s}

. Then, the global attention result

w k v

is obtained through the bidirectional attention mechanism Bi-WKV. Bi-WKV has linear complexity and can be expressed in a summation form:

\begin{matrix} w k v_{t} & = Bi - WKV {(K, V)}_{t} \\ = \frac{\sum_{i = 0, i \neq t}^{L - 1} e^{- (| t - i | - 1) / L \cdot w + k_{i}} v_{i} + e^{u + k_{t}} v_{t}}{\sum_{i = 0, i \neq t}^{L - 1} e^{- (| t - i | - 1) / L \cdot w + k_{i}} + e^{u + k_{t}}} . \end{matrix}

(6)

where L denotes the length of the sequence.

w k v

is multiplied by

σ (r_{s})

and passed through a linear projection to generate the output of the spatial mix module

O_{s}

:

\begin{matrix} O_{s} = F_{l} (σ (r_{s}) ⊙ w k v) \end{matrix}

(7)

where

σ (\cdot)

denotes the sigmoid function, ⊙ denotes the Hadamard product, and

F_{l}

denotes the linear projection. The channel mix module follows a similar calculation, formulated as

\begin{matrix} v_{c} = F_{l} (S q u a r e d R e L U (k_{c})) \end{matrix}

(8)

\begin{matrix} O_{c} = F_{l} (σ (r_{c}) ⊙ v_{c}) \end{matrix}

(9)

For comparison, we also implement VMamba [30] and ResNet [31] as basic encoding blocks. Finally, we chose to employ the VRWKV network. The detailed experiments will be elaborated on in Section 4.5.

3.5. Loss Function

Adhering to common practices [1,6,10,12], the loss function consists of four parts: the binary CE loss

L_{g e o}

, CE loss

L_{s e m}

, LovaszSoftmax loss

L_{L o v a s z}

, and focal loss

L_{f o c a l}

.

\begin{matrix} L_{t o t a l} = L_{g e o} + L_{s e m} + L_{l o v a s z} + L_{f o c a l} \end{matrix}

(10)

4. Experiment

4.1. Datasets

We conduct experiments on the NuScenes dataset [32], a comprehensive autonomous driving dataset consisting of 1000 driving scenarios sourced from Boston and Singapore. Each scenario spans 20 s, totaling approximately 15 h of driving data. These data cover a wide range of locations, times, and weather conditions. The NuScenes dataset offers abundant sensor data, including six cameras, one LiDAR (Light Detection and Ranging) sensor, five radars, and GPS and IMU (Inertial Measurement Unit) sensor data. The occupancy ground truth is provided by Occ3d-NuScenes [14]. It includes 700 scenes for training and 150 scenes for validation. Each scene covers a range of [−40 m, −40 m, −1 m, 40 m, 40 m, 5.4 m] with a voxel size of [0.4 m, 0.4 m, 0.4 m]. The labels include 16 known object categories, an ‘others’ class for unknown objects, and an `empty’ class for unoccupied voxels.

4.2. Evaluation Metrics

We report the mean Intersection over Union (mIoU) and Ray-level mIoU (RayIoU) [5] to evaluate the prediction performance of our model. We also evaluate the model’s number of parameters, frames per second (FPS), and storage size of temporal data.

4.3. Implementation Details

We adopt ResNet-50 [31] as the image backbone, and the input image resolution is set to

704 \times 256

. We propose two versions of our model. The feature dimensions of ReOcc-B and ReOcc-S are set to 128 and 64, respectively. Our model is trained on 8 NVIDIA 4090 GPUs for 24 epochs with a batch size of 3. The optimizer is Adam [33] with a learning rate of

1 \times 10^{- 4}

and a weight decay of

0.01

. Following previous research [5,9,12], the inference speed is tested on a single NVIDIA A100 GPU with a batch size of 1. Historical features are saved using the ’savez_compressed’ function from the NumPy package to measure their storage size.

4.4. Main Results

We compare ReOcc with recent state-of-the-art (SOTA) methods on Occ3D-NuScenes in Table 1. When trained with a visible mask, ReOcc-B achieves

39.9

mIoU and

31.5

RayIoU on the NuScenes validation dataset with an inference speed of

23.4

FPS. Without the mask, ReOcc-B achieves

31.9

mIoU and

38.2

RayIoU. ReOcc-S, with reduced feature dimensions, shows a slight drop of

1.7 %

mIoU and

0.3 %

RayIoU with the mask and

0.9 %

mIoU and

1.0 %

RayIoU without the mask but improves FPS by

12.8 %

over ReOcc-B.

To demonstrate the efficiency of our model, we perform a more detailed comparison with other real-time methods, as shown in Table 2. Fusing only a single feature map, our model achieves comparable performance to GSD-Occ(16f) [12] and Panoptic-FlashOcc(8f) [9] and outperforms SparseOcc(8f) [5] by 1.8 mIoU and 4.2 RayIoU. We provide a qualitative comparison with GSD-Occ [12] in Figure 5. Our method demonstrates stable reconstruction and superior perception capabilities. In terms of inference speed, ReOcc-B and ReOcc-S outperform SparseOcc(8f) [5] by

6.1

FPS and

9.1

FPS, GSD-Occ(16f) [12] by

3.4

FPS and

6.4

FPS, and GSD-Occ(2f) [12] by

2.4

FPS and

5.4

FPS, respectively. However, they are still around 10 FPS slower than the Panoptic-FlashOcc series. Although Panoptic-FlashOcc [9] boasts very fast inference speed, it still uses the temporal fusion structure shown in Figure 1a. The inference speed test excludes the time required to extract features from historical images. Therefore, Panoptic-FlashOcc [9] cannot avoid redundant calculations in the temporal fusion process during the training stage. We include the time for extracting historical features and retest its inference speed. As shown by Panoptic-FlashOcc(2f)* and Panoptic-FlashOcc(8f)*, there’s a significant decrease in FPS, indicating that the model’s training is not efficient. In contrast, our model maintains a high inference speed during training, thus saving more training resources. Our model has the smallest number of parameters while maintaining promising performance and inference speed. The parameter count of ReOcc-S is

30.5 %

lower than that of SparseOcc(8f) [5],

51.4 %

lower than GSD-Occ(16f) [12], and

27.3 %

lower than Panoptic-FlashOcc(8f) [9]. Additionally, since both GSD-Occ [12] and Panoptic-FlashOcc [9] utilize long-term temporal feature fusion, they need to store multiple historical features. We measure their sizes and find that the storage size of GSD-Occ(16f) [12] is 32.3 M, and that of Panoptic-FlashOcc [9] is 55.0 M. Thanks to the advantages of the SMG structure, ReOcc only needs to store a single feature map, with a size of approximately one-tenth of theirs. The storage size of ReOcc-S is only 1.9 M, even smaller than that of GSD-Occ(2f) [12] and Panoptic-FlashOcc(2f) [9]. We report ReOcc’s IoU for specific categories and provide additional quantitative results in Appendix A.

4.5. Ablation Studies

We conduct ablation experiments with ReOcc-B on the NuScenes dataset. All of the experiments are trained with a visible mask.

Ablation on different components: Table 3 demonstrates the effectiveness of the Spatial-Motion Gating (SMG) and Light Encoder. The baseline uses the GSD-Occ structure with a temporal fusion of 2 frames. Replacing the baseline’s temporal fusion method with SMG results leads to only minor changes in the model’s parameters and inference speed while increasing IoU by $3.5 %$ , mIoU by $7.0 %$ , and RayIoU by 6.0%. This demonstrates SMG’s effectiveness in preserving fused historical features. Replacing the baseline’s 2D encoder with the Light Encoder keeps mIoU and RayIoU nearly unchanged while reducing the parameter count by 16.8%, demonstrating the Light Encoder’s efficiency. When both modules are combined, the model’s inference speed slightly decreases compared with the baseline, but its IoU, mIoU, and RayIoU improve by 5.1%, 8.4%, and 6.3%, respectively, and its parameter count decreases by 16.5%.
Ablation on Light Encoder: To explore a more efficient network structure for the encoded block, we conduct experiments with the VRWKV [13], VMamba [30], and ResNet [31] networks, as shown in Table 4. Compared with ResNet, VMamba improves IoU by 0.37, mIoU by 0.20, slightly decreases RayIoU by 0.03, reduces parameters by 0.8 M, and lowers inference speed by 1.5 FPS. VRWKV, on the other hand, achieves a higher IoU and mIoU by 0.43 and 0.28 but a lower RayIoU by 0.3. It has 0.4M fewer parameters but has a lower inference speed by 0.8 FPS. Considering mIoU as the primary metric, we choose the VRWKV network as the basic block in the Light Encoder. We also perform ablation experiments on the number of encode blocks. Increasing the number of blocks from 1 to 2 improves IoU by 0.21, mIoU by 0.3, and RayIoU by 0.03, but FPS decreases by 0.8, and the parameter count increases by 2.34 M. Increasing the number of blocks from 2 to 3 improves IoU by 0.37, mIoU by 0.23, and RayIoU by 0.12, but FPS decreases by 1.4, and the parameter count increases by 5.77 M. To balance the model’s performance, size, and inference speed, we set the number of blocks to 2 in this paper.
Ablation on SMG: To further validate the effectiveness and efficiency of SMG on temporal fusion, we compared it with GSD-Occ’s temporal fusion method (denoted as GSD fusion) with 2-frame and 16-frame inputs, as shown in Table 5. SMG outperforms GSD fusion(16f) by 0.63 in IoU, 0.89 in mIoU, 0.41 in RayIoU, and 0.4 in FPS, while requiring only 3.53M of storage space for temporal features. Although GSD fusion(2f) has fewer parameters and smaller storage compared with SMG, its performance is significantly inferior to SMG. SMG outperforms GSD fusion(2f) by 3.4% in IoU, 7.9% in mIoU, and 7.3% in RayIoU. This demonstrates SMG’s superior temporal feature fusion efficiency compared with GSD-Occ’s method. Moreover, to further validate the effectiveness of SMG’s mask mechanism, we conducted an ablation study by removing the mask module. As shown in Table 6, this modification resulted in performance degradation of 0.42 mIoU and 0.45 RayIoU. The result demonstrates that SMG’s mask mechanism effectively filters temporal feature information and enhances the model’s focus on semantically valuable regions. Additionally, we conducted an experiment to investigate whether SMG requires an active forgetting mechanism similar to LSTM’s forget gate, which is also present in Table 6. We implemented a simple forget gate (FG) consisting of a convolutional layer $f_{g}$ with kernel size 3 and a sigmoid activation layer. Before storing the history feature map, $h_{T}$ is first processed through this forget gate:

$\begin{matrix} h_{T} = σ (f_{g} (h_{T})) * h_{T} \end{matrix}$

(11)

The addition of the forget gate yielded marginal IoU improvement while causing slight degradation in mIoU and RayIoU metrics. These results demonstrate that the active forgetting mechanism is unnecessary, as our existing mask mechanism already provides adequate filtering capability, and the recurrent computation inherently possesses long-term forgetting capacity.
Ablation on 3D-to-BEV transformation: We compared two existing methods for compressing 3D features into 2D representations. The first approach performs mean pooling along the height dimension, while the second method, proposed by FlashOcc [6], initially merges both the feature dimension and height dimension into a single dimension and then applies a convolutional layer for channel compression. As shown in Table 7, we denote these methods as ’mean pooling’ and ’convolution’, respectively. Both methods achieve comparable IoU scores, with mean pooling showing a slight advantage of +0.33 mIoU and +0.21 RayIoU.

5. Conclusions

In this paper, we introduce an efficient occupancy prediction model named ReOcc. It uses an SMG module to aggregate and store historical feature information in a single historical feature map. This approach reduces redundant computations in temporal fusion and conserves space for storing feature data. Furthermore, we design a Light Encoder to further decrease the model’s parameter count and computational load. On the Occ3D-NuScenes validation dataset, our model achieves performance comparable to other real-time competitors that fuse multiple historical features, but with fewer parameters and under the condition of fusing only one historical feature map. This demonstrates the high efficiency of our model. While converting 3D features to BEV representation significantly improves model inference speed, the inherent loss of height-dimensional information in BEV features limits 3D reconstruction quality, even with supplementary 3D geometric features. This limitation becomes particularly apparent in cases where objects overlap vertically (e.g., a person standing under a tree), as BEV representations struggle to distinguish such conditions. Moreover, although our proposed Scene Memory Gate reduces redundant computation and storage compared with previous methods, its speed advantage remains modest. Future research will focus on developing more efficient temporal fusion mechanisms to accelerate inference, as well as exploring dual-view 2D image representations as potential alternatives to 3D features, thereby addressing BEV’s information loss.

Author Contributions

Conceptualization, Y.X. and Z.C.; methodology, Z.C.; validation, Z.C.; formal analysis, Z.C.; investigation, Z.C. and Y.W.; writing—original draft preparation, Z.C.; writing—review and editing, Z.C. and Y.W.; visualization, Z.C.; supervision, Y.X. and Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Hunan Province, China, under Grant 2023JJ30082.

Data Availability Statement

The NuScenes dataset is available on https://www.nuscenes.org/ (accessed on 21 April 2025). The Annotation is available on https://github.com/Tsinghua-MARS-Lab/Occ3D (accessed on 21 April 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

In Table A1, we demonstrate the recognition and reconstruction capabilities of ReOcc-B and ReOcc-S for specific categories. In Figure A1, we visualize additional prediction results based on ReOcc-B. We present these results from a BEV perspective to facilitate a comprehensive display of the reconstructed scene.

Figure A1. Qualitative results on Occ3D-NuScenes validation set.

Table A1. ReOcc’s 3D Occupancy prediction performance for specific categories on the Occ3D-NuScenes.

Method	Vis. Mask	SC IoU	SSC mIoU	Others	Barrier	Bicycle	Bus	Car	Const. Veh.	Motorcycle	Pedestrian	Traffic Cone	Trailer	Truck	Drive. Suf.	Other Flat	Sidewalk	Terrain	Manmade	Vegetation
ReOcc-B	✓	70.50	39.97	12.37	47.80	28.76	43.82	51.23	27.32	28.02	28.94	29.40	32.94	37.64	81.34	43.50	50.96	55.53	42.86	37.01
ReOcc-S	✓	70.02	39.28	12.78	47.22	27.54	44.31	50.90	26.38	27.58	28.20	28.23	31.00	36.92	80.77	42.21	50.39	55.10	41.73	36.48
ReOcc-B	✕	45.91	31.98	11.20	43.64	24.75	40.42	43.91	22.51	26.89	25.74	27.90	22.36	32.48	65.40	36.75	39.75	36.70	21.26	22.01
ReOcc-S	✕	45.47	31.61	10.87	42.60	25.93	39.50	43.62	22.30	26.46	25.63	27.35	21.48	31.95	65.07	35.91	39.08	36.22	21.03	22.11

References

Cao, A.-Q.; de Charette, R. Monoscene: Monocular 3d Semantic Scene Completion. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Li, Y.; Yu, Z.; Choy, C.B.; Xiao, C.; Álvarez, J.M.; Fidler, S.; Feng, C.; Anandkumar, A. VoxFormer: Sparse Voxel Transformer for Camera-Based 3D Semantic Scene Completion. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
Ma, Q.; Tan, X.; Qu, Y.; Ma, L.; Zhang, Z.; Xie, Y. COTR: Compact Occupancy TRansformer for Vision-Based 3D Occupancy Prediction. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Wang, Y.; Chen, Y.; Liao, X.; Fan, L.; Zhang, Z. PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Liu, H.; Chen, Y.; Wang, H.; Yang, Z.; Li, T.; Zeng, J.; Chen, L.; Li, H.; Wang, L. Fully Sparse 3D Occupancy Prediction. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, Netherlands, 8–16 October 2023. [Google Scholar]
Yu, Z.; Shu, C.; Deng, J.; Lu, K.; Liu, Z.; Yu, J.; Yang, D.; Li, H.; Chen, Y. FlashOcc: Fast and Memory-Efficient Occupancy Prediction via Channel-to-Height Plugin. arXiv 2023, arXiv:2311.12058. [Google Scholar]
Hou, J.; Li, X.; Guan, W.; Zhang, G.; Feng, D.; Du, Y.; Xue, X.; Pu, J. FastOcc: Accelerating 3D Occupancy Prediction by Fusing the 2D Bird’s-Eye View and Perspective View. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024. [Google Scholar]
Pan, M.; Liu, J.; Zhang, R.; Huang, P.; Li, X.; Liu, L.; Zhang, S. Renderocc: Vision-Centric 3d Occupancy Prediction with 2d Rendering Supervision. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024. [Google Scholar]
Yu, Z.; Shu, C.; Sun, Q.; Linghu, J.; Wei, X.; Yu, J.; Liu, Z.; Yang, D.; Li, H.; Chen, Y. Panoptic-FlashOcc: An Efficient Baseline to Marry Semantic Occupancy with Panoptic via Instance Center. arXiv 2024, arXiv:2406.10527. [Google Scholar]
Li, Z.; Yu, Z.; Wang, W.; Anandkumar, A.; Lu, T.; Alvarez, J.M. Fb-bev: Bev representation from forward-backward view transformations. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023. [Google Scholar]
Huang, J.; Huang, G. BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection. arXiv 2022, arXiv:2203.17054. [Google Scholar]
He, Y.; Chen, W.; Xun, T.; Tan, Y. Real-Time 3D Occupancy Prediction via Geometric-Semantic Disentanglement. arXiv 2024, arXiv:2407.13155. [Google Scholar]
Duan, Y.; Wang, W.; Chen, Z.; Zhu, X.; Lu, L.; Lu, T.; Qiao, Y.; Li, H.; Dai, J.; Wang, W. Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures. arXiv 2024, arXiv:2403.02308. [Google Scholar]
Tian, X.; Jiang, T.; Yun, L.; Wang, Y.; Wang, Y.; Zhao, H. Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving. arXiv 2023, arXiv:2304.14365. [Google Scholar]
Zhang, Y.; Zhu, Z.; Du, D. OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All you Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention Mask Transformer for Universal Image Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Wei, Y.; Zhao, L.; Zheng, W.; Zhu, Z.; Zhou, J.; Lu, J. SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, 26–30 April 2020. [Google Scholar]
Jiang, H.; Cheng, T.; Gao, N.; Zhang, H.; Liu, W.; Wang, X. Symphonize 3D Semantic Scene Completion with Contextual Instance Queries. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Huang, Y.; Zheng, W.; Zhang, Y.; Zhou, J.; Lu, J. Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
Lu, Y.; Zhu, X.; Wang, T.; Ma, Y. OctreeOcc: Efficient and Multi-Granularity Occupancy Prediction Using Octree Queries. arXiv 2023, arXiv:2312.03774. [Google Scholar]
Meagher, D. Geometric modeling using octree encoding. Comput. Graph. Image Process. 1982, 19, 129–147. [Google Scholar] [CrossRef]
Ding, X.; Zhang, Y.; Ge, Y.; Zhao, S.; Song, L.; Yue, X.; Shan, Y. Unireplknet: A universal perception large-kernel convnet for audio, video, point cloud, time-series and image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Li, B.; Deng, J.; Zhang, W.; Liang, Z.; Du, D.; Jin, X.; Zeng, W. Hierarchical Temporal Context Learning for Camera-Based Semantic Scene Completion. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 1–4 October 2024. [Google Scholar]
Park, J.; Xu, C.; Yang, S.; Keutzer, K.; Kitani, K.M.; Tomizuka, M.; Zhan, W. Time will tell: New outlooks and a baseline for temporal multi view 3d object detection. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, 1–5 May 2023. [Google Scholar]
Philion, J.; Fidler, S. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
Li, Y.; Ge, Z.; Yu, G.; Yang, J.; Wang, Z.; Shi, Y.; Sun, J.; Li, Z. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Washington, DC, USA, 7–14 February 2023. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. VMamba: Visual State Space Model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A multi modal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Loshchilov, I.; Hutter, F. Fixing Weight Decay Regularization in Adam. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Qiao, Y.; Dai, J. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Gan, W.; Mo, N.; Xu, H.; Yokoya, N. A Comprehensive Framework for 3D Occupancy Estimation in Autonomous Driving. IEEE Trans. Intell. Veh. 2024. early access. [Google Scholar] [CrossRef]

Figure 1. Comparisons between occupancy prediction temporal fusion methods. (a) illustrates a widely used architecture for temporal fusion, where the backbone extracts all temporal features and sends them to the fusion module for aggregation. (b) is the architecture that GSD-Occ [12] employs. It stores the extracted history features in a queue and retrieves these features from the queue before the temporal fusion process. (c) represents our proposed method. We condense historical features through the Scene Memory Gate and pass the fused historical feature map to the next iteration.

Figure 2. The overall architecture of ReOcc. The backbone extracts image features from the input, which are then transformed into 3D voxel features via a 2D-to-3D transformation. On the semantic branch, the 3D voxel feature is compressed with a BEV feature. The Scene Memory Gate fuses the BEV feature with the historical feature map, and the fused feature is then fed into the 2D semantic encoder. On the geometric branch, the 3D voxel feature is processed through the 3D geometric encoder. The semantic and geometric features are combined to generate the prediction result.

Figure 3. The overview of Scene Memory Gate. The left part shows the iteration process of SMG, and the right part illustrates its detailed architecture. There’s a normalization layer between the convolution layer and the ReLU activation layer, which is omitted for clarity.

Figure 4. The overview of the Light Encoder. The left part shows the Light Encoder’s architecture, while the right part illustrates the VRWKV architecture.

Figure 5. Qualitative results comparison on Occ3D-NuScenes validation set. Our proposed ReOcc is able to capture more detailed information compared with GSD-Occ. Moreover, ReOcc’s reconstruction is more stable and avoids creating nonexistent structures in the scene (row 1).

Table 1. Three-dimensional occupancy prediction performance on the Occ3D-NuScenes. ’-’ indicates the metrics are not reported. Best values are bolded.

$Method$	$Backbone$	$InputSize$	$Vis . Mask$	$mIoU ↑$	$RayIoU ↑$	${RayIoU}_{1 m} ↑$	${RayIoU}_{2 m} ↑$	${RayIoU}_{4 m} ↑$	$FPS ↑$
BEVFormer [34]	R101	1600 × 900	✓	39.2	32.4	26.1	32.9	38.0	3.0
FB-Occ(16f) [10]	R50	704 × 256	✓	39.1	33.5	26.7	34.1	39.7	10.3
FlashOcc [6]	R50	704 × 256	✓	32.0	-	-	-	-	29.6
PanoOcc [4]	R101	1600 × 900	✓	42.1	-	-	-	-	3.0
COTR [3]	R50	704 × 256	✓	44.5	-	-	-	-	0.9
GSD-Occ(16f) [12]	R50	704 × 256	✓	39.4	-	-	-	-	20.0
ReOcc-B(2f) (ours)	R50	704 × 256	✓	39.9	31.5	25.0	32.0	37.6	23.4
ReOcc-S(2f) (ours)	R50	704 × 256	✓	39.2	31.4	24.9	31.9	37.3	26.4
BEVFormer [34]	R101	1600 × 900	✕	23.7	33.7	-	-	-	3.0
FB-Occ(16f) [10]	R50	704 × 256	✕	27.9	35.6	-	-	-	10.3
SimpleOcc [35]	R101	672 × 336	✕	31.8	22.5	17.0	22.7	27.9	9.7
BEVDet-Occ(2f) [11]	R50	704 × 256	✕	36.1	29.6	23.6	30.0	35.1	2.6
BEVDet-Occ-Long(8f) [11]	R50	$704 \times 384$	✕	39.3	32.6	26.6	33.1	38.2	0.8
SparseOcc(8f) [5]	R50	704 × 256	✕	30.1	34.0	28.0	34.7	39.4	17.3
GSD-Occ(16f) [12]	R50	704 × 256	✕	31.8	38.9	32.9	39.7	44.0	20.0
Panoptic-FlashOcc(8f) [9]	R50	704 × 256	✕	31.6	38.5	32.8	39.3	43.4	35.6
ReOcc-B(2f) (ours)	R50	704 × 256	✕	31.9	38.2	32.4	38.9	43.2	23.4
ReOcc-S(2f) (ours)	R50	704 × 256	✕	31.6	37.8	31.9	38.6	43.0	26.4

Table 2. The detailed comparison of 3D occupancy prediction performance with real-time methods on the Occ3D-NuScenes. ’-’ indicates the metrics are not reported. ’*’ indicates the metrics are tested by us. Best values are bolded.

$Method$	$Backbone$	$InputSize$	$Vis . Mask$	$mIoU ↑$	$RayIoU ↑$	$FPS ↑$	$Storage$ *↓	$Param .$ *↓
GSD-Occ(2f) [12]	R50	704 × 256	✓	37.2	-	21.0	2.5 M	114.7 M
GSD-Occ(8f) [12]	R50	704 × 256	✓	38.4	-	20.6	15.7 M	114.8 M
GSD-Occ(16f) [12]	R50	704 × 256	✓	39.4	-	20.0	32.3 M	115.0 M
ReOcc-B(2f) (ours)	R50	704 × 256	✓	39.9	31.5	23.4	3.5 M	59.1 M
ReOcc-S(2f) (ours)	R50	704 × 256	✓	39.2	31.4	26.4	1.9 M	55.8 M
SparseOcc(8f) [5]	R50	704 × 256	✕	30.1	34.0	17.3	∖	80.3 M
GSD-Occ(16f) [12]	R50	704 × 256	✕	31.8	38.9	20.0	32.3 M	115.0 M
Panoptic-FlashOcc(2f) [9]	R50	704 × 256	✕	30.3	36.8	35.9	6.8 M	75.2 M
Panoptic-FlashOcc(8f) [9]	R50	704 × 256	✕	31.6	38.5	35.6	55.0 M	76.8 M
Panoptic-FlashOcc(2f) * [9]	R50	704 × 256	✕	30.3	36.8	22.5 *	∖	75.2 M
Panoptic-FlashOcc(8f) * [9]	R50	704 × 256	✕	31.6	38.5	6.9 *	∖	76.8 M
ReOcc-B(2f) (ours)	R50	704 × 256	✕	31.9	38.2	23.4	3.5 M	59.1 M
ReOcc-S(2f) (ours)	R50	704 × 256	✕	31.6	37.8	26.4	1.9 M	55.8 M

Table 3. Ablation on different components. Best values are bolded.

$Method$	$IoU$	$mIoU$	$RayIoU$	$FPS$	$Param .$
Baseline	67.07	36.85	29.71	23.6	70.84 M
+SMG	69.43	39.43	31.52	23.6	71.07 M
+Light Encoder	68.13	37.02	29.44	23.3	58.91 M
Final	70.50	39.97	31.59	23.4	59.15 M

Table 4. Ablation on Light Encoder. Best values are bolded.

$Method$	$Blocks$	$IoU$	$mIoU$	$RayIoU$	$FPS$	$Param .$
ResNet	2	70.07	39.69	31.89	24.2	59.55 M
VMamba [30]	2	70.44	39.89	31.86	22.7	58.95 M
VRWKV [13]	2	70.50	39.97	31.59	23.4	59.15 M
VRWKV [13]	1	70.29	39.67	31.56	24.2	56.81 M
VRWKV [13]	3	70.87	40.20	31.71	22.0	64.92 M

Table 5. Ablation on temporal fusion methods. Best values are bolded.

$Method$	$IoU$	$mIoU$	$RayIoU$	$FPS$	$Param .$	$Storage$
SMG	70.50	39.97	31.59	23.4	59.15 M	3.53 M
GSD fusion(2f)	68.13	37.02	29.44	23.3	58.91 M	2.53 M
GSD fusion(16f)	69.87	39.08	31.18	23.0	59.06 M	25.09 M

Table 6. Ablation on SMG. Best values are bolded.

$Method$	$IoU$	$mIoU$	$RayIoU$	${RayIoU}_{1 m}$	${RayIoU}_{2 m}$	${RayIoU}_{4 m}$
SMG	70.50	39.97	31.59	25.04	32.05	37.66
SMG w/o mask	70.27	39.55	31.14	24.70	31.63	37.08
SMG w/ FG	70.59	39.77	31.46	24.88	31.97	37.54

Table 7. Ablation on 3D-to-BEV transformation. Best values are bolded.

$Method$	$IoU$	$mIoU$	$RayIoU$	${RayIoU}_{1 m}$	${RayIoU}_{2 m}$	${RayIoU}_{4 m}$
mean pooling	70.50	39.97	31.59	25.04	32.05	37.66
convolution	70.51	39.64	31.38	24.93	31.81	37.39

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Z.; Xie, Y.; Wei, Y. RecurrentOcc: An Efficient Real-Time Occupancy Prediction Model with Memory Mechanism. Big Data Cogn. Comput. 2025, 9, 176. https://doi.org/10.3390/bdcc9070176

AMA Style

Chen Z, Xie Y, Wei Y. RecurrentOcc: An Efficient Real-Time Occupancy Prediction Model with Memory Mechanism. Big Data and Cognitive Computing. 2025; 9(7):176. https://doi.org/10.3390/bdcc9070176

Chicago/Turabian Style

Chen, Zimo, Yuxiang Xie, and Yingmei Wei. 2025. "RecurrentOcc: An Efficient Real-Time Occupancy Prediction Model with Memory Mechanism" Big Data and Cognitive Computing 9, no. 7: 176. https://doi.org/10.3390/bdcc9070176

APA Style

Chen, Z., Xie, Y., & Wei, Y. (2025). RecurrentOcc: An Efficient Real-Time Occupancy Prediction Model with Memory Mechanism. Big Data and Cognitive Computing, 9(7), 176. https://doi.org/10.3390/bdcc9070176

Article Menu

RecurrentOcc: An Efficient Real-Time Occupancy Prediction Model with Memory Mechanism

Abstract

1. Introduction

2. Related Works

2.1. Vision-Based 3D Occupancy Prediction

2.2. Deployment-Friendly Occupancy Prediction

2.3. Temporal Fusion in 3D Occupancy Prediction

3. Method

3.1. Problem Setup

3.2. Overall Architecture

3.3. Scene Memory Gate

3.4. Light Encoder

3.5. Loss Function

4. Experiment

4.1. Datasets

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Main Results

4.5. Ablation Studies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI