Next Article in Journal
ESDBO: A Multi-Strategy Enhanced Dung Beetle Optimization Algorithm for Urban Path Planning of UGV
Previous Article in Journal
Event-Based Machine Vision for Edge AI Computing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

HBEVOcc: Height-Aware Bird’s-Eye-View Representation for 3D Occupancy Prediction from Multi-Camera Images

1
School of Information Science and Engineering, Shandong University, Qingdao 266237, China
2
School of Computer Science, University of Nottingham Malaysia, Semenyih 43500, Malaysia
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(3), 934; https://doi.org/10.3390/s26030934
Submission received: 26 December 2025 / Revised: 20 January 2026 / Accepted: 30 January 2026 / Published: 1 February 2026
(This article belongs to the Section Sensing and Imaging)

Abstract

Due to the ability to perceive fine-grained 3D scenes and recognize objects of arbitrary shapes, 3D occupancy prediction plays a crucial role in vision-centric autonomous driving and robotics. However, most existing methods rely on voxel-based methods, which inevitably demand a large amount of memory and computing resources. To address this challenge and facilitate more efficient 3D occupancy prediction, we propose HBEVOcc, a Bird’s-Eye-View based method for 3D scene representation with a novel height-aware deformable attention module, which can effectively leverage latent height information within BEV framework to compensate for lack of height dimension, significantly reducing computing resource consumption while enhancing the performance. Specifically, our method first extracts multi-camera image features and lifts these 2D features into 3D BEV occupancy features via explicit and implicit view transformations. The BEV features are then further processed by a BEV feature extraction network and height-aware deformable attention module, with the final 3D occupancy prediction results obtained through a prediction head. To further enhance voxel supervision along the height axis, we introduce a height-aware voxel loss with adaptive vertical weighting. Extensive experiments on the Occ3D-nuScenes and OpenOcc dataset demonstrate that HBEVOcc can achieve state-of-the-art results in terms of both mIoU and RayIoU metrics with less training memory (even when trained on 2080Ti).

1. Introduction

Accurate 3D perception is a crucial foundation for scene understanding and obstacle avoidance in autonomous driving and robotics. In recent years, vision-based 3D perception methods have garnered significant attention over LiDAR-based methods due to their lower cost, superior generalization, stability, as well as their ability to obtain richer color information. Some vision methods have demonstrated notable success in 3D perception tasks, such as 3D object detection [1,2,3,4,5,6], semantic map reconstruction [7,8], depth estimation [9,10,11], and motion prediction [12], etc.
Unlike the above visual 3D perception tasks, 3D occupancy prediction [13,14,15] takes multi-camera images as input and represents the real 3D world into voxels, estimating and predicting the semantic occupancy state of each voxel in the surrounding environment. 3D occupancy prediction provides more fine-grained 3D scene perception ability, capable of describing arbitrary complex shapes [16]. Moreover, 3D occupancy prediction models can identify both general objects and unusual obstacles, which is extremely important for scene understanding and reconstruction in autonomous driving and robotics. As an effective alternative to LiDAR-based perception, 3D occupancy prediction offers better assistance for downstream tasks and possesses a very broad development prospect.
Despite the aforementioned advantages, 3D occupancy prediction remains a highly challenging task that needs to achieve a balance of accuracy, robustness, and efficiency. Current 3D occupancy prediction methods mostly rely on voxel-based heavy 3D representation and processing, such as 3D convolutions and transformer operators [14,16,17]. These approaches lead to high computational cost and memory consumption, making them impractical for the actual perception requirements of autonomous driving and robotics. Recent works have aimed to address these issues through various optimizations. For instance, TPVFormer [15] uses tri-perspective view representations to reduce the amount of computation, and OctreeOcc [18] employs an octree structure to represent 3D scenes. However, these models still take up a large amount of memory during training.
Bird’s-Eye-View (BEV)-based methods have achieved remarkable success in 3D object detection with both accuracy and efficiency. Unlike voxel-based methods that explicitly model 3D spatial structure via voxels (leading to high memory and computational costs), BEV-based methods [1,2] project multi-view image features onto a 2D top-down plane, collapsing the height dimension into channel-wise features to enable efficient computation. However, in the task of 3D occupancy prediction, it is generally believed that BEV-based methods collapse the height information and are unable to effectively describe the fine-grained 3D scene details. Although some recent efforts have attempted to employ BEV representation for 3D occupancy prediction [19,20], they often fail to achieve satisfactory performance and do not achieve results comparable to voxel-based methods.
During the transformation from 2D image features to 3D occupancy features, there are two primary view transformations. One is the explicit view transformation (EVT) that performs forward projection based on the predicted depth map, and the other is the implicit view transformation (IVT) that conducts backward projection through cross-attention. The EVT can efficiently lift 2D image features to 3D space using the predicted depth map, but its drawback is that sparse LiDAR points limit the supervision of pixel-level depth prediction. On the other hand, IVT enables end-to-end transformation but suffers from inherent depth ambiguities.
To solve the above problems, we propose a BEV-based 3D occupancy prediction framework to achieve excellent results while reducing resource consumption. We adopt both explicit and implicit view transformations to take advantage of their strengths and compensate for their weaknesses simultaneously. To address the problem of the height information deficiency in BEV, we introduce a height-aware deformable attention module that can mine the potential latent height information, enabling interactions between features of the same and different heights. To complement this at the supervision level, we further introduce a height-aware voxel loss with adaptive height weighting to better guide the learning of sparsely distributed occupancy voxels.
Our contributions are summarized as follows:
  • We design HBEVOcc, a framework leveraging BEV representation and a novel height-aware deformable attention module for 3D occupancy prediction. By effectively exploiting the latent height information embedded in BEV features, it addresses the absence of vertical dimensionality in BEV representations, resulting in a significant improvement in 3D occupancy prediction performance.
  • Our proposed method learns 3D occupancy prediction from multi-camera images through both explicit and implicit view transformations. It enables the efficient fusion of explicit, implicit, and multi-scale BEV features, significantly reducing the memory usage of 3D occupancy prediction whilst maintaining high performance. To further improve height voxel supervision, we introduce a height-aware voxel loss with adaptive weighting along the height axis.
  • Through extensive experiments on the Occ3D-nuScenes and OpenOcc dataset, we demonstrate that HBEVOcc outperforms existing methods in 3D occupancy prediction, achieving superior performance in this challenging task. Our results outperform not only BEV-based but also voxel-based methods, achieving a better trade-off between memory consumption and accuracy.

2. Related Work

2.1. Vision-Based 3D Occupancy Prediction

Recently, vision-based 3D occupancy prediction has attracted considerable attention in both academia and industry. PanoOcc [21] proposes a unified occupancy representation for camera-based 3D panoptic segmentation and occupancy prediction, aiming to integrate object detection and semantic segmentation into a single framework. It uses voxel queries to aggregate spatio-temporal information from multi-frame multi-view images via a coarse-to-fine scheme and introduces an occupancy sparsify module. RenderOcc [22] achieves 3D occupancy prediction using only 2D labels for supervision. SelfOcc [23] and OccNeRF [24] adopt a self-supervised approach for occupancy prediction, eliminating the dependence on occupancy labels. FB-OCC [25] enhances 3D occupancy prediction through forward–backward view transformation, integrating BEV and voxel representations, while employing depth and semantic pre-training. COTR [26] reconstructs a compact occupancy representation using a geometric encoder and a semantic decoder via the compact occupancy transformer. Nevertheless, it still relies on voxel-based modeling, which inherently incurs high GPU memory usage and computational costs, limiting scalability compared with BEV-based solutions. OctreeOcc [18] introduces a novel multi-granularity octree framework, which sparsifies the space and reduces the number of voxels. SAMOccNet [27] introduces the Segment Anything Model into occupancy prediction, enhancing fine-grained scene understanding through detailed visual feature extraction and fusion. OFMPNet [28] is an end-to-end model that jointly predicts future occupancy and motion flow using BEV inputs and a novel time-weighted loss. STCOcc [29] introduces a spatial–temporal cascade framework that explicitly utilizes the occupancy state to guide 3D feature refinement for improved scene understanding. Compared with these voxel-based methods, our approach avoids explicit voxelization and instead directly models height cues within BEV features. This design achieves effective height-aware occupancy prediction with significantly lower memory consumption and better scalability, making it more suitable for efficient deployment.

2.2. 3D Semantic Scene Completion

3D semantic scene completion (SSC) is most closely related to 3D occupancy prediction. It was first introduced in [30]. Monoscene [31] achieved 3D SSC using monocular image for the first time through 2D and 3D Unets, bridged by Feature Line of Sight Projection (FLoSP). VoxFormer [32] adopts a novel two-stage design, employing depth-based query proposals and a sparse voxel transformer with deformable cross-attention and self-attention to achieve 3D SSC. OccFormer [17] designs a dual-path transformer network and Mask2Former [33] to achieve semantic scene completion (SSC) and 3D occupancy prediction. OccDepth [34] exploits the implicit depth information in stereo images, using Stereo Soft Feature Assignment (STEREO-SFA) and Occupancy Aware Depth (OAD) modules to improve the effectiveness of 3D SSC. Symphonize [35] presents a novel paradigm that dynamically encodes instance-centric semantics, effectively mitigating geometric ambiguity through contextual scene reasoning.

2.3. BEV-Based 3D Scene Representation

BEV representations have been demonstrated to be a highly successful and effective approach in 3D object detection. BEV utilizes vectors to represent the features of BEV grids. Compared to voxel-based methods, BEV-based methods collapse the height dimension, thus improving the computational efficiency. BEVDet [1] projects image features into BEV features using predicted depth, achieving a good balance between accuracy and inference speed. BEVFormer [2] implements 3D object detection through a transformer and uses cross-attention and self-attention to complete the aggregation of spatial and temporal features. Recently, some works have also applied BEV methods to 3D occupancy prediction. FlashOcc [20] introduces a plug-and-play paradigm that replaces 3D convolutions with 2D convolutions, while using a channel-to-height prediction head to convert BEV features into 3D occupancy outputs. FastOcc [19] accelerates inference by collapsing voxel features into 2D BEV features, supplementing them with voxel features obtained through the interpolation of image features, while utilizing BEV semantic segmentation for supervision. Although the aforementioned methods utilize BEV for 3D occupancy prediction, there still remains a gap compared to voxel-based 3D occupancy prediction methods. DHD [36] introduces an explicit height prior into occupancy prediction by predicting height maps with LiDAR supervision and decoupling them into multiple height masks via the proposed Mask Guided Height Sampling (MGHS) module. These masks enable 2D features to be projected into separate 3D subspaces. This explicit height decoupling strategy improves the accuracy on Occ3D-nuScenes. However, DHD requires dense height labels and a relatively heavy architecture consisting of multiple dedicated modules (HeightNet, MGHS), which increases the model complexity and training overhead. In particular, projecting features into multiple height subspaces and aggregating them layer by layer incurs substantial GPU memory consumption, making DHD [36] less efficient compared to lightweight BEV-based approaches. In contrast, our method leverages height-aware deformable attention to implicitly mine latent vertical information already embedded in BEV features, without relying on external height labels or complex multi-stage subspace modeling. As a result, our framework achieves stronger efficiency–performance trade-offs: it improves height-aware representation while maintaining lightweight memory usage and architectural simplicity.

3. Proposed Method

3.1. Problem Formulation

Given a sequence of multi-camera image inputs, the aim of 3D occupancy prediction is to estimate the occupancy state and semantic category of each voxel in 3D space surrounding the ego-vehicle. Specifically, the input images are defined as I i t R H i × W i × 3 , where i { 1 , 2 , , N } represents the i-th of N surround-view cameras, and t { T , T 1 , , T τ } denotes the current timestamp at T with historical τ frames. Here, H and W indicate the height and width of the input images, respectively. Furthermore, the extrinsic parameters R i t i and intrinsic parameters { K i } of the camera used for different coordinate systems conversion and ego-motion in each frame are also known. The range of 3D space around the ego vehicle is [ X m i n , Y m i n , Z m i n , X m a x , Y m a x , Z m a x ] , and the resolution of the voxel label for 3D occupancy prediction is [ X , Y , Z ] (e.g., [ 200 , 200 , 16 ] in Occ3D [14]), with each voxel representing a real-world size of [ X m a x X m i n X , Y m a x Y m i n Y , Z m a x Z m i n Z ] .

3.2. Overview

Figure 1 shows the pipeline of our method. Given multi-camera images as input, we first extract features of images using a backbone network (e.g., ResNet-50 [37]); then, we lift the image features to 3D BEV space. After obtaining the initial features of BEV, a BEV encoder and height-aware deformable attention module are used to further refine the features, and then a BEV decoder progressively restores the spatial resolution. Finally, the explicit and implicit features are fused and fed into the prediction head to obtain the 3D occupancy results.
EVT and IVT are designed to address the inherent ambiguity and information loss when lifting multi-view image features into the BEV space by jointly modeling the explicit geometric projection and implicit learned occupancy queries.

3.3. Explicit View Transformation

We follow previous works such as FlashOcc [20] and BEVDet [4] to implement EVT, which lifts image features into 3D space using depth-based projections. After extracting features from a 2D backbone, we can obtain the multi-camera image features F i m g = { F i R C f × H f × W f } i = 1 N . For EVT, the depth distribution D d e p t h = { D i R D b i n × H f × W f } i = 1 N can be predicted via a depth net, where D b i n denotes the number of depth bins. The outer product D d e p t h F i m g is applied to lift image features to pseudo-LiDAR points P l = R N × D b i n × C f × H f × W f in the camera coordinates. Then, P l is transformed to ego-coordinate and wrapped into a voxel grid with fixed resolution [ X , Y , Z e ] based on their 3D positions, where Z e is the height resolution along the z axis. Next, voxel pooling is performed to obtain the voxel feature P l 1 R C f × X × Y × Z e , which is subsequently permuted and reshaped into P l 2 R ( Z e × C f ) × X × Y . Finally, a 2D convolutional preprocessing network is applied to produce the initial explicit BEV occupancy feature O e R C e × X × Y .

3.4. Implicit View Transformation

For IVT, we adopt the query-based cross-attention strategy proposed in BEVFormer [2], where learnable BEV queries interact with multi-view image features via spatial cross-attention. In IVT, we only use the cross-attention component of BEVFormer [2], which projects 3D points into 2D image features to obtain the BEV features. As illustrated in Figure 1 (Implicit View Transformation), we predefine learnable parameters Q i v t R C i v t × X 2 × Y 2 as implicit BEV occupancy queries. Then, we obtain the corresponding implicit BEV occupancy feature O i v t R C i v t × X 2 × Y 2 using spatial cross-attention. The above process can be expressed as follows:
O i v t = 1 V hit i V hit CA Q i v t , P i , F i ,
where V hit is the hit views of 3D reference points in IVT, and C A ( · ) denotes the cross-attention. For each 3D point in Q i v t , we use a project function P i to get the reference point on the i-th camera image.

3.5. Height-Aware Deformable Attention

BEV-based methods collapse the height dimension into channel-wise features, leading to the loss of vertical spatial information and the inability to capture fine-grained 3D scene details. This deficiency limits their performance in 3D occupancy prediction, as height is critical for distinguishing objects and describing the spatial structure. To address the shortcoming of the lack of height information in BEV methods, we design a height-aware deformable attention (HADA) module, as shown in Figure 2. Inspired by [38], we also use deformable sampling points in the sampling process. After the initial BEV occupancy feature is passed through the BEV encoder, the BEV feature is O 1 R C 1 × X × Y (e.g., O i v t ), where [ X , Y ] is the resolution of the current BEV feature. A grid of reference points p R ( N h × 2 ) × X × Y is predefined, where N h is the number of attentional height heads.
We define the deformable sampling process as follows:
Δ p = B o f f s e t ( O 1 ) , O s = S a m p l e ( W O 1 O 1 , p + Δ p ) ,
where Δ p R ( N h × p 1 × 2 ) × X × Y is obtained from the 2D offset prediction network B o f f s e t shown in Figure 3b, and p 1 is the number of sampling points in the horizontal direction at each height. W O 1 is a 2D convolutional network. S a m p l e is the bilinear interpolation function, and O s R ( N h × p 1 × C h ) × X × Y is the sampled feature, where C h = C 1 / N h is the feature dimension per height head. P s a m p l e = p + Δ p R ( N h × p 1 × 2 ) × X × Y is the corresponding sampling position. For this addition, the reference points p are broadcasted to match the dimensions of Δ p by replicating each reference point p 1 times along the sampling point dimension; so, when we perform P s a m p l e , p will broadcast to ( N h × p 1 × 2 ) × X × Y . The query, key, and value features are then computed as
Q = W q O 1 , K 1 = W K 1 O s , V 1 = W v 1 O s .
After using the 2D convolutional network W q on O 1 , Q R ( N h × C h ) × X × Y is obtained. We use GroupConv in W K 1 and W v 1 ; then, we obtain K 1 R ( N h × p 1 × C h ) × X × Y , V 1 R ( N h × p 1 × C h ) × X × Y .
Figure 3. The architecture of each module in height-aware deformable attention. (a) Channel attention network. (b) Offset network. (c) Feedforward network.
Figure 3. The architecture of each module in height-aware deformable attention. (a) Channel attention network. (b) Offset network. (c) Feedforward network.
Sensors 26 00934 g003
In the height direction, we process O 1 using channel attention shown in Figure 3a,
O 2 = C h a n n e l   A t t e n t i o n ( O 1 ) , K 2 = W K 2 O 2 , V 2 = W v 2 O 2 ,
where O 2 R C 1 × X × Y , K 2 R ( N h × p 2 × C h ) × X × Y , V 2 R ( N h × p 2 × C h ) × X × Y , and p 2 is the number of points in the height direction. Then, we compute attention by the following equation:
K = C o n c a t ( K 1 , K 2 ) , V = C o n c a t ( V 1 , V 2 ) , O A = A t t e n t i o n ( Q , K , V ) = S o f t m a x ( Q K T C h ) V ,
where the shapes of K and V are both ( N h × ( p 1 + p 2 ) × C h ) × X × Y . When we calculate the attention O A , Q only calculates with nearby points; so, Q will be reshaped to ( N h × 1 × C h ) × X × Y , K T R ( N h × C h × ( p 1 + p 2 ) ) × X × Y , Q K T R ( N h × 1 × ( p 1 + p 2 ) ) × X × Y , and O A R ( N h × 1 × C h ) × X × Y . After the O A is processed by feedforward network (FFN), the shape of the final output of HADA is C 1 × X × Y . As can be seen from Figure 2 and Figure 3c, we use GroupConv and GroupNorm in FFN, which can achieve different processing of features at different heights.

3.6. 3D Occupancy Prediction Head

After the BEV processing network, the explicit and implicit BEV occupancy features are denoted as O E R C E × X × Y and O I R C I × X × Y , respectively. We upsample the low-resolution features O E 1 R C E 1 × X 2 × Y 2 and O I 1 = H A D A ( O i v t ) R C I 1 × X 2 × Y 2 shown in Figure 1 and then fuse them with O E and O I to obtain their fused feature O F R C F × X × Y . Specifically, the explicit and implicit features are concatenated along the channel dimension, and the upsampled low-resolution fused features are added to the high-resolution fused features, followed by a convolution layer for feature fusion. The above process can be expressed as follows:
O F = C o n v ( C o n c a t ( O E , O I ) + U p s a m p l e ( C o n c a t ( O E 1 , O I 1 ) ) ) .
Like FlashOcc [20], we also employ Channel2Height prediction head, and we obtain O F 1 R ( Z × C ) × X × Y after applying a 1 × 1 convolution on O F . Finally, we permute and reshape O F 1 to O R X × Y × Z × C to get the final 3D occupancy prediction output.

3.7. Height-Aware Voxel Loss

In 3D occupancy prediction, voxels are unevenly distributed along the height axis (Z-axis): voxels near the ground (e.g., 0–2 m) are densely occupied by objects like vehicles and pedestrians, while voxels at higher altitudes (e.g., 2–5.4 m) are sparsely occupied. Conventional loss functions treat all voxels equally, leading to biased supervision—models prioritize learning from dense low-height voxels and underperform on sparse high-height voxels, reducing the overall prediction accuracy.
To address this issue, we propose a height-aware voxel loss (HAVL), which enhances voxel supervision along the height axis by introducing adaptive height-dependent weighting. Specifically, we randomly sample a set of positions in the XY plane and compute the loss over all height levels at those positions as shown in Figure 4. This sampling strategy reduces the computational cost while ensuring that vertical occupancy patterns are jointly optimized.
For each height, a weight is assigned based on the number of occupied voxels at that height. Heights with fewer occupied voxels are assigned larger weights. To penalize low-confidence errors more strongly, we apply the loss to l o g function. This design increases the gradient magnitude when the predicted probability approaches zero, which is particularly important for sparse occupancy voxels. We adopt Smooth L1 to prevent excessively large gradients caused by extreme log-probability values. The final loss is formulated as
L H A V = SmoothL 1 ( z = 1 Z ( x , y ) S w z · l o g ( p c ( x , y , z ) ) , 0 ) ,
where w z = w max · w min w max N z N max is the height-aware weight at height z, N z is the number of occupied voxels at height i, and N max is the maximum among all N z . This exponential scheme emphasizes sparse heights and ensures smooth weighting without hard thresholds. ( x , y ) S denotes randomly sampled positions in the XY plane, and p c ( x , y , z ) is the predicted probability of ground-truth class c at voxel ( x , y , z ) .

3.8. Model Optimization

In the model training stage, we use the cross-entropy loss L c e , depth loss L d e p t h supervised by sparse LiDAR points, lovasz-softmax loss L l o v a s z from [39], affinity loss L s e m and L g e o from MonoScene [31], and our proposed height-aware voxel loss L H A V to optimize our model. So, the total loss function used for our model optimization can be defined as follows:
L t o t a l = λ d e p t h L d e p t h + λ c e L c e + λ l o v a s z L l o v a s z + λ s e m L s e m + λ g e o L g e o + λ H A V L H A V ,
where λ is the weight of each loss, and we set λ d e p t h = 0.15 , λ c e = λ l o v a s z = 1 following [4,20,36,40] in our experiments. The weight for λ s e m = λ g e o = λ H A V = 0.1 is chosen to ensure numerical consistency with the existing loss terms for stable training.

4. Experiments

4.1. Dataset

We conduct experiments on two large scale 3D occupancy prediction benchmarks: Occ3D-nuScenes [14] and OpenOcc [41]. Each dataset comprises 700, 150, and 150 scenes in the training, validation, and testing sets, respectively, and each scene is 20 s in duration and 2 HZ in frequency. Each frame consists of 6 surround-view camera images. 3D occupancy labels have a spatial range of [−40 m, −40 m, −1 m, 40 m, 40 m, 40 m, 5.4 m] across the X, Y, and Z axes, a voxel resolution of [200, 200, 16], and a size of each voxel of [0.4 m, 0.4 m, 0.4 m]. Each voxel of Occ3D-nuScenes has 18 categories (1 “other” class and 1 “free” class), and the labels provide a visible mask for the camera. OpenOcc annotates each voxel with 17 categories (1 “free” class) and provides per-voxel motion flow prediction.

4.2. Experimental Settings

Implementation Details

For explicit BEV features, our initial resolution is set to 200 × 200 , and the encoder follows the design of FlashOcc [20]. HADA is applied at resolutions of 100 × 100 and 25 × 25 , where the BEV feature dimensions are set to 128/512 for single-frame input and 160/640 when temporal history frames are used. Accordingly, the explicit BEV feature O e has a shape of 200 × 200 × 64 or 200 × 200 × 80 , depending on the presence of a temporal input. For EVT, the voxel grid resolution along the z axis, denoted as Z e , is set to 8 for single frame or 1 history frame input and reduced to 1 for multiple historical frames inputs to save computation. For implicit BEV features, the initial resolution is 100 × 100 , with feature dimensions of 128 and 160 for single frame and temporal inputs, respectively. The resulting BEV feature O i has a shape of 100 × 100 × 128 or 100 × 100 × 160 , followed by an upsampling operation and a two-layer convolution. HADA is applied at a resolution of 100 × 100 . In HBEVOcc, both explicit and implicit BEV outputs, O E and O I , are unified to a channel dimension of 256, and the fused BEV feature O F has a final dimension C F of 512. Under the setting without history frames, our approach does not rely on LiDAR-based depth supervision. We also construct a fast version, HBEVOcc-Fast, where an explicit and implicit BEV feature is added at the 100 × 100 resolution, and HADA is applied solely at 25 × 25 . In this case, the final fused feature C F is reduced to 256. For temporal fusion in HBEVOcc and HBEVOcc-Fast, we adopt Stereo4D and Depth4D used in [4,20,36,40]. Specifically, when computing the BEV features of historical frames, gradients are disabled to reduce the memory usage. After obtaining 3D features from both the current and historical frames, we concatenate them after a lightweight preprocessing network and then feed the concatenated features into the BEV encoder.

4.3. Evaluation Metrics

For 3D semantic occupancy prediction, we use mIoU as the evaluation metric on Occ3D-nuScenes [14]. In addition, we also use the RayIoU proposed in SparseOcc [42] as the evaluation metric both on Occ3D and OpenOcc [41], which is defined as T P only if the class are consistent and the L1 distance between the predicted depth and the true depth is less than a certain threshold. We use the mean absolute velocity error (mAVE) to evaluate the scene flow prediction across defined categories (e.g., pedestrian, bus) on OpenOcc.

Training

During training, we adopt the AdamW [43] optimizer with a learning rate of 2 × 10−4 and weight decay of 0.01, using a linear warming up in the first 200 iterations. Models with a ResNet-50 backbone are trained on 8 RTX2080Ti GPUs (11 G memory) and those with a SwinB backbone are trained on 4 RTX4090 GPUs (24 G memory), both with a batch size of 2.

4.4. Main Results

4.4.1. 3D Occupancy Prediction Results on Occ3D-nuScenes

We report the quantitative results and qualitative visualizations of the 3D semantic occupancy prediction results on the Occ3D-nuScenes dataset. In Table 1, we report in detail the comparison results of our HBEVOcc and other existing state-of-the-art methods on mIoU and each semantic class. Our method consistently achieves the best performance, regardless of whether camera masks or history frames are used. In Figure 5, we visualize the training memory and performance comparison between HBEVOcc and other methods. Our method uses less memory but achieves better performance. In Figure 6, we visualize the results of our model and the state-of-the-art methods. Our method can predict the occupancy semantic classes more accurately compared to SOTA.
In Table 2, we also report in detail the comparison results of our HBEVOcc, HBEVOcc-Fast, and other existing methods on RayIoU and mIoU. Testing GPU is RTX 4090 refers to that we test the model according to its official code. Regardless of whether a camera mask is used or not, our method achieves state-of-the-art RayIoU while ensuring fast speed.

4.4.2. 3D Occupancy Prediction Results on OpenOcc

In Table 3, we report the results of our HBEVOcc and other existing methods on RayIoU and mAVE. Our model achieves better occupancy and flow prediction results while consuming less memory (only 7.5 GB).

4.5. Ablation Study

To verify the effectiveness of our proposed method and module, we perform ablation experiments on the Occ3D and OpenOcc dataset. For a fair comparison, we retrain FlashOcc [20] with the additional loss function, including the lovasz-softmax loss L l o v a s z from [39] and the affinity loss L s e m and L g e o from MonoScene [31] and treat this enhanced model as our baseline. As shown in Table 4, incorporating both EVT and IVT results in a 0.98% improvement in mIoU compared to using EVT alone, demonstrating the benefits of their combination. Meanwhile, using HADA has a 1.08% higher mIoU than not using it, underscoring the effectiveness of HADA. In Figure 7, we visualize the 3D occupancy prediction results under three settings. It can be seen that our proposed HADA and HAVL enhance the geometric structure and semantic coherence of the 3D occupancy results, thereby improving scene understanding ability. Table 5 further shows that without using a camera mask, HADA can still improve RayIoU and mIoU to some extent, and HAVL can improve the mIoU by 1.7% without increasing the memory usage. Table 6 demonstrates the improvement effects of HADA and HAVL on OpenOcc. Due to the long training time, we use the image resolution of 256 × 704 and the image backbone ResNet-50 in the ablation experiments, and no history frame is used to reduce the training time.
As shown in Table 7, concatenation achieves the best mIoU performance among all evaluated fusion methods, outperforming additive fusion and gated fusion. In Table 8, we present the impact of different horizontal and height points in HADA on mIoU. The best performance is achieved when the horizontal point is 4 and the height point is 2. To investigate the impact of the voxel grid resolution along the z axis (height Z e ) setting under different numbers of historical frames, we conduct experiments on Occ3D dataset in Table 9. When no historical frame or only a single historical frame is used, adopting Z e = 8 leads to better performance compared with Z e = 1 . However, when the number of historical frames increases to 4 or 8, Z e = 1 consistently outperforms Z e = 8 . The temporal fusion of multiple historical frames provides additional geometric cues, which compensates for the potential loss of vertical information caused by a smaller z-dimension.
We first investigate the influence of the number of height levels involved in HAVL. As shown in Table 10, increasing the number of height levels generally improves the performance, with the best result achieved when 16 height levels are used. This indicates that sufficiently fine-grained height supervision is beneficial for learning sparse occupancy patterns along the vertical axis, while too coarse height partitioning limits its effectiveness. We further analyze the effect of the number of sampled spatial positions in the XY plane. Table 11 shows that sampling 4000 positions achieves the best performance, while overly dense sampling (e.g., 20,000 or 40,000) does not bring further improvement. This suggests that HAVL does not rely on exhaustive voxel supervision, and a moderate number of sampled positions is sufficient to provide stable and effective height-aware gradients. Table 12 shows that, in HAVL, the best performance is achieved when using our proposed height-aware weights.
Table 13 shows the effect of different history frames in temporal fusion on RayIoU, and it can be seen that long-term sequences lead to a significant improvement in the results. To evaluate the capability of HADA in mining and exploiting height information, we visualize in Figure 8 the heatmaps of attention O A features at eight different height levels. As shown in Figure 8, the regions emphasized by HADA vary across different heights, indicating that the model attends to distinct spatial features, depending on the vertical dimension. This clearly demonstrates the effectiveness of HADA in capturing and leveraging height-aware representations. To quantify how HADA’s height setting impacts performance across different vertical ranges, we conduct additional ablation experiments by varying the number of height levels in HADA (2, 4, 8, 16). Table 14 presents the results: the model achieves the best mIoU of 36.76% when using 8 height levels. This result confirms that 8 height levels strike an optimal balance between capturing vertical details and computational efficiency. To further verify the effectiveness of our proposed HADA, we applied it to other methods. For BEV-based methods, HADA is directly applied to the BEV features. For voxel-based methods, we first transform the voxel features into BEV features by collapsing the z axis into the channel dimension and applying a 2D convolutional network. HADA is then applied to the transformed BEV features, which are subsequently mapped back into the voxel representation and fused with the original voxel features through addition. As can be seen in Table 15, both BEV-based and voxel-based models achieve a significant improvement in mIoU with only a limited increase in memory usage. Here, mIoU* denotes the performance of the original methods; to ensure the fairness of the experiment, we retrain these methods to obtain the mIoU.

5. Discussion

As shown in Table 1, Table 2 and Table 3, our method consistently outperforms previous BEV-based and voxel-based approaches in terms of the occupancy prediction accuracy, while maintaining competitive inference efficiency. Compared with voxel-based methods, our approach avoids explicit 3D voxel computation and benefits from the compact BEV representation, leading to reduced inference memory consumption and favorable runtime performance. Although the proposed HADA module introduces additional computation, its cost is moderate due to the localized and sparse sampling strategy, making the overall complexity suitable for real-time or near-real-time deployment. Despite its effectiveness, our method has several limitations. First, extremely complex scenes with a large number of small or highly detailed objects may still pose challenges due to the inherent resolution limits of the BEV grid. Second, the current framework assumes relatively accurate camera calibration, and calibration errors may negatively impact the lifting process. Finally, like most vision-based methods, our approach may be affected by challenging environmental conditions such as low illumination, adverse weather, or sensor noise. In future work, incorporating temporal information or robustness-oriented data augmentation could further enhance performance under such conditions.

6. Conclusions

In this paper, we present HBEVOcc, a 3D occupancy prediction method based on BEV representation. To enhance the perception and understanding of 3D scenes, we employ both explicit and implicit view transformations to obtain BEV features. Our proposed HADA module and HAVL can effectively utilize the latent height information, addressing the challenge of missing height in BEV and significantly improving model performance. Our method achieves superior 3D occupancy prediction results while also reducing the training memory. Extensive experiments on the Occ3D-nuScenes and OpenOcc dataset demonstrate that HBEVOcc outperforms existing methods in both mIoU and RayIoU metrics, which proves the effectiveness of our proposed method.

Author Contributions

Conceptualization, Methodology, Writing—original draft, Investigation, C.L.; Software, Validation, Visualization, W.L. and C.L.; Formal analysis, I.Y.L.; Writing—review and editing, I.Y.L., F.D., H.L., and H.Z.; Funding acquisition, Supervision, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Key Research and Development Program of China (2021YFB2800300).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in HBEVOcc Available online: https://github.com/lvchuandong/HBEVOcc (accessed on 20 January 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Huang, J.; Huang, G.; Zhu, Z.; Ye, Y.; Du, D. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv 2021, arXiv:2112.11790. [Google Scholar]
  2. Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Qiao, Y.; Dai, J. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 1–18. [Google Scholar]
  3. Li, Y.; Bao, H.; Ge, Z.; Yang, J.; Sun, J.; Li, Z. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2023; Volume 37, pp. 1486–1494. [Google Scholar]
  4. Huang, J.; Huang, G. Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. arXiv 2022, arXiv:2203.17054. [Google Scholar]
  5. Wu, X.; Ma, D.; Qu, X.; Jiang, X.; Zeng, D. Depth dynamic center difference convolutions for monocular 3D object detection. Neurocomputing 2023, 520, 73–81. [Google Scholar]
  6. Tang, Y.; He, H.; Wang, Y.; Mao, Z.; Wang, H. Multi-modality 3D object detection in autonomous driving: A review. Neurocomputing 2023, 553, 126587. [Google Scholar] [CrossRef]
  7. Zhao, T.; Chen, Y.; Wu, Y.; Liu, T.; Du, B.; Xiao, P.; Qiu, S.; Yang, H.; Li, G.; Yang, Y.; et al. Improving Bird’s Eye View Semantic Segmentation by Task Decomposition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 15512–15521. [Google Scholar]
  8. Xu, Z.; Li, S.; Peng, L.; Jiang, B.; Huang, R.; Chen, Y. Ultra-fast semantic map perception model for autonomous driving. Neurocomputing 2024, 599, 128162. [Google Scholar] [CrossRef]
  9. Agarwal, A.; Arora, C. Attention attention everywhere: Monocular depth prediction with skip attention. In IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 5861–5870. [Google Scholar]
  10. Masoumian, A.; Rashwan, H.A.; Abdulwahab, S.; Cristiano, J.; Asif, M.S.; Puig, D. GCNDepth: Self-supervised monocular depth estimation based on graph convolutional network. Neurocomputing 2023, 517, 81–92. [Google Scholar] [CrossRef]
  11. Zhao, G.; Wei, H.; He, H. IAFMVS: Iterative Depth Estimation with Adaptive Features for Multi-View Stereo. Neurocomputing 2025, 629, 129682. [Google Scholar] [CrossRef]
  12. Hu, A.; Murez, Z.; Mohan, N.; Dudas, S.; Hawke, J.; Badrinarayanan, V.; Cipolla, R.; Kendall, A. Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 15273–15282. [Google Scholar]
  13. Xu, H.; Chen, J.; Meng, S.; Wang, Y.; Chau, L.P. A survey on occupancy perception for autonomous driving: The information fusion perspective. Inf. Fusion 2025, 114, 102671. [Google Scholar] [CrossRef]
  14. Tian, X.; Jiang, T.; Yun, L.; Mao, Y.; Yang, H.; Wang, Y.; Wang, Y.; Zhao, H. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. Adv. Neural Inf. Process. Syst. 2024, 36, 64318–64330. [Google Scholar]
  15. Huang, Y.; Zheng, W.; Zhang, Y.; Zhou, J.; Lu, J. Tri-perspective view for vision-based 3d semantic occupancy prediction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2023; pp. 9223–9232. [Google Scholar]
  16. Wei, Y.; Zhao, L.; Zheng, W.; Zhu, Z.; Zhou, J.; Lu, J. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 21729–21740. [Google Scholar]
  17. Zhang, Y.; Zhu, Z.; Du, D. Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. In IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 9433–9443. [Google Scholar]
  18. Lu, Y.; Zhu, X.; Wang, T.; Ma, Y. Octreeocc: Efficient and multi-granularity occupancy prediction using octree queries. Adv. Neural Inf. Process. Syst. 2024, 37, 79618–79641. [Google Scholar]
  19. Hou, J.; Li, X.; Guan, W.; Zhang, G.; Feng, D.; Du, Y.; Xue, X.; Pu, J. FastOcc: Accelerating 3D Occupancy Prediction by Fusing the 2D Bird’s-Eye View and Perspective View. arXiv 2024, arXiv:2403.02710. [Google Scholar]
  20. Yu, Z.; Shu, C.; Deng, J.; Lu, K.; Liu, Z.; Yu, J.; Yang, D.; Li, H.; Chen, Y. Flashocc: Fast and memory-efficient occupancy prediction via channel-to-height plugin. arXiv 2023, arXiv:2311.12058. [Google Scholar]
  21. Wang, Y.; Chen, Y.; Liao, X.; Fan, L.; Zhang, Z. Panoocc: Unified occupancy representation for camera-based 3d panoptic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 17158–17168. [Google Scholar]
  22. Pan, M.; Liu, J.; Zhang, R.; Huang, P.; Li, X.; Xie, H.; Wang, B.; Liu, L.; Zhang, S. Renderocc: Vision-centric 3d occupancy prediction with 2d rendering supervision. In 2024 IEEE International Conference on Robotics and Automation (ICRA); IEEE: Piscataway, NJ, USA, 2024; pp. 12404–12411. [Google Scholar]
  23. Huang, Y.; Zheng, W.; Zhang, B.; Zhou, J.; Lu, J. Selfocc: Self-supervised vision-based 3d occupancy prediction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 19946–19956. [Google Scholar]
  24. Zhang, C.; Yan, J.; Wei, Y.; Li, J.; Liu, L.; Tang, Y.; Duan, Y.; Lu, J. Occnerf: Advancing 3d occupancy prediction in lidar-free environments. IEEE Trans. Image Process. 2025, 34, 3096–3107. [Google Scholar] [CrossRef]
  25. Li, Z.; Yu, Z.; Wang, W.; Anandkumar, A.; Lu, T.; Alvarez, J.M. Fb-bev: Bev representation from forward-backward view transformations. In IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 6919–6928. [Google Scholar]
  26. Ma, Q.; Tan, X.; Qu, Y.; Ma, L.; Zhang, Z.; Xie, Y. Cotr: Compact occupancy transformer for vision-based 3d occupancy prediction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 19936–19945. [Google Scholar]
  27. Tan, Q.; Liu, W.; Bi, H.; Wang, L.; Yang, L.; Qiao, Y.; Zhao, Z.; Jiang, Y.; Guo, Q.; Liu, H.; et al. SAMOccNet: Refined SAM-based Surrounding Semantic Occupancy Perception for Autonomous Driving. Neurocomputing 2025, 650, 130918. [Google Scholar] [CrossRef]
  28. Murhij, Y.; Yudin, D. OFMPNet: Deep end-to-end model for occupancy and flow prediction in urban environment. Neurocomputing 2024, 586, 127649. [Google Scholar] [CrossRef]
  29. Liao, Z.; Wei, P.; Chen, S.; Wang, H.; Ren, Z. Stcocc: Sparse spatial-temporal cascade renovation for 3d occupancy and scene flow prediction. In Computer Vision and Pattern Recognition Conference; IEEE: Piscataway, NJ, USA, 2025; pp. 1516–1526. [Google Scholar]
  30. Song, S.; Yu, F.; Zeng, A.; Chang, A.X.; Savva, M.; Funkhouser, T. Semantic scene completion from a single depth image. In IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2017; pp. 1746–1754. [Google Scholar]
  31. Cao, A.Q.; De Charette, R. Monoscene: Monocular 3d semantic scene completion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 3991–4001. [Google Scholar]
  32. Li, Y.; Yu, Z.; Choy, C.; Xiao, C.; Alvarez, J.M.; Fidler, S.; Feng, C.; Anandkumar, A. Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. In IEEE/CVF conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2023; pp. 9087–9098. [Google Scholar]
  33. Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 1290–1299. [Google Scholar]
  34. Miao, R.; Liu, W.; Chen, M.; Gong, Z.; Xu, W.; Hu, C.; Zhou, S. Occdepth: A depth-aware method for 3d semantic scene completion. arXiv 2023, arXiv:2302.13540. [Google Scholar]
  35. Jiang, H.; Cheng, T.; Gao, N.; Zhang, H.; Lin, T.; Liu, W.; Wang, X. Symphonize 3d semantic scene completion with contextual instance queries. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 20258–20267. [Google Scholar]
  36. Wu, Y.; Yan, Z.; Wang, Z.; Li, X.; Hui, L.; Yang, J. Deep height decoupling for precise vision-based 3d occupancy prediction. arXiv 2024, arXiv:2409.07972. [Google Scholar] [CrossRef]
  37. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar]
  38. Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision transformer with deformable attention. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 4794–4803. [Google Scholar]
  39. Berman, M.; Triki, A.R.; Blaschko, M.B. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 4413–4421. [Google Scholar]
  40. Yu, Z.; Shu, C.; Sun, Q.; Linghu, J.; Wei, X.; Yu, J.; Liu, Z.; Yang, D.; Li, H.; Chen, Y. Panoptic-FlashOcc: An Efficient Baseline to Marry Semantic Occupancy with Panoptic via Instance Center. arXiv 2024, arXiv:2406.10527. [Google Scholar]
  41. Tong, W.; Sima, C.; Wang, T.; Chen, L.; Wu, S.; Deng, H.; Gu, Y.; Lu, L.; Luo, P.; Lin, D.; et al. Scene as occupancy. In IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 8406–8415. [Google Scholar]
  42. Liu, H.; Chen, Y.; Wang, H.; Yang, Z.; Li, T.; Zeng, J.; Chen, L.; Li, H.; Wang, L. Fully sparse 3d occupancy prediction. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 54–71. [Google Scholar]
  43. Loshchilov, I. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
  44. Shi, Y.; Cheng, T.; Zhang, Q.; Liu, W.; Wang, X. Occupancy as set of points. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 72–87. [Google Scholar]
  45. Ye, Z.; Jiang, T.; Xu, C.; Li, Y.; Zhao, H. CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction. arXiv 2024, arXiv:2409.13430. [Google Scholar]
  46. Li, J.; He, X.; Zhou, C.; Cheng, X.; Wen, Y.; Zhang, D. ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers. arXiv 2024, arXiv:2405.04299. [Google Scholar]
  47. Tan, X.; Wu, W.; Zhang, Z.; Fan, C.; Peng, Y.; Zhang, Z.; Xie, Y.; Ma, L. Geocc: Geometrically enhanced 3d occupancy network with implicit-explicit depth fusion and contextual self-supervision. IEEE Trans. Intell. Transp. Syst. 2025, 26, 5613–5623. [Google Scholar] [CrossRef]
  48. Gan, W.; Mo, N.; Xu, H.; Yokoya, N. A Comprehensive Framework for 3D Occupancy Estimation in Autonomous Driving. IEEE Trans. Intell. Veh. 2024, 9, 7852–7864. [Google Scholar] [CrossRef]
  49. He, Y.; Chen, W.; Xun, T.; Tan, Y. Real-Time 3D Occupancy Prediction via Geometric-Semantic Disentanglement. arXiv 2024, arXiv:2407.13155. [Google Scholar]
  50. Liu, Y.; Mou, L.; Yu, X.; Han, C.; Mao, S.; Xiong, R.; Wang, Y. Let occ flow: Self-supervised 3d occupancy flow prediction. arXiv 2024, arXiv:2407.07587. [Google Scholar] [CrossRef]
Figure 1. The overview of our proposed 3D occupancy prediction model (HBEVOcc). Firstly, we extract 2D features from multi-camera images using an image backbone network. Subsequently, we utilize EVT and IVT to lift 2D image features into 3D BEV space. We process BEV features by a BEV encoder and HADA module. Finally, we use a BEV decoder to recover spatial resolution, and then, we further fuse the explicit and implicit features to predict 3D occupancy through the prediction head.
Figure 1. The overview of our proposed 3D occupancy prediction model (HBEVOcc). Firstly, we extract 2D features from multi-camera images using an image backbone network. Subsequently, we utilize EVT and IVT to lift 2D image features into 3D BEV space. We process BEV features by a BEV encoder and HADA module. Finally, we use a BEV decoder to recover spatial resolution, and then, we further fuse the explicit and implicit features to predict 3D occupancy through the prediction head.
Sensors 26 00934 g001
Figure 2. The architecture of height-aware deformable attention.
Figure 2. The architecture of height-aware deformable attention.
Sensors 26 00934 g002
Figure 4. Height voxel loss. Different colors represent randomly selected voxels at different heights.
Figure 4. Height voxel loss. Different colors represent randomly selected voxels at different heights.
Sensors 26 00934 g004
Figure 5. Comparison of mIoU and training memory of various 3D occupancy prediction methods on the Occ3D-nuScenes dataset. Different colors and shapes represent different methods. “1f” and “8f” mean fusing temporal information from 1 and 8 history frames using ResNet-50 backbone. The solid and hollow shapes represent whether the camera mask is used for training or not.
Figure 5. Comparison of mIoU and training memory of various 3D occupancy prediction methods on the Occ3D-nuScenes dataset. Different colors and shapes represent different methods. “1f” and “8f” mean fusing temporal information from 1 and 8 history frames using ResNet-50 backbone. The solid and hollow shapes represent whether the camera mask is used for training or not.
Sensors 26 00934 g005
Figure 6. Qualitative visualization comparison results on the Occ3D-nuScenes dataset. The first column is a camera image, the second column is the ground truth, and the rest are the 3D occupancy prediction results of BEVDet4D, FlashOcc, FBOcc, and our method HBEVOcc.
Figure 6. Qualitative visualization comparison results on the Occ3D-nuScenes dataset. The first column is a camera image, the second column is the ground truth, and the rest are the 3D occupancy prediction results of BEVDet4D, FlashOcc, FBOcc, and our method HBEVOcc.
Sensors 26 00934 g006
Figure 7. Qualitative visualization results for ablation study on the Occ3D-nuScenes dataset. The leftmost part shows surround-view camera image inputs, the second part shows the ground truth, and the last three parts visualize the 3D occupancy predictions under three settings: (1) without HADA and HAVL, (2) with HADA only, and (3) with both HADA and HAVL.
Figure 7. Qualitative visualization results for ablation study on the Occ3D-nuScenes dataset. The leftmost part shows surround-view camera image inputs, the second part shows the ground truth, and the last three parts visualize the 3D occupancy predictions under three settings: (1) without HADA and HAVL, (2) with HADA only, and (3) with both HADA and HAVL.
Sensors 26 00934 g007
Figure 8. Visualization of the HADA attention maps at eight different height levels. Different colors in the heatmaps indicate varying intensities of the attention features.
Figure 8. Visualization of the HADA attention maps at eight different height levels. Different colors in the heatmaps indicate varying intensities of the attention features.
Sensors 26 00934 g008
Table 1. 3D Occupancy prediction results (mIoU) on Occ3D-nuScenes dataset. All results are from official papers or codes. The best results are marked in bold.
Table 1. 3D Occupancy prediction results (mIoU) on Occ3D-nuScenes dataset. All results are from official papers or codes. The best results are marked in bold.
MethodMaskHistory FrameBackboneImage SizemIoU (%) ↑Sensors 26 00934 i001 othersSensors 26 00934 i002 barrierSensors 26 00934 i003 bicycleSensors 26 00934 i004 busSensors 26 00934 i005 carSensors 26 00934 i006 cons. veh.Sensors 26 00934 i007 motorcycleSensors 26 00934 i008 pedestrianSensors 26 00934 i009 traffic coneSensors 26 00934 i010 trailerSensors 26 00934 i011 truckSensors 26 00934 i012 drive. surf.Sensors 26 00934 i013 other flatSensors 26 00934 i014 sidewalkSensors 26 00934 i015 terrainSensors 26 00934 i016 manmadeSensors 26 00934 i017 vegetation
MonoScene [31]ResNet-101900 × 16006.061.757.234.264.939.385.673.983.015.904.457.1714.916.327.927.431.017.65
OccFormer [17]ResNet-101900 × 160021.935.9430.2912.3234.4039.1714.4416.4517.229.2713.9026.3650.9930.9634.6622.736.766.97
TPVFormer [15]ResNet-101900 × 160028.346.6739.2014.2441.5446.9819.2122.6417.8714.5430.2035.5156.1833.6535.6931.6119.9716.12
CTF-Occ [14]ResNet-101900 × 160028.538.0939.3320.5638.2942.2416.9324.5222.7221.0522.9831.1153.3333.8437.9833.2320.7918.00
HBEVOcc (ours)ResNet-50256 × 70429.136.4837.6518.0538.6642.5618.4521.7219.9418.1321.4330.0362.6634.3339.9437.3924.0123.73
BEVFormer [2]3ResNet-101900 × 160023.675.0338.799.9834.4141.0913.2416.5018.1517.8318.6627.7048.9527.7329.0825.3815.4114.46
BEVStereo [3]1ResNet-101900 × 160024.515.7338.417.8838.7041.2017.5617.3314.6910.3116.8429.6254.0828.9232.6826.5418.7417.49
SparseOcc [42]16ResNet-50256 × 70430.910.639.220.232.943.319.423.823.429.321.429.367.736.344.640.922.021.9
HBEVOcc (ours)1ResNet-50256 × 70434.3410.5145.4124.3241.1047.6523.7926.5924.6827.2927.8834.5865.0235.3842.8340.4935.0331.23
HBEVOcc (ours)8ResNet-50256 × 70436.4312.6949.1327.1341.1849.3223.4729.7127.3632.1229.0336.4367.0137.1644.7141.7137.6933.48
BEVDetOcc [4]ResNet-50256 × 70431.646.6536.978.3338.6944.4615.2113.6716.3915.2727.1131.0478.7036.4548.2751.6836.8232.09
FlashOcc [20]ResNet-50256 × 70431.956.2139.5711.2736.3243.9516.2514.7316.8915.7628.5630.9178.1637.5247.4251.3536.7931.42
DHD-S [36]ResNet-50256 × 70436.5010.5943.2123.0240.6147.3121.6823.2523.8523.4031.7534.1580.1641.3049.9554.0738.7333.51
HBEVOcc (ours)ResNet-50256 × 70436.9311.0044.0723.8340.4648.922.2924.4925.8025.8029.1934.2479.8241.3250.3353.5638.5434.16
BEVDet4D [4]1ResNet-50256 × 70436.018.2244.2110.3442.0849.6323.3717.4121.4919.7031.3337.0980.1337.3750.4154.2945.5639.59
FlashOcc [20]1ResNet-50256 × 70437.849.0846.3217.7142.750.6423.7220.1322.3424.0930.2637.3981.6840.1352.3456.4647.6940.6
OSP [44]1ResNet-101900 × 160041.2110.9549.027.6850.2455.9922.9631.0230.9130.2535.6041.2382.0942.5951.955.144.8238.17
COTR (BEVDet4D) [26]1ResNet-50256 × 70441.3912.2048.5129.0844.6653.3327.0129.1928.9130.9835.0339.5081.8342.5353.7156.8648.1842.09
DHD-M [36]1ResNet-50256 × 70441.4912.7248.6826.3143.2252.9227.3328.4928.5230.0235.8140.2483.1244.6754.7157.6948.8742.09
HBEVOcc (ours)1ResNet-50256 × 70441.8413.2849.9628.8845.7653.7728.1929.6829.2032.3834.7740.2782.2844.0053.6056.7847.2341.20
FBOCC [25]16ResNet-50256 × 70439.1113.5744.7427.0145.4149.125.1526.3327.8627.7932.2836.7580.0742.7651.1855.1342.1937.53
FastOcc [19]16ResNet-101640 × 160039.2112.0643.5328.0444.8052.1622.9629.1429.6826.9830.8138.4482.0441.9351.9253.7141.0435.49
BEVFormer [2]3ResNet-101900 × 160039.2410.1347.9124.9047.5754.5220.2328.8528.0225.7333.0338.5681.9840.6550.9353.0243.8637.15
BEVDet4D [4]8ResNet-50384 × 70439.269.3347.0519.2341.4752.2127.1922.2323.3221.5835.7738.9482.4840.4253.7557.7149.9445.76
CVT-Occ [45]6ResNet-101900 × 160040.349.4549.4623.5749.1855.6323.127.8528.8829.0734.9740.9881.4440.9251.3754.2545.9439.71
ViewFormer [46]3ResNet-50256 × 70441.8512.9450.1127.9744.6152.8522.3829.6228.0129.2835.1839.4084.7149.3957.4459.6947.3740.56
PanoOcc [21]3ResNet-101900 × 160042.1311.6750.4829.6449.4455.5223.2933.2630.5530.9934.4342.5783.3144.2354.4056.0445.9440.40
GEOcc [47]8ResNet-50256 × 70443.6414.2951.2731.1146.1355.0929.1230.4630.9935.4735.241.8284.047.055.5259.550.0344.82
HBEVOcc (ours)8ResNet-50256 × 70443.9814.3852.8930.6546.2955.8429.0033.2932.1536.4237.1241.9982.8645.4854.9158.9650.7744.6
BEVDet4D [4]1Swin-B512 × 140842.0212.1549.6325.152.0254.4627.8727.9928.9427.2336.4342.2282.3143.2954.4657.948.6143.55
FlashOcc [20]1Swin-B512 × 140843.5213.4251.0727.6851.5756.2227.2729.9829.9329.8037.7743.5283.8146.5556.1559.5650.8444.67
GEOcc [47]8Swin-B512 × 140844.6714.0251.433.0852.0856.7230.0433.5432.3435.8339.3444.1883.4946.7755.7258.9448.8543.0
HBEVOcc (ours)1Swin-B512 × 140845.2015.0352.5133.6652.9856.9329.0334.5433.4135.8338.5844.2983.7947.4256.2459.3350.5344.30
Table 2. 3D Occupancy prediction results (RayIoU) on Occ3D-nuScenes dataset. All results are from official papers or codes. The best results are marked in bold.
Table 2. 3D Occupancy prediction results (RayIoU) on Occ3D-nuScenes dataset. All results are from official papers or codes. The best results are marked in bold.
MethodMaskHistory FramesBackboneInput SizeEpochRayIoU (%) ↑RayIoU1m, 2m, 4mmIoU (%) ↑FPS↑Training
Mem (G) ↓
Inference
Mem (G) ↓
Training
GPU
Testing
GPU
SimpleOccupancy [48]ResNet-101336 × 6721222.517.022.727.931.89.7--A100A100
BEVFormer [2]3ResNet-101900 × 16002432.426.132.938.039.23.025.16.7A100A100
BEVDet4D [4]1ResNet-50256 × 7049029.623.630.035.136.12.68.44.7A100A100
BEVDet4D [4]8ResNet-50384 × 7049032.626.633.138.239.30.810.16.4A100A100
FBOcc [25]16ResNet-50256 × 7049033.526.734.139.739.110.311.15.5A100A100
HBEVOcc-Fast(ours)1ResNet-50256 × 7042431.424.831.837.639.118.96.42.7RTX 2080TiRTX 4090
HBEVOcc-Fast(ours)8ResNet-50256 × 7042433.426.933.839.441.214.66.92.8RTX 2080TiRTX 4090
HBEVOcc (ours)1ResNet-50256 × 7042433.426.933.839.441.88.27.33.0RTX 2080TiRTX 4090
HBEVOcc (ours)8ResNet-50256 × 7042434.928.635.440.844.05.47.53.1RTX 2080TiRTX 4090
SparseOcc [42]8ResNet-50256 × 7042434.028.034.739.430.117.112.25.4A100RTX 4090
SparseOcc [42]16ResNet-50256 × 7042435.129.135.840.330.614.122.96.9A100RTX 4090
SparseOcc [42]16ResNet-50256 × 7044836.130.236.841.230.914.122.96.9A100RTX 4090
Panoptic-FlashOcc [40]1ResNet-50256 × 7042436.030.136.841.129.639.46.12.2A100RTX 4090
Panoptic-FlashOcc [40]8ResNet-50256 × 7042438.532.839.343.431.520.46.32.4A100RTX 4090
GSD-Occ [49]16ResNet-50256 × 7042438.9----20.0-4.8A100A100
HBEVOcc-Fast (ours)1ResNet-50256 × 7042437.130.937.942.531.718.96.42.7RTX 2080TiRTX 4090
HBEVOcc-Fast (ours)8ResNet-50256 × 7042439.633.440.345.034.014.66.92.8RTX 2080TiRTX 4090
HBEVOcc-Fast (ours)8ResNet-50256 × 7044840.134.240.845.334.214.66.92.8RTX 2080TiRTX 4090
HBEVOcc (ours)1ResNet-50256 × 7042439.233.340.044.434.38.27.33.0RTX 2080TiRTX 4090
HBEVOcc (ours)8ResNet-50256 × 7042441.035.941.845.536.45.47.53.1RTX 2080TiRTX 4090
HBEVOcc (ours)8ResNet-50256 × 7044841.536.142.246.336.55.47.53.1RTX 2080TiRTX 4090
Table 3. 3D occupancy prediction results (RayIoU and mAVE) on OpenOcc. C and L denote camera and Lidar supervision. The best results are marked in bold.
Table 3. 3D occupancy prediction results (RayIoU and mAVE) on OpenOcc. C and L denote camera and Lidar supervision. The best results are marked in bold.
MethodSup.BackboneInput SizeHistory FramesEpochRayIoU (%) ↑mAVE ↓FPS ↑Training
Mem (G) ↓
Inference
Mem (G) ↓
Training
GPU
Testing
GPU
OccNeRF-C [24]CR101900 × 1600--21.61.53-----
OccNeRF-L [24]LR101900 × 160--31.71.59-----
RenderOcc [22]LR101900 × 16061236.71.63-----
Let Occ Flow [50]C+LR101512 × 140821640.51.45-----
OccNet [41]3DR101900 × 16032439.71.61-----
BEVFormer [2]3DR50900 × 16032428.11.123.026.06.7A100A100
FB-Occ [25]3DR50256 × 704169032.30.8310.311.15.5A100A100
SparseOcc [42]3DR50256 × 70484833.40.8717.115.85.4A100RTX 4090
STCOcc [29]3DR50256 × 704164840.80.444.710.05.6RTX 4090RTX 4090
HBEVOcc (ours)3DR50256 × 70412439.40.528.27.33.0RTX 2080TiRTX 4090
HBEVOcc (ours)3DR50256 × 70482440.80.415.47.53.1RTX 2080TiRTX 4090
HBEVOcc (ours)3DR50256 × 70484841.40.395.47.53.1RTX 2080TiRTX 4090
Table 4. mIoU of HBEVOcc for ablation study on Occ3D. The best results are marked in bold.
Table 4. mIoU of HBEVOcc for ablation study on Occ3D. The best results are marked in bold.
BaselineEVTIVTHADAHAVLmIoU (%) ↑Training
Mem (G) ↓
Inference
Mem (G) ↓
Params (M)GFLOPs
34.344.82.344.9253.1
35.025.12.350.1259.1
34.434.82.344.9253.1
35.135.12.350.1259.1
34.705.02.345.8280.3
34.404.12.228.1148.7
35.685.32.550.7384.5
36.766.32.656.2393.6
36.936.32.656.2393.6
Table 5. RayIoU of HBEVOcc for ablation study on Occ3D. The best results are marked in bold.
Table 5. RayIoU of HBEVOcc for ablation study on Occ3D. The best results are marked in bold.
BaselineEVTIVTHADAHAVLRayIoU (%) ↑mIoU (%) ↑Training
Mem (G) ↓
Inference
Mem (G) ↓
Params (M)GFLOPs
32.1225.514.82.344.9253.1
32.2825.905.12.350.1259.1
32.2527.274.82.344.9253.1
33.1127.815.12.350.1259.1
32.5825.975.02.345.8280.3
32.5126.244.12.228.1148.7
33.6326.975.32.550.7384.5
34.0127.436.32.656.2393.6
34.3029.136.32.656.2393.6
Table 6. RayIoU of HBEVOcc for ablation study on OpenOcc. The best results are marked in bold.
Table 6. RayIoU of HBEVOcc for ablation study on OpenOcc. The best results are marked in bold.
BaelineEVTIVTHADAHAVLRayIoU (%) ↑ mAVE ↓Training
Mem (G) ↓
Inference
Mem (G) ↓
Params (M)GFLOPs
31.951.204.82.345.1261.6
32.011.125.12.350.3267.7
32.831.814.82.345.1261.6
32.921.755.12.350.3267.7
32.161.025.02.346.0285.9
32.091.374.12.228.2154.1
33.191.145.32.551.0395.1
33.671.016.32.656.5404.2
34.271.036.32.656.5404.2
Table 7. Different fusion methods of EVT and IVT. The best results are marked in bold.
Table 7. Different fusion methods of EVT and IVT. The best results are marked in bold.
Fusion MethodsmIoU (%) ↑Training Mem (G) ↓Inference Mem (G) ↓Params (M)GFLOPs
Concat35.685.32.550.7384.5
Add35.425.22.347.7297.6
Gated Fusion35.585.42.448.1308.8
Table 8. Effect of different number of horizontal and height points. The best results are marked in bold.
Table 8. Effect of different number of horizontal and height points. The best results are marked in bold.
History FramesHorizontal PointsHeight PointsmIoU (%) ↑Training Mem (G) ↓Inference Mem (G) ↓Params (M)GFLOPs
2136.676.22.655.6392.3
2236.686.22.656.2393.3
4136.706.32.655.6392.6
4236.766.32.656.2393.6
12141.367.23.073.4640.8
12241.637.23.074.3642.3
14141.567.33.073.4641.2
14241.647.33.074.3642.7
Table 9. Effect of different heights Z e in different history frames on Occ3D. The best results are marked in bold.
Table 9. Effect of different heights Z e in different history frames on Occ3D. The best results are marked in bold.
History FramesHeight Z e mIoU (%) ↑Training Mem (G) ↓Inference Mem (G) ↓Params (M)GFLOPs
136.786.22.655.6370.0
836.936.32.656.2393.6
1141.487.33.073.5578.2
841.847.33.074.3642.7
4142.947.33.174.1946.8
842.217.53.175.01131.2
8143.987.53.175.01450.7
842.587.63.175.91782.7
Table 10. Effect of different heights in height-aware voxel loss. The best results are marked in bold.
Table 10. Effect of different heights in height-aware voxel loss. The best results are marked in bold.
HAVL HeightmIoU (%) ↑Training Mem (G) ↓Inference Mem (G) ↓Params (M)GFLOPs
235.805.32.550.7384.5
435.865.32.550.7384.5
835.825.32.550.7384.5
1636.005.32.550.7384.5
Table 11. Effect of different number of sampled positions in height-aware voxel loss. The best results are marked in bold.
Table 11. Effect of different number of sampled positions in height-aware voxel loss. The best results are marked in bold.
Sampled PositionsmIoU (%) ↑Training Mem (G) ↓Inference Mem (G) ↓Params (M)GFLOPs
200035.875.32.550.7384.5
400036.005.32.550.7384.5
20,00035.715.32.550.7384.5
40,00035.815.32.550.7384.5
Table 12. Effect of different weight strategies in height-aware voxel loss. The best results are marked in bold.
Table 12. Effect of different weight strategies in height-aware voxel loss. The best results are marked in bold.
HAVLmIoU (%) ↑Training Mem (G) ↓Inference Mem (G) ↓Params (M)GFLOPs
35.685.32.550.7384.5
Fixed Weight w z = 1 35.895.32.550.7384.5
Fixed Weight w z = 2 35.805.32.550.7384.5
Fixed Weight w z = 3 35.845.32.550.7384.5
Height-aware Weight36.005.32.550.7384.5
Table 13. Effect of different numbers of history frames on Occ3D and OpenOcc. The best results are marked in bold.
Table 13. Effect of different numbers of history frames on Occ3D and OpenOcc. The best results are marked in bold.
DatasetHistory
Frames
RayIoU (%) ↑Training
Mem (G) ↓
Inference
Mem (G) ↓
Params (M)GFLOPs
Occ3D-nuScenes139.217.33.074.3642.7
440.507.33.174.1946.8
841.057.53.175.01450.7
OpenOcc139.437.33.074.6653.3
440.027.33.174.3957.4
840.787.53.175.31461.3
Table 14. Effect of different heights in height-aware deformable attention. The best results are marked in bold.
Table 14. Effect of different heights in height-aware deformable attention. The best results are marked in bold.
HADA HeightmIoU (%) ↑Training Mem (G) ↓Inference Mem (G) ↓Params (M)GFLOPs
236.666.32.656.8395.6
436.596.32.656.4394.2
836.766.32.656.2393.6
1636.706.32.656.1393.2
Table 15. Effectivenss of our proposed HADA with different methods. Here, mIoU* denotes the performance of the original methods; to ensure the fairness of the experiment, we retrain these methods to obtain the mIoU.
Table 15. Effectivenss of our proposed HADA with different methods. Here, mIoU* denotes the performance of the original methods; to ensure the fairness of the experiment, we retrain these methods to obtain the mIoU.
MethodsRepresentationHistory FramesmIoU * ↑mIoU ↑mIoU ↑ (+HADA)ΔmIoU ↑ΔMem (G) ↓
DHD-S [36]BEV36.5036.5136.99+0.48+0.94
DHD-M [36]BEV141.4940.7441.36+0.62+1.09
FlashOcc:M2 [20]BEV32.0832.6233.54+0.92+0.29
FlashOcc-4D-Stereo:M2 [20]BEV137.8438.8039.73+0.93+0.31
BEVDet4D [4]Voxel136.0137.4038.35+0.95+0.98
FBOcc [25]BEV and Voxel1639.1140.2140.65+0.44+0.66
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lyu, C.; Li, W.; Liao, I.Y.; Ding, F.; Liu, H.; Zhou, H. HBEVOcc: Height-Aware Bird’s-Eye-View Representation for 3D Occupancy Prediction from Multi-Camera Images. Sensors 2026, 26, 934. https://doi.org/10.3390/s26030934

AMA Style

Lyu C, Li W, Liao IY, Ding F, Liu H, Zhou H. HBEVOcc: Height-Aware Bird’s-Eye-View Representation for 3D Occupancy Prediction from Multi-Camera Images. Sensors. 2026; 26(3):934. https://doi.org/10.3390/s26030934

Chicago/Turabian Style

Lyu, Chuandong, Wenkai Li, Iman Yi Liao, Fengqian Ding, Han Liu, and Hongchao Zhou. 2026. "HBEVOcc: Height-Aware Bird’s-Eye-View Representation for 3D Occupancy Prediction from Multi-Camera Images" Sensors 26, no. 3: 934. https://doi.org/10.3390/s26030934

APA Style

Lyu, C., Li, W., Liao, I. Y., Ding, F., Liu, H., & Zhou, H. (2026). HBEVOcc: Height-Aware Bird’s-Eye-View Representation for 3D Occupancy Prediction from Multi-Camera Images. Sensors, 26(3), 934. https://doi.org/10.3390/s26030934

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop