1. Introduction
Object detection in visual data has conventionally been formulated as the regression of a bounding box around each object, typically predicting the object’s center and size in the image plane. However, this deterministic bounding-box paradigm fails to capture the inherent uncertainty in object appearance, geometric projection, and the three-dimensional structure of the scene.
Recently, several works have reformulated detection as a density estimation problem, modeling each object as a Gaussian (or mixture) distribution over the image or 3D space rather than as a fixed box [
1,
2]. Modeling the object region via a Gaussian representation allows a more flexible uncertainty representation and can lead to improved alignment with evaluation metrics such as IoU or generalized divergence losses. In parallel, RGB-D sensing (i.e., combining color images with depth channels) has emerged as a powerful modality for 3D object detection tasks, enabling richer geometric cues than single-modal RGB. Standard approaches in the RGB-D domain often process RGB and depth streams separately, fusing them only at a later stage. Alternatively, some methods extract 2D image proposals first and then refine them using depth information [
3,
4]. While effective, these strategies incur a modality gap and require separate processing steps.
In this paper, we propose a unified framework for direct 3D object detection from the RGB-D input by modeling each object as a 3D Gaussian distribution characterized by its mean and covariance.
Our network takes a four-channel RGB-D tensor as input and directly predicts Gaussian parameters for each object, representing both spatial location and geometric uncertainty. Importantly, the detector is built upon a vision transformer backbone that is modified to support four-channel inputs through an extended patch embedding layer. This design enables joint modeling of appearance and geometry from the earliest stages of feature extraction while leveraging pretrained ImageNet representations.
Key advantages of our approach include the following: (i) representing objects with a probabilistic formulation that naturally encodes localization uncertainty and shape anisotropy; (ii) extending the Gaussian-based detection paradigm into true 3D space using RGBD input; (iii) integrating RGB and depth modalities from the outset via early fusion within a transformer architecture, thereby reducing domain gap and enabling more coherent feature learning; and (iv) benefiting from the global receptive field of transformers, which facilitates long-range reasoning essential for estimating 3D Gaussian covariances that span large spatial extents.
We evaluate our proposed method on the SUN RGB-D benchmark and demonstrate that it achieves competitive 3D detection accuracy compared to state-of-the-art point-cloud methods while providing uncertainty-aware predictions. Our contributions are summarized as follows:
We propose a direct 3D object detection framework from RGB-D input that predicts 3D Gaussian distribution parameters (mean and covariance) for each object, replacing traditional box regression.
We design a transformer-based RGB-D backbone with a four-channel patch embedding that processes RGB and depth channels in a unified tensor, thereby eliminating late fusion and reducing modality gap.
We introduce a Cholesky-based covariance parameterization and derive the geometric transformations required to lift image-aligned Gaussian predictions to metrically interpretable 3D ellipsoids.
We demonstrate through experiments that our probabilistic Gaussian formulation yields more physically interpretable detections and improved performance over conventional bounding-box and fusion-based RGB-D detectors.
2. Related Work
We review three strands that directly motivate our approach: (1) Gaussian and probabilistic representations for object detection, (2) RGB-D and 3D object detection, and (3) fusion strategies for RGB and depth modalities.
2.1. Probabilistic and Gaussian Representations for Object Detection
Recent research has challenged the canonical bounding-box parameterization by modeling object regions as probability distributions. Early works explored fuzzy object region parameterizations and probabilistic IoU proxies based on distributional distances. Llerena et al. introduced Gaussian Bounding Boxes and a Probabilistic IoU notion for fuzzy object regions, demonstrating that distributional similarity (e.g., Hellinger distance) can serve as a viable box alternative [
2]. Yang et al. developed a more complete framework that models rotated 2D/3D boxes as Gaussians and uses between-distribution metrics—Gaussian Wasserstein Distance (GWD), KL divergence, and Bhattacharyya distance—as regression losses [
1,
5]. They demonstrated that many recurring problems in rotation detection (boundary discontinuity, square-like ambiguity) can be mitigated under Gaussian modeling and extended the formulation to 3D. The G-Rep work provides a unified Gaussian construction to represent oriented boxes, quads, and point-sets, enabling a single Gaussian representation across different geometric primitives [
6].
These approaches bring several benefits. First, distributional metrics align training objectives more closely with geometric overlap metrics (IoU variants) while remaining differentiable. Second, Gaussian parameterization naturally encodes uncertainty (via covariance) and anisotropy (via eigenstructure), which is useful for high-precision localization and downstream tasks that exploit uncertainty. Third, Gaussian modeling often removes angle periodicity and edge-exchange ambiguities that plague angle-regression schemes for rotated boxes. However, most prior Gaussian works focus on 2D detection or oriented bounding box (OBB) representations and do not directly address the RGB-D to 3D problem using a single fused input tensor.
Beyond Gaussian parameterizations for rotated bounding boxes, probabilistic modeling has also been explored from a Bayesian and uncertainty-aware perspective. Kendall and Gal demonstrated that aleatoric uncertainty can be learned implicitly through probabilistic regression losses, motivating distribution-valued predictions in deep networks. Subsequent works incorporated uncertainty estimation into detection and localization tasks, showing improved robustness under noisy observations and ambiguous geometries [
7,
8].
In the context of object detection, several studies have investigated probabilistic bounding boxes and uncertainty-aware scoring mechanisms. He et al. proposed uncertainty-guided non-maximum suppression and confidence calibration strategies, highlighting that explicit uncertainty modeling improves ranking and post-processing reliability [
9]. These lines of work further support the view that modeling objects as distributions rather than deterministic boxes provides a principled framework for uncertainty-aware perception.
2.2. RGB-D and 3D Object Detection
The community has developed many approaches for 3D detection using depth sensors (RGB-D) or LiDAR. Two historical threads are notable.
Early studies such as Sliding Shapes and follow-ups like Deep Sliding Shapes showed that depth/volumetric representations enable direct 3D box estimation by operating on 3D volumes or truncated signed distance fields (TSDFs) constructed from depth maps [
3,
10]. These volumetric pipelines demonstrated how geometry alone greatly simplifies certain detection challenges (occlusion, viewpoint) compared to pure RGB pipelines. Many subsequent RGB-D works adopted 3D ConvNets or hybrid voxel/TSDF encoders for amodal 3D bounding box estimation.
In parallel, point-cloud methods such as VoxelNet [
11], PointPillars [
12], and PointRCNN [
13] established powerful encoders and training paradigms for 3D detection from LiDAR. These methods emphasize structured representation (voxels, pillars, or point sets) and demonstrate how spatially-aware backbones boost localization accuracy in metric space. More recently, VoteNet [
14] introduced deep Hough voting for indoor 3D detection, achieving strong results on SUN RGB-D and ScanNet using only point cloud input. H3DNet [
15] further improved performance by predicting a hybrid set of geometric primitives (centers, face centers, and edge centers). ImVoteNet [
4] extended VoteNet by incorporating image votes from 2D detectors, effectively fusing RGB and point cloud modalities. FCAF3D [
16] proposed an anchor-free, fully convolutional approach using sparse convolutions, achieving state-of-the-art results on indoor benchmarks. Group-Free 3D [
17] introduced a transformer-based detection head that directly learns object features from point clouds without explicit grouping.
Although LiDAR and RGB-D are different sensors, the core insight is shared: direct 3D geometric representations help produce metric 3D boxes and support uncertainty estimation in world coordinates. Despite those successes, a large fraction of RGB-D detectors still adopt hybrid strategies: (i) 2D proposal generation from RGB followed by depth-based refinement (lifting), or (ii) separate RGB and depth encoding followed by late fusion. Very few works directly estimate probabilistic 3D object parameters (mean and covariance) from a single fused RGBD tensor in an end-to-end manner—this is the gap our paper addresses.
More recently, transformer-based architectures have been introduced for 3D object detection, leveraging global context modeling to replace hand-crafted grouping or voting heuristics. Works such as 3DETR formulate 3D detection as a set prediction problem using transformers, demonstrating strong performance without explicit anchors or proposals [
18]. Related efforts further extend transformer-based reasoning to indoor scenes and multi-modal inputs, suggesting that global attention is particularly beneficial for capturing long-range geometric dependencies in 3D space.
2.3. Fusion Strategies for RGB and Depth Modalities
Integrating RGB and depth modalities has been studied extensively. Fusion strategies generally fall into (a) early fusion (concatenating depth as an extra channel), (b) middle fusion (cross-modal attention or feature fusion at intermediate layers), and (c) late fusion (separate backbones with combined scores or proposals). Empirical studies highlight that the optimal fusion strategy depends on the architecture and task. For classification, middle or late fusion is often effective. In contrast, for dense geometric tasks, early fusion is preferred as it allows the network to jointly learn geometric-appearance correlations from the outset [
19,
20]. Recent transformer-based studies empirically compare early versus late fusion regimes and show that aligning modality representations matters greatly to final performance [
19].
In object detection, late-fusion pipelines commonly extract 2D RGB proposals and then exploit point cloud or depth within ROIs to lift 2D to 3D boxes. While pragmatic, this two-stage processing can suffer from modality gaps and limits end-to-end uncertainty propagation. Our approach adopts an early/fused representation (an RGBD tensor) but differs from naive early fusion by predicting 3D Gaussian parameters directly (instead of regressing 3D box corners or lifting 2D boxes). This enables a single forward pass to produce both metric centers and covariances, simplifying uncertainty quantification and downstream integration.
From a representation learning perspective, early fusion can be interpreted as learning a joint appearance–geometry embedding, whereas late fusion enforces a factorized representation that may hinder uncertainty propagation across modalities. Recent analyses indicate that early fusion is particularly advantageous when the prediction target is geometrically structured, such as depth, surface normals, or 3D object parameters [
21,
22]. Our approach aligns with this view by directly predicting metric 3D Gaussian parameters from a unified RGB-D tensor.
2.4. Label Assignment and Loss Design
Practical advances like Adaptive Training Sample Selection (ATSS) showed that how positives and negatives are chosen (anchor/point assignment) significantly affect detector calibration and training stability [
23]. Gaussian-based losses can be paired with distribution-aware assignment strategies (e.g., metric thresholding in distribution space) to obtain sample assignment consistent with the regression objective. We build on these insights in our training design.
4. Covariance Parameterization and Geometric Transformation
Section 3 introduced the probabilistic formulation in the 3D camera frame
. We now describe how the network parameterizes Gaussian ellipsoids in an image-aligned frame
, and how these ellipsoids are transformed into metrically interpretable ellipsoids in
through perspective geometry. We follow the coordinate convention illustrated in
Figure 1.
4.1. Cholesky-Based Covariance in the Image-Aligned Frame
The network first predicts ellipsoids in the augmented pixel coordinate frame
, whose axes are aligned with
, where
are image coordinates and
w is oriented along the optical axis. In this frame, the covariance
is parameterized via a Cholesky factorization:
Ensuring positive definiteness of predicted covariance matrices is critical for stable optimization. Cholesky-based parameterizations are commonly adopted in probabilistic deep learning to guarantee valid covariance estimates while maintaining differentiability [
26]. Our formulation follows this principle, enabling unconstrained network outputs to be mapped to valid 3D ellipsoids.
With lower-triangular factor
Equation (
9) spans all symmetric positive definite matrices in
and is differentiable with respect to the six scalar parameters
. The network outputs raw values
that are mapped as follows:
where
is the sigmoid function and
are characteristic length scales in the three directions.
The exponential mapping in Equation (
10) ensures the principal scales
are strictly positive, and the sigmoid normalization in Equation (
11) keeps the angular parameters within
, avoiding discontinuities in the orientation. As a result, every network output corresponds to a valid ellipsoid with a well-conditioned covariance.
4.2. Perspective Projection Matrix and Notation
We explicitly decompose the perspective projection matrix
into its intrinsic and translation components. In homogeneous coordinates, the mapping from a 3D point
in frame
to image-plane coordinates in frame
can be written as follows:
Here, and denote the number of pixels per unit length in the horizontal and vertical directions, respectively, f is the focal length, and is the principal point offset in pixels. The matrix is partitioned as , where contains the intrinsic parameters and is the translation column.
4.3. Inverse Mapping for the Ellipsoid Center
The network predicts a 2D center
and a depth value
z in frame
, which must be lifted back to a 3D point in
. Depth is parameterized as
where
is a reference depth scale. The exponential mapping guarantees positive metric depth and improves numerical stability by compressing the dynamic range of raw outputs.
Given
, the 3D center of the ellipsoid in frame
, denoted as
, is obtained by inverting Equation (
12):
Here,
is the inverse of the intrinsic
submatrix of
. Equation (
14) is equivalent to classical back-projection using intrinsics but written in a homogeneous linear-algebra form that fits the ellipsoid transformation.
4.4. Depth-Dependent Scaling of Covariance
Perspective projection shrinks apparent object size proportionally to depth. Conversely, when lifting a covariance from the image-aligned frame
back to the metric camera frame
, the covariance must be scaled up according to depth. We capture this depth-dependent scaling using a diagonal matrix:
The first two entries correspond to the scaling factors in the horizontal and vertical directions, obtained from the focal length and pixel density. The third entry uses the geometric mean of the lateral scalings, which yields a depth scaling that is consistent with the isotropic shrinkage implied by perspective projection. We emphasize that this formulation is a geometrically motivated approximation under the pinhole camera model, rather than an exact probabilistic projection of a 3D Gaussian through perspective, which is known to yield a non-Gaussian distribution. In practice, this approximation provides a useful inductive bias and leads to stable and metrically consistent ellipsoid predictions.
4.5. Rotation Alignment with the Viewing Direction
The ellipsoid predicted in is aligned with the axes, where the w-axis is orthogonal to the image plane. In the camera frame , we align the ellipsoid such that its main axis is oriented along the direction from the camera center to the ellipsoid center.
Let
denote the unit vector along the optical axis in
and
denote the unit vector from the camera center to the ellipsoid center. The rotation axis and angle are obtained via
where
is the unit rotation axis and
is the rotation angle. The corresponding rotation matrix is given by Rodrigues’ formula:
where
denotes the skew-symmetric matrix representation. This alignment is physically motivated by the observation that uncertainty in RGB-D perception is typically elongated along the viewing ray, particularly due to depth noise and occlusion. Empirically, removing this alignment led to unstable training and implausible covariance orientations in preliminary experiments, supporting its practical necessity.
4.6. Final Covariance in the Camera Frame
Combining the Cholesky factorization, scaling, and rotation, the covariance of the ellipsoid in the camera frame
is
Equation (
18) is used in the distributional losses in
Section 3 and ensures that the learned covariance directly corresponds to metric ellipsoids in 3D.
5. RGB-D Transformer Network Architecture
This section describes the concrete network architecture that instantiates the probabilistic formulation developed in
Section 3 and
Section 4. As illustrated in
Figure 2, the detector follows a dense, single-stage design that takes an RGB-D image as input and outputs; for each feature-grid location, the parameters of a 3D Gaussian distribution together with objectness and class scores.
5.1. RGB-D Input Construction
Given an RGB image
and an aligned depth map
, we construct a four-channel RGB-D tensor
where
is obtained by clipping raw depth values to a maximum range (8 m in all experiments) and linearly rescaling them to
. The RGB channels are normalized using the standard ImageNet mean and variance, while the depth channel is normalized with its own statistics.
5.2. Transformer Backbone with Four-Channel Patch Embedding
We adopt a vision transformer backbone , instantiated as either a ViT-Small or Swin-Tiny model, to encode the RGB-D tensor into a sequence of tokens. The backbone is modified to accept four-channel inputs by extending the patch embedding layer.
Let
be the initial convolutional projection with a
kernel and stride
P. We replace it with a four-channel patch embedding the following:
where
is the number of patches. The weight tensor of
is initialized by copying the pretrained RGB filters and setting the depth filter to their channel-wise mean:
This “mean depth” initialization preserves the behavior of the pretrained ImageNet backbone on RGB while introducing a sensible starting point for the depth channel.
After adding positional encodings, the token sequence is processed by a stack of
L transformer encoder blocks:
The final tokens form a dense RGB-D representation on a coarse spatial grid.
5.3. Feature Refinement Neck
The feature refinement neck serves as the bridge between the transformer encoder and the Gaussian-based 3D detection head. While the preceding
Section 3 and
Section 4 introduced the theoretical and geometric formulation of Gaussian parameterization, the role of the neck is to reshape globally contextualized transformer tokens into spatially structured feature maps that enable dense prediction of 3D Gaussian distributions. As shown in
Figure 3, the Feature Refinement Neck consists of two main components: a module that converts tokens into grid-structured features and a module that refines the features in a multi-scale manner.
Token-to-grid restoration. Given the final-layer token sequence
, we reshape it into a 2D spatial grid to recover a feature representation with explicit geometric structure:
This grid acts as the highest-resolution feature map .
Multi-scale pyramid construction. To support objects of different physical sizes and depth-dependent scale variations, we construct a shallow 2D feature pyramid:
Unlike traditional FPNs, which are typically paired with bounding-box regression, our pyramid is specifically designed to provide multi-scale geometric cues for Gaussian parameter regression. Since Gaussian ellipsoids encode shape anisotropy and uncertainty, feature responses benefit significantly from hierarchical context (global semantics from transformers and localized spatial structure from convolutional refinement).
Importantly, the Gaussian representation is continuous and differentiable with respect to its mean and covariance. Therefore, producing stable predictions requires features that exhibit both spatial smoothness and global consistency. The refinement neck fulfills this by injecting local convolutional inductive biases on top of globally aggregated transformer features.
5.4. Dense Gaussian 3D Detection Head
The dense detection head operates on each pyramid level and predicts, at every spatial location, the parameters of a full 3D Gaussian distribution. Each detection head consists of a lightweight convolutional tower with four Conv–BN–ReLU layers, followed by one output layer producing 3D mean parameters , covariance parameters via the Cholesky factor , objectness , and class logits for categories.
The objectness (confidence) is computed as
, and class probabilities for
categories are obtained via softmax on the subsequent channels:
Together with the Gaussian parameters, these outputs define the mixture in Equation (
2) for each input frame.
6. Experiments
We evaluate the proposed transformer-based probabilistic detector on the SUN RGB-D benchmark and compare it against state-of-the-art methods from both point-cloud-based and RGB-D-based detection paradigms.
6.1. Experimental Setting
SUN RGB-D [
27,
28,
29,
30] is a widely used indoor RGB-D benchmark containing 10,335 RGB-D images collected with four depth sensors (Intel RealSense, Asus Xtion, Kinect v1, and Kinect v2) across diverse indoor environments including bedrooms, living rooms, offices, and classrooms. We follow the official split, using 5285 images for training and 5050 for testing over 10 object categories. The benchmark presents several challenges including cluttered scenes, significant inter-class variation in object sizes, partial occlusions, and depth noise from consumer-grade sensors.
Ground-truth oriented cuboids are converted into Gaussian ellipsoids in the camera frame using the formulation of
Section 3. The cuboid center becomes the Gaussian mean, and the covariance is derived from the cuboid dimensions and yaw angle following Gaussian-based 3D detection works [
1,
6]. Specifically, the covariance eigenvalues are set proportional to the squared half-dimensions of the cuboid, and the eigenvectors are aligned with the cuboid axes.
We use a ViT-Small backbone with patch size , embedding dimension , and encoder layers. The total number of parameters is approximately 45 M. All models are trained on two NVIDIA RTX 3080 Ti GPUs with batch size 8 for 80–100 epochs using AdamW with initial learning rate . We apply standard data augmentation including random horizontal flipping, random scaling (0.9–1.1), and color jittering. The loss weights are set to and throughout all experiments.
6.2. Results
We evaluate our probabilistic Gaussian detector on the SUN RGB-D validation split and compare it against state-of-the-art 3D detection networks from both point-cloud-based and RGB-D-based categories.
Table 1 summarizes the quantitative performance using mean Average Precision at IoU threshold 0.25 (mAP@0.25) for 10 standard indoor object categories.
Our method achieves an mAP@0.25 of 61.9%, demonstrating competitive performance among state-of-the-art methods. This result surpasses several established point-cloud detectors including VoteNet (59.1%), H3DNet (60.1%), MLCVNet (59.8%), and 3DETR (59.1%), while achieving comparable performance to BRNet (61.1%). Among RGB-D methods, our approach significantly outperforms traditional fusion strategies such as Frustum PointNets (54.0%), PointFusion (45.4%), and DSS (42.1%), establishing it as the second-best RGB-D method after ImVoteNet (63.4%). This competitive performance is particularly meaningful given our fundamentally different paradigm. Rather than relying on complex multi-stage pipelines or voting heuristics (e.g., ImVoteNet, Group-Free 3D), we achieve strong results through a simpler single-stage architecture. Our approach leverages unified RGB-D early fusion and a probabilistic Gaussian representation. This architectural simplicity, combined with the inherent uncertainty quantification provided by our Gaussian formulation, represents a compelling trade-off for practical applications where interpretability and uncertainty awareness are valued alongside raw accuracy.
When compared against point-cloud-based detectors, our RGB-D approach demonstrates that the combination of appearance and depth information in a unified tensor provides competitive results while offering additional benefits. VoteNet, which pioneered deep Hough voting for indoor 3D detection, achieves 59.1% mAP using only point cloud input. Our method outperforms VoteNet by +2.8% mAP, suggesting that the explicit integration of RGB appearance features enables more discriminative object representations. H3DNet extends VoteNet with hybrid geometric primitives (centers, face centers, edge centers), reaching 60.1% mAP, yet our probabilistic Gaussian formulation surpasses this by +1.8% mAP without requiring explicit primitive prediction. The transformer-based 3DETR achieves 59.1% mAP on point clouds; our approach demonstrates that incorporating RGB information and adopting a Gaussian-based representation provides complementary benefits that push performance beyond pure geometric reasoning. While our method does not reach the performance of Group-Free 3D (63.0%), it achieves comparable results to BRNet (61.1%) while additionally providing uncertainty estimates that are valuable for downstream robotic applications.
Among RGB-D methods, our approach demonstrates substantial improvements over traditional fusion strategies. Deep Sliding Shapes (DSS), an early volumetric approach, achieves only 42.1% mAP due to its reliance on hand-crafted 3D representations and limited fusion capability. PointFusion, which projects image features onto point clouds for late fusion, reaches 45.4% mAP but suffers from modality misalignment. The 2D-driven approach, which lifts 2D detections to 3D using depth, achieves 45.1% mAP and is limited by the quality of 2D proposals. Frustum PointNets improves to 54.0% mAP by constraining 3D search within 2D frustums, but still relies on a two-stage pipeline. Our approach achieves 61.9% mAP, representing a significant improvement of +7.9% over Frustum PointNets and establishing our method as the second-best RGB-D approach. While ImVoteNet achieves 63.4% mAP by combining 2D image votes with point cloud voting, our approach offers distinct architectural advantages: early fusion of RGB and depth in a unified tensor eliminates the need for complex cross-modal feature alignment and multi-stage processing, and our probabilistic Gaussian representation provides inherent uncertainty quantification that ImVoteNet’s deterministic box regression cannot offer. The 1.5% gap to ImVoteNet is offset by the simplicity and interpretability of our single-stage probabilistic framework.
The per-category results reveal interesting patterns that highlight the strengths of our probabilistic Gaussian representation on specific object types. Our method achieves notable improvements on the Sofa category (+5.8% over ImVoteNet, 76.5% vs. 70.7%), where the full covariance matrix effectively captures the soft boundaries and variable geometries that characterize upholstered furniture. Similarly, our approach shows gains on Bathtub (+1.2%, 77.1% vs. 75.9%) and Table (+0.8%, 51.9% vs. 51.1%), categories that benefit from the continuous ellipsoidal representation’s ability to model curved surfaces and varying aspect ratios. These improvements suggest that the Gaussian formulation is particularly effective for objects with smooth boundaries or high geometric variability.
For categories with well-defined geometric structures such as Bed (85.8%), Toilet (87.2%), and Chair (74.8%), our method shows competitive but slightly lower performance compared to ImVoteNet (87.6%, 90.5%, and 76.7%, respectively). This suggests that ImVoteNet’s explicit point cloud voting mechanism may be particularly effective for objects where depth discontinuities provide strong boundary cues. The Nightstand category shows a larger gap (59.4% vs. 69.9%), which may be attributed to the relatively small size and frequent occlusion of nightstands in cluttered bedroom scenes, where point cloud voting can leverage fine-grained geometric details more effectively. Nevertheless, our method maintains strong overall performance while providing additional uncertainty information that is valuable for downstream applications.
The consistent improvements across categories validate our early fusion strategy, where RGB and depth are concatenated into a unified tensor before being processed by the transformer backbone. Unlike late fusion approaches (e.g., ImVoteNet) that must learn separate representations and then align them, our method enables the network to discover joint appearance–geometry patterns from the first layer. This is particularly beneficial for depth-ambiguous scenarios where appearance cues are essential for disambiguation, and for objects with similar appearancess where depth provides critical discriminative information.
The Gaussian representation offers several practical advantages beyond raw detection accuracy. First, the predicted covariance provides a natural uncertainty estimate for each detection, which is valuable for downstream tasks such as robot manipulation and navigation where understanding localization confidence is critical. Second, the continuous and differentiable nature of Gaussian parameterization enables smoother gradient flow during training compared to discrete box regression, contributing to more stable optimization. Third, the distributional loss functions (KL divergence and Bhattacharyya distance) provide a geometrically meaningful training signal that is inherently aligned with the evaluation metric, avoiding the mismatch between smooth L1 loss and IoU-based evaluation in traditional detectors.
6.3. Qualitative Analysis
Figure 4 presents qualitative detection results on representative scenes from the SUN RGB-D validation set. The figure llustrates how our model visualizes objects in real-world indoor environments. By representing detections as 3D Gaussian ellipsoids rather than rigid boxes, the model captures the actual shape and orientation of objects like chairs, sofas, and desks more naturally. For instance, the ellipsoids stretch to follow the long edges of desks and sofas, while remaining compact for smaller items like chairs. This approach allows the system to reflect the physical structure of the scene accurately.
However, a close look at the prediction results in
Figure 4 shows that the rotation angles of the boxes do not always perfectly align with the ground truth. This misalignment typically occurs in cluttered office spaces or with objects that have nearly square dimensions, where it is difficult to distinguish the exact front and back based on depth data alone. Nevertheless, our model handles these errors gracefully by increasing the size of the ellipsoid when it is uncertain. Instead of just providing a wrong box with high confidence, the model signals its uncertainty through the spread of the Gaussian distribution, offering a more reliable and safety-aware detection result for practical use.
6.4. Ablation Study
To quantify the contribution of each architectural and representational component, we conduct systematic ablation experiments on three key aspects: (1) the effect of Gaussian-based detection head versus traditional bounding box regression, (2) the impact of covariance parameterization complexity, and (3) the influence of transformer backbone depth on detection performance. All ablation experiments are conducted on the SUN RGB-D validation set using the same training protocol. The results are summarized in
Table 2. We note that certain components, such as the geometric transformations and the distributional loss formulation, are mathematically coupled. Isolating them independently leads to degenerate or physically inconsistent behavior; therefore, we focus our ablation on coherent modeling choices rather than artificially separated submodules.
Replacing the conventional bounding box regression head with a Gaussian detection head yields a substantial improvement of +4.1 mAP (from 57.8% to 61.9%). This improvement can be attributed to several fundamental differences between the two representations. First, the Gaussian formulation provides a continuous and smooth parameterization of object regions, which facilitates gradient-based optimization. In contrast, box regression with explicit orientation angles suffers from discontinuities at angle boundaries (e.g., wrapping at ) and ambiguities for near-square objects, where small perturbations can cause large changes in the predicted angle. The Gaussian covariance matrix naturally handles these cases through its eigenstructure, which varies smoothly with object orientation.
Second, the distributional losses (KL divergence and Bhattacharyya distance) used in Gaussian regression are geometrically meaningful and directly relate to the overlap between predicted and ground-truth regions. This contrasts with the commonly used smooth L1 loss for box regression, which treats center, size, and angle independently without considering their joint geometric effect. The distributional losses provide a more holistic training signal that encourages predictions to match the overall spatial extent of objects.
Third, the Gaussian representation inherently encodes uncertainty, which regularizes the network to produce predictions that are consistent with the underlying data distribution. Objects with high appearance variability or frequent occlusions naturally receive higher uncertainty estimates, leading to more calibrated confidence scores.
The choice of covariance parameterization significantly impacts detection performance. Using only diagonal covariance (3 parameters) achieves 58.7% mAP, representing a +0.9 mAP improvement over the baseline box head. However, upgrading to the full Cholesky parameterization (6 parameters) provides an additional +3.2 mAP gain, reaching 61.9% mAP. This result demonstrates the importance of modeling cross-axis correlations in 3D object representation.
From a geometric perspective, diagonal covariance restricts the predicted ellipsoid to be axis-aligned in the camera coordinate frame, which is a significant limitation for indoor objects that appear at arbitrary orientations. Real-world objects such as sofas, beds, and tables are often rotated relative to the camera, and their 3D extents exhibit strong correlations across axes. For example, a bed viewed at a 45-degree angle has correlated uncertainty in the camera’s x and z directions, which cannot be captured by diagonal covariance.
The full Cholesky parameterization allows the network to predict arbitrary ellipsoid orientations, effectively learning an implicit rotation that aligns with each object’s principal axes. This flexibility is particularly valuable for categories with elongated shapes (beds, tables, desks) or objects at non-canonical orientations. The Cholesky factorization guarantees positive definiteness regardless of network outputs, providing numerical stability during training while maintaining full expressiveness.
Reducing the transformer backbone depth from 12 layers to 8 layers leads to a performance drop of 1.5 mAP (from 61.9% to 60.4%). This result indicates that accurate estimation of 3D Gaussian parameters, particularly the covariance matrix, benefits substantially from deeper networks with stronger global context modeling capabilities.
The transformer’s self-attention mechanism enables each spatial location to aggregate information from the entire image, which is crucial for several aspects of Gaussian prediction. First, estimating the full extent of an object requires reasoning about its boundaries, which may be distant from the object center in the feature map. Deeper networks with more attention layers can propagate information across larger spatial extents more effectively. Second, the covariance estimation requires understanding the object’s shape and orientation, which often depends on contextual cues such as surrounding furniture, room layout, and perspective geometry. These cues are better captured by networks with larger effective receptive fields.
Third, the depth prediction component of our method benefits from global context because depth estimation in indoor scenes is inherently ambiguous from local appearance alone. Objects of similar appearance (e.g., chairs) can appear at vastly different depths, and disambiguating these cases requires understanding the overall scene structure. The 4-layer reduction from L = 12 to L = 8 decreases the network’s capacity for such global reasoning, leading to less accurate depth and covariance predictions.
The relatively modest performance drop (1.5 mAP) suggests that the Gaussian representation itself provides strong inductive bias that partially compensates for reduced backbone capacity. Nevertheless, the full L = 12 configuration remains preferable for achieving optimal performance.
7. Conclusions
We presented a probabilistic framework for RGB-D 3D object detection that models each object as a 3D Gaussian distribution rather than a deterministic bounding box. This formulation enables unified appearance–geometry encoding through early fusion in a vision transformer backbone and provides inherent uncertainty estimates via covariance prediction. The Cholesky-based parameterization ensures valid and stable covariance outputs, while distributional loss functions offer geometrically coherent supervision aligned with overlap metrics.
Our method achieves 61.9% mAP@0.25 on SUN RGB-D, outperforming several point-cloud (VoteNet, H3DNet, MLCVNet, and 3DETR) and RGB-D approaches (Frustum PointNets, PointFusion, and DSS). Although ImVoteNet reports slightly higher accuracy, our approach provides advantages not captured by box-based detectors, including calibrated uncertainty, metrically interpretable ellipsoids, and a simplified architecture without multi-stage fusion. The improvements observed on categories with soft boundaries, such as Sofa, highlight the benefits of full covariance modeling. Ablation studies further validate that both the Gaussian representation and the Cholesky parameterization significantly contribute to performance.
This work also has limitations. The current formulation operates on single frames, assumes single-ellipsoid representations, and carries computational cost due to the transformer backbone. Nevertheless, several avenues remain promising for future work, including temporal modeling for video, mixture-of-Gaussians representations for complex objects, lightweight backbone design for real-time deployment, and leveraging uncertainty for downstream tasks such as active perception.
In summary, Gaussian-based object representations offer a practical and uncertainty-aware alternative to box-based 3D detection, opening opportunities for more reliable perception in robotics and augmented reality applications.