Probabilistic Indoor 3D Object Detection from RGB-D via Gaussian Distribution Estimation

Kim, Hyeong-Geun

doi:10.3390/math14030421

Open AccessArticle

Probabilistic Indoor 3D Object Detection from RGB-D via Gaussian Distribution Estimation

by

Hyeong-Geun Kim

Department of Mechanical Engineering, Incheon National University, Incheon 22012, Republic of Korea

Mathematics 2026, 14(3), 421; https://doi.org/10.3390/math14030421

Submission received: 23 December 2025 / Revised: 18 January 2026 / Accepted: 22 January 2026 / Published: 26 January 2026

Download

Browse Figures

Versions Notes

Abstract

Conventional object detectors represent each object by a deterministic bounding box, regressing its center and size from RGB images. However, such discrete parameterization ignores the inherent uncertainty in object appearance and geometric projection, which can be more naturally modeled as a probabilistic density field. Recent works have introduced Gaussian-based formulations that treat objects as distributions rather than boxes, yet they remain limited to 2D images or require late fusion between image and depth modalities. In this paper, we propose a unified Gaussian-based framework for direct 3D object detection from RGB-D inputs. Our method is built upon a vision transformer backbone to effectively capture global context. Instead of separately embedding RGB and depth features or refining depth within region proposals, our method takes a full four-channel RGB-D tensor and predicts the mean and covariance of a 3D Gaussian distribution for each object in a single forward pass. We extend a pretrained vision transformer to accept four-channel inputs by augmenting the patch embedding layer while preserving ImageNet-learned representations. This formulation allows the detector to represent both object location and geometric uncertainty in 3D space. By optimizing divergence metrics such as the Kullback–Leibler or Bhattacharyya distances between predicted and target distributions, the network learns a physically consistent probabilistic representation of objects. Experimental results on the SUN RGB-D benchmark demonstrate that our approach achieves competitive performance compared to state-of-the-art point-cloud-based methods while offering uncertainty-aware and geometrically interpretable 3D detections.

Keywords:

3D object detection; Gaussian distributions; RGB-D perception; vision transformers; probabilistic modeling

MSC:

68T45; 68T07; 62H12

1. Introduction

Object detection in visual data has conventionally been formulated as the regression of a bounding box around each object, typically predicting the object’s center and size in the image plane. However, this deterministic bounding-box paradigm fails to capture the inherent uncertainty in object appearance, geometric projection, and the three-dimensional structure of the scene.

Recently, several works have reformulated detection as a density estimation problem, modeling each object as a Gaussian (or mixture) distribution over the image or 3D space rather than as a fixed box [1,2]. Modeling the object region via a Gaussian representation allows a more flexible uncertainty representation and can lead to improved alignment with evaluation metrics such as IoU or generalized divergence losses. In parallel, RGB-D sensing (i.e., combining color images with depth channels) has emerged as a powerful modality for 3D object detection tasks, enabling richer geometric cues than single-modal RGB. Standard approaches in the RGB-D domain often process RGB and depth streams separately, fusing them only at a later stage. Alternatively, some methods extract 2D image proposals first and then refine them using depth information [3,4]. While effective, these strategies incur a modality gap and require separate processing steps.

In this paper, we propose a unified framework for direct 3D object detection from the RGB-D input by modeling each object as a 3D Gaussian distribution characterized by its mean and covariance.

Our network takes a four-channel RGB-D tensor as input and directly predicts Gaussian parameters for each object, representing both spatial location and geometric uncertainty. Importantly, the detector is built upon a vision transformer backbone that is modified to support four-channel inputs through an extended patch embedding layer. This design enables joint modeling of appearance and geometry from the earliest stages of feature extraction while leveraging pretrained ImageNet representations.

Key advantages of our approach include the following: (i) representing objects with a probabilistic formulation that naturally encodes localization uncertainty and shape anisotropy; (ii) extending the Gaussian-based detection paradigm into true 3D space using RGBD input; (iii) integrating RGB and depth modalities from the outset via early fusion within a transformer architecture, thereby reducing domain gap and enabling more coherent feature learning; and (iv) benefiting from the global receptive field of transformers, which facilitates long-range reasoning essential for estimating 3D Gaussian covariances that span large spatial extents.

We evaluate our proposed method on the SUN RGB-D benchmark and demonstrate that it achieves competitive 3D detection accuracy compared to state-of-the-art point-cloud methods while providing uncertainty-aware predictions. Our contributions are summarized as follows:

We propose a direct 3D object detection framework from RGB-D input that predicts 3D Gaussian distribution parameters (mean and covariance) for each object, replacing traditional box regression.
We design a transformer-based RGB-D backbone with a four-channel patch embedding that processes RGB and depth channels in a unified tensor, thereby eliminating late fusion and reducing modality gap.
We introduce a Cholesky-based covariance parameterization and derive the geometric transformations required to lift image-aligned Gaussian predictions to metrically interpretable 3D ellipsoids.
We demonstrate through experiments that our probabilistic Gaussian formulation yields more physically interpretable detections and improved performance over conventional bounding-box and fusion-based RGB-D detectors.

2. Related Work

We review three strands that directly motivate our approach: (1) Gaussian and probabilistic representations for object detection, (2) RGB-D and 3D object detection, and (3) fusion strategies for RGB and depth modalities.

2.1. Probabilistic and Gaussian Representations for Object Detection

Recent research has challenged the canonical bounding-box parameterization by modeling object regions as probability distributions. Early works explored fuzzy object region parameterizations and probabilistic IoU proxies based on distributional distances. Llerena et al. introduced Gaussian Bounding Boxes and a Probabilistic IoU notion for fuzzy object regions, demonstrating that distributional similarity (e.g., Hellinger distance) can serve as a viable box alternative [2]. Yang et al. developed a more complete framework that models rotated 2D/3D boxes as Gaussians and uses between-distribution metrics—Gaussian Wasserstein Distance (GWD), KL divergence, and Bhattacharyya distance—as regression losses [1,5]. They demonstrated that many recurring problems in rotation detection (boundary discontinuity, square-like ambiguity) can be mitigated under Gaussian modeling and extended the formulation to 3D. The G-Rep work provides a unified Gaussian construction to represent oriented boxes, quads, and point-sets, enabling a single Gaussian representation across different geometric primitives [6].

These approaches bring several benefits. First, distributional metrics align training objectives more closely with geometric overlap metrics (IoU variants) while remaining differentiable. Second, Gaussian parameterization naturally encodes uncertainty (via covariance) and anisotropy (via eigenstructure), which is useful for high-precision localization and downstream tasks that exploit uncertainty. Third, Gaussian modeling often removes angle periodicity and edge-exchange ambiguities that plague angle-regression schemes for rotated boxes. However, most prior Gaussian works focus on 2D detection or oriented bounding box (OBB) representations and do not directly address the RGB-D to 3D problem using a single fused input tensor.

Beyond Gaussian parameterizations for rotated bounding boxes, probabilistic modeling has also been explored from a Bayesian and uncertainty-aware perspective. Kendall and Gal demonstrated that aleatoric uncertainty can be learned implicitly through probabilistic regression losses, motivating distribution-valued predictions in deep networks. Subsequent works incorporated uncertainty estimation into detection and localization tasks, showing improved robustness under noisy observations and ambiguous geometries [7,8].

In the context of object detection, several studies have investigated probabilistic bounding boxes and uncertainty-aware scoring mechanisms. He et al. proposed uncertainty-guided non-maximum suppression and confidence calibration strategies, highlighting that explicit uncertainty modeling improves ranking and post-processing reliability [9]. These lines of work further support the view that modeling objects as distributions rather than deterministic boxes provides a principled framework for uncertainty-aware perception.

2.2. RGB-D and 3D Object Detection

The community has developed many approaches for 3D detection using depth sensors (RGB-D) or LiDAR. Two historical threads are notable.

Depth and volumetric pipelines for RGB-D.

Early studies such as Sliding Shapes and follow-ups like Deep Sliding Shapes showed that depth/volumetric representations enable direct 3D box estimation by operating on 3D volumes or truncated signed distance fields (TSDFs) constructed from depth maps [3,10]. These volumetric pipelines demonstrated how geometry alone greatly simplifies certain detection challenges (occlusion, viewpoint) compared to pure RGB pipelines. Many subsequent RGB-D works adopted 3D ConvNets or hybrid voxel/TSDF encoders for amodal 3D bounding box estimation.

Point cloud and LiDAR pipelines.

In parallel, point-cloud methods such as VoxelNet [11], PointPillars [12], and PointRCNN [13] established powerful encoders and training paradigms for 3D detection from LiDAR. These methods emphasize structured representation (voxels, pillars, or point sets) and demonstrate how spatially-aware backbones boost localization accuracy in metric space. More recently, VoteNet [14] introduced deep Hough voting for indoor 3D detection, achieving strong results on SUN RGB-D and ScanNet using only point cloud input. H3DNet [15] further improved performance by predicting a hybrid set of geometric primitives (centers, face centers, and edge centers). ImVoteNet [4] extended VoteNet by incorporating image votes from 2D detectors, effectively fusing RGB and point cloud modalities. FCAF3D [16] proposed an anchor-free, fully convolutional approach using sparse convolutions, achieving state-of-the-art results on indoor benchmarks. Group-Free 3D [17] introduced a transformer-based detection head that directly learns object features from point clouds without explicit grouping.

Although LiDAR and RGB-D are different sensors, the core insight is shared: direct 3D geometric representations help produce metric 3D boxes and support uncertainty estimation in world coordinates. Despite those successes, a large fraction of RGB-D detectors still adopt hybrid strategies: (i) 2D proposal generation from RGB followed by depth-based refinement (lifting), or (ii) separate RGB and depth encoding followed by late fusion. Very few works directly estimate probabilistic 3D object parameters (mean and covariance) from a single fused RGBD tensor in an end-to-end manner—this is the gap our paper addresses.

More recently, transformer-based architectures have been introduced for 3D object detection, leveraging global context modeling to replace hand-crafted grouping or voting heuristics. Works such as 3DETR formulate 3D detection as a set prediction problem using transformers, demonstrating strong performance without explicit anchors or proposals [18]. Related efforts further extend transformer-based reasoning to indoor scenes and multi-modal inputs, suggesting that global attention is particularly beneficial for capturing long-range geometric dependencies in 3D space.

2.3. Fusion Strategies for RGB and Depth Modalities

Integrating RGB and depth modalities has been studied extensively. Fusion strategies generally fall into (a) early fusion (concatenating depth as an extra channel), (b) middle fusion (cross-modal attention or feature fusion at intermediate layers), and (c) late fusion (separate backbones with combined scores or proposals). Empirical studies highlight that the optimal fusion strategy depends on the architecture and task. For classification, middle or late fusion is often effective. In contrast, for dense geometric tasks, early fusion is preferred as it allows the network to jointly learn geometric-appearance correlations from the outset [19,20]. Recent transformer-based studies empirically compare early versus late fusion regimes and show that aligning modality representations matters greatly to final performance [19].

In object detection, late-fusion pipelines commonly extract 2D RGB proposals and then exploit point cloud or depth within ROIs to lift 2D to 3D boxes. While pragmatic, this two-stage processing can suffer from modality gaps and limits end-to-end uncertainty propagation. Our approach adopts an early/fused representation (an RGBD tensor) but differs from naive early fusion by predicting 3D Gaussian parameters directly (instead of regressing 3D box corners or lifting 2D boxes). This enables a single forward pass to produce both metric centers and covariances, simplifying uncertainty quantification and downstream integration.

From a representation learning perspective, early fusion can be interpreted as learning a joint appearance–geometry embedding, whereas late fusion enforces a factorized representation that may hinder uncertainty propagation across modalities. Recent analyses indicate that early fusion is particularly advantageous when the prediction target is geometrically structured, such as depth, surface normals, or 3D object parameters [21,22]. Our approach aligns with this view by directly predicting metric 3D Gaussian parameters from a unified RGB-D tensor.

2.4. Label Assignment and Loss Design

Practical advances like Adaptive Training Sample Selection (ATSS) showed that how positives and negatives are chosen (anchor/point assignment) significantly affect detector calibration and training stability [23]. Gaussian-based losses can be paired with distribution-aware assignment strategies (e.g., metric thresholding in distribution space) to obtain sample assignment consistent with the regression objective. We build on these insights in our training design.

3. Probabilistic 3D Object Representation

We formulate 3D object detection from RGB-D input as probabilistic estimation of object-centric 3D Gaussian distributions. Each object is represented as an ellipsoid in the camera coordinate frame, parameterized by its mean and covariance, instead of a deterministic 3D cuboid. This section presents the mathematical formulation of the representation and objective.

3.1. Scene Representation as a 3D Gaussian Mixture

Let an RGB-D frame be defined over the image domain

Ω \subset R^{2}

, with aligned per-pixel depth map

D (u, v)

. We denote the camera coordinate frame as

{b}

and the image-aligned frame (as used by the network outputs) as

{a}

. Instead of estimating oriented 3D bounding boxes, we encode each object as a multivariate normal distribution in

{b}

:

N (x ∣ μ, Σ), μ \in R^{3}, Σ \in S_{+ +}^{3},

(1)

where

Σ

is symmetric positive definite and the level sets

{x : {(x - μ)}^{⊤} Σ^{- 1} (x - μ) = c}

form ellipsoids in 3D space.

Given N objects, the scene is represented as a Gaussian mixture:

ρ (x) = \sum_{k = 1}^{N} π_{k} N (x ∣ μ_{k}, Σ_{k}),

(2)

where

π_{k} \in (0, 1)

denotes the existence probability of the k-th object. This continuous representation avoids discrete box parameterization and provides a smooth manifold on which distributional distances can be defined.

3.2. RGB-D Back-Projection to 3D Space

Each pixel coordinate

(u, v)

with depth

D (u, v)

corresponds to a 3D point in the camera frame via the pinhole camera model:

X (u, v) = D (u, v) K^{- 1} {[u, v, 1]}^{⊤},

(3)

where

K

is the intrinsic calibration matrix. Equation (3) is differentiable with respect to both depth and intrinsics, which allows the 3D mean of each Gaussian to be supervised from RGB-D data in an end-to-end fashion.

3.3. Image-Plane Approximation of Projected Covariance

Although the exact projection of a 3D Gaussian through perspective projection is not Gaussian, a first-order approximation is obtained by linearizing the projection around the mean

μ

. Let

Π : R^{3} \to R^{2}

be the projection function and

J_{Π} (μ)

its Jacobian at

μ

. The induced image-plane covariance is approximated by the following:

Σ_{2 D} = J_{Π} (μ) Σ J_{Π} {(μ)}^{⊤} .

(4)

This relation links the 3D ellipsoid to its 2D footprint and enables training with either 2D or 3D supervision while maintaining a consistent probabilistic representation.

3.4. Distribution-Based Objective

To supervise the predicted Gaussian

N_{p} (μ_{p}, Σ_{p})

against a ground-truth Gaussian

N_{g} (μ_{g}, Σ_{g})

, we employ distributional divergences that are available in closed form.

Distributional divergences such as the KL divergence and Bhattacharyya distance have been widely used in probabilistic inference, metric learning, and density matching due to their closed-form expressions for Gaussian distributions [24,25]. In particular, the Bhattacharyya distance provides a symmetric and bounded similarity measure, which has been shown to yield more stable gradients in distribution regression tasks.

The KL divergence from

N_{g}

to

N_{p}

is given by the following:

D_{KL} (N_{g} ∥ N_{p}) = \frac{1}{2} [tr (Σ_{p}^{- 1} Σ_{g}) + {(μ_{p} - μ_{g})}^{⊤} Σ_{p}^{- 1} (μ_{p} - μ_{g}) - 3 + ln \frac{| Σ_{p} |}{| Σ_{g} |}],

(5)

where the dimension is

d = 3

. The detailed derivation of Equation (5) is provided in Appendix A. To obtain a symmetric and more stable distance, we also use the Bhattacharyya distance:

D_{B} = \frac{1}{8} {(μ_{g} - μ_{p})}^{⊤} Σ^{- 1} (μ_{g} - μ_{p}) + \frac{1}{2} ln \frac{| Σ |}{\sqrt{| Σ_{g} | | Σ_{p} |}}, Σ = \frac{1}{2} (Σ_{g} + Σ_{p}) .

(6)

The overall training objective at all spatial locations p is

L = \sum_{p} π_{p} [D_{KL} + λ D_{B}] + β L_{depth},

(7)

where

π_{p}

is the predicted existence probability,

λ

balances asymmetric and symmetric components, and

L_{depth}

regularizes depth-related terms. This objective directly optimizes a probabilistic notion of alignment between predicted and target ellipsoids. In practice, we found the optimization to be insensitive to moderate variations of the loss weights. Across preliminary experiments, performance remained stable for

λ

∈ [0.3, 1.0] and

β

∈ [0.5, 2.0]. therefore, we fix

λ

= 0.5 and

β

= 1.0 in all experiments for simplicity.

4. Covariance Parameterization and Geometric Transformation

Section 3 introduced the probabilistic formulation in the 3D camera frame

{b}

. We now describe how the network parameterizes Gaussian ellipsoids in an image-aligned frame

{a}

, and how these ellipsoids are transformed into metrically interpretable ellipsoids in

{b}

through perspective geometry. We follow the coordinate convention illustrated in Figure 1.

4.1. Cholesky-Based Covariance in the Image-Aligned Frame

The network first predicts ellipsoids in the augmented pixel coordinate frame

{a}

, whose axes are aligned with

(u, v, w)

, where

(u, v)

are image coordinates and w is oriented along the optical axis. In this frame, the covariance

Σ_{a}

is parameterized via a Cholesky factorization:

Σ_{a} = L L^{⊤},

(8)

Ensuring positive definiteness of predicted covariance matrices is critical for stable optimization. Cholesky-based parameterizations are commonly adopted in probabilistic deep learning to guarantee valid covariance estimates while maintaining differentiability [26]. Our formulation follows this principle, enabling unconstrained network outputs to be mapped to valid 3D ellipsoids.

With lower-triangular factor

L = [\begin{matrix} l_{1} & 0 & 0 \\ l_{2} cos θ_{21} & l_{2} sin θ_{21} & 0 \\ l_{3} cos θ_{31} & l_{3} sin θ_{31} cos θ_{32} & l_{3} sin θ_{31} sin θ_{32} \end{matrix}] .

(9)

Equation (9) spans all symmetric positive definite matrices in

R^{3 \times 3}

and is differentiable with respect to the six scalar parameters

(l_{1}, l_{2}, l_{3}, θ_{21}, θ_{31}, θ_{32})

. The network outputs raw values

(t_{3}, \dots, t_{8})

that are mapped as follows:

l_{1} = \frac{w_{0}}{2} e^{t_{3}}, l_{2} = \frac{h_{0}}{2} e^{t_{4}}, l_{3} = \frac{d_{0}}{2} e^{t_{5}},

(10)

θ_{21} = π σ (t_{6}), θ_{31} = π σ (t_{7}), θ_{32} = π σ (t_{8}),

(11)

where

σ (\cdot)

is the sigmoid function and

(w_{0}, h_{0}, d_{0})

are characteristic length scales in the three directions.

The exponential mapping in Equation (10) ensures the principal scales

l_{i}

are strictly positive, and the sigmoid normalization in Equation (11) keeps the angular parameters within

(0, π)

, avoiding discontinuities in the orientation. As a result, every network output corresponds to a valid ellipsoid with a well-conditioned covariance.

4.2. Perspective Projection Matrix and Notation

We explicitly decompose the perspective projection matrix

P

into its intrinsic and translation components. In homogeneous coordinates, the mapping from a 3D point

x_{b} = {(x, y, z)}^{⊤}

in frame

{b}

to image-plane coordinates in frame

{a}

can be written as follows:

[\begin{matrix} u z \\ v z \\ z \end{matrix}] = P [\begin{matrix} x \\ y \\ z \\ 1 \end{matrix}], P = [\begin{matrix} m_{u} f & 0 & t_{u} & 0 \\ 0 & m_{v} f & t_{v} & 0 \\ 0 & 0 & 1 & 0 \end{matrix}] .

(12)

Here,

m_{u}

and

m_{v}

denote the number of pixels per unit length in the horizontal and vertical directions, respectively, f is the focal length, and

(t_{u}, t_{v})

is the principal point offset in pixels. The

3 \times 4

matrix

P

is partitioned as

P = [P_{3 \times 3} ∣ p_{4}]

, where

P_{3 \times 3}

contains the intrinsic parameters and

p_{4}

is the translation column.

4.3. Inverse Mapping for the Ellipsoid Center

The network predicts a 2D center

(u, v)

and a depth value z in frame

{a}

, which must be lifted back to a 3D point in

{b}

. Depth is parameterized as

z = z_{0} e^{t_{9}},

(13)

where

z_{0}

is a reference depth scale. The exponential mapping guarantees positive metric depth and improves numerical stability by compressing the dynamic range of raw outputs.

Given

(u, v, z)

, the 3D center of the ellipsoid in frame

{b}

, denoted as

m_{b} \in R^{3}

, is obtained by inverting Equation (12):

m_{b} = P_{3 \times 3}^{- 1} ([\begin{matrix} u z \\ v z \\ z \end{matrix}] - p_{4}) .

(14)

Here,

P_{3 \times 3}^{- 1}

is the inverse of the intrinsic

3 \times 3

submatrix of

P

. Equation (14) is equivalent to classical back-projection using intrinsics but written in a homogeneous linear-algebra form that fits the ellipsoid transformation.

4.4. Depth-Dependent Scaling of Covariance

Perspective projection shrinks apparent object size proportionally to depth. Conversely, when lifting a covariance from the image-aligned frame

{a}

back to the metric camera frame

{b}

, the covariance must be scaled up according to depth. We capture this depth-dependent scaling using a diagonal matrix:

S = diag {z m_{u} f, z m_{v} f, z \sqrt{m_{u} m_{v}} f} \in R^{3 \times 3} .

(15)

The first two entries correspond to the scaling factors in the horizontal and vertical directions, obtained from the focal length and pixel density. The third entry uses the geometric mean of the lateral scalings, which yields a depth scaling that is consistent with the isotropic shrinkage implied by perspective projection. We emphasize that this formulation is a geometrically motivated approximation under the pinhole camera model, rather than an exact probabilistic projection of a 3D Gaussian through perspective, which is known to yield a non-Gaussian distribution. In practice, this approximation provides a useful inductive bias and leads to stable and metrically consistent ellipsoid predictions.

4.5. Rotation Alignment with the Viewing Direction

The ellipsoid predicted in

{a}

is aligned with the

(u, v, w)

axes, where the w-axis is orthogonal to the image plane. In the camera frame

{b}

, we align the ellipsoid such that its main axis is oriented along the direction from the camera center to the ellipsoid center.

Let

\hat{z}

denote the unit vector along the optical axis in

{b}

and

{\hat{m}}_{b} = m_{b} / ∥ m_{b} ∥

denote the unit vector from the camera center to the ellipsoid center. The rotation axis and angle are obtained via

\hat{z} \times {\hat{m}}_{b} = \hat{r} sin ϕ,

(16)

where

\hat{r}

is the unit rotation axis and

ϕ

is the rotation angle. The corresponding rotation matrix is given by Rodrigues’ formula:

R = I + sin ϕ {[\hat{r}]}_{\times} + (1 - cos ϕ) {[\hat{r}]}_{\times}^{2},

(17)

where

{[\cdot]}_{\times}

denotes the skew-symmetric matrix representation. This alignment is physically motivated by the observation that uncertainty in RGB-D perception is typically elongated along the viewing ray, particularly due to depth noise and occlusion. Empirically, removing this alignment led to unstable training and implausible covariance orientations in preliminary experiments, supporting its practical necessity.

4.6. Final Covariance in the Camera Frame

Combining the Cholesky factorization, scaling, and rotation, the covariance of the ellipsoid in the camera frame

{b}

is

Σ_{b} = R S Σ_{a} S^{⊤} R^{⊤} = R S L L^{⊤} S^{⊤} R^{⊤} .

(18)

Equation (18) is used in the distributional losses in Section 3 and ensures that the learned covariance directly corresponds to metric ellipsoids in 3D.

5. RGB-D Transformer Network Architecture

This section describes the concrete network architecture that instantiates the probabilistic formulation developed in Section 3 and Section 4. As illustrated in Figure 2, the detector follows a dense, single-stage design that takes an RGB-D image as input and outputs; for each feature-grid location, the parameters of a 3D Gaussian distribution together with objectness and class scores.

5.1. RGB-D Input Construction

Given an RGB image

I^{rgb} \in R^{3 \times H \times W}

and an aligned depth map

I^{d} \in R^{1 \times H \times W}

, we construct a four-channel RGB-D tensor

I = concat (I^{rgb}, I_{norm}^{d}) \in R^{4 \times H \times W},

(19)

where

I_{norm}^{d}

is obtained by clipping raw depth values to a maximum range (8 m in all experiments) and linearly rescaling them to

[0, 1]

. The RGB channels are normalized using the standard ImageNet mean and variance, while the depth channel is normalized with its own statistics.

5.2. Transformer Backbone with Four-Channel Patch Embedding

We adopt a vision transformer backbone

ϕ (\cdot)

, instantiated as either a ViT-Small or Swin-Tiny model, to encode the RGB-D tensor into a sequence of tokens. The backbone is modified to accept four-channel inputs by extending the patch embedding layer.

Let

PatchEmbed

be the initial convolutional projection with a

P \times P

kernel and stride P. We replace it with a four-channel patch embedding the following:

E_{0} = {PatchEmbed}_{4} (I) \in R^{N \times D},

(20)

where

N = \frac{H}{P} \cdot \frac{W}{P}

is the number of patches. The weight tensor of

{PatchEmbed}_{4}

is initialized by copying the pretrained RGB filters and setting the depth filter to their channel-wise mean:

W^{(4)} = [W^{(3)} | \frac{1}{3} \sum_{c = 1}^{3} W_{c, :, :}^{(3)}] .

(21)

This “mean depth” initialization preserves the behavior of the pretrained ImageNet backbone on RGB while introducing a sensible starting point for the depth channel.

After adding positional encodings, the token sequence is processed by a stack of L transformer encoder blocks:

E_{ℓ} = {EncBlock}_{ℓ} (E_{ℓ - 1}), ℓ = 1, \dots, L .

(22)

The final tokens

E_{L} \in R^{N \times D}

form a dense RGB-D representation on a coarse spatial grid.

5.3. Feature Refinement Neck

The feature refinement neck serves as the bridge between the transformer encoder and the Gaussian-based 3D detection head. While the preceding Section 3 and Section 4 introduced the theoretical and geometric formulation of Gaussian parameterization, the role of the neck is to reshape globally contextualized transformer tokens into spatially structured feature maps that enable dense prediction of 3D Gaussian distributions. As shown in Figure 3, the Feature Refinement Neck consists of two main components: a module that converts tokens into grid-structured features and a module that refines the features in a multi-scale manner.

Token-to-grid restoration. Given the final-layer token sequence

E_{L} \in R^{N \times D}

, we reshape it into a 2D spatial grid to recover a feature representation with explicit geometric structure:

F_{base} = reshape (E_{L}) \in R^{D \times H^{'} \times W^{'}}, H^{'} = \frac{H}{P}, W^{'} = \frac{W}{P} .

(23)

This grid acts as the highest-resolution feature map

P_{3}

.

Multi-scale pyramid construction. To support objects of different physical sizes and depth-dependent scale variations, we construct a shallow 2D feature pyramid:

\begin{matrix} P_{3} & = {Conv}_{3 \times 3} (F_{base}), \end{matrix}

(24)

\begin{matrix} P_{4} & = {Conv}_{3 \times 3, s = 2} (P_{3}), \end{matrix}

(25)

\begin{matrix} P_{5} & = {Conv}_{3 \times 3, s = 2} (P_{4}) . \end{matrix}

(26)

Unlike traditional FPNs, which are typically paired with bounding-box regression, our pyramid is specifically designed to provide multi-scale geometric cues for Gaussian parameter regression. Since Gaussian ellipsoids encode shape anisotropy and uncertainty, feature responses benefit significantly from hierarchical context (global semantics from transformers and localized spatial structure from convolutional refinement).

Importantly, the Gaussian representation is continuous and differentiable with respect to its mean and covariance. Therefore, producing stable predictions requires features that exhibit both spatial smoothness and global consistency. The refinement neck fulfills this by injecting local convolutional inductive biases on top of globally aggregated transformer features.

5.4. Dense Gaussian 3D Detection Head

The dense detection head operates on each pyramid level

P_{ℓ}

and predicts, at every spatial location, the parameters of a full 3D Gaussian distribution. Each detection head consists of a lightweight convolutional tower with four Conv–BN–ReLU layers, followed by one output layer producing 3D mean parameters

(u, v, z)

, covariance parameters via the Cholesky factor

(t_{3}, \dots, t_{8})

, objectness

ω

, and class logits for

N_{c}

categories.

The objectness (confidence) is computed as

ω = σ (t_{10})

, and class probabilities for

N_{c}

categories are obtained via softmax on the subsequent channels:

Pr (c) = \frac{e^{t_{10 + c}}}{\sum_{i = 1}^{N_{c}} e^{t_{10 + i}}}, c = 1, \dots, N_{c} .

(27)

Together with the Gaussian parameters, these outputs define the mixture in Equation (2) for each input frame.

6. Experiments

We evaluate the proposed transformer-based probabilistic detector on the SUN RGB-D benchmark and compare it against state-of-the-art methods from both point-cloud-based and RGB-D-based detection paradigms.

6.1. Experimental Setting

SUN RGB-D [27,28,29,30] is a widely used indoor RGB-D benchmark containing 10,335 RGB-D images collected with four depth sensors (Intel RealSense, Asus Xtion, Kinect v1, and Kinect v2) across diverse indoor environments including bedrooms, living rooms, offices, and classrooms. We follow the official split, using 5285 images for training and 5050 for testing over 10 object categories. The benchmark presents several challenges including cluttered scenes, significant inter-class variation in object sizes, partial occlusions, and depth noise from consumer-grade sensors.

Ground-truth oriented cuboids are converted into Gaussian ellipsoids in the camera frame using the formulation of Section 3. The cuboid center becomes the Gaussian mean, and the covariance is derived from the cuboid dimensions and yaw angle following Gaussian-based 3D detection works [1,6]. Specifically, the covariance eigenvalues are set proportional to the squared half-dimensions of the cuboid, and the eigenvectors are aligned with the cuboid axes.

We use a ViT-Small backbone with patch size

P = 16

, embedding dimension

D = 384

, and

L = 12

encoder layers. The total number of parameters is approximately 45 M. All models are trained on two NVIDIA RTX 3080 Ti GPUs with batch size 8 for 80–100 epochs using AdamW with initial learning rate

2 \times 10^{- 4}

. We apply standard data augmentation including random horizontal flipping, random scaling (0.9–1.1), and color jittering. The loss weights are set to

λ = 0.5

and

β = 1.0

throughout all experiments.

6.2. Results

We evaluate our probabilistic Gaussian detector on the SUN RGB-D validation split and compare it against state-of-the-art 3D detection networks from both point-cloud-based and RGB-D-based categories. Table 1 summarizes the quantitative performance using mean Average Precision at IoU threshold 0.25 (mAP@0.25) for 10 standard indoor object categories.

Overall performance.

Our method achieves an mAP@0.25 of 61.9%, demonstrating competitive performance among state-of-the-art methods. This result surpasses several established point-cloud detectors including VoteNet (59.1%), H3DNet (60.1%), MLCVNet (59.8%), and 3DETR (59.1%), while achieving comparable performance to BRNet (61.1%). Among RGB-D methods, our approach significantly outperforms traditional fusion strategies such as Frustum PointNets (54.0%), PointFusion (45.4%), and DSS (42.1%), establishing it as the second-best RGB-D method after ImVoteNet (63.4%). This competitive performance is particularly meaningful given our fundamentally different paradigm. Rather than relying on complex multi-stage pipelines or voting heuristics (e.g., ImVoteNet, Group-Free 3D), we achieve strong results through a simpler single-stage architecture. Our approach leverages unified RGB-D early fusion and a probabilistic Gaussian representation. This architectural simplicity, combined with the inherent uncertainty quantification provided by our Gaussian formulation, represents a compelling trade-off for practical applications where interpretability and uncertainty awareness are valued alongside raw accuracy.

Comparison with point-cloud methods.

When compared against point-cloud-based detectors, our RGB-D approach demonstrates that the combination of appearance and depth information in a unified tensor provides competitive results while offering additional benefits. VoteNet, which pioneered deep Hough voting for indoor 3D detection, achieves 59.1% mAP using only point cloud input. Our method outperforms VoteNet by +2.8% mAP, suggesting that the explicit integration of RGB appearance features enables more discriminative object representations. H3DNet extends VoteNet with hybrid geometric primitives (centers, face centers, edge centers), reaching 60.1% mAP, yet our probabilistic Gaussian formulation surpasses this by +1.8% mAP without requiring explicit primitive prediction. The transformer-based 3DETR achieves 59.1% mAP on point clouds; our approach demonstrates that incorporating RGB information and adopting a Gaussian-based representation provides complementary benefits that push performance beyond pure geometric reasoning. While our method does not reach the performance of Group-Free 3D (63.0%), it achieves comparable results to BRNet (61.1%) while additionally providing uncertainty estimates that are valuable for downstream robotic applications.

Comparison with RGB-D methods.

Among RGB-D methods, our approach demonstrates substantial improvements over traditional fusion strategies. Deep Sliding Shapes (DSS), an early volumetric approach, achieves only 42.1% mAP due to its reliance on hand-crafted 3D representations and limited fusion capability. PointFusion, which projects image features onto point clouds for late fusion, reaches 45.4% mAP but suffers from modality misalignment. The 2D-driven approach, which lifts 2D detections to 3D using depth, achieves 45.1% mAP and is limited by the quality of 2D proposals. Frustum PointNets improves to 54.0% mAP by constraining 3D search within 2D frustums, but still relies on a two-stage pipeline. Our approach achieves 61.9% mAP, representing a significant improvement of +7.9% over Frustum PointNets and establishing our method as the second-best RGB-D approach. While ImVoteNet achieves 63.4% mAP by combining 2D image votes with point cloud voting, our approach offers distinct architectural advantages: early fusion of RGB and depth in a unified tensor eliminates the need for complex cross-modal feature alignment and multi-stage processing, and our probabilistic Gaussian representation provides inherent uncertainty quantification that ImVoteNet’s deterministic box regression cannot offer. The 1.5% gap to ImVoteNet is offset by the simplicity and interpretability of our single-stage probabilistic framework.

Category-wise analysis.

The per-category results reveal interesting patterns that highlight the strengths of our probabilistic Gaussian representation on specific object types. Our method achieves notable improvements on the Sofa category (+5.8% over ImVoteNet, 76.5% vs. 70.7%), where the full covariance matrix effectively captures the soft boundaries and variable geometries that characterize upholstered furniture. Similarly, our approach shows gains on Bathtub (+1.2%, 77.1% vs. 75.9%) and Table (+0.8%, 51.9% vs. 51.1%), categories that benefit from the continuous ellipsoidal representation’s ability to model curved surfaces and varying aspect ratios. These improvements suggest that the Gaussian formulation is particularly effective for objects with smooth boundaries or high geometric variability.

For categories with well-defined geometric structures such as Bed (85.8%), Toilet (87.2%), and Chair (74.8%), our method shows competitive but slightly lower performance compared to ImVoteNet (87.6%, 90.5%, and 76.7%, respectively). This suggests that ImVoteNet’s explicit point cloud voting mechanism may be particularly effective for objects where depth discontinuities provide strong boundary cues. The Nightstand category shows a larger gap (59.4% vs. 69.9%), which may be attributed to the relatively small size and frequent occlusion of nightstands in cluttered bedroom scenes, where point cloud voting can leverage fine-grained geometric details more effectively. Nevertheless, our method maintains strong overall performance while providing additional uncertainty information that is valuable for downstream applications.

Advantages of early fusion.

The consistent improvements across categories validate our early fusion strategy, where RGB and depth are concatenated into a unified tensor before being processed by the transformer backbone. Unlike late fusion approaches (e.g., ImVoteNet) that must learn separate representations and then align them, our method enables the network to discover joint appearance–geometry patterns from the first layer. This is particularly beneficial for depth-ambiguous scenarios where appearance cues are essential for disambiguation, and for objects with similar appearancess where depth provides critical discriminative information.

Benefits of probabilistic representation.

The Gaussian representation offers several practical advantages beyond raw detection accuracy. First, the predicted covariance provides a natural uncertainty estimate for each detection, which is valuable for downstream tasks such as robot manipulation and navigation where understanding localization confidence is critical. Second, the continuous and differentiable nature of Gaussian parameterization enables smoother gradient flow during training compared to discrete box regression, contributing to more stable optimization. Third, the distributional loss functions (KL divergence and Bhattacharyya distance) provide a geometrically meaningful training signal that is inherently aligned with the evaluation metric, avoiding the mismatch between smooth L1 loss and IoU-based evaluation in traditional detectors.

6.3. Qualitative Analysis

Figure 4 presents qualitative detection results on representative scenes from the SUN RGB-D validation set. The figure llustrates how our model visualizes objects in real-world indoor environments. By representing detections as 3D Gaussian ellipsoids rather than rigid boxes, the model captures the actual shape and orientation of objects like chairs, sofas, and desks more naturally. For instance, the ellipsoids stretch to follow the long edges of desks and sofas, while remaining compact for smaller items like chairs. This approach allows the system to reflect the physical structure of the scene accurately.

However, a close look at the prediction results in Figure 4 shows that the rotation angles of the boxes do not always perfectly align with the ground truth. This misalignment typically occurs in cluttered office spaces or with objects that have nearly square dimensions, where it is difficult to distinguish the exact front and back based on depth data alone. Nevertheless, our model handles these errors gracefully by increasing the size of the ellipsoid when it is uncertain. Instead of just providing a wrong box with high confidence, the model signals its uncertainty through the spread of the Gaussian distribution, offering a more reliable and safety-aware detection result for practical use.

6.4. Ablation Study

To quantify the contribution of each architectural and representational component, we conduct systematic ablation experiments on three key aspects: (1) the effect of Gaussian-based detection head versus traditional bounding box regression, (2) the impact of covariance parameterization complexity, and (3) the influence of transformer backbone depth on detection performance. All ablation experiments are conducted on the SUN RGB-D validation set using the same training protocol. The results are summarized in Table 2. We note that certain components, such as the geometric transformations and the distributional loss formulation, are mathematically coupled. Isolating them independently leads to degenerate or physically inconsistent behavior; therefore, we focus our ablation on coherent modeling choices rather than artificially separated submodules.

Effect of Gaussian modeling.

Replacing the conventional bounding box regression head with a Gaussian detection head yields a substantial improvement of +4.1 mAP (from 57.8% to 61.9%). This improvement can be attributed to several fundamental differences between the two representations. First, the Gaussian formulation provides a continuous and smooth parameterization of object regions, which facilitates gradient-based optimization. In contrast, box regression with explicit orientation angles suffers from discontinuities at angle boundaries (e.g., wrapping at

\pm π

) and ambiguities for near-square objects, where small perturbations can cause large changes in the predicted angle. The Gaussian covariance matrix naturally handles these cases through its eigenstructure, which varies smoothly with object orientation.

Second, the distributional losses (KL divergence and Bhattacharyya distance) used in Gaussian regression are geometrically meaningful and directly relate to the overlap between predicted and ground-truth regions. This contrasts with the commonly used smooth L1 loss for box regression, which treats center, size, and angle independently without considering their joint geometric effect. The distributional losses provide a more holistic training signal that encourages predictions to match the overall spatial extent of objects.

Third, the Gaussian representation inherently encodes uncertainty, which regularizes the network to produce predictions that are consistent with the underlying data distribution. Objects with high appearance variability or frequent occlusions naturally receive higher uncertainty estimates, leading to more calibrated confidence scores.

Effect of covariance form.

The choice of covariance parameterization significantly impacts detection performance. Using only diagonal covariance (3 parameters) achieves 58.7% mAP, representing a +0.9 mAP improvement over the baseline box head. However, upgrading to the full Cholesky parameterization (6 parameters) provides an additional +3.2 mAP gain, reaching 61.9% mAP. This result demonstrates the importance of modeling cross-axis correlations in 3D object representation.

From a geometric perspective, diagonal covariance restricts the predicted ellipsoid to be axis-aligned in the camera coordinate frame, which is a significant limitation for indoor objects that appear at arbitrary orientations. Real-world objects such as sofas, beds, and tables are often rotated relative to the camera, and their 3D extents exhibit strong correlations across axes. For example, a bed viewed at a 45-degree angle has correlated uncertainty in the camera’s x and z directions, which cannot be captured by diagonal covariance.

The full Cholesky parameterization allows the network to predict arbitrary ellipsoid orientations, effectively learning an implicit rotation that aligns with each object’s principal axes. This flexibility is particularly valuable for categories with elongated shapes (beds, tables, desks) or objects at non-canonical orientations. The Cholesky factorization

Σ = L L^{⊤}

guarantees positive definiteness regardless of network outputs, providing numerical stability during training while maintaining full expressiveness.

Effect of backbone depth.

Reducing the transformer backbone depth from 12 layers to 8 layers leads to a performance drop of 1.5 mAP (from 61.9% to 60.4%). This result indicates that accurate estimation of 3D Gaussian parameters, particularly the covariance matrix, benefits substantially from deeper networks with stronger global context modeling capabilities.

The transformer’s self-attention mechanism enables each spatial location to aggregate information from the entire image, which is crucial for several aspects of Gaussian prediction. First, estimating the full extent of an object requires reasoning about its boundaries, which may be distant from the object center in the feature map. Deeper networks with more attention layers can propagate information across larger spatial extents more effectively. Second, the covariance estimation requires understanding the object’s shape and orientation, which often depends on contextual cues such as surrounding furniture, room layout, and perspective geometry. These cues are better captured by networks with larger effective receptive fields.

Third, the depth prediction component of our method benefits from global context because depth estimation in indoor scenes is inherently ambiguous from local appearance alone. Objects of similar appearance (e.g., chairs) can appear at vastly different depths, and disambiguating these cases requires understanding the overall scene structure. The 4-layer reduction from L = 12 to L = 8 decreases the network’s capacity for such global reasoning, leading to less accurate depth and covariance predictions.

The relatively modest performance drop (1.5 mAP) suggests that the Gaussian representation itself provides strong inductive bias that partially compensates for reduced backbone capacity. Nevertheless, the full L = 12 configuration remains preferable for achieving optimal performance.

7. Conclusions

We presented a probabilistic framework for RGB-D 3D object detection that models each object as a 3D Gaussian distribution rather than a deterministic bounding box. This formulation enables unified appearance–geometry encoding through early fusion in a vision transformer backbone and provides inherent uncertainty estimates via covariance prediction. The Cholesky-based parameterization ensures valid and stable covariance outputs, while distributional loss functions offer geometrically coherent supervision aligned with overlap metrics.

Our method achieves 61.9% mAP@0.25 on SUN RGB-D, outperforming several point-cloud (VoteNet, H3DNet, MLCVNet, and 3DETR) and RGB-D approaches (Frustum PointNets, PointFusion, and DSS). Although ImVoteNet reports slightly higher accuracy, our approach provides advantages not captured by box-based detectors, including calibrated uncertainty, metrically interpretable ellipsoids, and a simplified architecture without multi-stage fusion. The improvements observed on categories with soft boundaries, such as Sofa, highlight the benefits of full covariance modeling. Ablation studies further validate that both the Gaussian representation and the Cholesky parameterization significantly contribute to performance.

This work also has limitations. The current formulation operates on single frames, assumes single-ellipsoid representations, and carries computational cost due to the transformer backbone. Nevertheless, several avenues remain promising for future work, including temporal modeling for video, mixture-of-Gaussians representations for complex objects, lightweight backbone design for real-time deployment, and leveraging uncertainty for downstream tasks such as active perception.

In summary, Gaussian-based object representations offer a practical and uncertainty-aware alternative to box-based 3D detection, opening opportunities for more reliable perception in robotics and augmented reality applications.

Funding

This work was supported by Incheon National University Research Grant in 2024.

Data Availability Statement

The data presented in this study are openly available in UN RGB-D benchmark http://rgbd.cs.princeton.edu/.

Conflicts of Interest

The author declares no conflicts of interest.

Appendix A. Derivation of the Gaussian KL Divergence

For completeness, we derive the Kullback–Leibler divergence between two multivariate Gaussians

N_{1} (μ_{1}, Σ_{1})

and

N_{2} (μ_{2}, Σ_{2})

in

R^{d}

.

The KL divergence is defined as

D_{KL} (N_{1} ∥ N_{2}) = \int N_{1} (x) log \frac{N_{1} (x)}{N_{2} (x)} d x .

(A1)

Substituting the Gaussian densities and using the expectations under

N_{1}

:

$E [x] = μ_{1}$ ;
$E [(x - μ_{1}) {(x - μ_{1})}^{⊤}] = Σ_{1}$ ;
$E [{(x - μ_{1})}^{⊤} Σ_{1}^{- 1} (x - μ_{1})] = d$ .

We evaluate each term. For the cross-term involving

Σ_{2}^{- 1}

, we obtain:

E [{(x - μ_{2})}^{⊤} Σ_{2}^{- 1} (x - μ_{2})] = tr (Σ_{2}^{- 1} Σ_{1}) + {(μ_{1} - μ_{2})}^{⊤} Σ_{2}^{- 1} (μ_{1} - μ_{2}) .

(A2)

Combining all terms and using the trace identity

tr (A B) = tr (B A)

and

ln | A^{- 1} | = - ln | A |

, we arrive at

D_{KL} (N_{1} ∥ N_{2}) = \frac{1}{2} [tr (Σ_{2}^{- 1} Σ_{1}) + {(μ_{2} - μ_{1})}^{⊤} Σ_{2}^{- 1} (μ_{2} - μ_{1}) - d + ln \frac{| Σ_{2} |}{| Σ_{1} |}],

(A3)

which reduces to Equation (5) for

d = 3

. This closed form is widely used in probabilistic metric learning and Gaussian-based detection [1,24].

Appendix B. Derivation of the Perspective Covariance Transform

This appendix provides detailed derivations of the main geometric relations used in the method.

Appendix B.1. Inverse Projection of the Ellipsoid Center

We start from the homogeneous projection model in Equation (12). Let

m_{b} = {(x, y, z)}^{⊤}

be the 3D center of the ellipsoid in the camera frame. Rearranging yields

[\begin{matrix} u z \\ v z \\ z \end{matrix}] - p_{4} = P_{3 \times 3} m_{b} .

(A4)

Since

P_{3 \times 3}

is nonsingular for a standard pinhole camera, we obtain Equation (14).

Appendix B.2. Depth-Dependent Covariance Scaling

Consider small perturbations

δ x_{b} = {(δ x, δ y, δ z)}^{⊤}

around the mean

m_{b}

. Linearizing the projection gives

δ [\begin{matrix} u z \\ v z \\ z \end{matrix}] = P_{3 \times 3} δ x_{b} .

(A5)

Since

Σ_{a}

is defined in the

(u, v, z)

coordinate frame, the corresponding covariance in

{b}

satisfies

Σ_{b} = P_{3 \times 3}^{- 1} Σ_{a} P_{3 \times 3}^{- ⊤} .

(A6)

We factor this operation into a diagonal scaling

S

and a rotation

R

that aligns coordinate axes, leading to

Σ_{b} = R S Σ_{a} S^{⊤} R^{⊤}

. Matching the scale terms with the entries of

P_{3 \times 3}

leads to Equation (15).

Appendix B.3. Rotation Derived from Rodrigues’ Formula

To align the intermediate w-axis with the viewing ray, we compute a rotation that maps

\hat{z}

to

{\hat{m}}_{b} = m_{b} / ∥ m_{b} ∥

. Rodrigues’ rotation formula yields the following:

\hat{r} = \frac{\hat{z} \times {\hat{m}}_{b}}{∥ \hat{z} \times {\hat{m}}_{b} ∥}, cos ϕ = {\hat{z}}^{⊤} {\hat{m}}_{b},

(A7)

and the rotation matrix

R = I + sin ϕ {[\hat{r}]}_{\times} + (1 - cos ϕ) {[\hat{r}]}_{\times}^{2} .

(A8)

Substituting this

R

and the scaling matrix

S

into the covariance relation directly gives Equation (18).

Appendix C. Derivation of the Bhattacharyya Distance

The Bhattacharyya distance between two Gaussians

N_{1} (μ_{1}, Σ_{1})

and

N_{2} (μ_{2}, Σ_{2})

is defined via the Bhattacharyya coefficient:

BC (N_{1}, N_{2}) = \int \sqrt{N_{1} (x) N_{2} (x)} d x .

(A9)

For multivariate Gaussians, this integral can be computed in closed form. The product

N_{1} (x) N_{2} (x)

is proportional to a Gaussian with precision matrix

Σ_{1}^{- 1} + Σ_{2}^{- 1}

. Completing the square and evaluating the integral yields the following:

D_{B} = - ln BC = \frac{1}{8} {(μ_{1} - μ_{2})}^{⊤} Σ^{- 1} (μ_{1} - μ_{2}) + \frac{1}{2} ln \frac{| Σ |}{\sqrt{| Σ_{1} | | Σ_{2} |}},

(A10)

where

Σ = \frac{1}{2} (Σ_{1} + Σ_{2})

. This is Equation (6).

The Bhattacharyya distance is symmetric and bounded below by zero. Unlike KL divergence, it provides a true metric on the space of Gaussian distributions (up to a monotonic transformation), which can improve training stability.

Appendix D. Positive Definiteness of Cholesky Parameterization

We verify that the Cholesky parameterization in Equation (9) guarantees a positive definite covariance matrix for any set of network outputs.

Theorem A1.

Let

L

be a lower-triangular matrix with positive diagonal entries

l_{1}, l_{2}, l_{3} > 0

. Then

Σ = L L^{⊤}

is symmetric positive definite.

Proof.

Symmetry is immediate:

{(L L^{⊤})}^{⊤} = L L^{⊤}

. For positive definiteness, consider any nonzero vector

v \in R^{3}

:

v^{⊤} Σ v = v^{⊤} L L^{⊤} v = {∥ L^{⊤} v ∥}^{2} .

(A11)

Since

L

has positive diagonal entries, it is nonsingular. Thus

L^{⊤} v \neq 0

for

v \neq 0

, implying

∥ L^{⊤} {v ∥}^{2} > 0

. □

The exponential mapping in Equation (10) ensures

l_{i} > 0

for all network outputs

t_{i} \in R

, thereby guaranteeing that every predicted covariance is valid.

References

Yang, X.; Zhang, G.; Yang, X.; Zhou, Y.; Wang, W.; Tang, J.; He, T.; Yan, J. Detecting Rotated Objects as Gaussian Distributions and Its 3-D Generalization. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 9266–9279. [Google Scholar] [CrossRef] [PubMed]
Llerena, J.M.; Zeni, L.F.; Kristen, L.N.; Jung, C. Gaussian Bounding Boxes and Probabilistic Intersection-over-Union for Object Detection. arXiv 2021, arXiv:2106.06072. [Google Scholar] [CrossRef]
Song, S.; Xiao, J. Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 808–816. [Google Scholar] [CrossRef]
Qi, C.R.; Chen, X.; Litany, O.; Guibas, L.J. ImVoteNet: Boosting 3D Object Detection in Point Clouds with Image Votes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4404–4413. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking Rotated Object Detection with Gaussian Wasserstein Distance Loss. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 11830–11841. [Google Scholar]
Hou, L.; Lu, K.; Yang, X.; Li, Y.; Xue, J. G-Rep: Gaussian Representation for Arbitrary-Oriented Object Detection. Remote Sens. 2023, 15, 757. [Google Scholar] [CrossRef]
Kendall, A.; Gal, Y. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5574–5584. [Google Scholar]
Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 6402–6413. [Google Scholar]
He, Y.; Zhu, C.; Wang, J.; Savvides, M.; Zhang, X. Bounding Box Regression with Uncertainty for Accurate Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2888–2897. [Google Scholar] [CrossRef]
Song, S.; Xiao, J. Sliding Shapes for 3D Object Detection in Depth Images. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 634–651. [Google Scholar] [CrossRef]
Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [Google Scholar] [CrossRef]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast Encoders for Object Detection from Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 12697–12705. [Google Scholar] [CrossRef]
Shi, S.; Wang, X.; Li, H. PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 770–779. [Google Scholar] [CrossRef]
Qi, C.R.; Litany, O.; He, K.; Guibas, L.J. Deep Hough Voting for 3D Object Detection in Point Clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27–28 October 2019; pp. 9277–9286. [Google Scholar] [CrossRef]
Zhang, Z.; Sun, B.; Yang, H.; Huang, Q. H3DNet: 3D Object Detection Using Hybrid Geometric Primitives. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 311–329. [Google Scholar] [CrossRef]
Rukhovich, D.; Vorontsova, A.; Konushin, A. FCAF3D: Fully Convolutional Anchor-Free 3D Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 477–493. [Google Scholar] [CrossRef]
Liu, Z.; Zhang, Z.; Cao, Y.; Hu, H.; Tong, X. Group-Free 3D Object Detection via Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 2949–2958. [Google Scholar] [CrossRef]
Misra, I.; Girdhar, R.; Joulin, A.; van der Maaten, L. An End-to-End Transformer Model for 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 2906–2916. [Google Scholar] [CrossRef]
Tziafas, G.; Kasaei, H. Early or Late Fusion Matters: Efficient RGB-D Fusion in Vision Transformers for 3D Object Recognition. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 9558–9565. [Google Scholar] [CrossRef]
Zhou, T.; Fan, D.-P.; Cheng, M.-M.; Shen, J.; Shao, L. RGB-D Salient Object Detection: A Survey. Comput. Vis. Media 2021, 7, 37–69. [Google Scholar] [CrossRef] [PubMed]
Gupta, S.; Girshick, R.; Arbeláez, P.; Malik, J. Learning Rich Features from RGB-D Images for Object Detection and Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 345–360. [Google Scholar] [CrossRef]
Valada, A.; Mohan, R.; Burgard, W. Self-Supervised Model Adaptation for Multimodal Semantic Segmentation. Int. J. Comput. Vis. 2019, 128, 1239–1285. [Google Scholar] [CrossRef]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar] [CrossRef]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Nielsen, F.; Garcia, V. Statistical Exponential Families: A Digest with Flash Cards. arXiv 2010, arXiv:0911.4863. [Google Scholar] [CrossRef]
Magnus, J.R.; Neudecker, H. Matrix Differential Calculus with Applications in Statistics and Econometrics; Wiley: Hoboken, NJ, USA, 1985. [Google Scholar] [CrossRef]
Song, S.; Lichtenberg, S.P.; Xiao, J. SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 567–576. [Google Scholar] [CrossRef]
Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from rgbd images. In Proceedings of the European Conference on Computer Vision (ECCV), Florence, Italy, 7–13 October 2012. [Google Scholar] [CrossRef]
Janoch, A.; Karayev, S.; Jia, Y.; Barron, J.T.; Fritz, M.; Saenko, K.; Darrell, T. A category-level 3-d object dataset: Putting the kinect to work. In Proceedings of the ICCV Workshop on Consumer Depth Cameras for Computer Vision, Barcelona, Spain, 6–13 November 2011. [Google Scholar] [CrossRef]
Xiao, J.; Owens, A.; Torralba, A. SUN3D: A database of big spaces reconstructed using SfM and object labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Sydney, Australia, 1–8 December 2013. [Google Scholar] [CrossRef]
Cheng, B.; Sheng, L.; Shi, S.; Yang, M.; Xu, D. Back-tracing Representative Points for Voting-based 3D Object Detection in Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8963–8972. [Google Scholar] [CrossRef]
Xie, Q.; Yao, J.; Zhang, J.; Chen, K.; Zhou, K. MLCVNet: Multi-Level Context Voting for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10445–10454. [Google Scholar] [CrossRef]
Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum PointNets for 3D Object Detection from RGB-D Data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 918–927. [Google Scholar] [CrossRef]
Xu, D.; Anguelov, D.; Jain, A. PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 244–253. [Google Scholar] [CrossRef]
Ren, Z.; Sudderth, E.B. Three-Dimensional Object Detection and Layout Prediction Using Clouds of Oriented Gradients. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1525–1533. [Google Scholar] [CrossRef]
Lahoud, J.; Ghanem, B. 2D-Driven 3D Object Detection in RGB-D Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4622–4630. [Google Scholar] [CrossRef]

Figure 1. Perspective projection of a 3D ellipsoid onto the image plane. The ellipsoid in the camera frame

{b}

is characterized by mean

m_{b} \in R^{3}

and covariance

Σ_{b} \in R^{3 \times 3}

. Through perspective projection, this ellipsoid induces a 2D Gaussian in the image-aligned frame

{a}

with mean

m_{a} \in R^{2}

and covariance

Σ_{a} \in R^{2 \times 2}

.

Figure 1. Perspective projection of a 3D ellipsoid onto the image plane. The ellipsoid in the camera frame

{b}

is characterized by mean

m_{b} \in R^{3}

and covariance

Σ_{b} \in R^{3 \times 3}

. Through perspective projection, this ellipsoid induces a 2D Gaussian in the image-aligned frame

{a}

with mean

m_{a} \in R^{2}

and covariance

Σ_{a} \in R^{2 \times 2}

.

Figure 2. Overview of the proposed probabilistic 3D object detection network. An aligned RGB image and depth map are concatenated into a four-channel RGB-D tensor and fed to a transformer backbone with four-channel patch embedding. The encoder tokens are reshaped into a 2D feature map by a refinement neck, and a dense Gaussian 3D detection head predicts, at each spatial location, objectness, 2D offsets and depth, Gaussian covariance parameters, and class probabilities.

Figure 3. Detailed architecture of the Feature Refinement Neck. Transformer output tokens are reshaped into a 2D base feature map

F_{base}

, followed by two strided

3 \times 3

convolutions to generate a multi-scale pyramid

{P_{3}, P_{4}, P_{5}}

. This pyramid supports detection of objects at different scales while maintaining the global context preserved by the transformer.

Figure 3. Detailed architecture of the Feature Refinement Neck. Transformer output tokens are reshaped into a 2D base feature map

F_{base}

, followed by two strided

3 \times 3

convolutions to generate a multi-scale pyramid

{P_{3}, P_{4}, P_{5}}

. This pyramid supports detection of objects at different scales while maintaining the global context preserved by the transformer.

Figure 4. Qualitative evaluation results of 3D bounding box prediction on SUN RGB-D. In each pair, the left image shows the ground truth annotations, and the right image shows the prediction results of our method. The predicted 3D bounding boxes closely align with object geometries, demonstrating accurate localization across diverse indoor scenarios including bedrooms, living rooms, and offices. The detection results are visualized using bounding boxes in different colors for each object.

Table 1. Comparison with state-of-the-art methods of 3D object detection results on SUN RGB-D val set. We report mAP@0.25 for the 10 standard object categories. Methods are grouped by input modality: PC denotes point cloud input, and RGB-D denotes direct use of RGB and depth images. N.stand: Nightstand; B.shelf: Bookshelf. Boldface indicates the best performance in each column.

Method	Input	mAP	Bed	Table	Chair	Sofa	Dresser	N.stand	Toilet	Desk	B.shelf	Bathtub
VoteNet [14]	PC	59.1	83.0	47.3	75.3	64.0	29.8	62.2	90.1	22.0	28.8	74.4
H3DNet [15]	PC	60.1	85.6	50.8	76.7	66.5	33.4	65.5	88.2	29.6	31.0	73.8
BRNet [31]	PC	61.1	86.9	51.8	77.4	66.4	35.9	65.9	91.3	29.6	29.7	76.2
Group-Free [17]	PC	63.0	87.8	53.8	79.4	70.0	36.0	66.7	91.1	32.6	32.5	80.0
MLCVNet [32]	PC	59.8	85.8	50.4	75.8	66.3	31.3	61.5	89.1	26.5	31.9	79.2
3DETR [18]	PC	59.1	81.8	50.0	68.0	58.3	28.6	56.6	90.3	28.7	27.5	77.6
FCAF3D [16]	PC	48.9	69.8	35.5	68.8	58.2	30.1	59.8	74.5	14.8	11.6	66.2
F.PointNets [33]	RGB-D	54.0	81.1	51.1	64.2	61.1	32.0	58.1	90.9	24.7	33.3	43.3
PointFusion [34]	RGB-D	45.4	68.6	31.0	55.1	53.8	23.9	32.3	83.8	17.2	37.7	37.3
ImVoteNet [4]	RGB-D	63.4	87.6	51.1	76.7	70.7	41.4	69.9	90.5	28.7	41.3	75.9
DSS [3]	RGB-D	42.1	78.8	50.3	61.2	53.5	6.4	15.4	78.9	20.5	11.9	44.2
COG [35]	RGB-D	47.6	63.7	51.3	62.2	51.0	15.5	27.4	70.1	45.2	31.8	58.3
2D-driven [36]	RGB-D	45.1	64.5	37.0	48.3	50.4	25.9	41.9	80.4	27.9	31.4	43.5
Ours	RGB-D	61.9	85.8	51.9	74.8	76.5	39.1	59.4	87.2	28.0	39.7	77.1

Table 2. Ablation study on SUN RGB-D. We report

{mAP}_{3 D @ 0.25}

.

Table 2. Ablation study on SUN RGB-D. We report

{mAP}_{3 D @ 0.25}

.

Model Variant	Description	mAP@0.25
Baseline (Box Head)	Box regression (center + size + yaw)	57.8
Gaussian Head (diag cov.)	Only diagonal covariance (3 vars)	58.7
Gaussian Head (Chol. full)	Full $3 \times 3$ SPD parameterization	61.9
Gaussian + Transformer (L = 8)	Shallower backbone	60.4
Gaussian + Transformer (L = 12)	Default configuration	61.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, H.-G. Probabilistic Indoor 3D Object Detection from RGB-D via Gaussian Distribution Estimation. Mathematics 2026, 14, 421. https://doi.org/10.3390/math14030421

AMA Style

Kim H-G. Probabilistic Indoor 3D Object Detection from RGB-D via Gaussian Distribution Estimation. Mathematics. 2026; 14(3):421. https://doi.org/10.3390/math14030421

Chicago/Turabian Style

Kim, Hyeong-Geun. 2026. "Probabilistic Indoor 3D Object Detection from RGB-D via Gaussian Distribution Estimation" Mathematics 14, no. 3: 421. https://doi.org/10.3390/math14030421

APA Style

Kim, H.-G. (2026). Probabilistic Indoor 3D Object Detection from RGB-D via Gaussian Distribution Estimation. Mathematics, 14(3), 421. https://doi.org/10.3390/math14030421

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Probabilistic Indoor 3D Object Detection from RGB-D via Gaussian Distribution Estimation

Abstract

1. Introduction

2. Related Work

2.1. Probabilistic and Gaussian Representations for Object Detection

2.2. RGB-D and 3D Object Detection

2.3. Fusion Strategies for RGB and Depth Modalities

2.4. Label Assignment and Loss Design

3. Probabilistic 3D Object Representation

3.1. Scene Representation as a 3D Gaussian Mixture

3.2. RGB-D Back-Projection to 3D Space

3.3. Image-Plane Approximation of Projected Covariance

3.4. Distribution-Based Objective

4. Covariance Parameterization and Geometric Transformation

4.1. Cholesky-Based Covariance in the Image-Aligned Frame

4.2. Perspective Projection Matrix and Notation

4.3. Inverse Mapping for the Ellipsoid Center

4.4. Depth-Dependent Scaling of Covariance

4.5. Rotation Alignment with the Viewing Direction

4.6. Final Covariance in the Camera Frame

5. RGB-D Transformer Network Architecture

5.1. RGB-D Input Construction

5.2. Transformer Backbone with Four-Channel Patch Embedding

5.3. Feature Refinement Neck

5.4. Dense Gaussian 3D Detection Head

6. Experiments

6.1. Experimental Setting

6.2. Results

6.3. Qualitative Analysis

6.4. Ablation Study

7. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Derivation of the Gaussian KL Divergence

Appendix B. Derivation of the Perspective Covariance Transform

Appendix B.1. Inverse Projection of the Ellipsoid Center

Appendix B.2. Depth-Dependent Covariance Scaling

Appendix B.3. Rotation Derived from Rodrigues’ Formula

Appendix C. Derivation of the Bhattacharyya Distance

Appendix D. Positive Definiteness of Cholesky Parameterization

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI