TCG-Depth: A Two-Stage Symmetric Confidence-Guided Framework for Transparent Object Depth Completion

Huang, Kaixin; Yao, Chendong; Lv, Ke; Ye, Sichao; Zhuang, Jiayan

doi:10.3390/sym18030405

Open AccessArticle

TCG-Depth: A Two-Stage Symmetric Confidence-Guided Framework for Transparent Object Depth Completion

by

Kaixin Huang

¹

,

Chendong Yao

¹

,

Ke Lv

²

,

Sichao Ye

² and

Jiayan Zhuang

^2,*

¹

Faculty of Electrical Engineering and Computer Science, Ningbo University, Ningbo 315211, China

²

Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315201, China

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(3), 405; https://doi.org/10.3390/sym18030405

Submission received: 26 January 2026 / Revised: 19 February 2026 / Accepted: 24 February 2026 / Published: 25 February 2026

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

While transparent objects present persistent challenges for RGB-D sensing due to complex refraction and reflection effects, these optical phenomena frequently precipitate unreliable depth measurements and compromise depth completion performance. In this paper, we propose TCG-Depth, a two-stage symmetric confidence-guided framework tailored for transparent object depth completion, wherein confidence estimation and depth refinement are synergistically integrated within a structurally consistent architecture. In the first stage, an initial dense depth map is synthesized from RGB images and sparse depth inputs, concomitant with a pixel-wise confidence map that characterizes the spatially-varying reliability of depth predictions via an adaptive thresholding strategy. In the second stage, a confidence-aware depth refinement module is introduced, utilizing the predicted confidence to actively modulate intermediate feature representations. This mechanism enables the selective restoration of low-confidence regions while simultaneously preserving reliable depth information. Experimental results on the benchmark TransCG dataset demonstrate that the proposed framework exhibits superior performance compared to state-of-the-art depth completion methods, particularly in transparent object regions, thereby validating the effectiveness of the symmetric confidence-guided design.

Keywords:

transparent object perception; depth completion; RGB-D sensing; confidence-guided refinement; symmetric framework

1. Introduction

Transparent objects are pervasive across both daily life and industrial applications, most notably in laboratory environments where glassware, plastic containers, and specialized instruments are ubiquitous [1,2]. Due to the distinctive optical properties of transparent materials, their surfaces concurrently reflect and refract light, thereby violating the Lambertian assumption fundamental to standard RGB-D sensors [3]. This phenomenon frequently manifests as depth discontinuities, boundary artifacts, and data voids, which severely impair the fidelity of depth measurements. In practical scenarios, reliable perception of transparent objects is indispensable for robotic systems, particularly in the context of robotic manipulation and autonomous navigation [4]. Given that most robotic manipulation tasks necessitate high-precision visual perception for grasp synthesis [5], enhancing the perceptual robustness against transparent materials is of paramount importance. Consequently, achieving accurate reconstruction of transparent objects remains a formidable and critical challenge [6].

Driven by the rapid advancement of deep learning techniques, data-driven approaches for transparent object depth completion have gained significant traction. Existing methodologies in this field are generally bifurcated into single-view and multi-view paradigms.

Single-view methods, which operate on a single RGB or RGB-D frame, offer distinct advantages in architectural simplicity and practical deployment, making them highly sought-after in real-world applications. Existing single-view frameworks typically perform depth completion directly on RGB-D inputs to rectify missing or erroneous depth measurements. However, due to the intrinsic characteristics of transparent objects such as indistinct boundaries, sparse textures, and pronounced optical distortions, these methods often produce erratic or inaccurate reconstructions in transparent regions, especially on glassware where reliable depth cues are notoriously limited [7]. More importantly, conventional single-view approaches focus primarily on optimizing global accuracy while implicitly assuming that all pixel-level predictions are equally trustworthy. In perceptually-degraded regions such as transparent surfaces, this flawed assumption can propagate erroneous estimates, thereby compromising both depth completion quality and subsequent robotic perception tasks.

In contrast, multi-view methods leverage geometric constraints across multiple viewpoints, including disparity cues, cross-view consistency, or light-field information, to enhance reconstruction accuracy in transparent regions. However, these methods typically necessitate complex multi-view data acquisition or multi-camera configurations, imposing strict requirements on system calibration and precise camera motion control. Consequently, multi-view approaches incur prohibitive computational costs and hardware complexity, which significantly circumscribe their feasibility in real-world robotic deployments [8].

To address the instability of depth completion results in transparent regions and the lack of spatial reliability awareness in single-view settings, this paper proposes TCG-Depth, a two-stage confidence-aware depth completion method. Rather than treating all depth completion results as equally reliable, TCG-Depth explicitly leverages confidence estimation to differentiate the reliability of predictions across spatial regions. This confidence information serves as an active guide for the subsequent refinement, thereby improving both the accuracy and stability of depth completion in transparent areas.

In the first stage, TCG-Depth employs a vision-based depth completion network that fuses RGB images and sparse depth observations to generate a structurally consistent initial depth map. Based on these initial results, pixel-wise confidence prediction, combined with an image-level adaptive threshold, is used to assess relative reliability, resulting in a continuous soft confidence representation over spatial regions.

In the second stage, a confidence-aware depth refinement strategy is introduced, where the estimated confidence map adaptively modulates intermediate feature representations. This allows the model to preserve reliable depth information while selectively refining unreliable predictions. Consequently, depth completion quality and stability in transparent regions are significantly improved without the need for additional sensors or complex post-processing.

While uncertainty estimation has been extensively explored in depth completion tasks [9,10,11], most existing approaches primarily utilize uncertainty as a post hoc measure of prediction reliability or for re-weighting loss functions during training. In contrast, our proposed confidence modeling plays a fundamentally different role by explicitly integrating confidence as an active, feature-level modulation gate within the refinement process, rather than treating it merely as a secondary output metric. Unlike general epistemic or aleatoric uncertainty, which typically models stochastic sensor noise, our confidence map specifically characterizes the structural unreliability induced by the non-linear optical properties of transparent materials. This specialized design enables the framework to adaptively distinguish between structurally reliable regions and transparency-induced error regions, thereby guiding the second-stage refinement to selectively preserve valid geometric structures while targetedly correcting distorted transparent areas.

Most existing methods, including LIDF [12] and DFNet [13], treat all predicted pixels with equal importance during optimization. However, this uniform approach fails to account for the physical reality of transparent objects: depth errors are not distributed evenly. Distortions are far more severe at refractive boundaries than at stable object centers. To bridge this gap, we move away from blind global optimization. Instead, our TCG-Depth uses predicted confidence as a spatial guide to prioritize the refinement of these high-error regions, ensuring better geometric stability in complex optical scenes.

The main contributions of this work are summarized as follows:

We propose TCG-Depth, a two-stage confidence-guided depth completion framework specifically tailored for transparent objects, which introduces spatial reliability awareness to enhance the robustness of single-view depth completion.
We design a reliability estimation mechanism that integrates pixel-wise confidence prediction with an image-level adaptive threshold, enabling the framework to autonomously distinguish between reliable and unreliable regions across diverse scenes and transparent object geometries.
We develop a confidence-aware feature modulation strategy that selectively refines unreliable depth predictions while preserving high-confidence geometric information, significantly improving completion accuracy in challenging transparent regions.

2. Related Work

2.1. Transparent Object Perception

Over the past decade, transparent object perception has gained substantial prominence within the computer vision and robotics communities. Early studies primarily focused on recognition and segmentation [14,15], laying the foundation for more challenging geometric perception tasks. Progressing from these earlier works, transparent object depth completion has emerged as a critical frontier, where reliable sensing remains particularly formidable due to the intrinsic optical ambiguity introduced by transparent materials. To address such perceptual uncertainty, existing methodologies have explored the fusion of multi-modal signals, such as integrating RGB, infrared, or tactile data to complement geometric cues and enhance depth recovery robustness [16,17]. Furthermore, some approaches incorporate explicit 3D object priors, including CAD models or canonical geometric shapes, to facilitate pose estimation and geometric alignment, thereby regularizing depth inference and completion for transparent objects [18,19].

In transparent environments, depth completion faces intensified challenges because refraction and reflection effects fundamentally violate the Lambertian assumption. To tackle this issue, prior works have introduced active sensing devices or specialized acquisition setups to mitigate the ambiguity caused by transparent materials [20], specifically analyzing imaging distortions stemming from non-linear light transport [21,22], and exploring structured reconstruction strategies customized for transparent scenes [23,24,25]. Although these methods provide valuable insights into the optical characteristics of transparent objects, they typically rely on restrictive geometric assumptions or constrained imaging conditions, which limits their versatility in complex real-world environments. Concurrently, the paradigm has gradually shifted toward data-driven methods, enabling end-to-end depth completion from RGB or RGB-D inputs. However, in these contexts, depth completion is compromised not only by sensor noise but also by stochastic optical effects, frequently manifesting as misestimation, boundary artifacts, and volumetric data loss. Consequently, the refractive and reflective properties of transparent materials severely disrupt standard sensing mechanisms, rendering reliable depth completion for transparent objects substantially more challenging than their opaque counterparts [26].

From the perspective of observation modalities, existing methods for this task can generally be categorized into multi-view and single-view approaches. Multi-view methods attenuate depth uncertainty by aggregating geometric information from multiple viewpoints, albeit at the cost of increased system complexity and deployment overhead. In contrast, single-view methods are preferable for practical deployment, despite the inherent scarcity of geometric cues and the difficulty in assessing spatial reliability. Therefore, the following subsections review multi-view and single-view transparent object perception methods in detail, and analyze the irrespective merits and limitations.

2.2. Single-View Transparent Object Perception

Owing to their minimal hardware footprint and ease of deployment, single-view transparent object perception methods have attracted extensive attention for practical robotic systems. Early methodologies typically formulated depth recovery as a direct regression problem from RGB or RGB-D inputs to dense maps via convolutional neural networks. While these methods achieved satisfactory results on opaque objects, their efficacy degrades in transparent regions, where texture information is inherently sparse and depth observations are notoriously unreliable.

To mitigate this issue, some studies introduce auxiliary geometric priors, such as surface normals or occlusion boundaries [27], or adopt sophisticated network architectures. Other works employ multi-task learning by jointly modeling depth estimation with complementary tasks, including semantic segmentation or structural feature detection [28]. However, these methods typically operate under the uniform assumption that all pixel-level depth predictions are equally reliable during optimization. Consequently, erroneous estimates in transparent regions can be easily propagated and amplified, detrimentally affecting downstream perception and manipulation tasks.

Most existing single-view methods rely on regression, normal estimation, or boundary-guided strategies. For example, ClearGrasp [27] first demonstrated the feasibility of RGB-Drefinement for depth completion; LIDF [12] facilitates detail recovery via local implicit functions; and TDCNet [29] utilizes a Transformer-based architecture to improve contextual modeling. However, a critical gap in these representative works is the lack of explicit reliability awareness. For instance, while FDCT [30] and TCRNet [31] leverage cross-layer fusion or global context to improve performance, they typically treat all predicted pixels with uniform importance. In complex scenes where refraction and reflection induce severe optical distortions, these methods are prone to propagate errors from unreliable transparent regions to the entire depth map, leading to cluttered boundaries and persistent structural ambiguities.

The above analysis indicates that merely increasing network complexity or augmenting feature representations, as seen in methods like TranspareNet [32] which rely on implicit feature fusion, is often insufficient to fundamentally address the instability of depth completion. In contrast to existing approaches that overlook spatial reliability variations, considering region-wise reliability differences and applying targeted processing to error-prone transparent regions has received limited attention. While uncertainty estimation has been explored in general vision tasks [10], its integration as an active modulation mechanism—rather than a post hoc metric—remains underexplored in this field. This observation motivates our development of a confidence-guided strategy to explicitly distinguish and refine unreliable transparent regions.

2.3. Multi-View Transparent Object Perception

Multi-view transparent object perception methods reconstruct the three-dimensional geometry of a scene by aggregating observations from multiple viewpoints. By exploiting geometric constraints across different views, these approaches can partially mitigate the depth uncertainty arising from the complex refraction and reflection effects characteristic of transparent objects.

To address the challenges transparent materials impose on traditional stereo matching, several studies introduce additional priors or indirect supervision signals to fortify the robustness of multi-view systems. For example, ActiveZero enhances matching stability in transparent regions by combining active structured light projection with stereo vision, although it relies heavily on specific hardware configurations and controlled scene conditions [33]. Building upon this framework, ActiveZero++ refines the active sensing strategy and learning formulation, yielding superior depth recovery performance in transparent and highly reflective regions [34]. More recently, advanced learning-based modeling has been integrated with optical measurement techniques to further enhance robustness. Lu et al. proposed a generative deep-learning-embedded framework for asynchronous structured light imaging, which effectively mitigates phase error artifacts in complex optical conditions [35]. Similarly, Su et al. demonstrated the importance of measurement consistency in stereo imaging through wavelet analysis, offering valuable insights for maintaining geometric stability in challenging sensing environments [36].

Dex-NeRF utilizes neural radiance field representations to model transparent objects from multi-view RGB images, facilitating both geometric reconstruction and grasp planning [23]. Furthermore, GraspNeRF integrates multi-view NeRF-based reconstruction with 6-DoF grasp detection, achieving joint perception and manipulation of transparent and specular objects [37]. In addition, ASGrasp employs an RGB-D active stereo system enhanced by multi-view observations, fusing RGB and infrared information to simultaneously reconstruct the visible and occluded geometry of transparent objects [8]. StereoPose directly estimates object poses from stereo images by exploiting multi-view geometric cues without the need for explicit depth reconstruction, albeit still requiring a synchronized stereo camera setup [38].

However, from a practical deployment perspective, multi-view methods typically necessitate multi-camera systems or protracted data acquisition sequences, which elevate hardware complexity, computational overhead, and deployment expenditure. In robotic applications with stringent requirements for real-time performance, system minimalism, and robustness, such methods present significant integration hurdles. Consequently, despite their theoretical advantages, the practicality of multi-view approaches remains constrained, motivating the growing interest in single-view transparent object depth completion methods that rely exclusively on a single input frame.

3. Method

3.1. Overview

Achieving high-fidelity depth completion for transparent objects under single-view RGB-D observations presents a formidable challenge, as depth measurements are severely degraded by refraction and reflection effects inherent to transparent materials. Given an RGB image captured by an RGB-D sensor and its corresponding sparse raw depth map, the objective of this work is to synthesize a refined dense depth map that can reliably support downstream robotic perception tasks.

Formally, let

X^{R G B} \in R^{H \times W \times 3}

denote the input RGB image and

X^{D} \in R^{H \times W}

denote the raw depth map. Due to transparency-induced optical distortions,

X^{D}

often contains significant data voids and stochastic measurement errors. The goal of transparent object depth completion is to infer a refined depth map

D \in R^{H \times W}

from these degraded observations.

A key observation driving this work is that depth completion results on transparent objects exhibit highly non-uniform reliability across spatial locations. Even within the same transparent object, the fidelity of completed depth values can vary significantly due to varying angles of refraction and partial observability. However, existing paradigms typically enforce a uniform optimization strategy across the entire image domain, implicitly assuming a homogeneous confidence distribution. Such indiscriminate optimization ignores intrinsic reliability variations and often leads to error propagation and unstable depth estimation.

A pivotal attribute of the proposed framework is its architectural and functional symmetry, defined here as the structural congruence between the initial estimation and refinement modules. By deploying mirroring encoder–decoder architectures in both stages, we guarantee that the receptive fields and feature resolutions remain strictly aligned throughout the pipeline. This design is paramount for spatial coherence, ensuring that the pixel-wise confidence map generated in the first stage maps precisely to the features modulated in the second stage. Unlike asymmetric configurations, which risk feature space misalignment or receptive field discrepancies due to disparate backbone capacities, our symmetric design facilitates the seamless propagation of confidence cues. Consequently, this structural alignment underpins robust depth recovery, a prerequisite for correcting the complex non-linear distortions typical of transparent objects.

Motivated by this observation, we propose TCG-Depth, a confidence-guided depth completion framework for single-view RGB-D transparent object perception. As illustrated in Figure 1, the framework consists of three synergistic components. First, an initial depth completion network (Section 3.2) synthesizes a dense depth map from RGB-D inputs. Then, a confidence mask estimation module (Section 3.3) quantifies the reliability of the completed depth at each spatial location. Finally, a confidence-aware depth refinement network (Section 3.4) selectively rectifies unreliable depth predictions while preserving stable depth structures.

Specifically, the overall depth completion process can be formulated as:

(D_{0}, C) = F_{1} (X^{R G B}, X^{D}), and D = F_{2} (X^{R G B}, D_{0}, C),

(1)

where

D_{0}

denotes the initially completed depth map and C represents the predicted pixel-wise confidence map. To ensure numerical stability and avoid the risk of recursive dependency, TCG-Depth is trained in a strictly sequential two-stage manner. Stage 1 (Initial Completion and Confidence Estimation) is trained to convergence first; subsequently, its parameters are frozen, and Stage 2 (Refinement) is trained using the static outputs of Stage 1 as deterministic priors. This decoupling prevents unstable feedback loops and ensures that the refinement module learns to rectify errors based on a consistent reliability distribution. Here,

F_{1} (\cdot)

corresponds to the integration of the initial depth completion network and the confidence estimation module, while

F_{2} (\cdot)

denotes the confidence-aware depth refinement network. By leveraging confidence information to guide selective refinement, TCG-Depth effectively mitigates depth estimation instability caused by refraction and missing measurements, while maintaining high efficiency and practical deployability under a single-view RGB-D setting.

3.2. Initial Depth Completion

The primary objective of the initial depth completion stage is to reconstruct dense geometry from severely degraded RGB-D observations, where extensive data voids and transparency-induced distortions render direct depth estimation highly ill-posed. In such scenarios, standard depth regression paradigms often gravitate towards local interpolation or the propagation of sparse artifacts, consequently yielding over-smoothed predictions and unstable depth structures on transparent objects.

To circumvent this issue, we design an encoder–decoder network that explicitly integrates appearance cues and sparse depth information across multiple scales. As illustrated in Figure 2, the network follows a symmetric multi-scale architecture with dense feature aggregation, building upon residual dense U-Net designs that have proven effective for image restoration tasks [39]. The encoder progressively extracts contextual representations through hierarchical downsampling [40], while the decoder symmetrically restores spatial resolution to recover coherent depth structures.

Formally, given an RGB image

X^{R G B}

and a raw depth map

X^{D}

, the network input is constructed via channel-wise concatenation:

X = Concat (X^{R G B}, X^{D}),

(2)

facilitating joint reasoning over visual appearance and sparse geometric measurements. The network outputs an initial depth prediction

D_{0} \in R^{H \times W}

.

A key design consideration is the effective utilization of valid depth measurements without propagating unreliable values into transparent regions. To this end, depth information is injected into the feature encoding process at multiple scales. At each encoding level, the downsampled depth map is concatenated with the intermediate feature representation:

F^{(l)} = Φ^{(l)} (Concat (F^{(l - 1)}, ↓^{(l)} (X^{D}))),

(3)

which encourages the network to preserve reliable geometric cues while learning to infer missing or distorted depth in a structure-consistent manner.

Although multi-scale encoding helps capture global context, refraction and reflection effects introduced by transparent objects often lead to feature ambiguities and high-frequency noise, which can interfere with reliable depth inference. To suppress these artifacts, we incorporate an attention mechanism into the network. Specifically, a CBAM (Convolutional Block Attention Module) [41] is employed to adaptively recalibrate feature responses along channel and spatial dimensions:

\tilde{F} = A_{s} (A_{c} (F)) ⊙ F,

(4)

where

A_{c} (\cdot)

and

A_{s} (\cdot)

denote channel-wise and spatial attention operations, respectively. By guiding the network to focus on more reliable features at each convolutional stage, the attention mechanism mitigates the tendency of depth regression to collapse into trivial interpolation.

The decoder mirrors the encoder structure and progressively reconstructs spatial resolution through a series of upsampling stages with skip connections. These skip connections facilitate the fusion of high-level contextual information and low-level spatial details, which is essential for preserving depth discontinuities and object boundaries. Finally, a lightweight regression head maps the decoded features to the initial depth estimate:

D_{0} = H (F),

(5)

where F denotes the final decoded feature map. The resulting depth map

D_{0}

provides a dense and structurally coherent initialization for subsequent pixel-wise confidence estimation and confidence-aware refinement.

3.3. Confidence Mask Estimation

While the initial depth completion stage yields a structurally coherent depth map, the stochastic nature of refractive distortions renders the reliability of these predictions spatially non-uniform. In practice, regions subject to strong specular reflections or transmission ambiguities often exhibit significant measurement errors. Explicitly identifying such unreliable regions is, therefore, a prerequisite for guiding subsequent refinement.

To this end, we introduce a dedicated Confidence Mask Estimation module. As illustrated in Figure 3, this module operates in a hierarchical manner: predicting a dense pixel-wise confidence map, estimating a scene-adaptive threshold, and fusing them into a soft modulation mask.

3.3.1. Pixel-Wise Confidence Prediction

Given the initial depth prediction

D_{0}

, a pixel-wise confidence map

C \in {[0, 1]}^{H \times W}

is computed as:

C = σ (G (D_{0})),

(6)

where

σ (\cdot)

denotes the sigmoid activation function [42]. As illustrated in Figure 3, the network

G (\cdot)

is implemented as a lightweight sequence of three successive convolutional layers. This design choice prioritizes computational efficiency while effectively capturing local structural anomalies and gradient inconsistencies—typical signatures of refraction-induced errors manifested within the initial depth map

D_{0}

. By processing the dense geometry through these layers, the module generates a raw reliability score for each pixel, which is subsequently normalized to the

[0, 1]

range to serve as the baseline for spatial reliability awareness.

3.3.2. Scene-Adaptive Threshold Regression

Reliability is relative; a confidence score considered “high” in a complex scene might be insufficient in a simple one. To determine a threshold that dynamically adapts to varying optical complexities, we feed the confidence map C into a threshold predictor

T (\cdot)

:

τ = T (C) .

(7)

The predictor

T (\cdot)

employs a “Conv-GAP-FC” structure to aggregate global scene statistics. It consists of three convolutional layers for feature compression, followed by a GAP (Global Average Pooling) operation that collapses spatial dimensions into a distinct scene-level context vector. Finally, two FC (Fully Connected) layers with a ReLU (Rectified Linear Unit) bottleneck regress the scalar threshold

τ \in [0, 1]

. This mechanism empowers the framework to adaptively shift the refinement focus based on the global reliability distribution of the current scene.

3.3.3. Soft Confidence-Gating Function

Conventional binary gating strategies often introduce abrupt geometric discontinuities at the boundary of reliable regions. To mitigate this, we utilize a soft confidence function to model the gradual transition of reliability:

M = \frac{2}{1 + exp (- ϵ (C - τ))},

(8)

where

ϵ

is a scaling hyper-parameter controlling the transition sharpness (empirically set to 10). This formulation yields a smooth and differentiable mask M that suppresses unreliable depth predictions while retaining stable geometric anchors through a continuous gradient.

As visualized in Figure 4, compared to hard thresholding (Binary) or simple linear scaling, the proposed adaptive soft masking mechanism provides more stable and flexible region selection. Furthermore, this continuous gating mechanism works synergistically with the symmetric architecture defined in Section 3.3.1. By ensuring that high-level semantic priors and low-level geometric details are modulated synchronously without spatial discretization, we effectively avoid the boundary artifacts common in asymmetric or hard-switching refinement pipelines.

3.3.4. Confidence Supervision

For the soft gating mechanism to function effectively, the predicted confidence C must be more than a latent feature; it must be physically grounded in the actual reconstruction quality. To guarantee this, we incorporate explicit supervision derived from the pixel-wise discrepancy between the initial prediction

D_{0}

and the ground truth

D_{g t}

.

We define the confidence supervision loss

L_{c o n f}

as:

L_{c o n f} = ∥ C - exp (- α | D_{0} - D_{g t} {|) ∥}_{1},

(9)

where

α

is a sensitivity scaling factor (empirically set to 5.0). This formulation establishes a direct mapping between geometric error and reliability scores: as the reconstruction error

| D_{0} - D_{g t} |

increases—typically in regions with severe refractive distortion—the target confidence exponentially decays toward zero. Conversely, structurally stable regions are compelled to approach a confidence of one. By grounding the confidence estimation in actual physical error distributions, we preclude the network from converging toward trivial solutions and ensure that the confidence map serves as a faithful, error-aware guide for the subsequent refinement stage.

3.4. Confidence-Aware Depth Refinement

Upon deriving the pixel-wise confidence estimation, the framework explicitly identifies regions where the initial depth prediction is compromised. In our sequential training paradigm, the refinement network

R (\cdot)

is optimized while the weights of the preceding modules (

F_{1}

) are fixed. However, leveraging this information to refine geometry without corrupting high-confidence regions presents a delicate challenge. Indiscriminate refinement of the entire depth map risks degrading essentially correct geometric anchors, leading to over-smoothing or structural deterioration.

To address this, we design a confidence-aware depth refinement network that utilizes confidence cues to strictly regulate feature propagation and depth correction. As illustrated in Figure 5, the refinement network takes the initial depth prediction

D_{0}

, its corresponding confidence map C, and the RGB image as inputs to synthesize the final refined depth map D.

Formally, the refinement process is expressed as:

D = R (X^{R G B}, D_{0}, C),

(10)

where

R (\cdot)

denotes the confidence-aware refinement network. Distinct from the initial completion stage, this network does not aim to reconstruct depth ab initio, but focuses on selectively rectifying unreliable predictions conditioned on confidence priors.

In stark contrast to conventional paradigms where uncertainty estimation serves merely as a post hoc metric or a loss-reweighting scalar, our approach integrates confidence as an active modulation gate directly within the feature encoding process. While prior attention mechanisms typically apply global recalibration, our confidence-aware module selectively intensifies feature transformation in low-confidence regions while explicitly preserving features in high-confidence areas. This design ensures targeted rectification of refractive distortions without compromising valid geometric structures, a capability absent in methods that lack region-specific refinement constraints.

The refinement network employs a multi-scale encoder–decoder architecture, yet differs fundamentally from the initial stage in its incorporation of confidence. At multiple encoding stages, confidence-guided feature gating is applied to modulate intermediate representations:

{\tilde{F}}^{(l)} = G (F^{(l)}, C),

(11)

where

F^{(l)}

denotes the feature map at scale l, and

G (\cdot)

represents the confidence-gated modulation function. This gating mechanism attenuates feature updates in high-confidence regions while amplifying corrective signals in low-confidence areas. Unlike standard attention mechanisms that apply global feature weighting, this method adjusts the magnitude of feature updates according to local prediction reliability, empowering the model to correct unreliable regions.

Furthermore, the refinement network incorporates multi-scale geometric guidance by injecting downsampled versions of the initial depth prediction

D_{0}

into both the encoding and decoding processes. Combined with skip connections, this design facilitates the effective propagation of global contextual information and local geometric details.

The decoder progressively restores spatial resolution through a sequence of upsampling stages and produces the final refined depth map via a lightweight prediction head. By conditioning depth refinement on confidence information, the proposed network performs targeted rectification rather than global regression, resulting in significantly more stable and accurate depth estimates in complex transparent object scenes.

4. Experiments

4.1. Datasets

We evaluate the proposed method on the TransCG dataset [13], a large-scale real-world benchmark for transparent object perception. TransCG contains 57,715 RGB images with corresponding high-quality depth annotations acquired by RGB-D sensors, covering 51 categories of transparent objects and approximately 200 categories of opaque objects. All images are captured in diverse real indoor environments, ranging from simple scenes to complex and cluttered settings, which closely resemble real-world robotic manipulation and interaction scenarios. In our experiments, we follow the official data split and evaluation protocol provided by TransCG to ensure fair comparison and reproducibility. All methods are evaluated under the same RGB-D depth completion setting.

4.2. Evaluation Metrics

To quantitatively evaluate depth completion performance, we adopt standard evaluation metrics widely used in the transparent object perception and depth completion literature [12,27]. These metrics assess prediction quality from both absolute and relative error perspectives.

Root Mean Squared Error (RMSE) measures the overall deviation between the predicted depth map and the ground-truth depth map, defined as:

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}},

(12)

where N denotes the number of valid pixels within the evaluation mask,

y_{i}

is the predicted depth value, and

{\hat{y}}_{i}

is the corresponding ground-truth depth value.

Absolute Relative Error (REL) evaluates the average relative difference between the predicted and ground-truth depths:

REL = \frac{1}{N} \sum_{i = 1}^{N} \frac{| y_{i} - {\hat{y}}_{i} |}{{\hat{y}}_{i}} .

(13)

Mean Absolute Error (MAE) computes the average absolute difference between the predicted and ground-truth depths:

MAE = \frac{1}{N} \sum_{i = 1}^{N} | y_{i} - {\hat{y}}_{i} | .

(14)

In addition, we report the Accuracy under Threshold metric, which measures the percentage of valid pixels satisfying:

max (\frac{y_{i}}{{\hat{y}}_{i}}, \frac{{\hat{y}}_{i}}{y_{i}}) < δ .

(15)

Following common practice, we evaluate accuracy with

δ = 1.05, 1.10

, and

1.25

. This metric reflects the robustness of depth completion results under different error tolerance levels, and is particularly informative for assessing depth stability in challenging regions.

4.3. Implementation Details and Baselines

Our framework is implemented using PyTorch (version 1.9.0) and trained on a single NVIDIA RTX 4090 GPU. We utilize the Adam optimizer with an initial learning rate of

1 \times 10^{- 4}

and a batch size of 4, training the network for a total of 40 epochs. To balance computational efficiency with the preservation of fine spatial details, all input images are resized to a resolution of

380 \times 240

. To enhance model robustness and generalization, standard data augmentation techniques are employed during training, including random horizontal flipping, color jittering, and random cropping. Notably, no additional post-processing is applied during inference.

To ensure a fair and rigorous comparison, we locally reproduced several representative state-of-the-art methods on the TransCG dataset using the same NVIDIA RTX 4090 hardware. The selected baselines encompass a wide range of technical paradigms, including the iterative self-correction framework LIDF, and the geometric optimization-based ClearGrasp. We also benchmarked against fusion-centric models such as FDCT, TranspareNet, and DFNet, which utilize multi-modal guidance to mitigate optical uncertainty.

4.4. Ablation Studies

To rigorously analyze the contribution of each component within the proposed TCG-Depth framework, we conduct a comprehensive series of ablation studies on the TransCG dataset. These experiments are designed to systematically evaluate how different architectural choices and module configurations influence depth completion performance, particularly in challenging transparent object scenes.

First, we dissect the overall framework by incrementally introducing key components: the base depth completion network, the confidence prediction module, and the confidence-guided refinement stage. This incremental setup allows us to isolate the specific impact of each component and assess their synergistic effect on final performance.

Next, we examine the fidelity of the predicted confidence maps through a quantitative region-wise error analysis. By stratifying results into high-confidence and low-confidence areas, this analysis aims to verify whether the predicted confidence scores effectively correlate with the actual reliability of depth completion results.

Finally, we conduct a targeted ablation study on the confidence-guided gating mechanism employed in the second-stage refinement. We evaluate three different functional forms for mapping the adaptively predicted threshold

τ

to the final gating weights. Specifically, we compare a binary mask, which performs a hard partition of confidence values into reliable and unreliable regions; a linear mask, which applies a linear ramp centered at

τ

to generate continuous gating weights; and the proposed soft mask, which employs a sigmoid-shaped mapping centered at

τ

to produce smoothly varying weights. This comparison aims to analyze how the differentiability and continuity of the gating function influence the ability to seamlessly distinguish reliable anchors from unreliable predictions.

Unless otherwise stated, all ablation experiments are conducted under the identical training and evaluation protocols described in the previous subsections.

4.4.1. Effectiveness of the Overall Framework Design

To isolate and quantify the contribution of each component within the TCG-Depth architecture, we perform a stepwise ablation study on the complete pipeline. This experiment rigorously examines the impact of input modality fusion, the attention mechanism in Stage1-a, and the confidence-guided refinement strategy in Stage2.

The quantitative results are summarized in Table 1. Models relying on a single modality (RGB-only or Depth-only) exhibit markedly higher error rates, as evidenced by elevated RMSE and REL metrics. This confirms that neither modality alone suffices for robust depth recovery in transparent scenes: RGB data lacks explicit geometric depth, while raw depth maps suffer from severe data loss due to transparency. As visualized in Figure 6, the fusion of RGB and Depth modalities yields significantly more plausible structures and reduces artifacts, validating the necessity of multimodal integration.

Building upon the RGB+Depth baseline, the incorporation of the Convolutional Block Attention Module (CBAM) in Stage1-a further enhances depth completion accuracy, as detailed in Table 1. The qualitative improvements are particularly evident around object boundaries and thin transparent structures. Figure 7 corroborates this, showing that the attention-augmented Stage1-a produces smoother and more coherent depth predictions compared to the baseline.

We further investigate the efficacy of the second-stage refinement. Direct refinement without confidence-guided gating (Stage2 w/o conf-gating) yields marginal gains and risks modifying high-confidence regions unnecessarily. Conversely, the full TCG-Depth framework employs selective updates based on predicted confidence, thereby refining unreliable regions while anchoring stable areas. This strategy achieves the best overall performance across all metrics. As illustrated in Figure 7, our confidence-guided refinement effectively suppresses residual noise and structural inconsistencies in transparent regions, while maintaining the integrity of stable depth structures. These findings substantiate the effectiveness of the proposed two-stage symmetric design.

4.4.2. Effectiveness of Confidence-Based Region Separation

This study evaluates the discriminative power of the confidence estimation module. By partitioning the predicted depth maps into high-confidence and low-confidence subsets based on the predicted confidence scores, we analyze the region-wise error distribution. This analysis serves to verify the correlation between predicted confidence and actual depth completion accuracy.

As presented in Table 2, high-confidence regions consistently exhibit substantially lower errors across all evaluation metrics. Specifically, the RMSE, REL, and MAE in high-confidence regions are 0.009, 0.012, and 0.006, respectively, whereas low-confidence regions show significantly higher error rates. The full-scene performance represents an aggregation of these distinct distributions.

These results demonstrate a strong alignment between the predicted confidence scores and the physical depth errors, confirming that the learned confidence map serves as a reliable indicator of prediction fidelity. This validation supports the core premise of our framework: using confidence as a guiding signal for region-adaptive refinement.

4.4.3. Effectiveness of Different Confidence-Based Refinement Strategies

To systematically evaluate the impact of different gating mechanisms in Stage 2, we compare three variants: Binary Confidence Gating (B-Mask), Linear Confidence Gating (L-Mask), and the proposed Soft Confidence Gating (S-Mask). These strategies differ in how they construct the gating mask from the predicted confidence scores, while sharing identical backbones and training settings to ensure fairness.

As reported in Table 3, both binary and linear gating provide consistent improvements over the no-gating baseline, indicating the fundamental benefit of confidence-weighted refinement. However, the marginal performance gap between them suggests that simple mask-based modulation is insufficient to fully capture the nuances of prediction uncertainty. In contrast, the proposed soft confidence-guided gating achieves the best performance across RMSE, MAE, and REL metrics, clearly outperforming the other strategies.

Qualitative results in Figure 8 further corroborate these findings. The Binary Confidence Mask (B-Mask) introduces noticeable artifacts and sharp transitions around object boundaries, while the Linear Confidence Mask (L-Mask) partially mitigates these issues but still suffers from boundary blurring and residual noise. Conversely, the proposed Soft Confidence Mask (S-Mask) effectively suppresses error propagation in low-confidence regions while preserving structural integrity in high-confidence areas, resulting in significantly smoother and more consistent depth maps. This demonstrates that a differentiable, soft-gating mechanism is superior to rigid thresholding for confidence-based refinement.

To further validate the necessity of the adaptive mechanism, we performed ablation studies by replacing the predicted threshold with fixed scalar values (e.g.,

τ = 0.5

and

τ = 0.6

). We observed a consistent performance degradation, characterized by an approximate 8% increase in RMSE. This decline stems from the inherent variability of confidence distributions across diverse scenes; a static threshold struggles to accommodate fluctuations in lighting and object composition, leading to suboptimal region separation. By contrast, our adaptive module functions as a global context aggregator comparable to channel attention mechanisms, thereby allowing it to dynamically calibrate the threshold for each specific input, thereby ensuring robust feature gating and training stability.

4.5. Comparison to State-of-the-Art Methods

Table 4 presents a quantitative comparison with state-of-the-art transparent object depth completion methods on the TransCG benchmark. The proposed method achieves the lowest errors across all three core error metrics, with RMSE, REL, and MAE reaching 0.013, 0.018, and 0.009, respectively. Notably, it consistently outperforms competitive baselines such as FDCT and TCRNet. Moreover, our method attains the highest accuracy under the stringent

δ < 1.05

and

δ < 1.10

thresholds, indicating superior reliability in producing precise depth predictions. Furthermore, as shown in the Table 4, our method maintains a competitive inference speed of 21.4 FPS and a parameter count of 38.2 M, demonstrating its suitability for real-time deployment on standard hardware.

Figure 9 provides qualitative comparisons on representative transparent object scenes. Existing methods such as FDCT and TranspareNet frequently exhibit depth blurring and geometric discontinuities, particularly in large transparent regions and near object boundaries. Although TCRNet alleviates some of these artifacts, local distortions persist in cluttered and occluded scenarios. In contrast, the proposed method more accurately recovers object contours and relative depth ordering (as highlighted in the zoomed-in regions), significantly reducing boundary artifacts and noise. This results in depth maps that are both globally consistent and locally faithful to the ground truth.

Collectively, both quantitative and qualitative results demonstrate that TCG-Depth achieves superior robustness and generalization. Compared with approaches relying on fixed geometric assumptions or unguided local refinement, our confidence-guided strategy effectively mitigates the influence of unreliable regions, leading to more stable and accurate depth recovery in complex transparent scenes. These results strongly validate the effectiveness of the proposed symmetric, confidence-aware design.

4.6. Cross-Dataset Generalization

To further evaluate the robustness and transferability of the proposed TCG-Depth, we conducted a zero-shot cross-dataset evaluation on the ClearGrasp-real dataset. In this experiment, our model was trained exclusively on the TransCG dataset and directly evaluated on the real-world test suite of ClearGrasp without any fine-tuning or domain adaptation. This setting poses a significant challenge due to the domain gap in sensor noise profiles and lighting conditions between the two datasets.

As summarized in Table 5, TCG-Depth demonstrates superior generalization capabilities by outperforming existing state-of-the-art methods across nearly all key metrics. Notably, our method achieves the lowest RMSE (0.031) and the highest

δ_{1.05}

(67.26%). The substantial lead in the strictly-defined

δ_{1.05}

metric is particularly significant, as it indicates that our confidence-guided mechanism provides much higher precision in localized refractive regions, even in unseen environments. While TCRNet shows a marginally higher

δ_{1.25}

, our model’s lower overall error (RMSE, REL, and MAE) confirms better global geometric stability and reliability.

4.7. Qualitative Comparison

As illustrated in Figure 10, we provide a visual comparison between TCG-Depth and several representative baselines (FDCT, TranspareNet, and TCRNet) on the ClearGrasp-real dataset. Due to the inherent sensor domain gap, existing methods frequently suffer from bleeding artifacts near object boundaries or geometric collapses within the centers of transparent objects, resulting in blurred and inconsistent depth estimates.

In contrast, TCG-Depth successfully recovers sharp and accurate geometries, maintaining high fidelity even in regions with complex refractive properties. This visual evidence directly aligns with our quantitative findings, proving that the proposed framework has learned intrinsic physical cues of refractive geometry rather than merely over-fitting to dataset-specific noise patterns. Such robust performance in a zero-shot setting highlights the superior generalization capability of the TCG-Depth framework.

5. Discussions and Limitations

5.1. Reliability Analysis of Confidence Mapping

To verify the reliability of the predicted confidence maps, we first conduct a region-wise error analysis. As shown in Table 2, the regions assigned with high confidence scores exhibit significantly lower errors (RMSE = 0.009) compared to low-confidence regions (RMSE = 0.031), confirming that the module effectively identifies unreliable depth predictions. To further quantify this relationship at a pixel level, we calculate the Pearson Correlation Coefficient (PCC) between the confidence scores and the absolute depth errors. We observe a strong negative correlation of approximately −0.769, which demonstrates that the predicted confidence serves as a continuous and faithful indicator of depth uncertainty.

The superior performance of TCG-Depth over baseline methods like FDCT and TCRNet highlights the necessity of explicit reliability modeling in transparent object perception. While global feature extraction is effective for opaque surfaces, it lacks the flexibility to handle the non-linear depth ambiguities of glassware. By introducing a confidence-guided gate, our framework prevents the blurring and degradation of geometric boundaries—a common failure mode in uniform-optimization paradigms.

5.2. Failure Cases and Environmental Constraints

Despite its effectiveness, we identify certain failure cases where the framework’s performance reaches its boundary. We analyze these constraints from two distinct perspectives:

First, while TCG-Depth demonstrates robustness across most scenarios, we observe performance boundaries under conditions of extreme refraction and severe texture scarcity. As illustrated in Figure 11, when transparent objects possess highly complex overlapping structures or when intense lighting leads to a complete loss of visual texture, the refinement stage tends to produce over-smoothed results. Nevertheless, the confidence module (S-mask) exhibits remarkable self-aware capability: even in regions where depth recovery fails (as indicated by the high-energy regions in the red-boxed Error Map), the module can still accurately flag them as low-confidence areas (represented by the black regions). This characteristic is paramount in practical applications. It ensures that while the system may not perfectly restore all geometric defects, it can effectively block the propagation of erroneous depth information to downstream tasks such as robotic grasping or navigation by actively identifying unreliable regions.

Second, the method’s reliance on the current TransCG dataset remains a significant constraint. While TransCG offers diverse indoor scenes, it lacks extreme optical scenarios such as high-curvature vessel geometries and specialized instruments under intense specular reflections. The fixed sensor parameters and the inherent synthetic-to-real gap limit our comprehensive understanding of the model’s robustness against hardware-specific noise. Therefore, the development of a more exhaustive dataset featuring extreme optical conditions, combined with multi-modal fusion or generative priors, represents the next frontier for this research.

5.3. Future Roadmap

While TCG-Depth demonstrates strong performance in single-view settings, several directions remain for future exploration to bridge the gap between academic benchmarks and physical deployment:

Real-world Deployment and Hardware Integration: A primary focus of our future roadmap is to validate the framework’s zero-shot generalization on diverse real-world captures. We aim to deploy TCG-Depth onto physical robotic manipulation platforms to evaluate its real-time performance and grasping success rates in unstructured environments.
Robustness in Extreme Optical Scenarios: Future work will involve extensive experiments on edge cases, such as glassware with high-curvature geometries and environments under extreme lighting conditions (e.g., intense specular glare or low-light transmission). This will further refine the confidence module’s ability to characterize structural unreliability in perceptually degraded scenes.
Complete Volumetric Reconstruction: Beyond depth completion, we plan to explore the integration of generative priors to infer missing geometric structures. This research will focus on recovering complete 3D point clouds for transparent objects, even in cases of severe texture loss or complex overlapping refractions.

6. Conclusions

In this study, we presented TCG-Depth, an approach that incorporates region-wise reliability into the refinement process for transparent object depth completion. By exploring the integration of confidence estimation with architectural modulation, TCG-Depth provides an effective framework for single-view depth recovery that balances accuracy and structural consistency. This method offers a potential solution to the instability often observed in refractive regions during single-view perception. Experimental results indicate that the proposed confidence-guided strategy achieves competitive performance on existing benchmarks and serves as a reliable reference for perception systems operating in complex, unstructured environments.

Author Contributions

Conceptualization, K.H. and S.Y.; Methodology, K.H.; Data curation, C.Y.; Writing—original draft preparation, K.H.; Writing—review and editing, J.Z., K.L. and S.Y.; Visualization, K.H.; Supervision, S.Y.; Project administration, J.Z. and S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

TCG-Depth	Confidence-Aware Two-Stage Depth Completion for Transparent Objects
RMSE	Root Mean Squared Error
MAE	Mean Absolute Error
REL	Absolute Relative Error
CMAM	Convolutional Multi-scale Attention Module

References

Schober, D.; Güldenring, R.; Love, J.; Nalpantidis, L. Vision-based robot manipulation of transparent liquid containers in a laboratory setting. In Proceedings of the 2025 IEEE/SICE International Symposium on System Integration (SII), Munich, Germany, 21–24 January 2025; pp. 1193–1200. [Google Scholar] [CrossRef]
Ren, L.; Dong, J.; Liu, S.; Zhang, L.; Wang, L. Embodied intelligence toward future smart manufacturing in the era of AI foundation model. IEEE/ASME Trans. Mechatronics 2024, 30, 2632–2642. [Google Scholar] [CrossRef]
Jiang, J.; Cao, G.; Deng, J.; Do, T.-T.; Luo, S. Robotic perception of transparent objects: A review. IEEE Trans. Artif. Intell. 2024, 5, 2547–2567. [Google Scholar] [CrossRef]
Wei, L.; Ding, M.; Li, S. Monocular vision-based depth estimation of forward-looking scenes for mobile platforms. Appl. Sci. 2025, 15, 4267. [Google Scholar] [CrossRef]
Liu, H.; Guo, D.; Cangelosi, A. Embodied intelligence: A synergy of morphology, action, perception and learning. ACM Comput. Surv. 2025, 57, 186. [Google Scholar] [CrossRef]
Kim, J.; Jeon, M.; Jung, S.; Yang, W.; Jung, M.; Shin, J.; Kim, A. Transpose: Large-scale multispectral dataset for transparent objects. Int. J. Robot. Res. 2024, 43, 731–738. [Google Scholar] [CrossRef]
Liang, Y.; Deng, B.; Liu, W.; Qin, J.; He, S. Monocular depth estimation for glass walls with context: A new dataset and method. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15081–15097. [Google Scholar] [CrossRef]
Shi, J.; Yong, A.; Jin, Y.; Li, D.; Niu, H.; Jin, Z.; Wang, H. ASGrasp: Generalizable transparent object reconstruction and 6-DoF grasp detection from RGB-D active stereo camera. In 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 5441–5447. [Google Scholar]
Kendall, A.; Gal, Y. What uncertainties do we need in Bayesian deep learning for computer vision? In Proceedings of the NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Poggi, M.; Kim, S.; Tosi, F.; Kim, S.; Aleotti, F.; Min, D.; Sohn, K.; Mattoccia, S. On the confidence of stereo matching in a deep-learning era: A quantitative evaluation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 5293–5313. [Google Scholar] [CrossRef] [PubMed]
Hwang, S.; Lee, K. Confidence-guided LiDAR depth completion for robust 3D object detection. IEEE Access 2025, 13, 159998–160009. [Google Scholar] [CrossRef]
Zhu, L.; Mousavian, A.; Xiang, Y.; Mazhar, H.; van Eenbergen, J.; Debnath, S.; Fox, D. RGB-D local implicit function for depth completion of transparent objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 4649–4658. [Google Scholar]
Fang, H.; Fang, H.S.; Xu, S.; Xu, S.; Lu, C. TransCG: A large-scale real-world dataset for transparent object depth completion and a grasping baseline. IEEE Robot. Autom. Lett. 2022, 7, 7383–7390. [Google Scholar] [CrossRef]
Fritz, M.; Bradski, G.; Karayev, S.; Darrell, T.; Black, M.J. An additive latent feature model for transparent object recognition. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 7–10 December 2009. [Google Scholar]
Maeno, K.; Nagahara, H.; Shimada, A.; Taniguchi, R.-I. Light field distortion feature for transparent object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013. [Google Scholar]
Bian, L.; Shi, P.; Chen, W.; Xu, J.; Yi, L.; Chen, R. TransTouch: Learning transparent objects depth sensing through sparse touches. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 9566–9573. [Google Scholar]
Li, S.; Yu, H.; Ding, W.; Liu, H.; Ye, L.; Xia, C.; Wang, X.; Zhang, X. Visual–tactile fusion for transparent object grasping in complex backgrounds. IEEE Trans. Robot. 2023, 39, 3838–3856. [Google Scholar] [CrossRef]
Wang, Y.R.; Zhao, Y.; Xu, H.; Eppel, S.; Aspuru-Guzik, A.; Shkurti, F.; Garg, A. MvTrans: Multi-view perception of transparent objects. arXiv 2023, arXiv:2302.11683. [Google Scholar] [CrossRef]
Dai, Q.; Zhang, J.; Li, Q.; Wu, T.; Dong, H.; Liu, Z.; Tan, P.; Wang, H. Domain randomization-enhanced depth simulation and restoration for perceiving and grasping specular and transparent objects. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 374–391. [Google Scholar]
Fang, I.; Shi, K.; He, X.; Tan, S.; Wang, Y.; Zhao, H.; Huang, H.; Yuan, W.; Feng, C.; Zhang, J. FusionSense: Bridging common sense, vision, and touch for robust sparse-view reconstruction. arXiv 2024, arXiv:2410.08282. [Google Scholar]
Lee, J.; Kim, S.M.; Lee, Y.; Kim, Y.M. NFL: Normal field learning for 6-DoF grasping of transparent objects. IEEE Robot. Autom. Lett. 2023, 9, 819–826. [Google Scholar] [CrossRef]
Lin, J.; Yeung, Y.; Ye, S.; Lau, R.W.H. Leveraging RGB-D data with cross-modal context mining for glass surface detection. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2025; Volume 39, pp. 5254–5261. [Google Scholar]
Ichnowski, J.; Avigal, Y.; Kerr, J.; Goldberg, K. Dex-NeRF: Using a neural radiance field to grasp transparent objects. In Proceedings of the Conference on Robot Learning (CoRL), London, UK, 8–11 November 2021; pp. 526–536. [Google Scholar]
Kerr, J.; Fu, L.; Huang, H.; Avigal, Y.; Tancik, M.; Ichnowski, J.; Kanazawa, A.; Goldberg, K. Evo-NeRF: Evolving NeRF for sequential robot grasping. In Proceedings of the Conference on Robot Learning (CoRL), Auckland, New Zealand, 14–18 December 2022; pp. 353–367. [Google Scholar]
Dai, Q.; Zhu, Y.; Geng, Y.; Ruan, C.; Zhang, J.; Wang, H. GraspNeRF: Multiview-based 6-DoF grasp detection for transparent and specular objects using generalizable NeRF. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 1757–1763. [Google Scholar]
Tao, T.; Zheng, H.; Xiao, J.; Wu, W.; Yang, J. SRNet-Trans: A single-image guided depth completion regression network for transparent objects. Appl. Sci. 2025, 15, 10566. [Google Scholar] [CrossRef]
Sajjan, S.; Moore, M.; Pan, M.; Nagaraja, G.; Lee, J.; Zeng, A.; Song, S. ClearGrasp: 3D shape estimation of transparent objects for manipulation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 3634–3642. [Google Scholar]
Liu, J.; Ma, H.; Guo, Y.; Zhao, Y.; Zhang, C.; Sui, W.; Zou, W. Monocular depth estimation and segmentation for transparent objects with iterative semantic and geometric fusion. arXiv 2025, arXiv:2502.14616. [Google Scholar] [CrossRef]
Fan, X.; Ye, C.; Deng, A.; Wu, X.; Pan, M.; Yang, H. TDCNet: Transparent objects depth completion with CNN–Transformer dual-branch parallel network. IEEE Sens. J. 2025, 25, 36629–36641. [Google Scholar] [CrossRef]
Li, T.; Chen, Z.; Liu, H.; Wang, C. FDCT: Fast depth completion for transparent objects. IEEE Robot. Autom. Lett. 2023, 8, 5823–5830. [Google Scholar] [CrossRef]
Zhai, D.-H.; Yu, S.; Wang, W.; Guan, Y.; Xia, Y. Tcrnet: Transparent object depth completion with cascade refinements. IEEE Trans. Autom. Sci. Eng. 2024, 22, 1893–1912. [Google Scholar] [CrossRef]
Xu, H.; Wang, Y.R.; Eppel, S.; Aspuru-Guzik, A.; Shkurti, F.; Garg, A. SeeingGlass: Joint point cloud and depth completion for transparent objects. arXiv 2021, arXiv:2110.00087. [Google Scholar]
Liu, I.; Yang, E.; Tao, J.; Chen, R.; Zhang, X.; Ran, Q.; Liu, Z.; Su, H. ActiveZero: Mixed domain learning for active stereovision with zero annotation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13033–13042. [Google Scholar]
Chen, R.; Liu, I.; Yang, E.; Tao, J.; Zhang, X.; Ran, Q. ActiveZero++: Mixed domain learning stereo and confidence-based depth completion with zero annotation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 14098–14113. [Google Scholar] [CrossRef] [PubMed]
Lu, L.; Bu, C.; Su, Z.; Guan, B.; Yu, Q.; Pan, W.; Zhang, Q. Generative deep-learning-embedded asynchronous structured light for three-dimensional imaging. Adv. Photonics 2024, 6, 046004. [Google Scholar] [CrossRef]
Su, Z.; Xu, Y.; Qin, X.; Zhang, D.; Wang, H.; Lu, L. Automatic crack tracking with consistent stereo imaging and wavelet analysis. Int. J. Mech. Sci. 2025, 304, 110662. [Google Scholar] [CrossRef]
Dai, Q.; Zhu, Y.; Geng, Y.; Ruan, C.; Zhang, J.; Wang, H. GraspNeRF: Multiview-based 6-DoF grasp detection for transparent and specular objects using generalizable NeRF. arXiv 2022, arXiv:2210.06575. [Google Scholar]
Chen, K.; James, S.; Sui, C.; Liu, Y.; Abbeel, P.; Dou, Q. StereoPose: Category-level 6D transparent object pose estimation from stereo images via back-view NOCS. arXiv 2022, arXiv:2211.01644. [Google Scholar]
Gurrola-Ramos, J.; Dalmau, O.; Alarcon, T.E. A residual dense U-Net neural network for image denoising. IEEE Access 2021, 9, 31742–31754. [Google Scholar] [CrossRef]
Wang, P.; Chen, P.; Yuan, Y.; Liu, D.; Huang, Z.; Hou, X.; Cottrell, G. Understanding convolution for semantic segmentation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1451–1460. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Yin, X.; Goudriaan, J.A.N.; Lantinga, E.A.; Vos, J.A.N.; Spiertz, H.J. A flexible sigmoid function of determinate growth. Ann. Bot. 2003, 91, 361–371. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overview of the proposed TCG-Depth framework for transparent object depth completion. The framework takes single-view RGB-D inputs and outputs a refined dense depth map through a two-stage confidence-guided refinement process.

Figure 2. Architecture of the proposed initial depth completion network. The encoder–decoder network takes RGB-D inputs and produces an initial dense depth map.

Figure 3. Architecture of the proposed Confidence Mask Estimation module. Given the initial depth prediction, the module predicts a pixel-wise confidence map and regresses a scene-adaptive threshold to construct a soft reliability mask for subsequent refinement.

Figure 4. Visual comparison of different confidence functions. Unlike binary (hard) or linear functions, the proposed soft confidence function (Equation (8)) generates a smooth, differentiable mask that avoids geometric discontinuities.

Figure 5. Architecture of the confidence-aware depth refinement network. The network refines the initial depth prediction by selectively modulating features based on the predicted confidence mask.

Figure 6. Qualitative comparison of Stage1-a depth completion results under different input modality configurations. From left to right, the columns show the input RGB image, raw depth map, ground-truth depth, and the depth predictions produced by Stage1-a using RGB-only, Depth-only, and RGB+Depth inputs, respectively. In the figure, the color map represents depth information, with warmer colors (yellow and red) indicating closer objects, and cooler colors (blue and green) representing farther objects.

Figure 7. Qualitative ablation results illustrating the effectiveness of the overall TCG-Depth framework. The columns correspond to the input RGB image, raw depth map, ground-truth depth, Stage1-a with RGB+Depth+CBAM, Stage2 refinement without confidence-guided gating, and the final results produced by TCG-Depth (proposed). In the figure, the color map represents depth information, with warmer colors (yellow and red) indicating closer objects, and cooler colors (blue and green) representing farther objects.

Figure 8. The first row shows the RGB input and the corresponding confidence masks constructed using different confidence functions, including the raw confidence map (Raw Mask), the Binary Confidence Mask (B-Mask), the Linear Confidence Mask (L-Mask), and the Soft Confidence Mask (S-Mask, Ours). The second row presents the corresponding refined depth results. Compared with B-Mask and L-Mask, the proposed soft confidence-guided gating using the Soft Confidence Mask (S-Mask, Ours) produces more stable and consistent depth predictions, particularly around object boundaries and low-confidence regions. In the figure, the color map represents depth information, with warmer colors (yellow and red) indicating closer objects, and cooler colors (blue and green) representing farther objects.

Figure 9. Quantitative comparison with state-of-the-art transparent object depth completion methods on the TransCG dataset. Lower is better for RMSE, MAE, and REL, while higher is better for accuracy metrics.

Figure 10. Zero-shot qualitative comparison on the ClearGrasp-real dataset. The first three columns display the input RGB image, raw depth map, and ground-truth (GT) depth, respectively.

Figure 11. Failure case analysis under extreme conditions. Red boxes highlight regions where extreme refraction or texture loss leads to over-smoothed depth; however, the S-mask successfully identifies and flags these high-error areas.

Table 1. Comprehensive ablation study of the proposed TCG-Depth framework. Bold values indicate the best performance for each metric, downward arrows (↓) signify that lower values are better, and checkmarks (✓) represent the inclusion of specific components in each model.

Model	RGB	Depth	CBAM	Conf-Gating	RMSE ↓	REL ↓	MAE ↓
Stage1-a (RGB only)	✓				0.029	0.071	0.018
Stage1-a (Depth only)		✓			0.027	0.053	0.016
Stage1-a (RGB+Depth)	✓	✓			0.018	0.023	0.013
Stage1-a (RGB+Depth +CBAM)	✓	✓	✓		0.017	0.022	0.012
Stage2 w/o conf-gating	✓	✓	✓		0.016	0.022	0.011
TCG-Depth (Ours)	✓	✓	✓	✓	0.013	0.018	0.009

Table 2. Region-wise depth error analysis based on predicted confidence maps. Bold values indicate the best performance for each metric, downward arrows (↓) signify that lower values are better.

Region	RMSE ↓	REL ↓	MAE ↓
High-confidence regions	0.009	0.012	0.006
Low-confidence regions	0.031	0.056	0.021
Full Scene	0.017	0.022	0.012

Table 3. Comparison of different confidence-guided refinement strategies in Stage 2. Bold values indicate the best performance for each metric, downward arrows (↓) signify that lower values are better.

Strategy	RMSE ↓	MAE ↓	REL ↓
Binary confidence-guided gating (B-Mask)	0.016	0.021	0.011
Linear confidence-guided gating (L-Mask)	0.016	0.020	0.010
Soft confidence-guided gating (S-Mask, Ours)	0.013	0.018	0.009

Table 4. Quantitative comparison with state-of-the-art transparent object depth estimation methods on the TransCG dataset. Lower is better for RMSE, MAE, and REL, while higher is better for accuracy metrics. Inference speed (FPS) is measured on an NVIDIA RTX 4090 GPU. Bold values indicate the best performance for each metric, downward arrows (↓) signify that lower values are better, upward arrows (↑) signify that higher values are better.

Model	RMSE ↓	REL ↓	MAE ↓	1.05 ↑	1.10 ↑	1.25 ↑	Params (M) ↓	FPS ↑
CG (ClearGrasp)	0.054	0.083	0.037	50.48	68.68	95.28	42.5	21.6
LIDF-Refine	0.019	0.034	0.015	78.22	94.26	99.80	28.7	31.2
DFNet	0.018	0.023	0.013	83.76	95.67	99.71	26.1	33.0
FDCT	0.015	0.022	0.010	88.18	97.15	99.81	27.8	30.2
TranspareNet	0.026	0.023	0.013	88.45	96.25	99.42	30.3	27.4
TCRNet	0.017	0.020	0.010	88.96	96.94	99.87	34.5	22.8
Ours	0.013	0.018	0.009	91.58	97.65	99.82	38.2	21.4

Table 5. Cross-dataset evaluation: Models trained on TransCG and tested on ClearGrasp-real. Bold values indicate the best performance. Bold values indicate the best performance for each metric, downward arrows (↓) signify that lower values are better, upward arrows (↑) signify that higher values are better.

Model	RMSE ↓	REL ↓	MAE ↓	$δ_{1.05} ↑$	$δ_{1.10} ↑$	$δ_{1.25} ↑$
CG (ClearGrasp)	0.085	0.095	0.052	47.26	70.76	92.54
LIDF-Refine	0.152	0.225	0.139	9.86	20.63	46.02
DFNet	0.041	0.054	0.031	62.74	83.31	97.33
FDCT	0.042	0.058	0.033	60.12	81.45	96.88
TranspareNet	0.045	0.071	0.040	33.43	70.14	99.40
TCRNet	0.034	0.049	0.027	63.67	86.63	99.47
Ours	0.031	0.042	0.025	67.26	88.16	99.03

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, K.; Yao, C.; Lv, K.; Ye, S.; Zhuang, J. TCG-Depth: A Two-Stage Symmetric Confidence-Guided Framework for Transparent Object Depth Completion. Symmetry 2026, 18, 405. https://doi.org/10.3390/sym18030405

AMA Style

Huang K, Yao C, Lv K, Ye S, Zhuang J. TCG-Depth: A Two-Stage Symmetric Confidence-Guided Framework for Transparent Object Depth Completion. Symmetry. 2026; 18(3):405. https://doi.org/10.3390/sym18030405

Chicago/Turabian Style

Huang, Kaixin, Chendong Yao, Ke Lv, Sichao Ye, and Jiayan Zhuang. 2026. "TCG-Depth: A Two-Stage Symmetric Confidence-Guided Framework for Transparent Object Depth Completion" Symmetry 18, no. 3: 405. https://doi.org/10.3390/sym18030405

APA Style

Huang, K., Yao, C., Lv, K., Ye, S., & Zhuang, J. (2026). TCG-Depth: A Two-Stage Symmetric Confidence-Guided Framework for Transparent Object Depth Completion. Symmetry, 18(3), 405. https://doi.org/10.3390/sym18030405

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TCG-Depth: A Two-Stage Symmetric Confidence-Guided Framework for Transparent Object Depth Completion

Abstract

1. Introduction

2. Related Work

2.1. Transparent Object Perception

2.2. Single-View Transparent Object Perception

2.3. Multi-View Transparent Object Perception

3. Method

3.1. Overview

3.2. Initial Depth Completion

3.3. Confidence Mask Estimation

3.3.1. Pixel-Wise Confidence Prediction

3.3.2. Scene-Adaptive Threshold Regression

3.3.3. Soft Confidence-Gating Function

3.3.4. Confidence Supervision

3.4. Confidence-Aware Depth Refinement

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Implementation Details and Baselines

4.4. Ablation Studies

4.4.1. Effectiveness of the Overall Framework Design

4.4.2. Effectiveness of Confidence-Based Region Separation

4.4.3. Effectiveness of Different Confidence-Based Refinement Strategies

4.5. Comparison to State-of-the-Art Methods

4.6. Cross-Dataset Generalization

4.7. Qualitative Comparison

5. Discussions and Limitations

5.1. Reliability Analysis of Confidence Mapping

5.2. Failure Cases and Environmental Constraints

5.3. Future Roadmap

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI