Next Article in Journal
Low-Cost Ethanol Concentration Sensor Based on a Balloon-like Curved Optical Fiber in a Mach–Zehnder Interferometric Configuration
Previous Article in Journal
Bridging the Bond: High-Sensitivity External Printed Strain Sensors for Condition Monitoring of Adhesive Joints
Previous Article in Special Issue
USF-Net: Infrared-Visible Image Fusion via Unified Semantics and Context Modulation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

RGB-D Mirror Segmentation with Reliability-Guided Residual Correction

School of Computing, Gachon University, Seongnam-si 13120, Republic of Korea
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(12), 3739; https://doi.org/10.3390/s26123739
Submission received: 29 April 2026 / Revised: 5 June 2026 / Accepted: 9 June 2026 / Published: 11 June 2026

Abstract

Mirror segmentation remains challenging because mirror regions often share appearance with the reflected scene, while sensor depth around mirrors is frequently missing, noisy, or geometrically inconsistent. Although recent RGB-based methods have achieved strong results by exploiting contextual and symmetry-aware cues, their ability to use geometric information reliably is still limited. In this paper, we propose a reliable RGB-D mirror segmentation framework built upon SATNet. Specifically, we extend the symmetry-aware baseline with a dedicated depth branch that injects hierarchical sensor-depth features into the multi-scale decoder, and we introduce a Reliability-Guided Residual Correction Module (RGRCM) for final prediction refinement. Instead of treating predicted depth as an independent modality branch, RGRCM internally constructs dual-depth evidence from sensor depth and monocular depth estimated by a pretrained Depth Anything v2 model, encoding raw depth observations, cross-depth discrepancies, validity cues, and local depth instability. The resulting evidence is used to guide uncertainty-aware residual correction only in regions where depth-driven refinement is likely to be beneficial. Experiments on the RGBD-Mirror benchmark show that the proposed method achieves 83.57 IoU, 0.899 F β , 0.026 MAE, and 6.26 BER, outperforming existing RGB and RGB-D mirror segmentation methods.

1. Introduction

Mirror segmentation aims to identify mirror regions at the pixel level. It is an important yet challenging problem in scene understanding because mirrors violate the common assumption that the appearance of an image region reflects the intrinsic property of a visible surface. Instead, mirror regions reflect surrounding objects and structures, often making them visually similar to the reflected scene itself. This ambiguity makes mirror segmentation difficult for models that rely mainly on local appearance or semantic context, and it also affects downstream tasks such as robotic navigation, scene parsing, augmented reality, and spatial perception [1,2]. As illustrated in Figure 1, mirror regions are difficult to segment not only because their appearance resembles the reflected scene, but also because the corresponding sensor depth is often missing, noisy, or geometrically inconsistent. This is a well-known limitation of depth sensors near reflective and transparent surfaces such as mirrors and glass [3,4]. In the example, the sensor depth map is severely corrupted around the mirror region, whereas the monocular predicted depth still preserves a plausible scene layout. Existing methods therefore tend to either over-segment reflected content or miss the mirror region entirely. This example highlights the central challenge of RGB-D mirror segmentation: depth is potentially useful, but it cannot be treated as uniformly reliable.
Recent progress in mirror segmentation has been largely driven by RGB-based methods. Early approaches exploited contextual contrast, boundary reasoning, and semantic cues to distinguish mirror interiors from their surroundings [1,5]. Subsequent methods introduced more intrinsic and structural cues, including visual chirality, semantic associations, and symmetry-aware reasoning [6,7,8]. In particular, SATNet demonstrated that the loose symmetry between the input image and its horizontally flipped counterpart provides a strong cue for mirror detection, establishing a powerful RGB baseline [8].
Despite these advances, RGB-only reasoning remains limited in challenging cases where appearance cues are ambiguous. In principle, depth can provide complementary geometric information for mirror segmentation. However, directly using sensor depth is nontrivial because mirror surfaces often produce missing, noisy, or inconsistent measurements. As a result, naive RGB-D fusion may introduce unreliable geometric cues rather than improve segmentation quality [2,9,10]. This observation suggests that the key issue is not simply how to add depth, but how to use depth reliably.
In this work, we propose a reliable RGB-D mirror segmentation framework built upon SATNet. Starting from the symmetry-aware RGB baseline, we first introduce a dedicated depth branch that extracts hierarchical geometric features from the sensor depth map and injects them into the multi-scale decoder. We then propose a Reliability-Guided Residual Correction Module (RGRCM), which performs uncertainty-aware residual correction on the final prediction. Rather than treating dual-depth information as an independent prediction pathway, RGRCM internally constructs discrepancy-aware depth evidence from the sensor depth and a monocular predicted depth generated by a pretrained Depth Anything v2 model [11]. This evidence is encoded by a Dual-Depth Evidence Block (DDEB), which aggregates raw dual-depth observations, cross-depth discrepancy cues, a joint-validity cue, and a local sensor-depth variance cue. The resulting evidence is then used to support selective residual correction only where such correction is likely to be beneficial.
This design is motivated by two observations. First, sensor depth remains useful for mirror segmentation whenever valid geometric measurements are available, and therefore a dedicated depth branch can complement symmetry-aware RGB reasoning. Second, in reflective regions, the relationship between sensor depth and monocular predicted depth (DA depth) often contains more useful information than either source alone. Accordingly, we do not use dual-depth evidence as a separate segmentation branch. Instead, we exploit it as an internal evidence representation within RGRCM, where it helps determine how depth-driven residual correction should be applied under ambiguous depth conditions.
Extensive experiments show that the proposed design is effective. The dedicated depth branch consistently strengthens the SATNet baseline, while RGRCM further improves the prediction through reliability-guided residual correction. The gains are particularly evident in terms of region overlap and balanced error, indicating that reliable depth utilization, rather than indiscriminate RGB-D fusion, is critical for robust mirror segmentation.
The main contributions of this work are summarized as follows:
  • We extend SATNet to RGB-D mirror segmentation by introducing a dedicated depth branch that injects hierarchical sensor-depth features into the symmetry-aware decoder.
  • We propose a Reliability-Guided Residual Correction Module (RGRCM), which internally constructs dual-depth evidence through a Dual-Depth Evidence Block (DDEB) and performs uncertainty-aware residual correction for final prediction refinement.
  • We demonstrate through extensive experiments that the proposed framework improves mirror segmentation performance in a stable manner, especially in terms of region overlap and balanced error.
The rest of this paper is organized as follows. Section 2 reviews previous work related to mirror segmentation and reliability-aware RGB-D fusion. Section 3 describes the proposed method in detail, including the depth branch, the internal dual-depth evidence construction, the Reliability-Guided Residual Correction Module (RGRCM), and the loss function. Section 4 presents the experimental setup and quantitative results, including comparisons with state-of-the-art methods, ablation studies, and complexity analysis. Finally, Section 5 concludes the paper.

2. Related Work

2.1. Mirror Segmentation from RGB Images

Mirror segmentation has been actively studied from RGB images. MirrorNet introduced one of the earliest large-scale benchmarks for mirror segmentation and proposed contextual contrast modeling to distinguish mirror regions from surrounding content [1]. PMDNet further improved mirror localization by progressively learning contextual relations and explicitly refining mirror boundaries [5]. Later methods explored more intrinsic mirror cues. VCNet modeled visual chirality through a flipping–convolution–flipping transformation and introduced chirality-guided boundary reasoning [6], while semantic association-based methods leveraged the functional relation between mirrors and surrounding objects to improve robustness against distractors such as windows and doorways [7]. HetNet proposed multi-level heterogeneous learning to extract complementary mirror cues at different feature levels [12]. More recently, SATNet proposed a symmetry-aware transformer architecture that models the loose symmetry between the input image and its horizontally flipped counterpart, providing a strong RGB baseline for mirror detection [8].

2.2. RGB-D Mirror Segmentation

Compared with RGB-only approaches, RGB-D mirror segmentation seeks to exploit geometric cues to reduce appearance ambiguity. A representative early method in this direction is PDNet, which introduced the RGBD-Mirror dataset and showed that depth discontinuities and color-depth correlations can substantially improve mirror localization and boundary refinement [2]. More recent RGB-D mirror segmentation methods have focused on stronger multi-modal fusion, uncertainty modeling, and knowledge distillation. UTLNet introduced an uncertainty-aware transformer localization framework for RGB-depth mirror segmentation, explicitly modeling unreliable depth during localization and fusion [13]. Morphology-Guided Network (MGNet) further improved RGB-D mirror segmentation by incorporating morphology-aware structural guidance into a knowledge distillation framework [14]. ADRNet-S* proposed asymmetric depth registration and contrastive knowledge distillation for RGB-D mirror segmentation, emphasizing cross-modal alignment and depth-guided multimodal interaction [9]. NDANet-S* advanced this line by introducing neighborhood-level feature matching and demand-modal adaptive fusion within a distillation framework [15]. Recently, Kurohiji and Hachiya showed that inconsistency between sensor depth and predicted depth itself is a useful cue for mirror segmentation, and they proposed a depth inconsistency-based spatial-channel attention gate to emphasize informative reflective regions [16].
Our approach is distinct from depth inconsistency-based mirror segmentation methods such as that of Kurohiji and Hachiya [16], who employ the inconsistency between sensor and predicted depth as a spatial-channel attention cue during feature fusion. The key difference lies in where and how the dual-depth cue is exploited: rather than acting as an attention modulation distributed across the decoder, our dual-depth evidence is consumed at the final prediction stage to guide explicit residual correction. Furthermore, the proposed safe loss regularization explicitly penalizes unnecessary correction on pixels where the base prediction is already confident and correct, a conservative correction principle that is not present in prior reliability-aware fusion or depth inconsistency attention designs.
These studies consistently show that depth is a valuable cue for mirror segmentation. At the same time, they also reveal a common challenge: the benefit of depth depends strongly on how geometric information is represented, aligned, and fused, especially when depth measurements are missing or unreliable around reflective surfaces.

2.3. Reliability-Aware RGB-D Fusion and Dual-Depth Cues

The broader RGB-D segmentation literature has repeatedly shown that depth should not be treated as uniformly reliable. ACNet selectively gathers complementary RGB and depth information through attention-based fusion [17]. SA-Gate reduces the influence of noisy depth measurements through bi-directional cross-modal gating [18]. UCTNet further models depth uncertainty explicitly and uses uncertainty-aware cross-modal interaction for robust RGB-D semantic segmentation [10]. More broadly, handling unreliable or heterogeneous depth cues is a shared challenge across multi-modal vision tasks, including RGB-D salient object detection [19,20,21] and depth estimation [22,23,24]. In parallel, monocular depth estimation has improved significantly in recent years. Depth Anything v2 provides fine-grained and robust depth predictions with strong generalization ability, making predicted depth a practical complementary cue when sensor depth is degraded [11]. Motivated by these observations, our method does not use predicted depth as an independent modality branch. Instead, it combines sensor depth and predicted depth to internally construct dual-depth evidence inside the proposed RGRCM and uses this evidence to guide selective residual correction.
In summary, existing RGB mirror segmentation methods mainly exploit contextual contrast, semantic relations, chirality, or symmetry-aware reasoning, whereas RGB-D approaches demonstrate the value of geometry but remain sensitive to unreliable depth and suboptimal fusion. Our method differs in two main aspects: first, it extends the strong SATNet baseline to the RGB-D setting through a dedicated depth branch; second, it introduces a reliability-guided residual correction design in which dual-depth evidence is internally constructed and used for the uncertainty-aware correction of the final prediction.

3. Method

This section presents the proposed RGB-D mirror segmentation framework. As shown in Figure 2, the network is built upon SATNet and augments the original symmetry-aware RGB pipeline with two key additions: a dedicated depth branch for encoding sensor-depth geometry, and a Reliability-Guided Residual Correction Module (RGRCM) for final prediction refinement. A key design choice is that dual-depth information is not treated as an independent prediction branch. Instead, it is internally constructed inside RGRCM as discrepancy-aware evidence and is used to support reliability-guided residual correction. The detailed structure of RGRCM, including the internal Dual-Depth Evidence Block (DDEB), is shown in Figure 3.

3.1. Overall Architecture

Let I R 3 × H × W denote the input RGB image, D s R 1 × H × W the sensor depth map, and M { 0 , 1 } H × W the ground-truth mirror mask. Following SATNet, we construct a horizontally flipped image
I f = F ( I ) ,
where F ( · ) denotes horizontal flipping.
The original image I and the flipped image I f are then processed by two weight-sharing Swin-S encoders. Denoting the shared RGB encoder by E r g b , the resulting multi-scale feature pyramids are
{ X k } k = 0 3 = E r g b ( I ) , { X k f } k = 0 3 = E r g b ( I f ) ,
where k = 0 , 1 , 2 , 3 correspond to feature resolutions H / 4 , H / 8 , H / 16 , and H / 32 , respectively. The flipped-stream features are spatially realigned by inverse flipping:
X ¯ k f = F 1 ( X k f ) .
Following SATNet, the symmetry-aware attention module (SAAM) models the interaction between the main-stream features and the realigned flipped-stream features, and the contrast and fusion decoder module (CFDM), which incorporates efficient channel attention [25], progressively decodes the fused RGB representations to produce hierarchical decoder features and auxiliary predictions [8]. As shown in Figure 2, the decoder outputs four auxiliary predictions, denoted by P 3 , P 2 , P 1 , and P 0 , and the top decoder feature is further refined by RGRCM to generate the final output P f i n a l .
In parallel, the sensor depth map D s is processed by a dedicated depth branch that extracts multi-scale geometric features aligned with the decoder hierarchy. These projected depth features are injected into the decoder through element-wise addition, enabling the symmetry-aware RGB decoder to exploit explicit geometry throughout the decoding process.
To provide complementary depth cues, we additionally obtain a monocular predicted depth map from a pretrained and frozen Depth Anything v2 model:
D d a = Φ D A ( I ) ,
where Φ D A ( · ) denotes the Depth Anything v2 estimator. The predicted depth D d a is not directly fused into the decoder. Instead, it is used only inside RGRCM, where it is combined with the sensor depth D s to construct dual-depth evidence for final-stage residual correction.
Let F t o p R 96 × H / 4 × W / 4 denote the top decoder feature from CFDM, as shown in Figure 3. The final mirror prediction is obtained by the proposed RGRCM:
P f i n a l = RGRCM ( F t o p , D s , D d a ) ,
where P f i n a l R 2 × H × W denotes the final two-channel logits.

3.2. Depth Branch

The depth branch is designed to encode the sensor depth map into hierarchical geometric features that are spatially aligned with the SATNet decoder. While RGB appearance is often ambiguous in reflective regions, valid depth measurements provide direct structural cues that are useful for mirror localization and boundary recovery.
As shown in Figure 2, the depth branch consists of a sequence of convolutional blocks. Each block is composed of a 3 × 3 convolution, batch normalization, and ReLU activation, followed by max-pooling for downsampling. Let B ( t ) denote the intermediate feature at stage t. The depth encoding process is written as
B ( 0 ) = D s , B ( t ) = Pool ϕ ( t ) ( B ( t 1 ) ) , t = 1 , , T ,
where ϕ ( t ) ( · ) denotes a Conv-BN-ReLU block and Pool ( · ) denotes max-pooling.
To match the decoder hierarchy, a projection block is attached to each decoding scale. Let G k denote the projected depth feature at level k. Then
G k = Π k ( B k ) , k { 0 , 1 , 2 , 3 } ,
where Π k ( · ) denotes a 1 × 1 Conv-BN-ReLU projection block, and G k has the same spatial resolution and channel dimension as the corresponding decoder skip feature.
At each decoder stage, the projected depth feature is injected into the RGB skip features through element-wise addition. Let S k ( 1 ) and S k ( 2 ) denote the two skip tensors used by the CFDM block at level k. The depth-enhanced skip features are defined as
S ˜ k ( 1 ) = S k ( 1 ) + G k , S ˜ k ( 2 ) = S k ( 2 ) + G k .
This design preserves the original symmetry-aware decoder topology of SATNet while enabling the network to exploit sensor-depth geometry throughout multi-scale decoding. Importantly, the depth branch uses only sensor depth and focuses on hierarchical geometric encoding, whereas the dual-depth representation is reserved for the final reliability-guided correction stage.

3.3. Reliability-Guided Residual Correction Module

The proposed RGRCM performs final-stage residual correction using both prediction uncertainty and internally constructed dual-depth evidence. Its purpose is to avoid the indiscriminate use of depth and instead apply depth-driven correction only where such correction is needed and likely to be reliable. As shown in Figure 3, RGRCM consists of three functional stages: 1) dual-depth evidence construction through DDEB, 2) main prediction and uncertainty estimation, and 3) reliability-guided residual correction.

3.3.1. Dual-Depth Evidence Block

The purpose of DDEB is to construct discrepancy-aware depth evidence from the sensor depth and the monocular predicted depth. Rather than acting as an independent branch, DDEB provides an internal evidence representation for the subsequent residual correction and reliability gating inside RGRCM.
Given the sensor depth D s and the predicted depth D d a , we first resize both maps to the input resolution. We do not apply per-image depth normalization or scale-shift alignment before constructing the evidence channels. Although D s and D d a are not necessarily metrically aligned, the proposed DDEB uses their raw preprocessed responses as learnable inconsistency cues rather than calibrated metric differences. This preserves sensor-depth failure patterns, local instability, and the native response distribution of the predicted depth. We define the following raw depth evidence channels:
e 1 = D s ,                                                                                                     e 2 = D d a ,                                         e 3 = | D s D d a | ,
e 4 = log ( D s + ϵ ) log ( D d a + ϵ ) , e 5 = | x D s x D d a | , e 6 = | y D s y D d a | ,
e 7 = 1 [ D s > 0 ] 1 [ D d a > 0 ] ,                               e 8 = Var k ( D s ) , k = 5 ,
where ϵ is a small constant for numerical stability, x and y denote horizontal and vertical finite-difference operators, ∧ denotes logical conjunction, and Var k ( D s ) denotes the local variance of the sensor depth computed over a k × k patch. In our implementation, k = 5 . The first two channels preserve the original depth observations, e 3 and e 4 encode raw cross-depth inconsistency in linear and logarithmic spaces, e 5 and e 6 capture gradient inconsistency, and e 7 provides a joint-validity cue for reliable cross-depth comparison. Since the predicted depth is dense in practice, this term primarily acts as a validity cue for cross-depth comparison in regions where sensor depth is available. e 8 encodes the local instability of the sensor depth that frequently occurs around reflective surfaces and invalid depth regions.
The raw depth evidence tensor is then formed by channel-wise concatenation:
E r a w = [ e 1 ; e 2 ; e 3 ; e 4 ; e 5 ; e 6 ; e 7 ; e 8 ] R 8 × H × W .
As shown in Figure 3, E r a w is subsequently encoded by a shallow evidence encoder Ψ e ( · ) , composed of stacked Conv-BN-ReLU-MaxPooling blocks:
E = Ψ e ( E r a w ) , E R C e × H / 4 × W / 4 ,
where C e denotes the channel dimension of the encoded evidence feature.
DDEB therefore converts raw dual-depth observations into a compact discrepancy-aware representation that can be used internally by RGRCM. This design allows the network to exploit the relationship between the two depth sources without introducing a separate segmentation pathway.

3.3.2. Main Prediction and Uncertainty Estimation

Given the top decoder feature F t o p , RGRCM first computes a main prediction branch:
P m a i n = Up Conv 1 × 1 ( F t o p ) ,
where P m a i n R 2 × H × W and Up ( · ) denotes bilinear upsampling to the input resolution. This branch corresponds to the baseline segmentation logits before reliability-guided correction.
We then obtain the posterior probability map by softmax:
Q = Softmax ( P m a i n ) ,
where Q [ 0 , 1 ] 2 × H × W . Based on this, we compute the normalized binary entropy map
H m a i n = 1 log 2 c = 1 2 Q c log ( Q c + ϵ ) ,
where H m a i n [ 0 , 1 ] 1 × H × W . A larger value of H m a i n indicates greater uncertainty in the main prediction and therefore a stronger need for correction.
This uncertainty estimate plays a central role in RGRCM. Rather than correcting all pixels uniformly, the module uses the entropy map to identify ambiguous regions where depth-driven residual refinement is more likely to be beneficial.

3.3.3. Reliability-Guided Residual Correction

The final stage of RGRCM performs reliability-guided residual correction based on the top decoder feature F t o p , the encoded depth evidence E , and the uncertainty map H m a i n .
First, a Depth Residual Head (DRH) predicts a depth-driven residual correction:
Δ P = Up DRH [ F t o p ; E ] ,
where Δ P R 2 × H × W and [ · ; · ] denotes channel-wise concatenation. As shown in Figure 3, DRH is implemented using the shared head architecture composed of two 3 × 3 Conv-BN-ReLU layers followed by a final 1 × 1 convolution.
Next, a Reliability Gate Head (RGH) estimates a learned gate conditioned on the encoded evidence and the uncertainty map. Specifically, the encoded evidence E and the downsampled uncertainty map Down ( H m a i n ) are concatenated and fed into RGH:
G r a w = Up σ RGH [ E ; Down ( H m a i n ) ] ,
where G r a w [ 0 , 1 ] 1 × H × W and σ ( · ) denotes the sigmoid function. RGH has the same shared head architecture as DRH but uses independent parameters.
The final gate is obtained by modulating the learned gate with the uncertainty map:
G = H m a i n G r a w ,
where ⊙ denotes element-wise multiplication.
Finally, the corrected output is given by
P f i n a l = P m a i n + G Δ P ,
where the single-channel gate G is broadcast along the class dimension.
This formulation has a clear interpretation. The main branch first provides a stable baseline prediction. RGRCM then constructs dual-depth evidence internally through DDEB and predicts a residual correction through DRH, but it applies that correction only where the current prediction is uncertain and where RGH judges the correction to be reliable. In this way, RGRCM suppresses harmful depth-driven correction in confident regions while selectively refining ambiguous mirror regions.

3.4. Loss Function

The proposed network is trained with three complementary objectives: a baseline multi-scale supervision term inherited from SATNet, a final prediction loss for the corrected output, and a safe regularization term for the proposed RGRCM. The overall objective is defined as
L = L final + L base + λ L safe ,
where λ controls the contribution of the safe regularization term. In all experiments, we set λ = 0.1 .
Following SATNet [8], we apply deep supervision to the four decoder predictions { P i } i = 0 3 . The baseline loss is defined as
L base = i = 0 3 w i L ce ( P i , M ) ,
where w i is the scale-dependent weight for the i-th prediction map, and L ce ( · , · ) denotes the pixel-wise cross-entropy loss. Following the SATNet [8] setting, we use [1.25, 1.25, 1.0, 1.5] for [ w 0 , w 1 , w 2 , w 3 ]. This term preserves the original multi-scale supervision strategy of SATNet [8] and stabilizes the learning of the symmetry-aware decoder.
In addition to the auxiliary supervision, we directly supervise the final corrected prediction P f i n a l produced by RGRCM:
L final = L ce ( P f i n a l , M ) .
This term ensures that the final output after reliability-guided residual correction is optimized directly for mirror segmentation.
For a prediction P, the cross-entropy loss is defined as
L ce ( P , M ) = 1 N j = 1 N c { 0 , 1 } y j , c log p j , c ,
where N = H × W , p j , c is the softmax probability of class c at pixel j, and y j , c is the corresponding one-hot ground-truth label.
To explicitly encourage RGRCM to avoid unnecessary correction on already reliable predictions, we introduce a safe regularization term:
L safe = 1 N j = 1 N e α H j · c j · G j Δ P j 1 ,
where α controls the sensitivity to uncertainty, H j is the uncertainty at pixel j, c j is the agreement coefficient between the main prediction and the ground truth, G j is the reliability gate, and Δ P j is the residual correction predicted by RGRCM. Since G j is a one-channel gate, it is broadcast along the class dimension when multiplied with Δ P j . In our experiments, we set α = 5.0 .
In our framework, the main prediction corresponds to the pre-correction output P m a i n , which is identical to the top-scale prediction P 0 . Let
Q = Softmax ( P m a i n ) ,
where Q [ 0 , 1 ] 2 × H × W . We then compute the normalized binary entropy as
H j = 1 log 2 c = 1 2 Q j , c log ( Q j , c + ϵ ) ,
where Q j , c is the posterior probability of class c at pixel j, and ϵ is a small constant for numerical stability.
The agreement coefficient c j is defined as
c j = c = 1 2 y j , c Q j , c ,
which corresponds to the posterior probability assigned by the main prediction to the ground-truth class at pixel j. Therefore, c j becomes large when the main prediction already agrees well with the ground truth.
The safe loss in Equation (25) penalizes large residual corrections on pixels where the main prediction is already correct and confident. Specifically, the factor e α H j imposes a stronger penalty when the uncertainty is low, while the coefficient c j further emphasizes pixels whose main prediction already agrees well with the ground truth. As a result, L safe encourages RGRCM to behave conservatively on easy and already-correct pixels, while allowing larger corrections on ambiguous or erroneous regions. This is consistent with the design goal of reliability-guided residual correction, namely, to suppress unnecessary modification of stable predictions and focus correction on difficult mirror regions.

4. Experiments and Results

This section evaluates the proposed method from multiple perspectives, including state-of-the-art comparison, qualitative analysis, component ablation, uncertainty-map visualization, statistical stability, robustness to sensor-depth corruption, depth-source analysis, safe loss and hyperparameter studies, computational complexity, and failure-case analysis.

4.1. Experimental Setup

Dataset and Evaluation Metrics

We evaluate the proposed method on the RGBD-Mirror benchmark [2], following the standard evaluation protocol used in prior work. RGBD-Mirror contains 3049 RGB-D image triplets, where each sample consists of an RGB image, a depth map, and a corresponding ground-truth mirror mask. The dataset is split into 2000 training images and 1049 testing images. Performance is measured using four widely adopted metrics, namely, intersection over union (IoU), F β , mean absolute error (MAE), and balanced error rate (BER).
Let M ^ [ 0 , 1 ] H × W denote the predicted foreground probability map and M { 0 , 1 } H × W denote the ground-truth binary mirror mask. Let M ˜ denote the binarized prediction obtained from M ^ , and let T P , T N , F P , and F N be the corresponding numbers of true positives, true negatives, false positives, and false negatives.
The IoU is defined as
IoU = T P T P + F P + F N .
The F β score is defined as
F β = ( 1 + β 2 ) Precision · Recall β 2 Precision + Recall ,
where
Precision = T P T P + F P , Recall = T P T P + F N .
The MAE is computed on the continuous prediction map as
MAE = 1 N i = 1 N m ^ i m i ,
where N = H × W , and m ^ i and m i denote the predicted foreground probability and the ground-truth binary label at pixel i, respectively.
The BER is defined as
BER = 1 1 2 T P T P + F N + T N T N + F P × 100 .
Higher IoU and F β indicate better segmentation performance, while lower MAE and BER indicate more accurate and balanced predictions. In all experiments, we set β 2 = 0.3 for F β , and use a fixed binarization threshold of 0.5 to obtain the binary prediction map. All threshold-based metrics reported in this paper are computed under this setting.

4.2. Implementation Details

We adopt Swin-S [26] pretrained on ImageNet-1K [27] as the RGB backbone encoder. All experiments are conducted using input RGB images and depth maps resized to 512 × 512 . The monocular predicted depth is obtained from a pretrained Depth Anything v2 [11] model, which is kept frozen during training. The remaining network, including the RGB backbone, the depth branch, the decoder, and RGRCM, is trained end-to-end using the loss function described in Section 3.4. Following Section 3.4, the multi-scale weights are set to [ 1.25 , 1.25 , 1.0 , 1.5 ] , while the coefficients of the safe loss are set to λ = 0.1 and α = 5.0 .
We train the network for 200 epochs using AdamW with β 1 = 0.9 , β 2 = 0.999 , and weight decay = 0.01 . The learning rate is set to 3 × 10 4 . A PolynomialLR scheduler is adopted during training. The batch size is set to 8, and all experiments are conducted on a single NVIDIA GeForce RTX 5090 GPU.
Sensor depth maps are loaded as single-channel depth images and converted to the [ 0 , 1 ] range according to the stored 8-bit depth scale. No explicit masking, interpolation, hole-filling, or inpainting is applied to invalid or missing depth pixels at the preprocessing stage. Zero-valued or missing depth pixels are retained as zero values because missingness itself provides useful evidence near reflective surfaces. DA depth maps generated by Depth Anything v2 undergo the same [ 0 , 1 ] normalization. Instead of relying on hand-crafted preprocessing heuristics, we provide raw depth observations to the network and rely on the internal evidence mechanisms described in Section 3.3.1—in particular, the joint-validity channel e 7 and the local-variance channel e 8 —to identify and handle unreliable measurements during inference.
During training, we apply several data augmentation strategies to improve generalization. Specifically, random cropping is applied with a crop size of up to 25% of the image area, random horizontal flipping is applied with probability 0.5, and random brightness and contrast jittering within ± 0.2 is applied with probability 0.5. In addition, random Gaussian blur with a kernel size 3 or 5 is applied with probability 0.3, and random Gaussian noise with standard deviation σ = 5 is applied with probability 0.3. After augmentation, all RGB images and depth maps are resized to 512 × 512 , and the RGB images are normalized using the ImageNet mean and standard deviation. During testing, only resizing and RGB normalization are applied.

4.3. Comparison with State-of-the-Art Methods

Table 1 compares the proposed method with representative RGB mirror segmentation methods and RGB-D mirror segmentation methods on the RGBD-Mirror benchmark [2]. Specifically, the compared RGB mirror methods include VCNet [6], SATNet [8], CSFwinformer-B [28], DPRNet [29], S2MD [30], and SAMirror [31], while the compared RGB-D mirror methods include PDNet [2], SANet [7], NDANet-S* [15], MGNet-S* [14], UTLNet [13], ADRNet-S* [9], and Kurohiji and Hachiya [16]. Except for Kurohiji and Hachiya, all threshold-based metrics are evaluated using the same fixed binarization threshold of 0.5. Since Kurohiji and Hachiya report the mean values of the top five runs selected from ten repeated trials in their original paper, we include their reported mean values for reference only. These methods cover both strong RGB baselines and recent RGB-D approaches, providing a comprehensive benchmark for evaluating the effectiveness of the proposed depth branch and RGRCM. For quantitative comparison, SATNet (our baseline) was retrained using the official codebase under our experimental setting. For all other competing methods, performance numbers are cited directly from their respective original publications. Note that we group SAMirror [31] with RGB methods because its main segmentation backbone is RGB-driven while predicted depth is used as an auxiliary cue.
The proposed method achieves the best overall performance, reaching 83.57 IoU, 0.899 F β , 0.026 MAE, and 6.26 BER. In particular, our method outperforms all compared methods in IoU, F β , and BER, while also matching the best MAE.
Compared with the RGB baseline SATNet [8], the proposed method improves IoU from 80.69 to 83.57, F β from 0.877 to 0.899, and BER from 7.33 to 6.26. Compared with the strongest RGB-D competitor ADRNet-S* [9], our method further improves IoU by 1.36 points, F β by 0.028, and BER by 0.76 points. These results verify that the proposed depth branch and RGRCM provide more effective and more reliable depth utilization than conventional RGB-only reasoning or direct RGB-D fusion.
Figure 4 presents representative qualitative comparisons between the proposed method and prior RGB and RGB-D mirror segmentation methods. Across all examples, the sensor depth is often incomplete, noisy, or unreliable around reflective regions, whereas the predicted depth provides a more plausible global scene layout. This observation supports the design motivation of the proposed framework, which uses sensor depth as the primary geometric cue in the depth branch and exploits the predicted depth only as complementary evidence inside RGRCM. Note that, for qualitative comparison in Figure 4, we used officially released prediction maps from methods with publicly available code. SANet predictions were generated using the authors’ released model weights.

4.4. Qualitative Results

In the first row, window-like structures are easily confused with mirrors because of their similar rectangular shape and reflective appearance. Several competing methods produce noticeable false positives or over-segment the ambiguous region, whereas the proposed method better suppresses non-mirror structures and produces a mask that is more consistent with the ground truth. This suggests that the proposed reliability-guided correction helps avoid erroneous depth-driven updates in visually confusing regions.
The second to fourth rows show mirrors that are difficult to recognize due to weak appearance cues, narrow shapes, or challenging scene context. In these cases, several existing methods either miss the mirror entirely or predict incomplete masks. By contrast, the proposed method better recovers the correct mirror extent, especially when the sensor depth alone is insufficient but the dual-depth evidence still provides useful complementary guidance.
The fifth and sixth rows contain scenes with two mirrors of different sizes. These examples are challenging because small mirrors are easily missed while large mirrors may dominate the prediction. The proposed method is able to preserve both mirror regions more reliably, indicating that the combination of hierarchical depth features and reliability-guided residual correction improves scale robustness.
The last two rows show scenes containing multiple mirrors. Existing methods often miss one of the mirror regions, produce fragmented masks, or fail to localize all reflective areas consistently. In contrast, the proposed method detects multiple mirror regions more completely and yields predictions that are closer to the ground truth. Overall, the qualitative results demonstrate that the proposed framework is particularly effective in ambiguous scenes where naive use of depth is unreliable and selective correction is necessary.

4.5. Ablation Study

We conduct ablation studies to analyze the contribution of the proposed architectural components, the roles of different depth inputs and the proposed dual-depth evidence, and the effect of the safe loss term. Unless otherwise specified, all RGRCM variants in this section use the full training objective described in Section 3.4.

4.5.1. Effect of the Proposed Components

Table 2 reports the contribution of the proposed components starting from the SATNet baseline. Adding the dedicated depth branch already provides a substantial gain, improving IoU from 80.69 to 83.02 and reducing BER from 7.33 to 6.72. This verifies that hierarchical sensor-depth features provide useful geometric cues when injected into the symmetry-aware decoder.
Adding RGRCM on top of the depth branch brings further improvements. In the “simple depth evidence” variant, DDEB is removed and the projected feature from Stage 2 of the depth branch is directly fed to RGRCM. Inside RGRCM, this feature replaces the DDEB output and is used as the input to both DRH and RGH. Even with this simplified design, the model improves IoU to 83.41 and BER to 6.48, showing that reliability-guided residual correction is effective when conditioned on an intermediate depth-branch feature. The full RGRCM, which replaces this simple feature with the complete DDEB representation, achieves the best overall performance with 83.57 IoU, 0.899 F β , 0.026 MAE, and 6.26 BER. Although the difference in F β between the simple and full variants is marginal, the full design yields the best trade-off across all metrics, especially in MAE and BER.
Figure 5 provides a qualitative ablation analysis of the proposed components. Compared with the RGB-only SATNet baseline, adding the depth branch generally recovers additional mirror cues and improves the localization of challenging mirror regions. However, because the sensor depth is often corrupted or incomplete around reflective surfaces, the depth-branch-only variant may still produce noisy responses, miss weak mirror regions, or generate inaccurate shapes.
By further introducing RGRCM, the full model produces visibly cleaner and more complete predictions. In several examples, RGRCM suppresses spurious activations caused by unreliable sensor depth while preserving mirror regions that are weak in appearance or difficult to distinguish from surrounding structures. This behavior is consistent with the design of RGRCM, which uses dual-depth evidence and uncertainty-aware residual correction to refine the baseline prediction selectively rather than modifying all pixels indiscriminately.
The qualitative results therefore complement the quantitative ablation in Table 2. They show that the depth branch is effective for injecting geometric information into the decoder, while RGRCM further improves robustness by correcting difficult regions in a reliability-aware manner.

4.5.2. Uncertainty Map Visualization

Figure 6 visualizes the intermediate representations of RGRCM for representative test images. Each row shows, from left to right, the following: (a) the input RGB image, (b) the sensor depth map, (c) the monocular predicted depth from Depth Anything v2, (d) the base prediction P m a i n , (e) the uncertainty map H m a i n , (f) the final prediction P f i n a l , and (g) the ground-truth mask.
The uncertainty map H m a i n is computed as the normalized binary entropy of the base prediction P m a i n , where high values (warm colors) indicate regions of low prediction confidence. As shown in Figure 6, H m a i n consistently highlights mirror boundaries and regions that share visual characteristics with mirrors, such as glass surfaces and reflective objects. In contrast, well-predicted background regions exhibit low uncertainty values (cool colors).
This spatial distribution of uncertainty directly governs the behavior of RGRCM through the final gate G = H m a i n G r a w . In regions where H m a i n is low, the gate is suppressed regardless of the learned reliability gate G r a w , preventing unnecessary depth-based correction from degrading already confident predictions. In regions where H m a i n is high, the gate permits correction only when G r a w additionally confirms that the dual-depth evidence supports reliable correction. This two-stage gating mechanism ensures that residual correction is applied selectively, focusing on ambiguous regions where depth-driven refinement is most likely to be beneficial.

4.5.3. Statistical Stability

To verify the reliability of the reported improvements, we repeat the main ablation experiments with five different random seeds and report the mean ± standard deviation for all four evaluation metrics. In addition, we conduct paired t-tests between model variants using seed-matched runs on the same test set. While the main comparison tables report the primary run for consistency with prior single-run reports, this section provides additional multi-seed statistics to evaluate training stability.
Table 3 summarizes the results. The full model achieves 83.42 ± 0.16 IoU, 0.898 ± 0.002 F β , 0.027 ± 0.001 MAE, and 6.37 ± 0.14 BER across five seeds. Compared with the RGB baseline, the improvement is statistically significant across all four metrics ( p < 0.01 ). Compared with the depth-branch-only model, the improvement in IoU is statistically significant ( p = 0.0294 ), while the improvements in F β , MAE, and BER are directionally consistent but do not reach the 0.05 significance level. We report these non-significant results transparently to avoid overclaiming.
We do not compute p-values against competing methods because their repeated-run predictions are not publicly available. Computing paired statistics against single-run results from other groups would not be methodologically valid; we therefore restrict statistical testing to our own seed-matched ablation experiments. A further observation is that the standard deviation progressively decreases from the baseline ( 0.95 ) through the depth branch ( 0.25 ) to the full model ( 0.16 ). This trend suggests that the proposed modules not only improve mean performance but also reduce sensitivity to random initialization, contributing to more stable training.

4.5.4. Robustness to Sensor-Depth Corruption

To evaluate how the proposed method behaves when sensor depth is degraded beyond the naturally occurring corruption in the dataset, we conduct a controlled robustness experiment. Three types of corruption are applied to the sensor depth at test time—without any retraining—and we compare the depth-branch-only model (without DDEB/RGRCM) against the full model to isolate the contribution of the proposed modules under degraded conditions.
The three corruption types are defined as follows. Missing Depth sets random pixels to zero at rates of 30%, 50%, and 70%, simulating the pixel-level sensor failure commonly observed near reflective surfaces. Gaussian Noise adds noise sampled from N ( 0 , σ 2 ) with σ { 3 , 5 , 10 } in the 8-bit depth-intensity space (i.e., [ 0 , 255 ] ), and it then converts the result back to [ 0 , 1 ] for the model, simulating electronic sensor noise and environmental interference. Block Corruption zeroes out contiguous 64 × 64 blocks at area ratios of 10%, 30%, and 50%, simulating regional sensor failure where an entire spatial neighborhood produces no valid measurement.
Table 4 reports the results. Across all three corruption types and all severity levels, the full model consistently exhibits a smaller IoU drop than the depth-branch-only baseline. For Missing Depth, the gap is most evident at 70% corruption, where the IoU drop decreases from 6.42 points (depth branch only) to 5.69 points (full model). For Gaussian Noise at σ = 10 , the IoU drop decreases from 1.49 to 0.90 points. For Block Corruption at a 50% area ratio, the drop decreases from 2.29 to 1.92 points.
These results indicate that the DDEB evidence channels—particularly the joint-validity channel e 7 and the local-variance channel e 8 —enable the network to detect when sensor depth has become unreliable, while the RGRCM reliability gate suppresses correction in regions where depth information is uninformative. We note that these corruption experiments provide a controlled stress test within the RGBD-Mirror benchmark and do not replace evaluation on additional datasets, which we discuss as a limitation in Section 5.

4.5.5. Effect of Sensor Depth, Predicted Depth, and Dual-Depth Evidence

Table 5 analyzes the roles of sensor depth, predicted depth, and the proposed dual-depth evidence. The first setting, sensor depth only, removes DDEB and uses only the sensor depth through the dedicated depth branch. This variant achieves 83.02 IoU and 6.72 BER, showing that valid sensor depth already provides strong geometric cues when injected into the decoder.
The second setting, predicted depth only, also removes DDEB, but it replaces the sensor depth input of the depth branch with the monocular predicted depth generated by Depth Anything v2. This variant achieves 82.50 IoU and 6.85 BER, which is lower than the sensor-depth-only setting. This result indicates that predicted depth can provide useful geometric information, but it is not a sufficient replacement for sensor depth in the depth branch.
The final setting uses the full model, in which the depth branch takes sensor depth as input and RGRCM internally constructs dual-depth evidence from both sensor depth and predicted depth through DDEB. This setting achieves the best performance across all metrics, reaching 83.57 IoU, 0.899 F β , 0.026 MAE, and 6.26 BER. The comparison shows that sensor depth should remain the primary source for hierarchical geometric encoding, while predicted depth is most effective when used as a complementary cue for internal dual-depth evidence construction inside RGRCM rather than as a direct replacement for sensor depth.
We note that the primary value of DA depth in our framework is not as an independent input modality competing with sensor depth. Rather, it serves as a reference signal for assessing the reliability of sensor depth within DDEB. When sensor depth is corrupted, the discrepancy between sensor depth and DA depth provides an informative cue that DDEB encodes through its evidence channels. This interpretation is supported by the corruption robustness experiments in Table 4: under severe sensor-depth corruption (e.g., 70% missing depth), the full dual-depth model retains IoU 77.87 compared to 76.60 for the sensor-depth-only model. The benefit of dual-depth evidence becomes more pronounced precisely when sensor depth is most unreliable, which is consistent with the design rationale that predicted depth is most valuable as a reliability reference rather than a standalone geometric source.

4.5.6. Effect of the Safe Loss Term

Table 6 evaluates the effect of the proposed safe loss term L safe in the training objective of RGRCM. Without L safe , adding RGRCM on top of the depth branch does not improve IoU over the depth-branch baseline and yields only limited benefit in MAE, while the BER remains relatively high at 6.75. In contrast, when L safe is included, the full model improves all evaluation metrics, reaching 83.57 IoU, 0.899 F β , 0.026 MAE, and 6.26 BER.
This result shows that L safe plays an important role in stabilizing the behavior of RGRCM. Since RGRCM performs residual correction on top of an already strong main prediction, unconstrained correction may unnecessarily perturb pixels that are already correct. The safe loss alleviates this issue by suppressing large corrections on pixels where the main prediction is already confident and consistent with the ground truth, thereby allowing the module to focus its correction capacity on ambiguous and difficult regions. As a result, the proposed loss improves not only BER but also IoU and MAE, confirming that it is an effective training objective for reliability-guided residual correction.

4.5.7. Hyperparameter Sensitivity Analysis

Table 7 reports the sensitivity analysis for the two hyperparameters of the safe loss: the regularization weight λ and the uncertainty sensitivity α . For each experiment, one hyperparameter is varied while the other is fixed at its default value.
The choice of λ has a notable effect on performance. With λ = 0.01 , the regularization is too weak to prevent unnecessary corrections, yielding IoU 82.85 and BER 6.83. With λ = 0.2 , the residual correction is over-suppressed, and performance drops to IoU 83.04 and BER 6.70. The default value λ = 0.1 achieves the best trade-off, reaching 83.57 IoU and 6.26 BER.
In contrast, α has a milder influence. Both α = 5.0 and α = 7.0 produce nearly identical results (IoU 83.57 vs. 83.56), while α = 3.0 leads to a modest decrease (IoU 83.15). Since α controls how sharply the safe loss distinguishes confident predictions from uncertain ones, a value that is too small blurs this distinction and weakens the regularization effect. The results show that the default setting ( λ = 0.1 , α = 5.0 ) achieves the best performance among the tested settings and that the method is not overly sensitive to either hyperparameter within the tested range.

4.6. Complexity Analysis

Table 8 compares the computational complexity and accuracy–efficiency trade-off of the proposed method with representative RGB-D mirror segmentation methods under an input size of 416 × 416. PDNet is the lightest and fastest model in the table, achieving 153.8 FPS with 0.41 GB peak memory. However, its segmentation accuracy is substantially lower than that of the proposed method, with 77.77 IoU and 7.77 BER. In contrast, the proposed model requires more computation than PDNet but improves IoU by 5.80 points and reduces BER by 1.51 points, while still maintaining real-time inference at 57.8 FPS.
Compared with the SATNet baseline, adding the depth branch and RGRCM increases the computational cost only modestly, from 102.14 to 105.08 GFLOPs, and the parameter count from 125.35M to 126.13M. The final model achieves 57.8 FPS with 1.15 GB peak GPU memory. These results show that the proposed method does not aim to be the fastest model; rather, it provides a favorable trade-off between segmentation accuracy and computational efficiency. Note that, for PDNet, SATNet, and our variants, GFLOPs, FPS, and peak GPU memory were measured under the same environment using an input size of 416 × 416 and batch size 1. For reference, Table 8 also includes UTLNet, a recent RGB-D mirror segmentation method. Since the UTLNet FPS is reported from the original paper and was measured under a different hardware and implementation environment, it is used only as a contextual reference rather than a strictly direct speed comparison.

4.7. Failure Case Analysis

To provide a balanced evaluation, we analyze representative failure cases of the proposed method, as shown in Figure 7. We identify three failure modes. First, mirrors with irregular or uncommon shapes that deviate from the rectangular forms predominant in the training set can lead to incomplete segmentation boundaries. Second, when sensor depth contains an extremely high proportion of missing values, the dual-depth evidence channels carry little discriminative information, limiting the ability of DDEB to construct meaningful evidence for correction. Third, glass doors, windows, and other transparent or reflective surfaces may exhibit depth characteristics similar to those of mirrors, which can cause false positive predictions because depth-based evidence alone cannot fully distinguish mirrors from other reflective surfaces.

5. Conclusions

In this paper, we presented a reliable RGB-D mirror segmentation framework built upon SATNet. The proposed method extends the symmetry-aware RGB baseline with two key components: a dedicated depth branch for hierarchical sensor-depth encoding and a Reliability-Guided Residual Correction Module (RGRCM) for final prediction refinement. A central design choice of the proposed framework is that the predicted monocular depth is not used as an independent modality branch. Instead, it is combined with the sensor depth inside RGRCM to construct discrepancy-aware dual-depth evidence through the Dual-Depth Evidence Block (DDEB). This evidence is then used to support uncertainty-aware residual correction only where such correction is likely to be reliable and beneficial.
Extensive experiments on the RGBD-Mirror benchmark showed that the proposed design is effective. The depth branch consistently strengthened the SATNet baseline by providing explicit geometric cues, while RGRCM further improved the final prediction through reliability-guided residual correction. The full model achieved the best overall performance among the compared methods, reaching 83.57 IoU, 0.899 F β , 0.026 MAE, and 6.26 BER. The ablation results further confirmed that sensor depth remains the primary geometric cue, while monocular predicted depth serves as a complementary source of internal evidence for more reliable correction.
Overall, the results suggest that the key to effective RGB-D mirror segmentation is not simply to add depth information, but to use it in a reliability-aware manner. By combining hierarchical sensor-depth encoding with selective dual-depth-guided correction, the proposed method provides a practical and effective solution for reliability-aware RGB-D mirror segmentation. We note that the current evaluation is limited to the RGBD-Mirror benchmark. Evaluation on additional RGB-D mirror segmentation datasets, when they become available, would further strengthen the generality of the proposed framework. The corruption robustness experiments demonstrate that the proposed method degrades more gracefully than the depth-branch-only baseline under the tested sensor-depth corruption conditions, providing evidence of robustness within the scope of this benchmark. In future work, it would be interesting to explore stronger hard-case modeling for severe depth corruption, finer boundary-aware correction strategies, and broader generalization to other reflective or transparent objects.

Author Contributions

Conceptualization, T.K. and Y.J.J.; methodology, T.K.; software, T.K.; formal analysis, T.K. and Y.J.J.; investigation, T.K.; data curation, T.K.; visualization, T.K. and Y.J.J.; validation, T.K. and Y.J.J.; writing—original draft preparation, T.K. and Y.J.J.; writing—review and editing, T.K. and Y.J.J.; supervision, Y.J.J.; resources, Y.J.J.; funding acquisition, Y.J.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2025-16065170) and the Gachon University research fund of 2025 (GCU-202500770001).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created in this study. The RGBD-Mirror dataset used in this work is publicly available and was introduced in reference [2]. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Yang, X.; Mei, H.; Xu, K.; Wei, X.; Yin, B.; Lau, R.W.H. Where Is My Mirror? In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8809–8818. [Google Scholar]
  2. Mei, H.; Dong, B.; Dong, W.; Peers, P.; Yang, X.; Zhang, Q.; Wei, X. Depth-Aware Mirror Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 3044–3053. [Google Scholar]
  3. Whelan, T.; Goesele, M.; Lovegrove, S.; Straub, J.; Green, S.; Szeliski, R.; Butterfield, S.; Verma, S.; Newcombe, R. Reconstructing Scenes with Mirror and Glass Surfaces. ACM Trans. Graph. 2018, 37, 102:1–102:11. [Google Scholar] [CrossRef]
  4. Mei, H.; Yang, X.; Wang, Y.; Liu, Y.; He, S.; Zhang, Q.; Wei, X.; Lau, R.W.H. Don’t Hit Me! Glass Detection in Real-World Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3684–3693. [Google Scholar]
  5. Lin, J.; Wang, G.; Lau, R.W.H. Progressive Mirror Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3697–3705. [Google Scholar]
  6. Tan, X.; Lin, J.; Xu, K.; Chen, P.; Ma, L.; Lau, R.W.H. Mirror Detection with the Visual Chirality Cue. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3492–3504. [Google Scholar] [CrossRef]
  7. Guan, H.; Lin, J.; Lau, R.W.H. Learning Semantic Associations for Mirror Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5941–5950. [Google Scholar]
  8. Huang, T.; Dong, B.; Lin, J.; Liu, X.; Lau, R.W.H.; Zuo, W. Symmetry-Aware Transformer-Based Mirror Detection. Proc. AAAI Conf. Artif. Intell. 2023, 37, 935–943. [Google Scholar] [CrossRef]
  9. Zhou, W.; Cai, Y.; Dong, X.; Qiang, F.; Qiu, W. ADRNet-S*: Asymmetric Depth Registration Network via Contrastive Knowledge Distillation for RGB-D Mirror Segmentation. Inf. Fusion 2024, 108, 102392. [Google Scholar] [CrossRef]
  10. Ying, X.; Chuah, M.C. UCTNet: Uncertainty-Aware Cross-Modal Transformer Network for Indoor RGB-D Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 20–37. [Google Scholar]
  11. Yang, L.; Kang, B.; Huang, Z.; Zhao, Z.; Xu, X.; Feng, J.; Zhao, H. Depth Anything V2. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
  12. He, R.; Lin, J.; Lau, R.W.H. Efficient Mirror Detection via Multi-level Heterogeneous Learning. Proc. AAAI Conf. Artif. Intell. 2023, 37, 790–798. [Google Scholar] [CrossRef]
  13. Zhou, W.; Cai, Y.; Zhang, L.; Yan, W.; Yu, L. UTLNet: Uncertainty-Aware Transformer Localization Network for RGB-Depth Mirror Segmentation. IEEE Trans. Multimed. 2024, 26, 4564–4574. [Google Scholar] [CrossRef]
  14. Zhou, W.; Cai, Y.; Qiang, F. Morphology-Guided Network via Knowledge Distillation for RGB-D Mirror Segmentation. IEEE Trans. Intell. Transp. Syst. 2024, 25, 17382–17391. [Google Scholar] [CrossRef]
  15. Zhou, W.; Zhang, H.; Liu, Y.; Luo, T. Enhancing RGB-D Mirror Segmentation with a Neighborhood-Matching and Demand-Modal Adaptive Network Using Knowledge Distillation. IEEE Trans. Autom. Sci. Eng. 2025, 22, 12679–12692. [Google Scholar] [CrossRef]
  16. Kurohiji, R.; Hachiya, H. Depth Inconsistency-based Spatial-channel Attention Gate for Mirror Segmentation. In Proceedings of the 36th British Machine Vision Conference (BMVC), Sheffield, UK, 24–27 November 2025. [Google Scholar]
  17. Hu, X.; Yang, K.; Fei, L.; Wang, K. ACNet: Attention Based Network to Exploit Complementary Features for RGBD Semantic Segmentation. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1440–1444. [Google Scholar]
  18. Chen, X.; Lin, K.-Y.; Wang, J.; Wu, W.; Qian, C.; Li, H.; Zeng, G. Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 561–577. [Google Scholar]
  19. Fan, D.-P.; Zhai, Y.; Borji, A.; Yang, J.; Shao, L. BBS-Net: RGB-D Salient Object Detection with a Bifurcated Backbone Strategy Network. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 275–292. [Google Scholar]
  20. Wang, S.; Jiang, F.; Xu, B. Global Guided Cross-Modal Cross-Scale Network for RGB-D Salient Object Detection. Sensors 2023, 23, 7221. [Google Scholar] [CrossRef] [PubMed]
  21. Peng, Y.; Zhai, Z.; Feng, M. SLMSF-Net: A Semantic Localization and Multi-Scale Fusion Network for RGB-D Salient Object Detection. Sensors 2024, 24, 1117. [Google Scholar] [CrossRef] [PubMed]
  22. Kim, J.; Ghosh, D.K.; Jung, Y.J. Event-based video deblurring based on image and event feature fusion. Expert Syst. Appl. 2023, 223, 119917. [Google Scholar] [CrossRef]
  23. Ghosh, D.K.; Jung, Y.J. Two-stage cross-fusion network for stereo event-based depth estimation. Expert Syst. Appl. 2024, 241, 122743. [Google Scholar] [CrossRef]
  24. Ghosh, D.K.; Jung, Y.J. Depth cue fusion for event-based stereo depth estimation. Inf. Fusion 2025, 117, 102891. [Google Scholar] [CrossRef]
  25. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
  26. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
  27. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
  28. Xie, Z.; Wang, S.; Yu, Q.; Tan, X.; Xie, Y. CSFwinformer: Cross-Space-Frequency Window Transformer for Mirror Detection. IEEE Trans. Image Process. 2024, 33, 1853–1867. [Google Scholar] [CrossRef] [PubMed]
  29. Zha, M.; Fu, F.; Pei, Y.; Wang, G.; Li, T.; Tang, X.; Yang, Y.; Shen, H.T. Dual Domain Perception and Progressive Refinement for Mirror Detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 11942–11953. [Google Scholar] [CrossRef]
  30. Shao, Z.; Chen, R.; Shi, X.; Liu, B.; Li, C.; Ma, L.; Yeung, D.-Y. Mirror Detection via Multi-Directional Similarity Perception and Spectral Saliency Enhancement. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 10099–10109. [Google Scholar] [CrossRef]
  31. Meng, Q.; Liu, Y.; Hu, R.; Liang, M.; Yan, J.; Zhu, L. SAMirror: Enhancing Mirror Detection via Integrated Visual-Depth Cues in Segment Anything Model. Vis. Comput. 2025, 41, 12679–12690. [Google Scholar] [CrossRef]
Figure 1. A motivating example on the RGBD-Mirror benchmark. (a) Input image, (b) sensor depth, (c) monocular predicted depth from Depth Anything v2, (d) SATNet, (e) UTLNet, (f) ours, and (g) ground truth. The sensor depth is severely corrupted around the mirror, while the predicted depth still provides a plausible geometric cue. As a result, existing methods may over-segment reflected content or miss the mirror region, whereas the proposed method better matches the ground truth.
Figure 1. A motivating example on the RGBD-Mirror benchmark. (a) Input image, (b) sensor depth, (c) monocular predicted depth from Depth Anything v2, (d) SATNet, (e) UTLNet, (f) ours, and (g) ground truth. The sensor depth is severely corrupted around the mirror, while the predicted depth still provides a plausible geometric cue. As a result, existing methods may over-segment reflected content or miss the mirror region, whereas the proposed method better matches the ground truth.
Sensors 26 03739 g001
Figure 2. Overall architecture of the proposed method.
Figure 2. Overall architecture of the proposed method.
Sensors 26 03739 g002
Figure 3. Structure of the reliability-guided residual correction module (RGRCM).
Figure 3. Structure of the reliability-guided residual correction module (RGRCM).
Sensors 26 03739 g003
Figure 4. Qualitative comparison on the RGBD-Mirror benchmark. Columns from left to right show (a) input image, (b) sensor depth, (c) monocular predicted depth from Depth Anything v2, (d) PDNet, (e) SANet, (f) VCNet, (g) SATNet, (h) CSFwinformer-B, (i) DPRNet, (j) S2MD, (k) UTLNet, (l) ours, and (m) ground-truth mask.
Figure 4. Qualitative comparison on the RGBD-Mirror benchmark. Columns from left to right show (a) input image, (b) sensor depth, (c) monocular predicted depth from Depth Anything v2, (d) PDNet, (e) SANet, (f) VCNet, (g) SATNet, (h) CSFwinformer-B, (i) DPRNet, (j) S2MD, (k) UTLNet, (l) ours, and (m) ground-truth mask.
Sensors 26 03739 g004
Figure 5. Qualitative ablation results of the proposed framework. From left to right, each column shows (a) input image, (b) sensor depth, (c) monocular predicted depth from Depth Anything v2, (d) SATNet, (e) SATNet + depth branch, (f) SATNet + depth branch + RGRCM (ours), and (g) ground-truth mask. The examples show representative challenging cases where sensor depth is corrupted, incomplete, or ambiguous around mirror regions. Compared with the RGB-only baseline, the depth branch recovers additional geometric cues but may still produce incomplete or noisy predictions. By further introducing RGRCM, the full model produces cleaner and more accurate masks that are more consistent with the ground truth.
Figure 5. Qualitative ablation results of the proposed framework. From left to right, each column shows (a) input image, (b) sensor depth, (c) monocular predicted depth from Depth Anything v2, (d) SATNet, (e) SATNet + depth branch, (f) SATNet + depth branch + RGRCM (ours), and (g) ground-truth mask. The examples show representative challenging cases where sensor depth is corrupted, incomplete, or ambiguous around mirror regions. Compared with the RGB-only baseline, the depth branch recovers additional geometric cues but may still produce incomplete or noisy predictions. By further introducing RGRCM, the full model produces cleaner and more accurate masks that are more consistent with the ground truth.
Sensors 26 03739 g005
Figure 6. Visualization of the uncertainty map H m a i n from RGRCM. From left to right: (a) input image, (b) sensor depth, (c) DA depth, (d) base prediction P m a i n , (e) uncertainty map H m a i n , (f) final prediction P f i n a l , and (g) ground-truth mask.
Figure 6. Visualization of the uncertainty map H m a i n from RGRCM. From left to right: (a) input image, (b) sensor depth, (c) DA depth, (d) base prediction P m a i n , (e) uncertainty map H m a i n , (f) final prediction P f i n a l , and (g) ground-truth mask.
Sensors 26 03739 g006
Figure 7. Representative failure cases of the proposed method. From left to right: (a) input image, (b) sensor depth, (c) DA depth, (d) SATNet (baseline), (e) our prediction, and (f) ground-truth mask. Rows 1 and 2 show unusual mirror shapes that deviate predominant in the training set. Row 3 shows a case with excessive missing depth, where the evidence channels carry insufficient information. Row 4 shows a glass surface misidentified as a mirror due to similar depth characteristics.
Figure 7. Representative failure cases of the proposed method. From left to right: (a) input image, (b) sensor depth, (c) DA depth, (d) SATNet (baseline), (e) our prediction, and (f) ground-truth mask. Rows 1 and 2 show unusual mirror shapes that deviate predominant in the training set. Row 3 shows a case with excessive missing depth, where the evidence channels carry insufficient information. Row 4 shows a glass surface misidentified as a mirror due to similar depth characteristics.
Sensors 26 03739 g007
Table 1. Comparison results of the mirror segmentation methods on the RGBD-Mirror benchmark. Best results are shown in bold. The results of Kurohiji and Hachiya are taken from their original paper, where the reported values are the mean results of the top five runs selected from ten repeated trials. “–” indicates that the corresponding metric was not reported in the original paper.
Table 1. Comparison results of the mirror segmentation methods on the RGBD-Mirror benchmark. Best results are shown in bold. The results of Kurohiji and Hachiya are taken from their original paper, where the reported values are the mean results of the top five runs selected from ten repeated trials. “–” indicates that the corresponding metric was not reported in the original paper.
TypeMethodIoU↑ F β MAE↓BER↓
RGB mirror
methods
VCNet73.010.8490.05210.42
SATNet80.690.8770.0307.33
CSFwinformer-B78.660.8630.0318.57
DPRNet76.100.8110.047
S2MD78.600.8660.030
SAMirror79.200.8360.02610.02
RGB-D mirror
methods
PDNet77.770.8250.0427.77
SANet78.430.8340.0418.16
NDANet-S*79.930.8440.0357.56
MGNet-S*80.800.8590.0307.39
UTLNet80.500.8580.0327.23
ADRNet-S*82.210.8710.0307.02
Kurohiji and Hachiya 70.940.8810.079
Ours83.570.8990.0266.26
Table 2. Ablation results for the effect of the proposed components. The “simple depth evidence” variant removes DDEB and instead feeds the projected feature from Stage 2 of the depth branch into RGRCM. Specifically, this projected feature replaces the DDEB output as the input to both DRH and RGH. The “full” variant uses the complete DDEB representation inside RGRCM. The best result in each column is shown in bold.
Table 2. Ablation results for the effect of the proposed components. The “simple depth evidence” variant removes DDEB and instead feeds the projected feature from Stage 2 of the depth branch into RGRCM. Specifically, this projected feature replaces the DDEB output as the input to both DRH and RGH. The “full” variant uses the complete DDEB representation inside RGRCM. The best result in each column is shown in bold.
MethodIoU↑ F β MAE↓BER↓
Baseline (SATNet)80.690.8770.0307.33
SATNet + Depth Branch83.020.8980.0286.72
SATNet + Depth Branch + RGRCM (simple depth evidence)83.410.9000.0286.48
SATNet + Depth Branch + RGRCM (full)83.570.8990.0266.26
Table 3. Statistical stability over 5 random seeds with paired t-test p-values. The best result in each column is shown in bold.
Table 3. Statistical stability over 5 random seeds with paired t-test p-values. The best result in each column is shown in bold.
MethodIoU↑ F β MAE↓BER↓
Baseline (SATNet)80.13 ± 0.950.875 ± 0.0060.032 ± 0.0027.84 ± 0.49
+ Depth Branch83.04 ± 0.250.897 ± 0.0030.028 ± 0.0016.64 ± 0.20
+ DDEB + RGRCM (Full)83.42 ± 0.160.898 ± 0.0020.027 ± 0.0016.37 ± 0.14
Paired t-test p-values vs. Full model:
Baseline (SATNet)0.0015 **0.0006 ***0.0053 **0.0038 **
+ Depth Branch0.0294 *0.87120.09930.0695
*** p < 0.001 , ** p < 0.01 , * p < 0.05 .
Table 4. Robustness to sensor-depth corruption. Δ denotes the IoU drop from the clean baseline.
Table 4. Robustness to sensor-depth corruption. Δ denotes the IoU drop from the clean baseline.
MethodCorruptionIoU↑ F β MAE↓BER↓
SATNet
+ Depth
Branch
No corruption83.020.8980.0286.72
Missing 30%77.79 (−5.23)0.8590.0339.97
Missing 50%76.71 (−6.31)0.8510.03410.60
Missing 70%76.60 (−6.42)0.8510.03410.69
Noise σ = 382.68 (−0.34)0.8960.0296.95
Noise σ = 582.32 (−0.70)0.8930.0297.16
Noise σ = 1081.53 (−1.49)0.8880.0307.67
Block 10%82.47 (−0.55)0.8930.0286.76
Block 30%81.37 (−1.65)0.8840.0296.97
Block 50%80.73 (−2.29)0.8790.0297.14
OursNo corruption83.570.8990.0266.26
Missing 30%78.94 (−4.63)0.8680.0319.37
Missing 50%77.93 (−5.63)0.8610.0329.98
Missing 70%77.87 (−5.69)0.8610.03210.04
Noise σ = 383.32 (−0.24)0.8990.0276.45
Noise σ = 583.11 (−0.46)0.8980.0276.58
Noise σ = 1082.66 (−0.90)0.8960.0286.89
Block 10%83.21 (−0.35)0.8970.0276.41
Block 30%82.22 (−1.35)0.8890.0286.85
Block 50%81.64 (−1.92)0.8860.0297.14
Table 5. Effect of sensor depth, predicted depth, and dual-depth evidence. The first two settings remove DDEB, while the last setting uses the full model with DDEB inside RGRCM. The best result in each column is shown in bold.
Table 5. Effect of sensor depth, predicted depth, and dual-depth evidence. The first two settings remove DDEB, while the last setting uses the full model with DDEB inside RGRCM. The best result in each column is shown in bold.
MethodIoU↑ F β MAE↓BER↓
Sensor depth only (without DDEB)83.020.8980.0286.72
Predicted depth only (without DDEB; predicted depth as depth-branch input)82.500.8820.0296.85
Sensor + predicted depth (full model with DDEB)83.570.8990.0266.26
Table 6. Effect of the safe loss term in the proposed loss function. The best result in each column is shown in bold.
Table 6. Effect of the safe loss term in the proposed loss function. The best result in each column is shown in bold.
MethodIoU↑ F β MAE↓BER↓
SATNet + Depth Branch + RGRCM (without L safe )83.020.8970.0276.75
SATNet + Depth Branch + RGRCM (with L safe )83.570.8990.0266.26
Table 7. Sensitivity analysis of safe loss hyperparameters λ and α . The best result in each column is shown in bold.
Table 7. Sensitivity analysis of safe loss hyperparameters λ and α . The best result in each column is shown in bold.
λ α IoU↑ F β MAE↓BER↓
0.015.082.850.8940.0276.83
0.15.083.570.8990.0266.26
0.25.083.040.8960.0276.70
0.13.083.150.8950.0276.60
0.15.083.570.8990.0266.26
0.17.083.560.8990.0276.34
Table 8. Complexity comparison with representative methods.
Table 8. Complexity comparison with representative methods.
MethodInput SizeGFLOPs↓Params (M)↓FPS↑Peak Mem (GB)↓IoU↑BER↓
PDNet 416 × 416 82.3180.54153.80.4177.777.77
UTLNet 416 × 416 157.74263.699.5-80.507.23
SATNet (baseline) 416 × 416 102.14125.3560.80.5980.697.33
+ Depth Branch 416 × 416 103.11126.0159.50.6283.026.72
+ Depth Branch + RGRCM (ours) 416 × 416 105.08126.1357.81.1583.576.26
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, T.; Jung, Y.J. RGB-D Mirror Segmentation with Reliability-Guided Residual Correction. Sensors 2026, 26, 3739. https://doi.org/10.3390/s26123739

AMA Style

Kim T, Jung YJ. RGB-D Mirror Segmentation with Reliability-Guided Residual Correction. Sensors. 2026; 26(12):3739. https://doi.org/10.3390/s26123739

Chicago/Turabian Style

Kim, Taehyeon, and Yong Ju Jung. 2026. "RGB-D Mirror Segmentation with Reliability-Guided Residual Correction" Sensors 26, no. 12: 3739. https://doi.org/10.3390/s26123739

APA Style

Kim, T., & Jung, Y. J. (2026). RGB-D Mirror Segmentation with Reliability-Guided Residual Correction. Sensors, 26(12), 3739. https://doi.org/10.3390/s26123739

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop