Adaptive RGB-D Semantic Segmentation with Skip-Connection Fusion for Indoor Staircase and Elevator Localization
Abstract
1. Introduction
1.1. Limitations
- Reflective and Translucent Surfaces: Current models struggle with insufficient geometric cues, leading to inaccurate segmentation of reflective elevator doors.
- Ineffective Fusion of Complementary Modalities: Current approaches rely largely on generic fusion strategies that do not specifically cater to the unique demands of architectural segmentation. In particular, these methods often fail to dynamically integrate RGB and depth information, resulting in suboptimal feature fusion in complex environments.
1.2. Key Contributions
- Learnable Skip-Connection Fusion (SCF): A dynamic fusion module that adaptively balances RGB and depth features through learnable weighting mechanisms. This design enhances segmentation robustness by effectively handling reflective and translucent surfaces commonly encountered in architectural environments.
- Integration with PSPNet: Our framework leverages pyramid pooling within the PSPNet architecture to capture both global and local contextual information. This integration significantly improves segmentation accuracy, particularly in cluttered and intricate scenes.
- Task-Specific Optimization: The proposed method is specifically tailored to segment stairs and elevators, achieving a favorable trade-off between high segmentation accuracy and computational efficiency in real time.
- Dataset Contribution: We have collected a dedicated dataset comprising 2386 images from real-world environments at Minjiang University in Fuzhou, Fujian Province, China. The dataset includes 1818 staircase images and 568 elevator images, captured from two camera perspectives to simulate robotic vision. The segmentation task is defined over six classes: upstairs, downstairs, elevator interior, elevator door, elevator frame, and background. All images were pre-processed via segmentation and padding and resized to 640 × 480 pixels, with 1671 images used for training and 715 for validation.
- Problem Significance: We address a critical yet relatively overlooked challenge in indoor robotic navigation—precise segmentation of stairs and elevators. Accurate recognition of these architectural structures is essential for ensuring safe and efficient navigation in multi-floor environments. Misclassification can lead to serious consequences, such as navigation failures, accessibility issues, and potential safety hazards for both robots and humans. Our approach introduces a novel fusion mechanism that improves segmentation accuracy, making autonomous navigation more reliable in real-world scenarios.
- Methodological Contribution: We propose a novel Skip-Connection Fusion (SCF) module that dynamically integrates RGB and depth features for enhanced segmentation. Unlike conventional concatenation-based fusion, SCF learns adaptive weights to balance modality contributions, effectively mitigating challenges such as occlusions, reflections, and variations in lighting conditions. Using this approach, we significantly improve the segmentation performance of robotic perception systems in complex indoor environments.
2. Related Work
2.1. Multimodal Fusion for Semantic Segmentation
2.1.1. Early and Foundational Approaches
2.1.2. Advanced Fusion Techniques: Feature-Level Integration
2.2. Segmentation of Architectural Structures
2.2.1. Stair and Elevator Detection
2.2.2. Context-Aware Multi-Scale Fusion
Extension to Other Architectural Features
Justification for Model Choices
3. Methodology
3.1. Summary of SCF and Comparative Analysis
3.2. Preliminaries
3.2.1. Task Description
3.2.2. Model Training and Loss Function
3.2.3. Skip-Connection Fusion Module
3.2.4. SCF Module Architecture Design and Placement
3.3. Feature Fusion via Concatenation
3.3.1. Overview of Concatenation
3.3.2. Computational Trade-Offs
3.3.3. Limitations and Transition to SCF
3.4. Our Module: Skip-Connection Fusion (SCF)
3.4.1. Feature Extraction
- RGB Backbone: This network extracts texture and appearance features from the RGB input . It captures rich color information and fine-grained texture details, which are critical for delineating object boundaries and understanding visual context.
- Depth Backbone: In contrast, the depth backbone processes the depth input . While it shares the same ResNet architecture, the first convolutional layer is modified to accept single-channel input. This network is optimized to capture geometric and spatial structures, providing complementary cues to the RGB features.
3.4.2. Fusion Process in the SCF Module
- Feature Extraction: The input consists of RGB features and depth features , which are extracted from separate backbone networks trained on RGB and depth images.
- RGB features capture texture, color, and fine-grained appearance details.
- Depth features emphasize geometric structure and object boundaries.
- Concatenation: The extracted RGB and depth features are concatenated along the channel dimension:This step merges visual and geometric information, allowing the model to leverage complementary cues from both modalities.
- Dimensionality Reduction: To reduce computational complexity and restore the feature channel size to C, a convolution is applied:Batch normalization and ReLU activation further refine the output:
- Attention Mechanism (Dynamic Weighting): The SCF module dynamically adjusts feature importance using an attention mechanism:The computed attention weights are applied to the feature map:The attention mechanism consists of the following:
- Spatial Attention: Identifies key regions in the image by emphasizing areas that contribute most to segmentation.
- Channel Attention: Prioritizes informative feature channels, allowing the model to selectively amplify critical depth or RGB features.
In practice, this dynamic weighting mechanism enables the model to determine when to rely more on depth or RGB based on scene characteristics. For instance, in environments with poor lighting or strong reflections, where RGB features become unreliable, the model assigns higher importance to depth features to ensure accurate segmentation. Conversely, in cases where depth information is noisy or lacks structural details, the RGB features are given more weight to compensate. This adaptive balancing improves the segmentation of challenging structures such as stairs and elevators, where variations in texture, illumination, and perspective often complicate recognition. - Skip-Connection Integration: A residual connection integrates the refined feature map with the original RGB input:This preserves low-level RGB information while incorporating the enhanced depth features. The skip-connection also improves gradient flow, ensuring stable training and preventing vanishing gradients.
Algorithm 1 Skip-Connection Fusion (SCF) module. |
|
3.4.3. Our Dataset
- Task-specific Design: Focused on stair and elevator segmentation for robotic perception.
- Rich Environmental Variation: Includes varied lighting, occlusions, and camera perspectives.
- Balanced Complexity and Scale: Provides a manageable yet diverse dataset for training and benchmarking segmentation models.
- Fine-grained Annotation: Supports multi-class segmentation beyond binary stair detection.
4. Experimental Evaluation
4.1. Experimental Setup
4.1.1. Training Strategy
4.1.2. Evaluation Metric: Mean Intersection over Union (mIoU)
- : The set of pixels predicted to belong to class i;
- : The set of ground truth pixels for class i;
- N: The total number of classes.
4.2. Comparison with Other Methods
4.2.1. Baseline
4.2.2. Segmentation Results
- Input Image: Displays the original RGB image used for testing.
- Ground Truth: Provides the manually annotated segmentation mask for reference.
- SCF (Proposed Method): Highlights improved segmentation accuracy, particularly in ambiguous regions like shadows or low-contrast areas.
4.2.3. SCF vs. Concatenation in Challenging Scenarios
- List 1: Original RGB images.
- List 2: Provides the manually annotated segmentation mask for reference.
- List 3: Highlights improved segmentation accuracy, particularly in ambiguous regions like shadows or low-contrast areas.
- List 4: Shows partial improvement but struggles with reflective surfaces or overlapping objects.
4.2.4. Additional Improvements for SCF
Proof of SCF Efficacy
- : Feature map extracted from the RGB modality.
- : Feature map extracted from the depth modality.
- and : Learnable weights that dynamically adjust the importance of RGB and depth features, respectively, ensuring robust segmentation. A sensitivity analysis is conducted to evaluate their impact on performance in challenging cases, such as reflective or translucent surfaces.Empirical analysis shows that when is high , segmentation is more sensitive to texture and color but struggles with reflective surfaces, where misleading reflections can cause incorrect classifications. Conversely, when dominates , the model relies more on spatial geometry, improving segmentation in low-texture regions but introducing noise in areas where depth data is unreliable (e.g., glass surfaces).To achieve optimal segmentation, we fine-tune these weights dynamically during training using a learnable balancing mechanism:
- : An activation function, such as sigmoid or softmax, which normalizes the weighted sum to ensure numerical stability and sparsity.
- : The final fused feature map, emphasizing key regions for improved performance.
- : The number of parameters in the dynamic weighting mechanism.
- : The computational cost of the feature fusion process.
Adaptive Attention Mechanism
- : Learnable weight matrix for spatial attention.
- : The combined feature map of RGB and depth modalities.
- : The input feature map to be refined.
- : The final refined feature map.
- Dynamic Feature Selection: By normalizing the attention weights, ensures that the model focuses on critical regions while suppressing irrelevant features.
- Numerical Stability: Activation functions such as sigmoid or softmax prevent extreme weight values, facilitating stable convergence during training.
- Sparsity: Functions like sigmoid introduce sparsity in attention weights, allowing the model to prioritize a smaller subset of features for efficient learning.
4.3. Ablation Experiment
4.3.1. Objective
4.3.2. Experimental Design
Computational Overhead
- Runtime: SCF performs learnable weight adjustments across feature channels, while CFM involves contour-based feature guidance operations. Both increase runtime compared to concatenation but aim to enhance the integration of RGB and depth features for better segmentation performance.
- Memory Usage: These modules introduce extra parameters and intermediate feature maps, leading to a modest increase in memory consumption. However, this overhead is justified by their ability to improve segmentation accuracy in complex scenes.
4.3.3. Results and Analysis
Key Observations
- RGB-only: Training with only RGB features achieved reasonable accuracy but struggled with challenging scenarios, demonstrating the limitations of single-modality input.
- Concatenation: Simple concatenation of RGB and depth improved top 10% mIoU but introduced marginal degradation in overall mIoU due to suboptimal feature integration.
- CFM: An advanced fusion method, CFM outperformed RGB-only in top 10% mIoU but achieved a lower overall mIoU (85.05%) than concatenation and SCF, indicating its limited robustness in complex indoor scenes.
- SCF (Ours): The SCF module achieved the highest mIoU and top 10% mIoU, demonstrating its ability to dynamically fuse features and enhance segmentation robustness.
Significance of SCF Efficacy
Computational Overhead
5. Conclusions and Future Work
- Limitations:
- Handling Highly Dynamic Scenes: Although our SCF module improves segmentation in challenging static architectural scenes, it may face difficulties when dealing with highly dynamic or rapidly changing environments, which requires further adaptation.
- Computational Cost in Edge Cases: While the proposed method achieves a good balance between accuracy and efficiency, certain complex scenes with extreme occlusions or reflections can still increase computational overhead.
- Generalization to Other Architectural Elements: Our approach is specifically optimized for stairs and elevators; its effectiveness on other indoor architectural components or outdoor environments needs further investigation.
- Strengths:
- Adaptive Fusion of RGB and Depth: The learnable Skip-Connection Fusion (SCF) module dynamically balances complementary modalities, improving robustness against occlusion, reflection, and lighting variations that commonly challenge traditional methods.
- Integration with Contextual Features: By leveraging PSPNet’s pyramid pooling, our framework captures rich global and local context, significantly enhancing segmentation accuracy in cluttered and complex indoor scenes.
- Task-Specific Optimization: Tailoring the model specifically for stairs and elevators allows achieving superior segmentation performance without compromising real-time computational requirements, crucial for robotic navigation.
- Comprehensive Dataset and Evaluation: The newly collected, diverse real-world dataset simulating robot vision provides a solid benchmark to validate the effectiveness and generalization capability of the proposed approach.
- Practical Impact: The improved segmentation precision contributes directly to safer and more reliable robot navigation in multi-floor architectural environments, addressing a critical real-world need often overlooked in generic segmentation models.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Li, H. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar] [CrossRef]
- Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
- Urrea, C.; Vélez, M. Advances in Deep Learning for Semantic Segmentation of Low-Contrast Images: A Systematic Review of Methods, Challenges, and Future Directions. Sensors 2025, 25, 2043. [Google Scholar] [CrossRef] [PubMed]
- Betsas, T.; Georgopoulos, A.; Doulamis, A.; Grussenmeyer, P. Deep Learning on 3D Semantic Segmentation: A Detailed Review. Remote Sens. 2025, 17, 298. [Google Scholar] [CrossRef]
- Velastegui, R.; Tatarchenko, M.; Karaoglu, S.; Gevers, T. Image semantic segmentation of indoor scenes: A survey. Comput. Vis. Image Underst. 2024, 248, 104102. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. arXiv 2014, arXiv:1411.4038. [Google Scholar]
- Gupta, S.; Girshick, R.; Arbeláez, P.; Malik, J. Learning Rich Features from RGB-D Images for Object Detection and Segmentation. arXiv 2014, arXiv:1407.5736. [Google Scholar] [CrossRef]
- Lagos, J.P.; Rahtu, E. Semsegdepth: A combined model for semantic segmentation and depth completion. arXiv 2022, arXiv:2209.00381. [Google Scholar]
- Hazirbas, C.; Ma, L.; Domokos, C.; Cremers, D. FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture. In Proceedings of the Asian Conference on Computer Vision (ACCV), Taipei, Taiwan, 20–24 November 2016; pp. 213–228. [Google Scholar] [CrossRef]
- Zhang, S.; Xiong, Y.; Liu, J.; Ye, X.; Sun, G. RDF-GAN: RGB-Depth Fusion GAN for Indoor Depth Completion. arXiv 2022, arXiv:2203.10856. [Google Scholar]
- Zhang, Y.; Xiong, C.; Liu, J.; Ye, X.; Sun, G. Spatial Information-Guided Adaptive Context-Aware Network for Efficient RGB-D Semantic Segmentation. IEEE Sens. J. 2023, 23, 23512–23521. [Google Scholar] [CrossRef]
- Hao, Z.; Xiao, Z.; Luo, Y.; Guo, J.; Wang, J.; Shen, L.; Hu, H. PrimKD: Primary Modality Guided Multimodal Fusion for RGB-D Semantic Segmentation. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024. [Google Scholar] [CrossRef]
- Zhang, S.; Xie, M. Optimizing rgb-d semantic segmentation through multi-modal interaction and pooling attention. arXiv 2023, arXiv:2311.11312. [Google Scholar] [CrossRef]
- Bui, M.; Alexis, K. Diffusion-based RGB-D Semantic Segmentation with Deformable Attention Transformer. arXiv 2024, arXiv:2409.15117. [Google Scholar]
- Wang, C.; Pei, Z.; Qiu, S.; Tang, Z. RGB-D-Based Stair Detection and Estimation Using Deep Learning. Sensors 2023, 23, 2175. [Google Scholar] [CrossRef] [PubMed]
- Jiang, S.; Xu, Y.; Li, D.; Fan, R. Multi-scale fusion for RGB-D indoor semantic segmentation. Sci. Rep. 2022, 12, 20305. [Google Scholar] [CrossRef] [PubMed]
- Kirch, S.; Olyunina, V.; Ondřej, J.; Pagés, R.; Martín, S.; Pérez-Molina, C. RGB-D-Fusion: Image Conditioned Depth Diffusion of Humanoid Subjects. IEEE Access 2023, 11, 99111–99129. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor Segmentation and Support Inference from RGBD Images. In Proceedings of the ECCV, Florence, Italy, 7–13 October 2012. [Google Scholar]
- Song, S.; Lichtenberg, S.P.; Xiao, J. SUN RGB-D: A RGB-D scene understanding benchmark suite. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 567–576. [Google Scholar] [CrossRef]
- Li, S.; He, Y.; Zhang, W.; Zhang, W.; Tan, X.; Han, J.; Ding, E.; Wang, J. CFCG: Semi-Supervised Semantic Segmentation via Cross-Fusion and Contour Guidance Supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 13436–13446. [Google Scholar]
Method | Main Idea | Advantages | Limitations |
---|---|---|---|
FuseNet | Dual-stream network fuses RGB and depth in the decoding stage | Improves segmentation accuracy; simple and easy to implement | Direct fusion, lacks adaptive adjustment; sensitive to depth noise |
RDF-GAN | Uses GAN for RGB-D semantic segmentation | Enhances feature representation; suitable for complex scenes | Hard to train, prone to mode collapse; high computational cost |
Our Method | Uses SCF for RGB-D fusion with PSPNet | Learnable fusion improves complementarity; PSPNet enhances accuracy | Generalization needs further validation; more fusion comparisons needed |
Dataset | Image Count | Classes | Lighting Conditions | Occlusions | Perspective Variation |
---|---|---|---|---|---|
Minjiang Dataset (Ours) | 2386 | 6 | High | Yes | Yes |
NYU Depth v2 [19] | 1449 | 13 | Moderate | Yes | No |
SUN RGB-D [19] | 10,335 | 37 | High | Limited | Yes |
Stair Dataset with Depth Maps | 2996 | 2 | High | Yes | Yes |
DeepLabv3 + ResNet-101 | |||
---|---|---|---|
Model | mIoU (%) | Top 10% mIoU (%) | Improvement |
Only RGB | Baseline | ||
Depth + RGB (Concat) | (Top 10%) | ||
With SCF | (Top 10%) |
PSPNet + ResNet-50 | |||
---|---|---|---|
Model | mIoU (%) | Top 10% mIoU (%) | Improvement |
Only RGB | Baseline | ||
Depth + RGB (Concat) | (Top 10%) | ||
With SCF | (Top 10%) |
Method | Theoretical Time Complexity | CPU Time (ms) | Space Complexity |
---|---|---|---|
Concatenation | 4784 | ||
SCF | 3749 |
Method | mIoU (%) | Top 10% mIoU (%) | Normalized Runtime Increase |
---|---|---|---|
RGB-only | 88.19 | 81.69 | 0 |
Concatenation | 88.04 | 84.28 | +100% |
CFM [21] | 85.05 | 83.82 | +270% |
SCF (Ours) | 88.30 | 85.45 | +78% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhu, Z.; Lin, H.; Ioannou, A.; Wang, T. Adaptive RGB-D Semantic Segmentation with Skip-Connection Fusion for Indoor Staircase and Elevator Localization. J. Imaging 2025, 11, 258. https://doi.org/10.3390/jimaging11080258
Zhu Z, Lin H, Ioannou A, Wang T. Adaptive RGB-D Semantic Segmentation with Skip-Connection Fusion for Indoor Staircase and Elevator Localization. Journal of Imaging. 2025; 11(8):258. https://doi.org/10.3390/jimaging11080258
Chicago/Turabian StyleZhu, Zihan, Henghong Lin, Anastasia Ioannou, and Tao Wang. 2025. "Adaptive RGB-D Semantic Segmentation with Skip-Connection Fusion for Indoor Staircase and Elevator Localization" Journal of Imaging 11, no. 8: 258. https://doi.org/10.3390/jimaging11080258
APA StyleZhu, Z., Lin, H., Ioannou, A., & Wang, T. (2025). Adaptive RGB-D Semantic Segmentation with Skip-Connection Fusion for Indoor Staircase and Elevator Localization. Journal of Imaging, 11(8), 258. https://doi.org/10.3390/jimaging11080258