Next Article in Journal
Prospects of Algal Strains for Acidic Wastewater Treatment
Previous Article in Journal
A Stress-Relief Concept and Its Energy-Dissipating Support for High-Stress Soft-Rock Tunnels
Previous Article in Special Issue
Cross-Modal Adaptive Fusion and Multi-Scale Aggregation Network for RGB-T Crowd Density Estimation and Counting
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Semantic-Guided Spatial and Temporal Fusion Framework for Enhancing Monocular Video Depth Estimation

Department of Computer Science and Artificial Intelligence, Dongguk University, Seoul 04620, Republic of Korea
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2026, 16(1), 212; https://doi.org/10.3390/app16010212
Submission received: 25 November 2025 / Revised: 16 December 2025 / Accepted: 23 December 2025 / Published: 24 December 2025
(This article belongs to the Special Issue Advances in Computer Vision and Digital Image Processing)

Abstract

Despite advancements in deep learning-based Monocular Depth Estimation (MDE), applying these models to video sequences remains challenging due to geometric ambiguities in texture-less regions and temporal instability caused by independent per-frame inference. To address these limitations, we propose STF-Depth, a novel post-processing framework that enhances depth quality by logically fusing heterogeneous information—geometric, semantic, and panoptic—without requiring additional retraining. Our approach introduces a robust RANSAC-based Vanishing Point Estimation to guide Dynamic Depth Gradient Correction for background separation, alongside Adaptive Instance Re-ordering to clarify occlusion relationships. Experimental results on the KITTI, NYU Depth V2, and TartanAir datasets demonstrate that STF-Depth functions as a universal plug-and-play module. Notably, it achieved a 25.7% reduction in Absolute Relative error (AbsRel) and significantly enhanced temporal consistency compared to state-of-the-art backbone models. These findings confirm the framework’s practicality for real-world applications requiring geometric precision and video stability, such as autonomous driving, robotics, and augmented reality (AR).
Keywords: monocular video depth estimation; heterogeneous information fusion; temporal consistency; semantic and panoptic segmentation; vanishing point estimation monocular video depth estimation; heterogeneous information fusion; temporal consistency; semantic and panoptic segmentation; vanishing point estimation

Share and Cite

MDPI and ACS Style

Kim, H.; Lee, Y.; Ko, H.; Jeong, J.; Son, Y. Semantic-Guided Spatial and Temporal Fusion Framework for Enhancing Monocular Video Depth Estimation. Appl. Sci. 2026, 16, 212. https://doi.org/10.3390/app16010212

AMA Style

Kim H, Lee Y, Ko H, Jeong J, Son Y. Semantic-Guided Spatial and Temporal Fusion Framework for Enhancing Monocular Video Depth Estimation. Applied Sciences. 2026; 16(1):212. https://doi.org/10.3390/app16010212

Chicago/Turabian Style

Kim, Hyunsu, Yeongseop Lee, Hyunseong Ko, Junho Jeong, and Yunsik Son. 2026. "Semantic-Guided Spatial and Temporal Fusion Framework for Enhancing Monocular Video Depth Estimation" Applied Sciences 16, no. 1: 212. https://doi.org/10.3390/app16010212

APA Style

Kim, H., Lee, Y., Ko, H., Jeong, J., & Son, Y. (2026). Semantic-Guided Spatial and Temporal Fusion Framework for Enhancing Monocular Video Depth Estimation. Applied Sciences, 16(1), 212. https://doi.org/10.3390/app16010212

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop