Previous Article in Journal
Gait Analysis Using Walking-Generated Acceleration Obtained from Two Sensors Attached to the Lower Legs
Previous Article in Special Issue
Toward a Recognition System for Mexican Sign Language: Arm Movement Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Multimodal Latent Representation Learning for Video Moment Retrieval

Department of AI, Chung-Ang University, Seoul 06974, Republic of Korea
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(14), 4528; https://doi.org/10.3390/s25144528
Submission received: 14 June 2025 / Revised: 16 July 2025 / Accepted: 16 July 2025 / Published: 21 July 2025

Abstract

The rise of artificial intelligence (AI) has revolutionized the processing and analysis of video sensor data, driving advancements in areas such as surveillance, autonomous driving, and personalized content recommendations. However, leveraging video data presents unique challenges, particularly in the time-intensive feature extraction process required for model training. This challenge is intensified in research environments lacking advanced hardware resources like GPUs. We propose a new method called the multimodal latent representation learning framework (MLRL) to address these limitations. MLRL enhances the performance of downstream tasks by conducting additional representation learning on pre-extracted features. By integrating and augmenting multimodal data, our method effectively predicts latent representations, leveraging pre-extracted features to reduce model training time and improve task performance. We validate the efficacy of MLRL on the video moment retrieval task using the QVHighlight dataset, benchmarking against the QD-DETR model. Our results demonstrate significant improvements, highlighting the potential of MLRL to streamline video data processing by leveraging pre-extracted features to bypass the time-consuming extraction process of raw sensor data and enhance model accuracy in various sensor-based applications.
Keywords: video moment retrieval; visual language reasoning; multimodal representation learning video moment retrieval; visual language reasoning; multimodal representation learning

Share and Cite

MDPI and ACS Style

Hwang, J.; Jeon, M.; Kim, J. Multimodal Latent Representation Learning for Video Moment Retrieval. Sensors 2025, 25, 4528. https://doi.org/10.3390/s25144528

AMA Style

Hwang J, Jeon M, Kim J. Multimodal Latent Representation Learning for Video Moment Retrieval. Sensors. 2025; 25(14):4528. https://doi.org/10.3390/s25144528

Chicago/Turabian Style

Hwang, Jinkwon, Mingyu Jeon, and Junyeong Kim. 2025. "Multimodal Latent Representation Learning for Video Moment Retrieval" Sensors 25, no. 14: 4528. https://doi.org/10.3390/s25144528

APA Style

Hwang, J., Jeon, M., & Kim, J. (2025). Multimodal Latent Representation Learning for Video Moment Retrieval. Sensors, 25(14), 4528. https://doi.org/10.3390/s25144528

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop