Next Article in Journal
A Robust Modeling Analysis of Environmental Factors Influencing the Direct Current, Power, and Voltage of Photovoltaic Systems
Previous Article in Journal
Hybrid Stochastic–Information Gap Decision Theory Method for Robust Operation of Water–Energy Nexus Considering Leakage
Previous Article in Special Issue
Confidence-Feature Fusion: A Novel Method for Fog Density Estimation in Object Detection Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Adaptive Temporal Action Localization in Video

1
College of Computer Science, Guangdong University of Science & Technology, Dongguan 523668, China
2
The University of Sydney, Sydney, NSW 2008, Australia
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(13), 2645; https://doi.org/10.3390/electronics14132645
Submission received: 28 April 2025 / Revised: 16 June 2025 / Accepted: 18 June 2025 / Published: 30 June 2025
(This article belongs to the Special Issue Applications of Artificial Intelligence in Image and Video Processing)

Abstract

Temporal action localization aims to identify the boundaries of the action of interest in a video. Most existing methods take a two-stage approach: first, identify a set of action proposals; then, based on this set, determine the accurate temporal locations of the action of interest. However, the diversely distributed semantics of a video over time have not been well considered, which could compromise the localization performance, especially for ubiquitous short actions or events (e.g., a fall in healthcare and a traffic violation in surveillance). To address this problem, we propose a novel deep learning architecture, namely an adaptive template-guided self-attention network, to characterize the proposals adaptively with their relevant frames. An input video is segmented into temporal frames, within which the spatio-temporal patterns are formulated by a global–Local Transformer-based encoder. Each frame is associated with a number of proposals of different lengths as their starting frame. Learnable templates for proposals of different lengths are introduced, and each template guides the sampling for proposals with a specific length. It formulates the probabilities for a proposal to form the representation of certain spatio-temporal patterns from its relevant temporal frames. Therefore, the semantics of a proposal can be formulated in an adaptive manner, and a feature map of all proposals can be appropriately characterized. To estimate the IoU of these proposals with ground truth actions, a two-level scheme is introduced. A shortcut connection is also utilized to refine the predictions by using the convolutions of the feature map from coarse to fine. Comprehensive experiments on two benchmark datasets demonstrate the state-of-the-art performance of our proposed method: 32.6% mAP@IoU 0.7 on THUMOS-14 and 9.35% mAP@IoU 0.95 on ActivityNet-1.3.
Keywords: video understanding; temporal action localization; neural networks video understanding; temporal action localization; neural networks

Share and Cite

MDPI and ACS Style

Xu, Z.; Lu, Z.; Ding, Y.; Tian, L.; Liu, S. Adaptive Temporal Action Localization in Video. Electronics 2025, 14, 2645. https://doi.org/10.3390/electronics14132645

AMA Style

Xu Z, Lu Z, Ding Y, Tian L, Liu S. Adaptive Temporal Action Localization in Video. Electronics. 2025; 14(13):2645. https://doi.org/10.3390/electronics14132645

Chicago/Turabian Style

Xu, Zhiyu, Zhuqiang Lu, Yong Ding, Liwei Tian, and Suping Liu. 2025. "Adaptive Temporal Action Localization in Video" Electronics 14, no. 13: 2645. https://doi.org/10.3390/electronics14132645

APA Style

Xu, Z., Lu, Z., Ding, Y., Tian, L., & Liu, S. (2025). Adaptive Temporal Action Localization in Video. Electronics, 14(13), 2645. https://doi.org/10.3390/electronics14132645

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop