Adaptive Temporal Action Localization in Video

Xu, Zhiyu; Lu, Zhuqiang; Ding, Yong; Tian, Liwei; Liu, Suping

doi:10.3390/electronics14132645

Open AccessArticle

Adaptive Temporal Action Localization in Video

by

Zhiyu Xu

¹,

Zhuqiang Lu

²,

Yong Ding

^1,*,

Liwei Tian

¹ and

Suping Liu

¹

College of Computer Science, Guangdong University of Science & Technology, Dongguan 523083, China

²

The University of Sydney, Sydney, NSW 2008, Australia

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(13), 2645; https://doi.org/10.3390/electronics14132645

Submission received: 28 April 2025 / Revised: 16 June 2025 / Accepted: 18 June 2025 / Published: 30 June 2025

(This article belongs to the Special Issue Applications of Artificial Intelligence in Image and Video Processing)

Download

Browse Figures

Versions Notes

Abstract

Temporal action localization aims to identify the boundaries of the action of interest in a video. Most existing methods take a two-stage approach: first, identify a set of action proposals; then, based on this set, determine the accurate temporal locations of the action of interest. However, the diversely distributed semantics of a video over time have not been well considered, which could compromise the localization performance, especially for ubiquitous short actions or events (e.g., a fall in healthcare and a traffic violation in surveillance). To address this problem, we propose a novel deep learning architecture, namely an adaptive template-guided self-attention network, to characterize the proposals adaptively with their relevant frames. An input video is segmented into temporal frames, within which the spatio-temporal patterns are formulated by a global–Local Transformer-based encoder. Each frame is associated with a number of proposals of different lengths as their starting frame. Learnable templates for proposals of different lengths are introduced, and each template guides the sampling for proposals with a specific length. It formulates the probabilities for a proposal to form the representation of certain spatio-temporal patterns from its relevant temporal frames. Therefore, the semantics of a proposal can be formulated in an adaptive manner, and a feature map of all proposals can be appropriately characterized. To estimate the IoU of these proposals with ground truth actions, a two-level scheme is introduced. A shortcut connection is also utilized to refine the predictions by using the convolutions of the feature map from coarse to fine. Comprehensive experiments on two benchmark datasets demonstrate the state-of-the-art performance of our proposed method: 32.6% mAP@IoU 0.7 on THUMOS-14 and 9.35% mAP@IoU 0.95 on ActivityNet-1.3.

Keywords:

video understanding; temporal action localization; neural networks

1. Introduction

Temporally localizing actions of interest has become a fundamental problem for video understanding, with a fast-growing amount of video data from a diverse range of applications in many sectors, such as surveillance, sports, and healthcare. The temporal localization of actions of interest aims to identify the boundaries (i.e., starting and ending locations) of the actions in a given video. Thereafter, temporal segments can be obtained and analyzed by other techniques such as video action recognition. Hence, a common practice for a temporal action localization task is to accurately generate and characterize action proposals.

Recent temporal action localization methods generate proposals or candidate proposals based on the fact that the hints of an event may not overlap with the event itself. For example, athletes approaching the starting lines in the race track can be a cue for a race, although the action of approaching the starting line of the racing track is not counted as the action of racing. Therefore, it is reasonable to extend the boundary of a proposal to consider the hints from its surroundings. Pyramid approaches inspired by image object detection help to collect multi-scale spatio-temporal patterns from the extended region of a proposal [1,2,3]. Nonetheless, as the duration of the actions can vary rapidly, this raises difficulties in forming consistent representations using these methods. More specifically, the large receptive field used in these methods is computationally expensive, and the spatio-temporal representations of the proposals are sensitive to changes in proposal length.

In pursuit of a more flexible and efficient solution for proposal formulation, various sampling strategies have been introduced to sample and aggregate a number of frames from the proposal-related temporal segment to represent the proposal [4,5,6,7,8,9,10,11,12,13]. For example, a widely used strategy is to sample relevant frame representations uniformly to characterize a proposal [6]. Although these strategies allow the efficient computation of a feature map of the proposals, important semantics can be missed from the frames that are not selected during the sampling. This would be the case where the semantics are not distributed over time consistently with their sampling assumptions.

In particular, in various applications, short actions or events (e.g., a fall in healthcare and a traffic violation in surveillance) are ubiquitous, and their semantics are often distributed diversely over time [14]. Manually identifying such events or adjusting them from inaccurate algorithms can be time-consuming and difficult. Therefore, in this study, a novel deep learning architecture, namely an adaptive template-guided self-attention network, is proposed to formulate the proposals in an adaptive manner without pre-defined sampling assumptions for temporal action localization. As shown in Figure 1, the proposed method aims to estimate a temporal Intersection over Union (IoU) map for a given video, in which the element at the position

(t, k)

represents the IoU of a proposal with a starting time t and a length k on a ground truth action.

Specifically, transformer-based global–local encoders are introduced to obtain a spatio-temporal representation for each temporal frame of an untrimmed input video [15]. These representations are used to characterize the action proposals, which are usually associated with a number of temporal frames in line with their length. Instead of using a conventional sampling strategy to sample and aggregate frame representations as the proposal’s feature vector, template-guided self-attention is proposed for adaptive sampling. Each template contains learnable weights and guides the sampling of proposals with a specified length. It formulates the probabilities for a proposal to obtain spatio-temporal patterns from relevant temporal frames. Therefore, proposal representations are formulated in a data-driven approach, and a feature map of all proposals is appropriately characterized. Finally, this feature map is utilized to estimate the temporal IoU of proposals with ground truth actions. Unlike existing methods, a two-level scheme is proposed for obtaining a smooth and continuous IoU map, as such properties can be observed from a ground truth IoU map. A shortcut connection is utilized to refine the predictions by using full convolutions on the feature map from coarse to fine.

In summary, the main contributions of this paper are three-fold:

A novel deep learning architecture, namely an adaptive template-guided self-attention network, is proposed by introducing a template-guided self-attention to formulate proposal representations adaptively without prior assumptions.
We propose a two-level scheme to derive a fine probability map from a coarse one by establishing a shortcut connection in pursuit of smooth and continuous properties.
Comprehensive experiments demonstrate the effectiveness of our proposed method. It achieved state-of-the-art performance on two benchmark datasets: 32.6% mAP@Iou 0.7 on THUMOS-14 and 9.35% mAP@IoU 0.95 on ActivityNet-1.3.

The remaining of this paper is organized as follows. Section 2 reviews the related work, explaining the significance of this paper. Section 3 introduces the details of the proposed method, describing the innovation of this research. Section 4 presents the implementation details and the experimental results, demonstrating the effectiveness of our method. Finally, Section 5 concludes this study with a discussion of future work.

2. Related Work

While both temporal action localization (TAL) and video moment retrieval (VMR) involve localizing temporal segments within untrimmed videos, they differ in their objectives: TAL focuses on identifying all instances of pre-defined actions and classifying them, while VMR focuses on retrieving a segment that matches a textual query.

Temporal action localization is a task to identify the starting and ending positions as well as the category of an action in a long and untrimmed video. The task can be divided into two stages: action proposal generation and action classification. Recent temporal action localization methods dominantly focus on the accurate proposal generation, as the latter can be conducted by an additional classification method based on the proposal results. In pursuit of accurate temporal proposals, it is common to involve the surrounding information of an action proposal, as the region containing critical hints of an action may not overlap with the action itself. Convolutional neural networks (CNNs) have been widely used in various applications [16]. By extending the boundary of a proposal, CNN-based methods can be adopted to formulate the spatio-temporal patterns for a proposal from this extended region [1,2,3]. Similar to image object detection, a pyramid mechanism was investigated to formulate the representation of a candidate proposal. The internal feature maps of multiple temporal scales in the pyramid can be summarized [1]. Another study, which is inspired by R-CNN, proposed a 3D fully convolutional neural network for temporal localization [2]. A progressive boundary refinement network further involved a coarse pyramid and a refined pyramid to detect actions at different scales [3]. However, as the duration of the actions can vary rapidly and the relevant hints can be far away from a proposal, it raises difficulties to adequately characterize a proposal through these methods.

To improve the flexibility, various sampling strategies are introduced to address the proposals of different durations. They sample and aggregate a number of frames or temporal frames from a region to formulate a proposal representation [4,5,6,7,8,9,10,11,12,13]. Uniform sampling from associated frames to characterize a proposal was investigated by pyramid-based action localization methods [4,5]. A structured segment network was proposed, which sparsely samples a fixed number of snippets within an augmented proposal with additional surroundings to form its representation [6]. A boundary-sensitive network was proposed with a two-stage sampling strategy: (1) the probability that a frame is the starting point or ending point is estimated to construct proposals, and (2) for each proposal, a fixed number of frames are sampled uniformly from the frames associated with this proposal [7]. Further regularization constraints were introduced for the starting point and ending point over a sequence in [8]. A graph convolutional neural network and its low-pass filtering properties were explored to analyze the relations between the proposals collected from an existing method such as [7], which further increased proposal accuracy [9]. A boundary-matching network was proposed to retrieve proposals with reliable confidence scores instead of only using boundary probabilities, in which a weighted sampling strategy was adopted if the sampled frame index is not an integer [10]. Similar weighted strategies were adopted for a graph-based localization method [11], which formulates a semantic graph and a temporal graph to identify proposals and for a dense boundary generator method [12]. A set of Gaussian kernels were introduced to formulate the contextual contributions for an action proposal, which can be used to characterize proposals of various lengths [13]. Although these strategies allow the efficient computation of the proposal feature map, they could fail if the semantics are not distributed as per the assumptions made in these methods. The risk of missing important patterns from the frames not selected during the sampling could negatively impact the localization performance.

As large language models (LLMs) are increasingly utilized in various applications [17,18], recent studies have also explored their potential in online action detection [19] and video representation learning [20]. Recent works have started to explore the use of LLMs for TAL, such as T3AL [21] and MLLM4WTAL [22]. These methods leverage pre-trained vision–language models or multimodal LLMs to improve localization performance but may introduce high computational overhead and efficiency concerns, particularly during test-time adaptation. In contrast, our approach aims to achieve strong performance with efficient computation and without relying on large-scale LLMs. Exploring LLM-based approaches remains an interesting direction for future work.

3. Proposed Method

The overview of the proposed architecture is illustrated in Figure 2. A probability map representing the temporal IoU of the proposals with a ground truth action is estimated for a given video. Firstly, an input sequence is encoded following a two-level strategy for its global and local characteristics. Secondly, a density-adaptive sampler, which is adaptive to the temporal distribution of the encoded features, is introduced to formulate action proposal features. The probabilities of a proposal associated with an event at both fine and coarse levels are estimated to satisfy the local maximum property of a ground truth probability map. Finally, to enhance the robustness, temporal classification is applied to the final proposal selection. The details of the proposed modules are discussed in the following sections.

3.1. Problem Formulation

An input video can be viewed as a sequence of T image frames. By using pre-trained CNN features for each frame, a sequence

X = {x_{t}}_{t = 1}^{T}

can be obtained to characterize the video, where

x_{t} \in R^{d}

is the feature vector of the t-th frame.

Action localization is expected to identify a number of candidate action proposals and select the proposals in line with some pre-defined criteria. In this paper, the maximum duration of an action is denoted as

T_{s}

. Considering the events with the t-th frame as their starting point, it is reasonable to have

T_{s}

potential event proposals in terms of different lengths. In particular, we denote

P_{t, k}

as a proposal for which the starting frame is t and the length is k. Note that for the starting frames with

t > T - T_{s}

, the proposals that exceed the scope of the video are removed. For simplicity, we formally obtained a total of

T \times T_{s}

proposals. For the proposal

P_{t, k}

, the maximum temporal intersection of union with a ground truth action is defined as

p_{t, k}

, which indicates the probability that a proposal is associated with an event. A matrix

P = {p_{t, k}}_{t, k} \in R^{T \times T_{s}}

can be further defined as a probability map, and the proposals with probability 1 (i.e., temporal IoU = 1) can be selected as the localization outcomes. Figure 1 provides an illustration of this probability map.

With the spatial-temporal patterns of each proposal obtained from

X

, our proposed method aims to estimate the probability map

P

, which is denoted as

\hat{P} = {{\hat{p}}_{t, k}}_{t, k} \in R^{T \times T_{s}}

. In addition, to provide the supervision from a different perspective, the category of a frame is considered by introducing binary variables:

s_{t}

to indicate if the t-th frame is a starting point of an event and

e_{t}

to indicate if the t-th frame is an ending point of an event. The frame patterns obtained in line with

X

are used to estimate

s_{t}

and

e_{t}

as

{\hat{s}}_{t}

and

{\hat{e}}_{t}

, respectively.

3.2. Global and Local Encoders

Effective temporal action localization requires capturing both global context and local fine-grained information from the video sequence

X

. The motivation for this module is to design an encoder that captures long-range dependencies as well as local neighborhood patterns. We propose a two-level transformer-based encoding scheme, where the global encoder models long-term temporal relationships across the entire video, and the local encoder refines these representations with a localized context. This enables the model to handle actions of diverse durations and varying temporal distributions.

The global encoder aims to formulate long-term patterns of the entire sequence, which takes

X

as its input and outputs a sequence

X^{g} = {x_{t}^{g}}_{t = 1}^{T}

. As each frame can be associated with an event having a maximum duration of

T_{s}

frames, a sliding window of length

T_{s}

and stride 1 is further introduced to slide through the sequence

X^{g}

to observe the local-level temporal patterns for each frame. Specifically, for the t-th frame, the local encoder is applied to a subsequence

{x_{t}^{g}, x_{t + 1}^{g}, \dots, x_{t + T_{s} - 1}^{g},}

of

X^{g}

, and a feature vector

x_{t}^{l}

can be obtained. Therefore, a global–local representation

X^{l} = {x_{t}^{l}}_{t = 1}^{T}

for the input video is derived.

3.3. Density Adaptive Sampler

Sampling strategies critically impact the quality of proposal representations, especially for short and long actions with diverse temporal patterns. The motivation for this module is to develop an adaptive sampling method that effectively selects informative frames while accounting for the varying density of action semantics. We propose a template-guided self-attention mechanism to adaptively sample frames according to the proposal length. This approach allows for the data-driven learning of temporal dependencies, enabling more precise and flexible proposal feature representations.

An event proposal contains a number of continuous frames. As the length of different proposals can be inconsistent, existing practices introduced a sampler to select a subset of frames from each proposal containing the same number of frames to represent an event proposal. However, these conventional sampling methods could ignore critical information for localization, such as evenly distributed selection from the frames could be carried out by treating all frames with equal importance as having a risk of selecting meaningless frames where the content density is diversely distributed.

Therefore, a density adaptive sampler is proposed to effectively sample the most informative frames’ features from the global-=local representation

X^{l}

without using hand-crafted selection strategies and to combine these features into a representation with dimension

d_{s}

. This is inspired by the self-attention mechanism [23] used for transformers, which have recently gained promising performance for various vision tasks (e.g., [24,25,26]). As actions with different lengths can involve the frame patterns very differently, templates are introduced for the self-attention mechanism to guide the proposal formulation of different lengths.

The key advantage of the adaptive sampler is that it learns to assign dynamic attention weights to frames within each proposal through the template-guided self-attention mechanism. This mechanism allows the sampler to focus on frames that contain salient semantic information relevant to the action while down-weighting less informative or noisy frames. Unlike uniform sampling, which treats all frames equally, the adaptive sampler adapts its focus based on the content distribution within each proposal. This flexibility helps capture diverse temporal patterns, such as sudden motion changes or key action segments, that are crucial for accurate action localization. In essence, the adaptive sampler learns to identify and emphasize the frames that best characterize the semantics of the action proposal, effectively addressing the challenge of diverse and non-uniform temporal distributions in videos.

Template-Guided Self-attention. Formally,

u_{k} \in R^{d_{s}}

is a learnable vector representing the template to formulate a proposal of length k. Overall, we have

U = [u_{1}, \dots, u_{T_{s}}]

for proposal lengths in a range from 1 to

T_{s}

. The sampler first formulates the following probability:

p_{t^{'}, t, k} = p (A_{t^{'}, t, k} | X_{t, T_{s}}^{l}, u_{k}, Θ) s . t . t^{'} = t, \dots, t + T_{s} - 1,

(1)

where

A_{t', t, k}

indicates an incident that the

t^{'}

-th frame is associated with the proposal

P_{t, k}

,

X_{t, T_{s}}^{l} = [x_{t}^{l}, \dots, x_{t + T_{s} - 1}^{l}]

, and

Θ

is a set of learnable weights.

To estimate the probability

p_{t^{'}, t, k}

, we propose a template-guided self-attention mechanism. Similar to the original self-attention, the query

Q

, the key

K_{t, T_{s}}^{l}

, and the value

V_{t, T_{s}}^{l}

can be computed as

\begin{matrix} Q & = & W_{Q} U, \\ K_{t, T_{s}}^{l} & = & W_{K} X_{t, T_{s}}^{l}, \\ V_{t, T_{s}}^{l} & = & W_{V} X_{t, T_{s}}^{l}, \end{matrix}

(2)

where

W_{Q}

,

W_{K}

and

W_{V} \in Θ

are matrices of linear transforms with learnable parameters. With the k-th column vector

q_{k}

of

Q

, the probability

p_{t^{'}, t, k}

can be computed as

p_{t^{'}, t, k} = softmax {(\frac{q_{k}^{⊤} K_{t, T_{s}}^{l}}{\sqrt{d_{s}}})}_{t^{'}} .

(3)

The probabilities

p_{t^{'}, t, k}

,

t^{'} = t, \dots, t + T_{s} - 1

can be further treated as weights to aggregate the feature vectors in

V_{t, T_{s}}^{l}

as a feature vector for the proposal

p_{t, k}

:

x_{t, k}^{p} = softmax (\frac{q_{k}^{⊤} K_{l, t, T_{s}}}{\sqrt{d_{s}}}) V_{l, t, T_{s}} .

(4)

To apply the template-guided self-attention by considering the domain knowledge of the temporal localization problem, the following mechanisms are further investigated to increase the robustness of localization and the quality and proposals.

Ordered Multi-head Self-attention is introduced to facilitate the sampling procedure from multiple perspectives. A number of independent template-guided self-attentions, as defined in Equations (2)–(4), can be conducted, and the results can be concatenated and further projected by a linear transform. Note that each self-attention may dominantly concentrate on a unique frame in line with the derived probability vector. However, concatenating these results with a default order breaks the temporal relationships and could miss the temporal patterns associated with their natural sequential order. Therefore, these multi-head attentions are reordered in line with the order of their dominant frames appearing in the input sequence. For simplicity, we do not introduce new symbols for the results of the multi-head self-attention which

x_{t, k}^{p}

is used to represent the proposal features.

Masking Attention aims to prevent the computations of

x_{t, k}^{p}

for proposal

P_{t, k}

from involving irrelevant patterns that are out of the scope between the t-th to the

t + k - 1

-th frames. Specifically, only the first k keys are used when computing Equations (2)–(4), and the

t + k

-th to

T_{s} - 1

-th frame’s probabilities are ignored.

Noisy Sampling for Short Proposals. When sampling for a short proposal, it is possible that the number of heads is greater than the number of frames that the proposal contains. In this scenario, it is possible that the adaptive sampler samples similar frame features multiple times, which can increase the risk of over-fitting. To prevent this from happening, we concatenate

n_{h}

“meaningless” frame feature vectors to the head of

X_{t, T_{s}}^{l}

as

[H, X_{t, T_{s}}^{l}]

, where

H \in R^{d_{l} \times n_{h}}

, to derive the template-guided self-attentions. In this study,

H

is randomly generated with a Gaussian distribution.

In summary, the output of the density adaptive sampler can be defined as

X^{p} = {x_{t, k}^{p}} \in R^{T \times T_{s} \times d_{s}}

, which can be viewed as a feature map containing d channels.

3.4. Probability Map Estimation

Precise action localization requires estimating the likelihood that a given proposal overlaps with ground truth actions. The motivation for this module is to design an effective method for generating a probability map that reflects the IoU between proposals and ground truth actions. We formulate this as a regression problem and introduce a two-level convolutional scheme to first obtain a coarse probability map and then refine it into a fine probability map. This hierarchical approach ensures smoothness and continuity of the IoU map, improving localization accuracy.

To derive the proposal probability map, a set of 2D convolution layers are applied to the feature map

X^{p}

obtained from the template-guided self-attention, which formulates the localization as a regression problem. Specifically, the output

{\hat{P}}^{c} = {{\hat{p}}_{t, k}^{c}}_{t, k}

of the convolutions estimates the probability map

P

. Note that the proposals are not independent, and the neighboring proposals may share common patterns. Therefore, the kernel size of these convolutions can be set to a specific value other than 1, which only focuses on the proposal itself. However, this localized modeling could derive noisy probability maps, as illustrated in Figure 3, which leads to inaccurate proposal identifications.

Therefore, we treat this estimated probability map as a coarse prediction of the ground truth probability map, which indicates the importance of a proposal at a coarse level. It can be used to filter the proposal feature map

X^{p}

, by which the features of the unimportant proposals can be scaled down to reduce their impacts. Specifically, a proposal feature

x_{t, k}^{p}

is refined by

x_{t, k}^{p} \leftarrow {\hat{p}}_{t, k}^{c} \cdot x_{t, k}^{p} .

(5)

With the refined feature map, convolutions can be further utilized to obtain a fine probability map

{\hat{P}}^{f} = {{\hat{p}}_{t, k}^{f}}_{t, k}

, which tends to be smoother compared to its coarse counterpart.

3.5. Localization Loss

To train the model effectively, we need loss functions that align with the goals of temporal action localization: accurately predicting IoU maps and proposal boundaries. The motivation for this module is to design a comprehensive loss function that guides the model to learn both proposal quality and temporal boundary classification. We combine a proposal loss, a regional decreasing loss (to enforce smoothness in the probability map), and temporal classification losses. This combination provides strong supervision to optimize both proposal localization and classification.

To supervise and optimize the parameters, three types of losses are introduced, including a proposal loss

L_{p}

, a temporal classification loss

L_{t}

and a regional decreasing loss

L_{r}

with an additional

ℓ_{2}

regularization term

L_{Θ}

. The overall loss function

L

can be defined as

L = L_{p} + L_{t} + L_{r} + λ_{Θ} \cdot L_{Θ},

(6)

where

λ_{Θ}

is a hyper-parameter and

L_{Θ}

is the regularization term. The details of these losses are discussed as follows.

Proposal Loss

L_{p}

aims to measure the quality of the proposal’s probability map, which consists of two parts: the fine regression loss

L_{p, f}

and the coarse regression loss

L_{p, c}

. To balance the impacts from

L_{p, f}

and

L_{p, c}

, they are weighted by two hyper-parameters,

λ_{p, f}

and

λ_{p, c}

, respectively. Formally, the proposal loss

L_{p}

can be written as follows:

\begin{matrix} L_{p} = & λ_{p, f} L_{p, f} + λ_{p, c} L_{p, c} \\ = & λ_{p, f} \sum {‖ {\hat{P}}^{f} - P ‖}_{F}^{2} + λ_{p, c} \sum {‖ {\hat{P}}^{c} - P ‖}_{F}^{2}, \end{matrix}

(7)

where

{‖ \cdot ‖}_{F}

is a Frobenius norm.

Regional Decreasing Loss. As a peak (temporal IoU = 1) of a ground truth probability map indicates an actual event of action, intuitively, a neighborhood of the peak can be identified, within which the probability values tend to decrease. This fact can be observed in Figure 3a. To guide the estimated probability map to satisfy this property, a regional decreasing loss is introduced to the neighbourhoods of the local maximum in a predicted probability map.

For the coarse probability map

{\hat{P}}_{c}

, define

{\hat{P}}_{c}^{'} = {(t, k)}

, where a pair

(t, k)

is associated with a local minimum

{\hat{p}}_{t, k}^{c}

and

R_{t, k}^{c}

is a neighborhood of this local maximum. Similar definitions can be made for the fine probability map

{\hat{P}}^{f}

.

L_{r, f}

and

L_{r, c}

represent the fine and the coarse regional decreasing losses, respectively. Formally, the regional decreasing loss term is defined as

\begin{matrix} L_{r} = & λ_{r, f} L_{r, f} + λ_{r, c} L_{r, c} \\ = & λ_{r, f} L_{r, f} \sum_{(t, k) \in {\hat{P}}_{f}^{'}} \sum_{r \in R_{t, k}^{f}} m a x (r - p_{t, k}, 0) \\ + & λ_{r, c} L_{r, c} \sum_{(t, k) \in {\hat{P}}_{c}^{'}} \sum_{r \in R_{t, k}^{c}} m a x (r - p_{t, k}, 0), \end{matrix}

(8)

where

λ_{r, f}

and

λ_{r, c}

are two hyper-parameters to balance the two losses and r is a neighborhood value of local maximum.

Although the regional decreasing loss

L_{r}

is mainly introduced as an intuitive regularization term to enforce smoothness and decreasing patterns in the predicted probability map, we also provide some theoretical insights here. Specifically,

L_{r}

can be interpreted as a local shape prior that encourages the probability values near a detected peak (i.e., temporal IoU = 1) to decrease monotonically outward, reflecting the natural decay of relevance as the temporal distance increases. This loss effectively adds local constraints to the probability landscape, which can help reduce spurious peaks and promote robustness.

From a gradient perspective,

L_{r}

is piecewise differentiable and is composed of hinge-like terms that penalize the violation of the decreasing property. Hence, it does not introduce non-differentiability into the optimization, allowing standard gradient-based optimizers to train the model efficiently. However, we acknowledge that a rigorous mathematical proof of global convergence is beyond the scope of this paper and remains an open problem in the deep learning community. We plan to investigate this in future work by exploring the theoretical properties of

L_{r}

and its effect on the overall loss landscape.

Temporal Classification Loss. Temporal classification aims to identify the type of a frame, which introduces two binary classifiers: a starting point classifier and an ending point classifier. Specifically, each classifier contains a number of convolution layers, which are applied to the global-level representation

X^{g}

to obtain the classification predictions

{\hat{s}}_{t}

and

{\hat{e}}_{t}

for all frames. Due to the imbalanced nature of the ground truth, weighted binary entropy loss is applied, and the temporal classification loss

L_{c}

is defined as

L_{c} = \sum_{t} L_{c l s} ({\hat{s}}_{t}, s_{t}) + \sum_{t} L_{c l s} ({\hat{e}}_{t}, e_{t}),

(9)

where

L_{c l s}

is the loss for each classifier,

s_{t}, e_{t}

are the ground-truth starting and ending points at time t, and

{\hat{s}}_{t}, {\hat{e}}_{t}

are the predictions of starting and ending points at time t.

3.6. Proposal Selection

After estimating the proposal probabilities and boundaries, the final step is to select the most relevant proposals with high confidence. The motivation for this module is to develop a reliable post-processing strategy that filters and ranks proposals effectively for precise localization results. We introduce a confidence score formulation combining the fine probability map with temporal classification outputs, followed by soft non-maximum suppression (NMS) [27] to suppress redundant proposals. This ensures that the final selected proposals are both accurate and non-overlapping.

In addition to the fine prediction probability map

{\hat{P}}_{f}

, the inference also involves the temporal classification results, including

{\hat{s}}_{t}

and

{\hat{e}}_{t}

, to enhance the localization quality as an additional perspective. The final confidence score for a proposal

P_{t, k}

can be formulated as

{\hat{p}}_{t, k}^{f} * \sqrt{{\hat{s}}_{t} * {\hat{e}}_{t}} .

(10)

To process the overlapping proposals, post-processing algorithms are needed. Similar to the practice in the existing methods [7,10], soft-NMS [27] is applied to remove those proposals. Each proposal is inspected in a descending order in line with their confidence scores; other overlapping proposals are removed if their overlapping area is higher than a pre-defined threshold.

4. Experimental Results and Discussion

4.1. Datasets and Evaluation Metrics

THUMOS-14 [28] contains 200 videos for training and 213 videos for validation and testing. In total, 147 h of videos of 20 action categories were obtained, with an average of 7.5 actions per video. Note that the actions are more densely and diversely distributed in THUMOS-14 compared to ActivityNet-1.3.

ActivityNet-1.3 [29] contains 10,024 training videos, 4926 validation videos, and 5044 testing videos. This dataset provides 200 action categories. In total, 470 h of videos were obtained, with an average of 1.5 actions per video.

Evaluation Metrics Mean average precision at different temporal IoU thresholds are reported. Specifically, for THUMOS-14, we calculated mAP over temporal IoU thresholds of {0.3, 0.4, 0.5, 0.6, 0.7}, and for ActivityNet-1.3, we measured mAP over temporal IoU thresholds of {0.5, 0.75, 0.95}.

Although our experiments were primarily conducted on ActivityNet-1.3 and THUMOS-14, which are widely recognized benchmarks in temporal action localization, we acknowledge that testing on additional noisy, real-world datasets could provide further insights into the model’s generalization ability. ActivityNet-1.3 and THUMOS-14 do include actions with varying levels of background noise and event complexity, making them reasonable testbeds for evaluating model robustness. Nevertheless, future work will consider extending the evaluation to more challenging real-world video datasets, including those with more cluttered backgrounds and variable recording conditions.

4.2. Implementation Details

Feature Extraction. The spatial features of a video frame were obtained through pre-trained neural networks. For THUMOS-14, feature sequences were extracted using a pre-trained TSN [30]; by following [11], each video was divided into multiple sub-sequences with a fixed length of 256 via an observation window with a stride set to 128. Similar to [11], to reduce the computational cost, a downsampling strategy was applied for every five frames with an average pooling on their pre-trained features.

For ActiviyNet-1.3, by following the same settings in [1,10,11], feature sequences were extracted using the pre-trained two-stream network proposed in [31]. In particular,

δ

was set to 16 and the sequence was down-sampled to 100 frames using a linear interpolation.

Training and Inference. To reduce the computational costs, we further reduced the dimensions of a frame feature vector to 512 for training and inference by introducing an additional fully connected layer. The transformer for the global encoder consisted of three layers with eight heads per layer, balancing model expressiveness and computational cost. The choice of eight heads allows the model to capture diverse temporal dependencies without overwhelming GPU memory usage. Similarly, the local encoder uses one layer with eight heads to focus on capturing local temporal patterns effectively while maintaining computational efficiency. The size of the sliding window and the maximum duration of a proposal

T_{s}

was set to 64 and 100 frames for THUMOS-14 and ActivityNet-1.3, respectively. The learning rate was set to 0.01 and 0.001 for THUMOS-14 and ActivityNet-1.3, respectively, to be trained for 10 epochs. The batch size for THUMOS-14 was set to 4, and for ActivityNet-1.3, it was 16. In particular, we set

λ_{Θ}

to

1 \times 10^{- 4}

to apply moderate regularization that prevents overfitting without impeding convergence. Both

λ_{p, f}

and

λ_{p, c}

were set to 10 to balance the influence of the fine and coarse probability map losses relative to other loss components. These values were chosen based on preliminary experiments and grid search on a validation subset to achieve stable training and competitive performance on the benchmark datasets. The collected proposals were further classified by methods in [31,32] for THUMOS-14 and ActivityNet-1.3, respectively. The classification results (action scores) were weighted to

P^{f}

as the final probability map. To select the proper proposals as the final prediction, Soft-NMS [27] was employed using the weighted probability map. Top k predictions were adopted. Specifically, the top 200 and top 100 predictions were selected as the final output for THUMOS-14 and ActivityNet-1.3, respectively.

Platform. The experiments were implemented with an Nvidia GTX 3090 GPU, Intel Core i7-10700 CPU, with 32 GB of RAM. The implementation framework was Pytorch 1.7.0 with Python 3.8.3.

Efficiency Analysis. To analyze the computational efficiency of our method, we note that the proposed adaptive template-guided self-attention mechanism introduces an additional computational cost compared to uniform or linear sampling methods. Specifically, the complexity of the template-guided self-attention is

O (T \times T_{s} \times d)

, where T is the number of frames in the input video,

T_{s}

is the maximum duration of proposals, and d is the feature dimension. Although this increases the computational burden compared to uniform sampling, it remains tractable in practice, as we reduce the feature dimensions to 512 and apply efficient multi-head attention implementations. On our experimental platform (NVIDIA GTX 3090 GPU), the training time per epoch on THUMOS-14 is approximately 2.5 h, and inference takes around 0.02 s per video segment on average. This demonstrates that the method is practically efficient and scalable, even with the additional self-attention mechanism. However, for real-time deployment, further optimizations are necessary. Future research could explore model pruning, knowledge distillation, or lightweight attention modules to reduce computational costs while maintaining accuracy.

4.3. Comparison with the State of the Art

Table 1 lists the action detection performance in terms of mAP at different temporal IoU thresholds on the validation set of THUMOS-14. Our proposed method significantly outperformed the state-of-the-art methods at high temporal IoU thresholds. In particular, our method improves mAP@IoU 0.7 from 29.5 to 31.4 by 1.9 and mAP@0.6 from 41.8 to 43.5 by 1.7. Note that a metric associated with a higher temporal IoU threshold involves the predicted actions with higher temporal IoU values with ground truth actions, which aims to measure the predicted action proposals with more accurate boundaries. This demonstrates that our approach is particularly effective in scenarios requiring precise action boundary localization. Such precise localization is critical for applications like sports analytics, surveillance, and healthcare, where fine-grained temporal boundaries determine the accurate identification of critical events (e.g., injury incidents or traffic violations). The template-guided self-attention network, coupled with the two-level probability map estimation, enables the model to capture subtle temporal details, which is reflected by its superior performance at high thresholds. While the focus of this work is on achieving state-of-the-art accuracy, exploring parameter-efficient architectures, template parameter sharing, or lightweight transformers to reduce computational burden remains an important direction for future work. This would further enhance the applicability of the proposed method in real-world scenarios without compromising its strength in precise action boundary localization.

Table 2 lists the performance of our method on the validation set of ActivityNet-1.3. Our method achieved superior performance in terms of mAP@IoU 0.95 at 9.35, demonstrating its strength in generating precise temporal boundaries under strict conditions. Meanwhile, at IoU thresholds of 0.5 and 0.75, the performance is comparable to other methods. This is partly because ActivityNet-1.3 contains a larger proportion of long-duration actions, with only 14.6% of actions under 5 s compared to 68.12% in THUMOS-14. Such long actions often require capturing extensive temporal dependencies, which can pose challenges for methods that primarily focus on modeling local or mid-range temporal patterns. This suggests that while our proposed method is effective for short- and mid-duration actions—common in domains such as surveillance and sports—further research into modeling long-range temporal structures could benefit performance on datasets with longer actions. Overall, our results demonstrate that the proposed method performs well under strict IoU criteria and on shorter actions, highlighting its potential for high-precision temporal action localization. A fuller significance study—e.g., ten-fold repetitions and paired two-tailed t-tests across the entire validation set—remains a valuable avenue, but it demands considerable computing time that is beyond the scope of the present retrospective work; we therefore defer this to future research.

4.4. Ablation Study on Density Adaptive Sampler

In this subsection, an ablation study is presented to demonstrate the effectiveness of the proposed density adaptive sampler. A widely used linear sampling strategy was explored in our method to substitute for the template-guided self-attention [10]. Moreover, as a number of mechanisms, including ordered multi-head self-attention, masking attention, and noisy sampling for short proposals, are adopted for the proposed density adaptive sampler, an individual analysis of each mechanism is studied.

The linear sampling strategy samples 32-frame feature vectors from a candidate-proposal-related temporal region to form a proposal representation with a fixed length. The frame feature vector was obtained from our proposed two-level encoders. The results are listed in Table 1 as Uniform Sampling. It can be observed that the proposed sampling method achieved higher mAP at different temporal IoU thresholds on THUMOS-14 compared to the uniform sampling strategy. This indicates that an adaptive sampling method is more effective for capturing the semantics in THUMOS-14, and the semantics in this dataset are not distributed uniformly.

An architecture based on the proposed method without involving ordered multi-head self-attention, masking attention, and noisy sampling strategies is investigated. This architecture was improved with each mechanism individually for comparison purposes, which helps to verify their contributions one by one. The temporal localization performance metrics on THUMOS-14 of these mechanisms are listed in Table 3. Overall, every mechanism shows its effectiveness for temporal action localization. In particular, applying all mechanisms, the mAP@IoU 0.7 improves 4.4 from 28.2 to 32.6.

The mAP@IoU 0.7 improved 0.5 from 28.2 to 28.7 by using the ordered attention mechanism, which was not as significant as the other two mechanisms. Nonetheless, it is noticed that the ordered attention has a noticeable impact on facilitating the convergence when training the localization model. This indicates that providing the candidate features in line with their temporal order in the video could be helpful in guiding the optimization. The masking mechanism helps a proposal to focus on the patterns in its relevant frames, which improved the mAP@IoU 0.7 from 28.2 to 30.3. The introduction of noise sampling for short proposals has the most significant performance increments. It improved mAP@IoU 0.7 2.7 from 28.2 to 30.9. In addition, we further investigate how the number of noise frames contributes to the performance. Specifically, we set the number of noise frames to 1, 4 and 16. The results are listed in Table 4. As mAP@IoU 0.7 is a more difficult metric and one noisy frame performed best in regards to mAP@IoU 0.7, only one noise frame is inserted in our architecture.

4.5. Impacts of the Sliding Window Size

The sliding window size

T_{s}

could be a factor that affects the performance of the proposed method since

T_{s}

determines the maximum duration of the proposals. Intuitively, a small sliding window tends to limit the capacity and the scope of the method to identify an action of longer duration, whilst a large sliding window requires more computational resources and increases the model complexity. Note that increasing

T_{s}

will also increase the number of templates. In this subsection, the impact of different sliding window sizes

T_{s}

is explored in THUMOS-14. Specifically, we compare the

T_{s}

of 48, 64, and 80, where 64 is the maximum duration of an action in THUMOS-14. As shown in Table 5, increasing

T_{s}

from a small value to a large value does have a positive impact. The mAP@IoU 0.7 improves from 28.4 to 32.6. However, an overly large

T_{s}

negatively impacts the performance, where the mAP@IoU 0.7 drops from 32.6 to 30.0. With the increase in the templates, more trainable weights are involved, and potential overfitting risks may also increase.

The reason for this phenomenon is that a larger

T_{s}

allows the model to capture longer temporal dependencies, which is beneficial for localizing actions with longer duration and more temporal context. However, it also results in higher computational complexity, as more proposals need to be evaluated, and the self-attention modules must process longer sequences. This can lead to increased training and inference time as well as a higher risk of overfitting, particularly if the dataset has fewer long-duration actions. In addition, as

T_{s}

increases, the number of learnable templates also grows, potentially leading to scalability issues. This is because each template represents a distinct proposal length and therefore increases the parameter count. To address this, future work could explore dynamic or learnable sliding window sizes, as well as template parameter sharing, weight tying, or other parameter-efficient architectures. These approaches could help mitigate model complexity and overfitting risk while maintaining or improving performance. Therefore, selecting an appropriate

T_{s}

requires balancing accuracy, computational efficiency, and model scalability.

4.6. Qualitative Analysis

As illustrated in Figure 4, the localization results of two videos from the THUMOS-14 dataset are compared between the actions identified by the proposed method and the ground truth. In particular, different mechanisms proposed in this paper were also investigated, as in the ablation study. The first video contains four events of interest, which are tennis swing actions, of which the duration varies from 1 s to 4 s. A method without ordered attention, mask, and noisy sampling mechanisms (for simplicity, denoted as the baseline method) identified three events with high overlap with the ground truth actions. A short event that is missed by using the baseline method can be identified by the method with the noisy sampling strategy. As the duration of these actions is short, additional noisy frames could be more important in these proposals and are beneficial for localization. The ordered attention mechanism helps to formulate more accurate boundaries compared to the baseline method. Taking advantage of all the proposed mechanisms, a false negative event proposal was removed, which suggests that these mechanisms are able to mutually contribute to the localization. The second video identifies cricket bowling actions, of which the duration is around 1 s. By involving all these mechanisms, more accurate action boundaries for cricket bowling can be identified.

5. Conclusions

In this paper, a novel temporal action localization method is presented, namely the adaptive template-guided self-attention network, to effectively characterize the diversely distributed semantics within a video over time. Learnable templates for the proposals with different lengths are introduced to represent spatio-temporal patterns from their associated temporal frames in a data-driven manner. In addition, the properties of a ground truth temporal IoU probability map are considered with a two-level strategy, which formulates the probability map from coarse to fine using a shortcut connection between two sets of full convolutions. The experimental results on two widely used benchmark datasets demonstrated the effectiveness of our proposed method, especially for the THUMOS-14 dataset, in which the actions present a more diverse distribution over time. Nevertheless, our method has some limitations. For example, in videos with significant background clutter or very low resolution, the global and local encoders may struggle to extract accurate features, leading to sub-optimal localization. Moreover, in cases where actions overlap densely or occur concurrently, precise boundary localization remains challenging. In our future work, we aim to reduce the computations of the adaptive sampler for a more efficient temporal action localization. Furthermore, the properties of a temporal IoU map will be investigated to guide the localization modeling. In addition, we plan to explore methods such as attention map visualization and template contribution analysis to further illustrate the model’s internal reasoning and improve its interpretability.

Author Contributions

Conceptualization, Z.X., Y.D. and L.T.; methodology, Z.X., Y.D., L.T. and S.L.; software, Z.L.; validation, Z.X., Z.L. and Y.D.; investigation, Z.L.; data curation, Z.L.; writing—original draft, Z.X. and Z.L.; writing—review & editing, Y.D., L.T. and S.L.; visualization, S.L.; supervision, L.T. and S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study did not require Institutional Review Board (IRB) or Human Research Ethics Committee (HREC) approval because it exclusively used publicly available benchmark datasets (THUMOS-14 and ActivityNet-1.3) that were collected prior to this research. No new data or human subject interaction was involved. According to institutional and national guidelines, secondary analysis of such non-identifiable data is exempt from formal ethical review.

Data Availability Statement

The original data presented in the study are openly available in THUMOS14 dataset at https://paperswithcode.com/dataset/thumos14-1 (accessed on 1 June 2025), and ActivityNet-1.3 dataset at https://paperswithcode.com/dataset/activitynet (accessed on 1 June 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gao, J.; Yang, Z.; Chen, K.; Sun, C.; Nevatia, R. TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Xu, H.; Das, A.; Saenko, K. R-C3D: Region Convolutional 3D Network for Temporal Activity Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Liu, Q.; Wang, Z. Progressive boundary refinement network for temporal action detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11612–11619. [Google Scholar]
Dai, X.; Singh, B.; Zhang, G.; Davis, L.S.; Qiu Chen, Y. Temporal Context Network for Activity Localization in Videos. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Chao, Y.; Vijayanarasimhan, S.; Seybold, B.; Ross, D.A.; Deng, J.; Sukthankar, R. Rethinking the Faster R-CNN Architecture for Temporal Action Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1130–1139. [Google Scholar]
Zhao, Y.; Xiong, Y.; Wang, L.; Wu, Z.; Tang, X.; Lin, D. Temporal Action Detection with Structured Segment Networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Lin, T.; Zhao, X.; Su, H.; Wang, C.; Yang, M. BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Zhao, P.; Xie, L.; Ju, C.; Zhang, Y.; Wang, Y.; Tian, Q. Bottom-up temporal action localization with mutual regularization. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 539–555. [Google Scholar]
Zeng, R.; Huang, W.; Tan, M.; Rong, Y.; Zhao, P.; Huang, J.; Gan, C. Graph Convolutional Networks for Temporal Action Localization. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Lin, T.; Liu, X.; Li, X.; Ding, E.; Wen, S. BMN: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3889–3898. [Google Scholar]
Xu, M.; Zhao, C.; Rojas, D.S.; Thabet, A.; Ghanem, B. G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10156–10165. [Google Scholar]
Lin, C.; Li, J.; Wang, Y.; Tai, Y.; Luo, D.; Cui, Z.; Wang, C.; Li, J.; Huang, F.; Ji, R. Fast learning of temporal action proposal via dense boundary generator. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11499–11506. [Google Scholar]
Long, F.; Yao, T.; Qiu, Z.; Tian, X.; Luo, J.; Mei, T. Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 344–353. [Google Scholar]
Kanwal, S.; Mehta, V.; Dhall, A. Large Scale Hierarchical Anomaly Detection and Temporal Localization. In Proceedings of the ACM International Conference on Multimedia, Online, 12–16 October 2020; pp. 4674–4678. [Google Scholar]
Deng, Z.; Guo, Y.; Han, C.; Ma, W.; Xiong, J.; Wen, S.; Xiang, Y. AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways. ACM Comput. Surv. 2025, 57, 1–36. [Google Scholar] [CrossRef]
Chen, X.; Li, C.; Wang, D.; Wen, S.; Zhang, J.; Nepal, S.; Xiang, Y.; Ren, K. Android HIV: A study of repackaging malware for evading machine-learning detection. IEEE Trans. Inf. Forensics Secur. 2019, 15, 987–1001. [Google Scholar] [CrossRef]
Zhu, X.; Zhou, W.; Han, Q.L.; Ma, W.; Wen, S.; Xiang, Y. When software security meets large language models: A survey. IEEE/CAA J. Autom. Sin. 2025, 12, 317–334. [Google Scholar] [CrossRef]
Zhou, W.; Zhu, X.; Han, Q.L.; Li, L.; Chen, X.; Wen, S.; Xiang, Y. The security of using large language models: A survey with emphasis on ChatGPT. IEEE/CAA J. Autom. Sin. 2024, 12, 1–26. [Google Scholar] [CrossRef]
Chen, J.; Lv, Z.; Wu, S.; Lin, K.Q.; Song, C.; Gao, D.; Liu, J.W.; Gao, Z.; Mao, D.; Shou, M.Z. Videollm-online: Online video large language model for streaming video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 18407–18418. [Google Scholar]
Zhao, Y.; Misra, I.; Krähenbühl, P.; Girdhar, R. Learning video representations from large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 6586–6597. [Google Scholar]
Liberatori, B.; Conti, A.; Rota, P.; Wang, Y.; Ricci, E. Test-time zero-shot temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 18720–18729. [Google Scholar]
Zhang, Q.; Fang, J.; Yuan, R.; Tang, X.; Qi, Y.; Zhang, K.; Yuan, C. Weakly Supervised Temporal Action Localization via Dual-Prior Collaborative Learning Guided by Multimodal Large Language Models. In Proceedings of the Computer Vision and Pattern Recognition Conference, Honolulu, HI, USA, 19–25 October 2025; pp. 24139–24148. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Wu, B.; Xu, C.; Dai, X.; Wan, A.; Zhang, P.; Tomizuka, M.; Keutzer, K.; Vajda, P. Visual transformers: Token-based image representation and processing for computer vision. arXiv 2020, arXiv:2006.03677. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision. Springer, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS — Improving Object Detection with One Line of Code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5562–5570. [Google Scholar]
Jiang, Y.G.; Liu, J.; Roshan Zamir, A.; Toderici, G.; Laptev, I.; Shah, M.; Sukthankar, R. THUMOS Challenge: Action Recognition with a Large Number of Classes. 2014. Available online: http://crcv.ucf.edu/THUMOS14/ (accessed on 8 October 2024).
Heilbron, F.C.; Escorcia, V.; Ghanem, B.; Niebles, J.C. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 961–970. [Google Scholar]
Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. Springer, Amsterdam, The Netherlands, 11–14 October 2016; pp. 20–36. [Google Scholar]
Xiong, Y.; Wang, L.; Wang, Z.; Zhang, B.; Song, H.; Li, W.; Lin, D.; Qiao, Y.; Van Gool, L.; Tang, X. CUHK & ETHZ & SIAT submission to activitynet challenge 2016. arXiv 2016, arXiv:1608.00797. [Google Scholar]
Wang, L.; Xiong, Y.; Lin, D.; Van Gool, L. UntrimmedNets for Weakly Supervised Action Recognition and Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6402–6411. [Google Scholar]
Qiu, H.; Zheng, Y.; Ye, H.; Lu, Y.; Wang, F.; He, L. Precise temporal action localization by evolving temporal proposals. In Proceedings of the ACM on International Conference on Multimedia Retrieval, Yokohama, Japan, 11–14 June 2018; pp. 388–396. [Google Scholar]

Figure 1. Illustration of temporal action localization and proposal probability (i.e., temporal IoU (Intersection over Union) with the ground truth map) for

T_{s}

templates. Candidate proposals are evaluated via temporal IoU. Our method predicts the temporal IoU map, in which the x-axis indicates the starting location and the y-axis indicates the duration of a proposal.

Figure 1. Illustration of temporal action localization and proposal probability (i.e., temporal IoU (Intersection over Union) with the ground truth map) for

T_{s}

templates. Candidate proposals are evaluated via temporal IoU. Our method predicts the temporal IoU map, in which the x-axis indicates the starting location and the y-axis indicates the duration of a proposal.

Figure 2. Illustration of the proposed architecture. The input video is fed to a Global Transformer to encode the long-range dependency between frames and generate a global frame feature map

X^{g}

. Frame Classifier identifies the boundary probability of each frame using

X^{g}

. Local Transformer focuses on neighborhood patterns of a frame, and a local frame feature map

X^{l}

can be obtained. A sliding window mechanism is employed to capture a portion of

X^{l}

for sampling frames to construct proposal feature representations

X^{p}

using an adaptive sampling strategy. A coarse probability map is generated through convolution layers using

X^{p}

. The coarse probability map is used to refine the proposal feature map

X^{p}

and aims for a fine probability map.

Figure 2. Illustration of the proposed architecture. The input video is fed to a Global Transformer to encode the long-range dependency between frames and generate a global frame feature map

X^{g}

. Frame Classifier identifies the boundary probability of each frame using

X^{g}

. Local Transformer focuses on neighborhood patterns of a frame, and a local frame feature map

X^{l}

can be obtained. A sliding window mechanism is employed to capture a portion of

X^{l}

for sampling frames to construct proposal feature representations

X^{p}

using an adaptive sampling strategy. A coarse probability map is generated through convolution layers using

X^{p}

. The coarse probability map is used to refine the proposal feature map

X^{p}

and aims for a fine probability map.

Figure 3. Illustration of the two-level probability map. (a) Ground truth probability map. (b) Estimated coarse probability map. (c) Estimated fine probability map.

Figure 4. Qualitative examples of the localization results on THUMOS-14.

Table 1. Action detection results on THUMOS-14’s test set, measured by mAP (%) at different temporal IoU thresholds.

Methods	0.3	0.4	0.5	0.6	0.7
TURN [1]	44.1	34.9	25.6	-	-
TCN [4]	-	33.3	25.6	15.9	0.9
R-C3D [2]	44.8	35.6	28.9	-	-
SSN [6]	51.9	41.0	29.8	19.6	10.7
BSN [7]	53.5	45.0	36.9	28.4	20.0
TAL-NET [5]	53.2	48.5	42.8	33.8	20.8
GTAN [13]	57.8	47.2	38.8	-	-
BMN [10]	56.0	47.4	38.8	29.7	20.5
BSN+PGCN [9]	63.6	57.8	49.1	-	-
PBRNet [3]	58.5	54.6	51.3	41.8	29.5
ETP [33]	48.2	42.4	34.2	23.4	13.9
G-TAD [11]	54.5	47.6	40.2	30.8	23.4
G-TAD+PGCN [11]	66.4	60.4	51.6	37.6	22.9
DBG [12]	57.8	49.4	39.8	30.2	21.7
BU-TAL [8]	53.9	50.7	45.4	38.0	28.5
Uniform Sampling *	42.0	36.5	29.9	21.2	13.6
Ours	66.0	60.7	53.7	44.16	32.6

* A global–local encoder with a linear interpolated uniform sampling.

Table 2. Action detection results on ActivityNet-1.3’s validation set, measured by mAP (%) at different temporal IoU thresholds.

Methods	0.5	0.75	0.95	Avg
TCN [4]	37.49	23.47	4.47	23.58
SSN [6]	43.26	28.70	5.63	28.28
BSN [7]	46.45	29.96	8.02	30.03
TAL-NET [5]	38.23	18.30	1.30	20.22
GTAN [13]	52.61	34.14	8.91	34.31
BMN [10]	50.07	34.78	8.29	33.85
PBRNet [3]	53.96	34.97	8.98	35.01
G-TAD [11]	50.36	34.60	9.02	34.09
BU-TAL [8]	43.47	33.91	9.21	30.12
Ours	42.11	30.21	9.35	30.0

Table 3. Effectiveness of the proposed mechanisms for self-attention on THUMOS-14’s test set measured by mAP(%) at different temporal IoU thresholds.

Ordered	Mask	Noise
Attention		Frame	0.3	0.4	0.5	0.6	0.7
✗	✗	✗	61.7	57.2	50.6	39.7	28.2
✓	✗	✗	62.7	58.4	50.5	40.9	28.7
✗	✓	✗	63.0	58.5	50.9	40.8	30.3
✗	✗	✓	63.4	59.3	52.3	43.1	30.9
✓	✓	✓	66.0	60.7	53.7	44.2	32.6

Table 4. Impacts of the number of noise frames on THUMOS-14’s test set measured by mAP (%) at different temporal IoU thresholds.

# of Noise Frames	0.3	0.4	0.5	0.6	0.7
1	63.4	59.3	52.3	43.1	30.9
4	62.4	58.5	52.2	42.8	30.4
16	64.8	60.6	54.3	43.0	30.5

Table 5. Impacts of the sliding window size on THUMOS-14’s test set measured by mAP(%) at different temporal IoU thresholds.

Sliding Windows Size $T_{s}$	0.3	0.4	0.5	0.6	0.7
48	63.4	58.8	50.4	40.4	28.4
64	66.0	60.7	53.7	44.2	32.6
80	65.6	60.8	53.5	41.9	30.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, Z.; Lu, Z.; Ding, Y.; Tian, L.; Liu, S. Adaptive Temporal Action Localization in Video. Electronics 2025, 14, 2645. https://doi.org/10.3390/electronics14132645

AMA Style

Xu Z, Lu Z, Ding Y, Tian L, Liu S. Adaptive Temporal Action Localization in Video. Electronics. 2025; 14(13):2645. https://doi.org/10.3390/electronics14132645

Chicago/Turabian Style

Xu, Zhiyu, Zhuqiang Lu, Yong Ding, Liwei Tian, and Suping Liu. 2025. "Adaptive Temporal Action Localization in Video" Electronics 14, no. 13: 2645. https://doi.org/10.3390/electronics14132645

APA Style

Xu, Z., Lu, Z., Ding, Y., Tian, L., & Liu, S. (2025). Adaptive Temporal Action Localization in Video. Electronics, 14(13), 2645. https://doi.org/10.3390/electronics14132645

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Temporal Action Localization in Video

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. Problem Formulation

3.2. Global and Local Encoders

3.3. Density Adaptive Sampler

3.4. Probability Map Estimation

3.5. Localization Loss

3.6. Proposal Selection

4. Experimental Results and Discussion

4.1. Datasets and Evaluation Metrics

4.2. Implementation Details

4.3. Comparison with the State of the Art

4.4. Ablation Study on Density Adaptive Sampler

4.5. Impacts of the Sliding Window Size

4.6. Qualitative Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI