Temporal Map-Based Boundary Refinement Network for Video Moment Localization

Lyu, Liang; Liu, Deyin; Zhang, Chengyuan; Zhang, Yilin; Ruan, Haoyu; Zhu, Lei

doi:10.3390/electronics14081657

Open AccessArticle

Temporal Map-Based Boundary Refinement Network for Video Moment Localization

by

Liang Lyu

¹

,

Deyin Liu

^2,*

,

Chengyuan Zhang

^1,*,

Yilin Zhang

¹,

Haoyu Ruan

¹

and

Lei Zhu

³

¹

College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China

²

School of Artificial Intelligence, Anhui University, Hefei 230039, China

³

College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(8), 1657; https://doi.org/10.3390/electronics14081657

Submission received: 29 March 2025 / Revised: 16 April 2025 / Accepted: 17 April 2025 / Published: 19 April 2025

(This article belongs to the Special Issue Big Model Techniques for Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Video moment localization has gradually become a hot research problem in video understanding. Despite much remarkable progress, there are still two limitations in the following aspects. Firstly, previous methods usually directly regarded the moment candidate with the highest confidence score as the final localization result; they overlooked the inevitable deviations between moment candidates and the ground truth. Secondly, past models have not considered the problem that moment candidates with various qualities have different impacts on model training. Therefore, this paper proposes a novel Temporal Map-based Boundary Refinement Network to solve the above problems. Specifically, besides the conventional confidence scores’ prediction network, we introduce a boundary refinement network based on the 2D temporal map, which can fine-tune the temporal boundaries of generated moment candidates to obtain more precise results. Additionally, to discriminately treat diverse moment candidates with different qualities and further boost the localization performance, we devise an innovative weighted refinement loss to guide the model to focus more on refining processes of those moment candidates closer to the ground truth. Finally, we evaluate our model on two publicly available datasets, and the results of extensive experiments show that our technique outperforms the state-of-the-art methods (e.g., for the ActivityNet Captions dataset, a relative improvement of 4.63% on R1@0.7).

Keywords:

video moment localization; temporal map; boundary refinement network; weighted refinement loss

1. Introduction

With the rapid development of the Internet, we have entered the era of the Internet of Everything. Recently, massive amounts of multimedia data have poured into people’s daily lives, coming from speech, texts, images, videos, etc. [1,2,3]. Although they bring unprecedented convenience to humans, we face many formidable challenges. How to quickly and accurately retrieve valuable information from plenty of chaotic data, especially video data, has become the mainstream research direction of information retrieval. Video moment localization, one of the most typical topics in multimedia retrieval and the computer vision community [4,5,6], is a critical and challenging problem.

Video moment localization (VML for short) aims to precisely locate the corresponding segment in a long untrimmed video with the given natural language query. For instance, as query 2 shows in Figure 1, “she washes her face and wipes it down with a rag still speaking to the camera” includes “wash face”, “wipe face with a rag”, and “speaking to camera”, three actions. Therefore, we need to obtain the segment that contains at least these three consecutive action scenes from the video. In addition, it is worth mentioning that VML tasks have penetrated many industrial fields [7] (e.g., Intelligent Security and Human–Machine Interaction) and also have beneficial impacts on several related cross-modal retrieval tasks, such as visual dialogue [8], video captioning [9,10], and so on.

To solve this intractable yet meaningful task, we will face two significant challenges. (1) Text queries and videos exhibit significant heterogeneity since they come from different modalities, so there is a substantial semantic gap between them. (2) The background and action scenes presented in videos are complicated and dynamic. Effectively tackling these issues requires both a thorough comprehension of video content and accurate cross-modal semantic alignment.

To overcome these challenges, researchers have developed various strategies to address VML tasks effectively. Until now, mainstream approaches have been generally classified into two categories. (1) Proposal-free methods, which instead of generating abundant moment candidates, immediately predict the start and end time of the target segment or the probability distributions of the two boundaries over all frames within a video. Due to the inherent characteristics of this sort of approach, it is no longer necessary to generate moment candidates, so it is computationally efficient. However, they mainly focus on directly predicting a more precise boundary while ignoring the rich visual content in video clips, leading to mediocre localization results. (2) Proposal-based methods follow the pipeline of Object Detection generally. Specifically, it first generates many moment candidates in a pre-defined manner and then predicts their confidence scores. Subsequently, it sorts these moment candidates according to their confidence scores and regards the candidate with the highest score as the final result. Here, it is worth noting that the familiar way to generate moment candidates is the sliding window method, obtaining abundant moment candidates by dense sampling with multiple windows of different sizes on the time dimension of videos. Compared with proposal-free methods, their final prediction results are more impressive due to the more accurate comprehension of the visual content in the video. Nonetheless, proposal-based methods also have drawbacks, such as the heavy computational burden and over-relying on the heuristic algorithms to generate moment candidates.

Motivated by this, we strive to propose a novel Temporal Map based Boundary Refinement Network (TMBRN) in this paper. Indeed, our solution also follows the pipeline of proposal-based methods because of its better localization performance. Specifically, to solve the problem of relying excessively on heuristic algorithms to generate moment candidates in most previous proposal-based methods, we construct a 2D temporal feature map to obtain moment candidates flexibly. As presented in Figure 1, the value in the horizontal and vertical axis in the 2D coordinate system represents the start’s and end’s clip index, respectively. Notably, each grid in the temporal map represents a moment proposal (candidate). We can obtain abundant moment proposals with extremely fine-grained time spans in such an advanced way. However, in conventional proposal-based methods, the moment candidate with the highest confidence score will be directly regarded as the final localization result. Nonetheless, no matter how fine the granularity of grid partitioning is, it is always inconsistent with the start and end times of the target moment. Therefore, besides the confidence scores’ prediction network, we introduce a boundary refinement network based on such a temporal map to fine-tune the endpoints of the moment candidates further, so that they are as close as possible to the boundaries of the target moment. Meanwhile, to reduce the heavy computation burden caused by the generated massive moment candidates, we develop a multi-step sampling strategy to drop some unnecessary moment proposals.

Meanwhile, we observe that the existing models do not consider the problem that moment candidates with different qualities may have different effects on model training. For instance, as shown in Figure 2, both proposal 1 and proposal 2 are sampled through the constructed temporal map, but proposal 1 is closer to the ground truth. If treating these moment proposals equally in the actual model learning process, we will fail to take full advantage of the superiority of proposal 1 over proposal 2. To solve this problem, we propose a novel weighted refinement loss function to strengthen the boundary refinement process of proposal 1 and weaken that of proposal 2. As far as we know, in the VML task, we are the first to incorporate a weighted factor into the loss function to distinguish the boundary refinement process for moment candidates with different qualities.

In conclusion, our main contributions are summarized below:

In this paper, a Temporal Map-based Boundary Refinement Network framework is proposed to handle VML tasks. Firstly, in addition to predicting the confidence scores for moment candidates, we introduce a boundary refinement network based on a temporal map to fine-tune the boundaries of these generated moment candidates. Then, to reduce the heavy computation burden, we leverage a multi-step sampling strategy to drop some redundant moment candidates.
We innovatively design a weighted refinement loss function to guide the model to focus more on the boundary refinement process of those moment candidates closer to the ground truth, thereby further improving the model’s performance.
Extensive experiments are conducted on two datasets, and the results demonstrate that our proposed model has outperformed the state-of-the-art methods.

Roadmap. Moreover, the remainder of this paper is structured as follows. In Section 2, we introduce the related works. In Section 3, the proposed approach is described exhaustively. In Section 4, we present comprehensive experimental results and in-depth analysis to validate our approach. Finally, in Section 5, we conclude this paper. In addition, all vectors and matrices are denoted in bold.

2. Related Work

In this section, according to the evolution of video retrieval issues, we will briefly describe some related work of text–video retrieval, temporal action localization, and video moment localization in sequence.

2.1. Text–Video Retrieval

The conventional video retrieval task is to find the entire video that includes the scene described by the text query from the video corpus. It is worth noting that this type of task does not need to localize the specific moment. In addition, video retrieval is a bidirectional task in many previous works. Namely, the video can be searched through the given text, and the relevant text can also be found through the given video. Most of the early studies about video retrieval were based on the idea of common space [11,12,13]. That is to say, they project both text and video into a joint common space, where semantically identical video and text are closer. To improve retrieval efficiency, ref. [14] proposed a deep hash framework for a large-scale video similarity search to learn lightweight yet very effective binary hash codes. In [15], aiming to obtain a comprehensive understanding of both text and video content, a dual deep encoding network was proposed to represent video and query into three levels of features, which contain global, temporal-aware, and local encoding. As traditional approaches tend to ignore the deep fusion between text and visual content, Cao et al. [16] proposed a novel co-attention network to explicitly emphasize the cross-modal feature interaction between video and recipe (text). Moreover, conventional common space learning is not enough to represent complex visual and textual details. To solve this problem, Chen et al. [17] introduced a hierarchical graph reasoning (HGR) model, which decomposed the text into a multilevel semantic graph containing three levels of scenario, actions, and entities. Similarly, hierarchical text representations are generated through attention-based graph reasoning. By this, the video representation with different levels can be guided to perform cross-modal matching for capturing global and local interactive details. However, to realize video retrieval of complex queries, a tree-augmented cross-modal encoding method was proposed by jointly learning the semantic structure of text query and the temporal representation of video [18].

2.2. Temporal Action Localization

The purpose of the temporal action localization task is to localize all small segments from a video and predict their classes of action. For this kind of task, Shou et al. [19] proposed a sequential multi-stage CNN network, which includes a proposal identification network, a classification network, and a localization network. These three sub-networks were designed to identify, classify, and regress candidates, respectively, achieving precise action localization. Meanwhile, the reuse of feature units guaranteed fast and efficient computation. Buch et al. [20] introduced the single-stream temporal action proposal network (SST), a new efficient deep architecture to generate candidates. Specifically, it can run continuously in long frame sequences through a single stream and eliminate the necessity of splitting the input video into short overlapping clips or temporal windows. Most of the time, many previous methods handled all action proposals individually and did not explicitly take advantage of their correlations in the model learning phase. However, the relationship between candidates is necessary for temporal action localization. Therefore, Zeng et al. [21] utilized graph convolutional networks (GCNs) to explore the potential relations between candidates.

2.3. Video Moment Localization

Compared to video retrieval and temporal action localization tasks, the video moment localization task is more suitable for the actual situation. Therefore, researchers have paid more and more attention to this open issue in recent years. In short, previous research solutions can be roughly divided into two categories: proposal-based and proposal-free methods.

The proposal-based methods generally follow a proposal-and-rank pipeline. Specifically, all moment candidates are generated first and sorted by their predicted confidence scores, and then the one with the highest score is regarded as the final prediction result. Chen et al. [22] proposed a novel temporal grounding network to capture the evolving fine-grained frame–word interaction between video and natural language in real time. In contrast to the traditional sliding window method, the final localization result was acquired by extracting the video once, which can greatly reduce the prediction time cost. Zhang et al. [23] proposed a cross-modal interaction network encompassing GCNs and the Multi-Head Self-Attention module to learn more fine-grained representation for each modality. Xu et al. [24] introduced a query-guided segment proposal network, which integrated the features of text query in the early phase of moment candidate generation to avoid the production of irrelevant video segments. In [25], the authors proposed a fast video moment retrieval model, which introduced a fine-grained semantic distillation framework to improve the efficiency of the common space.

The proposal-free methods directly predict the two boundaries of the target moment or the probability distribution of the boundary in all video frames. Reference [26] proposed a collaborative attention mechanism to achieve multi-modal interaction effects and enhance the localization precision. Lu et al. [27] proposed a new dense bottom-up grounding framework, which regarded all frames in the ground truth interval as foreground, and each foreground frame carried out a specific distance regression according to the offset between it and the bi-boundary. Mun et al. [28] proposed a hierarchical feature interaction model to extract middle-level features corresponding to significant semantic entities for textual phrases. Li et al. [29] devised a context pyramid network to study multi-scale temporal correlation in the video.

In addition to the above two kinds of mainstream video moment localization methods, many scholars have applied the currently popular reinforcement learning technology to address VML tasks. The conventional proposal-based methods have to generate abundant candidates, which leads to substantial computational overhead. For this problem, He et al. [30] described the task as a sequential decision problem by learning an agent that gradually adjusts the time boundary according to some specific strategies. Specifically, they proposed a reinforcement learning framework based on multi-task learning, which takes additional supervised boundary information into account during the training phase, and finally improves the location performance. To solve the issue that most existing methods are inefficient and deviate from the characteristics of human perception, Wu et al. [31] proposed a new progressive reinforcement learning framework with the tree structure strategy, which adjusts the temporal boundaries sequentially through an iterative process.

3. Proposed Approach

In this section, we first elaborate on the formulation of VML tasks. Then, we will describe every module of our proposed TMBRN in detail. In addition, the overview of the TMBRN is shown in Figure 3.

3.1. Problem Formulation

In the case of a given pair of a long untrimmed video V and a natural language query Q, the VML task aims to exactly localize a short segment S which is semantically the same as the text query. For computational convenience, video is denoted as

V = {f_{i}}_{i = 0}^{l^{V} - 1}

, where

f_{i}

indicates the i-th frame and

l^{V}

represents the total number of frames in the video. In the same way, a natural language query can be indicated as

Q = {q_{i}}_{i = 0}^{l^{Q} - 1}

, in which

q_{i}

represents the i-th word and

l^{Q}

stands for the total number of words in the text query. Therefore, the key challenge of VML tasks is how to locate a precise segment

S = {f_{i}}_{t_{s}}^{t_{e}}

according to the textual query, where

t_{s}

and

t_{e}

represent the index of the start and end frame.

3.2. Feature Extraction

For a given text query Q, we first obtain the embedding

w_{i} \in R^{d^{q}}

of the word

q_{i}

by the released Glove Word2vec [32], where

d^{q}

represents the embedding size. It is well known that LSTM (Long Short-Term Memory) [33] has demonstrated exceptional capability in extracting contextual information in the deep learning community. Therefore, we input the sentence query Q into a three-layer bidirectional LSTM, then integrate all hidden states at each time step. Consequently, we obtain the contextual feature of the query denoted as

Q^{h} \in R^{l^{Q} \times d^{h}}

, where

d^{h}

indicates the dimension of the hidden state. Formally, it can be represented as:

\begin{matrix} h_{i}^{f} & = {LSTM}_{q}^{f} (w_{i}, h_{i - 1}^{f}), \end{matrix}

(1)

\begin{matrix} h_{i}^{b} & = {LSTM}_{q}^{b} (w_{i}, h_{i - 1}^{b}), \end{matrix}

(2)

\begin{matrix} q_{i}^{h} & = [h_{i}^{f}; h_{i}^{b}], \end{matrix}

(3)

where

{LSTM}_{q}^{f}

and

{LSTM}_{q}^{b}

are the forward and backward LSTM network,

h_{i}^{f}

and

h_{i}^{b}

represent the hidden states of the i-th unit of the forward and backward LSTM network, and

[;]

stands for the concatenation operation of tensors.

Due to the nature of video data, it is essential to simultaneously analyze spatial and temporal information. Fortunately, some researchers have proposed several 3D convolutional neural networks (3D-CNNs) [34,35] to solve this problem. For the sequence of the input video frames, we split them into several small clips with the same length T. Subsequently, we perform uniform downsampling with fixed intervals to obtain N small clips. By this, we extract the video feature denoted as

V \in R^{N \times d^{v}}

by the pre-trained 3D-CNN for every sampled clip. Then, a bidirectional LSTM is utilized to learn the local context information within the video, similar to the text query. In addition, the Transformer [36,37] has been proven to have a strong ability to capture long-distance dependencies in previous work, so we adopt it to obtain more informative video representations with global perspectives. As a result, we derive the contextual representation of the video denoted as

V^{h} \in R^{N \times d^{h}}

, where

d^{h}

represents the dimension of the feature.

3.3. Cross-Modal Feature Fusion

Through the previous parts, we have gained the representations of videos and text queries. However, significant modality heterogeneity persists, necessitating effective feature fusion and precise semantic alignment between these distinct representations. In this paper, we use a Context–Query Attention (CQA) module to achieve the purpose of feature interaction [38,39]. Originally, the CQA module was widely used for QA (Question Answering) tasks, where C and Q stand for the context and query [40]. Specifically, we first calculate a correlation matrix

M \in R^{N \times l^{Q}}

between the video and text, and the calculation formula is denoted as:

f (v_{i}, q_{j}) = w_{0} [v_{i}; q_{j}; v_{i} ⊙ q_{j}],

(4)

where ⊙ is the symbol of the Hadamard product and

w_{0}

is a trainable vector for linear transformation. In addition,

v_{i}

and

q_{j}

indicate the feature vector of the i-th clip in the video and the j-th word in the text, respectively. Therefore,

f (v_{i}, q_{j})

represents the similarity between

v_{i}

and

q_{j}

, which corresponds to the value of the i-th row and the j-th column in the matrix

M

.

Then, we can further obtain the text-aware video feature

A^{v 2 q}

and the self-attended video feature

A^{v 2 v}

by the attention mechanism, while the calculation formula is expressed as:

\begin{matrix} A^{v 2 q} & = M_{r} \cdot Q^{h}, \end{matrix}

(5)

\begin{matrix} A^{v 2 v} & = M_{r} \cdot {M_{c}}^{T} \cdot V^{h}, \end{matrix}

(6)

where

M_{r}

and

M_{c}

are computed by row and column-wise SoftMax operation on similarity matrix

M

. Subsequently, we obtain the ultimate cross-modal representations between video and text, denoted as

V^{q} \in R^{N \times d}

, which can be calculated by the following formula:

V^{q} = MLP ([V^{h}; A^{v 2 q}; V^{h} ⊙ A^{v 2 q}; V^{h} ⊙ A^{v 2 v}]),

(7)

where MLP is a Multi-Layer Perceptron network, which can linearly transform the dimension of the fused cross-modal feature to d.

3.4. Boundary Refinement Network

In this part, we first describe the concrete construction process of the 2D temporal map in detail, which is leveraged to obtain a mass of moment candidates with extremely fine-granularity time spans. Then, to reduce the heavy computation burden, we leverage a multi-step sampling strategy to drop some redundant moment candidates. In the end, we further introduce a boundary refinement network based on such a temporal map to fine-tune the boundaries of the moment candidates further.

Through the previous subsection, we have obtained the fused cross-modal feature

V^{q}

, and we will elaborate on the construction process of the temporal feature map in this part. We assume that the feature of a moment proposal can be represented as

f_{a}^{b} = m a x p o o l (V_{a}^{q}, V_{a + 1}^{q}, \cdot \cdot \cdot, V_{b}^{q})

, where

V_{a}^{q}

indicates the fused feature of the a-th clip in video and

0 \leq a \leq b \leq N

. Therefore, we can obtain diverse moment candidates with many different time spans. In this manner, we can construct a whole 2D temporal feature map, where each grid corresponds to a video moment proposal, and their features are obtained by consecutive max-pooling. Formally, this temporal feature map can be denoted as

F \in R^{N \times N \times d}

. Furthermore, it is worth noting that the start clip index of each moment candidate in the feature map must not be larger than the end clip index. Therefore, moment candidates below the diagonal are invalid in the temporal feature map. In our experimental implementations, we pad these moment candidates with zero vectors.

However, the total number of moment candidates in such a temporal feature map is still substantial, which will bring a heavy computation burden. To solve this problem, we use a multi-step sampling strategy to remove some redundant moment candidates. Namely, we perform dense sampling when the length of the moment candidate is short. On the contrary, as the length of the moment proposal becomes longer, the sampling probability becomes lower and lower. Specifically, when the size of the temporal map

N \leq 16

, we sample all moment candidates in the map. For simplicity, we adopt

N = 64

as an example to illustrate the proposed strategy. As shown in Figure 4, we divide all moment candidates into four regions based on the length. In addition, indices of the moment candidate’s start and end clip can be denoted as i and j. When

0 \leq j - i \leq 15

, i.e., the lengths of moment candidates range from 1 to 16 clips, we fetch all moment proposals. If

16 \leq j - i \leq 31

, i.e., the time span of moment candidates ranges from 17 to 32 clips, we obtain moment proposals with a two-clip stride at every specific moment candidate length. In detail, we take out proposals from the set of moment candidate length

L \in {18, 20, \dots, 32}

. Similarly, when the moment candidate lengths is from 33 to 48 (i.e.,

32 \leq j - i \leq 47

), we obtain moment proposals with a three-clip stride at every specific moment candidate length. The corresponding set of moment candidate length is

L \in {35, 38, \dots, 47}

. Finally, as the moment candidate length is from 49 to 64 (i.e.,

48 \leq j - i \leq 63

), we obtain moment proposals with a four-clip stride at every specific moment candidate length. The corresponding set of moment candidate lengths is

L \in {52, 56, 60, 64}

. In addition, it is worth explaining that we adopt such a strategy because of a common phenomenon that the video segment needed to locate generally takes up less than one-quarter of the length of the whole video in most cases, and the longer the target segment, the lower the probability. Therefore, the longer the duration of moment candidates, the sparser the sampling. Furthermore, we can use the above rule to infer when

N > 64

.

Most of the previous proposal-based methods predominantly relied on ingenious strategies to generate high-quality moment candidates, then predicted their confidence scores and directly selected the moment proposal with the highest score as the final localization result. However, as is depicted in Figure 5, although these moment candidates are segmented so exquisitely through the temporal map in this paper, there are still some deviations compared with the ground truth. Motivated by this, in addition to the usual confidence score prediction network, we innovatively introduce a boundary refinement network based on the constructed temporal feature map to continuously fine-tune the boundaries of generated moment candidates, further improving localization performance.

As we know, continuous convolution can gradually expand the receptive field and constantly fuse the adjacent context information [6,41]. Inspired by this, the boundary refinement network encompasses L continuous convolutional layers, the first

L - 1

convolutional layers are leveraged to fuse context information continuously, and the last convolutional layer followed with an activation function is used for scoring and boundary refining. To guarantee that the shape of the output is equal to the input, we perform proper padding before each convolution operation. In addition, it is worth noting that moment candidates padded with zero in the temporal feature map are not taken into computation since they are invalid.

Finally, we obtained a confidence score map and an offsets map based on the temporal feature map. Here, the offsets predicted by the boundary refinement network refer to the gap between the two boundaries of each moment candidate in the temporal map and the ground truth. Then, we collect the confidence scores and offsets of all valid moment candidates in the temporal map denoted as

\hat{P} = {{\hat{p}}_{i}}_{i = 0}^{C}

,

\hat{O} = {{\hat{o}}_{i}^{s}, {\hat{o}}_{i}^{e}}_{i = 0}^{C}

, respectively, where C stands for the number of valid moment candidates. Therefore, the one with the highest confidence score is regarded as the correct moment candidate, and the final prediction result is the sum of the original timestamp of this candidate and its corresponding predicted offsets.

3.5. Loss Function

As shown in Figure 1, the start and end boundaries of each moment candidate in the temporal feature map can be expressed as

(i δ, (j + 1) δ)

, where i and j stand for the indexes of start and end clips in the temporal map, respectively. Moreover,

δ

indicates a unit of time in the temporal map, which refers to the length of each grid (converted to absolute time). Specifically, it can be calculated with

δ = \frac{D}{N}

, where D stands for the duration of the entire video. Based on the above description, we can calculate the IoU (Intersection of Union) between each moment candidate and ground truth in the temporal map. As shown in Figure 6, the IoU can be computed by the following formula:

IoU = \frac{Area of overlap}{Area of union} = \frac{\min (\hat{t^{e}}, t^{e}) - \max (\hat{t^{s}}, t^{s})}{\max (\hat{t^{e}}, t^{e}) - \min (\hat{t^{s}}, t^{s})},

(8)

where

\hat{t^{s}}

and

\hat{t^{e}}

represent the start and end time of the predicted video segment by the boundary refinement network, while

t^{s}

and

t^{e}

indicate the start and end time of the ground truth.

3.5.1. Alignment Loss

Based on this, the IoU of the i-th moment candidate in the temporal map can be readily calculated and denoted as

p_{i}

. Meanwhile, we can also obtain the label of offsets between every moment candidate in the temporal map and ground truth, which can be denoted as

o_{i} = (o_{i}^{s}, o_{i}^{e})

. In the process of TMBRN training, we utilize a scaled IoU as a label to replace the traditional hard binary label [25], which can be calculated by the following formula:

\begin{matrix} y_{i} = \{\begin{matrix} 0 p_{i} \leq t_{m i n}, \\ \frac{p_{i} - t_{m i n}}{t_{m a x} - t_{m i n}} t_{m i n} < p_{i} < t_{m a x}, \\ 1 p_{i} \geq t_{m a x}, \end{matrix} \end{matrix}

(9)

where

t_{m i n}

and

t_{m a x}

represent two thresholds used to scale the IoU label. Consequently,

y_{i}

can serve as a supervised signal in the model training phase.

To bridge the semantic gap between the text and video, we introduce an alignment loss function based on binary cross-entropy loss, which can be expressed by the following formula:

L_{a l i g n} = - \frac{1}{C} \sum_{i = 1}^{C} y_{i} log {\hat{p}}_{i} + (1 - y_{i}) log (1 - {\hat{p}}_{i}),

(10)

where C stands for the number of valid sampled moment candidates in the temporal map and

{\hat{p}}_{i}

represents the predicted confidence score of the i-th moment proposal.

3.5.2. Weighted Refinement Loss

In addition to the alignment loss, it is essential to incorporate the loss of offsets predicted by the boundary refinement network. Because the boundaries of the moment candidates generated by the temporal map have been fixed, the model’s performance will be limited if we train the TMBRN only with the alignment loss. Motivated by this, we introduce a refinement loss to train the model jointly. Because many moment candidates significantly deviate from the target video segment (i.e., the IoUs between them with ground truth are very small), it is unprofitable to refine these moment proposals. To handle this dilemma, we devise an unprecedented weighted refinement loss function. Specifically, we inject the IoUs between the original boundaries of moment candidates and the ground truth into the naive refinement loss as a weighted factor. By this, we can guide the network to pay more attention to fine-tuning those moment candidates closer to the ground truth, and then improve the localization performance of the model. Formally, the weighted refinement loss function can be expressed as:

L_{w r e f} = \frac{1}{C} \sum_{i = 0}^{C} y_{i} [R (o_{i}^{s} - {\hat{o}}_{i}^{s}) + R (o_{i}^{e} - {\hat{o}}_{i}^{e})],

(11)

where

R (\cdot)

is the smooth

L_{1}

function. It is worth noting that we adopt the scaled IoU

y_{i}

as a weighted factor here. Furthermore, we can reduce training computation costs to some extent in this way. The reason is that the weighted factor of those moment candidates having no overlap with the ground truth will be assigned 0.

Combined with that mentioned above, the total loss is expressed as:

L = L_{a l i g n} + λ \times L_{w r e f},

(12)

where

λ

is a hyperparameter used to balance two loss functions.

During inference, we sort all valid moment candidates by the predicted confidence scores and the one with the highest score is our need. It is worth mentioning that the final result should be the sum of the original boundaries of each moment candidate (converted to absolute time) in the temporal map and its corresponding refined offsets. In addition, we use the NMS (Non-Maximum Suppression) to drop some redundant moment candidates.

4. Experiments

We conduct extensive experiments on two popular and challenging datasets, i.e., Charades-STA [42] and ActivityNet Captions [43], to evaluate the efficiency of our proposed TMBRN model. In this section, we first briefly introduce these two datasets, the evaluation metrics in the experiments, and the relevant details of the experimental implementation. Then, we compare the experimental results of the proposed model with the state-of-the-art methods in recent years and analyze them legitimately. At last, we conduct several ablation experiments to quantify the impact of various components in the TMBRN model on performance and present the visualized video moment localization result.

4.1. Datasets

Charades-STA: It is built on the Charades dataset [44], which was originally used for the temporal action localization task. Charades contains 9848 videos of daily indoor activity, with an average duration of about 30 s. Moreover, it includes 15 activity scenes and 157 action categories. On this basis, Gao et al. split these videos in Charades into several small segments according to the initial annotations, and constructed the Charades-STA dataset by annotating each segment with a start time, an end time, and a sentence. Following the previous work [42], we utilize 12,408 video–text pairs to constitute the training set and 3720 pairs to compose the test set in our experiments. ActivityNet Captions: It is built on the ActivityNet v1.3 dataset [45], which contains 19,994 untrimmed YouTube videos. This dataset includes approximately 100 K descriptive sentences and the average duration of videos is about 120 s. Each video contains at least two annotations, the timestamp label (i.e., the start and end time), and the hand-written sentence description. Therefore, according to the official split principle, we used 37,421 moment–text pairs to construct the training set, 17,505 pairs to form the validation set, and 17,031 pairs to form the test set.

4.2. Evaluation Metrics

Similar to the evaluation metric of previous work, we also utilize

R n @ m

as the key indicator to evaluate the experiment results of our proposed approach. m is a threshold, which means that if the IoU value between the predicted result and the ground truth is greater than it, we regard this result as a correct prediction. n stands for the top-n results sorted by confidence scores. Therefore,

R n @ m

is the percentage of samples having at least one correct prediction in the top-n results. In addition, for these two datasets, we adopt

n \in {1, 5}

and

m \in {0.5, 0.7}

for Charades-STA, and

n \in {1, 5}

and

m \in {0.3, 0.5, 0.7}

for ActivityNet Captions.

4.3. Implementation Details

To make a fair comparison with some previous work, we maintain identical experimental setups. When extracting the embedding of each word in sentences, we use the pre-trained Glove-300 word2vec model [32]; hence, the length of all word vectors is 300, i.e.,

d^{q} = 300

. We statistically analyzed the length of all sentence queries (i.e., the number of words) in the dataset, and to make the code implementation more convenient, we set the length of all sentences to a uniform value

L^{Q}

. Specifically, when the length of the sentence is greater than

L^{Q}

, we truncate the subsequent content, whereas we pad with zero vectors for shorter sentences. We set the

L^{Q}

to 10 for Charades-STA and 25 for ActivityNet Captions. As for the video feature, we utilize several pre-trained 3D ConvNets to extract. Here, we adopt the C3D [34] and the VGG [46] features for Charades-STA and the C3D feature for ActivityNet Captions. The length of small clips T is set to 16 frames and the number of downsampled clips N is set to 16 for Charades-STA and 64 for ActivityNet Captions. In the Transformer module, which is leveraged to capture long-distance dependencies in the video, the number of attention heads is set to 8. In this paper, the dimension

d^{h}

of all hidden states is uniformly set to 512. In the boundary refinement network, we leverage continuous L convolution layers with kernel size K, and we set

L = 8

and

K = 5

for Charades-STA and

L = 4

and

K = 9

for ActivityNet Captions. The threshold

t_{m a x}

and

t_{m i n}

are set to 1.0 and 0.5 for all datasets. The hyperparameter

λ

used to balance alignment and refinement loss is set to 10.0 for Charades-STA and 0.1 for ActivityNet Captions. Moreover, the NMS threshold is set to 0.4 for Charades-STA and 0.5 for ActivityNet Captions. The learning rate and batch size are set to

1 \times 10^{- 4}

and 32, respectively. We use both these two datasets to train the TMBRN with 100 epochs. The PyTorch framework (torch 1.10.0 version) is used in our experiments, and Adam is selected as the optimizer; then, the model is trained on a computer equipped with two NVIDIA RTX3090 GPUs.

4.4. Performance Comparison

We evaluate the proposed model through extensive experiments on two extremely challenging benchmarks: the Charades-STA and ActivityNet Captions datasets. To demonstrate the superiority of the TMBRN, we compared it with the state-of-the-art approaches in recent years, including:

Proposal-based methods: CTRL [42], MCN [47], ACRN [48], ROLE [49], TGN [22], MAC [50], QSPN [24], SAP [51], SCDM [52], 2D-TAN [53], CBP [54], BPNet [55], FVMR [25], TLL [56], MIM [57].
Proposal-free methods: DEBUG [27], ABLR [26], GDP [58], LGI [28], PMI [59], CPNet [29], PLPNet [60], TCFN [61], STCNet [62], FVC [63].
Reinforcement learning methods: RWM [30], SM-RL [64], TSP-PRL [31], TripNet [65].

The results on the Charades-STA dataset are presented in Table 1, and the results on the ActivityNet Captions dataset are shown in Table 2. The values of best performance in the table are highlighted in bold font. It is worth noting that we compare the experimental results on both C3D and VGG features in the Charades-STA dataset.

Table 1 shows the experimental results of the TMBRN on the Charades-STA dataset. As demonstrated in the table, our model achieves almost the best results with C3D and VGG visual features. When using C3D features, the TMBRN has an improvement of 5.64% over the second-best FVMR on the R5@0.5 evaluation metric. For VGG features, our TMBRN outperforms state-of-the-art methods on all metrics. In addition, our proposed model has a more significant increase on the R5 metrics than the R1 metrics compared to the other models. This may be because moment candidates provided by the temporal feature map in the TMBRN are more flexible than those conventional proposal-based methods, which can achieve the goal of generating more reasonable moment candidate segments with different time spans.

Similarly, the experimental results of the TMBRN on the ActivityNet Captions dataset are presented in Table 2. The data presented in Table 2 clearly demonstrate that our proposed model achieves the best results on all metrics. Therefore, it confirms that the TMBRN predicts the score of moment candidates more accurately and the refined offset is also precise in the prediction process. In addition, it benefits from the weighted factor of the refinement loss function. Because of this, when training the TMBRN, we can urge the network to concentrate more on fine-tuning those moment candidates closer to the ground truth. As a result, the localization performance can be further improved in this way.

4.5. Ablation Studies

In this part, we conducted ablation experiments to elaborate on the influence of some components of the proposed model and parameter settings on the localization precision. These include the number of stacked convolution layers and the size of the convolution kernel in the boundary refinement network, the setting of upper and lower thresholds of the scaled IoU, the weighted factor of the refinement loss, the feature fusion module CQA, and the hyperparameter

λ

used to balance alignment loss and refinement loss.

Figure 7 shows the impact of the number of convolution layers and the size of the convolution kernel on localization performance in the boundary refinement network. When studying the number of convolution layers, we first empirically fixed the convolution kernel size

K = 9

, and we can see that the best performance occurs when

L = 4

from the left chart. However, the performance declines when the number of convolution layers increases continuously. This phenomenon can be attributed to the potential loss of crucially original features when stacking excessive convolutional layers. Similarly, we fixed the number of convolution layers

L = 4

, and we can intuitively observe from the right subgraph that the localization precision reaches the maximum when

K = 9

. When

K < 9

, the larger the convolution kernel, the better the performance, which means that a small convolution kernel cannot effectively aggregate the context information of adjacent moment candidates. When

K > 9

, as the size of the convolution kernel increases, the performance begins to decline. This may be because if the convolution kernel is too large, information fusion is not smooth, degrading the model’s localization performance. Therefore, the more convolution layers and the larger convolution kernel sizes do not denote better performance. Moreover, these two experimental results convincingly verify that the values of K and L in this paper are set reasonably.

Figure 8 shows the impact of different threshold settings of the scaled IoU. The horizontal axis represents the difference value between the threshold

t_{m a x}

and the threshold

t_{m i n}

, while the three bars in each group denote the

t_{m i n} = 0.3 / 0.4 / 0.5

, respectively. As can be seen from the figure, the smaller the difference value between the upper and lower thresholds, the lower the overall localization accuracy of the model. This may be because when all the IoU values are converted into a smaller interval, the diversity of the original IoU values is lost, which conversely makes the training effect of alignment loss worse. In addition, the larger the

t_{m i n}

, the better the model performance. We think the potential reason is that the scaled IoU solution directly regards moment proposals with a small IoU value as invalid moment candidates, which can reduce the complexity of the prediction.

Table 3 and Table 4 present the importance of the weighted factor of the refinement loss on the Charades-STA and the ActivityNet Captions dataset. These two tables reveal that the absence of the weighting factor in the refinement loss (i.e., the first row) leads to a substantial degradation in model performance across all evaluation metrics. We preliminarily consider the reason to be that in the training phase, where we added the scaled IoUs as the weighted factor, the model can pay more attention to refining those moment proposals closer to the ground truth, and weaken other moment candidates that are far away from the ground truth, thereby making the model training more efficient. In addition, to further demonstrate this thought, we utilize the original IoU values to replace the scaled IoUs as the weighted factor of (11) (i.e., the second row). Finally, we can conclude that while its performance is inferior to the complete TMBRN, it is much better than the refinement loss function without any weighted factor.

Table 5 and Table 6 demonstrate the importance of the scaled IoU on the Charades-STA and ActivityNet Captions datasets, respectively. As shown in these two tables, the higher the confidence threshold, the more significant the improvement in the scaled IoU. This is also highly consistent with the goal of our VML tasks, i.e., as much overlap as possible between the predicted fragment and the ground truth.

Table 7 shows the results of the TMBRN with different multi-modal feature fusion strategies on the ActivityNet Captions dataset. In this paper, we use the feature fusion scheme of CTRL and the cross-attention mechanism to conduct ablation experiments. It can be intuitively seen from the table that replacing the CQA module of the TMBRN with the fusion module of CTRL leads to the worst performance. Because the feature fusion module of CTRL only comprises some plain tensor operations, e.g., element-wise multiplication and concatenation, it fails to align multi-modal semantics well. The model’s performance improved significantly when using the cross-attention mechanism to integrate features, but it is still worse than utilizing the CQA module. However, even if we fuse cross-modal features without the CQA module, the TMBRN surpasses some recent methods, proving its superiority and robustness.

Figure 9 shows the impact of the hyperparameter

λ

, which is used to balance the alignment loss and the weighted refinement loss on the performance of the TMBRN. It can be seen from the figure that both a larger and a smaller

λ

will degrade the model performance. Across all metrics, the best performance is achieved when

λ = 10

in this paper. Through such an experiment, we find that the two terms of (12) are of the same magnitude order only when

λ = 10

. This indicates that in the actual model training phase, we need to set an appropriate

λ

to make the model treat the two kinds of loss equally, which is beneficial to improving the localization precision of the model.

4.6. Visualization Results

Figure 10 demonstrates some visualized results, which can intuitively verify the reliability of our devised TMBRN. As shown in the top subfigure, for the given query, “A person takes a book out from the entertainment center”, the TMBRN generates a video segment whose IoU is about 91%. However, this is the result of the Charades-STA dataset, in which the duration of the video is relatively shorter. As a comparison, in the bottom subfigure, we also show the visualized results on the ActivityNet Captions dataset in which the video has a 2 min average duration. For the more complex query, “The man takes out the car, and other men start fighting for the car, and the three are riding it”, which contains three different actions, the TMBRN also outputs a video segment with 90% localization precision. This further confirms that our proposed TMBRN model has strong stability and high reliability.

5. Conclusions

In this paper, we develop a Temporal Map-based Boundary Refinement Network (TMBRN) for video moment localization tasks. We obtain abundant moment candidates with extremely fine-granularity time spans by a constructed 2D temporal map. Based on such a temporal map, in addition to predicting the confidence scores, we introduce a boundary refinement network to fine-tune the boundaries of moment candidates to obtain more precise localization results. Meanwhile, we adopt a multi-step sampling strategy to remove some redundant moment candidates to reduce the heavy computation burden. Moreover, we design an innovative weighted refinement loss function to enable the model to focus more on refining the boundaries of those moment candidates closer to the ground truth, greatly heightening the model’s performance. Finally, by extensive experiments on two publicly available datasets, the results suggest that our proposed method is highly effective for handling VML tasks and outperforms the state-of-the-art methods.

Meanwhile, our model also has some weaknesses; for example, for complex textual queries (including multiple actions), it may not be able to accurately model its semantics, resulting in poor multi-modal semantic alignment and inferior localization performance. Our future work may consider the development of more effective text encoders, obtaining informative contextual features to enhance the capability of language modeling.

Author Contributions

L.L. implemented the model and wrote the original draft. D.L. reviewed and edited the manuscript. C.Z. revised the paper. Y.Z. and H.R. analyzed the experimental results and prepared figures. L.Z. acquired funding. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grants 62472161, 61836016, and 62202163, and the Natural Science Foundation of Hunan Province under Grants 2023JJ30169 and 2022JJ40190.

Data Availability Statement

All datasets can be accessed through their respective official websites or mirrors. The Charades-STA dataset can be accessed at https://prior.allenai.org/projects/charades. The ActivityNet Captions dataset can be accessed at http://activity-net.org/challenges/2016/download.html.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tian, C.; Zheng, M.; Li, B.; Zhang, Y.; Zhang, S.; Zhang, D. Perceptive self-supervised learning network for noisy image watermark removal. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7069–7079. [Google Scholar] [CrossRef]
Zhang, J.; Zhao, B.; Zhang, W.; Miao, Q. Knowledge Translator: Cross-Lingual Course Video Text Style Transform via Imposed Sequential Attention Networks. Electronics 2025, 14, 1213. [Google Scholar] [CrossRef]
Tian, C.; Zheng, M.; Zuo, W.; Zhang, S.; Zhang, Y.; Lin, C.W. A cross transformer for image denoising. Inf. Fusion 2024, 102, 102043. [Google Scholar] [CrossRef]
Liu, J.; Liu, W.; Han, K. MNv3-MFAE: A Lightweight Network for Video Action Recognition. Electronics 2025, 14, 981. [Google Scholar] [CrossRef]
Zhang, H.; Li, Z.; Yang, J.; Wang, X.; Guo, C.; Feng, C. Revisiting Hard Negative Mining in Contrastive Learning for Visual Understanding. Electronics 2023, 12, 4884. [Google Scholar] [CrossRef]
Tian, C.; Zhang, X.; Liang, X.; Li, B.; Sun, Y.; Zhang, S. Knowledge distillation with fast CNN for license plate detection. IEEE Trans. Intell. Veh. 2023, 1–7. [Google Scholar] [CrossRef]
Tian, C.; Xiao, J.; Zhang, B.; Zuo, W.; Zhang, Y.; Lin, C.W. A self-supervised network for image denoising and watermark removal. Neural Netw. 2024, 174, 106218. [Google Scholar] [CrossRef] [PubMed]
Matsumori, S.; Shingyouchi, K.; Abe, Y.; Fukuchi, Y.; Sugiura, K.; Imai, M. Unified questioner transformer for descriptive question generation in goal-oriented visual dialogue. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1898–1907. [Google Scholar]
Song, J.; Guo, Y.; Gao, L.; Li, X.; Hanjalic, A.; Shen, H.T. From deterministic to generative: Multimodal stochastic RNNs for video captioning. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 3047–3058. [Google Scholar] [CrossRef] [PubMed]
Fu, Y.; Yang, Y.; Ye, O. Video Captioning Method Based on Semantic Topic Association. Electronics 2025, 14, 905. [Google Scholar] [CrossRef]
Wang, B.; Yang, Y.; Xu, X.; Hanjalic, A.; Shen, H.T. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 154–162. [Google Scholar]
Xu, X.; Wang, T.; Yang, Y.; Zuo, L.; Shen, F.; Shen, H.T. Cross-modal attention with semantic consistence for image–text matching. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 5412–5425. [Google Scholar] [CrossRef] [PubMed]
Shen, H.T.; Liu, L.; Yang, Y.; Xu, X.; Huang, Z.; Shen, F.; Hong, R. Exploiting subspace relation in semantic labels for cross-modal hashing. IEEE Trans. Knowl. Data Eng. 2020, 33, 3351–3365. [Google Scholar] [CrossRef]
Wu, G.; Han, J.; Guo, Y.; Liu, L.; Ding, G.; Ni, Q.; Shao, L. Unsupervised deep video hashing via balanced code for large-scale video retrieval. IEEE Trans. Image Process. 2018, 28, 1993–2007. [Google Scholar] [CrossRef] [PubMed]
Dong, J.; Li, X.; Xu, C.; Ji, S.; He, Y.; Yang, G.; Wang, X. Dual encoding for zero-example video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9346–9355. [Google Scholar]
Cao, D.; Yu, Z.; Zhang, H.; Fang, J.; Nie, L.; Tian, Q. Video-based cross-modal recipe retrieval. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1685–1693. [Google Scholar]
Chen, S.; Zhao, Y.; Jin, Q.; Wu, Q. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10638–10647. [Google Scholar]
Yang, X.; Dong, J.; Cao, Y.; Wang, X.; Wang, M.; Chua, T.S. Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the 43rd international ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, 25–30 July 2020; pp. 1339–1348. [Google Scholar]
Shou, Z.; Wang, D.; Chang, S.F. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1049–1058. [Google Scholar]
Buch, S.; Escorcia, V.; Shen, C.; Ghanem, B.; Carlos Niebles, J. Sst: Single-stream temporal action proposals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2911–2920. [Google Scholar]
Zeng, R.; Huang, W.; Tan, M.; Rong, Y.; Zhao, P.; Huang, J.; Gan, C. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7094–7103. [Google Scholar]
Chen, J.; Chen, X.; Ma, L.; Jie, Z.; Chua, T.S. Temporally grounding natural sentence in video. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 162–171. [Google Scholar]
Zhang, Z.; Lin, Z.; Zhao, Z.; Xiao, Z. Cross-modal interaction networks for query-based moment retrieval in videos. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 655–664. [Google Scholar]
Xu, H.; He, K.; Plummer, B.A.; Sigal, L.; Sclaroff, S.; Saenko, K. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 9062–9069. [Google Scholar]
Gao, J.; Xu, C. Fast video moment retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1523–1532. [Google Scholar]
Yuan, Y.; Mei, T.; Zhu, W. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 9159–9166. [Google Scholar]
Lu, C.; Chen, L.; Tan, C.; Li, X.; Xiao, J. Debug: A dense bottom-up grounding approach for natural language video localization. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 5144–5153. [Google Scholar]
Mun, J.; Cho, M.; Han, B. Local-global video-text interactions for temporal grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10810–10819. [Google Scholar]
Li, K.; Guo, D.; Wang, M. Proposal-free video grounding with contextual pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021; Volume 35, pp. 1902–1910. [Google Scholar]
He, D.; Zhao, X.; Huang, J.; Li, F.; Liu, X.; Wen, S. Read, watch, and move: Reinforcement learning for temporally grounding natural language descriptions in videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8393–8400. [Google Scholar]
Wu, J.; Li, G.; Liu, S.; Lin, L. Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12386–12393. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Tian, C.; Zheng, M.; Lin, C.W.; Li, Z.; Zhang, D. Heterogeneous window transformer for image denoising. IEEE Trans. Syst. Man Cybern. Syst. 2024, 54, 6621–6632. [Google Scholar] [CrossRef]
Xiong, C.; Zhong, V.; Socher, R. Dynamic Coattention Networks For Question Answering. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Yu, A.W.; Dohan, D.; Le, Q.; Luong, T.; Zhao, R.; Chen, K. Fast and accurate reading comprehension by combining self-attention and convolution. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018; Volume 2. [Google Scholar]
Seo, M.; Kembhavi, A.; Farhadi, A.; Hajishirzi, H. Bidirectional Attention Flow for Machine Comprehension. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Tian, C.; Zheng, M.; Jiao, T.; Zuo, W.; Zhang, Y.; Lin, C.W. A self-supervised CNN for image watermark removal. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7566–7576. [Google Scholar] [CrossRef]
Gao, J.; Sun, C.; Yang, Z.; Nevatia, R. Tall: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5267–5275. [Google Scholar]
Krishna, R.; Hata, K.; Ren, F.; Fei-Fei, L.; Carlos Niebles, J. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 706–715. [Google Scholar]
Sigurdsson, G.A.; Varol, G.; Wang, X.; Farhadi, A.; Laptev, I.; Gupta, A. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 510–526. [Google Scholar]
Caba Heilbron, F.; Escorcia, V.; Ghanem, B.; Carlos Niebles, J. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 961–970. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015; pp. 1–14. [Google Scholar]
Anne Hendricks, L.; Wang, O.; Shechtman, E.; Sivic, J.; Darrell, T.; Russell, B. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5803–5812. [Google Scholar]
Liu, M.; Wang, X.; Nie, L.; He, X.; Chen, B.; Chua, T.S. Attentive moment retrieval in videos. In Proceedings of the 41st international ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 15–24. [Google Scholar]
Liu, M.; Wang, X.; Nie, L.; Tian, Q.; Chen, B.; Chua, T.S. Cross-modal moment localization in videos. In Proceedings of the 26th ACM international Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 843–851. [Google Scholar]
Ge, R.; Gao, J.; Chen, K.; Nevatia, R. Mac: Mining activity concepts for language-based temporal localization. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 245–253. [Google Scholar]
Chen, S.; Jiang, Y.G. Semantic proposal for activity localization in videos via sentence query. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8199–8206. [Google Scholar]
Yuan, Y.; Ma, L.; Wang, J.; Liu, W.; Zhu, W. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.; Peng, H.; Fu, J.; Luo, J. Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12870–12877. [Google Scholar]
Wang, J.; Ma, L.; Jiang, W. Temporally grounding language queries in videos by contextual boundary-aware prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12168–12175. [Google Scholar]
Xiao, S.; Chen, L.; Zhang, S.; Ji, W.; Shao, J.; Ye, L.; Xiao, J. Boundary proposal network for two-stage natural language video localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 19–21 May 2021; Volume 35, pp. 2986–2994. [Google Scholar]
Bao, P.; Mu, Y. Learning Sample Importance for Cross-Scenario Video Temporal Grounding. In Proceedings of the 2022 International Conference on Multimedia Retrieval, Newark, NJ, USA, 27–30 June 2022; pp. 322–329. [Google Scholar]
Li, J.; Zhang, F.; Lin, S.; Zhou, F.; Wang, R. Mim: Lightweight multi-modal interaction model for joint video moment retrieval and highlight detection. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 1961–1966. [Google Scholar]
Chen, L.; Lu, C.; Tang, S.; Xiao, J.; Zhang, D.; Tan, C.; Li, X. Rethinking the bottom-up framework for query-based video localization. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 10551–10558. [Google Scholar]
Chen, S.; Jiang, W.; Liu, W.; Jiang, Y.G. Learning modality interaction for temporal sentence localization and event captioning in videos. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 333–351. [Google Scholar]
Li, S.; Li, C.; Zheng, M.; Liu, Y. Phrase-level Prediction for Video Temporal Localization. In Proceedings of the 2022 International Conference on Multimedia Retrieval, Newark, NJ, USA, 27–30 June 2022; pp. 360–368. [Google Scholar]
Lv, Z.; Su, B. Temporal-enhanced Cross-modality Fusion Network for Video Sentence Grounding. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 1487–1492. [Google Scholar]
Wang, Y.; Li, K.; Chen, G.; Zhang, Y.; Guo, D.; Wang, M. Spatiotemporal contrastive modeling for video moment retrieval. World Wide Web 2023, 26, 1525–1544. [Google Scholar] [CrossRef]
Shi, M.; Su, Y.; Lin, X.; Zao, B.; Kong, S.; Tan, M. Frame as Video Clip: Proposal-Free Moment Retrieval by Semantic Aligned Frames. IEEE Trans. Ind. Inform. 2024, 20, 13158–13168. [Google Scholar] [CrossRef]
Wang, W.; Huang, Y.; Wang, L. Language-driven temporal activity localization: A semantic matching reinforcement learning model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 334–343. [Google Scholar]
Hahn, M.; Kadav, A.; Rehg, J.M.; Graf, H.P. Tripping through time: Efficient localization of activities in videos. arXiv 2019, arXiv:1904.09936. [Google Scholar]

Figure 1. A diagram of VML tasks and the 2D temporal map.

Figure 2. A demonstration of the weighted refinement loss.

Figure 3. The architecture of the Temporal Map-based Boundary Refinement Network. The TMBRN primarily comprises three modules: feature extraction, feature fusion, and the boundary refinement network. The two blue dotted boxes of the figure represent the video and text encoding modules; the box filled with green is the cross-modal fusion module, and the red dotted box indicates the boundary refinement network based on the 2D temporal map.

Figure 4. A demonstration of the multi-step sampling on the 2D temporal map. In the Figure, the grids marked blue, orange, purple, and green represent moment proposals sampled with four different interval segments, respectively. The blue area denotes the moment candidates whose periods range from 1 to 16 clips, where we sampled all moment proposals. The orange region represents the moment candidates whose lengths range from 17 to 32 clips, sampled with a two-clip stride on both the horizontal and vertical axis in the 2D coordinate system. The purple area stands for the moment candidates whose durations range from 33 to 48 clips, sampled with a three-clip stride. Similarly, the green region indicates moment candidates whose lengths range from 49 to 64 clips, sampled with a four-clip stride.

Figure 5. The diagram of the boundary refinement network.

Figure 6. An illustration of Intersection of Union (IoU).

Figure 7. The ablation study of the number of stacked convolution layers and the convolution kernel size on the ActivityNet Captions dataset. The left chart shows the result of the R1@0.7 metric, and the right one represents the result of R5@0.7.

Figure 8. The ablation study of the various value of

t_{m i n}

and

t_{m a x}

on the Charades-STA dataset at the R5@0.5 metric.

Figure 8. The ablation study of the various value of

t_{m i n}

and

t_{m a x}

on the Charades-STA dataset at the R5@0.5 metric.

Figure 9. The ablation study of the hyperparameter

λ

on the Charades-STA dataset. From left to right, each subgraph represents the result of R1@0.5, R1@0.7, R5@0.5, and R5@0.7, respectively.

Figure 9. The ablation study of the hyperparameter

λ

on the Charades-STA dataset. From left to right, each subgraph represents the result of R1@0.5, R1@0.7, R5@0.5, and R5@0.7, respectively.

Figure 10. The top and bottom subfigures refer to the visualization results of the Charades-STA and ActivityNet Captions, respectively.

Table 1. Performance comparison with other methods on Charades-STA dataset.

Method Types	Methods	Feature	r1@0.5	r1@0.7	r5@0.5	r5@0.7
Reinforcement learning	RWM	C3D	34.12	13.74	-	-
	TripNet		38.29	16.07	-	-
	TSP-PRL		37.39	17.69	-	-
Proposal-free	DEBUG	C3D	37.39	17.69	-	-
	GDP		39.47	18.49	-	-
	PMI		39.73	19.27	-	-
	STCNet		38.44	20.73	-	-
	FVC		41.56	22.61	-	-
Proposal-based	CTRL	C3D	23.63	8.89	58.92	29.52
	ACRN		20.26	7.64	71.99	27.79
	ROLE		12.12	-	40.59	-
	MAC		30.48	12.2	64.84	35.13
	QSPN		35.6	15.8	79.4	45.4
	CBP		36.8	18.87	70.94	50.19
	BPNet		38.25	20.51	-	-
	FVMR		38.16	18.22	82.18	44.96
	TMBRN		42.33	23.28	87.82	52.2
Reinforcement learning	SM-RL	VGG	24.36	11.17	61.25	32.08
Proposal-free	ABLR	VGG	24.36	9.01	-	-
Proposal-free	PLPNet	VGG	41.88	20.56	-	-
Proposal-based	MCN	VGG	17.46	8.01	48.22	26.73
	SAP		27.42	13.36	66.37	38.15
	2D-TAN		39.70	23.31	80.32	51.26
	MIM		43.92	25.89	87.07	52.26
	TMBRN		45.14	27.03	88.49	53.31

Table 2. Performance comparison with other methods on ActivityNet Captions dataset.

Method Types	Methods	Feature	R1@0.3	R1@0.5	R1@0.7	R5@0.3	R5@0.5	R5@0.7
Reinforcement learning	RWM	C3D	-	36.9	-	-	-	-
	TripNet		48.42	32.19	13.93	-	-	-
	TSP-PRL		56.08	38.76	-	-	-	-
Proposal-free	DEBUG	C3D	55.91	39.72	-	-	-	-
	ABLR		55.67	36.79	-	-	-	-
	GDP		56.17	39.27	-	-	-	-
	LGI		58.52	41.51	23.07	-	-	-
	PMI		59.69	38.28	17.83	-	-	-
	CPNet		-	40.56	21.63	-	-	-
	PLPNet		56.92	39.20	20.91	-	-	-
	TCFN		56.81	40.58	24.73	67.87	55.51	39.42
	STCNet		-	40.15	21.85	-	-	-
	FVC		63.59	46.32	26.01	-	-	-
Proposal-based	CTRL	C3D	47.43	29.01	10.34	75.32	59.17	37.54
	MCN		39.35	21.36	6.43	68.12	53.23	29.70
	ACRN		49.70	31.67	11.25	76.50	60.34	38.57
	TGN		43.81	27.93	-	54.56	44.20	-
	SCDM		54.80	36.75	19.86	77.29	64.99	41.53
	QSPN		52.13	33.26	13.43	77.72	62.39	40.78
	2D-TAN		59.45	44.51	26.54	85.53	77.13	61.96
	CBP		54.30	35.76	17.80	77.63	65.89	46.20
	BPNet		58.98	42.07	24.69	-	-	-
	TLL		-	44.24	27.01	-	75.22	60.23
	TMBRN		64.51	47.02	28.26	85.89	77.22	62.32

Table 3. Results of ablation experiments of the weighted factor (WF) of the refinement loss on the Charades-STA dataset. The first row stands for the TMBRN without any weighted factors, and the second row indicates that the TMBRN regards the original IoU (OIoU) as the weighted factor.

Method	R1@0.5	R1@0.7	R5@0.5	R5@0.7
TMBRN w/o WF	32.12	15.46	38.68	18.92
TMBRN w/ WF(OIoU)	36.34	19.11	82.50	44.33
TMBRN	42.33	23.28	87.82	52.2

Table 4. Results of ablation experiments of the weighted factor (WF) of the refinement loss on the ActivityNet Captions dataset. The first row stands for the TMBRN without any weighted factors, and the second row indicates that the TMBRN regards the original IoU (OIoU) as the weighted factor.

Method	R1@0.3	R1@0.5	R1@0.7	R5@0.3	R5@0.5	R5@0.7
TMBRN w/o WF	52.60	35.54	17.63	65.07	49.64	29.39
TMBRN w/ WF(OIoU)	54.68	40.11	23.21	82.90	72.17	53.36
TMBRN	64.51	47.02	28.26	85.89	77.22	62.32

Table 5. Results of ablation experiments of scaled IoU on the Charades-STA dataset.

Method	R1@0.5	R1@0.7	R5@0.5	R5@0.7
TMBRN w/o scaled IoU	34.70	17.39	75.22	40.51
TMBRN	42.33	23.28	87.82	52.2

Table 6. Results of ablation experiments of scaled IoU on the ActivityNet Captions dataset.

Method	R1@0.3	R1@0.5	R1@0.7	R5@0.3	R5@0.5	R5@0.7
TMBRN w/o scaled IoU	53.83	38.58	20.51	78.18	67.24	49.95
TMBRN	64.51	47.02	28.26	85.89	77.22	62.32

Table 7. Results of ablation experiments of the feature fusion module on the ActivityNet Captions dataset. The first row indicates that the feature fusion module of the TMBRN is replaced with that of CTRL [42], and the second row represents that it is replaced with the cross-attention (CA) module.

Method	R1@0.3	R1@0.5	R1@0.7	R5@0.3	R5@0.5	R5@0.7
TMBRN w/ CTRL	56.30	41.07	24.91	83.19	72.48	53.91
TMBRN w/ CA	56.89	42.47	26.13	83.40	74.02	57.69
TMBRN	64.51	47.02	28.26	85.89	77.22	62.32

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lyu, L.; Liu, D.; Zhang, C.; Zhang, Y.; Ruan, H.; Zhu, L. Temporal Map-Based Boundary Refinement Network for Video Moment Localization. Electronics 2025, 14, 1657. https://doi.org/10.3390/electronics14081657

AMA Style

Lyu L, Liu D, Zhang C, Zhang Y, Ruan H, Zhu L. Temporal Map-Based Boundary Refinement Network for Video Moment Localization. Electronics. 2025; 14(8):1657. https://doi.org/10.3390/electronics14081657

Chicago/Turabian Style

Lyu, Liang, Deyin Liu, Chengyuan Zhang, Yilin Zhang, Haoyu Ruan, and Lei Zhu. 2025. "Temporal Map-Based Boundary Refinement Network for Video Moment Localization" Electronics 14, no. 8: 1657. https://doi.org/10.3390/electronics14081657

APA Style

Lyu, L., Liu, D., Zhang, C., Zhang, Y., Ruan, H., & Zhu, L. (2025). Temporal Map-Based Boundary Refinement Network for Video Moment Localization. Electronics, 14(8), 1657. https://doi.org/10.3390/electronics14081657

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Temporal Map-Based Boundary Refinement Network for Video Moment Localization

Abstract

1. Introduction

2. Related Work

2.1. Text–Video Retrieval

2.2. Temporal Action Localization

2.3. Video Moment Localization

3. Proposed Approach

3.1. Problem Formulation

3.2. Feature Extraction

3.3. Cross-Modal Feature Fusion

3.4. Boundary Refinement Network

3.5. Loss Function

3.5.1. Alignment Loss

3.5.2. Weighted Refinement Loss

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Performance Comparison

4.5. Ablation Studies

4.6. Visualization Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI