Event-Oriented State Alignment Network for Weakly Supervised Temporal Language Grounding

Wu, Hongzhou; Zhang, Xiang; Tang, Tao; Yang, Canqun; Luo, Zhigang

doi:10.3390/e26090730

Open AccessArticle

Event-Oriented State Alignment Network for Weakly Supervised Temporal Language Grounding

by

Hongzhou Wu

^*,

Xiang Zhang

,

Tao Tang

,

Canqun Yang

and

Zhigang Luo

School of Computer, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Entropy 2024, 26(9), 730; https://doi.org/10.3390/e26090730

Submission received: 19 June 2024 / Revised: 22 July 2024 / Accepted: 23 August 2024 / Published: 27 August 2024

(This article belongs to the Section Signal and Data Analysis)

Download

Browse Figures

Versions Notes

Abstract

Weakly supervised temporal language grounding (TLG) aims to locate events in untrimmed videos based on natural language queries without temporal annotations, necessitating a deep understanding of semantic context across both video and text modalities. Existing methods often focus on simple correlations between query phrases and isolated video segments, neglecting the event-oriented semantic coherence and consistency required for accurate temporal grounding. This can lead to misleading results due to partial frame correlations. To address these limitations, we propose the Event-oriented State Alignment Network (ESAN), which constructs “start–event–end” semantic state sets for both textual and video data. ESAN employs relative entropy for cross-modal alignment through knowledge distillation from pre-trained large models, thereby enhancing semantic coherence within each modality and ensuring consistency across modalities. Our approach leverages vision–language models to extract static frame semantics and large language models to capture dynamic semantic changes, facilitating a more comprehensive understanding of events. Experiments conducted on two benchmark datasets demonstrate that ESAN significantly outperforms existing methods. By reducing false high correlations and improving the overall performance, our method effectively addresses the challenges posed by previous approaches. These advancements highlight the potential of ESAN to improve the precision and reliability of temporal language grounding tasks.

Keywords:

temporal language grounding; cross-modal; relative entropy; neural networks

1. Introduction

Temporal language grounding (TLG) is a multi-modal task that aims to find events in untrimmed videos by matching a natural language query and outputting the starting and ending times of the video segment. Unlike temporal action localization, which retrieves information based on a finite set of predefined actions such as “run” or “drink”, temporal language grounding (TLG) operates through open-domain natural language sentences. This approach requires models to deeply understand the semantic context in both video and text modalities to achieve accurate grounding. TLG is particularly effective for interpreting mixed and complex events due to its ability to process an infinite variety of sentence queries. For instance, the capability to handle a query like “A close-up of a glass door is shown, along with a person’s hand moving around in the camera and making various gestures” demonstrates TLG’s ability to address such complex scenarios, thereby endowing it with broader applicability. This requires the model to not simply learn specific patterns but to delve deeply into understanding the connection between natural language queries and video content. Some works [1,2,3] have introduced training under fully supervised setting. However, this requires extensive human effort and time for temporal annotations. Therefore, some researchers [4,5] have turned to weakly supervised TLG, where training samples only provide “video–text” pairs without annotated starting and ending times for the matching segments. Previous works [6,7] have conducted cross-modal alignment by treating untrimmed videos as positive bags in multiple-instance learning. Recently, reconstruction-based approaches [8,9,10] have been proposed for weakly supervised TLG, which regards the video segment that reconstructs the query better as more relevant to the query. However, the problem is that the aforementioned methods primarily focus on correlations between query phrases and isolated video segments, neglecting event-oriented semantic coherence and consistency across modalities. This oversight can cause models to be misled by similar irrelevant frames. So our research question is how to solve the spurious high correlations of similar irrelevant frames.

In fact, when humans understand the events taking place in a video, they grasp how an event causes objects to change from their initial state to their final state [11]. For example, as shown in Figure 1, for the query “person pours a cup of coffee”, the viewer not only understands the action itself but also understands its start state may be “a person picks up a coffee pot” and the end state would be “a person puts down the coffee pot”. Inspired by this insight, we hypothesize that only video segments where the start state, event action, and end state all align with the query should be considered valid matches, while confusing frames only relate to specific words in the query but fail to capture the complete state transitions. That is to say, a model needs to comprehend the sequence of state changes—“start–event–end”—to truly understand the essence of the event described in the query, rather than merely identifying superficial similarities between words and frame elements. This necessitates multi-modal features that ensure intra-modal semantic coherence and cross-modal semantic consistency throughout the state transition process.

To address this issue, we propose the Event-oriented State Alignment Network (ESAN) for improved cross-modal alignment and inference. First, we construct a cross-modal semantic state space. In this space, query events from the textual modality are transformed into a “start–event–end” state set. This state set needs to be semantically coherent, meaning that its start and end states are closely related to the event itself. For the video modality, due to the fact that temporal annotations are not available for hard alignment in weakly supervised TLG, we use the model’s predicted proposals as soft labels. We then project the video features within the proposed start and end times into the cross-modal semantic state space. We believe that the projection of video features for a correct proposal can meet semantic coherence, while incorrect proposals often do not directly relate to the start and end states of the event. In this manner, both the textual and video modalities have an event-oriented start, end, and change of state in the cross-modal semantic state space. Aligning them encourages cross-modal semantic consistency, enabling more precise temporal localization.

Specially, we extract semantic information from static frames at regular intervals using a pre-trained vision–language model. However, the semantic information from static frames only reflects the semantics of the current frame and does not capture changes in semantic states effectively. Recently, large language models have demonstrated powerful language understanding and generalization capabilities, enabling the generation of high-quality natural language responses. Therefore, we leverage pre-trained models such as CLIP and LLaMa2 for knowledge distillation, utilizing the broad, universal knowledge embedded within these models. This ensures our approach is adaptable and effective across a diverse range of contexts. We extract knowledge from a pre-trained large language model by inputting the semantic information from static frames in video features, allowing us to obtain high-quality descriptions of semantic changes and project them into a cross-modal semantic state space. For the predicted proposal T = {

t^{s}

,

t^{e}

}, we map the projected static features at times

t^{s}

and

t^{e}

to represent the start and end states, respectively. We consider the semantic changes from

t^{s}

to

t^{e}

as indicating an “event” and build a video-modal state set that maintains semantic coherence. Additionally, we extract potential event start and end states based on knowledge from the language model to establish a “start–event (query sentence)–end” state set while maintaining semantic coherence for the text modality. Subsequently, we align the state sets of the two modalities in a cross-modal semantic state space. Then, our model reconstructs the original query by using the video features in the proposal. The reconstructed word distribution is used to compute the Kullback–Leibler divergence (relative entropy) of the distribution of the states’ similarity as well as the cross-entropy of the true word distribution of the original query. Through this approach, we enhance the event-oriented cross-modal semantic consistency, avoiding the false high correlations caused by irrelevant frames and queries. Simultaneously, we ensure semantic coherence from the beginning to the occurrence and end of events in both video and text modalities, facilitating the model’s understanding of the essence of events.

In summary, our method addresses the challenge of spurious high correlations of similar irrelevant video frames for weakly supervised temporal language grounding (WSTLG). The key innovations and contributions of our approach include:

We propose a novel Event-oriented State Alignment Network (ESAN) to improve cross-modal alignment in weakly supervised temporal language grounding. Unlike existing methods that often overlook the need for event-oriented semantic coherence, our ESAN constructs a semantic state space that captures the progression of events from start to end. This ensures more accurate representation and alignment of the events described in natural language queries with corresponding video segments.
We devise a semantic state alignment module that aligns the event-oriented “start–event–end” state sets constructed from two modalities to facilitate better understanding of the essence of event occurrences. Our method leverages entropy-based knowledge distillation to enhance the alignment of “start–event–end” state sets across modalities. This novel application of information theory principles improves the discriminative power of the model, particularly for distinguishing relevant frames from similar but irrelevant ones, which is often overlooked in existing methods.
With significantly enhanced semantic coherence and consistency, ESAN achieves substantial performance improvements on both the Charades-STA and ActivityNet-Captions datasets compared to recent, well-established methods. This highlights the potential of our insight in weakly supervised temporal language grounding.

These contributions address critical gaps in existing methods and provide a comprehensive solution to the challenges of weakly supervised temporal language grounding, making our approach both innovative and effective.

2. Related Work

2.1. Fully Supervised Temporal Language Grounding

Temporal language grounding refers to the process of aligning a video segment with a natural language query, which enables it to handle more complex and mixed events. Since these queries arise from natural language statements in open domains, establishing robust interactions between video and text modalities becomes crucial. In fully supervised settings, previous research can be categorized into three main types. The first category employs a two-stage paradigm, where a series of candidate moment proposals are generated and subsequently scored for ranking. Early methodologies such as TALL [12] and MCN [13] employ techniques like sliding windows along the temporal dimension to produce proposals of varying lengths. Following this, multi-modal fusion and alignment techniques are applied to identify the most suitable proposal [3,14,15]. Some methods extend this approach by introducing deeper visual–textual interactions. Another category known as one-stage methods designs end-to-end structures to directly output the matching moment for a given sentence. For instance, ExCL [16] utilizes a tied LSTM span predictor to output start and end frames following cross-modal interactions. 2DTAN [17] creates a two-dimensional temporal map to cover diverse video segments with different lengths. This approach proves beneficial for learning the relationships within temporal contexts. Lastly, reinforcement-learning-based methods [18] involve training an agent to make decisions regarding the movement of start and end points along the temporal sequence. However, all these methods necessitate temporal boundary annotations as supervision during training, which can be labor-intensive and prone to errors due to the subjectivity of different annotators.

2.2. Weakly Supervised Temporal Language Grounding

To address the reliance on temporal boundary annotations, researchers are exploring weakly supervised temporal language grounding, which relies solely on video-level descriptions. Many of these methods draw on the multi-instance learning (MIL) paradigm. TGA [5] introduces text-specific global features for videos to conduct MIL and effectively computes cross-modal similarity. Meanwhile, WSLLN [4] focuses on measuring segment–text consistency and simultaneously conducts segment selection. Recognizing the effectiveness of modeling temporal relationships, some weakly supervised methods [7,19] adopt 2DTAN. They treat the 2D feature map as a bag and consider candidate moments as instances within an MIL framework. Another kind of weakly supervised temporal language grounding methods lies in the reconstruction-based approaches. They reconstruct the original query and optimize using a cross-entropy-based reconstruction loss function, which significantly enhances the accuracy of query reconstructions. This leads to more reliable temporal grounding results. For instance, SCN [8] generates multiple proposals and selects the top-k candidates that excel at reconstructing masked words in queries. On the other hand, CNM [9] and CPL [10] not only utilize reconstruction mechanisms but also incorporate negative samples for contrastive learning. These methods fuse multimodal information to predict a Gaussian mask as a proposal and mine challenging negative samples within the same video.

However, there is challenging problem with previous methods: false high correlations. These methods primarily learn the relevance between query sentences and video frames. Some frames that are not within the correct moment may still show certain relevance to vocabularies in the query sentence. Due to the lack of temporal annotations, such pseudo-relevant frames are not penalized, leading to false correlations that can interfere with the accuracy of event localization. To solve this issue, we hypothesize that improving event-oriented semantic coherence and consistency will prevent the model from being misled by false correlations from individual frames.

2.3. Pre-Trained Large Language Model

Recently, advancements in pre-trained large language models [20,21] have revolutionized natural language understanding and generation. Similarly, multi-modal pre-trained models in vision–language tasks [22,23,24] have demonstrated impressive performance in image–text comprehension, leveraging self-supervision and vast text data to achieve strong generalization capabilities.

However, deploying these large models in downstream applications often entails significant computational overhead due to fine-tuning requirements. Additionally, current multimodal models predominantly focus on static image–text relationships, exhibiting limited proficiency in capturing dynamic temporal information in videos.

To utilize pre-trained vision–language models with limited computing resources, our proposed ESAN model addresses these limitations by a knowledge distillation technique, which benefits the dynamic semantic state alignment in capturing event-oriented transitions accurately, thereby improving weakly supervised temporal language grounding.

3. Approach

The objective of weakly temporal language grounding is to identify and retrieve the moment T = (

t^{s}

,

t^{e}

) within an untrimmed video that corresponds to a specified query. As shown in Figure 2, our reconstruction framework comprises four key components. Initially, the feature extractor is responsible for extracting visual and textual features from both the video and the query. These extracted features serve as inputs for the moment grounding module, which predicts the temporal proposal (

t^{s}

,

t^{e}

) of the corresponding events. Subsequently, the reconstruction module produces the reconstructed query by leveraging multi-modal information within the predict proposal to reconstruct the masked query. The video feature and reconstruction performance are supervised in a semantic state aligning module.

3.1. Feature Extraction

3.1.1. Language Encoder

We extract the word embedding of a query

Q = {q_{i}}_{i = 0}^{L_{q}}

using GloVe [25], and it is fed into a single-layer fully connected network. The word embedding

Q^{'} = {q_{i}^{'}}_{i = 0}^{N_{v}}

interacts with visual feature in the following module, where

d

is the dimension of the feature and

L_{q}

is the length of the sentence.

3.1.2. Video Encoder

We extract the visual features from the given video using a pre-trained model, which helps us to capture spatial information effectively, even in the presence of noise. Then, fixed-interval sampling is performed on the vision features to divide

L_{v}

frames of the video into

N_{v}

clips. Each video clip consists of

\frac{L_{v}}{N_{v}}

frames. Hence, the input video is denoted as a sequence of small video clips

V = {v_{i}}_{i = 0}^{N_{v}} \in R^{N_{v} * d}

, where the dimension of the visual feature

d

is same as that of the text feature.

3.2. Moment Grounding Module

Our method leverages transformer-based architectures and multi-head attention mechanisms for cross-modal fusion due to their superior adaptability and ability to establish long-range contextual relationships between contexts. We constructed a moment grounding module based on the standard transformer to predict the temporal segments in videos that match the queried events. The multi-head attention mechanism ensures that each video frame perceives contextual information from its surrounding frames, which helps to mitigate the impact of some noisy frames.

We first perform self-attention calculations on the features of video, highlighting the video tokens

V^{'} = {v_{i}^{'}}_{i = 0}^{N_{v}}

that are considered to be context aware.

Assuming that we have the architecture of a transformer with an encoder

E n c (\cdot)

and a transformer decoder

D e c (\cdot)

, we then perform cross-modal fusion using them, and we feed the fused features into a grounding head for predicting the moment of the event described in the query:

\begin{matrix} F & = D e c (V^{'}, E n c (Q^{'})) \\ c & = σ (W^{c} \cdot F) \\ l & = σ (W^{l} \cdot F) \end{matrix}

(1)

where

W^{c}, W^{l} \in R^{d_{k} * 1}

are the learnable parameters, and

σ

means the sigmoid function. The prediction T =

(t^{s}, t^{e})

of temporal grounding can be obtained as follow:

\begin{matrix} t^{s} & = m a x (0, c - \frac{w}{2}) N_{v} \\ t^{e} & = m i n (1, c + \frac{w}{2}) N_{v} \end{matrix}

(2)

3.3. Reconstruction Module

Our method employs a reconstruction module to assist with temporal grounding. To select the best moment proposal in an end-to-end manner, we follow the reconstruction framework of CPL. Our method evaluates and selects proposals by reconstructing masked words in the query. Recognizing the importance of context, we believe that words containing key contextual information are closely related to video content. Consequently, only video segments that match the query can accurately reconstruct these key words. To enhance the reconstruction module’s capability, we randomly mask one-third of the words in the query and then use a mask conditional reconstructor to reconstruct the masked query

Q^{*}

, while we use cross-entropy and cross-entropy contrastive loss functions for model optimization. In the mask conditional reconstructor, the moment proposal is formulated as a mask that follows a Gaussian distribution:

\begin{matrix} m \sim N (c, l^{2}) \end{matrix}

(3)

as our prediction. The mask m forces the attention in the transformers onto the context of the predicted proposal. Following CPL [10], we predict K proposals with mask

m_{i}

, where

i = {1, . . ., K}

, and we select the one with the best reconstruction. Then, we reconstruct the masked query using the video feature

V^{'}

and the mask m:

\begin{matrix} {\hat{Q}}_{i} = S o f t m a x (W^{r} \cdot D e c (Q^{'}, E n c (V^{'}, m_{i}))) \end{matrix}

(4)

where

{\hat{Q}}_{i}

denotes the reconstructed probability distribution of the i-th proposal, and

W^{r}

is a trainable parameter. We also calculate the word distributions

{\hat{Q}}^{t o t a l}, {\hat{Q}}_{i}^{n e g}

and use the masks representing the total video and the segments beyond the predicted proposal to reconstruct the masked query. Given the importance of accurate reconstruction for effective temporal grounding, we utilize a cross-entropy-based reconstruction loss function to optimize the module. This approach enhances the accuracy of the reconstructed queries, which is critical for selecting the optimal proposal in weakly supervised scenarios. Assuming the word distribution of the reconstruction result is

\hat{R} = {p (r_{i})}

and the real distribution of the original sentence is Q, we use the cross-entropy to calculate the distance between them:

\begin{matrix} L_{c e} (Q, \hat{R}) = - \sum log (\hat{R} (q_{i})), q_{i} \in Q \end{matrix}

(5)

We apply the reconstruction loss and ranking loss to enable our model to have comparative reconstructive capabilities.

\begin{matrix} n & = \underset{1 < = i < = K}{a r g m i n} L_{c e} (Q, {\hat{Q}}_{i}) \\ L_{r e c} & = L_{c e} (Q, {\hat{Q}}_{n}) + L_{c e} (Q, {\hat{Q}}_{t o t a l}) \\ L_{r a n k} & = m a x (L_{c e} (Q, {\hat{Q}}_{n}) - L_{c e} (Q, {\hat{Q}}^{t o t a l}) + β_{1}, 0) \\ + m a x (L_{c e} (Q, {\hat{Q}}_{n}) - L_{c e} (Q, {\hat{Q}}_{k}^{n e g}) + β_{2}, 0) \end{matrix}

(6)

where

β_{1}, β_{2}

are the hyperparameters to control the gaps between reconstruction performances.

3.4. Semantic State Alignment Module

In this module, we treat the large model as the teacher network and our grounding network as the student network for knowledge distillation, aiming to better leverage the generic knowledge in the large model to enhance the event-oriented semantic consistency of the grounding network.

3.4.1. Principle Assumption

Our method is based on the assumption that only video segments whose states all align with the query are considered to be matched. We define the semantic states in a video as follows: at time t1, the video’s semantic state is termed as “

s t a r t_{v}

”, the aggregated semantic state from t1 to t2 is termed as “

e v e n t_{v}

”, and the semantic state at time t2 is termed as “

e n d_{v}

”. This forms a unimodal semantic state set for the video:

s e t_{v} = (s t a r t_{v}, e v e n t_{v}, e n d_{v})

. For the query event

e v e n t_{q}

, we generate a corresponding unimodal semantic state set

s e t_{q} = (s t a r t_{q}, e v e n t_{q}, e n d_{q})

using knowledge distillation. Given the general knowledge within large models,

s e t_{q}

maintains semantic coherence and temporal consistency.

If the video segment within (t1, t2) matches the query, each state in

s e t_{v}

will correspond to a state in

s e t_{q}

. For non-matching segments, partial matches may occur. Our semantic state alignment module addresses the false high correlations caused by unrelated segments with similar semantic elements by ensuring that unrelated segments’ states

s t a r t_{v}

and

s t a r t_{q}

as well as

e n d_{v}

and

e n d_{q}

do not align.

3.4.2. Event-Oriented Semantic State Space

We construct an event-oriented semantic state space to implement semantic alignment, which is shown in Figure 3. Our goal is for the features from both the video modality and the text modality to be event-oriented, ensuring intra-modal semantic coherence and cross-modal semantic consistency. The former entails that feature projections in the semantic state space from both modalities adhere to the state transformation of “start–event–end”. The latter necessitates aligning state transformations from both modalities within the same semantic state space.

Pre-trained large language models trained on massive datasets possess strong language understanding and generation capabilities. We utilize pre-trained models that are trained on extensive and diverse datasets covering a wide range of general-purpose scenarios. By leveraging the universal knowledge from these models, we ensure that our method maintains high applicability and robustness across different scenes and domains.

However, directly fine-tuning large models requires high computational resources. Recognizing the resource intensity of training and fine-tuning large models, we utilize a knowledge extraction strategy to leverage the general knowledge embedded within these models. This approach enhances the accuracy of our smaller network without incurring significant resource costs.

Firstly, we construct prompting statements based on the event Q in the text modality, and we input them into the large language model to generate embeddings of start state

Q_{s}

and end state

Q_{e}

for the event. Considering that

Q_{s}

and

Q_{e}

are generated by universal prior knowledge in large models, we believe that

S^{q} = {Q_{s}, Q, Q_{e}}

satisfies semantic consistency about the event Q. This set denotes

s e t_{q}

in the principle assumption.

For the video modality, we similarly obtain a soft description

O = {o_{i}}

of semantic states by captioning the video at equidistant intervals using a visual–language model. This soft description only contains static information of individual frames and lacks the ability to understand dynamic changes. To capture the semantic state changes in the video, we construct prompting statements based on two adjacent soft descriptions

{O_{i}, O_{i + 1}}

, and we use a large language model to explicitly model the changes in the semantic states

c_{i}

that occur during this time interval in the video. This process allows us to derive a set of semantic states

C = {c_{i}}

for the video. Since C is generated based on continuous video soft descriptions, it certainly satisfies semantic coherence.

3.4.3. Knowledge Distillation

In this part, the outputs of the teacher network serve as soft labels to calculate the relative entropy for our grounding network. Firstly, we project the video features from the student network into the cross-modal semantic state space and align their embeddings with the semantic state embeddings of the videos. This is carried out to reduce the semantic gap between the video features and the semantic state sets we have constructed:

\begin{matrix} L_{p} & = | o_{i} - v_{i} |, 0 \leq i \leq N_{v} \\ L_{u} & = | c_{k} - v_{k}^{'} |, t^{s} < k < t^{e} \\ L_{a l i g n} & = L_{p} + L_{u} \end{matrix}

(7)

Note that the alignment loss in Equation (7) can be divided into two parts:

L_{p}

and

L_{u}

, representing the separate alignments of image features with single-frame information and context-aware video features with soft descriptions of semantic state changes, achieving cross-modal semantic alignment from both static and dynamic perspectives.

Each proposal is predicted by the moment grounding module as pseudo-labels, and the set of semantic state oriented events in the video modality can be formulated as

\begin{matrix} S_{i}^{v} = {o_{t_{k}^{s}}, {c_{i}}, o_{t_{k}^{e}}}, t_{k}^{s} < i < t_{k}^{e}, 1 \leq k \leq K \end{matrix}

(8)

The set

S_{i}^{v}

means the

s e t_{v}

in the principle assumption. The semantic state sets of the same event Q in different modalities,

S^{q}

and

S^{v}

, demonstrate semantic consistency. We compute the element-wise similarity between them, using it as soft labels from the teacher network to train the student network:

\begin{matrix} {\hat{S}}_{k}^{s} & = \frac{Q_{s} \cdot o_{t_{k}^{s}}}{∥ Q_{s} ∥ \cdot ∥ o_{t_{k}^{s}} ∥} \\ {\hat{S}}_{k}^{q} & = \underset{t_{k}^{s} < i < t_{k}^{e}}{m a x} (\frac{Q \cdot c_{i}}{∥ Q ∥ \cdot ∥ c_{i} ∥}) \\ {\hat{S}}_{k}^{e} & = \frac{Q_{e} \cdot o_{t_{k}^{e}}}{∥ Q_{e} ∥ \cdot ∥ o_{t_{k}^{e}} ∥} \\ {\hat{S}}_{k} & = ({\hat{S}}_{k}^{s} + {\hat{S}}_{k}^{q} + {\hat{S}}_{k}^{e}) / 3, 1 \leq k \leq K \end{matrix}

(9)

For the k-th predicted proposal

{t_{k}^{s}, t_{k}^{e}}

,

{\hat{S}}_{k}^{s}, {\hat{S}}_{k}^{q}

, and

{\hat{S}}_{k}^{e}

respectively represent the similarity of the static information at time

t_{k}^{s}

to the start state of the event, the semantic state changes from

t_{k}^{s}

to

t_{e}^{s}

within the event, and the static information at time

t_{k}^{s}

at the end state of the event. The reconstruction word probability

{\hat{Q}}_{k}

of the student network’s output is used as an indicator for selecting the best proposal. We supervise the reconstruction word probability distribution of the student network’s output using

{\hat{S}}_{k}

as soft labels to calculate relations, promoting semantic coherence and consistency between the videos covered by the proposals and the events queried. The distillation loss can be calculated by:

\begin{matrix} {\bar{Q}}_{k} & = \sum p (q_{i}), q_{i} \in Q \\ L_{k l} & = \sum_{1 = < k = < K} {\bar{Q}}_{k} l o g ({\bar{Q}}_{k} - {\hat{S}}_{k}) \end{matrix}

(10)

where

p (w)

denotes the probability of the word

w

in

\hat{Q}

. Finally, we use the time points with the highest similarities between the start and end states in both modalities as pseudo-labels to guide the temporal localization of the best proposal with the smallest reconstruction loss:

\begin{matrix} {\hat{y}}_{s} & = \underset{k}{a r g m a x} {\hat{S}}_{k}^{s}, 0 \leq k \leq N_{v} \\ {\hat{y}}_{e} & = \underset{k}{a r g m a x} {\hat{S}}_{k}^{e}, 0 \leq k \leq N_{v} \\ L_{l o c} & = | {\hat{y}}_{s} - t^{s} | + | {\hat{y}}_{e} - t^{e} | \end{matrix}

(11)

3.5. Training and Inference

During the training phase, we update the parameters of the student network using the aforementioned losses:

\begin{matrix} L = L_{r e c} + L_{r a n k} + L_{l o c} + α_{1} L_{a l i g n} + α_{2} L_{k l} \end{matrix}

(12)

where

α_{1}

and

α_{2}

are the hyperparameters to balance the weights of different losses.

During the inference phase, among the K proposals generated by the moment grounding module, we select the proposal with the smallest

L_{r e c}

as the predicted event-matching moment.

4. Experiments

4.1. Datasets

To evaluate our method, we conducted experiments on temporal language grounding using two benchmark datasets.

Charades-STA [12] is a dataset that focuses on indoor daily activities and contains 12,408 moment–sentence pairs in the training set and 3720 pairs in the testing set. The average video length, moment length, and query length in Charades-STA are 30.60 s, 8.09 s, and 7.22 words, respectively. We report the results on the test set.

ActivityNet-Captions [26] is currently the largest dataset on TLG and consists of 20,000 untrimmed videos and over 70,000 pairs related to open-world activities. The average lengths of the target and untrimmed video moments is 37.1 and 117.6 s, respectively, with an average query length of 14.4 words. We present the results based on the

v a l_{2}

split.

4.2. Metrics

All results are measured using the evaluation metric “R@1,

I o U

>m”, which indicates the percentage of output results for which the intersection over union (

I o U

) between the model’s best-matched query moment and the ground truth exceeds the threshold m. We separately used

m = {0.1, 0.3, 0.5}

for ActivityNet-Captions and

m = {0.3, 0.5, 0.7}

for Charades-STA.

4.3. Implementation Details

4.3.1. Data Preprocessing

For the Charades-STA dataset, we apply I3D [27] features, while for the ActivityNet-Captions dataset, we use C3D [28] features. As for the natural language sentence queries, we utilize pre-trained GloVe [25] embeddings to extract 300-dimensional word embeddings.

4.3.2. Experiment Settings

For knowledge distillation, we adopt frozen CLIP [29] and Llama2 [21] as the teacher network. The number of downsampled video clips for all datasets is set to 200. We utilize three attention layers with a 256-dimensional hidden state in the transformer model. During training, the batch size is set to 32 for all datasets, and we employ the Adam optimizer with an initial learning rate of 0.0004. We conduct the training and inference on only one NVIDIA RTX3090 GPU with 24 GB of memory.

4.4. Results and Comparisons

Table 1 and Table 2 present the results of our method alongside those of previous state-of-the-art methods on ActivityNet-Captions and Charades-STA.

Comparison to the baseline: We selected CPL [10] as the baseline to demonstrate the effectiveness of our method, given that we adopted the reconstruction framework of CPL.

In comparison, our method exhibits significant improvements on both datasets. Specifically, on Charades-STA, when considering accuracy at

I o U

= 0.3, 0.5, and 0.7, our approach achieves substantial improvements of 2.69%, 1.61%, and 1.59%, respectively.

Comparison to SOTA methods: As shown in Table 1, our model clearly achieves the highest performance across all metrics for Charades-STA. Additionally, for the Activitynet-Captions dataset, our method also demonstrates the best performance at

I o U

values of 0.3 and 0.5, with only a slight deviation below CPL at an

I o U

of 0.1. Our method consistently delivers strong performance on both datasets, indicating the potential of our approach.

Overall, the empirical results validate our assumption by demonstrating the effectiveness of our approach at both matching and distinguishing relevant and irrelevant video segments. This is consistent with previous research emphasizing the significance of semantic coherence and consistency in video analysis.

4.5. Statistical Testing

To demonstrate the robustness and reliability of our results, we conduct statistical tests and report the confidence intervals. We conduct rigorous statistical tests on three evaluation metrics across two datasets: Charades-STA and ActivityNet-Captions. The p-values obtained from these tests were all less than 0.05, indicating that the results are statistically significant. Moreover, we conduct multiple experiments and attain the 95% confidence intervals (CIs) for different

I o U

thresholds. The 95% confidence intervals are [67.53, 68.92] (

I o U

> 0.3), [49.95, 51.01] (

I o U

> 0.5), and [23.56, 24.35] (

I o U

> 0.7) for Charades-STA, while they are [78.33, 81.83] (

I o U

> 0.1), [55.37, 57.21] (

I o U

> 0.3), and [32.20, 33.51] (

I o U

> 0.5) for Activitinet-Captions. These confidence intervals demonstrate the robustness of our method across different

I o U

thresholds and datasets.

4.6. Ablation Study

To analyze the effectiveness of our method, we conduct an ablation study on the Charades-STA dataset.

Effectiveness of knowledge distillation in semantic state alignment module: We constructed ablation models by only conducting embedding alignment (Emb Align), state alignment (State Align), localization guiding (Loc Guide), and doing nothing (w/o Any). As shown in Table 3, the full model outperforms the ablation models in all the metrics. This can be attributed to our integration of the general knowledge of pre-trained large models, prompting the model to gain a deeper understanding of the semantic states changes of the described events. This improves the cross-modal semantic consistency. It is worth noting that the singular state alignment does not exhibit clear performance due to the lack of embedding alignment. There exists a semantic gap between the video features of the student network and the state sets in the semantic space, which leads to a lack of semantic coherence in the constructed state sets for event-oriented semantics, thereby hindering cross-modal semantic alignment.

Effectiveness of event-oriented semantic state space: Further, we evaluate the different ways to use the knowledge in large models to highlight the efficacy of our proposed event-oriented semantic state space. The ablation approach involves directly calculating the similarity between the queries and the captions of frames in the video without constructing an event-oriented semantic state space and proposing “start–event–end” sets of semantic states. As the results show in Table 4, our approach demonstrates significant advantages. This is because the captions of frame images only contain static visual information, which can easily produce false high correlations with some words in the query. In contrast, our method constructs a semantic state set of “start–event–end” for the query event, capturing coherent dynamic changes between semantic states. Cross-modal alignment based on these state changes improves cross-modal semantic consistency, which improves the accuracy of moment grounding.

4.7. Sensitivity Analysis

To analyze the sensitivity, we varied the weights of losses

α 1, α 2

to observe the accuracy with

I o U

thresholds of 0.3, 0.5, and 0.7 while keeping the other hyperparameters unchanged.

As shown in Figure 4, our model exhibits stability and robustness across different hyperparameter values. Specifically, our model achieves optimal performance when

α 1 = 0.5

and

α 2 = 1

; exceeding these values leads to a slight decline in performance. This is attributed to the excessive reliance on soft labels when

α 1, α 2

are too large, neglecting the importance of reconstruction performance for selecting the best proposals.

4.8. Discussion about Complexity

Comparison with large pre-trained models: The common knowledge within large pre-trained language models benefits our semantic state alignment. However, fine-tuning large models is resource-intensity. Thus, we efficiently leverage the general knowledge within these models by employing a knowledge extraction strategy, minimizing resource consumption. Our experiments indicate that this process requires limited resources and can be performed on a single NVIDIA RTX 3090 GPU, highlighting the practicality and scalability of our approach.

In constructing the cross-modal semantic state space for events, we employ LLaMa2-7B [21] as the teacher network. Thus, we compare its complexity against our own network. Since LLaMa2-7B, being a language model, does not possess visual perception capabilities, we input the semantic embeddings of video frames in chronological order to facilitate temporal grounding. Experiments are carried out on the Charades-STA dataset using an NVIDIA RTX 3090, and the results are presented in Table 5. The outcomes clearly demonstrate that our model has only 1/57 of the parameter size of Llama2-7B, resulting in a significantly faster inference speed by 268 times. This suggests that our model can effectively harness the general knowledge of large models through knowledge distillation, making it more suitable for scenarios with limited computational resources.

Space and time costs: Table 6 demonstrates that by utilizing transformer-based architectures and multi-head attention mechanisms, our model exhibits relatively low time and space overhead during both training and inference. Specifically, with a batch size of 32, the memory consumption during training the small model is approximately 6.7 GB, knowledge distillation requires about 13 GB, and inference uses about 3.2 GB. Additionally, on a single NVIDIA RTX 3090 GPU, the inference time per sample is around 0.25 s. These results highlight the efficiency of our model, making it feasible for near-real-time applications even in environments with limited computational resources.

4.9. Qualitative Results

Figure 5 shows some qualitative results of our approach. Figure 5a,b demonstrates that our method is capable of better cognition and understanding of video content and the events described in the natural language queries, thereby enabling more precise localization. This is attributed to our model’s semantic coherence and consistency at avoiding being misled by the spurious high similarities of some unrelated frames. Additionally, as shown in Figure 5c,d, our method demonstrates more precise boundaries, which is attributed to the construction of the “start–event–end” semantic state set, which facilitates the identification of the temporal boundaries associated with the start and end states, thereby enhancing moment grounding accuracy.

Figure 5e shows that the model achieves rough localization but lacks precision. The query “person walks through a doorway to leave the room” is misinterpreted by the model, which inaccurately associates the ending state “leave the room” with the overall event state, leading to deviations. Another example, Figure 5f, showcases a clear misalignment due to the essential semantic element “book” not being prominent in the video frames. This absence results in incorrect cross-modal alignment and highlights a failure case.

These examples underscore both the strengths and limitations of our model, providing valuable insights into its performance across different scenarios.

5. Future Works

In this section, we outline several areas of future research that address the challenges posed by multi-language queries.

Addressing multi-language queries: As current datasets and prior work predominantly focus on English, expanding to multi-language queries presents unique challenges. Different languages exhibit distinct grammar, vocabulary, and expression styles, leading to potential information loss or misinterpretation. This also brings mixed-code [39] and code-switching [40], adding an additional layer of complexity. Moreover, supporting multiple languages often requires different models or adaptations, posing challenges for the scalability and generalization of the system.

To tackle these issues, future work will explore the use of language identification techniques and multilingual pre-trained models as text pre-processing. This approach aims to enhance the model’s adaptability and scalability, ensuring effective performance in diverse linguistic contexts.

Addressing insignificant but important video elements: Addressing the issue of reduced accuracy when semantic elements are not prominently featured in video frames is another focus. Enhancing the model’s robustness in such scenarios could involve integrating additional visual context or employing more sophisticated attention mechanisms to better capture subtle semantic cues.

6. Conclusions

In conclusion, this paper introduces a novel Event-oriented State Alignment Network (ESAN) that makes several unique contributions to the field of weakly supervised temporal language grounding. Here are the novel and unique contributions of ESAN:

Firstly, ESAN constructs event-oriented “start–event–end” semantic state sets for both textual and video data. This approach ensures that the model captures the complete sequence of state changes, thereby aligning text more accurately with the true essence of queried events.

Secondly, by applying relative entropy for knowledge distillation from pre-trained large models, ESAN achieves effective cross-modal alignment with various domains. This innovative use of information theory principles allows the model to leverage the general knowledge in pre-trained models while maintaining semantic coherence and consistency across modalities.

Then, ESAN’s methodology enables a deeper understanding of queried events based on semantic state changes rather than relying solely on superficial correlations between words and video frames. This leads to better intra-modal semantic coherence and cross-modal semantic consistency during the state transition process.

Experiments conducted on two widely-used datasets demonstrate that ESAN significantly outperforms existing methods in weakly supervised TLG. The results highlight the robustness and effectiveness of our proposed approach at capturing event-oriented semantics and reducing false positives caused by irrelevant frames.

However, our approach also has some limitations that must be considered. One notable challenge is handling multilingual queries. The current model may struggle with queries in different languages due to potential inconsistencies in semantic interpretation. Additionally, the accuracy of our method can diminish when essential semantic elements are not prominently featured in the video frames, leading to potential misalignments and decreased performance. Future work will focus on exploring methods to address these limitations.

Author Contributions

Conceptualization, H.W. and X.Z.; methodology, H.W. and X.Z.; software, H.W.; validation, T.T., C.Y. and X.Z.; formal analysis, H.W.; investigation, H.W.; resources, Z.L.; data curation, H.W.; writing—original draft preparation, H.W.; writing—review and editing, H.W. and X.Z.; visualization, H.W.; supervision, T.T., C.Y. and Z.L.; project administration, C.Y. and Z.L.; funding acquisition, Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by National Key Research and Development Program of China (2023YFB3001903).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in an open-access repository at [9]. Only publicly available data sources have been used, and thus no ethical permission is required.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

TLG	Temporal Language Grounding
ESAN	Event-oriented State Alignment Network
TALL	Temporal Activity Localization via Language
MCN	Moment Context Network
ExCL	Extractive Clip Localization
LSTM	Long Short-Term Memory
2DTAN	2-dimensional Temporal Adjacent Networks
MIL	Multi-Instance Learning
TGA	Text-specific Global features Alignment
WSLLN	Weakly Supervised Language Localization Network
SCN	Semantic Complement Network
CNM	Contrastive Negative sample Mining
CPL	Contrastive Proposal Learning
$I o U$	Intersection over Union
CLIP	Contrastive Language-Image Pre-training

References

Shen, X.; Lan, L.; Tan, H.; Zhang, X.; Ma, X.; Luo, Z. Joint modality synergy and spatio-temporal cue purification for moment localization. In Proceedings of the 2022 International Conference on Multimedia Retrieval, Newark, NJ, USA, 27–30 June 2022; pp. 369–379. [Google Scholar]
Xiao, S.; Chen, L.; Zhang, S.; Ji, W.; Shao, J.; Ye, L.; Xiao, J. Boundary proposal network for two-stage natural language video localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021. [Google Scholar]
Yang, X.; Wang, S.; Dong, J.; Dong, J.; Wang, M.; Chua, T. Video moment retrieval with cross-modal neural architecture search. IEEE Trans. Image Process. 2022, 31, 1204–1216. [Google Scholar] [CrossRef] [PubMed]
Gao, M.; Larry, S.; Socher, R.; Xiong, C. Wslln: Weakly supervised natural language localization networks. In Proceedings of the Empirical Methods Natural Language Processing, Hong Kong, China, 3–7 November 2019; pp. 1481–1487. [Google Scholar]
Mithun, N.; Paul, S.; Roy-Chowdhury, A.K. Weakly supervised video moment retrieval from text queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11592–11601. [Google Scholar]
Tan, R.; Xu, H.; Saenko, K.; Plummer, B.A. Logan: Latent graph co- attention network for weakly-supervised video moment retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 2083–2092. [Google Scholar]
Wang, Y.; Deng, J.; Zhou, W.; Li, H. Weakly supervised temporal adjacent network for language ground. IEEE Trans. Multimed. 2021, 24, 3276–3286. [Google Scholar] [CrossRef]
Lin, Z.; Zhao, Z.; Zhang, Z.; Wang, Q.; Liu, H. Weakly-supervised video moment retrieval via semantic completion network. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, New York, USA, 7–12 February 2020; pp. 11539–11546. [Google Scholar]
Zheng, M.; Huang, Y.; Chen, Q.; Liu, Y. Weakly supervised video moment localization with contrastive negative sample mining. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2022; pp. 3517–3525. [Google Scholar]
Zheng, M.; Huang, Y.; Chen, Q.; Peng, Y.; Liu, Y. Weakly supervised temporal sentence grounding with gaussian-based contrastive proposal learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Zhuang, L.; Fei, H.; Hu, P. Knowledge-enhanced event relation extraction via event ontology prompt. Inf. Fusion 2023, 100, 101919. [Google Scholar] [CrossRef]
Gao, J.; Sun, C.; Yang, Z.; Nevatia, R. Tall: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5267–5275. [Google Scholar]
Hendricks, L.A.; Wang, O.; Shechtman, E.; Sivic, J.; Darrell, T.; Russell, B. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5804–5813. [Google Scholar]
Dong, J.; Li, X.; Xu, C.; Yang, X.; Yang, G.; Wang, X.; Wang, M. Dual encoding for video retrieval by text. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 48, 4065–4080. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Li, K.; Li, J.; Chen, G.; Wang, M.; Guo, D. Spatiotemporal contrastive modeling for video moment retrieval. World Wide Web. 2022, 26, 1525–1544. [Google Scholar]
Ghosh, S.; Agarwal, A.; Parekh, Z.; Hauptmann, A. Excl: Extractive clip localization using natural language descriptions. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
Zhang, S.; Peng, H.; Fu, J.; Luo, J. Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12870–12877. [Google Scholar]
Sun, X.; Wang, H.; He, B. Maban: Multi-agent boundary-aware network for natural language moment retrieval. IEEE Trans. Image Process. 2021, 30, 5589–5599. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Lin, Z.; Zhao, Z.; Zhu, J.; He, X. Regularized two-branch proposal networks for weakly-supervised moment retrieval in videos. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 4098–4106. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. arXiv 2022, arXiv:2203.02155. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozi‘ere, B.; Goyal, N.; Hambro, E.; Azhar, F. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Qu, L.; Wu, S.; Fei, H.; Nie, L.; Chua, T. Next-gpt: Any-to-any multi- modal llm. arXiv 2023, arXiv:2309.05519. [Google Scholar]
Bai, J.; Bai, S.; Yang, S.; Wang, S.; Tan, S.; Wang, P.; Lin, J.; Zhou, C.; Zhou, J. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv 2023, arXiv:2308.12966. [Google Scholar]
Wu, S.; Fei, H.; Ji, W.; Chua, T. Cross2stra: Unpaired cross-lingual im- age captioning with cross-lingual cross-modal structure-pivoted alignment. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 10–12 July 2023. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Krishna, R.; Hata, K.; Ren, F.; Li, F.; Niebles, J.C. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 706–715. [Google Scholar]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Alec, R. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Chen, Z.; Ma, L.; Luo, W.; Tang, P.; Wong, K. Look closer to ground better:weakly supervised temporal grounding of sentence in video. arXiv 2020, arXiv:2001.09308. [Google Scholar]
Ma, F.; Zhu, L.; Yang, Y. Weakly supervised moment localization with decoupled consistent concept prediction. Int. J. Comput. Vis. 2022, 130, 1244–1258. [Google Scholar] [CrossRef]
Song, Y.; Wang, J.; Ma, L.; Yu, Z.; Yu, J. Weakly-supervised multi-level attentional reconstruction network for grounding textual queries in videos. arXiv 2020, arXiv:2003.07048. [Google Scholar]
Wang, Z.; Chen, J.; Jiang, Y. Visual co-occurrence alignment learning for weakly-supervised video moment retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, China, 20–24 October 2021. [Google Scholar]
Huang, J.; Liu, Y.; Gong, S.; Jin, H. Cross-sentence temporal and semantic relations in video activity localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7199–7208. [Google Scholar]
Yang, W.; Zhang, T.; Zhang, Y.; Wu, F. Local correspondence network for weakly supervised temporal sentence grounding. IEEE Trans. Image Process. 2021, 30, 3252–3262. [Google Scholar] [CrossRef] [PubMed]
Duan, X.; Huang, W.; Gan, C.; Wang, J.; Huang, J. Weakly supervised dense event captioning in videos. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018. [Google Scholar]
Teng, J.; Lu, X.; Gong, Y.; Liu, X.; Nie, X.; Yin, Y. Regularized Two Granularity Loss Function for Weakly Supervised Video Moment Retrieval. IEEE Trans. Multimed. 2022, 24, 1141–1151. [Google Scholar] [CrossRef]
Chen, S.; Jiang, Y. Towards Bridging Event Captioner and Sentence Localizer for Weakly Supervised Dense Event Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8425–8435. [Google Scholar]
Ehtesham, H.; Sule, Y.Y.; Sarang, S. Augmenting sentiment prediction capabilities for code-mixed tweets with multilingual transformers. Soc. Netw. Anal. Min. 2024, 14, 86. [Google Scholar]
William, R.; Aldo, M.B.; Lianelys, C.M.; Ivan, C. The Complexity in Bilingual Code-Switching Research: A Systematic Review. Language 2024, 9, 217. [Google Scholar]

Figure 1. An example of event-oriented semantic coherence and consistency in temporal language grounding. The event described in the query and the corresponding video segment can be transformed into sets of semantic state changes as a “start–event–end” pattern. These semantic sets are intra-modally semantically coherent. Due to them being sets of semantic states in different modalities about the same event, they satisfy cross-modal consistency in semantics.

Figure 2. The overall architecture of our approach, including four components: feature encoder, moment grounding module, reconstruction module, and semantic state alignment module.

Figure 3. The generation of soft labels for the teacher network in the semantic state alignment module. We use a pre-trained vision–language model and a large language model to produce the event-oriented semantic state sets of two modalities. Then, we calculate the element-wise similarities between the sets we produced. The similar values will be used for supervising the reconstruction performance. By the way, the embedding of semantic states in the video modality will also be the target of alignment.

Figure 4. The results with different values of

α 1

and

α 2

. Subfigures (a–c) shows the results when

I o U > 0.3, 0.5

and

0.7

.

Figure 4. The results with different values of

α 1

and

α 2

. Subfigures (a–c) shows the results when

I o U > 0.3, 0.5

and

0.7

.

Figure 5. The qualitative results on Charades-STA. Subfigures (a–d) show the successful example of our model. Subfigures (e,f) show the cases of failure.

Table 1. Performance comparison on the Charades-STA dataset (

I o U > m \in {0.3, 0.5, 0.7}

). The best results are highlighted in bold, and the second-best results are underlined.

Table 1. Performance comparison on the Charades-STA dataset (

I o U > m \in {0.3, 0.5, 0.7}

). The best results are highlighted in bold, and the second-best results are underlined.

Method	$I o U$ = 0.3	$I o U$ = 0.5	$I o U$ = 0.7
TGA [5]	32.14	19.94	8.84
CTF [30]	37.8	27.3	12.9
SCN [8]	42.96	23.58	9.97
DCCP [31]	-	29.8	11.9
WSTAN [7]	43.39	29.35	12.28
LoGAN [6]	48.04	31.74	13.71
MARN [32]	48.55	31.94	14.81
VCA [33]	58.58	38.13	19.57
CRM [34]	53.66	34.76	16.37
LCNet [35]	59.60	39.19	18.87
RTBPN [19]	60.04	32.36	13.24
CNM [9]	60.04	35.15	14.95
CPL [10]	65.99	49.05	22.61
Ours	$68.68$	$50.66$	$24.10$

Table 2. Performance comparison on ActivityNet-Captions dataset (

I o U > m \in {0.1, 0.3, 0.5}

). The best results are highlighted in bold, and the second-best results are underlined.

Table 2. Performance comparison on ActivityNet-Captions dataset (

I o U > m \in {0.1, 0.3, 0.5}

). The best results are highlighted in bold, and the second-best results are underlined.

Method	$I o U$ = 0.1	$I o U$ = 0.3	$I o U$ = 0.5
DCCP [31]	-	41.6	23.2
WS-DEC [36]	62.71	41.98	23.34
RGL [37]	65.99	44.49	24.33
VCA [33]	67.96	50.45	31.00
EC-SL [38]	68.48	44.29	24.26
MARN [32]	-	47.01	29.95
SCN [8]	71.48	47.23	29.22
RTBPN [19]	73.73	49.77	29.63
CTF [30]	74.2	44.30	23.6
WSLLN [4]	75.4	42.8	22.7
LCNet [35]	78.58	48.49	26.33
WSTAN [7]	79.78	52.45	30.01
CPL [10]	82.55	55.73	31.37
CRM [34]	81.61	55.26	32.19
CNM [9]	78.13	55.68	33.33
Ours	$80.21$	$56.76$	$33.41$

Table 3. The effectiveness of knowledge distillation in the semantic state alignment module for moment localization on the Charades-STA dataset. The bold means the best score.

Method	$I o U$ = 0.3	$I o U$ = 0.5	$I o U$ = 0.7
Full Model	$68.68$	$50.66$	$24.10$
Emb Align	66.62	49.24	23.50
State Align	65.71	49.30	23.02
Loc Guide	66.66	49.62	24.12
w/o Any	65.62	49.21	22.20

Table 4. The importance of the event-oriented semantic state set (ESS) on the Charades dataset.The bold means the best score.

Method	$I o U$ = 0.3	$I o U$ = 0.5	$I o U$ = 0.7
Full Model	$68.68$	$50.66$	$24.10$
w/o ESS	63.12	48.36	21.13

Table 5. Complexity comparison with pre-trained large model.

Method	Parameter Size	Time Cost
Our Model	13 M	0.25 s
LLAMA2-7B	745 M	6.7 s

Table 6. Space and time costs of our model on Charades-STA.

Method	Memory Usage	Time Cost
Training	6773 MB	1.12 s
Inference	3223 MB	0.25 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, H.; Zhang, X.; Tang, T.; Yang, C.; Luo, Z. Event-Oriented State Alignment Network for Weakly Supervised Temporal Language Grounding. Entropy 2024, 26, 730. https://doi.org/10.3390/e26090730

AMA Style

Wu H, Zhang X, Tang T, Yang C, Luo Z. Event-Oriented State Alignment Network for Weakly Supervised Temporal Language Grounding. Entropy. 2024; 26(9):730. https://doi.org/10.3390/e26090730

Chicago/Turabian Style

Wu, Hongzhou, Xiang Zhang, Tao Tang, Canqun Yang, and Zhigang Luo. 2024. "Event-Oriented State Alignment Network for Weakly Supervised Temporal Language Grounding" Entropy 26, no. 9: 730. https://doi.org/10.3390/e26090730

APA Style

Wu, H., Zhang, X., Tang, T., Yang, C., & Luo, Z. (2024). Event-Oriented State Alignment Network for Weakly Supervised Temporal Language Grounding. Entropy, 26(9), 730. https://doi.org/10.3390/e26090730

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Event-Oriented State Alignment Network for Weakly Supervised Temporal Language Grounding

Abstract

1. Introduction

2. Related Work

2.1. Fully Supervised Temporal Language Grounding

2.2. Weakly Supervised Temporal Language Grounding

2.3. Pre-Trained Large Language Model

3. Approach

3.1. Feature Extraction

3.1.1. Language Encoder

3.1.2. Video Encoder

3.2. Moment Grounding Module

3.3. Reconstruction Module

3.4. Semantic State Alignment Module

3.4.1. Principle Assumption

3.4.2. Event-Oriented Semantic State Space

3.4.3. Knowledge Distillation

3.5. Training and Inference

4. Experiments

4.1. Datasets

4.2. Metrics

4.3. Implementation Details

4.3.1. Data Preprocessing

4.3.2. Experiment Settings

4.4. Results and Comparisons

4.5. Statistical Testing

4.6. Ablation Study

4.7. Sensitivity Analysis

4.8. Discussion about Complexity

4.9. Qualitative Results

5. Future Works

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI