Next Article in Journal
Approaches to Extend FPGA Reverse-Engineering Technology from ISE to Vivado
Previous Article in Journal
Decentralized Exchange Transaction Analysis and Maximal Extractable Value Attack Identification: Focusing on Uniswap USDC3
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Advancing Temporal Action Localization with a Boundary Awareness Network

School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510006, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(6), 1099; https://doi.org/10.3390/electronics13061099
Submission received: 23 February 2024 / Revised: 14 March 2024 / Accepted: 14 March 2024 / Published: 16 March 2024

Abstract

:
Temporal action localization (TAL) is crucial in video analysis, yet presents notable challenges. This process focuses on the precise identification and categorization of action instances within lengthy, raw videos. A key difficulty in TAL lies in determining the exact start and end points of actions, owing to the often unclear boundaries of these actions in real-world footage. Existing methods tend to take insufficient account of changes in action boundary features. To tackle these issues, we propose a boundary awareness network (BAN) for TAL. Specifically, the BAN mainly consists of a feature encoding network, coarse pyramidal detection to obtain preliminary proposals and action categories, and fine-grained detection with a Gaussian boundary module (GBM) to get more valuable boundary information. The GBM contains a novel Gaussian boundary pooling, which serves to aggregate the relevant features of the action boundaries and to capture discriminative boundary and actionness features. Furthermore, we introduce a novel approach named Boundary Differentiated Learning (BDL) to ensure our model’s capability in accurately identifying action boundaries across diverse proposals. Comprehensive experiments on both the THUMOS14 and ActivityNet v1.3 datasets, where our BAN model achieved an increase in mean Average Precision (mAP) by 1.6% and 0.2%, respectively, over existing state-of-the-art methods, illustrate that our approach not only improves upon the current state of the art but also achieves outstanding performance.

1. Introduction

With multiple videos being submitted to social media every day, an effective and efficient video understanding algorithm is becoming increasingly important. Temporal action localization (TAL) plays a crucial role in video interpretation, particularly for applications such as video surveillance, content analysis and summarization, and human–computer interaction. In long, untrimmed videos, TAL is essential to not only detect temporal proposal boundaries of action instances but also to identify action categories.
TAL has great similarity with object detection; TAL is aimed at accurately generating temporal proposals to segment action instances, while object detection is aimed at precisely producing spatial proposals to delineate object instances. Therefore, many methods of TAL are based on the extension of the object detection framework, including anchor-based [1,2,3,4,5,6], actionness-guided [7,8,9,10,11], and the anchor-free (boundary-based) method [12,13], which is inspired by the flexibility and efficiency of anchor-free object detectors, such as CornerNet [14], FCOS [15], and BorderDet [16]. Presently, anchor-based methods in temporal action localization (TAL) are generally classified into two distinct types: two-stage framework [2,3,4] with higher accuracy and one-stage framework [1,5,6] with higher efficiency. The two-stage framework studies proposal and classification. The former is divided into the task of temporary action proposal (TAP), and the latter is more similar to action recognition.
Methods that are anchor-free and known for their accuracy in defining boundaries, their adaptability in duration, and produce a single proposal at each temporal point throughout the video sequence. This proposal is represented by a pair of values indicating the distance from the current location to the start and end times. Among these, the Anchor-Free Saliency-based Detector (AFSD) has made notable strides, pioneering the use of frame-level features and saliency cues to enhance boundary detection. Although the above methods have achieved good results, there is still some distance from the goal of accurately localizing the action, the most difficult aspect of which is the ambiguity of action boundaries. Due to the continuity of the action, the action does not start suddenly at a certain moment, but “gradually”. If the “action value” is used to distinguish the beginning (or end) of an action, then the “action value” accumulates until the threshold (i.e., the moment of the action) before the action begins (or after it ends). Therefore, the initiation point of actions (which is identical to the termination time) must exhibit two critical characteristics: (1) Relatively small variation in the characteristics of the boundary area. (2) There is an objective threshold, which is used to distinguish the boundaries of actions.
To this end, we introduce the boundary awareness network (BAN) as a refinement to the existing anchor-free framework for temporal action localization (TAL), focusing on achieving a more accurate detection of action boundaries within videos. This refinement is executed through a comprehensive architecture comprising a feature encoding network, a coarse pyramidal detection module, and a fine-grained detection mechanism powered by the Gaussian boundary module (GBM). Compared to the anchor-free method AFSD, the BAN introduces innovative approaches for boundary detection, aiming to address the ambiguities of action boundaries more effectively; while the AFSD laid the groundwork for anchor-free TAL by simplifying the detection process and reducing dependency on hyperparameters, the BAN builds upon this by enhancing the precision of boundary detection through the GBM and a novel learning strategy, Boundary Differentiated Learning (BDL).
Feature encoding network. Utilizing the I3D [17], renowned for its effectiveness in capturing rich, dynamic features from video data, this network serves as the cornerstone of our approach. By extracting robust spatial–temporal features, it lays a solid foundation for the precise detection and localization of actions within videos.
Coarse pyramidal detection module. To address the spatial–temporal dimensions of video data comprehensively, we employ a feature pyramid network equipped with various temporal convolutions. This module is pivotal in generating initial action proposals by amalgamating information across multiple layers, thereby facilitating a multi-scale understanding of potential action instances.
Fine-grained detection. Then, rough estimates of the start and end boundary distances are derived through projection. After that, the predicted temporal regions are used to generate fine-grained features with the Gaussian boundary module, which utilizes Gaussian boundary pooling to precisely locate the action boundary. Since the action is continuous, its boundary has a fuzzy property. Accurate action boundary detection requires discriminative boundary features. However, the core of obtaining such boundary features is how to correlate the moment boundary features with relevant features. A useful but non-differentiable method is to extract the moment-level boundary feature from the frame feature with the maximum activation value; while the boundary feature contains the most discriminating features, most useful information is lost. For example, the features around the salient features also contain boundary information.
Gaussian Boundary Module. To tackle the challenges in capturing high-quality boundary features for action prediction, we introduce the Gaussian boundary module (GBM). This module features a novel boundary pooling method named Gaussian boundary pooling (GBP), employing a Gaussian kernel function to aggregate features around the action boundary effectively. The kernel, centered on the boundary time stamp from coarse predictions, ensures that features closer to the boundary have a higher influence, optimizing the quality of boundary features extracted. It is important to note the distinction between our Gaussian kernels and those used in GTAN, which aims to represent action proposals of varying durations, whereas our focus is on extracting moment range boundary features for precise action localization.
Boundary Differentiated Learning. To further enhance our model’s ability to discern complex actions from background clutter in videos, we propose a novel strategy named Boundary Differentiated Learning (BDL). This strategy innovatively transforms the classification task of distinguishing action boundaries into a regression task focused on identifying the ’actionability’ of a scene. By re-scaling the ground truth labels to guide the model, BDL allows for a more nuanced understanding of action presence, facilitating the discrimination between action and non-action segments. Our extensive experiments on datasets such as THUMOS14 and ActivityNet-1.3 demonstrate the effectiveness of the BDL strategy, significantly improving our model’s performance and establishing new state-of-the-art results.
Our key contributions can be outlined as follows:
1. In this study, our enhanced anchor-free framework for temporal action localization modestly increases computational demands relative to the AFSD method, yet it offers a more precise detection of action boundaries. This methodical refinement bolsters the framework’s precision without significantly impacting computational efficiency, thereby advancing the state of the art in temporal action localization.
2. For a more accurate delineation of action boundaries, we have developed a unique boundary feature extraction method using the Gaussian boundary module. This is utilized in conjunction with coarse pyramidal detection to yield more detailed predictions. Additionally, we have devised a Boundary Contrastive Learning approach to further refine the accuracy of boundary information.
3. In our thorough assessments on the THUMOS14 and ActivityNet v1.3 datasets, the boundary awareness network (BAN) has shown advancements in mean Average Precision (mAP) when compared with the existing state-of-the-art methods, underscoring the enhanced performance of our proposed approach.

2. Related Work

Anchor-based. Current anchor-based approaches in the field can be categorized into two distinct frameworks: the one-stage and the two-stage framework. The two-stage framework includes TURN-TAP [2], R-C3D [3], TAL-Net [4], etc. TURN-TAP [2] cuts the untrimmed video into multiple non-overlapping clips, uses the pyramid method to construct unit features, and independently regresses the temporal boundary on each unit. R-C3D [3], taking inspiration from Faster-RCNN [18], proposes an end-to-end network to generate TAP and action classification in two stages. TAL-Net [4] proposes multi-scale anchors to adapt to the diversified duration of action instances. The main drawback of the two-stage framework is that the boundaries of action instances have been fixed during the classification step. To address this issue, a one-step framework has been proposed in recent years, including SSAD [1], GTAN [5], and PBR-Net [6]. SSAD [1] uses 1D temporal convolution to generate multiple temporal action anchors, and captures actions with different durations from three hierarchical levels. GTAN [5] represents actions with different durations through the mixture of Gaussian kernels. PBR-Net [6] designs three series regression modules to modify the boundary information gradually and integrates the fine-grained frame-level features method and anchor-based method.
The above anchor-based methods heavily rely on predefined anchors, which lack flexibility and are sensitive to anchor parameters. To address these problems, A2Net [19] combines anchor-based and anchor-free method to accurately detects TAP. However, it depends on predefined anchors to some extent. Our method differs, in that it does not require predefined anchors, and is thus more efficient and flexible.
Actionness-guided. Anchor-based TAL method can be regarded as a top–down algorithm, whereas the actionness-guided TAL method is bottom–up. It indicates the probability of potential actions utilizing the frame-level actionness score, then combines all start and end moments into proposals, and, finally, performs category predictions for each proposal. Compared with the anchor-based method, the actionness-guided method detects proposals in a more flexible manner. SSN [7] models the different stages (start, course, and end) of the action, and evaluates the completeness of proposals. CDC [11] predicts the category of action frame by frame, which makes the boundary of action more accurate. BSN [8] locates action boundaries via actionness score and starting and ending probability. BMN [9], an extension of BSN, combines the temporary boundary probability sequence and bounding-matching mechanism to get the final proposals, and designs the threshold to screen proposals. However, the boundary matching strategy of BMN only uses the low-level boundary information and ignores the overall information of action, which performs poorly in dealing with complex actions and fuzzy backgrounds. To overcome this limitation, DBG [10] assesses comprehensive boundary confidence maps across all proposals and introduces features sensitive to action for a thorough evaluation of action completeness.
In contrast to these approaches, anchor-free methods like SALAD [13] and the AFSD [12] demonstrate efficiency and flexibility by eschewing predefined anchors. The AFSD, in particular, has pioneered an entirely anchor-free framework for TAL, emphasizing moment-level feature identification for action boundaries. Despite its innovations, the AFSD encounters challenges in capturing comprehensive boundary features, often losing valuable information surrounding salient features.
Building on the AFSD’s foundation, we introduce the boundary awareness network (BAN) equipped with the Gaussian boundary module (GBM) to enhance action boundary localization precision. Unlike the AFSD, the BAN leverages Gaussian boundary pooling (GBP) within the GBM to aggregate features around action boundaries more effectively, ensuring that high-quality boundary features are captured without losing peripheral information. This targeted approach allows the BAN to address the limitations observed in both anchor-based and actionness-guided methods, marking a significant step forward in the domain of anchor-free temporal action localization.

3. Method

In this section, the detailed structure of our newly developed boundary awareness network (BAN) is discussed. The comprehensive architecture tailored for temporal action localization is depicted in Figure 1.

3.1. Problem Definition

In the context of an unedited video V, comprising T frames, we segment it into n sections, each containing a fixed frame count. These segments are represented as
X = { x n } n = 1 l v
wherein x n symbolizes the n-th frame within X. The temporal labeling set of X is constituted by a collection of temporal action instances, defined as
Ψ g = { φ n = ( t s , n , t e , n ) } n = 1 N g
with N g denoting the total number of actual action instances, t s , n being the start time of the action instance φ n , and t e , n representing its end time. Given these annotations for video V, our objective with the model is to accurately predict proposals, complete with class scores, ensuring both high recall and precision.

3.2. Boundary Awareness Network

The architecture of the proposed boundary awareness network is illustrated in Figure 1. The detection process of the BAN consists of three steps, including feature encoding, coarse pyramidal detection, and fine-grained detection.
Feature Encoding. In our approach, we have opted for the I3D [17] architecture as backbone, leveraging both RGB and optical flow frames as inputs. This choice is based on I3D’s proven efficacy in both action recognition and temporal action localization tasks. Specifically, for a given input video V, we initially derive clip-level features F = f i i = 0 T 1 R T × C × H × W from consecutive clips. Here, T, C, H, and W represent the time step, channel, height, and width, respectively. We then compress the features along the last three dimensions into a one-dimensional feature sequence F R T × C , encapsulating both the appearance and motion aspects of the video. In our study, the outputs from the final two layers are utilized as the predictive features.
Coarse Pyramidal Detection. To integrate the spatial dimension and aggregate the temporal dimension in multiple layers, we use a feature pyramid network with several temporal convolutions. Following recent methods [12], after feature encoding, we obtain a feature map F R T × C from raw video clips V, and then use F R T × C to generate FPN-level feature F f p n = f l R T l × c with a series of convolutional layers, where l denotes a layer of the FPN level. The objective of the feature pyramid network (FPN) layer is to generate a representation of features across multiple scales, ensuring that each level of the feature maps possesses robust semantic details pertinent to temporal action localization. Then, we generate coarse prediction features by projecting these features into f c l s and f l o c , which contain action classification information and action position information, respectively. We can get coarse start and end boundary relative distances according to following formula:
d i s , e = F t f c l s , l o c
where F t is one layer of temporal convolution shared among all FPN layers. Finally, we can calculate the absolute boundary position as follows:
φ i = i 2 l d i s
ϕ i = i 2 l d i e
Subsequently, at each specific location, the predicted temporal regions are used to generate fine-grained features with Gaussian boundary pooling.
Fine-grained Detection. To amend the drawbacks of coarse pyramidal detection, we constructed a fine-grained feature via the Gaussian boundary module, as shown in Figure 2. Concatenating these features, we can get a boundary-awareness feature as follows:
f B A P = f , f s , f e , f ^ s , f ^ e
where f , f s , f e , f ^ s , f ^ e denote a frame-level feature, the start and end frame-level features, and start and end features, as detailed in Section 3.3. Additionally, we implemented Boundary Differentiated Learning (BDL) as a strategy to differentiate between action and non-action segments, effectively increasing the distinction between the characteristic features of action sequences and the background, as detailed in Section 3.4.

3.3. Gaussian Boundary Module

The accurate localization of action boundaries within videos is a paramount challenge in temporal action localization, primarily due to the inherent ambiguity of action boundaries. The Gaussian boundary module (GBM) is designed to tackle this challenge by aggregating a comprehensive range of features to extract highly discriminative boundary features for accurate action prediction. These features encompass frame features along with start and end features, which are crucial for delineating the precise extents of actions.
The cornerstone of the GBM is the innovative use of Gaussian boundary pooling, a technique that significantly differs from traditional convolutional methods. This choice is motivated by our dual detection network structure, which incorporates both coarse-grained and fine-grained detection mechanisms. In this context, the coarse-grained detection outcomes serve not only as preliminary action proposals but also as critical guides for the subsequent fine-grained detection phase. This synergy is essential for refining the detection process and achieving high precision in boundary localization.
Gaussian boundary pooling operates by applying a Gaussian kernel centered on the predicted action boundaries from the coarse detection phase. The kernel’s function is to weight the features according to their spatial–temporal proximity to the center, with closer features receiving higher weights. This method ensures that the boundary features are not just moment-level snapshots, but rather encompass a segment range, capturing the nuanced transitions that characterize the start and end of actions. The Gaussian kernel’s unique property of emphasizing features near the predicted boundary while smoothly decreasing the influence of distant features allows the GBM to extract boundary features that are both highly localized and contextually enriched.
This approach is particularly effective in capturing discriminative features that are vital for distinguishing between closely spaced actions and for accurately identifying action boundaries amidst complex background activities. To address this issue, we propose a novel differentiable pooling method called Gaussian boundary pooling, which is defined as follows:
f ^ s k = f f r a m e · g x , t
where · is matrix multiplication. The function g x , t is specified as follows:
g x , t = 1 2 π σ e x p P S T 2 2 σ 2
where P N T × T is an automatically generated continuous sequence, and S R C × T consist of the k-th start and end time stamp, which is the result calculated by the coarse pyramidal detection stage. We empirically set the standard deviation σ to 0.5.
It is important to note that, although GTAN [5] has a similar structure, they are fundamentally different. In terms of design, the Gaussian kernel module employed in our work is derived from course results and serves the function of extracting the moment-range boundary feature. Furthermore, GTAN [5] employs Gaussian kernels to characterize action proposals with varying durations.

3.4. Boundary Differentiated Learning

In the realm of action detection within continuous video streams, the precision in delineating action boundaries is a nuanced challenge due to the inherently fuzzy nature of these boundaries. Traditional approaches often rely on discrete feature representation for action-versus-background discrimination, inadvertently leading the model to emphasize features that might not be critical for distinguishing subtle variations in action boundaries. This issue stems from the binary classification framework, which inherently lacks the granularity to accommodate the continuous variation observed in the temporal dynamics of actions.
To address this challenge and enhance the model’s sensitivity to the continuum of action boundary features, we introduced Boundary Differentiated Learning (BDL). At its core, BDL re-envisions the task of action boundary detection by transitioning from a binary classification paradigm to a regression framework. This shift allows for a more nuanced representation of boundary features, reflecting the continuous nature of actions and their transitions.
The BDL strategy employs a novel approach to label assignment, where action boundaries are treated as a spectrum rather than discrete points. Specifically, we reformulated the label assignment process by introducing a continuous label space for the boundary regions. This is achieved through a re-scaling function applied to the ground truth labels, effectively transforming the binary classification task into a regression problem. The re-scaling function is designed to gradually adjust the label values based on their proximity to the action boundaries, thereby creating a gradient of importance that mirrors the fuzzy nature of these boundaries in real-world scenarios. The formula for re-measured boundary labels is as follows:
I A x = 1 i f x A c t i o n f x i f x B o u n d a r y 0 i f x B a c k g r o u n d
where set B o u n d a r y R l contains the predefined boundary area, and f x is a re-measure function:
f x = 1 1 d a c t i o n + x
where d a c t i o n is the length of the action, and x represents the position of the label. After such a re-measurement process, the boundary between action and background is no longer clear-cut but slowly changing. During model training, such a regression task enables the model to extract more robust boundary features. Following this, we proceed to compute the cross entropy:
L a c t i o n = C E l o s s j ^ , f m o d e l
where CE denotes the cross entropy loss function. In addition, to further improve the model accuracy and to facilitate the comparison with the baseline, we also used the Boundary Consistency Learning [12] strategy.
The incorporation of BDL into our model framework significantly bolsters its capability to discern and precisely localize action boundaries, even in challenging scenarios characterized by ambiguous or subtle transitions. By acknowledging and effectively leveraging the continuous nature of action boundaries, BDL adds substantial value to our model, enhancing its overall performance in temporal action localization tasks. This refined learning strategy, coupled with our model’s architecture, marks a significant advancement in the field, pushing the boundaries of what is achievable in action detection within continuous video streams.

3.5. Training and Inference

Label Assignment. Following the AFSD [12], we evaluate the temporal Intersection over Union (tIoU) between each initial boundary prediction and its respective ground truth. A specific location i is considered positive if its tIoU score exceeds 0.5. Let N C and N R represent the count of positive samples for the initial and refined predictions, respectively. For the initial prediction, we label each location i as a positive sample corresponding to ground truth j if the condition φ i i ϕ j is met. Moreover, within the bounding region, we allocate position i to the recalibrated ground truth j ^ .
Training Objective. The BAN was trained using a composite multitask loss function, which encompasses the losses for coarse detection, fine-grained detection, action classification, and an L2 regularization term:
L t o t a l = L c o a r s e + L f i n e g r a i n e d + γ L Θ
where γ is the hyperparameters. γ is set to 0.001.
Loss of coarse detection. In coarse detection, we directly process the FPN-level features with convolution to obtain the class of the action and the boundary time stamp. The optimization of the coarse detection can be achieved using the subsequent objective function:
L c o a r s e = λ 1 L c l s + λ 2 L l o c
where l a m b d a 1 is set to 1, and l a m b d a 2 is set to 10 and 1 on THUMOS14 and ActivityNet-1.3. L c l s is a softmax focal loss. We represent tIoU loss as L l o c .
Loss of fine-grained detection. In fine-grained detection, the loss function can be optimized through the application of the following specific function:
L f i n e g r a i n e d = λ 1 L c l s f + λ 2 L l o c f + L B C L + γ 1 L a c t i o n
where l a m b d a 1 and l a m b d a 2 settings are the same as in coarse detection. γ is set to 10 and 1 on THUMOS14 and ActivityNet-1.3, separately. The composition of L B C L is consistent with that described in the AFSD [12]. We represent L1 loss as L l o c f .
Inference. In the prediction phase, the BAN is utilized as outlined in the subsequent methodology:
φ s t a r t c o a r s e = φ s t a r t c o a r s e + 1 2 d c o a r s e · φ s t a r t f i n e g r a i n e d
where φ s t a r t c o a r s e denotes the action start time, which comes from the result of coarse-grained prediction. φ s t a r t f i n e g r a i n e d denotes the fine-grained action start time. The end time is expressed in the same way as the start time. Subsequently, we compile all the predictions and apply Soft-NMS [20] for the elimination of overlapping proposals.

4. Experiments

4.1. Datasets

THUMOS14 [21] and ActivityNet1.3 [22] were adopted to demonstrate the performance of our BAN against numerous SOTA methods.
THUMOS14 [21]. The dataset includes 1010 videos in the validation set and 1574 in the testing set, spanning across 20 distinct categories. For temporal action localization specifically, it features 200 untrimmed videos in the validation subset and 212 in the testing subset, each marked with temporal annotations. Our model was trained using the validation subset and assessed on the test subset.
ActivityNet-1.3 [22]. For our purposes, we utilized version 1.3 of the dataset, which comprises 10,024 training videos, 4926 validation videos, and 5044 testing videos, each encompassing 200 distinct activity types. The majority of these videos feature instances of single-class activities occupying a substantial portion of their duration. In comparison to THUMOS14, it stands out as a more extensive dataset, both in the diversity of activities and the total volume of video content. The distribution ratio of the training, validation, and testing sets adheres to a 2:1:1 proportion.

4.2. Implementation Details

In our work with the THUMOS14 dataset [21], video encoding was performed at a rate of 10 frames per second (fps) and a spatial resolution of 96 × 96. In line with Lin’s 2021 methodology, consecutive video clips were produced using sliding windows. Each clip was set to a temporal length of 256 frames, with an overlapping of 30 frames in the training phase and 128 frames during testing. For the ActivityNet1.3 dataset [22], we adjusted the frame sampling rates to guarantee that each video contained a total of 768 frames. As a result, this led to each video being represented by a single clip, consisting of 768 frames. The frame resolution was 96 × 96, which is the same as the size of THUMOS14. In training, data augmentation strategies such as random cropping and horizontal flipping were utilized. For feature extraction, we used the I3D [17] as a backbone to implement our framework, which was initialized with the parameters pre-trained on the Kinetics [23] dataset. Our experiment utilized a modified version of the ASFD model’s code [12], specifically tailored to support our advanced boundary detection and learning features, while maintaining a comparative baseline with the original model.
Our approach was implemented in Python-3.8 using Pytorch on a GPU server with NVIDIA GeForce 1080Ti GPU(NVIDIA Corporation, Santa Clara, CA, USA). The total training epoch was set as 16 for both THUMOS14 and ActivityNet1.3. We utilized the Adam [24] optimizer with a learning rate set at 10 5 and a weight decay of 10 3 . Across various experiments, we consistently used a batch size of 1. During the training phase, the loss weight λ was adjusted to 10 for THUMOS14 and 1 for ActivityNet1.3, while γ was fixed at 10. In the testing phase, we averaged the results from both RGB and optical flow frames to determine the final locations and class scores. The temporal Intersection over Union (tIoU) threshold for Soft-NMS was set at 0.5 for THUMOS14 and 0.85 for ActivityNet1.3.

4.3. Metrics

For the task of temporal action detection, the mean Average Precision (mAP) served as the primary evaluation metric. In the case of ActivityNet-1.3, we utilized mAP at IoU thresholds of 0.5, 0.75, and 0.95, along with an average mAP calculated over a range of IoU thresholds from 0.5 to 0.95 in increments of 0.05. For THUMOS14, the mAP was assessed at IoU thresholds of 0.3, 0.4, 0.5, 0.6, and 0.7.

4.4. Main Results

Our model was benchmarked against the leading methods in the field, as shown in Table 1 and Table 2, across two demanding datasets, with a detailed report on the specifics employed by each approach, including BSN [8], GTAN [5], BMN [9], and the AFSD [12]. In the evaluation of our boundary awareness network (BAN) for temporal action localization, we conducted an extensive comparison with existing state-of-the-art methods, with a particular focus on the AFSD method, recognized for its leading performance in the domain.
Our BAN method introduced a novel Gaussian boundary module (GBM) and Boundary Differentiated Learning (BDL) strategy, specifically designed to enhance the detection of action boundaries in video sequences. This approach demonstrated superior performance on the challenging benchmark dataset THUMOS14 and showed promising results on ActivityNet-1.3, indicating its effectiveness across different contexts.
On the THUMOS14 dataset, the BAN achieved a mean Average Precision (mAP) of 56.4% at an IoU threshold of 0.5, outperforming the AFSD method, which recorded an mAP of 55.5%. This improvement is even more pronounced at higher IoU thresholds. For instance, at an IoU threshold of 0.7, the BAN achieved an mAP of 32.7%, which is a significant increase from the AFSD’s 31.1%. Such results highlight the BAN’s enhanced ability to accurately identify action boundaries, even in more stringent evaluation settings.
For ActivityNet v1.3, as detailed in Table 2, the BAN demonstrates a competitive performance with an mAP of 52.5% at an IoU threshold of 0.5 and shows significant improvement in more challenging conditions, with an mAP of 35.5% at an IoU threshold of 0.75 and an average mAP of 34.6%. These metrics not only affirm the model’s robustness across varied datasets but also underscore its capacity to excel in identifying precise action boundaries amidst diverse and complex video contents.
The advantages of the BAN are attributed to its GBM, which effectively aggregates features around action boundaries, providing a richer and more discriminative representation of boundary features. Moreover, the BAN’s Boundary Differentiated Learning (BDL) strategy facilitates the model in distinguishing between action and non-action segments more effectively, enhancing the overall precision of action localization.
In conclusion, the comparative analysis underscores the BAN’s advancements over existing methods, including the AFSD. By focusing on the intricate detection of action boundaries and employing a refined learning strategy, the BAN sets a new benchmark for temporal action localization, promising significant improvements for real-world applications.

4.5. Ablation Study

We conducted multiple ablation tests on THUMOS14 to demonstrate the impact of key components of the BAN. In this section, we present experiments with the goal of identifying the main elements that contribute to high-quality temporal action detection.
The Effect of the Gaussian Boundary Module. To ascertain the specific contribution of the Gaussian boundary module (GBM) to temporal action localization (TAL), we devised a comprehensive series of ablation experiments. These experiments were designed to evaluate two primary aspects: (1) the impact of varying the range of boundary features on detection accuracy, and (2) the efficacy of different Gaussian boundary pooling methods within the GBM.
The results, detailed in Table 3, illustrate a clear trend: as the boundary feature range narrows (indicated by a decrease in σ ), the localization accuracy significantly improves, peaking at σ = 0.5 . This parameter, σ , represents the variance within the Gaussian function, with lower values indicating a more focused aggregation of features around the action boundaries. The optimal performance at σ = 0.5 underscores the necessity for boundary features to be scope-specific, enabling the precise delineation of action start and end points, as opposed to being broadly defined or moment-specific.
Impact of Boundary Differentiated Learning (BDL). Further, to quantify the influence of Boundary Differentiated Learning (BDL) on the model’s performance, we compared the BAN’s effectiveness with and without the implementation of BDL. This comparison revealed that BDL significantly enhances the model’s ability to distinguish between action and non-action segments, thereby improving the overall accuracy of action localization. The integration of BDL into the BAN led to a marked increase in mean Average Precision (mAP) across various IoU thresholds, particularly highlighting its role in refining the model’s predictive capabilities in complex video sequences where action boundaries may be less distinct.
Combined Influence of the GBM and BDL. In addition to evaluating the individual contributions of the GBM and BDL, our study also examined their combined effect on the model’s performance. The full incorporation of both the GBM and BDL into the BAN resulted in the highest observed improvements in localization accuracy, demonstrating the synergistic impact of these components. This combined configuration significantly outperformed other variations, including models with either component omitted, thereby validating the integral role of both the GBM and BDL in enhancing the efficacy of the BAN for TAL.
The findings from these ablation studies are encapsulated in Table 4, which presents a detailed breakdown of performance metrics under various configurations of the BAN model. Through this rigorous analysis, we have established the critical contributions of both the GBM and BDL to the overall performance of the BAN, affirming their value in our proposed framework for temporal action localization.

4.6. Comparison of Inference Speed

In this section, we compare the detection speeds of our model with two other leading methods. The inference speed on the THUMOS14 dataset was assessed in terms of frames per second (fps) across various models, as outlined in Table 5. RGB frames were employed as inputs for all networks to ensure a balanced comparison. Our findings indicate that our model operates significantly faster than most existing methods, with only a marginal difference compared to the benchmark model AFSD. Specifically, the execution speed of our model was recorded at 3197 fps on a single RTX-1080ti (NVIDIA Corporation, Santa Clara, CA, USA) GPU for temporal action localization tasks, which escalated to an impressive 4291 fps on the more advanced RTX-2080ti (NVIDIA Corporation, Santa Clara, CA, USA) GPU.
While our model demonstrated a high inference speed, making it suitable for near-real-time applications, we acknowledge the ongoing need to enhance its efficiency for real-time processing in live video streams. Future optimizations may include algorithmic improvements, model simplification, and leveraging edge computing technologies to further reduce latency and computational requirements. These efforts aim to expand the applicability of our model to real-time scenarios, ensuring it can deliver a high performance in a wider range of practical applications.

4.7. Visualization Results

In this section, we compare the visualization results of our method and the state-of-the-art method AFSD [12] on the THUMOS14 test set and the results are shown in Figure 3. Our proposed method has a greater ability to find action boundaries, according to the results.

5. Conclusions

In this study, we introduced a novel action localization framework named BAN, which achieves state-of-the-art performance on standard benchmarks. Our method generates precise boundary proposals by aggregating features around the boundary and refines these predictions to make more accurate boundary determinations. We proposed the Gaussian boundary module (GBM) to obtain accurate boundary features, coupled with a novel Boundary Differentiated Learning (BDL) strategy to discern discriminative action information effectively. Extensive experiments conducted on well-known benchmarks, including ActivityNet-1.3 and THUMOS14, underscore the BAN’s advanced capabilities in comparison with existing state-of-the-art methods, including the AFSD. This contribution significantly enriches the community’s toolbox for video understanding, offering a powerful means to tackle the nuanced challenges of action localization with unprecedented precision. Moreover, by setting new benchmarks, the BAN paves the way for future research to build upon our findings, fostering innovation and further exploration in the realm of video analysis.
Acknowledging the limitations of our approach is crucial for setting realistic expectations and guiding future work; while the BAN represents a significant advancement in temporal action localization, its performance is contingent upon the quality of feature representations and may encounter challenges in cluttered scenes where actions are less distinct. These limitations highlight the importance of ongoing research efforts to enhance the robustness of the BAN, particularly in complex scenarios. Future directions include exploring advanced techniques for feature extraction and representation, aiming to improve the model’s adaptability and effectiveness across a broader range of video content. Moreover, addressing these challenges will involve developing methods that can more effectively navigate the complexities of video scenes, thereby ensuring the BAN’s relevance and applicability as video analysis challenges evolve. To further improve performance in complex scenes, we plan to explore the potential of integrating context-aware models, such as scene segmentation or object recognition. These models could provide richer contextual cues for action boundary detection, thereby improving action localization accuracy in visually complex scenes. By introducing this level of context understanding, we anticipate future iterations of our framework to better handle the subtle distinctions between actions and backgrounds, offering deeper insights for video analysis. Additionally, we aim to incorporate more advanced deep learning techniques, like attention mechanisms or transformers, in future iterations to more accurately capture the temporal dynamics of videos, further enhancing our model’s ability to discern and localize actions within complex video sequences. To address the dependence on quality feature representations, we also intend to explore unsupervised or semi-supervised learning methods in future work. These approaches could enhance the model’s ability to learn from unlabeled data, potentially reducing the reliance on extensively annotated datasets and improving the model’s generalization capabilities.
In light of the advancements in temporal action localization (TAL) demonstrated by our model, it is crucial to acknowledge the ethical implications associated with its application. The potential for enhancing understanding and interaction with video content comes with a responsibility to consider privacy concerns and guard against misuse. As TAL technologies become increasingly capable, their application in surveillance, media analysis, and beyond must be approached with a commitment to ethical standards and privacy protection. We advocate for ongoing dialogue and the establishment of robust guidelines to ensure that the benefits of TAL are realized in a manner that respects individual rights and societal values.

Author Contributions

J.G.: As the first author, J.G. was instrumental in the conceptualization and design of the study. He was primarily responsible for the development and execution of the coding and experimental work. Additionally, J.G. contributed to part of the manuscript writing, ensuring that the experimental results and methodologies were accurately and clearly presented. Y.Y.: As the second author, Y.Y. provided comprehensive guidance throughout the project. Her role was pivotal in overseeing the research direction and methodology. Y.Y.’s extensive experience and knowledge were crucial in formulating the research questions and in interpreting the results. She also provided significant input during the manuscript revision stages, ensuring the overall quality and integrity of the research. M.W.: The third author, M.W., played a significant role in the manuscript writing process. She actively participated in drafting the manuscript and revising it critically for important intellectual content. Her contributions also included engaging in discussions and providing insights that significantly shaped the research and analysis. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by National Natural Science Foundation of China (No. 61672546), and Guangzhou Science and Technology Project (No. 201707010127).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the editor and anonymous reviewers for their contributions towards improving the quality of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Lin, T.; Zhao, X.; Shou, Z. Single shot temporal action detection. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 988–996. [Google Scholar]
  2. Gao, J.; Yang, Z.; Chen, K.; Sun, C.; Nevatia, R. Turn tap: Temporal unit regression network for temporal action proposals. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3628–3636. [Google Scholar]
  3. Xu, H.; Das, A.; Saenko, K. R-C3D: Region convolutional 3D network for temporal activity detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5783–5792. [Google Scholar]
  4. Chao, Y.-W.; Vijayanarasimhan, S.; Seybold, B.; Ross, D.A.; Deng, J.; Sukthankar, R. Rethinking the Faster R-CNN architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1130–1139. [Google Scholar]
  5. Long, F.; Yao, T.; Qiu, Z.; Tian, X.; Luo, J.; Mei, T. Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 344–353. [Google Scholar]
  6. Liu, Q.; Wang, Z. Progressive boundary refinement network for temporal action detection. Proc. Aaai Conf. Artif. Intell. 2020, 34, 11612–11619. [Google Scholar] [CrossRef]
  7. Zhao, Y.; Xiong, Y.; Wang, L.; Wu, Z.; Tang, X.; Lin, D. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2914–2923. [Google Scholar]
  8. Lin, T.; Zhao, X.; Su, H.; Wang, C.; Yang, M. BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  9. Lin, T.; Liu, X.; Li, X.; Ding, E.; Wen, S. BMN: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3889–3898. [Google Scholar]
  10. Lin, C.; Li, J.; Wang, Y.; Tai, Y.; Luo, D.; Cui, Z.; Wang, C.; Li, J.; Huang, F.; Ji, R. Fast learning of temporal action proposal via dense boundary generator. Proc. Aaai Conf. Artif. Intell. 2020, 34, 11499–11506. [Google Scholar] [CrossRef]
  11. Shou, Z.; Chan, J.; Zareian, A.; Miyazawa, K.; Chang, S.-F. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5734–5743. [Google Scholar]
  12. Lin, C.; Xu, C.; Luo, D.; Wang, Y.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; Fu, Y. Learning salient boundary feature for anchor-free temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3320–3329. [Google Scholar]
  13. Vaudaux-Ruth, G.; Chan-Hon-Tong, A.; Achard, C. SALAD: Self-Assessment Learning for Action Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 1269–1278. [Google Scholar]
  14. Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
  15. Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
  16. Qiu, H.; Ma, Y.; Li, Z.; Liu, S.; Sun, J. BorderDet: Border Feature for Dense Object Detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 549–564. [Google Scholar]
  17. Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
  18. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef] [PubMed]
  19. Yang, L.; Peng, H.; Zhang, D.; Fu, J.; Han, J. Revisiting Anchor Mechanisms for Temporal Action Localization. IEEE Trans. Image Process. 2020, 29, 8535–8548. [Google Scholar] [CrossRef]
  20. Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS–Improving Object Detection with One Line of Code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar]
  21. Idrees, H.; Zamir, A.R.; Jiang, Y.-G.; Gorban, A.; Laptev, I.; Sukthankar, R.; Shah, M. The THUMOS challenge on action recognition for videos “in the wild”. Comput. Vis. Image Underst. 2017, 155, 1–23. [Google Scholar]
  22. Caba Heilbron, F.; Escorcia, V.; Ghanem, B.; Carlos Niebles, J. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 961–970. [Google Scholar]
  23. Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
  24. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  25. Gao, J.; Yang, Z.; Nevatia, R. Cascaded boundary regression for temporal action detection. arXiv 2017, arXiv:1705.01180. [Google Scholar]
  26. Xu, M.; Zhao, C.; Rojas, D.S.; Thabet, A.; Ghanem, B. G-TAD: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10156–10165. [Google Scholar]
  27. Zhao, P.; Xie, L.; Ju, C.; Zhang, Y.; Wang, Y.; Tian, Q. Bottom-up temporal action localization with mutual regularization. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 539–555. [Google Scholar]
  28. Bai, Y.; Wang, Y.; Tong, Y.; Yang, Y.; Liu, Q.; Liu, J. Boundary content graph neural network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 121–137. [Google Scholar]
  29. Liu, X.; Wang, Q.; Hu, Y.; Tang, X.; Bai, S.; Bai, X. End-to-end temporal action detection with transformer. arXiv 2021, arXiv:2106.10271. [Google Scholar] [CrossRef]
Figure 1. Architecture of the boundary awareness network (BAN) and its comparison with the AFSD. Panel (a) outlines the BAN structure, illustrating key components such as the Gaussian boundary module for precise boundary detection. Panel (b) presents a side-by-side structural comparison with the AFSD, highlighting the distinct features of the BAN, including Boundary Differentiated Learning (BDL) which is activated during training. This comparison delineates the specific architectural differences, demonstrating the BAN’s approach to achieving more accurate action delineation.
Figure 1. Architecture of the boundary awareness network (BAN) and its comparison with the AFSD. Panel (a) outlines the BAN structure, illustrating key components such as the Gaussian boundary module for precise boundary detection. Panel (b) presents a side-by-side structural comparison with the AFSD, highlighting the distinct features of the BAN, including Boundary Differentiated Learning (BDL) which is activated during training. This comparison delineates the specific architectural differences, demonstrating the BAN’s approach to achieving more accurate action delineation.
Electronics 13 01099 g001
Figure 2. An illustration of the Gaussian boundary module.
Figure 2. An illustration of the Gaussian boundary module.
Electronics 13 01099 g002
Figure 3. Visualization results of our method and AFSD on THUMOS14.
Figure 3. Visualization results of our method and AFSD on THUMOS14.
Electronics 13 01099 g003
Table 1. Comparison with state-of-the-art methods on THUMOS14 testing set.
Table 1. Comparison with state-of-the-art methods on THUMOS14 testing set.
TypeModelBackbone0.30.40.50.60.7Avg.
Anchor-basedSSAD [1]TS43.035.024.6---
TURN [2]C3D44.134.925.6---
RC3D [3]C3D44.835.628.9---
CBR [25]TS50.141.331.019.19.930.3
TAL [4]I3D53.248.542.833.820.839.8
GTAN [5]P3D57.847.238.8---
PBRNet [6]I3D58.554.651.341.829.547.1
ActionnessCDC [11]-40.129.423.313.17.922.8
SSN [7]TS51.041.029.8---
BSN [8]TS53.545.036.928.420.036.8
BMN [9]TS56.047.428.829.720.538.5
DBG [10]TS57.849.442.833.821.741.1
GTAD [26]TS54.547.640.230.823.439.3
BUTAL [27]I3D53.950.745.438.028.543.3
BCGNN [28]TS57.149.140.431.223.140.2
Anchor-freeTadTR [29]I3D62.457.449.237.826.346.6
AFSD [12]I3D67.362.455.543.731.152.0
BAN (ours)I3D68.863.956.446.132.753.6
Table 2. Comparison with state-of-the-art methods on ActivityNet-1.3 testing set.
Table 2. Comparison with state-of-the-art methods on ActivityNet-1.3 testing set.
TypeModelBackbone0.50.75Avg.
Anchor-basedSSAD [1]TS44.3918.3029.17
TAL [4]I3D38.218.320.2
GTAN [5]P3D52.634.134.3
Anchor–ActionnessCDC [11]-45.326.023.8
SSN [7]TS43.228.728.3
BSN [8]TS46.530.030.0
BMN [9]TS50.134.833.9
GTAD [27]TS50.434.634.1
BCGNN [28]TS50.634.834.3
Anchor-freeTadTR [29]I3D49.132.632.3
AFSD [12]I3D52.435.334.4
BAN (ours)I3D52.535.534.6
Table 3. The effect of the range of the boundary on the detection. The score ‘0.5’ used in our model gets the highest performance.
Table 3. The effect of the range of the boundary on the detection. The score ‘0.5’ used in our model gets the highest performance.
σ 0.30.40.50.60.7Avg.
558.853.149.239.824.645.1
165.159.955.245.430.951.3
0.568.863.956.446.132.753.6
0.267.262.455.145.932.352.0
Table 4. The effect of the range of the Gaussian boundary module. The second and third columns represent different levels of features for Gaussian boundary pooling, respectively.
Table 4. The effect of the range of the Gaussian boundary module. The second and third columns represent different levels of features for Gaussian boundary pooling, respectively.
Frame-LevelFPN-Level0.30.40.50.60.7Avg.
BAN without GBM 61.258.449.838.725.546.7
BAN with GBM 66.661.053.844.530.949.1
68.262.855.846.032.553.1
68.163.455.645.832.353.0
68.863.956.446.132.753.6
Table 5. Activity detection speed during inference.
Table 5. Activity detection speed during inference.
MethodSS-TADR-C3DAFSDPBRBAN (Ours)
GPUTITAN XmTITAN Xp1080Ti1080Ti1080Ti2080Ti
FPS70110303259148831974291
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gu, J.; Yi, Y.; Wang, M. Advancing Temporal Action Localization with a Boundary Awareness Network. Electronics 2024, 13, 1099. https://doi.org/10.3390/electronics13061099

AMA Style

Gu J, Yi Y, Wang M. Advancing Temporal Action Localization with a Boundary Awareness Network. Electronics. 2024; 13(6):1099. https://doi.org/10.3390/electronics13061099

Chicago/Turabian Style

Gu, Jialiang, Yang Yi, and Min Wang. 2024. "Advancing Temporal Action Localization with a Boundary Awareness Network" Electronics 13, no. 6: 1099. https://doi.org/10.3390/electronics13061099

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop