GLFormer: Global and Local Context Aggregation Network for Temporal Action Detection

: As the core component of video analysis, Temporal Action Localization (TAL) has ex-perienced remarkable success. However, some issues are not well addressed. First, most of the existing methods process the local context individually, without explicitly exploiting the relations between features in an action instance as a whole. Second, the duration of different actions varies widely; thus, it is difﬁcult to choose the proper temporal receptive ﬁeld. To address these issues, this paper proposes a novel network, GLFormer, which can aggregate short, medium, and long temporal contexts. Our method consists of three independent branches with different ranges of attention, and these features are then concatenated along the temporal dimension to obtain richer features. One is multi-scale local convolution (MLC), which consists of multiple 1D convolutions with varying kernel sizes to capture the multi-scale context information. Another is window self-attention (WSA), which tries to explore the relationship between features within the window range. The last is global attention (GA), which is used to establish long-range dependencies across the full sequence. Moreover, we design a feature pyramid structure to be compatible with action instances of various durations. GLFormer achieves state-of-the-art performance on two challenging video benchmarks, THUMOS14 and ActivityNet 1.3. Our performance is 67.2% and 54.5% AP@0.5 on the datasets THUMOS14 and ActivityNet 1.3, respectively.


Introduction
In recent years, with the popularization of multimedia devices and the rapid development of the Internet, a dramatically increasing number of videos are produced every day, and relying on people to analyze videos has been far from meeting the actual needs. As a fundamental component of video content analysis technology, temporal action detection (TAD) has attracted more and more interest from industry and academia. The TAD task aims to predict the start/end boundaries and action categories for action instances as accurately as possible in the untrimmed video. TAD is mainly for public areas such as security surveillance, precision medicine, and intelligent manufacturing. It needs to collect a large amount of video data; a distributed storage platform that can detect copyright infringement is particularly important. In addition, there is still the threat of the raw video being forged and tampered with. One can refer to the content in [1,2] for further information.
As we all know, image classification and object detection have achieved impressive performance. Most of the previous works were inspired by object detection due to its macroscopic similarity to the temporal action detection task. Notable studies predominantly fall into three paradigms. First is the temporal convolution paradigm, whose representative works include SSAD [3], R-C3D [4], MGG [5], and A2net [6], stacking multiple 1D convolutions with a fixed kernel size to extract feature information. Another thread of work, such as [7][8][9], is tackling the TAD task in the form of evaluating frame-level action probabilities and combining consecutive frames through the watershed algorithm to generate proposals.
The third is a boundary-sensitive network, such as BSN [10] and BMN [11]. For each temporal location, predict the probabilities of whether it belongs to boundaries or actionness, then proposals are generated based on boundary probabilities. However, these approaches fall into a local trap; each temporal location is calculated independently as an isolated point, resulting in a lack of contextual connection and sensitivity to noise.
To solve the island problem, several works have attempted to establish long-range dependencies. PGCN [12] proposed exploiting the relations between proposals using graph convolution. BSN++ [13] optimizes the method that generated proposals only based on boundary probabilities, which designs a proposal relation block to explore the proposal-proposal relations. These methods belong to the local-then-global strategy, whose video-level prediction is achieved by following a post-processing step after the frame-level processing results. However, these works ignore the fact that the internal features of an action instance belong to a whole. Thus, this two-stage processing strategy will fragment the integrity. Inspired by Transformer's success in NLP [14] and object detection [15], several recent works attempted to adopt the Transformer architecture to establish longrange dependencies, such as TadTR [16], RTD-Net [17], and ActionFormer [18].
We observed that action instances with various durations are randomly distributed in the original video. Therefore, if the Transformer structure is directly applied, there will be noise and multi-scale problems. To address these issues, we present the global and local context aggregation network (GLFormer). Compared with the other visual Transformer methods, we make three significant improvements to adapt to the TAD task. Firstly, to alleviate the noise issues while avoiding attention distraction caused by excessively long temporal ranges, we replace the original self-attention of Transformer with window self-attention. However, we are convinced that long-range dependencies are crucial for generating proposals. Thus, in order to exploit its advantages, the local self-attention module is followed by a lightweight global attention module as another branch. Lastly, to tackle the temporal multi-scale problem, we propose a parallel branch structure, adding a multi-scale local context module, which is parallel to the window self-attention module. Furthermore, the temporal feature pyramid is constructed by the temporal downsampling operation, and each level is regarded as a stage. The structure we design takes into account multi-scale local context, long-range dependencies, and global information, all of which complement each other. We conduct extensive ablation experiments on the THUMOS14 [19] and ActivityNet 1.3 [20] datasets to verify the effectiveness of our work. In summary, our main contributions are three-fold: • We design a tandem structure with window self-attention followed by a lightweight global attention module, which can not only establish long-range dependencies, but also effectively avoid the introduction of noise. • We add a multi-scale local context branch parallel to the window self-attention, forming a dual-branch structure. This stems from our desire to simultaneously take into account local context, long-range dependencies, and global information, which can help adaptively capture temporal context for temporal action detection. • We design a feature pyramid structure to be compatible with action instances of various durations. Moreover, our network enables end-to-end training and achieves state-of-the-art performance on two representative large-scale human activity datasets, namely THUMOS14 and ActivityNet 1.3.

Action Recognition
Like image recognition in the field of image analysis, as a fundamental task in video understanding areas, action recognition has been extensively investigated in recent years. Traditional methods such as MBH [21], HOF [22], and HOG [23] rely heavily on handdesigned ways to extract features. Inspired by the vast success of convolutional neural networks in the image domain, the current mainstream contains two categories: (a) The first is two-stream networks [24], which take RGB and optical flow as the input. The spatial stream captures the appearance features from the RGB image, while the temporal stream learns the motion information from dense optical flow. (b) The C3D network [25] obtains spatial-temporal feature information directly from the input video. A pre-trained action recognition model is usually used as a feature extractor for TAD tasks. By convention, we adopt I3D [26] pre-trained on the Kinetics-400 [27] dataset to generate the feature sequence as the input for our model.

Temporal Action Localization
Recent approaches in this task can be roughly divided into three categories: (1) The first is local-then-proposal; these methods, such as TAG [8], BSN [10], and R-C3D [4], first extract frame-level or snippet-level features and then generate proposals via action/boundary probabilities or the distance from the boundary. (2) Next is proposal-and-proposal; processing each proposal individually ignores the semantic relationship between proposals. PGCN [12] constructs a graph of proposals to explore the proposal-proposal relations. Complex videos may include overlapping, irregular, or non-sequential instances. AGT [28] proposes a novel Graph Transformer method to model the non-linear temporal structure using graph self-attention mechanisms. (3) The third is global-then-proposal, utilizing the global context in the sequence task. RTD-Net [17] and ActionFormer [18] adopt the Transformer structure, which helps to establish long-range dependencies. A summary of some recent work is shown in Table 1. Table 1. Summary of some representative temporal action detection methods.
Year mAP@0.5 Advantages Limitations

RC3D [4]
ICCV-2017 28.9 The method adopts the 3D fully convolutional network and proposalwise pooling to predict the class confidence and boundary offset for each pre-specified anchor. These methods require pre-defined anchors, which are inflexible for action instances with varying durations.

Anchor-Based
TALNet [29] CVPR-2018 42.8 The method proposes dilated convolutions and a multi-tower network to align receptive fields. GTAN [30] CVPR-2019 38.8 The method learns a set of Gaussian kernels to dynamically predict the duration of the candidate proposal. PBRNet [31] AAAI-2020 51.3 The method uses three cascaded modules to refine the anchor boundary.
BSN [10] ECCV-2018 36.9 The method predicts the probability of the start/end/action for each temporal location and then pairs the locations with higher scores to generate proposals.
Bottom-up BMN [11] ICCV-2019 38.8 The method proposes an end-to-end framework to predict the candidate proposal and category scores simultaneously.
These methods utilize the boundary probability to estimate the proposal quality, which are sensitive to noise and prone to local traps.
BUTAL [32] ECCV-2020 45.4 The method uses the potential relationship between boundary actionness and boundary probabilities to refine the start and end positions of action instances. BSN++ [13] AAAI-2021 41.3 The method exploits proposal-proposal relation modeling and a novel boundary regressor to improve boundary precision.
MGG [5] CVPR-2019 37.4 The method combines two complementary generators with different granularities to generate proposals from fine (frame) and coarse (instance) perspectives, respectively.
These methods directly localize action instances without predefined anchors, thus lacking the guidance of prior knowledge, resulting in easily missed action instances.

Vision Transformer
Transformer was originally developed by [14] in the machine translation task. The core of Transformer is the self-attention architecture, which can transform one sequence into another sequence. Specifically, the output is computed as the weighted sum of the input features, where the weight is computed by a dot product at each temporal location. Therefore, Transformer can establish long-range dependencies. Recurrent neural networks such as RNN [34], LSTM [35], and GRU [36] have natural advantages in sequence modeling. However, in recent years, they have been gradually replaced by Transformers in many sequential tasks. Transformers have three key advantages: (1) the design of the parallel computing architecture breaks through the inherent serial properties of RNN models; (2) in the self-attention structure, the distance between any two temporal locations is one, which enables the model to remember longer-range dependencies' information; (3) compared with the convolutional neural network (CNN), the Transformer model has a stronger interpretability. We only use the encoder part of the original Transformer on videos to explore the long-range dependencies.

Overall Architecture
An overview of our method is depicted in Figure 1. Our method consists of a stack of L = 6 stages, with the temporal length halved except for the first stage, thereby generating a temporal feature pyramid structure. All stages have an identical structure, which is used to aggregate multi-range context features. Moreover, every stage includes three modules: multi-scale local convolution (MLC), window self-attention (WSA), and global attention (GA).

Window Self-Attention
Unlike the traditional way of directly using X = {X 1 , X 2 , X 3 , · · · , X T } as the input, we first use a 1D convolutional network to convert the dimension of the input X ∈ R T×C into the dimension we need R T×C −→ R T×D , where T denotes the temporal length and C and D represent the channel dimensions. This operation also contributes to the stability of the training process [37], represented as X = {X 1 , X 2 , X 3 , · · · , X T } The core part in the Transformer [14] network is the self-attention. We inserted a normalization block (Pre-LN [38]) before the self-attention block to remove the learning rate warm-up stage. In order to perform the attention function, the traditional way uses linear projection to generate values (V), queries (Q), and keys (k).These projections are parameter matrices with parameters W Q ∈ R D×d q , W K ∈ R D×d k , W V ∈ R D×d v , used to learn linear projections of features Z ∈ R T×D to Q ∈ R T×d q , K ∈ R T×d k , V ∈ R T×d v , where d q , d k , d v denote the channel dimensions. However, we found that it was beneficial to replace with 1D depthwise convolution (DC), which was implemented by using a layer 1D group convolution with a kernel size of 3 and group numbers the same as the channel dimension. A main advantage of the self-attention block is the ability to capture globalrange dependencies across the full sequence. However, this advantage comes at the cost of introducing noise and increasing computation when the temporal length exceeds a certain range. Inspired by the local self-attention from Longformer [39] and Actionformer [18], we adopted the window self-attention mechanism to get rid of the impact of a long sequence. For features at location i ∈ [1, T], we call it a token. This gives a sequence of arbitrary length, and our window self-attention pattern uses a fixed-range attention around each token. For a specified window size ω, we evenly divided the input sequence into T/ω attention chunks. A chunk is denoted as where φ i ϕ i and ψ i are keys (K i ), queries (Q i ), and values (V i ) of the feature sequence separately. For a chunk region separately, where ω = t e − t s (see Figure 2). In this way, the range of attention is limited within a chunk. At stage j ∈ [1, L], the range of attention is j × ω. Stacking multiple window self-attention layers naturally integrates short-, medium-, and long-range features, which is beneficial to multi-scale prediction. V, Q, and K are computed as  We used projection Q and K-V pairs to compute the window attention of each chunk, and the weight assigned to V ∈ R 3ω×d v is computed by matrix multiplication QK , where K represents the transpose of K, Q ∈ R ω×d q , K ∈ R 3ω×d k . The attention of the full sequence is obtained via the temporal sliding window approach with window size 3ω and stride ω. The result of window self-attention is calculated as a weighted sum of V: In order to avoid the problem of gradient disappearance due to the dot product growing large in magnitude, we scaled the dot product S by 1/ d q . Similar to Transformer's multi-head attention, we used h = 4 parallel attention heads [H 1 ,

Global Attention
In the actual processing, a fundamental problem for the window self-attention method is how to set the window size. If it is too long, it goes against the original intention of the design. On the contrary, it may not cover the full context of the longer action instance, resulting in insufficient information. Inspired by Transformer [14] and non_local [40], the window self-attention module is followed by a lightweight global attention module. We define the operation in our network as: The linear projection functions θ, φ, and H are implemented in three independent layers of 1D convolutions with kernel size 1. where i ∈ [1, T], j ∈ [1, T]. The pairwise function F is utilized to calculate the dot product between arbitrary feature vectors, which represent the similarity relationship. This similarity score is normalized by N (YW i ) = ∑ ∀j F (θ(YW j )), φ(YW j ), used to prevent the dot product from becoming too large.
Window Self-Attention

Multi-Scale Local Convolution
The Transformer structure is mainly used to establish long-range dependencies, and capturing local context is its shortcoming. However, the convolutional neural network (CNN) is adept at capturing local features, and combining convolution kernels of various sizes can simultaneously extract multi-scale local features. Based on this concept, we propose a multi-scale local convolution (MLC) module parallel with the window self-attention module to capture multi-scale local context information. MLC consists of multiple independent 1D convolutions with different kernel sizes, the kernel size with a typical choice d 1 < · · · < d n , which defines increasingly larger receptive fields. Figure 1

Temporal Feature Pyramid
Inspired by resnet [41] and FPN [42], we designed a temporal feature pyramid by stacking multiple pyramid levels to enhance the expressiveness of the network. Specifically, we applied a temporal downsampling operation at each pyramid level to accommodate action instances with various durations, where the downsampling rate is between d, d ∈ {1, 2, 2, 2, 2, 2}. Each pyramid level consists of MLC, WSA, and GA modules, and the output of each module is independent. Therefore, the network has the ability to capture local, window, and global range features simultaneously. In order to aggregate these features, a MAX operator is adopted along the temporal dimension, which is used to filter out the strongest features. The aggregation strategy can be expressed as S τ = ↓ S τ , τ = 2, 3, · · · , L, S τ ∈ R T/2(τ−1)×D (10) where denotes the concatenation operation, τ represents the pyramid level, and ↓ represents the downsampling operation, which is implemented with a 1D convolution with a stride of 2. Finally, we combined the results of all stages and obtained the temporal feature pyramid set S = {S 1 , S 2 , · · · , S 6 }. We designed two lightweight prediction branches with the same structure, but independent of each other, for action classification and boundary regression, respectively. This structure consists of three layers of 1D convolution with a kernel size of 3. We considered every location i ∈ [1, T] in the sequence as an action instance candidate. For the classification branch, a sigmoid function was followed to predict the probability C i t of action categories. Similar to the regression branch, in order to ensure that the distance (d s t , d e t ) to the left and right boundary is a positive number, a Relu activation function is attached at the end.

Loss Function
The output of our network includes the start/end boundary (d s t , d e t ) and class probability C t , and we used the following loss function to optimize the model: where λ is a hyper-parameter, used to adjust the weight of the classification and regression losses, and we treated these two losses equally and set λ = 1. Our classification loss uses the focal loss [43] function, which can effectively solve the problem of the imbalance between the foreground and background where In the above, class label y t ∈ R N , y i t ∈ {0, 1}, indicates whether the temporal location t ∈ [1, T] belongs to action (y i t = 1) or background (y t = {0} N i=0 ), the predicted classification score C t = {C 1 t , C 2 t , · · · , C N t }. T j is the length of the full sequence in the j-th stage, and N represents the total number of action categories. α and γ are hyper-parameters, which are specified as 0.25 and 2 in the experiments. L reg is the intersection over union (IOU) loss [44] between predicted boundariesΩ t = θ t ,ξ t and the corresponding ground truth Ω t = (θ t , ξ t ): where T p represents the total number of positive samples. The function I is used to indicate that the location t ∈ [1, T] is inside (I = 1) or outside (I = 0) an action instance.

Inference
During inference, we fed the feature sequences into the network and obtain the predictions A j t = (C i t , d s t , d e t ) L 0 for every temporal location t across all pyramid levels, where i ∈ R N and j ∈ R L represent the action category and pyramid level, respectively. For the t-th temporal location in the j-th level, the predicted action instance is represented by where s t and e t are the left and right offsets of an action instance and C i t represents the category of the action instance. Finally, the predicted results from all locations are merged, and we performed Soft-NMS [45] and obtained the final outputs.

Datasets and Settings
We evaluated our model on two widely used large-scale datasets, THUMOS14 [19] and ActivityNet 1.3 [20]. THUMOS14 contains 20 sport categories and consist of three parts: training, validation, and testing sets. Among all the videos, the training set with no temporal annotations was used for action recognition. Following previous research [3,6,12], we trained our model on the validation set including 213 untrimmed videos and evaluated the performance on the test set including 200 untrimmed videos. ActivityNet 1.3 is composed of 19,994 videos, which contain 200 action categories, and the dataset is divided into training, testing, and validation subsets in a ratio of 2:1:1.

Evaluation Metrics
To compare with existing methods, we used official evaluation metrics, the mean average precision (mAP) at different temporal intersection over unions (tIoUs) and the average mAP to evaluate the performance on the two datasets. On THUMOS14, the tIoU thresholds were chosen from [0.3:0.1:0.7], which focuses on the performance of mAP@0.5.
On Activitynet v1.3, the tIoU thresholds were selected from 0.5, 0.75, 0.95, which pays more attention to the results of mAP@avg [0.5:0.1:0.95]. Each video in THUMOS14 contains more than 15 short-duration action instances, and each video in Activitynet v1.3 contains an average of 1.7 long-duration action instances. Thus, the two datasets use different evaluation metrics.

Implementation Details
THUMOS14: In order to extract spatial-temporal features from THUMOS14, we used a two-stream inflated 3D ConvNet (I3D) [26] module pre-trained on the Kinetics-400 [27] dataset. We sampled 16 consecutive RGB and optical-flow with an overlap rate of 75% as clips, which were respectively input to the I3D network, and extracted features of dimension 512 × 2 at the first fully connected layer. Then, the output features of the twostream network were concatenated to obtain the input features of our model. According to the evaluation method, the mean average precision (mAP) with an IoU threshold {0.3, 0.4, 0.5, 0.6, 0.7} was the evaluation metric used on THUMOS-14. We used Adam [46] to optimize the network and set the batch size, initial learning rate, and total epoch number as 2, 10 −4 , and 50, respectively. ActivityNet 1.3: We adopted an R(2+1)D model to extract features from ActivityNet 1.3 pre-trained on the TSP [47] dataset. We sampled 16 consecutive frames with a stride of 16 as clips (i.e., non-overlapping clips). Following [10,18], the length of the sequence was rescaled into a fixed length of 128 using linear interpolation. According to the evaluation method, a mAP with IoU threshold {0.5, 0.75, 0.95} and an average mAP [0.5:0.05:0.95] were the evaluation metrics used on ActivityNet 1.3. We used Adam to optimize the network and set the batch size, initial learning rate, and total epoch number as 16, 10 −3 , and 15, respectively.
Our method was implemented based on PyTorch 1.1, Python 3.8, and CUDA 11.6. We conducted experiments with one NVIDIA GeForce RTX 3090 GPU, Intel i5-10400 CPU, and 128 G memory.

THUMOS14
: We compared our model with several recent state-of-the-art methods including one-stage, two-stage, and Transformer models on the THUMOS14 dataset. Table 2 summarizes the performance. It can be seen intuitively that our model outperformed all previous methods, establishing the new state-of-the-art of 67.2% mAP@0.5 on THUMOS14.
In particular, our model achieved an improvement of 11.7% (from 55.5% to 67.2%) on mAP@0.5 and 10.9% (from 52.0% to 62.9%) on the average mAP ([0.3:0.1:0.7]) compared with AFSD [33], which is currently the best-performing one-stage detector. We outperformed MUSES [48], which is the strongest two-stage competitor by 10.3% (from 56.9% to 67.2%) on mAP@0.5 and 9.5% on the average mAP (from 53.4% to 62.9%). Moreover, our model achieved up to 1.6% (from 65.6% to 67.2%) on mAP@0.5 and 0.3% (from 62.6% to 62.9%) on the average mAP over the Transformer method [16], which is the current state-of-theart method in TAD tasks. The excellent performance proved that for TAD, simultaneous modeling of local multi-scale features and long-range temporal dependencies can effectively improve its performance. ActivityNet 1.3: The performances on the ActivityNet 1.3 dataset are shown in Table 3. On the average mAP, GLFormer reached an mAP of 36.3%, which is 0.7% higher than the current state-of-the-art of 35.6% by ActionFormer [18]. Our method achieved 37.7% mAP@0.75, outperforming all previous methods by at least 1.5%. GLFormer achieved 54.5% mAP@0.5, but did not perform as well as the previous method ContextLoc [49] (54.5% vs. 56.0%), but outperformed it on other evaluation metrics. Our method achieved 7.6% mAP@0.95 and had no advantage over other methods. Considering that the evaluation index mAP@avg is the average result of multiple tight tIoUs (such as tIoU = 0.95), it requires higher accuracy, so the performance of 36.3% is also commendable. This demonstrates the effectiveness and generalizability of fusing multi-scale context and long-range features.

Ablation Experiments
Here, in order to validate the various design decisions, we discuss the contributions from several key modules. All ablation experiments were performed on THUMOS14.

Effectiveness of WSA Module
By comparing the first and second rows in Table 4, we found it beneficial to replace the linear projection approach of the traditional Transformer model with a 1D depthwise convolution to project the queries, keys, and values. The results showed that when using the 1D depthwise convolution, the mAP at tIoU 0.5 and the average mAP increased by 0.9% and 0.5%, respectively. This indicates that choosing an appropriate projection approach can help improve the performance of the TAL task. Observed in the third row, we evaluated the effect of the choice of the window size on the results. We can find that increasing the window size from 4 to 9 had an improvement (from 65.7% to 67.2% on mAP@0.5), while the performance dropped from 67.2% to 65.8% after continuing to increase the window size to the full sequence. The features located in the deep layers of the network are already highly abstracted, and simple linear projection may over-reorganize the channel features, resulting in insufficient feature distinguishability. The 1D group convolution prevents the features between groups from interfering with each other, and it learns independent expression patterns, which is beneficial to enhance the distinguishability of features. Setting the window size too small will damage the integrity of the information, while setting the window size too large will distract the attention and introduce irrelevant information.

Effectiveness of MLC Module
To discuss the effectiveness of the MLC module, we compared two settings: (1) WSA + GA module; (2) MLC + WSA + GA module. MLC is a multi-branch structure, and each branch consists of a layer of 1D convolution with different kernel sizes. Table 5 shows the comparison results on the THUMOS14 test set. It can be seen that MLC + WSA + GA had a +1.1% mAP@0.5 improvement compared to only using the WSA + GA module, suggesting that the MLC module can provide complementary information with WSA + GA. As the number of branches increases, the accuracy further increases, with only a small increase in the model parameters. However, when using a larger kernel size (k_s = 13), the effectiveness of MLC seems to become weaker (−1.4% in mAP@0.5). Further expanding the kernel size (e.g., k_s = 15) leads to greater performance degradation. Multi-scale local features play an important role in enriching the feature details, but the range of the temporal receptive field must be controlled within a certain range. Beyond a certain range, the module's attention is distracted, not only being unable to capture long-range features, but also unable to focus on the local context.

Effectiveness of GA Module
We validated the design of the GA module in Figure 1 by comparing three settings: (1) WSA + MLC module; (2) follow a GA after WSA; (3) follow multiple GAs after WSA. We compare the performance with or without the GA module in Table 6. The result clearly shows that with a GA, the performance improves by 1.1% mAP@0.5. This experiment demonstrated that the GA was beneficial to make the localization precise. However, when we replaced a single GA module with two or three GAs, the performance dropped by 1.2% and 1.1% mAP@0.5, respectively. Using a GA module can help improve performance, indicating that only considering local features is not conducive to the model capturing sufficient features, especially for instances with a long duration. While noise may be introduced, the benefits are greater. Stacking more than two GA modules will amplify the noise in a cumulative manner, negating the benefits of capturing global features.

Module Complementarity
In order to study the relationship between the three modules of WSA (medium-range), MLC (short-range), and GA (long-range), we compared different combinations of these modules, and the results are presented in Table 7. Just using the WSA module alone, the performance was 65.8% mAP@0.5, which is the strongest contributor to our model's performance. Combined with MLC or GA, the performance improved by 0.3% and 0.1%, respectively. When we used these three modules together, the result were 67.2% mAP@0.5, which proves that the three modules have a complementary relationship. The MLC module is responsible for capturing multi-scale local contextual information. The WSA module and the following lightweight GA module can not only establish long-range dependencies, but also effectively avoid introducing noise. The three modules work together to enable the network to adaptively capture action instances with randomly varying durations.

Temporal Feature Pyramid
The temporal feature pyramid is used to accommodate action instances with various durations. In our experiments, different numbers of levels were tried, and the performance are listed in Table 8. We can observe that the performance became progressively better as the levels increased, with the six-level pyramid achieving the best result on mAP@0.5 and the average mAP, while using more levels did not result in better performance; this is due to more redundant candidates being involved. In Table 8, we also show the effectiveness of different numbers of channels on the performance. The initial number of channels was set to 256, and increasing the number of channels to 1024 can help improve the performance (from 65.0% to 67.2%). However, when continuing to increase the number of channels to 4096, the performance dropped from 67.2% to 65.6%. The number of channels is directly related to the expressiveness of the network. On the one hand, the low feature dimension will have difficulty providing sufficient information, making it difficult for the network to distinguish instances with a high similarity. On the other hand, for high-dimensional features, performing linear projection operations on highly abstract features not only increases the computational complexity, but also causes network overfitting.

Temporal Downsampling Module
For each pyramid level, we used a 1D convolution with a stride of 2 (except for the first level, which is 1) to reduce the temporal resolution. In addition, we also tried other downsampling methods, including average pooling and max pooling, and compared their results with this method. The performance is summarized in Table 9. Among all temporal downsampling modes, the 1D convolution operation achieved the best performance, showing a 1.2% and 0.9% advantage for mAP@0.5 against average pooling and maximum pooling respectively. For max pooling and avg pooling, the cost of losing some feature information will be paid when reducing the time series dimension. However, the convolution with a stride of 2 reduces the dimension while not only retaining the features completely, but also enhancing the expressiveness of the network.

Visualization Results
We visualize four proposals generated by GLFormer, which include short, medium, and long action instances, and compare them with the ground truth in Figure 3. These four action instance samples are from the THUMOS14 dataset. The visualizations indicate that our results match the corresponding ground truth well even for short and long action instances.

Conclusions
In this paper, we introduced GLFormer for temporal action detection (TAD). This network takes advantage of multi-scale 1D convolution, global attention, and window self-attention to learn rich contexts for action classification and boundary prediction. Furthermore, taking into account action instances with various durations, an important component of our network is the temporal feature pyramid, which is achieved by using a 2× downsampling between successive stages. The experimental evaluation shows that our model achieves state-of-the-art performance on the human activity datasets THUMOS14 and ActivityNet 1.3. Overall, this work highlights the importance of aggregating features with different ranges of attention and shows that window self-attention is an effective means to model longer-range temporal context in complex activity videos.
Discussion: Although great achievements have been made in temporal action detection, there still remain many shortcomings that need further improvement: (i) manually labeling data results in excessive human and material investment, combining supervised learning with semi-supervised or even unsupervised learning; (ii) exploring effective postprocessing methods to improve boundary positioning accuracy; (iii) the traditional TV-1 algorithm is inefficient and occupies storage space, which is replaced by the deep learning algorithm to realize the end-to-end processing process; (iv) real-world action instances are often dynamic and random (exploring effective methods to model nonlinear video sequence features); (v) adding data preprocessing steps to improve the quality of the raw data, such as the clutter effect, illumination effect, and complex scenarios.