Temporal Context Modeling Network with Local-Global Complementary Architecture for Temporal Proposal Generation

: Temporal Action Proposal Generation (TAPG) is a promising but challenging task with a wide range of practical applications. Although state-of-the-art methods have made signiﬁcant progress in TAPG, most ignore the impact of the temporal scales of action and lack the exploitation of effective boundary contexts. In this paper, we propose a simple but effective uniﬁed framework named Temporal Context Modeling Network (TCMNet) that generates temporal action proposals. TCMNet innovatively uses convolutional ﬁlters with different dilation rates to address the temporal scale issue. Speciﬁcally, TCMNet contains a BaseNet with dilated convolutions (DBNet), an Action Completeness Module (ACM), and a Temporal Boundary Generator (TBG). The DBNet aims to model temporal information. It handles input video features through different dilated convolutional layers and outputs a feature sequence as the input of ACM and TBG. The ACM aims to evaluate the conﬁdence scores of densely distributed proposals. The TBG is designed to enrich the boundary context of an action instance. The TBG can generate action boundaries with high precision and high recall through a local–global complementary structure. We conduct comprehensive evaluations on two challenging video benchmarks: ActivityNet-1.3 and THUMOS14. Extensive experiments demonstrate the effectiveness of the proposed TCMNet on tasks of temporal action proposal generation and temporal action detection.


Introduction
Temporal action detection is one of the most fundamental tasks in video understanding, which is aimed at not only classifying every action instance in each video, but also looking for their accurate temporal locations. In general, the temporal action detection task is composed of two subtasks: temporal action proposal generation and action classification. Although current action recognition methods [1,2] can achieve convincing classification accuracy, the performance of temporal action detection is still unsatisfactory on mainstream benchmarks. Object detection aims to find as many tight bounding box locations and classes of objects as possible. With the continuous in-depth research of many works, a quite number of recent methods [3][4][5] have achieved remarkable progress and superior performance. Akin to object proposals for object detection in images, temporal action proposal has a crucial impact on the quality of action detection. As a result, more and more works are therefore devoted to improving the quality of temporal action proposals. Temporal Action Proposal Generation (TAPG) gradually became a research focus in video understanding tasks. TAPG is not only used for temporal action detection, but is also the core of several downstream tasks such as video recommendation, video captioning, and summarization.
Proposals generated by a robust TAPG method usually have two essential properties: (1) The generated temporal proposals should cover ground-truth action instances accurately and exhaustively, and have flexible durations and accurate boundaries. (2) The generated temporal proposals should be precisely evaluated so that redundant proposals can be effectively suppressed. Existing TAPG methods can be roughly divided into two categories. The first category follows a top-down fashion. Such methods generate proposals by predefining sliding windows [6] or designing a set of regularly distributed anchors [7] of different scales for each video segment. The generated proposals are then evaluated by a binary classifier. However, as sliding windows and anchors are defined manually, the generated proposals are doomed to have imprecise boundaries. Under this circumstance, more and more researchers begin to study bottom-up TAPG methods. By using local clues on the video sequence to evaluate each temporal location, these types of approaches can generate more precise boundaries and more flexible durations.
Although recent methods have made significant progress in TAPG, they still have unresolved problems. (1) The duration of ground-truth action instances varies, typically ranging from seconds to minutes. However, existing methods use a fixed temporal receptive field for all action instances and thus ignore the temporal scale issue of action instances.
(2) Most of the existing methods only exploit the local details around the boundaries to predict starting and ending time, but do not pay much attention to the global context in the video sequence. Figure 1 shows the diversity of ground-truth action instances' durations on two challenging video benchmarks: ActivityNet-1.3 and THUMOS14. Motivated by the above observations, we propose a Temporal Context Modeling Network (TCMNet) to efficiently model video action instances with different durations and make full use of global context to generate more accurate temporal boundaries. In our framework, a BaseNet named DBNet takes the extracted video features as input and provides a shared feature sequence for subsequent modules. To efficiently model video action instances with different durations, DBNet contains multiple convolution layers with different dilated rates. These convolution filters have different receptive fields that are most effective at a specific temporal scale. An Action Completeness Module (ACM) is designed to take the shared feature sequence as input and generate action completeness maps to evaluate densely distributed proposals. A Temporal Boundary Generator (TBG) is designed to generate temporal boundaries with high precision and high recall. The TBG contains a local branch and a global branch. The local branch consists of only two temporal convolution layers. It focuses on the local abrupt background-to-action (actionto-background) change in the input feature sequence and generates rough boundaries with high recall but low precision. The global branch generates temporal boundaries with high precision but low recall by using our improved contextual U-shaped network structure. It uses multiple temporal convolutional layers followed by down-sampling steps to extract semantic information from different granularities. To restore the resolution of the temporal feature sequence, the up-sampling operation is repeated multiple times and the features of the same resolution are fused. All in all, the contributions of our work can be summarized fourfold: (1) We propose a Temporal Context Modeling Network (TCMNet) for temporal action proposal generation. TCMNet adopts multiple dilated temporal convolutions. Each of them is most effective for action instances with a specific duration, to obtain different receptive fields. The responses of all temporal convolutions are fused to generate more reliable temporal action proposals. (2) To achieve the complete action proposal generation, we embed an improved Ushaped network in the temporal boundary generator. Therefore, TCMNet can improve performance by leveraging the global context for boundary detection through localglobal structures. (3) We propose a pooling operation to obtain more useful deep semantic information and an aggregation function to achieve adaptive fusion of semantic features. The pooling operation and the aggregation function are embedded in the U-shaped network to reduce the disturbance of noise. (4) We conduct extensive experiments on the THUMOS14 and ActivityNet-1.3 benchmarks. The results show that TCMNet can achieve significant proposal generation performance. Combined with the existing action classifiers, TCMNet can also achieve remarkable temporal action detection performance compared with other approaches.

Temporal Anchoring Methods
With the continuous development of deep learning networks [8,9], great progress has been made in video analysis tasks. Temporal action detection is a key part of video analysis tasks, and extracting high-quality temporal proposals is crucial for action detection. Temporal proposals have different time spans to align with action instances. However, fixed-size features must be extracted from each proposal to be fed to fully connected layers for proposal evaluation [10]. Bottom-up methods [11,12] first obtain the boundary candidates and then use 1D RoI pooling to estimate all possible combinations. Multi-scale anchor methods [13] extend image detection to temporal action localization. They generate class-agnostic proposals by jointly classifying and regressing a fixed set of multi-scale anchors at each location. Anchor-free methods [14] directly predict the confidence score, the center offset, and the length of time through the center point feature. Continuous representation [15] proposes modeling action segments by maximizing the confidence scores in a 2D function. It enables a more flexible and efficient data sampling space.

Action Recognition
Action recognition is a fundamental and important task in the video understanding area, and deep learning models have recently achieved significant performance promotion in the action recognition task. Ref. [16] uses human boxes and key points to represent instance-level features, and the action region features of this framework are used as the input of the temporal action head network, which makes the framework more discriminative. The author of [17] proposed a multi-scale feature extraction method used to extract richer feature information. At the same time, a multi-task learning model is introduced. It can further improve classification accuracy by sharing data among multiple tasks. Due to the continuous development of deep models in the field of action recognition, some works [18,19] begin to solve the difficult problems of deep models in real-life applications, so those deep models can be used in practice.

Temporal Action Proposal Generation
Temporal action proposal generation (TAPG) aims to generate proposals with precise temporal action boundaries and confidence in untrimmed videos. Existing proposal gener-ation methods are mainly divided into two branches: top-down and bottom-up methods. Top-down methods mainly generate proposals based on sliding windows or uniformly distributed anchors, and then use a binary classifier to evaluate the generated proposals. SCNN [6] first uses sliding windows of different scales to generate some proposals with a fixed overlap rate. TURN [20] draws on the classic algorithm Faster R-CNN [21] in object detection. It generates proposals through uniformly distributed anchors. GTAN [22] introduces Gaussian kernels to dynamically optimize the temporal scale of each action proposal. Those methods are inspired by the achievements of anchor-based object detectors in still images; they discretize the proposal task into a classification task where multiple predefined anchors with different lengths are used as classes and a class that best fits the ground-truth action length is regarded as a ground-truth class for training.
As for the bottom-up methods [23][24][25][26], they generate proposals by locating temporal boundaries and then combining the boundaries in a certain strategy. TAG [27] designs a temporal watershed algorithm to generate proposals but lacks confidence scores for retrieval. On the basis of TAG, BSN [11] utilizes a temporal evaluation module to locate temporal boundaries and adopts a proposal evaluation module to regress the confidence of proposals. However, BSN is inefficient because it conducts proposal feature construction and confidence evaluation procedure for each proposal, respectively. To solve this problem, BMN [28] designs a boundary-matching (BM) mechanism for the confidence evaluation of densely distributed proposals. Bottom-Up-TAL [12] introduces two regularization terms to mutually regularize the learning procedure. Jointly optimizing these two terms, the entire framework is aware of potential constraints during an end-to-end optimization process. Considering that proposals generated by the methods using only local clues are susceptible to noise. TSI [29] leverages temporal context for boundary detection with the local-global-complementary structure to improve performance. TSI also designs a scale-invariant loss function to improve detection performance for short actions. RTD-Net [30] adopts Transformer architecture to directly generate action proposals in untrimmed videos. It models dependencies between proposals from a global perspective and avoids non-maximum suppression post-processing through simple but efficient design.

Temporal Action Detection
Temporal action detection can be divided into two types of methods. One is the onestage method, which aims to localize an action and predict its class simultaneously. The other is the two-stage approach, which works by classifying proposals and detecting them. As one-stage methods, PBRNet [31] and AFSD [14] skip the proposal generation by directly detecting action instances in untrimmed videos. P-GCN [32] exploits the proposal-proposal relations for temporal action detection in videos. G-TAD [33] adaptively incorporates multilevel semantic context into video features and casts temporal action detection as a sub-graph localization problem to localize actions in video graphs. As for two-stage temporal action detection methods, TCANet [34] and RCL [15] adopt the progressive boundary refinement method to achieve precise boundaries and reliable confidence of proposals, thus improving the efficiency of action detection.

Problem Definition
We are given an untrimmed video sequence V = {v t } l v t=1 , where v t denotes the t-th frame in the video sequence and l v is the length of the video. The temporal annotation set corresponding to the video V is composed of a set of action instances where N g is the number of ground-truth action instances and t s n and t e n are the starting and ending time of action instance ϕ g,n . TAPG aims to predict proposals ψ p = ϕ p,n = (t s n , t e n , p n ) N p n=1 to cover ψ g with high recall and high temporal overlap, where p n is action completeness score of predicted proposal ϕ p,n , and it will be further used for proposal ranking.

Feature Encoding
We employ two-stream networks to encode raw video sequence and generate a visual feature sequence. Specifically, given an untrimmed video V containing l v frames, we can extract a visual feature sequence F = { f i } l s i=1 by concatenating the output of the last FC-layer in the two-stream networks, where l s denote the length of visual feature F. Like previous works [11,24,28,29], we extract features at regular frame interval δ to reduce computational cost; thus l s = l v /δ.

Temporal Context Modeling Network (TCMNet)
TCMNet is designed to generate densely distributed proposals directly in a unified network. It generates action completeness maps that represent the confidence of densely distributed proposals and local-global boundary probability sequences that represent boundary information simultaneously. The framework of TCMNet is illustrated in Figure 2, which contains three main modules: BaseNet with dilated convolutions (DBNet), Action Completeness Module (ACM) and Temporal Boundary Generator (TBG). DBNet can be seen as the backbone of TCMNet, which aims to handle the input video features through different dilated convolutional layers to better model the temporal information. It receives the video feature sequence as input and outputs a feature sequence as the input to ACM and TBG. ACM generates action completeness maps of dense proposals through Boundary-Matching (BM) layers proposed in BMN [28]. In addition, dilated convolutional layers are embedded in ACM to obtain different receptive fields. TBG contains a local branch and a global branch. The local branch focuses on local sudden changes in the input feature sequence and generates rough boundaries with high recall. The global branch extracts contextual features and generates high-precision boundaries through our improved Ushaped architecture.

DBNet
In order to faithfully detect boundaries, each action instance in the video sequence needs to have the appropriate temporal receptive fields. However, the duration of different action instances in the video generally varies widely, so it is impossible to find a one-for-all temporal receptive field. As a natural solution, we embed a set of convolutional filters with different dilation rates in BaseNet and name it DBNet. The goal of DBNet is to receive the two-stream video feature sequence F as input and output a feature sequence F BaseNet shared by ACM and TBG. As shown in Figure 3, we embed a dilated convolutional layer consisting of several different dilated convolutional filters after the traditional temporal convolution. The outputs from all dilated convolutions are simply averaged, returning fused contextual information. Note that a skip connection is inserted after the average operation, such that the dilated convolutions are reinforced to focus on learning the residual. This is written as where conv1 and conv2 denote two traditional temporal convolutions and dc(·) denotes the dilated convolutional layer. By combining convolutions with different dilation rates, DBNet can better model the temporal relationship of action instances with different durations.

Action Completeness Module (ACM)
The ACM module receives the shared feature sequence generated by DBNet as input and outputs action completeness maps to evaluate the confidence score of dense proposals. To achieve this goal, we adopt the Boundary-Matching (BM) mechanism proposed in BMN [28]. As shown in Figure 4, the BM layer can transfer temporal feature sequence F BaseNet ∈ R C×D to proposal feature maps M F ∈ R D×T×128×32 , where T is the length of the feature sequence and D is the maximum duration of proposals. The proposal feature maps are then fed into several 2D convolutional and dilated convolutional layers to generate new feature maps M F ∈ R D×T×128 . After going through the ACM module, each proposal is predicted as two confidence scores, which are supervised by the IoU classification loss and the IoU regression loss.

Temporal Boundary Generator (TBG)
The goal of TBG is to accurately evaluate the start and end probabilities of all temporal locations in untrimmed videos. These boundary probability sequences are then used to generate proposals in the post-processing stage. Previous methods treat the boundary as a kind of local information but do not pay enough attention to global context or deep semantic features, which makes the detection of the boundary vulnerable to noise [35]. To remedy this defect, we follow the structural details of TSI [29] to accurately detect temporal boundaries through a local-global complementary architecture. The architecture of TBG is shown in Figure 5. The local branch in TBG contains only two temporal convolutional layers. This branch focuses on local abrupt changes and generates a rough boundary with high recall but low precision to cover all actual start/end points. Inspired by the UNet [36] used in image segmentation, the global branch is designed to represent the action boundary through a U-shaped contextual architecture. The global branch uses multiple temporal convolutional layers followed by down-sampling to extract semantic information from different granularities. In order to restore the resolution of the temporal feature sequence, the up-sampling operation is repeated multiple times, and the features of the same resolution are adaptively fused. remedy this defect, we follow the structural details of TSI [29] to accurately detect temporal boundaries through a local-global complementary architecture. The architecture of TBG is shown in Figure 5. The local branch in TBG contains only two temporal convolutional layers. This branch focuses on local abrupt changes and generates a rough boundary with high recall but low precision to cover all actual start/end points. Inspired by the UNet [36] used in image segmentation, the global branch is designed to represent the action boundary through a U-shaped contextual architecture. The global branch uses multiple temporal convolutional layers followed by down-sampling to extract semantic information from different granularities. In order to restore the resolution of the temporal feature sequence, the up-sampling operation is repeated multiple times, and the features of the same resolution are adaptively fused. Upsample, ×2  Unlike the TBD proposed in TSI [29], we argue that: (1) Deep semantic features obtained through temporal max pooling during down-sampling are not enough because the fine-grained temporal information critical for localizing boundaries is lost. Therefore, as shown in Figure 6a, we design a new pooling method called Pool, which uses both average-pooling and max-pooling operations to generate two different temporal context descriptors. The two descriptors are then forwarded to the shared MLP to produce our deep semantic features, written as stands for aggregation function.
Unlike the TBD proposed in TSI [29], we argue that: (1) Deep semantic features obtained through temporal max pooling during down-sampling are not enough because the fine-grained temporal information critical for localizing boundaries is lost. Therefore, as shown in Figure 6a, we design a new pooling method called Pool, which uses both average-pooling and max-pooling operations to generate two different temporal context descriptors. The two descriptors are then forwarded to the shared MLP to produce our deep semantic features, written as where + is element-wise addition.
(2) Semantic features of different granularities contribute to boundary detection differently, so it is not the most appropriate way to concatenate semantic features directly. Therefore, as shown in Figure 6b, we design an aggregation function to achieve adaptive fusion of the same resolution features. Specifically, we first concatenate each input feature in the channel dimension to obtain the new feature into squeeze-excitation architecture consisting of several temporal convolutions to explicitly model the channel relationship; the channel scaling factor in the squeeze-excitation architecture is denoted as r. Then, we normalize the output using the Softmax function to get the weight of semantic features with different granularity.
where α 1 , . . . ,α n denote the weight of semantic features of different granularities. Finally, we can get the feature of adaptive fusion, written by F upsample TBG

Label Assignment
For each action instance ϕ g,n = (t s n , t e n ) in the annotation ψ g , its starting region and ending region are defined as r s = [t s − d/10, t s + d/10] and r e = [t e − d/10, t e + d/10], respectively, where d = t e − t s is the duration of ϕ g,n . Then, by computing the maximum overlap ratio of each temporal interval with r s , we can obtain G s = g s t n as the starting label of TBG. The ending label G e = g e t n can be obtained through the same label assignment process. For ACM, we follow the definition in BMN [28] to get the label of the action completeness map G c = g c i,j .

Loss of ACM
ACM outputs action completeness map p c with two channels. The training loss is defined as regression loss L reg and binary classification loss L cls , respectively: where L2 loss is adopted as L reg and SI loss proposed in TSI [29] is adopted as L cls .

Loss of TBG
TBG outputs the starting and ending probability sequence of local and global branches, denoted as P s l , P e l , P s g , P e g , respectively. We follow BMN [28] to adopt binary logistic loss L bl as starting and ending losses to supervise the boundary prediction with G s , G e , denoted as

The Training Objective of TCMNet
The multi-task loss function of TCMNet consists of TBG loss, ACM loss and the L2 regularization term, which is defined as: where weight term β and λ are set to 1 and 0.0001 separately to ensure different modules are trained evenly, and L 2 (θ) is the L2 regularization term.

Proposal Selection
To ensure the diversity of proposals and guarantee high recall, we only use the local starting and ending probability sequences of TBG for proposal selection. When temporal locations in the probability sequences satisfy (1) local peak of boundary probabilities or (2) probabilities higher than 0.5·max(P), these temporal locations are regarded as the starting and ending locations. Then, we match all starting and ending locations to generate redundant candidate proposals ψ p .

Score Fusion and Proposal Suppression
To generate a more reliable confidence score, for each proposal ϕ, we multiply its boundary probability by the confidence score to generate the final confidence score p f , where p start = p s l · p s g is the starting probability, p end = p e l · p e g is the ending probability and p c is the action completeness score, which is the fusion of the classification score and the regression score, written by p c = p cls c · p reg c . Then, we need Soft-NMS [37] to suppress redundant proposals to retrieve the final high-quality proposals. After the Soft-NMS step, we employ a confidence threshold to get the final sparse candidate proposals. The ActivityNet-1.3 [38] dataset consists of 19,994 untrimmed videos with annotations for the action proposal task. The dataset has 200 action categories and is divided into training, validation and test sets by a ratio of 2:1:1. The THUMOS14 [39] dataset contains 200 annotated untrimmed validation videos with 20 action categories and 213 annotated untrimmed test videos with 20 action categories. We train TCMNet on the validation set and evaluate it on the test set.

Implementation Details
For video representation, we adopt two-stream networks TSN [40] and I3D [41] for feature encoding. During THUMOS14 feature extraction, the frame strides of I3D and TSN are set to 8 and 5, respectively. For ActivityNet-1.3, the sampling frame stride is 16. On ActivityNet-1.3, the feature sequence is rescaled to 100 by linear interpolation. On the THUMOS dataset, the length of the sliding window is set to 128 and the overlap ratio is set to 0.5. When training TCMNet, we use Adam for optimization. The batch size is set to 8. The learning rate is set to 0.001 for the first seven epochs, and we decay it to 0.0001 for the other two epochs.  Table 1 illustrates the performance of our proposal generation method compared with other state-of-the-art methods on the validation set of the ActivityNet-1.3 dataset. It should be pointed out that due to the limitations of experimental equipment, several TAPG methods (DBG, TSI) that we reimplemented on ActivityNet-1.3 did not achieve the results proposed in the original paper. As can be seen from the table, our TCMNet outperforms other state-of-the-art proposal generation methods. Specifically, the TCMNet outperforms BMN [28] with 0.92% and 1.07% in terms of AR@100 and AUC. In addition, TCMNet improves AUC from 67.93% to 68.17% on the validation set compared to TSI [29]. Additionally, when AN is one, our TCMNet significantly improves AR from 32.57% to 33.69% by 1.12%. It should be pointed out that action proposal generation focuses on the diversity of the retrieved proposals and judges the performance by the recall of top-N proposals, while the action detection task focuses on the accuracy of the top-N proposals. Therefore, some methods, such as DBG [23], can retrieve the actions with good diversity, but sacrifice the accuracy of top-N proposals, which leads to lower action detection performance. The results in Table 5 also prove this point, the performance of DBG on action detection is much lower than other methods.

Comparison with State-of-the-Art Methods on THUMOS14
We also compare the performance of our method with other state-of-the-art methods on the THUMOS14 dataset, as shown in Table 2. Due to the excellent performance achieved by I3D and TSN in action recognition tasks, we use them in our TCMNet to extract features. For a fair comparison, we also re-implement BMN [28] and TSI [29] using the same TSN and I3D features through publicly available code. As can be seen from the table, our method using TSN_GTAD or I3D_PGCN video features outperforms BMN [28] and TSI [29] significantly when the proposal number is set within [50,100,200,500,1000]. Specifically, (1) based on the I3D_PGCN features, when the number of proposals varies from 50 to 1000, our method outperforms TSI by 2.09%, 1.63%, 1.58%, 1.18% and 0.93%. (2) Based on the TSN_GTAD features, when the number of proposals varies from 50 to 1000, our method outperforms TSI by 2.39%, 1.29%, 0.84%, 0.72% and 0.40%.

Ablation Study
In this section, we comprehensively evaluate our proposed TCMNet on the THUMOS14 dataset. We use I3D_PGCN feature as the visual feature sequence for ablation experiments.

Effectiveness and Efficiency of Modules in TCMNet
We conduct ablation studies using different architectural settings to verify the effectiveness and efficiency of each module proposed in TCMNet. The evaluation results shown in Table 3 indicate that: (1) Integrating convolutional filters with different dilation rates effectively achieves different temporal receptive fields optimized for specific-duration actions.
(2) Unlike TSI [29] which employs max pooling for down-sampling, our proposed pooling operation for down-sampling can obtain fine-grained temporal information critical for localizing boundaries. (3) By further utilizing aggregation functions in TBG, deep semantic information of different granularities can be adaptively fused to reduce the impact of noise. (4) Finally, by integrating all the separated modules into an end-to-end framework, we can obtain competitive performance gains.

Study on Channel Scaling Factor r in TBG
Drawing on the idea in SENet [42], we explicitly model the weight of each feature channel through the squeeze-excitation architecture. We then use this weight to enhance useful channels and suppress channels that are not useful for boundary detection. The parameter r in the TBG module needs to be adjusted during the experiment, where the range of r is 1, 2, 4 and 8. In Table 4, we notice that without channel dimension reduction, the average recall (AR) under different average number of proposals (AN) drops severely, and AR@AN also drops as r exceeds a certain range. A reasonable explanation is that when the value of r is too large, the intermediate representation vector will lose key information, but when r is too small, the action-independent information contained in the intermediate representation vector will dominate. We finally adopt r = 2 by default for all experiments, with which we obtained the best results for AR@AN.

Effectiveness of Locating Actions with Different Durations
To further verify the effectiveness of locating actions with different durations, we follow the details of TSI [29] and conduct several ablation experiments, which are shown in Table 5. We divide the dataset into three groups according to the value of s (s stands for the scale of ground truth): small-scale actions that 0 ≤ s < 0.06, middle-scale actions in which 0.06 ≤ s < 0.65, and large-scale actions in which 0.65 ≤ s ≤ 1.0. Each of these subsets has almost the same amount of ground truth to ensure fair comparisons. We then evaluate the methods on each sub-dataset. As can be seen from the table, TCMNet has a better performance on actions of different durations.

Visualization of Qualitative Results
We also visualize qualitative results. The top five proposal predictions of BMN [28] and TCMNet on the ActivityNet-1.3 dataset are shown in Figure 7 The demonstrated canoeing video has three ground-truth action instances. However, due to the excessive learning for long actions, BMN may regard two individual action instances as only one and predict more proposals with a long duration. Additionally, the temporal boundary of BMN is also not accurate enough because it only treats boundaries as local clues and does not pay enough attention to the global context. Compared with BMN, our proposed method can retrieve three action instances independently with higher overlap and more accurate boundaries.

Temporal Action Proposal Detection
In this section, we put the proposals into a temporal action detection framework to evaluate its detection performance. We adopt Mean Average Precision (mAP) as an evaluation metric for the temporal action detection task. On THUMOS14, mAP with tIoU thresholds set [0.3:0.1:0.7] are calculated. On ActivityNet-1.3, mAP with tIoU thresholds set {0.5,0.75,0.95} and average mAP with tIoU thresholds set [0.5:0.05:0.95] are reported [10].
For ActivityNet-1.3, we first use TCMNet to generate a set of action proposals for each video and keep the top 100 proposals for subsequent detection. Then, we adopt the top video-level classification result provided by CUHK [43] as the detection result. The experimental results are shown in Table 6; we can see that our method outperforms TSI by 1.53%, 1.42% and 0.94% when tIoU varies from 0.5 to 0.95 and achieves the mAP of 34.03%. Furthermore, compared to recent methods, TCMNet can achieve state-of-the-art results at tIoU = 0.5 and tIoU = 0.95.
For THUMOS14, we first use TCMNet to generate 200 temporal proposals per video. Then, we use the top two video-level classification results generated by UntrimmedNet [44] classifier to generate classification results for each proposal. As can be seen from Table 7, TCMNet achieves the best results at tIoU 0.6 (44.8%) and 0.7 (32.1%). Specifically, our TCMNet outperforms TSI by 3.8%, 4.9%, 4.9%, 5.2% and 4.4% when tIoU varies from 0.3 to 0.7. These results indicate that proposals generated by TCMNet are more accurate.  Table 6. Temporal action detection results on the validation set of the ActivityNet-1.3 in terms of mAP at different tIoU thresholds. "re" denotes that this method is re-implemented by ourselves.  Table 7. Temporal action detection results on the test set of THUMOS14 in terms of mAP at different tIoU thresholds. "re" denotes that this method is re-implemented by ourselves.

Conclusions
In this paper, we proposed a Temporal Context Modeling Network (TCMNet) for generating temporal action proposals. TCMNet effectively achieved different temporal receptive fields optimized for specific-duration actions by embedding convolutional layers containing different dilation rates. To predict precise action boundaries, the Temporal Boundary Generator (TBG) module improved the local-global complementary architecture in TSI. TBG obtained useful deep semantic information by embedding the proposed pooling operation and achieved an adaptive fusion of semantic features through an aggregation function to reduce noise disturbance. Extensive experiments on ActivityNet-1.3 and THUMOS14 datasets demonstrated the effectiveness of our method in terms of temporal action proposal and detection performance. In the beginning, we considered that the contextual information exploited in previous work was often characterized by the similarity between frames (or proposals) at the semantic feature level, without taking into account the temporal location contextual interactions between frames (or proposals). Temporal location contextual interactions are valuable prior knowledge. Therefore, we tried to embed position encoding in the temporal action proposal generation framework, but we did not achieve the desired effect. A possible reason is that fixed sinusoidal position encoding can only provide relative distance information without direction. In future work, we will try to augment feature representations with directed temporal positional encoding for more precise localization of actions.