Local Attention Sequence Model for Video Object Detection

: Video object detection still faces several difﬁculties and challenges. For example, the imbalance of positive and negative samples leads to low information processing efﬁciency, and detection performance declines in abnormal situations in video. This paper examines video object detection based on local attention to address such challenges. We propose a local attention sequence model and optimized the parameter and calculation of ConvGRU. It could process spatial and temporal information in videos more efﬁciently and ultimately improve detection performance under abnormal conditions. The experiments on ImageNet VID show that our method could improve the detection accuracy by 5.3%, and the visualization results show that the method is adaptive to different abnormal conditions, thereby improving the reliability of video object detection.


Introduction
Object detection is a fundamental problem in computer vision and has been widely used in the fields of surveillance, robots, medical intelligence, etc. In recent years, with the rapid popularization of deep convolutional networks, many scholars have also conducted much research on object detection algorithms based on deep convolutional networks [1,2], which has greatly improved the performance of image object detection. In 2013, Girshick proposed R-CNN [3], which used a deep convolutional network to achieve object detection for the first time, and the mAP index on the VOC dataset was approximately doubled when compared to traditional detection methods. Subsequently, Ren proposed Faster R-CNN [4] and designed a region proposal network to extract candidate regions, which improves the detection accuracy and greatly reduces the detection running time. However, the speed of the two-stage detection model based on R-CNN is low. Redmon proposed the YOLO [5] detection framework in 2015, which rasterizes images and predicts the object category and bounding box for each grid at the same time.
Applying such image-based object detectors to the domain of videos, however, is often unsatisfactory due to the deteriorated appearance caused by issues such as motion blur, out-of-focus camera, and rare poses frequently encountered in videos. These problems cannot be effectively solved by relying only on static images. The video can provide context and temporal information containing multiple frames of images. Combining this information can solve the above problems more effectively. Existing methods that leverage temporal information for object detection from videos usually use optical flow to propagate high-level features across frames. Extra optical flow models, e.g., FlowNet [6,7], have to be utilized to establish motion information and achieve better performance, which leads to excessive model parameters and calculations that are not conducive to model deployment.
In addition, the optical flow models establish motion information between local pixels, and it is difficult to model the continuity between high-level semantic features.
Instead of relying on optical flow, we propose an innovative video object detection model based on local attention. Specifically, we design the spatial attention module and local attention sequence model to improve video object detection accuracy and modify ConvGRU (convolutional gated recurrent units) [8] to process video context and temporal information in order to improve object detection reliability.
We conducted extensive experiments on ImageNet VID for video object detection. Our results outperform the original method in accuracy and achieve real-time detection in the same time duration.
In summary, our contributions are as follows: We introduce a novel video object detector based on local attention to establish the spatial and temporal correspondence across frames without extra optical flow models.
We propose a spatial attention module and local attention sequence module and modify ConvGRU to model spatial and temporal appearance and enhance feature representation.
We conduct experiments on ImageNet VID and achieve improved performance.

Image Object Detection
Existing state-of-the-art methods for image object detection mostly follow two paradigms, that is, two-stage and single-stage pipelines. A two-stage pipeline consists of region proposals, region classification, and location refinement. Girshick proposed the R-CNN [3] detection framework, which extracts image features through a convolutional network. Subsequently, Ren proposed Faster R-CNN [4], designed an RPN network based on a convolutional network, and introduced multi-scale anchor boxes to extract candidate regions with higher confidence. Lin proposed FPN [9] to detect objects on a multi-stage feature map. Comparing to two-stage detectors, single-stage methods are faster but less accurate. Redmon proposed YOLO [5], predicting the categories and bounding boxes on each grid simultaneously. Liu proposed SSD [10], which uses anchors on the feature maps of different depths of deep convolutional networks and then obtains categories and bounding boxes through convolution operations on each layer of the maps. Lin proposed focal loss [11] function to address the imbalance of easy and hard examples. In this paper, we use YOLO as our base detector.

Video Object Detection
The T-CNN [12] framework designed the multi-context suppression module and motion-guided propagation module to process context and motion information between adjacent frames and combine tracking algorithms in order to improve the classification accuracy of detection sequences. The Seq-NMS [13] algorithm only incorporates video temporal information in the post-processing operation of image object detection and can significantly improve the performance of video object detection through simple expansion. Zhu designed the FGFA [14] framework, which uses an optical flow model at the feature map level to estimate the motion information between adjacent frames. By combining motion information and adjacent frame features to improve the feature response of the current frame to obtain higher quality detection results, the FGFA framework effectively improves the detection effect of video frames affected by motion blur. However, the content of only part of the key frames in the video shows great changes, and the other adjacent frames have a high degree of correlation. It is not necessary to perform feature fusion on each frame of image. Thus, Zhu [15] only uses the deep convolutional network to extract image features in key frames and combines the optical flow network to fuse the motion information between key frames, while the features of non-key frames are obtained by updating the features of the motion part based on the previous key frame features according to the optical flow network. At the same time, the selection of key frames is adaptively decided based on the quality of the feature map, which steadily improves the detection performance and runs efficiently. Liu [16] combined the ConvLSTM [17] module at the SSD detection feature map level to process spatial and temporal information at the same time and obtained features with higher timing consistency and quality, allowing for improved detection performance. Xiao designed the spatial temporal memory nodule [18] to achieve video object detection. STMM and ConvLSTM are similar and use a two-way recurrent network to process the information of the preceding and following frames at the same time.

Self-Attention
Self-attention is a mechanism first introduced in [19] for machine translation. Jaderberg proposed spatial transformer networks [20], which implement global scaling, rotation, and other transformations on the feature map, enabling the network to have invariance of scaling, rotation, and other transformations. Hu proposed a squeeze-and-excitation network [21]. By modeling the correlation between the channels in the convolution feature map, each channel was assigned a different importance weight, thereby recalibrating different channel features. Through this channel domain attention mechanism, the network can combine global features to learn to select and improve features that have a more important impact on the current target and suppress less important features, thereby improving the efficiency and performance of the network. He Kaiming proposed nonlocal neural networks [22], drawing on the method of non-local means filtering in image processing, using the weighted sum of all location features to represent the feature response of the location so as to model long-distance feature dependence.

Overview
In order to make more effective use of the temporal information in the video, the video object detection framework based on local attention is as shown in Figure 1. Given a video, each frame is first processed by a CNN like DarkNet53 [23] to extract features. This is followed by the YOLO detector to predict multi-scale object categories and bounding boxes. Aiming at reducing the imbalance of positive and negative samples, we propose the spatial attention module to classify foreground objects on multi-scale feature layers. The local attention sequence model is used after the spatial attention module to obtain the distribution of spatial attention with temporal consistency, thereby improving the performance of video target detection. We also modify ConvGRU to effectively establish the temporal information across frames, providing higher quality features for the followed detector. In the following sections, we describe in detail the proposed spatial attention, local attention, and modified ConvGRU.

Spatial Attention
The spatial attention module obtains the distribution of spatial attention by modeling the correlation of features at various positions in space and instructs the model to pay more attention to the areas of the feature that can perform subsequent tasks more effectively. As shown in Figure 2, in order to obtain the spatial attention distribution, firstly, the maximum feature response and the average feature response in each grid of the feature map are obtained through max pooling and average pooling operations. Then, these two features are stitched together, and a small convolutional network is used to model the feature correlation in its local area. Following this, the attention distribution of each area is obtained on this basis. Finally, the attention distribution and the original feature map are multiplied in the spatial dimension to strengthen the more noteworthy regional features, and the remaining unimportant regional features are suppressed.
the Spatial Attention module to classify foreground objects on multi-scale feature layers. At the same time, the Local Attention sequence model is used after the Spatial Attention module to obtain the distribution of spatial attention with temporal consistency, thereby improving the performance of video target detection. We also modify ConvGRU to effectively establish the temporal information across frames, providing higher quality feature for the followed detector. In the following, we will describe in detail the proposed Spatial Attention, Local Attention and Modified ConvGRU.

Spatial Attention
The Spatial Attention module obtains the distribution of spatial attention by modeling the correlation of features at various positions in the space, and instructs the model to pay more attention to which areas of the feature can perform subsequent tasks more effectively. As shown in Fig. 2, in order to obtain the spatial attention distribution, firstly, the maximum feature response and the average feature response in each grid of the feature map are obtained through the max pooling and average pooling operations. Then stitch these two features and use a small convolutional network to model the feature correlation in its local area, and then obtain the attention distribution of each area on this basis. Finally, the attention distribution and the original feature map are multiplied in the spatial dimension to strengthen the more noteworthy regional features and suppress the remaining unimportant regional features.
Where W1 and W2 are the parameters of the two-layer convolutional network with 3 × 3 kernel size. Function g represents ReLU activation function and σ represents sigmoid activation function.

Local Attention Sequence Model
The distribution of spatial attention will change timely and has a certain continuity, and optical flow networks are usually used to establish motion information. However, this requires the introduction of additional deep convolutional networks such as FlowNet, which leads to excessive model parameters and calculations, which is not conducive to model deployment. In addition, the optical flow network establishes local pixels corresponding to motion information, and it is difficult to model the continuity between high-level semantic features. Since the motion between adjacent moments occurs more in the local domain, we design the Local Attention sequence model, focusing on the small-range motion information in the local domain, so as to establish the temporal consistency of the Spatial Attention module.
Specifically, the Local Attention sequence model can be formulated to two steps as follows: The first step is to achieve the aligned distribution of spatial attention by aggregating the corresponding feature cells with correspondence weights. Given two Formally, let F and F be the original feature map and spatial attention feature map, respectively. After obtaining the maximum and average feature response, we use a small convolutional network to compute the distribution of spatial attention as where W 1 and W 2 are the parameters of the two-layer convolutional network with 3 × 3 kernel size. Function g represents the ReLU (rectified linear unit) activation function, and σ represents the sigmoid activation function.

Local Attention Sequence Model
The distribution of spatial attention changes with time and has a certain continuity, and optical flow networks are usually used to establish motion information. However, this requires the introduction of additional deep convolutional networks such as FlowNet, leading to excessive model parameters and calculations, which is not conducive to model deployment. In addition, the optical flow network establishes local pixels corresponding to motion information, and it is difficult to model the continuity between high-level semantic features. Since the motion between adjacent moments occurs more in the local domain, we design the local attention sequence model, focusing on the small-range motion information in the local domain, so as to establish the temporal consistency of the spatial attention module.
Specifically, the local attention sequence model can be formulated as follows: The first step is to achieve the aligned distribution of spatial attention by aggregating the corresponding feature cells with correspondence weights. Given two adjacent frames F t and F t−1 , we first compute the affinity between two feature cells at various positions. Then, we compute the normalized correspondence weights in the local area. Finally, we compute the weighted sum of the corresponding feature cells in the local area as the aligned distribution as where C x,y represents the affinity matrix. T x,y represents the correspondence weight matrix, and it is restricted in the sub-region with stride k.Â t represents the aligned distribution of spatial attention. After achieving the aligned distribution of spatial attention, a small neural network, named the update network, is devised to fuse two distributions adaptively, with the goal of incorporating the temporal context of videos. As shown in Figure 3, the update network takes the concatenation of two distributions to obtain an adaptive weight through a single convolutional network. Then, we compute the weighted sum of the two distributions as the final distribution.
Where Cx,y represents the affinity matrix. Tx,y represents correspondence weights matrix and it is restricted in sub-region with stride k. ̂ represents the aligned distribution of spatial attention.
After achieve the aligned distribution of spatial attention, a tiny neural network, named Update Network, is devised to fuse both distribution adaptively, with the goal of incorporating temporal context of videos. As shown in Fig. 3, update network takes the concatenation of two distributions to obtain an adaptive weight through a single convolutional network. Then compute the weighted sum of two distributions as final distribution.

Modified ConvGRU
Video object detection has a high demand for real-time performance. We introduce ConvGRU to establish video temporal information. However, ConvGRU has a large amount of parameters and calculations, which seriously affects the efficiency of the model. Therefore, we modigy the traditional ConvGRU, optimizes its parameter amount and calculation amount, so as to improve the video object detection performance without excessively increasing the running time.
The network details of the modified ConvGRU are shown in Fig. 4. First, concatenate layer is used to connect the input state Xt and the hidden state Ht-1, which can make full use of the spatial and temporal information. But it also causes the feature dimension to be expanded by 2 times. Therefore, 1×1 convolution is used to compress the feature dimensions to reduce the calculation amount of subsequent modules, and then the design of grouped convolution [24] is adopted to optimize the parameter amount and calculation amount.

Modified ConvGRU
Video object detection has a high demand for real-time performance. We introduce ConvGRU to establish video temporal information. However, ConvGRU has a large amount of parameters and calculations, which seriously affects the efficiency of the model. Therefore, we modify the traditional ConvGRU and optimize its parameters and calculation so as to improve the video object detection performance without excessively increasing the running time.
The network details of the modified ConvGRU are shown in Figure 4. First, a concatenate layer is used to connect the input state X t and the hidden state H t−1 , which can make full use of the spatial and temporal information. However, it also causes the feature dimension to be expanded by 2 times. Therefore, 1 × 1 convolution is used to compress the feature dimensions in order to reduce the calculation amount of subsequent modules, and, then, the design of grouped convolution [24] is adopted to optimize the parameters and calculation.  Figure 4. Modified ConvGRU Formally, let X and H be the current feature map and the hidden feature map. Here we use ReLU6 as the activation function to make the gating unit activation sparse, so that more unimportant historical information will be forgotten. We establish the temporal information as Where Z and R represent update gate and reset gate, [x,y] means concatenate operation, * means convolution calculation, ⊙ means matrix element multiplication.
W1~3 represent 1×1 convolution parameters, and W{z,r,h} represent grouped convolution parameters. The gating unit activation function uses ReLU6 function, which makes the gating unit activation sparse, so that more unimportant historical information will be forgotten.

Dataset and Setup
We evaluate our framework on ImageNet VID [25] dataset that contains objects of 30 classes with fully annotated bounding boxes. Experiments is based on the PyTorch framework to implement the video object detection model based on local-attention. The hardware environment of the training server is: Intel Xeon (Skylake) Platinum 8163 2.5GHz CPU, 32GB DDR4 memory, NVIDIA V100 16GB GPU.
Video object detection model training consists of two stages: first use images to Formally, let X and H be the current feature map and the hidden feature map, respectively. Here we use ReLU6 as the activation function to make the gating unit activation sparse so that more unimportant historical information will be forgotten. We establish the temporal information asX where Z and R represent the update gate and reset gate, respectively; [x,y] is the concatenate operation; * is the convolution calculation; and is the matrix element multiplication. W 1∼3 represents 1 × 1 convolution parameters, and W {z,r,h} represents grouped convolution parameters. The gating unit activation function uses the ReLU6 function, which makes the gating unit activation sparse, allowing more unimportant historical information to be forgotten.

Dataset and Setup
We evaluated our framework on the ImageNet VID [25] dataset, which contains objects of 30 classes with fully annotated bounding boxes. Experiments were based on the PyTorch framework to implement the video object detection model on the basis of local attention. The hardware environment of the training server is Intel Xeon (Skylake) Platinum 8163 2.5 GHz CPU, 32 GB DDR4 memory, NVIDIA V100 16 GB GPU.
The video object detection model training consisted of two stages: first, images were used to train the object detection network that did not contain the sequence model, and, then, the sequence model was introduced to train using video sequences. We used data enhancement with random scaling and cropping, and it was controlled within 1/5 of the image size. The exposure and saturation of the image were randomly adjusted, controlled within 1.5 times on the HSV color space, and finally horizontally flipped randomly with a probability of 50%. During training, we used an SGD optimizer, the momentum coefficient was 0.9, the batch size was 64, the initial learning rate was 0.001, and the weight decay rate was 0.0005. Furthermore, using the warm-up strategy, the learning rate linearly increased from 0.0001 to the initial learning rate in the first 2000 iterations, and then, the learning rate decreased by 10 times at the 40,000th iteration, giving a total of 60,000 iterations.

Results
We compared our methods with the original YOLO method for video object detection. The results are shown in Table 1, where mAP is the mean average precision metric, P is the precision score, and R is the recall score. F1 = 2*P*R/(P + R). Through the comparison of experimental results, the following conclusions can be drawn: (1) The video target detection based on the local attention sequence model proposed in this section can effectively improve the performance of video target detection.
(2) Compared with single-image target detection, mAP is increased by 1.8 points after the introduction of the sequence model, which is a relative increase of 5.3%.
In order to further understand the improvement of the reliability of video object detection by introducing the local attention sequence model, we randomly selected some video object detection results, as shown in Figure 5. It can be seen from the comparison of the results that the video object detection model after the introduction of the local attention sequence model can better solve the difficult detection problems caused by the occlusion of the object movement process in the video, posture transformation, and the blurring caused by camera movement.

Ablation Study
We conducted an ablation study on ImageNet VID to validate the effectiveness of the proposed modules. The results are shown in Table 2, where mAP is the mean average precision score, time represents the inference time, FLOPs is the calculation of the module, and Parameters represents the number of module parameters.
By comparing the results of ablation experiments, the following conclusions can be drawn: (1) Modified ConvGRU and local attention sequence models can improve the performance of video target detection. (2) Compared with the traditional ConvGRU, the improved ConvGRU increases the amount of parameters and calculations to a lesser extent, with relative increases of 17.3% and 9.5%, respectively, which are equivalent to 24.8% and 30.3%, respectively, of the traditional ConvGRU, and the model accuracy is basically the same. (3) By contrast, the improvement of the performance of ConvGRU is more obvious. mAP increased by 1.09, a relative increase of 3.2%, and the local attention sequence model increased mAP by 0.69, a relative increase of 2.1%. (4) However, the numbers of parameters and calculations added to the local attention sequence model are relatively small, increasing by 1.4% and 3.1%, respectively. (5) Using the modified ConvGRU and local attention sequence model at the same time can improve the performance more obviously, as mAP increased by 1.8 with this method. We randomly select some illustration results of the local attention as shown in Figure 6.

Ablation Study
We conduct ablation study on ImageNet VID to validate the effectiveness of the proposed modules.  By comparing the results of ablation experiments, the following conclusions can be drawn: 1) Improving the ConvGRU and local attention sequence models can improve the performance of video target detection. 2) Compared with the traditional ConvGRU, the improved ConvGRU increases the amount of parameters and calculations less, with a relative increase of 17.3% and 9.5%, which is equivalent to 24.8% and 30.3% of the traditional ConvGRU, and the model accuracy is basically the same. 3) In contrast, improving the performance of ConvGRU is more obvious. mAP has increased by 1.09, a relative increase of 3.2%, and the local attention sequence model has increased mAP by 0.69, a relative increase of 2.1%. 4) However, the amount of parameters and calculations added to the local attention sequence model are relatively small, increasing by 1.4% and 3.1% respectively. 5) Improved ConvGRU and local attention sequence model at the same time use performance improvement effect more obvious, mAP increased by 1.8, a relative increase of 5.3%. Figure 6. Illustration of local attention.

Conclusion
In this paper, we study the video object detection technology that integrates local attention, which mainly includes three aspects: 1) Design the Spatial Attention module to improve the efficiency and accuracy of object detection. 2) Design the Local Attention sequence model to process video context and temporal information more

Conclusions
In this paper, we examined the video object detection technology that integrates local attention, mainly focusing on three aspects: (1) we designed a spatial attention module to improve the efficiency and accuracy of object detection; (2) we designed a local attention sequence model to process video context and temporal information more efficiently and to solve the problem of low abnormality detection performance in videos; (3) we modified the ConvGRU to more effectively establish temporal information, thereby improving the quality of the video features. We conducted ablation studies on ImageNet VID to examine the effectiveness of our framework in video object detection. The proposed framework achieved 35.57% mAP on ImageNet VID. However, there remains a lack of studies focusing on this topic, and there are similar areas worthy of further research, such as combining video key frames for calculation optimization and introducing additional supervision signals to improve the accuracy of attention distribution.