Non-Local Spatial and Temporal Attention Network for Video-Based Person Re-Identiﬁcation

: Given a video containing a person, the video-based person re-identification (Re-ID) task aims to identify the same person from videos captured under different cameras. How to embed spatial-temporal information of a video into its feature representation is a crucial challenge. Most existing methods have failed to make full use of the relationship between frames during feature extraction. In this work, we propose a plug-and-play non-local attention module (NLAM) for frame-level feature extraction. NLAM, based on global spatial attention and channel attention, helps the network to determine the location of the person in each frame. Besides, we propose a non-local temporal pooling (NLTP) method used for temporal features’ aggregation, which can effectively capture long-range and global dependencies among the frames of the video. Our model obtained impressive results on different datasets compared to the state-of-the-art methods. In particular, it achieved the rank-1 accuracy of 86.3% on the MARS (Motion Analysis and Re-identification Set) dataset without re-ranking, which is 1.4% higher than the state-of-the-art way. On the DukeMTMC-VideoReID (Duke Multi-Target Multi-Camera Video Reidentification) dataset, our method also had an excellent performance of 95% rank-1 accuracy and 94.5% mAP (mean Average Precision).


Introduction
Person re-identification (Re-ID) aims to use computer vision algorithms for cross-camera tracking, which means finding the same person under different cameras. Person Re-ID intends to identify a probe person in a camera by matching his/her images or videos and has many practical applications, including intelligent surveillance and criminal investigation. Person Re-ID can be divided into image-based and video-based person Re-ID. Image-based person Re-ID has made significant progress in terms of both solutions [1,2] and the construction of large benchmark datasets [3,4]. Recently, more work [5][6][7] has begun to focus on video-based person Re-ID because of the richer information contained in video data as compared to image data. By extracting more spatial and temporal cues from video data, video person Re-ID has the potential to solve some of the challenges faced in image person Re-ID, e.g., the visual blocking of pedestrians as they walk.
In the video-based person Re-ID task, the video-based dataset is composed of many consequent sequences of images rather than static images. Here, we need to declare that the video is composed of several sequences, and a sequence includes several frames of images in this article. The critical challenge is to make use of the temporal clues embedded in the sequences. Some previous work [5][6][7] has typically divided this task into two steps. In the first step, image-based convolutional neural 1.
Extraction of dynamic features from other CNN inputs, e.g., by optical flux [8].
The third category to which our work belongs is currently dominant in video-based person Re-ID tasks. Most existing methods represent the frame of the video as a feature map and then use an average or maximum pooling across frames to obtain a representation of the input video. However, this approach tends to fail when occlusions are frequent in the video because it processes all images in the video with equal importance. In order to distill the relevant information from a video and weaken the influence of noisy samples, some works have learned the temporal attention score of each frame in a given video by using recurrent neural networks (RNNs) to solve this problem. The limitation of the RNN method is that it requires sequential calculations to be performed. As a result, it is difficult to compute in parallel and make full use of the graphics processing unit (GPU) hardware. Additionally, a single recurrent operation could only calculate the dependency between the current and the latest frame. In general, it is difficult for RNNs to capture long-range dependencies.
In this paper, we propose a non-local spatial and temporal attention network for video-based person Re-ID. We improve the non-local neural network [14] and apply it to the video-based person Re-ID task with excellent results. The novelty of our approach is that we use non-local neural networks to compute spatial and temporal dependencies over long ranges among video frames. The way each attention score is calculated depends on all the frames in the video, as shown in Figure 1, not just on the adjacent frames. This method gives a better video-level feature representation, making the video look more like a whole rather than merely a few images. Additionally, we apply the improved non-local neural network to CNN networks at different levels, so that the features at different levels obtain a better performance. We performed both frame-level feature extraction, and temporal aggregation using the non-local attention mechanism. Our main contributions can be summarized in three-fold:

1.
We propose a plug-and-play non-local attention module (NLAM). It can be inserted into CNN networks for frame-level feature extraction. In the video-based person Re-ID task, the spatial position of the target person in the image can be determined more accurately.

2.
We propose a non-local temporal pooling (NLTP) method for temporal feature aggregation. We use it to replace the single average or maximum pooling, which could not order the video frames.

3.
We verified the effectiveness of our two methods on different datasets.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 15 Figure 1. Some examples of the application of non-local attention in our network. The starting point of the arrow represents one frame in a video, and the picture pointed by the arrow represents another frame of the video. For brevity, we only selected four frames in the figure to show how our model directly finds relevant clues in frames to support its prediction. The 'blue' arrow represents the similarity between the first frame and the remaining frames, and the 'green', 'red', and 'orange' respectively represent the similarity of the second, third, and fourth frames with the remaining frames.
1. We propose a plug-and-play non-local attention module (NLAM). It can be inserted into CNN networks for frame-level feature extraction. In the video-based person Re-ID task, the spatial position of the target person in the image can be determined more accurately.
2. We propose a non-local temporal pooling (NLTP) method for temporal feature aggregation. We use it to replace the single average or maximum pooling, which could not order the video frames.
3. We verified the effectiveness of our two methods on different datasets.

Related Works
Person Re-ID has always been a hot field in computer vision. In this section, we review the development of video-based person Re-ID, from image-based person Re-ID to video-based person Re-ID. Additionally, we introduced a video-based person Re-ID pipeline.

Image-Based Person Re-ID
The purpose of image-based person Re-ID is to match a given probe image to the same person in a set of images (gallery images) captured by another non-overlapping camera. Existing methods are usually divided into two steps to complete this work: (1) Extract special vectors, and (2) calculate the similarity of the two feature vectors. With the continuous development of the CNN network [15][16][17][18][19], learning image features from the CNN network has replaced hand-made features [3,[20][21][22] to represent person images. After extracting features from the image, the metric distance is used to calculate the similarity/dissimilarity between the features of the two images. Ideally, if two images contain the same person, the distance should be smaller than two images that do not contain the same person. As suggested by Zheng et al. [23], the calculation of feature vectors can be used for discriminant learning and metric learning. Discriminant learning uses cross-entropy loss [17,19] to learn the deep features used for identity classification. Metric learning uses triple loss to increase the distance among classes and reduce the distance within classes. In our work, we use both loss functions to train our network.

Video-Based Person Re-ID
Video-based person Re-ID can be seen as an extension of image-based person Re-ID efforts. Compared to static images, video can provide richer information for person Re-ID tasks because it contains both spatial and temporal information. Video-based person Re-ID is also closer to the real Figure 1. Some examples of the application of non-local attention in our network. The starting point of the arrow represents one frame in a video, and the picture pointed by the arrow represents another frame of the video. For brevity, we only selected four frames in the figure to show how our model directly finds relevant clues in frames to support its prediction. The 'blue' arrow represents the similarity between the first frame and the remaining frames, and the 'green', 'red', and 'orange' respectively represent the similarity of the second, third, and fourth frames with the remaining frames.

Related Works
Person Re-ID has always been a hot field in computer vision. In this section, we review the development of video-based person Re-ID, from image-based person Re-ID to video-based person Re-ID. Additionally, we introduced a video-based person Re-ID pipeline.

Image-Based Person Re-ID
The purpose of image-based person Re-ID is to match a given probe image to the same person in a set of images (gallery images) captured by another non-overlapping camera. Existing methods are usually divided into two steps to complete this work: (1) Extract special vectors, and (2) calculate the similarity of the two feature vectors. With the continuous development of the CNN network [15][16][17][18][19], learning image features from the CNN network has replaced hand-made features [3,[20][21][22] to represent person images. After extracting features from the image, the metric distance is used to calculate the similarity/dissimilarity between the features of the two images. Ideally, if two images contain the same person, the distance should be smaller than two images that do not contain the same person. As suggested by Zheng et al. [23], the calculation of feature vectors can be used for discriminant learning and metric learning. Discriminant learning uses cross-entropy loss [17,19] to learn the deep features used for identity classification. Metric learning uses triple loss to increase the distance among classes and reduce the distance within classes. In our work, we use both loss functions to train our network.

Video-Based Person Re-ID
Video-based person Re-ID can be seen as an extension of image-based person Re-ID efforts. Compared to static images, video can provide richer information for person Re-ID tasks because it contains both spatial and temporal information. Video-based person Re-ID is also closer to the real world for better application. So, in recent years, video-based person Re-ID has also attracted the attention of more researchers. Some early work [5,24,25] considered frame-level similarities to identify the person. Recently, deep learning methods have been applied to gain more discriminative video-level features. They first trained the CNN network to extract image features and then aggregated them into video features through average or maximum pooling. Mc Laughlin et al. [5] proposed a method for extracting time information using RNN and the temporal pooling layer. Following [5], Xu et al. [7] proposed a spatial and temporal attention pooling network (STAPN), which extracts more robust features by calculating attention in the spatial and temporal dimensions. Li et al. [13] proposed a new spatial-temporal attention model to distinguish different body parts automatically.

A Video-Based Person Re-ID Pipeline
In our article, we follow the state-of-the-art structure that has been summarized by previous researchers and is the most commonly used base structure for video-based person Re-ID works. It mainly consists of two parts: (1) Feature extraction: This part can extract meaningful abstract spatial representations from video frames through pre-trained ImageNet Models [26,27], such as residual network (ResNet50 [26]) and squeeze-and-excitation residual network (SE-ResNet50 [27]). (2) Temporal feature aggregation: In this part, the frame-level features extracted in the previous step are aggregated into video-level features. Gao et al. [28] summarized that the feature aggregation method could be roughly divided into three types: Average temporal pooling (TP avg ) operation, temporal attention (TA), and the RNN layer. Subramaniam et al. [29] compared different feature extraction methods and temporal feature aggregation methods. The comparison results are shown in Table 1. Two main conclusions can be drawn from the comparison results: First, the choice of the backbone network will affect the overall performance of the system, and SE-ResNet50 has a better performance than ResNet50. Second, TP avg is superior to attention/RNN. Therefore, we chose SE-Resnet50 + TP avg as the baseline of our work.  [30] and DukeMTMC-VideoReID (Duke Multi-Target Multi-Camera Video Reidentification) [31]. TP avg , TA, RNN stand for average temporal pooling, temporal attention, and recurrent convolution network, respectively. The best results are shown in bold.

Feature
Temporal Aggregation

Our Approach
In this part, we accurately describe our network structure and the innovations we propose in the network. In Section 3.1, we explain the role of our proposed NLAM in the frame-level feature extraction process. In Section 3.2, we take one sequence of a video as an example to analyze our proposed NLTP method. In Section 3.3, we explain the loss function we adopted. The structure of the entire network is shown in Figure 2 Our overall network architecture is shown in Figure 2. Similar to a standard video-based Re-ID framework, it mainly includes two parts: Feature extractor and temporal feature aggregation. The difference is that we insert the non-local attention module (NLAM) we proposed in the feature extractor, and we adopt the non-local temporal pooling (NLTP) method we proposed in the temporal feature aggregation part. NLAM is used to insert between CNN blocks for spatial attention and channel attention extraction. It helps CNN to determine the location of the target person in each frame and reduce the interference caused by occlusion in the image. NLTP improves on the previous temporal pooling method by acquiring temporal attention in a non-local way in the first step and embedding temporal features into video-level features through pooling in the second step. The primary purpose of NLTP is to give a higher weight to frames that are more representative of the entire sequence, thereby obtaining a more robust video-level representation.

Frame-Level Feature Extraction
In the process of frame-level feature extraction, we use the most modern image recognition network architecture SE-ResNet50 as a feature extractor in video-based Re-ID. SE-ResNet50 contains five consecutive CNN blocks (one initial convolution block, followed by four successive squeeze-andexcitation (SE) residual blocks). We argue that a single CNN is not sufficient for feature extraction of an image. With the addition of an attentional mechanism, CNN can extract more critical spatial information from the image, similar to human visual attention. We add NLAM between CNN blocks to obtain a better frame-level feature representation. The overall structure of NLAM is shown in Figure 3.  Our overall network architecture is shown in Figure 2. Similar to a standard video-based Re-ID framework, it mainly includes two parts: Feature extractor and temporal feature aggregation. The difference is that we insert the non-local attention module (NLAM) we proposed in the feature extractor, and we adopt the non-local temporal pooling (NLTP) method we proposed in the temporal feature aggregation part. NLAM is used to insert between CNN blocks for spatial attention and channel attention extraction. It helps CNN to determine the location of the target person in each frame and reduce the interference caused by occlusion in the image. NLTP improves on the previous temporal pooling method by acquiring temporal attention in a non-local way in the first step and embedding temporal features into video-level features through pooling in the second step. The primary purpose of NLTP is to give a higher weight to frames that are more representative of the entire sequence, thereby obtaining a more robust video-level representation.

Frame-Level Feature Extraction
In the process of frame-level feature extraction, we use the most modern image recognition network architecture SE-ResNet50 as a feature extractor in video-based Re-ID. SE-ResNet50 contains five consecutive CNN blocks (one initial convolution block, followed by four successive squeeze-and-excitation (SE) residual blocks). We argue that a single CNN is not sufficient for feature extraction of an image. With the addition of an attentional mechanism, CNN can extract more critical spatial information from the image, similar to human visual attention. We add NLAM between CNN blocks to obtain a better frame-level feature representation. The overall structure of NLAM is shown in Figure 3. Our overall network architecture is shown in Figure 2. Similar to a standard video-based Re-ID framework, it mainly includes two parts: Feature extractor and temporal feature aggregation. The difference is that we insert the non-local attention module (NLAM) we proposed in the feature extractor, and we adopt the non-local temporal pooling (NLTP) method we proposed in the temporal feature aggregation part. NLAM is used to insert between CNN blocks for spatial attention and channel attention extraction. It helps CNN to determine the location of the target person in each frame and reduce the interference caused by occlusion in the image. NLTP improves on the previous temporal pooling method by acquiring temporal attention in a non-local way in the first step and embedding temporal features into video-level features through pooling in the second step. The primary purpose of NLTP is to give a higher weight to frames that are more representative of the entire sequence, thereby obtaining a more robust video-level representation.

Frame-Level Feature Extraction
In the process of frame-level feature extraction, we use the most modern image recognition network architecture SE-ResNet50 as a feature extractor in video-based Re-ID. SE-ResNet50 contains five consecutive CNN blocks (one initial convolution block, followed by four successive squeeze-andexcitation (SE) residual blocks). We argue that a single CNN is not sufficient for feature extraction of an image. With the addition of an attentional mechanism, CNN can extract more critical spatial information from the image, similar to human visual attention. We add NLAM between CNN blocks to obtain a better frame-level feature representation. The overall structure of NLAM is shown in Figure 3.  In the NLAM spatial attention part, we aim to perform spatial attention calculation on the feature maps output by the CNN network in the previous layer. Given an input feature tensor X ∈ R N×C×H×W , it is obtained from a sequence of N feature maps of size C × H × W. We aim to exchange their spatial information between all frames in the sequence to determine a better position of the target person in the image and reduce the interference of occlusion between frames. Let x i be sampled from X. First, we reduce the dimension of the input feature channel to C through three 1 × 1 convolution blocks (a, b, d) to obtain A, B, D ∈ R C ×NHW . The transposition of A is multiplied by B to obtain the attention score of all positions x j at position x i by using embedded Gaussian instantiation. Then, the weighted average M ∈ R NHW×NHW of the attention scores of all positions x j is used to calculate the response y i of each position x i . Finally, Y is recovered to the same size as the input X by 1 × 1 convolution, and the recovered result is added to the original feature tensor X to obtain the final result A S . NLAM spatial attention can be formulated as follow: Equation (1) represents the process of non-local operations, and the overall spatial attention is formulated as Equation (2). Here, i, j = [1, NHW] refers to all locations of each feature map and in all frames. The convolution operation is expressed by a, b, and c. W 0 recovers Y to the same size as the input tensor X. The idea contained in the non-local operation is that when extracting features at a specific location in a particular time, the network should consider the spatial and temporal dependencies within the sequence by attending on the non-local context.
In the NLAM channel attention part, we pass the feature maps A S outputted from Figure 3a through global max pooling (GMP) and global average pooling (GAP) based on the width and height, and then through a multi-layer perceptron (MLP) to get the channel importance of each frame. The obtained channel importance vectors of all N frames are respectively subjected to maximum pooling and average pooling in each dimension to estimate the global channel importance. Then, the features of maximum pooling and average pooling output are subjected to the elementwise addition operation followed by sigmoid activation to obtain the final channel attention feature maps. Then, the features of maximum pooling and average pooling output are subjected to the elementwise addition operation followed by sigmoid activation to obtain the final channel attention feature maps. Finally, the channel attention feature maps and input feature maps are elementwise multiplied to generate the final output A C of NLAM, which is used as the input of the L + 1 th layer CNN network. The channel attention map is computed as follows: where σ refers to the sigmoid activation function. A S stands for NLAM spatial attention output. A C refers to the final output of NLAM channel attention and the final result of NLAM. GMP and GAP are the same as Figure 3b, which represents global maximum pooling and global average pooling; note that the MLP weights, W 1 and W 2 , are shared for both inputs and the δ (Tanh) activation function is followed by W 1 .

Temporal Aggregation
In video-based person Re-ID, a key challenge is how to combine frame-level features into video-level features to express the temporal features in the video better. In previous works, researchers generally used temporal pooling to perform temporal feature aggregation. Table 1 makes a detailed comparison of three different temporal aggregation layers (TPavg, TA, RNN), from which we can see that temporal pooling shows the best performance indicators. However, temporal pooling naturally ignores the temporal relationship between frames. The calculation of each attention score in non-local attention depends on all the frames in the sequence, so it can capture the remote dependencies in the deep neural network well.
In this part, we propose a non-local temporal pooling (NLTP) method for temporal feature aggregation. The specific architecture is shown in Figure 4. Our proposed method aims to use the non-local attention mechanism to frame-level feature sequences in the temporal dimension. Our proposed method aims to use the non-local attention mechanism for the extracted frame-level features to perform feature aggregation in the temporal dimension. Enhancing the frame-to-frame relationship allows frames with a closer relevance to obtain a higher weight, resulting in a more reliable person Re-ID model. The NLTP operation we proposed is as follows: where F represents the frame-level features of a sequence extracted through Section 3.1. Equation (4) represents the entire NLTP process, and the result Y after the non-local operation is restored to the same scale as the input F through W (1 × 1 Conv) and elementwise addition is performed with F. Finally, the average pooling operation is performed in the temporal dimension to obtain the final output result F of NLTP.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 15 that temporal pooling shows the best performance indicators. However, temporal pooling naturally ignores the temporal relationship between frames. The calculation of each attention score in non-local attention depends on all the frames in the sequence, so it can capture the remote dependencies in the deep neural network well.
In this part, we propose a non-local temporal pooling (NLTP) method for temporal feature aggregation. The specific architecture is shown in Figure 4. Our proposed method aims to use the non-local attention mechanism to frame-level feature sequences in the temporal dimension. Our proposed method aims to use the non-local attention mechanism for the extracted frame-level features to perform feature aggregation in the temporal dimension. Enhancing the frame-to-frame relationship allows frames with a closer relevance to obtain a higher weight, resulting in a more reliable person Re-ID model. The NLTP operation we proposed is as follows: where F represents the frame-level features of a sequence extracted through Section 3.1. Equation (4) represents the entire NLTP process, and the result after the non-local operation is restored to the same scale as the input F through W (1 × 1 Conv) and elementwise addition is performed with F. Finally, the average pooling operation is performed in the temporal dimension to obtain the final output result of NLTP. We assume that a continuous image sequence contains N frames, and ( ∈ {1,2, … , }) is one frame of N frames of the video. We use the function h , ∈ ℝ × to calculate the scale value between the current frame and each frame ( ∈ {1,2, … , }). There are many different choices for the pairwise function h , [14]. In our work, we use the "dot product" pairwise function to compute the correlation between frames. Recall that each frame is represented as a C × H × W tensor We assume that a continuous image sequence contains N frames, and f i (i ∈ {1, 2, . . . , N}) is one frame of N frames of the video. We use the function h x i , x j ∈ R N×N to calculate the scale value between the current frame x i and each frame x j ( j ∈ {1, 2, . . . , N}). There are many different choices for the pairwise function h x i , x j [14]. In our work, we use the "dot product" pairwise function to compute the correlation between frames. Recall that each frame is represented as a C × H × W tensor (see Section 3.1). We apply a 1 × 1 convolution on f i to reduce its channel dimension to C = C/2 as a way to reduce computation. Then, f i is reshaped to a vector. We use θ( f i ), ϕ f j , and g f j to indicate three such vectors. The pairwise function is then defined as the point product between θ(x i ) and ϕ x j . The pairwise function is defined as: Then, we multiply the output of the pairwise function by 1/N as the normalization operation. We call the normalized result the "attention score" to indicate the influence of all frames f j on f i . We then compute a "weighted frame feature" y i using the attention scores and frames g f j . The weighted frame feature y i is as follows: Note that since y i is computed based on all frames in the video, y i implicitly contains information of the frame f i and all the other frames f j in the video. To obtain the video-level feature, we simply perform temporal pooling over these weighted frame features and original frame features. Since the weighted frame features already capture long-range dependencies in the video, the output (e.g., video-level feature) of the temporal pooling will implicitly capture rich long-range dependencies in the video.

Loss Function
We use Softmax cross-entropy loss and batch triplet loss as a loss function for our work. On the one hand, these two loss functions are used for a fair comparison with the baseline. On the other hand, because these two loss functions are proved to be our work is very suitable. We randomly select P identity samples and randomly select K sequences (each sequence contains N frames) from each identity sample to form a batch. Therefore, a batch contains P × K sequences. The overall loss function can be described as: where L softmax and L triplet refer to the cross-entropy loss and batch triplet loss, respectively. The cross-entropy loss function encourages the network to classify the P × K sequences to the correct identities. The cross-entropy loss function is defined as follows: where p i and q i are the groundtruth identity and the prediction of sample i. B represents all sequences in a batch. Batch triplet loss is generally used to reduce the intra-class distance between each sequence and to increase the inter-class distance. The training instances contain an anchor, a positive instance, and a negative instance. The positive instance belongs to the same class as the anchor, and the negative instance belongs to a different class than the anchor. Let f I A , f I P , f I N be the video-level descriptors of three different sequences, where I A , I P , and I N are the anchor, positive, and negative examples, respectively. The triplet loss function is defined as: where m is the margin between positive and negative samples, and D(i, j) indicates the distance function between two video-level descriptors i, j. B represents all sequences in a batch, and I represents the I th sequence in a batch. The Softmax cross-entropy loss function follows the fully connected (FC) layer for probabilities obtained for the identities. The batch triplet loss is applied to the video-level descriptors to backpropagate the gradients.

Experiment
In this part, we introduce the datasets used in the training process, the evaluation method used after the training is completed, and some parameter settings throughout the experiment. Finally, our experimental results are listed and explained.

Datasets and Evaluation
We evaluated the proposed model on two commonly used video-based person Re-ID datasets: MARS, DukeMTMC-VideoReID.
MARS: The MARS [30] dataset is an extended version of the Market1501 [3] dataset and is also the first large-scale video-based dataset. Since all bounding boxes and tracks are automatically generated, it contains disruption terms, and each identification may contain multiple tracks. It is the largest video-based person Re-ID dataset with 1261 identities and 20,478 videos, with multiple frames per person captured across six non-overlapping camera views. Among the total identities, about half of the identities are used for training, and the other half are used for testing. Additionally, the MARS dataset includes 3248 identities (disjoint with the train and test set) that are used as distractors.
DukeMTMC-VideoReID: The DukeMTMCVideoReID [31] is a subset of the DukeMTMC multicamera dataset [32], which was collected on an outdoor scenario with varying viewpoints, illuminations, backgrounds, and occlusions using eight synchronized cameras. The dataset contains 1404 identities for training and testing and 408 identities as distractors. In total, there are 2196 videos for training and 2636 videos for testing. Each video contains person images sampled every 12 frames. During testing, a video for each ID is used as the query, and the remaining videos are placed in the gallery.
In Table 2, a detailed comparison of the two datasets MARS and DukeMTMC-VideoReID is shown from the following aspects: • The total number of people included; • The number of people used for training; • The number of people used for testing; • The number of people used as distractors; • The total number of videos contained in the dataset; and • The number of cameras used in data collection. We used the same evaluation indicators as those used in the literature [12,13,30,33]: CMC (cumulative matching characteristics) and mAP (mean average precision). CMC refers to the probability of finding the correct identity among the first k matches based on the retrieval ability of the algorithm. We chose to use CMC when only one gallery instance exists for every identity. We tested the probability of rank-1, rank-5, and rank-20. The mAP metric is used when there are multiple instances of the same identity in the gallery.

Implementation Details
The proposed method was implemented using the PyTorch framework [34]. During training, each sequence consists of N = 8 frames, which is somewhat different from the baseline, and the video frames are resized to 256 × 128. It should be noted that during training, we used a random approach to obtain N frames from the video to form a sequence as input. In testing, we split the video into several sequences of length N in temporal order. The network was trained using the Adam optimizer. The batch size was set to 32, and if the total memory usage was over the GPU memory limit, the batch size was reduced accordingly to the maximum possible extent. The learning rate was initialized to 0.0001, while the learning rate decreased as the number of epochs increased with the parameter γ = 0.1. The margin of triple loss was m = 0.3. We trained the network for 800 epochs, and the learning rate was multiplied by 0.1 after every 200 epochs.

Result
In our experiments, first, every video of the person was divided into multiple sequences containing N frames, then each sequence was passed through the network to obtain sequence-level features, and finally, the sequence-level features were averaged to obtain a video-level descriptor. We used the L 2 distance to calculate CMC and mAP. The following is a comparative analysis of some experimental data: Location of the NLAM within the network: In the first step, we explored the effect of NLAM at different locations within the network. We inserted an NLAM layer after one or more feature extraction CNN blocks to compare their capability. In order to ensure the uniqueness of variables, we used NLTP as the temporal aggregation layer. The network was trained and tested on two datasets, MARS and DukeMTMC-VideoReID. The results are shown in Table 3. It can be derived from Table 3 that when we added a single NLAM layer to the network, the network performed better when inserted into deeper blocks (e.g., block3, block4, block5). So, we tested the insertion of multiple NLAM layers for the deeper blocks in Table 4. We found that inserting the NLAM layer after block4 and block5 achieved the best results. The results are as follows: Different temporal aggregation methods: In order to directly compare the superiority of our proposed NLTP feature aggregation method, we compared our proposed method with the temporal pooling method used in the baseline (Table 1). Both experiments were conducted based on using SE-ResNet50 as the feature extractor and adding the NLAM module after the fourth and fifth CNN layers to make sure there was only one variable. Table 5 shows the performance evaluation of the model. Compared with the temporal pooling method, our proposed NLTP method achieved a better performance. Especially on the MARS dataset, NLTP improved mAP by 0.2% and increased the accuracy of rank-1 by 0.5%. The specific data are shown in Table 5. Effect of different sequence lengths (N): In this step, we studied the impact of the different number of frames in each sequence on our model. N represents the length of the sequence which is the number of frames we captured from a video. We compared the performance of the model on the MARS dataset at N = 2, 4, 8, and the results are shown in Table 6. In order to ensure the uniqueness of the experimental variables, we conducted an experimental comparison based on the SE-ResNet50 with the NLAM module added after the fourth and fifth CNN layers as the feature extractor and NLTP as the temporal aggregation layer. It can be seen from Table 6 that our network achieved a better performance when N = 8, which is different from the conclusion in our baseline (the model performs best when N = 4). In the MARS dataset, the accuracy of Rank1 was improved by 0.8% relative to N = 4, and mAP was improved by 0.9% relative to N = 4. Such a result is also expected because for both the NLAM and NLTP, we inserted a non-local mechanism. Hence, a more extended sequence is more helpful for our model to extract long-range dependencies and obtain a more robust video-level feature descriptor. Comparison with state-of-the-art methods: We compared our method with the state-of-the-art method [15,22,28,33,[35][36][37][38] in the MARS and DukeMTMCVideoReID datasets. The results are shown in Table 7. Our final model selection was tested on the basis of N = 8 using SE-ResNet50 with the NLAM module added after the fourth and fifth CNN layers as the feature extractor and NLTP as the temporal aggregation layer. It is observed that our proposed model achieved a good performance. Especially in the MARS dataset, our method improved by 2.3% on CMC Rank-1 and nearly 1.8% on mAP compared to our baseline (Table 1). Compared with the state-of-the-art method [29], our method also improved the CMC Rank-1 by 1.4%. Our model also achieved impressive results in the DukeMTMCVideoReID dataset. Compared to the baseline (Table 1), our network improved by 0.9% and 1.3% on mAP and CMC Rank-1, respectively. We attribute this improvement to NLAM in frame-level feature extraction and NLTP in temporal feature aggregation to better obtain global information, resulting in a more robust feature representation. Table 7. Comparison of our model with our baseline and a series of state-of-the-art models on the two datasets MARS and DukeMTMCVideoReID. TP avg = average Temporal Pooling. NLAM (4,5) means to add the NLAM layer after the fourth and fifth CNN blocks.

Conclusions
Person Re-ID based on video is an important task that has received much attention in recent years. In this paper, we proposed a non-local attention model (NLAM) that can be added between CNN blocks for frame-level feature extraction and a non-local temporal pooling (NLTP) method for temporal feature aggregation. The experiments showed that the two methods we proposed have shown excellent results on the video-based person Re-ID datasets. Compared with most existing methods, the advantage of our proposed network architecture (SE-ResNet50 +NLAM (4,5) + NLTP) is that it better describes the relationship between frames in the video. It focuses on the spatial and temporal relationships of all frames in a non-local way and gives different weights, thus forming a more accurate representation of the video. The results performed better compared to state-of-the-art methods. Our proposed NLAM and NLTP methods can also be applied to other video-based tasks, such as target tracking and pose estimation.

Conflicts of Interest:
The authors declare no conflict of interest.