Person Re-Identiﬁcation via Pyramid Multipart Features and Multi-Attention Framework

: Video-based person re-identiﬁcation has become quite attractive due to its importance in many vision surveillance problems. It is a challenging topic due to the inter/intra changes, occlusion, and pose variations involved. In this paper, we propose a pyramid-attentive framework that relies on multi-part features and multiple attention to aggregate features of multi-levels and learns attention-based representations of persons through various aspects. Self-attention is used to strengthen the most discriminative features in the spatial and channel domains and hence capture robust global information. We propose the use of part-relation attention between different multi-granularities of features’ representation to focus on learning appropriate local features. Temporal attention is used to aggregate temporal features. We integrate the most robust features in the global and multi-level views to build an effective convolution neural network (CNN) model. The proposed model outperforms the previous state-of-the art models on three datasets. Notably, using the proposed model enables the achievement of 98.9% (a relative improvement of 2.7% on the GRL) top1 accuracy and 99.3% mAP on the PRID2011, and 92.8% (a relative improvement of 2.4% relative to GRL) top1 accuracy on iLIDS-vid. We also explore the generalization ability of our model on a cross dataset.


Introduction
Person re-identification is the process of retrieving the best matching person in a video sequence across the views of multiple overlapping cameras. It is an essential step in many important applications such as surveillance systems, object tracking, and activity analysis. Re-identifying a person involves simply assigning a unique identifier to pedestrians captured within multiple camera settings. However, it is very challenging due to occlusion, large intra-class and small inter-class variation, pose variation, viewpoint change, etc.
Person re-identification is conducted on images [1][2][3][4][5][6][7] or videos [8][9][10][11][12][13][14]. Early approaches to this process can be classified into two main branches: either to develop handcrafted features or to develop machine learning solutions for optimizing parameters. Recently, the use of deep learning feature extraction has become very popular due to its success in many vision problems.
Nowadays, videos have gained more attention because they make use of the benefits of rich temporal information rather than using static images which suffer from limited content. Several related works have extracted frame features and aggregated them into maximum or average pooling [11,15], while other studies using temporal information have used temporal aggregates such as long-and short-term memory (LSTM) or recurrent neural networks (RNNs) [8,9,16,17].

•
Developing multilevel local features representation to overcome missing parts and misalignment in global representation. The integration between multi-local partitions makes the representation more generalizable, as the person is viewed with multiple granularities. • Proposing a novel way to use self-attention in an image using a block of convolution layers to capture the most generalized information. • Using channel attention in addition to spatial attention to capture correlation in all directions. This uses the relation between channels instead of only a spatial aspect. • Introducing self-part attention between the multi-levels of features to benefit from the relationships between the parts in multi-granularities and produce better representations of each part.
The rest of the paper is organized as follows: The related works are discussed in Section 2; the proposed approach in Section 3; the experimental results in Section 4; and the conclusion and our intended future studies in Section 5.

Related Works
Person re-identification focuses on developing a discriminative feature set to represent a person. Early studies in this area have developed handcrafted features such as the local binary pattern (LBP) histogram [24], histogram of gradient (HOG) [25], and local-maximaloccurrence (LOMO) [26]. Other studies use a combination of different features [27,28].
After that, various deep learning models have been presented and show better performance compared to handcrafted features. For example, Shangxuan et al. [29] combined handcrafted features (ensemble of local features-ELF) with features extracted from CNNs, thereby making the CNN features more robust.
Video-based re-identification research focuses on three topics: feature extraction, temporal aggregation, and the attention function. Feature extraction refers to selecting the best features-global, part-based, or a combination of both-that represent a person. Temporal aggregation is the method by which each frame's features are aggregated to construct video-sequence features. The attention function is used to learn how important features can be strengthened and irrelevant ones suppressed.
Features are extracted from each frame and aggregated using the temporal average pooling, as provided in [30]. An RNN is designed to extract features from each frame and capture information across all time steps to get final feature representation [11]. Two stream networks, one for motion and the other for appearance, are used together for aggregating with the RNN [14]. Handcrafted features such as color, texture, and LBP with LSTM are used to prove the importance of LSTM in capturing temporal information. LSTM is used to build a timestamped frame-wise sequence of pedestrian representation that allows discriminative features to be accumulated to the deepest node and prevents Big Data Cogn. Comput. 2022, 6, 20 3 of 17 non-discriminative ones from reaching it [9]. Thus, two CONV-LSTM are proposed: one for getting spatial information and the other for capturing temporal information [18].
The global features are combined with six local attributes learned from the predefined RAP dataset [22]. Then, the frames are aggregated by re-weighting each frame using the attribute confidence function [31]. Co-segmentation [32] detects salient features across frames by calculating the correlation between multiple frames of the same tracklet and aggregating the frames using temporal attention. Multi-level spatial pyramid pooling is used to determine an important region in the spatial dimension and uses RNNs to capture temporal information [16]. A tracklet is divided into multiple snippets and learns coattention between them [12]. The self and collaborative attention network (SCAN) [10] uses the self-attention sub-network (SAN) to select information from frames in the same video, and the collaborative attention sub-network (CAN) obtains the across-camera features. TAM is designed for generating weights that represent the importance of each frame, and the spatial recurrent model (SRM) is responsible for capturing spatial information and aggregating frames by RNN [13].
Spatial-Temporal Attention-Aware Learning (STAL) [15] presents a model that extracts global and local features, which is trained on the MPII human pose dataset [33] for learning body joints and attention branch for learning STAL. The relation module (RM) is designed to determine the relation between each spatial position and all other positions. It obtains the temporal relation among all frames according to the defined relational guided spatial attention (RGSA) and relational guided spatial attention (RGTR) [34]. The multi-hypergraph can be defined as multi-nodes where each node is in a different spatial granularity [35]. Attentive feature aggregation is presented to explore features along channel dimensions in different granularities and divides into groups. Each group has a different granularity than the attentive aggregate used [36]. Finally, intra/inter frame attention is proposed for re-weighting each frame [20].
Unlike the use of spatial and temporal attention in the previous methods, we propose the use of multiple attention for mining tensor vectors. Thus, we use the correlation from multiple directions and the advantages of channel dimension and part interaction to select important features from various dimensions. Taking advantage of multi-scale approaches, we propose a multi-part pyramid for exploring a person from multiple views that aims to extract discriminative and robust features.

Our Approach
Our model aims to explore pedestrians using the pyramid multi-part features with multi-attention (PMP-MA) for video re-identification. Pyramid multi-part features (PMP) detects pedestrians with multiple granularities to capture different details. Multiple Attention (MA) is applied to obtain the most important features in multiple domains. The combined approach gets the most robust features in multi-granularity levels and multiple domains. We start with a discussion of the overall architecture, followed by the details of the different model parts.

Overall Architecture
Initially, a group of pedestrian videos is fed into a backbone network X = {xt}1:T, where T is the number of frames. The backbone network is based on ResNet50. The output of the backbone is split into three independent branches; each branch represents a different granular level, as shown in Figure 1. Then, we apply self-spatial and channel attention to improve the feature extraction and obtain the best spatial channel and part attention to obtain the relation between parts in multiple granularities. Subsequently, the outcomes of the three branches are concatenated to learn the most robust features for representing a person at multiple levels. After that, we apply temporal attention pooling (TAP) [30] to aggregate the frame-level features and temporal information into a video-sequence feature representation. Finally, we use a classification layer with an intermediate bottleneck layer obtain the relation between parts in multiple granularities. Subsequently, the outcomes of the three branches are concatenated to learn the most robust features for representing a person at multiple levels. After that, we apply temporal attention pooling (TAP) [30] to aggregate the frame-level features and temporal information into a video-sequence feature representation. Finally, we use a classification layer with an intermediate bottleneck layer to generalize the representation among the set of tracklets used in training before loss function. Two loss functions are used: triple loss and cross-entropy.

Pyramid Feature Extractor
To enhance the representation of person-reidentification, this model uses a pyramid feature extractor to represent a pedestrian with multi-granularities level for each frame. It represents the pedestrian by three granularities: global representation , coarse-grained part representation and fine-grained part representation . The feature map of global representation is = | ∈ * * } : where h, w, and c represent the height, width, and channel size of the feature map. The features of the multi-part branches are partitioned into K and 2*K horizontal parts = ∈ * } : = 2 2 ∈ * } : where k represents the number of horizontal parts. Initially, the backbone network extracts an initial feature map. Then, it splits into three branches and extracts three different feature maps to represent the same person through separate convolution layers (based on Resnet50). Besides that, it extracts complementary features to use as attention features.
The first branch is the global extractor that captures the global features of each pedestrian. It represents the whole frame with one vector and produces attention vector = ∈ * } for part attention. The second branch divides the feature map into K parts to capture the local features for each part, captures the coarse-grained parts of each pedestrian , and represents the frame with K vectors. Beside that, it produces attention vector = ∈ ℎ * * } for spatial and channel attention. The third branch divides the map into 2*K parts. It captures the fine parts of each pedestrian and represents a frame with 2*K vectors and produces the attention vectors =

Pyramid Feature Extractor
To enhance the representation of person-reidentification, this model uses a pyramid feature extractor to represent a pedestrian with multi-granularities level for each frame. It represents the pedestrian by three granularities: global representation gl t , coarse-grained part representation l k and fine-grained part representation l 2k . The feature map of global representation is gl = gl t gl t ∈ R h * w * c t=1:T where h, w, and c represent the height, width, and channel size of the feature map. The features of the multi-part branches are partitioned into K and 2*K horizontal parts l k = l kt l kt ∈ R k * c t=1:T l 2k = l 2kt l 2kt ∈ R 2k * c t=1:T where k represents the number of horizontal parts. Initially, the backbone network extracts an initial feature map. Then, it splits into three branches and extracts three different feature maps to represent the same person through separate convolution layers (based on Resnet50). Besides that, it extracts complementary features to use as attention features.
The first branch is the global extractor that captures the global features of each pedestrian. It represents the whole frame with one vector gl and produces attention vector p att1 = p att1t p att1t ∈ R K * C for part attention. The second branch divides the feature map into K parts to capture the local features for each part, captures the coarse-grained parts of each pedestrian l k , and represents the frame with K vectors. Beside that, it produces attention vector g att1 = g att1t g att1t ∈ R h * W * C for spatial and channel attention. The third branch divides the map into 2*K parts. It captures the fine parts of each pedestrian and represents a frame with 2*K vectors l 2k and produces the attention vectors g att2 = g att2t g att2t ∈ R h * W * C and p att2 = p att2t p att2t ∈ R 2K * C . The model extracts global, coarse parts and fine parts features from the three branches and fuses them to obtain the most robust features.

Self-Spatial Channel Attention
Self-attention is an intra-attention mechanism to re-weight feature F with an attention score matrix S(x,y) for strengthening the discriminative feature. It is designed to learn the correlation between one pixel and all other positions. It can explicitly capture global interdependencies. There are different techniques for calculating the attention score; in this work, the attention score S is calculated by computing the relation and interdependencies of two attention branches using the attention dot-product [21]. We compute the tensormatrix-multiplication of x and y, where x and y are two feature attention vectors, and apply the softmax function to get S (Equation (1)), as shown in Figure 2. Self-attention is used as a residual block [37], which sums the output of S * F to the original F. Equation (2) expresses the output of the self-attention block f_att. where x, y, F→ feature vectors.
parts and fine parts features from the three branches and fuses them to obtain the most robust features.

Self-Spatial Channel Attention
Self-attention is an intra-attention mechanism to re-weight feature F with an attention score matrix S(x,y) for strengthening the discriminative feature. It is designed to learn the correlation between one pixel and all other positions. It can explicitly capture global interdependencies. There are different techniques for calculating the attention score; in this work, the attention score S is calculated by computing the relation and interdependencies of two attention branches using the attention dot-product [21]. We compute the tensormatrix-multiplication of x and y, where x and y are two feature attention vectors, and apply the softmax function to get S (Equation (1)), as shown in Figure 2. Self-attention is used as a residual block [37], which sums the output of S * F to the original F. Equation (2) expresses the output of the self-attention block f_att. where x, y, F feature vectors. Using three branches satisfies the need of self-attention. Spatial-channel self-attention uses for building an attention map that represents the relation between spatial and channel. Spatial attention enriches the spatial relationship. Let gatt1 and gatt2 be attention branches; when rearranged (gatt1, gatt2 ∈ R hw*c , R hw*c ), each spatial position is described by the c channel. Hence, the spatial attention score after applying function S is s1 ∈ R hw*hw (Equation (3)), which represents the relation between each position and all positions. Then, we apply the f_att function for gl (Equation (4)).
At this step, we construct global self-channel attention. This type of attention aims to strengthen the weight of important channels and suppress less important ones. First, we apply a summary layer over the three branches ( ) to sum all spatial position and focus on the channel feature for each branch. For the global attention branches, we calculate , ∈ R c*1 (Equations (5)- (7)) and then compute the channel attention score by applying function S for , (Equation (8)); the result is the channel score (s2 ∈ R c*c ). The s2 score represents the relation between channels. Finally, we apply the f_att function over channel score s2 and the third branch , and compute the self-spatial with channel attention ℎ(Equation (9)). The flow of the whole Using three branches satisfies the need of self-attention. Spatial-channel self-attention uses g att1 and g att2 for building an attention map that represents the relation between spatial and channel. Spatial attention enriches the spatial relationship. Let g att1 and g att2 be attention branches; when rearranged (g att1 , g att2 ∈ R hw*c , R hw*c ), each spatial position is described by the c channel. Hence, the spatial attention score after applying function S is s1 ∈ R hw*hw (Equation (3)), which represents the relation between each position and all positions. Then, we apply the f_att function for gl (Equation (4)).
At this step, we construct global self-channel attention. This type of attention aims to strengthen the weight of important channels and suppress less important ones. First, we apply a summary layer over the three branches (g att1 g att1 and glsp) to sum all spatial position and focus on the channel feature for each branch. For the global attention branches, we calculate g avg1 , g avg2 and glsp avg ∈ R c*1 (Equations (5)- (7)) and then compute the channel attention score by applying function S for g avg1 , g avg2 (Equation (8)); the result is the channel score (s2 ∈ R c*c ). The s2 score represents the relation between channels. Finally, we apply the f_att function over channel score s2 and the third branch glsp avg , and compute the self-spatial with channel attention glsch (Equation (9)). The flow of the whole process is shown in Figure 3, where N is the number of pedestrians in each batch and t is the number of frames describing each pedestrian.

Self-Part Attention
The local feature extractor is better at learning specific parts than the global one. Pedestrians can be partitioned into several parts, and two local branches are used to see a pedestrian with coarse-grained and fine-grained parts. Part attention is then used to learn the robust features of each coarse part by computing the relationship among the parts in different branches. It applies the S function over the attention vector from the global branch ( ) and attention from fine parts ( ). The s3 score represents the relation between the two granularity parts. Then, the coarse part with attention ( ) is computed by applying over s and the K features. After that, it rearranges vector to let each part be represented by (N × t) × c. This results in enriching each part with discriminative features for each k part, as shown in Figure 4.

Self-Part Attention
The local feature extractor is better at learning specific parts than the global one. Pedestrians can be partitioned into several parts, and two local branches are used to see a pedestrian with coarse-grained and fine-grained parts. Part attention is then used to learn the robust features of each coarse part by computing the relationship among the parts in different branches. It applies the S function over the attention vector from the global branch (p att1 ) and attention from fine parts (p att2 ). The s3 score represents the relation between the two granularity parts. Then, the coarse part with attention (lKpA) is computed by applying f att over s and the K features. After that, it rearranges lKpA vector to let each part be represented by (N × t) × c. This results in enriching each part with discriminative features for each k part, as shown in Figure 4.

Temporal Attention
After extraction frame level features, temporal attention pooling (TAP) [30] applies to generate video level representation. The global features after spatial-channel attention ( ℎ), each part in coarse-grained after part attention ( ), and each part in finegrained part representation ( ) are fed into TAP. Initially, the convolution layer is used to reduce the features from c to c′; thus, it is more generalized. Then, the other convolution layers with the softmax function are used to calculate the attention scores as shown in Figure 5. Finally, the aggregation layer directly takes each pedestrian described by t frames and aggregates the frames into a single generalized feature vector.

Temporal Attention
After extraction frame level features, temporal attention pooling (TAP) [30] applies to generate video level representation. The global features after spatial-channel attention (glsch), each part in coarse-grained after part attention (lKpA p ), and each part in finegrained part representation (l 2kp ) are fed into TAP. Initially, the convolution layer is used to reduce the features from c to c ; thus, it is more generalized. Then, the other convolution layers with the softmax function are used to calculate the attention scores as shown in Figure 5. Finally, the aggregation layer directly takes each pedestrian described by t frames and aggregates the frames into a single generalized feature vector.
where gl v is a feature vector of global video representation, LKpA p is a feature vector of part p in coarse-grained frame representation, l 2kp is a feature vector of part p in finegrained frame representation, l vp1 is a feature vector of part p in coarse-grained video representation, l vp2 is a feature vector of part p in fine-grained video representation.

Temporal Attention
After extraction frame level features, temporal attention pooling (TAP) [30] applies to generate video level representation. The global features after spatial-channel attention ( ℎ), each part in coarse-grained after part attention ( ), and each part in finegrained part representation ( ) are fed into TAP. Initially, the convolution layer is used to reduce the features from c to c′; thus, it is more generalized. Then, the other convolution layers with the softmax function are used to calculate the attention scores as shown in Figure 5. Finally, the aggregation layer directly takes each pedestrian described by t frames and aggregates the frames into a single generalized feature vector.
where is a feature vector of global video representation, is a feature vector of part p in coarse-grained frame representation, is a feature vector of part p in finegrained frame representation, is a feature vector of part p in coarse-grained video representation, is a feature vector of part p in fine-grained video representation.

Objective Function
Our objective function is a combination of the hard triplet loss and crossentropy loss in both the global features and each local part, as shown in Equation (11).

Objective Function
Our objective function L is a combination of the hard triplet loss L tri and cross-entropy loss L ce in both the global features and each local part, as shown in Equation (11). L = L tri (gl v ) + L ce (gl v ) + ∑ k L tri l vp1 + L ce l vp1 + ∑ 2k L tri l vp2 + L ce l vp2 (15) Further details of the functions used are as follows:

•
Triplet Loss (L tri ): The distance between pairs from the same pedestrian is minimized (reduced intra-class), where the distance between pairs of different pedestrians is maximized (increased inter-class). We use the hard triplet loss that selects the hardest example for the positive and negative pairs, where f A , f + , f − are the anchor, positive features, and negative features, respectively.
• Cross entropy (L ce ): It is used to calculate the classification error between pedestrians, where N is the number of pedestrians. Where, p i and q i are the identity and prediction of sample i.
For the classification layer in the cross-entropy function, we used the intermediate bottleneck to reduce dimension and make it more generalized.

Datasets Used
The proposed model was evaluated on four widely used video-based person reidentification datasets, i.e., iLIDs-VID [38], PRID2011 [39], DukeMTMC-VideoReID [40], and motion analysis and re-identification (MARS) [41]. The standard evaluation metric for the model is the mean average precision score (mAP) and the cumulative matching bottleneck to reduce dimension and make it more generalized.

Datasets Used
The proposed model was evaluated on four widely used video-based person re-identification datasets, i.e., iLIDs-VID [38], PRID2011 [39], DukeMTMC-VideoReID [40], and motion analysis and re-identification (MARS) [41]. The standard evaluation metric for the model is the mean average precision score (mAP) and the cumulative matching curve (CMC) at Rank1, Rank5, Rank10, and Rank20. CMC measures the matching accuracy of a person, while mAP measures the accuracy of the result performance. Samples of the dataset used are shown in Figure 6.

iLIDs-VID
The iLIDS-VID dataset [38] contains 600 tracklets of 300 persons captured from two non-overlapping cameras. Each tracklet has a length of 23-193 frames. A total of 150 pedestrians were used for training and 150 for testing This dataset was taken in a crowded airport arrival hall.

PRID2011
The PRID2011 dataset [39] contains 400 tracklets of 200 persons captured from two non-overlapping cameras. A total of 385 pedestrians were under camera A and 749 under camera B. Each tracklet has a length of 5-675 frames. Following a previously published protocol [38], we selected tracklets with more than 27 frames; 89 pedestrians were selected

iLIDs-VID
The iLIDS-VID dataset [38] contains 600 tracklets of 300 persons captured from two non-overlapping cameras. Each tracklet has a length of 23-193 frames. A total of 150 pedestrians were used for training and 150 for testing This dataset was taken in a crowded airport arrival hall.

PRID2011
The PRID2011 dataset [39] contains 400 tracklets of 200 persons captured from two non-overlapping cameras. A total of 385 pedestrians were under camera A and 749 under camera B. Each tracklet has a length of 5-675 frames. Following a previously published protocol [38], we selected tracklets with more than 27 frames; 89 pedestrians were selected for training and 89 for testing. This dataset was taken in an uncrowded outdoor area with varying degrees of illumination and viewpoints. In iLIDs-VID and PRID2011, we randomly split them into train/test pedestrians. This is repeated 10 times for computing averaged accuracies performance.

DukeMTMC-VideoReID
The DukeMTMC-VideoReID dataset [40] is one of the largest datasets in video-based person re-identification. It contains 2196 tracklets of 702 persons captured for training and 702 tracklets of 702 persons in query, while the gallery contains 2636 tracklets for 1110 persons. Each tracklet has a length of 1-9324 frames. Eight cameras were used to capture this dataset, which was taken in a crowded outdoor area with varying degrees of illumination and occlusion, viewpoints, and backgrounds.

MARS
The MARS dataset [41] is another large dataset in video-based person re-identification. It contains 8298 tracklets of 626 persons captured for training and 1980 tracklets of 626 persons in query, while the gallery contains 9330 tracklets of 626 persons. Each person was captured with at least two cameras. Each tracklet has a length of 2-920 frames. This sequence was extracted using the DPM pedestrian detector [42] and GMMCP tracker [43]. The videos were taken in a crowded outdoor space with six cameras and have varying viewpoints and complicated occlusion. We used post-processing re-ranking [44] as a post-processing step to enhance the results of the test phase due to multiple appearances by the person in the gallery rather than only one as in other datasets; a similar approach is used in [21,32].

Implementation Details
We compared the most efficient deep learning frameworks and selected Pytorch [45] to implement our model. Pytorch's consequent performance is enhanced significantly; moreover, it has off-the-shelf person re-identification libraries. The images were resized to 216 × 423 and randomly cropped to 192 × 384. Then, they were normalized using the RGB mean and standard deviation. We used the ResNet50 pre-trained on ImageNet for extracting features per frame. Our backbone uses the first two layers and part one of layer three of the ResNet50. The second part of layer three of the ResNet50 is duplicated in our three independent branches. We removed the last layer to increase the resolution of the final feature map so more details could be preserved-which is beneficial for further multilevel learning-and replaced it with our PMP model. We used K = 6 and 2 × K = 12 for multi-local partitioning; the effects of this choice are discussed in Section 4.6. In training to form the N batch, first, Pid pedestrians and Seq different sequences for each Pid were selected randomly, and each Seq sequence had T frames. The total sequence is Pid* seq in each batch. We used T = 4 and Seq = 2 in small datasets (iLIDs-VID and PRID2011) and T = 4 and Seq = 4 in large-scale datasets (DukeMTMC-VideoReID and MARS); these values are further discussed in Section 4.3. The network was trained using the Adam optimizer [46] with the following hyper-parameters: the initial learning rate = 0.0003, weight decay = 5 × 10 −4 , gamma = 0.1, h = 24, w = 12, c = 1024, and c = 256.
The implementation was processed over two machines with different GPU specs (one for small datasets and one for large datasets). Table 1 shows the training time of each dataset:

Comparison with State-of-the-Art
In this section, we compare our model with other state-of-the-art models, including Global Guided Reciprocal learning(GRL) [47], BICnet [48], Clip similarity [21], Rethink temporal fusion [20], TACAN [12], MG-RAFA [35], MGH [36], SCAN [10], Co_segment [32], STAL [15], STN [19], and Attribute disentangling [31]. Table 2 shows the recent videobased person re-identification models along with a summary of their technique and year of publication. Table 3 shows the performance of our PMP-MA model versus the most recent videobased person re-identification models when applied to the small datasets. Our model improves Rank1 relative to the best R1 so far (GRL) by 2.7% on PRID2011(See multiple moderate and hard examples for PRID2011 in Tables S1 and S2 in supplementary material file) and relative to the best R1 (GRL) by 2.4% on iLIDS-vid (See multiple moderate and hard examples for iLIDS-vid in Tables S3 and S4 in supplementary material file) due to the improvement in multi-level learning and by using multi-attention instead of one.  All the values are in percentages. The values in bold are the best results and the underlined values are the second-best results. Table 4 presents a comparison of the performances when applied to large-scale datasets. Our model improves Rank1 by 0.9% relative to the best R1 (BICnet) on DukeMTMC-VideoReID and improves mAP by 2.1% relative to the best MAP (BICnet) on MARS after reranking.

Effectiveness of Increasing the Batch Size
Compared to small datasets (two cameras), large-scale datasets are more diverse, as they capture people from different views using many cameras (six-eight cameras). With large datasets, we found that increasing the batch size and the number of instances enhanced the accuracy of catching large variations in different poses and changes in view, as shown in Table 5. Rank1 and mAP are improved by 3.2% and 2.5%, respectively, when using a batch size of 32 and four instances, relative to using a batch size of eight and two instances and Rank1 on DukeMTMC-VideoReID. However, Rank1 and mAP are improved by 7.6% and 6.1%, respectively, when using a batch size of 32 and four instances relative to using a batch size of eight and two instances and Rank1 on MARS. Increasing the batch size would lead to more computation; thus, we cannot increase the batch size to more than 32 due to our limited capabilities. However, Rank1 and mAP are almost saturated on DukeMTMC-VideoReID, so much so that accuracyimproves by 2.3% in Rank1 when increasing the batch size from 8 to 16, but it increases by only 0.9% when increasing the batch size from 16 to 32. Unfortunately, we thought that MARS could be improved more if we increased the batch size by more than 32, as we found Rank1 to increase by 2.3% when increasing the batch size from 8 to 16, whereas Rank1 increased by 3.9% when increasing the batch size from 16 to 32. Conversely, in the small dataset shown in Table 6, increasing the batch size and number of instances did not result in a similar improvement, and the variance of the results is almost saturated. For example, in the case of mAP, increasing the batch size from 8 to 32 and the number of instances from 2 to 4 produces only 0.5% extra accuracy. This is because it already has two cameras with a single tracklet for each person. Thus, with two instances, we already capture all the variations. All the values are in percentages.

Cross-Dataset Generalization
Using a cross-dataset is a better way of measuring our model's generalization. It evaluates the ability of a system to perform on a different dataset than the training dataset. Each dataset is collected in different visual conditions and using different viewpoints. Often, models trained in one dataset perform badly on others. To evaluate our model under a more general setting, we used the iLIDS-VID dataset for training and PRID2011 for testing and compared our results to the STN [19] and RCN [11] models used in the same setting.
The results in Table 7 show that the proposed model is more generalizable than the current best state-of-the-art models. Our model improves Rank1 by 11.5%. However, an accuracy of 40% is expected due to the challenges of using a cross-dataset. Slight improvements have been verified for Rank5, Rank10, and Rank20. The use of a crossdataset is an open research issue. Our model certainly achieves better performance, but more optimization is required to enhance the generalization performance. All the values are in percentages.

Ablation Study
In this section, we show the contribution of each component of the proposed model. Table 8 summarizes the performance of each module separately when applied to iLIDS-VID. First, we evaluated Resnet50 optimized with triple loss and cross entropy without any addon component as pretraining model. Then, it was evaluated with TAP [30]. We discovered that TAP aggregation improved mAP by 5.9% and Rank1 by 5%. After that, the baseline was evaluated, where the baseline is the pretraining model with TAP aggregation in addition to bottleneck layer in the classification layer before the cross-entropy loss function and removed layer 4 of ResNet50. Experiments prove that adding an intermediate bottleneck in the classification layer (cross-entropy function), improved the baseline mAP by 9.4% and Rank1 by 9%. The reason for this was that the bottleneck layer compressed features representation to find the best fit and become more generalized. All the values are in percentages.
Secondly, we tested each component of our system in the fine-tuning stage. First, our PMP was added to baseline. As shown in Table 8, the PMP component can achieve mAP of 89% and 87% in Rank1 and 100% in Rank 10, which improves the system relative to the baseline by 4.3%, 9.7%, and 4 %. The reason is that the PMP extracts pedestrians with multi-level partitions and fuses them. The multi-level partition overcomes the fine parts that are usually missed if we extract global features with one-level partitioning. Then, the self-part attention was tested by the replaced PMP component. The spatial attention component can achieve mAP of 86.1% and 80.3% in Rank1, which improved the system relative to the baseline by 1.4% and 3%. After that, we added channel attention along with spatial attention; it improved the system relative to the baseline with spatial with the mAP increasing by 2.2% and Rank1 by 2%.
We evaluated our PMP component, which was combined with the self-spatial and channel components. Our component can achieve mAP of 92.4%, Rank1 of 90.3%, and Rank5 of 99.3%, which improves the system relative to the baseline by 7.7%, 14%, and 5%. Finally, we tested the last component of our model by adding part attention to spatial and channel attention, which improved the system by increasing the mAP by 2.9% and Rank1 by 2.5%. It is obvious that our model achieves better mAP and Ranks accuracies, which show the effectiveness of the PMP-MA framework.

The Effect of Using Different Parts (K)
The number of parts (k) was evaluated during the monitoring of our model's performance. In particular, we analyzed k = 2, 3, 4, 6, 8, 10, and 12. The results are shown in Figure 7. We found k = 6 is the best.

Analysis of Part Attention (K)
There are many ways to calculate score attention; in this work, we used self-part attention (Figure 4), comparing it against using one attention branch (Figure 8). Table 9 lists the effect of self-part attention on the coarse branch. Using self-part attention instead of the attention branch in component 3 improved mAP by 0.5% and Rank1 by 0.8% on the iLIDS-VID over the results of component 2 and by 0.6% and 1.1%, respectively, on the PRID dataset. In comparison to the results of component 1, which has no part attention, it

Analysis of Part Attention (K)
There are many ways to calculate score attention; in this work, we used self-part attention (Figure 4), comparing it against using one attention branch (Figure 8). Table 9 lists the effect of self-part attention on the coarse branch. Using self-part attention instead of the attention branch in component 3 improved mAP by 0.5% and Rank1 by 0.8% on the iLIDS-VID over the results of component 2 and by 0.6% and 1.1%, respectively, on the PRID dataset. In comparison to the results of component 1, which has no part attention, it improved mAP by 2.9% and Rank1 by 2.8% on the iLIDS-VID and by 2.3% and 4.5%, respectively, on the PRID dataset relative to not using part attention as in component 1.

Analysis of Part Attention (K)
There are many ways to calculate score attention; in this work, we used self-part attention ( Figure 4), comparing it against using one attention branch (Figure 8). Table 9 lists the effect of self-part attention on the coarse branch. Using self-part attention instead of the attention branch in component 3 improved mAP by 0.5% and Rank1 by 0.8% on the iLIDS-VID over the results of component 2 and by 0.6% and 1.1%, respectively, on the PRID dataset. In comparison to the results of component 1, which has no part attention, it improved mAP by 2.9% and Rank1 by 2.8% on the iLIDS-VID and by 2.3% and 4.5%, respectively, on the PRID dataset relative to not using part attention as in component 1.   Table 9 shows that using two-part attention in coarse and fine branches, as in component 4, worsens Rank1 by 1.8% and mAP by 2.3% on the iLIDS-VID relative to using only coarse part attention and by 4.5% and 2.3%, respectively, on PRID.
Thus, if we use part attention on the coarse branch, we focus on the important features on the coarse branch and extract fine details from the fine branch. Hence, not using   Table 9 shows that using two-part attention in coarse and fine branches, as in component 4, worsens Rank1 by 1.8% and mAP by 2.3% on the iLIDS-VID relative to using only coarse part attention and by 4.5% and 2.3%, respectively, on PRID.
Thus, if we use part attention on the coarse branch, we focus on the important features on the coarse branch and extract fine details from the fine branch. Hence, not using any part attention in both branches does not capture discriminative part features as in component 1, whereas overusing two-part attention in both branches results in losing focus of important coarse part features as in component 4. It is obvious that using part attention in the coarsegrained branch achieved best accuracy due to focus on discriminative features as well as fine details.

Conclusions
This paper proposes a PMP-MA extractor for video-based re-identification. A multilocal model can learn generic and specific (multiple localization) features of each frame. To take full advantage of the pyramid model, we used self-spatial and channel attention, which enabled us to specify the quality of each spatial feature and channel to enrich the features vector. Multiple local parts were used to learn a specific part with two scales and part attention. The PMP system is complemented with TAP, which is used to extract temporal information among frames. The evaluation was done against four challenging datasets, where the proposed model achieved better performance in three datasets, that is, 2.4% over the iLIDs-VID, 2.7% over PRID2011, 0.9% over DukeMTMC-VideoReID, and 11.5% over the cross-dataset. The PMP-MA extractor is a well-designed extractor that can extract and fuse robust features from multiple granularities. Potential applications of this approach in other computer vision problems include object tracking and image segmentation or video object segmentation. In the future, we plan to add positional encoding to get the model to pay more attention to important frames. Moreover, we will try to enhance our pyramid model to reduce the complexity of the system components to overcome the necessity of having a GPU with a betterVRAM capacity.