A Patch Information Supplement Transformer for Person Re-Identiﬁcation

: Extracting ﬁne-grained features from person images has proven crucial in person re-identiﬁcation (re-ID). Although the research of convolutional neural networks (CNN) has been very successful in person re-ID, due to the small receptive ﬁeld and downsampling operation, the existing CNNs cannot solve the problem of information loss. The multi-head attention modules in transformer can solve the above problems well. However, since dicing operations destroy the spatial correlation between patches, transformer still loses some local features. in this paper, we propose the scheme of the patch information supplement transformer (PIT) to extract ﬁne-grained features in the dicing stage. Patch pyramid network (PPN) is introduced to solve the problem of local information loss. This is accomplished by dividing the image into different scales through the dicing operation and adding them together from top to bottom according to the pyramid structure. In addition, we insert a learnable identity information-embedding module (IDE) to reduce the feature bias of clothing and camera perspective. Experiments verify the superiority and effectiveness of PIT compared to state-of-the-art methods.


Introduction
Person re-identification [1] technology is a crucial but challenging task in the fields of multimedia and computer vision, which has been applied in various applications.The purpose of person re-identification is to associate a particular person quickly across different scenes and camera perspectives.The main challenge comes from the difficulty of accurately distinguishing a person with similar appearances.For example, as shown in Figure 1, the clothes of the two women are of the same colour, the postures of the two men have changed, and the occlusions cover many key pedestrian features.Therefore, extracting fine-grained features is crucial to person re-ID [2].
According to the authors in [3,4], CNN-based methods are the very intriguing in current research.However, the research methods based on CNN still face two challenges: (1) obtaining rich contextual information in global feature extraction is crucial for person re-ID [5].Due to the convolution and the small receptive field, CNN methods only focus on local small feature information [6].(2) Extracting detailed fine-grained features with detailed information in the person image is also important.However, the convolution and pooling operations of CNN reduce the resolution of the image, leading to feature information losses.This greatly affects the accuracy of fine-grained feature extraction, such as similar appearance [7].

Similar Clothing Pose Changes Occlusion
Figure 1.Person re-identification is about identifying the correct person, which is extremely challenging due to similarities in appearance, changes in posture, and occlusion.Failure to consider the identity information of the person can lead to recognition errors.
The multi-head attention modules in transformer [8] and the operations of removing convolution and pooling can better solve the above-mentioned critical problems facing CNNs.The reasons are as follows.(1) The multi-head attention modules can capture the relationship between long distances and make the model concentrate more on the relationship between different regions.(2) Removing operations such as convolution and pooling to reduce feature loss allows the model to save more detailed feature information [9].Combining the above advantages, we chose the transformer network as the baseline for person re-ID.
Although transformer has the above-mentioned significant advantages, it still has tremendous challenges in the face of person re-ID tasks, such as occlusion, lighting, camera perspective, pose diversity, etc.Therefore, considering the above problems, the transformer should be improved.Many attempts have been made to face this problem when researching person re-ID based upon the CNN.As for extracting fine-grained features, local part features [10][11][12] and auxiliary features [13,14] have been shown to be essential and effective.CNN-based methods extract features from a complete image, while the slicing operation in transformer cuts the image into independent patches, directly migrating CNN-based methods into transformer results in suboptimal recognition [15].In addition, directly adding the CNN-based side semantic information module to the transformer network cannot improve the use of its encoding ability.Therefore, it is also a big challenge to embed the module based on the specific design of CNN into the transformer model.
We proposed a scheme of the PIT to solve the above-mentioned critical problems.Firstly, we proposed a PPN to solve the problem that the correlation between the image regions is destroyed due to the dicing operation.This divides the image into blocks of different scales by dicing, which are connected and added from top to bottom according to a pyramid structure for further fine-grained feature learning.The PPN structure can make the most miniature scale patch to obtain the feature information of other scale images to strengthen its fine-grained features.Therefore, the network can extract global perturbation-invariant and fine-grained features.To the best of our knowledge, the issue of transformer-based identity information encoding has not been investigated [16].Different from utilizing semantic information embedding in CNN-based methods, we designed the IDE for Transformer, effectively integrating identity semantic information through learnable embeddings to reduce feature bias of peripheral information.For instance, the proposed IDE can solve the problem of matching similarity caused by similar clothing.
In summary, this is the first study to design a PPN structure and apply it to person re-ID.The main contributions of this paper are as follows: 1.
We designed a PPN structure and applied it with transformer for person re-ID, solving the global perturbation caused by spatial dimension segmentation and poor finegrained features.

2.
We provided an approach to identity information embedding, encoding identity information through learnable embeddings.Thus, effectively addressing the problem of learned feature bias.

Related Work 2.1. Auxiliary Feature Representation Learning
The information interference is the most challenging problem in person re-ID [16].Auxiliary feature representations have been adopted to cultivate consistent person re-ID to deal this problem.Some common methods are adopted to enhance auxiliary feature representation such as additional annotation information [13,14,[21][22][23], timing information [24,25] and generated/augmented training samples [26].Furthermore, some studies have fused representations from multiple levels [4,27,28], requiring extra methods, such as body-part detection [4,27] or attention mechanisms [28].However, while most of these methods have been designed based on CNNs, the slicing operation in transformer cuts the image into independent patches, making most of these methods unsuitable for transformer for direct application [15].Thus, an auxiliary feature representation method of the transformer for person re-ID is needed.

Visual Transformer
Transformer [9] was originally proposed and applied in the field of natural language processing (NLP) and has gradually become the mainstream method for NLP tasks.Recently, transformer has been shown to outperform traditional methods in many vision tasks, such as object detection [29], generative adversarial networks [30], action recognition [31].Visual transformers (VIT) [8] are the first to transformers to be applied to image vision.They first divide the input image into image patches, then use linear projection to map it to a 1D vector and add learnable 'class labels (CLS)', and finally pass the merged 1D vector to the transformer encoder.TransReID [32] is the first application of VIT in the field of object re-ID.The patch shuffle operation and side information module were designed to verify the advantages of the transformer on the re-ID task.PAT [27] integrates transformer architectures into CNNs.PAT [27] designs part diversity part discriminability to diversify part discovery.However, its failed solve the problem of fine-grained feature loss caused by transformer dicing operations.Thus, it is still an urgent problem to obtain high correlations from scattered patch blocks in transformer.This paper aims to propose a transformer-based patch information supplementation method to address the problems in existing transformer fine-grained feature extraction schemes.

Methodology
In this section, we will present the designed techniques to improve the transformer baseline.As demonstrated in Figure 2, the proposed PIT consists of the baseline, a patch feature extract network PPN, and the semantic information encoder IDE.The first part is the baseline, we use the ViT model as the baseline for person re-identification task.The details will be introduced in Section 3.1.The second part is the PPN, which obtains fine-grained features that are invariant to global disturbances by solving the local feature information loss caused by dicing operations.The details of which are introduced in Section 3.2.The third part is the IDE, which aims to avoid interfering with the generation of information encoding.The same person should have the same identity information encoding.The details will be introduced in Section 3.

Baseline
As shown in Figure 3, input an image x ∈ R C,H,W , while C, H, and W represent the channel size, height and width, respectively.We divide it into N fixed-size patches • • • , N through the slicing operation.These patches are flattened into one-dimensional tensors and then projected onto a lower-dimensional space D using a linear transformation.The resulting feature vectors for each patch are used as inputs for the subsequent layers.In addition, a learnable embedding token denoted as x cls is pre-loaded into the input sequence, with its output serving as the global feature denoted by f .Learnable position embeddings are also incorporated to capture spatial information.The input sequence to the transformer layers can be represented as: where Z 0 is the input sequence embeddings of the baseline, and P ∈ R (N+1)×D is the location information embedding.F is the linear mapping of the diced patches x i p to D dimensions.Additionally, l transformer layers is used to learn feature representations.
Patch Slicing.Patch splitting refers to the process of dividing an input image into several small, equally-sized blocks called "patches".We use a sliding window approach to divide an image of size H × W into N fixed-size patches.The stride is denoted as S, and the patch size is denoted as P.
where [•] is the floor function and S is set same to P .N H and N W represent the number of splitting patches in height and width, respectively.Position Embeddings.The image resolution of person re-ID is different from the original resolution of VIT images.The pre-trained position embeddings in ImageNet cannot be directly loaded into the baseline.Therefore, we introduce bilinear 2D interpolation to handle the input image resolution.Supervised Learning.We use ID loss to optimize the baseline.ID loss is cross-entropy loss without label smoothing and can be expressed as: where M represents the total number of people, y represents the truth ID label, q i represents the target probability, and p i represents the ID prediction logits of class i.The smaller the cross-entropy loss, the closer the predicted result is to the true result.

Patch Pyramid Network
Although encoding the transformer provides a larger receptive field, the dicing operation reduces the correlation of the patch in the spatial dimension, leading to an inability to effectively extract fine-grained features.
As shown in Figure 4, the proposed PPN is a multi-scale patch fine-grained feature extraction network.Given an image x ∈ R C,H,W , C, H, and W represent the channel size, height and width, respectively.Then, we split it into N fixed-sized patches We explain the details of PPN below.
Based on convolution operations with four different kernel sizes (2×2, 4×4, 8×8, 16×16), x is divided into various patch sizes {P 0 , P 1 , P 2 , P 3 } ∈ R C,H,W , which only alter the size of the patches.The size of H, and W among the patches are expressed as Then, P 0 , P 1 , P 2 , and P 3 are arranged in a pyramid structure, where the top of the pyramid is the largest patch scale P 3 and the bottom is the smallest P 0 .Starting from the top of the pyramid, the downsampling operator F 2D is applied to scale the patches.After downsampling, adjacent patches are summed together.The above two operations are repeated until the patches reach the smallest scale, which is the target patch.Accordingly, the object patches x i p ∈ R C,H,W can be obtained.Moreover, the proposed feature extraction module F E is used before each downsampling operation to obtain P 1 , P 2 , P 3 ∈ R C,H,W .It is worth noting that the minimum size of a patch does not perform feature extraction, having an impact on model accuracy due to the shallow feature extraction operation.Finally, the output x i p can be deduced, given by where F 2D represents the downsampling operation with a convolution kernel of 3 × 3.As shown in Figure 5, the extraction module consists of four parts: Input P ∈ R C,H,W , AC module, AS module and output P ∈ R C,H,W .The AC module is an attention mechanism module in the channel dimension, and the AS module is an attention mechanism module in the spatial dimension.The two modules work in parallel in the extraction module by being connected together.They, respectively, extract channel features and spatial features in the patch block, and then add them together to obtain the final fine-grained feature.AC module.In the channel dimension, the AC module is an adaptive feature extraction module.Two 2D maps P M ∈ R C,1,1 and P A ∈ R C,1,1 can be generated by the parallel maximum-pooling F Max and average-pooling F Avg , which resize the input P from C, H, W to C, 1, 1.Then, in order to avoid overcomplicating the model, the obtained 2D feature maps are processed by the squeeze and permute operation F S to obtain the P SM ∈ R 1,C and P SA ∈ R 1,C .After this, the 1D convolution module F 1D with a kernel size of 3×3 is used to process P SM and P SA separately.The local cross-channel interaction can be realized, leading to P CM ∈ R 1,C and P CA ∈ R 1,C .Then, both are added together to obtain P C ∈ R 1,C .The P sg ∈ R 1,C is acquired by applying a sigmoid activation function F sg to P C .Finally, the weight of the channel dimension P weight ∈ R C,1,1 is obtained after a dimensional upgrade operation F us .The operation of the AC module can be deduced by

Maxpooling
The adaptive channel feature A C (P) can be acquired by multiplying P weight by the input patch P, giving A C (P) = P weight ⊗ P AS module.The AS module is an adaptive feature extraction module in the spatial dimension.Two 2D maps P M ∈ R 1,H,W and P A ∈ R 1,H,W can be generated through the adaptive maximum-pooling F M and adaptive average-pooling F A , producing two 1*H*W feature maps from the input P.After this, the P C ∈ R 2,H,W is obtained by concatenating P M and P A through a concatenation operation ⊕.Then, P Cov ∈ R 1,H,W is obtained by a convolution F 2D with a kernel size of 3×3 that turns the number of channels in P C to 1.The adaptive weight P weight of the spatial dimension is acquired by applying a sigmoid activation function F sg to the P Cov .Accordingly, the operation of the AS module can be computed as Similarly, the adaptive spatial feature A S (P) can be defined as A S (P) = P weight ⊗ P Based on the AC and AS operation, the output P of the extraction module can be computed as P = w 1 * A C (P) + w 2 * A S (P) (10) where w 1 and w 2 are the weights that change with the gradient whose sum is 1.

Identity Information Embedding Module
Despite fine-grained feature being obtained, the impact of clothing changes cannot be ignored.In other words, the model may not be able to discriminate different objects from the same angle due to the bias of clothing information.Therefore, the proposed IDE incorporates the identity information of people into the embedding representation to obtain robust features.
Similar to positional embeddings employing learnable embeddings to encode positional information, we insert learnable one-dimensional embeddings to preserve identity information.In particular, as shown in Figure 1, we insert the identity information embedding into the transformer encoder with the patch and position embeddings.Specifically, assuming that there is a total of M person IDs, we initialize the learnable identity information embedding as G ID ∈ R M×D .If the ID of a person is k, the identity information embedding can be expressed as G ID (k).Unlike positional embeddings which vary between patches, identity information embeddings G ID (k) are the same for all patches of an image.
As the identity information embedding, patch embedding, and position embedding are all linearly mapped to a D-dimensional space, their input sequences can be directly added for information integration.The IDE can be written as where Z 0 is the input sequence after adding the identity information embedding, Z 0 is the original input sequence in the baseline, and λ is the parameter of the identity information embedding.The transformer encoding layer can encode embeddings of different distribution types and add these features directly.

Datasets
In order to demonstrate the effectiveness of our methods, the experiments were conducted on the three widely used holistic re-ID datasets (Market-1501 [17], Duke-MTMC [18], MSMT17 [20]), and one occluded re-ID dataset (Occluded-Duke [19]).Table 1 provides a brief explanation of the datasets, explained in detail as follows.Duke-MTMC [18] has a total of 36,411 images from 1812 people captured by 8 cameras.The 16,522 images of 702 people are randomly selected as the training set, the 2228 remaining images are used as query images and the 17,661 gallery images are used as a test set.
MSMT17 [20] is the largest dataset with 4101 IDs, with 126,441 images captured from 15 cameras.The training set has 32,621 images of 1041 identities, and the test set has 93,820 images of 3060 identities.In the test set, the 11,659 images are randomly selected as query images.
Occluded-Duke [19] is a split of Duke-MTMC [18] that preserves occluded images and removes overlapping images.It contains 15,618 training images, 17,661 gallery images, and 2210 occlusion query images.

Experimental Strategy and Experimental Environment
The size of all images were adjusted to 256 × 128 for training and testing.As for the data augmentation, we employed random horizontal flipping, padding, random cropping and random erasing.The SGD optimizer was adopted with a momentum of 0.9 and a weight decay set as 1 × 10 −4 .The initial learning rate and the batch size are set to 0.008 and 64, respectively.
Evaluation Protocols.Cumulative matching characteristic (CMC) curves and mean average precision (mAP) were adopted to evaluate the quality of different the re-ID models.The re-ranking was not taken to refine the matching results further in the experiments that are performed in a single query mode.

Results of Backbones
In this section, we compare the performances of the backbones.We chose several backbones, including SqueezeNet, ShuffleNet, DenseNet, ResNet50, ResNet101, SEResNet50, SEResNet101 and VIT-B/16, where VIT-B/16 represents the VIT baseline with patch size 16.
As shown in Table 2, there is a big gap in performance between the VIT-B/16-and CNN-based backbones.On the re-ID benchmark, the VIT-B/16 greatly outperforms the CNN-based backbones, the VIT-B/16 achieved at least 11.6% mAP and 8.2% rank-1 over SEResNet101.This suggests that the multi-head attention mechanism of transformer can obtain a larger receptive field.Thus, we choose the VIT-B/16 as the baseline.

Ablation Study of PPN
As shown in Table 3, we evaluate the superiority of our proposed PPN.Compared to the baseline, the PPN improves by +1.6% mAP and +1.1% rank-1.Increasing number of P in the PPN, the performance of the PPN gradually increases, suggesting that adding larger patches to the PPN is helpful to extract spatial features in the dicing stage.Furthermore, by comparing the PPN and PPN (without the extraction module), we can find that the extraction module operation can help the model to improve by +0.8% mAP and +0.6% rank-1, suggesting that the extraction module in our PPN helps the model to effectively obtain fine-grained features.The visual attention map in Figure 6 shows that the model obtains more context-aware information and better fine-grained features through the PPN operation, improving the anti-interference ability of the model.

Ablation Study of IDE
Performance Analysis.As shown in Table 4, we assess the utility of the proposed IDE.The experimental results show that our method achieves 1.7% mAP and 0.8% rank-1 improvement with slight improvments of the both rank-5 and rank-10 at extremely high performances, showing that the proposed IDE can work well on the baseline.The IDE can separate invariant ID features from semantic features, significant for person re-ID to eliminate interference.Ablation study of λ.As demonstrated in Figure 7, we analysed the effect of parameter λ on the IDE.Particularly, the baseline reached 86.7% mAP and 94.5% rank-1 with λ set as 0. When λ increases, the performance of the IDE improves by 88.4% mAP and 95.3% rank-1 (λ = 2.0), indicating that our methods can learn invariant features well.With a further increase in λ, the feature embedding and position embedding gradually decrease to set the performance of the system.

Ablation Study of PIT
As shown in Table 5, compared to the baseline, our proposed PPN and IDE improve performance by +1.6% mAP /+1.1% rank-1 and +1.9% mAP /+0.9% rank-1, respectively.The ablation studies illustrate that the proposed PIT achieves a better performance of 63.1% (+2.2%) mAP and 82.3% (+1.3%) rank-1 than the baseline.The experimental results demonstrate that the PPN and IDE complement each other in person re-ID.

Comparison with State-of-the-Art Approaches
As shown in Table 6, we compare our model with the state-of-the-art methods on three holistic benchmarks and one occluded benchmark to show the effectiveness of the proposed PIT.In addition, we also compare the results from the baseline.Market1501.Compared with the state-of-the-art methods, the proposed PIT acquires outstanding performances.Specifically, when compared with other transformer-based networks [17,18,33,38], our method receives the best result of 88.8% (+0.8%) mAP.There is little difference in the results between these methods because the exploration of Market1501 is saturated.
Duke-MTMC.Although the proposed PIT performs close to the other methods, our model still derives first-rate performances, showing the robustness of the proposed methods.Furthermore, compared to the baseline, our method exhibits better results (+1.9% mAP/+2.0%rank-1), which validates our work on the Transformer baseline.
MSMT17.Being the largest person re-ID dataset means that robust feature extraction in complex environments is more difficult.Due to the use of a greater number of cameras to capture people in the MSMT17 dataset, there is a greater diversity of clothing and posture changes among the individuals, leading to a suboptimal recognition performance of the models.However, the proposed PIT proved its reliability with good experimental results, outperforming the ISE [39] when integrating identity information.
Occluded re-ID.From the experimental results, the introduced PIT still achieves a good score of 54.8% mAP in the face of severe occlusion and fewer features.This is because occlusions generate interfering features, resulting in a suboptimal recognition performance of the models.Additionally, compared with the PAT [27] with aligned body parts, our method achieved better results (+1.2% in front of PAT [27]) without aligning body parts.

The Matching Visualization Results
We compare the ranking results of the baseline and PIT in Figure 8, and the results indicate that PIT can effectively address the recognition inaccuracy caused by similar clothing, human pose variations, and partial occlusions.The identity information provides useful assistance to the model.

Conclusions
In this paper, we proposed a patch information supplement method (PIT) that can handle local information loss.By using the introduced PPN, PIT can solve decreasing spatial correlations caused by dicing operations and obtain the fine-grained features.Furthermore, we designed the IDE to eliminate the influence of clothing changes to separate ID-related features from the semantic features.Extensive experimental results for re-ID verify the superiority of PIT and the effectiveness of its components.
However, our model does not perform optimally on occluded re-ID data.It cannot offer a high generalization performance in a randomly occluded environment.We will concentrate on how to better utilize structural information to handle the interference of occlusions in future research.

Figure 2 .
Figure 2. Framework of the proposed PIT.The PPN (light purple) cuts the image into patches and passes it into the encoder.The identity information is encoded as an embedded representation by the IDE (light yellow).The transformer encoder uniformly encodes the identity information embedding, patch embeddings, and position embeddings.

Figure 3 .
Figure 3. Framework of the baseline.The output [cls] token marked with * serves as the global feature f .Inspired by [7], we introduce the BNNeck after f .

Figure 4 .
Figure 4. (a) Framework of the patch pyramid network; (b) framework of the extraction module.

Table 1 .
[17]set statistics used in this articleMarket-1501[17]has a total of 32,668 images, consisting of 1501 identities under 6 cameras.It is split into a training set of 12,936 images of 751 identities and a test set of 19,732 images of 750 identities.

Table 2 .
Comparison of different backbones.VIT-B/16 is the baseline model of this paper, referred to as baseline for short.

Table 3 .
The ablation study of patch pyramid network on the MSMT17.

Table 4 .
Ablation study of IDE on the Market-1501.

Table 5 .
The ablation study of PIT on the MSMT17.

Table 6 .
Comparison with the state-of-the-art methods.