Cross-Attention Guided Local Feature Enhanced Multi-Branch Network for Person Re-Identification

Wang, Xiaoyong; Yang, Jianxi

doi:10.3390/electronics14081626

Open AccessArticle

Cross-Attention Guided Local Feature Enhanced Multi-Branch Network for Person Re-Identification

by

Xiaoyong Wang

and

Jianxi Yang

^*

School of Information Science and Engineering, Chongqing Jiaotong University, Chongqing 400074, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(8), 1626; https://doi.org/10.3390/electronics14081626

Submission received: 18 March 2025 / Revised: 9 April 2025 / Accepted: 15 April 2025 / Published: 17 April 2025

Download

Browse Figures

Versions Notes

Abstract

The purpose of person re-identification is to identify and retrieve the same individual across different scenes, angles, and times. As an instance-level recognition problem, human re-identification relies on discriminative features, but is not determined by a single salient feature. This requires us to extract identity-specific features from multiple perspectives. In this paper, we propose a cross-attention guided local feature enhanced multi-branch network, which includes a global branch, a partial branch, and an attention channel branch. The network is guided by attention and jointly extracts local and global features to capture discriminative identity features from multiple aspects. To enable each branch to effectively mine identity information, we design modules dedicated to the function of each branch, making the branch network capture discriminative features. Finally, we conducted extensive testing on the Market1501 and CUHK03 datasets and achieved outstanding results.

Keywords:

person re-identification; feature fusion; attention mechanism; multi-branch

1. Introduction

In recent years, with the in-depth development of deep learning [1], neural networks have shown astonishing progress in different directions. Similarly, this has promoted the rapid development of person re-identification. With the development of science and technology and the popularity of street cameras, person re-identification has been the focus of more attention and has been applied to safety precautions [2]. Although person re-identification has achieved impressive results in recent years, the task of matching individuals with different identities remains a significant challenge due to the complexity of the real world. There are still many hurdles to overcome in the field of person re-identification, including issues related to posture, privacy, cross-scene variations, and occlusions. These factors add layers of complexity to the task.

At the same time, since human activity is uncertain, this requires the ability to correctly match pedestrian images taken from different cameras. Some approaches propose monitoring [3], activity analysis [4], and people-tracking. Therefore, we need more discriminative image feature representation. This requires us to use various factors to improve discrimination between features. To address this problem, some methods [5,6] have focused on extracting global features through feature representation learning; Lin et al. [7] reweighted attributes by correcting predictions based on learned attribute dependencies and correlations. However, these methods ignore fine-grained features to a certain extent. In order to obtain more detailed feature representations, further research is required to add human body information to the network, that is, to predict human posture or human body parts. Some methods [8,9] employ semantic segmentation techniques for images focusing on human body analysis, utilizing human body analysis as an auxiliary extraction method. There are also methods [10,11] that use unsupervised methods to generate false tags from unlabeled samples, so that the network training does not rely on manual tags. In addition, there is another approach to locating different body parts on the human body, which involves estimating key points or utilizing human pose estimation techniques [12]. However, this auxiliary model is pre-trained on different domains, which means that the learning process have significant domain gaps and data bias problems, and that such labeling will incur additional computational costs. In addition, the scale of the training dataset for person ReID tasks is very limited, which leads to the fact that the network can only fit rough and obvious features in the process of global feature learning [13], and some non-significant key information is easily ignored.

To address the above issue and enable the network to extract more detailed feature representations, researchers have developed a number of local fine-grained feature representation methods [14]. Sun et al. [15] proposed a way to divide the feature space into several fringes and correct the outlier features in each region, so as to achieve a kind of local feature capture, and finally connect the features as the final representation. Zheng et al. [16] proposed a coarse-to-fine pyramid network to extract identification information from different spatial scales in a progressive manner. The contribution of global features in the task of person ReID is very large, but, if fine-grained features are ignored, the network will fall into local optimality. Therefore, we speculate that integrating a partial branch into the network will encourage the network to pay attention to global features while taking into account fine-grained features, and enhance the network’s identity discrimination. Previous works [13,17,18] have also demonstrated the benefits of multiple branches. Ref. [17] proposed the use of global and local branches to extract salient features and common features of images, respectively. However, this network is only a simple association of global branches and local branches, and the local branches are just the result of the hard segmentation of features.

Driven by the above factors, we propose a novel approach called the cross-attention guided local feature enhanced multi-branch network, which uses a multi-branch network guided by channel attention. First, we establish three branches in the network, namely the global branch, the local branch, and the attention branch. Similar to most methods, we employ the global branch to capture the overall characteristics of pedestrians, such as their clothing and body shape. However, what sets our model apart is the inclusion of a feature expansion layer within the global branch. This layer aids in more accurately extracting the overall features of individuals. However, we know that, in the real world, a person’s distinctive features cannot be determinant. In other words, salient features, such as clothing, are only a single attribute of a person, not the whole. Typically, we cannot determine a person’s identity based on their clothing, because two people with the same clothes cannot be identified as the same person. Therefore, if only global features are used, the network will inevitably focus on the clothes that have a greater attraction for identifying the identity. In other words, this will make the network more attentive to the clothes rather than the person. Obviously, this is not the result we want. To address this problem, we have added a local branch to the network. The local branch is more focused on the capture of fine-grained features. The local branch is utilized to focus on local information within the image, including fine-grained information such as the pose of the figure and the texture of the clothing. To assist in extracting crucial fine-grained features from images of people, we employ attention-guided channel features to explore discriminative information within the image.

2. Related Work

2.1. Holistic Feature Representation Learning

Person re-identification (ReID) is a technique [19] that aims to match images of people captured from non-overlapping camera views. This technology has gained popularity in the field of intelligent video analysis, is a primary task in many surveillance and security applications, and is receiving increasing attention in the field of computer vision. The goal of this technology is to identify target pedestrians from existing video sequences from possible sources and non-overlapping camera views. Existing ReID methods typically use metric learning methods [6,7] and deep learning methods [20,21,22]. Before deep learning dominated the ReID task, hand-crafted features were developed. Some methods [23,24] improve color and texture feature extraction by dividing the pedestrian image into stripes or triangle-shaped areas [25,26]. Later, deep learning methods were widely used. Liu et al. [27] focused on the comprehensive extraction of multi-scale features, thereby enriching the semantic information of the features. Li et al. [20] designed a multi-scale network to extract feature context information using multi-scale features. However, these ignore local fine-grained feature extraction [28].

2.2. Local Feature Learning in ReID

With the development of deep learning, the ability of feature extraction can better serve the extraction of fine features, which further promotes the application of partial features in re-identification. Zhao et al. [21] used a deep network to decompose the human body into several spatial regions, extract features, and then connect the features. Sun et al. [15] proposed dividing the overall feature into several stripes and learning the local features of each stripe region. Wang et al. [13] proposed a multi-branch network, which divides the network into a multi-branch network; some branches were used to learn local feature representation, thus enhancing the discrimination of the network. Park et al. [29] introduced a new relational module that considered the relationship between a single site and others, This method fully excavated the features between parts. The local feature could effectively extract partial-level feature information from pedestrians, but it was easily affected by changes in human posture and the occlusion of pedestrians or non-pedestrians. Therefore, we also used a combination of local and global, namely fine-grained local features and coarse-grained global features, which were combined to enhance the feature representation.

2.3. Attention in ReID

Using attention has achieved very good results in other fields, such as natural language processing [30], object detection [31], image segmentation, etc. Similarly, attention is superior when carrying out the ReID task, especially in capturing more discriminating features. The common way to use attention in ReID is to aggregate separate attention blocks into a deep convolutional ReID model. Li et al. [32] proposed a new harmonious attention CNN (HA-CNN) model, this model jointly learns pixel-level soft attention and region-level hard attention to solve the problems caused by poor boundary frame positioning and pixel-level noise information. Yan et al. [33] designed an internal attention network to search for information and discriminative regions in images of whole bodies or body parts, and to decompose attention into spatial attention and channel attention for enhancing the network’s focus on pedestrian features. However, attention is, to some extent, very rough and cannot model complex relationships between parts, which leads to the loss of fine-grained information. Chen et al. [34] proposed a method that uses high-order attention modules and a linear polynomial predictor to model the high-order semantics within attention. Zhang et al. [35] proposed a competent relationship-aware global attention module for efficient learning to recognize features. This module explores global scope relationships to mine structural information, in order to improve network authentication ability. Chen et al. [22] proposed using global attention and local attention to learn the features of attention perception, use global attention to decouple the background, and use local attention to decouple human body parts, which enhanced the robustness of the network. In this paper, we use channel-level attention to constrain feature channels to assist in extracting finer grained key human features.

3. Methods

Our proposed approach is an attention-directed global and local federated multi-branch network, as shown in Figure 1. Similar to [18], we use OSNet [36] as the backbone network for fine extraction of human features. It differs from ResNet50 [37], in that it has a smaller parameter count and is smaller than ResNet50. We divide OSNet into two parts; the first part is from the first block of OSNet to conv3_0, which we use as the backbone of the whole network. The other part is from conv3_1 to the last block of the network, which we use as a Block, and the subsequent three branch networks will have independent weights of this Block. Our overall frame diagram is shown in Figure 1. In the following subsections, we will introduce our proposed global branch, partial branch, and attention channel branch.

For our global branch, we have retained the global branch from a previous work [13], but have modified the maximum pooling layer to better enable the network to extract global features of the person. In other words, we have enhanced the discriminate power of the rough features extracted by the global branch. Therefore, we use a feature amplification module to replace the maximum pooling layer. After the pedestrian image is input into the backbone, when it reaches the layer 3_0 of the network, we obtain a feature vector

F_{1}

with a dimension of 24 × 8 × 512. After obtaining the feature vector

F_{1}

, we feed it into the remaining layers of OSNet without shared weights, which enables us to obtain a 512-dimensional vector

F_{1}

. After that, we feed this vector into the proposed feature amplification module. Then the feature amplification module enhances the features, and we obtain a more significant 512-dimensional feature vector. It is worth noting that the feature enhancement layer only changes the internal attributes of the features, without changing the feature dimensions. Therefore, our

F_{1}

is still 512-dimensional. In the next section, we will introduce the feature amplification module. On the other hand, similarly to [18], we added a dropout block in the global branch, which takes the initial tensor as input. The dropout module erases the horizontal regions with the highest activation, forcing the network to pay attention to the regions with lower discriminate power. This naturally increases the robustness of the resulting representation.

For the partial branch, like most methods, we divide the feature into N parts, but we find that hard segmentation of the feature, as conducted in previous works, does not provide much benefit for the network. This is because, for the semantic parts of the human body, if hard segmentation is directly applied, it will cause a rupture between parts, making the network ignore the connections between parts, and also damage the semantic features of the human body to some extent. Therefore, to avoid breaking the relationship between parts via the hard segmentation of the stripes, we propose a part fusion module (PFM). We feed the partial branch feature into the PFM as input, and obtain K fused partial features that have absorbed other feature information. In this way, each partial feature is no longer independent, but related to other parts. In Section 3.2, we will explore PFM in detail.

In the attention branch, the feature vector is pooled and decomposed into two 256-dimensional feature vectors, which we feed to the channel attention [38] to obtain more information worth attention. We then use a 1 × 1 convolution to expand the feature representation and obtain two feature vectors, namely

{\hat{a}}_{1}

and

{\hat{a}}_{2}

.

Finally, we receive all the feature vectors obtained by the three branches, and we monitor these features with triple loss [39]. At the same time, we use BNNeck [40] to process these features. BNNeck is composed of batch normalized and fully connected layers. These feature representations processed by BNneck blocks are applied to cross-entropy loss monitoring.

3.1. Feature Amplification Module

Unlike other global feature networks, our global branch specifically focuses on highly discriminative features, such as clothing characteristics and body attributes, which are more apparent. Therefore, we constructed a feature amplification module (FAM). As illustrated in Figure 2, the input to the feature amplification module is the global feature vector

F_{1}

, which is subsequently fed into both the max pooling layer and the average pooling layer to obtain

g_{m a x}

and

g_{a v g}

, respectively.

g_{max} = m a x p o o l (F_{1}), g_{a v g} = a v g p o o l (F_{2})

(1)

The max pooling layer is utilized to extract the most salient parts of the features while filtering out background noise. On the other hand, performing average pooling on the features aims to preserve important information about the individuals, aiding us in obtaining smoother feature maps. Afterward, we concatenate the two features and feed them into a convolutional block to integrate the information from both sources, resulting in a more robust feature representation. To stay aligned with the objective of the global branch, which is to capture more salient features, we perform element-wise addition between the concatenated feature and

g_{m a x}

. Finally, we compress the resulting feature vector G, thus obtaining the optimized feature vector after going through the feature amplification module.

G = R_{d o w n} (c o n c a t (g_{m a x}, Conv (g_{a v g} + g_{m a x})))

(2)

where Conv consists of 1 × 1 convolutional blocks followed by normalization and ReLU activation function. Concatenation represents the concatenation operation, and

R_{d o w n}

refers to the convolutional layer that reduces the number of channels.

3.2. Cross Fusion Module

In most local methods, we often observe that they simply divide the overall features into several parts, but these parts are independent of each other and lack interconnection. In our approach, we employ a cross-fusion technique to blend the information of each part’s feature representation with that of other parts. This allows us to establish connections between different parts, enabling a more comprehensive understanding of the overall feature representation. We utilize average pooling to divide the feature vector

F_{2}

into three parts, namely

p_{1}

,

p_{2}

, and

p_{3}

. As shown in Figure 3, we take three local feature vectors as input. To combine the characteristic information of

p_{1}

and

p_{2}

, we merge

p_{1}

and

p_{2}

to obtain a new feature vector

p_{0}

, as formulated below:

p_{0} = \frac{p_{1} + p_{2}}{2},

(3)

Then, we perform cross-fusion on these vectors to achieve the effect of mutual connection. We concatenate the newly generated

p_{0}

with

p_{3}

, and concatenate the newly generated

p_{1}

with

p_{2}

, so we obtain three new vectors, namely

{\hat{p}}_{1}

,

{\hat{p}}_{2}

, and

{\hat{p}}_{3}

. It is worth noting that our concatenation is achieved by using a cross-attention mechanism. The cross-attention mechanism can be represented as follows:

f_{c r o s s} (X_{i}, X_{j}) = S o f t m a x (\frac{Q_{i} K_{j}^{T}}{\sqrt{d_{k}}}) V,

(4)

where

d_{k}

is the scaling factor,

Q_{i}

is the linear mapping for

X_{i}

, K is the linear mapping for

X_{j}

, and we set

X_{i}

as V. In the proposed method, the cross-attention mechanism is applied to

p_{1}

and

p_{2}

,

p_{1}

is projected as Q, and

p_{2}

is projected as K and V; thus,

{\hat{p}}_{1}

is obtained. Then, we switch

p_{1}

and

p_{2}

to perform the cross-attention mechanism and obtain

{\hat{p}}_{2}

. Similarly, the cross-attention mechanism is performed on

p_{3}

and

p_{0}

, with

p_{3}

projected as Q and

p_{0}

projected as K and V; thus,

{\hat{p}}_{3}

is obtained.

3.3. Training Loss

Given a probe image

I \in R^{H \times W \times C}

, where H, W, and C denote the height, width, and the number of channels, respectively. After obtaining the network output features, we use two kinds of losses to supervise the network, which are multi-similarity loss [41] (MS Loss) and ID Loss. Among them, we use cross entropy as ID Loss. As shown in Figure 1, we use the classification embedding calculated by each branch feature after BNNeck to calculate ID Loss. The feature before BNNeck is used to calculate MS Loss. Functionally, ID Loss can be presented as follows:

L_{I D} = - y_{i} l o g (\frac{exp (C_{i} W_{i})}{\sum_{j = 1}^{N} exp (C_{j} W_{j})}),

(5)

where

W_{i}

is a linear projection matrix,

y_{i}

represents the corresponding ground truth label, and N represents the total number of identities. Regarding MS loss, by calculating similarity pairs and mining information pairs, weighted selection is further applied to the pairs using similarity, thereby enabling pair mining. Functionally, MS Loss can be presented as follows:

\begin{matrix} \begin{matrix} L_{M S} = \frac{1}{m} \sum_{i = 1}^{m} log (1 + \sum_{k}^{P} exp (S_{i k} - μ)) \\ + log (1 + \sum_{k}^{N} exp (S_{i k} - μ)), \end{matrix} \end{matrix}

(6)

where m is the total number of identities,

μ

is a hyper parameter, P and N, respectively, refer to the number of positive samples and negative samples,

S_{i k}

represents similarity, and the calculation method is as follows:

S_{i j} = < f (x_{i} | θ), f (x_{j} | θ) >,

(7)

where

< \cdot, \cdot >

represents the dot product. Therefore, during the training process, we define the overall objective function for optimization as follows:

L = λ L_{I D} + (1 - λ) L_{M S},

(8)

where

λ

is a hyperparameter. This parameter is used to balance the two losses, and, in this article, we set it to 0.5. In addition, we use the cosine annealing strategy as an optimization strategy for updating the learning rate. The cosine annealing strategy is commonly used in ReID networks, and it can better motivate the network compared to traditional step-wise learning rate scheduling, thereby improving the performance of the network. During the first 10 epochs, the learning rate increases from

6 \times 10^{- 5}

to

6 \times 10^{- 4}

, and then in the remaining period, cosine decay is applied to reach a final learning rate of

6 \times 10^{- 7}

.

4. Results

4.1. Datasets and Evaluation Metrics

During model training, the use of biased data may lead to algorithmic discrimination [42,43]. In this study, we have also considered fairness in both the dataset and evaluation metrics by selecting the currently widely recognized mainstream pedestrian re-identification datasets for training. To assess the reliability of the model, we adopt the cumulative matching characteristics (CMC) at Rank 1 and the mean average precision (mAP) to evaluate performance.

We evaluated our model using three datasets commonly used in pedestrian re-identification tasks, namely Market-1501 [44], CUHK03 [45] and DukeMTMC-reID [46]. Market-1501 is an image dataset for pedestrian re-identification, collected on the campus of Tsinghua University. The dataset contains 32,217 images of 1501 pedestrians captured by six cameras (including five high-definition cameras and one low-definition camera). This dataset has become one of the most commonly used datasets in the field of pedestrian re-identification. CUHK03 is a large-scale pedestrian re-identification dataset collected by the Chinese University of Hong Kong (CUHK). It contains images of 1467 different individuals captured by five pairs of cameras. The images were collected on the campus of CUHK, covering a variety of scenes and lighting conditions. We used the new training/testing protocol proposed in 2017, where the dataset is divided into 767 pedestrians in the training set and 700 pedestrians in the testing set. DukeMTMC-reID is an image dataset collected on the Duke University campus in 2017. It is composed of more than 36,000 images, covering 702 individuals with different identities, which provides abundant data for algorithms. Among them, the training set contains 16,522 pictures of 702 pedestrians, and the test set contains 70 pedestrians and 408 interfering pedestrians, totaling 17,661 pictures. The query set randomly selects one picture of each of the 702 pedestrians in the test set from each camera, with a total of 2228 pictures.

4.2. Implementation Details

We uniformly adjust the resolution of the input images to 384 × 128 pixels, and also adjust the image size to 105% of the original width and height. We perform data augmentation through random cropping and random horizontal flipping with a probability of 0.5. The model is trained on 120 epochs for Market-1501 and 180 epochs for CUHK03, with a batch size of 48. Each batch consists of eight samples, and each sample has six identities. The parameters are optimized using the Adam optimizer with a weight decay of

5 \times 10^{- 4}

,

β_{1} = 0.9

and

β_{2} = 0.999

. We use the PyTorch library in Python to implement the code for the network proposed in the paper, and utilize the ImageNet dataset to pre-train the backbone model. We use

λ

to balance the overall loss of the entire network. In this paper,

λ

is set to 0.5.

4.3. Comparison with State-of-the-Art Methods

In Table 1, we conducted comparative experiments by comparing our network with other methods. Our model achieved very impressive results on Market-1501 and CUHK03, both in terms of rank-1 accuracy and mAP. Regarding the results for Market-1501, as shown in Table 1, our method achieves the best mAP and a comparable Rank-1 accuracy among all the state-of-the-art competitors. Among them, TransReID [47] and AAformer [48] are ViT network-based methods. Our method achieved a Rank 1 accuracy of 1.3% higher than that of NFormer [49]. This greatly reflects the superiority of our method. Finally, we provide the visualization of the retrieval results, which is shown in Figure 4. The visualization results represent the probability that the top K graphs in the search results are correct, as well as the confidence level of the model’s performance.

4.4. Ablation Studies

Finally, in order to demonstrate the validity of each branch and its contribution to the network, we evaluated the benefits of introducing each branch and tested them, as shown in Table 2.

In three different datasets, we conducted ablation experiments on the various branches of our network, and the results are shown in Table 2. It is worth noting that the baseline network is also the network that removes the attention branch and the local branch, namely the global network. It is evident that using all branches together leads to a noticeable improvement in the network’s performance. When training with only a single branch on different networks, it is apparent that each branch has different effects on different datasets. For example, on the Market-1501 dataset, only using the partial branch yields the best results, whereas, on CUHK03-D, only using the partial branch is not as effective as only using the global branch. It can be anticipated that jointly using these two branches might yield good results. As expected, the results in the lower half of the table show that the combined use of the global and partial branches leads to improvements across all three datasets, surpassing the performance of using a single branch.

In addition, we conducted experiments on the

λ

parameter within the loss function. Varying values of

λ

exhibited different effects on the training of the network, as illustrated in Figure 5. In order to compare the impact of different

λ

values on network training, we conducted experiments on the network using three different values: 0.3, 0.5, and 0.7. From the experimental results, it can be seen that, on the CUHK03 dataset, the impact of

λ

on performance is the most significant, and the best results are achieved when

λ

is set to 0.3. However, on the Market1501 dataset, the opposite is true, and the best results are obtained when

λ

is set to 0.7. Therefore, in this experiment, we choose

λ

= 0.5 as our training parameter.

5. Discussion

In this paper, we propose an attention-driven holistic local-joint multi-branch network architecture. The different branches in our network are designed to capture specific types of features that we desire, resulting in a complementary approach. We conducted extensive experiments and obtained superior results to support the superiority of our network. Therefore, we believe that considering character features from multiple perspectives and designing multiple branches to extract features of different scales are effective in person re-identification.

However, it should be noted that the network method we propose has certain limitations. On the one hand, as it is a multi-branch network, it has relatively high requirements for computing resources and may not be able to operate efficiently in environments with low hardware configurations, which limits its wide application range to some extent. On the other hand, the data for model training is targeted at conventional person re-identification, and it is also rather dependent on the scale and quality of the training data. If there are biases or deficiencies in the data, it may affect the final recognition results in some person re-identification tasks under special scenarios.

Based on the experiments and analyses in this paper, there is still much work to be done. Firstly, we will continuously optimize the network architecture. We will attempt to reduce the dependence on computing resources, simplify the training process, and improve the training efficiency and generalization ability of the model by means of improving the algorithm structure and introducing new technical means. Secondly, in order to further enhance the generalization ability of the model, we will collect more image data of pedestrians in different scenarios and of different types to enrich the diversity of data, enabling the model to learn more comprehensive features and thus strengthening the pedestrian re-identification ability under various complex circumstances.

Author Contributions

Conceptualization, J.Y. and X.W.; methodology, X.W.; software, X.W.; validation, J.Y. and X.W.; formal analysis, X.W.; investigation, X.W.; resources, J.Y.; data curation, X.W.; writing—original draft preparation, X.W.; writing—review and editing, J.Y. and X.W.; visualization, J.Y.; supervision, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Science and Technology Innovation Key R&D Program of Chongqing CSTB2023TIAD-STX0015.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All datasets are publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, H.; Yao, M.; Chen, Y.; Xu, Y.; Liu, H.; Jia, W.; Fu, X.; Wang, Y. Manifold-based Incomplete Multi-view Clustering via Bi-Consistency Guidance. IEEE Trans. Multimed. 2024, 26, 10001–10014. [Google Scholar] [CrossRef]
Wang, Y.; Peng, J.; Wang, H.; Wang, M. Progressive learning with multi-scale attention network for cross-domain vehicle re-identification. Sci. China Inf. Sci. 2022, 65, 160103. [Google Scholar] [CrossRef]
Wang, X. Intelligent multi-camera video surveillance: A review. Pattern Recognit. Lett. 2013, 34, 3–19. [Google Scholar] [CrossRef]
Loy, C.C.; Xiang, T.; Gong, S. Multi-camera activity correlation analysis. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 1988–1995. [Google Scholar]
Wu, Q.; Dai, P.; Chen, J.; Lin, C.W.; Wu, Y.; Huang, F.; Zhong, B.; Ji, R. Discover cross-modality nuances for visible-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 4330–4339. [Google Scholar]
Zhao, Z.; Liu, B.; Chu, Q.; Lu, Y.; Yu, N. Joint color-irrelevant consistency learning and identity-aware modality adaptation for visible-infrared cross modality person re-identification. Proc. AAAI Conf. Artif. Intell. 2021, 35, 3520–3528. [Google Scholar] [CrossRef]
Lin, Y.; Zheng, L.; Zheng, Z.; Wu, Y.; Hu, Z.; Yan, C.; Yang, Y. Improving person re-identification by attribute and identity learning. Pattern Recognit. 2019, 95, 151–161. [Google Scholar] [CrossRef]
Huang, H.; Chen, X.; Huang, K. Human parsing based alignment with multi-task learning for occluded person re-identification. In Proceedings of the 2020 IEEE International Conference on Multimedia and Expo (ICME), London, UK, 6–10 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
Liu, J.; Ni, B.; Yan, Y.; Zhou, P.; Cheng, S.; Hu, J. Pose transferrable person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4099–4108. [Google Scholar]
Peng, J.; Jiang, G.; Wang, H. Adaptive Memorization with Group Labels for Unsupervised Person Re-identification. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 5802–5813. [Google Scholar] [CrossRef]
Cai, B.; Wang, H.; Yao, M.; Fu, X. Focus More on What? Guiding Multi-Task Training for End-to-End Person Search. IEEE Trans. Circuits Syst. Video Technol. 2025. [Google Scholar] [CrossRef]
Wang, G.; Yang, S.; Liu, H.; Wang, Z.; Yang, Y.; Wang, S.; Yu, G.; Zhou, E.; Sun, J. High-order information matters: Learning relation and topology for occluded person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6449–6458. [Google Scholar]
Wang, G.; Yuan, Y.; Chen, X.; Li, J.; Zhou, X. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 274–282. [Google Scholar]
Yao, M.; Wang, H.; Chen, Y.; Fu, X. Between/Within View Information Completing for Tensorial Incomplete Multi-view Clustering. IEEE Trans. Multimed. 2024, 27, 1538–1550. [Google Scholar] [CrossRef]
Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 480–496. [Google Scholar]
Zheng, F.; Deng, C.; Sun, X.; Jiang, X.; Guo, X.; Yu, Z.; Huang, F.; Ji, R. Pyramidal person re-identification via multi-loss dynamic training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8514–8522. [Google Scholar]
Qi, G.; Hu, G.; Wang, X.; Mazur, N.; Zhu, Z.; Haner, M. EXAM: A framework of learning extreme and moderate embeddings for person re-ID. J. Imaging 2021, 7, 6. [Google Scholar] [CrossRef]
Herzog, F.; Ji, X.; Teepe, T.; Hörmann, S.; Gilg, J.; Rigoll, G. Lightweight multi-branch network for person re-identification. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1129–1133. [Google Scholar]
Zheng, L.; Yang, Y.; Hauptmann, A.G. Person re-identification: Past, present and future. arXiv 2016, arXiv:1610.02984. [Google Scholar]
Li, D.; Chen, X.; Zhang, Z.; Huang, K. Learning deep context-aware features over body and latent parts for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 384–393. [Google Scholar]
Zhao, L.; Li, X.; Zhuang, Y.; Wang, J. Deeply-learned part-aligned representations for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3219–3228. [Google Scholar]
Chen, Y.; Wang, H.; Sun, X.; Fan, B.; Tang, C.; Zeng, H. Deep attention aware feature learning for person re-identification. Pattern Recognit. 2022, 126, 108567. [Google Scholar] [CrossRef]
Gheissari, N.; Sebastian, T.B.; Hartley, R. Person reidentification using spatiotemporal appearance. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; IEEE: Piscataway, NJ, USA, 2006; Volume 2, pp. 1528–1535. [Google Scholar]
Prosser, B.J.; Zheng, W.S.; Gong, S.; Xiang, T.; Mary, Q. Person re-identification by support vector ranking. In Proceedings of the BMVC, Aberystwyth, UK, 31 August–3 September 2010; Volume 2, p. 6. [Google Scholar]
Wang, H.; Yao, M.; Jiang, G.; Mi, Z.; Fu, X. Graph-Collaborated Auto-Encoder Hashing for Multiview Binary Clustering. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 10121–10133. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Jiang, G.; Peng, J.; Deng, R.; Fu, X. Towards adaptive consensus graph: Multi-view clustering via graph collaboration. IEEE Trans. Multimed. 2022, 25, 6629–6641. [Google Scholar] [CrossRef]
Liu, X.; Zhao, H.; Tian, M.; Sheng, L.; Shao, J.; Yi, S.; Yan, J.; Wang, X. Hydraplus-net: Attentive deep features for pedestrian analysis. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 350–359. [Google Scholar]
Jiang, G.; Peng, J.; Wang, H.; Mi, Z.; Fu, X. Tensorial multi-view clustering via low-rank constrained high-order graph learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5307–5318. [Google Scholar] [CrossRef]
Park, H.; Ham, B. Relation network for person re-identification. Proc. AAAI Conf. Artif. Intell. 2020, 34, 11839–11847. [Google Scholar] [CrossRef]
Yamada, I.; Asai, A.; Shindo, H.; Takeda, H.; Matsumoto, Y. Luke: Deep contextualized entity representations with entity-aware self-attention. arXiv 2020, arXiv:2010.01057. [Google Scholar]
Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.S. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5659–5667. [Google Scholar]
Li, W.; Zhu, X.; Gong, S. Harmonious attention network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2285–2294. [Google Scholar]
Yang, F.; Yan, K.; Lu, S.; Jia, H.; Xie, X.; Gao, W. Attention driven person re-identification. Pattern Recognit. 2019, 86, 143–155. [Google Scholar] [CrossRef]
Chen, B.; Deng, W.; Hu, J. Mixed high-order attention network for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 371–381. [Google Scholar]
Zhang, Z.; Lan, C.; Zeng, W.; Jin, X.; Chen, Z. Relation-aware global attention for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 3186–3195. [Google Scholar]
Zhou, K.; Yang, Y.; Cavallaro, A.; Xiang, T. Omni-scale feature learning for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3702–3712. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Hermans, A.; Beyer, L.; Leibe, B. In defense of the triplet loss for person re-identification. arXiv 2017, arXiv:1703.07737. [Google Scholar]
Luo, H.; Gu, Y.; Liao, X.; Lai, S.; Jiang, W. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Wang, X.; Han, X.; Huang, W.; Dong, D.; Scott, M.R. Multi-similarity loss with general pair weighting for deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5022–5030. [Google Scholar]
Buolamwini, J.; Gebru, T. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency, New York, NY, USA, 23–24 February 2018; Friedler, S.A., Wilson, C., Eds.; Volume 81, pp. 77–91. [Google Scholar]
Datta, A.; Swamidass, S.J. Fair-Net: A network architecture for reducing performance disparity between identifiable sub-populations. arXiv 2021, arXiv:2106.00720. [Google Scholar]
Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1116–1124. [Google Scholar]
Li, W.; Zhao, R.; Xiao, T.; Wang, X. Deepreid: Deep filter pairing neural network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 152–159. [Google Scholar]
Zheng, Z.; Zheng, L.; Yang, Y. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3754–3762. [Google Scholar]
He, S.; Luo, H.; Wang, P.; Wang, F.; Li, H.; Jiang, W. Transreid: Transformer-based object re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 15013–15022. [Google Scholar]
Zhu, K.; Guo, H.; Zhang, S.; Wang, Y.; Liu, J.; Wang, J.; Tang, M. Aaformer: Auto-aligned transformer for person re-identification. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 17307–17317. [Google Scholar] [CrossRef]
Wang, H.; Shen, J.; Liu, Y.; Gao, Y.; Gavves, E. Nformer: Robust person re-identification with neighbor transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7297–7307. [Google Scholar]
Zhang, X.; Luo, H.; Fan, X.; Xiang, W.; Sun, Y.; Xiao, Q.; Jiang, W.; Zhang, C.; Sun, J. Alignedreid: Surpassing human-level performance in person re-identification. arXiv 2017, arXiv:1711.08184. [Google Scholar]
Quan, R.; Dong, X.; Wu, Y.; Zhu, L.; Yang, Y. Auto-reid: Searching for a part-aware convnet for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3750–3759. [Google Scholar]

Figure 1. Overall architecture. In the figure, the first dashed box on the left-hand side of the image contains three non-weight-sharing blocks.

Figure 2. Feature amplification module. In the figure, the green vector G represents the output feature obtained. The ⊕ symbol indicates vector addition, and the ⊗ symbol indicates matrix multiplication.

Figure 3. Cross fusion module. Here, the ⊕ symbol indicates vector addition, and the ⊗ symbol indicates the cross attention mechanism.

Figure 4. Retrieval results of our proposed network on the Market1501 dataset. The left column in the image is the person’s image to be queried, while the right column shows the top ten search results. The images in the green boxes indicate a correct match, while the images in the red boxes indicate an incorrect match.

Figure 5. The influence of the value of

λ

on network training; the dots represent the values adopted in this experiment.

Figure 5. The influence of the value of

λ

on network training; the dots represent the values adopted in this experiment.

Table 1. Comparison of the experimental results. The table shows the results of multiple methods on the different datasets, where results in bold underlined fonts are the best results, and results in bold-only fonts are the second best results.

Methods	Market1501		CHUK03-D		DukeMTMC-reID
Methods	Rank 1	mAP	Rank 1	mAP	Rank 1	mAP
AlignedReID [50]	91.8	79.1	61.5	59.6	82.1	69.7
PCB [15]	92.4	77.3	61.3	54.2	81.9	65.3
HA-CNN [32]	91.2	75.7	41.7	38.6	80.5	63.8
Auto-ReID [51]	94.5	85.1	73.3	69.3	-	-
MHN [34]	95.1	85.0	71.7	65.4	89.1	77.2
OSNet [36]	94.8	84.9	72.3	67.8	88.6	73.5
DAAF-BoT [22]	95.1	87.9	64.9	63.1	87.9	77.9
GCP [29]	95.2	88.9	74.4	69.6	89.7	78.6
MGN [13]	95.7	86.9	66.8	66.0	88.7	78.4
NFormer [49]	94.7	91.1	79.0	76.4	90.6	85.7
TransReID [47]	95.2	88.9	75.1	72.9	90.7	82.0
AAformer [48]	95.4	88.0	78.1	77.2	90.1	80.9
ours	96.0	90.2	82.6	79.7	90.8	81.9

Table 2. The table shows the ablation experiments we performed for the individual branches. In the table, "Global" represents the baseline network.

Branch	Market-1501		CUHK03-D		DukeMTMC-reID
Branch	Rank-1	mAP	Rank-1	mAP	Rank-1	mAP
Global (G)	94.7	87.6	78.0	74.7	88.4	74.8
Part (P)	95.1	87.3	73.8	70.6	88.7	77.9
Attention (A)	92.8	84.5	71.5	67.2	85.0	71.7
G and P	95.8	90.1	81.6	79.1	90.6	80.8
G and A	94.6	88.3	78.5	75.2	88.7	78.0
P and A	95.0	88.6	75.3	73.4	89.5	79.3
All	96.0	90.2	82.6	79.7	90.8	81.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, X.; Yang, J. Cross-Attention Guided Local Feature Enhanced Multi-Branch Network for Person Re-Identification. Electronics 2025, 14, 1626. https://doi.org/10.3390/electronics14081626

AMA Style

Wang X, Yang J. Cross-Attention Guided Local Feature Enhanced Multi-Branch Network for Person Re-Identification. Electronics. 2025; 14(8):1626. https://doi.org/10.3390/electronics14081626

Chicago/Turabian Style

Wang, Xiaoyong, and Jianxi Yang. 2025. "Cross-Attention Guided Local Feature Enhanced Multi-Branch Network for Person Re-Identification" Electronics 14, no. 8: 1626. https://doi.org/10.3390/electronics14081626

APA Style

Wang, X., & Yang, J. (2025). Cross-Attention Guided Local Feature Enhanced Multi-Branch Network for Person Re-Identification. Electronics, 14(8), 1626. https://doi.org/10.3390/electronics14081626

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cross-Attention Guided Local Feature Enhanced Multi-Branch Network for Person Re-Identification

Abstract

1. Introduction

2. Related Work

2.1. Holistic Feature Representation Learning

2.2. Local Feature Learning in ReID

2.3. Attention in ReID

3. Methods

3.1. Feature Amplification Module

3.2. Cross Fusion Module

3.3. Training Loss

4. Results

4.1. Datasets and Evaluation Metrics

4.2. Implementation Details

4.3. Comparison with State-of-the-Art Methods

4.4. Ablation Studies

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI