1. Introduction
Person Re-identification (Re-ID) is a task aimed at achieving cross-camera pedestrian image retrieval. It involves accurately identifying and retrieving specific pedestrian query images from a vast image database captured by multiple non-overlapping cameras. In practical applications, person re-identification technology can track and recognize the movement trajectories of the same pedestrian across different camera views and time points, thereby being widely applied in fields such as video surveillance, intelligent security, and pedestrian behavior analysis [
1,
2,
3].
In the field of person re-identification, researchers have adopted various methods to enhance model performance. Improvements based on CNNs, such as OSNet [
4], optimize network structures to reduce complexity but are limited by the volume of training data and the constraints of feature representation, often leading to overfitting. To mitigate this, methods like Auto-ReID [
5] and CDNet [
6] leverage Neural Architecture Search to discover more compact and effective model architectures. Additionally, OfM [
7] employs a data selection strategy to prioritize more generalizable data during training, enhancing model performance. However, these methods still face challenges in large-scale datasets.
On the other hand, methods incorporating prior knowledge, such as PCB [
8] and SAN [
9], enhance local feature representation by segmenting features and extracting them locally. MGN [
10] further adopts a multigranularity feature segmentation scheme to improve model expressiveness. However, these methods typically increase model complexity. To simplify models, method like CBDB-Net [
11] use a dual-branch structure to extract global and local features separately and improve model robustness through feature dropout strategies.
Furthermore, attention mechanism methods, including ABDNet [
12] and HOReID [
13], introduce attention modules to expand the model’s receptive field and extract more discriminative features. CAL [
14] combines attention mechanisms with counterfactual learning to further improve prediction accuracy. Recently, with the widespread application of Transformers in computer vision, methods such as PAT [
15] and DRL-Net [
16] have emerged in person re-identification, leveraging the strengths of both CNNs and Transformers to further boost model performance. However, these methods also present challenges in terms of computational intensity and parameter tuning.
Due to the swift advancements in deep learning technology, the processing of multimodal information has emerged as a focal area of research in computer vision and natural language processing. In this context, how to efficiently fuse visual and textual features to extract more discriminative pedestrian characteristics for achieving higher-level semantic understanding and reasoning has become an important and urgent research topic.
CLIP, proposed by Radford et al. [
17], is a groundbreaking visual–language pre-training model that leverages a vast amount of image–text pairs for training. It maps images and text into a shared latent space, enabling the model to understand and correlate information from these two modalities. This capability allows CLIP to perform various tasks such as cross-modal retrieval, classification, and generation. However, despite its groundbreaking nature, CLIP also has its limitations. Firstly, while CLIP excels in understanding and correlating information from images and text, it may struggle with nuanced or context-specific details. This is because the model is trained on a vast and diverse dataset, which may not always capture the subtleties of specific contexts or domains. Secondly, CLIP’s performance can be affected by the quality and relevance of the image–text pairs used for training. If the pairs are not well-aligned or if the text descriptions are not accurate, the model may not learn effective representations for certain concepts or objects. CoOp, proposed by Zhou et al. [
18], is an optimization method for CLIP’s text prompts, focusing on improving CLIP’s performance on zero-shot classification tasks. By generating learnable text prompts and optimizing them, CoOp better adapts the text description to downstream tasks. This approach enables CLIP to accurately classify new categories without additional training data. CLIP-Reid, proposed by Li et al. [
19], cleverly combines CLIP’s cross-modal contrastive learning ability with CoOp’s dynamic text prompt generation mechanism. By integrating the advantages of both, it provides a more precise and flexible method for association matching between images and text, achieving significant performance improvements in tasks such as person re-identification. However, there is room for improvement: CLIP-Reid primarily focuses on optimizing the image encoder using text information and does not directly perform deep fusion of image and text features. To address this limitation and fully exploit the potential connections between image and text features, we decided to introduce the cross-attention mechanism from the Transformer Decoder. This mechanism enables the model to adapt dynamically attend to and reference relevant text features when processing image features, effectively fusing the two types of features.
Specifically, we build upon the CLIP-Reid framework by incorporating a Transformer Decoder module that includes a cross-attention layer. During the training process, the image encoder extracts features from images, while the text encoder does the same for text, and these features are fed into the Transformer Decoder module as input. In the cross-attention layer, the model calculates attention weights between image features and text features, and based on these weights, it fuses the two types of features. In this way, we can generate a joint feature representation that incorporates both image and text information. This joint feature, along with the text features, is then concatenated through a residual connection and used together to optimize the image encoder.
During the course of conducting experiments, we observed that the inherent limitations of the CLIP model significantly impacted the performance of pedestrian detection. Specifically, the model exhibited too coarse granularity when extracting pedestrian features, making it difficult to accurately capture discriminative information for pedestrian identity, which subsequently led to poor performance in joint feature fusion. To effectively mitigate this issue, we adopted the Multistage Channel Spatial Feature Aggregation (MSCSA) module proposed by Wang et al. [
20]. We found that this module significantly enhances the model’s sensitivity and recognition accuracy for pedestrian features. It not only assists the model in capturing pedestrian identity features in greater detail but also facilitates the effective transmission of these features across different levels, thereby generating text that is more suitable for model learning in the first stage and substantially improving the quality and robustness of the joint features in the second stage.
In conclusion, the primary contributions of this paper are summarized as follows:
Inspired by the CLIP-Reid model, this paper proposes an innovative person re-identification method that fuses visual and textual information.
A Multistage Channel Spatial Feature Aggregation (MSCSA) module is introduced into the image encoder to generate more accurate and rich representations of pedestrian features. By leveraging the cross-attention mechanism of the Transformer decoder, the textual features can dynamically guide the visual features, enhancing the model’s ability to locate and recognize the target pedestrian.
Our method has achieved competitive performance on four public datasets, namely MSMT17, Market1501, DukeMTMC, and Occluded-Duke, fully demonstrating its effectiveness and generalization capability.
2. Related Work
In the approach of manually crafting text, researchers can flexibly adjust the description content and style to fit the specific needs of different application scenarios, thereby enhancing the generalization ability and accuracy of models in practical applications. For instance, Yang et al. [
21] cleverly leveraged readily available diffusion models and image caption generation models to construct a comprehensive image–text dataset. Through meticulous analysis of the generated text, they predefined an attribute space matching the Market-1501 attribute dataset and automatically annotated 27 different types of attributes for each image–text pair using text keyword matching technology. Subsequently, they introduced the APTM framework, which jointly learns multiple attributes and text matching, significantly optimizing model performance through complementary training of attribute recognition tasks and text-based person retrieval tasks. On the other hand, Zuo et al. [
22] meticulously manually annotated a novel ultra-fine-grained text description dataset specifically for text-based person retrieval tasks and proposed a test set that truthfully reflects changes in complex scenarios. They designed an efficient algorithm, CFAM, for ultra-fine-grained text-based person retrieval, achieving finer feature mining by introducing a shared cross-modal granular decoder and a hard negative matching mechanism. Meanwhile, Zuo et al. [
23] employed specific text templates to guide the synthesis of image–text pairs. These templates not only ensured that the generated text descriptions were structured and consistent but also enabled them to accurately capture subtle characteristics of people, thereby aiding the model in learning better person representations in subsequent pre-training and downstream tasks. Based on this, they proposed a language–image pre-training framework named PLIP, significantly enhancing the model’s joint understanding of image and text information through three pre-training tasks: semantically fused image colorization, visually fused attribute prediction, and visual–language matching. However, creating high-quality text datasets requires substantial human, material, and financial resources. Moreover, annotators often vary in professional expertise and annotation standards. Some annotators may lack a clear understanding of the purpose and norms of the annotation task, leading to varying quality in annotation results. Additionally, reviewing and proofreading annotation results are challenging tasks, further increasing data uncertainty. Yan et al. [
24] proposed a fine-grained information mining framework CFine based on CLIP, which uses multilevel global feature learning modules to mine discriminative details shared by modalities and designs cross granularity feature refinement and fine-grained correspondence discovery modules to establish cross modal coarse and fine granularity correspondences. Ren et al. [
25] proposed a novel human–object interaction (HOI) detection method that jointly explores interaction contextual clues from both visual and textual perspectives, both within and across triplets. Ye et al. [
26] conducted an in-depth analysis of the three major aspects of closed world ReID, summarized the five aspects of open world ReID, and designed a powerful AGW baseline. Gong et al. [
27] proposed a lightweight image angle cross-attention module to improve the performance of detection in complex scenes. In unsupervised learning, Wang et al. [
28] proposed a method for unsupervised pedestrian re-identification tasks, which obtains global and local features through a dual branch structure, and uses an adaptive information supplementation method based on the k-nearest neighbor algorithm and an adaptive foreground enhancement module to improve feature robustness, reduce label noise, and improve pseudo label accuracy. Tang et al. [
29] proposed an unsupervised pedestrian re-identification method called MLJT, which refines the training process by predicting multiple pseudo labels, thereby improving the accuracy of pedestrian re-identification in multicamera systems.
3. Materials and Methods
3.1. CLIP-Reid
CLIP-Reid is an image–text cross-modal person re-identification model, with its core composed of two encoders: an image encoder and a text encoder. The image encoder employs Convolutional Neural Networks (CNN) such as ResNet-50 or transformer models like ViT-B/16, aiming to map image data into feature vectors within a cross-modal embedding space. Meanwhile, the text encoder relies on the powerful transformer architecture to generate “fuzzy” text descriptions that are easily understandable by machines.
The workflow of CLIP-Reid is divided into two stages. In the first stage, the model learns “fuzzy” text descriptions for each unique ID using ID-specific learnable tokens. These descriptions follow the template of “A photo of a person”, where [X] represents learnable text tokens. During this stage, the pre-trained weights of the image and text encoders remain fixed, and only these learnable tokens are trained.
Given the characteristics of the ReID task, which includes relatively small dataset sizes and extremely high requirements for fine-grained appearance features, as well as the potential presence of multiple image samples of the same ID within a batch, CLIP-Reid makes specific adjustments to the text-to-image loss function. Specifically, for each text token index, CLIP-Reid calculates the cross-entropy with all positive sample images in the batch and takes the average as the final loss. The image-to-text loss function remains unchanged.
where
is the similarity between visual features and textual features,
=
is the set of indices for all positive instances of class
within the batch, and
denotes its cardinality.
The overall loss function for the first stage is jointly composed of these components:
In the second stage, only the parameters of the image encoder will be updated. To achieve better performance, CLIP-Reid introduces triplet loss and ID loss, and adopts the label smoothing strategy as a benchmark for the ReID task:
where
in the target distribution balances between the true class y (weighted by
) and a uniform distribution (
), where
is a smoothing factor,
are the logits for class k predictions,
and
measure feature distances for positive and negative pairs, respectively, and
is the margin in the triplet loss (
), enforcing a minimum distance gap between positive and negative pairs.
CLIP-Reid takes advantage of CLIP’s multimodal capabilities by employing the text features obtained during the first stage to compute the cross-entropy loss between images and text. However, unlike the first stage, the i2t (image-to-text) loss in the second stage is adjusted accordingly based on the characteristics of label smoothing:
Ultimately, the overall loss function for the second stage is jointly composed of these components:
3.2. Overall Framework
Inspired by the CLIP-Reid model, this paper proposes an innovative fusion-modal person re-identification method based on visual–text matching technology. By improving upon the CLIP-Reid model, we achieve precise guidance and efficient fusion of text with images. This paper carefully designs two training stages to enhance the fusion effect of visual–text matching.
In the first stage, we keep the parameters of the text encoder and image encoder constant while introducing a multistage channel spatial feature aggregation module (MSCSA) within the image encoder. This step aims to generate more accurate and learnable text descriptions and effectively train them, as illustrated in
Figure 1 (Stage 1).
In the second stage, we utilize the image encoder to extract image features and combine these features with the text descriptions trained in the first stage. This combination process is completed within a Transformer decoder, where the cross-attention mechanism simulates complex interactions between vision and language. During this process, we also employ residual connections to update the text features, using them to further optimize the performance of the image encoder, as shown in
Figure 1 (Stage 2).
3.3. Multistage Channel Spatial Aggregation Module
A multistage feature aggregation module focusing on both channel and spatial aspects is employed to enhance the extraction of diverse and richer representations of channel and spatial features from different network stages. As illustrated in
Figure 2, this module primarily considers two source features derived from the channel–spatial aggregation blocks situated at various stages of the backbone network: the low-level feature map
before the stage and the high-level feature map
after the stage, where
C denotes the number of channels,
W represents the feature width, and
H stands for the feature height. The feature aggregation process is primarily achieved by applying a self-attention mechanism, which can accurately capture and integrate feature information from different levels.
Firstly, three 1 × 1 convolutional layers,
,
,
, are employed to convert f into three compact embeddings:
,
and
. Subsequently, we determine the channel similarity matrix
by performing matrix multiplication followed by a softmax operation:
Multilevel feature aggregation at the channel level is achieved by restoring the channel dimension through matrix multiplication of
and
. Subsequently, another 1 × 1 convolutional layer
is applied to transform the size of the aforementioned feature map to match the size of
. Finally, we obtain the output by adding
to it through matrix addition:
Subsequently, the high-level feature map
, acquired through the aforementioned procedures, and the low-level feature map
are utilized to carry out spatial feature fusion, analogous to the process of multilevel feature aggregation at the channel level. Finally, the output we obtain is as follows:
where
and
represent two convolutional layers with a kernel size of 1 × 1, while
denotes the matrix that captures spatial similarity.
3.4. Text-Guided Image Module
We feed both visual features and text features into the Transformer Decoder, leveraging its built-in cross-attention mechanism to achieve a deep integration between them. During this fusion process, the Transformer Decoder can flexibly adjust the importance weights of various regions in the visual features based on the key information in the text features, thereby achieving precise locking of the target person.
We use the text features t generated by a one-stage CoOp as the query for the Transformer decoder, and select the visual features v from the third stage of ResNet50 as the key and value for the decoder.
Specifically, we extract the text features trained in the first stage, and then input these features into the transformer decoder. Within the decoder, the features first go through a normalization layer to ensure the stability and consistency of the data. Then, these standardized features enter the self-attention module to integrate and focus internal information. This step helps to retain the original information and introduce the new features learned by the self-attention mechanism. The connected features go through a standardization layer again to prepare for the subsequent cross-attention operation. At this time, we regard this standardized feature as the query of cross-attention (q). At the same time, the features extracted from the third stage of resnet-50 and aggregated by the MSCSA module are used as keys (k) and values (V) for cross-attention. This cross-attention mechanism realizes the information interaction and fusion between text features and visual features. The features after cross-attention processing are also residually connected with those of the second standardization layer, and the connected features are input into the third standardization layer. This series of standardization, attention mechanism, and residual linking operation together constitute a complex and effective feature extraction and fusion framework. Finally, the features of the third standardization layer enter a multilayer perceptron (MLP) for further processing. The output of MLP is connected with the residual of the text features learned in one stage, forming a multimodal feature that combines text and visual information. This multimodal feature is used to optimize the image encoder, which improves the performance and accuracy of the whole system:
This design aims to enable the text features to precisely capture the most relevant visual cues. Drawing inspiration from the DenseCLIP [
30], we subsequently modify the text features by employing residual connections for updating purposes:
where
serves as an adjustable parameter that governs the magnitude of the residual component. During the initialization phase, we set
to a very tiny value to ensure that the linguistic prior information in the text features is preserved to the greatest extent possible.
4. Experimental Results and Analysis
4.1. Dataset and Evaluation Metrics
In line with common practice, we assess our method utilizing four datasets: MSMT17 [
31], Market-1501 [
32], DukeMTMC-reID [
33], and occluded-duke [
34].
Table 1 offers detailed information about these datasets. To assess the model’s performance, we employ the Cumulative Matching Characteristic (CMC) at Rank-1 (R1) and the mean Average Precision (mAP) metrics.
4.2. Experimental Setup
In the experiment, we utilized two NVIDIA GeForce RTX A6000 GPUs and developed based on the PyTorch 1.12.0 + cu111 deep learning framework. We inserted the MSCSA module after the first three stages of ResNet-50. It is worth noting that in the channel attention mechanism, the convolution step of the first module is set to 1, while the convolution step of the second and third modules is adjusted to 2. In the second stage, was set to . We deployed a three-layer Transformer decoder structure, where the number of heads in the multihead attention mechanism is 4 and the length of the feature vector is set to 256. Regarding other training details, we maintained consistency with CLIP-ReID to ensure uniformity and comparability in the experimental process. We followed the training strategies, parameter settings, and hyperparameter adjustments of CLIP-ReID to ensure that our improved method could be evaluated and compared on a stable and reliable benchmark. This approach aids us in more accurately assessing the effectiveness of the proposed method and making fair comparisons with other state-of-the-art methods.
4.3. Analysis of Comparative Experimental Results
To ascertain the efficacy of the approach introduced in this paper, we compared the experimental results of our method with mainstream person re-identification algorithms on four public datasets. The results are shown in
Table 2. The superscript asterisk (*) indicates that the resolution of the input images has been adjusted to over
.
The table clearly shows that the proposed method, which is evident in this study, outperforms the baseline network on all four public datasets. Specifically, our method exceeds the baseline by 5.4%, 2.7%, 2.6%, and 9.2% in mAP, and by 4.3%, 1.7%, 2.7%, and 11.8% in Rank-1, respectively. This demonstrates that the pedestrian text description information generated by the fusion modality method proposed in this paper can assist the network in extracting more comprehensive pedestrian information, thereby improving the pedestrian recognition rate. PromptSG constructs a parameterized inversion network that maps the global embedding of CLIP visual space to a pseudo token in text space, and integrates it into natural language sentences. Through symmetric supervised contrastive loss, the pseudo token accurately conveys image context and identity details. Although the R-1 and mAP metrics of our method did not fully meet the performance level of the above methods on the Market-1501 and MSMT17 datasets, the optimal results were achieved on the DukeMTMC dataset. Our aim in proposing this article is to solve specific problems, rather than pursuing competition beyond the current state of the art (SOTA).
4.4. Ablation Experiment
To ascertain the efficacy of the Multistage Channel Spatial Feature Aggregation (MSCSA) module and the Text-Guided Image (TGI) module proposed in this paper, we meticulously designed a range of comparative experiments on the MSMT17 person re-identification dataset. The specific experimental results are shown in
Table 3. The experimental results indicate that the performance of the model did not improve as expected when only the TGI module was introduced without the MSCSA module, but instead showed a slight decrease.
Specifically, when the TGI module was applied solely to the CLIP-Reid model, the model’s mAP decreased by 0.4% compared to the CLIP-Reid model. This result may be attributed to the fact that the granularity of pedestrian feature extraction by ResNet50 is too large, making it difficult for the TGI module to accurately capture the discriminative information of pedestrian identity and fully exert its role in guiding the model to learn more discriminative image features.
Detailed data in the table show that when the MSCSA module was applied alone, the model’s mAP and Rank-1 accuracy significantly improved compared to the CLIP-Reid model, with increases of 2.3% and 1.7%, respectively. This result fully demonstrates the effectiveness of the MSCSA module in enhancing the model’s sensitivity to pedestrian features and recognition accuracy, enabling the model to more accurately capture and utilize subtle discriminative information of pedestrian identity.
Furthermore, when both the MSCSA module and the TGI module were introduced simultaneously, the model’s mAP and Rank-1 accuracy again achieved significant improvements compared to the CLIP-Reid model, with increases of 3.1% and 2.0%, respectively. This result not only verifies the role of the TGI module in steering the model towards acquiring more distinctive image features for learning but also demonstrates the powerful efficacy of the two modules working together. By promoting effective fusion and transmission of image and text features at different levels, these two modules jointly contributed to a leap in the performance of the person re-identification task.
For this phenomenon, we analyze that CLIP is only a coarse-grained alignment issue. Thus, using clip to extract image features to generate text features will result in noise, causing subsequent text guided images to focus on background noise, which affects the performance of the model. Furthermore, Transformer has excellent global attention capabilities; however, when using Transformer decode alone, this feature also inevitably leads to the model’s extensive coverage of various regions when processing the entire image. From the subsequent visualization analysis, it can be clearly observed that this wide range of attention makes the model susceptible to interference from noise or irrelevant information in the image, which has a negative impact on its performance and leads to an overall decline in performance. We introduced the MSCSA module and ingeniously embedded it after the first three stages of the ResNet50 network. The core purpose of this module is to achieve effective fusion of feature information at different levels by adopting a multistage feature aggregation strategy. This innovative design not only greatly enhances the network’s ability to capture fine-grained features, but also significantly improves the model’s generalization performance, enabling the generation of more accurate pedestrian text features. In the two-stage processing, with the help of more accurate text features generated by the MSCSA module, we can more accurately guide the extraction of image features, thereby obtaining a more superior multimodal fusion feature and improving the performance of the model.
In summary, the results of comparative experiments conducted on the MSMT17 dataset of the MSCSA module and the TGI module proposed in this paper fully prove their effectiveness and necessity. The introduction of these two modules not only improves the recognition accuracy of the model but also brings fresh ideas and directions to the research on person re-identification tasks.
Optimizing the number of aggregation modules can significantly enhance model performance. To delve into the application effects of the MSCSA module at different stages of ResNet50, we conducted meticulous ablation experiments on the MSMT17 person re-identification dataset. To determine the optimal number of modules, we performed exhaustive experiments, as shown in
Table 4. The experimental data clearly indicate that with the introduction of multistage channel spatial aggregated features, the image features become more enriched, effectively mitigating the interference of noise on model results, thereby producing more precise text features and gradually improving model performance. However, adding more than three modules leads to a decrease in the model’s performance. We believe that in the first, second, and third stages, the model is already able to capture and utilize key features effectively. However, when the module is introduced in the fourth stage, the extracted information becomes redundant with that from the first three stages, affecting the final performance. Based on the extensive experimental findings, we found that the model with MSCSA modules added in the first three stages exhibits the best performance. Therefore, unless there are specific requirements or instructions, we use three MSCSA modules in the model to ensure high performance while avoiding the risk of information redundancy.
The specific parameters of the model are shown in
Table 5. Although the integration of MSCSA and TGI modules has increased the complexity of our system and the model parameter count has climbed to 100.70 M, we still maintain excellent performance in terms of inference speed. The inference time for a single image is only about 7.7 ms, which fully meets the high standard requirements of real-time applications.
4.5. Visualization
To vividly showcase the benefits of our method in enhancing model performance, we randomly selected twelve images from the MSMT17 dataset and conducted an in-depth visualization analysis using Grad-CAM [
39] heatmap technology. The relevant results are presented in
Figure 3. In this figure, column (a) shows the original images, column (b) presents the heatmaps generated by the CLIP-Reid model, and column (c) displays the heatmaps produced by our method.
The intensity of colors in the heatmaps visually reflects the degree of attention the model pays to different image regions: the darker the color, the higher the model’s attention to that region, and the more significant its importance in constructing the final feature representation. By comparing the heatmaps in column (b) with those in column (c) in depth, we can see that the CLIP-Reid model focuses more on local regions during pedestrian recognition, such as the pedestrian’s face and limbs, while neglecting other key details of the pedestrian to some extent. In contrast, our model demonstrates superior performance in pedestrian recognition. It can more accurately focus on the core feature regions of the pedestrian, including the face, body contour, and backpack, thereby comprehensively capturing the pedestrian’s identity features.
It is particularly noteworthy that in images (a02), (a03), (a08), and (a10), the CLIP-Reid model fails to fully notice the important feature of the backpack, while our model can accurately identify and focus on the backpack, further highlighting its advantage in capturing pedestrian features. The presentation of these visualization results not only strongly proves the significant effect of our method in enhancing the model’s ability to capture key information but also fully demonstrates its excellent performance in improving pedestrian re-identification performance.
In
Figure 4, we provide a visualization example of the occluded-Duke dataset. Through careful observation, we can clearly see that in complex scenes where vehicles partially obstruct pedestrians, compared to the CLIP-Reid method, our model exhibits more refined pedestrian feature capture capabilities, presenting more complete and accurate pedestrian contours. This comparison highlights the potential of our model in dealing with occlusion problems.