Image–Text Person Re-Identification with Transformer-Based Modal Fusion

Li, Xin; Guo, Hubo; Zhang, Meiling; Fu, Bo

doi:10.3390/electronics14030525

Open AccessArticle

Image–Text Person Re-Identification with Transformer-Based Modal Fusion

¹

College of Intelligent Science and Information Engineering, Shenyang University, Shenyang 110044, China

²

State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110169, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(3), 525; https://doi.org/10.3390/electronics14030525

Submission received: 21 December 2024 / Revised: 23 January 2025 / Accepted: 24 January 2025 / Published: 28 January 2025

(This article belongs to the Special Issue Deep Learning-Based Image Restoration and Object Identification)

Download

Browse Figures

Versions Notes

Abstract

Existing person re-identification methods utilizing CLIP (Contrastive Language-Image Pre-training) mostly suffer from coarse-grained alignment issues. This is primarily due to the original design intention of the CLIP model, which aims at broad and global alignment between images and texts to support a wide range of image–text matching tasks. However, in the specific domain of person re-identification, local features and fine-grained information are equally important in addition to global features. This paper proposes an innovative modal fusion approach, aiming to precisely locate the most prominent pedestrian information in images by combining visual features extracted by the ResNet-50 model with text representations generated by a text encoder. This method leverages the cross-attention mechanism of the Transformer Decoder to enable text features to dynamically guide visual features, enhancing the ability to identify and locate the target pedestrian. Experiments conducted on four public datasets, namely MSMT17, Market1501, DukeMTMC, and Occluded-Duke, demonstrate that our method outperforms the baseline network by 5.4%, 2.7%, 2.6%, and 9.2% in mAP, and by 4.3%, 1.7%, 2.7%, and 11.8% in Rank-1, respectively. This method exhibits excellent performance and provides new research insights for the task of person re-identification.

Keywords:

person re-identification; CLIP; modal fusion

1. Introduction

Person Re-identification (Re-ID) is a task aimed at achieving cross-camera pedestrian image retrieval. It involves accurately identifying and retrieving specific pedestrian query images from a vast image database captured by multiple non-overlapping cameras. In practical applications, person re-identification technology can track and recognize the movement trajectories of the same pedestrian across different camera views and time points, thereby being widely applied in fields such as video surveillance, intelligent security, and pedestrian behavior analysis [1,2,3].

In the field of person re-identification, researchers have adopted various methods to enhance model performance. Improvements based on CNNs, such as OSNet [4], optimize network structures to reduce complexity but are limited by the volume of training data and the constraints of feature representation, often leading to overfitting. To mitigate this, methods like Auto-ReID [5] and CDNet [6] leverage Neural Architecture Search to discover more compact and effective model architectures. Additionally, OfM [7] employs a data selection strategy to prioritize more generalizable data during training, enhancing model performance. However, these methods still face challenges in large-scale datasets.

On the other hand, methods incorporating prior knowledge, such as PCB [8] and SAN [9], enhance local feature representation by segmenting features and extracting them locally. MGN [10] further adopts a multigranularity feature segmentation scheme to improve model expressiveness. However, these methods typically increase model complexity. To simplify models, method like CBDB-Net [11] use a dual-branch structure to extract global and local features separately and improve model robustness through feature dropout strategies.

Furthermore, attention mechanism methods, including ABDNet [12] and HOReID [13], introduce attention modules to expand the model’s receptive field and extract more discriminative features. CAL [14] combines attention mechanisms with counterfactual learning to further improve prediction accuracy. Recently, with the widespread application of Transformers in computer vision, methods such as PAT [15] and DRL-Net [16] have emerged in person re-identification, leveraging the strengths of both CNNs and Transformers to further boost model performance. However, these methods also present challenges in terms of computational intensity and parameter tuning.

Due to the swift advancements in deep learning technology, the processing of multimodal information has emerged as a focal area of research in computer vision and natural language processing. In this context, how to efficiently fuse visual and textual features to extract more discriminative pedestrian characteristics for achieving higher-level semantic understanding and reasoning has become an important and urgent research topic.

CLIP, proposed by Radford et al. [17], is a groundbreaking visual–language pre-training model that leverages a vast amount of image–text pairs for training. It maps images and text into a shared latent space, enabling the model to understand and correlate information from these two modalities. This capability allows CLIP to perform various tasks such as cross-modal retrieval, classification, and generation. However, despite its groundbreaking nature, CLIP also has its limitations. Firstly, while CLIP excels in understanding and correlating information from images and text, it may struggle with nuanced or context-specific details. This is because the model is trained on a vast and diverse dataset, which may not always capture the subtleties of specific contexts or domains. Secondly, CLIP’s performance can be affected by the quality and relevance of the image–text pairs used for training. If the pairs are not well-aligned or if the text descriptions are not accurate, the model may not learn effective representations for certain concepts or objects. CoOp, proposed by Zhou et al. [18], is an optimization method for CLIP’s text prompts, focusing on improving CLIP’s performance on zero-shot classification tasks. By generating learnable text prompts and optimizing them, CoOp better adapts the text description to downstream tasks. This approach enables CLIP to accurately classify new categories without additional training data. CLIP-Reid, proposed by Li et al. [19], cleverly combines CLIP’s cross-modal contrastive learning ability with CoOp’s dynamic text prompt generation mechanism. By integrating the advantages of both, it provides a more precise and flexible method for association matching between images and text, achieving significant performance improvements in tasks such as person re-identification. However, there is room for improvement: CLIP-Reid primarily focuses on optimizing the image encoder using text information and does not directly perform deep fusion of image and text features. To address this limitation and fully exploit the potential connections between image and text features, we decided to introduce the cross-attention mechanism from the Transformer Decoder. This mechanism enables the model to adapt dynamically attend to and reference relevant text features when processing image features, effectively fusing the two types of features.

Specifically, we build upon the CLIP-Reid framework by incorporating a Transformer Decoder module that includes a cross-attention layer. During the training process, the image encoder extracts features from images, while the text encoder does the same for text, and these features are fed into the Transformer Decoder module as input. In the cross-attention layer, the model calculates attention weights between image features and text features, and based on these weights, it fuses the two types of features. In this way, we can generate a joint feature representation that incorporates both image and text information. This joint feature, along with the text features, is then concatenated through a residual connection and used together to optimize the image encoder.

During the course of conducting experiments, we observed that the inherent limitations of the CLIP model significantly impacted the performance of pedestrian detection. Specifically, the model exhibited too coarse granularity when extracting pedestrian features, making it difficult to accurately capture discriminative information for pedestrian identity, which subsequently led to poor performance in joint feature fusion. To effectively mitigate this issue, we adopted the Multistage Channel Spatial Feature Aggregation (MSCSA) module proposed by Wang et al. [20]. We found that this module significantly enhances the model’s sensitivity and recognition accuracy for pedestrian features. It not only assists the model in capturing pedestrian identity features in greater detail but also facilitates the effective transmission of these features across different levels, thereby generating text that is more suitable for model learning in the first stage and substantially improving the quality and robustness of the joint features in the second stage.

In conclusion, the primary contributions of this paper are summarized as follows:

Inspired by the CLIP-Reid model, this paper proposes an innovative person re-identification method that fuses visual and textual information.
A Multistage Channel Spatial Feature Aggregation (MSCSA) module is introduced into the image encoder to generate more accurate and rich representations of pedestrian features. By leveraging the cross-attention mechanism of the Transformer decoder, the textual features can dynamically guide the visual features, enhancing the model’s ability to locate and recognize the target pedestrian.
Our method has achieved competitive performance on four public datasets, namely MSMT17, Market1501, DukeMTMC, and Occluded-Duke, fully demonstrating its effectiveness and generalization capability.

2. Related Work

In the approach of manually crafting text, researchers can flexibly adjust the description content and style to fit the specific needs of different application scenarios, thereby enhancing the generalization ability and accuracy of models in practical applications. For instance, Yang et al. [21] cleverly leveraged readily available diffusion models and image caption generation models to construct a comprehensive image–text dataset. Through meticulous analysis of the generated text, they predefined an attribute space matching the Market-1501 attribute dataset and automatically annotated 27 different types of attributes for each image–text pair using text keyword matching technology. Subsequently, they introduced the APTM framework, which jointly learns multiple attributes and text matching, significantly optimizing model performance through complementary training of attribute recognition tasks and text-based person retrieval tasks. On the other hand, Zuo et al. [22] meticulously manually annotated a novel ultra-fine-grained text description dataset specifically for text-based person retrieval tasks and proposed a test set that truthfully reflects changes in complex scenarios. They designed an efficient algorithm, CFAM, for ultra-fine-grained text-based person retrieval, achieving finer feature mining by introducing a shared cross-modal granular decoder and a hard negative matching mechanism. Meanwhile, Zuo et al. [23] employed specific text templates to guide the synthesis of image–text pairs. These templates not only ensured that the generated text descriptions were structured and consistent but also enabled them to accurately capture subtle characteristics of people, thereby aiding the model in learning better person representations in subsequent pre-training and downstream tasks. Based on this, they proposed a language–image pre-training framework named PLIP, significantly enhancing the model’s joint understanding of image and text information through three pre-training tasks: semantically fused image colorization, visually fused attribute prediction, and visual–language matching. However, creating high-quality text datasets requires substantial human, material, and financial resources. Moreover, annotators often vary in professional expertise and annotation standards. Some annotators may lack a clear understanding of the purpose and norms of the annotation task, leading to varying quality in annotation results. Additionally, reviewing and proofreading annotation results are challenging tasks, further increasing data uncertainty. Yan et al. [24] proposed a fine-grained information mining framework CFine based on CLIP, which uses multilevel global feature learning modules to mine discriminative details shared by modalities and designs cross granularity feature refinement and fine-grained correspondence discovery modules to establish cross modal coarse and fine granularity correspondences. Ren et al. [25] proposed a novel human–object interaction (HOI) detection method that jointly explores interaction contextual clues from both visual and textual perspectives, both within and across triplets. Ye et al. [26] conducted an in-depth analysis of the three major aspects of closed world ReID, summarized the five aspects of open world ReID, and designed a powerful AGW baseline. Gong et al. [27] proposed a lightweight image angle cross-attention module to improve the performance of detection in complex scenes. In unsupervised learning, Wang et al. [28] proposed a method for unsupervised pedestrian re-identification tasks, which obtains global and local features through a dual branch structure, and uses an adaptive information supplementation method based on the k-nearest neighbor algorithm and an adaptive foreground enhancement module to improve feature robustness, reduce label noise, and improve pseudo label accuracy. Tang et al. [29] proposed an unsupervised pedestrian re-identification method called MLJT, which refines the training process by predicting multiple pseudo labels, thereby improving the accuracy of pedestrian re-identification in multicamera systems.

3. Materials and Methods

3.1. CLIP-Reid

CLIP-Reid is an image–text cross-modal person re-identification model, with its core composed of two encoders: an image encoder and a text encoder. The image encoder employs Convolutional Neural Networks (CNN) such as ResNet-50 or transformer models like ViT-B/16, aiming to map image data into feature vectors within a cross-modal embedding space. Meanwhile, the text encoder relies on the powerful transformer architecture to generate “fuzzy” text descriptions that are easily understandable by machines.

The workflow of CLIP-Reid is divided into two stages. In the first stage, the model learns “fuzzy” text descriptions for each unique ID using ID-specific learnable tokens. These descriptions follow the template of “A photo of a

{[X]}_{1} {[X]}_{2} {[X]}_{3} . . . {[X]}_{M}

person”, where [X] represents learnable text tokens. During this stage, the pre-trained weights of the image and text encoders remain fixed, and only these learnable tokens are trained.

Given the characteristics of the ReID task, which includes relatively small dataset sizes and extremely high requirements for fine-grained appearance features, as well as the potential presence of multiple image samples of the same ID within a batch, CLIP-Reid makes specific adjustments to the text-to-image loss function. Specifically, for each text token index, CLIP-Reid calculates the cross-entropy with all positive sample images in the batch and takes the average as the final loss. The image-to-text loss function remains unchanged.

L_{i 2 t} = - l o g \frac{e x p (s (V_{i}, T_{i}))}{\sum_{a = 1}^{B} e x p (s (V_{i}, T_{a}))}

(1)

L_{t 2 i} (y_{i}) = \frac{- 1}{| P (y_{i}) |} \sum_{p \in P (y_{i})} l o g \frac{\sum_{p \in P (y_{i})} e x p (s (V_{p}, T_{y_{i}}))}{\sum_{a = 1}^{B} e x p (s (V_{a}, T_{y_{i}}))}

(2)

where

s (V_{i}, T_{i})

is the similarity between visual features and textual features,

P (y_{i})

=

{p \in 1 \dots B : y_{p} = y_{i}}

is the set of indices for all positive instances of class

T_{y i}

within the batch, and

| \cdot |

denotes its cardinality.

The overall loss function for the first stage is jointly composed of these components:

L_{s t a g e 1} = L_{i 2 t} + L_{t 2 i}

(3)

In the second stage, only the parameters of the image encoder will be updated. To achieve better performance, CLIP-Reid introduces triplet loss and ID loss, and adopts the label smoothing strategy as a benchmark for the ReID task:

L_{t r i} = m a x (d_{p} - d_{n} + α, 0)

(4)

L_{i d} = \sum_{k = 1}^{N} - q_{k} l o g (p_{k})

(5)

where

q_{k}

in the target distribution balances between the true class y (weighted by

(1 - ϵ) δ_{k, y}

) and a uniform distribution (

ϵ / N

), where

ϵ

is a smoothing factor,

p_{k}

are the logits for class k predictions,

d_{p}

and

d_{n}

measure feature distances for positive and negative pairs, respectively, and

α

is the margin in the triplet loss (

L_{t r i}

), enforcing a minimum distance gap between positive and negative pairs.

CLIP-Reid takes advantage of CLIP’s multimodal capabilities by employing the text features obtained during the first stage to compute the cross-entropy loss between images and text. However, unlike the first stage, the i2t (image-to-text) loss in the second stage is adjusted accordingly based on the characteristics of label smoothing:

L_{i 2 t c e} = \sum_{k = 1}^{N} - q_{k} l o g (p_{k}) \frac{e x p (s (V_{i}, T_{y_{k}}))}{\sum_{y_{a} = 1}^{N} e x p (s (V_{i}, T_{y_{a}}))}

(6)

Ultimately, the overall loss function for the second stage is jointly composed of these components:

L_{s t a g e 2} = L_{i d} + L_{t r i} + L_{i 2 t c e}

(7)

3.2. Overall Framework

Inspired by the CLIP-Reid model, this paper proposes an innovative fusion-modal person re-identification method based on visual–text matching technology. By improving upon the CLIP-Reid model, we achieve precise guidance and efficient fusion of text with images. This paper carefully designs two training stages to enhance the fusion effect of visual–text matching.

In the first stage, we keep the parameters of the text encoder and image encoder constant while introducing a multistage channel spatial feature aggregation module (MSCSA) within the image encoder. This step aims to generate more accurate and learnable text descriptions and effectively train them, as illustrated in Figure 1 (Stage 1).

In the second stage, we utilize the image encoder to extract image features and combine these features with the text descriptions trained in the first stage. This combination process is completed within a Transformer decoder, where the cross-attention mechanism simulates complex interactions between vision and language. During this process, we also employ residual connections to update the text features, using them to further optimize the performance of the image encoder, as shown in Figure 1 (Stage 2).

3.3. Multistage Channel Spatial Aggregation Module

A multistage feature aggregation module focusing on both channel and spatial aspects is employed to enhance the extraction of diverse and richer representations of channel and spatial features from different network stages. As illustrated in Figure 2, this module primarily considers two source features derived from the channel–spatial aggregation blocks situated at various stages of the backbone network: the low-level feature map

f_{l} \in R^{C_{l} \times H_{l} \times W_{l}}

before the stage and the high-level feature map

f_{h} \in R^{C_{h} \times H_{h} \times W_{h}}

after the stage, where C denotes the number of channels, W represents the feature width, and H stands for the feature height. The feature aggregation process is primarily achieved by applying a self-attention mechanism, which can accurately capture and integrate feature information from different levels.

Firstly, three 1 × 1 convolutional layers,

ψ_{1 q}

,

ψ_{1 v}

,

ψ_{1 k}

, are employed to convert f into three compact embeddings:

ψ_{1 q} (f_{h})

,

ψ_{1 v} (f_{l})

and

ψ_{1 k} (f_{l})

. Subsequently, we determine the channel similarity matrix

M^{C} \in R^{C \times C}

by performing matrix multiplication followed by a softmax operation:

M^{c} = F_{s o f t m a x} (ψ_{1 q} (f_{h}) \times ψ_{1 k} (f_{l}))

(8)

Multilevel feature aggregation at the channel level is achieved by restoring the channel dimension through matrix multiplication of

ψ_{1 v} (f_{l})

and

M^{C}

. Subsequently, another 1 × 1 convolutional layer

o m e g a

is applied to transform the size of the aforementioned feature map to match the size of

f_{h}

. Finally, we obtain the output by adding

f_{h}

to it through matrix addition:

f_{h}^{c} = ω^{c} (ψ_{1 v} (f_{l}) \times M^{c}) + f_{h}

(9)

Subsequently, the high-level feature map

f_{h}^{c}

, acquired through the aforementioned procedures, and the low-level feature map

f_{l}

are utilized to carry out spatial feature fusion, analogous to the process of multilevel feature aggregation at the channel level. Finally, the output we obtain is as follows:

f_{h}^{s} = ω^{s} (ψ_{2 v} (f_{l}) \times M^{s}) + f_{h}^{c}

(10)

where

ψ_{s}

and

ψ_{2 v}

represent two convolutional layers with a kernel size of 1 × 1, while

M^{S}

denotes the matrix that captures spatial similarity.

3.4. Text-Guided Image Module

We feed both visual features and text features into the Transformer Decoder, leveraging its built-in cross-attention mechanism to achieve a deep integration between them. During this fusion process, the Transformer Decoder can flexibly adjust the importance weights of various regions in the visual features based on the key information in the text features, thereby achieving precise locking of the target person.

We use the text features t generated by a one-stage CoOp as the query for the Transformer decoder, and select the visual features v from the third stage of ResNet50 as the key and value for the decoder.

Specifically, we extract the text features trained in the first stage, and then input these features into the transformer decoder. Within the decoder, the features first go through a normalization layer to ensure the stability and consistency of the data. Then, these standardized features enter the self-attention module to integrate and focus internal information. This step helps to retain the original information and introduce the new features learned by the self-attention mechanism. The connected features go through a standardization layer again to prepare for the subsequent cross-attention operation. At this time, we regard this standardized feature as the query of cross-attention (q). At the same time, the features extracted from the third stage of resnet-50 and aggregated by the MSCSA module are used as keys (k) and values (V) for cross-attention. This cross-attention mechanism realizes the information interaction and fusion between text features and visual features. The features after cross-attention processing are also residually connected with those of the second standardization layer, and the connected features are input into the third standardization layer. This series of standardization, attention mechanism, and residual linking operation together constitute a complex and effective feature extraction and fusion framework. Finally, the features of the third standardization layer enter a multilayer perceptron (MLP) for further processing. The output of MLP is connected with the residual of the text features learned in one stage, forming a multimodal feature that combines text and visual information. This multimodal feature is used to optimize the image encoder, which improves the performance and accuracy of the whole system:

T = T r a n s D e c o d e r (t, [v, v])

(11)

This design aims to enable the text features to precisely capture the most relevant visual cues. Drawing inspiration from the DenseCLIP [30], we subsequently modify the text features by employing residual connections for updating purposes:

t = t + λ T

(12)

where

λ

serves as an adjustable parameter that governs the magnitude of the residual component. During the initialization phase, we set

λ

to a very tiny value to ensure that the linguistic prior information in the text features is preserved to the greatest extent possible.

4. Experimental Results and Analysis

4.1. Dataset and Evaluation Metrics

In line with common practice, we assess our method utilizing four datasets: MSMT17 [31], Market-1501 [32], DukeMTMC-reID [33], and occluded-duke [34]. Table 1 offers detailed information about these datasets. To assess the model’s performance, we employ the Cumulative Matching Characteristic (CMC) at Rank-1 (R1) and the mean Average Precision (mAP) metrics.

4.2. Experimental Setup

In the experiment, we utilized two NVIDIA GeForce RTX A6000 GPUs and developed based on the PyTorch 1.12.0 + cu111 deep learning framework. We inserted the MSCSA module after the first three stages of ResNet-50. It is worth noting that in the channel attention mechanism, the

ψ_{1 k}

convolution step of the first module is set to 1, while the

ψ_{1 k}

convolution step of the second and third modules is adjusted to 2. In the second stage,

λ

was set to

10^{- 4}

. We deployed a three-layer Transformer decoder structure, where the number of heads in the multihead attention mechanism is 4 and the length of the feature vector is set to 256. Regarding other training details, we maintained consistency with CLIP-ReID to ensure uniformity and comparability in the experimental process. We followed the training strategies, parameter settings, and hyperparameter adjustments of CLIP-ReID to ensure that our improved method could be evaluated and compared on a stable and reliable benchmark. This approach aids us in more accurately assessing the effectiveness of the proposed method and making fair comparisons with other state-of-the-art methods.

4.3. Analysis of Comparative Experimental Results

To ascertain the efficacy of the approach introduced in this paper, we compared the experimental results of our method with mainstream person re-identification algorithms on four public datasets. The results are shown in Table 2. The superscript asterisk (*) indicates that the resolution of the input images has been adjusted to over

256 \times 128

.

The table clearly shows that the proposed method, which is evident in this study, outperforms the baseline network on all four public datasets. Specifically, our method exceeds the baseline by 5.4%, 2.7%, 2.6%, and 9.2% in mAP, and by 4.3%, 1.7%, 2.7%, and 11.8% in Rank-1, respectively. This demonstrates that the pedestrian text description information generated by the fusion modality method proposed in this paper can assist the network in extracting more comprehensive pedestrian information, thereby improving the pedestrian recognition rate. PromptSG constructs a parameterized inversion network that maps the global embedding of CLIP visual space to a pseudo token in text space, and integrates it into natural language sentences. Through symmetric supervised contrastive loss, the pseudo token accurately conveys image context and identity details. Although the R-1 and mAP metrics of our method did not fully meet the performance level of the above methods on the Market-1501 and MSMT17 datasets, the optimal results were achieved on the DukeMTMC dataset. Our aim in proposing this article is to solve specific problems, rather than pursuing competition beyond the current state of the art (SOTA).

4.4. Ablation Experiment

To ascertain the efficacy of the Multistage Channel Spatial Feature Aggregation (MSCSA) module and the Text-Guided Image (TGI) module proposed in this paper, we meticulously designed a range of comparative experiments on the MSMT17 person re-identification dataset. The specific experimental results are shown in Table 3. The experimental results indicate that the performance of the model did not improve as expected when only the TGI module was introduced without the MSCSA module, but instead showed a slight decrease.

Specifically, when the TGI module was applied solely to the CLIP-Reid model, the model’s mAP decreased by 0.4% compared to the CLIP-Reid model. This result may be attributed to the fact that the granularity of pedestrian feature extraction by ResNet50 is too large, making it difficult for the TGI module to accurately capture the discriminative information of pedestrian identity and fully exert its role in guiding the model to learn more discriminative image features.

Detailed data in the table show that when the MSCSA module was applied alone, the model’s mAP and Rank-1 accuracy significantly improved compared to the CLIP-Reid model, with increases of 2.3% and 1.7%, respectively. This result fully demonstrates the effectiveness of the MSCSA module in enhancing the model’s sensitivity to pedestrian features and recognition accuracy, enabling the model to more accurately capture and utilize subtle discriminative information of pedestrian identity.

Furthermore, when both the MSCSA module and the TGI module were introduced simultaneously, the model’s mAP and Rank-1 accuracy again achieved significant improvements compared to the CLIP-Reid model, with increases of 3.1% and 2.0%, respectively. This result not only verifies the role of the TGI module in steering the model towards acquiring more distinctive image features for learning but also demonstrates the powerful efficacy of the two modules working together. By promoting effective fusion and transmission of image and text features at different levels, these two modules jointly contributed to a leap in the performance of the person re-identification task.

For this phenomenon, we analyze that CLIP is only a coarse-grained alignment issue. Thus, using clip to extract image features to generate text features will result in noise, causing subsequent text guided images to focus on background noise, which affects the performance of the model. Furthermore, Transformer has excellent global attention capabilities; however, when using Transformer decode alone, this feature also inevitably leads to the model’s extensive coverage of various regions when processing the entire image. From the subsequent visualization analysis, it can be clearly observed that this wide range of attention makes the model susceptible to interference from noise or irrelevant information in the image, which has a negative impact on its performance and leads to an overall decline in performance. We introduced the MSCSA module and ingeniously embedded it after the first three stages of the ResNet50 network. The core purpose of this module is to achieve effective fusion of feature information at different levels by adopting a multistage feature aggregation strategy. This innovative design not only greatly enhances the network’s ability to capture fine-grained features, but also significantly improves the model’s generalization performance, enabling the generation of more accurate pedestrian text features. In the two-stage processing, with the help of more accurate text features generated by the MSCSA module, we can more accurately guide the extraction of image features, thereby obtaining a more superior multimodal fusion feature and improving the performance of the model.

In summary, the results of comparative experiments conducted on the MSMT17 dataset of the MSCSA module and the TGI module proposed in this paper fully prove their effectiveness and necessity. The introduction of these two modules not only improves the recognition accuracy of the model but also brings fresh ideas and directions to the research on person re-identification tasks.

Optimizing the number of aggregation modules can significantly enhance model performance. To delve into the application effects of the MSCSA module at different stages of ResNet50, we conducted meticulous ablation experiments on the MSMT17 person re-identification dataset. To determine the optimal number of modules, we performed exhaustive experiments, as shown in Table 4. The experimental data clearly indicate that with the introduction of multistage channel spatial aggregated features, the image features become more enriched, effectively mitigating the interference of noise on model results, thereby producing more precise text features and gradually improving model performance. However, adding more than three modules leads to a decrease in the model’s performance. We believe that in the first, second, and third stages, the model is already able to capture and utilize key features effectively. However, when the module is introduced in the fourth stage, the extracted information becomes redundant with that from the first three stages, affecting the final performance. Based on the extensive experimental findings, we found that the model with MSCSA modules added in the first three stages exhibits the best performance. Therefore, unless there are specific requirements or instructions, we use three MSCSA modules in the model to ensure high performance while avoiding the risk of information redundancy.

The specific parameters of the model are shown in Table 5. Although the integration of MSCSA and TGI modules has increased the complexity of our system and the model parameter count has climbed to 100.70 M, we still maintain excellent performance in terms of inference speed. The inference time for a single image is only about 7.7 ms, which fully meets the high standard requirements of real-time applications.

4.5. Visualization

To vividly showcase the benefits of our method in enhancing model performance, we randomly selected twelve images from the MSMT17 dataset and conducted an in-depth visualization analysis using Grad-CAM [39] heatmap technology. The relevant results are presented in Figure 3. In this figure, column (a) shows the original images, column (b) presents the heatmaps generated by the CLIP-Reid model, and column (c) displays the heatmaps produced by our method.

The intensity of colors in the heatmaps visually reflects the degree of attention the model pays to different image regions: the darker the color, the higher the model’s attention to that region, and the more significant its importance in constructing the final feature representation. By comparing the heatmaps in column (b) with those in column (c) in depth, we can see that the CLIP-Reid model focuses more on local regions during pedestrian recognition, such as the pedestrian’s face and limbs, while neglecting other key details of the pedestrian to some extent. In contrast, our model demonstrates superior performance in pedestrian recognition. It can more accurately focus on the core feature regions of the pedestrian, including the face, body contour, and backpack, thereby comprehensively capturing the pedestrian’s identity features.

It is particularly noteworthy that in images (a02), (a03), (a08), and (a10), the CLIP-Reid model fails to fully notice the important feature of the backpack, while our model can accurately identify and focus on the backpack, further highlighting its advantage in capturing pedestrian features. The presentation of these visualization results not only strongly proves the significant effect of our method in enhancing the model’s ability to capture key information but also fully demonstrates its excellent performance in improving pedestrian re-identification performance.

In Figure 4, we provide a visualization example of the occluded-Duke dataset. Through careful observation, we can clearly see that in complex scenes where vehicles partially obstruct pedestrians, compared to the CLIP-Reid method, our model exhibits more refined pedestrian feature capture capabilities, presenting more complete and accurate pedestrian contours. This comparison highlights the potential of our model in dealing with occlusion problems.

5. Conclusions

This paper proposes a modal fusion method based on the cross-attention mechanism of the Transformer Decoder. In the first stage, we introduce a Multistage Channel Spatial Feature Aggregation (MSCSA) module into the image encoder to generate more accurate and learnable text descriptions. In the second stage, we successfully introduce these text descriptions into the visual feature processing, achieving precise localization of prominent figures in the image. The experimental outcomes indicate that our method performs satisfactorily across four prominent public datasets. However, there is still room for optimization in solving the problem of modal fusion for person re-identification. Future research will focus on how to more effectively separate background noise, focusing solely on the pedestrian of interest and generating more accurate pedestrian text descriptions to further improve recognition accuracy. We will also make more effective use of shallow and mid-level information, gradually refining the fusion process at different stages, and more effectively integrating information from different modalities, thereby enhancing the accuracy and robustness of the model. These will be the key directions of our subsequent research.

Author Contributions

Conceptualization, X.L.; Methodology, B.F., H.G. and X.L.; Software, H.G.; Investigation, M.Z.; Data curation, M.Z. and X.L.; Writing—original draft, H.G.; Writing—review & editing, B.F. and X.L.; Visualization, H.G.; Supervision, M.Z.; Project administration, B.F.; Funding acquisition, B.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset Market-1501 is an open dataset and can be downloaded at https://drive.google.com/file/d/0B8-rUzbwVRk0c054eEozWG9COHM/view?usp=sharing (accessed on 23 January 2025). The dataset MSMT17 can be downloaded by applying to Peking University. The dataset DukeMTMC-reID can be downloaded by applying to Duke University. Based on the Duke MTMC Reid dataset, the occluded-Duke can be accessed through https://github.com/lightas/ICCV19_Pose_Guided_Occluded_Person_ReID (accessed on 23 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lin, X.; Li, J.; Ma, Z.; Li, H.; Li, S.; Xu, K.; Lu, G.; Zhang, D. Learning modal-invariant and temporal-memory for video-based visible-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20973–20982. [Google Scholar]
Leng, Q.; Ye, M.; Tian, Q. A survey of open-world person re-identification. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1092–1108. [Google Scholar] [CrossRef]
Wu, W.; Liu, J.; Zheng, K.; Sun, Q.; Zha, Z.J. Temporal complementarity-guided reinforcement learning for image-to-video person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7319–7328. [Google Scholar]
Zhou, K.; Yang, Y.; Cavallaro, A.; Xiang, T. Omni-scale feature learning for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3702–3712. [Google Scholar]
Quan, R.; Dong, X.; Wu, Y.; Zhu, L.; Yang, Y. Auto-reid: Searching for a part-aware convnet for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3750–3759. [Google Scholar]
Li, H.; Wu, G.; Zheng, W.S. Combined depth space based architecture search for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6729–6738. [Google Scholar]
Zhang, E.; Jiang, X.; Cheng, H.; Wu, A.; Yu, F.; Li, K.; Guo, X.; Zheng, F.; Zheng, W.; Sun, X. One for more: Selecting generalizable samples for generalizable reid model. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 3324–3332. [Google Scholar]
Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 480–496. [Google Scholar]
Qian, J.; Jiang, W.; Luo, H.; Yu, H. Stripe-based and attribute-aware network: A two-branch deep model for vehicle re-identification. Meas. Sci. Technol. 2020, 31, 095401. [Google Scholar] [CrossRef]
Wang, G.; Yuan, Y.; Chen, X.; Li, J.; Zhou, X. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 274–282. [Google Scholar]
Tan, H.; Liu, X.; Bian, Y.; Wang, H.; Yin, B. Incomplete descriptor mining with elastic loss for person re-identification. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 160–171. [Google Scholar] [CrossRef]
Chen, T.; Ding, S.; Xie, J.; Yuan, Y.; Chen, W.; Yang, Y.; Ren, Z.; Wang, Z. Abd-net: Attentive but diverse person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8351–8361. [Google Scholar]
Wang, P.; Zhao, Z.; Su, F.; Zu, X.; Boulgouris, N.V. Horeid: Deep high-order mapping enhances pose alignment for person re-identification. IEEE Trans. Image Process. 2021, 30, 2908–2922. [Google Scholar] [CrossRef] [PubMed]
Rao, Y.; Chen, G.; Lu, J.; Zhou, J. Counterfactual attention learning for fine-grained visual categorization and re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1025–1034. [Google Scholar]
Li, Y.; He, J.; Zhang, T.; Liu, X.; Zhang, Y.; Wu, F. Diverse part discovery: Occluded person re-identification with part-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2898–2907. [Google Scholar]
Jia, M.; Cheng, X.; Lu, S.; Zhang, J. Learning disentangled representation implicitly via transformer for occluded person re-identification. IEEE Trans. Multimed. 2022, 25, 1294–1305. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
Li, S.; Sun, L.; Li, Q. CLIP-ReID: Exploiting vision-language model for image re-identification without concrete text labels. In Proceedings of the AAAI Conference on Artificial Intelligence, Montreal, QC, Canada, 8–10 August 2023; Volume 37, pp. 1405–1413. [Google Scholar]
Zhang, Y.; Wang, H. Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2153–2162. [Google Scholar]
Yang, S.; Zhou, Y.; Zheng, Z.; Wang, Y.; Zhu, L.; Wu, Y. Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 4492–4501. [Google Scholar]
Zuo, J.; Zhou, H.; Nie, Y.; Zhang, F.; Guo, T.; Sang, N.; Wang, Y.; Gao, C. UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 22010–22019. [Google Scholar]
Zuo, J.; Yu, C.; Sang, N.; Gao, C. Plip: Language-image pre-training for person representation learning. arXiv 2023, arXiv:2305.08386. [Google Scholar]
Yan, S.; Dong, N.; Zhang, L.; Tang, J. Clip-driven fine-grained text-image person re-identification. IEEE Trans. Image Process. 2023, 32, 6032–6046. [Google Scholar] [CrossRef] [PubMed]
Ren, W.; Luo, J.; Jiang, W.; Qu, L.; Han, Z.; Tian, J.; Liu, H. Learning Self- and Cross-Triplet Context Clues for Human-Object Interaction Detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 9760–9773. [Google Scholar] [CrossRef]
Ye, M.; Shen, J.; Lin, G.; Xiang, T.; Shao, L.; Hoi, S.C. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 2872–2893. [Google Scholar] [CrossRef] [PubMed]
Gong, Y.; Zhang, X.; Lu, J.; Jiang, X.; Wang, Z.; Liu, H.; Li, Z.; Wang, L.; Yang, Q.; Wu, X. Steering Angle-Guided Multimodal Fusion Lane Detection for Autonomous Driving. IEEE Trans. Intell. Transp. Syst. 2024. [Google Scholar] [CrossRef]
Wang, Q.; Huang, Z.; Fan, H.; Fu, S.; Tang, Y. Unsupervised person re-identification based on adaptive information supplementation and foreground enhancement. IET Image Process. 2024, 18, 4680–4694. [Google Scholar] [CrossRef]
Tang, Q.; Cao, G.; Jo, K.H. Fully unsupervised person re-identification via multiple pseudo labels joint training. IEEE Access 2021, 9, 165120–165131. [Google Scholar] [CrossRef]
Rao, Y.; Zhao, W.; Chen, G.; Tang, Y.; Zhu, Z.; Huang, G.; Zhou, J.; Lu, J. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18082–18091. [Google Scholar]
Wei, L.; Zhang, S.; Gao, W.; Tian, Q. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 79–88. [Google Scholar]
Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1116–1124. [Google Scholar]
Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 17–35. [Google Scholar]
Miao, J.; Wu, Y.; Liu, P.; Ding, Y.; Yang, Y. Pose-guided feature alignment for occluded person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 542–551. [Google Scholar]
Zhu, K.; Guo, H.; Liu, Z.; Tang, M.; Wang, J. Identity-guided human semantic parsing for person re-identification. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part III 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 346–363. [Google Scholar]
Zhang, Q.; Lai, J.; Feng, Z.; Xie, X. Seeing like a human: Asynchronous learning with dynamic progressive refinement for person re-identification. IEEE Trans. Image Process. 2021, 31, 352–365. [Google Scholar] [CrossRef] [PubMed]
Wang, P.; Zhao, Z.; Su, F.; Meng, H. LTReID: Factorizable feature generation with independent components for long-tailed person re-identification. IEEE Trans. Multimed. 2022, 25, 4610–4622. [Google Scholar] [CrossRef]
Yang, Z.; Wu, D.; Wu, C.; Lin, Z.; Gu, J.; Wang, W. A Pedestrian is Worth One Prompt: Towards Language Guidance Person Re-Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17343–17353. [Google Scholar]
Gildenblat, J.; Contributors. PyTorch Library for CAM Methods. 2021. Available online: https://github.com/jacobgil/pytorch-grad-cam (accessed on 23 January 2025).

Figure 1. Overall framework.

Figure 2. Channel spatial feature aggregation module.

Figure 3. MSMT17 visualization. Column (a) shows the original images, column (b) presents the heatmaps generated by the CLIP-Reid model, and column (c) displays the heatmaps produced by our method.

Figure 4. Occluded-Duke visualization. Column (a) shows the original images, column (b) presents the heatmaps generated by the CLIP-Reid model, and column (c) displays the heatmaps produced by our method.

Table 1. Dataset information.

Dataset	Image	ID	Cam + View
MSMT17	126,441	4101	15
Market-1501	32,668	1501	6
DukeMTMC-reID	36,411	1404	8
Occluded-Duke	35,489	1404	8

Table 2. Comparing experimental results with other methods across four datasets. The * indicates that the resolution of the input images has been adjusted to over

256 \times 128

. Bold indicates the best performance.

Table 2. Comparing experimental results with other methods across four datasets. The * indicates that the resolution of the input images has been adjusted to over

256 \times 128

. Bold indicates the best performance.

Methods	MSMT17		Market-1501		DukeMTMC		Occluded-Duke
Methods	mAP	R1	mAP	R1	mAP	R1	mAP	R1
PCB *			81.6	93.8	69.2	83.3
MGN *			86.9	95.7	78.4	88.7
OSNeT	52.9	78.7	84.9	94.8	73.5	88.6
ABD-NeT *	60.8	82.3	88.3	95.6	78.6	89.0
Auto-ReID *	52.5	78.2	85.1	94.5
HOReID			84.9	94.2	75.6	86.9	43.8	55.1
ISP [35]			88.6	95.3	80.0	89.6	52.3	62.8
SAN	55.7	79.2	88.0	96.1	75.5	87.9
OfM	54.7	78.4	87.9	94.9	78.6	89.0
CDNet	54.7	78.9	86.0	95.1	76.8	88.6
PAT			88.0	95.4	78.2	88.8	53.6	64.5
CAL	56.2	79.5	87.0	94.5	76.4	87.2
CBDB-Net *			85.0	94.4	74.3	87.7	38.9	50.9
ALDER * [36]	59.1	82.5	88.9	95.6	78.9	89.9
LTReID * [37]	58.6	81.0	89.0	95.9	80.4	90.5
DRL-Net	55.3	78.4	86.9	94.7	76.6	88.1	50.8	65.0
baseline	60.7	82.1	88.1	94.7	79.3	88.6	47.4	54.2
PromptSG [38]	68.5	86.0	91.8	96.6	80.4	90.2
CLIP-ReID	63.0	84.4	89.8	95.7	80.7	90.0	53.5	61.0
Ours	66.1	86.4	90.8	96.4	81.9	91.3	56.6	66.0

Table 3. Ablation experimental results of each module on the MSMT17 dataset. Bold indicates the best performance.

	mAP	Rank-1
CLIP	60.7	82.1
CLIP-Reid	63.0	84.4
CLIP-Reid+TGI	62.6	84.4
CLIP-Reid+MSCSA	65.3	86.1
Ours	66.1	86.4

Table 4. The impact of the Multistage Channel Spatial Feature Aggregation module at different stages on the model. Bold indicates the best performance.

Stage 1	Stage 2	Stage 3	Stage 4	mAP	Rank-1
✓				63.6	85.1
	✓			64.3	85.5
		✓		63.5	84.9
			✓	63.1	84.8
✓	✓			64.1	85.7
✓		✓		64.2	85.3
✓			✓	64.3	85.4
	✓	✓		64.9	86.1
	✓		✓	64.6	85.5
		✓	✓	63.4	84.8
✓	✓	✓		65.3	86.1
✓	✓		✓	64.8	85.8
✓		✓	✓	64.9	86.0
	✓	✓	✓	65.0	85.9
✓	✓	✓	✓	65.0	86.1

Table 5. Parameters.

Method	Parameters
CLIP-Reid	78.33 M
CLIP-Reid + TGI	88.80 M
CLIP-Reid + MSCSA	90.30 M
Ours	100.70 M

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, X.; Guo, H.; Zhang, M.; Fu, B. Image–Text Person Re-Identification with Transformer-Based Modal Fusion. Electronics 2025, 14, 525. https://doi.org/10.3390/electronics14030525

AMA Style

Li X, Guo H, Zhang M, Fu B. Image–Text Person Re-Identification with Transformer-Based Modal Fusion. Electronics. 2025; 14(3):525. https://doi.org/10.3390/electronics14030525

Chicago/Turabian Style

Li, Xin, Hubo Guo, Meiling Zhang, and Bo Fu. 2025. "Image–Text Person Re-Identification with Transformer-Based Modal Fusion" Electronics 14, no. 3: 525. https://doi.org/10.3390/electronics14030525

APA Style

Li, X., Guo, H., Zhang, M., & Fu, B. (2025). Image–Text Person Re-Identification with Transformer-Based Modal Fusion. Electronics, 14(3), 525. https://doi.org/10.3390/electronics14030525

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Image–Text Person Re-Identification with Transformer-Based Modal Fusion

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. CLIP-Reid

3.2. Overall Framework

3.3. Multistage Channel Spatial Aggregation Module

3.4. Text-Guided Image Module

4. Experimental Results and Analysis

4.1. Dataset and Evaluation Metrics

4.2. Experimental Setup

4.3. Analysis of Comparative Experimental Results

4.4. Ablation Experiment

4.5. Visualization

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI