Parsing-Guided Differential Enhancement Graph Learning for Visible-Infrared Person Re-Identification

Li, Xingpeng; Liu, Huabing; Xue, Chen; Wang, Nuo; Hu, Enwen

doi:10.3390/electronics14153118

Open AccessArticle

Parsing-Guided Differential Enhancement Graph Learning for Visible-Infrared Person Re-Identification

by

Xingpeng Li

¹,

Huabing Liu

²,

Chen Xue

²,

Nuo Wang

² and

Enwen Hu

^1,*

¹

School of Electronic Engineering, Beijing University of Posts and Telecommunications (BUPT), Beijing 100876, China

²

Beijing CPI Huizhi Technology Co., Ltd., No. 16 Suzhou Street, Haidian District, Beijing 100086, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(15), 3118; https://doi.org/10.3390/electronics14153118

Submission received: 13 June 2025 / Revised: 21 July 2025 / Accepted: 3 August 2025 / Published: 5 August 2025

Download

Browse Figures

Versions Notes

Abstract

Visible-Infrared Person Re-Identification (VI-ReID) is of crucial importance in applications such as monitoring and security. However, challenges faced from intra-class variations and cross-modal differences are often exacerbated by inaccurate infrared analysis and insufficient structural modeling. To address these issues, we propose Parsing-guided Differential Enhancement Graph Learning (PDEGL), a novel framework that learns discriminative representations through a dual-branch architecture synergizing global feature refinement with part-based structural graph analysis. In particular, we introduce a Differential Infrared Part Enhancement (DIPE) module to correct infrared parsing errors and a Parsing Structural Graph (PSG) module to model high-order topological relationships between body parts for structural consistency matching. Furthermore, we design a Position-sensitive Spatial-Channel Attention (PSCA) module to enhance global feature discriminability. Extensive evaluations on the SYSU-MM01, RegDB, and LLCM datasets demonstrate that our PDEGL method achieves competitive performance.

Keywords:

VI-ReID; cross-modality; attention mechanism; graph learning

1. Introduction

Person Re-Identification (Re-ID) [1], as a subtask of image retrieval, aims to identify and track specific individuals across non-overlapping camera views under varying perspectives, illumination conditions, and pose variations. This technology holds transformative potential for intelligent surveillance, public safety, and smart infrastructure systems, driving extensive research efforts in recent years [2,3,4]. Traditional Re-ID methodologies predominantly focus on unimodal frameworks leveraging visible-light imagery [5,6,7]. However, visible surveillance cameras suffer severe limitations in low-light and nocturnal environments, where captured images often lack discriminative color and texture cues essential for accurate identification. In contrast, infrared cameras are inherently robust to ambient illumination, enabling high-fidelity person imaging even under total darkness. Modern surveillance systems increasingly integrate dual-mode sensing capabilities, seamlessly switching between RGB and infrared imaging modality, thereby motivating the emergence of Visible-Infrared cross-modal person Re-Identification (VI-ReID) [8,9] to enable robust all-weather target recognition and monitoring.

VI-ReID aims to re-identify a target individual across non-overlapping camera views and modality shifts, enabling robust identification despite variations in illumination, pose, and sensor characteristics. It offers distinct advantages over traditional visible-spectrum methods, particularly in scenarios with poor lighting conditions. The complementary nature of visible and IR cameras enhances system robustness and expands applicability across diverse environments. However, bridging the gap between these two modalities introduces significant challenges. Two primary challenges hinder VI-ReID performance: intra-person variations (e.g., pose, illumination, occlusion) and cross-modal discrepancies arising from fundamentally distinct imaging mechanisms between visible and infrared spectra. These discrepancies manifest as significant appearance inconsistencies for the same individual across modalities. For instance, visible-light imagery relies on ambient illumination to capture surface reflectance, while infrared imagery encodes thermal radiation patterns emitted by the human body. It results in intrinsic differences in texture granularity, chromaticity, and contrast, rendering traditional unimodal feature extraction and matching methodologies ineffective for cross-modal generalization. Furthermore, infrared imagery suffers from impoverished textural and chromatic cues, compounded by vulnerability to environmental noise interference. These limitations severely impede feature alignment and discriminative representation learning, necessitating innovative solutions to bridge the modality gap.

To address the aforementioned issues, early efforts focused on designing dual-branch architectures that extract features from each modality separately before performing cross-modal fusion. One representative method, AGW [7], introduced a two-stream network with a Weighted Regularization Triplet (WRT) loss, setting a strong baseline for subsequent studies in VI-ReID. The WRT loss was particularly effective in balancing intra-class compactness and inter-class separability under cross-spectral conditions, offering insights into loss function design for heterogeneous data learning. Another line of work explored data augmentation through synthetic image generation. Zhang et al. [10] proposed the Diverse Embedding Expansion Network (DEEN), which generates diverse visual appearances to enrich training samples. This approach not only mitigates the lack of real-world cross-modality data but also enhances the model’s robustness against illumination variations—a common challenge in low-light environments. In parallel, multi-channel input strategies have been adopted to exploit complementary information across modalities. For example, MSCMNet [11] incorporated both original and enhanced versions of visible and infrared images, enabling more discriminative feature learning through multi-scale fusion. Despite these advances, methods relying solely on global feature representations often suffer from performance degradation due to misalignment caused by pose variation and occlusion. To tackle this limitation, DDAG [12] introduced a dynamic dual-attentive aggregation mechanism that adaptively refines local features through cropping, slicing, and segmentation, thereby capturing context-aware representations that are more resilient to spatial misalignment. Fang et al. [13] proposed semantic alignment and affinity inference to address part misalignment issues. However, these methods neglect higher-order spatial relationships between body parts. Compressing full-body features into compact vectors risks diluting critical discriminative cues (e.g., distinct facial features or clothing patterns), as these details may be averaged out during dimensionality reduction.

Unlike CNNs that process features in a grid-like manner, Graph Neural Networks (GNNs) enable explicit cross-modal interactions through graph edges, allowing complementary information exchange between infrared and visible features. Moreover, their message passing mechanisms inherently preserve both local details and global contextual features during information propagation. Feng et al. [14] proposed a graph-based architecture that integrates local body-part features with relational reasoning via graph neural networks, effectively modeling both homogeneous and heterogeneous structural dependencies. However, directly applying GNNs to VI-ReID faces several challenges. First, infrared images suffer from poor parsing quality due to low contrast and missing texture details, leading to unreliable graph construction. Second, without proper guidance, information propagation in graphs may cause feature dilution rather than enhancement. Third, the semantic gap between modalities requires careful design of cross-modal graph interactions.

To address these limitations, we propose the Parsing-guided Differential Enhancement Graph Learning (PDEGL) method for VI-ReID. By constructing a Graph Neural Network (GNN), our method explicitly models high-order topological relationships among body components, enabling inter-node information propagation and adaptive feature fusion. Ultimately, this structured relational reasoning empowers the GNN to learn feature representations that are not only highly discriminative but also significantly more robust to the aforementioned cross-modal and intra-class variations. Specifically, our proposed Parsing-guided Differential Enhancement Graph Learning (PDEGL) method distinctively employs human parsing techniques to decompose pedestrian images into constituent body parts. These parts are subsequently conceptualized as nodes within a GNN. This paradigm shifts the matching process from direct, often unstable image-feature comparisons to a more robust evaluation of spatial interdependencies among key body parts. The proposed PDEGL method operates through a specially designed dual-branch architecture. The first branch processes global features extracted from the original Visible (VIS) and Infrared (IR) images using a ResNet50 backbone. These features are further refined by our novel Position-sensitive Spatial-Channel Attention (PSCA) module, which dynamically emphasizes discriminative spatial regions and feature channels pivotal for bridging cross-modal discrepancies. The second branch focuses on part-based analysis. While the Self-Correction for Human Parsing (SCHP) [15] network achieves strong performance on visible images, it suffers from inherent limitations in the IR domain, such as low resolution, absence of chromatic information, and poor signal-to-noise ratios, leading to parsing inaccuracies. To address this problem, we propose a Differential Infrared Part Enhancement (DIPE) module, inspired by analog differential amplifiers [16] that suppress common-mode noise while amplifying differential signals. The DIPE module synergistically fuses information from the raw IR image and its initial parsed counterpart. This is achieved through a channel-weighted aggregation mechanism, which rectifies local parsing inaccuracies and ultimately generates a more complete and veridical IR human parsing images. Subsequently, we introduce the specially designed Parsing Structural Graph (PSG) module, where the refined parts serve as GNN nodes, with edges defined by an adjacency matrix encoding human anatomical topology. Within each node, pixel connectivity is established via an eight-connected strategy. This graph structure explicitly models spatial distributions and inter-dependencies of body parts, enabling identity matching through structural consistency analysis rather than appearance-based comparisons. Through the interaction of PSCA, DIPE, and PSG, our framework is empowered to match individuals by comparing the structural congruence of their respective part-graphs, which leads to a more discriminative and robust person representation across visible and infrared modalities.

Our main contributions are summarized as follows:

We propose a novel Parsing-guided Differential Enhancement Graph Learning (PDEGL) method, which explicitly constructs interconnected body-component graphs of the target person. By transforming person re-identification into a high-level structural matching problem, PDEGL captures spatial configurations and inter-dependencies among anatomical regions, demonstrating exceptional robustness against inherent appearance variations in cross-modal scenarios.
We introduce an innovative Differential Infrared Part Enhancement (DIPE) module. Inspired by the operational principles of differential amplifiers, the DIPE module synergistically fuses information from the raw infrared image and its initial parsed counterpart via adaptive channel weighting. This mechanism is specifically designed to address and rectify local inaccuracies prevalent in infrared human parsing, yielding more veridical part segmentations. Meanwhile, to construct the internal structural relationships among the various parts of the human body, the proposed Parsing Structural Graph (PSG) module is executed after DIPE enhanced IR human parsing images.
We design a Position-sensitive Spatial-Channel Attention (PSCA) module, integrated with the ResNet50 backbone. PSCA dynamically accentuates salient global features and informative channels, effectively mitigating cross-modal discrepancies within the global feature representations.
Extensive experimental evaluations conducted on the challenging SYSU-MM01, RegDB, and LLCM datasets compellingly demonstrate that our proposed PDEGL method achieves great performance.

2. Related Work

2.1. Single-Modality Person Re-ID

The single-modality person Re-ID task aims to retrieve pedestrian images captured by different visible-light cameras. The key challenge lies in the intra-class and inter-class differences caused by different perspectives, postures, lighting conditions, and the influence of occlusion brought about by multiple cameras. The existing methods can be classified into hand-crafted feature construction-based methods [17,18,19], metric learning-based methods [20,21,22,23], and deep learning-based methods [24,25,26,27,28] according to the means of feature construction. These methods have all achieved excellent performance on public datasets. However, since these methods can only retrieve pedestrians in visible light images and cannot be applied to infrared images with huge modal differences, their practicability is limited in actual all-day monitoring scenarios.

2.2. Visible-Infrared Person Re-ID

Visible-Infrared Person Re-ID (VI-ReID) is a cross-modal retrieval task, which requires matching the visible light images during the day with the infrared images at night. This task was initially proposed by Wu et al. [8]. They introduced deep zero-filling of domain-specific automatic update structures in single-channel networks to optimize feature extraction. In addition, they provided the SYSU-MM01 dataset, which is a large-scale dataset in the VI-ReID task domain and has become a widely used benchmark. After that, some researchers have proposed many methods to solve the problem of differences between modalities. Ye et al. [29] handled cross-modality mapping deviations and inter-modal differences simultaneously in an end-to-end manner in an bi-directional network. In [30], a Two-Stream Local Feature Network (TSLFN) extracts infrared and visible light images respectively in two branches, and with the supervision of hetero-center loss the distance between the centers of the two patterns is narrowed to improve the similarity of cross-pattern features within the class. Some researchers have set out to implement modal transformation through Generative Adversarial Networks (GANs) [31] to unify the feature expressions among different modalities. Hi-CMD [32] was designed with a hierarchical disentanglement framework that separates feature representations into modality-shared and modality-specific components. In [26], AlignGAN was proposed to simultaneously perform pixel-level and feature-level alignment to generate modality-invariant representations while preserving identity information. FMCNet [33] utilized GANs for compensating for modality differences at the feature level. A modality synergy complement learning with cascaded aggregation strategy was proposed in [34]. Qian et al. [35] proposed multi-scale contrastive learning with hierarchical knowledge synergy, emphasizing the importance of multi-level feature integration. Recently, Transformer-based architectures have demonstrated strong potential in VI-ReID. Jiang et al. [36] introduced the transformer structure for the first time, compensating for the missing modality-specific information and adaptively adjusting the sample features respectively through the modality-level alignment module and the instance-level alignment module. In [37], a Progressive Modality-shared Transformer (PMT) network was proposed to model long-range dependencies and extract robust global features with a progressive learning strategy. Wu et al. [38] introduced modality-aware and instance-aware visual prompts in transformer architectures, demonstrating the potential of adaptive prompt learning for cross-modal feature alignment.

2.3. GNN Application in Person Re-ID

The Graph Neural Network (GNN) [39] is a deep learning model specifically designed for processing graph-structured data. It aggregates neighbor information and updates node representations through a message-passing mechanism, thereby achieving remarkable performance in various tasks. Graph Convolutional Networks (GCNs) have been proposed as an important variant of GNNs in spectral and non-spectral manners, respectively [40,41]. Other popular GNN architectures, including Graph Attention Networks (GATs) [42] and GraphSAGE [43], capture the structural and topological information of graphs in different ways. For example, GRAF [44] made full use of the relationship between the nearest neighbor images of the given image and allows message propagation to update the node features, thereby outproducing a robust and discriminative feature representation. Recent advances in graph learning have shown promising results in various vision tasks. Specifically, Wang et al. [45] demonstrates GNN’s capability in capturing both local and contextual information for image analysis. Li et al. [46] presented how GNNs can effectively handle multi-modal features through cross-modality edges. These works inspire us to explore GNN’s potential in VI-ReID, where modeling body part relationships and cross-modality interactions are equally crucial. Given the powerful ability of graph neural networks in modeling complex relationships, they have been widely applied in the person Re-ID task. Wu et al. [47] built an adaptive structure-aware adjacency graph by means of posture alignment connection and feature similarity connection, thereby mining the context relationship to obtain discriminative features. Qiu et al. [48] explored high-order structure learning for VI-ReID, focusing on middle-feature extraction. Yang et al. [49] simulated the robust spatio-temporal relationship with GCN between each part of the human body in the same frame and different frames to mine more discriminative information for video-based person Re-ID.

Compared with the above methods, the proposed Parsing-guided Differential Enhancement Graph Learning (PDEGL) method aims to perform cross-modal pedestrian matching in a higher dimension, which could explicitly construct graph structures the of human body composed of differentially enhanced analytical regions with GNN to learn their inherent spatial topological relationships.

3. The Proposed Method

In this section, we introduce the proposed Parsing-guided Differential Enhancement Graph Learning (PDEGL) method in detail. First, we introduce the overall structure of the PDEGL method. Subsequently, we elaborate on the design details of the Differential Infrared Part Enhancement (DIPE) module, Parsing Structural Graph (PSG) module, and Position-sensitive Spatial-Channel Attention (PSCA) module. Lastly, we provide the total loss for our PDEGL method.

3.1. Overall Structure

As shown in Figure 1, the proposed PDEGL method adopts a dual-branch structure network, including global feature branch and parsing graph branch. Firstly, the input Visible (VIS) and Infrared (IR) pedestrian images are decomposed into anatomically meaningful components with the Self-Correction for Human Parsing (SCHP) [15] method (e.g., head, arms, and legs). The global feature branch extracts global features from raw images through a ResNet50 backbone, augmented by a novel Position-sensitive Spatial-Channel Attention (PSCA) module. This mechanism dynamically focuses on the most discriminative spatial regions and channel-wise features, enhancing cross-modal feature alignment. Central to our PDEGL framework, the parsing graph branch specializes in part-based graph representation learning. Given that SCHP suffers from parsing inaccuracies in IR images, we introduce a Differential Infrared Part Enhancement (DIPE) module due to low resolution and the absence of chromatic information. Inspired by analog differential amplifiers, DIPE synergistically fuses raw IR inputs with preliminary parsing results via channel-wise weighted aggregation, correcting local parsing errors and refining IR part segmentation. Subsequently, these parsed components serve as nodes in the Parsing Structural Graph (PSG) module. The adjacency matrix encodes spatial topological relationships between body parts, while intra-node pixel connectivity is established through an eight-connected strategy. This GNN architecture module explicitly constructs spatial distributions and interdependencies among body components, transforming person re-identification from simplistic features matching into a robust structural consistency comparison. Finally, a component-level heterogeneous center loss supervises the learning of discriminative cross-modal part graphs, ensuring optimal alignment of VIS and IR embeddings under challenging conditions.

3.2. Differential Infrared Part Enhancement Module

When directly applying the pretrained SCHP to infrared (IR) pedestrian images, the parsing accuracy often deteriorates significantly due to inherent limitations in IR images, such as low spatial resolution and absence of chromatic and textural details. Such erroneous IR parsing results subsequently mislead the Graph Neural Network (GNN) learning process, compromising its ability to capture genuine human structural relationships, and thereby affect the performance of cross-modal matching.

To address this challenge, we propose a Differential Infrared Part Enhancement (DIPE) module, whose core design principle is inspired by the characteristics of differential amplifier circuits that amplify differential-mode signals while suppressing common-mode noise. Analogously, the DIPE module aims to intelligently integrate low-level spatial information from raw IR images with semantic part-aware features generated by the preliminary SCHP parsing. By adaptively amplifying subtle discrepancies between these two modalities, leveraging critical details in the raw IR data to rectify parsing inaccuracies, the module achieves precise correction and enhancement of the initial infrared parsing results.

The DIPE module adopts a dual-stream parallel processing architecture, as illustrated in Figure 2. It takes the raw infrared image

I \in R^{C \times H \times W}

and the SCHP parsed semantic map

M \in R^{C \times H \times W}

as input, where the parsed map M specifically encodes semantic label information for 20 anatomical regions (e.g., head, upper body, pants, etc.):

M = SCHP (I), M \in R^{C \times H \times W}

(1)

In the feature encoding phase, multi-level shared-weight convolutional layers are first applied to process the raw infrared image I and the parsed semantic map M, generating corresponding feature maps

F_{I}

and

F_{M}

. Subsequently, bidirectional differential signals are computed to capture complementary information between these two modalities:

\{\begin{matrix} Δ_{I - M} = F_{I} - F_{M} \\ Δ_{M - I} = F_{M} - F_{I}, \end{matrix}

(2)

The differential signals

Δ_{I - M}

and

Δ_{M - I}

capture complementary information. Specifically,

Δ_{I - M}

highlights regions where the raw IR image contains noise or background clutter absent in the parsed map, while

Δ_{M - I}

identifies body part structures missing from the raw image due to low thermal contrast. Then, the differential signals are compressed into global descriptors through Global Average Pooling (GAP), and the adaptive channel attention weights are generated by the Sigmoid activation function:

\{\begin{matrix} ω_{I - M} = σ (GAP (Δ_{I - M})) \\ ω_{M - I} = σ (GAP (Δ_{M - I})) \end{matrix}

(3)

It dynamically modulates feature response strengths across channels, enhancing identity-discriminative features while suppressing modality-specific artifacts. Based on the proposed DIPE module, we further construct a differential-aware feature map:

\{\begin{matrix} \tilde{F_{I}} = F_{I} \oplus (Δ_{I - M} ⊙ ω_{M - I}) \\ \tilde{F_{M}} = F_{M} \oplus (Δ_{M - I} ⊙ ω_{I - M}), \end{matrix}

(4)

where ⊕ represents element-by-element addition and ⊙represents channel-level multiplication. Through this cross-enhancement strategy, the original features and the analytical features have achieved complementary fusion.

Finally, the spatial resolution of the differential sensing feature map after channel stitching is gradually restored through the deconvolution network, and the enhanced infrared human body feature map is output:

F_{r} = Deconv (Concat (\tilde{F_{I}}, \tilde{F_{M}})) .

(5)

3.3. Parsing Structural Graph Module

Conventional VI-ReID approaches typically perform feature extraction on holistic body representations. However, this coarse-grained strategy fails to exploit semantic and spatial relationships between anatomical components, making it highly sensitive to pose variations and occlusions. By contrast, we propose the Parsing Structural Graph (PSG) module to construct graph-structured representations from DIPE-enhanced parsing results, explicitly modeling inter-part spatial dependencies and semantic correlations through Graph Neural Networks (GNNs). The PSG module independently processes every input pedestrian sample in a weight-sharing manner, whether derived from RGB or DIPE-enhanced infrared inputs. Specifically, it represents parsed human bodies as graphs where semantically consistent components (e.g., head, torso, upper/lower limbs) serve as nodes, while their inherent anatomical and spatial relationships define edges. This formulation enables the network to generate more robust feature representations by learning relational configurations of human body parts.

As shown in Figure 3, for each parsed pedestrian image we construct an undirected graph,

G = (V, E, X),

where the nodes

V = {v_{1}, \dots, v_{N_{p}}}

represent

N_{p}

body semantic components. Then, a spatial adjacency matrix

A_{s}

is formulated to encode anatomical spatial topological relationships between body parts:

A_{S} (i, j) = \{\begin{matrix} 1 & if j \in N (i) \\ 0 & otherwise \end{matrix}

(6)

where

N (i)

represents the set of neighbors of node i.

Subsequently, the spatial graph attention branch maps the input feature map to the low-dimensional space to obtain the projected feature vector:

α = RELU (B N (C o n v_{1 \times 1} [X]))

(7)

Consequently, the spatial graph attention weights are calculated as:

S = Softmax (\frac{α^{⊤} α}{d}) \otimes A_{s}

(8)

The core objective of the spatial graph attention branch is to dynamically assign importance to different parts based on their contextual relationships. By computing feature similarities and constraining the results with a predefined anatomical structure adjacency matrix

A_{s}

, the model learns which part combinations are most critical for identity recognition. After feature aggregation:

X_{s} = Reshape (S \cdot X_{flat})

(9)

where

d = \frac{C}{r}

is a scaling factor and

X_{flat} \in R^{B \times C \times N}

is the spatially flattened feature vector.

Figure 3. The workflow of the PSG module to construct the spatial structure relationship of human body parts.

The channel adjacency matrix models the long-range dependencies between multi-modal feature channels by constructing a chain-like connection. The sparse adjacency matrix that constrains information propagation between adjacent channels is represented as follows:

A_{C} (i, j) = \{\begin{matrix} 1 & if | i - j | = 1 \\ 0 & otherwise \end{matrix}

(10)

Then, the channel graph attention branch models pedestrian features through a channel-wise correlation matrix:

G_{c} = X_{flat}^{T} X_{flat} \in R^{B \times C \times C}

(11)

In this way, it models dependencies between different feature channels, which enables the model to learn more discriminative feature combinations and adaptively enhance the feature channels most critical for identity recognition. After applying constraints with the channel adjacency matrix, the attention weights of the channel graph are obtained through normalization:

C = Softmax (G_{c} ⊙ A_{C})

(12)

After weighted aggregation, the channel graph feature map is calculated as:

X_{c} = Reshape (C \cdot X_{flat})

(13)

So far, through the joint force of the attention branches in the spatial position map and the channel relationship map, the local spatial dependence and channel correlation among various parts of the human body have been revealed. After that, the operation of spatial-channel graph feature aggregation is carried out. The node features are updated by aggregating the neighbor information to calculate the scores of adjacent nodes and normalize them:

h_{i} = Concat ([X_{s}^{(1)}, X_{c}^{(1)}, \dots, X_{s}^{(l)}, X_{c}^{(l)}])

(14)

After constructing the single-layer graph attention mechanism, multiple parallel and independent attention heads are aggregated to capture feature relationships from different subspaces. The final results

F_{g r a p h}

obtained from all attention heads are concatenated to obtain the final graph feature output:

F_{g r a p h} = LN (Concat [h^{(1)}, h^{(2)}, \dots, h^{(K)}])

(15)

3.4. Position-Sensitive Spatial-Channel Attention Module

Due to the inherent differences between imaging modalities in VI-ReID tasks, the appearance features of pedestrians captured from infrared and visible modalities show significant heterogeneity. Although the proposed PDEGL method could learn the structural correspondence among various parts of the human body through the PSG module we introduced earlier, global features are still an important supplement for overall context information. However, the global features extracted directly from the feature extraction network ResNet50 cannot fully adapt to the huge differences between modalities, resulting in the limitation of its effectiveness in cross-modal matching.

To solve this problem and enhance the representation ability of global feature branches, we designed the Position-aware Spatial-Channel Attention (PSCA) module. It is designed to enable attention mechanisms to adaptively focus on the most important spatial dependencies and feature channels according to different spatial regions of the feature map. The core idea is to decompose the feature map into multiple patches, and independently perform attention operations in the spatial and channel dimensions within each region, so that the feature enhancement strategy based on location awareness can be implemented. The structure of the PSCA module is shown in Figure 4.

Specifically, the similarity between any two locations in the input feature map is first calculated, and then the importance weight of each location is dynamically adjusted based on the calculated similarity to refine the final feature representation. For the given input feature map

x \in R^{C \times H \times W}

, in order to make better use of its feature information in the spatial and channel dimensions, it is reshaped by the convolutional layer to obtain

X_{Q}, X_{K}, X_{V} \in R^{C \times H W}

. Then, each matrix is uniformly divided into several patches in parallel by column and converted into three tensors

T_{Q}, T_{K}, T_{V} \in R^{C^{'} \times H W \times N}

, respectively, where

C^{'} = \frac{C}{N}

and N stands for the number of image patches. Next, the association between different spatial locations and channel information is merged by generating a position autocorrelation matrix and a channel autocorrelation matrix. Thus, the position self-correlation matrix of each patch (assuming serial number i) can be calculated as:

P_{s}^{(i)} = Softmax (\frac{{T_{Q}}^{(i)} \cdot {T_{K}}^{(i)}^{T}}{\sqrt{C^{'}}})

(16)

And the channel self-correlation matrix is:

C_{s}^{(i)} = Softmax (\frac{{T_{Q}}^{(i)}^{T} \cdot {T_{K}}^{(i)}}{\sqrt{C^{'}}}),

(17)

where

{T_{Q}}^{(i)}, {T_{K}}^{(i)} \in R^{H W \times C^{'}}

.

The position self-correlation matrix represents the pairwise relationship between different positions in the feature map, and its elements indicate the importance of one position relative to another. Similarly, each element of the channel self-correlation matrix represents the importance of one channel relative to another.

{T_{V}}^{(i)}

represents the original input feature, which is weighted by the position self-correlation matrix

P_{s}^{(i)}

and channel self-correlation matrix

C_{s}^{(i)}

, so that the position and channel relationship of the input feature map could be merged as

{M_{p c s}}^{(i)} \in R^{C^{'} \times H W}

:

{M_{p c s}}^{(i)} = Concat ([C_{s}^{(i)} \cdot {T_{V}}^{(i)}], [{T_{V}}^{(i)} \cdot P_{s}^{(i)}]),

(18)

This operation fuses the salient features extracted from both channel attention and spatial attention to form a holistic enhanced representation for the current image patch. Finally, the feature maps of all patches are aggregated to obtain the final output

F_{g l o b a l} \in R^{C \times H \times W}

of the global feature branch, which is a global feature map that integrates position-sensitive spatial and channel context information:

F_{g l o b a l} = Reshape (FC (Concat (M_{p c s})))

(19)

3.5. Loss Function

Realizing the alignment and discrimination of cross-modal features at the component level is the key to supervising the training of our proposed PDEGL method. It means minimizing the characteristic differences of the corresponding components of the same pedestrian in different modes to a great extent, while increasing the characteristic differences among the corresponding components of different pedestrians. Specifically, we maintain a learnable central prototype

c_{p, k} \in R^{D_{p a r t}}

for each human body component

k \in {1, \dots, N_{p}}

of each identity

p \in {1, \dots, N_{I D}}

in the trainset, where

D_{p a r t}

is the dimension of the feature of a single component. These central prototypes are updated along with the network parameters during the training process. Based on this, we design the Parsing-level Hetero-Center Loss:

L_{P H C} = L_{i n t r a} + L_{i n t e r}

(20)

L_{i n t r a} = \frac{1}{B \cdot N_{p}} \sum_{i = 1}^{B} \sum_{k = 1}^{N_{p}} d (x_{i, k}^{(L)}, c_{y_{i}, k})

(21)

L_{i n t e r} = \frac{1}{N_{I D} (N_{I D} - 1) N_{p}} \sum_{p_{1} \neq p_{2}} \sum_{k = 1}^{N_{p}} max (0, M - d (c_{p_{1}, k}, c_{p_{2}, k}))

(22)

where

x_{i, k}^{(L)}

is the enhanced feature of the kth component of the ith sample output by the PSG module,

y_{i}

is the identity label of the ith sample, and M is a preset margin parameter used to ensure that there is sufficient spacing between the centers of components of different identities.

L_{i n t r a}

prompts the corresponding components of the same identity to aggregate towards the shared component center in the feature space, thereby reducing cross-modal differences.

L_{i n t e r}

enhances the discriminative power of component-level features by pushing away the component centers of different identities. In this way,

L_{P H C}

effectively optimizes the feature space at the component level, enabling the model to learn fine-grained representations that are robust to modal changes and sensitive to identities.

Furthermore, we design a novel sample classification center loss

L_{c l s}

to enhance the global sensitivity of the target pedestrians in different modalities of the model while promoting the close aggregation of similar samples:

L_{cls} = \frac{1}{N} \sum_{i = 1}^{N} d_{pc} (x_{i}) + \frac{1}{N / k} \sum_{i = 1}^{N / k} d_{an} (i)

(23)

The pedestrian identity constraint loss is defined as

L_{I D}

, which is used to measure the difference between the identity labels predicted by the model and the true labels, using the cross-entropy loss:

L_{ID} = E_{i_{m}} [- log (p (y_{i}^{m} | F_{i}^{m}))]

(24)

Ultimately, the overall loss function adopted for our PDEGL method training is the weighted sum of

L_{I D}

,

L_{P H C}

and

L_{cls}

:

L_{total} = L_{I D} + L_{P H C} + L_{cls}

(25)

4. Experiments

4.1. Datasets

We conducted extensive experiments on three widely used datasets in the VI-ReID task, SYSU-MM01 [8], RegDB [9], and LLCM [10], to verify the effectiveness of our proposed PDEGL method. The datasets are schematically shown in Figure 5 and detailed as described below.

SYSU-MM01. The SYSU-MM01 dataset contains images in both visible (RGB) and infrared (IR) modalities. The captured images are taken by six cameras, four of which were RGB cameras and two were IR cameras. The images are captured in different indoor and outdoor scenes, including bright indoor environments and dark outdoor environments. The researchers captured a total of 45,863 images from 491 pedestrians, of which 30,071 are RGB images and 15,792 are IR images.

RegDB. The information collection for the dataset relies on the dual-modal information collection system deployed. The dual-modal information acquisition system includes both visible and infrared modal acquisition equipment, and the information acquired covers 412 pedestrian identities (158 male and 254 female), and 10 visible and infrared images are acquired for each identity, totaling 8240 images. The dataset is set up in terms of the angle of information acquisition, with 156 of the 412 identities acquiring information from the front and the remaining 256 identities acquiring information from the back.

LLCM. The LLCM dataset contains 7800 pedestrian RGB images and 7800 IR images, totaling 15,600 images. The dataset provides 390 pedestrian identities, and each pedestrian identity has 20 images in each of the above two modalities. The pedestrian images in it are collected by two cameras with overlapping perspectives, and pedestrians walk between the two cameras. It is one of the larger cross-modal pedestrian re-identification datasets at present, providing important data support for research in the field of cross-modal pedestrian re-identification.

Evaluation Metrics. We use Cumulative Matching Characteristics (CMCs) and mean Average Precision (mAP) as evaluation metrics in the comparison between the performance of the PDEGL method with other SOTA methods in the article.

4.2. Implementation Details

Our proposed PDEGL method was implemented on an Nvidia 4090 GPU with the Pytorch framework [50]. Following [7,51], a ResNet50 [52] pre-trained on ImageNet was employed as the backbone for global feature extraction. We removed the final fully connected classification layer and used the feature map from the layer4 output. We resized the input images into

288 \times 144

. For data augmentation, we adopted some common strategies, including random horizontal flip and random erasing [Random Erasing Data Augmentation]. In a training batch, 8 pedestrian samples with different identities were randomly selected from the pedestrian dataset, and each pedestrian identity contained 8 different images. The sampling method is designed to simulate the diversity of pedestrians in a real scene. The entire network was trained end-to-end using the AdamW optimizer with an initial learning rate of

1 \times 10^{- 4}

and a weight decay of

1 \times 10^{- 4}

. A cosine annealing learning rate schedule was employed over a total of 120 epochs. We used a warm-up strategy for the first 10 epochs, where the learning rate was linearly increased from

1 \times 10^{- 5}

to

1 \times 10^{- 4}

.

4.3. Comparison with State-of-the-Art Methods

In this section, in order to verify the effectiveness and advancement of the proposed PDEGL method, we compare it with some SOTA methods in the same conduction described in Section 4.2 on the SYSU-MM01, RegDB, and LLCM datasets. Experimental results show that our proposed Parsing-guided Differential Enhancement Graph Learning (PDEGL) method achieves excellent performance on the above three datasets. Comparison on SYSU-MM01. As shown in Table 1, the proposed PDEGL method demonstrates highly competitive performance on SYSU-MM01. In the All search mode, our method achieved a 75.2% mAP and a 74.1% Rank-1 accuracy rate. Under the Indoor search setting, PDEGL achieved an 86.4% mAP and an 82.8% Rank-1 accuracy rate. This is significantly ahead of the suboptimal DEEN, in particular, by more than 10% ahead of PMWGCN, which also adopts the graph learning strategy, demonstrating the powerful ability of the PDEGL method to learn discriminative features in complex indoor environments. This advantage stems from the differential infrared local enhancement module and the high-order relation modeling based on graph neural networks that we proposed, effectively alleviating the problems of infrared image parsing errors and modal differences.

Comparison on RegDB. For the RegDB dataset, our proposed PDEGL method achieves 87.2% mAP and 91.3% Rank-1 on the visible to infrared mode (VIS to IR). Such performance is only 0.3% weaker than that of MMN in Rank-1. Similar results also occur in the infrared to visible (IR to VIS) mode. The characteristic of the RegDB dataset is that the differences between modes are large but the viewing angles are aligned. This indicates that the achievement of this excellent performance is attributed to the proposed Position-sensitive Spatial-Channel Attention (PSCA) module. The local–global alignment of cross-modal features has been effectively achieved.

Comparison on LLCM. On the challenging LLCM dataset, facing complex posture change and occlusion scenes, our method demonstrates consistent superiority. The PDEGL method achieves 67.6% mAP and 63.2% Rank-1 on the visible to infrared (VIS to IR) mode and 63.4% mAP and 55.2% Rank-1 on the infrared to visible (IR to VIS) mode. This is attributed to our graph learning strategy to explicitly model the spatial topological relationships of human body parts, significantly enhancing the robustness of the network against occlusion and pose changes.

4.4. Ablation Study

In this section, we follow the experimental configuration in Section 4.2 and conduct a systematic ablation study of the effectiveness of the proposed components on the SYSU-MM01 and RegDB datasets. We choose ResNet-50, referring to the set of AGW [7] as the baseline. The detailed experimental results are presented in Table 2.

Effectiveness of PSCA. To verify the effectiveness of the PSCA module, we integrated the PSCA into the baseline, resulting in “+PSCA”, as shown in Table 2. Specifically, in the “VIS to IR” (IR to VIS) mode on the RegDB dataset, improvements of 3.3% (4.5%) in mAP and 6.2% (4.9%) in Rank-1 have been demonstrated when compared with the baseline. Similar improvements can also be clearly observed on the SYSU-MM01 dataset. The above experimental results prove the effectiveness of PSCA. The weight contrast of significant regions in pedestrian images could be enhanced and feature confusion between modalities could be avoided through establishing a location-channel cooperative perception mechanism at the same time, which could provide a new optimized path for cross-modal feature learning.

Effectiveness of PSG. To verify the effectiveness of our Parsing Structural Graph (PSG) module, we integrated it with the baseline as “+PSG” in Table 2. PSG regards each parsed human body component as a node in a graph and encodes the spatial connection topology between each part, which can match the target pedestrians to be retrieved from the high-dimensional structural relationship. Specially, under the “All search” (“Indoor search”) mode in SYSU-MM01 dataset, it has been evaluated improvements of 11.1% (9.9%) in mAP and 4.2% (3.1%) in Rank-1 compared with the baseline. Therefore, the above experimental results prove the validity of PSG module.

Effectiveness of DIPE. To verify the effectiveness of our Differential Infrared Part Enhancement module, we evaluate "+DIPE" where only DIPE is added to the baseline. As shown in Table 2, DIPE alone achieves consistent improvements: 1.8%/4.0% (mAP/Rank-1) gains on RegDB VIS to IR, and 6.0% mAP improvement on SYSU-MM01 All Search. The other experiment performance is displayed on the last line as “+PSCA+PSG+DIPE”. Metrics of mAP and Rank-1 have collectively improved by 0.9% (1.2%) and 2.6% (1.7%), respectively, when compared with “+PSCA+PSC” under “VIS to IR“ (“IR to VIS”) mode on the RegDB dataset. The refined parsing serves as better input for subsequent graph modules, as evidenced by the substantial performance boost in the full model (75.2% mAP). The experimental results demonstrate that our DIPE module could effectively compensate for parsing errors inherent in IR imaging by intelligently fusing complementary information from original and parsed infrared images, significantly improving the quality of part-level graph representations for our PDEGL method.

Effectiveness of loss function. In this section, we conduct a detailed ablation study on the SYSU-MM01 dataset under the all-search mode to verify the validity of the designed loss function. As shown in Table 3, the proposed PDEGL method performs poorly when using only identity loss

L_{I D}

. After introducing the sample classification center loss

L_{c l s}

, the performance of PDEGL has been significantly improved. The metrics of mAP and Rank-1 improve by 4.5% and 5.6%, respectively. This indicates that the discriminative ability of the model for sample classification has been effectively enhanced. Then, we adopt the designed Parsing-level Hetero-Center Loss

L_{P H C}

to supervise the PDEGL method with the strategy of graph learning. The metrics of mAP and Rank-1 improve by 7.4% and 10.1%, respectively, compared with

L_{I D}

. It can be confirmed that

L_{P H C}

can reduce the characteristic differences of the corresponding components of the same pedestrian in different modes while increasing the characteristic differences among the corresponding components of different pedestrians. Finally, according to our design of the loss function, the training performances of the PDEGL method are the best with the adoption of

L_{c l s}

and

L_{P H C}

.

4.5. Visualization

Heatmap interest visualization. To verify the advantages of the proposed PDEGL method in feature extraction and attention allocation, we compared the attention visualization heat maps of the baseline method and the proposed method. The comparison results are shown in Figure 6. It can be seen that the baseline’s attention distribution on the human trunk is relatively scattered, yet it gives a higher level of attention to some unimportant areas. This indicates that the baseline method may have certain limitations in the feature extraction process and fails to fully focus on the important feature areas of pedestrians. By contrast, our PDEGL method pays more attention to the important feature areas of pedestrians (such as the head, trunk, and legs), and the highlighted areas in the heat map are more concentrated and obvious. It can be concluded that PDEGL can extract the key features of pedestrians more effectively and reduce the distribution of attention in unimportant areas.

Retrieval results visualization. The Top-10 retrieval matching results of “VIS to IR“ mode and “IR to VIS” mode on the LLCM dataset are shown in Figure 7. Correct pedestrian identity matching is marked as the green box, and incorrect matching is marked as the red box. On the query image, we select four fixed pedestrian samples for retrieval in two modes. It can be observed from Figure 7 that the proposed PDEGL method can present highly accurate pedestrian matching results, but there are still some incorrect matching situations among them. To be specific, two different pedestrians may have very similar body proportions and postures. At this time, they may be wrongly identified as the same person due to the high matching of the graph structure. For example, the first column with “VIS to IR” mode and the fourth column with “IR to VIS” mode.

4.6. Robustness Analysis and Limitations

While our PDEGL method achieves competitive performance on standard benchmarks, real-world deployment faces significant challenges beyond controlled datasets. Current VI-ReID datasets lack diversity in weather conditions, severe occlusions, and infrared imaging variations, limiting comprehensive robustness assessment. To understand these practical limitations, we analyze our method’s performance under challenging conditions available in the LLCM dataset and compare it with the representative DEEN. The comparison retrieval results are presented in Figure 8.

The scene shown in the upper rows is about a person wearing a bulky black coat with poor thermal contrast. Our method fails to retrieve correct matches (Ranks 1–6 are incorrect), while DEEN achieves better results with two correct matches in the top-5. The failure of PDEGL stems from parsing degradation and graph structure ambiguity. Specifically, the low thermal contrast and bulky clothing in thermal images lead to parsing degradation, as accurate body part segmentation becomes extremely challenging, causing DIPE to produce unreliable parsing results. Furthermore, the lack of clear body part boundaries introduces graph structure ambiguity, which diminishes the discriminative power of the constructed graph and forces matches to rely on the overall silhouette rather than on detailed structural cues. In the lower rows, this contrast is even more obvious. The target pedestrian carried a red bag and had a unique posture, which caused severe disruption to the expected human topological structure, resulting in the failure of our graph-based matching method when compared with individuals lacking similar attachments.

Despite our PDEGL achieving higher overall performance metrics, as shown in Table 1, DEEN presents better robustness under extreme conditions. This can be attributed to its diverse embedding strategy expanding visual representations through an expansion module, capturing appearance variations beyond structural patterns to maintain robustness when structural cues are unreliable.

It can be concluded that although structured methods like PDEGL perform well when body parts are clearly visible and parsed correctly, they become vulnerable when these assumptions are violated. Methods that emphasize appearance diversity and global features exhibit better fail-safe behavior under extreme conditions. This is where we need to improve in the future.

5. Conclusions and Future Work

In this paper, we propose a novel Parsing-guided Differential Enhancement Graph Learning (PDEGL) method for Visible-Infrared Person Re-Identification (VI-ReID). Our PDEGL method leverages a dual-branch architecture, integrating global feature learning with a sophisticated part-based graph analysis. The PDEGL framework is built upon a dual-branch architecture that distinctly handles global feature learning and part-based graph analysis. Its core contributions include the Differential Infrared Part Enhancement (DIPE) module for correcting infrared parsing errors, the Parsing Structural Graph (PSG) module for modeling complex inter-part topological relationships, and the Position-sensitive Spatial-Channel Attention (PSCA) module for enhancing global feature discriminability. Extensive evaluations on SYSU-MM01, RegDB, and LLCM datasets demonstrate that PDEGL achieves state-of-the-art or highly competitive performance, substantially outperforming existing methods, particularly in scenarios with parsing errors and significant appearance variations. These results underscore the efficacy of explicitly modeling structural relationships and enhancing feature quality across modalities. The PDEGL framework offers a robust solution for VI-ReID, advancing the potential for reliable cross-modal person identification in diverse real-world applications.

Looking forward, several important directions remain to be explored. First, collecting and annotating datasets under extreme weather conditions would enable more realistic evaluation of VI-ReID methods. Second, developing adaptive graph topologies that can handle varying body configurations and severe occlusions would improve robustness. Third, incorporating uncertainty estimation mechanisms would help identify unreliable matches in challenging scenarios. Finally, exploring domain adaptation techniques could address the generalization gap between controlled datasets and real-world deployments. These efforts are crucial for transitioning VI-ReID from laboratory success to practical applications.

Author Contributions

Methodology, X.L. and C.X.; software, H.L. and N.W.; resources, E.H.; writing—original draft, X.L.; writing—review & editing, E.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China, Grant No. 6220020330.

Data Availability Statement

Data available in a publicly accessible repository: https://github.com/wuancong/SYSU-MM01 (accessed on 13 June 2025).

Conflicts of Interest

Authors Huabing Liu, Chen Xue and Nuo Wang were employed by the company Beijing CPI Huizhi Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Author Huabing Liu, Chen Xue and Nuo Wang were employed by the company Beijing CPI Huizhi Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Zheng, L.; Yang, Y.; Hauptmann, A.G. Person re-identification: Past, present and future. arXiv 2016, arXiv:1610.02984. [Google Scholar]
Luo, H.; Jiang, W.; Gu, Y.; Liu, F.; Liao, X.; Lai, S.; Gu, J. A strong baseline and batch normalization neck for deep person re-identification. IEEE Trans. Multimed. 2019, 22, 2597–2609. [Google Scholar] [CrossRef]
Zeng, Z.; Wang, Z.; Wang, Z.; Zheng, Y.; Chuang, Y.Y.; Satoh, S. Illumination-adaptive person re-identification. IEEE Trans. Multimed. 2020, 22, 3064–3074. [Google Scholar] [CrossRef]
Yan, C.; Pang, G.; Bai, X.; Liu, C.; Ning, X.; Gu, L.; Zhou, J. Beyond triplet loss: Person re-identification with fine-grained difference-aware pairwise loss. IEEE Trans. Multimed. 2021, 24, 1665–1677. [Google Scholar] [CrossRef]
Luo, H.; Gu, Y.; Liao, X.; Lai, S.; Jiang, W. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Wang, G.; Yuan, Y.; Chen, X.; Li, J.; Zhou, X. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM international conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 274–282. [Google Scholar]
Ye, M.; Shen, J.; Lin, G.; Xiang, T.; Shao, L.; Hoi, S.C. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 2872–2893. [Google Scholar] [CrossRef]
Wu, A.; Zheng, W.S.; Yu, H.X.; Gong, S.; Lai, J. RGB-infrared cross-modality person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5380–5389. [Google Scholar]
Nguyen, D.T.; Hong, H.G.; Kim, K.W.; Park, K.R. Person recognition system based on a combination of body images from visible light and thermal cameras. Sensors 2017, 17, 605. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, H. Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 2153–2162. [Google Scholar]
Cheng, K.; Hua, X.; Lu, H.; Tu, J.; Wang, Y.; Wang, S. Multi-scale semantic correlation mining for visible-infrared person re-identification. arXiv 2023, arXiv:2311.14395. [Google Scholar]
Ye, M.; Shen, J.; Crandall, D.J.; Shao, L.; Luo, J. Dynamic dual-attentive aggregation learning for visible-infrared person re-identification. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVII 16. pp. 229–247. [Google Scholar]
Fang, X.; Yang, Y.; Fu, Y. Visible-infrared person re-identification via semantic alignment and affinity inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 11270–11279. [Google Scholar]
Feng, Y.; Chen, F.; Yu, J.; Ji, Y.; Wu, F.; Liu, S.; Jing, X.Y. Homogeneous and heterogeneous relational graph for visible-infrared person re-identification. arXiv 2021, arXiv:2109.08811. [Google Scholar] [CrossRef]
Li, P.; Xu, Y.; Wei, Y.; Yang, Y. Self-correction for human parsing. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 3260–3271. [Google Scholar] [CrossRef]
Tang, L.; Yuan, J.; Zhang, H.; Jiang, X.; Ma, J. PIAFusion: A progressive infrared and visible image fusion network based on illumination aware. Inf. Fusion 2022, 83, 79–92. [Google Scholar] [CrossRef]
Ma, B.; Su, Y.; Jurie, F. Covariance descriptor based on bio-inspired features for person re-identification and face verification. Image Vis. Comput. 2014, 32, 379–390. [Google Scholar] [CrossRef]
Liao, S.; Hu, Y.; Zhu, X.; Li, S.Z. Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, 7–12 June 2015; pp. 2197–2206. [Google Scholar]
Farenzena, M.; Bazzani, L.; Perina, A.; Murino, V.; Cristani, M. Person re-identification by symmetry-driven accumulation of local features. In Proceedings of the The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13–18 June 2010; pp. 2360–2367. [Google Scholar]
Zheng, W.S.; Gong, S.; Xiang, T. Reidentification by relative distance comparison. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 653–668. [Google Scholar] [CrossRef] [PubMed]
Yang, X.; Zhou, P.; Wang, M. Person reidentification via structural deep metric learning. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 2987–2998. [Google Scholar] [CrossRef] [PubMed]
Oh Song, H.; Xiang, Y.; Jegelka, S.; Savarese, S. Deep metric learning via lifted structured feature embedding. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 4004–4012. [Google Scholar]
Liao, S.; Li, S.Z. Efficient psd constrained asymmetric metric learning for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 13–16 December 2015; pp. 3685–3693. [Google Scholar]
Hermans, A.; Beyer, L.; Leibe, B. In defense of the triplet loss for person re-identification. arXiv 2017, arXiv:1703.07737. [Google Scholar]
Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the Computer Vision—ECCV 2018—15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part IV. Springer: Munich, Germany, 2018; pp. 480–496. [Google Scholar]
Wang, G.; Zhang, T.; Cheng, J.; Liu, S.; Yang, Y.; Hou, Z. RGB-infrared cross-modality person re-identification via joint pixel and feature alignment. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3623–3632. [Google Scholar]
Wang, G.A.; Zhang, T.; Yang, Y.; Cheng, J.; Chang, J.; Liang, X.; Hou, Z.G. Cross-modality paired-images generation for RGB-infrared person re-identification. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12144–12151. [Google Scholar]
Lu, Y.; Wu, Y.; Liu, B.; Zhang, T.; Li, B.; Chu, Q.; Yu, N. Cross-modality person re-identification with shared-specific feature transfer. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; pp. 13379–13389. [Google Scholar]
Ye, M.; Lan, X.; Wang, Z.; Yuen, P.C. Bi-directional center-constrained top-ranking for visible thermal person re-identification. IEEE Trans. Inf. Forensics Secur. 2019, 15, 407–419. [Google Scholar] [CrossRef]
Zhu, Y.; Yang, Z.; Wang, L.; Zhao, S.; Hu, X.; Tao, D. Hetero-center loss for cross-modality person re-identification. Neurocomputing 2020, 386, 97–109. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Choi, S.; Lee, S.; Kim, Y.; Kim, T.; Kim, C. Hi-CMD: Hierarchical cross-modality disentanglement for visible-infrared person re-identification. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; pp. 10257–10266. [Google Scholar]
Zhang, Q.; Lai, C.; Liu, J.; Huang, N.; Han, J. Fmcnet: Feature-level modality compensation for visible-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 7349–7358. [Google Scholar]
Zhang, Y.; Zhao, S.; Kang, Y.; Shen, J. Modality synergy complement learning with cascaded aggregation for visible-infrared person re-identification. In Proceedings of the Computer Vision—ECCV 2022—17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XIV. pp. 462–479. [Google Scholar]
Qian, Y.; Tang, S.K. Multi-Scale Contrastive Learning with Hierarchical Knowledge Synergy for Visible-Infrared Person Re-Identification. Sensors 2025, 25, 192. [Google Scholar] [CrossRef]
Jiang, K.; Zhang, T.; Liu, X.; Qian, B.; Zhang, Y.; Wu, F. Cross-modality transformer for visible-infrared person re-identification. In Proceedings of the Computer Vision—ECCV 2022—17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XIV. pp. 462–479. [Google Scholar]
Lu, H.; Zou, X.; Zhang, P. Learning progressive modality-shared transformers for effective visible-infrared person re-identification. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 1835–1843. [Google Scholar]
Wu, R.; Jiao, B.; Wang, W.; Liu, M.; Wang, P. Enhancing visible-infrared person re-identification with modality-and instance-aware visual prompt learning. In Proceedings of the 2024 International Conference on Multimedia Retrieval, ICMR 2024, Phuket, Thailand, 10–14 June 2024; pp. 579–588. [Google Scholar]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 2008, 20, 61–80. [Google Scholar] [CrossRef]
Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv 2013, arXiv:1312.6203. [Google Scholar]
Duvenaud, D.K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; Aspuru-Guzik, A.; Adams, R.P. Convolutional networks on graphs for learning molecular fingerprints 2015. Adv. Neural Inf. Process. Syst. 2015, 28, 2224–2232. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. 2017, 30, 1024–1034. [Google Scholar]
Kesimoglu, Z.N.; Bozdag, S. GRAF: Graph attention-aware fusion networks. arXiv 2023, arXiv:2303.16781. [Google Scholar]
Wang, H.; Liu, J.; Tan, H.; Lou, J.; Liu, X.; Zhou, W.; Liu, H. Blind image quality assessment via adaptive graph attention. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 10299–10309. [Google Scholar] [CrossRef]
Li, J.; Chen, J.; Liu, J.; Ma, H. Learning a graph neural network with cross modality interaction for image fusion. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 4471–4479. [Google Scholar]
Wu, Y.; Bourahla, O.E.F.; Li, X.; Wu, F.; Tian, Q.; Zhou, X. Adaptive graph representation learning for video person re-identification. IEEE Trans. Image Process. 2020, 29, 8821–8830. [Google Scholar] [CrossRef]
Qiu, L.; Chen, S.; Yan, Y.; Xue, J.H.; Wang, D.H.; Zhu, S. High-order structure based middle-feature learning for visible-infrared person re-identification. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 4596–4604. [Google Scholar]
Yang, J.; Zheng, W.S.; Yang, Q.; Chen, Y.C.; Tian, Q. Spatial-temporal graph convolutional network for video-based person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 24 June 2020; pp. 3289–3299. [Google Scholar]
Paszke, A. Pytorch: An imperative style, high-performance deep learning library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
Gao, Y.; Liang, T.; Jin, Y.; Gu, X.; Liu, W.; Li, Y.; Lang, C. Mso: Multi-feature space joint optimization network for rgb-infrared person re-identification. In Proceedings of the MM ’21: ACM Multimedia Conference, Virtual Event, 20–24 October 2021; pp. 5257–5265. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Wang, Z.; Wang, Z.; Zheng, Y.; Chuang, Y.Y.; Satoh, S. Learning to reduce dual-level discrepancy for infrared-visible person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 618–626. [Google Scholar]
Li, D.; Wei, X.; Hong, X.; Gong, Y. Infrared-visible cross-modal person re-identification with an x modality. In Proceedings of the The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 4610–4617. [Google Scholar]
Park, H.; Lee, S.; Lee, J.; Ham, B. Learning by aligning: Visible-infrared person re-identification using cross-modal correspondences. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 12046–12055. [Google Scholar]
Fu, C.; Hu, Y.; Wu, X.; Shi, H.; Mei, T.; He, R. CM-NAS: Cross-modality neural architecture search for visible-infrared person re-identification. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 11823–11832. [Google Scholar]
Ye, M.; Ruan, W.; Du, B.; Shou, M.Z. Channel augmented joint learning for visible-infrared recognition. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 13567–13576. [Google Scholar]
Zhang, Y.; Yan, Y.; Lu, Y.; Wang, H. Towards a unified middle modality learning for visible-infrared person re-identification. In Proceedings of the MM ’21: ACM Multimedia Conference, Virtual Event, 20–24 October 2021; pp. 788–796. [Google Scholar]
Chen, C.; Ye, M.; Qi, M.; Wu, J.; Jiang, J.; Lin, C.W. Structure-aware positional transformer for visible-infrared person re-identification. IEEE Trans. Image Process. 2022, 31, 2352–2364. [Google Scholar] [CrossRef]
Liu, W.; Xu, X.; Chang, H.; Yuan, X.; Wang, Z. Mix-modality person re-identification: A new and practical paradigm. ACM Trans. Multimed. Comput. Commun. Appl. 2025, 21, 1–21. [Google Scholar] [CrossRef]
Sun, R.; Chen, L.; Zhang, L.; Xie, R.; Gao, J. Robust visible-infrared person re-identification based on polymorphic mask and wavelet graph convolutional network. IEEE Trans. Inf. Forensics Secur. 2024, 19, 2800–2813. [Google Scholar] [CrossRef]

Figure 1. The overall structure of our proposed Parsing-guided Differential Enhancement Graph Learning (PDEGL) method for VI-ReID task. It consist of a global feature branch and a parsing graph branch with a specially designed Differential Infrared Part Enhancement (DIPE) module, Parsing Structural Graph (PSG) module, and Position-sensitive Spatial-Channel Attention (PSCA) module.

Figure 2. Illustration of the proposed DIPE module that is used to enhance human parsing representation in IR mode.

Figure 4. The processing of the PSCA module. It enhances global feature discriminability.

Figure 5. Some sample displays of VI-ReID datasets.

Figure 6. Heatmap visualization for visible and infrared images on SYSU-MM01 dataset. The first row shows the original visible (left) and infrared (right) images. The second and third rows illustrate the attention maps generated by the baseline network and our method, respectively.

Figure 7. Visualization of top-10 retrieval results for “VIS to IR“ mode and “IR to VIS” mode on LLCM dataset. Red for incorrect and green for correct matches.

Figure 8. Comparison of retrieval results between our PDEGL and DEEN under challenging conditions. Red for incorrect and green for correct matches.

Table 1. Comparisons with state-of-the-art methods on the SYSU-MM01, RegDB, and LLCM datasets. The bold font and the underline denote the best and second-best performance, respectively.

Methods	SYSU-MM01		RegDB		LLCM
	All Search	Indoor Search	VIS to IR	IR to VIS	VIS to IR	IR to VIS
	mAP/R-1	mAP/R-1	mAP/R-1	mAP/R-1	mAP/R-1	mAP/R-1
D²RL [53]	29.2/28.9	-/-	44.1/43.4	-/-	-/-	-/-
Hi-CMD [32]	35.9/34.9	-/-	66.0/70.9	-/-	-/-	-/-
X-Modality [54]	50.7/49.9	-/-	60.2/62.2	-/-	-/-	-/-
DDAG [12]	53/0/54.8	68.0/61.0	63.5/69.3	61.8/68.1	52.3/48.0	48.4/40.3
LbA [55]	54.1/55.4	66.3/58.5	67.6/74.2	72.4/67.5	55.3/50.1	52.6/41.8
AlignGAN [26]	40.7/42.4	54.3/45.9	53.6/57.9	53.4/56.3	-/-	-/-
CM-NAS [56]	58.9/60.8	52.4/68.0	79.3/82.8	77.6/81.7	-/-	-/-
CAJ [57]	66.9/69.9	80.4/76.3	79.1/85.0	77.8/84.8	58.9/55.6	56.4/47.8
MMN [58]	66.9/70.6	79.6/76.2	84.1/91.6	80.5/87.5	60.7/59.6	58.3/52.1
DART [59]	66.3/68.7	78.2/72.5	75.7/83.6	73.8/82.0	62.6/60.2	59.3/52.0
DEEN [10]	71.8/74.7	83.3/80.3	85.1/91.1	83.4/89.5	65.2/61.8	62.1/54.2
MAUM [60]	68.8/71.7	81.9/77.0	85.1/ 87.9	84.3/87.0	-/-	-/-
PMWGCN [61]	64.9/66.8	76.2/72.6	84.5/90.6	81.6/88.8	-/-	-/-
PDEGL (Ours)	75.2/74.1	86.4/82.8	87.2/91.3	85.3/88.5	67.6/63.2	63.4/55.2

Table 2. The results of ablation experiments for each module on RegDB and SYSU-MM01.

Method	RegDB				SYSU-MM01
	VIS to IR		IR to VIS		All Search		Indoor Search
	mAP	Rank-1	mAP	Rank-1	mAP	Rank-1	mAP	Rank-1
Baseline	77.1	76.3	75.3	76.5	60.2	65.7	71.3	75.1
+DIPE	78.9	80.3	77.5	80.1	66.2	64.2	72.6	76.3
+PSCA	80.4	82.5	79.8	81.4	68.9	63.2	77.4	77.4
+PSG	83.2	85.8	81.3	84.1	71.3	69.9	81.2	78.2
+PSCA+PSG	86.3	88.7	84.1	86.8	74.1	72.9	85.3	81.1
+PSCA+PSG+DIPE (Ours)	87.2	91.3	85.3	88.5	75.2	74.1	86.4	82.8

Table 3. The results of the effectiveness of the loss function on SYSU-MM01.

$L_{ID}$	$L_{PHC}$	$L_{cls}$	mAP	Rank-1
✓	×	×	65.7	60.2
✓	×	✓	70.2	65.8
✓	✓	×	73.1	70.3
✓	✓	✓	75.2	74.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, X.; Liu, H.; Xue, C.; Wang, N.; Hu, E. Parsing-Guided Differential Enhancement Graph Learning for Visible-Infrared Person Re-Identification. Electronics 2025, 14, 3118. https://doi.org/10.3390/electronics14153118

AMA Style

Li X, Liu H, Xue C, Wang N, Hu E. Parsing-Guided Differential Enhancement Graph Learning for Visible-Infrared Person Re-Identification. Electronics. 2025; 14(15):3118. https://doi.org/10.3390/electronics14153118

Chicago/Turabian Style

Li, Xingpeng, Huabing Liu, Chen Xue, Nuo Wang, and Enwen Hu. 2025. "Parsing-Guided Differential Enhancement Graph Learning for Visible-Infrared Person Re-Identification" Electronics 14, no. 15: 3118. https://doi.org/10.3390/electronics14153118

APA Style

Li, X., Liu, H., Xue, C., Wang, N., & Hu, E. (2025). Parsing-Guided Differential Enhancement Graph Learning for Visible-Infrared Person Re-Identification. Electronics, 14(15), 3118. https://doi.org/10.3390/electronics14153118

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Parsing-Guided Differential Enhancement Graph Learning for Visible-Infrared Person Re-Identification

Abstract

1. Introduction

2. Related Work

2.1. Single-Modality Person Re-ID

2.2. Visible-Infrared Person Re-ID

2.3. GNN Application in Person Re-ID

3. The Proposed Method

3.1. Overall Structure

3.2. Differential Infrared Part Enhancement Module

3.3. Parsing Structural Graph Module

3.4. Position-Sensitive Spatial-Channel Attention Module

3.5. Loss Function

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Comparison with State-of-the-Art Methods

4.4. Ablation Study

4.5. Visualization

4.6. Robustness Analysis and Limitations

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI