Graph Sampling-Based Multi-Stream Enhancement Network for Visible-Infrared Person Re-Identification

With the increasing demand for person re-identification (Re-ID) tasks, the need for all-day retrieval has become an inevitable trend. Nevertheless, single-modal Re-ID is no longer sufficient to meet this requirement, making Multi-Modal Data crucial in Re-ID. Consequently, a Visible-Infrared Person Re-Identification (VI Re-ID) task is proposed, which aims to match pairs of person images from the visible and infrared modalities. The significant modality discrepancy between the modalities poses a major challenge. Existing VI Re-ID methods focus on cross-modal feature learning and modal transformation to alleviate the discrepancy but overlook the impact of person contour information. Contours exhibit modality invariance, which is vital for learning effective identity representations and cross-modal matching. In addition, due to the low intra-modal diversity in the visible modality, it is difficult to distinguish the boundaries between some hard samples. To address these issues, we propose the Graph Sampling-based Multi-stream Enhancement Network (GSMEN). Firstly, the Contour Expansion Module (CEM) incorporates the contour information of a person into the original samples, further reducing the modality discrepancy and leading to improved matching stability between image pairs of different modalities. Additionally, to better distinguish cross-modal hard sample pairs during the training process, an innovative Cross-modality Graph Sampler (CGS) is designed for sample selection before training. The CGS calculates the feature distance between samples from different modalities and groups similar samples into the same batch during the training process, effectively exploring the boundary relationships between hard classes in the cross-modal setting. Some experiments conducted on the SYSU-MM01 and RegDB datasets demonstrate the superiority of our proposed method. Specifically, in the VIS→IR task, the experimental results on the RegDB dataset achieve 93.69% for Rank-1 and 92.56% for mAP.


Introduction
Person re-identification (Re-ID) [1][2][3][4][5][6][7] is a complex computer vision task that focuses on matching individuals across non-overlapping camera views.The main objective is to associate images or videos of the same person while maintaining a low recall rate, thereby reducing the likelihood of incorrect matches.Effective Re-ID techniques have significant applications in various domains, such as surveillance, security, and public safety.With the increasing demand for Re-ID, there is a need to match infrared person images captured under challenging lighting conditions with visible person images.Consequently, VI Re-ID [8][9][10][11][12][13] garners significant attention from both the industry and academia.Besides the intra-modal variations already present in single-modal Re-ID, a key challenge in the VI Re-ID task is how to reduce the modality discrepancy between visible and infrared images of the same identity.Existing research approaches primarily rely on modal transformation methods.These methods generate cross-modal or intermediate-modal images corresponding to person images to convert heterogeneous modalities into a unified modality, thereby reducing the modal discrepancy.Specifically, Generative Adversarial Networks (GANs) [14] and encoder-decoder structures [15,16] are commonly introduced in these methods.However, the transformation from infrared images to visible images is ill-posed, which may introduce additional noise and fail to generate accurate visible images.Moreover, GAN-based models [17] often overlook the relationships between global features or local features of person images in the VI Re-ID task, leading to limited modal adaptability in their methods.
To enhance the method's adaptability to cross-modal challenges, some recent researchers apply modal-shared feature learning to the VI Re-ID task, which projects visible and infrared images into a specific shared embedding space, achieving cross-modal feature alignment.These approaches can be further divided into global feature learning and local feature learning.Specifically, global feature learning represents a person image as a single feature vector, which is suitable for capturing overall person identity information.On the other hand, local feature learning uses a set of feature vectors based on parts or regions to represent the image, allowing for a more detailed capture of the local features of person images.In addition, the two-stream convolutional neural network architecture is commonly applied in such methods and combined with loss functions (such as identity loss, triplet loss, etc.) for constraint.
Although these methods have achieved good results in alleviating the modality discrepancy, they still have certain limitations: (1) Existing modal-shared feature learning methods typically focus on exploring either global or local feature representations and rarely combine the advantages of both features.Moreover, due to the nature of infrared images, which only contain a single channel reflecting objects' thermal radiation, key features like color cannot be utilized for cross-modal matching.Directly extracting features from infrared images may also suffer from interference from identity-irrelevant information.
(2) These methods all adopt some basic sampling techniques, such as random sampling and uniform sampling, which do not consider the relationships and similarities between samples with different modalities.Additionally, in VI Re-ID tasks, there are limited features that can be extracted when dealing with infrared modality retrieval, resulting in numerous similar features among samples from different classes.Consequently, conventional sampling methods struggle to capture the subtle differences between these similar features.
To address the aforementioned two issues, we propose a novel method named the Graph Sampling-based Multi-stream Enhancement Network (GSMEN) in this paper.It is noteworthy that when humans engage in the visual judgment of infrared surveillance, they heavily rely on contour information.Despite the absence of color and texture features in infrared images, contour and shape information remains clear and visible, as depicted in Figure 1a with the contour image.It is evident that contours exhibit certain cross-modal invariance between visible and infrared images [18].Additionally, as contours provide a holistic representation of a person rather than localized features, their global features can better capture the characteristic information.This observation motivates us to extend the global features of contours to the local features obtained from modal-shared feature learning, with the aim of enhancing the feature representation capability and reducing the cross-modal discrepancy between visible and infrared modalities.Consequently, the Contour Expansion Module (CEM) is proposed to fuse the contour-enhanced features with local features, resulting in improved matching performance for cross-modal image pairs.Then, to tackle the challenge of exploring the boundaries between hard classes in Re-ID task, an efficient batch sampling technique is introduced, known as the Cross-m dality Graph Sampler (CGS).Specifically, CGS involves constructing nearest neighbor lationship graphs for all classes in the visible and infrared modalities at the beginning each epoch and then combining them.Subsequently, CGS conducts batch sampling randomly selecting a sample as the anchor and choosing its top-k nearest neighbori samples of different classes, each class containing the same number of S instances, as lustrated in Figure 1b.Therefore, CGS ensures that the samples within a batch are mos similar to each other, providing informative and challenging examples for discriminat learning.This sampler aims to explore the boundary relationships between hard class and enhance the discriminative power of the learned model.
In summary, our contributions in the paper are as follows: • To enhance feature representation and reduce the cross-modal discrepancy betwe visible and infrared modalities, we propose the Contour Expansion Module (CEM which combines the global features of contours with the local features obtained fro modal-shared feature learning.To the best of our knowledge, this method represen the first attempt at tackling the VI Re-ID task.

•
To explore the boundaries between hard classes, we introduce the Cross-modal Graph Sampler (CGS).The sampler constructs nearest neighbor relationship grap separately for the visible and infrared modalities, and then combines them for bat sampling.This sampling strategy ensures that samples within a batch are mostly si ilar to each other, providing informative and challenging examples for discriminati learning.

•
We conduct experiments on large-scale VI Re-ID datasets, SYSU-MM01, and RegD The results demonstrate that our method achieves significant improvements matching performance and modal adaptability.Then, to tackle the challenge of exploring the boundaries between hard classes in VI Re-ID task, an efficient batch sampling technique is introduced, known as the Crossmodality Graph Sampler (CGS).Specifically, CGS involves constructing nearest neighbor relationship graphs for all classes in the visible and infrared modalities at the beginning of each epoch and then combining them.Subsequently, CGS conducts batch sampling by randomly selecting a sample as the anchor and choosing its top-k nearest neighboring samples of different classes, each class containing the same number of S instances, as illustrated in Figure 1b.Therefore, CGS ensures that the samples within a batch are mostly similar to each other, providing informative and challenging examples for discriminative learning.This sampler aims to explore the boundary relationships between hard classes and enhance the discriminative power of the learned model.
In summary, our contributions in the paper are as follows: • To enhance feature representation and reduce the cross-modal discrepancy between visible and infrared modalities, we propose the Contour Expansion Module (CEM), which combines the global features of contours with the local features obtained from modal-shared feature learning.To the best of our knowledge, this method represents the first attempt at tackling the VI Re-ID task.

•
To explore the boundaries between hard classes, we introduce the Cross-modality Graph Sampler (CGS).The sampler constructs nearest neighbor relationship graphs separately for the visible and infrared modalities, and then combines them for batch sampling.This sampling strategy ensures that samples within a batch are mostly similar to each other, providing informative and challenging examples for discriminative learning.
The results demonstrate that our method achieves significant improvements in matching performance and modal adaptability.

Related Work
VI Re-ID aims to address not only the challenges of handling intra-modal differences but also the cross-modal disparities arising from heterogeneous images.Therefore, alleviating cross-modal disparities is crucial, as they can exacerbate existing intra-modal differences.
To tackle these challenges, researchers attempt the modality-shared feature learning approach [19][20][21][22][23], which focuses on extracting discriminative and robust features from heterogeneous modalities for the model's learning process.For instance, Wu et al. [8] introduce the large and challenging benchmark dataset (SYSU-MM01) and propose a deep one-stream zero-padding network for RGB-IR image matching.Additionally, Fu et al. [19] present a cross-modality neural architecture search method to enhance the effectiveness of neural network structures for VI Re-ID tasks.Furthermore, Zheng et al. [24] adopt eight attributes as annotation information in the PAENet to learn detailed semantic attribute information.
In recent years, image generation-based methods [14,25] are also applied to the VI Re-ID task, aiming to narrow the gap between visible and infrared modalities by adopting modality auxiliary information.For this purpose, Dai et al. [13] introduce a framework based on Generative Adversarial Networks (GANs) [14] for cross-modal image generation and propose cmGAN for feature learning.Similarly, Wei et al. [26] propose a comprehensive modality generation module that combines features from different modalities to create a new modality, effectively integrating multi-modal information.Additionally, Lu et al. [25] introduce the Progressive Modality-shared Transformer (PMT), which employs grayscale images as auxiliary modalities to enhance the reliability and commonality of visual features across different modalities, addressing the negative effects of modality disparities.
Furthermore, it is worth noting that among various types of information in cross-modal images (such as color, texture, and contour), contour information is crucial in cross-modal retrieval, as it exhibits strong modality invariance.However, previous researchers did not consider using it as auxiliary information to alleviate modality differences.Therefore, we attempt to introduce it as auxiliary enhancement information to improve the imagematching capability in VI Re-ID tasks.

Method
Our VI Re-ID framework is based on a two-stream convolutional neural network [4] but incorporates contour information to enhance the model's cross-modal adaptability.The method we propose is outlined as shown in Figure 2, and the specific steps are as follows: First, the Cross-modality Graph Sampler (Section 3.1) samples the dataset with the aim of including categories that are close in distance into the same batch.The obtained samples consist of two modalities: visible and infrared modalities, both of which are obtained through contour extraction to produce their respective contour images.Next, these four categories (two types of original images and their corresponding contour images) of images are separately input into their corresponding backbone networks.Finally, the resulting sample features and contour features are fused in the Contour Expansion Module (Section 3.2) to mitigate the differences between the visible and infrared modalities.

Contour Expansion Module
The goal of contour detection is to identify pixels in the image that correspond to regions with significant changes in grayscale values.Recently, there have been studies [23,25] applying contour detection to object detection and semantic segmentation, achieving success.The challenge in the VI Re-ID task lies in the significant differences between the visible and infrared modalities.Additionally, contour information exhibits strong modality invariance.Hence, this inspired us to apply contour detection to the VI Re-ID task to alleviate the modal discrepancies.First, the pre-trained SCHP (Self-Correction Human Parsing) [29] is adopted as the contour detector to segment person contour maps from the images.Taking the visible image for example, the contour detection of visible image vis X is below:

Contour Expansion Module
The goal of contour detection is to identify pixels in the image that correspond to regions with significant changes in grayscale values.Recently, there have been studies [23,25] applying contour detection to object detection and semantic segmentation, achieving success.The challenge in the VI Re-ID task lies in the significant differences between the visible and infrared modalities.Additionally, contour information exhibits strong modality invariance.Hence, this inspired us to apply contour detection to the VI Re-ID task to alleviate the modal discrepancies.First, the pre-trained SCHP (Self-Correction Human Parsing) [29] is adopted as the contour detector to segment person contour maps from the images.Taking the visible image for example, the contour detection of visible image X vis is below: where σ(•) denotes the contour detector, X visc represents the person contour map generated from the visible image X vis .Then, the obtained contour information needs to be integrated into the original image information.However, the fusion methods [30-32] vary in diversity.Therefore, this paper investigates the impact of different fusion methods on the experiments, including Element-wise addition and concatenation, as shown in Figure 3. Specifically, Element-wise addition emphasizes the employment of contour feature to supplement person image-related semantic information, while Element-wise concatenation expands the feature dimension without losing the respective information of person image and contour.
Here are the specific fusion methods employed in this study.where σ ⋅ () denotes the contour detector, visc X represents the person contour map generated from the visible image vis X .Then, the obtained contour information needs to be integrated into the original image information.However, the fusion methods [30-32] vary in diversity.Therefore, this paper investigates the impact of different fusion methods on the experiments, including Element-wise addition and concatenation, as shown in Figure 3. Specifically, Element-wise addition emphasizes the employment of contour feature to supplement person image-related semantic information, while Element-wise concatenation expands the feature dimension without losing the respective information of person image and contour.Here are the specific fusion methods employed in this study.Element-wise addition.As shown in Figure 3b,c, we, respectively perform feature addition before the CNN and after Conv Block n 5) .The specific formula is as follows: where  ,( )) GEMPooling( ,( )) where z x,y GEMPooling( ,( )) applies Generalized Mean Pooling [33] to z using a two-di- mensional scale with the height of x and width of y. p is the number of local body parts in visible image.Then, 1 × 1 convolutional layers are utilized to adjust the number of feature channels in  Element-wise addition.As shown in Figure 3b,c, we, respectively perform feature addition before the CNN and after Conv Block n (1 ≤ n ≤ 5).The specific formula is as follows: where F vism i represent visible feature after merging contour feature F visc i to basic visible feature F vis i .i ∈ {RGB, conv-1, . . . ,conv-5} represents the fusion methods shown in Figure 3b,c.Specifically, F vism RGB represents the three-channel features obtained after transforming the RGB image into RGB data.
Element-wise concatenation.As depicted in Figure 3a, we augment the local features by incorporating the global feature of the visible image contours.Firstly, the visible image and the contour image are separately processed through CNN to obtain features F vis conv-5 ∈ R c×h×w and F visc conv-5 ∈ R c×h×w , respectively.Next, the output features are subjected to Generalized Mean Pooling for local and global feature pooling, resulting in features F vis local ∈ R c×h/p and F visc gobal ∈ R c : where GEMPooling(z, (x, y)) applies Generalized Mean Pooling [33] to z using a twodimensional scale with the height of x and width of y. p is the number of local body parts in visible image.Then, 1 × 1 convolutional layers are utilized to adjust the number of feature channels in F vis local and F visc gobal to C. Finally, by concatenating the local feature F vis of the visible image with its contour image's global feature F visc , the new visible feature F vism is obtained: where concat(e, f ) represents the concatenation of feature e and feature f.Considering the comparative experiments in Section 4.5.2,Element-wise concatenation better accomplishes contour enhancement than other ways and is chosen as the fusion method for contour information in CEM.

Cross-Modality Graph Sampler
Both DG Re-ID [34][35][36][37][38] and VI Re-ID [5,[9][10][11][12] face the challenge of modality differences.Furthermore, conventional sampling methods exhibit significant randomness, making it insufficient to distinguish the boundaries between hard classes.In contrast, the CGS sampler effectively addresses this limitation by focusing on grouping similar samples into the same batch.Combining with Figure 4, the details of CGS should be introduced below.
the comparative experiments in Section 4.5.2,Element-wise concatenation better accomplishes contour enhancement than other ways and is chosen as the fusion method for contour information in CEM.

Cross-Modality Graph Sampler
Both DG Re-ID [34][35][36][37][38] and VI Re-ID [5,[9][10][11][12] face the challenge of modality differences.Furthermore, conventional sampling methods exhibit significant randomness, making it insufficient to distinguish the boundaries between hard classes.In contrast, the CGS sampler effectively addresses this limitation by focusing on grouping similar samples into the same batch.Combining with Figure 4, the details of CGS should be introduced below.
where (( , ),dim ) φ = x y z represents the pairwise Euclidean distances calculated for the feature vectors x and y after aligning them along the zth dimension.Similarly, the process is applied to the infrared modality sample set: Before each epoch, we calculate the distances or similarities between classes using the latest trained model and then construct a graph encompassing all classes.This approach allows us to leverage the relationships between classes for informative sampling.To illustrate, one image per class is randomly chosen to form a smaller sub-dataset.Next, the feature F vism ∈ R C×d should be extracted thought the latest trained model, where C represents the total number of training classes and d is the feature dimension.Subsequently, the pairwise Euclidean distances between all the selected samples are computed by the feature F vism ∈ R C×d .As a result, we obtain a distance matrix distv ∈ R C×C that encompasses all classes: where φ((x, y), dim = z) represents the pairwise Euclidean distances calculated for the feature vectors x and y after aligning them along the zth dimension.Similarly, the process is applied to the infrared modality sample set: where F in f m ∈ R C×d is the feature extracted from Infrared samples.Afterwards, to obtain the neighboring classes across different modalities, the overall class distance matrix dist can be obtained by adding matrices distv and disti together: Later, the top k − 1 nearest neighboring classes need to be denoted by N(c) = {x i |i = 1, . . ., k − 1} from each class c, where k is the number of classes to sample in each mini-batch.Subsequently, a graph G =(V, E) with V = {c|c = 1, . . ., C} represents the vertices, where each class corresponds to one node, and E = {(c 1 , c 2 ) | c 2 ∈ N(c 1 )} representing the edges.Finally, according to the graph G, we perform random sampling of S instances per class to create a mini-batch containing B = K × S samples for training.This approach allows us to establish connections between classes based on their proximity, enabling informative sampling for our training process.

Loss Function
Triplet loss L tri [39] and Cross-entropy loss L id are the fundamental losses for image classification tasks.Moreover, the Barlow Twins loss L ssd [40] is a self-supervised learning loss function, which is also introduced into our method to improve its performance.The overall loss is computed as follow: RegDB [41] dataset comprises pairs of images captured by visible and infrared cameras.It contains images of 412 different identities, with each identity having 10 visible images and 10 infrared images.These images are captured by a pair of cameras that overlap with each other, providing a comprehensive set of data for evaluation.Additionally, in order to effectively validate various methods, the dataset offers two testing protocols: Infrared-to-Visible (IR-to-VIS) and Visible-to-Infrared (VIS-to-IR).

Evaluation Protocol
To assess the performance of both datasets, we employ standard evaluation protocols, which incorporate Cumulative Matching Characteristics (CMC) [42], and mean Average Precision (mAP) [43] as evaluation metrics.Specifically, we conducted ten tests and computed the average results across these tests.

Implementation
The proposed methodology utilizes the PyTorch deep learning framework and is implemented on an NVIDIA RTX 3090 GPU.Building upon existing VI Re-ID methods, a pretrained ResNet-50 [28] is employed as a backbone network.During training, all images should be resized to the dimensions of 288 × 144 and data augmentation techniques (random cropping and random horizontal flipping) [44] are introduced.
The training process involves using the stochastic gradient descent (SGD) optimizer with a momentum value of 0.9.The initial learning rate is set to 0.01, and a warm-up strategy is employed to adjust the learning rate.Specifically, the learning rate is initialized to 0.01 and undergoes 10 decays, each occurring every twenty epochs.The training is stopped after 60 epochs.The number p of local body parts in Formula ( 3) is set to 6.
Furthermore, when compared to the current best Cross-modal feature learning method AGMNet [51], our approach also achieves remarkable performance gains.In the All-Search mode, our method surpasses AGMNet with a higher mAP and Rank-1 accuracy by 4.43% and 3.34%, respectively.Similarly, in the Indoor-Search mode, our method outperforms AGMNet with a mAP and Rank-1 accuracy higher by 1.83% and 3.54%, respectively.
Evaluations on RegDB.In comparison with the best-performing method in the VISto-IR mode, our approach demonstrates remarkable superiority.Specifically, our method achieves a higher mAP by 4.86% in this mode, showcasing its effectiveness in handling the cross-modal matching between visible and infrared images.
Similarly, in the IR-to-VIS mode, our method outperforms the best-performing method with a significantly higher mAP by 10.67%.This result highlights the exceptional capability of our approach to effectively address the challenges of cross-modal matching between infrared and visible images.
These impressive performance demonstrates the versatility and effectiveness of our proposed method in handling VI Re-ID task.

Ablation Study
In this section, a comprehensive ablation study is conducted to thoroughly evaluate the contributions of the Contour-oriented Enhancement Module (CEM) and the Crossmodality Graph Sampler (CGS) in our proposed approach.By systematically adding or removing these modules, we investigate their individual impacts on the performance of our model.The results are presented in Table 2, where shows the mAP and Rank-1 accuracy for each experimental setting.First, we establish a baseline model [4] that comprises basic feature extraction and re-ranking with K-reciprocal [52].In the subsequent analysis, the All-Search evaluation protocol on the SYSU-MM01 dataset [8] is applied as the benchmark for comparison.The performance of the baseline model achieves a mAP of 70.54% and a Rank-1 accuracy of 72.97%.
In the next step, to assess the influence of the CEM module, we integrate it into the baseline model.The inclusion of the CEM module results in significant performance gains, with mAP and Rank-1 accuracy increasing by 7.74% and 8.51%, respectively.This demonstrates that the CEM module effectively enhances feature representation and reduces modality discrepancy, contributing to the overall improvement in VI Re-ID performance.
Then, the impact of the CGS module is evaluated by incorporating it into the baseline model.The addition of the CGS module also leads to notable performance improvements, with mAP and Rank-1 accuracy increasing by 2.45% and 1.61%, respectively.The CGS module facilitates informative and challenging sample selection, effectively optimizing the training data and further enhancing the model's discriminative capability.
Finally, we examine the combined effect of both the CEM and CGS modules by integrating them into the baseline model simultaneously.Remarkably, this joint integration yields remarkable performance enhancements, with mAP and Rank-1 accuracy increasing by 10.24% and 9.73%, respectively.The synergistic interplay between CEM and CGS reinforces the feature representation and sample selection aspects, leading to substantial overall improvements in the VI Re-ID task.
The ablation study demonstrates the effectiveness and significance of both the CEM and CGS modules in our proposed approach.The CEM module successfully leverages contour information to enhance feature representation, while the CGS module optimizes the sampling strategy for informative and challenging examples.By understanding the individual contributions of these modules, our study offers valuable insights into the design of a robust and efficient VI Re-ID task.

Comparison Experiment 4.5.1. Comparison Experiment of Sampling Methods
In our comparative experiments on different sampling methods, namely Random Sampler, Uniform Sampler, and our proposed Cross-modality Graph Sampler (CGS), we observe significant differences in their performances from Figure 5. Specifically, our CGS sampling method outperformed both Random Sampling and Uniform Sampling by a notable margin.The mAP achieved with CGS is 1.81% higher than Random Sampling and 2.41% higher than Uniform Sampling on the SYSU-MM01 dataset with All-Search mode.This result clearly demonstrates the superiority of our CGS sampling approach in improving the overall performance of the VI Re-ID task.Compared to other sampling methods that exhibit randomness, CGS leverages the relationships among classes and ensures that instances within a batch are mostly similar, providing informative and challenging examples for discriminative learning.By incorporating such informative sampling, our method is better able to handle cross-modal challenges and effectively capture subtle differences between similar features, leading to the improved performance observed in the experiments (Table 3).In our comparative experiments on different sampling methods, namely Random Sampler, Uniform Sampler, and our proposed Cross-modality Graph Sampler (CGS), we observe significant differences in their performances from Figure 5. Specifically, our CGS sampling method outperformed both Random Sampling and Uniform Sampling by a notable margin.The mAP achieved with CGS is 1.81% higher than Random Sampling and 2.41% higher than Uniform Sampling on the SYSU-MM01 dataset with All-Search mode.This result clearly demonstrates the superiority of our CGS sampling approach in improving the overall performance of the VI Re-ID task.Compared to other sampling methods that exhibit randomness, CGS leverages the relationships among classes and ensures that instances within a batch are mostly similar, providing informative and challenging examples for discriminative learning.By incorporating such informative sampling, our method is better able to handle cross-modal challenges and effectively capture subtle differences between similar features, leading to the improved performance observed in the experiments (Table 3).

Comparison Experiment of Fusion Methods
Besides, considering Section 3.1, for the method of fusing contour information, comparative experiments need to be conducted on different feature fusion methods, such as Element-wise addition and Element-wise concatenation (EC).Therefore, some comparative experiments are shown in Figure 5a, which can help determine which fusion method performs best in the task.Specifically, when our model adopts the Element-wise Concatenation fusion method, the mAP and Rank-1 accuracy of 72.97% and 70.54% is achieved, which outperformed other fusion methods and showed the best overall fusion performance.Taking into account the above analysis, the Element-wise Concatenation fusion method is employed for integrating contour information in this paper.

Comparison Experiment of the Contour Detectors
For the selection of contour extraction methods, two options are considered: Canny edge detection and Self-Correction Human Parsing (SCHP).Therefore, a comparative analysis is conducted, and the specific results are shown in Table 4. Specifically, under the Indoor-Search mode, the SCHP method outperformed the Canny edge detection method with a Rank-1 accuracy improvement of 7.93% and an mAP improvement of 5.65%.These results indicate that the SCHP method is better suited for contour extraction in our approach.Furthermore, before employing the Element-wise Concatenation fusion method to merge the contour global features with the local features obtained from modal-shared feature learning, the output channel values of the contour global features are also worth our attention.In Figure 5b, we compare the output channel values of contour global features from 1 × 1 Conv.Our model achieves better performance when the output channel is set to 512.This result indicates that setting the output channel to 512 enhances the representation capability of our model for fusing contour information.This finding confirms the significance of selecting an appropriate output channel value for the effective utilization of contour global features and the enhancement of model performance.

Qualitative Analysis
In this section, we compare our proposed method with the AGW [4] approach using the SYSU-MM01 dataset.For the comparison, two sample images are selected, one depicting the frontal view and the other showing the rear view of individuals, as query samples.The Rank-10 visualization results are presented in Figure 6.Overall, these qualitative analyses demonstrate that the integration of contour information and the utilization of the CGS sampler effectively address the challenges posed by modal discrepancies and improve the precision and accuracy of the VI Re-ID task.

Conclusions
In this paper, we propose the Graph Sampling-based Multi-stream Enhancement Network (GSMEN) for the VI Re-ID task.The GSMEN integrates contour information with the globally shared contour features obtained from modal-shared feature learning.This integration aims to enhance feature representation and reduce cross-modal discrepancy.Our approach introduces the Contour Expansion Module (CEM) for fusing contour-enhanced features with local features and the Cross-modality Graph Sampler (CGS) for effective batch sampling.Experimental results on large-scale datasets demonstrate significant improvements in matching performance and modal adaptability.Our contributions include the novel CEM approach and the efficient CGS sampler, which show promising potential for VI Re-ID in various applications.
Author Contributions: Conceptualization, J.J. and J.X.; Data curation, J.J., J.X. and W.Z.; Formal analysis, J.J., J.X., W.Z., R.R., R.W., T.L. and S.X.; Funding acquisition, W.Z.; Investigation, J.J., J.X., W.Z., R.R., R.W., T.L. and S.X.; Methodology, J.J., W.Z. and R.R.; Project administration, W.Z.; Resources, J.J.; Software, J.J.; Supervision, W.Z. and R.R.; Validation, R.W. and T.L.; Writing-original draft, J.J.; Writing-review and editing, W.Z., R.R. and J.X.All authors have read and agreed to the published version of the manuscript.Upon analyzing the results, we observe that our method, with the inclusion of contour information and the utilization of the CGS sampler, effectively improves the retrieval performance.Specifically, in the case of the AGW method, there are instances of two erroneous matches in the rearview image retrieval, where these two images belong to the same class.In contrast, our method achieves correct Rank-10 results for all samples, indicating the superiority of the CGS sampler in distinguishing between similar samples from different classes.Moreover, when using the frontal view as a query sample, the AGW method shows three incorrect matches.These errors can be attributed to the similarity in backgrounds between these samples and the query sample.However, there are noticeable differences in the contour information between them.After enhancing the contour information in our method, only one matching error was observed, showcasing the significant role of contour assistance in enhancing matching capability.
Overall, these qualitative analyses demonstrate that the integration of contour information and the utilization of the CGS sampler effectively address the challenges posed by modal discrepancies and improve the precision and accuracy of the VI Re-ID task.

Conclusions
In this paper, we propose the Graph Sampling-based Multi-stream Enhancement Network (GSMEN) for the VI Re-ID task.The GSMEN integrates contour information with the globally shared contour features obtained from modal-shared feature learning.This integration aims to enhance feature representation and reduce cross-modal discrepancy.Our approach introduces the Contour Expansion Module (CEM) for fusing contour-enhanced features with local features and the Cross-modality Graph Sampler (CGS) for effective batch sampling.Experimental results on large-scale datasets demonstrate significant improvements in matching performance and modal adaptability.Our contributions include the novel CEM approach and the efficient CGS sampler, which show promising potential for VI Re-ID in various applications.

Figure 1 .
Figure 1.(a) Visible images and infrared images utilize the extended contour information obtain through contour detection to alleviate the modality discrepancy.Consequently, it becomes easier match the same person between the visible and infrared modalities.(b) Different shapes repres different classes in the dataset.The CGS sampler first selects one class as an anchor.Next, it identifi the top-k nearest neighboring classes based on their distances to the anchor class.These selec neighboring classes are then included in the same batch for training.

Figure 1 .
Figure 1.(a) Visible images and infrared images utilize the extended contour information obtained through contour detection to alleviate the modality discrepancy.Consequently, it becomes easier to match the same person between the visible and infrared modalities.(b) Different shapes represent different classes in the dataset.The CGS sampler first selects one class as an anchor.Next, it identifies the top-k nearest neighboring classes based on their distances to the anchor class.These selected neighboring classes are then included in the same batch for training.

Figure 2 .
Figure 2. Framework of GSMEN.(a) CGS samples the samples into N batches separately.(b) The inputs are divided into three categories: visible images, infrared images, and contour images obtained through contour detection from the two modalities.(c) ResNet-50 [27] is introduced as the base backbone network, supplemented with Non-local Attention [28] to enhance feature extraction.(d) The CEM integrates local features from both Visible and Infrared modalities with global feature from the contour modality.

Figure 2 .
Figure 2. Framework of GSMEN.(a) CGS samples the samples into N batches separately.(b) The inputs are divided into three categories: visible images, infrared images, and contour images obtained through contour detection from the two modalities.(c) ResNet-50 [27] is introduced as the base backbone network, supplemented with Non-local Attention [28] to enhance feature extraction.(d) The CEM integrates local features from both Visible and Infrared modalities with global feature from the contour modality.

Figure 3 .
Figure 3. Fusion methods for contour information.

5 ,∈
Figure 3b,c.Specifically, vism RGB F represents the three-channel features obtained after transforming the RGB image into RGB data.Element-wise concatenation.As depicted in Figure 3a, we augment the local features by incorporating the global feature of the visible image contours.Firstly, the visible image and the contour image are separately processed through CNN to obtain features × × ∈ vis c h w conv-F 5

F
to C. Finally, by concatenating the local feature vis F of the visible image with its contour image's global feature visc F , the new visible feature v ism F is obtained:

Figure 3 .
Figure 3. Fusion methods for contour information.

Figure 4 .F
Figure 4. Framework of the CGS.The feature distances ( distv and disti ) for each class of samples in the visible and infrared modalities are obtained based on the latest trained model.Then, they are merged to obtain the cross-modal distances, denoted as dist .Next, the nearest neighbor classes are grouped into the same batch based on the distance dist to complete the sampling process.Before each epoch, we calculate the distances or similarities between classes using the latest trained model and then construct a graph encompassing all classes.This approach allows us to leverage the relationships between classes for informative sampling.To illustrate, one image per class is randomly chosen to form a smaller sub-dataset.Next, the feature × ∈  vism C d F should be extracted thought the latest trained model, where C represents the total number of training classes and d is the feature dimension.Subsequently, the pairwise Euclidean distances between all the selected samples are computed by the feature vism C d

Figure 4 .
Figure 4. Framework of the CGS.The feature distances (distv and disti) for each class of samples in the visible and infrared modalities are obtained based on the latest trained model.Then, they are merged to obtain the cross-modal distances, denoted as dist.Next, the nearest neighbor classes are grouped into the same batch based on the distance dist to complete the sampling process.

Sensors 2023 ,
23, x FOR PEER REVIEW 12 of 17 individual contributions of these modules, our study offers valuable insights into the design of a robust and efficient VI Re-ID task.4.5.Comparison Experiment 4.5.1.Comparison Experiment of Sampling Methods

Figure 5 .
Figure 5. Training and testing on SYSU-MM01 dataset with All-Search mode.

Figure 5 .
Figure 5. Training and testing on SYSU-MM01 dataset with All-Search mode.
] dataset consists of 491 different identities captured by four visible cameras and two infrared cameras.It encompasses two search modes: All-Search mode and Indoor-Search mode.Specifically, in the All-Search mode, the gallery set comprises all images captured by the visible cameras, allowing researchers to explore scenarios where all available visible cameras are employed for Re-ID tasks.In contrast, the Indoor-Search mode utilizes images from indoor visible cameras as the gallery set, which is employed for studying Re-ID tasks in indoor environments.The training set comprises 19,659 visible (VIS) images and 1792 infrared (NIR) images, providing a diverse collection of data covering 395 distinct person identities.The test set consists of 3803 infrared images from 96 different person identities, serving as the query set.

Table 1 .
Comparing data (%) between our method and other VI Re-ID methods.Red and bold signify the best result, while blue indicates the second-best result.

Table 2 .
Ablation experiment results of our method.Training on SYSU-MM01 dataset.The bold indicates the best result.

Table 3 .
Comparison experiments of different sampling methods.Training on SYSU-MM01 dataset.The bold indicates the best result.

Table 3 .
Comparison experiments of different sampling methods.Training on SYSU-MM01 dataset.The bold indicates the best result.

Table 4 .
Comparison experiments of different contour detectors.Training on SYSU-MM01 dataset.The bold indicates the best result.