Cross-Modality Person Re-Identification via Local Paired Graph Attention Network

Cross-modality person re-identification (ReID) aims at searching a pedestrian image of RGB modality from infrared (IR) pedestrian images and vice versa. Recently, some approaches have constructed a graph to learn the relevance of pedestrian images of distinct modalities to narrow the gap between IR modality and RGB modality, but they omit the correlation between IR image and RGB image pairs. In this paper, we propose a novel graph model called Local Paired Graph Attention Network (LPGAT). It uses the paired local features of pedestrian images from different modalities to build the nodes of the graph. For accurate propagation of information among the nodes of the graph, we propose a contextual attention coefficient that leverages distance information to regulate the process of updating the nodes of the graph. Furthermore, we put forward Cross-Center Contrastive Learning (C3L) to constrain how far local features are from their heterogeneous centers, which is beneficial for learning the completed distance metric. We conduct experiments on the RegDB and SYSU-MM01 datasets to validate the feasibility of the proposed approach.


Introduction
The purpose of person re-identification (ReID) [1][2][3][4] is to match pedestrians across multiple non-overlapping cameras, which could be considered to be a specific person-retrieval task. It is extensively applied in smart cities, autonomous driving, security surveillance, and so on. However, most person ReID methods focus on matching pedestrians captured by RGB cameras, and therefore they do not allow 24-h intelligent surveillance. To overcome this limitation, some researchers are dedicated to cross-modality person ReID.
Cross-modality person ReID [5][6][7][8][9] retrieves RGB pedestrian images from infrared (IR) pedestrian images and vice versa. It not only inherits the challenges of unimodality person ReID, such as the variations in postures, illumination, and camera view, but also possesses a large discrepancy between IR modality and RGB modality. The modality discrepancy results in an unreliable match due to different color and appearance information of IR images and RGB images.
Recently, some cross-modality person ReID methods have been proposed to learn feature representations and metric functions for both IR and RGB images. Regarding feature representations, many methods [10][11][12] extract modality-specific features and modality-shared features by designing dual-stream deep networks. Meanwhile, the local features [13][14][15] are also demonstrated to be effective for cross-modality person ReID. Furthermore, some approaches [16][17][18][19] apply graph convolution layers, which aggregate features from other pedestrian images to enhance the discriminative power of features. They treat each pedestrian image as a node of a graph and update the nodes based on their correlations, as shown in Figure 1a. However, they ignore the relationship between the pairs of IR images and RGB images when constructing the graph, therefore hindering the learning of discriminative features for pedestrian images of different modalities. The proposed method treats paired local features from different modalities as a node of a graph. We distinguish the pedestrian identities using different colors, with the same color indicating the same pedestrian. Circles and triangles are used to represent the IR and RGB modalities, respectively.
In metric learning, some works [20][21][22] propose various losses to minimize the distance between pedestrian images from different modalities for cross-modal person ReID. These losses target the learning of an embedding space for different modalities in which images of same-identity pedestrians are closer to each other and images of different-identity pedestrians are further away. For this purpose, these methods constrain the distance among the pedestrian images from different modalities [23,24] or the distance among the centers from different modalities [25,26]. However, they overlook the distance between the pedestrian image and its center of different modalities, which results in incomplete distance learning between different modalities.
In this paper, we propose a novel graph network entitled Local Paired Graph Attention Network (LPGAT) for cross-modality person ReID. This approach considers the correlation of paired pedestrian images from different modalities and local information in a uniform framework. Specifically, we design the proposed method as a two-stream network where each branch corresponds to each modality. To learn local features, we uniformly divide the feature maps for each stream in the horizontal direction. Later, we construct a graph using local features where each node is composed of the corresponding local features of paired pedestrian images with varying modalities, which is illustrated in Figure 1b. The inclusion criterion of the selected paired local features is that each paired local features come from two different modalities. For better propagation of information in the graph, we further propose the contextual attention coefficient which not only considers the node features but also the relationship between the nodes. Hence, the proposed LPGAT could directly learn the relationship between the paired local features from different modalities and accurately propagate the information between them.
Recently, the distance constraint among the distinct modality centers has verified the effectiveness of cross-modality person ReID. However, they do not constrain the features which are far from the centers. To overcome this, we propose Cross-Center Contrastive Learning (C 3 L) to reduce the distance between local features and their heterogeneous centers to narrow the gap between heterogeneous modalities. Combined with the constraint between heterogeneous centers, the proposed C 3 L helps learn the completed distance metric for IR images and RGB images, and therefore the discriminative features are obtained. The primary contributions are outlined below:

•
We propose LPGAT for cross-modality ReID. In contrast to previous approaches that only use pedestrian images from different modalities as the nodes of a graph, LPGAT uses the paired local features from different modalities as the nodes of a graph, thus alleviating the gap between the two modalities. • We propose C 3 L to constrain local features and their heterogeneous centers. In contrast to previous methods that only constrain the distance between the centers of different modalities, C 3 L constrains the features that are far from the center, thus narrowing the gap between heterogeneous modalities. • We compare the proposed method against state-of-the-art methods using two publicly accessible datasets, RegDB and SYSU-MM01, and our results demonstrate that the proposed method outperforms them.

Cross-Modality Person ReID
To address the challenge of the modality gap for cross-modality person ReID, many methods are proposed to derive global or local features from heterogeneous pedestrian images. For the global features, Wu et al. [27] put forward the deep zero-padding model to learn complementary information from IR images and RGB images. Then, Ye et al. [20] introduced a dual-stream network to capture the globally shared information of IR images and RGB images. Chen et al. [6] raised the Neural Feature Search (NFS) to implement the feature-selection automation, which allows the network to filter the background noise and focus on the important portions of pedestrian images.
Since pedestrians' partial information is crucial for cross-modality person ReID, several researchers have proposed learning local features from IR images and RGB images. Zhu et al. [21] and Sun et al. [22] developed a deep two-stream framework to capture local features to mitigate modality differences. Zhang et al. [28] proposed Dual-Alignment Part-aware Representation (DAPR) to simultaneously reduce the modality gap and learn discriminant features from the local and global aspects. In this paper, we adopt a two-steam deep network to learn local features, and aggregate paired local features of pedestrian images from different modalities.

Graph Attention Networks
Graph Convolutional Network (GCN) [29,30] has been proposed to handle the non-Euclidean data. It learns node features by propagating the information among nodes as well as their neighborhoods. Several vision-related tasks, including semantic segmentation [31] and face analysis [32], have widely applied GCN. Later, Graph Attention Network (GAT) [33,34] is further proposed to aggregate node features using attention weights.
Recently, more and more researchers have combined GCN or GAT with Convolutional Neural Network (CNN) for person ReID [16,19,35]. Ye et al. [16] put forward Dynamic Dual Attention Aggregation (DDAG), in which each pedestrian image is regarded as a node of the graph, and the relationship between the node and its neighborhoods is mined. Zhang et al. [19] treated each body part as a node of a graph and construct a graph using one pedestrian image to alleviate the intra-modality variations. In this paper, a graph is constructed using paired local features derived from distinct modalities, and the contextual attention coefficients are introduced for better propagation.

Contrastive Learning
The purpose of contrastive learning [36,37] is to learn discriminative features using image pairs, so that similar images are close to each other, while dissimilar ones are far away. It can be applied in both unsupervised and supervised learning, such as image classification [37] and object detection [38]. Recently, contrastive learning has been used in person ReID to improve the discrimination of features [39,40]. For example, Chen et al. [39] proposed Inter-instance Contrastive Encoding (ICE) to fully explore the relationship between different pedestrian images. Isobe et al. [41] presented the Cluster-wise Contrastive Learning (CCL) algorithm to learn noise-robust features for cross-domain person ReID. We propose C 3 L to reduce the distance between local features and their heterogeneous centers to reduce the modality gap for cross-modality person ReID, which is inspired by the applications of contrastive learning in cross-domain ReID.

Approach
In this section, an outline of the proposed method is first represented, which is depicted in Figure 2. The proposed method contains three key components, namely Local Feature Extractor, LPGAT Module, and C 3 L. Then, we introduce each of them in detail. Finally, we optimize the proposed approach. The framework of our approach. We first apply the Local Feature Extractor to obtain the local features from different modalities. Then, we propose the LPGAT module to learn the correlation between the paired local features from different modalities. The same color indicates the same pedestrian, and the circle and the triangle represent IR and RGB modalities, respectively. We also use the proposed C 3 L to optimize the network, which constrains the distance between the local features and their heterogeneous centers.

Overview
Local Feature Extractor. The Local Feature Extractor is designed as a two-stream network where two individual ResNet-50 [42] are adopted as the backbone. Then, we divide the feature maps output by ResNet-50 horizontally and apply the global average pooling (GAP) to obtain the local features. Afterward, we apply the fully connected (FC) layers to reduce the dimension of local features, where the weights of FC layers are shared.
LPGAT Module. We regard the difference between paired local features of pedestrian images from different modalities as the node of a graph to learn the paired correlation of local features. Then, we update the nodes using the contextual attention coefficient which injects the distance information between the nodes into the process of information propagation.
Cross-Center Contrastive Learning. We introduce C 3 L to optimize the similarity between the local features and their heterogeneous centers. We then combine C 3 L with other metric functions to obtain the completed distance metric.

Local Feature Extractor
The Local Feature Extractor possesses two streams, and two individual pre-trained ResNet-50 are used as the backbone, where the stride of the convolution operation in the last layer is modified from 2 to 1. We feed the pedestrian images of IR modality and RGB modality into the two streams, respectively. Later, we obtain the feature maps of pedestrian images with the size of W × H × C, where W and H denote the width and the height of feature maps, and C is the number of channels. Afterward, we uniformly split the feature maps into P part-level stripes. We conduct GAP on each part-level stripe to obtain the local feature. The p-th local feature of the i-th RGB image is denoted as f R i,p ∈ R C×1 , where p = 1, . . . , P. Similarly, the local feature of the IR image is denoted as f I j,p ∈ R C×1 . Finally, we apply the FC layers with shared weights to reduce the dimension of the local features from C to D.

LPGAT Module
The local features have proved the robustness to the variances in viewpoints, poses, and so on [1,13]. Meanwhile, the paired features derived from distinct modalities facilitate the reduction of the modality gap. Hence, we propose to use the paired local features from the IR modality and RGB modality to construct the fully connected graph. The node of the fully connected graph is defined as: Hence, we obtain a fully connected graph where U is the number of nodes in the graph. Please note that the node of a graph can also be performed by subtracting the local feature of RGB modality from the local feature of IR modality.
After obtaining the graph, we need to calculate the attention coefficient to describe the correlation between different nodes. Many cross-modality person ReID approaches [16,17] calculate the attention coefficient between the nodes as: where is a nonlinear operation performed by LeakyReLU, , is the concatenation operation, and q ∈ R 2D×1 is a learnable vector. From Equation (2) we can see that it directly concatenates the nodes, but ignores the relationship between the nodes, which leads to inaccurate information propagation. Thus, we propose the contextual attention coefficient: where · 2 indicates the Euclidean distance, and β is the hyperparameter. From Equation (4), we can see that the smaller distance between the nodes possesses a larger k p n,m , and therefore it produces a strong correlation between the nodes. Hence, the contextual attention coefficient is helpful for accurate information propagation. Please note that when we set k p n,m to 1, the contextual attention coefficient degenerates to the traditional attention coefficient. With the contextual attention coefficient, the node is represented as: Finally, to further improve the representation ability, the node is updated as: where φ is the ELU activation function to learn a stable graph structure, and w ∈ R 2D×2 is a learnable matrix. We treat the optimization of LPGAT as a binary classification problem, and use the verification loss: wherez p i,j is the predicted probability of the j-th node of G p i , and z p i,j is the ground-truth of the j-th node of G p i . z p i,j = 1 indicates the paired local features derived from distinct modalities in the node are with the same identity, otherwise z p i,j = 0. In a word, we design the node of the graph in LPGAT as the paired local features derived from distinct modalities to effectively mitigate the discrepancy between IR modality and RGB modality. Furthermore, we inject the distance information using the contextual attention coefficient to propagate the information between the nodes accurately.

Cross-Center Contrastive Learning
As for cross-modality person ReID, learning the distance metric is an effective way to narrow the modality gap. Recently, the constraint on the centers of RGB modality and IR modality have achieved promising performance [21,25,26]. However, they overlook the distance between features and their heterogeneous centers resulting in some outliers in the learning process as shown in Figure 3a.
In this paper, we put forward C 3 L to force the local features to be close to the corresponding heterogeneous centers in the embedding space, and therefore the pedestrian images which have the same identity from distinct modalities are gathered as shown in Figure 3b. The constraint between the p-th local feature of the i-th RGB image and its heterogeneous center is defined as: where τ > 0 is a scalar temperature parameter, S is the number of identities, and ID(i, V) indicates the identity of the i-th RGB image. Here, c I b,p is the center of the p-th local feature of the b-th identity for IR images, and it is defined as: where O b is the number of pedestrian images with the b-th identity, and ID(i, I) is the identity of the i-th IR image.
Similarly, the constraint between the p-th local feature of the i-th IR image and its heterogeneous center is defined as: where c V b,p is the heterogeneous center of f I i,p . In a word, the proposed C 3 L is formulated as: The proposed C 3 L decreases the distance between the local features of IR images and their centers of RGB images, and so do the local features of IR images and their centers of RGB images. Hence, it clusters the local features of pedestrian images from different modalities with the same identity.

Optimization
For learning the completed distance metric, we exploit the proposed C 3 L as well as the heterogeneous center (HC) loss [21]. The HC loss aims to diminish the distance between the centers of IR modality and RGB modality with the same identity. The HC loss for the p-th local feature is denoted as L p HC . Additionally, we employ the cross-entropy loss to optimize the local features, and the cross-entropy loss for the p-th local feature is denoted as L p CE . Moreover, we use the validation loss L p g to optimize LPGAT and treat it as a binary classification task. Therefore, the overall loss of the proposed method is expressed as: where λ 1 , λ 2 and λ 3 are the trade-off parameters to balance the importance between different losses.

Experiments
In this section, the evaluation protocol and datasets are first presented, followed by showing the implementation details of our experiments. After that, the experimental results are compared with the state-of-the-art methods, and ablation experiments are performed to evaluate the effectiveness of the key components of the presented approaches. Finally, the influence of several important parameters in the proposed method is analyzed.
SYSU-MM01, a massive dataset, is captured by four visible-light cameras as well as two NIR cameras in both outdoor and indoor settings. There are 491 pedestrian identities recorded in this dataset, and each pedestrian is photographed by two different cameras. Furthermore, 11,909 IR images and 22,258 RGB images of 395 identities are contained in the training set. During the testing phase, we perform our experiments on two settings, i.e., indoor search and all search. Each mode has 3803 query IR pedestrian images of 96 identities. Additionally, the gallery set for all-search settings contains 301 randomly selected pedestrian images taken by RGB cameras which are placed in outdoor and indoor environments, while the gallery set for the indoor search settings contains 112 randomly selected pedestrian images taken by RGB cameras which are placed in indoor environments.
RegDB consists of 8240 images from 412 identities, with each identity containing 10 IR images and 10 RGB images. The whole dataset is divided into two halves and used for training and testing, respectively, which the training set includes 2060 IR images and 2060 RGB images of 206 identities. As for the test set, there are 2060 query images of 206 identities and 2060 gallery images of 206 identities. Moreover, two evaluation settings are available, including Thermal-to-Visible (T-V) and Visible-to-Thermal (V-T).

Evaluation Metrics
The Cumulative Matching Characteristic (CMC) curve is a commonly used performance evaluation metric in the person re-identification task. It plots the probability of correctly matching the query image at different ranks. Specifically, the x-axis represents the rank of the retrieved image (i. e., 1st, 2nd, 3rd, etc.), and the y-axis represents the probability of correctly identifying the query image among the top k retrieved images. A larger area under the CMC curve indicates better performance. The mean Average Precision (mAP) is another commonly used performance evaluation metric in the person reidentification task. It measures the average precision of a set of queries. Specifically, it considers both the precision and recall of the retrieval results. A higher mAP value indicates the better performance.
In this paper, standard CMC and mAP are adopted as evaluation metrics to test the performance of the proposed method.

Implementation Details
We first resize the pedestrian image into 288 × 144, then apply random cropping and shuffle a horizontal flip to augment the data. Meanwhile, we set the batch size to 64, where each batch is composed of 4 identities, and each identity consists of 8 RGB pedestrian images and 8 IR pedestrian images. The scalar temperature parameter τ in Equations (8) and (10) is set to 0.2. The hyperparameter β in Equation (4) is set to 2. To balance the importance between different losses, we set the trade-off parameters λ 1 , λ 2 , and λ 3 to 0.5, 0.4, and 0.5, respectively. We use stochastic gradient descent (SGD) optimizer to optimize the proposed method and fix the number of epochs to 60. The preliminary learning rate is set to 0.01 and decayed to 0.001 after 30 epochs. After that, We adopt the FC layers to reduce the local feature dimension to D = 512, and the number of part-level stripes P is set to 6. In the testing phase, all local features are concatenated as the representation of a pedestrian image.
Comparisons on SYSU-MM01. From Table 1, we can see that the proposed method achieves 61.89% of Rank-1 accuracy and 60.12% of mAP accuracy among the all-search setting, which exceeds NFS [6] and CMAlign [7] in terms of mAP accuracy by 4.67% and 5.98%, respectively. It is worth noting that the performance of our method exceeds that of DDAG [16] with respect to Rank-1 and mAP accuracy by 13.0% and 13.4%, respectively. This is because DDAG only uses pedestrian images of different modalities as the nodes of a graph, while our method uses the paired local features as the nodes of a graph. Compared with KSD [51], our method outperforms its Rank-1 and mAP accuracy by 1.3% and 2.3%, respectively. In addition, under the indoor search settings, the performance of our method surpasses that of WIT [22] by 5.8% and 5.9% regarding Rank-1 and mAP accuracy, respectively. This is because WIT uses the center constraint to pull images with the same identity to their cross-modality centers, but ignores constraining the features that are far from the centers. The proposed C 3 L overcomes this shortcoming by constraining the distance between local features and their heterogeneous centers. The proposed method models the node of a graph with paired local features from different modalities, which outperforms other GAT models, such as DDAG. Furthermore, our approach yields superior performance to the other center-constrained approaches, i.e., TSLFN+HC [21] and WIT [22] on the indoor and all-search settings.
Comparisons on RegDB. We compare our LPGAT model with 13 different methods on the RegDB dataset. From the experimental results in Table 2, LPGAT shows the best performance compared with the other methods. Specifically, The proposed method obtains 89.37% in Rank-1 accuracy and 78.74% in mAP accuracy under the V-T mode, which surpasses the second-best method, i.e., WIT [22] with 4.37% and 2.84% in Rank-1 accuracy and mAP accuracy, respectively. In addition, under the T-V model, our method outperforms DDAG [16] and NFS [6] by 24% and 8.4% in Rank-1 accuracy, respectively, and surpasses them by 19.3% and 5.7% in mAP accuracy. Hence, it proves that our model has a strong generalization ability with different scenarios.
In conclusion, the proposed method yields superior performance on the two large-scale datasets, which demonstrates the good generalization capability of our approach.

Ablation Studies
We conduct ablation experiments on SYSU-MM01 with the all-search set-up to assess the effectiveness of each key component of our method. The detailed results are presented in Table 3. B represents the baseline which is implemented by the Local Feature Extractor and optimized by the HC loss and the cross-entropy loss. GAT indicates that the node of the graph is built by the single local feature of the pedestrian image, and LPGAT-k is the LPGAT module without using the contextual attention coefficient. From Table 3, several conclusions can be drawn as follows. First, we observe that B + GAT exceeds B with 0.7% in Rank-1 accuracy and 1.08% in mAP accuracy, respectively. It indicates that aggregating the local features of different modalities can improve the discrimination of features. Second, B + LPGAT-k improves the performance compared with B + GAT, which demonstrates the effectiveness of using the paired local features from different modalities to build a graph. Third, the comparison between B + LPGAT and B + LPGAT-k proves the effectiveness of the contextual attention coefficient which is beneficial to obtain accurate information propagation. Fourth, B + C 3 L outperforms B by 4.46% in Rank-1 accuracy and 3.67% in mAP accuracy, respectively. The proposed C 3 L could facilitate the deep model to learn the completed distance metric by constraining the distance between local features and their heterogeneous centers to narrow the gap of different modalities. Finally, the performance is further improved when combining LPGAT and C 3 L, which demonstrates they could mutually reinforce.

Parameters Analysis
There are several key parameters in the proposed method. We evaluated the effect of different parameter values on all-search mode in SYSU-MM01, and the experimental results can be generalized to other cross-modality person ReID settings.
The impact of the hyperparameter β. We perform the experiments with different values of β in Equation (4) to evaluate the performance of the proposed method which is shown in Figure 4. From the figure, it can be seen that the performance peaks at β = 2, and it drops as β increases. Therefore, we set β to 2. The impact of the scalar temperature parameter τ. The scalar temperature parameter τ is an important parameter that controls the range of similarity between the local features and their heterogeneous centers in Equations (8) and (10). The experimental results with different values of τ are shown in Figure 5 where the performance becomes better as τ increases and the performance drops when τ > 0.2. Hence, the optimal value of τ is 0.2.
The impact of the trade-off parameters λ 1 , λ 2 and λ 3 . The trade-off parameters λ 1 , λ 2 and λ 3 in Equation (12) control the importance of different losses. To search the optimal values of λ 1 , λ 2 and λ 3 , we experimentally test different value combinations of them, and to conveniently display, we fix two parameters with the optimal values and show the influence of the other parameter. The results are shown in Figure 6 where we can see that when λ 1 = 0.5, λ 2 = 0.4 and λ 3 = 0.5 the performance is best.

Visualization
To intuitively verify the effectiveness of our method, we visualize the cosine similarity distribution of cross-modality positive and negative pairs (R-I positive and R-I negative) of B, B + LPGAT and B + C 3 L as shown in Figure 7. From the figure, we can see that the distribution of B + LPGAT and B + C 3 L are more separate than that of B. It demonstrates that LPGAT and C 3 L could improve the discrimination of features for cross-modality person ReID. L. The x axis shows the cosine similarity scores between RGB images and IR images, and the y axis shows the frequency statistics of the cosine similarity score.

Conclusions
In this paper, we presented LPGAT for cross-modality person ReID to model the correlation between paired local features derived from distinct modalities. Meanwhile, we propose the contextual attention coefficient to ensure accurate information propagation on the graph. In addition, we propose C 3 L to decrease the modality gap for cross-modality person ReID by constraining the distance between local features and their heterogeneous centers. The results of experiments on two commonly used datasets demonstrate that the proposed approach surpasses the state-of-the-art approaches. In future work, we will extend our approach to video sequences for the cross-modal ReID domain. Considering that in practical ReID application scenarios, multiple tasks such as pedestrian attribute recognition and pose estimation often need to be performed simultaneously, the joint learning of multiple tasks will be considered in the future to make full use of multimodal information and improve the performance of pedestrian re-identification.  Data Availability Statement: All datasets used for training and evaluating the performance of our proposed approach are publicly available and can be accessed from [27,43].

Conflicts of Interest:
The authors declare no conflict of interest.