A Cross-Modality Person Re-Identification Method Based on Joint Middle Modality and Representation Learning

Ma, Li; Guan, Zhibin; Dai, Xinguan; Gao, Hangbiao; Lu, Yuanmeng

doi:10.3390/electronics12122687

Open AccessArticle

A Cross-Modality Person Re-Identification Method Based on Joint Middle Modality and Representation Learning

by

Li Ma

^1,2

,

Zhibin Guan

^1,2,*

,

Xinguan Dai

^1,2,

Hangbiao Gao

³ and

Yuanmeng Lu

^1,2

¹

College of Communication and Information Engineering, Xi’an University of Science and Technology, Xi’an 710054, China

²

Xi’an Key Laboratory of Heterogeneous Network Convergence Communication, Xi’an 710054, China

³

Safety Supervision Department, Shaanxi Cuijiagou Energy Co., Ltd., Tongchuan 727000, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(12), 2687; https://doi.org/10.3390/electronics12122687

Submission received: 12 May 2023 / Revised: 10 June 2023 / Accepted: 12 June 2023 / Published: 15 June 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Modality differences and intra-class differences have been hot research problems in the field of cross-modality person re-identification currently. In this paper, we propose a cross-modality person re-identification method based on joint middle modality and representation learning. To reduce the modality differences, a middle modal generator is used to map different modal images to a unified feature space to generate middle modality images. A two-stream network with parameter sharing is used to extract the combined features of the original image and the middle modality image. In addition, a multi-granularity pooling strategy combining global features and local features is used to improve the representation learning capability of the model and further reduce the modality differences. To reduce the intra-class differences, the model is further optimized by combining distribution consistency loss, label smoothing cross-entropy loss, and hetero-center triplet loss to reduce the intra-class distance and accelerate the model convergence. In this paper, we use the publicly available datasets RegDB and SYSU-MM01 for validation. The results show that the proposed approach in this paper reaches 68.11% mAP in All Search mode for the SYSU-MM01 dataset and 86.54% mAP in VtI mode for the RegDB dataset, with a performance improvement of 3.29% and 3.29%, respectively, which demonstrate the effectiveness of the proposed method.

Keywords:

cross-modality person re-identification; middle modality; parameter sharing; multi-granularity pooling; joint loss

1. Introduction

Person re-identification is widely considered as an image search and refers to the technology of determining whether a specific person exists under multiple disjoint cameras by using computer vision technology. Specifically, this is a technology by which a specific person’s image is compared with person images acquired from different surveillance scenes to calculate the feature distance. Additionally, if the two pedestrian images are similar to each other above a certain threshold, they are considered to be the same person; otherwise, it is considered not to be the same person. This technology can be combined with person detection, person tracking, and other technologies, and will be widely applied in smart security, smart video surveillance, and other fields.

The early person re-identification method is mainly performed through manual extraction of discriminative features, but the method is time-consuming and error-prone, which greatly affects the accuracy and real-time detection of the person re-identification tasks [1]. In the last decade, person re-identification technology has received a lot of interest from academia and achieved certain results, with more than 53 papers published in CVPR 2019 and ICCV 2019 conferences alone [2]. Single-modality person re-identification mainly targets visible light images, but, at night or in scenes with insufficient light, it is hard for visible light cameras to obtain clear person images, which makes it hard for feature extraction networks to extract effective person features, thus making the traditional person re-identification model unable to reach the expected results. In contrast, due to its insensitivity to lighting conditions, infrared cameras can still image in the absence of a light source. Therefore, cross-modality person re-identification based on visible (VIS) and infrared (IR) images is gradually becoming a research hotspot in this field.

The challenge of the cross-modality person re-identification task lies in camera angle differences leading to features of the person images being acquired by different cameras such as the background and pose varying significantly, which result in intra-class differences between the same persons. In addition, the different imaging principles of VIS and IR cameras are responsible for modality differences between VIS and IR images, which bring great challenges to cross-modality person re-identification.

In order to address the above problems, Wu et al. [3] proposed the first cross-modality person re-identification dataset—SYSU-MM01, which includes person images of multiple scenes from various angles, enabling the model to learn person images of various backgrounds and poses, thus improving the model generalization capability. Zhu et al. [4] designed the heterogeneous-center loss (HC) in order to reduce the intra-class center distance between different modalities. Subsequently, Liu et al. [5] designed the Hetero-center Triplet Loss (HcTri) based on the TriHard loss to weaken the strong constraint of traditional triples by substituting the distance between sample anchors with the distance between sample anchor centers. Ye et al. [6] used Two Stream to extract features with different modality differences and mapped the features of two different modalities to a unified space for metric learning. However, this method is more concerned with the image background and is unable to address the problem of modality differences. To further reduce the modality differences between VIS and IR images, Liu et al. [5] explored the structure of Two Stream with parameter sharing and demonstrated the effect of parameter sharing on cross-modality person re-identification. Dai et al. [7] and Wang et al. [8] used a Generative Adversarial Network (GAN) to generate VIS images into corresponding IR images, and performed feature extraction on the original IR images so that the generated IR images reduced the modality differences at the image and feature levels. However, due to the nonlinear relationship, the GAN approach cannot fully produce an image of a modality from another modality without changing the identity of the person. Li et al. [9] proposed an X-modality between the two modalities, which keeps the space information of the source images, but modality differences still exist between the X-modality and the IR images. Zhang et al. [10] proposed a Middle Modality Generator (MMG), which maps the VIS and IR images to a common space to generate the corresponding middle modality images, further reducing the modality differences and improving the models’ performance.

These issues, such as modal differences and intra-class differences between visible and infrared images, are further addressed. In this article, we make an improvement on MMN [10] and propose a cross-modality person re-identification method based on joint middle modality and representation learning. The major work of this article can be summarized as follows:

(1): We use MMG to generate middle modality images by projecting VIS and IR images into a unified feature space, and then jointly input the middle modality images and the original images to a two-stream parameter sharing network for feature extraction to reduce the modality differences.
(2): We use a multi-granularity pooling strategy combining global and local features to improve the representational learning capability of the model.
(3): We jointly optimize the model by combining the distribution consistency loss, label smoothing cross-entropy loss, and hetero-center triplet loss to reduce the intra-class distance and accelerate the model convergence.

This paper firstly introduces the research background and significance of person re-identification. Then, the research status of the unimodal and cross-modality person re-identification is introduced, respectively. Then, a cross-modality person re-identification method based on joint middle modality and representation learning is then introduced, and a middle modality generator, two-stream parameter sharing network, and multi-granularity pooling strategy and joint loss are elaborated. Finally, ablation experiments are conducted on two publicly available datasets for each module to verify the effectiveness of each module. The advantages of the proposed method are then demonstrated by comparing them with the mainstream methods.

2. Related Work

This section introduces the current research status of unimodal person re-identification and cross-modality person re-identification, in which unimodal person re-identification is mainly introduced in terms of representation learning, metric learning, and generative adversarial networks; cross-modality person re-identification is introduced in terms of metric learning, parameter sharing, and modality transformation.

2.1. Single-Modality Person Re-Identification

Single-modality person re-identification generally means person re-identification based on visible images [11]. The current work related to single-modality person re-identification research is mainly focused on representation learning, metric learning, and generative adversarial networks.

The main task of approaches based on representation learning is to extract more discriminative feature representations from person images. The initial stage of the study focuses on extracting global features of pedestrians. In 2015, Geng et al. [12] used binary classification to judge two pedestrian images whether they are the same person, and multi-categorization to predict the ID of each person in two person images. With the development of person re-identification, scholars gradually recognized the limitations of global features and then started to focus on local features. In 2018, Sun et al. [13] proposed the Part-based Convolutional Baseline (PCB) idea of chunking person images and then extracting features separately, which confirms the importance of local features for a wide range of scholars and stands out as a milestone in the field of person re-identification. Commonly used person image segmentation methods include horizontal segmentation [14], pose estimation [15], and so on.

The main task of metric-based learning methods is to map the learned features to a new space and to close the distance between identical pedestrians. Commonly used loss functions are triplet loss, contrastive loss, trihard loss, and quadruplet loss. In 2015, Ding et al. [16] proposed triplet loss to reduce the intra-class distance. In 2016, Varior et al. [17] proposed contrastive loss to deal with the relationship problem between data pairs in twin networks. In 2017, Hermans et al. [18] designed trihard loss based on triplet loss. On the basis of triplet loss, Chen et al. [19] proposed a quadratic loss function in the same year, which added a new negative sample pair image that allowed the network to study better features.

The method based on generative adversarial networks addresses the current problem of smaller data sets from the perspective of data augmentation. Traditional data augmentation is achieved by performing methods such as random cropping, random horizontal flipping, and random erasure. Unlike the traditional methods, generative adversarial networks can not only realize the transformation of a person’s posture and clothing but also perform style migration of different datasets [20], which is of great help to raise the capacity of model generalization and solve the cross-domain problem of person re-identification.

2.2. Cross-Modality Person Re-Identification

As is known before, a single-modality person re-identification targets VIS images, while a cross-modality person re-identification targets not only VIS images but also IR images. The differences in the imaging principles of VIS and IR cameras result in modality differences between VIS and IR images. The current methods related to cross-modality person re-identification are mainly metric learning, parameter sharing, and modality conversion.

A metric-based learning method is similar to a single-modality metric learning method in that they both essentially reduce the intra-class distance by means of a loss function. In 2020, Zhu et al. [4] designed heterogeneous center loss in order to reduce the intra-class center distance between different modalities. Liu et al. [5] designed the hetero-center triplet loss based on the trihard loss to weaken the strong constraint of the traditional triplet.

A parameter sharing-based method focuses on enabling the network to learn in a shared feature space by sharing some of the network layers. In 2020, Liu et al. [5] explored and demonstrated the impact of parameter sharing on cross-modality person re-identification.

The main idea of the modal-conversion-based method is to reduce the modal differences by transforming the images of different modalities into each other or unifying them into middle modalities. Dai et al. [7] proposed cmGAN, the first generative adversarial network to introduce cross-modality person re-identification to distinguish visible and infrared images. Wang et al. [8] proposed AlignGAN, which uses cycleGAN [21] to generate VIS images into corresponding IR images, and then feature extraction is performed on the original and generated infrared images. These methods reduce the difference between infrared and visible modes to some extent.

3. Methods

To decrease the modality differences and intra-class differences, this paper proposes a cross-modality person re-identification method based on joint middle modality and representation learning. This section begins with an overview of the whole structure of the network. Then, the middle modality generator, a two-stream network with parameter sharing, and a multi-granularity pooling strategy are elaborated. Finally, the distribution consistency loss, label smoothing cross-entropy loss, and hetero-center triplet loss are jointly optimized to improve the performance of the proposed model.

3.1. Overall Network Structure

The overall structure of a cross-modality person re-identification method based on joint middle modality and representation learning is shown in Figure 1.

(1): A Middle Modality Generator (MMG) [10] maps the VIS and IR images to a unified feature space via the encoder and decoder to generate the middle modality images, and then distributes the generated middle modality images consistently via Distribution Consistency Loss (DCL) [10] to reduce the difference between the different modality images.
(2): In this paper, we use ResNet50 as the base network of a two-stream network with parameter sharing [5], and the first convolutional layer and the first two residual blocks of ResNet50 as the feature extractor to extract the independent features of each modality, and the last two residual blocks as the feature embedders for weight sharing to further reduce the modality differences.
(3): A Multi-granularity Pooling (MGP) strategy combines global features and local features to enhance the correlation between features; the pooling method adopts Generalized Mean Pooling (GeM), which pays more attention to image detail information to improve the representation learning capability of the model.
(4): In this paper, we combine distributed consistency loss [10], label smoothing cross-entropy loss [22], and hetero-center triplet loss [5] to optimize the model and reduce intra-class distance.

3.2. Middle Modality Generator

It is known that VIS images are three-channel and IR images are single-channel, so the method of matching the three-channel VIS images with the single-channel IR images is the focus of research in this section. To decrease the influence brought by the channels, this paper uses two independent encoders to encode the images of the two modes separately, and then generates the intermediate mode images via a shared decoder, whose structure is shown in Figure 2.

As can be seen from the above figure, a middle mode generator consists of an encoder and a decoder. The middle modality generator takes as input image pairs of both modalities with the same label. The input image size is uniformly adjusted to 3 × 384 × 192, where the single-channel infrared image is copied into three channels to ensure alignment with the three-channel visible image. Then, the images of both modalities are input to the encoder for encoding, firstly via Covn2d (3, 1, 1) to convert the three-channel image into a single-channel image, then via Covn2d (1, 1, 1) to reduce the computation, and finally via the BN layer for data normalization. The normalized data are then input to the decoder for decoding, and the encoded single-channel image is converted into a three-channel image via Covn2d (1, 3, 1) to generate a middle modality image with the same labels as the visible image and the infrared image, where the visible images, the infrared images, and the middle modality images are shown in Figure 3.

Where Person1–Person8 denotes the identity of the person, VIS denotes the visible image, IR denotes the infrared image, VtM denotes the middle modality image generated corresponding to the visible image, and ItM denotes the middle modality image generated corresponding to the infrared image.

3.3. The Design of a Two-Stream Network with Parameter Sharing

We input the generated middle modality images together with the original images to a two-stream network with parameter sharing for feature extraction to further reduce the modal differences. However, early two-stream networks mainly set up separate feature extraction branches for images of the two modalities to learn the person-related information in each modality; its structure is shown in Figure 4a. This structure can reduce the differences between different modalities to a certain extent but ignores the correlation between different modalities and identity samples, leading to the increase in the intra-class distance. To address such problems, this paper adopts a two-stream network structure with parameter sharing. Specifically, ResNet50 is used as the base network of the two-stream network, and the first convolutional layer and the first two residual blocks of ResNet50 are used as modality-specific feature extractors to extract the independent features of each modality, and the last two residual blocks are used as a modality-shared feature embedder for parameter sharing; its structure is shown in Figure 4b.

3.4. Multi-Granularity Pooling Strategy

The main task of a representation-based learning method is to extract a more discriminative feature representation from the person image. The early stage of the research focuses on extracting global features of persons, i.e., extracting global information of persons in the image and representing different person identities with a global feature vector. However, using only global feature extraction will ignore some insignificant pedestrian information, making it difficult for the network to extract more discriminative features. With the development of person re-identification technology, the limitations of global features have been gradually recognized and then local features start to be focused on. In 2018, Sun et al. [13] proposed the landmark PCB network for person re-identification, which divides the person image into six blocks equally from top to bottom, and then extracts features for each region separately to obtain more discriminative features. However, this granularity of slicing is likely to divide the important features into different blocks, which is to some extent detrimental to model learning. In contrast, a Multiple Granularity Network (MGN) [23] has more advantages in that the image is uniformly chunked in the horizontal direction, and different branches set different numbers of chunks to obtain multi-granularity person features. Therefore, based on the idea of MGN, we construct the multiple granularity network in this paper, as shown in Figure 5.

The multi-granularity pooling structure is divided into two branches. The upper part is the global feature branch, which down-samples the output features of the parameter sharing network in steps of 2 and does not perform fine-grained processing of the output features but is only responsible for learning global features. The lower part is the local features branch, which divides the output features of the parameter sharing network horizontally and evenly divides the image into four blocks for fine-grained local feature learning.

After feature extraction, GAP is usually used to further reduce the number of parameters in the model. However, GAP focuses on the overall image information, which is easily disturbed by background and occlusion with difficulty to obtain the detailed features of persons. In contrast, Generalized Mean Pooling (GeM) is more concerned with image details. The GeM formula is as follows:

\begin{array}{l} f = {[f_{1} \dots f_{k} \dots f_{K}]}^{T}, \\ f_{k} = {(\frac{1}{|X_{k}|} \sum_{x \in X_{k}} x^{p_{k}})}^{\frac{1}{p_{k}}} \end{array}

(1)

where

X

is the input to the pooling layer and

f

is the output of the pooling layer.

p_{k}

is a hyperparameter that is learned during the backpropagation process. When

p_{k} = 1

, GeM pooling is equivalent to global average pooling; when

p_{k}

tends to infinity, GeM pooling is equivalent to global maximum pooling.

3.5. The Design of Joint Loss Function

3.5.1. Distributional Consistency Loss

The middle modality generator uses two independent encoders to encode the images of the two modes separately, so that the VIS and IR images are mapped to a uniform feature space, and then the middle modality images are generated by a shared decoder. To keep the generated middle modality images consistent, the distribution consistency loss is used to bring the intermediate modal images closer together, which is expressed as follows:

L_{d c l} = \frac{1}{N} \sum_{i = 1}^{N} m e a n [f (I_{^{V t M}}^{i}) - f (I_{^{I t M}}^{i})]

(2)

where

N

is the number of images in each batch in the training phase;

I_{V t M}^{i}

and

I_{I t M}^{i}

are the middle modality images corresponding to VIS and IR, respectively;

f (x)

is the output of the two middle modalities after the fully connected layer; and

m e a n [A - B]

denotes the average operation of the difference between A and B.

3.5.2. Label Smoothing Cross-Entropy Loss

In addition to using DCL to pull the distance of the generated intermediate modal images, we also use label smoothing cross-entropy loss to avoid overfitting. Label smoothing cross-entropy loss, a widely used method for classification tasks, is expressed by the following equation:

\begin{array}{l} L_{i d} = - \sum_{i = 1}^{N} q_{i} \log (p_{i}), \\ q_{i} = \{\begin{matrix} 1 - \frac{N - 1}{N} ξ, & y = i \\ \frac{ξ}{N} & y \neq i \end{matrix} \end{array}

(3)

where

y

is the true label of the person;

p_{i}

is the predicted value;

N

is the number of person identities; and

ξ

is the error tolerance, trained with

1 - ξ

as the true label, where

ξ

is 0.1.

3.5.3. Hetero-Center Triplet Loss

In terms of reducing the intra-class distance, the hetero-center triplet attenuates the strong constraint of conventional triplet loss, which leads to better mapping results of different modality images in the same feature space.

Since we use the generated middle modality images together with the original images for the auxiliary network training, a batch of size 4M is formed. We set the first M to be I_VIS, the second M to be I_VtM, the third M to be I_ItM, and the fourth M to be I_IR. For the VIS and IR modalities, the hetero-center triplet loss is represented as follows:

L_{h c_t r i}^{(V, I)} = L_{h c_t r i} (V, I) + L_{h c_t r i} (I, V)

(4)

L_{h c_t r i} (V, I) = {\sum_{i = 1}^{M} [ρ + {‖c_{V}^{i} - c_{I}^{i}‖}_{2} - \min_{\begin{matrix} K \in {V, I} \\ i \neq j \end{matrix}} {‖c_{V}^{i} - c_{K}^{j}‖}_{2}]}_{+}

(5)

L_{h c_t r i} (I, V) = {\sum_{i = 3 M + 1}^{4 M} [ρ + {‖c_{I}^{i} - c_{V}^{i}‖}_{2} - \min_{\begin{matrix} K \in {V, I} \\ i \neq j \end{matrix}} {‖c_{I}^{i} - c_{K}^{j}‖}_{2}]}_{+}

(6)

where ρ is the margin parameter and is set to 0.3;

{| | c_{V}^{i} - c_{I}^{i} | |}_{2}

denotes the Euclidean distance between VIS and IR image centers;

\min {| | c_{V}^{i} - c_{K}^{j} | |}_{2}

denotes the most indistinguishable negative sample; and

{[z]}_{+} = m a x (z, 0)

denotes the value taken when the value in

{[]}_{+}

is larger than 0, otherwise the value is 0.

The calculation of the hetero-center triplet loss between other modes is similar to

L_{h c_t r i}^{(V, I)}

. The final hetero-center triplet loss of this paper is obtained, and it is expressed as follows:

L_{h c_t r i} = L_{h c_t r i}^{(V, I)} + L_{h c_t r i}^{(V, I t M)} + L_{h c_t r i}^{(I, V t M)} + L_{h c_t r i}^{(V t M, I t M)}

(7)

3.5.4. Joint Loss

This paper adopts the hetero-center triple loss, label smoothing cross entropy loss and distributed consistent loss to jointly supervise training and jointly optimize the model. The final joint loss expression is obtained as:

L = L_{i d} + λ_{1} L_{h c_t r i} + λ_{2} L_{d c l}

(8)

where

λ_{1}

and

λ_{2}

denote the weights of

L_{h c_t r i}

and

L_{d c l}

with parameter values of 1 and 0.5, respectively.

4. Experiments

This section first introduced two publicly available datasets and the experimental environment. Then, the advantages of the proposed approach were demonstrated by comparing them with the mainstream methods. Finally, comparative experiments were conducted for each module to determine the parameter sharing, multi-granularity pooling, and joint loss of the model, and then ablation experiments were conducted for each module to verify the effectiveness of each module.

4.1. Datasets and Evaluation Metrics

The experiments were conducted on two publicly available datasets for the proposed method in this paper: RegDB [24] and SYSU-MM01 [3].

The RegDB dataset was captured by a pair of VIS and IR cameras. The dataset includes a total of 412 people, where each person consists of 10 VIS images and 10 IR images, containing a total of 4120 VIS images and 4120 IR images. To ensure the reliability of the experiment, two retrieval modes: Visible to Infrared (VtI) and Infrared to Visible (ItV) are included. In this paper, the 206 randomly selected person identities, containing a total of 4120 VIS and IR images, are used as the training set; the remaining 206 person identities, containing a total of 4120 VIS and IR images, are used as the test set. The experiment is repeated ten times, and the average of the ten tests is finally taken as the final result.

SYSU-MM01 is the person data captured by Sun Yat-sen University and the first dataset released in the field of cross-modality person re-identification. The dataset includes both indoor and outdoor scenes and is captured by four visible light cameras and two infrared cameras. A total of 491 people are included, with 22,258 VIS images and 11,909 IR images of 395 people in the training set and 3803 IR images of 96 people in the test set as query images. The dataset also contains two retrieval modes: All-search, which takes indoor and outdoor images captured by the visible camera as the gallery, and Indoor-search, which takes indoor images captured by the visible camera as the gallery. In order to ensure the reliability of the experiment, both two retrieval modes are used and the single-shot is set. What is meant by single-shot is that each person identity in the gallery contains one image.

In this paper, the standard Cumulative Matching Characteristic (CMC) curve and the mean Accuracy Precision (mAP) are used as evaluation metrics for the performance of the algorithm.

4.2. Experimental Environment and Parameter Settings

Experimental environment: graphics card NVIDIA GeForce RTX 3090 (24 G video memory), CPU memory 43 G; Pytorch 1.11.0 deep learning framework with Python 3.8 on Ubuntu 20.04 system.

Parameter settings: Before training, the dataset is preprocessed, and the image size of the training set needs to be adjusted to 384 × 192 first. Then, the data enhancement is carried out by performing random cropping, random horizontal flipping, and random erasing strategies. The sampling strategy P × K is used, where P = 8 and K = 4. Batch_size is set to 64, num_workers is set to 4, and the number of iterations is 80. The gradient descent optimizer is selected as SGD; the momentum parameter is set to 0.9. After that, the preheating learning rate is used, with the initial learning rate of 0.1. Additionally, the learning rate of the epoch is set as follows:

b (t) = \{\begin{matrix} 0.1 \times \frac{t + 1}{10} & 0 \leq t \leq 10 \\ 0.1 & 10 \leq t \leq 20 \\ 0.01 & 20 \leq t \leq 50 \\ 0.001 & 50 \leq t \end{matrix}

(9)

4.3. Comparison with State-of-the-Art Methods

We compare the method in this paper with the mainstream cross-modality person re-identification methods on RegDB and SYSU-MM01 in the last five years, and the results are shown in Table 1. Where the mainstream methods include: HCML [6], HSME [25], D2RL [26], AliGAN [8], HC [4], HcTri [5], X-modality [9], AGW [27], DDAG [28], CM-NAS [29], DGTL [30], and FMCNet [31].

The results of the experiment showed that: (1) the methodology of this paper achieves 71.27% and 68.11% for Rank-1 and mAP in All Search mode; 77.64% and 81.06% for Rank-1 and mAP in Indoor Search mode for the SYSU-MM01 dataset, respectively. (2) Rank-1 and mAP reached 94.18% and 86.54% in VtI mode; 91.16% and 83.67% in ItV mode, respectively, in the RegDB dataset.

As can be seen from the experimental results, the performance of the proposed method in this paper is significantly better than other methods in the following three aspects:

(1): The X-modal-based method uses auxiliary modes to reduce the modal differences, but it only generates middle modality images for VIS images. However, the method proposed in this paper maps the VIS and IR modal images into a unified space to generate middle modality images, which can further reduce the modality differences.
(2): The main task of the representation-learning-based method is to extract more discriminative features. In this paper, we use a combination of global and local features to improve the representation learning capability of the model, which has better performance compared with the DDAG method that focuses only on global features.
(3): The main task of the metric-based learning method is to map the learned features to a new space, and then reduce the intra-class distance by a loss function. In this paper, the joint distributed consistency loss, label-smoothed cross-entropy loss, and hetero-center triplet loss jointly optimize the model, which has more advantages over the methods using only HC or HcTri.

4.4. Ablation Study

4.4.1. Two-Stream Parameter-Sharing Experiments

In this paper, to address the problem of modality differences between VIS and IR images, a two-stream network with parameter sharing is used for feature extraction of the cross-modality images, with ResNet50 as the base network of the network. However, since the ResNet50 structure contains five stages, stage 0–stage 4, the parameter sharing from different stages will have different effects on the model performance. Therefore, in this section, we conduct parameter sharing experiments to compare the effects of different sharing methods on the model performance to select the optimal parameter sharing network. The results of the experiment are shown in Table 2, where experiment 1 is the baseline.

The results of the experiment showed that the model performance reaches the best performance when stage 0–stage 2 is used as the feature extractor to extract the independent features of each mode, and stage 3–stage 4 is used as the feature embedder for parameter sharing, which can effectively reduce the modality differences.

4.4.2. Multi-Granularity Pooling Experiments

Multi-granularity pooling combines global and local features, allowing the model to learn both overall coarse-grained features and local fine-grained features, and having stronger learning ability than network representations that use only local or global features.

Feature extraction is usually followed by using GAP to further reduce the number of parameters in the model. However, GAP focuses on the overall image information, which is easily disturbed by the background. In contrast, GeM focuses more on image detail information. The experiments were conducted to verify the effectiveness of multi-granularity pooling and pooling approaches. The experimental results are shown in Table 3, where Experiment 1 is the method taken as the baseline.

The experimental results show that both the multi-granularity strategy and the GeM pooling method can improve the model performance to some extent. The multi granularity strategy has a significant improvement in the model performance on the SYSU-MM01 dataset, while it has a small improvement in model performance on the RegDB dataset, and even has a negative impact on the model. The reason is that the person images in the RegDB dataset are relatively unclear, which makes it difficult to extract reliable features. Overall, the combined evaluation of the methods in this paper on both datasets reveals that the combination of multi-granularity and GeM pooling can effectively improve the model representation learning capability and, thus, the model performance.

4.4.3. Joint Loss Function Experiments

In terms of reducing the intra-class distance, the hetero-center triplet attenuates the strong constraint of conventional triplet loss, which leads to better mapping results of different modality images in the same feature space. To verify the effectiveness of the heterogeneous center triplet loss, we conducted a comparison experiment. The experimental results are shown in Table 4, where experiment 1 is the baseline.

From the experimental results, the model performance is optimal when the joint DCL-LS-HCT loss is used, and it also shows that the hetero-center triplet loss effectively reduces the intra-class differences.

4.4.4. Ablation Experiments

The previous three subsections determine the parameter sharing network, multi-granularity pooling, and joint loss function through comparative experiments, and this section conducts ablation experiments with the benchmark model to further verify the effectiveness of each module based on the previous experiments.

The ablation experiments were performed on the SYSU-MM01 dataset with All Search mode and Indoor Search mode, and the experimental results are shown in Table 5. Specifically, the method in this paper achieves 71.27% and 68.11% for Rank-1 and mAP in All Search mode of the SYSU-MM01 dataset, with an improvement of 3.59% and 3.29%, respectively, and 77.64% and 81.06% for Rank-1 and mAP in Indoor Search mode, with an improvement of 3.38% and 2.57%, respectively.

Further analysis of the experimental results shows that: (1) both PS and MGP can improve the performance of the model, indicating that the representation learning ability of the model can not only be improved by means of parameter sharing and the multi-granularity pooling strategy but also reduce the modal differences. (2) The DCL-LS-HCT loss function combining distributional consistency loss, label smoothing cross-entropy loss, and hetero-center triplet loss also improves the performance of the baseline model, suggesting that DCL-LS-HCT effectively reduces the intra-class differences. (3) When PS, MGP, and DCL-LS-HCT work together, the modal differences and intra-class differences are effectively reduced, and the model performance is greatly improved, which also illustrates the effectiveness of the method in this paper.

5. Conclusions

To address the problems of modality differences and intra-class differences in cross- modality person re-identification, this paper jointly inputs the middle modality image and the original image into a two-stream network with parameter sharing for feature extraction, and then uses a multi-granularity pooling strategy combining global and local features to improve the representation learning ability of the model, which effectively reduces the modality differences. The intra-class differences are also reduced using hetero-center triplet loss, which is then combined with distributional consistency loss and label smoothing cross-entropy loss to jointly optimize the model. Extensive experiments are conducted on publicly available datasets, and the method proposed in this paper has better performance compared with the existing state-of-the-art (SOTA) method.

In the future work, this paper will study in the following directions in depth:

In this paper, the middle modality images and the original images are jointly input to a two-stream parameter sharing network for feature extraction, and then the extracted features are directly stitched together. In the future, we will design a more reasonable parameter sharing network, focusing on pedestrian feature alignment and avoiding the introduction of noise.
In this paper, we use a multi-granularity pooling strategy combining global features and local features to improve the representation learning capability of the model. Later, we will discuss the chunking of local features and design a better combination of global and local features to further improve the model’s representation learning capability.
In this paper, we use joint distribution consistency loss, label smoothing cross-entropy loss, and heterogeneous center triplet loss to jointly optimize the model, which is slow to converge during training. In the future, we will optimize the loss function to make it converge quickly.

Author Contributions

Methodology, L.M.; formal analysis, H.G. and Y.L.; writing—original draft preparation, Z.G.; writing—review and editing, L.M.; project administration, X.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Industry Innovation Chain Project of Shaanxi Key Research and Development Plan (2021ZDLGY07−08).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ming, Z.; Zhu, M.; Wang, X.; Zhu, J.; Cheng, J.; Gao, C.; Yang, Y.; Wei, X. Deep learning-based person re-identification methods: A survey and outlook of recent works. Image Vis. Comput. 2022, 119, 104394. [Google Scholar] [CrossRef]
Yaghoubi, E.; Kumar, A.; Proença, H. Sss-pr: A short survey of surveys in person re-identification. Pattern Recognit. Lett. 2021, 143, 50–57. [Google Scholar] [CrossRef]
Wu, A.; Zheng, W.S.; Yu, H.X.; Gong, S.; Lai, J. RGB-infrared cross-modality person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5380–5389. [Google Scholar]
Zhu, Y.; Yang, Z.; Wang, L.; Zhao, S.; Hu, X.; Tao, D. Hetero-center loss for cross-modality person re-identification. Neurocomputing 2020, 386, 97–109. [Google Scholar] [CrossRef] [Green Version]
Liu, H.; Tan, X.; Zhou, X. Parameter sharing exploration and hetero-center triplet loss for visible-thermal person re-identification. IEEE Trans. Multimed. 2020, 23, 4414–4425. [Google Scholar] [CrossRef]
Ye, M.; Lan, X.; Li, J.; Yuen, P. Hierarchical discriminative learning for visible thermal person re-identification. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, New Orleans, LA, USA, 2–7 February 2018; pp. 7501–7508. [Google Scholar]
Dai, P.; Ji, R.; Wang, H.; Wu, Q.; Huang, Y. Cross-modality person re-identification with generative adversarial training. In Proceedings of the IJCAI: International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; Volume 1, p. 6. [Google Scholar]
Wang, G.; Zhang, T.; Cheng, J.; Liu, S. Rgb-infrared cross-modality person re-identification via joint pixel and feature alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3623–3632. [Google Scholar]
Li, D.; Wei, X.; Hong, X.; Gong, Y. Infrared-visible cross-modal person re-identification with an x modality. In Proceedings of the AAAI conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 4610–4617. [Google Scholar]
Zhang, Y.; Yan, Y.; Lu, Y.; Wang, H. Towards a unified middle modality learning for visible-infrared person re-identification. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 788–796. [Google Scholar]
Sun, Y.; Qi, K.; Chen, W.; Xiong, W.; Li, P.; Liu, Z. Fusional Modality and Distribution Alignment Learning for Visible-Infrared Person Re-Identification. In Proceedings of the 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Prague, Czech Republic, 9–12 October 2022; IEEE: New York, NY, USA, 2022; pp. 3242–3248. [Google Scholar]
Geng, M.; Wang, Y.; Xiang, T.; Tian, Y. Deep transfer learning for person re-identification. arXiv 2016, arXiv:1611.05244. [Google Scholar]
Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 480–496. [Google Scholar]
Kalayeh, M.M.; Basaran, E.; Gökmen, M.; Kamasak, M.E.; Shah, M. Human semantic parsing for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1062–1071. [Google Scholar]
Zhao, H.; Tian, M.; Sun, S.; Shao, J.; Yan, J.; Yi, S.; Wang, X.; Tang, X. Spindle net: Person re-identification with human body region guided feature decomposition and fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1077–1085. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Varior, R.R.; Haloi, M.; Wang, G. Gated siamese convolutional neural network architecture for human re-identification. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VIII 14. Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 791–808. [Google Scholar]
Hermans, A.; Beyer, L.; Leibe, B. In defense of the triplet loss for person re-identification. arXiv 2017, arXiv:1703.07737. [Google Scholar]
Chen, W.; Chen, X.; Zhang, J.; Huang, K. Beyond triplet loss: A deep quadruplet network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 403–412. [Google Scholar]
Xia, D.; Liu, H.; Xu, L.; Wang, L. Visible-infrared person re-identification with data augmentation via cycle-consistent adversarial network. Neurocomputing 2021, 443, 35–46. [Google Scholar] [CrossRef]
Almahairi, A.; Rajeshwar, S.; Sordoni, A.; Bachman, P.; Courville, A. Augmented cyclegan: Learning many-to-many mappings from unpaired data. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 195–204. [Google Scholar]
Luo, H.; Gu, Y.; Liao, X.; Lai, S.; Jiang, W. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Wang, G.; Yuan, Y.; Chen, X.; Li, J.; Zhou, X. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 274–282. [Google Scholar]
Munaro, M.; Fossati, A.; Basso, A.; Menegatti, E.; Van Gool, L. One-shot person re-identification with a consumer depth camera. In Person Re-Identification; Springer: London, UK, 2014; pp. 161–181. [Google Scholar]
Hao, Y.; Wang, N.; Li, J.; Gao, X. HSME: Hypersphere manifold embedding for visible thermal person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8385–8392. [Google Scholar]
Wang, Z.; Wang, Z.; Zheng, Y.; Chuang, Y.Y.; Satoh, S.I. Learning to reduce dual-level discrepancy for infrared-visible person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 618–626. [Google Scholar]
Ye, M.; Shen, J.; Lin, G.; Xiang, T.; Shao, L.; Hoi, S.C. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 2872–2893. [Google Scholar] [CrossRef] [PubMed]
Ye, M.; Shen, J.; Crandall, D.J.; Shao, L.; Luo, J. Dynamic dual-attentive aggregation learning for visible-infrared person re-identification. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVII 16. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 229–247. [Google Scholar]
Fu, C.; Hu, Y.; Wu, X.; Shi, H.; Mei, T.; He, R. CM-NAS: Cross-modality neural architecture search for visible-infrared person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11823–11832. [Google Scholar]
Liu, H.; Chai, Y.; Tan, X.; Li, D.; Zhou, X. Strong but simple baseline with dual-granularity triplet loss for visible-thermal person re-identification. IEEE Signal Process. Lett. 2021, 28, 653–657. [Google Scholar] [CrossRef]
Zhang, Q.; Lai, C.; Liu, J.; Huang, N.; Han, J. Fmcnet: Feature-level modality compensation for visible-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 7349–7358. [Google Scholar]

Figure 1. The overall structure of a cross-modality person re-identification method based on joint middle modality and representation learning.

Figure 2. Middle modality generator.

Figure 3. VIS, IR, and middle modality images.

Figure 4. (a) Early use of a two-stream network structure; (b) a two-stream network with a parameter-sharing structure.

Figure 5. Multi-granularity pooling network structure.

Table 1. Comparison of this paper’s method with mainstream methods on RegDB and SYSU-MM01 datasets. The bold indicates the optimal of performance. R-1, VtI, and ItV denotes the Rank-1 accuracy, Visible to Infrared, Infrared to Visible, respectively.

Methods	RegDB				SYSU-MM01
	VtI		ItV		All Search		Indoor Search
	R-1	mAP	R-1	mAP	R-1	mAP	R-1	mAP
HCML [6]	24.44	20.08	21.70	22.24	14.32	16.16	24.52	30.08
HSME [25]	41.34	38.82	40.67	37.50	20.68	23.12	-	-
D²RL [26]	43.40	44.10	-	-	28.90	29.20	-	-
AliGAN [8]	57.90	53.60	56.30	53.40	42.40	40.70	45.90	54.30
HC [4]	-	-	-	-	56.96	54.95	59.74	64.91
HcTri [5]	91.05	83.28	89.30	81.46	61.68	57.51	63.41	68.17
X-modality [9]	62.21	60.18	-	-	49.92	50.73	-	-
DDAG [28]	69.34	63.46	68.06	61.80	54.75	53.02	61.02	67.98
AGW [27]	70.05	66.37	-	-	47.50	47.65	54.17	62.97
CM-NAS [29]	84.54	80.32	82.57	78.31	61.99	60.02	67.01	72.95
DGTL [30]	83.92	73.78	81.59	71.65	57.34	55.13	63.11	69.20
FMCNet [31]	89.12	84.43	88.38	83.86	66.34	62.51	68.15	74.09
Ours	94.18	86.54	91.16	83.67	71.27	68.11	77.64	81.06

Table 2. Parameter sharing experiment on RegDB and SYSU-MM01 datasets. The bold indicates the optimal performance. The stage0–stage4 represent the five stages of ResNet50.

Experiments	Modality-Specific Feature Extractor	Modality-Shared Feature Embedder	RegDB		SYSU-MM01
Experiments	Modality-Specific Feature Extractor	Modality-Shared Feature Embedder	R-1	mAP	R-1	mAP
1	stage 0	stage 1–stage 4	89.94	83.25	67.68	64.82
2	stage 0–stage 1	stage 2–stage 4	90.24	83.36	68.85	66.15
3	stage 0–stage 2	stage 3–stage 4	90.31	83.48	69.53	66.90
4	stage 0–stage 3	stage 4	86.86	80.22	64.32	62.54

Table 3. Multi-granularity pooling experiments on RegDB and SYSU-MM01 datasets. The bold indicates the optimal of performance. GAP, GeM denotes Global Average Pooling, Generalized Mean Pooling, respectively. “✓” indicates that the method was adopted.

Experiments	Multi-Granularity	Pooling Methods		RegDB		SYSU-MM01
Experiments	Multi-Granularity	GAP	GeM	R-1	mAP	R-1	mAP
1		✓		89.94	83.25	67.68	64.82
2			✓	90.60	84.10	67.94	65.27
3	✓	✓		89.68	83.24	69.47	66.37
4	✓		✓	90.03	83.78	68.54	66.74

Table 4. Joint loss function experiments on RegDB and SYSU-MM01 datasets. The bold indicates the optimal performance. Where DCL-LS-T denotes joint Distributional Consistency Loss, Label Smoothing Cross-entropy Loss and Triplet loss, DCL-LS-HCT denotes joint Distributional Consistency Loss, Label Smoothing Cross-entropy Loss, and Hetero-Center Triplet loss. “✓” indicates that the method was adopted.

Experiments	Joint Loss Functions		RegDB		SYSU-MM01
Experiments	DCL-LS-T	DCL-LS-HCT	R1	mAP	R1	mAP
1	✓		89.94	83.25	67.68	64.82
2		✓	91.68	83.48	69.11	65.87

Table 5. Ablation study on the SYSU-MM01 dataset. The bold indicates the optimal performance. Baseline denotes baseline methodology, PS denotes Parameter Sharing, MGP denotes Multi-granularity Pooling, and DCL-LS-HCT denotes joint Distributional Consistency Loss, Label Smoothing Cross-entropy Loss, and Hetero-Center Triplet loss. “✓” indicates that the method was adopted.

Methods				SYSU-MM01
Methods				All Search				Indoor Search
Baseline	PS	MGP	DCL-LS-HCT	R-1	R-10	R-20	mAP	R-1	R-10	R-20	mAP
✓				67.68	94.94	98.45	64.82	74.26	98.09	99.70	78.49
✓	✓			69.54	96.04	98.91	66.90	76.69	98.72	99.75	80.82
✓		✓		69.47	95.95	98.86	66.37	76.18	98.14	99.63	80.24
✓			✓	69.11	95.53	98.46	65.87	75.59	98.15	99.53	79.63
✓	✓	✓	✓	71.27	96.11	98.72	68.11	77.64	98.16	99.42	81.06

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, L.; Guan, Z.; Dai, X.; Gao, H.; Lu, Y. A Cross-Modality Person Re-Identification Method Based on Joint Middle Modality and Representation Learning. Electronics 2023, 12, 2687. https://doi.org/10.3390/electronics12122687

AMA Style

Ma L, Guan Z, Dai X, Gao H, Lu Y. A Cross-Modality Person Re-Identification Method Based on Joint Middle Modality and Representation Learning. Electronics. 2023; 12(12):2687. https://doi.org/10.3390/electronics12122687

Chicago/Turabian Style

Ma, Li, Zhibin Guan, Xinguan Dai, Hangbiao Gao, and Yuanmeng Lu. 2023. "A Cross-Modality Person Re-Identification Method Based on Joint Middle Modality and Representation Learning" Electronics 12, no. 12: 2687. https://doi.org/10.3390/electronics12122687

APA Style

Ma, L., Guan, Z., Dai, X., Gao, H., & Lu, Y. (2023). A Cross-Modality Person Re-Identification Method Based on Joint Middle Modality and Representation Learning. Electronics, 12(12), 2687. https://doi.org/10.3390/electronics12122687

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Cross-Modality Person Re-Identification Method Based on Joint Middle Modality and Representation Learning

Abstract

1. Introduction

2. Related Work

2.1. Single-Modality Person Re-Identification

2.2. Cross-Modality Person Re-Identification

3. Methods

3.1. Overall Network Structure

3.2. Middle Modality Generator

3.3. The Design of a Two-Stream Network with Parameter Sharing

3.4. Multi-Granularity Pooling Strategy

3.5. The Design of Joint Loss Function

3.5.1. Distributional Consistency Loss

3.5.2. Label Smoothing Cross-Entropy Loss

3.5.3. Hetero-Center Triplet Loss

3.5.4. Joint Loss

4. Experiments

4.1. Datasets and Evaluation Metrics

4.2. Experimental Environment and Parameter Settings

4.3. Comparison with State-of-the-Art Methods

4.4. Ablation Study

4.4.1. Two-Stream Parameter-Sharing Experiments

4.4.2. Multi-Granularity Pooling Experiments

4.4.3. Joint Loss Function Experiments

4.4.4. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI