Visible-Infrared Person Re-Identiﬁcation: A Comprehensive Survey and a New Setting

: Person re-identiﬁcation (ReID) plays a crucial role in video surveillance with the aim to search a speciﬁc person across disjoint cameras, and it has progressed notably in recent years. However, visible cameras may not be able to record enough information about the pedestrian’s appearance under the condition of low illumination. On the contrary, thermal infrared images can signiﬁcantly mitigate this issue. To this end, combining visible images with infrared images is a natural trend, and are considerably heterogeneous modalities. Some attempts have recently been contributed to visible-infrared person re-identiﬁcation (VI-ReID). This paper provides a complete overview of current VI-ReID approaches that employ deep learning algorithms. To align with the practical application scenarios, we ﬁrst propose a new testing setting and systematically evaluate state-of-the-art methods based on our new setting. Then, we compare ReID with VI-ReID in three aspects, including data composition, challenges, and performance. According to the summary of previous work, we classify the existing methods into two categories. Additionally, we elaborate on frequently used datasets and metrics for performance evaluation. We give insights on the historical development and conclude the limitations of off-the-shelf methods. We ﬁnally discuss the future directions of VI-ReID that the community should further address.


Introduction
Person re-identification (ReID) is a fundamental building block in various tasks of computer vision, such as intelligent surveillance, video analysis [1], and criminal investigation [2]. With the advancement of intelligent monitoring and the enormous expansion of video data in recent years, conventional human power has been challenging and insufficient to deal with intricate surveillance scenarios. ReID aims at searching for a given individual across disjoint cameras. Numerous algorithms designed for ReID have been proposed with impressive results on some publicly available datasets, e.g., 98.1% and 94.5% Rank-1 accuracy on Market-1501 [3] and DukeMTMC-ReID [4] datasets, respectively [5]. However, the images captured by visible cameras may be unavailable in a dark environment. In such a case, infrared imaging equipment, which does not rely on visible light, should be applied. In 2017, Wu et al. [6] first introduced visible-infrared person re-identification (VI-ReID) and proposed a dataset named SYSU-MM01.
As shown in Figure 1a, for a certain pedestrian, the images of the corresponding identity (ID) should be matched from the other modality set. In addition to the common challenges, e.g., low-resolution, viewpoint change, pose variation, and occlusion, VI-ReID is an effortful problem that encounters additional modality discrepancy due to the significant To improve the practical application ability of VI-ReID, researchers previously achieved remarkable progress on VI-ReID. We divide existing methods into two categories-nongenerative-based and generative-based-which were proposed in [7]. As shown in Figure 2a, the non-generative-based model mainly utilizes conventional methods, including feature representation learning and distance metric learning, to maximize the similarity between two images with the same ID and minimize the similarity between two images with different IDs [8][9][10]. In contrast, Figure 2b shows a generative-based model that unifies the modality on the data level, bridging the gap between two heterogeneous modalities [11,12].  To the best of our knowledge, almost all VI-ReID systems evaluated their performance based on the setting as shown in Figure 1a. However, this may not be in line with the actual scene. Taking the visible image V as an example, it may be more similar with some negative visible samples than positive infrared samples. The existing testing setting removes all visible images in the gallery to avoid this challenge. In this paper, we propose a novel testing setting that is closer to the practical scene. As shown in Figure 1b, instead of containing images from only one modality in query and gallery, we simultaneously put visible and infrared images into the query and gallery. This setting makes VI-ReID more challenging. Existing works created various two-stream architectures to learn modalityspecific information in order to alleviate the cross-modality discrepancy. However, this kind of two-stream network may not extract effective features of visible and infrared images simultaneously in our new setting. Considering the realistic value of this setting, we believe that researchers should pay more attention to it.
In recent years, many excellent review papers have appeared in ReID. For example, Wang et al. [13] considered four different cross-modality application scenarios: lowresolution, infrared, sketch, text and then analyzed typical approaches. Ye et al. [8] categorized related works into closed-world ReID and open-world ReID, and proposed a strong baseline named AGW. Leng et al. [14] sorted out the papers in open-world ReID based on specific application scenarios. Inspired by them, we conduct a thorough overview for VI-ReID.
Our contributions are threefold: • We propose a new testing setting which is closer to practical application scenarios and conduct preliminary experiments to verify the significant challenges of the new setting. • We compare VI-ReID with ReID in detail and provide a thorough review of VI-ReID techniques, including datasets and performance metrics. • We conclude the necessary components of networks and discuss possible future directions of VI-ReID.

ReID vs. VI-ReID
Generally, there is just visible modality in ReID, while VI-ReID contains two modalities: visible and infrared. As all know, visible images have three channels containing rich color information, while infrared images contain intensity information with the red channel only. As shown in Figure 3, there is a noticeable modality gap between visible and infrared images [11]. As the inter-modality discrepancy is substantially greater than the intramodality discrepancy, bridging the modality gap between the two heterogeneous modalities is a major aspect of VI-ReID research. For ReID, it only faces challenges of intra-modality, e.g., people's appearance change, viewpoint change, and occlusion. In contrast, VI-ReID confronts not only the difficulties that appear in ReID, but also the cross-modality discrepancy. The networks designed for ReID are not suitable for VI-ReID since the solutions to intra-and inter-modality discrepancies are completely different.
To our knowledge, the performance gap between VI-ReID and ReID is also large. Ref. [15], for example, achieved 95.7% rank-1 on Market-1501, while the rank-1 of [10] just reached 70.58% on the SYSU-MM01 dataset. The performance of VI-ReID is far lower than that of ReID. However, the VI-ReID is more valuable in practical application scenarios, and we should pay more attention to it.

A New Testing Setting
In a VI-ReID dataset, represent the visible and infrared images, respectively, where N v and N t denote the number of samples in a single modality, respectively. Every image has a corresponding ID label y ∈ {Y i } N p i=1 , where N p denotes the number of IDs. Given a certain image as the query, the purpose of VI-ReID is to match images with the same label from the other modality according to the similarity.
However, this setting is not in line with practical scenarios. Just imagine that, given an image of a criminal who has been escaped for several days, we have to search for him via cross visible and infrared cameras. The off-the-shelf methods may not be useful in such a case. Instead of containing images from only one modality, as shown in Figure 1b, we set the probe P = V p ∪ T p and gallery G = V g ∪ T g to simultaneously contain visible and infrared images, where ∪ denotes union. V p , V g , and T p , T g are the mutually exclusive subsets of V and T, respectively. When researchers evaluate methods with our new setting, the images with the same ID and modality as the probe cannot appear in the gallery to avoid the impact of ReID in the same modality. When training with this setting, the mainstream dual-branch network structure may not extract effective features because of the effect of mixed modalities. The P and G first generate features through the feature extraction module, and then they are matched by the feature matching module.

VI-ReID Methods
According to our investigation, there are no other types of articles published on mainstream conferences or journals except those that are deep-learning-based. Hence, only deep-learning-based approaches are included in this review. For the non-generative-based model, we subdivide the model into feature learning, metric learning, and training strategy. For the generative-based model, we subdivide the model into modality translation and extra modality. Besides, we introduce some methods using other technologies. We also summarize some algorithms intended for general ReID or other domains that perform well in VI-ReID. Some methods may be appropriate for multiple categories; however, we will take them to the most suitable position.

Milestones of Existing VI-ReID Studies
VI-ReID has achieved significant progress in a variety of areas thanks to the unwavering efforts of artificial intelligence researchers. We introduce these crucial milestones for VI-ReID following a timeline and present them in Figure 4. Note that the main basis of a paper selected as a milestone is its citations. We select the paper with highest citations among all papers in a category after dividing the papers into different categories.
Deep Zero-Padding

Feature Representation Learning
It aims to extract robust and discriminative features to help the VI-ReID system correctly classify images into different fine-grained classes. We review three kinds of feature representation learning strategies.
Global Feature Representation Learning. As far as we know, most existing methods focus on extracting global features. An illustration is shown in Figure 5a. To obtain modality specific information, Feng et al. [16] established two individual branches for visible and infrared images, respectively. In [17], the authors thought only learning shared features means a massive loss of information, which reduces the difference of features. Therefore, they proposed a cross-modality shared-specific feature transfer algorithm. Ye et al. [18] pointed out that the consistency at the feature and classifier levels is essential when dealing with modality differences. To learn discriminative representations in each modality, Wei et al. [19] developed an attention-lifting mechanism. Wang et al. [20] excavated spatial and channel information of images to reduce the discrepancy between two heterogeneous modalities.  In addition, some works extract ID-invariant features by disentanglement to boost the performance. To achieve more robust retrieval for VI-ReID, Pu et al. [21] disentangled an ID-discriminable and an ID-ambiguous cross-modality feature subspace, respectively. In [22], the authors thought existing methods do not explicitly ignore spectrum information that is not related to VI-ReID. As a result, they disentangled the spectrum information in order to maximize invariant ID information while minimizing the influence of spectrum information. Zhao et al. [23] learned color-irrelevant features through color-irrelevant consistency learning and aligned the ID-level feature distributions by the ID-aware modality adaptation. Hao et al. [24] confused two modalities to learn modality irrelevant representation. In [10], the authors extracted modality irrelevant features by channel attention-guided instance normalization (IN).
Local Feature Representation Learning. As shown in Figure 5b, compared to the global feature, the local feature is more focused on the differences in details. Lin et al. proposed an attribute-person recognition network to make full use of the information contained in attributes [25]. Hao et al. [26] replaced global features with part-level features so that fine-grained camera-invariant information can be extracted. In [27], the authors proposed an adaptive body partition model for automatically detecting and distinguishing effective component representations. Liu et al. [28] presented a network that jointly learns global and local features to cope with viewpoint change and pose variation. Ye et al. [29] excavated contextual cues at the intra-modality components and cross-modality graph levels. Wang et al. [30] utilized global features and partial features to realize the complement of global information and detailed information. To select useful features, Wei et al. [31] designed a flexible body partition module to distinguish part representations automatically. Zhang et al. concatenated the global feature and local feature to create a more powerful feature descriptor [32]. In [33], aiming to eliminate the interference of background information, the authors exploited the knowledge of human body parts to extract robust features. Wu et al. [10] utilized pattern alignment to discover nuances in different patterns. Zhang et al. [34] also made an attempt to discover semantic differences between contrastive features by cross correlation.
Auxiliary Feature Representation Learning. Ye et al. [35] exploited auxiliary information, including the distribution of cross-modality features and contextual information, to bridge the gap between heterogeneous modalities. In [36], the authors designed camerabased batch normalization (BN) to guarantee an invariant input distribution independent of all cameras.

Metric Learning
The purpose of metric learning is to guide feature representation learning. We will go through some prevalent loss functions and training strategies.
Loss Function Design. Generally, researchers design different loss functions to solve targeted problems based on the observed phenomenon. A large cross-modal discrepancy and intra-modal variations generated by varied camera angles, human postures, etc., impact the VI-ReID. As shown in Figure 6a, the function of identity loss is to classify a sample into a correct class in the training phase, which is widely used. For contrastive loss, as shown in Figure 6b, it mainly constrains the training of Siamese networks. For instance, Ye et al. [37] proposed a hierarchical cross-modal matching model, which jointly optimized the modality shared and -specific matrix, aiming at the problem of perspective changing when different cameras record a person. To minimize the difference between same modality and cross similarities, Wu et al. [38] guided the learning of cross-modality similarity by same-modality similarity.  Figure 6c shows the triplet loss, which is contributed to pull the distance between positive sample pairs and push the distance between negative sample pairs. The samples with the same ID form clusters in feature space. The approach constrains the features by a set of triplets to obtain high performance [39]. Wang et al. [40] proposed an improved triplet loss to realize matching a video by an image. Ye et al. [9] proposed a bi-directional dualconstrained top-ranking loss to guide the feature learning objectives. Then, they improved this work by replacing the similarity between two samples with similarity between sample and center [41]. To alleviate the strict constraint of classical triplet loss, Liu et al. [2] proposed an improved triplet loss with the mode of center to center instead of instance to instance. Zhang et al. mitigated the modality discrepancy by mapping the heterogeneous representations into a common space [42]. To learn an angularly separable common feature space, Ye et al. [1] constrained the angles between feature vectors. Cai et al. [43] proposed a dual-modality hard mining triplet-center loss (DTCL) which can reduce computational cost and mine hard triplet samples. In order to eliminate the effect of inconsistent feature distribution in different modalities, Zhang et al. [44] mapped the feature space to angular space and proposed several loss functions to conduct specific angular metric learning. Figure 6d shows quadruplet loss, which is an improved version of triplet loss. It adds relative distance between the samples with different IDs. In [45], current approaches, according to the authors, primarily combine classification and metric learning to train models in order to generate discriminative and robust representations. However, these methods ignore the relationship between the classification and feature embedding subspaces. The authors presented a hyperspherical manifold-embedded network with classification and recognition constraints based on this information. Jia et al. [46] utilized the similarity transitivity to tackle the problem of mismatching hard positive samples.
Training Strategy. To incorporate different loss functions into an organic whole, researchers have proposed different training strategies. Dai et al. [47] proposed a generative adversarial training strategy to deal with the lack of discriminative information. Ye et al. [48] observed that existing VI-ReID learning strategies ignore the discriminative information of different modalities. Therefore, they presented a modality aware collaborative learning strategy to deal with the gap between two modalities in both the feature level and classifier level. Zhang et al. [49] proposed a mutual learning module that provides a bidirectional transfer between two modalities, aiming at excavating useful information from them. Ling et al. [50] thought most existing methods constrain the similarity of the instance or class level, which is inadequate to make full use of the hidden relationships in crossmodality data. Hence, they proposed multi-constraint similarity learning from instance to instance, instance to class center, and class center to class center. Gao et al. [51] proposed a learning strategy for joint optimization of a single modality and unified modality spaces.

Generative-Based Model
The generative-based model mainly utilizes generative adversarial network (GAN) or encoder-decoder module to realize the mutual translation between the two modalities. Then, the methods of ReID are used to constrain the appearance of discrepancy.

Modality Translation
In recent years, GAN-based modality translation has gradually become popular. As shown in Figure 7a, modality translation includes infrared to visible and visible to infrared. In contrast, some works disentangled ID-discriminative and ID-excluding factors, and then generated image pairs to extract highly discriminative features.
For infrared to visible, this kind of method can be regarded as image colorization that has been extensively used in various fields [52]. To our knowledge, there is little work in the literature using colorization. Zhong et al. [53] bridged the gap between the two modalities by fusing the features of original infrared images and generated fake visible images. After that, Zhong et al. [11] improved the performance by pixel-wise transformation, which can retain original structure information.
For visible to infrared, Kniaz et al. [54] matched the fake infrared images generated by GAN with the gallery images to mitigate the modality discrepancy. Wang et al. added a pixel alignment module based on feature alignment module [47] to further reduce the gap between the two modalities [12]. However, Liu et al. [55] thought that those methods employing GAN to generate fake images destroy the structure information of generated images and introduce plenty of noise. Hence, they replaced fake images generated by GAN with grayscale images with three channels.  [11,53] and visible to infrared [12,54]; (b) more works generate visible-infrared image pairs, employing their combination [56][57][58][59][60].
For dual translation, as shown in Figure 7b, it encodes the visible and infrared modalities into a consistent space to eliminate the effect of modality style. It then generates fake cross-modality image pairs with the same ID. In 2019, Wang et al. [56] first generated visible-infrared image pairs by disentanglement and mapped them into a unified space. Analogous to [56], the idea of disentanglement is also indicated in [57][58][59][60]. Among them, Choi et al. [57] encoded the prototype and the attribute separately to generate fake images containing invariant features. Meanwhile, [58,59] acquired visible-infrared image pairs by feature disentanglement and [60] added unseen IDs to generate discriminative features based on [58]. In [61], the network extracted appearance invariant features by generating corresponding fake images.

Extra Modal
Aside from modality translation, some works alleviated the modality discrepancy by introducing an additional third modality. In 2020, Li et al. [62] first introduced an "X" modality as the middle modality to eliminate the cross-modality discrepancy. Subsequently, Huang et al. [63] learned the shared features of images from both modalities to guide the generation of extra images. Ye et al. [64] bridged the gap between the two modalities by generating 3-channel grayscale images. Miao et al. [65] introduced two novel relevant modalities to investigate modality invariant representations. In [66], the authors reduced the cross-modality discrepancy by fusing the two modalities. Wei et al. [67] combined information from visible and infrared images to generate syncretic modality, which can help the network extract modality invariant representations. Zhang et al. [68] projected the images from both modalities into a consolidated subspace to mitigate the modality discrepancy.

Other Methods
Besides the aforementioned methods, some works also alleviated the impact of a large modality discrepancy by introducing some other technologies. Almost all existing works bridge the gap between the two modalities by manually designing feature extraction modules. Such a manually designed routine usually requires plenty of domain knowledge and practical experience. Therefore, Fu et al. [69] and Chen et al. [70] proposed a cross-modality neural architecture search method and a neural feature search method, respectively, to automatically realize the process of feature extraction. Inspired by the information bottleneck (IB), Tian et al. [71] designed a new strategy that can preserve sufficient label information while simultaneously getting rid of task-irrelevant details. Liang et al. [72] thought that the high cost of labeling person IDs in datasets greatly limits the development of supervised models. Hence, they proposed an unsupervised homogeneous-heterogeneous approach for the unsupervised visible-infrared problem. In [73], the authors used distance metrics instead of a fully connected layer to learn discriminative features. Ye et al. [74] decomposed three channels of visible images and excavated the relationship between each individual channel and infrared image.
In addition, as the tasks of ReID and VI-ReID are identical on the whole, some networks designed for ReID or other similar tasks are also valid on VI-ReID. For instance, Ye et al. [8] proposed a strong baseline for ReID, as it also shows excellent performance on VI-ReID. Jin et al. [75] combined the information removed by IN to achieve high performance. Methods aiming at solving problems of further related domains can also be applied to VI-ReID. For example, Yang et al. [76] proposed an unsupervised graph alignment method that aligns both data representations and distribution structures across the source and target domains, aiming at general cross-domain visual feature representations. The method [77] mitigates the negative effects of noise similarities in cross-modality retrieval by intramodality distributions. These methods perform excellently on the corresponding tasks; therefore, we can learn from their ideas.

Summary
From the perspective of method categories, we make the following summaries: • Different methods have different strengths. Non-generative-based model are dedicated to mitigating the modality gap on the feature-level (e.g., [9]), while generativebased models pay more attention to the pixel level (e.g., [11]). Compared to nongenerative-based model, there is either an information loss or introducing noise in unifying the modality. However, a generative-based model can avoid the impact of color information. A more detailed summary about mainstream works is shown in Table 1. • Combining with other techniques is a growing trend. To acquire more discriminative features, some researchers combined this task with some universal techniques (e.g., [70]), and there are also methods (e.g., [71]) that treat this issue from a fresh perspective.

•
The existing setting is not in line with practical application scenarios. In some cases, the modality discrepancy is larger than the differences among IDs. However, the existing testing settings avoid this challenge by putting only single modal images into the gallery.

Datasets
We first review two prevalent VI-ReID datatsets (RegDB [78] and SYSU-MM01 [6]). Some pedestrian image samples from two datasets are shown in Figure 8. RegDB [78] contains 412 different IDs, which are classified into 254 females and 158 males, and each ID corresponds to 10 visible images and 10 infrared images. From the samples shown in the first four columns of Figure 8, we can see clear differences between the images captured by two different cameras in terms of color and exposure. Generally, the dataset is randomly split into two halves for training and testing, respectively, according to the evaluation protocol in [37]. In the testing phase, the images from one modality are utilized as a query, while the gallery contains the images from the other modality. The final result is the average of 10 repeated operations. SYSU-MM01 [6] is a public dataset for VI-ReID proposed in 2017. It contains four cameras for capturing visible images and two for capturing infrared images. Camera 1 and camera 2 are put in two bright rooms, and camera 4 and camera 5 are placed in bright outdoor scenes to capture visible images. Infrared cameras 3 and 6 are placed in a room and outdoor scene, respectively, to capture infrared images without light. There are, in total, 287,628 visible images and 15,792 infrared images of 491 different IDs in SYSU-MM01. As shown in Figure 8, the images in SYSU-MM01 are unpaired in terms of pose, viewpoint, etc.

Evaluation Metrics
Evaluation metrics play an important role when we want to test the pros and cons of a system. There are two widely used metrics for VI-ReID, named cumulative matching characteristics (CMC) [79] and mean average precision (mAP) [3].
CMC. Rank-r represents the probability that a correct match appears in the top-r search results ranked by confidence. For single shot, this is accurate. However, for multi-shot, CMC [79] cannot accurately represent a model's discriminability, as it only examines the first match of ranked result.
mAP. The other widely used metric, mAP [3], is a more comprehensive metric for measuring the performance of the VI-ReID algorithm. It reflects how forward all images with the same ID and the probe in the gallery are in the ranked sequence. Therefore, when we face the problem that two algorithms have equal performance in searching the first match, it can address it effectively. However, when a hard sample appears, mAP may still have difficulties evaluating a better one between two algorithms.

Analysis of the State of the Art with Existing Setting
The performance results of state-of-the-art methods on RegDB and SYSU-MM01 are shown in Tables 2 and 3, respectively. From the Table 2, we observe that [2] achieves superior performance rank1/mAP 91.05%/83.28% for visible to thermal query setting on RegDB. The main improvement comes from two aspects: replacing global-level features with part-level features and utilizing center-based triplet loss instead of instance-based triplet loss. As the images with the same ID but different modalities from RegDB are entirely aligned, the part-level features are more effective. In contrast, the images are not aligned well on SYSU-MM01. Hence, it is not as big a boost on SYSU-MM01. Moreover, the improvement in the loss function also plays a key role in performance enhancement. There are also some other works committed to this improvement, e.g., Ye et al. [9] adjusted the instance-to-instance-based triplet loss to the instance-to-class-center-based loss, and the performance was significantly improved on SYSU-MM01. Table 2. Rank-r accuracy (%) and mAP (%) performance of state-of-the-art methods on RegDB. Bold numbers are the best results.   Table 3. Rank-r accuracy (%) and mAP (%) performances of state-of-the-art methods on SYSU-MM01. Bold numbers are the best results.  As shown in Table 3, MPANet [10] performs best on SYSU-MM01 [10]. As the infrared modality contains limited information, the difference among the infrared IDs is extremely inconspicuous. Most existing methods deal with the cross-modality discrepancy by proposing novel loss functions or introducing other modalities. In addition to addressing the modality discrepancy, MPANet exploits the nuances among different infrared images to extract more discriminative features.

Results of the State-of-the-Arts with New Setting
To evaluate the new proposed testing setting, we propose new testing datasets based on RegDB and SYSU-MM01, named RegDB_Mix and SYSU-MM01_Mix, respectively. The reconstructed datasets have the same number of identities and images with original datasets. Rather than putting images of the two modalities into the query and gallery respectively, we mix visible and infrared images, and remove the images in the gallery which have the same modality and identity as the images in the query. We train and test these approaches on a single NVIDIA Tesla P100 GPU. The other settings are consistent with those in the original paper. The multi-shot results on the two datasets are presented in Table 4. Compared to  Tables 2 and 3, we observe that the Rank-1 and mAP both have a great degree of decline. Specially, on RegDB_Mix, the CM-NAS [69] achieves the Rank-1 accuracy of 42.50% and mAP of 41.73%, approximately only half the value of Rank-1 and mAP on RegDB with the visible to infrared mode. On SYSU-MM01_Mix, the CM-NAS achieves the Rank-1 accuracy of 36.48% and mAP of 29.51%, significantly dropping the Rank-1 accuracy by 25.51% and mAP by 30.51% on SYSU-MM01 with all-search and single-shot modes. To better present the challenges posed by the new setting, we randomly select 10 IDs from the testing set to visualize the distributions of learned features by t-SNE [84]. Here, we choose AGW [8] to extract features of the selected images. As shown in Figure 9, most the features extracted by AGW can be clustered well. However, compared to Figure 9a, many infrared images with different IDs (e.g., blue, red, and yellow) are gathered in Figure 9b. This means that pedestrian images with a certain ID are more likely to be influenced by the images with other different IDs. In fact, this is also more in line with reality, as infrared images contain less information. Hence, they are more difficult to discern by VI-ReID systems.
(a) Testing data with existing setting (b) Testing data with new setting Figure 9. Visualization of the feature extracted with AGW [8] distributions. A total of 10 IDs are randomly selected from the testing set of SYSU-MM01. Here, samples with the same color indicate they are of the same person. The markers "circle" and "square" represent the images from infrared and visible modalities, respectively.

Conclusions and Future Directions
With the increase in functional application requirements, VI-ReID has attracted some researchers' attention. This paper presents a comprehensive survey of VI-ReID. We first compare it with ReID in detail to show the different challenges of VI-ReID. With powerful deep learning techniques, VI-ReID has achieved remarkable progress, and we divide the existing methods into two categories: non-generative-based and generative-based methods. For the non-generative-based model, we analyze the method in terms of feature learning, metric learning, and training strategy. In contrast, the generative-based model applies modality translation to bridge the modality gap. Finally, we describe standard datasets in detail, evaluation metrics, and performance of the state-of-the-art methods on two datasets.
From Tables 2 and 3, we observe that the performance of VI-ReID on two public datasets has improved a lot in recent years. Meanwhile, the complexity of networks architecture has also increased. Among the existing network architectures, feature learning and metric learning are the essential modules. The primary function of feature learning is to extract modality specific and -shared features. Recently, some works aiming to extract effective features have become more popular, including global-local features fusion. Here, the distance between two features with the same ID would be pulled, while the distance between two features with different IDs would be pushed by distance metric learning.
From the experimental results, we observe the following directions in VI-ReID: • One-stream network architecture. In terms of testing baseline, we believe that the new testing baseline with a more practical setting is more valuable to research than the existing setting. Considering that existing two-stream network architectures cannot validly solve the challenges of the new setting, a one-stream network that can extract more robust and effective features of two heterogeneous modalities may be a trend. • Weakly supervised or self-supervised. Considering the difficulties of obtaining a sufficient amount of high-confidence data, we should concentrate on those data with no labels or low label confidence. The approaches, such as those of [85,86] of leveraging this kind of data to address related issues is highly advanced in ReID. We believe that numerous works about weakly supervised or self-supervised data will appear in VI-ReID in the future. • Transfer learning. As the number of neural networks grows, the structures become more and more complicated, we expect that the neural network can draw on some current resources when facing comparable tasks. Further research on transfer learning, which has been widely used in ReID [87][88][89], may be a great direction in VI-ReID.
Limitations. First of all, this review draws on the authors' summary of the literature analysis. Although we aim to be objective in the analysis process, we still cannot avoid a robust subjective tone. Thus, all descriptions are built on personal opinions. Moreover, this survey classifies the networks according to the criteria in [8]. In contrast, we focus on VI-ReID, which accounts for a small percentage of [8]. Finally, this review only covers the research results published in mainstream conferences or journals in this field. The main reason is perhaps that these articles sufficiently represent the research methods and research trends in the area.

Acknowledgments:
The authors would like to thank the anonymous reviewers and the editor for their careful reviews and constructive suggestions to help us improve the quality of this paper.

Conflicts of Interest:
The authors declare no conflict of interest.