Feature-Level Camera Style Transfer for Person Re-Identification

: The person re-identiﬁcation (re-ID) problem has attracted growing interest in the computer vision community. Most public re-ID datasets are captured by multiple non-overlapping cameras, and the same person may appear dissimilar in different camera views due to variances of illuminations, viewpoints and postures. These differences, collectively referred to as camera style variance, make person re-ID still a challenging problem. Recently, researchers have attempted to solve this problem using generative models. The generative adversarial network (GAN) is widely used for the pose transfer or data augmentation to bridge the camera style gap. However, these methods, mostly based on image-level GAN, require huge computational power during the training of generative models. Furthermore, the training process of GAN is separated from the re-ID model, which makes it hard to achieve a global optimal for both models simultaneously. In this paper, the authors propose to alleviate camera style variance in the re-ID problem by adopting a feature-level Camera Style Transfer (CST) model, which can serve as an intra-class augmentation method and enhance the model robustness against camera style variance. Speciﬁcally, the proposed CST method transfers the camera style-related information of input features while preserving the corresponding identity information. Moreover, the training process can be embedded into the re-ID model in an end-to-end manner, which means the proposed approach can be deployed with much less time and memory cost. The proposed approach is veriﬁed on several different person re-ID baselines. Extensive experiments show the validity of the proposed CST model and its beneﬁts for re-ID performance on the Market-1501 dataset.


Introduction
Person re-ID [1] aims to match images of the same person appearing in different camera views. Due to the camera style variance in the surveillance system, images of the same pedestrian often show significant differences in pose, illumination and background, which increase the intra-class variations in the model and harm its retrieval performance. Thus, camera style variance is a major difficulty in the person re-ID task.
With the rapid development of deep neural networks in the past decade, researchers have explored various deep learning-based methods for person re-ID [2]. Many researchers have tried to solve camera style variance as a part of the optimization problem in a common classification model. Some researchers focus on the innovation of deep learning loss functions. Wang et al. [3,4] study the extension of classification losses based on softmax, while Deng et al. [5] and Hermans et al. [6] investigate the usage of contrastive losses based on hard sample mining. Sun et al. [7] and Chen et al. [8] pay attention to the modification of the base model, which extracts pedestrians' features. However, these methods sidestep the camera style variance that innately exists in re-ID datasets. Therefore, they cannot solve the camera style variance problem fundamentally.
Another prevalent choice is using generative models to make camera style-guided image-level data augmentation. Benefiting from the augmented dataset, the re-ID model can directly perceive these variations. With the rapid progress of the GAN theory [9][10][11][12], many GAN-based practices [5,13,14] have been proven as helpful alliances in re-ID tasks. However, these methods are coupled with off-the-backbone generative networks which cost tremendous time and memory in the training process. Taking CamStyle [14] as an example, the training process of the person re-ID model on the extended dataset only requires about one hour. However, it takes a couple of days to train the generative model, which is indicated as a bunch of CycleGANs [15] in this task. Furthermore, the introduction of a complex generation network splits the training process of the entire model, and it makes it hard to achieve an optimal for both models simultaneously.
In order to overcome camera style variance in the person re-ID problem, a feature-level Camera Style Transfer (CST) approach is proposed to serve as data augmentation with low time and memory cost. The authors propose utilizing an additional camera style feature extractor to learn camera-related information. The proposed CST approach then takes source identity features and target camera style features as input, and it generates features that transfer their camera information while maintaining their identity information. An Adaptive Batch Normalization (ABN) layer is designed to inject camera information into the original identity feature. Compared with image-level GAN-based methods, the proposed CST model can produce high-quality features precisely and speedily. Moreover, extensive experiments show that the proposed CST model can adapt to different baselines and metric learning methods, proving the universality of the proposed method. Finally, a lightweight framework is presented, where the generative model is embedded into a person re-ID model in an end-to-end manner. In summary, this paper makes the following main contributions.

•
To overcome camera variations, a feature-level Camera Style Transfer (CST) model for person re-ID is proposed, acting as a data augmentation approach which benefits the training process of re-ID model. • To achieve camera style transfer while preserving the identity information, an Adaptive Batch Normalization (ABN) layer is designed to inject camera information and produce high-quality features. • To optimize the generative model and re-ID model simultaneously, a lightweight framework is proposed to train the generative model and re-ID model in an end-toend manner.
The rest of this paper is organized as follows: Related work is discussed in Section 2. The proposed feature-level CST model is described in Section 3. Section 4 introduces the end-to-end training process when implementing the CST model in re-ID baselines. Experimental results are reported and analyzed in Section 5, which are followed by conclusions in Section 6.

Generative Adversarial Network (GAN)
GAN [9] aims at learning a mapping function from latent space to a target distribution. The training objective function of GAN has been widely discussed and modified since its establishment. Mao et al. use LSGAN [10] to minimize Pearson divergence. Zhao et al. [11] treat the discriminator as an energy function that attributes lower energies to the regions near the data manifold and higher energies to other regions. Martin Arjovsky et al. propose Wasserstein GAN [16] to stabilize the training process of GAN. Lars Mescheder et al. [12] discuss regularization strategies to stabilize GAN's training.

Image Generation
At the same time, remarkable progress has been made in image generation with the development of GAN theory. Alec et al. [17] firstly bridge the gap between convolutional network and GAN. Zhu [18] to perform image-to-image translation for multiple domains using only a single model. Karras et al. [19] describe an unique training strategy which can generate images from low resolution to high resolution gradually. Image generation can be used in image processing [20,21], and also can be used for data augmentation in computer vision tasks.

Camera Style Transfer
Another mentionable topic is the usage of adversarial training in camera style transfer. Sohn et al. [22] arrange a GAN-based network in handling unsupervised domain adaptation problem for face recognition. Yin et al. [23] propose a center-based feature transfer framework to enhance the face recognition model. Gao et al. [24] introduce a covariance preserving GAN into the feature space in low-shot learning research. These research studies on camera style transfer mainly focus on the face recognition task.

Person Re-Identification
A large family of approaches treat person re-ID as a metric learning problem. Many efforts focus on the improvement of loss functions and the basic structure of feature encoders. Alexander Hermans et al. [6] introduce triplet loss based on hard sample mining into the re-ID theory. Chen et al. [25] use a novel scheme to make instance-guided context rendering. Sun et al. [7] discuss the advantage of local features through a unique partition strategy. Chen et al. [8] propose the High-Order Attention (HOA) module to model and utilize the complex and high-order statistics information in an attention mechanism. These methods mainly regard person re-ID as a classification problem yet neglect camera style variance, which innately exists in the re-ID problem.
Due to the rapid growth of GAN theory in recent years, several researchers have exploited using GAN in image-level data augmentation for person re-ID. CycleGAN [15] is applied in some works [5,26] to make cross-domain image style conversion to solve domain adaption problems. StarGAN [18] is adopted to generate pedestrian images with different camera styles due to its flexibility. Liu et al. [27] manage to improve the quality of generated images by decomposing the process of transformation into a set of sub-transformation. Zheng et al. [13] design a generative network by exploiting and reorganizing the structure information and the appearance information in the dataset. Chen et al. [25] formulate a dual condition GAN which can enrich person images with contextual variations.
GAN-based methods can be helpful to alleviate camera style variance for the person re-ID task. However, due to the huge computing resources required by the training of GAN, image-level generative models are usually used as a pre-training or data augmentation process in re-ID. The training process of GAN is usually separated from the training of the re-ID model. In order to reduce the time and memory consumption of model training, the authors establish the generative model on a feature level and embed the generative model into the re-ID model in an end-to-end manner. Experimental results show that the proposed feature-level Camera Style Transfer (CST) approach can significantly improve the performance of multiple re-ID baselines with less time and memory cost.

Feature-Level Camera Style Transfer
In the person re-ID problem, camera style variance often leads to differences in viewpoint, illumination, and background, making it hard to determine whether two images captured by different cameras are of the same person or not. Some researchers apply pose-guided image generation as data augmentation. However, the image-level generative model takes a huge amount of time and memory. In addition, the training process of the image generation model is separated from the re-ID model, which makes it hard to achieve optimal performance for both models. To overcome these shortcomings, a GAN-based feature-level Camera Style Transfer (CST) approach is proposed to alleviate the camera style variance in the person re-ID task.

Problem Modeling
The person re-ID problem is related to determining whether the two input images belong to the same person. Traditional methods often consist of feature extraction and metric learning. Camera style variance can harm the representational ability of traditional identity features. This paper proposes extracting an additional camera style feature to provide camera-related information. Furthermore, a generative model is proposed to make camera style transfer to alleviate camera style variance for the re-ID problem.
Identity Feature and Camera Style Feature. Generally speaking, a deep learningbased person re-ID approach consists of two main parts: representative feature extraction and distance metric learning. In a typical approach, a feature extractor E I takes a pair of person images (x i , x j ) with labels (y i , y j ) as input and outputs the extracted identity features ( f I i , f I j ). Then, the distance of the features is measured to determine the similarity of images x i and x j . If the two images have the same label, i.e., y i = y j , the approach should determine that x i and x j are of the same person. However, if x i and x j were captured by different cameras and thus have different camera labels c i = c j . The camera style information contained in the extracted features makes it hard to make the right decision. An extra feature extractor E C is used to capture features related to camera style ( f C i , f C j ). Then, the extracted identity feature ( f I i , f I j ) and camera style feature ( f C i , f C j ) are used in a feature-level generative model to complete the camera style transfer of a given feature. The necessity of the extra independent feature extractor E C is discussed in Section 5.6.1.
The Generative Model for Camera Style Transfer. The generative model aims to make camera style transfer at the feature level. Thus, it takes a pair of features ( f I i |y i , f C j |c j ) as input where f I i |y i denotes the source identity feature with identity label y i , which is extracted from source image x i by extractor E I . Similarly, f C j |c j denotes the target camera style feature with camera label c j , which is obtained from target image x j by an extra camera style feature extractor E C . Then, the generator G generates feature f conditioned on identity label y i from f I i and camera label c j from f C j . The whole generation process of the generative model can be expressed as: The generated feature f |y i , c j has the same identity label y i as the source feature f I i |y i and the same camera label c j as the target feature f C j |c j .

Adversarial Camera Style Feature-to-Feature Translation
The generative model aims to make camera style transfer while preserving input features' identity information. Thus, the goal of the generative model can be divided into two parts: judging the authenticity of generated features and determining whether generated features complete the camera style transfer. The objective functions of the generative model are designed based on these two goals.
GAN Designing. To determine whether generated features are real or not, a discriminator D is introduced into the generative model to calculate adversarial losses. Paired with the generator G, the whole generative model plays a min-max game to improve the authenticity of generated features: where θ G , θ D represents the parameters of generator G and discriminator D. f I i is the real source feature with identity label y i and camera label c i , while f = G( f I i , f C j ) is the generated feature with the same identity label y i and different camera label c j . P G and P I denote the distributions of generated features and real features, respectively. Furthermore, WGAN-GP [28] is selected to play the role of adversarial loss, since the authors found its relatively stable performance in the generation objectives. Therefore, a gradient penalty loss is added to confirm the Lipschitz constraint of the entire loss function: where Pf corresponds to the distribution that samples uniformly along the straight line between P G and P I , || · || 2 indicates L2-norm. Camera Classification Loss. To ensure that the generator makes accurate camera style transfer toward target features, camera labels can be utilized as supervisory information in the GAN training. To be more specific, the real feature f I i should preserve its camera label c i and the generated feature f has transferred camera label c j . Following the practice of StarGAN [18], this objective can be decomposed into two parts. One is camera domain classification loss for real features L r cam , and the other is camera domain classification loss for generated features L f cam : where D cam is the camera discriminator, and the definitions of the other symbols are the same as shown in Equation (2). Overall Objective Function. The loss functions mentioned before are combined to formulate total loss function for D and G, respectively: where λ cam and λ gp are hyper-parameters to control the contribution of different parts to the overall loss function. The impact of these hyper-parameters is studied in Section 5.3.

Adaptive Batch Normalization
In order to make camera style transfer while preserving identity information, this paper proposes a novel Adaptive Batch Normalization (ABN) layer. When trying to add camera style information into feature generation, the authors first tried to simply add a one-hot camera label to the original identity feature. However, they found that the one-hot camera label is merely a label, and it does not contain enough camera style information for the network to learn from. Then, the authors tried to use the concatenation of identity feature and camera feature, but the result is still not ideal. The authors believe the reason is that the network sees a feature as a whole, the identity part and the camera style part are not clearly separated, leading to semantic confusion. Inspired by Adaptive Instance Normalization [29], in which the style code can be injected through linear affine parts in image style transformation, the authors proposed an Adaptive Batch Normalization (ABN) layer that can effectively embed camera information into an identity feature and accomplish camera style transfer in a feature level. ABN layers are applied to replace the corresponding normalization layers in generator G by a similar yet more concise way: where µ(·) and σ(·) denote the mean calculation and the standard deviation calculation, x represents features that normally pass through the layer, and y represents style features that are mapped into the appropriate dimensions through mapping functions m 1 (·) and m 2 (·). Note that the normalization layers are placed into the feature level, so the related statistics are calculated only along the batch dimension. The ablation study shows the effectiveness of the proposed ABN layer, and the detailed experimental results are shown in Section 5.6.2.

Network Architecture
The main architectures of generator G and discriminator D are illustrated in Figure 1. Since the whole camera style transfer process took place at the feature level, a convolutional layer is not necessary, and only fully connected (FC) layers are used in the network. The dimensions of input identity feature and camera-style feature are 2048 and 512, respectively. For generator G, a U-Net based architecture is adopted as the backbone. The identity branch consists of three parts: the downsampling part, the transition part and the upsampling part. The downsampling part acts as the role of an encoder, reducing the dimension of features, while the upsampling part acts as a decoder and has the opposite effect. Batch normalization layers are applied after each FC layer in these two parts. The transition part, namely a serial of ABN blocks, plays the key role of fusing two branches of features and makes a style transfer, where the proposed ABN layers are applied after each FC layer in this part. All normalization layers in G are followed with a Leaky ReLU layer (the negative slope is set to be 0.2, which is the same as below). The camera style branch is a Multi-Layer Perception (MLP) for mapping camera style information into ABN layers. For D, a series of FC layers reduce the dimension of input features from 2048 to 1 with a reduction rate of 0.5 per layer. An additional classifier is connected to the 512-dimensional layer for camera classification loss. All FC layers in D are followed with a Leaky ReLU layer except the last one, while normalization layers are not used in D, since they would do harm to Conditional GAN training. The image in the red dotted box in Figure 1 is not generated; it is placed here only to demonstrate that the generated feature f maintains identity information while transferring camera style.

Downsampling Upsampling
Generator Discriminator E

ABN Layers
EC MLP Figure 1. The structure of generator G and discriminator D of the proposed CST approach. E I and E C are the identity feature extractor and camera style feature extractor, respectively.
Through the proposed CST approach, given an input feature, multiple features that maintain the same identity but vary in camera style can be generated. These augmented features can be used to alleviate camera style variance during the training process of the re-ID model.

Using Transferred Feature for Person Re-ID
In this section, the proposed CST approach is implemented in different re-ID models in an end-to-end manner. Through a multi-stage joint training process, the augmented features produced by the CST approach can help to overcome camera style variance and improve the performances of baseline re-ID models.
In order to optimize the feature generation model with re-ID model simultaneously, the authors design an end-to-end framework which can easily embed the CST approach into any baseline re-ID model. The training process including three stages: pre-training stage, generative model training stage, and joint training stage.

Pre-Training Stage
The pre-training stage is used to train two feature extractors, namely identity feature extractor E I and camera style feature extractor E C . The training process is illustrated in Figure 2. ResNet50 [30] and ResNet18 pre-trained on ImageNet are used as the backbone network for E I and E C , respectively. For both extractors, the last fully-connected (FC) layer in ResNet is replaced by a batch normalization (BN) layer with corresponding dimensions. Two extractors E I and E C are trained independently as two classification tasks, while person identity classifier C I and camera style classifier C C are introduced to calculate identity loss L id and camera style loss L cam as follows: where x i is the input source image with identity label y i , and x j is the input target image with camera label c j . f I i = E I (x i |y i ) is the extracted identity feature, and f C j = E C (x j |c j ) is the extracted camera style feature, while P I and P C are the distributions of corresponding features.
After finishing the pre-training stage, extractors E I and E C are trained to extract highquality person identity features f I and camera style features f C ; these features are used as input in the generative model training stage to train the GAN for camera style transfer.

Generative Model Training Stage
The generative model training stage is to train a GAN to accomplish camera style transfer. The network structure is introduced in Section 3.4. The training process is illustrated in Figure 3. The dotted boxes and lines in the figure mean that during the generative model training stage, all parameters in feature extractors E I and E C stop updating, and gradient back-propagation is stopped before it reaches input features f I and f C . E I and E C only provide input data for GAN in this stage. The identity feature f I and camera style feature f C are input into GAN in pairs through a feature reassemble algorithm. To adequately train the CST model, each identity feature should be paired with camera style features of different camera labels. f I and f C share the same batch size N, and the number of different camera labels M is much less than N. The input image batch can be properly arranged so that each N of camera style feature f C in a same batch cover all M different camera labels. Then, for each of the N identity features f I k in a batch, one camera style feature f C j of each camera label is chosen to form a feature pair . Thus, the batch size of feature pairs is N × M. The loss functions use in this stage are introduced in Section 3.2. Through this training stage, the generator G is trained to produce feature f |y i , c j = G( f I i |y i , f C j |c j ). The generated feature f maintains the same identity label y i as the input source feature f I i while the camera label is transferred to c j , which is the same as target feature f C j .

Joint Training Stage
The joint training stage jointly trains the baseline re-ID model and the generative model. The training process is shown in Figure 4. During this stage, the parameters of identity feature extractor E I are unfrozen, so that E I can be further trained with help of augmented data. Because camera classification is a relatively easy task, camera style feature extractor E C remains frozen in this stage. The generator G is stable after the previous training stage, so discriminator D is removed from this training stage, and G continues to update with a small learning rate.
where f I i = E I (x i |y i ) is the real identity feature, and f = G( f I i , f C j ) is the generated feature. P I and P G denote the distributions of real features and generated features, respectively.
For the IDE baseline, triplet loss is also used to enhance the performance of the re-ID model. The generative model can provide abundant positive and negative samples for hard sample mining. With the help of the generated feature, the triplet loss used in this stage can be written as: where N is the batch size, positive sample f p and negative sample f n are both chosen from the union set of real features and generated features, and m is the margin for triplet loss.
The total loss function is as follows: where λ gen controls the contribution of generated features; the impact of this weight is studied in Section 5.3. Through the above multi-stage training process, the proposed CST approach can help improve the performance of the baseline re-ID model by two means: First, the camera style transfered features can alleviate the difficulties brought by camera style variance. Second, the generated features also act as data augmentation, providing more training samples, which is especially helpful for hard sample mining.

Baseline Methods
Three re-ID models are chosen as baselines to verify the effectiveness of the proposed CST approach. ID-discriminative Embedding (IDE) [1] is a famous person re-ID baseline proposed by Zheng et al. in 2016; it has been widely studied and used in the re-ID task due to its simplicity and effectiveness. The IDE baseline used in this paper is a slightly improved version (using a batch normalization layer instead of the original fully connected block); it outperforms the original IDE baseline, but it is still refered to as IDE in this paper for simplicity. Part-based Convolutional Baseline (PCB) [7] is a re-ID baseline proposed by Sun et al. in 2018; the authors proposed dividing the input image horizontally and output part-level features. Multiple Granularity Network (MGN) [31] proposed combining global and partial features obtained by a multi-branch deep network. These three baselines represent models based on global features, partial features, and the combination of global and partial features; thus, they are chosen to evaluate the effectiveness of the proposed CST approach.

Implementation Details
The proposed method is implemented on the PyTorch Framework. All experiments are held on two NVIDIA 2080 Ti GPUs. For the IDE baseline, input images are resized to 256 × 128 × 3. Random horizontal flip and random erasing (with probabilities 0.5) are used for pre-processing, and the batch size is set to be 64. The network is optimized within 450 epochs (1-60 for pre-training, 61-360 for generative model training and 361-450 for joint training). Adam optimizer is adopted for all networks. For G and D, the learning rate starts with 0.001 and is divided by 10 every 100 epochs. For E I and E C , the initial learning rate is set to be 0. The performance of the proposed method is evaluated on the Market-1501 [32] dataset. Market-1501 contains 32,668 labeled images with 1501 identities from six camera views. The training set has 12,936 images with 751 identities, from which 1182 images of 80 identities are chosen as the validation set. Images in the validation set are not used in the training stage, and they are used for parameter analysis. In the testing stage, 3368 images from the remaining 750 identities are chosen as query images to retrieve the matching persons in a gallery set of 15,913 images.

Impact of Hyper-Parameters
The sensitivities of hyper-parameters in loss functions are analyzed in this section: i.e., the weight of camera classification loss λ cam and gradient penalty λ gp , which control the relative importance of different losses when generating camera style transferred features (Section 3.2), and the weight of generated feature loss λ gen , which controls the contribution of generated features when jointly training (Section 4.3). The experiments are carried out on the validation set, and the results are shown in Figure 5-  When λ cam = 0.3, the proposed model reaches its best results. It is worth mentioning that the camera style information is not identity-related, so the weight of camera classification loss λ cam should not be too large. Otherwise, the generator in the generative model tends to learn overcharged camera information, which does not exist in the input identity feature and corrupts the training process.
λ gp is introduced into the network to maintain 1-Lipschitz continuity. The proposed model performs best when λ gp = 1.0.
The best performance can be obtained when λ gen = 0.5. Assigning an excessive value for λ gen reduces the result. This suggests that the proposed generative model can act as an effective data augmentation scheme in the appropriate range.  Figure 7. Hyper-parameter analysis of the weight of generated feature loss λ gen on Market-1501. Best performance achieved when λ gen = 0.5.

Comparison with Baselines and State-of-the-Arts
Experimental results are shown in Table 1. Despite the high baseline adopted in this experiment, the proposed methods still achieve remarkable improvement over the previous works. Compared to the IDE baseline with Cross Entropy (CE) loss (IDC-CE), by introducing CST, the mAP and Rank-1 improve by +3.9% and +2.2%. When compared to IDE with triplet loss (IDE-Trip) and IDE with both losses (IDE-CE-Trip), a similar increase can be observed. The authors further illustrate the performance of the proposed method adapted to the PCB baseline; the performance of the proposed CST surpasses the baseline by +1.6% in mAP and +0.9% in Rank-1. For the MGN baseline, the CST approach also outperforms the baselines in both mAP and Rank-1.
Some state-of-the-art methods are also listed in Table 2 for comparison on the Market-1501 dataset. The first group includes methods that did not use generative models as data augmentation, and the second group includes GAN-based methods that use generated or style transferred images. As we can see, the proposed CST applied to any baseline method outperforms most of the other SOTA methods. These quantitative results show that the proposed CST approach is suitable for both global and partial features; thus, it can be applied in multiple person re-ID baselines, and it outperforms state-of-the-art methods. LGMANet [33] 82.7 94.0 --MTNet [34] 81.5 93.9 --RIN [35] 67.6 86.1 --RNLSTM [36] 76.9 90.6 --Gconv [37] 72. Furthermore, some qualitative comparisons are presented to vividly show the improvement brought by the proposed CST approach. Figure 8 shows the retrieved results of two example query images in the Market-1501 dataset. As we can see in the figure, the baseline model tends to be affected by camera style-related factors, e.g., posture, background, illumination, etc. The proposed CST can generate a large amount of camera style transferred features to alleviate this problem so that the model can overcome the effect of camera style variance and perform more accurate matching.

Computational Costs Analysis
The proposed feature-level CST is compared with some image-level generation model on the Market-1501 dataset. The time and memory costs as well as re-ID results are listed in Table 3. The results show that feature-level CST costs much less time and less memory compared with image-level generation models while achieving better re-ID performance. When trying to extract camera style features, the authors first tried to add a new branch at the identity feature extractor. However, the re-ID accuracy drops dramatically. That is because person identity classification and camera style classification are two different tasks, and they cannot be fulfilled in the same network. The gradient brought by the camera style classifier affects the identity classification. Then, the authors tried to stop the back propagation of the camera style classifier to eliminate its influence to identity classification, but the accuracy of camera style classification decreased. Because the original feature extractor is trained to extract person identity features, camera style information might be lost during training. Since the camera style feature extractor is the input of the generative model, the quality of the camera style feature is crucial for the CST model. The authors decide to add a light-weighted independent feature extractor E C to the framework. The comparison with the former setting shows that by introducing an independent feature extractor E C , the camera classification accuracy can reach almost 100%, and the quality of the identity feature is not influenced.
This study is carried on the IDE baseline; the camera style classification accuracy and re-ID results are listed in Table 4, where 'branch' means adding a branch to the identity classifier, 'stop bp' means stopping the back-propagation of the camera classifier, and 'independent' means using an extra independent classifier for the camera style.

Effectiveness of ABN Layer
To verify the effectiveness of the proposed ABN layer, mAP and Rank-k results are compared between different camera style transfer strategies, and the results are shown in Table 5, where 'none' means not using camera style information, 'one-hot' represents a one-hot camera label, 'concat' means the concatenation of identity features and camera features, and ABN is the proposed Adaptive Batch Normalization layer.
The study is carried on the IDE baseline. We can see that the proposed ABN layer performs best among these camera style transfer strategies, showing that the ABN layer can successfully inject camera style information into the identity feature.

CST Applied on Different Branches of MGN
To further verify the generalization capability of the proposed CST approach, CST is applied for each network branch of the original MGN baseline; the results are shown in Table 6.
From the results, we can see that the CST approach improves all branches of the MGN re-ID baseline, proving that the CST approach is suitable for both global features and partial features.

Comparisons on RAP Dataset
The RAP dataset [46] is a newly published large-scale dataset which contains 84,928 images of 2589 identities. It is designed for both pedestrian attribute recognition and person re-ID tasks. The proposed CST approach is further evaluated on the RAP dataset. Following the dataset split protocol suggested by the authors of the RAP dataset [46], and the training settings suggested in Yaghoubi et al. [47], the experiments are carried out using the MGN baseline [31]. The comparison results are listed in Table 7. It is shown that the CST approach improves the performance of the MGN baseline and outperforms the SOTA methods. The significant improvement in mAP demonstrates that the proposed CST approach can alleviate camera style variance in general.

Conclusions
In this paper, a feature-level Camera Style Transfer (CST) approach is proposed to alleviate the camera style variance in the person re-ID problem. In addition to the traditional re-ID feature extractor, another feature extractor is used to extract the camera style feature independently. Then, CST takes a source identity feature and a target camera style feature as input, and it generates an identity-preserving feature by a GAN to achieve the camera style transfer. An Adaptive Batch Normalization (ABN) layer is designed to inject camera style information in the feature level. The authors further propose an end-to-end framework to jointly train the generative model with the re-ID model for better performance. Experimental results show that the proposed CST approach can be embedded with multiple re-ID baselines and outperform state-of-the-art methods, proving the effectiveness and generalization capability of the proposed approach.