Weakly Supervised Cross-domain Person Re-Identication Algorithm based on Small Sample Learning

Person Re-identiﬁcation (Re-ID) algorithms based on deep learning are developing rapidly, and supervised learning methods ensure a steady improvement in algorithm performance by virtue of massive amounts of data. However, in the real environment, data collection and labeling of the current scene is time-consuming and laborious, and the actual performance of the algorithmic model is often poor. Therefore, the design of the model should focus on extracting and abstracting the information contained in the data under limited conditions. In this paper, we focus on the problems of strong data since, weak cross-domain capability and low accuracy encountered in Re-ID in weakly supervised scenarios. First, we implement a joint training framework with the help of small sample learning and cross-domain migration for Re-ID. Second, an residual compensation and fusion attention (RCFA) module is designed with the help of residual compensation and fused attention module, based on which a model base framework is built to explore the impact brought by diﬀerent insertion positions. Third, to solve the problem of low accuracy caused by insuﬃcient data coverage of small samples, a fusion of shallow features and deep features is designed to enable the model to weighted fusion of shallow detail information and deep semantic information. Finally, by selecting diﬀerent camera images in Market1501 dataset and DukeMTMC-reID dataset as small samples respectively, and introducing another dataset data for joint training, we demonstrate the feasibility of this joint training framework, which can perform weakly supervised cross-domain Re-ID based on small sample data.


Introduction
Person Re-identification (Re-ID) [1], which refers to the association and matching of specific target pedestrians using computer vision techniques in scenarios across devices, times and locations.The task is generally viewed as a fine-grained image retrieval problem with constraints.Re-ID can make up for the face recognition technology and fixed camera vision limitations in the fields of intelligent security and video surveillance; and can be combined with person detection and person tracking [2] for Re-ID systems.Traditional video surveillance system due to its low degree of intelligence, resulting in criminal investigators not only need to consume a lot of time and energy, but also inevitably negligent omissions on the way to work.In addition, the suspect may confuse the criminal investigator by deliberately covering the face or wearing different color clothing to achieve the purpose of obstructing the tracking.The emergence of Re-ID technology can, to a certain extent, solve the problem of inefficiency and high rate of missed detection that exists in traditional video surveillance.With the construction of smart cities, security needs are increasing day by day, intelligent monitoring system ushered in a major development opportunity.As an indispensable part of it, Re-ID has become a hot research direction in the academic and industrial fields.
With the improvement of computer hardware performance such as GPU and the huge amount of data brought by the big data environment makes Re-ID algorithm based on deep learning develop rapidly.The deep learning Re-ID method integrates two modules of feature extraction and metric learning [i.e., the extraction of image features and the similarity comparison of feature vectors are done in one model.According to the different recognition methods, the deep learning-based Re-ID models can be divided into representation models [3] and matching models [4] , where the representation models treat the Re-ID task as a classification problem and the loss functions of the representation models are classification loss [5] and verification loss [3] , etc.
Typically, fully supervised deep learning models require large amounts of labeled data for training, and expecting data to cover all sample characteristics is alien to the idea of data-driven model learning.Therefore, model design should focus less on the use of large amounts of labeled data and more on extracting and abstracting the information and knowledge contained in the data under limited data conditions.This leads to the concept of weakly supervised learning.Weakly supervised learning of small amount of labeled data in a weakly supervised scenario is of great value and significance for the implementation of applications related to Re-ID systems.The crossdomain migration-based approach is unsupervised learning by domain adaptation, where the model is pre-trained in a supervised manner on labeled source data and then adapted to the target domain of unlabeled data.Through migration learning, the domain knowledge of the labeled dataset is transferred to the unlabeled dataset.However, existing Re-ID techniques mostly rely on supervised learning with a large number of samples, which requires a large amount of labeled data for training, and thus cannot be generalized to other scenarios, requiring labor costs and computational costs that are not conducive to the implementation of Re-ID techniques.Therefore, weakly supervised Re-ID techniques based on small sample learning are needed to achieve cross-domain Re-ID.
In this paper, we focus on the problems of strong data since, weak cross-domain capability and low accuracy encountered in Re-ID in weakly supervised scenarios, and implement a joint training framework based on deep learning with the help of small-sample learning and cross-domain migration for Re-ID.The framework uses singlecamera labeled data as a small-sample training set and introduces a large amount of data from non-target domains as prior knowledge to improve the Re-ID performance of the model through joint training.The main innovations in this paper are as follows: 1. To solve the Re-ID problem in weakly supervised scenes, a joint training framework combining cross-domain migration learning and small-sample learning is proposed, which can train both small-sample data and differentdomain data to reduce the realistic scene data collection efforts and yet ensure that the algorithm models are adequately trained.2. To solve the problem of weak cross-domain capability, a Re-ID module RCFA based on residual compensation and fusion attention is designed.to effectively utilize the introduced non-target domain data and to solve the perturbation caused by different data distribution, an inserted residual compensation and fusion attention module is designed, which can suppress inter-domain differences with only a small increase in computational effort.3. To further improve the accuracy of the algorithm, a fusion of shallow and deep features is designed so that the model can weighted fuse shallow detail information and deep semantic information to solve the model learning bias caused by insufficient coverage of small sample data.
2 Related work

Cross-domain Re-ID
Image style transformation is a transfer learning method in the image domain, first proposed by Gatys in [6].Because it can effectively solve the problem of model generalization due to image style differences, researchers have widely applied it to cross-domain Re-ID tasks.For example, Deng et al. [7] proposed SPGAN, an unsupervised domain adaptive framework consisting of SiaNet and CycleGAN [8].The samples generated by coordination between SiaNet and CycleGAN not only have the style of the target domain, but also retain the underlying identity information.An instance-guided context presentation method is proposed in [9], which transfers the source domain person identity to a different target domain context in order to achieve supervised Re-ID in the unlabeled target domain.Yc et al. [10] proposed a new style migration framework STReID, which can change the style while preserving the image content information, and then use both the original image.Zhu et al. [11] decomposed person images into foreground, background and style features, and then use these features to synthesize person images with target domain background for training.In addition to this, there are many studies in improving the generalization ability of the model.[12] proposed a domain-invariant mapping network (DIMN) to learn the mapping between images and classifiers.It follows a meta-learning pipeline and samples a subset of the source domain training task in each training set to make the model domain invariant.

Weakly supervised Re-ID
Based on the problems of supervised and unsupervised Re-ID, researchers have introduced the concept of weakly supervised learning, which is a combination of supervised and unsupervised learning methods to train effective Re-ID models using only a small amount of data.With the increasing interest in weakly supervised learning, a large number of related research branches have emerged and been attributed to it, such that weakly supervised learning has become a comprehensive research field that covers a variety of studies that attempt to build predictive models with weakly supervised [13].Weakly supervised learning, as the name implies, refers to the lack of adequate supervision of the data provided, and from this perspective, semi-supervised learning can be seen as the first and most fundamental framework in the field [14].
Semi-supervised learning aims to use both labeled and unlabeled data to accomplish a specific learning task.the concept of semi-supervised learning first appeared in [15].As the earliest semi-supervised methods, self-learning methods are considered as an iterative mechanism that uses initially labeled data to train a model to predict some unlabeled samples.Then, the most plausible predictions are marked as the best predictions for the current model, thus providing more training data for the supervised algorithm.The joint training approach [16] provided a similar solution by training two different models on two different views and using the reliable predictions from one view as labels for the other model.Figueira et al. [17] proposed a multi-class learning approach.Given any set of features, regardless of their number, dimensionality and descriptors, the method works by fusing these features and ensuring that they are consistent with the classification results.Li et al. [18] proposed a new semi-supervised region metric learning method that learns discriminative region-to-point metrics by estimating positive neighborhoods to generate positive regions.
Unsupervised learning does not require labeled data and is therefore more adaptive and robust.Early unsupervised Re-ID mainly learns invariant components, i.e., dictionary learning [19], metric learning [20] or significance analysis [21], which leads to limited discriminability or scalability.
Ye et al. [22] proposed an unsupervised crosscamera label estimation method to build a sample map for each camera, iteratively update the label estimation and sample map, and implement crosscamera label association using a dynamic graph matching (DGM) method to solve the problem of poor quality of feature representation and noise generated by cross-views during the association process.Wang et al. [23] proposed a consistent cross-view matching (CCM) framework using global camera network constraints to ensure the consistency of matching pairs, and a cross-view matching strategy using global camera network constraints to explore the matching relationships across the camera network and solve the problem of inaccurate matching results for different camera pairs.
For end-to-end unsupervised Re-ID, Fan et al. [24] first pseudo-labeled the target domain in a cross-domain dataset and proposed an iterative clustering model for Re-ID, first training a convolutional network on the source domain, then going to the target domain for image feature extraction, clustering by K-Means to a set number of families, fine-tuning the model with the clustered results, and so on iteratively.The pseudo-label clustering algorithm combining hierarchical clustering and hard-batch triplet loss proposed by Zeng et al. [25] made full use of the similarity between samples in the target data set through hierarchical clustering, and reduces the influence of difficult samples through hard-batch triplet loss, resulting in high-quality pseudo-labels and improving model performance.The TAUDL proposed by Li et al. [26] trained an end-to-end neural network by using unsupervised single-camera trajectory information, and then used this image model to automatically label and learn crosscamera images.Most unsupervised learning does not consider the distribution differences between cameras.Xuan et al. [27] iteratively optimized the similarity between cameras by generating pseudo labels within and between cameras.
In recent years, the performance of Re-ID algorithms based on weakly supervised methods has been significantly improved, but there is still a big gap compared with methods based on supervised learning.At present, there are relatively few studies on weakly supervised Re-ID algorithms in academia, and the development is not yet complete.How to transfer the knowledge learned from labeled source datasets to unlabeled target datasets through a domain adaptive approach to achieve higher performance of weakly supervised algorithms will be the focus of related research.
3 The Proposed Method

Model basic framework
In order to solve the problem of insufficient training data for small sample learning, prior knowledge must be introduced to assist model learning, and different types of prior knowledge have different effects.If a large amount of labeled data different from the target domain is introduced, the amount of data for the overall training of the model can be expanded.However, there are differences between different domains, and directly introducing them as training data into training does not guarantee a positive effect.Therefore, this paper designs a basic framework, which can ensure that the introduced prior knowledge has a positive effect.By enhancing the model 's ability to extract invariant features, it can effectively utilize a large amount of labeled data in the nontarget domain.The overall structure is shown in Figure 1.The framework uses ResNet-50 [28] as the backbone network and inserts residual compensation and fusion attention (RCFA) module at different stages to extract domain invariant features.

Residual Compensation and Fusion Attention Module
In order to improve the generalization ability of the Re-ID model, a residual compensation and fusion attention (RCFA) module based on instance normalization (IN) residual connection structure and fusion attention (FA) module (as shown in Figure 2) is designed.
The FA module helps the model to extract more discriminative semantic features at the cost of adding a small amount of computation.C, H and W denote the number of channels, height and width of the features, respectively.In order to introduce spatial information into the channel dimension, the spatial information map needs to be obtained first.For the input feature F ∈ R c ×h×w, Average Pooling and Max Pooling operations are performed along the channel dimension to aggregate the channel features, and two twodimensional maps are obtained.Average Pooling can limit the variance of the estimate due to the restricted neighborhood size by selecting the average pixel value of a certain region to represent the overall features of the region.Max Pooling is used to select the maximum pixel value in a region to represent the overall features in that region, which helps to retain the saliency information in the feature map and gives the model some resistance to distortion.Second, to efficiently compute the final fused attention weights, the spatial dimensions of the Favg and Fmax maps need to be compressed.Therefore, the feature maps with fused spatial information are then performed Average Pooling and Max Pooling operations along the spatial axis to aggregate spatial information and generate two spatial contextual descriptors, respectively.The computational results focus on the global information and saliency information in the feature maps that help to discriminate pedestrians, respectively.Finally, the two spatial context descriptors are combined and fused by a multilayer perceptron (MLP) to obtain the final fused attention weights.
In particular, to suppress the effect of interdomain differences, the IN normalized data distribution is used to suppress style differences.And the input features x and the normalized features x IN are fused by residual concatenation to compensate for the person discriminative information lost during the instance normalization calculation.Then, the person features are further enhanced by calculating weights and weighting them by the FA module.Figure 3 shows the structure of the RCFA module.
The information x ∈ R c×h×w carried by the input features includes style (activated mean and variance) and shape (activated spatial structure) information.Instance normalization suppresses the style differences between different domains by calculating the mean and variance in each channel of each sample while maintaining the shape information unchanged.The calculation is as follows: In the formula, µ(•) and s(•) respectively represent the average and standard deviation calculated in the space size of each channel.γ and β are the parameters learned by the network model during training.

Feature Fusion
In order to solve the problem of Re-ID in weakly supervised scenarios, a joint training framework combining cross-domain transfer learning and small sample learning is proposed.Aiming at the problem of building the Re-ID algorithm model in the real environment, cross-domain transfer learning and small sample learning can effectively reduce the data collection work and calculation cost, and ensure that the model is adequately trained.In a large number of person reidentification studies, due to the complete training data, the model can learn more sufficient features from a variety of different resolution samples.For  the person re-identification model, sufficient training data with different resolutions is crucial to improve its generalization ability.For each image in the training set, if all the corresponding images with the same content but different resolutions can be obtained, it will help the model to obtain better generalization ability.However, a small amount of learning methods are used to reduce the collection of real scene data.The selected method is to select a camera sample for labeling.At this time, while greatly reducing the amount of data collection and labeling, it also greatly reduces the number of samples with different resolutions, making it difficult to extract sufficient information in the model learning process.In order to solve the above problems, this part further improves the basic framework to form the final proposed joint training framework, as shown in Figure 4.
The specific feature fusion process is as follows: The feature maps output after each RCFA module of the basic framework are globally pooled to obtain features f 2 , f 3 , f 4 and f 5 .Features f 2 and f 3 contain similar information, while features f 4 and f 5 contain similar information.In the feature division, it is considered that the features f 2 and f 3 mainly contain shallow detail information, while f 4 and f 5 mainly contain deep semantic information.In this paper, f 2 and f 3 are weighted and fused according to the weight w 1 , f 4 and f 5 are weighted and fused according to the weight w 2 , and the formulas are as follows: In the formula, the weights w 1 and w 2 are calculated by the triplet loss value.The idea is that the triplet loss function [29] reflects the closeness of the anchor sample and the positive sample in the feature space, and the distance between the anchor sample and the negative sample in the feature space.Therefore, the triplet loss can judge whether the information contained in the feature is sufficiently discriminative.
After obtaining the shallow fusion feature f 23 and the deep fusion feature f 45 , the two need to be weighted to obtain the fusion feature that finally contains shallow information and deep information.The calculation method is as follows: In the formula, the weight wad is an adaptive learning weight to ensure that the model adaptively allocates the proportion of shallow information and deep information in the fusion process.Finally, the fusion feature is passed through the BNNeck structure, and the triple loss value and ID classification loss are calculated respectively to constrain the model convergence process.

Loss Function
This paper uses CELoss and Triplet Loss [29] to jointly constrain model training.However, in general, CELoss and Triplet Loss have inconsistent embedding spaces in the constrained optimization process, and it is easy to have one loss value decrease and another loss value increase [30].Therefore, this paper introduces the BNNeck structure to improve the model based framework, as shown in Figure 5.
The purpose of person re-identification is to match all images belonging to the same ID as the query image from the image library.In terms of the nature of the task, it can still be divided into an image classification task, in which each person ID is taken as a class and the ID number is labeled.For image classification problems, CrossEntropy Loss (CELoss) function is the most commonly used loss function, which has two common types, binary classification and multi-classification.The binary classification represents that there are only two types of prediction results after training the model.In person re-identification, the prediction is expressed as positive samples and negative samples.The formula is as follows: (5) In the formula, y i represents the label of sample i, the positive sample is 1, and the negative sample is 0; p i means that sample i is predicted to be positive.The probability of this.Multi-classification is an extension of binary classification.In person re-identification, each row ID is treated as a class, and the formula is as follows: In the formula, M denotes the number of categories, that is, the number of IDs; y i c means that when the real class of sample i is c, take 1, otherwise take 0; p i c represents the probability that sample i is predicted to be a class c.
The core idea of the Triplet Loss function is to shorten the distance of samples with the same ID in the vector space as much as possible, and to push the distance between samples with different IDs farther.Because of its simple and clear thinking, and fully in line with the logical thinking of person re-identification.Triplet Loss has become a commonly used loss function in the study of Re-ID algorithms.The formula is as follows : In the formula, a, p, n represent anchor sample, positive sample and negative sample respectively; the function d is often a euclidean distance metric function; margin is a hyperparameter that can be initially set.
This paper introduces the BNNeck structure on the basis of the model framework, as shown in Figure 5.The structure adds a BN layer between the final stage of feature extraction and the fully connected layer of the classifier, and initializes the BN layer and the fully connected layer.In the forward propagation stage, the Triplet Loss is calculated for the feature f, and then the feature f is passed through the BN layer to obtain the f BN , which is classified by the fully connected layer, and the classification probability is output to calculate the CELoss.Four datasets are selected for the experiments in this chapter: Market1501 [31], DukeMTMC-reID [32], CUHK03-NP [33] and the large-scale dataset MSMT17 [34].

Experiment Setting
First, data preprocessing is performed by scaling all images to 256 × 128 and padding them with 10px, performing random horizontal flipping and re-random clipping to 256 × 128.In addition, data enhancement methods such as random color dithering and random patching are used to enhance the diversity of the samples.In the training phase, 4 pedestrians with 6 images each are randomly selected from the training set to obtain a small batch size of 24 to train the model.The initial learning rate is set to 3.5 × 10-6, and the learning rate is increased to 3.5 × 10-4 by Warmup at the 2000th iteration, and then the Cosine Annealing mechanism is started at the

Evaluation indicators
To accurately evaluate the model performance, Rank-1 matching rate and mean average precision (mAP) are used as evaluation metrics.The calculations are as follows: In conducting the experimental tests, the person features of query and gallery are compared using the cosine distance with the following equation: Q is the normalized query person feature and G is the normalized gallery person feature.After obtaining the cosine distance, it provides the basis for the subsequent evaluation metrics calculation.In this paper, three different evaluation metrics, mAP (mean Average Precision) and Rank-1, are adopted in the experimental process.
The mAP measures the performance of the whole model mainly by calculating the average of the average accuracy of the whole dataset.When the number of images searched by the model in the dataset is X, only X images out of X are actually the same person to be detected.The model accuracy (Precision) can be calculated as Extending the computational accuracy of a single person to n of the same person, the average accuracy (AP) of the person can be further calculated as Finally, the average accuracy of all the different types of pedestrians is then averaged to mAP: Rank-1 indicates the probability that the first retrieval result is the correct result among the search results returned according to the similarity level for all samples to be tested.
Where f (l q , l q i ) indicates whether the first search result is consistent with the sample to be tested, i.e. f (l q , l q i ) = As shown in Tables 3, 4 and 5, the best Re-ID is achieved in all cross-domain scenarios when all components, i.e., the complete RCFA module, are used.Overall, the cross-domain rerecognition performance of CUHK03-NP is poor compared to other target domains.This is because CUHK03-NP is far away from other domains in the feature space.By adding each component one by one and conducting experiments to evaluate the performance of each component, it can be found that each component effectively improves the cross-domain Re-ID performance.For different cross-domain scenarios, the INbased residual connection structure is first embedded into the baseline, and it can be seen that the Re-ID effect has been significantly improved.In the DukeMTMC-reID to CUHK03-NP crossdomain scenario, the improvement of mAP metric is the weakest at 1.1%, while the improvement of mAP metric is the most significant at 6.3% for the MSMT17 to Market1501 and CUHK03-NP to DukeMTMC-reID cross-domain scenarios.Other metrics also improved significantly, indicating that the residual linkage effectively normalizes the style of features and preserves discriminative information.Then, the FA module is further introduced to form the complete RCFA module.Similarly, the performance of Re-ID in most cross-domain scenarios is further improved.The experiments demonstrate that the introduction of RCFA module improves the model cross-domain effect significantly.  6.

Joint training
Based on the single-camera labeling experiments, the non-target domain data are further added for testing.M(single) + D means that the single-labeled camera data of Market1501 is used as a small sample, and the complete DukeMTMC-reID is introduced for joint training and tested on Market1501; D(single) + M means that the singlelabeled camera data of DukeMTMC-reID is used as a small sample, and the complete DukeMTMC-reID is introduced for joint training and tested on Market1501, respectively.DukeMTMC-reID's single-labeled camera data as small samples, introduced the complete Market1501 for joint training, and tested on DukeMTMC-reID, and the results are shown in Table 7.
As can be seen from the table, for both the Market1501 and DukeMTMC-reID datasets, any single camera annotation by adding non-target domain data is effective in improving the Re-ID performance of the model.
The effectiveness of the model base framework in this joint training scenario is verified by

Conclusion
In this paper, we propose a Re-ID method based on residual compensation and fused attention, the core of which is an inserted RCFA module.This module can effectively improve the robustness of the model and overcome the perturbation caused by different data distribution to solve the Re-ID problem of cross-domain migration.In the cross-domain scenario from Market1501 to DukeMTMC-reID, the mAP metric improves by 6.3% and the Rank-1 metric improves by 10.1% compared to the baseline.The improvement of each component over the existing methods is also verified by comparing with existing attention mechanisms and style normalization methods through extensive experiments.In addition, a joint training framework combining cross-domain migration learning and small-sample learning is proposed, which can train both small-sample data and different domain data to effectively reduce the data collection and computational cost of realistic scenarios, while ensuring that the algorithm models can be adequately trained.
Market1501 was collected by five HD cameras and one regular camera on the Tsinghua University campus during the summer.The pedestrians in the dataset are divided into 751 training IDs and 750 query IDs.DukeMTMC-reID contains 36411 images of 1812 pedestrians captured by eight HD cameras.702 IDs were randomly selected from the dataset and the corresponding 16522 images were used as the training set, and 2228 images from the remaining 702 IDs were used as the query images.CUHK03-NP is a new training and test set splitting protocol of CUHK03, it splits the training set and test set into 767 and 700 identities.MSMT17 is collected under different time periods and weather conditions and contains 126441 labeled borders with 4101 person IDs.Among them, 32621 labeled borders of 1041 person IDs are training sets; the 93820 labeled borders of 3060 person IDs are the test set.The details of each dataset are shown in Table 2 .
This section focuses on the single-camera annotation experiments for Market1501 and DukeMTMC-reID .The training set of Mar-ket1501 contains 12936 images captured under 6 cameras, and the training set of DukeMTMC-reID contains 16522 images captured under 8 cameras.And the number of person IDs captured by different cameras as well as the number of images are different, and their specific divisions are shown in Table

Table 1
Experimental environment.

Table 2
Distribution of datasets.

Table 3
Results of the ablation experimental data when Market1501 is the target domain.

Table 4
Results of the ablation experimental data when DukeMTMC-reID is used as the target domain.

Table 5
Results of the ablation experimental data when CUHK03-NP is used as the target domain.

Table 6
Distribution of different camera data in the dataset.

Table 7
Experimental results of single-camera annotation with the addition of non-target domain data.
Finally, the feasibility of this joint training framework is demonstrated through experiments in different scenarios.For example, 2017 images from camera 0 of the Market1501 dataset are selected as small samples and DukeMTMC-reID data are introduced for joint training, and tested on the Market1501 test set.The joint training framework improves mAP by 22.8% and Rank-1 by 21.4% compared to the ResNet-50 model alone.Our subsequent work will investigate multimodal joint training methods to enable the model to utilize multimodal data to suppress different factors to improve the training effect.