Margin CosReid Network for Pedestrian Re-Identiﬁcation

: This paper proposes a margin CosReid network for effective pedestrian re-identiﬁcation. Aiming to overcome the overﬁtting, gradient explosion, and loss function non-convergence problems caused by traditional CNNs, the proposed GBNeck model can realize a faster, stronger generalization, and more discriminative feature extraction task. Furthermore, to enhance the classiﬁcation ability of the softmax loss function within classes, the margin cosine softmax loss (MCSL) is proposed through a boundary margin introduction to ensure intraclass compactness and interclass separability of the learning depth features and thus to build a stronger metric-based learning model for pedestrian re-identiﬁcation. The effectiveness of the margin CosReid network was veriﬁed on the mainstream datasets Market-1501 and DukeMTMC-reID compared with other state-of-the-art pedestrian re-identiﬁcation methods.


Introduction
Given a probe image (query), the goal of pedestrian re-identification is to search for images containing the same person in a gallery (database with training labels) under multiple nonoverlapping cameras [1]. Although it has played a supervisory role in public video-based pedestrian monitoring, the performance can be seriously affected by the presence or absence of obstructions and changes in target posture, camera angles, and illumination intensity.
To solve the above problems, an effective re-identification technology is expected to be able to identify person with the same identification under different environments, and distinguish between different people even if their appearances are similar. Therefore, a typical re-identification network should compose both feature learning ability for pedestrian's feature representation and distance metric learning ability for discriminative feature extraction [2,3]. Traditional studies on pedestrian re-identification are mostly based on handcrafted features, such as scale invariant feature transform (SIFT) features [4,5] and local maximum occurrence (LOMO) features [6], which are gradually replaced by neural networks. As the earliest simple neural network, the artificial neural network (ANN) is widely used in classification problems [7], but may decrease the algorithm accuracy and simultaneously increase the processing time [8]. The convolutional neural network (CNN), one of the typical feature extraction methods in deep learning, embeds the target feature space from the dataset automatically in the pedestrian re-identification problem. Compared to other CNN structures, ResNet50 is simpler and easier to train, which has been widely used in pedestrian re-identification methods [9]. Nevertheless, problems of overfitting, gradient explosion, and loss function non-convergence may occur in these networks, which are not very effective in dealing with complex re-identification tasks.
Besides a robust feature learning model in the re-identification network, an effective distance metric learning theory [10][11][12][13] is also the key to maximize the differences between classes and minimize the differences within them in the re-identification problem. Some scholars proposed a support vector machine (SVM) that poses pedestrian re-identification as a metric-based ranking problem [14,15] by learning a ranking function parameterized by a weight vector, that is, sorting the positive sample pairs before the negative sample pairs. However, SVM can not provide fine-tuned features and may increase running time [8,16], which is gradually replaced by other metric-based re-identification methods such as contrastive loss [13] and triplet loss functions [10]. Traditional losses can only select simple and easily distinguishable sample pairs during training; therefore, Hermans et al. [11] proposed an effective metric-based method to solve hard sample selections but has convergence problems with loss function training and is time-consuming. Ahmed et al. [17] trained a pedestrian re-identification model on the softmax loss [18] function, which removes the classifier and uses the cosine similarity or Euclidean distance of the last layer of the network for distance query. The softmax loss function can solve the convergence problem without considering the minimum sample batch size and succeeds in performing classification between classes but has difficulties distinguishing the differences within the classification.
Considering the simple and effective performance of CNN and the success of the softmax loss function in the task of pedestrian re-identification, we combine CNN (i.e., feature-based reidentification) and softmax (i.e., metric-based reidentification) to achieve better results. However, the problems involving CNN and softmax which we mentioned above may hinder the successful performance of this combination. Therefore, we propose the margin CosReid network in this paper to extract more discriminative features from CNN and to introduce a margin parameter in softmax to solve their respective problems for effective pedestrian reidentification. The main contributions of this paper are summarized as follows: • The proposed feature extraction model GBNeck is added behind the backbone network ResNet50 to avoid the overfitting and slow convergence problems caused by traditional CNNs and thus achieve a stronger generalization and more discriminative feature learning models. • The proposed margin cosine softmax loss (MCSL) introduces a boundary margin parameter that can maximize differences between classes and minimize those within classes simultaneously to deal with outside interferences strongly. • Our method was tested on Market-1501 [1] and DukeMTMC-reID [19] and performs superior compared to state-of-the-art pedestrian re-identification methods.

Related Work
Existing researches aiming to address the pedestrian re-identification problem mainly focus on different aspects of the issue such as developing robust feature learning models and designing discriminative metrics. In this section, we review several related works briefly.

Feature-Based Learning Re-Identification Methods
Traditional feature-based learning methods that handle the appearance variations in re-identification are mostly based on handcrafted features, such as SIFT [4,5] and LOMO [6]. However, handcrafted features are difficult to achieve satisfactory performance with the growing size of the re-identification dataset. As deep learning develops, automatically learning feature representations from the training data have been applied in the re-identification task and the network structure becomes much more complex [20]. Zheng et al. [21] applied CNNs to use the pedestrians' identification as training labels in the re-identification network. However, relying only on the identification information when training the network model often creates an overfitting problem, which leads to a model generalization inability. Thus, Lin et al. [22] proposed attribute-person recognition to combine the loss of pedestrians' identification with those of the multiple attribute identifications as training supervision. Sun et al. [23] proposed a visibility-aware part model (VPM) to capture fine-grained learning features. Moreover, to take full advantage of the strengths of different features, Yang et al. [24] formulated a method for exploring more diverse discriminative visual features using a class activation maps (CAM) augmentation multibranch model and a novel penalty mechanism. Additionally, Chen et al. [25] introduced a pair of complementary attention modules and regularized network diversity, which was able to learn more discriminative features and reduce correlations. To prevent pedestrian pose variations from affecting the re-identification accuracy, Zheng et al. [19] first applied the generated adversarial network (GAN) to pedestrian re-identification tasks and proved its effectiveness, and Qian et al. [26] proposed a post-normalized GAN based on it. To narrow the gap between different datasets, Wei et al. [27] proposed a human transfer GAN to reduce the expensive data annotation on the datasets. Sun et al. [28] conducted a special study on the change in pedestrians' viewpoints to obtain diverse feature information. Nevertheless, performance may fail by using above methods when the whole body features are not representative in complex scenes. Therefore, Liu et al. [29] proposed a spatial and temporal features mixture model to make full use of the information from individual human body parts.

Metric-Based Learning Re-Identification Methods
Studies that focus only on feature-based learning may have difficulties on distinguishing similar appearance, which can be solved through metric learning methods. Traditional metric-based re-identification methods, such as cross-view quadratic discriminant analysis (XQDA) [6] and keep it simple and straightforward metric (KISSME) [30], learn feature subspace with resolution abilities. Kalarani et al. [8] combined SVM and ANN for data mining analysis to cope with the low accuracy and high processing time of ANN. To increase the interclass distance, Tang et al. [16] used hinge loss together with SVM to enhance the generalization ability of the model. To force the distance between dissimilar sample pairs to be greater than that between similar sample pairs, Cheng et al. [31] proposed an improved triplet loss, and Chen et al. [12] proposed quaternion loss and introduced new constraints based on it. As the network selects samples randomly, samples that are too simple will lead to poor generalization ability of the model, and samples that are too complex will lead to gradient explosion of the network during training. Therefore, selecting triplet and quaternion samples is still a major problem [11,32]. Hermans et al. [11] proposed hard sample mining to solve the poor generalization ability caused by random sample selection. Ahmed et al. [17] applied the softmax loss function to pedestrian re-identification by using the cosine similarity or Euclidean distance of the last layer of the network for distance query. Due to its strong classification ability, the softmax function remains widely used in applications of multiclassification, such as image classification, face recognition, and object detection [33,34]. Guo et al. [35] trained both softmax and center losses in ANN to classify the extracted spectral features. This combination makes the intraclass features more compact and the interclass features more expandable, but results in parameter redundancy simultaneously. Thus, Wen et al. [36] combined center loss with softmax loss to supervise CNN to minimize the deep feature distance between classes. To approximate the classification boundary in the metric-based learning re-identification, a linearization method [37] was proposed on a deep network and to enhance the robustness of the loss function by the activation function. The margin of the loss function in this method is imposed on any network layer but may lead to the expansion of parameters and longer training time.

Margin CosReid Network
The overall framework of the proposed margin CosReid network is shown in Figure 1. First, the dataset is inputted to the basic skeleton ResNet50. After that, the proposed feature extraction model GBNeck is added for feature extraction. Finally, the pedestrian descriptor is obtained.

Proposed GBNeck
In this paper, we propose the fine-tuned GBNeck model to extract deep pedestrian features. First, to increase the size of the feature map and obtain higher resolution features, the proposed GBNeck removes the last downsampling layer of the backbone network ResNet50. Then, to reduce the number of parameters and integrate the global spatial information, a global average pooling (GAP) layer is added to replace the fully connected (FC) layer behind ResNet50, and the kernel is pooled with a size of 16 × 8 into 1 × 1 to obtain a 2048-dimensional feature vector. Second, each neuron in the total connective layer is fully connected with all the neurons in the former layer to integrate the classified local information in the pooling layer.
Furthermore, the batch normalization (BN) layer can speed up training and minimize exploding gradients [38,39]. Thus, we introduce the BN layer in our GBNeck and experimentally find that it can also improve the generalization ability of the model. Then, the dropout layer is introduced to avoid the overfitting problem, improve the generalization performance, and play a regularization role in the training process [40]. Finally, another BN and FC layer is added again as the discriminative descriptor to focus the network on the input image and reduce the image distortion caused by external factors to obtain 512-dimensional feature vectors for person re-identification. The proposed GBNeck model can achieve faster convergence, stronger generalization, and more discriminative feature learning abilities in the training process.

Margin Cosine Softmax Loss (MCSL)
The traditional softmax loss function is widely used as a supervision mechanism in pedestrian re-identification because of its brilliant performance in discriminating features between classes but has difficulties distinguishing the differences within classes [41]. To solve this problem, this paper proposes the margin cosine softmax loss (MCSL) to normalize the weight and feature vector and introduces a boundary margin parameter m for maximizing the difference between classes while minimizing those within classes to embed the pedestrian features deeper.

Softmax Loss
This section first introduces the common classification loss function, namely, the softmax loss function. Given an input feature vector x i of the i-th training sample and the corresponding label y i , the traditional softmax loss function is expressed as where p i denotes the posterior probability conditioning that x i is correctly classified. N and C denote the size of the training sample and the number of categories, respectively. f denotes the activation of the fully connected layer, including the weight vector W ∈ R d×n and the offset B ∈ R n . In Equation (1), f j = W T j x i + B j and f y i = W T y i x i + B y i , where W j and W y i denote the j-th and y i -th columns of the weight vector W, respectively. In this paper, B is set as 0 [42], and then the activation f j is computed as where θ j,i (0 ≤ θ j,i ≤ π) is the angle between the weight vector W j and the feature vector x.

Cosine Softmax Loss
L-softmax [43] does not consider the imbalance of sample distribution; thus, we normalize W j = 1 [33,42,44] so that each category is treated relatively equally during training. The feature vector x i is also L2 normalized and scaled to s, making the posterior probability p i rely only on the cosine value for an improvement in the resolution ability. Then, the improved loss function, named the cosine softmax loss (CSL), is defined as [42] In Equation (3), the features learned in the CSL space can be separated and thus classified correctly.

Proposed Margin Cosine Softmax Loss
Although features between different classes can be well distinguished, those within the same class cannot be separated using CSL. Based on the work in [43], SphereFace [42] normalizes the weight vector, and the multiplicative margin compresses the sample features to a smaller space and at the same time reduces the monotone interval of the cosine function, which leads to the difficulty of optimization. To solve this problem, both ArcFace [33] and CosFace [44] introduce additive margins and only have margins on the items corresponding to the real class labels. To maximize the distance between classes, we also consider the need to introduce additive margins between different categories at the denominator and propose the margin cosine softmax loss (MCSL) defined as subject to where N and C denote the numbers of training sample batches and the dataset category, respectively. x i denotes the normalized feature vector of the i-th sample corresponding to label y i , and W j denotes the weight vector of the category j. θ y i ,i and θ j,i denote the angle between x i and the weight vectors W y i or W j , respectively. Compared to the expnormalize trick [45], MCSL maps the pedestrian descriptor to the cosine space through a cosine function that has intrinsic consistency with softmax. The hyperparameter m is the boundary margin by introducing which MCSL can make classification stricter through controlling the cosine boundary size from cos(ϕ 1 ) > cos(ϕ 2 ) to cos(ϕ 1 ) − m > cos(ϕ 2 ). ϕ i (i = 1, 2) denotes the angle between the feature vectors x i , and C i (i = 1, 2) denotes the class label of x i . x i is determined to belong to class C 1 by cos(ϕ 1 ) − m > cos(ϕ 2 ) in MCSL for significant performance in distinguishing features within class. Moreover, the binary classified MCSL can also be expanded to other multiple classification problems.

Loss Comparison
In this section, the decision boundaries among classes C 1 and C 2 of the three loss functions softmax, CSL, and MCSL are shown in Figure 2. First, in the traditional softmax loss function, overlapping decision area (denoted by the yellow region in Figure 2a) exists between the decision boundaries of classes C 1 and C 2 (denoted by purple and blue dotted lines, respectively) defined as W 1 cos(ϕ 1 ) = W 2 cos(ϕ 2 ), so that decisions in this overlapping area cannot be made on which class it belongs to.
Based on softmax, CSL performs L2 normalization on the weight vectors W 1 and W 2 , thus making the decision boundary a constant (cos(ϕ 1 ) = cos(ϕ 2 )). As shown in Figure 2b, boundaries of classes C 1 and C 2 are connected together, so that the decisions on or closed to the connected line are also unable to be made on their associated classes. CSL performs well in simple pedestrian classification problems because there is no large overlapping area compared to traditional softmax, but is still vulnerable to outside interference with similar pairs of negative samples and confuses two pedestrians with different identities.
To solve the above problems, the boundary threshold m is introduced in cosine space in the proposed MCSL algorithm. The boundary conditions of the MCSL are defined as C 1 : cos(ϕ 1 ) ≥ cos(ϕ 2 ) + m, In Equations (6) and (7), the feature vector belongs to class C 1 only if the minimum value of cos(ϕ 1 ) is greater than or equal to the maximum value of cos(ϕ 2 ), and vice versa. Because the decision boundaries of classes C 1 and C 2 are separated far away from each other as shown in Figure 2c, decisions are easier to be made in the MCSL function. Therefore, the interclass difference becomes larger, and the intraclass difference becomes more compact, which is strong enough to deal with outside interferences to learn more powerful distinguishing features.
To compare more depth about the three loss functions, we conduct an experiment on their class activation maps in Figure 3. Color temperature means importance in producing gradients associated with the target identities, indicating that warmer colors and sharper outlines demonstrate better performances. The rough outlines of results are drawn with pink dotted curves. In Figure 3a, the outline of MCSL has sharper and more similar shape with that of the person in the input image than those of softmax and CSL. In Figure 3b,c, the color in the outline of MCSL is warmer than those of softmax and CSL. Moreover, red boxes in Figure 3 denote the backgrounds that are mistakenly identified with target in outlines of both softmax and CSL, whereas not in that of MCSL, which indicates that MCSL outperform others in class activation mapping.

Datasets
In our experiments, all the algorithms are tested on the two datasets: Market-1501 [1] and DukeMTMC-reID [19]. The Market-1501 dataset consists of 32,668 images of 1501 labeled persons from six cameras, one of which has a lower resolution. There are 751 identities in the training set and 750 identities in the testing set. The DukeMTMC-reID dataset contains 36,411 images of 1812 persons of eight high-resolution camera views. A total of 16,522 images of 702 persons are randomly selected from the dataset as the training set, and the other 702 persons are divided into the testing set, which includes 2228 query images and 17,661 gallery images.
We demonstrate the effectiveness of the proposed method by using the cumulative matching characteristics (CMC) at rank-1, rank-5, and rank-10 and the mean average precision (mAP) on the standard dataset. Considering reidentification as a ranking problem, rank-n (n = 1, 5, 10) denotes the probability that the first n identities ranking in candidate re-identification lists have the correct result, in which rank-1 is the most important. The mAP calculates the mean value of average precisions (APs) for all queries, which considers both precision and recall of an algorithm, thus providing a more comprehensive evaluation [1]. The higher the values of rank-1, rank-5, rank-10, and mAP, the better of the re-identification performances.

Experimental Results and Analysis
Our approach is implemented using the PyTorch framework with GTX TITAN X GPU, Intel i7 CPU, and 128 GB memory. The skeleton network ResNet50 is pretrained in ImageNet [46]. The sizes of the input images are adjusted to 256 × 128, the batch size is set to 32, and the training iteration number is set to 60. The initial learning rate is set to 0.1 and reduces to 0.01 after 20 iterations. The weight attenuation is set to 0.0005, and the momentum term is set to 0.9. Table 1 demonstrates the efficiency of GBNeck in Figure 1. Based on the baseline ResNet50, Net-A adds only global average pooling (GAP) layer; Net-B adds a fully connected (FC) layer based on Net-A; Net-C adds a batch naturalization (BN) layer based on Net-B; and the proposed GBNeck adds a dropout layer, an FC layer, and a BN based on Net-C. The embedded feature size is 2048 for Net-A, 1024 for Net-B and Net-C, and 512 for GBNeck, where the dropout is set to 0.5. As shown in Table 1, the precision of Net-B is higher than Net-A because the FC layer includes the key image information to obtain more discriminative features. Net-C performs even better than Net-B, which shows the effective generalization ability of the BN layer. GBNeck achieves the best rank-1 to 91.0% in Market-1501 by combining dropout and BN layers. In the proposed GBNeck, the last downsampling layer is removed from the backbone network ResNet50 for higher spatial resolution to bring significant improvement [47]. Two sets of experiments were performed on the influence of the downsampling layer, and the results are shown in Table 2. In this section, we demonstrate the effects of this removal. Table 2. Influences on rank-1 and mAP of the downsampling layer. w/and w/o denote that the downsampling layer is and is not removed, respectively. Bold fonts denote highest values.

Market-1501
DukeMTMC-reID In addition, we also conduct an experiment to show the mean training loss curves from 5 runs of MCSL with the backbone ResNet50 and GBNeck on the Market-1501 dataset. We can see from Figure 4 that the GBNeck has faster convergence compared to ResNet50.

The Proposed MCSL
In this section, we conduct experiments to discuss the efficiency of the proposed margin cosine loss (MCSL) function. Figure 5 demonstrates the mean training loss curves from 5 runs of the three functions, i.e., softmax, CSL, and MCSL optimized by Adam [48] and stochastic gradient descent (SGD) on the Market-1501 dataset. The training curves of softmax loss function optimized by Adam and SGD decline slowly after the 20th and 15th iterations, respectively. Comparatively, the convergence of CSL in the early stage is accelerated due to the cosine space mapping, but it converges slowly in after the 25th iterations. The curve of MCSL optimized by Adam vibrates greatly in the final training stage, showing a clear confrontation between MCSL and Adam [49]. MCSL optimized by SGD can suppress the inconsistency and achieve a high accuracy from Figure 5, so we use SGD to optimize our training model in this paper. In this paper, experiments are performed on the selection of the loss parameters s and m in Equation (4). The scaling parameter s in the proposed MCSL controls the dependence degree of the loss to the cosine space, which cannot be adaptively learned to prevent slow network convergence and relatively difficult optimization. As shown in Figure 6a, we found that if s is set too small, the loss will decrease slowly and may not even converge. Therefore, s is set to a larger value in this paper for better performance and training loss reduction. In addition, the margin parameter m can also be selected, as shown in Figure 6b. We set m to 0.35 in our paper because of the highest accuracy in both the Market-1501 and DukeMTMC-reID datasets.

Visualization Result
The visualization result of the proposed method is shown in Figure 7. People in different rows represent different pedestrian sample images in the Market-1501 dataset. Images in the first column denote the query graph, and the retrieval images are sorted according to the cosine similarity from 1 to 10. As seen from the sorting order, most of the retrieval images are selected correctly, but there are still some incorrect images marked with a red number. This failure probably occurred because of the insufficient image information collected from single view cameras.

Comparison with State-of-the-Art Methods
The proposed re-identification method has been compared with other existing stateof-the-art methods, and the comparison results are shown in Tables 3 and 4. On the Market-1501 dataset, the single-person search rank-1 and mAP of our method reach 91.0% and 75.9%, respectively. On DukeMTMC-reID, the single-person search rank-1 and mAP of our method reach 81.1% and 64.61%, respectively, which are much higher than those of other methods. Compared with PSE [50], the rank-1 and mAP of our method without any personal posture information increase 2.3% and 6.9%, respectively, on Market-1501 and 1.3% and 2.61%, respectively, on DukeMTMC-reID. Furthermore, compared with PNGAN [26] using GAN to generate pedestrian images, rank-1 and mAP of our method increase 0.6% and 3.3%, respectively, on Market-1501 and 7.5% and 11.4%, respectively, on DukeMTMC-reID.
To show the improvements of the re-ranking method and the multi-query mode used on rank-1, rank-5, rank-10, and mAP, we conduct comparison experiments on the performances of the proposed MCSL with single-query mode (MCSL+single-query), MCSL with reranking (MCSL+RK), MCSL with multi-query mode (MCSL+multi-query), and MCSL with both reranking and multi-query mode (MCSL+RK+multi-query). The experimental results are shown in Tables 3 and 4, indicating that using re-ranking and multi-query mode can improve the performances of our method.

Conclusions
In this paper, we proposed a margin CosReid network for end-to-end pedestrian reidentification with respect to feature and metric-based learning combinations. On the one hand, the feature extraction model GBNeck was proposed to overcome the overfitting and slow convergence problems of traditional CNNs, which is a typical model of feature-based learning. On the other hand, the margin cosine softmax loss (MCSL) was proposed by introducing a boundary margin parameter that can maximize the difference between classes and minimize those within classes simultaneously. Our margin CosReid network can not only achieve a stronger generalization and more discriminative feature learning models, but can also deal with outside interferences strongly for the metric-based learning model. It was tested on Market-1501 and DukeMTMC-reID and had superior performance compared to state-of-the-art pedestrian re-identification methods in accuracy and robustness.
Although a common phenomenon in deep network, the accuracy saturation problem of the proposed MCSL still affects the accuracy enhancement. In the future work, we plan to introduce a deep residual learning to decrease accuracy saturation and obtain an even higher accuracy.
Author Contributions: X.Y. designed the study, analysed data, and wrote the paper and the revisions. M.G. performed the experiments and analysed data. Y.S. contributed to refining the ideas, and is responsible for the research team. K.D. performed the experiments and analysed data. X.H. collected and analysed the data. All authors have read and agreed to the published version of the manuscript.