Sparse and Low-Rank Joint Dictionary Learning for Person Re-Identiﬁcation

: In the past decade, the scientiﬁc community has become increasingly interested in the re-identiﬁcation of people. It is still a challenging problem due to its low-quality images; occlusion between objects; and huge changes in lighting, viewpoint and posture (even for the same person). Therefore, we propose a dictionary learning method that divides the appearance characteristics of pedestrians into a shared part, which comprises the similarity between different pedestrians, and a speciﬁc part, which reﬂects unique identity information. In the process of re-identiﬁcation, by removing the shared part of a pedestrian’s visual characteristics and considering the unique part of each person, the ambiguity of the pedestrian’s visual characteristics can be reduced. In addition, considering the structural characteristics of the shared dictionary and special dictionary, low-rank, l 0 norm and row sparsity constraints instead of their convex-relaxed forms are introduced into the dictionary learning framework to improve its representation and recognition capabilities. Therefore, we adopt the method of alternating directions to solve it. The experimental results of several commonly used datasets show the effectiveness of our proposed method.


Introduction
Pedestrian re-identification (re-ID) aims to identify specific pedestrians through cameras at different locations, that is, to establish the correspondence between people at different visual ranges. It is a key task in most surveillance and security applications [1][2][3], and has attracted increasing attention from the computer vision community [4,5]. However, in a real complex environment, such as different camera resolutions, viewing angles and background changes, lighting changes, occlusions and person pose changes can adversely affect pedestrian recognition, increase the difficulty of successful pedestrian recognition and make it face many technical challenges. Furthermore, there is still a big gap between the current person re-identification technology and practical applications.
This research direction has attracted the attention of a large number of scholars and research institutions. Aiming to solve the problem of person re-identification, the research mainly focuses on the following two aspects: the expression of pedestrian characteristics [6][7][8][9][10][11][12] and similarity measurement learning [13][14][15][16][17][18]. The feature descriptors try to determine how to select visual features with good discrimination and robustness for pedestrian image matching. As high-dimensional visual features usually do not capture the invariant factors under sample variance, a distance metric is introduced into pedestrian re-identification. The main concept of metric learning is that the visual characteristics of different pedestrians should be more separate and the visual characteristics of the same pedestrian under different perspectives should be as similar as possible in the embedded space. Since sparse dictionary learning is a special case of metric learning, it has been successfully applied in computer vision fields such as face recognition [19,20], and is now applied to pedestrian re-recognition.
In 2010, Cong et al. first introduced dictionary learning into person re-identification [21]. They first built a dictionary through a camera, and then the pedestrians of the other camera were represented by the dictionary sparsely and linearly. Khedher et al. [22] used the surf descriptor to extract features from each person's pictures and then constructed a known dictionary using reference SURFs. They used the sparse representation model to learn a coefficient and then determined the identity information. Karanam et al. [23] learned a single view invariant dictionary for different cameras. They also improved the discriminative ability of the dictionary by adding explicit constraints on the sparse codes, which made the Euclidean distance between the coding coefficients of the same pedestrian under different views smaller than that between the sparse coding coefficients of different pedestrians. Jing et al. [24] proposed a novel semi-coupled low-rank discriminant dictionary learning approach for high-and low-resolution images. Karanam et al. [25] proposed a block sparse representation method based on dictionary learning. An et al. [26] used canonical correlation analysis (CCA) to learn a subspace in which the goal is to maximize the correlation between data from different cameras but corresponding to the same people. Then, they jointly learned the dictionaries for each camera view in the CCA subspace. Zhou et al. [27] proposed a novel joint learning framework that unifies representative feature dictionary learning and discriminative metric learning. Xu et al. [28] proposed to separate the images of the same pedestrian observed from different camera views into view-shared components and view-specific components so as to improve the discriminating performance of the learned dictionary. Peng et al. [29] proposed a novel dictionary learning model which divides the dictionary space into three parts corresponding to semantic, latent discriminative and latent background attributes, respectively. Li [30] proposed a discriminative semicoupled projective dictionary learning (DSPDL) model that employs an efficient projection dictionary learning strategy and jointly learns a pair of dictionaries and a mapping function to model the correspondence of cross-view data. Li et al. [31] proposed a person re-ID method to divide a pedestrian's appearance features into different components. They developed a framework for learning a pair of commonality and specificity dictionaries, while introducing a distance constraint to force the particularities of the same person over the specificity dictionary to have the same coding coefficients and the coding coefficients of different pedestrians to a have weak correlation. Li et al. [32] considered novel joint fusion and super-resolution framework based on discriminative dictionary learning. They jointly learned two pairs of low-rank and sparse dictionaries and a conversion dictionary, which are used to represent the low-rank and sparse components of low-resolution images, and to reconstruct a high-resolution fused result. However, to accurately characterize the sparsity and low rank, it is suggested to impose the sparsity and low-rank constraints directly instead of using the approximations/regularizations.
In 2014, Li et al. [33] first used deep learning methods for person re-identification research, and since then, an increasing number of researchers have tried to combine deep learning methods with person re-identification research. Deep learning can integrate feature extraction and metric learning into a unified learning framework and is mainly focused on extracting global identity features from pedestrian images. He et al. [34] proposed to use the Spatial Pyramid structure to extract sample features. Huang et al. [35] used a deep neural network to learn different representation features of different parts of pedestrian appearance images, and then calculated the similarity of the corresponding parts of the image. Then, three sub-networks were constructed for each part to learn the differences between images, feature maps and spatial changes, and the results of the three sub-networks were combined. Wu et al. [36] introduced a deep architecture that combines Fisher vectors and deep neural networks to learn a mixture of nonlinear transformations of pedestrian images into a deep space where the data can be linearly separated. Tao et al. [37] utilized Cross-view Quadratic Discriminant Analysis (XQDA) metric learning for person recognition in order to achieve simultaneous spatial localization and feature representation. Compared with images, there are not only spatial dependencies, but also temporal order relationships between frames in video sequences. Reasonable use of the temporal features of videos can reflect the motion characteristics of pedestrians and improve the recognition accuracy. Therefore, for video-based pedestrian re-identification, the spatiotemporal features of videos are often extracted for recognition. Gao et al. [38] proposed a temporally aligned pooling representation method, which uses the periodic characteristics of walking to divide the video sequence into independent walking cycles, and selects the cycle that best matches the characteristics of the sinusoidal signal to represent the video sequence. Rahmani et al. [39] proposed a deep fully connected neural network by finding the nonlinear transformations of a set of connected views, which learn from 2D projections of the dense trajectories of synthetic 3D human models fitted to real motion capture data. Using the spatiotemporal motion characteristics of human walking, Khan et al. [40] proposed a novel view-invariant gait representation deep fully connected neural network for cross-view gait recognition. However, spatiotemporal features are susceptible to factors such as viewing angle, scale and speed. With the substantial increase in pedestrians, the motion similarity between pedestrians also increases, which greatly reduces the ability to distinguish spatiotemporal features. At the same time, the large number of cameras in large datasets increases the pose differences and motion differences of the same pedestrian. Obviously, these all limit the role of spatiotemporal features in pedestrian re-identification.
It this paper, we propose a new special and shared dictionary learning model with structure characteristic constraints, which has stronger interpretability. We divide the learning dictionary into two parts. One is a shared dictionary, which represents some features shared by all pedestrians in the camera, such as the same background. The other is a special dictionary, which represents the unique characteristics of each pedestrian. Then, only the unique part that represents the identity of the pedestrian is considered in the recognition process, which can reduce the ambiguity caused by some other unnecessary visual feature factors. The main contributions of the paper are summarized as follows: (I) The shared dictionary part, whose features are shared by all people, have a strong correlation, so the shared dictionary must be low rank; then, we directly impose the low-rank constraint. Next, we impose a l 0 norm constraint to the special dictionary, which has strong sparsity and contains only information unique to each person. (II) In order to better describe the shared information of pedestrians and force the commonality of different pedestrians to have the same coding coefficients in the shared dictionary, we introduce the l 2,0 norm constraint on the coding coefficients Z s . (III) Due to the l 0 norm and low-rank constraints, the dictionary learning model with structure characteristics constraints is highly nonconvex and computationally NP-hard in general;therefore, we adopt the method of alternating directions to solve it. When dealing with each subproblem, we directly deal with the original problems with the l 0 norm and rank constraints instead of their convex relaxed form. Numerical experiments performed on some real datasets show that our method is superior to traditional methods, and even better than some deep learning methods on some datasets.
The rest of this paper is organized as follows. The joint dictionary learning model is presented in Section 2, while Section 3 is devoted to optimization algorithm for the special and shared dictionary learning model and the re-identification process. In Section 4, the computational experiments are reported. Finally, we conclude the paper with future work in Section 5.

Joint Dictionary Learning Model
We know that the general cameras are fixed in a place, so the picture of each person in the camera contains part of the same elements that do not help in recognition. What is useful is the part of unique features that represents information about each person. Assuming that we have two camera views, a and b, and Y l = [y l,1 , y l,2 , . . . , y l,N ] ∈ R m×N (l = a, b) is a set of training samples composed of N individuals images from l-th view,Y l can be divided into two parts Considering the actual situation, we mainly study the following dictionary learning model for the person re-identification problem where Z a,u and Z b,u are the coding coefficients of person-specific components under camera views a and b; Z s = [z s,1 , z s,2 , . . . , z s,N ] is the coding coefficient of person-shared components under different camera views. D s is the dictionary for the person-shared elements, and D u is the dictionary for the person-specific elements. λ 1 , λ 2 , λ 3 are penalty parameters and s 1 , s 2 , k are three integers representing the prior information on the upper bounds of the sparsity and the rank, respectively. D u 0 is the zero norm of D u , representing the number of its nonzero elements. rank(D s ) represents the rank of the matrix D s . Z s 2,0 is the zero norm of the rows of the matrix Z s , representing the sparsity of the rows of the matrix Z s . We know that different people have different features, so the coding coefficients of different pedestrians should be largely irrelevant. The same pedestrian has a greater similarity under different cameras, i.e., one person under different views should have the same coding coefficient.
are the given correlation parameters, and cor is the correlation function. However, the correlation coefficient is more difficult to calculate; similar to article [19], we transformed the correlation coefficient constraint into the following form: The same elements play the same role for each pedestrian under the camera, and the shared features are only a small part of all features. So, Z s 2,0 ≤ s 1 was added. Generally, the common part has a strong correlation. For example, two cameras may have a part of the same background, and the picture background often has a low-rank structure. So, the shared dictionary should have a low-rank structure. At the same time, the unique information for each pedestrian is different, so it should have a sparse structure, i.e., rank(D s ) ≤ k, D u 0 ≤ s 2 . The identity information of the same pedestrian under different cameras is the same and should be as similar as possible. Therefore, the same pedestrian should have the same coefficient under different cameras; that is, Z a,u − Z b,u = 0.

Algorithm Implementation
In this section, we describe the algorithm of dictionary learning and the process of re-identification.

Dictionary Learning Algorithm
First, we propose the algorithm for the problem (1). Due to the l 0 norm, low-rank constraints and the properties of the objective function of dictionary learning, we adopted the alternating directions scheme and optimized one variable at a time by fixing the other variables.
(1) U pdate D s . With fixed D u , Z s , Z a,u , Z b,u , we solved the following minimization problem: To facilitate optimization, a relaxation variable D s was introduced to simplify the solving process. Then, the optimization problem in (2) can be written as follows: First, we fixed D s and updated D s . This problem can be optimized by the Lagrange dual [41].
Then, we fixed D s , and updated D s by solving: The above problem can be solved by [42].
(2) U pdate D u . With fixed D s , Z s , Z a,u , Z b,u , we solved the following minimization problem: For easier calculation, the relaxation variable D u was introduced to simplify the solving process. The optimization problem in (6) can be written as follows: We fixed D u , and updated D u by solving. It can be optimized in the same way as in (4) min With the updated D u , we can update D u by solving This problem can be solved via the gradient support projection algorithm (GSPA) [43].
(3) U pdate Z s . With fixed D s , D u , Z b,u , Z b,u , we solved the following minimization problem: Since the L 2,0 norm can be regarded as a special L 0 norm, it can be solved similar to in (9).
(4) U pdate Z a,u . With fixed D s , D u , Z s , Z b,u , we solved the following minimization problem: This is a smooth convex problem; hence, we could obtain the following closed-form solution: (5) U pdate Z b,u . With fixed D s , D u , Z s , Z a,u , we solved the following minimization problem: This is a smooth convex problem, so that the closed-form solution is as follows: A detailed description of the above learning is summarized in the following Algorithm 1.

Algorithm 1 Dictionary Learning Algorithm
Input: samples matrices Y a , Y b and initial point D s , D u , Z s , Z a,u and Z b,u randomly. 1: while not converged do 2: Fix other variables, update Z s , Z a,u and Z b,u via solving (10), (11) and (13), respectively.

4:
Update dictionary D u by solving (7). 5: end while Output: dictionaries D u and D s .

Re-Identification
Given the gallery feature vector matrices and the learned dictionaries D s , D u , we propose the following steps to re-identify a person:

1.
The coding coefficients of the shared components of pedestrians can be obtained by: where λ 4 is the scalar parameter; Y l = [Y a , Y b ], Z l,s = [Z a,s , Z b,s ] and Z l,u = [Z a,u , Z b,u ]; Z l,s denotes the dictionary coefficient corresponding to the part shared by the pedestrians under different cameras; Z l,u denotes the dictionary coefficient corresponding to the part unique to each person under different cameras.

2.
The dictionary coefficient corresponding to the part unique of each person under different cameras can be obtained as follows: where β 1 is the scalar parameter.

3.
We took z a,u,i and z b,u,j to be the dictionary coefficients corresponding to the individual special parts of the i-th pedestrian and the j-th pedestrian under cameras a and b, respectively. Then, we computed the Euclidean distance between z a,u,i and z b,u,j for personal identification matching sim z a,u,i , z b,u,j = z a,u,i − z b,u,j where i = 1, 2, · · · , N and j = 1, 2, · · · , M.

Datasets
In this section, we evaluate the proposed method. All the codes were written in MAT-LAB, and all the computations were performed on a LENONVE ideapad with Windows 10 Inter(R) Core(TM)i5-6200U CPU @2.30 GHz, 2.40 GHz and 4 GB memory. We empirically validated the method proposed in this paper using five publicly available multi-shot re-ID datasets: VIPeR [44]: the VIPeR dataset contains 632 pedestrians; each pedestrian captures one image from each of the two cameras; it contains two images of each pedestrian.
PRID 2011 [45]: The PRID 2011 dataset contains images of 200 people. These images were taken with two non-overlapping cameras in an uncrowded outdoor environment with significant point-of-view and lighting variations.
QMUL-GRID [46]: The GRID dataset contains 250 images of pedestrians. Each pedestrian image contains two images seen from different camera views, both of which come from eight non-intersecting camera views installed in busy subway stations. The dataset is challenging due to variations in pose, color and lighting, as well as poor image quality due to low spatial resolution.
CUHK01 [47]: The CUHK01 dataset contains contains 971 photos of pedestrians. Each pedestrian consists of two images from a pair of disjoint cameras. The image quality is relatively good.
CUHK03 [48]: CUHK03 is one of the largest personal identification datasets, and it contains 1467 pedestrians. It provides two types of data, one obtained from manually labeled pedestrian bounding boxes and the other from automatically detected bounding boxes. The detected CUHK03 is more challenging than the labeled CUHK03 dataset due to incorrectly detected bounding boxes.
In our experiments, we randomly divided VIPeR, PRID 2011, QMUL-GRID, CUHK01 and CUHK03 into two parts, one as the training data and the other as the testing data. More details are given in Table 1. For the experiments on each dataset, the above procedure was repeated 10 times. The average of the 10 measurements was considered as the final experimental result.  IDs  632  200  250  971  1467  TrainIDs  316  100  125  871  1367  TestIDs  316  100  125  100  100  Interfering Img  0  549  775  0  0  Labeled  1264  949  1275  1942  14,096  Detected  0  0  0  0  14,096 Some pedestrian image pairs selected from the five datasets are presented in Figure 1. Feature Selection: In our experiment, we used the GOG feature method proposed by Matsukawa et al. [11] to deal with the original image feature. GOG feature describes local regions in an image through a hierarchical Gaussian distribution and shows strong robustness against changes in pedestrian body pose, illumination, background clutter, and picture quality.

Experiment on VIPeR
To illustrate the effectiveness of our proposed model as well as the method, several alternative state-of-the-art re-ID methods were selected: KISSME (2012) [16], DGD (2016) [49], KCVDCA (2017) [50], JDL (2017) [28], JLML (2017) [48], MVLDML (2018) [51], MPML (2019) [52], DIMN (2019) [53], DLA (2020) [31] and VS-SSL (2020) [54]. The numerical results of the above methods on dataset VIPeR are reported in Table 2. From Table 2, the matching rates of our method on ranks 1, 5 and 10 are 51.23%, 80.73%, 90.56% and 95.02%, respectively, which are higher than those of the other methods. Although the recognition accuracy of our method in rank 1 is the same as that of DIMN method, the recognition accuracy of our method in ranks 2-5 is higher than that of the DIMN method, meaning our method is significantly better than DIMN in rank 5, which also shows the superior performance of our method. The results show that the recognition accuracy of our approach on the VIPeR dataset is better than that of some other existing methods.

Experiment on QMUL-GRID
The second experiment was conducted on dataset QMUL-GRID to illustrate the effectiveness of our proposed method, and several alternative state-of-the-art re-ID methods were selected: MtMCML (2014) [60], LSSCDL (2016) [61], DR-KISS (2016) [11], Multi-HG (2017) [62], SCRWI (2017) [63], JLML (deep-learning) (2017) [48], CSPL+GOG (2018) [64], MPML (2019) [52], SRRTC (2019) [65], DIMN (2019) [53], DLA (2020) [31] and KISS+ (2021) [66]. The numerical results of the above methods for dataset QMUL-GRID are reported in Table 4. As shown in Table 4, the deep learning method JLML has the best performance in the recognition rate of rank 1 and rank 5, but the KISS+ method has the highest accuracy in the accuracy rates of rank 10 and rank 20. Our proposed method is second only to these two methods. This is also due to the fact that the GRID dataset is relatively small, and we only used 125 markers to train our model, while JLML first pre-trained on largescale ImageNet [67], Market-1501 [68] and CUHK03, and then selected 125 identifiers from QMUL-GRID to fine-tune the pre-trained model. The KISS+ method used an orthogonal basis vector to generate virtual samples to deal with the small sample size problem. Our proposed approach has better performance than most of the other traditional non-deep learning methods, which proves the strong applicability of the proposed method.
From Table 6, we can see that on the labeled CUHK03, our proposed method is second only to the deep learning method Deep-Person on rank 1, rank 5 and rank 10, and second only to the deep learning method JLML on rank 20, and it performs better than some existing traditional methods, as well as deep learning methods, as the Deep-Person method applied long short-term memory (LSTM) in an end-to-end fashion to model pedestrians, treating it as a head-to-toe sequence of body parts. It exploited the complementary information between local and global features to better align with the whole person. From Table 7, we can see that on the detected CUHK03, our proposed method achieved the best recognition accuracy on all ranks, which is better than some existing traditional and deep learning methods, such as GLAD, Deep-Person and Gconv. This proves the effectiveness and competitiveness of our method. From the above experiments on different datasets, it can be seen that our method is the best on some datasets and outperforms the existing traditional methods. Although not the best on some datasets, it is only second to one or two deep learning methods. In conclusion, our proposed method performs well overall.

Conclusions and Discussion
In this paper, we propose a new special and shared dictionary learning model with structure characteristic constraints, including sparse, low-rank and row-sparse constraints. Here, we divided the dictionary into two, one to represent features shared by all pedestrians and the other to represent features unique to each individual. Then, only the unique part that represents the identity of the pedestrian was considered in the recognition process, which can reduce the ambiguity caused by some other unnecessary visual feature factors in the recognition process. In order to improve the accuracy of matching and to better characterize the structural characteristics of the dictionary, on the basis of original dictionary learning, l 0 norm and low-rank constraints were directly added instead of their convex regular form. We used the method of alternating directions to solve the optimization model, and when solving each sub-problem, we also directly solved the problem with constraints. Finally, experiments on different datasets showed that our algorithm has a high accuracy rate.
Since the objective function of dictionary learning, as well as the highly nonconvex and computationally NP-hard of l 0 norm and rank constraints, is bilinear, the optimality condition of the model and the convergence of the algorithm were not established here. It is difficult to not only explore the impact of feature extraction methods, but also compare the performance of metric learning methods. So our research focuses on the comparison of metric learning methods. We think that evaluating the GOG features using view transformation model (VTM) based approaches is a good attempt. These will be addressed in our future work.