Person Re-Identiﬁcation by Discriminative Local Features of Overlapping Stripes

: The human visual system can recognize a person based on his physical appearance, even if extreme spatio-temporal variations exist. However, the surveillance system deployed so far fails to re-identify the individual when it travels through the non-overlapping camera’s ﬁeld-of-view. Person re-identiﬁcation (Re-ID) is the task of associating individuals across disjoint camera views. In this paper, we propose a robust feature extraction model named Discriminative Local Features of Overlapping Stripes (DLFOS) that can associate corresponding actual individuals in the disjoint visual surveillance system. The proposed DLFOS model accumulates the discriminative features from the local patch of each overlapping strip of the pedestrian appearance. The concatenation of histogram of oriented gradients, Gaussian of color, and the magnitude operator of CJLBP bring robustness in the ﬁnal feature vector. The experimental results show that our proposed feature extraction model achieves rank@1 matching rate of 47.18% on VIPeR, 64.4% on CAVIAR4REID, and 62.68% on Market1501, outperforming the recently reported models from the literature and validating the advantage of the proposed model.


Introduction
Surveillance cameras are mounted in critical geographical locations to ensure public safety and security. The visual scenes captured by these cameras require constant monitoring. Most often, the camera network deployed within the campus possesses blank spaces due to the non-overlapping field-of-view [1]. The object trackers [2] fail whenever the object travels through such non-overlapping areas of the surveillance coverage. To monitor the overall trajectory of a person, the tracker needs to employ a robust re-identification (Re-ID) system [3]. Person Re-ID is the task of associating individuals within the non-overlapping camera scenes [4,5]. The appearance variability of the same individual due to the non-uniform environmental conditions and deformable nature of the human body degrades the performance of the Re-ID system [6]. The viewpoint, object's scale, and illumination of the scene vary due to the camera mount position, depth, and light source of the surveillance environment [6,7]. Moreover, occlusions and background clutter also affect the performance of the Re-ID system [7]. The performance of the Re-ID model depends on the content of the probe and gallery set. The gallery can either be a single-shot or multi-shot with single or multiple images of the individual, respectively [8]. Moreover, the gallery image can be acquired online or offline. In the offline case, the gallery includes pre-registered candidates, while, in the online stage, gallery candidates are updated with time at various cameras across the network [9].
The probe and gallery images are described using features such as texture, color, shape, and motion. The representation of an individual in the camera scene requires a robust feature reviewed [8][9][10][11][12][13][14][18][19][20][21][22] in the last decade. The existing ReID techniques depend mostly on the features extraction and distance metric learning process. The Re-ID system consists of a gallery and probe set [11]. The gallery includes the sample images of the candidates whose identity is already known. The probe set consists of the test images that need to be identified based on gallery information. The feature extraction process used in the Person Re-ID approach is broadly categorized into two classes that are hand-crafted and non-hand-crafted.
The hand-crafted approach is one in which what features to extract from the input image to get better accuracy are manually decided. The drawback of this approach is that it is often fine-tuned toward the dataset. It works better for one dataset while its performance degrades with another dataset provided [8]. They most used features in this approach are color and texture. The spatial structure is divided into strips and grid; the features extracted from each grid cell are concatenated to get a global description. Many hand-crafted feature descriptors have been developed, consisting of various color and texture combinations. In [20], RGB, HSV, and YCbCr have been fused to describe the appearance based on color information within the detected bounding box. The Gabor and covariance related information are used in [23], and [24] to describe the visual cues within the pedestrian appearance. In [25], Symmetry-driven Accumulation of Local Features (SDLF) is produced, which decides the vertical symmetry before feature extraction. The symmetric axis is determined based on chrominance value; the texture features near the vertical symmetry are used to represent the pedestrian appearance. The SDLF has low complexity, but the matching rate is lower for rank-1 [26]. The Weighted Histogram of Overlapping Strips (WHOS) [27] descriptor is developed to extract a combined histogram of oriented gradients with the color feature of horizontally overlapping strips. The hierarchical Gaussian descriptor (GOG) has been reported in [28], which employs the localized color information to describe the human visual appearance. The GOG has better rank-1 accuracy; however, this method is highly sensitive to bounding box alignment problems [29]. The Bio-inspired Features (gBiCov) [30], is developed by combining the biologically Inspired features with the covariance descriptor. The gBiCov has a higher facial Re-ID accuracy but a lower matching rate on the full-body bounding box of the pedestrian. The Local Maximal Occurrence (LOMO), along with a novel distance metric learning approach Cross-view Quadratic Discriminant Analysis (XQDA) [21], was developed in [21] for Pr-ID in the non-overlapping camera views. The Discriminative Accumulation of Local Features (DALF) [31] weights the local histogram representing the discriminative positions of the bounding box. A cross-view adaptation framework known as Camera coRrelation Aware Feature augmenTation (CRAFT) is introduced in [7] for person re-id. The CRAFT adoptively presents the feature augmentation by measuring the cross-camera correlation of the visual scenes. The deep learning model automatically decides through backpropagation what features to be extracted. Various deep neural models reported in the literature [6,32,33] can re-identify the individual in the presence of extreme distortion. Various pre-defined models that are AluxNet, Caffenet, Googlenet, VGG networks, ResNet, and SVDnet have also been used as feature extraction criteria for the person Re-ID. These models can easily be retrained on the data available in the surveillance model.
Many researchers have employed distance metric learning in their Re-ID models to improve the rank-1 matching accuracy. In addition to feature extraction, the distance metric learning can bring more improvement in the rank-1 performance of the Re-ID system. An overview of distance metric learning approaches is reported in [34]. Instead of using Euclidean or butcheria distance, a supervised metric learning approach known as Mahalanobis [34] is used, which keeps the feature of the same class closer while that of a different type apart. Fisher discriminant analysis (FDA) and its local variant LFDA and KLFDA [35] have been developed in [18] to learn and reduce the raw features. The discriminative null space learning (NFST) [13], marginal Fisher analysis (MFA) [36], and kernel-based MFA (KMFA) [35] all employ Fisher optimization criteria to reduce the inter-class and increase the intra-class feature variations. The metric learning method Keep It Simple and Straightforward MEtric learning (KISSME) [37], Probabilistic Relative Distance Comparison (PRDC) [20], and Pairwise Constrained Component Analysis (PCCA) [38] all learn the Mahalanobis distance with the constraints principle.

Proposed Re-ID Model
The Discriminative Local Features of Overlapping Stripes (DLFOS) model employs the robust color and texture information to achieve better accuracy. A novel texture descriptor OMTLBP_M is fused to HoG and Gaussian of color features to describe the appearance of the person with more robustness to variations. The details of the proposed DLFOS are as follows.

Features Extraction
The proposed Discriminative Local Features of Overlapping Stripes (DLFOS) consist of various color and texture descriptors. The Histogram of oriented Gradients (HoG) descriptor concatenated with the Gaussian of color features and the OMTLBP_M in the proposed Re-ID model.
The input bounding box is scaled to a size of 126 × 64, then 36 × 64 pixel size horizontal strips with 50% overlap among them are extracted as shown in Figure 1. The HoG feature [15] vector is extracted for each horizontal strip as shown in Figure 1a. Each strip is transformed into a grid, where each grid cell has a size of 4 × 4. The cells of the grid are described through a 32-dimensional feature vector. The nine orientations overall are described with the help of 27 variables. Moreover, one truncation and four texture variables also include the complete representation. Let f 1 be the HoG feature with dimensions of 3780 variables. The color-naming feature (CN) [39] of 11-variable vocabulary dimensions is collected from each grid's cell of the input segment. Suppose f 2 is the feature vector representing the color-naming. The Gaussian of Gaussian (GOG) [28] for the color feature is extracted for the 7 × 7 patch of each horizontal strip. The RGB, HSV, and YCbCr color values of the pixels are used to extract the Gaussian distribution of pixel features, as shown in Equation (2). The Gaussian collected from each patch is flattened and then vectorized according to the geometry of Gaussian. Moreover, the local Gaussians of each region are collected into a Gaussian region. After that, those Gaussian regions are further flattened to create a feature vector. Lastly, the feature vectors belonging to each region are concatenated into a single vector: For each 7 × 7 patch p, the patch of Gaussian G(g; µ p λ P ) where |.| is the determinant of the covariance matrix λ P . The µ p denote the mean of the feature g extracted from the sample patch p. The n p is the total number of pixels in the patch p.
The OMTLBP_M riu2 P,R [40] is also employed to bring rotation invariance in the proposed model. The sampling value S = 8 along with a set of radius R = {1, 2, 3} are employed to extract the OMTLBP_M riu2 P,R as presented in Equation (5). The OMTLBP_M operator from each 7 × 7 segment of the corresponding strip has a dimensionality of 200 variables to describe each input strip: where ω(x, y) = 0 for x less than y and 1 otherwise. The m k r,c is the kth magnitude component with a mean value of µ m , which is the mean value at each (r, c) coordinate of the images' segments. The S and R in Equation (5)   The proposed feature extraction model is evaluated on several distance metric learning methods. However, better results have been achieved using Cross-view Quadratic Discriminant Analysis (XQDA) [21]. The multi-shot method is evaluated using SRID [41].

Experimental Results
The proposed DLFOS model in combination with XQDA and SRID is evaluated on VIPeR, CAVIARR4REID, and the Market1501 database, and the results have been summarized in the shape of the CMC curve and the table as follows.

Dataset
There are several publicly available datasets for person re-ID [42]. Many datasets have been developed and made publicly available by the researchers working in the field of security and surveillance. These datasets can be categorized into single-shot and multi-shot, depending on the scenario of the research. The single-shot are those databases that have a single probe and a unique gallery image. On the other hand, a multi-shot database has multiple images in the probe set and various pictures in the gallery. The probe and the gallery images may have a difference in camera view angle, illumination, scale, pose, and resolution. The proposed model is evaluated on VIPeR, CAVIARR4REID, and Market1501, which have been shown in Figure 2.

VIPeR
The viewpoint invariant pedestrian recognition (VIPeR) dataset contains 632 pedestrian image pairs. These images were collected from two cameras with a difference in view angle, pose, and lighting conditions. The images are scaled up to 128 × 48 pixel size. Performance is assessed by matching each test image in Cam A against Cam B in the gallery. Some of the pictures given as an example of the dataset are given as follows.
The proposed DLFOS model is evaluated on the VIPeR dataset, with numerous combinations of metric learning methods, and the results have been shown in Figure 3 and Table 1. Figure 3 presents the matching rate in percentage for the range of rank value within the range 1 and 30. The proposed feature extraction model, in combination with the XQDA metric learning method, provides 47. 18 [20], RPCCA [38], KPCCA [38], KISSME [37], KMFA [35], KLFDA [35], FDA [18], LFDA [18], MFA [36], and NFST [13]. The performance of the proposed DLFOS feature extraction model in comparison with other descriptors and the results have been summarized in Figure 4 and Table      Context-Aware Vision using the Image-based Active Recognition (CAVIAR4REID) dataset contains pedestrians' images in a shopping center. It consists of 72 different people with 1221 images taken from two cameras from various views. From these 72 people, 22 come from the same camera and 50 from both of them. Each person's images vary from two to five images with various sizes from 17 × 39 to 72 × 144. The proposed DLFOS feature extraction model, along with numerous combinations of distance metric learning, is tested on the CAVIAR4REID dataset, and the results have been presented in Figure 5 and Table 2. Figure 6 shows the matching rate in percentage for the range of rank value within the range 1 and 25. The proposed feature extraction model, in combination with the XQDA metric learning method, provides a 64.4% matching rate at rank-1. The DLFOS-XQDA results in 28.4, 26, 26, 25.6, 22, 18, 13.6, 11.2, 8.8, and 8.0 percent better matching rate vs. rank-1 than MFA [36], KMFA [35], KPCCA [38], KLFDA [35], NFST [13], PCCA [38], KISSME [37], PRDC [20], FDA [18], and LFDA [18], respectively. The DLFOS-XQDA is tested on the SRID ranking method for the multi-shot CAVIAREID dataset. The results in Table 3 show that the best combination of the descriptor and metric learning are gBiCov-KISSME, Color-Texture-XQDA, WHOS-XQDA, LOMO-KISSME, and GOG-KISSME. The proposed DLFOS-XQDA provides 34, 14, 9.2, 6, and 4.8 percent higher matching rates than gBiCov-KISSME, Color-Texture-XQDA, WHOS-XQDA, LOMO-KISSME, and GOG-KISSME.

Market1501
The Market1501 database includes a diverse set of test subjects with multiple snaps belonging to six disjoint cameras. This dataset also contains 2793 false positives of the deformable part model. The bounding boxes are more challenging than CUHK03. Later on, 500K distractors included bringing diversity within the database. The authors employed mAP to evaluate the test Re-ID models on the proposed database. The proposed DLFOS-XQDA combination, when evaluated on the Market1501 database, achieved a 62.68% matching rate over rank-1. Figure 7 presents the percentage matching rate over the rank values ranges between 1 and 25. Table 3 shows that WHOS-NFST, GOG-NFST, Color-Texture-XQDA, and gBiCov-NFST combinations give high matching rates in comparison to other combinations. The DLFOS-XQDA provides 41.96%, 28.15%, 9.89%, and 6% higher matching rates than gBiCov-NFST, Color-Texture-XQDA, GOG-NFST, and WHOS-NFST for rank-1.

Conclusions
This paper mainly proposes a novel feature extraction for Person Re-Identification (ReID). In this method, we accumulated the distinctive features representing the silent regions in the overall appearance. We offer a fused feature extraction model consisting of HoG, Gaussian of color, and the discriminative novel texture descriptor CJLBP_M. Inspired by the WHOS framework, the detected bounding box is divided into horizontally overlapped strips before the feature extraction process. The selection of rich features in the proposed DLFOS model has brought robustness to noise and invariance against variations in illumination, scale, and orientation of the person's appearance. To validate the performance of our proposed DLFOS, we have conducted several experiments on three publicly available datasets. The outcome of the experiments indicates that the proposed DLFOS model, along with XQDA metric learning, offers a 5.98%, 4.8%, and 6% high rank-1 matching rate compared to many recently reported works when tested on publicly available ViPeR, CAVIAR, and Market1501 databases, respectively. In the future, we would like to employ the feature reduction model to improve the computational complexity by reducing the dimensionality of the extracted feature vector.