Person Re-Identification with RGB-D Camera in Top-View Configuration through Multiple Nearest Neighbor Classifiers and Neighborhood Component Features Selection

Person re-identification is an important topic in retail, scene monitoring, human-computer interaction, people counting, ambient assisted living and many other application fields. A dataset for person re-identification TVPR (Top View Person Re-Identification) based on a number of significant features derived from both depth and color images has been previously built. This dataset uses an RGB-D camera in a top-view configuration to extract anthropometric features for the recognition of people in view of the camera, reducing the problem of occlusions while being privacy preserving. In this paper, we introduce a machine learning method for person re-identification using the TVPR dataset. In particular, we propose the combination of multiple k-nearest neighbor classifiers based on different distance functions and feature subsets derived from depth and color images. Moreover, the neighborhood component feature selection is used to learn the depth features’ weighting vector by minimizing the leave-one-out regularized training error. The classification process is performed by selecting the first passage under the camera for training and using the others as the testing set. Experimental results show that the proposed methodology outperforms standard supervised classifiers widely used for the re-identification task. This improvement encourages the application of this approach in the retail context in order to improve retail analytics, customer service and shopping space management.


Introduction
Camera installations are widespread in several domains, from small business and large retail applications, to home surveillance applications, environment monitoring, facility access, sports venues and mass-transit. Identification cameras are widely employed in most public places like malls, office buildings, airports, stations and museums. In these applications, it is desirable to identify different instances or images of the same person, recorded at different moments, as belonging to the same subject. This kind of process, commonly known as "person re-identification" (re-id), has a wide range of applications and is of great commercial value.
Research in people behavior analysis has been thoroughly focused on person re-id during the last decade, which has seen the exploitation of many paradigms and approaches of pattern recognition [1][2][3]. In challenging situations, algorithms need to be robust to be able to deal with issues such as widely-varying camera viewpoints and orientations, rapid changes in the appearance of clothing, occlusions, varying poses and various lighting conditions [4,5]. stage, the pre-processing/feature extraction stage and the classification stage. Thus, we have tested the approach using the TVPR dataset [23] with respect to other state-of-the-art classifiers in order to measure the reliability and the effectiveness of our approach. In particular, we propose an ensemble method, named Multiple K-Nearest Neighbor (MKNN), based on the combination of different k-Nearest Neighbor (K-NN) classifiers. The problem of combining different K-NN has been addressed in [25][26][27] respectively for different feature subsets and different distance functions. The main contributions of this work with respect to the existing literature are: (i) the adoption of different distance functions for each single K-NN based on the nature of the feature descriptors, (ii) the introduction of Neighborhood Component Feature Selection (NCFS) for the anthropometric features, (iii) the overall combination method and (iv) the application of the following methodology on the TVPR dataset collected by the authors in a previous work [23]. The motivation for the usage of the specific method, i.e., MKNN, arose from the need to exploit the informative power of depth and RGB input properly combining the different nature of each feature. Although the authors combined different existing classifiers in an ensemble strategy, the way these classifiers were chosen and combined represents the main advantage of the proposed classification stage. The experimental results demonstrated the effectiveness of the proposed approach, encouraging its application in public contexts and in different real-world applications (e.g., safety and security in crowded environments, access control), where the top-view configuration allows reducing the problem of occlusions and privacy.
Each K-NN is trained by different distance functions and feature subsets. The neighborhood component feature selection is applied to the depth features to find the optimal weights, while cosine distance and Spearman's rank correlation are applied to measure the similarity between two RGB feature points. Instead of the standard majority vote method, we propose a variation of the Bayesian approach for combining the decision of different K-NN. The performance evaluation encourages the reliability and the effectiveness of the proposed approach. The MKNN methodology decreases the generalization error compared to the baseline K-NN method, outperforming supervised classifiers used for the re-id task (i.e., K-Nearest Neighbors (K-NN) [28], Decision Tree (DT) [29] and Random Forest (RF) [30,31]).
The paper is organized as follows: Section 2 provides a description of the approaches in the context of re-id (Section 2.1) and the characterization of the TVPR dataset (Section 2.2). Section 3 gives details on the proposed methodology for the feature extraction stage and the machine learning model implemented. Section 4 provides the experimental results and comparison with respect to baseline classifiers. The conclusions and future work in this direction are proposed in Section 5.

Background
This section presents an overview of the main approaches in the context of person re-id. In particular, Section 2.1 provides a review/summary of the literature on person re-id methods, and Section 2.2 gives details on the TVPR dataset for person re-id in a top-view configuration.

Previous Works on Person Re-Identification
Over the past few years, in the field of object recognition, the re-id problem has received considerable attention, and various reviews and surveys are available, pointing out different aspects of this topic [32,33]. Among the proposed approaches, four different classes could be defined, mainly depending on the camera setup and environmental conditions: biometric, geometric, appearance-based and learning approaches.
In the biometric approaches, the different person instances are matched together and are assigned to the same identity by the use of biometric features. The examples adopted in the real situation involve gait, faces, fingerprints, iris scans, and so on [34,35]. They are reliable and effective solutions, but these require a collaborative behavior of the people and suitable sensors. Thus, in the case of low resolution, poor views and a non-collaborative public, as in the case with common settings for surveillance cameras, these techniques are not often applicable.
The geometric approaches occur when more than one camera or sensor simultaneously collects information of the same area, and geometric relations among the fields of view (homographies, epipolar lines, and so on) can be adopted to match the data [18,36,37]. The geometric relations, when available, guarantee strong matches or, at least, a stiff candidate selection.
In the general case, only the appearance of the different items can be adopted [38,39]. In the appearance-based approaches, re-id can be correctly done only if the appearance is preserved among the views. It consists of exploiting dress colors and textures, perceived heights and other similar cues and can be considered a soft-biometric approach. Occlusions, illumination changes, different sensor qualities and different viewpoints are some of the challenging issues that make the appearance-based re-id difficult to implement. In [18], Gray et al. for the first time considered the problem of appearance models for person recognition, reacquisition and tracking. Until then, these problems had been evaluated independently, so they called for metrics that apply to complete systems [40,41]. A standard protocol to compare the results is proposed. This is done using the Cumulative Matching Curve (CMC) and introducing the VIPeR dataset for re-id. In [42], an algorithm was proposed that learns a domain-specific similarity function using an ensemble of local features and the AdaBoost classifier. Features are raw color channels in many color spaces and texture information captured by Schmid and Gabor filters [8]. Background clutter highly affects the descriptors of visual appearance for person recognition, and thus, the background modeling is used in many person re-id approaches [38,43,44].
The re-id has even been reinterpreted as a learning problem. In [45], the authors proposed a discriminative model based on the use of Partial Least Squares (PLS). In [46], a robust Mahalanobis metric for Large Margin Nearest Neighbor classification with Rejection (LMNN-R) was obtained with the use of a metric learning framework. Accordingly, in [47], the authors introduced a metric learning approach that learns a Mahalanobis distance from equivalence constraints derived from target labels. A comparison model aimed to maximize the probability of a pair of correctly matched images having a smaller distance than that of an incorrectly matched pair. The model was introduced as the Probabilistic Distance Comparison (PRDC) approach [48]. In [49], the same authors modeled person re-id as a transfer ranking problem, with the main goal of transferring similarity observations from a small gallery to a larger unlabeled probe set. Camera transfer approaches have also been introduced using images of the same person captured from different cameras to learn the associated metrics [50,51]. The Multiple Component Dissimilarity (MCD) framework was defined in [52] to turn a given appearance-based re-id method into a dissimilarity-based one. A supervised technique based on SVM is the approach presented in [53]. Pairs of similar and dissimilar images and a relaxed RankSVM algorithm [54] were used to rank probe images. The main issue with running RankSVM on large datasets is its very expensive computational load due to a large amount of inequality constraints. The authors in [29] used a decision tree to perform a fast matching between descriptors. In this case, the association of the query to one of the models is done by a voting approach. Dimensionality reduction was performed in [30] on image feature vectors through random projection. Afterwards, they built an ensemble of random forests, trained by feature vectors randomly projected onto different subspaces. Random forest was also employed in [31] to learn the similarity function of pairs of person images using color features.
The main differences with our work lay in: • An RGB-D camera in a top view configuration motivated by the enhancement of the applicability of the proposed approach in crowded public environments is employed. The top-view configuration reduces the problem of occlusions and has the advantage of being privacy preserving because a person's face is not recorded by the camera [55]. However, this challenging configuration does not allow one to retrieve features related to the front view, which can be highly discriminative for the subject identification. Hence, the proposed approach including the feature extraction and the classification stage was designed according to this challenging setup • The ensemble classifier was built taking into account the different nature of each feature. The model ensures a higher interpretability with respect to other black box models, allowing one to localize which features contribute to the final prediction. • The computation time of the training stage is reasonably fast and would be practically feasible for real-world application.

TVPR Dataset and Related Applications
TVPR (Top View Person Re-identification) dataset (http://vrai.dii.univpm.it/re-id-dataset) for person re-id [23] contains videos of 100 individuals recorded over several days from an RGB-D camera installed in a top-view configuration. The camera was installed on the ceiling of a laboratory at 4 m above the floor and covered an area of 14.66 m 2 (4.43 m × 3.31 m). The camera was positioned above the surface where the analyses took place (Figure 1). Registrations were made in an indoor scenario, where people passed under the camera installed on the ceiling. A big issue was environmental illumination. In the recording sessions, the illumination condition was not constant, but it varied as a function of the different hours of the day and also depended on natural illumination due to weather conditions. Snapshots of the video acquisitions, in our scenario, are depicted in Figure 2, where examples of person registration with artificial light are given.
Each person during a registration session walked with an average gait within the recording area in one direction and subsequently turned back and repeated over the same route in the opposite direction. This methodology is used for a better split of the TVPR in the training set (the first passage of the person under the camera) and the testing set (when the person passes a second time under the camera).
Although in the previous datasets presented in the literature, data were gathered using the RGB-D technology, they were not actually suitable for our purposes. The main motivating factors for our top-view dataset are due to some related applications that will be described below.
First, the top-view configuration provides the reliable and occlusion free counting of persons, which is crucial in many applications. Most of the previous works can only count moving people from a single camera, and they fail to count still people or situations when occlusions are very frequent and when there is a crowd. Possible applications can be: safety and security in crowded environments, people flow analysis and access control, as well as counting [56][57][58]. Actual tracking accuracy of top-view cameras overperforms all other tracking methods in crowded environments, with accuracies up to 99%. When there are special security applications or the system is working in usually crowded scenarios, the proposed architecture with the top-view configuration is the only suitable one. Second, the scope of this specific configuration and analysis is also the interaction detection between people and the environment with the many possible applications for the field of intelligent retail environment such as shopper analytics, in addition to the field of Human Behavior Analysis (HBA) for Ambient Assisted Living (AAL) [59][60][61][62].
Third, another possible application of this specific top-view configuration is fall detection and HBA in smart homes, from high-reliability fall detection to occlusion-free HBA at home for elders in AAL environments [55,63].
All these applications have relevant outcomes from the current research, with the ability to identify users or shoppers while performing tracking, interaction analysis or HBA. Furthermore, all these scenarios can gather data using low-cost sensors and processing units, ensuring scalability and mass usage. Finally, the proposed architecture can be certified on a EU basis privacy by design approach. Figure 3 shows the overview of the proposed approach comprised of data recording, feature extraction and the classification stage.

Pre-Processing and Feature Extraction
The first step involves the processing of the data acquired from the RGB-D camera. The camera captures the depth and color images, both with dimensions of 640 × 480 pixels, at a rate up to approximately 30 fps. The scene/objects are illuminated with structured light based on infrared patterns. People were detected from the top-view configuration using the same algorithm employed in [64].
Seven out of the nine features selected are anthropometric features extracted from the depth image: distance between floor and head, d 1 ; distance between floor and shoulders, d 2 ; area of head surface, d 3 ; head circumference, d 4 ; shoulder circumference, d 5 ; shoulder breadth, d 6 ; thoracic anteroposterior depth, d 7 . The remaining two color-based features are acquired by the color image. We also define the color descriptor TVH: and the depth descriptor TVD: Finally, TVDH is the signature of a person defined as: Color is an important visual attribute for both computer vision and human perception. It is one of the most widely-used visual features in image/video retrieval. To extract these two features, we used HSV histograms. Local histograms have proven to be largely adopted and are very effective. The signature of a person is also composed by two color histograms computed for head/hair and outerwear: H  (1), such as in [65], with n = 10 bin quantization, for both the H channel and S channel.

Classification Stage
The classification stage is depicted in Figure 3. We propose an ensemble classification approach, named Multiple K-Nearest Neighbor (MKNN), where the primary classification stage is represented by different K-NN classifiers according to the nature of the feature descriptors. The overall prediction is performed averaging the computed posterior probability of each K-NN classifier, in order to provide the optimal decision rule.

Predictive Model for TVD Descriptors
Since the TVD descriptors represent anthropometric features, we decided to adopt the 1-norm distance as a discriminative function of the K-NN model and the well-known Neighborhood Component Feature Selection (NCFS) approach [66] in order to learn the optimal feature weighting vector by maximizing the approximate regularized leave-one-out classification error. The application of NCFS allows decreasing the sensitivity of K-NN to irrelevant features [25]. In order to perform feature selection and decrease overfitting, we further introduce the regularization parameter λ, which controls the magnitude of the weighting vector. The optimal lambda found (i.e., λ = 5 × 10 −4 ) was selected by previously implementing a grid-search and optimizing the macro-f1 score in the validation set. For further explanation about NCFS, the reader can refer to [66,67].

Predictive Model for TVH Descriptors
The cosine and the correlation metric are widely used in the literature to measure the similarity among different HSV descriptors [68,69]. Then, we implement two K-NN models with cosine and Spearman rank correlation, respectively, as the distance function.
The cosine distance between two HSV histogram features is defined as: while the Spearman rank correlation-based distance is defined as: where TV H test and TV H train are converted to ranks rg TV H test and rg TV H train , while TV H is the sample mean.

Predictive Model for TVDH Descriptors
For the single K-NN model of the TVDH descriptors, we consider the 1-norm metric, to measure the distance between two different TVDH feature vectors.

Combiner
We introduce the approach for combining the prediction of the single K-NN model. Assuming {y p 1 , y p 2 , y p 3 , y p 4 } are the predictions of the TVD, TVH and TVDH unseen sample, respectively (i.e., x i ), if we use the majority vote to determine the final label of y p i , the result will be: where δ(a, b) = 1 if a = b and δ(a, b) = 0 otherwise. The Majority Vote (MV) approach does not take into account the posterior probability and does not always provide the best prediction results. The standard Bayesian approach [70]. finds the most probable hypothesis {y ∈ 1 . . . 100} given the observed data {y p 1 , y p 2 , y p 3 , y p 4 }: arg max y P(y|{y p 1 , y p 2 , y p 3 , y p 4 }) (7) according to Bayes' theorem, the maximally probable hypothesis becomes: arg max y P({y p 1 , y p 2 , y p 3 , y p 4 }|y)P(y) (8) The Bayesian approach selects the model with the highest posterior probability and then proceeds as if the selected model had generated the data.
Differently from the Bayesian approach, we compute the average of the posterior probability (i.e., P(ȳ)) of the 4 hypotheses as follows: P(y p l |y)P(y) (9) and the final prediction is: Our ensemble methodology is based on Bayesian Model Averaging (BMA), which is an application of Bayesian inference to the problems of combined prediction of different classifiers. Although this choice can lead to overfitting in some situations [71], it provides straightforward model choice criteria and less risky predictions [72][73][74]. The BMA ignores the uncertainty in model selection, leading to over-confident inferences and decisions [73].

Results
The baseline results are reported in Section 4.1 in terms of the Cumulative Match Curve (CMC). In Sections 4.2 and 4.3, however, we show the results of the proposed MKNN approach for re-id classification. The authors compare the performance of the proposed methodology with respect to single K-NN classifiers and other supervised machine learning algorithms widely used in the re-id literature. We have also performed the computation time comparison related to the training stage.

Baseline Results
The baseline performance of the TVPR dataset was evaluated in terms of recognition rate, using the CMC curves, as previously described in [23]. Figure 5 depicts a comparison among the TVH, TVD and TVDH predictors in terms of CMC curves, to compare the ranks returned by using these different descriptors, where the horizontal axis is the rank of the matching score and the vertical axis is the probability of correct identification.
In particular, Figure 5a,b represents respectively the CMC obtained using the TVH and TVD descriptors for three different distances: one-norm (L1 city block), two-norm (euclidean) and cosine. Figure 5c provides the CMC computed using both TVH and TVD descriptors (i.e., TVDH), while Figure 5d is the averaged CMC over the three considered distances for the color (i.e., average of CMC curves in Figure 5a), depth (i.e., average of CMC curves in Figure 5b) and depth + color (i.e., average of CMC curves in Figure 5c). Although it can be assumed that the best performance was achieved when using the combination of descriptors (TVDH), the contribution of the depth was small, and the CMC curves in Figure 5a,c are very similar. However, the depth information can be informative for the re-id task (see Figure 5b). These baseline results suggest the need for a methodology to combine the different nature of descriptors, exploiting the importance and potential of the depth information. In this context, our approach aimed to exploit the informative power of depth and RGB input, properly combining the different nature of each feature.  . (a,b) shows respectively the CMC obtained using the TVH and TVD descriptors for three different distance: one-norm (L1 city block, cyan), two-norm (euclidean, purple) and cosine (green). (c) provides the CMC computed using both the TVH and TVD descriptors (i.e., TVDH), while (d) is the averaged CMC over the three considered distance for the color (i.e., average of CMC curves in (a), purple), depth (i.e., average of CMC curves in (b), orange) and depth + color (i.e., average of CMC curves in (c), green).

Results of the Proposed Approach
We considered the first passage under the camera as the training set and the return to the initial position as the testing set. The dataset was composed of 21,685 instances divided into 11,683 for training and 10,002 for testing. The performance of the proposed MKNN method is reported in Table 1 in terms of macro-F1 score, precision and recall. We also report the results of the single K-NN classifier for each descriptor (i.e., TVH, TVD, TVDH) and each different distance (i.e., cosine, Spearman's rank correlation and one-norm). We have highlighted in bold the single K-NN used for designing the proposed MKNN method. The optimal number of neighbors is five, and it has been chosen since it maximizes the macro-F1 score in the validation set. Additionally, we have reported the results of different combiner approaches (i.e., MV, Bayesian and BMA). The proposed BMA-MKNN approach performed favorably over the other methods.
According to the nature of the descriptors, the cosine distance was the most consistent measure in order to achieve the best performance for the TVH input, while the K-NN with one-norm achieved the best performance considering the TVDH input. The proposed MKNN methodology outperformed all single K-NN classifiers. In particular, the MKNN improved the performance of TVD-KNN, TVH-KNN and TVDH-KNN by 84.44%, 12% and 2.5%, respectively. Figure 6 shows the CMC curve of the MKNN compared with respect to the CMC curves of the single weak learner fed with TVH, TVD and TVDH. The ranking returned by MKNN showed better performance than the single classifier. This result outlines the advantage of the proposed approach in order to exploit the discriminative power of the depth information for the re-id task. In addition, the introduced BMA approach performed favorably over the MV and Bayesian methods.  We summarize in Figure 8 the macro-f1 score for the MKNN and the TVDH-KNN for each class (subjects). The macro-f1 score is the same for 32 out of 100 subjects, while the MKNN achieves higher performance than TVDH-KNN in 42 out of 100 subjects. This result suggests how the MKNN (BMA) recognizes 10% of subjects with a higher recognition rate with respect to TVDH-KNN.  The implemented NCFS for the TVD descriptors allowed decreasing the generalization error of the standard K-NN classifier while increasing the sparsity, as well as the interpretability of the model. Moreover, also the increase of K-NN performance in terms of precision, recall and macro-f1 score can be seen in Table 1. The optimal weighting vector found by the NCFS algorithm is shown in Figure 9. The feature with the highest predictive power is the thoracic anteroposterior depth (d 7 ), while the less relevant TVD descriptors are the distance between floor and shoulders (d 2 ), the area of the head surface (d 3 ) and the shoulder circumference (d 5 ).  Figure 9. The optimal feature weights for TVD descriptors found by the NCFS algorithm. Table 2 shows the comparison between our approach and standard supervised learning algorithms widely adopted in the re-id scenario such as DT [29], bagged tree, RF [30,31], adaptive boosting (AdaBoost), linear programming boosting (LPBoost) and totally corrective boosting (TotalBoost). The considered inputs for the DT, bagged tree, RF, AdaBoost, LPBoost and TotalBoost classifiers are the TVDH descriptors. The MKNN outperformed all standard methods, achieving an improvement of 76.60%, 3.75%, 18.57%, 43.10%, 69.39% and 36.07% with respect to DT, bagged tree, RF, AdaBoost, LPBoost and TotalBoost. The K-NN may perform better than DT and RF when the number of training samples is not huge compared to the number of classes. The advantage of our ensemble strategy lies in the way we have built and combined each classifier. In particular, each weak learner was built according to the different nature of the features in order to extract the discriminative information of each subject. Differently from our approach, the other boosting and bagged strategies combined different weak learners in an automatic fashion without taking into account the different descriptors (i.e., TVH and TVD). Table 3 shows the computation time expressed in seconds (s) for the training stage of all methodologies. MKNN (BMA) was reasonably fast and would be practically feasible for the re-id task.

Conclusions and Future Works
In this paper, we describe a method for person re-identification based on features derived from both depth (anthropometric features) and color. Different from other approaches, the experiments were conducted on the TVPR dataset where the RGB-D images were collected in a top-view setting, reducing the problems of occlusions, while preserving the privacy issue [55].
Person recognition is handled by using the proposed ensemble method, named Multiple K-Nearest Neighbor (MKNN), based on the combination of different K-NN classifiers. Each K-NN is built with a different distance function based on the nature of the feature descriptors, and the neighborhood component feature selection is introduced for the anthropometric features. The experimental results demonstrate how the proposed methodology outperforms standard supervised classifiers (i.e., k-NN, DT, bagged tree, RF and boosting methods). Moreover, the computation time analysis of the training stage suggests that the proposed MKNN method is reasonably fast, encouraging the application of the proposed approach for the person re-identification task in the retail scenario. This improvement may be explained by the fact that our approach is consistent to model and combine the nature and information of different descriptors (i.e., TVH and TVD), weighting the importance of the anthropometric features. Further investigation will be devoted to improve our approach by extracting other informative features and setting up the proposed approach for the real-time processing of video images in the retail scenario. In the field of retail applications, the long-term goal of this work is to merge the developed re-identification system with an audio framework and the use of other types of RGB-D cameras, such as Time Of Flight (TOF) ones. The system can be integrated additionally as a source of high semantic level information in a networked ambient intelligence scenario, to provide cues for different problems, such as detecting abnormal speed and dimension outliers, alerting one to a possible uncontrolled circumstance. It would also be interesting to evaluate both color and depth images in a way that it does not decrease the performance of the system when the color image is being affected by changes in pose and/or illumination.