Headgear Accessories Classification Using an Overhead Depth Sensor

In this paper, we address the generation of semantic labels describing the headgear accessories carried out by people in a scene under surveillance, only using depth information obtained from a Time-of-Flight (ToF) camera placed in an overhead position. We propose a new method for headgear accessories classification based on the design of a robust processing strategy that includes the estimation of a meaningful feature vector that provides the relevant information about the people’s head and shoulder areas. This paper includes a detailed description of the proposed algorithmic approach, and the results obtained in tests with persons with and without headgear accessories, and with different types of hats and caps. In order to evaluate the proposal, a wide experimental validation has been carried out on a fully labeled database (that has been made available to the scientific community), including a broad variety of people and headgear accessories. For the validation, three different levels of detail have been defined, considering a different number of classes: the first level only includes two classes (hat/cap, and no hat/cap), the second one considers three classes (hat, cap and no hat/cap), and the last one includes the full class set with the five classes (no hat/cap, cap, small size hat, medium size hat, and large size hat). The achieved performance is satisfactory in every case: the average classification rates for the first level reaches 95.25%, for the second one is 92.34%, and for the full class set equals 84.60%. In addition, the online stage processing time is 5.75 ms per frame in a standard PC, thus allowing for real-time operation.


Introduction
People detection and human behavior analysis in video-surveillance applications are already classic research topics in computer vision, artificial intelligence, and machine learning areas but are still an open issue, with some commercial solutions performing some basic tasks with reasonable accuracy but still not available for other higher-level understanding tasks. One of the main reasons for this is the context diversity of the in-the-wild real video-surveillance images. Nevertheless, technological evolution as well as machine learning algorithmic contributions facilitate and speed up more and more reliable and robust context aware solutions, making them in order to contribute to solving this part of the problem.
In addition, identity preservation is the other open issue that should be considered in order to obtain the commercial solutions demanded by the everyday most connected and technologically informed world. Time-of-Flight (ToF) cameras, that obtain depth images from the surveilled scenarios, give high-resolution images suitable for the image analysis tasks needed by the surveillance application pursuit, while at the same time preserving the privacy of surveilled people.
Within this context, this work presents a proposal for only using depth information that is acquired using an overhead ToF camera [1,2], to be applied to the classification of different headgear accessories (Throughout the paper, we will use "headgear accessories", "head accessories" or simply "accessories" to refer to different types of hats or caps). The camera placement was selected in zenithal position, as it greatly avoids occlusions (thus facilitating the task of people detection and segmentation), the camera installation is easy, and the image acquisition position contributes to preserving the privacy of the persons in the scene.
The selected accessories are limited by the camera position, which only provides a top-view. Because of this, the proposal allows for detecting and classifying different headgear accessories such as hats, caps, etc. when worn by people in the overhead scene under surveillance. It is worth highlighting the fact that this information is enough for obtaining semantic features of the upper-body of people in the scene, which is useful in the surveillance applications focused on by the scene understanding commercial solution pursuits.
In fact, the interest in classifying people under surveillance by their headgear accessories (caps, hats, etc.), and preserving their identity, is justified as it provides additional semantic information that can be used by higher level processing systems (such as tracking, re-identification, scene understanding, etc.), where a combination of different semantic features is normally used.
Due to this, the classification of human accessories in semantic video analysis tasks has attracted a great interest in recent years within the research community [3], since it allows for semantically describing people for different applications, such as security and biometrics [4,5], surveillance [6], human computer interaction [7], or oriented marketing [8]. Recently, there has been an increasing number of approaches proposed in the scientific literature, which are able to extract human attributes such as gender, ethnicity, age, hairstyle or clothing, focused on in such applications.
The early approaches [8][9][10] in these areas used RGB (Red, Green, Blue) data for extracting such semantic features. For example, the authors of [9] proposed a method for recognizing attributes, such as gender, hair style and different types of clothes (hats, t-shirts, shorts, jeans, long hair, glasses, long sleeves, long pants), in people under variations in viewpoint and pose, based on attribute classifiers and a discriminative model. In [10], the authors proposed a novel hair style retrieval algorithm from face images. The proposal in [8] generates a list of semantic accessories for clothes worn by people in RGB images, using attribute classifiers and Conditional Random Fields. In that proposal, some variations in the viewpoint are allowed, but in most of the images, people are facing the camera. In our work, we contribute to this area by addressing the problem of attribute classification (headgear accessories) using depth data from an overhead ToF camera.
In recent years, most of the approaches have started using RGB-D data, which are less sensitive to indoor ambient conditions, providing depth information in addition to RGB. The proposal in [6] detects fine-grained human attributes (gender, wearing a backpack, talking on a cell phone, wearing glasses, short sleeves, long pants) in surveillance environments. The proposal in [11] allows detecting gender, as well as other attributes such as hair style or clothing [7] in RGB-D images, not only from frontal body views, but also from side and back views. However, none of the previous works use depth information only or top-view images, and thus do not take into account the privacy issues so sensible in surveillance applications. On the other hand, the work in [12] proposes an approach for human attribute analysis, such as gender and whether they have bags with them, based on multi-layer classification, and using top-view RGB camera images. In our case, we still want to avoid the use of RGB data for privacy consideration issues.
All of the cited works are focused on detecting and tracking or counting people, not allowing the obtaining of any semantic attribute from them. Specifically, the authors of [13] propose the use of a ToF depth sensor to allow people tracking, but it only works if people enter the scene well separated, in order to avoid occlusions. The work presented in [14] obtains better results for multiple people, using an approach based on a normalized Mexican Hat Wavelet, but it also fails if people are close to each other (as we showed in [21]). In the proposal in [15], after removing the noise and background, any group of connected depth measurements (blobs) whose height is over 90 cm is detected as a person, thus allowing the easy generation of false positives as the detector criteria is only based on height.
Other works, such as Refs. [16,19], use RGB images for people counting. These works only include a detector, but not a classification stage, so that they cannot discriminate between people and other objects in the scene. On the other hand, the proposals described in [17,18,20] include a classification stage that allows for reducing the number of false positives. More specifically, the works described in [17,20] are based on the human head and shoulders' physical structure for obtaining a person overhead descriptor. These approaches are able to discriminate between people and other elements in the scene, but their detection rates decrease if people are close to each other. Moreover, they suffer from a lack of robustness if people wear complements such as hats, caps, backpacks, etc.
The proposal we describe in this paper not only allows for detecting people from an overhead ToF camera, but is also able to handle scenes with people that are close to each other (being based on our previous proposal [21]), and to discriminate between people wearing different accessories that can have a variety of appearances, such as hats and caps.
The structure of the paper is as follows: Section 2 describes the accessories classification algorithmic proposal; Section 3 covers the experimental work, including the description of the experimental setup, the performance metrics used and the obtained results, both in terms of classification rates and computational cost; and Section 4 draws the main conclusions of the paper.

Accessories Classification Algorithm
As described in the introduction, in this work, we propose an approach for the classification of headgear accessories only using depth information provided by a ToF sensor located in an overhead position. The camera position, which only provides a top-view, allows for preserving the privacy of the users and restricts the type of accessories that can be detected and classified. Our proposal focuses on the design of a robust processing strategy, including the estimation of a meaningful feature vector containing the relevant information that could be extracted from an overhead ToF camera, and its application on the headgear accessories classification. Within this context, the proposed method is based on Principal Components Analysis (PCA), although any classification method could be applied. The classification procedure uses an eight-component feature vector that is related to the geometry of the surface information of the people head and shoulders, designed ad hoc for this application.
A general block diagram of the proposed method is shown in Figure 1. There are three main processes, two offline processes and an online one. In the offline Process 1, some coefficients that are used to normalize the feature vector are estimated. In the offline Process 2, the model estimation is carried out. In our case, five different models are generated: one for people without any accessories, another one for people wearing caps, as well as three more models for people wearing different types (sizes) of hats. In the online process, the recovery error of the feature vector for each model is calculated, and the model with the smallest error is assumed to correspond to the accessory class that the users actually wear.
In the next sections, we describe in detail each stage of the online and offline processes shown in Figure 1. It is worth mentioning that some of the stages, such as the preprocessing of the depth maps or the detection of the person ROIs (Regions of Interest), are common to both (online and offline) processes.

Preprocessing of Depth Maps
The depth maps are captured using a ToF camera fixed to the ceiling at height ℎ from the floor, the camera optical axis being perpendicular to the floor plane. Some of the problems exhibited in depth maps captured by ToF cameras are the great number of invalid pixels that appear in the object borders, the motion artifacts, and the high noise level [22][23][24]. In this work, the invalid pixels are corrected by means of the interpolation of the nearest valid neighboring pixels, using the algorithm proposed in [21]. Moreover, in order to reduce the noise in the depth information, a nine-element mean filter is applied. Next, the depth information is translated to the floor origin, obtaining information about the actual height of each pixel in the scene. The height matrix obtained after the preprocessing is denoted by H.

Detection of a Person ROI
In this work, the body elements of interest are the head and shoulders areas, so that it is necessary to determine which pixels belong to these body parts. To achieve this objective, a region of interest (ROI) is defined as the set of pixels that belong to a person while also having height values within a height range that corresponds to the areas of interest. To determine a person's ROI, the contour detection algorithm to be used has to take into account the sparsity of the depth data, and the conditions imposed by the overhead camera position [21,25]. After that, it is necessary to find the maximum height value (ℎ ) within that contour and, finally, select the pixels whose heights are between ℎ and the minimum height of interest ℎ . Based on anthropometric considerations [26,27], we assume that: ℎ = ℎ − 40 . In what follows, the set of selected pixels will be represented by (ℎ).

Feature Extration and Normalization
In [21], we used a feature vector that described the pixel density associated to the head and shoulders surface within the corresponding ROI. In that work, we used a feature vector composed of six components: five of them related to the visible people surfaces at different heights, and the sixth

Preprocessing of Depth Maps
The depth maps are captured using a ToF camera fixed to the ceiling at height h c from the floor, the camera optical axis being perpendicular to the floor plane. Some of the problems exhibited in depth maps captured by ToF cameras are the great number of invalid pixels that appear in the object borders, the motion artifacts, and the high noise level [22][23][24]. In this work, the invalid pixels are corrected by means of the interpolation of the nearest valid neighboring pixels, using the algorithm proposed in [21]. Moreover, in order to reduce the noise in the depth information, a nine-element mean filter is applied. Next, the depth information is translated to the floor origin, obtaining information about the actual height of each pixel in the scene. The height matrix obtained after the preprocessing is denoted by H.

Detection of a Person ROI
In this work, the body elements of interest are the head and shoulders areas, so that it is necessary to determine which pixels belong to these body parts. To achieve this objective, a region of interest (ROI) is defined as the set of pixels that belong to a person while also having height values within a height range that corresponds to the areas of interest. To determine a person's ROI, the contour detection algorithm to be used has to take into account the sparsity of the depth data, and the conditions imposed by the overhead camera position [21,25]. After that, it is necessary to find the maximum height value (h max ) within that contour and, finally, select the pixels whose heights are between h max and the minimum height of interest h min . Based on anthropometricconsiderations [26,27], we assume that: h min = h max − 40 cm. In what follows, the set of selected pixels will be represented by θ(h).

Feature Extration and Normalization
In [21], we used a feature vector that described the pixel density associated to the head and shoulders surface within the corresponding ROI. In that work, we used a feature vector composed of six components: five of them related to the visible people surfaces at different heights, and the sixth component corresponding to the eccentricity of the person head. These six components are valid to describe the upper part of a person and to detect if a particular object in the scene corresponds to a Sensors 2017, 17, 1845 5 of 15 person or not, but it is not precise enough to be used in the classification of headgear accessories (as they focus on only modeling the upper part of the head and the shoulders, so not being aimed at the intermediate region in which the headgear accessories key elements will be located).
To increase the system robustness in the accessories classification task, we propose a new feature vector that is composed of eight components, which describe the geometrical structure of the head and shoulders areas, and in a more detailed way as compared with [23]: five of them describe the head area geometry, and the other three describe the shoulders geometry. To show an example of the geometrical regions for which the feature vector is providing information, and to allow a proper understanding of how it works, we have generated Figure 2. In this Figure, a side-on portrait view of a person is shown, indicating the division in horizontal slides from which the calculations are being carried out, applying the following procedure: 1.
Calculate the histogram of pixels (depth measurements) along the heights of interest θ(h) (between h max and h min ), with a high enough number of intervals. In this work, 20 intervals were used, as proposed in [21], taking into account geometrical considerations ( Figure 2 shows this division in 20 slices). That is, we count the number of pixels in 20 slices of 2 cm of height, obtaining a vector ϑ = [ϑ 1 . . . ϑ s . . . ϑ 20 ], where ϑ s is the number of pixels in section s (being s = 1, 2, . . . , 20). The highest section of a given person corresponds to s = 1, as shown in Figure 2.
Sensors 2017, 17, 1845 5 of 15 component corresponding to the eccentricity of the person head. These six components are valid to describe the upper part of a person and to detect if a particular object in the scene corresponds to a person or not, but it is not precise enough to be used in the classification of headgear accessories (as they focus on only modeling the upper part of the head and the shoulders, so not being aimed at the intermediate region in which the headgear accessories key elements will be located). To increase the system robustness in the accessories classification task, we propose a new feature vector that is composed of eight components, which describe the geometrical structure of the head and shoulders areas, and in a more detailed way as compared with [23]: five of them describe the head area geometry, and the other three describe the shoulders geometry. To show an example of the geometrical regions for which the feature vector is providing information, and to allow a proper understanding of how it works, we have generated Figure 2. In this Figure, a side-on portrait view of a person is shown, indicating the division in horizontal slides from which the calculations are being carried out, applying the following procedure: 1. Calculate the histogram of pixels (depth measurements) along the heights of interest (ℎ) (between ℎ and ℎ ), with a high enough number of intervals. In this work, 20 intervals were used, as proposed in [21], taking into account geometrical considerations ( Figure 2 shows this division in 20 slices). That is, we count the number of pixels in 20 slices of 2 cm of height, obtaining a vector = [ 1 ⋯ ⋯ 20 ] , where is the number of pixels in section s (being s = 1, 2, …, 20). The highest section of a given person corresponds to s = 1, as shown in Figure 2.  . To do this, we first find the section of the head with the maximum number of pixels in the first three elements of : ℎ_ = max ( 1 , 2 , 3 ). Then, we assume that:

2.
Estimate the first top section of the head, s h_ f irst . To do this, we first find the section of the head with the maximum number of pixels in the first three elements of ϑ: s h_max = argmax(ϑ 1 , ϑ 2 , ϑ 3 ). Then, we assume that: I f s h_max = 1 : s h_ f irst = s h_max I f s h_max > 1 and NT·ϑ s h_ max −1 ≥ ϑ s h_max : s h_ f irst = s h_max − 1 I f s h_max > 1 and NT·ϑ s h_ max −1 < ϑ s h_max : where NT is the number of times that the numbers of pixel in section ϑ s h_max −1 must be higher than the number of pixels in section ϑ s h_max to assume that the top section is ϑ s h_ max −1 . This step is necessary to reduce the error in the person height estimation, due to the noise in the depth maps, and for the use of small accessories or specific hair configurations (such as tied in a tail). The value of NT has been obtained empirically, and it was determined that, with values of NT between 10 and 20, similar results were obtained. The estimated value of the person height without noise, h p , will be obtained as: 3.
Find the five components of the feature vector (ϕ) that describe the head structure. Each component is composed of two consecutive elements of ϑ organized as follows:

4.
Estimate the maximum section of the shoulders, s s_max . Taking into account the human morphology, we assume that s s_max can be found starting at 22 cm from the top of the head, i.e.:

5.
Find the three components of the feature vector (ϕ) that describe the shoulders' structure. Each component is composed of two consecutive elements of ϑ organized as follows: Each component ϕ i corresponds to the number of pixels of a section of 4 cm, so that it depends on the person's height, thus it is necessary to normalize them. 6.
Determine the normalized feature vector (Ψ) that will be obtained dividing the components of the feature vector (ϕ), by the estimated value of its first component ( ϕ 1 ): The next section describes how to estimate the value of the normalization parameter ϕ 1 .

Calculation of Normalization Coefficients
To make ϕ independent of a person's height (h p ), the relationship between h p and ϕ 1 must be determined. For a training sample composed of a set of people with heights between 140 cm and 213 cm, the values of h p and ϕ 1 were calculated (using the steps 1, 2 and 3 described in Section 2.3). From the analysis of the results, we observed a quadratic relationship between them, so that we assumed that: where a 0 , a 1 and a 2 are the coefficients that must be estimated. Using a nonlinear least squares approximation method, the obtained values were: a 0 = 1998.0, a 1 = −24.62 and a 2 = 0.092, which empirically showed to be adequate for the normalization task.

Attribute Model Estimation
In this work, we define five classes (C 1 , C 2 , . . . , C 5 ) in the classification of headgear accessories (see Figure 3). With the objective of simplifying the notation and following the same order in which they appeared previously, these classes will be referred to as α = 1, . . . , 5 (being α = 1 for C 1 . . . α = 5 for C 5 ).
We initially selected a PCA classification strategy for its simplicity and low computational requirements. PCA is a well-known method that has been widely used for representing high-dimensional data in a low-dimensional space [28,29], and also for classification [21,30,31]. However, any other classification method, such as a Multi-Layer Perceptrons (MLPs), Support Vector Machines (SVMs), etc., could have been used, provided that the feature vector contains meaningful information for the target task.
The PCA classifier requires an offline training process to estimate a model for each class α. Therefore, each of the models is composed of an average vector Ψ α , a scatter matrix S α , and a transformation matrix U α : where N α is the number of training vectors that represents the class α.
The matrices U α are formed by the eigenvectors associated with the largest (principal) eigenvalues of the scatter matrices S α . Following the criterion that the average normalized residual quadratic error (RMSE) is higher than 90%, three eigenvectors were finally considered in this work. We initially selected a PCA classification strategy for its simplicity and low computational requirements. PCA is a well-known method that has been widely used for representing highdimensional data in a low-dimensional space [28,29], and also for classification [21,30,31]. However, any other classification method, such as a Multi-Layer Perceptrons (MLPs), Support Vector Machines (SVMs), etc., could have been used, provided that the feature vector contains meaningful information for the target task.
The PCA classifier requires an offline training process to estimate a model for each class . Therefore, each of the models is composed of an average vector ̅ , a scatter matrix , and a transformation matrix : where is the number of training vectors that represents the class . The matrices are formed by the eigenvectors associated with the largest (principal) eigenvalues of the scatter matrices . Following the criterion that the average normalized residual quadratic error (RMSE) is higher than 90%, three eigenvectors were finally considered in this work.

Classification
The classification of each vector corresponding to a person in the scene is described in Algorithm 1, where the feature vector of a person is projected on the transformed space and recovered to the original space for each class. A person is classified as corresponding to the class when the Mahalanobis distance between and ̂ is the smallest one among all of the considered classes.

Classification
The classification of each vector Ψ corresponding to a person in the scene is described in Algorithm 1, where the feature vector Ψ of a person is projected on the transformed space and recovered to the original space for each class. A person is classified as corresponding to the class α when the Mahalanobis distance between Φ α and Φ α is the smallest one among all of the considered classes.

Experimental Setup
In order to provide data for training and evaluating the proposal, we have used part of the GOTPD2 database (available at [32]). GOTPD2 was recorded in an indoor environment using two Kinect v2 devices (Microsoft, Redmon, WA, USA) located at a height of 3.4 m, perpendicularly pointing towards the room floor. The recordings tried to cover a broad variety of conditions, with scenarios comprising: • Single people scenarios, as the target is the classification of head style/accessories configuration, with a total of 25 different persons involved.

•
People without accessories (no hat nor cap) and different hairstyle configurations (short, long and with/without ponytail).

•
People with 21 different headgear accessories, 17 hats (of different sizes) and four caps. Table 1 shows the details of the different accessories used, including information on their size (with the naming conventions indicated in Figure 4).

Experimental Setup
In order to provide data for training and evaluating the proposal, we have used part of the GOTPD2 database (available at [32]). GOTPD2 was recorded in an indoor environment using two Kinect v2 devices (Microsoft, Redmon, WA, USA) located at a height of 3.4 m, perpendicularly pointing towards the room floor. The recordings tried to cover a broad variety of conditions, with scenarios comprising:


Single people scenarios, as the target is the classification of head style/accessories configuration, with a total of 25 different persons involved.  People without accessories (no hat nor cap) and different hairstyle configurations (short, long and with/without ponytail).  People with 21 different headgear accessories, 17 hats (of different sizes) and four caps. Table 1 shows the details of the different accessories used, including information on their size (with the naming conventions indicated in Figure 4).  Table 1, for hats (left) and caps (right). Figure 4. Diagram of the sizes of the accessories used in Table 1, for hats (left) and caps (right).   1 The ruler shown is 30 cm long; 2 Sizes are given in cm as D1_in × D2_in, D1_out × D2_out and h, according to Figure 4.
The dataset was divided into two subsets, one for training and the other one for testing. The subsets are fully independent, so that no person nor accessory present in the training subset was present in the testing subset. The columns Training Accessories and Testing Accessories in Table 1 show the accessories used for training and testing, respectively. It can be seen that just one hat/cap per class was used for training, and a variety of them was used for testing. Preliminary experiments were carried out to decide on the amount of training material required, and using a single hat/cap per class for training proved to be good enough, allowing us to use a bigger testing database to increase the statistical significance of the results.
The database contains sequences in which the users were instructed on how to move under the camera in linear directions (parallel or diagonal to the captured area, to allow for proper coverage of the recording area). A top-view of the room, in which the recording took place, and some example images belonging to different classes are shown in Figure 5. In Figure 5a, T0 and T1 are the two overhead ToF cameras used to record the sequences, the color-filled rectangles show the camera field of view at the floor level, and the thick line shows the trajectory followed by the users in the recordings. In Figure 5b, the height values are shown using a color scale representation.
We run preliminary experiments to assess the actual precision of the Kinect devices when estimating the depth values. We found that errors in height estimation were below 1 cm for users in static position, which is consistent with previous reports in the literature (such as [33], where the authors report a maximum error of 4 cm at the maximum range of the sensor, but our measured elements are much nearer the sensor, as we evaluate people standing). Tables 2 and 3 show the details of the training and testing subsets, respectively. #Sequences refers to the number of recorded sequences, #Frames refers to the number of all of the (useful) frames   1 The ruler shown is 30 cm long; 2 Sizes are given in cm as D1_in × D2_in, D1_out × D2_out and h, according to Figure 4.
The dataset was divided into two subsets, one for training and the other one for testing. The subsets are fully independent, so that no person nor accessory present in the training subset was present in the testing subset. The columns Training Accessories and Testing Accessories in Table 1 show the accessories used for training and testing, respectively. It can be seen that just one hat/cap per class was used for training, and a variety of them was used for testing. Preliminary experiments were carried out to decide on the amount of training material required, and using a single hat/cap per class for training proved to be good enough, allowing us to use a bigger testing database to increase the statistical significance of the results.
The database contains sequences in which the users were instructed on how to move under the camera in linear directions (parallel or diagonal to the captured area, to allow for proper coverage of the recording area). A top-view of the room, in which the recording took place, and some example images belonging to different classes are shown in Figure 5. In Figure 5a, T0 and T1 are the two overhead ToF cameras used to record the sequences, the color-filled rectangles show the camera field of view at the floor level, and the thick line shows the trajectory followed by the users in the recordings. In Figure 5b, the height values are shown using a color scale representation.
We run preliminary experiments to assess the actual precision of the Kinect devices when estimating the depth values. We found that errors in height estimation were below 1 cm for users in static position, which is consistent with previous reports in the literature (such as [33], where the authors report a maximum error of 4 cm at the maximum range of the sensor, but our measured elements are much nearer the sensor, as we evaluate people standing). Tables 2 and 3 show the details of the training and testing subsets, respectively. #Sequences refers to the number of recorded sequences, #Frames refers to the number of all of the (useful) frames  The dataset was divided into two subsets, one for training and the other one for testing. The subsets are fully independent, so that no person nor accessory present in the training subset was present in the testing subset. The columns Training Accessories and Testing Accessories in Table 1 show the accessories used for training and testing, respectively. It can be seen that just one hat/cap per class was used for training, and a variety of them was used for testing. Preliminary experiments were carried out to decide on the amount of training material required, and using a single hat/cap per class for training proved to be good enough, allowing us to use a bigger testing database to increase the statistical significance of the results.
The database contains sequences in which the users were instructed on how to move under the camera in linear directions (parallel or diagonal to the captured area, to allow for proper coverage of the recording area). A top-view of the room, in which the recording took place, and some example images belonging to different classes are shown in Figure 5. In Figure 5a, T0 and T1 are the two overhead ToF cameras used to record the sequences, the color-filled rectangles show the camera field of view at the floor level, and the thick line shows the trajectory followed by the users in the recordings. In Figure 5b, the height values are shown using a color scale representation.
We run preliminary experiments to assess the actual precision of the Kinect devices when estimating the depth values. We found that errors in height estimation were below 1 cm for users in static position, which is consistent with previous reports in the literature (such as [33], where the authors report a maximum error of 4 cm at the maximum range of the sensor, but our measured elements are much nearer the sensor, as we evaluate people standing). Tables 2 and 3 show the details of the training and testing subsets, respectively. #Sequences refers to the number of recorded sequences, #Frames refers to the number of all of the (useful) frames  The dataset was divided into two subsets, one for training and the other one for testing. The subsets are fully independent, so that no person nor accessory present in the training subset was present in the testing subset. The columns Training Accessories and Testing Accessories in Table 1 show the accessories used for training and testing, respectively. It can be seen that just one hat/cap per class was used for training, and a variety of them was used for testing. Preliminary experiments were carried out to decide on the amount of training material required, and using a single hat/cap per class for training proved to be good enough, allowing us to use a bigger testing database to increase the statistical significance of the results.
The database contains sequences in which the users were instructed on how to move under the camera in linear directions (parallel or diagonal to the captured area, to allow for proper coverage of the recording area). A top-view of the room, in which the recording took place, and some example images belonging to different classes are shown in Figure 5. In Figure 5a, T0 and T1 are the two overhead ToF cameras used to record the sequences, the color-filled rectangles show the camera field of view at the floor level, and the thick line shows the trajectory followed by the users in the recordings. In Figure 5b, the height values are shown using a color scale representation.
We run preliminary experiments to assess the actual precision of the Kinect devices when estimating the depth values. We found that errors in height estimation were below 1 cm for users in static position, which is consistent with previous reports in the literature (such as [33], where the authors report a maximum error of 4 cm at the maximum range of the sensor, but our measured elements are much nearer the sensor, as we evaluate people standing). Tables 2 and 3 show the details of the training and testing subsets, respectively. #Sequences refers to the number of recorded sequences, #Frames refers to the number of all of the (useful) frames  The dataset was divided into two subsets, one for training and the other one for testing. The subsets are fully independent, so that no person nor accessory present in the training subset was present in the testing subset. The columns Training Accessories and Testing Accessories in Table 1 show the accessories used for training and testing, respectively. It can be seen that just one hat/cap per class was used for training, and a variety of them was used for testing. Preliminary experiments were carried out to decide on the amount of training material required, and using a single hat/cap per class for training proved to be good enough, allowing us to use a bigger testing database to increase the statistical significance of the results.
The database contains sequences in which the users were instructed on how to move under the camera in linear directions (parallel or diagonal to the captured area, to allow for proper coverage of the recording area). A top-view of the room, in which the recording took place, and some example images belonging to different classes are shown in Figure 5. In Figure 5a, T0 and T1 are the two overhead ToF cameras used to record the sequences, the color-filled rectangles show the camera field of view at the floor level, and the thick line shows the trajectory followed by the users in the recordings. In Figure 5b, the height values are shown using a color scale representation.
We run preliminary experiments to assess the actual precision of the Kinect devices when estimating the depth values. We found that errors in height estimation were below 1 cm for users in static position, which is consistent with previous reports in the literature (such as [33], where the authors report a maximum error of 4 cm at the maximum range of the sensor, but our measured elements are much nearer the sensor, as we evaluate people standing). Tables 2 and 3 show the details of the training and testing subsets, respectively. #Sequences refers to the number of recorded sequences, #Frames refers to the number of all of the (useful) frames 16 1 The ruler shown is 30 cm long; 2 Sizes are given in cm as D1_in × D2_in, D1_out × D2_out and h, according to Figure 4.
The dataset was divided into two subsets, one for training and the other one for testing. The subsets are fully independent, so that no person nor accessory present in the training subset was present in the testing subset. The columns Training Accessories and Testing Accessories in Table 1 show the accessories used for training and testing, respectively. It can be seen that just one hat/cap per class was used for training, and a variety of them was used for testing. Preliminary experiments were carried out to decide on the amount of training material required, and using a single hat/cap per class for training proved to be good enough, allowing us to use a bigger testing database to increase the statistical significance of the results.
The database contains sequences in which the users were instructed on how to move under the camera in linear directions (parallel or diagonal to the captured area, to allow for proper coverage of the recording area). A top-view of the room, in which the recording took place, and some example images belonging to different classes are shown in Figure 5. In Figure 5a, T0 and T1 are the two overhead ToF cameras used to record the sequences, the color-filled rectangles show the camera field of view at the floor level, and the thick line shows the trajectory followed by the users in the recordings. In Figure 5b, the height values are shown using a color scale representation.
We run preliminary experiments to assess the actual precision of the Kinect devices when estimating the depth values. We found that errors in height estimation were below 1 cm for users in static position, which is consistent with previous reports in the literature (such as [33], where the authors report a maximum error of 4 cm at the maximum range of the sensor, but our measured elements are much nearer the sensor, as we evaluate people standing). Tables 2 and 3 show the details of the training and testing subsets, respectively. #Sequences refers to the number of recorded sequences, #Frames refers to the number of all of the (useful) frames   1 The ruler shown is 30 cm long; 2 Sizes are given in cm as D1_in × D2_in, D1_out × D2_out and h, according to Figure 4.
The dataset was divided into two subsets, one for training and the other one for testing. The subsets are fully independent, so that no person nor accessory present in the training subset was present in the testing subset. The columns Training Accessories and Testing Accessories in Table 1 show the accessories used for training and testing, respectively. It can be seen that just one hat/cap per class was used for training, and a variety of them was used for testing. Preliminary experiments were carried out to decide on the amount of training material required, and using a single hat/cap per class for training proved to be good enough, allowing us to use a bigger testing database to increase the statistical significance of the results.
The database contains sequences in which the users were instructed on how to move under the camera in linear directions (parallel or diagonal to the captured area, to allow for proper coverage of the recording area). A top-view of the room, in which the recording took place, and some example images belonging to different classes are shown in Figure 5. In Figure 5a, T0 and T1 are the two overhead ToF cameras used to record the sequences, the color-filled rectangles show the camera field of view at the floor level, and the thick line shows the trajectory followed by the users in the recordings. In Figure 5b, the height values are shown using a color scale representation.
We run preliminary experiments to assess the actual precision of the Kinect devices when estimating the depth values. We found that errors in height estimation were below 1 cm for users in static position, which is consistent with previous reports in the literature (such as [33], where the authors report a maximum error of 4 cm at the maximum range of the sensor, but our measured elements are much nearer the sensor, as we evaluate people standing). Tables 2 and 3 show the details of the training and testing subsets, respectively. #Sequences refers to the number of recorded sequences, #Frames refers to the number of all of the (useful) frames   1 The ruler shown is 30 cm long; 2 Sizes are given in cm as D1_in × D2_in, D1_out × D2_out and h, according to Figure 4.
The dataset was divided into two subsets, one for training and the other one for testing. The subsets are fully independent, so that no person nor accessory present in the training subset was present in the testing subset. The columns Training Accessories and Testing Accessories in Table 1 show the accessories used for training and testing, respectively. It can be seen that just one hat/cap per class was used for training, and a variety of them was used for testing. Preliminary experiments were carried out to decide on the amount of training material required, and using a single hat/cap per class for training proved to be good enough, allowing us to use a bigger testing database to increase the statistical significance of the results.
The database contains sequences in which the users were instructed on how to move under the camera in linear directions (parallel or diagonal to the captured area, to allow for proper coverage of the recording area). A top-view of the room, in which the recording took place, and some example images belonging to different classes are shown in Figure 5. In Figure 5a, T0 and T1 are the two overhead ToF cameras used to record the sequences, the color-filled rectangles show the camera field of view at the floor level, and the thick line shows the trajectory followed by the users in the recordings. In Figure 5b, the height values are shown using a color scale representation.
We run preliminary experiments to assess the actual precision of the Kinect devices when estimating the depth values. We found that errors in height estimation were below 1 cm for users in static position, which is consistent with previous reports in the literature (such as [33], where the authors report a maximum error of 4 cm at the maximum range of the sensor, but our measured elements are much nearer the sensor, as we evaluate people standing). Tables 2 and 3 show the details of the training and testing subsets, respectively. #Sequences refers to the number of recorded sequences, #Frames refers to the number of all of the (useful) frames   1 The ruler shown is 30 cm long; 2 Sizes are given in cm as D1_in × D2_in, D1_out × D2_out and h, according to Figure 4.
The dataset was divided into two subsets, one for training and the other one for testing. The subsets are fully independent, so that no person nor accessory present in the training subset was present in the testing subset. The columns Training Accessories and Testing Accessories in Table 1 show the accessories used for training and testing, respectively. It can be seen that just one hat/cap per class was used for training, and a variety of them was used for testing. Preliminary experiments were carried out to decide on the amount of training material required, and using a single hat/cap per class for training proved to be good enough, allowing us to use a bigger testing database to increase the statistical significance of the results.
The database contains sequences in which the users were instructed on how to move under the camera in linear directions (parallel or diagonal to the captured area, to allow for proper coverage of the recording area). A top-view of the room, in which the recording took place, and some example images belonging to different classes are shown in Figure 5. In Figure 5a, T0 and T1 are the two overhead ToF cameras used to record the sequences, the color-filled rectangles show the camera field of view at the floor level, and the thick line shows the trajectory followed by the users in the recordings. In Figure 5b, the height values are shown using a color scale representation.
We run preliminary experiments to assess the actual precision of the Kinect devices when estimating the depth values. We found that errors in height estimation were below 1 cm for users in static position, which is consistent with previous reports in the literature (such as [33], where the authors report a maximum error of 4 cm at the maximum range of the sensor, but our measured elements are much nearer the sensor, as we evaluate people standing). Tables 2 and 3 show the details of the training and testing subsets, respectively. #Sequences refers to the number of recorded sequences, #Frames refers to the number of all of the (useful) frames  The dataset was divided into two subsets, one for training and the other one for testing. The subsets are fully independent, so that no person nor accessory present in the training subset was present in the testing subset. The columns Training Accessories and Testing Accessories in Table 1 show the accessories used for training and testing, respectively. It can be seen that just one hat/cap per class was used for training, and a variety of them was used for testing. Preliminary experiments were carried out to decide on the amount of training material required, and using a single hat/cap per class for training proved to be good enough, allowing us to use a bigger testing database to increase the statistical significance of the results.
The database contains sequences in which the users were instructed on how to move under the camera in linear directions (parallel or diagonal to the captured area, to allow for proper coverage of the recording area). A top-view of the room, in which the recording took place, and some example images belonging to different classes are shown in Figure 5. In Figure 5a, T0 and T1 are the two overhead ToF cameras used to record the sequences, the color-filled rectangles show the camera field of view at the floor level, and the thick line shows the trajectory followed by the users in the recordings. In Figure 5b, the height values are shown using a color scale representation.
We run preliminary experiments to assess the actual precision of the Kinect devices when estimating the depth values. We found that errors in height estimation were below 1 cm for users in static position, which is consistent with previous reports in the literature (such as [33], where the authors report a maximum error of 4 cm at the maximum range of the sensor, but our measured elements are much nearer the sensor, as we evaluate people standing). Tables 2 and 3 show the details of the training and testing subsets, respectively. #Sequences refers to the number of recorded sequences, #Frames refers to the number of all of the (useful) frames  The dataset was divided into two subsets, one for training and the other one for testing. The subsets are fully independent, so that no person nor accessory present in the training subset was present in the testing subset. The columns Training Accessories and Testing Accessories in Table 1 show the accessories used for training and testing, respectively. It can be seen that just one hat/cap per class was used for training, and a variety of them was used for testing. Preliminary experiments were carried out to decide on the amount of training material required, and using a single hat/cap per class for training proved to be good enough, allowing us to use a bigger testing database to increase the statistical significance of the results.
The database contains sequences in which the users were instructed on how to move under the camera in linear directions (parallel or diagonal to the captured area, to allow for proper coverage of the recording area). A top-view of the room, in which the recording took place, and some example images belonging to different classes are shown in Figure 5. In Figure 5a, T0 and T1 are the two overhead ToF cameras used to record the sequences, the color-filled rectangles show the camera field of view at the floor level, and the thick line shows the trajectory followed by the users in the recordings. In Figure 5b, the height values are shown using a color scale representation.
We run preliminary experiments to assess the actual precision of the Kinect devices when estimating the depth values. We found that errors in height estimation were below 1 cm for users in static position, which is consistent with previous reports in the literature (such as [33], where the authors report a maximum error of 4 cm at the maximum range of the sensor, but our measured elements are much nearer the sensor, as we evaluate people standing). Tables 2 and 3 show the details of the training and testing subsets, respectively. #Sequences refers to the number of recorded sequences, #Frames refers to the number of all of the (useful) frames   1 The ruler shown is 30 cm long; 2 Sizes are given in cm as D1_in × D2_in, D1_out × D2_out and h, according to Figure 4.
The dataset was divided into two subsets, one for training and the other one for testing. The subsets are fully independent, so that no person nor accessory present in the training subset was present in the testing subset. The columns Training Accessories and Testing Accessories in Table 1 show the accessories used for training and testing, respectively. It can be seen that just one hat/cap per class was used for training, and a variety of them was used for testing. Preliminary experiments were carried out to decide on the amount of training material required, and using a single hat/cap per class for training proved to be good enough, allowing us to use a bigger testing database to increase the statistical significance of the results.
The database contains sequences in which the users were instructed on how to move under the camera in linear directions (parallel or diagonal to the captured area, to allow for proper coverage of the recording area). A top-view of the room, in which the recording took place, and some example images belonging to different classes are shown in Figure 5. In Figure 5a, T0 and T1 are the two overhead ToF cameras used to record the sequences, the color-filled rectangles show the camera field of view at the floor level, and the thick line shows the trajectory followed by the users in the recordings. In Figure 5b, the height values are shown using a color scale representation.
We run preliminary experiments to assess the actual precision of the Kinect devices when estimating the depth values. We found that errors in height estimation were below 1 cm for users in static position, which is consistent with previous reports in the literature (such as [33], where the authors report a maximum error of 4 cm at the maximum range of the sensor, but our measured elements are much nearer the sensor, as we evaluate people standing). Tables 2 and 3 show the details of the training and testing subsets, respectively. #Sequences refers to the number of recorded sequences, #Frames refers to the number of all of the (useful) frames  The dataset was divided into two subsets, one for training and the other one for testing. The subsets are fully independent, so that no person nor accessory present in the training subset was present in the testing subset. The columns Training Accessories and Testing Accessories in Table 1 show the accessories used for training and testing, respectively. It can be seen that just one hat/cap per class was used for training, and a variety of them was used for testing. Preliminary experiments were carried out to decide on the amount of training material required, and using a single hat/cap per class for training proved to be good enough, allowing us to use a bigger testing database to increase the statistical significance of the results.
The database contains sequences in which the users were instructed on how to move under the camera in linear directions (parallel or diagonal to the captured area, to allow for proper coverage of the recording area). A top-view of the room, in which the recording took place, and some example images belonging to different classes are shown in Figure 5. In Figure 5a, T0 and T1 are the two overhead ToF cameras used to record the sequences, the color-filled rectangles show the camera field of view at the floor level, and the thick line shows the trajectory followed by the users in the recordings. In Figure 5b, the height values are shown using a color scale representation.
We run preliminary experiments to assess the actual precision of the Kinect devices when estimating the depth values. We found that errors in height estimation were below 1 cm for users in static position, which is consistent with previous reports in the literature (such as [33], where the authors report a maximum error of 4 cm at the maximum range of the sensor, but our measured elements are much nearer the sensor, as we evaluate people standing). Tables 2 and 3 show the details of the training and testing subsets, respectively. #Sequences refers to the number of recorded sequences, #Frames refers to the number of all of the (useful) frames   1 The ruler shown is 30 cm long; 2 Sizes are given in cm as D1_in × D2_in, D1_out × D2_out and h, according to Figure 4.
The dataset was divided into two subsets, one for training and the other one for testing. The subsets are fully independent, so that no person nor accessory present in the training subset was present in the testing subset. The columns Training Accessories and Testing Accessories in Table 1 show the accessories used for training and testing, respectively. It can be seen that just one hat/cap per class was used for training, and a variety of them was used for testing. Preliminary experiments were carried out to decide on the amount of training material required, and using a single hat/cap per class for training proved to be good enough, allowing us to use a bigger testing database to increase the statistical significance of the results.
The database contains sequences in which the users were instructed on how to move under the camera in linear directions (parallel or diagonal to the captured area, to allow for proper coverage of the recording area). A top-view of the room, in which the recording took place, and some example images belonging to different classes are shown in Figure 5. In Figure 5a, T0 and T1 are the two overhead ToF cameras used to record the sequences, the color-filled rectangles show the camera field of view at the floor level, and the thick line shows the trajectory followed by the users in the recordings. In Figure 5b, the height values are shown using a color scale representation.
We run preliminary experiments to assess the actual precision of the Kinect devices when estimating the depth values. We found that errors in height estimation were below 1 cm for users in static position, which is consistent with previous reports in the literature (such as [33], where the authors report a maximum error of 4 cm at the maximum range of the sensor, but our measured elements are much nearer the sensor, as we evaluate people standing). Tables 2 and 3 show the details of the training and testing subsets, respectively. #Sequences refers to the number of recorded sequences, #Frames refers to the number of all of the (useful) frames   1 The ruler shown is 30 cm long; 2 Sizes are given in cm as D1_in × D2_in, D1_out × D2_out and h, according to Figure 4.
The dataset was divided into two subsets, one for training and the other one for testing. The subsets are fully independent, so that no person nor accessory present in the training subset was present in the testing subset. The columns Training Accessories and Testing Accessories in Table 1 show the accessories used for training and testing, respectively. It can be seen that just one hat/cap per class was used for training, and a variety of them was used for testing. Preliminary experiments were carried out to decide on the amount of training material required, and using a single hat/cap per class for training proved to be good enough, allowing us to use a bigger testing database to increase the statistical significance of the results.
The database contains sequences in which the users were instructed on how to move under the camera in linear directions (parallel or diagonal to the captured area, to allow for proper coverage of the recording area). A top-view of the room, in which the recording took place, and some example images belonging to different classes are shown in Figure 5. In Figure 5a, T0 and T1 are the two overhead ToF cameras used to record the sequences, the color-filled rectangles show the camera field of view at the floor level, and the thick line shows the trajectory followed by the users in the recordings. In Figure 5b, the height values are shown using a color scale representation.
We run preliminary experiments to assess the actual precision of the Kinect devices when estimating the depth values. We found that errors in height estimation were below 1 cm for users in static position, which is consistent with previous reports in the literature (such as [33], where the authors report a maximum error of 4 cm at the maximum range of the sensor, but our measured elements are much nearer the sensor, as we evaluate people standing). Tables 2 and 3 show the details of the training and testing subsets, respectively. #Sequences refers to the number of recorded sequences, #Frames refers to the number of all of the (useful) frames  The dataset was divided into two subsets, one for training and the other one for testing. The subsets are fully independent, so that no person nor accessory present in the training subset was present in the testing subset. The columns Training Accessories and Testing Accessories in Table 1 show the accessories used for training and testing, respectively. It can be seen that just one hat/cap per class was used for training, and a variety of them was used for testing. Preliminary experiments were carried out to decide on the amount of training material required, and using a single hat/cap per class for training proved to be good enough, allowing us to use a bigger testing database to increase the statistical significance of the results.
The database contains sequences in which the users were instructed on how to move under the camera in linear directions (parallel or diagonal to the captured area, to allow for proper coverage of the recording area). A top-view of the room, in which the recording took place, and some example images belonging to different classes are shown in Figure 5. In Figure 5a, T0 and T1 are the two overhead ToF cameras used to record the sequences, the color-filled rectangles show the camera field of view at the floor level, and the thick line shows the trajectory followed by the users in the recordings. In Figure 5b, the height values are shown using a color scale representation.
We run preliminary experiments to assess the actual precision of the Kinect devices when estimating the depth values. We found that errors in height estimation were below 1 cm for users in static position, which is consistent with previous reports in the literature (such as [33], where the authors report a maximum error of 4 cm at the maximum range of the sensor, but our measured elements are much nearer the sensor, as we evaluate people standing). Tables 2 and 3 show the details of the training and testing subsets, respectively. #Sequences refers to the number of recorded sequences, #Frames refers to the number of all of the (useful) frames   1 The ruler shown is 30 cm long; 2 Sizes are given in cm as D1_in × D2_in, D1_out × D2_out and h, according to Figure 4.
The dataset was divided into two subsets, one for training and the other one for testing. The subsets are fully independent, so that no person nor accessory present in the training subset was present in the testing subset. The columns Training Accessories and Testing Accessories in Table 1 show the accessories used for training and testing, respectively. It can be seen that just one hat/cap per class was used for training, and a variety of them was used for testing. Preliminary experiments were carried out to decide on the amount of training material required, and using a single hat/cap per class for training proved to be good enough, allowing us to use a bigger testing database to increase the statistical significance of the results.
The database contains sequences in which the users were instructed on how to move under the camera in linear directions (parallel or diagonal to the captured area, to allow for proper coverage of the recording area). A top-view of the room, in which the recording took place, and some example images belonging to different classes are shown in Figure 5. In Figure 5a, T0 and T1 are the two overhead ToF cameras used to record the sequences, the color-filled rectangles show the camera field of view at the floor level, and the thick line shows the trajectory followed by the users in the recordings. In Figure 5b, the height values are shown using a color scale representation.
We run preliminary experiments to assess the actual precision of the Kinect devices when estimating the depth values. We found that errors in height estimation were below 1 cm for users in static position, which is consistent with previous reports in the literature (such as [33], where the authors report a maximum error of 4 cm at the maximum range of the sensor, but our measured elements are much nearer the sensor, as we evaluate people standing). Tables 2 and 3 show the details of the training and testing subsets, respectively. #Sequences refers to the number of recorded sequences, #Frames refers to the number of all of the (useful) frames 16 × 11, 15 × 20, 10 Sensors 2017, 17, 1845 9 of 15  1 The ruler shown is 30 cm long; 2 Sizes are given in cm as D1_in × D2_in, D1_out × D2_out and h, according to Figure 4.
The dataset was divided into two subsets, one for training and the other one for testing. The subsets are fully independent, so that no person nor accessory present in the training subset was present in the testing subset. The columns Training Accessories and Testing Accessories in Table 1 show the accessories used for training and testing, respectively. It can be seen that just one hat/cap per class was used for training, and a variety of them was used for testing. Preliminary experiments were carried out to decide on the amount of training material required, and using a single hat/cap per class for training proved to be good enough, allowing us to use a bigger testing database to increase the statistical significance of the results.
The database contains sequences in which the users were instructed on how to move under the camera in linear directions (parallel or diagonal to the captured area, to allow for proper coverage of the recording area). A top-view of the room, in which the recording took place, and some example images belonging to different classes are shown in Figure 5. In Figure 5a, T0 and T1 are the two overhead ToF cameras used to record the sequences, the color-filled rectangles show the camera field of view at the floor level, and the thick line shows the trajectory followed by the users in the recordings. In Figure 5b, the height values are shown using a color scale representation.
We run preliminary experiments to assess the actual precision of the Kinect devices when estimating the depth values. We found that errors in height estimation were below 1 cm for users in static position, which is consistent with previous reports in the literature (such as [33], where the authors report a maximum error of 4 cm at the maximum range of the sensor, but our measured elements are much nearer the sensor, as we evaluate people standing). Tables 2 and 3 show the details of the training and testing subsets, respectively. #Sequences refers to the number of recorded sequences, #Frames refers to the number of all of the (useful) frames 16 × 14, 26 × 23, 9 Sensors 2017, 17, 1845 9 of 15  1 The ruler shown is 30 cm long; 2 Sizes are given in cm as D1_in × D2_in, D1_out × D2_out and h, according to Figure 4.
The dataset was divided into two subsets, one for training and the other one for testing. The subsets are fully independent, so that no person nor accessory present in the training subset was present in the testing subset. The columns Training Accessories and Testing Accessories in Table 1 show the accessories used for training and testing, respectively. It can be seen that just one hat/cap per class was used for training, and a variety of them was used for testing. Preliminary experiments were carried out to decide on the amount of training material required, and using a single hat/cap per class for training proved to be good enough, allowing us to use a bigger testing database to increase the statistical significance of the results.
The database contains sequences in which the users were instructed on how to move under the camera in linear directions (parallel or diagonal to the captured area, to allow for proper coverage of the recording area). A top-view of the room, in which the recording took place, and some example images belonging to different classes are shown in Figure 5. In Figure 5a, T0 and T1 are the two overhead ToF cameras used to record the sequences, the color-filled rectangles show the camera field of view at the floor level, and the thick line shows the trajectory followed by the users in the recordings. In Figure 5b, the height values are shown using a color scale representation.
We run preliminary experiments to assess the actual precision of the Kinect devices when estimating the depth values. We found that errors in height estimation were below 1 cm for users in static position, which is consistent with previous reports in the literature (such as [33], where the authors report a maximum error of 4 cm at the maximum range of the sensor, but our measured elements are much nearer the sensor, as we evaluate people standing). Tables 2 and 3 show the details of the training and testing subsets, respectively. #Sequences refers to the number of recorded sequences, #Frames refers to the number of all of the (useful) frames   1 The ruler shown is 30 cm long; 2 Sizes are given in cm as D1_in × D2_in, D1_out × D2_out and h, according to Figure 4.
The dataset was divided into two subsets, one for training and the other one for testing. The subsets are fully independent, so that no person nor accessory present in the training subset was present in the testing subset. The columns Training Accessories and Testing Accessories in Table 1 show the accessories used for training and testing, respectively. It can be seen that just one hat/cap per class was used for training, and a variety of them was used for testing. Preliminary experiments were carried out to decide on the amount of training material required, and using a single hat/cap per class for training proved to be good enough, allowing us to use a bigger testing database to increase the statistical significance of the results.
The database contains sequences in which the users were instructed on how to move under the camera in linear directions (parallel or diagonal to the captured area, to allow for proper coverage of the recording area). A top-view of the room, in which the recording took place, and some example images belonging to different classes are shown in Figure 5. In Figure 5a, T0 and T1 are the two overhead ToF cameras used to record the sequences, the color-filled rectangles show the camera field of view at the floor level, and the thick line shows the trajectory followed by the users in the recordings. In Figure 5b, the height values are shown using a color scale representation.
We run preliminary experiments to assess the actual precision of the Kinect devices when estimating the depth values. We found that errors in height estimation were below 1 cm for users in static position, which is consistent with previous reports in the literature (such as [33], where the authors report a maximum error of 4 cm at the maximum range of the sensor, but our measured elements are much nearer the sensor, as we evaluate people standing). Tables 2 and 3 show the details of the training and testing subsets, respectively. #Sequences refers to the number of recorded sequences, #Frames refers to the number of all of the (useful) frames   1 The ruler shown is 30 cm long; 2 Sizes are given in cm as D1_in × D2_in, D1_out × D2_out and h, according to Figure 4.
The dataset was divided into two subsets, one for training and the other one for testing. The subsets are fully independent, so that no person nor accessory present in the training subset was present in the testing subset. The columns Training Accessories and Testing Accessories in Table 1 show the accessories used for training and testing, respectively. It can be seen that just one hat/cap per class was used for training, and a variety of them was used for testing. Preliminary experiments were carried out to decide on the amount of training material required, and using a single hat/cap per class for training proved to be good enough, allowing us to use a bigger testing database to increase the statistical significance of the results.
The database contains sequences in which the users were instructed on how to move under the camera in linear directions (parallel or diagonal to the captured area, to allow for proper coverage of the recording area). A top-view of the room, in which the recording took place, and some example images belonging to different classes are shown in Figure 5. In Figure 5a, T0 and T1 are the two overhead ToF cameras used to record the sequences, the color-filled rectangles show the camera field of view at the floor level, and the thick line shows the trajectory followed by the users in the recordings. In Figure 5b, the height values are shown using a color scale representation.
We run preliminary experiments to assess the actual precision of the Kinect devices when estimating the depth values. We found that errors in height estimation were below 1 cm for users in static position, which is consistent with previous reports in the literature (such as [33], where the authors report a maximum error of 4 cm at the maximum range of the sensor, but our measured elements are much nearer the sensor, as we evaluate people standing). Tables 2 and 3 show the details of the training and testing subsets, respectively. #Sequences refers to the number of recorded sequences, #Frames refers to the number of all of the (useful) frames 18 × 13, 27 × 25, 14 1 The ruler shown is 30 cm long; 2 Sizes are given in cm as D1_in × D2_in, D1_out × D2_out and h, according to The dataset was divided into two subsets, one for training and the other one for testing. The subsets are fully independent, so that no person nor accessory present in the training subset was present in the testing subset. The columns Training Accessories and Testing Accessories in Table 1 show the accessories used for training and testing, respectively. It can be seen that just one hat/cap per class was used for training, and a variety of them was used for testing. Preliminary experiments were carried out to decide on the amount of training material required, and using a single hat/cap per class for training proved to be good enough, allowing us to use a bigger testing database to increase the statistical significance of the results.
The database contains sequences in which the users were instructed on how to move under the camera in linear directions (parallel or diagonal to the captured area, to allow for proper coverage of the recording area). A top-view of the room, in which the recording took place, and some example images belonging to different classes are shown in Figure 5. In Figure 5a, T0 and T1 are the two overhead ToF cameras used to record the sequences, the color-filled rectangles show the camera field of view at the floor level, and the thick line shows the trajectory followed by the users in the recordings. In Figure 5b, the height values are shown using a color scale representation.
We run preliminary experiments to assess the actual precision of the Kinect devices when estimating the depth values. We found that errors in height estimation were below 1 cm for users in static position, which is consistent with previous reports in the literature (such as [33], where the authors report a maximum error of 4 cm at the maximum range of the sensor, but our measured elements are much nearer the sensor, as we evaluate people standing). Tables 2 and 3 show the details of the training and testing subsets, respectively. #Sequences refers to the number of recorded sequences, #Frames refers to the number of all of the (useful) frames with people present. In Table 3, the number of sequences and frames is shown as T0 + T1, as they refer to the recordings done with each of the two overhead cameras: T0 and T1.
The Level 1 grouping just identifies the three main broad classes, namely Caps, Hats, and none of them (referred to as No hat/cap). In the Level 2 grouping, the classification was further refined for the Hats class, considering their size: large, medium and small. This refinement is somehow fuzzy, as the frontiers between each of these classes is somehow arbitrary, but we want to stress the fact that we are interested in providing additional semantic labels, with reasonable performance, to characterize people in the scene. In our case, we did this grouping mainly considering the external diameters of each hat (see size information in Table 1, considering the size definitions of Figure 4). with people present. In Table 3, the number of sequences and frames is shown as T0 + T1, as they refer to the recordings done with each of the two overhead cameras: T0 and T1.
The Level 1 grouping just identifies the three main broad classes, namely Caps, Hats, and none of them (referred to as No hat/cap). In the Level 2 grouping, the classification was further refined for the Hats class, considering their size: large, medium and small. This refinement is somehow fuzzy, as the frontiers between each of these classes is somehow arbitrary, but we want to stress the fact that we are interested in providing additional semantic labels, with reasonable performance, to characterize people in the scene. In our case, we did this grouping mainly considering the external diameters of each hat (see size information in Table 1, considering the size definitions of Figure 4).

Accessories
Sensor T0 Sensor T1 No hat/no cap

Performance Metrics
To evaluate the performance of our proposal, we will provide the full confusion matrices for the experiments carried out, and the average classification rates. In the case of the confusion matrices, we will also include the confidence interval values for a confidence level of 95%.

Results and Discussion
Several experiments were run in order to assess the capabilities of the proposed system in coping with different requirements, namely as above: • Level 2 classification grouping: in which five classes were considered, i.e., large size hat, medium size hat, small size hat, cap and no hat/cap. • Level 1 classification grouping: in which three classes were considered, i.e., hat, cap and no hat/cap. • Accessory-no accessory classification: in which two classes were considered, i.e., hat/cap, and no hat/cap. Table 4 shows the results for the level 2 classification task. The average classification rate is 84.60%, which may seem a bit low to be usable in a real context, but we have to take into account again that the frontier between hat sizes is somehow arbitrary. This can be clearly seen in the grayed areas that show, for every class, the somehow high confusion between adjacent hat sizes, and between using a cap and wearing nothing on the head (actually, a cap adjusts to the general head shape, except for its frontal shelter/cover).  Table 5 shows the results for the level 1 classification task. The average classification rate in this case is 92.33%. Again, a non-negligible confusion between the cap class and the no hat/cap class can be noticed, for the same reason stated above.  Table 6 shows the results for the hat/cap-no hat/cap classification task. The average classification rate is 95.25%. From Table 6, it is also easy to find out the sensitivity (true positive rate), which equals 95.22%, the specificity (true negative rate), which equals 95.61%, and the F1 score, which equals 95.41%. As a final comment, even if the classification performance may seem relatively low for the level 2 classification grouping, we want to stress that the proposal objective is providing semantic labels to be used by higher level processing systems. In this scenario, even if the classification is not perfect, it can provide additional information that will be useful in the higher levels.

Computational Cost
All of the experiments reported were run on a PC, with an Intel Quad Q9550 2.83 GHz processor (Santa Clara, CA, USA), and 4 GB RAM. The algorithmic proposal has been implemented in C under the LabWindows™/CVI (C for Virtual Instrumentation) Integrated Development Environment, and runs on a Windows 10 operating system (Microsoft, Redmon, WA, USA). The pre-process of the depth maps is the stage that presents the higher computational cost, with an average of 3.75 ms per frame. The rest of processes have a duration of 2 ms per frame, so that the overall processing time per frame is 5.75 ms, thus allowing for real-time operation of the proposal.
Alternative classification strategies could have been applied, but we focused our efforts on the design of a meaningful feature vector containing the relevant information that could be extracted from an overhead ToF camera. The advantage of PCA is its relative simplicity, which would allow its implementation in low power computing architectures. To provide an example, in [31], we compared the PCA based proposal with another based in SVM, for a people detection task also using depth information from an overhead ToF camera. The SVM classifier outperformed the PCA based one, but the later one was 4.5 times faster. The computational simplicity of our proposal would even allow its implementation in low cost platforms such as the Raspberry Pi, which has proved to be able to solve even more complex preprocessing and classification problems (such as the one described in [34], aimed at face detection, also using a PCA based solution).

Conclusions
In this work, we have presented a procedure for real-time classification (processing time is 5.75 ms per frame) of different headgear accessories (hats, caps, no hats/caps), by only using the depth information provided by a ToF camera placed in an overhead position. The use of only depth information, compared with the use of RGB, allows for guaranteeing people's privacy and decreasing the processing time.
Since a dataset fulfilling the requirements of the target application (depth maps acquired with an overhead ToF camera) was not available within the scientific community, a specific dataset has been recorded and made publicly available. This dataset contains depth sequences of 25 different people with short/long hairstyles without accessories (no hat nor cap), and with 21 different accessories (17 different hats and four different caps).
A full description of our algorithm for accessories classification has been provided, including depth data preprocessing, person ROI estimation, feature extraction and normalization, and the final classification stage. The classification process has been carried out using PCA techniques. Within this context, the recorded dataset was divided into fully separated training and testing subsets that have been used to train and evaluate the proposal for different classification tasks (from two to five different classes). The results obtained are satisfactory, achieving an average classification rate of around 92% for the three-class problem (hat, cap, no hat/cap). This average improves when the number of classes is reduced, reaching 95% in the case of classifying people with and without accessories. The performance for the five classes classification task is around 85%, which is mainly due to the fuzzy frontiers between the hat classes considering different sizes. Nevertheless, the results are very good considering the target application of providing additional attributes in the field of semantic surveillance video analysis.
Future work will be focused on further exploiting the information provided by the system, generating additional quantitative features (for example, related to the actual area of the detected accessories), evaluating alternative classification strategies, and also integrating this proposal in a more general semantic video analysis tool, combined with other semantic attribute estimators, specifically oriented to user reidentification in wider areas.