Categorization of Indoor Places Using the Kinect Sensor

The categorization of places in indoor environments is an important capability for service robots working and interacting with humans. In this paper we present a method to categorize different areas in indoor environments using a mobile robot equipped with a Kinect camera. Our approach transforms depth and grey scale images taken at each place into histograms of local binary patterns (LBPs) whose dimensionality is further reduced following a uniform criterion. The histograms are then combined into a single feature vector which is categorized using a supervised method. In this work we compare the performance of support vector machines and random forests as supervised classifiers. Finally, we apply our technique to distinguish five different place categories: corridors, laboratories, offices, kitchens, and study rooms. Experimental results show that we can categorize these places with high accuracy using our approach.


Introduction
An important capability for service robots working in indoor environments is their ability to categorize the different places where they are located. Place categorization has many applications in service robots. It is mainly used in semantic mapping, where acquired maps of the environment are

OPEN ACCESS
extended with information about the type of each place allowing high level conceptual representations of environments [1][2][3][4][5][6]. In addition, the information about the type of a place can be used as prior or context information to improve the detection of objects [7,8]. Moreover, whenever a robot has information about the type of a place, it can determine the possible actions to be carried out in that area [9][10][11].
In the task of place categorization a robot assigns a label to the place where it is located according to the information gathered with its sensors. The labels assigned by the robot to the different places are usually the same that people would use to refer to those places such as office, kitchen, or laboratory. In this way the communication with humans is improved [12,13].
In this paper we present a new approach to categorize indoor places using a RGB-D sensor, in particular the Kinect camera [ 14]. The Kinect sensor is able to provide RGB and depth images simultaneously at high rates. Moreover, this sensor is getting popular in the robotics community due to its low cost. Figure 1 shows the Kinect sensor together with example depth and RGB images taken in a laboratory.
The main idea of our approach consists of transforming the image and depth information from the Kinect camera into feature vectors using histograms of local binary patterns (LBPs) whose dimensionality is reduced using a uniform criterion [ 15]. In order to obtain LBPs from RGB images they should first be transformed into grey scale images since the LBP operator ignores color information. The goal of this work is to distinguish categories of places, i.e., places with similar structural and spatial properties, and for this reason we have selected a descriptor that does not take color properties into consideration. Previous works on place categorization [16,17] also support the premise of ignoring color information for general categorization of indoor places.
The final feature vectors are combined and used as input to a supervised classifier. In this paper we compare the perform ance of support vector machines (SVMs) [ 18] and random forests (RFs) [ 19] as classification methods. We apply our method to sequences of images corresponding to five different place categories namely corridors, laboratories, offices, kitchens, and study rooms, and obtain average correct classification rates above 92%. This result demonstrates that it is possible to categorize indoor places using a Kinect sensor with high accuracy. Finally, we show the improvement of our categorization approach when using both modalities simultaneously (depth and grey images) in comparison with single modalities.
The rest of the paper is organized as follows: after presenting related work in Section 2, we introduce the local binary pattern transformation for grey scale and depth images in Section 3. In Section 4 we describe the combined feature vector used to represent the grey scale and depth images corresponding to the same scene. The supervised classifiers used for the categorization are presented in Section 5. We introduce our dataset in Section 6. Finally, experimental results are presented in Section 7.

Related Work
The problem of place recognition by mobile robots has gained much attention during recent years. Some previous works use 2D laser scans to represent different places in the environment. For example, in [ 20] 2D scans obtained with a laser range finder are transformed into feature vectors representing their geometrical properties. These feature vectors are categorized into several places using Boosting. The work in [ 21] uses similar feature vectors to represent locations in a Voronoi Random Field. Moreover, in [ 22] sub-maps from indoor environments are obtained by clustering feature vectors representing the different 2D laser scans. Finally, the work in [ 23] introduces the classification of a single scan into different semantic labels instead of assigning a single label to the whole scan.
Vision sensors have also been applied to categorize places indoors using mobile robots. In [ 16] the CENTRIST descriptor is applied to images representing different rooms in several houses. The descriptors are later classified using support vector machines. Moreover, in the PLISS system for place categorization introduced in [ 17] images are represented by bag of words using the SIFT descriptor. Similar images are grouped together by locating change-points in the sequences. In [ 7] local and global features from images taken by a wearable camera are classified using a hidden Markov model.
Finally, combinations of different modalities have been also applied to robot place recognition. The work in [ 24] combines 2D laser scans with visual object detection to categorize places indoors. Moreover, in [ 25] multiple visual and laser-based cues are combined using support vector machines for recognizing places indoors.
In contrast to these works, we use the new Kinect sensor which has the advantage of simultaneously providing visual and depth information. We apply a combination of image and depth images which allows us to integrate richer information about the visual appearance and the 3D structure of each place.

Local Binary Patterns
The local binary pattern (LBP) operator introduced in [ 15,26] has been originally used for analysis and classification of grey scale images. The LBP is a local transformation that contains the relations between pixel values in a neighborhood of a reference pixel. In the next sections we explain how to calculate the LBP transformation for the RGB and depth images obtained with the Kinect sensor.

LPB Transformation for RGB Images
To apply the LBP transformation to RGB images they should be converted first into grey scale images because LBPs ignore color information and work only with intensity values. Then for each pixel p i in the grey scale image we calculate the corresponding LBP value following the approach presented in [ 15]. In particular, given a pixel p i with image coordinates (x i , y i ), we compare its value v(p i ) with the values corresponding to the 8-neighboring pixels p j ∈ N 8 (p i ). For each neighboring pixel p j we obtain a binary value b j ∈ {0, 1} indicating whether the value v(p i ) of the reference pixel p i is bigger than the value v(p j ) of the neighboring pixel p j as: The binary values in the neighborhood are concatenated into a string in some specific order. In this work we use a clockwise order starting with the value v(p s ) of the pixel which is on the right of the center pixel p i , that is, p s = (x i + 1, p y ). The obtained binary string is then converted into the corresponding decimal value d(p i ) ∈ [0, 255]. An example of this process is shown in Figure 2. The final LBP is obtained after applying the previous transformation to every pixel in the image, obtaining a final transformed image T grey . Figure 3 (upper row) shows the result of applying the LBP transformation to a RGB image obtained with the Kinect camera.
The abovementioned LBP operator is equivalent to the LBP 8,1 operator of [ 15] with the solely difference that we do not interpolate values at the diagonals. Moreover, it is equivalent to the Census Transform presented in [ 27].

LPB Transformation for Depth Images
Pixels in depth images provided by the Kinect sensor represent the distance of objects to the sensor (see Figure 1(a)). To obtain the LBP transformation of depth images we apply the same process as for grey images (Section 3.1) but using the depth values. However, since the Kinect camera has a limited working depth range, the pixels representing depth values outside this range appear as undefined values in the corresponding depth image. In addition, we obtain similar undefined values when the camera is pointing to reflective surfaces, or when the pixels represent positions close to the borders of objects. Examples of these cases are presented in Figure 1(a) where undefined pixels are shown in black. To integrate undefined pixels when calculating the LBP transformation we propose to extend the range of resulting decimal values with the extra value 256 to represent these undefined cases. In addition, when calculating the LBP value for a given pixel in the depth image we also take into account neighboring pixels with undefined values as follows. For a given pixel p i in the original depth image we assign it the decimal value 256 if its depth value is undefined or there exists some undefined value in its 8-neighborhood N 8 (p i ). Otherwise we apply the standard LBP procedure of Section 3.1. Formally: where δ(.) is an indicator function which returns true when its argument is an undefined value, and false otherwise. The value d(p i ) is the base-10 value obtained by applying the LBP transformation of Section 3.1. Finally, the resulting value d + (p i ) is contained in the extended range [0, 256]. After applying this operator to every depth pixel we obtain the resulting transformed image T depth . An example of a LBP transformation for a depth image is shown in Figure 3.

Multi-Modal Representation of Places
In our approach places are represented by depth and color images taken by a Kinect camera. In this section we explain how to combine both modalities to obtain a global feature vector which will be later categorized using different supervised methods.
The transformed images T grey and T depth obtained by following the steps of Section 3 are further represented by histograms H grey and H depth respectively. Each bin in these histograms contains the frequency of appearance of the different LBP transformed values. LBPs define local structures in images and histograms of LBPs represent the distribution in the scene of these local structures, and thus give a general representation of the images which in our case represent different place categories. Similar histograms may represent different places but these places should share a similar global structure. This is in fact an advantage in our approach since our objective is to classify places with similar global structure into the same category, e.g., different corridors should be include in the general category "corridor" , in the same way different offices should be detected as pertaining to the same category "office". Histograms of local features have been successfully used in previous works to classify images into different place categories [ 16,17,28].
In our approach we further reduce the dimensionality of each histogram by selecting a subset of their LPBs using a uniformity measurement U introduced in [ 15] which indicates the number of transitions between 0/1 values of the binary representation of the decimal value d as: , where the different values b i are obtained following Equation (1) and their position inside the image are indicated in Figure 2(b). As an example, the uniformity value corresponding to the decimal LBP value d = 236 is U(d) = 4 (c.f. Figure 2(b)). As explained above LBPs represent local structure in the image (see Figure 2). Moreover, some of these local structures appear with different frequencies in different places, and also present different discriminative properties. In this paper we want to study the discriminative properties of these different local structures when they are applied to the problem of place categorization. For this purpose we use the uniformity measurement U to select different subsets of LBPs, i.e., different local structures. In the experiments we will see that the selection of subsets of LBPs according to the uniformity measurement U improves the categorization results. A side effect of this selection is the reduction on the dimensionality in the final feature vectors representing different place categories; however, as the experiments will demonstrate, this reduction improves the classification results. We think this is due to the elimination of LBPs containing poor discrimination properties for place categorization. For example, when the threshold θ is high we allow LBPs corresponding to local structures with many local changes that can correspond to noise, while low thresholds maintain only more defined local structures like for example corners or lines as in Figure 2(b).
Using the uniformity measurement U the final histograms are composed of the subsets of bins representing the selected LBPs as: (4) where h d is the bin in the histogram corresponding to LBP value d, and θ is a threshold for the uniformity measurement. Lower values for θ produce histograms with lower dimensionality. As an example, for θ = 2 the resulting histograms have 58 bins, while a value of θ = 4 results in histograms of 198 bins. When the threshold θ = 8 then there is no reduction in the corresponding histograms and they are equivalent to the CENTRIST descriptor, which has been recently introduced for place categorization using visual information [ 16]. That means that CENTRIS can be seen as a special case of our approach when θ = 8.
Finally, the multi-modal feature vector x θ describing a particular place is obtained by concatenating the reduced histograms corresponding to both modalities:

Classification
The multi-modal feature vector obtained in the previous section is used as input to a supervised method for categorization purposes. In this paper we compare two state-of-the-art classification methods: support vector machines, and random forests.

Support Vector Machines
The first supervised classification method is based on a support vector machine (SVM) [ 29,30]. During the training phase, a support vector machine takes as input a set of N feature vectors x i together with their binary labels y i ∈ {1, −1}. The idea behind SVMs is to find the hyperplane that maximizes the distance between the examples of the two classes. This is done by finding a solution to the optimization problem: (6) subject to the condition: (7) where w is the normal to the hyperplane, and ξ 0 are slack variables that measure the error in the misclassification of x i . In addition we use the radial basis function (RBF) kernel: In the test step new examples x t are labeled according to: SVMs were originally designed to solve binary classification problems. In the case of multi-class classification different approaches can be used to manage several classes. In our case we apply the "one-against-one" approach [ 31] which implies to learn a SVM for each pair of categories, resulting in a total of k(k-1)/2 classifiers for k categories.

Random Forests
The second type of supervised classifier used in this work is the random forest [ 19]. The idea behind this classifier is to use M classification trees each of which assigns a label to the input vector x. The final label is obtained by a majority vote over the labels assigned by all trees.
In this approach, each tree is trained as follows. First, using the original training data with N feature vectors, a new training set is created by random sampling of N samples with replacement. Second, during the creation of each node in the tree a subset of l << L features from the total feature vector x∈ℝ L is randomly selected. Finally, the tree is constructed without pruning. In our approach we use the random forest implementation of WEKA [ 34] which is based on [ 19].

Place Dataset
To test our approach we have created a dataset of places by collecting data in different buildings at the University of Kyushu (this dataset is available at [ 35]). The dataset contains RGB and depth images acquired by a Kinect sensor which was mounted on a mobile platform at a height of 125 cm. We collected data from five different place categories: "corridor", "kitchen", "laboratory", "office", and "study room". Each category contains RGB and depth images from several places that pertain to that category. For example the category "laboratory" contains data from four different laboratories. In each place we obtained one sequence of images while controlling the platform manually. The trajectory at each place has a different length and thus contains a different number of images. Table 1 presents a summary of the information contained in the dataset. For obtaining the place data we used the Robot Operating System framework (ROS) on a laptop equipped with an Intel core i5. In our experiments we simultaneously recorded depth images, 3D point clouds and RGB images. Since the Kinect camera does not provide hardware synchronization of RGB and depth images, we use the closest timestamp to match images of both modalities. The elapsed times between depth and RGB images ranged between 5 ms and 10 ms. Examples of RGB and depth images for each place in our dataset are shown in Figure 4.

Experiments
To evaluate the performance of our approach we conducted several experiments using our dataset of places. To create the different test and training sets for the experiments we applied the following procedure. Each test set was created by randomly selecting one place from each category, i.e., each test set contains always five sequences of grey scale and depth images each of which corresponds to one category. Example test sets are {corridor 1, kitchen 2, laboratory 4, study room 1, office 2} or {corridor 2, kitchen 2, laboratory 3, study room 2, office 2}. The rest of places are used as training data. The idea behind this selection is that the test sets contain always sequences of places that do not appear in the training set, in this way we test the behavior of our method when applied to previously unseen places. Finally, for each experiment we repeated the previous process 10 times and obtained the average confusion matrices for the five categories.
We first show categorization results using our proposed approach in which we combined reduced histograms of LBP for grey scale and depth images that are classified using a SVM. In addition, we compare our approach with results in which the histograms of LBPs are not reduced.
Moreover, we show the improvement of the performance when using the combination of both modalities in comparison with single modalities only. We also present classification results applying spatial pyramids [ 28], a well known technique used in computer vision to improve classification results of scenes. Finally, we study the performance of our combined descriptor when used with support vector machines in comparison to random forests. In all the experiment the RGB images were first converted into grey scale.

Categorization of Places with Combined Histograms of LBP and SVMs
In the first experiment we study the performance of our approach when using histograms of reduced local binary patterns together with support vector machines. The final combined modality feature vectors x representing each pair of grey and depth images were obtained following the method of Section 4. In addition we apply different thresholds θ for the uniformity measurement and compare their classification results. As explained above, we repeated 10 experiments using different training and test sets. The support vector machines for each of the 10 experiments were trained using RBF kernels whose parameters were found by grid-search (see Section 5.1). Table 2 presents the overall classification results for the 10 experiments. Results are averaged over the 10 experiments and are accompanied by the corresponding standard deviations. As Table 2 suggests best results are obtained with threshold θ = 4. In this case not only the average classification rate improves but also the uncertainty (represented by the standard deviation) is reduced. When θ = 8 there is no reduction in the histograms of LBPs and the final descriptor is equivalent to CENTRIST [ 16]. In addition, Figure 5 plots the average correct classification rates for each category. As shown in the plot best results are obtained almost always when θ = 4. In particular, the performance greatly improves in the most difficult categories which are "kitchen" and "study room". Finally, we present the details of the previous experiments using confusion matrices which indicate the predicted classification for the actual place. The value of each cell in the confusion matrix is the average and standard deviation over the 10 experiments. The confusion matrices for different values of the uniformity threshold θ are shown in Table 3.

Multiple Modalities vs. Single Modalities
In this section we study the improvement on the categorization of places when using the combined modalities (grey and depth images) in comparison with single modalities only (grey or depth image). We repeated the experiments of the previous section using different data each time: grey images only, depth images only, and grey + depth images. Similar to the previous section we used SVMs as classifiers. Figure 6 compares  We applied spatial pyramids using the data from our previous 10 experiments using SVM as classifiers and compare different modalities and uniformity thresholds. A final summary of categorization results is shown in Table 4 showing overall average correct categorization results and standard deviation for the 10 experiments. The results in Table 4 show that the combination of modalities outperforms single ones in almost all cases. We also can see that the best result in the combined modality is obtained in level 0. Previous literature reported better results when applying spatial pyramids to image categorization. From Table 4 we can see that this is also the case when using individual modalities, i.e., grey scale images or depth images only, however the combination of both does not improve the categorization at further levels in our particular dataset and experiments. We want to study this behavior in future work.

Classification Using Random Forests
In this section we compare the performance of our approach when using random forests in the categorization step. We compare the performance with the best results obtained using SVMs with reduced feature vectors using uniform measurement threshold θ = 4. Table 5 shows a summary of this comparison. As we can see the use of support vector machines outperforms random forest at different levels of spatial pyramids. In this table we can also see that results using random forest improve as the levels of spatial pyramids increase; however we do not observe this behavior when using the multi-class implementation of SVM provided in libsvm [ 32].

Conclusions
In this paper we have presented a method to classify places in indoor environments using RGB and depth images obtained by a Kinect camera. Our approach uses a combination of both modalities to create a feature vector that is categorized using different supervised methods. Moreover, we have introduced the uniform measurement to reduce the combined feature vectors and to improve the final categorization results. In addition, we compared the categorization results using SVMs and random forests. The results indicated that SVMs are more appropriate for our particular case. Finally, the results in all our experiments demonstrated that the combination of depth and image information outperforms the use of single modalities individually.
In this work, we did not apply any extra reduction of dimensionality in the final combined feature vectors used for categorization. However, when using spatial pyramids at different levels the dimension of the feature vectors grows exponentially and the application of some reduction technique such as PCA can improve results at subsequent levels [ 16]. As future work we want to study different methods to further reduce the dimensionality of the feature vectors at different levels and compare these results to the ones presented in this paper. We also want to study new ways of combining vectors from different modalities.