1. Introduction
The prospect of a social robot in every home may be realized within the next two decades. There are already many researchers in academia and tech industries that are actively studying and designing prototypes of such robots. The open research objectives are diverse and include, but are not limited to, emotion recognition, perception, pattern recognition (face, object, scene, and voice), and navigation. These robots are expected to be employed as companions to seniors and children, housekeeping, surveillance, etc. [
1,
2]. In order to accomplish such tasks, it is essential that the robot seamlessly recognizes its own location inside the home—similar to humans who are effortlessly aware of their whereabouts at any instant, e.g., kitchen or living room. This knowledge is a pretext for many indoor navigation scenarios, and facilitates the robot’s movement in the house. This paper is not about designing social robots per se; it addresses one of the many problems that collectively contribute towards efficient operation of such robots; namely knowing its location in the house at any given instant. Classification is a core computer vision problem whereby data streams are categorized into specific classes in accordance to learning their specific features. The problem has been addressed by different supervised machine learning algorithms [
3]. The convolutional neural network (CNN) [
4,
5] is generally regarded as the state-of-the-art algorithm in deep learning for visual purposes, e.g., face recognition and object detection, especially after the pioneering work reported in Reference [
6]. This algorithm surpasses conventional machine learning algorithms by integrating the feature extraction and classification problems without the requirement of careful human design [
7].
The main objective of this paper is to identify different household rooms for social robotic applications in houses. The study is part of a larger project of designing social robots to be employed in such environments. The problem of indoor navigation can be addressed by different approaches; here, we propose a CNN solution for real time implementation for such social robots. We examined several CNN architectures within a home setting. The latest scene dataset, called Places [
8], was adopted in the study. We downloaded the most common five indoor classes for houses (bedrooms, dining rooms, kitchens, living room, and bathrooms) from the dataset. However, we noted that each class included a sizeable number of irrelevant scenes; we therefore reduced the number of samples by removing unrelated images from each class. We then propose a combination solution of CNN with multi-binary classifiers, referred to as ECOC [
9]. These models were evaluated experimentally on a NAO humanoid robot [
10].
The rest of the paper is organized as follows:
Section 2 discusses the related reported literature that addresses the room classification problem for robotics employing different methods. In
Section 3, we briefly review CNN and ECOC algorithms. We begin with fundamental components of CNN that are used in most CNN architectures. This section also explains the main idea of ECOC approach.
Section 4 focuses on all simulation experiments and examines the results. We start with a brief overview of the scene dataset. We then present simulation studies of multi-class experiments for several CNN architectures as well as results of multi-binary classifiers on the best CNN architecture.
Section 5 shows the results of real-time experiments on all three models tested on a NAO humanoid robot. The paper is concluded with a discussion of the results in
Section 6.
2. Related Work
Recognizing different rooms in a home environment based on their specific function is an important problem for social robots, and its solution not only facilitates seamless movement from one place to another, it is the basis of all other tasks, including assistance to humans inside the house or performing various functions in the context of the robot’s overall tasks. An interaction with a human might be in the form of “
Please go to the kitchen and bring a cup of water”. This problem has attracted the attention of robotics researchers in the last decade, and several conventional machine learning methods have been employed to address room classification in indoor settings. One of the early studies reported by Burgard’s group in [
11] was to address semantic place classification of indoor environments by extracting features from a laser range data using AdaBoost algorithm. The experiments were conducted in a real office environment using sequential binary classifiers for differentiating between room, corridor, doorway, and hallway. It was suggested that the sequential binary AdaBoost classifiers were much more accurate than multi-class AdaBoost. The study was further extended in References [
12,
13] by extracting features from laser and camera for classifying six different places: doorways, a laboratory, a kitchen, a seminar room, and a corridor, as well as examining the effect of the Hidden Markov Model on the final classification. The same algorithm, i.e., AdaBoost, was trained in Reference [
14] using SIFT features of online images for seven different rooms. It examined the performance of different number of classes and different possible pairs of classes, where the success of the average of binary classifiers was 77%.
Robotics researchers also employed the well-known support vector machine (SVM) algorithm for the room classification problem using different sensors. In Reference [
15], laser data was used to build a hierarchical model, in which the hierarchy is employed for training and testing SVMs to classify 25 living rooms, 6 corridors, 35 bathrooms, and 28 bedrooms. Although this study reported an accuracy of 84.38%, laser data generally do not provide rich information, and require substantial processing to extract useful features. In contrast, vision features are used in other studies in order to train SVMs. In Reference [
16], a voting technique was used to combine 3D features to GIST 2D features, and these were used for training SVMs to classify six indoor places: bathrooms, bedrooms, eating places, kitchens, living rooms, and offices. Furthermore, SVM and Random Forests (RF) classifiers were used and compared in Reference [
17] to classify five places: corridors, laboratories, offices, kitchens, and study rooms using RGB-D images from a Kinect sensor. Room detection has also been addressed as an unsupervised learning problem using unlabeled images. In Reference [
18], SIFT features and 3D representation were used to extract convex spaces for clustering images based on similarities. In addition, stereo imagery was used in Reference [
19] for room detection and modeling by fusing 2D features with geometry data acquired from pixel-wise stereo for representing 3D scenes. The study was completed by modeling walls, rooms, and doorways using many techniques of extracting features, depth diffusion, depth segmentation, and clustering in order to detect room functionalities. The problem has also been addressed from different perspectives, such as the study in Reference [
20], in which the authors addressed the context-awareness problem for service robots by developing a system that identified 3D objects using online information. As we can note from previous research, the main drawback is the huge effort required to extract features. This weakness can be overcome by adopting a convolutional neural network (CNN) algorithm.
CNN is a category of deep neural network that has demonstrated successful results in the field of computer vision, such as face recognition and object detection. It was proposed by LeCun [
5], who introduced the first CNN architecture called LeNet in 1998, after several successful attempts since the 1980s [
4,
21]. There are two main advantages of this algorithm over other machine learning algorithms and the conventional fully connected feedforward neural networks. First, CNN extracts and learns features from raw images without requiring a careful engineering design for extracting features in advance [
7]. Second, CNN considers the spatial structure of the image by translating inputs to outputs through shared filters [
22]. After the huge improvements in data collection and computer hardware between 1990s and 2012, AlexNet was introduced [
6] for addressing the object detection problem using the ubiquitous
ImageNet to classify 1.2 million images into 1000 different classes. Since 2012, many articulate architectures have been proposed that are essentially built on the early architecture of LeCun, in order to improve the performance on ImageNet database [
23]. However, the effective progress that was demonstrated on ImageNet for object classification by these pre-trained models has not shown the same success for the scene classification problem. Consequently, the first significant dataset for scene-centric images, referred to as Places, was proposed in Reference [
8]. In general, indoor scene classification is challenging due to features’ similarity in different categories. This problem has been studied with different learning methods as well as CNN, which so far has been employed in few studies. In Reference [
24], a solution was proposed by designing a model that combined local and global information. The same problem was addressed by applying a probabilistic hierarchical model, which associates low-level features to objects via an object classifier, and objects to scenes via contextual relations [
25]. There are also some research studies that have employed CNN for learning robots in indoor environments. Ursic et al. [
26] addressed the room classification problem for household service robots. The performance of a pre-trained hybrid-CNN model was examined in Reference [
8] on segmented images, i.e., learning through parts, of eight classes from the
Indoor67 dataset. The result generated 85.16% accuracy using a part-based model, which is close to the accuracy of the original hybrid-CNN, 86.45%. However, learning through parts gave much better accuracies on deformed images than the original model. The authors in Reference [
27] took advantage of CNN for scene recognition in laboratory environments, with 89.9% accuracy, to enhance the indoor localization performance of a multi-sensor fusion system using smartphones. Furthermore, the objective of Reference [
28] was to find the best retraining approach for a dynamically learning robot in indoor office environments. The paper examined and compared different approaches when adding new features from new images into a learned CNN model, considering the accuracy and training time. The new added images to the features database were the failed ones that were selected and corrected by the user. The authors simulated one of the categories to be the new environment. This paper reported that a pre-trained CNN with a KNN classifier was the most appropriate approach for real robots, as it gave a reasonable accuracy with the shortest training time. All their experiments were executed on the VidRILO dataset [
29] using only its RGB frames, i.e., excluding the D frame. The methodology of this presentation, however, is different from previous studies, as we examined several CNN architectures with five categories of indoor scene rooms, i.e., bathrooms, bedrooms, dining rooms, kitchens, and living rooms downloaded from the Places dataset. In addition, these models were examined after cleaning and reducing the number of samples. Furthermore, a combination of CNN and multi-binary classifiers method called ECOC was proposed and evaluated in order to improve the real-time performance on a NAO humanoid robot.
Error correcting output code (ECOC) is a decomposition technique that was proposed by Dietterich and Bakiri for addressing multiclass learning problems [
9]. There are a few studies in reported literature that have employed ECOC within CNN architecture, but in a different perspective than the employed in this paper. Deng et al. [
30] used ECOC in order to address the target code issue by replacing the one-hot encoding with Hamming code in the last layer of CNN, which helped reduce the number of neurons in that layer. Then, the CNN model and CNN-ECOC, i.e., CNN with the new target codes, were trained and evaluated separately and the results were compared. Additionally, the same problem was solved in Reference [
31] using a different code algorithm referred to as Hadamard code. ECOC within CNN has also been employed in medical applications [
32,
33], in which a pre-trained CNN was employed only for extracting features, then multi-binary SVM classifiers trained and combined with ECOC, referred to as ECOC-SVM. Up to our knowledge, this is the only reported work combining and fine-tuning CNN with ECOC for robotics applications that design multi-binary classifiers of CNN and compare the performance with regular CNN for multiple classes.
4. Dataset & Simulation Experiments
The process of this work can be divided into three phases, as shown in
Figure 3. Phase 1 was aimed at fine-tuning three different CNN models, i.e., VGG16, VGG19, and Inception V3, on five categories of rooms from the Places205 dataset, in order to select the best model for real experiments on a NAO robot. In addition, all models were examined in this phase after cleaning the dataset by removing all unrelated images to the scenes, and the results were compared before and after cleaning the dataset. In phase 2, the goal was to design multi-binary classifiers of the selected CNN from phase 1 and combine their results through the ECOC algorithm and ECOC-REG. Finally, testing the selected CNN, CNN-ECOC, and CNN-ECOC REG through real experiments on a NAO robot, and comparing the results was the goal of phase 3. The importance of this phase is to show how these models performed on images from robots, e.g., NAO, in which those images are quite different in the level of view from the existed dataset. Phase 1 and phase 2 are explained in detail in this section, whereas phase 3 is explained in the next section.
4.1. Scene Dataset
There are several scene datasets proposed in the literature for addressing object/scene detection or classification problems. Some of them are small-scale datasets such as the 15-scene dataset, UIUC Sports, and CMU 300, and some are large-scale datasets such as 80 Million Tiny Image Dataset, PASCAL, ImageNet, LabelMe, SUN, and Places [
39]. The dataset can be 3D scene, such as SUNCG [
40], or it can be images for a particular environment with geo-referenced pose information for each image, such as TUM and NavVis [
41]. These datasets can be classified into two view types: object-centric datasets, e.g., ImageNet, and scene-centric datasets, e.g., Places [
8]. Places is the latest and largest scene-centric dataset, which is provided by MIT Computer Science and Artificial Intelligence Laboratory for the purpose of CNN training. It has a repository of around 2.5 million images classified into 205 categories, and for this reason it is called the Places205 dataset. This dataset is updated and extended with more images classified into 365 categories in Reference [
42], which is called Places365.
Since this project is within the scope of household robotics applications, five categories of images were selected to be downloaded from Places205 for addressing room–scene classification problems for social robots using CNN models. The five categories are: bedroom, dining-room, kitchen, living-room, and bathroom, which most, if not all, houses have. It should be noted that the corridor category is not available in Places205 and Places365 at the time of this work. This category is important in this research, and will be incorporated in the design once it is available. 11,600 images/category were used to train the CNN model, where 20% of images were used for validation, i.e., 2320 images/category.
Cleaning Data
The Places dataset is regarded to be very important in the field of computer vision and deep learning. However, there are some issues with the downloaded images for real time robotic applications that affect the learning process. Therefore, we manually excluded some images from all five categories, based on criteria that are shown in the few examples in
Figure 4.
Table 1 shows the percentage of the data that were deemed irrelevant form each category. After cleaning the data, we noted that the remaining images for bedrooms were the highest and for kitchens were the lowest (
Table 1). The reader might note the high percentage of irrelevant images in each category, which justifies the need for cleaning data.
4.2. Multi-Class Room Classification Experiments Using Fine-Tuning Pre-Trained Models
Since our main concern was to recognize only five room classes, we did not require huge amounts of data. Additionally, the learned features from pre-trained models in literature are relevant to the room classification problem, therefore fine-tuning pre-trained models was the best strategy for this work, instead of training a CNN model from scratch. Fine-tuning can be achieved through two main steps. The first step is to proceed room images to non-trainable ConvNet in order to extract features, then use these features to train our new classifier, i.e., softmax layer. The second step is to retrain the whole network, i.e., ConvNet and classifier, with a smaller learning rate, while freezing a few layers of the ConvNet.
All experiments were completed through the Graham cluster provided by Compute Canada Database [
43] using Keras API, which is written in a python deep learning library [
44]. Several CNN models were fine-tuned for this project, i.e., VGG16, VGG19, and Inception V3, with different freezing layers to be trained. All these CNN models were followed by a similar fully connected (FC) layer. FC begins with an average pooling layer, then a layer of 1024 neurons with the
relu activation function, and ends with a logistic layer to predict one of the five classes. Keras provides a compile method with different optimizers for learning process. In the first stage of the fine-tuning, the
adam optimizer was used with a 0.001 learning rate, whereas we applied a SGD (stochastic gradient descent) optimizer in the second stage with learning rate of 0.0001 and momentum of 0.9. All models were trained for 10 epochs in each stage with both the original data as well as the cleaned data, and with different non-trainable layers. It was noticed that training with more epochs did not provide that much improvement in the final accuracy, but it took a very long time in the training process.
Table 2 shows the superior results of all models trained with clean data compared to all data. The best result shown from these experiments is VGG19 and VGG16 with 0 freezing layers using clean data, which gives an accuracy of 93.61% and 93.29%, respectively.
4.3. Binary-Class Room Classification Experiments Using a Matrix Code
The best two models in phase 1 were VGG19 and VGG16 with all layers fine-tuned. Although this work was carried out through one of Compute-Canada Servers, i.e., Graham, there are many works in robotic applications that can be processed using local machines. Therefore, considering the time of training is an important factor for this phase, which has multiple binary classifiers for training. For this reason, the selected model to be trained in this phase was the VGG16 with 0 freezing layers, which has an accuracy of 93.29%, which is quite similar to the best one. The binary classifiers can be designed through grouping classes based on an exhausted matrix code, as explained in Reference [
9]. The following 5 × 15 matrix is the best for this experiment, as it does not have any repeated and complimented columns.
Let us take an example of the classifier in column 3 of matrix
M, which has [1 0 0 1 0] values. The first group of this classifier would be the classes with zero value, i.e., bedrooms, dining rooms, and living rooms. The second group is the classes with ones, i.e., bathrooms & kitchens.
Table 3 shows the validation accuracies of all 15 binary fine-tuned VGG16 classifiers. The main advantage of the binary classifier is high accuracy depending on classification, as it reached 98.5% for this project, and the average of all 15 classifiers was 95.37%, which is still higher than the multi-classification approach.
4.4. Discussion
One of the most important features that has been studied by researchers is deepness, which is significant in most of the known CNN architectures. It is reported that the deeper and wider the architecture is designed, the better features will be learned [
35]. However, this requires a huge dataset, i.e., millions of samples, in order to avoid the overfitting issue. Unfortunately, this is not always the case in robotics applications, wherein in most applications, the number of classes for a specific robotic problem such as room classification is very limited, which means the number of samples might be only thousands, i.e., a small dataset. Therefore, the very deep CNN architecture will most likely lead to overfitting problem, as happened with ResNet [
45] in this work, which is why their results are excluded from these experiments. Although the Inception V3 results were not over-fitted, they were less accurate than the architectures with less deepness, i.e., VGG16 and VGG19. The reason might be related to the scene-centric type of dataset, in which learning hierarchical representations can be difficult with more deepness. For this work, we had two ways to address this problem for robotic application: either designing a new CNN architecture for a small dataset, or adopting an existing CNN with less deepness and improving robot’s decision by integrating another method. The latter solution was preferable, so we adopted VGG16, i.e., the least deepness of all three architectures, and integrated it with ECOC in a way it was practicable for real time robotic implementation. One could ask why we adopted CNN from the beginning—as discussed before, CNN extracts and learns features from raw images without requiring a careful engineering design that shown superiority over conventional approaches in computer vision.
Binary classifier results for ECOC explain the challenge of feature similarities in a house’s rooms. Let us discuss the obvious results of ‘class vs. all’ in classifiers number 1, 8, 12, 14, and 15. The most distinguishable rooms are the bathroom and kitchen, as shown in classifiers 1 and 14 respectively. Meanwhile, the other three classes are very similar to each other. There are many reasons related to the dataset or the architecture of VGG16 that the model is less accurate with these similar rooms than the distinguishable ones. The first reason is having some sharable objects in different rooms, such as tables or TVs, or the wide variety styles of those rooms such as open/closed spaces, or even culture-based styles, e.g., no beds for sleeping. The second reason is that the architecture with less deepness will not be able to differentiate between objects similar in shape, e.g., rectangular shapes in dining tables, coffee tables, and beds from different rooms. Therefore, it is a tradeoff between learning deep features and having a small dataset. For our future work, we will consider the associated models between scenes and objects for better robot decisions.
6. Conclusions and Future Work
This paper focused on addressing the room classification problem for social robots. The CNN deep learning approach was adopted for this purpose because of its superiority in the areas of object detection and classification. Several CNN architectures were examined by fine-tuning them on five room classifications of the Places dataset, in order to find out the best model for real life experiments. It was found that VGG16 is the best adopted model, with 93.29% of validation accuracy after cleaning the dataset by excluding all mislabeled images. In addition, we proposed and examined a combination of CNN with ECOC, a multi-binary classifier approach, in order to address the error in practical prediction. The validation accuracy reached 98.5% in one of the binary classifiers and 95.37% in the average of all binary classifiers. The CNN and the combination model of CNN and ECOC in both forms, i.e., CNN-ECOC and CNN-ECOC-REG, were evaluated practically on a NAO humanoid robot. The results show the superiority of the combination model over the regular CNN.
There are many challenges that should be considered for future work. First, the real time experiments for domestic robots need their own dataset which is compatible with most social robots’ heights. For example, the images captured by NAO robots are almost at the level of 0.5 m from the floor. This implies that the real time prediction will be negatively affected if the model is trained using the existing datasets. Second, a final prediction based on one image is practically not sufficient for social robots, as the captured images depend on the angle of the view or other image factors, e.g., resolution or light. Therefore, combining many images of the same room or many frames from a video are important for getting the right decision. Third, corridors in houses are an important class that should be added for future work. Lastly, detecting and locating the door through the same CNN architecture is significant for the purpose of indoor navigation.