Head Pose Detection for a Wearable Parrot-Inspired Robot Based on Deep Learning

: Extensive research has been conducted in human head pose detection systems and several applications have been identiﬁed to deploy such systems. Deep learning based head pose detection is one such method which has been studied for several decades and reports high success rates during implementation. Across several pet robots designed and developed for various needs, there is a complete absence of wearable pet robots and head pose detection models in wearable pet robots. Designing a wearable pet robot capable of head pose detection can provide more opportunities for research and development of such systems. In this paper, we present a novel head pose detection system for a wearable parrot-inspired pet robot using images taken from the wearer’s shoulder. This is the ﬁrst time head pose detection has been studied in wearable robots and using images from a side angle. In this study, we used AlexNet convolutional neural network architecture trained on the images from the database for the head pose detection system. The system was tested with 250 images and resulted in an accuracy of 94.4% across ﬁve head poses, namely left, left intermediate, straight, right, and right intermediate.


Introduction
The field of robotics has found numerous applications in the recent years. Earlier, the predominant use of robots was mainly in performing tedious and repetitive tasks, such as manufacturing and transporting. But, a new generation of robots with more intelligence has been shown to benefit several other industries, including service, medical, and entertainment. Particularly, robots capable of interacting with humans with natural behavior has begun to emerge extensively due to their closeness with humans.
Several methods have been explored to establish human-machine interaction (HMI) in the literature. Gesture recognition pertains to recognition of human expressions through their hands, head, and/or body movements. In the recent years, gesture recognition has been one of the most focused research areas, where new methods and applications of interacting with medical, service, and entertainment devices have been studied. For example, Greatzel et al. [1] developed a system to replace standard computer mouse operation with hand gestures using a computer vision algorithm. The system is designed to establish non-contact human-computer interaction (HCI), which helps through the face recognition system. Several other robots, such as ROBITA, Robonaut, and Leonardo use gestures to interact with humans [8].
Head pose detection is one of the gesture recognition techniques used in various applications. It has been used in various fields such as robotics [9], computer engineering [10,11], physical science and health industry [12], natural sciences [13], and industrial academic areas [14][15][16]. As an illustration, Sileye and Jean-Marc [17] deployed head pose detection using the Hidden Markov Model to recognize the visual focus of attention of participants in meetings. Eric and Mohan [18] presented a vision-based head pose detection and tracking method for monitoring driver awareness. The authors propose this method to monitor driver alertness and their head pose orientation while driving. Javier and Patricio [19] proposed head pose detection between robots to decide the next action by an observing robot. In this study, the authors used two robots, one acting as a performer and the other as an observer. The performer robot changes its head pose which is processed by the observer robot to perform actions. The head pose detection and recognition system has found wider application areas such as face recognition, action recognition, gait recognition, head recognition, and hand recognition systems [20][21][22][23]. In such systems, several sensors like binary, digital, and depth cameras are used to train and detect postures [24][25][26][27][28][29][30]. In such systems, several machine learning feature extraction algorithms and classification methods are implemented for detection and recognition of gestures. For instance, Samina et al. [31] analyzed several feature selection and extraction methods, and presented their effectiveness in achieving high performance of learning algorithms. A real-time tracking system for human pose recognition was proposed by Jalal et al. [32] using ridge body part features, in which a support vector machine (SVM) was used to recognize different poses. In another study, a novel subspace learning algorithm, called discriminant simplex analysis (DSA), was developed by Fu et al. [33] in which the intraclass compactness and interclass separability were measured by distances.
Even though head pose detection has been studied in robotics for many years, there are applications which can be further studied to effectively use a head pose detection method. Pet robots are one of the areas where it can be very useful in establishing human-robot interactions, but was given primitive focus in the literature. The applications of pet robots are manifold ranging from medical, service, and the entertainment industry. For example, a pet robot that we have developed has been used to reduce stress levels of patients [34], improve learning abilities of children [35], and entertain participants [36]. With such multi-industry applicability, pet robots with head pose detection can further improve closeness with humans and create effective human-robot interaction models. Even though several pet robots have been designed and developed for various needs in the literature, there is a complete absence of wearable pet robots and human-robot interaction models in wearable pet robots. With known benefits of wearable pets [37,38], designing a wearable pet robot can provide extensive research and application opportunities in several sectors. In this paper, we present the design and development of a wearable parrot-inspired pet robot, KiliRo, and its human-robot interaction model using vision-based head pose detection. The novelty of this paper is threefold: First, we introduce a new design and development of a wearable parrot-inspired pet robot. Second, we provide the design of human-robot interaction model for wearable pet robots using a vision-based head pose detection method. Third, we quantitatively demonstrate the success of this system through head pose images captured from the robot wearers shoulder in five different orientations.
The remainder of this paper is organized as follows: After the presenting the system architecture of our KiliRo robot in Section 2, we outline our system consisting of methods for detecting and perceiving head pose orientation of the person wearing the robot (Section 3). In Section 4, we present the experiments involving 1380 images of head poses in five different orientations to validate our approach. Lastly, in Section 5, we conclude this study and discuss the future works.

Robot Architecture
The main scope of this research study is to design and develop a wearable pet robot that can mimic the head pose orientation of the wearer. In terms of morphology, the KiliRo robot can be defined as a two-legged wearable robot, having a physical appearance that resembles a parrot.
We considered a set of design constraints in deciding the dimensions of the robot during the concept generation process: Head rotation 180 degrees; • Operate between 10 • C and 45 • C After a series of brainstorming sessions on concept generation and selection sessions, we developed a wearable pet robot KiliRo centered on achieving 180 degrees of rotating head design. The curvature of the robot's leg design was optimized to create a wearable robot design. The dimensions and weight of the robot played a vital role in the wearable robot design, as the wearers wore them in most cases. The dimensions of KiliRo-W and the selection of commercial devices, such as servo motors, electronic boards, etc., were opted to fit the robot design constraint on size and weight. The robot has three parts: head, body, and wings. The neck part connects the head and body. A static tail is attached at the top of the head for aesthetic appeal. The feet were designed through inspiration from parrots and were modified to adapt to the wearable design, which can fix to the body part. The robot parts were designed to be hollow to minimize the weight and optimize the three-dimensional printed materials. The specifications of the mechanical properties of the wearable parrot robot are listed in Table 1. The robot's head was mounted with two servo motors (SG90, manufactured by TowerPro) to provide pitch and yaw motions. Even though the robot can achieve pitch motion on its head, it was not deployed during this study. The robot uses the camera mounted on its head to detect the wearer's head position and processes it using a Raspberry pi-3 small computer to detect and actuate its head accordingly. The list of hardware used in the robot is presented in Table 2. A TREK Ai-Ball portable Wi-Fi camera as the imaging sensor was used for the KILIRO robot. The Wi-Fi camera used can capture images at 30 Hz with a maximum resolution of 640 × 480. The camera possesses a range of focal length form 20 cm up to infinity with a view angle of 300 • .

System Overview
Considering the constraints on power, size, and computational complexity, we explored the monocular vision embedded machine learning framework for the head pose recognition. We modeled the head pose identification as an object classification problem in the domain of computer vision. A machine learning based object classification paradigm is comprised of three steps. Feature identification and extraction is considered as the first and foremost amongst all the three, followed by the feature description which translates the extracted feature to a mathematical form where it can be used as an algebraic operand, and finally a classification model trained on numerous feature descriptors is used for the classification. An alternate approach to the above-mentioned classification scheme is the Artificial Neural Network (ANN) which works analogous to the biological brain. Besides, the main advantage of ANN over the typical classification is the absence of a dedicated feature extraction method for classification. Artificial neural network learning model consists of numerous layers stacked one over the other. Each layer consists of several nodes that encapsulate an "activation function". The activation function decides whether the respective neurons should be "fired or not". The information for training and testing are parsed into the ANN classification model through the "input layer", and the input layer links to the preceding hidden layers, and all layers converge to the fully connected layer and finally to the output layer. Both classification schemes

System Overview
Considering the constraints on power, size, and computational complexity, we explored the monocular vision embedded machine learning framework for the head pose recognition. We modeled the head pose identification as an object classification problem in the domain of computer vision. A machine learning based object classification paradigm is comprised of three steps. Feature identification and extraction is considered as the first and foremost amongst all the three, followed by the feature description which translates the extracted feature to a mathematical form where it can be used as an algebraic operand, and finally a classification model trained on numerous feature descriptors is used for the classification. An alternate approach to the above-mentioned classification scheme is the Artificial Neural Network (ANN) which works analogous to the biological brain. Besides, the main advantage of ANN over the typical classification is the absence of a dedicated feature extraction method for classification. Artificial neural network learning model consists of numerous layers stacked one over the other. Each layer consists of several nodes that encapsulate an "activation function". The activation function decides whether the respective neurons should be "fired or not". The information for training and testing are parsed into the ANN classification model through the "input layer", and the input layer links to the preceding hidden layers, and all layers converge to the fully connected layer and finally to the output layer. Both classification schemes

System Overview
Considering the constraints on power, size, and computational complexity, we explored the monocular vision embedded machine learning framework for the head pose recognition. We modeled the head pose identification as an object classification problem in the domain of computer vision. A machine learning based object classification paradigm is comprised of three steps. Feature identification and extraction is considered as the first and foremost amongst all the three, followed by the feature description which translates the extracted feature to a mathematical form where it can be used as an algebraic operand, and finally a classification model trained on numerous feature descriptors is used for the classification. An alternate approach to the above-mentioned classification scheme is the Artificial Neural Network (ANN) which works analogous to the biological brain. Besides, the main advantage of ANN over the typical classification is the absence of a dedicated feature extraction method for classification. Artificial neural network learning model consists of numerous layers stacked one over the other. Each layer consists of several nodes that encapsulate an "activation function". The activation function decides whether the respective neurons should be "fired or not". The information for training and testing are parsed into the ANN classification model through the "input layer", and the input layer links to the preceding hidden layers, and all layers converge to the fully connected layer and finally to the output layer. Both classification schemes mentioned require an enormous amount of training data to ensure the accuracy of the classification system. Typically, ANN requires comparatively more datasets to yield a better classification result than an SVM. However, if we consider the computational complexity induced by both the ANN and SVM classification models, ANN has an upper hand over the SVM because of the absence of complex computation induced by feature extraction. Furthermore, the ANN algorithms are friendlier for implementing in a computing device that allows parallel execution. From the perspective of real-time implementation of the classification system, computational complexity is among one of the critical challenges to address. In the scenario of a parrot inspired robot, the development of a classification system should acknowledge the following two factors: First, the system should be realizable with minimal requirements for computation. Second, the system should be realizable with minimal usage of sensors for data acquisition. Since we focus on the development of a wearable robot, the structural weight, energy consumption, size, etc. has to be minimized. Hence, the usage of a computation device that requires the development of simple and easily deployable drain surveillance and mosquito detection module. An effective way to address the above-mentioned challenge is the usage of a neural network based on an iterative learning method for classification, since it does not require dedicated feature extraction and description processes.
Extensive comparisons of different classification scenarios on SVM and artificial neural networks are reported in the literature. Support vector machines excel in performance over artificial neural networks in terms of time efficiency and accuracy in classification. However, the performance gap between neural networks and SVM in classification problems is almost negligible. Hence, in the case of developing a head pose detection module for a wearable robot, we have done a tradeoff between performance and computational complexity. We modeled the head pose detection to an image classification problem, and we are used convolutional neural networks. Supplemental research work has been reported regarding the development of head pose orientation. For instance, Rainer [39] used neural networks for head pose estimation to evaluate the participants in a workshop. Voit, Nickel, and Stiefelhagen [40] used the same approach for estimating the pan and tilt orientation on synthetic, high-resolution head images. They also used the neural network method for estimating horizontal head orientation on seminar recordings captured with multiple cameras from different viewing angles.
Even though deep learning based head pose detection is extensively studied, its application on a wearable robot is primitive. The proposed head pose identification strategy from the images taken from the shoulder of the target object emphasizes the novelty of the research work. The two major phases involved in both schemes include a head pose database generation and head pose classification. The database generation phase is done by acquiring head pose images from different persons. The images are taken using a camera mounted inside the eye socket of the robot. The user keeps the robot on his left shoulder and multiple images of their head are taken.
We remotely accessed the embedded computer using SSH secure shell protocol and the images of the head pose when the user oriented his head towards the left, right, left-intermediate, right intermediate, and straight positions. Figure 1 explains the various steps involved in both identification schemes using convolutional neural networks. We implemented the classification in a C++ platform linked with Caffe library on a Linux platform. The training of the neural network was done on a Linux PC, with the GPU and the pre-trained model stored in the embedded computer (Raspberry Pi). The learning of neural networks demands high computational requirements. However, the deployment and testing of a trained neural network does not require high computational power. Hence, we used a workstation with Nvidia GeForce GT-730 GPU to accelerate the training process and generate the pre-trained deployment file after the training process. After the training of AlexNet, the deployment file was transferred into the micro-computer of the parrot robot for real-time usage. The flow of the overall system is illustrated in Figure 5.

Database Generation
The process of database generation involves acquisition of images required for the classification and training of deep neural networks. Here, our intention was to classify the images captured by the robot into Left, Right, Straight, Left Intermediate, and Right intermediate classes. Our database consists of Left, Right, Straight, Left Intermediate, and Right intermediate as the five classes. We captured 250 images for each class from five subjects, and they were captured in an indoor setting at our university laboratory with sufficient lighting. It took about 15 min to capture 250 images. We mounted the camera on the parrot robot's eye sockets and captured the images of the head pose while a person was wearing the robot. In addition, we chose four different individuals to wear the parrot robot and collect the images of different head poses. The number of classes is always scalable and not limited to the number five. In this case, we chose five different head poses that were visually distinct. We captured 250 images for each class. For generating diversity in the database, for each class we acquired images from four different individuals. Likewise, we have collected a total of 1000 images (labeled as 250 images from each class). The KiliRo robot uses an Ai-Ball camera for capturing the images of the head poses. Having a resolution of 640 × 480 (VGA), focal length of 200 mm, frame rate of 30 fps, and a view angle of 60 degrees, this camera met the requirements for our head pose detection needs. As the Ai-Ball camera is compact (diameter of 30 mm and 35 mm in length) and lightweight (less than 100 g), it was a perfect choice for our robot. This compact design makes it more suitable for wearable robots with limited on-board computational power. Figure 6 illustrates the examples of five head poses used in our study.
Before generating the image database, all the images are resized to a gray scale rectangular matrix of 60 × 60. To use during the training phase, we computed the image mean for the created dataset. To boost the performance of the neural network, the calculated mean image was subtracted from the individual images in the dataset during the training phase figure. The calculation of the mean image ensures our data would have a zero mean. For instance, a training dataset X with five images can be represented as; where (i can take values 0, 1, 2, 3, 4) represents the individual image in the database.

Database Generation
The process of database generation involves acquisition of images required for the classification and training of deep neural networks. Here, our intention was to classify the images captured by the robot into Left, Right, Straight, Left Intermediate, and Right intermediate classes. Our database consists of Left, Right, Straight, Left Intermediate, and Right intermediate as the five classes. We captured 250 images for each class from five subjects, and they were captured in an indoor setting at our university laboratory with sufficient lighting. It took about 15 min to capture 250 images. We mounted the camera on the parrot robot's eye sockets and captured the images of the head pose while a person was wearing the robot. In addition, we chose four different individuals to wear the parrot robot and collect the images of different head poses. The number of classes is always scalable and not limited to the number five. In this case, we chose five different head poses that were visually distinct. We captured 250 images for each class. For generating diversity in the database, for each class we acquired images from four different individuals. Likewise, we have collected a total of 1000 images (labeled as 250 images from each class). The KiliRo robot uses an Ai-Ball camera for capturing the images of the head poses. Having a resolution of 640 × 480 (VGA), focal length of 200 mm, frame rate of 30 fps, and a view angle of 60 degrees, this camera met the requirements for our head pose detection needs. As the Ai-Ball camera is compact (diameter of 30 mm and 35 mm in length) and lightweight (less than 100 g), it was a perfect choice for our robot. This compact design makes it more suitable for wearable robots with limited on-board computational power. Figure 6 illustrates the examples of five head poses used in our study.
Before generating the image database, all the images are resized to a gray scale rectangular matrix of 60 × 60. To use during the training phase, we computed the image mean for the created dataset. To boost the performance of the neural network, the calculated mean image was subtracted from the individual images in the dataset during the training phase figure. The calculation of the mean image ensures our data would have a zero mean. For instance, a training dataset X with five images can be represented as; where x i (i can take values 0, 1, 2, 3, 4) represents the individual image in the database.
From (1), the mean of the data is calculated as: From (1), the mean of the data is calculated as:

Training and Classification
The head pose detection was performed in the classification phase. The convolutional neural networks (CNN) are made up of multiple convolutions and a pooling layer followed by one or more fully connected layers. A series of convolution and pooling processes in the hidden layer realizes extraction of translational and rotational invariant features from the input image. The advantage of CNN over other classification methods is its simplicity in training and possession of few parameters to consider while training. We used AlexNet CNN architecture trained on the images from our database for implementing the head pose identification system. The AlexNet method consists of five convolution layers, and two fully connected layers (Figure 7). In between, there are layers that do the max pooling. Table 3 shows the AlexNet architecture.

Training and Classification
The head pose detection was performed in the classification phase. The convolutional neural networks (CNN) are made up of multiple convolutions and a pooling layer followed by one or more fully connected layers. A series of convolution and pooling processes in the hidden layer realizes extraction of translational and rotational invariant features from the input image. The advantage of CNN over other classification methods is its simplicity in training and possession of few parameters to consider while training. We used AlexNet CNN architecture trained on the images from our database for implementing the head pose identification system. The AlexNet method consists of five convolution layers, and two fully connected layers (Figure 7). In between, there are layers that do the max pooling. Table 3 shows the AlexNet architecture.

Training and Classification
The head pose detection was performed in the classification phase. The convolutional neural networks (CNN) are made up of multiple convolutions and a pooling layer followed by one or more fully connected layers. A series of convolution and pooling processes in the hidden layer realizes extraction of translational and rotational invariant features from the input image. The advantage of CNN over other classification methods is its simplicity in training and possession of few parameters to consider while training. We used AlexNet CNN architecture trained on the images from our database for implementing the head pose identification system. The AlexNet method consists of five convolution layers, and two fully connected layers (Figure 7). In between, there are layers that do the max pooling. Table 3 shows the AlexNet architecture.  The activation function used was the ReLU (Rectified Linear Unit) for the convolution layers, Besides, ReLU we used a softmax function at the fully connected layers before the output. The softmax function normalizes the outputs of each unit in the fully connected unit to range between 0 and 1 such that the sum of output will be unity. The network used had five outputs that corresponded to five different classes (head poses). Hence, the output neuron corresponding to the input image will be activated by the softmax function. The neural network architecture mentioned in Table 1 is represented as protocol buffer data. The protocol buffer language-neutral platform enables serialization of structured data in a faster and simple manner. Before training, the databases that consist of head-pose image sets were converted to a lightning memory mapped database (LMDB) format. The neural network was trained in CPU without GPU support. The learning of the network is performed in batch size of 128 using stochastic gradient descent (SGD). We have trained the network for 400 epochs at a learning rate of 0.0001. Figure 8 shows the learning rate vs. each epoch. The loss in the system was calculated by forward pass of the network. The figure shows loss in the training phase during each epoch in training. The mean image generated from the dataset is shown in Figure 9. The activation function used was the ReLU (Rectified Linear Unit) for the convolution layers, Besides, ReLU we used a softmax function at the fully connected layers before the output. The softmax function normalizes the outputs of each unit in the fully connected unit to range between 0 and 1 such that the sum of output will be unity. The network used had five outputs that corresponded to five different classes (head poses). Hence, the output neuron corresponding to the input image will be activated by the softmax function. The neural network architecture mentioned in Table 1 is represented as protocol buffer data. The protocol buffer language-neutral platform enables serialization of structured data in a faster and simple manner. Before training, the databases that consist of head-pose image sets were converted to a lightning memory mapped database (LMDB) format. The neural network was trained in CPU without GPU support. The learning of the network is performed in batch size of 128 using stochastic gradient descent (SGD). We have trained the network for 400 epochs at a learning rate of 0.0001. Figure 8 shows the learning rate vs. each epoch. The loss in the system was calculated by forward pass of the network. The figure shows loss in the training phase during each epoch in training. The mean image generated from the dataset is shown in Figure 9.

Results
After creating the database for head pose detection, the testing phase wa performed. As in the learning phase, the images required for the testing phase used the same camera and position on the wearer's shoulder. The illustrations of five head pose orientations in the experimental phase are presented in Figure 10. Table 4 presents the results of the head pose detection system tested with 250 images. Overall, the system resulted in an accuracy of 94.4%.

Results
After creating the database for head pose detection, the testing phase was performed. As in the learning phase, the images required for the testing phase used the same camera and position on the wearer's shoulder. The illustrations of five head pose orientations in the experimental phase are presented in Figure 10. Table 4 presents the results of the head pose detection system tested with 250 images. Overall, the system resulted in an accuracy of 94.4%.

Results
After creating the database for head pose detection, the testing phase wa performed. As in the learning phase, the images required for the testing phase used the same camera and position on the wearer's shoulder. The illustrations of five head pose orientations in the experimental phase are presented in Figure 10. Table 4 presents the results of the head pose detection system tested with 250 images. Overall, the system resulted in an accuracy of 94.4%.     The results were analyzed by determining the probability of obtaining the result if performance was simply random. For that purpose, data can be presented as a number of correct responses (i.e., turning the head in the same direction as the human) and number of incorrect responses (i.e., turning the head in the wrong direction). As there are five possible directions, the probability of getting the direction correct is 0.20. For three of the required responses, Class 0, 2, and 3, the robot emitted the correct responses for 50 out of the maximum number of 50 trials. According to a binomial distribution, the probability of obtaining this result is near 0. For images corresponding to Class 2, performance was not flawless, and the robot emitted the correct response on 48 of the 50 trials. For Class 4 images, responses were the worst, with responses correct for only 38 of the 50 trials. However, even in those two cases, the probability of obtaining this result simply due to chance is near 0. Performance is thus significantly better than chance (p < 0.0001) for all responses.

Conclusions
This paper presented a novel head pose detection system for a bio-inspired wearable pet robot, KiliRo, to classify five head poses from the side angle of the wearer. The presented deep learning based head pose detection system reported an overall accuracy of 94.4%. This is the first time head pose detection was demonstrated on a wearable pet robot. Even though the proposed system works well in bright indoor settings, the system delivers a sub-optimal performance in outdoor and low-light conditions. This problem can be addressed by upgrading the current imaging sensor to an advanced imaging sensors that delivers best output in low-light condition. Besides, the motion blur induced when the person moves his/her head in a faster manner is another factor that adversely affects the accurate detection of head-pose. The future extension of this research work will be focusing on integrating the camera with faster refreshing and better resolution by compromising the additional computational complexity introduced in the system. We also aim to enhance the system with more head pose detection abilities for our parrot-inspired pet robot and conduct real-time experiments involving children and adults to evaluate the effects of such robot in providing companionship. Furthermore, we will be working on improving the existing database of five classes to an advanced database with different subclasses to detect the head orientation in more detailed manner.