Developing Emotion-Aware Human–Robot Dialogues for Domain-Speciﬁc and Goal-Oriented Tasks †

: Developing dialogue services for robots has been promoted nowadays for providing natural human–robot interactions to enhance user experiences. In this study, we adopted a service-oriented framework to develop emotion-aware dialogues for service robots. Considering the importance of the contexts and contents of dialogues in delivering robot services, our framework employed deep learning methods to develop emotion classiﬁers and two types of dialogue models of dialogue services. In the ﬁrst type of dialogue service, the robot works as a consultant, able to provide domain-speciﬁc knowledge to users. We trained di ﬀ erent neural models for mapping questions and answering sentences, tracking the human emotion during the human–robot dialogue, and using the emotion information to decide the responses. In the second type of dialogue service, the robot continuously asks the user questions related to a task with a speciﬁc goal, tracks the user’s intention through the interactions and provides suggestions accordingly. A series of experiments and performance comparisons were conducted to evaluate the major components of the presented framework and the results showed the promise of our approach.


Introduction
Researchers and engineers have been building service robots that can interact with people and achieve given tasks. To deploy practical service robots, two major concerns need to be seriously considered, including the system architecture for launching the services and the creation of the service functions. At present, the services are mostly laboring services, in which robots take actions in the physical environment to assist people. However, robots are now expected to play more important roles in providing domain-specific knowledge services and task-oriented services. To deliver these services, robots communicate with users through a natural way of spoken language because conversation is a key instrument for developing and maintaining mutual relationships. Following our previous studies that adopted a service-oriented architecture to develop action-oriented robot services, in this work we presented a trainable framework for modeling emotion-aware human-robot dialogues to provide the aforementioned services.
Regarding the many choices of supportive software architecture, some researchers have proposed to adopt cloud-based service-oriented architecture (SOA). SOA is an architectural style based on interacting software components, providing services as fundamental units to design, build and compose the service-oriented software systems [1]. A service is a function made available by a service provider in order to deliver results to a consumer. Moreover, services are autonomous platform-independent entities that can be described, published, discovered and loosely coupled. To effectively and efficiently deploy different kinds of services, researchers have proposed to link SOA to a cloud computing environment. With this way, the robots are no longer limited by onboard computation, memory and programming, leading to a more intelligent robotic network. Our former work implemented a cloud-based system to support a variety of user-created services [2,3]. To ensure its expandability and shareability, we constructed a service configuration mechanism and deployed the system on the ROS (robot operating system, [4]) computing nodes in practice.
The most common way for achieving natural language-based human-robot interaction is to build a dialogue system to be a vocal interactive interface. Essentially, the dialogue system includes a knowledge base (i.e., dataset) with organized domain questions and their corresponding answers and the dialogue service is to design an accurate mapping mechanism that can correctly retrieve answers in response to the users' questions. The system is performed in a question-answering manner, and most traditional approaches are based on hand-crafted rules or templates. Recently, the deep learning-based methods have been successfully employed to infer neural models for question and answer sentences. These neural systems mainly use a sequence to sequence (seq2seq) model as a backbone to perform mappings from entire sequences of words or characters to other sequences, for example [5,6]. In addition to the dialoguing content, emotion plays a significant role in determining the relevance of the answer to a specific question. By integrating emotion information into the applications, a service system can enable its services to automatically adapt to changes in the operational environment, leading to enhanced user experience.
To enhance the service performance and equip the robot with social competences, in this work, we developed an emotion-aware human-robot dialogue framework extended from our previous research presented in [7], with a series of additional experiments and newly developed dialogue services. To this end, this extended framework included two types of dialogue services. One was to enable the robot to work as a consultant to provide domain-specific knowledge services. The main focus was on constructing a deep learning model for mapping questions and answer sentences, tracking the human emotion during the process of the human-robot dialoguing and using this additional information to determine the relevance of the sentences obtained by the model. The other was to provide task-oriented dialogue services which raised considerable interests due to its broad applicability for assisting users in achieving specific goals (e.g., for booking flight tickets or scheduling meetings). To verify the presented approach, we conducted a series of experiments as described below to evaluate the major system components. The results showed the effectiveness and efficiency of the presented approach.
The remaining part of this paper is arranged as follows. Section 2 provides the research background and reviews the dialogue-related research work. Section 3 describes the framework, including the functional modules of emotion classification and dialogue response selection, and the deep learning techniques used for modeling. Section 4 presents the experimental outcomes and the performance comparisons of the different methods. Finally, Section 5 concludes the paper.

Related Works
As mentioned previously, at present most of the service robot frameworks have been connected to various cloud-computing environments to exploit their large amounts of resources. Among others, the most representative work is RoboEarth [8], driven by an open-source cloud robotics platform [9]. With this platform, the robots can distribute highly loaded computation to the cloud and access the RoboEarth knowledge repository to download required resources. There are also other platforms developed for cloud robotic systems. For example, Pereira et al. proposed the ROSRemote framework [10], which enabled users to work with ROS remotely to create several applications. More extensive surveys were found in [11,12]. More recently, due to the rapid advances of the Internet of Things (IoT), researchers proposed the concept of the Internet of Robot Things (IoRT) to describe a new approach to robotics [13,14]. In this way, smart devices can monitor events, fuse sensor data from a variety of sources and use local and distributed intelligence to determine a best course of action. This expands the ability of service robots, improves a robot's understanding during the human-machine interaction and leads to a more intelligent robotic network. Moreover, to deal with the scalability problem, researchers have started to extend the cloud computing concept for service robots to edge or fog computing to utilize the resources in a more efficient way [15,16].
Instead of investigating issues related to resource allocation and utilization, this work aimed to develop emotion-aware dialogues for a service robot, in which the most important issues were to recognize the emotions from the user utterances and to generate appropriate machine responses. Many methods have been proposed to solve these problems from different perspectives. Because this work adopted deep learning models to address the above two issues, in the following we discuss the most relevant studies with similar computational methods.
In general, using a deep learning-based approach to develop dialogues, responses are generated based on sequence-to-sequence (seq2seq) neural network models, with an objective function of the maximum-likelihood estimation [17]. This model is to take dialogue modeling as learning a mapping between human utterances and machine responses. The focus is on how to generate a suitable response from a corpus to a human utterance. For the training of dialogue models, generative and retrieval-based methods are often used. Although generative methods have the potential to generate sentences of rich content, current generative models often have the disadvantages of lacking coherence and producing unnatural responses. In contrast, though retrieval-based methods are more restricted, they have the advantage of producing informative and fluent responses. Thus, the retrieval-based methods are more practical. As can be observed, retrieval-based methods rely on the exploitation of a large and varied corpus (human-human or human-machine interactions) [18] and deep learning models have been employed to derive mappings (that is, a selection mechanism) between questions and answers (e.g., [5,19]). The basic seq2seq model consists of two recurrent neural networks (RNNs): one works as an encoder to process the input; the other, a decoder to generate the output. With the characteristic of making predictions based on running texts of varying lengths, the long short-term memory networks (LSTMs) are often adopted to train the answer selection mechanism. This model has now been widely applied to conversation generation and most existing works have mainly focused on developing more advanced techniques (such as decoding strategies or network models) to improve the content quality of the responses. Many neural dialogue systems have been constructed based on this design principle. For example, Serban et al. used a hierarchical LSTM network for a conversation application [20], and Wen et al. proposed a task-oriented model to generate the correct answers in response to the needs of the given dialogue [21]. To overcome the problem of overly general (i.e., safe) responses, Wu et al. proposed a hybrid-level encoder-decoder model, which utilized both word-level and character-level features [22]. Although these models, in theory, are better at maintaining the dialogue state using memory components, they require longer training time and excessive searching for hyper-parameters.
In contrast to the above domain-specific dialogue systems that aim to generate fluent and engaging responses, the other type of neural dialogue systems that has attracted a lot of attention is task-oriented [23,24]. Task-oriented dialogue systems need to complete a specific task (to achieve a goal), for example, restaurant reservation, by interacting with users (i.e., a response generation process). Existing task-oriented systems can be divided into two categories: the modularized pipeline and the end-to-end single-module systems. The former decomposes the task-oriented dialogue task into modularized pipelines to be solved separately, while the latter proposes to use an end-to-end model to produce a sequence of output tokens directly to solve the overall task. End-to-end systems are often more superior than pipeline systems, due to their unique characteristics, such as global optimization and easier adaptation to new domains. In the task-oriented dialogue systems, the most critical component is the goal tracker [25]. The system must update the state of the dialogue according to each user's query and their intent. Given the current dialogue state, the system can then decide how to respond best to the user to accomplish the desired task.
In addition to employing more sophisticated models and advanced tuning mechanisms towards proper response generation, some recent works attempted to augment the emotional information of the neural dialoguing models to generate more meaningful and humanized machine responses. For example, Zhou et al. presented a model that assumed the emotion category of human utterance was known and taken as an additional input to train a model of responses [26]. Sun et al. adopted a LSTM neural network for conversation modeling [27] in which an emotional category label was added to the encoder, which regarded emotional information as an additional source to the conversational model. Moreover, Asghar et al. discussed the feasibility of employing emotion information to help generate diverse responses [28]. They proposed a model of affective response generation to generate sentences conditioned on emotional word embeddings, affective objective functions and diverse beam search. However, these methods only focused on emotional factors while ignoring content relevance, possibly resulting in a decline in the quality and diversity of a response. The integration of emotion and content is still a challenging task for several reasons. The first is that high-quality emotion-labeled data are difficult to obtain in a large-scale corpus because emotions are subjective and difficult to annotate. Moreover, it is difficult to deal with emotions coherently because balancing grammaticality and the expressions of emotions is needed [29].

The Framework
In this work, we adopted a service-oriented robotic framework that could provide various services and resources and develop emotion-aware dialoguing services. This computing platform included two parts: the on-board processors mounted on the robot side (to handle robot functions requiring fast responses, such as those related to perception and actuation) and the computing nodes located on the cloud side to perform highly loaded computing services (such as service planning and deep learning). To realize the proposed design in practice, we configured the framework with ROS to deliver different types of services. Figure 1 illustrates our robotic system architecture and its ROS configuration. As shown, the Graphics Processing Unit (GPU) acceleration virtual machine (VM) and the cloud parallel computing virtual machine are used to support computation. To provide different services on the cloud, we defined different types of computing nodes in the framework. Through the ROS frame protocol, where the management of data interchange is between nodes, the framework could easily combine different services to launch new functions. The module of service planning was described in our previous work [2,3]. Here, we focused on the dialogue module, in which the major functional components were indicated.
Robotics 2020, 9, x FOR PEER REVIEW 5 of 20 emotion-aware services. The same approach can also be applied to the task-oriented dialogue service. The major components of our framework are described in the following subsections.

Text Processing
In addition to the traditional text processing steps to clean and purify texts, we apply semantic Our framework included two types of dialogue services, one for domain-specific dialogues and the other for task-specific (task-oriented) dialogues. In contrast to the open-domain conversation performed by the general purpose chatbots, the domain-specific dialogue presented here aims to provide knowledge services of a certain domain (e.g., finance or insurance) through a question-answering manner between the user and the robot. In contrast, the task-oriented dialogue service was to achieve the specific goal for a certain task (e.g., restaurant recommendation) by conducting the iterative human-robot dialogue to adapt to the user's intention or preference related to the task goal. This type of service is especially important in the coming conversational commerce.
At present, the functions of user identification and emotion recognition are constructed independently from the dialogue model, mainly because of the lack of a dataset containing complete information of a human face, utterance emotion and dialoguing content. The current strategy was that the identified user was assigned to a certain type of user group and the corresponding model was retrieved to perform dialoguing. Then, the candidate sentences produced by the model were re-ranked (based on the recognized emotion) following a set of hand-crafted rules and the sentence with the highest rank was selected as the robot's response. In this work, we only applied the emotion mechanism to the first type of dialogue (i.e., domain-specific) as a representative example of emotion-aware services. The same approach can also be applied to the task-oriented dialogue service. The major components of our framework are described in the following subsections.

Text Processing
In addition to the traditional text processing steps to clean and purify texts, we apply semantic rules to perform sentence segmentation. For example, if there is a disjunctive such as "but" or "although" in the sentence, the emotion of the entire sentence is usually biased toward the former or the latter clause. To tackle such a problem, this study adopted a set of five semantic rules (selected from those proposed in [30]) to perform more precise sentence segmentation. For example, using one of the rules: "If a sentence contains but, disregard all previous sentiment and only take the sentiment of the part after but," the sentence "I really, really, really wanna go, but I can't." is simplified to be "I can't". The details of the rules refer to [30].
After the above sentence segmentation, we employed the Natural Language Processing Toolkit (NLTK, [31]) to build a dictionary, consisting of more than 7000 words, of which the most frequent stop words were removed. However, because the dialogue dataset used for building classifiers contained some short responses (such as "He?" and "You?"), the list of stop words was thus not fully applied to filter them out. In addition, adverbs such as "more", "most" and "very" are tone aggravation in conversation, therefore, they were retained. Then, a procedure of stemming was performed to strip off word endings, reducing them to a common core or stem.
As indicated above, our framework adopted deep learning for model training and an encoding (embedding) scheme was needed to transfer the natural language sentences into vector representations. Therefore, once the word processing procedure was completed, the GloVe method (Global Vectors for Word Representation) was employed to map the words into vectors, due to its high training efficiency [32]. The training process was performed on aggregated global word-word co-occurrence statistics from a corpus. Through the mapping, the words were represented by real numbers and words with similar meanings which could have similar representations. In this study, the words were mapped into vectors of 300 dimensions. GloVe provides pre-training word vectors, which contain 400 k vocabularies trained from a corpus of 6 billion token words. The vectors were used as the input of the training algorithm to build the model.

Learning Emotion Classifiers
In this work, we trained a deep learning network to recognize emotions from the utterances in dialogues. Figure 2 illustrates the model that includes a convolutional neural network (CNN) followed by a long short-term memory network (LSTM). As shown, the inputs are the dialoguing sentences processed and converted to the vectors by the procedure described above. In this network, three convolutional layers with lengths of three, four and five were arranged to extract the local features of the sentences. Then, the features were combined and served as the input of the next learning layer (i.e., LSTM). gradient-based machine learning methods. However, this situation still occurs when the sentence length is too long and the network needs to be deepened. In this work, we adopted LSTM with ReLU (Rectified Linear Unit [33]) to train a better model, as ReLU was proved to be effective in overcoming the vanishing gradient problem. Moreover, ReLU has the property of sparse activation, making the neural network sparse to alleviate the problem of over-fitting. In the above learning process, the widely adopted gradient descent optimization algorithm Adam [34] was used as an optimizer.
As shown in Figure 2, we used the activation function widely used in deep learning model, "Softmax", to map the outputs of the neurons into the interval of (0-1). In this way, a probability distribution over the possible classes could be obtained and the node with the highest probability was selected as our prediction emotion class. To calculate the error between the prediction class and the actual class, a loss function was used and the weight update of the deep neural network was performed accordingly. Here, the function "LabelEncoder" of the machine learning tool sciki-learn (https://scikit-learn.org/) and the loss function "categorical_crossentropy" of the deep learning framework Keras (https://keras.io/) were employed to normalize the class label and convert it into a one-hot code of the binary matrix to perform the numerical calculation.

Learning Dialogue Models
To develop dialogues for the robot, we adopted the neural language model from our previous works [3,35] for training the answer selection mechanism. Figure 3 illustrates our model that included a LSTM network with a CNN network. The LSTM contained memory blocks in the recurrent hidden layer that could store the temporal state of the network. With this characteristic, this model could better capture information over longer time steps to meet our goal.
For training the deep learning model, the sentences were organized as the question-answering pairs. The question sentence Q was the input question encoded into an internal vector form QV by the word-embedding procedure described above. To enhance the performance, we established the word2vec [36] weights for the entire corpus and used them as the pretrained model of the embedding layer. The output then flows to the LSTM and CNN layers. In this procedure, for each question Q there was a corresponding positive answer A+ with a very high probability to be the correct answer among all the answers in the dataset (i.e., the confirmed correct answer). As shown in Figure 3, after the embedding layer, an output vector E was obtained and then calculated through the LSTM function to derive a hidden vector L as the following: In the above equations, E can be represented as {e1, e2, …, en}, ∈ ℝ × in which n is the maximal sentence length and d is the dimension of embedding. We is the weight matrix ∈ ℝ × Figure 2. The deep learning model used for the emotion recognition.
It has been well known that LSTM can overcome the vanishing gradient problem in gradient-based machine learning methods. However, this situation still occurs when the sentence length is too long and the network needs to be deepened. In this work, we adopted LSTM with ReLU (Rectified Linear Unit [33]) to train a better model, as ReLU was proved to be effective in overcoming the vanishing gradient problem. Moreover, ReLU has the property of sparse activation, making the neural network sparse to alleviate the problem of over-fitting. In the above learning process, the widely adopted gradient descent optimization algorithm Adam [34] was used as an optimizer.
As shown in Figure 2, we used the activation function widely used in deep learning model, "Softmax", to map the outputs of the neurons into the interval of (0-1). In this way, a probability distribution over the possible classes could be obtained and the node with the highest probability was selected as our prediction emotion class. To calculate the error between the prediction class and the actual class, a loss function was used and the weight update of the deep neural network was performed accordingly. Here, the function "LabelEncoder" of the machine learning tool sciki-learn (https://scikit-learn.org/) and the loss function "categorical_crossentropy" of the deep learning framework Keras (https://keras.io/) were employed to normalize the class label and convert it into a one-hot code of the binary matrix to perform the numerical calculation.

Learning Dialogue Models
To develop dialogues for the robot, we adopted the neural language model from our previous works [3,35] for training the answer selection mechanism. Figure 3 illustrates our model that included a LSTM network with a CNN network. The LSTM contained memory blocks in the recurrent hidden layer that could store the temporal state of the network. With this characteristic, this model could better capture information over longer time steps to meet our goal.
For training the deep learning model, the sentences were organized as the question-answering pairs. The question sentence Q was the input question encoded into an internal vector form Q V by the word-embedding procedure described above. To enhance the performance, we established the word2vec [36] weights for the entire corpus and used them as the pretrained model of the embedding layer. The output then flows to the LSTM and CNN layers. In this procedure, for each question Q there was a corresponding positive answer A+ with a very high probability to be the correct answer among all the answers in the dataset (i.e., the confirmed correct answer). As shown in Figure 3, after the embedding layer, an output vector E was obtained and then calculated through the LSTM function to derive a hidden vector L as the following: In the above equations, E can be represented as {e 1 , e 2 , . . . , e n }, E ∈ R n×d in which n is the maximal sentence length and d is the dimension of embedding. W e is the weight matrix W ∈ R v×d (v is the number of words in the dictionary), e is the vector embedded for word x and W L is the LSTM weight matrix.
This equation was adopted from [38], and it has been shown to offer good performance. In the above equation, the parameter γ is 1.0 and c is 1. VA is a positive or negative answer (i.e., VA+ or VA−). Then, the distance between the two similarities is compared (meaning the difference between an answer and the ground truth) to a pre-defined margin m (a maximum number of steps often used to reduce the running time). If the distance is less than m, the network parameters are updated; otherwise another negative example is sampled until the distance is less than m. The above operations were to ensure that the similarity distance (to be minimized) could reach a certain level. As defined in [38], the loss function corresponding to the above similarity is: During the human-robot dialoguing period (i.e., the test phase), this dialogue service calculates the similarity between a question sentence (asked by the user) and each answer sentence (in the knowledge base). A set of answers with the highest similarity scores is selected and they are re-ranked by the pre-defined rules. The first-ranking sentence is then used as the robot's response.

Knowledge Enrichment
In addition to the learning model and method, the dataset with the domain questions and the corresponding answers also played a critical role in dialogue modeling, because a rich dataset represents abundant knowledge for a system to interact with human users. It was thus important to include more knowledge resources to enrich the dataset (meaning better conversation For performance enhancement, we used the genism package [37] to establish the weights for the entire corpus and used them as the pretrained model of the embedding layer. Though the LSTM layer described above, one can extract the features of word sequences in the sentences of our network. Furthermore, we connected the tensor L (Equation (2)) to a convolutional layer to extract more complicated features for performance enhancement. As indicated in Figure 3, the "MaxPooling" function was performed and the "tanh" function was used to transfer and output the decoding result. The above two functions have been widely used in deep learning models for language processing [38].
In the model training procedure, the question Q, the correct answer A+ and the wrong answer A−(sampled from the answer space) are encoded into vector representations V Q , V A+ and V A− , respectively, and the similarities between the question and the two answers are calculated separately. Here, the similarity of the two vectors is defined as This equation was adopted from [38], and it has been shown to offer good performance. In the above equation, the parameter γ is 1.0 and c is 1. V A is a positive or negative answer (i.e., V A+ or V A− ). Then, the distance between the two similarities is compared (meaning the difference between an answer and the ground truth) to a pre-defined margin m (a maximum number of steps often used to reduce the running time). If the distance is less than m, the network parameters are updated; otherwise another negative example is sampled until the distance is less than m. The above operations were to ensure that the similarity distance (to be minimized) could reach a certain level. As defined in [38], the loss function corresponding to the above similarity is: Robotics 2020, 9, 31 8 of 20 During the human-robot dialoguing period (i.e., the test phase), this dialogue service calculates the similarity between a question sentence (asked by the user) and each answer sentence (in the knowledge base). A set of answers with the highest similarity scores is selected and they are re-ranked by the pre-defined rules. The first-ranking sentence is then used as the robot's response.

Knowledge Enrichment
In addition to the learning model and method, the dataset with the domain questions and the corresponding answers also played a critical role in dialogue modeling, because a rich dataset represents abundant knowledge for a system to interact with human users. It was thus important to include more knowledge resources to enrich the dataset (meaning better conversation comprehension) for a dialogue system equipped with a service robot. Many strategies can be developed to include more knowledge resources (e.g., external knowledge resources) for dialogue modeling. In this work, we used a language translation system to translate a dataset to achieve knowledge sharing between different languages. This method was especially important for developing human-machine dialogues with a resource-restricted language (very few data are available for model training). Here, we translated a dataset from English to Chinese as an example to investigate the corresponding effect.
Word segmentation was a very important sentence preprocessing step in the dialogue modeling with the dataset in Chinese. This step was to determine word boundaries for a Chinese sentence. That is, a sentence can be segmented into different combinations of words and therefore the ambiguity exists for Chinese word segmentation. Several segmentation systems have been proposed for Chinese text. Among others, the most often used segmentation systems are the CKIP (http://ckipsvr.iis.sinica.edu.tw/), Stanford Pars (http://nlp.stanford.edu/software/lex-parser.shtml ) and the JIEBA (https://github.com/ ldkrsi/jieba-zh_TW) system. Following a preliminary evaluation, we chose to use the JIEBA system to perform the word segmentation. Then, the word embedding procedure was performed in which the Wiki Chinese text documents were used to pre-train the corpus for performance enhancement, and the same type of dialogue model can be trained by the deep learning method as in the above section.

Developing Task-Oriented Dialogues
In addition to the above domain-specific dialogue modeling, this section presents the task-oriented subsystem we developed for the service robot to achieve practical applications with specific goals. As mentioned previously, existing task-oriented dialogue methods can be divided into two categories: modularized pipeline and end-to-end single-module systems. Among others, the hybrid code network (HCN, [24]) is a popular and useful end-to-end framework for developing practical task-oriented dialogue applications. It allows a developer to hybrid the data-driven learning method and knowledge-based hand-coded rules. This approach can learn an RNN with considerably less training data and express domain knowledge via software and action templates. Therefore, in this work, we adopted a simplified HCN framework with some enhanced functions to develop task-oriented dialogue services for the robot. Figure 4 presents our revised framework for the task-oriented dialogues. We also implemented a restaurant recommendation application as an illustrative example. The goal was to request the robot to make a restaurant reservation for a user, given all his constraints on the location, cuisine, price range, atmosphere and party size, which were derived iteratively from the human-robot dialogue.
The operational flow of our revised HCN included four major phases as illustrated in Figure 4. The first phase, which was mainly for text processing, included three steps to extract different types of features from a user utterance. The first step was to extract the context features (entities to be traced).
Since we used the DSTC dataset (Dialog State Tracking Challenge dataset [39]) for network training, the context features here were the same as the original dataset: atmosphere, cuisine, location, party size and price, each with a value of 0 or 1 (as a placeholder in the entity tracking slot). The second step was to extract the words (bag of words) to be representatives and the one-hot encoding scheme was used to form a word vector. The third step was to perform word embedding and here the word2vec was employed. As shown in the figure, in the second phase the text and entities mentioned were then passed to a module of dialogue state tracking, which grounds and maintains entities. In contrast to the original HCN work, we adopted a deep CNN network and defined label ontology to further improve the state tracking performance (described in Section 3.4.1).
Robotics 2020, 9, x FOR PEER REVIEW 9 of 20 above phases, a recommendation module was developed to revise some entities according to the user's preferences. The details are described in Section 3.4.2. Figure 4. The framework used for the task-oriented dialogues.

Belief Tracker
The belief tracker (i.e., the dialogue state tracking) is an important component in a dialogue system in which a dialogue state is a full and temporal representation of each participant's intention. A belief tracker can track what has happened with the system outputs, user utterances and context from previous turns. It provides a direct way to validate the system's understanding of the user's goal at each dialogue step through the intention estimation.
Traditionally, the rule-based systems were built for state tracking, but they hardly model uncertainty. Recently, researchers have turned to develop neural models to overcome the uncertainty in tracking dialogue states. In task-oriented dialogue systems, the end-to-end neural networks have been successfully employed for state tracking via interacting with an external knowledge base. However, in task-oriented dialogues, a state tracker is usually trained from a large amount of manually annotated corpora. Considering the huge efforts required for human annotation, we used the available dataset for model training and focused on the model performance.
As indicated above, we adopted a simplified HCN model with a dialogue state tracker. However, in some situations the original state tracker could misjudge the ambiguous user utterances or wrongly spell words and produce incorrect answers. For example, using the original HCN tracker to analyze the user utterance "I'm asking my friend if she wants to do Rome", the word "Rome" (entity value) is wrongly taken as the final location, but in fact the decision has not yet been made. This was because the original tracker uses a string-matching method for entity identification so the mismatches cannot be corrected. As the neural belief tracker was able to deliver a better performance [40], we thus adopted this method and used a deep CNN model to solve this problem. In addition, a small ontology was established to ensure the semantic correctness of the mentioned entities (i.e., slot values).

Autoencoder
Following the above dialogue flow, we developed a recommender (as shown in Figure 4) to enhance the system performance and user experience. This module was to refine some entities from the selected response according to the user preferences. In this work, we used a deep learning-based method and adopted the deep neural network and autoencoder [41,42] to realize collaborative recommendation. In the third phase, the results obtained from the above phases were then concatenated to be a feature vector and a traditional LSTM was adopted for training. As shown in Figure 4, the output of the LSTM model was passed to a dense layer with a Softmax activation, in which the output dimension was equal to the number of distinct action templates. The output was a distribution over the action templates. In the fourth phase, the action mask was applied and an action was selected accordingly. Then, the selected action was used to produce a fully formed action. Following the above phases, a recommendation module was developed to revise some entities according to the user's preferences. The details are described in Section 3.4.2.

Belief Tracker
The belief tracker (i.e., the dialogue state tracking) is an important component in a dialogue system in which a dialogue state is a full and temporal representation of each participant's intention. A belief tracker can track what has happened with the system outputs, user utterances and context from previous turns. It provides a direct way to validate the system's understanding of the user's goal at each dialogue step through the intention estimation.
Traditionally, the rule-based systems were built for state tracking, but they hardly model uncertainty. Recently, researchers have turned to develop neural models to overcome the uncertainty in tracking dialogue states. In task-oriented dialogue systems, the end-to-end neural networks have been successfully employed for state tracking via interacting with an external knowledge base. However, in task-oriented dialogues, a state tracker is usually trained from a large amount of manually annotated corpora. Considering the huge efforts required for human annotation, we used the available dataset for model training and focused on the model performance.
As indicated above, we adopted a simplified HCN model with a dialogue state tracker. However, in some situations the original state tracker could misjudge the ambiguous user utterances or wrongly spell words and produce incorrect answers. For example, using the original HCN tracker to analyze the user utterance "I'm asking my friend if she wants to do Rome", the word "Rome" (entity value) is wrongly taken as the final location, but in fact the decision has not yet been made. This was because the original tracker uses a string-matching method for entity identification so the mismatches cannot be corrected. As the neural belief tracker was able to deliver a better performance [40], we thus adopted this method and used a deep CNN model to solve this problem. In addition, a small ontology was established to ensure the semantic correctness of the mentioned entities (i.e., slot values).

Autoencoder
Following the above dialogue flow, we developed a recommender (as shown in Figure 4) to enhance the system performance and user experience. This module was to refine some entities from the selected response according to the user preferences. In this work, we used a deep learning-based method and adopted the deep neural network and autoencoder [41,42] to realize collaborative recommendation.
Autoencoder is a superior tool for dimensionality reduction and it can be regarded as a strict generalization of principle component analysis. It is a network with implementations of two transformations (encoder and decoder), aiming to reconstruct inputs in the output layer via a low-dimensional latent space to predict the missing ratings. Then, the learning goal is to minimize the error between the original vector (input) and the transformed vector (output). One of the popular autoencoder-based recommendation models is AutoRec [42]. In this model, denoising techniques are used to discover more robust representations and to avoid learning an identity function. These techniques mean to learn the latent representations of the corrupted user-item preferences and they can be used to reconstruct the users' full preferences and reduce the overfitting situations. In this work, our recommender was developed based on AutoRec. The overall architecture is illustrated in Figure 5, in which the encoder, code-layer and decoder are the major parts of the model (included in the dotted line rectangle). Both the encoder and the decoder consist of feed-forward neural networks with fully connected layers and the depth of the model was increased (marked as the deep stack) to enhance the corresponding performance. Autoencoder is a superior tool for dimensionality reduction and it can be regarded as a strict generalization of principle component analysis. It is a network with implementations of two transformations (encoder and decoder), aiming to reconstruct inputs in the output layer via a low-dimensional latent space to predict the missing ratings. Then, the learning goal is to minimize the error between the original vector (input) and the transformed vector (output). One of the popular autoencoder-based recommendation models is AutoRec [42]. In this model, denoising techniques are used to discover more robust representations and to avoid learning an identity function. These techniques mean to learn the latent representations of the corrupted user-item preferences and they can be used to reconstruct the users' full preferences and reduce the overfitting situations. In this work, our recommender was developed based on AutoRec. The overall architecture is illustrated in Figure 5, in which the encoder, code-layer and decoder are the major parts of the model (included in the dotted line rectangle). Both the encoder and the decoder consist of feed-forward neural networks with fully connected layers and the depth of the model was increased (marked as the deep stack) to enhance the corresponding performance.

Experiments and Results
To evaluate the presented emotion-aware dialoguing service for human-robot interaction, several sets of experimental trials were conducted. As mentioned previously, due to the lack of a dataset with full information on the human face, utterance emotion and dialoguing content, in the experiments we used four datasets to evaluate these modules separately. The evaluations are described in the following subsections.

Performance Metrics
In the experiments, we employed the criteria often used in data classification for performance evaluation and a five-fold cross-validation strategy was also used. We first measured the numbers of true positives (TP), false positives (FP), true negatives (TN), false negatives (FN) and then used them to calculate the metrics of accuracy (proportion of correctly predicted instances relative to all predicted instances), precision (proportion of retrieved instances that were relevant), recall (proportion of relevant instances that were retrieved) and F-measure (the combined effect of

Experiments and Results
To evaluate the presented emotion-aware dialoguing service for human-robot interaction, several sets of experimental trials were conducted. As mentioned previously, due to the lack of a dataset with full information on the human face, utterance emotion and dialoguing content, in the experiments we used four datasets to evaluate these modules separately. The evaluations are described in the following subsections.

Performance Metrics
In the experiments, we employed the criteria often used in data classification for performance evaluation and a five-fold cross-validation strategy was also used. We first measured the numbers of true positives (TP), false positives (FP), true negatives (TN), false negatives (FN) and then used them to calculate the metrics of accuracy (proportion of correctly predicted instances relative to all predicted instances), precision (proportion of retrieved instances that were relevant), recall (proportion of relevant instances that were retrieved) and F-measure (the combined effect of precision and recall that often conflict in nature) [43]. The metrics are defined as follows: In addition to accuracy, to evaluate the performance of the answer selection in the dialogue modeling, we adopted a statistical measure MRR (mean reciprocal rank, the average of the reciprocal ranks of results for a sample of n queries). It is defined as where rank i refers to the rank position of the first relevant document for the i-th query.

User Identification
In this work, a cloud-based system was built for a service robot and we configured a ROS framework on top of a Linux OS to connect the sensing camera nodes. Often, a system built with ROS consists of a number of processes on a set of hosts, which are connected at runtime in a peer-to-peer topology. Here, the ROS master was a PC running the roscore and serving as the resource center for all the other ROS nodes connected to the network. The cloud parallel computing virtual machine had eight CPUs and eight GB memory, and the GPU acceleration virtual machine had eight CPUs, 32 GB memory and a NVIDIA Tesla K80 GPU.
For user identification, the experiments were conducted to evaluate the performance of face recognition. The goal was to train the robot to recognize human faces in a static manner and we adopted OpenCV (https://opcv.org, an open source computer vision library) to train the classifiers. An online face dataset [44] was used. It included 90 image sets of different persons, in which each set included face images taken from different viewpoints, from 90 to −90 degrees (stepping by 5). The results showed that the trained classifiers performed the best in the recognition of the front face images. The faces in the images could be detected correctly with a reasonable rate of accuracy when the variation of the rotating angle was less than 30 degrees, and the faces could be recognized with a good accuracy if the view angle was within the range of 10 to −10 degrees.

Performance Evaluation
To assess the performance of the emotion recognition module, we adopted the dataset used in [45], which was derived from the Movie Dialog Corpus. The sentences in this dataset were categorized into six classes of emotions: fear, disgust, joy, sadness, anticipation and none (neutral). The deep learning approach described in Section 3.2.2 was employed to train a model for multi-class emotion recognition. In addition, two popular learning methods, the random forest (RF) and the support vector machine (SVM) methods, were used for performance comparison.
For RF and SVM, we used the n-gram method to extract more text features from the original data for building classifiers to enhance their performance, in addition to the word features extracted from the text-processing procedure. N-gram can express the sequence relationships between the words, and the unigram, bigram and trigram (n is 1, 2 and 3, respectively) models are often used. After a preliminary test, in this work we used the above three models to extract more text features, and the combined feature vectors were used as the input of the above two machine learning methods (RF and SVM) to enhance their performance. Figure 6a illustrates the accuracy, precision, recall and F-score for each of the three methods. As can be seen, RF performed the best in all the metrics. The main reason could be that RF is a type of ensemble machine learning algorithm and the way it handled (samples) data for the grouped multiple classifiers made it perform better than the others for the imbalanced dataset here.
Robotics 2020, 9, x FOR PEER REVIEW 12 of 20 methods to investigate their effects in performance. For the semantic rules, the five rules mentioned in Section 3.2.1 were used to perform more precise sentence segmentation; for data balance, we adopted the sciki-learn tool to produce a set of specific class weights for different types of emotions.
The results for accuracy, precision, recall and F-score are illustrated in Figure 6b. As can be seen, in general our CNN-LSTM method obtained the best results on all performance metrics. In addition to the data balance effect, the reason for the performance improvement could be that the semantic rules removed the irrelevant words and filtered out their effects on the sentence emotions. Thus, the learning methods were able to focus on the emotions delivered by the most related parts of the sentences to be predicted.
(a) (b) Figure 6. Results of the three machine learning methods; (a) without and (b) with the enhanced techniques of the semantic rules and data balance.

Comparisons with IBM Tone Analyzer
In addition to comparing the different machine learning methods, we evaluated a well known emotion detection system, the IBM Watson's Tone Analyzer (https://natural-language-understanding-demo.ng.bluemix.net/), for further comparison. Interestingly, the emotions the Tone Analyzer considered were slightly different from what we defined in this work, and it gave degrees (values) of multiple emotions for an input sentence (also different from our work). To conduct the performance comparison, we projected the two sets of emotions (one for our work and one for the Tone Analyzer) into the well known emotional valence and arousal space (i.e., V-A space, [46]). In this space, valence indicates the hedonic value (positive or negative), ranging from inactive to active; and arousal indicates the emotional intensity, ranging from unpleasant to pleasant. The valence and arousal dimensions can be projected onto Euclidean space, where emotions are represented as point-vectors. In this way, the user's emotion can be located in this space to be a tuple of valence and arousal values.
In the experiments, we first projected the five classes (annotated in the dataset) into the V-A space to retrieve the corresponding valence and arousal values (based on the emotion positions defined in [46]). For the data of each (sentence), we took the positions of the actual (correct) class and the predicted classes in the space and obtained the set of V-A values. Then, the emotion values produced by the trained model were taken as class weights and the weighted sum was derived for these specific data. Consequently, our method and the Tone Analyzer were compared.
For the data of each, we chose the two closest classes (with the largest weights) and calculated their weighted distance to represent the distance between the predicted and the actual classes. To compare the performances, we divided the distance into eight intervals and counted the numbers of data within each interval. Table 1 presents the results, in which x is the weighted distance. As can be seen, in general the results obtained by the presented method were better than those obtained by the Tone Analyzer for the dataset used.  After comparing the three aforementioned methods, we applied two data processing techniques to the dataset, including semantic rules and data balance, with the above learning methods to investigate their effects in performance. For the semantic rules, the five rules mentioned in Section 3.2.1 were used to perform more precise sentence segmentation; for data balance, we adopted the sciki-learn tool to produce a set of specific class weights for different types of emotions. The results for accuracy, precision, recall and F-score are illustrated in Figure 6b. As can be seen, in general our CNN-LSTM method obtained the best results on all performance metrics. In addition to the data balance effect, the reason for the performance improvement could be that the semantic rules removed the irrelevant words and filtered out their effects on the sentence emotions. Thus, the learning methods were able to focus on the emotions delivered by the most related parts of the sentences to be predicted.

Comparisons with IBM Tone Analyzer
In addition to comparing the different machine learning methods, we evaluated a well known emotion detection system, the IBM Watson's Tone Analyzer (https://natural-language-understandingdemo.ng.bluemix.net/), for further comparison. Interestingly, the emotions the Tone Analyzer considered were slightly different from what we defined in this work, and it gave degrees (values) of multiple emotions for an input sentence (also different from our work). To conduct the performance comparison, we projected the two sets of emotions (one for our work and one for the Tone Analyzer) into the well known emotional valence and arousal space (i.e., V-A space, [46]). In this space, valence indicates the hedonic value (positive or negative), ranging from inactive to active; and arousal indicates the emotional intensity, ranging from unpleasant to pleasant. The valence and arousal dimensions can be projected onto Euclidean space, where emotions are represented as point-vectors. In this way, the user's emotion can be located in this space to be a tuple of valence and arousal values.
In the experiments, we first projected the five classes (annotated in the dataset) into the V-A space to retrieve the corresponding valence and arousal values (based on the emotion positions defined in [46]). For the data of each (sentence), we took the positions of the actual (correct) class and the predicted classes in the space and obtained the set of V-A values. Then, the emotion values produced by the trained model were taken as class weights and the weighted sum was derived for these specific data. Consequently, our method and the Tone Analyzer were compared.
For the data of each, we chose the two closest classes (with the largest weights) and calculated their weighted distance to represent the distance between the predicted and the actual classes. To compare the performances, we divided the distance into eight intervals and counted the numbers of data within each interval. Table 1 presents the results, in which x is the weighted distance. As can be seen, in general the results obtained by the presented method were better than those obtained by the Tone Analyzer for the dataset used.

Performance of Training a Dialogue Model
The next set of experiments was to examine the system performance of model training in retrieving (selecting) answers. In this series of experiments, a large dataset was adopted [38]. It was collected from the Insurance Library website that included 12,889 questions and 21,325 answers, after a data preprocessing procedure was performed. This procedure was to remove unsuitable data that could not form the proper input question−answer pairs, to clean the irrelevant terms (such as html tags) and to transfer the text context into internal identifiers (to form the vectors). In the experiments, the above dataset was divided into two parts, in which a part of 2000 questions and a part of 3308 answers were used for testing. The complete experiments of dialoguing were described in our previous work [35], and here we focused on reporting the results most related to the model training for human-robot interaction.
As described in Section 3.3, in the model training phase, for each question sentence a positive and a negative answer were needed to constitute a training instance. However, in a real-world application, the correct answer A+ for a question Q can be determined easily (by the confirmation of the person asking the question), while the wrong answers are often not explicitly specified. Therefore, in the experiments here, all other answers in the dataset were considered candidates of wrong answers to Q. To find the most suitable wrong answer A− for each question in the dataset, we used the above model training procedure to perform the preprocessing procedure of the wrong answer selection. Due to the large amount of answers, in this work we randomly chose ten (instead of all) answers for each question to perform training to reduce computational time.
In the learning process, the random shuffling strategy was used to combine the correct and wrong answers for each question to work as the training data. The model and method presented in Section 3.3 were used for training. Figure 7a illustrates the results of the two performance metrics often used in retrieval-based dialogue modeling, accuracy and MRR. Here, the accuracy was in fact the top-one precision mentioned in the other relevant studies. It means that the model's predictive result (i.e., the top score answer) must be exactly the expected one as recorded in the dataset. As shown, the LSTM-CNN model could achieve the best performance with a correct prediction rate of 0.61 and the MRR was 0.70. The results were similar to those presented in the related study [38], whereas the presented method involved a smaller set of parameters and was more efficient in learning. In addition to the LSTM-CNN model, a traditional embedding model (using only word embedding technique) was also implemented for performance comparison. The results are shown in Figure 7b. As presented, the accuracy of the embedding model is 0.12 and for the MRR is 0.21. These results indicated that the LSTM-CNN model was more efficient; it obtained a better result within less iterations. In addition to the performance evaluation of the dialogue modeling, we performed another set of trials to examine the performance of the shared knowledge translated by a dataset from a different language. In the experiments, the dataset used in the above set of experiments was translated according to the steps described in Section 3.3, and the same deep learning model and method were used for the training. As mentioned previously, in contrast to English sentences, a Chinese sentence could be segmented into various combinations of words by different segmentation methods and this often led to different modeling results. Therefore, before evaluating the performance of the model training, we conducted a set of trails to investigate the effect of two popular segmentation methods: the Jiaba and the HanLP, and the results showed that the Jiaba performed better than the HanLP. We thus chose Jiaba segmentation to continue the experiments of model training.
The results (i.e., accuracy and MRR) are presented in Figure 8a. As shown in the figure, the LSTM-CNN model can achieve a best performance (accuracy) of 0.54 and an MRR of 0.64. Moreover, the traditional embedding model was implemented for comparison and the results are shown in Figure 8b. Similar to the experiments conducted for the original (untranslated) dataset, the results here indicated that the LSTM-CNN model was more efficient than the traditional embedding method.
Compared to the results obtained from the original dataset, the accuracy declined from 0.61 to 0.54. This indicated that the model built from the translated knowledge (i.e., dataset) could not keep the modeling performance at the same level; nevertheless, the results showed that the translated knowledge was learnable with an acceptable performance and was thus useful in building models for the resource-restricted language. The performance could be further improved when more advanced text translation techniques are applied.  In addition to the performance evaluation of the dialogue modeling, we performed another set of trials to examine the performance of the shared knowledge translated by a dataset from a different language. In the experiments, the dataset used in the above set of experiments was translated according to the steps described in Section 3.3, and the same deep learning model and method were used for the training. As mentioned previously, in contrast to English sentences, a Chinese sentence could be segmented into various combinations of words by different segmentation methods and this often led to different modeling results. Therefore, before evaluating the performance of the model training, we conducted a set of trails to investigate the effect of two popular segmentation methods: the Jiaba and the HanLP, and the results showed that the Jiaba performed better than the HanLP. We thus chose Jiaba segmentation to continue the experiments of model training.
The results (i.e., accuracy and MRR) are presented in Figure 8a. As shown in the figure, the LSTM-CNN model can achieve a best performance (accuracy) of 0.54 and an MRR of 0.64. Moreover, the traditional embedding model was implemented for comparison and the results are shown in Figure 8b. Similar to the experiments conducted for the original (untranslated) dataset, the results here indicated that the LSTM-CNN model was more efficient than the traditional embedding method. In addition to the performance evaluation of the dialogue modeling, we performed another set of trials to examine the performance of the shared knowledge translated by a dataset from a different language. In the experiments, the dataset used in the above set of experiments was translated according to the steps described in Section 3.3, and the same deep learning model and method were used for the training. As mentioned previously, in contrast to English sentences, a Chinese sentence could be segmented into various combinations of words by different segmentation methods and this often led to different modeling results. Therefore, before evaluating the performance of the model training, we conducted a set of trails to investigate the effect of two popular segmentation methods: the Jiaba and the HanLP, and the results showed that the Jiaba performed better than the HanLP. We thus chose Jiaba segmentation to continue the experiments of model training.
The results (i.e., accuracy and MRR) are presented in Figure 8a. As shown in the figure, the LSTM-CNN model can achieve a best performance (accuracy) of 0.54 and an MRR of 0.64. Moreover, the traditional embedding model was implemented for comparison and the results are shown in Figure 8b. Similar to the experiments conducted for the original (untranslated) dataset, the results here indicated that the LSTM-CNN model was more efficient than the traditional embedding method.
Compared to the results obtained from the original dataset, the accuracy declined from 0.61 to 0.54. This indicated that the model built from the translated knowledge (i.e., dataset) could not keep the modeling performance at the same level; nevertheless, the results showed that the translated knowledge was learnable with an acceptable performance and was thus useful in building models for the resource-restricted language. The performance could be further improved when more advanced text translation techniques are applied.  Compared to the results obtained from the original dataset, the accuracy declined from 0.61 to 0.54. This indicated that the model built from the translated knowledge (i.e., dataset) could not keep the modeling performance at the same level; nevertheless, the results showed that the translated knowledge was learnable with an acceptable performance and was thus useful in building models for the resource-restricted language. The performance could be further improved when more advanced text translation techniques are applied.

Performance Evaluation of Neural Belief Tracker
As mentioned previously, the belief tracker plays an important role in a goal-oriented dialogue system; it can be used to track each participant's intention from the continuous dialoguing utterances between the participant and the robot. In this work, we implemented a deep CNN model to work as a neural belief tracker. The application task was to perform restaurant recommendation through the user−robot dialogue. A set of entities were pre-defined and the robot iteratively interacted with the user to derive all the missing entity values. The DSTC6 dataset was used for training the tracker. In this task (dataset), five entities were tracked, namely cuisine, location, price range, atmosphere and party size, and the system had to infer their corresponding slot values for making an appropriate recommendation. Table 2 lists the values defined for the entities. To achieve the task, a state tracker was trained for each entity. In the experiments, the performance metrics were the accuracy (the number of correct responses divided by the number of turns) and the loss (here, root mean square error). The training performance for all the entities is presented in Figure 9, in which (a) shows how the accuracy was improved during the training process, and (b) illustrates the reduced loss. As is shown in Figure 9, an accuracy of 0.9 can be obtained after 200 epochs and the loss converged to a small value after 75 epochs and approximated toward zero in the end of the training (200 epochs). As mentioned previously, the belief tracker plays an important role in a goal-oriented dialogue system; it can be used to track each participant's intention from the continuous dialoguing utterances between the participant and the robot. In this work, we implemented a deep CNN model to work as a neural belief tracker. The application task was to perform restaurant recommendation through the user−robot dialogue. A set of entities were pre-defined and the robot iteratively interacted with the user to derive all the missing entity values. The DSTC6 dataset was used for training the tracker. In this task (dataset), five entities were tracked, namely cuisine, location, price range, atmosphere and party size, and the system had to infer their corresponding slot values for making an appropriate recommendation. Table 2 lists the values defined for the entities. To achieve the task, a state tracker was trained for each entity. In the experiments, the performance metrics were the accuracy (the number of correct responses divided by the number of turns) and the loss (here, root mean square error). The training performance for all the entities is presented in Figure 9, in which (a) shows how the accuracy was improved during the training process, and (b) illustrates the reduced loss. As is shown in Figure 9, an accuracy of 0.9 can be obtained after 200 epochs and the loss converged to a small value after 75 epochs and approximated toward zero in the end of the training (200 epochs).

Performance Evaluation of Autoencoder
The DSTC6 dataset used in the above section for training the neural belief tracker contained only dialogue information which cannot be used for making a recommendation. Therefore, in this section we adopted another public dataset (i.e., Yelp [47]) to evaluate the recommendation performance of the presented model. The original dataset contained a large amount of users and their ratings of a set of shops. This dataset had a very high sparsity. To achieve our task of restaurant recommendation and

Performance Evaluation of Autoencoder
The DSTC6 dataset used in the above section for training the neural belief tracker contained only dialogue information which cannot be used for making a recommendation. Therefore, in this section we adopted another public dataset (i.e., Yelp [47]) to evaluate the recommendation performance of the presented model. The original dataset contained a large amount of users and their ratings of a set of shops. This dataset had a very high sparsity. To achieve our task of restaurant recommendation and to connect the recommendation module to the dialogue system, we chose the relevant data (69,634 users, 41,019 restaurants and 1,817,955 ratings) to evaluate our approach.
As mentioned above, we revised the autoencoder model through a set of experimental investigations to enhance the corresponding performance. The first phase was to investigate the effect of the code size (the number of nodes in the code layer). A set of code sizes (32, 64, 128 and 256) were evaluated and the results showed that with a size of 32, the model could obtain its best performance. After the preliminary test for code size, in the second phase we evaluated the performance of the different activation functions, including ELU, SeLU, ReLU, Sigmoid and tanh, which were often used in deep learning models. The loss (root mean squared error) was employed to measure the prediction performance and the results are shown in Figure 10, in which (a) is the training process and (b) is the corresponding test process. Figure 10 indicates that the overfitting situation occurred in all cases and the case of ELU obtained the best result. We then performed an additional set of trials on the dropout (the dropping out unit in a neural network) and chose a dropout value of 0.8 to alleviate the overfitting. The third phase was to investigate the effect of the number of hidden layers arranged in the deep network. In this set of experiments, we evaluated five different numbers of layers: 2, 4, 6, 8 and 10, and the results are shown in Figure 11. As shown in Figure 11, though with more hidden layers the model can obtain better training performance, it caused overfitting. We thus chose to use six hidden layers in the final experiments for performance comparison.
Robotics 2020, 9, x FOR PEER REVIEW 16 of 20 As mentioned above, we revised the autoencoder model through a set of experimental investigations to enhance the corresponding performance. The first phase was to investigate the effect of the code size (the number of nodes in the code layer). A set of code sizes (32, 64, 128 and 256) were evaluated and the results showed that with a size of 32, the model could obtain its best performance. After the preliminary test for code size, in the second phase we evaluated the performance of the different activation functions, including ELU, SeLU, ReLU, Sigmoid and tanh, which were often used in deep learning models. The loss (root mean squared error) was employed to measure the prediction performance and the results are shown in Figure 10, in which (a) is the training process and (b) is the corresponding test process. Figure 10 indicates that the overfitting situation occurred in all cases and the case of ELU obtained the best result. We then performed an additional set of trials on the dropout (the dropping out unit in a neural network) and chose a dropout value of 0.8 to alleviate the overfitting. The third phase was to investigate the effect of the number of hidden layers arranged in the deep network. In this set of experiments, we evaluated five different numbers of layers: 2, 4, 6, 8 and 10, and the results are shown in Figure 11. As shown in Figure 11, though with more hidden layers the model can obtain better training performance, it caused overfitting. We thus chose to use six hidden layers in the final experiments for performance comparison.
After conducting the above evaluation steps for the determination of the network parameters, we then compared our enhanced approach to other popular collaborative filtering methods, including the well known autoencoder AutoRec, and the latent factor model NNMF (non-negative matrix factorization [48]) which is one of the best models in the relevant studies. In the experiments, for all three methods, the number of epochs was 100, the code size (for our model and AutoRec) and the latent factor (for NNMF) was 32, and the learning rate (for our model and AutoRec) was 0.005. As a result, the loss (error) for the proposed model, the AutoRec model and the NNMF method were 1.0868, 1.4758 and 1.1293, respectively. Such results showed that the proposed method outperformed other methods and can provide better recommendation performance. After conducting the above evaluation steps for the determination of the network parameters, we then compared our enhanced approach to other popular collaborative filtering methods, including the well known autoencoder AutoRec, and the latent factor model NNMF (non-negative matrix factorization [48]) which is one of the best models in the relevant studies. In the experiments, for all three methods, the number of epochs was 100, the code size (for our model and AutoRec) and the latent factor (for NNMF) was 32, and the learning rate (for our model and AutoRec) was 0.005. As a result, the loss (error) for the proposed model, the AutoRec model and the NNMF method were 1.0868, 1.4758 and 1.1293, respectively. Such results showed that the proposed method outperformed other methods and can provide better recommendation performance.

Discussion
The above experiments evaluated our approach for a service robot to provide knowledge services. As presented, in our current design, the emotion recognition was constructed separately from the dialogue modeling. The model was trained by a data-driven process with a static dataset. The emotion classifier was then used to re-rank the sentences selected by the model. The separation of emotion recognition and dialogue modeling has several advantages. The first is that the modules of emotion recognition and response generation can be constructed by any effective methods if available; the system thus operates more flexibly. Meanwhile, the reasons why the system generated these responses can be interpretable to users for further analysis. The two subsystems can be integrated into one model to optimize the corresponding structure and performance, for example, to adopt a monolithic model with an attention mechanism to capture emotion as a special context. However, the integrated system may thus become relatively difficult to understand and computationally expensive.
Considering the dialogue modeling, this work trained models by a data-driven process with a static dataset. Therefore, in addition to the learning method, the quality and quantity of the dataset also had influences on the overall performance. It was thus important to strengthen the role of knowledge (i.e., dataset) to infer an enriched domain-specific model. Different strategies can be developed to exploit more knowledge resources, ranging from directly linking the dataset to the up-to-date external knowledge bases, reorganizing the dataset to obtain an optimized data use and to a complicated procedure of transferring knowledge between different domains. For a resource-restricted language, a straightforward way is to take the translated datasets as shared knowledge for modeling. We showed the effect of using translated knowledge. Our application case revealed that the translated knowledge was learnable and the modeling performance could be kept at a similar level as when using the original data. More advanced language translation techniques can be developed to further improve the performance.
In contrast to the non-task-oriented dialogues, a task-oriented dialogue has a specific goal to achieve. The dataset is more focused and thus relatively smaller. In such a system, the most critical component is the goal tracker that is used to track the user's intent during the dialogue to infer the dialogue state. The system can then decide the best response accordingly to achieve the goal (e.g., the recommendations in our experiments). Through the application presented in this work, we have demonstrated that task-oriented dialogues can be practically launched for a manageable task with a clear goal and a constrained dataset. During such dialogues, a fine-tuning procedure for the model parameters needs to be carefully performed to find the best results. When the task becomes complicated or has a high-level (or abstract) goal to achieve, more advanced state tracking and an inferring mechanism is needed to better understand the users' intentions.

Conclusions
In this work, we presented an emotion-aware dialogue framework for a service robot to achieve natural human-robot communication. To deploy this framework, we adopted a cloud-based service-oriented architecture and developed the emotion recognition and two types of dialogue modeling modules on it. In the first type of service, the robot worked as a consultant to deliver domain-specific knowledge to users. We employed a deep learning method to train different neural models for mapping questions and answer sentences, tracking the human emotion during the process of the human-robot dialoguing and using this additional information to determine the relevance of the sentences obtained by the model. In the second type of dialogue service, task-oriented dialogues were provided for assisting users to achieve specific goals. The robot continuously asked the user questions related to the task, tracked the user's intention through the interactions and provided suggestions accordingly. To verify our framework, we conducted a series of experiments to evaluate the major system components. The results confirmed the effectiveness and efficiency of the presented approach. Currently, we are developing techniques of knowledge transfer that can extract pairs of questions and answers from the text documents of different domains, in order to automatically enrich the dataset for retrieval-based model training. Moreover, we plan to investigate the use of Kansei engineering with hedge algebras to improve the granularity of the semantic and linguistic analysis in the dialogue sentences. We also plan to integrate the characteristics and preferences of the users into the learning model to achieve personalized dialogues.

Conflicts of Interest:
The authors declare no conflict of interest.