2.2. GROW Model Implemented through the Dialogue Manager
The Dialogue Management (DM) is a fundamental component of any Spoken Dialogue System (SDS). It maintains the state and manages the flow of the conversation by determining the action that the system has to perform at each agent turn. For the EMPATHIC project [9
], we used an agenda-based management structure based on the RavenClaw [13
] dialogue management framework that separates the domain-dependent and the domain-independent components of the conversation, unlike previous plan-based dialogue managers. The domain-specific aspects are defined by a dialogue task specification defined by a tree of dialogue agents. Then a domain-independent dialogue engine executes any specified task using a stack structure to control the dialog while providing reusable conversational skills, such as error recovering. This approach is suitable for dealing with complex domains while allowing the use of a relatively unconstrained natural language.
The DM and the involved strategy implement the coaching model chosen for the project. Coaching has been defined as a result-orientated, systematic process. Coaching generally uses strong questions in order that people discover their own abilities and draw on their own resources. In other words, the role of a coach is to foster change by facilitating a coaches’ movement through a self-regulatory cycle [14
]. There is evidence showing that coaching interventions can be effectively applied as a change methodology [15
]. One of the most common used coaching methodologies is the GROW Model [17
]. This model provides a simple methodology and an adaptable structure for coaching sessions. Moreover, efficiency has been demonstrated in some Theoretical Behavior Change Models such as the Trans theoretical Model of Change (TTM) [18
]. As a consequence, this coaching model has been selected for the EMPATHIC project to be integrated in the DM strategy.
A GROW coaching dialogue consists of four phases which give the name to the model: Goals or objectives, Reality, Options and Will or action plan. During the first phase, the dialogue aims at getting the specification of the objective that the user wants to achieve, for example, to reduce the amount of salt in order to diminish the related risk of hypertension. Then, this goal has to be placed within the personal context in which the user lives, and the potential obstacles need to be identified. In the next phase, the agent goal is to make the user analyse the options he/she has to achieve the objective within his/her reality. Then the final goal of the dialogue is the specification of an action plan that the user will carry out in order to advance towards goals. In this framework, the DM strategy also involves achieving the goals associated with each of the four stages. First, it will try to get a specific goal from the user, asking something like (“Would you like to improve something in your eating habits?”). Once the user provides a sentence including his/her goal, the DM will focus on the next stage. Thus, it will try to get information about the context in which the goal has to be achieved, asking something like (“How often do you usually go to the grocery?”). In this way the dialogue will be developed until all the stages are completed. This strategy, differs from classical task-oriented dialogue systems in which user asks something related to the task, and then the system tries to obtain additional information, if needed, to be able to provide as accurate a response as possible. In fact, the particular user goal and related action plan have to be agreed between the virtual agent and the user during the conversation. However, this strategy can still be correctly specified by the Ravenclaw domain dependent trees of dialogue agents mentioned above that define the domain specific aspects of the dialogue.
The EMPATHIC virtual coach is planned to deal with four coaching subdomains: nutrition [20
], physical activity [21
], leisure [22
] and social and family engagement.
2.3. Related Work
While the GROW model serves as a conceptual pillar for developing the dialogue-act taxonomy, we also look to previous approaches for dialogue-act tagging.
Coding a sentence with a set of labels goes back to speech act theory of Austin [23
], which has been the basis for modern data-driven dialogue act theory. Multiple different dialogue act taxonomies have been proposed to solve the task of assigning dialogue act labels to sentences. They not only differ in the precise set of tags selected, but also in characteristics such as whether the tags are exclusive, level of detail or structure.
Dialogue act taxonomies can be characterized taking into consideration different criteria, such as the following:
Type of communication (i.e., synchronous vs. asynchronous).
Activity type and dialogue domain.
Type of corpora (e.g., speech dialogues, videos, chat).
Types of speech act classification schemes.
Dimensions (unidimensional versus multidimensional annotation).
Annotation tools and annotation procedure.
Books, and other written forms of communication, are asynchronous methods of communication where each message is thought beforehand. This generally gives written communication a better structure than spoken communication, where doubts, rectifications, and external factors such as noise or user speech characteristics, may result into incomplete or fuzzy messages. In this context, the PDTB [24
] taxonomy was designed for annotation of discourse relations between sentences, analyzing the conjunctions used to relate them. The sense tags described in PDTB have a hierarchical structure, but they would not suit to our coaching problem since they are designed to deal with asynchronous communication.
In a human to human conversation, these problems are solved by considering the context of the conversation. Dialogue acts need to take into account whether the communication is synchronous or asynchronous. For instance, synchronous communications allow the introduction of clarification intents, where an agent may be instructed to repeat a question or formulate it in a different manner. Such type of intent tags make no sense for asynchronous methods. The dialogue-act taxonomy introduced in this paper has been conceived for synchronous communication.
While one of the most common applications of act labeling is in the context of human to human or agent to human conversation, there are other types of activities to which they have been applied [25
]; for instance, they can be used for text summarization [26
]. Similarly, there is a variety of platforms and domains of applications to which act labeling methods have been applied, such as social networks [26
], and classification of message board posts [27
]. The proposal we introduce in this paper is oriented to represent spoken communication between an agent and a human.
Another important difference between dialogue act-taxonomies are the corpora to which they are applied and from which machine learning models are commonly learned. The corpus used is, most of the time, strongly related to the domain in which the dialogue act are going to be applied, and therefore should be able to capture the particularities of the domain. Initially, available corpora were mainly created from task-oriented dialogues [28
]. More recently, larger corpora have been proposed for training end-to-end dialogue systems [29
]. In general, these large corpora are not annotated. For a survey on available corpora for dialogue systems [31
] can be consulted. The corpus used in this paper has as a particular characteristic, the fact of being obtained from elderly people, a social group for which dialogues are more scarce.
Among the dialogue-act models proposed in the literature, the approach introduced in [4
] presents a framework to model dialogues in conversational speech. The dialogue act taxonomy was first based on a set of tags that was used for the annotation of the discourse structure and then modified to make it more relevant for the target corpus (the Switchboard corpus [32
]) and the task. However, existing dialogue act taxonomies were not designed for the scenario described in Section 2.2
, where a virtual agent is the responsible for the development of the conversation. DAMSL taxonomy [33
] was developed primarily for two-agent task-oriented dialogs. Nevertheless, the Empathic taxonomy has some common features with DAMSL, since the Intent dimension defined for Empathic taxonomy contains labels related to 3 out of 4 DAMSL categories (Information level, Forward Looking Function, Backward Looking Function).
Our proposal for the Empathic project has aspects in common with DIT++ [34
], although they are designed for different types of interactions. On the one hand, DIT++ is based on traditional task-oriented conversations, on the other hand Empathic is based on coaching interactions, where the agent is an active member of the conversation from the point of view that the coach guides the conversation throughout the GROW model strategy. Nevertheless, many of the labels present in the taxonomy of general-purpose functions and dimension-specific functions defined in [34
] can also be found in the intent label defined for Empathic. Another work relevant for our approach is the one recently published in [35
], where a hierarchical schema for dialogue representation is proposed. Although the introduced scheme is specifically conceived to support computational applications, it uses a structure of linked units of intent that resembles the hierarchical structure at the core of our proposal.
The common norm for dialogue act annotation is that a single communicative function is assigned to an utterance. However, some works propose multidimensional dialogue act taxonomies in which multiple communicative functions may be assigned to the utterances. DAMSL considers a set of exclusive group tags as different dimensions, whereas DIT++ considers a dimension in a multidimensional system, as independent, and can be addressed independently from other dimensions. In particular, a 9-dimensional annotation scheme was defined in [3
]. Similarly, we use a multidimensional taxonomy which allows us to capture richer semantic information from the dialogues. Considering multiple modes of semantic information is a requirement for implementing an agent that should be able to embed the coaching objectives as part of the dialogue strategies. Without the rich information provided by multiple types of tags, it would be very difficult to guide the user to the satisfaction of the objectives, and to evaluate whether these objectives have been fulfilled. The details of this multi-modal taxonomy are described in Section 3
Although the multidimension criteria is similar in both taxonomies, DIT++ is designed for turn labeling, and Empathic is focused in subsentence labeling. This difference forces DIT++ to separate into two dimensions different intents that can be found in the same turn as seen in the dialogue examples found in [36
]. Labeling subsentences allows the Empathic taxonomy to group aspects found in the DIT++ dimensions defined in [34
] (general-purpose functions, dimension-specific functions), since a turn will be split into subsentences, having only one intent label for each one, avoiding the problem of having two intent labels in the same turn.
In addition, an important effort has been carried out to define the ISO 24617-2 standard [6
] that includes the 9 dimensions defined in DIT++ and reduces the number of communicative functions, which can be specific for a particular dimension or general-purpose communicative functions that can be applied in any dimension. In addition, the standard also considers different qualifiers for the certainty or the sentiment. This approach has also be included in our proposal, which can be considered, to some extent, as a reduced and GROW-driven adaptation of main characteristics of DIT++ and ISO 24617-2 for the Empathic purposes. An additional aim of the standard is to produce interoperable annotated dialogue resources. To this end, a set of dialogues from variety of corpora and dialogue annotation schemes, such as the Map task, Switchboard, TRAINs or DBOX, have been re-annotated under ISO 24617-2 scheme to build a Dialog Bank [7
Finally, proposals for dialogue act taxonomies also differ in the annotation tools used and annotation procedures. Humans are better than machines at understanding and annotating dialogue utterances in a detailed manner, because they have more knowledge of intentional behaviour and they have richer context models [3
]. So we rely on human annotation procedures to get accurate annotations instead of using automatic methods. We explain the characteristics of our annotation procedure in Section 4
Regarding the NLU task, having a semantic representation that is both broad coverage and simple enough to be applicable to several different tasks and domains is challenging. Thus, most NLU system approaches depend on the application and the environment they have been designed for. In this way, targeted NLU systems are based on frames that capture the semantics of a user utterance or query. The semantic parsing of input utterances in NLU typically consists of three tasks: domain classification (what is the user talking about, e.g., “travel”), intent determination (what does the user want to do, e.g., book a hoter room) and slot filling (what are the parameters of this task e.g., “two bedroom suite near disneyland”) [38
]. The domain detection and intent determination tasks have been typically treated as semantic utterance classification problems [39
]. Slot filling, instead, has been treated as a sequence classification problem in which semantic class labels are assigned to contiguous sequences of words [41
], which is now addressed by bidirectional LSTM/GRU models among others [42
]. A good review of the NLU evolution is given in [44
While the NLU employed in this work does perform intent detection and adds entity recognition, as other approaches do, the taxonomy includes intent labels specifically conceived for the GROW model, which has to fulfil additional objectives. For example, the taxonomy has to provide a relationship among the user utterances and the goals of the GROW model, which has to be agreed between user and virtual agent, and therefore be established, during the conversation, as mentioned in Section 2.2