Knowledge-Grounded Chatbot Based on Dual Wasserstein Generative Adversarial Networks with E ﬀ ective Attention Mechanisms

Featured Application: Core technology for intelligent virtual assistants. Abstract: A conversation is based on internal knowledge that the participants already know or external knowledge that they have gained during the conversation. A chatbot that communicates with humans by using its internal and external knowledge is called a knowledge-grounded chatbot. Although previous studies on knowledge-grounded chatbots have achieved reasonable performance, they may still generate unsuitable responses that are not associated with the given knowledge. To address this problem, we propose a knowledge-grounded chatbot model that e ﬀ ectively reﬂects the dialogue context and given knowledge by using well-designed attention mechanisms. The proposed model uses three kinds of attention: Query-context attention, query-knowledge attention, and context-knowledge attention. In our experiments with the Wizard-of-Wikipedia dataset, the proposed model showed better performances than the state-of-the-art model in a variety of measures.


Introduction
The ultimate goal of natural language processing is to let humans and machines freely communicate with each other. In human-to-human conversations, people communicate based on contextual knowledge (i.e., knowledge obtained from previous utterances) or external knowledge (i.e., knowledge obtained from various media). A knowledge-grounded chatbot is a dialogue system that can communicate by recalling internal and external knowledge (similar to a human). Knowledge-grounded chatbots must understand the language, store internal/external knowledge during the conversations, and respond in various ways based on the stored knowledge. Table 1 lists the differences between the responses of a conventional chatbot (utterances 1-1 and 2-1) and a knowledge-grounded chatbot (utterances 1-2 and 2-2). As shown in Table 1, the conventional chatbot provides general responses such as "I see" and "Yes, it is" because it has no knowledge about "Avengers." However, the knowledge-grounded chatbot provides specific responses (such as "Sure, I like Iron Man" and "Oh! Anthony and Joe Russo directed it") based on external knowledge obtained from the documents retrieved by an information retrieval system. To generate these specific responses, the knowledge-grounded chatbot must possess the following three abilities.

•
Knowledge retrieval: A chatbot should be able to search for documents associated with the conversation topics.

•
Knowledge extraction: A chatbot should be able to extract knowledge from the retrieved documents.

•
Response generation: A chatbot should be able to generate responses that reflect the extracted knowledge.
In open-domain conversations, it is difficult for a chatbot to find documents that are closely associated with the current conversation topic, because a multi-turn conversation references many topics. Although we assume that the chatbot can find some documents associated with the current conversation topic by using a state-of-the-art information retrieval system, it is difficult for the chatbot to extract knowledge from the search results because the document may contain diverse knowledge on the given topic [1]. In addition, it is difficult for the chatbot to generate appropriate responses that reflect the acquired knowledge because the conversation history (i.e., contextual information based on previous utterances) should also be considered. In this study, we ignore the knowledge retrieval problem; instead, we focus on addressing the problems of knowledge extraction and response generation. In particular, we propose a knowledge-grounded multi-turn chatbot model that generates responses by considering new knowledge from previous utterances and a given document. This paper is organized as follows. In Section 2, we review the previous works on generative chatbot models and describe the proposed knowledge-grounded chatbot model in Section 3. In Section 4, we explain the experimental setup and report on some experimental results. Section 5 concludes our study.

Related Work
Chatbot models are divided into retrieval-based models and generative models. The retrieval-based models predefine a large amount of question-answer pairs. Then, they match users' queries against predefined questions by using well-designed information retrieval models. Next, they return answers of the matched questions as responses of users' queries [2,3]. Since the retrieval-based models predefine well-purified question-answer pairs in advance, they do not return responses including grammatical errors. However, they have some deficiencies that chatbots' responses are restricted into a predefined set and are not sensitive to changes of users' queries. To overcome these deficiencies, generative models that automatically select word sequences proper to a response have been recently proposed. Most recent studies on generative chatbots primarily use the sequence-to-sequence (Seq2Seq; encoder-decoder) model [4,5]. Generally, Seq2Seq models consist of two recurrent neural networks (RNNs) called an encoder for embedding input sentences and a decoder for generating output sentences. Previous studies have some deficiencies that Seq2Seq-based chatbots often generate safe and short responses such as "I don't know" and "OKay" [6,7]. To overcome this problem, a maximum mutual information (MMI) model and a variational autoencoder (VAE) model have been proposed [6][7][8][9]. To diversify responses, in the MMI model, the standard objective function (the log-likelihood of a response given a query) is replaced with a new objective function based on mutual information between a response and a query [6]. The VAE model is possible to give diversity to a response by learning latent variables from inputs and outputs. However, the VAE model has collapse problems that the decoder is trained to ignore the latent variable by simplifying them to a standard normal distribution [10,11]. In other words, although the real response has a very complex distribution, the latent variable of the VAE model has a posterior collapse problem because it learns a simple prior distribution as the normal distribution. Some previous studies partially solved this problem by using adversarial learning [11]. However, adversarial learning has generated a discrete token that cannot be differentiated and, thus, cannot be learned [12,13]. To solve this problem, various studies have proposed using adversarial learning not for generating responses but for generating latent variables.
There have been various studies on knowledge-grounded chatbots (i.e., chatbot models that actively acquire and use knowledge for generating responses). The memory-to-sequence (Mem2Seq) model acquires structured knowledge and contextual information through a memory network [14]. The neural knowledge diffusion model automatically acquires knowledge associated with the entities in the previous utterances by looking up a knowledge base [15]. However, existing models have some limitations. First, they require considerable time to build new external knowledge bases. Second, the knowledge included in their responses is restricted to predefined knowledge bases [14][15][16][17][18]. Therefore, most open-domain knowledge-grounded chatbots extract knowledge from unstructured texts and generate responses using the extracted knowledge [19][20][21].

Knowledge-Grounded Chatbot Model
The proposed model comprises four submodules: A context encoder, knowledge encoder, latent variable generator, and response generator, as shown in Figure 1.
Appl. Sci. 2020, 10, x 3 of 10 collapse problems that the decoder is trained to ignore the latent variable by simplifying them to a standard normal distribution [10,11]. In other words, although the real response has a very complex distribution, the latent variable of the VAE model has a posterior collapse problem because it learns a simple prior distribution as the normal distribution. Some previous studies partially solved this problem by using adversarial learning [11]. However, adversarial learning has generated a discrete token that cannot be differentiated and, thus, cannot be learned [12,13]. To solve this problem, various studies have proposed using adversarial learning not for generating responses but for generating latent variables.
There have been various studies on knowledge-grounded chatbots (i.e., chatbot models that actively acquire and use knowledge for generating responses). The memory-to-sequence (Mem2Seq) model acquires structured knowledge and contextual information through a memory network [14]. The neural knowledge diffusion model automatically acquires knowledge associated with the entities in the previous utterances by looking up a knowledge base [15]. However, existing models have some limitations. First, they require considerable time to build new external knowledge bases. Second, the knowledge included in their responses is restricted to predefined knowledge bases [14][15][16][17][18].

Knowledge-Grounded Chatbot Model
The proposed model comprises four submodules: A context encoder, knowledge encoder, latent variable generator, and response generator, as shown in Figure 1. and a current utterance n U (i.e., a user's query) as inputs. Then, it calculates the degrees of association between the previous utterances and the current utterance based on a scaled dot product attention mechanism [22]. Finally, it generates a context vector. The knowledge encoder takes knowledge K (i.e., a document containing evidence to generate a response) and the current utterance n U as inputs. Then, it calculates the degrees of association between knowledge and the current utterance based on a scaled dot product attention mechanism [22], and it generates a knowledge vector. During the training, the latent variable generator takes the context vector and the next utterance +1 n U (i.e., a chatbot's response) as inputs. Then, it creates a context vector similar to the RNN-encoded next-utterance vector using an adversarial learning method [23]. Next, it decodes The context encoder takes m previous utterances U n−m , . . . , U n−2 , U n−1 (i.e., a dialogue context) and a current utterance U n (i.e., a user's query) as inputs. Then, it calculates the degrees of association between the previous utterances and the current utterance based on a scaled dot product attention mechanism [22]. Finally, it generates a context vector. The knowledge encoder takes knowledge K (i.e., a document containing evidence to generate a response) and the current utterance U n as inputs. Then, it calculates the degrees of association between knowledge and the current utterance based on a scaled dot product attention mechanism [22], and it generates a knowledge vector. During the training, the latent variable generator takes the context vector and the next utterance U n+1 (i.e., a chatbot's response) as inputs. Then, it creates a context vector similar to the RNN-encoded next-utterance vector using an adversarial learning method [23]. Next, it decodes an encoded next-utterance vector using an autoencoder learning method. Finally, the trained latent variable vector z is used at the inference time.
To generate various responses, the response generator creates an encoded response vector (i.e., a gold response vector) similar to a response vector decoded by the RNN (i.e., a generated response vector) using an adversarial learning scheme.

Context Encoder
The context encoder calculates the degrees of association between a dialogue context U n−m , . . . , U n−2 , U n−1 and a current utterance U n . The current and previous utterances in the dialogue context are individually encoded with bidirectional gated recurrent units (Bi-GRU; a kind of bidirectional RNN) [24]: where x i denotes the i-th word vector in an utterance, → h i and ← h i denote the i-th word vectors encoded by the forward and the backward state of Bi-GRU, and [ ; ] denotes concatenation. Then, by analogy, each utterance in a dialogue context is encoded by unidirectional gated recurrent units (Uni-GRU; a kind of unidirectional RNN) [25]: where u j denotes the concatenation of the first word vector and the last word vector in the j-th utterance encoded by Bi-GRU, and c j denotes the j-th utterance vector encoded by Uni-GRU. After encoding the dialogue context and the current utterance, the context encoder computes attention scores A c between the current utterance and each utterance in the dialogue context by using scaled dot products [22]: where C denotes a matrix of an encoded dialogue context, [c 1 , c 2 , . . . , c m ], and H denotes a matrix of an encoded current utterance, [h 1 , h 2 , . . . , h last ]. Then, W c , W h 1 , and W g are weights; d is a normalizing factor, which is set to 300; and g(x) is a sigmoid gate function. After passing through the self-attention layer between A c and A c [20], the final attention matrix A qc (called QC-Attention) is calculated as follows:

Knowledge Encoder
The knowledge encoder calculates the degrees of association between a current utterance U n and given knowledge K. We assume that knowledge is represented as unstructured text (a sequence of words). Knowledge K is encoded with Bi-GRU as in the context encoder. After encoding knowledge and the current utterance, the knowledge encoder computes attention scores A k as in the context encoder: where H denotes a matrix of an encoded current utterance, [h 1 , h 2 , . . . , h k ], and K denotes a matrix of encoded knowledge, [k 1 , k 2 , . . . , k last ], in which k j is a concatenation of the i-th forward state and the i-th backward state of Bi-GRU. Then, W c , W h 2 , and W g are weights; d is a normalizing factor, which is set to 300; and g(x) is a sigmoid gate function. Then, the final attention matrix A qk (called QK-Attention) is calculated. Finally, the knowledge encoder calculates A ck (called QC-QK-Attention) that represents the degrees of association between the dialogue context and knowledge:

Latent Variable Generator
The latent variable generator is a module that is responsible for generating latent variables that help the response decoder generate various responses. To generate latent variables, we adopted an autoencoder model using Wasserstein distance [4,11,23]. The conventional VAE dialogue model assumes that latent variables follow a simple prior distribution. However, we train latent variables using a Wasserstein generative adversarial network (WGAN) because responses in the real world follow very complex distributions. Formally, we model distributions z and z in latent spaces as follows: where f θ (·) and g ∅ (·) are feed-forward neural networks for generating latent variables z and z, respectively. The difference between the two neural networks is whether u i+1 , the next utterance (i.e., a chatbot's response) encoded by Bi-GRU, is used as an input or not. In Equation (7), u i+1 is represented by the concatenation of the first word vector and the last word vector encoded by Bi-GRU. Our goal is to minimize the divergence between z and z. Therefore, we made the two distributions similar, according to the well-known GAN training scheme [26]: where D(·) is a discriminator based on a feed-forward neural network. The discriminator plays the role of distinguishing between z and z.

Response Generator
First, the response generator generates a response by using a concatenation of an encoded current utterance vector u n , QC-Attention A qc , QC-QK-Attention A ck , and a generated latent variable (i.e., z for training and z for inferencing) as the initial state of a decoder. Then, it makes the generated response vector similar to u n+1 , which is a gold response vector encoded by Bi-GRU using the Wasserstein autoencoder (WAE) process based on WGAN [23].

Implementation and Training
We implemented the proposed model by using TensorFlow 1.14.1 [27]. Bi-GRUs have 300 hidden units in each direction, and Uni-GRUs have 600 hidden units. The dimensions of QC-Attention, QK-Attention, and QC-QK-Attention are 600. The discriminators of the latent variable and response generators are three-layer FNNs with 100 hidden units and rectified linear unit activation [28]. The vocabulary size, the word-embedding size, and the dialogue context size was set to 47,186, 300, and 3, respectively. All word-embedding vectors were initialized to random values. Responses were generated by using a greedy decoding algorithm. The model training was performed through three steps [2]. In the first step, we trained the WGAN in the latent variable generator by using an adversarial learning method. In the second step, we trained the entire model, except the WAE in the response generator. At last, we trained the WAE in the response generator by using an adversarial learning method. When we train the discriminant models [26], we used the gradient penalty which was set to 10. Then, we used a cross-entropy function as a cost function to maximize the log-probability. In the inference step, we used z as a latent variable.

Datasets and Experimental Settings
We evaluated our model on a Wizard-of-Wikipedia dataset [1]. The dataset is used for three tasks: Knowledge prediction (i.e., selecting documents containing proper knowledge from a document collection), response generation (i.e., generating responses using the given knowledge), and an end-to-end task (i.e., both knowledge prediction and response generation). In this study, we focused on response generation. Owing to hardware limitations, we refined the Wizard-of-Wikipedia dataset to use two previous utterances as the dialogue context. Our refined dataset comprises 83,247 utterances for training, 4444 utterances for verification, and 4356 utterances for testing.
We used Bilingual evaluation understudy (BLEU) [29,30], perplexity (PPL) [1,31], and bag-of-words (BOW) embedding [32] as performance measures. BLEU measures a ratio of overlapped words between generated responses and gold responses, as shown in the following equation.
where n is the maximum length of n-grams, which is commonly set to 4, and precision i is a word i-gram precision (i.e., the number of correct word i-grams divided by the number of word i-grams in a generated sentence). The precision of BLEU is the average score of BLEUs for 10 generated sentences (for these experiments, the decoder returned 10 candidate sentences per query by a beam search algorithm) per query. The recall of BLEU is the maximum score among BLEUs for 10 generated sentences per query. As many studies used unigram-F1, we also used word an unigram F1, called F1-score [1,31]. PPL is a measure for evaluating language models, and it is commonly used in traditional generative chatbot models that mainly use an RNN decoder. The BOW embedding metric is the cosine similarity of BOW embedding between generated and gold responses. It consists of three metrics: Greedy [33], average [34], and extrema [35]. In our test, we report the maximum BOW embedding score among the 10 sampled responses.

Experimental Results
In our first experiment, we evaluated the effectiveness of the knowledge encoder at the architecture level. The results are summarized in Table 2. In Table 2, CE is a model that only uses QK-Attention vector, and CE + KE uses QC-QK-Attention. As shown in Table 2, CE+KE exhibit better performances for all measures except PPL. PPL is the average number of options when a language model is generating words. With a lower PPL, the model has fewer options for generating words, such as general words ("I know"). Hence, PPL is not important for chatbots that need to generate various responses. This means that QC-QK-Attention improves the quality of responses.
In the second experiment, we compared the performance of the proposed model with those of state-of-the-art models. For a fair comparison, we downloaded the code provided by the authors of each model from GitHub (https://github.com/lizekang/ITDD) [31]. Then, we trained and tested the models by using the refined Wizard-of-Wikipedia dataset. Table 3 compares the performance of these models. In Table 3, Dinan-End2End and Dinan-TwoStage are the most well-known knowledge-grounded chatbot models proposed by Dinan et al. who deployed the Wizard-of-Wikipedia dataset [1]. Dinan-End2End performs the end-to-end task, and Dinan-TwoStage separately performs knowledge prediction and response generation. ITDD is a state-of-the-art model based on Transformer, and it generates a response using a two-pass decoder. We replaced the document with the context. As shown in Table 3, the proposed model exhibits better performances for all measures except PPL. The compared models show lower PPL because they generate more general words than the proposed model does, as shown in Table 4. Table 4 shows an example of the response of each model. Dinan-End2End, Dinan-TwoStage, and ITDD models generated answers such as "I know" or "I'm not sure." These general responses help lower PPL. However, the proposed model generated a response that mentioned the names of the band members, reflecting the acquired knowledge. Table 4. Comparison of proposed and previous models.

Context
I used to listen to the Rolling Stones a lot when I was a child.
Me too. I can't believe they were formed in London as far back as 1962! What's your favorite song?
I can't even remember, to be honest! Do you know who the band members were?

Gold Response
Mick (of course), Brian Jones, Keith Richards, Bill Wyman, and I don't remember who else.

Dinan-End2End
I know, they were formed in London in 1962.

Dinan-TwoStage
I'm not sure, but I know that the band was formed in 1981 by drummer Lars Ulrich and vocalist James Hetfield.

ITDD
I'm not sure, but I know they were formed in 1962.

Proposed Model
Mick Jagger, Keith Richards, Bill Wyman, Charlie Watts, and Ian Stewart.

Conclusions
We proposed a knowledge-grounded multi-turn chatbot model to effectively reflect newly-acquired external knowledge. To generate responses in the context of conversation history, it used an attention mechanism between a query and a context in the query-encoding step. To generate responses that reflect external knowledge, it used the query-knowledge mechanism in the knowledge encoding step. Furthermore, the model effectively mixed the two kinds of attention to consider the degree of association between the dialogue context and the given knowledge. In experiments with the Wizard-of-Wikipedia dataset, the proposed model showed better performances than the previous state-of-the-art models and generated responses more effectively, reflecting external knowledge. However, the proposed model has a limitation; that is, proper external knowledge should be given in advance. To overcome this, we will study an end-to-end knowledge-grounded chatbot model that searches external documents containing proper knowledge, summarizes the retrieved documents, and extracts proper knowledge from the summarized documents. As the future work, we will study a method to generalize an input utterance to a shallow semantic form that is a set of keywords, tense information, and modality information (e.g., "I'll be able to go there." → "[[I, go, there], future, possibility]"). Then, we will study a method to use the semantic form as an input of a Seq2Seq model.