Memory-Based Deep Neural Attention (mDNA) for Cognitive Multi-Turn Response Retrieval in Task-Oriented Chatbots

: One of the important criteria used in judging the performance of a chatbot is the ability to provide meaningful and informative responses that correspond with the context of a user’s utterance. Nowadays, the number of enterprises adopting and relying on task-oriented chatbots for proﬁt is increasing. Dialog errors and inappropriate response to user queries by chatbots can result in huge cost implications. To achieve high performance, recent AI chatbot models are increasingly adopting the Transformer positional encoding and the attention-based architecture. While the transformer performs optimally in sequential generative chatbot models, recent studies has pointed out the occurrence of logical inconsistency and fuzzy error problems when the Transformer technique is adopted in retrieval-based chatbot models. Our investigation discovers that the encountered errors are caused by information losses. Therefore, in this paper, we address this problem by augmenting the Transformer-based retrieval chatbot architecture with a memory-based deep neural attention (mDNA) model by using an approach similar to late data fusion. The mDNA is a simple encoder-decoder neural architecture that comprises of bidirectional long short-term memory (Bi-LSTM), attention mechanism, and a memory for information retention in the encoder. In our experiments, we trained the model extensively on a large Ubuntu dialog corpus, and the results from recall evaluation scores show that the mDNA augmentation approach slightly outperforms selected state-of-the-art retrieval chatbot models. The results from the mDNA augmentation approach are quite impressive.


Introduction
Many studies on natural language processing (NLP) agree that chatbots have contributed immensely towards the advancement in information retrieval and exchange between humans and computers. In the vertical domain, the profitability of online transactions in e-commerce, reservations, and marketing firms, etc., is beginning to depend heavily on chatbots to convince people to make purchase of goods and services through interactive and persuasive chats. For this reason, it is important that a chatbot is able to respond coherently and accurately with respect to the context of queries from users. Therefore, we opine that the degree of the accuracy of selected response is the primary criteria to be used in judging the performance of a chatbot. However, modeling a chatbot to consistently select and match accurate responses with respect to the intent and context of the users' input utterances over multiple-turns of a conversation is a challenging task.
To address this challenge, a plethora of research studies have been done on retrieval-based dialog systems. These systems are more commonly adopted in vertical domains unlike in open domains where sequence-to-sequence generative chatbots are more prevalently adopted. Retrieval chatbots enjoy the advantage of possessing rich repositories from which they can select information-rich responses [1,2] that align within the scope of a predefined knowledge base. Whereas the generative-based ones possess so much liberty on how to respond to utterances, and thus, are usually prone to occasionally generate grammatical errors or irrelevant generic responses that may not serve the intents of the user.
Early studies on retrieval chatbots considered only data in single-turn of a dialog for selecting responses from a repository [3][4][5]. In-other-words, this approach entails sole use of the information in the last conversational utterance for matching a response. Meanwhile, outcomes from recent works such as the deep attention matching network (DAM) by Zhou et al. [1], shows that the consideration of information in multiple-turns of dialog prove to offer better performance in-which response selection matching not only considers information in last utterance turn, but also considers previous turns of utterances for matching contextual dependencies, which significantly improve accuracy of output response selection. This approach is based on the notion that human conversational manner often depend on the context and sentiments in the previous turns of utterances containing varying segments of semantic cues.
Recently, most of the state-of-the-art multi-turn retrieval chatbot models adopt the novel Transformer architecture proposed by Vaswani et al. [6] because an attention-based network helps neural systems to expedite the capturing of utterance-response textual dependencies by replacing deep tensors, thus minimising computational cost [7]. However, Transformer-based retrieval chatbot models [1,2,8] encounter the problematic occurrence of the following errors: • logical inconsistency error: where retrieved response candidates are wrong due to logical mismatch • vague or fuzzy response candidate error: where selected responses contains improper details To contribute towards solving the above outlined problems, we propose the memory-based deep neural attention (mDNA). The key purpose of the mDNA is to augment transformer-based retrieval chatbot models to improve response accuracy and logical consistency. The mDNA is a deep neural network (DNN) architecture with a Bi-LSTM encoder integrated with attention mechanism; and in addition, a memory module is implemented in the encoder. The purpose of implementing memory is to capture and store important utterance word embedding so as to offer a cognitive-like ability to the model. In our experiments, we adopted the DAM [1] as a case study representing the transformer-based retrieval chatbot model. We combined the two model architectures (i.e., DAM and mDNA) with an approach similar to unimodal late fusion method [9], as shown in Figure 1, to achieve enhanced performance.
During the response retrieval process, the previous multi-turns of relevant utterance information that could have probably been "forgotten" (i.e., data lost in DAM but retained in mDNA memory) are parsed through the mDNA Bi-LSTM encoder with the attention mechanism to match long-range contextual dependencies. The final response candidates from the DAM and mDNA are combined to undergo a voting process via a softmax function for the final response output selection. Our experiments show that the fusion of two different model architectures increases the number of response candidates. It also increases the probability of the chatbot to select more logically consistent and accurate responses. Thus, this technique drastically minimises the problematic occurrence of fuzzy response candidates and logical response inconsistency errors impairing the performance of most transformer-based retrieval dialog systems.

Related Works
The growing number of research interests in multi-turn retrieval chatbots is an indication that it is a successful approach, but plagued with inherent challenges. Investigative studies done in [1,2,8] and so on, has sufficiently proven the superiority of multi-turn models over the single-turn ones. Therefore, with greater advantages, multi-turn retrieval chatbots are widely being adopted for various end-to-end task completion services [10] in the vertical domain which justifies their importance. Currently the existing efforts improve intuitive matching of utterance-context and semantic dependency to achieve high accuracy in response selection. It has driven recent scholarly works on retrieval chatbots [1,8,11,12] to adopt the use of Transformer attention-based architecture proposed in [6]. Although feed-forward neural networks (FFN) such as CNN are best suited for computer vision tasks, the Transformer attention model architecture has successfully used it for improving utterance textual context and response dependency matching in sequential neural dialog system. In addition, it has also achieved fair results in retrieval-based dialog system models such as the DAM network [1].
While the Transformer is now being hailed as the potential replacement to gated-recurrent neural networks in NLP tasks. Recent novel retrieval chatbot models based on the transformer-attention architecture [1,8,11] have shown to encounter the problematic occurrence of context mismatch and logical inconsistency errors in the selected responses of the models. Although some comparative studies on Transformer and RNN [13] agree that the self-attention and positional encoding attributes of the Transformer give them better performance leverage over the gate-based RNN neural models, more recent studies (e.g., [14]) point out that the multi-head attention mechanism of transformer models can also cause it to suffer loss of sequential information that are important in natural languages. Wang et al. [14] thereby proposed a RNN-enhanced Transformer (R-Transformer) model to address this problem.
However, we argue that by systematically applying memory to gated RNN-based models, it would achieve comparable long-range contextual matching performance such as the Transformer without suffering significant loss of important data. Scholarly works that have investigated the integration of memory in gate-based recurrent neural models include Zhao et al. [11] who used background document to supply external knowledge to retrieval chatbot models (i.e., a memory-like operation) to enrich response coherence with respect to the context of utterances. In other studies, memory support has been added to neural networks for natural language transduction tasks in [15]. Also, in [16], an attempt has been made to replace the use of attention with active memory. Furthermore, Wulamu et al. [17] combined the use of memory networks and attention to achieve improved responses on simple Q&A tasks but, in more complex Q&A tasks, their model performed poorly. Owing to the poor model performance of [17] in handling complex retrieval tasks, the authors consider implementing Bi-LSTM and attention mechanism in their future work. Finally, investigating the effectiveness of augmenting Transformer with RNN-based neural model to forge a single robust ensemble, the research by Amazon's Domhan [18] suggests that one can successfully achieve good performance by such combination of architectures.
In this paper, we describe how we tackled the problems of logical mismatch and context inconsistency errors of multi-turn response selection in Transformer-based retrieval chatbots, by exploiting the collective benefits of memory, gated-RNN, and the Transformer attentional components combined as one unit. We selected the novel DAM network [1] to serve as a case study representing many other similar multi-turn retrieval chatbot models based on the Transformer architecture. By augmenting the DAM model with the mDNA, we achieved optimized performance, with respect to logical consistency and contextual accuracy of output responses.

Model Description
This section briefly describes the interactive roles of the major functional units that make up the mDNA (i.e., the Memory Encoding, Bi-LSTM with Attention, and the matching composition or prediction) as illustrated in Figure 2.

Memory Utterance Embedding
The Transformer-based models can effectively capture and match textual dependencies in utterance and response. However, important relevant word representations in long multi-turn dialogs can easily be lost in the process and thereby impair the quality of the final predicted output responses. On this basis, we formulated the concept of the mDNA to comprise a memory unit for temporary storage of the utterance input so as to retain important contextual utterance information which will be needed consequently for response dependency matching during a conversational session.
Here, we denote the input utterance embedding parsed from word2vec to the memory as m = (m 1 , m 2 , m 3 , . . . , m n ). The response embedding is directly fed into the Bi-LSTM encoder and is denoted as r = (r 1 , . . . , r m ). Then using the pre-trained word embedding table, we generate two input sequences as M = [e m,1 , . . . , e m,n ] and R = [e r,1 , . . . , e r,m ], where e ∈ R d , and d is the dimension of word embedding. As shown in Figure 3, we use soft attention denoted as (α M ) to filter important utterance words vectors of given lengths e m l for storage. We express this as, α M = softmax(W, e m,l ), where W is the weight score, and e m,l is utterance vector representation (m), and the length (l) of each word.

Bi-LSTM Encoding with Attention
Having solved the exploding and vanishing gradient problems of RNN, LSTM has enjoyed a lengthy time-line of being the basic model widely used in NLP tasks due to its capability to capture long-term textual dependencies [19]. Moreover, the enhanced bi-directional variant of the traditional LSTM which comprises of 2 hidden layers in opposing directions enables it to connect past and future input features to the same output in specified time steps, thus, the Bi-LSTM is able to learn fast. Meanwhile, comparing Bi-LSTM to the bi-directional gated recurrent unit (Bi-GRU) [20,21], the study [20] showed that the latter is equally fast; however, we justify our choice of adopting the Bi-LSTM in the mDNA model based on the analysis in [22] that the LSTM is a better option than GRU for performing tasks where deep context understanding is the major consideration of our study.
In this section, we apply the two input sequences of memory and response, M and R, into the Bi-LSTM encoder to represent the tokens of the contextual vectors, and we denote the vectors of the hidden state as m s and r s : where i and j represent the i-th context in the utterance, and the j-th context in the response, respectively.
To determine whether a selected response candidate is appropriate and correlated with the context of the utterance, modeling the relationship between the utterance and the response is an essential step. For example, appropriate responses consider contextual keyword vectors, which can be obtained by modeling the semantic relationship. To this regard, being that not all words contribute equally to the context representation of an utterance, attention is required to ensure that important information are collected for matching composition. In this case, we use a multi-head attention mechanism to align the utterance contexts to the response candidates, and then calculate the semantic relationship at the utterance level. We also applied a soft alignment layer to calculate the attention weights as follows: where e ∈ R w * n , and T is the length of the sequence. For the hidden state of utterance context, i.e., m s i (already encoding the utterance and its contextual meaning), the relevant semantics in the response options are identified and composed using e ij as a vector m d i , more specifically as shown in Equation (4).
where α ∈ R w * n and β ∈ R w * n are the normalized attention weight matrices. The same process is also performed for each utterance and response matching in Equation (5). Hence, we model the utterance level semantic relationship between the aligned utterance pairs, i.e., < m s i , m d i > and < r s i , r d i >. The equation is further expressed as follows: where a matching layer can be used to model some high-order interaction between the vectors m w i and r w i for the utterance and response, respectively. P is a 1-layer multi-layer perception (MLP) with a rectified linear unit (ReLU) activation.

Matching Composition and Prediction
In the process to predict the final response candidate selection, the Bi-LSTM is used to compose the matching vectors in a dense layer as follows: and r m j = BiLSTM(r w , j).
Finally, we get the representations of DAM and mDNA denoted as [ f 1 , f 2 ] to compute the final matching score g(m, r), which is formulated as: where W 2 and b 2 are the learning parameters. The loss function of our model is the negative log likelihood, defined as: where D is the dataset, and y is the label marked in D.

Dataset
The availability and use of a large amount of training data promises an increase in the probability of achieving good performance in neural machine translation tasks. For this reason, we used the Ubuntu Dialogue Corpus [23] to train and analyse the performance of our mDNA model. The Ubuntu Corpus is an English dataset which contains multi-turn dialogues constructed from Ubuntu Internet Relay Chat (IRC) logs. The dataset consists of one million utterance-response pairs, and each pair have a binary label to mark the responses as either positive or negative (as shown in Table 1). In the training set, the ratio of the positive and the negative is 1:1, and 1:9 in the validation and test sets. Table 2 gives the statistics of the training set, validation set, and test set.

Hyperparameter Settings
We implemented the mDNA with Python 3 language, and used word2vec [24] to pre-train the word embedding on training set. The hidden size of the Bi-LSTM layers is set to 300, and to further enable fine-grain deep learning for good performance, the number of training epoch is set to 15 which we arrived at by fine-tuning the model with early stopping mechanism, while we set the mini-batch sizes to 16 for training, testing and validation, respectively, as could be accommodated by the Titan NVIDIA GPU memory resource available in our machine. Furthermore, the initial learning rate is set as 2 × 10 −4 (i.e., 0.0002) used to update the learning parameters by stochastic gradient descent with Adam optimizer [25]. While low learning rate tends to slow down the training speed, it also contributes to steep decrease in the network's losses. Finally, the mDNA model is trained in Tensorflow-GPU environment with Cuda kernel, and the learning parameters W 2 and b 2 in the softmax function uses the default value set by Tensorflow. Being that Ubuntu dialog corpus is large (i.e., comprising of 1,000,000 Q&A pairs), we set the vocabulary size to 100,000; and lastly, we assigned the maximum lengths of utterance input embedding and response input embedding as 350 and 150, respectively.

Results and Evaluation
In this paper, we used the recall R at position k in n number of candidates, denoted as R n @k, as the evaluation metrics, as it is also used in DAM [1], SMN [2], and other novel retrieval chatbot models [20,24,[26][27][28][29] we selected to compare their performance with that of the mDNA model. Here, R n @k is defined as (the number of relevant retrieved response candidates at top-k)/(total n number of retrieved response candidates). We represent this mathematically as below: where TP and FN represent true positive and false negative of retrieved response candidates, respectively. This evaluation metrics formula is suitable for information retrieval tasks, and is popularly used to measure and score the performance of information retrieval systems in terms of the degree at which the selected output response candidates match with the context of the input utterances. In our task, the mDNA model needs to select a response from n number of candidate responses. In this process, if the true response is among the response candidates, it is referred to as the true positive. Figure 4 shows the overview representation plot of R 10 @1, 2, and 5, with respect to different values of epochs which shows that the model achieved a stable training (and learning). Moreover, to further analyze the performance of our model, we selected some representative models based on retrieval dialog systems (that are either based on Transformer or gated-RNN respective), DAM [1], SMN [2], DualEncoder [23], MV-LSTM [26], Match-LSTM [27], Multiview [28], DL2R [29], and BERT Bi-Encoder+CE [20]. We compared the performance of these models to that of the mDNA model; and to perform ablation analysis, we removed the DAM model combined with the mDNA, and referred to it as mDNA sel f . In Table 3, the recall evaluation results show that the mDNA performs competitively better than most of the selected chatbot models trained on Ubuntu Dialogue Corpus that were compared to it. With respect to the baseline DAM model, the performance of the mDNA increased by 3% on R 10 @1, 1.7% on R 10 @2 and 0.6% on R 10 @5, respectively. While the mDNA outperforms all other selected models compared to it, we observe that the recall evaluation scores of the mDNA and BERT Bi-Encoder+CE model [20] are similar, and therefore the performance gap between them is very insignificant. The BERT Bi-Encoder+CE model exploits the benefits of Bi-GRU and Bi-encoder system of BERT-Transformer architecture which enables its performance to be comparable to that of the mDNA. However, the capability of augmenting Transformer-based retrieval chatbot models for performance enhancement is a unique attribute to which the mDNA claims superiority.  Lastly, owing to the important role of memory in the mDNA, we investigated to see if adjustment of the maximum length of utterance embedding is fed into the memory would affect the performance and effectiveness in any way. In this task, we initially set the dimension of utterance memory embedding to 350 and, subsequently, during multiple retraining sessions of the network, we altered the value from 350 to 300, 250, 200, and 150 at different corresponding intervals of the following training epochs: 3, 6, 9, 12, and 15, respectively. After several retraining of the model, we observed that the size of utterance word embedding parsed into the memory does not affect its performance, but instead, altering the values of the epochs directly influences overall accuracy and performance of the model. As the number of epoch increases from a lower value to the max set value, the average accuracy score also increases, as well as the scores of Recall 10 @1, Recall 10 @2, Recall 10 @5, and vice-versa. This scenario is depicted with the plot in Figure 4.

Conclusions
In this paper, we presented the memory-based deep neural attention (mDNA) method and described the concepts of its operational process as well as the benefits of using fusion technique to augment independent retrieval chatbot models that are based on the Transformer architecture, for enhancing response selection accuracy, and consistency in logical context-dependency matching, respectively. As a case study, we selected the DAM network model and hybridised it with the mDNA to demonstrate the effectiveness of achieving better performance. The results obtained show the improvement in performance over some existing state-of-the-art retrieval chatbots. In future works, we will consider enhancing the Bi-LSTM encoder algorithm to support multimodal input-data and fusion (such as text and image) which will necessitate the use of more diversified datasets in the training of the model to further improve the accuracy of response selection.

Conflicts of Interest:
The authors declare no conflict of interests regarding the publication of this article.