Machine Reading Comprehension for Answer Re-Ranking in Customer Support Chatbots

Recent advances in deep neural networks, language modeling and language generation have introduced new ideas to the field of conversational agents. As a result, deep neural models such as sequence-to-sequence, Memory Networks, and the Transformer have become key ingredients of state-of-the-art dialog systems. While those models are able to generate meaningful responses even in unseen situation, they need a lot of training data to build a reliable model. Thus, most real-world systems stuck to traditional approaches based on information retrieval and even hand-crafted rules, due to their robustness and effectiveness, especially for narrow-focused conversations. Here, we present a method that adapts a deep neural architecture from the domain of machine reading comprehension to re-rank the suggested answers from different models using the question as context. We train our model using negative sampling based on question-answer pairs from the Twitter Customer Support Dataset.The experimental results show that our re-ranking framework can improve the performance in terms of word overlap and semantics both for individual models as well as for model combinations.


Introduction
The growing popularity of smart devices, personal assistants, and online customer support systems has driven the research community to develop various new methodologies for automatic question answering and chatbots. In the domain of conversational agents, two general types of systems have become dominant: (i) retrieval-based, and (ii) generative. While the former produce clear and smooth output, the latter bring flexibility and the ability to generate new unseen answers.
In this work, we focus on finding the most suitable answer for a question, where each candidate can be produced by a different system, e.g., knowledge-based, rule-based, deep neural network, retrieval, etc. In particular, we propose a re-ranking framework based on machine reading comprehension [1][2][3] for question-answer pairs. Moreover, instead of selecting the top candidate from the re-ranker's output, we use probabilistic sampling that aims to diversify the agent's language and to up-vote popular answers from different input models. We train our model using negative sampling based on question-answer pairs from the Twitter Customer Support Dataset.
In our experimental setup, we adopt a real-world application scenario, where we train on historical logs for some period of time, and then we test on logs for subsequent days. We evaluate the model using both semantic similarity measures, as well as word-overlap ones such as BLEU [4] and ROUGE [5], which come from machine translation and text summarization.
The remainder of this paper is organized as follows: Section 2 presents some related work in the domain of conversational agents and answer combination. Section 3 describes our framework and the general workflow for answer re-ranking. Section 4 introduces the original dataset and explains how we used in to build a new, task-specific one with negative sampling; it also offers insights about the dialogs and the pre-processing. Section 5 describes our experiments, and gives details about the training parameters. Section 6 presents the performance of each model and discusses the results. Finally, Section 7 concludes and suggests possible directions for future work.

Conversational Agents
The emergence of large conversational corpora such as the Ubuntu Dialog corpus [6], OpenSubtitles [7], CoQA [8] and the Microsoft Research Social Media Conversation Corpus [9] has enabled the use of generative models and end-to-end neural networks in the domain of conversational agents. In particular, sequence-to-sequence (seq2seq) models, which were initially proposed for machine translation [10][11][12], got adapted to become a standard tool for training end-to-end dialogue systems. Early vanilla seq2seq models [13] got quickly extended to model hierarchical structure [14], context [15], and combination thereof [16]. While models were typically trained on corpora such as Ubuntu, some work [17] has also used data from Community Question Answering forums [18]; this means forming a training pair involving a question and each good answer in the corresponding question-answer thread.
More recently, the Transformer, a model without recurrent connections, was proposed [19], demonstrating state-of-the-art results for Machine Translation in various experimental scenarios for several language pairs and translation directions, thus, emerging as a strong alternative to seq2seq methods. The fact that it only uses self-attention makes it a lot faster both at training and at inference time, even though its deep architecture requires more calculations than a seq2seq model, it enables high degree of parallelism, while maintaining the ability to model word sequences through the mechanism of attention and positional embeddings.
In the domain of customer support, it has been shown that generative models such as seq2seq and the Transformer perform better then retrieval-based models, but they fail in the case of insufficient training data [20]. Other works have incorporated intent categories and semantic matching into an answer selection model, which uses a knowledge base as its source [21]. In the insurance domain, Feng et al. [22] proposed a generic deep learning approach for answer selection, based on convolutional neural networks (CNN) [23]. In Li et al. [24] combined recurrent neural networks based on long short-term memory (LSTM) cells [25] and reinforcement learning (RL) to learn without the need of prior domain knowledge.

Answer Combination
Answer combination has been recognized as an important research direction in the domain of customer support chatbots. For example, Qiu et al. [26] used an attentive seq2seq re-ranker to choose dynamically between the outputs of a retrieval-based and a seq2seq model. Similarly, Cui et al. [27] combined a fact database, FAQs, opinion-oriented answers, and a neural-based chit-chat generator, by training a meta-engine that chooses between them.
Answer combination is also a key research topic in the related field of information retrieval (IR). For example, Pang et al. [28] proposed a generic relevance ranker based on deep learning and CNNs [23], which tries to maintain standard IR search engine characteristics, such as exact matching and query term importance, while enriching the results based on semantics, proximity heuristics, and diversification.

Re-Ranking Model
Our re-ranking framework uses a classifier based on QANet [1], a state-of-the-art architecture for machine reading comprehension, to evaluate whether a given answer is a good fit for the target question. It then uses the posterior probabilities of the classifier to re-rank the candidate answers, as shown in Figure 1.

Negative Sampling
Our goal is to distinguish "good" vs. "bad" answers, but the original dataset only contains valid, i.e., "good" question-answer pairs. Thus, we use negative sampling [29], where we replace the original answer to the target question with a random answer from the training dataset. We further compare the word-based cosine similarity between the original and the sampled answer, and, in some rare cases, we turn a "bad" answer into "good" one if it is too similar to the original "good" answer.

QANet Architecture
Machine reading comprehension aims to answer a question by looking to extract a string from a given text context. Here, we use that model to measure the goodness of a given question-answer pair.
The first layer of the network is a standard an embedding layer, which transforms words into low-dimensional dense vectors. Afterwards, a two-layer highway network [30] is added on top of the embedding representations. This allows the network to regulate the information flow using a gated mechanism. The output of this layer is of dimensionality #words × d, where #words is the number of words in the encoded sentence (Note that it differs for the question vs. the answer. See Section 5.1 for more detail.) and d is the input/output dimensionality of the model for all Transformer layers, which is required by the architecture.
We experiment with two types of input embeddings. First, we use 200-dimensional GloVe [31] vectors trained on 27 billion Twitter posts. We compare their performance to ELMo [32], a recently proposed way to train contextualized word representations. In ELMo, these word vectors are learned activation functions of the internal states of a deep bi-directional language model. The latter is built upon a single (embedding) layer, followed by two LSTM [25] layers, which are fed the words from a target sentence in a forward and a backward direction, accordingly. We obtain the final embedding by taking a weighted average over all three layers as suggested in [32].
The embedding encoder layer is based on a convolution, followed by self-attention [19] and a feed-forward network. We use a kernel size of seven, d filters, and four convolutional layers within a block. The output of the layer is f (layernorm(x)) + x, where layernorm is the layer normalization operation [33]. The output again is mapped to #words × d by a 1D convolution. The input and the embedding layers are learned separately for the question and the answer.
The attention layer is a standard module for machine reading comprehension models. We call it answer-to-question (A2Q) and question-to-answer (Q2A) attention, which are also known as context-query and query-context, respectively. Let us denote the output of the encoder for the question as Q and for the answer as A. In order to obtain the attention, the model first computes a matrix S with similarities between each two words for the question and the answer, then the values are normalized using softmax. The similarity function is defined as follows: f (a, q) = W 0 [a; q; a q].
We adopt the notation S = so f tmax(S), which is a softmax normalization over the rows of S, and S = so f tmax(S ) is a normalization over the columns. Then, the two attention matrices are computed as A2Q = S · Q , and Q2A = S · S · C .
The attention layer is followed by a model layer, which takes as input the concatenation of [a; a2q; a a2q; a q2a], where we use small letters to denote rows from the original matrices. For the output layer, we learn two different representations by passing the output of the model layer to two residual blocks, applying dropout [34] only to the inputs of the first one. We predict the output as . The weights are learned by minimizing a binary cross-entropy loss.

Answer Selection
We experimented with two answer selection strategies: (i) max, and (ii) proportional sampling after softmax normalization. The former strategy is standard and it selects the answer with the highest score, while the latter one returns a random answer with probability proportional to the score returned by the softmax, aiming at increasing the variability of the answers.
For both strategies, we use a linear projection applied on the output of the last residual model block, which is shows as "linear block" in Figure 1. We can generalize the latter as follows: o(q, a k ) = W o [M], where M is the concatenation of the outputs of one or more residual model blocks.
We present the formulation of the two strategies, as we introduce the following notation: Ans is the selected utterance by the agent; o(q, a k ) is the output of the model before applying the sigmoid function; q is the original question by the user; A is the set of possible answers that we want to re-rank. Equation (1) shows the selection process in the max case.
We empirically found that the answer selection based on the max strategy does not always perform well. As our experimental results in Tables 3 and 4 show, we can gain notable improvement by using proportional sampling after softmax normalization, instead of always selecting the answer with the highest probability. In our experiments, we model Ans as a random variable that follows a categorical distribution over K = |A| events (candidate answers). For each of the question-answer pairs (q, a), we define the probability p that a is a good answer to q using softmax as shown in Equations (2) and (3). Finally, we draw a random sample from Equation (3) to obtain the best matching answer.

Data
The data and the resources that could be used to train customer support conversational agents are generally very scarce, as companies keep conversations locked on their own proprietary support systems. This is due to customer privacy concerns and to companies not wanting to make public their know-how and the common issues about their products and services. An extensive 2015 survey on available dialog corpora by Serban et al. [35] found no good publicly available dataset for real-world customer support.
In early 2018, this situation changed as a new open dataset for Customer Support on Twitter [36] was made available on Kaggle. It contains 3M tweets and replies for twenty big companies such as Amazon, Apple, Uber, Delta, and Spotify, among others. As customer support topics from different organizations are generally unrelated to each other, we focus only on tweets related to Apple support, which represents the largest number of tweets in the corpus.
We filtered all utterances that redirect the user to another communication channel, e.g., direct messages, which are not informative for the model and only bring noise. Moreover, since answers evolve over time, we divided our dataset into a training and a testing part, keeping earlier posts for training and the latest ones for testing. We further excluded from the training set all conversations that are older then sixty days. For evaluation, we used dialogs from the last five days in the dataset, to simulate a real-world scenario for customer support. We ended up with a dataset of 49,626 question-answer pairs divided into 45,582 for training and 4,044 for testing. Finally, we open-sourced our code for pre-processing and filtering the data, making it available to the research community [37]. Table 1 shows some statistics about our dataset. On the top of the table, we can see that the average number of turns per dialog is under three, which means that most of the dialogues finish after one answer from the customer support. The bottom of the table shows the distribution of the words in the user questions vs. the customer support answers. We can see that answers tend to be slightly longer, which is natural as replies by customer support must be extensive and helpful.

Preprocessing
Since Twitter has its own specifics of writing in terms of both length (by design, tweets have been strictly limited to 140 characters; this constraint has been relaxed to 280 characters in 2017) and style, standard text tokenization is generally not suitable for tweets. Therefore, we used a specialized Twitter tokenizer [38] to preprocess the data. Then, we further replaced shorthand entries such as 'll, 'd, 're, 've, with the most corresponding literary form, e.g., will, would, are, have. We also replaced shortened slang words, e.g., 'bout and 'til, with the standard words, e.g., about and until. Similarly, we replaced URLs with the special word <url>, all user mentions with <user>, and all hashtags with <hashtag>.
Due to the nature of writing in Twitter and the free form of the conversation, some of the utterances contain emoticons and emojis. They are handled automatically by the Twitter tokenizer and treated as a single token. We keep them in their original form, as they can be very useful for detecting emotions and sarcasm, which pose serious challenges for natural language understanding.
Based on the statistics presented in Section 4, we chose to trim the length of the questions and of the answers to 60 and 70 words, respectively.

Training Setup
For training, we use the Adam [39] optimizer with decaying learning rate, as implemented in TensorFlow [40]. We start with the following values: learning rate η = 5 × 10 −4 , exponential decay rate for the 1st and the 2nd momentum β 1 = 0.9 and β 2 = 1.00, and constant for prevention of division by zero = 1 × 10 −7 . Then, we decay the learning after each epoch by a factor of 0.99. We also apply dropout with a probability of 0.1, and L2 weight decay on all trainable variables with λ = 3 × 10 −7 . We train each model for 42K steps with a batch size of 64. We found these values by running a grid search on a dev set (extracted as a fraction of the training data) and using the values suggested in [1], where applicable.
For IR, we use ElasticSearch [41] with English analyzer enabled, whitespace-and punctuation-based tokenization, and word 3-grams. We further use the default BM25 algorithm [42], which is an improved version of TF.IDF. For all training questions and for all testing queries, we append the previous turns in the dialog as context.
For seq2seq, we use a bi-directional LSTM network with 512 hidden units per direction. The decoder has two uni-directional layers connected directly to the bi-directional layer in the encoder. The network takes as input words encoded as 200-dimensional embeddings. It is a combination of pre-trained GloVe [31] vectors for the known words, and a positional embedding layer, learned as model parameters, for the unknown words. The embedding layers for the encoder and for the decoder are not shared, and are learned separately. This separation is due to the words used in utterances by the customers being very different from the posts by the customer support.
For the Transformer, we use two identical layers for the encoder and for the decoder, with four heads for the self-attention. The dimensionality of the input and of the output is d model = 256, and the inner dimensionality is d inner = 512. The input consists of queries with keys of dimensionality d k = 64 and values of the same dimensionality d v = 64. The input and the output embedding are learned separately with sinusoidal positional encoding.

Evaluation Measures
How to evaluate a chatbot is an open research question. As the problem is related to machine translation (MT) and text summarization (TS), which are nowadays also addressed using seq2seq models, researchers have been using MT and TS evaluation measures such as BLEU [4] and ROUGE [5], which focus primarily on word overlap and measure the similarity between the chatbot's response and the gold answer to the user question (here, the answer by the customer support). However, it has been argued [43,44] that such word-overlap measures are not very suitable for evaluating chatbots. Thus, we adopt three additional measures, which are more semantic in nature.
The embedding average [6] constructs a vector for a piece of text by taking the average of the word embeddings of its constituent words. Then, the vectors for the chatbot response and for the gold human answer are compared using the cosine similarity.
The greedy matching was introduced in the context of intelligent tutoring systems [45]. It matches each word in the chatbot's output to the most similar word in the gold human response, where the similarity is measured as the cosine between the corresponding word embeddings, multiplied by a weighting term (which we set to 1), as shown in Equation (4). Since this measure is asymmetric, we also calculate it with the arguments swapped, and then we take the average as shown in Equation (5).
The vector extrema [46] was proposed for dialogue systems. Instead of averaging the word embeddings of the words in a piece of text, it takes the coordinate-wise maximum (or minimum), as shown in Equation (6). Finally, the resulting vectors for the chatbot output and for the gold human answer are compared using the cosine similarity.

Evaluation Results
Below, we first discuss our auxiliary classification task, where the objective is to predict which question-answer pair is "good", and then we move to the main task of answer re-ranking. Table 2 shows the results for the auxiliary task of question-answer goodness classification. The first column is the name of the model. It is followed by three columns showing the type of embedding used, the size of the hidden layer, and the number of heads (see Section 3.2). The last column reports the accuracy. Since our dataset is balanced (we generate about 50% positive, and about 50% negative examples), accuracy is a suitable evaluation measure for this task. The top row of the table shows the performance for a majority class baseline. The following lines show the results for our full QANet-based model when using different kinds of embeddings. We can see that contextualized sentence-level embeddings are preferable to using simple word embeddings as in GloVe or token-level ELMo embeddings. Moreover, while token-level ELMo outperforms GloVe when the size of the network is small, there is no much difference when the number of parameters grows (d model = 128, #Heads = 8).  Table 3 reports the performance of the individual models: information retrieval (IR), Sequence-to-sequence (seq2seq), and the Transformer (see Section 5.3 for more details about these models). In our earlier work [20], we performed these experiments using exactly the same experimental setup. The table is organized as follows: The first column contains the name of the model used to obtain the best answer. The second and the third columns report the word overlap measures: (i) BLEU@2, which uses uni-gram and bi-gram matches between the hypothesis and the reference sentence, and (ii) ROUGE-L [47], which uses Longest Common Subsequence (LCS). The last three columns are for the semantic similarity measures: (i) Embedding Average (Emb Avg) with cosine similarity, (ii) Greedy Matching (Greedy Match), and (iii) Vector Extrema (Vec Extr) with cosine similarity. In the three latter measures, we used the standard pre-trained word2vec embeddings because they are not learned during training, which helps avoid bias, as has been suggested in [43,44].

Answer Selection/Generation: Individual Models
We can see in Table 3 that the seq2seq model outperforms IR by a margin on all five evaluation measures, which is consistent with previous results in the literature. What is surprising, however, is the relatively poor performance for the Transformer, which trails behind the seq2seq model on all evaluation measures. We hypothesize that this is due to the Transformer having to learn more parameters as it operates with higher-dimensional word embeddings. Overall, the Transformer is arguably slightly better than the IR model, outperforming it on three of the five evaluation measures.
The last row of Table 3 is not an individual model; it is our re-ranker applied to the top answers returned by the IR model. In particular, we use QANet with Sentence level ELMo (d model = 128, #Heads = 8). We took the top-5 answer candidates (the value of 5 was found using cross-validation on the training dataset) from the IR model, and we selected the best answer based on our re-ranker's scores. We can see that re-ranking yields improvements for all evaluation measures: +1.18 on BLEU@2, +0.93 on ROUGE_L, +1.12 on Embedding Average, +0.67 on Greedy Matching, and +1.64 in Vector Extrema. These results show that we can get sizable performance gains when re-ranking the top-K predictions of a single model; below we will combine multiple models.

Main Task: Multi-Source Answer Re-Ranking
Next, we combine the top-K answers from different models: IR and seq2seq. We did not include the Transformer in the mix as its output is generative and similar to that of the seq2seq model; moreover, as we have seen in Table 3 above, it performs worse than seq2seq on our dataset. We set K = 2 for the baseline, Random Top Answer, which selects a random answer from the union of the top K answers by the models involved in the re-ranking. For the remaining re-ranking experiments, we use K = 5. We found these values using cross-validation on the training dataset, trying 1-5.
The results are shown in Table 4, where different representations are separated by a horizontal line. The first row of each group contains the name of the model. Then, on the even rows (second, forth, etc.), we show the results from a greedy answer selection strategy, while on the odd rows are the results from an exploration strategy (softmax sampling). Since softmax sampling and random selection are stochastic in nature, we include a 95% confidence interval for them. We can see in Table 4 that QANet with sentence-level ELMo (d model = 128, #Heads = 8) performs best in terms of BLEU@2, ROUGE_L, and Greedy Matching. Note also the correlation between higher results on the auxiliary task (see Table 2) and improvement in terms of word-overlap measures, where we find the largest difference between individual and re-ranked models (+1.5 points absolute over the baseline, and +0.95 over seq2seq in terms of BLEU@2). In terms of semantic similarity, we note the highest increase for Embedding Average (+1.3 over the baseline, and +1.4 over seq2seq), and a smaller one for Greedy Matching (+1.0 over the baseline, and +0.4 over seq2seq), and Vector Extrema (+2.6 over the baseline, and +0.6 over seq2seq).
Overall, the re-ranked models are superior as evaluated on word-matching measures, which is supported by the improvement of BLEU@2 and Embedding Average. The smaller improvement for Greedy Matching and Vector Extrema can be explained by the training procedure for the re-ranking model, which is based on word comparison. However, these two measures focus on keyword similarity between the target and the proposed answers, and generative models are better at this. This is supported by comparing the combined model to IR-BM25, where we see sizable improvements of +1.5 and +2.0 in terms of Greedy Matching and Vector Extrema, respectively.
We can further see in Table 4 that using a stochastic approach to select the best answer yields additional improvements. This strategy accounts for the predicted goodness score for each candidate, thus, enriching the model in two ways. First, implicit voting is used, as duplicate answer candidates are not removed, resulting in higher selection probability of popular answers from different input modules. Second, albeit two answers may have a very different structure, they still can be similar in meaning, leading to very similar scores and promoting only the first one. This behavior can be mitigated by choosing the winner proportionally to its ranking, thus, also introducing diversity in the chatbot's language. This hypothesis is supported by the results in Table 4: compare each model to the corresponding one with softmax selection.

Conclusions and Future Work
We have presented a novel framework for re-ranking answer candidates for conversational agents. In particular, we adopted techniques from the domain of machine reading comprehension [1][2][3] to evaluate the quality of a question-answer pair. Our framework consists of two tasks: (i) an auxiliary one, aiming to fit a goodness classifier using QANet and negative sampling, and (ii) a main task that re-ranks answer candidates using the learned model. We further experimented with different model sizes and two types of embedding models: GloVe [31] and ELMo [32]. Our experiments showed improvements in answer quality in terms of word-overlap and semantics when re-ranking using the auxiliary model. Last but not least, we argued that choosing the top-ranked answer is not always the best option. Thus, we introduced probabilistic sampling that aims to diversify the agent's language and to up-vote the popular answers, while taking their ranking scores into consideration.
In future work, we plan to experiment with different exploration strategies such as Boltzmann exploration (softmax is a degraded version of Boltzmann exploration with τ = 1). -Confident, Upper confidence bound, and other bandit methods [48] to widen the possible context for each answer over time. We see an interesting research direction in applying deep reinforcement learning (i) to improve the answer selection models when applied to unseen questions, and (ii) to account for user feedback and customer support task success.