Machine Reading Comprehension for Answer Re-Ranking in Customer Support Chatbots

Hardalov, Momchil; Koychev, Ivan; Nakov, Preslav

doi:10.3390/info10030082

Open AccessArticle

Machine Reading Comprehension for Answer Re-Ranking in Customer Support Chatbots

by

Momchil Hardalov

^1,*,

Ivan Koychev

¹ and

Preslav Nakov

²

¹

Faculty of Mathematics and Informatics, Sofia University, 1164 Sofia, Bulgaria

²

Qatar Computing Research Institute, Hamad Bin Khalifa University, 34110 Doha, Qatar

^*

Author to whom correspondence should be addressed.

Information 2019, 10(3), 82; https://doi.org/10.3390/info10030082

Submission received: 21 January 2019 / Revised: 16 February 2019 / Accepted: 19 February 2019 / Published: 26 February 2019

(This article belongs to the Special Issue Artificial Intelligence—Methodology, Systems, and Applications)

Download

Browse Figure

Versions Notes

Abstract

:

Recent advances in deep neural networks, language modeling and language generation have introduced new ideas to the field of conversational agents. As a result, deep neural models such as sequence-to-sequence, memory networks, and the Transformer have become key ingredients of state-of-the-art dialog systems. While those models are able to generate meaningful responses even in unseen situations, they need a lot of training data to build a reliable model. Thus, most real-world systems have used traditional approaches based on information retrieval (IR) and even hand-crafted rules, due to their robustness and effectiveness, especially for narrow-focused conversations. Here, we present a method that adapts a deep neural architecture from the domain of machine reading comprehension to re-rank the suggested answers from different models using the question as a context. We train our model using negative sampling based on question–answer pairs from the Twitter Customer Support Dataset. The experimental results show that our re-ranking framework can improve the performance in terms of word overlap and semantics both for individual models as well as for model combinations.

Keywords:

conversational agents; chatbots; machine reading comprehension; question answering; information retrieval; answer re-ranking

1. Introduction

The growing popularity of smart devices, personal assistants, and online customer support systems has driven the research community to develop various new methodologies for automatic question answering and chatbots. In the domain of conversational agents, two general types of systems have become dominant: (i) retrieval-based, and (ii) generative. While the former produce clear and smooth output, the latter bring flexibility and the ability to generate new unseen answers.

In this work, we focus on finding the most suitable answer for a question, where each candidate can be produced by a different system, e.g., knowledge-based, rule-based, deep neural network, retrieval, etc. In particular, we propose a re-ranking framework based on machine reading comprehension [1,2,3] for question–answer pairs. Moreover, instead of selecting the top candidate from the re-ranker’s output, we use probabilistic sampling that aims to diversify the agent’s language and to up-vote popular answers from different input models. We train our model using negative sampling based on question–answer pairs from the Twitter Customer Support Dataset.

In our experimental setup, we adopt a real-world application scenario, where we train on historical logs for some period of time, and then we test on logs for subsequent days. We evaluate the model using both semantic similarity measures, as well as word-overlap ones such as BLEU [4] and ROUGE [5], which come from machine translation and text summarization.

The remainder of this paper is organized as follows: Section 2 presents some related work in the domain of conversational agents and answer combination. Section 3 describes our framework and the general workflow for answer re-ranking. Section 4 introduces the original dataset and explains how we used in to build a new, task-specific one with negative sampling; it also offers insights about the dialogs and the pre-processing. Section 5 describes our experiments, and gives details about the training parameters. Section 6 presents the performance of each model and discusses the results. Finally, Section 7 concludes and suggests possible directions for future work.

2. Related Work

2.1. Conversational Agents

The emergence of large conversational corpora such as the Ubuntu Dialog corpus [6], OpenSubtitles [7], CoQA [8] and the Microsoft Research Social Media Conversation Corpus [9] has enabled the use of generative models and end-to-end neural networks in the domain of conversational agents. In particular, sequence-to-sequence (seq2seq) models, which were initially proposed for machine translation [10,11,12], got adapted to become a standard tool for training end-to-end dialogue systems. Early vanilla seq2seq models [13] got quickly extended to model hierarchical structure [14], context [15], and combination thereof [16]. While models were typically trained on corpora such as Ubuntu, some work [17] has also used data from Community Question Answering forums [18]; this means forming a training pair involving a question and each good answer in the corresponding question-answer thread.

More recently, the Transformer, a model without recurrent connections, was proposed [19], demonstrating state-of-the-art results for Machine Translation in various experimental scenarios for several language pairs and translation directions, thus, emerging as a strong alternative to seq2seq methods. The fact that it only uses self-attention makes it a lot faster both at training and at inference time, even though its deep architecture requires more calculations than a seq2seq model, it enables high degree of parallelism, while maintaining the ability to model word sequences through the mechanism of attention and positional embeddings.

In the domain of customer support, it has been shown that generative models such as seq2seq and the Transformer perform better then retrieval-based models, but they fail in the case of insufficient training data [20]. Other works have incorporated intent categories and semantic matching into an answer selection model, which uses a knowledge base as its source [21]. In the insurance domain, Feng et al. [22] proposed a generic deep learning approach for answer selection, based on convolutional neural networks (CNN) [23]. In Li et al. [24] combined recurrent neural networks based on long short-term memory (LSTM) cells [25] and reinforcement learning (RL) to learn without the need of prior domain knowledge.

2.2. Answer Combination

Answer combination has been recognized as an important research direction in the domain of customer support chatbots. For example, Qiu et al. [26] used an attentive seq2seq re-ranker to choose dynamically between the outputs of a retrieval-based and a seq2seq model. Similarly, Cui et al. [27] combined a fact database, FAQs, opinion-oriented answers, and a neural-based chit-chat generator, by training a meta-engine that chooses between them.

Answer combination is also a key research topic in the related field of information retrieval (IR). For example, Pang et al. [28] proposed a generic relevance ranker based on deep learning and CNNs [23], which tries to maintain standard IR search engine characteristics, such as exact matching and query term importance, while enriching the results based on semantics, proximity heuristics, and diversification.

3. Re-Ranking Model

Our re-ranking framework uses a classifier based on QANet [1], a state-of-the-art architecture for machine reading comprehension, to evaluate whether a given answer is a good fit for the target question. It then uses the posterior probabilities of the classifier to re-rank the candidate answers, as shown in Figure 1.

3.1. Negative Sampling

Our goal is to distinguish “good” vs. “bad” answers, but the original dataset only contains valid, i.e., “good” question–answer pairs. Thus, we use negative sampling [29], where we replace the original answer to the target question with a random answer from the training dataset. We further compare the word-based cosine similarity between the original and the sampled answer, and, in some rare cases, we turn a “bad” answer into “good” one if it is too similar to the original “good” answer.

3.2. QANet Architecture

Machine reading comprehension aims to answer a question by looking to extract a string from a given text context. Here, we use that model to measure the goodness of a given question–answer pair.

The first layer of the network is a standard an embedding layer, which transforms words into low-dimensional dense vectors. Afterwards, a two-layer highway network [30] is added on top of the embedding representations. This allows the network to regulate the information flow using a gated mechanism. The output of this layer is of dimensionality

# w o r d s \times d

, where

# w o r d s

is the number of words in the encoded sentence (Note that it differs for the question vs. the answer. See Section 5.1 for more detail.) and d is the input/output dimensionality of the model for all Transformer layers, which is required by the architecture.

We experiment with two types of input embeddings. First, we use 200-dimensional GloVe [31] vectors trained on 27 billion Twitter posts. We compare their performance to ELMo [32], a recently proposed way to train contextualized word representations. In ELMo, these word vectors are learned activation functions of the internal states of a deep bi-directional language model. The latter is built upon a single (embedding) layer, followed by two LSTM [25] layers, which are fed the words from a target sentence in a forward and a backward direction, accordingly. We obtain the final embedding by taking a weighted average over all three layers as suggested in [32].

The embedding encoder layer is based on a convolution, followed by self-attention [19] and a feed-forward network. We use a kernel size of seven, d filters, and four convolutional layers within a block. The output of the layer is

f (l a y e r n o r m (x)) + x

, where

l a y e r n o r m

is the layer normalization operation [33]. The output again is mapped to

# w o r d s \times d

by a 1D convolution. The input and the embedding layers are learned separately for the question and the answer.

The attention layer is a standard module for machine reading comprehension models. We call it answer-to-question (A2Q) and question-to-answer (Q2A) attention, which are also known as context-query and query-context, respectively. Let us denote the output of the encoder for the question as Q and for the answer as A. In order to obtain the attention, the model first computes a matrix S with similarities between each two words for the question and the answer, then the values are normalized using softmax. The similarity function is defined as follows:

f (a, q) = W_{0} [a; q; a ⊙ q]

.

We adopt the notation

\bar{S} = s o f t m a x (S)

, which is a softmax normalization over the rows of S, and

\bar{\bar{S}} = s o f t m a x (S^{⊺})

is a normalization over the columns. Then, the two attention matrices are computed as

A 2 Q = \bar{S} \cdot Q^{⊺}

, and

Q 2 A = \bar{S} \cdot {\bar{\bar{S}}}^{⊺} \cdot C^{⊺}

.

The attention layer is followed by a model layer, which takes as input the concatenation of

[a; a 2 q; a ⊙ a 2 q; a ⊙ q 2 a]

, where we use small letters to denote rows from the original matrices. For the output layer, we learn two different representations by passing the output of the model layer to two residual blocks, applying dropout [34] only to the inputs of the first one. We predict the output as

P (a | q) = σ (W_{o} [M_{0}; M_{1}])

. The weights are learned by minimizing a binary cross-entropy loss.

3.3. Answer Selection

We experimented with two answer selection strategies: (i) max, and (ii) proportional sampling after softmax normalization. The former strategy is standard and it selects the answer with the highest score, while the latter one returns a random answer with probability proportional to the score returned by the softmax, aiming at increasing the variability of the answers.

For both strategies, we use a linear projection applied on the output of the last residual model block, which is shows as “linear block” in Figure 1. We can generalize the latter as follows:

o (q, a_{k}) = W_{o} [M]

, where M is the concatenation of the outputs of one or more residual model blocks.

We present the formulation of the two strategies, as we introduce the following notation:

A n s

is the selected utterance by the agent;

o (q, a_{k})

is the output of the model before applying the sigmoid function; q is the original question by the user; A is the set of possible answers that we want to re-rank. Equation (1) shows the selection process in the max case.

A n s = \underset{a \in A}{arg max} (o (q, a))

(1)

We empirically found that the answer selection based on the max strategy does not always perform well. As our experimental results in Tables 3 and Table 4 show, we can gain notable improvement by using proportional sampling after softmax normalization, instead of always selecting the answer with the highest probability. In our experiments, we model

A n s

as a random variable that follows a categorical distribution over

K = | A |

events (candidate answers). For each of the question–answer pairs (q, a), we define the probability p that a is a good answer to q using softmax as shown in Equations (2) and (3). Finally, we draw a random sample from Equation (3) to obtain the best matching answer.

p | q, A \sim softmax (o (q, a_{1}), \dots, o (q, a_{K}))

(2)

A n s | p \sim C a t (K, p)

(3)

4. Data

The data and the resources that could be used to train customer support conversational agents are generally very scarce, as companies keep conversations locked on their own proprietary support systems. This is due to customer privacy concerns and to companies not wanting to make public their know-how and the common issues about their products and services. An extensive 2015 survey on available dialog corpora by Serban et al. [35] found no good publicly available dataset for real-world customer support.

In early 2018, this situation changed as a new open dataset for Customer Support on Twitter [36] was made available on Kaggle. It contains 3M tweets and replies for twenty big companies such as Amazon, Apple, Uber, Delta, and Spotify, among others. As customer support topics from different organizations are generally unrelated to each other, we focus only on tweets related to Apple support, which represents the largest number of tweets in the corpus.

We filtered all utterances that redirect the user to another communication channel, e.g., direct messages, which are not informative for the model and only bring noise. Moreover, since answers evolve over time, we divided our dataset into a training and a testing part, keeping earlier posts for training and the latest ones for testing. We further excluded from the training set all conversations that are older then sixty days. For evaluation, we used dialogs from the last five days in the dataset, to simulate a real-world scenario for customer support. We ended up with a dataset of 49,626 question–answer pairs divided into 45,582 for training and 4044 for testing. Finally, we open-sourced our code for pre-cessing and filtering the data, making it available to the research community [37].

Table 1 shows some statistics about our dataset. On the top of the table, we can see that the average number of turns per dialog is under three, which means that most of the dialogues finish after one answer from the customer support. The bottom of the table shows the distribution of the words in the user questions vs. the customer support answers. We can see that answers tend to be slightly longer, which is natural as replies by customer support must be extensive and helpful.

5. Experiments

5.1. Preprocessing

Since Twitter has its own specifics of writing in terms of both length (by design, tweets have been strictly limited to 140 characters; this constraint has been relaxed to 280 characters in 2017) and style, standard text tokenization is generally not suitable for tweets. Therefore, we used a specialized Twitter tokenizer [38] to preprocess the data. Then, we further replaced shorthand entries such as ’ll, ’d, ’re, ’ve, with the most corresponding literary form, e.g., will, would, are, have. We also replaced shortened slang words, e.g., ’bout and ’til, with the standard words, e.g., about and until. Similarly, we replaced URLs with the special word url, all user mentions with user, and all hashtags with hashtag.

Due to the nature of writing in Twitter and the free form of the conversation, some of the utterances contain emoticons and emojis. They are handled automatically by the Twitter tokenizer and treated as a single token. We keep them in their original form, as they can be very useful for detecting emotions and sarcasm, which pose serious challenges for natural language understanding.

Based on the statistics presented in Section 4, we chose to trim the length of the questions and of the answers to 60 and 70 words, respectively.

5.2. Training Setup

For training, we use the Adam [39] optimizer with decaying learning rate, as implemented in TensorFlow [40]. We start with the following values: learning rate

η = 5 \times 10^{- 4}

, exponential decay rate for the 1st and the 2nd momentum

β_{1} = 0.9

and

β_{2} = 1.00

, and constant for prevention of division by zero

ϵ = 1 \times 10^{- 7}

. Then, we decay the learning after each epoch by a factor of 0.99. We also apply dropout with a probability of 0.1, and L2 weight decay on all trainable variables with

λ = 3 \times 10^{- 7}

. We train each model for 42K steps with a batch size of 64. We found these values by running a grid search on a dev set (extracted as a fraction of the training data) and using the values suggested in [1], where applicable.

5.3. Individual Models

Following [20], we experiment with three individual models: (i) Information Retrieval-based (IR), (ii) Sequence-to-sequence (seq2seq) and (iii) the Transformer.

For IR, we use ElasticSearch [41] with English analyzer enabled, whitespace- and punctuation-based tokenization, and word 3-grams. We further use the default BM25 algorithm [42], which is an improved version of TF.IDF. For all training questions and for all testing queries, we append the previous turns in the dialog as context.

For seq2seq, we use a bi-directional LSTM network with 512 hidden units per direction. The decoder has two uni-directional layers connected directly to the bi-directional layer in the encoder. The network takes as input words encoded as 200-dimensional embeddings. It is a combination of pre-trained GloVe [31] vectors for the known words, and a positional embedding layer, learned as model parameters, for the unknown words. The embedding layers for the encoder and for the decoder are not shared, and are learned separately. This separation is due to the words used in utterances by the customers being very different from the posts by the customer support.

For the Transformer, we use two identical layers for the encoder and for the decoder, with four heads for the self-attention. The dimensionality of the input and of the output is

d_{m o d e l} = 256

, and the inner dimensionality is

d_{i n n e r} = 512

. The input consists of queries with keys of dimensionality

d_{k} = 64

and values of the same dimensionality

d_{v} = 64

. The input and the output embedding are learned separately with sinusoidal positional encoding.

5.4. Evaluation Measures

How to evaluate a chatbot is an open research question. As the problem is related to machine translation (MT) and text summarization (TS), which are nowadays also addressed using seq2seq models, researchers have been using MT and TS evaluation measures such as BLEU [4] and ROUGE [5], which focus primarily on word overlap and measure the similarity between the chatbot’s response and the gold answer to the user question (here, the answer by the customer support). However, it has been argued [43,44] that such word-overlap measures are not very suitable for evaluating chatbots. Thus, we adopt three additional measures, which are more semantic in nature.

The embedding average [6] constructs a vector for a piece of text by taking the average of the word embeddings of its constituent words. Then, the vectors for the chatbot response and for the gold human answer are compared using the cosine similarity.

The greedy matching was introduced in the context of intelligent tutoring systems [45]. It matches each word in the chatbot’s output to the most similar word in the gold human response, where the similarity is measured as the cosine between the corresponding word embeddings, multiplied by a weighting term (which we set to 1), as shown in Equation (4). Since this measure is asymmetric, we also calculate it with the arguments swapped, and then we take the average as shown in Equation (5).

g r e e d y (u_{1}, u_{2}) = \frac{\sum_{v \in u_{1}} w e i g h t (v) * {max}_{w \in u_{2}} c o s (v, w)}{\sum_{v \in u_{1}} w e i g h t (v)}

(4)

s i m G r e e d y (u_{1}, u_{2}) = \frac{g r e e d y (u_{1}, u_{2}) + g r e e d y (u_{2}, u_{1})}{2}

(5)

The vector extrema [46] was proposed for dialogue systems. Instead of averaging the word embeddings of the words in a piece of text, it takes the coordinate-wise maximum (or minimum), as shown in Equation (6). Finally, the resulting vectors for the chatbot output and for the gold human answer are compared using the cosine similarity.

e x t r e m a (u_{i}) = \{\begin{matrix} max u_{i}, & i f max u_{i} \geq | min u_{i} | \\ min u_{i}, & otherwise \end{matrix}

(6)

6. Evaluation Results

Below, we first discuss our auxiliary classification task, where the objective is to predict which question–answer pair is “good”, and then we move to the main task of answer re-ranking.

6.1. Auxiliary Task: Question–Answer Goodness Classification

Table 2 shows the results for the auxiliary task of question–answer goodness classification. The first column is the name of the model. It is followed by three columns showing the type of embedding used, the size of the hidden layer, and the number of heads (see Section 3.2). The last column reports the accuracy. Since our dataset is balanced (we generate about 50% positive, and about 50% negative examples), accuracy is a suitable evaluation measure for this task. The top row of the table shows the performance for a majority class baseline. The following lines show the results for our full QANet-based model when using different kinds of embeddings. We can see that contextualized sentence-level embeddings are preferable to using simple word embeddings as in GloVe or token-level ELMo embeddings. Moreover, while token-level ELMo outperforms GloVe when the size of the network is small, there is no much difference when the number of parameters grows (

d_{m o d e l} = 128

,

# H e a d s = 8

).

6.2. Answer Selection/Generation: Individual Models

Table 3 reports the performance of the individual models: information retrieval (IR), Sequence-to-sequence (seq2seq), and the Transformer (see Section 5.3 for more details about these models). In our earlier work [20], we performed these experiments using exactly the same experimental setup. The table is organized as follows: The first column contains the name of the model used to obtain the best answer. The second and the third columns report the word overlap measures: (i) BLEU@2, which uses uni-gram and bi-gram matches between the hypothesis and the reference sentence, and (ii) ROUGE-L [47], which uses Longest Common Subsequence (LCS). The last three columns are for the semantic similarity measures: (i) Embedding Average (Emb Avg) with cosine similarity, (ii) Greedy Matching (Greedy Match), and (iii) Vector Extrema (Vec Extr) with cosine similarity. In the three latter measures, we used the standard pre-trained word2vec embeddings because they are not learned during training, which helps avoid bias, as has been suggested in [43,44].

We can see in Table 3 that the seq2seq model outperforms IR by a margin on all five evaluation measures, which is consistent with previous results in the literature. What is surprising, however, is the relatively poor performance for the Transformer, which trails behind the seq2seq model on all evaluation measures. We hypothesize that this is due to the Transformer having to learn more parameters as it operates with higher-dimensional word embeddings. Overall, the Transformer is arguably slightly better than the IR model, outperforming it on three of the five evaluation measures.

The last row of Table 3 is not an individual model; it is our re-ranker applied to the top answers returned by the IR model. In particular, we use QANet with Sentence level ELMo (

d_{m o d e l} = 128

,

# H e a d s = 8

). We took the top-5 answer candidates (the value of 5 was found using cross-validation on the training dataset) from the IR model, and we selected the best answer based on our re-ranker’s scores. We can see that re-ranking yields improvements for all evaluation measures:

+ 1.18

on BLEU@2,

+ 0.93

on ROUGE_L,

+ 1.12

on Embedding Average,

+ 0.67

on Greedy Matching, and +1.64 in Vector Extrema. These results show that we can get sizable performance gains when re-ranking the top-K predictions of a single model; below we will combine multiple models.

6.3. Main Task: Multi-Source Answer Re-Ranking

Next, we combine the top-K answers from different models: IR and seq2seq. We did not include the Transformer in the mix as its output is generative and similar to that of the seq2seq model; moreover, as we have seen in Table 3 above, it performs worse than seq2seq on our dataset. We set

K = 2

for the baseline, Random Top Answer, which selects a random answer from the union of the top K answers by the models involved in the re-ranking. For the remaining re-ranking experiments, we use

K = 5

. We found these values using cross-validation on the training dataset, trying 1–5.

The results are shown in Table 4, where different representations are separated by a horizontal line. The first row of each group contains the name of the model. Then, on the even rows (second, forth, etc.), we show the results from a greedy answer selection strategy, while on the odd rows are the results from an exploration strategy (softmax sampling). Since softmax sampling and random selection are stochastic in nature, we include a

95 %

confidence interval for them.

We can see in Table 4 that QANet with sentence-level ELMo (

d_{m o d e l} = 128

,

# H e a d s = 8

) performs best in terms of BLEU@2, ROUGE_L, and Greedy Matching. Note also the correlation between higher results on the auxiliary task (see Table 2) and improvement in terms of word-overlap measures, where we find the largest difference between individual and re-ranked models (

+ 1.5

points absolute over the baseline, and

+ 0.95

over seq2seq in terms of BLEU@2). In terms of semantic similarity, we note the highest increase for Embedding Average (

+ 1.3

over the baseline, and

+ 1.4

over seq2seq), and a smaller one for Greedy Matching (

+ 1.0

over the baseline, and

+ 0.4

over seq2seq), and Vector Extrema (

+ 2.6

over the baseline, and

+ 0.6

over seq2seq).

Overall, the re-ranked models are superior as evaluated on word-matching measures, which is supported by the improvement of BLEU@2 and Embedding Average. The smaller improvement for Greedy Matching and Vector Extrema can be explained by the training procedure for the re-ranking model, which is based on word comparison. However, these two measures focus on keyword similarity between the target and the proposed answers, and generative models are better at this. This is supported by comparing the combined model to IR-BM25, where we see sizable improvements of

+ 1.5

and

+ 2.0

in terms of Greedy Matching and Vector Extrema, respectively.

We can further see in Table 4 that using a stochastic approach to select the best answer yields additional improvements. This strategy accounts for the predicted goodness score for each candidate, thus, enriching the model in two ways. First, implicit voting is used, as duplicate answer candidates are not removed, resulting in higher selection probability of popular answers from different input modules. Second, albeit two answers may have a very different structure, they still can be similar in meaning, leading to very similar scores and promoting only the first one. This behavior can be mitigated by choosing the winner proportionally to its ranking, thus, also introducing diversity in the chatbot’s language. This hypothesis is supported by the results in Table 4: compare each model to the corresponding one with softmax selection.

7. Conclusions and Future Work

We have presented a novel framework for re-ranking answer candidates for conversational agents. In particular, we adopted techniques from the domain of machine reading comprehension [1,2,3] to evaluate the quality of a question–answer pair. Our framework consists of two tasks: (i) an auxiliary one, aiming to fit a goodness classifier using QANet and negative sampling, and (ii) a main task that re-ranks answer candidates using the learned model. We further experimented with different model sizes and two types of embedding models: GloVe [31] and ELMo [32]. Our experiments showed improvements in answer quality in terms of word-overlap and semantics when re-ranking using the auxiliary model. Last but not least, we argued that choosing the top-ranked answer is not always the best option. Thus, we introduced probabilistic sampling that aims to diversify the agent’s language and to up-vote the popular answers, while taking their ranking scores into consideration.

In future work, we plan to experiment with different exploration strategies such as Boltzmann exploration (softmax is a degraded version of Boltzmann exploration with

τ = 1

).

ϵ

-Confident, Upper confidence bound, and other bandit methods [48] to widen the possible context for each answer over time. We see an interesting research direction in applying deep reinforcement learning (i) to improve the answer selection models when applied to unseen questions, and (ii) to account for user feedback and customer support task success.

Author Contributions

Conceptualization, M.H., P.N. and I.K.; Investigation, M.H.; Writing–original draft, M.H.; Writing–review editing, P.N. and I.K.

Funding

This research is partially supported by Project UNITe BG05M2OP001-1.001-0004 funded by the OP “Science and Education for Smart Growth”, co-funded by the EU through the ESI Funds.

Conflicts of Interest

The authors declare no conflict of interest.

References

Yu, A.W.; Dohan, D.; Luong, M.T.; Zhao, R.; Chen, K.; Norouzi, M.; Le, Q.V. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. In Proceedings of the 2018 International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Seo, M.; Kembhavi, A.; Farhadi, A.; Hajishirzi, H. Bi-directional attention flow for machine comprehension. In Proceedings of the 2017 International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Chen, D.; Fisch, A.; Weston, J.; Bordes, A. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1870–1879. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the ACL Workshop on Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Lowe, R.; Pow, N.; Serban, I.; Pineau, J. The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Prague, Czech Republic, 2–4 September 2015; pp. 285–294. [Google Scholar]
Lison, P.; Tiedemann, J. OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation, Portorož, Slovenia, 23–28 May 2016. [Google Scholar]
Reddy, S.; Chen, D.; Manning, C.D. CoQA: A conversational question answering challenge. arXiv, 2018; arXiv:1808.07042. [Google Scholar]
Microsoft Research Social Media Conversation Corpus. Available online: http://research.microsoft.com/convo/ (accessed on 20 February 2019).
Luong, T.; Pham, H.; Manning, C.D. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1412–1421. [Google Scholar]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 3104–3112. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv, 2014; arXiv:1409.0473. [Google Scholar]
Vinyals, O.; Le, Q.V. A Neural Conversational Model. arXiv, 2015; arXiv:1506.05869. [Google Scholar]
Serban, I.V.; Sordoni, A.; Bengio, Y.; Courville, A.C.; Pineau, J. Hierarchical neural network generative models for movie dialogues. arXiv, 2015; arXiv:1507.04808. [Google Scholar]
Sordoni, A.; Galley, M.; Auli, M.; Brockett, C.; Ji, Y.; Mitchell, M.; Nie, J.Y.; Gao, J.; Dolan, B. A Neural Network Approach to Context-Sensitive Generation of Conversational Responses. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, CO, USA, 31 May–5 June 2015; pp. 196–205. [Google Scholar]
Sordoni, A.; Bengio, Y.; Vahabi, H.; Lioma, C.; Grue Simonsen, J.; Nie, J.Y. A Hierarchical Recurrent Encoder-Decoder for Generative Context-Aware Query Suggestion. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Melbourne, Australia, 18–23 October 2015; pp. 553–562. [Google Scholar]
Boyanov, M.; Nakov, P.; Moschitti, A.; Da San Martino, G.; Koychev, I. Building Chatbots from Forum Data: Model Selection Using Question Answering Metrics. In Proceedings of the International Conference Recent Advances in Natural Language Processing, Varna, Bulgaria, 2–8 September 2017; pp. 121–129. [Google Scholar]
Nakov, P.; Hoogeveen, D.; Màrquez, L.; Moschitti, A.; Mubarak, H.; Baldwin, T.; Verspoor, K. SemEval-2017 Task 3: Community Question Answering. In Proceedings of the 11th International Workshop on Semantic Evaluation, Vancouver, BC, Canada, 3–4 August 2017; pp. 27–48. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Hardalov, M.; Koychev, I.; Nakov, P. Towards Automated Customer Support. In Proceedings of the 18th International Conference on Artificial Intelligence: Methodology, Systems, and Applications, Varna, Bulgaria, 12–14 September 2018; pp. 48–59. [Google Scholar]
Li, Y.; Miao, Q.; Geng, J.; Alt, C.; Schwarzenberg, R.; Hennig, L.; Hu, C.; Xu, F. Question Answering for Technical Customer Support. In Proceedings of the 7th CCF International Conference, NLPCC 2018, Hohhot, China, 26–30 August 2018; pp. 3–15. [Google Scholar]
Feng, M.; Xiang, B.; Glass, M.R.; Wang, L.; Zhou, B. Applying deep learning to answer selection: A study and an open task. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, Scottsdale, AZ, USA, 13–17 December 2015; pp. 813–820. [Google Scholar]
LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Back-propagation Applied to Handwritten Zip Code Recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
Li, X.; Li, L.; Gao, J.; He, X.; Chen, J.; Deng, L.; He, J. Recurrent reinforcement learning: A hybrid approach. In Proceedings of the 2016 International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Qiu, M.; Li, F.L.; Wang, S.; Gao, X.; Chen, Y.; Zhao, W.; Chen, H.; Huang, J.; Chu, W. AliMe Chat: A Sequence to Sequence and Rerank based Chatbot Engine. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 498–503. [Google Scholar]
Cui, L.; Huang, S.; Wei, F.; Tan, C.; Duan, C.; Zhou, M. SuperAgent: A Customer Service Chatbot for E-commerce Websites. In Proceedings of the Association for Computational Linguistics 2017, System Demonstrations, Vancouver, BC, Cananda, 30 July–4 August 2017; pp. 97–102. [Google Scholar]
Pang, L.; Lan, Y.; Guo, J.; Xu, J.; Xu, J.; Cheng, X. DeepRank: A New Deep Architecture for Relevance Ranking in Information Retrieval. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, Singapore, 6–10 November 2017; pp. 257–266. [Google Scholar]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–8 December 2013; pp. 3111–3119. [Google Scholar]
Srivastava, R.K.; Greff, K.; Schmidhuber, J. Highway networks. arXiv, 2015; arXiv:1505.00387. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics. New Orleans, LA, USA, 1–6 June 2018; pp. 2227–2237. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv, 2016; arXiv:1607.06450. [Google Scholar]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Serban, I.V.; Lowe, R.; Henderson, P.; Charlin, L.; Pineau, J. A Survey of Available Corpora For Building Data-Driven Dialogue Systems: The Journal Version. Dialog. Discourse 2018, 9, 1–49. [Google Scholar]
Customer Support on Twitter. Available online: http://www.kaggle.com/thoughtvector/customer-support-on-twitter (accessed on 18 February 2019).
Codebase of Towards Automated Customer Support. Available online: https://github.com/mhardalov/customer-support-chatbot (accessed on 12 February 2019).
Manning, C.; Surdeanu, M.; Bauer, J.; Finkel, J.; Bethard, S.; McClosky, D. The Stanford CoreNLP Natural Language Processing Toolkit. Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MA, USA, 22–27 June 2014; pp. 55–60. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 2015 International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. Tensorflow: A system for large-scale machine learning. OSDI 2016, 16, 265–283. [Google Scholar]
ElasticSearch. Available online: http://www.elastic.co/products/elasticsearch (accessed on 20 February 2019).
Robertson, S.; Zaragoza, H. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr. 2009, 3, 333–389. [Google Scholar] [CrossRef]
Liu, C.W.; Lowe, R.; Serban, I.; Noseworthy, M.; Charlin, L.; Pineau, J. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 2122–2132. [Google Scholar]
Lowe, R.; Noseworthy, M.; Serban, I.V.; Angelard-Gontier, N.; Bengio, Y.; Pineau, J. Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1116–1126. [Google Scholar]
Rus, V.; Lintean, M. A Comparison of Greedy and Optimal Assessment of Natural Language Student Input Using Word-to-Word Similarity Metrics. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, Montreal, QC, Canada, 7 June 2012; pp. 157–162. [Google Scholar]
Forgues, G.; Pineau, J.; Larchevêque, J.M.; Tremblay, R. Bootstrapping dialog systems with word embeddings. In Proceedings of the NIPS Workshop on Modern Machine Learning and Natural Language Processing, Montreal, QC, Canada, 12 December 2014. [Google Scholar]
Lin, C.Y.; Och, F.J. Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics. In Proceedings of the 42nd Annual Conference of the Association for Computational Linguistics, Barcelona, Spain, 21–26 July 2004; pp. 605–612. [Google Scholar]
Sutton, R.S.; Barto, A.G. Introduction to Reinforcement Learning; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]

Figure 1. Our answer re-ranking framework, based on the QANet architecture.

Table 1. Statistics about our dataset. (Reprinted by permission from Springer Nature: Springer Lecture Notes in Computer Science (Hardalov, M.; Koychev, I.; Nakov, P. Towards Automated Customer Support [20]), 2018).

	Questions	Answers
Avg. # words	21.31	25.88
Min # words	$1.00$	$3.00$
1st quantile (#words)	$13.00$	$20.00$
Mode (# words)	$20.00$	$23.00$
3rd quantile (#words)	$27.00$	$29.00$
Max # words	$136.00$	$70.00$
Overall
# question–answer pairs		49,626
# words (in total)		26,140
Min # turns per dialog		$2$
Max # turns per dialog		$106$
Avg. # turns per dialog		2.6
Training set: # of dialogs		45,582
Testing set: # of dialogs		4044

Table 2. Auxiliary task: question–answer goodness classification results.

Model	Embedding Type	d_model	Heads	Accuracy
Majority class	–	–	–	$50.52$
QANet	GloVe	64	4	$80.58$
		64	8	$82.88$
		128	8	$83.42$
QANet	ELMo (token level)	64	4	$82.92$
		64	8	$83.88$
		128	8	$83.48$
QANet	ELMo (sentence level)	64	8	$84.09$
QANet	ELMo (sentence level)	128	8	85.45

Table 3. Main task: performance of the individual models.

Model	Word Overlap		Semantic Similarity
Model	BLEU@2	ROUGE_L	Emb Avg	Greedy Match	Vec Extr
Transformer [20]	$12.43$	$25.33$	$75.35$	$30.08$	$39.40$
IR-BM25 [20]	$13.73$	$22.35$	$76.53$	$29.72$	$37.99$
seq2seq [20]	$15.10$	$26.60$	$77.11$	$30.81$	$40.23$
QANet on IR (Individual)	$14.92 \pm 0.13$	$23.30 \pm 0.12$	$77.47 \pm 0.06$	$30.63 \pm 0.06$	$39.63 \pm 0.06$

Table 4. Main task: re-ranking the top

K = 5

answers returned by the IR and the seq2seq models.

Table 4. Main task: re-ranking the top

K = 5

answers returned by the IR and the seq2seq models.

Model	Word Overlap		Semantic Similarity
Model	BLEU@2	ROUGE_L	Emb Avg	Greedy Match	Vec Extr
Random Top Answer	$14.52 \pm 0.12$	$23.41 \pm 0.12$	$77.21 \pm 0.06$	$30.24 \pm 0.07$	$38.25 \pm 0.20$
QANet+GloVe
d = 64, h = 4	$15.18$	$24.13$	$78.38$	$31.14$	40.85
Softmax	$15.81 \pm 0.09$	$24.53 \pm 0.05$	$78.32 \pm 0.08$	$31.10 \pm 0.03$	$40.51 \pm 0.12$
d = 64, h = 8	$15.41$	$23.62$	$78.48$	$30.97$	$40.81$
Softmax	$15.90 \pm 0.06$	$24.39 \pm 0.03$	$78.38 \pm 0.04$	$31.11 \pm 0.02$	$40.66 \pm 0.06$
d = 128, h = 8	$15.94$	$24.59$	$78.29$	$31.19$	$40.63$
Softmax	$16.04 \pm 0.08$	$24.71 \pm 0.06$	$78.36 \pm 0.07$	$31.20 \pm 0.07$	$40.70 \pm 0.05$
QANet+ELMo (Token)
d = 64, h = 4	$15.23$	$23.48$	$78.25$	$30.77$	$40.22$
Softmax	$15.77 \pm 0.15$	$24.44 \pm 0.09$	$78.27 \pm 0.03$	$31.06 \pm 0.05$	$40.46 \pm 0.11$
d = 64, h = 8	$15.30$	$23.41$	78.54	$30.97$	$40.19$
Softmax	$15.86 \pm 0.07$	$24.40 \pm 0.06$	$78.36 \pm 0.08$	$31.11 \pm 0.04$	$40.49 \pm 0.05$
d = 128, h = 8	$15.24$	$23.59$	$78.34$	$30.90$	$40.19$
Softmax	$15.89 \pm 0.08$	$24.56 \pm 0.10$	$78.33 \pm 0.66$	$31.11 \pm 0.05$	$40.40 \pm 0.05$
QANet+ELMo (Sentence)
d = 64, h = 8	$15.48$	$23.88$	$78.44$	$30.96$	$40.33$
Softmax	$16.00 \pm 0.14$	$24.50 \pm 0.33$	$78.34 \pm 0.10$	$31.13 \pm 0.08$	$40.56 \pm 0.09$
d = 128, h = 8	$15.64$	$24.13$	$78.52$	$31.14$	$40.63$
Softmax	16.05 ± 0.06	24.81 ± 0.08	$78.40 \pm 0.07$	31.20 ± 0.06	$40.58 \pm 0.03$

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hardalov, M.; Koychev, I.; Nakov, P. Machine Reading Comprehension for Answer Re-Ranking in Customer Support Chatbots. Information 2019, 10, 82. https://doi.org/10.3390/info10030082

AMA Style

Hardalov M, Koychev I, Nakov P. Machine Reading Comprehension for Answer Re-Ranking in Customer Support Chatbots. Information. 2019; 10(3):82. https://doi.org/10.3390/info10030082

Chicago/Turabian Style

Hardalov, Momchil, Ivan Koychev, and Preslav Nakov. 2019. "Machine Reading Comprehension for Answer Re-Ranking in Customer Support Chatbots" Information 10, no. 3: 82. https://doi.org/10.3390/info10030082

APA Style

Hardalov, M., Koychev, I., & Nakov, P. (2019). Machine Reading Comprehension for Answer Re-Ranking in Customer Support Chatbots. Information, 10(3), 82. https://doi.org/10.3390/info10030082

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Reading Comprehension for Answer Re-Ranking in Customer Support Chatbots

Abstract

1. Introduction

2. Related Work

2.1. Conversational Agents

2.2. Answer Combination

3. Re-Ranking Model

3.1. Negative Sampling

3.2. QANet Architecture

3.3. Answer Selection

4. Data

5. Experiments

5.1. Preprocessing

5.2. Training Setup

5.3. Individual Models

5.4. Evaluation Measures

6. Evaluation Results

6.1. Auxiliary Task: Question–Answer Goodness Classification

6.2. Answer Selection/Generation: Individual Models

6.3. Main Task: Multi-Source Answer Re-Ranking

7. Conclusions and Future Work

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI