MOLI: Smart Conversation Agent for Mobile Customer Service

: Human agents in technical customer support provide users with instructional answers to solve a task that would otherwise require a lot of time, money, energy, physical costs. Developing a dialogue system in this domain is challenging due to the broad variety of user questions. Moreover, user questions are noisy (for example, spelling mistakes), redundant and have various natural language expressions. In this work, we introduce a conversational system, MOLI (the name of our dialogue system), to solve customer questions by providing instructional answers from a knowledge base. Our approach combines models for question type and intent category classiﬁcation with slot ﬁlling and a back-end knowledge base for ﬁltering and ranking answers, and uses a dialog framework to actively query the user for missing information. For answer-ranking we ﬁnd that sequential matching networks and neural multi-perspective sentence similarity networks clearly outperform baseline models, achieving a 43% error reduction. The end-to-end P@1(Precision at top 1) of MOLI was 0.69 and the customers’ satisfaction was 0.73.


Introduction
For many companies, customers can seek customer support from multiple channels such as web page, Facebook or APP.Besides, according to the research of China Information Industry Network (CNII), customer service and support is a sizable and growing market globally, as well as in China.In response to tremendous demand in our company and market we develop our smart customer service MOLI for mobile.
"My Wi-Fi is not working anymore!"-most mobile device users probably have faced this or a similar questions in the past.Solving such questions is the task of customer support agents (CSAs).For frequent questions and user intents, for which solutions often exist in the form of user guides and question-answer knowledge base (QA-KB), this is a repetitive and time consuming process.Automating such conversations would significantly reduce the time CSAs have to invest in solving common questions, which they could then spend on more complex or previously unseen customer problems [1].
Recent advances in dialog systems have led to successful applications in domains such as restaurant [2] and flight bookings [3], providing a convenient way for users to interact with backend services and knowledge bases in natural language, via speech or text-based input.Developing a dialog system for technical customer support presents additional challenges due to the broad variety of topics and tasks that need to be handled.The task is made even more challenging by the fact that the dialogs are often noisy, contain grammatical errors, and incomplete user turns.They also refer to concepts and entities not recognized by standard NER tools (e.g., devices, components).Due to the non-technical background of most customers, problem descriptions can be ambiguous, too colloquial with respect to the more formalized, technical QA knowledge base texts, and possibly miss important information that is necessary to identify a unique and correct solution.CSAs therefore often query customers for contextual information, such as device model and mobile carrier, in order to identify the exact issue and to select a good answer.
With the work described in this paper, we aim to automatize this task of matching instructional answers from a QA-KB to user queries described by users in online support chats.We describe an approach to conversational question answering in the little-explored domain of technical customer support.Our approach selects the best answer from a QA-KB in a dialog-oriented fashion, using intent classification to narrow down answer candidates and in particular to pro-actively query the user for missing information.

Related Work
Existing work on dialog systems in the customer service domain has focused on dialog modeling [4] for answering Ubuntu OS-related questions, and single-turn question answering in the insurance domain [5].Both studies show that it is in principle possible to handle longer conversations in an unsupervised fashion and answer complex questions with the help of a noisy training set and an unstructured knowledge source.Lowe et al. [4] use a large corpus of support conversations in the operating system domain to train an end-to-end dialog system for answering customer questions.Their results suggest that end-to-end trained systems can achieve good performance but perform poorly on dialogs that require specific domain knowledge which the model possibly never observed.
In contrast, in our work we adopt a classical classification approach followed by semantically matching a user question to a set of results from a QA-KB, in order to cope with the limited amount of training data.The work of Feng et al. [5] focuses on answer matching and selection for a spoken question answering system.The authors show that a standard CNN (Convolutional Neural Network)-based approach with a large number of filters can achieve good performance on this task, but it is limited to single-turn question-response conversations.
Most previous works on sentence similarity modeling focus on feature engineering.Several types of features have been proved useful, including: (1) string-based, including n-gram features [6] and features used in machine translation evaluation [7]; (2) knowledge-based, using lexical resources (e.x.WordNet) in Fern and Stevenson' work [8]; (3) syntax-based, such as, modeling divergence of dependency relation between two sentences in Das and Smith' work [9]; (4) corpus-based, using distributional models such as using latent semantic analysis to catch the features [10].Recent work has been changed from hand-crafted features to modeling with distributed representations under neural network architectures.Collobert and Weston [11] used convolutional neural networks to be trained jointly for multiple tasks in NLP (Natural Language Processing) with the same shared weights.Kalchbrenner et al. [12] utilized a dynamic k-max pooling to better model inputs of varying sizes in a convolutional neural network to model sentences.Kim [13] proposed several improvements to the convolutional architecture of Collobert and Weston, including the use of fixed word vectors and varying sizes for convolution windows.Hu et al. [14] used convolutional neural net works to combine sentence modeling in a layer-by-layer composition method.A variety of other models have been proposed for similarity tasks (Weston et al. [15], Huang et al. [16], Andrew et al. [17]).Tai et al. [18] proposed a tree-based LSTM (Long Short-Term Memory).
In our approach we follow the architecture proposed by He et al. [19] to handle the ambiguity and variability of linguistic expression when modeling sentence semantic similarity.Their work proposes a multi-perspective model for comparing sentences.The authors first use a CNN model to extract features at multiple levels of granularity and then compute multiple similarity metrics to measure sentence similarity.Wu et al. [20] propose a sequential matching network (SMN) which matches two sentences in the context on multiple granularity, and distills important matching information from each pair with convolution and pooling operations followed by a recurrent network to model sentence relationships.
In contrast to standard QA tasks such as extractive QA [21] and opensequential matching network-domain answer selection [22], the task we address focuses on identifying instructional answers that match the user's question and its context, and hence cannot simply be gleaned from a text corpus.Although question answering and dialog systems recently received a lot of attention, work combining both is still sparse.

Problem Formalization
The goal of our approach was to identify the answer a i from a corpus of N QA pairs QA = {(q 1 , a 1 ), . . ., (q n , a n )} that best matched the user's question as expressed by a sequence of user turns T = t 1 , . . ., t T .q i is a representative question such as "How to setup email" which prototypically stands for other questions that can be answered by a i .As Figure 1 illustrates, QA pairs as well as user turns are expressed as free text.
For each pair (q i , a i ), there exists meta data, recorded in the form of properties p i,1 , . . ., p i,m .Properties describe the context to which a given QA pair applies, such as a specific device name and operating system version.They typically take a single value from a limited set of possible values.
At each turn t k during the conversation our system estimates the utility u i,k ∈ IR ≥0 of answer a i given the current context C =< t 1 , . . ., t k , P k > where P k is a list of relevant properties that have already been identified (filled 'slots').

Dialog System Overview
Figure 2 gives an overview of the overall architecture of the dialog system.Customers can interact with the system either by entering free text or by selecting one of a set of predefined choices suggested by the system.Choices can be of different types-frequently asked questions at the start of a conversation, or, for instance, product names if the system queried for the customer's device in the previous turn.Upon receiving a user turn t k , the natural language understanding (NLU) component's task was to transform t k into a structured representation.It first performed sentence segmentation and tokenization of the input using the Stanford CoreNLP toolkit [23], and then determined the question type, namely, how-to or others.In addition, the NLU component performed intent classification to identify user intent that represents a similar question set.The intent was used for narrowing down candidate standard QA pairs from QA-KB.Then the slot filling (SF) component is used for identifying entities such as product names and attributes, which are linked to concepts in a product knowledge graph (KG).SF is based on a combination of template and sequence classification approaches.After question type classification,intent classification and slot filling, the system determined the truly concerned specific question of user.Since SF component is based on combination of template and sequence classification approaches, the model with high accuracy but low recall.We add a semantic matching component to improve recall without decreasing accuracy.
The main task of the dialog manager (DM) was to maintain the dialogue context (semantic frame), which encoded all available structured information, and decided on the next system action.If the semantic frame was not complete, that is, it was lacking a slot value such as product information, the DM can re-ask, confirm, or clarify to update the dialogue context and keep the semantic frame unambiguous and complete.If a product name was not detected, the DM may also query the customer database for the most recent product purchased by this customer.Given the question type, the intent category and the values of already filled slots, the DM retrieved the list of potential QA pairs by querying the QA-KB.It then either asked the customer for more information to fill empty slots (which are specified by the properties of the QA pairs) to narrow down this list, or it runs the semantic matching component to rank the remaining QA pairs The DM can also ask for confirmation or disambiguation of a slot filler extracted by the NLU component.DM action choices were passed through a template text-based natural language generation component (NLG) or were converted into option choices represented by buttons in the user interface.At any time during the conversation, the system or the user may choose to refer the conversation to a human support agent.

Question Type Classification
Question type classification is the entrance of NLU, which is important for the performance of the whole system.Here, we define n-gram information as a local semantic feature and long dependency relationship between words or phrases as a global structure feature.Most existing question type classification models either learn little structure information or just rely on pre-defined structures, leading to degradation of performance and generalization capability.To address this issue, we propose a sandwich neural network (SNN) to learn semantic and structure representations automatically.
SNN contains four parts: first LSTM layer, CNN and pooling layer, second LSTM layer and concatenation and loss layer.The first LSTM layer was inspired by DSCNN [24] to adjust the word representation which takes the context into account.CNN and pooling was used to learn local n-gram semantic representation.The second LSTM layer was inspired by C-LSTM [25] to use the filter maps after convolution to represent the high-level phrase representation and feed it into following LSTM to learn long-dependency structure representation.The last concatenation and loss layer was used to concatenate these two representations as a new one, to compute loss through corss-entorpy.Figure 3 shows the architecture of our model, where CNN is in the middle of two LSTM layers like a sandwich.Then, we will describe SNN in detail.Our model's first layer consisted of LSTM networks which processed different versions of word embeddings.For every version of word embeddings, there was an according LSTM network where the input x t ∈ R d is the d-dimensional word embedding for w t .The LSTM layer will produce a hidden state representation h t ∈ R d at each time step.The hidden state representations will be set as the output of LSTM layers: for i = 1, 2, ..., c.

CNN Layer
The second layer was a CNN.To utilize multiple kinds of word embeddings, we applied a filter F ∈ R c×d×l , where l is the size of convolution window.The i th version of word embedding produce the hidden state sequence h (i) , which forms one channel of the feature map.Then these feature maps are stacked c-channel feature maps X ∈ R c×d×s .
Afterwards, filter F convolved with the window vectors (l-gram) at each position to generate a feature map c ∈ R s−l+1 ; c k is the element of the feature map c for window vector X k:k+l−1 at position k and it is produced as follows: where denotes element-wise multiplication.The n feature maps generated from n filters can be rearranged through column vector concatenation method to form a new representation, Each row W j of W ∈ R (s−l+1)×n is the feature map generated from n filters for the window vector at position j.The new successive higher-level representations were then fed into the last LSTM layer.
Here, a max-over-time pooling layer was added after the convolution neural network.The pooling result of the feature map c is : These pooling results are used as our local semantic representation se ∈ R n :

Second LSTM Layer
We used the same number of filters n to denote the dimension in this LSTM layer for easy and fair fusion in the latter, and use the last hidden unit of LSTM as global structure representation st ∈ R n .

Concatenation and Loss Layer
Thus, we got the local semantic representation se and global structure representation st.Then we concatenated se and st to get the sentence representation and compute loss through cross entropy.

Intent Category Classification
The correct intent can reduce the number of candidate QA pairs significantly.Currently, the data set contains 60 intents such as "Bluetooth", "Screen Unlock", "Google Account", etc.
The intent category classifier estimates the probability p(I|t k ) where I represents intent.Our baseline approaches are GBDT (Gradient Boosting Decision Tree) and a linear SVM (Support Vector Machine).For feature extraction, at first the t k was tokenized, followed by stop-word removal and transformation into a bag-of-words representation.The features were term frequency-reverse document frequency (TF-IDF) weighted unigram and bigram features.We also implemented a bidirectional LSTM model(BiLSTM).In this model, each w i ∈ t k was represented by an embedding e i ∈ R d that we obtain from pre-trained distributed word representations E = [e 1 , . . ., e W ]. The BiLSTM output was passed to a fully-connected layer followed by a ReLU (Rectified Linear Unit) non-linearity and softmax normalization, s.t.p(I|t k ) was computed as follows:

Semantic Matching
We assume that the QA pair with the highest semantic similarity to the question expressed in turn t k (and previous turns) will be of the highest utility to the user.After question type and intent category classification we obtain an initial set of candidates, QA init , by retrieving all QA pairs from the knowledge base that are relevant to the question type and intent category.Following a common information retrieval approach, we then used a pairwise scoring function S(q i , a i , C) to sort QA init by utility, where (q i , a i ) ∈ QA init .

TF-IDF
TF-IDF means term frequency-inverse document frequency.Our first baseline computes S with a TF-IDF weighted bag-of-words representation of q i , a i and t k to estimate the semantic relatedness by cosine similarity cos(v i , v k ) between the feature vectors of the QA pair, v i , and the user turn, v k .

WMD
The second baseline leverages the semantic information of distributed word representations [26].To this end, we replace the tokens in q i , a i and t k with their respective embeddings and then compute the word mover distance [27] between the embeddings.

SMN
In addition to the baselines we use a sequential matching network (SMN) [20], which treats semantic matching as a classification task.The SMN first represents q i , a i and t k by their respective sequence of word embeddings E i and E k before encoding both separately with a recurrent network, a gated recurrent unit (GRU) [28] in this case.A word-word similarity matrix M w and a sequence-sequence similarity matrix M s is constructed from E i and E k , and important matching information is distilled into a matching vector v m via a convolutional layer followed by max-pooling.v m is further projected using a fully connected layer followed by a softmax.

MPCNN
In this section, we present innovative solutions that incorporate multi-info and context information of user questions into multi-perspective CNN(MPCNN) to fulfill question paraphrase identification.The architecture of model is shown in Figure 4. Our model has two same subnetworks that processing t k and q i a i in parallel after getting context by GRU. (

1) Multi-info
To the data, t k is quite long but q i in our QA-KB is short and contains less information.Besides, the a i is quite long and contains some information that related to t k .In this work, we concat q i and a i of QA-KB then to compute S(q i a i , t k ).User queries are always concerned with a specific product but some related standard questions for different products may be the same in the QA-KB.For example in Figure 1 "moto g3" is a mobile name.For a same question, if the product of the question is different, it will influence the matching result.We replace these specific mobiles by the same word "Mobile" directly.In this paper, we use the product-KB and CRF (Conditional Random Field) algorithm to recognize the mobile from t k .The left part of Figure 5 indicates the structure of the product-KB.In product-KB, every mobile has its surface names which are mined from the chat log.
Product-KB hardly contains all mobiles and their surface names so we use CRF to recognize the mobile names from the input user question as a supplement.There are two level features used in CRF, char level ngrames and word level ngrams.The maximum char level ngram is 6 and word level ngram is 3.By using the multi-info of product-KB and answer information, the precision of semantic matching is improved.(2) Context Multi-Perspective CNN After getting the multi-info, the input of our neural model are t k and q i a i .Given a user query t k and a response candidate q i a i , the model looks up an embedding table and represents t k and q i a i as t k = [e u,1 ,e u,2 ,...,e u,L ] and q i a i = [e s,1 ,e s,2 ,...,e s,L ] respectively, where e u,j and e s,j ∈ R d are the embeddings of the j-th word of t k and q i a i respectively.L is the max length of two input sequences.Before feed into multi-perspective CNN, t k is transformed to hidden vectors conM Qu by GRU.Suppose that conM t k = [h u,1 , h u,1 , . . ., h u,L ] are the hidden vectors of t k , then h u,i is defined by where h u,0 = 0, z i and r i are an update gate and a reset gate respectively, σ(.) is a sigmoid function, and W z , W r , W h , U z , U r , U h are parameters.Because q i a i is not a sequential sentence the model only gets context information of t k and learns long-term dependencies by GRU.conM t k and q i a i are then processed by the same neural networks.The paper applies to both word level convolutional filters and embedding level convolutional filters.Word level filters operate over sliding windows while considering the full dimensionality of the word embeddings, like typical temporal convolutional filters.The embedding level filters focus on information at a finer granularity and operate over sliding windows of each dimension of the word embeddings.Embedding level filters can find and extract information from individual dimensions, while word level filters can discover broader patterns of contextual information.Both kinds of filters are allowed to extract more information for richer our model.
For every output vector of convolutional filter, the model converts it to a scalar by pooling layer.Pooling helps a convolutional model retain the most prominent and prevalent features, which is helpful for robustness our model.Max pooling is a widely used pooling layer, which applies max operation over the input vector and returns the maximum value.In addition to using max pooling, our model also uses min pooling and mean pooling.

Dialogue Manager
For the conversation, we adopted the method based on finite state machine (FSM) to manage it.We set up eight intermediate states for conversation besides "start" and "close" states, such as: "Init", "SlotFull", "SlotNotFull", "SlotClarify", "IntentVerify", "DeliverAnswer", "WaitUserInput" and "ErrorHandling" as shown in the Figure 6.We explain the meaning of each state separately as shown in Table 1.Besides, in order to clearly see the jump logic between states, we use arrow lines to indicate the next state to jump , as shown in Figure 6.

Experiments and Discussion
In this section, we evaluate our approaches for question type classification, question intent category classification, semantic matching as well as end-to-end performance of MOLI.Next, we will introduce the dataset, QA-KB, product-KB and experiment results separately.

Data Set, QA-KB and Product-KB
The chat transcript data set mainly consists of first contact transcripts, in which the customer's question or intent is explicitly stated.Each transcript includes the full text of the chat, speaker ids for each message, a product id, optionally question type and an intent category assigned by the customer service agent.
From this corpus, we extracted a dataset of 80,216 user turns, which are manually labeled with question type information by the CSAs.Table 2 shows the distribution over question types contained in the dataset.Out of the 30,593 how-to questions, 6808 have an intent category assigned, for which the distribution over the top 30 categories (out of 60) is shown in Figure 7.The KB stores the QA pairs and its relevant products.Figure 5 indicates the structure of our KB.The left part is the parameters of mobile product and the right part is QA pairs.In the current version, KB includes 20 mobile products, 242 standard QA pairs.Our KB in total includes more than 150,000 triples.

Question Type Classification
In this section, we firstly split the data set into 80/20 training and test sets, respectively.In the paper, we use 300 dimensional GloVe word embeddings [29].Hyper-parameter selection was done on the training set via five-fold cross validation and results averaged over multiple runs are reported on the test set and Table 3 shows the evaluation results.As we see, SNN outperforms the baseline models prominently.For example, in the sentence "I want to get support on the steps of factory model setting", the structure "support on ... steps" contributes a lot to the classification.Any part of the structure ("support" or "steps") may make mistakes.

Intent Category Classification
In this section, we also firstly split the dataset into 80/20 train and test sets, respectively.Hyper-parameter selection is done on the training set via 5-fold cross validation and results averaged over multiple runs are reported on the test set.For baseline model BiLSTM we use 300 dimensional GloVe word embeddings [29].Table 4 shows the evaluation results on the dataset.The baseline model SVM, even outperforming the BiLSTM model.From the specific every category results of SVM in Table 5 we find that some categories (e.g., "Google account and transfer from previous device") achieve a disproportional lower performance.For example, "Google account" is often confused with "reset as a Google account" is generally a main topic when trying to reset a device (e.g., "Android smartphone").It is also noteworthy that "subsidy unlock", "bootloader unlock" and "screen lock" are frequently confused.This is best illustrated by the example "Hi i need pin for unlock red to my moto g", which has the true category "Subsidy Unlock" but is categorized as "screen lock".Without knowledge about the mobile phone and contracts domain it is very difficult to understand that the customer is referring to a "pin" (subsidy unlock code) for "red" (mobile service provider) and not the actual PIN code for unlocking the phone.This example also symbolizes a common problem in smart customer support, where users unfamiliar with the domain are not able to describe their information need in the domain-specific terminology.

Semantic Matching
For all models except TF-IDF, we use 300 dimensional GloVe word embeddings [29].To obtain negative samples, for each t k , we randomly selected five standard queries with the same intent and five standard queries with different intents.To alleviate the impact of unbalanced training data, we oversampled positive samples.As the standard questions q i of most QA pairs (q i , a i ) are usually less than 10 tokens, we also evaluate the impact on model performance when adding the answer a i as additional context (up to 500 characters) to q i .Table 6 shows the P@1 of each model on our data.We see that the MPCNN and MPCNN_GRU (MPCNN with a Gated Recurrent Unit) outperform the unsupervised baseline approaches, with a 43% error reduction achieved with the MPCNN_GRU model.Intuitively it makes sense to provide the models with additional context that can be used to learn a better representation of semantic similarity.The SMN's P@1 are much lower than MPCNN models, even only slightly higher than these unsupervised models.

The Importance of Intent Classification for Semantic Matching
Question intent classification is an important step to narrow down candidate answers.In this section, we compare with baseline models to highlight the effectiveness of intent classification.The baseline models used the same network as MPCNN and MPCNN_GRU, without intent classification so the models were matching with all QA pairs directly.Table 7 shows that the precision of semantic matching with intent outperformed baseline models.In this section, we display the end-to-end performance of MOLI and compare MOLI with baseline system to highlight the effectiveness of related component.In detail, we list the baseline performance, and then we list the performance with question type classification, intent category recognition, semantic matching improved respectively in Sections 5.1-5.3.The detailed methods are SNN, SVM, MPCNN_GRU respectively, so we name the baseline models baseline_QT SNN , baseline_IC SV M , baseline_SM MPCNN_GRU .At last, we show the performance with all the above components improved together.Besides, in order to prove the effectiveness of semantic matching, we designed the MOLI-SM model, which removed the semantic matching component based on the MOLI.Table 8 shows the P@1 and feedback score of each system.The feedback score is calculated by user's action.At the end of a session in the system, there is a feedback mechanism where you can grade the recommend answer.There were five level scores that the user could choose.If the score was four or five then we think the answer was useful for the user.In the table, the results show that our improvements were useful.

Conclusions
In this paper, we describe our smart customer system MOLI in detail with many innovative NLP techniques.We presented a first approach for conversational question answering in the complex and little-explored domain of technical customer support.Our approach matches a user's question with the most relevant answer from a knowledge base.It does so in a conversational manner, by asking for, and clarifying required information if necessary.Our approach incorporates several separate models to determine an answer.Most notably, it performs question type and intent classification for a dataset with 60 intent categories, slot filling, and semantic answer matching.We observe that while supervised models, both neural and standard ones such as decision trees and SVMs, perform reasonably well on the individual tasks, there is still room for improvement.As many previous authors have shown in other domains, such models can benefit from joint training and end-to-end task modeling.
Our experiments were conducted with a dataset of noisy, real-world chat transcripts, which we plan to make available to the community in the near future.Future research directions include end-to-end, joint modeling of the question type and intent classification, slot filling and semantic matching subtasks, as well as updating the dialogue manager to account for nested, non-linear conversations, and maintaining multiple dialog hypotheses.

Figure 1 .
Figure 1.Example chat about an "email account setup".

Figure 5 .
Figure 5. Structure of the product-knowledge base (KB) and question-answer (QA)-KB.

Figure 6 .
Figure 6.Finite state machine for task-oriented dialog.

Figure 7 .
Figure 7. Distribution of intent categories (top 30) for user question.

Table 1 .
Definition of states.
3 Slotfull It means all necessary slots are filled and verified completely.4 SlotNotFull It means not all necessary slots are filled or verified completely.5 SlotClarify It means that some necessary slots need verify by user or knowledge base.6 IntentVerify It means some intent from users need verify.7 DeliverAnswer It means that in current state our FSM is delivering answer to our users.8 WaitUserInput In this state our FSM accepts users' input.9 ErrorHandling The state is used for dealing with errors happening in the state jumping process.10 End It refers to terminate the task.

Table 2 .
Question type statistics.

Table 3 .
Question type classification results.

Table 4 .
Intent category classification results for user question.

Table 5 .
Intent category classification results for user question, top 10 categories.

Table 6 .
Semantic matching results for user question.

Table 7 .
Semantic matching results on baseline for user question.