Reliable Classiﬁcation of FAQs with Spelling Errors Using an Encoder-Decoder Neural Network in Korean

: To resolve lexical disagreement problems between queries and frequently asked questions (FAQs), we propose a reliable sentence classiﬁcation model based on an encoder-decoder neural network. The proposed model uses three types of word embeddings; ﬁxed word embeddings for representing domain-independent meanings of words, ﬁned-tuned word embeddings for representing domain-speciﬁc meanings of words, and character-level word embeddings for bridging lexical gaps caused by spelling errors. It also uses class embeddings to represent domain knowledge associated with each category. In the experiments with an FAQ dataset about online banking, the proposed embedding methods contributed to an improved performance of the sentence classiﬁcation. In addition, the proposed model showed better performance (with an accuracy of 0.810 in the classiﬁcation of 411 categories) than that of the comparison model.


Introduction
Frequently asked questions (FAQs) in commercial services based on social media (e.g., chatbot for online banking) accommodate both customer needs and business requirements. As a useful tool for information access, most commercial services provide customers with a keyword search. However, sometimes the keyword search does not perform well in FAQ retrieval because of lexical disagreements between users' queries and the predefined questions in an FAQ set, as shown in Figure 1.

Introduction
Frequently asked questions (FAQs) in commercial services based on social media (e.g., chatbot for online banking) accommodate both customer needs and business requirements. As a useful tool for information access, most commercial services provide customers with a keyword search. However, sometimes the keyword search does not perform well in FAQ retrieval because of lexical disagreements between users' queries and the predefined questions in an FAQ set, as shown in Figure  1.  In Figure 1, the lexical disagreements are caused by using different words with the same meanings (e.g., remittance vs. bank transfer), and by using incorrect words with spelling errors (e.g., remittance Appl. Sci. 2019, 9,  vs. remitence). To resolve these lexical disagreement problems, most FAQ retrieval systems expand keywords by looking up synonym dictionaries and bridge lexical gaps between different words with the same meanings. However, they cannot cope with the lexical agreement problem caused by spelling errors because it is impossible to pre-construct a synonym dictionary containing all misspelled keywords. Recently, FAQ classification models based on deep learning have been proposed because they have the ability to cluster semantically or lexically similar words through various distributed representation schemes like word embeddings and character embeddings. In this paper, we propose an FAQ classification model based on an encoder-decoder neural network with multiple word embedding vectors instead of keyword search methods. To increase FAQ classification performance, the proposed model adopts class embeddings, including domain knowledge of each FAQ category.

Previous Works
Initial sentence classification models based on deep learning were n-gram models using convolutional neural networks (CNNs) [1][2][3][4][5]. The authors of [3] proposed a CNN architecture using diverse versions of pre-trained static word vectors and variable size convolution filters. It was shown in [2] that simple convolutions of word n-grams could contribute to improving the performance of sentence classification by fine-tuning pre-trained static word vectors like Word2Vec [6]. These n-gram models were effective in exploring the regional syntax of words, but they could not account for order-sensitive situations where the order of words was critical to the meaning of a sentence. To overcome this problem, [7] proposed a classification model combined with a recurrent neural network (RNN) and a CNN. Then, some studies demonstrated that sub-word units like character n-grams could contribute to improving the performance of downstream natural language processing (NLP) tasks [8][9][10][11][12][13]. The authors of [12] proposed a part-of-speech tagging model based on an RNN in which each word is represented by a combination of Korean alphabet embeddings for making the model robust to typing errors. The authors of [13] proposed a character-level CNN model for text classification which showed that the character-level CNN model could achieve state-of-the-art or competitive results. In addition, [14] demonstrated that domain embeddings (i.e., embeddings of predefined categories) could contribute to improving the performance of large-scale domain classification. Recently, bidirectional encoder representations from transformers (BERT) was proposed [15], which is deeply bidirectional, unsupervised language representation that is pre-trained using a large amount of plain text corpus. BERT has shown state-of-the-art performance in many downstream NLP tasks such as classification, sequence labeling, and span prediction by learning task-specific vectors through fine-tuning. In sentence classification tasks such as sentiment analysis and semantic textual similarity analysis, BERT also outperformed the previous state-of-the-art models. To make the proposed model robust to lexical disagreements, the embedding layer consists of three types of embedding vectors: Fixed word embedding vectors, fine-tuned word embedding vectors, and character-level word embedding vectors using a CNN. We expect that the fixed word embedding vectors represent domain-independent meanings of each word, and the fine-tuned word embedding vectors represent domain-specific meanings of each word. For example, we hope that "transfer" has the domain-independent meaning "move something" and the domain-specific meaning "send money" in a banking domain. We also expect that the character-based word embedding vectors alleviate lexical disagreement problems that are raised by spelling errors. For example, we hope that the misspelled word "remitence" has a similar vector representation with "remittance." In Figure 2, and are [CLS] (a special symbol added in front of every input example) and an embedding of [CLS], respectively. except and except are the i-th word in a sentence, and its embedding vector concatenated with three types of word embedding vectors, respectively.  In Figure 3, , ̂ , and are a fixed word embedding vector, a fine-tuned word embedding vector, and a character-level word embedding vector of the i-th one among n words in an input To make the proposed model robust to lexical disagreements, the embedding layer consists of three types of embedding vectors: Fixed word embedding vectors, fine-tuned word embedding vectors, and character-level word embedding vectors using a CNN. We expect that the fixed word embedding vectors represent domain-independent meanings of each word, and the fine-tuned word embedding vectors represent domain-specific meanings of each word. For example, we hope that "transfer" has the domain-independent meaning "move something" and the domain-specific meaning "send money" in a banking domain. We also expect that the character-based word embedding vectors alleviate lexical disagreement problems that are raised by spelling errors. For example, we hope that the misspelled word "remitence" has a similar vector representation with "remittance." In Figure 2, W 0 and E 0 are [CLS] (a special symbol added in front of every input example) and an embedding of [CLS], respectively. W i except W 0 and E i except E 0 are the i-th word in a sentence, and its embedding vector concatenated with three types of word embedding vectors, respectively. To make the proposed model robust to lexical disagreements, the embedding layer consists of three types of embedding vectors: Fixed word embedding vectors, fine-tuned word embedding vectors, and character-level word embedding vectors using a CNN. We expect that the fixed word embedding vectors represent domain-independent meanings of each word, and the fine-tuned word embedding vectors represent domain-specific meanings of each word. For example, we hope that "transfer" has the domain-independent meaning "move something" and the domain-specific meaning "send money" in a banking domain. We also expect that the character-based word embedding vectors alleviate lexical disagreement problems that are raised by spelling errors. For example, we hope that the misspelled word "remitence" has a similar vector representation with "remittance." In Figure   In Figure 3, , ̂ , and are a fixed word embedding vector, a fine-tuned word embedding vector, and a character-level word embedding vector of the i-th one among n words in an input S (i.e., an input query or a predefined question in a FAQ set), respectively. e char i is generated by a CNN, as shown in the following equation.

FAQ Classification Using an Encoder-Decoder Neural Network
where c j is the j-th one among l characters in a word w i . In this paper, a character refers to the Korean characters called jamo. A final word embedding vector E i is represented by the concatenation of e i ,ê i , and e char i , as shown in the following equation: To supplement word embedding vectors with contextual information, we adopt an encoder-decoder neural network in which word embedding vectors are encoded by a transformer's encoder [16]. The output T i of the transformer's encoder is represented by a multi-head scaled dot-product self-attention mechanism, as shown in the following equations.
where E is one among n word embedding vectors, and Q, K, and V are a query, a key, and a value for calculating attentions, respectively. Then, d k is the size of E for scaling dot-products. In Equation (4), the query, key, and value are the same vectors according to Equation (3). This case is called self-attention, relating different positions of a single sequence E 1 , E 2 , . . . , E n , to compute a representation of the sequence. Self-attention has been successfully used in various NLP tasks, such as machine translation, machine reading comprehension, abstractive document summarization, etc. The query, key, and value are first linearly transformed into N heads. Then, each head is entered into Equation (4). Therefore, the self-attention is calculated N times, making it so-called multi-headed. The first output T 0 (the final output vector of the special [CLS] token) of the transformer's encoder is input as an initial value of the RNN decoder, implemented by a gated recurrent unit (GRU) [17] with Luong's encoder-decoder attention mechanism [18], after passing through a fully connected neural network (FNN). Figure 4 shows the RNN decoder with Luong's encoder-decoder attention mechanism in detail.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 4 of 9 sentence (i.e., an input query or a predefined question in a FAQ set), respectively. is generated by a CNN, as shown in the following equation.
where is the j-th one among l characters in a word . In this paper, a character refers to the Korean characters called jamo. A final word embedding vector is represented by the concatenation of , ̂ , and , as shown in the following equation: To supplement word embedding vectors with contextual information, we adopt an encoderdecoder neural network in which word embedding vectors are encoded by a transformer's encoder [16]. The output of the transformer's encoder is represented by a multi-head scaled dot-product self-attention mechanism, as shown in the following equations.
where is one among n word embedding vectors, and , , and are a query, a key, and a value for calculating attentions, respectively. Then, is the size of for scaling dot-products. In Equation (4), the query, key, and value are the same vectors according to Equation (3). This case is called self-attention, relating different positions of a single sequence , , … , , to compute a representation of the sequence. Self-attention has been successfully used in various NLP tasks, such as machine translation, machine reading comprehension, abstractive document summarization, etc. The query, key, and value are first linearly transformed into N heads. Then, each head is entered into Equation (4). Therefore, the self-attention is calculated N times, making it so-called multi-headed. The first output (the final output vector of the special [CLS] token) of the transformer's encoder is input as an initial value of the RNN decoder, implemented by a gated recurrent unit (GRU) [17] with Luong's encoder-decoder attention mechanism [18], after passing through a fully connected neural network (FNN). Figure 4 shows the RNN decoder with Luong's encoder-decoder attention mechanism in detail. As shown in Figure 4, each attention weight is induced by inner products between each output of the transformer's encoder and the first hidden state ℎ of the RNN decoder. The As shown in Figure 4, each attention weight a i is induced by inner products between each output T i of the transformer's encoder and the first hidden state h o of the RNN decoder. The attention weights mean how much each output T i is associated with the hidden state h o . Then, the context vector c is constructed by the weighted sum of a i and T i . Finally, the RNN decoder generates an output vector V o using the FNN-encoded input sentence FNN(T 0 ), the start symbol <S>, and the context vector c, as shown in the following equation: To supplement the output vector V o with domain-specific knowledge, we adopt a domain embedding scheme proposed by [14]. We define one class embedding vector per FAQ category, as shown in the following equation.
where V C t is a class embedding vector that is calculated as an average of the word embedding vectors, e k 's, in sentences belonging to the t-th FAQ category. Finally, to classify input sentences into FAQ categories, we use an FNN. The vector of inner products between the output vector V o of the RNN decoder and the class embedding matrix V C is used as an input vector of the FNN.

Data Sets and Experimental Settings
We collected an FAQ dataset (10,495 pairs of FAQs about online banking). The FAQ dataset is a set of users' queries manually annotated with FAQ categories. The queries had many spacing errors and spelling errors because they were collected from a real mobile app service. An FAQ in the dataset consists of, on average, 23.3 eumjeols (Korean syllables) and contains, on average, 0.7 typo-like spelling errors and spacing errors. The number of FAQ categories was 411. Table 1 shows a sample of the FAQ dataset.

Sentence (Korean) Sentence (English) ID of FAQ Category
How to change the password 7 ᄇ ᅵᄇ ᅥ ᆫ ᄇ ᅡᄁ ᅮᄂ ᅳ ᆫ ᄇ ᅥ ᆸ How to change PW 7 Figure 5 shows a full histogram of the data distribution over the full 411 categories. Figure 6 shows the distribution of FAQ categories according to the number of queries included in each FAQ category.
As shown in Figures 5 and 6, 63% of FAQ categories included less than six queries. To evaluate the proposed model, we divided the FAQ dataset into a training set, a validation set, and a test set by a ratio of 8:1:1 according to a random sampling scheme. As an evaluation measure, we used an accuracy calculation.
To implement the proposed model, we pre-trained GloVe [19] by using 20 GB of Korean news articles. Then, we used the GloVe as the word embedding vectors in Equation (2). The vocabulary size of the GloVe was 210,867. We initialized the character-level embedding vectors with random values. The vocabulary size of the character-level embedding vectors was 132. We set the sizes of embedding vectors (i.e., e i ,ê i , and e char i ) to 100, 100, and 300, respectively. We set the sizes of the class embedding matrix (i.e., V C ) to 100 × 411. We set the hidden size, the attention head size, and the number of layers in the transformer's encoder to 500, 12, and 6, respectively. We set the hidden size of the GRU neural network to 100. The model optimization was done with Adam [20] at a learning rate of 0.00005, and the learning rate was halved if the performance of the validation set did not improve. The dropout rate was set to 0.2, and the mini-batch size was set to 64 sentences, respectively. We empirically set the learning rate, the dropout rate, and the mini-batch size in order to obtain the best performances.  As shown in Figures 5 and 6, 63% of FAQ categories included less than six queries. To evaluate the proposed model, we divided the FAQ dataset into a training set, a validation set, and a test set by a ratio of 8:1:1 according to a random sampling scheme. As an evaluation measure, we used an accuracy calculation.
To implement the proposed model, we pre-trained GloVe [19] by using 20 GB of Korean news articles. Then, we used the GloVe as the word embedding vectors in Equation (2). The vocabulary size of the GloVe was 210,867. We initialized the character-level embedding vectors with random values. The vocabulary size of the character-level embedding vectors was 132. We set the sizes of embedding vectors (i.e., , ̂ , and ) to 100, 100, and 300, respectively. We set the sizes of the class embedding matrix (i.e., ) to 100 × 411. We set the hidden size, the attention head size, and the number of layers in the transformer's encoder to 500, 12, and 6, respectively. We set the hidden size of the GRU neural network to 100. The model optimization was done with Adam [20] at a learning rate of 0.00005, and the learning rate was halved if the performance of the validation set did not improve. The dropout rate was set to 0.2, and the mini-batch size was set to 64 sentences, respectively. We empirically set the learning rate, the dropout rate, and the mini-batch size in order to obtain the best performances.

Experimental Results
The first experiment was to evaluate the effectiveness of the proposed embedding methods by  As shown in Figures 5 and 6, 63% of FAQ categories included less than six queries. To evaluate the proposed model, we divided the FAQ dataset into a training set, a validation set, and a test set by a ratio of 8:1:1 according to a random sampling scheme. As an evaluation measure, we used an accuracy calculation.
To implement the proposed model, we pre-trained GloVe [19] by using 20 GB of Korean news articles. Then, we used the GloVe as the word embedding vectors in Equation (2). The vocabulary size of the GloVe was 210,867. We initialized the character-level embedding vectors with random values. The vocabulary size of the character-level embedding vectors was 132. We set the sizes of embedding vectors (i.e., , ̂ , and ) to 100, 100, and 300, respectively. We set the sizes of the class embedding matrix (i.e., ) to 100 × 411. We set the hidden size, the attention head size, and the number of layers in the transformer's encoder to 500, 12, and 6, respectively. We set the hidden size of the GRU neural network to 100. The model optimization was done with Adam [20] at a learning rate of 0.00005, and the learning rate was halved if the performance of the validation set did not improve. The dropout rate was set to 0.2, and the mini-batch size was set to 64 sentences, respectively. We empirically set the learning rate, the dropout rate, and the mini-batch size in order to obtain the best performances.

Experimental Results
The first experiment was to evaluate the effectiveness of the proposed embedding methods by comparing the performance changes, as shown in Table 2.

Experimental Results
The first experiment was to evaluate the effectiveness of the proposed embedding methods by comparing the performance changes, as shown in Table 2. In Table 2, the baseline model (WordEmbed) uses fixed GloVe embeddings as input vectors. CharEmbed, TunedEmbed, and ClassEmbed refer to the character-level word embeddings, the fine-tuned word embeddings, and the class embeddings that are proposed in this paper, respectively. As shown in Table 2, the proposed embedding methods contributed to increasing the performance of FAQ classification.
The second experiment was to compare the proposed model with the previous models, as shown in Table 3. In Table 3, CNN is the sentence classification model based on a CNN [2] in which pretrained word vectors are converted into feature maps by convolution operations based on multiple filters. OKAPI is the Okapi BM25 retrieval model [21] which is a state-of-the-art ranking function used in document retrieval. BERT-Multilingual is a multilingual version of BERT [15] that is pretrained using a large multilingual text corpus, including Korean. In our experiments, BERT-Multilingual was fine-tuned for 15 epochs by using the FAQ dataset. As shown in Table 3, the proposed model outperformed both the well-known sentence classification model and the keyword search model.
The last experiment was to compare the performance changes of the proposed model according to the size of training data, as shown in Figure 7.
In Table 2, the baseline model (WordEmbed) uses fixed GloVe embeddings as input vectors. CharEmbed, TunedEmbed, and ClassEmbed refer to the character-level word embeddings, the finetuned word embeddings, and the class embeddings that are proposed in this paper, respectively. As shown in Table 2, the proposed embedding methods contributed to increasing the performance of FAQ classification.
The second experiment was to compare the proposed model with the previous models, as shown in Table 3. In Table 3, CNN is the sentence classification model based on a CNN [2] in which pretrained word vectors are converted into feature maps by convolution operations based on multiple filters. OKAPI is the Okapi BM25 retrieval model [21] which is a state-of-the-art ranking function used in document retrieval. BERT-Multilingual is a multilingual version of BERT [15] that is pretrained using a large multilingual text corpus, including Korean. In our experiments, BERT-Multilingual was finetuned for 15 epochs by using the FAQ dataset. As shown in Table 3, the proposed model outperformed both the well-known sentence classification model and the keyword search model.
The last experiment was to compare the performance changes of the proposed model according to the size of training data, as shown in Figure 7.  In Figure 7, FAQ-n indicates FAQ categories in which n queries (i.e., n training data) are contained.
The parenthesized values indicate the number of FAQ categories associated with each FAQ-n in the test data. It can be seen from the figure that the proposed model needed at least five training data per FAQ category in order to obtain an accuracy of more than 0.8.

Conclusions
We proposed a high-performance sentence classification model based on an encoder-decoder model with an attention mechanism. For bridging the lexical gaps between users' queries and FAQs, we used three types of word embeddings (fixed word embeddings, fine-tuned word embeddings, and character-level word embeddings) as inputs to the transformer's encoder. For supplementing domain knowledge associated with categories, we added class embeddings to the outputs of the RNN decoder. In the experiments with the FAQ dataset, the proposed model outperformed the comparison models. We found that the proposed embedding methods contributed to improving the performance of sentence classification. The proposed model showed low performances in FAQ categories containing a small number of training data. To reduce this problem, we need to adopt pre-trained language models like BERT and XLNet [22] as encoders. In the future, we will try to combine the proposed model with a chatbot model for assisting online banking customers. Therefore, we will study a method to return a nil category to make the chatbot model generate proper responses when users' queries are not associated with any one of the predefined FAQ categories.