An Empirical Study on Deep Neural Network Models for Chinese Dialogue Generation

: The task of dialogue generation has attracted increasing attention due to its diverse downstream applications, such as question-answering systems and chatbots. Recently, the deep neural network (DNN)-based dialogue generation models have achieved superior performance against conventional models utilizing statistical machine learning methods. However, despite that an enormous number of state-of-the-art DNN-based models have been proposed, there lacks detailed empirical comparative analysis for them on the open Chinese corpus. As a result, relevant researchers and engineers might ﬁnd it hard to get an intuitive understanding of the current research progress. To address this challenge, we conducted an empirical study for state-of-the-art DNN-based dialogue generation models in various Chinese corpora. Speciﬁcally, extensive experiments were performed on several well-known single-turn and multi-turn dialogue corpora, including KdConv, Weibo, and Douban, to evaluate a wide range of dialogue generation models that are based on the symmetrical architecture of Seq2Seq, RNNSearch, transformer, generative adversarial nets, and reinforcement learning respectively. Moreover, we paid special attention to the prevalent pre-trained model for the quality of dialogue generation. Their performances were evaluated by four widely-used metrics in this area: BLEU, pseudo, distinct, and rouge. Finally, we report a case study to show example responses generated by these models separately.


Introduction
Text generation is a core task in artificial intelligence and natural language processing, which is essential for many popular downstream applications, such as the chitchat style dialogue system for human-machine conversations. Compared with traditional statistical machine learning methods [1,2], the methods based on deep neural networks (DNNs) [3][4][5] have achieved superior performance because they can effectively leverage massive data to capture high-level feature representations and

你之前去过乌鲁木齐吗？
Have you been to Urumqi before? Despite the dominant success of DNN-based dialogue generation systems shown in the literature, it is interesting to further compare the performances of the DNN structures on the real-world corpora under the same evaluation metrics and study the differences, which has been seldom explored. It is also important to both related researchers and software engineers because they need practical and detailed guidelines for the performances of different deep learning architectures and pre-trained language models on the task of Chinese dialogue generation. To fill this gap in open-domain dialogue generation, reference [7] did an empirical investigation on pre-trained transformer language models. However, they only evaluated the architectures of the pre-trained language model and Seq2Seq and did not comprehensively evaluate the effects of different architectures. Moreover, to the best of our knowledge, there is no empirical comparative study of different DNN structures on open Chinese corpora for the dialogue generation task.
To that end, in this paper, we present an empirical study on the performances and impacts of symmetrical DNN architectures (i.e., Seq2Seq, RNNSearch, transformer, generative adversarial nets, reinforcement learning, and pre-trained Language model) for the task of Chinese dialogue generation on three typical single-turn and multi-turn Chinese dialogue corpora: KdConv, Weibo, and Douban. We chose six specific and typical models using the aforementioned techniques, and they also have been verified to achieve superior performance in the literature. Detailed evaluation results based on four metrics, BLEU, rouge, perplexity, and distinct of the generated results, are reported. Furthermore, we analyze the results and try to disclose the mechanisms that cause the differences.
The contributions of this paper can be summarized as follows: • We conducted extensive experiments to compare the performances of six typical DNN architectures for the dialogue generation task on three open Chinese corpora.

•
We performed a case study to help understand their performances in an intuitive manner.

•
We analyzed the mechanisms behind the different performances by these models and provide practical recommendations for model selection.

Background
In this section, we first introduce different DNN architectures dominating the dialog generation task recently.

Seq2Seq
T words and a target sequence Y = (y 1 , y 2 , ..., y T ) whose length is T consist of a source sequence X = (x 1 , x 2 , ..., x T ), which make the generation probability of the model to be the maximized Y under the X : p(y 1 , y 2 , ..., y T |x 1 , x 2 , ..., x T ) condition. In detail, the encoder-decoder framework is a structure of Seq2Seq [8,9]. The encoder word by word reads X, and uses a context vector c produced by RNN to speak with it. The decoder input estimates c to generate a probability of Y. The context vector c by the encoder RNN as follows In there, h t is the hidden state and t (as f ) is a non-linear function, such as instant LSTM and GRU. The hidden state counterpart to last word h T is c. The decoder is an RNN model that has an additional conditional context vector c. The calculated that candidate words p t probability distribution at every time is where s t is the hidden state of the decoder RNN at time t and y t1 is the word at time t 1 in the response sequence. The Seq2Seq objective function is defined as:

RNNSearch
To improve Seq2Seq performance, RNNSearch introduces the attention mechanism. Each word in y depends on a different context vector c, and it is observed that each word in y may be related to a different part of x. In particular, y i corresponds to the context vector c i , and c i is the weighted average of the hidden state of the encoder h 1 , ..., h t : where a i,j is computed by: where g is a multilayer perceptron.

Transformer
Transformer abandons the recurrent network structure of RNN and models a piece of text entirely based on attention mechanisms. The most important module of the coding unit is the self-attention module, which can be described as: To extend the ability of the model to focus on different locations and to increase the representation learning capacity of subspaces for attention units, transformer adopts the "multi-head" mode that can be expressed as:

Generative Adversarial Nets
Generative adversarial nets [8,10,11] aim to train two competing networks, one of which is to train a generator making the generated results as close to the real ones as possible, and the other aim is to train a discriminator which determines whether the data are from the distribution of the real or the distribution learned by the generator.
Suppose that a generation model is G(z|θ g ), where z is a random noise; G converts this random noise into data type x. Taking the text problem as an example, we can treat the output of G as a text and D as a discriminant model. For any input x, the output of D(x|θ d ) is a real number in the range of [0.1], which is used to determine the probability of this text being a real text. Let p r and p g represent the distribution of real text and the distribution of generated text respectively. The objective function of the discriminant model is as follows: The goal of a similar generation model is to make the discrimination model unable to distinguish the real text from the generated text. The objective function of the generation model is expressed as: Then the whole optimization objective function is described:

Reinforcement Learning
We regard the generated text as actions that are taken through a policy defined by an encoder-decoder model. Using a policy search optimizes network parameters to maximize the expected future of the reward. In dialogue generation, the main methods of reinforcement learning are policy gradient methods and Q-learning.
Q-learning is the basic value-based algorithm. Assume Q to be the value function; then the optimal value of the Q-learning algorithm using the Bellman equation can be expressed: where the Bellman operator (B) can be described as: On the other hand, policy gradient is meant to find a neural network parameterized policy in order to maximize the expected cumulative reward. The easiest approach to obtain the policy gradient estimator could be to utilize an algorithm, and one of the methods could be defined as:

Pre-Training Language Model
A text sequence X = (x 1 , x 2 , ..., x T ) and a sequence Y = (y 1 , y 2 , ..., y T ) to be given to denote the sequence of input context and target response, respectively. The conditional probability p(x t |x 0:t−1 ) modeled by a probability distribution over the vocabulary given linguistic context x 0:t−1 . The context x 0:t−1 is modeled by neural encoder f enc (·), and the conditional probability: where gLM(·) is the prediction layer. For a large corpus, we can use maximum likelihood estimation (MLE) to train the whole network. Firstly, X and Y are connected, and then the predicted loss of the whole target response sequence is obtained as the loss function. The loss term for predicting the dialogue context X is:

Related Work
In the field of natural language processing, dialogue generation has attracted more and more attention from researchers. The Meta-dialog system (MDS) worked on by [12] combines the advantages of both meta-learning approaches and human-machine collaboration; existing end-to-end dialog systems perform less effectively when data are lacking. Reference [13] proposes a novel dataset named multi-turn to facilitate the conversation reasoning research. We briefly reviewed three related approaches: generative adversarial nets, reinforcement learning, and pre-training language model. Existing techniques for building an open-domain dialogue system can be categorized into three groups.
It has been proven that the end-to-end task-oriented dialog system can have remarkable success in recent studies. Reference [14,15], based on the end-to-end framework, have achieved a productive performance in dialogue generation; Reference [16] improves the quality of the selected response for a retrieval-based dialog system based on end-to-end architecture. We introduce the dialogue of the end-to-end architecture from three structures: seq2seq, RnnSearch, and transfomer.

•
Seq2Seq: The first group learns the response generation model under the simple Seq2Seq architecture.
Reference [17] used multi-layer LSTM to map the input sequence to a fixed-dimensional vector and then used another deep-layer LSTM to decode the target sequence from the vector, with high automation and flexibility. Reference [18] introduced HRED, which uses a hierarchical codec architecture to model all contextual sentences. Reference [19] used an extended encoder-decoder that provides encoding of context and external knowledge.
The second group is based on the basic sequence-to-sequence with attention frame [20,21]. Reference [22] proposed a new multi-round dialogue generation model, which uses a self-attention mechanism to capture long-distance dependence-a weighted sequence(wise) attention model which uses cosine similarity to measure the degree of correlation proposed by [23] for HEED. Reference [24] introduced an attention mechanism into HRED, and proposed a new hierarchical recursive attention network (HRAN). • Transformer: The last group at the peak of attention architectures with the transformer framework. Reference [25] proposed a NMT model via multi-head attention; others were inspired by this paper. Reference [26] proposed an incremental transformer with the deliberation decoder to solve the task of document grounded conversations. Reference [27] proposed a transformer-based model to address multi-turn unstructured text facts open-domain dialogue.
Generative adversarial nets: Reference [28] proposed a sequence generation method, SeqGAN, to effectively train generative adversarial nets for structured sequences generation via policy gradient. For poems and speech-language generation, SeqGAN showed excellent performance in generating the creative sequences. Reference [29] proposed a new generative adversarial nets (GAN) variant for dialog generation, introduced an approximate embedding layer in GAN, and added a discriminator with anti-function to improve the diversity of answers. Reference [30] solved the problems of either inconsistent personality across conversations or average personality of users by generating a controlling agent's persona upon, instead of conditioning on prior conversations of a target actor.
Reinforcement Learning: Reference [4] showed how to introduce deep reinforcement learning in chatbot dialogue to model future reward. Reference [31] presents a new approach to best utilize a fixed budget-conscious scheduling (BCS) small amount of user interactions (budget) for learning task-oriented dialogue agents to extend deep dyna-Q (DDQ). Reference [32] proposed a new model-based reinforcement learning approach, discriminative deep dyna-Q (D3Q), for task-completion dialogue policy learning. Reference [33] presented a new reinforcement learning framework, switch-based active deep dyna-Q (Switch-DDQ), for task-completion dialogue policy learning.
Pre-training language models: Recently, pre-training language models including GPT-2 [34] or BERT [35] in various tasks of NLP have achieved enormous success, including machine translation, text classification, summarization, question answering, etc. Academics are also working to assemble the language models for the task of dialogue generation. Typically, Reference [5] conducted some experimental analyses about the appearance of language models on dialogue generation. Reference [6] proposed a relevant promoting language model by incorporating a topic inference component into the language model to conduct diverse and informative dialogue generation. Reference [36] presented an end-to-end monolithic neural model for goal-oriented dialogues using GPT-2. Reference [37] proposed a new dialogue generation framework, which uses pre-training to support various conversations, such as chit-chat, knowledge-based dialogues, and conversational questions and answers.

Evaluation Metrics
We adopt BLEU [42], distinct [43], rouge [44], and perplexity [3] as the evaluation metrics to measure the quality of the generated response. For BLEU, we employ the values of BLEU 1-4 and show the value of rouge-1/2/L. Intuitively, the higher the BLEU and rouge scores, the more n-gram overlaps between the generated responses, and thereby the better the performance. To be more specific, BLEU-N is formally defined as: where PN represents the modified N-gram precision, W N equals N, and R and C represent the lengths of the reference response and the prediction response respectively. Intuitively, a higher BLEU score means more n-gram overlaps between the comparative responses, thereby indicating better performance. At the same time, with the continuous change from unigram to 4-gram, BLEU-4 is applied more and more in machine translation and dialogue systems. Rouge-N is formally defined as: where n represents the length of n-gram, and gram n and Count match (gram n ) are the maximum numbers of n-gram that occur simultaneously in both candidate summary and a set of reference summaries.
We used the values of 1-4 to evaluate the diversity of generated responses. Distinct-N is defined as: Count(uniquengram) represents the number of ngramsthat are not repeated in the reply, and Count(word) represents the total number of ngram words in the reply. The larger the Distinct-n, the higher the diversity of the generation. Nevertheless, perplexity is a well-established performance metric for generative dialogue models.
On the other hand, perplexity explicitly measures the ability of the model to account for the syntactic structure of the dialogue, and the syntactic structure of each utterance and lower perplexity is indicative of a better model. We define word perplexity: For the model with the parameter θ, a dataset containing n triplets, u n 1 , u n 2 , u n 3 n n=1 , and n w refers to the number of tokens in the whole dataset. Lower confusion represents a better model. Confusion clearly measures the model's ability to explain the syntactic structure of dialogue and each discourse. In dialogue, the distribution of words in the next utterance is highly multi-modal; i.e., there are many possible answers, which makes confusion especially suitable because it always measures the probability of regenerating accurate reference utterances.

Comparison Models
In this section, we first briefly introduce the models that we aim to study. •

HRED (the code implementations can be found on https://github.com/julianser/hed-dlg):
HRED [18] is a hierarchical RNN-based encoder-decoder constructed for the multi-turn dialogue generation tasks. In the dialogue, the encoder RNN maps each utterance to a discourse vector. High-level context neural networks track past utterances by iteratively processing each discourse vector. The next utterance prediction is implemented by RNN decoder, which obtains the hidden state of context RNN and generates probability distribution on the token of the next utterance. • ReCoSa (the code and data are available at https://github.com/zhanghainan/ReCoSa): ReCoSa [22] is a model based on attention mechanism, Firstly, the word-level LSTM encoder is executed to get an initial representation of each context. The self-focus mechanism is then used to update both the context and the mask response representation. Finally, the weight of attention between each context and response representation is calculated and used for further decoding.

com/liuyuemaicha/Deep-Reinforcement-Learning-for-Dialogue-Generation-in-tensorflow)
:This is a model that uses reinforcement learning to simulate the future returns in chat robot dialogues. The model simulates the conversation between two virtual agents and uses the policy gradient method to reward a sequence, which shows three useful conversation characteristics: large amount of information, strong consistency, and easy answer (related to forward-looking function). • GAN-AEL (the code and models are available at https://github.com/lan2720/GAN-AEL): GAN-AEL [29] is a GAN framework to model single-turn short-text conversations. GAN-AEL trains the Seq2Seq network to simultaneously perform a discriminant classifier, which measures the difference between the human-generated response and the machine-generated response and introduces an approximate embedding layer to solve the non-differentiable problem caused by sampling-based output decoding in the Seq2Seq generation model steps. • BigLM-24: (the code and models are available at https://github.com/lipiji/Guyu) This is a language model with both the pre-training and fine-tuning procedures [26]. BigLM-24 is the typical GPT-2 model with 345 million parameters (1024 dimensions, 24 layers, 16 heads). During training, we employ maximum likelihood estimation (MLE) to conduct the parameter learning.
In the inference stage, various decoding strategies such as greedy search, beam search, truncated top-k sampling, and nucleus sampling are employed to conduct the response-text generation.

Setup
We are interested in a model that performs robustly across a diverse set of tasks. To this end, we used the same hyperparameters as those in the original paper. We ran these typical models on 4 Tesla K80 GPUs and saved the best model on the validation set for testing.

Results and Analysis
Tables 2-4 and Figures 2 and 3 describe the detailed evaluation results for all the models on the KdConv, Weibo, and Douban datasets, separately.
Which models generate results with better relevance? • As shown in Table 2, on the KdConv dataset, according to the BLEU and perplexity, the performance prioritizes for pre-training model-based models, transformer-based models, RNNSearch-based models, RL-based models, Seq2Seq-based models, and GAN-based models. As given in Table 3, but on the Weibo dataset, from the results, we can observe that the performances of RL-based models are better than those of RNNSearch-based models. The performances of GAN-based models were better than those of the Seq2Seq-based models. As given in Table 4, on the Douban dataset, according to the performances of those six architectures, the performances are ranked as pre-training model-based models, transformer-based models, RNNSearch-based models, RL-based models, Seq2Seq-based models, and GAN-based models. Among them, Seq2Seq-based models and GAN-based models performed similarly.
How about diversity? • As given in Figure 2, pre-training can benefit performance on diversity (distinct). Figure 3 also shows pre-training model-based models obtain better rouge scores than other models on all the three datasets. Transformer-based models also perform well. This is attributed to that the pre-trained model-based models are flexible enough to handle various down-stream tasks of dialogue generation.
Are larger models helpful?
• As shown in Tables 2-4, the pre-training model-based model obtained a lower perplexity. Compared with the Seq2Seq framework, attention mechanisms can improve the performance. Transformer-based models also obtained better performances on most of the datasets. GAN-based models had worse performances in dialogue generation. RL-based models are necessary for further research in dialogue generation. Obviously, pre-training model-based models obtained better performances on automatic evaluation metrics than other models on three datasets. This phenomenon waws more evident on the multi-turn datasets of Douban and KdConv.
How about high-quality training data?
• Models with high-quality conversation datasets can also improve the performance. From the results, we see that models on KdConv are usually better than Douban and Weibo on the metrics of BLEU, perplexity, distinct, and rouge. The reason may be that KdConv can provide more useful information than the other datasets.

Case Study
We conducted a case study to demonstrate how those models respond given a query. As shown in Table 5, the red keywords indicate the relevant context to the response and the blue contexts represent the generated diversity and consistency of the responses. Seq2Seq-based models tend to generate responses that are slightly relevant to the context. After introducing the attention mechanism, RNNSearch-based models and transformer-based models can generate context-aware responses such as "天气," "生活," and "新疆." Nevertheless, generating responses with RL-based models and generative adversarial nets can generate keywords related to queries, but the semantics are often inconsistent with the query and are still difficult for the stability of the generated statement quality. In contrast, responses generated by the pre-training language model-based models show better quality, achieving both high relevance and diversity. This demonstrates the ability of large-scale language models in the decoding phase. We also conducted bot-to-bot interaction experiments on KdConv-bot via the pre-trained language model-based model, and the sample results are shown in Table 6. We set up the four-round so that the two robots could interact for four rounds. We extract the demonstration. It should be noted that there were no two robot scene settings in our work; we just managed the context memory to generate the next sound. Therefore, we can observe the topic will drift over time.

Discussion
We present an empirical study on the performances and impacts of six different dialogue generation DNN architectures. Encoder-decoder-based models are strong baseline methods and always produce better outcomes in automatic evaluation metrics. We argue that introducing the a pre-trained language model into the encoder-decoder framework may further enhance the performance significantly. If the problem that the gradient from the generative model is difficult to pass from the discriminative model to the generative model cannot be effectively solved, GAN-based models will suffer from the dialogue generation performance. The RL-based models can create the long-term influence of a generated response in an ongoing dialogue; however, the design of the reward function depends on experience, and the training process is difficult; the stability of generating sentences is not good enough.
We found that generated text in Weibo and Douban datasets was more random, with topics often drifting. The reason comes down to the distribution of the data in the dataset; they are chat datasets, so for the same question, the datasets contained multiple topics. In the KdConv dataset, the dialogue performance in the fields of music, travel, and film are better, because the KdConv dataset mainly contains the data distribution of the three fields. This indicates that the quality of dialogue generation is strongly related to content and data distribution. Notwithstanding, some grave issues also live in the results generated-the grammatical issue and the topic drift problem, to name a few. More seriously, sometimes the chatbots will generate contra-factual or offensive content. Hence, each architecture is required better model structures, training paradigms, and decoding strategies that need to be investigated and built in the future.

Conclusions and Future Work
In this paper, we show an empirical exploration regarding the performances of different architectures in the task of open-domain Chinese dialogue generation. A wide range of experiments were carried out on typical single-turn and multi-turn conversational datasets. We reported the detailed values of automatic evaluation metrics BLEU, perplexity, distinct, and rouge for the generated dialogue. We found that the text generated on the large-scale pre-training model is superior to other models in terms of evaluation metrics. An attention mechanism can significantly improve the performance of dialogue generation tasks. We also reported a case study to show example responses generated by these models separately and analyzed the reasons behind the different performances by these models and provided practical recommendations for model selection.
In future work, we will compare more architectures, such as VAE [45,46], and we will select 3-5 models for evaluation of each architecture. Meanwhile, more automatic metrics-embedding matching, METEOR, etc., as well as the human evaluation performance, will be completed in the future.