Research on Modeling and Analysis of Generative Conversational System Based on Optimal Joint Structural and Linguistic Model

Generative conversational systems consisting of a neural network-based structural model and a linguistic model have always been considered to be an attractive area. However, conversational systems tend to generate single-turn responses with a lack of diversity and informativeness. For this reason, the conversational system method is further developed by modeling and analyzing the joint structural and linguistic model, as presented in the paper. Firstly, we establish a novel dual-encoder structural model based on the new Convolutional Neural Network architecture and strengthened attention with intention. It is able to effectively extract the features of variable-length sequences and then mine their deep semantic information. Secondly, a linguistic model combining the maximum mutual information with the foolish punishment mechanism is proposed. Thirdly, the conversational system for the joint structural and linguistic model is observed and discussed. Then, to validate the effectiveness of the proposed method, some different models are tested, evaluated and compared with respect to Response Coherence, Response Diversity, Length of Conversation and Human Evaluation. As these comparative results show, the proposed method is able to effectively improve the response quality of the generative conversational system.


Introduction
Along with the rapid development of artificial intelligence, the use of generative conversational systems based on joint structural and linguistic models is increasingly being observed and is being applied in some interesting robotic cases. Generative conversational systems provide the ability to generate conversational responses actively. Additionally, they are also not limited by conversation content. Implicitly, this provides several benefits for human life, such as in the family environment, hospitals, entertainment venues, etc.
Conversational systems are composed of a neural network-based structural model and a linguistic model. The neural network-based structural model mainly performs feature extraction and semantic understanding on input sequences. In addition, the linguistic model can determine the probability of the existence of the output sequence by determining a probability distribution for an output sequence of length m. The response quality of the system, with respect to aspects such as diversity, informativeness and multi-turns, is greatly influenced by different structural models and linguistic models. However, Common and foolish responses are often generated by the prediction of responses with the general statistical linguistic model in conversational system. Meanwhile, linguistic models based on Maximum Mutual Information (MMI), Mutual Information (MI), Pointwise Mutual Information (PMI) and Term Frequency-Inverse Document Frequency (TF-IDF) are also derived to increase the coherence between the input sequence and system response. For example, responses that enjoy unconditionally high probability, as well as biases towards responses that were specific to the given input, could be avoided by the linguistic models based on MMI [15]. The responses that enjoy high probability but were ungrammatical or incoherent could be avoided by the linguistic models based on MI [16]. The nonspecific responses could be avoided by the linguistic models that incorporated the TF-IDF term [2]. Similarly, the linguistic models based on PMI were able to predict a noun as a keyword reflecting the main gist of the response in order to generate a response containing the given keyword [8]. These studies of the coherence between the input sequence and system response can increase the informativeness to some extent, but more foolish responses are still unavoidable in testing. Therefore, a linguistic model based on MMI and a foolish punishment mechanism is proposed.
To comprehensively improve the response quality of the conversational system with respect to the aspects of the structural model and the linguistic model, the attention with intention-based structural model and TF-IDF-based linguistic model were combined [2]. The joint model firstly modeled intention across turns using RNN, and then incorporated an attention model that was conditional on the representation of intention. It subsequently avoided generating non-specific responses by incorporating an IDF term in the linguistic model. A structural model based on forward and backward neural networks and a linguistic model based on PMI were also combined [8]. The joint model firstly used PMI to predict a keyword, then generated a response containing the keyword using the structural model. These joint models improved the informativeness of system responses by combining the developed structural model and linguistic model. Therefore, in order to improve the response quality of conversational systems in terms of diversity, informativeness and multi-turns, a novel joint model is established in this paper, which combines the dual-encoder structural model with the linguistic model. The theoretical model is also proven to be effective by comparison with an experiment.
To address the problem of the lack of diversity, informativeness and multi-turns, a joint model is presented in the paper. In Section 2, a novel dual-encoder structural model based on the new CNN and strengthened attention with intention is established. In Section 3, the linguistic model based on MMI and the foolish punishment mechanism is established. In Section 4, the experiments on generative conversational system based on the joint structural and linguistic model are built. In Section 5, comparisons are drawn between the joint model and baseline models.

Model Architecture
In this section, a novel dual-encoder model structure based on the new CNN and strengthened attention with intention (SAWI-DCNN) is proposed, where CNN, rather than RNN, can be used to obtain the long-term context. First, the pre-processed input sequences are processed in encoder 1, as shown in Figure 1. Meanwhile, previous target tokens are processed in encoder 2. Second, the output sequence of encoder 1 distributes attention at the strengthened attention layer, where the distribution of attention is affected by the state of encoder 2, including conversational intention [2,17,18]. Finally, the output sequence of the attention distribution and encoder 2 is iterated to generate the predicted target token at the fully connected layer. ① is the input pre-processing layer; ② is the dual-encoder layer (Encoder 1: left; Encoder 2: right); ③ is the conversational intention layer; ④ is the strengthened attention layer; ⑤ is the fully connected layers.

Input Pre-Processing
The input sequence ( ) E are extracted and mined in the conversational system model. However, deeply hidden semantics can only be excavated by the system with difficulty when context is discarded in different interactions. Conversely, too much noise is brought into the conversational system when the context is included in its entirety. Thus, the input regarding the response of the previous turn is controlled in encoder 1 in order to increase the perception of the conversational environment and improve the interaction turns. The updated input vectors (k) new E can be defined as are the sentence-level embedded vectors [19] of the input sequence of the current turn k and the output sequence of the previous turn k-1, respectively. Note that the result of  1 is the input pre-processing layer; 2 is the dual-encoder layer (Encoder 1: left; Encoder 2: right); 3 is the conversational intention layer; 4 is the strengthened attention layer; 5 is the fully connected layers.

Input Pre-Processing
The input sequence m ∈ R f represents the embedding vector at the position m during the k turn conversation. The features and deep semantics of embedded vectors E (k) are extracted and mined in the conversational system model. However, deeply hidden semantics can only be excavated by the system with difficulty when context is discarded in different interactions. Conversely, too much noise is brought into the conversational system when the context is included in its entirety. Thus, the input regarding the response of the previous turn is controlled in encoder 1 in order to increase the perception of the conversational environment and improve the interaction turns. The updated input vectors E (k) new can be defined as E ,e (k−1) Y ∈ R f are the sentence-level embedded vectors [19] of the input sequence of the current turn k and the output sequence of the previous turn k − 1, respectively. Note that the result of f(·) is a biased vector, which is able to control the input generated by the previous output sequence.
Thus, Equation (1) can be rewritten as When embedded vectors are input into CNN, multiple vectors are convoluted simultaneously by convolution kernels. In addition, the sense of order of vectors decreases with the increase of the convolution layer. For this reason, the absolute position is embedded in the input sequence in order to increase the temporal order of vectors and enable the model to perceive which part of the input sequence is being processed. The joint embedding vector is expressed as where S (k) ∈ R m×f is the joint input vectors and P

Dual-Encoder
The dual-encoder consists of stacked convolution blocks, which include the new CNN, Gated Linear Units (GLU) [9], Residual connections [20], and scaling factors. The outputs of the convolution blocks are represented as h dm ] ∈ R mf in encoder 1 and encoder 2, respectively. Each convolution kernel is parameterized as W ∈ R w , b w ∈ R in the new CNN. In addition, the input vectors S (k) are mapped to the output vectors Y ∈ R 2m×f through the new CNN, in which the output vectors have twice the dimensionality of the input vectors.
The information flows of the output Y = [A, B] ∈ R 2m×f of the new CNN can be controlled by GLU, which provides a linear path for information gradient flows and solves the gradient problem caused by nonlinear gating. Thus, the gated linear unit is added to the convolution blocks.
where A and B are a nonlinear input, A, B ∈ R m×f ; ⊗ refers to the point multiplication operation; the dimension of the output f(·) ∈ R m×f is half the size of Y; the information flow A related to the current context is controlled by the gates σ(B). Meanwhile, in order to enable the conversational system to further mine deeply semantic information in a conversational environment, the conversational intention vector Z (k) ∈ R f is added to encoder 2 as a bias of the convolution output.
Residual connections from the input of each convolution block to the linear gating output are added to avoid degradation caused by network depth. In addition, the scaling factors µ are also added to the convolution blocks to preserve the input variance at the beginning of training. Thus, the output of the convolution block can be expressed as h (k,l) are the outputs of l − 1th convolution block in encoder 1 and encoder 2, respectively; meanwhile, the scaling factor µ is a hyper parameter that satisfies µ = √ 0.5. In the test, the distribution of the target sequence token is predicted at the top level of the fully connected layers through the linguistic model based on MMI and the foolish punishment, as shown in Section 3.

1-D Dynamic Convolutional Neural Networks (DCNN)
Since the dimension of input vectors is reduced when the convolution and pooling of the vectors are performed by CNN, it is difficult to increase the number of CNN layers when dealing with variable-length vectors of the input sequence. Therefore, a new Convolutional Neural Network architecture, consisting of a one-dimensional Wide Convolution layer, a dynamic k-max pooling layer, a flatting layer, a dropout layer, and a recurrent fully connected layer, is proposed. As shown in Figure 2, one-dimensional Wide Convolution Operations are adopted [21]. This aims to ensure that the vectors of the whole variable-length input sequence containing the edge words are convoluted by convolution kernels, generating a non-empty feature map c. The two-channel and multi-convolution kernels are used for convolution in order to improve convolution speed and obtain more features. This is initiated by defining the convolution kernels width with one dimension. In addition, the dropout layer is used for regularization. This aims to prevent the occurrence of over-fitting and divergence of the prediction. Meanwhile, in order to align the variable-length vector of both the input and output sequences, a recurrent fully connected layer is proposed. The recurrent fully connected layer is similar to the fully connected layer in RNN. In addition, the dimension of the recurrent fully connected layer is defined as an integer multiple of the input token vector. Finally, the output is generated by sliding the fully connected layer.
where M ∈ R m is a convolution kernel; b m ∈ R is a bias; S ∈ R s×f is the input vectors; c ∈ R (s+m−1)×f is a feature map trough convolution operation and f(·) is an activation function. respectively; meanwhile, the scaling factor μ is a hyper parameter that satisfies μ= 0.5 .
In the test, the distribution of the target sequence token is predicted at the top level of the fully connected layers through the linguistic model based on MMI and the foolish punishment, as shown in Section 3.

1-D Dynamic Convolutional Neural Networks (DCNN)
Since the dimension of input vectors is reduced when the convolution and pooling of the vectors are performed by CNN, it is difficult to increase the number of CNN layers when dealing with variable-length vectors of the input sequence. Therefore, a new Convolutional Neural Network architecture, consisting of a one-dimensional Wide Convolution layer, a dynamic k-max pooling layer, a flatting layer, a dropout layer, and a recurrent fully connected layer, is proposed. As shown in Figure 2, one-dimensional Wide Convolution Operations are adopted [21]. This aims to ensure that the vectors of the whole variable-length input sequence containing the edge words are convoluted by convolution kernels, generating a non-empty feature map c. The two-channel and multiconvolution kernels are used for convolution in order to improve convolution speed and obtain more features. This is initiated by defining the convolution kernels width with one dimension. In addition, the dropout layer is used for regularization. This aims to prevent the occurrence of over-fitting and divergence of the prediction. Meanwhile, in order to align the variable-length vector of both the input and output sequences, a recurrent fully connected layer is proposed. The recurrent fully connected layer is similar to the fully connected layer in RNN. In addition, the dimension of the recurrent fully connected layer is defined as an integer multiple of the input token vector. Finally, the output is generated by sliding the fully connected layer.   The dimensions of vectors after wide convolution are variable with the varying lengths of different input sequences. The edge vectors are expanded by means of zero filling when the vectors of the input sequence are convoluted by convolution kernels. Thus, the dimension of the convoluted feature map is larger than the input sequence vectors. The one-dimensional dynamic k-max pooling process is defined in order to align the output vector state with the input sequence vectors at each moment. The pooling parameter is defined as where s is the length of the input sequence.
The pooled feature map of the single channel convolution and pooling operations is represented as C max ∈ R s×f , where the sequences of the feature map values are related to the source and the subscripts are arranged from small to large.

Centralizing Intention
The attention weights of the input sequence can be distributed each time using attention models. In addition, according to the attention distribution, the semantic information of the input sequence can be further understood by the conversational system. The attention distribution of the encoder 1 output state can be affected not only by the previous output state of encoder 2, but also by the conversation intention [2,7,16], just like a human being. Conversation intention can represent the conversation context and the primary motivation of the conversation. However, the role of conversation intention in conversation responses is not immediately obvious. This is mainly influenced by the desire that the additional noise have no contribution to the distribution of attention. Thus, to reduce the redundancy of intention caused by the increase in conversation turns, a dynamic model of the intention vector is established, and forgetting gates are added to the model. Hence, the final dynamic model of the intention vector can be expressed as where Z (k) ∈ R f is an intention vector of the k-th turn; tanh(·) refers to the tanh operation and f t is a forgetting gate that can control the previous intention. f t ∈ R f×f is expressed as where W t ∈ R 1×f is a transformation matrix; b t ∈ R f×f is a bias and h (k,top) S ∈ R f is a sentence-level vector of the encoder output at the k-th turn that can be expressed as where h is the output vectors of the top-layer convolution block in encoder 1.

Intensity-Strengthening Attention
Because the attention weights are distributed according to the contribution of each token in the sequence, and the sum of the attention weights is 1, the effect of a single attention [22] becomes weaker and weaker as the input sequence increases in size. Indeed, the distribution of a single attention will be more distracted, and can even reach zero when the input sequence is longer. An intensity-strengthening attention method is proposed in order to address the problem of the small attention distribution and the partial over-distribution.
To preserve more context for the current state of encoder 2, the previous output sequence is convoluted. Thus, the features of the output sequence at the current time are as follows: where h is the output vectors of the top-layer convolution block in encoder 2. The current state of encoder 2 consists of the features of the output sequence and the previously predicted target token g (k) i−1 , which are expressed as The query vector d where W Q h ∈ R dx×f and h K h ∈ R dx×f are transformation matrices. Therefore, the input C (k) i to the connection layer can be expressed as where W V h ∈ R dx×f is a transformation matrix. The overall intensity of attention is enhanced through superimposed attention, which reduces the effects on attention of both distraction and inattention. The output of encoder 1 contains the context and location information of the input sequence. Similarly, the output state of encoder 2 includes the context, previously predicted target token, and intention information. Therefore, with the calculation of attention distribution, the results are determined by the above information.

Linguistic Model Based on MMI and FPM
To guarantee the existence of the output sequence, the probability of the predicted target sequence needs to be estimated by the linguistic model. In addition, a linguistic model based on MMI, which can improve the response coherence of the conversational system and reduce the generation of irrelevant responses, is adopted to estimate the probability of the output sequence in the paper [6]. Nevertheless, foolish responses such as "I don't know" and "what?" are still unavoidable in the process of testing. Therefore, a foolish punishment mechanism (FPM) is added to the linguistic model based on MMI to reduce the number of foolish responses.
U(Ŷ) = N n=1 p(ŷ n |ŷ 1 ,ŷ 2 ,ŷ 3 , · · ·ŷ n−1 )·g(n) (20) where λ is a hyper parameter for the general response punishment; γ is the first token to be punished; and n is the index of the target tokens, which is generated at time n. The predicted target tokens are punished by calculating the probability of foolish responses Y, which is predicted by the previous output sequenceŶ. For example, the current target token is predicted based on the previous output sequence as input. Then the target token is compared with the foolish responses. If the predicted target token is similar to the foolish response tokens, then the token is regarded as a foolish target token. According to the comparison results, the probability of the predicted target tokens being foolish response tokens is obtained, and the probability is used as the punishment for foolishness. Ten sequences Y of foolish responses like "I don't know" and "I have no idea" are manually built, which are often generated by the general model. Although the system generates more total categories of foolish responses than the manually built sequences of foolish responses, these responses will be similar to the established foolish responses. Therefore, the foolish punishment function is defined as where N Y is the number of foolish responses; N y is the number of tokens in the foolish responses Y. Meanwhile, the final objective function is defined as where λ 1 and λ 2 are hyper parameters. Both are set to be equal to 0.25. In the test, the generative conversational system needs to sample the predicted target tokens to maximize the probability of the output sequence. In addition, the Beam Search algorithm [23] is often adopted. The Beam Search algorithm is a graph-searching algorithm that can quickly find the optimal output sequence. However, the Beam Search algorithm is prone to generating erroneous responses in the sampling process, e.g., the traditional Beam Search algorithm is easily affected by previously sampled tokens and large local probabilities. Moreover, the correct response sequence cannot be produced. Therefore, in this paper, the Diverse Beam Search algorithm [24] is used to predict target tokens, as it is able to improve the diversity of output sequences by sampling on the basis of grouping using the Beam Search algorithm.

Datasets and Training
The OpenSubtitles (OSDb) dataset, an open-domain dataset, is applied in these experiments. The OSDb contains 60M scripted lines spoken by movie characters [25]. 301,000 question-answer pairs are randomly selected, of which 300,000 are used for training and 1000 are sampled for testing. 512 hidden units are adopted for the dual encoder in the model. All embedded vectors have a dimensionality of 512. Meanwhile, the same dimensionality is also adopted for linear layer mapping between the embedded sizes and hidden layers; a learning rate of 0.001 is used. In addition, subsequently, a mini-batch of 256 is used; the filter widths are set to 3 and 5, respectively, and the stacked convolution blocks are set to 3 in both encoders. The model is trained with mini-batches by back-propagation, and the gradient descent optimization (Adam Optimizer) is performed.

Automatic Evaluations
Automatic evaluations of response quality are an open and difficult problem in the conversational field [19,26]. In addition, while there are existing automatic evaluation methods related to machine translation, such as Bilingual Evaluation Understudy (BLEU) and METEOR, these metrics for evaluating conversational system do not correlate strongly with human evaluations, and have been negated by many scholars for the purposes of conversational evaluation [19]. Influenced by the automatic evaluation of multi-turns and response diversity, as proposed by Li [16,27], in which the degree of response diversity is calculated by the number of distinct unigrams in the generated responses, and inspired by conversational targets, the authors propose two automatic evaluation criteria-response diversity and response coherence-in order to indirectly reflect the relationships between system responses and real responses.
Response Coherence: the proposed measure for evaluating response coherence is to compute the cosine similarity between the question and the system responses based on embedding using the greedy matching method [18]. In other words, the similarities between the question and the responses are calculated by random sampling of samples in the test. In addition, the mean operation is applied to the similarity of the samples. The coherence of question and responses is greater where the similarity is greater.
Response Diversity: Although the method of BLEU [28] is pointed out as being unreasonable for evaluating the coherence between system response and human evaluation [19], the idea behind BLEU is to calculate the similarity between two sequences. Therefore, response diversity is proposed to be calculated by an improved method of BLEU, which evaluates response diversity on the basis of a calculation of candidates for responses, instead of a calculation of system responses and real response [15,16]. Candidates for responses used in the test are generated by the Diverse Beam Search algorithm. The value of BLEU is obtained by pairwise calculation of candidates and averaged by mean operation. Multiple candidates generated each time are defined as a sample. In addition, the means of a sample calculated by the BLEU method are regarded as the response diversity. Samples are sampled randomly, and response diversity is calculated during the test. Response diversity is greater when the similarity is weaker.
Length of the Conversation: Li et al. [16] proposed a method for evaluating the turns of a conversation: a conversation ends when a foolish response like "I don't know" is generated, or two consecutive responses are highly overlapping. In the test, the above method is adopted to determine the length of a conversation in which eight interactions are defined as one turn.

Human Evaluation
Although the response quality of the system can be indirectly reflected by the coherence, diversity and length of the conversation, the relationship between system responses and real responses cannot be determined by their simple linear superposition. Therefore, the current popular method of human evaluation is used for comprehensive evaluation.
To improve the quality of human evaluation, 500 data points are randomly collected from the test questions and responses, and the system responses and baseline model responses are labeled by five volunteers. Meanwhile, five-grade interpretation criteria proposed by Zhang et al. [29] are adopted as labeling criteria.

1.
It is not fluent or is logically incorrect in responses; 2.
The response is fluent, but irrelevant to the question, including irrelevant regular responses; 3.
The response is fluent and weakly related to the question, but the response can answer the question; 4.
The response is fluent and strongly related to the question; 5.
The response is fluent and strongly related to the question. The response is close to human language.
In the test sample, 1000 samples were randomly collected in order to calculate the response coherence, response diversity, and length of the conversation. As can be seen from Table 1 Table 2. As can be seen from the data, the output system responses are diverse and coherent. In addition, the model trends toward generating short responses. Meanwhile, foolish responses may be produced with an increase in the length of the question.

Inputs Responses
What are you doing? predefined foolish responses. In addition, when the BELU value was more than 0.5, the response was considered to be a foolish one. As can be seen from the data, compared with SAWI-DCNN without FPM, the joint SAWI-DCNN has a strong inhibitory effect on foolish responses. Table 3. Foolish responses evaluation (%).

Models Foolishness
SAWI-DCNN 8% SAWI-DCNN (except FMP) 26% The Diverse Beam Search algorithm was used to sample the predicted target tokens and select the candidates with the greatest likelihood probability. Some of the sampling results are shown in Table 4. It can be seen that the SAWI-DCNN trends toward generating high-quality responses, whereas foolish responses like "I don't know what you are talking about" and "what?" are easily produced by LSTM+Attention and CNN+Attention. The responses of SAWI-DCNN and the baseline models were sampled randomly and evaluated by humans. The results are shown in Table 5, where the labels (1)(2)(3)(4)(5) correspond to the grading against the five-step interpretation criteria. For example, 1 corresponds to the response "It is not fluent or is logically incorrect in responses", and 2 corresponds to the response "The response is fluent, but irrelevant to the question, including irrelevant regular responses". Values are the percentage of the response number for the sample collected in each grade. The larger the ratio, the more prone the model is to producing responses with the corresponding feature in the five-step interpretation criteria. The quality of the model can be judged on the basis of the response distribution in the corresponding five-step interpretation criteria, i.e., the higher the quality of model response is, the higher the distribution of grades will tend to be in the responses. The parameter AVE is the average grade of responses, which is calculated on the basis of the corresponding response distribution of the samples and the weights. As can be seen from the data in Table 5, high-grade responses are more easily generated by SAWI-DCNN than by the baseline models. In addition, there is a trend towards high-quality responses being produced with higher average grade scores.

Conclusions
In this paper, a generative conversational system was investigated based on a structural model and a linguistic model. The structural model was initially established based on the new CNN and strengthened attention with intention. Similarly, the linguistic model was established based on MMI and FPM. Both were combined into the form of a conversational system. Different models were tested and evaluated under automatic evaluation and human evaluation. The results of automatic evaluation were observed and compared in terms of response diversity, response coherence, and length of the conversation. Meanwhile, to the results of the proposed method were also observed and compared based on human evaluation in terms of comprehensive response quality. Finally, by evaluating these comparative results, it can be concluded that the proposed joint model greatly and significantly improves the conversational system. This work paves the way for generative conversational systems, in which the optimal combination of a structural model and a linguistic model is the key to improving the response quality of the system.