Knowledge Interpolated Conditional Variational Auto-Encoder for Knowledge Grounded Dialogues

: In the Knowledge Grounded Dialogue (KGD) generation, the explicit modeling of instance-variety of knowledge speciﬁcity and its seamless fusion with the dialogue context remains challenging. This paper presents an innovative approach, the Knowledge Interpolated conditional Variational auto-encoder (KIV), to address these issues. In particular, KIV introduces a novel interpolation mechanism to fuse two latent variables: independently encoding dialogue context and grounded knowledge. This distinct fusion of context and knowledge in the semantic space enables the interpolated latent variable to guide the decoder toward generating more contextually rich and engaging responses. We further explore deterministic and probabilistic methodologies to ascertain the interpolation weight, capturing the level of knowledge speciﬁcity. Comprehensive empirical analysis conducted on the Wizard-of-Wikipedia and Holl-E datasets veriﬁes that the responses generated by our model performs better than strong baselines, with notable performance improvements observed in both automatic metrics and manual evaluation.


Introduction
End-to-end neural conversation models have shown significant promise, demonstrating remarkable advancements from both academia and industry perspectives [1][2][3][4][5][6]. Nonetheless, these conventional conversation models often grapple with generating informative and engaging responses due to their limited capacity to retain and leverage background knowledge [7,8]. To surmount this knowledge-absence issue prevalent in existing conversation models, Knowledge Grounded Dialogue (KGD) generation is recently proposed for generating responses by simultaneously referring to both the background knowledge and the dialogue context [7,[9][10][11][12][13][14][15][16][17]. The objective is to enhance dialogue response generation to facilitate engaging and in-depth conversations, while avoiding the inclusion of non-factual information.
Existing methodologies in KGD can be broadly classified into extractive and generative models. Extractive models typically view KGD as a reading comprehension task [18] or a document question-answering challenge [19], employing established models such as BiDAF [20] to address this issue. Nevertheless, these models, primarily focusing on extracting knowledge snippets, need to provide more engaging and natural responses akin to human conversation. To mitigate these limitations, attention has turned towards integrating external factoid knowledge into generative dialogue models [21][22][23][24]. Most of these works focus on selecting appropriate knowledge and combining the chosen knowledge during the response generation process via copy mechanisms [10,25] or memory network [26]. However, it is crucial to note that responses exhibit varying degrees of knowledge specificity. Some responses may rely heavily on external knowledge, while others may predominantly depend on the query due to differing dialogue contexts. Despite this, existing generative models do not explicitly model the instance-variety of knowledge specificity in KGD.
The Conditional Variational Auto-Encoder (CVAE) has emerged as an effective model for integrating information from multiple sources within a latent space, showing promise in the domain of response generation [27]. Previous research has utilized CVAE by incorporating a latent variable that is conditioned on the concatenation of dialogue context and additional knowledge information such as dialog acts [28], persona sentences [29] or even images [30]. However, existing CVAE models encounter the issue of representation entanglement [31]. This issue arises due to the single latent variable in CVAE, making it challenging for the model to learn informative and interpretable representations for context and knowledge simultaneously. This issue can lead to inconsistent and unnatural knowledge-grounded responses by the CVAE models.
In this work, we build upon the strengths of CVAE to address the challenges of KGD. To enable the explicit capture of knowledge specificity by CVAE and to overcome its representation entanglement problem, we present a novel Knowledge Interpolated conditional Variational auto-encoder (KIV). This innovative approach introduces two separate latent variables to model dialogue context and external knowledge independently. Unlike prior models that concatenate multiple latent information sources, our approach uses linear interpolation to seamlessly integrate the latent variables associated with context and knowledge, where the interpolation weight corresponds to the level of knowledge specificity. In addition to proposing a deterministic method for obtaining the interpolation weight, we explore a probabilistic interpolation method. This approach views the interpolation weight as a latent variable and models it using a Logistic-Normal distribution. Our proposed probabilistic interpolation method empowers the model to robustly emulate the process of adaptively leveraging background knowledge in response generation.
We evaluate the effectiveness of our proposed model on the Wizard-of-Wikipedia [7] and Holl-E [10] datasets. Both evaluations confirm that our model significantly outperforms the vanilla CVAE and other existing KGD models. Additionally, qualitative analyses demonstrate that the interpolated latent variable successfully controls the knowledge specificity of the generated responses, further offering human-interpretable meaning representations.
In summary, our contributions to the field are as follows: • We introduce a novel Knowledge Interpolated conditional Variational auto-encoder (KIV) for knowledge grounded dialogue generation. This approach utilizes two distinct latent variables for context and knowledge and fuses them by linear interpolation; • We explore deterministic and probabilistic methodologies for obtaining the interpolation weight that signifies the level of knowledge utilization; • A series of extensive experiments are conducted to validate the effectiveness of our proposed model. These experiments further illustrate the interpretability of our interpolation methodologies.

Knowledge Grounded Dialogue
Prior research on Knowledge Grounded Dialogue shows that extractive models often generate more suitable responses than their generative counterparts, as indicated by [8,10]. Despite this, it has been noted that generative models can produce more captivating responses that resemble natural human dialogue [21,32]. The primary focus of most KGD generative models is to learn external knowledge representations, often through neural memory networks [7,8] or intricate attention mechanisms [23,[33][34][35]. Ghazvininejad et al. [9] took a unique approach, encoding the dialogue history and documents separately to imbue responses with facts from the external world. Other researchers, including Yang et al. [16], Chen et al. [36], Wang et al. [37], Zhou et al. [38], Li et al. [39], have integrated knowledge graph representation into the response generation process. A few works concentrate on seamlessly integrating external knowledge with dialogue context. For instance, Li et al. [40] introduce a two-pass decoding strategy for document grounded conversations. Wu et al. [41] defined knowledge identification as finding relevant knowledge in an extensive document that aligns with a user's current query within the conversation context. In their work [15], introduce a KGD model for documentgrounded dialogue generation. The model leverages both structured and unstructured knowledge sources to enhance its performance. It incorporates a comprehensive commonsense knowledge network enriched with named entities, along with a domain-specific factual knowledge base. These knowledge sources are utilized to improve the understanding of utterances and generate more informed and contextually appropriate responses. Overall, the proposed model effectively combines multi-source heterogeneous knowledge to enhance document-grounded dialogue generation. Ye et al. [32] jointly employed a CVAE model to represent context and knowledge within a unified latent variable.
Contrasting with previous research, our work enhances knowledge fusion methodologies in two distinct ways: (1) By employing interpolating latent variables, we facilitate a fusion of knowledge and context that is both interpretable and controllable. (2) We innovatively manage the fusion ratio of instance-variety knowledge by using deterministic and probabilistic interpolation weight schemes, allowing for dynamic control over the process.

Latent Space Interpolation
Latent-space interpolation is a widely adopted technique for evaluating generative latent variable models, typically employed to verify the effective generalization of a generative model [42]. This latent space representation encapsulates all critical information necessary to depict the original data's features. The model learns the data features and simplifies its representation to facilitate easier analysis. This process is integral to Representation Learning [43], a collection of methods designed to enable a system to discern the requisite representations for feature detection from raw data.
Interpolation has traditionally been used to bolster the robustness and effectiveness of representation in supervised learning [44], and to improve semi-supervised learning models [45]. Within text generation, interpolation is commonly employed to demonstrate that generative models can effectively learn smooth latent representations [46,47]. Unlike previous work, Gao et al. [48] incorporated interpolation into their model to promote seamless transitions between two sub-modules. However, the interpolation weight in their model was randomly chosen. In contrast, our model utilizes the interpolation weight as a control variable for knowledge specificity, which needs to be estimated in our model.

CVAE for Knowledge-Grounded Dialogue
The task of knowledge-grounded dialogue generation can be framed as follows: Given a dialogue context C = (C 1 , C 2 , . . . , C |c| ) comprising |c| turns of conversation, and a segment of knowledge text k = (k 1 , k 2 , . . . , k |k| ) containing |k| words, the goal is to generate a response y = (y 1 , y 2 , . . . , y |y| ) that aligns with the provided context and is informed by the knowledge text. This is achieved by maximizing the probability p(y|C, k). A practical approach for solving this problem involves using the Conditional Variational Autoencoder (CVAE) framework [27]. This framework approximates the distribution of the random variable y (representing the response) conditioned on c (representing the context) and k (representing the knowledge). This is accomplished by introducing a latent variable z.
The training objective of CVAE can be formulated as maximizing the Evidence Lower Bound (ELBO), as detailed below: In the above formula, KL represents the Kullback-Leibler divergence. p θ (y|z, c, k) acts as the decoder, reconstructing the response y using the latent variable z, as well as context c and knowledge k; q φ (z|y, c, k) serves as the inference model, approximating the true posterior; p θ (z|c, k) is the prior model, which samples the latent variable from the prior distribution. In these models, θ, φ are parameters that respectively pertain to the inference and decoder models. The CVAE model mentioned above employs a single latent variable to encode information from context and knowledge sources. A typical implementation of this CVAE involves concatenating c and k as input to the encoder, subsequently sampling the latent variable z ∼ q φ (z|[c, k]) [28,29], as illustrated in Figure 1a. However, it has been noted that, in the absence of explicit supervision, the solitary latent variable generated by the standard CVAE fails to learn disentangled representations that accurately reflect the distinct latent structures of different sources [31]. This limitation significantly impedes the model's performance enhancement and interpretability potential by exploiting variational latent variables.

Knowledge Interpolated Conditional Variational Auto-Encoder (KIV)
To address the representation entangled issue of the standard CVAE, we introduce our KIV model to explicitly disentangle the latent variable z by incorporating two new variables z k and z c . Using two independent variational encoders, KIV initially encodes the given context and knowledge separately. Upon acquiring the disentangled latent variables z k and z c , we employ linear interpolation to blend z k and z c . This approach allows for smooth transitions between latent variables in an interpretable and controllable way. Specifically, the latent variables z k and z c in our model are linearly interpolated using a weight factor λ ∈ (0, 1), thereby constructing a 1-simplex as follows: In this study, we denote the interpolation weight corresponding to knowledge as λ, and the interpolation weight of context is given by 1 − λ. The interpolation weight λ signifies the knowledge specificity and relevance of the response, thus facilitating better interpretation and connection of empirical findings. By incorporating the interpolated latent variable into Equation (1) and applying the convex property of the KL-divergence, we can deduce a new ELBO as the training objective: A detailed derivation showing that L KIV still serves as a valid lower bound of log p(y|c, k) is provided in Appendix A.
As illustrated in Figure 2, the proposed KIV model is composed of context/knowledge encoders and a response decoder, the specifics of which are expanded upon in the following section. The interpolation methods will be introduced in the subsequent section.

Context and Knowledge Encoders
The context and knowledge latent variables z c , z k are derived from two variational neural modules that share the same architecture but utilize different parameters. Taking the knowledge latent variable z k as an example, its variational neural module, f k , is composed of a posterior network f k pst and a prior network f k pri . The prior distribution is a factorized normal distribution p θ (z k |k) ∼ N (µ k pri , σ k pri ), which is parameterized by the prior network f k pri as follows: In the training phase of our model, k and y jointly define the posterior distribution where f k pst and f k pri are multi-layer perceptrons with tanh activation function. The knowledge representation k and response representation y are derived by extracting the final hidden state from two bidirectional GRU encoders.
Likewise, we can obtain the parameterized prior distribution p θ (z c |c) and posterior distribution q φ (z c |c, y) of the context latent variable z c via its variational neural module f c . The context representation c is obtained by extracting the final hidden state from a hierarchical GRU encoder [3]. This method uses a word-level GRU network for each utterance and then feeds the outputs of the word-level GRU's last hidden state into an utterance-level GRU network.

Response Decoder
Upon obtaining z c and z k , an interpolation network produces the interpolated latent variable z as described in Equation (2). The decoder, a GRU-based recurrent neural network f dec , maintains a hidden state h dec t at each step. The initial hidden state h dec 0 is configured by the latent variable z through a MLP: h dec 0 = MLP(z). To fully harness the context c, knowledge k, and latent information inherent in z, we introduce a mixture-of-decoders mechanism to output the probability of a response. This mechanism draws inspiration from the Mixture-of-Softmaxes (MoS) trick proposed by [49]. The proposed method incorporates three decoding modules corresponding to various settings of model outputs by where w x is the embedding vector of word w x , and o m t , m = {k, c, ck} represent three output vectors corresponding to knowledge, context and a mixture of knowledge and context information. The dimension of o m t aligns with the dimension of w x . The term π m t is the mixture weight of the m-th component, subject to the constraint ∑ m π m t = 1. In our model, we define the output vector o m t as follows: where W c,k,ck are weights that transform inputs into vectors with the same dimension as the word embedding vector. The mixture weight π m t is computed by where w π,m represents a trainable weight.

Interpolation of Latent Variables
We propose two methods to compute the interpolation weight λ. Initially, we treat the interpolation weight λ as a deterministic variable and pre-compute it based on the relatedness between the response and the knowledge during the training phase. During the testing, we substitute λ with a value predicted by a neural network, as depicted in Figure 1b. However, the same or similar input context and knowledge can yield suitable responses at different levels of knowledge specificity for multiple λ values. To account for the uncertainty and variability of interpolation weights, we propose a second approach where we model the interpolation weight as following a distribution, which is jointly trained and inferred with z k , z c , as illustrated in Figure1c.

Deterministic Interpolation Weight
One direct method to acquire the interpolation weight, λ, is to pre-compute it as a relatedness measure between the response and knowledge. Specifically, λ indicates the relative usage ratio of context and knowledge. We propose calculating λ by determining the relative tf-idf similarity between the response and context/knowledge: where sim(·, ·) represents the tf-idf similarity between two text portions. During testing, the ground truth response is unavailable; hence we employ an MLP network with the sigmoid function to predict λ. To train this MLP, we construct training data with inputs as the concatenation of context and knowledge representations, represented as: During the training phase, to bridge the discrepancy between the predicted weight and the ground truth, we aim to minimize the Mean Squared Error (MSE) loss, denoted as L MSE λ , betweenλ and λ computed using Equation (8). We refer to the KIV model that uses deterministic interpolation weight as KIV d , and the objective for this model is:

Probabilistic Interpolation Weight
In order to adaptively utilize background knowledge for response generation, emulating human-like knowledge-grounded conversation behaviors, we assume that the interpolation weight λ follows a Logistic Normal distribution. This distribution is known for its flexibility in approximating the Dirichlet distribution, and it can effectively capture correlations between components of probability vectors [50]. Each response is generated by sampling an interpolation weight from this Logistic-Normal distribution: where µ λ and σ λ are parameters of the Logistic Normal distribution. To parameterize p(λ|µ λ , σ λ ), we obtain the posterior and priori latent variables by reparameterizing µ λ , σ λ as follows: The posterior and prior network f λ pst , f λ pst are MLP networks with tanh activation functions. To obtain a sample from the reparameterized Logistic-Normal distribution, we first draw a sample from the Normal distribution N (µ λ , σ λ ) and then apply the logistic function to transform the sample into the Logistic-Normal distribution space. The KL-divergence between the posterior and prior weight can be computed using the closed-form formula for the Gaussian distribution [50]. By injecting the probabilistic interpolation weight λ, the posterior distribution during training can be factorized as: The prior distribution p θ (z|c, k) can be decomposed in the same way. Injecting the factorized distribution in Equation (3), the final objective can be rewritten as: The KIV model with the above objective is denoted as KIV p , as shown in Figure 1c.

Model Training
In the scenario of deterministic interpolation weights, we use the loss function L d KIV as defined in Equation (9) for training. On the other hand, when employing probabilistic interpolation weights, the training objective is defined as in Equation (12). To mitigate the issue of posterior collapse in response generation, we implement techniques such as the KL annealing trick and the bag-of-word loss, as proposed in previous work [28,46].

Experiments and Analysis
This section is organized as follows: First, in Section 6.1, we discuss experimental settings, covering aspects such as the dataset, baseline models, and metrics. Subsequently, in Section 6.2, we delve into the implementation specifics of the proposed model. Finally, we present and analyze our experimental results in sections ranging from Sections 6.3-6.5.

Experimental Settings
The programming environment is set up with Python version 3.7.16 and Cudatoolkit version 11.7. It utilizes a Tesla V100 32 GB GPU for accelerated computations. The required packages are installed using pip, including torch version 1.13.1 for deep learning tasks, numpy for numerical computations, spacy with the en-core-web-trf model for natural language processing, and pandas version 1.3.5 for data manipulation and analysis.

Dataset
We evaluated our model on two commonly used public benchmark datasets for the knowledge grounded dialog system, Wizard-of-Wikipedia (WoW) [7] and Holl-E [10].
WoW is an open-domain knowledge graph dataset created using Wikipedia passages as a source of background knowledge. It has fine-grained annotations of selected knowledge. The test set of the Wizard dataset is divided into two subsets: Test Seen and Test Unseen. Test Seen contains 3619 conversation turns on topics overlapping with those in the training set. In contrast, Test Unseen includes 3689 turns on topics never encountered in the training or validation sets. In total, there are 68,931/3686/7308 conversations used for training/validation/testing. In each test set, we assess the proposed model and baselines under two scenarios: (1) the ground-truth knowledge selected by the model is known; (2) a separately trained knowledge selection model predicts the knowledge. To ensure a fair comparison in the predicted knowledge setting, our proposed model and all baselines utilize the knowledge determined by a pre-trained transformer memory network. This is the same knowledge selection module employed in the two-state generative model [7]. Holl-E is a specialized language model that focuses on the movie domain. It has been trained on a diverse range of data, including plots, comments, and movie reviews from various websites. The model has two versions of the test set: Single reference test and Multireference test. The Single reference test contains one annotated response per conversation, while the Multi-reference test includes multiple human-annotated ground-truth knowledge and corresponding responses for each instance. In total, there are 7228/930/913 dialogues used for training/validation/testing.

Baselines
We compare our proposed models with the following four baselines: • HRED [3]: A general knowledge-free model encodes the context at two hierarchical levels. • CVAE [32]: This model can be considered a modified kg-CVAE model [28] as depicted in Figure 1a. • GTTP [51]: This model, based on HRED, incorporates grounded knowledge through a copying mechanism, enabling it to copy phrases from the knowledge at the appropriate decoding step. • TMem [7]: Transformer Memory Network first concatenates the representations of context and knowledge and employs a transformer-based framework to generate knowledge grounded responses. • SKT [25]: The sequential latent variable model is utilized to capture the knowledge selection process in multi-turn dialogue generation. • KIV c : This is a variant of our proposed model which directly concatenates the context latent variable z c and the knowledge latent variable z k . This conditions the response generation without any variable interpolation, i.e., Equation (2) is replaced with z = [z k , z c ] with the encoders and decoder remaining the same.

Metrics
We employ four types of automatic metrics to evaluate our proposed model and the baseline models: Per PerpLexity (PPL), three embedding metrics (Embedding Average (AVE), Embedding Extreme (EXT), and Greedy Matching (GRY)), Distinct 1 (Dist 1 ) and Distinct 2 (Dist 2 ), and our calculated Bilingual Evaluation Understudy metrics ∆BLEU k . PPL: It is the exponentiation of the word entropy, and describes how well the generative model predicts the expected responses. It does not directly capture coherence, as a low perplexity value indicates an accurate prediction, but can not guarantee a coherent text. Coherence depends on factors like transitions, readability, consistent topics, and logical structures. Diversity in text generation refers to variations and novelty, which perplexity does not capture. AVE, EXT, and GRY: Rather than using n-gram overlapping-based metrics like Bilingual Evaluation Understudy (BLEU) or Recall-Oriented Understudy for Gisting Evaluation (ROUGE), we report word embedding-based similarity metrics [52] to capture the semantic alignment between generated responses and ground truth. Specifically, we adopt three embedding metrics: Embedding Average (AVE), Embedding Extreme (EXT), and Greedy Matching (GRY).
AVE calculates the average similarity between consecutive sentence embeddings in generated text, indirectly enhancing coherence/diversity. It incorporates contextual information from previous words, aligning the generated text with the overall topic/theme and improving coherence/diversity.
EXT involves selecting the most unique or extreme word embeddings to generate text that deviates from typical language patterns. It aims to enhance diversity but may result in inconsistencies and reduced coherence. A low embedding extreme score suggests repetition or redundant sentences, while a high score indicates incoherence.
GRY is a method employed to identify the most pertinent sentences from a larger body of text. Its purpose is to ensure consistency by selecting text segments that are highly relevant to a given query. While emphasizing coherence, greedy matching tends to prioritize closely associated sentences, potentially leading to a dearth of diversity in the generated text.

(Dist 1 ) and (Dist 2 ) [53]:
We use (Dist 1 ) and (Dist 2 ) to evaluate the diversity of responses, which calculates the ratio of unique unigrams (or bigrams) to the total number of generated words. It determines the ratio of unique n-grams to the total number of n-grams produced. A higher distinct value suggests a greater range of content. However, in certain instances, a higher distinctness can result in reduced coherence as it may introduce unrelated ideas that disrupt the overall flow and cohesiveness of the text. ∆BLEU k : We use ∆BLEU k to measure the engagement of knowledge. To assess whether the knowledge is articulated diversely and engagingly, we define a unique automatic metric that measures the ratio of knowledge utilization in responses based on the word overlap between responses and provided knowledge. We initially compute BLEU k (ŷ) as the average BLEU score, considering the generated responseŷ as the hypothesis and given knowledge as the reference. An exceptionally high BLEU k (ŷ) signifies an excessive copy of external knowledge and makes the response less engaging. We further consider BLEU k (y) of the gold standard response created by human y as the ground truth and compute the average absolute difference between BLEU k (ŷ) and BLEU k (y), namely ∆BLEU k , to measure the engagement of knowledge: where N is the number of samples in the test set, y andŷ are gold and generated responses from the same context and knowledge. A lower ∆BLEU k indicates a closer knowledge copy ratio of generated responses to ground truths, showing better knowledge engagement.

Implementation Details
The vocabulary size in our model is limited to 20,000, encompassing 95.75% of words in the dataset. The embedding size that has been shared with both the encoder and decoder is 22.89 MB. We use the pre-trained GloVe 300-dimensional word embeddings for both the encoder and the decoder. The encoder size is 42.69 MB, and the decoder size is 58.24 MB. Single-layer bi-directional RNNs with GRU [54] are used for knowledge/response encoders and word-level networks in context encoders. A single-layer uni-directional GRU is employed for the utterance-level network in the context encoder. Another single-layer GRU is used for the decoder. The dimension of all hidden states in the GRU network is set to 512. The size of the latent variables is set to 128. The inference and prior networks consist of a single-layer feed-forward network with a tanh activation function.
We apply Layer Normalization when training the decoder. All weights are initialized by the Xavier method [55]. The model is trained end-to-end by Adam optimizer [56], with the learning rate set to 10 −4 and gradient clipped applied at 1. During text generation, we use a greedy strategy along with the KL-annealing strategy, in which the temperature is increased by 10 −5 after each batch update iteration, varying from 0 to 1.

Metric-Based Evaluation
As shown in Table 1, test seen refers to evaluating a model's performance on familiar data it has encountered or been trained on, while test unseen assesses the model's ability to generalize to new, unseen data. Test seen measures the model's recall and application of learned patterns, while test unseen evaluates its adaptability and accuracy in novel scenarios. Gold knowledge refers to the reference or target information used for evaluation or training. It represents correct answers and serves as a benchmark for measuring model performance. Predicted knowledge refers to the knowledge generated by the model during response generation. Our experiment aims to evaluate how well the model utilizes the gold knowledge in its responses. The averaged BLEU k score is used as a metric to measure the quality of incorporating the gold knowledge into the generated responses. The numbers in Bold represent the best results for the corresponding measure.
Automatic evaluations in the Test Seen setting reveal that the responses generated by KIV d and KIV p are considerably more coherent and relevant than those produced by all baseline models, as indicated by the PPL and word embedding-based similarity metrics. Regarding diversity metrics, KIV p outperforms all others, except one: Dist 1,2 has the same or slightly lower result compared with SKT on Gold and Predicted Knowledge. The EXT score on SKT is higher than our model, which confirms the Dist 1 measure that the diversity generated by SKT seems slightly better than ours. However, an emphasizs on higher EXT score may bring incoherence since the model may produce more unique word embeddings. Having such a result from SKT, we think their model put a great emphasis on dealing with the diversity in the knowledge selection of conversations. Regarding the ∆BLEU k indicator, SKT adopted a copy mechanism to maximize the effect of knowledge for response generation, showing a slight improvement in knowledge engagement.
In the scenario of Test Unseen, KIV d and KIV p demonstrate a similar pattern as Test Seen in terms of relevance and diversity metrics. These results substantiate our assertion that interpolating two latent variables, conditioned on knowledge and context, aids in generating more appropriate and informative responses. A comparison between interpolation fusion methods (KIV d,p ) and non-interpolation fusion (KIV c ) reveals that interpolation significantly enhances the quality of responses. This indicates that interpolation is more appropriate for fusing knowledge and context in knowledge grounded conversations.
Our analysis of knowledge engagement metric ∆BLEU k shows that KIV p achieves the lowest score, except that it is slightly higher than SKT. This indicates that interpolating latent variables using probabilistic weights can adaptively learn the joint representation of context and knowledge, leading to the generation of more engaging responses with respect to the incorporated knowledge.
Similar results were also observed on Holl-E in Table 2. The Dist 1,2 in KIV p is lower than KIV c,d , suggesting that KIV p use more copied knowledge, while KIV c,d generate a greater range of content that may not be from the knowledge base. The KIV c,d AVG score is higher than KIV p suggesting that the former generates consecutive sentences that are more similar to each other. Higher score in EXT by KIV c,d indicate generated works may be incoherent since they may produce more unique or extreme word embeddings. SKT model has highest EXT score in our experiments. A higher GRY in KIV c,d identifies the most pertinent sentences chosen from the knowledge base, therefore showing more emphasized coherence. However, a lower score by KIV p may produce more diversity in the generated sentences. The PPL scores in KIV p is much higher than KIV c,d , indicating more accurate in the text prediction, and BLEU k sores are lower in KIV p than KIV c,d suggesting the model is better at engagement of the knowledge, i.e., a closer knowledge copy. Our experiments show a similar result in Multiple Reference. We have not calculated the SKT's ∆BLEU k score, since our KIV p ∆BLEU k score has almost achieved a perfect score. In addition, following previous work [10], we calculate the scores for the multi-reference dataset by taking the maximum score over multiple reference responses. Since the Dist 1,2 score is calculated according to the repeated n-gram of generated responses instead of the multiple reference responses, we exclude Dist 1,2 . The numbers in Bold represent the best results for the corresponding measure.

Human Evaluation
In addition to the automated evaluation, we also carried out a human evaluation on the WoW dataset to gauge the quality of responses produced by our model and the baseline models. We hired three professional annotators to assess the generated responses based on four criteria, which fall into two categories: (1) Coherence (C)/Fluency (F): This assesses whether the response is coherent with the dialogue context and fluent to read. (2) Knowledge Correctness (KC)/Knowledge Diversity (KD): This assesses whether the response is consistent with the provided knowledge and presents relevant knowledge diversely and engagingly. For both the Test Seen and Test Unseen settings, each annotator rated 100 randomly sampled responses generated by each model using a five-point scale (1)(2)(3)(4)(5).
Human evaluation results are listed in Table 3. We observe that GTTP is a robust baseline method, except for the metric Knowledge Diversity(KD), since GTTP tends to copy the whole sentence of the given knowledge. The proposed KIV d,p achieve the best performance in terms of most metrics, in which KIV p performs slightly lower on Knowledge Diversity than SKT, but outperforms by a substantial margin compared with the rest of the baselines. This result verifies that the proposed interpolation-based models, especially KIV p , can present relevant knowledge excitingly and engagingly while keeping the response coherent with the context. The numbers in Bold represent the best results for the corresponding measure.
6.5. Qualitive Analysis 6.5.1. Impact of Interpolation Weight We first investigate the relationship between the interpolation weight λ employed in our model and the ratio of knowledge utilization in the generated responses. We group the test samples into separate bins according to their λ value and calculate the averaged knowledge BLEU score, BLEU K , for responses within each bin. The relationship between λ and the averaged BLEU K is illustrated in the line plot in Figure 3. The x-axis represents the values of λ, ranging from 0 to 1, while the y-axis represents the averaged BLEU score (BLEU K ). Each data point on the line plot represents the average BLEU K for responses falling within a specific range of λ values. When λ = 0, the output is determined solely by the context variable z c , with no knowledge-specific influence; this can be interpreted as a response that relies solely on the immediate context and does not make use of any external knowledge. When λ = 1, the output is determined solely by the knowledge specific variable z k , without any contribution from the context variable z c . This means that the response will be generated solely based on the acquired knowledge, disregarding the context provided. Intermediate values of λ (between 0 and 1) blend both variables, enabling smooth transitions and combinations of knowledge and context.
The line plot in Figure 3 shows the relationship between λ and the averaged BLEU K (knowledge BLEU score) for different experimental settings. It indicates that as the interpolation weight λ increases, the BLEU K also tends to increase, implying a positive correlation between knowledge utilization (represented by λ) and the quality of responses (measured by BLEU K ). This suggests that the learned interpolation weight can effectively represent the knowledge utilization ratio, as higher values of λ indicate a stronger influence of knowledge in generating responses.
The line plot helps visualize the trend that increasing the emphasis on knowledge (higher λ) leads to improved knowledge utilization and better quality responses. It provides evidence that the interpolation weight λ plays a crucial role in controlling the balance between knowledge and context in the model's responses. It demonstrates that, as the model assigns higher weights to the knowledge variable, it generates responses with better utilization of knowledge, leading to improved BLEU K scores.  Figure 4a presents an example of responses generated by different models given the same context and knowledge. As can be observed, our proposed model tends to generate more reasonable and engaging responses by considering both the context and the provided knowledge. We also notice that KIV c leans towards addressing the query by delivering relevant knowledge, whereas KIV p is more inclined to create more diverse and indirect responses, utilizing the knowledge as a conversation guide. Figure 4b,c displays responses sampled with different interpolation weights λ's in KIV d and KIV p . The learned latent variable in our model facilitates a smooth transition from solely relying on the query to parroting the external knowledge verbatim, demonstrating that the model can integrate knowledge and context at a semantic level within the latent space.

Knowledge:
country music often consists of ballads and dance tunes with generally simple forms, folk lyric and harmonies accompanied by mostly string instruments. QUERY: sometimes music from different bands sound the same.

RESPONSES: HRED:
i am sure country music is a lot of music.

CVAE:
country music done by mostly string instruments.

GTTP:
country music often consists of ballads and dance tunes.

TMem:
it is fascinating country music often consists of ballads.

KIVc :
do you like country music of ballads and dance tunes with generally simple forms.

KIVd :
the main reason is most country music are with simple forms, lyric and harmonies.

KIVp:
i think country music are very popular . it consists of dance tunes of simple forms, lyric i like the sound of a lot of music music but i like the sound of a lot of music music but i like it too much more than the other day music and the music is great ! do i agree! i love that it 's so many different genres for it to be honest, and the best part is to sing in the shower! it 's so convenient that it 's music usually λ = 0 . 6 i love it too. it's a classic. it's a classic song, it's usually ballads and dance tunes.

More Generated Examples
In this section, we present additional examples of responses generated by our model and baseline models. In Figure 5, we have included three new topics: heavy metal music, Chevrolet corvette, and coffee. As can be seen, both KIV d and KIV p exhibit the ability to generate responses that are not only more relevant but also more engaging.
In Figure 6, we have included a multi-turn dialogue generated by our model KIV p . During the multi-turn dialogue initiated by the wizard, the topic of "Science Fiction" was introduced, leading to an exploration of various aspects of the genre. As the conversation progressed into its second round, the model smoothly transitioned the focus towards political and philosophical issues within science fiction. At this point, the human participant engaged by discussing time travel and mentioning the Harry Potter movies. Recognizing the significance of Harry Potter as a book that has had a transformative impact, the model suggested it as further reading on the topic, which the human enthusiastically accepted. This seamless exchange of multiple turns created a highly engaging and enjoyable conversational experience.

Conclusions
This paper introduces KIV, a novel interpolation-based CVAE designed to generate knowledge-grounded responses. This approach incorporates two distinct latent variables for modeling context and knowledge. These latent variables are seamlessly integrated via linear interpolation, where the interpolation weight is tied to the degree of knowledge specificity. Specifically, we propose two interpolation strategies: deterministic interpolation, which uses semantic similarity as interpolation weight, and probabilistic interpolation, which treats the interpolation weight as a probabilistic variable sampled from a Logistic Normal distribution. Based on both automatic and human evaluations, our experimental results demonstrate that both interpolation strategies outperform in relevance and fluency. Moreover, probabilistic interpolation significantly enhances knowledge engagement and diversity. Future research will concentrate on exploring more sophisticated methods for interpolation weight optimization and handling multimodal data such as images or sound.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Derivation of Lower Bound
In this section we will show L KIV is a lower bound estimation of log p(y|c, k). The loss function of CVAE [27] can be formulated as maximizing by the Evidence Lower Bound (ELBO) of logarithm likelihood log p(y|c, k) as follows: L CVAE = −KL(q φ (z|y, c, k)||p θ (z|y, c, k)) +E z∼q φ (z|c,k) log p θ (y|c, k, z) ≤ log p θ (y|c, k).
To verify the loss function L KIV still remains the lower bound of log p(y|c, k), we only need to show that L KIV ≤ L CVAE .
(A5) Therefore L KIV is the lower bound estimation of logarithm likelihood log p(y|c, k).