Exploring Bi-Directional Context for Improved Chatbot Response Generation Using Deep Reinforcement Learning

: The development of conversational agents that can generate relevant and meaningful replies is a challenging task in the ﬁeld of natural language processing. Context and predictive capabilities are crucial factors that humans rely on for effective communication. Prior studies have had a signiﬁcant limitation in that they do not adequately consider the relationship between utterances in conversations when generating responses. This study aims to address this limitation by proposing a novel method that comprehensively models the contextual information of the current utterance for response generation. A commonly used approach is to rely on the information of the current utterance to generate the corresponding response, and as such it does not take advantage of the context of a multi-turn conversation. In our proposal, different from other studies, we will use a bi-directional context in which the historical direction helps the model remember information from the past in the conversation, while the future direction enables the model to anticipate its impact afterward. We combine a Transformer-based sequence-to-sequence model and the reinforcement learning algorithm to achieve our goal. Experimental results demonstrate the effectiveness of the proposed model through qualitative evaluation of some generated samples, in which the proposed model increases 24% average BLEU score and 29% average ROUGE score compared to the baseline model. This result also shows that our proposed model improves from 5% to 151% for the average BLEU score compared with previous related studies.


Introduction
Conversational agents aim to enable computers to communicate with humans by automatically generating a response to each user input.Researchers have been working on developing robust, scalable, and context-aware systems for a long time.One of the key challenges they are focused on is generating meaningful and consistent dialogue responses that align with the conversation history, which this paper also addresses.While previous studies have indicated that incorporating additional information can enhance the accuracy of models, it is still uncertain how contextual information impacts a conversation as a whole, including its connection to future outcomes and how different contextual factors affect the conversation.Our goal is to describe and understand how contextual factors contribute to model performance improvement.In this study, we examine, explore, and exploit the contexts humans use in everyday decision making.Based on the principle that the more information the better, we have also presented different generation-based models that utilize context information in a conversation.This study proposes a Deep Reinforcement Learning model for analyzing the influence of different contextual information on responses in a conversation and how to combine them to improve a conversation's coherence and consistency.
Existing chatbot systems can be classified into rule-based, information retrieval-based, and generation-based systems.Rule-based methods rely on a knowledge base of predefined question-answer pairs.When an input is received, the system matches it to the most similar question pattern in the knowledge base and retrieves the corresponding answer [1].However, these predefined rules take time to expand and maintain.Additionally, these systems respond to user questions in a stereotyped manner without utilizing the conversation history.
Another method based on the Information Retrieval (IR) approach is available [2][3][4].This approach is suitable when a large set of conversations is available.The principle of IRbased models is to select the corresponding response based on a given input from a dataset (containing a set of conversation or question-answer pairs).However, a disadvantage of IR-based models is their limited ability to understand semantic differences between input contexts.Additionally, IR-based models can only respond within the dataset, making them suitable for query-type lookups rather than forming a conversation.
Recently, generation-based approaches have attracted much attention in NLP, especially in machine translation and chatbots.These approaches treat a conversation as a source-to-target problem.Seq2Seq models based on LSTM have shown excellent performance in generating responses, such as in [5][6][7].The strength of this approach lies in its ability to generate answers in a generalized way that is not dependent on predetermined rules.The mechanism of this method mimics human thinking, including two stages: the first stage of encoding (i.e., understanding) the question and the second stage of generating the corresponding answer.Therefore, most current research continues this approach and focuses on developing deep neural models to improve the quality of conversations, which is also the approach in our work.However, most Seq2Seq-based models are trained on single-turn conversations, making them incapable of handling long-context conversations.Additionally, due to numerous generic responses in the training dataset, Seq2Seq models tend to generate generic responses regardless of the input [8][9][10].
In reality, humans use contextual reasoning to make daily decisions.Context refers to the collection of texts surrounding a word used in a sentence or phrase of interest.Some studies have used previous utterances to generate current ones or reinforcement learning methods to keep track of the conversation's history, such as [11,12].However, these studies typically only considered partially observed context when generating responses at the current position.For a chatbot model to be effective, it must consider the logical relationship between turns in a conversation.Therefore, the chatbot must consider the surrounding contexts in the ongoing conversation when generating a response.For example, in Table 1, there is a dialogue between a boy and his mother.In this conversation, the answer in turn 7 is based on the information "bus" mentioned in the previous turn (turn 3).This example highlights the importance of a chatbot managing the flow of information and connecting utterances coherently and meaningfully throughout the entire conversation.This study aims to use contextual information to generate responses effectively.At position i in a multi-turn conversation, the task of the conversation modeling is to generate a response for the following position i + 1.In this proposed conversation model, we use two contexts: the left and right contexts of the current utterance.The left-context stores information from previous utterances, providing historical context for the current one.The right-context, on the other hand, indicates how the current utterance may affect future utterances.We use the left-context by integrating multiple preceding utterances from position i backward into the Seq2Seq structure to gather contextual information and generate the corresponding response for position i + 1.The more challenging task is how to use the right-context.When the model generates a response at position i + 1, there is no utterance available at subsequent positions.So how can we exploit the right-context during the training process?To exploit the right contextual information, we consider the conversation as a Markov decision process like [13][14][15].Then, we apply the reinforcement learning technique to manage it.Instead of stopping the generation of utterances at position i + 1, we utilize the result to generate utterances at subsequent positions such as i + 2, i + 3, and so on.The model is designed to simultaneously optimize the entire utterance sequence, including positions i + 1, i + 2, and beyond.This is achieved using a combination of a deep learning model and a reinforcement learning mechanism, known as the deep reinforcement learning model.Using this model, we have incorporated both the left-context and right-context of the current utterance into the model's training process.
Unlike recent studies that often focus on designing the architecture of models for generating responses, our goal is to describe and understand contextual factors that can improve the model's performance in generating responses.In this study, we take a more general approach, using both the left-context and right-context of the current utterance when generating the subsequent response.Our model aims to generate responses in a multiturn conversation and is a potent combination of Seq2Seq architecture and reinforcement learning algorithm.Seq2Seq is capable of leveraging historical information as a left-context.In contrast, the reinforcement learning algorithm leverages the future impact of the current response.This optimization process helps the model generate appropriate responses that align with the previous conversation and guide its future impact.The contributions of this paper are summarized as follows:

•
Using historical information in the conversation for the Seq2Seq model can improve consistency in the conversation.When combined with future contextual information, the Seq2Seq model can evaluate the current response based on long-term objectives that the Reinforcement Learning (RL) algorithm will accumulate.This method enables the chatbot to maintain a conversation that adheres to a specific objective.The primary objective of utilizing bi-directional context is to ensure the conversation stays on track toward the desired goal.Our aim is to develop a new conversational model that takes advantage of effective context to improve the coherence and consistency of a conversation.• RL techniques require a forward-looking function, which is an essential component that scores the quality of each response.In order to enhance the training of models for achieving the desired goal, we have introduced two additional forward-looking functions.

•
Our proposed method for training conversational agents involves using deep reinforcement learning algorithms that leverage action spaces obtained from simulating conversations between two pre-trained virtual agents.

Related Work
In this section, we will provide a brief overview of the models proposed in recent years, including their respective strengths and limitations.Pattern matching and machine learning are two main approaches to developing a conversational model based on the applied techniques.
Weizenbaum from the Massachusetts Institute of Technology (MIT) developed ELIZA, one of the earliest rule-based chatbots [16].ELIZA is a simple program that uses pre-defined rules to communicate with users in natural language.It searches for keywords in the user's input text and analyzes them using predefined rules to generate a response.PARRY is a chatbot extension of ELIZA, created by Stanford [17], and has many improvements over ELIZA.PARRY simulates a patient with schizophrenia and generates responses based on the user's emotional state and the previous response.From 1995 to 2000, the Artificial Intelligence Markup Language (AIML) was developed to build a knowledge base for chatbot systems using pattern-matching techniques.ALICE was the first chatbot to use the AIML language, and its knowledge base contains around 41,000 patterns, in contrast to ELIZA's 200 rules.However, even with this massive knowledge base, ALICE is still not intelligent enough to generate human-like responses.The weakness of pattern-matching methods is that they often produce robotic and repetitive responses, lacking the naturalness of actual human interactions.Therefore, this approach is better suited to a single-turn conversation where the response can be selected from a database, and the user is only interested in the final response.In practice, we need chatbots to generate appropriate and goal-oriented responses, ensuring the conversation remains coherent and meaningful within the given context.
Unlike chatbot systems based on the pattern matching method, chatbots that use Machine Learning can extract information from user input using NLP techniques and learn from conversations.They are not limited to using pre-defined rules set by the user.The core idea of Machine Learning is to train a model with labeled question-answer pairs provided by humans, which maps the relationships between inputs and responses from the training data.One of the earliest generations of chatbots was developed using Statistical Machine Translation (SMT) by [18].The SMT-based approach treats conversational responses as a language translation problem, where the mapping rules between input and output are learned from training data.Based on this idea, some previous studies proposed SMT models for building chatbots [18].They utilized SMT models by taking the user's utterance X as input and generating the response Y using the translation method.The language model P(Y) was constructed by counting the frequency of n-grams in the training data.The probability of a response sentence given an input sentence denoted as P(Y|X) forms the basis for generating reasonable phrases in a conversation.A translation table was also created based on the training data during the training process.The Moses phrase-based decoder used the translation table to select the best response for the input sentence.In their experiments, the input and response were in the same language, and the data consisted of status updates on Twitter.However, the researchers found that mapping responses in a conversation was more complex than translating between two languages and unsuitable for multi-turn conversations.
Recognizing user messages and generating appropriate responses can enrich communication in chatbot systems.Recently, text generation tasks based on the Seq2Seq neural network model have attracted much attention in the field of neural dialogue generation [6].Their models use two recurrent neural networks (RNNs).The first network encodes the input sentence to a context vector, and the second decodes the vector to generate the desired response.In recent studies, generating responses based on Seq2Seq models has achieved significant improvements in various applications, from machine translation [19][20][21], text summarization [22][23][24], and chatbots [5,6,25].Seq2Seq-based models are widely used in response generation because they usually achieve a better performance than earlier models.However, they have several drawbacks to multi-turn conversations.They learn by maximizing the probability of generating a response based on the previous dialogue turn using maximum likelihood estimation (MLE).Moreover, MLE may have difficulty estimating the specific targets of chatbot systems.Furthermore, the training dataset includes numerous generic responses, which can cause Seq2Seq models to generate common and boring responses like "I don't know" regardless of the input [8].These types of responses can lead to the termination of the conversation or fall into an endless loop after three turns [26].Serban et al. recently introduced a hierarchical neural model that captures relationships between turns in a conversation [5,11].They improved their model by training it on a dataset of question-answer pairs and pre-trained embeddings.Additionally, Li et al. [27] proposed using maximum mutual information (MMI) instead of the MLE objective function for response generation tasks to increase response diversity.Although this improvement may generate more appropriate responses, it is still challenging to achieve the goal of a chatbot: simulating human-to-human interactions by providing informative responses to engage users.
Another research direction considers dialogues as a Markov Decision Process (MDP) [13][14][15] and uses reinforcement learning techniques to address related issues.RL is a general-purpose framework for sequential decision making and is typically described as an agent that interacts with unknown environments.Many tasks such as generation, reasoning, information extraction, and dialogue can be formulated as sequential decision making.Therefore, in recent years, deep reinforcement learning has garnered a lot of attention in NLP [28].In chatbot systems, RL-based models consider a conversation as a sequential decision process that operates over its state space, action set, and strategy to address the challenges of multi-turn dialogues in previous models.These models define a dialogue as either a Markov Decision Process (MDP) [13,14] or a Partially Observable Markov Decision Process (POMDP) [29][30][31].Using reinforcement learning (RL), they monitor the state transition process, take appropriate actions (utterances), and obtain information from the user [28].The authors of [26] simulated a conversation between two virtual agents, evaluated action sequences using policy gradient methods, and presented rewards for three useful dialogue attributes: informativeness, coherence, and ease of answering.Another study suggested a method based on reinforcement learning to create a chatbot using a generation model that generates sequences for a task-oriented model [32].The experiments showed that this method results in more natural conversations that more efficiently accomplish task objectives.
According to a recent study [15], the authors developed a conversational system for learning policy by incorporating three reward functions.The first reward function evaluates the similarity between previous utterances and the topic presentation.The second one measures semantic coherence using mutual information between the generated response and previous utterances.The final reward function encourages the model to produce grammatically correct and fluent responses.Chen et al. [33] proposed an actorcritical model to implement deep reinforcement learning (DRL).The model was trained in parallel using data collected from various dialogue tasks and tested on 18 tasks from PyDial [34].The results showed that the model achieved robust learning efficiency.Another RL-based approach [35] proposed Offline RL, which can train dialogue agents using static datasets.Their experiments showed that this method generated conversations that could help complete tasks better for specific purposes.
Unlike other works focusing on improving chatbot systems' model architecture based on deep learning, this study uses reinforcement learning to exploit bi-directional context in a muti-turn conversation.Our goal is to learn contextual features from human-to-human conversations by combining the strengths of Seq2Seq models with the advantages of RL.We construct the model using RL to achieve various goals, such as avoiding overfitting by integrating multiple context constraints in RL and enhancing the conversation's long-turn coherence and consistency.

The Proposed Model
Firstly, we provide a formal description of the problem.Most existing work on chatbots studies response generation for single-turn conversation, which only consider the immediately preceding utterance.However, this is not like how people typically converse.In practice, human conversations often consist of multiple turns, rather than just a single turn.In this research, we treat the problem in a multi-turn scenario where the dataset consists of multiple conversations.Each conversation comprises a sequence of multiple turns, denoted as s 0 , s 1 , s 2 , . . ., s n−1 , s n , where s t and s t+1 represent successive turns between two agents.The task of building a chatbot model can be viewed as a source-to-target mapping problem, where the model learns mapping rules between source utterances and their corresponding suitable target responses from massive training data.We split the training dataset into n pairs (s t , s t+1 ) n t=1 where (s t , s t+1 ) represents the t th pair consisting of an input and its corresponding target.In a conversation system, s t and s t+1 This architecture is designed to model sequences of data, such as natural language text, and has been used in a variety of natural language processing tasks, including machine translation [37,38], language modeling [39], and chatbot [40].In order to understand the relationship between two sentences, the BERT training process also utilizes next sentence prediction.A pre-trained model with this understanding is relevant for tasks like question answering.During training, the model receives input pairs of sentences and learns to predict if the second sentence is the next one in the original text.
Figure 1 shows an overview of the BERT architecture.The input text is first tokenized and embedded using BERT's token embedding layer.The position embedding layer adds positional information to the input tokens, allowing the model to distinguish between the order of the tokens.On a high level, the BERT block is a stack of multiple Transformer encoder blocks, each of which is composed of multi-head attention layer and positionwise layer (shown in the right part of the figure).The multi-head attention layer allows the model to weigh the importance of different words in the input text based on their relevance to each otherr.In contrast, the feed-forward layer transforms the weighted input into a fixed-length vector representation.This process is repeated for each transformer block in the stack, and the output embedding layer generates the final vector representation of the input text, which is then used for downstream tasks such as text classification or question answering.In this study, we use the pre-trained BERT model to downstream our task.

Integrating Left-Context with BERT2BERT
A human can control the informational flow in their conversation during a long-term conversation.However, most current models are incapable of handling the history of a conversation.That is the reason those models usually generate repetitive and generic responses.Our model solves this problem by using a left-context, denoted as c L , which serves as contextual information from the conversation history.This context is defined as a k consecutive sequence of previous utterances s t−k , . . ., s t−2 , s t−1 at the turn t.Based on the principle that more information is better, this left-context can provide additional information during the training process.
Our proposed model for generating historical contextual responses is a transformerbased encoder-decoder model.It has been proven to improve performance on many tasks using the encoder-decoder model [41,42].However, such models require a massive dataset for pre-training before fine-tuning for a desired task.Recently, a study showed that skipping the cost of the pre-training process and using a pre-trained encoder allows the transformer-based encoder-decoder model to achieve competitive results in text generation tasks [43][44][45].Inspired by their study, we use the BERT2BERT architecture [43] and warmstart the encoder and decoder with the BERT-based checkpoint [46].The proposed model takes an input message s t = {w t 1 , w t 2 , . . ., w t |s t | } and its left-context c L as input.This sequence is then fed into an encoder, which is a stack of BERT-based encoder blocks.Each block consists of a self-attention layer and two feed-forward layers, as shown in The encoder maps the input sequence [c L , s t ] to a contextualized encoded vector XBERT , as follows: The architecture of the decoder block is similar to the encoder block (as shown in Figure 3).However, the decoder block is conditioned on the contextualized encoded vector XBERT , and the model also includes a cross-attention layer in addition to the decoder.Each decoder block is more extensive than the encoder, consisting of a self-attention layer, two feed-forward layers, and a cross-attention layer that obtains contextual information from the vector XBERT .In addition, a linear layer called the LM Head is included on top of the last decoder block, which maps the output vectors to the logit vectors L.
During the training process, the decoder maps the contextualized encoded XBERT and a target sequence s t+1 to logit vectors L. The probability distribution of the target sequence s t+1 is factorized from conditional distributions of the next word using Bayes's rule: The logits define the distribution of the target sequence s t+1 conditioned on the input sequence s t through a softmax operation.As a result, each generated token is defined by the softmax of the logit vector as follows: where l i is the i th token in logit vector L. We define s t+1 = {w t+1 1 , w t+1 2 , . . ., w t+1 |s t+1 | } as the ground-truth output for a given input sequence s t .The training objective aims to minimize the following cross-entropy (CE) loss: As discussed, although encoder-decoder-based models can generate meaningful responses, they still need to meet the objectives of a chatbot.Most of these models are trained to generate the best response from a single utterance, making them suitable only for single-turn conversation systems.Our architecture proposes a solution to this problem by integrating the left-context into the encoder-decoder models, allowing the system to consider the previous utterances in the dialogue.However, traditional models are still short-sighted in predicting responses in multi-turn conversations because they ignore the potential impact on the future of the dialogue.

Bi-Directional Context Using Deep Reinforcement Learning
To solve the above problems, we propose formulating the conversation as a reinforcement learning problem and using long-term rewards to optimize response generation.Moreover, to gain insight from the long-context success of a conversation, the proposed model provides an utterance generation model conditioned on the impact of a generated response in an ongoing dialogue.We define the right-context as capturing the influence of utterances in the future.We first predict the best response corresponding to the historical context and then fine-tune the model with a desirable goal conditioned on the future context c R .
That goal can be achieved with a Reinforcement Learning algorithm through a Markov Decision Process (MDP).MDP is a machine learning method to solve decision problems by interacting with the environment to reach desired goals [47], and is used to solve decisionmaking problems sequentially.It consists of an agent, including a learner, a decision-maker, and the environment, encompassing all external factors.MDP is a collection of states S, a set of actions A, a transition function P, and a reward function R. Given an MDP (S, A, P, R), the model is trained to find a policy π that solves the problem.From an algorithmic perspec-tive, a policy is a conditional probability distribution over the set of actions A. During the interaction, the agent takes action a according to a policy π.The environment updates to a new state based on the agent's action a.More specifically, the agent interacts with the environment at each of a sequence of discrete time steps t = 0, 1, 2, . ... At the time t, each pair of the current state s t ∈ S and the action a t ∈ A(s) creates a transition tuple (s t , a t , r t+1 , s t+1 ), with s t+1 being the next state, and the agent also receives a numerical reward, r t+1 ∈ R. The environment and agent produce a sequence s 0 , a 0 , r 1 , s 1 , a 1 , r 2 , s 2 , a 2 , r 3 , . .., where r t and s t are defined by discrete probability distributions conditioned on the previous state and action.At time t, the transition probability to the next state s t+1 , given particular values of the preceding state s t and action a t , is defined by p(s t+1 , r t+1 |s t , a t ) for all s t , s t+1 ∈ S, r t ∈ R and a t ∈ A(s).
In the dialogue task, given input s t , we need to find a response s t+1 that optimizes a measure of semantic relevance to s t .RL can be used to learn when a part of the conversation was generated.We can define a conversation as an MDP (S, A, P, R) and solve it using RL.The corresponding MDP (S, A, P, R) is defined as follows [28]: the set of states S is defined as the conversation history, A is a set of generated responses for the system to reply to the user, and P is the transition probability function trained using the BERT-based encoding p RL (s t+1 |s t , c L ), where c L is the left-context at time t.The reward function, denoted as R, represents the forward-looking reward received for each chosen action and plays a crucial role in achieving a successful conversation.Operationalizing R can improve the models to achieve the desired goal.

Reward Definition
Our work proposes two rewards to generate desired responses and ensure the conversation's responses have a specific goal.An emerging problem with encoder-decoder-based models is that they often generate highly possible responses but are incoherent or irrelevant to the dialogue history.To avoid inappropriate responses in a conversation, we use consecutive turns to define our forward-looking functions.By utilizing the mutual information between the action a and the preceding utterance in the conversation, we can ensure the appropriateness of the generated responses.Let r 1 denote the reward obtained for each action.Let h t and h t+1 denote the representations obtained from the agents for two consecutive turns s t and s t+1 .The cosine similarity between h t and h t+1 gives the first reward at the current state s t : In addition to the first reward, we suggest the second reward to encourage the model to contribute new information in each conversation.This reward leverages the intralinguistic relations in the sentence to detect changes in the content, thereby helping to maintain coherence and sustainability in the conversation.To achieve this purpose, we define the second reward, denoted as r 2 , as the minimum cumulative distance between the words in turn s t and all the words in the subsequent turn s t+1 .
where N s t denotes the number of tokens in sentence s t , and w p , w q are embedding vectors of the words in sentences s t and s t+1 , respectively.The final reward for an action a t at the current state s t is a weighted sum of the rewards computed by where λ 1 + λ 2 = 1.We set λ 1 = 0.5 and λ 2 = 0.5.

Conversation Simulation
The main idea of the proposed model is to simulate a conversation by allowing two chatbots to communicate with each other, as shown in Figure 4.While the pre-trained encoder-decoder allows the model to generate coherent responses with the conversation history, using RL enhances the model's ability to generate responses optimized for longterm goals.We simulate the conversation as follows.At the first step, we obtain an input sentence with conversation history as contextual information c L from the training dataset and feed it to the first agent.The first agent encodes the inputs into a vector representation and decodes it to generate a response s t for the next turn.The second agent updates the state of the simulation by combining the conversation history with the output s t .It immediately encodes this new state into a representation and decodes it into a new response, fed back to the first agent and repeated.At the end of the simulation, the right-context c R is a sequence of k consecutive utterances {s t+1 , s t+2 , . . ., s t+k } to the right of the generated response s t .The transition probability distribution π is initialized as a pre-trained BERT-based model and represents the policy P:

Response Generation
where s t is the current state of the conversation and its left-context c L ; we generate a list of candidate responses A as follows: In Reinforcement Learning, the agent and environment interact over a sequence of actions in a conversation.The goal of the agent is to maximize the expected reward from its actions through the interaction [48]: maximize ∑ [a t ,..,a t+k ]∈c R π θ (a t , . . ., a t+k )r(a t , . . ., a t+k ) (11) where a t is a generated response in turn t and θ is the set of parameters in the model; c R is the right-context obtained in the simulation process; and r(a t , . . ., a t+k ) is the cumulative discounted reward associated with the sequence of utterances a t , . . ., a t+k .When the simulator reaches the end of the conversation, it estimates the rewards based on the specific goals and computed by r(a t , . . . , where γ is the discount factor which adjusts the importance of rewards over time in the reinforcement learning algorithm.γ is a real value ∈ [0, 1] and tells how important future rewards are to the current state.We begin a curriculum learning strategy by simulating the dialogue for k turns and using the policy gradient method to find parameters that maximize the expected future reward.The objective we aim to maximize is the expected cumulative reward: Finally, the derivative for the loss function can be written as an expectation as follows:  Run policy π and obtain response s t .

9:
Run the simulator to obtain sequence of sentence s t , . . ., s t+k , with s t ∼ π θ .10: Observe the sequence and calculate the reward according to Equation (12).

11:
Calculate the loss according to Equation ( 17) and update the parameters of the model.12: end for

Description of Dataset
Our experiments utilized the DailyDialog dataset [50] to evaluate our models.The dataset consists of a wide variety of dialogues from daily communications and is divided into three main categories: Work (14.49%),Ordinary Life (28.26%), and Relationship (33.33%).The dataset was built to cover various topics in our daily lives, and it contains 13,118 multi-turn dialogues.To ensure the dataset's consistency with real-life experiences, Li et al. [50] invited individuals to engage in social activities (Relationship), discuss recent events (Ordinary Life), and talk about work-related topics (Work).
The DailyDialog dataset serves multiple purposes, including enhancing social bonding.The dataset contains rich emotions and is manually labeled to ensure high quality.Additionally, the dialogues cover various daily scenarios such as holidays, shopping, restaurants, and so on.In contrast to social media datasets such as Twitter Dialog Corpus [18] and Chinese Weibo [51], the language used in DailyDialog is written by humans and often focuses on a specific topic.So, we prefer the conversations in this dataset because of its formal writing style.The primary objective of this dataset is to develop a high-quality multi-turn dialogue dataset, which distinguishes it from most existing dialogue datasets.
As discussed before, reinforcement learning is applied to each given multi-turn dialogue as a Markov Decision Process, in which the agent learns to determine the following action to take in the environment to complete the conversation based on specific criteria.Based on these characteristics, we have chosen the dataset for the proposed model.

Quantitative Evaluation
It is widely recognized that evaluation plays an essential role in developing the conversational agent.We evaluate dialogue generation systems based on two criteria.One criterion demonstrates a reasonable correlation between human judgment and the response generation task.In contrast, the other criterion measures the consistency and coherence of the utterances in a conversation.
For the first metric, we use the BLEU (Bilingual Evaluation Understudy) score [52].This metric is based on the string-matching algorithm, which compares consecutive n-grams of the generated response with the consecutive n-grams in the reference sentence and counts the number of matches with weighted scores.The BLEU score measures how many words overlap in a generated response compared to a reference response, and it is widely used to evaluate dialogue quality [27,53].A higher BLEU score indicates that the generated response is more similar to the reference response and is more likely to be rated as human-like.
For the second metric, we use the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score, which has recently been employed to evaluate dialogue quality as well [54].It encompasses multiple metrics for evaluating the quality of a response by comparing it to other human-generated reference versions.This score utilizes n-gram count, word sequence, and word pair to measure the similarity of the chatbot-generated output to the reference text.

Designing and Evaluating Experimental Models
Firstly, we deployed a baseline model based on BERT2BERT, which is widely chosen as a baseline model in text generation tasks [43].We then designed different experimental models to demonstrate the effectiveness of using different contexts for response generation in the conversational model.To this end, we designed three models as follows: •

BERT LC (only left-context based on BERT)
This model addresses the shortcomings of traditional generative models that only generate responses based on the current sentence without considering information from previous conversation turns.It does so by reusing historical information to predict responses, exploiting contextual information from prior turns in the conversation.

•
BERT RC (only right-context with BERT and reinforcement learning) Encoderdecoder architectures have been proven to yield good results in text generation tasks.However, they still lack the ability to process information by considering its impact on the future.To address this limitation, this approach leverages the strengths of the BERT2BERT method to capture the semantic information of a previous conversation and combines it with the power of reinforcement learning algorithms to manage the information flow for future conversations.We experimented with different context sizes (i.e., different numbers of utterances on the left and right sides of the context) to determine the most appropriate context length for these data.Our best model will then be compared to the latest relevant studies.
Results are presented in Table 2. From these results, we have the following observations: • We presented the BLEU and ROUGE scores to measure conversation generation for correlation with human judgment.We showed that three of our models improved the BLEU score compared to the baseline model.However, the difference in BLEU score between the baseline and the proposed model (with only left-context) is slight.BERT is a state-of-the-art language model that uses a transformer-based architecture to learn contextual relationships between words in a text corpus.One of the main advantages of BERT is its ability to integrate contextual information into its understanding of text.Therefore, adding more contextual information to it may be difficult to further enhance the model's effectiveness.

•
When comparing the use of a single context, we found that the left-context-based model improved the average BLEU score by 5% and the average ROUGE score by 7% compared to the baseline.In contrast, the right-context-based model correspondingly achieved 8% BLEU score and 10% ROUGE score.However, using both the left-context and the right-context, we also find that the RL-based models generate more coherent outputs when compared to the baseline model.Specifically, the model improved 12% average BLEU score and 13% average ROUGE score.Figure 5 visually represents the scores for the four experimental models we trained.In general, using contextual information, either left-context or right-context can help the model better capture the mutual information among utterances in conversation.Among the four experiments, the proposed model with full context achieved the highest BLEU and ROUGE scores because they not only consider history information but also capture information in future utterances.
In addition, we evaluated our proposed model based on the length of the simulated conversation with 1, 3, 5, and 7 turns (as shown in Table 3).These results are visually represented in Figures 6 and 7. When compared with the baseline, our best model still increased the average BLEU score by 24% and the average ROUGE score by 29%.The simulated conversation was used for model training with reinforcement learning, which performs better as the length of the conversation increases.However, the model quickly converged at turn 5.Although BLEU and ROUGE scores are widely used for evaluating dialogue quality, they are mainly designed for comparing individual sentences rather than entire dialogues, and can become noisy if the data comparison is too long.Moreover, in reinforcement learning, the reward function reflects the success of a model in achieving its goals.Our two reward functions are based only on word embeddings and not on the sentence level.Thus, they may not be strong enough to capture contextual information in long conversations.We recognize this as a reason why RL algorithms may not be effective beyond a certain threshold, which is determined by the length of the simulated dialogue (MDP).
We also compare the proposed methods with the relevant current methods (shown in Table 4).HRED [55] is a hierarchical model based on the encoder-decoder.The model used previous responses and encoded all the past information to generate a probable next token.Their experiments demonstrate that their model improves upon previous models.COHA [56] build their model using emotion states and capturing expressions from predefined emotions.They show that their proposed model can generate contextually and emotionally appropriate responses.Recently, a pre-trained dialogue generation model, PLATO [57], has been used based on hidden vectors to determine the inherent features.The model used attention mechanisms to combine contextual information and characteristic features.It showed an improvement in comparison with previous studies.These current studies are shown in Table 4.Our proposed model showed a significant improvement, especially in higher BLEU score indices such as BLEU2, BLEU3, and BLEU4.This indicates that the use of context in our model helped generate more accurate and coherent responses.Thus, our proposed model can potentially improve the quality of automated conversation systems.Moreover, these results demonstrate superior performance compared to those of previous studies.While the BLEU index decreases significantly for earlier models in these studies as we compare BLEU2, BLEU3, and BLEU4 to BLEU1, our model still produces results that are not too far off.This finding confirms that our model generates accurate responses with highly natural, human-like tendencies appropriate to the conversation content.
Upon concluding our analysis, we present a comprehensive summary and thorough comparison of the proposed model's efficacy against our experimental models and relevant studies when compared to the baseline model (as shown in Table 5).The results presented in Table 5 clearly demonstrate that models constructed within a contextualization framework exhibit a significant enhancement in the coherence and consistency of a conversation.The extensive experimental results also provide compelling evidence for the effectiveness of the proposed methods in generating responses that are both reasonable and coherent.Thus, it can be inferred that building models using a contextbased approach is a promising strategy for improving conversational coherence and consistency.These experimental findings provide strong support for this claim and highlight the potential benefits of adopting a context-driven approach in natural language processing.

Conclusions and Future Work
In this study, we developed a novel conversational model that leverages effective context to improve the coherence and consistency of conversations.Our model is based on a Transformer-based sequence-to-sequence model, which utilizes BERT to encode the current utterance's left-context.By employing a reinforcement learning strategy and building the corresponding reward function, we can incorporate the right-context of conversations during the training process to enhance the generation model.The proposed model can capture the flow of conversation and the relationships between utterances more effectively by utilizing both left-context and right-context.The left-context helps the system keep track of the conversation's history, including the topics discussed and any pertinent information mentioned earlier.This feature assists the system in generating more appropriate responses that consider the current state of the conversation.On the other hand, the right-context enables the system to anticipate the conversation's direction and generate more forwardlooking responses.
Experimental results have shown that our proposals effectively improve the quality of generated responses.In our comparison, we discovered that the left-context-based model increased the average BLEU score by 5% and the average ROUGE score by 7% compared to the baseline when utilizing a single context.Meanwhile, the right-context-based model achieved an 8% BLEU score and a 10% ROUGE score improvement.However, when we employed both left and right contexts, we observed that the RL-based models generated more coherent results than the baseline model.Specifically, our best model increased the average BLEU score by 24% and the average ROUGE score by 29%.We compared the outcomes of our proposed models with those of other studies in the literature.Our model consistently outperformed them by an average of 5% to 151% based on the BLEU score.

Theoretical Implications
Our proposal is a novel approach that considers contextual factors to improve the conversational model.Humans employ contextual information in everyday decision making.This is a sensible idea to mimic human conversation in chatbot systems.We performed multiple experiments to showcase that incorporating both left-and right-contexts is more effective than using either one separately and significantly more advantageous than not utilizing context at all.Based on our experiments, having more context can be beneficial when training a chatbot system.It can help the system better understand the user's intent and provide more accurate and relevant responses.
Moreover, we also proposed an approach that resolves issues with traditional neural network models in conversation response generation by integrating Transformer-based Seq2Seq models and RL.The experimental results demonstrate that our proposed model has significantly improved when compared to current relevant studies.From an academic perspective, there is still considerable research space for exploring this direction, which is very worthy of in-depth excavation by interested researchers.

Practical Implications
Building conversational agents to generate appropriate and meaningful responses is a challenging problem in the field of natural language processing.Moreover, consistency is indeed important in chatbot conversations.Consistency helps ensure the chatbot's responses are in alignment with the user's purpose.One of the critical factors humans use in daily communication is context and the ability to anticipate.
Unlike the earlier studies, we provide additional information for a chatbot by exploring contextual factors in the conversation.These factors help the chatbot generate appropriate responses that have a clear purpose and align with the context of the conversation.A consistent chatbot experience helps users feel comfortable and confident when using the chatbot.This leads to a positive user experience, increasing the likelihood of the user returning to the chatbot.

Future Work
The use of BLEU and ROUGE scores for sentence comparison in chatbot systems has been widely debated due to uncertainty about their correlation with human response quality.Although BLEU and ROUGE scores have been extensively used for evaluating dialogue quality, they are primarily designed for comparing sentences rather than dialogues.Therefore, if the data comparison is too lengthy, these scores can become noisy.It is necessary to identify better automatic evaluation metrics for the future development of dialogue systems.Developing chatbots with human-like thinking capabilities is still challenging.
Based on the improved results of this study, we will design additional rewards that are based on the characteristics of human decision making.These rewards can guide chatbot behavior to be more in line with human expectations, ultimately improving the quality of the chatbot and making conversations feel more natural.We will also enrich our model with a Large Language Model (LLM), such as GPT, or plan to incorporate popular RL techniques to build self-learning conversational agents.With these enhancements, chatbots can become more effective tools for communication, customer service, and other purposes.

Figure 1 .
Figure 1.The architecture of pre-trained BERT model.

Figure 2 .
Figure 2. The architecture of encoder.The encoder is initialized based on multiple BERT blocks, each of which is composed of a multi-head attention and feed-forward network.

Figure 3 .
Figure 3.The architecture of decoder, in which each BERT block has cross-attention layers added between the multi-head attention and the feed-forward network.

Figure 4 .
Figure 4. Bi-directional context for response generator using deep reinforcement learning.

Algorithm 1
summarizes the method used to model the bi-directional context chatbot: Algorithm 1 DRL-Chat Require: Input sequence (X), ground-truth output sequence (Y), and conversation history (C L ).Pre-training Policy with Left-context: 1: Initialize the policy model π θ based on pre-trained BERT.2: for number of training iterations do 3: Run encoding on X, Y, C L and obtain a contextualized encoded vector XBERT .

4 : 5 :
Run encoding by feeding XBERT to the decoder and obtain a response Ŷ.Calculate the loss according to Equation (4) and update the parameters.6: end for Fine-tuning Policy with Right-context: 7: for number of training iterations do 8: • BERT FC (full context with BERT and reinforcement learning) This model combines the left-context and right-context to take advantage of them.The left-context allows the model to leverage the information already passed in the conversation.At the same time, the right-context helps the model capture the influence of the information flow on future conversations.

Figure 5 .
Figure 5. Chart showing the performance of experiments measured by BLEU score (a) and ROUGE score (b).

Figure 6 .
Figure 6.Chart showing the performance with different lengths of simulated conversation measured by BLEU score (a) and ROUGE score (b).

Figure 7 .
Figure 7. Average of the BLEU scores and ROUGE score with different lengths of simulated conversation.

Table 1 .
A conversation between two people in which the last turn requires information from earlier in the conversation.
denote two consecutive turns in the dialogue.Each user utterance s t = {w t 1 , w t 2 , . . ., w t |s t | } is paired with a sequence of outputs s t+1 = {w t+1 1 , w t+1 2 , . . ., w t+1 |s t+1 | } that needs to be predicted, where w t k represents the k th word in the utterance s t .

Table 2 .
Summarization results of different models.

Table 3 .
Model performance using different lengths of simulated conversation.

Table 4 .
Comparison of proposed model with recent studies.

Table 5 .
Rate of increase in BLEU score compared to our best proposed model.