Variational Reward Estimator Bottleneck: Towards Robust Reward Estimator for Multidomain Task-Oriented Dialogue

Technology (MIT). Abstract: Despite its signiﬁcant effectiveness in adversarial training approaches to multidomain task-oriented dialogue systems, adversarial inverse reinforcement learning of the dialogue policy frequently fails to balance the performance of the reward estimator and policy generator. During the optimization process, the reward estimator frequently overwhelms the policy generator, resulting in excessively uninformative gradients. We propose the variational reward estimator bottleneck (VRB), which is a novel and effective regularization strategy that aims to constrain unproductive information ﬂows between inputs and the reward estimator. The VRB focuses on capturing discriminative features by exploiting information bottleneck on mutual information. Quantitative analysis on a multidomain task-oriented dialogue dataset demonstrates that the VRB signiﬁcantly outperforms previous studies.


Introduction
While deep reinforcement learning (RL) has emerged as a viable solution for complicated and high-dimensional decision-making problems [1], including games such as Go [2], chess [3], checkers [4], and poker [5,6], robotic locomotion [7,8], autonomous driving [9,10], and recommender system [11,12], the determination of an effective reward function remains a challenge, especially in multidomain task-oriented dialogue systems. Many recent studies have struggled in sparse-reward environments and employed a handcrafted reward function as a breakthrough [13][14][15][16]. However, such approaches typically are not capable of guiding the dialogue policy through user goals. For instance, as shown in Figure 1, the user cannot attain the goal because the system (S1) that exploits the handcrafted rewards completes the dialogue session too early. Moreover, as the dialog progresses, the user goal will frequently vary.
Due to these problems, systems that exploit the handcrafted rewards fail to assimilate user goals and guide users through user goals, achieving low performance, while humans self-judge from dialog context using well-defined reward function in their minds and generate appropriate responses despite multidomain circumstances.
MaxEnt-IRL [17] and Inverse reinforcement learning (IRL) [18,19] tackle the problem of recovering the reward function automatically and using this reward function to generate optimal behavior. Although generative adversarial imitation learning (GAIL) [20], which applies the GANs framework [21], has proven that the discriminator can be defined as a reward function, GAIL fails to generalize and recover the reward function. Adversarial inverse reinforcement learning (AIRL) [22] enables GAIL to take advantage of disentangled rewards. Guided dialogue policy learning (GDPL) [23] uses the AIRL framework to construct the reward estimator for multidomain task-oriented dialogues. However, these approaches often encounter difficulties in balancing the performance of the reward estimator and policy generator and produce excessively uninformative gradients.
In this paper, we propose the variational reward estimator bottleneck (VRB), a novel and effective regularization algorithm. The VRB uses information bottleneck [24][25][26] to constrain unproductive information flows between dialogue internal representations and state-action pairs of the reward estimator, thereby ensuring highly informative gradients and robustness. The experiments show that the VRB achieves state-of-the-art (SOTA) performances on a multidomain task-oriented dataset.  The system (S2) that uses well-specified rewards can guide the user through the goal, while S1 cannot.
The remainder of this paper is organized as follows: Section 2 presents the brief background to set the stage for our model. Section 3 describes the proposed method in detail along with mathematical calculations. Section 4 outlines the experimental setup, whereas Section 5 presents the experiments and the results thereof. Section 6 provide discussions and the conclusions of this study.

Dialogue State Tracker
The dialogue state tracker (DST) [27][28][29], which takes dialogue action a and dialogue history as input, updates the dialogue state x and belief state b for each slot. For example, as shown in Figure 2, DST observes the user goal where the user aims to go. At dialogue turn t, the dialogue action is represented as a slot and value pair (e.g., Attraction: (area, centre), (type, concert hall)). Given the dialogue action, DST encodes the dialogue state as

User Simulator
Mimicking various and human-like actions is essential with respect to training taskoriented dialogue systems and evaluating these models automatically. The user simulator µ(a u , t u |x u ) [30,31] in Figure 2 extracts the dialogue action a u corresponding to the dialogue state x u . t u stands for whether the user goal is achieved during a conversation. Note that the DST and the user simulator cannot meet the user in the absence of well-defined reward estimation.

Policy Generator
The policy generator [32,33] encourages the dialogue policy π θ to determine the next action that maximizes the reward functionr ζ,ψ (x t , a t , and δ is the TD residual [34]. ξ t (θ) = π θ (a t |x t ) π θ old (a t |x t ) and V θ is the state-value function. Epsilon and λ are hyperparameters. The reward functionr ζ,ψ can be simplified in the following manner:r x t+1 ) is the reward estimator, which is defined as follows [22]:

Notations on MDP
To represent inverse reinforcement learning (IRL) as a Markov decision process (MDP), we consider a tuple M = (X , A, T, R, ρ 0 , γ), where X is state space, and A is the action space. The transition probability T(x t+1 |x t , a t ) defines the distribution of the next state x t+1 given state x t , and a t at time-step t. R(x t , a t ) is the reward function of the state-action pair, ρ 0 is the distribution of the initial state x 0 , and γ is the discount factor. The stochastic policy π(a t |x t ) maps a state to a distribution over actions. Supposing we are given an optimal policy π * , the goal of IRL is to estimate the reward function R from the trajectory τ = {x 0 , a 0 , x 1 , a 1 , . . . , x T , a T } ∼ π * . However, building an effective reward function is challenging, especially in a multidomain task-oriented dialogue system.

Reward Estimator
The reward estimator [23], which is an essential component of multidomain taskoriented dialogue systems, evaluates dialogue state-action pairs at dialogue turn t and estimates the reward that is used for guiding the dialogue policy through the user goal. Based on MaxEnt-IRL [17], each dialogue session τ in a set of human dialogue sessions D = {τ 1 , τ 2 , . . . , τ H } can be modeled as a Boltzmann distribution that does not exhibit additional preferences for any dialogue sessions.
, Z is a partition function, ζ is a parameter of the reward function, and R ζ denotes a discounted cumulative reward. To imitate human behaviors, the reward estimator should learn the distributions of human dialogue sessions using the KL divergence loss: where H(π θ ) is the entropy of dialogue policy π θ . The reward estimator maximizes the entropy, which indicates maximizing the likelihood of observed dialogue sessions. Therefore, the reward estimator is learned to discern between human dialogue sessions D and dialogue sessions that are generated by the dialogue policy.
Note that H(D) and H(π θ ) are not dependent on the parameters ζ and ψ. Thus, the reward estimator can be trained using gradient-based optimization as follows:

Variational Reward Estimator Bottleneck
The variational information bottleneck [24][25][26] is an theoretical information approach that restricts unproductive information flow between the discriminator and inputs. Inspired by this approach, we propose a regularized objective that constrains the mutual information between encoded original inputs and state-action pairs, thereby ensuring highly informative internal representations and a robust adversarial model. Our proposed method trains an encoder that is maximally informative regarding human dialogues.
To this end, we employ a stochastic encoder and an upper bound constraint on the mutual information between the dialogue states X and latent variables Z: and D h (z h ), based on GANs [21], GAN-GCL [35], and AIRL [22]. D g represents the encoded disentangled reward approximator with the parameter ζ, and D h is the encoded shaping term with the parameter ψ. Stochastic encoder E(z|x t , x t+1 ) can be defined as , which maps states to a latent distribution z: E(z|x t ) = N (µ E (x t ), Σ E (x t )). r(z) = N (0, I) is standard Gaussian, and I c stands for an enforced upper bound on mutual information.
To optimize L f ,E (ζ, ψ), VRB introduces a Lagrange multiplier ϕ as follows: where the mutual information between dialogue states X and latent variable Z is In Equation (3), the VRB minimizes the mutual information with dialogue states to focus on discriminative features. The VRB also minimizes the KL divergence with the human dialogues, while maximizing the KL divergence with the generated dialogues, thereby distinguishing effectively between samples from dialogue policy and human dialogues. Our proposed model is summarized in Algorithm 1.

Algorithm 1 Algorithm of Variational Reward Estimator Bottleneck
1 Initialize dialogue policy generator π θ and reward estimator f ζ,ψ 2 for i ← 0 to N do 3 Obtain Random Samples from Human Dialogue Corpus D 4 Gather Dialogue Sessions using User Simulator µ(a u , t u |x u ) and Policy Generator π θ (a|x) 5 Encode Dialogue Sessions using Stochastic Encoder Update Reward Estimator f ζ,ψ by Optimizing L f ,E (ζ, ψ) 8 Estimate Reward Functionr ζ,ψ for each State-Action Pair 9 Update State-Value Function V(X ) and Dialogue Policy π θ given the Reward r ζ,ψ

Dataset Details
We evaluated our method on multidomain wizard of oz [36] (MultiWOZ), which contained approximately 10,000 large-scale, multidomain, and multiturn conversational dialogue corpora. MultiWOZ consisted of 7 distinct task-oriented domains, 24 slots, and 4510 slot values. The dialogue sessions were randomly divided into training, validation, and test set. The validation and test sets contained 1000 sessions, respectively.

Models Details
We used the agenda-based user simulator [30] and VHUS-based user simulator [31]. The policy network π θ and value network V are MLPs with two hidden layers. g ζ and h ψ are MPLs with one hidden layer each. We used the ReLu activation function and Adam optimizer for the MLPs. We trained our model using a single NVIDIA GTX 1080ti GPU. Detailed hyperparameters are shown in Table 1. We compare the proposed method with the following previous studies: GP-MBCM [37], ACER [38], PPO [33], ALDM [39], and GDPL [23]. GP-MBCM [37] trains a number of policies on different datasets based on the Bayesian committee machine [40]. ACER [38] suggests the importance of weight truncation with bias correction for sampling efficiency. PPO [33] employs an effective algorithm that attains the data's robust and efficient performance using only a first-order optimizer. ALDM [39] shows an adversarial learning method to learn dialogue rewards directly from dialogue samples. GDPL [23] is the current SOTA model that consists of a dialogue reward estimator based on IRL.

Evaluation Details
To evaluate the performances of these models, we introduce four metrics: (i) Turns: we record the average number of dialogue turns between the user simulator and dialogue agent. (ii) Match rate: we conduct match rate experiments to analyze whether the booked entities are matched with the corresponding constraints in the multidomain environment.
For instance, in Figure 2, entertainment should be matched with concert hall in the center. The match rate ranges from 0 to 1 and scores 0 if an agent is unable to book the entity. (iii) Inform F1: we test the ability of the model to inform all of the requested slot values. For example, as shown in Figure 1, the price range, food type, and area should be informed if the user wishes to visit a high-end Cuban restaurant in Cambridge. (iv) Success rate: in the success rate experiment, a dialogue session scores 0 or 1. We obtain 1 if all required information is presented, and every entity is booked successfully. Table 2 presents the empirical results on both simulators and MultiWOZ. In the agendabased setting, we observe that our proposed method achieves a new SOTA performance.

Experimental Results of Agenda-Based User Simulators
Note that an outstanding model should obtain high scores in every metric, not just a single one, because to regard a dialogue as having ended successfully, every request should be informed precisely, thereby guiding a dialogue through the user goal. Although GDPL achieves the highest score in Inform F1, our proposed model acts more human-like with respect to Turns, which is close to the human evaluation score: 7.37, and provides more accurate slot values and matched entities than the other methods.

Experimental Results of VHUS-Based User Simulators
On the other hand, in the VHUS setting, though PPO behaves more human-like in Turns, PPO exhibits greater difficulty in providing accurate information, while our model does not because our approach constrains unproductive information flows. Results in Table 3 demonstrate that our proposed model outperforms existing models, providing more definitive information than the other methods. Similar to the agenda-based setting, the VHUS-based model also showed the best performance. It demonstrates that our methodology reflecting human-like characteristics is a very effective methodology.

Verification of Robustness
As shown in Figures 3 and 4, to evaluate the robustness of the models, we conduct experiments over 30 times for each model and visualize the results using a violin plot. Exper-imental results show that our proposed method outperforms PPO in every metric, despite some negative outliers, and has a much lower standard deviation than PPO. An example of a dialogue session comparison between VRB and PPO is available in Table 4.

Conclusions
In this paper, we present a novel and effective regularization method known as the variational reward estimator bottleneck (VRB) for multidomain task-oriented dialogue systems. The VRB includes a stochastic encoder, which enables the reward estimator to be maximally informative, and provides information bottleneck regularization, which constrains unproductive information flows between the reward estimator and the inputs. The quantitative results show that VRB achieves new SOTA performances on two different user simulators and a multiturn and multidomain task-oriented dialogue dataset. Despite great improvements, training dialog policy via VHUS setting remains a hurdle to overcome. We leave this for future works.