A Knowledge-Enhanced Hierarchical Reinforcement Learning-Based Dialogue System for Automatic Disease Diagnosis

: Deep Reinforcement Learning is a key technology for the diagnosis-oriented medical dialogue system, determining the type of disease according to the patient’s utterances. The existing dialogue models for disease diagnosis cannot achieve good performance due to the large number of symptoms and diseases. In this paper, we propose a knowledge-enhanced hierarchical reinforcement learning model for strategy learning in the medical dialogue system for disease diagnosis. Our hierarchical strategy alleviates the problem of a large action space in reinforcement learning. In addition, the knowledge enhancement module integrates a learnable disease–symptom relationship matrix and medical knowledge graph into the hierarchical strategy for higher diagnosis success rate. Our proposed model has been proved to be effective on a medical dialogue dataset for automatic disease diagnosis.


Introduction
Intelligent medical technology is gaining more and more attention because of its ability to relieve physicians' work pressure and improve work efficiency, and has achieved excellent results in various fields such as medical text summarization [1][2][3], medical QA [4][5][6] and biomedical information extraction [7][8][9].At present, machine learning models have been widely used in disease diagnosis.Siddhartha et al. [10] and Finale et al. [11] have achieved promising results on disease recognition using electronic medical records and supervised learning models.
With the development of deep learning techniques, task-oriented dialogue has been widely used in restaurant reservation [12], movie reservation [13] and online shopping [14].In the medical realm, scholars have proposed dialogue models for automatic disease diagnosis.Specifically, the disease diagnosis is regarded as a Markov decision process, and a dialogue model is employed to collect symptoms through interacting with the patient, and thus reducing the great efforts of building an electronic medical record for each disease.The dialogue system can not only provide convenience for patients, but also provide preliminary diagnosis for doctors' consultation.
Policy learning is the key technology for dialogue-based disease diagnosis, and it is also widely used in task-oriented dialogue systems [15][16][17].However, the previous methods have the following drawbacks.
Firstly, most existing models are based on single-layer reinforcement learning strategies that treat all the diseases and their associated symptoms equally.Wei et al. [18] regards the symptom acquisition process of multiple rounds of consultation between the agent and the patient as a Markov decision process, and utilizes the reinforcement learning algorithm for training.But when the number of diseases and symptoms is too large, the single-layer strategy that mixes symptom inquiry action and disease diagnosis action will lead to an excessively large action space of the agent, which negatively affects the diagnosis success rate.
Secondly, when the agent selects symptoms for inquiry, the existing methods may not pre-classify the possible symptoms in the current state, resulting in more irrelevant symptoms involved.
Thirdly, a few approaches that consider hierarchical reinforcement learning strategies [19,20] propose a hierarchical reinforcement learning model that integrates two-level hierarchical strategy into dialogue strategy learning.The high-level strategy consists of a model called master, which is responsible for triggering the low-level model.The low-level strategy consists of several symptom checkers and a disease classifier.Although the strategy of hierarchical reinforcement learning is adopted, it ignores the medical knowledge and disease-symptom relationships that are closely related to the diagnosis task, which brings in irrelevant symptoms and may harm the success diagnosis rate of disease.
In this paper, we propose a hierarchical reinforcement learning (HRL) model KNHRL that integrates medical knowledge and disease-symptom relations into a dialogue model for disease diagnosis.Compared with the previous HRL model for disease diagnosis, KNHRL incorporates a learnable disease-symptom relation matrix and knowledge graph to assist the agent for decision making.By incorporating co-occurrence probabilities between symptoms, the model can quickly and comprehensively ask for implicit symptoms that are more relevant to known symptom information, rather than asking for irrelevant symptoms.The knowledge of the relationship between disease and symptoms further ensures the accuracy of the diagnosis.Moreover, KNHRL conducts pre-classification before the lowlevel strategy makes decisions, separating the action of asking about symptoms from the action of diagnosing a disease.This way, the agent can collect symptoms more likely to be associated with the disease that users are suffering from.The major contributions of this paper can be summarized as follows.

•
We incorporate the learned medical knowledge into the low-level strategy of an HRL model, which can further improve the symptom matching rate and the diagnosis success rate.

•
Inspired by the process of the doctor's consultation in real life, we leverage a classifier to feed the user's disease probabilities into system states, and propose a new decisionmaking method by considering the medical knowledge graph and the learned diseasesymptom relation matrix.

•
The proposed KNHRL model outperforms strong baseline methods on a public available medical dialogue dataset for automatic disease diagnosis.

Related Work
Hierarchical reinforcement learning (HRL) methods are employed to decompose a huge action space, and have been applied in visual navigation, natural language processing, recommendation systems, video description generation and other daily life domains [21][22][23][24][25][26][27][28].Jain et al. [29], for a four-legged robot path tracking task, took full advantage of the hierarchical structure features and timing decoupling scheme of HRL to use different state representations for the upper and lower controllers.The model emphasized the different concerns of position estimation and motion control to ensure the reusability of the lower layer strategies.Li et al. [30] in a multi-goal-oriented task for an 18-degree-of-freedom robot, pre-trained skills to obtain skills that could achieve simple goals, and then planned the learning of the skills.
Budzianowski et al. [31] utilized the strong transfer ability of HRL to build a crossdomain dialogue system, which learned shareable information in similar subdomains of different main domains to train a general underlying policy.Saha et al. [32,33] leveraged the HRL framework to learn a multi-intent dialogue policy.The proposed algorithm introduced emotion-based instant rewards into the basic rewards of the dialogue system, making the question-answering robot adaptive so as to obtain maximum user satisfaction.Saleh et al. [34] devised a variational sequence model, which no longer simply considered word-level information, but built a reward model at the discourse level to improve the global vision of the model.
Reinforcement learning (RL) has become the mainstream method for automatic disease diagnosis in dialogues [35][36][37].Wei et al. [18] leveraged a DQN from conversations with patients to select additional symptoms, which could greatly improve the accuracy of diagnosis.Hou et al. [36] proposed a multi-level reward RL-based model that could improve both the performance and the speed of convergence.Teixeira et al. [37] customized the settings of the RL leveraging the dialogue data.The existing hierarchical reinforcement learning strategies [19] usually ignore knowledge and disease-symptom relationships that are closely related to the diagnosis task, which has negative impacts on the success diagnosis rate of the disease.
In addition, there is work on knowledge enhancement for diagnosis.Xu et al. [38] proposed a Knowledge Routing Dialogue System, referred to as KR-DS for short, which embedded the rich medical knowledge into topic switching in the dialogue management module to assist agent decision-making.Liu et al. [39] introduced a supervised diagnostic model (mapping between symptoms and diseases) in the external environment, thereby improving the agent's ability to collect symptoms that were more helpful for diagnosis.However, these models could not effectively incorporate the knowledge-enhanced diseasesymptom relation into the HRL models.

Model Overview
The task of reinforcement learning is to learn how to take actions based on the current environmental state in order to maximize the expected return.As for RL-based models for automatic diagnosis, the action space of agent is A = D ∪ S, where D is the set of all diseases and S is the set of all symptoms associated with these diseases.Given the state s t ∈ S at turn t, the agent takes an action according to its policy a t ∼ π(a|s t ) and receives an immediate reward r t = R(s t , a t ) from the environment.If a t ∈ S, the agent chooses a symptom to inquire the user.Then the user responds to the agent with True/False/Unknown.If a t ∈ D, the agent informs the user of the corresponding disease as the diagnosis result and the dialogue session will be terminated, marking the success or failure in terms of the correctness of the diagnosis.
Scholars introduce Markov Decision Process to simplify the model.They assume that the state transition model exhibits the Markov property, meaning the transition of states depends solely on the current state.Consequently, the problem of reinforcement learning can be formulated as a Markov Decision Process.
The disease diagnosis model can be expressed as Markov Decision Process M = S, A, R, P, γ .
is a set of all states, S h is the status of the agent in the high-level strategy (dubbed as the high-level agent).S l i is the status of the agent in the ith low-level strategy (dubbed as the low-level agent).n l is the number of low-level agents.
is the set of all actions, A h is the high-level agent action, A l i is the ith low-level agent action, and n l is the number of low-level agents.R is a collection of dialogue rewards.A policy π is a mapping between a state set S and a state transition model set P. γ is the discount rate used to compute the Q value function.The goal of the model is to optimize the Markov Decision Process M = S, A, R, P, γ and find the policy π that maximizes the cumulative discount reward for all S, A .This paper proposes a knowledge-enhanced hierarchical reinforcement learning model KNHRL for the disease diagnosis task.In order to reduce the action space, KNHRL divides the strategies of the disease diagnosis task into two levels, namely high-level strategy and low-level strategy.This idea was inspired by hospital consultations in the real world.
Figure 1 demonstrates the framework of the KNHRL model.The high-level agent receives the current initial state s t and selects a low-level agent to talk to the user simulator for symptom collection.The low-level strategy consists of multiple agents, and each agent is responsible for collecting relevant symptoms of different diseases.Each low-level agent consists of a disease classifier and a deep Q network (DQN) with knowledge embeddings.Considering that when a doctor asks about a patient's symptoms, they will first consider that the patient may have a certain disease, and then ask the related symptoms of the disease.According to this process, before using the knowledge-embedded DQN for decision-making, we first use the disease classifier to obtain the probability distribution of the disease in the low-level agent in the current state, and to assist the subsequent DQN in decision-making.The doctors will combine their own medical experience and knowledge when asking about symptoms, so we add information on past diseases, dependencies between symptoms and disease-symptom knowledge graphs that can be learned during training based on the basic DQN.The above DQN strategy is called DQN for knowledge embedding.
The collection of symptoms by the low-level agent is achieved through a dialogue with a user simulator that gives feedback (True/False/Unknown) about the symptoms asked by the agent, and the model rewards the agent based on the feedback from the user simulator, namely low-level reward.The low-level agent then decides whether to continue symptom collection according to the reward obtained.If the low-level agent continues to collect symptoms, it will update the policy and continue to select symptoms to interact with the user simulator.When the low-level agent no longer collects symptoms, the previous low-level rewards are accumulated as the reward of the high-level agent, and the high-level agent updates the policy according to the obtained reward, so as to select the low-level agent to collect symptoms or to choose a disease classifier to make a diagnosis.

Knowledge Construction
This paper leverages disease-symptom relation modules and medical knowledge graphs to assist decision-making in low-level policies.The medical knowledge graph is constructed by diseases and their related symptoms, as shown in Figure 2, for example.In Figure 2, the blue entity is the disease, and the green entity is the symptom.Each edge between the disease entity and the symptom entity contains two weights, which are the symptom probability (sym|dis) under the disease condition and the disease prob- ability (dis|sym) under the symptom condition.The two probabilities are calculated by occurrences of diseases and symptoms in the dataset, which forms a disease-symptom relation matrix.
The elements in the dis_sym matrix are the symptom probability (sym|dis) under the disease condition, and the elements in the sym_dis matrix are the disease probability (dis|sym) under the symptom condition.It is worth noting that the establishment of the medical knowledge graph is based on the disease and its related symptoms that each low-level agent is responsible for.In this paper, a total of nine medical knowledge graphs are established, each of which has a corresponding disease-symptom relation matrix.
We can learn the disease-symptom relation matrix from the dataset.Note that the relation matrix is also built at the unit of each lower-level agent.The disease-symptom relation matrix is a concatenation of the disease-disease matrix (recorded as dis_dis), diseasesymptom matrix (recorded as dis_sym), symptoms-disease matrix (recorded as sym_dis), and symptoms-symptom matrix (sym_sym), shown below: Formulas ( 1) and ( 2) are spliced on the first dimension, and Formula (3) is spliced on the 0th dimension.Since the diseases in each low-level agent are in the same department, this paper does not consider the relationship between diseases and diseases, that is, the dis_dis matrix is set to a 0 matrix of size R n dis ×n dis .

Deep Reinforcement Learning Model for Disease Diagnosis
The use of DQN-based models for disease diagnosis is one of the most popular methods.In the problem of automatic disease diagnosis, the main elements of the DQNbased model include current state s t , strategy π, current action a t , and immediate reward r t .Among them, the current state s t is spliced by the 3-dimensional one-hot vector z i of each symptom, and each dimension of the one-hot symptom vector represents the different states of the symptom, where z i = (1, 0, 0) means that the patient has the symptom (True), z i = (0, 1, 0)means the patient does not have the symptom (False) and z i = (0, 0, 1) means the patient does not know whether the patient has the symptom (Unknown).For symptoms not asked by the agent, we denote it as z i = (0, 0, 0).Therefore, the current state s t contains not only the information of the current round, but also the action information of the previous agent and the patient and the symptom information that has been collected.According to the described definition, the current state s t can be expressed as Formula (4).
where n s is the number of symptoms.The policy π is used to describe the action of the agent.When the current state s t is known, the policy π can be expressed as π(a|s t ) , which obtains the probability distribution of all possible agent actions in the state s t .The current action a t is the action of the agent obtained according to the policy (a|s t ) under current state s t , and the process can be expressed as Formula (5).
The action space A of the agent is the union of all disease sets D and their associated symptom sets S, that is, A = D ∪ S. The instant reward r t is the reward obtained from the user simulator when the agent is in a state s t and makes an action π according to the strategy a t , to update the strategy.
According to the above elements, the process of disease diagnosis using a deep reinforcement learning model is described as follows: in the state of s t , the agent selects an action a t according to the policy π.Note that the agent follows the ε greedy policy when selecting an action; that is, in the case of 1 − ε, the agent chooses the optimal action; in the case of ε the agent chooses the action randomly.When a t ∈ S, the agent will choose a symptom to talk to the user simulator, and the user simulator will give the agent feedback (True/False/Unknown) and the corresponding reward.According to the feedback information, the agent assigns the value at the location of the corresponding symptom in s t , and updates the strategy to select the next action; when a t ∈ D, the agent will choose a disease to inform the user simulator, and the dialogue will be judged as success or failure according to whether the informed disease is correct or not, and the agent will get different rewards and continue to update the strategy.
The goal of the agent is to find a policy that maximizes the expected cumulative discounted reward (called the optimal policy).The Q value function is used to calculate the expected reward generated by selecting the action a t according to the policy π in the state s t .The calculation method is shown in Formula (6): where the Q π (s t+1 , a t+1 |θ ) is the Q function of the target network, θ is the parameter of the current network, θ is the parameter of the target network obtained from the previous iteration, and γ ∈ [0, 1] is the discount factor.When γ = 0, only the rewards of the current round are considered.When γ = 1, the rewards of the current round and subsequent round are treated equally.When γ ∈ (0, 1), the rewards of the current round rewards are more important than the subsequent round.The agent wants to find the policy that maximizes the cumulative discount reward, then the optimal Q value function Q * is the maximum value of the Q value function under all strategies, namely When the value of the Q value function obtained for all states and actions under a policy π * is the largest, then the policy π * is called the optimal policy.
Notably, the DQN parameterizes the policy so that the policy is updated by training the DQN.Each iteration of DQN takes the current state as input and outputs the computed Q value of the current network.DQN updates the parameter θ at each iteration of training by minimizing the error between the computed Q value of the current network and the Q value of the target network, (that is, the Q value obtained from the Bellman equation), to train the network.

High-Level Strategy for HRL
In the dataset for the dialogue-based disease diagnosis constructed by Liao et al. [19], the diseases are divided into nine subsets based on the department, and each subset contains ten diseases.The diseases in different subsets are different from each other, and the relationship between each subset is shown as follows: where D i represents the disease set in the ith subset, and d k represents the kth disease in the ith disease subset.In hierarchical reinforcement learning (HRL), an agent in a low-level policy is responsible for collecting its associated symptoms for each disease subset, and a high-level policy is responsible for selecting which agent in a certain low-level policy to work.The process of the model informing the user simulator disease to make a diagnosis is carried out by a disease classifier that is selected by a high-level policy at the same level as a low-level policy.
According to the task content of the high-level policy, the action space of the high-level agent is shown in Formula (10): where l i is the ith agent in the low-level policy, and d is the disease classifier.After receiving the current state s t , the high-level agent selects an action a h t according to the current policy π h , a h t is a 10-dimensional vector (nine for low-level agents and one for a disease classifier) to indicate which low-level agent is selected for symptom collection, or which disease classifiers are selected for disease informing.When the high-level agent triggers the work of a certain low-level agent, the high-level agent will proceed to the next step, only when the low-level agent finishes the work.After the low-level agent finishes working, the rewards received from the user simulator for each round will be accumulated as the reward of the high-level agent, which is called the high-level reward.This is calculated as follows: t is the dialogue rounds of the agent in the low-level policy, T is the total number of dialogue rounds of the agent in the low-level policy, γ h is the discount factor, r l t+t is the reward that the low-level agent gets from the user simulator in the current round, and r dl t is the reward from the user simulator for the disease classifier.The goal of the advanced agent is to maximize the expected cumulative discounted advanced reward.The Q value function is used to represent the expected reward of the advanced agent.Its Bellman Equation can be written in the form of Formula ( 12): θ h is the parameter of the current advanced policy network, s t+1 is the next dialogue state observed by the advanced agent after taking an action a h t according to the policy π in the state of s t , and a h t is the action taken by the high-level agent under s t+1 , and γ h is the discount factor.The high-level policy network consists of a three-layer DQN, and the network parameters θ h are updated during training by reducing the mean squared error between the Q value calculated in the current network and the Q value of the target network obtained from the Bellman equation.Therefore, the above mean square error is used as the loss function of the advanced policy network, as shown in Formula (13): The first term in the squared difference is the Q value of the target network obtained from Bellman equation, and the second term is the calculated Q value of current network.

Low-Level Strategy for Knowledge Enhanced Decision-Making
The low-level agent is responsible for collecting symptoms by talking to the user simulator, which is triggered by the high-level agent.Figure 3 shows the process in which a low-level agent is selected for work by a high-level agent.l 1 , l 5 in Figure 3 as well as dl is the action of the high-level agent, l 1 and l 5 represent that the high-level agent has selected the first and fifth low-level agents, respectively, and dl represents that the high-level agent has selected a disease classifier for diagnosis.Taking the working process of the first low-level agent as an example, a 1 k is the action of the k conversation of the first low-level agent.When the low-level agent repeatedly asks the same symptom or the number of dialogue rounds reaches the specified upper bound, the low-level agent's work ends.The reward obtained by the low-level agent for each round of dialogue (low-level rewards) are accumulated and returned to the high-level agent, and the high-level agent makes the next selection.The disease set contained in the low-level agent is D i , and the associated symptom set S i is the action space of the ith low-level agent.Next, we illustrate how medical knowledge assists low-level agents in decision-making in the low-level strategy of decision-making, as shown in Figure 4.
The s i t in the Figure 4 is the current state extracted by the ith low-level agent from s t .The extraction process proceeds as follows: when the low-level agent l i is selected by the high-level agent, the high-level agent will pass the current state s t to the low-level agent l i .l i will extract the corresponding states of these symptoms from s t according to the symptoms of the disease they are responsible for, considering it as the current state of the low-level agent.The specific extraction method is shown in Formula ( 14): where n S i is the number of associated symptoms of the disease in the ith low-level agent.z i k is the one-hot vector of the kth symptom in the ith low-level agent.We hope to preliminarily screen the diseases that patients may have in the current state, so that when collecting symptoms, it is easier for low-level agents to collect symptoms related to possible diseases, and thus improve the success rate of diagnosis.Therefore, we design a disease pre-classification module for each low-level agent, which contains a disease classifier consisting of a two-layer MLP.Specifically, before the DQN makes a decision, the current state is input into the disease classifier, and the disease classifier outputs a 10-dimensional vector dl i , which represents the predicted probability fraction of each disease in the agent.We concatenate the output vector dl i with the current state S i t according to Formula (15) S i t is the newly obtained current state containing disease information, and the output obtained by inputting it into the DQN is the original action of the low-level agent a i d t in the current state S i t , as shown in Formula ( 16).
i t a i t s We hope to use the relation information between diseases and symptoms in the dialogue history as "experience" to assist decision-making in the current state.Therefore, we design a relation module to capture the "experience" in the dialogue history.The relation module contains a matrix R ∈ R A i ×A i , where A i = D i ∪ S i , which can learn the relation between each symptom and disease during training.Specifically, the original action a i d t obtained by the lower-level agent is multiplied by the relation matrix R, as shown in Formula (17).
a i r t is the action of the low-level agent augmented by the relation matrix, where the elements are the weighted sum of the original action and the relation matrix.The matrix R is initialized by the relation matrix established in Section 4 relation_matrix, which contains the dependency of diseases and symptoms in the dataset.During model training, the relation matrix R learns the dependencies between diseases and symptoms during the dialogue between the low-level agent and the user simulator through backpropagation.
We also hope to simulate the real-world situation of doctors combining their own medical knowledge for diagnosis, so a medical knowledge graph module is designed to assist the agent in making decisions.In Section 4, we have established a disease-symptom medical knowledge graph for each low-level agent.When the ith low-level agent works, the weight matrices P i (dis|sym) and P i (sym|dis) on each edge of the ith medical knowledge graph are used to compute the weight matrix for the medical knowledge graph module.
According to the conditional probability, Formulas ( 18) and ( 19) can be obtained: P i (sym) = P i (sym|dis) where the symptom P i (sym) is the final desired weight matrix.Since both P i (dis) and P i (sym) are unknown, and the prior probability of symptoms can be obtained from the dataset.The disease probability P i (dis) is first calculated using the prior probability of symptoms; that is, Formula ( 18) can be rewritten in the form of Formula (20): where P i prior (sym) ∈ R n S i is the prior probability of symptoms, and n S i is the number of symptoms corresponding to the ith low-level agent.For the symptoms that have been collected under the current state s i t , the value of the prior probability of the symptoms that the patient does exist (the response of the user simulator is True) is set to 1; the value of the prior probability of the symptoms that the patient does not exist (the response of the user simulator is False), the prior probability is set to −1; the prior probability of symptoms that the patient does not know exists (the user simulator's response is Unknown) is set to the value calculated from the user goals in the dataset.For the symptoms that have not been collected in the current state s i t , the prior probability is also set to the value calculated from the user goals of the dataset.Formula ( 21) is a method for calculating the prior probability of symptoms from the user goals in the dataset.
where n(S i,m true ) is the number of real symptoms in the mth symptom in the data corresponding to the ith low-level agent, and n i is the number of user targets in the ith low-level agent.After obtaining the prior probability of the symptom, the symptom probability P i (sym) can be obtained by Formulas (19) and (20).
After multiplying the obtained symptom probability by the current state element-wise, it is sent to DQN, as shown in Formula (22).
where stands for element-wise multiplication, a i k t is the action selected by the low-level agent after the enhancement of the medical knowledge graph, and the final action of the low-level agent is the sum of the above three actions: When the low-level agent makes an action, the user simulator will give a reply and corresponding reward according to the symptoms inquired by the low-level agent, and the dialogue will be updated to the next state.Since the action of the low-level agent is symptom collection, in the process of training and prediction, the index value of the predicted action should be obtained and judged whether it is less than n S i .If the predicted action index is not less than n S i , the task of the current low-level agent is terminated directly.We call the reward received by the lower-level agent as the lower-level reward.Thus, the goal of the lower-level agent is to find a policy that maximizes the expected cumulative discount of the lower-level reward.The Bellman Equation of the ith lower-level agent can be expressed as Formula (24): where γ i l is the discount factor for the ith low-level agent.The low-level policy network is a three-layer DQN, and its network parameters θ i are optimized by minimizing the loss function of the network.The mean squared error between the current network of the low-level policy and the target network is used as the loss function of the network, as shown in Formula (25):

User Simulator
The user simulator is the component that talks to the agent and contains the user goals in the dataset.In each simulated dialogue, the user simulator extracts a user target for model training, and the explicit symptoms in the user target are used to initialize the dialogue state.For the symptoms inquired by the agent, the user simulator provides feedback according to the extracted symptom information in the user target: for the symptoms that are True in the user target, the user simulator sets the corresponding symptoms in the state to (1, 0, 0), and gives a +1 reward; for the symptom of False in the user's goal, the user simulator will set the corresponding symptom in the state to (0, 1, 0) and give a −1 reward; for the symptom of Unknown in the user's goal and the symptoms that do not exist in the user target, the user simulator sets the corresponding symptoms in the state to (0, 0, 1) and gives a 0 reward.Notably, when the user simulator receives symptoms that the agent has already asked about, or the maximum number of dialogue turns with the agent is reached, a −2 reward is given and the dialogue with the agent is ended.For the disease notified by the agent, when the value of the disease label in the user target extracted by the user simulator is the same, the diagnosis is determined to be successful and a +22 reward is given; otherwise, it is determined to be a failure and a −44 reward is given.

Experimental Data and Settings
We select the artificially synthesized dialogue dataset for the disease diagnosis proposed by Liao et al. [19].It contains user goals based on patient self-descriptions and conversations with physicians.This synthetic dataset is based on the SymCat database, which contains disease-symptom relations.From the 21 groups (departments) of diseases and their related symptoms classified according to the International Classification of Diseases, nine groups of the most representative diseases were selected and used to generate user goals for disease diagnosis.Each department selects the top 10 diseases with the highest incidence rate in the department.
This paper utilizes the experience playback mechanism [20] to train the high-level policy network and the low-level policy network.Specifically, during training, the "experience" of the high-level policy network s t , a h t , r h t , s t+1 and the low-level policy network s t , a l t , r l t , s t+1 are put into their respective buffers B h and B l .The capacities of the two buffers is fixed, and each round of training is to extract mini_batch "experiences" from the buffer.The current network will be evaluated after each round of training, and when the performance of the current network is the best, the buffer is flushed.Note that the high-level policy network and the low-level policy network are not trained synchronously.The low-level policy is trained once for every 10 rounds the high-level policy network is trained.

Baseline Models
Flat-DQN is a model of a single-layer strategy proposed by Wei et al. [18], which treats all diseases and their related symptoms equally; KR-DQN [38] also treats all diseases and their associated symptoms equally; REFUEL is a single-layer policy reinforcement learning model combining reward remodeling and positive remodeling mechanisms proposed by Peng et al. [35]; GAMP [40] is a single-layer reinforcement learning model optimized by the policy gradient framework of generative adversarial networks; HRL-pretrained [41] is a hierarchical reinforcement learning model that pre-trains low-level policies and then trains high-level policies; HRL [19] is a hierarchical reinforcement learning model, which utilizes a disease classifier to separate symptom collection The above baselines are the reinforcement learning model for disease diagnosis.In this paper, SVM is selected as the multi-class classification baseline model, and two experiments are designed based on the SVM model, namely SVM-ex trained only with explicit symptoms, and SVM-ex-im trained with explicit symptoms and implicit symptoms at the same time.Since the deep reinforcement learning models for disease diagnosis-oriented dialogues all initialize states with overt symptoms, the results of multi-classification models SVM-ex-im trained with both explicit and implicit symptoms can be used as an upper bound on the performance of deep reinforcement learning models on the synthetic dialogue datasets.

Experimental Results and Analysis
We select success rate, average number of dialogue turns and matching rate as the metrics to evaluate the performance of the models.Each session between the agent and the user simulator ends with the agent notifying the user simulator of the disease.If the notified disease is consistent with the disease label of the user target in the user simulator, the session is recorded as a success.The success rate is the ratio of the number of successful sessions to the total number of sessions.Average dialogue turns are the average number of turns in the session.The matching rate is the symptom matching rate, which is calculated as the ratio of the number of implicit symptoms in the user target inquired by the agent to the total number of symptoms inquired in a session.
In Table 1, The results of the KR-DQN, REFUEL and GAMP models are reproduced on this dataset.For the rest of the baselines, we adopt the reported results from the related papers published in recent years [40].Note that the results for the KNHRL model are the average of the results obtained from three experiments with the same experimental settings on this dataset.In Table 1, the performance of the KNHRL model is better than that of the SVM-ex model, and the success rate of other deep reinforcement learning models is also higher than that of SVM-ex.This shows that when only explicit symptoms are used for training, deep reinforcement models perform better on the task of disease diagnosis than the multi-class classification model.For the other models in Table 1, DQN, KR-DQN, REFUEL, and GAMP are deep reinforcement learning models with single-layer strategies.Among them, REFUEL and GAMP introduce additional mechanisms to optimize the reinforcement learning model based on the basic DQN; KR-DQN adds medical knowledge to the basic DQN to assist decision-making, and the disease diagnosis success rate of the KR-DQN model is higher than that of the DQN model, which indicates the medical knowledge can improve the performance of disease diagnosis models.
Compared with KR-DQN, the performance of KNHRL has been greatly improved in terms of success rate and matching rate.This proves the necessity of a stratified strategy in the case of a large number of diseases and symptoms.HRL-pretrained and HRL are deep reinforcement learning models with hierarchical strategies.KNHRL has greatly outperformed these two models.Note that compared with the current state-of-the-art hierarchical strategy reinforcement learning model HRL, KNHRL has an improvement of 5.4% and 22.8% in the success rate and matching rate, respectively.This result shows that medical knowledge plays an important role in disease diagnosis, especially in improving the symptom matching rate of the model.The performance of KR-DQN is inferior to the HRL-pretrained and HRL.This indicates that in the disease diagnosis task, in the case of a large number of diseases and symptoms, the hierarchical strategy plays a greater role in improving model performance.
In Table 1, KNHRL outperforms all other baseline models in the success rate, and is the closest to the upper bound (SVM-ex-im) of the deep reinforcement learning model performance on this synthetic dataset.Compared with KRDQN and HRL, KNHRL has a great improvement in the matching rate.However, the average number of dialogue turns of KNHRL is higher than the rest of the baseline models, which may be caused by the hierarchical strategy and medical knowledge that bring more information to the model.In future work, how to reduce the number of dialogue turns without reducing the success rate and matching rate will be the key issue of research.
Figure 5 illustrates the learning curves of the KNHRL model and the recurrent KR-DQN model on the synthetic dataset, which respectively show the changes in the success rate for the dataset during the learning process of the two models.Both models are used for 3000 simulated dialogues.From the learning curve, the learning curve of the KNHRL model reaches a plateau at about 1500, while the learning curve of the KR-DQN model reaches a plateau at about 2000, which shows that KNHRL learns faster than KR-DQN.Therefore, the disease diagnosis success rate of the KNHRL model is better than that of the KR-DQN model.

Further Analysis
In order to prove that in the KNHRL model, each component has a positive effect on the improvement of performance, this paper designs ablation experiments, as shown in Table 2.The results of each ablation experiment are the average of the results obtained from three experiments at the same setting.In Table 2, -dl is the result obtained by removing the disease classifier in the lowlevel strategy on the basis of the complete model; -rel is the result obtained by removing the relation module; -kg is the result obtained by removing the relation module; -hrl is the result of the experiment without using the hierarchical strategy; -all is the result of removing all the above modules.In Table 2, the model performance of all ablation experiments is lower than the full model in terms of success rate and matching rate, which verifies the effectiveness of all components in KNHRL.In addition, the success rate and matching rate of -hrl are lower than those of -rel and -kg, which further proves that when the number of diseases and symptoms is large, the hierarchical strategy plays a greater role in improving the model performance of the disease diagnosis task.Note that the success rate and matching rate of -hrl are both higher than the results of the KR-DQN model in Table 1, which shows that the knowledge embedding method in KNHRL is better than the knowledge embedding method in KR-DQN.

Conclusions
This paper proposes a hierarchical reinforcement learning model KNHRL for knowledge-enhanced automatic disease diagnosis in medical dialogue systems.Based on the hierarchical reinforcement learning strategy, a medical knowledge graph is incorporated into each low-level agent to assist decision-making.The learnable relationship matrix and disease classifier are used to assist the low-level agent to make policy.The effectiveness of KNHRL is validated on a publicly available dataset for disease diagnosis.In future work, we hope to collect a real-world medical dialogue dataset for disease diagnosis, and further verify the performance of the KNHRL model.

Limitations
This work mainly focuses on a knowledge-enhanced hierarchical reinforcement learning model in the medical dialogue system for disease diagnosis.We have identified two key limitations that can be further examined in future research.The first limitation is that the KNHRL model tends to have a relatively higher average number of dialogue turns due to the hierarchical strategy and the incorporation of medical knowledge, which provides the model with more information.In the future, a key research focus will be on reducing the number of dialogue turns without compromising the success rate of diagnosis and symptom matching.Additionally, due to the limited availability of real diagnostic datasets, we utilized an artificially synthesized dialogue dataset for disease diagnosis.In future work, we aim to collect a real-world medical dialogue dataset, specifically designed for disease diagnosis.We intend to utilize this dataset to validate and improve the performance of the KNHRL model.

Ethics Statement
This paper aims to investigate hierarchical reinforcement learning-based approaches for automatic disease diagnosis, with the objective of reducing the burden on doctors and promoting the advancement of automatic diagnosis systems.It is crucial to emphasize that the proposed methods are designed solely for research purposes and are not suitable for direct clinical application due to the potential risks associated with the misuse of automatic diagnosis systems.Furthermore, the dataset used in our experiments is synthetic; therefore, there are no issues related to ethics and privacy concerns.

Figure 2 .
Figure 2.An excerpt of medical knowledge graph.

Figure 3 .
Figure 3.The working process of hierarchical model.

Figure 4 .
Figure 4.The structure of knowledge-enhanced low-level policy.

Figure 5 .
Figure 5. Learning curve of KNHRL and KR-DQN on the synthetic dataset.

Table 1 .
Evaluation results of KNHRL and other baselines on synthetic dataset.

Table 2 .
Evaluation results of ablation experiments.