Complex Question Decomposition Based on Causal Reinforcement Learning

Li, Dezhi; Lu, Yunjun; Wu, Jianping; Zhou, Wenlu; Zeng, Guangjun

doi:10.3390/sym17071022

Open AccessArticle

Complex Question Decomposition Based on Causal Reinforcement Learning

by

Dezhi Li

,

Yunjun Lu

^*,

Jianping Wu

,

Wenlu Zhou

and

Guangjun Zeng

School of Information and Communication, National University of Defense Technology, Wuhan 430019, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(7), 1022; https://doi.org/10.3390/sym17071022

Submission received: 21 April 2025 / Revised: 15 June 2025 / Accepted: 16 June 2025 / Published: 29 June 2025

(This article belongs to the Special Issue Applications Based on Symmetry and Asymmetry in Deep Learning and Artificial Intelligence Methods)

Download

Browse Figures

Versions Notes

Abstract

Complex question decomposition is an important research topic in the field of natural language processing (NLP). It refers to the decomposition of a compound question containing multiple ontologies and classes into a simple question containing only a single attribute or entity. Most previous studies focus on how to generate simple questions using a single attribute or entity but pay little attention to the generation order of simple questions, which may lead to an inaccurate decomposition or longer execution time. In this study, we propose a new method based on causal reinforcement learning, which combines the advantages of the current optimal performance reinforcement learning method and the causal inference method. Compared with previous methods, causal reinforcement learning can find the generation order of sub-questions more accurately, so as to better decompose complex questions. In particular, the prior knowledge is extracted using the counterfactual method in causal reasoning and is integrated into the policy network of the reinforcement learning model, and the reward rules of reinforcement learning are designed from the perspective of symmetry (positive reward and negative punishment), thus the intelligent body is guided to choose the sub-question with a greater benefit and less risk of decomposing. We compare the proposed method with the baseline method on three datasets. The experimental results show that the performance of our method is improved by 5–10% compared with the baseline method on Hits@n (n = 1, 3, 10), which proves the effectiveness of our proposed method.

Keywords:

complex question decomposition; causal inference; reinforcement learning; prior knowledge; counterfactual method; multi-hop QA; natural language understanding

1. Introduction

Real-world questions are often complex. Practical questions combine multiple sub-questions; each sub-question contains multiple attributes or multiple classes, with multi-hop, logical operations, aggregation, and other characteristics [1]. For example, the question “What movies did the 38th Governor of California appear in?” features multiple jumps, and the question “What was the first movie in which Linda Hamilton and Arnold Schwarzenegger starred together?” has the characteristics of aggregation, and the actual question is often “What is the first film that the 38th Governor of California and Linda Hamilton co-starred in?” How to automatically answer these questions is the key research direction in the field of natural language processing (NLP). The basic idea is to accurately decompose these complex questions into simple sub-question sequences and to combine the sub-question sequences and answer sequences to obtain the final answer, as shown in Figure 1.

For the input complex question, we must first decompose it into a sequence of sub-questions [2], such as the input complex question “What is the first movie that the 38th Governor of California and Linda Hamilton co-starred in?” As depicted in Figure 1, the sequence of sub-questions can be as follows:

Sub-question 1: “Who is the 38th Governor of California?”
Sub-question 2: “What movies has Arnold Schwarzenegger been in?”
Sub-question 3: “The Terminator” The Running Man “Predator” Terminator 2: Judgment Day “True Lies” The Expendables“ In films such as “Expendables” and “Terminator: Dark Fate” (due to space limitations, not all films starring Arnold Schwarzenegger are listed here), which ones did Linda Hamilton play?
Sub-question 4: “Terminator The Terminator 2: Judgment Day Terminator 2: Judgment Day Terminator: Dark Fate Which of the three movies will be released first?”.

The greater the complexity of the question, the more sub-questions that need to be decomposed and the greater the selectivity of the generation order of the sub-questions. Then, it is natural to wonder whether the order of the sub-questions generation affects the accuracy and efficiency of answering questions.

In fact, to answer complex questions correctly and efficiently, the order of the generation of sub-questions is very important [3], such as in the above example. If you answer in the second or third order, it will increase computational cost or obtain the wrong answer. Recent studies have used various approaches to achieve sub-question ordering, such as reinforcement learning and large language models [4,5,6]. Reinforcement learning can effectively identify the connection between the sub-questions, but the next-hop selection of the intelligent body is not reliable enough. A large language model has a high accuracy in obtaining answers to complex questions, but its training cost is high and it is prone to hallucinations. Based on this, we try to introduce the counterfactual method of causal inference into reinforcement learning, which can enhance the rationality of the intelligent body’s next-hop selection while maintaining the advantages of reinforcement learning and further improve the accuracy of the sub-question ordering. Firstly, we need to answer why the order of the generation of sub-questions is closely related to the answer to the question.

Let us go back to the way humans answer complex questions. When answering complex questions, human beings must also decompose complex questions into simple sub-questions, and there may be many ways to order the sub-questions. When humans consider how to order these sub-questions, they first consider the dependencies between the sub-questions. For example, in the above example, to answer the final complex question, they must answer the sub-question “Who is the 38th Governor of California?”. Although this sub-question does not need to be answered first, it must be answered before it is combined with Linda Hamilton’s film series. Based on the dependency of sub-questions, human beings also consider the difficulty of answering sub-questions and the amount of information provided by each sub-question for answering subsequent sub-questions. Humans consider all of these to find the best order to answer complex questions. If the answer to each sub-question is clear, the final answer can be obtained through an efficient link. Therefore, if we can optimize the generation order of sub-questions, we can make machines answer complex questions more accurately and understand human beings better. Inspired by the way human beings answer complex questions, if we integrate prior knowledge into the decomposition process of complex questions, we can improve the accuracy of the generation order of sub-questions and then improve the accuracy of complex questions. Since reinforcement learning methods have advantages in analyzing sub-question dependencies, and the counterfactual method can effectively extract prior knowledge, we consider combining counterfactual inference with reinforcement learning to achieve complex question decomposition.

The main contributions of this study are as follows:

(1): We propose a method to determine the difficulty of answering a question using causal inference, which aggregates the answers of each sub-question decomposed from a complex question, obtains the ratio of the number of the next-hop entities which enable the intelligent body to obtain the maximum reward in the iteration and the total number of all the next-hop entities selected by the intelligent body through the counterfactual method, and converts it into a weight value to provide support for the subsequent combination with the reinforcement learning framework.
(2): We propose a complex question decomposition framework, including a knowledge graph embedding module, sub-question generation and ordering module, and a sub-question-answering module, among which the sub-question generation and ordering module is the focus of this paper.
(3): We design the sub-question generation and ordering module based on causal inference and reinforcement learning, model the dependencies between the sub-questions through the multi-head attention mechanism, and determine the ordering of the sub-questions based on the weight value or the neural network.
(4): Experiments demonstrate that the performance of our method is improved by 5–10% compared with the baseline method on Hits@n (n = 1, 3, 10), and the ablation experiments verify that the proposed model is effective compared with the pure reinforcement learning model.

The rest of this paper is organized as follows. Section 2 presents related works. The details of the model are described in Section 3. Section 4 investigates the details of the experiment and analyzes the results. Section 5 summarizes this paper and introduces future research directions.

2. Related Works

Since the method proposed in this paper combines causal inference and reinforcement learning and applies them to complex question decomposition, this section will introduce the related work of causal inference, reinforcement learning, causal reinforcement learning, and the field of complex question decomposition.

2.1. Reinforcement Learning

Reinforcement learning is one of the methods of machine learning, and its basic idea to solve the question is to set rewards, so the intelligent body learns the optimal strategy to maximize the reward in the process of the interaction with the environment, to achieve specific goals [7,8]. Reinforcement learning generally does not have direct guidance information, and intelligent bodies must constantly interact with the environment to obtain the optimal strategy through trial and error [9]. Based on whether to learn and understand the environment or not, reinforcement learning can be divided into model-free learning and model learning [10].

Model-free learning does not learn and understand the environment but directly uses the information given by the environment. Representative methods include the Q-learning method [11], which determines the action selection by evaluating the action value, and the Deep Q Network method [12], which determines the action selection by predicting the action value through a neural network, State-Action-Reward-State-Action (SARSA) [13], and its variants [14].

Model learning constructs the environment model through learning and then predicts the feedback of the environment to obtain the optimal strategy. The deep probabilistic inference for learning control is proposed in [15]. The probabilistic dynamic model of the environment is obtained by the Bayesian neural network, and the optimal strategy is determined by the prediction of the model. In addition, some scholars have proposed combining the advantages of model-based and model-free algorithms, such as the Actor–Critic algorithm proposed by Mnih et al. [16], in which the Actor chooses the action based on the probability of the next action, and the critic accelerates the learning process based on the value of the action.

Few studies have been conducted on the application of the reinforcement learning algorithm in the field of complex question decomposition, and the performance of the existing method is not good enough. The main reason is that the intelligent body relies too much on the neural network when selecting the next-hop entity, which cannot effectively use the implicit prior knowledge of the question.

2.2. Causal Inference

Causal inference mainly determines the relationship between causal and outcome variables; it is mainly divided into two main research directions: causal relationship discovery and causal effect estimation [17,18,19].

Causality discovery mainly involves finding the causal relationship between variables and determining the causal structure based on the observed data. Ogarrio et al. [20] proposed a greedy fast causal inference (GFCI) method that combines the greedy equivalence search (GES) algorithm [21] and the fast causal inference (FCI) algorithm [22], determines the correlation between variables through the GES algorithm, and determines the causal direction between variables using the FCI algorithm. Runge et al. [23] improved the traditional classical Peter–Clark (PC) algorithm [24] and divided the causal relationship discovery into two steps. The first step is to delete irrelevant variables through an independence test based on the idea of the PC algorithm. The second step is to determine the causal direction between variables through the momentary conditional independence (MCI) test using the results of the first step.

The causal effect estimation determines the influence of causal variables on outcome variables when the causal structure is known, which can generally be divided into intervention and counterfactual reasoning. Vansteelandt and Daniel [25] trained a regression model under supervision to estimate the overall average causal effect of the data after the intervention. Counterfactual reasoning [26] mainly updates the value of the noise term in the causal structural equation based on the observed data, then replaces the causal variable in the original equation with the counterfactual variable to obtain a new equation after replacement, and calculates the value of the outcome variable based on the new equation and the updated noise term to obtain the value of the causal effect.

The decomposition of complex questions studied in this paper draws on the basic idea of the counterfactual method and guides the generation order of sub-questions through counterfactual results.

2.3. Causal Reinforcement Learning

Causal reinforcement learning is based on the reinforcement learning framework and introduces the theory of causal inference, which makes the intelligent body select actions more efficiently by conveying causal information to it [27,28,29]. Causal information is determined by the causal structure based on whether the causal structure is known or not. Causal reinforcement learning methods can be divided into two categories. One is causal reinforcement learning with a known causal structure, which can directly guide the intelligent body to learn through the pre-existing causal structure. For example, Liao et al. [30] considered unobserved confounding factors, combined the Instrumental Variables (IVs) to construct a Markov Decision Process (MDP), and proposed a Confounded Markov Decision Process–Instrumental Variables (CMDP-IVs) approach. To solve the question of contextual gambling machines, Subramanian et al. [31] proposed a new contextual gambling machine learning method by querying causal structures to obtain causal information and then guide the actions of intelligent bodies in the process of intelligent body training. In the field of reinforcement learning, the context gambling machine problem is an advanced version of the multi-arm gambling machine problem. The multi-arm gambling machine problem is that the intelligent body selects a pull from multiple arms of a gambling machine to obtain + 1 or −1 benefits. In the context of the gambling machine problem, the intelligent body faces multiple gambling machines, and the state determines which gambling machine the intelligent body faces at the current moment. The intelligent body needs to learn to choose actions based on the state. Huang et al. [32] obtained Action-Sufficient Representations (ASRs) by representing the action effectiveness of the intelligent body and modeling the environment. The environment and ASR are estimated by the variational autoencoder to guide the intelligent body’s actions. Bica et al. [33] improved the interpretability of the proposed method by estimating the causal effects of different actions of the intelligent body and then learning how human experts make decisions.

Research on causal reinforcement learning algorithms has made great progress in recent years, but it has not yet been applied to complex question decomposition research.

2.4. Complex Question Decomposition

Complex question decomposition is the key link to answer complex questions [34], and the methods of decomposition can generally be divided into two categories.

The first category is the decomposition method based on machine learning; the basic idea is to extract the question features step by step and generate the decomposed sub-questions based on the rules or templates. For example, in recent studies [35,36,37], the complex questions to be decomposed are divided into two cases: parallel and nested. For example, “Are you going to finish your homework first and then play the game or play the game first and then do your homework?” is a parallel construction question, and “Do you know why he refused your offer of help?” is a nested structure question. In the parallel case, the decomposed sub-questions are independent of each other, and in the nested case, they must be decomposed in a specific order. In particular, the entities and attributes in complex questions are identified using the sub-question identification module and generated into sub-questions using the question generator. Zheng et al. [38] proposed a method to decompose complex questions using a knowledge graph and text corpus to build templates. In this method, the decomposition template is obtained by exhausting each sub-question, and the result is obtained by matching the sub-questions to the template based on the type and order. Min et al. [39] proposed a method to find a partition point to decompose a complex question. In this method, the complex questions are divided into three types of dependence, intersection, and comparison, wherein the dependence type means that the second sub-question can be answered only by knowing the answer to the first sub-question, the intersection type means that the sub-question contains a plurality of conditions, and the comparison type means that the sub-question needs to compare different entity attributes. After the sub-questions are decomposed based on the three types, the most appropriate decomposition is obtained through the scorer. Several studies [40,41] have proposed a joint decomposition method, which used the method of the graph model to consider the results of the word segmentation and syntactic analysis simultaneously.

The second category is the decomposition method based on deep learning. The basic idea is to optimize the decomposition process without distinguishing the decomposition steps. For example, Khot et al. [42] proposed a text modular network (TMN) framework that decomposes complex questions into sub-questions using several sub-models composed of transformer structures. Fu et al. [43] proposed a relation extractor–reader and comparator (RERC) framework, in which a relation extractor is constructed to generate sub-questions, a reader is constructed to answer the generated sub-questions, and a comparator is constructed to combine the answers of all sub-questions to obtain the final answer. Shin and Lee [44] proposed a decomposition method based on a query graph, which generated sub-questions through dependency parsing, searched the subgraphs corresponding to the sub-questions from the graph library, then scored the semantic matching from a global perspective, combined the subgraphs with the highest score into a complete query graph, and merged the intermediate results to obtain the final answer. Zhang et al. [45] proposed a decomposition method of reinforcement learning, which designed rewards based on the contribution of sub-questions to the entire question and the possibility of wrong answers to sub-questions and guided intelligent bodies to dynamically choose the order of answering sub-questions. Lin et al. [46] proposed a reinforcement learning decomposition method for Reward Remodeling, which estimated unobserved rewards using a pre-trained one-hop embedding model to reduce the effect of low-quality rewards on intelligent bodies. Das [47] proposed a MINERVA algorithm, which combined neural networks with reinforcement learning methods to decompose complex questions into a multi-hop path selection.

Among the above methods, only the method based on reinforcement learning considered the order of answering sub-questions, so its accuracy is higher than other methods. The model proposed in this study is an improvement on the framework of reinforcement learning. We use the counterfactual method to obtain prior knowledge to solve the problem that the next-hop selection of the intelligent body in the reinforcement learning model is not reasonable enough. Thus, the accuracy of the sub-question generation order is improved.

3. Proposed Method

This section describes our proposed causal reinforcement learning model. First, it defines the complex question decomposition. Then, it presents the general framework of our model. Next, it describes how to obtain the optimal sub-questions sequence through the sub-question ordering module. Finally, it presents the framework of the reinforcement learning model.

3.1. Problem Definition

First, we must formally define the process of complex question decomposition. As mentioned above, a complex question is a question that contains multiple entities and multiple types of relationships. Combined with the current mainstream methods in the field of NLP, we choose to model the decomposition of complex questions based on existing knowledge graphs. The reason for choosing the knowledge graph is because the goal of this paper is to optimize the order of sub-questions, and the knowledge graph and related tools can help us to efficiently complete the generation of sub-questions so that we can focus on optimizing the order of sub-questions. However, as one of the mainstream methods in the field of NLP, knowledge mapping can better represent entities and relationships and has a wide range of versatility, which is convenient for developing other models on this basis. Therefore, we first represent the knowledge map as

G = {E, R}

, where

E

denotes the set of entities and

R

denotes the set of relations. The triple in the knowledge graph can be represented as

(e_{h}, r, e_{t}) \in G

, where

e_{h}

denotes the head entity and

r

denotes the relationship between

e_{h}

and

e_{t}

.

The decomposition process of complex questions can be represented as a set of entities, relationships

{e_{1}, \dots, e_{n}; r_{1}, \dots, r_{n}}

, and a reasoning tree, where

e_{i}, i \in (1, n)

and

r_{j}, j \in (1, n)

denote the entities and relationships extracted from the complex question to be decomposed, respectively. The external nodes in the reasoning tree are sub-questions, and the internal nodes (nodes with at least one sub-node) are functions for processing sub-questions, such as functions for answering sub-questions, functions for connecting sub-question answers with other entities, and functions for comparing sub-question answers. Because the answers to the sub-questions may be correlated, our goal is to optimize the order in which the sub-questions are answered by capturing the additional information generated by the correlation. The answer to the sub-question can be expressed in the following form:

e_{i}, (e_{i}, r_{i 1}, e_{o 1}), (e_{o 1}, r_{i 2}, e_{o 2}), \dots, (e_{o | R_{i} | - 1}, r_{i | R_{i} |}, e_{o | R_{i} |})

, where

R_{i}

represents the entity–relation pair consisting of entity

e_{i}

and relation

r_{j}

,

e_{o t}, t \in (1, | R_{i} |)

denotes the answer of the previous step, and

r_{i 1}, r_{i 2}, \dots, r_{i | R_{i} |}

denotes the set of all relations connected with

e_{i}, e_{o 1}, e_{o 2}, \dots, e_{o | R_{i} |}

. Due to the incompleteness of the knowledge graph,

\exists i, e_{i} \in E

,

R_{i} \notin R

.

3.2. Overall Architecture

The overall framework of the proposed method is shown in Figure 2. We first extract the entities and relationships

{e_{1}, \dots, e_{n}; r_{1}, \dots, r_{n}}

contained in a complex question to be decomposed. The obtained entities and relationships correspond to the subgraphs of the existing knowledge graph. The starting point of each subgraph is each entity in

e_{i}, i \in (1, n)

, and the subgraphs contain relations

r_{j}, j \in (1, n)

. Then, we transform the decomposition process of the complex question into the collaborative query process of each subgraph and transform the query of each subgraph in the previous step into the construction of a computation tree through the generation of sub-questions and the ordering module, in which we need to specify which sub-questions are included in the complex question and the order of answering the sub-questions. Second, based on the answer order of the sub-questions, we obtain the answers to the sub-questions through the sub-question-answering module and update the sub-questions in the calculation tree by the answers. After the whole decomposition process reaches the external nodes of the calculation tree, we obtain the answers to the sub-questions and the complex questions.

The sub-question generation and ordering module is the core of the method proposed in this study, which mainly comprises the following parts, as shown in Figure 3. From the overall architecture, we can observe that the input of the sub-question generation and ordering module is the mapping of the complex question in the subgraph, and the output is the ordered sub-question and the answer to the complex question. The sub-question generation and ordering process is mainly divided into four steps: subgraph embedding, sub-question association, intelligent body action selection, sub-question ordering, and complex question answer generation. Subgraph embedding is mainly to transform subgraph mapping into a vector through a coding network and to encode multi-head attention, which can presently be used in many mature coding methods. The sub-question association and intelligent body action selection are the key points of the method proposed in this study, which is mainly to obtain the importance of sub-questions by modeling the interaction relationship of sub-questions and the counterfactual method. We determine the sub-question ordering based on the sub-question importance obtained in the previous step and obtain the complex question answer through the sub-question-answering module. Through the above steps, we integrate the counterfactual method in causal inference into the next-hop selection strategy of the intelligent body in the reinforcement learning model. Compared with the individual reinforcement learning model, our model can use the prior knowledge obtained by the counterfactual method to select the next-hop entity more accurately, thereby improving the accuracy of the sub-question generation order.

3.3. Causal Reinforcement Learning Model

The causal reinforcement learning model continuously expands the subgraph through the intelligent body to achieve sub-question ordering. We mainly describe the model from five aspects: the action space, state space, reward, causal policy network, and objective function.

3.3.1. Action Space

The action space refers to the set of all triples connected by the intelligent body starting from the current state (current subgraph) and the entities where the intelligent body is located. The action space at time

t

can be expressed as follows:

A_{t} = {(e_{s}, {\hat{r}}_{t}, e_{t}) ∣ e_{s} \in {g_{t i}}_{i = 1}^{n}, (e_{s}, {\hat{r}}_{t}, e_{t}) \in G}

, where

e_{s}

denotes the entity in which the intelligent body is located,

{\hat{r}}_{t}

denotes the outgoing edge of

e_{s}

,

e_{t}

denotes the next-hop entity, and

g_{t i}

denotes the subgraph at the current moment. We want the intelligent body to be able to obtain information not only about the sub-question corresponding to the current subgraph but also from other sub-questions. Therefore, we must fuse the answers to each sub-question so that the intelligent body can make a better choice.

3.3.2. State Space

Because the complex question decomposition is the collaborative search of multiple subgraphs, the state space should contain all subgraphs and corresponding sub-questions, which can be expressed as follows:

S_{t} = [(g_{t 1}; g_{t 2}; \dots g_{m}), Q_{t}]

, where

Q_{t}

denotes the set of sub-questions at time

t

, the initialization is denoted by

Q_{0} = [q_{1}, \dots, q_{m}]

, and

m

denotes the number of sub-questions that can be decomposed under the initial condition. By intelligent body actions, we represent state transitions as probability matrices:

P (s_{t + 1} = s^{'} ∣ s_{t} = s, a_{t} = a)

. In this section, we formalize the representation of subgraphs, formalize the relationship between sub-questions through the relationship between subgraphs, and formalize the decomposition process of complex questions through the aggregation of sub-questions.

Subgraph Representation

The main purpose of subgraph representation is to obtain the contribution of each subgraph to the sub-question, so we represent the subgraph as a weighted sum of a series of entity nodes. The weight of a node represents the contribution of the node to the sub-question. Suppose that subgraph

g_{t i}

has

k

nodes at time

t

, that is,

g_{t i} = e_{t 1} \dots e_{t k}

. We use ConvE [44] to encode the entity nodes in the subgraph, thus obtaining the semantic information implied by the entity. By calculating the similarity matrix of the time subgraph and the corresponding sub-question, we can obtain the contribution of the node, which is expressed in the following form:

\begin{array}{l} L = g_{t i} * Q_{t} \\ A^{Q_{t}} = s o f t \max (L) \\ h_{t i} = g_{t i} * A^{Q_{t}} \end{array}

(1)

where

A^{Q_{t}}

denotes the weight of each relation contained in

Q_{t}

in the subgraph.

h_{t i}

denotes the contextual attention corresponding to

Q_{t}

. By applying the above form, we can obtain a representation of each subgraph. Next, we must find the connection between the subgraphs and update the sub-questions based on the complex question decomposition process. These two parts are described in detail in the following two subsections: the subgraph interaction and sub-question update.

2.: Subgraph Interaction

This subsection focuses on how to find and formalize the connections between subgraphs. It has been introduced that there are relationships between sub-questions, and determining these relationships must be completed through the steps of the prior knowledge representation, subgraph relationship modeling, and complex question reduction.

Prior knowledge extraction: It mainly imitates the common sense of the human brain to identify the relationship between sub-questions. Using the counterfactual method and sub-question-answering module in the causal inference theory, we can obtain the difficulty of sub-question answering to quantify the complexity of the sub-question. In particular, when the intelligent body faces multiple sub-questions and corresponding subgraphs at time

t

, we obtain the prior knowledge such as the complexity of each sub-question through the counterfactual definition and sub-question-answering module and then record it by constructing the prior knowledge table

f

, which is expressed in the following form:

V [Q_{t}] [Q_{t + 1}^{i}] = \frac{f [Q_{t}] [Q_{t + 1}^{i}]}{\sum f [Q_{t}] [Q_{t + 1}^{i}]}

(2)

where

V

denotes the prior knowledge, which represents the importance set of all sub-questions at time

t + 1

, and

f [Q_{t}] [Q_{t + 1}^{i}]

denotes the number of times that the prior knowledge of “sub-question

i

with lower complexity” is obtained through counterfactual reasoning.

Sub-question relationship modeling: Because the subgraphs corresponding to each sub-question may not be adjacent in the knowledge graph, we must model the relationship between the cross-regional subgraphs. In particular, we use the multi-head self-attention mechanism to obtain the relationship between subgraphs, and the relationship between subgraph

i

and subgraph

j

is expressed in the following form:

\begin{array}{l} α_{i j}^{p} = \frac{\exp (τ * W_{q}^{p} h_{i} * {(W_{k}^{p} h_{j})}^{T})}{\sum_{e \in E_{i}} \exp (τ * W_{q}^{p} h_{i} * {(W_{k}^{p} h_{e})}^{T})} \\ h_{i}^{'} = σ (C o n c a t [\sum_{j \in E_{i}} α_{i j}^{m} W_{v}^{p} h_{j}]) \end{array}

(3)

where

p

denotes the number of heads in the multi-head self-attention mechanism;

W_{q}, W_{k}, W_{v}

denote queries, keys, and values in the multi-head self-attention mechanism;

τ

denotes the learnable coefficients;

α

denotes the probability distribution of weights;

E_{i}

denotes the set of entities contained in the subgraph

i

; and

C o n c a t

denotes the connection function, representing the softmax function. Using the above formula, we can transfer the information of each subgraph to each other so that the intelligent body can obtain the global information before making a decision and continue to decompose the question and expand the corresponding subgraph in the correct order.

Complex question reduction: Complex question reduction means that when a complex question has been decomposed many times, the sub-questions that have been decomposed in the previous step do not need to be considered in the subsequent decomposition, such as in the previous example, “What was the first film co-starred by the 38th Governor of California and Linda Hamilton?” After the sub-question “Who was the 38th Governor of California?” is answered, and it can be reduced and no longer considered in the subsequent decomposition. The process can be formally expressed as follows:

Q_{t + 1} = Q_{t} - γ * {\hat{r}}_{t}

(4)

where

γ

denotes the similarity of each relationship in

{\hat{r}}_{t}

and

g_{t}

. We reduce the size of the question in

Q_{t}

(especially the sub-question that has been decomposed) by reducing the importance of the relationship, which enables us to focus more on the sub-question to be decomposed.

3.: State Space Representation

Using the above process, we can represent the state space as follows: At time

t

, the state space is

S_{t} = [(h_{t 1}^{'}; h_{t 2}^{'}; \dots; h_{t n}^{'}), Q_{t}]

, where

(h_{t 1}^{'}; h_{t 2}^{'}; \dots; h_{t n}^{'})

denotes the association of the subgraph as a whole. We represent the end node at this time by

E_{t} = (e_{t 1}, e_{t 2}, \dots, e_{t n})

.

3.3.3. Reward

To ensure that the intelligent body performs well in different environments, we design rewards for the positive and negative effects in the decomposition of complex questions from the perspective of symmetry. The positive effect refers to the utility degree of the sub-question selected by the intelligent body, and the negative effect refers to the risk degree of the failure of the decomposition of the complex question caused by the sub-question selected by the intelligent body. Note that the failure here includes both the decomposition error and the fact that the decomposed sub-question sequence is not the optimal sub-sequence. The principle of our reward design is to make the intelligent body choose the sub-question with the best decomposition effect as far as possible, while avoiding errors in the decomposition of complex questions, so that the intelligent body’s choice can achieve a balance between benefits and risks.

Positive Effect Reward Design

We measure the utility of the sub-question to the decomposition of the question from two aspects: one is whether it can support the decomposition of the complex question, and the other is whether it can help other sub-questions to reduce the search space. The first aspect is easy to design and can be determined by whether it can be finally decomposed. The second aspect is determined by whether the decomposition process makes more subgraphs obtain answers. Based on the above two principles, we set the reward for correctly and efficiently decomposing complex questions as +1, which is expressed in the following form:

R (s_{T}) = z * 1 + λ * y

, where

z

denotes the number of subgraphs,

λ

denotes the learnable coefficients, and

y

denotes the number of interactions between subgraphs.

2.: Negative Effect Reward Design

The negative effect is also measured from two aspects: one is that the intelligent body sub-question selection leads to the generation of uncertainty in the complex question decomposition process, which is determined by the variance; the other is that the intelligent body selection sub-question error leads to the incomplete decomposition of the complex question, which is achieved by setting the penalty value to −1. We use the risk rate to represent the negative effect, which is defined in the following form:

P (a n s = w r o n g | S = s, Q = q) = P (a n s = w r o n g, Q = q | S = s) / P (Q = q | S = s)

(5)

where

a n s

denotes the final answer to a complex question and can be obtained from the output of the neural network.

3.3.4. Target Function

We determine the target function as follows:

J (θ) = E_{q \in Q} [E_{τ ~ π_{θ}} [R (s_{T} | q)] - k * var (R (s_{T} | q)]

(6)

where

θ

denotes a parameter in the policy network,

τ

denotes the trajectory of reinforcement learning,

q

denotes a sub-question in the set

Q

,

k \in (0, 1)

denotes a small positive number, and

var

denotes a function for calculating the variance. Through the above function, we can clarify the reward that the intelligent body receives for executing the action sequence using the action strategy

π_{θ}

.

3.3.5. Policy Network

We use the combination of the question complexity recognition and neural network to construct the policy network and guide the intelligent body to choose the most efficient action to achieve a fast and accurate decomposition of complex questions. In particular, we first set threshold

φ

to compare the maximum value in the prior knowledge set

V [Q_{t}] [Q_{t + 1}^{i}]

with if

V [Q_{t}] [Q_{t + 1}^{i}]

is greater than

φ

, let the intelligent body choose the action corresponding to the sub-question of the maximum value; otherwise, the action of the intelligent body is determined through the neural network.

When the action of the intelligent body is determined by the neural network, we use the Long Short-Term Memory (LSTM) to encode the historical state information in the following form:

p_{0} = L S T M (0, [R_{0}, H_{0}]), p_{t} = L S T M (p_{t - 1}, a_{t - 1})

, where

H_{0}

denotes the initial state. The policy network portion of the neural network can be represented as follows:

π_{θ} (a_{t} | s_{t}) = σ (A_{t} \times W_{2} Re L U (W_{1} [[h_{t 1}^{'}; \dots; h_{t n}^{'}]; p_{t}; R_{t}]))

(7)

where

W_{1}

and

W_{2}

denote weight matrices, and

σ

denotes softmax functions whose parameters can be updated by the following stochastic gradients:

\nabla J (θ) = \nabla θ \sum_{t - 1}^{T} (R (s_{T} | q) - k * var (R (s_{t} | q))) \log (π_{θ} (a_{t} | s_{t}))

(8)

Therefore, we can express the overall policy network in the following form:

\begin{array}{l} π_{θ} (a_{t} | s_{t}) \\ = \{\begin{matrix} V [Q_{t}] [Q_{t + 1}^{i}] if \max (V [Q_{t}] [Q_{t + 1}^{i}]) \geq φ \\ σ (A_{t} \times W_{2} Re L U (W_{1} [[h_{t 1}^{'}; \dots; h_{t n}^{'}]; p_{t}; R_{t}])) otherwise \end{matrix} \end{array}

(9)

4. Experiments

4.1. Datasets

We construct and use three datasets to test the proposed method. The Complex Network Question Dataset (CNQD) was directly constructed by Talmor et al. [47] based on the dataset used by Yih et al. [48] to construct more complex query questions. The WC-2014-C dataset was constructed based on the 2014 Football World Cup dataset used by Zhang et al. [49], which is mainly based on the construction of the CNQD, sampling from the 2014 Football World Cup dataset and automatically creating complex SPARQL queries. The SPARQL language is mainly a query language for RDF/OWL. From the perspective of the knowledge graph, the query process can be regarded as subgraph matching, which is mainly implemented by python 3.9.2. Complex questions are generated, and the connection questions to build the dataset are filtered out. Examples of SPARQL queries include “What are Schwarzenegger’s films with a score greater than 6?”. The above work is mainly performed manually, so the dataset creation is time-consuming and costly. FB15K-C is like the WC-2014-C dataset, which is based on FB15K (which is from Freebase, a huge multi-domain dataset created by Google). We divide the training set and the test set in a ratio of about 3:1. Before dividing the training set and the test set, we reduce the number of samples in the majority class by undersampling to achieve a sample equilibrium as much as possible. We present the number of triples for the training and test sets in Table 1 for the three datasets.

4.2. Baseline Method

The overall architecture of the proposed method comprises two main parts: the sub-question generation and ordering module and the sub-question-answering module, as shown in Figure 2. This study aims to optimize the sequence of the sub-question generation, not to design the sub-question-answering module. Therefore, in the choice of the baseline method, we use a method similar to our question-answering module as a comparison. We use Reward Remodeling (RR) [46] and MINERVA [47] as baseline methods. On the one hand, these two methods are selected as the baseline method because the question-answering module of their method is the same as ours. On the other hand, they are based on the improved reinforcement learning method to complete the question decomposition, which is the same idea as our method. During the experiment, the baseline method can generate a computation tree from a complex question and skip the case where there is a dependency between sub-questions.

4.3. Evaluation Index

We use Hits@

n

, a common evaluation index in the field of complex question decomposition based on a knowledge graph, to evaluate the performance of the proposed and the baseline methods. This index refers to the average proportion of triples with a rank less than or equal to n in the link prediction. The reason why we choose Hits@

n

as the evaluation index is that it is very suitable for evaluating the accuracy of the ranking results, and the key point of this paper is the accuracy of the ordering generated by the sub-questions. The precision, recall, and F1 are mainly used to evaluate the accuracy and completeness of the prediction results, which are not closely related to this paper, so they are not included in the evaluation index. Generally speaking, n is 1, 3, and 10, and the larger the Hits@

n

, the better the performance of the method. In this study,

n

takes the values of 1, 3, and 10.

4.4. Model Training

We trained our model on GTX 1080Ti, and Table 2 shows the time spent on training and the number of training iterations for each dataset. Based on our hardware performance and dataset size, we set the hyperparameter information as follows: the value of the epoch is 10, the value of the batch size is 32, and the value of the learning rate is 0.01. In addition, we adopt Global Vectors (GloVe) for the word representation, a 300-dimensional pre-training word embedding model, to encode and update the network parameters of the model training and output using the Adam optimizer.

4.5. Experimental Results and Analysis

We present the results of the experiments in Table 3, with the best highlights of each metric in each dataset. Our method achieves the best performance on all three datasets. Next, this section analyzes and explains the experimental results through the error rate and entropy analysis, risk rate analysis, threshold setting, and ablation study.

4.5.1. Error Rate and Entropy Analysis

To analyze the reason why the performance of the proposed method is better than the baseline method, we first analyze the average error rate of the proposed and baseline methods in the process of complex question decomposition, as shown in Figure 4. The average error rate is calculated by dividing the number of steps in error by the total number of steps in the complex question decomposition. It can be observed from the figure that the single-step error rate of the proposed method is lower than that of the baseline method, especially at the beginning. The specific reasons are explained in detail in the subsequent analysis. As the decomposition process progresses, the error rate of the proposed method gradually increases to the baseline method approximation, indicating that the proposed method initially selects the least risk question for decomposition. The increase in the error rate may be caused by two reasons: one is that the low-quality prior knowledge may lead to the accumulation of errors; the other is that the parameters of the neural network must learn not only the order of answering questions but also the answers to sub-questions. Therefore, even if the model tries the correct order, the error rate may increase due to the wrong answers to sub-questions.

We further analyze why the proposed method performs better than the baseline method through the entropy of the answer distribution of the model sub-question answers. The calculation formula is as follows:

H (S) = - \sum_{i = 1}^{n} p (s_{i}) \log p (s_{i})

, and

p (s_{i})

represents the probability when the answer to the sub-question is

s_{i}

. Figure 5 shows the values of the entropy of the answer distribution of the three sub-questions in 10–100 iterations of the proposed model on the CNQD. We can observe that the model can continuously reduce the uncertainty of the sub-questions to be decomposed by answering the sub-questions in the previous step. “(H(S1)-H(S2)/H(S2)-H(S3))” is the ratio of the entropy difference between sub-question 1 and sub-question 2 to the entropy difference between sub-question 2 and sub-question 3. The reason why we choose this entropy difference formula is that we want to verify whether our method can continuously reduce the uncertainty of the next sub-question in the decomposition through iterations. To reduce the uncertainty of the next sub-question means that our method can find the correct sub-question generation order. Figure 6 shows the variation in the above ratio with the number of iterations. If the above ratio increases, it means that the entropy difference between sub-question 2 and sub-question 3 is smaller, which means that the uncertainty of answering the next sub-question is reduced through iteration. From Figure 6, we can observe that as the number of iterations increases, the value of (H(S1)-H(S2)/H(S2)-H(S3)) increases in the overall trend, indicating that the model has learned to decompose the most profitable sub-question first.

4.5.2. Risk Rate Analysis

This section mainly analyzes the reasons the proposed method performs better than the baseline method from two aspects: the average risk rate of the selected sub-questions and the change in the risk rate of each sub-question with the decomposition step. For the selection of sub-questions, as shown in Figure 7, the proposed model selects the sub-question with the lowest risk rate for decomposition in the first step, that is, at the beginning of decomposition, which effectively reduces the error rate of the first step of the decomposition of the model. In addition, Figure 8 shows how the risk rate for the three sub-questions changes from Step 1 to Step 3, showing that with the decomposition, the risk rate of sub-question 3 with the largest risk rate decreases continuously so that the overall risk rate is at a relatively controllable level.

4.5.3. Threshold Setting and Ablation Study

Experiments on the FB15K-C dataset are conducted to verify the effectiveness of the proposed method compared with the general reinforcement learning method. Through the experiment, we found that the maximum weight of the relationship in the FB15K-C dataset is 0.1, so we set three thresholds (0.01, 0.05, and 0.1) for the experiment. We still use Hits@1, Hits@3, and Hits@10 to evaluate, and the experimental results are shown in Figure 9.

In general, if the value of the HITS@n of different methods differs by more than 0.02, the performance is considered to be different, and if it differs by more than 0.05, the performance is considered to be significantly different. It can be observed from the figure that the prior knowledge obtained by causal reasoning improves the performance of the reinforcement learning method. It is noteworthy that because the general reinforcement learning method does not involve the threshold, the values of HITS@1, HITS@3, and HITS@10 are the same under each threshold condition. In addition, the performance of the method increases first and then decreases with the increasing threshold, so the selection of the threshold also affects the performance of the method.

Since our method contains the threshold parameter and the reinforcement learning method does not, to prove the rationality of the ablation study, we perform a t-test on the FB15K-C dataset. To avoid the deviation of data partitioning, we use

N (N = 30)

repeated random partitioning to determine the proportion of the training set and test set. We still use HITS@n (n = 1, 3, 10) as the performance evaluation index and set the threshold to 0.05 in our method. The specific steps of the t-test are as follows: Firstly, the performance results of our method and the separate reinforcement learning method are recorded as

A_{i}

and

B_{i}

respectively, and a paired sample set

\{(A_{1}, B_{1}), (A_{2}, B_{2}), \dots, (A_{n}, B_{n})\}

is formed to calculate each pair of differences

D_{i} = A_{i} - B_{i}

. Secondly, we perform a Shapiro–Wilk test on

D_{i}

, and calculate that a

p > 0.05

satisfies the t-test premise. Again, we construct two types of assumptions: the null hypothesis

H_{0}

—there is no significant difference in the performance of our method and the separate reinforcement learning method—and the alternative hypothesis

H_{1}

—there is a significant difference in the performance of our method and the separate reinforcement learning method. Then we calculate the statistic

t

,

t = \frac{\bar{D}}{s_{D} / \sqrt{N}}

,

\bar{D} = \frac{1}{N} \sum_{i = 1}^{N} D_{i}

,

s_{D} = \sqrt{\frac{1}{N - 1} \sum_{i = 1}^{N} {(D_{i} - \bar{D})}^{2}}

and find the t distribution table according to the degree of freedom N-1 to get

p = 0.01

. Since

p < 0.05

, a 95% confidence interval does not include 0, the null hypothesis is rejected, and it can be considered that our method is significantly different from the separate reinforcement learning method. The results are shown in Table 4.

5. Conclusions

In this study, we propose a new method for decomposing complex questions by incorporating causal reasoning into reinforcement learning, which can improve the accuracy of complex question decomposition by optimizing the order of the sub-question generation. The proposed method designs a module to extract prior knowledge using the counterfactual method in causal reasoning and integrates it into the policy network of the reinforcement learning model, thus providing more accurate guidance for the selection of sub-questions. In addition, we find the connection between subgraphs through the knowledge graph and multi-head attention mechanism and design a reward mechanism from both positive and negative effects. The reason why the setting of the threshold affects our method may be that if the threshold is set too high, the noise is high, which affects the performance of the method. To maintain the three indicators at a good level, we try to set the threshold at about 0.05 in the FB15K-C dataset, and for other datasets, we must also determine the appropriate threshold through experiments to improve the overall performance of the method. The experimental results show the effectiveness of the proposed method, which improves the accuracy of the complex question decomposition by optimizing the order of the sub-question generation.

Although the proposed method achieves good results, there is still much room for further optimization and improvement. In subsequent studies, we will further optimize the input of the method so that the input form is not limited to the knowledge graph but can be input through text, voice, and other multimodal forms. In addition, we plan to further optimize the neural network so that it can better complete the task of answering questions.

Author Contributions

D.L.: conceptualization, writing—original draft, writing—reviewing and editing, methodology, validation, formal analysis. Y.L.: supervision, writing—reviewing and editing. J.W.: supervision, data procession, formal analysis. W.Z.: supervision, writing—editing. G.Z.: supervision, writing—editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset and code generated during the current study are not publicly available because the data and code also form part of the ongoing study, but they can be obtained from the corresponding authors according to reasonable requirements.

Acknowledgments

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

NLP	Natural language processing
TMN	Text modular network
RERC	Relation extractor–reader and comparator
GFCI	Greedy fast causal inference
GES	Greedy equivalence search
FCI	Fast causal inference
PC	Peter Clark
MCI	Momentary conditional independence
CGTST	Causality-gated time series transformer
SARSA	State Action Reward State Action
MDP	Markov Decision Process
IV	Instrumental Variables
CMDP-IV	Confounded Markov Decision Process–Instrumental Variables
ASRs	Action-Sufficient Representations
LSTM	Long Short-Term Memory

References

Jun, F.; Yan, L.; Ting, H.T. A survey of complex question decomposition methods in question answering system. Comput. Eng. Appl. 2022, 17, 22–33. [Google Scholar]
Wei, S.Y.; Gong, C. Complex question answering Method of interpretable knowledge map based on graph matching network. J. Comput. Res. Dev. 2021, 12, 2673–2683. [Google Scholar]
Bin, S.; Zhi, C.K.; Tao, L.S. Intelligent understanding of intention of complex questions for medical consultation. J. Chin. Inf. Process. 2023, 37, 112–120. [Google Scholar]
Zhang, Y.N.; Cheng, X.; Zhang, Y.F. Learning to order sub-questions for complex question answering. arXiv 2019, arXiv:1911.04065. [Google Scholar]
Fazili, B.; Goswami, K.; Modani, N. GenSco: Can Question Decomposition based Passage Alignment improve Question Answering? arXiv 2024, arXiv:2407.10245. [Google Scholar]
Rosset, C.; Qin, G.; Feng, Z. Researchy Questions: A Dataset of Multi-Perspective, Decompositional Questions for LLM Web Agents. arXiv 2024, arXiv:2402.17896. [Google Scholar]
Yi, F.; Fu, W.; Liang, H. Model-based reinforcement learning: A survey. In Proceedings of the 18th ICEB, Guilin, China, 2–6 December 2018. [Google Scholar]
Wang, H.N.; Liu, N.; Zhang, Y.Y. Deep reinforcement learning: A survey. Front. Inf. Technol. Electron. Eng. 2020, 12, 1726–1744. [Google Scholar] [CrossRef]
Moerland, T.M.; Broekens, J.; Plaat, A. Model-based reinforcement learning: A survey. Found. Trends Mach. Learn. 2023, 16, 101–118. [Google Scholar] [CrossRef]
Yang, L.M.; Ke, X.; Qiang, S.Z. A review of research on multi-agent reinforcement learning algorithms. J. Front. Comput. Sci. Technol. 2024, 4, 1101–1123. [Google Scholar]
Watkins, C.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Mnih, V.; Silver, D.; Graves, A. Playing Atari with deep reinforcement learning. arXiv 2013, arXiv:1312.09092. [Google Scholar]
Singh, S.; Jaakkola, T.; Littman, M. Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 2000, 38, 287–308. [Google Scholar] [CrossRef]
Fortunato, M.; Azar, M.; Piot, B. Noisy networks for exploration. In Proceedings of the 6th ICLR, Vancouver, BC, Canada, 1–4 May 2018. [Google Scholar]
Gal, Y.; McAllister, R.; Rasmussen, C.E. Improving PILCO with bayesian neural network dynamics models. In Proceedings of the 33th ICML, New York, NY, USA, 19–24 June 2016. [Google Scholar]
Mnih, V.; Badia, A.P.; Mirza, M. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33th ICML, New York, NY, USA, 19–24 June 2016. [Google Scholar]
Shen, X.P.; Ma, S.S.; Vemuri, P. Challenges and opportunities with causal discovery algorithms: Application to Alzheimer’s pathophysiology. Sci. Rep. 2020, 1, 2975–2982. [Google Scholar] [CrossRef]
Guo, R.C.; Cheng, L.; Li, J.D. A survey of learning causality with data: Problems and methods. ACM Comput. Surv. 2020, 53, 3397269. [Google Scholar] [CrossRef]
Nogueira, A.R.; Gama, J.; Ferreira, C.A. Causal discovery in machine learning: Theories and applications. J. Dyn. Games 2021, 3, 203–231. [Google Scholar] [CrossRef]
Ogarrio, J.M.; Spirtes, P.; Ramsey, J. A hybrid causal search algorithm for latent variable models. In Proceedings of the 8th PGM, Lugano, Switzerland, 6–9 September 2016. [Google Scholar]
Chickering, D.M. Optimal structure identification with greedy search. J. Mach. Learn. Res. 2003, 3, 507–554. [Google Scholar]
Spirtes, P.L.; Meek, C.; Richardon, T.S. Causal inference in the presence of latent variables and selection bias. In Proceedings of the 11th UAI, Montreal, QC, Canada, 18–20 August 1995. [Google Scholar]
Runge, J.; Nowack, P.; Kretschmer, M. Detecting and quantifying causal associations in large nonlinear time series datasets. Sci. Adv. 2019, 5, 115–125. [Google Scholar] [CrossRef] [PubMed]
Affeldt, S.; Isambert, H. Robust reconstruction of causal graphical models based on conditional 2-point and 3-point information. In Proceedings of the 31th UAI, Amsterdam, The Netherlands, 12–16 July 2015. [Google Scholar]
Vansteelandt, S.; Daniel, R.M. On regression adjustment for the propensity score. Stat. Med. 2014, 23, 4053–4072. [Google Scholar] [CrossRef]
Danilo, J.; Danihelka, I.; George, P. Causally correct partial models for reinforcement learning. arXiv 2020, arXiv:2020.02836v1. [Google Scholar]
Zhi, H.D.; Jing, J.; Guo, D.L. Causal reinforcement learning: A survey. arXiv 2023, arXiv:2307.01452. [Google Scholar]
Zeng, Y.; Rui, C.; Fu, S. A survey on causal reinforcement learning. arXiv 2023, arXiv:2302.05209. [Google Scholar] [CrossRef] [PubMed]
Yue, S.; Wen, Z.; Chang, S. Causality in reinforcement learning control: The state of the art and prospects. Acta Autom. Sin. 2023, 49, 661–677. [Google Scholar]
Liao, Z.; Fu, Z.; Yang, Y. Instrumental variable value iteration for causal offline reinforcement learning. arXiv 2021, arXiv:2102.09907. [Google Scholar]
Subramanian, C.; Ravindran, B. Causal contextual bandits with targeted interventions. In Proceedings of the 10th ICLR, Online, 25–29 April 2022. [Google Scholar]
Huang, B.; Lu, C.; Le, J. Action-sufficient state representation learning for control with structural constraints. In Proceedings of the 39th ICML, Baltimore, MD, USA, 17–23 July 2022. [Google Scholar]
Bica, I.; Jarrett, D. Learning what if explanations for sequential decision-making. In Proceedings of the 9th ICLR, Online, 3–7 May 2021. [Google Scholar]
Feng, S.X.; Ru, L.; Li, L.X. A span-based target-aware relation model for frame-semantic parsing. ACM Trans. Asian Low. Resour. Lang. Inf. Process. 2023, 22, 9001–9024. [Google Scholar]
Kalyanpur, A.; Patwardhan, S.; Boguraev, B. Fact-based question decomposition for candidate answer re-ranking. In Proceedings of the 20th ACM CIKM, New York, NY, USA, 24–28 October 2011. [Google Scholar]
Kalyanpur, A.; Patwardhan, S.; Boguraev, B. Fact-based question decomposition in DeepQA. IBM J. Res. Dev. 2012, 3, 133–145. [Google Scholar] [CrossRef]
Kalyanpur, A.; Patwardhan, S.; Boguraev, B. Parallel and nested decomposition for factoid questions. In Proceedings of the 13th EACL, Philadelphia, PA, USA, 23–27 April 2012. [Google Scholar]
Zheng, W.G.; Yu, J.X.; Zou, L. Question answering over knowledge graphs: Question understanding via template decomposition. Proc. VLDB Endow. 2018, 11, 1373–1386. [Google Scholar] [CrossRef]
Min, S.; Zhong, V.; Zettlemoyer, L. Multi-hop reading comprehension through question decomposition and rescoring. In Proceedings of the 20th EACL, Florence, Italy, 3–5 July 2019. [Google Scholar]
Yan, H.; Qiu, X.P.; Huang, X.J. A graph-based model for joint Chinese word segmentation and dependency parsing. Trans. Assoc. Comput. Linguist. 2020, 8, 78–92. [Google Scholar] [CrossRef]
Wu, L.Z.; Zhang, M.S. Deep graph-based character-level Chinese dependency parsing. Inst. Electr. Electron. Eng. 2021, 29, 1329–1339. [Google Scholar] [CrossRef]
Khot, T.; Khashabi, D.; Richardson, K. Text modular networks: Learning to decompose tasks in the language of existing models. In Proceedings of the 2021 NAACL, Online, 6–11 June 2021. [Google Scholar]
Fu, R.L.; Wang, H.; Zhang, X.J. Decomposing complex questions makes multi- hop QA easier and more interpretable. In Proceedings of the 2021 EMNLP, Punta Cana, Dominican Republic, 7–11 November 2021. [Google Scholar]
Shin, S.; Lee, K. Processing knowledge graph-based complex questions through question decomposition and recomposition. Inf. Sci. 2020, 523, 234–244. [Google Scholar] [CrossRef]
Lin, X.V.; Socher, R.; Xiong, C. Multi-hop knowledge graph reasoning with reward shaping. In Proceedings of the 2018 EMNLP, Brussels, Belgium, 2–4 November 2018. [Google Scholar]
Das, R. Go for a walk and arrive at the answer-reasoning over paths in knowledge bases using reinforcement learning. In Proceedings of the 6th ICLR, Vancouver, BC, Canada, 1–3 May 2018. [Google Scholar]
Talmor, A.; Berant, J. The web as a knowledge-base for answering complex questions. In Proceedings of the 2018 NAACL-HLT, New Orleans, LA, USA, 1–6 June 2018. [Google Scholar]
Yih, M.; Richardson, C.; Meek, M. The value of semantic parse labeling for knowledge base question answering. In Proceedings of the 2018 ACL, Berlin, Germany, 15–20 July 2018. [Google Scholar]
Zhang, L.; Winn, J.M.; Tomioka, R. Gaussian attention model and its application to knowledge base embedding and question answering. arXiv 2016, arXiv:1611.02266. [Google Scholar]

Figure 1. The influence of the order of generating sub-questions on the accuracy and efficiency of answering questions.

Figure 2. The overall structure of the method.

Figure 3. Module architecture for sub-question generation and ordering.

Figure 4. Error rate per step.

Figure 5. Entropy value of sub-question answer distribution in iterative process.

Figure 6. Ratio of sub-question entropy decrease in iteration process.

Figure 7. The average risk rate for each decomposition step in the iterative process.

Figure 8. The4 average risk rate of each sub-question decomposed in the iteration process.

Figure 9. Experimental results with different thresholds on the FB15K-C dataset.

Table 1. Number of training and test set triples.

	Training Triples	Test Triples
CNQD	7734	1475
WC-2014-C	6209	1881
FB15K-C	5000	1660

Table 2. The time consumed for training and the number of iterations.

	CNQD	WC-2014-C	FB-15k-C
Time (Hour)	5.5	4.5	4
Number of Iterations	100	80	70

Table 3. The performance of our method and the baseline method on 3 datasets.

Models		CNQD			WC-2014-C			FB-15k-C
Models	Hits@1	Hits@3	Hits@10	Hits@1	Hits@3	Hits@10	Hits@1	Hits@3	Hits@10
Ours	0.552	0.618	0.719	0.558	0.602	0.671	0.212	0.257	0.383
RR	0.479	0.525	0.628	0.418	0.531	0.615	0.179	0.238	0.266
MINERVA	0.496	0.558	0.667	0.441	0.552	0.639	0.191	0.242	0.287

Table 4. The t-test results.

Index	Ours	Separate Reinforcement Learning	$D_{i}$
Experimental times	30	30	0
Mean value of HITS@1	0.63 ± 0.02	0.58 ± 0.03	0.05
Mean value of HITS@3	0.77 ± 0.03	0.75 ± 0.02	0.02
Mean value of HITS@10	0.84 ± 0.03	0.82 ± 0.03	0.02
t	-	-	6.708
p	-	-	0.01
95% confidence interval	-	-	(0.021, 0.039)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, D.; Lu, Y.; Wu, J.; Zhou, W.; Zeng, G. Complex Question Decomposition Based on Causal Reinforcement Learning. Symmetry 2025, 17, 1022. https://doi.org/10.3390/sym17071022

AMA Style

Li D, Lu Y, Wu J, Zhou W, Zeng G. Complex Question Decomposition Based on Causal Reinforcement Learning. Symmetry. 2025; 17(7):1022. https://doi.org/10.3390/sym17071022

Chicago/Turabian Style

Li, Dezhi, Yunjun Lu, Jianping Wu, Wenlu Zhou, and Guangjun Zeng. 2025. "Complex Question Decomposition Based on Causal Reinforcement Learning" Symmetry 17, no. 7: 1022. https://doi.org/10.3390/sym17071022

APA Style

Li, D., Lu, Y., Wu, J., Zhou, W., & Zeng, G. (2025). Complex Question Decomposition Based on Causal Reinforcement Learning. Symmetry, 17(7), 1022. https://doi.org/10.3390/sym17071022

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Complex Question Decomposition Based on Causal Reinforcement Learning

Abstract

1. Introduction

2. Related Works

2.1. Reinforcement Learning

2.2. Causal Inference

2.3. Causal Reinforcement Learning

2.4. Complex Question Decomposition

3. Proposed Method

3.1. Problem Definition

3.2. Overall Architecture

3.3. Causal Reinforcement Learning Model

3.3.1. Action Space

3.3.2. State Space

3.3.3. Reward

3.3.4. Target Function

3.3.5. Policy Network

4. Experiments

4.1. Datasets

4.2. Baseline Method

4.3. Evaluation Index

4.4. Model Training

4.5. Experimental Results and Analysis

4.5.1. Error Rate and Entropy Analysis

4.5.2. Risk Rate Analysis

4.5.3. Threshold Setting and Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI