Abstract
Causal discovery is central to human cognition, and learning directed acyclic graphs (DAGs) is its foundation. Recently, many nature-inspired meta-heuristic optimization algorithms have been proposed to serve as the basis for DAG learning. However, a single meta-heuristic algorithm requires specific domain knowledge and empirical parameter tuning and cannot guarantee good performance in all cases. Hyper-heuristics provide an alternative methodology to meta-heuristics, enabling multiple heuristic algorithms to be combined and optimized to achieve better generalization ability. In this paper, we propose a multi-population choice function hyper-heuristic to discover the causal relationships encoded in a DAG. This algorithm provides a reasonable solution for combining structural priors or possible expert knowledge with swarm intelligence. Under a linear structural equation model (SEM), we first identify the partial v-structures through partial correlation analysis as the structural priors of the next nature-inspired swarm intelligence approach. Then, through partial correlation analysis, we can limit the search space. Experimental results demonstrate the effectiveness of the proposed methods compared to the earlier state-of-the-art methods on six standard networks.
    1. Introduction
Causal discovery from observable data is described by Judea Pearl as one of the seven important tasks and tools for moving toward a strong artificial intelligence society. It is widely used in medicine [,,], biology [], environmentology [], and other fields. Currently, there are two types of causal modeling based on DAGs: Bayesian networks and SEMs. Bayesian networks operate on discrete data, modeling the relationships between causal variables as probabilistic relationships. In contrast, SEMs operate on continuous data, assuming the data follow a specified distribution to interpret the causal relations. So, causal discovery methods based on SEMs make it possible to guarantee the unique identification theory of causal structure. The classical SEMs proposed thus far include the linear non-Gaussian acyclic model (LiNGAM) [], additive noise model (ANM) [], post-nonlinear model (PNL) [], and information-geometric causal inference (IGCI) [].
There are two main approaches for learning a DAG: constraint-based and score-based. Constraint-based approaches, such as the well-known PC [], utilize conditional independence (CI) tests to search for a Markov equivalence class of causal graphs and do not need to assume any kind of causal mechanism. Therefore, they can be easily extended to address more complex problems. However, high-order CI tests are time-consuming and unreliable with limited samples. Score-based approaches, which use a scoring function to estimate the quality of DAGs and then search for a DAG with the highest score, are currently the most widely utilized method. However, the number of DAGs contained in the search space increases exponentially with the number of nodes. Exact methods become infeasible because they address the entire search space, and an increasing number of heuristic methods have been proposed to address this task. Examples include K2 [], A* [], and GES [], but they often become trapped in local optima. To escape local optima, nature-inspired meta-heuristic optimization algorithms have been recognized for use in DAG learning. These nature-inspired optimization algorithms include the genetic algorithm (GA) [], evolutionary programming [], ant colony optimization [], cuckoo optimization [], water cycle optimization [], particle swarm optimization (PSO) [,], artificial bee colony (ABC) algorithms [], bacterial foraging optimization (BFO) algorithms [], and firefly algorithms (FAs) []. Although these optimization algorithms have achieved relatively good results, they still face the following challenges:
- As suggested by the no-free-lunch theorem, a single meta-heuristic algorithm cannot meet the different needs of various practical problems and cannot guarantee good performance in all cases.
 - For large DAGs, the global search ability of the meta-heuristic algorithm is insufficient, the algorithm can easily fall into local optima, and the convergence accuracy is not high.
 
Hybridization of more than one meta-heuristic can make use of the differences and complementarities of each heuristic to improve the performance of DAG learning. Many recent results in the scientific literature seem to support this notion. Hybridization is the combination of different meta-heuristics or components of meta-heuristics. Unlike the hybridization of meta-heuristics, hyper-heuristics represent a hybridization approach where heuristics are used to choose or generate heuristics for solving combinatorial optimization problems. Recently, hyper-heuristics have been successfully applied to many practical problems in various fields, including the traveling salesman problem [], the vehicle routing problem [], the knapsack problem [], and T-way testing []. According to the literature on these applications, hyper-heuristic methods increase the abstraction level of heuristic algorithms and can achieve better generalization ability so that satisfactory solutions can be obtained at a small cost. Given the excellent performance of the hyper-heuristic approach, designing hyper-heuristic algorithms is a topic worthy of study for DAG learning.
There are two main hyper-heuristic categories: heuristic selection and heuristic generation. A selection hyper-heuristic, which is the focus of our study, designs a high-level strategy to select low-level heuristics with the best performance in the search process. In this paper, we develop a hyper-heuristic with a choice function as the high-level strategy, and low-level heuristics are derived from the operators of several nature-inspired optimization algorithms. To further improve the search performance of hyper-heuristics, several common heuristic algorithm search strategies are also adopted. First, we learn from several hybrid algorithms that can reduce the size of the search space. Hybrid algorithms, such as MMHC [] and PCS [], are combinations of constraint-based approaches and score-based approaches. The most common strategy is to reduce the size of the search space through a constraint-based approach and then perform a search. For linear SEMs, we consider using partial correlation analysis to obtain a more compact search space, while partial v-structures are also identified as a structural prior and then integrated into the search process as an alternative or supplement to expert prior knowledge. Second, we learn multi-population strategies from swarm intelligence algorithms that enhance global search capabilities and reduce the likelihood of falling into local optima.
The main contributions of this paper are summarized as follows:
- We propose a novel method to mine conditional independence information and determine the v-structure through partial correlation analysis, and demonstrate that this method is correct in both theory and practice. In the partial correlation analysis, two restricted search spaces are obtained, and the low-level heuristics can select the appropriate search space to improve efficiency.
 - We select the components of the existing heuristic algorithm to build the low-level algorithm library. To enhance the global search capability of large-scale DAGs, we redesign the global search operator. In addition, we design a search space switching operator for the global search operator. In the first stage, the global search operator works in the restricted search space to improve efficiency, and in the second stage, it works in the complete search space to improve accuracy.
 - We propose a multi-population choice function hyper-heuristic to provide sufficient coverage of the search space, and various groups communicate with each other through an immigration operator. To solve the problem that there is an order of magnitude difference between the fitness change and running time in DAG learning problems, we modify the choice function to balance the attention between them.
 
2. Related Works
Scholars have been exploring DAG learning for more than 40 years, and Constantinou divides the research results during these years into four main research directions: ideal data, continuous optimization, weakening faithfulness, and knowledge fusion [].
1. Ideal data. In the first direction, DAG structures are constructed using various causal discovery algorithms and optimization algorithms for datasets that are ideally unbiased and satisfy causal sufficiency and faithfulness. These algorithms are based on combinatorial optimization and form two solution directions: constraint-based approaches and score-based approaches. The specific implementation process of the constraint-based approach is divided into two steps: the first step involves conducting CI tests on variables, and the second step involves learning the global structure or local structure based on the CI test results. The most classic global structure discovery method is the PC [] algorithm, which has the advantage of low time complexity, but at the cost of the loss of stability. Therefore, Colombo and Maatthuis et al. proposed the PC-stable [] algorithm, which effectively eliminates the order dependence in the process of skeleton determination and edge orientation. It is also a constraint-based algorithm widely recognized and used by scholars in recent years, and it is used as a comparison algorithm in this paper. In addition, Spirtes et al. proposed the FCI [] algorithm in response to the existence of hidden variables or confusion factors in research questions. Some scholars have improved it in recent years, such as RFCI [] and FCI+ []. The local structure discovery method focuses on learning Markov blankets (MBs) in a DAG. The best-known method for local structure discovery is IAMB [], which uses conditional mutual information to determine the order in which individual variables are incorporated into the MB. Later, improved versions were proposed: Inter-IAMB and Fast-IAMB. In addition, Tsamardios and Aliferis et al. proposed the MMPC, HITON-PC, and SI-HITON-PC [] algorithms to discover MBs. Note that the performance of constraint-based approaches, which employ statistical tools to test conditional independence in the empirical joint distribution, may be severely limited by the hypothesis tests they use. Score-based approaches can be divided into approximate approaches and exact approaches according to whether they can obtain the global optimal solution. With well-defined scores, such as the Bayesian Information Criterion (BIC), the Minimum Description Length (MDL), and the Bayesian Dirichlet equivalence (BDe), score-based approaches turn causal discovery problems into combinatorial optimization problems. Based on this, several exact methods for solving combinatorial optimization problems, such as dynamic programming [], branch-and-bound [], and integer linear programming [], have been applied to DAG learning. Due to the poor scalability of exact approaches, approximate approaches have gained extreme popularity. HC [], which uses a greedy strategy and different operators to search the neighborhood of the current DAG and update the optimal structure until the termination condition is reached, is the most classical approximate learning algorithm in DAG space. However, this algorithm can easily fall into local optima. Therefore, many nature-inspired optimization algorithms have emerged in recent years, among which PSO [], ABC [], and GAs [] are widely used meta-heuristic algorithms, and many versions of these algorithms have been proposed. Although these meta-heuristic algorithms have achieved relatively good results, their search and generalization abilities still need to be improved. Therefore, this paper adopts a hyper-heuristic approach to DAG learning to obtain stronger search and generalization abilities compared to a single heuristic approach. To the best of our knowledge, hyper-heuristic methods have not been applied to DAG learning.
2. Continuous optimization. Most of the causal structures output by traditional constraint-based approaches and score-based approaches belong to the Markov equivalence class. To solve the problem of the Markov equivalence class, the method of introducing SEMs into causal models is receiving increasing attention. In SEMs, if some additional assumptions are made about the functional and/or parametric forms of the underlying true data-generating structure, then one can exploit asymmetries to identify the direction of causality. For example, Shimizu et al. [,] first proposed an estimation method based on independent component analysis (ICA) for LiNGAM, which is unique enough to identify the complete DAG by the non-Gaussian properties of the data. For nonlinear data, Hoyer et al. [] proposed the ANM to infer causality based on the assumption of independence between cause variables and noise variables. Compared with the ANM, Zhang et al. [] proposed PNL, which describes the data generation process more generally. Janzing et al. [] started from the perspective of information geometry and made causal inferences based on information entropy. In 2018, Zheng et al. first proposed NOTEARS [], which formulates the structure learning problem as a continuous optimization problem by introducing a smooth characterization of acyclicity. This method makes it possible to use gradient updating to acquire large-scale learning and online learning abilities. On this basis, an increasing number of machine learning methods, such as neural networks [], reinforcement learning [], and autoencoders [], have been introduced into this field. In addition, some updated versions of NOTEARS have also been proposed recently, such as NO TEARS+ [], NO BEARS [], and NO FEARS []. However, NOTEARS and its variants still lack a theoretical analysis of the unique identification of this model []. Moreover, in our experiments, NOTEARS sometimes failed to return a DAG.
3. Weakening faithfulness. Traditional causal faithfulness is a very demanding requirement, and theorists are constantly trying to relax the use of faithfulness “bottom lines” regarding data distribution and independence tests to improve the robustness of models by using more relaxed faithfulness. Unlike the PC algorithm, which is based on complete causal faithfulness, the CPC [] algorithm uses weaker adjacency faithfulness and directed faithfulness in the v-structure determination phase. Zhang and Spirtes believe that this weak faithfulness hypothesis can also be applied in the skeleton determination stage, so triangular faithfulness has been proposed []. In addition, Cheng et al. proposed the TPDA, which requires a stronger faithfulness hypothesis (monotone faithfulness).
4. Knowledge fusion. Expert knowledge is often used to assist in DAG modeling, and the integration of expert knowledge is divided into two methods: soft constraints and hard constraints. The former guides or intervenes in the learning process, while the latter forces the final learning outcome to meet certain conditions. For hard constraints, De Campos and Castellano et al. took the lead in modifying HC and PC algorithms to make the learning results meet the given edge constraints []. Later, De Campos proposed an improved B&B algorithm [], which supports predetermination of the direction of partial edges before learning. Borboudakis and Tsarmadios first proposed applying this constraint to PC and FCI algorithms to improve the accuracy of the edge orientation phase []. For soft constraints, different initial search graphs or restricted search spaces can also be regarded as soft constraints of score search algorithms, such as MMHC [] and PC-PSO []. In our algorithm, partial correlation is used to mine structural priors as a supplement or alternative for expert knowledge to guide the search process.
In summary, we consider introducing a hyper-heuristic method guided by expert knowledge to improve the search performance of causal discovery algorithms.
3. Background
3.1. DAG Model
A graph  represents a joint distribution  as a factorization of n variables , using n corresponding nodes  and connecting edges  where  indicates an edge between  and . If all the edges are directed and there are no cycles, we have what is known as a DAG.
Definition 1 
(v-structure). In a DAG, if there are two distinct adjacent nodes of X on a simple path, and both of them are parents of X, then these three nodes form a v-structure, and node X is called a collider. Otherwise, we call X a noncollider [].
Definition 2 
(d-separation). Two nodes  are d-separated by  if every simple path from  to  is blocked by Z. Note that a simple path is blocked if there is at least one noncollider in Z or if at least one collider and all its descendants are not in Z [].
3.2. SEMs and Partial Correlation
An SEM is a set of equations describing the value of each node  in X as a function  of its parents  and a random disturbance term :
      
        
      
      
      
      
    
        where the functions  can be defined as linear or nonlinear. If we restrict it to be linear, we set the formula of a linear SEM as follows:
      
        
      
      
      
      
    
Definition 3 
(Partial correlation). The partial correlation coefficient between two nodes , given a set of conditions  denoted as  or simply  is the correlation of the residuals  and  resulting from the least-squares linear regression of  on Z and  on Z, respectively [].
The most common method for calculating the partial correlation coefficient relies on inverting the correlation matrix R of X. Given  the partial correlation coefficient can be efficiently computed according to Equation (3).
        
      
        
      
      
      
      
    
In particular, the full partial correlation between two nodes  means that the set of conditions Z is equal to  and the set of conditions Z corresponding to a local partial correlation is a possible subset of 
Theorem 1. 
When the data are generated by linear SEMs, if the random disturbance term  has constant variance and is uncorrelated, the partial correlation analysis can be used as a criterion in the CI test [].
Theorem 2. 
If the sample size (denoted as m) of a given dataset generated by linear SEMs is sufficiently large  the test statistic t concerning the partial correlation coefficient approximately follows a t-distribution with  degrees of freedom [].
      
        
      
      
      
      
    
Definition 4 
(Bayes factor). Given a dataset D, the Bayes factor for a null hypothesis  over an alternative hypothesis , denoted as  can be written according to Equation (5).
      
        
      
      
      
      
    
For partial correlation analysis, the Bayes factor provides an index of preference for one hypothesis over another that is more intuitive in interpretation than the traditional p value. Since p values are often misunderstood and misused, the Bayes factor is used as a significance test for partial correlation analysis in this paper []. The reason that the Bayes factor is not commonly used is that it is inconvenient to calculate. In this paper, because the partial correlation coefficient approximately follows a t-distribution, the Bayes factor can be directly computed using an approximation algorithm.
        
      
        
      
      
      
      
    
3.3. Scoring Function
The scoring function used in this paper is the BIC, which is composed of the goodness of fit of a model and the penalty for model complexity. The BIC score is defined as:
      
        
      
      
      
      
    
        where  denotes the negative log-likelihood used to evaluate the goodness of fit of a model,  logm denotes the penalty for model complexity,  denotes the maximum likelihood estimate of the parameters for node  and  is the number of estimated parameters for node , which is equal to the number of its parents.
Theorem 3. 
In a linear SEM, the best linear unbiased estimator of the parameters is the ordinary least-squares estimator if the random disturbance term  has a mean of zero and constant variance and is uncorrelated.
In this paper, the least-squares method was used for parameter estimation. It is a statistical method used to determine the best-fit line by minimizing the sum of squares created by a mathematical function. Therefore, the negative log-likelihood for node  can be computed according to Equation (8),
        
      
        
      
      
      
      
    
        where  is equal to the parameter estimated by the least-squares method and can be computed according to Equation (9),
        
      
        
      
      
      
      
    
        where  denotes the vector of observations on  and x denotes the vector of observations on its parents.
4. Methodology
In this section, we first propose a new method called structural priors by partial correlation (SPPC), where the key idea is to use partial correlation analysis to mine conditional independence information. Next, this conditional independence information is integrated into the hyper-heuristic algorithm as a structural prior.
4.1. SPPC
Due to the equivalence of zero partial correlation and CI for linear SEMs, the goal of the SPPC algorithm is to use partial correlation analysis to narrow the search space as much as possible in addition to identifying partial v-structures. The SPPC algorithm starts with an empty graph and consists of three stages: full partial correlation, local partial correlation, and identification of v-structures. Through these three stages, we can obtain the global search space (GSS), local search space (LSS), and v-structure (V). The pseudocode is shown in Algorithm 1, and we explain each stage in more detail in the following paragraphs. The three stages can be summarized as follows:
- For any two nodes add an edge if the full partial correlation coefficient is significantly different from zero.
 - For every edge in the undirected graph built in step 1, we perform a local partial correlation analysis that looks for a d-separating set Z. If the partial correlation vanishes, we consider this edge to be a spurious link caused by v-structure effects and then remove it.
 - For every edge that is removed in step 2, we find the colliders contained in Z. If node U is a collider, we add two edges
 
| Algorithm 1: SPPC | 
 ![]()  | 
In the first stage, we perform a full partial correlation analysis and reconstruct a Markov random field. In the full partial correlation analysis, if  is less than a threshold  we consider that the two nodes are correlative and connect with each other in the GSS and LSS. In contrast, if  is greater than the threshold  we consider that the two nodes are uncorrelated. If the data satisfy the faithfulness assumption, the GSS derived from the identified undirected graph may resemble a moral graph. Therefore, we treat the GSS as the primary search space to ensure the completeness of the search space. Unfortunately, in the GSS, all parents of colliders are connected, and the v-structures are transformed into triangles. It should be noted that these spurious links caused by v-structure effects have a more severe negative impact on the search process compared to other error edges. When dealing with large-scale problems, the GSS cannot effectively alleviate the inefficiency of the search algorithm, and it easily falls into local optima. Fortunately, the partial correlation coefficient for the CI test is easy to calculate, even when the size of the condition set is large. Therefore, to improve search efficiency, we consider further mining conditional independence information in the second stage.
In the second stage, our goal is to find a set Z that blocks all simple paths between two nodes  Obviously, the exhaustive method is inefficient and undesirable. Therefore, heuristic strategies are usually used to find such a cut set. For example, a two-phase algorithm [] utilized a heuristic method based on monotone faithfulness that employs the absolute value of partial correlation as a decision criterion. However, monotone faithfulness is sometimes a bad assumption []. In this paper, we propose a new heuristic strategy to determine a d-separating set, and the pseudocode is shown in Algorithm 2. To illustrate how our heuristic strategy works, some relevant concepts are briefly introduced.
        
| Algorithm 2: Local partial correlation | 
 ![]()  | 
Theorem 4. 
For any two nodes  if there is no edge between them, we can determine a d-separating set by choosing nodes only from either  or  []. Here,  denotes the Markov random field of nodes .
This theorem enables us to perform local partial correlation analysis on small sets, which makes the estimation results more efficient and stable. Next, the most important task of a heuristic strategy is to find an appropriate metric to tightly connect the CI test with d-separation.
Definition 5 
(Simple path). A simple path is an adjacency path that does not contain duplicate nodes [].
According to the definition of d-separation, for any two nodes  if there exists such a cut set Z that makes the two nodes conditionally independent, we can finally find it by blocking all the simple paths between the two nodes.
Definition 6 
(Active path). For any two nodes  given a simple path U between the two nodes, the path U is blocked by Z if and only if at least one noncollider on U is in Z or at least one collider and all of its descendants are not in Z []. If path U is not blocked by Z, we call path U an active path on Z.
Definition 7 
(Open simple path). For any two nodes  and given a set of conditions  for a node  in Z, if  and  are both significantly different from zero, we refer to the simple paths from  to  and  as open. In this case,  is said to have an open simple path (OSP) to  on Z.
Notably, an OSP is different from an active path in a directed graph because we cannot determine whether node  is a noncollider or a collider. The initial OSP is the intersection of the Markov random fields of two nodes.
Theorem 5. 
For any two nodes  given a set of conditions Z, if there is no edge between the two nodes in the underlying graph and  is significantly different from zero, then all active paths on Z with colliders must satisfy that every collider and all of its descendants contain a node (denoted as  that belongs to Z and has an OSP to 
Proof of Theorem 5. 
If  is significantly different from zero, there must be at least one active path between  and  on Z, denoted as set U. For any path in U, denoted as path  according to the definition of the active path, we know that all the noncolliders on u are not in Z and every collider satisfies that itself or at least one of its descendants is in Z. For any collider in u, let  denote the node that satisfies the above condition. We can easily construct an active path based on path u between  and  on , and similarly between  and  Therefore, we can consider that  and  are both significantly different from zero, and then  has an OSP to     □
If path u does not contain a collider, we cannot block this path by removing nodes. Therefore, our heuristic strategy is to start with an initial set that contains the d-separation set and then block the simple paths by gradually removing nodes that have OSPs to  Throughout the process, as each node with an OSP is removed, we observe the number of remaining nodes that have OSPs, and this number is used as a criterion to determine which node to delete. In this paper, we greedily choose the node with the lowest value for removal. When no node in the conditional set has an OSP, the search stops.
In the third stage, our task is to orient some edges correctly by detecting v-structures. For each edge removed in the local partial correlation analysis, we find the colliders contained in  If node U is a collider, we add two edges to V.
4.2. Proposed Multi-Population Choice Function Hyper-Heuristic
Hyper-heuristics are high-level methodologies that perform a search over the space formed by a set of low-level heuristics when solving optimization problems. In general, a hyper-heuristic contains two levels—a high-level strategy and low-level heuristics—and there is a domain barrier between the two. The former comprises two main stages: a heuristic selection strategy and a move acceptance criterion. The latter involves a pool of low-level heuristics, the initial solution, and the objective function (often also the fitness or cost function). The working principle of hyper-heuristics is shown in Figure 1.
      
    
    Figure 1.
      Generic structure of a traditional hyper-heuristic model.
  
4.2.1. The High-Level Strategy
Various combinations of heuristic selection strategies and move acceptance criteria have been reported in the literature. Classical heuristic selection strategies include choice functions, nature-inspired algorithms, multi-armed bandit (MAB)-based selection, and reinforcement learning, while move acceptance criteria include only improvement, all moves, simulated annealing, and late acceptance. In this article, we use the “choice function accept all moves” as a high-level strategy, which evaluates the performance score (F) of each LLH using three different measurements:  and  The specific calculation method is shown in Equation (10):
      
        
      
      
      
      
    
Parameter  reflects the previous performance of the currently selected heuristics,  The value of  is evaluated using Equation (11),
          
      
        
      
      
      
      
    
          where  is the change in solution quality by  and is set to 0 when the solution quality does not improve.  is the time taken by 
Parameter  attempts to capture any pairwise dependencies between heuristics. The values of  are calculated for the current heuristic  when employed immediately following , using Equation (12),
          
      
        
      
      
      
      
    
          where  is the change in solution fitness and  is the time taken by both the heuristics. Similarly,  is set to 0 when the solution does not improve.
Parameter  captures the time elapsed since the heuristic  was last selected. The value of  is evaluated using Equation (13):
      
        
      
      
      
      
    
The value range of parameters  and  is (0,1) and is initially set to 0.5. If the solution quality improves,  is rewarded heavily by being assigned the highest value (0.99), whereas it is harshly punished by being assigned the lowest value (0.01). If the solution quality deteriorates,  decreases linearly, and  increases by the same amount. The values of both parameters are calculated using Equations (14) and (15):
      
        
      
      
      
      
    
      
        
      
      
      
      
    
For each LLH, the respective values of F are computed using the same parameters  and . The setting scheme of these two weight parameters makes the intensification component the dominating factor in the calculation of F while ensuring the diversification of the heuristic search process. However, in DAG learning problems, there is usually an order of magnitude difference between the fitness change and running time. As a result, the balance between the intensification component and the diversification cannot be guaranteed. To solve this problem, we record the values of  and  when the current heuristic increases the score of the optimal structure. Then, all the previously recorded values are linearly transformed into the interval  where m represents the average running time of all the calls. Coefficients a and b are used to balance the fitness change and running time, taking values of 0.1 and 0.2, respectively, in this article.
4.2.2. The Low-Level Heuristics
In this section, we introduce the 13 operators that make up the low-level algorithm library, which is primarily derived from several nature-inspired meta-heuristic optimization algorithms. For example, we decompose the BNC-PSO algorithm into three operators: the mutation operator, cognitive personal operator, and cooperative global operator. In addition, we modify the three operators. First, the mutation operator works on the GSS to improve its efficiency. Second, the acceleration coefficient of the cooperative global operator increases linearly from 0.1 to 0.5 to avoid prematurity.
For the BFO algorithm, we choose only two operators: the chemotactic operator and the elimination and dispersal operator. Addition, deletion, and reversion operators are three candidate directions for each bacterium to select in the chemotactic process, and for large DAG learning, these local operations can be blind and inefficient. Therefore, we consider these three operations to be used only to manipulate the parent set of a node. Specifically, the addition operation continuously adds possible parents to the selected node to improve the score, and its search space is the GSS. Correspondingly, the deletion operation and the reversion operation perform sequential deletion or parent–child transformation of the parent set of the selected node to improve the score. The elimination and dispersal operator is a global search operator, and we need to redesign a restart scheme only for the parent set of a selected node. For a selected node, we perform a local restart of the optimal structure in the population as a bacterial elimination and dispersal operation. First, we remove all the parent nodes of the selected node, calculate the score at this point, and record the structure at this point as the starting point for the restart. Second, in the search space, an addition chemotaxis operation is performed on the selected node to find a potential parent set, which is subsequently sorted by partial correlation values. Note that we are not updating the starting point structure in this step. Third, we add the nodes of the potential parent set one by one to the selected node, and if a node can improve the score, we add both itself and its parent and update the structure. Fourth, for the parent nodes that have been added, we greedily remove the one that has the greatest negative impact on the score and update the structure until the score cannot be improved. The nodes that have the greatest negative impact are achieved by the deletion chemotactic operation. Finally, we perform a reversion operation. The startup of the elimination and dispersal operator is controlled by the parameter , which increases linearly from 0.1 to 1 and is computed using Equation (16),
          
      
        
      
      
      
      
    
          where L represents the number of iterations in which the global maximum score did not improve, and  represents the maximum number of iterations allowed without increasing the global maximum score.
We decompose the ABC algorithm into three operators: worker bees, onlooker bees, and scout bees. Worker bees and onlooker bees, as local search operators, continue to work on the GSS. We redesign the scout bees to accommodate large-scale DAG learning. For a selected node, we perform a local restart of the optimal structure in the population. First, we record the parent set of the selected node. Second, a parent node is selected, and a parent–child transformation is performed with the selected node. Third, the addition, deletion, and reversion chemotaxis operations are performed successively. If the score of the new structure is higher than the score of the optimal structure, the structure is updated as a new starting point. Finally, we skip to step 2 and continue until all parent nodes have been tested. The startup of the scout is controlled by the parameter  when the individual best score does not improve for  consecutive iterations.
Inspired by the moth–flame optimization algorithm, we randomly arrange the individual historical optimal solutions as flames and design moths to fly around them, which is equivalent to moths learning from the flames. The learning mode is the same as that of the BNC-PSO algorithm. Similarly, we adopt the learner phase from the teaching–learning-based optimization algorithm. In the current generation, each student is randomly assigned a collaborator to learn from if they are better than themselves, with the learning mode aligned with that of the BNC-PSO algorithm.
To make efficient use of structural priors, expert knowledge operators are designed. In this operator, a fixed proportion of individuals are selected to be guided by expert knowledge or a structural prior, i.e., all identified v-structures are given. For large-scale DAG learning, an insufficient sample size often leads to overfitting problems. To reduce the complexity of the model, pruning operators are designed to remove all edges if the score change caused by these edges is less than a threshold . This threshold is shared by all operators as the basis for judging whether the score has improved. In addition, a more efficient neighborhood perturbation operator is designed to operate on the LSS.
4.2.3. Framework of Our Algorithm
In this section, we describe the workflow of our proposed multi-population choice function hyper-heuristic (MCFHH) algorithm, the framework of which is shown in Algorithm 3. The MCFHH algorithm starts by randomly generating the initial valid population, and the initial valid population is obtained by performing several local hill-climbing operations on V. Next, we divide the population evenly into several groups, and each group runs its own choice function individually. Our algorithm terminates when the optimal score does not improve in successive  generations or the maximum number of allowed iterations is reached. In addition, we introduce the migration operator and search space switching operator when running the algorithm.
The migration operator runs only after a certain number of iterations, and we set it to a minimum value between 100 and N. In the migration operation, we record the optimal structure of each subgroup and then swap the best with the worst. To avoid inbreeding, we use the inbreeding rate as a parameter to limit immigration operations. For DAG learning, we measure the inbreeding rate using the Hamming distance between the optimal individual of each subgroup and the globally optimal individual. In this paper, if the Hamming distance is less than 4, we assume that the optimal individual of the subgroup and the globally optimal individual are close relatives. The inbreeding rate is defined as the number of close relatives of a globally optimal individual divided by the number of subgroups, which, in this paper, is limited to no more than 0.6.
For large-scale DAG learning, the GSS cannot guarantee the completeness of the search space when the sample size is insufficient. Therefore, we introduce a search space switching operator. The search space switching operator is executed only once when the number of iterations without an increase in the highest score reaches  After execution, all global search operators operate within the complete search space (CSS) to correct errors caused by possible incompleteness in the GSS, and the number of iterations without an increase in the highest score is recalculated. This switching scheme is a balanced strategy that can improve efficiency in the early stage of the algorithm and improve accuracy in the late stage of the algorithm.
          
| Algorithm 3: MCFHH | 
 ![]()  | 
5. Experiments
In this section, several existing competitive algorithms and networks are selected to test the performance of the MCFHH algorithm. The following algorithms were selected for comparison: the PC-stable, LiNGAM, PCS, BNC-PSO, and NOTEARS algorithms (https://github.com/xunzheng/notears, accessed on 16 May 2024). We added structural prior knowledge, including the initial population and the GSS, to the BNC-PSO algorithm in this paper. All the experiments were implemented and executed on a computer running Windows 10 with an AMD 1.7 GHz CPU and 16 GB of memory. NOTEARS was implemented in Python 3.10.5, and the other algorithms were implemented in MATLAB R2020a.
5.1. Networks and Datasets
In our experiments, six networks were selected from the BNLEARN repository (https://www.bnlearn.com/bnrepository/, accessed on 16 May 2024), and a summary of these networks is shown in Table 1.
       
    
    Table 1.
    Summary of networks.
  
The datasets used in the experiments were generated by linear SEMs. Three different SEMs were designed, including the linear Gaussian model and the linear non-Gaussian model, as follows:
      
        
      
      
      
      
    
        where  and  For SEM1, the weight  is a Gaussian distribution, and the random disturbance term is also a Gaussian distribution. Thus, SEM1 follows a multivariate Gaussian distribution and is a linear Gaussian model. For SEM2, the weight  is randomly and uniformly distributed, and the random disturbance term follows a Gaussian distribution. Thus, SEM2 also follows a multivariate Gaussian distribution and is a linear Gaussian model. For SEM3, the weight is a Gaussian distribution, and the random disturbance term is randomly and uniformly distributed. Thus, SEM3 is a linear non-Gaussian model.
5.2. Performance Evaluation of the MCFHH Algorithm
The parameters of the MCFHH algorithm are listed in Table 2, and the parameters of the other algorithms are the best values from the corresponding literature. To evaluate the performance of these algorithms, the following metrics were used:
       
    
    Table 2.
    Parameters of the MCFHH.
  
- BIC: the BIC score of the final output structure.
 - SBS: the BIC score of the standard network.
 - AD: the difference of arcs incorrectly added over all trials.
 - DD: the difference of arcs incorrectly deleted over all trials.
 - RD: the difference of arcs incorrectly reversed over all trials.
 - RET: the execution time of the restriction phase.
 - SET: the execution time of the search phase.
 - F1: the F1 score of the final output structure.
 
The first performance metric is the BIC (higher is better), representing the score of the final output structure. The calculation method of the BIC was introduced in Section 3.3. The SBS represents the score of the original network, which is a fixed reference value based on the sample data. AD, DD, and RD are used to evaluate the structural errors of the learning result, representing the number of incorrectly added edges, incorrectly deleted edges, and incorrectly reversed edges, respectively, in the final output network compared to the original network. RET and SET represent the execution times of the restriction phase (SPPC) and search phase (MCFHH), respectively. The F1 score (higher is better) is calculated as , where P represents precision and R represents recall.
First, extensive experiments were conducted on six standard networks and three different linear SEMs to verify that our proposed algorithm is effective and robust. In our experiment, for each of the networks, we randomly sampled four datasets with 1000, 3000, 5000, and 10,000 cases. We report the mean and standard deviation of the evaluation indicators of 10 runs. Table 3, Table 4 and Table 5 present the results of the experiments on each dataset.
       
    
    Table 3.
    Performance of MCFHH algorithm on different datasets for SEM1. Bold denotes that BIC is greater than SBS.
  
       
    
    Table 4.
    Performance of MCFHH algorithm on different datasets for SEM2. Bold denotes that BIC is greater than SBS.
  
       
    
    Table 5.
    Performance of MCFHH algorithm on different datasets for SEM3. Bold denotes that BIC is greater than SBS.
  
It can be seen in Table 3, Table 4 and Table 5 that for all the datasets, the standard deviations of the BIC, AD, DD, and RD are all 0 after multiple runs, indicating no variation in the results of the MCFHH algorithm across multiple runs. At the same time, the mean value of the BIC is consistently greater than that of the ABS across all datasets. The above results fully demonstrate the stable convergence performance of the MCFHH algorithm. Regardless of the size of the network, as long as a network with a higher score exists in the search space, the algorithm has the ability to find it. Regarding structural errors, we can observe that for datasets with structural errors in the output structure, the BIC surpassed the ABS (highlighted in bold in the table). The reason for this may be that the data cannot fully reflect the network structure’s characteristics. In addition, for all three SEMs, our algorithm yielded structures with stable F1 scores, which shows that the MCFHH algorithm is a robust DAG learning algorithm, whether applied to Gaussian or non-Gaussian models. In terms of execution time, RET increased very little and SET did not increase significantly as the sample size increased, indicating that our algorithm can handle large sample sizes. However, with the increase in the size of the network, SET increased much faster than RET. The reason for this is that the second stage search for working on the CSS increased the time cost. Next, we considered whether and under what circumstances the search space switching operator should be removed to save time. In theory, adding the search space switching operator can reduce the dependence of the algorithm on the sample size, which can also be seen in the insensitivity of each performance index to the sample size. Therefore, for small sample data, the search space switching operator may be an important guarantee for accuracy. Therefore, we report the performance of the MCFHH algorithm after removing the search space switching operator when the sample size was sufficient (10,000).
As shown in Table 6, for the four networks—alarm, win95pts, munin, and pigs—the same structure could still be output after deleting the search space switching operator. For the hepar2 and andes networks, although the same structure could not be output, the maximum coefficient of variation (standard deviation divided by the mean) of the output structure on the three SEMs was 0.08% and 0.04%, respectively. Therefore, when the sample size was sufficient, deleting the search space switching operator still reliably produced a high-score structure. However, for the relatively complex hepar2 and andes networks, it was difficult to guarantee the integrity of the GSS, even with a sufficient sample size, and the edges not covered by the GSS caused structural errors, which the search space switching operator aimed to correct by increasing the coverage of the search space. Regarding running time, Table 6 shows that except for the alarm and win95pts networks, the removal of the search space switching operator significantly reduced the time cost. For hepar2, munin, andes, and pigs, the average SET reduction rates were 50%, 84%, 72%, and 95%, respectively. In summary, in the first stage, the search space switching operator used the constraint method to limit the search space to improve search efficiency, and in the second stage, it corrected the structural errors caused by the incomplete search space by extending the coverage of the search space. In practice, we would likely face a trade-off between accuracy and speed.
       
    
    Table 6.
    Performance of MCFHH algorithm without the switching operator.
  
5.3. Comparison with Other Algorithms
The performance of these comparison algorithms depends on the sample size. For a fair comparison, the sample size was uniformly set to 1000. Due to the serious impact of the search space switching operator on the performance of our algorithm, the algorithm that deletes the search space switching operator was also compared as a new algorithm, denoted as MCFHH1. Obviously, MCFHH1 represents the performance of our algorithm in the worst-case scenario.
Table 7 and Table 8 show the comparison results of the F1 scores and BIC scores, respectively, between our proposed algorithm and other algorithms in different SEMs. In these comparisons, MCFHH consistently outperformed the others (highlighted in bold in the table), which shows that our proposed algorithm is accurate and robust in linear SEMs. To further demonstrate the performance of our algorithm, we compared only MCFHH1 with other algorithms. The comparison of the BIC and F1 scores confirms the conclusion that MCFHH1>BNC-PSO>PCS>NOTEARS>PC>LiNGAM, which verifies that our algorithm maintains the reliability of the search even in the worst-case scenario.
       
    
    Table 7.
    F1 scores of the algorithms for different SEMs. Bold denotes the F1 score that was the best found amongst all methods. “-” indicates that no result is displayed.
  
       
    
    Table 8.
    BIC scores of the algorithms for different SEMs. Bold denotes the BIC score that was the best found amongst all methods. “-” indicates that no result is displayed.
  
Like the MCFHH1 algorithm, PCS also uses partial correlation to limit the search space. Its restrictions are more relaxed, so the coverage of its search space is wider. In theory, it is easier to search for a structure with a higher score. However, by comparing the BIC scores of PCS and MCFHH1, we found that the BIC scores of MCFHH1 were not lower than those of PCS on 12 datasets, most of which were concentrated on large-scale networks, such as munin, andes, and pigs. These results indicate that MCFHH1 has a stronger global search capability than PCS. Compared to MCFHH1, NOTEARS achieved higher BIC scores on 6 out of 18 datasets, while it failed to produce any output seven times. This means that NOTEARS is unstable and cannot stably output results. In addition, the performance of both the constraint-based method (PC-stable) and the exploiting structural asymmetries method (LiNGAM) was significantly worse compared to our method, especially on the andes network.
Figure 2 illustrates the BIC scores with respect to the number of iterations for six networks, and the results after the algorithm stopped are indicated by dotted lines. As shown in Figure 2, three algorithms improved the quality of the solutions at the beginning of the search process, but BNC-PSO converged faster than MCFHH and MCFHH1. This phenomenon became more obvious on the last four networks at larger scales. With the increase in the number of iterations, the convergence speeds of the three algorithms tended to be the same. For the hepar2 and win95pts networks, we can clearly observe that the MCFHH algorithm continued to find structures with higher scores after the BNC-PSO and MCFHH1 algorithms converged. In addition, on the hepar2 and andes networks, the convergence accuracy of BNC-PSO was significantly lower than that of MCFHH and MCFHH1. This shows that the BNC-PSO algorithm cannot guarantee good performance in all cases. By comparing BNC-PSO and MCFHH1, we found that the latter achieved the highest BIC scores across all datasets and the highest F1 scores on 14 out of 18 datasets, with an order of magnitude difference in the BIC scores between the two on the andes network. Therefore, we can conclude that the latter has better generalizability and search capability than the former.
      
    
    Figure 2.
      Convergence of the BIC scores of the three algorithms on the six networks.
  
In summary, our algorithm increases population diversity by combining and optimizing a variety of nature-inspired heuristics, thereby increasing convergence accuracy and decreasing convergence speed. In our algorithm, the completeness of the search space guarantees convergence accuracy, but the complete search space greatly increases the time cost. Therefore, finding ways to limit the search space as much as possible while ensuring its completeness will be a direction for improving the performance of our algorithm. Overall, regardless of whether the data are Gaussian or non-Gaussian, our algorithm can stably output a structure that is closer to true causality. Our algorithm uses the constraint method to reduce the difficulty of the search method and uses the search method to correct the errors caused by the constraint method. To some extent, the advantages of the two types of methods are absorbed, and the defects of both methods are compensated for.
6. Conclusions and Future Research
In this paper, structural priors are obtained using the SPPC algorithm and integrated into the score search process to improve search efficiency. We prove the correctness and validity of the SPPC in theory. To make effective use of this prior knowledge, we devised a hyper-heuristic method called MCFHH to discover causality under linear SEMs. The experimental results show that the proposed method has better generalizability and search capability. Compared to state-of-the-art methods, it outputs structures that are closer to real causality. Additional efforts will be required to expand our work. In this paper, we have only proposed a hybrid approach under linear SEMs, and we intend to further investigate this hybrid method for both discrete and nonlinear problems. We will also develop better hyper-heuristic algorithms.
Author Contributions
Conceptualization, Y.D.; methodology, Y.D.; software, Y.D.; validation, Y.D.; formal analysis, Y.D. and Z.W.; investigation, Y.D.; resources, Y.D.; data curation, Y.D. and Z.W.; writing—original draft preparation, Y.D.; writing—review and editing, Y.D.; visualization, Y.D.; supervision, X.G.; project administration, Y.D. and Z.W.; funding acquisition, X.G. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the National Natural Science Foundation of China (61573285), the Fundamental Research Funds for the Central Universities, China (No. G2022KY0602), and the key core technology research plan of Xi’an, China (No. 21RGZN0016).
Institutional Review Board Statement
Not applicable.
Data Availability Statement
The true networks of all eight datasets are known, and they are publicly available (http://www.bnlearn.com/bnrepository, accessed on 10 May 2024).
Acknowledgments
I have benefited from the presence of my supervisor and classmates. I am very grateful to my supervisor Xiaoguang Gao who gave me encouragement, careful guidance, and helpful advice throughout the writing of this thesis.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
      
| ABC | Artificial bee colony | 
| BF | Bayes factor | 
| BFO | Bacterial foraging optimization | 
| BIC | Bayesian information criterion | 
| CI | Conditional independence | 
| CSS | Complete search space | 
| DAG | Directed acyclic graph | 
| GSS | Global search space | 
| LLH | Low-level heuristics | 
| LSS | Local search space | 
| MB | Markov blanket | 
| MCFHH | Multi-population choice function hyper-heuristic | 
| NLL | Negative log-likelihood | 
| OSP | Open simple path | 
| PSO | Particle swarm optimization | 
| SEM | Structural equation model | 
| SPPC | Structural priors by partial correlation | 
References
- Larsson, S.C.; Butterworth, A.S.; Burgess, S. Mendelian randomization for cardiovascular diseases: Principles and applications. Eur. Heart J. 2023, 44, 4913–4924. [Google Scholar] [CrossRef] [PubMed]
 - Michoel, T.; Zhang, J.D. Causal inference in drug discovery and development. Drug Discov. Today 2023, 28, 17. [Google Scholar] [CrossRef] [PubMed]
 - Pavlovic, M.; Al Hajj, G.S.; Kanduri, C.; Pensar, J.; Wood, M.E.; Sollid, L.M.; Greiff, V.; Sandve, G.K. Improving generalization of machine learning-identified biomarkers using causal modelling with examples from immune receptor diagnostics. Nat. Mach. Intell. 2024, 6, 15–24. [Google Scholar] [CrossRef]
 - Corander, J.; Hanage, W.P.; Pensar, J. Causal discovery for the microbiome. Lancet Microbe 2022, 3, E881–E887. [Google Scholar] [CrossRef] [PubMed]
 - Runge, J.; Gerhardus, A.; Varando, G.; Eyring, V.; Camps-Valls, G. Causal inference for time series. Nat. Rev. Earth Environ. 2023, 4, 487–505. [Google Scholar] [CrossRef]
 - Shimizu, S.; Hoyer, P.O.; Hyvärinen, A.; Kerminen, A. A linear non-Gaussian acyclic model for causal discovery. J. Mach. Learn. Res. 2006, 7, 2003–2030. [Google Scholar]
 - Hoyer, P.O.; Janzing, D.; Mooij, J.M.; Peters, J.; Schölkopf, B. Nonlinear causal discovery with additive noise models. In Proceedings of the Advances in Neural Information Processing Systems 21—Proceedings of the 2008 Conference, Vancouver, BC, Canada, 8–11 December 2008; pp. 689–696. [Google Scholar]
 - Zhang, K.; Wang, Z.K.; Zhang, J.J.; Schölkopf, B. On Estimation of Functional Causal Models: General Results and Application to the Post-Nonlinear Causal Model. ACM Trans. Intell. Syst. Technol. 2016, 7, 22. [Google Scholar] [CrossRef]
 - Janzing, D.; Mooij, J.; Zhang, K.; Lemeire, J.; Zscheischler, J.; Daniusis, P.; Steudel, B.; Schölkopf, B. Information-geometric approach to inferring causal directions. Artif. Intell. 2012, 182, 1–31. [Google Scholar] [CrossRef]
 - Spirtes, P.; Glymour, C.; Scheines, R. Causation, Prediction, and Search, 2nd ed.; MIT Press: Cambridge, MA, USA, 2001. [Google Scholar]
 - Cooper, G.F.; Herskovits, E. A Bayesian method for the induction of probabilistic networks from data. Mach. Learn. 1992, 9, 309–347. [Google Scholar] [CrossRef]
 - Yuan, C.; Malone, B. Learning Optimal Bayesian Networks: A Shortest Path Perspective. J. Artif. Intell. Res. 2013, 48, 23–65. [Google Scholar] [CrossRef]
 - Chickering, D.M. Optimal structure identification with greedy search. J. Mach. Learn. Res. 2003, 3, 507–554. [Google Scholar]
 - Lee, J.; Chung, W.Y.; Kim, E. Structure learning of Bayesian networks using dual genetic algorithm. IEICE Trans. Inf. Syst. 2008, 91, 32–43. [Google Scholar] [CrossRef]
 - Cui, G.; Wong, M.L.; Lui, H.K. Machine learning for direct marketing response models: Bayesian networks with evolutionary programming. Manag. Sci. 2006, 52, 597–612. [Google Scholar] [CrossRef]
 - Gámez, J.A.; Puerta, J.M. Searching for the best elimination sequence in Bayesian networks by using ant colony optimization. Pattern Recognit. Lett. 2002, 23, 261–277. [Google Scholar] [CrossRef]
 - Askari, M.B.A.; Ahsaee, M.G.; IEEE. Bayesian network structure learning based on cuckoo search algorithm. In Proceedings of the 6th Iranian Joint Congress on Fuzzy and Intelligent Systems (CFIS), Shahid Bahonar Univ Kerman, Kerman, Iran, 28 February–2 March 2018; pp. 127–130. [Google Scholar]
 - Wang, J.Y.; Liu, S.Y. Novel binary encoding water cycle algorithm for solving Bayesian network structures learning problem. Knowl.-Based Syst. 2018, 150, 95–110. [Google Scholar] [CrossRef]
 - Sun, B.D.; Zhou, Y.; Wang, J.J.; Zhang, W.M. A new PC-PSO algorithm for Bayesian network structure learning with structure priors. Expert Syst. Appl. 2021, 184, 11. [Google Scholar] [CrossRef]
 - Gheisari, S.; Meybodi, M.R. BNC-PSO: Structure learning of Bayesian networks by Particle Swarm Optimization. Inf. Sci. 2016, 348, 272–289. [Google Scholar] [CrossRef]
 - Ji, J.Z.; Wei, H.K.; Liu, C.N. An artificial bee colony algorithm for learning Bayesian networks. Soft Comput. 2013, 17, 983–994. [Google Scholar] [CrossRef]
 - Yang, C.C.; Ji, J.Z.; Liu, J.M.; Liu, J.D.; Yin, B.C. Structural learning of Bayesian networks by bacterial foraging optimization. Int. J. Approx. Reason. 2016, 69, 147–167. [Google Scholar] [CrossRef]
 - Wang, X.C.; Ren, H.J.; Guo, X.X. A novel discrete firefly algorithm for Bayesian network structure learning. Knowl.-Based Syst. 2022, 242, 10. [Google Scholar] [CrossRef]
 - Pandiri, V.; Singh, A. A hyper-heuristic based artificial bee colony algorithm for k-Interconnected multi-depot multi-traveling salesman problem. Inf. Sci. 2018, 463, 261–281. [Google Scholar] [CrossRef]
 - Wang, Z.; Liu, J.L.; Zhang, J.L. Hyper-heuristic algorithm for traffic flow-based vehicle routing problem with simultaneous delivery and pickup. J. Comput. Des. Eng. 2023, 10, 2271–2287. [Google Scholar] [CrossRef]
 - Drake, J.H.; Özcan, E.; Burke, E.K. A Case Study of Controlling Crossover in a Selection Hyper-heuristic Framework Using the Multidimensional Knapsack Problem. Evol. Comput. 2016, 24, 113–141. [Google Scholar] [CrossRef] [PubMed]
 - Zamli, K.Z.; Din, F.; Kendall, G.; Ahmed, B.S. An experimental study of hyper-heuristic selection and acceptance mechanism for combinatorial t-way test suite generation. Inf. Sci. 2017, 399, 121–153. [Google Scholar] [CrossRef]
 - Tsamardinos, I.; Brown, L.E.; Aliferis, C.F. The max-min hill-climbing Bayesian network structure learning algorithm. Mach. Learn. 2006, 65, 31–78. [Google Scholar] [CrossRef]
 - Yang, J.; Li, L.; Wang, A.G. A partial correlation-based Bayesian network structure learning algorithm under linear SEM. Knowl.-Based Syst. 2011, 24, 963–976. [Google Scholar] [CrossRef]
 - Kitson, N.K.; Constantinou, A.C.; Guo, Z.G.; Liu, Y.; Chobtham, K. A survey of Bayesian Network structure learning. Artif. Intell. Rev. 2023, 56, 8721–8814. [Google Scholar] [CrossRef]
 - Colombo, D.; Maathuis, M.H. Order-Independent Constraint-Based Causal Structure Learning. J. Mach. Learn. Res. 2014, 15, 3741–3782. [Google Scholar]
 - Ogarrio, J.M.; Spirtes, P.; Ramsey, J. A Hybrid Causal Search Algorithm for Latent Variable Models. JMLR Workshop Conf. Proc. 2016, 52, 368–379. [Google Scholar]
 - Tsamardinos, I.; Aliferis, C.F.; Statnikov, A. Time and sample efficient discovery of Markov blankets and direct causal relations. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 24–27 August 2003; pp. 673–678. [Google Scholar]
 - Koivisto, M.; Sood, K. Exact Bayesian structure discovery in Bayesian networks. J. Mach. Learn. Res. 2004, 5, 549–573. [Google Scholar]
 - de Campos, C.P.; Ji, Q. Efficient Structure Learning of Bayesian Networks using Constraints. J. Mach. Learn. Res. 2011, 12, 663–689. [Google Scholar]
 - Cussens, J.; Järvisalo, M.; Korhonen, J.H.; Bartlett, M. Bayesian Network Structure Learning with Integer Programming: Polytopes, Facets and Complexity. J. Artif. Intell. Res. 2017, 58, 185–229. [Google Scholar] [CrossRef]
 - Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
 - Shimizu, S.; Inazumi, T.; Sogawa, Y.; Hyvärinen, A.; Kawahara, Y.; Washio, T.; Hoyer, P.O.; Bollen, K. DirectLiNGAM: A Direct Method for Learning a Linear Non-Gaussian Structural Equation Model. J. Mach. Learn. Res. 2011, 12, 1225–1248. [Google Scholar]
 - Zheng, X.; Aragam, B.; Ravikumar, P.; Xing, E.P. DAGs with NO TEARS: Continuous Optimization for Structure Learning. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 2–8 December 2018. [Google Scholar]
 - Yu, Y.; Chen, J.; Gao, T.; Yu, M. DAG-GNN: DAG Structure Learning with Graph Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
 - Wang, X.; Du, Y.; Zhu, S.; Ke, L.; Chen, Z.; Hao, J.; Wang, J. Ordering-Based Causal Discovery with Reinforcement Learning. In Proceedings of the IJCAI International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 19–27 August 2021; pp. 3566–3573. [Google Scholar]
 - Zhang, M.H.; Jiang, S.L.; Cui, Z.C.; Garnett, R.; Chen, Y.X. D-VAE: A Variational Autoencoder for Directed Acyclic Graphs. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
 - Zheng, X.; Dan, C.; Aragam, B.; Ravikumar, P.; Xing, E.P. Learning Sparse Nonparametric DAGs. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), Electr Network, Online, 26–28 August 2020; pp. 3414–3424. [Google Scholar]
 - Lee, H.C.; Danieletto, M.; Miotto, R.; Cherng, S.T.; Dudley, J.T. Scaling structural learning with NO-BEARS to infer causal transcriptome networks. In Proceedings of the Pacific Symposium on Biocomputing, Fairmont Orchid, HI, USA, 3–7 January 2020; pp. 391–402. [Google Scholar]
 - Wei, D.; Gao, T.; Yu, Y. DAGs with no fears: A closer look at continuous optimization for learning Bayesian networks. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020. [Google Scholar]
 - Kaiser, M.; Sipos, M. Unsuitability of NOTEARS for Causal Graph Discovery when Dealing with Dimensional Quantities. Neural Process. Lett. 2022, 54, 1587–1595. [Google Scholar] [CrossRef]
 - Ramsey, J.; Spirtes, P.; Zhang, J. Adjacency-faithfulness and conservative causal inference. In Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, UAI 2006, Cambridge, MA, USA, 13–16 July 2006; pp. 401–408. [Google Scholar]
 - Zhang, J.; Spirtes, P. Detection of unfaithfulness and robust causal inference. Minds Mach. 2008, 18, 239–271. [Google Scholar] [CrossRef]
 - de Campos, L.M.; Castellano, J.G. Bayesian network learning algorithms using structural restrictions. Int. J. Approx. Reason. 2007, 45, 233–254. [Google Scholar] [CrossRef]
 - Correia, A.H.C.; de Campos, C.P.; van der Gaag, L.C. An Experimental Study of Prior Dependence in Bayesian Network Structure Learning. In Proceedings of the 11th International Symposium on Imprecise Probabilities—Theories and Applications (ISIPTA), Ghent, Belgium, 3–6 July 2019; pp. 78–81. [Google Scholar]
 - Borboudakis, G.; Tsamardinos, I. Scoring and searching over Bayesian networks with causal and associative priors. In Proceedings of the Uncertainty in Artificial Intelligence—Proceedings of the 29th Conference, UAI 2013, Bellevue, WA, USA, 12–14 July 2013; pp. 102–111. [Google Scholar]
 - Wang, Z.X.; Chan, L.W. Learning Bayesian Networks from Markov Random Fields: An Efficient Algorithm for Linear Models. ACM Trans. Knowl. Discov. Data 2012, 6, 31. [Google Scholar] [CrossRef]
 - Chén, O.Y.; Bodelet, J.S.; Saraiva, R.G.; Phan, H.; Di, J.R.; Nagels, G.; Schwantje, T.; Cao, H.Y.; Gou, J.T.; Reinen, J.M.; et al. The roles, challenges, and merits of the p value. Patterns 2023, 4, 22. [Google Scholar] [CrossRef] [PubMed]
 - Wang, Z.; Chan, L. An efficient causal discovery algorithm for linear models. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 25–28 July 2010; pp. 1109–1117. [Google Scholar]
 - Cheng, J.; Greiner, R.; Kelly, J.; Bell, D.; Liu, W.R. Learning Bayesian networks from data: An information-theory based approach. Artif. Intell. 2002, 137, 43–90. [Google Scholar] [CrossRef]
 
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.  | 
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).


