The FEDHC Bayesian network learning algorithm

The paper proposes a new hybrid Bayesian network learning algorithm, termed Forward Early Dropping Hill Climbing (FEDHC), devised to work with either continuous or categorical variables. Further, the paper manifests that the only implementation of MMHC in the statistical software \textit{R}, is prohibitively expensive and a new implementation is offered. Further, specifically for the case of continuous data, a robust to outliers version of FEDHC, that can be adopted by other BN learning algorithms, is proposed. The FEDHC is tested via Monte Carlo simulations that distinctly show it is computationally efficient, and produces Bayesian networks of similar to, or of higher accuracy than MMHC and PCHC. Finally, an application of FEDHC, PCHC and MMHC algorithms to real data, from the field of economics, is demonstrated using the statistical software \textit{R}.


Introduction
Learning the causal relationships among variables using non-experimental data is of high importance in many scientific fields, such as economics and econometrics 1 . When the aim is particularly to recreate the causal mechanism that generated the data, graphical models, such as causal networks and Bayesian Networks (BNs) 2 are frequently employed. The advantages of BNs include simultaneous variable selection, among all variables and hence detection of conditional associations between variables. On a different route BNs form the scheme for synthetic population generation (Sun and Erath, 2015) and have been used synergetically with agent based models Dragicevic, 2009, 2013).
BNs enjoy applications to numerous fields, but the focus of the current paper is on economics related fields applications, such as production economics (Hosseini and Barker, 2016), macroeconomics (Spiegler, 2016) and environmental resource economics (Xue et al., 2017). Applications of BNs can be found also in financial econometrics (Mele, 2017) banking and finance (Chong and Kluppelberg, 2018), credit scoring (Leong, 2016), insurance (Sheehan et al., 2017) and customer service (Cugnata et al., 2014) to name a few. Despite the plethora of applications of BNs, not many BN algorithms exist, and most importantly fewer are publicly available in free software environments, such as the statistical software R. The Max-Min Hill Climbing (MMHC) (Tsamardinos et al., 2006) is an example of a widely used BN learning algorithm 3 that is publicly available, in the R package bnlearn (Scutari, 2010). PC Hill Climbing (PCHC) 1 For a general definition of causality specifically in economics and econometrics see Hoover (2017). 2 Also known as Bayes networks, belief networks, decision networks, Bayes(ian) models or probabilistic directed acyclic graphical models. 3 The relevant paper is one of the classic papers in the Artificial Intelligence field and has received more than 1,870 citations according to google.scholar by July 2022. (Tsagris, 2021) is a recently suggested hybrid algorithm that is also publicly available, in the R package pchc (Tsagris, 2021b). (Tsagris, 2021) showed that when the sample size is at the order of hundreds of thousands MMHC's implementation in the R package bnlearn requires more than a day with continuous data, even with 50 variables. On the contrary, PCHC is a computationally efficient and scalable BN learning algorithm (Tsagris, 2021). With modern technology and vast data generation, the computational cost is a considerable parameter. Every novel algorithm must be computationally efficient and scalable to large sample sizes. Seen from the green economy point of view, this cost also has an economic and environmental impact; a faster algorithm will produce results in a more timely manner, facilitating faster decision making, consuming less energy and hence reducing its carbon footprint.
Moving along those lines this paper proposes a new computationally highly efficient algorithm termed Forward with Early Dropping Hill Climbing (FEDHC) that is publicly available in the R package pchc. FEDHC shares common ideas with PCHC and MMHC. It applies the Forward Backward with Early Dropping (FBED) variable selection algorithm (Borboudakis and Tsamardinos, 2019) to each variable as a means of skeleton identification, followed by a Hill Climbing (HC) scoring phase. FEDHC can handle millions of observations in just a few minutes and retains similar or better accuracy levels than PCHC and MMHC. With continuous data FEDHC performs fewer errors than PCHC but the converse is true with categorical data. FEDHC further enjoys the same scalability property as PCHC, its computational cost is proportional to the sample size of the data. Increasing the sample size by a factor increases the execution time by the same factor. Finally, a new, computationally efficient, implementation of MMHC is offered that is also publicly available in the R package pchc.
The choice of the BN learning algorithm is not only computational cost dependent, but also quality dependant. Regardless of the algorithm used, the quality of the learned BN can be significantly affected by outliers. Robustness to outliers is an important aspect that, surprisingly enough, has not attracted substantial research attention in the field of BN learning. Kalisch and Bühlmann (2008) were the first to propose a robustified version of the PC algorithm by replacing the empirical standard deviations with robust scale estimates. Cheng et al. (2018) on the other hand, removed the outliers but their algorithm is only applicable to BNs with a known topology. Robustification of the proposed FEDHC algorithm takes place by adopting techniques from the robust statistics literature. The key concept is to identify and remove the outliers prior to applying FEDHC.
The rest of the paper is structured as follows. Preliminaries regarding BNs that will assist in making the paper comprehensible and a brief presentation of PCHC and MMHC algorithms are unveiled in Section 2. The FEDHC is introduced in Section 3 along with its robustified version 4 that will be shown to be remarkable and utterly insensitive to outliers. Theoretical properties and computational details of FEDHC and the conditional independence tests utilised for continuous and categorical data are delineated in the same section. Section 4 contains Monte Carlo simulation studies comparing FEDHC to PCHC and MMHC in terms of accuracy, computational efficiency and number of tests performed. Section 5 illustrates the FEDHC, PCHC and MMHC on two real cross-sectional datasets using the R package pchc. The first dataset contains continuous data on the credit history for a sample of applicants for a type of credit card (Greene, 2003). The second dataset contains categorical data on the household income plus some more demographic variables. Ultimately, Section 6 contains the conclusions drawn from this paper.

Preliminaries
Graphical models or probabilistic graphical models express visually the conditional (in)dependencies between random variables (V i , i = 1, . . . , D). Nodes (or vecrtices) are used to represent the variables V i and edges between the nodes, for example V i − V j , indicate relationship between the variables V i and V j . Directed graphs are graphical models that contain arrows, instead of edges, indicating the direction of the relationship, for example V i → V j . The parents of a node V i are the nodes whose direction (arrows) points towards V i . Consequently, node V i is termed child of those nodes and the common parents of that nodes are called spouses. Directed acyclic graphs (DAG) are stricter in the sense that they impose no cycles on these directions. For any path between V i and V j , In other words, the natural sequence or relationship between parents and children or ancestors and descendants is mandated. Figure 1: An example of a DAG. Nodes V 1 and V 2 are the parents of V 3, whose children are nodes V 4 and V 5. V 2 is the spouse of V 1 (and vice versa, V 1 is the spouse of V 2).

Bayesian Networks
Assume there is a collection V of D variables whose joint distribution P is known. The BN 5 (Pearl, 1988;Spirtes et al., 2000) B = G, V arises from linking P to G through the Markov condition (or Markov property), which states that each variable is conditionally independent of its non-descendants given its parents. By using this condition, the joint distribution P can be factorised as where Pa(V i ) denotes the parents set of V i in G.
If G entails only conditional independencies in P and all conditional independencies in P are entailed by G, based on the Markov condition, then G, P and G are faithful to each other, and G is a perfect map of P (Spirtes et al., 2000).
The BN whose edges can be interpreted causally is called causal BN, an edge V i → V j exists if V i is a direct cause of V j . A necessary assumption made by the algorithms under study is causal sufficiency; there are no latent (hidden, non observed) variables among the observed variables V.
If there is no edge between V i and V j the node V k is called unshielded collider. In Figure 1 the triplet (V 1 , V 3 , V 2 ) is a Vstructure as there is no edge between V 1 and V 2 and hence node V 3 is an unshielded collider. The unshielded collider V k implies that V i and V j ae independependent conditioning on V k , provided that the faithfulness property holds true (Spirtes et al., 2000). Conversely, the triplet of nodes ( Figure 1 is such an example). The Λ-structure implies that V i and V j are conditionally independent given V k .
Two or more BNs are called Markov equivalent if and only if they have the same skeleton and the same V-structures (Verma and Pearl, 1991). The set of all Markov equivalent BNs forms the Markov equivalence class that can be represented by a complete partially DAG, which in addition to directed edges contains undirected edges 6 .

Classes of BN learning algorithms
BN learning algorithms are typically constraint-based, score-based or hybrid. Constraint-based learning algorithms, such as PC 7 (Spirtes and Glymour, 1991) and FCI (Spirtes et al., 2000) employ conditional independence (CI) tests to discover the structure of the network (skeleton), and then orient the edges by repetitively applying orientation rules. On the contrary, scorebased methods (Cooper and Herskovits, 1992;Heckerman et al., 1995;Chickering, 2002), assign a score on the whole network and perform a search in the space of BNs to identify a highscoring network. Hybrid algorithms, such as MMHC (Tsamardinos et al., 2006) and PCHC (Tsagris, 2021), combine both aforementioned methods; they first perform CI tests to discover the skeleton of the BN and then employ a scoring method to direct the edges in the space of BNs.

PCHC and MMHC algorithms
PCHC's skeleton identification phase is the same as that of the PC algorithm (Tsagris, 2021). The phase commences with all pairwise unconditional associations and removes the edge between ordered pairs which are not statistically significantly associated. Subsequently, CI tests are performed with the cardinality of the conditioning set (denoted by k) increasing by 1 at a time. At every step, the conditioning set consists of subsets of the neighbours, adjacent to each variable V i (adj(G, V i )). This process is repeated until no edge can be removed. Spirtes et al. (2000) suggested three heuristics to select the pairs of variables and the order is crucial as it can yield erroneous results. The optimal is, for a given variable V i , to first test those variables V that are least probabilistically dependent on V i , conditional on those subsets of variables that are most probabilistically dependent on V i . Note that the pairs are first ordered according to the third heuristic of Spirtes et al. (2000) and so the order of selection of the pairs is deterministic. Hence, the skeleton identification phase is independent of the order at which the variables are located in the dataset (Tsagris, 2019).
MMHC's skeleton identification phase performs a variable selection process for each variable (call it target variable, V i ), described as follows. A search for its statistically significantly associated variables V s is performed via unconditional statistical tests. The associations are stored and the variable with the highest association (V j ) is chosen, an edge is added between this V i and V j and all non statistically significant variables are excluded from further consideration. In the second step, all CI tests between the target variable and previously identified variables, conditional on the previously selected variable are performed (V i ⊥ ⊥ V m |V j , m = i, j) and the non statistically significant variables are neglected. The previously stored associations are updated, for each variable, the minimum between the old and the new variables is stored. The variable with the highest association (Max-Min heuristic) is next selected. In subsequent steps, while the set of the selected variables increases, the conditioning set does not, as its cardinality is at most equal to k 8 . Upon completion, a backward phase, in the same spirit as the forward applies to remove falsely detected variables.
This variable selection process is repeated for all variables. The edges detected remain only if they were identified by all variables. If for example, V j was found to associated with V i , but V i was not found to be associated with V j , then no edge between V i and V j will be added.
A slightly modified version of MMHC's skeleton identification phase is implemented in the R package pchc. The backward phase is not performed in order to make the algorithm faster. To distinguish between them, bnlearn's implementation will be denoted by MMHC-1 and pchc's implementation will be denoted by MMHC-2 hereafter.
The orientation of the discovered edges takes place in the second, Hill Climbing (HC) scoring, phase of PCHC and MMHC and is the same phase employed by FEDHC as well.

The FEDHC BN learning algorithm
Similarly to PCHC and MMHC, the skeleton identification phase of FEDHC relies on a variable selection algorithm. Thus, prior to introducing the FEDHC algorithm the Forward Backward with Early Dropping (FBED) variable selection algorithm (Borboudakis and Tsamardinos, 2019) is briefly presented.

The FBED variable selection algorithm
In the classical forward selection algorithm all available predictor variables are constantly used and their statistical significance is tested at each step. Assuming that out of 10, 000 predictor variables only 10 are selected. This implies that almost 10, 000 × 10 regression models must be fitted and the same amount of statistical tests must be executed. The computational cost is tremendous rendering this computationally expensive algorithm impractical and hence prohibitive. Borboudakis and Tsamardinos (2019) introduced the FBED algorithm as a speed-up modification of the traditional forward selection algorithm coupled with the backward selection algorithm (Draper and Smith, 1998). FBED relies on the Early Dropping heuristic to speed up the forward selection. The heuristic drops the non statistically significant predictor variables at each step, thus removes them from further consideration resulting in a computationally dramatically cheaper algorithm, that is presented in Algorithm 1.

Skeleton identification phase of the FEDHC algorithm
The skeleton identification phase of the FEDHC algorithm is the one presented in Algorithm 2, but it must be stressed that the backward phase of FBED is not performed so as to reduce the computational cost. The FBED algorithm (Algortihm 1) for each variable (call it target variable, V i ). This variable selection process is repeated for all variables. The edges detected remain only if they were identified by all variables. If for example, V j was found to associated with V i , but V i was not found to be associated with V j , then no edge between V i and V j will be added.
Algorithm 1 The FBED variable selection algorithm 1: Input: A response variable y and a set of D predictor variables V. 2: Let S = ∅ denote the set of selected variables. 3: Perform all regression models of y on each g. a linear model y = a + bV i + e, and retain only the statistically significant predictor variables V sig . 4: Choose V j from V sig that has the highest association, add it in S and use that to perform all regression models of y on the V j and each V , y ∼ f (V j , V ), where ∈ V, with = j and again retain only the statistically significant predictor variables, thus reducing |V sig | and increasing |S|. 5: Repeat until no predictor variable is left, i.e. V sig = ∅. 6: This process can be repeated k times, using all neglected predictor variables, where k is a pre-defined number, until |S| cannot further increase. 7: Perform a backward selection phase attempting to remove the non statistically significant predictor variables. 8: Return S.
Algorithm 2 Skeleton identification phase of the FEDHC algorithm 1: Input: Data set on a set of D variables V. 2: Let the adjacency matrix G be full of zeros. 3: Repeat for all variables V i , i = 1, . . . , D 4: Perform the FBED algorithm in Algorithm 1, excluding the backward phase, and return

Hill Climbing phase of the FEDHC algortihm
The first phase of FEDHC, MMHC and of PCHC is to discover any possible edges between the nodes using CI tests. In the second phase, a search for the optimal DAG is performed, where edges turn to arrows or are deleted towards maximisation of a score metric. This scoring phase performs a greedy HC search in the space of BNs, commencing with an empty graph (Tsamardinos et al., 2006). The edge deletion or direction reversal that leads to the largest increase in score, in the space of BNs 9 , is applied and the search continues recursively. The fundamental difference from standard greedy search is that the search is constrained to the orientation of the edges discovered by the skeleton identification phase 10 .
Tabu search is such an iterative local searching procedure adopted by Tsamardinos et al. (2006) for this purpose. Its performance is enhanced by using a list where the last 100 structures explored are stored, while searching in the neighborhood of each solution. The search is also capable of escaping from local optima, in which normal local search techniques often get stuck. Instead of applying the best local change, the best local change that results in a structure not on the list is performed in an attempt to escape local maxima (Tsamardinos et al., 2006). This change may actually reduce the score. When a number of changes (10-15) occur without an increase in the maximum score ever encountered during search, the algorithm terminates. The overall best scoring structure is then returned.
The Bayesian Information Criterion (BIC) (Schwarz, 1978) is a frequent score used for continuous data, while other options include the multivariate normal log-likelihood, the Akaike Information Criterion (AIC) and the Bayesian Gaussian equivalent 11 (Geiger and Heckerman, 1994) score. The Bayesian Dirichlet equivalent (BDE) (Buntine, 1991), the BDe uniform score (BDeu) (Heckerman et al., 1995), the multinomial log-likelihood score (Bouckaert, 1995) and the MDL score (Suzuki, 1993;Lam and Bacchus, 1994) are four scoring options for discrete data.
The combination of the FBED algorithm during the skeleton identification phase with the HC scoring method forms the FEDHC algorithm. Interestingly enough, the skeleton identification phase of FEDHC performs substantially fewer statistical tests than PCHC and MMHC.

Prior knowledge
All BN learning algorithms are agnostic of true relationships among the input data. It is customary though for practitioners and researchers to have prior knowledge of the necessary directions (forbidden or not) of some of the relationships among the variables. For instance, variables such as sex or age cannot be caused by any economic or demographic variables. Economic theory (or theory from any other field) can further assist in improving the quality of the fitted BN by imposing or forbidding directions among some variables. All the prior information can be inserted into the scoring phase of the aforementioned BN learning algorithms leading to less errors and more realistic BNs.

Theoretical properties of FEDHC
The theoretical properties and guarantees of MMHC and PCHC can be found in Tsamardinos et al. (2006) and Tsagris (2021), respectively. As for the FEDHC, while there is no theoretical guarantee of the skeleton identification phase of FEDHC, Borboudakis and Tsamardinos (2019) showed that running FBED with two repeats recovers the MB of the response variable if the joint distribution of the response and the predictor variables can be faithfully represented by a BN. When used for BN learning though, FBED need not be be run more than once for each variable. In this case FBED, similarly to MMHC, will identify the children and parents of a variable V i , but not the spouses of the children (Borboudakis and Tsamardinos, 2019) as this is not necessary during the skeleton identification phase. When FBED is run on the children of the variable V i it will again identify the children's parents who are the spouses of the variable V i . Hence, upon completion of the FEDHC algorithm will have identified the MB of each variable.
Additionally, the early dropping heuristic does not only reduce the computational time but also reduces the problem of multiple testing, in some sense (Borboudakis and Tsamardinos, 2019). When FBED is run only once (as in the current situation), in the worst-case scenario, it is expected to select about α · D variables (where α is the significance level) since all other variables will be dropped in the first (filtering) phase. However, their simulation studies showed that FBED was selecting fewer false positives than expected and the authors' recommendation is to reduce the number of runs to further limit the number of falsely selected variables, a strategy FEDHC follows by default.
Similar to MMHC, the FEDHC is a local learning algorithm, and hence during the HC phase the overall score is decomposed (Tsamardinos et al., 2006) exploiting the Markov property of BNs (1). The local learning has several advantages (see Tsamardinos et al. (2006)) and the scores (BDe, BIC., etc.) are locally consistent (Chickering, 2002).

Robustification of the FEDHC algorithm for continuous data
It is not uncommon for economic datasets to contain outliers, observations with values far from the rest of the data. Income is such an example that contains outliers, but if outliers appear only in that variable their effect will be minor. The effect of outliers is propagated when they exist in more variables and in order to mitigate their effect, they must be identified in the multivariate space. If these outliers are not detected or not treated properly, BN learning algorithms will yield erroneous results. FEDHC will employ the Reweighted Minimum Covariance Determinant (RMCD) (Rousseeuw, 1985;Rousseeuw and Driessen, 1999) as a means to identify outliers and remove them 12 .
The RMCD estimator is computed in two stages. In the first stage, a subset of observations h (n/2 ≤ h < n) is selected such that the covariance matrix has the smallest determinant and a robust mean vector is also computed. The second stage, is a re-weighting scheme that increases the efficiency of the estimator, while preserving its high-breakdown properties. A weight w i = 0 is given to observations whose first-stage robust distance exceeds a threshold value, otherwise the weight is w i = 1 (i = 1, . . . , n) is given. Using the re-weighted robust covariance matrix and mean vector, robust Mahalanobis distances are computed and proper cut-off values are required to detect the outliers. Those cut-off values are based on the following accurate approximations (Cerioli, 2010;Cerchiello and Giudici, 2016) where w = n i=1 w i , and Be and F denote the Beta and F distributions respectively. The observations whose Mahalananobis distance d 2 i(RM CD) exceeds the 97.5% quantile of either distribution (Be or F ) are considered to be outliers and are hence removed from the dataset. The remainder of the dataset, assumed to be outlier free, will be used by FEDHC to learn the BN.
The default value for h is [(n+p+1)/2], where [.] denotes the largest integer. This value was proven to have the highest breakdown point (Hubert and Debruyne, 2010), but low efficiency on the other hand. Changing h yields an estimator with lower robustness properties and increases the computational cost of the RMCD estimator. For these reasons, this is the default value used inside the robustification process of the FEDHC algorithm.
The case of n < p and n p (very high dimensional case) in general can be treated in a similar way, by replacing the RMCD estimator with the high dimensional MCD approach of Ro et al. (2015).

Monte Carlo simulation studies
Extensive experiments were conducted on simulated data to investigate the quality of estimation of FEDHC compared to PCHC and MMHC-2. MMHC-1 participated in the simulation studies only with categorical data and not with continuous data because it is prohibitively expensive. The continuous data were generated by synthetic BNs that contained a various number of nodes, p = (20, 30, 50), with an average of 3 and 5 neighbours (edges) for each variable. For each case 50 random BNs were generated with Gaussian data and various sample sizes. Categorical data were generated utilising two famous (in the BN learning community) realistic BNs and the sample sizes were left varying. The R packages MXM (Tsagris and Tsamardinos, 2019) and bnlearn were used for the BN generation, and the R packages pchc and bnlearn were utilised to apply the FEDHC, PCHC, MMHC-2 and the MMHC-1 algorithms respectively. All simulations were implemented in a desktop computer with Intel Core i5-9500 CPU at 3.00GHz with 48 GB RAM and SSD installed.
The metrics of quality of the learned BNs were the structural Hamming distance (SHD) (Tsamardinos et al., 2006) of the estimated DAG from the true DAG 13 , the computational cost and the number of tests performed during the skeleton identification phase and the total duration of the algorithm. PCHC and the MMHC-1 algorithms have been implemented in C++, henceforth the comparison of the execution times is not really fair at the programming language level. FEDHC and MMHC-2 have been implemented in R (skeleton identification phase) and C++ (scoring phase).

Synthetic BNs with continuous data
The procedure used to generate data for X is summarised in the steps below. Let X be a variable in G and P a(X) be the parents of X in G.
2. In case P a(X) is empty, X is sampled from the standard normal distribution. Otherwise, X = f (P a(X)) = β 0 + i β i P a i (X) + X , a linear function 14 depending on X, where X is generated from a standard normal distribution.
The average number of connecting edges (neighbours) was set to 3 and 5. The higher the number of edges is, the denser the network is and the harder the inference becomes. The sample sizes considered were n = (100, 500, 1, 000, 5000, 1 × 10 4 , 3 × 10 4 , 5 × 10 4 , 1 × 10 5 , 3 × 10 5 , 5 × 10 5 , 1 × 10 6 , 3 × 10 6 , 5 × 10 6 ). Figures 2, 3, 4 and 5 present the SHD and the number of CI tests performed of each algorithm for a range of sample sizes (in log-scale). With 3 neighbours on average per node, the differences in the SHD are noticeably rather small, yet FEDHC achieves lower numbers. With 5 neighbours on average though the differences are more significant and increasing with increasing sample sizes. As for the number of CI tests executed during the skeleton identification phase, FEDHC is the evident winner as it executes up to 6 times less tests, regardless of the average neighbours.

Robustified FEDHC for continuous data
Examination of the robustified FEDHC contains two axes of comparison; the outlier-free and the outliers present cases. At first the performances of the raw and the robustified FEDHC in the outlier-free case are evaluated. Figures 6 and 7 signify that there is no loss in the accuracy when using the robustified FEDHC over the raw FEDHC. Computationally speaking though, the raw FEDHC is significantly faster than the robustified FEDHC. For small sample sizes the robustified FEDHC can be 10 times higher, while for large sample sizes its cost can be double or only 5% higher than that of the raw FEDHC.
The performances of the raw FEDHC and of the robustified FEDHC in the presence of extreme outliers are evaluated next. The BN generation scheme is the one described in Section 4.1 with the exception of having added a 5% of outlying observations. The considered sample sizes are smaller, as although FEDHC is computationally efficient, it becomes really slow in the presence of outliers.
The results presented in Figures 8 and 9 evidently show the gain when using the robustified FEDHC over the raw FEDHC. The SHD of the raw FEDHC increases by as little as 100% 13 This is defined as the number of operations required to make the estimated graph equal to the true graph. Instead of the true DAG, the Markov equivalence graph of the true BN; that is some edges have been un-oriented as their direction cannot be statistically decided. The transformation from the DAG to the Markov equivalence graph was carried out using Chickering's algorithm (Chickering, 1995).
14 In general this can represent any (non-linear) function.  up to 700% with 3 neighbours on average and 50 variables. The duration of the raw FEDHC increases substantially 15 . The raw FEDHC becomes up to more than 200 times slower than the robustified version with hundreds of thousands of observations and 50 variables. This is attributed to the HC phase of the raw FEDHC which consumes a tremendous amount of time.
The outliers produce noise that escalates the labour of the HC phase.

Realistic BNs with categorical data
The f (P a(X)) function utilised in the continuous data case relies on the β coefficients. The larger the magnitude of their values, the stronger the association between the edges becomes and hence it becomes easier to identify them. For BNs with categorical data one could apply the same generation technique and then discretize the simulated data. To avoid biased or optimistic estimates favoring one or the other method, two real BNs with categorical data were utilised to simulate data. These are a) the Insurance BN, used for evaluating car insurance risks (Beinlich et al., 1989), that consists of 27 variable (nodes) and 52 (directed) edges and  b) the Alarm BN, designed to provide an alarm message system for patient monitoring and consists of 37 variables and 45 (directed) edges. The R package bnlearn contains a few thousand categorical instantiations from these BNs, but for the purpose of the simulation studies more instantiations were generated using the same package. The sample sizes considered were n = (1 × 10 4 , 2 × 10 4 , 5 × 10 4 , 1 × 10 5 , 2 × 10 5 , 5 × 10 5 , 1 × 10 6 , 2 × 10 6 , 5 × 10 6 ). Figure 10 shows the SHD and the number of CI tests executed by each algorithm against the sample size. The MMHC-1 has evidently the poorest performance in both axes of comparison. Our implementation (MMHC-2) performs substantially better but the overall winner is the PCHC. FEDHC on the other hand performs better than MMHC-1 yet is the second best option.

Scalability of FEDHC
The computational cost of each algorithm was also measured appearing in Figure 11 as a function of the sample size. The empirical slopes of all lines in Figure 11 are nearly equal to 1 indicating that the scalability of FEDHC, PCHC, and MMHC-2 is linear in the sample size. Hence, the computational cost of all algorithms increases linearly with respect to the sample size. For  any percentage-wise increase in the sample size, the time increases by the same percentage. Surprisingly enough, the computational cost of MMHC-1 was not evaluated for categorical data case because similarly to the continuous data case it was too high to evaluate. It is surprising though that the computational cost of FEDHC is similar to that of PCHC and MMHC-2. In fact the skeleton identification phase requires about the same amount of time and it requires only 8 seconds with 5 million observations. The scoring phase is the most expensive phase of the algorithms, absorbing 73%-99% of the total computation time. Regarding FEDHC and MMHC-2, since the initial phase of both has been implemented in R, this can be attributed to the fact that the calculations of the partial correlation in FEDHC are heavier than those in MMHC-2 because the conditioning set in the former can grow larger than the conditioning set in MMHC-2 which is always bounded by a pre-specified value k. Thus, MMHC-2 performs more but computationally lighter calculations than FEDHC.

Expenditure data
The first example concerns a dataset with continuous variables containing information on the monthly credit card expenditure of individuals was used. It is the Expenditure dataset (Greene, 2003) and is publicly accessible via the R package AER (Kleiber and Zeileis, 2008).
The dataset contains information about 1,319 observations (10% of the original data set) on the following 12 variables. Whether the application for credit card was accepted or not (Card), the number of major derogatory reports, (Reports), the age in years plus twelfths of a year  (Age), the yearly income in $10,000 (Income), the ratio of monthly credit card expenditure to yearly income (Share), the average monthly credit card expenditure (Expenditure), whether the person owns their home or they rent (Owner), whether the person is self employed or not (Selfemp), the number of dependents + 1 (Dependents), the number of months living at current address (Months), the number of major credit cards held (Majorcards) and the number of active credit accounts (Active). The R package AER contains the data and must be loaded and be processed for the algorithms to run.
According to FEDHC ( Figure 12)(a), the age of the individual affects their income, the number of months they have been living at their current address, whether they own their home or not, and the ratio of their monthly credit card expenditure to their yearly income. The only variables associated with the number of major derogatory reports (Reports) are whether the consumer's application for credit card was accepted or not (Card) and the number of active credit accounts (Active). In fact these two variables are parents of Reports as the arrows are (a) 3 neighbours.
(d) 5 neighbours. Figure 9: Ratios of SHD and computational cost against log of sample size for 50 and 100 dimensions with 3 neighbours and 5 neighbours on average. The ratios depict the errors and computational cost of the raw FEDHC relatively to the robustified FEDHC with 5% outliers.
directed towards it. A third advantage of BNs is that they provide a solution to the variable selection problem. The parents of the variable Majorcards (number of major credit cards held) are Card (whether the application for credit card was accepted or not) and Income (yearly income in $10,000), its only child is Active (number of active credit accounts) and and it only spouse (parent of Active) is Owner (whether the consumer owns their home). The collection of those parents, children and spouses form the Majorcards' MB. That is, any other variable does not increase the information on the number of major credit cards held by the consumer. For any given variable one can straightforwardly obtain (and visualise) its MB which can be used for the construction of the appropriate linear regression model. Figure 12 contains the BNs using both implementations of MMHC, the PCHC and the FEDHC algorithms fitted to the expenditure data with the variables sorted in a topological order (Chickering, 1995), a tree-like structure. The BIC of the BN learned by MMHC-1 and MMHC-2 are equal to −32171.75 and −32171.22, and for the PCHC and FEDHC are both equal to −32171.75. This is an indication that all four algorithms produced BNs of nearly the same quality. On a closer examination of the graphs one can detect some differences between the  algorithms. For instance Age is directly related to Active according to PCHC and MMHC-2 but not according to FEDHC and MMHC-1. Further, all algorithms have identified the Owner as the parent of Income and not vice-versa. This is related to the prior knowledge discussed in Section 3.4 and will be examined in the next categorical example dataset.
This example further demonstrates the necessity of prior knowledge. BN learning algorithms fit a model to the data ignoring the underlying truthfulness, ignoring the relevant economic theory. Economic theory can be used as prior knowledge to help mitigate the errors and lead to more truthful BNs. The exclusion of the blacklist argument (forbidden directions) would yield some irrational directions, for instance that occupation of age might affect sex or marriage affects age, simply because these directions could increase the score. Finally, BNs are related to synthetic population generation where the data are usually categorical. This task requires the specification of the joint distribution of the data and BNs accomplish this (Sun and Erath, 2015). Based on the Markov condition 1, the joint distribution can be written down explicitly allowing for synthetic population generation in a sequential order. One commences by generating values for education and sex. Using these two variables, values for occupation. These values, along with income and age can be used to generate for the marital status and so on.

Conclusions
This paper proposed to combine the first phase of the FBED variable selection algorithm with the HC scoring phase leading to a new hybrid algorithm, termed FEDHC. Additionally, a new implementation of the MMHC algorithm was provided. Finally the paper presented robustified (against outliers) versions of FEDHC, PCHC and MMHC. The robustified version of FEDHC was shown to be even nearly 40 times faster than the raw version and yielded BNs of higher quality, when outliers were present. Simulation studies manifested that in terms of computational efficiency, FEDHC is comparable to PHCHC and along with MMHC-2 FEDHC was able to fit BNs to continuous data with sample sizes at the order of hundreds of thousands in a few seconds and at the order of millions in a few minutes. It must be highlighted though that the skeleton identification phase of FEDHC and MMHC-1 have been implemented in R and not in C++. Additionally, FEDHC was always executing significantly fewer CI tests than its competitors. Ultimately, in terms of accuracy, FEDHC outperformed is competitors with continuous data, and it was more accurate than or on par with MMHC-1 and MMHC-2 with categorical data, but less accurate than PCHC.
The rationale of MMHC and PCHC is to perform variable selection to each node and then apply a HC to the resulting network. On the same spirit, Meinshausen and Bühlmann (2006) used LASSO for variable selection with the scopus of constructing an un-directed network. The HC phase could be incorporated in the graphical LASSO to learn the underlying BN. Broadly speaking, the combination of a network learning phase with a scoring phase can yield hybrid algorithms. Other modern hybrid methods for BN learning include Kuipers et al. (2020) on hybrid structure learning and sampling. They combine constraint-based pruning with MCMC inference schemes (also to improve the overall search space) and find a combination that is relatively efficient with relatively good performance. The constraint-based part is interchangeable and could connect well with MMHC, PCHC, or FEDHC.
FEDHC is not the first algorithm that has outperformed MMHC. Recent algorithms include PCHC (Tsagris, 2021), the SP algorithm for Gaussian DAGs (Raskutti and Uhler, 2018) and the NOTEARS (Zheng et al., 2018). The algorithms of Zhang et al. (2012) and Chalupka et al. (2018) were also shown to outperform MMHC in the presence of latent confounders, not examined here. The advantage of the latter two is that employ non-parametric tests, such as kernel CI test, thus allowing for non-linear relationships. BNs that detect non-linear relationships among the variable, such as the algorithms proposed by Zhang et al. (2012) and Chalupka et al. (2018) is what this paper did not cover. Further, our comparative analysis was only with MMHC (Tsamardinos et al., 2006) and PCHC (Tsagris, 2021) due to their close relationship with FEDHC.
Future research includes a comparison of all algorithms in terms of more directions. For instance, a) the effect of Pearson and Spearman CI tests and the effect of X 2 and G 2 CI tests, b) the effect of the outliers, c) the effect of the scoring methods (Tabu search and HC) the effect of the average neighbours (network density), and e) the effect of the number of variables on the quality of the BN learned by either algorithm. These directions can be used to numerically evaluate the asymptotic properties of the BN learning algorithms with tens of millions of observations. Another interesting direction is the incorporation of fast non-linear CI tests, such as the distance correlation (Székely et al., 2007;Székely and Rizzo, 2014;Huo and Székely, 2016;Shen et al., 2022). The distance correlation could be utilized during the skeleton identification of the FEDHC mainly because it performs fewer CI tests than its competitors.
Continuous data with 3 neighbours on average.
Continuous data with 5 neighbours on average. Categorical data. Figure 11: Scalability of the algorithms with respect to the sample size for some selected cases. The results for the other cases convey the same message and are thus not presented. The left column refers to the skeleton identification phase whereas the right column to both phases.    X 2 test of independence for categorical data Alternatively, one could use the Pearson X 2 test statistic that has the same properties as the G 2 test statistic (3). The drawback of X 2 is that it cannot be computed when E ij|l = 0. On the contrary, G 2 is computed in such cases since lim x→0 x log x = 0. For either aforementioned test, when |Z| is the empty set, both tests examine the unconditional association between variables X and Y 18 .

Permutation based p-values
The aforementioned test statistics produce asymptotic p-values. In case of small sample sizes computationally intensive methods like permutations might be preferable. With continuous variables for instance, when testing for unconditional independence the idea is to distort the pairs multiple times and each time calculate the relevant test statistic. For the conditional independence of X and Y conditional on Z the partial correlation is computed from the residuals of two linear regression models, X ∼ Z and Y ∼ Z. In this instance, the pairs of the residual vectors are distorted multiple times. With categorical variables, this approach is more complicated and care must be taken so as to retain the row and column totals of the resulting contingency tables. For either case, the p-value is then computed as the proportion of times the permuted test statistics exceed the observed test statistic that computed using the original data. Permutation based techniques have shown to improve the quality of BNs (Tsamardinos and Borboudakis, 2010) in small sample sized cases. On the contrary, the FEDHC algorithm aims at making inference on datasets with large sample sizes, for which asymptotic statistical tests are valid and reliable enough to produce correct decisions.

B: Computational details of FEDHC
With continuous data, the correlation matrix is computed once and utilised throughout the skeleton identification phase. FEDHC returns the correlation matrix and the matrix of the pvalues of all pairwise associations that is useful in a second run of the algorithm with a different significance level. This is a significant advantage when BNs have to fit to large scale datasets and the correlation matrix can be given as an input to FEDHC to further reduce FEDHC's computational cost. The partial correlation coefficient is given by where R X,Y is the correlation between variables X and Y , R X,Z and R Y,Z denote the correlations between X & Z and Y & Z. A = R −1 X,Y,Z , with A denoting the sub correlation matrix of variables X, Y, Z and A i,j symbolises the element in the i-row and j-th column of matrix A.
The CI tests executed during the initial phase compute the logarithm of the p-value, instead of the p-value itself, to avoid numerical overflows observed with a large test statistic that produces a p-value equal to 0. Additionally, the computational cost of FEDHC's first phase can be further reduced via parallel programming.
It is also possible to store the p-values of each CI test for future reference. When a different significance level must be used, this will further decrease the associated computational cost of the skeleton identification phase in a second run. However, as will be exposed in section 4.4, the cost of this phase is really small (a few seconds) even for millions of observations. The largest 18 For a practical comparison between the two tests based on extensive simulation studies see Alenazi (2020). portion of this phase's computational cost is attributed to the calculation of the correlation matrix, which can be passed into subsequent runs of the algorithm.
Finally, Tsagris (2021) disregarded the potential of applying the PC-orientation rules (Spirtes and Glymour, 1991;Spirtes et al., 2000) prior to the scoring phase as a means of improving the performance of FEDHC and MMHC and this is not pursued any further.

C: The R package pchc
The package pchc was first launched in R in July 2020 and initially contained the PCHC algorithm. It now includes the FEDHC and MMHC-2 algorithms, functions for testing (un)conditional independence with continuous and categorical data, data generation, BN visualisation and utility functions. It imports the R packages bnlearn, Rfast and the built-in package stats. pchc is distributed as part of the CRAN R package repository and is compatible with MacOS-X, Windows, Solaris and Linux operating systems. Once the package is installed and loaded > install.packages("pchc") > library (pchc) it is ready to use without internet connection. The signature of the function fedhc along with a short explanation of its arguments is displayed below.
> fedhc(x, method = "pearson", alpha = 0.05, robust = FALSE, ini.stat = NULL, + R = NULL, restart = 10, score = "bic-g", blacklist = NULL, whitelist = NULL) • x: A numerical matrix with the variables. If you have a data.frame (i.e. categorical data) turn them into a matrix using. Note, that for the categorical case data, the numbers must start from 0. No missing data are allowed.
• method: If you have continuous data, this must be "pearson" (default value) or "cat" if you have categorical data though. With categorical data one has to make sure that the minimum value of each variable is zero. The function g2test from the package Rfast and the relevant functions work that way.
• alpha: The significance level for assessing the p-values. The default value is 0.05.
• robust: If outliers are be removed prior to applying the FEDHC algorithm this must be set to TRUE.
• ini.stat: If the initial test statistics (univariate associations) are available, they can passed to this argument.
• R: If the correlation matrix is available, pass it here.
• restart: An integer, the number of random restarts. The default value is 10.
• score: A character string, the label of the network score to be used in the algorithm. If none is specified, the default score is the Bayesian Information Criterion for continuous data sets. The available score for continuous variables are: "bic-g" (default value), "loglikg", "aic-g", "bic-g" or "bge". The available score categorical variables are: "bde", "loglik" or "bic".
• blacklist: A data frame with two columns (optionally labeled "from" and "to"), containing a set of forbidden directions.
• whitelist: A data frame with two columns (optionally labeled "from" and "to"), containing a set of must-add directions.
The output of the fedhc function is a list including: • ini: A list including the output of the fedhc.skel function.
• dag: A "bn" class output, a list including the outcome of the Hill-Climbing phase. See the package bnlearn for more details.
• scoring: The highest score value observed during the scoring phase.
• runtime: The duration of the algorithm.