Next Article in Journal
Hierarchical Learning-Enhanced Chaotic Crayfish Optimization Algorithm: Improving Extreme Learning Machine Diagnostics in Breast Cancer
Previous Article in Journal
Sharma–Taneja–Mittal Entropy and Its Application of Obesity in Saudi Arabia
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CL-NOTEARS: Continuous Optimization Algorithm Based on Curriculum Learning Framework

1
Laboratory for Big Data and Decision, National University of Defense Technology, Changsha 410073, China
2
National Key Laboratory of Information Systems Engineering, National University of Defense Technology, Changsha 410073, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(17), 2640; https://doi.org/10.3390/math12172640
Submission received: 8 July 2024 / Revised: 22 August 2024 / Accepted: 23 August 2024 / Published: 25 August 2024
(This article belongs to the Section E1: Mathematics and Computer Science)

Abstract

:
Causal structure learning plays a crucial role in the current field of artificial intelligence, yet existing causal structure learning methods are susceptible to interference from data sample noise and often become trapped in local optima. To address these challenges, this paper introduces a continuous optimization algorithm based on the curriculum learning framework: CL-NOTEARS. The model utilizes the curriculum loss function during training as a priority evaluation metric for curriculum selection and formulates the sample learning sequence of the model through task-level curricula, thereby enhancing the model’s learning performance. A curriculum-based sample prioritization strategy is employed that dynamically adjusts the training sequence based on variations in loss function values across different samples throughout the training process. The results demonstrate a significant reduction in the impact of sample noise in the data, leading to improved model training performance.

1. Introduction

Causal structure learning (CSL) is a crucial component of causal discovery and inference and aims to identify a directed acyclic graph (DAG) that represents causal relationships among variables. The accuracy of the DAG structure directly impacts the precision of subsequent causal inference. In previous studies, experts primarily constructed DAGs through domain knowledge and randomized controlled experiments, which are subjective and time-consuming. With the explosion of data volume, significant attention has been given to learning causal structures purely from observational data [1,2]. However, due to the high-dimensional feature space and substantial time costs, traditional CSL is computationally challenging, especially in high-dimensional and continuous scenarios. CSL finds widespread applications in real-world scenarios, including genetic analysis [3], financial risk prediction [4], biomedical image processing [5], fault diagnosis [6], telemarketing call filtering [7], and more. However, data-driven CSL commonly encounters the following issues: (1) massive data volume leads to exponential computational complexity with increasing feature dimensions, rendering it an NP-hard problem [8] and (2) inconsistent data quality, where sample noise significantly affects the effectiveness of CSL.
CSL helps reveal causal relationships within data, enhancing the interpretability and predictive capabilities of models. Understanding causal structures in complex systems is crucial for causal reasoning and decision-making. Existing CSL methods primarily fall into three categories: (1) Constraint-based (CB) methods, such as the IC (independent causal) [9] and PC (Peter and Clark) algorithms [10], mainly rely on conditional independence (CI) and fidelity constraints. These methods suffer from equivalence class issues in the output structure, preventing them from fully capturing complete causal information. (2) Score-based (SB) methods focus on designing scoring functions and optimizing searches. These methods formalize possible network structures and node parameters into linear descriptions, enhancing the intelligence of learning algorithms and problem-solving capabilities, but they reach certain limits in individual problem domains. (3) Gradient-based (GB) methods have also been formulated. Zheng et al. [11] developed a continuous optimization method for learning causal structures, overcoming the computational cost of non-cyclic constraints. This method models traditional combinatorial optimization problems as continuous optimization problems, improving the precision and efficiency of structure learning. It has been extended to handle intervention data, confounding factors, and time series. Using likelihood-based objectives, Ng et al. [12] formulated the problem as an unconstrained optimization problem involving only soft constraints, while Yu et al. [13] developed an equivalent representation of DAGs, allowing for continuous optimization in DAG space without the need for equality constraints. However, these methods are mainly based on structural equation modeling (SEM) equations of specific structures, and the accuracy of learned structures depends on the quality of the modeled data. As the number of sample nodes increases, algorithm performance quickly deteriorates due to the influence of data noise.
Motivated by this situation, this work explores how to mitigate the impact of sample data noise on algorithms and proposes a continuous optimization algorithm based on curriculum learning (CL): termed CL-NOTEARS. CL, introduced by Bengio in 2009 at ICML [14], mimics human cognitive learning by gradually transitioning from simpler to more complex samples. The ultimate goal is to guide the model’s learning direction, reduce the influence of sample noise, and make machine learning models more efficient and accurate. Over the subsequent years, many researchers have developed CL strategies tailored to specific applications such as weakly supervised object localization, object detection, and neural machine translation [15]. Despite its success across various domains, mainstream research on CL remains relatively limited.
Based on the deep guidance and denoising mechanisms provided by CL, this work constructs a framework for curriculum sample learning aimed at mitigating the impact of sample noise on algorithms. By formulating a curriculum loss function, the framework adaptively adjusts the model’s sample learning sequence. The weights of the final causal diagram are dynamically updated based on the causal structure learned at different curriculum stages. The overall framework is illustrated in Figure 1.
Specifically, the contributions of this work are as follows:
  • To evaluate the difficulty of learning data samples, this work proposes a similarity clustering method based on sample features. This method integrates similarity and entropy values among various sample features to classify samples into distinct types.
  • To alleviate the interference of data noise, this work designs a loss function to measure the learning difficulty of samples for curriculum-level selection. In each curriculum stage, candidate samples are used as new learning samples for structural learning. The optimal candidate samples are then selected as new curriculum samples for the next stage based on the measurement of the loss function value during the learning process. The order of model sample learning is adaptively adjusted according to the loss function value, leading to the final optimal network structure.
  • This work dynamically adjusts the weights of learned edges based on the results from different stages of the curriculum, filtering to obtain the final causal structure.
  • Experiments conducted on multiple synthetic datasets with varying scales and on a real dataset demonstrate that CL-NOTEARS exhibits good generalization ability and higher accuracy across networks with different scales and complexities.
Table 1 summarizes the notations and descriptions.

2. Related Work

Mining and inferring causal relationships are crucial for causal reasoning and decision-making in complex systems, especially in biomedical research, disease diagnosis, and public policy evaluation. The learned causal structure is represented as a DAG and is denoted as G = ( V , E ) , where V = v 1 , v 2 , , v d represents the set of nodes (features), and E = e 1 , e 2 , , e m is the set of edges, representing causal relationships between different variables. The primary objective of causal discovery is to recover the underlying causal structure and the associated conditional probability distributions. Methods for learning causal structures from observational data mainly fall into three categories:
CB methods: CB methods are represented as constraint satisfaction problems and primarily explore causal skeletons by testing conditional independencies among different variables. They then orient edges from the skeleton to Markov equivalence classes. Some noteworthy algorithms in this category include the IC [9], PC [10], and IAMB algorithms [16]. The IC algorithm searches for separation sets in the node set for each pair of variables; if there are no separation sets, the variables remain associated. The PC algorithm laid the foundation for CB methods and was preceded by the SGS algorithm [17]. The PC algorithm relies mainly on efficient CI tests, assuming that causal edges between non-conditionally independent nodes should be retained. It has been effective for discovering causal relationship structures with high-dimensional sparse connections. However, its output depends on the variable processing order, which becomes more pronounced in high-dimensional environments, leading to highly variable results. Consequently, many researchers have studied and improved the PC algorithm. Li et al. [18] alleviated the problem of high-order CI tests by introducing the FEPC algorithm. Additionally, some scholars addressed the instability caused by the PC algorithm’s dependency on node order and proposed the PC-stable algorithm [19] and the PC-parallel algorithm [20]. CB methods can handle a wider range of data types and distributions, and they boast high computational efficiency, making them highly interpretable. However, the accuracy of the learning process depends on the number of CI tests performed and the size of the constraint set. CB methods are sensitive to CI testing and data noise, and higher-order dependencies are unreliable for large-scale networks and complex data. Therefore, some research has introduced enhanced CI tests, including kernel-based causal learning methods such as the kernel-based Hilbert-Schmidt norms [21] and the kernel-based conditional independence test (KCI) [22]. However, these approaches require the assumption that random variables in different statistical tests are independent of each other, and research on such methods is still in its early stages. CB methods are based on the Markov assumption and require certain assumptions to infer CI relationships between variables for causal analysis, but they cannot provide the direction of edges in Markov equivalence classes.
SB methods: SB methods primarily assess the fit between the network structure and data using scoring functions and employ search algorithms to find the optimal structure. Common scoring functions include BIC [23], BDeu [24], MDL [25], and AIC [26]. SB methods mainly consist of two parts: scoring functions and search algorithms. The scoring function measures the fit between the structure and sample data; the better the fit, the higher the score obtained. The search algorithm is then used to find the network structure with the highest score, essentially seeking a network structure G that satisfies:
G = arg max G G S c o r e ( G | D ) .
Many studies have proposed using greedy searches and heuristic searches to solve combinatorial problems and improve algorithm performance, with examples including genetic algorithms [27], particle swarm optimization [28], ant colony optimization [29], and bee algorithms [30]. However, the heuristic nature of these algorithms often leads to local optima. In addition to these heuristic search algorithms, some methods transform combinatorial optimization problems into continuous optimization problems.
GB methods: From the perspective of the data distribution characteristics of the causal mechanism, some scholars model data generation methods based on the structural equation model and transform the combinatorial optimization problem into a continuous optimization problem using gradient-based approaches through the smooth score function and the smooth representation of acyclicity, as represented by the NOTEARS algorithm [11]. The NOTEARS algorithm utilizes the smoothing feature of the weighted adjacency matrix to model the data: aiming to find the optimal structure while satisfying the aperiodic constraints. Subsequently, DAG-GNN [13] and GAE [31] extended the NOTEARS idea to nonlinear scenarios, assuming that all variables undergo a common nonlinear transformation. Gran-DAG [32] models variable distributions using multilayer perceptrons (MLPs) and constructs equivalent weighted adjacency matrices. Zhu et al. [33] utilized reinforcement learning algorithms to search for the best-scoring DAG, resulting in training times that far exceed those of other algorithms due to the exploratory nature of reinforcement learning.
During the training of learning algorithms based on observational data, a common phenomenon emerges: there are significant disparities in performance as the sample size increases and with variations in sampling capabilities. It is evident that both the quality and scale of the data have a considerable impact on algorithm performance. The objective is to explore a hierarchical learning framework that involves designing sample prioritization evaluation metrics to partition samples and to develop curriculum loss functions for the model training process. These steps aim to enable adaptive adjustments to the sequence of training data samples throughout the model training process.

3. Problem Formulation

Let the DAG be represented as G = ( V , E ) , where V = v 1 , v 2 , , v d represents the set of nodes, with d being the number of nodes. The set of edges E is defined as E = ( v i , v j ) i , j = 1 , 2 , , d , indicating edges from node v i to node v j . Each node v i is associated with a random variable X i . Probability models associated with G assume data independence such that, given the parent nodes, X i is independent of its non-descendant nodes. The joint probability distribution of the data can be decomposed into the product of the conditional probabilities of each individual node, given its parent nodes, following the probability chain rule:
P ( X ) = i = 1 d P ( X i Pa ( X i ) ) ,
where P ( X i Pa ( X i ) ) represents the conditional probability of X i given its parent nodes. The parent nodes of X i are represented as Pa ( X i ) = X k ( v k , v j ) E .
Considering the limitations of traditional SEM, where some variables can only be observed and measured empirically, and inspired by recent advancements in continuous optimization, we model X via a SEM defined by the weighted adjacency matrix W R d × d . Consequently, we operate not in the discrete space but in R d × d , which is the continuous space of d × d real matrices.
Let A ( W ) 0 , 1 d × d be the binary matrix such that A ( W ) i j = 1 w i j 0 and is zero otherwise. Then A ( W ) defines the adjacency matrix of a directed graph G ( W ) . In addition to the graph G ( W ) , W = w 1 w 2 w d defines the linear SEM through X j = w j T X + z j , where X = ( X 1 , X 2 , X d ) is a random vector, and  z = z 1 , z 2 , , z d is an arbitrary noise vector. Assuming z is mutually independent with no unobserved confounding factors, the linear SEM is thus fully defined.
Zheng et al. [12] proved that the matrix W R d × d is a DAG if and only if h ( W ) = tr ( e W W ) d = 0 holds. Here, ∘ denotes the Hadamard product, and  e W is the matrix exponential of W. The resulting continuous constrained optimization problem is:
min 1 2 s X X W F 2 + λ W 1 subject to h ( W ) = tr ( e W W ) d = 0 .
where · 1 represents the 1 norm, which promotes sparsity in the matrix. Specifically, W 1 = vec ( W ) 1 , resulting in the regularized score function, with  λ as the regularization coefficient. Here, 1 2 s X X W F 2 is the least squares objective, where · F denotes the Frobenius norm. The Frobenius norm measures the square root of the sum of the absolute squares of the matrix elements.

4. Continuous Optimization Based on CL

The section begins with constructing a curriculum sample set partitioning model using Gaussian mixture kernel clustering and designing a CSL mechanism based on CL. The constrained problem is then reformulated as an optimization problem using the augmented Lagrangian method (ALM). The final step involves detailing the process of updating causal relationships with dynamic coefficient weighting. The overall framework is illustrated in Figure 2.

4.1. Similarity-Based Dataset Classification

Gaussian distributions have wide practical significance in manufacturing and scientific experiments, as the probability distributions of many random variables can be approximated using Gaussian distributions. For a random variable X j , a Gaussian distribution with mean μ and variance σ 2 is given by:
P ( X j | Θ ) = 1 2 π σ 2 exp ( X j μ ) 2 2 σ 2 .
Here, Θ = μ , σ 2 represents the parameters. P ( X j | Θ ) describes the probability density function of X j and indicates how likely different values of X j are under the Gaussian distribution. Gaussian mixture models offer strong probabilistic interpretations and advantages, such as robustness to noise, and have been widely applied. Leveraging the characteristics of causal learning sample modeling, Gaussian mixture models can cluster the samples that need to be learned, dividing the data sample set into different learning difficulty levels. Assuming the learning sample set is D = D 1 , D 2 , , D s , where D k R d (i.e., each sample D k is a vector of real numbers in d-dimensional space) and each D k is drawn from a mixture density composed of n Gaussian components, the probability density function is given by:
F ( D k ; Θ ) = i = 1 n β i f ( D k V i , Σ i ) ,
where β i , V i , and  Σ i represent the mixture coefficient, mean vector, and covariance matrix of the i t h component, respectively; f ( D k V i , Σ i ) is the Gaussian density function of the i t h component. Due to the involvement of numerous features in causal sample data that belong to high-dimensional space, traditional clustering methods are difficult to apply directly. Peng et al. [34] introduced information entropy into an extended Gaussian mixture model to address subspace clustering of high-dimensional data, specializing in the covariance matrix of Gaussian vectors:
Σ i = σ i 2 d i a g ( g i 1 1 , g i 2 1 , , g i d 1 ) ,
where diag ( g i 1 1 , g i 2 1 , , g i d 1 ) represents a diagonal matrix with diagonal elements g i 1 1 , g i 2 1 , , g i d 1 , where g i j represents the relevance of feature j to cluster i. The term σ i 2 represents the local variance of cluster i. The probability density function of the i t h Gaussian component is:
f ( D k | V i , σ i 2 , g i ) = j = 1 d g i j 2 π σ i 2 exp 1 2 σ i 2 j = 1 d g i j ( V i j D k j ) 2 ,
where V i j is the mean of feature j for the i t h cluster, D k j is the j t h feature of data k, and  g i j adjusts the relevance of feature j for the i t h cluster.
By minimizing the Kullback–Leibler (KL) divergence [34] to estimate parameters Φ = β i , V i , σ i 2 , g i 1 i n , we obtain:
G ( μ k , β i , V i , σ i 2 , g i ) = 1 n i = 1 n k = 1 s μ i k log β i + j = 1 d g i j 2 σ i 2 ( V i j D k j ) 2 log g i j 2 π σ i 2 + log μ i k ,
where μ k = μ 1 k , μ 2 k , , μ n k is a vector composed of fuzzy membership values.
Building on the research conducted by Peng et al. [34], this work employs the Gaussian mixture kernel clustering method to group causal sample datasets with high-dimensional features. This approach results in curriculum sample sets characterized by high inter-feature similarity, as illustrated in Figure 3. Given that the dataset is modeled using SEM, it is posited that samples with higher similarity tend to exhibit higher quality and have a more significant impact on CSL algorithms. Consequently, the quality of the partitioned curriculum sample sets directly influences the effectiveness of subsequent adaptive training processes.

4.2. The Self-Training Curriculum Framework

CL [35] refers to a series of training criteria over T training steps: C = C 1 , C 2 , , C T , where each criterion C t involves the reweighting of the target training distribution P ( D i ) :
C t ( D i ) O t ( D i ) P ( D i ) example D i training set D ,
such that the following three conditions are satisfied:
  • The entropy of distributions gradually increases:
    H ( X C t ) < H ( X C t + 1 ) .
  • The weight for any example increases:
    O t ( D i ) < O t + 1 ( D i ) .
  • The model finally learns all the samples:
    C T ( D i ) = P ( D i ) .
CL is a training strategy for machine learning models that involves using a curriculum to guide the learning process. In Equation (10), it is suggested that the diversity and informativeness of the training set should gradually increase. This means that in later stages, the probability of sampling more challenging examples is increased through reweighting. Equation (11) indicates that more training samples are gradually introduced, thereby enlarging the training set. Finally, Equation (12) signifies that at the end of the process, all sample weights become uniform, and training is conducted on the complete target training set.
Some scholars [36,37] elucidate that CL, from the perspectives of optimization and data distribution, involves progressive learning from “clean” (low-noise) samples to samples with increasing noise. This approach encourages models to focus more on easier data and less on harder data, thereby leading to noise reduction.
Inspired by the CL to mitigate the impact of noise in samples, a CL mechanism is employed to facilitate the adaptive training of these samples. Based on their performance during training, samples are categorized into curriculum samples and candidate samples. Through iterative training, the performance of candidate samples is continually updated, and the next stage of curriculum samples is determined accordingly. Ultimately, the entire set of training samples D is learned.
To achieve adaptive training of curriculum samples, this work designs a curriculum loss function based on acyclicity constraints to evaluate the performance of different samples during the training process. This paper primarily investigates linear SEM and the LS loss l ( W , X ) = 1 2 s X X W F 2 . Although the results are applicable to any smooth loss function defined on R d × d , extensive research has focused on the statistical properties of the LS loss in scoring DAGs. It has been shown that minimizing the LS loss can recover the true DAG with high probability in finite samples and for high dimensions ( d s ). Thus, for both Gaussian SEM [38,39] and non-Gaussian SEM [40], the LS loss is consistent.
l o s s ( X H t i ) = l ( W t , X H t i ) 1 i n ; t = 1 Δ l ( W t , X H t i , X C t 1 ) 1 i n ; t > 1 ,
where X C t represents a curriculum sample for the t t h stage, and  X H t i represents the i t h candidate sample of the t t h stage candidate set. W t denotes the weighted adjacency matrix of the t t h stage.
The loss function for each candidate set in the first curriculum stage is computed using the LS loss l ( W t , X H t i ) . In contrast, the difference loss function Δ l ( W t , X H t i , X C t 1 ) for the candidate curriculum stages is defined as the difference between the LS loss of the current candidate set and the LS loss of the curriculum sample from the previous stage.
l ( W t , X H t i ) = 1 2 s X H t i X H t i W t F 2
Δ l ( W t , X H t i , X C t 1 ) = 1 2 s X H t i X H t i W t F 2 l ( W t , X C t 1 ) .
The model adjusts the order of sample learning and updates the curriculum samples and candidate samples based on the losses obtained during the training process. The sample with the minimum loss is selected as the curriculum sample for the next stage, and this sample is removed from the candidate set.
X C t = X H t i = min i l o s s ( X H t i ) 1 i n ; 1 t T ; X H t = X H t 1 / X C t 1 i n ; 1 t T ; .
In each stage, the goal is to learn the causal structure G t of that stage from the curriculum sample X C t :
min W t R d × d l ( W t , C t ) + λ W t 1 subject to h ( W t ) = 0 .
Inspired by NOTEARS, the augmented Lagrangian method (ALM) is employed to transform constrained equations into a series of unconstrained problems. The ALM constructs the scoring function L to optimize the given constrained formulation. The corresponding augmented Lagrangian formula is as follows:
L ρ ( W t , C t , α ) = l ( W t , C t ) + λ W t 1 + α h ( W t ) + ρ 2 h ( W t ) 2 .
When ρ > 0 , it denotes the penalty parameter, and  α represents the estimate of the Lagrange multiplier. As  ρ tends to infinity, minimizing L ρ necessitates satisfying h ( W t ) = 0 . By gradually increasing ρ , the augmented Lagrangian is minimized, and the Lagrange multipliers α are updated to converge to the optimal conditions. In the special case where λ = 0 , the nonsmooth term vanishes, and the problem reduces to unconstrained smooth minimization, which can be solved using the L-BFGS approximation algorithm [41]. When λ > 0 , the problem can be approximately solved using the PQN method [42].

4.3. Pruning

To manage the curriculum iteration process, edge weights are dynamically updated to identify causal edges for generating new samples in subsequent rounds. New causal edges are added to the previously established edge set. After each training update, a probability threshold ( γ = 0.5 ) is applied to determine whether to retain or discard these edges. Figure 4 illustrates this process: (a) shows the causal graph for the first two curriculum stages, (b) represents the t t h curriculum stage, and (c) displays the updated result based on (a) and (b). The density of the dashed lines in the figure reflects the probability of the causal edges. To iteratively learn the causal structure across curriculum stages, edge weights and scores are computed.
  • Set the edge weights ( B t , M t , N t ).
    B t = E 1 + E 2 + + E C t M t = E t E t 1 N t = E t 1 E t ,
    where E t represents the weight matrix of the edges in the t t h stage. Specifically: (1) B t represents the weight matrix of the edges that always appear in or before the t t h stage; (2) M t represents the weight matrix of new edges appearing in the next curriculum stage; (3) N t represents the weight matrix of edges that appear in previous curriculum stages and then disappear. The initial curriculum weight is set to t 1 t , where t is the number of curriculum stages. Subsequent curriculum stages are weighted in decreasing order according to the stage number because the initial curriculum stage has the least data noise, and the learned edges are therefore more reliable.
  • Compute score ( S Δ B I C ). BIC [23] uses the log-likelihood to measure how well the structure fits the data, assuming the samples are independent and identically distributed. The BIC scoring function is given by:
    B I C ( G | D ) = i = 1 n j = 1 q i k = 1 r i m i j k log θ i j k 1 2 i = 1 n q i ( r i 1 ) log s .
    where s represents the number of datasets, m i j k represents the probability associated with the k t h value when the parent node takes the j t h combined value, and  i = 1 n q i ( r i 1 ) denotes the number of parameters in the network.
    Determine whether the edges in M i and N i are retained by calculating the BIC score:
    S Δ B I C = 1 B I C ( G | D ) B I C ( G | D ) > 0 0 B I C ( G | D ) B I C ( G | D ) 0 ,
    where G represents the DAG composed of variables X = X 1 , X 2 , , X d , and  G is the DAG with additional edges from M t or N t . The term q i indicates the number of possible values for the parent node of the variable x i , r i is the number of values of the variable X i , m i j k is the number of samples for which the parent node of X i takes the value j and X i takes the value k. The conditional likelihood probability θ i j k = m i j k m i j , where 0 θ i j k 1 and k θ i j k = 1 .
Based on the above-mentioned setting, the weight matrix for the t t h round, W t , is given by:
W t = B t + S Δ B I C M t + S Δ B I C N t .

4.4. The CL-NOTEARS Algorithm

Based on the analysis above, the CL-NOTEARS algorithm was developed. The algorithm involves three steps: (1) obtaining the initial candidate dataset using a Gaussian mixture kernel clustering method (lines 4–5), (2) computing the loss values for each candidate dataset at various curriculum stages and updating the curriculum dataset based on comparisons (lines 6–15), and (3) updating the weight matrix and iteratively refining the causal structure throughout the CL process (lines 16–17). The pseudocode for the CL-NOTEARS algorithm is provided in Algorithm 1.
Algorithm 1 CL-NOTEARS
Require: 
 
 1:
D = D 1 , D 2 , , D s : the dataset;
 2:
T: the number of curriculum stages;
 3:
n: the number of Gaussian components.
 4:
Initialize Initialize weight matrix W with d × d dimension.
Ensure: 
DAG: the casual graph
 5:
Generate an initial set of candidates: X H 1 G a u s s ( D , n )
 6:
Update weight matrix W using Equation (18) and calculate the loss of candidate datasets using Equation (14)
 7:
while  t < T  do
 8:
    if  t = 1  then
 9:
         Update the curriculum dataset: X C 1 = min i 1 , n l o s s ( X H 1 i )
10:
        Update the candidate dataset: X H 2 = X H 1 / X C 1
11:
    else
12:
        Update weight matrix W using Equation (18)
13:
        Calculate the loss of candidate datasets using Equation (15)
14:
        Update the curriculum dataset: X C t = min i 1 , n Δ l o s s ( X H t i ; X C t 1 )
15:
        Update the candidate dataset: X H t = X H t 1 / X C t 1
16:
    end if
17:
    Calculate the score S Δ B I C
18:
    Update causal graph matrix W using Equation (21)
19:
end while

5. Experiments

To encompass more general and challenging cases in the CSL scenario, this work addresses the following three research questions:
  • R Q 1 : In scenarios involving linear Gaussian (LG) noise and linear non-Gaussian (LN) noise, is the CL-NOTEARS algorithm superior to other models?
  • R Q 2 : Are the curriculum mechanism and continual update strategy effective for CSL?
  • R Q 3 : In real-world scenarios, can the CL-NOTEARS algorithm still reduce noise in the data and provide a greater advantage?

5.1. General Settings

All the experiments were performed on a computer equipped with an Intel(R) Core(TM) i7-1165G7 CPU at 2.80 GHz and 16 GB of memory, and the compiler environment is PyCharm 2021.1.2.
Datasets Following previous research, we generated six types of synthetic datasets to answer RQ1. For RQ2, we chose four types of LN noise datasets. For RQ3, we utilized real BN data from Sachs.
  • Simulated data: The work employed two graph sampling schemes—Erdős–Rényi (ER) and scale-free (SF) [43]—varying across four aspects: the data generation process, number of nodes, edge sparsity, and graph type. The data generation process included LG/LN ANM. For the LG autoregressive model, data were sampled as X j = A T ( X p a ( j ) ) + N j , with independent noise N j N ( 0 , I d × d ) . For the LN autoregressive model, the work compared two noise distributions: additive noise N j from either an exponential or uniform distribution with uniformly sampled variances. Each dataset type was randomly sampled according to ER or SF schemes, and graphs with nodes ( 10 , 30 , 50 ) and edges ( 1.5 d , 2 d , 3 d ) were considered.
  • Real BN data: The work adopts a real BN dataset—specifically, the Sachs dataset. Sachs is a multivariate proteomic dataset that is widely used in causal discovery. It contains measurements of various proteins and phospholipids found in human immune system cells, including 11 phosphorylated proteins and phospholipids (PKC, PKA, P38, Jnk, Raf, Mek, Erk, Akt, Plcg, PIP2, and PIP3).
The details are shown in Table 2.
Evaluation index: The work evaluates the quality of learned network structures using the proposed CL-NOTEARS approach through the following measurements:
  • Structural Hamming distance (SHD): The SHD is a standard distance used to compare graphs by using their adjacency matrices. It involves calculating the differences between two (binary) adjacency matrices: each missing or non-existent edge in the target graph is considered an error. The smaller the SHD, the fewer incorrect edges are learned, and the better the learning effect.
  • True positive rate (TPR): In binary classification, the TPR, also known as sensitivity or recall, measures the proportion of actual positive cases that are correctly identified by a classification model. It is calculated as the number of true positive predictions divided by the sum of true positive and false negative predictions.
  • False discovery rate (FDR): The FDR is the ratio of all discoveries that are incorrect or reversed in direction. The FDR helps determine potential false discoveries. The smaller the FDR, the lower the error rate of the learned network structure, and the better the algorithm’s performance.
Using the SHD, TPR, and FDR to measure the benefits of DAG provides a comprehensive evaluation, assessing the network structure, classification accuracy, and performance of feature selection from different perspectives.
Baseline: On the dataset, the work compared CL-NOTEARS against several classical and state-of-the-art methods, including PC [10], NOTEARS [11], DAG-GNN [13], GraN-DAG [32], and DirectLiNGAM [44]. The codes for these baseline methods are available in the pyagrum, case-learn2, and gCastle3 packages. Since PC may learn undirected edges, MCSL [45] was used to treat undirected edges as true edges if the true graph has a directed edge instead.

5.2. Comparison of Linear Models with Gaussian Noise

This section evaluates the proposed methods using LG models with equal-variance Gaussian noise and LiNGAM [46] data models, where the true DAGs are known to be identifiable [47]. The parameters h { 1.5 , 2 , 3 } and d { 10 , 30 , 50 } are used to generate the observed data.
In this section, we first present a comparison between CL-NOTEARS and state-of-the-art methods using synthetic datasets generated by linear SEMs.
Table 3 and Table 4 show the variations in FDR and SHD for the CL-NOTEARS model on ER- and SF-sampled datasets of LG ANM, respectively. Figure 5 illustrates the TPR performance of CL-NOTEARS across these six classes of datasets. The following observations can be made: (1) PC performs poorly due to the dense graphs in the generated data. (2) Both NOTEARS and DAG-GNN exhibit high performance in causal discovery. (3) GraN-DAG performs worse because it uses a two-layer feedforward neural network to model causal relationships, which limits its ability to effectively learn the ideal causal structures. (4) When d = 10 , the CL-NOTEARS algorithm does not show significant superiority over DAG-GNN, whereas when d = 50 , CL-NOTEARS demonstrates strong advantages. Moreover, with the same number of nodes, CL-NOTEARS exhibits increasing superiority as the number of edges grows. This is because with the increase in both nodes and edges, the network becomes more complex, introducing relatively more noise into the sampled datasets. CL-NOTEARS leverages the curriculum-grained framework for noise reduction, allowing it to learn relatively accurate causal structures even in higher-noise datasets, thus demonstrating significant advantages. (5) For each node-size dataset, the FDR value of the CL-NOTEARS algorithm ranges between 0 and 0.01, confirming that the CL framework effectively mitigates the impact of data noise and reduces the occurrence of erroneous edges.

5.3. Comparison of Linear Models with Non-Gaussian Noise

LN data models were evaluated with noise types set to exponential and uniform distributions. For both cases, h 2 , 3 and d 10 , 30 , generating observational data with 2000 instances per dataset. Four causal learning algorithms were compared: the traditional CB method PC algorithm, the GB methods DAG-GNN and GraN-DAG, and the NOTEARS algorithm.
For ER and SF graphs of SEM with LN noise, the experimental results are presented in Table 5 and Table 6. The results show that DAG-GNN significantly outperforms the other algorithms. This is due to its ability to capture complex interactions and nonlinear transformations among features in linear models under non-Gaussian noise. In contrast, the CL-NOTEARS model performs less effectively with non-Gaussian noise compared to Gaussian noise. This is because the initial partitioning of the CL-NOTEARS algorithm relies on Gaussian mixture models, making it sensitive to the clustering effects of non-Gaussian noise.

5.4. The Effect of Different Curriculum Difficulty Levels on the Learning Process

This work revisited Bengio’s proposed CL prototype model, as highlighted in the literature discussion in Section 5. It explored various challenges in the curriculum mechanism by considering three cases:
  • Curriculum mechanism: a basic setup that progresses from easy to difficult, where learning begins with simple sample examples and gradually advances to more complex ones.
  • Anti-curriculum mechanism: under anti-curriculum conditions, two mechanisms based on the original CL-NOTEARS model architecture were designed: (1) initially learning from difficult sample tasks and then transitioning from simple to complex samples in subsequent learning phases and (2) learning progresses from complex samples to simple samples throughout the entire process.
  • Random curriculum mechanism: learning from samples of varying difficulties selected randomly.
The specific design is illustrated in Figure 6.
Based on these three curriculum mechanisms, the CL-NOTEARS algorithm was integrated and compared with the NOTEARS algorithm, which lacks a curriculum mechanism. Observational data were generated with h = { 2 , 3 } and d = { 10 , 30 , 50 } , with each dataset containing 2000 instances.
Table 7 shows the FDR and SHD performances for different curriculum mechanisms on the datasets, while Figure 7 visually demonstrates the causal relationships captured by the different curriculum mechanisms. Observations indicate: (1) curriculum ≥ anti-curriculum 1 ≥ anti-curriculum 2; the first anti-curriculum mechanism performs better than the second one, as it only changes the difficulty of the initial learning task, resulting in less disturbance from noise compared to the second mechanism. (2) The random curriculum mechanism exhibits more variable performance compared to the curriculum mechanism. Figure 7 shows that the errors associated with the random curriculum mechanism are significantly larger and essentially encompass the range of errors found with the curriculum mechanism. This variability is due to the random sampling approach, which introduces randomness into the learning process, occasionally leading to better outcomes than those achieved with the curriculum mechanism.

5.5. The Impact of the Number of Curriculum Stages on Algorithms

The number of curriculum stages in the algorithm influences the size of the learning sample set at different stages, which in turn affects the accuracy of separating the sample set at various noise levels. To investigate the effect of the curriculum stage size on the algorithm, four different sizes were considered: T = { 5 , 10 , 15 , 20 } . Observation data were generated with h = 2 , 3 and d = 10 , 30 . The dataset was randomly sampled using the ER scheme to create a real DAG consisting of 2000 sample instances. Evaluation metrics included TPR and SHD. Each experimental result was averaged over five sets of data, and the results are summarized in Table 8.
It can be seen from Table 8 that the best performance occurs when the number of curriculum stages is 10. Based on the experimental data, we speculate that if the number of curriculum stages is too large, sample division becomes too fine, making it difficult for the algorithm to differentiate between noise and signals in small samples, which results in poor learning performance. Conversely, if the number of curriculum stages is too small, the sample division is too coarse, and the sample set learned in each stage becomes noisy, and the model fails to balance noise differences effectively. Therefore, when dividing curriculum stages, the size of each stage should be adjusted based on the size of the sample set for that stage.

5.6. Real Data

The Sachs dataset [48], which contains 11 nodes and 17 edges, is widely used in the study of graph models. The expression levels of proteins and phospholipids in this dataset can be used to uncover the underlying protein signaling network. This dataset is a common benchmark in graph modeling, and its experimental annotations are widely accepted by the biological community and encompass both observational and interventional aspects. The observational dataset includes m = 853 samples and is used to discover causal structures. The work also uses Gaussian process regression to simulate causality and to calculate the score. Since the real network structure is sparse, the SHD of an empty graph can be as low as 17. Comparing the best available algorithms, detailed results of the estimated graphs constructed by the different algorithms are shown in Table 9. This includes the total number of edges, the number of correct edges, and the SHD. The PC algorithm, which outputs many undirected edges, is not included in this comparison.
As shown in Table 9, the CL-NOTEARS algorithm demonstrates a significant advantage over other methods, achieving an optimal value of 13. Although the CL-NOTEARS algorithm identifies the same number of correct edges as the NOTEARS algorithm, it exhibits a substantial reduction in the SHD and a marked decrease in the number of erroneous edges. This improvement is attributed to the CL framework, which mitigates the impact of data noise on the algorithm, thereby significantly reducing the number of erroneous edges detected.

6. Conclusions

The work proposes a continuous optimization algorithm based on the CL framework, called CL-NOTEARS. This algorithm addresses the challenge of sample learning by evaluating the performance of the loss function on different sample sets, transforming the combinatorial optimization problem in discrete space into a numerical continuous optimization problem within various curriculum frameworks. Additionally, a dynamic weight adaptation method is employed for different curriculum stages to adjust the learning network’s structure. This reduces the impact of data noise on CSL algorithms and helps prevent the algorithm from falling into local optima. Extensive experiments on various synthetic datasets have demonstrated the effectiveness of the proposed approach.
In the future, we plan to further investigate batch adaptive mechanisms and noise-check evaluation methods in CSL to further reduce the impact of data noise on the algorithm and explore the universality of this mechanism for other algorithms.

Author Contributions

Conceptualization, H.H. and Y.Z.; methodology, K.L.; software, L.L.; validation, X.L., K.L., and H.Z.; formal analysis, K.X.; investigation, Y.Z.; resources, H.H.; data curation, K.L.; writing—original draft preparation, K.L.; writing—review and editing, Y.Z.; visualization, K.X.; supervision, H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data and source code for CL-NOTEARS can be anonymously accessed at the following location: https://github.com/moonastar/CL_Notears.git (accessed on 12 June 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
BNBayesian Network
CLCurriculum Learning
DAGDirected Acyclic Graph
CSLCausal Structure Learning
CBConstraint-Based
SBScore-Based
GBGradient-Based
CIConditional Independence
SEMStructural Equation Modeling
KICKernel-based Conditional Independence Test
PQNProbabilistic Quotient Normalization
LSLeast Squares
TPRTrue Positive Rate
FDRFalse Discovery Rate
SHDStructural Hamming Distance
LGLinear Gaussian
LNLinear Non-Gaussian
ERErdős–Rényi
SFScale-Free

References

  1. Squires, C. Causal Structure Learning: A Combinatorial Perspective. Found. Comput. Math. 2023, 23, 1781–1815. [Google Scholar] [CrossRef] [PubMed]
  2. Zhou, F.; He, K.; Ni, Y. Causal discovery with heterogeneous observational data. In Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, Eindhoven, The Netherlands, 1–5 August 2022; Volume 180, pp. 2383–2393. [Google Scholar]
  3. Wang, L.; Chignell, M.; Jiang, H.; Lokuge, S.; Mason, G.; Fotinos, K.; Katzman, M. Discovering the Causal Structure of the Hamilton Rating Scale for Depression Using Causal Discovery. In Proceedings of the 2021 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI), Athens, Greece, 27–30 July 2021; pp. 1–4. [Google Scholar] [CrossRef]
  4. Wan, C.X. Financial causal sentence recognition based on BERT-CNN text classification. J. Supercomput. 2022, 78, 6503–6527. [Google Scholar] [CrossRef]
  5. Umesh Kumar Lilhore, M.P.; Algarni, A.D. Hybrid Model for Detection of Cervical Cancer Using Causal Analysis and Machine Learning Techniques. Comput. Math. Methods Med. 2022, 2022, 4688327. [Google Scholar] [CrossRef]
  6. Xu, J.; Wang, Q.; Zhou, J.; Zhou, H.; Chen, J. Improved Bayesian network-based for fault diagnosis of air conditioner system. Int. J. Metrol. Qual. Eng. 2023, 14, 10. [Google Scholar] [CrossRef]
  7. Qin, W.; Zhang, H.; Hong, R.; Lim, E.P.; Sun, Q. Causal Interventional Training for Image Recognition. IEEE Trans. Multimed. 2023, 25, 1033–1044. [Google Scholar] [CrossRef]
  8. Chickering, D.M. Learning Bayesian networks is NP-complete. Learn. Data Artif. Intell. Stat. V 1996, 112, 121–130. [Google Scholar] [CrossRef]
  9. Neuberg, L.G. CAUSALITY: MODELS, REASONING, AND INFERENCE, by Judea Pearl, Cambridge University Press, 2000. Econom. Theory 2003, 19, 675–685. [Google Scholar] [CrossRef]
  10. Spirtes, P.; Glymour, C.; Scheines, R. Causation, Prediction, and Search; The MIT Press: Cambridge, MA, USA, 2001. [Google Scholar] [CrossRef]
  11. Zheng, X.; Aragam, B.; Ravikumar, P.; Xing, E.P. DAGs with NO TEARS: Continuous optimization for structure learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, Red Hook, NY, USA, 2–8 December 2018; pp. 9492–9503. [Google Scholar]
  12. Ng, I.; Zhu, S.; Chen, Z.; Fang, Z. A Graph Autoencoder Approach to Causal Structure Learning. arXiv 2019, arXiv:1911.07420. [Google Scholar]
  13. Yu, Y.; Chen, J.; Gao, T.; Yu, M. DAG-GNN: DAG Structure Learning with Graph Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Proceedings of Machine Learning Research. Chaudhuri, K., Salakhutdinov, R., Eds.; Curran Associates, Inc.: Mount Kisco, NY, USA, 2019; Volume 97, pp. 7154–7163. [Google Scholar]
  14. Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML’09, New York, NY, USA, 14–18 June 2009; pp. 41–48. [Google Scholar] [CrossRef]
  15. Platanios, E.A.; Stretcu, O.; Neubig, G.; Poczos, B.; Mitchell, T. Competence-based Curriculum Learning for Neural Machine Translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 1162–1172. [Google Scholar] [CrossRef]
  16. Tsamardinos, I.; Aliferis, C.; Statnikov, A. Algorithms for Large Scale Markov Blanket Discovery. FLAIRS 2003, 2, 376–381. [Google Scholar]
  17. Spirtes, P.; Glymour, C.; Scheines, R. From probability to causality. Philos. Stud. 1991, 64, 1–36. [Google Scholar] [CrossRef]
  18. Li, Y.; Yang, Y.; Zhu, X.; Yang, W. From probability to causality. Wuhan Univ. J. Nat. Sci. 2015, 20, 214–220. [Google Scholar] [CrossRef]
  19. Colombo, D.; Maathuis, M.H. Order-Independent Constraint-Based Causal Structure Learning. J. Mach. Learn. Res. 2014, 15, 3741–3782. [Google Scholar]
  20. Le, T.D.; Hoang, T.; Li, J.; Liu, L.; Liu, H.; Hu, S. A fast PC algorithm for high dimensional causal discovery with multi-core PCs. IEEE/ACM Trans. Comput. Biol. Bioinform. 2016, 16, 1483–1495. [Google Scholar] [CrossRef]
  21. Sun, X.; Janzing, D.; Schölkopf, B.; Fukumizu, K. A kernel-based causal learning algorithm. In Proceedings of the 24th International Conference on Machine Learning, ICML’07, Corvallis, OR, USA, 20–24 June 2007; pp. 855–862. [Google Scholar] [CrossRef]
  22. Zhang, K.; Peters, J.; Janzing, D.; Schölkopf, B. Kernel-based conditional independence test and application in causal discovery. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, UAI’11, Arlington, Barcelona, Spain, 14–17 July 2011; pp. 804–813. [Google Scholar]
  23. David Heckerman, D.G.; Chickering, D.M. Learning Bayesian networks: The combination of knowledge and statistical data. Mach. Learn. 1995, 20, 197–243. [Google Scholar] [CrossRef]
  24. Hiramatsu, K.; Matsumiya, Y.; Kitada, S. Introduction of Suitable Stock-recruitment Relationship by a Comparison of Statistical Models. Fish. Sci. 1994, 60, 411–414. [Google Scholar] [CrossRef]
  25. Bouckaert, R.R. Probalistic Network Construction Using the Minimum Description Length Principle. In Proceedings of the European Conference on Symbolic and Quantitative Approaches to Reasoning and Uncertainty, ECSQARU ’93, Granada, Spain, 8–10 November 1993; pp. 41–48. [Google Scholar]
  26. Schwarz, G. Estimating the Dimension of a Model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
  27. Gheisari, S.; Meybodi, M. BNC-PSO: Structure learning of Bayesian networks by Particle Swarm Optimization. Inf. Sci. 2016, 348, 272–289. [Google Scholar] [CrossRef]
  28. Wang, T.; Yang, J. A heuristic method for learning Bayesian networks using discrete particle swarm optimization. Knowl. Inf. Syst. 2010, 24, 269–281. [Google Scholar] [CrossRef]
  29. Daly, R.; Shen, Q. Learning Bayesian network equivalence classes with Ant Colony optimization. J. Artif. Int. Res. 2009, 35, 391–447. [Google Scholar] [CrossRef]
  30. Yang, C.; Gao, H.; Yang, X.; Huang, S.; Kan, Y.; Liu, J. BnBeeEpi: An Approach of Epistasis Mining Based on Artificial Bee Colony Algorithm Optimizing Bayesian Network. In Proceedings of the 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA, 18–21 November 2019; pp. 232–239. [Google Scholar] [CrossRef]
  31. Chickering, D.M. Optimal structure identification with greedy search. J. Mach. Learn. Res. 2003, 3, 507–554. [Google Scholar] [CrossRef]
  32. Lachapelle, S.; Brouillard, P.; Deleu, T.; Lacoste-Julien, S. Gradient-Based Neural DAG Learning. arXiv 2019, arXiv:1906.02226. [Google Scholar]
  33. Zhu, S.; Ng, I.; Chen, Z. Causal Discovery with Reinforcement Learning. arXiv 2019, arXiv:1906.04477. [Google Scholar]
  34. Peng, L.; Zhang, J. An entropy weighting mixture model for subspace clustering of high-dimensional data. Pattern Recognit. Lett. 2011, 32, 1154–1161. [Google Scholar] [CrossRef]
  35. Wang, X.; Chen, Y.; Zhu, W. A Survey on Curriculum Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 4555–4576. [Google Scholar] [CrossRef] [PubMed]
  36. Weinshall, D.; Cohen, G.; Amir, D. Curriculum Learning by Transfer Learning: Theory and Experiments with Deep Networks. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Proceedings of Machine Learning Research. Dy, J., Krause, A., Eds.; International Machine Learning Society (IMLS): Cambridge, MA, USA, 2018; Volume 80, pp. 5238–5246. [Google Scholar] [CrossRef]
  37. Gong, T.; Zhao, Q.; Meng, D.; Xu, Z. Why Curriculum Learning & Self-Paced Learning Work in Big/Noisy Data: A Theoretical Perspective. Big Data Inf. Anal. 2017, 1, 111–127. [Google Scholar]
  38. van de Geer, S.; Bühlmann, P. 0 -penalized maximum likelihood for sparse directed acyclic graphs. Ann. Stat. 2013, 41, 536–567. [Google Scholar] [CrossRef]
  39. Aragam, B.; Amini, A.A.; Zhou, Q. Learning Directed Acyclic Graphs with Penalized Neighbourhood Regression. arXiv 2015, arXiv:1511.08963. [Google Scholar]
  40. Loh, P.L.; Bühlmann, P. High-dimensional learning of linear causal networks via inverse covariance estimation. J. Mach. Learn. Res. 2014, 15, 3065–3105. [Google Scholar]
  41. Byrd, R.H.; Lu, P.; Nocedal, J.; Zhu, C. A Limited Memory Algorithm for Bound Constrained Optimization. SIAM J. Sci. Comput. 1995, 16, 1190–1208. [Google Scholar] [CrossRef]
  42. Zhong, K.; Yen, I.E.H.; Dhillon, I.S.; Ravikumar, P. Proximal quasi-Newton for computationally intensive 1-regularized M-estimators. In Proceedings of the 27th International Conference on Neural Information Processing Systems—Volume 2, NIPS’14, Cambridge, MA, USA, 8–13 December 2014; pp. 2375–2383. [Google Scholar]
  43. Barabási, A.L. Scale-Free Networks: A Decade and Beyond. Science 2009, 325, 412–413. [Google Scholar] [CrossRef] [PubMed]
  44. Shimizu, S.; Inazumi, T.; Sogawa, Y.; Hyvärinen, A.; Kawahara, Y.; Washio, T.; Hoyer, P.O.; Bollen, K. DirectLiNGAM: A Direct Method for Learning a Linear Non-Gaussian Structural Equation Model. J. Mach. Learn. Res. 2011, 12, 1225–1248. [Google Scholar]
  45. Ng, I.; Zhu, S.; Fang, Z.; Li, H.; Chen, Z.; Wang, J. Masked Gradient-Based Causal Structure Learning. In Proceedings of the 2022 SIAM International Conference on Data Mining (SDM), Hartford, CT, USA, 28–30 April 2022; pp. 424–432. [Google Scholar] [CrossRef]
  46. Peters, J.; Bühlmann, P. Identifiability of Gaussian structural equation models with equal error variances. Biometrika 2013, 101, 219–228. [Google Scholar] [CrossRef]
  47. Shimizu, S.; Hoyer, P.O.; Hyvärinen, A.; Kerminen, A. A Linear Non-Gaussian Acyclic Model for Causal Discovery. J. Mach. Learn. Res. 2006, 7, 2003–2030. [Google Scholar]
  48. Sachs, K.; Perez, O.D.; Pe’er, D.; Lauffenburger, D.A.; Nolan, G.P. Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data. Science 2005, 308, 523–529. [Google Scholar] [CrossRef]
Figure 1. The overall framework of the CL-NOTEARS model mainly consists of three parts: the initial data clustering design, curriculum design, and training model design.
Figure 1. The overall framework of the CL-NOTEARS model mainly consists of three parts: the initial data clustering design, curriculum design, and training model design.
Mathematics 12 02640 g001
Figure 2. CL-NOTEARS model framework.
Figure 2. CL-NOTEARS model framework.
Mathematics 12 02640 g002
Figure 3. Data classification based on similarity: the dataset is divided into n candidate datasets by the Gaussian mixed kernel clustering model.
Figure 3. Data classification based on similarity: the dataset is divided into n candidate datasets by the Gaussian mixed kernel clustering model.
Mathematics 12 02640 g003
Figure 4. Dynamic iterative update process of cause and effect diagrams.
Figure 4. Dynamic iterative update process of cause and effect diagrams.
Mathematics 12 02640 g004
Figure 5. TPR values on a Gaussian linear dataset on an ER graph.
Figure 5. TPR values on a Gaussian linear dataset on an ER graph.
Mathematics 12 02640 g005
Figure 6. The design of the curriculum difficulty in the sample examples.
Figure 6. The design of the curriculum difficulty in the sample examples.
Mathematics 12 02640 g006
Figure 7. Dynamic iterative update process of cause and effect diagrams.
Figure 7. Dynamic iterative update process of cause and effect diagrams.
Mathematics 12 02640 g007
Table 1. Symbols and descriptions.
Table 1. Symbols and descriptions.
SymbolDescription
G = ( V , E ) A DAG, where V is the set of nodes/features and E is the set of edges.
V V = v 1 , v 2 , , v d and is the set of nodes (features).
E E = e 1 , e 2 , , e m and is the set of edges.
d , m The number of nodes (features) and edges in G, respectively.
D D = D 1 , D 2 , , D s and is the learning instance/sample dataset.
sThe number of sample/instance in D.
X X = ( X 1 , X 2 , , X d ) and is a random vector.
X i The random variable associated with node v i .
Pa ( X i ) The parent nodes of X i .
WThe weighted adjacency matrix.
W t The weighted adjacency matrix at the t t h stage.
TThe number of curriculum stages.
R The set of real numbers.
A ( W ) A ( W ) { 0 , 1 } d × d and is the adjacency matrix of a directed graph G ( W ) .
λ The regularization parameter for the 1 -regularization term.
z z = z 1 , z 2 , , z d and is an arbitrary noise vector.
h ( W ) h ( W ) = tr ( e W W ) d = 0 and is a single smooth equality constraint.
nThe number of Gaussian components/clusters.
μ , σ 2 The mean and variance of a Gaussian distribution.
Θ , Φ Θ = μ , σ 2 and Φ = β i , V i , σ i 2 , g i 1 i n , and these are the parameters.
β i , V i , and  Σ i The mixture coefficient, mean vector, and covariance matrix, respectively, of the i t h component.
D k j The j t h feature of data k.
g i j The relevance of feature j to cluster i.
V i j The mean of feature j for the i t h cluster.
μ k μ k = μ 1 k , μ 2 k , , μ n k and is a vector composed of fuzzy membership values.
C C = C 1 , C 2 , , C T and is the set of the curriculum/training criteria.
C t The curriculum/training criteria.
H ( · ) , O ( · ) , and  P ( · ) The entropy, weight, and distribution, respectively.
X C t The curriculum sample for the t t h stage.
X H t i The i t h candidate sample of the t t h stage candidate set.
G t The DAG learned in the t t h stage.
ρ , α The penalty parameter and the estimate of the Lagrange multiplier.
γ The probability threshold.
Table 2. Summary of the synthetic and real datasets.
Table 2. Summary of the synthetic and real datasets.
RQDatasetData Generation ProcessFeaturesEdgesInstances
RQ1ER-1.5LG/LN10/30/5015/45/751000
ER-2LG/LN20/60/100
ER-3LG/LN30/90/150
SF-1.5LG/LN10/30/5015/45/751000
SF-2LG/LN20/60/100
SF-3LG/LN30/90/150
RQ2ER-2LG10/30/5020/60/1001000
ER-3LG30/90/150
SF-2LG10/30/5020/60/1001000
SF-3LG30/90/150
RQ3SachsReal1117853
ER and SF: graph sampling schemes; instances: number of samples needed for learning.
Table 3. FDR and SHD on ER graphs with linear SEMs, Gaussian noise, and 10/30/50 nodes.
Table 3. FDR and SHD on ER graphs with linear SEMs, Gaussian noise, and 10/30/50 nodes.
NodesAlgorithmER-2ER-3
FDR (↓)SHD (↓)FDR (↓)SHD (↓)
d = 10PC0.35 ± 0.119.60 ± 2.190.49 ± 0.1519.80 ± 7.66
NOTEARS0.06 ± 0.093.20 ± 2.680.05 ± 0.077.40 ± 2.67
DAG-GNN0.06±0.081.80±2.950.08±0.073.20 ± 2.86
Gran-DAG0.29 ± 0.0813.80 ± 3.110.30 ± 0.0719.00 ± 1.87
DirectLiNGAM0.48 ± 0.1915.40 ± 8.450.49 ± 0.1421.20 ± 7.63
Ours0.00 ± 0.001.80 ± 1.300.07 ± 0.035.60 ± 2.07
d = 30PC0.49 ± 0.1350.60 ± 14.870.61 ± 0.0797.00 ± 10.27
NOTEARS0.03 ± 0.035.00 ± 1.870.14 ± 0.1026.60 ± 18.01
DAG-GNN0.12 ± 0.038.80 ± 3.030.17 ± 0.1125.80 ± 16.39
Gran-DAG0.38 ± 0.1453.20 ± 12.380.66 ± 0.05104.60 ± 13.90
DirectLiNGAM0.71 ± 0.0972.40 ± 16.400.69 ± 0.04130.20 ± 16.47
Ours0.00 ± 0.012.60 ± 0.890.02 ± 0.029.00 ± 2.83
d = 50PC0.42 ± 0.1465.80 ± 28.420.64 ± 0.09169.60 ± 19.82
NOTEARS0.08 ± 0.0716.00 ± 9.170.11 ± 0.0934.80 ± 22.87
DAG-GNN0.09 ± 0.0716.00 ± 13.690.12 ± 0.0332.20 ± 10.13
Gran-DAG0.49 ± 0.17106.00 ± 21.710.71 ± 0.05194.40 ± 28.16
DirectLiNGAM0.70 ± 0.07149.60 ± 51.160.77 ± 0.06254.60 ± 23.27
Ours0.02 ± 0.038.00 ± 4.410.03 ± 0.0214.40 ± 7.02
The bold values represent the optimal values within the methods, with the values following ± indicating the standard deviation of the method’s performance across five iterations of training.
Table 4. FDR and SHD on SF graphs with linear SEMs, Gaussian noise, and 10/30/50 nodes.
Table 4. FDR and SHD on SF graphs with linear SEMs, Gaussian noise, and 10/30/50 nodes.
NodesAlgorithmSF-2SF-3
FDR (↓)SHD (↓)FDR (↓)SHD (↓)
d = 10PC0.34 ± 0.149.00 ± 3.550.50 ± 0.1018.00 ± 3.24
NOTEARS0.07 ± 0.070.60 ± 1.140.11 ± 0.117.40 ± 5.13
DAG-GNN0.02 ± 0.030.60 ± 0.890.06 ± 0.072.60 ± 2.70
Gran-DAG0.26 ± 0.1913.20 ± 2.170.22 ± 0.1816.20 ± 3.96
DirectLiNGAM0.58 ± 0.2816.00 ± 8.250.60 ± 0.0420.80 ± 1.10
Ours0.00 ± 0.000.40 ± 0.550.10 ± 0.086.80 ± 3.56
d = 30PC0.54 ± 0.0952.00 ± 6.430.60 ± 0.0784.40 ± 6.73
NOTEARS0.01 ± 0.012.00 ± 1.220.09 ± 0.1116.40 ± 15.26
DAG-GNN0.19 ± 0.1018.80 ± 10.850.18 ± 0.0523.80 ± 6.14
Gran-DAG0.34 ± 0.2152.20 ± 8.040.46 ± 0.1880.20 ± 10.18
DirectLiNGAM0.66 ± 0.0770.80 ± 8.930.71 ± 0.08120.60 ± 15.53
Ours0.01 ± 0.011.20 ± 1.220.08 ± 0.0813.00 ± 11.55
d = 50PC0.52 ± 0.1198.10 ± 41.260.65 ± 0.10154.40 ± 18.39
NOTEARS0.01 ± 0.013.00 ± 1.580.01 ± 0.014.20 ± 4.15
DAG-GNN0.23 ± 0.1038.60 ± 15.610.20 ± 0.1055.40 ± 23.09
Gran-DAG0.42 ± 0.1692.60 ± 8.760.60 ± 0.23162.40 ± 25.59
DirectLiNGAM0.69 ± 0.02140.75 ± 2.870.76 ± 0.07238.20 ± 30.07
Ours0.01 ± 0.011.40 ± 1.140.00 ± 0.003.80 ± 2.77
The bold values represent the optimal values within the methods. The values following ± indicate the standard deviation of the method’s performance across five iterations of training.
Table 5. FDR and SHD on ER/SF graphs with linear SEMs, exponential noise, and 10/30 nodes.
Table 5. FDR and SHD on ER/SF graphs with linear SEMs, exponential noise, and 10/30 nodes.
NodesAlgorithmER-2ER-3
TPR (↑)SHD (↓)TPR (↑)SHD (↓)
d = 10PC0.59 ± 0.2012.60 ± 9.940.35 ± 0.0921.80 ± 4.82
NOTEARS0.89 ± 0.112.40 ± 2.070.79 ± 0.047.20 ± 0.84
DAG-GNN0.94 ± 0.103.60 ± 5.130.94 ± 0.065.20 ± 3.71
Gran-DAG0.22 ± 0.1513.40 ± 6.660.11 ± 0.0824.20 ± 5.64
Ours0.86 ± 0.083.00 ± 1.220.73 ± 0.099.40 ± 2.88
d = 30PC0.51 ± 0.0847.00 ± 11.770.27 ± 0.0390.80 ± 8.23
NOTEARS0.91 ± 0.307.60 ± 3.290.89 ± 0.0512.00 ± 7.59
DAG-GNN0.94 ± 0.0516.60 ± 13.520.88 ± 0.0642.00 ± 26.41
Gran-DAG0.11 ± 0.0453.80 ± 8.890.15 ± 0.0583.80 ± 7.29
Ours0.94 ± 0.055.40 ± 4.510.91 ± 0.0410.40 ± 5.27
SF-2SF-3
TPR (↑)SHD (↓)TPR (↑)SHD (↓)
d = 10PC0.61 ± 0.178.40 ± 2.790.41 ± 0.1416.80 ± 0.35
NOTEARS0.85 ± 0.223.80 ± 5.500.82 ± 0.116.20 ± 3.12
DAG-GNN0.98 ± 0.030.20 ± 0.450.97 ± 0.041.60 ± 2.08
Gran-DAG0.44 ± 0.1412.20 ± 3.560.35 ± 0.1317.20 ± 1.64
Ours0.91 ± 0.061.80 ± 1.400.83 ± 0.074.20 ± 1.64
d = 30PC0.53 ± 0.1243.00 ± 13.560.35 ± 0.0676.60 ± 8.14
NOTEARS0.97 ± 0.021.40 ± 1.140.93 ± 0.078.60 ± 10.45
DAG-GNN0.87 ± 0.0815.00 ± 23.100.84 ± 0.1139.20 ± 19.87
Gran-DAG0.31 ± 0.0749.40 ± 6.470.22 ± 0.0489.40 ± 15.54
Ours0.98 ± 0.021.20 ± 1.100.95 ± 0.015.40 ± 1.67
The bold values represent the optimal values within the methods, with the values following ± indicating the standard deviation of the method’s performance across five iterations of training.
Table 6. FDR and SHD on ER/SF graphs with linear SEMs, uniform noise, and 10/30 nodes.
Table 6. FDR and SHD on ER/SF graphs with linear SEMs, uniform noise, and 10/30 nodes.
NodesAlgorithmsER-2ER-3
TPR (↑)SHD (↓)TPR (↑)SHD (↓)
d = 10PC0.48 ± 0.2016.20 ± 6.500.44 ± 0.1519.20 ± 6.10
NOTEARS0.61 ± 0.118.80 ± 2.860.48 ± 0.0617.00 ± 2.65
DAG-GNN0.92 ± 0.111.80 ± 2.490.86 ± 0.126.00 ± 4.31
Gran-DAG0.48 ± 0.1810.40 ± 2.500.35 ± 0.2021.60 ± 7.16
Ours0.68 ± 0.088.00 ± 1.220.51 ± 0.0515.00 ± 2.00
d = 30PC0.50 ± 0.1348.80 ± 16.120.27 ± 0.0692.00 ± 10.75
NOTEARS0.76 ± 0.0514.80 ± 3.270.61 ± 0.0648.80 ± 7.26
DAG-GNN0.91 ± 0.0611.20 ± 5.400.94 ± 0.0416.00 ± 10.77
Gran-DAG0.28 ± 0.0848.40 ± 12.680.22 ± 0.1195.60 ± 30.14
Ours0.80 ± 0.0313.60 ± 0.810.73 ± 0.0430.20 ± 4.87
SF-2SF-3
TPR (↑)SHD (↓)TPR (↑)SHD (↓)
d = 10PC0.58 ± 0.059.20 ± 0.840.39 ± 0.0917.40 ± 1.95
NOTEARS0.76 ± 0.184.00 ± 3.090.76 ± 0.096.40 ± 2.19
DAG-GNN0.86 ± 0.144.40 ± 4.390.98 ± 0.040.80 ± 1.4
Gran-DAG0.44 ± 0.2611.20 ± 4.440.30 ± 0.0920.40 ± 3.29
Ours0.85 ± 0.083.00 ± 1.870.78 ± 0.116.00 ± 2.23
d = 30PC0.52 ± 0.1144.80 ± 13.950.31 ± 0.0779.80 ± 8.41
NOTEARS0.82 ± 0.1212.40 ± 9.340.79 ± 0.0821.40 ± 8.56
DAG-GNN0.91 ± 0.0911.40 ± 10.360.84 ± 0.1233.00 ± 20.96
Gran-DAG0.25 ± 0.0763.00 ± 15.870.21 ± 0.0580.40 ± 16.02
Ours0.89 ± 0.026.60 ± 1.340.85 ± 0.0714.80 ± 6.50
The bold values represent the optimal values within the methods, with the values following ± indicating the standard deviation of the method’s performance across five iterations of training.
Table 7. FDR and SHD of different curriculum mechanisms on ER/SF graph with linear SEM and 10/30/50 nodes.
Table 7. FDR and SHD of different curriculum mechanisms on ER/SF graph with linear SEM and 10/30/50 nodes.
NodesAlgorithmER-2ER-3
FDR (↓)SHD (↓)FDR (↓)SHD (↓)
d = 10Curriculum0.03 ± 0.053.00 ± 2.310.07 ± 0.036.80 ± 2.49
Anti-curriculum 10.07 ± 0.073.80 ± 2.320.06 ± 0.067.00 ± 2.34
Anti-curriculum 20.12 ± 0.113.80 ± 3.120.13 ± 0.0510.60 ± 2.51
Random curriculum0.05 ± 0.073.20 ± 3.120.06 ± 0.045.60 ± 2.97
No curriculum0.06 ± 0.093.20 ± 2.680.05 ± 0.077.40 ± 2.67
d = 30Curriculum0.00 ± 0.012.60 ± 0.890.02 ± 0.029.00 ± 2.83
Anti-curriculum 10.05 ± 0.028.60 ± 3.50.11 ± 0.0623.40 ± 9.72
Anti-curriculum 20.14 ± 0.1116.80 ± 11.80.13 ± 0.0826.00 ± 11.58
Random curriculum0.09 ± 0.1412.60 ± 15.320.13 ± 0.1222.40 ± 18.02
No curriculum0.03 ± 0.035.00 ± 1.870.14 ± 0.1026.60 ± 18.01
d = 50Curriculum0.02 ± 0.038.00 ± 4.410.03 ± 0.0214.40 ± 7.02
Anti-curriculum 10.04 ± 0.0211.20 ± 4.080.14 ± 0.1245.40 ± 31.30
Anti-curriculum 20.09 ± 0.0918.80 ± 15.340.11 ± 0.0635.80 ± 15.89
Random curriculum0.04 ± 0.047.60 ± 5.980.02 ± 0.0211.00 ± 5.24
No curriculum0.08 ± 0.0716.00 ± 9.170.11 ± 0.0934.80 ± 22.87
SF-2SF-3
FDR (↓)SHD (↓)FDR (↓)SHD (↓)
d = 10Curriculum0.00 ± 0.000.40 ± 0.550.10 ± 0.086.80 ± 3.56
Anti-curriculum 10.02 ± 0.050.80 ± 1.300.10 ± 0.155.80 ± 5.54
Anti-curriculum 20.01 ± 0.021.40 ± 1.670.07 ± 0.135.00 ± 4.53
Random curriculum0.01 ± 0.020.60 ± 1.340.05 ± 0.054.40 ± 2.97
No Curriculum0.07 ± 0.070.60 ± 1.140.11 ± 0.117.40 ± 5.13
d = 30Curriculum0.01 ± 0.011.20 ± 1.220.08 ± 0.0813.00 ± 11.55
Anti-curriculum 10.01 ± 0.021.20 ± 1.780.06 ± 0.059.80 ± 6.46
Anti-curriculum 20.01 ± 0.021.20 ± 1.780.10 ± 0.0616.00 ± 8.71
Random curriculum0.01 ± 0.010.40 ± 0.550.02 ± 0.014.00 ± 2.00
No Curriculum0.01 ± 0.012.00 ± 1.220.09 ± 0.1116.40 ± 15.26
d = 50Curriculum0.01 ± 0.010.98 ± 0.020.00 ± 0.003.80 ± 2.77
Anti-curriculum 10.03 ± 0.065.80 ± 11.860.04 ± 0.0512.60 ± 9.53
Anti-curriculum 20.05 ± 0.096.60 ± 11.990.04 ± 0.0412.00 ± 9.00
Random curriculum0.01 ± 0.011.40 ± 1.520.01 ± 0.016.00 ± 4.06
No Curriculum0.01 ± 0.013.00 ± 1.580.01 ± 0.014.20 ± 4.15
The bold values represent the optimal values within the methods, with the values following ± indicating the standard deviation of the method’s performance across five iterations of training. 1 Description of anti-curriculum method 1. 2 Description of anti-curriculum method 2.
Table 8. TPR and SHD of different curriculum stages on an ER graph with linear SEM and 10/30 nodes.
Table 8. TPR and SHD of different curriculum stages on an ER graph with linear SEM and 10/30 nodes.
NodesCurriculum StagesER-2ER-3
TPR (↑)SHD (↓)TPR (↑)SHD (↓)
d = 10T = 50.85 ± 0.094.00 ± 2.450.72 ± 0.1210.00 ± 4.47
T = 100.90 ± 0.073.00 ± 2.310.81 ± 0.066.80 ± 2.49
T = 150.76 ± 0.086.20 ± 2.860.63 ± 0.0912.60 ± 4.04
T = 200.82 ± 0.105.40 ± 3.210.63 ± 0.1013.00 ± 3.39
d = 30T = 50.87 ± 0.0510.20 ± 4.210.81 ± 0.1230.20 ± 20.22
T = 100.97 ± 0.066.40 ± 5.680.96 ± 0.0212.20 ± 3.42
T = 150.84 ± 0.0917.40 ± 11.650.80 ± 0.0832.00 ± 14.97
T = 200.85 ± 0.0515.20 ± 7.530.81 ± 0.1329.40 ± 17.34
The bold values represent the optimal values within the methods, with the values following ± indicating the standard deviation of the method’s performance across five iterations of training.
Table 9. Comparison of different algorithms on the Sachs network.
Table 9. Comparison of different algorithms on the Sachs network.
NOTEARSDAG-GNNGran-DAGCL-NOTEARS
Total Edges20151013
Correct Edges6656
SHD19161313
The bold values represent the optimal values within the methods.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, K.; Liu, L.; Xiao, K.; Li, X.; Zhang, H.; Zhou, Y.; Huang, H. CL-NOTEARS: Continuous Optimization Algorithm Based on Curriculum Learning Framework. Mathematics 2024, 12, 2640. https://doi.org/10.3390/math12172640

AMA Style

Liu K, Liu L, Xiao K, Li X, Zhang H, Zhou Y, Huang H. CL-NOTEARS: Continuous Optimization Algorithm Based on Curriculum Learning Framework. Mathematics. 2024; 12(17):2640. https://doi.org/10.3390/math12172640

Chicago/Turabian Style

Liu, Kaiyue, Lihua Liu, Kaiming Xiao, Xuan Li, Hang Zhang, Yun Zhou, and Hongbin Huang. 2024. "CL-NOTEARS: Continuous Optimization Algorithm Based on Curriculum Learning Framework" Mathematics 12, no. 17: 2640. https://doi.org/10.3390/math12172640

APA Style

Liu, K., Liu, L., Xiao, K., Li, X., Zhang, H., Zhou, Y., & Huang, H. (2024). CL-NOTEARS: Continuous Optimization Algorithm Based on Curriculum Learning Framework. Mathematics, 12(17), 2640. https://doi.org/10.3390/math12172640

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop