Next Article in Journal
A Smart Evacuation Guidance System for Large Buildings
Next Article in Special Issue
A Non-Dominated Sorting Genetic Algorithm Based on Voronoi Diagram for Deployment of Wireless Sensor Networks on 3-D Terrains
Previous Article in Journal
Improvement of LSTM-Based Forecasting with NARX Model through Use of an Evolutionary Algorithm
Previous Article in Special Issue
Used Car Price Prediction Based on the Iterative Framework of XGBoost+LightGBM
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Constructing a Gene Regulatory Network Based on a Nonhomogeneous Dynamic Bayesian Network

1
College of Artificial Intelligence and Big Data, Hefei University, Hefei 230031, China
2
Anhui Province Urban Infrastructure Big Data Technology Application Engineering Laboratory, Hefei University, Hefei 230031, China
*
Author to whom correspondence should be addressed.
Electronics 2022, 11(18), 2936; https://doi.org/10.3390/electronics11182936
Submission received: 5 August 2022 / Revised: 12 September 2022 / Accepted: 13 September 2022 / Published: 16 September 2022
(This article belongs to the Special Issue Pattern Recognition and Machine Learning Applications)

Abstract

:
Since the regulatory relationship between genes is usually non-stationary, the homogeneity assumption cannot be satisfied when modeling with dynamic Bayesian networks (DBNs). For this reason, the homogeneity assumption in dynamic Bayesian networks should be relaxed. Various methods of combining multiple changepoint processes and DBNs have been proposed to relax the homogeneity assumption. When using a non-homogeneous dynamic Bayesian network to model a gene regulatory network, it is inevitable to infer the changepoints of the gene data. Based on this analysis, this paper first proposes a data-based birth move (ED-birth move). The ED-birth move makes full use of the potential information of data to infer the changepoints. The greater the Euclidean distance of the mean of the data in the two components, the more likely this data point will be selected as a new changepoint by the ED-birth move. In brief, the selection of the changepoint is proportional to the Euclidean distance of the mean on both sides of the data. Furthermore, an improved Markov chain Monte Carlo (MCMC) method is proposed, and the improved MCMC introduces the Pearson correlation coefficient (PCCs) to sample the parent node-set. The larger the absolute value of the Pearson correlation coefficient between two data points, the easier it is to be sampled. Compared with other classical models on Saccharomyces cerevisiae data, synthetic data, RAF pathway data, and Arabidopsis data, the PCCs-ED-DBN proposed in this paper improves the accuracy of gene network reconstruction and further improves the convergence and stability of the modeling process.

1. Introduction

With the rapidly decreasing cost of genome sequencing technology and the accelerated acquisition of biological experimental data, one of the key challenges in systems biology is to deduce gene regulatory networks from gene expression data. Gene regulatory networks are of great significance in biological development, maintenance of homeostasis, and the occurrence and development of diseases [1,2,3,4]. Although a large number of known regulatory relationships in organisms have been documented in various databases, they are still far from the number of interactions and complex relationships that actually exist in biological systems. Experiments are generally able to measure the abundance of elements, but it is difficult to directly discover the complex relationships among them [5]. Structural learning of dynamic Bayesian networks (DBNs) plays an important role in the construction of gene regulatory networks [6]. The traditional (homogeneous) dynamic Bayesian network models assume the network parameters to stay constant across time. This can lead to biased results and wrong conclusions, as cellular regulatory processes can change over time. Although there have been various methods to relax the homogeneity assumption of the undirected graphical model [7,8], relaxing this restriction in DBN is still a popular research topic [9,10,11,12]. Various authors have proposed a combination of multiple changepoint processes and DBNs to relax the homogeneity assumption of DBNs [13,14]. Each time series segment is delimited by two changepoints. The parameters of DBNs are node specific, so the conditional probability of the parameters varies from segment to segment. In certain regularity conditions, the outstanding advantage of the above methods is the parameter independence and conjugacy of the prior; the parameters can be integrated out in the closed form in the likelihood. Therefore, the inference task is simplified to sample the network structure and the number and location of changepoints from the posterior distribution, which can be influenced by reversible jump Markov chain Monte Carlo (RJMCMC) [15,16,17,18].
Early, the Bayesian regression model (BR-DBN), proposed by Lèbre et al., became the basic probabilistic model for non-homogeneous DBNs [19]. However, the disadvantage of the BR-DBN model is that the network structure varies from segment to segment, which leads to overfitting and exaggerated inference uncertainty for short time series. Grzegorczyk et al. proposed various variants of BR-DBN. The network structure between different segments is fixed, and the parameters are changed [20,21,22]. However, these above-mentioned variable point processes combined with DBN have limitations: data points from different segments must be divided into different components. If the allocation scheme for eight time data points is [11223311], the earlier allocation scheme can only approximate it as [11223344]. Unlike CPS-DBN with changepoints, MIX-DBN can assign data points to different components without the above restrictions [23,24]. However, it does not consider the time series of data points for time series data. Adjacent data points are more likely to be assigned to the same component than distant data points.
Subsequently, Grzegorczyk et al. proposed a non-homogeneous DBN with a hidden Markov model between changepoints (HMM-DBN). The HMM-DBN not only considers the time sequence of data points but also does not impose restrictions on the distribution of data points [25]. First, HMM-DBN introduces two pairs of new complementary MCMC moves—Gibbs sampler move and complementary inclusion move—to improve the assignment sampler, and second, assumes a first-order hidden Markov dependency structure for transition point inference. Based on the research of HMM-DBN, this paper makes full use of the latent prior knowledge hidden in the data to improve the accuracy of the changepoint and network inference and then improves the accuracy and the stability of the network structure and the convergence of the model.
Based on the above points of view, this paper first explores the relationship between each time data point as a changepoint and time-series data points of the component. Moreover, suppose that the larger the Euclidean distance of the data means on both sides of the time data point, the more likely it is to be a changepoint. Moreover, this idea is applied to the birth move of the changepoint to improve the rationality of the conversion point birth, and then the RJMCMC sampling time data point allocation is used. Second, the causal relationship between the Pearson correlation coefficient and the edge between the node data is discussed. Suppose that the higher the Pearson correlation coefficient of the node data, the more likely there is an edge. Finally, the accuracy and stability of the network structure and the convergence of the model are improved.

2. Bayesian Regression Model

A non-homogeneous DBN is an extension of a DBN in processing nonstationary time series data. A traditional dynamic Bayesian network generally contains two critical assumptions [26].
(1) First-order Markov hypothesis: Assuming that the edges between nodes cannot span a time component, the value of a node at time t is only related to the value of other nodes at that time and the node at time t 1 .
(2) Homogeneity hypothesis: The stable distribution of time data points generated by a homogeneous Markov chain requires that the model’s structure and parameters cannot change over time.
However, in the actual process, most of the time-series data are nonstationary, and the homogeneity assumption described above cannot be satisfied. Therefore, traditional dynamic Bayesian networks lose the modeling function of nonstationary data. To deal with nonstationary time series data, the changepoint process is added to the traditional dynamic Bayesian network. That is, the m changepoint is added to the time sequence of time length T , and it is divided into k components.
The hierarchical structure of the non-homogeneous DBN proposed in this paper is shown in Figure 1, and the regression equation is:
y g , k = X π g , k T w g , k + ε g , k
In each component k of non-homogeneous dynamic Bayes, where g = 1 , , N , N is the number of nodes; y g , k is assigned to the observation vector of component k, the regression coefficient matrix of the w g , k regression model, π g , k is the set of parent nodes of node g in component k, X π g , k T is the observation matrix of the parent node set of node g in component k, ε g , k is the noise parameter of the regression model, the mean is 0, and the variance is σ g (Table 1 shows the actual meaning of each symbol). Then, the regression model likelihood is:
P y g , k | X π g , k , w g , k , σ g = N ( y g , k | X π g , k T w g , k , σ g 2 I )
For, w g , k , σ g 2 and δ g 1 impose a Gaussian prior and conjugated gamma prior, respectively:
P w g , k | σ g 2 , δ g = N ( w g , k | 0 , δ g σ g 2 I )
P δ g 1 | A δ , B δ = G a m δ g 1 | A δ , B δ = B δ A δ Γ A δ δ g 1 A δ 1 e B δ δ g 1
P σ g 2 | A σ , B σ = G a m σ g 2 | A σ , B σ = B σ A σ Γ A σ σ g 2 A σ 1 e B σ σ g 2
The level-2 hyperparameter A δ , B δ , A σ , B σ is fixed. Then, samples can be generated from the posterior distribution P ( w g , 1 , , w g , K g , δ g , σ g 2 | D ) through Gibbs sampling [22].
Assuming that the time data points have been allocated, V g is known. Then, the conditional distribution of δ g 1 and w g , k can be obtained as:
δ g 1 | w g , k , σ g 2 ~ G a m ( A δ + K g π g + 1 2 , B δ + 1 2 σ g 2 k = 1 K g w g , k T w g , k )
w g , k | y g , k , X π g , k , σ g 2 , δ g = N ( ( δ g 1 I + X π g , k X π g , k T ) 1 X π g , k y g , k , σ g 2 ( δ g 1 I + X π g , k X π g , k T ) 1
where K g is the maximum number of states allocated by node g, π g is the number of parent nodes of node g, and the inverse variance hyperparameter σ g 2 can also be sampled from the conditional distribution:
σ g 2 | y g , V g , X π g , k , δ g ~ G a m ( A σ + T 1 2 , B σ + k = 1 K g y g , k T I + δ g X π g , k T X π g , k 1 y g , k 2
Keeping the parent node set π g and the component V g fixed, the MCMC sampling according to Equation (9) and Algorithm 1 can generate samples from the posterior distribution and use Equations (6)–(8) to update the hyperparameters.
P ( w g , k , δ g , σ g 2 | D ) g P δ g P σ g 2 k P ( w g , k | δ g , σ g ) P ( y g , k | X π g , k , σ g , w g , k )
Algorithm 1: Pseudocode for updating the signal-to-noise ratio hyperparameter δ g
For each node g = 1 , , N
Input:  π g , V g , δ g 1
Output:  δ g i
MCMC iteration:  i 1 i
① Sampling a concrete variance hyperparameter σ g i from σ g 2 | y g , V g , X π g , k , δ g i 1 Equation (8)
② Sampling regression parameter vectors w g , k   i ,from w g , k | y g , k , X π g , k , σ g i , δ g i 1 Equation (7) set: w g , k i = w g , 1 i , , w g , K g i
③ Sampling a new SNR hyperparameter δ g i from δ g 1 | w g , k i σ g i Equation (6), and output: δ g i

3. PCCs-ED-DBN Model

The above inference of SNR hyperparameters δ g assumes that the network structure M and component vectors V g are fixed; in fact, these need to be inferred. In this section, the inference of the network structure and component vectors is divided into two parts for description. First, PCCs-ED-DBN infers network structure M based on PCCs of data points and assumes fixed component vectors. Second, PCCs-ED-DBN infers component vectors V g based on Euclidean distances of data points.

3.1. Network Structure M Inference Based on PCCs of Data Points

When inferring the network structure, it is still assumed that the component vector V g is fixed, and the probability distribution of the network structure M = π 1 , , π N is set as:
P M = g = 1 N P π g
Infer the parent node set of each node g, that is, obtain the entire network structure. For each node, the conditional probability of its parent node set is:
P ( π g | D , V g , δ g ) P ( y g , V g | X π g , k , δ g )  
According to Equation (12), Metropolis–Hastings (M-H) keeps δ g and V g fixed and moves from the current parent node set π g i 1 to a new set π g ° . The move is accepted with probability:
π g i 1 π g ° = m i n ( 1 , P ( y g , V g | X π g ° , k , δ g ) P ( y g , V g | X π g i 1 , k , δ g ) × P π g ° P π g i 1 × S π g i 1 S π g ° )
If accepted, set   π g i = π g ° ; otherwise, π g i = π g i 1 .
This paper introduces the Pearson correlation coefficient [27] to explore the causal relationship between nodes. X i , X j represent nodes, and λ represents the slack variable; in this paper, λ = 1 . When the parent node is sampled by the Markov chain Monte Carlo sampling method, the node with a high Pearson correlation coefficient is more likely to be sampled. Obtain the S π g according to Equation (13). Algorithm 2 describes the pseudocode of M-H sampling:
R X i , X j = λ t = 1 T X i , t X i ¯ X j t X j ¯ t = 1 T X i , t X i ¯ 2 t = 1 T X j t X j ¯ 2  
Algorithm 2: Pseudocode for updating the parent node sets π g
For each node g = 1 , , N
Input:  δ g , V g , π g i 1
Output:  π g i
MCMC iteration:  i 1 i
 ① Get the system of parent sets S π g i :
 Randomly select node X j , R X g , X j = λ t = 1 T X g , t X g ¯ X j t X j ¯ t = 1 T X g , t X g ¯ 2 t = 1 T X g t X j ¯ 2 ,, a = rand(1),
   if a < R X g , X j (i) adding the node X j to π g i 1
   else (ii) deleting the node X j from π g i 1
   (iii) exchanging a node u π g i 1 for a node v π g i 1 .
 Randomly select a new candidate parent set π g ° from S π g i
 ② According to the probability Equation (13). If accepted, set: π g ° from S π g i . Otherwise, set π g i = π g i 1 . Output: π g i

3.2. Component Vector Vg Infer Based on Euclidean Distance of Data Points

In the above sampling process, it is assumed that the component vector V g is fixed but in the actual process, V g needs to be sampled. Figure 2 lists the non-homogeneous dynamic Bayesian network with two changepoints divided into three components, namely, V g = 1 , 1 , 2 , 2 , 3 , 3 . Suppose that the network structure in different components is the same, but the parameters are different.

3.2.1. Component Transition

The component transition of the time data point is determined by the birth move, death move, and inclusion and exclusion move of the changepoint. The following describes the component transition in detail, and Figure 3 gives a specific example.
Birth move: Randomly select a component k, randomly select one of the data points allocated to component k, and reallocate the data points allocated to component k to a known new component.
Death move: Randomly select two components, k = 1 and k = 3, and assign the data points of component k = 3 to component k = 1.
Inclusion and exclusion move: It is recommended to redistribute the component vector V g 3 , V g 4 = 2 , 2 to k = 1. This is because the surrounding time points V g 1 , V g 2 and V g 5 , V g 6 , V g 7 are all assigned to the state k = 1.
Therefore, if the potential prior knowledge in the data can be fully mined, it is more likely to accurately find the position of the conversion point, that is, to infer a correct component vector V g with node g, and ultimately improve the inferred accuracy of the network structure and model stability.

3.2.2. Birth Move Based on the Euclidean Distance

The experimental results found that the Euclidean distance of the mean on both sides of the changepoint is generally larger than that of the nonchanged point. Based on this finding, it is not difficult to conclude that when the Euclidean distance of the mean on both sides of a data point is large, it may be the real changepoint. Based on this conclusion, this paper proposes an ED-birth move whose changepoint possibility is proportional to the Euclidean distance on both sides of the data point. Algorithm 3 shows the ED-birth move algorithm flow.
Algorithm 3: Pseudocode for changepoint birth move detection based on the Euclidean distance of data points
Input: The component vector V g of the current node g and the maximum number of changepoint k m a x
Output:  V g , k m a x
① for k g V g
  for k 0 k g
  u = rand (0,1), d = i = 1 k 0 y g , i k 0 k 0 L y g , i L k 0 + 1
  if u < d
g _ k 0 = k 0 ;
   break;
  end
 end
② Change the component of all data points with state k g after g _ k 0 to a new component k g n e w = k m a x + 1 , and update V g and k m a x to calculate the acceptance rate b k .
b k ,   d k ,   r k , respectively, represent the acceptance rates of the birth move, death move, inclusion, and exclusion move actions. The RJ-MCMC algorithm steps for updating the changepoint are shown in Algorithm 4.
Algorithm 4: Pseudocode of RJ-MCMC sampling changepoint based on Euclidean distance of data points
Input: The component vector V g of the current node g and the maximum number of changepoint k m a x , network M
Output:  V g , k m a x
  ① For each sampling process, calculate b k , d k , r k based on the current number of conversion points k m a x
  ② Gibbs Sampler move
  A = rand (0,1)
  If A < b k birth move according to Algorithm 3
  If A < d k death move
  If A < d k Inclusion and Exclusion move
  ③ Output: V g , k m a x
The whole algorithm flow of the non-homogeneous DBN with multiple changepoints based on PPCs and Euclidean distance of data points is shown in Algorithm 5.
Algorithm 5: MCMC sampling pseudocode for the PCCs-ED-DBN model
Input: MCMC samples the current state: M i 1 , K g i 1 , V g i 1 ,   δ g i 1
Output: New MCMC status: M i , K g i , V g i , δ g i
  ① Keep the current M i 1 , V g i 1 fixed, and update δ g i 1 to δ g i according to Algorithm 1.
  ② Keep the current V g i 1 and δ g i fixed, and update M i 1 to M i according to Algorithm 2.
  ③ Keep the current π g i , K g i 1 , δ g i fixed, and update V g i 1 to V g i according to Algorithm 4.

4. Empirical Results

4.1. Evaluation Standard

4.1.1. Convergence Evaluation Criteria

Assuming that the current number of MCMC simulations is I , the burning rate is burn_in, and n e t n , j i = 1 indicates that there is edge n j when the number of iterations is i ; otherwise, n e t n , j i = 0 . Perform Q independent replicates of MCMC sampling. Plots of a scatterplot with a v e r a g e _ e d g e _ s c o r e s n , j values as the vertical axis and a v e r a g e _ e d g e _ s c o r e s n , j values as the horizontal axis.
e d g e _ s c o r e s n , j q = i = b u r n _ i n + 1 I   n e t n , j i I b u r n _ i n
a v e r a g e _ e d g e _ s c o r e s n , j = q = 1 Q e d g e _ s c o r e s n , j q Q

4.1.2. Network Structure Accuracy Evaluation Criteria

M n , j = 1 indicates that there is an edge n j , while M n , j = 0 indicates that there is no edge n j . Define E ξ as the set of all edges whose posterior probability e n , j 0 , 1 exceeds the threshold ξ for each edge. Calculate true positive TP ξ , false-positive FP ξ , and false negative FN ξ for each E ξ . Plot a precision-recall (PR) curve with P ξ as the ordinate and R ξ as the abscissa. A larger area under the PR curve (PR-AUC) [28] value indicates better network reconstruction accuracy.
P ξ =   TP ξ / TP ξ +   FP ξ
R ξ =   TP ξ / TP ξ +   FN ξ

4.1.3. Criteria for Model Stability

Assume the accuracy of the network structure obtained from different MCMC iteration times i , denoted as A U C i , p , can be calculated. Perform P independent experiments to obtain different A U C i , p , and then calculate the variance of all A U C i , p , denoted as V i . A smaller variance means that the network structure inferred from each independent experiment is similar, i.e., the model is more stable. Draw a variance iteration curve with V i as the ordinate and I as the abscissa. The stability of the network structure can be measured by comparing the curves.
V i = p = 1 P A U C i , p P

4.2. Experimental Results

4.2.1. Saccharomyces Cerevisiae

The Saccharomyces cerevisiae data containing five gene nodes is a small network structure designed by Cantone et al. [29]. The authors measured the expression levels of these genes in vivo by real-time quantitative polymerase chain reaction over 37-time points. Cantone et al. changed the carbon source from galactose to glucose during the experiment. There are 16 measurements in galactose and 21 measurements in glucose, and the observed value of g at each node is recorded. Since there is an error in washing when changing glycogen, the two first measurement values are removed to obtain a 5 × 35 data set. Figure 4 shows the network structure of Saccharomyces cerevisiae.
It can be seen from Figure 5 that when the number of MCMC iterations is 10,000, the edge scores simulated by 20 independent MCMC simulations are almost the same, and the convergence is almost reached. With the same number of iterations, the convergence of the PCCs-ED-DBN model is better than that of the HMM-DBN.
In the experiment, this paper follows the setting of Grzegorczyk et al. for hyperparameters. Set MCMC iteration: 10,000, the MCMC sampling results are saved once for each iteration, and 10,000 network structures are obtained. One hundred independent MCMC sampling results in 100 network structure accuracies, and the average value is used to obtain the final network structure accuracy (PR-AUC), as shown in Figure 6.
Figure 6 shows that the non-homogeneous DBN (PCCs-ED-DBN, HMM-DBN, CPS-DNM, MIX-DBN) [23,24,25] can achieve higher network reconstruction accuracy than a homogeneous DBN (HOM-DBN). The PR-AUC value of PCC-ED-DBN is about 15% higher than that of the homogeneous dynamic Bayesian network (HOM-DBN), and compared with other non-homogeneous dynamic Bayesian networks (MIX-DBN, CPS -DBN, HMM-DBN) increases were 12%, 6%, and 4%.
The homogeneous dynamic Bayesian network (HOM-DBN) follows the Markov assumption, the regulation network does not change with time and the regulation intensity obeys the same distribution during the modeling process. However, when the living environment of the organism changes, it is obviously unrealistic to assume that the distribution of gene regulation strength remains unchanged. The non-homogeneous dynamic Bayesian network (PCCs-ED-DBN, HMM-DBN, CPS-DBN, MIX-DBN) constructs a regulating network with the same network structure and different parameter distributions by combining the multiple changepoint processes. In this way, the model can better reflect the actual situation of natural biological development, and the network reconstruction ability is better.
Figure 7a shows the network structure accuracy of PCCs-ED-DBN and HMM-DBN under different times of MCMC sampling. Figure 7b shows the variance comparisons of the network structure, and Table 2 gives some specific numerical comparisons. Comparing Figure 7 and Table 2, it can be found that PCCs-ED-DBN has better network structure accuracy compared with HMM-DBN. Moreover, the network structure inferred under the same MCMC sampling times, compared with HMM-DBN, PCCs-ED-DBN inferred network structure accuracy variance is smaller, so the model is more stable than HMM-DBN.
In addition, ED-birth is also applied to the globally coupled NH-DBN [21] and partially coupled EWC NH-DBN [30] models for comparative experiments. In the experiment, this paper follows the setting of Grzegorczyk et al. for hyperparameters. Set MCMC iterations: 20,000, the MCMC sampling results are saved once for each iteration, and 20,000 network structures are obtained. Five hundred independent MCMC sampling results in 500 network structure accuracies, and the average value is used to obtain the final network structure accuracy (PR-AUC), as shown in Figure 8a.
From Figure 8a, it can be concluded that the non-homogeneous dynamic Bayesian networks of ED-birth move are applied, and the network structure sampled by MCMC can obtain higher accuracy. In the EWC NH-DBN models, the effect is more obvious, but the global coupled NH-DBN network structure accuracy (PR-AUC) improvement is not significant.
Figure 8b show the network structure accuracy of the ED-birth move under different numbers of MCMC samplings (Globally Coupled NH-DBN). Figure 8c show the variance comparisons of the network structure, and Table 3 gives some specific numerical comparisons.
Comparing Figure 8 and Table 3, it can be found that the ED-birth move has better network structure accuracy in the globally coupled NH-DBN compared with the birth move. The network structure is inferred under the same MCMC sampling times. Compared with the birth move, the ED-birth move inferred network structure accuracy variance is smaller, so the model is more stable.

4.2.2. Synthetic Yeast Data

This paper generated synthetic yeast data for the K = 4 segment. Comparative experiments between HMM-DBN [25] and PCCs-ED-DBN are performed using this dataset.
We analyzed the experimental results of the synthetic yeast dataset under the HMM model. Figure 9a shows the average AUC score, and Figure 9b shows the change in the AUC difference as the data point increases. With the increase in data points, PCCs-ED-DBN has better results for the detection of changepoints.

4.2.3. Gene Regulatory Network in Arabidopsis

Plants are well-suited experimental systems to study the mechanistic basis of developmental dynamics, given that they are more amenable to in vivo manipulation than, for example, animals. Constructing the Arabidopsis gene regulatory network is currently topical research [31,32,33]. Figure 10 shows that the convergence effect of the MCMC iteration number of 50,000 under the PCCs-ED-DBN model is approximately the same as the convergence effect of the MCMC iteration number of 200,000 under the HMM-DBN model. This means that to achieve the same convergence effect, PCCs-ED-DBN saves more than half the time overhead compared to HMM-DBN. Figure 11. Arabidopsis gene regulatory network with marginal probability greater than 0.5 inferred using the PCCs-ED-DBN model. Since the gene regulatory network of Arabidopsis has not been fully documented in the biological literature, the network construction accuracy cannot be calculated. However, known edges given in some biological literature are marked with bold lines in Figure 11 (GI→CCA1 [34], GI→TOC1 [34], ELF3→TOC1 [35], ELF3→CCA1 [35], ELF3→PRR9 [36], TOC1→LHY [37], LHY→TOC1 [37], ELF4→PRR9 [38]).

4.2.4. Simulated Data from the RAF Pathway

Figure 12 shows the RAF protein signaling pathway as described by Sachs et al. [39] consists of 11 proteins (pip3, plcg, pip2, pkc, p38, raf, pka, jnk, mek, erk, and akt), and the edges represent protein interactions. Figure 13 shows the experimental comparison of network reconstruction accuracy on the dataset provided by Marco Grzegorczyk [25]. Compared with CPS-DBN and MIX-DBN, the PR-AUC value of PCCs-ED-DBN is improved significantly. However, in data 1, data 2, and data 3, the PR-AUC values of PCCs-ED-DBN were only 2%, 3%, and 4% higher than that of HMM-DBN, respectively. However, in data 4, the increase was more obvious, about 8%.

4.2.5. Time Overhead

Compared with HMM-DBN, PCCs-ED-DBN has improved network reconstruction accuracy, convergence, and stability, but this inevitably adds additional time overhead. Table 4 gives a comparison of the additional time overhead during the fourth part of the experiment. The simulation platform is ① Processor Intel Core i5-9500, CPU 3.0 GHz. ② Installed memory (RAM) 8 GB. ③ Hard disk: 1 TB. ④ Tool MATLAB R2018b.

5. Conclusions

This paper makes two improvements compared to the HMM-DBN model. First, the changepoint sampling method based on the Euclidean distance of data points proposed in this paper fully mines the prior knowledge between data points. Second, we explore the causal relationship between gene expression data and the Pearson correlation coefficient between genes and apply this relationship to the selection of candidate parent nodes. In addition, the advantages of the PCCs-ED-DBN can be described in detail from the following three aspects.
Network reconstruction accuracy:
On the Saccharomyces cerevisiae dataset, the PR-AUC value of PCC-ED-DBN is about 15% higher than that of the homogeneous dynamic Bayesian network (HOM-DBN), and compared with other non-homogeneous dynamic Bayesian networks (MIX-DBN, CPS -DBN, HMM-DBN) increases were 12%, 6%, 4%. On the four datasets of the RAF pathway, the PR-AUC value of PCC-ED-DBN is more than 10% higher than that of MIX-DBN and CPS-DBN, but compared with HMM-DBN, in data_1, data_2, data_3, with only 2%, 3%, and 4% improvement, and 8% improvement in data_4.
Convergence:
On Saccharomyces cerevisiae data and Arabidopsis data, PCCs-ED-DBN has a better convergence effect than HMM-DBN, especially on Arabidopsis data, the improvement of convergence is more obvious. The convergence effect of HMM-DBN with 200,000 MCMC iterations is basically the same as that of PCCs-ED-DBN with 50,000 MCMC iterations. Although PCCs-ED-DBN has more time consumption in a single iteration than HMM-DBN, it can still reduce the time consumption by more than half.
Model stability:
The network reconstruction accuracy (PR-AUC) inferred in multiple independent MCMC simulations is experimentally verified, and the variance of PCCs-ED-DBN is smaller than that of HMM-DBN, which means that the model proposed in this paper is more stable. Finally, the ED-birth move proposed in this paper is applied to the coupled model (Globally Coupled NH-DBNs, EWC NH-DBNs) in the experiment, and the network reconstruction accuracy is also improved, but the improvement effect is not as good as that of the uncoupled model. This is because coupling parameters are added to the coupled model. Through the action of the coupling parameters, the regression parameters in the coupled components can influence each other, thereby adjusting the regression parameters in the components. This means that even if the component assignment deviates from the actual situation, it is still possible to infer regression parameters that are close to the actual situation.
This paper only proposes a method to find the changepoint using Euclidean distance. In future work, I hope to fully exploit the underlying prior knowledge of the data to infer component vectors. The convergence of MCMC sampling is also a topic worthy of study. I hope that the methods I explore in the future can improve the convergence of the model and express the problem of proving convergence mathematically.

Author Contributions

software, Q.Z. writing—original draft preparation, J.Z.; writing—review and editing, C.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the following grants: National Natural Science Foundation of China (General Program) 61772321, Natural Science Foundation of Hefei 2021035, Hefei University Graduate Innovation and Entrepreneurship Program (21YCXL25,21YCXL18).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ajmal, H.B.E.; Madden, M.G. Dynamic Bayesian Network Learning to Infer Sparse Models from Time Series Gene Expression Data. IEEE/ACM Trans. Comput. Biol. Bioinform. 2021. [Google Scholar] [CrossRef] [PubMed]
  2. Che, D.; Guo, S.; Jiang, Q.; Chen, L. PFBNet: A Priori-Fused Boosting Method for Gene Regulatory Network Inference. BMC Bioinform. 2020, 21, 308. [Google Scholar] [CrossRef] [PubMed]
  3. Shafiee Kamalabad, M.; Grzegorczyk, M. A New Bayesian Piecewise Linear Regression Model for Dynamic Network Reconstruction. BMC Bioinform. 2021, 22, 196. [Google Scholar] [CrossRef]
  4. Timmermann, T.; González, B.; Ruz, G.A. Reconstruction of a Gene Regulatory Network of the Induced Systemic Resistance Defense Response in Arabidopsis Using Boolean Networks. BMC Bioinform. 2020, 21, 142. [Google Scholar] [CrossRef] [PubMed]
  5. Zhao, M.; He, W.; Tang, J.; Zou, Q.; Guo, F. A Comprehensive Overview and Critical Evaluation of Gene Regulatory Network Inference Technologies. Brief. Bioinform. 2021, 22, bbab009. [Google Scholar] [CrossRef] [PubMed]
  6. Friedman, N.; Linial, M.; Nachman, I.; Pe’Er, D. Using Bayesian Networks to Analyze Expression Data. J. Comput. Biol. 2000, 7, 601–620. [Google Scholar] [CrossRef]
  7. Talih, M.; Hengartner, N. Structural Learning with Time-Varying Components: Tracking the Cross-Section of Financial Time Series. J. R. Stat. Soc. Ser. B 2005, 67, 321–341. [Google Scholar] [CrossRef]
  8. Xuan, X.; Murphy, K. Modeling Changing Dependency Structure in Multivariate Time Series. In Proceedings of the 24th International Conference on Machine Learning, New York, NY, USA, 20–24 June 2007. [Google Scholar]
  9. Lebre, S. Stochastic Process Analysis for Genomics and Dynamic Bayesian Networks Inference. Master’s Thesis, Université d’Evry-Val d’Essonne, Évry-Courcouronnes, France, 2007. [Google Scholar]
  10. Robinson, J.; Hartemink, A. Non-stationary Dynamic Bayesian Networks. In Advances in Neural Information Processing Systems 21 (NIPS 2008); Curran Associates Inc.: Red Hook, NY, USA, 2008. [Google Scholar]
  11. Robinson, J.W.; Hartemink, A.J.; Ghahramani, Z. Learning Non-Stationary Dynamic Bayesian Networks. J. Mach. Learn. Res. 2010, 11, 3647–3680. [Google Scholar]
  12. Kolar, M.; Song, L.; Xing, E. Sparsistent Learning of Varying-Coefficient Models with Structural Changes. Adv. Neural Inf. Processing Syst. 2009, 22, 1006–1014. [Google Scholar]
  13. Aderhold, A.; Husmeier, D.; Grzegorczyk, M. Statistical Inference of Regulatory Networks for Circadian Regulation. Stat. Appl. Genet. Mol. Biol. 2014, 13, 227–273. [Google Scholar] [CrossRef]
  14. Shafiee Kamalabad, M.; Heberle, A.M.; Thedieck, K.; Grzegorczyk, M. Partially Non-Homogeneous Dynamic Bayesian Networks Based on Bayesian Regression Models with Partitioned Design Matrices. Bioinformatics 2019, 35, 2108–2117. [Google Scholar] [CrossRef] [PubMed]
  15. Ahmed, A.; Xing, E.P. Recovering Time-Varying Networks of Dependencies in Social and Biological Studies. Proc. Natl. Acad. Sci. USA 2009, 106, 11878–11883. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  16. Dong, M.; He, D. A Segmental Hidden Semi-Markov Model (HSMM)-Based Diagnostics and Prognostics Framework and Methodology. Mech. Syst. Signal Processing 2007, 21, 2248–2266. [Google Scholar] [CrossRef]
  17. Dondelinger, F.; Lebre, S.; Husmeier, D. Heterogeneous Continuous Dynamic Bayesian Networks with Flexible Structure and Inter-Time Segment Information Sharing. In International Conference on Machine Learning (ICML); Furnkranz, J., Joachims, T., Eds.; Omnipress: Haifa, Israel, 2010; pp. 303–310. [Google Scholar]
  18. Dondelinger, F.; Lèbre, S.; Husmeier, D. Non-Homogeneous Dynamic Bayesian Networks with Bayesian Regularization for Inferring Gene Regulatory Networks with Gradually Time-Varying Structure. Mach. Learn. 2013, 90, 191–230. [Google Scholar] [CrossRef]
  19. Lèbre, S.; Becq, J.; Devaux, F.; Stumpf, M.P.; Lelandais, G. Statistical Inference of the Time-Varying Structure of Gene-Regulation Networks. BMC Syst. Biol. 2010, 4, 130. [Google Scholar] [CrossRef]
  20. Grzegorczyk, M.; Husmeier, D. Non-Homogeneous Dynamic Bayesian Networks for Continuous Data. Mach. Learn. 2011, 83, 355–419. [Google Scholar] [CrossRef]
  21. Grzegorczyk, M.; Husmeier, D. Regularization of Non-Homogeneous Dynamic Bayesian Networks with Global Information-Coupling Based on Hierarchical Bayesian Models. Mach. Learn. 2013, 91, 105–154. [Google Scholar] [CrossRef]
  22. Grzegorczyk, M.; Husmeier, D. A Non-Homogeneous Dynamic Bayesian Network with Sequentially Coupled Interaction Parameters for Applications in Systems and Synthetic Biology. Stat. Appl. Genet. Mol. Biol. 2012, 11, 1–62. [Google Scholar] [CrossRef]
  23. Grzegorczyk, M.; Husmeier, D.; Edwards, K.D.; Ghazal, P.; Millar, A.J. Modelling Non-stationary Gene Regulatory Processes with a Non-homogeneous Bayesian Network and the Allocation Sampler. Bioinformatics 2008, 24, 2071–2078. [Google Scholar] [CrossRef]
  24. Grzegorczyk, M.; Husmeier, D. Modelling Non-stationary Gene Regulatory Processes with a Non-homogeneous Dynamic Bayesian Network and the Change Point Process. In Proceedings of the 6th International Workshop on Computational Systems Biology, Aarhus, Denmark, 10–12 June 2009. [Google Scholar]
  25. Grzegorczyk, M. A Non-homogeneous Dynamic Bayesian Network with a Hidden Markov Model Dependency Structure among the Temporal Data Points. Mach. Learn. 2016, 102, 155–207. [Google Scholar] [CrossRef]
  26. Grzegorczyk, M.; Husmeier, D. Non-stationary Continuous Dynamic Bayesian Networks. In Advances in Neural Information Processing Systems 22 (NIPS 2009); Curran Associates Inc.: Red Hook, NY, USA, 2009. [Google Scholar]
  27. Cohen, I.; Juang, Y.; Chen, J.; Benesty, J. Pearson Correlation Coefficient. In Noise Reduction in Speech Processing; Springer: Berlin/Heidelberg, Germany, 2009; pp. 1–4. [Google Scholar]
  28. Davis, J.; Goadrich, M. The Relationship between Precision-Recall and Roc Curves. In Proceedings of the 23rd International Conference on Machine Learning, New York, NY, USA, 25–29 June 2006. [Google Scholar]
  29. Cantone, I.; Marucci, L.; Iorio, F.; Ricci, M.A.; Belcastro, V.; Bansal, M.; Santini, S.; di Bernardo, M.; di Bernardo, D.; Cosma, M.P. A Yeast Synthetic Network for In Vivo Assessment of Reverse-Engineering and Modeling Approaches. Cell 2009, 137, 172–181. [Google Scholar] [CrossRef] [PubMed]
  30. Shafiee Kamalabad, M.; Grzegorczyk, M. Non-Homogeneous Dynamic Bayesian Networks with Edge-Wise Sequentially Coupled Parameters. Bioinformatics 2020, 36, 1198–1207. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  31. Aluru, M.; Shrivastava, H.; Chockalingam, S.P.; Shivakumar, S.; Aluru, S. EnGRaiN: A Supervised Ensemble Learning Method for Recovery of Large-Scale Gene Regulatory Networks. Bioinformatics 2021, 38, 1312–1319. [Google Scholar] [CrossRef] [PubMed]
  32. Dávila-Velderrain, J.; Caldú-Primo, J.L.; Martínez-García, J.C.; Álvarez-Buylla Roces, M.A. Gene Regulatory Network Dynamical Logical Models for Plant Development. In Plant Systems Biology; Springer: Berlin/Heidelberg, Germany, 2022; Volume 2395, pp. 59–77. [Google Scholar]
  33. Monga, I.; Randhawa, V.; Dhanda, S.K. Connecting the Dots: Using Machine Learning to Forge Gene Regulatory Networks from Large Biological Datasets. At the Intersection of GRNs: Where System Biology Meets Machine Learning. In Machine Learning and Systems Biology in Genomics and Health; Springer: Berlin/Heidelberg, Germany, 2022; pp. 103–121. [Google Scholar] [CrossRef]
  34. Miwa, K.; Serikawa, M.; Suzuki, S.; Kondo, T.; Oyama, T. Conserved Expression Profiles of Circadian Clock-related Genes in Two Lemna Species Showing Long-day and Short-day Photoperiodic Flowering Responses. Plant Cell Physiol. 2006, 47, 601–612. [Google Scholar] [CrossRef]
  35. Dixon, L.E.; Knox, K.; Kozma-Bognar, L.; Southern, M.M.; Pokhilko, A.; Millar, A.J. Temporal Repression of Core Circadian Genes Is Mediated through EARLY FLOWERING 3 in Arabidopsis. Curr. Biol. 2011, 21, 120–125. [Google Scholar] [CrossRef]
  36. Chow, B.Y.; Helfer, A.; Nusinow, D.A.; Kay, S.A. ELF3 Recruitment to the PRR9 Promoter Requires Other Evening Complex Members in the Arabidopsis Circadian Clock. Plant Signal. Behav. 2012, 7, 170–173. [Google Scholar] [CrossRef]
  37. Locke, J.C.W.; Kozma-Bognár, L.; Gould, P.D.; Fehér, B.; Kevei, É.; Nagy, F.; Turner, M.S.; Hall, A.; Millar, A.J. Experimental Validation of a Predicted Feedback Loop in the Multi-Oscillator Clock of Arabidopsis Thaliana. Mol. Syst. Biol. 2006, 2, 59. [Google Scholar] [CrossRef]
  38. Herrero, E.; Kolmos, E.; Bujdoso, N.; Yuan, Y.; Wang, M.; Berns, M.C.; Uhlworm, H.; Coupland, G.; Saini, R.; Jaskolski, M.; et al. EARLY FLOWERING4 Recruitment of EARLY FLOWERING3 in the Nucleus Sustains the Arabidopsis Circadian Clock. Plant Cell 2012, 24, 428–443. [Google Scholar] [CrossRef]
  39. Sachs, K.; Perez, O.; Pe’Er, D.; Lauffenburger, D.A.; Nolan, G.P. Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data. Science 2005, 308, 523–529. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Hierarchy of PCCs-ED-DBN.
Figure 1. Hierarchy of PCCs-ED-DBN.
Electronics 11 02936 g001
Figure 2. Example of a non-homogeneous dynamic Bayesian network with two changepoints.
Figure 2. Example of a non-homogeneous dynamic Bayesian network with two changepoints.
Electronics 11 02936 g002
Figure 3. Component transition example.
Figure 3. Component transition example.
Electronics 11 02936 g003
Figure 4. The network structure of Saccharomyces cerevisiae.
Figure 4. The network structure of Saccharomyces cerevisiae.
Electronics 11 02936 g004
Figure 5. Saccharomyces cerevisiae edge convergence scatter plot under HMM-DBN and PCCs-ED-DBN.
Figure 5. Saccharomyces cerevisiae edge convergence scatter plot under HMM-DBN and PCCs-ED-DBN.
Electronics 11 02936 g005
Figure 6. Accuracy comparison among different models on Saccharomyces cerevisiae dataset.
Figure 6. Accuracy comparison among different models on Saccharomyces cerevisiae dataset.
Electronics 11 02936 g006
Figure 7. PR-AUC and variance under HMM-DBN and PCCs-ED-DBN. The line graph in panel (a) shows the relationship between the network reconstruction accuracy in terms of PR-AUC and the number of MCMC iterations. Line graph in panel (b) showing model stability in terms of PR-AUC variance versus number of MCMC iterations.
Figure 7. PR-AUC and variance under HMM-DBN and PCCs-ED-DBN. The line graph in panel (a) shows the relationship between the network reconstruction accuracy in terms of PR-AUC and the number of MCMC iterations. Line graph in panel (b) showing model stability in terms of PR-AUC variance versus number of MCMC iterations.
Electronics 11 02936 g007
Figure 8. PR-AUC and variance under different models. Panel (a) shows the network reconstruction accuracy in terms of PR-AUC scores incorporating the proposed ED-birth into the Globally Coupled NH-DBN and EWC NH-DBN. Panel (b) shows the relationship between the network reconstruction accuracy in terms of PR-AUC and the number of MCMC iterations under Globally Coupled NH-DBN. Panel (c) shows model stability in terms of PR-AUC variance under globally coupled NH-DBN.
Figure 8. PR-AUC and variance under different models. Panel (a) shows the network reconstruction accuracy in terms of PR-AUC scores incorporating the proposed ED-birth into the Globally Coupled NH-DBN and EWC NH-DBN. Panel (b) shows the relationship between the network reconstruction accuracy in terms of PR-AUC and the number of MCMC iterations under Globally Coupled NH-DBN. Panel (c) shows model stability in terms of PR-AUC variance under globally coupled NH-DBN.
Electronics 11 02936 g008
Figure 9. PR-AUC of synthetic dataset. Panel (a) shows the network reconstruction accuracy in terms of PR-AUC scores at different data point lengths. Panel (b) shows the difference in network reconstruction accuracy in terms of PR-AUC scores at different data point lengths.
Figure 9. PR-AUC of synthetic dataset. Panel (a) shows the network reconstruction accuracy in terms of PR-AUC scores at different data point lengths. Panel (b) shows the difference in network reconstruction accuracy in terms of PR-AUC scores at different data point lengths.
Electronics 11 02936 g009
Figure 10. Scatter plot of Arabidopsis edge convergence under HMM-DBN and PCCs-ED-DBN.
Figure 10. Scatter plot of Arabidopsis edge convergence under HMM-DBN and PCCs-ED-DBN.
Electronics 11 02936 g010
Figure 11. Arabidopsis gene regulatory network inferred by the PCCs-ED-DBN model.
Figure 11. Arabidopsis gene regulatory network inferred by the PCCs-ED-DBN model.
Electronics 11 02936 g011
Figure 12. RAF pathway.
Figure 12. RAF pathway.
Electronics 11 02936 g012
Figure 13. Accuracy comparison of different models on four RAF pathway datasets.
Figure 13. Accuracy comparison of different models on four RAF pathway datasets.
Electronics 11 02936 g013
Table 1. Hyperparameters and symbols.
Table 1. Hyperparameters and symbols.
SymbolExplanation
g The   g - th   network   node   g = 1 , , N
K g The number of components for node g
k The   k - th   time   component   ( k = 1 , , K g )
ε g , k The noise parameter for the k-th component of node g
M The   network   structure ,   M = π 1 , , π g
δ g The signal-to-noise hyperparameter for node g see (4)
σ g 2 The noise variance hyperparameter for node g see (5)
π g The parent node set of node g
w g , k The interaction parameter vector for the k-th component of node g
y g , k The target values of node g in component k
X π g , k The design matrix for component k of node g
A δ , B δ The   level - 2   hyperparameters   of   the   Gamma   prior   for   δ g 1
A σ , B σ The   level - 2   hyperparameters   of   the   Gamma   prior   for   σ g 2
S π g The set of candidate parent nodes
Table 2. The specific value of network structure variance under different models.
Table 2. The specific value of network structure variance under different models.
Iteration200400600800100012001500200025003000500010,000
HMM-DBN0.00700.00620.00500.00380.00300.00240.00260.00180.00160.00100.00080.0004
PCCs-ED-DBN0.00510.00280.00260.00130.00100.00090.00080.00070.00060.00060.00050.0001
Table 3. The specific value of network structure variance (Globally Coupled NH-DBN).
Table 3. The specific value of network structure variance (Globally Coupled NH-DBN).
Iteration10020030050010001500200030004000800010,000
Birth0.01360.01410.01330.01240.01070.00860.00800.00630.00470.00360.0028
ED-birth0.01310.01340.01350.01240.01010.00810.00620.00550.00380.00350.0021
Table 4. time overhead comparison between HMM-DBN and PCCs-ED-DBN.
Table 4. time overhead comparison between HMM-DBN and PCCs-ED-DBN.
DATAIterationHMM-DBNPCCs-ED-DBN
Saccharomyces cerevisiae10,000315 s325 s
Arabidopsis50,0002706 s2757 s
RAF pathwaydata_150,0005371 s5497 s
data_250,0005364 s5488 s
data_350,0005370 s5480 s
data_450,0005399 s5501 s
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Zhang, J.; Hu, C.; Zhang, Q. Constructing a Gene Regulatory Network Based on a Nonhomogeneous Dynamic Bayesian Network. Electronics 2022, 11, 2936. https://doi.org/10.3390/electronics11182936

AMA Style

Zhang J, Hu C, Zhang Q. Constructing a Gene Regulatory Network Based on a Nonhomogeneous Dynamic Bayesian Network. Electronics. 2022; 11(18):2936. https://doi.org/10.3390/electronics11182936

Chicago/Turabian Style

Zhang, Jiayao, Chunling Hu, and Qianqian Zhang. 2022. "Constructing a Gene Regulatory Network Based on a Nonhomogeneous Dynamic Bayesian Network" Electronics 11, no. 18: 2936. https://doi.org/10.3390/electronics11182936

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop