Network Inference from Gene Expression Data with Distance Correlation and Network Topology Centrality

: The reconstruction of gene regulatory networks based on gene expression data can effectively uncover regulatory relationships between genes and provide a deeper understanding of biological control processes. Non-linear dependence is a common problem in the regulatory mech-anisms of gene regulatory networks. Various methods based on information theory have been developed to infer networks. However, the methods have introduced many redundant regulatory relationships in the network inference process. A recent measurement method called distance correlation has, in many cases, shown strong and computationally efﬁcient non-linear correlations. In this paper, we propose a novel regulatory network inference method called the distance-correlation and network topology centrality network (DCNTC) method. The method is based on and extends the Local Density Measurement of Network Node Centrality (LDCNET) algorithm, which has the same choice of network centrality ranking as the LDCNET algorithm, but uses a simpler and more efﬁcient distance correlation measure of association between genes. In this work, we integrate distance correlation and network topological centrality into the reasoning about the structure of gene regulatory networks. We will select optimal thresholds based on the characteristics of the distribution of each gene pair in relation to distance correlation. Experiments were carried out on four network datasets and their performance was compared.


Introduction
Systems biology is not only an emerging field, but more importantly, it represents a new approach to biological research [1,2]. In the past, the structure of gene regulatory networks (GRNs) was inferred from experimental interventions, but such experiments required considerable time and cost. With the rapid development of high-throughput technologies, a large number of research studies have generated a large amount of gene expression data [3,4], which has made it possible to infer gene regulatory networks from these expression data based on computational methods. In recent years, network inference based on computational methods has become one of the most important goals in the postgenomic era. To this end, the "Dialogue on Reverse Engineering Evaluation and Methods" challenge aims to stimulate researchers to develop new and efficient arithmetics [5].
Much progress has been made in inferring GRNs' structure from gene expression data. In the early days, Boolean networks [6,7] were popular in GRN inferencing, where the states of genes were represented by Boolean variables, and interactions between genes were represented by Boolean functions, which determined the states of genes on top of some other regulatory gene states. At present, information-theoretic approaches are increasingly being used for reconstructing GRNs. Several mutual information (MI)-based methods have been successfully applied to infer GRNs, such as the relevance network (REL), context likelihood of relatedness (CLR), and the ARACNE and minimum redundancy network (MRNET). The REL algorithm [8] calculates MI values between genes and then infers interactions based on the threshold values. The CLR algorithm [9] extends the REL algorithm, which infers interactions based on scores derived from the background distribution of mutual information. For both the REL and CLR algorithms, it is easy to introduce indirect interactions, which can lead to more false edges. In order to eliminate indirect interactions, Margolin et al. proposed the ARACNE algorithm [10] based on data-processing inequality, which takes into account indirect interactions in interaction triangles. The MRNET algorithm [11] by Meyer is a network inference algorithm using a feature selection strategy, in which an iterative search process is applied to select direct interactions. PCA-CMI [12] measures the dependence between genes through conditional mutual information(CMI), successfully differentiated direct interaction, and indirect association. CMI2NI [13] calculated the mutual information between two genes when given the third gene by calculating the Kullback-Leibler divergence between the hypothetical distributions at the boundary between the two genes. Although MI can characterize nonlinear dependence, MI's calculation usually requires probability or density estimation, which often requires assumptions. However, it is not accurate because the distribution of gene expression data is uncertain.
Recently, a novel and simple statistic of dependency relationships, distance correlation [14], has emerged, which is sensitive to any deviation from independent behaviors, such as nonlinear or non-monotone dependent structures [15]. Better than MI, the distance correlation statistics were calculated very merely without any distribution assumptions. Recently, an approach has been proposed to infer GRNs based on distance correlation. It successfully combined distance correlation with existing CLR algorithms and MRNET algorithms to raise the accuracy of the GRNs [16].
GRNs, or more generally, biochemical networks are sparse, meaning that a gene is regulated by a small number of genes relative to the total number of genes in the network. A range of sparsification-based features have now been proposed to infer GRNs from gene expression data [17,18]. Currently, we can sparse networks by using graph properties in the network [19,20]. In the existing studies, most algorithms only consider the connections in the gene expression data. Still, they do not include the known graph attributes in the reasoning process, which reduces these methods' prediction veracity rates and restricts their availability in practice [21]. A large number of studies have shown a hierarchical, scale-free nature of biological networks [5,22]. This attribute makes most nodes in the network sparsely connected, where a few positively associated genes account for most of the interaction, which are hub nodes. The hub node is a node with high network centrality. The connection rules between nodes reflect the relative position information between nodes to some extent. Recently, the LDCNET algorithm has successfully merged MI and network topology centrality for reconstructing GRNs [23].
In this article, we develop a novel method, namely DCNTC, which incorporate the distance correlation and network topology centrality into GRNs inferring algorithms and test the performance. Our approach adopts a novel estimation of measurement, and the sparse (scale-free) structure of the gene regulatory network is used to calculate the network topology centrality. Compared with the traditional methods, we use the latest distance correlation statistics to measure the nonlinear relationship, and combine the network topology centrality to infer the gene regulation network. At the same time, we select the optimal threshold according to the distribution characteristics of the values calculated by the distance correlation of each gene pair. Real data from the SOS DNA repair network and DREAM-simulated data suggest that the DCNTC algorithm can improve the GRNs' inference accuracy.

Methods
In this section, the definitions of distance correlation and network topology centrality will be reviewed, as well as the algorithm of the DCNTC for inferring GRNs.

Distance Correlation
Distance correlation provides a new approach to the problem of testing the joint independence of random vectors [14,24]. The energy package in R provides the calculation function of distance correlation [25]. Distance correlation [14] was raised as an innovative method to detect the dependence. The key idea is to calculate the difference between the joint eigenfunction and the product of its marginal eigenfunction in a special, weighted L2 space. Specifically, for random variables (X, Y), denote an innovative method of (X, Y) by f (X,Y) , and its marginal eigenfunction f (X) and f (Y) . The distance covariance between X and Y is defined as the root of the following equation: where p and q are the dimensions of X and Y, respectively, and w(t, s) is the weight function given by (C p C q |t| q+1 p ) −1 with C p = π 1+p /2Γ(1 + p)/2 and C q = π1 + q/2Γ((1 + q)/2). By standardizing the distance covariance, the distance correlation can be defined as,

Network Topology Centrality
Network centrality is a network topology feature used to measure nodes' importance because it can effectively evaluate the network position relative to other nodes in a local scope. Therefore, network topology centrality can be applied to assess the significance of nodes in a network. Commonly used network centralities are closeness centrality, degree centrality, and betweenness centrality. In the network, the node's importance is usually calculated by its degree centrality of the node. Degree centrality was formally defined as the count of links on a node, which is often known as an analytical method of how nodes can be affected by flow into a given network [26]. In an undirected graph, the node's degree is the count of other nodes to which it is connected. In a graph G with n nodes, we commonly use the adjacency matrix A = [a µυ ] to describe the connectivity between nodes. Define the adjacency matrix A, where a µυ = 1, µ is adjacent to υ, and a µυ = 0, µ is not adjacent to υ or µ = υ. Degree centrality is based on the number of connected edges of node υ as the importance of node υ. The calculation of node centrality is as follows: where a µυ is the element in the adjacency matrix A, and V NB(υ) is the adjacency subgraph of υ.

GRNs Inference with DCNTC Algorithm
In this paper, we propose an algorithm DCNTC for inferring the structure of a regulatory network based on network topology centrality and distance correlation, which is based on the network properties and correlation between genes. The algorithm consists of three main parts: (1) Initialisation of the regulatory relationships, (2) calculation of the network topology centrality and optimisation of the ranking, and (3) inference of the regulatory network structure.

Initialization of Regulatory Relationships
The first step in the DCNTC algorithm is to initialize and pre-treat the regulatory relationships between the genes. As distance correlation is an effective way to quantitatively describe non-linear relationships between genes, a distance correlation matrix M is constructed for the input gene expression data based on Equation (2). The greater the value of the elements in this matrix, the greater the likelihood that the gene pair to which it is addressed has a regulatory relationship. Considering the high noise level of gene expression data and the sparse nature of the regulatory network, elements of the M-matrix need to be pre-processed to eliminate some of the redundant regulatory relationships before the network structure can be inferred. This is usually done by setting a fixed threshold value. When the value of an element in the matrix is greater than the given threshold value, there is a preliminary regulatory relationship between the two genes to which the element corresponds; when the value of an element in the matrix is less than the threshold value, there is no regulatory relationship between the two genes to which the element corresponds, and the matrix element is set to zero.
In the regulatory relationship matrix, we considered that there was no regulatory relationship between pairs of genes with small distance correlations, which can also be described as redundant relationships. In the process of selecting the threshold for initial de-redundancy, we further analysed the distribution of distance correlation values for each gene pair in the different datasets, as shown in Figure 1. In Figure 1, we separated the range of distance correlations (0-1) in steps of 0.1 and counted the ratio of how many values there were in each interval across the different datasets. We found a single peak in the distance correlation distribution plot. To initially remove redundant relationships from the gene regulatory network, we chose the left boundary value of the interval where the peak was located as the threshold value θ for redundancy removal. This allowed us to initially remove redundant relationships from the initially obtained regulatory relationship matrix by threshold θ.

Calculation of the Network Topology Centrality and Optimization of the Ranking
The algorithm then uses the pre-processed gene regulatory relationships and the matrix M to calculate and rank the network topology centrality of each gene. Given the simple and easy-to-implement nature of node centrality, the algorithm uses it as a measure of the neutrality of the network topology of each gene. In this paper, distance correlation is used to measure the regulatory relationships between genes, so that the degree centrality of gene g υ in the network G can be expressed as follows: where d c represents the given cut-off distance. Essentially, the value of DC(υ) is equal to the number of genes for which the value associated with gene g υ distance exceeds the given cut-off distance d c . From Equation (4), it can be seen that the topological centrality of a gene network is influenced by the value of the cut-off d c . The magnitude of the value is directly related to the calculation of the topological centrality of the gene network. Specifically, for the updated regulatory relationship matrix M, M ij represents the initial regulatory relationship values for genes i and j. The sequence M ij (i ≤ j) is ranked downwards, and the resulting sequence is We set the d c value from the αT regulatory relationship value in the sequence, where the parameter α is set to 20% by default. This strategy uses statistical means to obtain the truncation distance values, so the values obtained are more scientifically valid.
Once the value of the cut-off d c has been determined, the centrality of the node is calculated for each gene in turn, according to Equation (4). From this process, it can be seen that there may be cases where two or more genes have the same node centrality value. In order to effectively distinguish the importance of genes with the same nodal centrality, we consider the re-sequencing of all genes with the same nodal centrality. The ranking process takes into account the nodal centrality of all the genes directly adjacent to the target gene, and measures the importance of the target gene according to the following formula: where V NB(µ) represents the set of nodes whose distance from node µ is more correlated than d c . The larger the value of this formula, the greater the importance of the target gene. For different genes with the same node centrality, we calculate the corresponding values according to Equation (5), where the gene with the higher score is positioned ahead of the gene with the lower score, and the new standard sequence q i i = 1, 2, ...n is obtained.

Inference of the Regulatory Network Structure
In order to construct the complete gene regulatory network, we will select regulatory genes for each gene based on the final sequencing results obtained. According to the scale-free nature of the network, genes with high network centrality will be linked to genes with low network centrality, so that in sequence a, genes with lower order are linked by genes with higher order. Additionally, given the sparsity of biological networks, we limit the number of regulatory genes selected for the target gene to one, meaning that we select only the gene with the greatest distance correlation to the target gene as the regulatory gene. For a network with n genes, this indirectly limits the number of correct edges predicted to, at most, n − 1. In summary, the computational process can be implemented by the following function: where X k is the gene whose position precedes gene X i in the standard sequence q i i = 1, 2, ...n. This function allows us to select the genes with the greatest distance correlation to gene X i as regulatory genes for gene X i from among the genes whose sequence position precedes gene X i . When the regulatory genes for all the genes are available, we can then construct them into a complete regulatory network structure. The complete algorithm implementation flow is shown in Algorithm 1.

Algorithm 1 DCNTC algorithm
Input: Microarray data G = g 1 , g 2 , ...g n , the threshold θ Output: A gene network 1: Initialize Q← ∅ 2: Construct a distance correlation matrix M according to Equation (2) 3: Adjust matrix M using the threshold θ 4: for each gene g c : 1 to n do do 5: compute the DC value of g c according to Equation (4) 6: Rank the genes g c ∈ G according to the DC value in descending order and store in Q 7: for each gene g c :1 to n do do 8: select a regulatory gene of g c using Equation (6) from Q 9: return result

Results
In this section, we describe extensive experiments evaluating the performance of the proposed method. Four regulatory network datasets were used in the experiments. Our proposed method was compared with four network inference algorithms based on information theory: CLR, ARACNE, MRNET, LDCNET, and two methods based on distance correlation: REL-DC and MRNET-DC. For the ARACNE algorithm, we chose the default threshold to discriminate the final regulatory relationship. The CLR and MR-NET algorithms were implemented using the optimal threshold selection method from article [27]. The code for all the algorithms was implemented in R and Matlab, respectively. The distance dependence was calculated using the existing dcor function in R. The REL, CLR, ARACNE, and MRNET algorithms were implemented using the existing R package MINET [28]; the LDCNET and DCNTC algorithms were edited and implemented in Matlab.
All of the experiments were performed on four network datasets, including simulated and real data. The datasets can be obtained from previous studies.
The DREAM3-10 gene dataset [5], contains 10 samples for 10 genes. It is from the DREAM ("Dialogue for Reverse Engineering Assessments and Methods") project, and represents a yeast gene network. The true network is composed of 10 nodes and 10 edges.
The DREAM3-50 gene dataset [5], contains 50 samples for 50 genes. It also belongs to the DREAM project and represents a yeast gene network. The true network is composed of 50 nodes and 77 edges.
The DREAM3-100 gene dataset [5], contains 100 samples for 100 genes. It also belongs to the DREAM project and represents a yeast gene network. The true network is composed of 100 nodes and 166 edges.
SOS [29,30], contains nine samples for nine genes. It is an SOS DNA repair network in Escherichia coli. The true network is composed of 9 nodes and 24 edges.
To contrast the reasoning methods objectively, it is necessary to measure their performance quantitatively. The predictive results are defined as follows: false-positive (FP), true-positive (TP), false-positive rate (FPR), true-positive rate (TPR), positive predictive value (PPV), accuracy (ACC), and Matthews coefficient constant (MCC) [31]. Mathematically, they are defined by: where TP, FP, TN, and FN are the elements in the formula. These are accounts of truepositives, false-positives, true-negatives, and false-negatives, respectively.

Results for DREAM3 Challenge Network
To test the prediction effect of the DCNTC algorithm on simulated data, DREAM challenge data, which were widely used, will be used as test data to reconstruct the GRNs [32]. DREAM3 is one of many DREAM Challenge subprojects that provide users with baseline data and control networks to test and evaluate the regulatory network model's effectiveness. We validated our approach on the DREAM3 dataset, where the sizes of the yeast knockout gene expression data were 10, 50, and 100, respectively.
In the first, we tested the DCNTC algorithm on the yeast gene expression data of network size 10 and a sample size of 10. Figure 2 indicates the network extrapolated from gene expression data in different ways. Figure 2A is the real GRN with 10 nodes and 10 edges chosen from an experimentally validated network in yeast genomes. Figure 2B is the network derived from gene expression data with the LDCNET algorithm. The edges of solid dotted black lines are correctly deduced, and the edges of red dotted lines are incorrect. On the edge of false inferences, G2-G9 is a redundant edge, whereas edge G3-G5 is unfound. Figure 2C shows the network deduced from the gene expression data, which is the network inferred from us. Clearly, in our network, the inexistent regulations of G2-G9 were successfully removed. The performance data for each of the seven algorithms is shown in Table 1.  In the second, we tested the DCNTC algorithm on the yeast gene expression data of network size 50 and a sample size of 50. The real network consists of 50 nodes and 77 edges. In Table 2, we can find that the REL algorithm predicts more accurate edges, but this algorithm also expects a lot of wrong edges. The MRNET algorithm can only predict a tiny number of edges among the four algorithms. In comparison, among the four algorithms, the DCNTC algorithm has higher accuracy (ACC) and lower FPR. In the third, we tested the DCNTC algorithm on yeast gene expression data with a network size of 100 and a sample size of 100. The whole network consists of 100 nodes and 166 edges. In Table 3, we can find that the MRNET algorithm predicts more accurate boundaries, but this algorithm also expects a lot of unfair advantages. In comparison, among the four algorithms, the DCNTC algorithm has higher accuracy (ACC) and lower FPR. On the simulated dataset, the algorithm in this paper performed better in all aspects, mainly because only genes with the greatest distance correlation to the target gene were selected when picking regulatory genes for the target gene. This controls the introduction of false-positive edges to a certain extent. Compared to the other algorithms, the ARACNE algorithm removes the introduction of redundant edges to some extent by removing the loops present in the gene regulatory network; the MRNET algorithm uses a maximum correlation and minimum redundancy strategy to select regulatory genes, which can predict more true-positive edges, but also introduces the most false-positive edges in humans, leading to a decrease in prediction performance. In summary, our algorithm can better control the introduction of redundant edges, thus improving the prediction accuracy.

Result for SOS Network in E. coil
The SOS network [29] is a signal pathway in the DNA repair system. It is inferred from real gene expression data and is frequently used to test the effectiveness of network inference methods. Here, we test the DCNTC on the network of E. coli. Table 4 shows the results of the seven methods applied to the SOS network in the E. coli dataset. The results show that the method outperforms all other methods except MRNET and LDCNET in terms of MCC. Although our method did not identify the most correct edges, it produced the least redundant edges of most methods. Furthermore, the validity of the model on the SOS network was not satisfactory compared to previous experiments on other datasets. The main reason for this finding is related to the characteristics of the SOS network and is due to two main factors: firstly, noise in the real data can be taken into account, making the calculation of node centrality inaccurate and leading to biases in the relative positions between gene regulation, resulting in fewer true-positive edges being predicted; secondly, the SOS network consists of nine nodes and 22 edges. In the final construction of the network, the algorithm selects only the connected edges with the most important relationships to the target gene regulation, which indirectly limits the count of boundaries in the prediction network to eight. Although the experimental results are not ideal, Table 4 shows that our method is no more effective than any other test method other than MRNET and LDCNET.

Discussion
In this paper, we fused simple distance correlation and network topological centrality into a structural inference algorithm for GRNs and tested its performance. We selected simulated and real data sets commonly used in gene regulatory network construction algorithms and compared them with existing mutual information-based and distance-correlation-based algorithms. The algorithm can better control the introduction of redundant relationships by ranking the network topological centrality and selecting the maximum distance correlation as the regulatory gene. Therefore, it is conducive to the extension of large-scale network applications, but with many nodes in large-scale networks, there may be more genes with the same node importance when measuring node importance, which may require consideration of higher-order neighbourhood information to measure node importance. As the number of genes increases, it can also lead to inaccuracies and difficulties when inferring the network.
Node centrality is an important metric to characterise the criticality of a node. Many node importance metrics have been proposed, including degree centrality, median centrality, proximity centrality, and eigenvector centrality. The degree centrality is the simplest metric to characterise the importance of a node. Gabrys et al. found that in scale-free or exponential networks, there are only a small number of nodes of large degree, and this are of high importance [33]. Median centrality and proximity centrality are good measures of the importance of nodes in terms of network connectivity, but are computationally complex due to the need to know global information about the network in advance, and are not suitable for large and complex networks. Fabian et al. investigated the extent to which different centrality measures (degree, strength, tightness, mediation, and eigenvectors) recover potential causal interactions in directed acyclic graphs [34,35], whereas feature vector centrality shows powerful effects in measuring the importance of nodes. In summary, the algorithm in the paper has some limitations. In the next study, we will try to reconstruct the gene regulatory network using different approaches to compute the centrality of the network topology.
The DCNTC algorithm has a clear advantage over the distance-based REL algorithm and the distance-based MRNET algorithm in every performance indicator. The algorithm is similar to the LDCNET algorithm in that both have a central network node location. In combination with the distance-dependent initial network redundancy and network topology centrality, the DCNTC algorithm can predict the structure of the GRN more accurately.

Conclusions
In this article, we proposed a novel algorithm DCNTC for inferring GRNs from gene expression data, taking into account the sparse structure of GRNs and the nonlinear dependence. In this method, the nonlinear dependence is represented by distance correlation (without assuming the probability distribution) between this gene pair. The sparse (scalefree) control structure of the gene regulatory network is used to calculate the network topology centrality and test the predictive performance of the algorithm on four data sets, which can effectively predict the structure of the gene regulatory network. However, the algorithm has certain limitations. How could we further confirm the direction of the gene regulatory network structure and its centrality in exploiting the network topology? Further research is still needed in the precise construction of gene regulatory networks.