Estimation of Gene Regulatory Networks from Cancer Transcriptomics Data

: Cancer is a genetic disease in which multiple genes are perturbed. Thus, information about the regulatory relationships between genes is necessary for the identiﬁcation of biomarkers and therapeutic targets. In this review, methods for inference of gene regulatory networks (GRNs) from transcriptomics data that are used in cancer research are introduced. The methods are classiﬁed into three categories according to the analysis model. The ﬁrst category includes methods that use pair-wise measures between genes, including correlation coefﬁcient and mutual information. The second category includes methods that determine the genetic regulatory relationship using multivariate measures, which consider the expression proﬁles of all genes concurrently. The third category includes methods using supervised and integrative approaches. The supervised approach estimates the regulatory relationship using a supervised learning method that constructs a regression or classiﬁcation model for predicting whether there is a regulatory relationship between genes with input data of gene expression proﬁles and class labels of prior biological knowledge. The integrative method is an expansion of the supervised method and uses more data and biological knowledge for predicting the regulatory relationship. Furthermore, simulation and experimental validation of the estimated GRNs are also discussed in this review. This review identiﬁed that most GRN inference methods are not speciﬁc for cancer transcriptome data, and such methods are required for better understanding of cancer pathophysiology. In addition, more systematic methods for validation of the estimated GRNs need to be developed in the context of cancer biology.


Introduction
Cancer is a genetic disease that involves perturbation of gene regulatory networks (GRNs) caused by various mechanisms, such as copy number alteration, abnormal methylation status, abnormal protein configuration, and post-transcriptional dysregulation [1][2][3][4]. Although driver gene mutation information is crucial for the estimation of the genetic etiology of cancer, it is becoming increasingly evident that many genes are involved in cancer pathophysiology, which appears to disrupt GRNs [5]. In this context, the identification of information regarding gene regulation in cancer tissues is expected to provide invaluable information for the development of anticancer agents or cancer management strategies.
Previously, efforts aimed toward the identification of GRNs focused only on a small number of genes [6]. Currently, with the advent of high-throughput technologies, such as microarray and next-generation sequencing (NGS), tens of thousands of gene expressions are examined simultaneously [7,8]. A microarray is constructed on a slide on which probes for the hybridization of complementary DNAs for whole messenger RNAs (mRNAs) from one type of cell are implanted. The mRNAs obtained from sample tissues are tagged with fluorescent dyes so that a scanner can identify variations in the photosignals of the dyes as the abundance of the mRNAs. After scanning of the signals, gene expressions are summarized into a matrix of expression profiles in which rows and columns indicate genes and samples, respectively. Instead of utilizing hybridization, NGS platforms determine the amount of mRNA through direct sequencing of mRNAs.
With the advent of high-throughput transcriptomics data, a gene co-expression network (GCN) is the most widely used method for inference of the regulatory relationship between genes [14][15][16]. This is based on the assumption that genes showing similar expression patterns have a higher probability of sharing regulatory mechanisms for their expression. In the GCN, nodes represent genes and edges indicate a co-expression relationship defined by statistical inference. Pearson's correlation coefficient (CC) is used for the construction of a GCN (Equation (1)).
In the above equation, x and y represent expression vectors of different genes in a gene expression matrix of n samples, and x and y are means of the expression vectors. The significance of the co-expression is measured using the t distribution. The interpretation of significant co-expression includes the following scenarios. First, one gene of the significant co-expression gene pair has a direct effect on the expression of the other gene. Second, both genes of the significant co-expression gene pair have significant interactions with the other gene and the interaction of the co-expression genes is indirect. Third, both genes of the significant co-expression gene pair are regulated by another genes (Figure 1). significant co-expression gene pair has a direct effect on the expression of the oth Second, both genes of the significant co-expression gene pair have significant inte with the other gene and the interaction of the co-expression genes is indirect. Thi genes of the significant co-expression gene pair are regulated by another genes (Fi GCNs have revealed functional implications of GRNs in cancers. In particula combined with an enrichment test using biological knowledge, such as pathway ontology (GO), GCNs enabled identification of essential biological mechanisms consistent with the current knowledge about the molecular biology of cancer d ment and progression [14]. For example, Aggarwal et al. identified a gastrome, w functional module defined by meta-analysis of co-expression analysis of four d gastric cancer datasets [15]. In this analysis, they clustered the network of signifi expressions and identified the functional implications of the clusters using GO ment, which is tested based on the hypergeometric distribution. This gastrome co of genes that are related to cell proliferation, adhesion, and immune responses, w closely related to cancer pathophysiology. Moreover, they found a novel interac tween PLA2G2A and EphB2. Although this study identified both functional modu regulatory relationships between individual gene pairs, most other studies only r functional modules of cancers using co-expression analysis [17][18][19]. In addition t the CC for the estimation of GCNs, some modifications are applied for better est of co-expression relationships [20][21][22][23][24][25][26][27][28][29][30][31][32][33][34]. Weighted gene co-expression analysis (W is one of the most widely used methods. It uses signed co-expression measure b genes x and y using the following equation. This produces a weighted adjacency matrix (axy), and the weighted topologic lap measure is applied to define clusters based on the degree of co-expression. In research, WGCNA has been used to identify functional modules that are related to pathophysiology and prognosis. Yang et al. found that prognostic genes of four are enriched in modules identified using WGCNA [34]. In an analysis of large-sca expression microarray data of breast cancer, the WGCNA identified gene modu are related to clinical parameters and found novel prognostic genes [21]. Indee studies that apply WGCNA are aimed at exploring prognostic and biomarker gen GCNs have revealed functional implications of GRNs in cancers. In particular, when combined with an enrichment test using biological knowledge, such as pathway or gene ontology (GO), GCNs enabled identification of essential biological mechanisms that are consistent with the current knowledge about the molecular biology of cancer development and progression [14]. For example, Aggarwal et al. identified a gastrome, which is a functional module defined by meta-analysis of co-expression analysis of four different gastric cancer datasets [15]. In this analysis, they clustered the network of significant coexpressions and identified the functional implications of the clusters using GO enrichment, which is tested based on the hypergeometric distribution. This gastrome consisted of genes that are related to cell proliferation, adhesion, and immune responses, which are closely related to cancer pathophysiology. Moreover, they found a novel interaction between PLA2G2A and EphB2. Although this study identified both functional modules and regulatory relationships between individual gene pairs, most other studies only reported functional modules of cancers using co-expression analysis [17][18][19]. In addition to using the CC for the estimation of GCNs, some modifications are applied for better estimation of co-expression relationships [20][21][22][23][24][25][26][27][28][29][30][31][32][33][34]. Weighted gene co-expression analysis (WGCNA) is one of the most widely used methods. It uses signed co-expression measure between genes x and y using the following equation.
The signed correlations represent the distance between gene pairs, and the adjacency matrix is constructed using the signed correlations with the power of the soft thresholding parameter β. a xy = (s xy ) β This produces a weighted adjacency matrix (a xy ), and the weighted topological overlap measure is applied to define clusters based on the degree of co-expression. In cancer research, WGCNA has been used to identify functional modules that are related to cancer pathophysiology and prognosis. Yang et al. found that prognostic genes of four cancers are enriched in modules identified using WGCNA [34]. In an analysis of large-scale gene expression microarray data of breast cancer, the WGCNA identified gene modules that are related to clinical parameters and found novel prognostic genes [21]. Indeed, most studies that apply WGCNA are aimed at exploring prognostic and biomarker genes. Only a few studies are focused on the identification of gene regulatory relationships that are associated with the pathophysiology of cancer [29].
Information-theoretic measures have been applied in the construction of GCNs. Mutual information (MI) is a measure for the determination of co-expression relationships [35,36]. It represents the mutual dependency between two random variables (Equation (4)). In Equation (4), p(x,y) is the joint distribution of variables x and y, and p(x) and p(y) are the marginal distribution of each variable, respectively. The MI is determined by the ratio of the joint distribution versus product of the marginal distributions. There- fore, zero MI indicates that the two variables are independent, and high MI means strong dependency between the variables.
)dxdy (4) The MI-based CGN identifies highly connected hub genes of colon cancer [35]. In an analysis of prostate cancer data, the MI-based CGN also detected gene regulatory relationships of 454 interactions that had not been discovered previously [36].

Differential Co-Expression Network
Differential co-expression networks identify gene pairs that show significantly different co-expression between conditions, whereas the GCN methods are focused on identifying genes that have similar expression patterns in the gene expression data. The conditions might be different phenotypes or experimental conditions. In many cases, the differential co-expression is measured using Fisher's Z transformation of the CC. The difference of the statistics between the conditions is tested using the standard normal distribution. The differential co-expression reveals the mechanistic relationship between two gene pairs [37,38]. In cancer research, differential co-expression has been used to identify functional interactions that are conserved in the development of cancers [39], the regulatory mechanism of estrogen receptors in breast cancer [40], and prognostic gene modules in ovarian cancer [41]. For example, differentially co-expressed genes of several ovarian cancer data identified highly interconnected gene modules that showed significant associations with survival [41].
In the estimation of GRNs using pairwise measures, it should be noted that direct estimation of the regulatory relationship between two genes is not possible, and the findings rather provide clues about the regulatory relationship. Therefore, additional steps, such as the application of functional annotation, are required. However, even functional annotation provides the same results for the regulatory relationships. Therefore, post hoc analysis of co-expression appears to be performed more, especially in cancer research.

Estimation of Co-Expression Using the Gaussian Graphical Model
For the inference of GRNs, a partial correlation coefficient (PCC) has been applied for estimation of the causal relationship between two genes [42][43][44][45][46][47][48][49][50][51][52][53][54][55][56][57]. A partial correlation indicates that the relationship between the two variables is influenced by other variables in a dataset. A significant partial correlation, which indicates that the degree of the partial correlation is within the rejection region of the null hypothesis in the statistical testing, indicates that the relationship between the two variables is independent of the other remaining variables. In multivariate data, the PCC is estimated using covariance matrix inversion or the following equation, which uses the regression coefficients of least square regression between the target variables and the remaining variables ( Figure 2).
In the above equations, i and j indicate two genes, ω indicates the inverse of the covariance matrix, and β is a vector of regression coefficients of a regression model, where genes except gene i are explanatory variables and gene i is a response variable. The superscripts of β indicate the response variable genes. When the PCCs are estimated between all variables and summarized in a network form, it is also called a Gaussian graphical model (GGM).
In the above equations, i and j indicate two genes, ω indicates the inverse of the covariance matrix, and β is a vector of regression coefficients of a regression model, where genes except gene i are explanatory variables and gene i is a response variable. The superscripts of β indicate the response variable genes. When the PCCs are estimated between all variables and summarized in a network form, it is also called a Gaussian graphical model (GGM). In gene expression data, the high-dimensional property of transcriptomics data requires additional remedies. In matrix inversion, Moore-Penrose pseudoinverse and bootstrap aggregation is applied to overcome the shortcomings of a small sample size [56]. The GeneNet package showed a substantial performance in simulation data analysis and identified the genes that are relevant to cancer pathophysiology in an analysis of breast cancer microarray data [56]. In the estimation of PCCs with a linear regression model, gene expression data are problematic because the number of genes is far greater than that of samples, which makes ordinary least-squares estimation for linear regression problematic. To solve this problem, regularized regressions including the partial least square, ridge regression, lasso, and two-stage adaptive lasso methods are applied [55]. In this study, performance was compared in terms of network characteristics rather than biological implications. Although the GGM is applied in cancer research, most studies are focused on In gene expression data, the high-dimensional property of transcriptomics data requires additional remedies. In matrix inversion, Moore-Penrose pseudoinverse and bootstrap aggregation is applied to overcome the shortcomings of a small sample size [56]. The GeneNet package showed a substantial performance in simulation data analysis and identified the genes that are relevant to cancer pathophysiology in an analysis of breast cancer microarray data [56]. In the estimation of PCCs with a linear regression model, gene expression data are problematic because the number of genes is far greater than that of samples, which makes ordinary least-squares estimation for linear regression problematic. To solve this problem, regularized regressions including the partial least square, ridge regression, lasso, and two-stage adaptive lasso methods are applied [55]. In this study, performance was compared in terms of network characteristics rather than biological implications. Although the GGM is applied in cancer research, most studies are focused on modification of the GGM, and cancer data are used for validation of the developed methodologies [43][44][45][47][48][49][50][51][52][53][54][55][56][57]. Modifications of the GGM include identification of multiplicative interaction that hinders inference of partial correlation [48], bootstrapping inference of a network structure with control of the false discovery rate of edges using mixture model [49], fast inference of the GGM using scaled lasso regression [44], and combing a greedy search algorithm to identify the network structure of the GGM in high-dimensional data [57]. Whereas the GGM is inferred without considering the difference in phenotypes, differential networks of the GGM are estimated between conditions [47,52,53]. The methods identify GGM structures that are different between phenotypes or conditions.

Inference of Gene Regulatory Networks Using the Bayesian Network Model
A Bayesian network (BN) is a graphical model that is used for the inference of causal relationships between variables. Due to its algorithmic characteristics, the BN has been applied to the inference of GRNs from high-throughput transcriptomics data [58]. If the data have K variables, they are represented as K nodes in the BN. In the BN, the joint probability of all variables can be estimated using the following equation.
In Equation (7), pa k indicates the parental nodes of node k. The BN has a directed acyclic graph structure: a node is called a child node when the node is pointed to by another node, and the pointing node is called a parent node ( Figure 2). As in Equation (7), the probability of the nodes is influenced only by their parent nodes, and the other nodes do not influence the estimation of the joint distribution. This is equivalent to a factorization process of multiple variables. That is, only the parent nodes (variables) are considered in the estimation of the probability of a node, even when many nodes other than the parent nodes are present. For the construction of a BN, the conditional probability should be estimated according to the network structure (parent-child relationship), and inference of the network structure is accomplished with search algorithms.
In cancer research, the BN has been used for the identification of regulatory relationships between genes [58][59][60][61][62][63][64][65][66][67][68][69][70]. Kunkle et al. used the BN to determine the key genes that are critical for the development of astrocytoma [70]. They used 646 differentially expressed genes (DEGs) in astrocytoma from a meta-analysis of 10 microarray datasets. The BN structure was estimated using the DEGs and key genes, which are important genes for predicting the astrocytoma status, using the concept of Markov blanket. The BN was also applied to identify the causal genes that are responsible for the malignant transformation of cancer cells [59]. The DEGs were computed using the microarray data of normal and cancer cells and the BN was constructed using the DEGs. The BN was validated using independent cancer data, which showed consistent GO enrichment in the discovery and validation data. In another study, the BN was used to identify the structure of the GRN in breast cancer metastasis to the bone, brain, and lung using five microarray datasets [68]. Candidate genes for BN construction were selected as the top correlated genes with metastasis status, and the BNs for each metastatic site were inferred. Several genes (POLR2JA, SPTLC1, ILK, ALDH3B1, HEY1, KCNF1, and UVRAG) were found to be causal genes for the metastasis of breast cancer; in particular, estrogen receptor was significantly associated with bone and lung metastasis [68]. Another study using the BN identified 10 genes encoding novel regulators of the G1-S cell cycle transition, which was also validated using small interfering RNA (siRNA) experiments. Moreover, TR1B1 acts on NF-κB and AP-1 sites, which results in cyclin D1 and primes the sensitization of cells to TRAIL-induced apoptosis [64]. In addition to estimation of GRNs between genes, the BN has also been applied to the inference of GRNs related to clinical parameters, such as lymph node metastasis and survival. Gevaert et al. identified network structures having regulatory relationships between genes and clinical outcomes [69]. Of the genes in the BN, the Markov blanket selected those genes that have a causal relationship with the clinical outcomes. Furthermore, the BN showed a better performance in the classification of breast cancer prognosis.

Estimation of Gene Regulatory Networks Using the Markov Network Model
The Markov network or Markov random field (MRF) is an algorithm for identifying direct relationships between genes [71]. Similar to the BN, a Markov network factorizes the joint distribution p(x) using a subset of genes. However, unlike the BN, it uses an In Equation (8), C indicates cliques of a Markov network and Ψ is a potential function. The joint distribution is a product of potential functions over maximal cliques of the network. Furthermore, Z is a normalization constant estimated by sum of all potential functions.
In cancer research, the Markov network is applied to identify regulatory relationships between genes. Wei et al. proposed a local MRF approach that identifies DEGs and their dependencies [72]. They used the Gamma-Gamma model which is a mixture of gamma distribution for the determination of DEGs, and the local MRF with the selected DEGs was estimated using maximum likelihood estimation. A simulation study revealed that their model outperformed the other benchmark methods in the identification of DEGs. In an analysis of gene expression microarray data of breast cancer, the model found prognosisassociated DEGs, such as CAV1, WNT4, and CCR7. In addition, it found a dependency structure between the DEGs and the other selected genes, which have regulatory potential [72]. In cancer research, the MRF approach is frequently applied to the integration of heterogeneous data, including various omics data from different experimental platforms and prior knowledge that is gathered from publications and databases. These studies are introduced in the following section.

Integrative Approach for Estimation of Gene Regulatory Networks
The abovementioned methods estimate regulatory relationships by only using transcriptomics data, which can be classified as unsupervised approaches because any other information is not used except transcriptomics data. In cancer research, supervised methods that integrate information other than transcriptomics data are used for the inference of GRNs. Such information includes epigenomic, TF binding site, and protein-protein interaction information. In this review, these approaches are categorized into two classes. The first approach is a supervised method that uses regression or classification models for the inference of GRNs. The second one is an integrative approach that applies heterogeneous data regardless of whether the inference model for the GRN is supervised or unsupervised.

Supervised Approach for GRN Estimation
The general framework of the supervised approach used in the estimation of GRNs includes two approaches ( Figure 3). First, gene expression profiles are extracted from the transcriptomics data. Second, the expression profiles of one gene (or genes) are regressed on the expression profiles of the other gene (regression approach) or the regulatory relationships between genes are presented as class labels in the supervised classification model (classification approach). In the classification approach, information for regulatory relationships (e.g., TF and its DNA binding site) between genes is integrated in the model, and expression profiles or any information extracted from the profiles, such as distances or correlations, are used as input data for the model. The classification approach is based on the development of a model that predicts whether a regulatory relationship is present between genes [81,82]. For this purpose, gene expression data and other information are used, such as information regarding TF and its binding site, protein-protein interaction, and knockout experiments. Although these methods are validated using non-cancer gene expression datasets, it is worth validating them using cancer transcriptomics data considering their superior performance compared with other methods.

Integration of Heterogeneous Data Sources for Gene Regulatory Network Estimation
Transcriptional regulation of genes is known to comprise several steps that are involved in various kinds of biological phenomena. Therefore, it is expected that the integration of information regarding the biological processes will be advantageous for the In the regression approach, tree-based boosting models are applied for the estimation of GRNs [73][74][75][76]. The development of a tree model is equivalent to a feature selection process, and if the selected genes are regressed on the target gene, links between the tree model are regarded as regulatory relationships. As the boosting process generally improves the performance of the tree model, most studies using the tree model adopt such processes. However, the model appears to be applied infrequently to cancer research; thus, relevant publications are hard to find. Another regression approach uses a structural equation model (SEM) [77][78][79][80]. The SEM includes a path analysis model, which uses a directed acyclic graph for the representation of independent variables. Thus, the network structure is determined after application of the path analysis model [80]. Moreover, latent variables are accepted so that any unobservable biological effects can be modeled. Whereas the SEM has been applied for the inference of GRNs in studies on various diseases, it has been less frequently used in cancer research. Considering that the network structure is determined and represented as a regression equation, the result of the SEM is easy to interpret and allows for easy identification of regulatory relationships between genes.
The classification approach is based on the development of a model that predicts whether a regulatory relationship is present between genes [81,82]. For this purpose, gene expression data and other information are used, such as information regarding TF and its binding site, protein-protein interaction, and knockout experiments. Although these methods are validated using non-cancer gene expression datasets, it is worth validating them using cancer transcriptomics data considering their superior performance compared with other methods.

Integration of Heterogeneous Data Sources for Gene Regulatory Network Estimation
Transcriptional regulation of genes is known to comprise several steps that are involved in various kinds of biological phenomena. Therefore, it is expected that the integration of information regarding the biological processes will be advantageous for the inference of GRNs using gene expression data. Several applications utilize the abovementioned methodologies for the integrative analysis of gene expression data to infer a more accurate GRN by combining heterogeneous biological information and/or multiomics data.
Of the methods, the BN is inherently appropriate for the integration of information other than gene expression data because it can be integrated as priors of interaction between genes in the network structure learning [83]. In addition to biological information, multiomics data can also be integrated into the process of GRN structure learning using the BN [84]. The integration of biological knowledge and/or heterogeneous omics data can be applied to the Markov network and GGM. The "gene regulatory network inference accuracy enhancement" algorithm uses the concept of meta-GRNs that pose a weight between genes using prior knowledge, and the weight is integrated into the inference of a Markov network for the GRN [85]. In another study, the Markov network was modified to infer the network structure by combining data having different distributions (FuseNet) [86]. When the FuseNet method was used to perform a combined analysis of RNA sequencing and somatic mutation data for breast cancer, it identified a network structure that was relevant to the pathophysiology of cancer. The GGM has also been modified to integrate biological knowledge. In the modified version of the GGM, lasso is frequently used for the estimation of partial correlations between genes [87][88][89][90]. Most of these methods, including weighted graphical lasso [87], information-incorporated GGM, prior lasso [88], graphical adaptive lasso [89], and fused pathway graphical lasso [90], use a weight matrix that assigns weights to the links between the genes based on prior biological information. These methods have been validated for gene expression data of liver, lung, ovarian, and breast cancers and were found to perform better than the benchmark methods.

Comparative Analysis Using the Simulation Method
GRNs include over tens of thousands of genes, and a comparative analysis of GRN inference methods needs systematic approaches for unbiased comparison of these methods. Probably the best practical solution is the application of simulation data where a true positive regulatory relationship is already determined. In fact, most studies that develop novel methods use simulation data and compare the preexisting methods and newly developed ones. Whereas ad hoc simulation models have been applied independently across studies, there are computational tools that generate simulation data using statistical or mathematical models for gene expressions. These tools are classified into two categories. First, differential equation models are used, and the data are generated based on predefined parameters [91][92][93][94][95]. GeneNetWeaver and Stochastic Gene Networks Simulator are tools that use differential equation models [91,93], and NetBenchmark provides several different simulation tools for benchmark analysis [92]. Second, specific models are used in the generation of simulation data. For example, in the GeneNet package, a GGM-based function is implemented, and users can generate data using their own parameters, including the number of genes, samples, and regulatory relationships between specific genes [56]. In many studies, simulation data were generated using a GGM or BN model, and several parameters were combined to generate simulation data with different characteristics. SynTReN simulates gene expression data from different types of network topology models [96].

Application of Prior Biological Knowledge Obtained from External Databases
For faster and more efficient validation of results from GRN analysis, evidence from databases can be applied; one of the most frequently used databases is the GO database. GO is a nomenclature system that provides information about the functions of genes [97]. A GO term is mapped to multiple genes that are associated with the same biological processes, molecular functions, or cellular locations. GO is frequently used for the validation of coexpression modules that are defined using expression similarities between genes. If the genes in the module have similar functions, then certain GO terms linked to the genes are overrepresented in comparison to what would be attributable to random chance alone. The test of significant overrepresentation of GO terms is called an enrichment test, which uses chi-square or hypergeometric distributions for testing the significance of the enrichment of GO terms in the module. The enrichment test can be applied to the validation of a GRN, especially when a gene is linked to multiple genes or to a group of genes interconnected by the regulatory relationship in the estimated GRN. As it is assumed that genes involved in similar biological processes tend to be linked to each other in a GRN [98], the application of an enrichment test can be an efficient strategy for the computational evaluation of a GRN.
Information about intergene regulations from pathway databases, such as KEGG [99], WikiPathways [100], Small Molecule Pathway Database [101], PathBank [102], Pathway Commons [103], Reactome [104], GeneMANIA [105], and MSigDB [106], provides another platform for the evaluation of estimated GRNs. These databases collect information through manual curation of publication data and/or data analysis of open functional genomics data, and they contain regulatory relationships between genes. This information is applied to the validation of GRNs as gold-standard relationships. In cancer, however, it is possible that aberrant genetic circuit information is not included in the pathway databases. Fortunately, information related to cancer-specific pathways is available in KEGG, Reactome, WikiPathways, and MSigDB. This should be considered when the estimated GRN is evaluated using knowledge obtained from external biological databases.
Other databases used in the validation of GRNs provide biological knowledge about TFs and their binding site and epigenomic control of gene expression. TRANSFAC is one of the most popular databases for information regarding TFs and their binding sites [107]. As mentioned earlier, the database is frequently applied for the supervised estimation of GRNs. The JASPAR database has a similar functionality to TRANSFAC, and its data content is limited to six taxonomic groups [108]. Other databases such as hTFtarget [109], TRRUST [110], and TF2DNA [111] provide similar biological knowledge for humans and/or mice. In addition to the TF databases, integrated databases combining heterogeneous information about gene regulation are available. For example, the European Bioinformatics Institute maintains a database of information regarding gene regulation as part of the Ensembl database system [112]. The RegNetwork includes TF-microRNA relationships [113], and the Gene Transcription Regulation Database integrates processed epigenomic data of DNase I hypersensitive site sequencing and chromatin immunoprecipitation sequencing [114]. The dbCoRC database provides information about super-enhancers that are predicted using H3K27ac ChIP-seq data [115].

Experimental Technologies for the Validation of GRNs
As more than tens of thousands of genes exist in the human genome, experimental validation of the regulatory relationships between all genes is hard to accomplish. Even if fewer genes are selected, it is still difficult to simultaneously validate the regulatory relationships between several genes. For the experimental validation of regulatory relationships, genes that show the most significant results or high connectivity (e.g., measured using CC, PCC, or any other metrics) are selected and analyzed in experiments. Moreover, for this purpose, gene perturbation experiments are used [116]. The basic assumption of such validation is that the genes are estimated to have regulatory relationships if their expressions are changed after the perturbation of a certain gene.
Overexpression of one gene and observation of the other genes that are estimated to have regulatory relationships are widely used methods for the validation of regulatory relationships. DNA transfection and subsequent polymerase chain reaction experiments for the identification of gene expression changes allow researchers to decide whether a gene is indeed related to the changes in expressions of the other genes that are known to be regulated by that gene. Using siRNA is another method for perturbation of gene expression [117]. The length of siRNA ranges from 21 to 23 nucleotides, and it is easy to use compared with the previously mentioned transfection method. If an siRNA acts on one gene, then the expression of other genes that have regulatory relationships with that gene tends to change. Recently, clustered regularly interspaced short palindromic repeats (CRISPR) technology has been applied to gene perturbation experiments. CRISPR technology knocks down gene expression through complementary base-pairing of a single guide RNA. Similar to siRNA experiments, regulatory relationships between genes can be identified by gene perturbation using CRISPR technology [118].

Discussion
In this review, GRN inference and validation methods for gene expression data were surveyed. The widely used pairwise measures are CC and MI, and their interpretation is performed with GO in many cases. Since significant co-expression does not guarantee a regulatory relationship, it should be validated with other information. As multivariate measures, the GGM, BN, and MRF are frequently applied. The concepts of the models are different, but they consider multiple variables when determining regulatory relationships between two variables, and inference of regulatory relationships is possible. The supervised approach is performed with the application of linear regression or classification models, and biological knowledge is integrated through weights in the regression model or into a classification model as information for deciding whether genes have a regulatory relationship or not.
Considering the methodologies used in the publications that are reviewed here, there are issues that should be considered in the GRN analysis of cancer transcriptomics data. First, although several GRN inference methods have been developed, few methods that can be used for identifying cancer-specific GRNs are available. Most of the methods are applicable to transcriptomics data regardless of whether the data are generated from cancerous or non-cancerous samples. As the rewiring of GRNs is a characteristic feature of cancer, the deviation from normal GRNs should be estimated in studies on GRNs using cancer transcriptomics data. Second, because canonical knowledge of gene regulatory interactions may not be valid in cancer, it is necessary to provide cancer-specific genetic regulatory evidence, which needs to be collected in databases, and a testing procedure deciding whether the estimated GRN is valid given information from the databases. Third, even cancer of the same type or stage can have a heterogeneous nature, which indicates the necessity of developing methods for the inference of heterogeneous GRNs from cancer transcriptomics data. Finally, more research about integrative methods for multiomics data is required for accurate estimation of GRNs in cancer.

Conflicts of Interest:
The author declares no conflict of interest.