A main direction in the Systems Biology field is detecting and studying the complex relationships between different molecules in the cell. For this, network modeling has been extensively used to analyze the interactions between genes, mRNAs, proteins or metabolites [1
], as well as other entities, such as diseases [2
] or drugs [4
]. This approach has generated the Network Medicine field, where complex diseases are analyzed, which can concurrently affect many genes [6
]. To study cell mechanisms, an abundance of large-scale gene expression experiments were conducted using microarray or RNA sequencing (RNA-Seq) techniques, and data are available via publicly accessible databases. The gene regulatory network (GRN) inference problem refers to reconstructing a network consisting of interactions between transcription factors (TFs) and their target genes. TFs are proteins that bind to DNA and regulate the expression of the genes, i.e., they can activate or inhibit the transcription.
Substantial research interest has attracted the de novo GRN inference, namely, to construct a network based only on gene expression data. Towards this, a plethora of algorithms has been developed utilizing various mathematical and computational methods in the last two decades. Initial efforts focused on finding expression similarities via correlation (e.g., WGCNA [9
]) or mutual information (MI) (e.g., ARACNE [10
], CLR [11
]) and more recently, other variations of them, such as sparse correlation [12
], conditional MI [13
] and partial information decomposition [15
]. Several studies tried to model the gene transcription process using linear [16
] or non-linear [18
] ordinary differential equations or stochastic differential equations [20
]. Other approaches used Boolean networks [21
], statistical/probabilistic methods, for instance, Gaussian graphical models [22
], Bayesian networks [23
], and regression analysis (such as linear regression [27
], Lasso regression [28
], least angles regression [30
]). Another category is relying on machine learning methods, for example, support vector machines (SVM) (e.g., SIRENE [31
]), random forest (e.g., GENIE3 [32
], Jump3 [33
]), XGBoost [34
] and neural networks [36
]. Finally, methods for more specific problems have been developed, such as a method using deep neural networks on microscopy images recording spatial gene expression [37
] or a method to jointly learn GRNs in different species using orthology and Bayesian inference [38
]. Several reviews are available on the topic [39
], showing that each method makes different assumptions and takes advantage of different biological characteristics. Thus it can be most effective on specific data or problems. In [42
], an extensive comparison was performed, and it was concluded that the performance is highly variable on different data, confirming the “no free lunch theorem”.
Moreover, the GRN reconstruction algorithms can be categorized based on two interesting characteristics, locality and supervision [41
]. Regarding locality, algorithms can be characterized as global if the same approach is applied on all genes and as local if specific characteristics of each node are taken into account. An example of locality improving network inference results is the pair of unsupervised algorithms ARACNE [10
] and CLR [11
]. In ARACNE, first, pair-wise MI is calculated, and then a network pruning step follows to eliminate indirect connections, while in CLR, an adaptive background correction step is performed before pruning to keep interactions that are important for both connected nodes. Another widely used local GRN algorithm is GENIE3 [32
], which handles each gene separately. Specifically, for each target gene, considering all other genes as candidate regulators (or a subset of genes if a list of TFs is given), a random forest is trained using the target’s expression as output and regulators’ expression as input, and subsequently, the variable importance measure of the trained model is used to evaluate the rank of the potential regulators for the target gene.
Supervision refers to the inclusion of prior knowledge to improve modeling. Hence algorithms can be divided into supervised and unsupervised. Considering that human gene expression data contain measurements of about 20,000 genes, usually in few hundreds of samples, this consists of a “large p small n” problem. Thus inferring a biologically meaningful GRN relying solely on gene expression data is an extremely hard computational task. Therefore, supervised methods have emerged, which can provide more accurate results since embedding a priori knowledge in the form of experimentally validated interactions can lead to the exclusion of spurious interactions between biologically unrelated genes, despite possible expression profile similarity [43
]. Examples include [44
], where functional associations were used as priors to solve an optimization problem, and [45
], which used network motifs to learn probabilistic graphical models. Of great interest is the machine-learning category because, by their nature, these methods are based on supervised learning algorithms. A characteristic example is the SVM-based method SIRENE [31
], which solves a classification problem separately for each TF to determine if a gene is its target a or not. The operation of SIRENE requires as input a list of known TFs and their targets as positive examples, while due to the absence of negative examples, a cross-validation scheme is used on the unknown genes, considering a data subset as non-interacting examples. Finally, classification is performed using the expression of the unknown genes to predict their category (targets or non-targets).
Focusing on ensemble tree methods, i.e., random forest and XGBoost, they have been successfully applied in a wide range of Systems Biology problems, but in most cases, in an unsupervised mode. In detail, random forest models have been trained in order to obtain variable importance measurements and select the most discriminative variables; for instance, to rank single-nucleotide polymorphisms (SNPs) [46
] or microRNAs [47
] according to the relationship with a disease, to detect differentially expressed pathways between two conditions [48
] and for GRN in GENIE3 method [32
] as previously described. Similarly, XGBoost has been used to classify subpathways and select the most discriminative ones with variable importance [49
]. For GRN, in GRNBoost2 [34
], the same approach with GENIE3 is followed simply by replacing random forest with XGBoost, while in BiXGBoost [35
], the same concept is used in two directions to select both the best regulators and targets for each gene. However, from a machine learning perspective, there are two distinct phases, model training on some labeled data and prediction on new unlabeled data. In the aforementioned applications, models are trained to obtain the variable importance measurement, but they are not used for prediction. Therefore, the machine-learning algorithms are not exploited to their full potential. A notable exception where the trained regression random forest model is utilized is predicting new gene targets of microRNAs [50
Since in GRN inference, we are interested in prediction, this motivated us to create an appropriate training and testing approach to benefit from the generalization abilities of machine-learning methods. In this study, we present a local supervised method named XGBoost for gene regulatory networks (XGRN), aiming to model a biological network’s interactions and predict new similar interactions utilizing gene expression profiles. Specifically, each previously known interaction is represented with a regression XGBoost model built on the expression profiles of the two interactors. Using the trained model, we predict the gene expression of the second interactor with other genes as input, and then we compare the prediction with the actual values to infer if similar patterns are obtained. Thus these other genes could be possible interactors. In the case of GRN reconstruction, based on some known TF-target gene interactions, our method predicts other possible target genes of the TFs. The proposed method was applied on benchmark microarray data and a real single-cell RNA-Seq (scRNA-Seq) dataset with very high performance compared to other methods.
In this study, we presented XGRN, a local supervised method with the aim to model known interactions of a gene network and to predict new similar interactions. Specifically, exploiting gene expression data in combination with some known TF-target gene interactions, other candidate target genes of each TF are predicted. We repeat that this is performed by training an XGBoost regression model on TF and target gene expressions and applying the trained models on other genes’ profiles to infer if these candidate target genes result in similar patterns with the known target. In contrast to most unsupervised de novo GRN reconstruction methods, where each gene-gene combination is examined resulting in a
matrix (where N
is the number of the genes), here previously validated biological interactions are used, enabling us to focus only on TFs for model training, which are a small percentage of the genes. It is important that our method is local and focuses on each TF separately since it has been shown that GRN is sparse [57
] and scale-free [58
], namely some TFs have many targets, while most of them have few specific targets. Therefore, we can adapt to each TF’s characteristics. The independent modeling of each interaction is a key characteristic for users, who would like to focus only on a specific interaction subset, for example, a TF of interest or a specific pathway.
Regarding supervision, it was confirmed the statement that supervised methods could help to increase performance [43
]. Especially in D54, which is the largest dataset, the best unsupervised method provided an AUROC of 54%, which is not useful as a prediction since it is marginally higher than the chance level of 50%. Furthermore, it has been shown that several older GRN methods do not perform well in scRNA-Seq data [59
]. Hence it is important to test a method not only on benchmark DREAM datasets but also on real RNA-Seq experimental data.
The core concept of XGRN resembles the supervised learning performed by SIRENE for GRN inference, where a binary classification problem is solved separately for each TF to predict if a gene is its target or not, based on expression profiles of known targets [31
]. The operation of SIRENE requires as input a set of TFs and their targets as positive examples, while in the absence of negative examples, a cross-validation scheme is used on the unknown genes. It is noted that in this approach, the regulator profile is not utilized. An advantage of using regression instead of classification as in SIRENE is that we can utilize both the target and the regulator expression profiles. Moreover, this scheme can overcome the absence of negative examples, avoiding the hypothesis that the absence of interaction in a dataset can be interpreted as a negative training example.
Interestingly, our method is a generic framework that can be implemented using any regression method. However, XGBoost is a very recent, high-performing method, which builds a complex regression model, able to capture various non-linear functions. We note that gene expression experiments can contain inherent noise, therefore, we would like to avoid overfitting a model [60
]. Ensemble tree algorithms, such as random forest and XGBoost, help towards building a more generalized model by selecting as parameters many trees and a small maximum depth for each tree. In addition, machine-learning algorithms, such as the tree-based, are purely data-driven and model-free. Namely, no assumptions are made about the distribution of the variables or the relationships between them (which is the case in regression methods based on a specific mathematical model [27
]). Moreover, tree-based regression is not affected by the absolute expression level (high or low). Finally, there are few parameters to be fine-tuned, but they have a small effect on the quality of results. Thus there is no need for an exhaustive search for optimal values, which in addition may lead to overfitting to training data.
Noteworthy, the directionality of the interaction is taken into account by our method, which is a desired characteristic in TF-target networks, as well as in other cases, such as cellular pathways. If we switch the input and output, then we would model the relationship “a gene is targeted by a TF” and would set as testing input the profiles of other TFs to detect if they target this gene. Results were similar in this reverse case. Thus for clarity, we presented here only the first direction. A limitation of our method is that we cannot predict new targets of TFs without any known gene targets. However, even if a small number of relationships between TFs and target genes are known, we showed that the proposed method could accurately recover the network structure. This is very important since we do not know if biological networks are close to their complete form or not, especially for less studied organisms.
In conclusion, XGRN can deliver reliable results from a biological point of view, providing output networks very similar to the ground truth. We confirmed that supervised methods combining both expression data with network structure could outperform unsupervised ones. The proposed approach to train regression models on known interacting node pairs provided accurate predictions, proving its efficiency. The high-performance was achieved by employing XGBoost for regression, a recent model-free method. In general, the development of accurate computational tools cannot only help biological data analysis but also can be used as a first step before designing an experiment to provide indicative results for later experimental validation, reducing the cost by trying only the most promising directions. Furthermore, we believe that a gene expression prediction approach can be extremely valuable to various different applications beyond network reconstruction. In the future, we plan to apply this method to other interaction data, such as protein–protein interactions (PPIs) or pathways. Algorithms integrating these different information types are very important for advanced comprehension of the cellular mechanisms. Finally, recent research focus has been shifted on network-based differential gene expression, such as pathways and subpathways [61
]. Thus, we aim to adapt the proposed method for differential gene expression detection by using in testing the expression profile of the same gene in different conditions.