Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features

Random Forest (RF) is a widely used machine learning method with good performance on classification and regression tasks. It works well under low sample size situations, which benefits applications in the field of biology. For example, gene expression data often involve much larger numbers of features (p) compared to the size of samples (n). Though the predictive accuracy using RF is often high, there are some problems when selecting important genes using RF. The important genes selected by RF are usually scattered on the gene network, which conflicts with the biological assumption of functional consistency between effective features. To improve feature selection by incorporating external topological information between genes, we propose the Graph Random Forest (GRF) for identifying highly connected important features by involving the known biological network when constructing the forest. The algorithm can identify effective features that form highly connected sub-graphs and achieve equivalent classification accuracy to RF. To evaluate the capability of our proposed method, we conducted simulation experiments and applied the method to two real datasets—non-small cell lung cancer RNA-seq data from The Cancer Genome Atlas, and human embryonic stem cell RNA-seq dataset (GSE93593). The resulting high classification accuracy, connectivity of selected sub-graphs, and interpretable feature selection results suggest the method is a helpful addition to graph-based classification models and feature selection procedures.


Introduction
With the widespread use of high-throughput technologies, more gene expression datasets are available for studying clinical outcomes in diseases. Accurate classification models based on gene expression data are essential for identifying disease mechanisms and potentially designing specific treatments and medical plans. Identifying genes that determine different subtypes is essential for understanding the underlying biological mechanisms of disease development and discovering new drug targets [1]. Therefore, developing new methods to predict disease outcomes and identifying important features that reveal biological mechanisms is of great interest. However, the challenge of analyzing gene expression datasets lies in their structure, which often contains tens of thousands of genes and only a few hundred observations.
Tree-based methods such as random forest and LightGBM provide a natural solution to this problem, requiring little parameter tuning or data transformation [2]. Random forest, in particular, provides an inherent measure of feature importance, making it a popular choice for feature selection in the biological field [3]. Random forest algorithms have been shown to achieve high classification accuracy in distinguishing benign breast cancer from malignant [4], detecting biomarkers for prostate cancer progression based on DNA methylation data [5], and identifying abnormal pap smear cervical cell images [6]. Random We have developed a new method called Graph Random Forest (GRF) to effectively integrate graph information into the random forest framework. The architecture of the model is shown in Figure 1. The key idea behind GRF is to embed graph information in the process of building a decision tree. When establishing a decision tree, we consider features in the neighborhood of any number of hops to the head-splitting node. The head node for each decision tree is determined through a data-driven approach, allowing for a flexible and robust model that leverages the underlying graph structure. The source code is available as a public repository on Github (https://github.com/tianlq-prog/GRF (accessed on 26 June 2023)).

Evaluation of Feature Importance
In addition to predicting labels on testing data, it is crucial to identify features that significantly contribute to the classification and help uncover biological mechanisms. To this end, we leverage the Gini importance metric used in the random forest and adapt it for GRF.
In our implementation, we construct a random forest with c i decisions using features in the vicinity of each node i ∈ V, making it easy to compute feature importance by aggregating results from all forests. To compute feature importance, we initialize a vector I = [imp 1 , imp 2 , . . ., impp] T with zeros for each element. For each node i with non-zero c i , we build a random forest F i with c i trees using nodes in Neighbor(v, k). From F i , we obtain the Gini importance j represents the Gini importance of node j estimated from forest F i . We update I as imp j = impj + imp . This process allows us to access the feature importance for all variables, enabling us to select high-ranking features.

Details of Model Setting
The training of the GRF algorithm consists of two parts. In the first part, we trained a simple random forest where each tree had a depth of one. Using this random forest fitted on the training dataset, we recorded the number of occurrences of each node as a head-splitting node and embedded this information in GRF. Then we trained GRF on the training dataset with 500 trees and default parameter values in the second part. We did not fine-tune for the best model on specific data as our primary interest was in evaluating the ability to find important variables and studying the clustering properties of the selected sub-graphs.

Simulation Setting
To simulate disease classification, we conducted a series of experiments using gene expression data and a gene network. We compared the performance of GRF with that of the standard random forest. The simulation study aimed to examine whether GRF could identify an effective sub-graph more accurately.

Synthetic Data Generation
In our simulation studies, we generated a scale-free graph for p features using the Barab'asi-Albert (BA) model [17], which captures the degree distribution of biological networks characterized by a power law with parameter m. We set the power law parameter to 0.5 in our simulations. We then used the generated feature network to calculate the shortest distances between all pairs of vertices, which we represented a matrix D ∈ R p×p . Finally, we derived the covariance matrix Σ for the p features using the formula: The diagonal elements of Σ were one since the distance from one node to itself was zero. With the feature graph and covariance matrix Σ, we simulated expression data X = [x 1 , x 2 , . . ., x n ] T with p features. A multivariate Gaussian distribution with mean zero was used to generate X, for each sample x i , In this way, two features tended to have more significant covariance if they were close to each other. We adopted a strategy that favored significant covariance between two features if they were close to each other in the generated network. This approach allowed us to select a subset of true predictors that generated the outcome Y, corresponding to different disease outcomes. To identify potential core features, we first ranked the features according to their node degree, denoting it as d 1 ≥ d 2 ≥, . . ., ≥ n p . We then used an average strategy to calculate the average degree in three steps asd i = (d i + d i−1 + d i−2 )/3, i = 3, 4, . . ., p. We determined the change point as the first feature whose degree d t decreased more than 10% compared to the averaged value, which was (d t − d t−1 )/d t > −0.01. Finally, we chose potential core features with a node degree more significant than d t . Our approach considered the scales and characteristics of different graphs, which might vary, and prioritized the identification of features with significantly higher node degrees.
In the simulation study, we conducted experiments with randomly chosen core features from the pool of potential core features. To expand the clique starting from the core features, we iteratively chose at most m neighboring features each time. To avoid the selected sub-graph becoming too dense, we assigned an attenuation rate to the parameter m. This process produced a collection of true predictors S, which formed a sub-graph. We denoted the size of S as p 0 , and we sampled the parameter β = (β 1 , β 2 , . . ., β p 0 ) T from a uniform distribution ranging from 0.1 to 0.8, with some of the values set to negative. We generated the output Y using a generalized linear model as follows: where σ was the logistic link function or absolute link function, To evaluate the performance of our proposed method GRF, we conducted a comparative study with RF using expression data X consisting of 4000 features and 500 samples. In real gene data, the true predictors are typically a small fraction of the total number of genes, typically around 0.5%. To simulate this scenario, we generated Y with varying numbers of true predictors, namely 30, 60, 90, 120, 150, 180, and 210. In addition to the number of true predictors, we also considered different sub-graph shapes. While a clustered sub-graph is typically formed by one clique, in some cases, it may appear as two cliques with minimal connection between them. Therefore, we evaluated the performance of our method on two different clustering shapes of the sub-graph, namely, one core and two cores. In the case of two core features, we ensured that the distance between the two core features was no less than four to obtain two distinct cliques. We used logistic and absolute link functions to test the GRF method's effectiveness under different relationship assumptions between X and Y.

Real Datasets
For our study, we obtained gene expression data from two different datasets: the Cancer Genome Atlas (TCGA) and the Gene Expression Omnibus (GEO). Specifically, we used data from the TCGA for two common subtypes of lung cancer-lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC). These datasets were accessed through the TCGA website (https://www.cancer.gov/ccg/research/genome-sequencing/tcga (accessed on 26 June 2023)). In addition, we utilized the human embryonic stem cell RNA-seq dataset (GSE93593) from the GEO database (https://www.ncbi.nlm.nih.gov/geo/query/ acc.cgi?acc=GSE93593 (accessed on 26 June 2023)).

Simulation Results
In our simulation study, we first trained a simple random forest with one-depth trees to determine head nodes in GRF and then built up 500 graph-embedded decision trees. For the vanilla RF model, we also set the number of trees to 500, with a maximum depth of 10 for each tree.
In our simulation, we conducted experiments using Python as our programming language. For each simulation setting, we generated ten datasets and then applied GRF and RF. Each dataset was randomly divided into a training set and a testing set using a 7:3 ratio. The model was trained on the training set and made predictions on the testing set. For each experiment, the computation time for GRF was around 7 s on a Linux workstation with a 5950× CPU, 128 Gb RAM, and a GTX 3060 GPU. In the simulation study, we were access to the specific list of true predictors contributing to the output Y. To evaluate the classification models, we employed test accuracy as the performance metric. This measure quantifies the classification ability of the models by comparing their predictions against the ground truth labels. To estimate the power of estimated feature importance, we utilized two widely recognized metrics: Area Under the Curve (AUC) and Precision-Recall AUC (PR-AUC). These metrics were calculated based on the estimated feature importance values and the ground truth effective feature list. The AUC provides a comprehensive measure of overall model performance, while PR-AUC focuses on the relationship between precision and recall. By incorporating both metrics, we aimed to obtain a comprehensive assessment of feature importance. Figure 2 shows the results with logistic link function. The first row demonstrates the performance with a true sub-graph extended from one core node, and the second row shows the results of a true sub-graph extended from two core nodes. The error bar represents the estimated standard deviation estimated from ten experiments. Figure 2a,d show the test accuracy using RF and GRF with hopping steps 1, 2, and 3. In the case of one core node, GRF with hopping step 3 had the highest test accuracy when the number of true predictors was less than 100, and GRF 2 was higher when the size of true predictors grew larger. RF achieved the highest score in the setting with 210 true predictors. In the case of two core nodes, GRF with hopping steps 2 and 3 achieved the highest score in more than half of the cases. In general, when the hopping step was set to one, the test accuracy was lower compared to random forest. However, when the hopping step was set to two, both methods showed similar performance. Notably, when the hopping step was increased to three, our method outperformed random forest in terms of performance. Overall, there is minimal variation in test accuracy among the different methods, and an upward trend in prediction accuracy is evident as the number of true predictors increases. The second column and third column of Figure 2 are AUC and PR-AUC with different sizes of core nodes. The estimated feature importance and a list with true predictors as one and other features as 0 were used for calculating the two metrics. The results of AUC using two/three hopping step GRF in one core simulation were mostly higher than 0.9, which was a huge improvement compared to the value in RF, which is around 0.6-0.75. Simulation results using the absolute link function are shown in Figure 3, which shows similar patterns. In general, we observed that a higher hopping step resulted in better feature selection performance. Overall, the simulation results verify that GRF has the capability to identify more significant features without sacrificing prediction accuracy.
To explore the property of the sub-graph, which consists of high-ranking features, we selected the top 100 most important features and extracted their connections from the simulated network. Figure 4 shows the sub-graph density, number of connected components, and size of the largest connected component in the setting with the logistic link function. The two rows correspond to settings of the true sub-graph extended from one or two core nodes. Figure 4a,d show the graph density, which is defined as the ratio between the number of edges and the number of all possible edges. A larger density indicates a more connected graph. Results show that GRF has a larger density in all experiments. On the contrary, sub-graphs generated using GRF are highly connected and have large connected components. When the hopping step equals two or three, the largest connected component contains more than 80 nodes in different experimental settings. Results of using the absolute link function are shown in Figure 5, which demonstrate similar patterns.  Concerning the robustness in reproducibility of feature selection using GRF, we were curious whether selected features were stable across different model training times. To explore this property, we simulated a dataset with 500 samples, 4000 features, and 210 true features that determined the classes Y using a logistic link function. A GRF with hopping step two was used here. We repeated the training 20 times and recorded the top 100 features each time. In the 20 sets of most important features, 58 appeared more than 14 times, and 86 appeared more than ten times. Among the 58 features that appeared more than 14 times, 53 were on the list of true predictors, and the other five features were within the neighborhood of one step to true predictors. Among the 86 features that appeared more than ten times, 75 were true predictors, and the others were all around the true predictors in one step. For the union of the 20 lists of selected features, 46.4% of them were true predictors, and 89.3% of them were within one step neighborhood of true predictors.

Non-Small Cell Lung Cancer Data Results
We applied our GRF method to distinguish two types of most common subtypes of lung cancer-lung adenocarcinoma (LUAD) [18] and lung squamous cell carcinoma (LUSC) [19] from The Cancer Genome Atlas. Lung cancer is one of the deadliest cancers nowadays. However, it is still unclear how these two subtypes differ in biological mechanisms, and they are still treated equally as non-small cell lung cancer (NSCLC). We tried to identify the differences between these two subtypes and analyze the biological mechanisms using GRF. The LUAD dataset consisted of an expression matrix with 23,032 miRNA expressions in 524 patients, and the LUSC dataset contained an expression matrix with 23,652 miRNA expressions in 496 observations. Combining the two datasets of different subtypes using overlapped features and selecting the largest connected component of miRNA network obtained from HINT [15], we eventually obtained an expression matrix with 9819 miRNA features of 1020 samples. The label for LUAD was marked as one and LUSC as zero, correspondingly. We used the processed data for downstream analysis.
Using the data with their corresponding network, we conducted GRF and RF for comparison. When training the model, we split the dataset into the training set and testing set with the ratio of 7:3. We used GRF with a hopping step of two and evaluated the performance based on an averaged score. For each method, we conducted twenty experiments and summarized the performances. The first row in Table 1 shows that the two methods had good performance on the classification task with high predicting accuracy. The accuracy of the testing dataset for GRF and RF was around 0.94, indicating a different expression pattern of miRNA. It was reasonable to distinguish the two subtypes of lung cancer using this dataset. From the result, we figured out that further accuracy improvement is difficult and meaningless since the performance almost reached an upper bound. Though our proposed GRF had a tiny gap in accuracy compared to RF, GRF had advantages in identifying potentially important features which had been proved in the simulation study. Moreover, features selected by GRF were more likely to fall into cliques on the original graph and could generate a sub-graph with fewer connected components. A highly connected graph was preferred because of its similarity to the ones found in experimental studies. By choosing 100 features with the highest importance scores, we generated the sub-graph shown in Figure 6a. Table 1 also exhibits selected sub-graph properties through GRF and RF. The second row is the number of connected components, and the third row represents the size of the largest connected component. The fourth and fifth rows are the average shortest distance and the average distance in the largest components between each node. The results illustrate that the sub-graph selected by GRF had fewer connected components and possessed a larger connected component with a reasonable distance.  Functional analysis of genes selected by GRF was conducted by testing the enrichment of gene ontology (GO) biological processes using the clusterProfiler package [20]. The biological processes with p-values less than 0.01 and adjusted p-values less than 0.05 were considered significant pathways in our study. The top 15 GO terms are shown in Table 2. Twelve of the 100 selected genes belonged to the regulation of DNA metabolic process, which was the most important GO term. In addition, 'DNA ligation', 'somatic DNA recombination', and 'regulation of DNA biosynthesis processes' were among the top terms. Changes in DNA function and damage affect cell proliferation and differentiation, which may influence cancer progression [21]. Meanwhile, overexpression of CDK2 and CDK16 in LUSC has been proved to cause abnormal regulation of cell cycle and promote cell proliferation. These effects may increase the malignant potential of the tumor and lead to a faster growth speed compared to the progression of LUAD [22,23]. The second-most important term was 'telomere maintenance', and there were many terms among the top ones related to telomeres, such as 'telomere organization', 'negative regulation of telomere maintenance via telomerase', and 'telomere elongation control'. Limiting telomeres from shortening is one of the significant mechanisms by which cancer cells gain resistance to inhibition. The ability to maintain telomeres above a critical length represents the degree of cell deterioration [24]. Cancer cells can resist death and realize the immortality of replication through activating the telomerase [25]. LUAD and LUSC have different expression levels of telomere-related genes, so telomere maintenance genes are considered to be potential biomarkers for two subtypes. At the same time, a vaccine against telomerase named GV1001 has been proved to be beneficial to immunotherapy for NSCLC patients [26].
The remaining significant GO terms include 'T cell proliferation and activation', 'immune response modulation', 'embryonic morphological development', 'endoderm development', and so on. Studies on immune-related genes (IRGs) have found that T-cell receptor signaling expresses differently in two subtypes [27]. The MHC molecule, which is crucial for antigen processing in immune responses, and chemokine, which guides cell migration, are found to be inhibited more rapidly in LUSC. These observations confirm that LUSC grows faster by suppressing the immune system. HOX gene encoding is an important transcription factor in the embryonic development and differentiation of adult cells. Recent studies have shown that HOXA1 is significantly up-regulated and hypermethylated in LUAD [28]. The genes Hh and ErbB are found to be strongly correlated with two subtypes, which are related to lung development [29]. Hh maintains stem cells, responds to injury, and affects the formation of bronchial numbers. ErbB can cause defects of type II epithelial cells in the alveolar lining and reduce branching morphogenesis in embryos by affecting the expression level of anti-EGF antisense oligonucleotides. The whole table containing all GO terms for functional analysis using GRF and RF can be found in https://github.com/tianlq-prog/GRF/blob/main/Supplementary.pdf (accessed on 26 June 2023).

Human Embryonic Stem Cell Data Results
We also applied GRF on the human embryonic stem cell RNA-seq dataset from GEO (GSE93593) [30]. The dataset contained 23,045 genes from 1733 observations and their corresponding clinic information, including doublecortin (DCX) status and days of culture. We obtained the gene network from the HINT database [15]. After screening the genes in the HINT database and selecting the largest connected component of the network, 12,215 genes were finally selected. A log transformation was conducted on the expression value of each gene. Our goal was to explore the relationship between gene expression and the status of DCX, whether positive or negative. Therefore, the task became a binary classification problem.
Doublecortin (DCX) is a microtubule-associated protein expressed explicitly by immature neurons in embryonic and adult cortical structures. It is necessary for neuron migration and differentiation and is closely related to the development of the central nervous system. Since the expression of DCX changes two weeks before the appearance of new neurons, the richness of neurogenesis in the brain cannot be directly quantified. So, DCX is a powerful tool for identifying early and immature neurons. Therefore, research on the transient expression of DCX to help understand the development of the nervous system has received extensive attention.
We tested GRF and RF on a human embryonic stem cell dataset with twenty repeated experiments. The computation time for each experiment using GRF on the workstation was around 100 s. The classification accuracy results are shown in the first row of Table 3. The mean accuracy was high for each method, and GRF had slightly lower accuracy than RF. However, as shown in Table 3, the sub-graphs generated by GRF with the top 100 most important genes had higher connected properties than the ones using RF. Specifically, when using GRF, the number of connected components was much smaller, and the sub-graphs had larger connected components with an average value achieving 67.  1 The values in brackets correspond to the standard deviations. Figure 6b is the sub-graph of the top 100 genes with the highest averaged importance score using GRF. GO enrichment analysis was performed on the sub-graph, and the top 10 pathways are shown in Table 4. The top GO term was 'positive regulation of epithelial to mesenchymal transition'. In addition, 'homotypic cell-cell adhesion' and 'epithelial cell differentiation' were among the top terms. Epithelial cells have regular cell-cell contacts and adhesion to surrounding cellular structures, thus can avoid the separation of individual cells [31]. However, during embryonic development, cells need to migrate to adjacent tissues to form new organs, and tissues [32], so quiescent epithelial cells undergo epithelial-mesenchymal transition (EMT), thereby differentiate into motile mesenchymal cells [33] and possess the invasive ability. The response to EMT comes from stromal cells such as fibroblasts and mesenchymal stem cells. These stromal cells secrete a series of heterotypic signals, growth factors, platelet-derived growth factor (PDGF), and epidermal growth factor (EGF) [34]. This also explained the appearance of 'platelet aggregation', 'regulation of hematopoiesis', and 'platelet activation' in the top GO terms. Also, many of the genes involved in 'lung development,' 'respiratory system development,' and 'air duct development' are part of the response to growth factor stimulation, leading to the significance of these terms. The second most significant GO term was 'synaptic organization', and other important terms related to it included 'axon development' and 'axogenesis'. Synapses, responsible for transmitting information between neurons and target cells, play an essential role in nervous system development. The fetal brain begins to develop from the third week of gestation [35], neural precursor cells divide and form neurons and glia. Furthermore, the number of synapses keeps increasing in the first few years of life [36]. The fifth-ranked GO term was 'transmembrane receptor protein serine/threonine kinase signaling pathway'. The serine-threonine kinase Akt plays a central role in integrating cellular responses to growth factors [37], and it has been proven to maintain cellular integrity and protect 'tagged' from exposure. It is also involved in the phagocytic disposition of cells, in which it promotes neuronal and vascular survival and prevents induction of programmed cell death [38]. Overall, in the DCX classification task, GRF could identify important and easily interpretable sub-graphs.

Discussion
In this study, we proposed a novel method, Graph Random Forest (GRF), which integrates graph information into the random forest framework to improve the accuracy and interpretability of the model. We applied GRF to simulation data and two real datasets: one involving the classification of two subtypes of lung cancer and the other involving the classification of doublecortin status in human embryonic stem cells. Our results demonstrated that GRF had comparable classification accuracy to the traditional random forest method while generating a sub-graph with higher connectedness and providing more interpretable feature importance scores.
Our proposed method leverages the underlying graph structure of the data, which captures the dependencies and interactions between the features. By incorporating the graph information into the decision tree construction process, GRF can effectively identify and utilize the most informative features while reducing noise and irrelevant features. Our results showed that GRF was able to identify potentially important features that were more likely to fall into cliques on the original graph and could generate a sub-graph with fewer connected components, which are more similar to those found in experimental studies.
The biological insights obtained from our experiments provide evidence for the effectiveness of GRF in identifying important features and their interactions. For example, in the lung cancer dataset, GRF identified genes related to DNA metabolic processes and telomere maintenance, which are known to be associated with cancer progression and resistance to inhibition. In the human embryonic stem cell dataset, GRF identified genes related to epithelial-mesenchymal transition, axon development, and synapse organization, which are critical for nervous system development. These findings demonstrate the ability of GRF to uncover important biological mechanisms and potential biomarkers that could be useful for further experimental validation.
In summary, our proposed method, GRF, provides a powerful tool for analyzing complex data with graph structure. By incorporating graph information into the random forest framework, GRF can effectively improve the accuracy and interpretability of the model, as demonstrated by our experiments on lung cancer and human embryonic stem cell datasets. We believe that our method can also be widely applied to other fields, such as social network analysis and image analysis, where the data have an inherent graph structure.

Conclusions
We presented a new version of random forest which embeds graph information. It has the ability to identify important features with the property of high connectivity in the original information graph. A simulation study verified that our method could find features with higher AUC and PR-AUC without losing much classification accuracy. Two real applications showed the power of selection, which helps reveal meaningful biological mechanisms.