# A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Random Forest and Iterative Random Forest Methods

#### 2.2. Implementation of iRF in C++

#### 2.3. iRF-LOOP: iRF Leave One out Prediction

#### 2.4. Big Data: Showing the Scale of iRF with Arabidopsis thaliana SNP Data

#### 2.5. Using iRF-LOOP to Create Predictive Expression Networks

#### 2.6. Comparison of R to C++ Code

#### 2.7. Computational Resources

## 3. Results

#### 3.1. Comparison of the R to C++ Code

#### 3.2. Scaling Results for Big Data: Arabidopsis thaliana SNPs to Gene Expression

#### 3.3. Predictive Expression Networks

## 4. Discussion

## 5. Software Availability

## Supplementary Materials

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Abbreviations

iRF | Iterative Random Forest |

iRF-LOOP | iRF Leave One Out Prediction |

RF | Random Forest |

MPI | The Message Passing Interface |

HPC | High-Performance Computing |

GO | Gene Ontology |

SNP | Single Nucleotide Polymorphism |

eQTL | Expression Quantitative Trait Loci |

X-AI | Explainable Artificial Intelligence |

RIT | Random Intersection Trees |

## References

- Harfouche, A.; Jacobson, D.; Kainer, D.; Romero, J.; Harfouche, A.H.; Scarascia Mugnozza, G.; Moshelion, M.; Tuskan, G.; Keurentjes, J.; Altman, A. Accelerating Climate Resilient Plant Breeding by Applying Next-Generation Artificial Intelligence. Trends Biotechnol.
**2019**. accept. [Google Scholar] [CrossRef] [PubMed] - Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng.
**2014**, 40, 16. [Google Scholar] [CrossRef] - Breiman, L. Random Forests. Mach. Learn.
**2001**, 45, 5–32. [Google Scholar] [CrossRef] - Chen, X.; Ishwaran, H. Random forests for genomic data analysis. Genomics
**2012**, 99, 323–329. [Google Scholar] [CrossRef] [PubMed] - Basu, S.; Kumbier, K.; Brown, J.B.; Yu, B. Iterative random forests to discover predictive and stable high-order interactions. Proc. Natl. Acad. Sci. USA
**2018**, 115, 1943–1948. [Google Scholar] [CrossRef] [PubMed] - Basu, S.; Kumbier, K. iRF: Iterative Random Forests, R Package Version 2.0.0; 2017. Available online: https://CRAN.R-project.org/package=iRF (accessed on 8 October 2019).
- Walker, D.W.; Dongarra, J.J. MPI: A Standard Message Passing Interface. Available online: https://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf (accessed on 8 October 2019).
- Breiman, L.; Friedman, J.; Olshen, R.; Stone, C. Classification and Regression Tree; Taylor & Francis: Boca Raton, FL, USA, 1984; ISBN1 0412048418. ISBN2 9780412048418. [Google Scholar]
- Wright, M.; Ziegler, A. Ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. J. Stat. Softw. Artic.
**2017**, 77, 1–17. [Google Scholar] [CrossRef] - Gabriel, E.; Fagg, G.E.; Bosilca, G.; Angskun, T.; Dongarra, J.J.; Squyres, J.M.; Sahay, V.; Kambadur, P.; Barrett, B.; Lumsdaine, A.; et al. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. In Proceedings of the 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, 19–22 September 2004; pp. 97–104. [Google Scholar]
- Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics
**2011**, 27, 2987–2993. [Google Scholar] [CrossRef] [PubMed] - Kawakatsu, T.; Huang, S.S.C.; Jupe, F.; Sasaki, E.; Schmitz, R.J.; Urich, M.A.; Castanon, R.; Nery, J.R.; Barragan, C.; He, Y.; et al. Epigenomic Diversity in a Global Collection of Arabidopsis thaliana Accessions. Cell
**2016**, 166, 492–505. [Google Scholar] [CrossRef] [PubMed] - Margolin, A.A.; Nemenman, I.; Basso, K.; Wiggins, C.; Stolovitzky, G.; Favera, R.D.; Califano, A. ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context. BMC Bioinform.
**2006**, 7, S7. [Google Scholar] [CrossRef] [PubMed] - Huynh-Thu, V.A.; Irrthum, A.; Wehenkel, L.; Geurts, P. Inferring Regulatory Networks from Expression Data Using Tree-Based Methods. PLoS ONE
**2010**, 5, 1–10. [Google Scholar] [CrossRef] [PubMed] - Perrin, B.E.; Ralaivola, L.; Mazurie, A.; Bottani, S.; Mallet, J.; d’Alché-Buc, F. Gene networks inference using dynamic Bayesian networks. Bioinformatics
**2003**, 19, ii138–ii148. [Google Scholar] [CrossRef] [PubMed] - Zhang, J.; Yang, Y.; Zheng, K.; Xie, M.; Feng, K.; Jawdy, S.S.; Gunter, L.E.; Ranjan, P.; Singan, V.R.; Engle, N.; et al. Genome-wide association studies and expression-based quantitative trait loci analyses reveal roles of HCT2 in caffeoylquinic acid biosynthesis and its regulation by defense-responsive transcription factors in Populus. New Phytol.
**2018**, 220, 502–516. [Google Scholar] [CrossRef] - Tuskan, G.A.; DiFazio, S.; Jansson, S.; Bohlmann, J.; Grigoriev, I.; Hellsten, U.; Putnam, N.; Ralph, S.; Rombauts, S.; Salamov, A.; et al. The Genome of Black Cottonwood, Populus trichocarpa (Torr. & Gray). Science
**2006**, 313, 1596–1604. [Google Scholar] [CrossRef] - Furches, A.; Kainer, D.; Weighill, D.; Large, A.; Jones, P.; Walker, A.M.; Romero, J.; Gazolla, J.G.F.M.; Joubert, W.; Shah, M.; et al. Finding New Cell Wall Regulatory Genes in Populus trichocarpa Using Multiple Lines of Evidence. Front. Plant Sci.
**2019**, 10. [Google Scholar] [CrossRef] [PubMed] - Jin, J.; Tian, F.; Yang, D.C.; Meng, Y.Q.; Kong, L.; Luo, J.; Gao, G. PlantTFDB 4.0: Toward a central hub for transcription factors and regulatory interactions in plants. Nucleic Acids Res.
**2016**, 45, D1040–D1045. [Google Scholar] [CrossRef] [PubMed][Green Version] - Corporation, M.; Weston, S. doParallel: Foreach Parallel Adaptor for the ‘Parallel’ Package, R Package Version 1.0.14; 2018. Available online: https://CRAN.R-project.org/package=doParallel (accessed on 8 October 2019).
- Shah, R.D.; Meinshausen, N. Random intersection trees. J. Mach. Learn. Res.
**2014**, 15, 629–654. [Google Scholar]

**Figure 1.**The diagram shows the process of iRF-LOOP for a set of Expression profiles, creating a Predictive Expression Network. Each gene is independently treated as the target for an iRF run, with all other genes as predictors. iRF provides importance scores of each predictor gene, and creates network edge weights between target and predictors. These importance scores are then combined into an edge matrix, providing a value for each possible connection, from which a network can be generated. Generally, the weights are thresholded at some value, determined through other means, and only edges with large enough weights are included in the final network. Due to the inherent directionality of a prediction, the edges are weighted, and not likely to be symmetric.

**Figure 2.**Each of these graphs shows the total run time as the number of threads increases. Both the R code and C++ code were run on Summit. Note for 5000 trees, the R implementation failed to complete using less than 4 threads.

**Figure 3.**These graphs show a different orientation of the data from Figure 2. Each graph shows the total run time as the number of trees increases, while the number of features and number of threads stays constant. Due to the 5000 tree runs not completing with the R code for 1, 2, or 3 threads, those graphs are missing points.

**Figure 4.**The graph shows the run times for four different compute node quantities, each completing 1000 trees for the 1.7 million SNPs.

**Figure 5.**The graph shows the run time for five different feature sizes, on a single CPU of a standard MacBook Pro laptop. Each point represents the average of three runs. A linear regression was fit, with the equation shown. The fit is not perfect, but is enough to indicate that the run time increase approximately linearly in comparison to the number of features.

**Figure 6.**Graph (

**a**) provides the total run time for the C++ code on Summit, with various tree and thread counts, for 40,000 features. Graph (

**b**) provides a comparison of the C++ code on Summit and Titan, two HPC systems. For both graphs, run time is in seconds.

**Figure 7.**The network shown is a small example from the iRF prediction expression network overlaid with the GO process network. The nodes represent the genes. The black edges represent the iRF edges, which are directed from the feature to the predicted target. The colored edges represent different GO associations between genes, meaning that they share a GO term. Using the provided GO edge weights, this network has an intersect score of 0.0714, from connections DE and FE with both iRF edges and GO edge.

**Figure 8.**Graph (

**a**) shows the null distribution histogram (blue) and the iRF network score (red) for the top 10% of edges. Graph (

**b**) shows the null distribution histogram (blue) and the iRF network score (red) for the top 0.1% of edges. Please note that the x-axis is different for the two graphs. Each distribution was calculated from 1000 random permutations.

**Figure 9.**The null distribution histogram of the iRF network is shown in blue, with the network score in red. The co-expression null distribution is shown in orange, with the corresponding network score also in orange.

**Table 1.**The table provides the graph results for the 4 thresholded Predictive Expression Networks, as well as the co-expression comparison network. The listed mean and standard deviation are for the corresponding null distributions, as pictured in Figure 8, for the PEN networks. The p-values for the listed t-statistics were effectively zero.

Network | Nodes | Edges | Intersect Score | Null Dist Mean | Null Dist s.d. | t-Statistic |
---|---|---|---|---|---|---|

0.1% PEN | 26,617 | 57,112 | 59.74 | 0.9831 | 0.2597 | 226.27 |

1% PEN | 38,758 | 563,887 | 213.28 | 9.6930 | 0.8720 | 233.47 |

5% PEN | 39,349 | 2,795,636 | 484.07 | 48.1309 | 2.0784 | 209.74 |

10% PEN | 39,349 | 5,846,200 | 692.08 | 100.5038 | 2.9316 | 201.79 |

0.1% COEX | 6261 | 312,030 | 34.91 | 7.7701 | 1.5668 | 17.32 |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Cliff, A.; Romero, J.; Kainer, D.; Walker, A.; Furches, A.; Jacobson, D. A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks. *Genes* **2019**, *10*, 996.
https://doi.org/10.3390/genes10120996

**AMA Style**

Cliff A, Romero J, Kainer D, Walker A, Furches A, Jacobson D. A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks. *Genes*. 2019; 10(12):996.
https://doi.org/10.3390/genes10120996

**Chicago/Turabian Style**

Cliff, Ashley, Jonathon Romero, David Kainer, Angelica Walker, Anna Furches, and Daniel Jacobson. 2019. "A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks" *Genes* 10, no. 12: 996.
https://doi.org/10.3390/genes10120996