# Optimal microRNA Sequencing Depth to Predict Cancer Patient Survival with Random Forest and Cox Models

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Overview of the Methodology

#### 2.2. Cox Model with Elastic Net Penalty and Random Survival Forest: The Link between Genetic and Survival Data

#### 2.2.1. Cox Proportional Hazards Model with Elastic Net Penalty

#### 2.2.2. Random Survival Forest

#### 2.3. Prediction Performance Metrics

- ${\widehat{\mathrm{RS}}}_{i}={\widehat{\mathit{\beta}}}^{T}{\mathit{X}}^{i}$ for the Cox model, with $\widehat{\mathit{\beta}}$ the estimator of the coefficients, and ${\mathit{X}}^{i}$ the gene expression vector for patient i.
- ${\widehat{\mathrm{RS}}}_{i}=\frac{1}{\mathrm{Card}\left(\mathrm{T}\right)}{\sum}_{j\in \mathrm{T}}\widehat{H}\left({t}_{j}\right|{X}_{i})$ for random survival forest, with ‘Card’ the cardinal function, $\widehat{H}\left(t\right|{X}_{i})$ the estimated cumulative hazard function at time t for patient i, and T the times at which the hazard function is estimated.

#### 2.4. The Cancer Genome Atlas and E-MTAB-1980 Datasets

#### 2.5. Integration of miRNA-seq Data Together with Clinical Data

#### 2.6. Degradation of miRNA-seq Data

#### 2.6.1. Subsampling of miRNA-seq Data

#### 2.6.2. Reduction of the Number of Patients in the Training Dataset

## 3. Results

#### 3.1. Library Sizes of mRNA-seq Data Are Ten Times Larger Than the Ones of miRNA-seq Data

#### 3.2. C-Index Highlighted Noticeable Prediction Differences between Cox and Random Survival Forest Models for Eight out of Twenty-Five Cancers

#### 3.3. mRNA-seq Data Provides Slightly Better Prediction Performance Than miRNA-seq Data for Most of the 11 Investigated Cancers

#### 3.4. Mirna-seq Data Improves Predictions over Clinical Data Alone for Most of the Investigated Cancers

#### 3.5. Shallow Tumor miRNA or mRNA Sequencing Keeps Survival Prediction Performance for Many Cancers

#### 3.6. Models Trained with Fewer Patients Do Not Degrade Prognosis for Most of the Investigated Cancers

#### 3.7. Very Small Sequencing Depth Is Responsible for the Performance Loss

- (1)
- the number of detected genes decreases as the subsampling rate increases (Supplementary Figure S12A, [38]), and only the level of expression of the most highly expressed genes can be measured (Supplementary Figure S12B). However, genes with a low level of expression may have significant predictive power and go undetected, which would diminish overall predictive capabilities.
- (2)
- more generally, the signal-to-noise ratio decreases for all genes (the standard deviation of the measurements varies in $\sqrt{N}$, with N the number of aligned reads per gene).

#### 3.8. Prognostic Performances Follow Similar Trend after Subsampling When Tested on an Independent Dataset

## 4. Discussion

## 5. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Bartel, D.P. Metazoan micrornas. Cell
**2018**, 173, 20–51. [Google Scholar] [CrossRef] [PubMed][Green Version] - Peng, Y.; Croce, C.M. The role of MicroRNAs in human cancer. Signal Transduct. Target. Ther.
**2016**, 1, 15004. [Google Scholar] [CrossRef] [PubMed][Green Version] - Chu, A.; Robertson, G.; Brooks, D.; Mungall, A.J.; Birol, I.; Coope, R.; Ma, Y.; Jones, S.; Marra, M.A. Large-scale profiling of microRNAs for the cancer genome atlas. Nucleic Acids Res.
**2016**, 44, e3. [Google Scholar] [CrossRef] [PubMed] - Capula, M.; Mantini, G.; Funel, N.; Giovannetti, E. New avenues in pancreatic cancer: Exploiting microRNAs as predictive biomarkers and new approaches to target aberrant metabolism. Expert Rev. Clin. Pharmacol.
**2019**, 12, 1081–1090. [Google Scholar] [CrossRef] - Cox, D.R. Regression models and life-tables. J. R. Stat. Soc. Ser. B (Methodol.)
**1972**, 34, 187–202. [Google Scholar] [CrossRef] - Jardillier, R.; Chatelain, F.; Guyon, L. Bioinformatics Methods to Select Prognostic Biomarker Genes from Large Scale Datasets: A Review. Biotechnol. J.
**2018**, 13, 1800103. [Google Scholar] [CrossRef] - Zou, H.; Hastie, T. Regularization and variable selection via the elastic-net. J. R. Stat. Soc.
**2005**, 67, 301–320. [Google Scholar] [CrossRef][Green Version] - Jardillier, R.; Koca, D.; Chatelain, F.; Guyon, L. Prognosis of lasso-like penalized Cox models with tumor profiling improves prediction over clinical data alone and benefits from bi-dimensional pre-screening. BMC Cancer
**2022**, 22, 1045. [Google Scholar] [CrossRef] - Probst, P.; Wright, M.N.; Boulesteix, A. Hyperparameters and tuning strategies for random forest. Wiley Interdiscip. Rev. Data Min. Knowl. Discov.
**2019**, 9, e1301. [Google Scholar] [CrossRef][Green Version] - Ishwaran, H.; Kogalur, U.B.; Blackstone, E.H.; Lauer, M.S. Random survival forests. Ann. Appl. Stat.
**2008**, 2, 841–860. [Google Scholar] [CrossRef] - Wright, M.N.; Ziegler, A.; König, I.R. Do little interactions get lost in dark random forests? BMC Bioinform.
**2016**, 17, 145. [Google Scholar] [CrossRef][Green Version] - Kourou, K.; Exarchos, T.P.; Exarchos, K.P.; Karamouzis, M.V.; Fotiadis, D.I. Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J.
**2015**, 13, 8–17. [Google Scholar] [CrossRef][Green Version] - Milanez-Almeida, P.; Martins, A.J.; Germain, R.N.; Tsang, J.S. Cancer prognosis with shallow tumor RNA sequencing. Nat. Med.
**2020**, 26, 188–192. [Google Scholar] [CrossRef] - Breslow, N. Contribution to the Discussion of the Paper by D.R. Cox. J. R. Stat. Soc. B
**1972**, 34, 2016–2017. [Google Scholar] - Friedman, J.H.; Hastie, T.; Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw.
**2010**, 33, 1–22. [Google Scholar] [CrossRef] [PubMed][Green Version] - Breiman, L. Random Forests. Mach. Learn.
**2001**, 45, 5–32. [Google Scholar] [CrossRef][Green Version] - Wright, M.N.; Ziegler, A. ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. J. Stat. Softw.
**2017**, 77, 1–17. [Google Scholar] [CrossRef][Green Version] - Harrell, F.E., Jr.; Lee, K.L.; Mark, D.B. Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat. Med.
**1996**, 15, 361–387. [Google Scholar] [CrossRef] - Pencina, M.J.; D’Agostino, R.B. Overall C as a measure of discrimination in survival analysis: Model specific population value and confidence interval estimation. Stat. Med.
**2004**, 23, 2109–2123. [Google Scholar] [CrossRef] - Gerds, T.A.; Schumacher, M. Consistent estimation of the expected Brier score in general survival models with right-censored event times. Biom. J.
**2006**, 48, 1029–1040. [Google Scholar] [CrossRef] - Schroder, M.S.; Culhane, A.C.; Quackenbush, J.; Haibe-Kains, B. survcomp: An R/Bioconductor package for performance assessment and comparison of survival models. Bioinformatics
**2011**, 27, 3206–3208. [Google Scholar] [CrossRef] [PubMed] - Mogensen, U.B.; Ishwaran, H.; Gerds, T.A. Evaluating Random Forests for Survival Analysis Using Prediction Error Curves. J. Stat. Softw.
**2012**, 50, 1–23. [Google Scholar] [CrossRef] [PubMed][Green Version] - Liu, J.; Lichtenberg, T.; Hoadley, K.A.; Poisson, L.M.; Lazar, A.J.; Cherniack, A.D.; Kovatich, A.J.; Benz, C.C.; Levine, D.A.; Lee, A.V.; et al. An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics. Cell
**2018**, 173, 400–416.e11. [Google Scholar] [CrossRef] [PubMed][Green Version] - Robinson, M.D.; Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol.
**2010**, 11, R25. [Google Scholar] [CrossRef][Green Version] - Robinson, M.D.; McCarthy, D.J.; Smyth, G.K. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics
**2010**, 26, 139–140. [Google Scholar] [CrossRef][Green Version] - Ritchie, M.E.; Belinda, P.; Wu, D.; Hu, Y.; Law, C.W.; Shi, W.; Smyth, G.K. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res.
**2015**, 43, e47. [Google Scholar] [CrossRef] - Sato, Y.; Yoshizato, T.; Shiraishi, Y.; Maekawa, S.; Okuno, Y.; Kamura, T.; Shimamura, T.; Sato-Otsubo, A.; Nagae, G.; Suzuki, H.; et al. Integrated molecular analysis of clear-cell renal cell carcinoma. Nat. Genet.
**2013**, 45, 860–867. [Google Scholar] [CrossRef] - Volkmann, A.; De Bin, R.; Sauerbrei, W.; Boulesteix, A.-L. A plea for taking all available clinical information into account when assessing the predictive value of omics data. BMC Med. Res. Methodol.
**2019**, 19, 162. [Google Scholar] [CrossRef] - López de Maturana, E.; Alonso, L.; Alarcón, P.; Martín-Antoniano, I.A.; Pineda, S.; Piorno, L.; Calle, M.L.; Malats, N. Challenges in the Integration of Omics and Non-Omics Data. Genes
**2019**, 10, 238. [Google Scholar] [CrossRef][Green Version] - De Bin, R.; Boulesteix, A.-L.; Benner, A.; Becker, N.; Sauerbrei, W. Combining clinical and molecular data in regression prediction models: Insights from a simulation study. Briefings Bioinform.
**2019**, 21, 1904–1919. [Google Scholar] [CrossRef] - Robinson, D.G.; Storey, J.D. subSeq: Determining Appropriate Sequencing Depth Through Efficient Read Subsampling. Bioinformatics
**2014**, 30, 3424–3426. [Google Scholar] [CrossRef] - Tarazona, S.; García-Alcalde, F.; Dopazo, J.; Ferrer, A.; Conesa, A. Differential expression in RNA-seq: A matter of depth. Genome Res.
**2011**, 21, 2213–2223. [Google Scholar] [CrossRef][Green Version] - Bass, A.J.; Robinson, D.G.; Storey, J.D. Determining sufficient sequencing depth in RNA-Seq differential expression studies. bioRxiv
**2019**. [Google Scholar] [CrossRef] - Ricketts, C.J.; De Cubas, A.A.; Fan, H.; Smith, C.C.; Lang, M.; Reznik, E.; Bowlby, R.; Gibb, E.A.; Akbani, R.; Beroukhim, R.; et al. The Cancer Genome Atlas Comprehensive Molecular Characterization of Renal Cell Carcinoma. Cell Rep.
**2018**, 23, 313–326.e5. [Google Scholar] [CrossRef][Green Version] - Ternès, N.; Rotolo, F.; Heinze, G.; Michiels, S. Identification of biomarker-by-treatment interactions in randomized clinical trials with survival outcomes and high-dimensional spaces. Biom. J. Biom. Z.
**2017**, 59, 685–701. [Google Scholar] [CrossRef][Green Version] - Wei, H.; Zhang, J.-J.; Tang, Q.-L. MiR-638 inhibits cervical cancer metastasis through Wnt/beta-catenin signaling pathway and correlates with prognosis of cervical cancer patients. Eur. Rev. Med. Pharmacol. Sci.
**2017**, 21, 5587–5593. [Google Scholar] [CrossRef] - Roelants, C.; Pillet, C.; Franquet, Q.; Sarrazin, C.; Peilleron, N.; Giacosa, S.; Guyon, L.; Fontanell, A.; Fiard, G.; Long, J.A.; et al. Ex-vivo treatment of tumor tissue slices as a predictive preclinical method to evaluate targeted therapies for patients with renal carcinoma. Cancers
**2020**, 12, 232. [Google Scholar] [CrossRef][Green Version] - Sims, D.; Sudbery, I.; Ilott, N.E.; Heger, A.; Ponting, C.P. Sequencing depth and coverage: Key considerations in genomic analyses. Nat. Rev. Genet.
**2014**, 15, 121–132. [Google Scholar] [CrossRef] - Kalbeisch, J.D.; Prentice, R.L. The Statistical Analysis of Failure Time Data; Wiley: New York, NY, USA, 2011. [Google Scholar]
- Tibshirani, R. The lasso method for variable selection in the cox model. Stat. Med.
**1997**, 16, 385–395. [Google Scholar] [CrossRef]

**Figure 1.**C-index obtained for different fold reduction factors and percentage of patients in the training dataset for KIRC (ccRCC, TCGA) with the Cox model. (

**A**) Median C-index for different degradation of both sequencing depth (x axis) and percentage of patients (y axis) in the training dataset for miRNA-seq data. Horizontal box highlights the case where all $80\%$ of patients are used and corresponds to (

**B**), whereas vertical box focuses on the full available library size and corresponds to (

**C**). (

**B**) C-index for different fold reduction factors for miRNA-seq (gray boxplots) and mRNA-seq data (median values, in red) with $80\%$ of the patients in the training dataset. Above is the p-value of a one-sided Wilcoxon test compared to no subsampling (i.e., $\delta =1$). (

**C**) C-index for different percentages of patients in the training dataset for miRNA-seq (light gray boxplots) and mRNA-seq data (median values, in red) with original TCGA sequencing depth. Above is the p-value compared to the full dataset (i.e., $80\%$). red, mRNA-seq; gray boxplots, miRNA-seq. In each case, we computed the C-indices by 10 repetitions of 5-fold cross validation. ***: $p\le 0.001$, **: $p\le 0.01$, *: $p\le 0.05$, +: $p<0.1$, n.s.: $p\ge 0.1$.

**Figure 2.**C-index as a function of sequencing depth reduction tested on the E-MTAB-1980 dataset and TCGA subset for mRNA profiling in ccRCC. In dark gray, performance measured with the C-index calculated on the E-MTAB-1980 dataset, after training on an 80% sub-sample of the TCGA dataset (the procedure is repeated to obtain 50 C-indices). In blue, the test is performed on the remaining 20% of TCGA data (median C-index). ***: $p\le 0.001$, +: $p<0.1$, n.s.: $p\ge 0.1$.

**Table 1.**Characteristics of the 11 cancers investigated. We computed the C-indices with 10 repetitions of 5-fold cross-validation for both the Cox-elastic net model (EN) and random survival forest (RF). Datasets are ordered according to their median C-index computed with Cox-elastic net model (decreasing order).

Cancer | n Patients | p miRNA | Censoring Rate | Survival— 3 Years | C-Index— EN | C-Index— RF |
---|---|---|---|---|---|---|

UVM | 77 | 536 | 0.73 | 0.74 | 0.81 | 0.83 |

ACC | 77 | 518 | 0.65 | 0.75 | 0.8 | 0.84 |

KIRP | 269 | 486 | 0.84 | 0.87 | 0.79 | 0.82 |

MESO | 85 | 519 | 0.14 | 0.19 | 0.7 | 0.69 |

KIRC | 508 | 462 | 0.66 | 0.75 | 0.7 | 0.66 |

LGG | 506 | 548 | 0.62 | 0.56 | 0.7 | 0.69 |

CESC | 288 | 542 | 0.76 | 0.72 | 0.68 | 0.59 |

LIHC | 355 | 540 | 0.65 | 0.62 | 0.67 | 0.66 |

PRAD | 486 | 470 | 0.81 | 0.8 | 0.66 | 0.59 |

LUAD | 483 | 529 | 0.63 | 0.61 | 0.66 | 0.6 |

UCEC | 532 | 554 | 0.83 | 0.83 | 0.61 | 0.64 |

**Table 2.**Maximum miRNA-seq library size reduction before the decreasing of prediction performance, corresponding median sequencing depth (in thousands of aligned reads), and prediction metric degraded first, for the Cox model, and the 11 investigated cancers.

Cancer | UVM | ACC | KIRP | MESO | KIRC | LGG | CESC | LIHC | PRAD | LUAD | UCEC |
---|---|---|---|---|---|---|---|---|---|---|---|

Fold reduction | 1000 | 1000 | 100 | 100 | 10 | 10 | 2 | 10 | 5 | < 1 | 10,000 |

Median library size (in 1000 reads) | 5 | 6 | 60 | 50 | 200 | 700 | 2000 | 500 | 900 | > 5000 | 1 |

Metric degraded first | C-index | both | both | C-index | both | IBS | C-index | both | C-index | both | C-index |

**Table 3.**Maximum mRNA-seq library size reduction before the decreasing of prediction performance, corresponding median sequencing depth (in thousands of aligned reads), and prediction metric degraded first, for the Cox model, and the 11 investigated cancers.

Cancer | UVM | ACC | KIRP | MESO | KIRC | LGG | CESC | LIHC | PRAD | LUAD | UCEC |
---|---|---|---|---|---|---|---|---|---|---|---|

Fold reduction | 100 | 1000 | 100 | 100 | 10 | 100 | 2 | 10 | 10 | 10 | 10 |

Median library size (in 1000 reads) | 400 | 40 | 400 | 500 | 5000 | 500 | 20,000 | 5000 | 5000 | 4000 | 2000 |

Metric degraded first | C-index | both | IBS | both | IBS | both | C-index | IBS | both | IBS | both |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Jardillier, R.; Koca, D.; Chatelain, F.; Guyon, L.
Optimal microRNA Sequencing Depth to Predict Cancer Patient Survival with Random Forest and Cox Models. *Genes* **2022**, *13*, 2275.
https://doi.org/10.3390/genes13122275

**AMA Style**

Jardillier R, Koca D, Chatelain F, Guyon L.
Optimal microRNA Sequencing Depth to Predict Cancer Patient Survival with Random Forest and Cox Models. *Genes*. 2022; 13(12):2275.
https://doi.org/10.3390/genes13122275

**Chicago/Turabian Style**

Jardillier, Rémy, Dzenis Koca, Florent Chatelain, and Laurent Guyon.
2022. "Optimal microRNA Sequencing Depth to Predict Cancer Patient Survival with Random Forest and Cox Models" *Genes* 13, no. 12: 2275.
https://doi.org/10.3390/genes13122275