Optimal microRNA Sequencing Depth to Predict Cancer Patient Survival with Random Forest and Cox Models

Jardillier, Rémy; Koca, Dzenis; Chatelain, Florent; Guyon, Laurent

doi:10.3390/genes13122275

Open AccessArticle

Optimal microRNA Sequencing Depth to Predict Cancer Patient Survival with Random Forest and Cox Models

by

Rémy Jardillier

^1,2,

Dzenis Koca

¹,

Florent Chatelain

^2,*,† and

Laurent Guyon

^1,*,†

¹

Univ. Grenoble Alpes, CEA, Inserm, IRIG, BioSanté U1292, BCI, 38000 Grenoble, France

²

Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-Lab, Institute of Engineering University Grenoble Alpes, 38000 Grenoble, France

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Genes 2022, 13(12), 2275; https://doi.org/10.3390/genes13122275

Submission received: 27 October 2022 / Revised: 18 November 2022 / Accepted: 23 November 2022 / Published: 2 December 2022

(This article belongs to the Special Issue Small RNA Bioinformatics)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

(1) Background: tumor profiling enables patient survival prediction. The two essential parameters to be calibrated when designing a study based on tumor profiles from a cohort are the sequencing depth of RNA-seq technology and the number of patients. This calibration is carried out under cost constraints, and a compromise has to be found. In the context of survival data, the goal of this work is to benchmark the impact of the number of patients and of the sequencing depth of miRNA-seq and mRNA-seq on the predictive capabilities for both the Cox model with elastic net penalty and random survival forest. (2) Results: we first show that the Cox model and random survival forest provide comparable prediction capabilities, with significant differences for some cancers. Second, we demonstrate that miRNA and/or mRNA data improve prediction over clinical data alone. mRNA-seq data leads to slightly better prediction than miRNA-seq, with the notable exception of lung adenocarcinoma for which the tumor miRNA profile shows higher predictive power. Third, we demonstrate that the sequencing depth of RNA-seq data can be reduced for most of the investigated cancers without degrading the prediction abilities, allowing the creation of independent validation sets at a lower cost. Finally, we show that the number of patients in the training dataset can be reduced for the Cox model and random survival forest, allowing the use of different models on different patient subgroups.

Keywords:

sequencing depth; cancer; microRNA; survival; Cox model; random survival forest model

1. Introduction

microRNAs (miRNAs) are near 22-nucleotide long RNAs repressing protein-coding gene expression at post-transcriptional level [1]. miRNAs have been shown to be implicated in various steps of carcinogenesis: initiation, propagation and metastasis [2]. The cancer genome atlas (TCGA) project has provided microRNA sequencing on thousands of tumor samples over 33 cancer types, together with patient follow-up [3]. miRNAs represent promising biomarkers to predict patient survival in cancer [4]. TCGA datasets are extremely valuable to build survival models with tumor miRNA expression, and contain large enough cohorts for many tumor types to evaluate the predictions. A semi-parametric and popular model to link patient survival with genomics variables, dealing with censored data and assuming proportional hazards, has been proposed by D.R. Cox [5]. Classically, in the case of high-dimensional datasets, a penalty term is used to constrain the coefficients of the model, and to select only a subset of genes. Different forms of penalties exist [6], but we will focus on the elastic net penalty [7] in this paper, as we have recently shown that they provide similar performances [8]. More recently, non-parametric machine learning algorithms have been proposed and adapted to deal with survival data, including random survival forest [9]. Random survival forest potentially offers more flexibility, as it does not assume any proportionality between hazards, and takes into account non-linear effects and interactions between variables [10,11].

Building and evaluating efficient models from miRNA expression, applicable in clinics, requires to build large datasets from patient cohorts. It implies the recruitment of many patients, and the tumor miRNA profiling at a high enough sequencing depth. Increasing the number of patients and/or the sequencing depth means increasing the cost, but may not lead to direct improvement of the prediction performance of the models. Moreover, validation datasets remain scarce and expensive to build [12]. Thus, a compromise has to be found. For tumor mRNA profiling, P. Milanez-Almeida et al. [13] showed that the sequencing depth could be decreased by typically two orders of magnitude for TCGA datasets for most cancers when using the Cox model with elastic net penalty. They use C-index and p-value from single variable Cox model as prediction performance metrics. They argue that the saved cost could be used to increase the number of patients and/or to perform longitudinal studies.

The goal of the present work is to investigate the required miRNA and mRNA sequencing depth together with the number of patients in the training dataset to build optimal performance models according to well-established metrics (i.e., C-index and integrated Brier score), both with the classical Cox model with elastic net penalty and random survival forest. Additionally, we have validated our results with an independent cohort.

2. Materials and Methods

2.1. Overview of the Methodology

Supplementary Figure S2 shows the methods as a flowchart. Briefly, sequencing data from TCGA are down-sampled (see Section 2.6.1), and then a cross-validation procedure is performed (as detailed in Section 2.3). Learning of the model is performed on

80 %

of the patients, eventually further reduced (see Section 2.6.2), and testing the performance on the remaining

20 %

is performed using the C-index and the Integrated Brier Score (IBS). Subsampling is performed both on the training set and on the testing set for TCGA datasets, which aims to answer the question “can we reduce both the sequencing depth for learning models and for patients in hospital?”.

2.2. Cox Model with Elastic Net Penalty and Random Survival Forest: The Link between Genetic and Survival Data

2.2.1. Cox Proportional Hazards Model with Elastic Net Penalty

Let T denote the survival time (also called the ‘time-to-event’). The Cox model [5] is widely used in medicine to link covariates to survival data through the hazard function, defined for all time instants

t > 0

,

h (t) = lim_{h \to 0} \frac{P (t \leq T < t + h | T \geq t)}{h}

, which represents the instantaneous death probability per unit of time. In the Cox model, the hazard function for patient i is modeled as follows:

h (t; X^{i}) = h_{0} (t) exp (β^{T} X^{i}),

where

h_{0} (t)

is the baseline hazard function,

X^{i} = {(X_{1}^{i}, \dots, X_{p}^{i})}^{T}

the vector of covariates for patient i (here as mRNA or miRNA expression, with p the number of coding or miRNA genes), and

β = {(β_{1}, \dots, β_{p})}^{T}

the vector of associated coefficients. We define

Δ_{i}

to be the associated status, as 1 for death and 0 for censoring. The vector of coefficients

β

can be estimated by maximizing the Cox pseudo-likelihood, as proposed by Breslow [14].

The elastic net methodology [7] consists of the addition of a penalty term to the log-pseudo-likelihood

l (β)

before the maximization:

\hat{β} (E N) = \underset{β}{argmax} l (β) - λ ({α | | β | |}_{1} + \frac{1 - α}{2} {| | β | |}_{2}^{2})

In a previous work, we advised equally Lasso, Elastic Net and Ridge as penalization methods, as they provide equal model performance [8]. The Lasso selects fewer genes, leading to a more parsimonious model. In the present work, we chose the Elastic Net penalty as it is becoming the most popular penalization for Cox models.

We used the R package glmnet [15] to estimate Cox model with elastic net penalty. In the followig, ‘Cox model’ refers to ‘Cox model with elastic net penalty’. For more details on the Cox model and the choice of the hyperparameters used for penalties (i.e.,

α = 0.3

), we refer the reader to Supplementary Figure S1 and Materials.

2.2.2. Random Survival Forest

Random forest, introduced by Breiman, is a classical ensemble algorithm for regression and classification whose principle is to build multiple decision trees and create a forest [16]. Results are averaged over all the trees. H. Ishwaran et al. then extended the classical random forest algorithm to survival analysis with censored data [10]. At each node of each tree, m explanatory variables are randomly chosen, and the variable that best separates patients into two groups according to their survival curve is retained. Tree depth is controlled by a threshold on the minimum number of patients in the node. Random survival forest has the advantage of possibly taking into account non-linear effects and interactions between variables [10,11].

We used the R package tuneRanger [9] to learn random survival forest for survival data, a package based on the ranger package [17] but with a fast implementation for tuning the number m of variables randomly drawn at each node. We used default hyperparameters suggested by the authors (i.e., patients used to build a tree chosen with boostrapping, at least 3 patients in a terminal node, 50 trees in a forest, log-rank test as splitting rule,

\sqrt{p}

as starting value for tuning m, with p the total number of genes), and the function tuneMtryFast. To decrease computation time for mRNA datasets, we only retain the 2500 genes with the highest association with survival according to likelihood tests in single-variable Cox models (i.e., we learn one single-variable Cox model for each gene to compute the p-values).

2.3. Prediction Performance Metrics

As schemed in Supplementary Figure S2, we estimate the prediction performance of the models by 10 repetitions of K-fold cross-validation (

K = 5

). We learn a model (i.e., Cox model or random survival forest) on a training dataset (

\frac{4}{5}

of the patients), and we define a risk score from this estimation for each patient of the testing dataset (

\frac{1}{5}

of the patients). The risk score (RS) is defined for a given patient i as the sum of

β_{j} X_{j}^{i}

for the Cox model (

X_{j}

corresponds to the expression of gene j), and the mean of the estimated cumulative hazard function for the random survival forest:

${\hat{RS}}_{i} = {\hat{β}}^{T} X^{i}$ for the Cox model, with $\hat{β}$ the estimator of the coefficients, and $X^{i}$ the gene expression vector for patient i.
${\hat{RS}}_{i} = \frac{1}{Card (T)} \sum_{j \in T} \hat{H} (t_{j} | X_{i})$ for random survival forest, with ‘Card’ the cardinal function, $\hat{H} (t | X_{i})$ the estimated cumulative hazard function at time t for patient i, and T the times at which the hazard function is estimated.

This procedure allows to assess prediction performance by computing the C-index and the Integrated Brier Score (IBS), as defined below. Then, at the end of the procedure, 50 C-indices and 50 IBS are computed for each method.

The C-index allows the discrimination ability of a model to be assessed by quantifying the proportion of patient pairs for whom risk scores are in good agreement with their survival data. For two patients i and k with risk scores

{RS}_{i}

and

{RS}_{k}

, and with survival times

T_{i}

and

T_{k}

, the C-index is defined as

C = P (T_{i} < T_{k} | {RS}_{i} > {RS}_{k})

. A C-index of 1 indicates perfect agreement, and a C-index of

\frac{1}{2}

corresponds to random chance agreement. We took the estimator of the C-index given by [18] and theorized by [19].

The Brier Score [20] measures the average squared distance between the observed survival status and the predicted survival probability at a particular time t. It is always a number between 0 and 1, with 0 being the best possible value. We used the IBS that integrates the Brier Score between 0 and the maximum event time of the test set, and divides this quantity by the maximum integration time. Then, while the C-index measures the ability of a model to rank patients according to their risks, the IBS estimates the ability of a model to predict survival probabilities along time. The IBS is a global performance metric that assesses both discrimination and calibration. These two metrics are widely used to estimate prediction performance in practice and are complementary.

We used the R packages survcomp [21] to compute the C-index, and pec for the IBS [22].

2.4. The Cancer Genome Atlas and E-MTAB-1980 Datasets

Cancer acronyms, as provided by the TCGA consortium, are available in Supplementary Table S1. First, we included cancers available in TCGA for which there were more than 75 patients with miRNA-seq and survival data. Then, we followed recent formal recommendations [23] to exclude the PCPG cancer that has too few death events and the SKCM cancer that has a high ratio of metastatic samples sequenced. We used overall survival as the disease-outcome, except when the authors recommend the use of progression-free interval (BRCA, LGG, PRAD, READ, TGCT, THCA and THYM). After these two steps, we retained 25 cancers. Finally, we computed the C-index and IBS estimates after running the Cox model or random survival forest applied on the miRNA profiles for these 25 cancers as schemed in Supplementary Figure S2. To focus on cancers for which the sequencing data convey prognostic values, we decided to retain only the datasets for which the median C-index is significantly higher than 0.6 for at least one of the algorithms (i.e., Cox model or random survival forest) according to a one-sided Wilcoxon test at level 0.05. At the end of this procedure, we retained 11 cancers (Table 1, Supplementary Figure S3). Supplementary Table S2 shows the median miRNA sequencing depth and the number of patients for the 25 cancers. There is no statistical difference between the retained 11 cancers and the remaining 14 cancers, both in sequencing depth (p = 0.17, Wilcoxon test) and the number of patients (p = 0.85).

We used the Broad GDAC FIREHOSE utility (https://gdac.broadinstitute.org/ (accessed on 11 December 2020)) to obtain clinical, miRNA-seq, and mRNA-seq datasets. We applied a Trimmed Mean of M-values (TMM) procedure to correct for between sample variance [24]. We first used the calcNormFactors function of package EdgeR [25] to compute a normalization factor for each patient, and we then applied the voom function of the limma package to compute log2-CPM data corrected with the normalization factors computed earlier [26]. We then standardized the expression of each gene both in the training dataset and the testing datasets using the mean and standard deviation values among patients of the training data.

To confirm the impact of sub-sampling on survival metrics in an external validation cohort, we acquired the processed E-MTAB-1980 dataset [27] from ArrayExpress (https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-1980/ (accessed on 25 November 2019)). Survival metrics were calculated as previously described, while using E-MTAB-1980 as a testing dataset. We standardized the testing dataset independently from the training dataset, by using the mean and standard deviation of the testing dataset. This dataset gathers transcriptomic and clinical data from 101 patients of a Japanese cohort. The transcriptomic data were acquired with a microarray technology, which is a continuous technique, so we have not applied any down-sampling. The goals with this independent dataset are: “can the approach with cross-validation on the TCGA data be validated with an independent dataset?”, and “to which extent down-sampling affects the model quality” (keeping unchanged the testing set).

2.5. Integration of miRNA-seq Data Together with Clinical Data

We verified whether the tumor miRNA profiles added predictive value to the clinical data [28]. Different strategies exist for integration of miRNA-seq and clinical data [29,30]. In order to avoid the dilution of the few clinical parameters among all the miRNA covariates, we added the risk scores computed with miRNA-seq data alone (

{RS}_{m i R N A}

) to ones computed using classical clinical features (age, gender, grade, T, N, M), when available (

RS = β . {RS}_{m i R N A} + \sum_{l} β_{l} . {Clin}_{l}

, where

{Clin}_{l}

is the

l^{t h}

clinical variable). T is a score which stands for the extent of the tumor, N for the extent of spread to the lymph nodes, and M for the presence of metastasis. We did not include gender for sex-specific cancers (CESC, UCEC, PRAD). Age is available for all cancers, and we specify whether the other variables are available in Supplementary Figures S4 and S5.

To emphasize if the miRNA-seq data added prediction value over clinical data for both Cox model and random survival forest, we performed a one-sided Wilcoxon signed rank test for each of the 11 cancers studied. We considered a difference significant when the p-value corrected with Benjamini-Hochberg method is below 0.05, even though this is purely indicative as discussed below.

2.6. Degradation of miRNA-seq Data

2.6.1. Subsampling of miRNA-seq Data

“Sequencing depth” is defined here as the sum of the number of aligned reads per patient, and can vary according to the patients, and is equivalent to the notion of “library size”. These two nomenclatures will be used interchangeably in the following text.

To reduce the sequencing depth, we used a subsampling method [31]. The key parameter to calibrate fold reduction is the proportion of subsampling,

ε \in (0, 1]

. For each count data (i.e., number of reads)

R_{i j}

obtained for a patient i and a gene j, a subsampled count data of a proportion

ε

, noted

{\tilde{R}}_{i j}

, is drawn according to a binomial distribution of parameters

R_{i j}

and

ε

:

{\tilde{R}}_{i j} \sim Binom (R_{i j}, ε),

for each patient

i = 1, \dots, n

and each gene

j = 1, \dots, p

.

Thus, the closer the parameter

ε

is to 0, the smaller the read depth: a proportion of

ε

(e.g., 0.01) corresponds to a subsampling of the sequencing data by a factor

δ = \frac{1}{ε}

(e.g., 100). In this study, we examine the effect of 1 (no subsampling), 10, 100, 1000 and 10,000 subsampling factors

δ

. We then draw saturation curves, which are the evaluation metrics (i.e., C-index, IBS), performed on the test set which is not subsampled, as a function of the subsampling factor

δ

[32,33].

2.6.2. Reduction of the Number of Patients in the Training Dataset

To study the impact of the number of patients on prediction capabilities, we artificially decreased the percentage x of patients in the learning dataset. In this study, we chose

x = 10, 20, \dots, 80 %

. However, to ensure that the C-index and IBS are not biased, the testing dataset is always composed of

20 %

of patients.

3. Results

3.1. Library Sizes of mRNA-seq Data Are Ten Times Larger Than the Ones of miRNA-seq Data

The library sizes are equivalent between the 25 cancers, with a few exceptions, and are distributed around

5 \times 10^{6}

reads for miRNAs, and

5 \times 10^{7}

for mRNAs (Supplementary Figure S6). Recall here that only aligned reads, and not raw reads, are taken into account. The sequencing depth for mRNA datasets is therefore higher than that of the miRNAs by a factor of 10 on average, which is not surprising as it spans on 40 times more genes. There are thus, on average, 4 times more aligned reads per gene for miRNAs than for mRNAs. The lengths of genes are also very different between mRNAs and miRNAs. Also, there is no particular relationship between the sequencing depth chosen for the mRNAs and for the miRNAs between the different cancers. Note that for the LAML cancer, we observe a lower sequencing depth than for the other cancers: the median sequencing depth is 720,000 reads for LAML, 2.5 million for KIRC and 7.5 million for LGG (Supplementary Figure S6).

3.2. C-Index Highlighted Noticeable Prediction Differences between Cox and Random Survival Forest Models for Eight out of Twenty-Five Cancers

Using miRNA-seq datasets and according to the C-index metric, the Cox model shows better prediction than the random survival forest for KIRC, CESC, PRAD, LUAD, HNSC, and to a lesser extent LIHC (Supplementary Figure S3A). Conversely, random survival forest shows better predictions for KIRP, THYM, THCA, and to a lesser extent UCEC. For the other cancers, we did not observe clear differences. Noticeably, while the Cox model is not able to capture any prediction abilities for THCA (i.e., median C-index of 0.46), random survival forest exhibits a median C-index of 0.62. However, if we choose the IBS as the prediction metric, random survival forest shows better prediction than Cox except for LGG (Supplementary Figure S3B). We discuss this difference observed between the metrics below.

3.3. mRNA-seq Data Provides Slightly Better Prediction Performance Than miRNA-seq Data for Most of the 11 Investigated Cancers

In this section, we use a Wilcoxon signed-rank test to highlight the situation in which there exist differences, and we corrected the 11 p-values computed with the Benjamini-Hochberg procedure.

The median C-indices reached with the Cox model are higher with mRNA-seq data than miRNA-seq data for all selected cancers except LUAD. More precisely, the C-indices are higher with mRNA-seq data for 8 cancers (ACC, KIRP, MESO, KIRC, LGG, CESC, PRAD, UCEC), and higher with miRNA-seq data only for LUAD (Supplementary Figure S7A). When using IBS as a metric, the median IBS obtained from 5 cancers (UVM, KIRC, LIHC, LUAD, UCEC) was lower compared to the median IBS obtained with mRNA-seq data. Additionally, overall IBS obtained by using mRNA-seq data was lower in cases of KIRP, MESO, LGG, and CESC, and higher while using miRNA-seq data of UVM, KIRC and LIHC. Comparable results were obtained using random survival forest as prediction model. Overall, mRNA-seq data provides better predictions, but the absolute differences remain small.

3.4. Mirna-seq Data Improves Predictions over Clinical Data Alone for Most of the Investigated Cancers

For 8 cancers (UVM, ACC, MESO, LGG, CESC, LIHC, PRAD, LUAD) out of the 11 studied, the addition of miRNA-seq data to generic clinical data significantly improved the C-index compared to clinical data alone for the Cox model (Supplementary Figure S4A, integration of clinical and miRNA data is described Section 2.5). Similarly, for random survival forest, the median C-index is improved for 6 cancers (UVM, ACC, MESO, LGG, LIHC and PRAD, Supplementary Figure S5A). When using the IBS as the performance metric, the difference are often not as clear: predictions appear better for 5 cancers when taking tumor miRNA profiles into account in the Cox model (KIRP, MESO, KIRC, LGG, and CESC, Supplementary Figure S4B), and for 5 cancers in the random survival forest model (ACC, KIRP, MESO, LGG, and LIHC, Supplementary Figure S5B).

Overall, the addition of miRNA-seq data to classical clinical data improves prediction performance as assessed by C-index and/or IBS for all the 11 cancers investigated but UCEC with the Cox model. This performance drops to 7 cancers with random survival forest; KIRC, CESC, LUAD, UCEC do no show improvement. The use of miRNA-seq data seems not as interesting as clinical data alone to build predictive risk scores for UCEC. However, we have included this cancer because RNA-seq data can be used in other contexts as part of a survival model: stratifying patients according to transcriptomic profiles [34], identifying predictive markers of response to treatments [35], identifying potential therapeutic targets [36].

3.5. Shallow Tumor miRNA or mRNA Sequencing Keeps Survival Prediction Performance for Many Cancers

We consider that the sequencing depth can be reduced if and only if none of the two prediction metrics (i.e., C-index or IBS) is degraded at level 0.05 according to a one-sided Wilcoxon test. Figure 1 shows the C-index as a function of miRNA (or mRNA) library size reduction and/or number of patients subsampling for the kidney cancer subtype KIRC (corresponding to clear cell renal cell carcinoma, ccRCC). We highlight this cancer subtype for the following reasons: the dataset contains many patient data (Table 1), the prediction performance is quite high (C = 0.7), the availability of an independent dataset (Section 3.8), and as we are more specialized on this subtype [37]. For this cancer, it appears that both the number of patients and the sequencing depth could have been decreased while keeping similar prediction capacity. More precisely, using 60% of the patients (

n = 304

) in the training set leads to similar model performance, even though a small but noticeable performance decrease is noticed with IBS in the Cox model (Supplementary Figure S9). Also, decreasing the miRNA and mRNA sequencing depth by one order of magnitude has no measurable consequences. However, both the number of patients and the sequencing depth should not be decreased to their maximum extents altogether, as shown by the shape of the color map (Figure 1A).

Table 2 and Table 3 summarize the maximum lowering of sequencing depth without affecting prediction performance. For most cancers, reducing the sequencing depth for miRNAs and mRNAs leads to similar performance, and the possible library size reduction is correlated between miRNAs and mRNAs when chosen as covariates in the Cox model (Supplementary Figure S10). For mRNAs, one order of magnitude reduction or more is permitted for 11 cancers but CESC, which only tolerates a

50 %

reduction (∼20,000,000 aligned reads). For miRNAs, there are 3 exceptions: CESC again, which also tolerates a

50 %

reduction (∼2,000,000 aligned reads), PRAD with an

80 %

reduction (∼1,000,000 aligned reads), and LUAD which do not tolerate any sequencing depth reduction, and may even show improved performance with an increase in library size (≥5,000,000 aligned reads). For cancers tolerating 500,000 aligned reads in mRNAs or less, the sequencing depth could be reduced at least for one order of magnitude in miRNAs. Thus, mRNA sequencing data might inform of the required sequencing depth for miRNAs.

For random survival forest and mRNA-seq data, the sequencing depth of all cancers can be reduced by a factor 100 without degrading the C-index and the IBS (Supplementary Table S3 and Figure S11). This fold reduction corresponds to median sequencing depth of about 500,000 aligned reads. For KIRC and LGG the sequencing depth can be even more reduced, by a factor 1000 (∼50,000 aligned reads). The results are more heterogeneous for miRNA-seq data as the sequencing depth can be reduced by a factor 1000 for CESC (∼5000 aligned reads) down to 5 for KIRP and MESO (∼1,000,000 aligned reads). Noticeably, for CESC, the possible fold reduction is much larger for random survival forest than for the Cox model for both miRNAs and mRNAs. We hypothesized that these differences are the consequence of a better C-index obtained with the Cox model than with random survival forest (supplementary Figures S3A, S10 and S11).

3.6. Models Trained with Fewer Patients Do Not Degrade Prognosis for Most of the Investigated Cancers

For the Cox model and miRNA-seq data, the number of patients in the training dataset can be reduced for 9 of the 11 cancers, at least for a small proportion (Supplementary Table S4). The two exceptions concern PRAD and LUAD, for which diminishing the number of patients in the training set decreases the prediction performance. We obtained comparable results for mRNA-seq data, except for UVM and UCEC which require more patients to achieve maximum performance—for UVM with similar C-index between miRNAs and mRNAs, but for UCEC with better performance with miRNAs (Supplementary Figure S7). Surprisingly, random survival forest need less patients in the training dataset to achieve optimal prediction performance. This result makes it possible to consider stratifying patients into subgroups and to learn models separately for each subgroup.

3.7. Very Small Sequencing Depth Is Responsible for the Performance Loss

When reducing sequencing depth, we automatically reduce the number of detected genes. We define a non-coding/coding gene as ‘detected’ if its CPM-normalized expression level is greater than 1 for at least 1% of the patients in the training dataset. Two hypotheses can be put forward to explain the decrease in predictive abilities induced by the subsampling of sequencing depth:

(1): the number of detected genes decreases as the subsampling rate increases (Supplementary Figure S12A, [38]), and only the level of expression of the most highly expressed genes can be measured (Supplementary Figure S12B). However, genes with a low level of expression may have significant predictive power and go undetected, which would diminish overall predictive capabilities.
(2): more generally, the signal-to-noise ratio decreases for all genes (the standard deviation of the measurements varies in $\sqrt{N}$ , with N the number of aligned reads per gene).

To test the first hypothesis, we compared the C-indices obtained with all miRNAs, and those obtained only with the most expressed genes, using the Cox model. This corresponds on average to a 2-fold decrease in the number of predictors (i.e., 210 miRNAs on average detected after subsampling of the miRNA-seq data by a factor of 10,000 for all cancers studied). In these two scenarios, the sequencing data keep the same read depth per gene, but only the number of genes taken into account differs. For miRNAs, reducing the dimension by keeping only the 210 most expressed genes does not significantly impact prediction capabilities (i.e., C-index and IBS) for the 11 cancers studied, except for CESC and LUAD with the C-index (p-value

< 0.001

, one-sided Wilcoxon signed-rank test with Benjamini-Hochberg correction), and to a lesser extent for LIHC with the C-index (p-value

< 0.05

, Supplementary Figure S13). The first hypothesis is therefore not verified.

To test the second hypothesis, we calculated the C-indices obtained after subsampling by a factor of 10,000 (i.e., 210 genes on average are detected, and the count data are subsampled), and those obtained with the same 210 genes (on average) but without subsampling, also with the Cox model. For all 11 cancers, the median C-index obtained after subsampling is lower than that obtained without subsampling but with the same predictors (Supplementary Figure S13). For the IBS, we observed the same results, except for UVM, LUAD and UCEC. The second hypothesis is verified: the subsampling induces a decrease in the signal-to-noise ratio of the RNA-seq count data explaining the decrease of the prediction if a strong subsampling is applied. It is also interesting to notice that for most cancers, selecting half of the most expressed predictors does not affect prediction performance.

We obtained comparable results for random survival forest (Supplementary Figure S14) and for mRNA-seq data (data not shown to avoid too many supplementary figures, but can easily reproduced with the R scripts shared on github).

3.8. Prognostic Performances Follow Similar Trend after Subsampling When Tested on an Independent Dataset

To evaluate to which extent these results can be used in practice, we checked whether the models learned on TCGA, with an incremental decrease in sequencing depth, also do not degrade the predictions on an independent dataset. However, we were unable to find a dataset comparable to TCGA with both microRNA tumor profiling and patient survival, so we chose to demonstrate the reproducibility on mRNA profiling. We chose an mRNA profiling dataset measured with microarray in order to improve generalizability, and keeping ccRCC for the reasons detailed Section 3.5. Figure 2 shows that first the C-index is higher when the test set is the independent one, and second that the trend is comparable for both test sets (E-MTAB-1980 and TCGA datasets). More precisely, reducing the mRNA sequencing depth by a factor of 10 or even 100 on the training set does not affect the C-index performance. The E-MTAB-1980 dataset also provide improved IBS performance and smaller variability. We hypothesize that the smaller variability is due to the fact that the E-MTAB-1980 test set remains the same, as compared to the TCGA test set gathering the 20% of remaining patients, not used in the training set, and so the difference in performance is more sensitive (Supplementary Figures S15 and S16). Halving the number n of patients in the training set (from 80% to 40%, i.e., from

n = 406

to 203) increases the IBS measured on the E-MTAB-1980 dataset in a modest manner (from a median of 0.112, 95% confidence interval of the median [0.11;0.114] to 0.133 [0.125;0.14]), whereas only 10% (

n = 51

) of the patients lead to low IBS performance (0.205 [0.195;0.214]). This information is useful when trying to improve the model by clustering the patients into refined cancer subtypes, and further applying different Cox model in each of these subtypes: while 2 subtypes are reasonable for ccRCC on this large TCGA cohort, more than 5 may lead to too large IBS values only because of the small number of patients in each learning subset.

4. Discussion

Our work shows the benefit of using tumor mRNA or miRNA profiling to predict patient survival. We thus estimated the optimal sequencing depth and number of patients to achieve this prediction, using Cox and random forest models. We considered that the sequencing depth can be reduced if both C-index and IBS are not significantly degraded. This choice is subjective and can easily be adapted with the R script provided. For example, if discrimination among patients at risk is the only important aspect for a particular study, the C-index should be considered as the only prediction metrics. Second, as the same initial set of patients is used in multiple Monte-Carlo runs (10 repetitions of 5-fold cross-validation) to estimate prediction performance, the 50 metrics (i.e., C-index or IBS) are not independent. Besides, the p-values can be reduced toward 0 when small differences are observed by increasing the number of repetitions. The computed p-values are thus only indicative, but helps the readability of the graphics.

Thus, because of the methodology used, we obtain comparable results with slight differences with [13] about shallow tumor mRNA sequencing to predict patient survival with Cox model. We have extended it to miRNA-sequencing, and to random forest survival.

Then, estimation of the IBS with random survival forest is direct in the sense that the survival function S is an output of the algorithm for patients of the testing dataset. However, it is not the case with the Cox model as the baseline hazard function

h_{0}

is not estimated in the pseudo-likelihood. This function

h_{0}

is estimated with the Breslow estimator [14]. This dissimilarity in the way individual survival functions are estimated may explain the large advantage on prediction performance for random survival forest as compared to the Cox model (Supplementary Figure S3B). More investigations are still needed to better understand this observation.

C-index and IBS metrics do not always provide identical evaluation of the performance of the models. While the C-index only accounts for the ordering of the patients depending on the risk score, the IBS also evaluates the base line

h_{0} (t)

. We hypothesize that the quality of the inference of the baseline is responsible for the different behavior of both metrics. For specific goals, one of the metric could be favored. For example, to identify patients at risk, we encourage to focus on the C-index. In other cases, and as the two metrics are complementary, we advise to consider both metrics.

This work is the first one to focus on the sequencing depth of miRNA profiling for survival prediction, and could serve as a proxy to calibrate future experiments. Lung adenocarcinoma (LUAD) showed a noticeable difference with other cancers as tumor miRNAs better predict patient survival than mRNAs. This may indicate the particular role of miRNAs in this tumor type, and would be worth to further investigate.

Finally, the number of patients required in the training dataset is lower for random survival forest than for the Cox model (Section 3.6). This result is surprising as random forest has more degrees of freedom, and further work is needed to investigate this point.

5. Conclusions

In this work, we present a methodology and results on the possibility of (i) reducing sequencing costs to, for example, create validation datasets, and (ii) reducing the number of samples in training datasets to stratify patients into subgroups in the context of prediction of survival in cancer. Cox model and random survival forest provide comparable C-indices for some cancers but not all (e.g., the Cox model outperforms random survival forest for CESC and miRNA-seq data; the contrary being true for THCA). However, with IBS as the performance metric, the performance are better with random survival forest. We also pointed out that mRNA-seq data provide slightly better performance than miRNA-seq data on average, with the noticeable exception of lung adenocarcinoma (LUAD). Integration of miRNA-seq data with clinical data allows to improve predictions over clinical data alone for most of the 11 investigated cancers. Importantly, we demonstrated that sequencing depth of miRNA-seq and mRNA-seq can be reduced without degrading prediction performance for most of the 11 cancers retained in a cancer, data (i.e., miRNAs or mRNAs) and metric (i.e., C-index or IBS) dependent manner, thus allowing the reduction of sequencing cost to create independent validation datasets. Finally, we demonstrated that the number of patients in the training dataset can be reduced for both miRNA and mRNA data without degrading the prediction performance for the Cox model, and in a larger extent for random survival forest. Finally, our results were confirmed on an independent dataset.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/genes13122275/s1, Figure S1. Deviance and number of genes selected for different values of

α

for KIRP; Figure S2. Procedure for the evaluation of prediction performances; Figure S3. Boxplot of the C-indices (A) and of the IBS (B) for the Cox model with elastic net penalty (blue) and random forest (orange); Figure S4. C-indices (A) and IBS (B) obtained with clinical data alone (red), miRNA-seq data alone (blue), and both clinical and miRNA-seq data (purple) for the 11 cancers investigated and the Cox model with elastic net penalty; Figure S5. C-indices (A) and IBS (B) obtained with clinical data alone (red), miRNA-seq data alone (blue), and both clinical and miRNA-seq data (purple) for the 11 cancers investigated and the random survival forest procedure; Figure S6. Distribution of the library size for miRNA-seq data (A), and median library size for mRNA-seq and miRNA-seq data(B) for the 25 cancers of TCGA; Figure S7. Boxplot of the C-indices (A) and of the IBS (B) for the Cox model with elastic net penalty for miRNA-seq (blue) and mRNA-seq (lightblue) data; Figure S8. Boxplot of the C-indices (A) and of the IBS (B) for random survival forest for miRNA-seq (orange) and mRNA-seq (yellow) data; Figure S9. IBS obtained for different fold reduction factors and percentage of patients in the training dataset for KIRC (ccRCC, TCGA) with the Cox model; Figure S10. Distribution of C-indices obtained with the Cox model after fold reduction of miRNA-seq data or mRNA-seq data for the 11 investigated cancers; Figure S11. Distribution of C-indices obtained with the random survival forest model after fold reduction of miRNA-seq data or mRNA-seq data for the 11 investigated cancers; Figure S12. Number of genes (miRNAs) detected for different fold reduction (A) and level of expression (log2-CPM) of the genes detected without subsampling (blue) and after subsampling by a factor 10,000 for KIRP; Figure S13. C-indices (A) and IBS (B) obtained after subsampling by a factor 10,000 (green), with the same miRNAs (∼200 most expressed) but without subsampling (orange), and with all the miRNAs (∼500) and without subsampling (blue) for the Cox model with elastic net penalty; Figure S14. C-indices (A) and IBS (B) obtained after subsampling by a factor 10,000 (green), with the same miRNAs (∼200 most expressed) but without subsampling (orange), and with all the miRNAs (∼500) and without subsampling (blue) for random survival forest; Figure S15. C-index obtained for different fold reduction factors and percentage of patients in the training dataset for KIRC (ccRCC, TCGA) with the Cox model, assessed on the E-MTAB-1980 dataset; Figure S16. IBS obtained for different fold reduction factors and percentage of patients in the training dataset for KIRC (ccRCC, TCGA) with the Cox model, assessed on the E-MTAB-1980 dataset; Table S1. Acronym of the TCGA cancers; Table S2. Median microRNA sequencing depth and number of patients in the TCGA cohort; Table S3. Maximum fold reduction without degradation of the C-index and the IBS, corresponding median sequencing depth (thousands of reads), and prediction metric degraded first for random survival forest and for miRNA-seq data (A) and mRNA-seq data (B) for the 11 investigated cancers; Table S4. Minimum Proportion of patients needed in the training dataset without degradation of the C-index and the IBS and corresponding number of patients for the 11 investigated cancers for miRNA-seq data and the Cox model (A), miRNA-seq data and random survival forest (B), mRNA-seq data and the Cox model (C), and mRNA-seq data and random survival forest (D). References [15,39,40] are cited in the supplementary materials.

Author Contributions

Conceptualization and methodology, L.G., F.C. and R.J.; data analysis, R.J. with the help of D.K. and L.G. and supervision of L.G. and F.C.; writing—original draft preparation, R.J.; writing—review and editing, L.G., D.K. and F.C. All authors have read and agreed to the published version of the manuscript.

Funding

This article was developed in the framework of the Grenoble Alpes Data Institute, supported by the French National Research Agency under the Investissements d’avenir programme (ANR-15-IDEX-02).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

This study uses publicly available human datasets. The consent from patients are gathered with the original studies.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here: https://gdac.broadinstitute.org/ https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-1980/. The R script to reproduce the results presented in this article is available at https://github.com/remyJardillier/Survival_seq_depth (all accessed on 24 October 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

Bartel, D.P. Metazoan micrornas. Cell 2018, 173, 20–51. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Peng, Y.; Croce, C.M. The role of MicroRNAs in human cancer. Signal Transduct. Target. Ther. 2016, 1, 15004. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Chu, A.; Robertson, G.; Brooks, D.; Mungall, A.J.; Birol, I.; Coope, R.; Ma, Y.; Jones, S.; Marra, M.A. Large-scale profiling of microRNAs for the cancer genome atlas. Nucleic Acids Res. 2016, 44, e3. [Google Scholar] [CrossRef] [PubMed]
Capula, M.; Mantini, G.; Funel, N.; Giovannetti, E. New avenues in pancreatic cancer: Exploiting microRNAs as predictive biomarkers and new approaches to target aberrant metabolism. Expert Rev. Clin. Pharmacol. 2019, 12, 1081–1090. [Google Scholar] [CrossRef]
Cox, D.R. Regression models and life-tables. J. R. Stat. Soc. Ser. B (Methodol.) 1972, 34, 187–202. [Google Scholar] [CrossRef]
Jardillier, R.; Chatelain, F.; Guyon, L. Bioinformatics Methods to Select Prognostic Biomarker Genes from Large Scale Datasets: A Review. Biotechnol. J. 2018, 13, 1800103. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T. Regularization and variable selection via the elastic-net. J. R. Stat. Soc. 2005, 67, 301–320. [Google Scholar] [CrossRef] [Green Version]
Jardillier, R.; Koca, D.; Chatelain, F.; Guyon, L. Prognosis of lasso-like penalized Cox models with tumor profiling improves prediction over clinical data alone and benefits from bi-dimensional pre-screening. BMC Cancer 2022, 22, 1045. [Google Scholar] [CrossRef]
Probst, P.; Wright, M.N.; Boulesteix, A. Hyperparameters and tuning strategies for random forest. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2019, 9, e1301. [Google Scholar] [CrossRef] [Green Version]
Ishwaran, H.; Kogalur, U.B.; Blackstone, E.H.; Lauer, M.S. Random survival forests. Ann. Appl. Stat. 2008, 2, 841–860. [Google Scholar] [CrossRef]
Wright, M.N.; Ziegler, A.; König, I.R. Do little interactions get lost in dark random forests? BMC Bioinform. 2016, 17, 145. [Google Scholar] [CrossRef] [Green Version]
Kourou, K.; Exarchos, T.P.; Exarchos, K.P.; Karamouzis, M.V.; Fotiadis, D.I. Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J. 2015, 13, 8–17. [Google Scholar] [CrossRef] [Green Version]
Milanez-Almeida, P.; Martins, A.J.; Germain, R.N.; Tsang, J.S. Cancer prognosis with shallow tumor RNA sequencing. Nat. Med. 2020, 26, 188–192. [Google Scholar] [CrossRef]
Breslow, N. Contribution to the Discussion of the Paper by D.R. Cox. J. R. Stat. Soc. B 1972, 34, 2016–2017. [Google Scholar]
Friedman, J.H.; Hastie, T.; Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Wright, M.N.; Ziegler, A. ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. J. Stat. Softw. 2017, 77, 1–17. [Google Scholar] [CrossRef] [Green Version]
Harrell, F.E., Jr.; Lee, K.L.; Mark, D.B. Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat. Med. 1996, 15, 361–387. [Google Scholar] [CrossRef]
Pencina, M.J.; D’Agostino, R.B. Overall C as a measure of discrimination in survival analysis: Model specific population value and confidence interval estimation. Stat. Med. 2004, 23, 2109–2123. [Google Scholar] [CrossRef]
Gerds, T.A.; Schumacher, M. Consistent estimation of the expected Brier score in general survival models with right-censored event times. Biom. J. 2006, 48, 1029–1040. [Google Scholar] [CrossRef]
Schroder, M.S.; Culhane, A.C.; Quackenbush, J.; Haibe-Kains, B. survcomp: An R/Bioconductor package for performance assessment and comparison of survival models. Bioinformatics 2011, 27, 3206–3208. [Google Scholar] [CrossRef] [PubMed]
Mogensen, U.B.; Ishwaran, H.; Gerds, T.A. Evaluating Random Forests for Survival Analysis Using Prediction Error Curves. J. Stat. Softw. 2012, 50, 1–23. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Liu, J.; Lichtenberg, T.; Hoadley, K.A.; Poisson, L.M.; Lazar, A.J.; Cherniack, A.D.; Kovatich, A.J.; Benz, C.C.; Levine, D.A.; Lee, A.V.; et al. An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics. Cell 2018, 173, 400–416.e11. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Robinson, M.D.; Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010, 11, R25. [Google Scholar] [CrossRef] [Green Version]
Robinson, M.D.; McCarthy, D.J.; Smyth, G.K. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010, 26, 139–140. [Google Scholar] [CrossRef] [Green Version]
Ritchie, M.E.; Belinda, P.; Wu, D.; Hu, Y.; Law, C.W.; Shi, W.; Smyth, G.K. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015, 43, e47. [Google Scholar] [CrossRef]
Sato, Y.; Yoshizato, T.; Shiraishi, Y.; Maekawa, S.; Okuno, Y.; Kamura, T.; Shimamura, T.; Sato-Otsubo, A.; Nagae, G.; Suzuki, H.; et al. Integrated molecular analysis of clear-cell renal cell carcinoma. Nat. Genet. 2013, 45, 860–867. [Google Scholar] [CrossRef]
Volkmann, A.; De Bin, R.; Sauerbrei, W.; Boulesteix, A.-L. A plea for taking all available clinical information into account when assessing the predictive value of omics data. BMC Med. Res. Methodol. 2019, 19, 162. [Google Scholar] [CrossRef]
López de Maturana, E.; Alonso, L.; Alarcón, P.; Martín-Antoniano, I.A.; Pineda, S.; Piorno, L.; Calle, M.L.; Malats, N. Challenges in the Integration of Omics and Non-Omics Data. Genes 2019, 10, 238. [Google Scholar] [CrossRef] [Green Version]
De Bin, R.; Boulesteix, A.-L.; Benner, A.; Becker, N.; Sauerbrei, W. Combining clinical and molecular data in regression prediction models: Insights from a simulation study. Briefings Bioinform. 2019, 21, 1904–1919. [Google Scholar] [CrossRef]
Robinson, D.G.; Storey, J.D. subSeq: Determining Appropriate Sequencing Depth Through Efficient Read Subsampling. Bioinformatics 2014, 30, 3424–3426. [Google Scholar] [CrossRef]
Tarazona, S.; García-Alcalde, F.; Dopazo, J.; Ferrer, A.; Conesa, A. Differential expression in RNA-seq: A matter of depth. Genome Res. 2011, 21, 2213–2223. [Google Scholar] [CrossRef] [Green Version]
Bass, A.J.; Robinson, D.G.; Storey, J.D. Determining sufficient sequencing depth in RNA-Seq differential expression studies. bioRxiv 2019. [Google Scholar] [CrossRef]
Ricketts, C.J.; De Cubas, A.A.; Fan, H.; Smith, C.C.; Lang, M.; Reznik, E.; Bowlby, R.; Gibb, E.A.; Akbani, R.; Beroukhim, R.; et al. The Cancer Genome Atlas Comprehensive Molecular Characterization of Renal Cell Carcinoma. Cell Rep. 2018, 23, 313–326.e5. [Google Scholar] [CrossRef] [Green Version]
Ternès, N.; Rotolo, F.; Heinze, G.; Michiels, S. Identification of biomarker-by-treatment interactions in randomized clinical trials with survival outcomes and high-dimensional spaces. Biom. J. Biom. Z. 2017, 59, 685–701. [Google Scholar] [CrossRef] [Green Version]
Wei, H.; Zhang, J.-J.; Tang, Q.-L. MiR-638 inhibits cervical cancer metastasis through Wnt/beta-catenin signaling pathway and correlates with prognosis of cervical cancer patients. Eur. Rev. Med. Pharmacol. Sci. 2017, 21, 5587–5593. [Google Scholar] [CrossRef]
Roelants, C.; Pillet, C.; Franquet, Q.; Sarrazin, C.; Peilleron, N.; Giacosa, S.; Guyon, L.; Fontanell, A.; Fiard, G.; Long, J.A.; et al. Ex-vivo treatment of tumor tissue slices as a predictive preclinical method to evaluate targeted therapies for patients with renal carcinoma. Cancers 2020, 12, 232. [Google Scholar] [CrossRef] [Green Version]
Sims, D.; Sudbery, I.; Ilott, N.E.; Heger, A.; Ponting, C.P. Sequencing depth and coverage: Key considerations in genomic analyses. Nat. Rev. Genet. 2014, 15, 121–132. [Google Scholar] [CrossRef]
Kalbeisch, J.D.; Prentice, R.L. The Statistical Analysis of Failure Time Data; Wiley: New York, NY, USA, 2011. [Google Scholar]
Tibshirani, R. The lasso method for variable selection in the cox model. Stat. Med. 1997, 16, 385–395. [Google Scholar] [CrossRef]

Figure 1. C-index obtained for different fold reduction factors and percentage of patients in the training dataset for KIRC (ccRCC, TCGA) with the Cox model. (A) Median C-index for different degradation of both sequencing depth (x axis) and percentage of patients (y axis) in the training dataset for miRNA-seq data. Horizontal box highlights the case where all

80 %

of patients are used and corresponds to (B), whereas vertical box focuses on the full available library size and corresponds to (C). (B) C-index for different fold reduction factors for miRNA-seq (gray boxplots) and mRNA-seq data (median values, in red) with

80 %

of the patients in the training dataset. Above is the p-value of a one-sided Wilcoxon test compared to no subsampling (i.e.,

δ = 1

). (C) C-index for different percentages of patients in the training dataset for miRNA-seq (light gray boxplots) and mRNA-seq data (median values, in red) with original TCGA sequencing depth. Above is the p-value compared to the full dataset (i.e.,

80 %

). red, mRNA-seq; gray boxplots, miRNA-seq. In each case, we computed the C-indices by 10 repetitions of 5-fold cross validation. ***:

p \leq 0.001

, **:

p \leq 0.01

, *:

p \leq 0.05

, +:

p < 0.1

, n.s.:

p \geq 0.1

.

Figure 1. C-index obtained for different fold reduction factors and percentage of patients in the training dataset for KIRC (ccRCC, TCGA) with the Cox model. (A) Median C-index for different degradation of both sequencing depth (x axis) and percentage of patients (y axis) in the training dataset for miRNA-seq data. Horizontal box highlights the case where all

80 %

of patients are used and corresponds to (B), whereas vertical box focuses on the full available library size and corresponds to (C). (B) C-index for different fold reduction factors for miRNA-seq (gray boxplots) and mRNA-seq data (median values, in red) with

80 %

of the patients in the training dataset. Above is the p-value of a one-sided Wilcoxon test compared to no subsampling (i.e.,

δ = 1

). (C) C-index for different percentages of patients in the training dataset for miRNA-seq (light gray boxplots) and mRNA-seq data (median values, in red) with original TCGA sequencing depth. Above is the p-value compared to the full dataset (i.e.,

80 %

). red, mRNA-seq; gray boxplots, miRNA-seq. In each case, we computed the C-indices by 10 repetitions of 5-fold cross validation. ***:

p \leq 0.001

, **:

p \leq 0.01

, *:

p \leq 0.05

, +:

p < 0.1

, n.s.:

p \geq 0.1

.

Figure 2. C-index as a function of sequencing depth reduction tested on the E-MTAB-1980 dataset and TCGA subset for mRNA profiling in ccRCC. In dark gray, performance measured with the C-index calculated on the E-MTAB-1980 dataset, after training on an 80% sub-sample of the TCGA dataset (the procedure is repeated to obtain 50 C-indices). In blue, the test is performed on the remaining 20% of TCGA data (median C-index). ***:

p \leq 0.001

, +:

p < 0.1

, n.s.:

p \geq 0.1

.

Figure 2. C-index as a function of sequencing depth reduction tested on the E-MTAB-1980 dataset and TCGA subset for mRNA profiling in ccRCC. In dark gray, performance measured with the C-index calculated on the E-MTAB-1980 dataset, after training on an 80% sub-sample of the TCGA dataset (the procedure is repeated to obtain 50 C-indices). In blue, the test is performed on the remaining 20% of TCGA data (median C-index). ***:

p \leq 0.001

, +:

p < 0.1

, n.s.:

p \geq 0.1

.

Table 1. Characteristics of the 11 cancers investigated. We computed the C-indices with 10 repetitions of 5-fold cross-validation for both the Cox-elastic net model (EN) and random survival forest (RF). Datasets are ordered according to their median C-index computed with Cox-elastic net model (decreasing order).

Cancer	n Patients	p miRNA	Censoring Rate	Survival— 3 Years	C-Index— EN	C-Index— RF
UVM	77	536	0.73	0.74	0.81	0.83
ACC	77	518	0.65	0.75	0.8	0.84
KIRP	269	486	0.84	0.87	0.79	0.82
MESO	85	519	0.14	0.19	0.7	0.69
KIRC	508	462	0.66	0.75	0.7	0.66
LGG	506	548	0.62	0.56	0.7	0.69
CESC	288	542	0.76	0.72	0.68	0.59
LIHC	355	540	0.65	0.62	0.67	0.66
PRAD	486	470	0.81	0.8	0.66	0.59
LUAD	483	529	0.63	0.61	0.66	0.6
UCEC	532	554	0.83	0.83	0.61	0.64

Table 2. Maximum miRNA-seq library size reduction before the decreasing of prediction performance, corresponding median sequencing depth (in thousands of aligned reads), and prediction metric degraded first, for the Cox model, and the 11 investigated cancers.

Cancer	UVM	ACC	KIRP	MESO	KIRC	LGG	CESC	LIHC	PRAD	LUAD	UCEC
Fold reduction	1000	1000	100	100	10	10	2	10	5	< 1	10,000
Median library size (in 1000 reads)	5	6	60	50	200	700	2000	500	900	> 5000	1
Metric degraded first	C-index	both	both	C-index	both	IBS	C-index	both	C-index	both	C-index

Table 3. Maximum mRNA-seq library size reduction before the decreasing of prediction performance, corresponding median sequencing depth (in thousands of aligned reads), and prediction metric degraded first, for the Cox model, and the 11 investigated cancers.

Cancer	UVM	ACC	KIRP	MESO	KIRC	LGG	CESC	LIHC	PRAD	LUAD	UCEC
Fold reduction	100	1000	100	100	10	100	2	10	10	10	10
Median library size (in 1000 reads)	400	40	400	500	5000	500	20,000	5000	5000	4000	2000
Metric degraded first	C-index	both	IBS	both	IBS	both	C-index	IBS	both	IBS	both

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jardillier, R.; Koca, D.; Chatelain, F.; Guyon, L. Optimal microRNA Sequencing Depth to Predict Cancer Patient Survival with Random Forest and Cox Models. Genes 2022, 13, 2275. https://doi.org/10.3390/genes13122275

AMA Style

Jardillier R, Koca D, Chatelain F, Guyon L. Optimal microRNA Sequencing Depth to Predict Cancer Patient Survival with Random Forest and Cox Models. Genes. 2022; 13(12):2275. https://doi.org/10.3390/genes13122275

Chicago/Turabian Style

Jardillier, Rémy, Dzenis Koca, Florent Chatelain, and Laurent Guyon. 2022. "Optimal microRNA Sequencing Depth to Predict Cancer Patient Survival with Random Forest and Cox Models" Genes 13, no. 12: 2275. https://doi.org/10.3390/genes13122275

APA Style

Jardillier, R., Koca, D., Chatelain, F., & Guyon, L. (2022). Optimal microRNA Sequencing Depth to Predict Cancer Patient Survival with Random Forest and Cox Models. Genes, 13(12), 2275. https://doi.org/10.3390/genes13122275

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimal microRNA Sequencing Depth to Predict Cancer Patient Survival with Random Forest and Cox Models

Abstract

1. Introduction

2. Materials and Methods

2.1. Overview of the Methodology

2.2. Cox Model with Elastic Net Penalty and Random Survival Forest: The Link between Genetic and Survival Data

2.2.1. Cox Proportional Hazards Model with Elastic Net Penalty

2.2.2. Random Survival Forest

2.3. Prediction Performance Metrics

2.4. The Cancer Genome Atlas and E-MTAB-1980 Datasets

2.5. Integration of miRNA-seq Data Together with Clinical Data

2.6. Degradation of miRNA-seq Data

2.6.1. Subsampling of miRNA-seq Data

2.6.2. Reduction of the Number of Patients in the Training Dataset

3. Results

3.1. Library Sizes of mRNA-seq Data Are Ten Times Larger Than the Ones of miRNA-seq Data

3.2. C-Index Highlighted Noticeable Prediction Differences between Cox and Random Survival Forest Models for Eight out of Twenty-Five Cancers

3.3. mRNA-seq Data Provides Slightly Better Prediction Performance Than miRNA-seq Data for Most of the 11 Investigated Cancers

3.4. Mirna-seq Data Improves Predictions over Clinical Data Alone for Most of the Investigated Cancers

3.5. Shallow Tumor miRNA or mRNA Sequencing Keeps Survival Prediction Performance for Many Cancers

3.6. Models Trained with Fewer Patients Do Not Degrade Prognosis for Most of the Investigated Cancers

3.7. Very Small Sequencing Depth Is Responsible for the Performance Loss

3.8. Prognostic Performances Follow Similar Trend after Subsampling When Tested on an Independent Dataset

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI