Whole Blood mRNA Expression-Based Prognosis of Metastatic Renal Cell Carcinoma

The Memorial Sloan Kettering Cancer Center (MSKCC) prognostic score is based on clinical parameters. We analyzed whole blood mRNA expression in metastatic clear cell renal cell carcinoma (mCCRCC) patients and compared it to the MSKCC score for predicting overall survival. In a discovery set of 19 patients with mRCC, we performed whole transcriptome RNA sequencing and selected eighteen candidate genes for further evaluation based on associations with overall survival and statistical significance. In an independent validation of set of 47 patients with mCCRCC, transcript expression of the 18 candidate genes were quantified using a customized NanoString probeset. Cox regression multivariate analysis confirmed that two of the candidate genes were significantly associated with overall survival. Higher expression of BAG1 [hazard ratio (HR) of 0.14, p < 0.0001, 95% confidence interval (CI) 0.04–0.36] and NOP56 (HR 0.13, p < 0.0001, 95% CI 0.05–0.34) were associated with better prognosis. A prognostic model incorporating expression of BAG1 and NOP56 into the MSKCC score improved prognostication significantly over a model using the MSKCC prognostic score only (p < 0.0001). Prognostic value of using whole blood mRNA gene profiling in mCCRCC is feasible and should be prospectively confirmed in larger studies.

The next module that we employed for downstream analysis was cufflinks v2.2.1 to assemble transcripts, estimate their abundances, and test for differential expression and regulation in the above mentioned RNA-Seq samples [4][5][6][7]. These programs rely on the accepted_hitms.bam generated after running TopHat2. The options utilized with this module correspond to: 'time cufflinks -L selected_label -p 4 $TOPHATDIR/accepted_hits.bam' where -L is an option in cufflinks to allow labeling transcript fragments with a prefix "selected_label". We ran with 4 threads ( -p 4) and used the output from TopHat2 as the input. We also used the script provided by cufflinks, namely, cuffmerge to combine novel isoforms and known isoforms and maximize overall assembly quality as stated in the manual [7].
The final step was to run cuffdiff v2.2.1 to generate differential gene expression [8]. Cuffdiff calculates gene expression for all the samples and provides information about statistical significance for the changes reported between samples [4]. The options selected were: 'time cuffdiff -o cuffdiff_out -b $GENOME/hg38.fa -p 8 -L G1,I1,P1 -u $MERGE/merged_asm/merged.gtf $SAMPLES' where merged.gtf corresponds to the output from cuffmerge and $SAMPLES are all the accepted_hits.bam for conditions and replicates. Cuffmerge produces a gtf merged file from cufflinks transcript assemblies [8]. Cuffmerge merges transcript fragments from each sample into a comprehensive assembly [4]. The labels correspond to the three conditions Good (G1), Intermediate (I1), and Poor (P1), these labels are used throughout this paper. The Biomarker Discovery RNA-seq (BMD_RNA-seq) pipelineworkflow on the utilization of the above modules is illustrated in Figure S1. This workflow does not utilize any scripting language for communication between modules. It simplifies the swapping/elimination/addition of modules and follows the work reported in the literature [4].
All the data generated from this workflow was analyzed in the R environment via the cummeRbund package v2.10.0 [9] to render cuffdiff output in a graphical display. The following conventions were followed: all significant genes were obtained using the getSig() function with an alpha value of 0.05 [9]. Transcripts abundances were measured using fragments per kilobase of transcript per million fragments mapped (FPKM). A fragment corresponds to a single cDNA molecule and represented by a pair of reads at each end [7]. In addition, the base 2 log of the fold change between sample y and sample x, the uncorrected p-value and the false discovery rate (FDR) FDR-adjusted p-value were computed [4][5][6][7][8]. Cuffdiff reports the statistical significance based on whether p is greater than the FDR after applying the Benjamini-Hochberg correction [4][5][6][7][8]. Genes were selected by comparing the level of gene expression and the statistical significance [10]. Figure S2 shows a volcano plot to illustrate a pairwise comparison between the three conditions for all samples including all the replicates and all genes. The red dots illustrate the set of genes that were considered significant when comparing fold change versus significance ( -log p-values ).
To be able to select gene markers that can discriminate between conditions the approached utilized by Cembrowski et al was selected [10]. A gene was considered X-fold enriched in a given condition, relative to other condition, when the FPKM value as reported by Cuffdiff was at least X-fold greater for all corresponding pairwise comparisons (e.g., for gene A to be X-fold enriched in G1 condition relative to I1 condition and P1 condition, FPKM A,G1 > X•FPKM A,I1 and FPKM A,G1 > X•FPKM A,P1 . The set of genes with the largest enrichment fold and complying with statistical significance as defined by Cuffdiff were selected to be profiled as good candidates for gene markers between the three conditions previously defined. These top genes were compared for the three conditions and 18 genes were selected as gene markers to differentiate between conditions.

Validation of RNAseq data (NanoString)
The nCounter Digital Analyzer was used to count individual fluorescent barcodes to quantify gene expression. This technology is based on two probes. Capture probe linked to biotin molecule and reporter probe linked to a colorcoded molecular marker. These probes hybridize to a complementary target mRNA using specific sequences from the genes of interest. These sequences are normally 100 bp in length. See Table S2 for gene positions and target sequences utilized in this study. The level of expression for the targeted genes was measured by image counting based on four different colors. The count correspond to the number of times a particular gene was detected [11]. We utilized 100 ng of total RNA isolated from fresh-frozen samples. The detailed protocol for mRNA quantification analysis is followed the manufacturer's recommendations, and are available at http://www.nanostring.com/uploads/Manual_Gene_Expression_Assay.pdf/ under http://www.nanostring.com/ applications/subpage.asp?id=343. In addition, all the data generated with this technology was analyzed using the nCounter Digital Analyzer software, available at http://www.nanostring.com/support/ncounter/ [12].