LncRNA Profiling Reveals That the Deregulation of H19, WT1-AS, TCL6, and LEF1-AS1 Is Associated with Higher-Risk Myelodysplastic Syndrome

Simple Summary Although lncRNAs have been increasingly recognized as regulators of hematopoiesis, only several studies addressed their role in myelodysplastic syndrome (MDS). By genome-wide profiling, we identified lncRNAs deregulated in various groups of MDS patients. We computationally constructed lncRNA-protein coding gene networks to associate deregulated lncRNAs with cellular processes involved in MDS. We showed that expression of H19, WT1-AS, TCL6, and LEF1-AS1 lncRNAs associate with higher-risk MDS and proposed processes related with these transcripts. Abstract Background: myelodysplastic syndrome (MDS) is a hematopoietic stem cell disorder with an incompletely known pathogenesis. Long noncoding RNAs (lncRNAs) play multiple roles in hematopoiesis and represent a new class of biomarkers and therapeutic targets, but information on their roles in MDS is limited. Aims: here, we aimed to characterize lncRNAs deregulated in MDS that may function in disease pathogenesis. In particular, we focused on the identification of lncRNAs that could serve as novel potential biomarkers of adverse outcomes in MDS. Methods: we performed microarray expression profiling of lncRNAs and protein-coding genes (PCGs) in the CD34+ bone marrow cells of MDS patients. Expression profiles were analyzed in relation to different aspects of the disease (i.e., diagnosis, disease subtypes, cytogenetic and mutational aberrations, and risk of progression). LncRNA-PCG networks were constructed to link deregulated lncRNAs with regulatory mechanisms associated with MDS. Results: we found several lncRNAs strongly associated with disease pathogenesis (e.g., H19, WT1-AS, TCL6, LEF1-AS1, EPB41L4A-AS1, PVT1, GAS5, and ZFAS1). Of these, downregulation of LEF1-AS1 and TCL6 and upregulation of H19 and WT1-AS were associated with adverse outcomes in MDS patients. Multivariate analysis revealed that the predominant variables predictive of survival are blast count, H19 level, and TP53 mutation. Coexpression network data suggested that prognosis-related lncRNAs are predominantly related to cell adhesion and differentiation processes (H19 and WT1-AS) and mechanisms such as chromatin modification, cytokine response, and cell proliferation and death (LEF1-AS1 and TCL6). In addition, we observed that transcriptional regulation in the H19/IGF2 region is disrupted in higher-risk MDS, and discordant expression in this locus is associated with worse outcomes. Conclusions: we identified specific lncRNAs contributing to MDS pathogenesis and proposed cellular processes associated with these transcripts. Of the lncRNAs associated with patient prognosis, the level of H19 transcript might serve as a robust marker comparable to the clinical variables currently used for patient stratification.

transcripts and 17,535 PCG mRNAs. The Agilent Low Input Quick Amp Labeling Kit was used for sample preparation (RNA input was set up to 200 ng) according to the manufacturer's recommendation. The hybridized arrays were scanned using an Agilent Microarray Scanner C. Microarray probes were initially mapped to the GRCh37/hg19 genome using the NovoAlign program (Novocraft Technologies, Selangor, Malaysia) and reannotated according to the UCSC Genome Browser (http://genome.ucsc.edu). Raw data were extracted using Agilent Feature Extraction Software. Quality control, quantile normalization, and filtering were performed with the Bioconductor project in the R statistical environment using the limma package. Differentially expressed lncRNAs and PCGs were identified using the empirical Bayesian method implemented in the R limma package. Multiple testing correction was performed to compute the false discovery rate (FDR) using the Benjamini-Hochberg method. To visualize the differential expression data, expression heatmaps were designed using MeV v4.3.2 software [4] and hierarchical clustering of the data was performed using average linkage and Pearson distance. The raw and normalized data have been deposited in the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database under accession number GSE145733.
For the normalization of the raw CT data of lncRNAs and PCGs, we tested several known reference genes (B2M, GAPDH, GUSB, HPRT1, TUBB, UBC, and YWHAZ). The stability of the genes was compared using the web-based tool RefFinder that integrates four major currently available computational programs (geNorm, Normfinder, BestKeeper, and the comparative delta-Ct method) [5]. Based on the results from this optimization procedure ( Figure S1), the RT-qPCR data were finally normalized to the HPRT1 reference gene and further processed by the 2 -ΔΔCT method [6].

Mutational Screening and Data Analysis
The TruSight Myeloid Sequencing Panel Kit (Illumina, San Diego, CA, USA) containing 568 amplicons in 54 genes associated with myeloid malignancies was used for the mutational screening of patients from the discovery cohort. The amplicon library was constructed according to the manufacturer's recommendations. After library purification and subsequent normalization on beads, quantification was performed by the Kapa Library Quantification Kit (Kapa Biosystems, Wilmington, MA, USA). The libraries were pooled and 2x150 bp paired-end sequenced with Rapid SBS Kit V2 chemistry on an Illumina HiSeq 2500 instrument. FASTQ files were subjected to initial quality control by FastQC. Adaptor trimming was performed by Trimmomatic, and low-quality sequences were removed by Illuminaclip. The remaining reads were aligned to the human genome hg19 using BWA-MEM. Variants were detected by LoFreq v2.1.3.1. and annotated using Variant Effect Predictor (Ensembl). The clinical significance of each variant was verified in several genomic databases (UCSC, COSMIC, ExAC, and PubMed). The arbitrary cut off was set at 5% of variant allele frequency (VAF).

Statistical Analysis
Statistical analyses were performed using GraphPad Prism 7 (GraphPad Software, La Jolla, CA, USA) and SPSS software (IBM, Armonk, NY, USA). A nonparametric Mann-Whitney test was used to compare transcript levels and clinical parameters between two groups of samples. The Spearman rank test was performed to assess the correlation of continuous variables. The survival distributions for overall survival (OS) and progression-free survival (PFS) were estimated using the Kaplan-Meier method, and the differences were compared using the log-rank test. For the determination of the optimum cut-off values of transcript levels, we computed the p-values with the log-rank test on a dense net local computation and defined the cut-off points using Gaussian mixture models where the obtained p-values were divided into two components. For multivariate analysis, we estimated a Cox proportional hazards regression model with the Min-Max method for normalizing the data. The backward likelihood method was applied for the reduction of variables. The differences were considered statistically significant when p < 0.05.

Pathway Analysis
The functional changes related to deregulations in gene expression were assessed using gene set enrichment analysis (GSEA) [7].

LncRNA-PCG Coexpression Networks
The network analysis directly stems from the network-based lncRNA module functional annotation method introduced by Liu et al. [8]. First, we identified differentially expressed lncRNAs and PCGs (FDR < 0.05) and constructed a correlation matrix for these transcripts. The correlation was calculated for all the lncRNA-PCG pairs, and the absolute value of the Pearson correlation coefficient represented each pair. Second, non-negative matrix factorization (NMF) was used to extract modules from the correlation matrix. In particular, standard factorization based on the Kullback-Leibler divergence was employed [9]. The factorization was run multiple times for different numbers of modules and with different random seeds to compute the initial values for the factor matrices to avoid improper local minima of the objective function. The Frobenius norm of the factorization residual matrix served as the factorization objective function. Then, each module was functionally annotated. All the PCGs were mapped to the corresponding GO terms, and the terms with at least two corresponding PCGs were kept. GO enrichment analysis served to annotate individual modules, and Fisher's exact test was used to calculate the score for the individual terms. Eventually, the representative cores of the individual modules were plotted. In each plot, 4 lncRNAs and 13 PCGs with the highest module membership visually represented the module. Edges connect those nodes whose absolute correlation exceeds the median module correlation (weak) and its third quartile (strong). The GO terms with p < 0.01 associated with these modules are listed in the output tables. The network analysis was carried out in the R statistical environment with the limma, NMF, GSEABase and iGraph packages. Figure S1. Stability of the selected reference genes potentially applicable for RT-qPCR normalization. The tested genes (B2M, GAPDH, GUSB, HPRT1, TUBB, UBC, and YWHAZ) were ranked using the web-based tool RefFinder, which integrates four major currently available computational programs (geNorm, Normfinder, BestKeeper, and the comparative delta-Ct method). Based on the rankings from each program, RefFinder assigns an appropriate weight to an individual gene and calculates the geometric mean of their weights for the overall final ranking [5].   [11], Ross et al. [12], Eppert et al. [13], Graham et al. [14], Lim et al. [15], and Massarweh et al. [16].    [23], Diaz et al. [24], Haddad et al. [18], Piccaluga et al. [25], Jaatinen et al. [11], and Kumar et al. [26].