Gene Expression Signature Associated with Clinical Outcome in ALK-Positive Anaplastic Large Cell Lymphoma

Simple Summary Anaplastic large cell lymphomas associated with ALK translocation have a good outcome after CHOP treatment; however, the 2-year relapse rate remains at 30%. Microarray gene-expression profiling, high throughput RT-qPCR, and RNA sequencing of 48 ALK-positive anaplastic large cell lymphoma (ALK+ ALCL) samples obtained at diagnosis enable the identification of genes associated with clinical outcome. More particularly, our molecular signatures indicate that the FN1 gene, a matrix key regulator, might also be involved in the prognosis and the therapeutic response in anaplastic lymphomas. Abstract Anaplastic large cell lymphomas associated with ALK translocation have a good outcome after CHOP treatment; however, the 2-year relapse rate remains at 30%. Microarray gene-expression profiling of 48 samples obtained at diagnosis was used to identify 47 genes that were differentially expressed between patients with early relapse/progression and no relapse. In the relapsing group, the most significant overrepresented genes were related to the regulation of the immune response and T-cell activation while those in the non-relapsing group were involved in the extracellular matrix. Fluidigm technology gave concordant results for 29 genes, of which FN1, FAM179A, and SLC40A1 had the strongest predictive power after logistic regression and two classification algorithms. In parallel with 39 samples, we used a Kallisto/Sleuth pipeline to analyze RNA sequencing data and identified 20 genes common to the 28 genes validated by Fluidigm technology—notably, the FAM179A and FN1 genes. Interestingly, FN1 also belongs to the gene signature predicting longer survival in diffuse large B-cell lymphomas treated with CHOP. Thus, our molecular signatures indicate that the FN1 gene, a matrix key regulator, might also be involved in the prognosis and the therapeutic response in anaplastic lymphomas.


Introduction
Anaplastic large cell lymphoma (ALCL) is a rare type of T-cell lymphomas, accounting for approximately 3% of adult non-Hodgkin lymphomas and 10 to 20% of childhood lymphomas [1]. Systemic ALK-positive ALCLs (ALK + ALCL), associated with the translocation of the Anaplastic Lymphoma Kinase (ALK) oncogene, are considered a distinct entity in the WHO classification [1,2]. Chemotherapy treatments are based on cyclophosphamide, vincaalkaloids, doxorubicin, and corticosteroids in both adults and children, and high-dose methotrexate in children. ALK + ALCL tumors have a better outcome than other aggressive non-Hodgkin lymphomas, with a 5-year overall survival (OS) rate of 70% for adults and >90% for children [3][4][5][6][7][8][9][10]; however, the 2-year relapse rate remains at 30% [3][4][5][6][7][8]10,11]. To develop patient-tailored therapy strategies, we first need to be able to stratify patients according to risk factors. Several prognostic factors have been recently described for paediatric ALK + ALCLs, including the detection of minimal disseminated disease (MDD) [12], in bone marrow or blood combined with antibody titers against ALK [13][14][15][16]. The histological subtype variant (versus the common morphology) is also associated with the prognosis in ALK + ALCLs, at least in children [17]. However, the stratification of patients according to these prognostic factors has yet to be validated in randomized trials. We profiled gene expression in pre-treatment biopsies from non-relapsing and relapsing patients with ALK + ALCL to provide an additional indicator that could help to identify patients with a high risk of relapse and those of low risk who could benefit from a therapy reduction. Several techniques were used to identify differentially expressed genes, i.e., micro-arrays and RNA-sequencing. Then, Fluidgim technology and the Kallisto/Sleuth pipeline helped us to cross-validate candidate genes.

Patient Characteristics and Tumor Samples
The diagnosis of ALK + ALCL was based on morphologic and phenotypic criteria, as described in the 2001 and 2008 WHO classifications [1,2]. Histopathological and immunostaining results were reviewed by a national (the French Lymphopath Network) or international panel of pathologists [17]. Only cases with at least 50% lymph node involvement, assessed by CD30 staining frozen biopsies, and good RNA integrity (≥7) were selected from our tumor bank. The cohort consisted of 48 systemic ALK + ALCL tumor samples obtained at the time of diagnosis between 1994 and 2009 (Tables 1 and S1). The median follow-up was 58 months (4.8 years). Eighteen additional cases of systemic ALK + ALCL with available frozen material at the time of the diagnosis were retrieved from our tumor bank and used as an independent validation cohort. The patients were all treated with intensive chemotherapy, most of them according to the ALCL99 protocol and stratified on clinical factors [18]. Others were treated according to malignant histiocytosis protocols (HM89 and HM91) [19] or with ACVBP (doxorubicin, cyclophosphamide, vindesine, bleomycin, and prednisone). Patient samples were obtained after informed consent in accordance with the Declaration of Helsinki, and approval was received from the relevant ethics committees. All samples were stored at the «CRB Cancer des Hôpitaux de Toulouse» collection. In accordance with French law, the CRB cancer collection has been declared to the Ministry of Higher Education and Research (DC 2009-989) and a transfer agreement has been obtained (AC-2008-820) after approbation by ethical committees. Clinical and biological annotations of the samples have been declared to the CNIL (Comité National Informatique et Libertés). Abbreviations: IPI, international prognostic index; LDH, lactate dehydrogenase; Visceral involvement: lung, liver, spleen; CNS, central nervous system; MDD, minimal disseminated disease; PFS, progression free survival; CI, confidence interval; p, p value; RR, Relative Risk. † : groups defined by the following criteria: ≥ or < median age (12.5 years). † † : Missing Data: St-Jude Stage n = 14, Ann Arbor Stage n = 1, IPI score n = 14, LDH n = 5, peripheral lymph nodes n = 17, mediastinum n = 17, spleen n = 1, liver n = 1, lungs n = 1, other visceral involvement n = 1, skin lesions n = 2, clinical high risk group n = 6, bone lesions n = 1, bone marrow involvement n = 3, CNS involvement n = 1, soft tissue mass n = 1, CD3 n = 4, MDD n = 21.

Microarrays
Two µg of total RNA from 48 samples were used for hybridization to HG-U133Plus 2.0 GeneChips (54,675 probe sets; Affymetrix, Santa Clara, CA, USA), as previously reported [20]. For each outcome group, gene expression data were extracted and normalized using the GCRMA method [21,22] with the gcrma package for Bioconductor 3.14 (http://bioconductor.org, accessed on 26 September 2021). Then, the data were filtered (using the genefilter package) to eliminate probe sets whose expression values were too low and that could therefore be difficult to reproduce using very sensitive methods such as quantitative RT-PCR (RT-qPCR) [23]. Thus, only probes with normalized log2-transformed expression levels higher or equal to 5 within at least one outcome group were considered. Finally, a differential analysis was carried out using the Empirical Bayes method with the limma package [24], and the list of genes significantly discriminating between relapsing and non-relapsing groups was retained with a False Discovery Rate (FDR) [25] adjusted p-value of <0.05 and a fold change (FC) of at least ±2. Overrepresented biological functions and pathways (biological processes, cellular components and molecular functions) that were associated with the differentially expressed genes were assessed using the GOstats [26] package in Bioconductor.

RNA-Sequencing Data
From the 48 patient biopsies, 39 (18 relapsing and 21 non-relapsing) were retained for RNA-sequencing analysis. After ribodepletion (NEBNext ® rRNA Depletion HMR kit from NEB), RNA-seq libraries were prepared using NEBNext ® Ultra™ II Directional RNA Library Prep Kit for Illumina ® (NEB) and sequenced with Novaseq 6000 (ILLUMINA). The libraries' preparations were realized following the manufacturer's recommendations then sequenced to obtain 2 × 200 million 150-base reads per sample.

Validation of Microarray Signature Using High-Throughput Quantitative PCR Method
The oligonucleotide primer pairs used for the qPCR were designed with PrimerBLAST (http://www.ncbi.nlm.nih.gov/tools/primer-blast/, accessed on 26 September 2021) to target the CDS region of the variants detected by the selected Affymetrix probe sets. Primer Tms were calculated using Schildkraut and Lifson's 1965 salt-correction formula and Breslauer's 1986 table of thermodynamic parameters. The primer design was performed to avoid genomic DNA (gDNA) amplification. gDNA amplification was controlled during the primer validation and in the high-throughput qPCR by adding a positive control of gDNA (G147, Promega ® , Charbonnières-les-Bains, France) and by a valid prime assay, which accurately corrects all reactions in BioMark Array for signals derived from gDNA [27]. Primer sequences are reported in the Table S2.
PCR specificity was verified by assessing the melting curves of each amplification product. Primer efficiency has been tested on a pool of samples by standard qPCR (Table S2) prior to high-throughput qPCR. All qPCR assays were performed in duplicate. After a pre-amplification of cDNA, validation of the differentially expressed genes was performed using 96.96 Dynamic Arrays for the BioMark™ system (Fluidigm CorporatioSan Francisco, CA, USA) [23] according to manufacturer's instructions. An initial data analysis was performed with the Fluidigm real-time PCR analysis software using the linear derivative baseline correction, a quality correction set to 0.65, and the User (Detectors) Cycle Threshold. The cq (quantification cycle) ranged from 6.7 to 22.7 which signed for a successful experiment [28]. The cts for undetectable targets were set at 31. The mean expression of MLN51 and TBP, selected as the best housekeeping genes using Genorm ® and Normfinder ® with the R package NormqPCR, was used as a normalization factor to calculate ∆Cq values (1): [∆Cq gene of interest = mean duplicate Cq gene of interest − mean duplicate (Cq MLN51 , CqTBP)] The −∆Cq values were used for heatmap and boxplot (Beeswarm package, https:// rdrr.io/cran/beeswarm/man/beeswarm.html, accessed on 26 September 2021) generation by using the R software (version 3.1.2). The validation of the microarray signature was conducted using ∆Cq values after an assessment for, first, an adjusted p-value from the Wilcoxon test, followed by a Benjamini-Hoechberg correction lower than 0.05, then a Pearson's correlation between high-throughput qPCR and microarray data greater than 0.7.

Clinical Outcome Based on High-Throughput RT-qPCR Data
The validation of microarray signatures was carried out using ∆Cq values after assessments for p-values from a Wilcoxon test followed by a Benjamini-Hoechberg correction. The selection criteria were a p-value lower than 0.05 and a Pearson's correlation between high-throughput RT-qPCR and microarray data greater than 0.7. A two-step scheme to select the genes best discriminating between outcomes was established using ∆Cq values. The first step involved two complementary methods based on distinct approaches that reach the same goal [29]: Random Forest (RF, using the random Forest package [30], n = 500 trees) and Partial Least Squares Discriminant Analysis (PLS-DA, using the DiscriMiner package, http://cran.r-project.org/web/packages/DiscriMiner/index.html, accessed on 26 September 2021). For RF, 70% of the cohort (34 cases) formed a training set and the remaining 14 tumors formed the test set. Each set had approximately the same proportion of relapsing and non-relapsing cases as the whole cohort. A PLS-DA algorithm was associated with leave-on-out cross-validation. We selected the top five genes from each method, ranked by significance (using the Gini index and VIP [variable importance for the projection] index, respectively). These index values represent a quantitative statistical parameter ranking genes according to their ability to discriminate between the two outcome groups. Selected genes were then used to develop a logistic regression model with a backward selection method using relapse as the outcome variable.

Transcripts Quantification and Differential Expression Analysis
The Kallisto v0.44.0 pseudo-alignment method [31] was used to quantify transcript abundances directly from the raw RNA-seq FASTQ files. This method, based on the pseudo alignment for rapid and accurate quantification, was performed with a 100 bootstrap value, using a transcriptome index constructed from the Ensembl project's transcriptome v91. Spring Cloud Sleuth version 0.30.0 [32] was then used within R for differential expression analysis at the gene level (gene mode = TRUE) with an aggregation of the transcript abundances by Ensembl's gene ID (aggregation_column = 'ens_gene'). Poorly covered genes (read count <10 in more than half of the samples) were removed before any further analysis. Genes were then defined as differentially expressed (DE) depending on the corrected p-value (qval, adjusted p-values using the Benchamini-Hochberg method) from the Sleuth statistical test. We tested both the Wald test (WT) and the likelihood ratio test (LRT), which is more stringent.

Clinical and Pathological Characteristics of Patients
Among the 48 patients (Tables 1 and S1 and ref [33]), 31 were male and 17 were female. Most patients were children or young adults less than 22 years (n = 39). The median age at diagnosis was 12.5 years (range: 2-50 years). According to the Ann Arbor classification, 30 patients had advanced stage III or IV disease, and 18 had localised stage I or II disease. Twenty-two tumors were classified as common type and 26 as morphologic variants. The ALK gene was fused to the NPM gene in 44 tumors and to the TPM3 gene in the other cases, which corresponded to the different ALK staining patterns [17]. After front-line multiagent chemotherapy, 45 patients achieved complete remission. Three patients progressed during treatment (median: 7.2 months; range: 2.4-16.5 months), and 23 patients relapsed within 16.5 months of diagnosis: these were all assigned to the relapsing group. Twentytwo remained disease-free after a period of at least three years and were included in the non-relapsing group.

Molecular Signatures from Microarray Data Associated with Clinical Outcome
Based on microarray data, a supervised method was used to find the most significant differentially expressed genes between relapsing and non-relapsing tumors. Using a significance level of corrected p-value <0.05 and a cut-off fold change of ±2 ( Figure 1A), we generated a list of 47 significantly discriminating genes (61 probes), using the 14,388 probe sets that had a log2-transformed expression level ≥5 within at least one group ( Figure 2A, Table 2, orange columns). Among the 47 genes, 14 genes were overexpressed in the relapsing group while 33 genes were overexpressed in the non-relapsing group ( Figure 2A, Table 2, orange columns).  Heatmap of microarray data showing the 47 deregulated genes in "relapsing" (n = 26, dark grey) compared to "nonrelapsing" samples (n = 22, light grey). Each column represents a sample, and each row a probe set or gene. The expression level of each probe set was standardized by subtracting that probe set's mean expression from its expression value and then dividing this by the standard deviation across all the samples. This scaled expression value, designated as the row Z-score, was plotted using a red-blue color scale with red indicating high expression and blue indicating low expression.

Identification of a Minimum Set of Genes Associated with Clinical Outcome
Random Forest (RF) and Partial Least Squares Discriminant Analysis (PLS-DA) are two powerful tools for analysing microarray data. Because these two algorithms can highlight essential variables in a dataset, we used them as classification algorithms on high-throughput RT-qPCR data to identify the minimum set of genes whose expression in primary tumors is associated with clinical outcome ( Figure 1A). Using RF analysis, the optimal gene classifier consisted of five genes: EMP1, SCL40A1, ITGB7, SULF1, and FAM179A, ranked according to their variable importance in the model (Figures 1B and 3A). PLS-DA algorithms also gave an optimal gene classifier consisting of five genes in rankorder: FAM179A, MYOF1, SCL40A1, FN1, and PLAU ( Figures 1B and 3B). Therefore, RF and PLS-DA selected a total of 8 genes that could help classify relapsing and non-relapsing patients ( Figures 1B and 3C). We then tried to reduce the number of genes even more. Using a logistic regression on the ∆Cq expression from using high-throughput Fluidigm ® technology with these 8 genes, we identified a set of 3 genes ( Figure 1B, Table S4): FN1/fibronectin 1, FAM179A (family with sequence similarity 179, member A), and SCL40A1/ferroportin-1. For these three genes, data generated by microarray, Fluidigm ® , and standard RT-qPCR showed an excellent consistency (R 2 > 0.89, Figure S1). Overexpressions of the 3 genes are validated in relapse groups using an independent cohort (n = 18, Figure S2). Finally, since all are located on chromosome 2 (2q34, 2p23.2 and 2q32, respectively), we verified that their differential expression was not related to the gain or deletion of their loci by high-resolution CGH array. Box plots and strip-charts showing high-throughput qPCR quantification of the 8 selected genes in "relapsing" (grey, n = 26) and "non-relapsing" (white, n = 22) samples. Statistical significance was calculated using the Wilcoxon test followed by a Benjamini and Hoechberg correction. Expressions are given as (−∆Cq).

Transcripts Quantification with Pseudo-Alignment and Differential Expression Analysis by Total RNA-Sequencing
To find a gene's signature from another transcriptomic technique, 18 relapsing and 21 non-relapsing tumors over the 48 patient biopsies (39/48 samples) were sequenced. Using full RNA 150-bp paired-end sequencing data (median of 507 million reads per patient), gene expression was quantified with Kallisto, a fast pseudoalignment-based method used to obtain transcript quantification from RNA sequencing data [31]. Genes differentially expressed (DE) between relapsing and non-relapsing conditions were selected with Sleuth, which is a program for the differential expression analysis of RNA-Seq experiments for which transcript abundances have been quantified with Kallisto [32]. With a corrected p-value <0.02, 214 genes were found as DE between the two groups (relapse and no relapse) with the statistical Wald Test (WT, Figure 1C) and 62 with the more stringent Likelihood Ratio Test (LRT) (Table S5), which is a statistical test of the goodness-of-fit between two models. We finally retained the Wald Test's most extensive list for further analysis because it gives a 'beta' value (size effect) that can be compared to logFC. Thus, finally, 168 genes having an absolute log2 FC between relapse and no-relapse groups greater than 0.5 and a p-value lower than 0.02 [34] were selected ( Figure S3). After

Transcripts Quantification with Pseudo-Alignment and Differential Expression Analysis by Total RNA-Sequencing
To find a gene's signature from another transcriptomic technique, 18 relapsing and 21 non-relapsing tumors over the 48 patient biopsies (39/48 samples) were sequenced. Using full RNA 150-bp paired-end sequencing data (median of 507 million reads per patient), gene expression was quantified with Kallisto, a fast pseudoalignment-based method used to obtain transcript quantification from RNA sequencing data [31]. Genes differentially expressed (DE) between relapsing and non-relapsing conditions were selected with Sleuth, which is a program for the differential expression analysis of RNA-Seq experiments for which transcript abundances have been quantified with Kallisto [32]. With a corrected p-value < 0.02, 214 genes were found as DE between the two groups (relapse and no relapse) with the statistical Wald Test (WT, Figure 1C) and 62 with the more stringent Likelihood Ratio Test (LRT) (Table S5), which is a statistical test of the goodness-of-fit between two models. We finally retained the Wald Test's most extensive list for further analysis because it gives a 'beta' value (size effect) that can be compared to logFC. Thus, finally, 168 genes having an absolute log2 FC between relapse and no-relapse groups greater than 0.5 and a p-value lower than 0.02 [34] were selected ( Figure S3). After in-tersecting these 168 DE genes with the 47 significantly discriminating genes previously found with the microarray technique (p value < 0.05), 20 common genes were highlighted (ANTXR1, CTHRC1, DCN, FAM179A/TOGARAM2, FAP, FN1, FRMD6, GLT8D2, INHBA,  IRS1, ITGB7, MYO1F, NT5E, PLAU, PBXIP1, PTPN22, SLC39A14, SULF1, THBS2, and LTBP2) (beta value/WT "log2FC" estimator greater than 0.5 and p value < 0.02; Table 2: green columns, red lines) including 5 and 15 genes overexpressed in relapse and norelapse groups, respectively ( Figures S4 and S5). On these 20 genes, 18 (ANTXR1, CTHRC1,  DCN, FAM179A/TOGARAM2, FAP, FN1, FRMD6, GLT8D2, INHBA, IRS1, ITGB7, MYO1F,  NT5E, PLAU, PTPN22, SLC39A14, SULF1, and THBS2) have also been validated with high-throughput Fluidigm ® technology ( Figure S5). Among them, the FAM179A and FN1 genes were already selected after logistic regression on the ∆Cq expression from the high-throughput Fluidigm ® technology data (Table S4).

Discussion
Although systemic ALK + ALCL are highly chemosensitive tumors, with a 5-year OS rate of 80%, 30% usually experience relapse within the year following the end of treatment. Moreover, these "early" relapses are associated with a bad prognosis [35]. In the present study, we sought to identify a molecular signature that was associated with clinical outcome (relapse/progression versus non-relapse) in systemic ALK + ALCL. From a cohort of 48 tumor samples obtained at diagnosis, our supervised analysis based on micro-array data identified 47 genes that significantly discriminated the two groups. Twenty of them were also found to be differentially expressed by RNA sequencing, supporting their biological significance.
In the microarray molecular signature of the relapsing group, the most significant p-values included the overexpression of six genes (FAM179A, ITGB7, MYOF1, SLC40A1 or Ferroportin-1, PTPN22, and ROGDI). Many of the genes that were overrepresented and up-regulated in this group were implicated in the regulation of the immune response and in T-cell activation and proliferation. For the non-relapsing group, INHBA, GPC6, SULF1, FN1, PLAU, and FAP were the top six genes overexpressed with the most significant p-values. Eight of them were also differentially expressed using RNA-seq analysis (FAM179A, ITGB7, MYOF1, and PTPN22 in the relapse group, and FAP, FN1, and INHBA, SULF1 in the norelapse group). Within the genes overexpressed in this non-relapsing microarray signature, there was a statistically significant overrepresentation of genes involved in extracellular matrix (ECM) deposition and organization. The ECM is a highly dynamic structure which is constantly being remodeled and, in the appropriate context, might restrain malignant tumor progression. Although excessive ECM deposition could hinder the diffusion of therapeutic agents [36] and play a role in cell adhesion-mediated drug resistance [37], proteases secreted by tumor cells and/or cells of the micro-environment could lead to its structure breakdown and influence the tumor cell response to chemotherapy. Furthermore, although proteases have long been considered as cancer-promoting factors, recent studies have revealed that they can also elicit tumor-suppressive effects through the stimulation of apoptosis or the inhibition of angiogenesis [38]. This ECM signature probably reflects a strong ECM deposition that could be associated with a peculiar tumor microenvironment less favorable for tumor cells. Interestingly, 19 of the 33 overexpressed genes in the microarray non-relapsing signature and 13 out of the 16 genes in the RNA sequencing non-relapsing signature also belong to the "stromal-1 signature" (including FN1) associated with a better EFS and OS in diffuse large B-cell lymphomas (DLBCL) treated by CHOP or R-CHOP [39]. Thus, our molecular signatures point out that the ECM could be involved in the prognosis and the therapeutic response in ALCL, as it has already been suggested in DLBCL.

Conclusions
We have identified a minimum set of genes whose expression could help to predict clinical outcome at diagnosis. Using two different classification algorithms, we identified 8 genes to be the most powerful at discriminating between tumors that did or did not experience relapse. Intersecting data from microarrays, high-throughput Fluidigm, and RNA-sequencing, this number of genes was further reduced to FAM179A and FN1. As FN1 is an ECM key regulator, we suggest that it might be involved in the prognosis and therapeutic response in ALCL, as already suggested in DLBCL.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/cancers13215523/s1, Figure S1: Correlation (R 2 ) between gene expression levels in samples, determined by microarray or standard RT-qPCR and compared to high-throughput RT-qPCR, Figure  S2: Validation of the 3-genes associated with relapse using Cohort B samples with standard RT-qPCR, Figure S3: Venn diagram showing intersection between the 168 differentially expressed (DE) genes (Sleuth DE_WT) by RNA sequencing, the 47 significantly discriminating genes by the microarray technique and the 29 DE genes validated by high-throughput qPCR, Figure S4: Five genes overexpressed in Relapse group by RNA sequencing, Figure S5: Fifteen genes overexpressed in No-Relapse group by RNA sequencing. Table S1: Overview of clinical ALCL cases used in microarray (all cases; n = 48) and RNA sequencing (bold cases; n = 39) technologies, Table S2: List of high throughput and standard qPCR primers, Table S3: Functional Enrichment Analysis of the differentially-expressed genes between the relapsing and the no relapsing groups using microarray data, Table S4: Mean gene expression values and fold-changes of the 3-gene classifier measured by high-throughput RT-qPCR, standard RT-qPCR and RNA sequencing in all specimens., Table S5: With a corrected p-value < 0.02, 62 genes were found as differentially expressed between the two groups (relapse and no relapse) with the stringent Likelihood Ratio Test (LRT).  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.