2.1. QC of the MMRC Dataset for Integrative Network Analysis
Gene expression profiles (
n = 304, GSE26760), CNV profiles (
n = 254, GSE26849), and associated clinical data for the MMRC study were downloaded from the GEO database [
49]. Sample-labeling errors, including sample mislabeling, swapping, duplication, or contamination frequently occur in such multi-omics datasets [
51,
52]. Thus, it is critical to perform extensive QC to identify and correct such errors before integrating gene expression and CNV profiles for further analysis.
In MM, genomic alterations are common [
49], and gene expression variations are strongly associated with such alterations [
53]. In the MMRC dataset, the expression levels of 8182 genes were significantly associated with CNVs that contained the respective genes in cis form (cis-regulation), with a Benjamini–Hochberg multiple testing corrected
p-value < 0.01. Probabilistic multi-omics data matcher (proMODMatcher), a computational approach to identify and correct sample-labeling errors in multiple types of omics data [
51,
52], was applied to match mRNA and CNV profiles in the MMRC datasets. Among 246 pairs of gene expression and the CNV profiles of the common patient names, 10 profile pairs were not self-matched (i.e., the mRNA and CNV data annotated as having come from the same sample in these cases, were not correlated; see
Supplementary Materials, Table S1a). Moreover, we detected six pairs of gene expression and CNV profiles that were cross-matched (i.e., the mRNA profile of one patient significantly correlated to the CNV profile of another patient; see
Supplementary Materials, Table S1b). In total, 252 (246 self-matched and six cross-matched) pairs of gene expression and CNV profiles were used in the network reconstruction process.
To identify the source of these sample-labeling errors in the cross-matched mRNA-CNV profile pairs (whether the gene expression or CNV profiles were mis-labeled), the samples’ metadata (clinical annotations) and information inferred from mRNA and CNV profiles were compared. Based on the expression levels of the X-chromosome specific gene XIST and Y-chromosome specific gene RPS4Y1, MMRC samples were clustered into three groups to achieve a clear separation of samples into sex-specific groups (Male Sex, Female Sex, and No Call groups; left panel in
Supplementary Materials, Figure S1). There were inconsistencies between annotated sex/gender in the metadata and inferred sex based on the gene expression data for three samples: MMRC0021, MMRC0197, and MMRC0312 (
Supplementary Materials, Table S1a). Sex was not inferred for samples in the third group containing profiles in which the levels of expression for both XIST and RPS4Y1 were too low due to X- or Y- chromosome loss in MM cells (left-bottom group in the left panel in
Supplementary Figure S1). Immunoglobin isotype can be inferred based on gene expression profiles. Comparison between the annotated and inferred isotypes revealed one inconsistent sample for light chain isotype (MMRC0207) and two for the heavy chain isotype (MMRC0039 and MMRC0312;
Supplementary Materials, Table S1a). Analysis of hyperdiploidy estimated based on CNV profiles [
54] and annotated hyperdiploidy yielded one inconsistency for sample MMRC0442 (
Supplementary Materials, Table S1a). By comparing the clinically annotated and inferred metadata, the source of labeling errors was unambiguously identified for a few cross-matched profiles. For example, the mRNA and CNV profiles of MMRC0441 and MMRC0442 were cross-matched, and the labeling errors were likely due to a sample swap in CNV profiles based on the consistency with metadata. Similarly, mRNA and CNV profiles of MMRC0312 and MMRC0404 were cross-matched. The sex inconsistency between annotated and inferred sex of MMRC0312 indicated that a sample swap occurred in the mRNA expression profiles (
Supplementary Materials, Figure S1).
After sample labeling error corrections, 304 more cis-associations (in total, 8486) were identified between gene expressions and the respective cis-CNVs at a multiple testing adjusted p-value < 0.01. These results confirm that sample-labeling errors frequently occurred in large, complex datasets, and that proMODMatcher can efficiently identify and correct sample-labeling errors to improve power and accuracy in subsequent analyses.
2.2. Multiple Myeloma Molecular Causal Network (M3CN)
CNVs often occur in large blocks, with genes residing in these same blocks, likely sharing common CNV profiles across many samples. As a consequence of the shared common CNVs impacting the expression levels of genes in cis, the expression levels of these genes are likely to be correlated [
55,
56,
57]. For example, the most significant cis-CNV regulated genes (of the 8394 genes identified at false discovery rate (FDR) value < 0.01) were enriched in chr1q (
p-value = 2.9 × 10
−54), and these genes were co-expressed (
Supplementary Materials, Figure S2). To distinguish gene co-expression due to biological regulations of core molecular and cellular functions from co-expression that is more an artifact of genomic co-localizations to common CNV regions, we included CNV data as nodes in the construction of the molecular causal MM networks, using our previously described Bayesian network reconstruction algorithm [
58,
59]. Serving as input into the reconstruction algorithm, were 7920 informative genes (mean expression levels > 4.8 and variance of expression levels > 0.4; see
Supplementary Table S2), of which 3724 were cis-regulated by CNVs (
Supplementary Materials, Table S3). In total 11,644 nodes (7920 nodes for gene expression and 3724 for CNVs) were included in the process of constructing a M3CN using RIMBANet [
50] (detailed in Methods). The network reconstruction process searches for a structure
and associated parameters
that can best explain the given data
, which can be decomposed into a series of substructures (Methods). Given a potential regulation between nodes X and Y (i.e., X and Y are strongly associated), the joint probability
can be represented as the structures X→Y
, Y→X
, or the structure in which X and Y are both regulated by a third node Z (
Figure 2A). Even though there is a directed edge between X and Y, the structures X→Y and Y→X are Markov equivalent (i.e., they have the same probability given the data
, so that they are statistically indistinguishable). However, when cis-CNV nodes are included, serving as a source of perturbation acting on X and Y, the structures X→Y
and Y→X
(
Figure 2B) are no longer equivalent, so that potential causal relationships between X and Y can be inferred unambiguously. In addition, when conditioning on a given CNV, gene expression correlations due to chromosome co-localization are able to be filtered out. For example, of the 140,283 pairs of (X,Y) that were cis-regulated by CNVs and on the same chromosome associated at a multiple testing adjusted
p < 0.01, after conditioning on CNV_x (or CNV_y), only 49% of the pairs (X,Y|CNV_x) were associated at the same multiple testing adjusted
p-value < 0.01, demonstrating the need to integrating CNV data in order to identify true biological regulation.
The resulting M3CN consisted of 9102 interactions between nodes representing variations in gene expression levels. To assess the quality of the constructed network, M3CN was compared with public network or pathway databases that are not MM specific. Nevertheless, a significant number of regulations in M3CN overlapped with interactions in pathway databases, supporting the biological relevance of the network. For example, over 30% of the regulations captured by the M3CN overlapped with regulatory relationships represented in the KEGG and Hallmark databases (
Supplementary Materials, Table S4), indicating that M3CN was recapitulating known biological pathways. Causal links in M3CN were also compared with top correlated gene pairs derived from more basic co-expression analysis. Of the top 9102 gene pairs with the highest Pearson’s correlation coefficients between their expression profiles, the correlations were more often being driven by genomic co-localization rather than coherent regulation of biological processes (
Supplementary Materials, Figure S3).
The M3CN revealed many known features of MM. For example, the IgH gene [
4] is one of the most frequently translocated genes in MM. The subnetwork associated with IgH is composed of mostly IgH family genes or genes located in the same region as IgH (14q32.33). However, this subnetwork is almost completely isolated from the rest of M3CN, with only a single node (the gene MIR8071-2) connecting this subnetwork to the rest of M3CN, (
Supplementary Materials, Figure S4 This suggests that the pattern of gene expression variation driven by IgH and its corresponding subnetwork may have a limited impact on the rest of M3CN. Translocations between the immunoglobin heavy chain locus and oncogene loci, including
CCND1,
CCND3,
MAF,
FGFR3, and
MMSET (
WHSC1) commonly occur in MM patients [
60]. Further, MM patients can be divided into TC1-5 molecular subtypes based on translocations and
CCND1-3 expression levels [
61]. Subtype-specific signatures were derived based on GSE13591 [
62]. Putative key regulators were inferred for the TC subtype-specific signatures, including
CCDN1 and
WHSC1 as key regulators for the TC1 and TC4 subtype-specific signatures, respectively. TC1-3 subtypes had one of the D-cyclin genes,
CCND1-3, highly expressed, and the CD4/6-Rb pathway [
63] activated. The subnetworks for the TC1-3 subtype-specific signatures all significantly overlapped with each other (e.g., the subnetwork for TC1 overlapped the TC2 and TC3 subnetworks, with TC1 2.2-fold enriched for TC2,
p-values = 5.7 × 10
−25, and 6.2-fold enriched for TC3,
p-value = 1.6 × 10
−39), consistent with the observations that MM patients of TC1-3 subtypes shared similar survival patterns [
64]. In contrast,
WHSC1 (
MMSET) was upregulated in the TC4 subtype;
WHSC1 regulates the histone methylation of MM cells [
65], which in turn regulates cell proliferation. The subnetwork for the TC4 specific signature was distinct from the subnetworks for the TC1-3 specific signatures (overlaps were not significant), consistent with the observations that MM patients of TC4 subtype had worse prognosis than the ones of TC1-3 subtypes [
64].
At the global level, there were two highly connected genes, AGPS (Alkylglycerone Phosphate Synthase) and ATRX (Alpha Thalassemia/Mental Retardation Syndrome, X-Linked), regulated dozens of genes directly (41 and 32 respectively,
Supplementary Figure S4). AGPS is a metabolic enzyme, a critical component in the synthesis of ether lipids, and is up-regulated across multiple types of aggressive human cancer cells and primary tumors [
66]. Multiple studies show that lipid metabolism plays a critical role in MM tumorigenesis and progression [
67]. Previous studies have also shown the potential of AGPS as a therapeutic target of cancer, and multiple AGPS inhibitors are in development [
68]. ATRX is a chromatin remodeling protein whose main function is the deposition of the histone variant H3.3. A recent study showed that ATRX is a potential mutational driver in MM [
69].
2.3. MM Prognostic Signature Genes in the M3CN
Eight large prognostic gene expression signatures were collected from the literature, with the number of genes across these eight signatures ranging from 15 to 92 (
Table 1). The overlap of genes among these different signatures was limited (
Supplementary Materials, Table S5). For example, only one gene,
BIRC5, appeared in more than two signatures (Hose_50, Shaughnessy_70 and Kuiper_92 (EMC-92)). Some of the signatures were enriched in specific chromosomal locations (
Supplementary Materials, Table S6). For instance, 14 genes from Shaughnessy_70 and 13 genes from Burington_92 were located in chromosome 1 (Fisher’s exact test
p = 5.54 × 10
−5 and 2.70 × 10
−2, respectively), consistent with the frequently described chromosome 1 aberration in MM [
70].
These prognostic gene signatures were projected onto M3CN to infer the key regulators of each signature. It is of note that only a fraction of genes in each signature were included in the M3CN (
Table 1), as only the most informative genes (e.g., those genes detectably expressed and that were observed to vary across samples) were included in the M3CN construction procedure (detailed in Methods). Our network analysis identified 15 potential key regulators for three of the eight prognostic signatures (
Supplementary Materials, Table S7, detailed in Methods), including
NOP16 and
CECR5 for Shaughnessy_70;
MELK,
TPX2, and
NCAPG2 for Kuiper_92; and
CDK1,
DTL, and eight other genes for the Hose_50 signature.
TPX2 is known to regulate
AURKA and interacts with
RHAMM [
71], which is known to correlate with centrosome amplification and with poor prognosis in MM [
72]; the
AURKA inhibitor is a potential treatment for MM [
73].
CDK1 is one of the key regulators, and its inhibition has potential as an anti-cancer treatment [
74]. Cyclin-dependent kinase inhibitors have been shown to induce cell cycle arrest and eventual apoptotic cell death in MM cells [
75].
Using the 15 putative key regulators that we identified as regulating the prognostic gene signatures, we extracted a subnetwork consisting of 178 nodes and 218 interactions (
Supplementary Materials, Tables S8 and S9) from M3CN (
Figure 3, detailed in Methods) with the 15 key regulators as seeds, referred to as the prognostic subnetwork hereafter. Even though different prognostic signatures overlapped with each other sparsely at the individual gene level (
Supplementary Materials, Table S5), five of the eight prognostic signatures we examined, were significantly enriched in this subnetwork (
Table 1). For example, 28, 16, and 14 genes in Hose_50, Kuiper_92, and Shaughnessy_70, respectively, were in the prognostic subnetwork (corresponding to Fisher’s exact test
p-values: 2.20 × 10
−16, 2.10 × 10
−13, and 1.45 × 10
−14, respectively). Other smaller signatures, such as Kassambara_22 and Reme_19, were also enriched in this subnetwork (
p = 4.80 × 10
−7 and 1.55 × 10
−7, respectively).
The prognostic subnetwork was enriched for genes involved in the cell cycle and metabolic processes (
Supplementary Materials, Table S10, detailed in Methods). More specifically, 42 out of 178 genes are involved in the cell cycle (fold enrichment = 8.4, FDR = 1.1 × 10
−22), and 25 in mitotic cell cycle process (FDR = 7.81 × 10
−18). Other top pathways include DNA replication (20 genes, FDR = 1.42 × 10
−15), cellular macromolecule biosynthetic process (20 genes, FDR: 8.8 × 10
−5), and cellular response to DNA damage stimulus (16 genes, FDR: 1.75 × 10
−8). The results are consistent with the essential roles cell cycle genes are known to play in MM progression, and the therapeutic implications [
76].
To assess the prognostic values of the prognostic subnetwork or other prognostic signatures in the literature, we applied them to the Multiple Myeloma Research Foundation, MMRF’s, CoMMpass cohort (Methods), which was not used in constructing M3CN or training other prognostic signatures. In addition to the above prognostic signatures, we also included a four-gene signature [
48] that was developed based on the MMRF-CoMMpass RNAseq dataset for comparison. Hierarchical clustering was applied to the CoMMpass RNAseq data based on genes in each signature, and the clustering result suggested that MM patients could be divided into three groups in general (
Figure 4A and
Supplementary Materials, Figure S5). Then, k-means clustering (k = 3) was applied to the data (
Figure 4B and
Supplementary Materials, Figure S5), and the three groups were compared in terms of overall survival (OS) (
Figure 4C and
Figure 5) and progression free survival (PFS) (
Figure 4D and
Figure 6). The patient groups, based on the prognostic subnetwork, had significantly different survival (log-rank test
p-values = 1.8 × 10
−12 and 1.7 × 10
−10), and hazard ratios (HRs) between high and low risk groups (4.7 and 3.3 for OS and PFS, respectively), were the best among all signatures tested (
Table 2).