genes

: We propose a computational framework for selecting biologically plausible genes identiﬁed by clustering of multi-omics data that reveal patients’ similarity, thus giving researchers a more comprehensive view on any given disease. We employ spectral clustering of a similarity network created by fusion of three similarity networks, based on mRNA expression of immune genes, miRNA expression and DNA methylation data, using SNF_v2.1 software. For each cluster, we rank multi-omics features, ensuring the best separation between clusters, and select the top-ranked features that preserve clustering. To ﬁnd genes targeted by DNA methylation and miRNAs found in the top-ranked features, we use chromosome-conformation capture data and miRNet 2.0 software, respectively. To identify informative genes, these combined sets of target genes are analyzed in terms of their enrichment in somatic/germline mutations, GO biological processes/pathways terms and known sets of genes considered to be important in relation to a given disease, as recorded in the Molecular Signature Database from GSEA. The protein–protein interaction (PPI) networks were analyzed to identify genes that are hubs of PPI networks. We used data recorded in The Cancer Genome Atlas for patients with acute myeloid leukemia to demonstrate our approach, and discuss our ﬁndings in the context of results in the literature.


Introduction
The advent of new technologies makes multi-omics (mRNA and micro-RNA expression, DNA methylation) and mutational profiles routinely available for each individual patient.These data could potentially give clinicians and researchers a more comprehensive view on any given disease and allow them to more accurately assess existing similarities/differences between patients in terms of their omics, mutational profiles and other (clinical) features.Identification of patients' similarities/differences, especially for highly heterogeneous diseases, finding informative features that contribute to these similarities and, most importantly, suggesting plausible biological interpretation for observed similarities is invaluable for a better understanding of the molecular basis of a given disease and patients' survival trajectories.It may also lead to the discovery of novel therapeutic and diagnostic targets for optimized treatment and disease management for specific groups of patients.
In this paper, we focus on finding informative and biologically plausible features contributing to the similarities of patients, as identified by clustering of multi-omics profiles of patients with Acute Myeloid Leukemia (AML), known to be a highly heterogeneous disease from both a biological and clinical point of view [1].
Genes 2023, 14, 1795 3 of 16 2019) database for patients with AML.First, the similarity network fusion (SNF, [8]) was used to identify patients' fused similarity network based on three types of multi-omics data: mRNA expression of immune genes, miRNA expression and DNA methylation.Signatures based on immune genes have been validated before and proved to be important for survival prediction in several types of cancers (see, for example, [10]).Second, we used spectral clustering to identify the optimal partition of the resulting fused network into homogeneous patients' subgroups/clusters.To identify informative features contributing to similarity of patients' multi-omics profiles, for each cluster we ranked multi-omics features, ensuring the best separation of a given cluster from the remaining ones.For each cluster, we selected top-ranked features that, when used for creating fused similarity network and spectral clustering, ensure 95% accuracy, i.e., classify 95% of patients into the same clusters as in the original clustering based on all features.Immune genes found in the selected top-ranked features in each cluster were combined with immune genes that may be targeted by methylated DNA loci and miRNAs found among those selected top-ranked features.For the former, we used chromosome-conformation capture (Hi-C) data [11]; for the latter, the miRNet2.0[12] software tools were used.To identify subsets of immune genes contributing to the similarity/homogeneity of patients within each cluster and investigate the plausibility of these genes playing a role in AML, these combined sets of immune genes were explored in terms of their enrichment in somatic/germline mutations and GO biological processes/pathways terms.The Molecular Signature Database from GSEA was used to assess the enrichment of these combined sets in several known gene sets often considered to be important in relation to AML.The PPI analysis of the sets of immune genes was performed with the aim of identifying genes that are hubs of PPI networks and, when dysregulated, may influence the largest number of interacting proteins.We discussed our findings in the context of results published so far, although detailed biological interpretation of the identified genes was beyond the scope of this paper.
The NCBI Reference Sequence Database (RefSeq; https://www.ncbi.nlm.nih.gov/refseq/, accessed on 20 November 2019) was used to identify genes and their positions in the GRCh37/hg19 assembly.In total, there were 51,537 gene transcripts of 26,368 genes, henceforth referred to as RefSeq genes.
The dataset of somatic mutations in 200 AML patients was downloaded from TCGA database (https://www.cancer.gov/ccg/research/genome-sequencing/tcga,accessed on 20 November 2019).It contained a total of 2749 mutations in 2275 RefSeq genes and genomic transcripts including 1072 mutations in 871 immune genes.
Chromosome conformation capture (Hi-C) data for all-trans retinoic acid (ATRA)induced HL-60 cells, frequently used as a model of leukemia cell differentiation as reported in Li et al. (2018) [11], were downloaded from the GEO database (four datasets; accession number GSE93997).These datasets contained ~175 million interactions which we have subsequently binned into 40 K regions and counted the frequency of interactions between these 40 K regions.To identify the strongest intra-chromosomal interactions, distributions of interacting frequencies were analyzed individually for each chromosome.For each chromosome, only interactions (and corresponding 40 K bins) recorded in all four datasets and residing within the tails of these distributions that correspond to 5% of all interactions were considered to be the strongest intra-chromosomal interactions.For inter-chromosomal interactions, only interactions and their corresponding bins found in all four datasets were deemed to be significant and used in further analysis.Note that the frequency of intraand inter-chromosomal interactions between two loci is inversely proportional to their closeness within the cell nucleus.
DNA methylation, miRNA expression and mRNA gene expression data were downloaded from TCGA database (https://portal.gdc.cancer.gov/repository,accessed on 20 November 2019) for a cohort of 200 AML patients.Methylation levels at 27,578 CpG loci across the genome were recorded as conventional β values.The mRNA expression of 51,537 genes and gene transcripts was measured in terms of FPKM (fragments per kilobase of transcript per million mapped reads) values that consider the length of the gene sequence to account for biases in the sequencing process.For miRNA expression, RPM (reads per million mapped reads) values that take into account the number of reads mapped to miRNAs and the total number of miRNAs (which was 1881), were downloaded.Patients that exhibited greater than 20% of data missing in a single omics data type were removed.Further, features (either methylation loci, miRNAs or genes/gene transcripts) were omitted if they exhibited greater than 20% missing data across all patients and data types.The remaining missing values were imputed using a K-nearest-neighbor approach (K = 20) based on weighted averages; all data were further normalized using log2 and Z-transform.The resulting datasets for 112 patients consisted of methylation levels at 24,889 loci, expression of 415 miRNAs and 51,537 genes/gene transcripts.
Positions of LD blocks, genes and mutations were binned into corresponding 40 K region(s).When required, genomic positions of features were 'lifted over' to the GRCh37/hg19 assembly, using the Lift Genome Annotation program available at https://genome.ucsc.edu/cgi-bin/hgLiftOver (accessed on 20 November 2019).
Clinical data for each of 112 patients were downloaded from TCGA database and is summarized in Table S1.Note that all datasets were downloaded between October and December 2019 when this study commenced.

Similarity Network Fusion (SNF)
Similarity Network Fusion (SNF) implements a sophisticated method of integrating multi-omics data based on message-passing theory [8].SNF initially constructs similarity matrices for each available data type (e.g., mRNA and miRNA expression, DNA methylation) using the Euclidean distance function and a scaled exponential similarity kernel.Then, SNF uses a K-nearest-neighbor approach to form sparse local similarity matrices that encode the local community structure of the patient similarity network corresponding to each similarity matrix.It assumes that local similarities are more reliable information sources than similarities between patients that do not share the same local neighborhood, since those similarities form weak connections in the patient similarity network.The integration stage involves iteratively updating the full similarity matrices using the local similarity information encoded in the sparse kernel matrices.This is performed using a non-linear combination method that has its roots in message-passing and label propagation theory [18].The final similarity matrices were shown to converge [8] after a suitable number of iterations (experiments have shown that 20 iterations suffice in most cases).The data integration stage is finalized by forming a similarity matrix whose entries are defined as the average of the corresponding entries in each of the final similarity matrices.
A comparison of several multi-omics integration approaches performed by Tini [19] demonstrated that SNF is the most robust method for analyzing complex datasets.It exhibits a unique ability to detect complementary signals and is robust to noise (as it uses reliable local similarity information throughout the integration stage).Furthermore, since the local information (which represents strong similarity connections) is propagated to all similarity matrices in the integration stage, strong similarities are emphasized in the final fused matrix.Hence, SNF exhibits the unique property that it up-weights strong connections and down-weights weak connections in the patient similarity network, which often results in more clearly defined and well-separated patient communities that can easily be captured using spectral graph partitioning techniques.
Following data integration, SNF employs spectral clustering using the NCut algorithm [20] to determine the optimal partition of the network into homogeneous patient subgroups.Spectral clustering partitions the similarity network by using graph-theoretical properties of the similarity network.In turn, NCut aims to minimize between-cluster similarity whilst maximizing the similarity within the same cluster (for more detailed definitions see, e.g., [8]).To compute the optimal number of clusters, we used the EigenGap approach and RotationCost, adopted by SNF.

Identifying Informative Features Contributing to Cluster Separation
Following Wang et al. [8], we ranked features according to their significance in each cluster using a ranking similar to the Normalized differential expression (NDE) index suggested by Tusher [21].A feature in the kth cluster was considered to be significant if its average value in that cluster was different from the average value calculated across all other clusters: Here µ(Cl k ) and µ Cl k (Var(Cl k ) and Var Cl k ) are average values (variances) of a given feature in cluster k and the rest of the patients, respectively; m k and n are the number of patients in cluster k and the entire dataset, respectively.
In total, 33,114 features (415 miRNA expression, 7810 immune gene expressions and DNA methylation levels of 24,889 loci) available for 112 patients were considered for ranking.Finally, we constructed a list of features in descending order of their corresponding NDE values.Features at the top of this list were considered to be more important/informative for separating patients in the kth cluster from the rest of the patients.

Other Software Packages Used
The software package miRNet2.0 a miRNA-centric network visual analytics platform-described in Chang [12] and available at https://www.mirnet.ca/miRNet/home.xhtml(accessed on 20 November 2019), was used to identify potential target genes of miRNAs.Enrichment analyses were performed using Metascape (https://metascape.org/gp/index.html,accessed on 20 November 2019).A few selected gene sets corresponding to the hallmark interferon IFN-α (M5911) and IFN-γ (M5913) response and inflammatory response (M5932) were downloaded from the Molecular Signature Database [22,23], available via the GSEA website (https://www.gsea-msigdb.org/gsea/msigdb/,accessed on 20 November 2019).Protein-protein interaction data from the STRING database (https://string-db.org/,accessed on 20 November 2019) were used to create PPI networks for selected gene sets.Node degree distributions were used to identify the top 5% (1%) genes with the highest number of interactions with other proteins; these genes were considered to be hubs of the networks.

SNF Analysis
First, patient similarity matrices (networks) for each multi-omics dataset-DNA methylation, mRNA expression of immune genes and miRNA expression-were fused together using SNF_v2.1 software and then clustered using spectral clustering by the NCut al-Genes 2023, 14, 1795 6 of 16 gorithm [8].The optimal number of clusters for each data type was assessed by the RotationCost and EigenGap approaches and varied between two and four.Supplementary Figure S1 shows the results for patient clustering using single omics datasets.
It is apparent that these clusters (with exception of a few) exhibit relatively low withincluster similarity.For mRNA gene expression, no visible patient clusters were identified (Supplementary Figure S1a,b).For DNA methylation and miRNA expression, one and two patient clusters, respectively, with higher within-cluster similarity were emerging (Supplementary Figure S1c,d).The between-cluster similarity was generally lower than within-cluster similarity, suggesting some degree of separation between those clusters with respect to a single omics dataset.
Further, patient similarity networks for each data type were fused together using the SNF_v2.1 software [8], with default parameter settings (α = 0.5, T = 20 and K = 20).The fused similarity network was subjected to spectral clustering by NCut.The number of clusters was set to five.This number of clusters was found as an optimal number among first-and second-best choices by the RotationCost and EigenGap approaches, respectively.The heatmap in Figure 1 displays the results of the patient clustering (see also Table 1 for cluster sizes and their characteristics).The heatmap shows a generally good cluster separation and exhibits lower between-cluster similarity as compared to the clustering obtained using single omics datasets.

SNF Analysis
First, patient similarity matrices (networks) for each multi-omics dataset-DNA methylation, mRNA expression of immune genes and miRNA expression-were fused together using SNF_v2.1 software and then clustered using spectral clustering by the NCut algorithm [8].The optimal number of clusters for each data type was assessed by the RotationCost and EigenGap approaches and varied between two and four.Supplementary Figure S1 shows the results for patient clustering using single omics datasets.
It is apparent that these clusters (with exception of a few) exhibit relatively low within-cluster similarity.For mRNA gene expression, no visible patient clusters were identified (Supplementary Figure S1a,b).For DNA methylation and miRNA expression, one and two patient clusters, respectively, with higher within-cluster similarity were emerging (Supplementary Figure S1c,d).The between-cluster similarity was generally lower than within-cluster similarity, suggesting some degree of separation between those clusters with respect to a single omics dataset.
Further, patient similarity networks for each data type were fused together using the SNF_v2.1 software [8], with default parameter settings (α = 0.5, T = 20 and K = 20).The fused similarity network was subjected to spectral clustering by NCut.The number of clusters was set to five.This number of clusters was found as an optimal number among first-and second-best choices by the RotationCost and EigenGap approaches, respectively.The heatmap in Figure 1 displays the results of the patient clustering (see also Table 1 for cluster sizes and their characteristics).The heatmap shows a generally good cluster separation and exhibits lower between-cluster similarity as compared to the clustering obtained using single omics datasets.Cluster 4 is the smallest (11 patients) and most homogeneous cluster; it has the strongest within-cluster similarity in the network.Further, all patients in this cluster characterized as the FAB M3 AML subtype, i.e., acute promyelocytic leukemia (see Table 1).According to the ELN risk categories, every patient in Cluster 4 (with one exception) exhibits a favorable prognosis.Further, only 11 patients in the entire dataset were characterized as the FAB M5 subtype (acute monocytic leukemia); all these patients were found in Cluster 2 (Table 1).Twenty out of 25 patients in this cluster were classified as having intermediate/normal prognosis according to the ELN risk categories.Mutations in the NPM1c gene were found in almost half of patients in Clusters 2 (14/25) and 5 (9/20) (see Supplementary Table S2).Interestingly, the FAB M3 subtype was detected in the absence of any cytogenetic/chromosomal lesion information during data integration and clustering, which suggests the involvement of other biological processes/features in this disease subtype.Other clusters, however, did not significantly associate with either a particular category of existing classification systems or exhibit any distinct cytogenetic or immunophenotypic homogeneity (see Supplementary Table S2), suggesting the involvement of other biological modalities in defining patients' similarity.However, it was noted that patients from poor-risk and favorable-risk categories were rarely clustered together (Table 1).To explore it further, the survival analysis was performed.

Survival Analysis
To find out whether the survival curves are different for patient clusters identified using spectral clustering of the fused network, the Cox log-rank test was performed.The test indicated that there is evidence to suggest that there is a significant (p = 0.00082) difference between the survival profiles of patients in the clusters found (Figure 2).The Kaplan-Meier curve for Cluster 4 showed the highest survival probability.This aligns well with the ELN risk classification of patients in this cluster; all patients in Cluster 4 (with one exception) exhibit a favorable prognosis.Approximately 83% of patients in Cluster 3 were classified as having intermediate/normal and favorable prognosis according to the ELN risk classification, whereas all patients in Cluster 1 are classified as having poor or intermediate/normal prognosis.In general, there is a good agreement between the rankings of survival probabilities and the proportion of patients in each cluster having favorable prognosis (Spearman correlation = 0.82), although the p-value = 0.08859 is slightly higher than the 5% level of significance.
Genes 2023, 14, x FOR PEER REVIEW 8 of 16 subtype.Other clusters, however, did not significantly associate with either a particular category of existing classification systems or exhibit any distinct cytogenetic or immunophenotypic homogeneity (see Supplementary Table S2), suggesting the involvement of other biological modalities in defining patients' similarity.However, it was noted that patients from poor-risk and favorable-risk categories were rarely clustered together (Table 1).To explore it further, the survival analysis was performed.

Survival Analysis
To find out whether the survival curves are different for patient clusters identified using spectral clustering of the fused network, the Cox log-rank test was performed.The test indicated that there is evidence to suggest that there is a significant (p = 0.00082) difference between the survival profiles of patients in the clusters found (Figure 2).The Kaplan-Meier curve for Cluster 4 showed the highest survival probability.This aligns well with the ELN risk classification of patients in this cluster; all patients in Cluster 4 (with one exception) exhibit a favorable prognosis.Approximately 83% of patients in Cluster 3 were classified as having intermediate/normal and favorable prognosis according to the ELN risk classification, whereas all patients in Cluster 1 are classified as having poor or intermediate/normal prognosis.In general, there is a good agreement between the rankings of survival probabilities and the proportion of patients in each cluster having favorable prognosis (Spearman correlation = 0.82), although the p-value = 0.08859 is slightly higher than the 5% level of significance.

Informative Feature Selection
To find features that make the largest contribution to the patients' survival/risk category prediction, as captured by clustering of the fused similarity network, a ranked list of all 33,114 features comprising 415 miRNA, 7810 immune genes/transcripts and 24,889 DNA methylation loci was constructed for each cluster using the NDE index (see Section 2).The values of the NDE index, which, by definition, indicate the potential of each feature to separate patients of a given cluster from the rest of the patients, decrease dramatically even within the top 100-200 ranked features.For many features, especially at the bottom of these rankings, the NDE values are identical, showing that these features make similar, albeit small, contributions to cluster separation.We found that using the top 125 features from each cluster (613 unique features in total across all clusters) allowed us to produce clustering similar (with 95% accuracy) to the one obtained using all 33,114 features.Therefore, we speculate that these top-ranked 125 features from each cluster make the most

Informative Feature Selection
To find features that make the largest contribution to the patients' survival/risk category prediction, as captured by clustering of the fused similarity network, a ranked list of all 33,114 features comprising 415 miRNA, 7810 immune genes/transcripts and 24,889 DNA methylation loci was constructed for each cluster using the NDE index (see Section 2).The values of the NDE index, which, by definition, indicate the potential of each feature to separate patients of a given cluster from the rest of the patients, decrease dramatically even within the top 100-200 ranked features.For many features, especially at the bottom of these rankings, the NDE values are identical, showing that these features make similar, albeit small, contributions to cluster separation.We found that using the top 125 features from each cluster (613 unique features in total across all clusters) allowed us to produce clustering similar (with 95% accuracy) to the one obtained using all 33,114 features.Therefore, we speculate that these top-ranked 125 features from each cluster make the most contribution to cluster separation and could be considered as the most important and informative ones.
The majority (92-98%) of features appearing among the top 125 in each cluster were DNA methylation loci.To find immune genes that could be potential targets of the observed DNA methylation (i.e., their expression may be affected by DNA methylation either via intra-or inter-chromosomal interaction between methylated loci and gene promoter/gene), intra-and inter-chromosomal Hi-C interactions data (see Section 2 for details) were used.For each cluster, we compiled a list of immune genes that could be potential targets of methylation.Further, immune genes listed in TCGA database as potential targets of methylation loci, typically residing within close proximity to the methylation sites but not found following the stringent definition of "strong" intra-chromosomal interaction (see Section 2), were added to the corresponding lists.
Between six and 15 miRNAs were also found among top-ranked features.Immune genes potentially targeted by these miRNAs were found using miRNet [12] and added to the compiled lists of genes.Immune genes found in the top 125 most informative features were also kept.
Using the selection procedures described above, the sets of 1542, 1927, 1328, 1720 and 2782 immune genes (SIGs for short) that may play a role in AML were identified in Clusters 1 to 5, respectively (see Supplementary Table S3).The overall number of different immune genes (including target genes) across all five clusters was 4692, i.e., 60% of all immune genes (7810) used in this study.Note that it is unlikely that all the genes identified using inter-and intra-chromosomal interactions between 40 K fragments play a role in AML; some of them may be simply "bystanders", which happened to occur within the same 40 K regions harboring genes that may play a role in the development of AML.Although a systematic biological interpretation of genes in the identified SIGs was beyond the scope of this paper, we explored their enrichment in GO terms and genes from curated datasets, presence of somatic mutations, occurrence of genes prone to germline mutations (as identified by the GWA study of unrelated AML patients) and presence of genes being hubs of PPI network, with the aim of identifying biologically plausible features contributing to patients' similarity.

Enrichment Analysis
For each of five SIGs identified by selection procedures described above, their enrichment in GO biological processes and pathways terms was performed using Metascape.The SIGs were found to be enriched in 'cytokine signaling in immune system' reactome GO term; p-values corrected for multiple testing (i.e., q-values) were between ~10 −96 and 10 −41 in all five clusters.The SIGs were also enriched in 'signaling by interleukin' reactome term (10 −76 < q < 10 −31 ), 'apoptotic signaling pathway' GO term (10 −76 < q < 10 −40 ) and many others; for the full lists, see Supplementary Table S4.
Using the Molecular Signature Database from GSEA, we downloaded curated sets M5911, M5913 and M5932 containing genes that are up-regulated in response to α and γ interferon proteins and defining inflammatory response in AML, respectively.
The numbers of immune genes shared between SIGs for each cluster and gene lists corresponding to GO terms or curated datasets analyzed are summarized in Table 2. Some of the immune genes were present in several SIGs (i.e., the total number of unique genes is not a sum of genes in all clusters), but the majority of them were unique to each cluster.

Germline and Somatic Mutations in Immune Genes
A dataset of somatic mutations together with their potential target genes was downloaded from TCGA database and consisted of 2749 mutations occurring or residing within close proximity to 2275 RefSeq genes.The most frequently mutated gene was DNMT3A, with 33 mutations in 30 out of 200 AML patients.The next most frequently mutated genes were immune genes TP53 and FLT3; 11 and 13 patients had mutations in these genes, respectively.The majority of mutated genes were harboring between one and four somatic mutations per patient.Out of 7810 immune genes used in this study, 871 genes were harboring 1072 mutations.In a subset of 112 patients selected for this study, 680 somatic mutations were recorded in 581 immune genes.
The dataset of 1754 LD blocks [17] encompassing germline mutations (SNPs) identified by GWA studies [16] in patients unrelated to TCGA AML cohort were found to reside within 1124 distinct 40 K bins.In total, 712 RefSeq genes including 242 immune genes were either harboring or residing within LD blocks corresponding to at least one SNP showing genome-wide significance (p < 5 × 10 −6 ).
We found that immune genes were enriched in both germline and somatic mutations (Fisher's Exact Test, p < 2.2 × 10 −16 ) as compared to non-immune genes.
It is known that chromatin interactions could bring together genes and germline mutations that are not necessarily within the same LD block.We used Hi-C data to identify possible intra-and inter-chromosomal regions strongly (as described in Section 2) interacting with regions harboring germline mutations.For intra-chromosomal interactions, the cut-off interaction frequency corresponding to approximately 5% of the strongest interactions across various chromosomes was found to be 100.Genes residing within these interacting regions were identified using the list of 26,368 RefSeq genes and gene transcripts (including 7810 immune genes) that are recorded in the GRCh37/hg19 assembly.All gene positions were binned into 40 K regions to align them with the Hi-C data.We found that the proportion of immune genes, either harboring/residing within LD block of SNPs or targeted by germline mutations remotely via inter-or intra-chromosomal interactions, was significantly higher (Fisher's Exact Test; see Table 3 for corresponding p-values) for immune genes than non-immune genes.Note that germline mutations were not recorded for TCGA AML cohort; these observations are purely speculative and were made based on the GWAS data reported in [17].Nevertheless, we hypothesized that germline mutations reported in [17] as reaching genomewide significance level are highly likely to be present in TCGA AML cohort.Therefore, all genes either mutated or targeted by these germline mutations were considered as affected by germline mutations.The number of immune genes harboring somatic and germline mutations or that could be affected by germline mutations via chromatin interactions are summarized in Table 4.
A small number of genes in each cluster (apart from Cluster 4) could potentially harbor both germline and somatic mutations.The list of these genes is given in Table 4. 1 genes that could be affected by both germline and somatic mutation are shown in bold.

Identification of Hub Genes
PPI data from the STRING database were used to create PPI networks for the four identified SIGs.The selected gene set for Cluster 5 was too big (2781 genes) for the STRING analysis.Around 96% of immune genes in each SIGs were involved in PPI, with the average number of interactions per gene varying between 22 and 36 (Table 5).Genes with the top 1% of interactions listed in Table 5 were considered to be important hubs in the corresponding PPI networks.Four genes-ACTB, AKT1, TP53 and VEGFA-appeared to be hubs of all four networks.Unique cluster-specific hub genes were also identified (see Table 5).

Gene symbols
Genes with top 1% of interactions

Discussion
In this study we used the similarity network fusion approach [8] to integrate multiomics profiles of 112 AML patients, comprising 24,889 DNA methylation loci, mRNA expression of 7810 immune genes and expression of 415 miRNAs, into a fused similarity network, which was subsequently subjected to spectral clustering.The optimal number of clusters was found to be five.Patients' characteristics by cluster are given in Table 1.Analysis of the resulting clustering showed no strong correlation with either FAB or ELN subtypes/risk categories, apart from Cluster 4 in which all 11 patients were characterized as FAB M3 subtype, with ten patients having a favorable prognosis according to ELN.It is likely that patients in this cluster tend to have a favorable prognosis due to the success of all-trans retinoic acid (ARTA) treatment at targeting the PML::RARA gene fusion [24], which is distinctly present (positive) in three patients that have been tested for this gene fusion in this cluster.It was noted that patients with poor and favorable prognosis according to the ELN risk categories rarely appear in the same cluster.Note that it was impossible to compare the results with the recently proposed revised ELN classification [25]; these scores were not available for the cohort of patients analyzed.Subtyping of AML patients according to the WHO classification [3] by using differentiation markers was also not possible, since for the majority of patients these data were not available.
Despite only modest agreement between cluster assignment and the ELN risk category, the survival analysis shows a significant (p = 0.00082) difference between the survival profiles of patients in different clusters (Figure 2).Although the correlation between the rankings of survival probabilities and proportions of patients in each cluster having a favorable prognosis was noted (Spearman correlation = 0.82), the p-value = 0.08859 did not reach the 5% level of significance.Nevertheless, one may speculate that patients' survival/risk category is largely dependent on the similarities captured by clustering of the fused similarity network that integrates DNA methylation, mRNA expression of immune genes and expression of miRNA.Therefore, the resulting fused similarity network could be used in prediction of survival/risk categories for new patients, using, e.g., the procedure outlined in [8].
We suggested a way of reducing the number of features (immune genes, DNA methylation loci and miRNAs) used in constructing the fused similarity network and subsequent spectral clustering.For every cluster, we ranked each of 33,114 features with respect to their NDE index, reflecting the potential of a corresponding feature to separate patients of a given cluster from other clusters.It appeared that the use of the top 125 features from each cluster is sufficient to obtain a fused similarity network and clustering similar (with 95% accuracy) to the one obtained using all omics data available.We speculated that these top 125 features from each cluster make the most contribution to cluster separation and could be considered as the most important and informative feature in patient clustering and survival prediction.With a small loss of accuracy, this smaller fused similarity network could be used for survival/risk category prediction as outlined in [8].
A small number of immune genes (between three and ten) were found among topranked features.They included the ALDH1A1 gene (found in Cluster 1) that emerges as a significant risk factor in AML [26].The HNMT gene (Cluster 2) was found among six amino acid metabolism-related genes that correlate with the immune microenvironment and could be predictors of the prognosis and immunotherapy response of AML patients [27].The MPO gene (Cluster 3) was one of the five immune genes in a model for predicting non-M3 AML patients' treatment outcome.Note that none of the patients in Cluster 3 are classified as FAB M3 group (see Table 1).The full list of immune genes found among top-ranked features is given in Supplementary Table S3 and their role in AML from a biological point of view could be further explored.
Epigenetic control of gene expression plays a pivotal role in determining the biological behavior of cells.DNA methylation is one such epigenetic mechanism.Not surprisingly, between 92% and 98% of features in the top 125 for each cluster were DNA methylation loci.It is known that chromatin interactions could move distal regulatory elements, promoters or the transcription start sites to the proximity required for transcription regulation of certain genes.To identify these proximal and distal targets of methylation, the set of target genes listed in TCGA database was combined with immune genes found in regions that have strong intra-and inter-chromosomal interactions with regions harboring methylation loci.
MiRNAs are known to regulate most cellular processes and are considered promising therapeutic targets for cancer and other diseases.A small number of miRNAs (between six and 15) was found in the top-ranked features.Their potential target genes were identified using miRNet2.0software [12] for four clusters and added to the compiled lists of genes.
The resulting sets of genes (SIGs) were enriched in several GO biological processes/ pathways terms (see Supplementary Table S4).Comparison with curated sets containing genes that up-regulated in response to α interferon proteins (M5911), γ interferon proteins (M5913) and defining inflammatory response in AML (M5932) shows that there are overlaps with selected immune genes.Further investigation and biological interpretation of these common genes may guide the finding of features responsible for survival differences between patients in the five clusters obtained.
Further, we created PPI networks for four SIGs using the STRING database.Node degree distributions were used to identify the top 1% genes with the highest number of interactions with other proteins; these hub genes were considered to be important for a given network.Their dysregulation may influence the expression of proteins interacting with these hub genes.Four genes-ACTB, AKT1, TP53 and VEGFA-appeared to be hubs of all four networks.It is known that these genes play crucial roles in the development and progression of AML, and dysregulation of these genes contributes to the proliferation, survival and chemoresistance of leukemic cells.Their aberrant expression or mutations were found to serve as important prognostic markers, influencing clinical outcomes in AML patients (see, e.g., [26][27][28][29][30]).Interestingly, all these genes were found to be targeted by various miRNAs occurring among top-ranked features, and, in some cases, by the same miRNAs within the same cluster.For example, in Cluster 1 the ACTB and AKT1 genes are targets of the hsa-mir-192, whereas the TP53 and VEGFA genes are potential targets of the hsa-mir-106a.Moreover, all six somatic mutations recorded in the TP53 gene occur in patients assigned to Cluster 1.In Cluster 4, for example, three genes-ACTB, AKT1 and VEGFA-could be targeted by hsa-mir-1-3p.In addition, in all five clusters the AKT1 gene could be dysregulated by remotely acting germline mutations (possibly occurring within an enhancer) identified by GWA studies in an unrelated cohort of patients.Further understanding of "one-to-many" or "many-to-many" relationship between miRNAs and their targets is challenging but can guide the development of targeted therapies and personalized treatment approaches for AML.
Several cluster-specific hub genes identified, such as CASP3, CTNNB1, HIF1A, IL1B, JUN, MAPK1, UBA52 and UBC, are known to be involved in leukemogenesis.Dysregulation of these genes affects critical cellular processes including apoptosis, cell adhesion, transcriptional regulation, angiogenesis, inflammation and intracellular signaling.
We identified a small number of genes in each cluster (apart from Cluster 4, in which only one patient harboring a somatic mutation was found) that could potentially mutate both in soma and germline (see Table 4).The latter mutations were identified in an unrelated cohort of patients with AML [17] via GWA studies as reaching genome-wide significance level (p < 5 × 10 −6 ).Among these genes are the three oncogenes KRAS, NRAS (Cluster 2) and GLI3 (Cluster 3).In turn, the RUNX1 gene (Clusters 1 and 2) could play a dual role, being both an oncogene and tumor suppressor [31] and is considered to be an important player in AML.The platelet endothelial aggregation receptor 1 (PEAR1) gene (Clusters 1 and 2) could be another candidate for a tumor suppressor gene in AML [32].One may speculate that inactivation of both alleles in these tumor suppressor genes leads to the change of a cell's phenotype in agreement with the "two hit hypothesis" [33].Further inspection of the other genes from this list is required to establish their tumor suppressor potential.

Conclusions
In this paper, we proposed a computational framework for finding informative and biologically plausible features that may contribute to patient similarity as identified by various clustering procedures used for patient stratification.In this study, we used spectral clustering of fused similarity networks.The advantage of using similarity network fusion is its robustness to noise, because it up-weights strong connections and down-weights weak connections in the patients' neighborhood similarity network.
The use of this approach was demonstrated on data available for patients with acute myeloid leukemia.Further, we suggested a way of reducing the number of features used in clustering.Chromosome-conformation capture data, somatic/germline mutations and knowledge of protein-protein interactions were used to understand the observed clustering.This computational framework has the potential to guide researchers in finding plausible explanations for the features found as informative in patient' stratification with respect to any meaningful clustering.Further experimental/biological validation is required in order to establish the functional significance of the genes found.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/genes14091795/s1, Figure S1: Heatmap of patients' clustering results using; Table S1: Patients' characteristics; Table S2: Immunophenotypic characteristics of patients by cluster; Table S3: Full list of immune genes identified; Table S4: Enrichment of SIGs in GO biological processes and pathways terms.

Figure 1 .
Figure 1.Heatmap of fused similarity network following spectral clustering into five clusters.Darker colors indicate higher similarity.

Figure 2 .
Figure 2. The Kaplan-Meier survival curves for each of five patient clusters identified.

Figure 2 .
Figure 2. The Kaplan-Meier survival curves for each of five patient clusters identified.
Figure 1.Heatmap of fused similarity network following spectral clustering into five clusters.Darker colors indicate higher similarity.

Table 2 .
The number of immune genes shared between SIGs for each cluster and gene lists corresponding to GO terms and selected curated datasets.

Table 3 .
Number of genes harboring or targeted by remotely acting germline mutations.

Table 4 .
Immune genes in SIGs that harbor somatic and germline mutations or could be affected by germline mutations via chromatin interactions.

Table 5 .
Immune genes in SIGs involved in protein-protein interactions.(Note that the SIG of Cluster 5 was too big for analysis by STRING.)