Systems Biology Approaches Reveal Potential Phenotype-Modifier Genes in Neurofibromatosis Type 1

Neurofibromatosis type (NF1) is a syndrome characterized by varied symptoms, ranging from mild to more aggressive phenotypes. The variation is not explained only by genetic and epigenetic changes in the NF1 gene and the concept of phenotype-modifier genes in extensively discussed in an attempt to explain this variability. Many datasets and tools are already available to explore the relationship between genetic variation and disease, including systems biology and expression data. To suggest potential NF1 modifier genes, we selected proteins related to NF1 phenotype and NF1 gene ontologies. Protein–protein interaction (PPI) networks were assembled, and network statistics were obtained by using forward and reverse genetics strategies. We also evaluated the heterogeneous networks comprising the phenotype ontologies selected, gene expression data, and the PPI network. Finally, the hypothesized phenotype-modifier genes were verified by a random-walk mathematical model. The network statistics analyses combined with the forward and reverse genetics strategies, and the assembly of heterogeneous networks, resulted in ten potential phenotype-modifier genes: AKT1, BRAF, EGFR, LIMK1, PAK1, PTEN, RAF1, SDC2, SMARCA4, and VCP. Mathematical models using the random-walk approach suggested SDC2 and VCP as the main candidate genes for phenotype-modifiers.


Introduction
Neurofibromatosis type 1 (NF1) is a disease with a worldwide birth incidence of 1 in 2500 and a prevalence of at least 1 in 4000 [1]. The main clinical features are café-au-lait spots, axillary and inguinal freckling, cutaneous and subcutaneous neurofibromas, and Lisch nodules, occurring in almost every NF1 patient. Other less -ommon characteristics are scoliosis, macrocephaly, learning disabilities, plexiform neurofibromas, and multiple other benign and malignant tumors [2,3]. Inter-familial and intra-familial variability in NF1 is extensive: cutaneous neurofibromas may vary in number from dozens to thousands; about 30% to 50% of patients are affected by large plexiform neurofibromas; only about Main steps of the present study. The scheme shows the main steps in chronological order to identity potential NF1 phenotype-modifier genes. Gene Ontology (GO) 1 ; Human Phenotype Ontology (HPO) 2 ; Gene Expression Omnibus (GEO) 3 ; The Cancer Genomic Atlas (TCGA) 4.

Gene and Phenotype Ontologies Analyses
Gene Ontology (GO) describes a biological domain considering three aspects: molecular function, cellular component, and biological process. The Human Phenotype Ontology (HPO) provides a standardized vocabulary of phenotypic abnormalities encountered in human diseases. NF1 GO biological processes and NF1 HPO were analyzed by two coworkers individually. The chosen ontologies are listed in Table S1. HPO selection provided 1697 genes related to NF1 phenotype (OMIM 162200), whilst GO filter yielded 1449 genes included in the same ontologies previously selected for the NF1 gene. When comparing both strategies, it was observed that HPO and GO analysis had 265 genes in common ( Figure  2a). To assemble a network of the ontologies' selected genes, we used the STRING tool, observing protein-protein interactions (PPI) that were previously described by experimental assays. The separate networks generated for GO and HPO analyses are represented in Figure S1 and Figure S2, respectively. A Figure 1. Main steps of the present study. The scheme shows the main steps in chronological order to identity potential NF1 phenotype-modifier genes. Gene Ontology (GO) 1 ; Human Phenotype Ontology (HPO) 2 ; Gene Expression Omnibus (GEO) 3 ; The Cancer Genomic Atlas (TCGA) 4 .

Gene and Phenotype Ontologies Analyses
Gene Ontology (GO) describes a biological domain considering three aspects: molecular function, cellular component, and biological process. The Human Phenotype Ontology (HPO) provides a standardized vocabulary of phenotypic abnormalities encountered in human diseases. NF1 GO biological processes and NF1 HPO were analyzed by two coworkers individually. The chosen ontologies are listed in Table S1. HPO selection provided 1697 genes related to NF1 phenotype (OMIM 162200), whilst GO filter yielded 1449 genes included in the same ontologies previously selected for the NF1 gene. When comparing both strategies, it was observed that HPO and GO analysis had 265 genes in common (Figure 2a). To assemble a network of the ontologies' selected genes, we used the STRING tool, observing protein-protein interactions (PPI) that were previously described by experimental assays. The separate networks generated for GO and HPO analyses are represented in Figure S1 and Figure S2, respectively. A combined network, comprising NF1 direct interactions (first neighbors) for both GO and HPO strategies is available in Figure 2b. i.e., they have the potential to impact the whole network even when having few connections (bottleneck nodes). Thus, nodes can be visualized in four categories: (1) large/dark-orange nodes = hub/bottleneck nodes; (2) large/blue nodes = hub/non-bottleneck nodes; (3) small/dark-orange nodes = non-hub/bottleneck nodes; and (4) small/blue nodes = non-hub/non-bottleneck hubs. NF1 is represented by the yellow node on the right side of AKT1.
According to this strategy, AKT1 presents the highest levels of betweenness and closeness centrality. However, despite NF1 being a highly connected protein in the network evaluated, it presented low levels of betweenness and closeness centrality, as can be observed by the node size (small) and color (light yellow, compared to the dark orange elements). Hence, we aimed to evaluate HPO and GO networks separately using the same approach to minimize the possibility of overlooking potential phenotypemodifier genes, as described in the following section.

Forward and Reverse Genetics Strategies
As mentioned before, GO and HPO databases are related, respectively, to the gene function and phenotype association. Hence, the observations of their independent networks, previously represented in Figure S1 and Figure S2, were based on the forward and reverse genetics concepts.
When evaluating betweenness and closeness centrality by the forward genetics strategy (GO network), six genes were selected: AKT1, RAF1, LIMK1, BRAF, EGFR, and PTEN. In the other analysis, the reverse genetics approach (evaluating the HPO network) provided six genes as well: PAK1, VCP, AKT1, SMARCA4, RAF1, and PTEN. Together, the strategies provided nine candidate genes for neurofibromatosis phenotype modifiers. Besides, the network communities were evaluated, and AKT1 was identified as the network main hub, whilst the RAF1 gene had the highest score for authority. The authority score estimates the importance of the node itself, and the hub score measures this importance  Figure 3. Betweenness centrality and closeness analysis of the STRING network previously generated in Figure 2b using GO and HPO. Large nodes have a more central role in communication among other nodes (hub/hub-like nodes), i.e., more connections. The darker orange the nodes are, the faster information flows towards the central node; i.e., they have the potential to impact the whole network even when having few connections (bottleneck nodes). Thus, nodes can be visualized in four categories: (1) large/dark-orange nodes = hub/bottleneck nodes; (2) large/blue nodes = hub/nonbottleneck nodes; (3) small/dark-orange nodes = non-hub/bottleneck nodes; and (4) small/blue nodes = non-hub/non-bottleneck hubs. NF1 is represented by the yellow node on the right side of AKT1.

Forward and Reverse Genetics Strategies
As mentioned before, GO and HPO databases are related, respectively, to the gene function and phenotype association. Hence, the observations of their independent networks, previously represented in Figures S1 and S2, were based on the forward and reverse genetics concepts.
When evaluating betweenness and closeness centrality by the forward genetics strategy (GO network), six genes were selected: AKT1, RAF1, LIMK1, BRAF, EGFR, and PTEN. In the other analysis, the reverse genetics approach (evaluating the HPO network) provided six genes as well: PAK1, VCP, AKT1, SMARCA4, RAF1, and PTEN. Together, the strategies provided nine candidate genes for neurofibromatosis phenotype modifiers. Besides, the network communities were evaluated, and AKT1 was identified as the network main hub, whilst the RAF1 gene had the highest score for authority. The authority score estimates the importance of the node itself, and the hub score measures this importance based on the other nodes which are linked to the main hub. Despite the network statistics having provided these candidates, we wanted to observe whether or not their expression was altered in the absence of NF1. Therefore, the next step was designed to conduct gene expression analyses to evaluate this hypothesis.

Differential Gene Expression Networks
To comprehend the differential gene expressions (DGEs) of the candidate phenotype-modifier genes, we performed secondary expression analysis on the data available in two public repositories: The Cancer Genome Atlas (TCGA) and the Gene Expression Omnibus (GEO).
Using the GEO database, NF1 knockdown and knockout assays were selected: GSE14038 and GSE115406. The log fold-change (logFC) and false discovery rate (FDR) values for the ten potential modifier genes in each dataset are presented in Table S2.
In TCGA, we selected seven different types of tumors for which samples with nonsense mutations in NF1 were available. We evaluated tumors that presented NF1 nonsense mutations against tumors with wildtype NF1. The logFC and adjusted p-values, after the FDR correction, for the seven tumors evaluated and ten candidate genes are available in Table S3. Despite the somatic origins of these tumors, Cancers 2020, 12, 2416 6 of 23 we believe this information is valuable in order to check how NF1-loss could affect the global gene expression in a tissue/site-specific way, and check for signatures more related to a certain phenotype.
For both TCGA and GEO assays, few genes were evidenced to be significantly differentially expressed, demonstrating that the expression of all the candidates, and for the NF1 gene, is strictly regulated. We did not identify a variable expression profile between the tumors evaluated, and knockout assays. Therefore, we performed other systems biology analysis with the PhenomeScape application v.1.0.4 from Cytoscape software, assembling a complex network (Figure 1, step 3). For that, we used as input (I) the ontologies selected from HPO database ( Figure 1, step 1); (II) the network generated in STRING tool, as also mentioned in step 1 ( Figure 1); and (III) expression data from studies selected from GEO database (Figure 1, step 2). The resulting upregulated genes (overexpressed) were presented in red and the downregulated (lower expression) in green. These networks are available in Figures S3-S7. NF1 is downregulated (lower expression) in the knockdown and knockout studies, and upregulated (overexpressed) in the evaluation of malignant tumors when compared to benign neurofibromas. NF1 is absent from the network when its expression is not significantly altered in the expression dataset.
Besides the genes previously selected in the forward and reverse genetics analysis, when evaluating the complex networks, we observed that the SDC2 gene also had its expression altered when NF1 was affected. Furthermore, SDC2 is the first neighbor of NF1, which means both genes share a direct protein-protein interaction. Table 1 shows the final list of the 10 genes selected as potential phenotype-modifiers in this study and summarizes the strategies by which they were found. We then generated a complex network comprising all the candidate phenotype-modifier genes selected so far, and the NF1 phenotypes they are related to ( Figure 4); the phenotypes were provided by the PhenomeScape tool, according to the data available in HPO database. Finally, we used a mathematical model to evaluate whether one of those genes could be a stronger candidate as a phenotype-modifier than the others, which is described in the following section.

Systems Biology Approaches Reveal 10 NF1 Phenotype-Modifier Candidate Genes
Cancers 2020, 12, x FOR PEER REVIEW 0 of 27 Figure 4. A complex network evaluation comprising the 10 candidate phenotype-modifier genes and the NF1 phenotypes they are related to. The ten candidate NF1 phenotype-modifier genes suggested by our analysis are represented with the NF1 phenotypes they are related. Yellow nodes: genes selected with the forward genetics strategy; red nodes: genes selected with the reverse genetics strategy; purple nodes: genes selected by forward and reverse genetics strategies; green node: gene selected by evaluating the differential gene expression in the complex network. Blue squares: phenotypes provided by the PhenomeScape tool, according to the associations presented in the HPO database.

Random Walk Analysis
A random walk is a mathematical model known as a random process. It is based on the idea that a gene (node) is an imaginary particle that performs a succession of random steps (interactions) in a Figure 4. A complex network evaluation comprising the 10 candidate phenotype-modifier genes and the NF1 phenotypes they are related to. The ten candidate NF1 phenotype-modifier genes suggested by our analysis are represented with the NF1 phenotypes they are related. Yellow nodes: genes selected with the forward genetics strategy; red nodes: genes selected with the reverse genetics strategy; purple nodes: genes selected by forward and reverse genetics strategies; green node: gene selected by evaluating the differential gene expression in the complex network. Blue squares: phenotypes provided by the PhenomeScape tool, according to the associations presented in the HPO database.

Random Walk Analysis
A random walk is a mathematical model known as a random process. It is based on the idea that a gene (node) is an imaginary particle that performs a succession of random steps (interactions) in a network [32]. Our aim was to evaluate whether these random steps could lead the gene to the phenotype, which was set as neurofibromatosis type 1 (OMIM 162200). For this goal, we performed the random walk analysis with the RandomWalkRestartMH package in R v.3.6.2.
According to this mathematical model, the NF1 gene only had to take "one step" (one interaction) to reach the phenotype OMIM 162200. The genes SDC2 and VCP had to take only two steps (two interactions) ( Figure 5), whilst the other eight candidates needed more interactions to cause the syndrome ( Figure S8). Genes LIMK1 and PAK1 also needed a higher number of potential interactions, and hence more steps, than the others.

Literature Review and Genomic Databases Evaluation
To check for genetic variants already described in our 10 candidate genes, we looked at two databases: The Genome Aggregation Database (gnomAD v. 2.1.1), which spans 125,748 exomes and 15,708 genomes from unrelated individuals; and ClinVar, which aggregates information about genomic variation and its relationship to human health. For gnomAD, we found a total of 11,211 variants ( Table 2). SMARCA4 and EGFR have the highest numbers of variants (2575 and 2000, respectively); SDC2 and PTEN the lower (335 and 456, respectively). With this analysis, we confirmed all the candidates as potential phenotype-modifier genes. However, the results using this model pointed to SDC2 and VCP as being more directly connected to the NF1 phenotype.

Literature Review and Genomic Databases Evaluation
To check for genetic variants already described in our 10 candidate genes, we looked at two databases: The Genome Aggregation Database (gnomAD v. 2.1.1), which spans 125,748 exomes and 15,708 genomes from unrelated individuals; and ClinVar, which aggregates information about genomic variation and its relationship to human health. For gnomAD, we found a total of 11,211 variants ( Table 2). SMARCA4 and EGFR have the highest numbers of variants (2575 and 2000, respectively); SDC2 and PTEN the lower (335 and 456, respectively).
In ClinVar, germline variants were reported for all genes (N = 5216), with the exception of SDC2. Almost half (46.4%) of them were submitted as variants of uncertain significance (VUS). EGFR has the highest rate of VUS (67.3%), while all the 36 LIMK1 variants are classified as benign/likely benign (B/LB). On the opposite way, PTEN and BRAF have the highest number of pathogenic/likely pathogenic (P/LP) variants, corresponding to 32.5% and 23.9% of all reported variants, respectively (Table 3). Finally, we also explored TCGA and Genomics Evidence Neoplasia Information Exchange (GENIE) datasets to check for tumor samples harboring both genetic alterations in one of our candidate genes and NF1. Due to lack of samples or to the higher mutational and clinical heterogeneity, we managed to make reasonable assumptions only for AKT1, VCP, and SDC2. More details are presented in the discussion section.

Discussion
It is evident that genetic variants in NF1 do not act alone to determinate disease phenotype. Many factors may contribute to disease variability, including environmental factors, the occurrence of epigenetic alterations, and somatic second hits in NF1-associated tumors. The accumulation of somatic NF1 mutations is much more difficult to evaluate since each tumor needs to be sequenced individually, but it may be responsible for some level of NF1 variability. Other symptoms, like delayed mental development, are less influenced by second hit mutations. Genetic modifiers in a single locus or the interaction between several genes may suppress or enhance disease severity, including genes involved in the pathways other than the NF1-Ras-mTOR pathway. There is evidence that genetic modifiers explain a major fraction of phenotypic variation in NF1 [16]. A few genes and their variants have already been described as phenotype modifiers in literature and were reviewed and summarized in Table 4, but they are still insufficient to explain all the variability found in NF1 patients.
Recently, a review pointed out the main methods with which to discover novel phenotype-modifier genes in Mendelian diseases and formulate hypotheses about other pathways than Ras-NF1 that could be phenotype modifiers [44]. The most used methods to select candidate modifier genes are whole-genome sequencing, genome-wide association studies, and experimental approaches using animal models. Other studies also select candidate modifier genes using differential gene expression analysis [20]. These strategies have proved their value in identifying a few phenotype-modifier genes to date; however, they have some disadvantages, such as the high costs involved, being time-consuming, the use and maintenance of animal models, and the confounding factors in studies with selected NF1 patients [18].
One of these limitations was observed in our expression analysis, for which a few candidate genes were actually differentially expressed. Furthermore, differential expression analysis using TCGA tumor samples (Table S2) generated distinct results when compared to GEO controlled knockdown/knockout experiments (Table S3). Expression may depend on the tumor heterogeneity, i.e., the number of cells that are actually not expressing functional NF1, and its location, since the expression profiles of NF1 and related genes are tissue-dependent. Gene expression is by far the most common analysis among multi-omics studies. Despite that, in many studies it is not possible to obtain a clear scenario of the biological mechanisms disrupted in a disease by evaluating only the mRNA levels. Known disease genes are often not differentially expressed in affected individuals, once the mutations may only alter the protein function or post-translational mechanisms. As a consequence, much information contained in transcriptomics datasets are ignored, demanding an alternate strategy to evaluate these multi-omics assays [45]. On the other hand, the differential gene expression networks also allowed the selection of the SDC2 gene as a new candidate, once its expression was altered when NF1 was affected. This example highlighted the need to evaluate the multi-omics data in a more integrated and multidisciplinary perspective [46].  Network analysis makes it possible to combine different multi-omics studies, a strategy that has been applied in several personalized medicine studies evaluating genetic syndromes with phenotypic variability [47][48][49]. Recently, many projects and consortiums were created, gathering a huge amount of public results with germline and somatic mutation databases, transcriptomics, proteomics, and metabolomics data that can now be evaluated with a systems biology tools [44]. The analysis proposed here can be seen as an optimization in the search for candidate genes acting as phenotype modifiers in NF1, which can later be confirmed by the more robust molecular and functional assays. A brief description of the potential phenotype modifiers found in this study and their variants is provided below to show mechanistic insights and facilitate experimental studies. They are also summarized in Tables 1-3, respectively.
The ontologies selection was an important step in our analysis, especially because NF1 is important in several molecular mechanisms. If not filtered, our analysis could be later compromised by evidencing genes that are not so deeply associated to neurofibromatosis type 1, but which are more frequently studied in cancer (i.e., TP53 and developmental genes). This strategy has also been applied in studies that do not identify (or do not have access to) differentially expressed genes [50,51]. For Human Phenotype Ontology (HPO), our workflow was based on a deep phenotyping strategy (computational analysis of detailed, individual clinical abnormalities) [29], according to the heterogeneity of neurofibromatosis type 1 visualized in our patients. We also aimed to avoid ontologies related to congenital anomalies outside the NF1 spectrum of phenotypes, especially the ones that could have led to embryo lethality or severe impairments (major malformations) that would have been diagnosed before NF1. Embryo development is a critical, stepwise controlled process that can be disrupted by genetic or environmental factors, such as maternal infections or exposures [52]. Since the data on NF1 expression in the embryonic period are scarce, and we do not have information about the maternal genome or environment, we focused on the functional anomalies (more related to the fetal period) that are better characterized in neurofibromatosis type 1.
In our forward strategy, we found the epidermal growth factor receptor (EGFR) that acts upstream of NF1 in the Ras signaling pathway. EGFR belongs to a family of receptor tyrosine kinases that are anchored to the cytoplasmic membrane. EGFR is frequently over-activated in cancer and studies have shown that it is not expressed by normal Schwann cells but it is overexpressed in subpopulations of NF1 mutant Schwann cells [53]. The great involvement of this gene in different cancers and its differential expression patterns in NF1-enriched tissues may indicate that the occurrence of minor variants in this gene could act as phenotype modifiers in NF1. There are 2000 EGFR variants registered in the gnomAD database, 34.1% of them missense mutations. In ClinVar, most of the 199 catalogued germline variants are VUS (N = 134). Among the four P/LP variants, one (C326F) was related to Cowden syndrome. Variants in promoter and UTR regions might also have a potential as phenotype modifiers, since animal models have already shown that high levels of EGFR expression modify the initiation of neurofibromas, increasing their numbers [54].
AKT1, BRAF, LIMK1, PTEN, and RAF1 genes were also suggested by the forward strategy. They encode proteins that act downstream of NF1 in the Ras signaling pathway. Phosphoinositide 3-kinase PI3K/AKT is one of the most frequently activated pathways in cancer. This activation may occur through mutation of multiple genes, including PTEN, PIK3R1, and mTORC1 [55]. AKT1 presented the highest levels of betweenness and closeness centrality in systems biology network analysis, demonstrating itself to be the most relevant gene in the information flux among our selected genes (Figure 3).
AKT1 germline mutations are mainly associated with Cowden syndrome, characterized by the appearance of hamartomas, and an increased risk of developing multiple cancers, especially breast cancer [56]. One particular pathogenic variant, E17K, is reported to be linked to 22 different conditions in ClinVar. This variant was also found in gnomAD in a European individual. To explore how this alteration could modify the phenotype when co-occurring with NF1 mutations, we looked at mutational and clinical data deposited in the GENIE database (v7.0). When excluding AKT1-mutated patients and considering only NF1 mutations with more deleterious effects (nonsense, frameshift, and splicing variants), the most frequent cancer types are non-small cell lung cancer (12%), glioma (10%), and melanoma (9%). However, when grouping samples with both AKT1-E17K and NF1 mutations, breast cancer becomes the most prevalent, corresponding to 73% of all tumors. The link between breast cancer susceptibility and NF1 alterations was already established [18]. However, the mechanism that leads to this specific phenotype remains to be elucidated and AKT1 emerges as a strong candidate.
PI3K-Akt pathway activity is negatively regulated by phosphatase and tensin homolog protein (PTEN) [57]. PTEN is a tumor suppressor and its inactivation has a role in plexiform neurofibroma tumorigenesis and progression to high-grade peripheral nerve sheath tumors in the context of NF1 loss in Schwann cells, which is a very variable symptom in NF1, and may also participate in the mechanism of tumorigenesis of other tumors related to NF1 [58,59]. There are not many variants catalogued for PTEN in gnomAD (N = 456), but 1546 were already reported in ClinVar, with 35% still being classified as VUS. The phenotypes related to PTEN variants are, as expected, similar to AKT1, including Cowden syndrome and hereditary breast and ovarian cancer syndrome. Considering the importance of both AKT1 and PTEN in tumorigenesis, variants in the corresponding genes, not necessarily pathogenic, could act as a modifier of NF1 disease and need to be further investigated.
In contrast, BRAF studies in NF1 patients were already conducted. BRAF gene encodes a protein belonging to the RAF family of serine/threonine protein kinases. This protein plays a role in regulating the MAPK/ERK signaling pathway, which affects cell division, differentiation, and secretion. Germline mutations in BRAF were previously associated with cardiofaciocutaneous, Noonan, and Costello syndromes [60]. In ClinVar, 334 germline variants were already submitted, 80 being classified as P/LP and 117 of the remainder as VUS. A recent study analyzed a cohort of 100 patients clinically suspected of NF1 and identified 73 NF1 mutations and two BRAF novel variants. The clinical features of NF1 patients with co-occurrence of NF1-BRAF mutations were severe, and BRAF variants may have a synergistic role in determining NF1 phenotype [61].
Another member of the same family, the RAF1 gene, was suggested by the forward strategy. The RAF1 gene encodes a serine/threonine kinase protein that functions downstream of RAS and activates MEK1 and MEK2. In GENIE, RAF1 mutations are observed in various cancers. LEOPARD and Noonan syndromes were already associated in ClinVar with 6 and 28 pathogenic RAF1 variants, respectively. The other 197 variants were still classified as VUS and included conditions such as other rasopathies, chordoma, and retinoblastoma.
Finally, by the forward strategy, LIMK1 was suggested as a phenotype modifier. The N-terminal domain of LIM kinase 1 (LIMK1) regulates actin dynamics, affects cell adhesion and migration by phosphorylating cofilin, and negatively regulates the Rac1/Pak1/LIMK1/cofilin pathway [62]. NF1 is an upstream regulator of LIMK1 by acting on cofilin phosphorylation. When NF1 is mutated, this pathway is affected, possibly influencing neuronal development and cognitive deficits associated with the disease [62,63]. We found 1133 LIMK1 variants in gnomAD, but only 36 reported as germline in ClinVar, all of them classified as benign/likely benign. That may indicate that mutated LIMK1 alone is not pathogenic, but it does not exclude a combined effect of NF1 variants acting as a phenotype modifier.
By the reverse strategy, AKT1, PTEN, and RAF1 were also suggested, reinforcing the possible roles of these genes in NF1 phenotypes. Additionally, PAK1, SMARCA4, and VCP were found. The kinase PAK1 is a Rac/CDC42-dependent serine/threonine kinase that acts by activating several kinases such as RAF, ERK, and LIMK1, and other related pathways by activating TGFα and VEGF. PAK1 is required for the malignant growth of RAS transformants in NF1 neurofibrosarcoma cell lines [64,65]. There are not many PAK1 variants registered in gnomAD (N = 774) and only six in ClinVar. However, two LP variants (Y131C and Y429C) reported in ClinVar were associated with an intellectual developmental disorder with macrocephaly, seizures, and speech delay, phenotypes that are reported in NF1 patients.
SMARCA4 is a central component of the switch/sucrose-non-fermentable (SWI/SNF) chromatin remodeling complex. Inactivating mutations and loss of expression in several components of this complex have been implicated in carcinogenesis [66][67][68]. Thus, variants in one of these genes might influence the NF1 phenotype. SMARCA4 has the highest number of variants in both gnomAD and ClinVar among our candidate genes: 2575 and 2310, respectively. Peripheral nerve sheath tumors were already reported in patients with the syndrome carrying SMARCB1 mutations, which belongs to the same family of SMARCA4 [14]. Loss of SMARCB1 was also related to Schwannoma, another phenotype found in NF1 despite being more frequent in NF2 [69].
The valosin-containing protein (VCP) appeared as a strong candidate for being an NF1 phenotype modifier by our random-walk analysis. VCP gene is associated with the multisystem degenerative autosomal dominant disorder of inclusion body myopathy with Paget disease of bone and frontotemporal dementia (IBMPFD) and mutations were related to 1-2% of amyotrophic lateral sclerosis cases [70]. However, missense mutations in VCP, and low-effect and low-penetrant mutations in this gene, have controversial roles in causing disease. Neurofibromin interacts with VCP through its Leucine-Rich Domain (LRD)-domain [71]. Patients with NF1 who have mutations in the LRD coding region were described to be more prone to developing cognitive deficits than those with mutations elsewhere in the NF1 gene [72]. In the same study, it was observed that point mutations in the LRD coding region in the NF1 gene abolished the ability of NF1 to interact with VCP, while VCP mutants were shown to have reduced affinity for NF1. Interestingly, non-disease-associated polymorphisms in the LRD region of the NF1 gene may increase the risk of an IBMPFD patient developing dementia. In the same way, polymorphisms in the VCP gene that code for domains that interact with NF1 might influence the NF1 phenotype. These data obtained from literature research reinforce the accuracy of our systems biology analysis and random-walk mathematical model, which pointed VCP as a strong NF1 phenotype modifier candidate.
It would not be a surprise if VCP variants were found co-occurring with NF1 alterations, especially the ones in ATPAse domains 1 and 2 (D1 and D2), responsible for interacting with neurofibromin's LRD-domain. For example, in ClinVar only two variants are reported as pathogenic D1/D2, one in each domain. On the other hand, 34 remain as VUS, 23 in D1, and 11 in D2. In gnomAD, more than half of the (N = 491) cataloged VCP variants are located in D1 and D2 domains. Looking at TCGA, there are few samples (N = 28) with NF1 mutations co-occurring with variants in VCP D1/D2 domains, most of them (47.2%) from uterine corpus endometrial carcinomas.
The last gene that was suggested by our analysis is SDC2, which was found by differential gene expression networks and pointed to as a strong candidate by our mathematical model. The heparan sulfate part of SDC2 interacts with extracellular matrix proteins and growth factors to act as an adhesion molecule and as a coreceptor [73]. Variants in this gene might be associated with autism spectrum disorder [74]. Interestingly, some studies showed a higher frequency of autism spectrum disorder in NF1 children, and this is a variable condition in NF1 that might be influenced by variants in other genes [75]. Despite SDC2 emerging as a strong candidate in our study, only 335 variants were found in the gnomAD database and none in ClinVar, suggesting a highly conserved gene. However, the lack of SDC2 in ClinVar may merely reflect its absence from gene panels used for diagnostic purposes. On the other hand, TCGA somatic samples carrying both NF1 and SDC2 mutations are scarce (N = 14/10,437), most of them (57.1%) also related to uterine corpus endometrial carcinomas. This finding does not exclude the gene as a phenotype modifier, but variants co-occurring with NF1 alterations might be a rare event.
The literature search for variants and functions of the candidate modifier genes identified by our strategy shows that this is an economical and accurate way to filter and select genes that would be further validated by experimental assays. As a perspective, variants in the ten genes selected by our strategy will be searched in NF1 patients with different symptoms. Many strategies could be used to subsequently evaluate and validate the selected genes. One of them is to identify variants in these genes or other nearby genes and genotype those variants in NF1 patients with different symptoms and control populations, followed by statistical methods to identify correlations with the phenotype. Moreover, in-vitro and in-vivo studies are also useful for validating previously selected genes, focusing on CRISPR/Cas9 assays to induce partially and complete loss of the proteins. Our candidate genes could be also included in commercial gene panels with a low impact on their coast, which would help to feed public databases such as ClinVar.
One obvious limitation of the present study is the lack of proper validation for the candidate phenotype-modifier genes using benchwork. Hence, the results obtained must be evaluated with caution. Experimental validation is necessary and strongly recommended before clinical extrapolation. However, our purpose was to provide a new look in the strategies for evaluation of neurofibromatosis using the huge amount of data already available in shared public-curated datasets. For example, the protein-protein interactions identified by us were previously validated by several in vitro assays and here combined in a network. Together with our network analysis, we also performed random walk, a robust mathematical model that has been applied in the analysis of biological multiplex heterogeneous networks [76,77]. We hope this complex and robust systems biology approach will help to better understand the neurofibromatosis type 1 and its phenotypic variation.

Selection of NF1 Ontologies
The complete list of NF1 gene ontologies (GO) and phenotype ontologies were obtained in the AmiGO and Human Phenotype Ontology (HPO) databases, respectively. In modifier studies, the selection of which phenotypes to study is a key step, and in NF1, several phenotypic features are time-dependent [19]. Then, we selected from both lists (GO and HPO) the ontologies related to the less frequent and variable characteristics and not necessarily time-dependent, presented by NF1 patients, as reported in the literature and in our clinical experience in the Oncogenetics clinics of Hospital de Clínicas de Porto Alegre [16,78,79]. For example, cutaneous neurofibromas are common and may occur in up to 99% of NF1 patients in a cohort; this is a variable characteristic, mainly in the number of neurofibromas, but it is less variable when considering their presence in NF1 patients. Thus, we focused on ontologies related to characteristics that occur in a smaller number of patients to try to explain the variability of less common but more aggressive NF1 symptoms, such as breast cancer, delayed mental development, plexiform neurofibromas, and facial dysmorphism. The processes and phenotypes selected for the analysis involved NF1 and NF1-related signaling pathways, such as the MAPK cascade and regulation of the Ras pathway, considering the upper ontology in the hierarchy of each database. Hence, some ontologies were not selected because there was an upper term in the hierarchy that encompassed these ontologies. It is worth mentioning that we followed a guide for the correct selection of ontologies to try to limit the bias introduced by the choice of terms and keywords [80]. The processes were evaluated by two independent researchers and selected for subsequent analysis when both researchers considered it relevant.

Systems Biology Analysis
Networks were generated using STRING database v.11, comprising protein-protein interactions (PPI) for Homo sapiens. Only experimental interactions were selected, with a minimum required interaction score set in 0.400 (default). The assembled networks were transferred to Cytoscape v.3.7.2 software, with which the network statistics was obtained. Big nodes represented proteins with high betweenness centrality scores and warm colors comprised proteins with high closeness centrality measures.
Comparison between networks was performed with DyNet application for Cytoscape v.3.7.2. Complex networks comprising HPO selected phenotypes, gene expression, and PPI networks were assembled using the PhenomeScape app, also in Cytoscape v.3.7.2, using the default settings.

Gene Expression Evaluation
RNA-seq and microarray secondary analysis were performed using studies selected in Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) databases. For the GEO studies, we looked for NF1 knockdown or knockout assays and selected only the ones performed in human tumor cells. The data extraction was performed manually, and the robust multiarray averaging (RMA) normalization was applied using oligo or affy R packages (R v.3.6.2). The differentially expressed genes were obtained using the limma package (R v.3.6.2).
Firstly, for TCGA, we selected seven somatic tumors with nonsense mutations in NF1 (NF1-ns): bladder urothelial carcinoma (BLCA), brain lower grade glioma (LGG), breast invasive carcinoma (BRCA), cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC), colon adenocarcinoma (COAD), glioblastoma multiforme (GBM), and pheochromocytoma and paraganglioma (PCPG). Despite tumors having a somatic origin, this information can be useful to check how alterations in NF1 could affect the global gene expression in a specific tissue/tumor (phenotype). We then compared these samples against wild type NF1 tumors to check which genes were differentially expressed only in the NF1-ns group. The gene expression analysis for TCGA data was performed by extracting the data with TCGAbiolinks package and evaluating differential gene expression with edgeR package. All analyses were performed in R v.3.6.2.

Random Walk Analysis
The heterogeneous networks comprising both the genes and phenotypes selected were assembled in the RandomWalkRestartMH package, in R v.3.6.2, and the random walk analysis was performed with the same package.

Database Research
Other databases consulted to obtain data for the potential NF1 phenotype-modifier genes were: (1) BioGrid, for curated protein interactions; (2) PINOT tool, for literature data on curated protein interactions; (3) STRING database, for protein-protein interactions; and (4) The Human Reference Protein Interactome (HuRI), for the binary protein-protein interactions.

Variant Datasets
To explore variants in our candidate genes already reported in the general population or with clinical significance, we consulted The Genome Aggregation Database (gnomAD) v 2.1.1 and the ClinVar archive. For gnomAD, variants were classified according to its annotation. In ClinVar, only variants reported by at least one submitter as a germline were considered and classified according to their interpretations.
Finally, an additional analysis was performed consulting the 79,720 tumor samples made available by the AACR Project GENIE and the 10,967 samples from the TCGA PanCancer Atlas studies, using the cBioPortal for Cancer Genomics. Samples were filtered according to their mutational status: NF1-mutated patients including only nonsense, frameshift, and splicing; and patients with selected variants in our candidate genes, if available. Then, the clinical data were accessed and confronted with the mutational status to check which cancer types were predominant when NF1 was exclusively altered and when NF1 variant co-occurred with variants in our candidate genes.

Conclusions
We presented here a not yet explored systems biology strategy to investigate NF1 phenotype modifiers. The public availability of multi-omics datasets makes possible the use of robust tools to generate complex networks including protein-protein interactions, differential expression data, and phenotypes, reinforced by mathematical models such as random-walk. Combining all these strategies, we found 10 candidate genes as potential NF1 phenotype modifiers. Resources and time may be scarce to carry out association studies and systems biology analyses makes possible to better explain the genetic heterogeneity of this complex syndrome. Our results must be interpreted cautiously in clinical application and may guide further in-vitro and in-vivo validation studies, saving time and financial resources. The approach presented here may guide further in-vitro and in-vivo validation studies, saving time and financial resources.
Supplementary Materials: The following are available online at http://www.mdpi.com/2072-6694/12/9/2416/s1. Figure S1: Network for GO analysis. Figure S2: Network for HPO analysis. Figure S3: A complex network generated using GEO MPNST dataset, protein-protein interactions, and HPO ontologies as input data. Figure S4: A complex network generated using GEO neurofibroma dataset, protein-protein interactions, and HPO ontologies as input data. Figure S5: A complex network generated using GEO NF1-shRNA, protein-protein interactions, and HPO ontologies as input data. Figure S6: A network generated using GEO CRISPR-induced knockout of NF1, protein-protein interactions, and HPO ontologies as input data. Figure S7: A network generated using GEO neuroblastoma cell line, protein-protein interactions, and HPO ontologies as input data. Figure S8: Random walk analysis calculating the minimum steps (interactions) that a candidate gene (node) takes to reach the neurofibromatosis 1 phenotype. Table S1: Selected NF1 ontologies, Table S2: NF1 knockdown and knockout assays in GEO database, Table S3: Tumors with NF1 nonsense mutations from TCGA database, Table S4: List of genes and proteins previously described as NF1 phenotype modifiers in the literature.