The Participation of the Intrinsically Disordered Regions of the bHLH-PAS Transcription Factors in Disease Development

The basic helix–loop–helix/Per-ARNT-SIM (bHLH-PAS) proteins are a family of transcription factors regulating expression of a wide range of genes involved in different functions, ranging from differentiation and development control by oxygen and toxins sensing to circadian clock setting. In addition to the well-preserved DNA-binding bHLH and PAS domains, bHLH-PAS proteins contain long intrinsically disordered C-terminal regions, responsible for regulation of their activity. Our aim was to analyze the potential connection between disordered regions of the bHLH-PAS transcription factors, post-transcriptional modifications and liquid-liquid phase separation, in the context of disease-associated missense mutations. Highly flexible disordered regions, enriched in short motives which are more ordered, are responsible for a wide spectrum of interactions with transcriptional co-regulators. Based on our in silico analysis and taking into account the fact that the functions of transcription factors can be modulated by posttranslational modifications and spontaneous phase separation, we assume that the locations of missense mutations inducing disease states are clearly related to sequences directly undergoing these processes or to sequences responsible for their regulation.


bHLH-PAS Proteins
The basic helix-loop-helix/Per-ARNT-SIM (bHLH-PAS) proteins are an important class of transcription factors (TFs) responsible for the regulation of developmental and physiological events occurring in mammals [1]. Representatives of this family perform a wide spectrum of functions, starting with the Aryl hydrocarbon receptor (AHR) acting as receptor for environmental stimuli including highly toxic dioxins [2] to Clock and Aryl hydrocarbon receptor nuclear translocator-like protein 1 (ARNTL1, Bmal1) regulating circadian rhythms of the organism [3], and to Hypoxia inducible factor 1α (Hif-1α) [4], acting as a specific oxygen sensor in cells. In hypoxia conditions, Hif-1α trans-locates from cytoplasm to the nucleus, binds to the Aryl Hydrocarbon Receptor Nuclear Translocator (ARNT), and induces the expression of genes related to angiogenesis, cell proliferation, glucose, and iron metabolism [5]. The incorrect control of these processes is commonly connected with the genesis of many diseases, including cancer, strokes, and heart diseases [4].  [12]; green indicates the bHLH domain, purple indicates PAS domains, and blue indicates PAS-associated C-terminal (PAC), respectively, (B) crystal structure of the heterodimeric NPAS3-ARNT complex with Hypoxia Response Element (HRE) DNA (PDB: 5SY7) [13]. The bHLH domain, responsible for DNA binding, is colored in green, whereas PAS-Domain Containing Protein 1 (PAS1) and PAS2 domains are colored in purple.
Being biologically active, IDRs and intrinsically disordered proteins (IDPs) do not possess unique stable tertiary structures in physiological conditions [16], thereby contradict the fundamental paradigm of biochemistry and structural biology stating that the unique function of a protein results directly from its unique tertiary structure [17]. Currently, more than 20-30% of eukaryotic proteins have been found to present features of IDPs, and over 70% of proteins involved in signal transduction cascades have long IDRs. IDPs were identified as important elements in a wide range of biological processes, such as cell cycle, cell differentiation, regulation of transcription, mRNA processing, and apoptosis control [18][19][20].
The lack of a defined structure is critical for IDP and IDR functionalities [19]. Interestingly, IDRs found in bHLH TFs were proposed to contribute directly to the evolution of complex multicellularity [21]. The conformational plasticity allows IDPs/IDRs to interact with several unrelated proteins/ligands, with such binding promiscuity seeming to be highly useful for the molecular recognition processes [22]. For this reason IDPs are commonly involved in one-to-many and in many-to-one interactions and can function as hub proteins responsible for the cross-talk of different pathways [23]. Often, IDRs contain Molecular Recognition Features (MoRFs), which are interaction-prone segments of protein disorder exhibiting molecular recognition and binding functions and facilitating interactions with physiological partners. MoRFs undergo a disorder-to-order transition as a result of interaction with specific partners and such binding-induced folding allows them to perform various biological functions [24]. Their extended conformation and low compactness make IDPs excellent targets for post-translational modifications (PTMs) and proteolytic degradation, which are typical means activity regulation in proteins [25].
IDPs/IDRs were shown to play an important role in the formation of self-assembled, membrane-less organelles (MLOs) through liquid-liquid phase separation (LLPS). Interestingly, although in some cases PPI could lead to LLPS formation, there are also instances where LLPS may prevent protein interactions [26][27][28]. In the context of TFs, it is very interesting to consider the putative role of LLPS in fast cellular responses to external stimuli [29]. The ability of protein to undergo the LLPS process may be regulated by a wide spectrum of PTMs and alternative splicing [30]. Recently, we discussed the disordered character of bHLH TFs and their propensities to LLPS [31]. Experimental data have provided evidence that MyoD belonging to bHLH TFs family, and disordered regions of TFs, such as Oct4 and Brd4, can form liquid condensates [32]. Regulation of the circadian clock by BMAL1 also partially occurs in discrete nuclear foci resembling phase separated droplets [33]. Proteome-wide analyses of disease-related mutations have shown that gain or loss of post-translational modification sites might contribute to various human diseases. Importantly, most PTMs are found in IDRs. In addition, more than 80% of proteins considered as responsible for oncogenesis in humans are enriched in IDRs [34]. The ability of IDR-containing proteins to form multivalent, weak, and transient interactions underlie the ability of particular proteins to undergo LLPS. IDRs are often depleted in hydrophobic residues; however these residues can represent adhesive elements in phase-separating IDRs and mediate condensation upon changes in temperature [26]. In turn, repetitively distributed, highly, but oppositely, charged regions, short motifs such as YG/S-, FG-, RG-, GY-, KSPEA-, SY-, and Q/N-rich regions might be engaged in the formation of the multivalent interactions between condensate components [35]. Highly charged and flexible IDRs are in fact frequently identified as scaffold proteins and undergo spontaneous LLPS. Furthermore, they are essential for the structural integrity of a condensate [36].
As IDRs are suggested as the most important regulatory regions for proteins, we were interested in finding out if there is a pattern of the distribution of disease associated missense mutations among ordered and disordered regions in bHLH-PAS protein family members. Are the missense mutations observed more frequently in IDRs prone to PTMs, LLPS or aggregation?
To address this problem, we decided to analyze the known aa missense mutations listed in the HuVarBase database and to compare their localizations with the localizations of documented PTMs (PhosphoSitePlus database) and predicted MoRFs (Anchor server), simultaneously with the in silico analysis of protein's LLPS (catGranule and PScore servers) and amylogenic propensity (Waltz predictor). Based on the results, we assume that most of the disease-associated missense mutations are localized in IDRs of analyzed and selected bHLH-PAS family representatives.
The aim of this work was to produce a foundation for future experimental studies dedicated to the analysis of the effects of mutations affecting bHLH-PAS TFs' functionality.

AhR and AhRR
AhR, best known as a mediator of environmental pollutant toxicity, also contributes to the proper functioning of the liver, cardiovascular, immune, and reproductive systems [37]. AhR is also related to normal skin formation during fetal development and to pathological states such as epidermal wound healing and skin carcinogenesis [38]. Recently, AhR has been recognized as an important modulator of diseases driven by immune/inflammatory processes [39]. The ligand-bound AhR trans-locates to the nucleus, where it mediates the biological response to toxins resulting in wasting syndrome, hepatotoxicity, teratogenesis, and tumor promotion [2]. Activation of AhR was linked to chronic kidney and cardiovascular diseases [37]. The overexpression and constitutive AhR activation have been assigned to various types of tumors [40] including brain tumors, such as gliomas, meningiomas, medulloblastomas, and neuroblastomas [41]. Furthermore, AhR activation is linked to renal damage, diabetic nephropathy, and urinary system-associated cancers [37]. AhR can heterodimerize with ARNT to function as a co-regulator of the estrogen signaling pathway mediated by the estrogen receptor (ER) [42] and is considered as responsible for the connection between inflammation process and breast cancer [43].
Interestingly, AhR self-regulates its activity by activation of the repressor, AhRR. In comparison to AhR, present in most tissues, AhRR is characterized by high tissue specificity. The highest concentration of this protein was observed in the testis, lung, spleen, heart, and kidney [44]. The repressor competes with AhR for binding to the ARNT and forms an inactive AhRR/ARNT heterodimer [43]. AhRR is not able to bind to AhR ligands because it does not possess the PAS2 domain in the N-terminal region. Additionally, AhRR contains the C-terminal trans-repression domain (instead of the transactivation domains in the AhR C-terminus), that allows binding of the corepressors involved in a negative regulatory loop [45]. Zudaire et al. [46] demonstrated downregulation of AhRR expression in human malignant tissues of different anatomical origin, such as colon, breast, lung, stomach, cervix, and ovary. Genetic polymorphisms of AhRR were also related to enhanced susceptibility to advanced endometriosis [47,48]. Interestingly, it was observed that AhRR splice variant is able to inhibit transcription activated by Hif-1, which is essential for cancer progression [49].

Single Minded Protein (SIM)
The mammalian SIM exists as two homologs that are encoded by two different genes: SIM1 and SIM2, with a high level of amino acid identity shared by their N-terminal parts (90% identity in the bHLH and PAS domains), and a high level of diversity in their Cterminal parts [50]. While SIM1 is responsible for the activation of specific genes' expression, SIM2 is defined as a transcription inhibitor. The opposite transcriptional effect results from the presence of two repression domains within the SIM2 C-terminal sequence [51,52]. This example confirms the importance of the C-terminal region for the functions and activities of bHLH-PAS proteins [12]. SIM1 dimerizes with ARNT and activates transcription of specific genes related to the development, terminal differentiation, and post-development functioning of neuronal cells, especially in the paraventricular nucleus of the hypothalamus (PVN) [53]. Importantly, PVN is responsible for several autonomic processes, including response to stress, metabolism control, growth, reproduction and appetite regulation [53]. Since the SIM1 plays a role in the long-term regulation of food intake and energy expenditure [54], its reduced activity is manifested phenotypically as profound obesity and increased linear growth. The weight gain is connected to high food consumption, since measured energy expenditure is usual [54]. It was shown that SIM1 haploinsufficiency in mice induces hyperphagia (abnormally increased appetite for consumption of food) [55] leading to obesity and developmental abnormalities of the brain [56]. It has been shown that transgenic mice with overexpressed SIM1 are resistant to diet-induced obesity, which supports a post developmental, physiologic role for SIM1 in feeding regulation [57]. Induced SIM1 overexpression contributes to decreased food intake [58].

Hypoxia Inducible Factor 2α (Hif-2α)
Functional hypoxia inducible factors are heterodimers comprising one of the three known α subunits regulated by oxygen (Hif-1α, Hif-2α and Hif-3α), and constitutively expressed ARNT (known also as Hif-1β) [59]. For the first time, Hif-2α, also known as endothelial PAS-1 protein (EPAS1), was isolated from the endothelial cells [60]. Hif-2α shares approximately 50% sequence identity with the ubiquitously expressed Hif-1α, and the activities of both proteins are regulated by oxygen level. Under normoxic conditions, two proline residues of the oxygen-dependent degradation domain located in the C-termini of Hif-1α/Hif-2α are hydroxylated and targeted to the ubiquitin-proteasome (26S) pathway for degradation. Additionally, hydroxylation of the arginine residues prevents protein interaction with coactivator protein p300 [61]. Similar to Hif-1α, Hif-2α was shown to induce the expression of genes stimulating cell cycle progression, proliferation, apoptosis promotion, autophagy and angiogenesis [59]. Furthermore, Hif-2α regulates erythropoietin level and is involved in embryonic development and metastasis [62,63]. Interestingly, Hif-2α is localized within the nucleus in the form of puncta, whereas Hif-1α is distributed homogeneously in the nucleus. Distinct subnuclear localizations of both proteins were proposed to contribute to the different regulations and activities of these two TFs [64]. Importantly, Hif-2α shuttling is regulated by phosphorylation [65]. Some studies of kidney cancer suggested an oncogenic role for Hif-2α, which is in contrast to Hif-1α that manifested tumor suppressor properties [66]. Missense mutations within the bHLH and PAS domains of Hif-1α/Hif-2α proteins have been linked to pathogenesis of various cancers, such as stomach adenocarcinomas, endometrial carcinomas, brain gliomas, lung adenocarcinomas, hepatocellular carcinomas and skin melanomas [61]. The Gly537 residue located close to the primary oxygenation site is conserved among all known Hif-2α proteins, whereas mutation of this residue results in the familial erythrocytosis characterized by an increased number of red blood cells. The familial erythrocytosis symptoms are headaches, dizziness, nosebleeds, and shortness of breath. Additionally, an excess of red blood cells increases the risk of developing abnormal blood clots [67].

Neuronal PAS-Domain Containing Protein 4 (NPAS4)
Initially, it was shown that the NPAS4 protein is expressed and acts mainly in the nervous system [68]. However, later studies have shown that NPAS4 is also expressed in β cells of pancreatic islets, which significantly affects pancreatic cells. In this case, NPAS4 expression is induced by endoplasmic reticulum stressors and prevents the death of β-cells [69,70]. In the nervous system, NPAS4 is responsible for the regulation of the development of GABAergic inhibitory neurons [71]. NPAS4 was shown to be able to inhibit seizure attacks in pilocarpine-induced epileptic rats [72]. Importantly, increased levels of NPAS4 expression have been linked to brain protection in focal and generalized ischemic strokes of the brain, where it prevented necrosis and led to cell apoptosis [73,74]. It was also shown that NPAS4 is involved in the structural plasticity of the nervous system and plays an important role in the formation of long-term memory. Its expression is highly induced during the learning process [75,76]. Interestingly, NPAS4 overexpression can reverse tau protein aggregation [77]. Finally, NPAS4 expression was also detected in endothelial cells, where, similar to pancreatic β-cells, it promoted pro-angiogenic cell functions, such as migration or sprout formation [78].
For human NPAS4, a second isoform of NPAS4 comprising residues 1-234 (only bHLH and PAS-1 domains) with V234G substitution was proposed. However, there is no evidence for this isoform at the protein translation level, and its function is not known [79]. To date, only few dimerization partners for NPAS4 have been identified, such as ARNT and ARNT2, which are the general partners for the class I bHLH-PAS TFs in the brain [80] and the melanoma-associated antigen D1 (MAGED1), which is expressed ubiquitously in both developing and adult tissues, but is particularly abundant in the brain. MAGED1 participates in various signaling pathways, including apoptosis and differentiation of the neuronal precursors, periodicity stabilization in the circadian rhythm, and learning and memory formation [81]. As shown, NPAS4 developmental downregulation in the prefrontal cortex caused behavioral abnormalities observed in neurodevelopmental disorders, such as schizophrenia and autism [82]. NPAS4 was also linked to a number of other serious psychiatric disorders, including depression, Huntington's disease, Down syndrome, and various neurodegenerative diseases (e.g., Alzheimer's disease) [77].

Aryl Hydrocarbon Receptor Nuclear Translocator 2 (ARNT2) and BMAL1
ARNT2 is a representative of the class II bHLH-PAS TFs. It is constitutively expressed and acts as general heterodimerization partner for multiple class I bHLH-PAS members, including SIM1 [83] and NPAS4 [84,85]. In contrast to the ARNT, which is expressed equally in all tissues and interacts with a wide spectrum of physiological partners (ARNT is indispensable for AHR and Hif signaling) [86], ARNT2 is expressed mainly in the brain (in the developing central nervous system (CNS)), kidney, urinary tract, and embryos [87,88]. ARNT2 deficiency leads to secondary microcephaly within the first few months of human life with a specific frontal and temporal lobe hypoplasia [89]. Secondary microcephaly indicates a progressive neurodegenerative condition caused by a decreased number of dendritic connections and/or reduced neuron activity [90]. The hypothalamic insufficiency can cause obesity, diabetes, and is often combined with pituitary hormone deficiency [89]. The latter seems to be consistent with a key role of ARNT2 in the development of specific neurosecretory neurons in the human hypothalamus [89]. Some ARNT2 mutants are also considered as causing hyperphagic obesity, diabetes, and hepatic steatosis [91]. ARNT2 was also shown to act as an important component of a protein complex located at a node of the TF network controlling glioblastoma cell aggressiveness [92].
BMAL1, together with its heterodimerization partner CLOCK, creates the core of the regulatory mechanism of mammalian circadian rhythms. The C-terminally located TAD of BMAL1 acts as a regulatory hub interacting with positive/negative transcriptional regulators in a circadian time-dependent manner to control the activation state of CLOCK-BMAL1 dimer [93]. The conformational switch of TAD is caused by cis/trans isomerization around a highly conserved W624-P625 imide bond [94]. BMAL1 polySUMOylation leads to its ubiquitination and binding of CREB-binding protein (CBP) that potentiates its transcriptional activity. Formation of nuclear bodies containing BMAL1/CBP provides transcriptionally active sites for target genes [33] and supports our thesis about the putative role of BMAL1 in LLPS formation. Similar to other bHLH-PAS TFs, BMAL1 is a shuttling protein [95]. Its localization signal activities are regulated by PTMs, e.g., phosphorylation [96]. BMAL1 was also shown to stimulate the translation process by interacting with the translational machinery in the cytosol, which was possible only after S42 phosphorylation [97]. Geyfman et al. [98] reported that the circadian variations in DNA sensitivity to UVB-induced damage depended on BMAL1 activity that directly connects circadian mechanisms with the epidermal carcinogenesis.

Results
To date, the structural characterization of bHLH-PAS TFs was limited to their bHLH-PAS regions, whereas no structural information is available for their C-terminal regions. This lack of structural knowledge can be explained by the difficulties associated with the expression and purification of the full-length proteins, caused by the presence of long disordered C-termini. We have discussed this research area in detail previously [15]. Curiously, all previously published data on the analysis of the missense mutations linked to cancers were limited to the bHLH-PAS domains of the selected bHLH-PAS members (Hif-1α and Hif-2α) [61].
Taking into account the connection of bHLH-PAS TFs with some serious disorders discussed in the previous sections, we asked a question about the localizations of known missense mutations associated with various diseases within the entire proteins, including their IDRs.

AhR and AhRR
According to the PhosphoSitePlus, most of the documented PTMs ( Figure 2(Aa)) are located within the disordered regions of AhR, which are predicted at the short N-terminal fragment preceding the bHLH domain (residues 1-26), the linker between PAS1 and PAS2 domains (residues 182-274) and a long C-terminal region of the protein (residues 387-848) ( Figure 2(Ab,c)). In these regions, the presence of MoRFs was also predicted (Figure 2(Ab)). In contrast, all the regions corresponding to the conserved domains were predicted as highly ordered (Figure 2(Ab,c)), which is typical for bHLH-PAS proteins. The missense mutations in IDRs are linked mainly to large intestine cancer (T199P, P260L, N505S, T507I, P838S), soft tissue cancer (R554K), thyroid cancer (V570I), kidney cancer (E488K), and liver cancer (P18L) (Supplementary Materials). Importantly, results of the NetPhos 3.1 server prediction suggest many more phosphorylation sites (the most common PTM) in AhR than documented, for example, in the 100-200 aa region (Supplementary Materials).
The proximities of missense mutations (see Figure 2(Ac)) to the locations of known PTM sites (see Figure 2(Aa)) in some cases seem to be crucial for disease development. Prediction of the LLPS propensity resulted in a positive maximal score in the C-terminal fragment (residues 500-600) in the region enriched in the disease associated mutations (Figure 2(Ad)). The additional local positive maximum is observed in the linker between bHLH and the PAS domain, which is also predicted as locally disordered.
In the case of AhRR, all documented PTM sites ( Figure 2(Ba)) and all MoRFs (Figure 2(Bb)) are located in IDRs (Figure 2(Bc)). Importantly, AhRR undergoes many rather uncommon PTMs, such as SUMOylation (see Figure 2(Ba), green points). AhRR, as transcription repressor, does not possess ligand binding PAS2 domain and is predicted as highly disordered not only at the N-and C-termini (residues 1-27 and 183-700), but also in the linker between the bHLH and PAS domains (residues 82-111) (Figure 2(Bc)). AhRR possesses a defined ordered structure only in the middle of the bHLH domain and in the entire PAS domain. LLPS propensity analysis shows a positive maximum for the central part of the protein (approximately residues 340-440) ( Figure 2(Ad)) surrounded by various PTM sites. Furthermore, another maximum coincides with the segment of the disordered C-terminus. We can observe that AhRR C-terminus and the linker between its bHLH and PAS domains are enriched in the disease-associated mutations in reference to the ordered bHLH and PAS domains. Diseases associated with the mutations are represented mainly by intestine cancer (I226V, R230C, R285W, A300T, T419I, R485W, R491W, G494S, V553M, and D645H), skin cancer (P283S, A301V, and G427E), prostate cancer (R491Q and D645H) and liver cancer (C545F and A674S). The other single mutations are connected to endometrium cancer (A371T), CNS cancer (P189A), and esophagus cancer (G612S) (see Supplementary Materials). In the case of AhRR also, the NetPhos 3.1 server predicted more phosphorylation sites than documented (Supplementary Materials).

SIM1 and SIM2
According to the disorder predictions, most of the documented PTMs ( Figure 3(Aa)) and all predicted MoRFs (Figure 3(Ab)) of SIM1 are located at the long C-terminus (residues 336-766) (Figure 3(Ac)). An additional disordered region is predicted in the linker between the bHLH and PAS1 domains (residues 64-76) (Figure 3(Ac)). Prediction of phosphorylation sites by NetPhos resulted in positive scores for many sites along the whole protein (Supplementary Materials). bHLH and PAS domains, as well as several short regions observed in the C-terminus of SIM1 (residues 450-500 and 700-740, Figure 3(Ac)) are predicted as more ordered. Importantly, the short ordered regions in the middle of disordered C-termini were described as characteristic for bHLH-PAS class I proteins [15]. All the disease-associated missense mutations are located within the long C-terminus (Figure 3(Ac), Supplementary Materials). Prediction of the LLPS propensity resulted in local maxima in the linker region between the bHLH and PAS domains, the linker region between the PAS1 and PAS2 domains, and in the N-terminal region of the C-terminus (residue 390). The segment between residues 350-400 deserves special attention. It is predicted not only as highly disordered and possessing a local maximum of the LLPS propensity, but is also enriched in the PTM sites. What is more, many disease-associated mutations are reported in this region. According to HuVarBase, SIM1 missense mutations are linked mainly to skin cancer (H394Y, H402Y, D424N, S428F, S454L, R471Q, R493C, R550C, P588L, S603F, P661L, and R665C). The other diseases are lung cancer (R192H, G392R, E530K, A570G, N650Y, and S701C), breast cancer (P352T and A494T), liver cancer (H559Q, G448C, and Q704H), large intestine cancer (L217P, A371V, C472W, R548Q, and S663L), stomach cancer (S541L), hematopoietic and lymphoid tissue cancer (G408R and T481M), CNS cancer (P539R), esophagus cancer (E725K), and Schaaf-Yang syndrome (Q704L) (Supplementary Materials).
As demonstrated [53], the SIM1 mutation located in the C-terminus (p.G715V) leads to a novel SIM1 variant presenting reduced transcriptional activity. An ab initio hybrid model generated by Blackburn et al. [53] localized the p.G715 residue to the long IDR, directly in a small helix that is facing towards the solvent. The discussed helix is determined in our predictions as a local minimum in the disorder profiles generated by all predictors used in this study (Figure 3(Ac)), which is surrounded by highly disordered regions. Such a result is characteristic for motifs acting as the molecular recognition elements/features (MoREs/MoRFs), representing short interaction-prone segments that can undergo disorderto-order transition upon binding to specific partners [104]. The substitution of G to V at this position increases the local hydrophobicity and may affect helix function and stability. This mutation could alter affinities for cofactors binding, regulatory functions and proteins structure, which can modulate the SIM1 target gene regulation [53].
In the case of SIM2, most of the documented PTM sites (Figure 3(Ba)), similar to the predicted MoRFs (Figure 3(Bb)), are placed along the long, highly disordered C-terminus (residues 336-667) (Figure 3(Bc)). The only modification documented for this protein is phosphorylation. Similar to previously analyzed bHLH-PAS TFs, most of the missense, disease-associated mutations are observed within the long IDRs or short, local disordered regions (Figure 3(Bc)). Predicted LLPS propensity shows a local maximum in the linker between bHLH and PAS (residues 54-76), which is also predicted as disordered. Curiously, although this region does not possess experimentally determined PTM sites, NetPhos predictor [105] suggests many putative phosphorylation sites are located in this region

Hif-2α
For Hif-2α, most of the documented PTM sites (Figure 4(Aa)) and MoRFs (Figure 4(Ab)) are placed along the long C-terminus (residues 348-870) and within the linker between the bHLH and PAS1 domains (residues 48-83), both predicted as IDRs (Figure 4(Ac)). Similarly, most of the missense mutations in the Hif-2α sequence are located within the disordered C-terminus and the linker between the bHLH and PAS1 domains (residues 48-83) (Figure 4(Ab,c)). Interestingly, some of the Hif-2α documented PTMs are observed in the region comprising the PAS1 domain (see Figure 4(Aa,b)). This can be explained by the significantly higher local structural flexibility of regions surrounding this domain, in comparison to those of AhR or SIM proteins. Hif-2α is highly targeted by phosphorylation and ubiquitination, which can easily affect the life-time of the protein. Predicted LLPS profile contains many maxima throughout the entire protein length (Figure 4(Ad)). Importantly, these regions coincide with the predicted disordered fragments. Hif-2α missense mutations are mostly linked to familial erythrocytosis (A410T, M535V, M535T, G537R,  G537W, F540L, F608L, S703A, T766P, P785T, I789V

NPAS4
NPAS4 is one of the immediate early genes (IEGs) that can activate mechanisms related to the first defense against many cellular stresses [106]. Importantly, IEGs are regulated by a specific stimulus with no need for a de novo protein synthesis [107]. To date, there is only one documented NPAS4 modification-phosphorylation (Figure 4(Ba)) located in the bHLH domain, in the region where a locally disordered fragment of the sequence begins (between bHLH and PAS1 domains) (Figure 4(Bb,c)); however, NetPhos predictions showed many putative phosphorylation sites on the entire length of the protein (Supplementary Materials). Results of the disorder prediction indicated the presence of the long IDR in the C-terminal part of the protein (residues 318-802) and additional short IDRs within the N-terminal part of NPAS4, comprising bHLH and PAS domains, especially in the PAS1/PAS2 linker (residues 145-202) and less clearly in the bHLH/PAS1 linker (residues 54-69) (Figure 3(Bc)). Interestingly, the sites with high LLPS propensities (Figure 3(Bd)) mostly coincide with the IDRs. An exception is the central part of a protein (approximately residues 350-600) with a low LLPS potential and a high probability of being disordered. Similar to the protein sequences analyzed previously, disease-associated missense mutations of the NPAS4 sequence are located within IDRs, mostly predicted also as presenting a putative ability for LLPS formation. Especially interesting is the part of the C-terminus (residues 550-700) predicted as IDR with a high LLPS propensity which contains many described point mutations. NPAS4 missense mutations are linked predominantly to liver cancer (R150L, P194L, Q332K, P405L, Q547H, I639V, D647N, P679L, S683I, and S747F), skin cancer (R145C, P194S, D419N, L455F, P533S, P533L, S544N, T558I,  D716N, E725K, and D730N

ARNT2 and BMAL1
To compare different classes of bHLH-PAS TFs, we conducted analysis similar to that previously described for class I proteins, for ARNT2 and BMAL1-two representatives of the class II bHLH-PAS proteins. For ARNT2, documented PTMs ( Figure 5(Aa)) and MoRFs ( Figure 5(Ab)) are located within the N-and C-terminal regions predicted as highly disordered ( Figure 5(Ac)). However, predicted phosphorylation sites are uniformly distributed along the protein (Supplementary Materials). The long, predicted as highly disordered linker between PAS1 and PAS2 domains ( Figure 5(Ab,c)) contains short MoRFs (see Figure 5(Ab)). The high structural flexibility of the central part of this protein, which is much higher in comparison with the previously described class I members, could explain the ability of class II proteins to serve as an interaction partner for different class I proteins. Most of the missense mutations in the protein sequence are located within the C-terminus and within other regions predicted as disordered ( Figure 5(Ac)). Prediction of the LLPS propensity generated many maxima spread over the entire protein length ( Figure 5(Ad)). This seems to be a characteristic property of the class II bHLH-PAS TFs. Again, LLPS positive regions overlap with the disordered fragments. ARNT2 diseaseassociated missense variants are linked to large intestine cancer (A28V, R47C, R240K, P579S, and T602M), skin cancer (S458L and P529S), CNS cancer (Y430N), lung cancer (A25T and V683L), liver cancer (D191G and G710A), hematopoietic and lymphoid tissue cancer (H543R), pancreas cancer (P269S) and stomach cancer (G31R) (Supplementary Materials).
In the case of BMAL1 almost all documented PTM sites ( Figure 5(Ba)) are distributed along the long C-terminus (residues 445-626), N-terminus (residues 1-71), and the linker between PAS1 and PAS2 domains (residues 216-325). However, similar to several other bHLH-PAS TFs, NetPhos predicts many phosphorylation sites uniformly distributed along the protein (Supplementary Materials). Predicted MoRFs occur also within the N-and C-terminal regions of BMAL1 ( Figure 5(Bb)). All these fragments are predicted as highly disordered ( Figure 5(Bc)). Importantly, the long disordered region in the middle part of BMAL1, characteristic of the class II factors, is observed ( Figure 5(Bc)). For both BMAL1 and ARNT2, MoRFs were predicted within the N-terminal region ( Figure 5(Bb)). All these features distinguish class II proteins and suggest their specific characteristics that allow them to interact with a wide spectrum of partners from the class I. In contrast to all previously analyzed bHLH-PAS proteins, no disease-associated missense mutation was reported in the disordered C-terminal region of BMAL1. Instead, missense mutations accumulated in the disordered N-terminal part ( Figure 5(Bc)). This was unexpected, since the C-terminal TAD plays important roles in the mammalian clock regulation [94]. Importantly, acetylation of BMAL1 K537 was shown to be indispensable for circadian rhythmicity [108], suggesting the possibility that not all mutations responsible for disease development are known. LLPS propensity analysis revealed the presence of potential regions capable of phase separation in the N-and C-termini in accordance with the IDR prediction. BMAL1 seems to have a wider spectrum of PTMs (phosphorylation, ubiquitination, acetylation, and SUMOylation) in comparison to ARNT2. BMAL1 disease-associated missense mutations are linked predominantly to large intestine cancer (D22N, S27Y, R37C, R37H, R244Q, and V260A). The other related diseases are esophagus cancer (E62Q), genital tract cancer (E65K), thyroid cancer (H66P and C249R), skin cancer (P234H), cervix cancer (S246C), pancreas cancer (P292T), stomach cancer (T224S), breast cancer (T140S), and liver cancer (Q4L) (Supplementary Materials). Finally, we evaluated the presence of the amylogenic regions in selected bHLH-PAS TFs ( Figure 6). Our analysis revealed that all of the selected proteins were predicted to contain short amylogenic regions. Interestingly, most of these regions were located in Nand C-terminal regions of the defined domains, presenting higher flexibility. These regions show local N-terminal increase/C-terminal decrease of predicted disorder score in the corresponding intrinsic disorder profiles (see .

Discussion
Functional analysis of proteins at the crossroads between the different signaling pathways and, simultaneously, interacting with multiple partners (hub proteins), has proven that the intrinsically disordered nature of the interacting regions is indispensable [23]. Additionally, the DNA-binding proteins in eukaryotes were shown to be significantly enriched in disordered domains [110]. As aforementioned, bHLH-PAS proteins act as essential TFs via their binding to DNA and interacting with many physiological partners.
The results of our analysis confirm a high intrinsic disorder content of the bHLH-PAS TFs, especially in their long C-terminal regions. Additionally, short IDRs located in the region preceding the bHLH domain and in the linker between PAS domains can also be distinguished.
Utilizing the HuVarBase data in combination with the in silico analysis of selected representatives of the bHLH-PAS family allowed us to show that missense mutations associated with diseases are located mostly within predicted IDRs. For most of the analyzed proteins (AhRR, SIM1, Hif-2α, and NPAS4), we also predicted high propensities for LLPS in their putative IDRs. Furthermore, predicted mutations are often located at or in close proximity to the residues undergoing PTMs (Table 1). Table 1. Summary of AHR, AHRR, SIM1/2, Hif-2α, NPAS4, ARNT2 and BMAL1 mutations, disorder scores, and PTM and LLPS analyses. Protein mutations (based on HuVarBase) are arranged in order. Disorder scores are determined by mean predicted intrinsic disorder score (PIDS mean ). Ordered regions (PIDS mean ≤ 0.15), flexible (i.e., with 0.15 < PIDS mean ≤ 0.5), and disordered (PIDS mean ≥ 0.5) regions are indicated by blue, pink, and red colors, respectively. Closely located documented PTMs (PhosphoSitePlus, distance < 12aa) are listed. PTM sites coinciding with mutation sites are highlighted in yellow. Abbreviations: ac-acetylation, m-methylation, p-phosphorylation, sm-sumoylation, ub-ubiquitylation. Predicted LLPS is marked with '+', '+local' for local maxima of predicted LLPS and '++' for global maximum. Residues predicted as disordered, with close mutation sites and LLPS positive score are highlighted in gray.
We also observed that the G/A substitution (for example, SIM1/A570G and ARNT2/ G710A) could influence the folding propensity of the corresponding region, since glycine is a known helix-breaker, whereas alanine favors α-helix formation. Some mutations could obviously change the physico-chemical properties of a polypeptide chain. For example, E/K substitution causes the change of the sign of the amino acid residue charge (for example: AHR/E488K, SIM1/E155K, SIM2/E106K, Hif-2α/E82K, NPAS4/E724K, ARNT2/E72K or BMAL1/E65K). In other cases, however, for example for R/K (AHR/R554K, ARNT2/R240K) or L/I/V (AHR/V570I, AHRR/I226V, SIM1/V326I, SIM2/V76I, SIM2/L283V, NPAS4/I639V, ARNT2/V110I, BMAL1/V162I), substitution impact was not so obvious, though such substitution also resulted in a deleterious effect. An example would be the K537R mutation of BMAL1, which prevented acetylation of this protein and resulted in inhibition of transcriptional repression important for the rhythmicity of circadian clock [108]. Another example is given by the V304I mutation of the bHLH-PAS family member, NPAS3. In fact, V304I was identified as an NPAS3 missense variant associated with psychiatric disorders. Although the V304I mutation located in the PAS linker did not alter the protein's molecular function, mutation in the disordered region of NPAS3 led to the aggregation of this protein, which resulted in schizophrenia [111,112]. This has led us to hypothesize that some mutations could impact IDRs, thus promoting their misfolding and aggregation. Amyloid structures are widespread in nature for beneficial purposes, such as the formation of functional amyloids. However, misfolding and aggregation can lead to the formation of toxic amyloids often associated with the appearance of aberrant interactions of oligomeric intermediates with endogenous cellular components [113] resulting in disease development. Interestingly, although some proteins containing long IDRs were shown to have a propensity toward aggregate formation, it was also proposed that this aggregation tendency could be due to the aggregation-prone properties of the structured regions of the aggregating proteins [114]. In line with recent studies [115], we hypothesize that, in some cases, mutations could lead to the enhanced protein aggregation by modulating the exposure of the aggregation-prone regions.
Functionalities of IDPs and proteins containing IDRs usually rely on their abilities to interact with other proteins to form complexes and finally to organize PPI networks. This ensures the connection of different signaling pathways and promotes the creation of larger networks [116]. Protein interactivity can be evaluated using a publicly available computational platform STRING, which integrates all the information on PPIs, complements it with computational predictions and returns a PPI network showing all possible PPIs of a query protein(s) [117]. STRING-generated visualization of the internal interactome of selected bHLH-PAS members is presented in Figure 7. In line with earlier studies, Figure 7 shows that the bHLH-PAS proteins can interact with each other forming a rather well-linked PPI network. Figure 7. STRING-based interactome between selected representatives of bHLH-PAS transcription factor (TF) proteins (an internal protein-protein interaction network (PPI)). In the corresponding STRING-generated network, the nodes correspond to proteins, whereas the edges show predicted or known functional associations. Seven types of evidence are used to build the corresponding network, where they are indicated by the differently colored lines: a green line represents neighborhood evidence; a red line-the presence of fusion evidence; a purple line-experimental evidence; a blue line-co-occurrence evidence; a light blue line-database evidence; a yellow line-text mining evidence; and a black line-co-expression evidence.
Since bHLH-PAS TFs usually function as hub proteins at the intersections of many signaling pathways, a high binding promiscuity is extremely important for their activities.
Therefore, we used STRING to study the engagement of the bHLH-PAS TFs in interactions with the proteins forming the first shell of the resulting interactome. In this analysis, a confidence level of 0.5 was used. Figure 8 represents the resulting interactome that includes 432 nodes (proteins) connected by 8235 edges (interactions between proteins). Therefore, this interactome is characterized by an average node degree of 38.1 and shows an average local clustering coefficient of 0.589. Here, the average local clustering coefficient is a measure that defines how close neighbors of a given network are to forming a complete clique (i.e., a network, where each node, also known in graph theory as a vertex, is adjacent to each other vertex in the network). Therefore, the local clustering coefficient is equal to 1 if every neighbor connected to a given node N i is also connected to every other node within the neighborhood, and it is equal to 0 if no node that is connected to a given node N i connects to any other node that is connected to N i . The expected number of interactions for the set of proteins of the network of this size is 3516 indicating that this PPI network centered at the bHLH-PAS TFs has significantly more interactions than expected (PPI enrichment p-value is <10 −16 ). Here, PPI enrichment p-value is a reflection of the fact that query proteins in the analyzed PPI network have more interactions among themselves than what would be expected for a random set of proteins of similar size, drawn from the genome. It was pointed out that such an enrichment indicates that the proteins are at least partially biologically connected, as a group. We also used STRING to investigate the interactivity of individual bHLH-PAS TFs. The corresponding results are presented in the Supplementary Materials and clearly illustrate that all these TFs are promiscuous binders interacting with large numbers of specific partners.
The functionalities of IDPs and IDRs may depend on the abilities of such regions to undergo a disorder to order transition after binding [118]. Disease-associated missense mutations were most often found in PPI-controlling regions [119], known as MoRFs [34]. This indicates that pathogenesis may be associated with the wrong MoRF conformation after a missense mutation occurs. Recently, it was shown that the transition of the peptide mimicking a MoRF to a conformation with pronounced α-helical structure could be distorted by an amino acid substitution with proline as a helix breaker [120]. Activities of MoRFs responsible for PPI or protein localization are also regulated by PTMs, which may induce protein conformational changes. If so, the missense mutations of the residues serving as PTM targets can serve as important sites involved in disease induction after substitution [121].
The activities of bHLH-PAS TFs depend on nucleocytoplasmic shuttling, occurring as the result of interactions with proteins responsible for nuclear export/import. Nuclear localization signal (NLS) or nuclear export signal (NES) sequences were defined in the bHLH and PAS domains as well as in the C-terminal unstructured region of AhR. Ctermini of Hif-1α and Hif-2α also contain conserved NLS and NES sequences. For SIM2 the C-terminal region cytoplasmic localization was documented [122]. Finally, we have previously demonstrated the presence of overlapping NES and NLS in the C-terminal region of NPAS4 [123]. PTMs, such as phosphorylation, especially those taking place in close proximity to the NLS/, were shown to regulate the intracellular distribution of proteins via activation/deactivation of the localization motifs [124]. This suggests that the disease-associated missense mutations located in the C-termini of bHLH-PAS TFs could affect the NLS/NES activities by substitutions of residues in a signal sequence itself, or by substitutions of residues located close to the signal sequence that are important for this signal's activity.
It was shown that cells organize many biochemical processes in specific compartments known as MLOs originating as a result of LLPS. In the nucleus, LLPS is responsible for formation of nucleoli, paraspeckles, and Cajal bodies created by factors regulating, among other processes, chromatin remodeling, transcription, and RNA processing. Such LLPS-driven MLOs can serve as rapid recyclers/reactive storage facilities, which supply or sequester TFs [125]. Altered phase separation affects the disassembly of protein condensates, resulting in their accumulation, which could lead to pathological processes [126]. Interestingly, LLPS of a disease-causing mutant of heterogeneous nuclear ribonucleoprotein A1 (hnRNPA1, D262V) was shown to promote fibrillization of this protein, whereas MLO containing the wild type protein did not [127]. Pathological neurodegeneration related to age or disease and protein aggregation have been also linked to LLPS-driven processes [26]. Proteins containing long IDRs represent an abundant class of macromolecules that can phase separately under physiological conditions. IDRs do not have stable 3D structures and often contain repeated sequence elements providing the basis for multivalent weakly adhesive intermolecular interactions responsible for LLPS formation [128]. Recently, we discussed bHLH TFs as factors putatively engaged in the formation of LLPS during transcription process [31]. We propose that the aberrant regulation of LLPS processes by disease-associated bHLH-PAS variants with specific missense mutations could result in disease development. Obviously, computational results reported in our study require experimental validation. However, they generate testable hypotheses, and therefore these data provide an important foundation for future studies dedicated to the analysis of the effects of mutations in ordered regions, on conformational changes affecting PPIs and the propensities to make LLPS.
To search disease-associated mutations, we have reviewed the literature and analyzed the Human Variants Database (HuVarBase) https://www.iitm.ac.in/bioinfo/huvarbase/ mas18srch.php, (accessed on 11 March 2021) [101]. HuVarBase is a comprehensive database on human genome variants reported in the databases, such as Humsavar (Human polymorphisms and disease mutations), 1000 Genomes (genetic variants occurring at least in 1% of studied populations), SwissVar (portal to search variants in Swiss-Prot entries of the UniProt Knowledgebase), ClinVar (aggregates information about genomic variation and its relationship to human health), and COSMIC (the Catalogue Of Somatic Mutations In Cancer).
Many IDPs and IDRs include disorder-based interaction motifs such as molecular recognition features (MoRFs) [104,[140][141][142] that can undergo binding-induced folding and are utilized by IDPs/IDRs in formation of various complexes and assemblages. Such disorder-based binding sites were predicted by an ANCHOR algorithm [100].
We used the PhophoSitePlus database (https://www.phosphosite.org/homeAction, (accessed on 11 March 2021)) to take a look at the known experimentally documented PTM sites [99], and Waltz predictor (trained on a large set of experimentally characterized amyloid forming peptides) for detection of putative amylogenic regions in proteins [109] (https://waltz.switchlab.org/, (accessed on 11 March 2021)). Settings used for Waltz prediction were "Best Overall Performance" and pH 7.0.
We evaluated protein interactivity using a publicly available computational platform STRING (https://string-db.org/, (accessed on 11 March 2021)) which is an online database that integrates a variety of types of information on protein-protein interactions (PPIs), and complements this with computational predictions and produces a PPI network showing all possible PPIs based on a query protein(s) [117].

Conclusions
In this study, we conducted extensive analyses of the presence of IDRs and LLPS propensities combined with the analyses of human polymorphism and PTM databases, and the results have led us to conclude that most of the disease-associated missense mutations occur in IDRs of analyzed bHLH-PAS family members, which are located in close proximity to the regions important for LLPS regulation, or susceptible to PTMs. Changes in the PTM patterns can affect protein interaction network, protein stability or protein shuttling regulation. Importantly, mutations can also impact propensities for protein aggregation. All such variations can modify protein functions and induce specific disease states. Unfortunately, to date few experimental studies have been conducted concerning the structural characterization of bHLH-PAS IDRs and LLPS of these proteins. This can be explained by difficulties with the expression of proteins containing long IDRs. In the current study, we used available in silico predictors and databases to summarize the current state of knowledge. However, a better understanding of structure and function dependency cannot be achieved without in vivo and/or in vitro experimental data. Therefore, we emphasize the need for conducting further experimental research in these directions, as one of the most importantly future tasks that can enable us to open new perspectives and to gain a better understanding of the roles of LLPS and IDRs in bHLH-PAS TF functioning and development of various diseases.

Conflicts of Interest:
The authors declare no conflict of interest.