Ancient Evolutionary Origin of Intrinsically Disordered Cancer Risk Regions

Cancer is a heterogeneous genetic disease that alters the proper functioning of proteins involved in key regulatory processes such as cell cycle, DNA repair, survival, or apoptosis. Mutations often accumulate in hot-spots regions, highlighting critical functional modules within these proteins that need to be altered, amplified, or abolished for tumor formation. Recent evidence suggests that these mutational hotspots can correspond not only to globular domains, but also to intrinsically disordered regions (IDRs), which play a significant role in a subset of cancer types. IDRs have distinct functional properties that originate from their inherent flexibility. Generally, they correspond to more recent evolutionary inventions and show larger sequence variations across species. In this work, we analyzed the evolutionary origin of disordered regions that are specifically targeted in cancer. Surprisingly, the majority of these disordered cancer risk regions showed remarkable conservation with ancient evolutionary origin, stemming from the earliest multicellular animals or even beyond. Nevertheless, we encountered several examples where the mutated region emerged at a later stage compared with the origin of the gene family. We also showed the cancer risk regions become quickly fixated after their emergence, but evolution continues to tinker with their genes with novel regulatory elements introduced even at the level of humans. Our concise analysis provides a much clearer picture of the emergence of key regulatory elements in proteins and highlights the importance of taking into account the modular organisation of proteins for the analyses of evolutionary origin.


Introduction
Most human genes are thought to have an extensive and very deep evolutionary history. In line with the thought "Nature is a tinkerer, not an inventor" [1], major human gene families date back to the earliest Eukaryotic evolutionary events, or even beyond. The very oldest layers of human genes encode metabolically, structurally, or otherwise essential proteins that typically go back to unicellular evolutionary stages. Mutations to this core biochemical apparatus can prove disruptive to all aspects of cellular life, and indeed, there are known mutational targets associated with genome stability and cancer. In contrast to these "caretaker" genes, a more novel set of genes have emerged at the transition to a multicellular stage. These "gatekeeper" proteins are involved in cell-to-cell communication, especially in early embryonic development and tissue regeneration. Gatekeeper genes that control cell division are among the best known cancer-associated oncogenes and tumor suppressors [2].
In order to establish the evolutionary origins of cancer genes, Domazet-Loso and Tautz carried out a systematic analysis based on phylostratigraphic tracking [3]. By correlating the evolutionary origin of genes with particular macroevolutionary transitions, they found that a major peak connected to the emergence of cancer genes corresponds to the level where multicellular animals have emerged. However, many cancer genes have a more ancient origin and can be traced back to unicellular organisms. These trends seem to apply to the appearance of disease genes [4] and novel genes in general as well [5]. These studies were based on the evolutionary history of the founder domains. However, new genes can also be generated by duplication either in whole or from part of existing genes, when the duplicate copy of a gene becomes associated with a different phenotype to its paralogous partner. This mechanism can also influence the emergence of disease genes [5].
By taking advantage of the flux of cancer genome data, several new proteins have been identified to play a direct role in driving tumorigenesis during recent years [6]. One of the key signatures of cancer drivers is the presence of mutation hotspot regions, where many different patients might show a similarly recurrent pattern of mutations [7]. These hotspots are typically located within well-folded, structured domains. However, many cancer associated proteins have a complex modular architecture, incorporating not only globular domains, but also intrinsically disordered segments, which can also be sites of cancer mutations. In our recent work, we systematically collected disordered regions that are directly targeted by cancer mutations and analyzed their basic functional and system level properties. [8]. While only a relatively small subset of such disordered cancer drivers was identified, their mutations can be the main driver event in certain cancer types. These disordered regions can function in a variety of ways including post-transcriptional modification sites (PTMs), linear motifs, linkers, and larger sized functional modules typically involved in binding to macromolecular complexes. These disordered cancer drivers have a characteristic functional repertoire and increased interaction potential, and their perturbation can give rise to all ten hallmarks of cancer independently of ordered drivers [8].
In general, owing to the lack of structural constraints, disordered segments show more evolutionary variability [9]. In particular, linear motifs can easily emerge to a previously non-functional region of protein sequence by only a few mutations, or disappear as easily, leaving little trace after millions or billions of years [10]. However, elements fulfilling a critical regulatory function might linger on for a longer time. So far, the evolutionary origin of intrinsically disordered regions that have a critical function proven by a human disease association has not been analyzed.
In the current study, we studied the evolutionary origin of disordered cancer risk regions. For this, we used a dataset of cancer driving proteins in which cancer mutations specifically targeted intrinsically disordered regions [8]. We retrieved phylogeny data from the ENSEMBL Compara database. Using a novel conservation and phylogenetic-based strategy, we determined the evolutionary origin not only at the gene level, but also at the region level. In addition, we also investigated the emergence mechanism of disordered cancer risk regions and how evolutionary constraints, selection, and gene duplications events influenced the fate of these examples. Finally, we presented interesting case studies that demonstrate the ancient evolutionary origin of these examples and the continuing evolution of their genes built around the critical conserved functional module.

Dataset
We used a subset of the previously identified disordered cancer risk regions [8]. These regions were identified based on genetic variations collected from the COSMIC database [11] using the method that located specific regions that are enriched in cancer mutations [7]. Disorder status of these regions was verified based on experimental data collected from dedicated databases and from the literature when available, or based on consensus disorder prediction methods [8]. Mapping was not feasible for CDKN2A isoform (Tumor suppressor ARF), because it was not present in the ENSEMBL database we used in our study), hence this protein was excluded from the further analyses. Proteins in which both disordered and ordered cancer regions were identified were filtered out in order to be able to focus clearly on the disordered regions. Regions that were primarily mutated by in-frame insertion and deletion and contained less than 15 missense mutations were also excluded because of our conservation calculation method (see below). Finally, histone proteins were merged, keeping the single entry of HIST1H3B. Ultimately, we obtained a list of 36

Evolutionary Framework
In this work, we calculated the evolutionary origin of cancer risk regions within our dataset of disordered proteins. Our approach focused on the age of orthologous gene families, instead of focusing on the evolutionary origin of founder domains. Assignment of age of human gene families (origin) was carried out using the ENSEMBL genome browser database. To identify the origin of individual human gene families, we fetched the phylogenies and analysed the evolutionary supertrees built by the pipeline of the ENSEMBL Compara multi-species comparisons project [12,13]. The used release (99) of the project contained 282 reference species including 277 vertebrata, 4 eumetazoa, and 1 opisthokonta (S. cerevisiae) species. Note that, in these phylogenies, the most ancient node can be the ancestor of yeast. The origin of the gene family was identified by taking the taxonomy level of the most ancient node of the phylogenetic supertrees. Taxonomy levels were broken into major nested age categories (mammals, vertebrates, eumetazoa, opisthokonta), similarly to previous studies [14].
To define the evolutionary origin of regions, we built a customized pipeline that included collecting and mapping mutations from COSMIC database to ENSEMBL entries, constructing multiple sequence alignments of protein families, and mapping the cancer regions among orthologs and paralogs. According to the ENSEMBL supertrees, protein sequences of human paralogs (including the cancer gene) and their orthologs were queried from the database using the Rest API function. Then, multiple sequence alignments of the corresponding sequences were created with MAFFT (default settings) [15]. On the basis of the sequence alignments, cancer regions were mapped onto the sequences. In the mapping step, cancer regions were considered as functional units (linear motifs, linkers, disordered domains) and borders of the regions were defined according to this. When the highly mutated regions covered only a single residue, it was extended to cover the known functional linear motif or using its sequence neighbourhood. On this basis, the subset of paralogs, in which the mapped cancer region was found to be conserved, was identified.
Next, the set of sequences containing regions that showed evolutionary similarity to the mutated regions were identified among the collected orthologs and paralogs. Conservation of the regions among paralogs was evaluated relying on two strategies, by calculating the similarity of mutated positions in the cancer risk regions (see below) and based on HMM profiles. This consideration was taken into account in order to reduce the chance of false conservation interpretation arising from the difficulty of aligning disordered proteins. The HMM profiles were built from conserved cancer regions of vertebrate model organisms using the HMMER (version 3.3) method [16]. The identified region hits were manually checked to minimize the chance of false positives or negatives. Next, we identified the evolutionarily most distant relative in which the cancer region was declared to be conserved. As a result, the origin of the region could differ from the origin of the orthologous gene family, when paralogue sequences that contained the conserved motif had a more ancient origin. Basically, we treated the cancer risk regions as the founder of the family. The taxonomy level of this ortholog was defined as the level in which the cancer region emerged in the common ancestor of this ortholog and H. sapiens.

Region Conservation
Within the identified cancer risk region, some of the positions could be more heavily mutated and are likely to be more critical for the function of this region. We took this into account when calculating the region conservation. Mutations for each position collected from the COSMIC database were mapped to the corresponding ENSEMBL human entry. On the basis of the sequence alignment corresponding to the cancer risk regions, we identified the positions that were similar to the reference sequence. Two positions were considered similar when the substitution score was non-negative according to the BLOSUM62 substitution matrix. A given cancer region was considered to be conserved between homologs, when the conserved residues carried more than 50% of missense mutations.

Positive Selection: Selectome and McDonald and Kreitman (MK) Test Results
For each entry in our dataset, we collected information about positive selection using the Selectome database (current version 6) [17]. This database contains collected sites of positive selection detected on a single branch of the phylogeny using the systematic branch-site test of the CODEML algorithm from the PAML [18] phylogenetic package version 4b. The ratio of non-synonymous and synonymous substitutions (ω) can be interpreted as a measurement of selective pressure indicating purifying (ω values < 1), neutral (ω values = 1), or positive (ω values > 1) selection. In our work, positions under positive selection that have a posterior probability higher than 0.9 were extracted from the database and mapped onto our gene set.
However, the branch-site model generally cannot detect species-specific positive selection. Potential cases of human-specific positive selection may be detected effectively by comparing divergence to polymorphism data, as in the McDonald and Kreitman (MK) test. Human-specific positive selection detected by MK test previously calculated [19] was mapped onto our dataset of disordered cancer genes.

Evolutionary Origin of Genes and Regions
Altogether, we collected 36 cancer risk regions of 32 disordered proteins and investigated the evolutionary origin at the level of genes and regions. The age estimation of disordered cancer genes was obtained using the last common ancestor of descendants using the ENSEMBL supertrees, which includes phylogeny of gene families returning not only individual gene history, but also relationships of ancient paralogs and their history (see Material and Methods). Using this strategy instead of analysing the evolution of individual genes or simply the emergence of the founder domain, we could define the origin of regions more precisely, even the ancient ones, without introducing any bias of overprediction of origins. However, some ambiguity still remained and was manually checked (Supplementary Materials 1). The genes were traced back to opisthokonta (in accordance with the ENSEMBL database) and divided into four major phylostratigraphic groups, which are associated with the emergence of unicellular, multicellular organisms, vertebrates, and mammals.
Previous results identified the level of eumetazoa as the main age for the emergence of cancer associated proteins [3]. We observed a similar trend in the case of disordered cancer proteins. Specifically, we found that 21 disordered cancer proteins, the majority of cases, have emerged at the level of eumetazoa ( Figure 1). Fourteen cases were found to be even more ancient and could be traced back to single cell organisms, at least to opisthokonta. The only protein that emerged more recently, at the level of vertebrates, was CD79B, the B-cell antigen receptor complex-associated protein β chain. Its appearance is in agreement with the birth of many immune receptors [20] and is assumed to be driven by the insertion of transposable elements.

Position Conservation
Overall, these results point to the ancient evolutionary origin of disordered regions involved in In around half of the cases (21), the emergence of the mutated region was the same as the emergence of the protein ( Figure 1). Strikingly, these included five cases (EIF1AX, HIST1H3B, MLH1, RPS15, SMARCB1) where not only the gene/protein, but also the region primarily mutated in human cancers were very ancient and could be traced back to unicellular organisms. Fifteen regions with Eumetazoa and one with Vertebrata origin could be traced back to the same level as their corresponding gene. However, in several cases, the emergence of the region was a more recent event compared with the emergence of the gene. Of these, eight regions emerged at the Eumetazoa and seven at the Vertebrate level. In general, there was only one level difference between the emergence of the gene and the region at this resolution. The only exception was SETBP1. In this case, the region itself emerged at the vertebrate level. However, the gene could be traced back to opisthokonta level, although the eumetazoa origin cannot be completely ruled out (see Supplementary Materials 1). Overall, many of the disordered regions were more recent evolutionary inventions compared with the origin of their genes, and date back to the common ancestors of eumetazoans or vertebrates. Nevertheless, the ancestors of all of the regions were already present from the vertebrate level.

Position Conservation
Overall, these results point to the ancient evolutionary origin of disordered regions involved in cancer, not only at the gene level, but also at the region level. To take a closer look, we also calculated the conservation of individual positions within the regions based both in terms of homologous substitutions and identity. The results show that these residues are highly conserved even compared with the conservation of the whole sequence ( Figure 2). Here, 86% of the regions have more than 0.8 average conservation value even based on identities ( Figure 2A). Among the cases with the four lowest values, the conservation of VHL, CALR, and APC, which all correspond to relatively longer segments, was still relatively high. The only outlier was BCL2. In this case, the mutations are distributed along the N-terminal, encompassing the highly conserved BH4 motif, as well as the linker region between the BH4 and C-terminal part, which is conserved only in mammals ( Figure S1). with the conservation of the whole sequence ( Figure 2). Here, 86% of the regions have more than 0.8 average conservation value even based on identities ( Figure 2A). Among the cases with the four lowest values, the conservation of VHL, CALR, and APC, which all correspond to relatively longer segments, was still relatively high. The only outlier was BCL2. In this case, the mutations are distributed along the N-terminal, encompassing the highly conserved BH4 motif, as well as the linker region between the BH4 and C-terminal part, which is conserved only in mammals ( Figure  S1). Next, we investigated how this average value is altered when only the highly mutated positions are considered. We repeated that analysis for positions that had at least 15   We also collected sites of potential positive selection mapped onto our genes based on the Selectome database [17], which provides information on likely molecular selection both at the level of the evolutionary branch and the sequence position based on the ratio of non-synonymous and synonymous substitutions (ω). According to these results, positive selection affected only three genes on the human lineage in our dataset, CALR, CTNNB1, and VHL. All of these selections could be We also collected sites of potential positive selection mapped onto our genes based on the Selectome database [17], which provides information on likely molecular selection both at the level of the evolutionary branch and the sequence position based on the ratio of non-synonymous and synonymous substitutions (ω). According to these results, positive selection affected only three genes on the human lineage in our dataset, CALR, CTNNB1, and VHL. All of these selections could be mapped onto the vertebrates division with multiple positions (see Material and Methods) ( Table 1). However, these positions showed limited overlap with the mutated regions. In the case of CTNNB1, none of the positions under selection overlapped with the cancer mutated region. In the case of CALR, there was only a single position under selection within the cancer risk region, but it was not directly targeted by cancer mutations. In the case of VHL, six positions were detected with selective pressure and five of them were situated within the significantly mutated region. However, none of them corresponded to a highly mutated residue.
Taking advantage of an earlier analysis [19], we also analyzed if there was any human specific positive selection. As the ω based approach can not be used without uncertainty to identify human-specific positive selection, this work relied on the McDonald and Kreitman (MK) test, which compares the divergence to polymorphism data using closely related species, such as human and chimp. There was only a single entry in our database, ESR1, that showed human specific evolutionary changes (see case studies).

Contribution of Duplications to the Emergence of Disease Risk Regions
Gene duplications often drive the appearance of a novel function through the process called neofunctionalization. In these cases, after a duplication event, one copy may acquire a novel, beneficial function that becomes preserved by natural selection. Here, we have evaluated whether the emergence of disordered cancer regions corresponds to such neofunctionalization events. For this analysis, we collected paralog sequences and evaluated if there were regions present in these sequences that showed clear evolutionary similarity to the cancer mutated region.
The evolutionary history of many genes is quite complex and can involve multiple duplication events. We focused on the level where the cancer regions emerged and distinguished the following scenarios based on the relationship between the duplication and the presence of the region among the paralogs. The first scenario corresponds to duplication induced neofunctionalization. In this case, an ancient cancer region emerged directly after a given gene duplication and became preserved in only one of the branches that appeared after the duplication ( Figure 3A). There are two basic scenarios in which the duplication cannot be directly linked with the emergence of the regions. One possible scenario is when both branches contain the region, which indicates that the region must have emerged before the duplication ( Figure 3B). The other possible scenario is when the region emerged at a later evolutionary stage after a duplication, and duplication cannot be directly linked to neofunctionalization ( Figure 3B). Surprisingly, the duplication induced neofunctionalization was much less common than we expected, with only seven cases showing this behaviour. One example for this scenario is presented by the β-catenin family, where the degron motif [21] based cancer risk region that emerged after duplication is present only on the branch of β-catenin and junctional plakoglobin (JUP). In contrast, we found that 23 regions have evolved by de novo emergence, which seemed to be the dominant mechanisms for the emergence of the analyzed cancer mutated disordered regions ( Figure 4A). For example, ID3 underwent multiple duplications, but all paralogs contain the cancer risk region, which indicates that the region emerged prior to the duplication. Another example is ESR1, in which case the paralogs were born at the level of eumetazoa; however, this event is not directly linked to the emergence of the cancer region, which appeared only at the level of the ancient vertebrates. In addition, there were two singletons in our dataset, RPS15 and SMARCB1, which did not have any detectable paralogs. In the cases of ASXL1, CCND3, SETBP1, and the first region of CARD11, the evolutionary scenarios could not be unambiguously established. These six examples formed the "Other" group. Surprisingly, the duplication induced neofunctionalization was much less common than we expected, with only seven cases showing this behaviour. One example for this scenario is presented by the β-catenin family, where the degron motif [21] based cancer risk region that emerged after duplication is present only on the branch of β-catenin and junctional plakoglobin (JUP). In contrast, we found that 23 regions have evolved by de novo emergence, which seemed to be the dominant mechanisms for the emergence of the analyzed cancer mutated disordered regions ( Figure 4A). For example, ID3 underwent multiple duplications, but all paralogs contain the cancer risk region, which indicates that the region emerged prior to the duplication. Another example is ESR1, in which case the paralogs were born at the level of eumetazoa; however, this event is not directly linked to the emergence of the cancer region, which appeared only at the level of the ancient vertebrates. In addition, there were two singletons in our dataset, RPS15 and SMARCB1, which did not have any detectable paralogs. In the cases of ASXL1, CCND3, SETBP1, and the first region of CARD11, the evolutionary scenarios could not be unambiguously established. These six examples formed the "Other" group. We also analyzed if additional duplication events occurred after the emergence of regions and whether the novel paralogues retained the regions. There are basically three scenarios that can occur: (i) the region is preserved without any further duplications; (ii) the region spreads and becomes preserved in all of the novel duplicates; (iii) partial loss scenario, that is, the region is preserved in some duplicates, but is lost in others. Our results show that the most common evolutionary fate is the second one ( Figure 4B). In 29 cases, at least one duplication that inherited the region can be observed after the emergence of the cancer region. In contrast, only five regions were not duplicated. Some ancient cases, such as MLH1 and USP8, are also included among the non-duplicated ones, which means that the reason for the lack of duplications is not the short evolutionary time. The partial loss scenario was observed in only two cases, in the case of VHL and NFE2L2. For instance, in the case of VHL, there was a relatively recent gene duplication at the level of mammals. While the N-terminal segment is present on both paralogs (VHL and VHLL), the C-terminal segment is only present in VHL, but was lost from VHLL. In a similar fashion, NFE2L2 underwent a more recent gene duplication at the level of vertebrates, but the newly emerged paralog did not retain the two linear motifs that are primarily targeted by cancer mutations.

MLH1
One of the most ancient examples in our dataset corresponds to MLH1 (MutL Homolog 1), an essential protein in DNA mismatch repair (MMR). As one of the classic examples of a caretaker function, mutations of MLH1 can lead to cancer by increasing the rate of single-base substitutions and frameshift mutations [22]. Several positions of MLH1 are mutated in people with Lynch syndrome, also known as hereditary nonpolyposis colorectal cancer (HNPCC). However, according to the COSMIC database of somatic cancer mutations, the most common mutation of MLH1 is V384D. Mutational studies of V384D using yeast assays and in vitro MMR assay did not indicate a strong phenotype, but still showed a limited decrease of MMR activity [23]. However, it was shown that the (mostly germline) V384D variant is clearly associated with increased colorectal cancer susceptibility [24], and it is highly prevalent in HER2-positive luminal B breast cancer [25].
MLH1 is an ancient protein that is present from bacteria to humans. It has a highly conserved domain organization that involves ordered N-and C-terminal domains connected by a disordered linker [26] (Figure 5). This underlines the functional importance not only of the structured domains, but also of the connecting disordered region. In our previous work, we identified the region from We also analyzed if additional duplication events occurred after the emergence of regions and whether the novel paralogues retained the regions. There are basically three scenarios that can occur: (i) the region is preserved without any further duplications; (ii) the region spreads and becomes preserved in all of the novel duplicates; (iii) partial loss scenario, that is, the region is preserved in some duplicates, but is lost in others. Our results show that the most common evolutionary fate is the second one ( Figure 4B). In 29 cases, at least one duplication that inherited the region can be observed after the emergence of the cancer region. In contrast, only five regions were not duplicated. Some ancient cases, such as MLH1 and USP8, are also included among the non-duplicated ones, which means that the reason for the lack of duplications is not the short evolutionary time. The partial loss scenario was observed in only two cases, in the case of VHL and NFE2L2. For instance, in the case of VHL, there was a relatively recent gene duplication at the level of mammals. While the N-terminal segment is present on both paralogs (VHL and VHLL), the C-terminal segment is only present in VHL, but was lost from VHLL. In a similar fashion, NFE2L2 underwent a more recent gene duplication at the level of vertebrates, but the newly emerged paralog did not retain the two linear motifs that are primarily targeted by cancer mutations.

MLH1
One of the most ancient examples in our dataset corresponds to MLH1 (MutL Homolog 1), an essential protein in DNA mismatch repair (MMR). As one of the classic examples of a caretaker function, mutations of MLH1 can lead to cancer by increasing the rate of single-base substitutions and frameshift mutations [22]. Several positions of MLH1 are mutated in people with Lynch syndrome, also known as hereditary nonpolyposis colorectal cancer (HNPCC). However, according to the COSMIC database of somatic cancer mutations, the most common mutation of MLH1 is V384D. Mutational studies of V384D using yeast assays and in vitro MMR assay did not indicate a strong phenotype, but still showed a limited decrease of MMR activity [23]. However, it was shown that the (mostly germline) V384D variant is clearly associated with increased colorectal cancer susceptibility [24], and it is highly prevalent in HER2-positive luminal B breast cancer [25].
MLH1 is an ancient protein that is present from bacteria to humans. It has a highly conserved domain organization that involves ordered N-and C-terminal domains connected by a disordered linker [26] (Figure 5). This underlines the functional importance not only of the structured domains, but also of the connecting disordered region. In our previous work, we identified the region from 379 to 385 to be significantly mutated [7], which is located within the disordered segment. Recently, it was shown that the linker can regulate both DNA interactions and enzymatic activities of neighboring structured domains [27]. In agreement with the linker function, both the composition and length of this intrinsically disordered region (IDR) are critical for efficient MMR. Overall, most of the linker shows relatively low sequence conservation, however, the identified cancer risk region is highly conserved from across all eukaryotic sequences ( Figure 5), in an island-like manner. Although the exact function of this region is not known, the strong evolutionary conservation indicates a highly important function, not yet explored in detail.
379 to 385 to be significantly mutated [7], which is located within the disordered segment. Recently, it was shown that the linker can regulate both DNA interactions and enzymatic activities of neighboring structured domains [27]. In agreement with the linker function, both the composition and length of this intrinsically disordered region (IDR) are critical for efficient MMR. Overall, most of the linker shows relatively low sequence conservation, however, the identified cancer risk region is highly conserved from across all eukaryotic sequences ( Figure 5), in an island-like manner. Although the exact function of this region is not known, the strong evolutionary conservation indicates a highly important function, not yet explored in detail.

VHL
VHL, the Von Hippel-Lindau disease tumor suppressor protein possesses an E3 ligase activity. It plays a key role in cellular oxygen sensing by targeting hypoxia-inducible factors for ubiquitylation and proteasomal degradation. To carry out its function, VHL forms a complex with elongin B, elongin C, and cullin-2 and the RING finger protein RBX1 [28,29]. VHL has an α-domain (also known as the VHL-box, residues 155 to 192) that forms the principal contacts with elongin C, and a larger β-domain (residues 63 to 154) that directly binds the proline hydroxylated substrate, HIF1α. The positions mutated across various types of cancers cover a large part of the protein, including both the α and β domains. While these regions form a well-defined structure in complex with elongin B, elongin C, and cullin-2, they are disordered in isolation and rapidly degraded [30].
The VHL gene emerged de novo at the level of Eumetazoa together with HIFα and PHD, the other key components of the hypoxia regulatory pathway. However, more recently, the gene underwent various evolutionary events. The VHL gene showed slightly higher evolutionary variations compared with other cancer risk regions (Figure 2). Some positions, including K171, showed signs of positive selection at the level of Sarcopterygii, which might implicate the occurrence of an important evolutionary event. It was shown that the SUMO E3 ligase PIASy interacts with VHL and induces VHL SUMOylation on lysine residue 171 [31]. VHL also undergoes ubiquitination on K171 (and K196), which is blocked by PIASy. In the proposed model of the dynamic regulation of VHL, the interaction of VHL with PIASy results in VHL nuclear localization, SUMOylation, and stability for blocking ubiquitylation of VHL. Meanwhile, PIASy dissociation with VHL or

VHL
VHL, the Von Hippel-Lindau disease tumor suppressor protein possesses an E3 ligase activity. It plays a key role in cellular oxygen sensing by targeting hypoxia-inducible factors for ubiquitylation and proteasomal degradation. To carry out its function, VHL forms a complex with elongin B, elongin C, and cullin-2 and the RING finger protein RBX1 [28,29]. VHL has an α-domain (also known as the VHL-box, residues 155 to 192) that forms the principal contacts with elongin C, and a larger β-domain (residues 63 to 154) that directly binds the proline hydroxylated substrate, HIF1α. The positions mutated across various types of cancers cover a large part of the protein, including both the α and β domains. While these regions form a well-defined structure in complex with elongin B, elongin C, and cullin-2, they are disordered in isolation and rapidly degraded [30].
The VHL gene emerged de novo at the level of Eumetazoa together with HIFα and PHD, the other key components of the hypoxia regulatory pathway. However, more recently, the gene underwent various evolutionary events. The VHL gene showed slightly higher evolutionary variations compared with other cancer risk regions (Figure 2). Some positions, including K171, showed signs of positive selection at the level of Sarcopterygii, which might implicate the occurrence of an important evolutionary event. It was shown that the SUMO E3 ligase PIASy interacts with VHL and induces VHL SUMOylation on lysine residue 171 [31]. VHL also undergoes ubiquitination on K171 (and K196), which is blocked by PIASy. In the proposed model of the dynamic regulation of VHL, the interaction of VHL with PIASy results in VHL nuclear localization, SUMOylation, and stability for blocking ubiquitylation of VHL. Meanwhile, PIASy dissociation with VHL or attenuation of VHL SUMOylation facilitates VHL nuclear export, ubiquitylation, and instability. This dynamic process of VHL with reversible modification acts in concert to inhibit HIF1α [32].
A novel acidic repeat region appeared at the N-terminal region of the protein at the level of Sarcopterygii, and this region underwent further repeat expansion in the lineage leading up to humans ( Figure 6). These GxEEx repeats are generally thought to confer additional regulation to the long isoform of VHL (translated from the first methionine), with a number of putative (USP7) or experimentally detected (p14ARF) interactors [33]. Although poorly studied, this repetitive region also seems to harbour casein kinase 2 (CK2) phosphorylation as well as proteolytic cleavage sites, regulating VHL half-life (consistent with a deubiquitinase, such as USP7 binding role) [34]. As a result of a recent gene duplication, the human genome even encodes a VHL-like protein (VHLL), which has lost the C-terminal segment including the α domain. Consequently, VHLL cannot nucleate the multiprotein E3 ubiquitin ligase complex. Instead, it was suggested that VHLL functions as a dominant-negative VHL to serve as a protector of HIF1α [35]. This example demonstrates that, while the basic cancer risk region remains largely unchanged during evolution, additional regulatory mechanisms can emerge to further fine-tune the function of the protein. attenuation of VHL SUMOylation facilitates VHL nuclear export, ubiquitylation, and instability. This dynamic process of VHL with reversible modification acts in concert to inhibit HIF1α [32]. A novel acidic repeat region appeared at the N-terminal region of the protein at the level of Sarcopterygii, and this region underwent further repeat expansion in the lineage leading up to humans ( Figure 6). These GxEEx repeats are generally thought to confer additional regulation to the long isoform of VHL (translated from the first methionine), with a number of putative (USP7) or experimentally detected (p14ARF) interactors [33]. Although poorly studied, this repetitive region also seems to harbour casein kinase 2 (CK2) phosphorylation as well as proteolytic cleavage sites, regulating VHL half-life (consistent with a deubiquitinase, such as USP7 binding role) [34]. As a result of a recent gene duplication, the human genome even encodes a VHL-like protein (VHLL), which has lost the C-terminal segment including the α domain. Consequently, VHLL cannot nucleate the multiprotein E3 ubiquitin ligase complex. Instead, it was suggested that VHLL functions as a dominant-negative VHL to serve as a protector of HIF1α [35]. This example demonstrates that, while the basic cancer risk region remains largely unchanged during evolution, additional regulatory mechanisms can emerge to further fine-tune the function of the protein.

ESR1
Estrogen receptor 1 (ESR1) is a member of the nuclear hormone receptor family with eumetazoan origin. The most common mutation in both primary and tamoxifen therapy associated samples corresponds to a single mutation (K303R). This single site emerged more recently (Figure 7) and is located in a rather complex switch region adjacent to the ligand-binding domain ( Figure S2). The highly mutated K303 of ESR1 (more than 200 K303R missense mutations are seen in COSMIC) is a part of a motif-based molecular switch region involving several mutually exclusive PTMs. At positions 302, 303, and 305, methylation by SET7/9, acetylation by p300, and phosphorylation by PKA or PAK1 were observed in previous studies, respectively [36][37][38][39][40]. Our results show that this region is conserved only in Sarcopterygii, which indicates a relatively young evolutionary origin of the switching mechanism. However, while the methylation and acetylation sites are well conserved, the phosphorylation motif appears to be specific only to H. sapiens. We came to this conclusion because R300 and K302 as well as L306 are required for the protein kinase A (PKA) phosphorylation consensus and the oncogenic mutation K303R is expected to turn this region into an even better PKA substrate [41,42]. Curiously, these residues are not found in any other mammal, supposing species specific adaptive changes.

ESR1
Estrogen receptor 1 (ESR1) is a member of the nuclear hormone receptor family with eumetazoan origin. The most common mutation in both primary and tamoxifen therapy associated samples corresponds to a single mutation (K303R). This single site emerged more recently (Figure 7) and is located in a rather complex switch region adjacent to the ligand-binding domain ( Figure S2). The highly mutated K303 of ESR1 (more than 200 K303R missense mutations are seen in COSMIC) is a part of a motif-based molecular switch region involving several mutually exclusive PTMs. At positions 302, 303, and 305, methylation by SET7/9, acetylation by p300, and phosphorylation by PKA or PAK1 were observed in previous studies, respectively [36][37][38][39][40]. Our results show that this region is conserved only in Sarcopterygii, which indicates a relatively young evolutionary origin of the switching mechanism. However, while the methylation and acetylation sites are well conserved, the phosphorylation motif appears to be specific only to H. sapiens. We came to this conclusion because R300 and K302 as well as L306 are required for the protein kinase A (PKA) phosphorylation consensus and the oncogenic mutation K303R is expected to turn this region into an even better PKA substrate [41,42]. Curiously, these residues are not found in any other mammal, supposing species specific adaptive changes.
Comparison of substitutions and polymorphic sites is a powerful approach to identify specific changes in a pair of closely related species, like H. sapiens and chimpanzee. Relying on this approach, 198 of 9785 analyzed genes were identified to show human-specific changes including ESR1 [19]. In ESR1, there are three more changes besides R300 and K306 (L44, Q502, S559) between H. sapiens and chimp that are also thought to be adaptive substitutions according to the MK test. Phosphorylation of S559 was experimentally identified, suggesting this residue is also a H. sapiens specific PTM [43,44], but there is no specific data in the literature about the biological function of L44 and Q502. Yet, we know that phosphorylation of S305 allows the increase of estrogen sensitivity by external stimuli other than steroids, and permits ESR1 activity even when the canonical estrogen effect is completely blocked by tamoxifen [40,42]. In mice, ESR1 activity is essential for the estrogen effect and normal estrous episodes [45,46]. Although we lack information, we theorize that this human-specific signaling crosstalk might somehow be connected to the continuous menstrual cycle of H. sapiens (quite unusual among mammals), or some other human-specific reproductive adaptation. Comparison of substitutions and polymorphic sites is a powerful approach to identify specific changes in a pair of closely related species, like H. sapiens and chimpanzee. Relying on this approach, 198 of 9785 analyzed genes were identified to show human-specific changes including ESR1 [19]. In ESR1, there are three more changes besides R300 and K306 (L44, Q502, S559) between H. sapiens and chimp that are also thought to be adaptive substitutions according to the MK test. Phosphorylation of S559 was experimentally identified, suggesting this residue is also a H. sapiens specific PTM [43,44], but there is no specific data in the literature about the biological function of L44 and Q502. Yet, we know that phosphorylation of S305 allows the increase of estrogen sensitivity by external stimuli other than steroids, and permits ESR1 activity even when the canonical estrogen effect is completely blocked by tamoxifen [40,42]. In mice, ESR1 activity is essential for the estrogen effect and normal estrous episodes [45,46]. Although we lack information, we theorize that this human-specific signaling crosstalk might somehow be connected to the continuous menstrual cycle of H. sapiens (quite unusual among mammals), or some other human-specific reproductive adaptation.

Discussion
In our study, we aimed to estimate the evolutionary origin of disordered regions that are specifically targeted in cancer. Intrinsically disordered protein regions play essential roles in a wide-range of biological processes and can function as linear motifs, linkers, or other intrinsically disordered domain-sized segments [47]. They are integral parts of many cancer associated proteins and, in a smaller number of cases, they can also be the direct targets of cancer driving mutations. In general, IDRs are believed to be of more recent evolutionary origin, and exhibit higher rates of

Discussion
In our study, we aimed to estimate the evolutionary origin of disordered regions that are specifically targeted in cancer. Intrinsically disordered protein regions play essential roles in a wide-range of biological processes and can function as linear motifs, linkers, or other intrinsically disordered domain-sized segments [47]. They are integral parts of many cancer associated proteins and, in a smaller number of cases, they can also be the direct targets of cancer driving mutations. In general, IDRs are believed to be of more recent evolutionary origin, and exhibit higher rates of evolutionary variations compared with that of folded globular domains [9]. However, this is not what we see in the case of disordered cancer genes. Instead, we observed that cancer-targeted disordered regions are extremely conserved with deep evolutionary origins, which underlines their critical function. The two main ages for emergence of disordered cancer genes can be linked to unicellular organisms and the emergence of multicellularity, in agreement with the result of phylostratigraphic tracking of cancer genes in general [3].
One of the most unexpected findings of our study is the examples of disordered cancer genes that can be traced back to unicellular organisms. Mechanistically, the group of cancer genes that emerged in unicellular organisms were suggested to play a caretaker role and contribute to tumorigenesis by increasing mutation rates and genome instability. In contrast, cancer genes that emerged at the level of multicellularity were suggested to typically have a gatekeeper function and promote tumour progression directly by changing cell differentiation, growth, and death rates [48]. MLH1 is one of the best characterized examples of a gene with a caretaker function [49]. It is involved in mismatch repair (MMR) of DNA bases that have been misincorporated during DNA replication. Thus, disruptive mutations of MLH1 greatly increase the rate of point mutations in genes and underline various inherited forms of cancer. However, the most commonly seen alterations in patients are located in the flexible internal linker. Mutational studies indicate that this highly conserved segment might not be directly involved in MMR, but likely has an important, currently uncharacterized function. The other ancient examples are also involved in basic cellular processes, however, they are associated with a broader set of functions. HIST1H3B, SMARCB1, and SETBP1 are involved in epigenetic regulation and their mutations can alter gene expression patterns [50,51]. Mutations of EIF1AX and RPS15 are likely to perturb translation events [52,53]. However, SRSF2, which is responsible for orchestrating splicing events, can also have a global influence on cellular states [54]. Therefore, the caretaker function is also a subject of evolution and some of its components emerged as a result of more recent evolutionary events.
A clear novelty of our approach is to focus at the origin of sub-gene elements; that is, regulatory regions, modules, and domains, instead of full genes. The genes can be built around founder genes that have an extremely ancient origin, but their biological function and regulation can change fundamentally during subsequent evolution. In several cases, the origin of the cancer mutated region was substantially more recent than the origin of the gene. Nevertheless, after their emergence, disordered cancer regions were fixated rapidly and showed little variations afterwards. However, their evolution at the gene level was not set in stone and there are several indications that this process continues indefinitely. In several cases, the cancer genes underwent gene duplications, further regulatory regions were added, or fine-tuned by changing some of the less critical positions. We highlighted a fascinating case when such an event occurred when our species, H. sapiens, separated from its primate relatives.
In general, the rate of gene duplications is very high (0.01 per gene per million years) over evolution, which provides the source of emergence of evolutionary novelties [55]. According to the general view, paralogs go through a brief period of relaxed selection directly after duplications-this time ensures the acquisition of novelties-and subsequently experience strong purifying selection, preserving the newly developed function. However, our results showed that only a few disordered cancer regions have emerged in a duplication induced manner and the vast majority of disordered cancer regions emerged de novo, independent of duplications. The evolution of disordered regions is better described by the ex-nihilo motif theory, which is based on the rapid disappearance and emergence of linear motifs by the change of only a few residues within a given disordered protein segment [10]. This evolutionary phenomenon is commonly observed in the case of linear motifs, for example, in the case of NFE2L2. This protein carries a pair of crucial linear motifs that have emerged in the ancient eumetazoa, but are not preserved in the most recent duplicates. In an evolutionary biology aspect, our results suggest that the evolution of functional novelties in the case of disordered region mediated functions requires a more complex model.
Exploring the evolutionary origin of cancer genes is an important step to understand how this disease can emerge. This knowledge can also have important implications of how their regulatory networks are disrupted during tumorigenesis and can be incorporated into developing improved treatment options [56]. In this work, we focused on a subset of cancer genes that belong to the class of intrinsic disordered proteins, which rely on their inherent flexibility to carry out their important functions. While the selected examples represent only a small subset of cancer genes, they are highly relevant for several specific cancer types [8]. In general, disordered proteins are evolutionarily more variable compared with globular proteins, however, the disordered cancer risk regions showed remarkable conservation with ancient evolutionary origin, highlighting their importance in core biological processes. Nevertheless, we found several examples where the region specifically targeted by cancer mutations emerged at a later stage compared with the origin of the gene family. Our results highlight the importance of taking into account the complex modular architecture of cancer genes in order to get a more complete understanding of their evolutionary origin.