Multiple polyvalency provided by intrinsically disordered segments is a key feature of postsynaptic scaffold proteins

The postsynaptic density, a key regulator of the molecular events of learning and memory is composed of an elaborate network of interacting proteins capable of dynamic reorganization. Despite our growing knowledge on specific proteins and their interactions, atomic-level details of its full three-dimensional structure and its rearrangements are still largely elusive. In this work we addressed the extent and possible role of intrinsic disorder in postsynaptic proteins in a detailed in silico analysis. Using a strict consensus of predicted intrinsic disorder and a number of other protein sets as controls, we show that postsynaptic proteins are particularly enriched in disordered segments. Although the number of interacting partner proteins is not exceptionally large, the estimated diversity of the combinations of putative complexes is high in postsynaptic proteins.


Introduction
The postsynaptic density (PSD) is a characteristic part of excitatory chemical synapses. It forms a disk-shaped electrodense entity about 200-800 nm wide and 30-50 nm thick beneath the postsynaptic membrane. It is composed of a dense network of proteins and provides a complex link between the intracellular parts of membrane receptors and adhesion molecules and the cytoskeleton (1). In-cell studies strongly suggest that the PSD is a highly organized molecular network, possibly with spatially distinct 'nanodomains' with functional relevance such as AMPA receptor positioning and anchoring (2,3,4). The emerging view is that the PSD is constantly remodeled and has a characteristic dynamics that is manifested not only by the addition and elimination of components over time but also dynamic restructuring. This strongly suggests the presence of a number of intermolecular interacting sites with an elaborate and thoroughly regulated distribution of occupied, unavailable and available partner binding sites with a high level of redundancy. The PSD can most likely be imagined as a supramolecular machine capable of integrating and transmitting signals via dynamic reorganization. The underlying mechanisms of this most likely include competitive binding events, allostery and cooperativity at the level of proteins and the network might act as an amplifier and integrator of these. Recent experimental observations suggest the possible role of phase transitions in both post-and presynaptic processes (5,6). Protein intrinsic disorder, defined by the lack of adopting a stable three-dimensional structure of a particular protein or a segment of it, is considered as a key factor in mediating weak yet specific protein:protein interactions. The principal working model for intrinsically disordered regions (IDRs) in this respect is folding upon binding, i.e. the full or partial -both in terms of locality and residual flexibility -ordering of a disordered region when forming a complex with a binding partner (7). Besides other roles, such segments can act as binding sites for structural domains, recognition sites for posttranslational modifications such as phosphorylation, and flexible linkers between globular domains contributing the structural versatility of the complexes formed (8). Disordered segments have been shown to play important roles in scaffold proteins (9), signaling pathways (7,10) and be at key positions within interaction networks (8). In particular, the synaptosome has been listed as an intracellular component associated with the presence of intrinsic protein disorder (11). Multivalent proteins containing a number of similar interaction sites along with multiple domains have been suggested to be key for phase separation in protein complexes (12). In this study we investigated the idea whether intrinsic disorder might play an important role in PSD proteins with respect to the organization, versatility and the ability for dynamic rearrangement of the PSD. Whereas there are domain types, such as the PDZ domain, that have been described to be abundant in postsynaptic proteins, we are not aware of any detailed analysis on the role of disorder in the synaptome. Using the human proteome available in UniProt, we identified IDRs using a consensus of different prediction methods and filtering out oligomeric fibrillar regions (13) to extract segments likely to be intrinsically disordered under cellular conditions.

Results and Discussion
Proteins with the highest number of IDRs and protein:protein interaction regions as identified with ANCHOR (14) segments were found to be associated with Gene Ontology terms related to synaptic transmission and neural development. Protein length and overall percentage of residues in IDRs results in a lower number of related associated terms. We note that the most prominent GO categories are related to histone methylation, chromatin remodeling and other nuclear processes, as well as cytoskeletal functions. As a characteristic and relatively concise example, results of the GO-SLIM cellular component analysis are shown in Table 1. To get further insight into the relevance of these observations, we have investigated protein sets derived from the human proteome reference set in UniProt. Four sets defined in SynaptomeDB (15) along with sets defined based on the Gene Ontology (GO) terms Chemical synaptic transmission' and 'postsynaptic density', as well as a small set obtained from the UniProt website with a simple search for 'postsynaptic scaffold'. Reference sets include the full human proteome, the immunome as a set of primarily globular multidomain proteins (16), proteins involved in signal transduction pathways (17), and cytoskeletal proteins (18). The latter two sets were chosen because of their connection with the function and organization of the synaptome. We have also included histone methylases, known to have high IDR content (19) and a wider list of proteins in the nucleus (20), where phase separation has also been observed (21). All calculated descriptors for each protein along with the functional sets are available in Table S1. Basic properties of the sets are summarized in Tables S2 and S3. We calculated various descriptors related to the presence and functional role of IDRs, including the length, percentage of residues in IDRs, number of IDRs and number of binding segments identified with ANCHOR. In addition, the number of interacting partner proteins in the ComPPI database (22) and in the BioPlex 2.0 dataset (23) were also considered. To assess the potential of the proteins to be engaged in multivalent interactions, we introduced a descriptor called 'diversity of potential interactions', DPI, defined as: Where Ne and Nd denote the number of linear motifs based on the Eukaryotic Linear Motif (ELM) resource (24) and domain types, whereas ELMe and DOMd correspond to the number of ELM sites of type e and domains of type d, respectively. Only predicted ELM sites located in the identified disordered regions and of classes LIG/DOC were considered. We are aware that ELM prediction and also simple domain counts clearly overestimate the number of actual binding sites but we use DPI as a comparative rather than an absolute measure related to the upper limit of potential interaction sites in the protein sets examined. Postsynaptic proteins tend to be longer on average than other proteins in the human proteome except presynaptic ones and histone methylases although the standard deviation is large in all categories (Table S4). This length difference is also maintained in proteins with high disorder content. The extent of overall disorder and the number of disordered segments is also large in postsynaptic proteins, and also in nuclear and cyoteskeletal proteins and the histone methylase set. Only the among histone methylases is the percentage of proteins with at least 5 intrinsically disordered segments larger than in postsynaptic scaffold proteins. PSD scaffold proteins also exhibit high percentage of residues in ANCHOR-defined segments, and again only histone methylases contain more sequences with high number of ANCHOR segments (Table S4). The descriptor that distinguishes postsynaptic scaffold proteins and histone methylases from other proteins and sets is the DPI measure. This trend is robust as it is present in all proteins and when considering sequences with high disorder content only (Table S4). We note that the differences mainly stem from DPIdisorder, with DPIdomain showing an observable contribution to the DPI of postsynaptic scaffold proteins.
It should be noted that these trends are also observable if we omit proteins assigned to multiple functional sets ( Figure S2), or consider a nonredundant part of the human proteome ( Figure  S4) or only proteins in SwissProt ( Figure S5). For the latter two analyses, results are shown for all proteins allowing overlapping functional sets as otherwise the number of proteins in some of the sets would be extremely low (Table S2).
We have also investigated the binding regions in orthologous proteins. Orthologs of synaptic proteins were identified in 4 species: Western lowland gorilla (Gorilla gorilla gorilla), Sumatran orangutan (Pongo abelii), common chimpanzee (Pan troglodytes) and house mouse (Mus musculus). We found that the predicted binding regions are largely conserved among these species. Table S5 contains correlations between selected descriptors, focusing on the overall disorder content, number of disordered regions and diversity of potential interactions (DPI). Somewhat surprisingly, the number of disordered regions does not or only moderately correlate with the overall disorder percentage. There is, not unexpectedly, a relatively high correlation when all sequences, i.e. those with negligible disorder content are also considered, but analysis of proteins with 20 or 40% of overall disorder content reveals that this stems from the trivial connection that zero disorder content is trivially associated with the absence of any disordered regions. In largely disordered proteins, i.e. those with over 40% of residues in disordered regions, there is practically no correlation between these two descriptors. The DPI descriptor is, in most datasets, also uncorrelated with overall disorder content, again with the exception of the trivial cases where proteins with no or very low disorder content are also considered. Only proteins associated with presynaptic vesicles (VESCLE) and histone methylases (HKMT52) show meaningful correlation (above 0.7) between overall disorder content and DPI. In contrast, the number of disordered regions shows remarkable correlation with DPI in most groups. With the exception of the VESCLE and immunome (IMMNME) groups, this correlation is maintained in the proteins with high disorder content. These observations suggest that the number of disordered regions, rather than overall disorder content, is key in determining the versatility of interactions in most proteins investigated.

Conclusions
We propose that the high number of IDRs and the associated large DPI is an important aspect of providing the versatility of PSD complexes both across cell types and in different cellular states. We denote this feature multiple polyvalency, which is also a likely key feature of proteins and protein complexes capable of phase separation, as observed for PSD proteins, synapsin and a number of complexes involved in transcription.

Data sets
The UniProt database was used as the sole source of all protein sequences investigated in the study. In the case of specific protein sets, the UniProt IDs of the individual proteins were extracted and used thereafter. As the main data source, the human proteome set of UniProt  (19). All sets are assigned a 6-character identifier (Table S1). It should be noted that many proteins belong to more than one sets (Table S2). For some analyses proteins assigned to multiple sets were excluded, this is always indicated in the text.
Lists of interacting protein pairs were taken from either of two sources: the Comppi database and the comprehensive BioPlex 2.0 experiment. Orthologs were collected from 2 different databases, eggNOG (27) and OMA (28). Comparing the 2 databases, eggNOG yields a slightly higher number of related proteins. Since there are no contradictions between the 2 databases, the union of the orthologs identified in the two databases were used. OMA uses a special identifier, eggNOG uses ENSEMBL IDs for the sequences. They were both converted into Uniprot IDs.

Disorder prediction & analysis
Disorder prediction was done in a way that aims to minimize false positive hits. First, the consensus of two disorder prediction methods, IUPred (29) and VSL2B (30) were determined. In the second step, all residues predicted to be in oligomeric fibrillar motifs were eliminated from the set of disordered residues (31). Oligomeric fibrillar motifs were determined using a permissive prediction, namely, residues to form coiled coils as predicted either by COILS (32) or Paircoil2 (33), single alpha helices as identified using FT_CHARGE (34) or collagen triple helix as obtained by HMMER (35) with the Pfam HMM for collagen (ID: PF01391.13) segments in Pfam (36). The remaining set of residues is considered to be a good representation of segments being disordered under cellular conditions. Binding regions within disordered segments were identified with ANCHOR (14). It should be noted that the ANCHOR-predicted set of regions is much more restricted than the set of IDRs identified with the methodology described above, and can not be considered as a subset of it.
Thus, the two sets rather complement each other in the sense that they arise from different concepts.
The number of linear motifs participating in actual binding events was estimated using the regular expressions listed in the ELM database. It should be noted that we are aware that this treatment of ELM data results in massive overprediction of binding motifs, but we only use the obtained results in a comparative context. Besides using all ELM classes, a sub-section of LIG and DOC motif classes was also used. We calculated the overall number of motifs as well as the number of different motif types per protein, as well as a diversity measure termed 'diversity of potential interactions', DPI, defined as: Where Ne and Nd denote the number of ELM classes and domain types, whereas ELMe and DOMd correspond to the number of ELM sites of type e and domains of type d, respectively.
Only predicted ELM sites located in the identified disordered regions and of classes LIG/DOC were considered. We are aware that ELM prediction and also simple domain counts clearly overestimate the number of actual binding sites but we use DPI as a comparative rather than an absolute measure related to the upper limit of potential interaction sites in the protein sets examined. Figure S1. Top: Overview of the data sets used in the study. Bottom: schematic workflow for identifying IDRs excluding those in predicted fibrillar regions

Gene ontology analysis
The human proteome dataset was filtered according to the UniProt IDs listed in the human reference protein set if the PANTHER database (37), obtained from the PATHERDB web site (ftp://ftp.pantherdb.org/sequence_classifications/current_release/PANTHER_Sequence_Cla ssification_files/PTHR13.1_human, last modified 02/01/2018), resulting in 19781 proteins. In this case, to avoid any bias originating from multiple consideration of UniProt entries that were merged after the release of the PANTHER list, only the primary accessions were considered.
Average and standard deviation of specific descriptors (length, percentage of residues in IDRs, number of IDRS, number of ANCHOR sites) of the proteins in this list was calculated and proteins with higher descriptor value than the average plus two standard deviations were analyzed with PANTHER accessed through the AmiGO2 website (amigo.geneontology.org/amigo) (38).