Occurrence of Ordered and Disordered Structural Elements in Postsynaptic Proteins Supports Optimization for Interaction Diversity

The human postsynaptic density is an elaborate network comprising thousands of proteins, playing a vital role in the molecular events of learning and the formation of memory. Despite our growing knowledge of specific proteins and their interactions, atomic-level details of their full three-dimensional structure and their rearrangements are mostly elusive. Advancements in structural bioinformatics enabled us to depict the characteristic features of proteins involved in different processes aiding neurotransmission. We show that postsynaptic protein-protein interactions are mediated through the delicate balance of intrinsically disordered regions and folded domains, and this duality is also imprinted in the amino acid sequence. We introduce Diversity of Potential Interactions (DPI), a structure and regulation based descriptor to assess the diversity of interactions. Our approach reveals that the postsynaptic proteome has its own characteristic features and these properties reliably discriminate them from other proteins of the human proteome. Our results suggest that postsynaptic proteins are especially susceptible to forming diverse interactions with each other, which might be key in the reorganization of the postsynaptic density (PSD) in molecular processes related to learning and memory.


Introduction
Synaptic signal transduction is an elaborate process that not only provides the means of transmitting the excited state of one neuron to another but also plays an important role in basic phenomena underlying neural development, learning and memory [1,2]. The basic cellular phenomenon of memory, long-term potentiation (LTP), results in the strengthening of synaptic connections. The postsynaptic density (PSD) is a key structure in this process, as an essential part of excitatory chemical synapses. It is composed of a dense network of proteins and provides an intricate link between the intracellular parts of membrane receptors and adhesion molecules and the cytoskeleton [3]. PSD composition and organization changes during development and the individual history of the particular neuron [4,5], leading to morphological differences between PSDs in different brain regions [6]. Recent results point to a reorganization of the PSD during sleep by the exchange of Homer protein isoforms and contributing to synaptic scaling [7,8]. The emerging view is that the PSD is continuously remodeled and has a characteristic dynamics that is manifested not only by the addition and elimination of components over time but also by dynamic restructuring while retaining its composition [9][10][11]. From a structural point of view, interactions can be formed via a diverse range of structural elements. Protein intrinsic disorder is defined by the lack of ability of a particular protein/segment to adopt a stable three-dimensional structure. Disordered segments can play many different roles, including the mediation of protein-protein interactions (PPIs), during which they may get fully or partially folded (folding upon binding [12]) or retain a considerable degree of disorder (fuzzy complexes [13]). Disordered segments have been shown to play important roles in the postsynaptic scaffold (PSC) proteins [14], signaling pathways [15][16][17] and be at key positions within interaction networks [18]. The organization of synaptic proteins also exhibits a high amount of ordered domain interactions, resulting in macromolecular assemblies occupying the PSD. The highly organized assembly of PSD strongly suggests the presence of several intermolecular interacting sites with an elaborate and thoroughly regulated distribution of occupied and unavailable or available partner binding sites with a high level of redundancy. The PSD can most likely be imagined as a supramolecular association capable of integrating and transmitting signals via reorganization. The underlying mechanisms likely include competitive binding events, allostery and cooperativity tightly regulated through post-translational modifications (PTMs) [5,19], short linear motifs [20] and alternative splicing [21].
Recently, the role of liquid-liquid phase separation (LLPS) in the organization of the PSD has been suggested [22,23]. It has been shown that appropriate combinations of selected PSD proteins result in the formation of droplets [24,25]. The presence or absence of specific proteins and/or binding sites was shown to substantially influence the phase separation properties of PSD proteins [25]. To date, LLPS is most extensively studied for RNA-binding nuclear proteins, where RNA is an important component of the supramolecular associations exhibiting phase separation. It should be noted that RNA molecules are abundant in dendrites as a pool for in situ translation [26]. A key feature of proteins capable of LLPS is multivalency, the presence of multiple partner binding sites that can be folded domains or linear motifs in intrinsically disordered regions (IDRs) [27]. Bioinformatics analysis of phase separation is rather laborious, as the first databases containing proteins or regions driving LLPS are just being developed [28]; protein-protein interactions can be reliable investigated using various tools and databases, which may open prospects to the hallmark of LLPS [29,30].
Most current descriptions of human postsynaptic proteome characterize PSD proteins using experimental procedures [31] or literature-based collection extended with gene annotation services [32]. To our knowledge, there is no comprehensive computational analysis focusing on the structural organization of PSD proteins. In this paper, we use a wide range of bioinformatics methods to describe the structural characteristics of PSD proteins, focusing on features that may directly contribute to the synaptic plasticity through the formation of (PPIs).

Datasets
The human proteome was downloaded from UniProt [33] (2018_June release). For synaptic proteins, the SynaptomeDB [32] database and its classification of protein localization were used, defining four sets: postsynaptic, presynaptic, presynaptic active zone and vesicle-associated proteins. An additional PSD-related set was defined based on a simple search on the UniProt website with the term 'postsynaptic scaffold human'. Proteins of the immunome were extracted from the Immunome Knowledge Base [34]. The list of nuclear proteins was taken from the supplementary material of Frege et al. [35], whereas the set of histone methylases were extracted from the accompanying data of Lazar et al. [36]. Lists of interacting protein pairs were taken from the BioPlex 2.0 database [37]. obtain various amino acid scales [38]. To reduce the number of features, the ward.d2 function of R was used for K-means clustering [39], then based on the elbow method 5, clusters were selected to represent the features (Supplementary Figure S1). The following amino acid scales were selected randomly,  each from one of the clusters: NAKH900104, KHAG800101, JUNJ780101, HUTJ700103, FAUJ880103  (for explanation see Supplementary Table S1).

Annotation and Prediction of Protein Properties
Disorder prediction was made in a way that aims to minimize false positive hits arising from the inclusion of oligomeric fibrillar motifs [40]. First, the consensus of two disorder prediction methods, IUPred [41] and VSL2B [42] were determined. In the second step, all residues predicted to be in oligomeric fibrillar motifs were eliminated from the set of disordered residues [40]. Oligomeric fibrillar motifs were determined using a permissive prediction; namely, residues to form coiled coils as predicted either by COILS [43] or Paircoil2 [44], single α-helices as identified using FT_CHARGE [45] or collagen triple helix as obtained by HMMER [46] with the Pfam HMM for collagen (ID: PF01391.13) segments in Pfam [47]. The remaining set of residues is considered to be a good representation of segments being disordered under cellular conditions.
Binding regions within disordered segments were predicted with Anchor [41], low complexity regions with SEG [48], transmembrane (TM) regions with CCTOP [49,50] and signal peptides were predicted using SignalP [51]. Linear motifs participating in binding events were retrieved from the Eukaryotic Linear Motif (ELM) database [52], considering only "LIG/DOC" classes. The ELM prediction was also utilized, considering only those hits that passed the various filters as described by Gouw et al. [52]. Protein domains were taken from UniProt (2018_June release), PTMs were downloaded from PhosphoSitePlus [53] and were also predicted by NetPhos [54]. Venn diagrams were created using Venny [55].
All calculated values for each protein are available in Supplementary Table S1.

Statistical Analysis
We calculated the mean and standard deviation of these features on all datasets listed in Section 2.1. Furthermore, to compare the statistics of the proteome to Synaptome, PSD and PSC we also used exclusive sets. Due to the unbalanced distribution of proteins in various datasets, we used bootstrapping and randomly sampled all datasets 1000 times. The size of these samples was derived from the smallest dataset: 80% of protein belonging to PSC was used as reference. The calculations were used to show the enrichment of features in different datasets compared to the proteome, by applying logarithmic scale on their proportion. Means and variances for different groups are available in Supplementary Table S2. To further confirm the significance of the differences in the distributions of features, we also performed Kolmogorov-Smirnov tests for all sets versus the full human proteome (i.e., we compared the distribution of features between the proteome and the different subsets). This data is available in Supplementary Table S3. We also counted the number of PSD and non-PSD proteins with given features above and below their respective mean values (calculated for the proteome) and performed X-square tests (with Yates correction) on the contingency tables resulted from this procedure (Supplementary Table S4).
To investigate the co-occurrence of ordered and disordered structural elements in proteins of the PSD, we counted proteins with different individual structural entities and their combinations. Significance of this data is confirmed by randomly selecting 80% of data 1000 times (as described above) and calculating the mean and standard deviations. Whenever the mean ± (1, 2 and 3 fold) standard deviations did not overlap, we accepted the result to be significant with p < 0.32, p < 0.05 and p < 0.01, respectively (Supplementary Table S5).
Similar analysis was used to analyze the possible connection between PPIs and PTMs (Supplementary Table S6).
For Diversity of Potential Interactions (DPI, see Section 3.5) and PPI values of the different subsets Spearman's rank correlation was calculated.

Machine Learning
As a first step, we preprocessed the data used for machine learning by removing homologous sequences and assigning labels. For redundancy filtering CD-HIT [56] was used in an incremental manner, filtering identical proteins to 90, 70, 50 and finally to 40% identity. The remaining sequences were used to train the predictor. Labels were assigned based on the SynaptomeDB annotation.
For machine learning, a feed forward neural network was developed, using one hidden layer with 40 neurons and stochastic gradient descent. Due to the non-proportional data for training and testing, bootstrap aggregating (BAGGING) was used. In each step, 10 down-sampled sets were created and used to calibrate the Artificial Neural Networks (ANNs). For the final prediction, the results of individual ANNs were aggregated and weighted based on their reliability (defined as the output neuron probability). Benchmarking was done using 10-fold cross-validation and independent datasets.
Two different predictors were built: the first uses all annotations presented in Supplementary  Table S1. The second predictor only uses features that can be derived from the amino acid sequence and do not depend on the annotation of different databases (shown with a grey background in Supplementary Table S1). Training and testing data are available in Supplementary Table S7.

Datasets
In this paper, we investigated three nested subsets of proteins to the human proteome (21,766 proteins): all proteins from the synaptome (1891 proteins), proteins localizing into the postsynaptic density (1761 proteins) and postsynaptic scaffold proteins (51 proteins). Three additional control datasets were defined: histone methylases (52 proteins), proteins from the nucleus (180 proteins) and the immunome (834 proteins).
According to our definition (derived from SynaptomeDB), there is considerable overlap between the synaptic proteome and proteins of the PSD, and their characteristics are on par in every case. Therefore, conclusions drawn for PSD are valid for the synaptome too, even if it is not explicitly declared.

Sequences Feature a Mix of Disorder and Order Promoting Properties
General sequence properties often help to gain insight about structural features of proteins. As a very first step we compared the length distribution of proteins in the PSD to other proteins of the proteome: on average proteins in the PSD are longer (the average length of proteins is 524 and 705 residue in the proteome and in the PSD, respectively) ( Figure 1, Supplementary Figure S2). We also calculated the average amino acid content of proteins. On the one hand, these values seem similar and their variances are high; therefore, these properties cannot be used alone to reliably distinguish proteins in different localizations. On the other hand, these differences are mostly significant and according to the distribution of structural domains (see Section 3.3) it is clear that in some level, the function and structure of proteins are imprinted in their amino acid sequence: PSD proteins have some preference to include more charged residues (~28% and~25%, in the PSD and in the proteome, respectively) and to avoid cysteines (~2% and~3%, in the PSD and in the proteome, respectively). These properties alone may promote disorder content; however, on some level, the lower proline content (~5% and~6%, in the PSD and in the proteome, respectively) suggest globular domains may also emerge. Low complexity regions (LCRs) often, but not always, coincide with intrinsically disordered segments: interestingly proteins in PSD are not enriched in such regions (see Kolmogorov-Smirnov test: Supplementary Table S3; X-square test: Supplementary Table S4). These results may hint a balanced presence of intrinsically disordered regions (IDRs) and ordered globular domains in the PSD, however to unambiguously prove our hypothesis specialized predictions and analysis were performed (see Section 3.3). PSC proteins are a subset of PSD proteins with somewhat distinct features. We observed their average length is even higher compared to proteins from PSD, and almost the double of proteins of the proteome (on average 970 residue length). In contrast to the PSD, they contain a lot of IDRs. Other characteristics of PSC proteins are on par with PSD proteins.

PSD Proteins Exhibit a Diverse Range of Structured Elements
As sequence data indicate differences compared to the proteome, various prediction methods were utilized to reveal different types of structural elements. We classified segments as disordered regions and ordered globular domains; then, we extended this classification to groups often falling outside the classical "globular-disordered" partition: coiled-coils, a structural element already described in various PSD proteins and transmembrane segments assumed to play important role in transmitting information outside the cell. On the one hand, PSD proteins operate with a similar amount of disorder content as other proteins from the proteome (Figure 1). On the other hand, PSD proteins contain a large number of coiled coils and structured domains. We observed not only that the number of structured domains is higher, but flexible linker regions are also generally shorter in proteins of the PSD, allowing denser placements of domains along the polypeptide chain (Supplementary Figure S3). Note that our stringent structure prediction pipeline (see Methods) has an emphasis on discriminating disordered and coiled-coil regions. Therefore, the commonly occurring cross predictions between them are expected to only moderately bias these results. Another distinctive feature is the distribution of transmembrane proteins: proteins composed of seven transmembrane helices are greatly missing from the PSD (Supplementary Figure S4). PSC proteins are a subset of PSD proteins with somewhat distinct features. We observed their average length is even higher compared to proteins from PSD, and almost the double of proteins of the proteome (on average 970 residue length). In contrast to the PSD, they contain a lot of IDRs. Other characteristics of PSC proteins are on par with PSD proteins.

PSD Proteins Exhibit a Diverse Range of Structured Elements
As sequence data indicate differences compared to the proteome, various prediction methods were utilized to reveal different types of structural elements. We classified segments as disordered regions and ordered globular domains; then, we extended this classification to groups often falling outside the classical "globular-disordered" partition: coiled-coils, a structural element already described in various PSD proteins and transmembrane segments assumed to play important role in transmitting information outside the cell. On the one hand, PSD proteins operate with a similar amount of disorder content as other proteins from the proteome (Figure 1). On the other hand, PSD proteins contain a large number of coiled coils and structured domains. We observed not only that the number of structured domains is higher, but flexible linker regions are also generally shorter in proteins of the PSD, allowing denser placements of domains along the polypeptide chain (Supplementary Figure S3). Note that our stringent structure prediction pipeline (see Methods) has an emphasis on discriminating disordered and coiled-coil regions. Therefore, the commonly occurring cross predictions between them are expected to only moderately bias these results. Another distinctive feature is the distribution of transmembrane proteins: proteins composed of seven transmembrane helices are greatly missing from the PSD (Supplementary Figure S4). PSC proteins are on par with PSD proteins according to most of these features; however, in a more extreme way, utilizing even more coiled coils and domains. As expected from sequence characteristics, they utilize a larger amount of IDRs ( Figure 1).
Another notable trend is the mutual presence of ordered and disordered segments in PSD proteins (Table 1). While the PSD exclusive proteome outnumbers PSD proteins in single-domain proteins (i.e., containing a single coiled-coil, intrinsically disordered segment, a single domain or embedded in the membrane) or in proteins where multiple IDRs or other elements appear independently, in the PSD these elements seem to emerge in a more coordinated fashion, likely to promote more diverse possibilities for PPI formation (also see significance test in Supplementary Table S5).

The Overlap between Protein-Protein Interactions and Post-Translational Modifications Hint a Tightly Regulated Protein Network in PSD
PSD proteins operate with a high amount of PPIs ( Figure 1, Supplementary Figure S5), compared to the human proteome. Interestingly, proteins from the PSD are not particularly enriched in IDRs, a commonly used element to mediate PPIs. This is also reflected in the relatively low number of predicted disordered binding regions. The extremely high amount of phosphorylation and ubiquitination sites suggest that post-translational regulation may play an essential role in PSD proteins.
As the increased number of IDRs and binding regions suggest, PSC proteins may use different mechanisms to promote PPIs. In the case of phosphorylation and ubiquitination, the trends are similar to PSD proteins. However, methylation and acetylation values are more close to the proteome used as a reference.
Since we do not have information about the specific binding regions and the interacting partners, we tried to characterize the nature of these interactions by observing their co-occurrence with proteins containing IDRs and PTMs (i.e., they emerge in the same protein, but they do not necessarily overlap). We assessed the mutual presence of PPIs, PTMs and protein disorder in the proteins of the PSD. Protein disorder for PPI formation may be somewhat less frequently employed mechanism in the PSD compared to the human proteome (i.e.,~49% and~53% of interacting proteins do not contain any IDR at all in the proteome and the PSD, respectively). However, PTMs may regulate the emerging PPIs more tightly (i.e., only~98% compared to~92% of proteins participating in interactions have phosphorylation or ubiquitination site in the PSD and the proteome, respectively) ( Figure 2). All these differences between the proteome and the PSD are significant according to mean and standard variation values (Supplementary Table S6).
the PSD compared to the human proteome (i.e., ~49% and ~53% of interacting proteins do not contain any IDR at all in the proteome and the PSD, respectively). However, PTMs may regulate the emerging PPIs more tightly (i.e., only ~98% compared to ~92% of proteins participating in interactions have phosphorylation or ubiquitination site in the PSD and the proteome, respectively) ( Figure 2). All these differences between the proteome and the PSD are significant according to mean and standard variation values (Supplementary Table S6). Alternative splicing can also contribute to regulation, by inclusion or exclusion of binding regions and rewiring PPI networks. PSD and PSC proteins both seem to utilize this mechanism with an increased amount of isoforms compared to the human proteome ( Figure 1).

Potential of the Proteins to Be Engaged in Multivalent Interactions
Proteins in the PSD are capable of establishing a high amount of interactions, which heavily depends on the number of basic building blocks: coiled-coils can multimerize to build protein complexes, while IDRs, transmembrane proteins and globular domains also can form inter-or intramolecular interactions. To estimate the bias caused by overlapping definitions, we calculated the overlap between these elements: we found that less than 1% of them overlap on residue level. We defined Diversity of Potential Interaction (DPI) as a composite descriptor and introduced the following elements in the equation: where DOM is the number of domains, CC10 is the number of coiled-coil regions longer than 10 residues, IDR10 is the number of IDRs longer than ten residues, and TMH is the number of transmembrane segments, Ph and Ub are the number of phosphorylations and ubiquitination sites along the protein sequence and ELM is the number of short linear motifs aiding binding processes (LIG/DOC classes of the ELM database). These features and the calculated DPI have somewhat better distinctive properties than residue distribution (Supplementary Figures S6-S13) and applying Spearman's rank correlation indicates a significant relationship between PPIs and DPI (S. correlation = 0.35; p-value < 0.01). Using DPI as measure PSD proteins are on par with PSC proteins, as expected from the number of interactions (see Figures 1 and 3), while the proteome has rather lower values in both cases. Alternative splicing can also contribute to regulation, by inclusion or exclusion of binding regions and rewiring PPI networks. PSD and PSC proteins both seem to utilize this mechanism with an increased amount of isoforms compared to the human proteome ( Figure 1).

Potential of the Proteins to Be Engaged in Multivalent Interactions
Proteins in the PSD are capable of establishing a high amount of interactions, which heavily depends on the number of basic building blocks: coiled-coils can multimerize to build protein complexes, while IDRs, transmembrane proteins and globular domains also can form inter-or intramolecular interactions. To estimate the bias caused by overlapping definitions, we calculated the overlap between these elements: we found that less than 1% of them overlap on residue level. We defined Diversity of Potential Interaction (DPI) as a composite descriptor and introduced the following elements in the equation: DPI = ln(DOM) + ln(CC10) + ln(IDR10) + ln(TMH) + ln(Ph + Ub) + ln(ELM) where DOM is the number of domains, CC10 is the number of coiled-coil regions longer than 10 residues, IDR10 is the number of IDRs longer than ten residues, and TMH is the number of transmembrane segments, Ph and Ub are the number of phosphorylations and ubiquitination sites along the protein sequence and ELM is the number of short linear motifs aiding binding processes (LIG/DOC classes of the ELM database). These features and the calculated DPI have somewhat better distinctive properties than residue distribution (Supplementary Figures S6-S13) and applying Spearman's rank correlation indicates a significant relationship between PPIs and DPI (S. correlation = 0.35; p-value < 0.01). Using DPI as measure PSD proteins are on par with PSC proteins, as expected from the number of interactions (see Figures 1 and 3), while the proteome has rather lower values in both cases.
To have a better picture of how these descriptors behave, we also included three control sets with distinct features: immunome as a set of primarily globular multidomain proteins, namely histone methylases, known to have high IDR content and a more comprehensive list of proteins from the nucleus, shown to operate with a high amount of PPIs. We calculated the Spearman's rank correlation between the descriptors and the average number of interactions in the selected protein sets. DPI fairly correctly estimates the potential to the formation of complexes (correlation: 0.71; p-value < 0.1).
with distinct features: immunome as a set of primarily globular multidomain proteins, namely histone methylases, known to have high IDR content and a more comprehensive list of proteins from the nucleus, shown to operate with a high amount of PPIs. We calculated the Spearman's rank correlation between the descriptors and the average number of interactions in the selected protein sets. DPI fairly correctly estimates the potential to the formation of complexes (correlation: 0.71; pvalue < 0.1).

Sequential and Structural Features Discriminate PSD Proteins from Other Proteins
Although some of the above-discussed trends indicate that PSD proteins have distinctive properties, to demonstrate the prediction power of these features we established an artificial neural network (ANN). The input of the ANN consists of the calculated features discussed above. The output is a binary classification defining the localization of the protein (i.e., localizing in the PSD/not). The distribution of the positive and negative labels was not proportional; therefore we used bootstrap aggregation on the data. Briefly, this means that random samples of the data are used to train several predictors, and the final result is provided by aggregating the various outputs. The approach also helps to deal with the relatively noisy data, as in some cases it is expected that the same proteins localize inside and outside the PSD too. The algorithm achieved moderately high accuracy by incorporating all of the features presented in previous chapters (Area Under Curve (AUC): 0.84) ( Figure 4 and Table 2).
Some of the characteristics used as input are database dependent, reflecting a bias in our knowledge; therefore, it is hard to assess how the ANN works on proteins with more/less annotation.

Sequential and Structural Features Discriminate PSD Proteins from Other Proteins
Although some of the above-discussed trends indicate that PSD proteins have distinctive properties, to demonstrate the prediction power of these features we established an artificial neural network (ANN). The input of the ANN consists of the calculated features discussed above. The output is a binary classification defining the localization of the protein (i.e., localizing in the PSD/not). The distribution of the positive and negative labels was not proportional; therefore we used bootstrap aggregation on the data. Briefly, this means that random samples of the data are used to train several predictors, and the final result is provided by aggregating the various outputs. The approach also helps to deal with the relatively noisy data, as in some cases it is expected that the same proteins localize inside and outside the PSD too. The algorithm achieved moderately high accuracy by incorporating all of the features presented in previous chapters (Area Under Curve (AUC): 0.84) (Figure 4 and Table 2).
Some of the characteristics used as input are database dependent, reflecting a bias in our knowledge; therefore, it is hard to assess how the ANN works on proteins with more/less annotation. To overcome this problem, a second predictor was built, in which the feature space was reduced to contain only those characteristics that can be estimated based on the amino acid sequence of the proteins (intrinsic features). In some cases, annotations were replaced by prediction (e.g., phosphorylation). Although the performance of the prediction decreased, these features alone still reliably discriminate PSD proteins from others (AUC: 0.76). Besides AUC, we also calculated the Matthew Correlation Coefficient and Balanced Accuracy. We observed similar values during the cross-validation and on the independent dataset. The results suggest the selected structural features are descriptive for PSD proteins and the prediction method is quite robust regarding both the features and sample sets.
contain only those characteristics that can be estimated based on the amino acid sequence of the proteins (intrinsic features). In some cases, annotations were replaced by prediction (e.g., phosphorylation). Although the performance of the prediction decreased, these features alone still reliably discriminate PSD proteins from others (AUC: 0.76). Besides AUC, we also calculated the Matthew Correlation Coefficient and Balanced Accuracy. We observed similar values during the cross-validation and on the independent dataset. The results suggest the selected structural features are descriptive for PSD proteins and the prediction method is quite robust regarding both the features and sample sets.

Discussion
Synaptic plasticity is facilitated by proteins of the PSD, most plausibly by using their functional repertoire provided by the ability to reorganize PPIs dynamically. Our results indicate that PSD and especially PSC proteins have an increased potential to form diverse interactions and their features can be used to discriminate PSD proteins from the rest of the proteome.
General sequence properties are often used to characterize particular protein subsets, as they open prospects to structural features. Grouping amino acids based on physicochemical properties

Discussion
Synaptic plasticity is facilitated by proteins of the PSD, most plausibly by using their functional repertoire provided by the ability to reorganize PPIs dynamically. Our results indicate that PSD and especially PSC proteins have an increased potential to form diverse interactions and their features can be used to discriminate PSD proteins from the rest of the proteome.
General sequence properties are often used to characterize particular protein subsets, as they open prospects to structural features. Grouping amino acids based on physicochemical properties highlights exciting trends. The increased number of charged residues would hint a higher extent of intrinsic disorder; however, the low amount of prolines seems to counterbalance this effect. PSC proteins have similar characteristics; however, with higher proline content that may promote intrinsic disorder.
In a recent study, protein interactions were analyzed based on the structural state of participating partners, revealing major differences between "classical" protein disorder, and complexes formed by various partners. Three basic types of complexes were distinguished: autonomous folding and independent binding (i.e., the binding of two or more ordered proteins), coupled folding and binding (where an ordered protein stabilize an IDP partner) and mutual synergistic folding (interactions formed exclusively by disordered proteins) [57]. According to the sequence analysis, the amino acid content of the constitutive partners in the different groups have their own characteristic, and they also differ from those described in "classical" disordered protein papers. Besides sequence analysis, differences were also shown in the bound structure: distinct groups exhibit different secondary structures, unique residue/atomic level interaction features and energy properties. They also have different biological roles in different subcellular localizations. The overall amino acid content of PSC proteins shows high correlation with the group composed of proteins going through coupled binding and folding. In contrast, proteins from PSD are more close to the "mutual synergistic folding" (MSF) class (Supplementary Figure S14). We also noted that the predicted IDR content of PSD proteins lags behind that of the proteome (Figure 1). These observations raise the question whether PSD proteins utilize a different flavor of intrinsic disorder to establish interactions. A possible scenario might be that such flexible regions are overlooked by disorder prediction methods, as they lacked MSF protein sets for training. We note that coiled coils can be regarded as a subclass of MSF complexes; however, our pipeline explicitly detects them and they do not show enrichment compared to PSC proteins ( Figure 1). Thus, it is plausible that MSF mechanisms other than the coiled-coil formation also contribute to complex formation in the PSD.
The structural organization of the PSD utilizes a balanced distribution of different types of elements promoting PPIs. Coiled-coils, domain-domain interactions, intrinsically disordered regions and transmembrane proteins all contribute to the formation of protein complexes in the PSD. Notably, in PSD proteins, IDRs are often paired with other structural elements, in contrast to the full proteome, where this association is much less frequent. Although the common presence of TM segments and IDRs shows an exception (also see [58]), other combinations between IDRs and ordered segments are significantly higher. By exhibiting a diverse range of structural segments as binding regions PSD proteins likely involved in a wide range of different types of interactions to maintain the PPI network of the PSD.
Besides statistics presented in the results, detailed structure-function studies provide experimental evidence on how PPIs are formed and maintained in the PSD. Below, we discuss different aspects of PSD organization by using specific well-characterized proteins as demonstrative examples.
The Shaker channel is a voltage-dependent potassium channel responsible for conducting depolarizing potassium currents when the membrane potential increases [59]. The C-terminal tail of the channel is in random coil state and contains a PDZ domain recognition motif. The motif assists the interaction with the PSD-95 scaffolding protein, helping the channel to cluster at unique membrane sites, which is important for the proper assembly and functioning of the synapse [60]. Anchor [61] can relatively accurately predict such interactions, also described as coupled binding and folding [12]. Both amino acid content (Supplementary Figure S14) and Anchor predictions indicate that PSC proteins are enriched in disordered binding regions relative to the proteome, while the number of unique partners lags behind this enrichment (Figure 1 bottom panel). The stacking of the WW domain and MAPK binding motifs was described earlier in the case of TANC protein [62], also supporting the idea that PSC proteins "allow" their partners to pick from multiple unoccupied sites to fine-tune their function. In contrast, this enrichment is less pronounced in PSD and does not follow the enrichment in the number of interacting partners. These results theorize PSD proteins either more heavily utilize competitive binding with alternative partners binding to the same region, or promote different interaction modes. Since our former suggestion is rather hard to assess using computational methods, we collected other possible interaction modes between proteins.
The activity-regulated cytoskeleton-associated protein (Arc) is a crucial regulator of long-term synaptic plasticity. The protein is highly modular and contains a flexible C-terminal tail [63]. Arc oligomerization is aided by the terminal IDRs, leading to the assembly of a structure reminiscent to an HIV capsid. This way, Arc can encapsulate RNA and can mediate their transfer, which was shown to play a role in cell to cell communication in the nervous system [64]. Coiled-coils are also important oligomerization motifs, linking two or more partners with high specificity and a wide range of stability [65]. The PSD protein Homer can form high-order complexes with a diverse range of partners. A tetrameric form resulting in a coiled-coil develops in a two-step process: first, the C-terminal 70 residues of the proteins form a parallel dimer, then two Homer dimers serve as a base for the antiparallel tetramer structure [66]. This arrangement is long enough to connect elements through the thickness of the PSD, serving direct connection between the plasma membrane and intracellular proteins through EVH1 domains binding to various scaffolding partners [67]. The mechanisms mentioned above share the characteristic feature of forming complexes by natively unfolded regions.
One can argue whether computational tools cross predict IDRs and coiled-coils, leading to a bias in our observations. To overcome this problem, our pipeline uses coiled coils as a filter when predicting IDRs. However, the lack of abundance of IDRs in PSD proteins could be explained by the fact, that we classified them as coiled coils. Calculating the overlap between IDR and coiled-coil containing proteins, and proteins forming interactions confirmed that this is not the case (Supplementary Figure S15), IDR and coiled-coil containing proteins do not dominate PPIs in PSD proteins.
PPIs depend on structural elements providing binding sites; however, they are also intensively regulated through the spatiotemporal control in the cell, often using post-translational modifications to precisely modulate the properties of the protein.
PSD-95 is a scaffolding protein in the synaptome, and the Serine at 561 position was shown to be subject to phosphorylation and work as a molecular switch. Phosphomimetic (Ser → Asp) mutation promotes an intramolecular interaction between the guanylate kinase (GK) and the SH3 domains, inducing a highly dynamic, yet closed conformation with buried binding sites. In contrast, mutating Ser561 to Alanine leads to a stable open conformation, facilitating the interaction between PSD-95 and its partners [68]. Moreover, such a simple mechanism as phosphorylation is often used cumulatively to produce an electrostatic effect [69] to fine-tune PPIs [70] or in a combinatorial way, where specific patterns can be responsible for regulating synaptic activity [71].
Additional regulation phenomena may provide further functional diversity. Using alternative splicing, different isoforms displaying characteristic sets and distributions of interaction sites may carry out distinct functions, as it was described in the case of PDZ-containing proteins [21]. Short Linear Motifs also play a role by mediating protein-protein interactions to contribute to a plethora of functions. Their short length and structural flexibility provide plasticity to emerge by convergent evolution and fine-tune interaction networks [72].
The multivalent nature of PSD proteins can also lead to additional structure-related phenomena: the SynGAP postsynaptic protein forms a coiled-coil trimer, shortly followed by a C-terminal binding motif interacting with the third PDZ domain of PSD-95. After complex formation, the assembly goes through liquid-liquid phase separation. The enrichment of SynGap and PSD-95 was shown to highly correlate with synaptic activity [24], advocating the critical role of the formation of membraneless organelles in synaptic processes [23].
Due to the high variance of presented features, they cannot be used alone to describe PSD proteins. However, we assumed that using their combination may reveal certain aspects of PSD proteins. For this purpose, we introduced descriptors to assess the presence and distribution of different elements in PSD proteins and their possible role in interactions. Since PPI formation may occur through many distinct structural elements, we included several of these to thoroughly catch many possible aspects of protein complex formation. Using Diversity of Potential Interactions (DPI), including structural and post-translational regulation members, the potential of a protein to form distinct interactions can be adequately defined. Verification with different control sets confirmed that it can be used directly on proteins sets to estimate their tendency to establish elaborate protein networks.
Our protein-level descriptions are all focused on PPIs, and thus it is not trivial whether they are discriminative features of PSD proteins, or they are only relevant in the context of specific protein complexes. Machine learning is a commonly used tool to discriminate protein subsets when the number of features and their variance is too high to overlook. Our results demonstrate the prediction power of the presented sequential, structural and regulation features. Considering the noise present in the datasets (as some proteins may be localized in the PSD and other locations too, moreover the possible false classification of source databases also adds a bias), the prediction is remarkably accurate. Applying structure based machine learning algorithm may enhance synaptome database development by expediting data collection and reducing manual effort.
In conclusion, we suggest that postsynaptic proteins, and in particular postsynaptic scaffold proteins, are capable of forming diverse kinds of interactions with their partners that we propose to play a key role in the functional organization of the postsynaptic density and its dynamic rearrangements upon stimuli. We also found this ability is imprinted in the amino acid sequence and can be used to discriminate proteins with propensity to form a high number of interactions, or using machine learning to distinguish the PSD proteome from other proteins.  Figure S1: K-means evaluation of various numbers of clusters of AAIndex. Figure S2: Protein length distribution (blue: PSD, red: human proteome). Figure S3: Flexible linker length distribution compared to the human proteome (blue: synaptome, yellow: PSD, red: PSC). Figure S4: Transmembrane helix distribution (blue: PSD, red: human proteome). Figure S5: Distribution of the number of interacting partners (blue: PSD, red: human proteome). Figure S6: Distribution of protein lengths in all datasets. Figure S7: Distribution of intrinsically disordered residue content in all datasets. Figure S8: Distribution of coiled-coils residue content in all datasets. Figure S9: Distribution of transmembrane residue content in all datasets. Figure S10: Distribution of the number of globular domains in all datasets. Figure S11: Distribution of the number of phosphorylation sites in all datasets. Figure S12: Distribution of the number of ubiquitination sites in all datasets. Figure S13: Distribution of the DPI values in all datasets. Figure S14: A: Correlation of amino acid content between interaction classes based on the structural state of participating partners and postsynaptic proteins. Columns: Autonomous folding and independent binding (i.e., the binding of two or more ordered proteins), coupled folding and binding (where an ordered protein stabilize an IDP partner) and mutual synergistic folding (interactions formed exclusively by disordered proteins), No folding, no binding (i.e., the "classical" disordered definition). Rows: Postsynaptic Scaffold proteins, and proteins from the postsynaptic density. Color scales from red (negative correlation) to green (positive correlation). B: Change of amino acid content of the group mutual synergistic folding (blue) and PSD proteins (red) compared to the proteome. C: Change of amino acid content of the group coupled binding and folding (blue) and PSC proteins (red) compared to the proteome. Figure S15: Overlap of coiled coils, IDRs and PPIs in PSD proteins. Table S1: Calculated properties and labels of all proteins used in the study. Table S2: Mean and standard deviation of calculated properties in different groups. Top: sets derived from other sources (i.e., SynaptomeDB, Uniprot etc.). Middle: Exclusive human proteome datasets. Bottom: Values calculated by downsampling and bootstrapping default sets 1000 times. Table S3: P-values of Kolmogorov Smirnov tests calculated between the proteome and different sets on all features. Table S4: Contingency tables and calculated p-values of chi-square tests. The independent observations are: (I) PSD/nonPSD (II) feature is above/below of the mean calculated on the proteome. Table S5: Co-occurence of structral formations in PSD/nonPSD proteins. Means and standard deviations were calculated for selected combinations. Table S6: Co-occurence of PPI promoting features in the PSD and proteome proteins. Means and standard deviations were calculated for all combinations. Table S7: List of proteins used to train and test the Artificial Neural Network.