A Review of Fifteen Years Developing Computational Tools to Study Protein Aggregation

: The presence of insoluble protein deposits in tissues and organs is a hallmark of many human pathologies. In addition, the formation of protein aggregates is considered one of the main bottlenecks to producing protein-based therapeutics. Thus, there is a high interest in rationalizing and predicting protein aggregation. For almost two decades, our laboratory has been working to provide solutions for these needs. We have traditionally combined the core tenets of both bioinformatics and wet lab biophysics to develop algorithms and databases to study protein aggregation and its functional implications. Here, we review the computational toolbox developed by our lab, including programs for identifying sequential or structural aggregation-prone regions at the individual protein and proteome levels, engineering protein solubility, ﬁnding and evaluating prion-like domains, studying disorder-to-order protein transitions, or categorizing non-conventional amyloid regions of polar nature, among others. In perspective, the succession of the tools we describe illustrates how our understanding of the protein aggregation phenomenon has evolved over the last ﬁfteen years.


Introduction
Proteins are prevalent macromolecules in living organisms and are essential to most biological functions.The establishment of functional native intra-and interchain interactions is a key feature of protein biology that controls protein folding, binding, and activity [1,2].Proteins navigate several conformational states in the crowded environment of living cells in pursuit of a free-energy minimum, which can correspond to a monomeric state or a wide range of assemblies [3,4].Intracellular assemblies come in a variety of forms, from multi-component, dynamic, and reversible biomolecular condensates to irreversible protein clumps.In this latter case, the original protomers undergo partial or global unfolding, and native contacts are replaced by non-native intermolecular interactions [5][6][7] resulting in the formation of non-structured amorphous aggregates or highly ordered amyloid fibrils characterized by a cross-β conformation [8,9].
A wide range of pathologies, including neurodegenerative diseases such as Alzheimer's and Parkinson's and nonneuronal localized or systemic amyloidoses, are all closely linked to the formation of protein aggregates [10][11][12][13].However, despite this association with disease, aggregation propensity is a general property of polypeptide chains [14].This results from the physicochemical requirements to form native interactions overlapping with the molecular determinants driving aggregation [15,16].Therefore, aggregation-prone regions are ubiquitous in proteins and preserved throughout proteomes over millions of years of evolution [17].Under this constant and inevitable pressure of aggregation, proteins have evolved to adjust their solubilities to the maximum necessary to function in their natural context [18].
The protein quality control system continuously monitors the balance between protein aggregation and solubility in vivo [19,20].However, events such as genetic mutations [21,22], post-translational modifications [23], or the breakdown of proteostasis [24] place proteins out of their usual environment and favor the initiation of non-native contacts leading to aggregation.A similar situation is faced during the biotechnological production, purification, and storage of proteins [25][26][27], where they are exposed to solvent conditions divergent from those in the cell at concentrations that are many orders of magnitude higher than their biological levels [28].Proteins have not evolved to be soluble in these conditions [29,30]; as a result, they precipitate in tissues in human disorders or during the development of therapeutic proteins.
Rationalization of the causes underlying protein aggregation prompted the development of tools that can predict protein solubility, diagnose the impact of mutations or chemical modifications in disease, and assist the engineering of optimized protein-based drugs.In 2007, we developed Aggrescan, one of the first protein aggregation prediction algorithms [31].As our lab has a combined theoretical/experimental character, one of the features of this pioneering program was that it exploited validated biophysical data to generate its aggregation propensity scale [32].This has been a constant of the different tools we have developed along the years, in a journey that has taken us from identifying the regions responsible for aggregation in individual protein sequences [33] to describing the aggregation properties of the totality of human protein structures [34] (Figure 1).

Computational Tools to Study Protein Aggregation
The present section provides a brief overview of three different computational tools specifically developed to predict aggregation propensity from polypeptide sequences, identify aggregation-prone regions in globular proteins, and evaluate mutations' impact on aggregation.The present review attempts to illustrate the contribution of this Spanish group to the prediction of protein aggregation, describing the characteristics of the different developed algorithms and databases and providing our view on the state of the art of this field.

Computational Tools to Study Protein Aggregation
The present section provides a brief overview of three different computational tools specifically developed to predict aggregation propensity from polypeptide sequences, identify aggregation-prone regions in globular proteins, and evaluate mutations' impact on aggregation.

Aggrescan: Prediction of "Hot Spots" of Aggregation in Polypeptides
In 2007, we developed Aggrescan [31], a web-based software that provides a tool for predicting aggregation-prone regions in polypeptide sequences.Aggrescan implements an aggregation propensity scale for natural amino acids derived from in vivo experiments performed with the β-amyloid peptide [32].Precisely, we used a model consisting in the Aβ42 peptide fused to the green fluorescence protein (GFP), in which GFP fluorescence acts as a reporter of aggregation of the fusion protein.We mutated the middle position (Phe19) of the central hydrophobic cluster of Aβ42 for the other 19 possible natural amino acidic residues and analyzed the fluorescence levels of the different Aβ42-GFP fusions upon expression in Escherichia coli.As a result, variants with increased aggregation propensity decreased the fluorescence levels as they interfered with the proper folding of the fluorescent protein.The derived experimental data were employed to parametrize the algorithm behind Aggrescan [31][32][33]35].
Thus, Aggrescan assumes that protein aggregation is nucleated and driven by specific short sequence stretches that are exposed to the solvent, known as aggregation-prone regions (APRs) or "hot spots".The latest Aggrescan implementation is available online (http://bioinf.uab.es/aggrescan/(accessed on 12 January 2023)) and allows users to evaluate single or multiple protein sequences.In both cases, the polypeptide sequence/s must be provided in FASTA format as input.Then, the program determines the aggregation propensity values for each individual amino acid, based on the experimentally derived scale, generating an aggregation profile where APRs can be identified.It also calculates the average score of the sequence using a sliding window; this value provides an estimation of the overall aggregation propensity of the protein of interest.
Aggrescan is a simple and fast algorithm incorporated into different protein aggregation and stability prediction pipelines.In particular, in 2013, it was implemented in AMYLPRED2, a consensus predictor that integrates analysis performed by 11 top-tier tools to identify aggregation-prone regions in proteins [36].Ultimately, Aggrescan has been used for many different experimental applications, including the characterization of individual biomedically relevant human proteins and their mutants [37][38][39] or the comparative analysis of the aggregation propensity of entire proteomes [40,41].

Aggrescan 3D: A Server for Prediction of Aggregation Propensity in Protein Structures and Rational Design of Protein Solubility
The establishment of intermolecular contacts driven by solvent-exposed APRs has shown to be a successful concept in predicting protein aggregation in the context of newly formed proteins or IDPs.However, for folded proteins, the detected APRs are usually located within hydrophobic cores, inaccessible regions, or highly stable secondary structures, whose exposure or β-sheet conversion is thermodynamically prevented [42,43].Typically, globular proteins aggregate by the spatial clustering of often non-contiguous in sequence hydrophobic amino acids in the protein surface, forming structural APRs (STAPs) [44], by local or global structural destabilization [45] or by stochastic fluctuations that lead to the exposure of previously buried APRs [46].Therefore, weighting a protein's spatial context becomes necessary to understand the forces that lead to its aggregation, a task that sequence-based prediction methods cannot undertake.
To overcome these limitations, in 2015, we developed the Aggrescan 3D (A3D) algorithm (http://biocomp.chem.uw.edu.pl/A3D2/(accessed on 12 January 2023)) [47].A3D makes use of Aggrescan's aggregation propensity scale and projects it into a protein structure three-dimensional context.This novel algorithm modulates each residue's aggregation propensity by accounting for its surface exposure and summing the contributions from proximal residues' (i.e., at 5 Å or 10 Å radius) intrinsic aggregation propensity, distance, and exposed area, while disregarding non-exposed amino acids' contribution [47].This makes accessible the study of protein aggregation using structural models for non-experts and significantly reduces the number of false positive hits compared to lineal prediction methods.
The first A3D implementation was equipped with FoldX energy force field to calculate the structural impact upon mutations [48] and CABS-Flex [49], a coarse-grained molecular dynamics simulator to estimate the proteins' most dominant structural fluctuations in the near-native ensemble.The integration of both approaches under the same pipeline allowed A3D to model and estimate mutation impact on stability and aggregation propensity.Using this strategy, it was possible to explain the mechanism of human β2-microglobulin aggregation, which entails a severe complication for long-term hemodialysis patients.Aggregation-prone mutants in this protein tend to expose STAPs, which are protected in non-aggregating variants.
Due to its high computational costs, CABS-flex simulations were first restricted to small proteins.In 2019, the initial A3D release was subsequently updated to the 2.0 version [50,51], which allowed studies on large biomolecules such as antibodies, protein fibers, or multi-chain protein complexes [50].In addition, the A3D 2.0 included other significant improvements, such as the automatic engineering of more soluble yet stable protein variants and a REST-ful service to incorporate the server into bioinformatic pipelines.This last algorithm version has been incorporated into a cost-effective routine tool specifically developed for designing and optimizing multimeric protein materials [52].A standalone version of A3D 2.0 was recently released [53], which avoids erratic internet connections or deal with privacy concerns.
Notably, in July 2019, the Spanish Biophysical Society (SBE) designated the paper describing the method of A3D 2.0 as highlighted paper.Since its launch, A3D has aided the community in multiple experimental efforts, such as redesigning proteins for biotechnological approaches and engineering protein-based nanostructures [52,[54][55][56].For instance, A3D has allowed the in silico redesigning of one of the more soluble GFP variants [54].The A3Dassisted redesign of this protein is shown in Figure 2. Other A3D applications included the study of the impact on the aggregation of pathogenic [37,57,58] and non-pathogenic protein variants [54,[59][60][61], the analysis of the binding of antibacterial proteins to membranes [62], the understanding of chaperone client recognition [63], and the assistance with neglected tropical disease vaccine development [64] or model viral protein evolution throughout the SARS-CoV-2 pandemic [65].
tions included the study of the impact on the aggregation of pathogenic [37,57,58 non-pathogenic protein variants [54,[59][60][61], the analysis of the binding of antibac proteins to membranes [62], the understanding of chaperone client recognition [63] the assistance with neglected tropical disease vaccine development [64] or model protein evolution throughout the SARS-CoV-2 pandemic [65].

Figure 2. A3D−assisted redesign of the green fluorescent protein (GFP). (A)
Example of an A3D plot generated for the redesign of GFP.The red arrows point to aggregation-prone residues identified in a globular context.(B) Automatic mutations mode tab depicting three energetically favorable solubilizing mutations (indicated with brown squares).Note that A3D predicts three structural aggregation-prone regions (colored in red) that are solubilized (colored in blue) when applying a triple mutation to Lys amino residues.This example was generated using a previously solved structure of GFP (PDB code, 2B3Q, chain a).

A3D Database: Structure-Based Predictions of Protein Aggregation for the Human Proteome
In 2021, a comprehensive database containing highly accurate structure predictions for the human proteome was published [66].These predictions were computed using AlphaFold (AF), a deep-learning neural network model developed by Jumper and coworkers [67].As discussed above, A3D exploits the structural information from atomic models to identify surface-exposed aggregation-prone patches.We have exploited A3D to compute the aggregation propensity of the entire human proteome in the AF database.These data have been compiled in the A3D Database, which includes the precalculated A3D predictions for 23,391 human proteins [34].This database is the first compiling aggregation in protein structures at this large scale and is freely available at (http://biocomp.chem.uw.edu.pl/A3D2/hproteome (accessed on 12 January 2023)).
The first release of this database included interesting features from the more recent implementation of A3D, such as the capacity to predict the effect of selected mutations on protein stability and aggregation propensity, as well as propose optimal solubilityenhancing mutations for every compiled human protein.Each entry of the A3D database includes a detailed description of the structure-based aggregation propensity for the protein of interest.The A3D database also incorporates user-friendly graphical tools for protein structure visualization and interpretation.Examples of potential applications include studying the impact of genetic mutations and engineering the solubility of pharmaceutically relevant human proteins, including antibodies, replacement enzymes, and growth factors.

Computational Tools to Study Prion-like Proteins
Prions are a particular class of amyloids that can propagate their misfolded conformation.These proteins have unique compositional features that have been exploited to develop dedicated bioinformatics tools capable of identifying novel pathological and functional polypeptides with prion-like properties.Herein, we discuss the features of four different algorithms developed by our group to study prions and prion-like proteins.

PrionScan: An Online Database of Predicted Prion Domains in Complete Proteomes
In 2014, we developed PrionScan (http://webapps.bifi.es/prionscan(accessed on 12 January 2023)) as an open-source database of organized and up-to-date predictions for putative prion-forming proteins for all the publicly available proteomes from all taxonomic subdivisions [68][69][70][71].The PrionScan algorithm has been developed based on the assumption that prion propensity is determined by the composition of protein sequences [71,72].Previously developed algorithms primarily focused on identifying amyloidogenic regions in pathogenic proteins based on local structural and primary sequence characteristics.However, most of these programs were not suited to analyze prion behavior since, globally, prion domains do not share the sequential characteristics common to disease-associated ß-sheet amyloids [73].
PrionScan was designed to identify and score prion regions based on the compositional bias of prionogenic regions as deduced from an extensive set of experimentally validated prion and non-prion sequences from yeast.These data were exploited to build and train a probabilistic model that uses the statistical significance of individual amino acid propensities to detect Q/N-rich prion-like regions in all UniProtKB annotated proteomes [68,71,72,74].In addition to storing information on putative prion proteins, Pri-onScan provides a function to predict prion regions in sequences not reported in public databases [68,71,75].The data generated for a prediction comprises the sequence and localization of the highest-scoring putative prion domain and additional information about the protein, such as the Gene Ontology (GO) Terms and cross-references to other databases [68].
PrionScan has been used to understand prion/prionogenic proteins' functions and how their interaction networks have a substantial impact on gene regulation [76] or to identify regions driving liquid-liquid phase separation (LLPS) [77].Recently, we have applied PrionScan to identify and characterize novel prion-like proteins in more than 800 bacteria proteomes, suggesting that prion-like presence is a common feature of different prokaryotic genomes [70].

pWaltz and PrionW: Identification of Prion-like Protein Domains
In 2015, we launched the pWaltz algorithm (http://bioinf.uab.es/pWALTZ/(accessed on 12 January 2023)) [78].This predictor was inspired by the Waltz amyloid prediction strategy [79], but employed a lower detection threshold to identify milder amyloids and used a larger sliding window that fitted the size of the minimum transmissible β-fold described at that time [80,81].As described, prion-like conversion was initially thought to be driven by compositional features alone [82,83].However, in 2010 a seminal study by Toombs and co-workers using the yeast prion Sup35 suggested that certain stretches of the prion domain may play a driving role in its transition [84].Surprisingly, their results indicated that these regions do not exhibit bias for residues overrepresented in yeast prions, such as asparagines (N) and glutamines (Q).On the contrary, hydrophobic residues were favored, while charged residues and prolines (P) harmed prion formation [84].These biases were reminiscent of those used by pure amyloid predictors [79], which sparked the idea that prion conversion or propagation could rely on particular amyloid-like contributions.We further realized that previously reported prion or prion-like domains (PrLDs) had short stretches of mild amyloid propensity, and their mutation could explain observed differences in prion-like conversion.Based on these observations, we developed the pWaltz algorithm that could discriminate Q/N-rich domains with and without prion activity with higher accuracy than the compositional-only prediction methods available at the time [78].
Since its release, pWaltz has been applied, coupled to different PrLD boundary prediction algorithms, to detect soft amyloid cores in yeast and human prion-like proteins [85,86], to identify the first bacterial prion [87] and prion candidates in the malaria parasite [88], to evaluate mutation impact on prion-like protein aggregation [89], to understand the aggregation of human prion-like proteins [90], or to describe the mechanism of Med15 and TBP aggregation from initial coiled-coil conformations [60,91].
In 2015, we implemented PrionW (http://bioinf.uab.cat/prionw/(accessed on 12 January 2023)), a prion prediction algorithm that works with complete protein sequences, as it identifies the compositional context and the structural features needed for prion conversion [92].PrionW first runs a disorder prediction over the input sequence, and those stretches deemed disordered are evaluated for a minimum Q/N enrichment.Then, the best candidate sequence is evaluated with the pWaltz algorithm, and the selected PrLD and soft amyloid core is presented.We employed PrionW to analyze the complete yeast proteome demonstrating that it recalls bona fide prion proteins with high accuracy.Over the past years, PrionW has helped scientists study telomeric-associated proteins' evolution in Candida albicans strains [93], to select yeast prion-like transcription factors that co-aggregate with Swi1 in prion state, explaining another layer of how the prion phenotype changes gene expression patterns [94].PrionW has also been used to investigate the role of pathogenic SFPQ human protein in Alzheimer's and Creutzfeldt Jakob diseases [95], to understand the evolution of prions in fungal species [96], study the evolution of mammalian meiotic proteins [97], or proposed as a predictor of prion-like proteins capable of LLPS [77,98].

AMYCO: A Server for Prediction of the Impact of Mutations on the Aggregation Propensity of Prion-like Proteins
In 2017, the first extensive mutational study addressing the aggregation of a human prion-like protein in vivo was reported [99].It studied the ribonucleoprotein hnRNPA2, whose aggregation is associated with the development of Amyotrophic Lateral Sclerosis (ALS) and multisystem proteinopathy [100].This pioneering work provided a robust experimental framework to evaluate the determinants driving pathogenic PrLDs' aggregation.We used it to demonstrate that an equation that simultaneously considers the effects of mutations on PrLDs' composition and localized amyloid propensity best predicted the impact of amino acid substitutions on the intracellular aggregation of functional yeast prions and human disease-linked proteins [100][101][102].The derived amino acid scoring system was implemented in 2019 into the publicly available AMYCO (combined AMYloid and COmposition-based prediction of prion-like propensity) algorithm [89].
AMYCO (http://bioinf.uab.es/amycov04/(accessed on 12 January 2023)) is a web server that allows the fast, automated, and graphical evaluation of the effect of mutations on the aggregation properties of prion-like proteins [89].At that time, its performance was better than previous state-of-the-art predictors.Since its publication, AMYCO implementation has been used to gain insights into prion evolution, especially the appearance and conservation of prion-protective or -enhancing mutations in different mammals [103][104][105][106] and birds [107].It has also been used to identify prion disease-related somatic mutation in the prion gene from cancer patients [108] or to rationalize the effect of point mutations in the hnRNPDL gene on the onset of a rare type of muscular atrophy [109].

SGnn: A Server for the Prediction of Prion-like Domains Recruitment to Stress Granules upon Heat Stress
As a representative example, SGnn has been recently used by Harrison and coworkers to predict whether ortholog sequences from metazoans and plants can be recruited into SGs [113].

Computational Tools to Study Intrinsically Disordered Proteins (IDPs)
Intrinsically Disordered Proteins (IDPs) have primary structures that combine low mean hydrophobicity and high net charge.The absence of a driving force for compaction and electrostatic repulsions causes the proteins populating this sequence space to present extended conformations in which amino acids are highly exposed to the solvent.Thus, solution conditions, including the pH, significantly impact the structure adopted by disordered protein regions.Here, we introduce a set of bioinformatics tools we developed to provide a framework to study IDPs' properties in a context-dependent manner.

DispHred and DispHScan: Predicting Protein Disorder as a Function of pH
In 2020, we released DispHred (https://ppmclab.pythonanywhere.com/DispHred(accessed on 12 January 2023)) [114].This tool was specifically developed to study the effect of pH on the order-disorder transitions of proteins possessing low secondary structure content.This server uses Henderson Hasselbalch's equation to calculate the protein's net charge and the pH-dependent hydrophobicity scale developed by Zamora et al. [115].First, we validated the utility of this novel pH-dependent hydropathy scale, building up a dataset of experimentally validated disordered and single-chain folded proteins.Their associated net charge and hydropathy scores were computed and represented in chargehydropathy plots, which were then used to assess the disorder-predicting potential of this representation.Receiver Operating Characteristic (ROC) analysis was performed on these plots, indicating high performance compared to the traditional Guy's and Kyle-Dolittle's hydrophobicity scales [116].Afterward, the model was tested to predict pH-dependent order-to-disorder transitions.To do so, we used seven disordered proteins and peptides for which their pH-dependent conformations were validated experimentally.Using Support Vector Machines (SVMs), a linear boundary condition was defined.This classification system correctly discriminated folded and disordered proteins, avoiding overfitting and providing a margin of uncertainty near the boundary condition line.
DispHred is ready to use under its freely available web server implementation and allows the users access to the individual sequence order-to-disorder transition analysis.Among the variables of the analysis, the user can choose the sliding window size, the starting and ending pH interval, and the pH step used.The results page presents tabular and graphical data of the DispH score for the protein at every given pH.A score over 0 indicates that the protein is folded, while negative scores indicate that the protein is unfolded at the given pH.
DispHred has been used to predict pH-dependent order transitions in amphiphilic peptides to study their self-assembly [117] and disorder-to-order transitions in redox or alkali environments in viral IDPs [118].It has also been used in biomedicine to study the ordered state of possible bioactive peptides regarding pH [119] and the effect of pH on the binding of drugs to the Human Serum Albumin's disordered regions [120].

SolupHred: A Server to Predict the pH-Dependent Aggregation of Intrinsically Disordered Proteins
Biophysicists have been long interested in predicting aggregation from protein sequences in defined conditions [79,122,123].However, the protein microenvironment is highly dynamic, and the aggregation of polypeptides is influenced by external factors such as pH [124,125].This influence is especially relevant for IDPs, whose lack of defined three-dimensional conformation makes them more susceptible to environmental fluctuations [126].
SolupHred (https://ppmclab.pythonanywhere.com/SolupHred(accessed on 12 January 2023)) represented the first aggregation predictor for IDPs to incorporate the effect of pH in its core [127].In order to develop the predictive model, we engineered three different variants of the measles virus phosphoprotein (PNT) displaying different net charges and isoelectric points (pI).Interestingly, we discovered that not only the net charge but also the lipophilicity depended on the solution pH [128].The SolupHred algorithm implements this evidence into an empirical equation based on the assumption that pH-dependent aggregation in IDPs is determined by both charge and lipophilicity.SolupHred successfully recapitulated the aggregation propensities of disease-linked proteins such as alphasynuclein [129], islet amyloid polypeptide [130], abeta 40 [131], or tau [132] at different pH levels.
The SolupHred web server works on top of an individual or multiple sequences and predicts solubility either in a pH interval or at a specific pH.After submission, it provides a solubility profile in the selected pH range, indicating the 10% maximum and 10% minimum solubilities (Figure 3).SolupHred can be used as a fast, cost-effective method to optimize experimental conditions, purification, and storage of IDPs, as well as for conducting largescale analyses of pH-dependent IDP aggregation.The server has been used to study the correlation between solubility and LLPS in low-complexity regions of proteins implied in neuronal diseases [122].
Figure 3.The SolupHred web server.(A) SolupHred's input interface requires the entry of one or many disordered regions in FASTA format and the selected pH range for the study.Alternatively, users can also predict solubility at a specific pH.(B) The results provide a table with the most relevant predictions such as 10% maximum and 10% minimum solubilities.Moreover, a graphical representation with the solubility profile in the pH range is also shown with the specific solubility scores.Different links can be downloaded for obtaining the results in the desired file format.
4.3.CARs-DB: A Database of Cryptic Amyloidogenic Regions in Intrinsically Disordered Proteins IDPs, lacking a defined secondary structure, were considered devoid of pro-aggre-Figure 3. The SolupHred web server.(A) SolupHred's input interface requires the entry of one or many disordered regions in FASTA format and the selected pH range for the study.Alternatively, users can also predict solubility at a specific pH.(B) The results provide a table with the most relevant predictions such as 10% maximum and 10% minimum solubilities.Moreover, a graphical representation with the solubility profile in the pH range is also shown with the specific solubility scores.Different links can be downloaded for obtaining the results in the desired file format.

Discussion
Due to its importance in biomedical research and the biotechnological sector, protein aggregation has changed over the past decades from a virtually unexplored study subject to a scorching research issue.We have seen ground-breaking scientific advancements, and now, we have a profound mechanistic view of how aggregation occurs.Computational tools, such as the ones developed by our lab and described here, have contributed significantly to this knowledge, helping to direct experimental efforts to elucidate the molecular pathways behind disorders related to protein aggregation.Additionally, they have accelerated the development of engineered protein variants with enhanced solubility and stability, reducing the time and money needed to produce therapeutic proteins.
All our programs, and the large majority of those developed by our colleagues, are freely available to the public in the format of a web server and/or as an executable file.In addition, all of them contain help files with detailed information for users, including a general description of the tool and relevant usage information.In silico approximations, such as the ones we detail here, are gradually included in many wet laboratories' routines as a cost-effective way to design experimental pipelines.In this way, the primary articles describing the algorithms discussed in this review have collectively received >1350 citations as of 22nd December and according to Google Scholar.
The different algorithms capture distinct aspects of protein aggregation and are intended for diverse applications.In a way, the timeline shown in Figure 1 reflects how the interests of the field have evolved over time and how the integration of experimental biophysical data and predictions has allowed us, as a community, to address challenges of increasing complexity.Initially thought to be a purely stochastic and thus unpredictable phenomenon, the realization that, as folding, aggregation was somehow imprinted in the sequence [141] opened an avenue for rationalization of aggregation reactions at the proteome scale.With sequence-based predictors available, very soon, new types of protein sequences attracted the community's interest, those belonging to prion and prion-like proteins.It was immediately evident that the sequence space of archetypical amyloids and prion-like proteins only partially overlapped, and a new generation of algorithms was generated.That effort was worthwhile because these algorithms, or those directly derived from them, are currently being used to study the propensity of proteins to form part of the fashionable membraneless organelles [142].The need for different scales when dealing with different sequence sets already indicated that the amyloid sequence space was far broader than previously believed.It is now clear that highly soluble sequences with minimal aliphatic content and/or high net charge can form amyloids [139].The idea that low solubility and aggregation propensity are interchangeable qualities was at the heart of most initial algorithms; new programs and databases are revisiting this idea to fish sequences in this new amyloid terrain.
Once intrinsic sequential factors were clarified for the different aggregation flavors, extrinsic factors had to be considered.They include viscosity, temperature, pH, ionic concentration, protein concentration, solvent identity, and interactions with other molecules.The absence of rigorous experimental data spanning all potential variable combinations for a group of sequentially unrelated proteins has been the fundamental obstacle to developing systems that can incorporate the protein microenvironment in their predictions.However, as we illustrate here, the first attempts to incorporate parameters such as the solution pH in the prediction pipeline are rendering their fruits, especially for IDPs, whose properties are especially sensitive to the solution conditions.
In addition to the intrinsic sequence, one should consider other factors when studying the aggregation of globular proteins, including stability, conformation, cooperativity, surface solubility, and dynamics.Structure-based algorithms were born to deal with all these parameters automatically.However, for a long time, the application of these tools was limited to a relatively reduced space of the protein universe: those for which a highresolution structure exists or a model could be confidently constructed.However, with the avenue of programs such as AlphaFold [67], this limitation has been broken down, and databases containing accurate protein aggregation predictions for the complete set of globular proteins in a given proteome are already available online [34].
The time has come for artificial intelligence (AI) to enter the aggregation prediction arena [143].The application of this technology requires the availability of a large number of biophysical studies that can feed it.Unfortunately, the acquisition of biochemical (stability, pH-dependence of conformational changes) and biophysical data (type of condensation or aggregation) is seen as a low-value objective.We should remember that AI successes such as AlphaFold would not have been possible without an extraordinarily well-curated database of protein structures [144].Building a consortium that can generate a coherent set of information related to protein aggregation is now more than ever a necessity and a must, given the growing impact of protein aggregation-related diseases in our society.

Biophysica 2023, 3 , 3 Figure 1 .
Figure 1.Overview of computational tools developed during the last 15 years by Ventura's Lab.Red squares indicate aggregation-related predictors and databases, orange refer to prion-like domain resources and blue is indicative of tools dedicated to study intrinsically disordered proteins.

Figure 1 .
Figure 1.Overview of computational tools developed during the last 15 years by Ventura's Lab.Red squares indicate aggregation-related predictors and databases, orange refer to prion-like domain resources and blue is indicative of tools dedicated to study intrinsically disordered proteins.