The Budapest Amyloid Predictor and Its Applications

The amyloid state of proteins is widely studied with relevance to neurology, biochemistry, and biotechnology. In contrast with nearly amorphous aggregation, the amyloid state has a well-defined structure, consisting of parallel and antiparallel β-sheets in a periodically repeated formation. The understanding of the amyloid state is growing with the development of novel molecular imaging tools, like cryogenic electron microscopy. Sequence-based amyloid predictors were developed, mainly using artificial neural networks (ANNs) as the underlying computational technique. From a good neural-network-based predictor, it is a very difficult task to identify the attributes of the input amino acid sequence, which imply the decision of the network. Here, we present a linear Support Vector Machine (SVM)-based predictor for hexapeptides with correctness higher than 84%, i.e., it is at least as good as the best published ANN-based tools. Unlike artificial neural networks, the decisions of the linear SVMs are much easier to analyze and, from a good predictor, we can infer rich biochemical knowledge. In the Budapest Amyloid Predictor webserver the user needs to input a hexapeptide, and the server outputs a prediction for the input plus the 6 × 19 = 114 distance-1 neighbors of the input hexapeptide.


Introduction and motivation
The primary structure of the proteins is characterized by their amino acid sequence.While the primary structure determines the spatial folding of the proteins, and, consequently, all chemical and biological properties of the given protein, inferring those properties from the amino acid sequence is a very difficult task.Here we consider the amyloid predictors: tools, which tell us if a given amino acid sequence has or has not the propensity to become amyloid.Amyloids are misfolded protein aggregates (Horváth et al., 2019;Taricska et al., 2020), which -in contrast with the unstructured aggregates -have a well-defined structure, comprising parallel β-sheets (Takács et al., 2019;Takacs and Grolmusz, 2020).Amyloids are present in numerous organisms in biology: for example, in healthy human pituitary secretory granules (Maji et al., 2009), in the immune system of certain insects (Falabella et al., 2012), the silkmoth chorion and some fish choria (Iconomidou and Hamodrakas, 2008), in human amyloidoses and several neuro-degenerative diseases (Soto et al., 2006).
Most recently, on the analogy of the naturally occurring anti-herpes activity of β-amyloids, synthetic amyloid peptides were developed, acting as amyloidogenic aggregation cores in certain viral proteins with high specificity (Michiels et al., 2020).This way, new amyloid-based antiviral pharmaceuticals can be developed in the very near future: the specific aggregation cores turn the viral proteins into insoluble amyloids.Consequently, potential amyloidogenecity may have direct pharmaceutical relevance.
Sequence-based amyloid predictors would help the understanding and the exploitation of the amyloid state of the proteins: instead of the difficult, costly, and slow wet-laboratory tests, we can use the predictor on thousands or millions of inputs for enlightening the amyloidogenecity of the proteins.A very recent review (Santos et al., 2020) covers the sequence-based amyloid-predictors, applying different strategies like AGGRESCAN (Conchillo-Sole et al., 2007), Zyggregator (Tartaglia and Vendruscolo, 2008), netCSSP (Kim et al., 2009) and APPNN (Familia et al., 2015), among others.
In the last several years, the six amino acid long peptides have become a model of studying amyloid formation (Beerten et al., 2015;Louros et al., 2020b,a).The reason for this is twofold: first: numerous evidence shows the biological relevance of amyloid-forming hexapeptides (Hauser et al., 2011;Tenidis et al., 2000;Reches and Gazit, 2004;Iconomidou et al., 2006;Beerten et al., 2015); and second: one can form 20 6 = 64 million hexapeptides from the 20 amino acids, which is a large -but not too large -and rich space of model molecules, whose structures are less complex and, therefore, easier to be dealt with as larger model spaces.
The APPNN predictor applies a machine-learning approach by training on 296 hexapeptides, selected from various sources, then predicts if a given hexapeptide is amyloidogenic or not.For longer sequences, it screens six aminoacid long sliding windows in longer polypeptide-chains to predict if they would form amyloid structures.
In this contribution, we present a Support Vector Machine (SVM) predictor for hexapeptides, with better accuracy (84%) than most of the neural networkbased tools (see (Familia et al., 2015) for a tabular comparison of the accuracy of those tools).The main advantage of our new predictor is its (i) simplicity, (ii) free on-line availability, and (iii) easy applicability for inferring locationdependent amyloidogenic properties of amino acids, as we describe below.
We note that neural network-based predictors are neither simple nor easyto-apply, and inferring the causality of their classifications is a very difficult task.

Materials and Methods
For the construction of the Budapest Amyloid Predictor, we have applied an artificial intelligence tool, the Support Vector Machine architecture (Cortes and Vapnik, 1995).In Support Vector Machines, n+m data points are corresponded to n + m vectors, each of k dimensions, x 1 , x 2 , . . ., x n and y 1 , y 2 , . . ., y m , and the goal is to find a hyperplane which optimally separates the x and the y datapoints.Usually, the dataset is partitioned into a training and a testing subset: the first one is applied in the construction of the SVM, the second one is used for testing.
We have used the Waltz database (Beerten et al., 2015;Louros et al., 2020b) of 1415 hexapeptides, annotated to be amyloidogenic (514 peptides) or notamyloidogenic (901 peptides).The annotation was made by Thioflavin-T binding assays and literature search (Beerten et al., 2015;Louros et al., 2020b); consequently, it is based on experimental evidence.Similarly, as in (Familia et al., 2015), two vectorial representations of the hexapeptides were considered: The first is the simple translation of the 20 amino acid names into vectors: each amino acid was corresponded to a length-20 0-1 vector, with a single 1coordinate identifying the amino acid (called orthogonal representation).This way, a hexapeptide is described by a 120-dimensional 0-1 vector.
In this representation, each amino acid corresponds to a 553-dimensional vector, a hexapeptide to a 6 x 553 = 3318-dimensional vector.
From the 1415 (514 amyloids, 901 non-amyloids) hexapeptides found in Waltz database, we selected 158 amyloid and 309 non-amyloid hexapeptides randomly for the test set (roughly 33% ).We used the remaining hexapeptides for training our linear SVM.We used the sklearn LinearSVC object from the SciKit-learn Python library (Pedregosa et al., 2011) for constructing the classifier.
The orthogonal representation yielded an approximately 80 % accuracy, while the AAindex-based a much better accuracy; because of this, we have chosen the second, AAindex-based representation in what follows.
The classifier simply computes the sign of the w • z + b values for the 3318long z vectors, corresponding to a hexapeptide, where w is a 3318-dimensional weight vector, and b is a scalar, and if this sign is positive, then the prediction is "amyloidogenic", otherwise it is "non-amyloidogenic".

Implementation and Usage
The Budapest Amyloid Predictor webserver is available at the site https: //pitgroup.org/bap/.The user needs to input a hexapeptide with 6 capital letters, and the server returns the prediction for the query, plus the predictions of all 114 (= 6 x 19) 1-Hamming-distance neighbors of the query.If the hexapeptide is listed in the Waltz DB, then the "known" word appears next to prediction; otherwise, the "predicted" word.

The Amyloid Effect Matrix
One of the greatest advantages of the SVM prediction is that we can easily see the reasons behind the decision of the model.The following matrix enlightens the details of the decision of the SVM.Clearly, by representing every amino acid by a 553-dimensional vector is highly redundant, since we have only 20 amino acids: that is, only 20 different 553-dimensional vector exists in this representation.Therefore, we can write with = 553: For each fixed j = 1, 2, . . ., 6 the = 553 z i s are determined by the j th amino acid of the hexapeptide, and this way, all the possible 6x20 = 120 second sums (for six positions and 20 amino acids) can be pre-computed.Table 1 lists these pre-computed values, the 6 values of j correspond to the columns, the amino acids to the rows: Clearly, the value of (1) can now be computed by adding exactly one item from each column, determined by the first, second,...,sixth amino acid of the hexapeptide, plus the value of b = 1.083.For example, one can easily classify the hexapeptide AAEEAA by computing the sign of (−0.26 − 0.32 − 0.43 − 0.30 − 0.43 − 0.22 + 1.083) = −0.88,that is, -1, which predicts that AAEEAA is not amyloidogenic.
By observing Table 1, one can easily derive an amyloidogenecity order of the amino acids for each position from 1 through 6: we should just sort the columns in increasing order and substitute the amino acids as follows: In Table 2, the amyloidogenecity order decreases from left to right.Naturally, proline, the "structure breaker" appears mostly at the right end, but not in every row: in row 5 it is in position 12.This shows a remarkable difference in the amyloidogenecity order of the six positions of the hexapeptides.