A Peptides Prediction Methodology for Tertiary Structure Based on Simulated Annealing

: The Protein Folding Problem (PFP) is a big challenge that has remained unsolved for more than ﬁfty years. This problem consists of obtaining the tertiary structure or Native Structure (NS) of a protein knowing its amino acid sequence. The computational methodologies applied to this problem are classiﬁed into two groups, known as Template-Based Modeling (TBM) and ab initio models. In the latter methodology, only information from the primary structure of the target protein is used. In the literature, Hybrid Simulated Annealing (HSA) algorithms are among the best ab initio algorithms for PFP; Golden Ratio Simulated Annealing (GRSA) is a PFP family of these algorithms designed for peptides. Moreover, for the algorithms designed with TBM, they use information from a target protein’s primary structure and information from similar or analog proteins. This paper presents GRSA-SSP methodology that implements a secondary structure prediction to build an initial model and reﬁne it with HSA algorithms. Additionally, we compare the performance of the GRSAX-SSP algorithms versus its corresponding GRSAX. Finally, our best algorithm GRSAX-SSP is compared with PEP-FOLD3, I-TASSER, QUARK, and Rosetta, showing that it competes in small peptides except when predicting the largest peptides.


Introduction
Proteins or polypeptides are macromolecules built from amino acids (aa) and are mainly responsible for living beings' functionality. Proteins are essentials elements because every protein has a specific function related to its unique three-dimensional structure named Native Structure (NS). All the proteins consist of a polymer chain of aa; the junctions with a small number of them are named peptides. The peptides have significant importance in the science community because of their multiple applications, for instance, in pharmaceutical research [1][2][3][4], drug design [5][6][7], diagnosis [8][9][10], and therapy [11,12]. To obtain the NS of proteins from an amino acid sequence could bring benefits to human beings.
The PFP has been identified as an important problem since Kendrew and Perutz's research teams obtained the myoglobin and hemoglobin molecules' tertiary structure, respectively [13,14]. These studies established the relation between function and structure. PFP consists of obtaining the three-dimensional structure of a protein with the lowest Gibbs free energy, thermodynamically stable three-dimensional conformation [15].
The PFP is considered an NP-hard problem [16]. Thus, presumably, none of the known exact algorithms can solve it in polynomial time. In other words, the execution time grows The HSA algorithms previously mentioned obtained excellent results for small proteins or peptides. However, when the number of aa increases, the variables (torsional angle of aa) are also increased, the computational time for exploring the solution space is considerable. As a result, the PFP area needs new approaches to obtaining better solutions for large peptides or proteins.
This paper proposes the methodology GRSA-SSP that combines GRSA algorithms with the Secondary Structure Prediction (SSP). For a given chain of aa representing a peptide or a protein, the GRSA-SSP performs two processes: (a) To obtain the first protein prediction from the secondary structure of the aminoacids sequence. (b) To refine the previous protein prediction by using GRSA family algorithms.
These two processes are performed in several steps described in this paper. The algorithms used in the second phase of GRSA-SSP can be one of the GRSA family algorithms. This paper named these hybrid algorithms GRSAX-SSP, where X is used to distinguish the GRSA algorithm. We evaluate our methodology using RMSD and TM-score metrics [29]. Additionally, experimentation is performed with a set of forty-five instances of peptides and a set of six mini proteins, which are compared with the most popular algorithms in the literature, such as PEP-FOLD3 [28], I-TASSER [30,31], Rosetta [24,32], and QUARK [25,33].
The paper's organization is as follows: first, we present the introduction to PFP and HSA algorithms. Then, in the Background section, we review the Protein Folding Problem definition and some relevant research in the literature, and we explain the GRSA family of algorithms. In the next section, we describe the GRSA-SSP methodology. In the Results section, we present experimentation comparing the GRSA algorithms with those of the literature; also, we analyze the presented methodology's performance. Finally, the conclusions of this research are presented.

Background
The PFP is a significant multidisciplinary problem that has been investigated for over half a century [34]. Different scientific areas have been studied, for example, computer science, bioinformatics, and molecular biology, concerning this problem, and three questions in particular need to be answered [34].

•
Which is the physical code in which an amino-acids sequence dictates an NS? • Why in nature do proteins fold very quickly while in silicon they fold relatively slower?
• Is there an algorithm that predicts the protein structure from the amino-acids sequence?
This paper is related to the last question. We propose different strategies to obtain the NS tertiary structure using GRSA family algorithms and secondary structure prediction. As we mentioned before, finding new algorithms for PFP is significant not only because of its potential applications but also because it is an NP-hard problem [16], and the number of combinations that determine which algorithms must be explored in a very large solution space.

Definition of Ab-Initio and Force Fields
The ab initio modeling can be defined as an optimization problem where the Gibbs free energy is the objective function f(n), and this has to be minimized. Thus, this problem is defined as follows: let there be a sequence of amino acids: n = a1, a2, . . . , an; every amino acid has associated with it a set of angles σ1, σ2, . . . , σm where m represents a particular dihedral angle; then, minimizing the energy function f(σ|1, σ2, . . . , σm) provides the best tertiary structure or NS. The energy functions (force fields) are used for determining the energy of a protein structure [35], and some examples of these are AMBER [36], CHARMM [37], ECEPP/2, and ECEPP/3 [38]. The potential energy of ECEPP/2 is given by Equation (1), which is calculated in vacuo for only intramolecular energies, and this is the energy function to be minimized [39].
r ij 10 + ∑ n U n (1 ± cos(k n ϕ n )) (1) where: r ij is the distance in Å (angstroms) between the atoms i and j; A ij , B ij , C ij , and D ij are the parameters of the empirical potentials; q i and q j are the partial charges in the atoms i and j, respectively; ε is the dielectric constant; U n is the energetic torsion barrier of rotation about the bond n; k n is the multiplicity of the torsion angle ϕ n .
In this paper, we use the potential energy of ECEPP/2 as an objective function because we explore the conformational space, and when the energy of the protein structure is minimized, then the protein structure is accepted.

Computational Approaches for PFP
The CASP organization has classified PFP models into two main groups: Group 1: Template-based modeling (TBM). In this group, we find algorithms that use biological information obtained from the secondary structure of the target protein, homology, and fragments of other proteins. These algorithms have achieved good results for predicting protein structures in the CASP [32,40,41]. TBM involves several strategies; some of the most common are homology [42,43], threading [44], and fragment assembly [30,45].
Group 2: Ab initio. This prediction approach classically refers to the determination of the NS using only the aa sequence information. Unfortunately, ab initio algorithms have achieved good PFP results but only for small proteins with less than 120 residues [46]. The Ab initio modeling is the most challenging approach because it uses the amino acids' sequence as unique information. Finding an optimal solution with ab initio is very difficult for big proteins because the solution space is enormous.
These two groups can be applied to small proteins or peptides (between 5 to 50 aa) [28,47]. There are successful studies applied to protein prediction using SA [48][49][50] or Monte Carlo algorithms with Metropolis-Hasting [26,27]. The Monte Carlo algorithms are also applied to the inverse protein folding problem, which objective function is to find a sequence given a structure [51,52]. This paper focuses on the classical PFP that consists of finding the functional structure given a sequence aa.
The Rosetta is a protein structure prediction or de novo approach that performs models for the tertiary structure using the primary and secondary structure predictions. The algorithm generates a local sequence to produce local structures (fragments) that form a target protein template. Additionally, the fragments are then assembled by randomly using a Monte Carlo simulated annealing algorithm. Finally, the fitness of individual conformation interactions is evaluated based on a scoring function derived from known protein structures. However, only peptides longer than 27 aa can be provided as input [32].
Another PFP approach is I-TASSER (Iterative Threading ASSEmbly Refinement). It has four principal parts: generating a template using a multi-threading method, fragments' assembly method, refinement process, final model selection, and annotation tools. The I-TASSER applies an alignment of the target sequence and divides it into aligned using LOMETS [53,54] and nonaligned regions using the Monte Carlo algorithm. In the last step, annotation of functions is performed based on the structural models obtained using the BioLIP [55] database of ligand-protein interactions. Finally, the I-TASSER predicts protein structures from 10 to 1500 amino acids [31].
PEP-FOLD3 has a framework to predict the tertiary structure of peptides using de novo structure modeling. The process of predicting structure consists of three stages. Firstly, for a peptide amino acid sequence, a support vector machine is applied to predict the structural alphabet of fragments. Secondly, several models are generated using series of states and refined by a Monte Carlo algorithm. Finally, the five best conformations are selected [28].
Another approach is QUARK [33], in which an ab initio strategy is used to predict protein structures in ranges of 20 to 200 aa. Additionally, an assembly process of fragments with small structures is carefully selected and applied in the target sequence using a Monte Carlo algorithm.
SAINT2 is a fragment-based de novo structure prediction approach that has been successfully compared with the CASP12 approaches [56], which consists of a sequenceto-structure pipeline divided into four principal sections: (a) the secondary structure prediction where PSI-PRED [57] is applied, (b) the torsion angles prediction using SPINE-X [58], (c) a fragment library with the Flib package, and (d) the residue-residue contact prediction applying metaPSICOV [59]. Finally, the highest-scoring model is selected. In our methodology, sections (a) and (b) are applied, and they are shown in Figure 1.

The GRSA Family Algorithms
The SA algorithm is inspired by the physical annealing process of metals [60,61]. The algorithm has been applied with success in many NP-hard problems [20], including the PFP. SA employs the Metropolis algorithm to efficiently explore the solution space and obtain a good solution to optimization problems. We show the pseudocode of SA in Algorithm 1. T i and T f parameters define the initial and final temperatures, respectively; the α parameter represents the cooling factor. In the Metropolis cycle, new solutions are generated by a perturbation function. Finally, to accept or reject a new solution, an acceptance criterion based on Boltzmann distribution is applied (lines [11][12][13][14]. The SA algorithm is executed until the final temperature, T f , is reached. The SA algorithm source code is available at https: //github.com/DrJuanFraustoSolis/SimulatedAnnealing.git (accessed on 28 April 2021). However, when the solution space is very large, the algorithm's exploration takes a long time to obtain optimal solutions. Thus, new algorithms are necessary. The GRSA algorithm was proposed, which has been successfully applied in different NP problems [62,63], including the PFP [18]. The main characteristics of GRSA are the cooling scheme that decreases according to Tfp temperature cuts calculated by the golden number (ɸ) and then a stop criterion that reduces the cost of exploration (Algorithm 2). GRSA has a similar structure to the SA algorithm (lines 4 to 16). The difference with SA is that the GRSA calculates Tfp temperature cuts (five cuts are recommended), and in each cut, an α parameter in the range [0.7, 1] is associated (the common higher value is 0.95); the intermediate α values in this range are determined with an increment δ which represent the α increment since the lowest until the highest α value (in this case, δ = 0.05). These alpha values are associated with each temperature cut (line 17). The algorithm reduces the temperature cooling speed; thus, the execution time, corresponding to lines 18 to 23, decreases. Finally, to reduce wasting time in low temperatures, where the quality of the result is not improved, a stop criterion was implemented using the least-squares method (lines 24 to 29). This stop criterion detects the stochastic equilibrium for some Metropolis cycles. We measure the slope (m is a global variable) of the linear regression of the energy of these cycles. In this regression, we used the coordinates ( , ); where is in the range [2, ]. In our case, we used = 5. The equilibrium is found when m is close to zero, calculated by (2).
The Equation (2) can be written as follows (3): where: is the number of metropolis cycles for measuring the slope, i is the iteration of every metropolis cycle, and Ei the energy in each iteration. The evaluation of m in Equation (2)  (2) Torsion angles prediction. The secondary structure's prediction is essential for this stage, where SPINE-X is used to obtain the torsion angles (ɸ,Ψ, and ω) of each amino acid. This process is realized through the Position-Specific Score Matrix and Physical Parameters [58]. SPINE-X applies artificial neural networks to obtain the best predictions of the target's proteins.
(3) Model construction. In this stage, the torsion angles or variables are used to construct a template as initial solution Si = [ 1, Ψ1, Χ1, ω1, 2, Ψ2, Χ2, ω2, …, n, Ψn, Χn, ωn] that is represented by amino acids subscript 1 to n and the same form by the following amino acids up to n; n is dependent on the size of an amino acid sequence of the target protein. The torsion angles represent the base column of the peptide on which the refinement will be performed.
(4) Refinement by GRSAX. When the previous stages construct the peptide template, we can apply a GRSAX algorithm such as GRSA (renamed GRSA1), EGRSA (renamed GRSAE), and GRSA2, as well as the classical SA (GRSA0). The GRSAX algorithms are tested individually for comparison, which obtains a better tertiary structure of the target peptide. Moreover, once the energy and three-dimensional structure is obtained, the structure is evaluated with the RMSD and TM-score [29] metrics.
Output. The GRSAX-SSP algorithm obtains the tertiary structure prediction.  However, when the solution space is very large, the algorithm's exploration takes a long time to obtain optimal solutions. Thus, new algorithms are necessary. The GRSA algorithm was proposed, which has been successfully applied in different NP problems [62,63], including the PFP [18]. The main characteristics of GRSA are the cooling scheme that decreases according to T fp temperature cuts calculated by the golden number (F) and then a stop criterion that reduces the cost of exploration (Algorithm 2). GRSA has a similar structure to the SA algorithm (lines 4 to 16). The difference with SA is that the GRSA calculates T fp temperature cuts (five cuts are recommended), and in each cut, an α parameter in the range [0.7, 1] is associated (the common higher value is 0.95); the intermediate α values in this range are determined with an increment δ which represent the α increment since the lowest until the highest α value (in this case, δ = 0.05). These alpha values are associated with each temperature cut (line 17). The algorithm reduces the temperature cooling speed; thus, the execution time, corresponding to lines 18 to 23, decreases. Finally, to reduce wasting time in low temperatures, where the quality of the result is not improved, a stop criterion was implemented using the least-squares method (lines 24 to 29). This stop criterion detects the stochastic equilibrium for some i Metropolis cycles. We measure the slope (m is a global variable) of the linear regression of the energy of these cycles. In this regression, we used the coordinates (E i , i); where i is in the range [2, κ max ]. In our case, we used κ max = 5. The equilibrium is found when m is close to zero, calculated by (2).
The Equation (2) can be written as follows (3): where: κ is the number of metropolis cycles for measuring the slope, i is the iteration of every metropolis cycle, and E i the energy in each iteration. The evaluation of m in Equation (2) does not imply a significative execution time; the summations on Equation (3)  The EGRSA (Algorithm 4) is an algorithm integrated by the hybridization of GRSA with evolutionary techniques. This algorithm has an evolutionary perturbation (EGRSApert) in the GRSA phase (line 7), where a genetic algorithm is used. The EGRSA algorithm starts with a set of individuals generated for determining the initial solution designed as Si. Then in the Metropolis Cycle, the Si is perturbated by EGRSApert to generate new solutions. Next, the best individual generated Sj solution is selected of the population (lines 9 and 10). EGRSA is similar to GRSA, and both applied a stop criterion (see Algorithm 2.1) by the least-squares method [64,65] (lines 24-29). Algorithm 5 presents EGRSApert function, where one individual is a set of dihedral angles [ɸ1, Ψ1, Χ1, ω1, ɸ2, Ψ2, Χ2, ω2, …, ɸn, Ψn, Χn, ωn] and a population is a set of individuals. Then crossover and mutation operators are applied to generate new solutions by the perturbation function. Finally, when the number of generations is reached, the best individual of the population is selected. The EGRSA algorithm source code is available at https://github.com/DrJuanFrausto-Solis/EGRSA.git (accessed on 28 April 2021).

24:
if T k ≤ T fpn then 25: m= Equilibrium(E) 26: if m ≈ ε then 27: T K = T   The EGRSA (Algorithm 4) is an algorithm integrated by the hybridization of GRSA with evolutionary techniques. This algorithm has an evolutionary perturbation (EGR-SApert) in the GRSA phase (line 7), where a genetic algorithm is used. The EGRSA algorithm starts with a set of individuals generated for determining the initial solution designed as S i . Then in the Metropolis Cycle, the S i is perturbated by EGRSApert to generate new solutions. Next, the best individual generated S j solution is selected of the population (lines 9 and 10). EGRSA is similar to GRSA, and both applied a stop criterion (see Algorithm 2.1) by the least-squares method [64,65] (lines 24-29). Algorithm 5 presents EGRSApert function, where one individual is a set of dihedral angles [F 1 , Ψ 1 , X 1 , ω 1 , F 2 , Ψ 2 , X 2 , ω 2 , . . . , F n , Ψ n , X n , ω n ] and a population is a set of individuals. Then crossover and mutation operators are applied to generate new solutions by the perturbation function. Finally, when the number of generations is reached, the best individual of the population is selected. The EGRSA algorithm source code is available at https://github.com/DrJuanFraustoSolis/EGRSA.git (accessed on 28 April 2021).   The GRSA2 algorithm [23] is a hybridization of GRSA with the CRO algorithm [66]. GRSA2 (Algorithm 6) is an enhancement of GRSA. It has the same structure as the previous algorithms revised in this paper. Specifically, GRSA2 has two principal differences in the perturbation phase, applying decomposition and soft collision (line 8) and the acceptance criterion (lines 10 to 14). In Algorithm 7, we show the perturbation process implemented in the GRSA2pert function. In GRSA2, two soft collisions are used (unimolecular, Intermolecular). This algorithm has been applied only in the PFP with a set of 19 peptides and compared with I-TASSER and PEP-FOLD3 approaches obtaining outstanding results in the case of peptides [23]. The GRSA2 algorithm source code is available at https://github.com/DrJuanFraustoSolis/GRSA2.git (accessed on 28 April 2021).  The GRSA2 algorithm [23] is a hybridization of GRSA with the CRO algorithm [66]. GRSA2 (Algorithm 6) is an enhancement of GRSA. It has the same structure as the previous algorithms revised in this paper. Specifically, GRSA2 has two principal differences in the perturbation phase, applying decomposition and soft collision (line 8) and the acceptance criterion (lines 10 to 14). In Algorithm 7, we show the perturbation process implemented in the GRSA2pert function. In GRSA2, two soft collisions are used (unimolecular, Intermolecular). This algorithm has been applied only in the PFP with a set of 19 peptides and compared with I-TASSER and PEP-FOLD3 approaches obtaining outstanding results in the case of peptides [23]. The GRSA2 algorithm source code is available at https://github.com/DrJuanFraustoSolis/GRSA2.git (accessed on 28 April 2021).

GRSA-SSP Methodology
In this section, we present the GRSA-SSP methodology (Figure 1). This methodology has two main processes: (a) The prediction of the torsion angles (initial solution) from the secondary structure; that corresponds to stages 1 to 4 in Figure 1.
(b) The refinement of the solution obtained from the secondary structure. This is performed with GRSA algorithms showed in stage four ( Figure 1).
The GRSA-SSP methodology has an input (amino acid sequence), an output (tertiary structure prediction), and four stages: (1) secondary structure prediction, (2) torsion angles prediction, (3) template construction, and (4) refinement by GRSAX algorithms. Next, we explain each of these stages: Input (Amino acid sequence). The amino acid sequences are taken as input.
(1) Secondary structure prediction. This secondary structure, which corresponds to the amino acid sequence and is predicted using PSI-PRED [57]. This algorithm generates a sequence profile with PSI-BLAST [67] and performs the prediction of the stage, such

GRSA-SSP Methodology
In this section, we present the GRSA-SSP methodology (Figure 1). This methodology has two main processes: (a) The prediction of the torsion angles (initial solution) from the secondary structure; that corresponds to stages 1 to 4 in Figure 1. (b) The refinement of the solution obtained from the secondary structure. This is performed with GRSA algorithms showed in stage four ( Figure 1).
The GRSA-SSP methodology has an input (amino acid sequence), an output (tertiary structure prediction), and four stages: (1) secondary structure prediction, (2) torsion angles prediction, (3) template construction, and (4) refinement by GRSAX algorithms. Next, we explain each of these stages: Input (Amino acid sequence). The amino acid sequences are taken as input.
(1) Secondary structure prediction. This secondary structure, which corresponds to the amino acid sequence and is predicted using PSI-PRED [57]. This algorithm generates a sequence profile with PSI-BLAST [67] and performs the prediction of the stage, such as the helix (H), strand (E), and coil (C). PSI-PRED calculates the probability of each possible state and defines the most likely structure. (2) Torsion angles prediction. The secondary structure's prediction is essential for this stage, where SPINE-X is used to obtain the torsion angles (F, Ψ, and ω) of each amino acid. This process is realized through the Position-Specific Score Matrix and Physical Parameters [58]. SPINE-X applies artificial neural networks to obtain the best predictions of the target's proteins. (3) Model construction. In this stage, the torsion angles or variables are used to construct a template as initial solution S i = [F 1 , Ψ 1 , X 1 , ω 1 , F 2 , Ψ 2 , X 2 , ω 2 , . . . , F n , Ψ n , X n , ω n ] that is represented by amino acids subscript 1 to n and the same form by the following amino acids up to n; n is dependent on the size of an amino acid sequence of the target protein. The torsion angles represent the base column of the peptide on which the refinement will be performed. (4) Refinement by GRSAX. When the previous stages construct the peptide template, we can apply a GRSAX algorithm such as GRSA (renamed GRSA1), EGRSA (renamed GRSAE), and GRSA2, as well as the classical SA (GRSA0). The GRSAX algorithms are tested individually for comparison, which obtains a better tertiary structure of the target peptide. Moreover, once the energy and three-dimensional structure is obtained, the structure is evaluated with the RMSD and TM-score [29] metrics.
Output. The GRSAX-SSP algorithm obtains the tertiary structure prediction.

Results
We performed the next GRSAX-SSP algorithms with the proposed methodology: (a) GRSA0-SSP using classical SA [19], (b) GRSA1-SSP using original GRSA [21], (c) GRSAE-SSP using EGRSA [22], and (d) GRSA2-SSP using GRSA2 [23]. For all of them, we used the methodology presented in Figure 1. The peptides in this experimentation have 9 to 49 amino acids. The number of variables (torsion angles) for each peptide in this data set is in the range [49,304]. We chose this set because these instances (peptides) were used before in the literature. This set was also useful for comparing the GRSA2-SSP algorithm with the top-performing approaches of the CASP, which can be used for small peptides. We compared the last algorithm with I-TASSER, PEP-FOLD3, QUARK, and Rosetta, which are among the best algorithms in the CASP competition. We noted a difference between the GRSAX-SSP algorithms and the one that only applies ab initio by naming it GRSAX. Table 1 presents the set of 45 instances sorted by the number of variables taken from [23,28,68,69] and a PDB code represents each peptide. In the experimentation, the GRSAX-SSP algorithms were executed 30 times to validate the results. The energy function ECEPP/2 is determined with SMMP framework [38]; it is the objective function of our optimization algorithms. An analytical tuning [20] was performed to obtain the initial and final temperature for each instance. In GRSA0-SSP the α value is 0.95, and the temperature range has zero golden sections. For GRSA1-SSP, GRSAE-SSP, and GRSA2-SSP algorithms, the same cooling scheme was used, using the α parameter with values from 0.75 to 0.95 with five golden ratio sections, which was determined by experimentation [21][22][23]. The GRSAX-SSP algorithms were executed in one of the terminals of the Ehecatl cluster in TecNM/IT Ciudad Madero, and it has the following characteristics: Intel ® Xeon ® processor at 2.30 GHz, Memory: 64 GB (4 × 16 GB) ddr4-2133, Linux CentOS operating system, and Fortran language.
We used the minimum energy quality values, the RMSD, and TM-score to evaluate the results, which are two metrics of the structural quality used for PFP algorithms. The RMSD is a structural measure between the native structure and the one predicted by the GRSAX-SSP and classical SA named here as GRSA0: (a) If the RMSD has a value close to zero, the quality of the structure is considered excellent. On the contrary, the quality is worse. (b) The TM-score is also used to measure the similarity between two structures. When the TM-score is greater than 0.5, it indicates that there is a good similarity between the two structures, and the tested one has the same fold. Otherwise, as the TM-score is lower than 0.5, the target peptide has a different fold [29].
The TM-score metrics can be calculated using the TM-align [70] (an algorithm to obtain the best structural alignment between two proteins) or in a classical formulation [29]. In this paper, we use the classical formulation of TM-score.
GRSAX-SSP algorithms use a model determined by the secondary structure, and then it is refined for obtaining a better prediction. The results are compared with the GRSAX based on ab initio that only uses the amino acid sequence as information. Figures 2-5 show average results related to energy (kcal/mol), RMSD, and TM-score for each peptide. The numbers in the x-axis, represent the instances or peptides of Table 1, and each instance is a set of torsional angles X = [F 1 , Ψ 1 , X 1 , ω 1 , F 2 , Ψ 2 , X 2 , ω 2 , . . . , F n , Ψ n , X n , ω n ] associated to each amino acid. We averaged the results of 30 executions for comparison. We used the minimum energy quality values, the RMSD, and TM-score to evaluate the results, which are two metrics of the structural quality used for PFP algorithms. The RMSD is a structural measure between the native structure and the one predicted by the GRSAX-SSP and classical SA named here as GRSA0: (a) If the RMSD has a value close to zero, the quality of the structure is considered excellent. On the contrary, the quality is worse.
(b) The TM-score is also used to measure the similarity between two structures. When the TM-score is greater than 0.5, it indicates that there is a good similarity between the two structures, and the tested one has the same fold. Otherwise, as the TM-score is lower than 0.5, the target peptide has a different fold [29].
The TM-score metrics can be calculated using the TM-align [70] (an algorithm to obtain the best structural alignment between two proteins) or in a classical formulation [29]. In this paper, we use the classical formulation of TM-score.
GRSAX-SSP algorithms use a model determined by the secondary structure, and then it is refined for obtaining a better prediction. The results are compared with the GRSAX based on ab initio that only uses the amino acid sequence as information. Figures 2-5 show average results related to energy (kcal/mol), RMSD, and TM-score for each peptide. The numbers in the x-axis, represent the instances or peptides of Table 1, and each instance is a set of torsional angles X = [ 1, Ψ1, Χ1, ω1, 2, Ψ2, Χ2, ω2, …, n, Ψn, Χn, ωn] associated to each amino acid. We averaged the results of 30 executions for comparison.  Figure 2 shows that GRSA0-SSP has better behavior than GRSA0 or classical SA. Note that in all the peptides, GRSA0-SSP obtained the lowest energy. In other cases, the RMSD is more stable with small instances (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16), and in the next instances, the behavior is equal. Additionally, when we compared with TM-score, the behavior, in general, is similar. In conclusion, by implementing this methodology in GRSA0-SSP with these instances, we obtained slightly improved results. Figure 3 presents the comparison of the GRSA1-SSP versus GRSA1 with the same metrics; we observed the behavior with the 45 instances evaluated. In terms of energy, RMSD, and TM-score, the performance of GRSA1-SSP is equivalent to GRSA1.  Figure 2 shows that GRSA0-SSP has better behavior than GRSA0 or classical SA. Note that in all the peptides, GRSA0-SSP obtained the lowest energy. In other cases, the RMSD is more stable with small instances (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16), and in the next instances, the behavior is equal. Additionally, when we compared with TM-score, the behavior, in general, is similar. In conclusion, by implementing this methodology in GRSA0-SSP with these instances, we obtained slightly improved results. Figure 3 presents the comparison of the GRSA1-SSP versus GRSA1 with the same metrics; we observed the behavior with the 45 instances evaluated. In terms of energy, RMSD, and TM-score, the performance of GRSA1-SSP is equivalent to GRSA1.    In Figure 5, we present the comparison of GRSA2 versus GRSA2-SSP. Note that the results obtained in every instance are very remarkable, and the superiority of GRSA2-SSP uses the metrics of energy, RMSD, and TM-Score. In this case, we applied the methodology GRSA-SSP to improve the behavior of the classical GRSA2 algorithm.    In Figure 5, we present the comparison of GRSA2 versus GRSA2-SSP. Note that the results obtained in every instance are very remarkable, and the superiority of GRSA2-SSP uses the metrics of energy, RMSD, and TM-Score. In this case, we applied the methodology GRSA-SSP to improve the behavior of the classical GRSA2 algorithm. In Figure 5, we present the comparison of GRSA2 versus GRSA2-SSP. Note that the results obtained in every instance are very remarkable, and the superiority of GRSA2-SSP uses the metrics of energy, RMSD, and TM-Score. In this case, we applied the methodology GRSA-SSP to improve the behavior of the classical GRSA2 algorithm.
Finally, in Figure 6, we present the comparison of the GRSAX-SSP family algorithms. We observe that GRSA2-SSP has the best values in several instances against the other algorithms, being higher than the others. Therefore, the best behavior of the algorithms with secondary structure prediction is GRSA2-SSP.
Furthermore, Figure 7 presents the computational time of the GRSAX-SSP family algorithms. The GRSA2-SSP has the best behavior in time with low values in most of the instances compared to the other algorithms. In Figure 5, we present the comparison of GRSA2 versus GRSA2-SSP. Note that the results obtained in every instance are very remarkable, and the superiority of GRSA2-SSP uses the metrics of energy, RMSD, and TM-Score. In this case, we applied the methodology GRSA-SSP to improve the behavior of the classical GRSA2 algorithm. Finally, in Figure 6, we present the comparison of the GRSAX-SSP family algorithms. We observe that GRSA2-SSP has the best values in several instances against the other algorithms, being higher than the others. Therefore, the best behavior of the algorithms with secondary structure prediction is GRSA2-SSP.   Table 2 presents the results obtained by GRSA2-SSP. For each instance, we show the best TM-score and their RMSD. Additionally, we calculated the average of the RMSD and TM-score for the five best predictions. Complementing the results, we determined the standard deviation (std) of the RMSD and TM-score for the five best predictions and included the best type of secondary structure: A (mainly alpha), B (mainly beta), and N (mainly none). This classification as A, B, and N is based on the secondary structure predominating in each peptide [27,68,69,71,72]. We sort Table 2 by the number of amino acids Finally, in Figure 6, we present the comparison of the GRSAX-SSP family algorithms. We observe that GRSA2-SSP has the best values in several instances against the other algorithms, being higher than the others. Therefore, the best behavior of the algorithms with secondary structure prediction is GRSA2-SSP.   Table 2 presents the results obtained by GRSA2-SSP. For each instance, we show the best TM-score and their RMSD. Additionally, we calculated the average of the RMSD and TM-score for the five best predictions. Complementing the results, we determined the standard deviation (std) of the RMSD and TM-score for the five best predictions and included the best type of secondary structure: A (mainly alpha), B (mainly beta), and N (mainly none). This classification as A, B, and N is based on the secondary structure predominating in each peptide [27,68,69,71,72]. We sort Table 2 by the number of amino acids for comparing the best results obtained by GRSA2-SSP with the best algorithms of the literature. This comparison is presented in Figures 9-11.  Table 2 presents the results obtained by GRSA2-SSP. For each instance, we show the best TM-score and their RMSD. Additionally, we calculated the average of the RMSD and TM-score for the five best predictions. Complementing the results, we determined the standard deviation (std) of the RMSD and TM-score for the five best predictions and included the best type of secondary structure: A (mainly alpha), B (mainly beta), and N (mainly none). This classification as A, B, and N is based on the secondary structure predominating in each peptide [27,68,69,71,72]. We sort Table 2 by the number of amino acids for comparing the best results obtained by GRSA2-SSP with the best algorithms of the literature. This comparison is presented in Figures 9-11. Note: PDB code (Instance), number of amino acids (aa), SS is the predominant secondary structure type: beta strand (B), alpha-helix (A) and none (N), TM 1 = TM-score. Figure 8 shows the GRSA2-SSP algorithm performance with instances classified by secondary structure. We show that the GRSA2-SSP algorithm has the best behavior in alpha structure instances evaluated with TM-score in Figure 8a and RMSD metrics in Figure 8b. The values in Figure 8 are the best obtained using TM-score and their RMSD. In Figure 8c,d, we present the TM-score average for the five best predictions and their RMSD average.   Figure 8 shows the GRSA2-SSP algorithm performance with instances classified by secondary structure. We show that the GRSA2-SSP algorithm has the best behavior in alpha structure instances evaluated with TM-score in Figure 8a and RMSD metrics in Figure  8b. The values in Figure 8 are the best obtained using TM-score and their RMSD. In Figure  8c,d, we present the TM-score average for the five best predictions and their RMSD average.  In Figures 9-11, we present the behavior of the GRSA2-SSP algorithm, and we compare it with the results obtained from the approaches PEP-FOLD3, I-TASSER, QUARK, and Rosetta. We divided the dataset of Table 1 into three groups of 15 instances; groups 1, 2, and 3 have instances 1-15, 16-30, and 31-45. We compared these groups using the metrics RMSD, TM-score, GDT-TS [73], and TM-score (classical), and we present the best TM-score, the average of the five best predictions of the TM-score, and their RMSD. Additionally, we present the GDT-TS average and TM-score average.
In Figure 9, we introduced the comparison of the first group, and we observed that GRSA2-SSP behaves similarly to I-TASSER and PEP-FOLD3, but in this group of small peptides, PEP-FOLD3 is slightly better than our algorithm and I-TASSER when GDT-TS is compared (Figure 9e). Furthermore, we observed that our algorithm is competitive in this group. In this comparison, Rossetta and QUARK were not added because the minimal number of amino acids predicted are 27 and 20, respectively.
In Figures 9-11, we present the behavior of the GRSA2-SSP algorithm, and we compare it with the results obtained from the approaches PEP-FOLD3, I-TASSER, QUARK, and Rosetta. We divided the dataset of Table 1 into three groups of 15 instances; groups 1, 2, and 3 have instances 1-15, 16-30, and 31-45. We compared these groups using the metrics RMSD, TM-score, GDT-TS [73], and TM-score (classical), and we present the best TM-score, the average of the five best predictions of the TM-score, and their RMSD. Additionally, we present the GDT-TS average and TM-score average.
In Figure 9, we introduced the comparison of the first group, and we observed that GRSA2-SSP behaves similarly to I-TASSER and PEP-FOLD3, but in this group of small peptides, PEP-FOLD3 is slightly better than our algorithm and I-TASSER when GDT-TS is compared (Figure 9e). Furthermore, we observed that our algorithm is competitive in this group. In this comparison, Rossetta and QUARK were not added because the minimal number of amino acids predicted are 27 and 20, respectively.  Figure 10 compares the second group of 16 to 30 amino acids with the best and the five best obtained using the TM-score metric and their RMSD, and the GDT-TS average. In this comparison, we added the second group of instances' results of QUARK; Rosetta was omitted because it is unable to predict most of the instances of this group.
In Figure 10a we observe very similar behavior among GRSA2-SSP, PEP-FOLD3, I-TASSER, and Rosetta. Note in this figure, GRSA2-SSP and PEP-FOLD3 obtain the best prediction. In Figure 10c, when the best five predictions are compared, I-TASSER obtains the best results, followed by PEPFOLD3 and GRSA2-SSP. Additionally, when the RMSD average is compared (Figure 10d), I-TASSER is the best, followed by PEP-FOLD3 and GRSA2-SSP. Finally, in Figures 10e, when GDT-TS is compared, GRSA2-SSP has a similar performance to PEP-FOLD3, I-TASSER, and QUARK. According to this figure, GRSA2-SSP and I-TASSER obtained a similar average.   Figure 10 compares the second group of 16 to 30 amino acids with the best and the five best obtained using the TM-score metric and their RMSD, and the GDT-TS average. In this comparison, we added the second group of instances' results of QUARK; Rosetta was omitted because it is unable to predict most of the instances of this group.
In Figure 10a we observe very similar behavior among GRSA2-SSP, PEP-FOLD3, I-TASSER, and Rosetta. Note in this figure, GRSA2-SSP and PEP-FOLD3 obtain the best prediction. In Figure 10c, when the best five predictions are compared, I-TASSER obtains the best results, followed by PEPFOLD3 and GRSA2-SSP. Additionally, when the RMSD average is compared (Figure 10d), I-TASSER is the best, followed by PEP-FOLD3 and GRSA2-SSP. Finally, in Figure 10e, when GDT-TS is compared, GRSA2-SSP has a similar performance to PEP-FOLD3, I-TASSER, and QUARK. According to this figure, GRSA2-SSP and I-TASSER obtained a similar average. Figure 11 compares the third group of 31 to 49 amino acids with the five best results obtained using the TM-score metric and their RMSD y GDT-TS. This comparison added the Rosetta approach because it can process the number of aa in this group. As we observe, the best algorithm is I-TASSER, followed by Rosetta, QUARK, PEP-FOLD3, and finally GRSA2-SSP.  Figure 11 compares the third group of 31 to 49 amino acids with the five best results obtained using the TM-score metric and their RMSD y GDT-TS. This comparison added the Rosetta approach because it can process the number of aa in this group. As we observe, the best algorithm is I-TASSER, followed by Rosetta, QUARK, PEP-FOLD3, and finally GRSA2-SSP. The 45 instances evaluated in the below experimentation show the application of the secondary structure results and refine them with the GRSAX algorithms, enhancing the performance in energy, RMSD, and TM-score. Specifically, when GRSA2-SSP is compared with PEP-FOLD3, I-TASSER, QUARK, and Rosetta, we observed that our algorithm performs well in small instances (Group 1 and 2). Nevertheless, in the largest instances, our algorithm is not the best, but it is competitive.   Figure 11 compares the third group of 31 to 49 amino acids with the five best results obtained using the TM-score metric and their RMSD y GDT-TS. This comparison added the Rosetta approach because it can process the number of aa in this group. As we observe, the best algorithm is I-TASSER, followed by Rosetta, QUARK, PEP-FOLD3, and finally GRSA2-SSP. The 45 instances evaluated in the below experimentation show the application of the secondary structure results and refine them with the GRSAX algorithms, enhancing the performance in energy, RMSD, and TM-score. Specifically, when GRSA2-SSP is compared with PEP-FOLD3, I-TASSER, QUARK, and Rosetta, we observed that our algorithm performs well in small instances (Group 1 and 2). Nevertheless, in the largest instances, our algorithm is not the best, but it is competitive. The 45 instances evaluated in the below experimentation show the application of the secondary structure results and refine them with the GRSAX algorithms, enhancing the performance in energy, RMSD, and TM-score. Specifically, when GRSA2-SSP is compared with PEP-FOLD3, I-TASSER, QUARK, and Rosetta, we observed that our algorithm performs well in small instances (Group 1 and 2). Nevertheless, in the largest instances, our algorithm is not the best, but it is competitive.
We carried out a second experimentation with six mini-proteins (5wll, 5lo2, 5up5, 5uoi, 2ki0, and 2kik) presented in Table 3. The mini-proteins come from the de novo protein design field [74][75][76][77][78]. This data set was proposed to observe the behavior of our best algorithm in these kinds of instances. We applied the same evaluation of all the algorithms, as in the first experimentation, using RMSD, TM-score, and GDT-TS metrics. Table 4 shows the results of all the algorithms in this data set. Evaluating them with TM-score and GDT-TS, we observe that the best algorithms were Rosetta, I-TASSER, and GRSA2-SSP, where the number of times the best results were achieved 3, 2, and 1, respectively. Additionally, evaluating with the RMSD, the best algorithms were again Rosseta, I-TASSER, and GRSA2-SSP, but this time they obtained the best results in two instances, which were (5uoi, 2kik), (2ki0, 5up5), and (5wll, 5lo2), respectively. As a result, we can say that Rosetta is the best algorithm, followed by I-TASSER, and GRSA2-SSP.

Conclusions
In this paper, we present the methodology GRSA-SSP for Protein Folding Problem applied to peptides. The objective of this problem is to predict the functional tridimensional protein structure. The algorithms developed with this methodology are GRSA0-SSP, GRSA1-SSP, GRSAE-SSP, and GRSA2-SSP. The main relevance of the algorithm GRSA2-SSP, developed with this methodology, is that it produces very good results in the case of peptides; specifically, it is similar or better than the algorithms Rosetta, PEP-FOLD3, QUARK, and I-TASSER for the small and medium peptides, according to the experimentation presented. The last algorithms have traditionally been among the best of the CASP competition; besides, they use modern machine learning techniques like artificial neural networks.
We compared the algorithms developed with the original algorithms GRSA0, GRSA1, GRSAE, and GRSA2; we used a data set of 45 instances for this comparison. We showed that the hybrid algorithms produced with the GRSA-SSP methodology outperform the original ones. For this comparison, we used the metrics Energy, RMSD, TM-score, and execution time. We observed that the best of all these algorithms is GRSA2-SSP formulated with the proposed methodology.
We made a second evaluation comparing the GRSA2-SSP algorithm with the best state-of-the-art algorithms (we used the same data set of 45 instances). We selected for this comparison PEP-FOLD3, I-TASSER, QUARK, and Rosetta. We used a data set of forty-five instances divided into three groups, from small to large peptides. The experimentation shows that for groups 1 and 2, GRSA2-SSP performs as well as these algorithms. We observe that for the first group PEP-FOLD3 was the best, followed by GRSA2-SSP, while in the second group, the best algorithm was I-TASSER followed by GRSA2-SSP and PEP-FOLD3. Finally, in the third group, the best algorithm was Rosseta, followed by I-TASSER. Additionally, we present an analysis of GRSA2-SSP results for each type of secondary structure, obtaining a better behavior with alpha structures.
Furthermore, we assessed GRSA2-SSP with a second data set of six instances named mini proteins. The GRSA2-SSP results were compared with PEP-FOLD3, I-TASSER, QUARK, and Rosetta. The best algorithms in this data set were Rosetta, I-TASSER, and GRSA2-SSP because the number of times the best TM-score and GDT-TS were 3, 2, and 1, respectively. However, each of the three achieved two times the first place when RMSD was evaluated. As a result, the best of these algorithms for this data set is Rosetta, followed by I-TASSER and GRSA2-SSP.
We conclude that GRSAX-SSP algorithms enhance the original GRSA algorithms. The best of them is GRSA2-SSP which achieves very good results, surpassing the best state-of-art for peptides up to thirty amino acids. Finally, we note that the main advantage of our methodology is that it is simpler than the most powerful approaches of the literature.