HIGA: A Running History Information Guided Genetic Algorithm for Protein–Ligand Docking

Protein-ligand docking is an essential part of computer-aided drug design, and it identifies the binding patterns of proteins and ligands by computer simulation. Though Lamarckian genetic algorithm (LGA) has demonstrated excellent performance in terms of protein-ligand docking problems, it can not memorize the history information that it has accessed, rendering it effort-consuming to discover some promising solutions. This article illustrates a novel optimization algorithm (HIGA), which is based on LGA for solving the protein-ligand docking problems with an aim to overcome the drawback mentioned above. A running history information guided model, which includes CE crossover, ED mutation, and BSP tree, is applied in the method. The novel algorithm is more efficient to find the lowest energy of protein-ligand docking. We evaluate the performance of HIGA in comparison with GA, LGA, EDGA, CEPGA, SODOCK, and ABC, the results of which indicate that HIGA outperforms other search algorithms.


Introduction
Drug molecular design, as a new drug research method and means, has achieved a lot of theoretical and practical research findings [1][2][3][4]. Protein-ligand docking is a typical method for structure-based drug discovery and design, the aim of which is to find the best ligand conformation of a ligand against a protein receptor target with the lowest energy [5][6][7][8][9]. The progress of X-ray diffraction technology of biological macromolecules provides us with more important structures of proteins and ligands. These structures can be used as targets for bioactive substances to control diseases in animals and plants, and they allow for people to understand the biological mechanisms of active substances simply [10][11][12]. The rapid development of computer technology has promoted the further development of molecular drug design effectively and significantly reduced the cost of drug research. For the docking process, the ligands are placed at the active site of the receptors, then the position and orientation of the ligands are adjusted by some binding and complementary principles, and then the optimal binding modes are obtained finally.
A search algorithm and a scoring function are the basic tools of a docking method for solving the two goals above. The scoring function is used to evaluate the binding conformation of ligands and receptors that were predicted by computer simulation [13][14][15][16][17]. In the docking process, it is necessary to obtain the binding affinity accurately as the basis for optimization. The scoring function not only provides a fast method for evaluating the binding affinity, but it also assists a docking in efficiently exploring the binding space of a ligand [18,19]. The score function can be directly used as the fitness function of optimization algorithms.
The search algorithm aims to identify the optimal binding mode between ligands and receptors, including the location of small ligands relative to proteins and conformational changes in molecules [20,21]. (1) 3ptb β-trypsin/ben (benzamidine) β-Trypsin, isolated from the pancreas of pigs, sheep, as well as cattle, is used as a protease. Benzamidine, an inhibitor, is generally utilized in suppressing proteolysis of proteins.
(4) 1phg cytochrome P450-cam/hem (protoporphyrin IX) Cytochrome P450-cam, participating in metabolism of exogenous, as well as endogenous substances, is a superfamily of heme-thiolate proteins. Protoporphyrin IX, a purple brown crystalline powder, can dissolve in methanol, while is not soluble in ether, chloroform, acetone, or water.
(6) 1stp streptavidin/btn (biotin) Streptavidin, a protein obtained from streptomyces, harbors a similar biological features with affinity. Biotin, a member of B vitamins, plays a critical role in normal metabolism of proteins as well as fats.
Thrombin is a formless, white to gray, freeze-dried powder, and 4qq is regarded as a non-polymer suppressor.
(10) 1hri human rhinovirus/s57 Human rhinovirus causes the majority of human common cold. s57 is a member of imidazole.
(11) 1hvr protease/xk2 Protease, an enzyme widely found in animals as well as plants, is capable of catalyzing protein catabolism. xk2, a small molecule inhibitor, is able to decrease or even prohibit chemical reaction rate.
(12) 4hmg hemagglutinin/sia (sialic acid) Hemagglutinin is the cause of coagulation of erythrocytes. Sialic acids, generated at terminal sugars, are members of acidic monosaccharides.
(13) 1cdg cyclodextrin glycosyl transferase/mal (maltose) Cyclodextrin glycosyltransferase, a bacterial enzyme, is able to produce cyclodextrins. Maltose, made of starch, as well as malt, is widely utilized as nutrient as well as culture medium. (14) 1htf HIV-1 protease/g26 HIV-1 protease is capable of separating newly-generated polyproteins into individual peptides. g26, a non-polymer suppressor, is an amide with easily oxidizable and highly reactive perssad.
(16) 1tmn thermolysin/nas (2-naphthalenesulfonic acid) Thermolysin, a biological component, is featured by a more rapid hydrolysis of hydrophobic amino acids. 2-naphthalenesulfonic acid, white powder or crystal, can dissolve in water but not in alcohol, which is widely adopted in organic synthesis.

Comparison of Energy and RMSD
The primary objective of our experiment is to find the lowest energy. The values of the lowest energy, calculated by the semi-empirical free energy force field [15], is the most important criterion to evaluate the performance of the compared algorithms. Root-mean-square positional deviation (RMSD) is also the commonly used standard to evaluate the molecular docking results. RMSD compares the optimal docking structure with the experimentally measured actual structure. If the RMSD is smaller than a given threshold 2.0 Å after docking, then the docking can be considered successful. Each algorithm runs one time in Dataset 1. The success of the docking is recorded, Average RMSD (all cases) and Average RMSD (RMSD < 2 Å) are calculated (Table 1). For the number of success cases and the Average RMSD, HIGA is obviously superior to other algorithms. Each algorithm runs twenty times in Dataset 2, the lowest energy and RMSD are recorded, and the results are showed in Table 2. For the lowest energy of the sixteen complexes, HIGA finds the twelve lowest values, EDGA finds the two lowest values, CEPGA and SODOCK find the one lowest value respectively, and GA, LGA, ABC do not find any of the lowest values. Although HIGA does not find the lowest energy in 1aha, 1ett, 4hmg, and 1htf, the energy values found are still promising. For example, the energy −16.15 kcal mol −1 found by HIGA is close the lowest energy −16.23 kcal mol −1 found by EDGA in 1aha; the energy −21.17 kcal mol −1 found by HIGA is close the lowest energy −21.79 kcal mol −1 found by SODOCK in 1htf. The number of the lowest RMSD that is found by HIGA, GA, LGA, EDGA, CEPGA, ABC SODOCK is 7, 2, 1, 2, 1, 1 and 2, respectively. HIGA has no absolute advantage in finding the lowest RMSD, but it is better than the other algorithms. In conclusion, the best search method is HIGA with regard to its average performance. Table 1. Results of success case and average root-mean-square positional deviation (RMSD).

Cluster Analysis of Docked Conformations
After twenty times docking of each complex in Dataset 2, twenty docked conformations are obtained. These conformations exhaustively compared to one another to determine similarities, and they are clustered if they are similar enough. The range of the formed clusters is 0 to 20. The clusters are ranked in order of increasing energy, by the lowest energy in each cluster. Rank 1 is the lowest energy cluster, it refers to how often the structure with the lowest energy is found. The concentration of the clusters and the docked structures in rank 1 can reflect the stability of the algorithm. The cluster analysis is performed in Dataset 2, and the results are shown in Table 3. The mean of the number of clusters found is lowest for HIGA (3.72), followed by CEPGA (4.04), EDGA (4.24), LGA (4.58), SODOCK (10.30), ABC (6.52), and finally GA (13.04). The mean of the number of docked structures in rank 1 is 17.04 for HIGA, 16.20 for CEPGA, 15.92 for EDGA, 15.72 for LGA, 11.82 for SODOCK, 8.70 for ABC, and 8.00 for GA. Hence, the most reliable search method is HIGA.

Convergence Analysis
Convergence means that the convergence curve of the objective solution tends to be stable after several iterations. The convergence diagrams of seven different algorithms for solving different test cases of Dataset 2 are shown in Figure 2. The number of iterations is 3000, 6000, 9000, 12,000, 15,000, 18,000, 21,000, 24,000 and 27,000, respectively, and these values are used as the horizontal axis of the convergence diagrams. The energy of each algorithm under different iteration times is calculated as the vertical axis. At the early stage of each algorithm, the energy value decreases as the number of iterations increases. But, in the later stage, the energy values of some algorithms tend to be fixed. This phenomenon, which is caused by the decrease of population diversity and the loss of evolutionary capacity, is called premature convergence. For example, LGA, EDGA, and SODOCK are prematurely convergent after iterating 15,000 times in 2mcp; GA and ABC are prematurely convergent after iterating 18,000 times in 2mcp; LGA, EDGA, SODOCK, and ABC are prematurely convergent after iterating 21,000 times in 6rnt. From these graphs, it can be concluded that HIGA is superior to other algorithms regarding preventing premature and solution quality.

Data Distribution Analysis
The data distribution can reflect the concentration of the data and the stability of the algorithm. We calculate the minimum, the first quartile, the median, the third quartile, and the maximum of the energy values of each PDB, and then we use the five statistical quantities to draw the box plots ( Figure 3). The median, which is not affected by the extreme data, is suitable as a centralized trend value. It is evident that the median energy of HIGA is lower than that of the other algorithms. The first quartile is the upper boundary of the box, the third quartile is the lower boundary of the box, and the data distribution is concentrated or dispersed and can be determined by observing the box. It can be seen that the data distribution of HIGA is the most concentrated. The points outside the maximum and minimum are the outliers, and the outliers have an undesirable consequence of data distribution. For example, the outlier of GA in 1glq; the outlier of LGA in 3hvt; the outlier of EDGA in 1stp; the outlier of SODOCK in 1cdg; the outlier of ABC in 4hmg. HIGA and CEPGA have no outliers. It can be concluded that HIGA is a stable method for protein-ligand docking.

Execution Time Analysis
The execution time of the seven compared algorithms for solving different complexes of Dataset 2 are shown in Table 4. The time is recorded by how many seconds per run. There a direct relation between the execution time and the problem complexity. In addition to a few complexes, the execution time of the algorithms increases as the number of rotatable bonds increases. As seen from the table, the execution time of GA is the fastest, followed by LGA, EDGA, CEPGA, HIGA, ABC, and SODOCK. The best performance algorithm HIGA in previous experiments does not play an advantage in the time performance. However, the execution time of HIGA is not greatly increased when compared to the fastest algorithm GA. This can demonstrate that HIGA does not raise the performance at the cost of increasing the execution time.

Comparison Based on the Hypothesis Test
We use the hypothesis test to determine the difference between each algorithm in Table 5. Five of the best values obtained by each algorithm for every PDB are taken, and the critical value of hypothesis test is set to 0.05. When comparing algorithm A to algorithm B, we can conclude the algorithm A is significantly different from the algorithm B if the p-value is less than 0.05. The significant difference demonstrates that the algorithm A is superior to the algorithm B. As seen from the table, HIGA, EDGA, and CEPGA are better than GA, LGA, SODOCK, and ABC in most of the PDBs. Furthermore, HIGA is better than EDGA and CEPGA according to the p-value. We can make a conclusion from the statistical analysis that HIGA is significantly better than the other algorithms.

Running History Information Guided Genetic Algorithm
HIGA is an LGA-based hybrid search algorithm that is designed for protein-ligand docking. GA is an algorithm for simulating natural evolution process, and it generates solutions to optimise problems with the assistance of mutation, selection, and crossover. In each iteration of the genetic algorithm, the individual selection is made according to the fitness of the individual in the feasible region of the problem. Thus, a new and better approximate solution is generated. GA is simple, easy to realize, and has high robustness, so it has been successfully applied to solve the protein-ligand docking problem.
LGA, which combines GA with local search method, searches the potential energy surface rapidly by GA and optimises the potential energy surface by local search method.
LGA is proved to be an effective method for the protein-ligand docking, but the solutions that are obtained by LGA are unstable because of its randomness. Therefore, HIGA is proposed to overcome the drawback. Three core mechanisms are used to enhance the performance of HIGA for docking problems, and they are CE crossover, ED muation, and binary space partitioning (BSP) tree. CE crossover uses history information to improve the randomness of crossover operation. ED muation can guide the solutions to evolve in a better direction. BSP tree can memorize evaluated solutions that it has visited so that the diversity of solutions is maintained and the waste of computer resources is reduced. Taking the three techniques into consideration, the novel algorithm is guided to find some promising solutions by the running history information. Figure 4 is the block diagram of HIGA, and the grey parts are the three innovative mechanisms, which are newly added into LGA. HIGA first initializes the population randomly, and then obtains the next population by CE crossover, ED mutation, BSP tree, local search, selection and fitness evaluation. The process is iterated until a preset termination condition is reached. proved to be an effective method for the protein-ligand docking, but the solutions that are obtained by LGA are unstable because of its randomness. Therefore, HIGA is proposed to overcome the drawback. Three core mechanisms are used to enhance the performance of HIGA for docking problems, and they are CE crossover, ED muation, and binary space partitioning (BSP) tree. CE crossover uses history information to improve the randomness of crossover operation. ED muation can guide the solutions to evolve in a better direction. BSP tree can memorize evaluated solutions that it has visited so that the diversity of solutions is maintained and the waste of computer resources is reduced.
Taking the three techniques into consideration, the novel algorithm is guided to find some promising solutions by the running history information. Figure 4 is the block diagram of HIGA, and the grey parts are the three innovative mechanisms, which are newly added into LGA. HIGA first initializes the population randomly, and then obtains the next population by CE crossover, ED mutation, BSP tree, local search, selection and fitness evaluation. The process is iterated until a preset termination condition is reached.

CE Crossover
Since the parents of elite individuals are not preserved in the crossover process of LGA, The chances of getting better solutions are reduced by the subsequent crossover operation. CE crossover is proposed to solve the problem in this section. This new crossover can preserve the parents of the elite individuals with the best solution so that the good genes can be extended to improve the quality of the next-generation population. Table 6 shows the pseudocode of CE crossover. In the strategy, Mfather and Mmother are introduced, in which Mfather and Mmother represent the parents of the elite individual, e. The individual with current optimum fitness is selected as the elite individual at current iteration, and Mfather and Mmother are saved. The preserved individuals and the individuals of next-generation form a new population. The genes of these preserved individuals are excellent, and the possibility that they continue to reproduce individuals with good genes is greater. For each iteration, the number of elite individuals is set up to a certain percentage of the number of individuals, and the default is ten percent.

CE Crossover
Since the parents of elite individuals are not preserved in the crossover process of LGA, the chances of getting better solutions are reduced by the subsequent crossover operation. CE crossover is proposed to solve the problem in this section. This new crossover can preserve the parents of the elite individuals with the best solution so that the good genes can be extended to improve the quality of the next-generation population. Table 6 shows the pseudocode of CE crossover. In the strategy, M father and M mother are introduced, in which M father and M mother represent the parents of the elite individual, e. The individual with current optimum fitness is selected as the elite individual at current iteration, and M father and M mother are saved. The preserved individuals and the individuals of next-generation form a new population. The genes of these preserved individuals are excellent, and the possibility that they continue to reproduce individuals with good genes is greater. For each iteration, the number of elite individuals is set up to a certain percentage of the number of individuals, and the default is ten percent.
Find the historial optimal individual m 0 03.
If the fitness of current individual m i < m 0 then 04. m 0 = m i 05. e = m 0 06.
preserve M father and M mother 07.
End if 08. M father , M mother and e ⊂ next population 09. End for

ED Mutation
In LGA, some genes of the individuals are random changes by mutation operation, which results in the search direction of the algorithm is also random and aimless. At the early stage of this algorithm, the randomness plays a very positive guiding effect for global search. It is because the optimal solution in which the direction is unknown in the case of no search experience. With the continuous iteration of the algorithm, the group search experience is accumulating, the direction is gradually clear, and the search space starts to converge gradually. This means that the current best solution is inevitably abandoned, even the global best solution. Therefore, ED mutation is proposed to optimize the search direction. The pseudocode of ED mutation is shown in Table 7. Where m i is the current solution; θ and δ are random numbers; β is a particular adjustable parameter, M optimum is the historic optimal solution; and, M sub represents is the historic suboptimal solution; M max is the maximum solution; M min is the minimum solution. In the formula, the historic optimal solution and the historic suboptimal solution are used to ensure the correctness of the optimal solution direction. This mechanism is similar to the addition of vectors, and the addition can ensure that the algorithm evolves towards a better direction. Table 7. Pseudocode of ED mutation.

Algorithm: ED Mutation
Input: (1) a population with n indiviudals, (2) balance factor β. Output: a population after ED mutation 01. For i: = 1 to n do 02. Find the optimal solution M optimum and the historic suboptimal solution M sub 03. If θ < β then 04.

BSP Tree
The BSP tree that has been applied in non-revisiting genetic algorithm (NrGA) is used to store the evaluated solutions, and it divides the search space in accordance with the cumulative distribution of the solutions being evaluated. The BSP tree has only one root node at the beginning, and this node represents the entire search space. Each node that is subsequently inserted into BSP tree represents a subspace. If a parent node has two child nodes (l and r), the subspaces that are represented by the two child nodes are disjoint, and the sum of the two spaces is the subspace corresponding to the parent node. BSP tree is different from with Octree employed in QuickVina-W [41]. BSP tree stores all of the solutions that the algorithm has searched before, while Octree stores high-quality history points which are the output of last iteration of local optimization from all of the searching threads during the runtime. Table 8 shows the pseudo code for the working principle of BSP tree. Where RF is the revisit flag; and, d is the distance between two nodes. The position of each previous individual generated by the algorithm is recorded in a node of the tree. When the new generated individual m visits the node l or r, their positions are checked. If m = l or m = r, m is revisited and RF is 1. If the solution is revisited, it mutates to a nearest unvisited neighbor subspace. BSP tree can guarantee the location of all solutions is different so that the diversity of the individuals is maintained and the sample space of GA is increased.  If (m = l) or (m = r) then 06. RF = 1 07.
Insert a child node to Curr_node that records m 17.
Creat a new child node by mutating 19.
End if 20. End if

Conclusions
The article presents HIGA, which combines binary space partitioning (BSP), ED muation, and CE crossover to extend the power of the LGA-based algorithm. CE crossover amplifies the probability of repeated use of the individual with good genes, so that HIGA is more suitable to the changes of the environment evolution. ED mutation provides a guarantee for the evolutionary direction of HIGA, and its concept originates from the property of vector addition. By using the BSP tree, the search algorithm can not only remove revisits, but also guide to search for the next unvisited position. We have compared the performance of HIGA, GA, LGA, EDGA, CEPGA, SODOCK, and ABC, the results of which indicate that HIGA outperforms that the other algorithms, suggesting that HIGA can enhance the power of AutoDock to protein-ligand docking.