An Efficient ABC_DE_Based Hybrid Algorithm for Protein–Ligand Docking

Protein–ligand docking is a process of searching for the optimal binding conformation between the receptor and the ligand. Automated docking plays an important role in drug design, and an efficient search algorithm is needed to tackle the docking problem. To tackle the protein–ligand docking problem more efficiently, An ABC_DE_based hybrid algorithm (ADHDOCK), integrating artificial bee colony (ABC) algorithm and differential evolution (DE) algorithm, is proposed in the article. ADHDOCK applies an adaptive population partition (APP) mechanism to reasonably allocate the computational resources of the population in each iteration process, which helps the novel method make better use of the advantages of ABC and DE. The experiment tested fifty protein–ligand docking problems to compare the performance of ADHDOCK, ABC, DE, Lamarckian genetic algorithm (LGA), running history information guided genetic algorithm (HIGA), and swarm optimization for highly flexible protein–ligand docking (SODOCK). The results clearly exhibit the capability of ADHDOCK toward finding the lowest energy and the smallest root-mean-square deviation (RMSD) on most of the protein–ligand docking problems with respect to the other five algorithms.


Introduction
The development of new drugs is costly and inefficient, so it is urgent to apply new theoretical methods and new technologies to improve it. Computer aided drug design (CADD) is developed gradually under the strong impetus of this social demand [1,2]. CADD takes advantage of advanced multidisciplinary technology, methods, and achievements, and has been a necessary basic tool for drug design. Its application shortens the process of drug research and reduces the cost of drug discovery. Protein-ligand docking, as an important part of CADD, is a computer simulation to predict the binding pose when the three-dimensional structures of protein receptors and ligands are known [3][4][5][6]. The purpose of protein-ligand docking is to find the conformation with the lowest energy when a ligand binds the active region of a receptor. Protein-ligand docking has been widely used in understanding molecular biological functions, predicting drug toxicity, screening virtual drugs and so on [7][8][9][10]. For the main process of protein-ligand docking, all possible active sites of the receptor are first detected. Then, the active site that binds to the ligand is determined, the position of the ligand is constantly adjusted, and different conformations are obtained according to the complementary principle of docking. The results are sorted by a specific scoring function, and the binding pose with the lowest energy score is finally found.
A scoring function [11][12][13][14][15][16] and a search algorithm [17][18][19] are the necessary tools of a docking method for solving the two goals above. The scoring function is used to evaluate the affinity between the receptor and the ligand for each conformation [20]. Scoring functions express the geometric complementarity and the energy strength of the interaction based on the physicochemical characteristics of the amino acids in contact with each other [21]. In the docking process, it is necessary to obtain the binding affinity accurately as the basis for optimization. Furthermore, the scoring function can also effectively help the docking to explore the binding space of the ligand. The scoring function can be directly used as the fitness function of the search algorithm.
The purpose of the search algorithm is to identify the optimal binding pose between the receptor and the ligand. The performance of the search algorithm directly affects the efficiency of molecular docking. The ideal search algorithm should be able to enumerate all possible binding poses between ligands and receptors, but this is difficult to achieve because the search space involved in molecular docking is huge. Many evolutionary computation methods have been presented for solving protein-ligand docking problems [22][23][24][25][26][27][28]: for example, simulated annealing (SA) [29], genetic algorithm (GA) [30], Lamarckian genetic algorithm (LGA) [31], running history information guided genetic algorithm (HIGA) [32], and swarm optimization for highly flexible protein-ligand docking (SODOCK) [33]. These methods have been applied to solve docking problems, but they have some drawbacks. As two standard evolutionary algorithms, SA and GA are not combined with local searches, so they can no longer search better solutions in the neighborhood of the current solution and the diversity of solutions cannot be maintained.
LGA is a hybrid of GA and local search, HIGA is an improved algorithm based on running history information, and SODOCK is a hybrid of particle swarm optimization (PSO) and local search. Although LGA, HIGA, and SODOCK have local searches, their main disadvantage is the high probability of becoming trapped by local optima because of their single search strategy. Therefore, developing a more efficient and reliable search algorithm is necessary.
To solve the protein-ligand docking problem more efficiently, an efficient ABC_DE_based hybrid algorithm for protein-ligand docking (ADHDOCK) is proposed in the article. The novel method is a hybrid of artificial bee colony algorithm [34] and differential evolution algorithm [35]. Artificial bee colony (ABC) and differential evolution (DE) are two practical evolutionary computation methods, and they have been widely used in various applications. In ADHDOCK, these two algorithms share the same population and execute in parallel. An adaptive population partition (APP) mechanism incorporated into ADHDOCK is applied to automatically partition the population into two subpopulations and allocate them to ABC and DE. The reasonable allocation of computing resources makes the novel method make better use of the advantages of ABC and DE and prevents the single use of one algorithm from falling into the local optima.
The environment and the scoring function of AutoDock 4.2.6 are used as the experimental platform in the article [36]. AutoDock, as a free and open source molecular docking program, is one of the most commonly used docking software, and it provides great convenience for researchers to develop new algorithms on the existing program. AutoDock first forms a box by using the amino acid residues around the active site of the receptor. Subsequently, the program scans with different types of atoms as probes, calculates the energy of the grid, and searches for the ligand within the range of the box. Finally, every complex is scored according to the different conformation, orientation, and position of the ligand. To test the power of ADHDOCK, we performed an experiment on a set of protein-ligand complexes from PDBbind 2017 [37]. The performance of ADHDOCK, ABC, DE, LGA, HIGA, and SODOCK is compared in these datasets. Computer simulation results reveal that ADHDOCK is superior to the other methods regarding obtained energy and root-mean-square deviation (RMSD), success rate, convergence performance, data distribution, and hypothesis test.

Data Preparation and Parameter Setting
Fifty X-ray crystallographic complexes are randomly chosen from PDBbind 2017 [37] to compare the capability of the different docking methods. One rigid protein receptor and one flexible ligand are prepared before docking. Through AutoDock's file format, input files of both protein and ligand are preserved. The preparation process of proteins is as below: (1) get rid of ligands, the water molecule and metal ion not being included in binding sites; (2) restore the residues of missing atoms; (3) mix hydrogen with all atoms, integrate nonpolar hydrogen atoms, and allot partial charges; and (4) assign the parameter of solvent. The input files of the ligand are gained by the following process: (1) acquire the ligand atom coordinates from PDB files; (2) mix hydrogen with all atoms, integrate nonpolar hydrogen atoms, and allot the partial charge; and (3) define the torsions and a rigid root of the ligand. The file preparation stage of a molecule is carried out through AutoDock Tools. The degrees of freedom include three parameters denote the translation of the ligand relative to a specified center, a quaternion represents the orientation of the ligand with four parameters, and T torsion parameters where T is the number of rotatable bonds.
The semi-empirical free energy force field [13] is used as the scoring function in the experiment. The force field includes six pair-wise evaluations (V) and an estimate of the conformational entropy lost upon binding (∆S conf ): where L represents the ligand and P represents the protein. Each of the pair-wise energetic terms is expressed as the sum of dispersion/repulsion in which the parameters are based on the Amber force field, hydrogen bonding, electrostatics, and desolvation.
The docking power of six algorithms (ADHDOCK, ABC, DE, LGA, HIGA and SODOCK) is compared. For each tested algorithm, the number of iterations was 27,000 and the number of individuals was 100. The other parameters were set in accordance with previous studies [31][32][33][34][35], and the details are shown in Table 1.

Comparison of Energy and Root-Mean-Square Deviation (RMSD)
The main purpose of our experiment is to find the lowest energy because the docked energy values are the most important criterion to evaluate the performance of the tested algorithms. RMSD is an important index to evaluate the protein-ligand re-docking algorithms, and it is obtained by comparing the re-docking result with the real crystallographic structure of the complex. The smaller RMSD of a docked conformation is considered to be the more accurate solution to the docking problem and the stronger search capability of the used algorithm. A docking can be considered successful if the RMSD is smaller than a given threshold 2.0 Å after docking. Each method is run thirty times independently for each protein-ligand complex. The conformation obtained from the first run is called the first predicted conformation. The best conformation with the lowest energy value obtained during all thirty runs under the condition of RMSD < 2.0 Å is called the best predicted conformation, and the docking results of RMSD ≥ 2.0 Å are not counted. The success cases, the average RMSD (all cases) and the average RMSD (RMSD < 2.0 Å) of the first predicted conformations are calculated and recorded in Table 2. The lowest energy and RMSD of the best predicted conformation are recorded, and the results are shown in Table 3. The bold fonts are used to highlight which algorithm wins in each complex. The rate of the lowest energy and the smallest RMSD found by different tested algorithms is clearly shown in Figure 1.

Convergence Analysis
Convergence means that, after iterating for several times, the convergence curve of the target solution is likely to be steady. Figure 2 shows the convergence diagrams of the six tested algorithms for solving some representative complexes. The number of iterations is 3000, 6000, 9000, 12,000, 15,000, 18,000, 21,000, 24,000, 27,000 and 30,000, respectively, and the values are utilized as the horizontal axis of the convergence diagrams. Under different times of iteration, the energy value of each algorithm is referred to as the vertical axis. With the increase of the number of iterations, the power grows at the early stage of each algorithm. However, the power of certain algorithms is likely to be fixed in the later stage due to the loss of evolutionary ability and the reduction of population diversity. This phenomenon is referred to as premature convergence.

Data Distribution Analysis
The data distribution can reflect the algorithm stability and the data concentration. We calculate the median, the first quartile, the third quartile, the minimum, and the maximum of the energy values of each complex, and then we apply the five statistical quantities to obtain the box plots ( Figure 3). The median is suitable as a centralized trend value and not impacted by the extreme data. The median is an important measure of whether the distribution of data is dispersed or concentrated. Moreover, the first quartile is the upper boundary of the box and the third quartile is the lower boundary and the size of the box also reflects the concentration of the data. The dots outside the minimum value and maximum value are called outliers, and these outliers have a negative result on data distribution. Figure 3. Box plots of six tested algorithms.

Hypothesis Test
To demonstrate whether the algorithm is more applicable to solve the protein-ligand problem, we adopt the hypothesis test with the confidence level of 0.05 in the section. Comparing Algorithm 1 with Algorithm 2, when the p-value between them is less than 0.05, it shows that Algorithm 1 is significantly better than Algorithm 2. For every complex, ten best values solved by each tested algorithm are taken out as experimental data. The results of the comparison of ADHDOCK with ABC, DE, LGA, HIGA and SODOCK are shown in Table 4.

Discussion
In this study, we demonstrated the superiority of ADHDOCK by many experiments. The number of success cases and the average RMSDs are recorded in Table 1. ADHDOCK finds forty-four success cases in the fifty complexes for the first predicted conformation, and the success rate of the algorithm is higher than that of ABC, DE, LGA, HIGA and SODOCK. In addition, the average RMSD (all cases) and the average RMSD (RMSD < 2.0 Å) with their respective standard deviations of ADHDOCK are the smallest compared with the other tested algorithms. It can be seen in Table 2 that ADHDOCK is well capable of finding the lowest energy values on forty-four out of fifty complexes. HIGA searches for the lowest values on six out of fifty complexes. For the smallest RMSD of the fifty complexes, ADHDOCK finds thirty-five smallest values, HIGA finds eight smallest values, LGA finds three smallest values, SODOCK finds two smallest values, and ABC and DE find one smallest value each. The experimental results indicate that ADHDOCK can solve protein-ligand docking problems more effectively and accurately than the five other comparative algorithms.
For the convergence analysis, ABC is prematurely convergent after iterating 18,000 times in 3tpb. DE is prematurely convergent after iterating 21,000 times in 1aha. The figure shows that, when compared with other algorithms on preventing premature convergence and enhancing solution quality, ADHDOCK is better. Moreover, for this case, as shown in 1aha and 1stp, it is obvious that the convergent trajectory of ADHDOCK is far away from others. For 3ptb and 4dfr, the convergent trajectories of ADHDOCK and HIGA are similar, but ADHDOCK is better than HIGA with the iteration of the optimization process. For 1hri, 4hmg, 1htf and 1tmn, ADHDOCK does not show the advantage in the early iteration, but the method gets better solutions than the other algorithms in the late iteration. From the above analyses, we can arrive at a conclusion that ADHDOCK significantly outperforms the other five comparative algorithms.
It is apparent in the box plots shown in Figure 3 that the median energy of ADHDOCK is lower than that of the other tested algorithms. The size of the boxes of ADHDOCK is the shortest in most complexes. It is evident that ADHDOCK has the most concentrated data distribution among the six test algorithms. For the outliers, ADHDOCK has no outliers, which indicates that the algorithm can reduce the randomness of the evolutionary process. It can be concluded from the above analysis that ADHDOCK is stable for protein-ligand docking.
As shown in Table 3, ADHDOCK is significantly better than other tested algorithms in thirty-nine out of fifty complexes. In 1mrg, 3hvt, 1cdg, and 1rds, ADHDOCK is significantly better than four tested algorithms. In 1phg, 1eap, and 1lic, ADHDOCK is significantly better than three tested algorithms. In 6rnt and 1hri, ADHDOCK is significantly better than two tested algorithms. In 1nco and 1hpv, ADHDOCK is significantly better than one tested algorithm. We can conclude from the results of the hypothesis tests that the performance of ADHDOCK is the best.

Framework of ADHDOCK
ADHDOCK is a hybrid search algorithm based on ABC [34] and DE [35], and it is designed for protein-ligand docking. The block diagram of ADHDOCK is shown in Figure 4. ADHDOCK consists of three main modules. (1) ABC module: The population in the module is evolved according to ABC algorithm. (2) DE module: The population in the module is evolved by using DE algorithm.
(3) Adaptive population partition (APP) module: The population is partitioned based on the partition rate. In the following sections, the three main parts are described in detail.

ABC Module
ABC [34] is an optimization method that simulates the foraging behavior of the bee colony. The method consists of three kinds of bees: employed bees, onlooker bees and scout bees. The solution to the problem to be optimized is considered the food source, and the richer the food, the better the quality of the solution. The employed bees correspond to the food source they collect, and they store the information about the food source and share it with other bees at a certain probability. The number of employed bees is equal to the number of food sources, as one employed bee is related to only one food source. The onlooker bees observe the dance of the employed bees in the hive to determine which food source to choose. The scout bees randomly search for new food sources near the hive. The basic structure of the ABC algorithm can be split into employed bee stage, onlooker bee stage and scout stage.
At the employed bee stage, each employed bee hunts for a new food source near the existing food source where x k is a randomly selected food source; t is the iteration number; and ϕ is a random real number. The fitness between the new food source and the existing food source is compared, and the one with greater fitness is retained. The selection can be defined as where f it() is the fitness function. At the onlooker bee stage, each onlooker bee makes a selection according to the information of the food source. The probability that an onlooker bee chooses a food source can be calculated by It is clear from Equation (5) that the solution with a greater fitness has a higher probability of being chosen by an onlooker bee. If the onlooker bee has selected a food source, it generates a new solution by Equation (3) and evaluates the fitness by Equation (4).
At the scout bee stage, if the employed bee fails to improve the quality of the solution after a given number of attempts limit are achieved, the employed bee becomes a scout bee and the solution it owns is abandoned. The number of employed bees and the number of onlooker bees each account for half of the population size, and the number of scout bees is selected as one.
Subsequently, the evolutionary rate of ABC module is calculated, and the equation is given below: where f it(x i ) is the fitness of an individual x i at iteration t. f it(x i (t))-f it((t-1)) represents the difference between the fitness of the current iteration and that of the previous iteration, and the difference is called evolution velocity. An iteration with a higher evolutionary rate is considered to be more accurate in the search direction of the algorithm. The accuracy of the search direction affects the possibility of finding a better solution in subsequent iterations. Figure 5 gives the pseudocode of ABC module.

DE Module
DE [35] is an efficient global optimization algorithm that simulates biological evolution. Each individual in a population corresponds to a possible solution, and the swarm intelligence generated by mutual cooperation and competition between individuals guides the direction of optimization search. The algorithm starts with a random initial population and gets the optimal solution by iterative mutation, crossover and selection.
A mutated individual is generated by where x r1 , x r2 , and x r3 are three individuals randomly selected from the previous generation; t denotes the iteration number; and F is the scalar number. The vector difference between x r1 and x r2 is calculated, and then the vector difference is multiplied by F and added to the individual x r3 to be mutated. A test individual is obtained by crossing the mutant individual and the predetermined target individual. The crossover operation is described as where rand(0,1) generates a random number from 0 to 1; and CR is the probability of crossover, and it is a constant between 0 and 1. The next generation of individuals is selected by where f it() is the fitness function, which generally takes the objective function to be optimized as the fitness function. If the fitness of the test individual is better than that of the current individual, the old individual is replaced by the new individual in the next iteration, otherwise the old individual is still preserved. After executing the three basic operations of DE, the evolution rate is calculated according to Equation (6). The pseudocode of DE module is shown in Figure 6.

APP Module
The population is randomly divided into two different subpopulations. The number of individuals in Subpopulation 1 P1 and Subpopulation 2 P2 is M 1 = M × PR and M 2 = M − M 1 , where M is the total number of the population; and PR is the proportion of individuals in Subpopulation 1 called partition rate, which is set to 0.5 as the initial value. After the partition, the individuals of Subpopulation 1 evolve according to ABC module and the individuals of Subpopulation 2 evolve according to DE module.
When an evolution is completed, the emergence probability of new elite individuals in each subpopulation is called elite probability, and the elite probability is expressed according to Equation (10).
where M z is the size of a subpopulation; x i is an individual belongs to the subpopulation at iteration t; E z is the number of elite individuals in the subpopulation; and e j is an elite individual belonging to the subpopulation at current iteration. z is 1 or 2 and represents Subpopulation 1 and Subpopulation 2, respectively. The total size of the population M = M 1 + M 2 . The total size of the elite individuals E = E 1 + E 2 . The E elite individuals are found at the same time as fitness evaluation, and E is typically set to 20% of M. The elite probability can reflect the search capability of ABC module and DE module. The more elite individuals there are in this iteration, the greater the probability of continuing to reproduce elite individuals in the next iteration.
Then, the subpopulations are combined and the partition rate PR is calculated by Equation (11).
The above partition strategy is called APP. APP adaptively partitions the population and allocates ABC module and DE module in each iteration according to the search situations of these two modules. If the elite probability is simply taken as the basis of the population partition, there will be a situation in which one module accounts for the majority or all of the population and the other module evolves slowly or ceases to evolve because of the lack of individual resources. This is because elite individuals are more likely to reproduce elite individuals. The introduction of the evolutionary rate is a good solution to the situation. The evolutionary rate determines the evolutionary direction of the entire population rather than elite individuals. The evolutionary rate of a module is reduced if the individuals of the module have little impact on the entire population, which can prevent unreasonable partition. The process of partitioning the population is continuously carried out with the iteration of the algorithm until the algorithm reaches a specific terminating condition. Figure 7 gives the pseudocode of APP module.

Hybrid Search of ADHDOCK
ADHDOCK is a hybrid algorithm consisting of ABC module, DE module, and APP module for solving protein-ligand docking problems. In ABC module, the population in the module is evolved according to ABC algorithm. ABC is a specific application of swarm intelligence, and it has a faster convergence rate. Through the local optimization behavior of each artificial bee, the global optimal solution is eventually revealed in the population. In DE module, the population in the module is evolved according to DE algorithm. DE simulates the process of biological evolution, and the individuals who adapt to the environment are preserved after repeated iterations. Compared with GA, the global search strategy based on population is the same, but the complexity of genetic operation of DE is reduced by the simple mutation operation based on difference and the survival strategy of one-to-one competition. ABC and DE have applications in the fields of function optimization, data mining, pattern recognition and so on. The main drawback of these two algorithms is the high probability of falling into local optima, which leads to premature convergence of the algorithms and the high energy values obtained for protein-ligand docking. The reason for the shortcoming is that each algorithm uses a single search strategy.
ADHDOCK applies APP module to combine ABC module and DE module effectively. APP module can reasonably allocate the computational resources of the population in each iteration process. In the initialization of ADHDOCK, the population is partitioned into two subpopulations with the same number of individuals, and the two subpopulations are allocated to ABC module and DE module. After an iterative process of ABC and DE, the two subpopulations are combined. The combined population is repartitioned and then reallocated according to APP module. If one module performs well in this iteration, the module will get more individuals than the other module. The specific usage of APP module is described in the previous chapters. The above process is repeated until a predefined termination condition is reached. If the number of individuals in one module is increasing with the iteration of the algorithm, it can be considered that the module does not fall into local optima. ADHDOCK adopts the search strategies of ABC and DE, and the double search strategy of the novel algorithm can reduce the probability of falling into local optima.

Conclusions
Protein-ligand docking method has a pivotal part in the field of drug research and provides an effective tool for the discovery and optimization of leading compounds. Studying the interaction between small molecules and protein macromolecules and identifying the target of small molecules in the organisms can help to find a new breakthrough for the development of new drugs. The article presents ADHDOCK, which combines ABC module, ED module, and APP module to solve the protein-ligand docking problem. APP module is responsible for partitioning the population so that computer resources can be used more reasonably. ABC module and ED module execute in parallel under the guidance of APP module, their search capability is maximized. To demonstrate the advantages of ADHDOCK, we also compared the performance of the novel algorithm with that of ABC, DE, LGA, HIGA and SODOCK. Our results indicate that ADHDOCK is superior to the other tested algorithms for the docking problem, in terms of docked energy and RMSD, success rate, convergence performance, data distribution, and hypothesis test.