Advances and Challenges in Protein-Ligand Docking

Molecular docking is a widely-used computational tool for the study of molecular recognition, which aims to predict the binding mode and binding affinity of a complex formed by two or more constituent molecules with known structures. An important type of molecular docking is protein-ligand docking because of its therapeutic applications in modern structure-based drug design. Here, we review the recent advances of protein flexibility, ligand sampling, and scoring functions—the three important aspects in protein-ligand docking. Challenges and possible future directions are discussed in the Conclusion.


Introduction
Molecular recognitions including enzyme-substrate, drug-protein, drug-nucleic acid, protein-nucleic acid, and protein-protein interactions play important roles in many biological processes such as signal transduction, cell regulation, and other macromolecular assemblies. Therefore, determination of the binding mode and affinity between the constituent molecules in molecular recognition is crucial to understanding the interaction mechanisms and to designing therapeutic interventions. Due to the difficulties and economic cost of the experimental methods for determining the structures of complexes, computational methods such as molecular docking are desired for predicting putative binding modes and affinities. In molecular docking, based on the protein structures, thousands of possible poses of association are tried and evaluated; the pose with the lowest energy score is predicted as the "best match", i.e., the binding mode. Since Kuntz and colleagues' pioneering work [1], significant progress has been made in docking research to improve the computational speed and accuracy. Among them, protein-ligand docking is a particularly vibrant research area because of its importance to structure-based drug design [2][3][4][5][6][7][8] and will be the subject of the present review.
A protein-ligand docking program consists of two essential components, sampling and scoring. Sampling refers to the generation of putative ligand binding orientations/conformations near a binding site of a protein and can be further divided into two aspects, ligand sampling and protein flexibility. Scoring is the prediction of the binding tightness for individual ligand orientations/conformations with a physical or empirical energy function. The top orientation/conformation, namely the one with the lowest energy score, is predicted as the binding mode. Here, we will review the recent advances in protein-ligand docking on three important aspects: protein flexibility, ligand sampling, and scoring function, as illustrated in Figure 1. Challenges and future directions will also be discussed.

Soft Docking
Soft docking is the simplest method which considers protein flexibility implicitly. It works by allowing for a small degree of overlap between the ligand and the protein through softening the interatomic van der Waals interactions in docking calculations [15,16]. The advantages of soft docking are its computational efficiency and easiness for implementation. However soft docking can account for only small conformational changes.

Side-Chain Flexibility
Many of the early attempts to incorporate certain protein conformational changes into molecular docking focused on side-chain flexibility, in which backbones are kept fixed and side-chain conformations are sampled. One of the earliest studies is the ligand docking algorithm developed by Leach, in which discrete side-chain flexibility is included by using a rotamer library [17]. Since then, researchers have proposed many improved techniques to incorporate continuous or discrete side-chain flexibility in ligand docking [18][19][20][21][22][23][24].

Molecular Relaxation
The third type of methods account for protein flexibility by firstly using rigid-body docking to place the ligand into the binding site and then relaxing the protein backbone and side-chain atoms nearby. Specifically, the initial rigid-body docking allows for atomic clashes between the protein and the placed ligand orientations/conformations in order to consider the protein conformational changes. Then, the formed complexes are relaxed or minimized by Monte Carlo (MC), Molecular Dynamic simulations, or other methods [25,26]. The advantage of the molecular relaxation method is the inclusion of certain backbone flexibility in addition to the side-chain conformational changes. However, compared to the side-chain flexibility methods, the relaxation method is more demanding for the scoring function because it involves not only the side-chain movement but also the more challenging backbone sampling, thereby inaccuracies in the scoring function may lead to artifacts (e.g., improper backbone torsions) in the relaxed protein conformations. Moreover, the relaxation method is time-consuming.

Docking of Multiple Protein Structures
The most widely-used type of methods for incorporating protein flexibility utilize an ensemble of protein structures to represent different possible conformational changes [9][10][11][12][13][14]. One of the earliest studies was done by Knegtel et al. [27], in which an averaged energy grid was constructed by combining the energy grids generated from individual experimentally-determined protein structures using a weighting scheme, followed by standard ligand docking. Osterberg et al. extended the method to AutoDock [28] with a larger ensemble consisting of 21 different conformations of the HIV-1 protease [29]. The averaging nature of the method may miss the geometric accuracy of the protein. Claussen et al. developed a docking program FlexE to dock ligands into an ensemble of protein structures [30], in which the similar segments of the protein structures in the ensemble are aligned and merged while the dissimilar segments are used to combinatorially create new possible protein conformations for docking. In Wei et al.'s algorithm [31], a protein was decomposed into a rigid part and several flexible parts according to the crystal structures of the protein in the ensemble. For a given ligand placement, only the best-fit local conformer was kept for each flexible part in the protein, assuming the flexible regions move independently. The selected local conformers were joined with the rigid part to form the best-fit protein conformation. Compared to FlexE, this algorithm is significantly faster and scales linearly rather than exponentially with the protein flexibility. However, the algorithm may rely on the quality of ligand orientational/conformational sampling, namely whether proper initial ligand placement is included. Huang and Zou developed a fast ensemble docking algorithm by treating the protein conformational ensemble as an additional dimension to the traditional six degrees of freedom (three translational plus three rotational) for ligand energy optimization [32,33]. The algorithm is almost as fast as single docking whereas keeping the accuracy of sequential docking. The ensemble docking algorithm is not used for generating new protein structures, but instead for selecting the induced-fit structure from a given protein ensemble. Following a similar procedure, Abagyan and colleagues expanded Huang and Zou's algorithm to create ICM's ensemble docking algorithm, referred to as four-dimensional (4D) docking [34]. In addition to experimental structures such as NMR structures or crystal structures bound of the protein with different ligands, ensembles of protein conformations generated by molecular dynamic simulations, Monte Carlo simulations, or structure prediction have also been used to account for protein flexibility [35][36][37][38][39][40][41][42][43].

Ligand Sampling
Ligand sampling is the most basic element in protein-ligand docking. Given a protein target, the sampling algorithm generates putative ligand orientations/conformations (i.e., poses) around the chosen binding site of the protein. The binding site can be the experimentally determined active site, a dimer interface or other site of interest. Ligand sampling is the most successful area being developed in protein-ligand docking. Roughly, there are three types of ligand sampling algorithms: shape matching, systematic search, and stochastic algorithms.

Shape Matching
The shape matching method is one of the simplest sampling algorithms which is often used in the early stages of the docking process or in the first step of other more advanced ligand sampling methods. It places the ligand using the criterion that the molecular surface of the placed ligand must complement the molecular surface of the binding site on the protein. The six degrees of freedom (three translational and three rotational) of the ligand allow for many possible ligand binding orientations. Therefore, how to quickly place the ligand in the binding site with a good shape complementarity is the goal of the shape matching algorithm. Examples of docking programs that use shape matching include DOCK [1], FRED [44], FLOG [45], EUDOC [46], LigandFit [47], Surflex [48], MS-DOCK [49], and MDock [32,33]. The major advantage of shape matching is its computational efficiency. However, the conformation of the ligand is normally fixed during the shape matching process. Therefore, flexible-ligand docking with the shape matching method is usually performed by docking an ensemble of pre-generated ligand conformations into the protein [50], followed by merging/reranking of the docked poses from different docking runs according to their energy scores (see below).

Systematic Search
Systematic search algorithms are normally used for flexible-ligand docking, which generate all possible ligand binding conformations by exploring all degrees of freedom of the ligand. There are three types of systematic search methods: exhaustive search, fragmentation and conformational ensemble.
The most straightforward systematic algorithms are exhaustive search methods, in which flexible-ligand docking is performed by systematically rotating all possible rotatable bonds of the ligand at a given interval. Despite its sampling completeness for ligand conformations, the number of the combinations can be huge with the increase of the rotatable bonds. Therefore, to make the docking process practical, geometric/chemical constrains are normally applied to the initial screening of ligand poses, and the filtered ligand conformations are further subject to the more accurate refinement/optimization procedures. Glide [51,52] and FRED [44] are two typical examples of this type of hierarchical sampling methods.
In fragmentation methods, the ligand is first divided into different rigid parts/fragments. Then, the ligand binding conformation is incrementally grown by placing one fragment at a time in the binding site or by docking all the fragments into the binding site and linking them covalently. DOCK [53], LUDI [54], FlexX [55], ADAM [56], and eHiTs [57] are example docking programs that use fragmentation methods.

Stochastic Algorithms
In stochastic algorithms, ligand binding orientations and conformations are sampled by making random changes to the ligand at each step in both the conformational space and the translational/rotational space of the ligand, respectively. The random change will be accepted or rejected according to a probabilistic criterion. There are four types of stochastic algorithms: Monte Carlo (MC) methods, evolutionary algorithms (EA), Tabu search methods, and swarm optimization (SO) methods.
In a Monte Carlo method, the probability to accept a random change is calculated by using the following Boltzmann probability function: where E 0 and E 1 stand for the energy scores of the ligand before and after the random change, respectively, k B is the Boltzmann constant, and T is the absolute temperature of the system. The docking programs that use the MC methods include DockVision [61], ICM [18], QXP [62], Prodock [63], and MCDOCK [64]. Evolutionary algorithms (EAs) search for the correct ligand binding mode based on the idea from the evolutionary process in biological systems. The most popular type of EAs is the genetic algorithms (GAs). GOLD [65,66], AutoDock [28], DIVALI [67], DARWIN [68], MolDock [69], PSI-DOCK [70], FLIPDock [42], Lead finder [71], and EADock [72] are the examples that have implemented evolution algorithms.
In Tabu search methods, the probability of acceptance depends on the previously explored areas in the conformational space of the ligand. The random change will be rejected if the RMSD between the current ligand binding conformation and any of the previously recorded solutions is less than a cutoff; otherwise, the random change will be accepted. Example docking programs are PRO LEADS [73] and PSI-DOCK [70].
Swarm optimization (SO) algorithms attempt to find an optimal solution in a search space by modeling swarm intelligence. In the method, movements of a ligand mode through the search space are guided by the information of the best positions of its neighbors. Examples of docking programs that use swarm optimization algorithms include SODOCK [74], Tribe-PSO [75], PSO@Autodock [76], and PLANTS [77].

Scoring Functions
The scoring function is a key element of a protein-ligand docking algorithm, because it directly determines the accuracy of the algorithm [78][79][80][81][82]. Speed and accuracy are the two important aspects of a scoring function. An ideal scoring function would be both computationally efficient and reliable. Numerous scoring functions have been developed in the past decades and can be grouped into three basic categories according to their methods of derivation: force field, empirical, and knowledge-based scoring functions.

Force Field Scoring Functions
Force field (FF) scoring functions [28,83,84] are based on decomposition of the ligand binding energy into individual interaction terms such as van der Waals (VDW) energies, electrostatic energies, bond stretching/bending/torsional energies, etc., using a set of derived force-field parameters such as AMBER [85] or CHARMM [86,87] force fields. One of the major challenges in FF scoring functions is how to account for the solvent effect. The simplest method is to use a distance-dependent dielectric constant ε(r ij ) such as the force field scoring function in DOCK [84]: where r ij stands for the distance between protein atom i and ligand atom j, A ij and B ij are the VDW parameters, and q i and q j are the atomic charges. ε(r ij ) is usually set to 4r ij , reflecting the screening effect of water on electrostatic interactions. The most rigorous FF methods are to treat water molecules explicitly such as FEP and TI (see [88] for review). However, these methods, together with their simplified approaches such as LIEPROFEC, and OWFEG are computationally expensive [88]. To reduce the computational expense, accelerated methods have been developed while preserving the reasonable accuracy by treating water as a continuum dielectric medium. The Poisson-Boltzmann/surface area (PB/SA) models [89][90][91][92][93][94][95][96][97][98][99] and the generalized-Born/surface area (GB/SA) models [100][101][102][103][104][105][106][107][108][109][110][111] are typical examples of such implicit solvent models.
In addition to the challenge on solvent effect, how to accurately account for entropic effect is an even more severe challenge for FF scoring functions. Moreover, whether the free energy of ligand binding can be decomposed into a linear combination of individual interaction terms without calculating the partition function ("ensemble average") also remains in question, referred to as the nonadditive problem.

Empirical Scoring Functions
In empirical scoring functions, the binding energy score of a complex is calculated by summing up a set of weighted empirical energy terms such as VDW energy, electrostatic energy, hydrogen bonding energy, desolvation term, entropy term, hydrophobicity term, etc.
where {∆G i } represent individual empirical energy terms, and the corresponding coefficients {W i } are determined by reproducing the binding affinity data of a training set of protein-ligand complexes with known three-dimensional structures, using least squares fitting [112][113][114][115][116][117][118][119][120][121]. Compared to the force field scoring functions, the empirical scoring functions are normally much more computationally efficient due to their simple energy terms. However, the general applicability of an empirical scoring function depends on the training set due to the nature of its fitting to known binding affinities of its training set. With the rapid increase in the number of crystal structures of diverse protein-ligand complexes with known binding affinities, a general empirical scoring function could be developed by training on the binding constants of thousands of protein-ligand complexes. GlideScore [51,52], PLP [119,120], SYBYL/F-Score [55], LigScore [118], LUDI [115,117], SCORE [116], X-Score [121], ChemScore [114], MedusaScore [122], AIScore [123], and SFCscore [124] are examples of empirical scoring functions.

Knowledge-Based Scoring Functions
The potential parameters of knowledge-based scoring functions are directly derived from the structural information in experimentally determined protein-ligand complexes [125][126][127][128]. The principle behind knowledge-based scoring functions is the potential of mean force [129], which is defined by the inverse Boltzmann relation [130][131][132][133] where k B is the Boltzmann constant, T is the absolute temperature of the system, ρ(r) is the number density of the protein-ligand atom pair at distance r in the training set, and ρ * (r) is the pair density in a reference state where the interatomic interactions are zero. After the potential parameters w(r) are derived, the energy of ligand binding for a given complex is simply the sum of the interaction terms for all the protein-ligand atom pairs in the complex. Based on the early idea of Tanaka and Scheraga [125], a number of knowledge-based scoring functions have been developed for protein-ligand interactions. Compared to the force field and empirical scoring functions, the knowledge-based scoring functions offer a good balance between accuracy and speed. Namely, because the potentials in Equation. (4) are extracted from a large number of structures rather than attempting to reproduce the known affinities by fitting, the knowledge-based scoring functions are relatively robust and general [134][135][136][137]. Their pairwise characteristic also enables the scoring process to be as fast as empirical scoring functions.
As the ideal reference state is inaccessible for complicated systems like proteins [132], one major challenge for knowledge-based scoring functions is the calculation for the afore-mentioned reference state; based on the methods knowledge-based scoring functions can be classified into three categories: traditional atom-randomized reference state, corrected reference state, and circumventing the reference state.
Traditional methods to approximate the reference state are randomization of the atoms in the training set. Examples include DrugScore [138,139], SMoG [140,141], BLEEP [142,143], GOLD/ASP [144], MScore [145], and KScore [146]. The disadvantage of the atom-randomization approximation is the neglection of the effects of excluded volume, interatomic connectivity, etc. [132]. Methods like PMF [134,135] and DFIRE [147] have introduced correction terms for the reference state. Yet, binding mode prediction and virtual database screening are main problems for most knowledge-based scoring functions as a result of the reference state problem. To circumvent the long-standing reference state problem, Huang and Zou have developed a physics-based iterative method and derived the ITScore scoring function, which has been extensively tested with multiple benchmarks for binding mode prediction, affinity prediction and virtual screening [136,137,148].
Other challenges for knowledge-based scoring functions include extension of the pairwise interactions to many-body interactions to account for hydrogen bonding and other directional interactions, development of an accurate method for entropic calculations [148], etc.

Consensus Scoring
Consensus scoring is not really a specific type of scoring function but a technique in protein-ligand docking [149]. It improves the probability of finding a correct solution by combining the scoring information from multiple scoring functions in hopes of balancing out the errors of the individual scoring functions. Therefore, the main issue in consensus scoring is how to make the combination rule for individual scores so that the true binders can be discriminated from others according to the consensus rule [150,151]. MultiScore and X-Cscore are two examples of consensus scoring methods [121,152].

Clustering and Entropy-Based Scoring Methods
In addition to consensus scoring, another technique to improve the performances of scoring functions is clustering-based scoring methods, which incorporate the entropic effects by dividing generated ligand binding modes into different clusters [169][170][171]. The entropic contribution in each cluster is measured by the configurational space covered by the ligand poses or the number of the ligand poses in the cluster. One restriction in clustering-based scoring methods is that its performance depends on the ligand sampling protocol that is used, i.e., it is docking program-dependent. These methods in combination with ligand conformational sampling using AutoDock have significantly improved binding mode prediction [148,[169][170][171].

Conclusion and Discussions
We have reviewed three important aspects of protein-ligand docking: protein flexibility, ligand sampling, and scoring functions. Rapid advances in the last two decades have almost solved the ligand sampling issue. Although equal or even more efforts have been paid to scoring function development, entropy and desolvation effects remain the two major challenging issues for current scoring functions, particularly for the force field scoring functions. Speed and accuracy are the two important characteristics of a scoring function. Because of the rapid increase in computing power, how to improve the accuracy is the future direction for scoring function development. In contrast to ligand sampling and scoring functions which have been extensively studied for more than two decades, protein flexibility has only been addressed recently because of the difficulty resulting from the enormous degrees of freedom and the limitation of the computing power. The development of computational methods for protein flexibility is still in its infancy and thereby remains one of the major future directions in protein-ligand docking. Finally, how to evaluate different docking programs and scoring functions is another active area [153][154][155][156][157]. Although many comparison studies for docking and scoring have been published [158][159][160][161][162][163][164][165][166], publicly available docking benchmarks such as DUD (http://dud.docking.org/) [167,168] and CSAR (http://www.csardock.org/) are extremely valuable for systematic and consistent evaluation and improvement of new and existing docking algorithms.