Prediction of Positions of Active Compounds Makes It Possible To Increase Activity in Fragment-Based Drug Development

We have developed a computational method that predicts the positions of active compounds, making it possible to increase activity as a fragment evolution strategy. We refer to the positions of these compounds as the active position. When an active fragment compound is found, the following lead generation process is performed, primarily to increase activity. In the current method, to predict the location of the active position, hydrogen atoms are replaced by small side chains, generating virtual compounds. These virtual compounds are docked to a target protein, and the docking scores (affinities) are examined. The hydrogen atom that gives the virtual compound with good affinity should correspond to the active position and it should be replaced to generate a lead compound. This method was found to work well, with the prediction of the active position being 2 times more efficient than random synthesis. In the current study, 15 examples of lead generation were examined. The probability of finding active positions among all hydrogen atoms was 26%, and the current method accurately predicted 60% of the active positions.


Method
Starting from an active fragment (seed compound) and the 3D structure of the target protein, we try to predict which atom should be modified chemically to increase the activity. In the current study, only hydrogen atoms and fluorine atoms were modified. While there are numerous varieties of chemical modifications, only a limited number of side chains (less than 80) were used in the current study for chemical modification. A set of virtual compounds were generated from the active fragment by artificial chemical modification, and the subsequent docking study was carried out to rank the virtual compounds according to the docking score. The modified position of the top-ranked virtual compound was predicted as the active position. The details of this method are described below.
All hydrogen atoms and fluorine atoms of the active fragment were replaced by side chains, one by one. The side chains are small groups (methyl, ethyl, phenyl, etc) and their derivatives. Three sets of side chains (sets A, B and C) were prepared, and these sets, A, B, and C, consisted of 78, 38, and 25 side chains, respectively. These side chains, which are summarized in Figure 1, are small hydrocarbons including up to two aromatic rings, and they do not include heteroatoms. The side chains are prepared manually and arbitrary. These side chains were summarized in the supporting information.

Figure 1.
Side chain sets A, B, and C. All sets consist of the compounds 1-9 and their derivatives. Set A: For compound 1-9, R or one of the R i s is directly attached to the active fragment and the other R i s are replaced by H. In addition, for compound 8-9, R or one of the R i s is replaced by -CH 2 -R, -CH 2 -CH 2 -R, -CH=CH-R (R is the active fragment) and the other R i s are replaced by H. Set B: For compound 1-9, R or one of the R i s is directly attached to the active fragment and the other R i s are replaced by H. In addition, for compound 8-9, R or one of the R i s is replaced by -CH 2 -R, -CH 2 -CH 2 -R (R is the active fragment) and the other R i s are replaced by H. Set C: For compound 1-9, R or one of the R i s is directly attached to the active fragment and the other R i s are replaced by H. In addition, for compound 8-9, R or one of the R i s is replaced by -CH 2 -R (R is the active fragment) and the other R i s are replaced by H.
These side chains are introduced into the active fragment by the BindMol program, which is an inhouse program. If the attached side chain comes into contact with an atom of the seed compound (intra-molecular atomic conflict), such a compound is not generated. The atomic coordinates of the generated virtual compound are optimized by an energy minimization calculation in vacuum. The Cosgene/myPresto program is used for energy minimization with a general AMBER force field, and the dielectric constant is set to 4R, where R is the inter-atomic distance [20]. The atomic charges are calculated by the Gasteiger method [21,22].
The protein-compound docking simulation is performed with the Sievgene/myPresto program [23]. Each generated virtual compound is docked to the target protein by the flexible docking method and the affinity of each virtual compound is evaluated by the docking score. The docking pocket of each protein was indicated by the coordinates of the original ligand. Hydrogen atoms were added to the coordinates by tplgene/myPresto. The atomic charges of the proteins were the same as those of AMBER parm99 [24]. For flexible docking, the Sievgene program generated up to 100 conformers for each compound, and a 120x120x120 grid is applied to the scoring grid. The atomic coordinates of the target protein were fixed. The protonated states of the proteins and compounds are the dominant ion forms at pH 7. Finally, the virtual compounds are sorted according to their docking scores. The modified position of the top ranked compound among the all virtual compounds is the predicted active position.
The seed compounds were suggested by the literature [1,40]. The current procedure was applied to these 15 target proteins. These target names are summarized in Table 1, along with the number of virtual compounds generated for each target. Figure 2 shows the seed compounds of these target proteins. The active positions of these compounds are also shown in Figure 2. Figure 2 also shows the predicted active positions by the current calculation. The probability of predicting accurate active positions is summarized in Table 2. On average, the probability of finding active positions among all hydrogen atoms was 26.32% by random prediction. On the other hand, the current method predicted 60.0% of the active positions when side chain set A was used. The prediction is approximately two times more efficient than a random selection of active positions. As far as the second top predicted position is considered in addition to the top ranked position, the probability of finding active positions among all hydrogen atoms was 45.71% by random prediction. On the other hand, the current method predicted 66.67%, 46.67% and 46.67% of the active positions when side chain sets A, B and C were used, respectively. These values were bigger than the probability by the random prediction, but the advantage of the current method is not significant anymore.
The prediction accuracy increased with increases the number of attached side chains. The prediction accuracy obtained with set A was better than that with sets B and C. Thus, the prediction accuracy should be improved by increasing the number of side chains or the variety of side chains.    The used virtual side chains (sets A, B and C) were not hydrophilic groups but hydrophobic groups that were hydrocarbons. In the fragment evolution process, hydrophobic groups are usually added to the active fragment compound to increase the activity [40]. It appears reasonable to use simple hydrophobic groups for chemical modification, while the variations of chemical modification are infinite.

Multiple target structures were used
In addition to the single target protein, multiple target protein structures were examined. These proteins were extracted from the PDB. The used protein structures are summarized in Table 2. Each protein was prepared for docking in the same manner described in the Methods section. The docking scores for all protein structures were merged and re-ranked based on the docking score. The results are summarized in Table 2. When side chain set A was used, the prediction accuracy was 66%, which is the same value obtained from the single target protein structure.

Ranking of true lead compounds
To estimate the limitations of the prediction accuracy, the true lead compounds were added to the virtual compounds. A single target protein structure was used. The compounds were docked to the target protein, and these compounds were ranked according to the docking score. If the docking scores are accurate, the true lead compounds should be ranked at the first positions. The results are summarized in Table 3. The true lead compounds appeared at the first rank with a probability of 60%, while this probability would be 2.8% by random selection. The docking study actually worked, but the prediction was not perfect. This 60% probability should be considered the upper limit of the current prediction method.  Akt  2UZT  1361  50  41  Bcl-XL  1YS1  727  1  1  CDK2  1VYZ  900  41  1  DNHA  2NM2  634  16  1  ERK2  2OJG  1192  33  334  HSP90  1BYQ  699  4  1  IMPDH  1NF7  497  28  1  JanusKinase  3JY9  778  1  343  KDR  1T46  805  48  1  Lactatedehydrogenase 1ARZ  704  3  1  MetAP2  1YW7  1381  108  1  MMP12  1Y93  393  1  8  NADP  2F10  867  28  298  PDE4  1MKD  1102  8  303  Urokinase  1ETF  754  27  1 Average 2.80% 60.00% a number of generated virtual compounds; b rank of the virtual compound that precisely predicts the true active position

Discussion
On average, the probability of active positions among all hydrogen atoms was 26.32%, and the current method predicted 60.0% of the active positions. Considering that the accuracy of cross-docking by Sievgene is only 25%, the prediction accuracy of the current method is high. Our previous study shows that the virtual screening of fragment is difficult by docking study but that if a virtual side chain is added to the fragment compound, virtual screening of the modified fragment compound becomes easy [10]. As such, the accuracy is improved by the addition of a virtual side chain to fragment compounds. In the current study, the prediction accuracy would have been improved by the addition of virtual side chains to the active fragment compound.
The prediction accuracy obtained by side chain with set C was much better than that with sets A and B. The difference between the sets was that the side chains of set A included a C=C structure. This structure mimics that of amide or ester structures. Since the major contribution of the sievgene docking score is the ASA term and the electrostatic interaction is not as important, the size and shape of the group/compound is important in the sievgene docking score [10].
The prediction accuracy reached 60%, but no higher. Even if multiple structures were used, the prediction accuracy was not improved. In in-silico drug screening, the ensemble docking method has been used to consider features of protein flexibility such as induced-fitting. In an ensemble docking study, many protein structures are prepared for the docking study and each structure gives an in-silico drug screening result. We can obtain many screening results, but only a limited number of them can be reliable. How to select the reliable screening result from the many results is a serious problem. Ensemble docking studies have shown that the docking score does not consistently provide reliable results or true hit (active) compounds [41][42][43][44]. The same phenomenon should have occurred in the current study, and the docking scores were not good enough to predict the active positions.
Since comprehensive chemical modification is almost impossible, the reported active positions should correspond to one of the registered active positions. The same as the probability of true active positions, the accuracy of prediction should be underestimated. That is, even though a true active position is predicted, if the position is not reported (the position is not included in the registered active positions), the prediction is judged to be a failure. These chemical modifications are restricted by the synthetic accessibility, and the analysis of this current study is somewhat ambiguous.

Conclusions
We have developed a computational method that predicts the positions of seed compounds that should be chemically modified in the fragment evolution method. In the current method, to predict the active position, all hydrogen atoms are replaced by small side chains. Three sets of side chains were prepared manually. These virtual compounds were docked to a target protein, and the docking scores (affinities) were examined. The hydrogen atom that gave the virtual compound with good affinity was determined to be the active position that should be replaced to generate a lead compound. This method worked well. The prediction of active position was two times more efficient than random synthesis. In the current study, 15 examples of lead generation were examined. The probability of active positions among all hydrogen atoms was 26%, and the current method predicted 60% of the active positions.