Development of Computational Approaches with a Fragment-Based Drug Design Strategy: In Silico Hsp90 Inhibitors Discovery

A semi-exhaustive approach and a heuristic search algorithm use a fragment-based drug design (FBDD) strategy for designing new inhibitors in an in silico process. A deconstruction reconstruction process uses a set of known Hsp90 ligands for generating new ones. The deconstruction process consists of cutting off a known ligand in fragments. The reconstruction process consists of coupling fragments to develop a new set of ligands. For evaluating the approaches, we compare the binding energy of the new ligands with the known ligands.


Introduction
Rational drug design remains a challenging problem for computational methods developed in bioinformatics and medicinal chemistry. In this sense, computational generative methods have begun to show promising results for the design problem. Principally, computer-aided drug discovery/design (CADD) has played a significant role in developing therapeutically important small molecules [1]. In general, the computational methods used in research of drug development are broadly classified into two groups: (i) ligand-based methods or (ii) structure-based methods [2].
The ligand-based methods use only ligand information for predicting activity depending on its similarity/dissimilarity to previously known active ligands. When the structural information of the protein is sufficient, it uses structure-based approaches. In other words, it is a simulation of the protein-ligand interactions. These interactions are essential for biochemical functionality and are present in all biochemical roles [3]. According to the structure-based technique, the development of a new compound consists of two stages [4]. The first one consists of the exploration of an ample conformational space representing various potential binding modes, and the second one of obtaining an accurate prediction of the interaction energy associated with each of the predicted binding conformations [5]. In this context, the molecular docking programs perform these tasks through a cyclical process, in which specific scoring functions evaluate the ligand conformation. This process is carried out recursively until converging to a minimum energy solution [5][6][7].
From the point of view of searching for active compounds on one or more therapeutic targets using compound libraries, virtual screening (VS) based on molecular docking is an efficient method to recover new successful compounds in drug discovery. On the other hand, fragment-based drug design (FBDD) has emerged as a robust approach to identify small molecules that bind to a wide range of therapeutic targets; this approach has led to efficient progress in developing new drugs [8][9][10][11][12]. Experimental and computational FBDD has established itself as an effective alternative to more traditional methods such as high-throughput screening (HTS) [13,14]. However, in computational FBDD, they have not yet utilized the power of three-dimensional (3D) structural information for this reason. Successful fragment making often requires co-structures of the fragments bound to their target proteins and various biophysical and biochemical assays to track potency and efficacy. We have incorporated molecular docking to measure the affinity for new ligands obtained from our method.
Additionally, the binding energy of the entire molecule with the target can be considered the sum of individual binding energy between the fragments and the target. Despite this affirmation, identifying suitable fragments that bind to the neighboring binding sites is an obstacle. An essential advantage of the FBDD approach is that once the detailed interaction within the pocket is experimentally validated and clearly understood, it could provide a huge opportunity to design potent drug-like compounds [12]. The construction of a fragment library is the first step for FBDD. Several factors should be considered, including [12]: (i) the difference between fragments and hits or leads, wherein [15] propose a rule-of-three (RO3) for the construction of a fragment (molecular weight is < 300, cLogP is ≤3); (ii) the size of the fragment library; (iii) structural diversity of the fragment library; (iv) the solubility of fragments; and (v) the drug-likeness of fragments. We used the FBDD strategy called the deconstruction-reconstruction approach (DRA) in this work, which has gained attention in recent years [16]. The first step of DRA is to deconstruct known ligands into several fragments obtaining a fragment library. A size of 500 to 3000 components is recommended [17]. The second step is to reconstruct these fragments selected from the relevant fragment library into a new scaffold. Classic guidelines should govern the reconstruction stage, such as Lipinski's rule of five and Veber's rules to maintain suitable drug-like physicochemical properties, including molecular weight and volume [18,19]. For many years the DRA has been explored, showing it being helpful for target proteins. The main reason is that many reported ligands can provide robust data to create novel and promising scaffolds for further development.
This work presents two computational approaches aimed to decrease the searching time for potential high-quality ligands for Hsp90 inhibitors. The two approaches use a fragment-based drug design (FBDD) strategy for designing new ligands. A semiexhaustive search strategy and a heuristic search algorithm that focuses on decreasing the computational required by the first approach. We have constructed a library of several Hsp90 ligands that have already been addressed [20][21][22][23][24] to evaluate our method. However, the problem of finding acceptable ligands with a high inhibition level remains unsolved. Currently, a database of more than 70 Hsp90 ligands [25] is available. We use this database for the experiments.

Semi-Exhaustive Approach
The semi-exhaustive approach follows an FBDD technique. Mainly, we use a DRA for the computations of new ligands. In Figure 1, we can observe the stages of the algorithm. In the following, we explain each step: Stage 1. Deconstruction. A set of N KL Hsp90 ligands, called known ligands (KL), is selected for the fragmentation stage. Each ligand is cut off in sub-ligands considering the following constraints extracted from RO3 [15]: (i) the molecular weight must be <300; (ii) cLogP must be ≤3 ; (iii) the number of hydrogen bond donors must be ≤3 and (iv) the number of hydrogen bond receptors must be ≤3 [26]. Let N KL be the number of known ligands. Each ligand L i , ∀i = 1, . . ., N KL is fragmented in m i sub-ligands (fragments). Then a total of N FL = ∑ N i=1 m i fragments are generated from KL. Then the number of sub-ligands stored in the folder FL is N FL . Stage 2. Duplicity and Cleaning. For each sub-ligand obtained in the first stage, N FL , we apply a duplicity check process. The process compares each sub-ligand with the rest of sub-ligands in the set to check if a duplication exists. For this goal, we use the Tanimoto coefficient, where the similarity between each pair of sub-ligands is computed [27]. If there are duplicated sub-ligands, they are deleted. Let D F be the number of duplicated sub-ligands, therefore the number of sub-ligands now is N SL = N FL − D F . It is important to note that if a sub-ligand appears several times, all their copies are deleted, and just one remains.  Stage 3. Sub-Ligands. The folder SL* contains all sub-ligands obtained in the previous stage. Each sub-ligand that includes an asterisk identifier indicates that it contains a broken bond. Another folder called SL contains each sub-ligand without the asterisk identifiers. In this set, we have the same sub-ligands from SL*, but without the identifiers. We recall that each of these folders contain N SL sub-ligands.

KL
Stage 4. Reconstruction. The folder SL* contains the header of each sub-ligand, i.e., the sub-ligand with the asterisks identifying broken bonds. The folder SL contains the body of the sub-ligands, i.e., the sub-ligands without the asterisk identifiers.
The process of union between sub-ligands starts taking a header from set SL* and a body from set SL. Let consider two sub-ligands, sub-ligand A (from SL*) and B (from SL).
The body of sub-ligand B bonds into the first asterisk identifier on the header of subligand A. Then, if asterisk identifiers remain in the sub-ligand A, a random sub-ligand from SL is coupled. This last step is repeated until no free identifiers remain in A. The process is repeated for each sub-ligand B in SL with each sub-ligand A in SL*. Each new ligand obtained from the union process is stored in the folder NL. Let recall that we have N SL sub-ligands in each folder, SL* and SL, then the reconstruction process will generate N 2 SL new ligands. If a new ligand does not satisfy Lipinski's Rule of Five [18] and Veber's rules [19], then it is rejected. Let R NL be the number of new ligands which are rejected, then the total of new ligands is N NL = N 2 SL − R NL . Stage 5. Duplicity and Cleaning. As in Stage 2, a duplicity and cleaning process is applied in the same way. Each new ligand is compared with the rest of the ligands in set NL, checking for duplication. If so, then all the duplicated new ligands are deleted, and just one remains. Let D NL be the number of duplicated new ligands, then the number of new ligands now is N = N NL − D NL . Therefore the total number of new ligands obtained from known ligands is: where N FL is the total fragments from KL library, D F is the number of deleted fragments due to duplicity, D NL is the number of new ligands deleted from duplicity, and R NL is the number of rejected new ligands. Stage 6. Validation. The validation stage consists of computing the binding energy (BE) of each new ligand stored in NL. The process consists in calculating the interaction between the Hsp90 protein and a specific ligand. Then, considering several possible docking positions, the BE is computed. If the size of a ligand is larger than the grid specified for docking with the Hsp90, the evaluation is not possible. These ligands remain valid but not available. From all docking positions, it considers the most negative energy. The molecular docking software Autodock [28] is used in this stage.
Deconstruction example. The ligand PU3 is presented in Figure 2. If this ligand is cut off (deconstruction), then a total of five sub-ligands can be obtained. In Figure 2 the corresponding sub-ligands are presented as well. The open-source software OpenBabel [29] is used to convert a ligand, from PDB (protein data bank) format to SMILES (simplified molecular-input line-entry system) format [30]. The line notation of SMILES format allows to work the molecular representation of the ligands in a simpler manner. The reading of files with chemical structures in this format becomes easier to handle than with PDB format. We use the SMILES format from stages 1 to 5 (deconstruction, clean/duplicity, sub-ligands/reconstruction, and clean/duplicity). The open-source software PyMol [31] is used for the visualization.
Automated docking was used to locate new compound appropriate binding orientations and conformations into the Hsp90 binding pocket. The structure of Hsp90 used in molecular docking was obtained from X-ray crystal data in RCSB Protein Data Bank (PDB ID: 5LQ9). The molecular docking calculations were carried out by using the AutoDock4 software. The protein structure was prepared by removing water, adding Kollman charges and polar hydrogen. First, the residues around 6Å of the co-crystallized ligand were considered the binding site for docking calculations. Note that this box centered in the co-crystallized ligand involved two characteristic regions: (i) Site 1, corresponding to ATP-binding site and (ii) Site 2, which represents an internal region (see Figure 3). All the molecules from the data set were docked into the active site of Hsp90. Then, a grid map was generated by the auxiliary program AutoGrid4 using x, y, and z coordinate to find a small but more representative grid center for the active site. The grid dimensions were set to 45 × 45 × 45 points, with a grid spacing of 0.375Å. The number of docking runs was set to 100. The population in the genetic algorithm was 150. After docking, the 100 docked poses were clustered into groups with RMS deviations lower than 1Å. Among the entire cluster of complexes predicted by AutoDock4, the most populated cluster conformation, together with the lowest energy conformation for the most active compound docked to the receptor. The computation equipment used in the experiments is an Intel Core i7 with 3.2 GHz of processing on Ubuntu 16.04.

Heuristic Approach
We propose a heuristic-based approach to find new high-quality ligands from the sub-ligands obtained from the deconstruction process shown in Figure 1 in reduced computation times. This process works as a replacement for the reconstruction process shown in stage 4 in Figure 1.
The objective is to reduce the time involved in the reconstruction process using a local search approach capable of quickly constructing high-quality new ligands. Next, we present the representation and evaluation function used in our algorithm. Also, it shows the main structure of the local search approach and the search operators used.

Representation
A list of fragments represents a solution. Each solution contains a unique header and one or more bodies. The number of bodies depends on the number of available bonds (asterisk identifiers) in the header structure. Figure 4 shows an example of the representation of a solution. The solution contains the header in the first position, followed by two bodies. It uses SMILES format for the elements in the list. In the lower area, the two-dimensional model is shown as well.

Evaluation Function
The evaluation starts checking the feasibility of the list of fragments and then evaluates its quality. Feasibility is a filter according to the Lipinski's rules to determine that each new ligand is valid in a pharmacological environment [18]. Once validated, it computes the energy of the new ligand using the AutoDock Vina [32] software and the same conditions described in Section 2.1. The number of positions to test is considered a parameter of the algorithm (pos).

General Structure
Algorithm 1 shows the structure of the local search algorithm proposed here. It corresponds to a simulated annealing-based approach that uses two repairing operators to generate a new ligand at each iteration. It can accept worse quality solutions based on a probability controlled by a cooling schedule. This cooling schedule allows more deteriorations in the early stages of the search process and reduces their acceptance at the end of the process. The algorithm requires the parameter values for reduction factor (α) and initial and final temperature (initial_temp and f inal_temp).
The algorithm performs a fixed number of times the whole simulated annealing procedure. This number of repetitions is required to change the header structure of the current solution during the process. The parameter max_tries controls the number of repetitions. In each step, a local search process iterates from the given initial temperature to the final temperature (lines 5 to 22). At each iteration, temperature is reduced by α factor (line 21).
The approach implements two movements: change and swap. At each iteration, the process searches for the best combination of bodies for a given header. Hence, the first movement selects a random body from the solution and changes it by a different body available in the library (line 6). If the movement generates an improvement in the energy of the current ligand, a second movement is applied to evaluate the distribution of these bodies in the available bonds of the header (line 9). Movements are explained in Sections 2.2.5 and 2.2.6.
The algorithm performs the simulated annealing cooling scheme to accept worsening solutions only during the use of the change operator (line 14). It is expected that this operator generates more significant changes in the structure of solutions, and hence, it is allowed to generate deteriorations in some possible ligands as well.

Initial Solution Generation
In this step, the algorithm performs a competition between random solutions. First, it generates a solution choosing its header structure from the available ones. From its available n b bonds, the algorithm selects n b bodies to complete a new ligand. The selection of these bodies is random.

Change Movement
This movement consists of changing one of the bodies in the current solution. A random body from the current solution is selected and replaced by a new random body. To reduce the neighborhood of this movement, we use the Tanimoto indicator to determine the similarity between bodies. A Tanimoto similarity higher than 0.0 defines that two bodies belong to the same neighborhood. Figure 5 shows an example of the application of this movement. An input ligand and its corresponding output are in the upper and lower part, respectively. Each one is shown with its SMILES format and its two-dimensional representation. Moreover, it also shows the two-dimensional structure of both corresponding ligands on the right of the figure. In this example, the first body is replaced by a new body randomly selected using its Tanimoto similarity neighborhood from the SL set.

Swap Movement
This movement changes the order of the locations of bodies in the available header's bonds. This step occurs after applying the change operator to the bodies. The goal of this movement is to improve the quality of the solution only if a promising new ligand was found. It is important to remark that the movement is not applied if the current solution contains only one body. Figure 6 shows an example of the use of this movement. It shows the same information as in Figure 5. This input ligand corresponds to the output ligand obtained in the change movement example. The movement only changes the location of two bodies, but this can lead to essential changes in the final ligand structure. It shows these changes in the right part of Figure 6 with the two-dimensional representation of the resulting ligands.

Results
An Hsp90 ligand library with 70 components is selected and used [25]. As a preliminary check, we evaluate the binding energy (BE) for each ligand. The BE is the best binding energy from 25 docking positions. The worst energy level found is 9.2, and the best one is −11.3. Two ligands (a hydroxy-indazole and an amino-quinazoline) reach the best value. The average BE is −9.2, and the 92% of ligands in the library have energy levels lower than −6.9 and closer to 40% lower than −10.
The experiment design is the following: (i) it considers eight groups of ligands (eight different KL libraries), where each one has a different number of ligands, (ii) for each group, the semi-exhaustive process obtains a new ligand library, (iii) for each group, the heuristic approach obtains a new ligand library, and (iv) the BE is computed for each new ligand from both libraries.
Each ligand from the KL library belongs to a particular family: (A) resorcinol, (B) hydroxy-indazole, and (C) "others" (including aminothienopyridine, 6-hydroxyindole, 7-imidazopyridine, 2-aminopyridine, adenine, 7-azaindole, aminopyrrolopyrimidine, benzamide, and aminoquinazoline). For more details see Tables 1-3.  [25].  Table 2. Hsp90 inhibitors: hydroxy-indazole [25]. Table 3. Hsp90 inhibitors: "others" [25]. The criterion for defining the groups is to keep diversity between the ligands, except group 8, where the selection is from the two main families (resorcinol and hydroxyindazole). For instance, group 1 contains a ligand from the resorcinol family, a ligand from the hydroxy-indazole family, and three ligands from others. As the number of ligands increases in the groups, the diversity remains. The main goal of the groups definition is that every ligand belongs at least to one group. For each group, the number of ligands from the resorcinol and hydroxy-indazole families together is similar to the number of ligands from "others". Group 8 is the exception, where there are no ligands from "others". Figure 7 shows the Tanimoto matrix indicating the similarity level between ligands in the library grouped by families. We can observe that there is a high similarity between ligands from the same family. A low similarity between ligands from different families can be observed as well. As mentioned before, the incorporation of each ligand in a group is focused in keeping a high diversity in terms of similarity. Table 4 presents the eight groups representing the KL libraries for the experiments.  Table 4. Groups for the experiment. Each ligand belongs to some family [25]. The ligand structure specification is in Tables 1-3. The bold, cursive or underlined ligand belongs to the resorcinol, hydroxy-indazole or "others" family respectively.  Table 5 shows the results of the semi-exhaustive approach performance. It shows the number of known ligands (KL), the best binding energy obtained from KL, the number of fragments (sub-ligands) generated, the number of new ligands (NL), the best binding energy from NL, and the total computational time for each group.

Group KL Library
Interestingly, the semi-exhaustive approach always obtained new ligands that show better binding energy than the best in the corresponding KL library. We can confirm that the deconstruction-reconstruction method can generate high-quality new ligands for the Hsp90. The semi-exhaustive approach obtains a minimum improvement of around 1% for group 8 and a maximum of 40% for group 1. Moreover, it obtains an average improvement of around 16% on binding energy.
From Table 5 it is also interesting to notice that the number of new ligands and the execution time do not necessarily grow exponentially according to the size of the KL library of each group. As expected, this is due to the semi-exhaustive nature of the approach and its duplicity and cleaning stages. Group 1 required only 89.1 minutes to obtain a total of 581 new ligands reconstructed from a library of 5 known ligands. Meanwhile group 8 required around to 3,660 minutes to obtain 51,514 new ligands from a group of 40 known ligands.
The execution time increases with the number of new ligands, but the binding energy does not show the same behavior. Probably, it is because the process can reconstruct new high-quality ligands from fragments available in ligands belonging to different families, but not necessarily from a large KL library. Figure 8 shows the structure of the best new ligand obtained by the semi-exhaustive approach in each group. For simplicity, Table 6 shows each structure in the SMILES format.   Table 7 summarizes the results obtained by the heuristic search approach. It shows the average best binding energy, the standard deviation, and the best binding energy obtained from five executions using different random seeds. It also shows the average computational time and standard deviation of these executions.
Interestingly, the best binding energy obtained by the heuristic algorithm is always close to the results obtained by the semi-exhaustive approach.
The heuristic approach shows a stable behavior in terms of execution time. For group 8, five executions can take around 200 min, 6% of the time required by the semi-exhaustive approach. An advantage of the heuristic approach is being able to work with large KL libraries. This method can be used as an explorative supporting system before the application of the semi-exhaustive or exhaustive version of the first approach. Figure 9 shows the structure of the best new ligand obtained with the heuristic approach in each group. For simplicity, Table 8 shows each new ligand structure in the SMILES format. Figure 10 shows the eight best docking position obtained for each group by the semi-exhaustive approach and the heuristic search approach.  Table 8. New ligands with best binding energy using the heuristic search approach.

Discussions
According to the results, the best binding energy from each group shows an improvement compared to the best energy from the known ligands library (−11.3). Group 8 obtained the worst improvement, while groups 4 and 7 got the highest improvements using the semi-exhaustive approach. We can hypothesize that the variety of the group in terms of the families to which the known ligands belong can influence the binding energy improvement. We can observe that the binding energies obtained in the experiments of group 8 are of lower quality compared to groups 4 and 7. Group 8 is composed exclusively by ligands from the two leading families, while groups 4 and 7 consider ligands from all families. Then, a high variety of fragments and new ligands can be expected during the deconstruction and reconstruction processes. Group 1 obtained the best improvement with an increment of 40% between the best binding energy of the known ligands and the new ligands. Group 8 obtained the worst improvement with just 1%. According to the new ligands obtained with the semi-exhaustive approach, groups 4 and 6 got the same new ligand. However, the binding energy is quite different because the computation of the binding energy uses an approximation method. The final value is not necessarily the same but close. All ligands obtained with the semi-exhaustive approach are structurally similar. A 67% presents slight structural variations.
The heuristic approach obtains high-quality results with low computational times. This approach can be used for studying more extensive and diverse KL libraries. If the new ligands have promising results, then it can be explored exhaustively with the first approach.
According to the computational time, the trend is linear, as we can observe in Figure 11 with the semi-exhaustive approach. However, the computational time with the heuristic search is lower. This result is significant because it allows us to explore a higher number of known ligands.
It is essential to notice that the presented approaches are computational strategies that aim to decrease the search time of new high-quality ligands. Hence, only computational tools have validated the results obtained. A biological evaluation will be part of the future work.

Conclusions
This work presents a semi-exhaustive approach and a heuristic search algorithm based on the deconstruction-reconstruction method for drug design for the Hsp90 inhibition. The results are promising, looking at a future work where we can explore larger set of ligands. The critical point is that the reconstruction algorithm is a stage where we can use several computational techniques, generating a high level of diversity of new ligands. An important challenge is that the problem can scale quickly. This could be addressed using multi-core programming or GPU computing.
The focus explored in this work is obtaining a more significant number of ligands and getting a higher quality of the new ligands by using combinatorial methods. Another critical point is to propose an improved reconstruction process that generates even more diverse molecular structures than the ones developed in this work. An intuitive idea leads us to think that ligands with various combinatorial structures can induce new ones with higher quality in terms of binding energy. This approach can scale to other aspects and other pathologies.
The deconstruction-reconstruction method allows generating high-quality new ligands based on fragments. Independently of the reconstruction technique, the results show that it is always possible to find promising new ligands. Furthermore, our experiments show that the binding energy obtained with the heuristic search is close to the semiexhaustive approach. However, the computational time used with the heuristic approach is considerably lower than the semi-exhaustive. Then, we recommend the use of heuristic search techniques for finding promising new ligands. These techniques allow obtaining high-quality binding energy in acceptable computational times.