Combinatorial Pharmacophore-Based 3D-QSAR Analysis and Virtual Screening of FGFR1 Inhibitors

The fibroblast growth factor/fibroblast growth factor receptor (FGF/FGFR) signaling pathway plays crucial roles in cell proliferation, angiogenesis, migration, and survival. Aberration in FGFRs correlates with several malignancies and disorders. FGFRs have proved to be attractive targets for therapeutic intervention in cancer, and it is of high interest to find FGFR inhibitors with novel scaffolds. In this study, a combinatorial three-dimensional quantitative structure-activity relationship (3D-QSAR) model was developed based on previously reported FGFR1 inhibitors with diverse structural skeletons. This model was evaluated for its prediction performance on a diverse test set containing 232 FGFR inhibitors, and it yielded a SD value of 0.75 pIC50 units from measured inhibition affinities and a Pearson’s correlation coefficient R2 of 0.53. This result suggests that the combinatorial 3D-QSAR model could be used to search for new FGFR1 hit structures and predict their potential activity. To further evaluate the performance of the model, a decoy set validation was used to measure the efficiency of the model by calculating EF (enrichment factor). Based on the combinatorial pharmacophore model, a virtual screening against SPECS database was performed. Nineteen novel active compounds were successfully identified, which provide new chemical starting points for further structural optimization of FGFR1 inhibitors.

considers a single target or a single target-ligand complex [17]. Compared with structure-based model, ligand-based pharmacophore is more frequently used, which extracts common chemical features from aligned compound structures interacting with the same target, based on the hypothesis that compounds interacting with the same protein target may share similar chemical structure and physicochemical properties [18,19]. The pivotal issues of the ligand-based model are the modeling of ligand flexibility, the alignment methods of molecules and the selection of training set. Different pharmacophore models could be derived from different training sets because it is easily affected by the type of the ligand, the site of the dataset and chemical diversity [17].
QSAR model, which quantifies the correlation between structures of a series of compounds and biological activities, is based on the hypothesis that compounds with similar structures or physiochemical properties have similar activities [20]. The development of a QSAR model involves a series of consecutive steps, including: (1) Collect ligands with known activity with the same target; (2) Extract descriptors representing the molecule; (3) Select best descriptors from a larger set of descriptors; (4) Map the molecular descriptors into the biological activity; and (5) Internal and external validation of the QSAR model [21]. Compared with classical QSAR method using fragment-based descriptors such as electronic, hydrophobic and steric features, 3D-QSAR model is based on 3D descriptors such as various geometric, physical characteristics and quantum chemical descriptors, which are useful in describing the ligand-receptor interactions [22]. Statistical tools such as multivariable linear regression analysis (MLR), principal component analysis (PCA) and partial least square analysis (PLS) can be used for linear QSAR modeling, while there are also many non-linear models established using neural network, Bayesian neural network and others machine learning techniques. To validate the QSAR model, internal cross validation is used and to calculate the cross validated R 2 or Q 2 . Besides that, external validation is also necessary, which is performed by predicting the activity of external test sets not used for model building [20]. These 3D-QSAR models have done well when the molecules used for model developing are based on congeneric series of molecules [23][24][25].
Since QSAR model is more suitable for congeneric series of molecules, developing a single model for various chemical structures like inhibitors of FGFR with diverse core skeletons is challenging. In an attempt to design novel chemical entities with FGFR1 inhibition activities, in this study, we performed pharmacophore modeling and 3D-QSAR analyses for different chemical classes of FGFR1 inhibitors, and developed a combinatorial pharmacophore-based 3D QSAR model. For each class of inhibitors, 4-point, 5-point and 6-point pharmacophore hypotheses were generated, respectively, using the active compounds in a training set. Then, alignment conformations of all compounds were generated for subsequent 3D-QSAR modeling [26]. The model was validated by an external test set and a decoys set. Furthermore, virtual screening protocol based on the resulted combinatorial pharmacophore model was performed against a commercial chemical database. Finally, several compounds were short-listed as potential FGFR1 inhibitors by experimental evaluation with biological assay of a fraction of top ranked compounds. Figure 1 shows the flowchart of developing the combinatorial 3D-QSAR model. In each group of inhibitors, hundreds of pharmacophore hypotheses that would be subsequently used for QSAR modeling were generated. The final 3D-QSAR model of each group was selected based on a series of statistic parameters, including R 2 , F, p and stability. R 2 is the mean value of square of the correlation coefficient of regression. F is the ratio of model variance to the observed activity variance and a larger F indicates a more statistically significant regression. p is significance level of variance ratio and smaller values represent a greater degree of confidence. Stability value reflects the stability of the model predictions with changes in the training set composition. Therefore, an ideal QSAR model should have large R 2 , small standard deviation (SD), large F, small p and large stability. Table 1 lists statistic parameters of the combinatorial QSAR model. The R 2 and stability value of all groups are above 0.8 and 0.5, respectively, suggesting that all the components of the combinatorial model are stable and present good correlation between the experimental and predicted values.  It should be noted that the predictive ability of the whole dataset rather than each subset is of more interests. Figure 2 shows the scatter plots of the whole training set and test set. The R 2 of training set and test set are 0.81 and 0.53, respectively, showing that our model could well reflect the linear relationship of the experimental activities and the predicted ones. Additionally, each individual QSAR model was used to predict the entire test set of 232 compounds and the prediction performance of these individual QSAR models were compared to that of the combinatorial QSAR model. As listed in Table 2, the R 2 and the SD of the combinatorial QSAR model is higher and lower than any other single QSAR model, respectively, suggesting that the combinatorial QSAR model has advantage over any single QSAR models when predicting the whole test set. Besides, all the R 2 of each individual model on classified subsets are larger than that on the whole test set. The potential reason is that QSAR models are more effective for congeneric series of molecules and the predictive ability decreases as the structural or physicochemical properties of new molecules varies from that in the training set [20]. This phenomenon suggests that the combinatorial QSAR model is more appropriate for activity prediction of structurally diversified compounds. Based on their similarity to each group of training data, our method predicts a variety of molecules using different QSAR models, which successfully avoid the above disadvantages.  Take group 1 as an example, the predictive 3D-QSAR model and the associated four-point hypothesis are shown in Figure 3, which contains one hydrogen bond acceptor and three aromatic ring features. The core structure of this group is oxindole, which is one of the first discovered structural category of FGFR inhibitors [15]. The ketonic oxygen of the reference ligand (BindingDB50279269) was mapped to the hydrogen bind acceptor, while three phenyl groups mapped to aromatic ring features.  Figure 3B shows the 3D-QSAR model containing two favorable (domain a, domain b) and two unfavorable (domain c, domain d) domains. The favorable and unfavorable interactions are combined effects of different interaction types such as H-bond donor, hydrophobic/non-polar and so on. In order to understand whether a specific feature is favored or disfavored to activity, we decomposed the interaction shown in Figure 3B into the effects from ( Figure 3C) H-bond donor, ( Figure 3D) hydrophobic/non-polar and ( Figure 3E) electron-withdrawing. Given the reference ligand BindingDB50279269, we may find that two H-bond donors, N-1 of the oxindole and the N on the side chain of C-3 oxindole, are just located close to the H-bond donor favorable regions, which may explain the good activity of the compound. Moreover, a survey of existing crystal structures (PDB: 1AGW and PDB 1FGI) [27] revealed that the hydrogen bond interactions are indeed present at these regions. The favorable domain a and unfavorable domain d shown in Figure 3B correspond to hydrophobic interactions, by comparing the contour maps of Figure 3B,D. The electron-withdrawing effects shown in Figure 3E suggested that the methyl acetate group on the C-6 of oxindole structure is beneficial to activity. Figure 4 illustrates favorable and unfavorable interactions of several compounds mapped to the 3D-QSAR model. Compound BindingDB50421033 ( Figure 4A) well occupies two favorable domains and has fewer interactions at unfavorable domains, which shows the highest activity among the training set. Compound BindingDB504210112 ( Figure 4D), which, taking up two unfavorable domains (unfavorable domain c and d), is the least active molecule. Intuitively, activity will decrease when molecules take up the unfavorable domain c, as reflected by comparing BindingDB50421015 ( Figure 4B) and BindingDB50421012 ( Figure 4D). The absence of favorable domain a leads to the activity decrease in comparison of BindingDB50421015 ( Figure 4B) and BindingDB4811 ( Figure 4C). Sun et al. [28] reported that a substitution of electron-withdrawing groups on the phenyl ring of the oxindole can improve the inhibitory activity, which is consistent with the conclusion that the domain b has a positive contribution for maintaining the activity. A decoy set of 7897 compounds including 232 inhibitors was used to further evaluate the ability of this combinatorial model to identify actives from a relatively large dataset. As shown in Table 3, the maximum EF values of all groups appear at 1%-2%, meaning that when we screen the database, true positive compounds can be efficiently recognized among the top ranked compounds. Figure 5 shows the EF curve of the combinatorial QSAR model against the whole decoy dataset. The curve shows a peak when the percent of database screened is less than 5%, illustrating that our model is suitable for screening potential actives from a large database.  One application of our model is to perform virtual screening to find hits with desired activity from a large chemical database. After screening the SPECS database, 100 compounds were purchased for biological assay, and those with potential PAINS (Pan Assay Interference Compounds) structures [29] were excluded. Figure 6 illustrates the results of experimental assays. The selected compounds were firstly screened at 50 μM concentrations in vitro FGFR1 kinase assays, and 19 of them displayed ≥50% inhibition of FGFR1 at 50 μM compound concentration. Then, the 19 compounds were further tested at 10 μM compound concentration, and five compounds display ≥40% FGFR1 inhibition, i.e., compound 56, 75, 88, 91, 97. Table 4 shows the structure and the inhibition ratios of these active compounds. To investigate the structural novelty of the hit compounds, we further compared the similarity between the hits and the all the reported active FGFR1 compounds from BindingDB. For each hit compound, the structure of the most similar known FGFR1 inhibitor in BindingDB (and the corresponding FCFP4 similarity) is list in Table 4. The low similarity values highlighted that the hits identified in this study are of excellent structural novelty. In the end, the five compounds display ≥40% FGFR1 inhibition were further evaluated to determine their IC50 values against FGFR1 kinase. Among them, compounds 88, 91 and 97 rapidly precipitated at higher concentrations due to poor solubility. Compounds 56 and 75 showed IC50 values of 7.9 and 55.5 μM, respectively.

Dataset
A reasonable dataset is of great importance for model establishment and several criteria should be satisfied: (i) As many as possible compounds should be included to ensure the statistical significance; (ii) The range of biological activity should be wide enough; and (iii) The range of biological activity of the training set and the test set should be same or comparable [30]. The dataset used in this study was collected from the BindingDB database [31], which contains 1174 entries of experimentally measured IC50 of FGFR1 inhibition. Finally, a database containing 713 compounds was obtained after filtering redundancy. Then, the dataset was randomly divided into a training set (481 compounds) and a test set (232 compounds) using Discovery Studio 3.0 in a ratio of 2:1. As shown in Figure 7, the biological activity range and median value of training set and test set are similar, suggesting that the dataset we used for QSAR modeling was reasonable. In view of the structural diversity of FGFR1 inhibitors, it is unpractical to develop a single model to interpret all types of structure-activity relationships. To address the issue, the compounds in training set were firstly divided into different groups according to their core skeletons. If the number of compounds of a group was less than 15, the group was not taken into consideration due to less statistical significance. Finally, eight groups of FGFR1 inhibitors were generated (Table 5). BindingDB IDs of compounds in training set and test set are provided in Supplementary file 1.

Conformation Analysis
Pharmacophore hypotheses were generated using the Phase 3.1 implemented in the Maestro 9.0 software package (Schrödinger, LLC). Phase translates the ligands into bit strings by applying a tree-based partitioning algorithm and distinguishes multiple binding modes using a bi-directional clustering approach [32]. Since pharmacophore modeling requires all-atom 3D structures to represent the active form of the inhibitor, it is crucial to consider a variety range of conformations so as to increase the possibility of finding the one close to the natural bound structure. Training ligands were firstly cleaned up to ensure the structures are in 3D, and the count ions and water molecules are excluded. Additionally, parameters were set to retain specified chiralities, generate a maximum of 32 stereoisomers and ionize at target pH 7.4. Once the clean-up ligand structures were generated, a conformational search was carried out using the ConfGen module of Maestro with default parameters, to generate a set of conformers for each structure. Potential energy calculation was carried out using the OPLS_2005 force field. An RMSD cutoff of 1.00 Å was used for eliminating redundant conformations.

Generate Common Pharmacophore Hypothesis (CPH)
This procedure contains two steps: (1) Create sites and (2) Find common pharmacophores. In order to develop reliable QSAR models, a training set should contain a widely range of biological activity including high active, moderate active and inactive molecules. The high active molecules were used for pharmacophore modeling, while all of molecules were for QSAR modeling and testing. In this study, pIC50 values greater than 7.0 were defined as highly active molecules, pIC50 values less than 5.0 were defined as inactive molecules and the remaining in-between values were considered to be moderately active. Pharmacophore sites for ligands including hydrogen bond acceptor (A), hydrogen bond donor (D), hydrophobic group (H), negatively charged group (N), positively charged group (P) and aromatic ring (R) were generated in the step of creating site.
In the step of finding common pharmacophores, pharmacophores of all actives were examined and those containing identical features with similar spatial arrangements were grouped together. In this section, CPHs with 4, 5 and 6 pharmacophore sites were considered. For most of the groups, all the actives must be matched to the CPH. However, if no CPH was found, the number of actives that must be matched would be reset as a number that is less than the total but higher than two-third of the total number of actives. With a tree-based partitioning technique, CPHs were searched according to their inter-site distance.

Score Hypotheses
The function of score hypothesis step measures the quality of alignment and provides a ranking of different hypotheses. The top 10% of the pharmacophore hypotheses were kept according to their survival score, which was combined by alignment score, vector score and volume score with specific weights. Ligands with vector scores above 0.5 Å and RMSD values of inter-site distances to the reference below 1.2 Å were retained. Then, the retained hypotheses were clustered to eliminate those that resemble each other or have very similar scores. One single representative for each cluster was retained for the followed QSAR modeling.

Build QSAR Models
The 3D-QSAR modeling can be classified into atom-based and pharmacophore-based types, in which the structure of ligand is represented by van der Waals models of atoms and pharmacophore features with a specified radius, respectively. The atom-based QSAR model may work well in the case of compounds that contain only limited structural flexibility or have some common structure features, whereas the pharmacophore-based QSAR models may be more appropriate when the structures are highly flexible and exhibit significant diversity [32]. In this study, an atom-based 3D-QSAR model was constructed for each group of compounds as the compounds share the same or similar scaffolds. There are six atom types in atom-based model: hydrogen-bond donor (D), hydrophobic or nonpolar (H), negative ionic (N), positive ionic (P), electron-withdrawing (includes hydrogen-bond acceptors, W), and miscellaneous (X). Each atom of a compound is assigned to a specific atom type and represented by a sphere with the van der Waals radius [19]. Then, cubes with 1 Å grid were defined to cover the space occupied by these aligned conformations, and then assigned to zero or ones based on whether the cube was occupied by atoms or sites. Thus, a molecule can be represented by a string of zeros and ones and treated as a pool of independent variables. Finally, QSAR models were built by adding PLS factors to these independent variables [32]. Here, the PLS factor was set to three to avoid over-fitting. A regression coefficient was assigned to each bit to facilitate the identification of specific chemical features that favorable or unfavorable to the activity. A leave-one-out (LOO) cross validation analysis was performed to evaluate the predictive ability of QSAR model. Finally, a series of 3D-QSAR models were generated.

Construct Combinatorial 3D-QSAR Model
In each group of inhibitors, multiple pharmacophore-based 3D-QSAR model candidates were generated. A filtering procedure was then performed to retain a single QSAR model for each group, based on several statistic parameters, including R 2 , F, p and stability. Finally, the resulted eight pharmacophore-based QSAR models were used in combination when predicting the activity of a new compound. Since there are eight different QSAR models, a test compound can obtain different prediction values. Here, only one appropriate QSAR model was selected for one test compound, which was determined by the chemical similarity between the test compound and the training compounds used to establish the model. Given a test compound, it was compared to each compound in every group of the training set to calculate Tanimoto coefficient (Tc) of the pairwise chemical similarity. The averaged Tc values of 8 groups was then sorted, and the corresponding QSAR model of the group yielding the highest average Tc value was selected to predict the activity of the test compound.

Model Validation
The combinatorial QSAR model was firstly validated by an external test set with known activities.
To further evaluate our model for virtual screening, namely, the ability of the combinatorial QSAR model in assigning high ranks to the known actives, a decoy dataset composed of 7665 compounds was collected from the SPECS database (http://www.specs.net) in a proportion of 36:1 to the test set, using DecoyFinder [33]. Of note here is that DecoyFinder may introduce potential biases for the ligand-based virtual screening, and it should be compared with other structure-based approaches with caution [34,35]. Compounds in the decoy dataset were also divided into 8 groups based on the similarity comparison described above. Each structure was subjected to ligand preparation using Ligprep and conformations generation using the ConfGen. The corresponding QSAR model was used to predict the activity using the "find matches to hypothesis" option, which finds matches from a database to a selected hypothesis and calculates activity if the hypothesis has a QSAR model. We then calculated enrichment factors (EF) of the decoy set to evaluate the ability of identifying actives from inactives. EF is calculated by Equation (1) where Hitss is the number of actives in the selected front fraction of the ranked list, Hitst is the quantities of actives in database, Ns is the total number of compounds in the selected fraction of the database and Nt is the quantities of compounds in database.

Pharmacophore-Based Virtual Screening and 3D-QSAR Analysis
In this study, the combinatorial 3D-QSAR model was used to virtually screen the commercial chemical SPECS database. Firstly, the database was filtered by Lipinski rule of 5 to ensure that the selected compounds have better drug-like properties; Secondly, the compounds were prepared using the same procedures in pharmacophore modeling, and each of them was assigned to a specific group for pharmacophore mapping. The mapped compounds were extracted for activity prediction using the combinatorial 3D-QSAR model; Finally, the top ranked 100 compounds with predicted activity more than 5.0 were purchased for experimental evaluation with enzyme assay.

Enzyme Assay
Enzyme-linked immunosorbent assay (ELISA) was used for enzyme assay in this study. Immunoplates were coated with 125 mL/well of 20 mg/mL of enzyme reaction substrate Poly for 12-16 h at 37 °C. Wells were washed with 200 mL of T-PBS and dried. The reaction was initiated by the addition of 50 μL FGFR1 kinase domains recombinant protein in a shaker after ATP buffer (49 μL) and diluted samples (1 μL) being loaded and samples incubated at 37 °C for one hour. The reaction was terminated by adding 100 μL PY99 antibody dilution to each well and incubated in a shaker for 0.5 h at 37 °C. The precipitations were washed three time with T-PBS. Then, 100 μL of HRP-conjugated goat anti-mouse antibodies were added and incubated in the same way. Following three washing cycles in T_PBS, the chromogenic reaction was induced by 100 mL/well of 2 mg/mL OPD developing solution, and the reaction was allowed to proceed for 1-10 min in the dark at 25 °C. The reaction was stopped with 50 μL of 2 M H2SO4 and absorbance at 490 nm was measured with VERSA max. The IC50 values were calculated by fitting with the parameter of the Hill equation.

Conclusions
FGFR1 is an important therapeutic target for various malignancies and disorders, and it has attracted a lot of attention from pharmaceutical companies and researchers to find novel FGFR1 inhibitors. In this study, we developed an effective strategy for identifying novel FGFR1 inhibitors using pharmacophore-based virtual screening and 3D-QSAR analyses. Various FGFR1 inhibitors were categorized into different groups based on their core structures, and each group was used for pharmacophore and 3D-QSAR modeling. Then, the resulted models were combined to construct a combinatorial 3D-QSAR model, where a new molecule must be assigned to a specific group based on similarity searching prior to the activity prediction using a particular 3D-QSAR model. The performance of the combinatorial 3D-QSAR model has been validated using an external test set and a large decoy dataset. Subsequently, the SPECS database was screened by multiple pharmacophores and 3D-QSAR predictions. Nineteen hits exhibited more than 50% FGFR1 inhibition at 50 μM concentration. Among the nineteen compounds, two compounds appeared over 50% inhibition of FGFR1 and five compounds displayed over 40% FGFR1 inhibition at 10 μM concentration. IC50 values of compound 56 and compound 75 were 7.9 and 55.5 μM, respectively. Although the activities are not strong compared to known FGFR1 inhibitors, these compounds possess novel scaffolds not previously reported as FGFR1 inhibitors. Overall, the presented method is a useful alternative to traditional virtual screening methods, and the obtained active compounds provide new chemical starting points for further structural optimization of FGFR1 inhibitors.