Combining a QSAR Approach and Structural Analysis to Derive an SAR Map of Lyn Kinase Inhibition

Lyn kinase, a member of the Src family of protein tyrosine kinases, is mainly expressed by various hematopoietic cells, neural and adipose tissues. Abnormal Lyn kinase regulation causes various diseases such as cancers. Thus, Lyn represents, a potential target to develop new antitumor drugs. In the present study, using 176 molecules (123 training set molecules and 53 test set molecules) known by their inhibitory activities (IC50) against Lyn kinase, we constructed predictive models by linking their physico-chemical parameters (descriptors) to their biological activity. The models were derived using two different methods: the generalized linear model (GLM) and the artificial neural network (ANN). The ANN Model provided the best prediction precisions with a Square Correlation coefficient R2 = 0.92 and a Root of the Mean Square Error RMSE = 0.29. It was able to extrapolate to the test set successfully (R2 = 0.91 and RMSE = 0.33). In a second step, we have analyzed the used descriptors within the models as well as the structural features of the molecules in the training set. This analysis resulted in a transparent and informative SAR map that can be very useful for medicinal chemists to design new Lyn kinase inhibitors.


Introduction
Many signaling pathways transmit extracellular signals by altering the phosphorylation state of tyrosine residues. Phosphorylation of proteins in which tyrosine amino acid residue is phosphorylated by tyrosine kinases by the addition of a covalently bound phosphate group of ATP (adenosine triphosphate) [1], accounts only 0.1% of total protein phosphorylation in mammals. However, tyrosine kinases play a key role in the regulation of many biological phenomena such as cell proliferation, differentiation and motility. There are two families of tyrosine kinases: receptor tyrosine kinases (RTK) and non-receptor tyrosine kinases (NRTK) [2].
The existence of multiple conformations of kinases (active and non-active state) and the structural diversity of the ATP-binding site as well as the activation loop provide different strategies for designing inhibitors. Some inhibitors, by binding into the active site of the receptor tyrosine kinase, block the In the present study, using known ligands and their inhibitory activities against Lyn kinase, we constructed and validated predictive models using QSAR approach. We have also analyzed the selected molecular descriptors and the structural fragments of the inhibitors to draw a SAR map for the inhibition of Lyn kinase that can be used to build new and potentially active inhibitors.

Dataset Source and Preparation
A set of 440 molecules with their two-dimensional atomic coordinates and IC 50 were fetched from the BindingDB database [33]. This set was reduced to 176 molecules by applying a set of filters: (1) Lipinski's rule (number of hydrogen-bond donors less than 5, number of hydrogen acceptor less than 10, molecular weight less than 500 and log P less than 5; and the sum of donors and acceptors (N + O) less than 10) [34], (2) filtering out duplicates (we kept only the molecule that has the highest IC 50 ) and (3) removing all the molecules without reported IC 50 for Lyn kinase.
We randomly distributed the 176 molecules into two subsets: 123 molecules (70%) represent the training set to derive and validate internally the models and 53 molecules (30%) for the test set, to perform external validation and assessment of its extrapolation capacity to new data (Supplementary Materials). The targeted biological property of our QSAR study is IC 50 (concentration of the ligand that induces 50% of the inhibition of the enzyme activity). The IC 50 values have been converted to molar units pIC 50 (defined as −log 10 IC 50 ). The distributions of pIC 50 values (max = 4.34; min = 9.30) within the training and test set reproduced their distributions within the whole set.

Calculation of Molecular Descriptors
A set of 184 two-dimensional molecular descriptors were calculated by Molecular Operating Environment (MOE) package (version 2008.10, Chemical Computing Group, Montreal, Canada) [35]. These descriptors cover different classes of molecular parameters such as chemical constitution, topology, geometry and electrostatic properties, wave function, potential energy surface or some combination of these items for a given chemical structure.

Diversity Analysis
The structural diversity of the data was defined by using Principal Component Analysis (PCA) which is a powerful approach for exploring high-dimensional data [36]. We calculated the principles components (PC) using JMP (14.0.1) package [37] for a data matrix p × n dimension where n = 176 inhibitors and p = 184 descriptors.

Descriptors Selection
The 184 molecular structural descriptors for the 123 inhibitors of the training set data have been reduced sequentially using two phases: (1) We first used variable importance calculated from Partial Least-Squares (PLS) method where we excluded all the descriptors that have a Variable Importance less than 1, (2) in a second step, the resulting descriptors were submitted to a stepwise forward selection.

Model Development and Validation
To build our models, we used a training set of 172 molecules selected randomly from the initial data set. To fit the physico-chemical properties of the training set to the pIC 50 values, we used Generalized Linear Model (GLM) as a linear discriminant analysis method [38] and Artificial Neuronal Network (ANN), with feedforward backpropagation to train the model, as a nonlinear method [29,39]. Both methods are implemented in JMP software package [37]. For the GLM, we generated our models using a normal distribution, a unitary link function, and Maximum likelihood as estimation method. For ANN models, we varied the hidden layer size from range 3 to 12 neurons and used 10-fold cross-validation repeated 10 times as an internal validation method process to validate the models. For the external validation of the model, we used a test set containing 53 molecules selected randomly from the initial data set.

Domain of Applicability of the Models
To assess the reliability of the QSAR model for prediction purposes, we defined a domain of its applicability using a Mahalanobis distance-based approach [40]. The Mahalanobis distance to the training set is calculated for each molecule to be predicted. This distance, compared to Euclidian distance, accounts for the covariance among variables [40]. We have implemented a python program implementing the Mahalanobis distance algorithm as defined below.
In general, if → x = [x 1 , x 2 , . . . , x p ] T and → µ = [µ 1 , µ 2 , . . . , µ i ] T are multivariate data-observations drawn from a set of p variables with a p × i covariance matrix C, then the Mahalanobis distance DM between them is defined as: where DM 2 = Mahalanobis distance; x = Vector of data; µ = Vector of mean values of independent variables; C −1 = Inverse Covariance matrix of independent variables and T = Transposed matrix.

Diversity Analysis
During the split of initial data (all data set) into training set of 123 molecules and a test set of 53 molecules, we ensured that the distribution of pIC 50 value remains the same in the training and test sets as in the initial data set ( Figure 1).

Domain of Applicability of the Models
To assess the reliability of the QSAR model for prediction purposes, we defined a domain of its applicability using a Mahalanobis distance-based approach [40]. The Mahalanobis distance to the training set is calculated for each molecule to be predicted. This distance, compared to Euclidian distance, accounts for the covariance among variables [40]. We have implemented a python program implementing the Mahalanobis distance algorithm as defined below.
In general, if ⃗ = [x1, x2, ..., xp] T and µ �⃗ = [µ1, µ2, ..., µi] T are multivariate data-observations drawn from a set of p variables with a p × i covariance matrix C, then the Mahalanobis distance DM between them is defined as: where DM 2 = Mahalanobis distance; x = Vector of data; µ = Vector of mean values of independent variables; C −1 = Inverse Covariance matrix of independent variables and T = Transposed matrix

Diversity Analysis
During the split of initial data (all data set) into training set of 123 molecules and a test set of 53 molecules, we ensured that the distribution of pIC50 value remains the same in the training and test sets as in the initial data set ( Figure 1). The PCA analysis of the molecular descriptors space explained 56.34% of the global information of the original space (PC1: 35.4%; PC2: 12.8% and PC3: 8.14%). This analysis showed that the molecules in the training set and the test set were distributed homogeneously in the PCA space resulting in a good structural diversity in the data (Figure 2). This is in agreement with the different chemotypes represented in the initial data as shown in table 1.  The PCA analysis of the molecular descriptors space explained 56.34% of the global information of the original space (PC1: 35.4%; PC2: 12.8% and PC3: 8.14%). This analysis showed that the molecules in the training set and the test set were distributed homogeneously in the PCA space resulting in a good structural diversity in the data (Figure 2). This is in agreement with the different chemotypes represented in the initial data as shown in Table 1.

Descriptors Pertinence
The initial descriptor pool number (184 descriptors) was first reduced by eliminating out the descriptors with constant and near constant values. PLS was then used to further reduce the number of descriptors according to variable importance in the model. In fact, the PLS model resulted in a coefficient of determination R 2 of 0.72 and a cross-validated coefficient of determination q 2 of 0.63. When the variable importance threshold was set to the unit value, only 80 descriptors were retained. After using a stepwise forward selection procedure, the set of descriptors was further reduced to 35 descriptors that were then subjected to the data modeling step with the aim to find the best fit between the descriptors and the inhibitory activities of the molecules. These descriptors account for 7 different molecular categories as defined in Table 2.

Descriptors Pertinence
The initial descriptor pool number (184 descriptors) was first reduced by eliminating out the descriptors with constant and near constant values. PLS was then used to further reduce the number of descriptors according to variable importance in the model. In fact, the PLS model resulted in a coefficient of determination R 2 of 0.72 and a cross-validated coefficient of determination q 2 of 0.63. When the variable importance threshold was set to the unit value, only 80 descriptors were retained. After using a stepwise forward selection procedure, the set of descriptors was further reduced to 35 descriptors that were then subjected to the data modeling step with the aim to find the best fit between the descriptors and the inhibitory activities of the molecules. These descriptors account for 7 different molecular categories as defined in Table 2.

Descriptors Pertinence
The initial descriptor pool number (184 descriptors) was first reduced by eliminating out the descriptors with constant and near constant values. PLS was then used to further reduce the number of descriptors according to variable importance in the model. In fact, the PLS model resulted in a coefficient of determination R 2 of 0.72 and a cross-validated coefficient of determination q 2 of 0.63. When the variable importance threshold was set to the unit value, only 80 descriptors were retained. After using a stepwise forward selection procedure, the set of descriptors was further reduced to 35 descriptors that were then subjected to the data modeling step with the aim to find the best fit between the descriptors and the inhibitory activities of the molecules. These descriptors account for 7 different molecular categories as defined in Table 2.

Descriptors Pertinence
The initial descriptor pool number (184 descriptors) was first reduced by eliminating out the descriptors with constant and near constant values. PLS was then used to further reduce the number of descriptors according to variable importance in the model. In fact, the PLS model resulted in a coefficient of determination R 2 of 0.72 and a cross-validated coefficient of determination q 2 of 0.63. When the variable importance threshold was set to the unit value, only 80 descriptors were retained. After using a stepwise forward selection procedure, the set of descriptors was further reduced to 35 descriptors that were then subjected to the data modeling step with the aim to find the best fit between the descriptors and the inhibitory activities of the molecules. These descriptors account for 7 different molecular categories as defined in Table 2.
The selected descriptors cover the main structural features of the molecules needed for their biological activity. In fact, the physico-chemical properties such as logP, logS, MR, apol TPSA, logP and Subdivided Surface Areas represent molecular features that could explain the bioavailablity of the drugs. Pharmacophoric features, connectivity and shape indices as well as partial charge properties are features that represent the mode of interaction of drugs with their targeted receptor. Finally, atom and bond accounts and adjacency and distance matrix descriptors are features representing the topology as well as the geometry of the molecules.  The BCUT descriptors using atomic contribution to molar refractivity (using the Wildman and Crippen SMR method) instead of partial charge

QSAR Model Derivation and Validation
Using GLM approach to fit the 35 selected descriptors to the pIC 50 values of the training set resulted in a weak predictive power as judged by the correlation coefficient between experimental and predicted values of R 2 = 0.65 and a Mean Square Error RMSE = 0.64. When the model is applied to the test set, the correlation coefficient drops down to a value of R 2 = 0.39 and RMSE = 0.85. Consequently, GLM was not able to provide neither a predictive model for the molecules in the training set for the inhibition of Lyn kinase nor an extrapolation power to molecules used in the test set. The GLM model was not capable of predicting the pIC 50 value of the Lyn kinase inhibitors even if the descriptor selection step was done using the statistical procedure "stepwise forward selection procedure". This is due to the fact that the stepwise forward selection procedure uses multiple linear regression method to score the selected set of descriptors and it is not intended to derive a robust predictive model. Again, when used with the GLM, the combination of the descriptors in a linear way did not results in a predictive QSAR model.
When ANN approach was used, the derived model showed good predictive performance for the training set and good extrapolation to new and unseen molecules of the test set. In fact, several ANN models were built by varying the size of the hidden layer by increasing the number of neurons from 3 to 12 (Table 3). The predictive capacity of the model increased with the size of the hidden layer and reached a plateau when the number of neurons exceeded the value of 9. The model using 9 neurons in the hidden layer presented the best fit and the best cross-validated results as judged by the cross-validated correlation coefficient and the root mean squared error (R T 2 = 0.92, RMSE T = 0.29, R v 2 = 0.90 and RMSE V = 0.32). This model was applied to predict the molecules in the test set (Table 4) and resulted in a very good correlation coefficient between the experimental and predicted values of pIC 50 and root mean-squared error (R Ts 2 = 0.91 and RMSE Ts = 0.33) (Figure 3).
The model derivation and validation step resulted in a very good QSAR model using the ANN approach while the GLM approach was not able to derive useful models. This is explained by the fact that the training set contains high structural molecular diversity and high nonlinear underlying relationships between the structural variations and the biological activities of the models that only a nonlinear approach as ANN was able to conceptualize.   The model derivation and validation step resulted in a very good QSAR model using the ANN approach while the GLM approach was not able to derive useful models. This is explained by the fact

Applicability Domains of QSAR Models
To define the domain of applicability of the derived and validated QSAR model, we have used the Mahalanobis distance as a distance-based metric approach. This method calculates a distance between each molecule to be predicted (molecules in the test set) and the closest molecule in the training set. Any molecule above a threshold distance is considered to be unpredictable by the model or predictable with low confidence. When applied to the training set, the most distant molecule of the rest of the molecules is at a Mahalonbis distance of 9. When using a threshold value of 9, only seven molecules in the test set were distant from the training set ( Figure 4). This analysis showed that most of the test set molecules can be safely predicted by the model as judged by the Mahalanobis distance. most of the test set molecules can be safely predicted by the model as judged by the Mahalanobis distance.

Structure-Activity Relationship Map Derivation
Based on our selection of the most pertinent descriptors used in the QSAR model and the structural analysis of the molecules in the training set Table 1), we tried to derive a SAR map that explains Lyn kinase inhibition and also the predctions from the selected descriptors used in the QSAR model ( Figure 5). Indeed, most of the active molecules in the training set hold in one of their extrimities a planar bicyclic aromatic system that can be heterocyclic or not. This feature is represented by the number of double bond descriptor (b_count) correlated to aromatic planar rings and the number of donors and acceptors of hydrogen bonds (a_acc, lip_acc, a_don, lip_don) which lead to heterocyclic rings. Another common structural element in the active molecules is a central aromatic ring wich can again be reprensented by the number of the double bonds descriptor. This part of the molecule is usualy linked to the plan bicyclic system by a flexible linker. The flexibility is encoded in the kier_flex descriptor. A third common strutural element of the active molecules is an aromatic ring system localized at the other extrimity of the molecule. The three aromatic rings (planar heterocyle, central aromatic ring and an aromatic system being opposite of the first aromatic system) can be also encoded by the lipophilicity descripor (logP). With all these aromatic rings, the majority of active molecules present a high molecular volume that is encoded in the molar refractivity descripto (MR). Finnaly, the heterocyclic ring system as well as the number of donors and acceptors of hydrogen bonds are at the origin of the polarizability of the molecule which is encoded in the polar surface area descriptos (TPSA, apol and PEOPE).
Overall, considering the common structural features and some of the selected and used descriptors in the QSAR model, we could suggest a SAR map for the inhibition of the Lyn kinase as follows: (1) a planar and heterocyclic ring system that holds hydrogen bond donnors and acceptors, (2) a Linker to keep the flexibility of the molecule, (3) an hydrophobic and aromatic central part, (4) a lipophilic and aromatic ring system.
The derived SAR map can be found when analysing the structure of some published Lyn kinase inhibitiors. Indeed, a Lyn kinase inhibiotors (INNO-406, Nilotinib), with IC50 of 220 nM, was reported in the work of Horio et al. [42]. This compound bears a pyridinyl group as hydrogen bonding region, an amino group as a linker and a central substitued benzyl group as the hydrophobic region. Kim et

Structure-Activity Relationship Map Derivation
Based on our selection of the most pertinent descriptors used in the QSAR model and the structural analysis of the molecules in the training set Table 1), we tried to derive a SAR map that explains Lyn kinase inhibition and also the predctions from the selected descriptors used in the QSAR model ( Figure 5). Indeed, most of the active molecules in the training set hold in one of their extrimities a planar bicyclic aromatic system that can be heterocyclic or not. This feature is represented by the number of double bond descriptor (b_count) correlated to aromatic planar rings and the number of donors and acceptors of hydrogen bonds (a_acc, lip_acc, a_don, lip_don) which lead to heterocyclic rings. Another common structural element in the active molecules is a central aromatic ring wich can again be reprensented by the number of the double bonds descriptor. This part of the molecule is usualy linked to the plan bicyclic system by a flexible linker. The flexibility is encoded in the kier_flex descriptor. A third common strutural element of the active molecules is an aromatic ring system localized at the other extrimity of the molecule. The three aromatic rings (planar heterocyle, central aromatic ring and an aromatic system being opposite of the first aromatic system) can be also encoded by the lipophilicity descripor (logP). With all these aromatic rings, the majority of active molecules present a high molecular volume that is encoded in the molar refractivity descripto (MR). Finnaly, the heterocyclic ring system as well as the number of donors and acceptors of hydrogen bonds are at the origin of the polarizability of the molecule which is encoded in the polar surface area descriptos (TPSA, apol and PEOPE).
Overall, considering the common structural features and some of the selected and used descriptors in the QSAR model, we could suggest a SAR map for the inhibition of the Lyn kinase as follows: (1) a planar and heterocyclic ring system that holds hydrogen bond donnors and acceptors, (2) a Linker to keep the flexibility of the molecule, (3) an hydrophobic and aromatic central part, (4) a lipophilic and aromatic ring system.
The derived SAR map can be found when analysing the structure of some published Lyn kinase inhibitiors. Indeed, a Lyn kinase inhibiotors (INNO-406, Nilotinib), with IC 50 of 220 nM, was reported in the work of Horio et al. [42]. This compound bears a pyridinyl group as hydrogen bonding region, an amino group as a linker and a central substitued benzyl group as the hydrophobic region. Kim et al. obtained a Lyn kinase inhibitor (PCI-32765) [43], with an IC 50 of 200 nM, showing an aminopyrimidine moity playing the role of the hydrogen bonding region, a benzyl group as the aromatic moity linked to another benzyl group presenting the hydrophobic region. In the work of Goldberg et al. a reported Lyn kinase inhibitor (BDBM50218682), with an IC 50 of 230 nM, presented an aminopyridin moity as the hydrogen bonding region, an amid bond as the linker, a central aromatic fused cycle, and a substitued benzamid part playing the role of the hydrophobic region [44].
al. obtained a Lyn kinase inhibitor (PCI-32765) [43], with an IC50 of 200 nM, showing an aminopyrimidine moity playing the role of the hydrogen bonding region, a benzyl group as the aromatic moity linked to another benzyl group presenting the hydrophobic region. In the work of Goldberg et al. a reported Lyn kinase inhibitor (BDBM50218682), with an IC50 of 230 nM, presented an aminopyridin moity as the hydrogen bonding region, an amid bond as the linker, a central aromatic fused cycle, and a substitued benzamid part playing the role of the hydrophobic region [44].

Conclusions
Several physicochemical, topological and electronic molecular descriptors combined with a GLM method and ANN method were used for modeling and predicting Lyn kinase inhibitors. The best model was determined based on the values of correlation coeficcient between the experimental and the predicted pIC50 values for the training set and the test set. The ANN model obtained the highest prediction performance. The selected descriptors involved in the ANN model as well as the structural features of the training set were analyzed together to draw an informative SAR map that was in good agreement with published Lyn kinase inhibitors.
Overall, this study demonstrates that the machine learning method combined to molecular parameters can be used for in silico prediction of Lyn kinase inhibition and that the selected descriptors together with structural features derived from the training molecules can serve to build an SAR map to explain Lyn kinase inhibition.

Conclusions
Several physicochemical, topological and electronic molecular descriptors combined with a GLM method and ANN method were used for modeling and predicting Lyn kinase inhibitors. The best model was determined based on the values of correlation coeficcient between the experimental and the predicted pIC 50 values for the training set and the test set. The ANN model obtained the highest prediction performance. The selected descriptors involved in the ANN model as well as the structural features of the training set were analyzed together to draw an informative SAR map that was in good agreement with published Lyn kinase inhibitors.
Overall, this study demonstrates that the machine learning method combined to molecular parameters can be used for in silico prediction of Lyn kinase inhibition and that the selected descriptors together with structural features derived from the training molecules can serve to build an SAR map to explain Lyn kinase inhibition.