Many signaling pathways transmit extracellular signals by altering the phosphorylation state of tyrosine residues. Phosphorylation of proteins in which tyrosine amino acid residue is phosphorylated by tyrosine kinases by the addition of a covalently bound phosphate group of ATP (adenosine triphosphate) [1
], accounts only 0.1% of total protein phosphorylation in mammals. However, tyrosine kinases play a key role in the regulation of many biological phenomena such as cell proliferation, differentiation and motility. There are two families of tyrosine kinases: receptor tyrosine kinases (RTK) and non-receptor tyrosine kinases (NRTK) [2
The existence of multiple conformations of kinases (active and non-active state) and the structural diversity of the ATP-binding site as well as the activation loop provide different strategies for designing inhibitors. Some inhibitors, by binding into the active site of the receptor tyrosine kinase, block the signal transduction resulting from the binding of certain growth factors (EGF, FGF, Gas6…) to their receptors (EGFR, FGFR, AXL …) and consequently the growth factor activity. These inhibitors are often used to prevent the tumor’s growth because many cellular tyrosine kinases are produced by the proto-oncogene and they are the most frequent oncogenesis mechanism in human cancer [1
One of the most important and the largest non-receptor tyrosine kinases family is the Src family. It is considered for targeted therapies because Src family members are essential intermediaries in signal transduction and they can interact with a variety of growth factors, proliferating factors, and regulators of gene expression (migration, adhesion, differentiation, angiogenesis, invasion, immune function and G-protein-coupled receptors) [3
]. The Src family of tyrosine kinases comprises 11 related kinases: Blk, Fgr, Fyn, Hck, Lck, Lyn, c-Src, c-Yes, Yrk, Frk (also known as Rak) and Srm with specific functions and domains. Some members of these kinases are exclusively present in certain cells as breast, colon, lung, hematopoietic, adipocyte, hepatocyte, lymphoid cells, as well as in skeleton cells [6
Src signaling pathways are among the leading causes of cancer, and Src inhibitors are the keys of stopping many tumorigeneses. Therefore, that is why most of the FDA-approved protein kinase inhibitors are directed against the activation of many Src family tyrosine kinases (STKs) pathways including cell division and survival [7
Lyn non-receptor tyrosine kinase is a member of Src family [9
]. Lyn kinase plays an important role in the regulation of a variety of epithelial and hematopoietic cells, including the regulation of innate and adaptive immune responses, hematopoiesis, responses to growth factors and cytokines, integrin signaling, responses to DNA and genotoxic agents, as well as drug resistance [9
]. This tyrosine kinase is a critical regulator of several cellular processes of many human cancer cells. The over-expression of Lyn gene according to various studies is highly correlated with the development and progression of several tumors as esophageal adenocarcinoma [12
], prostate cancer (Castrate-resistant prostate cancer) [13
], pancreatic cancer [15
], cervical cancer [16
], breast cancer [17
], and it can be the cause of hepatic fibrosis [19
Some studies have proven that Lyn is overactive in the hematological malignancies including chronic myelogenous leukemia, chronic lymphocytic leukemia B [20
], Burkitt lymphoma [21
], and the most common cancer diagnosed in children, Acute Lymphoblastic Leukemia (ALL) [22
]. It has also been shown that the inhibition of lyn is a promoter treatment of lymphoma resistance [23
]. Lyn is also involved in nilotinib resistance to cancer treatments [25
], Zardan et al. suggested Lyn as a critical regulator of androgen receptor (AR) expression and activity, particularly in androgen-deprived conditions [14
]. He et al. found that Lyn plays an important role in the development and progression of glioblastoma, the most aggressive brain tumors [27
]. Developing new Lyn kinase inhibitors is an important therapeutic approach to block diseases where Lyn is heavily involved.
In the last decades, the identification and development of new drugs, medicinal chemists have benefited from drug rational design thanks to the chemoinformatics and molecular modeling approaches. Quantitative Structure-Activity Relationships (QSAR) is one of the chemoinformatics methodologies that allows medicinal chemist to correlate variations in a biological response of a ligand to its structural variations.
QSAR is a helpful methodology used in these recent years in drug discovery research [28
]. In QSAR, the central idea is to link, through a mathematical function, several properties or molecular descriptors (topological, electronic, physico-chemical parameters…) to the activity of a set of molecules [31
]. The obtained relationship is materialized by a mathematical model that can be used to predict the activity of new or existing molecules when their structural properties are known. These predictions can be also used to prioritize the organic synthesis of a small set of potentially active molecules. However, QSAR approach suffers from the fact that the predictive models are sometimes very difficult to use (qualified as black-boxes) directly during the design by medicinal chemists whose main objective is to establish the structure-activity relationships (SAR) map of the molecules under investigation [32
]. It is also hard to use such models to provide to chemists future directions for modifying the molecules to improve a biological property of interest.
In the present study, using known ligands and their inhibitory activities against Lyn kinase, we constructed and validated predictive models using QSAR approach. We have also analyzed the selected molecular descriptors and the structural fragments of the inhibitors to draw a SAR map for the inhibition of Lyn kinase that can be used to build new and potentially active inhibitors.
3. Results and Discussion
3.1. Diversity Analysis
During the split of initial data (all data set) into training set of 123 molecules and a test set of 53 molecules, we ensured that the distribution of pIC50
value remains the same in the training and test sets as in the initial data set (Figure 1
The PCA analysis of the molecular descriptors space explained 56.34% of the global information of the original space (PC1: 35.4%; PC2: 12.8% and PC3: 8.14%). This analysis showed that the molecules in the training set and the test set were distributed homogeneously in the PCA space resulting in a good structural diversity in the data (Figure 2
). This is in agreement with the different chemotypes represented in the initial data as shown in Table 1
3.2. Descriptors Pertinence
The initial descriptor pool number (184 descriptors) was first reduced by eliminating out the descriptors with constant and near constant values. PLS was then used to further reduce the number of descriptors according to variable importance in the model. In fact, the PLS model resulted in a coefficient of determination R2
of 0.72 and a cross-validated coefficient of determination q2
of 0.63. When the variable importance threshold was set to the unit value, only 80 descriptors were retained. After using a stepwise forward selection procedure, the set of descriptors was further reduced to 35 descriptors that were then subjected to the data modeling step with the aim to find the best fit between the descriptors and the inhibitory activities of the molecules. These descriptors account for 7 different molecular categories as defined in Table 2
The selected descriptors cover the main structural features of the molecules needed for their biological activity. In fact, the physico-chemical properties such as logP, logS, MR, apol TPSA, logP and Subdivided Surface Areas represent molecular features that could explain the bioavailablity of the drugs. Pharmacophoric features, connectivity and shape indices as well as partial charge properties are features that represent the mode of interaction of drugs with their targeted receptor. Finally, atom and bond accounts and adjacency and distance matrix descriptors are features representing the topology as well as the geometry of the molecules.
3.3. QSAR Model Derivation and Validation
Using GLM approach to fit the 35 selected descriptors to the pIC50 values of the training set resulted in a weak predictive power as judged by the correlation coefficient between experimental and predicted values of R2 = 0.65 and a Mean Square Error RMSE = 0.64. When the model is applied to the test set, the correlation coefficient drops down to a value of R2 = 0.39 and RMSE = 0.85. Consequently, GLM was not able to provide neither a predictive model for the molecules in the training set for the inhibition of Lyn kinase nor an extrapolation power to molecules used in the test set. The GLM model was not capable of predicting the pIC50 value of the Lyn kinase inhibitors even if the descriptor selection step was done using the statistical procedure “stepwise forward selection procedure”. This is due to the fact that the stepwise forward selection procedure uses multiple linear regression method to score the selected set of descriptors and it is not intended to derive a robust predictive model. Again, when used with the GLM, the combination of the descriptors in a linear way did not results in a predictive QSAR model.
When ANN approach was used, the derived model showed good predictive performance for the training set and good extrapolation to new and unseen molecules of the test set. In fact, several ANN models were built by varying the size of the hidden layer by increasing the number of neurons from 3 to 12 (Table 3
). The predictive capacity of the model increased with the size of the hidden layer and reached a plateau when the number of neurons exceeded the value of 9. The model using 9 neurons in the hidden layer presented the best fit and the best cross-validated results as judged by the cross-validated correlation coefficient and the root mean squared error (RT2
= 0.92, RMSET
= 0.29, Rv2
= 0.90 and RMSEV
= 0.32). This model was applied to predict the molecules in the test set (Table 4
) and resulted in a very good correlation coefficient between the experimental and predicted values of pIC50
and root mean-squared error (RTs2
= 0.91 and RMSETs
= 0.33) (Figure 3
The model derivation and validation step resulted in a very good QSAR model using the ANN approach while the GLM approach was not able to derive useful models. This is explained by the fact that the training set contains high structural molecular diversity and high nonlinear underlying relationships between the structural variations and the biological activities of the models that only a nonlinear approach as ANN was able to conceptualize.
3.4. Applicability Domains of QSAR Models
To define the domain of applicability of the derived and validated QSAR model, we have used the Mahalanobis distance as a distance-based metric approach. This method calculates a distance between each molecule to be predicted (molecules in the test set) and the closest molecule in the training set. Any molecule above a threshold distance is considered to be unpredictable by the model or predictable with low confidence. When applied to the training set, the most distant molecule of the rest of the molecules is at a Mahalonbis distance of 9. When using a threshold value of 9, only seven molecules in the test set were distant from the training set (Figure 4
). This analysis showed that most of the test set molecules can be safely predicted by the model as judged by the Mahalanobis distance.
3.5. Structure-Activity Relationship Map Derivation
Based on our selection of the most pertinent descriptors used in the QSAR model and the structural analysis of the molecules in the training set Table 1
), we tried to derive a SAR map that explains Lyn kinase inhibition and also the predctions from the selected descriptors used in the QSAR model (Figure 5
). Indeed, most of the active molecules in the training set hold in one of their extrimities a planar bicyclic aromatic system that can be heterocyclic or not. This feature is represented by the number of double bond descriptor (b_count) correlated to aromatic planar rings and the number of donors and acceptors of hydrogen bonds (a_acc, lip_acc, a_don, lip_don) which lead to heterocyclic rings. Another common structural element in the active molecules is a central aromatic ring wich can again be reprensented by the number of the double bonds descriptor. This part of the molecule is usualy linked to the plan bicyclic system by a flexible linker. The flexibility is encoded in the kier_flex descriptor. A third common strutural element of the active molecules is an aromatic ring system localized at the other extrimity of the molecule. The three aromatic rings (planar heterocyle, central aromatic ring and an aromatic system being opposite of the first aromatic system) can be also encoded by the lipophilicity descripor (logP). With all these aromatic rings, the majority of active molecules present a high molecular volume that is encoded in the molar refractivity descripto (MR). Finnaly, the heterocyclic ring system as well as the number of donors and acceptors of hydrogen bonds are at the origin of the polarizability of the molecule which is encoded in the polar surface area descriptos (TPSA, apol and PEOPE).
Overall, considering the common structural features and some of the selected and used descriptors in the QSAR model, we could suggest a SAR map for the inhibition of the Lyn kinase as follows: (1) a planar and heterocyclic ring system that holds hydrogen bond donnors and acceptors, (2) a Linker to keep the flexibility of the molecule, (3) an hydrophobic and aromatic central part, (4) a lipophilic and aromatic ring system.
The derived SAR map can be found when analysing the structure of some published Lyn kinase inhibitiors. Indeed, a Lyn kinase inhibiotors (INNO-406, Nilotinib), with IC50
of 220 nM, was reported in the work of Horio et al. [42
]. This compound bears a pyridinyl group as hydrogen bonding region, an amino group as a linker and a central substitued benzyl group as the hydrophobic region. Kim et al. obtained a Lyn kinase inhibitor (PCI-32765) [43
], with an IC50
of 200 nM, showing an aminopyrimidine moity playing the role of the hydrogen bonding region, a benzyl group as the aromatic moity linked to another benzyl group presenting the hydrophobic region. In the work of Goldberg et al. a reported Lyn kinase inhibitor (BDBM50218682), with an IC50
of 230 nM, presented an aminopyridin moity as the hydrogen bonding region, an amid bond as the linker, a central aromatic fused cycle, and a substitued benzamid part playing the role of the hydrophobic region [44