Computational Prediction of Inhibitors and Inducers of the Major Isoforms of Cytochrome P450

Human cytochrome P450 enzymes (CYPs) are heme-containing monooxygenases. This superfamily of drug-metabolizing enzymes is responsible for the metabolism of most drugs and other xenobiotics. The inhibition of CYPs may lead to drug–drug interactions and impair the biotransformation of drugs. CYP inducers may decrease the bioavailability and increase the clearance of drugs. Based on the freely available databases ChEMBL and PubChem, we have collected over 70,000 records containing the structures of inhibitors and inducers together with the IC50 values for the inhibitors of the five major human CYPs: 1A2, 3A4, 2D6, 2C9, and 2C19. Based on the collected data, we developed (Q)SAR models for predicting inhibitors and inducers of these CYPs using GUSAR and PASS software. The developed (Q)SAR models could be applied for assessment of the interaction of novel drug-like substances with the major human CYPs. The created (Q)SAR models demonstrated reasonable accuracy of prediction. They have been implemented in the web application P450-Analyzer that is freely available via the Internet.


Introduction
Biotransformation of xenobiotics (in other words, drug metabolism) can be described as the biological transformation of external to organism lipophilic nonpolar molecules into more hydrophilic polar metabolites, which in turn are easily excreted from the body. Biotransformation has a significant effect on the pharmacokinetics of most drugs. From the chemical point of view, drug metabolism reactions can be divided into two large categories: oxidative reactions (phase I) and conjugation reactions (phase II) [1]. The human cytochrome P450 enzyme (CYP) family is the main phase I enzymes and contains 57 isoenzymes. CYPs metabolize approximately two thirds of known drugs in humans, with 80% of this process occurring by five isoenzymes-1A2, 2C9, 2C19, 2D6, and 3A4 [2]. Safety is a serious problem for many launched drugs, especially for patients taking multiple medications [1,3]. Adverse drug reactions (ADRs) are among the top 10 leading causes of death, and it is estimated that one hundred thousand deaths per year are attributed to ADRs [1]. ADRs caused by drug-drug interactions (DDIs) can lead to early termination of drug development or even to the withdrawal of drugs from the market; astemizole, cerivastatin, cisapride, terfenadine, and mibefradil are some examples of drugs withdrawn from the market. It is known that most DDIs are mediated through CYP inhibition and induction [3,4]. Such DDIs are manifested by the effect of one drug (the perpetrator drug) on the biotransformation of others (the victim drugs); this may be a slowdown in the case of the inhibition of CYPs or an acceleration in the case of the induction of CYPs; nevertheless, the pharmacological action of coadministered drugs may be altered. In some cases, DDIs can cause more than tenfold increases or decreases in victim drug exposure, with dangerous health effects [5]. Thus, the FDA's guidance documents prescribe the preliminary screening of drugs for interaction with CYPs [6].
In vitro and in vivo investigations are performed to study the safety and side effects of drugs. Computational (in silico) approaches take much less time to evaluate a large number of compounds both for the known drugs and for new, not yet synthesized drug-like substances. Recently, various computational approaches have been used to create structureactivity relationship (SAR) classification models for predicting drug activity in relation to the interaction with CYPs; a comprehensive description of the methods for DDIs in silico prediction is presented in the review [7]. In silico predictions of interaction with CYPs are made using two approaches: ligand-based and structure-based. Ligand-based predictions of inhibition can be performed by two categories of methods: regression (predicted using IC 50 or K i values of CYP) and classification (predicted using the category of CYP inhibitory potency). The induction prediction for some CYPs can be performed indirectly by an assessment of the xenobiotic's interaction with nuclear receptors: the aryl hydrocarbon receptor (AhR), the constitutive androstane receptor (CAR), and the pregnane X receptor (PXR). Structure-based approaches to predicting inhibition of CYPs use CYP structures, docking simulations, and/or molecular dynamics and were not applied for the creation of web services due to the complexity of implementation. Some of the created ligand-based SAR models were implemented as freely available web applications, which predict, based on the structural formula of compounds, their ability to inhibit CYP isoforms important for drug metabolism. They are PreADMET [8], pkCSM [9], SwissADME [10], WhichCyp [11], CYPlebrity [12], vNN-ADMET [13], AdmetSAR [14], and SuperCYPsPred [15]. Despite the good predictive accuracy of these applications, they do not predict CYP-inducing activity and IC 50 values of the inhibitors. To overcome these limitations, we created the freely available web application P450-Analyzer, which predicts inhibition (including IC 50 values) and directly predicts the induction of CYPs with reasonable accuracy.

Results
Using GUSAR [16] and PASS [17] software based on the data from ChEMBL and PubChem database, we created (Q)SAR models for inhibitors and SAR models for inducers of the most important drug-metabolizing isoform of CYPs. 320 QSAR models were built based on the data from ChEMBL by GUSAR using different sets of QNA (Quantitative Neighborhoods of Atoms) and MNA (Multilevel Neighborhoods of Atoms) descriptors and the RBF-SCR (Radial-Basis Function and Self-Consistent Regression) algorithm for each training set with the information about the appropriate isoforms of CYPs. Only those QSAR models in which R 2 exceeded 0.6 and Q 2 exceeded 0.5 were selected in the consensus models. The 5-fold cross-validation (5-fold-CV) procedure was used to validate the accuracy of the prediction of QSAR models. The initial datasets were sorted by the ascending mode of pIC 50 values. After that, the sets were divided into five parts, which were used for the 5-Fold-CV procedure. As a result, different five training and five external test sets for IC 50 data were created for each isoform. The prediction accuracy of the created consensus QSAR models along with the characteristics of the training sets are represented in Table 1. Table 1 shows that Q 2 values in QSAR models for the CYPs 1A2, 2C9, and 3A4 exceed 0.6. The standard deviations between the predicted and experimental data for almost all studied isoforms of CYPs were less than 0.6 and close to 0.5, excluding CYP 1A2 (0.625), which reflects the acceptable accuracy of prediction. RMSE values given on the test sets during 5-fold cross-validation vary from 0.565 (2C19) to 0.682 (1A2). They are less 0.7 and may be considered as applicable. R 2 values given on the test sets during 5-fold crossvalidation (R 2 5-Fold-CV ) exceed 0.5 for 1A2, 2C9, and 3A4. R 2 5-Fold-CV for 2C19 and 2D6 are less 0.5, and the prediction results of such models should be treated with caution. Most compounds from the test sets during 5-fold cross-validation fall into the applicability domain (AD). The percent of compounds in AD exceed 95% for all isoforms. For all CYP isoforms in the training sets, the pIC 50 value ranges are characterized by a significant width and include both inactive and very active compounds. The mean values of pIC 50 are close to 5, which is a threshold used by medicinal chemists to make a division between the active and inactive compounds [18]. Therefore, the created QSAR models may be used in web applications for estimating the degree of activity of potential inhibitors of the CYPs. Taking into account the mean values and the values of standard deviation (Table 1), one may conclude that compounds with prediction results, including pIC 50 values exceeding 6, may be considered as potential inhibitors of the appropriate CYPs and should by experimentally tested.
Fifteen classification SAR models, depending on the training sets, for prediction of inhibitors for appropriate isoforms of CYPs were built using PASS software. In this case, in addition to a dataset from ChEMBL, we used a dataset from PubChem. The accuracy of prediction for the created SAR models and sources of the training sets are represented in Table 2. The "Total" column represents models, which were built from both ChEMBL and PubChem resources. To calculate the quality of the SAR model, we used the Invariant Accuracy of Prediction (IAP) (which is numerically similar to ROC AUC), calculated by leave-one-out cross-validation procedure (LOO CV) and 20-fold CV. As one may see, for both ChEMBL and PubChem training sets, the best models were obtained for CYP1A2. The combined training set allowed us to improve the quality only for 2D6; in all other cases, the models with PubChem data were better. The created (Q)SAR models were implemented in a new web application (P450-Analyzer), which is freely available at http://www.way2drug.com/p450-analyzer/ (accessed on 16 August 2022). It provides the ability to draw the structures in Marvin Applet, to input SMILES or drug name, or to use the initially prepared MOL file. The prediction results include three tables (Figure 1). The first table is a result of the QSAR models' application, which represents the numerical estimation of pIC 50 . The highest predicted pIC 50 values represent the most potent inhibitory activity of compounds for the appropriate CYP isoforms. The second table presents a result of classification models to predict the belonging of inhibitors of CYP isoforms. The table includes the name of the isoenzymes, Pa and Pi values, which are the probabilities "to be inhibitors" and "to be non-inhibitors", respectively. For our web application, we used models obtained from both datasets ("Total" column in Table 2). The third table represents the selected activities from the PASSOnline (http://www.way2drug.com/passonline/) (accessed on 16 August 2022); these activities are responsible for the belonging of compounds to CYP inducers. The training set of PASS software is collected from various sources, including commercially available databases, and currently contains over 1.5 million compounds. PASS predicts more than eight thousand activities with an average accuracy about 0.93, estimated in leave-one-out cross-validation. The accuracy of prediction for the selected activities related to CYP inducers is represented in Table 3.
which represents the numerical estimation of pIC50. The highest predicted pIC50 values represent the most potent inhibitory activity of compounds for the appropriate CYP isoforms. The second table presents a result of classification models to predict the belonging of inhibitors of CYP isoforms. The table includes the name of the isoenzymes, Pa and Pi values, which are the probabilities "to be inhibitors" and "to be non-inhibitors", respectively. For our web application, we used models obtained from both datasets ("Total" column in Table 2). The third table represents the selected activities from the PASSOnline (http://www.way2drug.com/passonline/) (accessed on 16 Aug. 2022); these activities are responsible for the belonging of compounds to CYP inducers. The training set of PASS software is collected from various sources, including commercially available databases, and currently contains over 1.5 million compounds. PASS predicts more than eight thousand activities with an average accuracy about 0.93, estimated in leave-one-out cross-validation. The accuracy of prediction for the selected activities related to CYP inducers is represented in Table 3.   In order to demonstrate how the web application may be used (Figure 1), we took Rifampicin, which is a well-known inducer of CYPs [19] as a case study. In Figure 1, one can see three tables: table A represents the numerical estimation of pIC50, and tables B and C represent a probabilistic assessment of belonging to inhibitors or inducers of CYPs, respectively. We chose the default threshold Pa>Pi for displaying possible activities. As one can see in Figure 1, in table B, there are no activities, meaning that no inhibitory activity was predicted with Pa>Pi; therefore, Rifampicin could not be classified as a CYP inhibitor. This result coincides with the predicted result of pIC50 values (table A), which are less than 6, and it means that Rifampicin is not a CYP inhibitor. There are three activities in table C. So, Rifampicin is correctly predicted as the inducer of the CYPs 3A4, 2C9, and 2C19. It should be emphasized that Rifampicin was excluded from the training set before the prediction.

Discussion
CYP-mediated metabolism represents a major route of elimination of many drugs; therefore, information about CYP inhibition and induction is very important to prevent DDIs. We created the web application (P450-Analyzer), which allows estimating the possibility for a drug-like compound to be an inhibitor or inducer of CYPs. Unlike the other existing computational tools, our web application provides an estimation of the IC 50 values, which allows classifying compounds as strong, moderate, and weak CYP inhibitors. Thus, with the P450-Analyzer, it is possible to perform a more thorough analysis of the potential DDIs. In addition, the P450-Analyzer allows estimating the inducing activity of compounds for five of the most important drug-metabolizing isoforms of CYPs.

ChEMBL and PubChem
Similar to the CYPlebrity [12], we used the ChEMBL and PubChem databases as a resource for extraction of structures and experimental data for compounds studied on inhibition of CYPs 1A2, 3A4, 2D6, 2C9, and 2C19. The selection criteria were the same as in the publication [12].
From PubChem, we took bioassay 1851 dataset [20], which was downloaded as CSV table; then, SMILES notations for the selected compounds were retrieved by querying PubChem's download services. Compounds with "Pubchem_activity_outcome" = "active" supported by a "Complete curve" were assigned to the "inhibitors" class. Compounds with "Pubchem_activity_outcome" = "inactive" were marked as "non-inhibitors".
On the basis of ChEMBL, we created two types of training sets. The first type was used to build classification models and contains qualitative information; the compounds were classified as inhibitors or noninhibitors, based on IC 50 values and percentage of inhibition. The entries with IC 50 lower than 10,000 nM were defined as inhibitors if the field "standard relation" containing any of the signs "=", "≤", or "<". The entries with IC 50 exceeding 20,000 nM were defined as noninhibitors if the field "standard relation" containing one of signs "=", "≥", or ">". In addition, the entries with percentage of inhibition > 50% were defined as inhibitors and otherwise as noninhibitors. The conflicting data (if the same structure was classified as inhibitor and noninhibitor) were removed from the training set.
The second type of training sets was used to build QSAR models and contained quantitative data on enzyme inhibition. For these types of training sets, only IC 50 values with the field "standard relation" containing sign "=" were considered. The data in the training sets for double records of the same structures were recalculated as median values. IC50 values in nM were converted to pIC 50 = −log 10 (M) . After the data selection and filtration of the training sets containing structures and appropriate information (pIC 50 or belonging to inhibitors), the SDfiles were created for the appropriate CYP isoforms. Then, these training sets were used in PASS and GUSAR software to build the appropriate (Q)SAR models. Characteristics of the training sets for development of classification models are represented in Table 4. The intersections of the training sets in "Total" model are shown in Figure 2. As can be seen, the compounds presented in the sets usually inhibit only one isoenzyme; 133 compounds inhibit all isoenzymes. Among the noninhibitors, most compounds do not inhibit any of the enzymes. Additionally, "Total" model chemical space of inhibitors and noninhibitors was shown by molecular weight as X-axis and logP, calculated by RdKit, as Y-axis (see Figure  3). Additionally, "Total" model chemical space of inhibitors and noninhibitors was shown by molecular weight as X-axis and logP, calculated by RdKit, as Y-axis (see Figure 3).

GUSAR
GUSAR software [16,21] was used to create the QSAR models predicting the inhibition of CYP isoforms on the basis of the structural formula of compounds. GUSAR is based on the representation of the compound structure by QNA and MNA descriptors [16,17] A combined RBF-SCR algorithm was used to build the relationships between the descriptors and biological activity [22]. To improve the accuracy of prediction, GUSAR allows to create a consensus model; the final predicted value for each activity/end point is estimated by including the weighted average of the predicted values from the set of QSAR models [23]. The value obtained from each model is weighted by the similarity value calculated in the estimation of its applicability domain. For applicability domain calculation in GUSAR, three different approaches are used for each model: similarity, leverage, and accuracy assessment [23].

PASS
PASS software [17] predicts the profile of biological activity based on advanced naïve Bayes classifier. The molecular structures of compounds are described by Multilevel Neighborhoods of Atom (MNA) descriptors. Accuracy of prediction is estimated in PASS by leave one-out cross-validation procedure using the IAP criterion. IAP is a probability

GUSAR
GUSAR software [16,21] was used to create the QSAR models predicting the inhibition of CYP isoforms on the basis of the structural formula of compounds. GUSAR is based on the representation of the compound structure by QNA and MNA descriptors [16,17] A combined RBF-SCR algorithm was used to build the relationships between the descriptors and biological activity [22]. To improve the accuracy of prediction, GUSAR allows to create a consensus model; the final predicted value for each activity/end point is estimated by including the weighted average of the predicted values from the set of QSAR models [23]. The value obtained from each model is weighted by the similarity value calculated in the estimation of its applicability domain. For applicability domain calculation in GUSAR, three different approaches are used for each model: similarity, leverage, and accuracy assessment [23].

PASS
PASS software [17] predicts the profile of biological activity based on advanced naïve Bayes classifier. The molecular structures of compounds are described by Multilevel Neighborhoods of Atom (MNA) descriptors. Accuracy of prediction is estimated in PASS by leave one-out cross-validation procedure using the IAP criterion. IAP is a probability that is randomly selected from an independent test set; positive and negative examples will be correctly classified [14]. The PASS output is a ranked list of activities with estimates the probabilities "to be active" Pa and "to be inactive" Pi. The activities with Pa > Pi are considered as possible. The percent of new MNA descriptors for the tested molecule is used in the web application for assessment of the applicability domain; the higher the percent of new MNA descriptors, the less the molecule structure is appropriate for the model. The tested molecules with up to 25% of new MNA descriptors are in the applicability domain [24].