MetStabOn—Online Platform for Metabolic Stability Predictions

Metabolic stability is an important parameter to be optimized during the complex process of designing new active compounds. Tuning this parameter with the simultaneous maintenance of a desired compound’s activity is not an easy task due to the extreme complexity of metabolic pathways in living organisms. In this study, the platform for in silico qualitative evaluation of metabolic stability, expressed as half-lifetime and clearance was developed. The platform is based on the application of machine learning methods and separate models for human, rat and mouse data were constructed. The compounds’ evaluation is qualitative and two types of experiments can be performed—regression, which is when the compound is assigned to one of the metabolic stability classes (low, medium, high) on the basis of numerical value of the predicted half-lifetime, and classification, in which the molecule is directly assessed as low, medium or high stability. The results show that the models have good predictive power, with accuracy values over 0.7 for all cases, for Sequential Minimal Optimization (SMO), k-nearest neighbor (IBk) and Random Forest algorithms. Additionally, for each of the analyzed compounds, 10 of the most similar structures from the training set (in terms of Tanimoto metric similarity) are identified and made available for download as separate files for more detailed manual inspection. The predictive power of the models was confronted with the external dataset, containing metabolic stability assessment via the GUSAR software, leading to good consistency of results for SMOreg and Naïve Bayes (~0.8 on average). The tool is available online.


Introduction
During the drug design process, attention is initially placed on obtaining the desired affinity with the appropriate receptors. However, failures of compounds at later stages of drug development are connected with other unfavorable physicochemical, pharmacokinetic, or toxic properties. The proper evaluation of these properties in silico is therefore just as important as the development of computational tools for accurate activity predictions [1][2][3][4].
A number of parameters can be set on the basis of which drug-like potential of the compounds is evaluated. One of the most popular groups of these properties are Lipinski's Rule of Five, which are one of the simplest and usually the first filters applied to disqualify compounds with an unfavorable physicochemical profile [5]. It is also important to provide proper compound solubility (both in water due to solubility in the fluids in the organism, and in non-polar solvents due to the provision of penetration of biological membranes, as well as the provision of proper equilibrium between the solubility in these two environments [6]), determine the ionization potential [7] and assure that the compound can penetrate the gut-blood and blood-brain barrier (in the case of drugs acting within the central nervous system) [8]. The permeability through the biological membranes is vital not only from In this study, we focus on one of the compound's parameters-metabolic stability. Usually, the tools for ADMET properties evaluation do not consider this property, as its predictions are very difficult due to the extreme complexity of the metabolic stability phenomenon and the great number of factors influencing this parameter. On the other hand, metabolic stability is a very important parameter, as the compound requires sufficient time to trigger the desired pharmacological response before its decomposition and moreover, the formed metabolites might not only be unable to provide the biological action of interest, but they can be toxic. For the evaluation of metabolic stability on the basis of experimental data, the most often used are half-lifetime (T 1/2 ) data produced in assays using liver microsomes. There were already several machine learning-based approaches to construct general QSAR models for prediction of this parameter [46][47][48][49] and several other studies based on narrower chemical space of selected classes of compounds [50][51][52][53][54].
Here, we present a freely available online tool for metabolic stability predictions expressed as T 1/2 or clearance. The tool uses ligand-based methodology and six machine learning methods to evaluate the compound stability, with separate models constructed for various species and assays preformed on liver microsomes and plasma. Additionally, an analysis of available metabolic stability data was performed and the most similar compounds from the training set are provided for each of the submitted structures, enabling manual analysis of the results and comparisons. The outcome of the constructed tool was compared with the external National Cancer Institute (NCI) dataset containing a GUSAR-based evaluation of metabolic stability [35]. The tool and all obtained results are freely available at http://skandal.if-pan.krakow.pl/met_stab_pred/.

Materials and Methods
The data for the construction of the tool for metabolic stability predictions were collected from the ChEMBL database version 23 [55]. All records with the T 1/2 and intrinsic clearance parameters reported were downloaded. Data preprocessing involved the following steps: selection of in vitro assays performed on liver microsomes or plasma, selection of records with standard unit "hr" for T 1/2 and "mL·min −1 ·g −1 ", and division into separate sets referring to human, rat and mouse experiments. Due to the lack of sufficient number of entries for clearance data obtained via plasma-based assays, for plasma data, only records for T 1/2 were gathered. The number of data points present in the respective datasets is gathered in Table 2. The compounds were represented with the use of the 1-and 2-dimensional PaDEL-Descriptors [56] (1d2d descriptors) and Extended Fingerprint [57] (ExtFP) from the same software package. These forms of representation were chosen after initial studies performed on a series of long-chain aryl piperazines, where several fingerprints implemented in PaDEL (Extended Fingerprint, MACCS Fingerprint [58], Klekota-Roth Fingerprint [59], Graph Fingerprint [57], PubChem Fingerprint [60], and Substructure Fingerprint [61]) and descriptor sets were tested. Three dimensional descriptors were not included due to the relatively high fraction of compounds for which they could not be calculated due to errors in their generation.
The constructed tool predicts the numerical value of metabolic stability with the predictive model based on the application of the two types of machine learning algorithms, regression and classification. From the first group of methods, one algorithm was used: SMOreg, which is a modification of the very popular and efficient algorithm Support Vector Machine (SVM) [62] into Sequential Minimal Optimization (SMO) [63] and adjusted for performing regression tasks. Additionally, five classification algorithms were used: SMO, Random Forest [64], Naïve Bayes [65], k-nearest neighbor (IBk) [66], and decision tree J48 [67]. However, in order to enable easier interpretability of the outcome of regression experiments, compounds are also divided into three classes according to metabolic stability values-low, medium, and high-and the results are colored accordingly. For each of the analyzed structures, the 10 most similar compounds from the training set (Tanimoto metric [68], topological fingerprint from RDKit package [69]) are found and provided in separate files for manual inspection (the particular chemical structure is provided only once, and the median half-lifetime value is given). Structures for analysis online can be submitted as a structure data file (sdf) or drawn using the MarvinJS [70] plugin. Regardless of the way the query is submitted, all the structures are shown in the results.

The Importance of Separate Models for Different Species
Separate models were constructed for different species. This approach was applied due to the relatively high differences in results of the in vitro tests for some of the compounds, despite the similar overall distribution of data points for human, rat and mouse data with the majority of very unstable compounds (     For all the compounds presented in the Figure 4, there were substantial differences in the results of metabolic stability examinations based on different species. For example, for compound CHEMBL214957, the human and mouse-based data were quite consistent (with stability experiments outcome being 10.05 and 9.57 hr, respectively); however, for experiments, in which rat liver microsomes were used, the obtained T 1/2 value was equal to 3.27. For compound CHEMBL2335990, more consistency was observed for rat and mouse-based experiments with the stability values of 0.93 and 0.55 hr, respectively, whereas in experiments using human microsomes, the obtained T 1/2 value was equal to 7.25 hr. For the other two compounds presented in Figure 4, that is CHEMBL3108858 and CHEMBL2346736, no similarities between any two experiments were observed, and the T 1/2 values varied from 0.5 to 5.5 hr.
Therefore, the data referring to human, rat and mouse-based experiments were not mixed, and separate models were constructed for each of these experimental conditions.
In order to provide a more general picture of variations in experimental results of metabolic stability for different models (human, rat, mouse), all compounds for which the half-lifetimes were provided for all the models were identified and standard deviations (σ) given by the following equation were calculated: where x i is half-lifetime value measured for a particular model; µ-mean value of all three measures (human, mouse, rat) provided for a particular compound The obtained values of standard deviations were presented in Figure 5 (the CHEMBL identifiers of compounds with metabolic stability data for human, rat, and mouse models with standard deviation values are provided in the Supplementary Materials). The analysis of the histogram in Figure 5 indicates and confirms the relatively high variation of data. Out of 249 structures containing data referring to all three experimental conditions (human, rat, mouse) to which Figure 5 refers, for over 16% of them, standard deviation of half-lifetimes values was higher than 0.5, leading to swap of metabolic classes. Moreover, for 172 structures, it was above 0. Additionally, although for datasets containing human-based data, the percentage of datapoints with non-zero standard deviation was equal to 8% for mouse-based datasets, which contained much less records, the percentage of non-zero data was equal to 20%. All these results support the construction of separate predictive models for human, rat, and mouse-based data.
A scheme of all the predictive approaches that were used in the study is presented in Figure 6. Taking into account the combination of all datasets used, prediction algorithms and compounds representation, the total number of predictive models provided as a result of the study is equal to 108; the online version of the tool for metabolic stability predictions includes 36 models referring to experimental data produced on liver microsomes and reported as T 1/2 . Remaining models are available at http://skandal.if-pan.krakow.pl/met_stab_pred/ and can directly be used within the WEKA software [71].

Evaluation of Models in 10-Fold Cross-Validation
Ten-fold cross-validation (CV) studies were performed for optimization and evaluation of the constructed predictive models. WEKA implementation [71] was used for all the algorithms used in the study.
The division into metabolic stability classes was applied, and evaluation parameters were subsequently calculated: overall accuracy, AUROC [72] in case of classification models, and recall and precision for each class.
The cutoffs for metabolic stability class division were as follows (T 1/2 expressed in hours): The number of compounds belonging to each stability class are presented in Table 3. The class distribution was not uniform and in all cases the highest number of compounds belonged to the group of low metabolic stability. The highest variation was observed for rat and mouse models, with 62% and 60% of compounds exhibiting low values of half-lifetimes, respectively. For both of these cases, compounds of high stability constituted 9% of the whole dataset, and the fraction of medium stability compounds in the whole dataset was equal to 29% and 31% for rat and mouse models, respectively. For human models, the compounds were more uniformly distributed among classes with 44% of compounds belonging to low and medium stability class, and the remaining 12% of datapoints referred to compounds characterized by high metabolic stability.
The values of evaluating parameters obtained in 10-fold CV studies are presented in Table 4 for T 1/2 liver microsomes data, all results for the remaining datasets are present in the Supplementary Materials.
The parameters values above 0.7 are presented in bold. In order to facilitate the results interpretation, the data were also presented in the respective figures (Figure 7 for T 1/2 liver microsomes; remaining data in Supplementary Materials).  In general, the values of evaluating parameters are high and show that the constructed models are capable of making a valid evaluation of metabolic stability expressed as T 1/2 . The general observation is that 1d2d descriptors provided higher values of evaluating parameters than ExtFP. It is also visible that there are high variations for recall and precision values depending on the metabolic stability class considered. However, for the great majority of cases, low stability class was the one that led to the highest recall and precision values. However, the highest number of cases belonged to the low stability class in all human, mouse and rat datasets. Relatively, the highest recall values were obtained for rat models, for Random Forest and SMO models (0.935 and 0.903, respectively) with compounds represented by 1d2d descriptors; for ExtFP representation, the recall values were on similar level for these methods: 0.903, and 0.904, respectively. On the other hand, it was IBk (for 1d2d descriptors) and Naïve Bayes (for ExtFP) that provided the highest precision values: 0.848 and 0.841, respectively.
Taking into account the overall accuracy of predictions, the SMO, IBk and Random Forest methods were the only ones that consistently provided overall accuracy over 0.7. For human models, it was 0.739 for SMO and 0.728 for Random Forest (for 1d2d descriptors and ExtFP representations, respectively) that were the highest overall accuracies; however, for mouse and rat data, it was SMO that provided the best performance of predictive models for both compounds representations, with values varying from 0.743 to 0.777. The other parameter that provided information on the general performance of the model was AUROC, which in general adopted higher values than overall accuracy, reaching values close to 0.9: 0.886 and 0.881 for Random Forest models constructed on human data, 0.912 and 0.906 for Random Forest models that used mouse data, and 0.872 and 0.848 for Random Forest model built on rat data (for 1d2d descriptors and ExtFP representations, respectively).

Comparison of the Constructed Tool Outcome with the Predictions on the External Dataset
The outcome of the constructed tool was compared with the external NCI dataset containing GUSAR-based evaluation of metabolic stability [35]. The predictions provided there were only binary (stable/unstable). After the removal of compounds with errors, and those containing heavy atoms, such as Pb, Ag, Se, Te (leading to errors in descriptors calculations), the predictions with models produced on T 1/2 liver microsomes data were carried out (as the dataset originally contained such data). The results obtained via these two approaches were compared by accuracy indicating the fraction of the same predictions (Table 5). The comparison of the predictions obtained via the constructed tool with the GUSAR predictions indicate great dependence of the results on the machine learning algorithm applied. The most consistent predictions with the output of the GUSAR software were provided by the SMOreg and Naïve Bayes algorithms. As the GUSAR predictions were binary (the compounds were evaluated only as stable or unstable) and our tool evaluates compounds as low, medium or high stability, two approaches of dealing with records assigned to medium stability class were applied: such entries were removed before the accuracy calculation, or the medium class assignments were shifted to the high stability class, as being more populated. With the removal of records assigned to medium stability class, the most consistent predictions with the GUSAR software outcome were obtained with the use of the SMOreg algorithm with 1d2d descriptors used for compounds representation (accuracy of 0.89). Surprisingly, the application of SMOreg with ExtFP for compounds representation led to high inconsistency of the constructed tool outcome with the GUSAR-based evaluation with the accuracy of 0.39. The GUSAR software output was also in line with the Naïve Bayes predictions, for both 1d2d descriptors and ExtFP compounds representation, with accuracies of 0.81 and 0.72, respectively. In general, for sets with medium class predictions removed, the application of 1d2d descriptors for compounds representation led to more consistent results with the GUSAR software than ExtFP (for all algorithms but IBk, the accuracies were much higher for the former compounds representation, by from 0.13 for J48, through 0.38 for Random Forest, to 0.50 for SMOreg). The advantage of the 1d2d descriptors representation over ExtFP was not visible for the situation, when the medium class predictions were manually shifted to the group of records referring to high stability compounds. For the most consistent with the GUSAR predictions, SMOreg, the difference in accuracy between these two forms of representations was only 0.01 (0.78 vs. 0.77). For Naïve Bayes, which also produced results consistent with the GUSAR software, the difference was equal to 0.05 (accuracy of 0.77 vs. 0.72 for 1d2d descriptors and ExtFP, respectively). Random Forest predictions were in line with the output of the GUSAR software only for the 1d2d descriptors (0.71 accuracy for this representation vs. 0.54 for ExtFP).

Example Results
A screenshot from the example output of the tool is presented in Figure 8. The summary describes the compound representation used, the predictive model applied, the number of input compounds and the number of compounds assigned to a particular metabolic stability class. Detailed results are gathered in a table with the structure and simplified molecular-input line-entry system (SMILES) of a compound, the predicted value of half-lifetime and the metabolic stability class to which a compound was assigned. Additionally, in order to perform a more detailed analysis, the 10 most similar compounds from the training set (in terms of Tanimoto metric-based similarity) can be downloaded for each of the analyzed structures (due to the computational resources limitations, this option is available for a maximum number of 100 structures submitted at one time). As the training sets contained all available data, in order to prevent situations in which the number of structures for manual inspection is very restricted, the particular chemical structure was listed only once, and in case of multiple metabolic stability entries available, median values of half-lifetimes were provided. The abovementioned data can be downloaded separately for each of the analyzed compounds or as one zipped directory.

Materials and Methods
The models' parameters were optimized with the set of values gathered in Table 6. For each model, the value that provided the lowest overall accuracy was selected. The conditions selected for each model are shown in Table 7.
The overall accuracy, recall and precision were calculated using the following equations: overall accuracy = number of correct predictions number of all predictions (2) where TP is the number of instances correctly assigned to a particular class, FN is the number of instances belonging to particular class incorrectly assigned to another one, and FP is the number of instances incorrectly assigned to particular class.

Conclusions
In summary, a tool for the qualitative evaluation of metabolic stability expressed as half-lifetime was constructed. It uses regression and classification tools to provide the assignment of a compound to a particular stability class (low, medium, high), 1d2d descriptors and ExtFP for compound representation, and the SMOreg, Random Forest, SMO, IBk, Naïve Bayes and J48 machine learning algorithms for making predictions. The tool is freely available online and allows for the submission of structures via sdf files or through drawing. Separate predictive models were constructed for human, rat and mouse data, and for data obtained in experiments using liver microsomes and plasma, as well as for data with metabolic stability expressed as T 1/2 and clearance. A detailed retrospective analysis and the application of the constructed model to the external dataset proved the usefulness of the developed tool. The tool can be very useful in designing new potential drugs, and in enabling a fast initial evaluation of a compound's metabolic stability.