Predictivity Approach for Quantitative Structure-Property Models. Application for Blood-Brain Barrier Permeation of Diverse Drug-Like Compounds

The goal of the present research was to present a predictivity statistical approach applied on structure-based prediction models. The approach was applied to the domain of blood-brain barrier (BBB) permeation of diverse drug-like compounds. For this purpose, 15 statistical parameters and associated 95% confidence intervals computed on a 2 × 2 contingency table were defined as measures of predictivity for binary quantitative structure-property models. The predictivity approach was applied on a set of compounds comprised of 437 diverse molecules, 122 with measured BBB permeability and 315 classified as active or inactive. A training set of 81 compounds (~2/3 of 122 compounds assigned randomly) was used to identify the model and a test set of 41 compounds was used as the internal validation set. The molecular descriptor family on vertices cutting was the computation tool used to generate and calculate structural descriptors for all compounds. The identified model was assessed using the predictivity approach and compared to one model previously reported. The best-identified classification model proved to have an accuracy of 69% in the training set (95%CI [58.53–78.37]) and of 73% in the test set (95%CI [58.32–84.77]). The predictive accuracy obtained on the external set proved to be of 73% (95%CI [67.58–77.39]). The classification model proved to have better abilities in the classification of inactive compounds (specificity of ~74% [59.20–85.15]) compared to abilities in the classification of active compounds (sensitivity of ~64% [48.47–77.70]) in the training and external sets. The overall accuracy of the previously reported model seems not to be statistically significantly better compared to the identified model (~81% [71.45–87.80] in the training set, ~93% [78.12–98.17] in the test set and ~79% [70.19–86.58] in the external set). In conclusion, our predictivity approach allowed us to characterize the model obtained on the investigated set of compounds as well as compare it with a previously reported model. According to the obtained results, the reported model should be chosen if a correct classification of inactive compounds is desired and the previously reported model should be chosen if a correct classification of active compounds is most wanted.

the training and external sets. The overall accuracy of the previously reported model seems not to be statistically significantly better compared to the identified model (~81% [71.45-87.80] in the training set, ~93% [78. 12-98.17] in the test set and ~79% [70. .58] in the external set). In conclusion, our predictivity approach allowed us to characterize the model obtained on the investigated set of compounds as well as compare it with a previously reported model. According to the obtained results, the reported model should be chosen if a correct classification of inactive compounds is desired and the previously reported model should be chosen if a correct classification of active compounds is most wanted.

Introduction
The blood-brain barrier (BBB), complex membranous system of brain capillary endothelial cells, pericytes, astrocytes, and nerve endings, plays an essential role in maintaining the homeostasis of the central nervous system by blocking the movement of molecules [1]. Determination of blood-brain barrier penetration is crucial in the assessment of compounds suitability as central nervous system drug [2]. As the population's life expectancy increases and neurological pathologies become more frequent, there is a need to rapidly and cost and resource effectively identify potentially adverse effects of drugs acting as CNS (central nervous system) and non-CNS targets [3,4].
Quantitative structure-activity/property relationship models support the -fail fast, fail cheap‖ model [5] in analysis of the link between structure of the compounds and associated activity/property. Different techniques have been used in BBB modeling. One of the earliest predictions of BBB permeation was the one presented by Young et al. [6] as a linear relationship between logBB and ΔlogP (histamine H 2 receptor agonists).
Crivori et al. used descriptors from 3D molecular fields to estimate the BBB and identified a model able to correctly predict 90% of the permeation data [7]. Narayanan and Gunturi used a systematic variable selection and modeling method based on the prediction on a sample of 88 BBB compounds and identified as best performing one model with three descriptors and one model with six descriptors with higher performances [8]. These models proved to have a success ratio of 82% in predicting the BBB + external data set. Statistical characteristics of their best models are presented in Equation 1 and Equation 2.
where R = correlation coefficient, R loo = leave-one-out correlation coefficient, F = F-value, se = standard error of estimate, j = number of descriptors in the model, n = sample size. Subramanian and Kitchen [9] identified that logP, polar surface area and some electrotopological indices are able to provide accurate predictive model for logBB (logarithm of brain to plasma concentration ratio). Linear regression and multivariate genetic partial least squares approaches were applied and the obtained model proved to have a success rate higher than 70% for active compounds and almost 60% for inactive compounds [9]. Subramanian and Kitchen concluded that the prediction consensus was not able to significantly improve the discrimination of active and inactive molecules on central nervous system.
Goodwin and Clark analyzed and presented the main problems of in silico prediction of blood-brain barrier penetration: quality of measured data available and prediction uncertainty and relevance of predictive models [10]. They pointed out the usefulness of local and global models as well as of the accuracy of experimental data by highlighting some success stories [11][12][13].
Non-linear approaches have also been used to predict the distribution of compounds based on different states (neutral, cationic and anionic) of the compounds distributed into three different compositions (lipid, protein and water) [14]. The statistical characteristics of the predictive model for the distribution of compounds were as follows [14]: All data: R (correlation coefficient) = 0.906, se (standard error) = 0.326, n (sample size) = 160; Training set: R = 0.908, se = 0.320, n = 139; and Test set: R = 0.903, se = 0.297, n = 21.
Klon reviewed the computational models of central nervous system penetration according to the type of variable of interest (quantitative-for logBB models and qualitative-for binary models) [15]. He proposed the permeability surface product and the fraction unbound in the brain as appropriate metric endpoints [15]. Although, due to the availability of the experimental data, the blood:brain ratio is still used for in silico modeling [16,17].
Nowadays, many models for prediction of logBB are available in the literature. However, how the best model could be identified? How can different models be compared to one another? A new classification model based on multi-linear regression to the domain of blood-brain barrier modeling is introduced in this manuscript. A series of 15 statistical parameters were introduced to be used as diagnostic tool of a binary logBB model as well as for the comparison of different classification models. The study aimed to present a new approach in assessment of a predictivity of a structure-based prediction model and was effectively accomplished.

Results
The multiple linear regression (MLR) that accomplished as many criteria as possible and proved to perform best is presented in Equation 3 (a-MLR equation, b-statistical characteristics of MLR model, c-statistical characteristics model in leave-one-out analysis). where Ŷ logBB = property estimated by MDFV model; TLgFAIDI (X 1 ), GAmIAaDI (X 2 ), TAgFIADL (X 3 ), and TAgPIADL (X 4 ) = members of MDFV; the values in round brackets allow us to obtain the lower (subtraction) and upper (addition) confidence boundary for the slope parameters; R = correlation coefficient; R 2 = determination coefficient; s est = standard error of estimate; n tr = sample size-training set; F est (p) = F-value of the model (p-value); t = t-value; R 2 loo = cross-validation leave-one-out square correlation coefficient; s loo = standard error of predicted; F loo = F-value on cross-validation leave-one-out model; values in the [] = 95% confidence interval; r = Pearson correlation coefficient between property observed and estimated by the model; r sQ = semi-quantitative correlation coefficient; ρ = Spearman rank correlation coefficient; τ a , τ b , τ c = Kendall's correlation coefficients; Γ = Gamma correlation coefficient. The descriptor's contributions to the logBB of investigated compounds are as follows: Two descriptors (TAgFIADL and TAgPIADL) proved to correlate significantly (a perfect concordance between all seven correlation methods) but neither of them significantly correlates with the observed property. No other statistically significant correlations could be identified between descriptors or between descriptors and logBB. The Durbin-Watson statistics was computed as a measure of autocorrelation. A value of 2.108 was obtained for the model presented in Equation 3.
A concordance of 69% (also known as accuracy) was obtained for training set after transformation of observed and estimated logBB as dichotomial variables. The concordance according with classification of compounds as active and inactive (based on observed value) are known as sensitivity and specificity.
The prediction ability of the model presented in Equation 3 was investigated on the test set. The obtained statistical characteristics are presented in Equation 4.
where se pred = standard error of predicted; n ts = sample size of test set; F pred = F-value of predicted. The concordance between observed and predicted property when classification was applied on the test set proved to be of 73% (accuracy, see Table 1). The ability of the classification model, which proved not to be a very good model in terms of goodness-of-fit, was analyzed for the training, test and external sets with the defined diagnostic parameters. The results are presented in Table 1. showed that the classification model is neither better nor worse in terms of goodness-of-fit.
The proposed statistical parameters were applied as diagnostic tools for the model presented in Equation 2 and the results are presented in Table 2.

Discussion
In silico modeling has been revolutionized along with the development and improvement of computers and information technologies [18,19]. Chemoinformatics, bioinformatics, combinatorial chemistry [20], high throughput screening [21], virtual screening, de novo design [22], structure-based drug design [23][24][25] are approaches frequently used in the processes of drug discovery. The study aimed to present a new approach in the assessment of the predictivity of a structure-based prediction model and was effectively accomplished. A predictive model has been developed based on a family of structural descriptors (molecular descriptors family on vertices cut) using the multiple-linear regression method. The best performing MLR model was identified to accomplish a series of criteria [26] and its performances were assessed using statistical parameters computed on the 2 × 2 contingency table.
The models with the highest correlation coefficient, the highest Fisher parameter, the lowest standard error of estimate, and the smallest possible number of significant parameters was chosen (see Equation 3). All four descriptors used by the model had their significant contribution to the explanation of the BBB permeation, as it can be observed from Equation 3. The analysis of the best performing model in terms of descriptor's contribution to the property (permeation of blood-brain barrier of drug-like compounds) revealed the following: ▪ almost 61% of the variation of BBB permeability could be explained by the linear-relationship with structural-based descriptors; the interaction between property and structure is performed through bonds (topology) and space (geometry-first letter in descriptor's name); ▪ the penetrability of drug-like compounds proved to be related to electronic affinity (A-second letter from descriptor's name) and melting point under normal temperature and pressure conditions (L) of BBB compounds; the structure on property scale proved to be of identity (I-last letter in descriptor's name), and logarithm (L) type.
The obtained model proved to be a reliable model with a reasonable goodness-of-fit since the sample size was so heterogenous. The absence of statistically significant correlation between descriptors and the Durbin-Watson statistics (based on its value, the presence of autocorrelation was withdrawn [27,28]) sustained the reliability of the MLR model. The results obtained in leave-one-out cross validation showed that the model could have abilities in prediction on external data since the difference between determinations was of 0.07 [29]. Based on results obtained in leave-one-out cross validation analysis, we expected the difference between the model obtained in the training set and its performances on the test set not to be higher/smaller than 12% in terms of the determination coefficient. The test set, a set that comprised a number of 41 drug-like compounds with known permeation on blood-brain barrier, was analyzed for its prediction and penetration abilities. As expected, the determination coefficient was small compared to the determination coefficient obtained on the training set but proved not to significantly different since the associated 95% confidence intervals overlap with one another.
The goodness-of-fit of our model (Equation 3) was compared with two previously reported models [8] and proved neither better nor worse in terms of the correlation coefficient. Although, the following should be taken into account and should give weight to the model presented in Equation 3:  The number of descriptors used by our model is 4 (Equation 3) while the number of descriptors used by best performing previously model is 6 (Equation 2).  The number of compounds in the training set was almost the same but the compounds included were not identical. It is well known that if some compounds are included or excluded from analysis, similar MLR equations could be obtained but with some changes of parameters. The quality criteria used to determine if a compound would be included in the sample were as follows: ▪ reliable experimental data (the compounds with different values of experimental data obtained by applying the same protocol were not included); ▪ compound identity (one compound was included whenever identical compounds were identified); ▪ normality of experimental data.
Furthermore, the abilities of the obtained model (Equation 3) to classify correctly the permeation of the blood-brain barrier were tested on two samples of compounds: the test set and the external set (compounds used neither in the training nor in the test sets). This analysis was carried out after transformation of observed blood-brain barrier permeation as a dichotomous variable; the interpretation of the obtained statistical parameters (see Table 1) revealed the following:  The presence of dependence between classification and observed permeation obtained for all three sets of compounds showed that the model has abilities in estimation as well as in prediction.  The total fraction of compounds correctly classified proved to be almost identical in the training and test sets. Even if the accuracy of the classification model was smaller in the training set compared to the test and external sets, the accuracies proved not to be significantly different since their confidence intervals overlapped one another.
 The error rate (the fraction of compounds misclassified) proved to vary from 27% to 31% with a higher value obtained in the training set compared to the test and external sets.  A valid classification model is the one that is able to classify correctly as many compounds as possible. Thus, it is expected that the 95% confidence interval of prior proportional probability of an active compound to overlap on the confidence interval of post-test probability of classification as active for a good classification model. The prior probability of an active class and the post-test probability of classification (where the test is our classification model) sustain the ability of the model in classification. The smallest difference for the active class of compounds was seen in training set; the same conclusion has also been seen for the inactive class of compounds.  The classification model proved to have higher abilities in the identification of a true active compound out of all the active compounds in the test set. Since the associated confidence intervals associated to sensibility overlap one another in the training and test sets, the model has the same ability to identify the true active compounds in these sets. Analyzing the sensitivity of the classification model on external set showed that it is not appropriate to use this classification model to classify active BBB permeation compounds since the false negative rate is almost 60% (there is a 2/5 chance to correctly classify an active BBB compound).  The higher ability in the classification of an inactive compound was obtained in an external set (~86%). This ability seems not to be significantly different in the training, test and external sets since the associated 95% confidence intervals overlap one another.  The higher positive predictivity proved to be obtained in the training set and refer to the ability of our classification model to correctly assign a compound as active out of all active assigned compounds. As expected, by analyzing the sensitivity, the positive predictivity of our model was significantly smaller when the classification model was applied to the external set (~56%, but the confidence interval did overlap with the confidence interval of positive predictivity obtained on the training set).  The highest value of negative predictivity was obtained in the test set. The negative predictivity proved not to be significantly different between all three investigated sets.  The smallest value of the probability of wrong classification as an active compound was obtained in the training set while the highest value was obtained in the external set (these two probabilities proved significantly different).  The smallest value of the probability of wrong classification as an inactive compound was obtained in test set. No statistically significant difference in terms of the probability of wrong classification was identified when all three sets of compounds were analyzed.  The odds of correct classification in the group of active compounds divided by the odds of incorrect classification in the group of inactive compounds proved to be almost identical in the training and external sets. Even if the value of the odds ratio obtained in test set is higher than the values obtained in training or external sets, the ability of our classification model in terms of OR proved not to be significantly different in this set (the confidence intervals overlap one another).
The differences in performances of our classification model on the training, test and external sets could be explained by the distribution of active and inactive compounds in the sets (active compounds proved to ~48% in training set, ~46% in test set and ~30% in external set). If the percentage of active compounds in the external set is imposed to be close to the percentage of active compounds in the training and test sets, the parameter's significant difference will be improved and the classification model could have the same ability in the external set as in the training and test sets.
The same statistical parameters proposed to diagnose the classification model were also computed for the model previously obtained and presented in Equation 2 in order to be used as comparison parameters. These parameters could be used to compare different models obtained on the same but not identical classes of compounds. The comparative analysis of the obtained statistical parameters for the proposed classification model (see Table 1) and for the previously reported model (see Table 2) revealed the following:  The interdependence between the observed and estimated/predicted class (active/inactive) was proved statistically for both models (Equation 3- Table 1 and Equation 2- Table 2) but the correlation coefficients in the 2 × 2 contingency table proved to be higher for the model presented in Equation 2.  The higher accuracy for all three sets sustains the use of model presented in Equation 2 in the classification of active and inactive BBB compounds. The accuracy seems not to be statistically significantly different when the model from Equation 3 is compared to the model from Equation 2 since the associated 95% confidence intervals overlap one another.  The under-classification as well as over-classification seems to have the smallest values for the model from Equation 2 compared to the model from Equation 3, but since the associated 95% confidence intervals overlap one another these differences do not seem to be statistically significant.  The percentage of compounds correctly assigned as active out of all of those assigned as active proved to be higher for the model presented in Equation 2 compared to the model presented in Equation 3. Since the 95% confidence intervals overlap one another, these percentages are not statistically significantly different. The same observation is also true for the percentage of compounds correctly assigned as inactive out of all of those assigned as inactive.  The previously reported model seems to have the smallest probabilities of wrong classification as active/inactive compounds. However, based on the overlap of associated 95% confidence interval it could be that these differences are not statistically significant.
The proposed four descriptors model demonstrates its abilities in the estimation and prediction of BBB drug-like penetration. The -best model‖ approach could be questionable in terms of goodness-offit, but the proposed four descriptors model proved to be good in certain applicability domains as shown above. Moreover, a model with four descriptors may perform worse than a model with six descriptors, but experience may show that the model with four descriptors could be more stable when changing the training data. Consequently, the best idea followed in the paper was to provide a tool to assess the models from certain points of view, and to let the user select their best classifier to fit their chosen applicability domain.
The goodness-of-fit of our model is similar with the goodness-of-fit of other models published in specialty literature when similar sample sizes were used in modeling (n training = 329, R 2 training = 0.52, n test = 141, R 2 = 0.54, n external = 174, R 2 = 0.65 [17]). The classification abilities of a qSAR/qSPR model could be tested using a series of parameters computed based on a 2 × 2 contingency table. The model's ability to correctly classify BBB drug-like compounds as well as the fraction of compounds misclassified proved not to be significantly different when all three sets were compared. The identified difference in under-classification for the external set, where the false negative rate proved to be significantly higher compared to the values obtained in training and test set could be explained by the different percentage of active compounds in external set compared to training and test sets. Our classification model performs better if the prior proportional probabilities of active and inactive class are closer to each other in investigated sets of compounds (see Table 1). Our model could be applied to classify BBB penetration of drug-like compounds and provide more accurate classification for inactive compounds if the prior proportional probability of an active class is close to the prior probability of an active class obtained in training set. Our classification model could be refined by clustering the observed penetration and obtaining models for each cluster since the training, test and external sets are comprised of drug-like compounds with heterogenous structures.
The presented approach introduced a new concept in the assessment of structure-based drug design: the assessment of the link between the structure of the compound and BBB penetrability through a series of parameters able to show the performances of the model in terms of accuracy, sensitivity, specificity, positive predictivity, etc. This approach is relevant to address to the situation when classification as active/inactive compounds is desired. The model with the highest accuracy, sensitivity and specificity must be chosen when more than one qSAR/qSPR model is accessible anytime. The structure that must be investigated is similar to those based on each of the obtained models.

Classification Model-Predictivity Approach
The ability of the identified model in the classification of active and inactive BBB compounds (the observed property being greater than or equal to 0 identifies an active compound, otherwise the compound was considered inactive) was assessed using appropriate statistical methods. Concordance defined as identical classification of a compound based on observed and estimated/predicted property was summarized as a percentage and an associated 95% confidence interval.
The performances of the classification model were assessed in training, test, and external sets. The external set is comprised of 315 different drug-like compounds classified as active (71) or inactive (244) BBB compounds. The compounds from the external set were taken from [30] (see Supplementary Material).
The parameters presented in Table 3 were used to assess the classification model. Some parameters were defined by Cooper et al. [31] while others were adapted from medical diagnosis studies [32]. The associated confidence intervals under binomial distribution assumption [33,34] were computed for each parameter [35].

Classification Model as Comparison Tool
The proposed statistical approach and associated significance levels were also computed for models presented in Equation 2 in order to be compared with a model introduced in the present manuscript (Equation 3).

Sub-sections 4.3. and 4.4. show how the model presented in Equation 3 was obtained. The model presented in Equation 3 was compared with previously reported models (Equation 1 and Equation 2
) in terms of goodness-of-fit using Steiger's Z test [36] at a significance level of 5%. Table 3. Parameters for the characterization of prediction.

Datasets and BBB Permeation Property
A sample of drug-like compounds with blood-brain barrier permeation (known logBB, the blood-brain distribution is expressed as the ration of the steady state molar concentration of a compound in the brain and in the blood) was identified to be included in the analysis [8,[37][38][39]. The quality criteria used to include a compound in the sample were as follows: ▪ reliable experimental data (the compounds with different value of experimental data obtained by applying the same protocol where not included); ▪ compound identity (one compound was included whenever identical compounds were identified); ▪ normality of experimental data.
Two databases were used in order to search the structures of the compounds: PubChem (http://pubchem.ncbi.nlm.nih.gov/, the compound ID is CID followed by a number in Table 4) and ChemSpider (http://www.chemspider.com/, the compound is CSID followed by a number in Table 4). The HyperChem 8.0 was used to draw the compounds that were not identified in the PubChem or in ChemSpider databases.
The compound name, ID (CID-ID of compounds taken from PubChem database or CSID-ID of compounds taken from ChemSpider database) and the observed property expressed in logarithmic scale are presented in Table 4.    Table 4 was randomly split into a training and test set, with ~2/3 of compounds in the training set. The method of randomization was implemented in order to ensure the normal distribution of the observed property in both sets. Descriptive statistics and normality test results for the training and test sets are presented in Table 5.

Molecular Descriptors Calculation
The HyperChem 8.0 was used to optimize the geometry of compounds by using a homemade program [40]. A series of home-made programs were used to perform the following tasks: (1) transform the *.sdf and *.mol files in *.hin files; (2) identify invalid compounds; (3) optimize the geometry of compounds; (4) calculate the molecular descriptors; (5) assign the compounds in training or test sets; (6) select valid descriptors (Jarque-Bera value higher than critical value for the observed activity, identity analysis and inter-correlation analysis); (7) multiple linear regression.
The Molecular Descriptor Family on Vertices approach (MDFV, [41]) was used to calculate the structural descriptors. The calculation of MDFV members is based on candidate fragments obtained using cutting atoms (as vertices cut) on the matrix representation of the molecular graph. A series of home-made PHP programs were developed to compute the MDFV values. The programs are run on an IntraNet network on a FreeBSD server and the results for previously investigated datasets are available online at http://l.academicdirect.org/Chemistry/SARs/MDFV/ (password provided by request). The calculation of descriptors on new data sets of compounds could be made upon request. A total number of 831 descriptors proved to be valid and were used to identify the best performing model.
The model that accomplishes the following criteria was considered the best classification MLR model [42]: highest explanation of the observed logBB (highest correlation coefficient); smallest number of MDFV descriptors; lowest standard error of estimate; highest F-value and smallest associated p-value; smallest difference between correlation coefficient and leave-one-out correlation coefficient, F-value and associated p-value in leave-one-out analysis. SPSS 16.0 was used to investigate multi-collinearity of descriptors in the MLR (multiple linear regressions) model, auto-correlations and homoscedacity.

Conclusions
The proposed predictivity approach could be used in the diagnosis of structure-based models (quantitative structure-property relationships or quantitative structure-activity relationships) but also could be seen as a tool for choosing the proper model for the assessment of new compounds. This approach is able to identify the model with the highest ability to identify active or inactive compounds. The best model could be considered the one with highest accuracy, specificity and sensibility as well as the smallest values of false-negative and false-positive rate and smallest values of probability of wrong classification as active or inactive compounds.
In regards to the blood-brain barrier permeation domain, the model presented in this manuscript proved to have high abilities in correct classification of inactive compounds (~86% of inactive compounds from external validation set-315 compounds-were correctly classified as inactive). The previously reported model proved to have high abilities in the correct classification of active compounds (~76% of active compounds from external validation set-92 compounds-were correctly classified as active). Therefore, the reported model should be chosen if the correct classification of inactive compounds is desired and the previously reported model should be chosen if the correct classification of active compounds is most wanted.