Prediction Models for Brain Distribution of Drugs Based on Biomimetic Chromatographic Data

The development of high-throughput approaches for the valid estimation of brain disposition is of great importance in the early drug screening of drug candidates. However, the complexity of brain tissue, which is protected by a unique vasculature formation called the blood–brain barrier (BBB), complicates the development of robust in silico models. In addition, most computational approaches focus only on brain permeability data without considering the crucial factors of plasma and tissue binding. In the present study, we combined experimental data obtained by HPLC using three biomimetic columns, i.e., immobilized artificial membranes, human serum albumin, and α1-acid glycoprotein, with molecular descriptors to model brain disposition of drugs. Kp,uu,brain, as the ratio between the unbound drug concentration in the brain interstitial fluid to the corresponding plasma concentration, brain permeability, the unbound fraction in the brain, and the brain unbound volume of distribution, was collected from literature. Given the complexity of the investigated biological processes, the extracted models displayed high statistical quality (R2 > 0.6), while in the case of the brain fraction unbound, the models showed excellent performance (R2 > 0.9). All models were thoroughly validated, and their applicability domain was estimated. Our approach highlighted the importance of phospholipid, as well as tissue and protein, binding in balance with BBB permeability in brain disposition and suggests biomimetic chromatography as a rapid and simple technique to construct models with experimental evidence for the early evaluation of CNS drug candidates.


Introduction
The brain disposition of drug candidates has been one of the major issues in the pharmaceutical industry in recent decades. The development of novel compounds targeting the central nervous system (CNS) is becoming more and more essential as a consequence of the increasing incidence of neurological and neurodegenerative disorders (e.g., depression, Parkinson disease, and Alzheimer disease). The high attrition rate and failures in the field of CNS agents are current challenges that pharmaceutical companies have to face [1,2]. From this perspective, assessment of brain distribution in the early stages of the drug development process is crucial for novel CNS drug candidates. The most important obstacle for drug delivery in the brain is the blood-brain barrier (BBB), which is formed by the endothelial cells lining the brain micro-vessels [3][4][5]. The presence of tight junctions between the endothelial cells leads to limited fenestration, making passive diffusion the dominant transport pathway inside the brain, while the paracellular path is of much lower importance. Moreover, the brain transport of certain drugs may be facilitated or restricted by the presence of membrane transport proteins in the BBB [6].
immobilization on the silica skeleton of the column. It is well-established, however, that protonated bases are more strongly retained on AGP columns, in agreement with the same binding preference of AGP in solution [47]. In light of the above considerations, HPLC-based biomimetic properties reflect the major factors which govern most biological processes, e.g., passive diffusion and non-specific binding to phospholipids, tissue, or plasma proteins. Accordingly, their combination may be suitable for the estimation of composite pharmacokinetic data. Valko et al. used the weighted sum of IAM and HSA retention to estimate the volume of distribution and tissue binding in different organs, including brain tissue binding [39].
In the present study, we investigated the performance of biomimetic properties in combination with molecular descriptors in the development of 'hybrid' models with experimental evidence for the thorough analysis of composite K p,uu,brain data, and the distinct brain disposition components, K p,brain , f u,brain , and V u,brain . The aim was to bridge the gap with pure in silico prediction models and to contribute to the perception of these crucial experimental end-points for CNS-acting drug candidates. For this purpose, in-house retention factors, determined for a number of pharmaceutical compounds on IAM, has, and AGP stationary phases, were used together with physicochemical/molecular descriptors. Lipophilicity, expressed as octanol-water partition coefficients (logP) or distribution coefficients at pH 7.4 (logD 7.4 ) and pH 5.0 (logD 5.0 ), was incorporated in the pool of descriptors, and its implementation in the models was compared to that of IAM retention. Models were constructed by applying multiple linear regression (MLR) and partial least squares (PLS) analysis. MLR can provide simple models that are easily used by medicinal chemists, while PLS, contributing complementary and supporting MLR models, may permit a deeper insight into the factors underlying brain disposition of drugs. Attention was given to model validation in respect to robustness and applicability domain to offset the drawback that rather limited datasets were analyzed.

Results and Discussion
In the present study, we collected experimental biomimetic chromatographic data, previously obtained in our laboratory, for a set of 55 pharmaceutical compounds belonging to a wide range of pharmacological classes. (For further details, see the Section 3 and Table S1.) The investigated compounds are small molecules within a molecular weight (MW) range of ca 129 to 513 Da ( Figure 1a). The dataset includes neutral, basic, and acidic compounds; however, most molecules contain one basic ionizable group (Figure 1b).
For further statistical analysis, f u,brain values were converted to the thermodynamic constant K b [55] according to the relevant equation included in Section 3. Its logarithmic form logK b was used throughout.

Data Overview with Unsupervised Data Analysis
To obtain an overview of the dataset, unsupervised principal component analysis (PCA) was performed using the pool of descriptors and the response variables as an X matrix. A 5-component PCA model was generated with R 2 = 0.674 and Q 2 = 0.462. The score plot of the first two components shows a uniform distribution of the data in all four quartiles, with one drug, candesartan, lying outside the Hoteling T 2 ellipse ( Figure 2a). As depicted by the coloring of the scores based on the logk wIAM values, IAM retention is a dominant parameter of the data distribution. The loadings plot reflects the correlations among the chromatographic data, the molecular and physicochemical descriptors, and the experimental brain distribution data in a multilayered fashion (Figure 2b). The coloring of the variables corresponds to their grouping according to hierarchical clustering based on the five principal components. LogBB and logV u,brain are located in the same quartile with chromatographic factors and lipophilicity parameters and belong to the same hierarchical clustering group. Interestingly, logK p,uu,brain is also incorporated into the same group, but its loading value in the first component is approximately zero. LogK b is located in the opposite quartile in high proximity to hydrogen bonding parameters.

Data Overview with Unsupervised Data Analysis
To obtain an overview of the dataset, unsupervised principal component analysis (PCA) was performed using the pool of descriptors and the response variables as an X matrix. A 5-component PCA model was generated with R 2 = 0.674 and Q 2 = 0.462. The score plot of the first two components shows a uniform distribution of the data in all four quartiles, with one drug, candesartan, lying outside the Hoteling T 2 ellipse ( Figure 2a). As depicted by the coloring of the scores based on the logkwIAM values, IAM retention is a dominant parameter of the data distribution. The loadings plot reflects the correlations among the chromatographic data, the molecular and physicochemical descriptors, and the experimental brain distribution data in a multilayered fashion (Figure 2b). The coloring of the variables corresponds to their grouping according to hierarchical clustering based on the five principal components. LogBB and logVu,brain are located in the same quartile with chromatographic factors and lipophilicity parameters and belong to the same hierarchical clustering group. Interestingly, logKp,uu,brain is also incorporated into the same group, but its loading value in the first component is approximately zero. LogKb is located in the opposite quartile in high proximity to hydrogen bonding parameters. experimental logK p,uu,brain values collected from the literature; (e) Correlation scatter plot between logBB and logK p,uu,brain values; (f) Box plot showing the distribution of the experimental logf u,brain values collected from the literature; (g) Box plot showing the distribution of the experimental logV u,brain values collected from the literature; (h) Correlation scatter plot between logf u,brain and logV u,brain values.

Modeling logK p,uu,brain
LogK p,uu,brain is considered to be the most relevant measure of the rate and extent of drug delivery in the brain, and as such, it was examined first. MLR analysis gave poor statistics (R < 0.5). Polarity descriptors, e.g., tPSA, display the strongest negative correlation (Figure 3a). Poor correlation was observed between logK p , uu,brain and logk wIAM , logP or logD7.4. However, a relatively better correlation was found between logD 5.0 and IAM retention at pH 5.0 ( Figure 3a).

Modeling logKp,uu,brain
LogKp,uu,brain is considered to be the most relevant measure of the rate and extent of drug delivery in the brain, and as such, it was examined first. MLR analysis gave poor statistics (R < 0.5). Polarity descriptors, e.g., tPSA, display the strongest negative correlation ( Figure 3a). Poor correlation was observed between logKp,uu,brain and logkwIAM, logP or logD7.4. However, a relatively better correlation was found between logD5.0 and IAM retention at pH 5.0 ( Figure 3a).  The above findings were further supported by Partial Least Squares analysis. PLS analysis offers a number of advantages, tolerating intercorrelated variables and missing values. Variables are treated simultaneously to extract the principal components as their linear combinations. Using the entire pool of descriptors and performing variable selection according to the variable importance to projection (VIP) criterion, no satisfactory PLS models could be obtained, with R 2 and Q 2 not exceeding 0.52 and 0.32, respectively. In  The above findings were further supported by Partial Least Squares analysis. PLS analysis offers a number of advantages, tolerating intercorrelated variables and missing values. Variables are treated simultaneously to extract the principal components as their linear combinations. Using the entire pool of descriptors and performing variable selection according to the variable importance to projection (VIP) criterion, no satisfactory PLS models could be obtained, with R 2 and Q 2 not exceeding 0.52 and 0.32, respectively. In agreement with MLR, polarity terms (TPSA) show the strongest impact with VIP values > 1 and negative contribution. Biomimetic properties and lipophilicity have positive contributions, albeit with considerably lower impacts. In agreement with MLR results, logD 5.0 and IAM retention at pH 5.0 show higher influence. In fact, VIP values decrease in the order of: logD 5.0 > logk wIAM,5.0 > logD 7.4 > logk 10,HSA > logk wAGP > logP > logk wIAM,7.4 ( Table S3). The better performance of IAM retention and lipophilicity at pH 5.0 may be explained by the recently reported evidence of the interference of the endo-lysosomal system (lysosomes' pH 5.0-4.5) in vitro blood-brain barrier models [56].
Inspection of the plots of observed vs. predicted logK p,uu,brain and of DModY plots (distance of observations to the Y model) revealed two strong outliers, quinidine and theophylline, present in any PLS model (figure not shown). Upon exclusion of these drugs from statistical analysis, acceptable PLS models were obtained. Using biomimetic properties and lipophilicity in combination with computational descriptors, a 1-component PLS model In Figure 3b,c the plot of observed vs. predicted logK p , uu,brain values generated by PLS Model 4 and the coefficients of the original variables are depicted, respectively. The predicted logK p , uu,brain values are presented in Table S4. The applicability domain was defined using DModY plot, and the doubled average DModY value, equal to 0.822 × 2 = 1.644, was used as a critical value. Two drugs, acetaminophen and midazolam, exceed the critical value ( Figure S1).
PLS Model 4 was validated by permutation tests ( Figure S2) and by dividing the dataset into a training set and a test set, using 4 different training/test sets (see Section 3). The external validation demonstrated robust statistics except for 1 iteration, in which the drug acyclovir was included in the test set (Table 1). In Figure S3, a representative plot of observed vs. predicted values is presented. In fact, the construction of PLS Model 4 indicates that biomimetic properties and lipophilicity are less crucial in logK p , uu,brain modeling, as shown also by the failure to generate successful MLR models. Previous studies have likewise identified the lack of strong correlations between logK p,uu,brain and physicochemical descriptors, especially lipophilicity, while hydrogen bonding has been found to be the strongest contributor [11,20]. Loryan, et al. reported a PLS model with two descriptors related to polarity (tPSA) and hydrogen bonding capacity [14].
Considering the complex nature of K p,uu,brain as the outcome of opposing brain disposition end-points and plasma binding [11] (see relevant equations in Section 3), modeling K p,brain , expressed as logBB, f u,brain , and V u,brain , can indirectly assist in the exploration of logK p,uu,brain . In the next sections, the generation and validation of the MLR and PLS models of these distinct brain disposition end-points are discussed.
External validation of Equations (1) and (2) was performed by dividing the dataset into a training set and a test set. This procedure was repeated up to six times for each equation. Practically speaking, the same equations were generated with comparable statistical data ( Table 2). In all cases, observed vs. predicted values showed a 1:1 correlation. In Figure S5, a representative plot of observed vs. predicted values from Equations (1) and (2) is illustrated.

Partial Least Squares Models
Initially, PLS analysis was performed considering both biomimetic properties and lipophilicity in the pool of descriptors, so as to rank their impact in modeling logBB. Polarity descriptors (TPSA and dipole moment) were found to have the highest VIP values (VIP = 1.2). AGP retention showed equally high impact, followed by logD 7.4 (VIP = 1.05), and IAM retention (VIP = 0.95). HSA retention and logP had a lower influence in the model with VIP < 0.9.
Keeping AGP retention as the most influential parameter and including either logk wIAM (PLS Model 5, Figure 4a The applicability domain was defined by the critical value DModY being 0.767 × 2 = 1.534 for Model 5 and 0.789 × 2 = 1.58 for Model 6 ( Figure S6). A total of 4 drugs, i.e., atenolol, candesartan, haloperidol, and maprotiline, were outside the AD of PLS Model 5. The same drugs, plus indomethacin, were outside the AD of PLS Model 6 ( Figure S6).
Considering the large percentage of missing logk wAGP values, and in order to confirm its essential contribution, PLS analysis was repeated including only drugs with available logk wAGP data. The generated models proved the crucial contribution of AGP retention ( Figure S7).
The contribution of logk wAGP in logBB was further confirmed in MLR models for the drugs with available data: The applicability domain was defined by the critical value DModY being 0.767 × 2 = 1.534 for Model 5 and 0.789 × 2 = 1.58 for Model 6 ( Figure S6). A total of 4 drugs, i.e., atenolol, candesartan, haloperidol, and maprotiline, were outside the AD of PLS Model 5. The same drugs, plus indomethacin, were outside the AD of PLS Model 6 ( Figure S6). Considering the large percentage of missing logkwAGP values, and in order to confirm its essential contribution, PLS analysis was repeated including only drugs with available logkwAGP data. The generated models proved the crucial contribution of AGP retention ( Figure S7).
The contribution of logkwAGP in logBB was further confirmed in MLR models for the drugs with available data: The high impact of AGP retention may be related to the presence of a strong negative charge in AGP stationary phases due to their high content in sialic acid. Brain cell membranes are also the most anionic and have their lipids mostly exposed, thus explaining the reason that lipophilic cationic compounds are more prone to cross the BBB.
PLS Models 5 and 6 were validated with permutation tests ( Figure S8) and by external validation upon dividing the dataset into 5 different training and test sets. Practically speaking, the same models were generated by an external validation procedure, with comparable statistical data (Table 2), with the exception of iteration, including quinidine (PLS Model 5) and morphine (PLS Model 6). In all cases, observed vs. predicted values showed a 1:1 correlation. In Figure S9, a representative plot of observed vs. predicted values is illustrated. The predicted logBB values from the MLR and PLS models are provided in Table S5.
The positive sign of the coefficient of NO indicated that a higher hydrogen bond acceptor potential is not favorable for unspecific hydrophobic tissue and plasma protein binding, increasing the amount of the unbound fraction. It should be mentioned that logk w,IAM and logk 10ACN,HSA showed a considerable degree of intercorrelation (r = 0.824). To overcome the collinearity problem, and considering the low differentiation of regression coefficients of logk w,IAM and logk 10ACN,HSA in Equation (5) All drugs were within the AD ( Figure S10). Satisfactory correlation was also obtained with AGP retention factors for 20 drugs with available data: Evidently, all of the regression coefficients in Equation (11) had negative signs, except NO, which contributed positively. Three drugs were outside the AD, as shown in Figure S10.
External validation of Equations (7) and (11) was performed by dividing the dataset into the training set and the test set, using 5 different test sets (Table 3). In all training sets, observed vs. predicted values showed a 1:1 correlation. In Figure S11, a representative plot of observed vs. predicted values from Equations (7) and (11) is illustrated.  Biomimetic properties and lipophilicity were the most crucial parameters (VIP> 1), with decreasing importance following the order: logk AGP > logk wIAM > logP> logk 10HSA > logD 7.4 ( Figure S12). In accordance with the MLR equations, the three biomimetic properties and lipophilicity had negative contributions, while the hydrogen bonding acceptor descriptor NO had a positive effect. More to the point, logP showed a higher impact in respect to logD 7.4 as a better lipophilicity expression for hydrophobic binding. Three bulk descriptors, CMR, arC6, and nPSA, included in the model had a negative sign ( Figure S12 As illustrated in Figure 5, the same signs were kept in PLS Model 8 for biomimetic properties and NO, while a balance was observed between the non-polar descriptors, with nPSA having a positive sign.

Partial Least Squares Models
According to the DModY criterion (critical value = 0.812 × 2 = 1.624), three drugs, i.e., fluoxetine, ranitidine, and theophylline were outside of the AD of the model ( Figure S13 As illustrated in Figure 5, the same signs were kept in PLS Model 8 for biomimetic properties and NO, while a balance was observed between the non-polar descriptors, with nPSA having a positive sign.  Five drugs-acetaminophen, atenolol, neostigmine, propranolol, and theophylline-are the beyond AD, with DModY being higher than the critical value (2 × 0.768) ( Figure S13).
The logK b PLS Models 8 and 9 were validated with permutation tests ( Figure S14), and upon dividing the dataset into 5 different training and test sets. Robust statistical data were obtained in all cases (Table 3). Representative plots of observed vs. predicted logK b values from external validation are presented in Figure S15. The predicted logK b values from the MLR and PLS models are presented in Table S6.
Back calculation of f u,brain values using the predicted logK b were successfully correlated with the corresponding experimental values, approximating a 1:1 correlation (Table S7).

Correlation between fraction unbound and unbound volume of distribution in the brain
The negative correlation of logV u,brain with logf u,brain, shown also in Figure 1h The high quality of Equation (13) permits the safe prediction of logV u,brain from logf u,brain . We further attempted to construct models for the unbound volume of distribution using biomimetic properties, lipophilicity, and computational descriptors although, in this case, the dataset was limited, including only 17 drugs with available experimental data.

Multiple Linear Regression
Direct correlation of logV u,brain with logk w,IAM led to Equation (14) All compounds were within the AD of Equation (17) ( Figure S16). The combination of logP with the fraction protonated at pH7.4, both with a positive contribution, has previously been reported for a PLS model of the apparent volume of distribution [58].
Owing to the reduced dataset, test sets for the external validation of Equations (15) and (17) contained 4 to 6 compounds. Robust statistical data were obtained for validated equations, except when metformin was included in the test set, for both Equations (15) and (17) ( Table 4). In Figure S17, a representative plot of observed vs. predicted values from Equations (15) and (17) is illustrated:  According to the DModY criterion (critical value: 0.802 × 2 = 1.604), 2 drugs (metformin and propranolol) were outside the AD of the model ( Figure S18).  In agreement with the MLR models, IAM and AGP retention were the most influential variables with a positive contribution, while HSA retention factors were not included in the final model. Most polar descriptors had a negative effect, and the opposite was true for non-polar descriptors.
The use of logP in place of biomimetic properties led to a three-component model with satisfactory statistics although they had inferior cross-validation results (Figure 6c,d) According to the DModY criterion (critical value 1.524), 2 drugs (pindolol and propranolol) were outside the AD of the model ( Figure S18).
The lower Q 2 and higher RMSEEcv indicate inferior predictability, reflected also in the higher intercepts of the corresponding permutation tests ( Figure S19). The PLS models were also validated by external validation, dividing the dataset into two training and test sets. In Table 4, the statistical data of the new models are given. In Figure S20, representative plots of observed vs. predicted logV u,brain are provided. The predicted logV u,brain values from the MLR and PLS models are presented in Table S8.
Considering the potential of Equation (13) for safe predictions of logV u,brain , we used Equation (13) in order to extend the dataset. Drugs with predicted logV u,brain were used as a blind test set. As illustrated in Figure S21, the blind test set was well accommodated in the model with the exception of two drugs, clomipramine and fluoxetine, which were strong outliers. Excluding clomipramine and fluoxetine, the combination of the blind test set with the training set led to PLS Model 12, which was practically the same as PLS Model 10:  Figure S22).

Dataset and Chromatographic Data
A dataset of 55 pharmaceutical compounds, belonging to a wide range of pharmacological classes, was used in the present investigation. The dataset exhibited adequate structural diversity, consisting of acidic, neutral, basic, and zwitterionic molecules. The compounds had been previously studied in our laboratory with respect to their retention profiles in 3 biomimetic chromatographic columns, namely an IAM.PC.DD2 (Regis Technology, Morton Grove, IL, USA), a ChromTech CHIRAL-HSA column (50 mm × 4 mm i.d.), and a ChromTech CHIRAL-AGP column (50 × 4 mm i.d.), under experimental conditions as described in the corresponding references [23,24,27,29,45].
For IAM chromatography, retention factors logk wIAM corresponding to a 100% aqueous mobile phase were used. PBS was used as a buffer at two pH values, i.e., the physiological pH 7.4 (data labeled as logk wIAM ) and pH 5.0 (data labeled as logk wIAM5 . 0 ), with the latter being associated with intestinal absorption and lysosomal trapping [59]. In the case of HSA chromatography, isocratic logk values, measured in the presence of 10% acetonitrile (ACN), were used (logk 10,HSA ), as they showed a highly significant 1:1 correlation with plasma protein binding data in our previous study [45]. Logk 10,HSA data were available for 37 compounds in the dataset. For the remaining compounds, logk 10ACN,HSA values were calculated based on the highly significant equation reported in the same study [45]. Logk wAGP retention factors corresponding to a 100% aqueous mobile phase were available for 28 drugs. Thus, they could be used in the MLR models only for restricted datasets. However, their performance could be evaluated in PLS models since this type of analysis tolerates missing values. All chromatographic indices are included in Table S1.

Brain Disposition Data
Experimental K p,uu,brain values measured in rat brain tissue were collected for 22 compounds [10][11][12]60,61] and converted to the logarithmic form. Experimentally determined K p,brain values for 42 compounds were obtained from the literature and were converted to logBB. They preferably referred to rat studies in order to achieve homogeneity in the data [19,31,32]. Experimental f u,brain values were available from the literature for 39 compounds in the dataset [10,[50][51][52][53]62]. When more than one value was available, the average was calculated. Most values were determined on rats. Since there are only small inter-species differences, values from species other than rats were included (seven cases) for compounds where rat values were not provided. The f u,brain values were converted to the thermodynamic constant K b according to the following equation [55]: Kb = fu, brain 1.001 − fu, brain The denominator was set to 1.001-f u,brain in the case of compounds exhibiting an f u,brain value equal to unity. The decimal logarithm of K b , logK b , was used for the development of the models. V u,brain values, determined using the brain slice method, were collected from the literature for 17 compounds in the dataset and converted to the corresponding logarithm, logV u,brain [10,12].
K p,brain , K p,uu,brain , f u,brain , and V u,brain values are presented in Table S2. They span a sufficient range and are evenly distributed, including drugs with different brain distribution profiles. K p,uu,brain can be derived from K p,brain by combining the unbound fractions of the drug in the plasma (f u,p ) and in the brain (f u,brain ) or the unbound brain volume of distribution, V u,brain [7]: Kp, uu, brain ∼ Kp, brain Vu, brain × fu, plasma V u,brain shows an inverse relation to Vu,brain shows an inverse relation to fu,brain approximating under circumstances [63]. fu,brain∼[1/Vu,brain]range 0-1.

Physicochemical and Molecular Descriptors
A pool of 121 descriptors, including 1D, 2D and 3D descriptors, was calculated using appropriate software. The software ADME Boxes 3.0 was used to calculate the topological polar surface area (TPSA), hydrogen bond acceptor and donor sites (HBA and HBD), the rotatable bonds (RB), the number of ionizable groups, and the fraction of molecular species at pH 7.4 (with the fraction of negatively charged species being F − 7.4 and that of positively charged species being F + 7.4 ). The Abraham's solvation parameters (hydrogen bond acidity (A), hydrogen bond basicity (B), excess molar refraction (E), McGowan's volume (V), and dipolarity/polarizability (S)) were calculated with the module ABSOLV implemented using the same software. Topological indices and electrotopological state indices [64] were computed with Molconn-Z (v4.12 eduSoft, LC, La Jolla, CA 92037 USA) software Hyperchem v.5.0/Chemplus v.1.6 software (HYPERCUBE Inc., Waterloo, ON, Canada) was used for the calculation of 3D descriptors. Molecular size descriptors, energy parameters, and dipole moment were calculated using the lowest energetic conformation. Descriptors based on essential structural characteristics, such as the total number of rings (nRings), the number of phenyl rings, the number of heteroatoms inside the rings (Nr, Or, Sr), the number of heteroatoms outside of the rings (Nnr, Onr, and Snr), or the sum of them (N, O, and S), the number of different halogens (Cl, F, I, and Br) and the sum of them (halogen), and the number of double bonds between carbon atoms were derived manually.

Statistical Analysis
Multiple linear regression analysis (MLR) was performed using SPSS v.22.0 software. Variable selection was performed by applying the stepwise algorithm, first in groups of descriptors, classified according to their physicochemical content, in order to exclude the less-relevant variables for the final selection. Variables with zero or very small variance were excluded, as were collinear variables, considering a correlation coefficient lower than 0.8. The models were evaluated by considering the values of R, R 2 , s (standard deviation), and F-test. For the significance of each individual variable, a t-test value of |t|≥ 2 was considered. The applicability domain (AD) of the regression equations was defined by calculating the critical leverage h* according to the formula h* = 3 (p + 1)/n, where p is the number of parameters, and n is the number of compounds [65].
Models were validated by dividing the dataset into different training and test sets of 5 to 9 drugs, depending on the overall sample size. The test sets were randomly selected, taking however into account to cover all four quartiles of the PCA scores plot. Subsequently, each model derived by the training set was used to predict the response variable of the corresponding test set. The R, R 2 , and s for both the training and test sets were considered and compared to the original model. In addition, models were evaluated by the proximity of the relation of observed vs. predicted values to a 1:1 correlation, reflected by a slope close to 1 and an intercept close to 0.
Multivariate data analysis was performed using Simca-P 14.1 (Sartorius Stedim Biotech, Umeå, Sweden,). Prior to analysis, data were scaled to unit variance. Principal component analysis (PCA) was performed, considering all columns of the table as X variables. PCA is a projection method resulting in dimensional reduction. The principal components, derived through projection, represent the new (latent) variables that summarize the information included in the initial set of descriptors. A PCA scores plot provides a useful data overview, which was also considered for the test set selection. Partial least squares analysis (PLS), a regression extension of PCA, was applied in order to construct prediction models. The variable selection was based on the values of the variable's influence to projection (VIP), the weight (w) in the loadings plot, and the size of coefficients. From this perspective, variables with variables with VIP < 0.8, low weight in the loading plot or low coefficient. Moreover, between descriptors encoding the same information, those which performed better were chosen. It should be mentioned, however, that PLS, as a projection method, can tolerate intercorrelated variables, as compared to multiple linear regression. The predictive ability and robustness of the models was evaluated using cross-validation as an internal validation according to the seven-fold option of Simca-P. The sum of the squared differences between the measured response and the predicted value of the omitted data, defined as predicting residuals sum of squares (PRESS), is used to calculate the cross-validated correlation coefficient, Q 2 : Q 2 = 1 − PRESS/SSY, with SSY representing the variation in Y after mean centering and scaling. Permutation tests (100 permutations) were applied by randomly re-ordering the response variables, and the newly derived R 2 and Q 2 were plotted against the degree of correlation between the permuted and the original data. The models were validated by external test sets, as already described for multiple linear regression. The root mean square error of prediction (RMSEP) was considered as an index of the predictability of the models. The applicability domain was defined using the double of the average value of the distance of observation to model Y (DModY) as the critical value.

Conclusions
In the present study, we combined bio-chromatographic retention factors, determined on IAM, has, and AGP stationary phases, with computed descriptors to develop MLR/PLS models for the rapid estimation of drug brain disposition. Our aim was to suggest a novel 'hybrid' modelling approach which combines high-throughput experimental accuracy with theoretical descriptors. At the same time, the well-investigated information content of the individual biomimetic properties [24,29,44,45] permitted a deeper insight into the underly-ing mechanisms of biological processes. In this sense, diverse aspects of drug disposition in the CNS were explored, including the modeling of experimental logK p,uu,brain , logBB , f u,brain , and V u,brain data, upon the application of two statistical techniques, considered as contributing in a complementary way. MLR led to simple models that are easy to use by the medicinal chemist, whereas PLS served to strengthen the MLR models and to further scrutinize the biological issues inherent in brain disposition measurements.
LogK p,uu,brain , as the outcome of contradictory factors related to permeability and tissue and plasma binding could not be efficiently modeled by biomimetic properties and lipophilicity, while computational descriptors alone were sufficient for model construction. Yet, PLS analysis revealed the greater effect of IAM, HSA, and AGP retention factors, as well as logD, if measured at pH 5.0, supporting the potential interference of the endolysosomic system in vitro brain penetration models. In regard to logBB, which measures permeability to CNS, traditional lipophilicity and IAM retention showed equal performance with positive contributions. However, in the cases of both logK b and logV ubrain , which also depend on binding, IAM retention performed better than lipophilicity. More to the point, a lesser number of total descriptors was required in the corresponding PLS models, reflecting the higher information content of IAM retention. HSA retention was found to be important in logK b modeling, while AGP retention influenced all brain disposition data, confirming the role of basicity as an essential CNS-drug-like characteristic. Polarity and hydrogen bond descriptors proved crucial in all models, with an opposite effect in regard to biomimetic properties and lipophilicity.
In view of the above findings, biomimetic properties are well-justified in modeling distinct brain disposition data as they reflect the major factors which govern biological processes, e.g., passive diffusion and binding. More to the point, IAM retention performs equally well or better than traditional lipophilicity. Hence, biomimetic chromatography can be suggested as a rapid and simple tool for the early evaluation of CNS drug candidates, permitting the construction of evidence-based 'hybrid' models and bridging the gap with in silico modeling, built solely on theoretical descriptors.
The 'hybrid 'complementary models constructed in this study can be further applied to, and validated with, a wider range of pharmaceutical compounds. Combined with relevant plasma protein binding models also based on biomimetic properties [45], they can serve as a sound basis for exploring the composite brain disposition end-points.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/molecules27123668/s1: Figure S1: DmodY bar plot for PLS Model 4 of logKp,uu,brain, including computational descriptors; Figure S2: Permutation test (based on 100 permutations) for PLS Model 4 of logKp,uu,brain; Figure S3: Representative plot of observed vs. predicted logKp,uu,brain values from the external validation of PLS Model 4; Figure S4: Williams plots of standardized residuals vs. the leverage; Figure S5: Representative plot of observed vs. predicted logBB values from external validation of (a) Equation (1) Figure S7: PLS models of logBB prediction including only the compounds with experimental logkwAGP values; Figure S8: Permutation tests for the PLS models of logBB prediction; Figure S9: Representative plot of observed vs. predicted logBB values from the external validation of the PLS model; Figure S10: Williams plots of standardized residuals vs. the leverage; Figure S11: Representative plot of observed vs. predicted logKb values from external validation of (a) Equation (7) Figure S14: Permutation tests for the PLS models of logKb prediction; Figure S15: Representative plot of observed vs. predicted logKb values from the external validation of the PLS model; Figure S16: Williams plots of standardized residuals vs. the leverage; Figure S17: Representative plot of observed vs. predicted logVu,brain values from the external validation of (a) Equation (15) and (b) Equation (17); Figure S18: (a) DmodY bar plot for PLS Model 10 of logVu,brain prediction, including IAM retention factors; (b) DmodY bar plot for PLS Model 11 of logVu,brain prediction, including logP; Figure S19: Permutation tests for the PLS models of logVu,brain prediction; Figure S20: Representative plot of observed vs. predicted logVu,brain values from the external validation of the PLS model; Figure S21: Plot of observed vs. predicted logVu,brain values from PLS Model 12; Figure S22: DmodY bar plot for PLS Model 12 of logVu,brain. Table S1: Compounds included in the dataset: pharmaceutical classification, chromatographic data, and lipophilicity parameters; Table S2: Brain disposition data, i.e., Kp,brain, Kp,uu,brain, fu,brain, and Vu,brain for the investigated compounds; Table S3: Variable Importance to Projection (VIP) values in PLS Model 4 for logKp,uu,brain on the basis of biomimetic properties, lipophilicity, and computational descriptors; Table S4: Observed and predicted logKp,uu,brain values by PLS Model 4, based on computational descriptors; Table S5: Experimental and predicted logBB values by MLR and PLS models; Table S6: Experimental and predicted logKb values by MLR and PLS models; Table S7: Observed vs. predicted fu,brain correlation: fu,brain = a × fu,brain,pred + b; Table S8: Experimental and predicted logVu,brain values by MLR and PLS models.