Predicting Value of Binding Constants of Organic Ligands to Beta-Cyclodextrin: Application of MARSplines and Descriptors Encoded in SMILES String

: The quantitative structure–activity relationship (QSPR) model was formulated to quantify values of the binding constant (lnK) of a series of ligands to beta–cyclodextrin ( β -CD). For this purpose, the multivariate adaptive regression splines (MARSplines) methodology was adopted with molecular descriptors derived from the simpliﬁed molecular input line entry speciﬁcation (SMILES) strings. This approach allows discovery of regression equations consisting of new non-linear components (basis functions) being combinations of molecular descriptors. The model was subjected to the standard internal and external validation procedures, which indicated its high predictive power. The appearance of polarity-related descriptors, such as XlogP, conﬁrms the hydrophobic nature of the cyclodextrin cavity. The model can be used for predicting the a ﬃ nity of new ligands to β -CD. However, a non-standard application was also proposed for classiﬁcation into Biopharmaceutical Classiﬁcation System (BCS) drug types. It was found that a single parameter, which is the estimated value of lnK, is su ﬃ cient to distinguish highly permeable drugs (BCS class I and II) from low permeable ones (BCS class II and IV). In general, it was found that drugs of the former group exhibit higher a ﬃ nity to β -CD then the latter group (class III and IV).


Model Development Using MARSplines
In this work, a MARSplines [40] methodology was applied as implemented in STATISTICA 12 [53]. This methodology leads to the following general regression formula, where F i is the regression factor and a i are the regression parameters: The left-hand side of this equation stand for the response variable, which is confronted with experimental values. Here it is defined by the value of the natural logarithm of a binding constant quantifying affinity of a ligand toward β-CD determined experimentally. The MARSplines method is the procedure designated for finding the analytical formula based on descriptors and so-called knots. Such relationships are termed basis functions. The values represent splitting of the set of values into sub-regions treated with alternative mathematical formula. The number of basis functions and factors in the model is controlled at an arbitrary level for balancing between accuracy and complexity of the model. To avoid model overfitting, the final model undergoes inspection of the regression coefficients by removing such factors for which statistical significance is not reached (p > 0.05). Additionally, the contribution to the model of each factor is inferred from the values of standardized regression coefficients (β i ). Only such factors are included in the final model for which β i > 0.09. Furthermore, the model was refined, internally validated and characterized in terms of fitting criteria using QSARINS software [54][55][56]. As a result of this procedure, the model was simplified by selecting the most important variables using a genetic algorithm (GA).
The simplest factor generated by the MARSplines procedure has a form identical to the classical QSPR approach and is expressed simply as multiplication of descriptor values by a coefficient, whose value is optimized for maximizing correlations between computed and estimated response values. The main improvement, however, comes from accounting for non-linearity by direct inclusion of more complex basis functions combined into factors. Hence, an advantage of the QSPR model formulation using the MARSplines procedure is the benefit of formally being in the multiple linear regression (MLR) format by including non-linear properties of considered datasets. Hence, the golden standard QSPR model development and validation procedures can be directly applied [41,42,57].

Results and Discussion
There are two main reasons that justify the efforts of the obtained model building. The first is obviously of substantive nature for deriving a model that is as accurate as possible and characterized by a low cost of applications. Hence, the screening of new potential ligands, or comparing a leading compound of an API and derivatives suggested by a drug design procedure, represent the immediate value of the obtained model. There is also a methodological reason for exploring the landscape of potential application in the chemistry domain of the MARSplines procedure. This is not explored deeply enough bearing in mind its high potential, effectiveness and ease of use.

Findings
Based on the MARSplines algorithm, the following descriptors were included in the model (Table 1): XLogP and Wlambda2.unity (source: BlueDesc); carbonTypes.8 (source: CDK), MLFER_A, AATS6m, AATS4i and PNSA-3 (source: PADEL); the tensor product of PEOEVSA9 descriptor vectors denoted as PEOEVSA9*PEOEVSA9; and, the vector sum of Chiv1 parameter denoted as Chiv1plusChiv1 (source: BioTriangle). Taking into account the relatively large training set population (n = 187), the number of variables seems to be reasonable fulfilling the general rules of acceptable QSPR model complexity, as documented in Table 1 and Figure 1. Internal validation, fitting criteria and external validation parameters, including R 2 (determination coefficient), R adj 2 (adjustment determination coefficient), F (Fisher ratio), SD (standard deviation), MAE (mean absolute error), MAPE (mean absolute percentage error), RMSE (root-mean-square error), PRESS (predicted residual error sum of squares) and K xx (descriptors' global correlation measure) [58,59], suggest that the model is well fitted to the training set and, most importantly, the external test set examples were well predicted. The results of external validation are presented in Figure 1. As one can see, the proposed model is characterized by high determination coefficients. Interestingly, R 2 , MAE and MAPE values are even slightly better for the external test set (0.936, 0.44, 9.3%, respectively) than for the training set (0.907, 0.49, 15.4%, respectively). This suggests that the model complexity is optimal. It is worth mentioning that the over-fitting problem should be taken into account when analyzing the quality of QSPR models, especially those that are non-linear. In the case of overly complex models, the training set data are exceptionally well fitted, but the test set prediction quality is far inferior. It is worth mentioning that the MARSplines protocol implemented in the STATISTICA software prevents overfitting by taking advantage from of the generalized cross validation (GCV) algorithm, which reduces the model to be as simple as possible.  Some of the parameters used in the model, such as like XLogP and MLFER_A, are quite intuitive and their physical meaning can be easily explained. The appearance of the hydrophilicity measure, namely the group contribution logP parameter (XlogP), confirms the role of the hydrophobic nature of the cyclodextrin cavity, while MLFER_A is the Abraham solubility parameter expressing the acidity. The role of polarity in β-CD molecular complexes formation was emphasized by PNSA-3 (charged partial surface area index [60]) and BioTriangle interaction descriptor PEOEVSA9*PEOEVSA9. This latter feature was calculated based on the MOE-type parameter involving the contributions of surface area and partial charge [61]. Another feature calculated using the BioTriangle platform, namely Chiv1plusChiv1, is associated with the Chiv1 descriptor belonging to the atomic valence connectivity indices class [62,63]. Of note, these descriptors were widely used in solving quite similar QSAR problems associated with target-ligand binding [64][65][66][67][68]. In the MARSplines model, there was also one topological descriptor characterizing carbon type (carbonTypes.8) [69] and the appearance of two autocorrelation indices, AATS6m and AATS4i [69]. Autocorrelation descriptors are probably one of the most extensively used quantitative structure-activity relationship/quantitative structure property relationship (QSAR/QSPR) descriptors Although the physical meaning of these parameters is not straightforward, our previous studies showed that this broad class of descriptors was found to be useful in the modelling of the affinity of compounds in the solid state [29,57].
the BioTriangle platform, namely Chiv1plusChiv1, is associated with the Chiv1 descriptor belonging to the atomic valence connectivity indices class [62,63]. Of note, these descriptors were widely used in solving quite similar QSAR problems associated with target-ligand binding [64][65][66][67][68]. In the MARSplines model, there was also one topological descriptor characterizing carbon type (carbonTypes.8) [69] and the appearance of two autocorrelation indices, AATS6m and AATS4i [69]. Autocorrelation descriptors are probably one of the most extensively used quantitative structureactivity relationship/quantitative structure property relationship (QSAR/QSPR) descriptors Although the physical meaning of these parameters is not straightforward, our previous studies showed that this broad class of descriptors was found to be useful in the modelling of the affinity of compounds in the solid state [29,57].

Comparison to Existing Models
Comparison of the determination coefficient calculated for the obtained model with two regression models reported in recent years is presented in Table 2. Although these models were generated using similar datasets, it should be taken into account that depending on the validation procedure, different results can be obtained. Nevertheless, the correlation coefficients are lower or approximately equal to the MARSplines model. This suggests that the proposed approach is a good alternative for β-CD calculation. The major advantage of calculating molecular descriptors from the SMILES code is low computational cost. However, the proposed QSPR model has some limitations associated with ignoring the geometrical features of molecular complexes, such as conformation and solvation effects. These effects can be included using optimized 3D structures. Furthermore, it should be taken into account that in some cases, the stoichiometry of β-CD complexes is not 1:1 [70][71][72]. In such cases, molecular modelling methods such as molecular-dynamics docking or quantum-chemical binding constant calculations are more appropriate than the proposed approach.

Exemplary Model Applications
The obvious application of the model provided by Equation (1) and Table 1 relates to its predictive power. Hence, it is possible to anticipate, before actual measurement, the probable affinity of the considered API toward β-CD. There is, of course, a limitation due to the applicability domain. For example, there are no organic and metalo-organic salts in the model. Hence, it is very unlikely that the model helps in situations where drugs are prepared in such forms. However, many drugs are, in principle, treatable by the model and at least the rational selection of the candidates for experimental measurements can be advised.
It is also possible to suggest alternative, less obvious applications of the formulated MARSplines model. For example, in the Biopharmaceutical Classification System (BCS) it is assumed that two measures such as solubility and permeability can be used for grouping drugs in respect of their bioavailability. In Table 3 this classification is shown [73]. Of note, cyclodextrins and their solubilizing abilities have been discussed in the context of BCS classification [74,75]. Table 3. Biopharmaceutical Classification System (BCS) [73] using solubility and permeability as qualitative criterions.

High Solubility Low Solubility
High permeability Class I This class comprise compounds characterized by good absorption profiles.

Class II
The bioavailability is directly related to the dissolution behavior.

Low permeability
Class III The active pharmaceutical ingredient (API) is soluble, however absorption profile is dependent on limited permeation behavior.
Class IV The API is characterized by very low bioavailability.
Hence, for proper bioavailability assessment, both water solubility and permeability must be known. It is interesting to see if there is any correlation between the BCS class of a given drug and its estimated affinity toward β-CD. For this purpose, information about the BCS classification was collected for 300+ drugs [76]. Those that are found to be outside of the applicability domain were excluded from the analysis. For the remaining drugs, the values of the molecular descriptors were collected. This, in turn, allowed for application of the MARSplines model and prediction of lnK values. The obtained results are presented in Figure 2 and Table S2. As can be seen from Figure 2, those APIs exhibiting good permeability (Class I and II) are characterized by higher affinity to cyclodextrin. This is understandable since the cyclodextrin cavity is rather hydrophobic, like for lipid biological barriers. The most important message coming from Figure 2 is that Class I and II have very similar distributions to each other and, at the same time, are distinct from Class III and IV. Indeed, application of a statistical non-parametrical test revealed that the medians are statistically the same (p = 0.27) for Class I and II but either combination with remaining classes reached statistical significance (p < 0.001). Similarly, the analysis of Classes III and IV versus the other two classes consistently confirms that low permeability can be distinguished from high values by predicted drug affinity to β-CD. In order to turn this qualitative conclusion into a practically useful formula, a second MARSplines model was formulated. However, the target of the modeling this time was the classification into low and high permeability cases. Hence, only one quantitative parameter was used for classification model formulation, namely, computed values of lnK. As a dependent value, the binary flag for permeability was declared. The obtained formulae are provided below:

Conclusions
Compounds exhibiting high symmetry, such as fullerenes or nanotubes, have been used in various branches of medicine and pharmacy, including drug delivery [77][78][79]. This also applies to cyclodextrins, which due to their specific shape features, have been widely used to increase API solubility. In this work, the QSPR model of the binding constant of different compounds to β-CD was developed based on the MARSplines methodology, and molecular descriptors were derived from the SMILES code. The internal and external validation indicated good accuracy of the model. The appearance of polarity-related descriptors, such as XlogP, indicated the hydrophobic nature of the cyclodextrin cavity, which is consistent with the nature of cyclodextrins. It is well known that the hydrophilicity/hydrophobicity of a drug can be used for evaluation of the drugs' permeability. Therefore, the model was used for predicting affinity to β-CD of exemplary compounds belonging to different BCS classes. As was established, APIs exhibiting high permeability (I and II BCS Class) are generally characterized by higher lnK values than compounds revealing low permeability (class III and IV). This shows that β-CD complexation seems to offer an alternative for complex and expensive experimental permeability modeling studies.  If the value of the first equation for dependent variable ClassA is higher compared to the value provided by the second equation, then high permeability is predicted. This means that the analyzed drug belongs to Class I or II of the BCS. On the contrary situation, when ClassA < ClassB, then low permeability is predicted by the model and, consequently, the given drug should belong to Class III or IV of the BCS. It is interesting to note that such a simple model has quite an acceptable predictive power. Proper qualification of high permeability occurred in 88% of cases, with only 12% of misclassified drugs. The low permeability was classified with slightly lower precision of 73%, with 27% of failure. These observations indicate the potential applicability of binding constants for evaluating permeability.

Conclusions
Compounds exhibiting high symmetry, such as fullerenes or nanotubes, have been used in various branches of medicine and pharmacy, including drug delivery [77][78][79]. This also applies to cyclodextrins, which due to their specific shape features, have been widely used to increase API solubility. In this work, the QSPR model of the binding constant of different compounds to β-CD was developed based on the MARSplines methodology, and molecular descriptors were derived from the SMILES code. The internal and external validation indicated good accuracy of the model. The appearance of polarity-related descriptors, such as XlogP, indicated the hydrophobic nature of the cyclodextrin cavity, which is consistent with the nature of cyclodextrins. It is well known that the hydrophilicity/hydrophobicity of a drug can be used for evaluation of the drugs' permeability. Therefore, the model was used for predicting affinity to β-CD of exemplary compounds belonging to different BCS classes. As was established, APIs exhibiting high permeability (I and II BCS Class) are generally characterized by higher lnK values than compounds revealing low permeability (class III and IV). This shows that β-CD complexation seems to offer an alternative for complex and expensive experimental permeability modeling studies.