Predicting the Surface Tension of Deep Eutectic Solvents: A Step Forward in the Use of Greener Solvents

Deep eutectic solvents (DES) are an important class of green solvents that have been developed as an alternative to toxic solvents. However, the large-scale industrial application of DESs requires fine-tuning their physicochemical properties. Among others, surface tension is one of such properties that have to be considered while designing novel DESs. In this work, we present the results of a detailed evaluation of Quantitative Structure-Property Relationships (QSPR) modeling efforts designed to predict the surface tension of DESs, following the Organization for Economic Co-operation and Development (OECD) guidelines. The data set used comprises a large number of structurally diverse binary DESs and the models were built systematically through rigorous validation methods, including ‘mixtures-out’- and ‘compounds-out’-based data splitting. The most predictive individual QSPR model found is shown to be statistically robust, besides providing valuable information about the structural and physicochemical features responsible for the surface tension of DESs. Furthermore, the intelligent consensus prediction strategy applied to multiple predictive models led to consensus models with similar statistical robustness to the individual QSPR model. The benefits of the present work stand out also from its reproducibility since it relies on fully specified computational procedures and on publicly available tools. Finally, our results not only guide the future design and screening of novel DESs with a desirable surface tension but also lays out strategies for efficiently setting up silico-based models for binary mixtures.


Introduction
The last two decades have witnessed a significant shift in the design and development of new chemicals for large-scale industrial applications. One of such efforts has been driven towards the replacement of flammable and environmentally hazardous substances with green and sustainable solvents. More benign solvents, even if dispensed in a large amount into the environment, are known to produce less harmful effects [1][2][3]. Deep eutectic solvents (DES) represent a class of "green solvents" with tremendous potential to replace conventional toxic chemicals. Indeed, apart from having a wide range of applications, DESs exhibit much less environmental toxicity, even when compared to their predecessor, ionic liquids (ILs) [4,5]. Thus, it is not surprising that the emergence of DESs at the beginning of this century drew considerable attention from the scientific community, as is confirmed by the growing number of publications related to DESs in the last two decades [4,5]. DESs may Economic Co-operation and Development (OECD) [30] to expand their overall applicability as well as statistical reliability [31]. Moreover, in order to address the requirement of robust validation strategies applicable for the binary mixtures, we have recently designed an open-source standalone Python-based tool "QSAR-Mx" (freely available to download at https://github.com/ncordeirfcup/QSAR-Mx, last accessed on 28 April 2022) [28]. This work extensively utilizes such a tool to set up the predictive QSPR models for probing the surface tensions of DESs. Therefore, the scope of the current work goes beyond the development of QSPR models by proposing and comparatively testing novel methodologies that can be utilized in the future towards reliably predicting the properties of binary mixtures.

Dataset Collection and Splitting
The dataset employed for the development of the QSPR models was adopted from our previously published work on DES surface tension estimation [14]. It contains 553 data points, compiled from 112 different binary and ternary DESs of diverse compositions. However, here we solely focused on binary DESs and therefore the current dataset was reduced to 530 data points pertaining to 99 different binary DESs. This dataset was combined with an additional set comprising 89 data points that were collected from measurements reported in the literature since 2020 [32][33][34][35] and that, thus, were not included in our previously thermodynamic model. The final dataset comprises 619 unique data points coming from 113 different types of binary DESs. It is worth noting that the experimental surface tension of each DES in the dataset was measured at atmospheric pressure in a large temperature range (278. 15-358.15 K), rendering the temperature an important independent variable to consider in the QSPR modeling for understanding how it influences such physical property.
Predictive validation is a required but delicate task in any QSPR modeling-i.e., to assess model adequacy to new mixtures, and it is related directly to the dataset division scheme adopted. In fact, as shown by Muratov et al. [36], the random division of the original dataset for validation purposes is unacceptable since it can lead to unreliable QSPR models and to an over-optimistic estimation of their predictive performance. The authors have thus proposed and described in detail different validation strategies for the QSPR modeling of mixtures [36]. In this work, two such validation strategies were utilized to search for the most predictive QSPR models, namely: the mixtures-out (MO) and compounds-out (CO) schemes. Briefly, in the MO scheme, mixtures of the modeling set are distributed among the training and the test sets without repetition. By contrast, in the CO scheme, at least one chemical of the dataset is never placed in the training set. Naturally, these validation strategies are only applicable to binary mixtures and require some guidance to follow. Due to the complexity of the data matrices, any random MO-or CO-based division scheme may not yield the most predictive model since variables selection depends largely on the training set [28]. Even though the CO-based validation is considered to be the most robust strategy [36], it may give rise to underfitted models with poor statistical quality. At the same time, while the MO-based validation is less robust, this strategy definitely provides more meaningful solutions than any random data distributions or other validation division schemes such as the points-out one proposed by the same authors [28,[36][37][38].
As referred to above, we have recently developed a Python-based tool named QSAR-Mx [28], specifically devised to address and automate some crucial steps involved in the QSPR modeling of binary mixtures. A detailed description of the functionalities of this tool can be found in its instruction manual (accessible from https://github.com/ncordeirfcup/ QSAR-Mx, last accessed on 28 April 2022). Essentially, QSAR-Mx lets the user generate multiple MO-and CO-based data distributions and then develop models with the latter and select the most predictive ones based on their statistical metrics. Firstly, the user should choose two parameters-i.e., seed and interval, for generating the MO-and CO-based data distributions. In the MO division scheme, QSAR-Mx detects unique mixtures present in the dataset and then sorts them by the number of occurrences in descending order. From the sorted list, the mixtures are grouped according to the seed (the starting point for selection) and interval values given. The unique mixtures selected are then incorporated into the test set. Likewise, in the CO scheme, the QSAR-Mx tool begins by sorting the unique chemicals that belong to component-1 of the mixtures, followed by sorting them in descending order and lastly, by choosing some chemicals based upon the maximum values of the seed and interval chosen. This procedure is then replicated for the unique chemicals belonging to component-2 of the mixtures. The unique chemicals selected are then placed in the test set. One should notice however here that the following QSPR models were always derived with generated data distributions in which the training set size was always greater than the test set size and, simultaneously, the size of the latter was at least 15% of the former. It should be also mentioned that the QSAR-Mx tool has been slightly modified since our previous work [28] because we found that the MO-based data distributions vary from one run to another. The new version of QSAR-Mx is now able to generate the same data distributions (i.e., MO-based training and test sets) every time, independently of the seed and interval given, leading thus to more reproducible modeling results.
In this work, to begin with, we divided the whole dataset into a modeling set and an external validation set (535 and 84 data points, respectively), using the CO-based division scheme with values for the seed and interval of 3 and 4, respectively. The modeling dataset was subsequently divided into training and test sets by MO-and CO-based schemes, setting both the maximum seed and maximum interval as 6. The DESs in the training set coming from the two schemes were employed separately for the development of the QSPR models, and those from the test sets were only used to test such models. The DESs in the external validation set were utilized for extra validation of the final most predictive QSPR models found. Details about the investigated DESs along with their experimental surface tension values, and corresponding references are given in Table S1 of the Supplementary Materials.

Mixture Descriptors
Due to the unique nature of binary DESs, the calculation of their descriptors requires additional steps to take into account the specificity of each component as well as the molar fractions. Here, we resorted to the strategy previously suggested by Oprisiu et al. [37], in which the descriptors are initially calculated for each component and then modified on the basis of 'mixture descriptors weighted by molar fraction' formulas. As such, two types of modified descriptors (from now on, referred to as WM descriptors), i.e., D pmix and D nmix , were computed by the following formulas: where D i stands for the descriptor of each component i (i = 1, 2) and x i for the respective molar fraction in the mixture. Both formulas have already been successfully applied to generate predictive models for various properties of DESs [9,28,38] and implemented in the QSAR-Mx tool. Basically, QSAR-Mx includes two methods for calculating such mixture descriptors. Starting from the descriptors previously obtained for each component (D 1 and D 2 ), 'Method-1' calculates only the D pmix descriptors whereas 'Method-2' provides both the D pmix and D nmix descriptors. In the present work, we used both of the aforementioned methods separately and then performed a comparative analysis to elucidate the method that provides the best solution as far as the predictivity of the overall model is concerned.
To start with, the 3D structures of each DES component were obtained by inputting the SMILES (Simplified Molecular Input Line Entry Specification) strings into the application MarvinView (https://docs.chemaxon.com/display/docs/marvinview.md, last accessed on 15 March 2022) and subsequently standardized by the ChemAxon Standardizer tool with the following options: strip salts, aromatize, neutralize and add explicit hydrogen atoms [39]. Here, we have resorted to the 0D-2D descriptors available in the Dragon software [40] for describing each DES component. Actually, 3D descriptors were excluded due to the high computational effort required for structure optimization of each component, especially for large datasets, and the fact that those may also give rise to misleading information not ensuring reliable property prediction by 3D QSPR [28,41]. Finally, along with the WM descriptors, three independent variables were also included, namely: the measuring temperature, T (in K), the presence/absence of chlorine ions, and the presence/absence of bromine ions. The importance of temperature for the modeling was referred to before. Note in addition that only the cationic part was considered during the calculation of WM descriptors for the DES HBA component. Hence, two binary indicator variables (i.e., the presence (1) or absence (0) of halide ions) were required to be included to account for the anionic part of the HBA components.

Modeling Techniques and Evaluation
As to the modeling techniques, we started by opting for a regression-based approach like in our previous work [28] thanks to its easy interpretation but also high reliability. Specifically, the regression coefficients were obtained by the multiple linear regression analysis (MLR) implemented in our in-house QSAR-Mx tool and by selecting the variables through the sequential forward selection (SFS) algorithm using the Sequential Feature Selector module of Mlxtend (http://rasbt.github.io/mlxtend/) [42]. The following different conditions were applied for scoring the SFS selection: (a) determination coefficient (R 2 ), no cross-validation; (b) negative mean absolute error (NMAE), no cross-validation; (c) negative mean Poisson deviance (NMPD), no cross-validation; (d) determination coefficient (R 2 ), five-fold cross-validation (CV) or ten-fold CV.
Yet, a correlation cutoff of 0.95 and variance cutoff of 0.001 were always set to discard highly intercorrelated and near-constant descriptors. Additionally, the selection of the optimal number of descriptors for the MLR models was controlled by the %MAE LOO reduction policy, also implemented in QSAR-Mx. The %MAE LOO reduction scheme guarantees that no new descriptor is added to the model during feature selection if its inclusion does not reduce the leave-one-out (LOO) cross-validation and mean absolute error (MAE LOO ) by at least 5% of the previous model. As such, this policy guarantees that the optimal number of descriptors is present in the model and, at the same time, that models generated with multiple model development strategies may be compared from a neutral ground [28].
To check if higher accuracy could be achieved when estimating the surface tension of DESs, non-linear models were also developed using five different machine learning (ML) techniques, i.e.: (i) k-Nearest Neighbors (k-NN), (ii) Random Forests (RF), (iii) Support Vector Machines (SVM), (iv) Neural Network Multilayer Perceptron (NN-MLP), and (v) Gradient boosting (GB) [43][44][45][46][47]. Such ML-based models were set up by resorting to the tools available in the Scikit-learn programs (https://scikit-learn.org/stable/) with QSAR-Mx (last accessed on 28 April 2022) and for each of them, hyperparameter tuning was performed by varying their crucial parameters (see the list in Table S2). The best parameters for a given ML estimator were determined by a 5-fold cross-validation scheme using the same training sets as before. In the same manner, the external predictivity of the promising non-linear models found was firstly accessed and further validated using the same test and external sets.
As described above, QSAR-Mx generates multiple models based on three types of inputs provided by the user, namely: (i) descriptor calculation strategy (Method-1 or -2), (ii) dataset division schemes (MO-or CO-based data-division), and (iii) scoring conditions. No matter the model generation strategy followed, any QSPR regression model requires an evaluation with robust diagnostic tools to assess and compare its acceptability as well as quality over other models.
In this work, the internal predictivity of the developed QSPR regression models was primarily checked by statistical parameters, such as the MAE LOO and Q 2 LOO (LOO crossvalidation R 2 ) [48,49]. Keeping in mind the importance of the compounds-out validation, we have recently introduced two new statistical parameters based on the so-called leavechemical-out (LCO) cross-validation, which is conceptually similar to the well-known leave-many-out CV but more effective whenever dealing with mixtures [28]. These new parameters, i.e., Q 2 LCO and MAE LCO , are mostly important for the MO-based data distributions. Indeed, even though the latter often produce more predictive models than the CO-based data distributions, their predictivity remains questionable due to the lack of CO-based validation. Both Q 2 LCO and MAE LCO have actually helped us monitor the model performance upon the removal of each component (belonging either to HBA or to HBD) one by one from the training set with further model redevelopment using the remaining components. A detailed description of these two parameters can be found in our previous work [28]. Importantly, while assessing the quality of the models, the difference between Q 2 LOO and Q 2 LCO should also be evaluated. In fact, a large discrepancy between the values of the latter suggests that the mixtures based on one or more components are not predicted well enough by the QSPR model. Besides the above-mentioned statistics, the internal predictivity of the final regression models was also evaluated by using scaled r m 2 validation metrics, such as r m 2 (LOO) and ∆r m 2 (LOO) [50]. Basically, r m 2 metrics are based on the correlation between the observed and predicted values, with and without setting to zero the intercept of the least square regression lines. In addition, the AARD calculated for each data distribution was also used for checking the overall errors of the derived models.
Although not quite common in QSPR modeling, the latter allowed us to compare the quality of our QSPR models with that of previously reported thermodynamic models [29]. To access the external predictive ability of the models, similar statistical validation metrics were also employed, i.e., the mean absolute error for the test set or external validation set (MAE test and MAE ext ) and the variance explained for external prediction (R 2 Pred ) [44] along with the scaled r m 2 (test) , ∆r m 2 (test) , r m 2 (ext) , and ∆r m 2 (ext) metrics [50]. Other aspects that deserve special attention are the absence of highly collinear descriptors and the lack of chance correlations in the final derived models. Highly collinear variables were simply checked by inspecting the cross-correlation matrix of the models' descriptors. On the other hand, the Y-randomization technique identifies models with chance correlations, using the cR P 2 parameter [51], after the sequence of the response vector has been randomly modified. Here, the procedure was repeated 1000 times, and new models were developed with the randomly reordered responses employing the same set of variables. The uniqueness of the final regression model and its lack of chance correlations is confirmed by the value obtained for cR P 2 , which should be closer to one [51]. Finally, apart from inspecting the models' robustness and predictivity, one should also define their applicability domain (AD), that is, the response and chemical structure space for which the models form reliable predictions without extrapolating. In this work, the AD of the developed models was determined by the leverage approach [52], which renders a measure of the similarity of a particular substance from all other substances (distance between its descriptor values and the average for all descriptor values). So, one can plot the standardized residuals against the leverage values for each DES of the several sets. From such a plot, the so-called William's plot [52], we were able to identify the response and structural DES outliers. All plots shown in the present work were conceived with Matplotlib [53].

Consensus Modeling
The task here is to explore whether the overall quality of predictions for new substances might be improved by an "Intelligent" selection of multiple models. Towards that end, the most predictive QSPR models derived were subjected to consensus modeling, using the software tool Intelligent Consensus Predictor (freely available through the web https: //dtclab.webs.com/software-tools, last accessed on 23 March 2022) developed by Roy et al. [54]. The four strategic techniques of this tool were applied, namely: Consensus Models (CM) 0-3, just as in our previous work [28], and as fully described in the work by the authors [54]. In short, CM0 is the simplest strategy and consists in computing the arithmetic average of predicted response values from all input individual models. In contrast, CM1 is based on the simple arithmetic average of predictions from all qualified individual models. CM2 corresponds to weighted average predictions from all qualified models, formerly giving appropriate weightage to those models. Finally, CM3 applies compound-wise predictions based on the best selection coming from the qualified models. Independently of the consensus modeling methodology, our main purpose was thus to combine multiple statistically robust models to improve the predictivity over the external validation set. Figure 1 depicts the workflow chart followed in the present QSPR modeling, which was mostly carried out using our recently developed tool, QSAR-Mx. As can be seen, all the involved steps and methodology employed to cope with the major goal of this work are shown, i.e., to build reliable predictive QSPR regression models from the compiled data that could be used to estimate the surface tension of DESs. stances might be improved by an "Intelligent" selection of multiple models. Towards that end, the most predictive QSPR models derived were subjected to consensus modeling, using the software tool Intelligent Consensus Predictor (freely available through the web https://dtclab.webs.com/software-tools, last accessed on 23 March 2022) developed by Roy et al. [54]. The four strategic techniques of this tool were applied, namely: Consensus Models (CM) 0-3, just as in our previous work [28], and as fully described in the work by the authors [54]. In short, CM0 is the simplest strategy and consists in computing the arithmetic average of predicted response values from all input individual models. In contrast, CM1 is based on the simple arithmetic average of predictions from all qualified individual models. CM2 corresponds to weighted average predictions from all qualified models, formerly giving appropriate weightage to those models. Finally, CM3 applies compoundwise predictions based on the best selection coming from the qualified models. Independently of the consensus modeling methodology, our main purpose was thus to combine multiple statistically robust models to improve the predictivity over the external validation set. Figure 1 depicts the workflow chart followed in the present QSPR modeling, which was mostly carried out using our recently developed tool, QSAR-Mx. As can be seen, all the involved steps and methodology employed to cope with the major goal of this work are shown, i.e., to build reliable predictive QSPR regression models from the compiled data that could be used to estimate the surface tension of DESs.  In total, 258 models were set up by varying data splitting schemes, descriptor calculation methods (Method-1 or Method-2) and SFS-MLR modeling. Among these, 136 models pertained to the MO-based division scheme, whereas the remaining 112 models were generated with the CO-based division scheme. The overall predictive quality of each of these regression models was judged by means of the average value computed for the statistical parameters Q 2 LOO , Q 2 LCO and R 2 Pred . Essentially, the two parameters-Q 2 LOO and R 2 Pred -account for the internal and external predictivity of the QSPR models, respectively. Nevertheless, the parameter Q 2 LCO was also included to ensure that the most predictive models do not suffer overfitting due to bias towards some specific components of the binary DES mixtures. Naturally, the higher the average value obtained from these three parameters is, the more predictive the model is. Considering this, we selected the top 15 unique models for further processing. A summary of the statistical results of these models is given in Table 1. Interestingly, out of these 15 models, 14 were derived from MO-based data distributions, and only one model arose from CO-based distributions. Undoubtedly, that clearly shows that MO-based data distributions are more likely to produce more predictive models in comparison to the CO-based data distributions since the latter provides a more rigorous validation strategy.

Model Calibration and Evaluation
As referred to before, in the entire model-building process, the data distributions were varied for the selection of the most predictive models. Therefore, the generated test sets serve as a validation set to estimate the external predictivity of the models but, at the same time, as calibration sets for the selection of the best models. In contrast, the external validation sets (containing 84 data points) were treated as the 'true validation set' for assessing the external predictivity of the models. The latter was built with the CO-based data-distribution scheme, thus holding a significant challenge to the generated models as far as their external predictivity is concerned. A comparison of the predictivity of the top 15 models is shown in Table 2.
As can be clearly observed from Table 2, only a few models show satisfactory predictions against the external validation set. Nevertheless, six of these models had R 2 Pred values greater than 0.50, as well as average %AARD values lower than 12. Moreover, three models, namely M09, M10 and M12, supplied the most satisfactory predictivity towards such external validation set with R 2 Pred > 0.65 and %AARD < 10. Therefore, these three models were considered the best models obtained for predicting the surface tension of DESs. Remarkably, M10, the only CO-based model included in the top 15, emerged as one of the most predictive models. Still, on the basis of overall predictivity, M12 was selected as the best individual QSPR model, even taking into account its slightly lower internal predictivity, compared to M10, and its slightly lower external predictivity towards the test set, as compared to M09. Even so, M12 afforded a balanced prediction against all three sets with an average %AARD value of 7.126, which is lower than that obtained for the other two models. At the same time, model M12 provides the best solution if the average value of Q 2 LOO , Q 2 LCO and R 2 Pred (against the two validation sets) is considered. In fact, for M12, this average value was found to be 0.859, while for M09 and M10, the average values were estimated as 0.820 and 0.831, respectively.
In summary, the best predictive model found for the DESs' surface tension (a sixvariable equation, model M12) can be expressed as detailed below, while the meaning of the selected WM descriptors is given in Table 3.    a P_VSA-like descriptors stand for the van der Waals surface area (VSA) with a particular property (P), in this case, the molar refractivity (MR) [57]. b Chemically Advanced Template Search (CATS) descriptors expressly designed to identify scaffold hops [58]. c I-states are based on the Kier-Hall atomic electronegativity modified by the number of σ bonds, number of hydrogen atoms, number of electrons in π orbitals, and number of lone pair electrons [55,56].
In this equation, X pmix and X nmix stand for WM descriptors of the type D pmix in line with Equation (1) and D nmix following Equation (2), respectively, T is the temperature (in K) under which the surface tension has been measured, and σ is the surface tension (in mN/m).
A summary of the extended statistical results for model M12 is given in Table 4. The determination coefficient values (R 2 = 0.916 and R 2 Adj = 0.915), the sample size (N tr = 360), the Fisher ratio (F = 642.4), but especially the high ratio between the number of data points to adjustable variables (ρ = 60) [59] are indicative of the model's statistical significance and fitness. Model M12 also provides a satisfactory internal and external predictivity as follows from the cross-validation, r m 2 and R 2 Pred metrics values (see Table 4). Moreover, built with only six descriptors, it led to %AARD values of 5.805, 11.155 and 4.418 against the training, test and validation sets, respectively. The model prediction ability was further checked by analyzing the relative deviations (%RD = 100*(σ Pred − σ exp )/σ exp ) between the predicted and experimental DES surface tension values for all three sets. As Figure 2 shows, model M12 performs more accurately regarding the training and external validation sets than the test set. Yet, the latter also demonstrates a normal behavior considering the shape of the RD distribution according to the proposed model, also displayed in Figure 2. This histogram plot clearly depicts that most of the RD error values are within ±20% and that those are normally distributed, suggesting that the model estimations are not biased.   Figure 3 shows the plot of the predicted surface tensions obtained from model M12 vs. the observed experimental ones. As seen, the majority of the data points are sufficiently close to the diagonal line, denoting the model's reliability and soundness of its predictions. Indeed, the model's performance is even better than that of the previously developed thermodynamic model for the DES surface tension [14], which, despite having fewer data points (a total of 530 data points, considering only the 99 unique binary DES), led to %AARD values of 8.87 and 14.81 for the training and test sets, respectively. However, the purpose and outcomes of the current QSPR modeling are different from that of any thermodynamic model, as the former demands several different conditions to be satisfied, apart from validation, to establish the statistical robustness of the model. For example, so far, we have demonstrated the acceptable results on the reliability of the QSPR model M12, but it is also important to inspect the non-intercollinearity among any two of its descriptors. The latter was found to be 0.238, indicating that the variables included in the model are indeed not interrelated to each other. Furthermore, the model was itself checked for its uniqueness by the Y-based randomization technique, which was performed by scrambling the endpoint responses for the training set. The high value obtained for cRP 2 (=0.908) implies that the model is not correlated by chance. Another crucial aspect is related to the applicability domain of the model that here was assessed by analyzing the Williams plot (plot of standardized residual vs. leverages). As seen in Figure 3, eight data points from the training set and thirteen from the test set can be considered structural outliers of the model, but no structural outliers were found in the external validation set. Interestingly, most of these structural outliers were well predicted by the model and were thus retained, as previously suggested by Gramatica et al. [49]. In addition, only twelve data points of the entire dataset were found to be response outliers, which also proves the high predictive accuracy of the model [60].  Figure 3 shows the plot of the predicted surface tensions obtained from model M12 vs. the observed experimental ones. As seen, the majority of the data points are sufficiently close to the diagonal line, denoting the model's reliability and soundness of its predictions. Indeed, the model's performance is even better than that of the previously developed thermodynamic model for the DES surface tension [14], which, despite having fewer data points (a total of 530 data points, considering only the 99 unique binary DES), led to %AARD values of 8.87 and 14.81 for the training and test sets, respectively. However, the purpose and outcomes of the current QSPR modeling are different from that of any thermodynamic model, as the former demands several different conditions to be satisfied, apart from validation, to establish the statistical robustness of the model. For example, so far, we have demonstrated the acceptable results on the reliability of the QSPR model M12, but it is also important to inspect the non-intercollinearity among any two of its descriptors. The latter was found to be 0.238, indicating that the variables included in the model are indeed not interrelated to each other. Furthermore, the model was itself checked for its uniqueness by the Y-based randomization technique, which was performed by scrambling the endpoint responses for the training set. The high value obtained for cR P 2 (=0.908) implies that the model is not correlated by chance. Another crucial aspect is related to the applicability domain of the model that here was assessed by analyzing the Williams plot (plot of standardized residual vs. leverages). As seen in Figure 3, eight data points from the training set and thirteen from the test set can be considered structural outliers of the model, but no structural outliers were found in the external validation set. Interestingly, most of these structural outliers were well predicted by the model and were thus retained, as previously suggested by Gramatica et al. [49]. In addition, only twelve data points of the entire dataset were found to be response outliers, which also proves the high predictive accuracy of the model [60].

Model Interpretation
In our previous investigation on density [28] we observed that, in spite of providing less mechanistic interpretability, graph-based topological descriptors often help in characterizing the physicochemical properties. In the present work, a number of topological descriptors were also proven to be significant for describing the surface tension of DESs. Figure 4 shows the relative importance of each descriptor of model M12, estimated on the basis of the absolute value of its regression coefficients. As can be observed, the WM descriptor MATS5snmix was found to have the highest importance and besides, it is the only Dnmix type descriptor in the model. Being derived from graph-based topological descriptors, MATS5dnmix points out that the differences in topological geometry of the DESs' components may play a significant role in the surface tension of these solvents. The Dpmix type WM descriptor CATS2D_02_ANpmix is the second most influencing descriptor of the model. Chemically Advanced Template Search (CATS) descriptors are a useful group of descriptors that account for the topological distance among scaffold features in the molecules [58]. CATS2D_02_AN, in particular, means that

Model Interpretation
In our previous investigation on density [28] we observed that, in spite of providing less mechanistic interpretability, graph-based topological descriptors often help in characterizing the physicochemical properties. In the present work, a number of topological descriptors were also proven to be significant for describing the surface tension of DESs. Figure 4 shows the relative importance of each descriptor of model M12, estimated on the basis of the absolute value of its regression coefficients.

Model Interpretation
In our previous investigation on density [28] we observed that, in spite of providing less mechanistic interpretability, graph-based topological descriptors often help in characterizing the physicochemical properties. In the present work, a number of topological descriptors were also proven to be significant for describing the surface tension of DESs. Figure 4 shows the relative importance of each descriptor of model M12, estimated on the basis of the absolute value of its regression coefficients. As can be observed, the WM descriptor MATS5snmix was found to have the highest importance and besides, it is the only Dnmix type descriptor in the model. Being derived from graph-based topological descriptors, MATS5dnmix points out that the differences in topological geometry of the DESs' components may play a significant role in the surface tension of these solvents. The Dpmix type WM descriptor CATS2D_02_ANpmix is the second most influencing descriptor of the model. Chemically Advanced Template Search (CATS) descriptors are a useful group of descriptors that account for the topological distance among scaffold features in the molecules [58]. CATS2D_02_AN, in particular, means that As can be observed, the WM descriptor MATS5s nmix was found to have the highest importance and besides, it is the only D nmix type descriptor in the model. Being derived from graph-based topological descriptors, MATS5d nmix points out that the differences in topological geometry of the DESs' components may play a significant role in the surface tension of these solvents. The D pmix type WM descriptor CATS2D_02_AN pmix is the second most influencing descriptor of the model. Chemically Advanced Template Search (CATS) descriptors are a useful group of descriptors that account for the topological distance among scaffold features in the molecules [58]. CATS2D_02_AN, in particular, means that the acceptor and negatively charged groups are separated by a small topological distance (=2). In this case, higher values of this descriptor are found to be negatively correlated to the surface tension. Descriptor BLTF96 pmix appears as the third most important descriptor in the model. Unlike the first two descriptors of topological nature, this descriptor is based on an important molecular property-lipophilicity [55,56]. Since this descriptor belongs to the D pmix type, it may be inferred that higher lipophilicity of the components would trigger higher surface tension for the DESs. Apart from lipophilicity, another wellknown physicochemical property-dipole moment-was also found to have important contributions in ascertaining the DES surface tension. The importance of the dipole moment is derived from the presence of descriptor Eig02_EA(dm) pmix . The fifth most important descriptor belongs to the class of P_VSA descriptors, which represent the amount of van der Waals surface area (VSA) having a property (P) in a certain range [56]. In the case of the descriptor P_VSA_MR_6 pmix , the property is the molar refractivity (MR) at a larger range (bin size 6). The positive relation of P_VSA_MR_6 pmix with the dependent property is highly significant as it suggests that increased MR (i.e., polarizability) within the van der Waals surface of each component contributes towards a higher surface tension for the respective DESs. Finally, the last descriptor of the model is the temperature of surface tension measurements, T. As expected, with increasing temperature the surface tension is found to decrease, which fits well with the experimental findings. Still, to further check how model M12 actually addresses the influence of temperature, we randomly selected six DESs with a range of surface tension values. From Figure 5, it can be clearly seen that both experimental and predicted properties followed the same trend, i.e., the surface tension gradually decreases as the temperature is increased. the acceptor and negatively charged groups are separated by a small topological distance (=2). In this case, higher values of this descriptor are found to be negatively correlated to the surface tension. Descriptor BLTF96pmix appears as the third most important descriptor in the model. Unlike the first two descriptors of topological nature, this descriptor is based on an important molecular property-lipophilicity [55,56]. Since this descriptor belongs to the Dpmix type, it may be inferred that higher lipophilicity of the components would trigger higher surface tension for the DESs. Apart from lipophilicity, another well-known physicochemical property-dipole moment-was also found to have important contributions in ascertaining the DES surface tension. The importance of the dipole moment is derived from the presence of descriptor Eig02_EA(dm)pmix. The fifth most important descriptor belongs to the class of P_VSA descriptors, which represent the amount of van der Waals surface area (VSA) having a property (P) in a certain range [56]. In the case of the descriptor P_VSA_MR_6pmix, the property is the molar refractivity (MR) at a larger range (bin size 6). The positive relation of P_VSA_MR_6pmix with the dependent property is highly significant as it suggests that increased MR (i.e., polarizability) within the van der Waals surface of each component contributes towards a higher surface tension for the respective DESs. Finally, the last descriptor of the model is the temperature of surface tension measurements, T. As expected, with increasing temperature the surface tension is found to decrease, which fits well with the experimental findings. Still, to further check how model M12 actually addresses the influence of temperature, we randomly selected six DESs with a range of surface tension values. From Figure 5, it can be clearly seen that both experimental and predicted properties followed the same trend, i.e., the surface tension gradually decreases as the temperature is increased.

Non-Linear Models
Albeit M12 emerged as the most accurate linear model in estimating the DES surface tension, the question that still remains is whether a non-linear model based on its descriptors might have better performance. Table 5 shows a statistical summary of the performance of the non-linear models resulting from applying the five different machine

Non-Linear Models
Albeit M12 emerged as the most accurate linear model in estimating the DES surface tension, the question that still remains is whether a non-linear model based on its descriptors might have better performance. Table 5 shows a statistical summary of the performance of the non-linear models resulting from applying the five different machine learning techniques, i.e.: k-NN, RF, SVM, NN-MLP and GB [49][50][51][52][53]. It can be observed that most of these ML techniques failed to produce predictive models and their results are not accurate either. Still, the SVM technique yields a predictive model and thus has the highest performance among the other ML techniques, though both the internal and the external predictivity of the latter remain inferior to the linear model. These results indicate that for the selected set of descriptors, the multiple linear regression-based model has the best accuracy in estimating the surface tension of DESs as well as sufficient predictivity that cannot be achieved with other model development techniques.

Consensus Modeling
Finally, we applied the intelligent consensus modeling [54] to see whether the surface tension predictions for the external validation set could be improved. To do so, sets of the three most predictive linear models-M09, M10 and M12-were subjected to consensus predictions in different combinations, namely: (a) C1-based using models M09, M10 and M12; (b) C2-based using models M10 and M12; (c) C3-based using models M09 and M10; and (d) C4-based with models M09 and M12. In each case, the modeling dataset containing 535 data points was treated as the training set whereas the external validation set was used to check the external predictivity of the consensus model. The results of all consensus modeling attempts are presented in Table 6. Interestingly, the resulting models C1 and C4 lead to similar predictivities. Yet, none of the later consensus models display an external predictivity considerably better than that of the best individual model, M12. The R 2 Pred and %AARD values obtained for consensus model C1 are 0.864 and 4.459, respectively, and similarly for C3 (i.e., 0.854 and 4.393). As can be seen, both C1 and C3 may therefore be projected as alternative models to M12.
However, let us mainly focus on C3, since it reveals that M09 and M10 may indeed work as complementary models for each other towards improving the external predictivity. Details about the M09 and M10 models are provided in Table S3 of the SI.
Since M09 was developed with the same data distribution as M12, these two models have four descriptors in common, namely: CATS2D_02_AN pmix , P_VSA_MR_6 pmix , BLTF96 pmix , and T. Obviously, these four descriptors have a high significance in describing the surface tension of DESs. Significantly, CATS2D_02_AN pmix , which was found to be the second most important descriptor of M12, appears to be the most influential descriptor of M09. It undoubtedly indicates that this descriptor may be considered the most crucial descriptor in predicting the surface tension of DESs. Presumably, due to the similarity between M12 and M09, consensus modeling with these two models failed to provide any better solution. Model M10, in contrast, is established as a unique model because, save for T, none of its descriptors is found either in M09 or in M12. Most likely due to this reason, its combination with the other two models produces good consensus models. Unlike models M09 and M12, model M10 yields are slightly higher but still have acceptable intercollinearity between descriptors, with a maximum R 2 value of 0.713. The selected descriptors for models M09 and M10 are described in detail in Table S4 of the SI.

Conclusions
The present work aimed to establish a systematic and well-designed QSPR modeling for predicting the surface tension of a wide range of DESs, following the OECD guidelines. Towards such aim, the largest surface tension data bank of binary DESs known to date, comprising 619 data points from 113 unique DESs of various families was gathered from the literature. In addition, special emphasis was put on employing robust validation strategies for setting up the QSPR models. In so doing, the best QSPR models were set up with multiple data distributions resulting from MO-and CO-data splitting schemes along with a weighted mixture type of descriptors, using our in-house open access tool QSAR-Mx. After considering several statistical parameters, the top three individual linear regression models stand out for their accuracy and robustness. The most predictive individual linear model was however selected based on the predictivity towards the external validation set.
Similar to our previous study on the development of a thermodynamic model [14], the surface tension of DESs was found to be a particularly difficult property to predict. This may be related to the challenging nature of the accurate experimental estimation of surface tensions, associated with the presence of surface-active impurities and differences in the measuring protocols [33]. Nevertheless, our most predictive individual QSPR regression model (i.e., model M12) yielded a satisfactory overall %AARD value (=7.126), especially when compared to the aforementioned thermodynamic model (%AARD = 10.31, considering only the binary DESs therein) [14]. This model depicted also the structural and physicochemical features related to the surface tension of DESs. Just as in our previously developed model for the density of DESs [28], graph-based topological descriptors were found to be highly useful in this respect. Some physicochemical factors, such as the lipophilicity, polarizability and dipole moment of the DESs' components, were found to be responsible for ruling their surface tensions. We also attempted to generate consensus models based on the top three individual linear models. Interestingly, consensus models based on the two other best individual models-M09 and M10-were found to be equally predictive towards the external validation set.
Overall, this work definitely provides valuable information about the structural and physicochemical features required for predicting the surface tension of binary DESs. At the same time, it also lends important guidelines to set up predictive and validated linear interpretable QSPR models for the various properties of binary mixtures. The high predictivity of the models ensures that these models may be used on the industrial scale to at least predict the surface tension of the DESs that are newly developed or under-developed to assess their suitability as an industrial solvent. The models may also be used to screen a large number of DESs (obtained from databases) to predict the DESs with desirable surface tension properties. What is more, all the proposed models are easily reproducible since they rely on fully specified computational procedures and were built with non-commercial software tools. Finally, both the individual and consensus models developed in this work shall help the future screening as well as the design of new sustainable DES, with major time and cost savings.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/molecules27154896/s1, Table S1. List of all investigated DESs and experimental surface tension data. (XLSX); Table S2. Hyperparameter tuning of the different machine learning techniques. (PDF); Table S3. Detailed description of the M09 and M10 models. (PDF); Table S4  Data Availability Statement: All the data files pertaining to the QSPR modeling are available from the authors. The training, test and external datasets were taken from cited publications, and the DES chemical structures along with the collected surface tension data are provided in the Supporting Information (Table S1). Dragon 7.0, MarvinView, and Standardizer were used in this study under academic license (see Material and methods section). Three other open source software tools were also used in this study, namely: QSAR-Mx, a Python-based tool developed by the authors that is available to download at https://github.com/ncordeirfcup/QSAR-Mx (last accessed on 28 April 2022); Mlxtend, a Python library of useful tools that is accessible from https://rasbt.github.io/ mlxtend/; scikit-learn, a Python library of useful machine learning tools that is accessible from https://scikit-learn.org/stable/; and Intelligent Consensus Predictor, a Java-based tool available through the web https://dtclab.webs.com/software-tools (last accessed on 23 March 2022)(see Material and methods section).