Next Article in Journal
Comprehensive Risdiplam Synthesis Overview: From Cross-Coupling Reliance to Complete Palladium Independence
Next Article in Special Issue
Data-Driven Classification of Solubility Space in Deep Eutectic Solvents: Deciphering Driving Forces Using PCA and K-Means Clustering
Previous Article in Journal
Isosorbide as a Molecular Glass: New Insights into the Physicochemical Behavior of a Biobased Diol
Previous Article in Special Issue
Sustainable Recovery of Cu, Fe, Ni, and Zn from Multilayer Ceramic Capacitors Using a Ternary Deep Eutectic Solvent
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Duality of Simplicity and Accuracy in QSPR: A Machine Learning Framework for Predicting Solubility of Selected Pharmaceutical Acids in Deep Eutectic Solvents

Department of Physical Chemistry, Faculty of Pharmacy, Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Toruń, Kurpińskiego 5, 85-950 Bydgoszcz, Poland
*
Author to whom correspondence should be addressed.
Molecules 2025, 30(22), 4361; https://doi.org/10.3390/molecules30224361
Submission received: 30 September 2025 / Revised: 7 November 2025 / Accepted: 10 November 2025 / Published: 11 November 2025

Abstract

We present a systematic machine learning study of the solubility of diverse pharmaceutical acids in deep eutectic solvents (DESs). Using an automated Dual-Objective Optimization with Iterative feature pruning (DOO-IT) framework, we analyze a solubility dataset compiled from the literature for ten pharmaceutically important carboxylic acids and augment it with new measurements for mefenamic and niflumic acids in choline chloride- and menthol-based DESs, yielding N = 1020 data points. The data-driven multi-criterion measure is applied for final model selection among all collected accurate and parsimonious models. This three-step procedure enables extensive exploration of the model’s hyperspace and effective selection of models fulfilling notable accuracy, simplicity, and also persistency of the descriptors selected during model development. The dual-solution landscape clarifies the trade-off between complexity and cost in QSPR for DES systems and shows that physically meaningful energetic descriptors can replace or enhance explicit COSMO-RS predictions depending on the application.

1. Introduction

Carboxylic acids and their derivatives play an essential role in pharmacy as both active pharmaceutical ingredients and structural motifs in drug design [1,2,3]. Their carboxyl group enables proton dissociation, hydrogen bonding, and interactions with plasma proteins, which affect absorption, distribution, and overall bioavailability [4,5]. Among them, mefenamic and niflumic acids are representative nonsteroidal anti-inflammatory drugs (NSAIDs) [6,7], used clinically for pain and inflammation [8,9,10,11,12,13]; however, both exhibit limited aqueous solubility that complicates formulation and necessitates solubilization strategies [14,15,16]. Other acids included in this study, such as flufenamic [17,18,19], ibuprofen and ketoprofen [20,21,22,23,24], probenecid [25,26], and naturally occurring phenolic acids like ferulic, caffeic, p-coumaric, and syringic acids [5,27,28,29,30,31,32,33,34,35,36,37,38], span a broad chemical and functional space relevant to pharmaceutical and green solvent research.
Solubility remains a key physicochemical parameter influencing drug efficacy, processing, and environmental behavior [39,40]. Poorly soluble compounds require formulation innovations, increasing development costs [41,42]. Because solubility depends on multiple thermodynamic and molecular factors, including temperature, polymorphism, and solute–solvent interactions [43,44,45,46,47], predictive computational approaches are increasingly valuable. Among the various techniques used for the solubility enhancement of APIs [41,48,49], deep eutectic solvents (DESs) are particularly interesting and promising. DESs are a flexible and increasingly studied class of liquid systems formed by mixing appropriate hydrogen bond donors and acceptors, which leads to a significant reduction in the melting point relative to the starting components [50,51,52,53]. DESs are distinguished by a number of properties useful in a pharmaceutical context: low volatility (which promotes safety and reduces emissions), a wide spectrum of polarity and acidity, significant “tunability” through component selection, and the ability to solubilize compounds of various chemical natures [54,55]. In pharmacy, DESs are being researched as extraction media for isolating natural products, as potential carriers for formulations that enhance the solubility and bioavailability of APIs, and as so-called THEDES (therapeutic deep eutectic solvents), i.e., systems in which the components themselves may have biological activity or facilitate drug stabilization and delivery [56,57,58,59]. At the same time, the vast combinatorial space of possible components and molar ratios necessitates the use of accelerated testing methodologies and predictive tools, as an empirical review of all variants is costly and time-consuming.
In this context, computational solubility prediction, especially using QSPR and advanced machine learning (ML) methods, is becoming a strategic tool for both pharmaceutical development and green solvent design [60,61,62,63]. QSPR models establish mathematical relationships between molecular structure and macroscopic properties, providing interpretable links between descriptors and solubility behavior [64,65]. Modern ML algorithms, such as neural networks, graph-based approaches, support vector machines, and ensemble strategies, extend this paradigm by capturing complex, non-linear patterns and multidimensional interactions that are often inaccessible to traditional regression models [66,67,68,69]. Neural networks, in particular, can integrate heterogeneous representations including physicochemical descriptors, molecular fingerprints, graph embeddings, and solvent composition features, allowing them to learn transferable correlations across chemically diverse datasets [70,71,72]. Recent studies demonstrate that such ML-based approaches can successfully predict aqueous and non-aqueous solubilities of drug-like molecules, outperforming empirical correlations and physics-based models when trained on sufficiently broad and curated datasets [73,74,75]. Beyond accuracy, however, the practical value of predictive modeling lies in its ability to guide experiment planning, identify anomalous data, and support rational solvent selection. For this purpose, hybrid strategies that combine physically meaningful descriptors, derived from quantum-chemical or COSMO-RS computations, with data-driven optimization provide a balanced route toward robust, interpretable, and generalizable models. Such hybrid approaches are particularly relevant for complex solvent systems like deep eutectic solvents (DESs), where solute–solvent interactions are governed by multiple competing forces and explicit simulation is computationally demanding. The integration of COSMO-RS-derived descriptors with machine learning optimization therefore enables data-efficient exploration of the chemical space and identification of the key physicochemical drivers of solubility. At the same time, the issue of balancing model interpretability and predictive power remains central. Reliable solubility modeling requires rigorous feature selection, cross-validation, and uncertainty estimation to avoid overfitting and ensure transferability to new chemical systems.
The purpose of this work was to create a predictive model for estimating the solubility of various pharmaceutically active carboxylic acids. The model was developed by combining COSMO-RS-derived molecular descriptors with machine learning methods, based on our newly established DOO-IT (Dual-Objective Optimization with ITerative feature pruning) framework. New experimental data for mefenamic acid and niflumic acid were obtained for this study, which were supplemented with solubility values found in the literature for a number of acids used in the pharmaceutical realm. The constructed models were thoroughly validated, and their performance was discussed, highlighting their effectiveness and the potential for generalization.

2. Results and Discussion

2.1. Solubility Measurements of Mefenamic Acid and Niflumic Acid

Figure 1 summarizes the mole fraction solubilities of mefenamic acid (xMEF) and niflumic acid (xNIF) at 25 °C in choline chloride- and menthol-containing DESs, with full numerical values given in Table S1 (xMEF) and Table S2 (xNIF). Across all systems, xMEF spans 1.38 × 10−4 to 1.40 × 10−2 and xNIF 2.38 × 10−4 to 2.11 × 10−2, and menthol-based DESs generally afford higher solubility than their choline chloride analogs for both compounds. The highest xMEF is observed for Men/TRG 1:3 (1.40 × 10−2), with Men/TEG also giving elevated values, while the highest xNIF appears for Men/GLY 1:1 (2.11 × 10−2) and remains high in Men/TRG 1:1–1:3. Within the choline chloride series, ChCl/DEG 1:1 provides the top xMEF (6.49 × 10−3), and ChCl/TRG 1:1 gives the top xNIF (1.47 × 10−2). Considering the hydrogen bond donors (HBDs), TRG and TEG are associated with the largest solubilities in the menthol series, GLY is particularly favorable for xNIF at 1:1, and DEG stands out for xMEF in the choline chloride series; the effect of the HBD fraction is system-specific; for example, xMEF increases from 3:1 to 1:3 in Men/TRG, xNIF maximizes at 1:1 in Men/GLY, and values decrease beyond 1:1 in ChCl/DEG (xMEF) and ChCl/TRG (xNIF).

2.2. Identifying an Optimal Predictive Model via the DOO-IT Framework

The DOO-IT framework, aiming to find the most accurate and parsimonious machine learning model, was applied to the solubility dataset collected from our previous studies augmented with the new measurements of pharmaceutical acids in deep eutectic solvents provided in this work. In total, N = 1020 points characterized the solubility of ten pharmaceutical acids in choline chloride- and menthol-based deep eutectic solvents. The DOO-IT pipeline was performed 50 times, enabling the determination of statistically significant populations of dominating models. The results of the application of the entire three-pillar pipeline are provided in Figure 2. This is the central pillar of model development, visualizing the outcome of the DOO-IT model selection workflow.
The most significant finding illustrated in Figure 2 is that two different sets lead to distinct “basins of excellence,” highlighted by points marked in red color. This discovery directly refers to the “duality” in our article’s title, emphasizing that not a single “best” model is presented, but a strategic choice between two high-performing, yet fundamentally different, modeling approaches. Indeed, set 2 utilizes both energetic contributions and σ-potential distributions and this extended source of information results in a slightly simpler model (eight descriptors) in terms of numbers of descriptors (MAETEST = 0.0893 ± 0.0116, R2TEST = 0.968 ± 0.052). On the other hand, utilization of set 1 exclusively for model development led to an only slightly worse model (MAETEST = 0.1054 ± 0.0082, R2TEST = 0.944 ± 0.015) at the cost of an increase in the number of descriptors (nine). This outcome indicates that distinct, scientifically meaningful descriptor combinations can achieve competitive performance via different mechanisms. Two separate models can therefore be regarded as the ones with optimal performance, depending on the intended task. The nine-descriptor model captures the main drivers of solubility in terms of energetic contribution only. Augmenting the model with additional information relying on the charge density contributions to σ-potentials enables both the simplification of the model and making it more robust. This duality is an important finding suggesting a nuanced interplay between information used for model development and the resulting interchange of accuracy–parsimony–stability. The robustness and usefulness of the developed DOO-IT framework rely not only on the ability to generate a valuable model extracted from the vast universe of potential models but also to tailor it to a given dataset and descriptor combination using objective criteria and full automation, preventing biases.
The first basin, located at eight descriptors, utilizes the following descriptors ordered in decreasing importance provided in brackets: EvdW,API (0.69), ΔHBA1 (0.66), log(xAPICOSMO) (0.55), μAPI (0.51), ΔHH1, (0.46), ΔHH4 (0.43), EHB,API (0.36), and ΔHH2 (0.32). These descriptors represent the most fundamental and universal properties of the organic acid molecules that dictate their behavior in DESs. Hence, for the studied set of data, the dispersion contribution of API and relative hydrophobicity values are the most dominant contributions to the solubility. Also, the relative value of acceptability at vicinal σ-potential regions is important. The model performance is provided in Figure 3. The plots highlight the inherent trade-off between accuracy and model complexity. The collection of non-dominated solutions forms a clear Pareto front (dark purple points), which represents the best accuracy attainable at any given level of complexity. Models lying to the right of this front (gray points) are inferior, since a simpler and more accurate alternative exists. Coloring of the points along the Pareto front according to the nu hyperparameter reveals a consistent pattern: lower nu values yield simpler models with reduced SV ratios, whereas higher nu values give rise to more complex models characterized by larger SV ratios.
The second high-performance model utilizes the following nine descriptors, namely ΔEMisfit (1.46), ΔEvdW (1.15) log(xAPICOSMO) (0.94), EvdW,API (0.72), EHB,API (0.71), Etot,API (0.66), μAPI (0.55), ΔEtot (0.47), and EMisfit,DES (0.47). This set of descriptors reveals the complex nature of solute–DES interactions under saturated conditions and the necessity of extending the core features of the former model with the additional polarity hydrogen bonding capacity of the solute and solvent. Hence, the extended nine-descriptor solution integrates a broader set of energetic and interaction terms to appropriately capture solvation phenomena. Detailed presentations along model hyperparameters are collected in Figure 4.
It is imperative to contextualize these findings within a crucial methodological framework. The dataset exclusively comprises organic acids of pharmaceutical relevance, including well-known compounds such as ketoprofen, ibuprofen, ferulic acid, probenecid, caffeic acid, p-coumaric acid, syringic acid, and flufenamic acid. A significant and unresolved challenge in modeling such systems is determining the dissociation state of acidic solutes. This is because their intrinsic acidity, quantified by pKa, is strongly modulated by the complex, non-aqueous environment of deep eutectic solvents (DESs). Lacking a reliable method to determine the precise ionization state of each acid in every unique DES, we adopted a necessary and consistent simplification: all molecular descriptors were calculated for the neutral, non-dissociated forms of the molecules. Despite this crude simplification, the remarkable accuracy of the resulting models is particularly noteworthy. It strongly suggests that the fundamental physicochemical properties of the parent molecule are the dominant drivers of solubility and that our modeling framework is robust enough to capture these governing relationships despite the simplification of the solute’s ionization state.

3. Materials and Methods

3.1. Materials

Mefenamic acid and niflumic acid (both ≥97%, Sigma-Aldrich, St. Louis, MO, USA) were used as received. The hydrogen bond acceptors were choline chloride (ChCl, CAS 67-48-1, ≥99%) and menthol (Men, CAS 89-78-1, ≥98.5%), and the hydrogen bond donors comprised ethylene glycol (ETG, CAS 107-21-1), diethylene glycol (DEG, CAS 111-46-6), triethylene glycol (TEG, CAS 112-27-6), tetraethylene glycol (TRG, CAS 112-60-7), glycerol (GLY, CAS 56-81-5), 1,2-propanediol (P2D, CAS 57-55-6), and 1,3-butanediol (B3D, CAS 107-88-0); all polyols/polyethers were obtained from Sigma-Aldrich with stated purity ≥ 99%. Methanol (analytical grade, CAS 67-56-1; Chempur, Piekary Śląskie, Poland) was used for sample handling where applicable. Unless otherwise specified, all chemicals were employed without additional purification.

3.2. Solubility Measurement Procedure

A similar methodology to [76] was employed, adapted here for mefenamic acid (MEF) and niflumic acid (NIF). Each DES was prepared at the target molar ratio by gentle heating with stirring until a clear single phase formed, then equilibrated to 25 °C. Pre-equilibrated aliquots were spiked with an excess of MEF or NIF, sealed, and agitated isothermally for 24 h at approximately 60 rpm. After equilibration, suspensions were maintained at 25 °C and supernatants were withdrawn, filtered through 0.22 μm PTFE syringe filters, and analyzed by UV–Vis. Spectra were collected from 200 to 500 nm in quartz cuvettes; analytical wavelengths were set at the absorption maxima (λMEFmax = 351 nm; λNIFmax = 344 nm). Concentrations obtained from UV–Vis were converted to mole fraction solubilities (xMEF or xNIF) using molar masses and the density of each saturated solution; densities were determined gravimetrically at 25 °C. All measurements were performed in triplicate.
Calibration curves were established for each compound using methanolic stock solutions and serial dilutions. The characteristics for MEF were as follows: calibration range of 0.002 to 0.078 mg mL−1, slope of 28.265, intercept of −0.010, linearity of R2 = 0.9993, LOD = 0.00261 mg mL−1, and LOQ = 0.00790 mg mL−1. The characteristics for NIF were as follows: calibration range of 0.005 to 0.090 mg mL−1, slope of 18.808, intercept of −0.006, linearity of R2 = 0.9994, LOD = 0.00272 mg mL−1, and LOQ = 0.00825 mg mL−1.

3.3. COSMO-RS Computations

Application of the COSMO-RS framework [77,78,79,80,81] requires appropriate representation of molecular diversity. This is achieved by performing conformational analysis prior to the determination of any thermodynamic properties. For this purpose, the default protocol was applied, taking advantage of the COSMOconf (version 2023, BIOVIA COSMOlogic)/TURBOMOLE (version 7.7, 2023, TURBOMOLE GmbH) tandem for the generation of the most representative structures for all solutes and solvent molecules. The applied protocol is consistent with previously published schemes [82,83,84]. For each molecule, up to ten low-energy conformations were determined for both gas and condensed phases, the latter accounting for solvent effects under the conductor-like screening model. The resulting “cosmo” and “energy” files were generated using the BP_TZVPD_FINE_24.ctd parameter set, essential for thermodynamic calculations in COSMOtherm, which requires application of the RI-BP/TZVP//TZVPD-FINE level of theory.

3.4. Molecular Descriptors

Two distinct sets of molecular descriptors were generated using the COSMO-RS theory. The first set of descriptors comprised interaction energies from solubility calculations [82,83,84]. While the standard iterative solubility protocol is typically sufficient, it frequently yields erroneous predictions of complete miscibility for highly soluble solutes in DES systems [85,86,87,88]. For these cases, complete solid–liquid equilibrium (SLE) calculations were mandated. Requisite thermodynamic fusion data for the solid solutes, including melting temperature, Tm, and enthalpy of fusion, ΔHfus, were obtained by averaging the available literature values [89]. The heat capacity of fusion was approximated as constant, ΔCp,fus ≈ ΔSfus ≈ ΔHfus/Tm. The resulting Gibbs free energies of fusion values, ΔGfus = ΔHfus−TΔSfus, utilized in the calculations are provided in the Supplementary Materials. The COSMO-RS output files yielded five primary descriptors for each solute: total intermolecular interaction energy, Eint,API; its constituent electrostatic misfit, Emisfit,API; hydrogen bonding, EHB,API; and van der Waals, EvdW,API, contributions, as well as the chemical potential and μAPI. Analogous descriptors for the DES (Eint,DES, Emisfit,DES, EHB,DES, EvdW,DES, and μDES) were calculated as the sum of the individual DES components, weighted by their respective molar fractions in the solute-free mixture. Relative descriptors, defined as the difference between the solute and DES values, were also included. Additionally, the computed solubility values from COSMO-RS were similarly included, log(xAPICOSMO).
Apart from this, the final set of molecular descriptors was augmented with values derived from σ-potential distributions. The standard COSMO-RS output consisted of 61 data points covering the charge density range of −0.03e/Å2 ÷ +0.03e/Å2. Consistent with prior machine learning applications, this data was reduced by averaging values over 0.005 intervals. This process resulted in a 12-step function defining three characteristic regions of the σ-potential: hydrogen bond donor (HBD1 ÷ 4, −0.03e/Å2 to −0.01 e/Å2), hydrophobicity (HH1 ÷ 4, from −0.01e/Å2 to +0.01 e/Å2), and acceptability (hydrogen bond acceptor, HBA1 ÷ 4, from +0.01e/Å2 to +0.03 e/Å2). Consequently, four descriptors were generated for each region, leading to twenty-four descriptors of this type for the solute, the solvent, and the relative difference between them.

3.5. Dataset

The values of molecular descriptors were computed for all components of studied systems, including pharmaceutical acids and DES constituents. The set of solutes included compounds for which new solubility measurements were included in this paper, namely mefenamic acid and niflumic acid. In addition, the values of the already published solubility data of ketoprofen, ibuprofen, ferulic acid, probenecid, caffeic acid, p-coumaric acid, syringic acid, and flufenamic acid in DESs were included [76,82,83,90]. In total, the dataset comprised N = 1020 mole fractions at saturated conditions in choline chloride-, betaine-, and menthol-based DESs with a variety of proportions of different HBA counterparts. Both dry and water-diluted systems were included if available. All data, including solubility values, fusion data, computed solubility, and all molecular descriptors, are available in the Supplementary Materials.

3.6. Machine Learning Protocol

3.6.1. Core Algorithm and Data Preprocessing

The machine learning workflow was centered on the nu-support vector regression (nuSVR) algorithm [91], chosen for its demonstrated ability to effectively model complex, non-linear relationships often present in QSPR studies [92,93,94]. To handle these non-linearities, the Radial Basis Function (RBF) kernel was selected. The RBF kernel is a powerful and flexible choice, capable of mapping features into an infinite-dimensional space, which allows it to model intricate decision boundaries while requiring the tuning of only a single parameter: gamma. The optimization of the nuSVR hyperparameters was conducted as follows: The regularization parameter C and the nu parameter were directly optimized. The kernel coefficient gamma, which dictates the influence of each support vector, was optimized via a guided, data-driven strategy. For each optimization cycle, a baseline gamma_base value was heuristically determined from the median pairwise squared Euclidean distance of the training data subset [95,96]. The optimizer then refined this anchor by searching for an optimal logarithmic scaling factor. This approach focused the search on a physically relevant scale, enhancing optimization efficiency. Prior to model training, two standard preprocessing steps were performed. First, the full dataset (N = 1020) was partitioned into a training set (80%) and a held-out test set (20%) using a fixed random seed to ensure reproducibility. This was performed in non-deterministic fashion, enabling the exploration of different splits for every run using a random number as a seed for splitting. Second, all molecular descriptors in the training set were standardized by removing the mean and scaling to unit variance using the StandardScaler from scikit-learn [97,98]. As SVR algorithms are sensitive to feature scaling, this step ensured that no single descriptor disproportionately influenced the model due to its magnitude. The same scaling transformation was subsequently applied to the test set.

3.6.2. Dual-Objective Optimization Protocol

To explicitly manage the inherent trade-off between model accuracy and simplicity, a dual-objective optimization (DOO) strategy was implemented using the Optuna framework (v. 3.2) [99,100,101]. The TPE sampler within Optuna was configured to simultaneously minimize two competing objectives, which were evaluated using a 5-fold cross-validation scheme on the training data. The first objective was predictive accuracy, quantified by the Mean Absolute Error (MAE). The second objective was model complexity, quantified by the mean support vector (SV) ratio. The SV ratio was calculated for each fold as the number of support vectors divided by the number of training samples in that fold, providing an intrinsic measure of complexity for nuSVR models. A model with a lower SV ratio was considered more parsimonious.
Hence, the outcome of dual-objective optimization is a set of solutions forming a Pareto front. This front consists exclusively of non-dominated solutions. A solution is considered non-dominated if no other solution exists that is superior in one objective without being inferior in the other. In other words, to improve a non-dominated solution with respect to one objective, a trade-off in the form of a degradation in the other objective must be accepted. Conversely, a dominated solution is one for which at least one other solution exists that offers better performance in one objective while being no worse in the other, making it an objectively suboptimal choice.

3.6.3. Iterative Model Refinement and Candidate Selection

The framework employs an iterative backward pruning methodology to integrate feature selection directly into the optimization process. This automatic procedure relies, therefore, on both Dual-Objective Optimization and Iterative feature pruning (DOO-IT). The procedure begins with the complete descriptor set. A full DOO is executed, producing a Pareto front of non-dominated models. From this front, a single candidate model for the current iteration is selected, governed by the one standard error (1-SE) rule [102,103]. This involves identifying the most accurate model on the front and defining a performance threshold based on its standard error; the simplest model (lowest SV ratio) within this threshold is then chosen. Once a candidate is selected, its features are ranked based on permutation importance with 10 repeats [104]. The least impactful descriptor is then eliminated, and the procedure repeats with a new, full DOO on the reduced feature set. This cycle continues until a specified minimum number of features is reached, generating a series of robust, parsimonious candidate models at each level of complexity. The independent runs are performed 50 times for exploring the hyperparameters’ hyperspace for variety of train–test splits.

3.6.4. Final Model Selection Using a Multi-Criteria Selection Scheme

In our previous paper [105], the final model selection was performed by plotting the corrected Akaike Information Criterion (AICc) [105,106,107,108] against the number of descriptors. However, the AICc is not well established for nuSVR regressors and introduces ambiguity. Hence, the theoretical criterion was replaced in this paper with a rigorous data-driven multi-criteria framework to select solutions that balance fit and parsimony. Hence, to objectively select the single best model from the family of candidates generated by the iterative procedure, a rigorous multi-criteria framework was designed to ensure robust predictive performance and chemical interpretability. After completing all the independent runs of the DOO-IT procedure, which generated numerous candidate models across varying descriptor complexities through 40 independent 80/20 training–test splits, a three-tiered selection strategy was applied. First, architectural optimization identified the optimal descriptor count through stability analysis, selecting models that consistently appeared across multiple runs while maintaining performance within one standard error of the minimum test MAE. This approach was inspired by the one standard error rule but it was enhanced with empirical stability thresholds (≥30% frequency per descriptor count), ensuring parsimonious model selection without relying on theoretically problematic information criteria. For the final model deployment, a composite scoring system was used that balanced predictive accuracy (50% weight), explanatory power (30% weight via R2), and generalization capability (20% weight via train–test performance gaps). From the architecturally optimal descriptor count, the specific model instance was selected that maximized the values of composite scores while demonstrating high descriptor stability—prioritizing molecular features consistently appearing across independent runs. This dual emphasis on both model architecture and specific feature set ensured that the deployed model not only preserved predictive performance but also embodied chemically meaningful and reproducible descriptor combinations. The final model was validated through comprehensive residual analysis, applicability domain assessment, and external validation where available, providing a transparent, empirically grounded foundation for practical solubility prediction in pharmaceutical and chemical development applications.
The DOO-IT framework was implemented as a fully automated pipeline using Python 3.10 [109] with the scikit-learn [110], Optuna [101], and pandas [111] libraries. To rigorously assess solution stability, the entire procedure was repeated fifteen independent times. Each dual-objective optimization within this process was configured to run for 2000 trials, ensuring a comprehensive exploration of the solution space.

4. Conclusions

This study addresses the challenge of solubility prediction, a problem of central importance in pharmaceutical and green chemistry research. Accurate predictive models therefore provide a powerful tool to reduce experimental workload, accelerate drug development pipelines, and enable the rational design of novel solvent systems such as DESs that combine efficiency with environmental compatibility. The Dual-Objective Optimization with Iterative feature pruning (DOO-IT) framework was applied for this task.
This study demonstrates that stability analysis of the DOO-IT framework uncovers not a single global optimum but two distinct regions of predictive excellence for modeling the solubility of pharmaceutical acids in deep eutectic solvents. On one side of the solution landscape lies an ultra-parsimonious eight-descriptor model that combines high predictive performance with minimal computational cost. This model integrates COSMO-RS logarithmic solubility as an anchor descriptor, which enables it to correct systematic deviations at solubility extremes and deliver near-perfect agreement with experimental values. By revealing two complementary “basins of excellence,” our analysis highlights the versatility of the DOO-IT framework in identifying multiple scientifically meaningful optima that balance accuracy, parsimony, and interpretability. The findings also extend our previous works, where a single model was sufficient to describe a narrower chemical space. In the present, more diverse dataset, the appearance of dual optimal regimes underscores the importance of tailoring model complexity to the scope of the prediction task. Taken together, these results suggest a pragmatic two-tiered strategy for future studies of solubility in deep eutectic solvents and related systems. Initial high-throughput screening can be effectively performed with the COSMO-RS-free parsimonious model, while subsequent high-precision evaluation of promising candidates can benefit from the expanded descriptor set that incorporates COSMO-RS calculations. This workflow balances efficiency with accuracy, making it possible to explore broader chemical spaces without sacrificing predictive reliability. Looking forward, validating the transferability of both models to other classes of solvents, as well as developing ensemble or adaptive strategies that dynamically combine parsimonious and high-performance regimes, will further enhance the applicability of this approach in green chemistry and pharmaceutical design.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/molecules30224361/s1. Table S1. Mole-fraction solubility of mefenamic acid in choline chloride- or menthol-containing deep eutectic solvents (DESs). Table S2. Mole-fraction solubility of niflumic acid in choline chloride- or menthol-containing deep eutectic solvents (DESs). Table S3. List of descriptors used for models development. Table S4. List of references of solubility data [76,82,83,90].

Author Contributions

Conceptualization, P.C.; methodology, P.C., T.J. and M.P.; software, P.C.; validation, P.C., T.J. and M.P.; formal analysis, P.C., T.J., J.G., A.K. and M.P.; investigation, P.C., T.J., J.G., A.K. and M.P.; resources, P.C., T.J. and M.P.; data curation, P.C., T.J. and M.P.; writing—original draft preparation, P.C., T.J. and M.P.; writing—review and editing, P.C., T.J. and M.P.; visualization, P.C., T.J. and M.P.; supervision, P.C.; project administration, P.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Materials. Further inquiries can be directed to the corresponding author.

Acknowledgments

We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Centers: ACK Cyfronet AGH, WCSS) for providing computer facilities and support within computational grant no. PLG/2025/018825.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Lamberth, C.; Dinges, J. Different Roles of Carboxylic Functions in Pharmaceuticals and Agrochemicals. In Bioactive Carboxylic Compound Classes: Pharmaceuticals and Agrochemicals; Wiley-VCH Verlag GmbH & Co. KGaA: Weinheim, Germany, 2016; pp. 1–11. [Google Scholar]
  2. Bharate, S.S. Carboxylic Acid Counterions in FDA-Approved Pharmaceutical Salts. Pharm. Res. 2021, 38, 1307–1326. [Google Scholar] [CrossRef] [PubMed]
  3. Bagby, M.O.; Johnson, R.W.; Daniels, R.W.; Contrell, R.R.; Sauer, E.T.; Keenan, M.J.; Krevalis, M.A. Carboxylic Acids. In Encyclopedia of Chemical Technology, 5th ed.; John Wiley & Sons: Hoboken, NJ, USA, 2003. [Google Scholar] [CrossRef]
  4. Ballatore, C.; Huryn, D.M.; Smith, A.B. Carboxylic Acid (Bio)Isosteres in Drug Design. ChemMedChem 2013, 8, 385. [Google Scholar] [CrossRef]
  5. Kumar, N.; Goel, N. Phenolic acids: Natural versatile molecules with promising therapeutic applications. Biotechnol. Rep. 2019, 24, e00370. [Google Scholar] [CrossRef]
  6. Arfeen, M.; Srivastava, A.; Srivastava, N.; Khan, R.A.; Almahmoud, S.A.; Mohammed, H.A. Design, classification, and adverse effects of NSAIDs: A review on recent advancements. Bioorg. Med. Chem. 2024, 112, 117899. [Google Scholar] [CrossRef]
  7. Panchal, N.K.; Prince Sabina, E. Non-steroidal anti-inflammatory drugs (NSAIDs): A current insight into its molecular mechanism eliciting organ toxicities. Food Chem. Toxicol. 2023, 172, 113598. [Google Scholar] [CrossRef] [PubMed]
  8. Cimolai, N. The potential and promise of mefenamic acid. Expert Rev. Clin. Pharmacol. 2013, 6, 289–305. [Google Scholar] [CrossRef]
  9. Mustafa, H.; Daud, S.; Sheraz, S.; Bibi, M.; Ahmad, T.; Sardar, A.; Fazal, T.; Khan, A.; Abid, O.U.R. The Chemistry and Bioactivity of Mefenamic Acid Derivatives: A Review of Recent Advances. Arch. Pharm. 2025, 358, e70004. [Google Scholar] [CrossRef] [PubMed]
  10. Jahromi, B.N.; Tartifizadeh, A.; Khabnadideh, S. Comparison of fennel and mefenamic acid for the treatment of primary dysmenorrhea. Int. J. Gynecol. Obstet. 2003, 80, 153–157. [Google Scholar] [CrossRef]
  11. Balderas, E.; Ateaga-Tlecuitl, R.; Rivera, M.; Gomora, J.C.; Darszon, A. Niflumic acid blocks native and recombinant T-type channels. J. Cell. Physiol. 2012, 227, 2542–2555. [Google Scholar] [CrossRef]
  12. Nakano, T.; Inoue, H.; Fukuyama, S.; Matsumoto, K.; Matsumura, M.; Tsuda, M.; Matsumoto, T.; Aizawa, H.; Nakanishi, Y. Niflumic Acid Suppresses Interleukin-13–induced Asthma Phenotypes. Am. J. Respir. Crit. Care Med. 2012, 173, 1216–1221. [Google Scholar] [CrossRef]
  13. Abdelbari, M.A.; El-Gazar, A.A.; Abdelbary, A.A.; Elshafeey, A.H.; Mosallam, S. Investigating the potential of novasomes in improving the trans-tympanic delivery of niflumic acid for effective treatment of acute otitis media. J. Drug Deliv. Sci. Technol. 2024, 98, 105912. [Google Scholar] [CrossRef]
  14. Adam, A.; Schrimpl, L.; Schmidt, P.C. Some Physicochemical Properties of Mefenamic Acid. Drug Dev. Ind. Pharm. 2000, 26, 477–487. [Google Scholar] [CrossRef]
  15. Takács-Novák, K.; Szoke, V.; Völgyi, G.; Horváth, P.; Ambrus, R.; Szabó-Révész, P. Biorelevant solubility of poorly soluble drugs: Rivaroxaban, furosemide, papaverine and niflumic acid. J. Pharm. Biomed. Anal. 2013, 83, 279–285. [Google Scholar] [CrossRef]
  16. Ullah, I.; Baloch, M.K.; Ullah, I.; Mustaqeem, M. Enhancement in aqueous solubility of Mefenamic acid using micellar solutions of various surfactants. J. Solution Chem. 2014, 43, 1360–1373. [Google Scholar] [CrossRef]
  17. Guinamard, R.; Simard, C.; Del Negro, C. Flufenamic acid as an ion channel modulator. Pharmacol. Ther. 2013, 138, 272. [Google Scholar] [CrossRef]
  18. Madhavan, M.; Hwang, G.C.C. Design and evaluation of transdermal flufenamic acid delivery system. Drug Dev. Ind. Pharm. 1992, 18, 617–626. [Google Scholar] [CrossRef]
  19. Chi, Y.; Li, K.; Yan, Q.; Koizumi, S.; Shi, L.; Takahashi, S.; Zhu, Y.; Matsue, H.; Takeda, M.; Kitamura, M.; et al. Nonsteroidal Anti-Inflammatory Drug Flufenamic Acid Is a Potent Activator of AMP-Activated Protein Kinase. J. Pharmacol. Exp. Ther. 2011, 339, 257–266. [Google Scholar] [CrossRef] [PubMed]
  20. Moses, V.S.; Bertone, A.L. Nonsteroidal anti-inflammatory drugs. Vet. Clin. N. Am. Equine Pract. 2002, 18, 21–37. [Google Scholar] [CrossRef] [PubMed]
  21. Vane, J.R.; Botting, R.M. Mechanism of Action of Nonsteroidal Anti-inflammatory Drugs. Am. J. Med. 1998, 104, 2S–8S. [Google Scholar] [CrossRef] [PubMed]
  22. Ghanim, A.M.; Girgis, A.S.; Kariuki, B.M.; Samir, N.; Said, M.F.; Abdelnaser, A.; Nasr, S.; Bekheit, M.S.; Abdelhameed, M.F.; Almalki, A.J.; et al. Design and synthesis of ibuprofen-quinoline conjugates as potential anti-inflammatory and analgesic drug candidates. Bioorg. Chem. 2022, 119, 105557. [Google Scholar] [CrossRef]
  23. Maleškić Kapo, S.; Rakanović-Todić, M.; Burnazović-Ristić, L.; Kusturica, J.; Kulo Ćesić, A.; Ademović, E.; Loga-Zec, S.; Sarač-Hadžihalilović, A.; Aganović-Mušinović, I. Analgesic and anti-inflammatory effects of diclofenac and ketoprofen patches in two different rat models of acute inflammation. J. King Saud Univ.-Sci. 2023, 35, 102394. [Google Scholar] [CrossRef]
  24. Wang, Y.; Han, Q.; Zhang, H.; Yan, Y. Evaluation of the binding interactions of p-acetylaminophenol, aspirin, ibuprofen and aminopyrine with norfloxacin from the view of antipyretic and anti-inflammatory. J. Mol. Liq. 2020, 312, 113397. [Google Scholar] [CrossRef]
  25. García-Rodríguez, C.; Mujica, P.; Illanes-González, J.; López, A.; Vargas, C.; Sáez, J.C.; González-Jamett, A.; Ardiles, Á.O. Probenecid, an Old Drug with Potential New Uses for Central Nervous System Disorders and Neuroinflammation. Biomedicines 2023, 11, 1516. [Google Scholar] [CrossRef]
  26. Robbins, N.; Koch, S.E.; Tranter, M.; Rubinstein, J. The history and future of probenecid. Cardiovasc. Toxicol. 2012, 12, 1–9. [Google Scholar] [CrossRef]
  27. Robbins, R.J. Phenolic acids in foods: An overview of analytical methodology. J. Agric. Food Chem. 2003, 51, 2866–2887. [Google Scholar] [CrossRef]
  28. Shimsa, S.; Mondal, S.; Mini, S. Syringic acid: A promising phenolic phytochemical with extensive therapeutic applications. R&D Funct. Food Prod. 2024, 1, 1–14. [Google Scholar]
  29. Boz, H. p-Coumaric acid in cereals: Presence, antioxidant and antimicrobial effects. Int. J. Food Sci. Technol. 2015, 50, 2323–2328. [Google Scholar] [CrossRef]
  30. Pei, K.; Ou, J.; Huang, J.; Ou, S. p-Coumaric acid and its conjugates: Dietary sources, pharmacokinetic properties and biological activities. J. Sci. Food Agric. 2016, 96, 2952–2962. [Google Scholar] [CrossRef] [PubMed]
  31. Al Jitan, S.; Alkhoori, S.A.; Yousef, L.F. Phenolic Acids from Plants: Extraction and Application to Human Health. Stud. Nat. Prod. Chem. 2018, 58, 389–417. [Google Scholar] [CrossRef]
  32. Dong, X.; Huang, R. Ferulic acid: An extraordinarily neuroprotective phenolic acid with anti-depressive properties. Phytomedicine 2022, 105, 154355. [Google Scholar] [CrossRef] [PubMed]
  33. Ali, S.A.; Saifi, M.A.; Pulivendala, G.; Godugu, C.; Talla, V. Ferulic acid ameliorates the progression of pulmonary fibrosis via inhibition of TGF-β/smad signalling. Food Chem. Toxicol. 2021, 149, 111980. [Google Scholar] [CrossRef] [PubMed]
  34. Sroka, Z.; Cisowski, W. Hydrogen peroxide scavenging, antioxidant and anti-radical activity of some phenolic acids. Food Chem. Toxicol. 2003, 41, 753–758. [Google Scholar] [CrossRef]
  35. Khan, F.A.; Maalik, A.; Murtaza, G. Inhibitory mechanism against oxidative stress of caffeic acid. J. Food Drug Anal. 2016, 24, 695–702. [Google Scholar] [CrossRef]
  36. Cizmarova, B.; Hubkova, B.; Bolerazska, B.; Marekova, M.; Birkova, A. Caffeic acid: A brief overview of its presence, metabolism, and bioactivity. Bioact. Compd. Health Dis. 2020, 3, 74–81. [Google Scholar] [CrossRef]
  37. Srinivasulu, C.; Ramgopal, M.; Ramanjaneyulu, G.; Anuradha, C.M.; Suresh Kumar, C. Syringic acid (SA)—A Review of Its Occurrence, Biosynthesis, Pharmacological and Industrial Importance. Biomed. Pharmacother. 2018, 108, 547–557. [Google Scholar] [CrossRef] [PubMed]
  38. Ogut, E.; Armagan, K.; Gül, Z. The role of syringic acid as a neuroprotective agent for neurodegenerative disorders and future expectations. Metab. Brain Dis. 2022, 37, 859–880. [Google Scholar] [CrossRef]
  39. Savjani, K.T.; Gajjar, A.K.; Savjani, J.K. Drug solubility: Importance and enhancement techniques. ISRN Pharm. 2012, 2012, 195727. [Google Scholar] [CrossRef]
  40. Martínez, F.; Jouyban, A.; Acree, W.E. Pharmaceuticals solubility is still nowadays widely studied everywhere. Pharm. Sci. 2017, 23, 1–2. [Google Scholar] [CrossRef]
  41. Jain, S.; Patel, N.; Lin, S. Solubility and dissolution enhancement strategies: Current understanding and recent trends. Drug Dev. Ind. Pharm. 2015, 41, 875–887. [Google Scholar] [CrossRef]
  42. Rashid, M.; Malik, M.Y.; Singh, S.K.; Chaturvedi, S.; Gayen, J.R.; Wahajuddin, M. Bioavailability Enhancement of Poorly Soluble Drugs: The Holy Grail in Pharma Industry. Curr. Pharm. Des. 2019, 25, 987–1020. [Google Scholar] [CrossRef]
  43. Bhattachar, S.N.; Deschenes, L.A.; Wesley, J.A. Solubility: It’s not just for physical chemists. Drug Discov. Today 2006, 11, 1012–1018. [Google Scholar] [CrossRef] [PubMed]
  44. Coltescu, A.R.; Butnariu, M.; Sarac, I. The importance of solubility for new drug molecules. Biomed. Pharmacol. J. 2020, 13, 577–583. [Google Scholar] [CrossRef]
  45. Chaturvedi, K.; Shah, H.S.; Nahar, K.; Dave, R.; Morris, K.R. Contribution of Crystal Lattice Energy on the Dissolution Behavior of Eutectic Solid Dispersions. ACS Omega 2020, 5, 9690–9701. [Google Scholar] [CrossRef] [PubMed]
  46. Censi, R.; Di Martino, P. Polymorph Impact on the Bioavailability and Stability of Poorly Soluble Drugs. Molecules 2015, 20, 18759–18776. [Google Scholar] [CrossRef]
  47. Chmiel, K.; Knapik-Kowalczuk, J.; Paluch, M. How does the high pressure affects the solubility of the drug within the polymer matrix in solid dispersion systems. Eur. J. Pharm. Biopharm. 2019, 143, 8–17. [Google Scholar] [CrossRef]
  48. Singh, D.; Bedi, N.; Tiwary, A.K. Enhancing solubility of poorly aqueous soluble drugs: Critical appraisal of techniques. J. Pharm. Investig. 2018, 48, 509–526. [Google Scholar] [CrossRef]
  49. Mahesha, B.S.; Sheeba, F.R.; Deepak, H.K. A comprehensive review of green approaches to drug solubility enhancement. Drug Dev. Ind. Pharm. 2025, 51, 659–669. [Google Scholar] [CrossRef]
  50. Smith, E.L.; Abbott, A.P.; Ryder, K.S. Deep Eutectic Solvents (DESs) and Their Applications. Chem. Rev. 2014, 114, 11060–11082. [Google Scholar] [CrossRef]
  51. Hansen, B.B.; Spittle, S.; Chen, B.; Poe, D.; Zhang, Y.; Klein, J.M.; Horton, A.; Adhikari, L.; Zelovich, T.; Doherty, B.W.; et al. Deep Eutectic Solvents: A Review of Fundamentals and Applications. Chem. Rev. 2021, 121, 1232–1285. [Google Scholar] [CrossRef]
  52. El Achkar, T.; Greige-Gerges, H.; Fourmentin, S. Basics and properties of deep eutectic solvents: A review. Environ. Chem. Lett. 2021, 19, 3397–3408. [Google Scholar] [CrossRef]
  53. Paiva, A.; Craveiro, R.; Aroso, I.; Martins, M.; Reis, R.L.; Duarte, A.R.C. Natural Deep Eutectic Solvents—Solvents for the 21st Century. ACS Sustain. Chem. Eng. 2014, 2, 1063–1071. [Google Scholar] [CrossRef]
  54. Liu, Y.; Friesen, J.B.; McAlpine, J.B.; Lankin, D.C.; Chen, S.N.; Pauli, G.F. Natural Deep Eutectic Solvents: Properties, Applications, and Perspectives. J. Nat. Prod. 2018, 81, 679–690. [Google Scholar] [CrossRef]
  55. Omar, K.A.; Sadeghi, R. Physicochemical properties of deep eutectic solvents: A review. J. Mol. Liq. 2022, 360, 119524. [Google Scholar] [CrossRef]
  56. Emami, S.; Shayanfar, A. Deep eutectic solvents for pharmaceutical formulation and drug delivery applications. Pharm. Dev. Technol. 2020, 25, 779–796. [Google Scholar] [CrossRef]
  57. Shah, P.A.; Chavda, V.; Hirpara, D.; Sharma, V.S.; Shrivastav, P.S.; Kumar, S. Exploring the potential of deep eutectic solvents in pharmaceuticals: Challenges and opportunities. J. Mol. Liq. 2023, 390, 123171. [Google Scholar] [CrossRef]
  58. Kalantri, S.; Vora, A. Eutectic solutions for healing: A comprehensive review on therapeutic deep eutectic solvents (TheDES). Drug Dev. Ind. Pharm. 2024, 50, 387–400. [Google Scholar] [CrossRef]
  59. Abdelquader, M.M.; Li, S.; Andrews, G.P.; Jones, D.S. Therapeutic deep eutectic solvents: A comprehensive review of their thermodynamics, microstructure and drug delivery applications. Eur. J. Pharm. Biopharm. 2023, 186, 85–104. [Google Scholar] [CrossRef] [PubMed]
  60. Raevsky, O.A.; Grigorev, V.Y.; Polianczyk, D.E.; Raevskaja, O.E.; Dearden, J.C. Aqueous Drug Solubility: What Do We Measure, Calculate and QSPR Predict? Mini-Rev. Med. Chem. 2019, 19, 362–372. [Google Scholar] [CrossRef]
  61. Fowles, D.J.; Connaughton, B.J.; Carter, J.W.; Mitchell, J.B.O.; Palmer, D.S. Physics-Based Solubility Prediction for Organic Molecules. Chem. Rev. 2025, 125, 7057–7098. [Google Scholar] [CrossRef]
  62. Boobier, S.; Hose, D.R.J.; Blacker, A.J.; Nguyen, B.N. Machine learning with physicochemical relationships: Solubility prediction in organic solvents and water. Nat. Commun. 2020, 11, 5753. [Google Scholar] [CrossRef] [PubMed]
  63. Ghanavati, M.A.; Ahmadi, S.; Rohani, S. A machine learning approach for the prediction of aqueous solubility of pharmaceuticals: A comparative model and dataset analysis. Digit. Discov. 2024, 3, 2085–2104. [Google Scholar] [CrossRef]
  64. Liu, P.; Long, W. Current Mathematical Methods Used in QSAR/QSPR Studies. Int. J. Mol. Sci. 2009, 10, 1978–1998. [Google Scholar] [CrossRef] [PubMed]
  65. Toropov, A.A.; Toropova, A.P. QSPR/QSAR: State-of-Art, Weirdness, the Future. Molecules 2020, 25, 1292. [Google Scholar] [CrossRef] [PubMed]
  66. Palmer, D.S.; O’Boyle, N.M.; Glen, R.C.; Mitchell, J.B.O. Random forest models to predict aqueous solubility. J. Chem. Inf. Model. 2007, 47, 150–158. [Google Scholar] [CrossRef] [PubMed]
  67. Ahmad, W.; Tayara, H.; Shim, H.J.; Chong, K.T. SolPredictor: Predicting Solubility with Residual Gated Graph Neural Network. Int. J. Mol. Sci. 2024, 25, 715. [Google Scholar] [CrossRef]
  68. Zheng, T.; Mitchell, J.B.O.; Dobson, S. Revisiting the Application of Machine Learning Approaches in Predicting Aqueous Solubility. ACS Omega 2024, 9, 35209–35222. [Google Scholar] [CrossRef]
  69. Ulrich, N.; Voigt, K.; Kudria, A.; Böhme, A.; Ebert, R.U. Prediction of the water solubility by a graph convolutional-based neural network on a highly curated dataset. J. Cheminform. 2025, 17, 55. [Google Scholar] [CrossRef]
  70. Wu, Y.-c.; Feng, J.-w. Development and Application of Artificial Neural Network. Wirel. Pers. Commun. 2018, 102, 1645–1656. [Google Scholar] [CrossRef]
  71. Panapitiya, G.; Girard, M.; Hollas, A.; Sepulveda, J.; Murugesan, V.; Wang, W.; Saldanha, E. Evaluation of Deep Learning Architectures for Aqueous Solubility Prediction. ACS Omega 2022, 28, 40. [Google Scholar] [CrossRef]
  72. Corso, G.; Stark, H.; Jegelka, S.; Jaakkola, T.; Barzilay, R. Graph neural networks. Nat. Rev. Methods Prim. 2024, 4, 17. [Google Scholar] [CrossRef]
  73. Kim, Y.; Jung, H.; Kumar, S.; Paton, R.S.; Kim, S. Designing solvent systems using self-evolving solubility databases and graph neural networks. Chem. Sci. 2024, 15, 923–939. [Google Scholar] [CrossRef]
  74. Tosca, E.M.; Bartolucci, R.; Magni, P. Application of artificial neural networks to predict the intrinsic solubility of drug-like molecules. Pharmaceutics 2021, 13, 1101. [Google Scholar] [CrossRef]
  75. Wang, S.; Di, J.; Wang, D.; Dai, X.; Hua, Y.; Gao, X.; Zheng, A.; Gao, J. State-of-the-Art Review of Artificial Neural Networks to Predict, Characterize and Optimize Pharmaceutical Formulation. Pharmaceutics 2022, 14, 183. [Google Scholar] [CrossRef]
  76. Cysewski, P.; Jeliński, T.; Kukwa, O.; Przybyłek, M. From Molecular Interactions to Solubility in Deep Eutectic Solvents: Exploring Flufenamic Acid in Choline-Chloride- and Menthol-Based Systems. Molecules 2025, 30, 3434. [Google Scholar] [CrossRef]
  77. Klamt, A. Conductor-like screening model for real solvents: A new approach to the quantitative calculation of solvation phenomena. J. Phys. Chem. 1995, 99, 2224–2235. [Google Scholar] [CrossRef]
  78. Klamt, A. COSMO-RS: From Quantum Chemistry to Fluid Phase Thermodynamics and Drug Design, 1st ed.; Elsevier: Amsterdam, The Netherlands, 2005; ISBN 9780444519948. [Google Scholar]
  79. Klamt, A.; Eckert, F.; Hornig, M.; Beck, M.E.; Bürger, T. Prediction of aqueous solubility of drugs and pesticides with COSMO-RS. J. Comput. Chem. 2002, 23, 275–281. [Google Scholar] [CrossRef] [PubMed]
  80. Klamt, A.; Eckert, F.; Arlt, W. COSMO-RS: An alternative to simulation for calculating thermodynamic properties of liquid mixtures. Annu. Rev. Chem. Biomol. Eng. 2010, 1, 101–122. [Google Scholar] [CrossRef] [PubMed]
  81. Dassault Systèmes. COSMOtherm, Version 24.0.0; BIOVIA: San Diego, CA, USA, 2024.
  82. Jeliński, T.; Przybyłek, M.; Różalski, R.; Romanek, K.; Wielewski, D.; Cysewski, P. Tuning Ferulic Acid Solubility in Choline-Chloride- and Betaine-Based Deep Eutectic Solvents: Experimental Determination and Machine Learning Modeling. Molecules 2024, 29, 3841. [Google Scholar] [CrossRef]
  83. Cysewski, P.; Jeliński, T.; Przybyłek, M.; Mai, A.; Kułak, J. Experimental and Machine-Learning-Assisted Design of Pharmaceutically Acceptable Deep Eutectic Solvents for the Solubility Improvement of Non-Selective COX Inhibitors Ibuprofen and Ketoprofen. Molecules 2024, 29, 2296. [Google Scholar] [CrossRef]
  84. Cysewski, P.; Jeliński, T.; Przybyłek, M. Exploration of the Solubility Hyperspace of Selected Active Pharmaceutical Ingredients in Choline- and Betaine-Based Deep Eutectic Solvents: Machine Learning Modeling and Experimental Validation. Molecules 2024, 29, 4894. [Google Scholar] [CrossRef]
  85. Cordova, I.W.; Teixeira, G.; Ribeiro-Claro, P.J.A.; Abranches, D.O.; Pinho, S.P.; Ferreira, O.; Coutinho, J.A.P. Using Molecular Conformers in COSMO-RS to Predict Drug Solubility in Mixed Solvents. Ind. Eng. Chem. Res. 2024, 63, 9565–9575. [Google Scholar] [CrossRef]
  86. Vilas-Boas, S.M.; Abranches, D.O.; Crespo, E.A.; Ferreira, O.; Coutinho, J.A.P.; Pinho, S.P. Experimental solubility and density studies on aqueous solutions of quaternary ammonium halides, and thermodynamic modelling for melting enthalpy estimations. J. Mol. Liq. 2020, 300, 112281. [Google Scholar] [CrossRef]
  87. Freire, M.G.; Carvalho, P.J.; Santos, L.M.N.B.F.; Gomes, L.R.; Marrucho, I.M.; Coutinho, J.A.P. Solubility of water in fluorocarbons: Experimental and COSMO-RS prediction results. J. Chem. Thermodyn. 2010, 42, 213–219. [Google Scholar] [CrossRef]
  88. Miller, M.B.; Chen, D.-L.; Luebke, D.R.; Johnson, J.K.; Enick, R.M. Critical Assessment of CO2 Solubility in Volatile Solvents at 298.15 K. J. Chem. Eng. Data 2011, 56, 1565–1572. [Google Scholar] [CrossRef]
  89. Acree, W.; Chickos, J.S. Phase Transition Enthalpy Measurements of Organic and Organometallic Compounds. Sublimation, Vaporization and Fusion Enthalpies from 1880 to 2010. J. Phys. Chem. Ref. Data 2010, 39, 043101. [Google Scholar] [CrossRef]
  90. Cysewski, P.; Jeliński, T. Optimization, thermodynamic characteristics and solubility predictions of natural deep eutectic solvents used for sulfonamide dissolution. Int. J. Pharm. 2019, 570, 118682. [Google Scholar] [CrossRef]
  91. Schölkopf, B.; Smola, A.J.; Williamson, R.C.; Bartlett, P.L. New support vector algorithms. Neural Comput. 2000, 12, 1207–1245. [Google Scholar] [CrossRef]
  92. Yao, X.J.; Panaye, A.; Doucet, J.P.; Zhang, R.S.; Chen, H.F.; Liu, M.C.; Hu, Z.D.; Fan, B.T. Comparative study of QSAR/QSPR correlations using support vector machines, radial basis function neural networks, and multiple linear regression. J. Chem. Inf. Comput. Sci. 2004, 44, 1257–1266. [Google Scholar] [CrossRef]
  93. Shi, Y. Support vector regression-based QSAR models for prediction of antioxidant activity of phenolic compounds. Sci. Rep. 2021, 11, 8806. [Google Scholar] [CrossRef]
  94. Cherkasov, A.; Muratov, E.N.; Fourches, D.; Varnek, A.; Baskin, I.I.; Cronin, M.; Dearden, J.; Gramatica, P.; Martin, Y.C.; Todeschini, R.; et al. QSAR modeling: Where have you been? Where are you going to? J. Med. Chem. 2014, 57, 4977–5010. [Google Scholar] [CrossRef]
  95. Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A kernel two-sample test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
  96. Muandet, K.; Fukumizu, K.; Sriperumbudur, B.; Schölkopf, B. Kernel mean embedding of distributions: A review and beyond. Found. Trends Mach. Learn. 2017, 10, 1–141. [Google Scholar] [CrossRef]
  97. Scikit-Learn Developers. StandardScaler—Scikit-Learn 1.7.2 Documentation. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html (accessed on 18 September 2025).
  98. Scikit-Learn Developers. 7.3.1. Standardization, or Mean Removal and Variance Scaling—Scikit-Learn User Guide. Available online: https://scikit-learn.org/stable/modules/preprocessing.html (accessed on 18 September 2025).
  99. Optuna Developers. Multi-Objective Optimization with Optuna—Optuna Documentation (Stable). Available online: https://optuna.readthedocs.io/en/stable/tutorial/20_recipes/002_multi_objective.html (accessed on 18 September 2025).
  100. Yanase, T. Announcing Optuna 3.2. Optuna Blog (Medium). Available online: https://medium.com/optuna/announcing-optuna-3-2-cfbfbe104d5f (accessed on 18 September 2025).
  101. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA, 4–8 August 2019; ACM: New York, NY, USA, 2019; pp. 2623–2631. [Google Scholar]
  102. Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Routledge: Abingdon, UK, 2017; ISBN 9781351460491. [Google Scholar]
  103. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
  104. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  105. Hurvich, C.M.; Tsai, C.-L. Regression and Time Series Model Selection in Small Samples. Biometrika 1989, 76, 297–307. [Google Scholar] [CrossRef]
  106. Cysewski, P.; Jeliński, T.; Przybyłek, M.; Gliniewicz, N.; Majkowski, M.; Wąs, M. Navigating the Deep Eutectic Solvent Landscape: Experimental and Machine Learning Solubility Explorations of Syringic, p-Coumaric, and Caffeic Acids. Int. J. Mol. Sci. 2025, 26, 10099. [Google Scholar] [CrossRef]
  107. Burnham, K.P.; Anderson, D.R. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach, 2nd ed.; Springer: New York, NY, USA, 2002. [Google Scholar] [CrossRef]
  108. Portet, S. A primer on model selection using the Akaike Information Criterion. Infect. Dis. Model. 2020, 5, 111–128. [Google Scholar] [CrossRef] [PubMed]
  109. Python Software Foundation. Python 3.10 Documentation. Available online: https://docs.python.org/3.10/ (accessed on 18 September 2025).
  110. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. Available online: https://www.jmlr.org/papers/v12/pedregosa11a.html (accessed on 18 September 2025).
  111. Pandas Development Team. Pandas-Dev/Pandas: Pandas, Version 2.3.0; Zenodo: Geneva, Switzerland, 2025. [CrossRef]
Figure 1. Experimental mole fraction solubility of (A) mefenamic acid and (B) niflumic acid in choline chloride- or menthol-containing deep eutectic solvents (DESs) measured at 25 °C.
Figure 1. Experimental mole fraction solubility of (A) mefenamic acid and (B) niflumic acid in choline chloride- or menthol-containing deep eutectic solvents (DESs) measured at 25 °C.
Molecules 30 04361 g001aMolecules 30 04361 g001b
Figure 2. Stability analysis of the DOO-IT model selection workflow for predicting the solubility of pharmaceutical acids in DESs. The figure illustrates the variation in MEA (mean average error) values obtained for the pre-selected models based on Pareto fronts generated for every run covering the whole span of descriptors (16 for set 1 and 28 for set 2). Dots represent values obtained for train subsets and the solid line characterizes the mean test values with standard deviations. The points marked with a red triangle or diamond are the models passed by the multi-criteria selection framework. These models represent two alternative “basins of excellence” corresponding to either nine (set 1) or eight (set 2) descriptors balancing accuracy, parsimony, and stability.
Figure 2. Stability analysis of the DOO-IT model selection workflow for predicting the solubility of pharmaceutical acids in DESs. The figure illustrates the variation in MEA (mean average error) values obtained for the pre-selected models based on Pareto fronts generated for every run covering the whole span of descriptors (16 for set 1 and 28 for set 2). Dots represent values obtained for train subsets and the solid line characterizes the mean test values with standard deviations. The points marked with a red triangle or diamond are the models passed by the multi-criteria selection framework. These models represent two alternative “basins of excellence” corresponding to either nine (set 1) or eight (set 2) descriptors balancing accuracy, parsimony, and stability.
Molecules 30 04361 g002
Figure 3. Dual-objective optimization and tentative model selection for the eight-descriptor model. The figure shows the balance between predictive accuracy (CV MAE) and model complexity (SV ratio). Each point corresponds to a distinct nuSVR model, while the Pareto front (dark purple points, shaded by the nu hyperparameter) marks the set of non-dominated solutions. The final model (trial 2269, orange star) was chosen according to the one standard error (1-SE) rule, which selects the least complex model falling within the 1-SE performance band (green region) of the best-performing candidate (trial 1684, red diamond). The parity plot in the right panel of the figure shows the agreement between the experimental (exp) and estimated (est) values of logarithmic mole fraction solubility (log(x)) for the selected optimal model. The following model parameters were found from the DOO-IT framework: {‘nu’: 0.25437165084768737, ‘C’: 33.95751326829903, and ‘log10_gamma_scale’: 0.4526771971941057}.
Figure 3. Dual-objective optimization and tentative model selection for the eight-descriptor model. The figure shows the balance between predictive accuracy (CV MAE) and model complexity (SV ratio). Each point corresponds to a distinct nuSVR model, while the Pareto front (dark purple points, shaded by the nu hyperparameter) marks the set of non-dominated solutions. The final model (trial 2269, orange star) was chosen according to the one standard error (1-SE) rule, which selects the least complex model falling within the 1-SE performance band (green region) of the best-performing candidate (trial 1684, red diamond). The parity plot in the right panel of the figure shows the agreement between the experimental (exp) and estimated (est) values of logarithmic mole fraction solubility (log(x)) for the selected optimal model. The following model parameters were found from the DOO-IT framework: {‘nu’: 0.25437165084768737, ‘C’: 33.95751326829903, and ‘log10_gamma_scale’: 0.4526771971941057}.
Molecules 30 04361 g003
Figure 4. Dual-objective optimization and tentative model selection for the nine-descriptor model. The figure shows the balance between predictive accuracy (CV MAE) and model complexity (SV ratio). Each point corresponds to a distinct nuSVR model, while the Pareto front (dark purple points, shaded by the nu hyperparameter) marks the set of non-dominated solutions. The final model (trial 1042, orange star) was chosen according to the one standard error (1-SE) rule, which selects the least complex model falling within the 1-SE performance band (green region) of the best-performing candidate (trial 1889, red diamond). The parity plot in the right panel of the figure shows the agreement between the experimental (exp) and estimated (est) values of logarithmic mole fraction solubility (log(x)) for the selected optimal model. The following model parameters were found from the DOO-IT framework: {‘nu’: 0.257891586205508, ‘C’: 57.25987010428314, and ‘log10_gamma_scale’: 0.936163953105186}.
Figure 4. Dual-objective optimization and tentative model selection for the nine-descriptor model. The figure shows the balance between predictive accuracy (CV MAE) and model complexity (SV ratio). Each point corresponds to a distinct nuSVR model, while the Pareto front (dark purple points, shaded by the nu hyperparameter) marks the set of non-dominated solutions. The final model (trial 1042, orange star) was chosen according to the one standard error (1-SE) rule, which selects the least complex model falling within the 1-SE performance band (green region) of the best-performing candidate (trial 1889, red diamond). The parity plot in the right panel of the figure shows the agreement between the experimental (exp) and estimated (est) values of logarithmic mole fraction solubility (log(x)) for the selected optimal model. The following model parameters were found from the DOO-IT framework: {‘nu’: 0.257891586205508, ‘C’: 57.25987010428314, and ‘log10_gamma_scale’: 0.936163953105186}.
Molecules 30 04361 g004
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cysewski, P.; Jeliński, T.; Giniewicz, J.; Kaźmierska, A.; Przybyłek, M. Duality of Simplicity and Accuracy in QSPR: A Machine Learning Framework for Predicting Solubility of Selected Pharmaceutical Acids in Deep Eutectic Solvents. Molecules 2025, 30, 4361. https://doi.org/10.3390/molecules30224361

AMA Style

Cysewski P, Jeliński T, Giniewicz J, Kaźmierska A, Przybyłek M. Duality of Simplicity and Accuracy in QSPR: A Machine Learning Framework for Predicting Solubility of Selected Pharmaceutical Acids in Deep Eutectic Solvents. Molecules. 2025; 30(22):4361. https://doi.org/10.3390/molecules30224361

Chicago/Turabian Style

Cysewski, Piotr, Tomasz Jeliński, Julia Giniewicz, Anna Kaźmierska, and Maciej Przybyłek. 2025. "Duality of Simplicity and Accuracy in QSPR: A Machine Learning Framework for Predicting Solubility of Selected Pharmaceutical Acids in Deep Eutectic Solvents" Molecules 30, no. 22: 4361. https://doi.org/10.3390/molecules30224361

APA Style

Cysewski, P., Jeliński, T., Giniewicz, J., Kaźmierska, A., & Przybyłek, M. (2025). Duality of Simplicity and Accuracy in QSPR: A Machine Learning Framework for Predicting Solubility of Selected Pharmaceutical Acids in Deep Eutectic Solvents. Molecules, 30(22), 4361. https://doi.org/10.3390/molecules30224361

Article Metrics

Back to TopTop