QSPR Studies on Aqueous Solubilities of Drug-Like Compounds

A rapidly growing area of modern pharmaceutical research is the prediction of aqueous solubility of drug-sized compounds from their molecular structures. There exist many different reasons for considering this physico-chemical property as a key parameter: the design of novel entities with adequate aqueous solubility brings many advantages to preclinical and clinical research and development, allowing improvement of the Absorption, Distribution, Metabolization, and Elimination/Toxicity profile and “screenability” of drug candidates in High Throughput Screening techniques. This work compiles recent QSPR linear models established by our research group devoted to the quantification of aqueous solubilities and their comparison to previous research on the topic.


Introduction
Nowadays it is generally recognized that an ideal drug, besides being pharmacologically active, should additionally possess certain features regarding its bioavailability and its toxicological profile [1][2][3][4][5]. Absorption, Distribution, Metabolization, and Elimination/Toxicological (ADME/Tox) in silico OPEN ACCESS filters constitute widely employed tools to determine whether it is probable or not for a drug candidate to reach its site of action or elicit toxic effects at its therapeutic dose. Moreover, modern approaches developed in the pharmaceutical industry for a rational molecular design have moved the ADME/Tox evaluations to the early stages of drug development, where an optimal activity of the compound is sought [6].
The degree of absorption of a substance depends simultaneously on dose, solubility, and permeability, and the exploration of large databases containing orally bioavailable drugs led to the formulation of the widely-used Lipinski "rule of five" for compounds absorbed through the gastrointestinal barrier via passive diffusion [7]. These simple rules state that oral bio-availability is likely to occur if at least three of the following rules are obeyed: molecular weight below 500; no more than five hydrogen bond donors and less than 10 hydrogen bond acceptors; and a calculated logarithm of the partition coefficient of the compound between water and octanol (log P) below 5.
The empirical conditions to satisfy Lipinski´s rule and display good oral bioavailability involve a balance between the aqueous solubility of a compound and its ability to diffuse passively through the different biological barriers. Aqueous solubility governs both the rate of dissolution of the compound and the maximum concentration reached in the gastrointestinal fluid. However, excessively polar compounds would result problematic at the stage of passing through the various biological barriers. Furthermore, it is known that aqueous solubility constitutes an important parameter in Medicinal Chemistry for the following reasons: soluble compounds are associated to shorter metabolization and elimination times, thus leading to lower probability of adverse effects and bioaccumulation [1,2,8], and most pre-clinical tests involve solubilization of the drug being tested in hydrophilic solvents [9,10]. Accurate activity measurements can be obtained only if the substance is sufficiently soluble (above the detection limit of the assay). Otherwise, an active compound may appear to be inactive due to insufficient solubility rather than inadequate potency [4,5].
The aqueous solubility of a given chemical entity can be obtained by experimental determination, although this usually presents some difficulties [2,3]. The traditional "shake flask" assay for measuring solubility is an equilibrium (thermodynamical) assay in which the solid is mixed vigorously with an aqueous buffer for a long period of time. This approach requires a fairly large amount of sample (1 -2 mg) and is time-demanding (24 -72 hours or more to do properly). Kinetic solubility measurements, in miniaturized methods such as Nephelometry [11], require little starting material but involve a reliable DMSO stock solution and multiple repeats to achieve accuracy. Furthermore, kinetic and thermodynamic solubility measurements are not interchangeable: they rely on fundamentally different physical properties to assess solid-state and solvation interactions and thus should be approached and interpreted with both caution and a detailed understanding of their strengths and limitations [12]. Obviously, it is not feasible to measure the solubility when no samples of compounds are available, while the times required for these assays are not compatible with the new High Throughput Screening technologies.
This background explains the great interest of developing theoretical models to predict aqueous solubility directly from structure. Consequently, a high number of theoretical models have been proposed in the past to predict aqueous solubilities, ranging from the early studies of Amidon et al. in 1975 [9] to several approaches including thermodynamic calculations, Group Contribution Methods and Quantitative Structure-Property Relationships (QSPR) [8,[13][14][15][16].

Some Different in silico Methods for Solubility Estimation
The simplest definition for aqueous solubility (S, mol·L -1 ) in a given solvent is the maximum amount of the most stable crystalline form of the compound that can remain in solution in a given volume of the solvent at a given temperature and pressure under thermodynamic equilibrium [12]. This equilibrium balances the energy of the intermolecular interactions between solvent and solute molecules against the energy of solvent and solute molecules interacting intramolecularly with each other. For an ionizable compound, solubility without reference to pH and ionization constant pK a is meaningless, while for any compound under analysis the specific solid state (amorphous or crystalline state) and solvent/s used is central for determining the solubility. It is also possible to distinguish different precise definitions of the term solubility [1].
The interaction between water and drug has been intensively studied previously and reviewed in ref. [2]. A typically employed empirical method to estimate solubility is based on easily obtained measurements, combining log P and melting point (MP) data by using the "General Solubility Equation" (GSE) [17][18][19]. Surprisingly, despite of its relative simplicity this equation has impressive accuracy as demonstrated in several studies [20][21][22], and this fact has led to the proposal of improved versions of the GSE model for adjusting large data sets of compounds [23][24][25][26]. The log P parameter provides an estimate of the strength of the interaction of the compound with water, while most common log P estimation programs are fragment based and empirical, such as CLOGP (Daylight Chemical Information Systems) and ACD/logD (Advanced Chemistry Development, Inc.). The main drawback of this method appears when it involves compounds having very high melting points (the sample decomposes before melting) or very low or very high log P values [15,27]. Other empirical methods were also reported, although sharing the common disadvantage that all of them require the experimental measurement of some terms defined in the equation [28,29].
The energetics of a compound in water can be assessed through a model of solvation, by resorting to Molecular Simulation in a statistical thermodynamical-like approach. Jorgensen and Duffy [30] employed Monte Carlo simulation with solute embedded in a bath of rigid water molecules to derive cohesive properties that can be used to predict solubility. However, this sort of calculations is quite computationally demanding for each different solute. A completely different approach to simulation is the Cellular Automata [31], were solvent and solute are represented by cells on a grid while their movements are governed by their immediate neighbors and a set of transition rules. The occupancy patterns of the cells change at each step, and many steps are involved. Such a kind of simulation offers intriguing insights into the dissolution process, i.e. formation of mobile cavities within the solid solute, but is not as useful as Monte Carlo in quantitative work. An alternative to the simulation of a large ensemble of particles focuses on a single solute molecule that is modeled in more detail, being based on electronic structure methods of Quantum Mechanics. Within this framework the solvent, which polarizes the molecule and is itself polarized by the solute, can be approximated as a continuous dielectric (Cramer-Truhlar approach) [32]. An alternative modeling of the solvent embeds both solute and solvent in a perfect conductor to calculate their polarization charge densities in the COSMO-RS (COSMOlogic GmbH and Co. KG) quantum chemical approach, leading to a chemical potential for the system that enables to estimate the solubility [33]. Despite of this, Quantum Mechanics methods are much slower than Monte Carlo simulations and result unsuitable for the analysis of large datasets of compounds. Table 1 summarizes different classes of methods to predict aqueous solubility data [1].  [34][35][36] are easy to apply, relying solely on the sum of contributions of each molecular structure fragment to the aqueous solubility. The basic assumption of this approach is the transferability concept for a group; if this hypothesis does not hold, then GCM can be corrected with experimental data when available to achieve better predictions. The methods proposed by Nirmalakhandan et al. [37], Suzuki et al. [38], Kuhne et al. [39], Lee et al. [40], and Klopman et al. [14,41] belong to this category. Among all these methods, only Klopman's method is a pure and general group contribution model without using additional experimental parameters.
Although GCM have a simple and practical implementation, some common drawbacks of this methodology are the following: a) they require a large data set to obtain a contribution of each functional group; b) in its basic form (without corrections) it cannot model isomeric structures; c) they may contain a "missing fragment" problem, which means that if a compound contains a missing fragment which can be defined by the group contribution model, its aqueous solubility cannot be precisely predicted; d) there are not always measured data available to extend these methods to strange compounds such as molecules containing fused aromatic rings or to organometallic compounds. Since the final estimated GCM value assigned to the aqueous solubility of a compound involve that it change from the solid phase to a new one (liquid), this makes it harder to separate the contributions of individual parts of the molecule to the whole process. Nevertheless, GCM is a fast method for estimating aqueous solubility on large data sets of compounds and can produce reasonably accurate results.

Predicting Solubility through Linear Regression Based QSPR-QSAR
In the realms of the Quantitative Structure Property-Activity Relationships theory (QSPR-QSAR), a physicochemical or biological property of a compound is assumed to be a unique consequence of its molecular structure [42][43][44]. Therefore, a model is employed to predict the property by means of structural descriptors or numerical variables that capture different constitutional, topological, geometrical or electronic characteristics of the molecular structure in consideration. These molecular descriptors can be readily calculated through mathematical formulae obtained from several theories, such as the Chemical Graph Theory, Information Theory, Quantum Mechanics, etc. [45,46] The hypotheses involved in QSPR-QSAR analyzes have proven in the past to function quite well for a wide spectrum of properties/activities of interest.
QSPR-QSAR models enable property estimation for substances that have yet not been tested for different reasons, such as instability, toxicity, or simply because their measurement requires too much time. In terms of economy, these studies allow the rational use of the available resources present in the laboratory or even a plant, avoiding performing expensive and unnecessary experimental determinations. With respect to their moral aspects, the QSPR-QSAR analyses applied to Toxicology have achieved great importance in the virtual screening of the toxic potential of compounds before their synthesis [47], and thus represent an effective alternative that reduces animal testing in biological assays. In drug discovery, both the prediction with QSAR-QSPR of ADMET properties [48] and the oral bioavailability of compounds [49,50] were conveniently addressed. Finally, from the theoretically point of view, the model can illuminate the mechanisms of physicochemical properties or biological activities of the compounds.
It is well known that a single descriptor is unable to carry all the structural information of a molecule, and one has to search for the best descriptors among the more than a thousand available in the literature, that are the most representative/descriptive parameters for the particular modeled property [51][52][53]. There exist various standard statistical methods that constitute a common practice for QSPR-QSAR model design, such as linear: Multivariable Linear Regression (MLR) [54], Principal Component Analysis (PCA) [55], Genetic Algorithms [56], Replacement Method [57], and non-linear methods: Artificial Neural Networks (ANN) [58], or Support Vector Machines [59]. The main advantage of developing linear models compared to non-linear ones is the fact that the former suffer less from the over-fitting (over-training) problem [60,61], they are more general and can transparently reveal the effect of the structural variables present in the model upon the property being modeled, thus making it possible to suggest cause/effect relationships.

The Proposal of Descriptors Based on Lipinski Rules for Modeling Aqueous Solubilities
One of our recent QSPR studies analyzing aqueous solubilities employs MLR for establishing the connection between the solubility values of 148 heterogeneous organic chemicals and their molecular structure, represented through a new set of physically interpretable descriptors [62]. The correct representation of the molecular structure of drug like compounds through molecular descriptors in every QSPR-QSAR study is of crucial importance. The set of descriptors introduced here is characterized by involving in a single number several of the parameters described by the Lipinski rules [7]. The proposed Lipinski based descriptors are based on combinations of the detour index (dd) from Chemical Graph Theory (derived as the half sum of the elements of the Detour Matrix -DD) [63] together with molecular features such as the number of H donors (D), the number of H acceptors (A) and the number of heteroatoms (H) present in the structure: where the 0.1 term in the D/D definition is introduced only to prevent dividing by zero, considering that several of the studied compounds do not have any H donor functional group. The above descriptor definitions take into consideration many literature reports which demonstrate linear, polynomial and exponential correlations between dd and the boiling point of alkanes, cycloalkanes and aromatic compounds [64][65][66][67][68]. Since the boiling point of compounds from homologous series usually correlates well with molecular weight (MW), we have investigated the relationship between the dd and the MWs of the 148 compounds used for the present study. Inspection of the correlation between dd and MW pushed us to explore possible relationships between the square and cubic roots of dd and the MW. It is noticeable that cubic root of dd, in the first place, and square root of dd, in the second, display quite better linear correlations with the molecular weight of the 148 structures (R = 0.932 and R = 0.918, in that order). This is an indication of very good correlation, specially noticing the structural diversity of the dataset.
It is clear then that the Detour Index may be an appropriate descriptor to explain the differences in the aqueous solubility values that could be explained through the molecular weight of compounds. It can also characterize other molecular properties such as the degree of ramification and cyclization. , based on the better correlation between the squares and cubic roots of dd and MW compared to that between dd and MW. The physicochemical meaning of these descriptors is immediate. MW is directly correlated with dd, and the solubility tends to decrease, in homologous series, when MW increases. The more H donor and acceptors present in the molecule the more water soluble the compound will be. If no H donor or acceptor is present in the molecule, the water solubility would be jeopardized or even non existent (as is the case of alkanes). Therefore, the defined descriptors will take high values in compounds with slight aqueous solubility, while they will tend to infinite in non-soluble compounds.
We proceeded to search for a QSPR solubility model that minimizes the S parameter subjected to the condition of combining at least one of the proposed molecular descriptors reflecting the Lipinski rules together with those calculated with the Dragon software [69]. The application of the Replacement Method (RM) variable subset selection technique [57,70,71] to the available pool with D = 1,367 descriptors leads to an optimal relationship over 100 compounds that, in terms of the best predictive power of the equation measured via the calibration and the l-n%-o parameters [72] and the least number of variables involved, contains six molecular descriptors of different type: where the absolute errors of the regression coefficients are given in parentheses and R is the correlation coefficient, F is the Fisher ratio and p is the significance of the model. Quite good estimations can be achieved with this QSPR model in many cases, considering the heterogeneous nature of the training set of molecules extracted from Merck Index 13 th [73]. About 99% of these compounds are "drug-like", satisfying Lipinski´s rule.
Equation (2) involves different molecular descriptors that can be classified as follows: two of the proposed absorption-based descriptors: D/A and D/B 1/2 ; a Radial Distribution Function (RDF): RDF040e, RDF-4.0/weighted by atomic Sanderson electronegativities [74]; a GETAWAY descriptor: H3p, H autocorrelation of lag 3/weighted by atomic polarizabilities [75]; an Atom-Centred Fragment: C-006, the number of CH 2 RX functional groups [X: heteroatom (O, N, S, P, Se or halogens), R: any group linked through carbon] [76]; and a geometrical descriptor: QXXe, Qxx COMMA2 value/weighted by atomic Sanderson electronegativities [77]. A next step in the present analysis was to further validate the predictive power of the QSPR solubility model found by predicting the log(Sol) values in a test set containing 48 organic compounds, thus demonstrating that it is possible to achieve good estimations in many situations.

A QSPR Designed upon a Balanced Aqueous Solubility Data Set
It has been pointed out that solubility modeling efforts have suffered from some basic concerns, among them: training sets that are not drug-like, lack of structural diversity, unknown experimental error, incorrect tautomers or structures, neglect of ionization and crystal packing effects, oversampling of compounds with low molecular weight and range in solubility data that is not pharmaceutically relevant [2,4]. Another study conducted by our research group [78] tries to answer some of the previous issues, since it is developed from a structural diverse training set composed by drug-like compounds with more than half the dataset presenting solubility values below 1 mg·mL -1 . Note that low solubility compounds are actually the ones one would like to be able to predict accurately, since they have higher probability of presenting difficulties in pre-clinic and clinic assays and formulation stages. Therefore, the QSPR Theory was employed for analyzing the aqueous solubility exhibited at 298 K by 145 diverse drug-like organic compounds. The molecular set was split into a 97-compound training set (train) and a 48-compounds test set (val), selecting the members of each set in such a way to share similar structural characteristics of the compounds. Additionally, an external molecular set (test set 21) that was not involved during the model design, and composed of 21 well-known compounds found in many solubility prediction papers, was also employed [2,14], in order to further examine the model's validation.
In this work, most of the drugs that comprise the training and test sets meet several drug-likeness criteria. More than 99% of the data set observes the Lipinski-rule criteria for estimating drug oral bioavailability [7], while more than 93% fulfill the Veber et al. rule [79]. More than 99% of the dataset also meets more general criteria for evaluating drug-likeness extracted from several recent publications: [80][81][82] 100 ≤ molecular weight ≤ 800 g·mol -1 ; log P ≤ 7; number of H bond acceptors ≤ 10; number of H bond donors ≤5; rotatable bonds ≤15; halogen atoms ≤7; alkyl chains ≤ (CH 2 ) 6 CH 3 ; no perfluorinated chains: CF 2 CF 2 CF 3 ; no large rings (i.e. with more than seven members); no presence of atoms other than C, O, N, S, P, F, Cl, Br, I, Na, K, Mg, Ca or Li and; presence of at least one N or O atom. Moreover, low molecular weight compounds are not over-represented in this molecular set. All the molecular structures are drawn in Figure 1.  The structural diversity of the training set is assessed through calculation of the average Tanimoto intermolecular distances (based on atom pairs) for all the possible pairs of structures that could be derived from the training set. For this purpose the PowerMV software provided by the National Institute of Statistical Sciences was used [83]. According to the results, the average Tanimoto intermolecular distance for the training set is 0.781 with a S of 0.412, which confirms the high structural diversity of the training set. Figure 2 shows a histogram representing the distribution of the 166 aqueous solubilities under study, which suggests that the experimental sample is normally distributed over more than four logarithmic units and can thus be employed in regression analysis.
The initial conformations of the drug compounds are obtained by means of the "model build" modulus of the HyperChem package [84]. After that, the structures of the compounds are firstly preoptimized with the Molecular Mechanics Force Field (MM+) procedure included in the Hyperchem, and the resulting geometries are further refined by means of the Semi-Empirical Method PM3. More than a thousand DRAGON [69] theoretical descriptors are simultaneously explored including definitions of all classes, by means of the linear variable subset selection approach Replacement Method (RM) [57,70,71]. The application of the RM method on the training set of 97 heterogeneous drugs leads to the following satisfactory three-descriptors relationship: The predictive power of the QSPR is satisfactory, as revealed by its stability upon the inclusion or exclusion of compounds, as measured by the loo parameters R loo = 0.849 and S loo = 0.971, and by the more severe test of higher percentage of compounds exclusion R l-10%-o = 0.809 and S l-10%-o = 1.090. These results are in the range of a validated model: R l-n%-o must be greater than the value of 0.50, according to the specialized literature [86]. Furthermore, the predictive capability of the so established equation is demonstrated by its performance in the test set val, leading to R val = 0.848 and S val = 0.899. Finally, after analyzing 5,000,000 cases for y-randomization [87], the smallest S value obtained using this procedure was 1.650, a poorer value when compared to the one found considering the true calibration (S = 0.903). In this way, the robustness of the model could be assessed, showing that the calibration was not a fortuitous correlation and therefore results in a structure-activity relationship.
As can be appreciated from the derived QSPR, different definitions of descriptors are needed to correctly represent the structures for the drug-like heterogeneous compounds. After a proper standardization [88] of the orthogonal descriptors present in Equation (3), it is feasible to assign a greater importance to those variables that exhibit larger absolute standardized coefficients. The most important structural factor of the model is the topological descriptor X1sol, the solvation connectivity index chi-1 proposed by Zefirov and Palyulin in 1991 [89]. It has the following general formula when calculated for hydrogen-and fluorine-depleted molecular graphs: where m is the order of index; summation is over all sub-graphs of order m;  Table. The second important descriptor involved in Eq. (3) corresponds to MLOGP, the Moriguchi octanol-water partition coefficient [90]: this reveals that a compound's hydrophobicity plays a crucial role in explaining the aqueous solubility data. Finally, the contribution of a 3D-Radial Distribution Function [74] RDF060u helps to improve the predictive power of the QSPR. Such a kind of molecular descriptor defined for an ensemble of atoms may be interpreted as the probability distribution of finding an atom in a spherical volume of certain radius, incorporating different types of atomic properties in order to differentiate the nature and contribution of atoms to the property being modelled. For the case of RDF060u, the sphere radius is of 6.0 angstroms and no atomic property is employed, thus characterizing the molecular size.
The application of the developed structure-property relationship to the classical test set 21, whose data are considered "unknown" and that do not participate during the model development (as is the case of test set val), leads to a square root mean quadratic residual (rms) of 1.202. The statistical quality achieved on this test set is comparable to that obtained by the previously reported models for aqueous solubilities in Table 2, and the main advantage here is that only three molecular descriptors are employed to model the physical property, leading to a favorable ratio N/d = 7. This equation results in a superior predictive quality than that obtained by the GCM of Klopman (rms = 1.213) involving 34 parameters [14], and also outperforms the MLR of Yan (rms = 1.286) using 40 parameters [91]. To conclude the present analysis, the chemical information encoded by only three theoretical molecular descriptors of the one-, two-, and three-types participating in a linear QSPR model enabled to explain the variation of the experimental aqueous solubilities in a satisfactory extent, and allowed a proper characterization of structurally heterogeneous drug-like organic compounds from both the training and test sets. The QSPR designed involved molecular descriptors that have a quite direct interpretation, and this relationship proved to have general applicability. The statistical parameters of the proposed model compare fairly well with others published previously based on the GCM methodology.

Conclusions
In this review we have analyzed the possibility of establishing quantitative structure-aqueous solubility relationships for drug-like compounds, and compared our recently developed linear QSPR method with others reported in the literature. Such kinds of linear equations are demonstrated to work quite well both for the training and validation stages of the model, and can in principle be used for the in silico prediction of physicochemical properties. Two different strategies can be adopted for correlating the structure and the solubility of compounds: (a) the proposal of novel descriptors posing some kind of physical interpretation, as it is the case for the Lipinski's "rule of five" descriptors taking into account the bio-availability of drugs, or (b) the use of any kind of constitutional, topological, geometrical, or electronic descriptors for adjusting to the experimental solubility data. In both cases, it results of considerable importance the appropriate selection of a balanced set of chemical compounds that considers structural diversity, known experimental errors, correct tautomers or structures, consideration of ionization and crystal packing effects, range in solubility data that is pharmaceutically relevant, and that avoid the over-sampling of compounds with low molecular weight.