Abraham Model Solute Descriptors for Favipiravir: Case of Tautomeric Equilibrium and Intramolecular Hydrogen-Bond Formation

.


Introduction
Advancements in computational science has led to the development of machine learning methods capable of predicting the chemical, biological and thermodynamic properties of a wide range of organic compounds based on training sets that contain expansive volumes of preferably "high-quality" experimental data.Machine learning algorithms scan the data sets, remove obvious outliers, and search for patterns between the known observations/data and input parameters, which may be deduced from molecular structure considerations.The published models have provided reasonably accurate estimates of the standard molar Gibbs energies and enthalpies of the formation of organic compounds [1][2][3][4], water-to-octanol partition coefficients [5][6][7], ADMET (absorption, distribution, metabolism, excretion and toxicology) properties of potential drug candidates [8-10], gas-chromatographic retention indices/factors on select polar and mid-polar stationary phases [11], logarithms of acid dissociation constants (pKa) [12,13] and enthalpies of solvation of a variety of compounds dissolved both in water and in approximately thirty organic mono-solvents [14][15][16].
Our interest in machine learning primarily focuses on the different methods available on the internet to estimate the six Abraham model solute descriptors and the data sets used in training the various models.Solute descriptors refer to the dissolved solute's ability to interact with its immediate solubilizing environment.For example, the A and B descriptors measure the ability of the dissolved solute to act as a hydrogen-bond donor and acceptor, respectively; the E solute descriptor denotes the excess molar refraction of the given solute referenced to that of a linear alkane of comparable molecular size; the L descriptor is the logarithm of the solute's experimental gas-to-hexadecane partition coefficient at 298.15 K; Thermo 2023, 3 the S descriptor represents a combination of the polarizability and the electrostatic polarity of the solute; the V descriptor is the calculated McGowan molecular volume of the solute.Knowledge of the solute enables one to estimate the molar solubility ratios of the given solute molecule in more than 130 different organic mono-solvents for which predictive Abraham model expressions have been published.
We are particularly interested in the applicability of each method regarding its ability to estimate the experiment-based solute descriptor values for organic compounds with structural features that may not be adequately represented in the data set(s) used to train the model.As far as we can surmise, the training set for the solute descriptors was a private set of numerical values provided by Abraham in April 2016 for inclusion on the UFZ-LSER website [17].Our private database is more extensive in that it includes not only the compounds and descriptor values from the earlier Abraham data set, but also values for many additional compounds that we have calculated over the past seven years.
Our previous experience involving internet-available machine learning methods, as well as using select group contribution methods, is that the software programs often overestimate the A solute descriptor of compounds that exhibit intramolecular hydrogen-bond formation.The compounds whose descriptor values are to be estimated are specified through their canonical Smiles codes, which give the arrangement of atoms within the given molecule.The inputted Smiles code does not contain all of the subtle structural features within the molecule that determine its chemical and biological properties.For example, the estimated solute descriptors from the group contribution method available on the UFZ-LSER website [17], E = 2.34; S = 2.46; A = 1.28;B = 1.14;V = 1.8615; and L = 11.352, were obtained using the canonical SMILES code, C1=CC=C2C(=C1)C(=O)C3=C(C=CC(=C3C2=O)O)O for 4,5-dihydroxyanthraquinone-2-carboxylic acid.The group contribution and machine learning estimation methods of Chung and coworkers [14] yielded solute descriptor sets of (E = 2.32; S = 2.37; A = 1.44;B = 0.96; V = 1.8615; and L = 11.368) and (E = 2.49; S = 2.17; A = 1.11;B = 0.87; V = 1.8615; and L = 11.327),respectively.The estimated A solute descriptor from each of the three different methods fell in the range of A = 1.11 to A = 1.44, which would be reflective of the single -COOH and two -OH functional groups contained within the molecule.Analysis of the measured solubility data of 4,5-dihydroxyanthraquinone-2carboxylic acid dissolved in 11 organic solvents yielded a different set of descriptor values: E = 2.340, S = 2.195, A = 0.755, B = 0.596, V = 1.8615, and L = 11.073[18].The much smaller calculated experiment-based A solute descriptor of A = 0.755 is comparable to the experimental-based values observed for substituted-benzoic acids with a single -COOH group.The two phenolic hydrogens on the -OH functional group are likely involved in intramolecular H-bond formation with the lone pairs of electrons on the oxygen atom on the neighboring >C=O group, as neither hydroxyl proton appears to contribute to the overall H-bond donating character.Two additional dihydroxyanthraquinone compounds, 1,4-dihydroxyanthraquinone and 1,8-dihydroxyanthraquinone, had experiment-based A solute descriptor values of zero [19], again suggesting that the hydroxyl protons were not available to engage in H-bond formation with the surrounding solvent molecules.Much larger estimated A solute descriptor values of A = 0.82 [17] were calculated for both of the dihydroxyanthraquinone compounds using their canonical Smiles codes.The Smiles code failed to properly capture the solute's ability to engage in intramolecular H-bond formation.
Cui et al. [20] recently reported mole fraction solubilities for favipiravir (more formerly named 6-fluoro-3-hydroxypyrazine-2-carboxamide) dissolved in dichloromethane, butanone, acetonitrile, N,N-dimethylformamide, three alcohol acetate (methyl acetate, ethyl acetate and butyl acetate) and four alcoholic (methanol, ethanol, 1-propanol, 2-propanol, 1-butanol) solvents in the temperature range of 293.15K to 333.15 K. Favipiravir is not only an important antiviral agent approved by several countries for the treatment of influenza and COVID-19 [21], but it also exhibits keto-enol tautomerism with the possibility to form intramolecular hydrogen-bonds.Figure 1 depicts the two tautomeric forms of the molecule suggested by both DFT quantum-chemical computational studies and molecular spectroscopic measurements [22].The enol tautomer was found to be substantially more Thermo 2023, 3 stable in the gas phase and in most organic mono-solvents.The fluorine substituent on the pyrazine ring, along with the presence of the carboxamide group, further stabilizes the enol tautomer.The tautomeric proton in the enol form is part of strong intramolecular hydrogen bonding, while the NH proton in the keto form is available for interaction with strong proton acceptor solvents like water [23].X-ray crystallographic studies provided further support for intramolecular H-bond formation in the case of the enol tautomer [24].In our search of the published literature, we found no evidence of intramolecular H-bond formation in the case of the keto tautomer; however, the number of published papers is limited.As primary and secondary amides can act as H-bond donors [25], we include, in Figure 1, a possible intramolecular hydrogen-bonded species for the keto tautomer.
possibility to form intramolecular hydrogen-bonds.Figure 1 depicts the two tautomeric forms of the molecule suggested by both DFT quantum-chemical computational studies and molecular spectroscopic measurements [22].The enol tautomer was found to be substantially more stable in the gas phase and in most organic mono-solvents.The fluorine substituent on the pyrazine ring, along with the presence of the carboxamide group, further stabilizes the enol tautomer.The tautomeric proton in the enol form is part of strong intramolecular hydrogen bonding, while the NH proton in the keto form is available for interaction with strong proton acceptor solvents like water [23].X-ray crystallographic studies provided further support for intramolecular H-bond formation in the case of the enol tautomer [24].In our search of the published literature, we found no evidence of intramolecular H-bond formation in the case of the keto tautomer; however, the number of published papers is limited.As primary and secondary amides can act as H-bond donors [25], we include, in Figure 1, a possible intramolecular hydrogen-bonded species for the keto tautomer.The keto-enol tautomeric equilibrium exhibited by favipiravir, combined with the possibility of intramolecular hydrogen-bond formation, provides the opportunity for us to further test the ability of the existing machine learning and group contribution methods to predict the experiment-based solute descriptors for compounds not included in each method's training data set.In the current study, we determined the solute descriptors for favipiravir based on the experimental solubility data reported by Cui et al. [20].The calculated experimental-based solute descriptors are then compared to the estimated values obtained using the software available on the UFZ-LSER [17] and MIT [26] websites.As noted above, the spectroscopic measurements and quantum-chemical computation studies indicate that the enol tautomer is the most stable form of favipiravir, so we are particularly interested in the estimated solute descriptors that the two software programs provide based on the SMILES code of the enol.We have included in our comparisons the solute descriptors for the keto tautomer as well in order to ascertain how large the difference is between the solute descriptors of the two tautomeric forms.Smiles codes are The keto-enol tautomeric equilibrium exhibited by favipiravir, combined with the possibility of intramolecular hydrogen-bond formation, provides the opportunity for us to further test the ability of the existing machine learning and group contribution methods to predict the experiment-based solute descriptors for compounds not included in each method's training data set.In the current study, we determined the solute descriptors for favipiravir based on the experimental solubility data reported by Cui et al. [20].The calculated experimental-based solute descriptors are then compared to the estimated values obtained using the software available on the UFZ-LSER [17] and MIT [26] websites.As noted above, the spectroscopic measurements and quantum-chemical computation studies indicate that the enol tautomer is the most stable form of favipiravir, so we are particularly interested in the estimated solute descriptors that the two software programs provide based on the SMILES code of the enol.We have included in our comparisons the solute descriptors for the keto tautomer as well in order to ascertain how large the difference is between the solute descriptors of the two tautomeric forms.Smiles codes are generated from the given molecular structure; therefore, it is imperative that the most stable tautomeric form of favipiravir be used in the estimations.Several published papers have depicted the molecular structure in the enol form [20,27,28], while other publications have contained only the keto tautomer [29][30][31].Scifinder Scholar [32] also depicts the structure as the keto tautomer with the chemical name being given as 6-fluoro-3,4-dihydro-3-oxo-Thermo 2023, 3 446 2-pyrazinecarboxamide.It is easy to see how an inexperienced researcher might simply input the molecular structure found in one's literature reading.

Solute Descriptor Calculations
Calculation of Abraham model solute descriptors might possibly provide indirect evidence to reinforce the earlier DFT quantum-chemical calculations [22] regarding which tautomeric form of favipiravir is present in organic solvents, and whether the molecule exhibits intramolecular hydrogen-bond formation.In instances where intramolecular hydrogen bonding occurs, the experiment-based A solute descriptor value is expected to be significantly diminished to below what would be expected based solely on structural considerations.The Abraham model solute descriptors are easily calculated through regressing the experimental logarithms of the molar solubility ratios, log (C S,organic /C S,water ) and log (C S,organic /C S,gas ), in accordance to Equations ( 1) and ( 2) [18,19,[33][34][35]: where the three subscripts ("organic", "water" and "gas") inside the logarithmic terms on the left-hand side of both mathematical expressions identify the phase to which the given solute molar concentration pertains.The capitalized alphabetical characters represent the solute descriptors that were defined in the manuscript's Introduction.The lowercase alphabetical characters that precede each solute descriptor refer to the complementary solvent properties, which are not the focus of the current study.Readers are directed to several review articles and book chapters for a more detailed discussion of the Abraham model [36][37][38][39][40][41].
The calculated numerical values of the solute descriptors should provide an indication of how favipiravir exists in solution, as either the enol tautomer, the keto tautomer or an intramolecular hydrogen-bonded species.The A and/or B descriptor values will likely be the most informative as these are the ones most directly related to the hydrogen-bonding character.The calculation of the solute descriptors requires that the published mole fraction solubilities be expressed as molar solubilities.The conversion is accomplished by dividing the numerical values of x S,organic by the ideal molar volume of the saturated solution (i.e., C S,organic ≈ x S,organic /V Saturated solution ).The molar volume of the saturated solution is a weighted mole fraction average of the molar volumes of favipiravir and the organic monosolvent (ie., V saturated solution = x S,organic V Solute + (1 − x S,organic ) V Solvent ).The molar volume of favipiravir was estimated to be V solute = 0.09963 Liter.The calculated logarithms of the molar solubilities in the 12 organic solvents, along with the logarithm of the measured water-to-octanol partition coefficient (log P = 0.72 [42]), are tabulated in Table 1.
Listed in Table 2 are the numerical values of the equation coefficients for the Abraham model log (C S,organic /C S,water ) and log (C S,organic /C S,gas ) for the 12 different organic monosolvents in which the favipiravir solubility was measured.Coefficients for the practical water-to-1-octanol partition coefficient and its accompanying "practical" gas-to-1-octanol partition coefficient correlations are given in the rows labeled 1-octanol (wet).A direct partitioning experiment is where the solute is distributed between an aqueous phase (saturated with 1-octanol) and an organic phase (saturated with water).Equations for the remaining five alcohol solvents, butanone, acetonitrile, N,N-dimethylformamide and three alkyl acetate solvents pertain to "dry" anhydrous organic mono-solvents.The Abraham model equations for dichloromethane were obtained by combining the experimental molar solubility data, the gas-to-liquid partition coefficients calculated from the infinite dilution activity coefficients and the practical water-to-dichloromethane partition coefficients.The small quantity of water in the dichloromethane phase in the practical partitioning system did not appear to affect the organic solvent's solubilizing properties.Of the three estimation methods considered, the group contribution method on the MIT website provides the best set of estimated values, at least as far as the A solute descriptor is concerned.The B solute descriptor value, however, is significantly overestimated.Each set of estimated solute descriptors were used to predict the experimental molar solubility ratios.Predictions were made by letting the log C S,water and log C S,gas values float to their "optimized" values for each of the six estimated sets of solute descriptors.None of the six sets of estimated solute descriptors were able to predict the observed values to a standard error of less than 0.40 log units.
As part of our computations, we reanalyzed the experimental values in Table 1 using an assumed value of E = 1.152, which is the estimated value for the enol tautomer based on the group contribution method available on the MIT website [26].The reanalysis yielded values of S = 1.321;A = 0.280; B = 0.648; V = 0.9669; L = 5.342, which were only slightly different than those obtained by assuming an E descriptor value of E = 1.040.Irrespective of which of the two numerical values are assumed for the E solute descriptor, one still obtains much smaller A and B descriptor values than expected.We attribute the smaller values to intramolecular hydrogen-bond formation, which is in accordance with the earlier DFT quantum-chemical calculations described in the published paper by Antonov [23].
In conclusion, our intent is to point out that the predictive methods for solute descriptors cannot properly account for intramolecular H-bond formation if the phenomenon was not learned during the training or if the input parameters-which for the three models considered here, were the canonical Smiles code-do not alert the method that intramolecular H-bond formation does occur.The data sets need to include a sufficient number of compounds exhibiting intramolecular hydrogen-bond formation for proper learning to take place.In the current paper, we have provided one additional example of an organic molecule that can exhibit intramolecular H-bond formation, in addition to possible keto-enol tautomerism.Tautomerism is important in the drug discovery process as it has been reported that approximately one-fifth of the molecules in drug discovery data sets exhibit tautomerism [44].It is our hope that freely available group contribution and machine learning methods will reach the point where users can identify molecules that exhibit intramolecular H-bond formation as part of the structural input parameters and/or features.Based on our past experiences, we identify molecules that are likely candidates for intramolecular hydrogen-bond formation as those that contain H-bond donor and acceptor sites in sufficiently close proximity, such that the hydrogen-bond formation results in either a five-or six-membered ring.Users of such predictive methods should also be aware that poor predicted values might occur if structural information for the incorrect tautomeric form is used as an input parameter.

Figure 1 .
Figure 1.The top two molecular structures represent the enol and ketone tautomers of favipiravir.The bottom two molecular structures depict potential intramolecular hydrogen-bond formation, which is shown by the dashed lines between the hydrogen and oxygen atoms.

Figure 1 .
Figure 1.The top two molecular structures represent the enol and ketone tautomers of favipiravir.The bottom two molecular structures depict potential intramolecular hydrogen-bond formation, which is shown by the dashed lines between the hydrogen and oxygen atoms.

Table 1 .
Logarithms of the published experimental molar solubilities, log C S,organic , and logarithms of the water-to-1-octanol partition coefficient, log P, for favipiravir dissolved in select organic solvents at 298.15 K.

Table 2 .
Abraham Model Equation Coefficients in Equation (1) and Equation (2) for Various Processes.