Application of a General Computer Algorithm Based on the Group-Additivity Method for the Calculation of Two Molecular Descriptors at Both Ends of Dilution: Liquid Viscosity and Activity Coefficient in Water at Infinite Dilution

The application of a commonly used computer algorithm based on the group-additivity method for the calculation of the liquid viscosity coefficient at 293.15 K and the activity coefficient at infinite dilution in water at 298.15 K of organic molecules is presented. The method is based on the complete breakdown of the molecules into their constituting atoms, further subdividing them by their immediate neighborhood. A fast Gauss–Seidel fitting method using experimental data from literature is applied for the calculation of the atom groups’ contributions. Plausibility tests have been carried out on each of the calculations using a ten-fold cross-validation procedure which confirms the excellent predictive quality of the method. The goodness of fit (Q2) and the standard deviation (σ) of the cross-validation calculations for the viscosity coefficient, expressed as log(η), was 0.9728 and 0.11, respectively, for 413 test molecules, and for the activity coefficient log(γ)∞ the corresponding values were 0.9736 and 0.31, respectively, for 621 test compounds. The present approach has proven its versatility in that it enabled the simultaneous evaluation of the liquid viscosity of normal organic compounds as well as of ionic liquids.


Introduction
In recent years, among the many computational methods for the prediction of physico-chemical properties of organic compounds, such as those derived from (quantum-)theoretical considerations, multiple linear regression approaches based on correlations between further properties of interest, cluster analysis, principal component analysis or group-additivity methods, the latter method has gained increasing interest due to its wide-ranging applicability for the evaluation of numerous molecular descriptors. Recently, two papers [1,2] demonstrated its versatility in that a single computer algorithm using a radical form of the atom-groups additivity method was able to reliably predict ten molecular descriptors: heats of combustion, solvation, sublimation and vaporization, entropy of fusion, partition coefficient logP o/w , solubility logS water , refractivity, polarizability and toxicity. The availability of the experimental values of the liquid viscosity coefficient (η) and the activity coefficient at infinite dilution in water log(γ) ∞ of several hundred organic compounds from various literature references gave reason to try to extend the atom-groups additivity approach described in [1] to these two descriptors, which coincidentally are both at the extreme ends of dilution.
The viscosity is an important property of liquid compounds, its knowledge required in particular in the transport business of bulk quantities of liquids as well as in the field of ionic liquids.  While most of the group definitions are self-explanatory, group No. 3 requires some additional explanation: in drawings of compounds such as imidazolium (or guanidinium, for that matter) the positive charge is usually assumed to be localized on one of the nitrogen atoms, which inherently implies an asymmetrical charge distribution in these molecules where there is none. This creates an ambiguity problem in truly asymmetrical cases where one or more of these nitrogen atoms carry additional, different substituents: on which nitrogen atom should the positive charge now be positioned? The best answer is given by quantum-theoretical calculations, e.g., by the extended Hückel MO (EHMO) method [10], which prove that the positive charge is indeed essentially centered on the carbon atom between the nitrogen atoms (see Figure 1)! This is also true for analogous compounds carrying alkyl substituents at the nitrogen atoms (which would be represented by the atom group No. 4 in Table 1).
implies an asymmetrical charge distribution in these molecules where there is none. This creates an ambiguity problem in truly asymmetrical cases where one or more of these nitrogen atoms carry additional, different substituents: on which nitrogen atom should the positive charge now be positioned? The best answer is given by quantum-theoretical calculations, e.g., by the extended Hückel MO (EHMO) method [10], which prove that the positive charge is indeed essentially centered on the carbon atom between the nitrogen atoms (see Figure 1)! This is also true for analogous compounds carrying alkyl substituents at the nitrogen atoms (which would be represented by the atom group No. 4 in Table 1). Accordingly, the representation of e.g., the 2-methylimidazolium ion applied to the present group-additivity calculations has the positive charge assigned to the carbon atom at position 2, which on the other hand is bound to the two neighbor nitrogens by aromatic bonds. (Analogously, the positive charge of the guanidinium ion would be assigned to the central carbon atom, which is bound to each of the three nitrogen atoms by aromatic bonds).
Following the calculation procedure described in [1], the computer algorithm breaks down the molecule to be evaluated into its constituting atom groups and checks for their occurrence in the respective group-parameters table generated earlier. In order to be eligible for the molecule's descriptor evaluation, the algorithm ensures that not only each of the molecule's atom groups is found in the group-parameters table but also that each of the groups found is "valid", i.e., that each has been represented in the preceding parameters-evaluation process by at least three independent molecules with known experimental descriptor value. On condition that these two requirements are fulfilled, the descriptor calculation follows the general Equation (1), where Y is the descriptor, ai and bj are the contributions, Ai is the number of occurrences of the ith atom group, and Bj is the number of occurrences of the jth special group and C is a constant: For each of the presented two descriptors a separate group-parameters table has been prepared. The evaluation of the group contributions according to the detailed description in [1] was Accordingly, the representation of e.g., the 2-methylimidazolium ion applied to the present group-additivity calculations has the positive charge assigned to the carbon atom at position 2, which on the other hand is bound to the two neighbor nitrogens by aromatic bonds. (Analogously, the positive charge of the guanidinium ion would be assigned to the central carbon atom, which is bound to each of the three nitrogen atoms by aromatic bonds).
Following the calculation procedure described in [1], the computer algorithm breaks down the molecule to be evaluated into its constituting atom groups and checks for their occurrence in the respective group-parameters table generated earlier. In order to be eligible for the molecule's descriptor evaluation, the algorithm ensures that not only each of the molecule's atom groups is found in the group-parameters table but also that each of the groups found is "valid", i.e., that each has been represented in the preceding parameters-evaluation process by at least three independent molecules with known experimental descriptor value. On condition that these two requirements are fulfilled, the descriptor calculation follows the general Equation (1), where Y is the descriptor, a i and b j are the contributions, A i is the number of occurrences of the ith atom group, and B j is the number of occurrences of the jth special group and C is a constant: For each of the presented two descriptors a separate group-parameters table has been prepared. The evaluation of the group contributions according to the detailed description in [1] was immediately followed by a plausibility test based on a ten-fold cross-validation procedure, wherein it was ensured that each of the compounds has been introduced alternatively as both a test or training sample. In row A to H at the end of each parameters table the results are collected. The correlation diagrams and histograms in the respective sections below show the results of the training and cross-validation calculations in black and red colors, respectively.
In the calculation processes of the two group-parameters tables it turned out that for an optimal viscosity-coefficient prediction the second summand in Equation (1) was not needed as there was no special group required, whereas for the prediction of the activity coefficient log(γ) ∞ the best value for the constant C was zero.
Looking at the rightmost column of the group-parameters tables showing the number of molecules representing a given atom group, one may notice that some of the atom groups are represented by less than three molecules. These atom groups are therefore not applicable for descriptor predictions; nevertheless, they have been left in the parameters tables for potential future use in this continuous project. As the parameters tables show, calculations have been restricted to molecules containing the elements H, B, C, N, O, P, S, Si and/or halogen.

General Remarks
(1) Cross-validation data in the following figures are superpositioned in red.
(2) Generally, compounds, the experimental values of which exceeded by more than three times the cross-validated standard error, have been excluded from group-parameters calculations and have been collected in a list of outliers.
The correlation diagram in Figure 2 reveals a very good compliance between the training and cross-validation results, confirmed by the close similarity of standard deviations R 2 and Q 2 (lines B and F in Table 2). The corresponding histogram in Figure 3 exhibits a slightly distorted Gaussian bell curve, the maximum of which being shifted by 0.02 to the negative deviations (indicating smaller experimental values than predicted), which might be ascribed to the relatively small number of experimental data.
Of particular interest is the question as to how well the prediction of the viscosity of ionic liquids performs. For 15 of the presently 33 ionic liquids, for which experimental data were available, predictions were possible. Their log(η) ranged between 1.951 and 4.3732; hence, in Figure 2 they are all positioned at the upper half of the correlation diagram. Evidently, their data points are in excellent conformance with those of the "normal" compounds, which may be surprising considering the additional interactive forces acting between their ionic moieties, but these extra effects are inherently considered in the assigned atom-groups parameters listed in Table 1. Nevertheless, five out of the 33 ionic liquids had to be removed from calculations as their deviation exceeded prediction by far more than three times the cross-validated standard deviation. They are collected in the list of outliers, available in the Supplementary Materials.
How do these results compare with the prediction methods published earlier? Quantitative structure-activity relationship (QSAR) techniques, described in [7], applied on a set of 237 compounds and using 18 physical properties as input into multiple linear as well as partial least squares regression calculations, yielded correlation coefficients of 0.933 and 0.931, respectively, and corresponding standard errors of 0.144 and 0.146. Later, a quantitative structure-property relationship (QSPR) study [6], founded on 361 compounds and using five molecular structural descriptors including electrostatic and quantum chemical properties, resulted in a correlation coefficient of 0.854 and a standard error of 0.22. The multiple linear regression and artificial neural network (ANN) back-propagation methods, outlined in [4], based on 361 compounds and nine physical and structural descriptors, yielded a correlation coefficient of 0.92 and 0.93, respectively, and corresponding standard errors of 0.17 and 0.16 units. In a later paper [5], the same authors presented slightly better results with a set of 440 compounds, using the same ANN approach and input descriptors, which produced correlation coefficients for the training, validation and test sets of 0.956, 0.932 and 0.884, respectively, with corresponding standard errors of 0.122, 0.134 and 0.148 units. Evidently, comparing these results with the data collected at the bottom of Table 2, none of the cited prediction methods achieved the accuracy of the present approach and, beyond this, the present method even allows a reliable prediction of the viscosity coefficient at 20 • C simply by hand, using paper and pencil, Table 2 and Equation (1). The only drawback is the condition that each atom group in a given molecule must be found in the table and that it is preferably represented by three or more molecules (shown in the rightmost column). A scan of the database of currently 30,125 compounds, which can be viewed as representative for the entire structural coverage of chemicals, reveals that at present this is the case for about 39% of all compounds, due to the relatively small experimental basis of only 501 compounds. Of particular interest is the question as to how well the prediction of the viscosity of ionic liquids performs. For 15 of the presently 33 ionic liquids, for which experimental data were available, predictions were possible. Their log(η) ranged between 1.951 and 4.3732; hence, in Figure 2 they are all positioned at the upper half of the correlation diagram. Evidently, their data points are in excellent conformance with those of the "normal" compounds, which may be surprising considering the additional interactive forces acting between their ionic moieties, but these extra effects are inherently considered in the assigned atom-groups parameters listed in Table 1. Nevertheless, five out of the 33 ionic liquids had to be removed from calculations as their deviation exceeded prediction by far more paper and pencil, Table 2 and Equation (1). The only drawback is the condition that each atom group in a given molecule must be found in the table and that it is preferably represented by three or more molecules (shown in the rightmost column). A scan of the database of currently 30,125 compounds, which can be viewed as representative for the entire structural coverage of chemicals, reveals that at present this is the case for about 39% of all compounds, due to the relatively small experimental basis of only 501 compounds.

Activity Coefficient at Infinite Solution in Water
Generally, the activity coefficient γ ∞ has been published in its logartithmic form log(γ) ∞ and has been measured at 298.15 K. In some cases, where γ ∞ itself or its logarithmus naturalis was cited, the data have been translated into their decimal logarithm. In addition, only values have been considered which have been measured at or reduced to 298.15 K. Primary sources of experimental data have been the collective reports mentioned earlier [8,9]. Additional data have been found for 1-propoxypropan-2-ol [73], several alkyl and alkenyl alcohols and alkylbenzenes [74], valeric and crotonic aldehyde [75], variously substituted benzoic acids [76,77], naphthoic acids [78,79], isatin [80], 2-cyanoguanidine [81], florfenicol [82], thiamphenicol [83] and various sulfonamides [84,85]. In total, the number of compounds with experimental log(γ) ∞ data amounted to 709, of which 34 turned out to be outliers (a list of them is available in the Supplementary Materials), as their experimental values differed by more than three times the cross-validated standard error from prediction. The remaining 675 compounds represented 113 atom groups, of which 75 have been defined as valid for predictions (see line A of Table 3). A number of calculations, which tentatively in-or excluded certain special groups, revealed that consideration of alkanes and unsaturated hydrocarbons (special groups 115 and 116 in Table 3) as separate entities significantly improved the values of the correlation coefficient R 2 (from 0.9621 to 0.9788) as well as the corresponding standard error (from 0.37 to 0.27), whereas the inclusion of intramolecular hydrogen bonds (special group 114) only had a minor effect, probably due to the small number of only six examples. Nevertheless, in view of future data input this latter group has been left in the parameters table.   The correlation diagram in Figure 4 shows a very good conformance between the training and cross-validation test values, which is reflected in the very similar values of R 2 and Q 2 . The intercept and slope of the regression line confirm that in this case a constant C is not required in the prediction calculations pursuant to Equation (1). Due to the fairly limited number of samples, on the other hand, the histogram in Figure 5 does not exhibit a perfect Gaussian bell curve but at least its maximum is reasonably well centred at the zero deviation point.

Conclusions
Ease of use and reliability of the predictions was the goal of the presented subject. While the former was in the hands of the method developer, the latter highly depended on the experimental data provided by the countless scientific publishers. The present results, together with those outlined in the previous publications [1,2], prove the enormous versatility of the atom-groups additivity method, particularly on applying the radical breakdown of the molecules as described, in that, including the present ones, the following 13 molecular descriptors can be calculated at once (some of them indirectly) in a split second on a desktop computer: the heats of combustion, formation, solvation, sublimation and vaporization, the entropy of fusion, the partition coefficient logPo/w, the solubility logSwater, the refractivity, the polarizability, the toxicity against the protozoan Tetrahymena pyriformis and, as has been demonstrated here, the viscosity coefficient log(η) and the activity coefficient log(γ) ∞ . The disadvantage of the radical breakdown of the molecules which inevitably leads to a large number of particularized atom groups and thus excludes molecules from any calculation for which not all of their atom groups have a defined contribution, is well compensated on the one hand by the accuracy of prediction for those compounds for which calculation is possible, in most cases even by the simple paper-and-pencil approach for finding the atom groups in a given molecule and summing up their contributions, and on the other hand by the enablement of a standardized computer algorithm, allowing a simple extension of each of the atom-groups parameters lists at the input of any further, future experimental data, which again would extend the scope of calculable molecular structures. The reliability of the predictions, however, only increases with the accuracy of any future input. The present work is part of an ongoing project called Comparison of the present result with those published in earlier articles [8,9] reveals that it lies in the same range of prediction accuracy: Abraham's method, described in [8], being based on the five descriptors: excess molar refractivity, dipolarity/polarizability, overall or summation hydrogen bond acidity and basicity, and the McGowan volume, yielded a correlation coefficient R 2 of 0.977 and a leave-one-out cross-validation correlation coefficient Q 2 of 0.976 and corresponding standard errors of 0.284 and 0.29, respectively, for 655 structurally diverse compounds; the ant-colony optimization method, outlined in [9], limited to 105 hydrocarbons and founded on four topological descriptors and the refractivity, resulted in a correlation coefficient R 2 of 0.9893 and a standard error of 0.3996 for the calibration set, and a Q 2 of 0.9891 and a standard error of 0.3865 for the prediction set. The main advantage of the present method lies in its ease of use in that-just like in the previous subsection-a simple 2D drawing is needed to help to find all the compound's atom groups and then sum up their contributions according to Table 3. In addition, for hydrocarbons, each carbon atom would contribute according to entry 115 or 116 in Table 3. The only disadvantage of the present approach lies in its limited range of molecules for which log(γ) ∞ is calculable, due to the relatively small amount of "valid" atom groups as a result of the limited number of experimental data-a weakness, however, which is gradually being remedied by means of the input of further experimental data in this ongoing project. At present, for 51% of the compounds of the current database the log(γ) ∞ value has been evaluated.

Conclusions
Ease of use and reliability of the predictions was the goal of the presented subject. While the former was in the hands of the method developer, the latter highly depended on the experimental data provided by the countless scientific publishers. The present results, together with those outlined in the previous publications [1,2], prove the enormous versatility of the atom-groups additivity method, particularly on applying the radical breakdown of the molecules as described, in that, including the present ones, the following 13 molecular descriptors can be calculated at once (some of them indirectly) in a split second on a desktop computer: the heats of combustion, formation, solvation, sublimation and vaporization, the entropy of fusion, the partition coefficient logP o/w , the solubility logS water , the refractivity, the polarizability, the toxicity against the protozoan Tetrahymena pyriformis and, as has been demonstrated here, the viscosity coefficient log(η) and the activity coefficient log(γ) ∞ . The disadvantage of the radical breakdown of the molecules which inevitably leads to a large number of particularized atom groups and thus excludes molecules from any calculation for which not all of their atom groups have a defined contribution, is well compensated on the one hand by the accuracy of prediction for those compounds for which calculation is possible, in most cases even by the simple paper-and-pencil approach for finding the atom groups in a given molecule and summing up their contributions, and on the other hand by the enablement of a standardized computer algorithm, allowing a simple extension of each of the atom-groups parameters lists at the input of any further, future experimental data, which again would extend the scope of calculable molecular structures. The reliability of the predictions, however, only increases with the accuracy of any future input. The present work is part of an ongoing project called ChemBrain IXL available from Neuronix Software (www.neuronix.ch, Rudolf Naef, Lupsingen, Switzerland).
Supplementary Materials: The following files are available online. The list of compounds, their experimental and calculated data and 3D structures of the viscosity-coefficient calculations are available under the names of "S1. Experimental and Calculated Viscosity-Data Table.doc" and "S2. Compounds List of Viscosity Calculations.sdf". A list of their outliers has been added under the name of "S3. Compounds List of Viscosity Outliers.xls". The set of experimental and calculated data of activity coefficients calculations is available under the name of "S4. Experimental and Calculated Activity-Coefficient-Data Table.doc", the corresponding list of compounds under the name of "S5. Compounds List of Activity-Coefficient Calculations.sdf" and the respective outliers list under the name of "S6. Compounds List of Activity-Coefficient Outliers.xls". The figures are available as tif files and the tables as doc files under the names given in the text.