Using Variable and Fixed Topological Indices for the Prediction of Reaction Rate Constants of Volatile Unsaturated Hydrocarbons with OH Radicals

Volatile organic compounds (VOCs) play an important role in different photochemical processes in the troposphere. In order to predict their impact on ozone formation processes a detailed knowledge about their abundance in the atmosphere as well as their reaction rate constants is required. The QSPR models were developed for the prediction of reaction rate constants of volatile unsaturated hydrocarbons. The chemical structure was encoded by constitutional and topological indices. Multiple linear regression models using CODESSA software was developed with the RMS(CV) error of 0.119 log units. The chemical structure was encoded by six topological indices. Additionally, a regression model using a variable connectivity index was developed. It provided worse cross-validation results with an RMS(CV) error of 0.16 log units, but enabled a structural interpretation of the obtained model. We differentiated between three classes of carbon atoms: sp2-hybridized, non-allylic sp3-hybridized and allylic sp3-hybridized. The structural interpretation of the developed model shows that most probably the most important mechanisms are the addition to multiple bonds and the hydrogen atom abstraction at allylic sites.


Introduction
The reactions of volatile organic compounds (VOCs) of biogenic and anthropogenic origin play an important role in different atmospheric photochemical processes.The major degradation pathway of VOCs in the atmosphere is reaction with hydroxyl radicals.Although extensive experimental work has been carried out during the last several years on measurement of kinetic parameters, experimental reaction rate constants are only available for approximately 500 VOCs [1][2][3].In order to reduce analysis time and costs of such measurements it would be useful to develop a prediction model to estimate these values.Recently, several quantitative structure property relationship (QSPR) prediction models were developed to predict reaction rate constants for the reaction of OH radicals with different organic species.These models estimated reaction rate constants based on empirical fragment contribution technique [4][5][6], bond dissociation energy [7][8][9], NMR chemical shift data [10], ionization potentials [11][12][13], molecular orbital calculations [14][15][16][17][18][19][20][21] and various structural descriptors [22][23][24][25][26][27].A comprehensive overview [28] of these method and their partial evaluations were recently published [29].These methods offered models with, at least, moderate prediction capabilities.However, their use was mostly restricted because of a limited knowledge about the reaction pathways, limited databases with experimental molecular properties, or due to the extensive computations necessary for ab initio molecular orbital calculations.All these drawbacks are bypassed in QSPR models using molecular structural descriptors.
Constitutional, topological, geometric, electronic, and quantum chemical indices have been already used for the prediction of different chemical or physical properties [30,31].Among them topological indices represent a very promising group of structural descriptors, not only because they are easy to calculate, but because in most cases they offer good correlations with the modeled property.Wiener [32] and Platt [33] did the pioneering work in the field of topological indices in late 1940s, however it was not until the early 1970's that the idea of representing chemical structures by graph invariants was resurrected and further developed by several authors [34][35][36][37][38].Because of their ability to yield good prediction models [39,40], in more recent years we have seen a rapid expansion of novel molecular descriptors derived from molecular graphs.The question can be raised whether there is a true need for hundreds of topological descriptors even though they may represent different structural features of the molecule in view that many of them are highly interrelated.Typically half a dozen descriptors are used to obtain satisfactory regressions.An alternative to this was suggested a decade ago [41,42] but only recently has been re-employed.Instead of using a large pool of descriptors to extract half a dozen of topological indices for the prediction models, a single index or very few indices involving variable parameters were constructed and optimized during the modeling procedure [43,44].We should also mention here the work of Pogliani [45], who used limited combinations of dozen generalized connectivity indices also involving variable parameters.
In this work we have tested the coding abilities of topological indices in prediction of gas-phase reaction rate constants for the reaction of different organic compounds with OH radical.A multiple linear regression (MLR) model was used as a modeling technique.In addition, the prediction ability of the variable connectivity index (χ 1 f ) was tested.The final prediction results were compared with previously published data.

Data set (Unsaturated Hydrocarbons)
In this study we considered a data set taken from the literature [26], which contains 58 unsaturated acyclic and cyclic organic compounds with three to ten carbon atoms (Table 1).The reaction rate constants were obtained at a temperature of 298 K.The listed numerical values are the negative logarithm of the reaction rate constant (k OH ) reported in cm 3 molecule -1 s -1 .The test set compounds are marked by asterisk.We have selected the same test set as in the literature [26] in order to compare prediction capabilities of the models.All molecular structures were created using HyperChem TM .By using the CODESSA software [46,47] 76 invariants were calculated for the structures, which were either informational or topological molecular descriptors.These descriptors contain information about the connectivity between atoms, molecular branching and molecular symmetry, and are sensitive to variation in shape between molecules of similar size.
The CODESSA software was also used for the selection of best subset of structural descriptors by minimizing the error in cross-validation using MLR model.Descriptors with no variation between structures, descriptors that did not cover the whole modeling space, and descriptors with very low coefficient of determination when used as single variable were omitted from the further consideration.We excluded descriptors giving the coefficient of determination (r 2 ), where r is the coefficient of the regression, smaller than 0.01.Subsequently we also eliminated highly inter-correlated descriptors on the basis that they represent duplication.Thus, if the squared pair-wise correlation coefficient for two descriptors exceeded 0.99, one of the descriptors was omitted.Both conditions for omitting descriptors (the low coefficient of determination and the high inter-correlation) are somewhat arbitrary, and as it is known may filter out otherwise useful descriptors [48].As a result the selected combinations that under such conditions CODESSA generate may not be unique, but will produce acceptable regression equations.
The descriptors that passed the selection process were sorted in the decreasing order of r 2 when used in a simple linear regression model.In this way the best ten descriptors were selected.To these pre-selected descriptors a new descriptor was added from the set of the remaining descriptors if it did not show pair-wise correlation higher than 0.8.Finally the ten best two parameter models showing the highest Fisher ration in the two-parameter models were selected and used as the working set.New descriptors were added until an MLR model with the prescribed number of variables was obtained.During each step of the formation of the MLR model a new descriptor is added to the working set if it does not correlate with the descriptors already included and produces higher normalized Fisher ratio F. Again we accepted a descriptor if r 2 is below 0.8 and if F new > n/(n+1)•F old , where n is the number of descriptors in the new working set.In this way through stepwise addition procedure we arrived to the final ten correlations with the highest r 2 .The derived correlations were tested for their cross-validation capabilities by leave-one-out cross validation procedure.The model with the highest q 2 (squared correlation coefficient in cross-validation) was selected as the best n-parameter MLR model.In order to test significance of descriptors selected to the prediction model a t-test and partial-F statistics were calculated for each parameter of the model.A more detailed description of stepwise selection procedure can be found in CODESSA manual [47].Prediction capability of final models was evaluated by using test set.
We feel that very low limits for elimination that were chosen, that is, one parameter and pair-wise correlation coefficient were less than 0.01 and greater than 0.8, respectively, offer considerable assurance that important descriptors would not been omitted in the selection process.As mentioned it was already reported in the literature that linear correlation of highly correlated parameters, which individually do not correlate well with the studied property, can yield a very good correlation [48].But at the same time the limits of elimination are strict enough that the chances of obtaining insignificant coefficients in MLR model are low.

The variable connectivity index (χ 1 f )
Only rarely is a single descriptor capable of describing sufficiently well the molecular structure to allow a simple regression to be used for the modeling of chosen property.As a rule, additional descriptors are needed to get regression equations if one hopes to obtain satisfactory prediction of molecular properties.In constructing MLR basically two alternative approaches are possible.One can apply statistical methods to select information-rich significant descriptors from a large pool of descriptors or one can use a smaller set of well-selected structurally related descriptors that form a basis that sufficiently well covers the structure-property space for molecules and properties considered.The first approach, which does not require a critical examination of descriptors, is often used, but it offers a limited interpretation of the derived regression model.The second approach, based on combinations of structurally related descriptors received a limited attention [49,50], except perhaps when confined to the connectivity index [35] ( 1 χ) and the higher order connectivity indices m χ [51], including also the valence connectivity indices of Kier and Hall [38,52].Such an approach enables less cumbersome structural interpretation of the regression analysis.Both these approaches, the use of a pool of descriptors and the use of basis descriptors, could dramatically improve the quality of the regression analysis if the selected descriptors were designed specifically for the particular application.One of the possibilities is to modify or adjust an already available descriptor for the particular property.
One of apparent disadvantages of the original connectivity index 1 χ is its inability to differentiate between different types of atoms and bonds.Kier and Hall recognized a need for the modification of the connectivity indices to describe molecules with heteroatoms.Similar modifications have been introduced since then for other graph theoretical descriptors that were initially designed for hydrocarbons [53].In these modifications one assigns different weights to heteroatoms and multiple bonds.However, when prescribing relative contribution of heteroatoms or multiple bonds such approaches involve some degree of arbitrary decisions.Although in a number of applications the valence connectivity indices gave better results than would be otherwise the case, in some cases discrimination of heteroatoms by the use of valence connectivity indices produce even worse results than the results based on the ordinary connectivity index.An example, already illustrated in the book of Kier and Hall [38] was the case of water solubility of aliphatic ethers, which is represented better with the simple connectivity index 1 χ (that does not differentiate between carbon and oxygen atoms) than the valence connectivity index 1 χ v .Dozens of similar correlations involving selected properties of ethers and alcohols have been listed in brief review on the developments involving the connectivity index after its 25 years of use [54].At that time apparently not much attention was given to this anomaly, which implied that different properties might require different optimal weights to be used when discriminating among heteroatoms.It was only with the development of the variable connectivity index that this has been recognized, even though such results could have been expected because individual heteroatoms influence different properties of the molecule in different ways.
The idea behind the variable connectivity index, where contributions of different atoms or different kind of bonds are varied during the optimization process, is to allow relative contributions of different atoms and bonds to be adjusted so to produce the smallest standard error for the regression.Hence, instead of screening a large pool of descriptors in order to select few for describing regression, variable molecular descriptors allow one to construct relatively small number of relevant descriptors, from which the best are selected.Calculation of the variable connectivity index, for which we use symbol with superscript f to suggest that it is a function rather than a constant, hence (χ 1 f ), for a hypothetical linear molecule YX 3 Z (Figure 1) is illustrated in Table 2. Y  X  X  X  Z First we augment the adjacency matrix of a graph by introduction of non-zero diagonal entries.The variables x, y, and z on the diagonal of the matrix will alter the atomic contributions to the connectivity index.In general one can assign individual weights to every atom present in the molecule, but this would lead to a large number of variables that need to be determined.Such an optimization would require large data sets, and besides being computationally intensive, it may produce non-general models in which there may be less significant differences among structurally similar molecular fragments.To avoid these problems one should try to construct models in which structurally similar fragments (atoms and bonds in similar immediate environment, in particular) should be described by the same variable parameters, which will consequently maintain the number of variables rather low.
As has been illustrated in the literature [41][42][43][44][55][56][57][58][59][60][61][62][63][64], by using variable descriptors the quality of correlations can be significantly improved.Equally important is that use of variable descriptors may allow a novel structural interpretation of the results when the optimized weights could be related to other physico-chemical properties of compounds.For instance, the connectivity index is a bond additive quantity in which the bond contributions are given by (m•n) -1/2 .As a consequence primary, secondary, tertiary, and quaternary carbon atoms make different contributions to bond-additive properties.Here m and n represents row sums in adjacency matrix of the atoms, which are forming bonds.By a close evaluation of the influence of optimized weights on the connectivity index we can get information on which structural features of the molecule may play more important role for the particular structure-property study.

Search for optimal weights in χ 1 f and evaluation of quality of the models
Optimal values of weights used in calculation of χ 1 f were obtained using a Simplex [65] optimization procedure.At the start of optimization one assigns to various weights in the augmented adjacency matrices random values, which however are subject to certain limits to avoid occurrence of complex numbers.The Simplex algorithm was used to find the weights that would minimize the standard error in a linear correlation log k OH = a f(χ 1 f ) + b.As is known the Simplex optimization may rest in any of present local minima, rather than total minimum.In order to reduce chance of selecting local minima, the simplex optimization was performed several times starting with different set of random numbers.
The generality of the obtained model was evaluated by leave-one-out cross validation procedure.The root-mean-squared (RMS) error of test set served as an estimate of prediction capabilities of the linear regression model created from optimized connectivity index.

Results and Discussion
Hydroxyl radical reactions are grouped into four reaction classes: (1) hydrogen atom abstraction, (2) addition to multiple bonds, (3) addition to aromatic rings, and (4) reaction to aromatic rings [1,6].Since all 58 compounds contained one or more C=C double bonds, all reactions belong to second class of reactions with OH radicals, that is, the radical addition to multiple bonds.Training set compounds were used during model creation.

RMS retrieved vs. experimental predicted vs. experimental RMS
At the beginning 76 informational and topological descriptors for each compound were calculated using CODESSA software.After elimination of descriptors that did not satisfy our criteria on being of sufficient interest we were left with a pool of 34 descriptors to be tested in the step-wise selection for selection of optimal combination of descriptors.The MLR models with up to 10 parameters were selected based on the best cross-validation capabilities obtained by leave-one-out cross-validation procedure.The influence of the number of selected descriptors of the MLR model on r 2 , q 2 , and RMS CV values is shown in Figure 2.
We can see that r 2 and q 2 increase from about 0.4 and 0.35 respectively, the values for single descriptor, to the asymptotic values close to 0.90 and 0.85, respectively as the number of descriptors has increased from one to ten.In the same figure we also illustrated the decrease of the RMS CV values.
The error bars in Figure 2 for RMS CV represent calculated standard deviation of estimates.Though the increase of both r 2 and q 2 and the decrease of RMS CV continues until we use nine descriptors, we can see that probably only the first half a dozen descriptors make the dominant contributions.It would be difficult to argue that the small improvements in q 2 and RMS CV values beyond the contributions of the first six descriptors are significant in view of possible fluctuations in the calculations if experimental values are slightly altered (with experimental errors limits).We can see a significant improvement in RMS CV until six descriptors are included in the model.Fluctuations beyond this point are within one standard deviation limit suggesting that RMS CV becomes approximately constant from that point on.Additional descriptors would lead to overfitting of the model and will decrease the signal to noise ratio for the model.Taken into account these conditions we have decided to use 6-parameter MLR model for the prediction of log k OH values.The selected structural descriptors together with coefficients of the a 6-parameter MLR model are shown in Table 3.The regression parameters presented in Table 4 were obtained from the MLR model constructed with 53 unsaturated hydrocarbons, which are present in the training set.We arrived at the result that a 6-dimensional vector, the components of which were six topological indices, characterizes a chemical structure.Most of the selected topological indices encode information on molecular size and the branching pattern.This is in particular the case with the Randić index, the Kier & Hall indices and Kier's kappa shape index.Indirectly these indices store information on steric properties of molecules, which are one of the determining factors of the reaction rate.In order to test the significance of individual descriptor a t-test and partial F factor were calculated.Both tests show significance of all descriptors included in model.
The cross-validation capability of the developed six-descriptor MLR model was tested by leaveone-out cross-validation procedure.The RMS CV error was found to be 0.119 log units.On the other hand test set was used to determine prediction ability of developed model.The obtained RMS value was 0.097 log units.The graph of the calculated versus the experimental log k OH values is shown in Figure 3. Visual inspection of the plot shows 1,3,5-cycloheptatriene (compound 23 in Table 1) as a potential outlier.The difference between the experimental log (k OH ) and the calculated value for this compound exceeds 0.45 log units.Its removal visibly reduced the RMS error (about 17 %).Because the studied data set was already quite small the likely outlier was not removed from the study.The large resulting error need not to be due to an experimental nature but due to limitations of coding capabilities of descriptors used.Another reason to keep the outlier was that we wanted to compare our model with other QSPR studies on the same data set [6,16,24,26], where 1,3,5-cycloheptatriene was included in the modeling procedure.Finally without a further close scrutiny it cannot be decided whether the large error accompanying outliers of experimental nature or is due to limitations of coding capabilities of descriptors used.
A direct comparison of calculated results with the literature data is difficult.The models discussed in the literature used residual sum of squares (RSS) for the model validation.Some of the predicted values represent true predicted values, but on the other hand the remaining data are actually retrieved values, that is, they were used for the model creation as well as prediction.Therefore the calculated RSS value slightly overestimates the prediction abilities of the model.In order to compare our results to three types of models reported by Bakken and Jurs [26] where also classical training/test set procedure was used, the RSS value as well as the RMS values for training and test set as well were calculated for the first 57 compounds.Corresponding values are 0.101, 0.0965 and 0.569 log units, respectively.Our six-descriptor MLR had similar training capabilities to the five descriptors MLR model of Jurs and Bakken.On the other hand the RSS value and RMS prediction value compare favorably with the literature MLR model, where corresponding values were 0.676 and 0.139 log units, respectively.
Introduction of non-linear modeling technique, like artificial neural networks (ANN), improved the reported calculation capabilities of log k OH .Both reported ANN models with RMS CV errors of 0.074 and 0.065 log units outperformed our MLR model.In both ANN cases five descriptors were used for encoding the chemical structure.
We want finally to compare calculation ability of our MLR model with the simple regression model based on a single variable molecular descriptor (χ 1 f ).We assigned four weights to diagonal elements of the connectivity matrix for the contribution of cyclic or acyclic sp 2 -hybridized or sp 3 hybridized carbon atoms.In this way we have differentiated the influence of cyclic structures as well as between contributions of CC single and CC double bonds.Using the training set, weights were optimized by the Simplex method to get the lowest RMS value for the retrieved data.The changes of RMS value during optimization procedure are shown in Figure 4. No. of cycles

RMS
The optimization of χ 1 f changes the calculation performance of regression model substantially.The r 2 , q 2 and cross-validated RMS improved from 0.399, 0.340, and 0.23 log units, respectively, when ordinary connectivity index (χ 1 ) was used, to 0.716, 0.692, and 0.16 log units, respectively when the variable index was used.At the same time RMS error for the prediction set also decreased from 0.16 to 0.11 log units.The calculated -log k OH values are shown in Figure 5.The calculation performance of the single variable connectivity index is better than MLR model with two descriptors and slightly worse than when three descriptors were used for model creation.Although the calculation ability of variable connectivity index is worse than the best MLR model, its biggest advantage is the possibility of a structural interpretation of the model.
During the creation of the first model carbon atoms were differentiated into four classes, that is, acyclic sp 3 -hybridized, acyclic sp 2 -hybridized, cyclic sp 3 -hybridized and cyclic sp 2 -hybridized.Their corresponding optimal weights were 9.5039 .10 4 , 3.6066 .10 4 , 21.788 .10 4 and 2.5304 .10 4 , respectively (Model 1 in Table 1).We can see that the smallest influence on compounds reactivity have both cyclic and acyclic sp 3 -hybridized carbon atoms.On the other hand cyclic sp 2 -hybridized carbon atoms show a slightly higher reactivity that the corresponding atoms in acyclic structure, which is again in agreement with the experimental observations.Alkenes can undergo halogenation with Cl 2 or Br 2 either by a heterolytic mechanism (resulting in addition which yields vicinal dihaloalkanes) or by a homolytic mechanism (resulting in substitution in allylic position).As an example for the latter type of reaction, when isobutene and one of the two above halogens react at 400ºC, the substitution product is methallyl halide.A milder reagent for allylic bromination is N-bromosuccnimide.
Similarly, oxidation of alkenes with oxygen ("autoxidation") proceeds homolytically by substitution affording allylic hydroperoxides.The reaction with OH radicals also follows the same course starting with hydrogen abstraction from an allylic position.
The reason for this regioselectivity favoring substitution in an allylic position is the different bond dissociation energy (BDE) for various types of C-H bonds.The approximate BDE values (in kJ/mol) are: (i) 445 for sp 2 C-H bonds; (ii) 400 for non-allylic sp 3 C-H bonds; and (iii) 360 for allylic sp 3  C -H bonds.These energy values are due to the remaining lone electron after hydrogen atom abstraction, which would be: (i) a σ-electron in an sp 2 -hybridized orbital; (ii) a π-electron in the nonhybridized p-orbital; and, respectively, (iii) a π-electron in a delocalized molecular orbital extending over three atoms with sp 2 -hybridzation.Of course, the allylic hydrogen atom with the lowest BDE will be abstracted by a halogen atom or an OH free radical.
Therefore we considered that allylic versus non-allylic positions are more important than cyclic/acyclic constitution, and we differentiated in model 2 between three classes of carbon atoms: sp 2 -hybridized, non-allylic sp 3 -hybridized, and allylic sp 3 -hybridized.The optimized weights for corresponding C atoms were 6.2731 .10 4 , 5.0000 .The r 2 , q 2 and cross-validated RMS improved slightly to 0.721, 0.694 and 0.1598.Allthough the improvements are not significant, we reduced the number of weights from four to three.At the same time by choosing different parametrisation of weights we have gained additional information about nature of the reaction.We can see that sp 2 -hybridized C atoms are still the most important contributors to overall reactivity, but the contribution of allylic sp 3 -hybridized carbon atoms is closer to sp 2hybridized than to non-allylic sp 3 -hybridized.This means that most probably the addition to multiple bonds is still most important mechanism but hydrogen atom abstraction at the allylic site can not be neglected.

Conclusions
The aim of this work was to investigate the coding capabilities of constitutional and topological descriptors for a QSPR study for predicting gas-phase reaction rate constants of organic compounds with OH radical.A very simple chemical representation was selected in order to enable even a nonspecialist in theoretical chemistry to use these predictive models.The data set contained unsaturated hydrocarbons where OH radical addition is the dominant reaction pathway.
Initially, MLR model was developed for the prediction of log k OH values from molecular structure.Chemical structure was presented by six-dimensional vector that contained classical topological indices.The cross-validation capability of MLR model was tested by leave-one-out cross-validation procedure.RMS CV error was 0.119 log units.The obtained prediction results were compared with previous literature studies.Our model gave better prediction results than reported MLR model, but was outperformed by non-linear models where artificial neural networks were used.
Beside standard structural indices, the coding capability of a single variable connectivity index was tested.The single variable connectivity index provided better cross-validation results than the MLR models with two descriptors.The biggest advantage of variable connectivity index is possibility of its structural interpretation.Carbon atoms were differentiated into four classes.Both cyclic and acyclic sp 3 -hybridized carbon atoms have the smallest influence on reactivity.The cyclic sp 2 -hybridized carbon atoms show slightly higher reactivity that the corresponding atoms in acyclic structure.
Additional model using variable connectivity index was developed where carbon atoms were differentiated between three classes: sp 2 -hybridized, non-allylic sp 3 -hybridized, and allylic sp 3hybridized.model with just three weights give the same calculation results than the model with four weights.Additionally, the structural interpretation of the model shown that most probably the addition to multiple bonds is most important mechanism but hydrogen atom abstraction at the allylic site cannot be neglected

Figure 2 .
Figure 2. The influence of the number of selected parameters of the MLR model on r 2 , q 2 and RMS CV values

Figure 3 .
Figure 3. Calculated vs. experimental -log(k OH ) values obtained by six parameter MLR model

Figure 5 .
Figure 5. Calculated vs. experimental -log(k OH ) values using variable connectivity index (Model 1) 10 5 and 9.6063 .10 4 , respectively.The calculated -log k values are shown in Fig 6 and as Model 2 in Table 1.

Table 2 .
Augmented adjacency matrix for the hypothetical compound YX 3 Z