Impersonality of the Connectivity Index and Recomposition of Topological Indices According to Different Properties

The connectivity index chi can be regarded as the sum of bond contributions. In this article, boiling point (bp)-oriented contributions for each kind of bond are obtained by decomposing the connectivity indices into ten connectivity character bases and then doing a linear regression between bps and the bases. From the comparison of bp-oriented contributions with the contributions assigned by chi, it can be found that they are very similar in percentage, i.e. the relative importance of each particular kind of bond is nearly the same in the two forms of combinations (one is obtained from the regression with boiling point, and the other is decided by the constructor of the chi index). This coincidence shows an impersonality of chi on bond weighting and may provide us another interpretation of the efficiency of the connectivity index on many quantitative structure-activity/property relationship (QSAR or QSPR) results. However, we also found that chi's weighting formula may not be appropriate for some other properties. In fact, there is no universal weighting formula appropriate for all properties/activities. Recomposition of some topological indices by adjusting the weights upon character bases according to different properties/activities is suggested. This idea of recomposition is applied to the first Zagreb group index M(1) and a large improvement has been achieved.


Introduction
Since the first topological index -the W index [1] was developed and found useful in finding correlations between property and chemical structures, more and more chemists came to know its merits and tried to propose novel topological indices to construct better QSAR/QSPR models.They compiled numerical characterizations of the chemical structure of molecules by means of various graph matrices, distances, walks and paths counts.Then, a large variety of mathematical operations were applied to the numerical molecular characters giving novel topological indices (TIs).Different operators on one character can even produce several topological indices, and in this way, more than 400 TIs have been proposed.They have contributed greatly to the widespread use of QSAR/QSPR models, but this indiscriminate proliferation of topological indices has produced some criticisms from skeptics of the use of this approach in chemistry: "This disorientation in the search of such molecular descriptors often produces no good correlation with any property, and looks very convoluted in their definition" [2].As a result, among hundreds of existing topological indices, only a small number of them are widely used by the QSAR/QSPR researchers.
In 1975, Randić proposed the connectivity index χ with the initial name "branching index" [3].Within a short time Kier and Hall had recognized its merits, not only its description ability for molecular branching, but also that the correlation ability of χ was also quite good for many physical and biochemical properties.They demonstrated its use for a wide range of compounds and properties [4,5].Until now, the connectivity index is still most widely used.Some researchers have studied the reason why it is so popular and drawn some conclusions on its success.For instance, Randić has attributed it to its greater weighting of terminal CC bonds and lesser weighting of internal CC bonds [6].Working over several famous topological indices by partitioning them into bond additive terms it was then found that better regressions resulted when terminal CC bonds gave more contribution and internal bonds gave less.This conclusion can interpret, to some extent, the correlation between some properties and molecular structure.However, it seems more like a qualitative interpretation than a quantitative one.When faced with various available weighting formulae like {(mn) -1 , (mn) -½ , (mn) -⅓ ,...} for bond (m, n), which should be the best, why (mn) -½ ?Using connectivity character bases, a novel interpretation is proposed for the success of the connectivity index χ with its impersonality on the bond weighting formula.In 1991, the variable connectivity index was proposed by Randić [7] as an alternative approach to Kier and Hall's valence connectivity index [8] for characterization of heterosystems in QSPR studies.The difference between the variable connectivity index and the valence connectivity index is that the former uses optimized vertex-weights while the latter uses fixed vertex-weights.The main advantage of such variable connectivity index lies in the fact that different molecular properties require distinct optimal parameters.There is no universal valence connectivity index that would apply to all properties of the heteroatomic structures, but the variable connectivity index can adjust to the individual requirements of different molecules and molecular properties.For the same reason, more general variable topological indices are considered in our present work.Using topological character bases, some TIs can be recomposed by adjusting the weights upon the character bases according to different properties/activities.This recomposition makes it possible that the character bases can give full scope to their potential abilities on property description.As an example, the first Zagreb group index M 1 [9] is recomposed according to different properties and then the optimized M 1 index will show large improvement in building regression models.

The connectivity character base set
The connectivity index was first proposed to parallel relative magnitudes of boiling points in smaller alkanes [3].After 27 years, this index is still most widely used among all TIs [10] with its well-known definition: where ( , ) i j v v denotes one pair of vertex degrees on both sides of an edge (bond) and (i, j) the orders of the two atoms on the bond.For example, from the 2, 2-dimethylhexane (Figure 1) we obtain: The partitioning of the connectivity index of 2, 2-dimethylhexane into bond contributions.
Since the χ index is a bond additive mathematical invariant, the process of its calculation can be divided into two steps: first classify the bond into the (m, n) bond type according to the valences of two atoms forming the bond, and then the value of χ is given by summing the contributions of the form (mn) -½ over all the bonds of hydrogen-suppressed molecular graph.For saturated hydrocarbons, there are 10 kinds of CC bonds: {(1, 1), (1, 2), (1,3), (1,4), (2, 2), (2, 3), (2, 4), (3,3), (3,4) χ -0, 24 χ -1, 33 χ -0, 34 χ -0, 44 χ -0}, according to this definition, the χ index can be calculated as: From this definition, the connectivity index χ can be viewed as a linear combination of the ten character bases while the combination weight assigned to each kind of base is (mn) -½ .To find out some impersonality of the assigned weight (mn) -½ , in this article, 530 boiling points (bps) of all the saturated hydrocarbons (from acyclic to polycyclic) with carbon numbers from 2 to 10 [11] are collected for the calculation of the bp-oriented weights.Numerical values of the bp-oriented weights were calculated by least squares, i.e. contributions for different bonds are decided by taking the bases as variables, bps as response and make linear regression between them.Because the regression coefficients in linear model will exhibit a relative importance of variables to the response, they can be regarded as the bp-oriented weights on the character bases.In the step of pretreatment, we found that base χ 11 is non-zero only for ethane, thus to avoid a singularity in the linear regression, ethane is deleted from our data set.Then χ 11 can be ignored in the linear model construction.The bp-oriented weights (regression coefficients) obtained from the rest 9 bases and 529 boiling points are listed in Table 1.
It is hard to make a fair comparison just from Table 1 because of different scales of the two sets of weights.To eliminate the influence brought by different scales, percentage of each weight is calculated by dividing the sum.For example, assume the original weights are 2, 4, 6, 8 for base 1, 2 3, 4, the percentage can be obtained by dividing each weight by the sum 2+4+6+8=20, then we get 2/20, 4/20, 6/20, 8/20 as the percentage for base 1, 2, 3, 4, respectively.After normalization, the relative importance of all connectivity character bases to the property--boiling point can be expressed clearly as percentages (Table 2).Plot of the normalized χ weights vs. normalized bp-oriented weights (Reg.Coef.) are given in Figure 2 (a).It can be found from Table 2 and Figure 2(a) that the χ weights are very close to the bp-oriented weights which are obtained from the linear regression when the property of boiling point is considered.Thus, weight (mn) -½ assigned to corresponding bonds seems more property-oriented-like than subjectively decided by its constructor, and this interesting coincidence exhibits the impersonality of the χ's weights.Here the impersonality of a topological index means that the construction of the index is not only representing the subjective understanding of its proposer but also indicating a close relationship with some property or activities.Then we will try to show that this impersonality of χ's weighting formula is an important reason for its great success.
Let us investigate on χ and two other topological indices.All of them are constructed on same character bases but very different achievements on property interpretations.One is the second Zagreb group index M 2 [7] that assigns m x n to base χ mn as: .
Similarly to χ, this index assigns greater weights on terminal CC bonds and lesser weights on internal CC bonds while M 2 does the opposite.Applying these three different weighting formulae on the connectivity character bases we get χ, M 2 and χ inv.Then, boiling points of the 529 alkanes are used as a particular property giving some regression results.Among the three indices χ is the best one in fitting with this property, with a standard deviation (SD) of 7.6788 and R of 0.9802, then comes χ inv with SD = 21.7452 and R = 0.8278, the last is M 2 with SD = 34.1617and R = 0.4724.To find out why the χ 's weighting formula is the most effective one, the weight percentages of the other two indices are also listed in Table 2. Plots of the assigned weights vs. regression coefficients from these two indices are represented in Figures 2 (b) and (c).
It can be seen clearly from Figure 2 that the bond-contributions assigned by M 2 is quite different from the bp-oriented (Reg.Coef.) ones, χ inv is closer but still has a little deviation, but we can find a perfect coincidence in (a).At the same time, we should notice that the only difference among the three TIs lies in their weighting formulae.That is to say, different combinations on the same connectivity character bases bring large variety of their regression achievements.From the comparison, it seems that the regression progress of such topological indices in QSPR model has a direct proportion to the degree of agreement between the weights on character bases with the regression coefficients.

The recomposition of TIs according to different properties
In previous section we have shown the impersonality of the χ 's weighting formula when boiling point is considered.In this section, six properties have been collected for further investigation, they are: boiling point at normal pressure [11], GC retention index (RI) [12], vapor pressure (VapP) at temperature of 25 ºC [13][14][15], density [16], refraction constant values (Refra) [16,17] and Critical Pressure (CP) [16,18].Except the values of boiling point and retention index which can be easily found in the references, all the property values used are presented in Table 3.The connectivity index χ , the second Zagreb group index M 2 and the connectivity character bases are used to build relationships with these properties.Results of R values in corresponding linear models are listed in Table 4.The digits following n shows the number of the carbons in the straight chain, the digits following c shows the number of the carbons in the ring of the cyclic alkanes, m, e, p and ip represent methyl, ethyl, propyl and isopropyl, respectively; digits in front of these characters denote the position of the substituents, and ones behind them denote the number of these substituents [11].4 it can be seen that the connectivity index χ gives satisfactory descriptions (R > 0.98) for half of the properties.On the other hand, for the property of density and refraction constant, it seems that the M 2 index performs a little better than the χ index.Since the only difference between the definitions of χ and M 2 is the weights assigned on the connectivity character bases, at least two conclusions can be drawn: (1) not only the character bases which describe the molecular structure, but also the weighting formula applied on the bases are important for the construction of a topological index, and (2) there are no unified weighting formula for same character bases that will satisfy different regression for different properties equally well.In fact, Randić proposed the variable connectivity index because some flexibility may be required by the connectivity index to accommodate for variability when different properties of the same compounds are considered.According to the motivation of the variable connectivity index, any set of preselected "rules" that fix the relative weights for heteroatoms in topological indices may better suit some molecular properties but will equally fail several others [19].For the same reason, a more general form of variable topological index is considered in our present work.Such an index can adjust its weighting formula upon character bases to individual requirements that different molecules and different properties may have.It suggests conserving the character bases for description of the molecular structures, but at the same time allowing reasonable changes on the weights according to the properties.Thus the character bases can give full scope to their potential abilities on property descriptions.The numerical values of the weights on character bases are selected to minimize the standard error for a regression.From the last row of Table 4 it can be seen that after rational adjustment on the weights, the connectivity character bases will always find a satisfactory relationship with the properties, while the fixed indices cannot.
In fact, some existing topological indices that can be viewed as the weighted combinations, such as the first Zagreb group index M 1 and the W index, may be improved using this method.In this section, the M 1 index will be optimized by adjusting the weights upon its character bases according to the property.As we know, the M 1 index was defined by the sum of atom contributions as follows: where υ i is the vertex degree of the ith carbon atoms, and n is the number of carbon atoms in a molecular.For example, the calculation of M 1 index from 2, 2-dimethylhexane (Figure 3) according to its definition is: . 32 There are 4 kinds of carbon atoms in saturated alkanes according to the vertex degree, thus we can define the 1 M character base set {m 1 , m 2 , m 3 , m 4 }, where m i is the count of carbon atoms with a vertex degree i.Using the bases, the M 1 index can be rewritten as: For 2, 2-dimethylhexane, the 1 M character bases are {m 1 -4, m 2 -3, m 3 -0, m 4 -1}, so we obtain: . 32 From above definition, the M 1 index can also be viewed as a weighted combination, where the weight assigned to base m j is j 2 .In the following same properties are used to show the efficiency of the optimized M 1 index while some flexibility is allowed on the weights.All the regression results expressed by correlation coefficients (R) are presented in Table 5.To obtain a rational weighting formula for the M 1 character bases, boiling point is used here.In the regression with 530 boiling points, it is interesting to notice that the normalized coefficients of the four M 1 character bases are (0.2331 0.2805 0.2703 0.2161), which are close to a proportion of 4:5:5:4, or a percentage of (0.2222 0.2778 0.2778 0.2222).Although this proportion is obtained from multivariate linear regression, it is almost independent with the number of alkanes used in correlation.For the reason of simplicity, weight (4, 5, 5, 4) is assigned to carbon atoms with vertex degrees of (1, 2, 3, 4), respectively.Then the optimized M 1 index can be written as: These integer weights offer a straightforward interpretation for the atom contributions to the property.In the definition of the M 1 index, a monotone increase of the atom contribution with the vertex degree is assumed.However, it is not true from the investigation on bp -M 1 bases relationships.The atom contribution here does not suggest an increase with the vertex degree.However, the relationship between them seems more like a quadratic function.It is surprising to see that a small change on the weighting formula will bring large improvements on the regression accuracy, the standard deviation (SD) for original M 1 index is 30.5612 while for revised M 1 index is 7.8028, even smaller than χ 's 7.9742.The correlation coefficients for the revised M 1 index and the 6 properties are also included in Table 5.Although different character bases are taken into consider, Table 4 and Table 5 suggest some similar results: (1) After rational adjustment of the weights according to the property, the recomposed index can largely improve the accuracy of the regressions even in the case that the original index does unsatisfactory works.(2) The "fixed" indices give good relationships with at most some of the properties, but the bases can behave best descriptions to all the six properties.(3) The χ and 1 rev M indices are constructed for better description of boiling points, at the same time, both of them are found to have better relationships with the second and the third properties and worse relationships with the last three properties.This interesting finding indicates that the first three properties may have a common requirement on the bond/atom contributions.

Conclusions
Considering that different properties need different weights, Randić introduced the variable connectivity index to search for optimal weights in heterosystems.Similarly, in various problems different character bases may offer different advantages.The numerical values of the weights on the character bases depend on the property used.Select proper character bases, we can optimize topological indices for not only alkanes, but also heteroatoms.Thus it can be viewed as an extension of the variable connectivity index.Outstanding topological indices may have directly characterization of the molecular structure and rational operators on the characters.The extended variable topological indices proposed in present work uses the character bases for the interpretation of the molecular structures and apply simple linear operator upon them, so they combines advantages of easy interpretation and property description.
hand, suppose another topological index χ inv is designed on the same χ character bases as follows:

Figure 3 .
Figure 3.The partitioning of the M 1 index of 2, 2-dimethylhexane into atom contributions.

Table 3 .
Alkanes and their properties.

Table 4 .
R-values between χ , M 2 , the connectivity character bases and the properties.

Table 5 .
R-values between M 1 , the revised M 1 , M 1 character bases and the properties.