Chromatographic Retention Times of Polychlorinated Biphenyls: from Structural Information to Property Characterization

The paper presents a unitary approach of the use of a Molecular Descriptors Family in structure-property/activity relationships, particularly in modelling the chromatographic retention times of polychlorinated biphenyls. Starting from molecular structure, viewed as a graph, and considering the bonds and bond types, atom types and often the 3D geometry of the molecule, a huge family of molecular descriptors called MDF was calculated. A preliminary selection of MDF members was done by simple linear regression (LR) against the measured property. The best fitted MDF subset is then submitted to multivariate linear regression (MLR) analysis in order to find the best pairs of MDF members that produce a reliable QSPR (Quantitative Structure-Property Relationship) model. The predictive capability was finally tested by randomly splitting of data into training and test sets. The best obtained models are presented and the results are discussed.


Introduction
Polychlorinated biphenyls (PCBs), organic compounds with 1 to 10 chlorine atoms attached to biphenyl, have the general chemical formula C 12 H 10-x Cl x . First manufactured by Monsanto in 1929, the PCBs production was banned in the 1970 th due to the high toxicity of most PCBs (209) and mixtures [1]. PCBs were used as insulating fluids for industrial transformers and capacitors, and are known as persistent organic pollutants. Even if the production of the PCBs was stopped, they still have an influence on the human [2][3][4] and animal [5] health due to their accumulation in the environment.
Moreover, the toxicity and carcinogenicity of PCBs could be related to mechanistic studies of their truncated analogue vynil chloride [6]. Ecological and toxicological aspects of polychlorinated biphenyls (PCBs) in the environment are under investigation due to their worldwide distribution [7][8][9][10].
Starting with the 20 th century, several mathematical approaches, that link chemical structure and property/activity in a quantitative manner, have been introduced [11]. Nowadays, quantitative structure-property/activity relationships (QSPRs/QSARs) are currently used in pharmaceutical chemistry, toxicology and other related fields [12].
The family of molecular descriptors MDF, designed by treating the interactions among fragments of a molecular structure with the formalism of electrostatic fields and potentials, and molecular topology as well, was developed and tested in QSPR/QSAR studies [26][27][28][29].
The aim of the present study was to investigate the ability of our MDF in modelling the retention times of 209 polychlorinated biphenyls.

Polychlorinated Biphenyls (PCBs)
The relative response times of all PCBs obtained by using temperature-programmed, highresolution gas chromatography on a capillary column of SE-54, reported by Mullin et al. [30] served as experimental data in this study.
Molecular structure of PCBs was drawn by using HyperChem software [31] and their 3D geometry optimised at the Extended Hückel level of theory. These calculations also provided partial charges of atoms inside the molecules. The output files *.hin files, which store the information about topology, geometry and charge distribution of the PCBs, represented the primary data for the generation of the molecular descriptors family.

Methodology of using Molecular Descriptors Family in QSPR/QSAR
Our MDF implements three criteria of fragmentation, related to pairs of atoms, in order to generate molecular fragments. Let i and j be the atoms forming a pair. The criteria are as follows: (a) A minimal fragment is that one containing only the atom i, while a maximal fragment will contain all the atoms connected to i, excluding the atom j.
A Szeged fragment is the set of vertices located closer to i than j (a distance-based criterion), the distance d(i, k) being lesser than d(k, j), and (c) A Cluj fragment is generated by excluding the path from i to j (except its terminal points) and then applying the above Szeged criterion. Every MDF member is named with seven ordered case sensitive letters: lMfOIpd, every letter encoding an operator, as follows.
The 7 th letter (d) encodes the distance metric and is either `g` (geometric) or `t` (topological). The 6 th letter (p) encodes the atomic property and can be `M` (mass), `Q` (charge), `C` (cardinality), `E` (electronegativity), `G` (group electronegativity), or `H` (number of attached hydrogens). The 5 th letter (I) encodes the interaction descriptor (involving two participants): The 4 th letter (O) encodes the type of overlapping interactions, which are either scalar (`R`, `r`, `M`, `m`) or vectorial (`D`, `d`). The 3 rd letter (f) encodes the fragmentation algorithm and can be: `m` (minimal), `M` (maximal), `D` (Szeged, distance based), and `P` (Cluj, shortest paths based). The 2 nd letter (M) encodes overlapping fragmental descriptors, which are of the type: sized group ( `m`, smallest; `M`, largest; `n`, smallest absolute; `N`, largest absolute); averaged group ( `S`, sum; `A`, average over all values; `a`, S divided by the number of all fragments; `B`, average first by atom group and then by the whole molecule; `b`, by bond); geometric group (`P` ,multiplication; `G`, geometric mean, by fragments; `g`, adjusted G; `F`, geometric mean by atom group and then by the whole molecule; `f``, by bond); harmonic group (`s`, harmonic mean, `H` harmonic mean, by fragments, and similarly to above `h`, `I`, and `i`).
MDF values enter in QSPR/QSAR modelling after a transformation (linearization procedure), one of: `I` (identity), `i` (inverse), `A` (absolute), `a` (inverse of absolute), `L` (logarithm of absolute), `l` (logarithm), which are encoded by the 1 st letter.
MDF use a genetic algorithm for QSPR/QSAR modelling (genetic algorithms are a particular class of evolutionary algorithms, being categorized as global search heuristics [32]). The peculiarities of the genetic algorithm used are: -Step 1 (implies inheritance and mutation). To the solution domain (2×6×24×6×4×19 MDF members) having the genetic representation with six letters words) are applied the linearization procedure from above, when every descendent is obtained from a parent (inheritance) through a transformation (mutation). Six times more (than parents) descendants are obtained. In this step, the fitness function is defined as "have real and distinct values". A number of 490030 descendants dye due to mutation on PCB data set (remaining 297938 descendants, having genetic representation with seven letters words now). -Step 2 (implies selection). To the solution domain (MDF descendants from Step 1) a bias procedure (selection) is applied. In this step, the fitness function is defined as "have distinct first nine digits of determination coefficient with measured property". For PCBs data set, only 99806 members pass selection. From this solution domain another selection is made: best descriptor (which correlates the best with measured property (for PCBs result being presented in Eq (1)). -Step 3 (implies crossover). Pairs of MDF members are crossover in order to obtain models with two descriptors. Two fitness functions are used here: "have better determination coefficient" and "have better cross-validation leave-one-out score". The result for PCBs data set is given in Eq(2).

Computational Details
The MDF is calculated by a set of original programs written in PHP (Pre Hypertext Processor, [33]) and stored into a MySQL database [34] under a FreeBSD server [35]. This set of programs completes the MDF generation task. The programs create tables, insert, drop, delete, and select grants on `MDF` database ( Figure 1). All programs run in a directory with the name of the set of selected compounds (actually, PCB). The first program, a_mdf_prepare.php, orders the molecules, contained as *.hin files in a `data` subdirectory, in the same ordering as the measured property, contained in a `property.txt` file. The names of *.hin files and corresponding property are used to create a temporary `PCB_tmpx` procedure) it stores thousands records into the `PCB_tmpx` table.
The third program, c_mdf_linearize.php, completes the `PCB_data`, `PCB_xval`, and `PCB_yval` tables with linearized MDF members and statistical parameters. Note that, only real and distinct values are stored into the database. The fourth program, d_mdf_bias.php applied a bias procedure for data reduction. Finally, the fifth program, e_mdf_order.php, re-arrange the data from the `PCB_xval`, and `PCB_yval` tables in descending order of the squared correlation coefficient. When the task is complete, the fifth program writes in the `ready` table a record with the set name ( Figure 2). The QSPR/QSAR finding procedure is made by a client programs built in Delphi programming language [36]. Bivariate correlations are performed, one with any other MDF members.
A client program ( Figure 3) connects the `MDF` database, query the ready tables all together, for the ready set (now PCB set is ready), and runs for finding the best QSAR/QSPR model. Every new better QSPR/QSAR is stored into a table called `qspr_qsar`, within the same `MDF` database.

Results and Discussion
The above described procedure was used for finding the best QSPR model of the PCBs relative chromatographic retention times.
In monovariate correlation, the best MDF QSPR model was provided by the iIDRwHg MDF member, Eq(1): 16  where Ŷ 1d = estimated retention time by MDF-SAR equation with one descriptor; iIDRwHg = molecular descriptor; R 2 = square correlation coefficient; 95%CI R = 95% confidence interval for correlation coefficient; Q 2 cv-loo = cross-validation leave-one-out score. The quality of statistics is given by R 2 (the square correlation coefficient), StErr (standard error of estimate), F (Fisher parameter) and p (type I error, or α error). The cross-validation leave-one-out score is given as Q 2 . Clearly, the model shows a good predictability. The type I error of the model from Eq(1) is very small, showing a very small error of rejecting the null hypothesis when it is actually true.
About ninety-eight percents of variation in PCBs chromatographic retention time can be explained by its linear relation with a single MDF member, iIDRwHg, which accounts for the actual geometry (by the geometric distance operator (`g`)) and the number of directly bonded hydrogen atoms (`H`).
The best model with two descriptors was: Ŷ 2d = -5.96 + 0.024·ISDmsHt-1.02·lADrtHg where Ŷ 2d = estimated retention time by MDF-SAR equation with two descriptors. The multi-colinearity analysis shown that the two descriptors used by Eq(2) rather inter-related (R 2 (ISDmsHt, lADrtHg) = 0.944) and each of them (R 2 (Y, ISDmsHt) = 0.907; rank = 12614; R 2 (Y, lADrtHg) = 0.973; rank = 277) are not the best descriptor in monovariate regression model (see Eq(1)). The ISDmsHt descriptor is built by a topological distance operator (`t`) while lADrtHg takes into account the genuine distance (`g`). Both of them consider the directly bonded hydrogen atom (`H`). The topological description explains more than 90% of the variance, the remaining 9.7% being completed by the information on molecular geometry.
The plot corresponding to Eq(2) is illustrated in Figure 4. The values of the best descriptors in uni and bivariate regressions (Eq(1)&(2)), the experimental and estimated chromatographic retention time, and residuals for the PCBs set are listed in Table 1.
The accuracy of description is extremely high, even as the set of molecules is quite large. The excellent model (Eq(2)), derived for such a large set, is by itself a test of predictive ability. Indeed, if various ratios training/testing selections were considered, the quality of statistics remained very high ( Table 2).  (2) CRT: Experimental As it can be observed from Table 2, the lowest R 2 is about 0.996 in both training and test sets, which demonstrates the ability of (ISDmsHt, lADrtHg) MDF pair to described the PCBs relative retention time (Eq(2)). Note that, R 2 exceeds the upper bond of the confidence interval of Eq(2) in almost 20% of cases and is less then the lower bond in other 20% of cases. In the test set, in four cases the values of Q 2 were greater than the upper confidence boundary.
By analysing of the obtained models (Eq(1) and Eq (2)) in the light of the previously reported models, it can be observed that with a single exception ( [25], p = 0.3528) out of three the model with one descriptor -Eq(1) -did not obtains a greater squared correlation coefficient compared with models reported in the references [22] and [24] (the differences are of -0.0064 [22], and of -0.0043 [24] respectively).

Conclusions
The MDF methodology provides excellent QSPR models, with good stability and predictive ability. It has the disadvantage to be time consuming (it calculates a huge pool of molecular descriptors and provides exhaustive mono-and bivariate regressions) but this is compensated by the high quality of the QSPR models.
Thus, the variance of chromatographic retention time of PCBs is 99.7% explained by two molecular descriptors, showing us that the property is related with geometry and topology, as well as with directly bounded hydrogen's of PCBs.
The selection of the MDF members from a huge family offers not only a QSPR model, but also a strong instrument to investigate the structural causality of a measured property. Thus, the chromatographic property of PCBs is determined by the molecular topology, geometry and the nonchlorinated (i.e., the remained hydrogenated) positions on the PCB structure.