The Quantitative Structure-Mutagenicity Relationship of Polycylic Aromatic Hydrocarbon Metabolites

Quantitative structure-activity relationships (QSAR s) for benz[a]anthracene (BA) mutagens using 73 descriptors were searched. The mu tagenicity data was obtained from Ames assays for Mycobacterium vanbaalenii, Mycobact erium gilvum and Mycobacterium flavescens strains. These data were fitted using a mutagenicity-cytotoxicity competition model which defines the mutagenic potencies of BA m etabolites, and include oxides, phenols, quinones, and dihydrodiols. The QSAR equat ions were derived using the molecular descriptor set (charged partial surface area, spati al, hermodynamic and electronic descriptors) and semi-empirical energetic and charg e descriptors. Genetic function approximation was used to reduce and fit independen t variables, including linearand quadratic-based functions. Multiple QSAR equations were generated and a separate QSAR equation was chosen and evaluated for each strain u sing conventional r2, F-test, and crossvalidated r2. Each strain exhibited its own charact eris ic descriptors.


Methods
We used a set of well-documented mutagenicity data obtained from investigations on several microbial species strains [19][20][21][22][23]. Mutagenicity and cytotoxicity are assumed to be competitive and using a similar model to Myers et al. [40], these two processes are described in (1) [41].
(Sd + B)e -td (1) Mutagenic behavior is described as the linear function Sd + B, whereby S represents the slope of the plot (number of revertants vs. dose applied), d is the dose of the mutagen (nmoles), and B is the background number of revertants.The exponential function describes cytotoxicity and thus, t is the cytotoxicity parameter.The data were fit using non-linear least squares.The value reported for mutagenicity was the slope of the estimated function at dose = 0 nmol of BA compound, given by S -Bt.The slope of the model (S -Bt) at d = 0 nmol is calculated as revertants nmol -1 and log(1 + revertants nmol -1 ) is defined as the mutagenic potency (MP).The mutagenic potencies for strains M. vanbaalenii(MV), M. gilvum (MG) and M. flavescens (MF) are MPmv, MPmg and MPmf, respectively.
The molecular structures of BA metabolites were constructed using the Cerius2 package [42] and optimized using the universal force field (UFF) generator [43], together with the charge equilibration (QEq) method [44].The routine for energy minimization ends upon completion of 10 iterations of the steepest-descents and 500 of the Newton-Raphson minimization algorithms, or if the average tolerance gradient is ≤ 0.001.The active compound is always assumed to take an energetically-stable configuration.In order to relate physicochemical properties to the molecular geometry (e.g.shadow indices), the longest, next longest and shortest axes were defined as the x-, y-and z-axes, respectively (Figure 1).In order to prepare the training set, energetically-optimized molecular structures were aligned to the core BA moiety using Cartesian coordinates, overlayed by the reference compound (BA), which is the moiety common to all the molecules (Figure 1A).Maximum Common Sub Group (MCSG; a rigid-body fitting method) was used to superimpose each structure onto BA.MCSG considers molecules as points and lines, then identifies patterns using graph theory.This enables identification of the largest subset of atoms that share commonality with the shape of the reference compound.Figure 1B demonstrates that MCSG superimposed the structures successfully.The QSAR equation was derived using Cerius2 and MOPAC (AM1) descriptors (charge and energy) and the set comprised 73 descriptors.This type of training molecule set does not easily mach to the classical Hansch analysis [45][46].However, typical electronic parameters such as the Hammett parameter σ may be responsible for substituent effects at more that one position in the aromatic ring.The 29 PAH molecules that were used, which contains substituents in positions 1 through 12, for the training set have a large common moiety (BA).Detailed insight at the molecular level can be obtained using charge and energetic descriptors such as total energy and ionic potential (AM1).In order to investigate the effect of charge in each position, the AM1 charges on each BA substitution position were adopted.MOPAC was also used to calculate the highest occupied molecular orbital, lowest unoccupied molecular orbital energy, dipole moment and heat of formation.Information about the surface area features of the molecules was provided by using Jurs descriptors, which are structural parameters that combine information on molecular surface area and partial atomic charge to derive charged partial surface area (CPSA) descriptors [47].30 CPSA descriptors were adopted in order to examine the relative importance of polar and nonpolar interactions between the compounds and DNA, as well as potential intermolecular interactions which might occur during mutagenesis.Thus, they are expected to provide molecular information on both charge and shape.Hydrophobicity was described using the Ghose and Crippen's partition coefficient (AlogP) and molar refractivity (MolRef), and these were correlated with membrane interactions, molecular size, and polarizability [48].At the beginning of this study, ten shadow indices, desolvation free energy for water and octanol, density, molecular volume, molecular surface area, radius of gyration, sum of atomic polarizability, and superdelocalizability were included for QSAR.In order to maintain the clarity of the QSAR equation, all topological descriptors from Cerius2 were excluded.
Genetic Function Approximation (GFA) was performed using linear and quadratic functions.A genetic algorithm (GA) was employed to select the variables for the least squares regression that was used to derive QSAR for the BA metabolites.Using a correlation matrix, a total of 73 descriptors were examined in order to eliminate those with a high degree of correlation.The objective in reducing the variables was to be able to assess the dimensionality of the data and interpret QSAR using the smallest number of variables for which there was no loss of information.In addition to biological insights, these variable reductions are very important for rapid and effective development of the GFA.Thus, the GA performs descriptor filtering and reduction and least squares regression derives the QSAR equation.Given that the GFA will generate multiple models, each QSAR equation must be evaluated using an Ftest, in order to determine significance and correlation coefficient to activity.The proper combination of descriptors was derived by lack of fit (LOF) [36][37] and is defined as LSE/{1 -(c+dp)/M}, whereby c is the number of non-constant basis functions i.e., descriptors, p is the total number of features (coefficients + knots), M is the number of training molecules, and d is a user-adjusted smoothing parameter.Using the smoothing parameter (d) = 1 often generates good results [36][37].The evaluation of the equation was done using least squares regression calculating cross-validated r 2 .
In order to avoid convergence upon a suboptimal solution, the GFA was performed five times using different initial populations, then the optimum QSAR result was chosen as the equation for each strain.The number of descriptors for the GFA was either fixed or varied from 2 to 6. Mutation of the equation was performed by 50% addition of new terms for child equations.For a population of 100, GFA evolution exhibited good convergence after 20000 crossovers in nonlinear two through four term QSAR equations, and after 30000 crossovers in nonlinear five and six term equations.
For a regression model, r 2 was used to describe the fitness of data and fitness is considered to improve as r 2 approaches 1.The sum of the squared deviations of dependent variables (SD) is described by (Y obs -Y mean ) 2 and the predicted sum of squares (PRESS), by (Y pred -Y obs ) 2 .The crossvalidated correlation coefficient (r cv 2 ) is defined as 1 -PRESS/SD and it used to evaluate the predictive power of the QSAR equations that were generated.Each molecule was eliminated from the training set and cross-validated r cv 2 was calculated using the predicted values for the missing molecule.In addition, the F value represents a measure of the statistical significance of the regression model and the number of descriptors that are used for derivation of the model has a greater influence than the SD.The standard deviation s, measures the quality of fitness and the number of degrees of freedom and takes into consideration the number of objects and variables.

Results and Discussion
Mutagenic data are presented in Table 1.The following correlation values were obtained for the three mutagenic activity data sets: 0.8509 (MV and MG); 0.8972 (MV and MF); and 0.8814 (MG and MF).MV and MF exhibit a high level of collinearity and although there are differences, in general the correlations are high.Four, six and ten negative mutagenic potencies were determined for the strains MV, MG and MF, respectively, suggesting cytotoxicity exerts a dominant effect over mutagenicity.However, it is likely that mutagenic potency exists to some degree.In Table 1, there were no large negative values and the biggest negative value was -0.4125.The standard deviations obtained for each strain were as follows: MV, 0.7435; MG, 0.6554; and MF, 0.6612.The strains are discussed individually below.Several metabolic activation steps may be required before a BA derivative becomes mutagenic and mutagenic potency may be described using a number of physicochemical parameters.Initially, a simple linear regression was performed and no single descriptor correlated strongly with the mutagenic potency of MV.The descriptor that exhibited the highest correlation to mutagenic potency was Apol (r 2 = 0.1909), and the Jurs descriptors also ranked relatively high (Figure 2).However, it is clear that inclusion of more than one term is required for the derivation of a statistically reliable QSAR for BA mutagens.In general, a better statistical result is obtained from inclusion of more terms.However, increasing numbers of terms leads to greater difficulty in interpreting the QSAR equation.In addition, GFA generates many statistically plausible equations that vary in number of terms and coefficients.In respect to both r 2 and r cv 2 , nonlinear QSAR equations were found to produce superior results to linear equations for all strains.Here, nonlinear multi-term QSAR equations were considered to provide the best fitting QSAR equations for all three strains, and these were filtered by r 2 and r cv 2 .In order to determine the minimum number of terms to obtain a statistically relevant result, the statistical threshold (r 2 = 0.9, r cv 2 = 0.7, Ftest = pass) was calculated and the number of terms increased from two until all the statistical criteria were satisfied.The statistical evaluation of results demonstrated that the six-term QSAR equation produced good convergence and achieved a satisfactory statistical threshold (Table 2).The QSAR equation obtained for the strain MV is as follows: MPmv = 14.4005 -0.7204(S YL -9.9056) 2 -0.2697SXL + 3.6901E LUMO + 7.5766Q6 -717.4359FPSAThe predicted values and residuals are presented in Figure 3.The standard deviation is small, but not in comparison to that of the training set.As Apol has been reduced, it is not included in equation 2. Two S YL terms have been reduced to one offset quadratic term, therefore, it appears as five-term equation.S XL and S YL are spatial terms that describe the importance of the lengths of the BA x and y axes.The values of descriptors are provided in Table 3 and indicate that the optimal y-axis length is 8.7659 Å as there are longer and shorter values than 8.7328 Å.In contrast, it would appear that a BA derivative with a short x-axis has greater mutagenic potency.In addition, mutagenicity increases when the charge at position 6 and the E LUMO are less negative.The fractional charged partial surface area (FPSA 3 ) is defined as the {sum of (partial positive solvent accessible surface area) × (total positive charge)}/(total molecular surface area).Molecules with a low overall positive charge and a large surface area exhibit enhanced mutagenic potency.Interestingly, no hydrophobicity-related term survived.
Correlation between descriptors is presented in Figure 4.No high correlations were observed between QSAR descriptors in MV and each variable provides an independent contribution to mutagenicity.The standard deviation (s) is sufficiently small.Average values and standard deviations of descriptors are presented in Table 3.Although the standard deviation of FPSA 3 is small, sensitivity is compensated by the large coefficient value (equation 2).In simple linear regression, Apol is the highest collinear descriptor (r 2 = 0.3119) for mutagenic potency, followed by the Jurs descriptors.Although this result is similar to that obtained for MV, the QSAR of MG is much cleared than for the other strains.As may be observed from Table 2, MG provides a better result statistically than either MV (equation 2) or MF (equation 4), and only five independent variables were needed to derive the QSAR equation.The QSAR equation for MG was derived using the same statistical threshold and methods as for MV, and is presented below.It would appear that mutagenic potency is increased by the presence of a high charge at position 8 on BA, whereas it is decreased by a high charge on position 7, and thus, descriptors compensating for electrostatic contributions are important (most charges are negative; Table 3).The highly negative charges on position 10 of the Q10 BA moiety correlate with an increase in mutagenic potency.The relative positive charge (RPCG) is defined as (charge of the most positively charged atom) / (sum of total positive charge).Thus, it is the most positively charged atom, rather than the total molecular positive charge, that is significant for this QSAR.The surface weighted charged partial surface area (WPSA2) is defined as {(total charge weighted partial positive surface area) × (total molecular surface area)}/ 1000 and in MG, provides a negative contribution to mutagenic potency.Thus, it would appear that molecular charge is the key to mutagenic potency in this strain.There was no significant correlation between descriptors in MG (Figure 4).
The same procedures were used for MF as for MV and MG.Simple linear regression generated a similar result to the other two strains.However, the correlation of Apol with mutagenic potency was higher (r 2 = 0.4599).Convergence of independent variables was derived using the following six-term equation: MPmf = -9.0704-5.2261Q5 + 6.6285Q6 -6.1e-05Area 2 + 0.0564PNSA 1 -1.4675ELUMO 2 + 7.8115S XYF (4) r 2 = 0.9299 r cv 2 = 0.7063 s = 0.3141 F = 19.3196It would appear that mutagenicity is increased by a small molecular surface area descriptor (Area).The partial negative surface area (PNSA) is defined as the sum of the solvent-accessible surface area of all negatively charged atoms and negatively charged atoms on the molecule surface contribute positively to its mutagenic potency.Mutagenic potency should be increased by a large negative charge at position 5 and a positive charge at 6. S XYF contributes positively to mutagenic potential, and it is defined as the fraction of the area of the molecular shadow in the XY plane, divided by the area of the enclosing rectangle.Mutagenicity is increased by a decreasingly negative E LUMO , and this finding is in agreement with the QSAR equation for MV.
It is clear that for simple linear regression, Apol is the most important single descriptor for all three strains.However, it is not included in any of the QSAR equations, even though the MF QSAR equation demonstrates a moderately high correlation (r 2 = 0.4599).In contrast, Jurs descriptors contribute to all three QSAR equations, although they were found second to Apol, indicating that they are collinear and provide more sensitive and detailed information.
Many variables were considered and the strains exhibit similar and characteristic patterns of behavior.The inclusion of a number of variables is indicative of the metabolic activation steps and variety of biological factors that are required in order for BA derivatives to become mutagenic.An important step in this process is the formation of BADE.
Outliers that have arisen through experimental or calculation errors may be removed in order to improve the statistical results and reduce the number of QSAR descriptors.One outlier (compound 12) is present in the MV data and following its removal, 28 training molecules remain and are described using the QSAR equation ( 5 In compound 12, the residual of the predicted value is larger than for any other compound.However, its mutagenic potency is still found to be higher than the other BA derivatives.
Compounds 10 and 25 are outliers in the MG strain and following their omission, statistical evaluation indicates that a four-term equation satisfies the statistical criteria.Whereas the residual of compound 10 is large, that of compound 25 is smaller that some non-outliers and as one would generally expect outliers to have larger residuals than non-outliers, this result may seem surprising.However, outliers were removed in order to obtain the best statistical result and accordingly, the residuals of outliers can be smaller than those of non-outliers.
For MG, an offset quadratic function for Area was used and the four-term QSAR equation was reduced to three terms, as follows: Thus, only three descriptors are needed for MPmg.The molecular surface area descriptor (Area) describes the van der Waals area of a molecule.On average its value for the 29 compounds was 323.4018 and all values were smaller than 363.117.This implies that the smaller the value of Area, the higher the mutagenic potency.Mutagenic potency is decreased by a high charge at position 7 on BA and therefore, Q7 is an important descriptor in compensating for the electrostatic contribution.
Although RPCG is included in this equation, the outliers are primarily dependent upon the value of Area.After elimination of outliers, the QSAR equation for MF may differ greatly.It is: MPmf (4, 5, 16, 25, and 29) = 0.5236 + 0.0019(Area -356.8530) 2 + 0.0032F h2o 2 -29.8473RPCG (7) r 2 = 0.9266 r cv 2 = 0.8147 s = 0.3435 F = 17.6755 The major difference in this equation is that charges are no longer included.The average Area value for the 29 compounds is 323.4018 and all values are less than 363.1170.Thus, it appears that a smaller value of Area correlates to a higher mutagenic potency.The desolvation free energy for waterA (F h2o ) descriptor appears to be discriminating in this strain.This term is not based on conformation, but on the connectivity of the atoms and it is derived from the hydration shell model developed by Hopfinger [48].Since F h2o is negative and is included in the QSAR equation as a quadratic, increasingly negative values of this descriptor correspond to increased mutagenic potency.There are five outliers for this strain but only compounds 4 and 16 exhibit large residuals.Area and F h2o demonstrate collinearity (r = 0.840, Fig. 4) indicating that molecular surface area is a major factor for determining mutagenicity in the MF strain.
Although there are a relatively large number (5; compounds 4, 5, 16, 25, and 29) of outliers, upon their removal the QSAR equation satisfies statistical criteria.Equation 4 contained a number of descriptors that resulted from the large number of outliers for this strain.

Figure 2 .
Figure 2. Simple linear regression for (A) r 2 and (B) F-test in MV, (C) r 2 and (D) F-test in MG, and (E) r 2 and (F) F-test in MF.

Figure 3 .
Figure 3.The correlation of observed and calculated mutagenic potencies for (A) MV, (B) MG, and (C) MF.

Figure 4 .
Figure 4.The correlation matrix presentation of QSAR Descriptors.

Table 1 .
Mutagenic potency data of BA derivatives in Mycobacterium species* Mutagenic potency is defined as the logarithm of the dose response curve slope (number of histidine revertants relative to the concentration of mutagen in nmol). *

Table 2 .
Statistical evaluation of QSAR equations