This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

In the present work, support vector machines (SVMs) and multiple linear regression (MLR) techniques were used for quantitative structure–property relationship (QSPR) studies of retention time (t_{R}) in standardized liquid chromatography–UV–mass spectrometry of 67 mycotoxins (aflatoxins, trichothecenes, roquefortines and ochratoxins) based on molecular descriptors calculated from the optimized 3D structures. By applying missing value, zero and multicollinearity tests with a cutoff value of 0.95, and genetic algorithm method of variable selection, the most relevant descriptors were selected to build QSPR models. MLR and SVMs methods were employed to build QSPR models. The robustness of the QSPR models was characterized by the statistical validation and applicability domain (AD). The prediction results from the MLR and SVM models are in good agreement with the experimental values. The correlation and predictability measure by r^{2} and q^{2} are 0.931 and 0.932, repectively, for SVM and 0.923 and 0.915, respectively, for MLR. The applicability domain of the model was investigated using William’s plot. The effects of different descriptors on the retention times are described.

Fungi are major plant and insect pathogens, but they are not nearly as important as agents of disease in vertebrates,

Studies have shown that a number of mycotoxins have carcinogenic properties. Some of them are clearly DNA-reactive and for others DNA reactivity may not be the mode of action. When the endpoint is cancer,

Mycotoxins usually enter the body via ingestion of contaminated foods, but inhalation of toxigenic spores and direct dermal contact are also important routes. Mycotoxins occurring in food commodities are secondary metabolites of a range of filamentous fungi, which can contaminate food or food crops throughout the food chain. Although many hundreds of fungal toxins are known, a more limited number are generally considered to play an important part in food safety and for these a range of analytical methods have been developed [

Microfungi are a rich source of chemical diversity [

Since the chemical diversity is very high within the micro-fungi almost all types of chemical structure can be expected in an extract, e.g., small acids, alcohols, ketones, alkaloids, antraquinones and cyclic peptides. To cope with this broad range of chemical structures, most methods are based on reversed- phase liquid chromatography combined with diode array detection (DAD) and atmospheric ionization [electrospray ionization (ESI) and atmospheric pressure chemical ionization (APCI)] mass spectrometry (MS). Nearly all methods use water–acetonitrile gradient elution on reversed-phase C_{18} and C_{8} columns, although methods for very polar and highly ionized components, using perfusion chromatography and hydrophilic interaction chromatography have been described [

However, only a few reports have investigated the quantitative correlation between the molecular parameters and the property of retention time of mycotoxins [

After the calculation of molecular descriptors, many different chemometrics methods, such as multiple linear regression (MLR), partial least squares regression (PLS), different types of artificial neural networks (ANN), genetic algorithms (GAs), and support vector machine (SVM) can be employed to derive correlation models between the molecular structures and properties. As a new and powerful modeling tool, support vector machine (SVM) has gained much interest in pattern recognition and function approximation applications recently. In bioinformatics, SVMs have been successfully used to solve classification and correlation problems. SVMs have also been applied in chemistry, for example, the prediction of retention index of protein [

54 descriptors were calculated by the ChemOffice software. By applying missing value, zeroand multicollinearity tests with a cutoff value of 0.95 and variable selection by genetic algorithm, the number of descriptors was reduced to 22. The stepwise regression routine was used to develop the linear model for the prediction of the retention time of mycotoxins using calculated structural descriptors. The best linear model contained four molecular descriptors. The regression coefficients of the descriptors, Mean effect and variable inflation factors (VIF) are listed in

Positive values in the regression coefficients show that the indicated descriptors contribute positively to the value of t_{R}, whereas negative values indicate that the greater the value of the descriptor, the lower the value of t_{R}. In other words, increasing the electronic energy (ElcE), dipole length (DPLL)and Lowest Unoccupied Molecular Orbital energy (LUMO) will decrease t_{R}, and the increase in the C log_{R} of the compounds.

With comparison of the mean effects of the descriptors appearing in MLR model, it is observed that the ElcE of the molecules has the largest effect on the t_{R} of the compound. The mean effect of a descriptor is the product of its mean and the regression coefficient in the MLR model [

Based on the variable inflation factor (VIF) values of the four descriptors shown in

After establishing models by MLR, the support vector machines were used to compare the performance of MLR based on the same subset of descriptors. Similar to other multivariate statistical models, the performances of SVM for regression depend on the combination of several parameters. They are capacity parameter

Satisfied with the robustness of the QSPR model developed using the training set, we applied the QSPR model to an external data set of 17 mycotoxins comprising the test set. The predicted results are given in _{R} values for the test set for both models is significant.

The statistical parameters calculated for the MLR and SVM models are represented in _{R} and their electronic and thermodynamic descriptors, while model using SVM based on these same sets of descriptors produced an even better model with a better predictive ability than the MLR model. SVM performs better on the whole due to embodying the structural risk minimization principle and the advantage over other techniques of converging to the global optimum and not to a local optimum.

Once a QSPR model is obtained, another crucial problem is the definition of its applicability domain (AD). For any QSPR model, only the predictions for chemicals falling within its AD can be considered reliable and not model extrapolations. There are several methods for defining the AD of QSPR models [

where

where ^{*}) suggested that the compound was very influential on the model. Secondly, it presented the Euclidean distances of the compounds to the model measured by the cross-validated standardized residuals. The cross-validated standardized residuals greater than three standard deviation (

The Williams plot for the presented SVM model is shown in ^{*} of 0.3. For making predictions, predicted t_{R} data must be considered reliable only for those compounds that fall within this AD on which the model was constructed. It can be seen from ^{*}). These erroneous predictions could probably be attributed to wrong experimental data rather than to molecular structures [

By interpreting the descriptors in the regression model, it is possible to gain some insight into factors that are likely to govern the retention time of mycotoxins. In regard to this point that all the descriptors in the final model together attributethe same property or activity, each one of the descriptors or their related coefficient takes into account a definitive amount of variance within property. However it can be concluded that the interpretation of a combination set of the descriptors would be much better than considering the result of the single descriptors. Of the four descriptors, C logP is thermodynamic and LUMO, DPLL and ElcE are electronic descriptors.

The octanol/water partition coefficient (C logP) characterizes the effectiveness of hydrophobicity of the compounds. C logP values can be calculated from molecular structure by summation of fragment values, which captures the nature of the hydrophobic regions of the molecule separately from hydrophilic regions. In the other words, it can be estimated from hydrophobic contributions of the chemical groups present in complex molecules [_{R} increases as well. In reversed-phase chromatography, compounds with higher hydrophobicities would make stronger interactions with mobile phase, which lead to having larger t_{R} within the compounds.

The other descriptors (LUMO, DPLL and ElcE) are electronic and their regression coefficient is negative, it means that as they increase, t_{R} decreases. In particular, electronic parameters are considered important in the establishment of QSAR models and are helpful to quantify different types of intermolecular and intramolecular interactions, as these interactions are usually responsible for properties of chemical and biological systems [_{R} of the compounds. The ElcE is the total electronic energy given in electron volt at 0 °C [_{R} and LUMO indicates that, t_{R} increase with decrease in the magnitude of LUMO index. The present results reinforce previous findings [

The data set for this investigation was extracted from a work reported by Nielsen _{R} were used as the dependent variables.

The molecular structures of data set were sketched using the ChemDraw Ultra module of the CS ChemOffice 2005 molecular modeling software version 9, supplied by Cambridge Software Company. Each molecule was “cleaned up” and energy minimization was performed using Allinger’s MM2 force filed and further geometry optimization was done using semiempirical AM1 (Austin Model) Hamiltonian and PM3 methods by default on the 3D-structure of molecules. A total of 54 molecular descriptors of differing types based on 3D structures were calculated to describe compound structural diversity. The descriptors calculated accounts three important properties of the molecules: (a) thermodynamic, (b) electronic and (c) steric, as they represent the possible molecular interactions which determined the retention time of the studied molecules.

After the calculation of molecular descriptors, any parameter which is not calculated (missing value) for any number of the compounds in the data set is rejected in the first step. Some of the descriptors were rejected because they contained a value of zero for all the compounds and have been removed (zero tests). In order to minimize the effect of colinearity and to avoid redundancy, we used amulticollinearity test with a cutoff value of 0.95, and subsequently discarded 10 parameters. Finally, a total set of 44 remaining descriptors were achieved and used to select the optimal subset of descriptors that have a significant contribution to the t_{R} property.

The basic strategy of QSPR analysis is to find optimum quantitative relationships between the molecular descriptors and desired property, which can then be used for the prediction of the property from only molecular structures. One of the most important problems involved in QSPR studies is to select optimal subset of descriptors that have significant contribution to the desired property. The well-known genetic algorithm is just a well-accepted method for solving this kind of problems.

After correlation analysis of the descriptors, we used MLR analysis on the molecular descriptors that resulted in genetic algorithm (GA) variable selection procedure. The GA-algorithm applied in this paper uses a binary representation as the coding technique for the given problem; the presence or absence of a descriptor in a chromosome is coded by 1 or 0. The GA performs its optimization by variation and selection via the evaluation of the fitness function (RMSECV). The algorithm used in this paper is an evolution of the algorithm described in Ref. [

Finally, descriptor-screening methods were used to select the most relevant descriptor to establish the models for prediction of the molecular property. Here, the stepwise regression method was used to choose the subset of the molecular descriptors.

After the descriptor was selected, multiple linear regression (MLR)[

In this equation, _{1}−_{n}_{1}− _{n}_{0}

The foundation of support vector machines (SVM) has been developed by Vapnik, and they are gaining popularity due to many attractive features and promising empirical performance [

Compared to other neural network regressors, there are three distinct characteristics when SVM are used to estimate the regression function. First of all, SVM estimate the regression using a set of linear functions that are defined in a high dimensional space. Second, SVM carry out the regression estimation by risk minimization where the risk is measured using Vapnik’s

In support vector regression (SVR), the basic idea is to map the data _{i}_{i}^{n}_{i}_{i}

where Φ(

In _{SVMs}_{ɛ}

Finally, by introducing Lagrange multipliers (_{i}_{i}^{*}) andexploiting the optimality constraints, the decision functiongiven by

Based on the Karush-Kuhn-Tucker (KKT) conditions of quadratic programming, only a number of coefficients (_{i}_{i}^{*}) will assume nonzero values, and the data points associated with them could be referred to as support vectors. In _{i}_{i}

The overall performances of SVM models were evaluated in terms of root mean square error (RMSE), which was defined as below:

where _{k}_{k}

The predictive power of the models developed on the calculated statistical parameters standard error of prediction (SEP) and relative error of prediction (REP %) as follows:

where _{i}_{i}

All calculations in this work were carried out by using Matlab (V 7.1, The Mathworks, Inc.) and the SVM toolbox developed by Gunn [

The main goal in QSPR studies is to obtain a model with the highest predictive ability. In order to evaluate the predictive ability of our QSPR model, we used the method described by Golbraikh and Tropsha [^{2}_{test}

where _{predtest}_{Test}_{m}^{2} using the following equation [

where ^{2}_{0}^{2} is the squared correlation coefficient for regression without using y-intercept and the regression equation was _{2}_{0}^{2} between experimental and predicted values for the external test set compounds were calculated using the regression of analysis Toolpak option of Excel. If _{m}^{2} value for a give model is >0.5, it indicates the good external predictability of the developed model.

The values of

where _{i}_{i}^{2} − r_{0}^{2}/r^{2}] and [r^{2} − r_{0}^{2}′/r^{2}] are less than 0.1 (stipulated value)[_{0}^{2} and _{0}^{2}′ are correlation coefficient of regression between the predicted and experimental property of compounds in the test set and

To further check the inter-correlation of descriptors variance inflation factor (VIF) analysis was performed. The VIF value is calculated from 1/1 − r^{2}, where ^{2} is the multiplecorrelation coefficient of one descriptor’s effect regressed on the remaining molecular descriptors. If the VIF value is larger than 10, information of the descriptor could be hidden by correlation of descriptors [

In recent years, attention has been paid to QSAR/QSPR methods as an interesting complement, or even as an expensive, time consuming alternative, to laboratory data. In this paper, new QSPR models have been developed for predicting the t_{R} of a diverse set of mycotoxins from the molecular structure alone. We have compared two linear models, MLR and SVM, with the data set. The obtained results show that both MLR and SVM methods could model the relationship between t_{R} and their electronic and thermodynamic descriptors; on the same sets of descriptors, using SVM based produced a better model with a better predictive ability than the MLR model. SVM exhibit the better overall performance due to embodying the structural risk minimization principle and some advantages over the other techniques of converging to the global optimum and not to a local optimum. By performing model validation, it can be concluded that the presented model is a valid model and can be effectively used to predict the t_{R} of mycotoxins with an accuracy approximating the accuracy of experimental t_{R} determination. Moreover, the mechanism of the model was interpreted, and the applicability domain of the model was defined. It can be reasonably concluded that the proposed model would be expected to predict t_{R} for new organic compounds or for other organic compounds for which experimental values are unknown. Additionally, the presented method could also identify and provide some insight into what structural features are related to the t_{R} property of organic compounds.

_{1}

_{oct}from structures

^{2}

The selection of the optimal epsilon for SVM (C = 4).

The selection of the optimal capacity factors for SVM (ɛ = 0.01).

t_{R} estimated by MLR (top panel) and SVM (bottom panel) modeling _{R}.

Williams plot of standardized residual

Details of the constructed QSPR model.

Descriptor | Coefficient | Mean effect | VIF |
---|---|---|---|

C logP |
2.6951(±0.2248) | 5 | 1.006 |

ElcE |
−0.0002(±0.0001) | 8 | 1.246 |

DPLL |
−1.091(±0.2981) | −3.875 | 1.556 |

LUMO |
−1.6922(±0.5521) | 0.594 | 1.287 |

Constant | 3.1912(±1.7569) | _ | _ |

= The octanol/water partition coefficient

= Electronic energy

= Dipole length

= Lowest Unoccupied Molecular Orbital energy

= Variable inflation factors

Correlation matrix for MLR model.

t_{R} |
C logP | ElcE | DPLL | LUMO | |
---|---|---|---|---|---|

t_{R} |
1 | ||||

C logP | 0.821263 | 1 | |||

ElcE | −0.21234 | 0.05977 | 1 | ||

DPLL | −0.07144 | 0.004813 | −0.32903 | ||

LUMO | −0.12041 | −0.05044 | 0.000773 | −0.45025 | 1 |

Comparison of experimental and predicted values of t_{R} for prediction set by MLR and SVM models.

No. | Exp. ( t_{R}) |
MLR model | SVM model | ||
---|---|---|---|---|---|

Pred. (t_{R}) |
RE (%) | Pred. (t_{R}) |
RE (%) | ||

21 | 5.1 | 4.97 | 2.55 | 5.03 | 1.37 |

4 | 6.6 | 6.91 | −4.7 | 7.99 | −21.06 |

23 | 7.4 | 7.03 | 5 | 8.35 | −12.84 |

41 | 8.59 | 8.88 | −3.38 | 10.08 | −17.35 |

3 | 10.33 | 9.44 | 8.62 | 10.25 | 0.77 |

38 | 10.51 | 11.43 | −8.75 | 12 | −14.18 |

24 | 11.28 | 12.03 | −6.65 | 12.37 | −9.66 |

27 | 13.69 | 11.51 | 15.92 | 11.74 | 14.24 |

34 | 14.15 | 11.48 | 18.87 | 12.53 | 11.45 |

13 | 15.03 | 14.52 | 3.39 | 15.18 | −1 |

25 | 15.56 | 14.61 | 6.11 | 14.79 | 4.95 |

37 | 17 | 14.29 | 15.94 | 15.08 | 11.29 |

11 | 18.02 | 15.7 | 12.87 | 16.37 | 9.16 |

46 | 18.6 | 18.91 | −1.67 | 19.39 | −4.25 |

65 | 20 | 22.66 | −13.3 | 22.11 | −10.55 |

29 | 21.12 | 22.61 | −7.05 | 20.43 | 3.27 |

55 | 21.6 | 20.74 | 3.98 | 19.84 | 8.15 |

The statistical parameters obtained by applying the MLR and SVM methods to the prediction set.

Parameters | MLR | SVM |
---|---|---|

RMSEP | 1.504 | 1.341 |

REP |
10.902 | 9.719 |

SEP |
1.551 | 1.382 |

q^{2} |
0.915 | 0.932 |

R^{2} |
0.923 | 0.931 |

(R^{2}-R_{0}^{2})/R^{2} |
0.001 | 0.0118 |

(R^{2}-R′_{0}^{2})/R^{2} |
0.0108 | 0.0011 |

r_{m}^{2} |
0.894 | 0.833 |

k | 0.996 | 0.891 |

k′ | 0.926 | 1.045 |

NDS |
4 | 4 |

= Relative error of prediction.

= Standard error of prediction.

= Number of descriptors.

Experimental retention time (t_{R}) of 67 compounds.

NO. | Compound | t_{R}(min) |
NO. | Compound | t_{R}(min) |
---|---|---|---|---|---|

1 | Aflatoxicol I | 12.45 | 9 | Austocystin A | 21.57 |

2 | Aflatoxin B_{1} |
11.50 | 10 | Averufin | 25.65 |

3 | Aflatoxin B_{2} |
10.33 | 11 | 5-Methoxysterigmatocystin | 18.02 |

4 | Aflatoxin B_{2} α |
6.60 | 12 | Dihydroxysterigmatocystin | 17.70 |

5 | Aflatoxin G_{1} |
10.16 | 13 | Methoxysterigmatocystin | 15.03 |

6 | Aflatoxin G_{2} |
8.97 | 14 | Sterigmatocystin | 18.91 |

7 | Aflatoxin G_{2}α |
5.00 | 15 | Norsolorinic acid | 31.08 |

8 | Aflatoxin M_{1} |
7.21 | 16 | Parasiticol | 10.73 |

17 | Nivalenol | 1.27 | 27 | HT-2 Toxin | 13.69 |

18 | Fusarenone X | 2.35 | 28 | T-2 Toxin | 17.06 |

19 | Deoxynivalenol | 1.54 | 29 | Acetyl-T-2 toxin | 21.12 |

20 | 3-Acetyldeoxynivalenol | 5.21 | 30 | Trichodermin | 16.13 |

21 | 15- |
5.10 | 31 | Trichodermol | 9.69 |

22 | Scirpentriol | 1.82 | 32 | 7-α-Hydroxytrichodermol | 2.59 |

23 | 15-Acetoxyscirpenol | 7.40 | 33 | Verrucarol | 2.89 |

24 | Diacetoxyscirpenol | 11.28 | 34 | 4,15-Diacetylverrucarol | 14.15 |

25 | 3α-Acetyldiacetoxyscirpenol | 15.56 | 35 | Trichothecin | 16.29 |

26 | Neosolaniol | 3.19 | 36 | Trichothecolone | 3.63 |

37 | Trichoverrol A | 10.16 | |||

38 | Agroclavine-I | 17.00 | 51 | Ergotamin | 19.60 |

39 | Auranthine | 10.51 | 52 | Fumigaclavine C | 21.40 |

40 | Aurantiamine | 10.49 | 53 | Marcfortine A | 19.59 |

41 | Aurantioclavine | 14.30 | 54 | Marcfortine B | 17.39 |

42 | Chanoclavine-I | 8.59 | 55 | Meleagrin | 18.90 |

43 | Costaclavine | 17.00 | 56 | Oxalin | 21.60 |

44 | Cyclopenin | 11.60 | 57 | Pyroclavine | 14.81 |

45 | Cyclopenol | 6.20 | 58 | Roquefortine C | 20.50 |

46 | Cyclopeptin | 12.05 | 59 | Roquefortine D | 6.09 |

47 | Dihydroergotamin | 18.60 | 60 | Rugulovasine A and B | 8.43 |

48 | Elymoclavine | 5.34 | 61 | Secoclavine | 20.40 |

49 | Epoxyagroclavine-I | 10.00 | 62 | α-Ergocryptin | 19.20 |

50 | Ergocristine | 25.10 | |||

63 | Ochratoxin α | 5.60 | 66 | Ochratoxin B-ethyl ester | 19.41 |

64 | Ochratoxin A-methyl ester | 22.49 | 67 | Ochratoxin α-methyl ester | 16.16 |

65 | Ochratoxin B-methyl ester | 20.00 |

Parameters of genetic algorithm (GA).

Cross-Validation | Random subset |
---|---|

Number of subsets | 4 |

Population size | 64 |

Mutation rate | 0.005 |

Window width | 2 |

Initial term% | 20% |

Maximum generation | 100 |

Convergence (%) | 50 |

Cross-over | Double |