This article is an openaccess article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
A contingency of observed antimicrobial activities measured for several compounds
Plant extracts, including oils, have been used as therapeutics from ancient times and have been reinvented more often in the last years. Important medical effects of plant extracts have been identified during the time (antioxidant, antimicrobial [
Quantitative StructureActivity Relationships (QSARs) are mathematical models resulting from the application of different statistical approaches in correlation analyses of biologic activity and/or physical or chemical properties of active compounds with descriptors derived from structure and/or properties [
Jirovetz
The antimicrobial effects at contingency of compounds, oils and mixtures on bacteria were investigated to identify the probability distribution function along bacteria series. The Uniform distribution was rejected at the beginning of the analysis due to unreasonable estimates of the population parameters. The remained three discrete distributions were compared based on several agreements. The percentage of rejection according to Fisher’s ChiSquare global statistics for each identified probability distribution function according to the class (as compounds, oils, mixtures) is shown in
Statistical parameters and estimates of the population properties under assumption of Poisson distribution are presented in
Assuming the Poisson distribution (as the FCS value from
Two requirements were imposed in identification of the proper transformation of Poisson parameter λ: the absence of outliers and the presence of normality at a significance level of 5%. The global FCS distribution statistic indicated that the Poisson parameter more likely follows a Lognormal distribution (statistics: K–S = 0.1315; p_{K–S} = 0.7948; A–D = 0.3874; Crit_{A–D5%} = 2.5018 (critical values associated for AndersonDarling test); C–S_{df = 2} = 0.9403; p_{C–S} = 0.6249).
The Eugenol compound was identified as outlier with Grubbs’ test (Z = 3.178, Z_{critical–5%} = 2.7338). After natural logarithm transformation of the Poisson parameters, seen as an overall antimicrobial activity of investigated compounds, no other outlier was identified (the highest Z value was of 2.528; Z_{critical–5%} = 2.758) and the normality hypothesis of the ln(λ) values could not be rejected (
Sulfametrole (CID = 64939) proved to be influential in the model obtained based on Dragon descriptors (training set,
The overall correlation between Dragon descriptors obtained for whole data set (
The results of regression analysis with Dragon descriptors provided the equation presented in
where Ŷ = ln(λ) estimated by
The abilities in estimation (training set) and prediction (test set) of the model from
No leverage was identified when the SAPF descriptors were investigated (
The overall correlation between SAPF descriptors obtained for whole data set (
The results of regression analysis with SAPF descriptors relating ln(λ) with compounds structure by using the entire training set is presented in
where Ŷ = ln(λ) estimated/predicted by
No statistically significant difference was identified when the goodnessoffit in training and test sets were compared for the model presented in
The search for the best fit between observed and linear regression model with two descriptors when the joined pool of SAPF and Dragon descriptors retrieved the same model as the one from
Parameters defined in Material and Method section were used to compare the QSARDragon model with QSARSAPF model. The residuals, defined as the difference between observed value and calculated value based on identified models, are presented in
Two compounds were randomly chosen as external set. The predictions that were closest to the observed values were obtained by QSARSAPF model (
Steiger’s test was used to identify if there are any statistically significant differences in terms of correlation coefficient between the models from
The antimicrobial effects of chemical compounds on bacteria and fungi species were analyzed with regards to probability distribution function. In addition, a structureactivity relationship analysis able to describe the effect of chemical compounds on the entire population of bacteria and fungi species was successfully conducted.
The analysis of
Thus, it was already proven [
The analysis of distribution on bacteria and fungi species revealed the following:
Compounds series:
○ Without any exception, the antimicrobial effects of all investigated compounds proved to follow Poisson distribution. Moreover, the hypothesis that any compound has a Poisson distribution of antimicrobial activity on bacteria population could not be rejected by FCS statistics (FCS statistics = 28.79,
○ Negative binomial distribution was rejected by 55% of compounds while Binomial distribution was rejected in 70% of cases. Negative binomial distribution, also known as the Pascal distribution or Pólya distribution, is a twin of Poisson distribution [
Oils and mixture series:
○ Negative Binomial distribution cannot be rejected for oils. Moreover, Negative Binomial distribution for oils had a higher likelihood than Poisson distribution (
○ Negative Binomial distribution cannot be rejected for mixtures either. Moreover, Negative Binomial distribution for mixtures had also higher likelihood than Poisson distribution (
○ The abovepresented facts suggest that in the case of oils and mixtures, the factors of the antibacterial activity are not completely separated when oil/mixture name are taken as factor; this appears to be because the Negative Binomial distribution often occurs when a convolution/superposition of Poisson distributions characterize the observed data [
Overall, any investigated compound, oil and mixture proved to have an antimicrobial effect that follows the Poisson distribution on studied bacteria and fungi species. The λ Poisson parameter, varied from 7.286 (Nerol acetate) to 28.250 (Eugenol) and represents the mean and variance of inhibition zone of compound/oil/mixture on investigated species. The obtained parameter of Poisson distribution proved able to characterize the overall antimicrobial activity (both mean and variance equals to Poisson parameter λ,
The structureactivity relationships between compounds’ structure and the overall antimicrobial effect on bacteria population, as well as the suitability of a pool of descriptors (SAPF and Dragon approaches) for the overall antimicrobial activity estimation and prediction were furthermore investigated.
QSAR model with two descriptors that proved abilities in estimation and prediction was identified for each approach after the split of compounds in training (13 compounds), test (7 compounds) and external (2 compounds) sets. Normal distribution of the observations was assured through natural logarithm transformation (
The analysis of QSARDragon model revealed the following:
One compound proved to be influential in the model (CID = 64939,
Two descriptors were able to describe the linear relation between overall antimicrobial activities of investigated compounds. One descriptor belongs to the walk and path counts and relates the conventional bond order ID number while the second descriptor relates the maximal autocorrelation of lag 3 divided by mass (R3m+). According with associated coefficients, the R3m+ had a higher contribution in the model compared with piID descriptor, but its contribution is to the significance level threshold (5.8% compared to imposed 5% significance level).
QSARDragon model proved to be statistically significant (
Multicollianearity is not present in the model since the tolerance value 0.1 <
The model proved its abilities in estimation (
Unfortunately, external abilities in prediction were away from the expected abilities. The trend is significant far from the expected line
The abilities in estimation (training set) proved not statistically significant from the abilities in prediction (test set) since a probability of 0.3598 was obtained in comparison.
The analysis of QSARSAPF model revealed the following:
The values of SAPF descriptors associated to compounds proved that no compound had significant influence on the model (all leverage values where lower than threshold −0.41,
SAPF model proved statistically significant (
According to descriptors from
Multicollianearity was not identified in the QSARSAPF model, even if a statistically significant correlation coefficient between descriptors exists (the tolerance values were higher than 0.1 and smaller than 1 and the variance inflation factors (VIF) had values smaller than 10).
The model proved its abilities in estimation (
External abilities in prediction proved to be close to expected abilities for QSARSAPF model (
The comparison of the identified models revealed the following:
Dragon model has slightly better abilities in estimation compared to SAPF model, but these abilities proved not statistically significant. The determination coefficient obtained both in training set and in leaveoneout analysis was higher compared to SAPF model with 0.068 and respectively 0.145. Moreover, the abilities of prediction seem to be better for SAPF model compared to Dragon model (a difference of 0.211, not statistically significant
The SAPF model systematically obtained smallest values of parameters presented in
The analysis of predictive power of the models demonstrated that SAPF model had significantly higher power of prediction (
Furthermore, the mean of residuals for training, external and external + test set proved not statistically different by zero when the SAPF model was analyzed. The Fisher’s predictive power identified statistically difference by zero of the residuals obtained by Dragon model in both training and test sets (9 compounds) (
The model with a higher concordance between observed and estimated/predicted could be considered the best model. The analysis of concordance correlation coefficient revealed a substantial strength of agreement for training set but a very poor agreement in test set for Dragon model. A moderate strength of agreement was obtained by SAPF model in both training and test sets (
Steiger’s test was not able to identify any statistically significant differences between Dragon and SAPF model regarding goodnessoffit neither in training set nor in external set.
It can be concluded based on the facts presented above that the SAPF model is a reliable, valid (internally as well as externally) and stable model useful in characterization of overall antimicrobial activity on investigated compounds, both in terms of estimation and prediction.
The aim and objectives of the research have been achieved. The antimicrobial effect proved to follow the Poisson distribution and its parameter was furthermore used to identify those descriptors from Dragon and SAPF pools able to characterize the link between compounds and overall antimicrobial activity. Two newly developed models were found statistically valid. However, which of these QSAR models is better? The analysis of applicability domain of the models obtained in training sets was able to identify based on the values of descriptors one structurally influential compound in training set for Dragon model. According to the obtained results, one compound was withdrawn from further analysis in Dragon modeling. Dragon model was created based on 12 compounds in training set while the SAPF model was created based on 13 compounds in training set. Graphical representation of observed
The antimicrobial effects of twentytwo compounds, eight oils and two mixtures on grampositive and negative bacteria (
Since all inhibition zones expressed in mm are integer numbers, a search for a discrete distribution was conducted having as alternatives Uniform, Binomial, Negative Binomial and Poisson distributions (other alternatives were excluded due to lack of fit with observed data). KolmogorovSmirnov (KS) [
The whole pool (matrix) of data was prior analyzed and none of the above distribution functions give an acceptable (higher than 5%) agreement with the observations. This fact could be explained by the heterogeneity of the chemicals/oils/mixtures.
In order to obtain the PDF of antimicrobial effects of compounds, oils and mixtures on bacteria and fungus population, rows of experimental values were analyzed as independent samples. A number of five observations in sample qualified the sample for estimation of the distribution parameters, and the analysis was conducted using maximum likelihood estimation (MLE) [
Population statistics of the identified PDF can be seen as an estimator of overall antimicrobial activity of the investigated compound on the bacteria and fungi population. The series of the population statistics for all investigated compounds was furthermore subject of a structureactivity relationship search intended to relate the overall antimicrobial effect with compounds’ structure.
The molecular modeling study was conducted at PM3 semiempirical level of theory [
A series of homemade programs were used to perform the following tasks: ■ automate transformation the *.sdf or *.mol files as *.hin files; ■ prepare the compounds for modeling (run HyperChem v.8.0 [
The molecular descriptors for the chemical compounds were calculated using a homemade software that implemented Structural Atomic Property Family [
The SAPF approach is a method that cumulates atomic properties at the molecular level. The approach used a localization of the molecular center using a metric, an atomic property (C = cardinality (number of heavy atoms), H = Hydrogen bonds (number of Hydrogen atoms), M = atomic mass (relative units), E = electronegativity (on Pauling scale [
Linear regression models (additive models) were used for search of structureactivity relationship between overall antimicrobial effects as dependent variable and structural descriptors (from SAPF approach and Dragon software) as independent variables.
KolmogorovSmirnov, AndersonDarling, and ChiSquare statistics [
Regression analysis was employed to select the candidate models and the following criteria were used: highest goodnessoffit, smallest number of descriptors and absence of collinearity between descriptors [
A complete randomization approach was applied to split of compounds in training (~2/3 compounds, 13 compounds), test (7 compounds: geranyl acetate, geranyl butyrate, geranyl tiglate, neral, neryl butyrate, neryl propanoate, citronellyl acetate, citronellyl propionate, and eugenol) and external (2 compounds: citronellyl acetate and neryl propanoate) sets.
Training set was used to identify the model, test set to validate the model and external set to assess the model external predictive power. The predictive power of identified models is sustained by an applied strategy; the models were not obtained on measured data which are subject of measurements errors. Instead, the QSAR models were constructed with population estimates (represented by Poisson parameter) that are less affected by errors. Thus, the QSAR models reflect the behavior of the compound on bacteria and fungi not the behavior of compound on a certain bacteria/fungus.
In order to assess the applicability domain of the obtained models, two approaches were involved on the full model with identified descriptors in the training sets [
The model diagnostics was carried out using statistical parameters presented in
The comparison of the models was performed using Steiger’s
Antimicrobial activity of investigated oils, compounds and mixtures on the series of bacteria and fungi were shown to follow the Poisson distribution.
Two newly developed QSAR models, with Dragon and with SAPF descriptors, were found to be statistically significant internally. Even if the Dragon model proved to have higher goodnessoffit, the model proved unacceptable in terms of prediction power. The SAPF model proved acceptable, with its prediction power being reliable, valid and stable in external validation analysis, with good overall performances in test set and test and external sets.
The study was supported by European Social Fund, Human Resources Development Operational Program, project number 89/1.5/62371 through a fellowship for L. Jäntschi. The funder had no role in study design, data collection, analysis and interpretation of data, in the writing of the report or in the decision to submit the article for publication.
Results of probability distribution functions analysis. X: Compounds (
Williams plot (training set): Dragon descriptors.
Observed
Williams plots (training set): SAPF descriptors.
Observed
SAPF descriptors (v = value, ln = natural logarithm, V = vector, T = topology, G = geometry, x, y, z = geometric atomic coordinates, i = atom, refD = modality to calculate coordinates—from average, refP = modality to calculate coordinates—from property center formula, t = topological atomic coordinate.
Statistical parameters and population properties.
Mode  Mean  Var  StDev  Skew  EKurt  Median  

Citral (638011)  14.125  14  14.125  14.125  3.758  0.266  0.071  13.457 
Geraniol (637566)  13.750  13  13.750  13.750  3.708  0.270  0.073  13.082 
Geranyl formate (5282109)  8.875  8  8.875  8.875  2.979  0.336  0.113  8.207 
Geranyl acetate (1549026)  8.200  8  8.200  8.200  2.864  0.349  0.122  7.531 
Geranyl butyrate (5355856)  8.714  8  8.714  8.714  2.952  0.339  0.115  8.046 
Geranyl tiglate (5367785)  11.625  11  11.625  11.625  3.410  0.293  0.086  10.957 
Neral (643779)  13.500  13  13.500  13.500  3.674  0.272  0.074  12.932 
Nerol (643820)  11.250  11  11.250  11.250  3.354  0.298  0.089  10.582 
Nerol acetate (1549025)  7.333  7  7.333  7.333  2.708  0.369  0.136  6.664 
Neryl butyrate (5352162)  10.714  10  10.714  10.714  3.273  0.306  0.093  10.046 
Neryl propanoate (5365982)  10.714  10  10.714  10.714  3.273  0.306  0.093  10.046 
Citronellal (7794)  14.600  14  14.600  14.600  3.821  0.262  0.068  13.932 
Citronellyl formate (7778)  12.143  12  12.143  12.143  3.485  0.287  0.082  11.475 
Citronellyl acetate (9017)  7.286  7  7.286  7.286  2.699  0.370  0.137  6.617 
Citronellyl butyrate (8835)  8.167  8  8.167  8.167  2.858  0.350  0.122  7.498 
Citronellyl isobutyrate (60985)  8.200  8  8.200  8.200  2.864  0.349  0.122  7.531 
Citronellyl propionate (8834)  14.333  14  14.333  14.333  3.786  0.264  0.070  13.665 
Hydroxycitronellal (7888)  18.750  18  18.750  18.750  4.330  0.231  0.053  18.083 
Rose oxide (27866)  12.800  12  12.800  12.800  3.578  0.280  0.078  12.132 
Eugenol (3314)  28.250  28  28.250  28.250  5.315  0.188  0.035  27.583 
Sulfametrole (64939)  19.200  19  19.200  19.200  4.382  0.228  0.052  18.533 
Citronella  9.750  9  9.750  9.750  3.122  0.320  0.103  9.082 
Geranium Africa  13.250  13  13.250  13.250  3.640  0.275  0.075  12.582 
Geranium Bourbon  12.500  12  12.500  12.500  3.536  0.283  0.080  11.832 
Geranium China  13.625  13  13.625  13.625  3.691  0.271  0.073  12.957 
Helichrysum  10.667  10  10.667  10.667  3.266  0.306  0.094  9.999 
Palmarosa  11.625  11  11.625  11.625  3.410  0.293  0.086  10.957 
Rose  12.750  12  12.750  12.750  3.571  0.280  0.078  12.082 
Verbena  16.500  16  16.500  16.500  4.062  0.246  0.061  15.833 
Tetracycline hydrochloride  15.143  15  15.143  15.143  3.891  0.257  0.066  14.476 
Ciproxin  26.000  26  26.000  26.000  5.099  0.196  0.038  25.333 
λ = Parameter of Poisson distribution; Var = variance; StDev = standard deviation; Skew = skewness; EKurt = Excess Kurtosis.
QSAR Residuals: Dragon
Set  CID  Y  Ŷ_{Dragon}  Res_{Dragon}  Ŷ_{SAPF}  Res_{SAPF} 

Training  1549025  1.9924  2.0070  −0.0146  2.0761  −0.0836 
Training  8835  2.1001  2.0564  0.0437  2.1461  −0.0460 
Training  60985  2.1041  2.0768  0.0273  2.0553  0.0488 
Training  5282109  2.1832  2.2596  −0.0764  2.3267  −0.1435 
Training  643820  2.4204  2.6106  −0.1902  2.7127  −0.2923 
Training  7778  2.4968  2.4132  0.0835  2.2816  0.2151 
Training  27866  2.5494  2.5905  −0.0411  2.4957  0.0538 
Training  637566  2.6210  2.6106  0.0104  2.7127  −0.0917 
Training  638011  2.6479  2.7061  −0.0582  2.6042  0.0437 
Training  8842  2.6741  2.6435  0.0307  2.5713  0.1029 
Training  7794  2.6810  2.6929  −0.0118  2.6430  0.0380 
Training  7888  2.9312  2.7346  0.1966  2.8638  0.0674 
Training  64939  2.9549  2.8674  0.0875  
Test  1549026  2.1041  2.0070  0.0971  2.2012  −0.0971 
Test  5355856  2.1650  1.9271  0.2379  2.2830  −0.1180 
Test  5352162  2.3716  1.9271  0.4445  2.7847  −0.4132 
Test  5367785  2.4532  1.8661  0.5870  2.4642  −0.0111 
Test  643779  2.6027  2.7061  −0.1034  2.6006  0.0021 
Test  8834  2.6626  2.4108  0.2518  2.6207  0.0418 
Test  3314  3.3411  2.7843  0.5568  3.3685  −0.0274 
External  9017  1.9859  2.1432  −0.1572  2.0053  −0.0194 
External  5365982  2.3716  2.2688  0.1028  2.2889  0.0827 
CID = compound identification number; Y = observed ln(λ) value; Ŷ = estimated/predicted value; Res = residuals; Dragon = model from
Results of comparison: QSARDragon model
Parameter (Abbreviation)  Dragon– 
SAPF–  

Rootmeansquare error (RMSE)  0.2314  0.1357  
Mean absolute error (MAE)  0.1582  0.0967  
Mean Absolute Percentage Error (MAPE)  0.0628  0.0403  
Standard error of prediction (SEP)  0.2371  0.0628  
Relative error of prediction (REP%)  9.2964  5.4523  
Predictive Power of the Model  
Q^{2}_{F1}  0.2121 
0.8436  
Q^{2}_{F2}  0.2041 
0.8421  
Q^{2}_{F3}  n.a.  0.7742  
ρ_{cTR}  0.9457 
0.9063  
ρ_{cTS}  0.4885 
0.9219  
Fisher’s Predictive Power  TS  EX 
TS + EX 
TS  EX  TS + EX 

7  2  9  7  2  9 

3.1148  −0.2095  2.5071  −1.5344  0.6198  −1.2830 

0.0104  0.4343  0.0230  0.0879  0.3234  0.1234 
= test set include also external compounds; ρ_{c} = concordance correlation coefficient; TR = training set; TS = test set;
accuracy = 0.9985, precision = 0.9471;
accuracy = 0.7357, precision = 0.6639;s
accuracy = 0.9956, precision = 0.9103;
accuracy = 0.9867, precision = 0.9344;
= external set (two compounds);
= training and external sets.
Compounds, oils and mixtures: inhibition zones (mm).
n  

1  Citral (638011)  15  23  11  9  10  8  9  28  8 
2  Geraniol (637566)  15  12  15  12  11  10  10  25  8 
3  Geranyl formate (5282109)  10  9  7  8  8  7  7  15  8 
4  Geranyl acetate (1549026)  10  8  7  NIO  NIO  7  NIO  9  5 
5  Geranyl butyrate (5355856)  10  11  7  NIO  9  7  7  10  7 
6  Geranyl tiglate (5367785)  17  10  11  9  8  8  15  15  8 
7  Neral (643779)  15  20  10  6  12  10  10  25  8 
8  Nerol (643820)  11  8  10  10  10  7  7  27  8 
9  Nerol acetate (1549025)  8  NIO  7  7  7  8  7  NIO  6 
10  Neryl butyrate (5352162)  25  8  8  8  NIO  8  8  10  7 
11  Neryl propanoate (5365982)  17  10  NIO  7  8  9  10  14  7 
12  Citronellal (7794)  25  18  NIO  9  NIO  7  14  NIO  5 
13  Citronellyl formate (7778)  18  20  10  8  9  7  NIO  13  7 
14  Citronellyl acetate (9017)  10  6  NIO  6  7  6  7  9  7 
15  Citronellyl butyrate (8835)  8  8  NIO  NIO  8  7  8  10  6 
16  Citronellyl isobutyrate (60985)  8  10  9  7  NIO  NIO  7  NIO  5 
17  Citronellyl propionate (8834)  15  20  NIO  NIO  10  15  11  15  6 
18  Hydroxycitronellal (7888)  20  20  23  16  17  15  14  25  8 
19  Rose oxide (27866)  8  10  NIO  11  7  NIO  NIO  28  5 
20  Eugenol (3314)  30  30  28  28  25  25  28  32  8 
21  Sulfametrole (64939)  27  27  11  23  NIO  8  NIO  NIO  5 
32  Citronellol (8842)  25  18  NIO  8  NIO  7  NIO  NIO  4 
22  Citronella  10  10  7  10  7  7  7  20  8 
23  Geranium Africa  16  12  10  10  10  9  11  28  8 
24  Geranium Bourbon  13  12  8  12  10  10  10  25  8 
25  Geranium China  20  13  14  9  9  9  10  25  8 
26  Helichrysum  20  13  8  NIO  9  NIO  7  7  6 
27  Palmarosa  8  13  12  9  11  10  10  20  8 
28  Rose  20  15  10  10  8  9  10  20  8 
29  Verbena  27  25  10  13  10  12  10  25  8 
30  Tetracycline hydrochloride  15  22  11  13  15  10  20  NIO  7 
31  Ciproxin  35  33  22  25  32  10  25  NIO  7 
SA =
Statistical parameters used to assess QSAR models.
Parameter (Abbreviation)  Formula [ref]  Remarks 

Rootmeansquare error (RMSE) 

RMSE > MAE → variation in the errors exist 
Mean absolute error (MAE) 
 
Mean Absolute Percentage Error (MAPE) n 

MAPE ~ 0 → perfect fit 
Standard error of prediction (SEP) 

Lower value indicate a good model 
Relative error of prediction (REP%) 

Lower value indicate a good model 
Concordance analysis (ρ_{c}) 

Strength of agreement [ 
Predictive Power of the Model Prediction is considered accurate if the predictive power of the model is > 0.6 [ 

Prediction power relative to mean value of observable in training set 

Prediction power relative to mean value of observable in test set  

Overall prediction weighted by test set sample size relative to observable weighted by mean of observed value in training set weighted by sample size in training set  
Predictive Power: Fisher’s approach 

Evaluate if the mean of residual is statistically different by the expected value (0) 
y_{i} = observed ln(λ) for i^{th} compound; ŷ_{i} = estimated/predicted ln(λ) by model from