3.1. LLNA data set
As seen in Table 1
, the LLNA data set was randomly divided into a training set with 76 sensitizers and 32 non-sensitizers and a test set with 43 sensitizers and 11 non-sensitizers. Based on the training set, 123 molecular descriptors remained after preprocessing according to Section 2.3. Then, SVM combined with PSO algorithm was implemented. The PSO was set with 30 particles and 100 iterations. In each evaluation, a descriptor got a hit if this descriptor was selected. The descriptor selected more often will get more hits. Ten-fold cross validation was carried out against the training set, and the highest cross validation accuracy was used to determine the most appropriate set of molecular descriptors. In the end, six out of 123 molecular descriptors (listed in Table 2
) were selected. The two most important parameters of SVM were also determined, i.e., C
=15.81 and σ
=14.13. The highest classification accuracy of cross validation against the training set is 83.33%. The classification accuracies, sensitivities and specificities of the training set and test set were all shown in Table 3
Although skin sensitization is a complex toxicological phenomenon and its biomolecular processes have not been fully understood, previous studies have indicated the ability of active agents to cause immune response is related to skin permeability and the production of immunological conjugates with endogenous macromolecules [5
]. Chemical reactivity, molecular size and skin permeability are important determinants for skin sensitization [27
]. Known from Table 2
, nCL represents the number of chlorine atoms in the molecule. The specific atom or group may be related to the combination or reaction of chemicals with skin protein. MAXDN and MATS1e characterize the molecular electronic structure, which may influence the electrostatic interactions between the chemical and protein. The binding of a chemical with skin proteins is considered as the rate-determing step for skin sensitization induction, where the chemical behaves as an electrophile and the protein as a nucleophile [28
]. MATS2m and BELm1 are descriptors concerning molecular size, which may influence the skin penetration of compounds. The active agents causing skin sensitization are relatively small molecules. MLOGP, Moriguchi octanol-water partition coefficient, is an indicator of hydrophobic properties, which has been correlated to transport properties of a molecule, long-range ligand-receptor recognition and subsequent binding [26
] has also investigated the original LLNA data set (including 132 sensitizers and 46 non-sensitizers) with logistic regression and the DEREK expert system. The classification results for the training set with logistic regression and prediction results for the whole data set with DEREK reported in Ref. [5
] are shown in Tables 4
, respectively. For rationality, the reported results with logistic regression were compared to the classification results of this paper for the training set, while the reported results with DEREK were compared to the prediction results of this paper for the test set. Seen from Tables 4
, the SVM classifier combined with PSO algorithm in this paper improved the results greatly, especially for the classification specificity. It has been explained in Ref. [5
] that the very low specificity was resulted from the substantially unbalanced size of samples, i.e., the ratio of sensitizers largely overwhelming that of the non-sensitizers. However, the specificity in this paper attained 87.50% (50.00% with logistic regression in Ref. [5
]) and 72.73% (32.60% with DEREK in Ref. [5
]) for the training set and test set, respectively. The experimental and estimated skin sensitivities are listed in Table 6
3.2. GPMT Data Set
For the GPMT data set, the same procedures as for the LLNA data set were carried out. After preprocessing according to Section 2.3, 127 molecular descriptors remained. Then, the SVM algorithm combined with PSO was implemented, and five molecular descriptors (listed in Table 7
) were selected from the remaining 127 molecular descriptors. Two SVM parameters, i.e., C
=45.63 and σ
=1.90 were determined by 10-fold cross validation based on the training set. The total accuracy of the 10-fold cross validation is 90.16%. For the training set, the sensitizers were all classified correctly, and five non-sensitizers were mistaken as sensitizers. According to Equations (10)
, the classification accuracy, sensitivity and specificity are 91.80%, 100.00% and 64.29%, respectively. For the test set, only one sensitizer and two non-sensitizers were wrongly assigned. Table 8
lists the statistical parameters.
From Tables 3
, we can see that the wrong classification ratio of non-sensitizers is larger than that of sensitizers for both LLNA and GPMT data sets. The unbalanced ratio (nearly 1:3) of non-sensitizers to sensitizers may be responsible for the worse classification performances for non-sensitizers than those for sensitizers. It is indicated that the non-sensitizers may be prone to be falsely predicted as sensitizers. On the contrary, if the dataset contains many more non-sensitizers than sensitizers, the QSPR model will also give biased classification results. For example, Roberts et al.
] have validated the TIMES-SS (TI
imulator) expert system platform used for predicting skin sensitization with 40 chemicals, consisting of 24 non-sensitizers and 16 sensitizers. TIMES-SS was able to predict non-sensitizers reasonably well (the prediction accuracy of 87.5%), while it predicted sensitizers very pooly (the prediction accuracy of 56.0%).
From Table 7
, the selected five molecular descriptors come from four blocks of descriptors. nDB is a constitutional descriptor indicating the number of double bonds. The π-bond electrons in double bonds are more active than the σ-bond electrons in single bonds, therefore, the molecule with more double bonds will have larger electronic cloud deformability and may be prone to combine with the target. Five kinds of chemical reaction mechanistic domain [29
] have been proposed, including Michael acceptors, SN
2 electrophiles, SN
Ar electrophiles, Schiff base electrophiles and acyl transfer electrophiles. According to Roberts [31
], some compounds containing an electron-deficient double bond can be confidently assigned as Michael acceptors or pro-Michael acceptors. EEig07d and EEig14d are both descriptors related to molecular dipole moments, which indicate the molecular polarity and are closely related to the hydrophobic properties and skin permeability of molecules. O-057, the number of phenol/enol/carboxyl OHs, may be responsible for the molecular polarity and hydrogen bond, which has relationship with the combination or reaction of compounds with specific group of the receptor. As described in Ref. [31
], some aromatic compounds with two meta
hydroxyl groups may follow two possible mechanisms: reaction with molecular oxygen to introduce a hydroxyl group either ortho
to the original hydroxyl groups, and directly binding to protein via attack of a protein-centered radical. Infective-80 is a descriptor derived from biological experiment, which may reflect directly the biochemical effect of skin sensitization induction.
] has also investigated the original GPMT data set (including 82 sensitizers and 23 non-sensitizers) with logistic regression and two expert systems TOPKAT and DEREK. The reported classification results for the training set with logistic regression in Ref. [5
] are listed in Table 9
, and the reported prediction results for the whole GPMT data set with TOPKAT and DEREK are shown in Table 10
. For comparison, the reported results with logistic regression were compared to the classification results of this paper for the training set, while the reported results with TOPKAT and DEREK were compared to the prediction results of this paper for the test set. As seen from Tables 9
, the prediction performances of the method in this paper are far superior to those in the literature, especially for the specificity. However, expert systems such as DEREK and TIMES-SS have been recently improved by modifing the alerts describing the skin sensitization potential or considering more mechanisitic knowledge and new rules for chemicals [28
]. Therefore, better results than those of previously reported in the literature may be achieved if the prediction is conducted with the improved expert systems. Golla et al.
] built neural network models using 25, 25 and 22 molecular descriptors for LLNA, GPMT and BgVV data sets with 358, 307 and 251 compounds, respectively. The classification accuracies for the above mentioned three data sets are 90%, 95% and 90%, respectively. Compared with Ref. [27
], this paper obtained the classification accuracy (for the training set) of 95.37% for LLNA data set and 91.80% for GPMT data set using only five or six molecular descriptors. The estimated skin sensitivities were listed in Table 6
Seen from Table 6
, there are ten inconsistent values in the experimental results of 92 compounds with both LLNA and GPMT data. The accuracy (or concordance) of experimental results obtained from two different test procedures for assessing skin sensitization is 89.13%. In 1999, the Interagency Coordinating Committee on the Validation of Alternative Methods (ICCVAM), with support from the National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods (NICEATM), validated the experimental procedures by comparing LLNA data for 97 chemicals to the available GPMT data, and also obtained an accuracy of 89% [27
]. From the above analysis, we may roughly assume that the experimental accuracy of skin sensitization is close to 89%. The prediction accuracies (for the test set) of LLNA and GPMT data sets using PSO optimized SVM in this paper were 88.89% and 90.32% respectively, which are in good agreement with the experimental accuracy.