Next Article in Journal
Genistein Inhibition of Topoisomerase IIα Expression Participated by Sp1 and Sp3 in HeLa Cell
Next Article in Special Issue
QSAR Studies on Andrographolide Derivatives as α-Glucosidase Inhibitors
Previous Article in Journal
Fuels for Thought!
Previous Article in Special Issue
Additive SMILES-Based Carcinogenicity Models: Probabilistic Principles in the Search for Robust Predictions
Article Menu

Export Article

Int. J. Mol. Sci. 2009, 10(7), 3237-3254; doi:10.3390/ijms10073237

Prediction of Skin Sensitization with a Particle Swarm Optimized Support Vector Machine
Hua Yuan 1,2,3, Jianping Huang 4 and Chenzhong Cao 1,2,3,*
Key Laboratory of Theoretical Chemistry and Molecular Simulation of Ministry of Education, Hunan University of Science and Technology, Xiangtan 411201, China
Hunan Provincial University Key Laboratory of QSAR/QSPR, Xiangtan 411201, China
School of Chemistry and Chemical Engineering, Hunan University of Science and Technology, Xiangtan 411201, China
Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310027, China
Author to whom correspondence should be addressed; Tel. +86-732-829-0045
Received: 25 May 2009 / Accepted: 24 June 2009 / Published: 17 July 2009


Skin sensitization is the most commonly reported occupational illness, causing much suffering to a wide range of people. Identification and labeling of environmental allergens is urgently required to protect people from skin sensitization. The guinea pig maximization test (GPMT) and murine local lymph node assay (LLNA) are the two most important in vivo models for identification of skin sensitizers. In order to reduce the number of animal tests, quantitative structure-activity relationships (QSARs) are strongly encouraged in the assessment of skin sensitization of chemicals. This paper has investigated the skin sensitization potential of 162 compounds with LLNA results and 92 compounds with GPMT results using a support vector machine. A particle swarm optimization algorithm was implemented for feature selection from a large number of molecular descriptors calculated by Dragon. For the LLNA data set, the classification accuracies are 95.37% and 88.89% for the training and the test sets, respectively. For the GPMT data set, the classification accuracies are 91.80% and 90.32% for the training and the test sets, respectively. The classification performances were greatly improved compared to those reported in the literature, indicating that the support vector machine optimized by particle swarm in this paper is competent for the identification of skin sensitizers.
skin sensitization; guinea pig maximization test; murine local lymph node assay; support vector machine; particle swarm optimization

1. Introduction

With the fast development of industry, agriculture and medication, human are exposed to more and more exogenous chemicals, some of which may result in allergic contact dermatitis after accidental or deliberate skin contact. The medical condition of allergic contact dermatitis is known as skin sensitization, which is associated with an alteration of the immune system. According to statistics from the U.S. Bureau of Labor, occupational contact dermatitis is the most commonly reported non-trauma related category of occupational illnesses in the United States [1]. The total annual losses due to occupational skin diseases were estimated to amount to over one billion dollars [2]. Therefore, identification and labelling of environmental allergens is an urgent request from consumer organizations, industry, and governmental agencies to protect people from skin sensitization. In the new European Union (EU) chemical policy REACH, information on skin sensitization potential will have to be provided for any chemicals manufactured or imported in amounts above 1 tonne/year [3].
The guinea pig maximization test (GPMT) and murine local lymph node assay (LLNA) are the two most commonly used in vivo models for identification of skin sensitizers. GPMT combines the use of intradermal administration of the compound with and without Freund’s complete adjuvant (FCA), and occluded topical application of the compound a week later [4]. The result of GPMT relies on subjective evaluation of a group of animals and is usually expressed as the dichotomous (sensitizer/nonsensitizer) form. The murine local lymph node assay defines skin sensitization hazard as a function of the ability of the test chemical to provoke lymphocyte proliferation on lymph nodes draining the site of topical application. A substance is classified as a sensitizer if it induces a threefold stimulation index (EC3) or greater at one or more test concentrations. The value of EC3 is continuous and indicates the relative skin sensitizing potency of chemicals, but the majority of published LLNA data nowadays are also in the dichotomous form [5]. Due to the huge number of chemicals with unknown skin sensitization potential, exhaustive animal testing of such chemicals is costly and raises ethical concerns. Therefore, the use of other alternative methods such as quantitative structure-activity relationships (QSARs) is strongly encouraged in order to reduce the number of animal tests. Several legislations have recently emerged to further develop and increase the acceptance of QSARs in assessment of skin sensitization of chemicals [6] and much work has been reported [710].
This paper aimed to build a classifier to distinguish skin sensitizers from non-sensitizers based on various compounds with LLNA and GPMT results. When compared to other classification techniques such as discriminant analysis [11], random forest methods [9] and artificial neural networks [12] the support vector machine (SVM) has been proven advantageous in handling classification tasks in cases of high dimensionality of data points. However, the input features (molecular descriptors here) of SVM play a very important role in the classification performance. Not all the molecular descriptors are equally important for a specific classification. Many of them may be redundant or irrelevant. If SVM is implemented without feature selection, the dimension of the input space is very large and non-clean, which will impair the performance of the SVM. The particle swarm optimization (PSO) algorithm is a swarm intelligence method for optimization problems and has been widely applied to feature selection. Compared with other descriptor selection approaches, such as the genetic algorithm (GA) and recursive feature elimination (RFE), PSO is not only much simpler in concept and more computationally efficient [13], but it also exhibits advantages in solving many kinds of optimization problems featuring nonlinearity and nondifferentiability, multiple optima, and high dimensionality [14,15]. Lin et al. [16] conducted a thorough study on the performance of PSO as a parameter determination and feature selection technique for SVM. The results based on about ten different data sets adequately confirmed that the performance of SVM+PSO outperforms that of SVM+GA and SVM without feature selection. Therefore, this paper investigates the potential of the support vector machine in combination of the particle swarm optimization algorithm for feature selections in addressing the problem of identification of sensitizers.

2. Materials and Methods

2.1. Data Set

The Interagency Coordinating Committee on the Validation of Alternative Methods (ICCVAM) issued the LLNA results of 209 compounds, and for some of which, the GPMT results were also available. All the experimental data were obtained within the “spirit” of Good Laboratory Practice guidelines. Fedorowicz [5] cleared out the inorganic salts, natural products and polymers from the ICCVAM list and developed a data set of 178 organic compounds, although it still contains 16 sodium salts which cannot be processed by the Dragon software used to calculate molecular descriptors in this work. Thus, a total 162 compounds were used in this paper after sodium salts were excluded. These compounds pertain to a number of chemical classes, including alkanes, aromatic hydrocarbons, alcohols, amines, acids, esters and so on. All 162 compounds have LLNA results, which indicate 119 sensitizers and 43 non-sensitizers. Furthermore, 92 of 162 compounds also have GPMT data, indicating 71 sensitizers and 21 non-sensitizers. For convenience of expression, the above two data sets with LLNA and GPMT results were denoted as LLNA data set and GPMT data set, respectively. For each data set, two thirds of compounds were randomly assigned as the training set, and the leftovers composed the test set. The information of each data set is shown in Table 1.

2.2. Calculation of Molecular Descriptors

Molecular descriptors characterizing molecular structure were calculated in Dragon 5.4 [17]. Twenty blocks of molecular descriptors were embodied in Dragon package. In this paper, only 926 descriptors contained in blocks 1–10, 17–18 and 20 were calculated, with no consideration of 3D descriptors. These descriptors consisted of constitutional descriptors, topological descriptors, walk and path counts, connectivity indices, information indices, 2D autocorrelations, edge adjacency indices, BCUT descriptors, topological charge indices, eigenvalue-based indices, functional group counts, atom-centered fragments, and molecular properties.

2.3. Preprocessing of Molecular Descriptors

In order to delete the noisy, irrelevant and redundant information, the calculated 926 molecular descriptors were preprocessed by eliminating: 1) those having same values for greater than 90% of the compounds; 2) those having high correlation coefficients (>0.85) with other descriptors.
Since these molecular descriptors characterize the structural information from extensive perspectives, their magnitudes are quite various. In order to prevent the descriptors in greater numeric ranges from outweighing those in smaller numeric ranges, the original descriptors were scaled to the range [0, 1] using min-max normalization method [18] prior to the next feature selection step with particle swarm optimization (PSO) algorithm. Min-max normalization was realized according to Equation (1):
V = V min max min
where min and max are the minimum and maximum values of a descriptor, V and V’ represent the descriptor before and after scaling, respectively.

2.4. Particle Swarm Optimization (PSO) Algorithm

Particle swarm optimization is a population-based meta-heuristic algorithm that simulates social behavior such as bird flocking and fish schooling. Since introduced originally by Kennedy and Eberhart [19] in 1995, PSO has been continuously developed and widely applied to solving optimization problems due to its reduced memory requirements and fast convergence [20,21]. Like evolutionary algorithms, PSO performs searches using a population (swarm) of individuals (particles) that are updated from iteration to iteration to find an optimal solution. Each particle, representing a potential solution, is treated as a point in a D-dimension space and its status is characterized by its position and velocity. The position vector (xi) and velocity vector (vi) for particle i in a D-dimension space can be represented as xi ={xi1, xi2,..., xiD} and vi ={vi1, vi2,..., viD}, respectively. Each particle keeps track of its personal best position pi ={pi1, pi2,..., piD} it has achieved so far and the global best position pg ={pg1, pg2,..., pgD} that has been found by other neighboured particles in the swarm. At each iteration, pi and pg vector are combined to update the velocity of particle i along each dimension, and the velocity is then used to adjust the new position for that particle as given in Equations (2) and (3).
v id ( new ) = w v id ( old ) + c 1 r 1 ( p id x id ) + c 2 r 2 ( p gd x id )
x id ( new ) = x id ( old ) + v id ( new ) , d = 1 , 2 , ... , D
where w is an inertia weight which contributes to balance the global search and local search; c1 and c2 are two positive constants indicating the cognition learning factor and the social learning factor, respectively; r1 and r2 are random numbers uniformly distributed in the range [0, 1]. The iteration is terminated if the minimum error criterion (fitness) is attained or the number of iterations reaches the predetermined limit.
Although the basic PSO algorithm presented above was originally designed for continuous problems, attempts have been made to extend it to discrete optimization issues, where the particle position is composed of a set of bits that contain either ‘1’ or ‘0’, indicating being selected or not [22]. Most modified algorithms lose their consistent form or way of evolution exhibited in the continuous particle swarm algorithm. This paper designed the discrete PSO simulating the continuous PSO, where the position and velocity of a particle were updated in continuous space. Only when the new candidate position is passed to the fitness function, it is transformed from the continuous space to the discrete space. Supposing that the particle position values is limited to interval [0, 1], conversion can be accomplished by mapping the values hitting in the interval [0, 0.5) to 1, and other values to 0.
In general, the number of molecular descriptors selected for QSAR modeling is considered one of the important factors responsible for overfitting of QSAR models. Fewer molecular descriptors are generally preferred, so a punishment factor is often used in the fitness evaluation expression of PSO [15]. When the number of the candidate molecular descriptors is very large, it is inefficient to use traditional PSO algorithm directly for feature selection. The probability that only several descriptors are selected at each iteration may be very small because the number of descriptors selected in the evolution process obeys normal distribution. In order to improve the computing efficiency, the conversion of values from continuous space to discrete space is adjusted by mapping each value hitting in the interval [0, 0.05) to 1, and other values to 0. Thus, each descriptor has a probability of 1/20 of being selected and only about 1/20 of all descriptors are selected in each iteration. The probability of only several descriptors being selected will be increased dramatically.

2.5. Support Vector Machine (SVM)

SVM is an emerging and powerful machine learning algorithm proposed by Vapnik and co-workers in 1995 [23]. It has been extensively applied to various classification problems due to its high accuracy and its lesser proneness to overfitting than other machine learning methods. Instead of traditional empirical risk minimization, SVM achieves structural risk minimization, which results in the good generalization and avoids being trapped in local optima.
The basic theory of SVM has been presented in many references. Here only a brief description is given. A set of training points (compounds) are denoted as (xi, yi), 1≤ i ≤ N, where N is the number of the training points; xi is the vector corresponding to data point i represented by a set of molecular descriptors in D-dimension space; yi is the class label taking value either +1 or −1. If the two classes are linearly separable, there exists a hyperplane that can separate the set by leaving all the vectors of the same class on the same side. The ultimate aim of SVM classification is to find an optimal separating hyperplane (OSH) as the decision surface to separate two classes of patterns with maximal margin. The optimal hyperplane H is defined mathematically by Equation (4)
w · x + b = 0
where w is the weight vector normal to the separating hyperplane, b is the threshold. SVM constructs two parallel hyperplanes (H1 and H2) on each side of the maximal separating hyperplane that maximizes the distance between the two parallel hyperplanes. The vectors situated on two hyperplanes are called support vectors, which are used to define the separating hyperplane. Any points that fall on or above H1 belong to class +1, and any data points that fall on or below H2 belong to class −1, which can be represented as follows:
w · x i + b + 1 for y i = + 1 ; class 1 ( sensitizer )
w · x i + b 1 for y i = 1 ; class 2   ( non-sensitizer )
The distance from the hyperplane to any point on H1 is 1/||w||, where ||w|| is the Euclidean norm of w. The margin of the separating hyper-plane is calculated as 2/||w||. The OSH has the largest margin among separating hyper-planes with the constrained optimization min w | | w | | 2 subject to inequalities (5) and (6). After the determination of w and b, the classification can be realized by Equation (7):
sign ( w · x + b )
In most cases, the data are not linearly separable, where no linear OSH exists in the current dimensional space. Therefore, the data are nonlinearly mapped into a high-dimensional feature space where linear separation can be performed. The transform can be done by using a kernel function K (xi, xj) = Φ(xiΦ(xj). Gaussian radial basis function, K (xi, xj) = e−||xij||2 / 2σ2 is one of the commonly used kernel functions. Linear support vector machine is then applied to this feature space, and the decision function is given as follows:
f ( x ) = sign ( i = 1 N y i α i K ( x i , x j ) + b )
where the coefficients αi and b are determined by maximizing the following Lagrange expression:
L D = i = 1 l α i 1 2 i = 1 l j = 1 l α i α j y i y j · K ( x i , x j )
where αi ≥ 0 and i = 1 N α i y i = 0. The above equation can be solved numerically using quadratic programming techniques under Karush-Kuhn-Tucker(KKT) conditions to obtain the Lagrange multipliers αi, together with w and b.
Two parameters C and σ are very important to the performance of SVM. Parameter C represents the penalty cost, which influences the classification outcome. Parameter σ affects the partitioning outcome in the feature space. Ten-fold cross validation procedure was implemented to obtain the appropriate C and σ.

2.6. Implementation

The PSO algorithm and related programs were implemented in the Java programming language, running on the Java (TM) 2 Runtime Environment, Standard Edition (build 1.5.0_02-b09). The Java package of libsvm (version 2.8) [24] used in this work is freely available online.

2.7. Assessment of Results

In order to evaluate the prediction performance of SVM models, we define and compute the classification accuracy, sensitivity and specificity by the methods reported in Ref. [25]. The formulations are as follows:
Accuracy = TP + TN TP + FN + TN + FP × 100 %
Sensitivity = TP TP + FN × 100 %
Specificity = TN TN + FP × 100 %
In Equations (10)(12), TP is the number of true positives (sensitizers); FN, the number of false negatives (non-sensitizers); TN, the number of true negatives; and FP, the number of false positives.

3. Results and Discussion

3.1. LLNA data set

As seen in Table 1, the LLNA data set was randomly divided into a training set with 76 sensitizers and 32 non-sensitizers and a test set with 43 sensitizers and 11 non-sensitizers. Based on the training set, 123 molecular descriptors remained after preprocessing according to Section 2.3. Then, SVM combined with PSO algorithm was implemented. The PSO was set with 30 particles and 100 iterations. In each evaluation, a descriptor got a hit if this descriptor was selected. The descriptor selected more often will get more hits. Ten-fold cross validation was carried out against the training set, and the highest cross validation accuracy was used to determine the most appropriate set of molecular descriptors. In the end, six out of 123 molecular descriptors (listed in Table 2) were selected. The two most important parameters of SVM were also determined, i.e., C=15.81 and σ =14.13. The highest classification accuracy of cross validation against the training set is 83.33%. The classification accuracies, sensitivities and specificities of the training set and test set were all shown in Table 3.
Although skin sensitization is a complex toxicological phenomenon and its biomolecular processes have not been fully understood, previous studies have indicated the ability of active agents to cause immune response is related to skin permeability and the production of immunological conjugates with endogenous macromolecules [5,26]. Chemical reactivity, molecular size and skin permeability are important determinants for skin sensitization [27]. Known from Table 2, nCL represents the number of chlorine atoms in the molecule. The specific atom or group may be related to the combination or reaction of chemicals with skin protein. MAXDN and MATS1e characterize the molecular electronic structure, which may influence the electrostatic interactions between the chemical and protein. The binding of a chemical with skin proteins is considered as the rate-determing step for skin sensitization induction, where the chemical behaves as an electrophile and the protein as a nucleophile [28]. MATS2m and BELm1 are descriptors concerning molecular size, which may influence the skin penetration of compounds. The active agents causing skin sensitization are relatively small molecules. MLOGP, Moriguchi octanol-water partition coefficient, is an indicator of hydrophobic properties, which has been correlated to transport properties of a molecule, long-range ligand-receptor recognition and subsequent binding [26].
Fedorowicz [5] has also investigated the original LLNA data set (including 132 sensitizers and 46 non-sensitizers) with logistic regression and the DEREK expert system. The classification results for the training set with logistic regression and prediction results for the whole data set with DEREK reported in Ref. [5] are shown in Tables 4 and 5, respectively. For rationality, the reported results with logistic regression were compared to the classification results of this paper for the training set, while the reported results with DEREK were compared to the prediction results of this paper for the test set. Seen from Tables 4 and 5, the SVM classifier combined with PSO algorithm in this paper improved the results greatly, especially for the classification specificity. It has been explained in Ref. [5] that the very low specificity was resulted from the substantially unbalanced size of samples, i.e., the ratio of sensitizers largely overwhelming that of the non-sensitizers. However, the specificity in this paper attained 87.50% (50.00% with logistic regression in Ref. [5]) and 72.73% (32.60% with DEREK in Ref. [5]) for the training set and test set, respectively. The experimental and estimated skin sensitivities are listed in Table 6.

3.2. GPMT Data Set

For the GPMT data set, the same procedures as for the LLNA data set were carried out. After preprocessing according to Section 2.3, 127 molecular descriptors remained. Then, the SVM algorithm combined with PSO was implemented, and five molecular descriptors (listed in Table 7) were selected from the remaining 127 molecular descriptors. Two SVM parameters, i.e., C=45.63 and σ =1.90 were determined by 10-fold cross validation based on the training set. The total accuracy of the 10-fold cross validation is 90.16%. For the training set, the sensitizers were all classified correctly, and five non-sensitizers were mistaken as sensitizers. According to Equations (10)(12), the classification accuracy, sensitivity and specificity are 91.80%, 100.00% and 64.29%, respectively. For the test set, only one sensitizer and two non-sensitizers were wrongly assigned. Table 8 lists the statistical parameters.
From Tables 3 and 8, we can see that the wrong classification ratio of non-sensitizers is larger than that of sensitizers for both LLNA and GPMT data sets. The unbalanced ratio (nearly 1:3) of non-sensitizers to sensitizers may be responsible for the worse classification performances for non-sensitizers than those for sensitizers. It is indicated that the non-sensitizers may be prone to be falsely predicted as sensitizers. On the contrary, if the dataset contains many more non-sensitizers than sensitizers, the QSPR model will also give biased classification results. For example, Roberts et al. [28] have validated the TIMES-SS (TImes MEtabolism Simulator) expert system platform used for predicting skin sensitization with 40 chemicals, consisting of 24 non-sensitizers and 16 sensitizers. TIMES-SS was able to predict non-sensitizers reasonably well (the prediction accuracy of 87.5%), while it predicted sensitizers very pooly (the prediction accuracy of 56.0%).
From Table 7, the selected five molecular descriptors come from four blocks of descriptors. nDB is a constitutional descriptor indicating the number of double bonds. The π-bond electrons in double bonds are more active than the σ-bond electrons in single bonds, therefore, the molecule with more double bonds will have larger electronic cloud deformability and may be prone to combine with the target. Five kinds of chemical reaction mechanistic domain [29,30] have been proposed, including Michael acceptors, SN2 electrophiles, SNAr electrophiles, Schiff base electrophiles and acyl transfer electrophiles. According to Roberts [31], some compounds containing an electron-deficient double bond can be confidently assigned as Michael acceptors or pro-Michael acceptors. EEig07d and EEig14d are both descriptors related to molecular dipole moments, which indicate the molecular polarity and are closely related to the hydrophobic properties and skin permeability of molecules. O-057, the number of phenol/enol/carboxyl OHs, may be responsible for the molecular polarity and hydrogen bond, which has relationship with the combination or reaction of compounds with specific group of the receptor. As described in Ref. [31], some aromatic compounds with two meta hydroxyl groups may follow two possible mechanisms: reaction with molecular oxygen to introduce a hydroxyl group either ortho or para to the original hydroxyl groups, and directly binding to protein via attack of a protein-centered radical. Infective-80 is a descriptor derived from biological experiment, which may reflect directly the biochemical effect of skin sensitization induction.
Fedorowicz [5] has also investigated the original GPMT data set (including 82 sensitizers and 23 non-sensitizers) with logistic regression and two expert systems TOPKAT and DEREK. The reported classification results for the training set with logistic regression in Ref. [5] are listed in Table 9, and the reported prediction results for the whole GPMT data set with TOPKAT and DEREK are shown in Table 10. For comparison, the reported results with logistic regression were compared to the classification results of this paper for the training set, while the reported results with TOPKAT and DEREK were compared to the prediction results of this paper for the test set. As seen from Tables 9 and 10, the prediction performances of the method in this paper are far superior to those in the literature, especially for the specificity. However, expert systems such as DEREK and TIMES-SS have been recently improved by modifing the alerts describing the skin sensitization potential or considering more mechanisitic knowledge and new rules for chemicals [28,32]. Therefore, better results than those of previously reported in the literature may be achieved if the prediction is conducted with the improved expert systems. Golla et al. [27] built neural network models using 25, 25 and 22 molecular descriptors for LLNA, GPMT and BgVV data sets with 358, 307 and 251 compounds, respectively. The classification accuracies for the above mentioned three data sets are 90%, 95% and 90%, respectively. Compared with Ref. [27], this paper obtained the classification accuracy (for the training set) of 95.37% for LLNA data set and 91.80% for GPMT data set using only five or six molecular descriptors. The estimated skin sensitivities were listed in Table 6.
Seen from Table 6, there are ten inconsistent values in the experimental results of 92 compounds with both LLNA and GPMT data. The accuracy (or concordance) of experimental results obtained from two different test procedures for assessing skin sensitization is 89.13%. In 1999, the Interagency Coordinating Committee on the Validation of Alternative Methods (ICCVAM), with support from the National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods (NICEATM), validated the experimental procedures by comparing LLNA data for 97 chemicals to the available GPMT data, and also obtained an accuracy of 89% [27]. From the above analysis, we may roughly assume that the experimental accuracy of skin sensitization is close to 89%. The prediction accuracies (for the test set) of LLNA and GPMT data sets using PSO optimized SVM in this paper were 88.89% and 90.32% respectively, which are in good agreement with the experimental accuracy.

4. Conclusions

This paper has investigated the skin sensitization against LLNA and GPMT data sets by particle swarm optimized support vector machine. The classification accuracies, sensitivities and specificities for both data sets were all satisfactory and largely improved compared to those obtained by logistic regression and the expert systems reported in the literature. This study has confirmed that the quantitative structure-activity relationship approach can be a promising complement to animal testing in the area of hazard identification only if a reasonable QSAR model has been constructed. The SVM classifier built in this paper can be used to assess the skin sensitization for environmental chemicals.


This work is financially supported by the National Natural Science Foundation of China (No. 20772028) and Provincial Natural Science Foundation of Hunan (No.06JJ2002).

References and Notes

  1. Nonfatal illness. Worker Health Chartbook; DHHS(NIOSH) Publication, DHHS: Cincinnati, OH, USA, 2000; no. 2002–120. [Google Scholar]
  2. Lushniak, BD. The importance of occupational skin diseases in the United States. Int. Arch. Occup. Environ. Health 2003, 76, 325–330. [Google Scholar]
  3. Regulation (EC) No. 1907/2006 of the European Parliament and of the Council of 18 December 2006, concerning the Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH), establishing a European Agency, amending Directive 1999/45/EC and Repealing Council Regulation (EEC) No. 793/93 and Commission Regulation (EC) No. 1488/94 as well as Council Directive 76/769/EEC and Commission Directives 91/155/EEC, 93.67/EEC, 93/105/EC and 2000/21/EC.
  4. Andersen, KE; Frankild, S. Allergic contact dermatitis. Clin. Dermatol 1997, 15, 645–654. [Google Scholar]
  5. Fedorowicz, A; Singh, H; Soderholm, S; Demchuk, E. Structure-activity models for contact sensitization. Chem. Res. Toxicol 2005, 18, 954–969. [Google Scholar]
  6. European Centre for Ecotoxicology and Toxicology of Chemicals (ECETOC). Workshop on regulatory acceptance of (Q)SARs for human health and environmental endpoints, Setubal, Portugal, March 4–6, 2004.
  7. Ren, YY; Liu, HX; Xue, CX; Yao, XJ; Liu, MC; Fan, BT. Classification study of skin sensitizers based on support vector machine and linear discriminant analysis. Anal. Chim. Acta 2006, 572, 272–282. [Google Scholar]
  8. Estrada, E; Patlewicz, G; Chamberlain, M; Basketter, D; Larbey, S. Computer-aided knowledge generation for understanding skin sensitization mechanisms: The TOPS-MODE approach. Chem. Res. Toxicol 2003, 16, 1226–1235. [Google Scholar]
  9. Li, S; Fedorowicz, A; Singh, H; Soderholm, SC. Application of the random forest method in studies of local lymph node assay based skin sensitization data. J. Chem. Inf. Model 2005, 45, 952–964. [Google Scholar]
  10. Li, Y; Pan, D; Liu, J; Kern, PS; Gerberick, GF; Hopfinger, AJ; Tseng, YJ. Categorical QSAR models for skin sensitization based upon local lymph node assay classification measures Part 2: 4D-Fingerprint three-state and two-2-state logistic regression models. Toxicol. Sci 2007, 99, 532–544. [Google Scholar]
  11. Ren, S; Schultz, TW. Identifying the mechanism of aquatic toxicity of selected compounds by hydrophobicity and electrophilicity descriptors. Toxicol. Lett 2002, 129, 151–160. [Google Scholar]
  12. Mosier, PD; Jurs, PC; Custer, LL; Durham, SK; Pearl, GM. Predicting the genotoxicity of thiophene derivatives from molecular structure. Chem. Res. Toxicol 2003, 16, 721–732. [Google Scholar]
  13. Fourie, PC; Groenwold, AA. In Particle Swarms in Size and Shape Optimization. Proceedings of the International Workshop on Multidisciplinary Design Optimization, Pretoria, South Africa, August 7–10, 2000; pp. 97–106.
  14. Al-kazemi, B; Mohan, CK. Multi-phase discrete particle swarm optimization. Fourth International Workshop on Frontiers in Evolutionary Algorithms, North Carolina, USA, March 8–13, 2002.
  15. Huang, JP; Ma, GL; Muhammad, I; Cheng, YY. Identifying P-glycoprotein substrates using a support vector machine optimized by a particle swarm. J. Chem. Inf. Model 2007, 47, 1638–1647. [Google Scholar]
  16. Lin, SW; Ying, KC; Chen, SC; Lee, ZJ. Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert Syst. Appl 2008, 35, 1817–1824. [Google Scholar]
  17. Todeschini, R; Consonni, V; Mauri, A; Pavan, M. Dragon 5.4; Milano Chemometrics and QSAR Research Group; University of Milano- Bicocca: Milan, Italy, 2006. [Google Scholar]
  18. Han, J; Kamber, M. Data Mining: Concepts and Techniques, 2nd ed; Morgan Kaufmann: San Francisco, CA, USA, 2006. [Google Scholar]
  19. Kennedy, J; Eberhart, RC. Particle swarm optimization. Proceedings of the IEEE conference on Neural Networks 1995, 4, 1942–1948. [Google Scholar]
  20. Shen, Q; Jiang, JH; Jiao, CX; Huan, SY; Shen, GL; Yu, RQ. Optimized partition of minimum spanning tree for piecewise modeling by particle swarm algorithm. QSAR studies of antagonism of angiotensin II antagonists. J. Chem. Inf. Comput. Sci 2004, 44, 2027–2031. [Google Scholar]
  21. Jiang, M; Luo, YP; Yang, SY. Stochastic convergence analysis and parameter selection of the standard particle swarm optimization algorithm. Inform. Process. Lett 2007, 102, 8–16. [Google Scholar]
  22. Shen, Q; Shi, WM; Kong, W; Ye, BX. A combination of modified particle swarm optimization algorithm and support vector machine for gene selection and tumor classification. Talanta 2007, 71, 1679–1683. [Google Scholar]
  23. Vapnik, VN. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 1995. [Google Scholar]
  24. Chang, CC; Lin, CJ. LIBSVM: A library for support vector machines. Software available at, accessed April, 2001.
  25. Li, Y; Tseng, YJ; Pan, D; Liu, J; Kern, PS; Gerberick, GF; Hopfinger, AJ. 4D-Fingerprint categorical QSAR models for skin sensitization based on the classification of local lymph node assay measures. Chem. Res. Toxicol 2007, 20, 114–128. [Google Scholar]
  26. Kubinyi, H. QSAR: Hansch Analysis and Related Approaches; VCH Verlagsgesellschaft mbH; Weinheim, Germany, 1993; p. 11. [Google Scholar]
  27. Golla, S; Madihally, S; Robinson, RL, Jr; Gasem, KAM. Quantitative structure–property relationship modeling of skin sensitization: A quantitative prediction. Toxicol. in Vitro 2009, 23, 454–465. [Google Scholar]
  28. Roberts, DW; Patlewicz, G; Dimitrov, SD; Low, LK; Aptula, AO; Kern, PS; Dimitrova, GD; Comber, MIH; Phillips, RD; Niemelä, J; Madsen, C; Wedebye, EB; Bailey, PT; Mekenyan, OG. TIMES-SS - A mechanistic evaluation of an external validation study using reaction chemistry principles. Chem. Res. Toxicol 2007, 20, 1321–1330. [Google Scholar]
  29. Aptula, AO; Roberts, DW. Mechanistic applicability domains for non-animal based toxicological endpoints. General principles and application to reactive toxicity. Chem. Res. Toxicol 2006, 19, 1097–1105. [Google Scholar]
  30. Aptula, AO; Patlewicz, G; Roberts, DW. Skin sensitization: reaction mechanistic applicability domains for structure-activity relationships. Chem. Res. Toxicol 2005, 18, 1420–1426. [Google Scholar]
  31. Roberts, DW; Patlewicz, G; Kern, PS; Gerberick, F; Kimber, I; Dearman, RJ; Ryan, CA; Basketter, DA; Aptula, AO. Mechanistic applicability domain classification of a local lymph node assay dataset for skin sensitization. Chem. Res. Toxicol 2007, 20, 1019–1030. [Google Scholar]
  32. Langton, K; Patlewicz, GY; Long, A; Marchant, CA; Basketter, DA. Structure–activity relationships for skin sensitization: recent improvements to Derek for Windows. Contact Dermatitis 2006, 55, 342–347. [Google Scholar]
Table 1. The composition of LLNA and GPMT data sets.
Table 1. The composition of LLNA and GPMT data sets.

Sensitizer (+)7643119472471
Non-sensitizer (−)32114314721
aTr represents the training set;
bTe represents the test set.
Table 2. Six most important molecular descriptors selected by PSO for SVM classification for LLNA data set.
Table 2. Six most important molecular descriptors selected by PSO for SVM classification for LLNA data set.
Descriptor symbolDescriptor blockDefinition
nCLConstitutional descriptorsNumber of chlorine atoms
MAXDNTopological descriptorsMaximal electrotopological negative variation
MATS1e2D autocorrelationsMoran autocorrelation-lag 1/weighted by atomic Sanderson electronegativities
MATS2m2D autocorrelationsMoran autocorrelation-lag 2/weighted by atomic masses
BELm1Burden eigenvaluesLowest eigenvalue n. 1 of Burden matrix/weighted by atomic masses
MLOGPMolecular propertiesMoriguchi octanol-water partition coeff. (logP)
Table 3. Performance of SVM classifier combined with PSO for the skin sensitization of LLNA data set.
Table 3. Performance of SVM classifier combined with PSO for the skin sensitization of LLNA data set.
Training set75128495.37%98.68%87.50%
Test set4038388.89%93.02%72.73%
Table 4. Comparison the classification performances of this paper with those reported in previous studies on the training set of LLNA data set.
Table 4. Comparison the classification performances of this paper with those reported in previous studies on the training set of LLNA data set.
Logistic regression [5]83.20%94.70%50.00%
This paper95.37%98.68%87.50%
Table 5. Comparison the prediction performances of this paper with those reported in previous studies on the test set of LLNA data set.
Table 5. Comparison the prediction performances of this paper with those reported in previous studies on the test set of LLNA data set.
This paper88.89%93.02%72.73%
Table 6. The investigated compounds and their experimental and estimated skin sensitivities.
Table 6. The investigated compounds and their experimental and estimated skin sensitivities.

1Propylene glycol57-55-6−1−1−1−1
2 aHexane110-54-3−1−1--
3 bLactic acid50-21-5−1−1−1−1
5 bResorcinol108-46-3−11−11
7Ethyl methanesulfonate62-50-0−1−1--
104-Aminobenzoic acid150−13-0−1−1−1−1
11 b4-Hydroxybenzoic acid99-96-7−1−1−1−1
12 bSalicylic acid69-72-7−1−1−1−1
13 a2-Hydroxypropyl methacrylate923-26-2−1−1−1−1
14 bTartaric acid87-69-4−1−1−1−1
15Methyl salicylate119-36-8−1−1−1−1
16 aGeraniol106-24-1−11−1−1
20 aSulfanilamide63-74-1−1−1−11
22 adi-2-Furanylethanedione492-94-4−1−1--
24 aDimethyl isophthalate1459-93-4−1−1−1−1
27Phthalic acid diethyl ether84-66-2−1−1--
28 a b5,5-Dimethyl-3-(mesyloxymethyl) dihydro-2(3H)-furanone154750-22-8−1−1−11
30N’-(4-methylcyclohexyl)-N-(2-chloroethyl)-N- nitrosourea13909-09-6−1−1--
313-(Benzenesulfonyloxymethyl)-5,5-dimethyl- dihydro-2(3H)-furanone154750-24-0−1−1--
32Benzoyloxy-3,5-benzene dicarboxylic acid102059-70-1−1−111
33 b5,5-Dimethyl-3-(tosyloxymethyl)-dihydro- 2(3H)-furanone154060-50-1−1−1−1−1
345,5-Dimethyl-3-(methoxybenzenesulfonyloxy- methyl)dihydro-2(3H)-furanone154750-23-9−1−111
353-(Chlorobenzenesulfonyloxymethyl)-5,5- dimethyldihydro-2(3H)-furanone154750-28-4−1−1--
36 b5,5-Dimethyl-3-(nitrobenzenesulfonyloxy- methyl)dihydro-2(3H)-furanone154750-29-5−1−111
37 bOctadecylmethane sulfonate31081-59-1−1−111
38 aα-Trimethylammonium-4-tolyloxy-4-benzene- sulfonate264869-81-0−1−111
39 aHydrocortisone50-23-7−1−1--
42 a bStreptomycin57-92-1−1−111
43 aNeomycin1405-10-3−1−1−1−1
44 bEthylenediamine107-15-31111
45 aβ-Propiolactone57-57-811--
46 aPyridine110-86-111--
47 a2,3-Butanedione431-03-811--
48 aAniline62-53-31111
49N, N-dimethyl-1,3-propanediamine109-55-71111
53 ap-Xylene106-42-311--
55 a3-Phenylenediamine108-45-21111
56 a4-Phenylenediamine106-50-31111
58 b3-Aminophenol591-27-51111
59 a2-Aminophenol95-55-61111
61Methyl methanesulfonate66-27-311--
622-Hydroxyethyl acrylate818-61-11111
63 aN-Ethyl-N-nitrosourea759-73-911--
65 a b4-Methylcatechol452-86-81111
66Dimethyl sulfate77-78-111--
67 a5,5-Dimethyl-3-methylenedihydro-2(3H)- furanone29043-97-81111
68 aButyl glycidyl ether2426-08-61111
69 aCinnamic aldehyde104-55-21111
71 aBenzoyl chloride98-88-41111
73 aPhthalic anhydride85-44-91111
76 a b5-Chloro-2-methyl-4-isothiazolin-3-one26172-55-41111
77 a4-Nitroso-N,N-dimethylaniline138-89-61111
79 bCitral5392-40-51111
80Diethyl sulfate64-67-511--
81 a2-Methyl-4,5-trimethylene-4-isothiazolin-3-one82633-79-21111
82 a1-Ethyl-3-nitro-1-nitrosoguanidine4245-77-611--
86 bDihydroeugenol2785-87-71111
87 b2-Mercaptobenzothiazole149-30-41111
88Benzyl bromide100-39-011--
894-Nitrobenzyl chloride100-14-11111
90 aHydroxycitronellal107-75-51111
92 aNonanoyl chloride764-85-211--
933,5,5-Trimethylhexanoyl chloride36727-29-41111
95 a5-Methyleugenol186743-25-911--
98 a5,5-Dimethyl-3-(thiocyanatomethyl)-dihydro- 2(3H)-furanone154750-32-01111
100 aBbenzene-1,3,4-tricarboxylic anhydride552-30-71111
102 aEthylene glycol dimethacrylate97-90-511−1−1
103 a bPhenyl benzoate93-99-21111
104 b2,4-Dinitrochlorobenzene97-00-71111
105 b5,5-Dimethyl-3-(bromomethyl)dihydro- 2(3H)-furanone154750-20-61111
107Propyl gallate121-79-91111
109 b4-Nitrobenzyl bromide100-11-81111
110 bHexyl cinnamic aldehyde101-86-01111
111 bIsophorone diisocyanate4098-71-91111
113 a3-Methoxyphenylbenzoate5554-24-511--
115 a1-Bromoundecane693-67-411--
1163-Acetylphenyl benzoate139-28-61111
117 bTetramethyl thiuram disulfide137-26-81111
118 bBenzoyl peroxide94-36-01111
119 aPicryl chloride88-88-01111
120 a b1-Bromododecane143-15-71111
121Methylene diphenyl diisocyanate101-68-81111
122 a1-Chloromethylpyrene1086-00-611--
123 aBenzopyrene50-32-811--
126 a1-Bromotridecane765-09-311--
127 aDodecyl methanesulfonate51323-71-81111
128Methyl dodecanesulfonate2374-65-41111
129 a12-Bromo-1-dodecanol3344-77-211--
132 aDodecylthiosulfonate127089-67-21111
136 aHexadecanoyl chloride112-67-411--
138 a1-Bromotetradecane112-71-011--
13912-Bromododecanoic acid73367-80-311--
141Octyl gallate1034-01-111--
144 bOxazolone1564-29-01111
145 bAbietic acid514-10-31111
146 aOctadecanoyl chloride112-76-511--
147 b1-Bromohexadecane112-82-31111
1482-Bromotetradecanoic acid10520-81-711--
149Methyl hexadec-2-ene sulfonate54612-23-61111
151 a1-Bromoheptadecane3508-00-711--
152 a1-Iodotetradecane19218-94-111--
154Penicillin G61-33-61111
158 a1-Iodooctadecane629-93-611--
159 bImidazolidinyl urea39236-46-91111
160Dimethyl sulfostearate99785-70-311--
161 bSulfanilic acid121-57-3−1−111
162 aIsononanoyloxybenzene sulfonate109363-00-01111
aCompounds making up of the test set of the LLNA data set;
bCompounds making up of the test set of the GPMT data set;
c“-” denotes no GPMT data available.
Table 7. Five molecular descriptors selected by PSO for SVM classification of GPMT data set.
Table 7. Five molecular descriptors selected by PSO for SVM classification of GPMT data set.
Descriptor symbolDescriptor blockDefinition
nDBConstitutional descriptorsNumber of double bonds
EEig07dEdge adjacency indicesEigenvalue 07 from edge adj. matrix weighted by dipole moments
EEig14dEdge adjacency indicesEigenvalue 14 from edge adj. matrix weighted by dipole moments
O-057Atom-centred fragmentsPhenol / enol / carboxyl OH
Infective-80Molecular propertiesGhose-Viswanadhan-Wendoloski antiinfective-like index at 80%
Table 8. Performance of SVM classifier combined with PSO for the skin sensitization of GPMT data set.
Table 8. Performance of SVM classifier combined with PSO for the skin sensitization of GPMT data set.
Training set4709591.80%100.00%64.29%
Test set2315290.32%95.83%71.43%
Table 9. Comparison the classification performances of this paper with those reported in previous studies on the training set of GPMT data set.
Table 9. Comparison the classification performances of this paper with those reported in previous studies on the training set of GPMT data set.
Logistic regression87.60%98.80%47.80%
This paper91.80%100.00%64.29%
Table 10. Comparison the prediction performances of this paper with those reported in previous studies on the test set of GPMT data set.
Table 10. Comparison the prediction performances of this paper with those reported in previous studies on the test set of GPMT data set.
This paper90.32%95.83%71.43%
Int. J. Mol. Sci. EISSN 1422-0067 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top