Prediction of Skin Sensitization with a Particle Swarm Optimized Support Vector Machine

Skin sensitization is the most commonly reported occupational illness, causing much suffering to a wide range of people. Identification and labeling of environmental allergens is urgently required to protect people from skin sensitization. The guinea pig maximization test (GPMT) and murine local lymph node assay (LLNA) are the two most important in vivo models for identification of skin sensitizers. In order to reduce the number of animal tests, quantitative structure-activity relationships (QSARs) are strongly encouraged in the assessment of skin sensitization of chemicals. This paper has investigated the skin sensitization potential of 162 compounds with LLNA results and 92 compounds with GPMT results using a support vector machine. A particle swarm optimization algorithm was implemented for feature selection from a large number of molecular descriptors calculated by Dragon. For the LLNA data set, the classification accuracies are 95.37% and 88.89% for the training and the test sets, respectively. For the GPMT data set, the classification accuracies are 91.80% and 90.32% for the training and the test sets, respectively. The classification performances were greatly improved compared to those reported in the literature, indicating that the support vector machine optimized by particle swarm in this paper is competent for the identification of skin sensitizers.


Introduction
With the fast development of industry, agriculture and medication, human are exposed to more and more exogenous chemicals, some of which may result in allergic contact dermatitis after accidental or deliberate skin contact. The medical condition of allergic contact dermatitis is known as skin sensitization, which is associated with an alteration of the immune system. According to statistics from the U.S. Bureau of Labor, occupational contact dermatitis is the most commonly reported non-trauma related category of occupational illnesses in the United States [1]. The total annual losses due to occupational skin diseases were estimated to amount to over one billion dollars [2]. Therefore, identification and labelling of environmental allergens is an urgent request from consumer organizations, industry, and governmental agencies to protect people from skin sensitization. In the new European Union (EU) chemical policy REACH, information on skin sensitization potential will have to be provided for any chemicals manufactured or imported in amounts above 1 tonne/year [3].
The guinea pig maximization test (GPMT) and murine local lymph node assay (LLNA) are the two most commonly used in vivo models for identification of skin sensitizers. GPMT combines the use of intradermal administration of the compound with and without Freund's complete adjuvant (FCA), and occluded topical application of the compound a week later [4]. The result of GPMT relies on subjective evaluation of a group of animals and is usually expressed as the dichotomous (sensitizer/nonsensitizer) form. The murine local lymph node assay defines skin sensitization hazard as a function of the ability of the test chemical to provoke lymphocyte proliferation on lymph nodes draining the site of topical application. A substance is classified as a sensitizer if it induces a threefold stimulation index (EC3) or greater at one or more test concentrations. The value of EC3 is continuous and indicates the relative skin sensitizing potency of chemicals, but the majority of published LLNA data nowadays are also in the dichotomous form [5]. Due to the huge number of chemicals with unknown skin sensitization potential, exhaustive animal testing of such chemicals is costly and raises ethical concerns. Therefore, the use of other alternative methods such as quantitative structure-activity relationships (QSARs) is strongly encouraged in order to reduce the number of animal tests. Several legislations have recently emerged to further develop and increase the acceptance of QSARs in assessment of skin sensitization of chemicals [6] and much work has been reported [7][8][9][10].
This paper aimed to build a classifier to distinguish skin sensitizers from non-sensitizers based on various compounds with LLNA and GPMT results. When compared to other classification techniques such as discriminant analysis [11], random forest methods [9] and artificial neural networks [12] the support vector machine (SVM) has been proven advantageous in handling classification tasks in cases of high dimensionality of data points. However, the input features (molecular descriptors here) of SVM play a very important role in the classification performance. Not all the molecular descriptors are equally important for a specific classification. Many of them may be redundant or irrelevant. If SVM is implemented without feature selection, the dimension of the input space is very large and non-clean, which will impair the performance of the SVM. The particle swarm optimization (PSO) algorithm is a swarm intelligence method for optimization problems and has been widely applied to feature selection. Compared with other descriptor selection approaches, such as the genetic algorithm (GA) and recursive feature elimination (RFE), PSO is not only much simpler in concept and more computationally efficient [13], but it also exhibits advantages in solving many kinds of optimization problems featuring nonlinearity and nondifferentiability, multiple optima, and high dimensionality [14,15]. Lin et al. [16] conducted a thorough study on the performance of PSO as a parameter determination and feature selection technique for SVM. The results based on about ten different data sets adequately confirmed that the performance of SVM+PSO outperforms that of SVM+GA and SVM without feature selection. Therefore, this paper investigates the potential of the support vector machine in combination of the particle swarm optimization algorithm for feature selections in addressing the problem of identification of sensitizers.

Data Set
The Interagency Coordinating Committee on the Validation of Alternative Methods (ICCVAM) issued the LLNA results of 209 compounds, and for some of which, the GPMT results were also available. All the experimental data were obtained within the "spirit" of Good Laboratory Practice guidelines. Fedorowicz [5] cleared out the inorganic salts, natural products and polymers from the ICCVAM list and developed a data set of 178 organic compounds, although it still contains 16 sodium salts which cannot be processed by the Dragon software used to calculate molecular descriptors in this work. Thus, a total 162 compounds were used in this paper after sodium salts were excluded. These compounds pertain to a number of chemical classes, including alkanes, aromatic hydrocarbons, alcohols, amines, acids, esters and so on. All 162 compounds have LLNA results, which indicate 119 sensitizers and 43 non-sensitizers. Furthermore, 92 of 162 compounds also have GPMT data, indicating 71 sensitizers and 21 non-sensitizers. For convenience of expression, the above two data sets with LLNA and GPMT results were denoted as LLNA data set and GPMT data set, respectively. For each data set, two thirds of compounds were randomly assigned as the training set, and the leftovers composed the test set. The information of each data set is shown in Table 1.

Calculation of Molecular Descriptors
Molecular descriptors characterizing molecular structure were calculated in Dragon 5.4 [17]. Twenty blocks of molecular descriptors were embodied in Dragon package. In this paper, only 926 descriptors contained in blocks 1-10, 17-18 and 20 were calculated, with no consideration of 3D descriptors. These descriptors consisted of constitutional descriptors, topological descriptors, walk and path counts, connectivity indices, information indices, 2D autocorrelations, edge adjacency indices, BCUT descriptors, topological charge indices, eigenvalue-based indices, functional group counts, atom-centered fragments, and molecular properties.

Preprocessing of Molecular Descriptors
In order to delete the noisy, irrelevant and redundant information, the calculated 926 molecular descriptors were preprocessed by eliminating: 1) those having same values for greater than 90% of the compounds; 2) those having high correlation coefficients (>0.85) with other descriptors.
Since these molecular descriptors characterize the structural information from extensive perspectives, their magnitudes are quite various. In order to prevent the descriptors in greater numeric ranges from outweighing those in smaller numeric ranges, the original descriptors were scaled to the range [0, 1] using min-max normalization method [18] prior to the next feature selection step with particle swarm optimization (PSO) algorithm. Min-max normalization was realized according to Equation (1): where min and max are the minimum and maximum values of a descriptor, V and V' represent the descriptor before and after scaling, respectively.

Particle Swarm Optimization (PSO) Algorithm
Particle swarm optimization is a population-based meta-heuristic algorithm that simulates social behavior such as bird flocking and fish schooling. Since introduced originally by Kennedy and Eberhart [19] in 1995, PSO has been continuously developed and widely applied to solving optimization problems due to its reduced memory requirements and fast convergence [20,21]. Like evolutionary algorithms, PSO performs searches using a population (swarm) of individuals (particles) that are updated from iteration to iteration to find an optimal solution. Each particle, representing a potential solution, is treated as a point in a D-dimension space and its status is characterized by its position and velocity. The position vector (x i ) and velocity vector (v i ) for particle i in a D-dimension space can be represented as L = that has been found by other neighboured particles in the swarm. At each iteration, p i and p g vector are combined to update the velocity of particle i along each dimension, and the velocity is then used to adjust the new position for that particle as given in Equations (2) and (3).
where w is an inertia weight which contributes to balance the global search and local search; c 1 and c 2 are two positive constants indicating the cognition learning factor and the social learning factor, respectively; r 1 and r 2 are random numbers uniformly distributed in the range [0, 1]. The iteration is terminated if the minimum error criterion (fitness) is attained or the number of iterations reaches the predetermined limit.
Although the basic PSO algorithm presented above was originally designed for continuous problems, attempts have been made to extend it to discrete optimization issues, where the particle position is composed of a set of bits that contain either '1' or '0', indicating being selected or not [22]. Most modified algorithms lose their consistent form or way of evolution exhibited in the continuous particle swarm algorithm. This paper designed the discrete PSO simulating the continuous PSO, where the position and velocity of a particle were updated in continuous space. Only when the new candidate position is passed to the fitness function, it is transformed from the continuous space to the discrete space. Supposing that the particle position values is limited to interval [0, 1], conversion can be accomplished by mapping the values hitting in the interval [0, 0.5) to 1, and other values to 0.
In general, the number of molecular descriptors selected for QSAR modeling is considered one of the important factors responsible for overfitting of QSAR models. Fewer molecular descriptors are generally preferred, so a punishment factor is often used in the fitness evaluation expression of PSO [15]. When the number of the candidate molecular descriptors is very large, it is inefficient to use traditional PSO algorithm directly for feature selection. The probability that only several descriptors are selected at each iteration may be very small because the number of descriptors selected in the evolution process obeys normal distribution. In order to improve the computing efficiency, the conversion of values from continuous space to discrete space is adjusted by mapping each value hitting in the interval [0, 0.05) to 1, and other values to 0. Thus, each descriptor has a probability of 1/20 of being selected and only about 1/20 of all descriptors are selected in each iteration. The probability of only several descriptors being selected will be increased dramatically.

Support Vector Machine (SVM)
SVM is an emerging and powerful machine learning algorithm proposed by Vapnik and co-workers in 1995 [23]. It has been extensively applied to various classification problems due to its high accuracy and its lesser proneness to overfitting than other machine learning methods. Instead of traditional empirical risk minimization, SVM achieves structural risk minimization, which results in the good generalization and avoids being trapped in local optima.
The basic theory of SVM has been presented in many references. Here only a brief description is given. A set of training points (compounds) are denoted as (x i , y i ), 1 ≤ i ≤ N, where N is the number of the training points; x i is the vector corresponding to data point i represented by a set of molecular descriptors in D-dimension space; y i is the class label taking value either +1 or -1. If the two classes are linearly separable, there exists a hyperplane that can separate the set by leaving all the vectors of the same class on the same side. The ultimate aim of SVM classification is to find an optimal separating hyperplane (OSH) as the decision surface to separate two classes of patterns with maximal margin. The optimal hyperplane H is defined mathematically by Equation (4) w·x where w is the weight vector normal to the separating hyperplane, b is the threshold. SVM constructs two parallel hyperplanes (H 1 and H 2 ) on each side of the maximal separating hyperplane that maximizes the distance between the two parallel hyperplanes. The vectors situated on two hyperplanes are called support vectors, which are used to define the separating hyperplane. Any points that fall on or above H 1 belong to class +1, and any data points that fall on or below H 2 belong to class -1, which can be represented as follows: ; class 2 (non-sensitizer) The distance from the hyperplane to any point on H 1 is 1/ w , where w is the Euclidean norm of w. The margin of the separating hyper-plane is calculated as 2/ w . The OSH has the largest margin among separating hyper-planes with the constrained optimization 2 min w w subject to inequalities (5) and (6). After the determination of w and b, the classification can be realized by Equation (7): In most cases, the data are not linearly separable, where no linear OSH exists in the current dimensional space. Therefore, the data are nonlinearly mapped into a high-dimensional feature space where linear separation can be performed. The transform can be done by using a kernel . Gaussian radial basis function, is one of the commonly used kernel functions. Linear support vector machine is then applied to this feature space, and the decision function is given as follows: where the coefficients i α and b are determined by maximizing the following Lagrange expression: where 0 ≥ Two parameters C and σ are very important to the performance of SVM. Parameter C represents the penalty cost, which influences the classification outcome. Parameter σ affects the partitioning outcome in the feature space. Ten-fold cross validation procedure was implemented to obtain the appropriate C and σ.

Implementation
The PSO algorithm and related programs were implemented in the Java programming language, running on the Java (TM) 2 Runtime Environment, Standard Edition (build 1.5.0_02-b09). The Java package of libsvm (version 2.8) [24] used in this work is freely available online.

Assessment of Results
In order to evaluate the prediction performance of SVM models, we define and compute the classification accuracy, sensitivity and specificity by the methods reported in Ref. [25]. The formulations are as follows: In Equations (10)- (12), TP is the number of true positives (sensitizers); FN, the number of false negatives (non-sensitizers); TN, the number of true negatives; and FP, the number of false positives.

LLNA data set
As seen in Table 1, the LLNA data set was randomly divided into a training set with 76 sensitizers and 32 non-sensitizers and a test set with 43 sensitizers and 11 non-sensitizers. Based on the training set, 123 molecular descriptors remained after preprocessing according to Section 2.3. Then, SVM combined with PSO algorithm was implemented. The PSO was set with 30 particles and 100 iterations. In each evaluation, a descriptor got a hit if this descriptor was selected. The descriptor selected more often will get more hits. Ten-fold cross validation was carried out against the training set, and the highest cross validation accuracy was used to determine the most appropriate set of molecular descriptors. In the end, six out of 123 molecular descriptors (listed in Table 2) were selected. The two most important parameters of SVM were also determined, i.e., C=15.81 and σ =14.13. The highest classification accuracy of cross validation against the training set is 83.33%. The classification accuracies, sensitivities and specificities of the training set and test set were all shown in Table 3.
Although skin sensitization is a complex toxicological phenomenon and its biomolecular processes have not been fully understood, previous studies have indicated the ability of active agents to cause immune response is related to skin permeability and the production of immunological conjugates with endogenous macromolecules [5,26]. Chemical reactivity, molecular size and skin permeability are important determinants for skin sensitization [27]. Known from Table 2, nCL represents the number of chlorine atoms in the molecule. The specific atom or group may be related to the combination or reaction of chemicals with skin protein. MAXDN and MATS1e characterize the molecular electronic structure, which may influence the electrostatic interactions between the chemical and protein. The binding of a chemical with skin proteins is considered as the rate-determing step for skin sensitization induction, where the chemical behaves as an electrophile and the protein as a nucleophile [28]. MATS2m and BELm1 are descriptors concerning molecular size, which may influence the skin penetration of compounds. The active agents causing skin sensitization are relatively small molecules. MLOGP, Moriguchi octanol-water partition coefficient, is an indicator of hydrophobic properties, which has been correlated to transport properties of a molecule, long-range ligand-receptor recognition and subsequent binding [26].  Fedorowicz [5] has also investigated the original LLNA data set (including 132 sensitizers and 46 non-sensitizers) with logistic regression and the DEREK expert system. The classification results for the training set with logistic regression and prediction results for the whole data set with DEREK reported in Ref. [5] are shown in Tables 4 and 5, respectively. For rationality, the reported results with logistic regression were compared to the classification results of this paper for the training set, while the reported results with DEREK were compared to the prediction results of this paper for the test set. Seen from Tables 4 and 5, the SVM classifier combined with PSO algorithm in this paper improved the results greatly, especially for the classification specificity. It has been explained in Ref. [5] that the very low specificity was resulted from the substantially unbalanced size of samples, i.e., the ratio of sensitizers largely overwhelming that of the non-sensitizers. However, the specificity in this paper attained 87.50% (50.00% with logistic regression in Ref. [5]) and 72.73% (32.60% with DEREK in Ref. [5]) for the training set and test set, respectively. The experimental and estimated skin sensitivities are listed in Table 6.

GPMT Data Set
For the GPMT data set, the same procedures as for the LLNA data set were carried out. After preprocessing according to Section 2.3, 127 molecular descriptors remained. Then, the SVM algorithm combined with PSO was implemented, and five molecular descriptors (listed in Table 7) were selected from the remaining 127 molecular descriptors. Two SVM parameters, i.e., C=45.63 and σ =1.90 were determined by 10-fold cross validation based on the training set. The total accuracy of the 10-fold cross validation is 90.16%. For the training set, the sensitizers were all classified correctly, and five non-sensitizers were mistaken as sensitizers. According to Equations (10) - (12), the classification accuracy, sensitivity and specificity are 91.80%, 100.00% and 64.29%, respectively. For the test set, only one sensitizer and two non-sensitizers were wrongly assigned. Table 8 lists the statistical parameters.
From Tables 3 and 8, we can see that the wrong classification ratio of non-sensitizers is larger than that of sensitizers for both LLNA and GPMT data sets. The unbalanced ratio (nearly 1:3) of non-sensitizers to sensitizers may be responsible for the worse classification performances for non-sensitizers than those for sensitizers. It is indicated that the non-sensitizers may be prone to be falsely predicted as sensitizers. On the contrary, if the dataset contains many more non-sensitizers than sensitizers, the QSPR model will also give biased classification results. For example, Roberts et al. [28] have validated the TIMES-SS (TImes MEtabolism Simulator) expert system platform used for predicting skin sensitization with 40 chemicals, consisting of 24 non-sensitizers and 16 sensitizers. TIMES-SS was able to predict non-sensitizers reasonably well (the prediction accuracy of 87.5%), while it predicted sensitizers very pooly (the prediction accuracy of 56.0%).  Table 7, the selected five molecular descriptors come from four blocks of descriptors. nDB is a constitutional descriptor indicating the number of double bonds. The π-bond electrons in double bonds are more active than the σ-bond electrons in single bonds, therefore, the molecule with more double bonds will have larger electronic cloud deformability and may be prone to combine with the target. Five kinds of chemical reaction mechanistic domain [29,30] have been proposed, including Michael acceptors, S N 2 electrophiles, S N Ar electrophiles, Schiff base electrophiles and acyl transfer electrophiles. According to Roberts [31], some compounds containing an electron-deficient double bond can be confidently assigned as Michael acceptors or pro-Michael acceptors. EEig07d and EEig14d are both descriptors related to molecular dipole moments, which indicate the molecular polarity and are closely related to the hydrophobic properties and skin permeability of molecules. O-057, the number of phenol/enol/carboxyl OHs, may be responsible for the molecular polarity and hydrogen bond, which has relationship with the combination or reaction of compounds with specific group of the receptor. As described in Ref. [31], some aromatic compounds with two meta hydroxyl groups may follow two possible mechanisms: reaction with molecular oxygen to introduce a hydroxyl group either ortho or para to the original hydroxyl groups, and directly binding to protein via attack of a protein-centered radical. Infective-80 is a descriptor derived from biological experiment, which may reflect directly the biochemical effect of skin sensitization induction.
Fedorowicz [5] has also investigated the original GPMT data set (including 82 sensitizers and 23 non-sensitizers) with logistic regression and two expert systems TOPKAT and DEREK. The reported classification results for the training set with logistic regression in Ref. [5] are listed in Table 9, and the reported prediction results for the whole GPMT data set with TOPKAT and DEREK are shown in Table 10. For comparison, the reported results with logistic regression were compared to the classification results of this paper for the training set, while the reported results with TOPKAT and DEREK were compared to the prediction results of this paper for the test set. As seen from Tables 9 and 10, the prediction performances of the method in this paper are far superior to those in the literature, especially for the specificity. However, expert systems such as DEREK and TIMES-SS have been recently improved by modifing the alerts describing the skin sensitization potential or considering more mechanisitic knowledge and new rules for chemicals [28,32]. Therefore, better results than those of previously reported in the literature may be achieved if the prediction is conducted with the improved expert systems. Golla et al. [27] built neural network models using 25, 25 and 22 molecular descriptors for LLNA, GPMT and BgVV data sets with 358, 307 and 251 compounds, respectively. The classification accuracies for the above mentioned three data sets are 90%, 95% and 90%, respectively. Compared with Ref. [27], this paper obtained the classification accuracy (for the training set) of 95.37% for LLNA data set and 91.80% for GPMT data set using only five or six molecular descriptors. The estimated skin sensitivities were listed in Table 6. Table 9. Comparison the classification performances of this paper with those reported in previous studies on the training set of GPMT data set.

Accuracy
Sensitivity Specificity Seen from Table 6, there are ten inconsistent values in the experimental results of 92 compounds with both LLNA and GPMT data. The accuracy (or concordance) of experimental results obtained from two different test procedures for assessing skin sensitization is 89.13%. In 1999, the Interagency Coordinating Committee on the Validation of Alternative Methods (ICCVAM), with support from the National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods (NICEATM), validated the experimental procedures by comparing LLNA data for 97 chemicals to the available GPMT data, and also obtained an accuracy of 89% [27]. From the above analysis, we may roughly assume that the experimental accuracy of skin sensitization is close to 89%. The prediction accuracies (for the test set) of LLNA and GPMT data sets using PSO optimized SVM in this paper were 88.89% and 90.32% respectively, which are in good agreement with the experimental accuracy.

Conclusions
This paper has investigated the skin sensitization against LLNA and GPMT data sets by particle swarm optimized support vector machine. The classification accuracies, sensitivities and specificities for both data sets were all satisfactory and largely improved compared to those obtained by logistic regression and the expert systems reported in the literature. This study has confirmed that the quantitative structure-activity relationship approach can be a promising complement to animal testing in the area of hazard identification only if a reasonable QSAR model has been constructed. The SVM classifier built in this paper can be used to assess the skin sensitization for environmental chemicals.