Next Article in Journal
Density Functional Study of Structures and Electron Affinities of BrO4F/BrO4F-
Next Article in Special Issue
Prediction of Skin Sensitization with a Particle Swarm Optimized Support Vector Machine
Previous Article in Journal
Bacterial Stressors in Minimally Processed Food
Previous Article in Special Issue
QSPR Studies on Aqueous Solubilities of Drug-Like Compounds
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Additive SMILES-Based Carcinogenicity Models: Probabilistic Principles in the Search for Robust Predictions

by
Andrey A. Toropov
1,2,*,
Alla P. Toropova
1,2 and
Emilio Benfenati
2
1
Institute of Geology and Geophysics, 100041, Khodzhibaev St. 49, Tashkent, Uzbekistan
2
Istituto di Ricerche Farmacologiche Mario Negri, 20156, Via La Masa 19, Milano, Italy
*
Author to whom correspondence should be addressed.
Int. J. Mol. Sci. 2009, 10(7), 3106-3127; https://doi.org/10.3390/ijms10073106
Submission received: 14 May 2009 / Revised: 23 June 2009 / Accepted: 2 July 2009 / Published: 8 July 2009
(This article belongs to the Special Issue Recent Advances in QSAR/QSPR Theory)

Abstract

:
Optimal descriptors calculated with the simplified molecular input line entry system (SMILES) have been utilized in modeling of carcinogenicity as continuous values (logTD50). These descriptors can be calculated using correlation weights of SMILES attributes calculated by the Monte Carlo method. A considerable subset of these attributes includes rare attributes. The use of these rare attributes can lead to overtraining. One can avoid the influence of the rare attributes if their correlation weights are fixed to zero. A function, limS, has been defined to identify rare attributes. The limS defines the minimum number of occurrences in the set of structures of the training (subtraining) set, to accept attributes as usable. If an attribute is present less than limS, it is considered “rare”, and thus not used. Two systems of building up models were examined: 1. classic training-test system; 2. balance of correlations for the subtraining and calibration sets (together, they are the original training set: the function of the calibration set is imitation of a preliminary test set). Three random splits into subtraining, calibration, and test sets were analysed. Comparison of abovementioned systems has shown that balance of correlations gives more robust prediction of the carcinogenicity for all three splits (split 1: rtest2=0.7514, stest=0.684; split 2: rtest2=0.7998, stest=0.600; split 3: rtest2=0.7192, stest=0.728).

Graphical Abstract

1. Introduction

Carcinogenicity is an important endpoint from a toxicological point of view and quantitative structure – activity relationships (QSAR) are a tool for modeling this endpoint [13]. Usually, the QSAR analysis is based on molecular descriptors, calculated from molecular graphs [3,4]. However, the simplified molecular input line entry system (SMILES) [57] has become a prospective alternative to molecular graphs in QSAR analysis [811], owing to an expansion of the databases available via the Internet with molecular structures given in SMILES notation [15,16]. The present study aimed to estimate the ability of the SMILES-based optimal descriptors to be a tool for QSAR analysis of carcinogenicity of non-congeneric chemicals.

2. Materials and Methods

Carcinogenicity data: Experimental values for carcinogenicity were taken from publicly available data sources and further checked for chemical structures [17]. Carcinogenicity is expressed as the potency dose that induces cancer in rats (TD50, in mg/kg body weight). These values have been converted into mmol/kg body weight. The -log(TD50) was examined as endpoint for the modelling. Initially, 401 chemicals have been extracted from [17]. These compounds were selected as substances with numerical data on the carcinogenicity available from [17].
However, this set (401 compounds) contains eight outliers (Table 1): for these compounds the difference between experimental and calculated (by our approach) value of -logTD50 is more than the double the standard error (2s). Probably the high symmetry and the presence of the N-nitroso group can lead to the unusual behaviour of these substances. These compounds were removed. Thus, 393 compounds were examined in this study. SMILES notations which were used in this study have been taken from [18].
We randomly split these 393 chemicals three times into training (n=165), calibration (n=167) and test (n=61) sets. The range of -log(TD50) values for these sets is about from −2 to 5 logarithmic units. Below, these splits are denoted the Split1, Split2, and Split3 (The Supplementary Materials contain lists of these splits).
The modification of the descriptor that was used for modeling bee toxicity [10] is the tool for QSAR analysis of the carcinogenicity. This descriptor is calculated as follows:
DCW ( lim S ) = CW ( dC ) + Σ CW ( 1 SA k ) + Σ CW ( 2 SA k ) + Σ CW ( 3 SA k )
where 1SAk, 2SAk, 3SAk are SMILES attributes. 1SAk, 2SAk, and 3SA contain one, two, and three SMILES elements, respectively. The SMILES element can be one (e.g., ‘C’, ‘c’, ‘N’, ‘S’, etc.), two (e.g., ‘Cl’, ‘Br’, etc.), three (‘C=O’), and four symbols (‘[O−]’). The order of elements in depiction of the 2SAk or 3SAk is defined by the ASCII characters. In other words only one version of AB-sequence or ABC-sequence is possible in the list of the SMILES-attributes (not AB together with BA, or ABC together with CBA).
The dC is the difference of the number of ‘C’ (capital letter) in the given SMILES notation minus the number of ‘c’ (lowercase letter) in the given SMILES notation. For example, this global SMILES attribute is denoted as ‘!001’, if dC=N(‘c’) – N(‘C’)=1, and as ‘!-02’ if the dC =−2. The CW(dC) is the correlation weight of the dC. The symbol “C” (capital letter) is the representation of a carbon atom in the sp3 configuration. The symbol “c” (lowercase letter) is the representation of a carbon atom in sp2 configuration. Thus, the dC is a measure of presence of rigid and flexible fragments in molecular architecture. The examined substances contain chlorine that gives an additional ‘C’. The chlorine is not rigid fragment in molecular system and we have calculated the dC taking into account the ‘C’ from chlorine atoms. Table 2 contains an example of the representation of SMILES by the set of SMILES attributes.
The CW(dC), CW(1SAk), CW(2SAk), and CW(3SAk) are correlation weights of the above SMILES attributes. By means of the Monte Carlo method one can calculate numerical data for these weights which give maximal value of determination coefficient (square of the correlation coefficient, r2) for the training set. However, most probably overtraining will result, i.e., an excellent model on the training set will be accompanied by a poor model for the test set. In order to avoid overtraining one can use the correlation balance [11], i.e., split the available chemicals into three sets: subtraining, calibration, and external test set. This approach gave reasonable result for the case of toxicity of 61 compounds [11], however for carcinogenicity of 393 compounds it is not enough. The use of the correlation balance and blocking of rare SMILES attributes [10] can improve the model. The blocking of rare attributes can be done by the scheme: if the number of SMILES from the training (subtraining) set which contain the SMILES attribute SA* is less than the limS, the correlation weight of the SA* should be fixed equal to zero, CW(SA*)=0.
Without rare attributes the model becomes better for the external test set. However, if limS is too large, the predictive potential of the model decreases, because the low number of active SMILES attribute cannot provide a high quality model. Thus, the central point of the system of modeling is the selection of the most efficient limS. The general scheme of the construction of optimal SMILES-based descriptors by the correlation balance method is represented in Figure 1.
This system can be denoted as a [Subtraining-Calibration-Test] system. The model can be satisfactory if the N111, i.e., the number of active (not blocked) attributes which are present in subtraining, calibration, and test sets, is as large as possible. The more traditional, “classic” approach is the construction of the model using united training set to predict the endpoint for an external test set. This system can be denoted as [Training-Test] system. This model can be satisfactory if the N101, i.e., the number of active attributes which are present in both the training and test set is as large as possible.
The correlation weights were calculated by the Monte Carlo method Optimization. The [Training-Test] system is based on correlation weights which provide maximum of the correlation coefficient between DCW(limS) and log(TD50) for the training set. The [Subtraining-Calibration-Test] system is based on correlation weights which provide the maximum of a target function (TF) calculated as follows [1114]:
TF = D ( subtraining ) + D ( calibration ) ABS [ D ( subtraining ) + D ( calibration ) ]   0 . 1
where D(subtraining) and D(calibration) are determination coefficients between DCW(limS) and log(TD50) for subtraining and calibration sets, respectively. Thus the optimization for the above system has been carried out by the same algorithm [11], but with different target functions.
For each attribute SA, CW(SA) is determined initially by setting the start values of all CWs to 1 ± 0.01*random. The random is the generator of random value of range (0, 1). The regular order of number of attributes (i.e., 1, 2, 3, 4, 5,…) is replaced by a random sequence (e.g., 3, 1, 5, 2, 4,...). A starting value of target function (TF1) is calculated. In a generated random sequence, each attribute correlation weight CWi was modified with the algorithm:
  • DCWi:=0.5*CWi; Eps:=0.1*DCWi;
  • Calculation of TF1; CWi:=CWi + DCWi;
  • Calculation of TF2, after modify CWi;
  • If TF2 > TF1 then TF1:=TF2; go to 2
  • CWi:=CWi - DCWi;
  • DCWi:= −0.5*DCWi;
  • If absolute value (DCWi) >Eps then go to 2.
Then, steps of 1–7 are carried out for all CWs (the epoch of the optimization). By computational experiment the optimal number of the epochs has been established (Table 3). This number is 10 (Figure 2).

3. Results

Computational experiments (Figure 3, Table 4) have shown that [Subtraining-Calibration-Test] system gives preferable results in comparison with the [Training-Test] system for all three splits. Thus the correlation balance (i.e., [Subtraining-Calibration-Test] system) improves QSAR model of log(TD50). It is the second successful experiment using the correlation balance for the QSAR analyses [11].
A useful characteristic of these models is W%=N111/Nact, where N111 is the number of non blocked attributes which take place in subtraining, calibration, and test set; Nact is the total number of attributes which are not blocked for a given limS. There is a correlation between W% and the determination coefficient for the test set (Figure 4, Table 4). One can see from the results that good prediction ocurrs if the W% is higher than 80 (excepting [Subtraining-Calibration-Test ] for the Split3: in this case W%=78).
The model obtained in the first probe of the Monte Carlo optimization for the split1 with limS=4 is calculated as follows:
log ( TD 50 ) = 0.5981   ( 0.0074 ) + 0.1118   ( 0.0004 ) * DCW ( 4 )
  • n=165, r2=0.7622, s=0.685, F=522 (subtraining set)
  • n=167, r2=0.7620, s=0.734, F=528 (calibration set)
  • n=61, r2=0.7541, s=0.682, F=181 (test set)
  • Y-scrambling[19,20] for the test set (Nshifting =300[20]) gave r2scrambling =0.0996
Figure 5 shows the model calculated with Equation 3, graphically. The Supplementary Materials contains numerical data on the experimental and calculated values with Equation 3 (split1 with limS=4). Table 5 contains numerical data on the correlation weights of SMILES attributes obtained in three probes of the Monte Carlo optimization.

4. Discussion

One can see that the statistical characteristics of this model are reasonably good. As additional validation we have calculated Y-scrambling criterion, randomly shifting the carcinogenicity values [16,17]. If after the shifting (300 exchanges recommended in Ref.[17]) the correlation coefficient is less than 0.2, the correlation of our model can be classified as not chance correlation. Thus, the Y-scrambling has shown that the Equation 3 gives robust prediction (not chance correlation) for the test set.
In our previous study we examined different equations for the carcinogenicity model, and only one split into the subtraining, calibration and test set [15]. Examination of three splits indicates that good results occur for all three splits (Table 4). Thus, we expect that the present model is more robust, also considering the Y-scrambling test.
One can see from Table 5 that there are three categories of SMILES attributes: category 1 is the set of SMILES attributes with the correlation weight more than zero in all three probes of the Monte Carlo optimization; category 2 is the set of SMILES attributes with the correlation weight less than zero in all three probes; category 3 is the set of SMILES attributes with non consistent values, which have both correlation weights more than zero and correlation weights less zero in the three probes of the optimization. We can say that the category 1 contains promoters of logTD50 increase; category 2 contains promoters of logTD50 decrease; category 3 contains attributes with unclear influence on logTD50.
The !-02, #, Cl, S, [N+], and [O−] SMILES elements are promoters of logTD50 increase, thus of carcinogenicity. However it is necessary to take into account the value of correlation weight as well as the number of the given attribute in the subtraining set. Taking this into account, one can detect that the strongest promoters of the logTD50 increase are Cl (number Cl in the subtraining set is 61, the range of correlation weights of the Cl in three probes is 2.19 – 3.19) and [O−] (the number of [O−] in the subtraining set is 26, the range of correlation weights in three probes is 5.92 – 6.96).
A similar analysis can be done for the promoters of logTD50 decrease. For instance, the number of bracket s‘(‘ in the subtraining set is 708 and the range of correlation weights of bracket is from −1.366 till −1.686; the number of ‘=’ in the subtraining set is 77 and the range of correlation weight is from −1.866 till 2.144. Table 6 contains examples of compounds, which contain the mentioned SMILES attributes. Thus, the analysis of the correlation weights of SMILES attributes can help in searching for agents of the carcinogenicity phenomenon.
An important feature of our model is that SMILES attributes are used for the QSAR predicted values and not only as tool for a binary classification (carcinogenic or not). Our model, which provides continuous values, can be used for risk assessment calculations, where a dose is necessary.
The applicability domain for these models can be defined from a probabilistic point of view: one can estimate the carcinogenic potential of compound if the SMILES of this compound does not contain rare SMILES attributes. A stronger definition of the applicability domain can be formulated taking into account the roles of the attributes (as promoters of logTD50 increase/decrease): thus, one can estimate the carcinogenic potential of a compound if the SMILES of the compound contains solely apparent promoters of logTD50 increase and/or decrease (without of SMILES attributes with unclear role).

5. Conclusions

  • - Optimal descriptors calculated by the Monte Carlo method can provide reasonable prediction for the carcinogenicity log(TD50).
  • - Blocking of rare SMILES attributes can improve statistical quality of the predicting. Splits into subtraining, calibration and test sets, as well splits into the training and test sets have influence to statistical characteristics of the models. In our case, in three splits examined in this study these characteristics are similar.
  • - The correlation balance, i.e., the [Subtraining-Calibration-Test] system gave models which are better in comparison with models obtained with the more traditional [Training-Test] system.

Supplementary Materials

Table 1. Three splits into subtraining, calibration, and test sets, which were studied.
Table 1. Three splits into subtraining, calibration, and test sets, which were studied.
CAS No Split1CAS No Split2CAS No Split3
Subtraining set
1.75-07-075-07-075-07-0
2.60-35-560-35-560-35-5
3.34627-78-653-96-353-96-3
4.4075-79-07008-42-67008-42-6
5.53-96-379-06-179-06-1
6.79-06-13688-53-7107-13-1
7.107-13-181-49-23688-53-7
8.3688-53-73775-55-181-49-2
9.81-49-299-57-03775-55-1
10.3775-55-1117-79-399-57-0
11.712-68-597-56-3121-88-0
12.99-57-010589-74-9117-79-3
13.121-88-0140-57-82432-99-7
14.117-79-31912-24-910589-74-9
15.60142-96-3115-02-6115-02-6
16.2432-99-717967-53-917967-53-9
17.10589-74-950-32-871-43-2
18.17967-53-93296-90-092-87-5
19.30516-87-1542-88-150-32-8
20.71-43-22475-45-814504-15-5
21.92-87-575-27-42475-45-8
22.50-32-851333-22-374-96-4
23.14504-15-53068-88-03068-88-0
24.3296-90-063-25-263-25-2
25.85-68-756-23-556-23-5
26.3068-88-0120-80-960391-92-6
27.331-39-5305-03-3305-03-3
28.63-25-277439-76-037087-94-8
29.56-23-537087-94-85131-60-2
30.305-03-395-83-075-88-7
31.37087-94-8150-68-550892-23-4
32.75-88-710473-70-8108-90-7
33.50892-23-41897-45-6107-30-2
34.65089-17-0102-50-1150-68-5
35.108-90-780-08-0126-99-8
36.107-30-250-29-31897-45-6
37.150-68-553-43-0102-50-1
38.126-99-8853-23-6120-71-8
39.1897-45-663019-65-880-08-0
40.102-50-116338-97-9853-23-6
41.120-71-8720-69-416338-97-9
42.1163-19-595-80-7720-69-4
43.853-23-696-12-896-12-8
44.16338-97-910318-26-010318-26-0
45.720-69-4106-93-4106-93-4
46.4106-66-51717-00-6106-46-7
47.96-12-8107-06-2107-06-2
48.10318-26-062-73-7101-90-6
49.106-93-456-53-13276-41-3
50.7572-29-4101-90-6119-84-6
51.106-46-75803-51-05803-51-0
52.105-55-559-35-891-93-0
53.3276-41-355738-54-060-11-7
54.91-93-0121-69-759-35-8
55.4164-28-726049-69-4513-37-1
56.513-37-1513-37-1106-89-8
57.106-89-8106-89-8150-69-6
58.150-69-6140-88-516301-26-1
59.16301-26-164-17-557497-29-7
60.75-21-816301-26-175-21-8
61.117-81-757497-29-786386-73-4
62.110559-84-775-21-869112-98-7
63.86386-73-496724-44-6110-00-9
64.69112-98-786386-73-467730-11-4
65.93957-54-1363-17-756-40-6
66.98-01-13570-75-087-68-3
67.56-40-6110-00-9319-84-6
68.319-84-698-01-167-72-1
69.67-72-167730-11-426049-70-7
70.18774-85-156-40-6122-66-7
71.26049-70-787-68-353-95-2
72.122-66-767-72-1129-43-1
73.53-95-2680-31-996724-45-7
74.129-43-126049-70-713743-07-2
75.96724-45-753-95-271752-70-0
76.71752-70-084545-30-2100643-96-7
77.100643-96-7100643-96-776180-96-6
78.76180-96-676180-96-6115-11-7
79.115-11-715503-86-3542-56-3
80.542-56-3115-11-754-85-3
81.303-34-4542-56-3303-34-4
82.76956-02-054-85-3108-78-1
83.148-82-3303-34-4148-82-3
84.149-30-476956-02-0149-30-4
85.5834-17-3108-78-1934-00-9
86.934-00-9148-82-3298-81-7
87.298-81-760-56-0598-55-0
88.598-55-05834-17-355-80-1
89.21638-36-8298-81-721638-36-8
90.63412-06-61634-04-463412-06-6
91.598-57-221340-68-114026-03-0
92.33868-17-621638-36-8598-57-2
93.443-48-163412-06-676014-81-8
94.39801-14-414026-03-064091-91-4
95.50-07-776014-81-890-94-8
96.3771-19-564091-91-42385-85-5
97.2243-62-190-94-839801-14-4
98.139-94-639801-14-450-07-7
99.99-59-250-07-758139-48-3
100.2122-86-358139-48-32243-62-1
101.2578-75-8389-08-2139-94-6
102.53757-28-12243-62-199-59-2
103.24554-26-591-59-891-23-6
104.600-24-8139-94-6600-24-8
105.1836-75-599-59-21836-75-5
106.607-57-859-87-0607-57-8
107.75-52-575198-31-1555-84-0
108.38777-13-836133-88-738777-13-8
109.83335-32-44812-22-083335-32-4
110.89911-78-4555-84-089911-79-5
111.96806-35-851-75-289911-78-4
112.56222-35-638777-13-896806-35-8
113.760-60-183335-32-4760-60-1
114.937-25-789911-78-4937-25-7
115.75881-22-096806-35-813256-11-6
116.38347-74-9760-60-175881-22-0
117.64005-62-5937-25-738347-74-9
118.1133-64-813256-11-691308-70-2
119.51542-33-738347-74-91133-64-8
120.60599-38-41133-64-860599-38-4
121.62-75-955-18-562-75-9
122.156-10-562-75-9156-10-5
123.10595-95-6156-10-520917-49-1
124.20917-49-142579-28-242579-28-2
125.42579-28-286451-37-886451-37-8
126.86451-37-870415-59-770415-59-7
127.26921-68-616219-98-055984-51-5
128.70415-59-759-89-216219-98-0
129.16219-98-05632-47-3614-00-6
130.614-00-6930-55-259-89-2
131.59-89-281795-07-55632-47-3
132.26541-51-53096-50-2100-75-4
133.611-23-4101-80-4930-55-2
134.303-47-960102-37-626541-51-5
135.3096-50-262-44-2611-23-4
136.60102-37-660-80-0303-47-9
137.62-44-277-09-83096-50-2
138.77-09-87227-91-077-09-8
139.7227-91-0842-07-97227-91-0
140.90-43-750-33-950-33-9
141.51-03-6122-60-190-43-7
142.29069-24-751-03-651-03-6
143.50-24-81955-45-91955-45-9
144.671-16-929069-24-729069-24-7
145.1120-71-4816-57-957-57-8
146.57-57-875-56-913010-07-6
147.13010-07-6599-79-181-54-9
148.51-52-52318-18-52425-85-6
149.2425-85-610048-13-2480-54-6
150.480-54-618883-66-42318-18-5
151.94-59-796-09-310048-13-2
152.2318-18-595-06-718883-66-4
153.10048-13-223031-25-695-06-7
154.18883-66-4127-18-4116-14-3
155.96-09-3116-14-3109-99-9
156.95-06-7509-14-8509-14-8
157.127-18-4139-65-152-24-4
158.109-99-962-56-6139-65-1
159.62-56-668-76-888-19-7
160.88-19-7538-23-868-76-8
161.68-76-888-06-276-25-5
162.76-25-596-18-475-25-2
163.75-25-22489-77-2137-17-7
164.51-79-651-79-651-79-6
165.88-12-0593-60-288-12-0
Calibration set
1.18523-69-818523-69-818523-69-8
2.7008-42-634627-78-634627-78-6
3.2835-39-44075-79-04075-79-0
4.760-56-5107-13-1760-56-5
5.82-28-01162-65-882-28-0
6.119-34-6760-56-5712-68-5
7.121-66-482-28-0119-34-6
8.97-56-3712-68-5121-66-4
9.61-82-5119-34-697-56-3
10.115-02-660142-96-360142-96-3
11.103-33-361-82-561-82-5
12.88133-11-325843-45-21912-24-9
13.271-89-630516-87-1103-33-3
14.542-88-188133-11-325843-45-2
15.2475-45-871-43-230516-87-1
16.75-27-492-87-588133-11-3
17.74-96-4271-89-6271-89-6
18.51333-22-314504-15-53296-90-0
19.106-99-02784-94-3542-88-1
20.75-65-0106-99-02784-94-3
21.60391-92-675-65-051333-22-3
22.115-28-6115-28-6106-99-0
23.101-79-1101-79-175-65-0
24.77439-76-05131-60-285-68-7
25.5131-60-275-88-7115-28-6
26.593-70-465089-17-0101-79-1
27.54749-90-5107-30-277439-76-0
28.52214-84-3126-99-865089-17-0
29.637-07-052214-84-3593-70-4
30.123-73-9637-07-010473-70-8
31.50-18-0120-71-852214-84-3
32.80-08-0123-73-9637-07-0
33.50-29-350-18-0123-73-9
34.63019-65-81163-19-550-18-0
35.95-80-74106-66-550-29-3
36.56654-52-556654-52-51163-19-5
37.1717-00-67572-29-463019-65-8
38.91-94-1106-46-795-80-7
39.107-06-291-94-156654-52-5
40.62-73-7111-46-61717-00-6
41.685-91-63276-41-37572-29-4
42.111-46-6119-84-691-94-1
43.56-53-194-58-662-73-7
44.119-84-691-93-0685-91-6
45.94-58-665176-75-2111-46-6
46.5803-51-060-11-756-53-1
47.65176-75-2551-92-894-58-6
48.60-11-7123-91-165176-75-2
49.59-35-857-63-6551-92-8
50.551-92-8150-69-626049-69-4
51.26049-69-4100-41-4123-91-1
52.123-91-196-45-713256-06-9
53.13256-06-9117-81-757-63-6
54.57-63-6110559-84-7140-88-5
55.140-88-538434-77-464-17-5
56.64-17-569112-98-7100-41-4
57.57497-29-793957-54-196-45-7
58.100-41-4556-52-5117-81-7
59.96-45-7517-28-296724-44-6
60.96724-44-6118-74-1110559-84-7
61.38434-77-4319-84-638434-77-4
62.363-17-7122-66-7363-17-7
63.110-00-9306-83-293957-54-1
64.67730-11-4129-43-13570-75-0
65.556-52-533389-36-5556-52-5
66.517-28-271752-70-0517-28-2
67.118-74-15208-87-7118-74-1
68.87-68-321416-87-5680-31-9
69.680-31-953-86-126049-68-3
70.26049-68-386315-52-8306-83-2
71.306-83-278-59-133389-36-5
72.13743-07-23778-73-25208-87-7
73.33389-36-5143-50-021416-87-5
74.5208-87-75989-27-584545-30-2
75.84545-30-277500-04-053-86-1
76.53-86-1149-30-415503-86-3
77.15503-86-357-39-686315-52-8
78.86315-52-8934-00-978-59-1
79.54-85-3150-76-53778-73-2
80.78-59-1598-55-0143-50-0
81.3778-73-255-80-15989-27-5
82.143-50-070-25-776956-02-0
83.5989-27-5129-15-757-39-6
84.108-78-163642-17-160-56-0
85.57-39-6452-86-8150-76-5
86.60-56-056-49-51634-04-4
87.150-76-5101-14-470-25-7
88.1634-04-4838-88-0129-15-7
89.21340-68-1598-57-263642-17-1
90.70-25-733868-17-698-85-1
91.63642-17-1443-48-1452-86-8
92.98-85-13771-19-556-49-5
93.452-86-8139-13-9101-14-4
94.56-49-52578-75-8838-88-0
95.101-14-4531-82-833868-17-6
96.838-88-024554-26-5443-48-1
97.101-61-191-23-6315-22-0
98.76014-81-898-95-33771-19-5
99.64091-91-4600-24-8389-08-2
100.2385-85-51836-75-559-87-0
101.315-22-0607-57-875198-31-1
102.58139-48-367-20-92122-86-3
103.389-08-275-52-536133-88-7
104.91-59-8551-88-22578-75-8
105.139-13-95522-43-024554-26-5
106.59-87-0607-35-24812-22-0
107.75198-31-116813-36-8602-87-9
108.36133-88-789911-79-598-95-3
109.4812-22-092177-50-967-20-9
110.602-87-956222-35-651-75-2
111.91-23-655090-44-375-52-5
112.98-95-375881-20-8551-88-2
113.67-20-975881-22-05522-43-0
114.555-84-0684-93-5607-35-2
115.51-75-255556-92-816813-36-8
116.551-88-282018-90-492177-50-9
117.607-35-275881-18-475896-33-2
118.16813-36-891308-70-256222-35-6
119.89911-79-591308-69-955090-44-3
120.92177-50-951542-33-775881-20-8
121.96806-34-760599-38-4684-93-5
122.55090-44-3924-16-355556-92-8
123.13256-11-61116-54-782018-90-4
124.684-93-5621-64-775881-18-4
125.92177-49-610595-95-691308-69-9
126.55556-92-8614-95-964005-62-5
127.82018-90-420917-49-151542-33-7
128.75881-18-426921-68-61116-54-7
129.91308-70-255984-51-555-18-5
130.91308-69-9614-00-6621-64-7
131.1116-54-768107-26-610595-95-6
132.55-18-578246-24-926921-68-6
133.621-64-7303-47-978246-24-9
134.55984-51-514698-29-414698-29-4
135.68107-26-613752-51-7101-80-4
136.78246-24-91825-21-413752-51-7
137.5632-47-350-24-860102-37-6
138.14698-29-4671-16-962-44-2
139.101-80-41120-71-4842-07-9
140.13752-51-757-57-8122-60-1
141.1825-21-413010-07-650-24-8
142.842-07-951-52-5671-16-9
143.50-33-981-54-91120-71-4
144.122-60-12425-85-6816-57-9
145.1955-45-9127-47-951-52-5
146.816-57-9480-54-6127-47-9
147.81-54-918559-94-918559-94-9
148.127-47-9533-31-3533-31-3
149.18559-94-977-46-396-09-3
150.599-79-1811-97-277-46-3
151.533-31-340548-68-3127-18-4
152.77-46-3109-99-9811-97-2
153.23031-25-652-24-440548-68-3
154.116-14-362-55-562-55-5
155.40548-68-3789-61-7789-61-7
156.509-14-8141-90-2141-90-2
157.52-24-488-19-762-56-6
158.62-55-576-25-588-06-2
159.789-61-775-25-242011-48-3
160.141-90-2137-17-795-63-6
161.137-17-795-63-62489-77-2
162.95-63-655-63-055-63-0
163.55-63-0126-72-7126-72-7
164.126-72-766-22-866-22-8
165.108-05-4108-05-4108-05-4
166.75-02-575-02-575-02-5
167.2832-40-82832-40-82832-40-8
Test set
1.29611-03-829611-03-829611-03-8
2.1162-65-857-06-71162-65-8
3.57-06-72835-39-457-06-7
4.38514-71-538514-71-52835-39-4
5.140-57-8121-88-038514-71-5
6.1912-24-9121-66-4140-57-8
7.25843-45-22432-99-733372-39-3
8.33372-39-3103-33-375-27-4
9.2784-94-333372-39-3869-01-2
10.869-01-274-96-4331-39-5
11.120-80-985-68-7120-80-9
12.95-83-0869-01-295-83-0
13.10473-70-8331-39-554749-90-5
14.117-10-260391-92-6117-10-2
15.1192-28-550892-23-41192-28-5
16.53-43-0108-90-753-43-0
17.79-43-6593-70-44106-66-5
18.101-90-654749-90-579-43-6
19.55738-54-0117-10-2105-55-5
20.121-69-71192-28-555738-54-0
21.106-88-779-43-6121-69-7
22.13073-35-3685-91-64164-28-7
23.398-32-3105-55-5106-88-7
24.32852-21-44164-28-713073-35-3
25.3570-75-013256-06-9398-32-3
26.67730-10-3106-88-732852-21-4
27.26049-71-813073-35-398-01-1
28.21416-87-5398-32-367730-10-3
29.77500-04-032852-21-418774-85-1
30.55-80-167730-10-326049-71-8
31.129-15-718774-85-177500-04-0
32.14026-03-026049-71-85834-17-3
33.90-94-826049-68-321340-68-1
34.531-82-896724-45-7101-61-1
35.51325-35-013743-07-291-59-8
36.62-23-798-85-1139-13-9
37.5522-43-0101-61-153757-28-1
38.75896-33-22385-85-5531-82-8
39.75881-20-8315-22-051325-35-0
40.88208-16-62122-86-362-23-7
41.91308-71-353757-28-196806-34-7
42.53609-64-651325-35-092177-49-6
43.924-16-3602-87-988208-16-6
44.40580-89-062-23-791308-71-3
45.614-95-996806-34-753609-64-6
46.100-75-475896-33-2924-16-3
47.930-55-292177-49-640580-89-0
48.81795-07-588208-16-6614-95-9
49.60-80-091308-71-368107-26-6
50.75-56-964005-62-581795-07-5
51.22571-95-553609-64-61825-21-4
52.811-97-240580-89-060-80-0
53.139-65-1100-75-475-56-9
54.538-23-826541-51-594-59-7
55.88-06-2611-23-4599-79-1
56.96-18-490-43-722571-95-5
57.42011-48-394-59-723031-25-6
58.2489-77-222571-95-5538-23-8
59.66-22-842011-48-396-18-4
60.593-60-275-01-4593-60-2
61.75-01-488-12-075-01-4
Table 2. Experimental and calculated with Eq. 3 log(TD50): split1, limS=4, first probe of the Monte Carlo method optimization.
Table 2. Experimental and calculated with Eq. 3 log(TD50): split1, limS=4, first probe of the Monte Carlo method optimization.
CAS NoSMILESDCW(4)ExprCalc
Subtraining set
75-07-0CC=O−1.6442255−0.541−0.782
60-35-5CC(N)=O2.4339941−0.484−0.326
34627-78-6CC(=O)OC(C=C)c1ccc2OCOc2c18.97234290.9450.405
4075-79-0O=C(C)Nc1ccc(cc1)c2ccccc216.82548902.2531.283
53-96-3CC(=O)NC1C=CC2=C3C=CC=CC3=CC2=C123.70419672.2632.052
79-06-1C=CC(N)=O6.15533071.2780.090
107-13-1C=CC#N0.43636470.497−0.549
3688-53-7O=[N+]([O−])c2ccc(/C=C(\c1ccco1)C(N)=O)o219.72191160.9261.607
81-49-2O=C2c1ccccc1C(=O)c3c2c(N)c(Br)cc3Br10.60827850.9180.588
3775-55-1Nc1nnc(o1)c2oc(cc2)[N+]([O−])=O19.52860091.7281.585
712-68-5Nc1nnc(s1)c2oc(cc2)[N+]([O−])=O21.47650442.5061.803
99-57-0Nc1cc(ccc1O)[N+]([O−])=O8.7582849−0.7360.381
121-88-0Nc1ccc(cc1O)[N+]([O−])=O8.75828490.1430.381
117-79-3Nc2ccc3C(=O)c1ccccc1C(=O)c3c210.42678800.3440.568
60142-96-3NCC1(CC(=O)O)CCCCC1−3.0738151−1.533−0.942
2432-99-7O=C(O)CCCCCCCCCCN−2.3822437−0.737−0.864
10589-74-9CCCCCN(N=O)C(N)=O28.99998252.4622.644
17967-53-9CC(C)[N+](\[O−])=N/C(C)C41.16178964.6864.004
30516-87-1CC1=CN(C(=O)NC1=O)C2CC(/N=[N+]=[N−])C(CO)O2−2.1058355−1.637−0.834
71-43-2c1ccccc13.2902364−0.335−0.230
92-87-5Nc1ccc(cc1)c2ccc(N)cc215.61179932.0271.147
50-32-8c1cc2c3ccc4cccc5ccc(cc2cc1)c3c4528.99216742.4212.643
14504-15-5NC(=O)Cc2c([O−])on[n+]2Cc1ccccc14.7178230−0.260−0.071
3296-90-0OCC(CBr)(CBr)CO10.15829690.3730.538
85-68-7O=C(OCc1ccccc1)c2ccccc2C(=O)OCCCC9.6647886−0.5220.482
3068-88-0O=C1CC(C)O17.02290650.7950.187
331-39-5Oc1ccc(/C=C/C(=O)O)cc1O2.7387186−0.217−0.292
63-25-2CNC(=O)Oc2cccc1ccccc1212.84979711.1540.839
56-23-5ClC(Cl)(Cl)Cl12.28695931.8270.776
305-03-3O=C(O)CCCc1ccc(cc1)N(CCCl)CCCl26.85648222.5312.404
37087-94-8CC1CC(C)CN(C1)S(=O)(=O)c2cc(C(=O)O)c(Cl)cc222.76151291.8351.947
75-88-7ClCC(F)(F)F11.17558310.1330.651
50892-23-4Cc2cccc(Nc1cc(Cl)nc(SCC(=O)O)n1)c2C19.08949051.8711.536
65089-17-0Cc2cccc(Nc1cc(Cl)nc(SCC(=O)NCCO)n1)c2C17.88673081.7521.402
108-90-7Clc1ccccc10.8393418−0.341−0.504
107-30-2COCCl21.70588321.1661.829
150-68-5Clc1ccc(NC(=O)N(C)C)cc17.66681780.1810.259
126-99-8C=C(Cl)C=C0.3865032−0.150−0.555
1897-45-6Clc1c(C#N)c(Cl)c(C#N)c(Cl)c1Cl−1.4730550−0.931−0.763
102-50-1Nc1ccc(OC)cc1C3.4796799−0.535−0.209
120-71-8Nc1cc(C)ccc1OC6.77958180.1460.160
1163-19-5Brc2c(Oc1c(Br)c(Br)c(Br)c(Br)c1Br)c(Br)c(Br)c(Br)c2 Br0.8059877−0.542−0.508
853-23-6CC(=O)OC2CCC3(C)C4CCC1(C)C(CCC1=O)C4CC=C 3C223.15435171.0221.991
16338-97-9C=CCN(CC=C)N=O26.57879090.5712.373
720-69-4O=[N+]([O−])c1ccc(o1)c2nc(N)nc(N)n215.51227432.1141.136
4106-66-5Nc1ccc2c3ccccc3oc2c118.80062391.8691.504
96-12-8BrC(CBr)CCl24.26647402.9602.115
10318-26-0OC(C(O)CBr)C(O)C(O)CBr21.07381021.5661.758
106-93-4BrCCBr13.45114852.0920.906
7572-29-4ClC#CCl24.12607541.4232.099
106-46-7Clc1ccc(Cl)cc14.3705653−0.642−0.109
105-55-5CCNC(=S)NCC12.95208770.7410.850
3276-41-3O=NN1CC=CCO117.20550320.1001.325
91-93-0COc1cc(ccc1/N=C=O)c2ccc(\N=C=O)c(OC)c21.4504491−0.740−0.436
4164-28-7CN(C)[N+]([O−])=O20.13484992.2171.653
513-37-1C/C(C)=C\Cl19.46805080.4551.578
106-89-8ClCC1CO115.69273491.4951.156
150-69-6CCOc1ccc(cc1)NC(N)=O4.9437759−0.474−0.045
16301-26-1[O−]\[N+](CC)=N\CC29.85735803.6672.740
75-21-8C1CO14.16779640.316−0.132
117-81-7CCC(CCCC)COC(=O)c1ccccc1C(=O)OCC(CC)CCCC−2.5987356−0.263−0.889
110559-84-7O=C(NCC(C)=O)N(CC)N=O25.18843512.9812.218
86386-73-4OC(Cn1cncn1)(Cn2cncn2)c3ccc(F)cc3F6.03219420.5790.076
69112-98-7NC(=O)N(CCF)N=O24.30333263.0342.119
93957-54-1O=C(O)CC(O)CC(O)/C=C/c2c(c1ccccc1n2C(C)C)c3ccc (F)cc314.69926450.5171.045
98-01-1O=Cc1ccco16.3091859−0.8520.107
56-40-6NCC(=O)O−1.3873663−2.534−0.753
319-84-6ClC1C(Cl)C(Cl)C(Cl)C(Cl)C1Cl18.22696491.4141.440
67-72-1ClC(Cl)(Cl)C(Cl)(Cl)Cl14.72436880.6311.048
18774-85-1CCCCCCN(N=O)C(N)=O28.17221942.5292.552
26049-70-7NNc1nc(cs1)c2ccc(cc2)[N+]([O−])=O20.65207591.8671.711
122-66-7N(Nc1ccccc1)c2ccccc218.12094961.5181.428
53-95-2CC(=O)N(O)C1C=CC2=C3C=CC=CC3=CC2=C123.98968502.3842.084
129-43-1O=C3c1ccccc1C(=O)c2c3cccc2O2.87571630.380−0.277
96724-45-7O=C(NCC)N(N=O)CCO21.76304902.4581.835
71752-70-0O=C(N)N(N=O)CCCO17.26244782.1771.332
100643-96-7O=C2Nc1ccc(cc1C2(C)C)C=3CCC(=O)NN=322.03124692.1071.865
76180-96-6Nc3nc2c(ccc1ncccc12)n3C21.34681332.3881.788
115-11-7C=C(C)C0.3184457−1.801−0.562
542-56-3CC(C)CON=O8.31625830.2800.332
303-34-4CC(C)(O)C(O)(C(C)OC)C(=O)OCC1=CCN2CCC(OC(=O)C(\C)=C\C)C1231.49702063.0242.923
76956-02-0OCc3nc(NCCCOc2cc(CN1CCCCC1)ccc2)n(C)n312.5193481−0.1250.802
148-82-3O=C(O)C(N)Cc1ccc(cc1)N(CCCl)CCCl41.98956543.5124.096
149-30-4S=C1Nc2ccccc2S14.9343382−0.313−0.046
5834-17-3COc1cc2c3ccccc3oc2cc1N15.34246540.8661.117
934-00-9COc1cccc(O)c1O−1.64263530.459−0.782
298-81-7COc1c3occc3cc2C=CC(=O)Oc1212.90077180.8240.844
598-55-0NC(=O)OC2.74759760.123−0.291
21638-36-8O=[N+]([O−])c2ccc(/C=N/N1CC(C)NC1=O)o215.36575071.6491.120
63412-06-6O=C(N(C)N=O)c1ccccc125.36116741.7062.237
598-57-2[O−][N+](=O)CN9.49663390.6410.464
33868-17-6N#CN(C)N=O24.87213772.2492.183
443-48-1Cc1ncc([N+]([O−])=O)n1CCO2.0019404−0.501−0.374
39801-14-4ClC13C5(Cl)C2(Cl)C4C(Cl)(C(Cl)(Cl)C12Cl)C3(Cl)C4 (Cl)C5(Cl)Cl23.64052902.5442.045
50-07-7NC(=O)OCC3C=1C(=O)C(N)=C(C)C(=O)C=1N4CC2 NC2C34OC46.67932455.5094.621
3771-19-5O=C(O)C(C)(C)Oc1ccc(cc1)C3CCCc2ccccc2315.04737111.4511.084
2243-62-1Nc2cccc1c2cccc1N6.31069250.3570.107
139-94-6O=C(Nc1ncc(s1)[N+]([O−])=O)NCC6.79425960.2180.161
99-59-2Nc1cc(ccc1OC)[N+]([O−])=O15.71943160.4941.159
2122-86-3O=C1NN=C(O1)c2oc(cc2)[N+]([O−])=O17.16165111.3601.321
2578-75-8O=C(C)Nc1nnc(s1)c2ccc(o2)[N+]([O−])=O19.60667021.4591.594
53757-28-1[O−][N+](=O)c1ccc(o1)c2cscn217.51094451.4071.360
24554-26-5O=CNc1nc(cs1)c2ccc(o2)[N+]([O−])=O18.11821611.7501.428
600-24-8CC(CC)[N+]([O−])=O10.7916363−0.4430.608
1836-75-5Clc2cc(Cl)ccc2Oc1ccc(cc1)[N+]([O−])=O5.0065304−0.170−0.038
607-57-8[O−][N+](=O)C1C=CC2=C3C=CC=CC3=CC2=C133.11415902.8703.104
75-52-5[O−][N+](C)=O19.95521210.1791.633
38777-13-8CC(C)Oc1ccccc1OC(=O)N(C)N=O30.11632212.8162.769
83335-32-4FC(F)(F)CCCN(CCCC(F)(F)F)N=O20.60032202.5511.705
89911-78-4O=NN(CCO)CC(O)CO23.82147391.4392.065
96806-35-8O=C(NCCCl)N(N=O)CC(C)O30.01470372.3802.758
56222-35-6CC(O)CN(CCO)N=O22.42271381.1811.909
760-60-1CC(C)CN(N=O)C(=O)N24.42059691.4872.132
937-25-7O=NN(C)c1ccc(F)cc131.53888192.7812.928
75881-22-0CN(CCCCCCCCCC)N=O21.57022252.2011.813
38347-74-9O=C1OCCN1N=O16.75061502.4791.275
64005-62-5O=NN(CCCCC)C(=O)OCC29.81142612.2702.735
1133-64-8O=NN2CCCCC2c1cccnc125.28794391.2062.229
51542-33-7CN(N=O)C(=O)Nc1nc2ccccc2s128.63238032.3202.603
60599-38-4O=C(C)CN(CC(=O)C)N=O28.72338862.5082.613
62-75-9CN(C)N=O28.93494312.8882.637
156-10-5O=Nc2ccc(Nc1ccccc1)cc218.9993753−0.0061.526
10595-95-6CCN(C)N=O32.71683923.2443.060
20917-49-1O=NN1CCCCCCC123.67973663.5752.049
42579-28-2O=C1NC(=O)CN1N=O19.29615240.4691.559
86451-37-8CN(N=O)CC(O)CO21.13886722.3171.765
26921-68-6CN(N=O)CCO25.97295651.9072.306
70415-59-7CN(N=O)CCCO20.25878091.8521.667
16219-98-0O=NN(C)c1ccccn124.98183462.8072.195
614-00-6O=NN(C)c1ccccc126.33162392.9822.346
59-89-2O=NN1CCOCC121.40196493.0281.795
26541-51-5O=NN1CCSCC124.37425481.3902.127
611-23-4Cc1ccccc1N=O10.74149830.3780.603
303-47-9O=C(O)C(Cc1ccccc1)NC(=O)c2cc(Cl)c3CC(C)OC(=O) c3c2O27.39692143.5932.465
3096-50-2CC(=O)Nc2ccc3c1ccccc1C(=O)c3c29.82043731.5850.500
60102-37-6CN1CCC2OC(=O)C3(CC(C)C(C)(O)C(=O)OCC(=CC1)C2=O)OC3C22.85414222.6171.957
62-44-2CCOc1ccc(cc1)NC(C)=O6.9861875−0.8430.183
77-09-8Oc1ccc(cc1)C3(OC(=O)c2ccccc23)c4ccc(O)cc46.3745131−0.4520.115
7227-91-0CN(C)/N=N/c1ccccc115.50757091.8101.136
90-43-7Oc2ccccc2c1ccccc14.7845818−0.134−0.063
51-03-6CCCc1cc2OCOc2cc1COCCOCCOCCCC8.6169629−0.2720.365
29069-24-7ClCCN(CCCl)c1ccc(cc1)CCCC(=O)OCC(=O)C5(O)CC C4C3CCC2=CC(=O)C=CC2(C)C3C(O)CC45C26.47918781.5272.362
50-24-8OCC(=O)C4(O)CCC3C2CCC1=CC(=O)C=CC1(C)C2C (O)CC34C24.19470472.3722.107
671-16-9CC(C)NC(=O)c1ccc(CNNC)cc115.05745551.7421.085
1120-71-4O=S1(=O)CCCO16.05643421.5030.079
57-57-8O=C1CCO110.55714791.6930.582
13010-07-6N/C(=N/[N+]([O−])=O)N(CCC)N=O36.68855482.1263.504
51-52-5S=C1NC(CCC)=CC(=O)N110.20347311.0940.543
2425-85-6[O−][N+](=O)c3cc(C)ccc3N\N=C1\c2ccccc2C=CC1=O−6.2022200−0.581−1.292
480-54-6O=C1OCC3=CCN2CCC(OC(=O)C(/CC(C)C1(O)CO)=C\C)C234.9508078−0.390−0.045
94-59-7C=CCc1ccc2OCOc2c15.7835004−0.4340.048
2318-18-5O=C1OC2CCN(C)CC=C(COC(=O)C(C)(O)C(C)C\C1=C\C)C2=O23.68030072.3322.049
10048-13-2Oc2cccc3Oc1c4C5C=COC5Oc4cc(OC)c1C(=O)c2335.89816103.3293.415
18883-66-4OC1OC(CO)C(O)C(O)C1NC(=O)N(C)N=O39.12853582.4403.776
96-09-3c1ccccc1C2CO27.79201770.3360.273
95-06-7C=C(Cl)CSC(=S)N(CC)CC8.23492520.9330.323
127-18-4Cl/C(Cl)=C(\Cl)Cl15.87073940.2151.176
109-99-9C1CCCO13.2078794−0.752−0.239
62-56-6NC(N)=S11.0889155−0.1120.642
88-19-7Cc1ccccc1S(=O)(N)=O−2.1709891−1.364−0.841
68-76-8O=C1C=C(C(=O)C(=C1N2CC2)N3CC3)N4CC440.65492144.6623.947
76-25-5OCC(=O)C54OC(C)(C)OC5CC3C2CCC1=CC(=O)C=C C1(C)C2(F)C(O)CC34C43.79162853.9144.298
75-25-2BrC(Br)Br5.8279398−0.4090.053
51-79-6NC(=O)OCC5.12552090.334−0.025
88-12-0O=C1CCCN1C=C16.79599850.9671.280
Calibration set
18523-69-8C\C(C)=N\Nc1ncc(s1)c2ccc(o2)[N+]([O−])=O15.79628871.6441.168
7008-42-6CN3c2c(c(cc1OC(C)(C)C=Cc12)OC)C(=O)c4ccccc3423.98045922.8042.083
2835-39-4CC(C)CC(=O)OCC=C4.97357330.063−0.042
760-56-5NC(=O)N(CC=C)N=O21.90121332.5781.850
82-28-0O=C3c1ccccc1C(=O)c2c3ccc(C)c2N17.34503960.6031.341
119-34-6O=[N+]([O−])c1cc(N)ccc1O9.7056304−0.3020.487
121-66-4[O−][N+](=O)c1cnc(N)s110.45596360.5130.571
97-56-3Cc2cc(/N=N/c1ccccc1C)ccc2N14.92142061.7461.070
61-82-5Nc1nncn18.47644660.9270.350
115-02-6N#[N+]\C=C(/[O−])OCC(N)C(=O)O16.69058342.3391.268
103-33-3N(=N/c1ccccc1)\c2ccccc213.66838070.8790.930
88133-11-3Nc1nc(c(CCOCC)c2ncnn12)c3ccccc36.2415469−0.2860.100
271-89-6c1cccc2occc127.3775200−0.5550.227
542-88-1ClCOCCl26.32078014.5072.345
2475-45-8Nc3ccc(N)c2C(=O)c1c(N)ccc(N)c1C(=O)c238.20160020.2350.319
75-27-4BrC(Cl)Cl15.66046460.3541.153
74-96-4BrCC7.1144910−0.1360.197
51333-22-3OCC(=O)C53OC(OC5CC2C1CCC4=CC(=O)C=CC4(C)C1C(O)CC23C)CCC24.62887833.1702.155
106-99-0C=CC=C−2.4847352−0.683−0.876
75-65-0CC(C)(C)O−1.16168020.060−0.728
60391-92-6O=C(N)N(N=O)CC(=O)O22.41203881.5331.908
115-28-6ClC2(Cl)C1(Cl)C(Cl)=C(Cl)C2(Cl)C(C1C(=O)O)C(=O) O10.50262630.9790.576
101-79-1Clc2ccc(Oc1ccc(N)cc1)cc213.56877730.7670.919
77439-76-0ClC=1C(=O)OC(O)C=1C(Cl)Cl24.34233522.5722.123
5131-60-2Nc1ccc(Cl)c(N)c15.3139941−0.344−0.004
593-70-4ClCF10.59599920.3960.587
54749-90-5OC1OC(CO)C(O)C(O)C1NC(=O)N(CCCl)N=O31.98961133.9232.978
52214-84-3ClC2(Cl)CC2c1ccc(OC(C)(C)C(=O)O)cc120.01583902.1231.640
637-07-0Clc1ccc(OC(C)(C)C(=O)OCC)cc13.83496460.157−0.169
123-73-9C\C=C\C=O4.97031231.222−0.042
50-18-0O=P1(NCCCO1)N(CCCl)CCCl19.99309442.0721.637
80-08-0Nc1ccc(cc1)S(=O)(=O)c2ccc(N)cc210.15908641.0450.538
50-29-3Clc1ccc(cc1)C(c2ccc(Cl)cc2)C(Cl)(Cl)Cl9.09433780.6220.419
63019-65-8CC(=O)N(C(C)=O)C2C=CC=C1c3ccccc3C=C1220.41978541.1451.685
95-80-7Nc1cc(N)c(C)cc14.63735991.694−0.080
56654-52-5O=C(NCCCC)N(CCCC)N=O21.94503601.6721.855
1717-00-6CC(Cl)(Cl)F−3.0596285−1.653−0.940
91-94-1Nc1ccc(cc1Cl)c2ccc(N)c(Cl)c213.57273070.9550.919
107-06-2ClCCCl20.70995201.0901.717
62-73-7COP(=O)(OC)O\C=C(\Cl)Cl22.69992041.7251.940
685-91-6CCN(CC)C(C)=O12.04802631.1150.749
111-46-6OCCOCCO−5.3166083−1.194−1.192
56-53-1Oc1ccc(cc1)C(\CC)=C(\CC)c2ccc(O)cc220.15023023.0801.655
119-84-6O=C1CCc2ccccc2O1−2.0947876−1.302−0.832
94-58-6CCCc1ccc2OCOc2c17.05323010.0600.190
5803-51-0COc2ccc(cc2/C=C/c1ccc(N)cc1)OC18.26404472.5491.444
65176-75-2COc5c(OC)cc(O)c2c5Oc1c3C4C=COC4Oc3cc(OC)c1C 2=O27.20138513.0242.443
60-11-7CN(C)c2ccc(/N=N/c1ccccc1)cc225.01888991.8332.199
59-35-8O=[N+]([O−])c1ccc(o1)c2nc(C)cc(C)n223.65230302.1982.046
551-92-8O=[N+]([O−])c1cnc(C)n1C11.87990280.9190.730
26049-69-4CN(C)Nc1nc(cs1)c2ccc(o2)[N+]([O−])=O32.08873632.7932.989
123-91-1C1COCCO13.8263120−0.481−0.170
13256-06-9CCCCCN(CCCCC)N=O20.82939921.6651.731
57-63-6Oc3cc4CCC2C(CCC1(C)C2CCC1(O)C#C)c4cc323.60260613.1712.041
140-88-5C=CC(=O)OCC−0.4544704−0.075−0.649
64-17-5CCO−2.0986452−2.296−0.833
57497-29-7[O−]\[N+](CC)=N\C35.52869343.6693.374
100-41-4CCc1ccccc1−0.2763213−1.612−0.629
96-45-7S=C1NCCN115.69875091.0991.157
96724-44-6O=NN(CC)C(=O)NCCO24.01479782.4902.087
38434-77-4N#CN(CC)N=O27.93263121.4302.525
363-17-7FC(F)(F)C(=O)NC1C=CC2=C3C=CC=CC3=CC2=C121.89340392.2331.850
110-00-9c1ccco111.14413382.2350.648
67730-11-4Cc1cccn2c3nc(N)ccc3nc1217.75337341.6261.387
556-52-5OCC1CO18.20281791.2380.319
517-28-2Oc2cc3CC4(O)COc1c(O)c(O)ccc1C4c3cc2O1.6535029−0.520−0.413
118-74-1Clc1c(Cl)c(Cl)c(Cl)c(Cl)c1Cl18.07293501.8681.422
87-68-3Cl/C(Cl)=C(/Cl)\C(\Cl)=C(/Cl)Cl5.08002680.598−0.030
680-31-9CN(C)P(=O)(N(C)C)N(C)C24.83946823.7172.179
26049-68-3NNc1nc(cs1)c2oc(cc2)[N+]([O−])=O20.46856161.8511.690
306-83-2ClC(Cl)C(F)(F)F5.3073777−1.190−0.005
13743-07-2NC(=O)N(N=O)CCO22.30363872.7371.895
33389-36-5O=[N+]([O−])c1ccc(s1)c2nc(NCCO)c3ccccc3n220.51246442.2281.695
5208-87-7C=CC(O)c1ccc2OCOc2c15.61640870.9860.030
84545-30-2FC(F)(F)C\N=C(/N)Nc1ccn(CCCCC(N)=O)n1−1.1528528−0.582−0.727
53-86-1Clc1ccc(cc1)C(=O)n3c2ccc(cc2c(CC(=O)O)c3C)OC20.86469782.4931.735
15503-86-3[O−][N+]13CC=C2COC(=O)[C@@](O)(CO)[C@H](C)C/C (=C\C)C(=O)OC(CC1)C2323.01748702.7101.975
86315-52-8CS(=O)c3ccc(c1nc2cnccc2n1)c(OC)c312.42213770.6100.791
54-85-3O=C(NN)c1ccncc16.9743574−0.0390.182
78-59-1O=C1C=C(C)CC(C)(C)C17.5076649−0.9420.241
3778-73-2O=P1(NCCCl)OCCCN1CCCl22.86386962.5481.958
143-50-0O=C2C1(Cl)C3(Cl)C5(Cl)C1(Cl)C4(Cl)C2(Cl)C3(Cl)C 4(Cl)C5(Cl)Cl20.45169772.2191.688
5989-27-5CC1=CCC(CC1)C(C)=C4.0566866−0.175−0.145
108-78-1Nc1nc(N)nc(N)n13.6878686−0.765−0.186
57-39-6CC1CN1P(=O)(N2CC2C)N3CC3C27.20971541.6842.444
60-56-0S=C1NC=CN1C14.26531062.0010.997
150-76-5Oc1ccc(OC)cc10.9123388−0.724−0.496
1634-04-4CC(C)(C)OC5.6530612−0.9010.034
21340-68-1Clc1ccc(cc1)c2ccc(OC(C)(C)C(=O)OC)cc217.68062841.8051.379
70-25-7O=[N+]([O−])\N=C(\N)N(C)N=O28.67251812.2632.607
63642-17-1NC(CCCNC(=O)N(C)N=O)C(=O)O28.85704382.4432.628
98-85-1CC(O)c1ccccc1−0.4434130−0.574−0.648
452-86-8Cc1cc(O)c(O)cc10.8376489−0.301−0.504
56-49-5Cc2ccc3cc1c5ccccc5ccc1c4CCc2c3428.41798892.7382.579
101-14-4Nc2ccc(Cc1ccc(N)c(Cl)c1)cc2Cl10.32473751.1410.556
838-88-0Cc2cc(Cc1ccc(N)c(C)c1)ccc2N14.64400721.4871.039
101-61-1CN(C)c2ccc(Cc1ccc(cc1)N(C)C)cc219.35907191.1911.566
76014-81-8OC(CCCN(C)N=O)c1cccnc130.84902163.3082.851
64091-91-4O=C(CCCN(C)N=O)c1cccnc129.29496893.3172.677
2385-85-5ClC53C1(Cl)C4(Cl)C2(Cl)C1(Cl)C(Cl)(Cl)C5(Cl)C2(Cl)C3(Cl)C4(Cl)Cl27.11090462.4892.433
315-22-0O=C1OCC3=CCN2CCC(OC(=O)C(C)C(C)(O)C1(C)O) C2326.13566892.5392.324
58139-48-3O=[N+]([O−])c1ccc(s1)c3nc(N2CCOCC2)c4ccccc4n322.40254591.8331.907
389-08-2O=C(O)C2=CN(CC)c1nc(C)ccc1C2=O2.94898050.063−0.268
91-59-8Nc1ccc2ccccc2c14.77740070.366−0.064
139-13-9OC(=O)CN(CC(=O)O)CC(=O)O1.4031307−0.967−0.441
59-87-0O=[N+]([O−])c1ccc(/C=N/NC(N)=O)o115.41598121.4531.125
75198-31-1O=[N+]([O−])c1ccc(o1)c2cnc3ccccn2316.20491511.2271.214
36133-88-7[O−][N+](=O)c1ccc(o1)c2nc(CNC(C)=O)on212.43001260.6270.792
4812-22-0CC\C=C(/CC)[N+]([O−])=O13.72732301.1740.937
602-87-9[O−][N+](=O)c1ccc2CCc3cccc1c239.45621861.3610.459
91-23-6COc1ccccc1[N+]([O−])=O−2.42350180.992−0.869
98-95-3[O−][N+](=O)c1ccccc111.54159340.6840.692
67-20-9O=[N+]([O−])c2ccc(/C=N/N1CC(=O)NC1=O)o29.08977660.1650.418
555-84-0O=[N+]([O−])c2ccc(/C=N/N1CCNC1=O)o211.48257191.6300.686
51-75-2ClCCN(C)CCCl30.52030754.1372.814
551-88-2CCC(CC)[N+]([O−])=O12.85778470.6940.839
607-35-2[O−][N+](=O)c1cccc2cccnc1212.50008501.2490.799
16813-36-8O=C1NC(=O)N(N=O)CC121.49669563.1631.805
89911-79-5O=NN(CC(C)O)CC(O)CO26.18359973.5232.329
92177-50-9OC(CNCC(C)=O)C(O)N=O22.49079163.6991.916
96806-34-7O=C(NCCCl)N(N=O)CCO28.04199472.7402.537
55090-44-3CN(CCCCCCCCCCCC)N=O20.63352872.6291.709
13256-11-6CN(CCc1ccccc1)N=O26.11367944.2162.321
684-93-5NC(=O)N(C)N=O25.26562533.0462.227
92177-49-6O=C(N=O)CCNCCO15.42123291.9101.126
55556-92-8O=NN1CC=CCC121.45775693.2711.801
82018-90-4FC(F)(F)CN(CC)N=O20.71709611.7921.718
75881-18-4CC1CN(N=O)CC(C)N1C39.37781213.0183.804
91308-70-2CC(O)CN(CC=C)N=O26.63427902.2162.380
91308-69-9C=CCN(N=O)CCO19.84602942.4231.621
1116-54-7OCCN(N=O)CCO16.17313201.6271.210
55-18-5CCN(CC)N=O25.07209083.5862.205
621-64-7CCCN(CCC)N=O28.55925052.8452.595
55984-51-5CC(=O)CN(C)N=O26.51546393.8292.366
68107-26-6CN(CCCCCCCCCCC)N=O21.10187561.9561.761
78246-24-9O=NN2CCCC2c1c[n+]([O−])ccc121.57488232.3441.814
5632-47-3O=NN1CCNCC122.94481851.1181.967
14698-29-4O=C(O)C2=CN(CC)c1cc3OCOc3cc1C2=O7.16127430.1940.203
101-80-4Nc1ccc(cc1)Oc2ccc(N)cc216.46246831.3231.242
13752-51-7S=C(SN1CCOCC1)N2CCOCC211.70325240.4370.710
1825-21-4Clc1c(OC)c(Cl)c(Cl)c(Cl)c1Cl16.70765471.0531.270
842-07-9O=C3C=Cc1ccccc1/C3=N\Nc2ccccc28.10548870.9270.308
50-33-9O=C3C(CCCC)C(=O)N(c1ccccc1)N3c2ccccc23.0841934−0.575−0.253
122-60-1c2ccc(OCC1CO1)cc25.84855390.5330.056
1955-45-9O=C1OCC1(C)C4.2138276−0.324−0.127
816-57-9NC(=O)N(CCC)N=O22.61194321.5411.930
81-54-9O=C2c1ccccc1C(=O)c3c2c(O)cc(O)c3O11.7313515−0.4230.713
127-47-9CC=1CCCC(C)(C)C=1/C=CC(\C)=C\C=C\C(\C)=C\CO C(C)=O12.37168850.4200.785
18559-94-9OCc1cc(ccc1O)C(O)CNC(C)(C)C3.22474430.777−0.238
599-79-1O=S(=O)(Nc1ccccn1)c3ccc(N\N=C2/C=CC(=O)C(=C2) C(=O)O)cc32.6830187−0.601−0.298
533-31-3Oc1ccc2OCOc2c15.4856314−0.9900.015
77-46-3O=S(=O)(c1ccc(NC(C)=O)cc1)c2ccc(NC(C)=O)cc214.41565500.7771.014
23031-25-6Oc1cc(cc(O)c1)C(O)CNC(C)(C)C6.0019117−0.2600.073
116-14-3F/C(F)=C(\F)F2.7326963−0.029−0.293
40548-68-3O=NN1CCCCO118.83423470.6791.508
509-14-8O=[N+]([O−])C([N+]([O−])=O)([N+](=O)[O−])[N+]([O−])=O19.96218052.6421.634
52-24-4S=P(N1CC1)(N2CC2)N3CC328.45998963.0622.584
62-55-5CC(N)=S8.42513250.8150.344
789-61-7NC=3Nc2c(ncn2C1CC(O)C(CO)O1)C(=S)N=325.33201222.1302.234
141-90-2O=C1C=CNC(=S)N117.68658101.0321.379
137-17-7Cc1cc(C)c(N)cc1C0.35233840.605−0.559
95-63-6Cc1cc(C)c(C)cc13.1538748−1.559−0.245
55-63-0O=[N+]([O−])OC(CO[N+]([O−])=O)CO[N+](=O)[O−]8.57309070.0940.360
126-72-7BrCC(Br)COP(=O)(OCC(Br)CBr)OCC(Br)CBr21.60720562.2601.818
108-05-4CC(=O)OC=C0.4943662−0.598−0.543
75-02-5C=CF−0.60096100.362−0.665
2832-40-8O=C2C=CC(C)=C\C2=N\Nc1ccc(NC(C)=O)cc15.1582118−0.149−0.021
Test set
29611-03-8O=C2Oc1c4C5C=COC5Oc4cc(OC)c1C=3CCC(O)C2=337.24196685.1023.566
1162-65-8O=C2Oc1c4C5C=COC5Oc4cc(OC)c1C=3CCC(=O)C2 =337.96900874.9913.647
57-06-7C=CC\N=C=S0.52829300.014−0.539
38514-71-5Nc1nc(cs1)c2oc(cc2)[N+]([O−])=O16.15356171.5581.208
140-57-8CC(C)(C)c1ccc(OCC(C)OS(=O)OCCCl)cc16.97643030.5390.182
1912-24-9Clc1nc(NCC)nc(NC(C)C)n111.22268370.8330.657
25843-45-2[O−]\[N+](C)=N\C32.46819993.2013.032
33372-39-3O=[N+]([O−])c1ccc(s1)c2nc(N(CCO)CCO)c3ccccc3n222.02600182.0601.864
2784-94-3CNc1ccc(cc1[N+]([O−])=O)N(CCO)CCO−0.0656579−0.439−0.605
869-01-2O=C(N)N(CCCC)N=O25.35107632.4482.236
120-80-9Oc1ccccc1O−0.01511940.114−0.600
95-83-0Nc1cc(Cl)ccc1N8.8914505−0.1760.396
10473-70-8Clc1ccc(NC(=O)N(C)C)cc17.66681781.5120.259
117-10-2Oc3cccc2C(=O)c1cccc(O)c1C(=O)c232.2118706−0.009−0.351
1192-28-5O\N=C1\CCCC15.45177840.3850.011
53-43-0O=C2CCC1C3CC=C4CC(O)CCC4(C)C3CCC12C13.14197210.5380.871
79-43-6ClC(Cl)C(=O)O6.7872450−0.0960.161
101-90-6c1ccc(cc1OCC2CO2)OCC3CO317.12817121.7691.317
55738-54-0CN(C)CNc2nnc(/C=C/c1ccc(o1)[N+]([O−])=O)o29.78624311.0960.496
121-69-7CN(C)c1ccccc18.7503763−0.0130.380
106-88-7CCC1CO15.3651210−0.4840.002
13073-35-3OC(=O)C(N)CCSCC15.60370321.5171.146
398-32-3O=C(C)Nc1ccc(cc1)c2ccc(F)cc222.03274702.3561.865
32852-21-4O=CNNc1nc(C)cs16.12124281.0380.086
3570-75-0O=CNNc1nc(cs1)c2ccc(o2)[N+]([O−])=O22.43321601.7011.910
67730-10-3Nc1ccc2nc3ccccn3c2n113.03555780.6390.859
26049-71-8NNc1nc(cs1)c2ccc(N)cc216.34834772.3021.230
21416-87-5O=C2CN(CC(C)N1CC(=O)NC(=O)C1)CC(=O)N210.54661601.3990.581
77500-04-0Cc1nc3c(nc1)ccc2c3nc(N)n2C23.27944482.1092.005
55-80-1CN(C)c2ccc(/N=N/c1cc(C)ccc1)cc227.20828261.8632.444
129-15-7[O−][N+](=O)c3c(C)ccc2C(=O)c1ccccc1C(=O)c2311.13181170.4990.646
14026-03-0CC1CCCCN1N=O22.12304240.9871.875
90-94-8CN(C)c1ccc(cc1)C(=O)c2ccc(cc2)N(C)C15.47548971.6771.132
531-82-8O=C(C)Nc1nc(cs1)c2ccc(o2)[N+]([O−])=O22.44937221.1531.912
51325-35-0O=[N+]([O−])c1ccc(o1)c2nc(NC(C)=O)nc(NC(C)=O)n218.78100921.3371.502
62-23-7O=[N+]([O−])c1ccc(cc1)C(=O)O11.6562372−0.2350.705
5522-43-0[O−][N+](=O)c4ccc1ccc2cccc3ccc4c1c2314.20903491.8710.990
75896-33-2OC1CCN(N=O)C127.82762362.1622.513
75881-20-8CN(CCCCCCCCCCCCCC)N=O19.69683492.1921.604
88208-16-6O=NN(CC=C)CC(O)CO24.14290952.2882.101
91308-71-3C=CCN(CC(=O)C)N=O29.34652302.6282.683
53609-64-6CC(O)CN(CC(C)O)N=O24.78483962.2832.173
924-16-3CCCCN(CCCC)N=O30.89568532.3602.856
40580-89-0O=NN1CCCCCCCCCCCC115.84095461.2901.173
614-95-9O=NN(CC)C(=O)OCC26.05398003.2092.315
100-75-4O=NN1CCCCC123.08648841.9021.983
930-55-2O=NN1CCCC121.02034002.0981.752
81795-07-5CC1SC(C)SC(C)N1N=O22.50259372.6001.918
60-80-0O=C2C=C(C)N(C)N2c1ccccc113.3054506−0.8150.889
75-56-9CC1CO111.0792966−0.1070.641
22571-95-5CC(C)C(O)(C(C)O)C(=O)OCC1=CCN2CCC(OC(=O)C (\C)=C\C)C1223.37462302.3002.015
811-97-2FCC(F)(F)F−4.6111637−2.467−1.114
139-65-1Nc1ccc(cc1)Sc2ccc(N)cc210.65563321.7660.593
538-23-8O=C(CCCCCCC)OC(COC(=O)CCCCCCC)COC(=O)C CCCCCC−6.8111993−1.067−1.360
88-06-2Clc1cc(Cl)cc(Cl)c1O4.0429184−0.312−0.146
96-18-4ClCC(Cl)CCl22.11250912.0381.874
42011-48-3O=C(Nc1nc(cs1)c2ccc(o2)[N+]([O−])=O)C(F)(F)F13.78070061.6560.943
2489-77-2CN(C)C(=S)NC21.07496200.6611.758
66-22-8O=C1C=CNC(=O)N17.6978985−0.7770.263
593-60-2BrC=C6.20376310.7620.095
75-01-4C=CCl15.18724441.0101.100

Acknowledgments

The authors thank the Marie Curie Fellowship (the contract ID 39036, CHEMPREDICT) and the EC project CAESAR (Project no. 022674 (SSPI)) for financial support.

References and Notes

  1. Benfenati, E; Benigni, R; Demarini, DM; Helma, C; Kirkland, D; Martin, TM; Mazzatorta, P; Ouedraogo-Arras, G; Richard, AM; Schilter, B; Schoonen, WG; Snyder, RD; Yang, C. Predictive Models for Carcinogenicity and Mutagenicity: Frameworks, State-of-the-Art, and Perspectives. J. Environ. Sci. Health C Environ. Carcinog. Ecotoxicol. Rev 2009, 27, 57–90. [Google Scholar]
  2. Benigni, R; Netzeva, T; Benfenati, E; Bossa, C; Franke, R; Helma, C; Hulzebos, E; Marchant, C; Richard, A; Woo, Y-T; Yang, C. The expanding role of predictive toxicology: An update on the (Q)SAR models for mutagens and carcinogens. J. Environ. Sci. Health C 2007, 25, 53–97. [Google Scholar]
  3. Benigni, R. Structure-activity relationship studies of chemical mutagens and carcinogens: Mechanistic investigations and prediction approaches. Chem. Rev 2005, 105, 1767–1800. [Google Scholar]
  4. Contrera, JF; MacLaughlin, P; Hall, LH; Kier, LB. QSAR modeling of carcinogenic risk using discriminant analysis and topological molecular descriptors. Curr. Drug Dis. Technol 2005, 2, 55–67. [Google Scholar]
  5. Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci 1988, 28, 31–36. [Google Scholar]
  6. Weininger, D; Weininger, A; Weininger, JL. SMILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Comput Sci 1989, 29, 97–101. [Google Scholar]
  7. Weininger, D. SMILES. 3. DEPICT. Graphical depiction of chemical structures. J. Chem. Inf. Comput. Sci 1990, 30, 237–243. [Google Scholar]
  8. Vidal, D; Thormann, M; Pons, M. LINGO, an efficient holographic text based method to calculate biophysical properties and intermolecular similarities. J. Chem. Inf. Model 2005, 45, 386–393. [Google Scholar]
  9. Toropov, AA; Benfenati, E. Optimisation of correlation weights of SMILES invariants for modelling oral quail toxicity. Eur J Med Chem 2007, 42, 606–613. [Google Scholar]
  10. Toropov, AA; Benfenati, E. Additive SMILES-based optimal descriptors in QSAR modelling bee toxicity: Using rare SMILES attributes to define the applicability domain. Bioorg Med Chem 2008, 16, 4801–4809. [Google Scholar]
  11. Toropov, AA; Rasulev, BF; Leszczynski, J. QSAR modeling of acute toxicity by balance of correlations. Bioorg. Med. Chem 2008, 16, 5999–6008. [Google Scholar]
  12. Toropov, AA; Toropova, AP. QSAR Modeling of Mutagenicity Based on Graphs of Atomic Orbitals. Internet Electron J Mol Des 2002, 1, 108–114. [Google Scholar]
  13. Marino, DJG; Peruzzo, PJ; Castro, EA; Toropov, AA. QSAR Carcinogenic Study of Methylated Polycyclic Aromatic Hydrocarbons Based on Topological Descriptors Derived from Distance Matrices and Correlation Weights of Local Graph Invariants. Internet Electron. J. Mol. Des 2002, 1, 115–133. [Google Scholar]
  14. Peruzzo, PJ; Marino, DJG; Castro, EA; Toropov, AA. QSPR Modeling of Lipophilicity by Means of Correlation Weights of Local Graph Invariants. Internet Electron. J. Mol. Des 2003, 2, 334–347. [Google Scholar]
  15. Available online: http://chem.sis.nlm.nih.gov/chemidplus/.
  16. Available online: http://webbook.nist.gov/chemistry/.
  17. Available online: http://www.epa.gov/ncct/dsstox/sdf_cpdbas.html/.
  18. Toropov, AA; Toropova, AP; Benfenati, E; Manganaro, A. QSAR modelling of carcinogenicity by balance of correlations. Mol Divers 2009, in press. [Google Scholar]
  19. Mazzatorta, P; Smiesko, M; Piparo, E; Benfenati, E. QSAR model for predicting pesticide aquatic toxicity. J. Chem. Inf. Model 2005, 45, 1767–1774. [Google Scholar]
  20. Fatemi, MH; Haghdadi, M. Quantitative structure-property relationship prediction of permeability coefficients for some organic compounds through polyethylene membrane. J Mol Struct 2008, 886, 43–50. [Google Scholar]
Figure 1. General scheme of construction of the optimal SMILES-based descriptors by means of the correlation balance method.Phase 1. The definition of general list of the SMILES attributes (limS=0). The N111 is the number of the attributes which are present in the subtraining, in calibration, and in test set. If limS=0 the N111 is relatively low.Phase 2. The definition of the most productive limS value: 0 < limS* < ∞; this value gives maximum of the N111, i.e., number of the SMILES attributes which are present in the subtraining, in calibration, and in test set.
Figure 1. General scheme of construction of the optimal SMILES-based descriptors by means of the correlation balance method.Phase 1. The definition of general list of the SMILES attributes (limS=0). The N111 is the number of the attributes which are present in the subtraining, in calibration, and in test set. If limS=0 the N111 is relatively low.Phase 2. The definition of the most productive limS value: 0 < limS* < ∞; this value gives maximum of the N111, i.e., number of the SMILES attributes which are present in the subtraining, in calibration, and in test set.
Ijms 10 03106f1aIjms 10 03106f1b
Figure 2. Results of computational experiments, which were used to establish of the preferable number of epochs of the Monte Carlo optimization (Nepoch). Triangles indicate curves for the test sets. Black circles denote the sub training set. White circles denote the calibration set.
Figure 2. Results of computational experiments, which were used to establish of the preferable number of epochs of the Monte Carlo optimization (Nepoch). Triangles indicate curves for the test sets. Black circles denote the sub training set. White circles denote the calibration set.
Ijms 10 03106f2aIjms 10 03106f2b
Figure 3. Comparison of the [subtraining-calibration-test] system and the [training-test] system for three splits.
Figure 3. Comparison of the [subtraining-calibration-test] system and the [training-test] system for three splits.
Ijms 10 03106f3
Figure 4. Correlations between the determination coefficient for test set and W% for the three splits (see data from Table 4).
Figure 4. Correlations between the determination coefficient for test set and W% for the three splits (see data from Table 4).
Ijms 10 03106f4
Figure 5. Graphical representation of the model for logTD50 calculated with Equation 3.
Figure 5. Graphical representation of the model for logTD50 calculated with Equation 3.
Ijms 10 03106f5
Table 1. The list of outliers of the QSAR models calculated with SMILES-based optimal descriptors.
Table 1. The list of outliers of the QSAR models calculated with SMILES-based optimal descriptors.
NumberStructureCASChemical name
1 Ijms 10 03106i1606-20-22,6-Dinitrotoluene
2 Ijms 10 03106i257497-34-4Z-Methyl-O,N,N-azoxyethane
3 Ijms 10 03106i317608-59-2N-Nitrosoephedrine
4 Ijms 10 03106i415973-99-6Di(N-nitroso)-perhydropyrimidine
5 Ijms 10 03106i561034-40-01-Nitroso-4-benzoyl-3,5-dimethylpiperazine
6 Ijms 10 03106i699-80-9N,4-Dinitrosomethylaniline
7 Ijms 10 03106i755557-00-1N,N-Dinitrosohomopiperazine
8 Ijms 10 03106i886-30-6N-Nitrosodiphenylamine
Table 2. Example of definition of SMILES attributes (unused positions are indicated by dots).
Table 2. Example of definition of SMILES attributes (unused positions are indicated by dots).
1SkCW(1Sk)2SkCW(2Sk)dCCW(dC)
C...........−0.0156855
O=C...........−2.8475657O=C.C.......0.0!-02........1.2190257
SMILES=“CC=O”; CAS= 75-07-0; DCW= −1.6442255.
Table 3. Results of computational experiments to establish of number of epochs of the Monte Carlo optimization, Nepoch.
Table 3. Results of computational experiments to establish of number of epochs of the Monte Carlo optimization, Nepoch.
[Subtraining-Calibration-Test] system
Nepochr2subtrainingr2calibrationr2test
Split-1
50.58500.60430.5513
100.76290.76750.7601
150.79390.80060.7187
200.81540.82430.6827
250.83000.82620.6076
Split-2
50.59470.60170.7347
100.71950.71900.8011
150.75510.75380.7870
200.77320.77190.7659
250.78390.78340.7538
Split-3
50.66730.63030.6548
100.76560.76690.7519
150.80770.80800.7205
200.84360.84280.6288
250.85620.85810.5503
[Training-Test] system
Split-1
50.62550.6003
100.77610.7098
150.81240.6579
200.83860.5826
250.85210.5158
Split-2
50.60280.7397
100.73960.7719
150.76870.7705
200.78720.7452
250.79850.7123
Split-3
50.63280.6559
100.76820.7127
150.81090.6397
200.83680.5378
250.85190.4573
Table 4. Average statistical characteristics of the QSAR model of carcinogenicity (logTD50) for three splits into the subtraining, calibration, and test sets with the limS values of 0–10. For the best models three attempts of the Monte Carlo optimization together with average values are presented, for other models only average values are shown.
Table 4. Average statistical characteristics of the QSAR model of carcinogenicity (logTD50) for three splits into the subtraining, calibration, and test sets with the limS values of 0–10. For the best models three attempts of the Monte Carlo optimization together with average values are presented, for other models only average values are shown.
SPLIT1
Subtraining set, n=165Calibration set, n=167Test set, n=61SAk distribution
Table 4. Average statistical characteristics of the QSAR model of carcinogenicity (logTD50) for three splits into the subtraining, calibration, and test sets with the limS values of 0–10. For the best models three attempts of the Monte Carlo optimization together with average values are presented, for other models only average values are shown.
[Subtraining-Calibration-Test] system
limSNactR2sFR2sFR2sFW%N111
07970.87310.50011250.88050.61912170.57690.8938142333
16220.88070.48512030.88210.62112350.53190.9426750314
24070.82750.5837830.82680.7037890.63050.83210170285
33210.78010.6585790.78060.7305880.71020.73214579255
4–12660.76220.6855220.76200.7345280.75410.68218182217
4–20.75930.6895140.75920.7465200.74830.692175
4–30.76430.6825290.76470.7295360.75190.678179
average0.76190.6855220.76190.7365280.75140.684178
52330.72470.7374290.72410.7704330.73870.71116785197
62030.69010.7813630.68880.8143650.71290.73814886174
71820.67040.8063320.67100.8303370.65410.81211284153
81640.65280.8273070.65300.8443110.70150.75313987142
91520.63560.8472840.63480.8642870.63780.82210584128
101390.61780.8682630.62180.8752710.67880.77712684117
Table 4. Average statistical characteristics of the QSAR model of carcinogenicity (logTD50) for three splits into the subtraining, calibration, and test sets with the limS values of 0–10. For the best models three attempts of the Monte Carlo optimization together with average values are presented, for other models only average values are shown.
Training set, n=332Calibration set, n=0Test set, n=61SAk distribution
Table 4. Average statistical characteristics of the QSAR model of carcinogenicity (logTD50) for three splits into the subtraining, calibration, and test sets with the limS values of 0–10. For the best models three attempts of the Monte Carlo optimization together with average values are presented, for other models only average values are shown.
[Training-Test] system
limSNactR2sFR2sFR2sFW%N101
07970.88680.47225930.54291.0027147376
17770.88510.47525420.54180.9847146356
25420.86020.52420320.60420.9109161330
34320.83130.57616260.55750.9177472309
43850.81090.61014170.56280.9107675289
53440.80070.62613270.59130.8718678267
6–13120.79020.64212430.67440.76912282255
6–20.78750.64612230.71380.721147
6–30.78430.65112000.69470.744134
average0.78730.64712220.69430.745135
72880.77880.65911620.65790.78911483238
82680.76590.67810800.66770.77712185227
92460.73630.7209220.68530.75712984207
102340.72240.7398590.69090.75013384196
Table 4. Average statistical characteristics of the QSAR model of carcinogenicity (logTD50) for three splits into the subtraining, calibration, and test sets with the limS values of 0–10. For the best models three attempts of the Monte Carlo optimization together with average values are presented, for other models only average values are shown.
SPLIT2
Subtraining set, n=165Calibration set, n=167Test set, n=61SAk distribution
Table 4. Average statistical characteristics of the QSAR model of carcinogenicity (logTD50) for three splits into the subtraining, calibration, and test sets with the limS values of 0–10. For the best models three attempts of the Monte Carlo optimization together with average values are presented, for other models only average values are shown.
[Subtraining-Calibration-Test] system Split2
limSNactR2sFR2sFR2sFW%N111
07970.87430.50711340.87370.54011420.46301.0555142337
16320.87400.50711310.87360.55111400.50030.9955951320
24250.83770.5768410.83670.5808460.59190.8208667286
33350.80480.6326730.80410.6336780.58620.8618478261
42840.78430.6645930.78420.6636000.70420.71114184239
52470.74580.7214780.74480.7284820.76270.67119087214
6–12240.73150.7414440.73140.7484490.79370.60422784189
6–20.72340.7524260.72340.7604310.79220.605225
6–30.73840.7314600.73840.7404660.81360.593258
average0.73110.7414440.73100.7494490.79980.600236
71950.69780.7863760.70070.7813860.73180.65716184164
81780.68780.7993590.68800.8013640.72230.68215382146
91580.66590.8263250.66920.8313340.71040.70914584133
101490.64720.8492990.65500.8473130.69700.72313684125
Table 4. Average statistical characteristics of the QSAR model of carcinogenicity (logTD50) for three splits into the subtraining, calibration, and test sets with the limS values of 0–10. For the best models three attempts of the Monte Carlo optimization together with average values are presented, for other models only average values are shown.
Training set, n=332Calibration set, N=0Test set, n=61SAk distribution
Table 4. Average statistical characteristics of the QSAR model of carcinogenicity (logTD50) for three splits into the subtraining, calibration, and test sets with the limS values of 0–10. For the best models three attempts of the Monte Carlo optimization together with average values are presented, for other models only average values are shown.
[Training-Test] system Split2
limSNactR2sFR2sFR2sFW%N101
07970.89220.46827340.46651.0135247372
17850.89500.46228150.47111.0295346360
25460.87400.50622900.53290.8876761335
34420.84560.56118070.57670.8458171315
43880.81940.60614970.61300.8059476296
53500.81220.61814280.58020.8738279278
63210.81030.62114120.60740.8409283267
72870.78480.66212040.66890.75312086247
82630.75940.70010420.73450.65516487229
9–12430.73970.7289380.74720.65317489216
9–20.73700.7329250.78620.602217
9–30.74560.7209670.76040.642187
average0.74080.7269430.76460.632193
102280.72940.7428900.75020.65517886196
Table 4. Average statistical characteristics of the QSAR model of carcinogenicity (logTD50) for three splits into the subtraining, calibration, and test sets with the limS values of 0–10. For the best models three attempts of the Monte Carlo optimization together with average values are presented, for other models only average values are shown.
SPLIT3
Subtraining set, n=165Calibration set, n=167Test set, n=61SAk distribution
Table 4. Average statistical characteristics of the QSAR model of carcinogenicity (logTD50) for three splits into the subtraining, calibration, and test sets with the limS values of 0–10. For the best models three attempts of the Monte Carlo optimization together with average values are presented, for other models only average values are shown.
[Subtraining-Calibration-Test] system Split3
limSNactR2sFR2sFR2sFW%N111
07970.86900.51810840.89090.51613530.57940.9298242332
16140.87420.50811340.89460.51314020.59950.8968950309
24020.82660.5977780.83310.6148260.67480.80012269278
3–13240.79630.6476370.79820.6336520.71760.72915078254
3–20.79190.6546200.79370.6396350.69690.758136
3–30.79300.6526240.79440.6416370.74310.698171
average0.79370.6516270.79540.6386420.71920.728152
42640.74390.7254740.74620.7034850.69920.76513885224
52270.71270.7684040.71360.7384110.69000.77413386195
61980.69450.7923710.70130.7563880.68990.77013386171
71810.67900.8123450.68430.7803580.69950.75813785154
81590.64320.8562940.64930.8153060.70610.74914284134
91470.62190.8812680.65330.8203110.69340.77513484123
101400.59520.9112400.62690.8492770.63000.84210183116
Table 4. Average statistical characteristics of the QSAR model of carcinogenicity (logTD50) for three splits into the subtraining, calibration, and test sets with the limS values of 0–10. For the best models three attempts of the Monte Carlo optimization together with average values are presented, for other models only average values are shown.
Training set, n=332Calibration set, n=0Test set, n=61SAk distribution
Table 4. Average statistical characteristics of the QSAR model of carcinogenicity (logTD50) for three splits into the subtraining, calibration, and test sets with the limS values of 0–10. For the best models three attempts of the Monte Carlo optimization together with average values are presented, for other models only average values are shown.
[Training-Test] system Split3
limSNactR2sFR2sFR2sFW%N101
07970.89300.45727560.55321.0097347377
17760.89320.45727630.55290.9967346356
25400.86990.50422090.59980.9228961327
34340.83490.56816740.59080.8968872311
43880.82200.59015280.60680.8659275291
53480.80300.62013460.66500.79611778272
6–13200.77730.66011520.70170.75113982261
6–20.79420.63412730.69670.761136
6–30.78340.65111930.71710.735150
average0.78500.64812060.70510.749141
72880.75980.68510450.68070.77812684241
82710.76370.67910670.65200.81711285229
92440.73180.7249010.68330.77812786210
102320.72880.7288870.68260.78112784196
Table 5. Correlation weights for calculation with Equation 1 DCW(4). N(Subtr), N(calib), and N(Test) are numbers of a given SMILES attribute in the subtraining set, calibration set, and test set, respectively. The rare attributes are omitted.
Table 5. Correlation weights for calculation with Equation 1 DCW(4). N(Subtr), N(calib), and N(Test) are numbers of a given SMILES attribute in the subtraining set, calibration set, and test set, respectively. The rare attributes are omitted.
SMILES-Attributes (SA)CW(SA) probe 1CW(SA) probe 2CW(SA) probe 3N(Subtr)N(Calib)N(Test)

dC
!-01........2.75222742.87046153.5346711540
!-02........1.21902572.12779101.87906801093
!-03........6.67843898.03117597.127195815103
!-04........1.43261021.67022251.934079017228
!-05........3.96710554.03449244.17296359116
!-06........5.85646375.87940126.44097548117
!-07........5.49704755.16112405.2308474530
!-08........9.12959239.51223289.0035813431
!-21........−1.63832481.87819620.0037831400
!000........3.62718214.68944053.7495506671
!002........1.56032601.74506111.4951171441
!003........−1.2514096−1.3248590−1.1256941582
!004........0.73597261.06435221.24502581181
!005........0.97028170.62406361.21442601394
!006........4.15430294.93388304.9975361752
!007........−3.7770327−3.5039029−3.0945823433
!010........0.5049355−0.26364350.3157527683
!012........3.25112133.24715784.5049864631

1SAk

#...........3.37062943.37398772.0643948530
(...........−1.6866726−1.3666396−1.5485382708780260
/...........−0.49134260.1880630−1.097573317244
1...........−1.4970879−0.8440743−0.077165922222288
2...........−0.1050677−1.1891334−1.113832913013248
3...........−1.34333400.0456678−0.1828115606020
4...........3.49548703.15621073.545344720188
5...........2.81280373.38999591.690208610104
=...........−1.8660845−2.1441609−1.8865449777923
C...........−0.01568550.04535250.2198595765736290
Br..........0.53271810.27793440.84549382381
Cl..........2.98385902.19069703.1890603618513
F...........−0.4680666−1.04259520.283649215198
O=C.........−2.8475657−2.4376628−2.9332073332113
O=..........0.73693720.00378051.408639814013247
N...........1.12275011.19829651.419364019620176
O...........−1.2649109−0.4408418−0.150149913814345
S...........2.37127142.62517602.531356513127
[N+]........1.93456891.65437713.0457447263112
[O−]........5.92509005.62305646.9653600263212
[...........−2.1531745−2.8080919−1.9966710460
\...........3.35658923.43388132.902741414297
c...........−0.03572640.03731810.0419142653679247
n...........−0.6564241−0.1251570−1.4184164374423
o...........−1.0665085−1.3777640−0.248547016127
s...........−0.0527175−0.9991370−1.0040993767

2SAk

(...(.......−0.0735964−0.1751550−0.497043218284
/...(.......−0.9972903−1.5270762−0.74797997102
1...(.......2.37334522.49757652.4084744374515
2...(.......0.06081360.1227718−0.184879214156
2...1.......5.75295096.62879317.4713129562
3...(.......−1.6283079−0.2528134−2.0499858630
3...2.......−1.5915226−1.5268568−1.9365673463
=...(.......1.84683332.30798392.437785114173
=...1.......2.5039102−0.5608527−0.0325572751
=...2.......−3.2847055−2.5042239−2.0658154752
=...3.......3.43891110.82780033.5625409654
C...#.......−0.1836222−0.9385750−0.4107562630
C...(.......−0.75738510.02801620.7466835443456163
C.../.......1.12453590.00424430.887318913132
C...1.......−0.45662620.0357389−1.3911620747330
C...2.......0.31150031.09118900.8401947464712
C...3.......3.65348363.46780532.986649940228
C...4.......−0.7152591−0.7967591−1.250257517135
C...5.......3.69098074.42293824.30158221076
C...=.......−0.5319093−0.5807285−0.45692539810129
C...C.......−0.4098212−0.6663667−0.4722713244211113
Br..(.......−1.2467411−0.7804139−0.96766022470
Br..C.......5.80393946.66015915.9721683951
Cl..(.......−0.2165917−0.6443513−0.73890156810411
Cl..C.......6.87686667.68395707.434334117185
F...(.......0.2020867−0.0538118−0.1868874242212
O=C.(.......0.7311485−1.62570290.21889341885
O=C.1.......4.37781604.78093404.0011131961
O=..(.......−0.5612999−1.5272716−1.145433717715860
O=..1.......−2.5019413−3.3715192−4.0028237420
N...#.......−3.8725309−4.4992215−3.7843832420
N...(.......0.06662450.7453778−0.128967414016556
N.../.......0.81333230.0606093−0.18938419122
N...1.......1.87448681.03354961.5038557231710
N...2.......1.49791321.49599011.4961647693
N...=.......−1.31574190.1537739−0.388289812165
N...C.......1.40512380.98274101.0619180637024
N...O=......6.12702917.30000584.8170313393413
N...N.......3.19224983.50131504.13216881488
O...(.......−0.1195150−0.1976562−0.783881110611131
O...1.......−0.7620380−1.4388311−1.936180319135
O...2.......−2.5618134−3.2668394−2.8747322953
O...C.......1.04441541.03397260.9105754909629
S...(.......−0.74797410.4990928−0.0132133784
S...=.......1.50090450.6752299−0.2807681572
S...C.......0.2470117−1.2535030−0.5349209614
[N+](.......3.25168211.65243301.3748828403717
[O−](.......−0.4532482−0.8221590−1.3547359394818
[O−][N+]....0.26167080.6284804−1.2536848562
\...(.......0.2506876−0.8700648−1.12682545111
\...C.......2.17103292.62623431.761937511266
\...N.......−3.1201815−3.8706242−3.03259704123
c...(.......0.3275817−0.1910585−0.534331118323894
c...1.......0.51277810.17149800.823651919620475
c...2.......0.11399691.43315932.250992712912238
c...3.......1.50453721.43754140.1592669415015
c...4.......0.93915820.24513760.760577291010
c...C.......−1.64592580.0580657−0.433324015191
c...Cl......−1.9973422−2.7517912−3.6905785573
c...N.......1.08964080.18973910.7548234261912
c...O.......2.43311561.25159970.917850322186
c...c.......−0.2252497−0.6284915−0.8624749316305106
n...(.......0.8765637−0.94010230.24943191186
n...1.......1.62354551.27653701.8586068161510
n...2.......2.03033853.97971453.50099426118
n...3.......4.12958733.55762184.1220666463
n...c.......2.31017151.55999871.7164852254017
o...(.......−3.9990305−2.9953875−3.7613936896
o...1.......5.88759627.18571776.8437068552
o...2.......−0.6199000−0.1044473−0.2658548765
o...c.......5.57256054.25490843.0289124831
s...1.......0.94071521.97653252.0040980667
s...c.......−0.49600140.00053520.0003709426

3SAk

(...C...(...3.99802863.56306342.42088009510242
(...Br..(...0.00047540.5262619−1.7321140930
(...Cl..(...0.6233843−0.37774701.356961129445
(...F...(...1.92165252.49942531.13375121195
(...O=..(...1.25866840.91936780.2775945716828
(...N...(...2.15491411.55008751.3741232294011
(...O...(...1.39215860.55907690.374169233348
(...[N+](...2.25435514.44915884.03234771585
(...[O−](...−1.5354823−1.4737180−3.377972419239
(...c...(...−1.0930229−0.93422190.433212712180
/...C...(...4.49708184.00482873.0000737540
1...C...(...4.21425252.81618853.185027916165
1...O...(...2.25284100.37976611.4952742420
1...c...(...0.93351290.52880830.8716499183512
2...C...(...−2.2544539−1.0919167−2.642706710134
2...c...(...2.99724223.44172853.8400379292812
2...c...1...1.99607340.5619955−0.0932258782
2...o...(...1.01105281.49732811.0603210444
3...C...(...−2.2814415−2.8156671−1.9977117761
3...C...2...6.49807358.00200208.2529271400
3...c...(...−0.21713791.05780161.2483687994
3...c...2...5.25020234.22265544.7487134872
4...C...(...1.7464204−0.6210801−0.2842473640
=...C...3...7.00299587.56416766.7549811840
=...C...1...2.99994562.84749094.25459861284
=...C...(...1.57130760.95215380.530994318182
=...C.../...5.24958005.49940846.0049990672
=...N.../...5.81193465.59549446.3789368782
C...(...C...0.5513380−0.2378199−1.0864076696424
C...(...1...−1.1216831−3.3768328−2.49519099103
C...(...=...6.25359765.15874974.43609259113
C...(...(...−0.3146880−1.6918223−1.283472411223
C.../...(...−3.0637430−1.8160468−2.9987839541
C...1...C...5.49546613.91895504.61402808105
C...1...(...1.37185291.30793331.47950978131
C...1...=...0.24768561.31694670.9333222641
C...2...(...−0.6449419−0.8430901−0.7370108590
C...2...C...5.99653795.99665336.0017257861
C...2...=...0.0028591−0.8747150−0.3764369752
C...3...(...6.50450566.24779665.7492216530
C...3...=...5.62316096.00029165.1293498532
C...3...C...−3.5000076−2.9954231−3.00283091123
C...4...C...−3.0021372−4.5016046−2.9968826411
C...=...1...0.43315262.65824341.9050556751
C...=...(...−2.4953051−0.6283873−0.74731938111
C...=...C...1.60935402.79759192.1897927333411
C...=...3...0.29718091.5013043−0.9639603532
C...=...2...1.62756373.71722581.7773288740
C...C...3...5.07461574.61190024.59088801685
C...C...=...−0.4018420−1.0583619−1.097781836265
C...C...1...1.87759370.93840771.3828674312717
C...C...2...0.7079969−0.18791120.627496221193
C...C...(...1.01230780.79243980.404730310910942
C...C...4...−1.4024533−0.7803070−0.3715353752
C...C...C...−0.04284020.1882515−0.1515135775859
C...Br..(...1.62980402.50385041.0267176410
C...Cl..(...−1.4962815−0.4993582−1.2516847441
C...N...1...−0.2691219−0.3795019−1.0000894881
C...N...(...1.44225040.97311360.4339815363518
C...O...2...5.12364724.08956983.5121038531
C...O...(...3.25151622.28434082.5780816283212
C...O...C...4.37416984.31050413.06346858104
C...O...1...2.87337893.12673192.96734391393
C...\...C...−2.8461979−3.8759584−2.2789129481
C...c...2...6.00068604.53157155.2468516450
C...c...1...2.43569120.43517201.247287210131
Br..(...C...2.06158551.14375991.9040185460
Br..C...(...1.49819690.62564060.0009660730
Cl..(...(...−1.2075807−0.6362162−1.9255180960
Cl..(...C...−1.1526609−0.2476848−1.814782527324
Cl..(...Cl..0.50492083.25394410.6886852470
Cl..C...C...−0.00145860.0039516−1.24643699102
Cl..c...1...−0.25339021.62956262.2512123463
F...(...C...1.68637541.50151670.5346605582
F...(...(...−1.7457403−2.1139078−1.1982770684
O=C.1...C...5.12799702.86698623.6914477431
O=..(...C...0.81071090.7780984−0.4033174926831
O=..1...C...−3.7510578−4.1255552−4.1222938420
O=..N...(...9.543518310.063654310.031589924286
N...#...C...−4.5000318−4.5014930−4.5004055410
N...(...N...1.19168030.99835791.628820112102
N...(...1...−0.1264931−0.75381140.0454142570
N...(...C...3.50188362.28227483.0336305556230
N...(...O=..−2.3138510−2.3145054−1.503109223145
N...(...O=C.−1.4990141−0.8704228−1.2494712652
N...1...C...2.69140722.56472752.718463812134
N...2...C...−0.4978517−0.00000511.0021092561
N...C...(...−0.8104915−0.4341890−0.968401325248
N...C...C...−1.2520104−0.7226801−1.000815922266
N...O=..(...2.67159721.62721713.655078411111
N...N...1...0.00425931.49701381.1825822533
N...N...O=..4.24598703.74661013.22214991065
N...N...(...4.75367405.40552584.6294702642
N...c...2...−3.6269860−2.8792235−4.2532265531
N...c...1...−0.18992510.23385720.2627307201511
O...(...O=..−0.62193161.43958271.253259419175
O...(...C...0.9395840−0.43473000.4467587524017
O...(...(...11.504053312.004931811.9989881441
O...(...O=C.4.93689824.99940004.7472261720
O...C...1...−0.49874160.49781810.9413176443
O...C...C...−2.6559832−3.2412835−3.2024047353710
O...C...(...−0.1201364−0.3106109−1.2477221273110
O...c...1...−2.7623334−2.0920952−2.12420601483
O...c...2...−1.4980614−3.5286748−3.6223778760
S...C...C...−0.00344081.50428030.7537994402
[N+](...C...9.25390669.44173557.5006239641
[N+](...2...5.41093754.62731273.2821267634
[N+](...O=..−0.37877900.31094360.1872916472
[O−](...[N+]−3.8743809−1.4388021−1.245318118229
[O−](...O=..−4.0585677−2.6242009−2.557735915116
[O−][N+](...−3.5045096−1.4982980−2.4892797552
\...C...=...−1.3136029−1.8755430−1.28544924111
\...C...(...−3.5018378−4.4994516−3.8096741572
c...(...[O−]3.99921704.06123264.49784794103
c...(...c...1.75238752.59212350.9359654241913
c...(...Br..1.33923410.53407791.18890341700
c...(...C...1.00020100.2472155−0.2478234194113
c...(...Cl..1.18255972.50394250.908897815237
c...(...O...1.15539930.91074021.840108310346
c...(...N...−0.4647652−0.51092500.5049629134111
c...(...1...3.25463951.68264381.7521385171710
c...(...O=..2.00087862.90446032.817204915177
c...(...F...2.56157272.24838472.9956341602
c...1...O...0.31572180.24600290.0026575733
c...1...C...−0.3426326−0.4053179−0.058701310104
c...1...(...4.12914723.73856174.501010215176
c...1...c...2.52702014.12545911.6392878646924
c...2...c...3.18346743.57652332.5649848464110
c...2...O...−2.1902681−0.8169002−1.5600959650
c...2...C...2.05999803.31660752.0508240642
c...2...(...−3.4837706−2.7454921−2.2497303522
c...3...c...0.56719530.32803752.557746514156
c...C...C...−0.4968006−1.0630716−0.7464899460
c...N...(...4.12129305.12808793.2342653933
c...O...(...7.87950418.75295008.5448576520
c...O...C...0.49697601.06028891.00359941080
c...c...2...−1.0046754−0.7477295−1.3148992595820
c...c...c...−0.9189229−1.1362229−0.988610317114850
c...c...1...0.96844041.09616871.026785911110136
c...c...3...−1.4056269−2.8661586−1.749054818246
c...c...4...−1.24983000.62789680.4997288554
c...c...(...−0.5592802−0.7452834−0.28311838711045
c...n...1...0.40372740.75451821.8635708898
n...1...c...1.14461621.19062160.13688851188
n...c...c...−4.4951810−4.4955509−4.25004685111
n...c...(...−1.7475062−0.9098866−0.0016730101313
o...(...c...1.9983265−0.30772481.1610603555
o...1...(...−0.8795536−0.8151611−1.4961309432
s...1...(...3.00073593.31262242.8719278556
Table 6. Examples of compounds which contain promoters of increase/decrease of the logTD50.
Table 6. Examples of compounds which contain promoters of increase/decrease of the logTD50.
StructureCAS and SMILESlogTD50
Ijms 10 03106i9148-82-3
O=C(O)C(N)Cc1ccc(cc1)N(CCCl)CCCl
3.512
Ijms 10 03106i1016301-26-1
[O−]\[N+](CC)=N\CC
3.667
Ijms 10 03106i111163-19-5
Brc2c(Oc1c(Br)c(Br)c(Br)c(Br)c1Br)c(Br)c(Br)c(Br)c2Br
−0.542*
Ijms 10 03106i1291-93-0
COc1cc(ccc1/N=C=O)c2ccc(\N=C=O)c(OC)c2
−0.740*
*)One can see that aromatic bonds are indicated in SMILES by ‘c’ (lower case), thus ‘=’ is indicator of local double bonds which are not a part of aromatic fragments.

Share and Cite

MDPI and ACS Style

Toropov, A.A.; Toropova, A.P.; Benfenati, E. Additive SMILES-Based Carcinogenicity Models: Probabilistic Principles in the Search for Robust Predictions. Int. J. Mol. Sci. 2009, 10, 3106-3127. https://doi.org/10.3390/ijms10073106

AMA Style

Toropov AA, Toropova AP, Benfenati E. Additive SMILES-Based Carcinogenicity Models: Probabilistic Principles in the Search for Robust Predictions. International Journal of Molecular Sciences. 2009; 10(7):3106-3127. https://doi.org/10.3390/ijms10073106

Chicago/Turabian Style

Toropov, Andrey A., Alla P. Toropova, and Emilio Benfenati. 2009. "Additive SMILES-Based Carcinogenicity Models: Probabilistic Principles in the Search for Robust Predictions" International Journal of Molecular Sciences 10, no. 7: 3106-3127. https://doi.org/10.3390/ijms10073106

Article Metrics

Back to TopTop