Additive SMILES-Based Carcinogenicity Models: Probabilistic Principles in the Search for Robust Predictions

Optimal descriptors calculated with the simplified molecular input line entry system (SMILES) have been utilized in modeling of carcinogenicity as continuous values (logTD50). These descriptors can be calculated using correlation weights of SMILES attributes calculated by the Monte Carlo method. A considerable subset of these attributes includes rare attributes. The use of these rare attributes can lead to overtraining. One can avoid the influence of the rare attributes if their correlation weights are fixed to zero. A function, limS, has been defined to identify rare attributes. The limS defines the minimum number of occurrences in the set of structures of the training (subtraining) set, to accept attributes as usable. If an attribute is present less than limS, it is considered “rare”, and thus not used. Two systems of building up models were examined: 1. classic training-test system; 2. balance of correlations for the subtraining and calibration sets (together, they are the original training set: the function of the calibration set is imitation of a preliminary test set). Three random splits into subtraining, calibration, and test sets were analysed. Comparison of abovementioned systems has shown that balance of correlations gives more robust prediction of the carcinogenicity for all three splits (split 1: rtest2=0.7514, stest=0.684; split 2: rtest2=0.7998, stest=0.600; split 3: rtest2=0.7192, stest=0.728).


Introduction
Carcinogenicity is an important endpoint from a toxicological point of view and quantitative structure -activity relationships (QSAR) are a tool for modeling this endpoint [1][2][3]. Usually, the QSAR analysis is based on molecular descriptors, calculated from molecular graphs [3,4]. However, the simplified molecular input line entry system (SMILES) [5][6][7] has become a prospective alternative to molecular graphs in QSAR analysis [8][9][10][11], owing to an expansion of the databases available via the Internet with molecular structures given in SMILES notation [15,16]. The present study aimed to estimate the ability of the SMILES-based optimal descriptors to be a tool for QSAR analysis of carcinogenicity of non-congeneric chemicals.

Materials and Methods
Carcinogenicity data: Experimental values for carcinogenicity were taken from publicly available data sources and further checked for chemical structures [17]. Carcinogenicity is expressed as the potency dose that induces cancer in rats (TD 50 , in mg/kg body weight). These values have been converted into mmol/kg body weight. The -log(TD 50 ) was examined as endpoint for the modelling. Initially, 401 chemicals have been extracted from [17]. These compounds were selected as substances with numerical data on the carcinogenicity available from [17].
However, this set (401 compounds) contains eight outliers (Table 1): for these compounds the difference between experimental and calculated (by our approach) value of -logTD50 is more than the double the standard error (2s). Probably the high symmetry and the presence of the N-nitroso group can lead to the unusual behaviour of these substances. These compounds were removed. Thus, 393 compounds were examined in this study. SMILES notations which were used in this study have been taken from [18].
We randomly split these 393 chemicals three times into training (n=165), calibration (n=167) and test (n=61) sets. The range of -log(TD 50 ) values for these sets is about from -2 to 5 logarithmic units. Below, these splits are denoted the Split1, Split2, and Split3 (The Supplementary Materials contain lists of these splits).
The modification of the descriptor that was used for modeling bee toxicity [10] is the tool for QSAR analysis of the carcinogenicity. This descriptor is calculated as follows: DCW(limS) = CW(dC) + Σ CW( 1 SA k ) + Σ CW( 2 SA k ) + Σ CW( 3 SA k ) (1) where 1 SA k , 2 SA k , 3 SA k are SMILES attributes. 1 SA k , 2 SA k , and 3 SA contain one, two, and three SMILES elements, respectively. The SMILES element can be one (e.g., 'C', 'c', 'N', 'S', etc.), two (e.g., 'Cl', 'Br', etc.), three ('C=O'), and four symbols ('[O-]'). The order of elements in depiction of the 2 SA k or 3 SA k is defined by the ASCII characters. In other words only one version of AB-sequence or ABC-sequence is possible in the list of the SMILES-attributes (not AB together with BA, or ABC together with CBA).

86-30-6 N-Nitrosodiphenylamine
The dC is the difference of the number of 'C' (capital letter) in the given SMILES notation minus the number of 'c' (lowercase letter) in the given SMILES notation. For example, this global SMILES attribute is denoted as '!001', if dC=N('c') -N('C')=1, and as '!-02' if the dC = −2. The CW(dC) is the correlation weight of the dC. The symbol "C" (capital letter) is the representation of a carbon atom in the sp 3 configuration. The symbol "c" (lowercase letter) is the representation of a carbon atom in sp 2 configuration. Thus, the dC is a measure of presence of rigid and flexible fragments in molecular architecture. The examined substances contain chlorine that gives an additional 'C'. The chlorine is not rigid fragment in molecular system and we have calculated the dC taking into account the 'C' from chlorine atoms. Table 2 contains an example of the representation of SMILES by the set of SMILES attributes. The CW(dC), CW( 1 SA k ), CW( 2 SA k ), and CW( 3 SA k ) are correlation weights of the above SMILES attributes. By means of the Monte Carlo method one can calculate numerical data for these weights which give maximal value of determination coefficient (square of the correlation coefficient, r 2 ) for the training set. However, most probably overtraining will result, i.e., an excellent model on the training set will be accompanied by a poor model for the test set. In order to avoid overtraining one can use the correlation balance [11], i.e., split the available chemicals into three sets: subtraining, calibration, and external test set. This approach gave reasonable result for the case of toxicity of 61 compounds [11], however for carcinogenicity of 393 compounds it is not enough. The use of the correlation balance and blocking of rare SMILES attributes [10] can improve the model. The blocking of rare attributes can be done by the scheme: if the number of SMILES from the training (subtraining) set which contain the SMILES attribute SA* is less than the limS, the correlation weight of the SA* should be fixed equal to zero, CW(SA*)=0.
Without rare attributes the model becomes better for the external test set. However, if limS is too large, the predictive potential of the model decreases, because the low number of active SMILES attribute cannot provide a high quality model. Thus, the central point of the system of modeling is the selection of the most efficient limS. The general scheme of the construction of optimal SMILES-based descriptors by the correlation balance method is represented in Figure 1.
This system can be denoted as a [Subtraining-Calibration-Test] system. The model can be satisfactory if the N 111 , i.e., the number of active (not blocked) attributes which are present in subtraining, calibration, and test sets, is as large as possible. The more traditional, "classic" approach is the construction of the model using united training set to predict the endpoint for an external test set. This system can be denoted as [Training-Test] system. This model can be satisfactory if the N 101 , i.e., the number of active attributes which are present in both the training and test set is as large as possible.
Then, steps of 1-7 are carried out for all CWs (the epoch of the optimization). By computational experiment the optimal number of the epochs has been established ( Table 3). This number is 10 ( Figure 2).

Results
Computational experiments ( Figure 3, Table 4) have shown that [Subtraining-Calibration-Test] system gives preferable results in comparison with the [Training-Test] system for all three splits. Thus the correlation balance (i.e., [Subtraining-Calibration-Test] system) improves QSAR model of log(TD50). It is the second successful experiment using the correlation balance for the QSAR analyses [11].    A useful characteristic of these models is W%=N 111 /Nact, where N 111 is the number of non blocked attributes which take place in subtraining, calibration, and test set; N act is the total number of attributes which are not blocked for a given limS. There is a correlation between W% and the determination coefficient for the test set ( Figure 4, Table 4). One can see from the results that good prediction ocurrs if the W% is higher than 80 (excepting [Subtraining-Calibration-Test ] for the Split3: in this case W%=78).
The model obtained in the first probe of the Monte Carlo optimization for the split1 with limS=4 is calculated as follows: -log(TD 50 ) = -0.5981 ( 0.0074) + 0.1118 (0.0004) * DCW(4)  [19,20] for the test set (N shifting =300 [20]) gave r 2 scrambling =0.0996 Figure 5 shows the model calculated with Equation 3, graphically. The Supplementary Materials contains numerical data on the experimental and calculated values with Equation 3 (split1 with limS=4). Table 5 contains numerical data on the correlation weights of SMILES attributes obtained in three probes of the Monte Carlo optimization.  Figure 4. Correlations between the determination coefficient for test set and W% for the three splits (see data from Table 4).

Discussion
One can see that the statistical characteristics of this model are reasonably good. As additional validation we have calculated Y-scrambling criterion, randomly shifting the carcinogenicity values [16,17]. If after the shifting (300 exchanges recommended in Ref. [17]) the correlation coefficient is less than 0.2, the correlation of our model can be classified as not chance correlation. Thus, the Y-scrambling has shown that the Equation 3 gives robust prediction (not chance correlation) for the test set.
In our previous study we examined different equations for the carcinogenicity model, and only one split into the subtraining, calibration and test set [15]. Examination of three splits indicates that good results occur for all three splits (Table 4). Thus, we expect that the present model is more robust, also considering the Y-scrambling test.
One can see from Table 5 that there are three categories of SMILES attributes: category 1 is the set of SMILES attributes with the correlation weight more than zero in all three probes of the Monte Carlo optimization; category 2 is the set of SMILES attributes with the correlation weight less than zero in all three probes; category 3 is the set of SMILES attributes with non consistent values, which have both correlation weights more than zero and correlation weights less zero in the three probes of the optimization. We can say that the category 1 contains promoters of logTD 50 increase; category 2 contains promoters of logTD 50 decrease; category 3 contains attributes with unclear influence on logTD 50 .
The !-02, #, Cl, S, [N+], and [O-] SMILES elements are promoters of logTD50 increase, thus of carcinogenicity. However it is necessary to take into account the value of correlation weight as well as the number of the given attribute in the subtraining set. Taking this into account, one can detect that the strongest promoters of the logTD50 increase are Cl (number Cl in the subtraining set is 61, the range of correlation weights of the Cl in three probes is 2. 19 -3.19) and [O-] (the number of [O-] in the subtraining set is 26, the range of correlation weights in three probes is 5.92 -6.96).
A similar analysis can be done for the promoters of logTD 50 decrease. For instance, the number of bracket s'(' in the subtraining set is 708 and the range of correlation weights of bracket is from -1.366 till -1.686; the number of '=' in the subtraining set is 77 and the range of correlation weight is from -1.866 till 2.144. Table 6 contains examples of compounds, which contain the mentioned SMILES attributes. Thus, the analysis of the correlation weights of SMILES attributes can help in searching for agents of the carcinogenicity phenomenon. An important feature of our model is that SMILES attributes are used for the QSAR predicted values and not only as tool for a binary classification (carcinogenic or not). Our model, which provides continuous values, can be used for risk assessment calculations, where a dose is necessary.  The applicability domain for these models can be defined from a probabilistic point of view: one can estimate the carcinogenic potential of compound if the SMILES of this compound does not contain rare SMILES attributes. A stronger definition of the applicability domain can be formulated taking into account the roles of the attributes (as promoters of logTD 50 increase/decrease): thus, one can estimate the carcinogenic potential of a compound if the SMILES of the compound contains solely apparent promoters of logTD 50 increase and/or decrease (without of SMILES attributes with unclear role).

-
Optimal descriptors calculated by the Monte Carlo method can provide reasonable prediction for the carcinogenicity log(TD50). -Blocking of rare SMILES attributes can improve statistical quality of the predicting. Splits into subtraining, calibration and test sets, as well splits into the training and test sets have influence to statistical characteristics of the models. In our case, in three splits examined in this study these characteristics are similar. - The correlation balance, i.e., the [Subtraining-Calibration-Test] system gave models which are better in comparison with models obtained with the more traditional [Training-Test] system.