Next Article in Journal
Toxicity of Titanium Dioxide–Cerium Oxide Nanocomposites to Zebrafish Embryos: A Preliminary Evaluation
Previous Article in Journal
Proteomic Analysis of the Mitochondrial Responses in P19 Embryonic Stem Cells Exposed to Florfenicol
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Semi-Correlations for Building Up a Simulation of Eye Irritation

by
Andrey A. Toropov
,
Alla P. Toropova
*,
Alessandra Roncaglioni
and
Emilio Benfenati
Istituto di Ricerche Farmacologiche Mario Negri IRCCS, Via Mario Negri 2, 20156 Milano, Italy
*
Author to whom correspondence should be addressed.
Toxics 2023, 11(12), 993; https://doi.org/10.3390/toxics11120993
Submission received: 30 October 2023 / Revised: 24 November 2023 / Accepted: 2 December 2023 / Published: 6 December 2023
(This article belongs to the Special Issue Advances in Computational Toxicology and Their Exposure)

Abstract

:
The OECD recognizes that data on a compound’s ability to treat eye irritation are essential for the assessment of new compounds on the market. In silico models are frequently used to provide information when experimental data are lacking. Semi-correlations, as they are called, can be useful to build up categorical models for eye irritation. Semi-correlations are latent regressions that can be used when the endpoint is expressed by two values: 1 for an active molecule and 0 for an inactive molecule. The regression line is based on the descriptor values which serve to distribute the data into four classes: true positive, true negative, false positive, and false negative. These values are applied to calculate the corresponding statistical criterion for assessing the predictive potential of the categorical model. In our model, the descriptor is the sum of what are termed correlation weights. These are defined by optimization using the Monte Carlo method. The target function of the optimization is related to the determination coefficient and the mean absolute error for the training set. Our model gives results that are better than those previously reported for the same endpoint.

1. Introduction

The Organization for Economic Cooperation and Development (OECD) recognizes that a compound’s effects on eye irritation are essential in the assessment of new compounds. The OECD has adopted three methods for assessing eye irritation [1,2]. When experimental values are lacking, models for the prediction of the adverse impact of chemicals are essential to modern theoretical chemistry [3,4,5,6], using various representations of the molecular structure [7]. Solving environmental problems by simulation endpoints that are indicators of hazardous or beneficial properties of substances is now a common practice. There are two types of simulation aimed at solving the mentioned problems. There is the regression approach, where endpoints are simulated and expressed as numerical values on a continuous scale or a categorical simulation, with the model taking the form of two values representing active (1) or inactive (0) molecules. Semi-correlation is one of the possible approaches for building a categorical model [8,9]. It should be noted, however, that semi-correlations are latent regression models. The result is a somewhat unusual correlation since the abscissa axis (if we continue the analogy with the usual regression model) displays only two values that can be expressed as 0− to mark an inactive compound, and 1− to mark an active compound. There may be variations such as −1 and +1 to indicate inactive and active compounds, respectively. While the y-axis contains a wide range of values, their informational content boils down to how they are located relative to 0.5 (if the range of observations is 0 and 1), or how they are located relative to 0 (if the range of observations is represented by −1 and +1). For regression models built using the Monte Carlo technique, meaning through self-controlled random processes, a mechanistic interpretation of the model can be presented in probabilistic terms, partially dependent on the distribution of the molecular features extracted from the SMILES. This interpretation is revealed through several runs of procedures to find the so-called correlation weights provided for each of the above-mentioned molecular features of the transmitted SMILES. In this case, the descriptor on which the model is based is a simple sum of these correlation weights calculated by the Monte Carlo method.
Three classes of molecular features can be distinguished during these procedures for calculating correlation weights, repeated several times. The first contains molecular features which, in all the runs of optimization by the Monte Carlo method, receive positive correlation weights. Despite the probabilistic nature of this process, most of these molecular features are extremely likely to be favorable factors for increasing the values of the endpoint studied in the regression. The second class of molecular features under discussion are those that receive a negative correlation weight in each run of the Monte Carlo optimization. Molecular features of this nature will, with fairly significant probability, turn out to be factors that contribute to a decrease in the value of the endpoint in question. Finally, the third class of molecular features extracted from SMILES are those for which both positive and negative correlation weights are encountered during several Monte Carlo optimizations. In principle, by comparing how many times the positive weights and how many times the negative weights are observed, one can assign these molecular features to either the first or second of these classes. For this kind of observation, many runs of the optimization procedure are needed, though in practice the number is limited, and in those circumstances, it is simpler and more logical to classify substances of this third class as molecular features with no clear role in terms of raising or lowering the values of the endpoint in question.
Then, there are cases when a model (ordinary regression or semi-correlation) is formed on exclusively positive correlation weights (where everyone ends up in the same first class) or exclusively negative correlation weights (all in the second class). In such cases, the only way to understand the contributions of the molecular features in the specified application is to compare the correlation weights and the frequencies of the corresponding molecular features extracted from the SMILES. Since semi-correlations, as already noted, are special cases of regression models, the situations listed here also apply to them. Thus, one can distinguish three classes of molecular features for semi-correlations. Therefore, for an endpoint with only two values (active/inactive), one can apply techniques tested on conventional “traditional” regression models, where the considered endpoint has a certain range of values.
Here, the semi-correlations are used to develop a categorical model of eye irritation. The CORAL software (http://www.insilico.eu/coral, accessed on 25 November 2023) is a tool for building semi-correlations [8,9].

2. Materials and Methods

The data on eye irritation from 5220 chemicals (including 3874 positive and 1346 negative) were taken from the literature [10]. To develop the models, the data were randomly divided into four subsets with approximately equal numbers of chemicals: these are referred to as the active and passive training sets, calibration set, and validation set. The traditional training set is structured in three subsets (active and passive training sets and a calibration set). Special interactions take place in this trio. The active training set is intended for building the initial model, with correlation weights that force the experimental and calculated values of the endpoint to correlate. The passive training set checks how suitable the resulting correlation weights are for the molecules distributed in the passive training set (i.e., absent in the active training set). Last, the calibration set checks for the absence of overtraining (meaning a situation where a very good correlation on the training sets is accompanied by a “fall” in the coefficient of determination for the calibration set). Thus, it is at this stage that the final model is optimized. The validation set is not used in the process of model building and provides the statistics for substances not used in the training sets.

2.1. Optimal SMILES-Based Descriptors

The simplified molecular input-line entry system (SMILES) is a representation of the molecular structure [7]. The optimal descriptor is a sum of correlation weights of SMILES atoms for eye irritation, calculated by the Monte Carlo method described in our previous work [11]:
D C W T , N = C W ( S k )
CW(Sk) are the correlation weights (CW) for the attribute of SMILES [7]; Sk, that is, one symbol or a group of symbols which cannot be considered separately) e.g., ‘Cl’, ‘Br’, ‘[N+]’, etc., the physical meaning of SMILES fragments available in the literature [7,8,9] as well as on the Internet (https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html, accessed on 15 October 2023); T (=1) is the threshold defining active SMILES atoms (those with a frequency more than T in the active training set); N is the number of the iterations of the optimization by the Monte Carlo method (N = 15). Table 1 contains an example of calculating the optimal SMILES-based descriptor.

2.2. Model of Eye Irritation

The model of eye irritation is defined as
Y = C 0 + C 1 × D C W ( T , N )
C A T E G O R Y S M I L E S = 1 a c t i v e , i f Y 0.5 0 i n a c t i v e , i f Y < 0.5

2.3. Monte Carlo Optimization

Equation (1) needs the numerical data for the CW, calculated by the Monte Carlo optimization. Here two target functions (TF0 and TF1) for the Monte Carlo optimization are examined:
T F 0 = r A T + r P T r A T r P T × 0.1
T F 1 = T F 0 + I I C C × 0.5
r A T and r P T are correlation coefficients between the observed and predicted endpoints for the active and passive training sets, respectively. IICC is the index of ideality of correlation [12]. IICC is calculated with data on the calibration set as follows:
I I C C = r C m i n ( M A E C , M A E C ) + m a x ( M A E C , M A E C ) +
min x , y = x ,   i f   x < y y , o t h e r w i s e
max x , y = x ,   i f   x > y y , o t h e r w i s e
M A E C = 1 N k , N   i s   t h e   n u m b e r   o f k < 0
M A E C + = 1 N + k , N   i s   t h e   n u m b e r   o f k + 0
Δ k = o b s e r v e d k c a l c u l a t e d k
rc is the correlation coefficient between the observed and calculated values of the endpoint on the calibration set; ‘c’ indicates that it belongs to the calibration set. Observed and calculated are the corresponding values of y applied to define the corresponding categories (active/inactive).

2.4. The System of Self-Consistent Models

The system of self-consistent models [11,13,14] for five random splits into the training (visible) and validation (invisible) sets confirms the good predictive potential of the models. The training set here is divided into active and passive training and calibration sets. Thus, the difference between models reflects the difference in training sets. However, the key attribute of the system of self-consistent models is the unified method for the validation of these models; each i-th model has i-th validation set. The validation sets are far from identical (Table S1, Supplementary Materials). This supports the statistical fact that we explore multiple conditions, and the results are representative of a set of cases, each obtained by chance, and their overall results should be evaluated jointly.
The measure of self-consistency is the average value and dispersion of the Matthews correlation coefficient (MCC) on different validation sets. The corresponding computational experiments are represented by the matrix:
( M 1 ( V 1 ) : V 1 M C C v 11 ) ( M 5 ( V 5 ) : V 1 M C C v 51 ) ( M 1 ( V 1 ) : V 5 M C C v 15 ) ( M 5 ( V 5 ) : V 5 M C C v 55 )
M i is an i-th model; V j is the list of compounds employed as the validation set in the case of the j-th split; M C C v i j     is the Matthews correlation coefficient for the j-th validation set if applied to the i-th model.

3. Results

The models for five random splits are the following:
Y = 1.091 + 0.03088 × DCW(1,15)
Y = 1.111 + 0.03304 × DCW(1,15)
Y = 1.075 + 0.02636 × DCW(1,15)
Y = 1.129 + 0.02883 × DCW(1,15)
Y = 1.107 + 0.04178 × DCW(1,15)
Table 1 sets out the statistical characteristics of the models for five random splits. One can see that these characteristics are quite similar for all five random splits. The average value of MCC for the validation sets is 0.8891 ± 0.0153. The numbers of the optimized parameters in the Monte Carlo optimization process are 31, 28, 30, 30, and 31, respectively, for splits 1–5.
Sens   ( sensitivity ) = T P T P + F N
Spec   ( specificity ) = T N T N + F P
Acc   ( accuracy ) = T P + T N T P + F P + F N + T N
MCC   ( Mathew   correlation   coefficient ) = T P × T N F P × F N ( T P + F P ) ( T P + F N ) ( T N + F P ) ( T N + F N )
Table 2 indicates that the evolution of the model enables us to optimize it. The statistics of the active and passive subsets (already very good) are even better for the validation set. The optimized model, as in the calibration set, is expected to provide similar statistics when tested on new substances, as in the case of the validation set. In fact, the values for the calibration and validation sets are quite similar. This provides evidence that the model can be expected to give good results when used for new substances. As described above, the training set is not balanced because there are more toxic substances. In these conditions, the use of the accuracy may provide a biased picture of the model’s performance. It is common that the most represented class is predicted more easily, and indeed in our case we observe that the sensitivity values are higher than the specificity value. To have a better evaluation of the overall performance of the model for the unbalanced datasets, the MCC parameter is more appropriate; MCC ranges from −1 to 1, and good values are those above 0.5.
Table 3 lists the correlation weights of the SMILES atoms for splits 1–5. Many of the SMILES’ attributes are rare, meaning they are present only in a few substances, and their roles in the five splits are different. This implies that these controversial features can hardly be considered as a reliable basis for statistical conclusions. In our case, these attributes are not used in the simulation.
The correlation weights and even their lists are different for the five random splits, but the statistical quality of the models is quite good and close. To explain this, we have to bear in mind that the data set, which is large, with more than 5000 substances, is sufficient to generate several good models, which are not identical, because in our exercise we split these substances randomly. Each model has an adequate, good statistical basis, but depending on the composition of the training set, different features may be extracted.
Table 1 gives an example of the calculation of the property value Y for a simple SMILES.
As noted above, to obtain a basis for determining a mechanistic interpretation, it is necessary to carry out several runs of optimizing the correlation weights of the molecular features extracted from the SMILES. Table 4 gives the results of such tests using splits 1–5. Table 5 indicates that for the system of models under consideration, there are promoters of both “increase/decrease” in the activity. There is a certain analogy in the lists but also differences. Table 6 contains a comparison of the similarities and differences in the mentioned lists.
There are two factors that allow for an evaluation of a molecular feature as a potential promoter of the increase in the activity in question. It should be noted that it is more appropriate to talk not so much about an “increase” as about the “probability of activity” for each specific substance. These factors are the frequency of the occurrence and stability of the correlation weight. For example, bromine (Br) is present in all the lists of promoters of an increase in the likelihood of activity, while nitrogen (n, i.e., nitrogen in an aromatic ring, according to the SMILES nomenclature) is present in all the lists of molecular features reducing the probability of activity for the five splits. The frequency of phosphorus and fluorine is too low to assess them as potential promoters of activity or inactivity (Table 4 and Table 5). We can therefore see whether a certain molecular feature has been selected in all five splits, which is the preferable situation. The value of the coefficient and its consistency is another important feature. If the coefficient is close to 0, its role is lower. If the coefficient is variable, we expect greater uncertainty on its role.
For instance, bromine ‘Br’ contributes to eye irritation, and the coefficients span from 0.11 to 0.42. Silicon and boron are other features consistently contributing to eye irritation with relatively high coefficients. The triple bond (represented by the symbol #) is a feature selected in all the splits, but the coefficients are lower: 0.002 (almost negligible), 0.02 (quite small), and 0.26. In addition, in some splits, there are molecular features with very high coefficients, but these findings are not replicated in other splits. This is the case of charged nitrogen ‘[N+]’, with coefficients of 0.87 and 1.02 in two of the five splits, while it has not been selected in the third. This is probably because of the particular composition of the splits, with certain substances present in the calibration and validation sets.
Similar considerations are held for the molecular features that reduce eye irritation.
A plus denotes the presence of a molecular feature in the list of promoters for an increase or decrease of the “probability of activity”.
The results of our computer experiments set out to identify which molecular fragments of the SMILES have an influence on the likelihood of the substances being able to affect the eyes are presented in Table 5. It can be seen that there are ‘convincing’ supporters for the presence of effects of substances on the eyes such as bromine ‘Br’, silicon ‘[Si]’, nitrogen, ‘N’, and ‘n’. There are also ‘fragments that may have such an influence’ which depends on the distribution of the substances into the training and validation sets, for instance, these are triple bond ‘#’, phosphorus ‘P’, sulfur ‘S’, and some others.
The applicability domain in each split was defined from the statistical defects [15]. The Supplementary Materials section contains this data on split 1 (Table S2). This is not a ‘strict’ way of determining the applicability domain. Outliers for this approach are “suspicious” molecules that have a significant number of rare molecular features.
Table 6 compares the statistical quality of the different categorical models of eye irritation, also considering those published in the recent literature.
The difference between the model-building system considered here and the more generally accepted ones is the use of a structured training set that includes three functionally different groups of compounds. Active and passive learning complement each other, postponing the moment of relearning. The third functional component (a calibration set) should catch the moment when overtraining starts. However, overtraining may not occur at all due to the use of the index of ideality of correlation. In other words, the feasibility of using the index of ideality of correlation to improve the statistical quality of the different models for the external validation sets is once again demonstrated for various substances and endpoints [12,16,17].

4. Discussion

The disadvantage of the scheme considered here (system of self-consistent models) is the need to compare a sufficiently large amount of digital data, to divide all the available substances into four sets. Furthermore, only considering the various groups of splits into training and validation sets can give an idea of the genuine predictive potential of a particular approach. The approach used here to generate the splits is through a random process. If we consider a random process as a sequence of changes in random variables, then, in the case under consideration, the random variables are the determination coefficients and their standard deviations. We get two levels of random processes. The first level of change in the specified quantities for one random process is building a model for a fixed division into the training and validation sets. The second level is the consideration of a certain group of random processes for different splits into the training and validation sets. The second level is designed to answer the following questions. Question 1: Is it possible to build up a model with our approach to selected parameters? Question 2: How reproducible are the statistical characteristics of the model in the chosen parameterization for various distributions into the training and validation sets? Question 3: If the statistical characteristics are reproducible, what is their variance?
By knowing the answers to these questions, one can assess how reliable the model is. The self-consistency of the models does not signify their identity. Each model is different, as we discussed above, but each model may be appropriate. Even the proportions in the distribution of available data into the active/passive training, calibration, and validation sets are not subject to any restrictions. However, the mentioned proportions, according to many computational experiments, are the best when using equivalent numbers of compounds for each of the four mentioned sets.
The main methodological novelty applied in this study is the improved exploitation of the statistical parameters governing the modelling task. It is possible by chance to get an apparently good model, but unfortunately the results will not be replicated when applying the model to new substances. The algorithms that we implemented in the past and then applied here are useful to reduce the risk of overfitting the models. The approach at the basis for this, as explained above, relates to the algorithms of the index of ideality of correlation and the system of self-consistent models. This approach necessitates splitting into different subsets, which means added complexity, as discussed above. However, at the same time, this provides the opportunity to optimize the parameters of the final model, such as the coefficients, addressing the calibration sets. This approach is rewarding in terms of the performance on the validation set, as documented above.
Protecting animals from cruelty is one of the main motivations for the development of QSAR in general, as well as using QSAR to estimate eye irritation [18]. To this end, the development of binary models that give a forecast in the form of “active”—“inactive” is encouraged. There are developments aimed at building multivariate categorical models of eye irritation obtained through decision trees for individual substances [18], as well as for mixtures [19]. Any models for eye irritation are very unstable in cases of significant structural variations in molecules [6]. To overcome these difficulties, it is necessary, on the one hand, to improve (expand) the training databases and, on the other hand, to improve methods for validation of the predictive potential of the models [20,21]. Taking these circumstances into account, one of the most representative databases was selected to test the considered approach, and the self-consistency of the models was studied to search for ways to improve the criteria of the predictive potential of the models. The simplicity is on the verge of primitiveness (since only the simplest features of molecular structures are involved), which is the basis for the development of models of eye irritation. This, combined with the usual statistical assessment on the principle of comparing average values and their variances, perhaps is the main cause of the good statistical quality of the models (note that complication almost always leads to overtraining). The involvement of the correlation ideality index somewhat complicates the stochastic Monte Carlo optimization process. However, this complication is justified here and in other works where the mentioned index was used [11,12,13,14,17].
We note that this model starts with substances classified into two classes: active or not. Thus, there is no information about the potency of the effect. In other cases, for instance, skin sensitization and the granularity of the effects have been introduced. For eye irritation specifically, it would be convenient in the future to distinguish between the effects with different severities, considering chemical burns (irritation versus corrosion).

5. Conclusions

Building models using the structurization of the compounds selected for the training set into three functionally different subsets can effectively improve the predictive potential of the categorical models for eye irritation using semi-correlations. The unusual impact of the index of ideality of correlation on the simulation process allows for an improvement in the statistical quality of the model for the external validation set based on the good results of the calibration set. This improves the predictive potential of the categorical models. Using this methodology, we obtained good models for eye irritation. The model is relatively simple since it only requires the SMILES structure of the substances without calculating the molecular descriptors. The model’s interpretation of the molecular parameters is simple as well since it points out which atom or simple molecular feature indicates the increase or decrease of the activity. However, it should be noted that these recommendations are of a qualitative (non-quantitative) nature.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/toxics11120993/s1, Table S1: Confirmation of differences for used splits; Table S2: Technical data of model for split #1.

Author Contributions

Conceptualization, A.A.T., A.P.T., A.R. and E.B.; data curation, A.A.T., A.P.T., A.R. and E.B.; writing—original draft preparation, A.A.T., A.P.T., A.R. and E.B.; writing—review and editing, A.A.T., A.P.T., A.R. and E.B.; supervision, A.R. and E.B.; project administration, E.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are included in the article/Supplementary Material; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. OECD. Test No. 437: Bovine Corneal Opacity and Permeability Test Method for Identifying (i) Chemicals Inducing Serious Eye Damage and (ii) Chemicals Not Requiring Classification for Eye Irritation or Serious Eye Damage; OECD Publishing: Paris, France, 2009. [Google Scholar]
  2. OECD. Test No. 438: Isolated Chicken Eye Test Method for Identifying (i) Chemicals Inducing Serious Eye Damage and (ii) Chemicals Not Requiring Classification for Eye Irritation or Serious Eye Damage; OECD Publishing: Paris, France, 2017. [Google Scholar]
  3. Tsakovska, I.; Saliner, A.G.; Netzeva, T.; Pavan, M.; Worth, A.P. Evaluation of SARs for the prediction of eye irritation/corrosion potential-structural inclusion rules in the BfR decision support system. SAR QSAR Environ. Res. 2007, 18, 221–235. [Google Scholar] [CrossRef] [PubMed]
  4. Saliner, A.G.; Patlewicz, G.; Worth, A.P. A review of (Q)SAR models for skin and eye irritation and corrosion. QSAR Comb. Sci. 2008, 27, 49–59. [Google Scholar] [CrossRef]
  5. Luechtefeld, T.; Maertens, A.; Russo, D.P.; Rovida, C.; Zhu, H.; Hartung, T. Analysis of draize eye irritation testing and its prediction by mining publicly available 2008–2014 REACH data. Altex 2016, 33, 123–134. [Google Scholar] [CrossRef] [PubMed]
  6. Verheyen, G.R.; Braeken, E.; Van Deun, K.; Van Miert, S. Evaluation of existing (Q)SAR models for skin and eye irritation and corrosion to use for REACH registration. Toxicol. Lett. 2017, 265, 47–52. [Google Scholar] [CrossRef] [PubMed]
  7. Weininger, D. Smiles. 3. Depict. Graphical Depiction of Chemical Structures. J. Chem. Inform. Comput. Sci. 1990, 30, 237–243. [Google Scholar] [CrossRef]
  8. Toropova, A.P.; Toropov, A.A. CORAL: Binary classifications (active/inactive) for drug-induced liver injury. Toxicol. Lett. 2017, 268, 51–57. [Google Scholar] [CrossRef] [PubMed]
  9. Toropova, A.P.; Toropov, A.A.; Benfenati, E. Semi-correlations as a tool to model for skin sensitization. Food Chem. Toxicol. 2021, 157, 112580. [Google Scholar] [CrossRef] [PubMed]
  10. Wang, Q.; Li, X.; Yang, H.; Cai, Y.; Wang, Y.; Wang, Z.; Li, W.; Tang, Y.; Liu, G. In silico prediction of serious eye irritation or corrosion potential of chemicals. RSC Adv. 2017, 7, 6697–6703. [Google Scholar] [CrossRef]
  11. Toropova, A.P.; Toropov, A.A.; Roncaglioni, A.; Benfenati, E. The System of Self-Consistent Models: QSAR Analysis of Drug-Induced Liver Toxicity. Toxics 2023, 11, 419. [Google Scholar] [CrossRef] [PubMed]
  12. Kumar, P.; Kumar, A.; Sindhu, J. Design and development of novel focal adhesion kinase (FAK) inhibitors using Monte Carlo method with index of ideality of correlation to validate QSAR. SAR QSAR Environ. Res. 2019, 30, 63–80. [Google Scholar] [CrossRef] [PubMed]
  13. Toropov, A.A.; Toropova, A.P.; Roncaglioni, A.; Benfenati, E. The system of self-consistent semi-correlations as one of the tools of cheminformatics for designing antiviral drugs. New J. Chem. 2021, 45, 20713–20720. [Google Scholar] [CrossRef]
  14. Toropova, A.P.; Toropov, A.A.; Roncaglioni, A.; Benfenati, E. The system of self-consistent models for vapour pressure. Chem. Phys. Lett. 2022, 790, 139354. [Google Scholar] [CrossRef]
  15. Toropov, A.A.; Toropova, A.P.; Raitano, G.; Benfenati, E. CORAL: Building up QSAR models for the chromosome aberration test. Saudi J. Biol. Sci. 2019, 26, 1101–1106. [Google Scholar] [CrossRef] [PubMed]
  16. Stoičkov, V.; Stojanović, D.; Tasić, I.; Šarić, S.; Radenković, D.; Babović, P.; Sokolović, D.; Veselinović, A.M. QSAR study of 2,4-dihydro-3H-1,2,4-triazol-3-ones derivatives as angiotensin II AT1 receptor antagonists based on the Monte Carlo method. Struct. Chem. 2018, 29, 441–449. [Google Scholar] [CrossRef]
  17. Ahmadi, S. Mathematical modeling of cytotoxicity of metal oxide nanoparticles using the index of ideality correlation criteria. Chemosphere 2020, 242, 125192. [Google Scholar] [CrossRef] [PubMed]
  18. Basant, N.; Gupta, S.; Singh, K.P. A three-tier QSAR modeling strategy for estimating eye irritation potential of diverse chemicals in rabbit for regulatory purposes. Regul. Toxicol. Pharmacol. 2016, 77, 282–291. [Google Scholar] [CrossRef] [PubMed]
  19. Abraham, M.H.; Gola, J.M.R.; Cometto-Muñiz, J.E. An assessment of air quality reflecting the chemosensory irritation impact of mixtures of volatile organic compounds. Environ. Int. 2016, 86, 84–91. [Google Scholar] [CrossRef] [PubMed]
  20. Geerts, L.; Adriaens, E.; Alépée, N.; Guest, R.; Willoughby, J.A., Sr.; Kandarova, H.; Drzewiecka, A.; Fochtman, P.; Verstraelen, S.; Van Rompay, A.R. CON4EI: Evaluation of QSAR models for hazard identification and labelling of eye irritating chemicals. Toxicol. Vitr. 2018, 49, 90–98. [Google Scholar] [CrossRef] [PubMed]
  21. Silva, A.C.; Borba, J.V.V.B.; Alves, V.M.; Hall, S.U.S.; Furnham, N.; Kleinstreuer, N.; Muratov, E.; Tropsha, A.; Andrade, C.H. Novel computational models offer alternatives to animal testing for assessing eye irritation and corrosion potential of chemicals. Artif. Intell. Life Sci. 2021, 1, 100028. [Google Scholar] [CrossRef] [PubMed]
Table 1. An example of model calculation CAS = 503-17-3; SMILES = CC#CC.
Table 1. An example of model calculation CAS = 503-17-3; SMILES = CC#CC.
SkCW(Sk)The Frequency in Active Training SetThe Frequency in Passive Training SetThe Frequency in Calibration Set
C−0.8947119911771130
C−0.8947119911771130
#
(triple bond)
2.5018677077
C−0.8947119911771130
C−0.8947119911771130
C W ( S k ) −1.0769
Y = 1.091 + 0.03088 × (−1.0769) = 1.057 (active because Y is larger than 0.5).
Table 2. The statistical characteristics of the models for eye irritation for five random splits.
Table 2. The statistical characteristics of the models for eye irritation for five random splits.
SplitSet *SensSpecAccMCCTNTPFPFNAll
1A0.93030.54590.81660.5332868214178651325
P0.91930.54070.79740.5117809226192711298
C0.98560.88420.96540.8897102622930151300
V0.98040.89170.96140.8839100024730201297
2A0.92470.56390.81220.5397847234181691331
P0.94610.62500.84800.6267843245147481283
C0.99410.91630.97810.932110082412261277
V0.98580.91300.97070.9099103825224151329
3A0.93220.54270.81400.5342852216182621312
P0.93810.56610.82850.5639849214164561283
C0.98000.87800.95680.875398025936201295
V0.98670.85450.95940.8734104123540141330
4A0.89900.56330.78600.5002783249193881313
P0.92370.58070.80930.5533823259187681337
C0.99530.95580.98840.959510602161051291
V0.99430.93100.98280.941410412161661279
5A0.91470.55230.79600.5145826243197771343
P0.91980.53810.78940.5109791240206691306
C0.99220.98260.99050.96831020226481258
V0.99720.98260.99470.98151080226431313
* A = active training set; P = passive training set; C = calibration set; V = validation set; TN = true negative; TP = true positive; FN = false negative; FP = false positive; All is the number of compounds in a set.
Table 3. The correlation weights for SMILES atoms that are used to calculate Y in the case of splits 1–5.
Table 3. The correlation weights for SMILES atoms that are used to calculate Y in the case of splits 1–5.
SkCW(Sk)The Frequency in Active Training SetThe Frequency in Passive Training SetThe Frequency in Calibration SetSk is Active
Split 1
#−0.0018677077TRUE
(0.4051113611221058TRUE
/0.0010FALSE
10.1432108010661058TRUE
2−0.0537526529452TRUE
30.2177244262226TRUE
4−0.1914112123108TRUE
50.4467334134TRUE
60.376671512TRUE
70.0835564TRUE
80.4559233TRUE
90.0011FALSE
=0.4918827852791TRUE
B−0.43054210TRUE
C0.3553119911771130TRUE
F0.449610491118TRUE
Br0.38827056115TRUE
I0.3796243432TRUE
Cl0.3870185172195TRUE
N0.2175601597514TRUE
O−0.20291010995939TRUE
P−0.127310249TRUE
S0.1072139137140TRUE
[123I]0.0010FALSE
\0.0010FALSE
[C−]0.0001FALSE
[Ge]0.0101FALSE
[N+]0.4829667698TRUE
[O−]−0.2889707699TRUE
[Se]0.0001FALSE
[Si]0.44295812TRUE
[Sn]−0.1457311TRUE
[n+]0.1103512TRUE
[nH]−0.2015513939TRUE
c0.1066921909940TRUE
n−0.1502213214202TRUE
o−0.3202211723TRUE
s−0.3532302643TRUE
Split 2
#−0.0156696080TRUE
(−0.4088115211181036TRUE
/0.0001FALSE
1−0.435911071061991TRUE
2−0.4951525501437TRUE
30.0739254245227TRUE
40.108811799118TRUE
50.3121373327TRUE
6−0.37689136TRUE
70.1469452TRUE
80.0031FALSE
90.0010FALSE
=0.3761840837801TRUE
B−0.1819647TRUE
C0.0391120511641112TRUE
F0.348990104127TRUE
Br0.4237695697TRUE
I−0.3217254625TRUE
Cl0.3727183180184TRUE
N−0.2051619569521TRUE
O−0.38731012997941TRUE
P−0.223412196TRUE
S0.4145157125123TRUE
[123I]0.0001FALSE
[18F]0.0001FALSE
\0.0001FALSE
[C−]0.0010FALSE
[Ge]0.0002FALSE
[N+]−0.1110828175TRUE
[O−]0.3385828478TRUE
[Si]0.26768915TRUE
[Sn]0.0120FALSE
[n+]0.0054FALSE
[nH]−0.0460413940TRUE
c−0.4981953895868TRUE
n0.1179220191176TRUE
o0.2430212515TRUE
s0.1486362331TRUE
Split 3
#0.2631676077TRUE
(−0.3882110910921105TRUE
/0.0102FALSE
1−0.4166107210481042TRUE
2−0.2890511509471TRUE
30.1485245254252TRUE
40.1888115118117TRUE
5−0.3245353538TRUE
60.3899131410TRUE
70.3989667TRUE
80.0144FALSE
90.0130FALSE
=−0.3768838825824TRUE
B0.1439427TRUE
C−0.4266117911401145TRUE
F−0.181898104113TRUE
Br0.1097685896TRUE
I−0.4305303334TRUE
Cl−0.0516180185189TRUE
N0.3385605544545TRUE
O−0.0065988979975TRUE
P−0.3230131611TRUE
S0.3219141131126TRUE
[123I]0.0010FALSE
[18F]0.0010FALSE
\0.0102FALSE
[Ge]0.0101FALSE
[N+]0.1284767582TRUE
[O−]−0.4385787684TRUE
[Si]0.432713714TRUE
[Sn]0.0123FALSE
[V]0.0100FALSE
[n+]−0.0209213TRUE
[nH]−0.1779344444TRUE
c0.4635918881932TRUE
n−0.2132210198200TRUE
o−0.3341292217TRUE
s−0.4718352939TRUE
Split 4
#0.2930585981TRUE
(−0.4336115711511068TRUE
/0.0010FALSE
10.172311021128991TRUE
2−0.3311566559421TRUE
30.0101286258214TRUE
4−0.1253132114103TRUE
50.0920413827TRUE
6−0.103416128TRUE
70.4945675TRUE
80.0289432TRUE
90.0112FALSE
=0.4146879868789TRUE
B−0.07622211TRUE
C−0.0174118012121129TRUE
F0.384788104124TRUE
Br0.40935760104TRUE
I−0.2812282831TRUE
Cl−0.4663191193187TRUE
N0.3680650648467TRUE
O0.209010221043944TRUE
P−0.1516191113TRUE
S0.4176146155111TRUE
[123I]0.0001FALSE
[18F]0.0010FALSE
\0.0010FALSE
[C−]0.0001FALSE
[Ge]0.0010FALSE
[N+]0.18387469100TRUE
[O−]−0.19667571105TRUE
[Se]0.0001FALSE
[Si]0.08276819TRUE
[Sn]0.2362312TRUE
[V]0.0100FALSE
[n+]0.0137FALSE
[nH]0.3438424337TRUE
c0.0892948962869TRUE
n0.3388198207198TRUE
o−0.274527198TRUE
s−0.2667303634TRUE
Split 5
#−0.1927586890TRUE
(0.1698116911361036TRUE
/0.0110FALSE
1−0.1953111110841005TRUE
20.0434571555417TRUE
30.4368268270207TRUE
40.4982123108117TRUE
50.4949353136TRUE
6−0.237510109TRUE
70.4610644TRUE
80.1974323TRUE
90.0551201TRUE
=−0.2514861864779TRUE
B−0.3615337TRUE
C−0.0744122311771094TRUE
F−0.070392107110TRUE
Br0.45244967101TRUE
I−0.3837322725TRUE
Cl0.1357206182182TRUE
N−0.2709646642489TRUE
O−0.099910381006926TRUE
P−0.350892210TRUE
S−0.0258144146126TRUE
[123I]0.0010FALSE
[18F]0.0010FALSE
\0.0110FALSE
[C−]0.0100FALSE
[Ge]−0.2719200TRUE
[N+]0.16336278113TRUE
[O−]−0.41266281118TRUE
[Se]0.0010FALSE
[Si]−0.4059111011TRUE
[Sn]0.0104FALSE
[V]0.0100FALSE
[n+]0.0137FALSE
[nH]0.0709343450TRUE
c−0.2732942939891TRUE
n−0.1253221239179TRUE
o0.0797211920TRUE
s−0.1934343825TRUE
Table 4. The roles of the different molecular features to provide the basis for the mechanistic interpretation in terms of the three classes of molecular features.
Table 4. The roles of the different molecular features to provide the basis for the mechanistic interpretation in terms of the three classes of molecular features.
No.Molecular Feature CWs Probe 1CWs Probe 2CWs Probe 3CWs Probe 4CWs Probe 5NA *NPNCd
Split 1 class 1
1Br3.27231.34631.78114.09533.122170561150.0004
2[O−]1.73650.66650.57902.14660.75867076990.0002
3#3.83411.95131.27410.85754.49296770770.0001
473.27181.57931.71862.78281.63715640.0002
5[Si]3.46613.07622.13141.20305.881758120.0004
6[n+]6.81733.34664.42024.55838.51855120.0008
7B4.77521.38992.24793.13932.045042100.0008
8[Sn]9.04847.15908.88548.689312.04963110.0006
Split 1 class 2
1C−0.6605−0.3548−0.5480−0.8370−0.51591199117711300.0000
2(−0.3212−0.4346−0.0967−0.0620−0.91591136112210580.0000
31−1.6947−0.3518−0.2589−0.8020−1.40601080106610580.0000
4N−2.7232−1.0325−2.1581−1.6193−3.05096015975140.0001
52−2.7013−1.9285−1.4587−2.0087−3.34645265294520.0001
63−1.7331−0.6932−1.1344−2.0032−2.58452442622260.0001
7n−1.4273−1.2580−0.5318−1.2792−1.44112132142020.0000
8Cl−0.8715−0.4565−0.6725−1.6398−0.16311851721950.0001
9S−2.5577−0.9998−1.7481−1.2768−1.86301391371400.0000
10s−4.8609−2.5041−0.2666−0.7957−2.67703026430.0003
11o−4.3739−1.2544−1.9536−2.0792−4.23022117230.0002
Split 2 Class 1
1F1.43180.42110.67470.63290.7588901041270.0002
2[N+]0.87320.69561.71820.18801.20088281750.0000
3[O−]0.53470.12522.40390.93120.54938284780.0000
4#2.82962.57113.73412.42792.48086960800.0002
5Br3.17842.15932.25011.52411.37806956970.0003
6P3.27092.98282.96901.39722.1455121960.0005
7[Si]2.29494.33983.41411.98052.360289150.0004
8B2.94352.36743.15901.81981.47806470.0003
974.62913.55573.37041.78512.74094520.0004
Split 2 Class 2
1C−0.4048−0.7736−0.3752−0.3051−0.39771205116411120.0000
2(−0.6309−0.1135−0.3682−0.1438−0.26761152111810360.0000
3N−2.2544−2.4720−2.1661−1.1186−1.43336195695210.0001
42−3.5296−2.3461−3.1806−1.5277−1.80805255014370.0001
53−2.1478−1.6436−1.7882−1.2203−1.07662542452270.0000
6n−0.6723−1.2947−0.7581−0.6257−0.87812201911760.0001
7Cl−2.0880−1.0896−1.1427−1.1099−0.70951831801840.0000
8S−1.0411−1.2982−0.4788−0.5819−0.65561571251230.0001
94−0.3482−0.2583−1.2454−0.3755−0.6052117991180.0001
10[nH]−1.1206−1.8879−1.5457−0.9128−0.90764139400.0000
11s−3.4152−1.4488−2.6476−1.7933−0.64323623310.0002
12o−2.1937−4.0809−0.1840−0.2794−0.72112125150.0003
Split 3 Class 1
1[N+]1.01640.99431.30580.28750.92197675820.0000
2Br1.14981.35650.98572.34863.11666858960.0003
3#1.38301.64661.00973.35593.72896760770.0001
461.56810.47921.23371.33210.36801314100.0002
5P1.73081.41371.76292.13081.52551316110.0002
6[Si]0.58030.94401.10573.29083.0645137140.0003
770.76500.93080.17701.69792.09716670.0001
8[n+]0.95171.46030.86020.94933.35572130.0005
Split 3 Class 2
1C−0.4496−0.2875−0.4625−0.2968−0.37221179114011450.0000
2(−0.0166−0.1634−0.1578−0.3955−0.57151109109211050.0000
3N−1.8463−1.1613−1.4371−1.4782−1.87846055445450.0000
42−1.7822−1.1166−1.1200−2.4887−2.75235115094710.0000
53−0.8123−0.7781−1.0673−1.4683−2.21622452542520.0000
6n−0.8070−0.4791−0.7562−0.4099−0.42172101982000.0000
7Cl−0.6271−0.1657−1.2458−0.4402−0.84171801851890.0000
8S−0.8774−0.9040−0.5881−0.8678−1.37861411311260.0001
9[nH]−0.5969−0.6678−1.4561−0.8938−1.43083444440.0001
10I−1.3131−0.0016−1.9577−1.2033−1.66493033340.0001
Split 4 Class 1
1#3.05920.42751.77482.80902.32995859810.0002
2Br2.55083.01722.75071.74401.214757601040.0003
350.04051.66850.42800.66310.93884138270.0002
4P0.90442.99892.52141.53311.16231911130.0003
560.65990.51481.35280.39580.6901161280.0003
6[Si]2.29871.50322.28093.01911.773868190.0006
781.25412.20473.61431.62870.92384320.0003
8[Sn]4.679010.88827.47116.71994.08033120.0005
Split 4 Class 2
1C−0.4159−0.8717−0.7904−0.4031−0.38711180121211290.0000
2(−0.0114−0.2922−0.2456−0.1328−0.00421157115110680.0000
3O−0.6308−0.2245−0.3892−0.0529−0.3235102210439440.0000
4c−0.0389−0.1040−0.4182−0.1399−0.22469489628690.0000
5N−1.2370−2.7788−2.5813−1.2900−0.78786506484670.0002
62−2.1037−3.3300−3.0946−1.7167−1.00905665594210.0001
73−1.9136−2.5553−1.4174−1.5132−0.83982862582140.0001
8n−0.9712−1.8482−2.3349−0.7493−0.78501982071980.0000
9Cl−1.1802−1.5431−1.5226−0.8968−0.53931911931870.0000
10S−0.9590−1.8499−1.1759−0.8525−0.70251461551110.0001
11[nH]−1.3662−1.0136−1.0906−0.8572−0.49324243370.0001
12I−2.4835−4.0180−2.4387−2.0084−1.21642828310.0001
Split 5 Class 1
1F1.19981.49440.22460.56331.2588921071100.0001
2Br4.05693.85011.10482.38483.429149671010.0004
3[Si]4.06764.87141.09753.17402.96091110110.0001
461.01291.50771.81560.78511.7448101090.0000
5P2.59601.52760.90441.50022.1112922100.0005
671.73903.60030.95801.21723.02136440.0002
795.06242.52422.85134.90723.04302010.0010
8[Ge]9.76826.36973.55425.55884.39392001.0000
Split 5 Class 2
1C−0.4755−0.4418−0.3983−0.6964−0.56471223117710940.0000
2(−0.6192−0.5926−0.1694−0.2048−0.44701169113610360.0000
31−2.5197−0.1753−0.9520−0.2585−1.75101111108410050.0000
4N−2.5202−1.5783−1.4051−2.0748−2.01346466424890.0001
52−3.3695−2.4920−1.0677−2.1892−2.72415715554170.0001
63−1.8686−2.3009−1.0730−1.1007−1.77772682702070.0001
7n−1.3633−0.5961−0.6129−1.0973−1.17982212391790.0001
8[nH]−2.6989−3.0141−1.4830−0.6964−2.41503434500.0002
9o−2.4007−2.4849−1.9940−3.0845−2.46392119200.0000
* NA, NP, and NC are the frequencies of a molecular feature in active training, passive training, and calibration set, respectively.
Table 5. Frequency of occurrence of different molecular features in lists of promoters of increase or decrease in probability of the activity observed for five random splits.
Table 5. Frequency of occurrence of different molecular features in lists of promoters of increase or decrease in probability of the activity observed for five random splits.
Molecular Feature Split 1Split 2Split 3Split 4Split 5
Class 1Br+++++
[O−]++
#++++
7+++ +
[Si]+++++
[n+]+ +
B+++
[Sn]+ +
P ++++
F + +
[N+] ++
Class 2C+++++
(+++++
1+ +
N+++++
2+++++
3+++++
n+++++
Cl++++
S++++
s++
o++ +
4 +
[nH] ++++
I ++
Table 6. Statistical characteristics of models of eye irritation for the external validation set.
Table 6. Statistical characteristics of models of eye irritation for the external validation set.
Method AccuracySensitivitySpecificityReferences
Support vector machine 0.9380.9670.856[10]
Artificial neural networks 0.9290.9600.845[10]
Average values on three splits0.9640.9840.886In this work
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Toropov, A.A.; Toropova, A.P.; Roncaglioni, A.; Benfenati, E. Semi-Correlations for Building Up a Simulation of Eye Irritation. Toxics 2023, 11, 993. https://doi.org/10.3390/toxics11120993

AMA Style

Toropov AA, Toropova AP, Roncaglioni A, Benfenati E. Semi-Correlations for Building Up a Simulation of Eye Irritation. Toxics. 2023; 11(12):993. https://doi.org/10.3390/toxics11120993

Chicago/Turabian Style

Toropov, Andrey A., Alla P. Toropova, Alessandra Roncaglioni, and Emilio Benfenati. 2023. "Semi-Correlations for Building Up a Simulation of Eye Irritation" Toxics 11, no. 12: 993. https://doi.org/10.3390/toxics11120993

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop