Semi-Correlations for Building Up a Simulation of Eye Irritation

The OECD recognizes that data on a compound’s ability to treat eye irritation are essential for the assessment of new compounds on the market. In silico models are frequently used to provide information when experimental data are lacking. Semi-correlations, as they are called, can be useful to build up categorical models for eye irritation. Semi-correlations are latent regressions that can be used when the endpoint is expressed by two values: 1 for an active molecule and 0 for an inactive molecule. The regression line is based on the descriptor values which serve to distribute the data into four classes: true positive, true negative, false positive, and false negative. These values are applied to calculate the corresponding statistical criterion for assessing the predictive potential of the categorical model. In our model, the descriptor is the sum of what are termed correlation weights. These are defined by optimization using the Monte Carlo method. The target function of the optimization is related to the determination coefficient and the mean absolute error for the training set. Our model gives results that are better than those previously reported for the same endpoint.


Introduction
The Organization for Economic Cooperation and Development (OECD) recognizes that a compound's effects on eye irritation are essential in the assessment of new compounds.The OECD has adopted three methods for assessing eye irritation [1,2].When experimental values are lacking, models for the prediction of the adverse impact of chemicals are essential to modern theoretical chemistry [3][4][5][6], using various representations of the molecular structure [7].Solving environmental problems by simulation endpoints that are indicators of hazardous or beneficial properties of substances is now a common practice.There are two types of simulation aimed at solving the mentioned problems.There is the regression approach, where endpoints are simulated and expressed as numerical values on a continuous scale or a categorical simulation, with the model taking the form of two values representing active (1) or inactive (0) molecules.Semi-correlation is one of the possible approaches for building a categorical model [8,9].It should be noted, however, that semicorrelations are latent regression models.The result is a somewhat unusual correlation since the abscissa axis (if we continue the analogy with the usual regression model) displays only two values that can be expressed as 0− to mark an inactive compound, and 1− to mark an active compound.There may be variations such as −1 and +1 to indicate inactive and active compounds, respectively.While the y-axis contains a wide range of values, their informational content boils down to how they are located relative to 0.5 (if the range of observations is 0 and 1), or how they are located relative to 0 (if the range of observations is represented by −1 and +1).For regression models built using the Monte Carlo technique, meaning through self-controlled random processes, a mechanistic interpretation of the model can be presented in probabilistic terms, partially dependent on the distribution of the molecular features extracted from the SMILES.This interpretation is revealed through several runs of procedures to find the so-called correlation weights provided for each of the above-mentioned molecular features of the transmitted SMILES.In this case, the descriptor on which the model is based is a simple sum of these correlation weights calculated by the Monte Carlo method.
Three classes of molecular features can be distinguished during these procedures for calculating correlation weights, repeated several times.The first contains molecular features which, in all the runs of optimization by the Monte Carlo method, receive positive correlation weights.Despite the probabilistic nature of this process, most of these molecular features are extremely likely to be favorable factors for increasing the values of the endpoint studied in the regression.The second class of molecular features under discussion are those that receive a negative correlation weight in each run of the Monte Carlo optimization.Molecular features of this nature will, with fairly significant probability, turn out to be factors that contribute to a decrease in the value of the endpoint in question.Finally, the third class of molecular features extracted from SMILES are those for which both positive and negative correlation weights are encountered during several Monte Carlo optimizations.In principle, by comparing how many times the positive weights and how many times the negative weights are observed, one can assign these molecular features to either the first or second of these classes.For this kind of observation, many runs of the optimization procedure are needed, though in practice the number is limited, and in those circumstances, it is simpler and more logical to classify substances of this third class as molecular features with no clear role in terms of raising or lowering the values of the endpoint in question.
Then, there are cases when a model (ordinary regression or semi-correlation) is formed on exclusively positive correlation weights (where everyone ends up in the same first class) or exclusively negative correlation weights (all in the second class).In such cases, the only way to understand the contributions of the molecular features in the specified application is to compare the correlation weights and the frequencies of the corresponding molecular features extracted from the SMILES.Since semi-correlations, as already noted, are special cases of regression models, the situations listed here also apply to them.Thus, one can distinguish three classes of molecular features for semi-correlations.Therefore, for an endpoint with only two values (active/inactive), one can apply techniques tested on conventional "traditional" regression models, where the considered endpoint has a certain range of values.
Here, the semi-correlations are used to develop a categorical model of eye irritation.The CORAL software (http://www.insilico.eu/coral,accessed on 25 November 2023) is a tool for building semi-correlations [8,9].

Materials and Methods
The data on eye irritation from 5220 chemicals (including 3874 positive and 1346 negative) were taken from the literature [10].To develop the models, the data were randomly divided into four subsets with approximately equal numbers of chemicals: these are referred to as the active and passive training sets, calibration set, and validation set.The traditional training set is structured in three subsets (active and passive training sets and a calibration set).Special interactions take place in this trio.The active training set is intended for building the initial model, with correlation weights that force the experimental and calculated values of the endpoint to correlate.The passive training set checks how suitable the resulting correlation weights are for the molecules distributed in the passive training set (i.e., absent in the active training set).Last, the calibration set checks for the absence of overtraining (meaning a situation where a very good correlation on the training sets is accompanied by a "fall" in the coefficient of determination for the calibration set).Thus, it is at this stage that the final model is optimized.The validation set is not used in the process of model building and provides the statistics for substances not used in the training sets.

Optimal SMILES-Based Descriptors
The simplified molecular input-line entry system (SMILES) is a representation of the molecular structure [7].The optimal descriptor is a sum of correlation weights of SMILES atoms for eye irritation, calculated by the Monte Carlo method described in our previous work [11]: CW(S k ) are the correlation weights (CW) for the attribute of SMILES [7]; S k , that is, one symbol or a group of symbols which cannot be considered separately) e.g., 'Cl', 'Br', '[N+]', etc., the physical meaning of SMILES fragments available in the literature [7][8][9] as well as on the Internet (https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html,accessed on 15 October 2023); T (=1) is the threshold defining active SMILES atoms (those with a frequency more than T in the active training set); N is the number of the iterations of the optimization by the Monte Carlo method (N = 15).Table 1 contains an example of calculating the optimal SMILES-based descriptor.

Model of Eye Irritation
The model of eye irritation is defined as

Monte Carlo Optimization
Equation (1) needs the numerical data for the CW, calculated by the Monte Carlo optimization.Here two target functions (TF 0 and TF 1 ) for the Monte Carlo optimization are examined: r AT and r PT are correlation coefficients between the observed and predicted endpoints for the active and passive training sets, respectively.IIC C is the index of ideality of correlation [12].IIC C is calculated with data on the calibration set as follows: r c is the correlation coefficient between the observed and calculated values of the endpoint on the calibration set; 'c' indicates that it belongs to the calibration set.Observed and calculated are the corresponding values of y applied to define the corresponding categories (active/inactive).

The System of Self-Consistent Models
The system of self-consistent models [11,13,14] for five random splits into the training (visible) and validation (invisible) sets confirms the good predictive potential of the models.
The training set here is divided into active and passive training and calibration sets.Thus, the difference between models reflects the difference in training sets.However, the key attribute of the system of self-consistent models is the unified method for the validation of these models; each i-th model has i-th validation set.The validation sets are far from identical (Table S1, Supplementary Materials).This supports the statistical fact that we explore multiple conditions, and the results are representative of a set of cases, each obtained by chance, and their overall results should be evaluated jointly.
The measure of self-consistency is the average value and dispersion of the Matthews correlation coefficient (MCC) on different validation sets.The corresponding computational experiments are represented by the matrix: M i is an i-th model; V j is the list of compounds employed as the validation set in the case of the j-th split; MCCv ij is the Matthews correlation coefficient for the j-th validation set if applied to the i-th model.

Results
The models for five random splits are the following: Table 1 sets out the statistical characteristics of the models for five random splits.One can see that these characteristics are quite similar for all five random splits.The average value of MCC for the validation sets is 0.8891 ± 0.0153.The numbers of the optimized parameters in the Monte Carlo optimization process are 31, 28, 30, 30, and 31, respectively, for splits 1-5.

Sens (sensitivity) =
TP TP + FN ( 18) Table 2 indicates that the evolution of the model enables us to optimize it.The statistics of the active and passive subsets (already very good) are even better for the validation set.The optimized model, as in the calibration set, is expected to provide similar statistics when tested on new substances, as in the case of the validation set.In fact, the values for the calibration and validation sets are quite similar.This provides evidence that the model can be expected to give good results when used for new substances.As described above, the training set is not balanced because there are more toxic substances.In these conditions, the use of the accuracy may provide a biased picture of the model's performance.It is common that the most represented class is predicted more easily, and indeed in our case we observe that the sensitivity values are higher than the specificity value.To have a better evaluation of the overall performance of the model for the unbalanced datasets, the MCC parameter is more appropriate; MCC ranges from −1 to 1, and good values are those above 0.5.Table 3 lists the correlation weights of the SMILES atoms for splits 1-5.Many of the SMILES' attributes are rare, meaning they are present only in a few substances, and their roles in the five splits are different.This implies that these controversial features can hardly be considered as a reliable basis for statistical conclusions.In our case, these attributes are not used in the simulation.The correlation weights and even their lists are different for the five random splits, but the statistical quality of the models is quite good and close.To explain this, we have to bear in mind that the data set, which is large, with more than 5000 substances, is sufficient to generate several good models, which are not identical, because in our exercise we split these substances randomly.Each model has an adequate, good statistical basis, but depending on the composition of the training set, different features may be extracted.Table 1 gives an example of the calculation of the property value Y for a simple SMILES.
As noted above, to obtain a basis for determining a mechanistic interpretation, it is necessary to carry out several runs of optimizing the correlation weights of the molecular features extracted from the SMILES.Table 4 gives the results of such tests using splits 1-5.Table 5 indicates that for the system of models under consideration, there are promoters of both "increase/decrease" in the activity.There is a certain analogy in the lists but also differences.Table 6 contains a comparison of the similarities and differences in the mentioned lists.There are two factors that allow for an evaluation of a molecular feature as a potential promoter of the increase in the activity in question.It should be noted that it is more appropriate to talk not so much about an "increase" as about the "probability of activity" for each specific substance.These factors are the frequency of the occurrence and stability of the correlation weight.For example, bromine (Br) is present in all the lists of promoters of an increase in the likelihood of activity, while nitrogen (n, i.e., nitrogen in an aromatic ring, according to the SMILES nomenclature) is present in all the lists of molecular features reducing the probability of activity for the five splits.The frequency of phosphorus and fluorine is too low to assess them as potential promoters of activity or inactivity (Tables 4 and 5).We can therefore see whether a certain molecular feature has been selected in all five splits, which is the preferable situation.The value of the coefficient and its consistency is another important feature.If the coefficient is close to 0, its role is lower.If the coefficient is variable, we expect greater uncertainty on its role.
For instance, bromine 'Br' contributes to eye irritation, and the coefficients span from 0.11 to 0.42.Silicon and boron are other features consistently contributing to eye irritation with relatively high coefficients.The triple bond (represented by the symbol #) is a feature selected in all the splits, but the coefficients are lower: 0.002 (almost negligible), 0.02 (quite small), and 0.26.In addition, in some splits, there are molecular features with very high coefficients, but these findings are not replicated in other splits.This is the case of charged nitrogen '[N+]', with coefficients of 0.87 and 1.02 in two of the five splits, while it has not been selected in the third.This is probably because of the particular composition of the splits, with certain substances present in the calibration and validation sets.
Similar considerations are held for the molecular features that reduce eye irritation.
A plus denotes the presence of a molecular feature in the list of promoters for an increase or decrease of the "probability of activity".
The results of our computer experiments set out to identify which molecular fragments of the SMILES have an influence on the likelihood of the substances being able to affect the eyes are presented in Table 5.It can be seen that there are 'convincing' supporters for the presence of effects of substances on the eyes such as bromine 'Br', silicon '[Si]', nitrogen, 'N', and 'n'.There are also 'fragments that may have such an influence' which depends on the distribution of the substances into the training and validation sets, for instance, these are triple bond '#', phosphorus 'P', sulfur 'S', and some others.
The applicability domain in each split was defined from the statistical defects [15].The Supplementary Materials section contains this data on split 1 (Table S2).This is not a 'strict' way of determining the applicability domain.Outliers for this approach are "suspicious" molecules that have a significant number of rare molecular features.
Table 6 compares the statistical quality of the different categorical models of eye irritation, also considering those published in the recent literature.
The difference between the model-building system considered here and the more generally accepted ones is the use of a structured training set that includes three functionally different groups of compounds.Active and passive learning complement each other, postponing the moment of relearning.The third functional component (a calibration set) should catch the moment when overtraining starts.However, overtraining may not occur at all due to the use of the index of ideality of correlation.In other words, the feasibility of using the index of ideality of correlation to improve the statistical quality of the different models for the external validation sets is once again demonstrated for various substances and endpoints [12,16,17].

Discussion
The disadvantage of the scheme considered here (system of self-consistent models) is the need to compare a sufficiently large amount of digital data, to divide all the available substances into four sets.Furthermore, only considering the various groups of splits into training and validation sets can give an idea of the genuine predictive potential of a particular approach.The approach used here to generate the splits is through a random process.If we consider a random process as a sequence of changes in random variables, then, in the case under consideration, the random variables are the determination coefficients and their standard deviations.We get two levels of random processes.The first level of change in the specified quantities for one random process is building a model for a fixed division into the training and validation sets.The second level is the consideration of a certain group of random processes for different splits into the training and validation sets.The second level is designed to answer the following questions.Question 1: Is it possible to build up a model with our approach to selected parameters?Question 2: How reproducible are the statistical characteristics of the model in the chosen parameterization for various distributions into the training and validation sets?Question 3: If the statistical characteristics are reproducible, what is their variance?
By knowing the answers to these questions, one can assess how reliable the model is.The self-consistency of the models does not signify their identity.Each model is different, as we discussed above, but each model may be appropriate.Even the proportions in the distribution of available data into the active/passive training, calibration, and validation sets are not subject to any restrictions.However, the mentioned proportions, according to many computational experiments, are the best when using equivalent numbers of compounds for each of the four mentioned sets.
The main methodological novelty applied in this study is the improved exploitation of the statistical parameters governing the modelling task.It is possible by chance to get an apparently good model, but unfortunately the results will not be replicated when applying the model to new substances.The algorithms that we implemented in the past and then applied here are useful to reduce the risk of overfitting the models.The approach at the basis for this, as explained above, relates to the algorithms of the index of ideality of correlation and the system of self-consistent models.This approach necessitates splitting into different subsets, which means added complexity, as discussed above.However, at the same time, this provides the opportunity to optimize the parameters of the final model, such as the coefficients, addressing the calibration sets.This approach is rewarding in terms of the performance on the validation set, as documented above.
Protecting animals from cruelty is one of the main motivations for the development of QSAR in general, as well as using QSAR to estimate eye irritation [18].To this end, the development of binary models that give a forecast in the form of "active"-"inactive" is encouraged.There are developments aimed at building multivariate categorical models of eye irritation obtained through decision trees for individual substances [18], as well as for mixtures [19].Any models for eye irritation are very unstable in cases of significant structural variations in molecules [6].To overcome these difficulties, it is necessary, on the one hand, to improve (expand) the training databases and, on the other hand, to improve methods for validation of the predictive potential of the models [20,21].Taking these circumstances into account, one of the most representative databases was selected to test the considered approach, and the self-consistency of the models was studied to search for ways to improve the criteria of the predictive potential of the models.The simplicity is on the verge of primitiveness (since only the simplest features of molecular structures are involved), which is the basis for the development of models of eye irritation.This, combined with the usual statistical assessment on the principle of comparing average values and their variances, perhaps is the main cause of the good statistical quality of the models (note that complication almost always leads to overtraining).The involvement of the correlation ideality index somewhat complicates the stochastic Monte Carlo optimization process.However, this complication is justified here and in other works where the mentioned index was used [11][12][13][14]17].
We note that this model starts with substances classified into two classes: active or not.Thus, there is no information about the potency of the effect.In other cases, for instance, skin sensitization and the granularity of the effects have been introduced.For eye irritation specifically, it would be convenient in the future to distinguish between the effects with different severities, considering chemical burns (irritation versus corrosion).

Conclusions
Building models using the structurization of the compounds selected for the training set into three functionally different subsets can effectively improve the predictive potential of the categorical models for eye irritation using semi-correlations.The unusual impact of the index of ideality of correlation on the simulation process allows for an improvement in the statistical quality of the model for the external validation set based on the good results of the calibration set.This improves the predictive potential of the categorical models.Using this methodology, we obtained good models for eye irritation.The model is relatively simple since it only requires the SMILES structure of the substances without calculating the molecular descriptors.The model's interpretation of the molecular parameters is simple as well since it points out which atom or simple molecular feature indicates the increase or decrease of the activity.However, it should be noted that these recommendations are of a qualitative (non-quantitative) nature.

Table 2 .
The statistical characteristics of the models for eye irritation for five random splits.

Table 3 .
The correlation weights for SMILES atoms that are used to calculate Y in the case of splits 1-5.

Table 4 .
The roles of the different molecular features to provide the basis for the mechanistic interpretation in terms of the three classes of molecular features.
* NA, NP, and NC are the frequencies of a molecular feature in active training, passive training, and calibration set, respectively.

Table 5 .
Frequency of occurrence of different molecular features in lists of promoters of increase or decrease in probability of the activity observed for five random splits.

Table 6 .
Statistical characteristics of models of eye irritation for the external validation set.