In Silico Simulation of Impacts of Metal Nano-Oxides on Cell Viability in THP-1 Cells Based on the Correlation Weights of the Fragments of Molecular Structures and Codes of Experimental Conditions Represented by Means of Quasi-SMILES

A simulation of the effect of metal nano-oxides at various concentrations (25, 50, 100, and 200 milligrams per millilitre) on cell viability in THP-1 cells (%) based on data on the molecular structure of the oxide and its concentration is proposed. We used a simplified molecular input-line entry system (SMILES) to represent the molecular structure. So-called quasi-SMILES extends usual SMILES with special codes for experimental conditions (concentration). The approach based on building up models using quasi-SMILES is self-consistent, i.e., the predictive potential of the model group obtained by random splits into training and validation sets is stable. The Monte Carlo method was used as a basis for building up the above groups of models. The CORAL software was applied to building the Monte Carlo calculations. The average determination coefficient for the five different validation sets was R2 = 0.806 ± 0.061.


Introduction
Nano-safety assessments are often conducted in live organisms, including fish, mice, and rats [1,2]. However, since the European Union and US regulatory authorities consider the development of alternative animal-free testing strategies as the most important challenge for future chemical risk assessment of nano-materials, interest in developing in silico approaches to solving the above task has increased considerably [3]. The lack of structured and systematized databases remains a factor that hinders the development of methods for the simulation of the physicochemical and biochemical behaviour of nano-materials [4][5][6][7][8][9][10]. Nevertheless, work on the creation of methods for assessing nano-safety is being carried out, and their flow is growing [11][12][13][14][15][16][17][18][19][20][21][22]. Nano-safety assessments are in high demand and refer to a wide variety of nano-materials that are increasingly penetrating the everyday life of modern society. One of the main directions of these studies is the development of models of environmental consequences of the use of nano-substances in industry, medicine, and everyday life.
The first attempts to develop in silico approaches to solving the above problem were based on the set of developed molecular descriptors used for traditional substances (organic, inorganic, coordination). At the same time, the combined use of calculated molecular descriptors and experimentally determined numerical data on various physicochemical and biochemical characteristics of nano-materials was used for the development of in silico models of the properties of nano-materials [7].
The development of a special format for presenting data on nano-materials is another concept for building in silico models for nano-materials. This format would be abbreviated ISO-TAB-nano (Investigation/Study/Assay Tabular) [8].
A convenient compromise between the need to have expensive experimental data on nano-materials and the need to quickly evaluate a rapidly expanding list of nano-materials in practical use is the "read-across" approach [9].
Finally, the quasi-SMILES method is an effective method for constructing models of nano-materials' physicochemical and biochemical behaviour in the absence of systematized databases [23][24][25][26][27][28][29][30][31][32][33][34][35]. The essence of this method in the first approximation is two steps. First, a list of conditions (for example, concentrations of reagents) and circumstances (presence of certain chemical elements) is made, designating each of them with a special code; and secondly, the correlation contribution of each code to some stochastic model of a given endpoint is evaluated using the Monte Carlo method.
The advantages of using quasi-SMILES are the convenience of formulating problems for in silico modelling and the clarity of the results obtained. The disadvantage of this approach is a significant variance in the results, as a result of which practical reliability can be achieved only when conducting a large number of stochastic computer experiments. It is to be noted that, previously, the index of ideality of correlation and the correlation intensity index have not been used in building models.
Here, the possibility of using the above-mentioned approach to simulate the impact of nano-oxide metals (in different concentrations) on cell viability in THP-1 cells expressed by a percentage was examined. The calculations described here were carried out with the CORAL software (http://www.insilico.eu/coral, accessed on 10 January 2023, Italy).

Models
The computational experiments with five random splits gave models characterized by quite close predictive potential (average determination coefficient R 2 = 0.806 ± 0.061). Table 1 shows the statistical characteristics of the models. Figure 1 shows the graphical representation of the model for cell viability in THP-1 cells observed for split-1. * A = Active training set; P = Passive training set; C = Calibration set; V = Validation set; n = the number of quasi-SMILES in a set; R 2 = the determination coefficient; CCC = the concordance correlation coefficient; IIC = the index of ideality of correlation; CII = correlation intensity index; Q 2 = cross-validated leave-one out R 2 ; RMSE = root mean squared error; F = Fischer F-ratio, NCW = the number of parameters involved in the Monte Carlo optimization.

Mechanistic Interpretation
Having the numerical data on the correlation weights of codes applied in quasi-SMILES, which was observed in several runs of the Monte Carlo optimization, one is able

Mechanistic Interpretation
Having the numerical data on the correlation weights of codes applied in quasi-SMILES, which was observed in several runs of the Monte Carlo optimization, one is able to detect three categories of these codes: Codes that have a positive value of the correlation weight in all runs. These are promoters of endpoint increase; II. Codes that have a negative value of the correlation weight in all runs. These are promoters of endpoint decrease; III. Codes that have both negative and positive values of the correlation weight in different optimization runs. These codes have an unclear role (one cannot classify these features as a promoter of endpoint increase or decrease).
In the case of the analysis of cell viability, promoters of decrease have a practical significance. Table 2 shows the collection of promoters of decrease in cell viability. Table 2. Promoters (↓) of decreased cell viability in THP-1 cells, according to computational experiments with five random splits.

Applicability Domain
The applicability domain for the described model calculated with Equation (1) is defined by the so-called statistical defects of quasi-SMILES codes [36]. The percentage of outliers according to the criterion equals 27%, 13%, 17%, 10%, and 13% for split 1, split 2 . . . split 5, respectively.

Discussion
In this study, only one additional parameter was available for model development in addition to the molecular structure (transmitted via SMILES), namely the concentration of metal oxide nano-particles. Nevertheless, the results obtained are, in fact, quite reliable models of cell viability in THP-1 cells.
It should be noted that the present approach makes it possible to quite easily improve the predictive potential of the model if additional experimental data are available that can be represented as additional codes for the quasi-SMILES extension. There are examples of works where representative lists of codes for quasi-SMILES are applied in practice [36,37]. Thus, simulation by means of the quasi-SMILES technique claims both simplicity and universality. Consequently, quasi-SMILES can find numerous applications as a tool for developing models for phenomena characterized by an eclectic set of factors influencing them.
It is possible to use the optimal descriptors considered here in conjunction with classical descriptors developed based on information theory ideas, physicochemical parameters (solubility, density, octanol/water distribution coefficient), biochemical characteristics (toxicity, drug effects), or the invariants of the molecular graph (multigraph). The above abilities of the quasi-SMILES technique are especially convenient for a situation related to non-standard objects for the simulation, such as mixtures, peptides, and nano-materials.
No less interesting are the prospects for the development of the objective functions described here used for optimization by the Monte Carlo method. Currently, objective functions based on correlations have been studied, but instead of correlations, the basis for them can be selected entropy values of fuzzy sets generated by various divisions of available data into training and verification subsets.
Like most stochastic approaches, the quasi-SMILES technique makes it possible to analyse existing experimental data, but the possibilities for extrapolating the considered approach are limited. In other words, this approach can be useful only for situations close to those that have been studied in detail in a direct experiment. At the same time, work with experimentally determined data sets can be used for the inverse problem, that is, the selection of experimental characteristics that are promising or, on the contrary, useless, according to the number of available experimental states of the data system under study.
Supplementary materials contain input files for the five splits examined here, together with the CORAL method used in this work.

Data
In [3], data on the impact of nano-oxide nano-particles on cell viability in THP-1 cells was tested at eight dilutions (0, 3.1, 6.2, 13, 25, 50, 100, and 200 µg/mL). Non-zero effects of impact on cell viability in THP-1 cells by the mentioned nano-particles were observed starting from a concentration of just 25. Only non-zero effects were used to build the model. Under such circumstances, the total number of situations (oxide-concentrationcell viability) equals 120. Quasi-SMILES represents each situation. These quasi-SMILES are distributed into four special sub-sets: (i) active training set; (ii) passive training set; (iii) calibration set; and (iv) validation set. Five random splits were examined here as a basis to build up the model of cell viability in THP-1 cells. Each above sub-set contains about 25% of the total list of quasi-SMILES.
Each of the above sets had a defined task. The active training set was used to build the model. Molecular features extracted from quasi-SMILES of the active training set were involved in the process of Monte Carlo optimization aimed to provide correlation weights for the above features, which give maximal target function value, which was calculated using descriptors (the sum of the correlation weights), and endpoint values on the active training set. The task of the passive training set is to check whether the model obtained for the active training set is satisfactory for quasi-SMILES which were not involved in the active training set. The calibration set should detect the start of overtraining (overfitting). The optimization must stop if overtraining starts. After stopping the optimization procedure, the validation set was used to assess the predictive potential of the obtained model. Figure 2 demonstrates the generalized scheme of construction of quasi-SMILES for the above-mentioned arbitrary situation. Figure 3 includes the general scheme of applying quasi-SMILES (Q k ) codes to calculate the optimal descriptor for a defined arbitrary situation. calibration set; and (iv) validation set. Five random splits were examined here as a basis to build up the model of cell viability in THP-1 cells. Each above sub-set contains about 25% of the total list of quasi-SMILES.
Each of the above sets had a defined task. The active training set was used to build the model. Molecular features extracted from quasi-SMILES of the active training set were involved in the process of Monte Carlo optimization aimed to provide correlation weights for the above features, which give maximal target function value, which was calculated using descriptors (the sum of the correlation weights), and endpoint values on the active training set. The task of the passive training set is to check whether the model obtained for the active training set is satisfactory for quasi-SMILES which were not involved in the active training set. The calibration set should detect the start of overtraining (overfitting). The optimization must stop if overtraining starts. After stopping the optimization procedure, the validation set was used to assess the predictive potential of the obtained model. Figure 2 demonstrates the generalized scheme of construction of quasi-SMILES for the above-mentioned arbitrary situation. Figure 3 includes the general scheme of applying quasi-SMILES (Qk) codes to calculate the optimal descriptor for a defined arbitrary situation.  25% of the total list of quasi-SMILES.
Each of the above sets had a defined task. The active training set was used to build the model. Molecular features extracted from quasi-SMILES of the active training set were involved in the process of Monte Carlo optimization aimed to provide correlation weights for the above features, which give maximal target function value, which was calculated using descriptors (the sum of the correlation weights), and endpoint values on the active training set. The task of the passive training set is to check whether the model obtained for the active training set is satisfactory for quasi-SMILES which were not involved in the active training set. The calibration set should detect the start of overtraining (overfitting). The optimization must stop if overtraining starts. After stopping the optimization procedure, the validation set was used to assess the predictive potential of the obtained model. Figure 2 demonstrates the generalized scheme of construction of quasi-SMILES for the above-mentioned arbitrary situation. Figure 3 includes the general scheme of applying quasi-SMILES (Qk) codes to calculate the optimal descriptor for a defined arbitrary situation.   Table 3 contains split-1 for the total list of quasi-SMILES together with experimental and calculated values of cell viability in THP-1 cells.

Optimal Descriptor
The optimal descriptor is the sum of the correlation weights of the quasi-SMILES codes obtained by the Monte Carlo method (Figure 3). The values of the optimal descriptor serve as the basis for the model of cell viability calculated by the formula The optimal descriptor depends on the style of the Monte Carlo optimization. T and N are parameters of the optimization procedure. T is a threshold applied to define rare codes; if T = 1, this means that codes absent in the active training set are rare. The rare codes are not involved in the modelling process (their correlation weights are zero). N is the number of epochs in the Monte Carlo optimization.

Monte Carlo Method
Equation (1) needs the numerical data of the above correlation weights. Monte Carlo optimization is a tool to calculate those correlation weights. Here, two target functions for the Monte Carlo optimization are examined: The r AT and r PT are correlation coefficients between the observed and predicted endpoints for the active and passive training sets, respectively. The IIC is the index of ideality of correlation [33,34]. The IIC is calculated using data from the calibration set as follows: The observed k and calculated k are corresponding values of the endpoint. The correlation intensity index (CII), similar to the above IIC, was developed as a tool to improve the quality of the Monte Carlo optimization aimed at building up QSPR/QSAR models. The CII is calculated as follows: otherwise R 2 is the correlation coefficient for a set that contains n substances. R 2 k is the correlation coefficient for n − 1 substances of a set after removal of the k-th substance. Hence, if (R 2 k − R 2 ) is larger than zero, the k-th substance is an "oppositionist" for the correlation between experimental and predicted values of the set. A small sum of "protests" means a more "intensive" correlation.
The Monte Carlo method aims to minimize the target functions [37], TF 1 , based on the application of two new criteria of predictive potential: the index of ideality of correlation [33,34] and correlation intensity index [38,39].

Conclusions
The quasi-SMILES technique gives quite satisfactory models for cell viability in THP-1 cells, as we have shown the reproducibility of the predictive potential of corresponding models obtained for different splits into sets of training and validation sets. There is variation in the statistical characteristics of the above models; however, this variation is not too large. In other words, the results can be assessed as acceptable for practical use. In addition, that the predictive potential of models can be improved by applying the index of ideality of correlation and the correlation intensity index is confirmed.

Conflicts of Interest:
The authors declare no conflict of interest.