Development of Self-Consistency Models of Anticancer Activity of Nanoparticles under Different Experimental Conditions Using Quasi-SMILES Approach

Algorithms of the simulation of the anticancer activity of nanoparticles under different experimental conditions toward cell lines A549 (lung cancer), THP-1 (leukemia), MCF-7 (breast cancer), Caco2 (cervical cancer), and hepG2 (hepatoma) have been developed using the quasi-SMILES approach. This approach is suggested as an efficient tool for the quantitative structure–property–activity relationships (QSPRs/QSARs) analysis of the above nanoparticles. The studied model is built up using the so-called vector of ideality of correlation. The components of this vector include the index of ideality of correlation (IIC) and the correlation intensity index (CII). The epistemological component of this study is the development of methods of registration, storage, and effective use of experimental situations that are comfortable for the researcher-experimentalist in order to be able to control the physicochemical and biochemical consequences of using nanomaterials. The proposed approach differs from the traditional models based on QSPR/QSAR in the following respects: (i) not molecules but experimental situations available in a database are considered; in other words, an answer is offered to the question of how to change the plot of the experiment in order to achieve the desired values of the endpoint being studied; and (ii) the user has the ability to select a list of controlled conditions available in the database that can affect the endpoint and evaluate how significant the influence of the selected controlled experimental conditions is.


Introduction
Knowledge is the basis of all actions aimed at improving people's lives and the evolution of civilization as a whole. However, knowledge has internal contradictions. For example, in order to manage or even observe complex processes, it is necessary first to study the available information, which is the personification of the corresponding knowledge. If the knowledge is not structured, learning to use these disordered facts or skills becomes quite expensive and difficult in several respects, such as the necessity of a long period to learn and apply expensive equipment and software. The main function of the quasi-SMILES conception examined here is to search for reasonably simple ways to study complex phenomena. The simulation of physicochemical and biochemical behavior nanomaterials is quite a complex phenomenon.
Strangeness is one of the manifestations of reality about which it is difficult to speak clearly. Nevertheless, strangeness is often a property of things that are new, unexpected, or important. For the implementation of any activity, an economy of thinking is necessary. In practice, such savings can be achieved in various ways. Sorting is perhaps the simplest fragment of the thought economy process. Sorting consists of selecting the most informative of the data under experimental conditions in the laboratory, production, and interaction with environmental circumstances (climate, epidemics, economic crises). However, sorting does not provide any guidance for decision making at the stage when the choice of priorities is made. Models are needed at this stage. Having a model for a process can help one to manage the process. The complexity of the choice is considerable because usefulness and harm can change places when harm becomes a benefit and the benefit turns out to be harmful. For instance, the toxicity of nanoparticles is considered a useful quality because it can be used for good aims. However, left unattended, this toxicity can harm or even kill humans and animals. The list of nanoparticles is expanding exponentially. The number of types of toxicity is by no means small. Obviously, under such circumstances, it is impossible quickly to evaluate experimentally all nanoparticles that are used or can be used. However, the assessment of the physicochemical and biochemical behavior of a significant number of new nanoparticles using databases on already studied nanoparticles is quite feasible.
The key points in the development of such models are the need to reduce the memory and logic requirements of the users of the models. In other words, developers of models should provide user-friendly means of evaluating new nanoparticles. At the same time, it is highly desirable and important that such models consider the effect of possible changes in the corresponding directions for the experimental use of nanoparticles [1].
Quantitative structure-property-activity relationships (QSPRS/QSARs) are a wellknown approach to establishing models of different endpoints considered as a mathematical function of molecular structure. A successful QSAR analysis is possible if and only if: (i) there is a large enough number of compounds with a clear definition of the congeneric features corresponding to the molecules; (ii) there is a hypothesis on how and which molecular features affect the endpoint (topological architecture, 3D configurations, quantum mechanics interactions, etc.); and (iii) checking of the predictive potential of the model can be carried out [2,3]. However, the QSPR/QSAR paradigm is widely applied to relatively traditional substances, such as organic, inorganic, metal-organic chemicals, and polymers. On the other hand, attempts to use the abovementioned paradigm for nanomaterials face quite a complex situation. First, there are only small databases on experimentally measured basic endpoints, such as thermodynamic parameters and/or biochemical effects. In other words, selecting a series of nanomaterials with experimental data is the problem. Secondly, the huge number of atoms in the majority of nanomaterials lessens the usefulness of traditional molecular descriptors: their values become non-sensitive to small molecular modifications.
There is an urgent need to clarify the approaches and methodology for measuring the biochemical potential of engineered nanomaterials. Factually, this is a problem of tuning computational and experimental approaches oriented to "traditional" substances for application to nanomaterials. The possibility of employing computational approaches like nano-QSAR or nano-read-across to predict nanomaterial hazards based on some "standard" databases is an attractive possibility from a financial point of view. The attractiveness from an ethical point of view is also clear (minimal animal tests). Many research studies have endeavored to investigate the eco-toxicological hazards of engineered nanomaterials. However, little is known regarding nanomaterials' actual environmental risks, combining hazard and exposure data on a planetary scale [1].
It has been assumed that strangeness and research activity rarely intersect. However, when they meet, they either reinforce or disregard each other. For example, modelling, one of the most important and complex areas of research, can be summed up in the short aphorism "All models are wrong, but some are useful" [4].
Systematization of knowledge related to nanomaterials has become necessary due to the fast growth of applications of these "unusual" substances. Systematization involves various aspects of research activity. The development of approaches that allow for the simulation of different characteristics of nanomaterials, including their interactions with other species, is one of them. There are many methods to perform such simulations. One of the possible approaches is to carry simulations out using the so-called quasi-SMILES [5][6][7][8][9][10][11][12][13] approach. The traditional simplified molecular input line entry system (SMILES) [14] allows the molecular architecture to be represented via a sequence of symbol-codes. At the same time, the quasi-SMILES approach gives us the possibility of representing the experimental conditions or even any arbitrary eclectic data related to the behavior of nanomaterials via symbol-codes. Figure 1 displays the general scheme for the simulation of the biological effects of nanoparticles. This scheme was used to build up the models described below.
Systematization of knowledge related to nanomaterials has become necessary due to the fast growth of applications of these "unusual" substances. Systematization involves various aspects of research activity. The development of approaches that allow for the simulation of different characteristics of nanomaterials, including their interactions with other species, is one of them. There are many methods to perform such simulations. One of the possible approaches is to carry simulations out using the so-called quasi-SMILES [5][6][7][8][9][10][11][12][13] approach. The traditional simplified molecular input line entry system (SMILES) [14] allows the molecular architecture to be represented via a sequence of symbol-codes. At the same time, the quasi-SMILES approach gives us the possibility of representing the experimental conditions or even any arbitrary eclectic data related to the behavior of nanomaterials via symbol-codes. Figure 1 displays the general scheme for the simulation of the biological effects of nanoparticles. This scheme was used to build up the models described below.

Data
The dataset used in this study includes measurements of half maximal effective (EC50), inhibitory (IC50), and lethal (LC50) concentration toxicity endpoints toward cell lines A549 (lung cancer), THP-1 (leukemia), MCF-7 (breast cancer), Caco2 (cervical cancer), and hepG2 (hepatoma) under different experimental conditions (various nanoparticles, size, exposure time) for human cells. The indicated conditions and circumstances were represented by special codes listed in Table 1. These codes are used for the construction of the quasi-SMILES that represent the above measurements of the toxicity of the studied nanoparticles [15].

Data
The dataset used in this study includes measurements of half maximal effective (EC50), inhibitory (IC50), and lethal (LC50) concentration toxicity endpoints toward cell lines A549 (lung cancer), THP-1 (leukemia), MCF-7 (breast cancer), Caco2 (cervical cancer), and hepG2 (hepatoma) under different experimental conditions (various nanoparticles, size, exposure time) for human cells. The indicated conditions and circumstances were represented by special codes listed in Table 1. These codes are used for the construction of the quasi-SMILES that represent the above measurements of the toxicity of the studied nanoparticles [15].
The listed codes for quasi-SMILES make it possible to constructively describe the available experimental situations for developing models in order to predict the results of varying codes (i.e., varying of an experiment). The system described can assess the statistical significance of individual experimental conditions (i.e., the above codes for quasi-SMILES). In other words, concentration values, exposure times, impacted objects, nanoparticle sizes, and others are under consideration to simulate the behavior of nanoparticles. After removing duplicates, the source [15] contains 935 measurements, representing data related only to human cells. The total set studied here includes 102 measurements. These data were randomly split into an active training set (≈25%), a passive training set (≈25%), a calibration set (≈25%), and a validation set (≈25%). The advantages of considering a structured training set (divided into an active training set, passive training set, and calibration set) are described in the literature [16]. Five such splits that involve the deposition of different data each time for the considered data sets are considered to assess the reproducibility of the approach considered here for creating models [17].

Optimal Descriptor
The optimal descriptor is the sum of the correlation weights of the quasi-SMILES codes obtained by the Monte Carlo method using the CORAL software (http://www.insilico.eu/ coral, accessed on 29 May 2023). The values of the optimal descriptor serve as the basis for the model of half-maximal concentration (HMC) (i.e., EC50, IC50, or LC50) calculated by the formula: The optimal descriptor depends on the selected method of the Monte Carlo optimization of the correlation weights for codes of quasi-SMILES (Table 1). The T and N are the parameters of the optimization procedure. T is a threshold applied to define rare codes. If T = 1, this means that codes which are absent in the active training set are rare. The rare codes are not involved in modelling (their correlation weights are zero). N is the number of epochs of the Monte Carlo optimization.

Optimization of Correlation Weights
The optimal descriptors are calculated using the correlation weights obtained by the Monte Carlo optimization [16,17]. Two target functions of the optimization are compared here: r AT and r PT are correlation coefficients between the experimental and predicted values for the active and passive training sets, respectively. The IIC represents the index of ideality of correlation [15][16][17]. The CII is the correlation intensity index [15][16][17]. Figure 2 contains examples of the optimization history with target functions TF 1 and TF 2 . The figure demonstrates the advantage of the target function TF 2 graphically.
The optimal descriptor depends on the selected method of the Monte Carlo optimization of the correlation weights for codes of quasi-SMILES (Table 1). The T and N are the parameters of the optimization procedure. T is a threshold applied to define rare codes. If T = 1, this means that codes which are absent in the active training set are rare. The rare codes are not involved in modelling (their correlation weights are zero). N is the number of epochs of the Monte Carlo optimization.

Optimization of Correlation Weights
The optimal descriptors are calculated using the correlation weights obtained by the Monte Carlo optimization [16,17]. Two target functions of the optimization are compared here: rAT and rPT are correlation coefficients between the experimental and predicted values for the active and passive training sets, respectively. The IIC represents the index of ideality of correlation [15][16][17]. The CII is the correlation intensity index [15][16][17]. Figure 2 contains examples of the optimization history with target functions TF1 and TF2. The figure demonstrates the advantage of the target function TF2 graphically.

Mechanistic Interpretation
If Monte Carlo optimization is carried out several times, then some components of the optimized correlation weights will have positive values in all optimization trials. Such correlation weights indicate those fragments of quasi-SMILES that are growth promoters of the studied endpoint. At the same time, some of the correlation weights will only have negative values. These correlation weights indicate those fragments of quasi-SMILES that are patrons of the decrease in the simulated endpoint. Correlation weights with alternating values (positive and negative in different runs of Monte Carlo optimizations) have no mechanical interpretation for the models under consideration.

Mechanistic Interpretation
If Monte Carlo optimization is carried out several times, then some components of the optimized correlation weights will have positive values in all optimization trials. Such correlation weights indicate those fragments of quasi-SMILES that are growth promoters of the studied endpoint. At the same time, some of the correlation weights will only have negative values. These correlation weights indicate those fragments of quasi-SMILES that are patrons of the decrease in the simulated endpoint. Correlation weights with alternating values (positive and negative in different runs of Monte Carlo optimizations) have no mechanical interpretation for the models under consideration.

Applicability Domain
The applicability domain for the described model defines via the so-called statistical defects of codes used in quasi-SMILES. These defects can be calculated as follows: where P(S k ), P (S k ) P (S k ) are the probability of S k in the active training set, passive training set, and calibration set, respectively; N(S k ), N (S k ), and N (S k ) are frequencies of S k in the active training set, passive training set, and calibration set, respectively. The statistical defects of quasi-SMILES (D j ) are calculated as follows: where NA is the number of non-blocked codes in quasi-SMILES. A quasi-SMILES falls in the applicability domain if Table 2 contains an example of the model of biological activity related to different experimental situations represented via quasi-SMILES (split 1, target function TF 2 ). However, since the statistical characteristics of a model can vary for different splits into the training and validation set, it is necessary to consider a system of several different splits. Two CORAL methods are applied here for five random splits. The first CORAL method is the Monte Carlo optimization with target function without the vector of ideality of correlation and the correlation weights of fragments of local symmetry (Equation (2)). This method gives the models represented in Table 3.

Results
466 -* A = active training set; P = passive training set; C = calibration set; V = validation set; n = the number of samples in a set; R 2 = determination coefficient; CCC = concordance correlation coefficient; IIC = index of ideality of correlation; CII = correlation intensity index; Q 2 = leave-one-out cross-validated R 2 ; RMSE = root mean squared error; F = Fischer F-ratio.
The second CORAL method is the Monte Carlo optimization with target function calculated by Equation (3) with the use of the vector of ideality of correlation together with the correlation weights of fragments of local symmetry. This method gives the models represented in Table 4.
One can see that the statistical characteristics of models observed in the case when the second method is applied are better than those observed in the case of the first method. This is evidenced by the average determination coefficient for the validation set, which in the case of the first method amounts to

Discussion
The most popular traditional QSAR modelling approach can be formulated as follows: (i) selection of a group of available and convenient descriptors; (ii) defining a model using training-set substances; and (iii) validating the model using external validationset substances. One can formulate several questions related to the optimization of this approach. For example, how will the model's statistical quality change in the next division into training and testing samples? How to avoid overfitting (i.e., how to avoid a situation where a good model for the training set becomes a bad model for external substances)? How can one estimate the probability of obtaining a satisfactory and reliable model? In fact, the approach under consideration attempts to solve these problems using original idealizations, assumptions, and limitations.
Much excellent research is dedicated to nano-topics; nevertheless, even a simple question, e.g., whether a nanomaterial can be assessed using software, is quite ambiguous. The results of different estimations can vary depending on the personal experience of the expert conducting the study, and one cannot guarantee the reproducibility of these assessments.
Perhaps the main and convenient (from the user point of view) idealization of the considered approach is that instead of searching for sources of numerous descriptors, it is supposed to use "artificial" optimal descriptors, which can be tuned to correlate with the endpoint of interest. This assumption may not be correct. In this case, the approach under consideration is unsuitable for such a task, and a useful alternative approach to solve the task becomes necessary. However, there are cases where the approach discussed here has been useful [6][7][8][9][10][11][12].
The approach considered here has various advantages. First, to apply this approach, one can use arbitrary data. There is no 'a priori' knowledge before the experiment about whether such data can improve the model or not. The instability of the values of the correlation weights is a reliable indicator of the uselessness of the tested factor. On the contrary, at the same time, stability is a significant indicator of the influence of the factor on the predictive potential of the model. Secondly, this approach makes it easy to change the set of correlation-weighted factors, thus radically changing the model. This facilitates fast evaluation of the benefits of various hypotheses related to the optimal list of factors involved in the model development process.
The universality of the approach provides the user with sample opportunities to choose a set of basic factors for developing a model. However, overextension of such a set can lead to useless models that are excellent for the training set of samples but are completely unsuitable for external sets of similar samples. Given this circumstance, it is difficult to formulate rules for dividing the available data into active learning, passive learning, calibration, and an external validation set. It seems reasonable to assume that each of the four mentioned sets has the same significance. Therefore, the distribution of available samples should be approximately the same, i.e., about 25% of the data for each set (Tables 2 and 3).
The presented approach is similar to incremental methods based on the selection of suitable contributions from individual parts of molecules to describe or model the physicochemical property or biological activity of interest [18,19]. The main common feature of the described approach with the additive scheme is that in both cases, the modeled endpoint is considered as the sum of the contributions of some participants in the model-building process. The difference between the mentioned approaches lies in the fact that for the traditional additive scheme, the set of participants is constant. At the same time, for the quasi-SMILES approach, it is possible to vary the number and quality of participants in the model-building process. For example, theoretically, the user of the quasi-SMILES method can eliminate the correlation weights reflecting particle size by reducing the number of Monte Carlo optimization parameters.
On the other hand, the user can expand the brutto formulas by representing the corresponding metal oxides with traditional SMILES (e.g., instead of Al 2  [Al+3]), thereby increasing the number of optimized parameters. Of course, such changes do not guarantee an improvement in the predictive potential of the model, but they do provide the user with extended opportunities in the search for a model of the phenomenon and perhaps even stimulate the user's creative activity. Another important although hidden point is that the considered approach allows the user to identify and discard those quasi-SMILES fragments that are non-informative due to their low prevalence in training samples and/or in the general array of available data. This defines automatically through the appropriate selection of the threshold described above (i.e., parameter T in Equation (1)). Since QSAR is actually a random event [20] associated with and determined by the distribution of available data in training and control samples, this option is very useful because it allows one to go from so-called "naive cross-validation" to "two-step cross-validation" [21]. The difference between naive and two-step cross-validation is as follows. Naive cross-validation is the result of a single distribution in the training and the validation sets. In contrast, two-step cross-validation is the result of considering and analyzing multiple random distributions in the training and validation sets.
A very significant component of models built on optimal descriptors using quasi-SMILES codes is optimization procedures by the Monte Carlo method. The choice of the target function is the key to the success of such Monte Carlo calculations. The ideality index of correlations (IIC) [22] turned out to be a very useful finding for improving the objective functions for the Monte Carlo method optimizations used to construct optimal descriptors calculated in using both SMILES and quasi-SMILES codes. The majority of the phenomena involved in the natural sciences are complex. Idealization (or simplification) is one of the most common approaches to studying complex phenomena in the natural sciences, such as ideal gas, ideal solution, ideal crystals, and ideal symmetry [22]. Ideal correlation is also a very attractive variant of correlations in general. The main idea of ideal correlation expressed through IIC is a correlation with forced minimization of the mean absolute error (MAE). It is to be noted, however, that the application of IIC gives an improvement to the statistical characteristics for calibration and validation sets which is accompanied by reducing the correlation coefficient for the training sets. This is a paradox situation. Nevertheless, from a practical point of view, this situation is preferable to overtraining (i.e., the situation in which the excellent statistical quality for the training set is accompanied by poor statistical quality for the validation set). An analysis of the graphical representations of such a paradox observed with various geometric configurations on plots for 'experiment vs calculation' shows that such idealization is not always possible; fortunately, however, it is possible in the majority of cases of the different arrangement of points on the plot diagram 'experiment vs calculation' [23].
Another useful invention for improving the predictive potential of models based on quasi-SMILES codes is the Correlation Intensity Index (CII) [17]. Data on a group of quasi-SMILES (e.g., calibration set or validation set) with experimental and predicted values of an endpoint gives the possibility to estimate the contribution of each quasi-SMILES to the correlation between experiments vs calculated endpoint value. The negative effect of removing quasi-SMILES means it is a 'supporter' of the correlation; the positive effect of removing quasi-SMILES means it is an 'oppositionist' of the correlation. The sum of these effects is the CII.

Conclusions
The present study demonstrated that the quasi-SMILES technique gives statistically robust models for the half-maximal concentrations for the five cell lines. We showed that the statistical quality is well reproduced for five random splits of available data into a structured training set (i.e., the active training, passive training, and calibration sets) and an external validation set. Such approach is tested and recommended for various applications of the quasi-SMILES approach. Paradoxically, the vector of ideality of correlation, which is the sum of the described IIC and CII, improves the predictive potential of the studied models but in detriment to the statistical quality of the models on the training set. The described approach can be easily adapted to simulate other experimental situations and endpoints for nanomaterials and other substances (mixtures, polymers, peptides, proteins).
A quasi-SMILES approach describing experimental situations can be modified both by feedback (i.e., depending on the results obtained) and purely heuristically in accordance with spontaneous ideas for which statistical expertise is possible. Thus, quasi-SMILES are a simple and versatile approach for modelling experimental situations not yet implemented in practice. Indices IIC and CII cannot only improve Monte Carlo optimization, but the mentioned values can also be indicators of the predictive potential of various models.  (Table S1 quasi-SMILES split 1-5 with experimental and calculated values of the half-maximal concentrations for the five cells lines; Table S2 includes the numerical data on the corresponding correlation weights for elements of quasi-SMILES).