QSPR/QSAR: State-of-Art, Weirdness, the Future

Ability of quantitative structure–property/activity relationships (QSPRs/QSARs) to serve for epistemological processes in natural sciences is discussed. Some weirdness of QSPR/QSAR state-of-art is listed. There are some contradictions in the research results in this area. Sometimes, these should be classified as paradoxes or weirdness. These points are often ignored. Here, these are listed and briefly commented. In addition, hypotheses on the future evolution of the QSPR/QSAR theory and practice are suggested. In particular, the possibility of extending of the QSPR/QSAR problematic by searching for the “statistical similarity” of different endpoints is suggested and illustrated by an example for relatively “distanced each from other” endpoints, namely (i) mutagenicity, (ii) anticancer activity, and (iii) blood–brain barrier.


Introduction
Each science meets with internal and external contradictions. Correlations many times have served as a key to the interpretation of various phenomena. Expansion of information available for analysis (e.g., search space along "traditional" substances has been extended by nanomaterials) leads to the following question: is the correlation useful or it will better to try to define a causality? [1].
Apparently, an answer to this question is deemed to be non-completed. At the first stage, Wiener had established the first correlations for physicochemical endpoints [2][3][4]. Later, other authors [5][6][7][8][9] continued the stream of similar studies. At the second stage (after almost twenty years pause), Hansh and Fujita [10] established the first correlations for biochemical endpoints.
The quantitative structure-property activity relationships (QSPRs/QSARs) are a relatively new field of natural sciences. There is a large group of aims associated with the QSPRs/QSARs technique, the main ones of these are probably the following: (i) prediction of the physicochemical behavior of various substances in industry and their further ecologic impacts [2][3][4][5][6][7][8][9]; (ii) the biochemical behavior of various substances in ecological and medicinal aspects [10]; (iii) selection of substances, which can be prospective candidates to the defined role [11,12].
Results of traditional experiments depended on properties of substances, masses, radiation; heat capacity, electronic, physicochemical, and biochemical conditions as well as porosity, Zeta potential of nanomaterials, time of exposure, irradiation, darkness, etc. Computational experiments related to QSPR/QSAR concerned with "information conditions" (available datasets) and "statistical conditions" (diversity of substances in datasets), as well as preference of the user.
Wiener has carried out the pioneer works in the field of correlation "molecular structure-macro-effect of a substance" in the 1940s [2][3][4]. This was the start of QSPR/QSAR history. In other words, this is the first stage of evolution of QSPR/QSAR theory and practice.
The main task of the QSPR/QSAR at this period was to establish a correlation between an endpoint and descriptor for a set of substances. Criteria of quality of those models were (i) the total number 1.
Be associated with a defined endpoint of regulatory importance; 2.
Take the form of an unambiguous algorithm; 3.
Have a clear domain of applicability; 4.
Be associated with appropriate measures of goodness of robustness, and predictivity; 5.
Have a mechanistic interpretation.
In further, these principles were renamed in OECD principles (Organization for Economic Co-operation and Development) http://www.oecd.org/chemicalsafety/risk-assessment/37849783.pdf The OECD principles open the second stage of QSPR/QSAR history: "not only to establish a correlation, but to check up the predictive potential of the correlation."
A person who would like to apply a model hardly will be pleasant to the necessity to get a group of descriptors via hard-to-understand software and with the further necessity to carry out calculations with other hard-to-understand software that provides multiple linear regression analysis or the artificial neuron networks or something else. Attempts to solve problems related to the above "unpleasant peculiarities" of QSPR/QSAR have been performed. However, these attempts gave three weirdness points.

The First Weirdness of QSPR/QSAR
The distribution of available data for QSPR/QSAR analyses into the training and validation sets can be done by various manners [50,51]. The distribution has a key influence on the statistical quality of QSPR/QSAR models [52,53]. Here, one can see first weirdness in the modern QSPR/QSAR researches: the majority of the models are based solely on distribution available data into the training and validation sets.
According to many authors, a rational split into training and validation set gives better statistical results of the validation sets than models based on random splits [54]. However, the experiment confirms that there are splits successful for one approach, which are unsuccessful for another approach [55][56][57][58][59]. For example, three different splits (Table 1) into training and validation sets of 87 anticancer inhibitors [60] give models with different predictive abilities ( Table 2).
An examination of several splits decreases the probability of "chance correlations": solely one good correlation easily can become chance correlation, however, three (five, six, seven, ...) good correlations hardly can be "chance correlations".
The described experiment confirms successful and unsuccessful splits exist. Excellent split (Split 1) for the 3D-QSAR approach is poor for 2D approaches, i.e,. models calculated by Equation (1) or Equation (2). However, ( Table 2), Split 2 is excellent (at least successful) for Method 1, whereas the Split 3 is excellent (at least successful) for Method 2.

The Second Weirdness of QSPR/QSAR
The number of statistical characteristics aimed to measure the predictive potential of a model gradually increase (Table 3), despite the apparent attractiveness of a small number of criteria of the predictive potential for practical applications.
On the one hand, the diversity of different criteria of predicting potential is a tool to improve the quality of QSPR/QSAR models. On the other hand, this situation causes sometimes the uncertainty in the choice of the best model. In other words, contradictions in the recommendations of various criteria force the researcher to search for truth (i.e,. the best choice) in a greater maze of possibilities. Table 3. Statistical criteria of the predictive potential for the quantitative structure-property activity relationships (QSPR/QSAR) models.

The Third Weirdness of QSPR/QSAR
Naturally, the contribution of the molecular structure is the key importance to an endpoint. However, any biological activity is a mathematical function of many different conditions and circumstances. In other words, toxicity or pharmaceutical effect is caused by not only molecular structure, but also physicochemical conditions (e.g., temperature, humidity) and circumstances (noise/silence, illumination/darkness). Apparently, one can disagree with the above postulation, but the majority of QSPR/QSAR has built up without taking into account something besides molecular structure.

Discussion
The above-mentioned unpleasant peculiarities and weirdness are interacting. To avoid unpleasant peculiarities, one should build up a model without the above weirdness, namely, (i) one should study several different splits (into training and validation sets); (ii) one should select a group of criteria of predictive potential which are agreed with each other; and (iii) one should take into account all conditions which impact corresponding endpoint (not only molecular structure). However, these actions are not enough to solve all problems.
Unfortunately, there are other problems. Fortunately, there are other solutions. The hierarchy of problems in the field of the modeling of various endpoints is not established. One group of researchers believes that the validation of a model is of key importance. Another group believes that the main result is the statistical quality of a model. A third group concentrates on mechanistic interpretation. It is curious, but non-standard tasks and solutions also exist and sometimes these are very important. Examples are below.

Multi-target QSAR Models
The limitation of almost all QSAR models is that they predict the biological activity for only one endpoint. In other words, traditional QSAR gives a model for the biological activity of drugs against only one parasite species [96], one species of a virus [97], and one type of cancer [98]. The so-called multi-target QSAR as a tool to build up models for several endpoints is suggested [96][97][98].
Apparently, this conception has attractive advantages, since it provides a user by extending a list of information (i.e,. expected numerical data for groups of endpoints, which affect the phenomenon under consideration, e.g., therapeutic effect, inhibition, biocide potential, etc.). Nonetheless, traditional approaches serve as a basis to solve the task of building up multi-target QSARs, e.g., using multiple regression [99], partial least squares (PLS) [100], artificial neural networks (ANN) [101][102][103], and random forest [104].
It is to be noted, that interest to researches dedicated to multi-target QSAR in drug discovery gradually increases during the past decade, whereas interest to general QSAR in drug discovery is approximately constant. Figure 1 confirms this situation.

Similarity of Endpoints
As noted in the previous section, the simultaneous examination of two endpoints is an attractive way in the QSPR/QSAR analysis. In addition to multi-target QSAR, the similarity of endpoints may be a heuristic tool of control of the biochemical knowledge [105][106][107]. Similarity/dissimilarity of endpoints can be expressed via correlation weights of molecular features extracted from SMILES [105]. In principle, the spectrum of physicochemical conditions with a clear impact on biochemical

Similarity of Endpoints
As noted in the previous section, the simultaneous examination of two endpoints is an attractive way in the QSPR/QSAR analysis. In addition to multi-target QSAR, the similarity of endpoints may be a heuristic tool of control of the biochemical knowledge [105][106][107]. Similarity/dissimilarity of endpoints can be expressed via correlation weights of molecular features extracted from SMILES [105]. In principle, the spectrum of physicochemical conditions with a clear impact on biochemical endpoints (toxicity, therapeutic potential) able to provide hints to establish similarity (dissimilarity) for two endpoints relevant to drug discovery, toxicity, risk assessment, and others.
The similarity of endpoints was analyzed in the literature [105]. The task is to extract molecular features involved in the modeling process, which play an analogous role for corresponding models of (i) mutagenicity; (ii) blood-brain barrier; and (iii) anticancer activity.

Mutagenicity
The endpoint for QSAR analysis is the mutagenic potential. The mutagenic potential in Salmonella typhimurium TA98+S9 microsomal reparation is represented by the natural logarithm of R, where R is the number of revertants per nanomole (lnR).

Anticancer Activity
The endpoint considered here is IC50 which represents the concentration of the agent necessary to reduce cell viability by 50% against Murine P388 Leukemia (in vitro cytotoxic activity). The endpoint is expressed on a logarithmic scale (pIC50).

Blood-Brain Barrier (BBB)
The database for BBB permeation (n = 291) is taken from the literature [105]. QSAR models for the above-listed endpoints are based on the following descriptor: Here, the global SMILES attributes are the following BOND, NOSP, HALO, and PAIR. The S k and SS k are local SMILES attributes. Tables 4 and 5 contain comments on these attributes. The CW(BOND), CW(NOSP), CW(HALO), CW(PAIR), CW(S k ), and CW(SS k ) are correlation weights of the above-listed attributes. Table 4. Simplified molecular input-line entry system (SMILES) attributes applied to build up a model.

S k
One symbol or two symbols which cannot be examined separately in SMILES, e.g., Cl, Br, etc.  .
Used only "(", not ')'; **) Symbols in SS k are placed according to ASCII code, in order to avoid situation wrong interpretations AB and BA as non-equivalent features.
The scheme of estimation of similarity and dissimilarity for the above-mentioned endpoints demonstrated by Table 6 is adapted from [105]. Table 6. Definition of similarities to models for mutagenicity, anticancer activity and blood-brain barrier (BBB). Here, model-1, denoted m1; model-2, denoted m2. The "m1.1" means first run of optimization for endpoint 1. Each plus denotes a promoter of an increase for endpoints (#1 or #2). Each minus denotes a promoter for a decrease for endpoints (#1 or #2).    Table 7 contains numerical measures of similarity and dissimilarity of the corresponding endpoints (Table 6). The similarity of endpoints defined according to suggested scheme can become the beginning of "a next generation" of the QSPR/QSAR evolution.

Gender-Oriented QSAR Models
Usually, the categorization of eco-toxic effects is related to a different animal (fishes, birds, and insects). However, in addition, at least for animals, categorization related to sex also may be useful from practical and theoretical points of view. QSAR models of carcinogenicity separately for male and female rats can have wide applications for both the agriculture and theoretical biochemistry [69]. The matrix to recognize the difference of corresponding models build up by scheme analogous to the above-mentioned scheme applied for Table 6. The promoters of the increase of pTA50 have been examined separately in the cases of male and female rats. In both cases, symbol '1' means the stable positive correlation weight, whereas symbol '0' means the stable negative correlation weight. Table 8 contains the results of the above-mentioned computational experiments.  1  2  3  1  2  3  1  2  3  1  2  3  1  2  3  1  Examples of molecular features acted in a different manner for male rats and female rats are the following (i) BOND1010000; (ii) HALO00000000; (iii) NNC-C . . . 303.; and (iv) NNC-C . . . 321. This information can be useful, e.g., for developers of corresponding biocides. Algorithms able to generate gender-oriented models may have wider applications (e.g., drug design).

The Simplicity or the Efficiency: Which is Better?
QSAR should be assessed as a surrogate of a real experiment. QSAR aimed to measure an endpoint value. However, to expect adequate prediction physicochemical and biochemical behavior of an arbitrary substance by means of the QSPR/QSAR-model is naive.
Despite the above-mentioned thesis, QSPR/QSAR has become an integral part of modern science as a tool to detect "fuzzy tendencies" in the behavior of groups of substances. This fact logically echoes the theory of fuzzy sets [108]. This is not surprising, as fuzzy set theory has success in solving some problems of QSPR/QSAR analysis [109][110][111].
One can extract two components in the total big variety of QSAR studies: (i) "extensive" studies and (ii) "intensive" studies. The aim of "extensive" studies is the integration of the results of applying current approaches to solve practical tasks. The aim of "intensive" studies is attempting to develop new conceptions of the QSPR/QSAR analysis. Naturally, a small part of the results of the "intensive" studies gradually become a tool of robust "extensive" studies.

Conclusions
The evolution of the field of QSPR/QSAR has two components: intensive and extensive. The intensive component is responsible for developing the quality and epistemology potential of various QSPR/QSAR approaches. The multi-target QSAR is the perspective field of the evolution of the QSAR theory and practices. Other perspective components of the "intensive" evolution of the QSPR/QSAR are (i) applying fuzzy set theory; (ii) developing statistical methods to detect similarity of biochemical endpoints; and (iii) extending "input data" for QSPR/QSAR by means of taking into account experimental conditions and circumstances also can be a component of intensive evolution of the QSPR/QSAR.