Giving Molecules an Identity. On the Interplay Between QSARs and Partial Order Ranking

The interplay between 'noise-deficient' QSAR and Partial Order Ranking, including analysis of average linear ranks, constitutes an effective tool in giving substances which have not been investigated experimentally an identity by comparison with experimentally well-characterized, structurally similar compounds. It is disclosed that experimentally well-characterized compounds may serve as substitutes for highly toxic compounds in experimental studies without exhibiting the same extreme toxicity, while from an overall viewpoint they exhibit analogous environmental characteristics.


Introduction
The lack of data for the vast majority of existing chemicals is well known and constitutes obviously a significant problem in relation to e.g., risk assessment.Thus, according to the European Commission only in the case of approximately 14% of the HPV (High Production Volume) chemicals on the EINECS list, comprising 100,116 entries, the minimum required data for evaluating the chemicals were available.For approximately 21% of the compounds no data at all concerning their potential impact on the environment and human health were found [1].In a study by the Danish EPA [2] it was concluded that even in major sources of test data, information on selected ecotoxicological effects could only be found for very limited number of the compounds on the EINECS list (acute toxic effect: 10.5%, reproductive damage: 2.2%, genetic damage: 3.2%, carcinogenic effect: 1.6%, effect on the aquatic environment: 3.5%).Since intensive and experimental evaluations of chemicals are rather costly [3, and references therein], QSAR derived data for physico-chemical as well as toxicological endpoints appear as an attractive alternative.However, although the lack of data can be remedied to a certain extent through QSAR modeling, this will leave us with the possibility of characterizing the single molecules based on single parameters, such as solubility, octanol-water partitioning, vapor pressure, biodegradation -and bioaccumulation potential.However, to establish an identity for a given molecule, e.g., as a potential PBT substance requires taking several parameters into account simultaneously, i.e., Persistence, Bioaccumulation and Toxicity.
In the present study the advantageous use of so-called "noise-deficient" QSARs, developed using data from experimentally well-characterized compounds as the training set, as a preprocessing tool to derive the desired endpoints for substances where experimental data are not available.Subsequently, these endpoints will be applied as descriptors in establishing a partial ordering of combined sets of compounds, hereby giving the experimentally not investigated compounds an identity by comparing to structurally related, experimentally well-characterized compounds [4,5].

QSAR
In the present study the end-points are generated through QSAR modeling, the EPI Suite being the primary tool [6].To generate new linear "noise-deficient" QSAR models, EPI generated values for, e.g., log Sol, log K OW , log VP and log HLC are further treated by estimating the relationships between the EPI generated data and available experimental data [7] for the a series of experimentally wellcharacterized compounds in the training set, the general formula for the end-points, D i , to be used being D EPI is the EPI generated end-point value and a i and b i being constants.The log K OW values generated in this way are subsequently used to generate log BCF values according to the Connell formula [8] log BCF = 6.9×10 -3 ×(log K ow ) 4 -1.85×10-1 ×(log K ow ) 3 The model was somewhat modified.Thus, a linear decrease of log BCF with log K OW was assumed in the range 1 < log K OW < 2.33, the log BCF = 0.5 for log K OW ≤ 1, the latter value being in accordance with BCFWin [6].Subsequently data for not characterized compounds are calculated based on these formulae and the appropriate EPI generated data.
In the present study a training set consisting of up to 65 organo phosphorus (OP) insecticides are applied.Due to the lack of experimental data for the training set compounds with regards to their biodegradation, the above procedure was not applicable to the biodegradation potential, BDP3.Thus, data on BDP3 are used as estimated by the appropriate modules in the EPI Suite.

Partial Order Ranking
The theory of partial order ranking is presented elsewhere [9] and its application in relation to QSAR is presented in previous papers [10][11][12][13].In brief, Partial Order Ranking is a simple principle, which a priori includes "≤" as the only mathematical relation.If a system is considered, which can be described by a series of descriptors p i , a given compound A, characterized by the descriptors p i (A) can be compared to another compound B, characterized by the descriptors p i (B), through comparison of the single descriptors, respectively.Thus, compound A will be ranked higher than compound B, i.e., B ≤ A, if at least one descriptor for A is higher than the corresponding descriptor for B and no descriptor for A is lower than the corresponding descriptor for B. If, on the other hand, p i (A)>p i (B) for descriptor i and p j (A)< p j (B) for descriptor j, A and B will be denoted incomparable.In mathematical terms this can be expressed as Obviously, if all descriptors for A are equal to the corresponding descriptors for B, i.e., p i (B) = p i (A) for all i, the two compounds will have identical rank and will be considered as equivalent.It further follows that if A ≤ B and B ≤ C then A ≤ C. If no rank can be established between A and B these compounds are denoted as incomparable, i.e. they cannot be assigned a mutual order.
In partial order ranking -in contrast to standard multidimensional statistical analysis -neither assumptions about linearity nor any assumptions about distribution properties are made.In this way the partial order ranking can be considered as a non-parametric method.Thus, there is no preference among the descriptors.However, due to the simple mathematics outlined above, it is obvious that the method a priori is rather sensitive to noise, since even minor fluctuations in the descriptor values may lead to non-comparability or reversed ordering.The graphical representation of the partial ordering is often given in a so-called Hasse diagram [14][15][16][17].In practice the partial order rankings are done using the WHasse software [17].

Linear extensions
The number of incomparable elements in the partial ordering may obviously constitute a limitation in the attempt to rank e.g. a series of chemical substances based on their potential environmental or human health hazard.To a certain extent this problem can be remedied through the application of the so-called linear extensions of the partial order ranking [18,19].A linear extension is a total order, where all comparabilities of the partial order are reproduced [9,16].Due to the incomparisons in the partial order ranking, a number of possible linear extensions corresponds to one partial order.If all possible linear extensions are found, a ranking probability can be calculated, i.e., based on the linear extensions the probability that a certain compound have a certain absolute rank can be derived.If all possible linear extensions are found it is possible to calculate the average ranks of the single elements in a partially ordered set [20,21].The average rank is simply the average of the ranks in all the linear extensions.On this basis the most probably rank for each element can be obtained leading to the most probably linear rank of the substances studied.
The generation of the average rank of the single compounds in the Hasse diagram is obtained applying the simple empirical relation recently reported by Brüggemann et al [22].The average rank of a specific compound, c i , can be obtained by the simple relation

Rk av (c i ) = (N+1) -(S(c i )+1)×(N+1)/(N+1-U(c i ))
(4)where N is the number of elements in the diagram, S(c i ) the number of successors to c i and U(c i ) the number of elements being incomparable to c i [22].

Results and Discussion
The basic idea of using partial order ranking for giving molecules an identity is illustrated in Figure 1.Thus, let us assume that a suite of 10 compounds has to be evaluated and that the evaluation should be based on three pre-selected criteria, e.g., persistence, bioaccumulation and toxicity.Let the resulting Hasse diagram be the one depicted in Figure 1A.If we apply the three descriptors representing biodegradation, bioaccumulation and toxicity, respectively, so the more persistent, the more bioaccumulating and the more toxic a substance would be the higher in the diagram it would be found, Figure 1A discloses that the compounds in the top level, i.e., compounds 1, 3, 4, 7 and 8 on a cumulative basis can be classified as the environmentally more problematic of the 10 compounds studied with respect to their PBT characteristics, whereas compound 10 that a found in the bottom of the diagram is the less hazardous.Subsequently we can introduce compounds solely characterized by QSAR derived in order to give this new compound, X, an identity, e.g., in an attempt to elucidate the environmental impact of X. Adopting the above discussed 10 compounds and the corresponding Hasse diagram (Figure 1A) we then introduced the compound X.The revised Hasse diagram, now including 11 compounds is visualized in Figure 1B.It is immediately disclosed that compound X has now obtained an identity in comparison to the originally well-characterized compounds, as it is evaluated as less environmentally harmful than compounds 4 and 7, but more harmful than compound 10.Thus, through the partial order ranking the compound, X, has obtained an identity in the scenario with regard to its potential environmental impact.
To illustrate the above an example from our current study on the physico-chemical characteristics of OP compounds with special emphasis on chemical warfare nerve agents as the G-agents, like Tabun, Sarin and Soman, and V-agents, like VX, shall be used [4,5].In the present study we shall focus on the aqueous persistence of OP insecticides and know and potential nerve agents as expressed through the solubility (Sol), the biodegradation potential (BDP) and the Henry's Law Constants (HLC), the latter being derived based on the EPI values as given by HenryWin [6].
As mentioned the EPI Suite [6] has been the primary tool for QSAR modeling, the single EPI generated values for log Sol, log K OW , log VP and log HLC being further treated to generate new linear "noise-deficient" QSAR models, cf.eqn. 1 [4].

log HLC EPI log HLC Exp
The generated end-point are subsequently used to generate partial order rankings of the the 65 OP insecticides with the 16 known potential nerve agents taking two or more descriptors simultaneously into account.Thus, as in total 81 compounds are included in the subsequent ranking procedure, the resulting Hasse diagrams may seem somewhat confusing.Figure 3 depicts the Hasse diagram disclosing the mutual ranking of the compounds due to their aqueous persistence, i.e., bringing simultaneously the solubility (log Sol), the biodegradation potential for ultimate biodegradation (BDP3) and Henry's Law Constant (log HLC) into play.From the above figure it can be seen that the nerve agent VX is located at the same level as the compounds 61 (Anilofos), 71 (Azinphos methyl), 194 (Chlorfenvinphos), 217 (Chlorpyriphos methyl), 296 (Dialifos), 319 (Dicrotophos), 372 (Ditalimfos), 705 (Monocrotophos), 795 (Phosalone), 798 (Phosmet), 799 (Phosphamidon) and 869 (Pyraclofos), in addition to the Russian version of VX (RVX) and the potential nerve agent AmMe (Amiton methyl).
A priori the location of the compounds on the same level in the Hasse diagram suggests these compounds to be close in their overall characteristics based on the set of descriptors used, i.e. solubility, biodegradation potential and Henry's Law Constant.However, a further analysis appears to be necessary in order eventually to disclose how close these compounds actually are.For this analysis the concept of average rank [4,5,22,23] was adopted.Thus, it is assumed that if the average ranks, Rk av , of two compounds are close, the two compounds will on an average basis display similar characteristics as being determined by the set of descriptors applied.In Table 1 the average ranks for the above-mentioned OPs are given together with minimum acute oral toxicity and acute percutaneous toxicity, respectively, in both cases for rats [7].It is immediately seen that although the compounds were placed on the same level in the Hasse diagram, only through the analysis of average linear rank the true identity of the single compounds are disclosed.Thus, in the present case it is obvious that VX (Rk av = 5.3) that in the present context is the unknown compound achieves an identity that can be compared to Phosphamidon (Rk av = 6.2) as the closest counterpart.Thus, with regard to aqueous persistence, the above combined QSAR and partial order ranking analysis indicates that VX and Phosphamidon will display close to identical behavior.This further means that Phosphamidon, within the present set of compounds included in the investigation, appears as the optimal substitute for VX in experimental studies where aqueous persistence is a crucial parameter.It is noted that the acute oral toxicity associated with Phosphamidon is approximately 200 times lower than that of VX and in the case of acute percutaneous toxicity, Phosphamidon appears to be nearly 4000 times less toxic than VX.

Conclusions
The present study has demonstrated how 'unknown' compounds may obtain an identity by comparing to structurally related, experimentally well-characterized structurally similar compounds.The identity can be established by a close interplay between so-called "noise-deficient" QSARs, in the present study generated using the EPI Suite as the modeling onset.Subsequently, the generated physico-chemical end-points are used as descriptors in a partial order based ranking and the subsequent analysis of the average linear rank.It is suggested that experimentally well-characterized compounds may serve as substitutes for highly toxic compounds, such as the nerve agent in experimental studies without exhibiting the same extreme toxicity, however from an overall viewpoint exhibit analogous environmental characteristics.

Figure 1 .
Figure 1.Illustrative Hasse diagram of A: 10 compounds using three descriptors and B: the same 10 compounds plus one new compound X.

Figure 2 .
Figure 2. Visualization of the EPI-based modified QSAR modeling of log HLC based on 49 OP insecticides

Figure 3 .
Figure 3. Hasse diagram displaying the aqueous persistence of the 65 OP insecticides (white/red) and 16 nerve agent (yellow/blue), The numbers corresponds to the numbering of the OP insecticides in the FADINAP database [7]

Table 1 .
Average ranks for the aqueous persistence as determined by the solubility, biodegradation potential and the Henry's Law Constants for a series of OP insecticides and VX (the compound ID refers to the FADINAP database, cf. the above text; na: not available)