2.2. Timişoara Spectral-SAR Model
Since QSAR models aim at correlations between concerned (congener) molecular structures and measured (or otherwise evaluated) activities, it appears naturally that the
structure part of the problem be accommodated within the quantum theory and of its formalisms. In fact, there are few quantum characters that we are using within the present approach:
○ Any molecular structural state (dynamical, since undergoes interactions with organisms) may be represented by a |
ket〉
state vector, in the abstract Hilbert space, following the 〈
bra | ket〉 Dirac formalism [
14]; such states are to be represented by any reliable molecular index, or, in particular in our study by hydrophobicity |
LogP〉, polarizability |
POL〉, and total optimized energy |
Etot〉, just to be restrained only the so called Hansch parameters, usually employed for accounting the diffusion, electrostatic and steric effects for molecules acting on organisms’ cells, respectively.
○ The (quantum) superposition principle assuring that the various linear combinations of molecular states map onto the resulting state, here interpreted as the bio-, eco- or toxico-logical activity, e.g., |Y〉 = |Y0〉 +CLogP |LogP〉 +CPOL |POL〉 +..., with |Y0〉 meaning the free or unperturbed activity (when all other influences are absent).
○ The
orthogonalization feature of quantum states, a crucial condition providing that the superimposed molecular states generates
new molecular state (here quantified as the organism activity); analytically, the orthogonalization condition is represented by the 〈
bra |
ket〉 scalar product of two envisaged states (molecular indices); if it is evaluated to zero value,
i.e., 〈
bra |
ket〉 = 0, then the convoluted states are said to be orthogonal (zero-overlapping) and the associate molecular descriptors are considered as independent, therefore suitable to be assumed as eigen-states (of a
spectral decomposition) in the resulted activity state, while quantified by the degree their molecular indices enter the activity correlation. Further details on scalar product and related properties are given in
Appendix A1, whereas in what follows the Spectral-based SAR correlation method (thereby called as Spectral-SAR) is resumed.
Note that since molecular states are usually represented by ket vectors which are a generalization of custom (classical) vectors, all formalisms are consistently developed accordingly. In this regard, the bra-ket formalism is more than a simple notation – it is indeed a reliable formalism since, for instance, it differentiates between the dual and direct spaces the bra- and ket- vectors are attributed to, respectively, with insightful consequences for the space-time evolution of a system – a matter not conveyed by classical simple vectorial notation. However, it is not a complication of reality but a close representation of it: the molecular descriptors belong to a given molecular state that has to be included as a component of the quantum (ket) vectors carrying the specific structural information – a feature not fulfilled by simple classical vectors. Therefore, the adopted vectorial formalism goes beyond the simple notation – each time when we write a ket vector represented by a structural index we see in fact a generalized electronic (for a hyper-molecular) state, defined as the global state collecting one descriptor’ values for all concerned congener molecules.
Now, a set of
N molecules studied against observed/recorded/measured biological activity is represented by means of their
M – structural indicators (the states); all the
N ×
M input information may be expressed by the vectors-columns of the
Table 1 and correlated upon the generic scheme of
Equations (1a)–
(1d):
where the vector |
X0〉 = 1 1 ... 1
N〉was added to account for the free activity term.
In order for
equation (1b) to represent a reliable model of the given activities, the hyper-molecular states (indices) assumed should constitute an orthogonal set, having this constraint a consistent quantum mechanical basis, as above described. However, unlike other important studies addressing this problem [
15–
17], the present Spectral-SAR [
7] assumes the prediction error vector as being orthogonal to all others:
since it is not known
a priori any correlation is made. Moreover,
Equations (1a),
(1b), and
(1c) imply that the prediction error vector has to be orthogonal on all known descriptors (states) of predicted activity:
assuring therefore the reliability of the present
ket states approach. In other terms, conditions (1c) and (1d) agree with
Equation (1a) in the sense that the prediction vector and the prediction activity
YPRED (with all its sub-intended states
) belong to disjoint (thus orthogonal) Hilbert (sub)spaces; or, even more, one can say that the Hilbert space of the observed activity |
YOBS〉 may be decomposed into a predicted and error independent Hilbert sub-spaces of states.
Therefore, within Timişoara Spectral-SAR procedure the very first step consists in orthogonalization of prediction error on the predicted activity and on its predictor states, while the remaining algorithm does not seek to optimize the minimization of errors, but for producing the ideal correlation between |YPRED〉 and the given descriptors
.
Next, the Gram-Schmidt orthogonalization scheme is applied through construction of the appropriate set of descriptors by means of the consecrated iteration [
16,
18,
19]:
providing the orthogonal correlation:
Remarkably, while available studies dedicated to the orthogonality problem usually stop at this stage, the Spectral-SAR uses it to provide the solution for the original sought correlation of
Equation (1b) – having the prediction error vector orthogonal to the predicted activity and all its predictor states of
Table 1. This can be wisely achieved through grouping
Equations (2) and
(3) so that the system of all descriptors of
Table 1 is now written in terms of orthogonal descriptors:
According with a well known algebraic theorem, the system (4) has no trivial solution if and only if the associated extended determinant vanishes; this way the Spectral-SAR determinant features the form [
7]:
Now, when the determinant of
Equation (5) is expanded on its first column, and the result is rearranged so that to have |
YPRED〉 on left side and the rest of states/indicators on the right side the sought QSAR solution for the initial observed-predicted correlation problem of
Equation (1a) is obtained under the Spectral-SAR vectorial expansion (from where the “spectral” name is justified) without the need to minimize the predicted error vector anymore, being this stage absorbed in its orthogonal behavior with respect to the predicted activity.
In fact, the Spectral-SAR procedure uses the double conversion idea: one forward, from the given problem of
Equations (1a)–
(1d) to the orthogonal one of
Equation (3) in which the error vector has no manifestation; and a backwards one, from the orthogonal to the real descriptors by employing the system (4) determinant (5) expansion as the QSAR solution.
It is worth stressing that the present QSAR/Spectral-SAR equations are totally delivered from the (analytical) determinant (5) and not computationally restricted to the inverse matrix product as prescribed by the fashioned statistical Pearson approach [
20]. Moreover, the Spectral-SAR algorithm is invariant also upon the order of descriptors chosen in orthogonalization procedure, providing equivalent determinants no matter how its lines are re-derived, an improvement that was not previously achieved by other available orthogonalization techniques [
15,
17].
However, besides the effectiveness of the S-SAR methodology in reproducing the old-fashioned multi-linear QSAR analysis [
7,
21], one of its advantages concerns on the possibility of introducing the so called (
vectorial)
norms (see
Appendix A1) associated with either
experimental (measured or observed) or
predicted (computed) activities:
They provide a unique assignment of a number to a specific type of correlation,
i.e., by performing a sort of final quantification of the models. Nevertheless, the activity norm given in
Equation (6) opens the possibility of replacing the classical statistical correlation factor [
21]:
with a new index of correlation, introduced as the so called
algebraic S-SAR correlation factor (or
R-algebraic, shorthanded as
RA) through the ratio of the predicted to observed norms [
22,
23]:
It has the meaning of realization probability with which a certain predicted model approaches the observed activity throughout all of the employed molecules (in the hyper-molecular states of activities), see
Appendix A2.
With this interpretation the algebraic correlation conceptually departs from the statistical one in that the later accounts on the degree with which each computed individual molecular activity approaches the mean activity of the N-molecules, while the first evaluates the (hyper-molecule) degree of overlap of predicted to observed activities’ norms (viewed as the “amplitudes” of molecular-organism interaction’s intensity). In this respect there seems that the algebraic analysis is more suited to environmental studies in which the global rather than local effect of a series of toxicants is evaluated on specific species and organisms.
In fact, this new correlation factor definition compares the vectorial lengths of the predicted activity against the measured one, thus being an indicator of the extent with which certain computed property or activity approaches the “length” of the observed quantity.
However, it was already shown that the algebraic correlation factor of
Equation (8) furnishes higher and more insightful values than its statistical counterpart in a systematical manner [
21,
24], thus advancing it as the ideal tool for correlation analysis on a shrink interval of data analysis where the statistical meaning is naturally lost.
Even more, in the terms of the “quantum spectral” formalism, one can say that algebraic investigation provides the “excited” states of an activity modeling, while the statistical approach deals with “ground state” or lower states of correlation. Consequently, for completeness, a proper quest of structure-activity models should include both of these stages of molecular SAR modeling.
Going further towards extracting the mechanistic information from the Spectral-SAR norms and correlation factors we can further advance the so called
least path principle:
applied upon successively connected models with different correlation dimensions: it starts from 1-dimension with a single structural indicator correlation, say
A1, until the models with maximum factors of correlation, say
AM –
i.e., containing
M number of indicators, see
Table 1) [
7–
10]. Since each of these models is now characterized by its predicted activity norm ‖|
YPRED〉‖ along the algebraic (
RA) and/or statistical (
R) correlation factors, the elementary paths of
Equation (9) are constructed as the Euclidian measure between two consecutive models (endpoints) [
7–
10,
22–
24]:
It is noteworthy that the formal
equation (9) has to be read as searching for paths’ combination on the left side providing minimum value in the right side; it is practiced as the tool for deciding the hierarchy along all (ergodic) possible end-point linked paths with the important consequence of picturing the mechanistic and causal evolution of structural influences that trigger the observed effects.
This methodology was successfully applied in ecotoxicology [
7,
8,
24] and for designing the behavior of the species interactions within a test battery [
23], promising to furnish adequate framework also for the present (and future) interspecies analysis.