Reproduction is permitted for noncommercial purposes.
A novel quantitative structureactivity (property) relationship model, namely SpectralSAR, is presented in an exclusive algebraic way replacing the oldfashioned multiregression one. The actual SSAR method interprets structural descriptors as vectors in a generic data space that is further mapped into a full orthogonal space by means of the GramSchmidt algorithm. Then, by coordinated transformation between the data and orthogonal spaces, the SSAR equation is given under simple determinant form for any chemicalbiological interactions under study. While proving to give the same analytical equation and correlation results with standard multivariate statistics, the actual SSAR frame allows the introduction of the spectral norm as a valid substitute for the correlation factor, while also having the advantage to design the various related SAR models through the introduced “minimal spectral path” rule. An application is given performing a complete SSAR analysis upon the
In Chemistry, the first systematic correlations come from Lavoisier’s law of conservation of mass and energy, followed by the Dalton conception of structural matter. Nevertheless, Mendeleyev was the first one to place the structureactivity relationships (SARs) in the centre of chemistry with his vision of the periodic table [
Yet, the current problem of science is to organize the huge amount of experimental information in comprehensive equations with a predictive value. At this point, the quantitative structureactivity relationships (QSARs) methods seem to offer the best key for unifying the chemical and biological interaction into a single in vivoin vitro content [
However, although the main purpose of QSARs studies is all about finding structural parameters that best correlate with the activity/property of the interactions observed, a multitude of methods of attaining this goal have appeared. They struggle to identify the most appropriate manner of quantifying the causes in such a way that they may be reflected in the measurement with maximal accuracy or minimal error. Phenomenologically, these methods can be conceptually grouped into “classical” [
In short, classic QSAR approaches assume as descriptors the structural indices that directly reflect the electronic structures of the tested chemical compounds. As such, they assume that the biological activity depends on factors describing the lipophylicity (e.g. LogP, surfaces), electronic effects (e.g. Hammett constants, polarization, localization of charges), and steric effects (e.g. Taft indices, Verloop indices, topological indices, molecular mass, total energy at optimized molecular geometry) [
A step forward is made when 3dimensional structures are characterized by entry indices. For instance, the MTD (minimal topological difference) [
Still, statistically, it was found that in order for multiple linear regressions to be used, the requirement of a large number of compounds has to be met in order to explore the structural combination. Under these circumstances, the next QSARs category in
Nevertheless, despite having several solutions to decide over thousands of products from millions of libraries, together with hundred descriptors, that opens the problem of their further relevance and classification. With these we have arrived at the heart of a QSAR analysis: the orthogonal problem. Statistically, this term was interpreted as descriptors whose values form a basis set that pose little intercorrelation factors. In practice, data reduction techniques such as PCA (principal component analysis) [
Another way of interpreting orthogonality was given through producing an orthogonal space by transforming the original basis set of descriptors in an orthogonal one by searching of interregression equations between them [
Under these circumstances, the third attempt of interpreting the orthogonal problem is considering the scalar product as the main vehicle in releasing the QSAR solution in a completely algebraic way thus furnishing the so called SpectralSAR (SSAR) technique in
The field of ecotoxicology was chosen as an application, where various combined SSARHansch models are constructed for describing the toxicity of 26 xenobiotics on the
The basic problem of structureactivity relationship analysis can be formulated as follows: given a set of measured activities of a certain series of (say
In
Therefore, the SAR problem becomes quantitative since the set of fixed parameters is determined so that the errors in activity evaluation are minimized. This way, the
However, this “Holy Grail” property of a QSAR equation opens the issue of significance and statistical relevance of the values considered in
Usually, the QSAR problem is solved in the so called “normal or “standard” way, briefly described in what follows. Firstly, the
Note that, generally, each activity evaluation is assumed to be accompanied by a different error, i.e. the values
However, since the following matrices are introduced
the system (
Hence, the minimization of the error vector
Put in vectorial terms, the solution of the supradimensional system (
Finally, one uses the following theorem [
This means that we can consider
It is worth noting that while solution (
Despite this, the “normal” or “standard” QSAR procedure is already implemented in various software packages nowadays. It is worth exploring other alternative way that may serve both conceptual and computational advantages. The so called “spectral” algorithm, presented below, stands as such a new perspective, belonging to “orthogonal QSAR” methods of
The key concept in SAR discussion regards the independence of the considered structural parameters in
The idea is to transform the columns of structural data of
The analytical procedure is unfolded in simple tree steps.
Basically,
Moreover, since the columns are now considered as vectors in data space we are looking for the “spectral” decomposition of the activity vector 
The next step is to construct a vectorial algorithm so that the residual vector 
To achieve the minimal errors in (
However, before applying it effectively one has to introduce the generalized scalar product throughout the basic rule:
giving out a real number from two arbitrary
Briefly, remember that the orthogonal condition requires that the scalar product of type (
Choose
Then, by picking 
so that 〈Ω_{0}Ω_{1}〉 = 0 assuring so far that Ω_{0}〉 and Ω_{1}〉 are orthogonal.
Next, repeating steps i. and ii. above until the vectors Ω_{0}〉, Ω_{1}〉, …, Ω
so that the vector Ω
Step (iii) is repeated and extended until the last orthogonal predictor vector Ω
Therefore, grounded on the GramSchmidt recipe the starting predictor vectorial basis {
Within the constructed orthogonal space, the vector activity 
Note that the residual vector in
This way, the GramSchmidt algorithm, by its specific orthogonal recursive rules, absorbs or transforms the minimization condition of errors in (
At this point, since there is no residual vector remaining in (
a condition assured by the very nature of the vectors from the constructed orthogonal basis.
As such, each coefficient comes out as the scalar product of its specific predictor vector with the activity vector (
With coefficients given by expressions of type (
However, this goal is easily achieved through the final stage of the present SAR algorithm. It consists in going back from the orthogonal to the initial basis of data through the system of coordinate transformations:
While the first equation of (
Finally, the system (
this being the condition consecrated by the theorem according to which
It is worth noting that the minimization of residual errors was unnecessarily complicated in previous orthogonalization approaches [
Moreover, the ordering problem in all previous orthogonal descriptors’ methods [
It is now clear that once expanded, observing its first column, the determinant (
However, although different from the mathematical procedure, both standard and spectralSAR give similar results due to the theorem that states that [
When combining
in close agreement with previous normal one, see
With these considerations one would prefer the present SpectralSAR approach when solving the QSAR problems in chemistry and related molecular fields. Nevertheless, wishing to also provide a practical advantage of the exposed SpectralSAR scheme, a specific application, with relevance in ecotoxicological studies, is presented in the next section.
From more than one decade the European Union institutions, e.g. Organization for Economic Cooperation and Development (OECD) through its Registration, Evaluation, and Authorization of Chemicals (REACH) management system [
Nevertheless, in order to best accomplish such a goal, both a conceptual and a computational strategy need to be adopted. As such, while, for instance, a certain set of parameters has been identified for environmental studies, i.e. bioaccumulation, chemical degradation (aqueous and gas phase), biodegradation, soil sorption, and ecotoxicity, two major aspects have been identified for QSAR analyses, namely the quality and the chemical domain of the QSAR [
Concerning the parameters to be evaluated, they are analytically transposed into the so called
On the other hand, a useful QSAR model has to satisfy selection criteria in order to be validated.
From the statistical point of view the ratio of data points to the number of variables should be higher or equal to 5 (the so called ToplissCostello rule [
As descriptors, those directly related to molecular structure of chemical are preferable. It is worth noting here that the quantum chemical parameters have an advantage against those of topological nature; still the quantum parameters to be used has to be relatively easily obtainable, for instance those based on ground state or valence state properties of compounds are preferable to those based on transitionstate calculations [
If descriptors are taken from experiments, the experimental conditions must be specified. Nevertheless, the best models predicting ecotoxic effects have to be mechanistic interpretable, though that structureactivity correlation permits reconstruction or prediction of the basic phenomena that take place at the molecular level.
Regarding the outliers they have to be treated with caution, as they are not necessarily outside of the chemical domain but depending on the QSAR model (i.e. of the correlated descriptors) employed [
Based on previous criteria in order for a QSAR analysis to be well conducted, a compromise between breath (variety) and depth (representability) characteristics through the existing chemicals within that domain have to be considered.
This way, the twofold process of dissimilarity and similarity based selection is achieved [
After all, it is widely recognized that ecotoxicity action is a multivariate process involving xenobiotics leading with immediate and longterm effects due o various transformations products. Therefore, a QSAR approach may provide information of the biouptake (i.e. of key process) through the selected descriptors that can be integrated in an expert system of toxic prediction.
However, with a view to designing an ecotoxicological mechanistic battery for different species on QSAR grounds, the first stage of unicellular organism level is undertaken here.
We often think of unicellular organisms as having a simple, primitive structure. This is definitely an erroneous view when applied to the ciliates; they are probably the most complex of all unicellular organisms.
Unlike multicellular organisms, which have cells specialized for performing the various body functions, singlecelled organisms must perform all these functions with a single cell, and so their structure may be much more complex than the cells of larger organisms.
Movement, sensitivity to the environment, water balance, and food capture must all be accomplished with the machinery in a single cell [
Many of these singlecelled organisms feed by engulfing smaller organisms directly into temporary intracellular vacuoles. These food vacuoles circulate in a characteristic manner within the cells while enzymes are secreted into them for digestion [
However, form the taxonomy points of view they are classified downwards, from kingdom to species as:
However, it is worth restricting the discussion to ciliates only since they include about 7500 known species of some of the most complex singlecelled organisms ever, as well as some of the largest freeliving protists; a few genera may reach two millimeters in length, and are abundant in almost every environment with liquid water: ocean waters, marine sediments, lakes, ponds, and rivers, and even soils. Because individual ciliate species vary greatly in their tolerance of pollution, the ciliates found in a body of water can be used to gauge the degree of pollution quickly.
More specifically, ciliates are classified on the basis of cilia arrangement, position, and ultrastructure. Such work now involves electron microscopy and comparative molecular biology to estimate relationships.
In the most recent classification of ciliates, the group is divided into eight classes:
Nevertheless, most frequently studied unicellular organisms through QSAR toxicological analysis are from the Tetrahymena genus of ciliated protozoa. All species of the genus Tetrahymena are morphologically very similar; they display multiple nuclei: a diploid micronucleus found only in conjugating strains and a polyploid macronucleus present in all strains, which is the site of gene expression during vegetative growth, see
Tetrahymena species are very common in aquatic habitats and are nonpathogenic, have a short generation time and can be grown to high cell density in inexpensive media [
The earliest classifications were based on morphological and ecological data. At this level the presence or absence of a caudal cilium was regarded as an important character. Later, three morphological species complexes were distinguished: the pyriformis complex with smaller, bacterivorous species and less somatic kinetics; the rostrata complex with larger parasitic or histophagous species, more somatic kinetics, and the ability to form resting cysts; and the patula complex with species that undergo microstomemacrostome transformation. Within the complexes, particularly the pyriformis complex, species are distinguishable by their mating capacity and/or isozyme patterns. Finally, another approach based on the degree of parasitism was suggested. Since, the
Accordingly,
Quite often, despite the tendency to submit a large class of descriptors to a QSAR analysis, this is not the best strategy [
More focused studies in ecotoxicology, and especially regarding
While hydrophobicity describes the penetration power of the xenobiotics though biological membranes, the other descriptors to be considered reflect the electronic and specific interaction between the ligand and target site of receptor.
Moreover, it was convincingly argued that the classical Hammett constant can be successfully rationalized by a pure structural index as the energy of the lowest unoccupied molecular orbital (
thus also providing enough information from transport, electronic affinity and specific interaction at the molecular level, respectively.
However, in the present study, besides considering
Then, the steric descriptor is chosen here, for simplicity, as the total molecular energy (
Under these circumstances the ecotoxic activity to
It is worth mentioning that the number of compounds is in relevant ratio with the number of descriptors used, according with above ToplissCostello rule, and that both chemical variability and congenericity are fulfilled since most of them reflect the phenolic toxicity.
The standard QSAR analysis of data of
as correlation factor, standard error of estimate and Fisher index, respectively, in terms of the total number of residues, measuring the spreading of the input activities with respect to their estimated counterparts,
and the total sum of squares,
measuring the dispersion of the measured activities around their average:
while the number of compounds and descriptors were fixed to
Before attempting a mechanistic analysis of the results, let us apply the SSAR techniques to the same data of
More explicitly, in
Remarkably, one may easily note the striking similitude of the equations in
However, conceptually, SSAR achieves a degree of novelty with respect to normal QSAR though that the spectral equation is given in terms of vectors rather than variables. Such features marks a fundamental achievements since this way we can deal at once with whole available data (of activity and descriptors) within a generalized vectorial space. Consequently, we may also use the spectral norm of the activity,
as the general tool by means which various models can be compared no matter of which dimensionality and of which multilinear degree since they all reduce to a single number. This could help fulfill QSAR's old dream of providing a conceptual basis for the comparison of various models and end points by becoming a true science. Even more, while also accurately reproducing the statistics of the standard QSAR, the actual SSAR permits the introduction of an alternative way of computing correlation factors by using the above spectral norm concept. As such the so called algebraic SSAR correlation factor is defined as the ratio of the spectral norm of the predicted activity versus that of the measured one:
Applying
The findings in
In other words, one can say that in an algebraic sense the SSAR furnishes systematically higher correlation factors than the standard QSAR does. This feature is also depicted in
Nevertheless, we should note at this point that while a certain model does not satisfy the correlation factor criteria for being validated, i.e.
Indeed, both within standard QSAR and SSAR approaches all models except (
Next, aiming to see whether the obtained models can provide us a mechanistic model of chemicalbiological interaction of tested xenobiotics on
In this respect,
Therefore, we may formulate the
In our case, according to the enounced minimum spectral path rule, the diagram of
Whenever the primary route is inhibited, the second hierarchy of action follows by excluding the models previously involved and based on the same least principle of action. The second initial model will be chose that which is nearest to the first one on the spectral norm scale. Then, from all equivalent paths the next step is made toward the closes neighbor in the spectral norm sense.
The second hierarchy results along the endpoints path
If the secondary route is somehow repressed, as well the third way of ecotoxicological action of
It is not surprising that the application of minimal action principles on the spectral activity norms furnished many, however ordered, ways in which chemicalbiological interaction are present in nature. This is in accordance with the heuristically truth that the Nature reserves the privilege to develop many paths to achieve an action. The present SSAR approach gives these new possibilities of hierarchically modelling of activities, in a way that the statistical analysis appears to be limited to single choices. Nevertheless, further work has to be performed by employing SSAR method and of its minimal spectral path principle on many species and class of compounds in order to better validate the present results and algorithm.
Aiming to solve part of the many challenges posed by QSAR and its applications, with a view to generating a mechanisticcausal vision of the data recorded (measured or computed), the current paper introduces both a new analytical SAR modelling algorithm (the socalled SpectralSAR method) and its associated minimum spectral action principle, following the activity norm of the models generated. As such, four possible branches of a QSAR expertise were identified, namely those based on the so called classical (of Hansch type), 3dimensional (of CoMFA or MTD type), decisional (of genetic algorithm type) and orthogonal (of PCA type) – all proposing to furnish an appropriate analytical model for structurechemical property or biological activity correlations. In this context the orthogonality problem was especially addressed, though the considered descriptors have to be as little collinear as possible in order to eliminate redundancies. Despite the fact that many QSAR approaches make use of algorithms that separate or transform initial nonorthogonal data into an orthogonal space, in search of a better correlation, many of them provide no significant improvement over the standard QSAR least square recipe. Instead, the present endeavor puts forth the orthogonal space (in GramSchmidt sense) only as an intermediate one in order to obtain from it the spectral expansion of concerned activity and descriptors like vectors in a high dimensional space. This way, through more algebraic transparent transformations the spectral structureactivity relationships (SSAR) are formulated as viable alternative to the previous standard QSAR method. The actual SSAR approach also provides the framework in which the spectral norm can be formulated as assigning a single number to any SAR problem with the meaning of encoded of all information of a model, including the statistics. However, the spectral norm permits the spectral formulation of the minimal action principle applicable among various tested models. As such, the ecotoxicology of the
Generic world of the quantitative structureactivity/property relationships  QSA(P)R  through classical, 3D, decisional and orthogonal methods of multivariate analysis of the chemicalbiological interactions. In scheme MSDMTD, CoMFA, and PCA stand for the “minimal steric differenceminimal topological difference”, “comparative molecular field analysis” and “principal component analysis”, respectively.
Generic mapping of data space containing the vectorial sets {X〉, O〉} into orthogonal basis {Ω(X)〉, Ω(O)〉}.
Illustration of the oral region of
Norm correlation spectral space of the statistical and algebraic correlation factors against the spectral norm of the predicted SSAR models of
Spectralstructural models, designed through the rules of minimal spectralSAR paths of
Synopsis of the basic SAR descriptors.
…  …  
…  …  
⋮  ⋮  ⋮  ⋮  ⋮  ⋮ 
…  … 
The spectral (vectorial) version of SAR descriptors of

 

 
 
 
…  …  
1  …  …  
1  …  …  
⋮  ⋮  ⋮  ⋮  ⋮  ⋮  ⋮ 
1  …  … 
The series of the xenobiotics of those toxic activities
No.  Compound  1〉  

 
Name  Formulae  
methanol  CH_{3}OH  −2.67  1  −0.27  3.25  −11622.9  
ethanol  C_{2}H_{5}OH  −1.99  1  0.08  5.08  −15215.4  
butan1ol  C_{4}H_{9}OH  −1.43  1  0.94  8.75  −22402.8  
butanone  C_{4}H_{8}O  −1.75  1  1.01  8.2  −21751.8  
pentan3one  C_{5}H_{10}O  −1.46  1  1.64  10.04  −25344.6  
phenol  C_{6}H_{5}OH  −0.21  1  1.76  11.07  −27003.1  
aniline  C_{6}H_{5}NH_{2}  −0.23  1  1.26  11.79  −24705.9  
3cresol  CH_{3}C_{6}H_{4}OH  −0.06  1  2.23  12.91  −30597.6  
4methoxiphenol  OHC_{6}H_{4}OCH_{3}  −0.14  1  1.51  13.54  −37976.3  
2hydroxyaniline  OHC_{6}H_{4}NH_{2}  0.94  1  0.98  12.42  −32095.4  
Benzaldehyde  C_{6}H_{5}CHO  −0.2  1  1.72  12.36  −29946.9  
2cresol  CH_{3}C_{6}H_{4}OH  −0.27  1  2.23  12.91  −30597.2  
3,4dimeyhylphenol  C_{6}H_{3}(CH_{3})_{2}OH  0.12  1  2.7  14.74  −34190.8  
3nitrotoluene  CH_{3}C_{6}H_{4}NO_{2}  0.05  1  0.94  13.98  −42365.1  
4chlorophenol  C_{6}H_{5}OCl  0.55  1  2.28  13  −35307.6  
2,4dinitroaniline  C_{6}H_{3}(NO_{2})NH_{2}  0.53  1  −1.75  15.22  −63030.2  
2methyl14naphtoquinone  C_{11}H_{8}O_{2}  1.54  1  2.39  20.99  −49768.3  
1,2dichlorobenzene  C_{6}H_{4}Cl_{2}  0.53  1  3.08  14.29  −36217.2  
2,4dinitrophenol  C_{6}H_{3}(NO_{2})OH  1.08  1  1.67  14.5  −65318  
1,4dinitrobenzene  C_{6}H_{4}N_{2}O_{4}  1.3  1  1.95  13.86  −57926.7  
2,4dinitrotoluene  C_{7}H_{6}(NO_{2})_{2}  0.87  1  2.42  15.7  −61520.7  
2,6ditertbutil 4methyl phenol  C_{15}H_{23}OH  1.8  1  5.48  27.59  −59316.5  
2,3,5,6tetrachloroaniline  C_{6}H_{3}NCl_{4}  1.76  1  3.34  19.5  −57920.2  
penthaclorophenol  C_{6}Cl_{5}OH  2.05  1  −0.54  20.71  −68512.4  
phenylazophenol  C_{12}H_{10}N_{2}O  1.66  1  4.06  22.79  −55488.9  
pentabromophenol  C_{6}Br_{5}OH  2.66  1  5.72  24.2  −66151.5 
QSAR equations through standard multilinear routine of Satistica package [
Model  Variables  QSAR Equation  r  s  F 


0.539  1.15  9.834  

0.908  0.574  112.15  

0.882  0.644  84.015  

0.911  0.58  55.930  

0.922  0.54  65.339  

0.939  0.478  86.503  

0.941  0.48  56.598 
Spectral structure activity relationships (SSAR) through determinants of
Models  Vectors  SSAR Equation 

 
 
 
 
 
 

The predicted spectral norm, the statistic and the algebraic correlation factors of the SSAR models of
 
3.86176  6.22803  6.0607  6.24858  6.32297  6.43641  6.44557 

0.53905  0.90759  0.88193  0.91074  0.92214  0.9395  0.9409 

0.56521  0.91154  0.88705  0.91455  0.92543  0.94204  0.94338 
Synopsis of the statistic and algebraic values of paths connecting the SSAR models of
Path  Value  

 
Statistic  Algebraic  
2.61485  2.61132  
2.61485  2.61132  
2.61485  2.61132  
0.389359  0.388969  
0.389359  0.388969  
0.389359  0.388969 
MVP whishes to thank Prof. Adrian Chiriac from Chemistry Department of West University of Timişoara for his permanent stimulation towards SAR unification of theoretical and experimental chemistry and for many key papers ceded from his collection for completing the background study for this project. AML gratefully credit Prof. Vasile Ostafe from Chemistry Department of West University of Timişoara for encouraging her on the line of SAR applications in ecotoxicology. As well, authors like to express their sincere gratitude to Prof. Mark Cronin from School of Pharmacy and Chemistry of Liverpool, to Dr. Bono Lučić from the Rugjer Bošković Institute of Zagreb, and to all those who through their kind correspondence and reference supply in last years inspired many of the present SAR and ecotoxicology issues. We also thank our colleague Cristian Chiş from “Babel Center” in Timişoara for the careful reading of the manuscript. Finally, but not at least, MVP and AML address particular appreciation to the Romanian National Council of Scientific Research in Universities – CNCSIS for the Grants AT/54/20062007 and TD/140/2007, respectively.
*** About systematic classification of