The Use of the Evenness of Eigenvalues of Similarity Matrices to Test for Predictivity of Ecosystem Classifications

The use of the evenness (E(λ)) of the eigenvalues of similarity matrices corresponding to different hierarchical levels of ecosystem classifications, is suggested to test correlation (or predictivity) between biological communities and environmental factors as one alternative of analysis of variance (parametric or non-parametric). The advantage over traditional methods is the fact that similarity matrices can be obtained from any kind of data (mixed and missing data) by indices such as those of Goodall and Gower. The significance of E(λ) is calculated by permutation techniques. One example of application of E(λ) is given by a data set describing plant community types (beech forests of the Italian peninsula).


Introduction
The separation between the classes of a classification in terms of the features used for the classification itself or in terms of features not used for the classification, can be evaluated by parametric methods based on simple or multivariate analysis of variance (ANOVA or MANOVA) [1,2] and by nonparametric ones [3].Within these, Feoli and Bressan in 1972 [4] proposed a simple index, called index of "individualization", given by the ratio between the average similarity within one class of sampling units and the average similarity that this class has with the other classes present in the study area: the higher the index the higher the separation of one class from the other classes.Orlóci [1] presented the index (let we call it INDI) in a way that its values range between 0 and 1.It is given by subtracting from 1 the ratio between the average similarity between the classes and the average similarity within the classes (i.e., INDI=1−B/W, where B means average between, or among similarity, and W the average within similarity).In 1988, Biondini et al. [5] proposed to test the separation between classes by an index based on Euclidean metric (a method based on "sum of squares") that uses the average within class sum of squares (δ) by introducing the methods known as multiple response permutation procedure (MRPP) and its randomized block design analogue (MRBP).In their proposal the best partition (classification) is the one with the lowest value of δ.Later, Clark and Anderson proposed, respectively the methods known as Analysis Of SIMilarity (ANOSIM) [6] and PERmutational Multivariate ANalysis Of Variance (PERMANOVA) [7] in which the ratio between and within dissimilarity averages or "between (or among)/within (residual) sum of squares" can be based on any kind of similarity functions as suggested in [4].In all these methods the significance of the class separation is tested by permutation techniques [8,9].The advantage of ANOSIM and PERMANOVA with respect the method of Biondini et al. [5] relies on the fact that both can be applied to any kind of data (mixed and missing data) if similarity is measured by suitable functions, e.g., the one of Gower or Goodall [1,2].The idea of using the evenness of the eigenvalues of a similarity matrix, rather than the indices based on the "ratio between/within average similarities" (or between/within sum of squares), is not new.It was already published by Feoli et al. [10] in 2009, and far before it was implemented, together with INDI [1], in the software MATEDIT [11,12].This was done to find the optimal classification among the sets of classifications obtained by MATEDIT, i.e., the one that would offer the best separation between the classes according Occam's razor rule.This rule, also called principle of ontological economy, principle of parsimony, or principle of simplicity [13][14][15], says that: "Entities are not to be multiplied beyond necessity."In practical terms, Occam's razor rule is satisfied when every class of a classification can be easily distinguishable from the other classes on the basis of peculiar features.
Two theorems of matrix algebra are supporting the evenness of eigenvalues as a suitable index of class separation [16].According to the first theorem, each disjoint submatrix of a given matrix, has its independent sets of eigenvalues (and eigenvectors); according to the second theorem a matrix (N × N) with scores all equal to 1 has only one eigenvalue that is equal to N.
From these two theorems it is easy to deduce that in case of a perfect crisp classification, the entropy [17] of the eigenvalues H k (λ) of a similarity matrix S (N × N), with k representing the number of classes of elements, would be equal to H k , i.e., the entropy of the proportions of the classes: with j = 1, . . .k, n k indicating the number of elements in the k-th class and N the total number of elements.Therefore, the ratio where with λ i indicating the i-th positive eigenvalue of the similarity matrix S (k × k), where the entries are the sum of similarity values within the classes and the sum of similarity between the classes, would represent an index of class separation ranging between 0 and 1.It is 0 when the matrix is full of 1s, i.e., there is no separation between the classes, it is 1 when the matrix presents fully disjoint submatrices with the values of within similarity all equal to 1.If we do not consider the importance of the proportion between the classes, but just their within and between similarity, we should use the following formula: In this case E(λ) is the evenness of the eigenvalues of the similarity matrix S (k × k) where the scores are the average similarity within the classes and the average similarity between the classes (also this index, ranges between 0 and 1).Both D and E(λ) are indices measuring class separation that can be tested by permutation techniques.
In the present paper, we consider only E(λ) because we think irrelevant the size of the classes when we are interested in measuring their separation.We suggest a new application of E(λ) that consists in testing how much a classification, based on a set of features A, is predictive with respect another set of features B that has not been used to obtain the classification.

Applications of E(λ)
2.1.Summary of the Rationale E(λ) was primarily suggested to test the separation between classes at different hierarchical levels of a classification T on the basis of the set A of features that has been used to obtain T (e.g., [10]).In the present paper we suggest using E(λ) to test if the k classes of a given classification are significantly separated when they are described by a set B of external features (biotic or abiotic, explanatory or non-explanatory variables), i.e., features that have not been used to obtain T. If the external features show significant separation between the classes, it would mean that their effect on those used for the classification would be significant, or vice versa, that the effect of the features used for classification would be significant over these external variables.other words, the higher is E(λ), the higher is the correlation between A and B, i.e., the higher is predictivity of the classification based on A with respect the set B or vice versa.We can apply E(λ) to any given similarity matrix S (N × N) based on all the h variables of B and to each of the h similarity matrices S i (N × N), with i = 1, . . ., h, obtained by using the single i-th feature of the set B to test the separation of the k classes of a given classification of N objects.In this case, we get a correlation between the set A and the single i-th feature of B. The exercise could be repeated for different hierarchical levels of T in order to discover or to define among the levels, what is the most predictive with respect to the whole set B or to its single features.E(λ) is different from INDI and from MRPP, MRBP, ANOSIM and PERMANOVA, because it uses the spectrum of positive eigenvalues and not what Anderson [7] calls the "pseudo F statistics".E(λ) is not a "pseudo thing", it is mathematically clear and represent an index sensitive to the overall structure of the data set under study thanks the two mentioned theorems.The permutation techniques allow to test the significance of E(λ) by calculating it a great number of times after permuting the scores of the similarity matrix S (N × N) within and among the submatrices corresponding to the k classes of a given classification.The ratio between the number of E(λ), greater than that observed, calculated by the permutations and the total number of permutations, gives an estimate of the probability to reject the hypothesis of separation, i.e., to accept the null hypothesis of non-separation.

Data
The example is based on a data matrix given in Table 1.It describes 10 vegetation types of beech forests of Central and South Italy [18] by eight ecological indicator values of Landolt [19] representing eight environmental factors.The 10 vegetation types are obtained by clustering methods using the species as features (features of set A) (see [18] for references).The 10 vegetation types belong to two phytosociological associations: Aquifolio-Fagetum (AQ) and Trochiscantho-Fagetum (TF).

E(λ)
was applied in order to answer the following two questions: (a) Are the two plant associations, as defined by species (set A), significantly separated in the space defined by environmental factors (set B)?(b) What are the environmental factors of set B that are more correlated with the two associations?
The similarity matrix S (10 × 10) for the 10 vegetation types, has been obtained by the complement to 1 of Euclidean distance after having transformed all the Euclidean distances d ij according to the following formula: where d min and d max are respectively the minimum and maximal Euclidean distance in the dissimilarity matrix.
To answer question a) we have calculated E(λ) with the similarity matrix S (10 × 10), by grouping the 10 vegetation types according to the two associations.To answer question b) we have measured the separation between the two associations in terms of the single environmental factor.In this way E(λ) is used as an alternative index of the Kruskal-Wallis test (i.e., a univariate non-parametric analysis of variance [3]) that we have calculated just for a comparison with E(λ).We have obtained eight similarity matrices S i (N × N), by comparing the 10 vegetation types on the basis of each of the eight environmental factors.Also, in this case we have used formula (5).These eight factors could have been combined in several ways to test the capacity of their combinations to separate the classes, but we did not enter in such an exercise that would not add new meanings to the aim of the paper.

Results
The results related to question (a) confirm that the separation between the two associations in the space defined by all environmental factors is significant.The E(λ) is highly significant when the two associations are described by all the eight environmental factors (E(λ) = 0.824, p < 0.00001).This means that the two associations defined on the basis of floristic data, occupy two different community niches well separated from the environmental point of view.
The answer to question (b) is given in Table 2.This table shows also the average values of environmental factors of the two associations.The two tests, E(λ) and KW, fully agree in showing the features that significantly separate the two associations.The correlation between E(λ) and the KW is very high and significant (Table 3).The temperature and the dispersion (i.e., the dimension of the soil particles) are the only two environmental factors that are highly significant.This means that these two environmental factors, temperature and dispersion, are influencing very strongly the floristic composition of the two associations, therefore we can conclude that Aquifolio-Fagetum is more thermophilic than Trochiscantho-Fagetum and that this last association is more related to soil with higher dispersion values, with respect the soil of the former association.

Discussion and Conclusions
The example is showing how, with similarity matrices, we can analyze in detail the relationships between different kind of environmental factors and the states of ecological systems.In this case they are defined by types of communities, however they could be represented by any type of sampling units.It is easy to understand that E(λ) could be used as an index to measure the indicator values of species or other biological features since the higher is the E(λ) for a given feature the higher is the separation or distinctiveness of the classes as described by that feature.However, we do not enter in such an obvious discussion and we sent the interested reader to [20] (and references therein) for a review of other indices proposed to test the indicator values.We used a very simple data matrix in the example describing types of plant communities by features not used for their definition.We chose this simple data set because the results could have been easily discussed and challenged with the results of studies already completed with the same data set and with knowledge already consolidated on the type of vegetation system (i.e., beech forests, for references see [18]).The results indicate the environmental factors differentiating the two associations.The pattern of significance is in agreement with current knowledge about the two associations and provides useful details.However, a discussion on the relationships between the biological features (species or other features) of vegetation and the environmental factors-which could be interesting for specialists in vegetation science-is outside the scope of this paper.
We can conclude the paper by saying that the evenness E(λ) of the eigenvalues of similarity matrices-that are a necessary step for the majority of ordinations methods [21] thanks to the theorem of spectral decomposition [16]-can be considered a simple, useful tool to investigate the predictive value of a classification with respect to external variables.This would help to find the variables that would be explanatory and support the given classification in terms of Occam's razor rule.

Table 1 .
Description of 10 vegetation types of beech forests of Central Italy by the indicator values corresponding to environmental factors according to Landolt [19].F = Humidity, R = Reaction, N = Nutrients, H = Humus, D = Dispersion, L = Light, T = Temperature, C = Continentality.

Table 2 .
Results of application of Kruskal-Wallis test (KW) and E(λ) obtained by the similarity matrices S i (10 × 10) corresponding to the single environmental factors (variables from Humidity to Continentality) by considering the classification of vegetation types in the two associations: AQ = Aquifolio-Fagetum and TF = Trochiscantho-Fagetum.Under the symbols of associations there are the average values of the eight variables, pKW is the probability of the test KW, pE(λ) is the probability of the evenness test.Significant differences are marked in bold.

Table 3 .
Correlation between the Kruskal-Wallis test (KW) and E(λ) in Table2; p means probability of the tests.