Correlations in Compositional Data without Log Transformations

: This article proposes a method for determining the p -value of correlations in compositional data, i.e., those data that arise as a result of dividing original values by their sum. Data organized in this way are typical for many ﬁelds of knowledge, but there is still no consensus on methods for interpreting correlations in such data. In the second decade of the new millennium, almost all newly emerging methods for solving this problem have become based on the log transformation of data. In the method proposed here, there are no log transformations. We return to the early stages of attempting to solve the problem and rely on negative shifts in correlations in the multinomial distribution. In modeling the data, we use a hybrid method that combines the hypergeometric distribution with the distribution of any other law. During our work on the calculation method, we found that the number of degrees of freedom in compositional data measures discretely only when all normalizing sums are equal and that it decreases when the sums are not equal, becoming a continuously varying quantity. Estimation of the number of degrees of freedom and the strength of its inﬂuence on the magnitude of the shift in the distribution of correlation coefﬁcients is the basis of the proposed method.


Introduction
The emergence in the new millennium of large databases, such as, for example, the Human Microbiome Project, has caused a marked increase in research activity aimed at finding methods for statistical processing of data that have an organization consisting of fractions summarized to a constant (proportions to 1, percent to 100, etc.).Following the established tradition in the English-language literature, we will call such data compositional, or simply composition, consisting of k parts: where x i ≥ 0.
Such normalization results in a closed structure, generating a special relationship between variance and covariance (var and cov).Since by all sample size measurements ∑ k j>i cov y i , y j , but in every single case, var(y i ) = −∑ k−1 j>i cov y i , y j .This was demonstrated in [1], where it was also found that when a composition consists of variables with the same parameters, the zero correlation (hereafter r 0 ) is not zero but the value (1 − k) −1 .
A construction for r 0 , which does not depend on the number k and only takes into account the proportion magnitudes, was independently derived in [2,3].In the latter paper, the derivation is based on the multinomial and Dirichlet laws of distribution.Since in such distributions the variance is proportional to the value E(y i )(1 − E(y i )), where E is the mathematical ex- pectation, Pearson's linear correlation formula by substituting the corresponding values of var and cov turns into the construction r 0 = − E(y i )E y j / (1 − E(y i )) 1 − E y j , i = j, or-through the original values-into where N = ∑ ∑ n i=1 x i is the sum of all values comprising the composition, and n is the sample size.
However, a year earlier in [2], Equation (1) was derived in a fundamentally different way, in which from a composition initially represented by a set of variables distributed according to the same law and having equal means and variances, a rearranged composition consisting of variables with different parameters is created by simple summation of different numbers of initial variables.As can be seen, there is no binding to any particular law here, but according to the central limit theorem, by default it is implied that summation produces the effect of approximating to the normal distribution.
It would seem that, knowing the null correlation, all that remains is to find a way to subtract its value.However, researchers are looking for other ways to solve the problem.At first, only the applicability of Equation (1) to cases where the variance values in the raw data clearly exceed the mean values and do not differ from each other in proportion to the value of E(y i )(1 − E(y i )) is questioned (if we are not mistaken, this question was first raised in [4] (p. 692)).Indeed, any simulation, if the original variables x i with different parameters are not generated in it by any of the distributions conjugate to the multinomial, or if it does not follow the summation process described above, will lead to deviations from the values expected by Equation (1).This is particularly pronounced when the parameter differences are large.One such case is specifically discussed in Section 4. As for the distrust of large values of variance, in hybrid distributions it can exceed the expectation by an unlimited number of times, and the proportionality to the value E(y i )(1 − E(y i )) can be detected with certainty only when analyzing long series.
If at first only the question of the limits of the applicability of Equation (1) was discussed, then after the appearance of J. Aitchison's works [5,6], in which the obligatory logarithmic transformation of variables is introduced, it is generally forgotten.And this is due not only to the obvious impossibility of using Equation (1) after the log transformation.The composition is deprived of the right to generate its specific values at all, since J. Aitchison also introduces the postulate of "subcompositional coherence", which requires the equality of the measured quantities to be preserved when the number of proportions forming a composition is changed.It is obvious that correlation coefficients in normalized data in such cases inevitably change, so in the end, J. Aitchison, faithful to this postulate, simply refused to use correlation analysis, which cannot be said about all his followers.Ignoring J. Aitchison's opinion that the correlation coefficient is uninformative, the authors of a number of works [7][8][9][10][11][12][13][14][15][16] nevertheless propose their own methods for its estimation.This is not surprising, since the postulate of unchangeability of values has no theoretical justification and refers to the problem related not to the specifics of calculations, but to the choice of material for processing.Either the researcher is mistaken in defining the boundaries of the system or he or she deliberately considers various variants of its possible boundaries or simply compares the relations within the system with those that are formed in one of its subsystems.The values here must necessarily be different, otherwise we would have to assume that, e.g., the extinction of dinosaurs should not have affected the nature of interactions between other members of the animal kingdom.
The effectiveness of correlation estimation methods based on logarithmic transformations, due to which the dependence of the value of null correlation on the size of proportions is lost, raises serious doubts (see Section 2 for details).For this reason, without seeing any clear theoretical justification for the inevitability of the logarithmic transformation (and the absence of such not only for the above postulate but also for other special prescriptions in the methodology initiated by J. Aitchison's works is directly stated in [17] (p.831)), we also see no serious theoretical grounds for refusing to accept the negative value of null correlation determined by Equation (1).
The key point of our method is the determination of the number of degrees of freedom.In Section 3.1, the inevitability of their loss in the case of different sizes of normalizing sums is justified, a method of estimating these losses is proposed, and by means of the value obtained as a result of this estimation, an adjustment to Equation ( 1) is made.This adjustment is necessary to achieve a satisfactory correspondence to the empirical averages obtained in the distribution of correlation coefficients calculated between the proportions of the vector y i .Hereinafter, we will refer to the adjusted value of null correlation as the shift in the mathematical expectation or simply shift.
Finding the shift becomes the starting point for transforming asymmetric empirical distributions into distributions with a median equal to zero, which, in turn, after correction of the shift subtraction procedure, also performed by means of the value of degrees of freedom, opens the way to the solution of the main problem-to restore the possibility of interpreting the results using standard methods.The corresponding equations are introduced in Section 3.2.In Section 3.3, it is demonstrated that the bounds within which our method shows reasonably accurate results coincide clearly with the bounds within which the neutrality of the proportion vectors is revealed.Further, some features of hybrid models (Section 3.4), the problem of the zero-value redundancy (Section 3.5), and the differences between the Pearson correlation and Spearman and Kendall rank correlations (Section 3.6) are considered.The final part (Section 4) discusses the boundaries of the applicability of the method.

Materials and Methods
The main source of material for our work, aimed at finding an optimal method for estimating correlations in compositional data, is their simulation, in the course of which long series of correlation coefficients are created.Under various experimental conditions, we observe features of their distribution and capture these features, and on the basis of this, we try to create such a mathematical model of the phenomenon under study which would allow us to predict with satisfactory accuracy the probability of occurrence of any value.
In the simulation, the vector x i is specified in three ways.The first uses the urn model with sampling without replacement, which results in the multivariate hypergeometric distribution.In the second, each element of the vector is generated independently, which gives a possibility to study the behavior of correlations using any distribution laws, both discrete and continuous.In the third, the first two are merged.As a result of the merging, the classical urn with balls of different colors representing units of different variables is replaced by an urn that, instead of balls giving a value of 1, contains balls with a value varying within a given range and according to a given law.Such a hybrid model, in our opinion, could agree much better than the urn model with the cases in which the objects of observation not only once appeared and settled in the studied areas of space but subsequently also multiplied in them.Different degrees of success in reproduction in such cases should inevitably lead to an increase in the initially low variance.
Examples of all noted simulation methods can be found in the "simulation" and "process" files (see Supplementary Materials).
The range of laws that independently and/or when inserted into the hypergeometric distribution give results predictable by our method is delineated in Section 3.3.Methods based on logarithmic transformations of variables do not pass the test when using distributions of this range.For example, one way of proportionality estimation, ρ p ("proportionality correlation coefficient"), which, according to the authors of this method [9,10], should be considered as a suitable alternative to standard correlation coefficients, gives approximately the same erroneous estimates when comparing logarithms of simply relative values and estimates with enhanced errors when taking logarithms of dividing relative values (absolute values can also be taken here) by their geometric mean (in terms of log-ratio analysis, this transformation is referred to as clr-centered log ratio).If the composition consists of variables with the same parameters, this normalized version behaves about the same as the standard coefficients.However, if the composition contains variables with large and small mathematical expectations, its behavior is inverted with respect to the behavior of the standard coefficients: instead of small negative shifts of zero values from the center of variation, large negative shifts occur, and instead of large negative shifts, small negative or positive shifts occur.This is a serious drawback of all methods that use clr transformation, independent of how the zero-value problem is solved in them.
Especially surprising is the SparCC method [7] often used in microbiological research, also included as part of other algorithms [18].This method periodically goes beyond ±1 or does not work at all, generating negative variance (again, the larger the difference between the sizes of the variables, the more often this happens, although clr transformation is not used here).Among all methods, the "based on symmetric balances" method [15] can be distinguished, which does eliminate the shift and distributes the transformed correlation coefficients on the Student's scale but does so, again, only when the composition consists of equal variables and there is no loss of degrees of freedom (i.e., all values of ∑ k i=1 x i are the same; see below for details).As the differences in the variables increase, so do the differences in the results, quickly reaching catastrophic sizes.Let us show this with two examples where the raw data were obtained by summing 50, 30, 10, and 5 standard continuous distributions in the first case and 500, 300, 100, and 5 in the second case.This gives the variable proportions shown in bold in Table 1.In both cases, 11,110 combinations were obtained, after each of which the standard Pearson's coefficients (hereafter r), our transformed coefficients (see Section 3.2), and as prescribed by the based on symmetric balances method, also Pearson coefficients but calculated from the variables transformed by this method were simultaneously recorded from the normalized data.To demonstrate the differences, it seems sufficient to show the magnitudes of the biases at the median points of the distributions.4) and (7) with the results obtained by the "based on symmetric balances" method (n = 30) (variable proportions are shown in bold).

Mean
, where ρ is the population coefficient.Since the symmetric balances method is presented in completed form, we should expect that all variance values will be insignificantly different from the 1/(n − 1) value expected at ρ = 0 and that values close to zero will be found at the median points.However, in the right matrices in the lower diagonals, we observe values extremely far from zero.If we use them in the above equation, we obtain values close to the empirical variance but far from the value 1/29.The same will happen if we use values from the left matrices in it, but here we can use both empirical averages (upper) and theoretical shift values (lower).After the transformation proposed by our method, all the initially noticeably different variances become little different from the value 1/29, and the values at the median points can be seen in the upper diagonals of the right matrices.
Compared to methods requiring logarithmic transformations, the method proposed in [19], based on permutations and renormalization, gives much more accurate estimates, mostly close to the estimates by our method.For example, when n = 10 and seven variables of different sizes are subjected to permutations (i.e., k = 7), in spite of rather strong mismatches between arithmetic means and expected values of null correlation, at significance levels of 0.1 and above the differences from the estimates by our method turn out to be rather small but on average slightly overestimated.This overestimation is due to the fact that the permutations partially lose information about the loss of degrees of freedom, the number of which here at the zero level is 6.32.Also, the accuracy of the estimation at permutations depends on the number k, and the larger it is, the more accurate estimates should be expected.The shortcomings noted are not so great that one should not trust this method.In certain situations, for example, when there are a large number of zero values, in our opinion, it has clear advantages over our method if, of course, the losses of degrees of freedom are not very large.

Results
The only end result of our research is the method created in the course of it.The purpose of this article is to substantiate all the components of the method and demonstrate the results of its use.The realization of this purpose can be considered as a chain of intermediate results on the way to the final one.

Loss of Degrees of Freedom and Correcting for the Value of r 0
As the empirical averages obtained from data simulations show, Equation (1) gives a satisfactory approximation, the accuracy of which, however, directly depends on the sample size and on the structure of the summing variable ∑ k i=1 x i , expressed in the size of distances between its values.Both of these factors are related to the problem of establishing the number of degrees of freedom (hereafter df ), which in compositional data, as follows from our numerous observations and experiments, are not necessarily discrete quantities of the natural order.
Let us try to show why the value of df is not equal to n − 2, as in the usual estimation of Pearson's linear correlation, but less than or equal to n − 2 when we are dealing with a composition.Let v j be the values of the summing variable ∑ k i=1 x i ranked in ascending order, and ∑ n j=1 v j = N.Then, in the multivariate hypergeometric distribution, the expectation and variance for the absolute values of x i can be expressed as . It can be seen that for each non-matching value of v j there will be different values of E(x i ) j and var(x i ) j .Turn- ing to relative values, the expectation turns into a constant equal to ∑ n i=1 x i /N.The variance here, being in the interval from zero to one, reverses its direction of growth: . With the multinomial distribution, the variance is defined more simply: var(y i ) j = v j −1 ∑ n i=1 x i /N(1 − ∑ n i=1 x i /N).The difference in variance between the j-th and k-th members of the sequence here is expressed by the ratio v k N − v j / v j (N − v k ) in without-replacement samples and v k /v j in with-replacement samples.Obviously, as j increases, the range of variation in the fractions inevitably decreases, which implies that each successive cell of the summing variable makes smaller contributions on average to the procedure for calculating the correlation coefficient than the previous one.However, a decreasing contribution entails an increase, not a decrease, in the value of the correlation coefficient, and this corresponds not to As simulations show, any increase in the distance between neighboring values of the summing variable leads to an increase in the variance of the empirical distribution of the correlation coefficients calculated between the variables of the normalized data.We speak precisely about the distances between values and not about the variance of the summing variable as a whole, as it is due to outliers not always commensurate with the loss of df.Therefore, when estimating the number of df, it is methodologically more correct to take into account not the final variance but its stepwise changes in the ranked sequence.This principle is implemented in the construction proposed below, which is derived mainly empirically: In this construction, is the current coefficient of variation, which is calculated stepwise over the parallel sequence v 1 /v j 1/ √ 2π .The zero in the index at df is due to the fact that as the correlation coefficient values move away from zero, the loss decreases somewhat, as discussed in more detail in Section 3.2.For this reason, the following function for the change in df is introduced, which is necessary for a correct p-value determination: where This adjustment is necessary because Equation (1) does not take into account one long-known problem-the regular mismatch of sample correlation coefficients with coefficients established for general populations.To correct this discrepancy, R. Fisher [20] derived the formula ρ = r 1 + 1 − r 2 /(2n) , which was later transformed in [21] Obviously, in our case, the coefficient for the general population is r 0 .The equation presented below uses a correction with the same organization but with a special modification of the denominator: Hereinafter, µ will denote what we call the actual shift rather than the null correlation.The modification of the Fisher correction proposed in Equation (4) minimizes the current deviations when the degrees of freedom change with noticeably higher accuracy than Equation (1).This modification, strangely enough, works effectively outside the compositional data as well.For example, [22] investigates distributions in long series, each containing one million correlation coefficients computed on identically distributed variables with sample sizes equal to 3, 4, 5, 7, 10, and 100.Population correlation coefficients equal to 0.5 and 0.9 are considered.The best fits (minus the cases with n = 3, where the fit is closer with the correction from [20] while the correction from [21] does not work at all, and n = 100, where the correction from [21] wins by 0.00003) are obtained using our correction.

Transformation of Correlation Coefficients
Let us agree that r without an index further will denote not the Pearson correlation coefficient in general but namely the one calculated from the normalized data of the composition.In [3], the estimation of the difference between r and r 0 is performed with Fisher's z transformation: the value (z[r] − z[r 0 ]) √ n − 3 gives the value of the t criterion by which the p-value of the difference is estimated, and the formula (z can be used to calculate the correlation coefficient corresponding to this criterion.However, such a procedure with expected values over |0.5| begins to noticeably deviate its results from the expected curve in the direction of their decreasing, bringing the deviation to approximately −0.08 at values from |0.92| up to |0.95|.Although this phenomenon is relevant only in the absence or insignificant losses of df, which, if they are ignored, can easily lead to overcoming such deficits, this overestimated "bend" of the distribution curve still seems to the authors much more difficult to correct than the problems generated by the simplest of possible constructions In certain cases, as will be shown below, the results of transformations by means of this construction are distributed quite symmetrically.However, in most cases, without correcting for null correlation and taking into account the value of degrees of freedom, it yields significant deviations.Equation ( 6) eliminates these disadvantages.
In addition to the corrected value of the null correlation, two other similarly organized corrections are added here.However, this construction does not complete the transformation.Here, it would stop at a level of degrees of freedom values unfamiliar to the average user, giving higher values of the strength of the relationship than they would be at df = n − 2. We therefore add another procedure that lowers the values of the correlation coefficients, making them consistent with the value of df = n − 2.
First, the empirical correlation coefficient transformed by Equation ( 6) is estimated on a scale corresponding to the value of df calculated by (3).Then, through the obtained p-value, the value of the t criterion for df = n − 2 is found, and then, the calculation by (7) completes the conversion procedure: The p-values corresponding to fractional degrees of freedom are determined by the formula p = p int − (d f − int)(p int − p int+1 ), where int is the integer part of the df value.For example, for ρ d f = 0.5491 with df = 3.7890, we obtain p = p 3 − (3.7890 − 3)(p 3 − p 4 ) = 0.3378 − 0.7890(0.3378− 0.2592) = 0.2758.As can be seen, this is a crude linear approximation.However, it has not yet been observed to give appreciable deviations even at small n.
By means of a number of examples given in Table 2, we will try to demonstrate what exactly forces the use of the transformations characterized above.The values of correlation coefficients obtained from simulations are transformed and directly compared to the values expected for the corresponding sample size.For the sake of brevity, here and below, in cases where no differences significant for the purposes of demonstration are observed, we summarize and average all values.Averaging is performed across series, each consisting of 11,110 correlation coefficients.Table 2 shows the sizes of the deviations from the expected values at the points corresponding to the empirical p-values shown at the top of the table.They are given in rounding, as they are actually fractions 28/5555 = 0.00504, 139/5555 = 0.02502, etc.The zero point (p = 1) is the average of the values in positions 5555 and 5556.
The magnitudes of deviations could be determined by subtracting the absolute expected values from the absolute empirical values.However, in such a case, we would obtain a distorted view of the direction and magnitude of deviations in the vicinity of zero.Therefore, in the negative tail, the transformed empirical value is subtracted from the expected value, and starting from the point p = 1, vice versa.This keeps the correct sign of the deviations and the correct distance between the empirical and expected values throughout the series.HG+ in this table and beyond means that a hybrid model is used, where HG is the underlying hypergeometric distribution, and the plus sign is followed by the designation of the "quantum" of the distribution nested within it instead of the unity.In all subsequent tables, unless there are special notes, by default, the values averaged over 21 series from the matrix at k = 7 are given.
The first line of Paragraph 1 shows the deviations for the values obtained at df 0 = 3.96, and the second-for the values reduced using Equation ( 7) and corresponding to df = n − 2. Only at large losses of degrees of freedom, as in this case, they reveal differences capable of manifesting themselves in thousandths.In the following, in all tables, when Equation ( 7) is implied, the deviations at df = n − 2 are shown, and the indicated value of df 0 simply represents one of the initial parameters involved in the calculation.
As can be seen from the deviations at the median point reflected in the p = 1 column, the values obtained from Equation ( 5) deviate from the median as n and df decrease, which is further markedly exacerbated by the large value of r 0 .This is remedied to a good extent if we replace r 0 by µ 1 + 1 − r 2 /d f .However, if we evaluate the correlations on a standard scale with df = n − 2, this will give only some semblance of symmetry but in the cases of appreciable losses of df will by no means get rid of the large deviations, which will disappear only in the nearest neighborhood of zero.If, as in the cases given in Paragraph 1, the values corresponding not to 8 and 16 but to 3.96 and 4.81 degrees of freedom are used in the whole series, this will lead to a different kind of distortion: the deviations from the expected values will gradually begin to grow not in plus but in minus.This trend is clearly visible when the loss of df is very large, as in the present cases.Since nothing of the sort is observed in cases where the normalizing denominator is a constant and, hence, df = n − 2, we can conclude that the loss of df gradually decreases as the value of the correlation coefficient increases.For this reason, we had to introduce the df change function reflected in Equation (3), without which there would be no relatively correct definition of the p-value.
From the data in Paragraphs 8-11, we can conclude that when sample sizes are not very small and when there is no or a not very large loss of degrees of freedom, Equation (5) gives quite satisfactory estimates.The third line of Paragraph 7 shows why we have refused to use the z transform.
In general, the errors of the method, manifested mainly in the estimation of the df losses, can be estimated from the data in Table 3. Everywhere the data obtained from the basic hypergeometric model (distributed object has value 1) are presented.
On average, more accurate estimates are obtained for growth forms close to distribution curves according to a power law.In such forms, there are usually no high df losses.At uniform growth, of course, the value of the first member of the sequence matters.In the given example, it is low, due to which deviations with a negative sign are observed everywhere.In general, we can certainly speak about noticeable errors of the method in estimating the distances between the values of the summing variable, especially in the initial part of its ranked row.One of the characteristic manifestations of these errors can be seen by comparing the left tail in Row 10 with any other examples.This is not a statistical error but a regular skew for these growth forms.The cases in which this disadvantage is most pronounced are listed separately at the bottom of the table.A clear overestimation of losses, leading to the underestimation of correlation coefficients, occurs when the summing variable is headed by a value (or a very small number of them relative to the total length of the series) that is much inferior to the subsequent values.In the example given in Row 12, the summarizing variable starts with values 20 and 35, followed by a series of 26 values derived from the formula v j = 60 + 5(j − 1).In the example given in Row 13, the summarizing variable starts with the value 10, followed by a series starting with the value 32, with relatively small distances between the values.In Row 14, the values of 10 and 30 precede the 60 values equal to 60.
Thus, the method's error when estimating the df losses confidently fits within the range of ±0.015 but only with a fundamental caveat: there should be no outliers at the head of the summing variable.In this case, we should expect an underestimation of correlations by more than −0.015.As for outliers at the end of the summing variable, no matter how large they are, they will not reduce more than their own unit of freedom.

Neutral Vectors in Compositional Data
In [23], concepts of independence for proportions were developed, and in [24], it was shown that the Dirichlet distribution possesses the property of complete neutrality.This gives reason to expect that a similar property will also be found among the proportion vectors generated by other distributions from the exponential family.In data simulation, this expectation is confirmed but with a very serious caveat: if as an indicator of independence we take the Pearson correlation coefficient, the distribution of which in the case of independence by definition should be zero and symmetric and obey the Student distribution, then many distributions, including the Dirichlet distribution, show independence of proportions only after passing a certain threshold in parameter values.
The full independence test requires testing all relations between the proportions subtracted and the proportions transformed due to the subtraction, regardless of their numbering order.For the purposes of this paper, the full test will be clearly redundant.Let us demonstrate this test in its simplest variant, when the initial data are set in such a way that all variables are independent and distributed identically, i.e., according to one law and with the same mathematical expectations and variances.This allows us to include distributions that do not belong to the exponential family in the analysis.Let us consider the case of sequential subtraction of vector elements when comparing them with the elements of the remaining subcomposition.That is, we first consider the correlation of y 1 with y 1 /(1-y 1 ), . .., y 7 /(1-y 1 ), then y 2 /(1-y 1 ) with y 3 /(1-y 1 -y 2 ), . .., y 7 /(1-y 1 -y 2 ), and at the end of the sequence, we arrive at the degenerate case with the impossibility of calculating the correlation between y 6 /(1-y 1 -y 2 -y 3 -y 4 -y 5 ) and y 7 /(1-y 1 -y 2 -y 3 -y 4 -y 5 -y 6 ), since y 7 /(1-y 1 -y 2 -y 3 -y 4 -y 5 -y 6 ) gives a value of 1 in the whole sample size.
The penultimate step in this sequence generates two fully dependent relationships, where always equal in strength and always different in sign coefficients give a total of zero, so we also do not take them into account.As a result, in Table 4, the lines highlighted in color show the average deviations from the expected values according to Student, calculated not by 21 but by 18 coefficients.Below them are the deviations obtained directly from the normalized compositional data, but not just from the Pearson coefficients but from their transformations by means of Equation (7), where, of course, the averaging is performed on the 21 transformed coefficients.A quick look at the data reflected in Table 4 is enough to realize that the neutrality of the proportion vectors occurs not at any parameters.Small parameters stably show equally oriented asymmetry, where the negative tail always has on average higher deviations in minus and the positive tail in plus.This is essentially the same as what is observed in the usual use of the Pearson correlation coefficient when outliers are present.
The values of the independence test coefficients always remain shifted to plus with respect to the values of the transformed coefficients, including when reaching zero median and symmetry of tails.Formally, this could be evaluated as an additional decrease in the number of degrees of freedom to what is already indicated in the table.Extraction of proportions, as shown in [25], is equivalent to the partial correlation procedure, which could in part explain this additional loss if it becomes clear how it is shifted from the vector y i to the sample size vector.In addition, this procedure is performed four times in our case.A distance of 4 degrees of freedom could be seen only in the case reflected in Paragraph 1 and only in the negative tail.
In general, however, the distances observed are quite different, and the reason for this seems to be other.If one pays attention to the binomial distribution, one will immediately notice the difference between the deviations at the probability of success of 0.5, which gives a symmetric distribution of the initial data, and the deviations at probabilities 1/7 and 0.05, at which the data are distributed asymmetrically.For the same mathematical expectations, the probability of 0.5 gives much higher plus deviations.The results of the independence tests are shown in averaged form, but in cases where there is no common symmetry, it should be noted that with each step of truncation of the original composition, the symmetry of coefficient distributions slightly increases.Why this is associated with an increase in the variance of the distribution is a question that requires separate consideration.For now, it is enough for us to state that as the parameter of the number of trials increases, all deviations come to nothing at any probability of success.
It looks a bit strange, but the Dirichlet distribution, continuous in the raw data, instead of the expected advantage over the discrete binomial and hypergeometric distributions, clearly loses to them in terms of the rate of reaching the null distribution as the value of mathematical expectation increases.In Paragraphs 7 and 11, the expectations are 4.29 and 5, while Dirichlet, even with an expectation of 20, does not achieve the symmetry that is reflected in those paragraphs.In this comparison, we are not at all trying to set any threshold values.The statistical error may well hide the real skewness if it is low.In the general case, however, as experiments show (see Section 3.5), a value of 5 for the lowest mathematical expectation is not quite sufficient but perhaps acceptable if one considers it admissible to add to the method errors an additional skewness approximately equal to 0.002 with a minus in the negative tail and a plus in the positive tail.
As can be seen in Paragraphs 16-18, the continuous uniform distribution produces relatively small deviations and quickly negates them when summed.In Paragraphs 19-21, reflecting the discrete uniform distribution, it is shown that even without summation, the deviations can be reduced simply by increasing the range of variation.Obviously, variables of different sizes in such distributions can only be created by summation, otherwise they will not pass the independence test.The same is true for the normal distribution, an example of which is given in Paragraph 23.If it is to be used in simulation, then, of course, it is necessary to choose such parameters at which the probability of negative values is negligible.In other respects, the change in parameters does not play any role here: the distributions are always symmetric, unless the usual statistical errors are taken into account.Therefore, it can be argued that at summation any types of distributions satisfying the conditions of the central limit theorem will tend to neutrality, but the rates, of course, will be different.
In Table 4, in order to simplify the testing procedure, the relationships between equal variables have been presented.The fact that this test yields similar results for very different ratios between the sizes of variables, as well as in hybrid models and for any losses of df, can be seen from the data in Table 5.As in Table 4, here the results of the independence test are highlighted in color.As can be seen from the data in Paragraph 1, the coefficients obtained in this test at a large loss of df correspond not to the value of df = n−2, where they together with Equation ( 5) show approximately equally anomalous deviations, but to the value of df, which is calculated through Equation ( 3) and accounted for by Equation (7).Here, the tests are performed to the same truncated extent to which they are presented in Table 4.We have failed to see any dependence on the size of the variables in these cases.
Thus, according to the results of the independence test, it can be confidently concluded that everything that becomes symmetric and zero at the median in this test also becomes symmetric and zero and has relatively small deviations from the expected values when using our method.In short, if the working hypothesis assumes that the data are distributed according to the normal, binomial, negative binomial, and other laws from the exponential family, then there is every reason to use the method proposed here.If the working hypothesis also assumes a hybrid distribution, the range of applicability of the method can be somewhat extended.

Hybrid Models
In practice, if it is not a simulation, the analyzed systems are already formed systems that have passed a certain historical or evolutionary path.If equal potential of possibilities is attributed to all units under study, which is a natural condition of the null hypothesis, then the total space of the analyzed system of interactions at the first stage of its formation will be filled according to the discrete uniform distribution.Partitioning the total array of units into separate variables and dividing the total space into separate areas will lead to variation in accordance with the hypergeometric distribution.If the equality of possibilities is preserved, which, again, should be considered a natural condition of the null hypothesis, then the evolution of the system should proceed with equal parameters of reproduction and multiplication of initial units for all variables.Thus, when modeling the evolution stage, the initial unit can be directly specified as a reproducing and multiplying unit.For this purpose, instead of the value 1, the parameters of an additional distribution are inserted into the hypergeometric distribution.
In Table 6, the designation of the hypergeometric distribution is released because it is present everywhere.After the insertion parameters, the values of the lowest mathematical expectation calculated without taking into account the insertion values, i.e., solely by the underlying hypergeometric distribution, are given here.If all samples are equal, it coincides with the expectation of the entire smallest variable.If not, it is taken from the smallest sample with a plus sign added.If these values are high enough, it means that even the smallest variable will have a relatively high level of approximation of the data to the normal distribution.
The laws of the exponential family reflected in the first subsection of the table do not require particularly high values of the lowest expectation to maintain the symmetry of the distribution.In the other cases considered here, the situation is somewhat more complicated.Exponential growth, whose various forms are presented in Section 2, is well known in biological systems.Symmetry and weak deviations here are observed only at small ranges of argument variation.In the examples with a relatively high lowest expectation of 25, when the argument is limited to 10, shifts of all sizes are distributed without much difference, but when the argument is limited to 20, there is already a noticeable difference, as reflected in Paragraphs 11 and 13.Naturally, when the ranges of argument variation are extended, symmetry will also be observed but only with a corresponding increase in the amount of data.Expectedly, the cases with the argument resulting from summation are more symmetrically distributed, although this is not very pronounced here.Besides exponential growth forms in the development of microbial communities, forms described by means of power functions are also revealed [26].The range of argument variation, in contrast to exponential functions, does not seem to need restrictions here.Also, apparently, here it is theoretically possible to have symmetric distributions of correlation coefficients at any power values but, of course, under the condition of appropriate compensation by the growth of the number of units in the hypergeometric base.
It is unlikely that the features of distribution peculiar to the exponential and power functions presented here will be encountered in the analysis of established systems, but it is quite realistic to expect them in the analysis of time series.The same, apparently, can be said about probabilistic power laws.In accordance with them, individuals of different species in microbial communities are usually distributed, i.e., the values of vector x i [27].However, at a certain character of samples (analysis by cities, regions, etc.), the values of the summing variable also agree with them, which, of course, is also reflected in the variances of its components, but this is usually not related to the nature of interactions between the components.
As can be seen from the data of Section 4, at the non-truncated forms of the power law (values 2 and 100 in Rows 29 and 30 indicate the lower and upper thresholds, in the other rows there is no truncation), very large numbers of units in the hypergeometric base are required to achieve relative symmetry in the distribution of correlation coefficients.In the picture presented here, symmetry appears so far only at α = 3. Non-truncated distributions at α ≤ 1, which do not have a finite expectation and variance, do not fulfill the conditions of the central limit theorem, and they do not tend but, on the contrary, move away from the normal distribution as the amount of data increases.
As can be seen, most of the distributions reflected in Table 5 are represented as having no loss of degrees of freedom.When the losses of df are fixed in the hypergeometric base, their arithmetic averages always clearly correspond to the given value when varied.When samples are of equal number in the hypergeometric base, the situation is different.It is clear that values varying in a wide range with the same mathematical expectations can give the same values for all sums only with negligible probability.Losses occur all the time, and their magnitude depends directly on the variance of the material being distributed.As can be seen, the data shown in Table 4 also have the same mathematical expectations but are given with an estimate of the loss of df, without which in many cases there would be noticeable deviations.Here, in all cases, the losses are negligible, so that we neglected the procedure of their estimation, which, unlike the cases with differences in the size of individual samples predetermined in the hypergeometric base, has to be performed separately.According to preliminary estimates, in a number of cases, when taking into account the df losses, the changes would not be reflected even in thousandths, but on average a change of −0.001 at significant levels of p would still occur.

Excess of Zeros
At the values of lowest mathematical expectation considered in the previous subsection, zeros occur infrequently or not at all.However, there are represented as asymmetric in nature distributions there, which, not being sufficiently approximated to the normal distribution by summation, begin to generate a characteristic skew from minus to plus in the distribution of correlation coefficients.A similar phenomenon is observed at the so-called sparsity [28], when an excess of zeros, regardless of which distribution law generates it, itself introduces asymmetry into the data distribution.
This problem is most often encountered by microbiologists, as large numbers of zero values constantly accompany studies of microbial communities.Using the data in Table 7, we will try to show how variables with a large number of zeros correlate with each other and with other variables.
All series considered in Table 6 are obtained at n = 20, df = 18.In each of the subsections, the variables have the same proportions; only the number of units falling into each sample changes.In Sections 2 and 3, color is used to highlight the deviations observed when five small variables interact with each other, while no color shows the deviations when two large variables interact with each other and separately with small variables.In Section 2, deviations in interactions of two small ones with all others are highlighted in color; the deviations in interactions between the large ones are given without color.The lowest expectation in Section 2 is given for the smallest variables among the large ones.
As can be seen from the first subsection of the table, the deviations in the relationships between small variables naturally decrease as we get rid of zeros, but they are still noticeable even when the lowest expectation is equal to five.At the same time, their correlations with large variables everywhere give low deviations without pronounced skews.It can be said that a large proportion size close to 0.5 successfully handles the asymmetry of distribution in small variables.However, if the same quantities are redistributed so that only two small variables remain, as reflected in Section 2, the reduced large ones do not immediately acquire the ability to suppress the asymmetry of the small ones.The large shift at the most anomalous number of zeros (top line in Paragraph 1) shows a significant minus in the center and noticeable plus deviations in both tails.In a single series, this could be the result of statistical error.However, with strongly pronounced asymmetry, this phenomenon has a regular character.It can also be seen in Section 3 in the top lines of Paragraphs 11 and 13.In this subsection, other types of distributions are given for comparison.The ordinary hypergeometric shows no noticeable difference from the hybrid with the binomial, while in the hybrids with the power and exponential functions, the small variables expectedly show a stronger skewness, which is affected by the cumulative effect of two asymmetries, one caused by an excess of zeros and the other by an insufficient number of summations of inserts with distributions that are asymmetric in nature.

Rank Correlation Coefficients
The normalization of compositional data itself reduces the impact of outliers to a considerable extent.Nevertheless, from a theoretical point of view, it is useful to compare

Discussion
The transforming of correlations between proportions method proposed in this article yields results that occur with approximately the same probability as their corresponding Pearson correlation coefficients computed from compositionally unrelated normally distributed variables.This can be confidently asserted not only when the experimental units in the hypergeometric model are represented namely by the value 1.This is also true for a certain range of hybrid models in which the researcher-selected law of reproduction of the original unit is built into the hypergeometric distribution.
Certainly, the question of the limits of applicability of our method requires careful further investigation.First of all, its application should be based on the estimation of raw data, which should show that all variables, regardless of their sizes, have sufficiently similar ratios between the variance and mathematical expectation values, where some constant changes only due to differences in the values of E(y i )(1 − E(y i )).However, experiments  show that even when the variance in hybrid modeling is not very high, replication often produces very different "individual portraits" of the composition.These ratios can freely vary by a factor of several, especially when sample sizes are small, so estimating the applicability of the method by this criterion can only be productive when the number of real experiments is sufficiently large.
Of course, there are cases in which the above ratio is violated without much doubt.Let us model a situation of such violation with three discrete uniform distributions (0, 2000), (10,40), and (10, 40) at n = 16.This combination of course looks not quite realistic, but let that be a sacrifice in favor of expressiveness.In the center of the distribution of the obtained 11110 correlation coefficients, instead of zero, we find values of −0.971 for two relationships of the large variable with small variables and 0.875 for the relationship between small variables.With such an anomalous ratio of variances, the values of the summing variable vary within a very large range, so the average value of the number of degrees of freedom here is not 14 but 7.14, for which Equation (6) gives values of −0.838 and 0.880 with p-values of 0.0016 and 0.0004, and Equation (7), returning the usual value of df = 14, reduces the coefficients to −0.721 and 0.778.
Let us assume that we would obtain a similar picture when studying the territorial distribution of three closely related species exploiting the same resources.We would explain the high variance in the dispersal of individuals of the evolutionarily more successful species by the fact that they tend to go to the most resource-rich places, leaving the "outcasts", who have lost in intraspecific competition, in scarce territories.In the positive relationship between two small species, we would see the manifestation of the usual consequence of natural selection, which ensures better preservation of those species (formerly subspecies that evolved from "outcasts") that have learned to cooperate or at least peacefully resolve conflicts.
Such an interpretation could be quite possibly given by some ethologist.In this case, his null hypothesis by default would have to assume that individuals of all three species, due to the fact that they do not influence each other, are potentially equal in their ability to reproduce and in their choice of habitat.Consequently, the overall spatial distribution of individuals would be represented in the null hypothesis as uniform, and partitioning the territory into separate areas would inevitably lead to the hypergeometric distribution.Knowing that, on average, individuals of all three species produce, for example, five offspring each, the ethologist could replace the value 1 in the hypergeometric distribution with the varying result of binomial distribution with parameters 10 and 0.5.This could well correspond to the situation in the early stages of colonizing the territory.
When modeling with the parameters of such a null hypothesis, we naturally obtain zero correlation and the symmetric distribution of coefficients according to Student's tcriterion distribution (see Paragraph 5 of Table 2 for one of the distributions with such parameters).As a result, the ethologist would have judged the results to be significant at a high level.At the same time, he would naturally consider that somewhere in the early stages of formation of this system of interactions, all individuals were indeed potentially ∑ k i=1 y i = 1, then var ∑ k i=1 y i = ∑ k i=1 var(y i ) + 2∑ k(k−1)/2 i=1∑ k j>i cov y i , y j = 0.It follows from this equation that not only ∑ k i=1 var(y i ) = −2∑ k(k−1)/2 i=1

Table 1 .
Comparison of results obtained by using Equations (

Table 2 .
Dependence of the pattern of distribution on the shift magnitude and degrees of freedom.

Table 4 .
Independence test for identically distributed variables.

Table 5 .
Independence test for hybrid distributions, large losses of df, and shifts with different sizes.

Table 6 .
Deviations from expected values in hybrid models. n;

Table 7 .
Distribution asymmetry at excess of zeros. v