Compositional Closure—Its Origin Lies Not in Mathematics but Rather in Nature Itself
Round 1
Reviewer 1 Report
This is a very insightful piece of work on the reasons and implications of closure in compositional data analysis (CoDa), which settles the minds of those who have been working with the CoDa log-ratio methodology for a while. Although some of the ideas are not entirely new, the presentation and combination of them is and deserves publication. The article is extremely well written, which adds to its intrinsic value.
My comments regard mostly the presentation of the materials, and not the substance, with which I agree, and should be taken only as suggestions, rather than required changes.
1) Section 2. Is this focus on sampling really needed? Most of the conclusions rely on sections 3 and 4, which focus on the natural system typology. References to sampling are few and far between (line 249 and some other I may have missed). Conversely, in the discussion a very important topic emerges after line 250, which impacts all what follows in the manuscript.
Measurement instrument, which may impose compositional closure "Nonetheless, the geochemical process under study becomes displacive when and because the variables are presented relative to a fixed mass. This is not, of course, mathematics altering nature", possibly related to a different research objective "but rather the researcher studying a different system" (maybe the researcher chooses closure after all because of a focus on relative rather than absolute information). Line 272 puts it very nicely "our choice of measurement alters the statistical result!"
My suggestion would be to move these measurement and research objective issues further up and expand on them, and possibly replace or reduce section 2.
My suggestion would also to distinguish reality and measurement, which are considered one and the same in line 279 "Such closure is not simply an artifact of the mathematics of ratios, but actually reflects a physical reality" and line 326 "compositional closure, when it does occur, is a consequence of, and inherent in, the natural system under investigation".
2) Indeed, when data are not compositionally closed, the analysis can be made with raw parts, commonly after a log-transform, say log(X). When data are trace or otherwise small amounts of elements ratioed on a very large reference part, say, water, it follows that a log-ratio:
log(X/Water)=log(X)-log(Water) is nearly linearly related to log(X), because the variance of log(Water) is close to zero, in such a way that log-ratio analysis and the analysis of logarithms of parts becomes equivalent. This may be a further argument for treating them as not compositionally closed.
So, trace element concentrations are indeed not compositionally closed, except if the measurement instrument or the research objective dictates so. When water gets (or is purposely gotten) out of the equation there is no way to avoid closure (this would be subcomposition-then-reclose practice in CoDa parlance, although the authors have avoided these terms and will probably want to continue doing so). This is another way of regarding paradox 6.5, not necessarily better than the authors' but possibly complementary with theirs.
3) As regards the conclusion, CoDa log-ratios indeed imply an information loss with data that are not compositionally closed (the researcher may not care if he or she is interested only in relative abundance). However, there is a way out which is to include log-ratios together with the total in the analysis. More precisely, taking the total as a weighted geometric mean of the D parts in a composition and a set of D-1 isometric log-ratios produces the same results as the D logarithms of the raw parts (Pawlowsky-Glahn, Egozcue, & Lovell, 2015). The advantage is that the hypotheses about size are contained in the model parameters related to the total, and the hypotheses about relative abundances are contained in the parameters related to the log-ratios. Conversely, the parameters related to the D log parts tend to confound size and relative importance. Since the authors prove they like to cite classical works, they may find Lewi (1976) to their taste, on the same topic.
Lewi, P. J. (1976). Spectral mapping, a technique for classifying biological activity profiles of chemical compounds. Arzneimittel-Forschung, 26(7), 1295–1300.
Pawlowsky-Glahn, V., Egozcue, J.J., Lovell, D. (2015). Tools for compositional data with a total. Statistical Modelling, 15(2):175-90.
The authors may want to write a couple of sentences on the possibility of combining CoDa with the total when data are not compositionally closed.
Some minor issues:
A) Line 44 "Unfortunately, the log-ratio transformations render results into a form that can be difficult to interpret." Log-ratio interpretation is fine for relative abundance. I would change it into something like "Unfortunately, the log-ratio transformations render results into a form that cannot be interpreted in terms of original abundance".
B) Line 163 "(large component as denominator, unrelated to numerator components)" should "or else changing volume" be added?
C) Reference 1. Should the journal name be in italics?
I enjoyed very much reading the paper and I wish the authors the best of lucks with its publication.
Author Response
Reviewer 1 – Many thanks for the obvious time and effort that went into this thoughtful – and thought provoking -review.
1) Section 2 focus on sampling.
Response: Section deleted. I had considered not including this material – it was a left-over from the thought process leading up to the main thrust of our work. And it does lead the reader astray from the main points.
Reality and measurement
Response: Wording changed in both places to emphasize this distinction.
2) Indeed, when data are not compositionally closed, the analysis can be made with raw parts, commonly after a log-transform, say log(X). When data are trace or otherwise small amounts of elements ratioed on a very large reference part, say, water, it follows that a log-ratio:
log(X/Water)=log(X)-log(Water) is nearly linearly related to log(X), because the variance of log(Water) is close to zero, in such a way that log-ratio analysis and the analysis of logarithms of parts becomes equivalent. This may be a further argument for treating them as not compositionally closed.
So, trace element concentrations are indeed not compositionally closed, except if the measurement instrument or the research objective dictates so. When water gets (or is purposely gotten) out of the equation there is no way to avoid closure (this would be subcomposition-then-reclose practice in CoDa parlance, although the authors have avoided these terms and will probably want to continue doing so). This is another way of regarding paradox 6.5, not necessarily better than the authors' but possibly complementary with theirs.
Response: Great feedback and discussion here. I (ME) have actually spent quite a bit of time working through whether or not water chemistry data and trace element data are compositional data or not (Engle and Rowan, 2013; Blondes et al., 2016). They are compositional, but as noted by the reviewer, it may not have a significant effect if they are in an accommodative system.
To address the reviewer comments and emphasize their valuable points, in Section 6.5, we note that using CoDA on trace elements in seawater, as an example, produces little effect compared to conventional analysis using the raw concentration data and that the difference in primarily in the log scaling (and cite a paper by Otero et al., which examine the phenomenon in more detail).
3) As regards the conclusion, CoDa log-ratios indeed imply an information loss with data that are not compositionally closed (the researcher may not care if he or she is interested only in relative abundance). However, there is a way out which is to include log-ratios together with the total in the analysis. More precisely, taking the total as a weighted geometric mean of the D parts in a composition and a set of D-1 isometric log-ratios produces the same results as the D logarithms of the raw parts (Pawlowsky-Glahn, Egozcue, & Lovell, 2015). The advantage is that the hypotheses about size are contained in the model parameters related to the total, and the hypotheses about relative abundances are contained in the parameters related to the log-ratios. Conversely, the parameters related to the D log parts tend to confound size and relative importance. Since the authors prove they like to cite classical works, they may find Lewi (1976) to their taste, on the same topic.
Lewi, P. J. (1976). Spectral mapping, a technique for classifying biological activity profiles of chemical compounds. Arzneimittel-Forschung, 26(7), 1295–1300.
Pawlowsky-Glahn, V., Egozcue, J.J., Lovell, D. (2015). Tools for compositional data with a total. Statistical Modelling, 15(2):175-90.
The authors may want to write a couple of sentences on the possibility of combining CoDa with the total when data are not compositionally closed.
Response: This is a fantastic point that is really helpful. One of the examples that always bothered me is whether or not point count data (or election totals) are compositional data or not. As noted here you can examine these types of data either way, and we have added this point to the manuscript as noted. Unfortunately one of the key questions of the work by Pawlowsky-Glahn et al., (2015) is whether including a measure of the sum (either the geometric mean or the total) causes issues with subcompositional incoherence. Regardless we discuss this paradox and this potential approach in more detail in section 6.4.
Some minor issues:
- Line 44 "Unfortunately, the log-ratio transformations render results into a form that can be difficult to interpret." Log-ratio interpretation is fine for relative abundance. I would change it into something like "Unfortunately, the log-ratio transformations render results into a form that cannot be interpreted in terms of original abundance”.
Response: Excellent point. Modified as suggested.
- B) Line 163 "(large component as denominator, unrelated to numerator components)" should "or else changing volume" be added?
Response: Good point. Modified as suggested.
- C) Reference 1. Should the journal name be in italics?
Response: Yes. Done.
Reviewer 2 Report
The manuscript “Compositional Closure – Its Origin Lies not in Mathematics but Rather in Nature Itself” presents a conceptual study where the authors examined different geologic systems and samples in order to determine in what situations compositional closure occurs physically and mathematically. They concluded that the displacive systems tend to exhibit a compositional closure, whereas the accommodative systems do not have a compositional closure.
The Review of the Manuscript:
- The authors clearly presented the criteria (types of components under study, whether the system is open or closed, etc) in order to decide whether the processes, events and likely behavior that lead the sample data tend to exhibit compositional closure. However, what is not clear is how the authors construct the link between the statistical analysis of the compositional data (and the assumptions) and the underlying processes that yield the data. It seems that from a statistical perspective, the researcher may find herself or himself in a dead end when they consider both physically and mathematically. I think that the authors may consider clarifying this further.
- It seems that the authors only consider PCA as a factorization technique. However, there are other techniques including minimum/ maximum autocorrelation (MAF) and projection pursuit multivariate transform (PPMT). The latter for example accounts for several complexities including nonlinearity, heteroscedasticity and constraints.
- I wonder if the types of systems and components will always be adequate to justify the decision of closure. A clear statement in relation to the comment could be added to the manuscript.
The manuscript can be accepted for publication in Minerals after the authors have responded the comments given above.
Comments for author File: Comments.pdf
Author Response
Reviewer 2 – We appreciate your review and especially your concerns about the link between the physical and the mathematical domains and have attempted to clarify that relationship
The authors clearly presented the criteria (types of components under study, whether the system is open or closed, etc) in order to decide whether the processes, events and likely behavior that lead the sample data tend to exhibit compositional closure. However, what is not clear is how the authors construct the link between the statistical analysis of the compositional data (and the assumptions) and the underlying processes that yield the data. It seems that from a statistical perspective, the researcher may find herself or himself in a dead end when they consider both physically and mathematically. I think that the authors may consider clarifying this further.
It seems that the authors only consider PCA as a factorization technique. However, there are other techniques including minimum/ maximum autocorrelation (MAF) and projection pursuit multivariate transform (PPMT). The latter for example accounts for several complexities including nonlinearity, heteroscedasticity and constraints.
I wonder if the types of systems and components will always be adequate to justify the decision of closure. A clear statement in relation to the comment could be added to the manuscript.
The manuscript can be accepted for publication in Minerals after the authors have responded the comments given above.
Response:
Compositional data, closure and related topics affect all data analysis approaches, not just PCA./factorization. Issues related to compositional data and closure have been around for more than 100 years and no single technique can overcome the problems as the issues are more related to the nature of the systems and the algebraic-geometric structure of compositional data. This is largely why approaches to “correct” related problems hinge almost entirely around the developing mathematical approaches to address the fact that compositional data do not follow conventional rules and measurements of standard Euclidean geometry in real space. Thus utilization of techniques that are able to handle common data issues, such a nonlinearity and heteroscedasticity, but still fully rely on a traditional algebraic-geometric framework are not part of the discussion. In an attempt to address this reviewer’s comment we have updated the Introduction section to provide a bit of context about why robust or other analytical techniques aren’t considered a viable alternative to address the issue.