The Classical Model of Type-Token Systems Compared with Items from the Standardized Project Gutenberg Corpus
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper is overall rather interesting and contributes to the field of language statistics with relevant empirical results that successfully contribute to a better understanding of Zipf's and Herdan's laws and of the statistical structure of language in general. The paper also makes a few innovative theoretical insights, especially by highlighting the log-log convexity in a clear way and by clearly identifying the regimes in which the different laws are valid. I recommend publication after a few minor points have received further elucidation. Below is the list of remarks I would like to see the authors address:
- 41: From the introduction, it remained entirely unclear to me what the set goal of the paper was.
- 80-85: I am not sure I understand the reasoning going on here. Why would the law apply best in the noisy discretized part – while in the figure it seems to apply best for low-frequency values? A slight elaboration
- 96: I have never encountered this use of decade before and found it confusing. Could the simpler “order of magnitude” be used instead?
- 108-110: I did not understand this sentence nor how it relates to the previous point. A clarification would be welcome.
- 126: I believe the following reference could be added as well: Evert, S. (2004). A simple LNRE model for random character sequences. In Proceedings of JADT (Vol. 2004).
- 166-167: this precision is welcome, but at this point I had still no idea what the paper intended contribution was. This should be made much clearer earlier.
- 169: are not truly constant (deletion of the a)?
- 185: the SPGC acronym is defined in the abstract but not in the body of the paper and I believe it should be.
- 257: that that?
- 269: the authors should also report the title of the books, not only the ID: it is quite illuminating that the items that deviate the most from a power law are botanical treatises, where we expect a lot of redundant content when it comes to pure linguistic expression, while the top two books feature on Figure 9 are, respectively, a literature essay and a novel. This is certainly meaningful and should be highlighted, if not commented upon.
- 278: I am skeptical about the relevance of Figure 10; as pointed out by the authors, the log-log fit in the low significance range does not yield a truly meaningful value.
- 356: I wonder whether this cluster of figures could not be condensed in some way, by discarding the cumulative distribution curves for the beta values (again I am not certain of their relevance), and having a single figure with multiple subplots showing figure (a) for all 7 ranges and an eighth subplot showing the correlation coefficient.
- 382: Which are the texts found in that filament?
- 406: this paragraph seems to me quite muddy and confusing. I think the authors should stick to the 7 ranges of frequency for the fit of law (2) and to the overall methodology, especially since these new analyses are hard to apprehend. In the end, it seems to me that this ‘filament’ is mostly due to texts of a very specific nature (the CIA factbooks, which do not like ordinary language outputs at all), and unless we can explain why the peculiarities of these documents produce the observed behavior (which I do not think the authors do), I do not believe it is worthy of further commentary or analysis. I would drop this and Figure 20 altogether and only mention above (l. 382) that the filament is due to the CIA factbooks, which cannot properly be considered as linguistic ‘texts’ anyway.
- 509-510: This part sets misleading expectations since it suggests that beta at high frequencies would play a role in regulating the vocabulary size, while it appears from the simulation that correlation is rather marginal (R^2 = 0.0449). If this very weak correlation does correspond indeed to what the authors mean by saying we should expect “some correlation”, then it should be explicitly stated to show that their model confirms their hypothesis. One or two sentences to summarize the findings and to outline how they highlight/inform the reasoning on how the beta power law drives vocabulary growth are needed here.
- 521: a verb is likely missing
- 517-528: it was rather unclear to me which simulations have been performed exactly. The explication also seems to be a bit disordered. In particular, a first mention is made of randomly drawn beta values on line 518 and then again on line 523; equation (11) is introduced in between but holds ‘as is’ without any reference to simulated data. I would delete the first sentence and add the reference to Table 2 in the sentence of line 523. Then how do you simulate data based on the two randomly drawn beta parameters? Do you sample a distribution based on (10)? It seems to me that this part requires additional detail to be fully understandable and reproducible.
- 547: “seven logarithmatically frequency ranges”: there seems to be a missing word here.
- 574: “The for” > something should be added or removed here.
- 583: The “Finish cluster” is certainly intriguing, but this is not the main teaching I would retain from this paper. What strikes me is that the deviations to the classical picture, and especially the low KS points, are mostly due to texts falling within very specific genres and rather estranged from natural language in the way they unfold. This outlines that producing words is not enough to guarantee a well-established Zipf’s law, and there is something specific to natural language that seems to be needed to get a well-behaved distribution. And this is certainly intriguing and points out the processes that underlie the production of specialized texts vs. texts comprised of a more colloquial and standard language. (This is only a comment; no action is expected from the authors based on this remark.)
Author Response
Thank you very much for your feedback. I have attached a copy of the paper with alterations made in response to your comments highlighted in yellow.
My responses are as follows:
1. 41: From the introduction, it remained entirely unclear to me what the set goal of the paper was.
I have added a statement at 36 stating “The goal of this paper is to…”
2. 80-85: I am not sure I understand the reasoning going on here. Why would the law apply best in the noisy discretized part – while in the figure it seems to apply best for low-frequency values? A slight elaboration
This statement was based on the “low frequency cut off” identified by other authors, rather than Figure 2. I have rewritten this section hopefully to make this clearer. Although Figure 2 does appear to show low-frequency log-log linearity, Figure 7(b) shows a gradual increase in the KS significance of beta as the frequency range increases. This is consistent with frequency moving increasingly beyond a “low frequency cut-off” for the log-log Zipf law.
3. 96: I have never encountered this use of decade before and found it confusing. Could the simpler “order of magnitude” be used instead?
I have changed “decades” to “orders of magnitude”
4. 108-110: I did not understand this sentence nor how it relates to the previous point. A clarification would be welcome.
The earlier part of this paragraph was about how in English, vocabulary always increases as the document expands, and never saturates. The final sentence was merely a comment that this is not true in ideogrammic languages where a hard upper limit exists. I have changed “despite this” to “interestingly”.
5. 126: I believe the following reference could be added as well: Evert, S. (2004). A simple LNRE model for random character sequences. In Proceedings of JADT (Vol. 2004).
The new reference has been added.
6. 166-167: this precision is welcome, but at this point I had still no idea what the paper intended contribution was. This should be made much clearer earlier.
I have tried to make the goal of the paper clearer in Section 1 (line 36).
7. 169: are not truly constant (deletion of the a)?
“a” removed.
8. 185: the SPGC acronym is defined in the abstract but not in the body of the paper and I believe it should be.
It was defined in the Introduction, but was mistyped SPGC. I have corrected this.
9. 257: that that?
“that” removed
10. 269: the authors should also report the title of the books, not only the ID: it is quite illuminating that the items that deviate the most from a power law are botanical treatises, where we expect a lot of redundant content when it comes to pure linguistic expression, while the top two books feature on Figure 9 are, respectively, a literature essay and a novel. This is certainly meaningful and should be highlighted, if not commented upon.
I have added Table 3, showing the titles of the items highlighted in Figures 8 and 9.
11. 278: I am skeptical about the relevance of Figure 10; as pointed out by the authors, the log-log fit in the low significance range does not yield a truly meaningful value.
I have removed Figure 10.
12. 356: I wonder whether this cluster of figures could not be condensed in some way, by discarding the cumulative distribution curves for the beta values (again I am not certain of their relevance), and having a single figure with multiple subplots showing figure (a) for all 7 ranges and an eighth subplot showing the correlation coefficient.
I have made this change. Since point 3 in section 6 relies on the cumulative distrubutions, I have removed this also.
13. 382: Which are the texts found in that filament?
This is mentioned at 353-5: "It includes many editions of the CIA World Factbook and other works of reference.”
14. 406: this paragraph seems to me quite muddy and confusing. I think the authors should stick to the 7 ranges of frequency for the fit of law (2) and to the overall methodology, especially since these new analyses are hard to apprehend. In the end, it seems to me that this ‘filament’ is mostly due to texts of a very specific nature (the CIA factbooks, which do not like ordinary language outputs at all), and unless we can explain why the peculiarities of these documents produce the observed behavior (which I do not think the authors do), I do not believe it is worthy of further commentary or analysis. I would drop this and Figure 20 altogether and only mention above (l. 382) that the filament is due to the CIA factbooks, which cannot properly be considered as linguistic ‘texts’ anyway.
I agree that the “filament” data are of limited interest, and I have emphasized more clearly that they are identified largely for the purpose of eliminating them. However, I consider Figure 20 to be the gem of the entire paper, since it shows the very close agreement between the high-KS data and the classical equation (4), and am surprised the reviewer thinks it should be dropped. However, parts (a) and (b) are of limited interest and I have removed them.
15. 509-510: This part sets misleading expectations since it suggests that beta at high frequencies would play a role in regulating the vocabulary size, while it appears from the simulation that correlation is rather marginal (R^2 = 0.0449). If this very weak correlation does correspond indeed to what the authors mean by saying we should expect “some correlation”, then it should be explicitly stated to show that their model confirms their hypothesis. One or two sentences to summarize the findings and to outline how they highlight/inform the reasoning on how the beta power law drives vocabulary growth are needed here.
I’m not sure how this is a misleading expectation: a fast-growing vocabulary would naturally correspond to a rapid rate at which new types appear, which must vary inversely with the rate at which existing types reappear. The point I was trying to make here is that we should not expect vocabulary growth to depend wholly on the low-frequency statistics, and this is later borne out by the simulation. For R^2=0.449 R=0.223 and for a sample size N=500 the p-value is around 0.00001. We can safely say that although the correlation is weak, it is far from being only marginally significant.
16. 521: a verb is likely missing
Corrected – there should have been an “are”.
17. 517-528: it was rather unclear to me which simulations have been performed exactly. The explication also seems to be a bit disordered. In particular, a first mention is made of randomly drawn beta values on line 518 and then again on line 523; equation (11) is introduced in between but holds ‘as is’ without any reference to simulated data. I would delete the first sentence and add the reference to Table 2 in the sentence of line 523. Then how do you simulate data based on the two randomly drawn beta parameters? Do you sample a distribution based on (10)? It seems to me that this part requires additional detail to be fully understandable and reproducible.
I think there is some misunderstanding here: (10) is a distribution for rank, whose direct sampling would not yield a relationship between beta and lambda as required. The expression (11) was obtained by substituting (10) into (9) and provides an implicit relationship between t (tokens) and v (types), to which (3) is optimised to obtain an approximate lambda for given values of alpha_lf, alpha_hf and rho. While rho (the transition rank between the two alpha-values) was set constant 1000 (the upper boundary of the “middle range”), alpha_lf and alpha_hf were computed from the corresponding beta_hf and beta_lf using (4) which were in turn generated by independent Gaussian processes (subject to the constraint that beta_fh<beta_lf to enforce log-log convexity. Reference to Table 2 is necessary to justify this independence. This was never intended to be especially precise: its purpose is to show the “reasonableness” of lambda being strongly correlated with low frequency beta, but only weakly correlated with high frequency beta, as was seen in Figure 14. I have rewritten the methodology as a series of numbered steps, together with the following paragraph, which hopefully improves the clarity.
18. 547: “seven logarithmatically frequency ranges”: there seems to be a missing word here.
Changed to “logarithmically spaced”
19. 574: “The for” > something should be added or removed here
“the” removed.
20. 583: The “Finish cluster” is certainly intriguing, but this is not the main teaching I would retain from this paper. What strikes me is that the deviations to the classical picture, and especially the low KS points, are mostly due to texts falling within very specific genres and rather estranged from natural language in the way they unfold. This outlines that producing words is not enough to guarantee a well-established Zipf’s law, and there is something specific to natural language that seems to be needed to get a well-behaved distribution. And this is certainly intriguing and points out the processes that underlie the production of specialized texts vs. texts comprised of a more colloquial and standard language. (This is only a comment; no action is expected from the authors based on this remark.)
Thank you very much for this comment – I will bear this in mind in future work.
Author Response File: Author Response.docx
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper investigates how well classical linguistic laws (primarily Zipf’s laws and Heaps’ law) hold up when applied to a large and standardized text dataset, namely the Standardized Project Gutenberg Corpus (SPGC). The authors show that these linguistic laws often work but have limitations depending on the language, text type, and frequency range being analyzed.
The main strenghts of the paper can be summarized as follows. The authors perform their experiments using a large and stanardized dataset. The numerous figures and charts effectively illustrate key points of the study. The topic of the study is interesting, the paper doesn't just measure indices, it also examines their correlation, distribution, and deviations using various conditions.
Minor comments:
- The introduction section is too brief. Please summarize the main findings of the paper in the introduction.
- Maybe it should be better to organize a Related work section containing the description of the laws used and their applications.
- At times, the paper is equation-heavy, which may overwhelm readers not deeply versed in statistical linguistics. Summarizing sections with intuitive interpretations or mini-conclusions would improve accessibility.
- Although Finnish and English are compared, other languages are underrepresented. Some more comments would be useful.
- The list of references mainly contain old paper. Can you pay more attention to current research?
Author Response
Thank you very much for your comments. I have attached a revised MS. Alterations made in response to your feedback are highlighted in green.
My responses are as follows:
1. The introduction section is too brief. Please summarize the main findings of the paper in the introduction.
I don’t believe that an introduction is supposed to summarize the findings of a paper, but to set out its intentions (though a summary of findings should appear in the abstract). I have nevertheless expanded the introduction slightly to include a brief foreshadowing of the eventual findings.
2. Maybe it should be better to organize a Related work section containing the description of the laws used and their applications.
This would require such extensive reorganization of the paper that could not be done in the time period set by the publishers. However, I have renamed section 2 “Related Work” and divided it into two subsections focusing on the classical model itself, and its problems.
3. At times, the paper is equation-heavy, which may overwhelm readers not deeply versed in statistical linguistics. Summarizing sections with intuitive interpretations or mini-conclusions would improve accessibility.
I am assuming the reviewer is referring to equations (4) to (8), which do appear in rapid succession without much intervening text. I have tried to remedy this with a few more words of explanation between equations.
4. Although Finnish and English are compared, other languages are underrepresented. Some more comments would be useful.
This is true with the current status of the work, although this is progressing. For one thing, the vast bulk of the SPGC is English, and Finnish is highlighted because of its stark contrast with the rest of the dataset (including languages other than English). Future work will focus on other languages and I have added a statement to this effect.
5. The list of references mainly contain old paper. Can you pay more attention to current research?
Several cited papers are from the 2020s. However, I have added a few more recent references.
Author Response File: Author Response.docx
Reviewer 3 Report
Comments and Suggestions for AuthorsThis paper examines statistical patterns in language by analysing a large corpus of texts from the Standardized Project Gutenberg Corpus. The researchers focus on Zipf´s first law, Zipf's second law, and Heaps’ law. The study tests whether these classical equations accurately describe real language data by analising over 8,400 texts of exactly 100,000 words each. The researchers find that these laws generally hold good, although with variations across different languages and text types. Finnish texts, in particular, show distinctive statistical patterns compared to English texts.
The papers is a solid contribution to quantitative linguistics that confirms some theoretical relationships and identifies variations across languages and text types. The design of the research is very good. The approach of standardising text length, identifying and removing anomalies, and analysing multiple frequency ranges is well-conceived. The metodology is sound and explicit. The statistical methods are described in detail, including the rationale for the choices in data processing and analysis. The results are presented thoroughly with illustrative visualisations, alhough the number of analyses makes the paper a bit dense. The conclusions are generally well-supported by the data, although the implications of some findings (especially regarding Finnish texts) could be further explored. The paper is written in excellent academic English with an accurate use of terminology.
Among the strong points of the paper, the following can be stressed: the comprehensive dataset (the analysis uses a large corpus of texts with a standardised length), the rigorous methodology (sophisticated statistical techniques are used, including Maximum Likelihood estimation and Kolmogorov-Smirnov tests), and original findings (the identification of language-specific patterns, especially for Finnish texts, and the detailed analysis of how the Zipf indices vary with frequency). Despite the overall contribution of this work, some aspects remain weak points, which could be improved. First of all, the corpus is heavily skewed toward English texts (7,932 English vs. 165 Finnish items). Secondly, some classification decisions rely on arbitrary thresholds (like the KS-significance of 0.03 to separate high and low groups). This calls for an explanation. In the third place, some parts of the presentation that contain very technical descriptions could be made more accessible for readers without a strong background in statistical linguistics. Fourthly, there could be more analysis of why certain texts (like dictionaries) behave differently. While the paper could be published as is, it would significantly improve if these questions were addressed by the authors.
Author Response
Thank you very much for your comments. I have attached a revised MS, with alterations made in response to your feedback highlighted in blue.
Response to comments:
First of all, the corpus is heavily skewed toward English texts (7,932 English vs. 165 Finnish items).
This unfortunately is an issue with the SPGC corpus – it is dominated by English, with only a smattering of other languages. Other corpora will need to be explored to find more examples – particularly of minority languages like Finnish, but the work has not yet reached this stage.
Secondly, some classification decisions rely on arbitrary thresholds (like the KS-significance of 0.03 to separate high and low groups). This calls for an explanation.
I agree there are arbitrary thresholds, for which there was little to guide the choice. For the lower anomaly cut-off, 0.00001 does in Figure 5 appear to separate the main distribution from the widely-spaced anomalies – but I daresay many other choices could have been made. I have added a note to that effect. As for the 0.03 KS-significance, this skims off the lowest decile (Figure 8), and though this is somewhat arbitrary it was guided by subjective observation of Figure 9: the distributions with KS-significance below 0.03 do appear to have a definite “bow” to them while the top two look approximately straight. This is of course subjective (a kind of “chi-by-eye”), though no more so than the standard “95% confidence” used to define “statistical significance”.
In the third place, some parts of the presentation that contain very technical descriptions could be made more accessible for readers without a strong background in statistical linguistics.
The second reviewer already commented on this. In view of this additional comment, I have attempted to improve the explanations further.
Fourthly, there could be more analysis of why certain texts (like dictionaries) behave differently.
Dictionaries, since they naturally cover all (or most) word-types of a language, are likely to have an unusually large vocabulary relative to the number of tokens they contain, and can thus be expected to show atypical characteristics. This question does doubtless deserve deeper study, I have directed this particular study to the typical as opposed to the exceptional.
Author Response File: Author Response.docx