Next Article in Journal / Special Issue
Ontology-Mediated Historical Data Modeling: Theoretical and Practical Tools for an Integrated Construction of the Past
Previous Article in Journal
Multilingual Transformer-Based Personality Traits Estimation
 
 
Article
Peer-Review Record

Measuring Language Distance of Isolated European Languages

Information 2020, 11(4), 181; https://doi.org/10.3390/info11040181
by Pablo Gamallo 1,*, José Ramom Pichel 2 and Iñaki Alegria 3
Reviewer 1:
Reviewer 2: Anonymous
Information 2020, 11(4), 181; https://doi.org/10.3390/info11040181
Submission received: 24 February 2020 / Revised: 23 March 2020 / Accepted: 25 March 2020 / Published: 27 March 2020
(This article belongs to the Special Issue Digital Humanities)

Round 1

Reviewer 1 Report

The objective and methodology of this paper are clearly and simple well-defined. The research rationale is simple and strong at the same time.

Finally, although results are not completely positive and conclusive, I think that the methodological approach presented in the paper is deserving of be published.

Just a  few suggestions follow:

Introduction and background sections are clearly and richly exposed. However, there is at least one author which is worthy to be cited in the first part of your paper, George Starostin (Russian State Univ.), who devoted much attention to lexicostatistics, linguistic philogenetic, and lexical proximity, in the last 20 years.

line 103: itens --> items

equation (6): is it correct?

One additional (curiosity-driven) comment, which, maybe, could be useful for your further research:

it would be import to understand/analyse why you obtained the 4 errors in exp. 1. Is this ascribable to the n-gram structure of e.g. French and English? or ..?

 

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper proposes to apply known methods for measuring language distance to isolated languages. The idea is original and the experimental setup mostly sound, but I have a few major issues with the paper in its current state:

  1. There is no explicit hypothesis about the expected results, and whether the hypothesis is confirmed or rejected by the experiments. The authors say that "the results are rather disappointing", but it is entirely unclear to me what would be a successful result.
  2. Hierarchical clustering presupposes that all leaves are related and have one common ancestor, which is obviously not true for your data. Furthermore, Ward's method is known to favour clusters of similar size, which is also obviously not the case for your dataset (you probably want the isolates to be presented as distinct branches, which Ward's method precisely tries to avoid). I would suggest that you try (1) non-hierarchical clustering techniques, and (2) if you decide to keep hierarchical ones, to use e.g. complete linkage or UPGMA (average linkage) instead of Ward.
  3. I'm a bit puzzled by the presentation of your methods. Any method basically relies on three factors: (1) the features (in your case the text normalization, ngram order and weighting), (2) the distance measure, (3) the clustering algorithm applied to the distance matrix. On this basis:
    • Why is Burrow's delta considered separately and not just like any other distance measures (besides perplexity, KLD etc.)?
    • Why is the n-gram order (3-grams vs 7-grams) not the same for all experiments? This would make comparisons easier.
    • What is the evidence for combining some measures into the ALD metric rather than keeping them separate and showing distinct results? What is the reason for not including Burrow's delta in ALD?
    • Why is there no discussion about (3) (cf. also my previous remark)?
    • Is it correct that you don't apply hierarchical clustering to the Burrow's delta distance matrix although section 2.1 is called "Language clustering"? If so, why not?
  4. I could imagine that the isolated languages behave quite differently depending on the linguistic level you look at: Basque might have a lot of lexical borrowings from Spanish and a similar phonology, but a completely different morphology and syntax. Albanian might have a lot of lexical borrowings from Greek (scientific terminology) but not vice-versa. Hence, I would find it very interesting if you could somehow tease apart the different levels of analysis and apply your method to each of them separately. This can be done to a certain extent by just changing the ngram order, but some more sophisticated techniques will probably be needed.

Some further minor remarks:

  • The term "pre-Indo-European" is problematic for the languages in your sample (Greek and Albanian are IE, and Hungarian could just as well be called "post-Indo-European"...). It is only used in the abstract though, not in the main text.
  • L48: remove one occurrence of "final(ly)"
  • L59: "the objective is not to distinguish..." So what is the objective then?
  • Section 1: I would find it relevant to cite some work from dialectometry, where similar methods are used. You might find some relevant papers e.g. about clustering from John Nerbonne's group.
  • in various places: dend*r*ogram
  • L150: "discover which language families are closest..." - why is this an interesting question at all?
  • L171: the common ancestry of Baltic and Slavic languages have long been disputed - if your model fails to find this link I would not count it as an error
  • L181: do you have any empirical findings about the impact of spelling transliteration? it would be nice to do a pilot study on just 2 languages, with (1) no normalization/transliteration, (2) your proposed normalization, (3) "extreme" normalization to IPA
  • Tables 1 and 2: ALD gives you a score, right? it would be instructive to add the scores as well. My hypothesis would be that the distances are much lower in Table 1 than in Table 2, and if you applied a common cutoff threshold the isolated languages would indeed end up being isolated.
  • L218: the so-called "Balkan Sprachbund" finding states that several Balkan languages share grammatical features independently of their origin. Since you have Greek, Albanian, Romanian, Serbian, Macedonian, Bulgarian and Croatian in your sample, you might be in a good position to test this finding. It doesn't seem that your data shows much evidence for it, but could it be that your procedure focuses mostly on lexical aspects rather than grammatical ones? I think a discussion of this phenomenon would be very interesting...
  • L221: historically, Hungarian is probably about as far apart from Finnish/Estonian as Spanish is from Russian, so your finding is not too surprising.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Thanks for the detailed author response and for the updated version. I realized that I misunderstood some methodological aspects, this is now much clearer. Below is just a list of typos and minor remarks that you might address in the final version.

  • L3: do not belong to the Indo-European family and are somehow isolated => *or* are otherwise isolated (?)
  • L12: Kullback
  • L20: their => they
  • L24: list => lists
  • L31: are you qualifying Greek, Armenian and Albanian as "ancient languages with limited attestation"? I'm not sure I agree...
  • L67: by just to compute => by just computing
  • L76: rely not => do not rely
  • L104: aglomerative => agglomerative
  • L106: meaure => measure
  • L108: *d*ocument
  • L113: You present Burrows' delta in detail, but use Eder's delta in your experiments, right? Would it be informative to add the precise formula of Eder's computation here?
  • L138: results => result
  • L141: consist => consists
  • L149: university versus non-university people => academics versus non-academics
  • L149: *K*ullback
  • L152+L156: Ranked-Based => Rank-Based
  • (2)/(5): In formula (2) you use |...| for absolute value, but in formula (5) you use abs(...). If these operators are the same, please also use the same notation.
  • L159: I would put the note regarding asymetric distances to the first paragraph of section 2.2, as it concerns several measures.
  • L162, (6), (9): Camberra => Canberra
  • (6): I assume that 1/5 has scope over the entire sum, so a pair of parentheses is required
  • L166: represents => represent
  • (7): I'm still not sure if the formula is right. Inside the square roots there should be sums, I think.
  • L198: add space after "the"
  • L205: *h*ierarchical
  • L209: have => has
  • L219: the text suggests that the transliteration was only done for the distance experiments, not for the Stylo/clustering ones. Is this true? I would imagine that results may be quite different depending on transliteration...
  • Table 1: ser*b*ian
  • L230: *B*ible
  • L228: Asgari Asgari (duplicate)
  • L245: meausure => measure
  • L249: tends => tend
  • L252: *B*altic
  • L259: does not have => does not show
  • L269: two full stops
  • L280: it might be useful to add a reference explaining the Balkan Sprachbund concept - you might find relevant links e.g. on Wikipedia
  • L292: a last occurrence of "pre-Indo-European" to be replaced...
  • L300: influence ... in => influence ... on
  • L303: Ackno*w*ledgments

Author Response

Please, see the attachment.

Author Response File: Author Response.pdf

Back to TopTop