Next Article in Journal
A Novel Air Quality Monitoring Unit Using Cloudino and FIWARE Technologies
Next Article in Special Issue
A Mathematical Model for Intimate Partner Violence
Previous Article in Journal / Special Issue
Evolution of an Exponential Polynomial Family of Discrete Dynamical Systems
Open AccessArticle
Peer-Review Record

A Hidden Markov Model for the Linguistic Analysis of the Voynich Manuscript

Math. Comput. Appl. 2019, 24(1), 14; https://doi.org/10.3390/mca24010014
Reviewer 1: René Zandbergen
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Math. Comput. Appl. 2019, 24(1), 14; https://doi.org/10.3390/mca24010014
Received: 23 November 2018 / Revised: 14 January 2019 / Accepted: 19 January 2019 / Published: 23 January 2019
(This article belongs to the Special Issue Mathematical Modelling in Engineering & Human Behaviour 2018)

Round 1

Reviewer 1 Report

The paper is interesting and can be published with some modifications. None of these modifications are major, but some are important.

Most importantly, the analysis done is not entirely new. Several people have already applied Hidden Markov Models to the Voynich MS text in the past. However, I am not aware of any clear publication of such past results, so this paper is a very useful addition to the Voynich MS bibliography. Examples of past work are:

1.       René Zandbergen, on a web page ( http://www.voynich.nu/extra/sol_ent.html ) towards the end of the page. This is not a formal publication, but it essentially confirms the result of the present paper.

2.       Reddy and Knight (2011) ( http://aclweb.org/anthology/W11-1511 ). They only summarise their result in a few sentences, but conclude that there is no clear vowel/consonant separation. This result has not yet been reproduced by anyone.

3.       Mary D’Imperio (undated, most probably between late 70’s and late 80’s) ( https://www.nsa.gov/Portals/70/documents/news-features/declassified-documents/tech-journals/application-of-ptah.pdf ). Her algorithm is classified and not described, but clearly some form of HMM.  She uses a higher-state model, and identifies some of the particularities of the Voynich MS text.

In summary, a two-state HMM largely suggests the presence of vowels and consonants, but going to a higher-state HMM implementation brings out some issues that are still not explained. (This is clearly an avenue for further work that the author might consider).

 

The most important comments:

By line nrs (or figure nr):

100 and ff.: from the wording, the reader might understand that the maths presented are original work by the author, but my understanding is that the text closely follows the publication of Stamp (ref. [1]). This should be made a bit more clear, e.g. by saying “following [1]” (or in some other way).

126 and ff.: this is a start of a summary or abstract of the results of the following sections, which is a bit unexpected. In particular the last sentence (“This may be explained….”) is confusing for the reader at this point, and should be confined to the abstract at the top of the paper. (But see also the comments about the validity of some parts of the conclusion). Removing the last two sentences will improve the understanding of the reader.

151-153: not being very familiar with HMM implementations, the convergence shown in Figure 3 is quite peculiar. I understand from ref. [1] that 100 iterations is quite normal, so I will accept that convergence only begins after about 50, but I would hesitate to call it “very fast”. It would be good to explain this in some way, e.g. saying that this is normal for HMM’s.

157: “This is also found”. Please clarify which aspect is also found.

183: the Eva transcription alphabet is not ideally suited for numerical analyses. This is generally a difficult topic, because it is not known with any certainty which symbols in the MS text represent one character. However, almost certainly the Eva combinations “ch” and “sh” should be considered one single character each. The same is probably true for “in” and “iin”. However, there may be many cases of “i” that are single characters. This is a problem that cannot yet be solved, and in fact the HMM analyses may help in answering some of these questions. I refer to this web page: ( http://www.voynich.nu/extra/sol_ent.html ) towards the end (80% down). An alternative alphabet like Cuva is likely to provide more representative results. Converting the transcription file can be done with a simple ‘sed’ script. To conclude on this point: I don’t mean to argue that the result of the present paper is invalidated by this. However, it is a limitation that needs to be understood, and mentioned, and it could be another avenue for further work.

199:  “the same pattern”. This A matrix is rather different, so I would rather call it “similar”. The main point, I understand, is that the transition vowel -> consonant and vice versa are much more likely than vowel -> vowel and consonant -> consonant. (The off-diagonal numbers are greater).

Fig.8: some other intermediate “peaks” are not indicated in the Figure, namely for Eva “e”,  “m”, and “p”. The last two are relatively rare, and clearly belong to the state 2. Eva “e” was already identified as a vowel. In line 210 (and elsewhere) the “e” could be added to the group “I, s and y”.

212-213: (This is one of the most important comments I have on the paper). In English, the letter “y” can appear both as a vowel and as a consonant, so this would be the most natural explanation for the phenomenon that is observed. That there are three or four, instead of one, is not so easy to explain, so this is where the Voynich MS text is a bit different. To conclude that this may be an abjad is not sufficiently justified in my opinion. In abjads, some vowels are written and some are left out. This means that each character that is written is still either a vowel or a consonant, but the transition probabilities are different. To know whether a text written in an abjad  would cause the observed effect, one should do the analysis of such a text. On the other hand, some adjads may have more semivowels than, say English. In summary, this conclusion is very tentative and has to be expressed very carefully, since it is not based on observation.

This also affects line 238 and ff. where, in addition, abugidas are mentioned. These are quite different, because vowel symbols are still written, but sometimes attached as an integral part of the consonant. This leads to a much larger character set. In some cases the vowel symbols are written above or below the consonants, in which case no general statement can be made, and the text could be considered a normal alphabetic text.

In summary, the comparison with adjad’s and abugida’s should be stated much more carefully, also in the abstract.

219-222: [dominated … erred] the first two sentences need to be removed or fundamentally  changed. There have been a few cases of “bad work” but very many that have helped our understanding of the MS, by people like Friedman, Tiltman, Currier and Bennett, all prior to, or contemporary with D’imperio, and much more useful work after D’Imperio.

224-226: “Anyway … encipherd text”. I would remove “the situation is still so confusing that”. The sentence seems to suggest that a fake is clearly impossible, but I would not say that a clever fake is excluded. Human-generated meaningless text is most likely to follow some vowel-consonant pattern.

241: “y” should be added as one of the clear members of state 1. Note that the Eva characters that seem to represent vowels are also mostly vowels. This was a deliberate choice in the definition of the Eva alphabet.

247-250: the identifications by Stephen Bax are not generally accepted, so cannot be used as a confirmation, but it is true that the mapping of vowels and consonants coming out of the HMM analysis allows for many pronounceable, even meaningful words to appear. The resulting conclusion has to be phrased more carefully.


Minor or editorial comments:

(Note that I cannot claim to have spotted all minor grammatical or typographical errors)

General: please change “Voynich’s manuscript” to “the Voynich manuscript” everywhere. This is how the MS is known in all publications, including those of the holding library.

By line nr (or figure nr):

20: please clarify in a few words what is the “Brown Corpus”

37: Samotigian should probably be Samogitian, but this is the name of a language. Best just write Polish or Polish-Lithuanian.

37: (editorial) “it have” should be “it has”

39: (editorial) “found” should be “find”. “M. d’Imperio“ should be “M. d’Imperio’s “

40: (editorial) “common lore about history researchers” does not sound like good English. Best change into the more accurate statement “generally believed by history researchers”.

43-44: “reconstruction … elucidated” does not read well. Either write: “The history … has been elucidated” or “A reconstruction … has been made”.

54: the illustration of the Voynich text will be improved by using a few upper case letters. Instead of “sh” better write “Sh” (with a capital S) and instead of “cth” write “cTh” with a capital T.

56-57: Eva does not have 36 symbols, but 25 (when ignoring the rare ones). It is the Currier alphabet that has 36.

68: (editorial) “his” should be “its”

91: (editorial) “make” should be “makes”

96: (editorial) insert “are” between “we interested”

139: reference number is missing: [17]. It would be good to mention (even if it is a repetition) at this point that the text used is an English translation of Quixote, because I naturally assumed, while reading, that it would have been the original Spanish.

167: the letter “y” in “my” is not part of a diphthong. If is a complete diphthong by itself.

Fig.7: the symbol “s” is missing in the caption.

261-262: the Eva alphabet was designed by (dr.) René Zandbergen and (dr.) Gabriel Landini.


Author Response

REPLY TO REFEREE 1:

In the first place, I would like to acknowledge  the referee  for the time invested in reading the manuscript and the many useful comments.

All  these  comments  have been taken into account in the revised version and I hope that the paper is now improved and acceptable for publication.

Detailed replies are provided below

The paper is interesting and can be published with some modifications. None of these modifications are major, but some are important.

Most importantly, the analysis done is not entirely new. Several people have already applied Hidden Markov Models to the Voynich MS text in the past. However, I am not aware of any clear publication of such past results, so this paper is a very useful addition to the Voynich MS bibliography. Examples of past work are:

1.       René Zandbergen, on a web page ( http://www.voynich.nu/extra/sol_ent.html ) towards the end of the page. This is not a formal publication, but it essentially confirms the result of the present paper.

2.       Reddy and Knight (2011) ( http://aclweb.org/anthology/W11-1511 ). They only summarise their result in a few sentences, but conclude that there is no clear vowel/consonant separation. This result has not yet been reproduced by anyone.

3.       Mary D’Imperio (undated, most probably between late 70’s and late 80’s) ( https://www.nsa.gov/Portals/70/documents/news-features/declassified-documents/tech-journals/application-of-ptah.pdf ). Her algorithm is classified and not described, but clearly some form of HMM.  She uses a higher-state model, and identifies some of the particularities of the Voynich MS text.

In summary, a two-state HMM largely suggests the presence of vowels and consonants, but going to a higher-state HMM implementation brings out some issues that are still not explained. (This is clearly an avenue for further work that the author might consider).

 REPLY:   I am also surprised by the scarcity of publications about the application of HMM to the Voynich manuscript. These online references are now discussed in Sec. 4: “ The idea of using HMM (…) some words”.

The most important comments:

By line nrs (or figure nr):

100 and ff.: from the wording, the reader might understand that the maths presented are original work by the author, but my understanding is that the text closely follows the publication of Stamp (ref. [1]). This should be made a bit more clear, e.g. by saying “following [1]” (or in some other way).

REPLY:  The maths corresponding to the HMM algorithm  are, indeed, very standard. Before Sec. 2.1 it is now clearly stated that we will closely follow the reference by M. Stamp.

126 and ff.: this is a start of a summary or abstract of the results of the following sections, which is a bit unexpected. In particular the last sentence (“This may be explained….”) is confusing for the reader at this point, and should be confined to the abstract at the top of the paper. (But see also the comments about the validity of some parts of the conclusion). Removing the last two sentences will improve the understanding of the reader.

REPLY:  Following this recommendation I have removed those phrases.

151-153: not being very familiar with HMM implementations, the convergence shown in Figure 3 is quite peculiar. I understand from ref. [1] that 100 iterations is quite normal, so I will accept that convergence only begins after about 50, but I would hesitate to call it “very fast”. It would be good to explain this in some way, e.g. saying that this is normal for HMM’s.

REPLY:  HMM algorithms  are heuristic and the rate of convergence depends  on the problem.  The rather  peculiar form  of log P in this case may has something to do with the size of the attraction basin for the fixed point we are looking for. But this would  require further research into the whole landscape of the space of possible models.  This is briefly commented  now before  Figure 3.

157: “This is also found”. Please clarify which aspect is also found.

REPLY: In particular I refer to transition matrices with large off-diagonal elements.  This is now explicitly stated.

183: the Eva transcription alphabet is not ideally suited for numerical analyses. This is generally a difficult topic, because it is not known with any certainty which symbols in the MS text represent one character. However, almost certainly the Eva combinations “ch” and “sh” should be considered one single character each. The same is probably true for “in” and “iin”. However, there may be many cases of “i” that are single characters. This is a problem that cannot yet be solved, and in fact the HMM analyses may help in answering some of these questions. I refer to this web page: ( http://www.voynich.nu/extra/sol_ent.html ) towards the end (80% down). An alternative alphabet like Cuva is likely to provide more representative results. Converting the transcription file can be done with a simple ‘sed’ script. To conclude on this point: I don’t mean to argue that the result of the present paper is invalidated by this. However, it is a limitation that needs to be understood, and mentioned, and it could be another avenue for further work.

REPLY:   In the second paragraph of Sec. 3. 2 we now discuss the different alternative alphabet we  could choose for the Voynich manuscript 's transcription.

199:  “the same pattern”. This A matrix is rather different, so I would rather call it “similar”. The main point, I understand, is that the transition vowel -> consonant and vice versa are much more likely than vowel -> vowel and consonant -> consonant. (The off-diagonal numbers are greater).

REPLY:     Before Eq. (19) we now say that the transition matrix is similar for the Voynich  manuscript in comparison with natural languages.  The reason is, indeed, the large values of the off-diagonal elements  and the smaller  value for the diagonal elements.

Fig.8: some other intermediate “peaks” are not indicated in the Figure, namely for Eva “e”,  “m”, and “p”. The last two are relatively rare, and clearly belong to the state 2. Eva “e” was already identified as a vowel. In line 210 (and elsewhere) the “e” could be added to the group “I, s and y”.

REPLY:  Done.

212-213: (This is one of the most important comments I have on the paper). In English, the letter “y” can appear both as a vowel and as a consonant, so this would be the most natural explanation for the phenomenon that is observed. That there are three or four, instead of one, is not so easy to explain, so this is where the Voynich MS text is a bit different. To conclude that this may be an abjad is not sufficiently justified in my opinion. In abjads, some vowels are written and some are left out. This means that each character that is written is still either a vowel or a consonant, but the transition probabilities are different. To know whether a text written in an abjad  would cause the observed effect, one should do the analysis of such a text. On the other hand, some adjads may have more semivowels than, say English. In summary, this conclusion is very tentative and has to be expressed very carefully, since it is not based on observation.

This also affects line 238 and ff. where, in addition, abugidas are mentioned. These are quite different, because vowel symbols are still written, but sometimes attached as an integral part of the consonant. This leads to a much larger character set. In some cases the vowel symbols are written above or below the consonants, in which case no general statement can be made, and the text could be considered a normal alphabetic text.

In summary, the comparison with adjad’s and abugida’s should be stated much more carefully, also in the abstract.

REPLY:    I have changed the discussion at the end of Sec. 3 and at Sec. 4 to incorporate these comments. I agree that a double vowel/consonant nature is a reasonable hypothesis for the appearance of the EVA symbols ``e'', ``i'', ``s'' and ``y'' as belonging to states 1 and 2. On the other hand, ``i'' can appear also  as ``in'' or ``iin'' and we do not know if, in these cases, if constitutes a single letter , a different symbol or it is attached to the consonant.

219-222: [dominated … erred] the first two sentences need to be removed or fundamentally  changed. There have been a few cases of “bad work” but very many that have helped our understanding of the MS, by people like Friedman, Tiltman, Currier and Bennett, all prior to, or contemporary with D’imperio, and much more useful work after D’Imperio.

REPLY:   When writing this rather strong statement  I was thinking, specially, about the work of Newbold in the early Xxth century and many recent pseudoscientific claims outside the academia.  Of course, there have been many useful work sfrom Friedman, and collaborator,s until nowadays.  I have changed this paragraph accordingly.

224-226: “Anyway … encipherd text”. I would remove “the situation is still so confusing that”. The sentence seems to suggest that a fake is clearly impossible, but I would not say that a clever fake is excluded. Human-generated meaningless text is most likely to follow some vowel-consonant pattern.

REPLY:  We now say that a recent falsification is excluded but we also comment that a clever fake but a medieval author is still possible.

241: “y” should be added as one of the clear members of state 1. Note that the Eva characters that seem to represent vowels are also mostly vowels. This was a deliberate choice in the definition of the Eva alphabet.

REPLY: Done

247-250: the identifications by Stephen Bax are not generally accepted, so cannot be used as a confirmation, but it is true that the mapping of vowels and consonants coming out of the HMM analysis allows for many pronounceable, even meaningful words to appear. The resulting conclusion has to be phrased more carefully.

REPLY:   We now say: “In any case, we must stress that the identifications performed  by S. Bax are not generally accepted but, at least, they show that our mapping of vowels and consonants allows for some meaningful words to appear.” to emphasize  that Bax's translations are only a hypothesis. 

 

 

Minor or editorial comments:

(Note that I cannot claim to have spotted all minor grammatical or typographical errors)

General: please change “Voynich’s manuscript” to “the Voynich manuscript” everywhere. This is how the MS is known in all publications, including those of the holding library.

By line nr (or figure nr):

20: please clarify in a few words what is the “Brown Corpus”

REPLY:  Done

37: Samotigian should probably be Samogitian, but this is the name of a language. Best just write Polish or Polish-Lithuanian.

REPLY:  Samogitia is also a region of Lithuania in the northwestern part of the country. Voynich was born at Kaunas which, at the time, was the capital of Samogitia. But I agree that is simpler  to state Voynich's nationality as Polish-Lithuanian.

37: (editorial) “it have” should be “it has”

39: (editorial) “found” should be “find”. “M. d’Imperio“ should be “M. d’Imperio’s “

40: (editorial) “common lore about history researchers” does not sound like good English. Best change into the more accurate statement “generally believed by history researchers”.

43-44: “reconstruction … elucidated” does not read well. Either write: “The history … has been elucidated” or “A reconstruction … has been made”.

54: the illustration of the Voynich text will be improved by using a few upper case letters. Instead of “sh” better write “Sh” (with a capital S) and instead of “cth” write “cTh” with a capital T.

56-57: Eva does not have 36 symbols, but 25 (when ignoring the rare ones). It is the Currier alphabet that has 36.

68: (editorial) “his” should be “its”

91: (editorial) “make” should be “makes”

96: (editorial) insert “are” between “we interested”

139: reference number is missing: [17]. It would be good to mention (even if it is a repetition) at this point that the text used is an English translation of Quixote, because I naturally assumed, while reading, that it would have been the original Spanish.

167: the letter “y” in “my” is not part of a diphthong. If is a complete diphthong by itself.

Fig.7: the symbol “s” is missing in the caption.

261-262: the Eva alphabet was designed by (dr.) René Zandbergen and (dr.) Gabriel Landini.

REPLY:   All the previous comments have been taken into account. Thanks for spotting these typos.


Reviewer 2 Report

This is an interesting paper on the

analysis of the Voynich manuscript

based on the HMM. Overall the paper is clear

and well writen. However, the authors should

address the following issues before

it can be considered acceptable for publication.


1) The authors should include a section 

mentioning applications of HMM in other 

linguistic applications. Other similar models

should also be mentioned as they also attempt

to identify the authenticity of documents. See

e.g.: 

doi.org/10.1371/journal.pone.0118394 

doi.org/10.1007/s11192-015-1637-z

doi.org/10.1038/srep04547

doi.org/10.1371/journal.pone.0170527


2) The authors performed an analysis only 

in 1 book (The Quixote). Is it possible to provide

some performance measure for a higher number of

books. This could be a more robust test to show that your 

method works before 


3) Some references are missing (see [? ] in line 139).


4) Could you provide a comment on which your technique

is capturing syntactical or semantical features of the Voynich

manuscript. Note that a discussion on this is provided in:

https://arxiv.org/abs/1806.08467


5) Finally, please provide in the conclusion how your works

adds to the literature and whether it can be used for real-world

applications?


Author Response

REPLY TO REFEREE 2:

The author wishes to acknowledge this referee for the time and effort invested in reading this manuscript and providing useful comments. The replies to the specific comments are given below.

1) The authors should include a section mentioning applications of HMM in other 

linguistic applications. Other similar models should also be mentioned as they also attempt

to identify the authenticity of documents. See e.g.: 

doi.org/10.1371/journal.pone.0118394 

doi.org/10.1007/s11192-015-1637-z

doi.org/10.1038/srep04547

doi.org/10.1371/journal.pone.0170527

REPLY: Section 2.4 has been included. In this section these references (and other) are briefly discussed.

2) The authors performed an analysis only in 1 book (The Quixote). Is it possible to provide

some performance measure for a higher number of books. This could be a more robust test to

show that your method works before 

REPLY: My idea was only to provide an example for the reader in order to grasp the technique and make the paper more self-contained. The HMM technique has indeed been applied other texts in the past. In particular, to the ``Brown Corpus’’ which is, really, a compilation of texts ranging from science to religion or literature. A comment about this previous application is now included in the paragraph after Eq. (15).

3) Some references are missing (see [? ] in line 139).

REPLY: Checked.

4) Could you provide a comment on which your technique is capturing syntactical or semantical features of the Voynich manuscript. Note that a discussion on this is provided in:https://arxiv.org/abs/1806.08467

REPLY: HMM provide information about the transition among individual symbols/letters. The result of this inference on the language is oriented towards the basic syntactical structure. Nothing can be said about the semantics from this approach (although it can certainly help in the decoding of the manuscript). This is now explained at the end of Sec. 2.4.

5) Finally, please provide in the conclusion how your works adds to the literature and whether it can be used for real-world applications?

REPLY: HMM are indeed not new and they have already been applied to many problems (including speech recognition). Some of these applications are discussed in Sec. 2.4. The main contribution of the present paper is the evidence on the presence of vowel and consonant sounds in the Voynich manuscript gained from the HMM approach. This can provide additional confidence to the linguistics trying to understand the manuscript and supports other mathematical studies as those of Amancio et al. and Montemurro et al. I comment about this in the first paragraph


Reviewer 3 Report

Hidden Markov models are used in the present submission in order to attempt to extract some meaningful information in the study of the Voynich's manuscript enigma. The authors claim that a model with two states is able to identify a first regime, essentially related to vowels, and a second one, essentially related to consonants. There is no doubt that the topic is very challenging, but in my opinion, the paper lacks good background and knowledge on hidden Markov models, both in terms of state-of-the art and practical applications, and I am not convinced it could be published as such.


My first major comment is concerned with the state-of-the-art and the motivation of the paper. Although I'm not an expert in natural language processing, this is by no means an “emergent” field, but rather a quite well established one, for more than sixty years. The description of the hidden Markov models, both in the Introduction and Section 2, makes no reference to the seminal paper of Baum and Petrie in 1966, nor to the well known Baum-Welsh forward-backward algorithm. Subsection 2.3 called “Reestimating the model” is by no means a “new algorithm” as claimed by the authors, but the very standard one, available in most mathematical softwares. The first two sections of the paper are thus very puzzling for the reader used to hidden Markov models …


My second major comment is related to the Results section. There are several issues that should be explained better or clarified by the authors. First, they do not discuss the question of model selection. Why two states? What is justifying them? How can one check that one --- or three, or four --- are not better suited? Second, the EM algorithm that is actually described in Section 2 is known to be sensitive to initial values and local minima. Were there several initialization performed? My guess is no, since the initial values are given, but the choice of performing one initialization only could potentially be problematic. Third, the resulting transition matrices are not stable at all, which means that in both examples, the hidden Markov chain will almost always switch from one state to the other. This is not exactly what one wishes for when using this kind of models, but rather the contrary : high values on the diagonal of the transition matrix, which ensure a good segmentation of the data. Hence, I am not completely convinced by the interest of using HMM in this case. The authors should explain much better their results, two frequency tables with an approximate interpretation are not sufficient in my view.


I also have two more minor comments.


- On page 7, line 163 (and also in other places), the authors draw conclusions about the English language having as a basis one literary manuscript. This kind of extrapolation is very dangerous! One can draw conclusions about the studied data, the studied manuscript in this case, but not about a language in general.


- There are plenty of typos that should be carefully reviewed and corrected.  

Author Response

REPLY TO REFEREE 3:

I acknowledge the referee for the many useful comments and the time and effort invested in reading this paper.

I have rewritten the paper according to these suggestions and the replies to these comments and the places where the paper has been revised are given below.

·         My first major comment is concerned with the state-of-the-art and the motivation of the paper. Although I'm not an expert in natural language processing, this is by no means an “emergent” field, but rather a quite well established one, for more than sixty years. The description of the hidden Markov models, both in the Introduction and Section 2, makes no reference to the seminal paper of Baum and Petrie in 1966, nor to the well known Baum-Welsh forward-backward algorithm. Subsection 2.3 called “Reestimating the model” is by no means a “new algorithm” as claimed by the authors, but the very standard one, available in most mathematical softwares. The first two sections of the paper are thus very puzzling for the reader used to hidden Markov models …

REPLY: I have added Section 3.1 to discuss and cite the seminal papers on HMM and other applications. I have removed the adjective “emergent” from the reference to NLP. In Subsection 2.3 I was adopting a pedagogical standpoint but I agree that the reference to a “new algorithm” can be misleading to many readers. I now refer to the algorithm in 2.3 as “standard”.

My second major comment is related to the Results section. There are several issues that should be explained better or clarified by the authors. First, they do not discuss the question of model selection. Why two states? What is justifying them? How can one check that one --- or three, or four --- are not better suited?

REPLY: The following paragraph is now included at the beginning of Sec. 3 to clarify this issue:

“In both applications (to a text in English and to the Voynich manuscript) we will use only two hidden states. This raises the question about the advisability of this particular choice, instead of a larger number of hidden states. As the main objective is to show that these texts can be partitioned in different sets of symbols that are different in their statistical properties, to select $N=2$ seems the simpler choice. Moreover, in earlier applications to other books (such as the ``Brown Corpus'') this choice was proven to be successful in the identification of the vowels and consonants \cite{Cave}. More hidden states were considered by Cave and Neuwirth in their seminal application of HMM to language and they even obtained some conclusions for $N=3$ to $N=12$. Nevertheless, this was done with the advantage of dealing with a known language. Proving the existence of, at least, two different sets of letters in the Voynich manuscript is already a useful conclusion. If we take into account that the alphabet and the language are completely unknown in this case, this could help linguistics in their research to unveil some meaning in the words of the manuscript. Only after some globally accepted success is achieved in this endeavour the study of the convergence of the HMM model for $N>2$ could be done with some possibility of interpretation.”

Second, the EM algorithm that is actually described in Section 2 is known to be sensitive to initial values and local minima. Were there several initialization performed? My guess is no, since the initial values are given, but the choice of performing one initialization only could potentially be problematic.

REPLY: I am aware of the problem with initializations and possible local minima. I checked that but I chose to present only a representative case. To check the reliability of the results I have included the average and statistical errors for the elements of the transition matrix after $10$ simulations starting with different conditions. The results are discussed at the end of Subsection 3.1

Third, the resulting transition matrices are not stable at all, which means that in both examples, the hidden Markov chain will almost always switch from one state to the other. This is not exactly what one wishes for when using this kind of models, but rather the contrary : high values on the diagonal of the transition matrix, which ensure a good segmentation of the data. Hence, I am not completely convinced by the interest of using HMM in this case. The authors should explain much better their results, two frequency tables with an approximate interpretation are not sufficient in my view.

REPLY: That is indeed the case for this linguistic application. Cave and Neuwirth in 1980 (Ref. 7) already found these kind of unstable matrices when working with the ``Brown Corpus”. The idea is that most transitions in a natural language take place from consonant to vowel sounds and viceversa. Of course, the 2-by-2 identity matrix is a fixed point of Baum-Welch algorithm. Anyway, this is not the fixed point we are interested in this particular application but the one which lead to the partition of the symbols into two independent sets.

I also have two more minor comments.

- On page 7, line 163 (and also in other places), the authors draw conclusions about the English language having as a basis one literary manuscript. This kind of extrapolation is very dangerous! One can draw conclusions about the studied data, the studied manuscript in this case, but not about a language in general.

REPLY: I have added some words of caution after the discussion of Fig. 4.

- There are plenty of typos that should be carefully reviewed and corrected.  

REPLY: The text has been revised in detail and I hope that most typos have been spotted and corrected.

 


Round 2

Reviewer 1 Report

I have reviewed the updated paper, and I can conclude that the author has made a serious effort to take into account all comments.

I now consider the paper suitable for publication.

Reviewer 2 Report

The authors adressed my comments. Therefore, I recommend acceptance.

Back to TopTop