# A Hidden Markov Model for the Linguistic Analysis of the Voynich Manuscript

^{o}Floor, Camino de Vera, Universitat Politècnica de València, 46022 Valencia, Spain

## Abstract

**:**

## 1. Introduction

- It is a manuscript written in an extinct natural language with an exotic alphabet [15].
- It is the encipherment of a known language (possibly Latin, German or other Indo-European language but nobody is sure [11]).
- It is a hoax consisting of asemic writing with the objective of making the book strange and valuable to collectors of antiquities [16].
- It is a modern fabrication (perhaps by its discoverer, W. Voynich) [10].

## 2. Hidden Markov Models

#### 2.1. The Forward Algorithm

#### 2.2. The Backward Algorithm

- In the first place, we define ${\beta}_{T-1}\left(i\right)=1$ for $i=0$, 1, …, $N-1$.
- Then, for $t=T-2$, $T-3$, …, 0 we define the recursive relation:$${\beta}_{t}\left(i\right)=\sum _{j=0}^{N-1}\phantom{\rule{0.166667em}{0ex}}{a}_{ij}\phantom{\rule{0.166667em}{0ex}}{b}_{j}\left({\mathcal{O}}_{t+1}\right)\phantom{\rule{0.166667em}{0ex}}{\beta}_{t+1}\left(j\right)\phantom{\rule{0.277778em}{0ex}}.$$

#### 2.3. Reestimating The Model

- We initialize the model $\lambda =(\mathit{A},\mathit{B},\mathit{\pi})$. It is a common practice to choose the elements according to the uniform distribution: ${\mathit{\pi}}_{i}\approx 1/N$, ${a}_{ij}\approx 1/N$, and ${b}_{j}\left(k\right)\approx 1/M$ but these values must be randomized to avoid that the algorithm becomes stuck at a local maximum.
- For $i=0$, 1, …, $N-1$ and $j=0$, 1, …, $N-1$ we reestimate the elements of the transition matrix, $\mathit{A}$, as follows:$${a}_{ij}=\frac{{\displaystyle \sum _{t=0}^{T-2}\phantom{\rule{0.166667em}{0ex}}{\gamma}_{t}(i,j)}}{{\displaystyle \sum _{t=0}^{T-2}\phantom{\rule{0.166667em}{0ex}}{\gamma}_{t}\left(i\right)}}\phantom{\rule{0.277778em}{0ex}},$$
- For $i=0$, 1, …, $N-1$ and $j=0$, 1, …, $N-1$ we compute the new values for the elements of the observation probability matrix as follows:$${b}_{j}\left(k\right)=\frac{{\displaystyle \sum _{t\in \left(\right)open="\{"\; close="\}">0,1,\dots ,T-1}\phantom{\rule{0.166667em}{0ex}}{\gamma}_{t}\left(j\right)}}{}{\sum}_{t=0}^{T-1}\phantom{\rule{0.166667em}{0ex}}{\gamma}_{t}\left(j\right)\phantom{\rule{0.277778em}{0ex}}.$$
- Finally, we compute the probability of the given observation sequence, i.e., $P\left(\mathcal{O}\right|\lambda )$ (obtained as the sum of ${\alpha}_{T-1}\left(i\right)$ for all the inner state values, i). If this probability increases (with respect to the previous value), the model updating is performed again. However, in practice, the algorithm is run for a given number of steps or until the probability does not increases more than a selected tolerance.

#### 2.4. Applications to Linguistics of HMM and Other Network Models

## 3. Results

#### 3.1. Application to The Quixote

- The most frequent vowel in the English language is “e”.
- The space among words has the structural function of a vowel, although it has no associated sound.
- The letter “y” is mostly a vowel in the English language. Indeed, the Oxford Dictionary classifies it as a vowel in some cases (“myth”), a semivowel in others (“yes”) or forming a diphthong (as in “my”) [35].

#### 3.2. Application to the Voynich Manuscript

## 4. Discussion

## Supplementary Materials

## Funding

## Acknowledgments

## Conflicts of Interest

## Abbreviations

EVA | European Voynich Alphabet |

HMM | Hidden Markov Models |

NLP | Natural Language Processing |

## References

- Stamp, M. A Revealing Introduction to Hidden Markov Models. Available online: http://www.cs.sjsu.edu/faculty/stamp/RUA/HMM.pdf (accessed on 19 January 2019).
- Ghahramani, Z. An Introduction to Hidden Markov Models and Bayesian Networks. Int. J. Pattern Recognit. Artif. Intell.
**2001**, 15, 9–42. [Google Scholar] [CrossRef] - Yoon, B.J. Hidden Markov Models and their Applications in Biological Sequence Analysis. Curr. Genom.
**2009**, 10, 402–415. [Google Scholar] [CrossRef] [PubMed] - Juang, B.H.; Rabiner, L.R. Hidden Markov Models for Speech Recognition. Technometris
**1991**, 33, 251–272. [Google Scholar] [CrossRef] - Bicego, M.; Castellani, U.; Murino, V. Using Hidden Markov models and wavelets for face recognition. In Proceedings of the 12th International Conference on Image Analysis and Processing, Mantova, Italy, 17–19 September 2003. [Google Scholar]
- Lefèvre, S.; Bouton, E.; Brouard, T.; Vincent, N. A new way to use Hidden Markov Models for object tracking in video sequences. In Proceedings of the 2003 International Conference on Image Processing, Barcelona, Spain, 14–18 September 2003. [Google Scholar]
- Cave, R.L.; Neuwirth, L.P. Hidden Markov Models for English. In Hidden Markov Models for Speech; IDA-CRD: Princeton, NJ, USA, 1980; Available online: https://www.cs.sjsu.edu/~stamp/RUA/CaveNeuwirth/index.html (accessed on 19 January 2019).
- Suleiman, D.; Awajan, A.; Al Etaiwi, W. The Use of Hidden Markov Model in Natural ARABIC Language Processing: A Survey. Proc. Comput. Sci.
**2017**, 113, 240–247. [Google Scholar] [CrossRef] - Okhovvat, M.; Bidgoli, B.M. A Hidden Markov Model for Persian Part-of-Speech Tagging. Proc. Comput. Sci.
**2011**, 3, 977–981. [Google Scholar] [CrossRef] - Zandbergen, R. The Voynich Manuscript. Available online: http://www.voynich.nu (accessed on 19 January 2019).
- D’Imperio, M.E. The Voynich Manuscript: An Elegant Enigma; National Security Agency, Central Security Service: Maryland, MD, USA, 1978.
- Repp, K. Materials Analysis of the Voynich Manuscript. Available online: https://beinecke.library.yale.edu/sites/default/files/voynich_analysis.pdf (accessed on 19 January 2019).
- Zandbergen, R. The Radio-Carbon Dating of the Voynich MS. Available online: http://www.voynich.nu/extra/carbon.html (accessed on 19 January 2019).
- Capelli, A. The Elements of Abbreviation in Medieval Latin Paleography. Translated by Heimann, D. and Kay, R.. University of Kansas Libraries, 1982. (Translation of the original, Lexicon abbreviaturarum, published in 1899). Available online: https://kuscholarworks.ku.edu/bitstream/handle/1808/1821/47cappelli.pdf (accessed on 19 January 2019).
- Bax, S. A Proposed Partial Decoding of the Voynich Script. Available online: https://stephenbax.net/wp-content/uploads/2014/01/Voynich-a-provisional-partial-decoding-BAX.pdf (accessed on 19 January 2019).
- Rugg, G.; Taylor, G. Hoaxing statistical features of the Voynich Manuscript. Cryptologia
**2017**, 41, 247–268. [Google Scholar] [CrossRef] - Koehn, P. Statistical Machine Translation; Cambridge University Press: Cambridge, UK, 2009; p. 27. [Google Scholar]
- Goldberg, Y. A Primer on Neural Network Models for Natural Language Processing. J. Artif. Intell. Res.
**2016**, 57. [Google Scholar] [CrossRef] - Deng, L.; Liu, Y. Deep Learning in Natural Language Processing; Springer: Singapore, 2018. [Google Scholar]
- Baum, L.E.; Petrie, T. Statistical Inference for Probabilistic Functions of Finite State Markov Chains. Ann. Math. Stat.
**1966**, 37, 1554–1563. [Google Scholar] [CrossRef] - Baum, L.E.; Eagon, J.A. An Inequality with Applications to Statistical Estimation for Probabilistic Functions of a Markov Process and to a Model for Ecology. Bull. Am. Math. Soc.
**1967**, 73, 360–363. [Google Scholar] [CrossRef] - Baum, L.E. A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains. Ann. Math. Stat.
**1970**, 41, 164–171. [Google Scholar] [CrossRef] - Vogel, S.; Ney, H.; Tillmann, C. HMM-Based Word Alignment in Statistical Translation. Available online: http://aclweb.org/anthology/C96-2141 (accessed on 19 January 2019).
- Wright, C.; Ballard, L.; Coull, S.; Monrose, F.; Masson, G. Spot Me If You Can: Uncovering Spoken Phrases in Encrypted VoIP Conversations. Available online: https://ieeexplore.ieee.org/document/4531143/authors#authors (accessed on 19 January 2019).
- Baker, J.K. The DRAGON system—An overview. IEEE Trans. Acoust. Speech Signal Process.
**1975**, 23, 24–29. [Google Scholar] [CrossRef] - Graves, A.; Mohamed, A.-R.; Hinton, G. Speech Recognition with Deep Recurrent Neural Networks. Available online: https://arxiv.org/abs/1303.5778 (accessed on 19 January 2019).
- Amancio, D.R. Probing the Topological Properties of Complex Networks Modeling Short Written Texts. PLoS ONE
**2015**, 10, e0118394. [Google Scholar] [CrossRef] [PubMed] - Nebil, O.; Malod-Dognin, N.; Davis, D.; Levnajic, Z.; Janjic, V.; Karapandza, R.; Stojmirovic, A.; Pržulj, N. Revealing the Hidden Language of Complex Networks. Sci. Rep.
**2014**, 4, 4547. [Google Scholar] - Amancio, D.R. Comparing the topological properties of real and artificially generated scientific manuscripts. Scientometrics
**2015**, 105, 1763–1779. [Google Scholar] [CrossRef] - Akimushkin, C.; Amancio, D.R.; Oliveira, O.N., Jr. Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks. PLoS ONE
**2017**, 12, e0170527. [Google Scholar] [CrossRef] [PubMed] - De Arruda, H.F.; Marinho, V.Q.; Costa, L.d.F.; Amancio, D.R. Paragraph-Based Complex Networks: Application to Document Classification and Authenticity Verification. Available online: https://arxiv.org/abs/1806.08467 (accessed on 19 January 2019).
- Amancio, D.R.; Altmann, E.G.; Rybski, D.; Oliveira, O.N., Jr.; Costa, L.D.F. Probing the Statistical Properties of Unknown Texts: Application to the Voynich Manuscript. PLOS ONE
**2013**, 8, e67310. [Google Scholar] [CrossRef] [PubMed] - Gutenberg Project. The Quixote by Miguel de Cervantes Saavedra. Available online: http://www.gutenberg.org/ebooks/996 (accessed on 19 January 2019).
- An implementation in C++ of the HMM algorithm developed by M. Stamp. Available online: http://www.cs.sjsu.edu/faculty/stamp/RUA/HMM_ref_fast.zip (accessed on 19 January 2019).
- Is the Letter “Y” a Vowel or a Consonant ? Available online: https://en.oxforddictionaries.com/explore/is-the-letter-y-a-vowel-or-a-consonant/ (accessed on 19 January 2019).
- Zandbergen, R. What We May Learn from the MS Text Entropy. Available online: http://www.voynich.nu/extra/sol_ent.html (accessed on 19 January 2019).
- Montemurro, M.A.; Zanette, D.H. Keywords and Co-Occurrence Patterns in the Voynich Manuscript: An Information-Theoretic Analysis. PLoS ONE
**2013**, 8, e66344. [Google Scholar] [CrossRef] [PubMed] - Hauer, B.; Kondrak, G. Decoding Anagrammed Texts Written in an Unknown Language and Script. Trans. Assoc. Comput. Linguist.
**2016**, 4, 75–86. [Google Scholar] [CrossRef] - Reddy, S.; Knight, K. What we know about the Voynich Manuscript. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, Portland, OR, USA, 24 June 2011; pp. 78–86. [Google Scholar]
- D’Imperio, M. An Application of PTAH to the Voynich Manuscript (U). Natl. Secur. Agency Tech. J.
**1979**, 24, 65–91. Available online: https://www.nsa.gov/Portals/70/documents/news-features/declassified-documents/tech-journals/application-of-ptah.pdf (accessed on 19 January 2019).

**Figure 3.**The evolution of the logarithm of the observation sequence’s probability as a function of the iteration. Notice the fast convergence to an asymptotic “plateau”.

**Figure 4.**The probability of observation of a given letter for the hidden state 2. Notice the peaks for the vowels as well as “y” and the space.

**Figure 5.**The histogram for the frequency of the totality of letters (and the space) in the English version of the Quixote.

**Figure 6.**The correspondence among Voynich’s symbols and the associated letter in the EVA transcription systems. Notice that this is merely an arbitrary codification without any relation to the actual phonemes that the Voynich’s characters might represent.

**Figure 7.**The probability of observation of a given character in the Voynich manuscript for the hidden state 1. The peaks correspond to the symbols: “a”, “c”, “e”, “i”, “o”, “s” and “y” of the EVA alphabet. There is also a peak for the space among words.

**Figure 8.**The same as Figure 7 but for the hidden state 2.

**Figure 9.**Some words in the Voynich manuscript and the phonetics transcriptions proposed by Bax [15].

© 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Acedo, L.
A Hidden Markov Model for the Linguistic Analysis of the Voynich Manuscript. *Math. Comput. Appl.* **2019**, *24*, 14.
https://doi.org/10.3390/mca24010014

**AMA Style**

Acedo L.
A Hidden Markov Model for the Linguistic Analysis of the Voynich Manuscript. *Mathematical and Computational Applications*. 2019; 24(1):14.
https://doi.org/10.3390/mca24010014

**Chicago/Turabian Style**

Acedo, Luis.
2019. "A Hidden Markov Model for the Linguistic Analysis of the Voynich Manuscript" *Mathematical and Computational Applications* 24, no. 1: 14.
https://doi.org/10.3390/mca24010014