# A Hidden Markov Model for the Linguistic Analysis of the Voynich Manuscript

## Abstract

## 1. Introduction

- It is a manuscript written in an extinct natural language with an exotic alphabet [15].
- It is the encipherment of a known language (possibly Latin, German or other Indo-European language but nobody is sure [11]).
- It is a hoax consisting of asemic writing with the objective of making the book strange and valuable to collectors of antiquities [16].
- It is a modern fabrication (perhaps by its discoverer, W. Voynich) [10].

## 2. Hidden Markov Models

#### 2.1. The Forward Algorithm

#### 2.2. The Backward Algorithm

- In the first place, we define ${\beta}_{T-1}\left(i\right)=1$ for $i=0$, 1, …, $N-1$.
- Then, for $t=T-2$, $T-3$, …, 0 we define the recursive relation:$${\beta}_{t}\left(i\right)=\sum _{j=0}^{N-1}\phantom{\rule{0.166667em}{0ex}}{a}_{ij}\phantom{\rule{0.166667em}{0ex}}{b}_{j}\left({\mathcal{O}}_{t+1}\right)\phantom{\rule{0.166667em}{0ex}}{\beta}_{t+1}\left(j\right)\phantom{\rule{0.277778em}{0ex}}.$$

#### 2.3. Reestimating The Model

- We initialize the model $\lambda =(\mathit{A},\mathit{B},\mathit{\pi})$. It is a common practice to choose the elements according to the uniform distribution: ${\mathit{\pi}}_{i}\approx 1/N$, ${a}_{ij}\approx 1/N$, and ${b}_{j}\left(k\right)\approx 1/M$ but these values must be randomized to avoid that the algorithm becomes stuck at a local maximum.
- For $i=0$, 1, …, $N-1$ and $j=0$, 1, …, $N-1$ we reestimate the elements of the transition matrix, $\mathit{A}$, as follows:$${a}_{ij}=\frac{{\displaystyle \sum _{t=0}^{T-2}\phantom{\rule{0.166667em}{0ex}}{\gamma}_{t}(i,j)}}{{\displaystyle \sum _{t=0}^{T-2}\phantom{\rule{0.166667em}{0ex}}{\gamma}_{t}\left(i\right)}}\phantom{\rule{0.277778em}{0ex}},$$
- For $i=0$, 1, …, $N-1$ and $j=0$, 1, …, $N-1$ we compute the new values for the elements of the observation probability matrix as follows:$${b}_{j}\left(k\right)=\frac{{\displaystyle \sum _{t\in \left(\right)open="\{"\; close="\}">0,1,\dots ,T-1}\phantom{\rule{0.166667em}{0ex}}{\gamma}_{t}\left(j\right)}}{}{\sum}_{t=0}^{T-1}\phantom{\rule{0.166667em}{0ex}}{\gamma}_{t}\left(j\right)\phantom{\rule{0.277778em}{0ex}}.$$
- Finally, we compute the probability of the given observation sequence, i.e., $P\left(\mathcal{O}\right|\lambda )$ (obtained as the sum of ${\alpha}_{T-1}\left(i\right)$ for all the inner state values, i). If this probability increases (with respect to the previous value), the model updating is performed again. However, in practice, the algorithm is run for a given number of steps or until the probability does not increases more than a selected tolerance.

#### 2.4. Applications to Linguistics of HMM and Other Network Models

## 3. Results

#### 3.1. Application to The Quixote

- The most frequent vowel in the English language is “e”.
- The space among words has the structural function of a vowel, although it has no associated sound.
- The letter “y” is mostly a vowel in the English language. Indeed, the Oxford Dictionary classifies it as a vowel in some cases (“myth”), a semivowel in others (“yes”) or forming a diphthong (as in “my”) [35].

#### 3.2. Application to the Voynich Manuscript

## 4. Discussion

