# An Efficient Coding Technique for Stochastic Processes

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Theoretical Background and Optimal Codes

**Definition**

**1.**

**Definition**

**2.**

**Definition**

**3.**

**Definition**

**4.**

**Definition**

**5.**

**Definition**

**6.**

**Definition**

**7.**

Algorithm 1: Huffman Code |

Input: $\mathbb{P}=\left\{p\right(1),p(2),\dots ,p(m\left)\right\},\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}$$G=\{1,2,\dots ,m\}$ |

1 If $\left|\mathbb{P}\right|=1,$$C\left(1\right):=0$ |

2 Otherwise, while $\left|\mathbb{P}\right|>1$ |

– define |

∗ $I:=argmin\left\{p\right(i)\in \mathbb{P}:i\in G\}$ |

∗ $J:=argmin\left\{p\right(i)\in \mathbb{P}\setminus \{p\left(k\right):k\in I\}:i\in G\}$ |

∗ $H:=I\cup J$ |

– If $\left|H\right|=2$ set $l\left(h\right):=0\forall h\in H,$$C\left(I\right):=0$ and $C\left(J\right):=1$ |

– Otherwise, if $h\in I,C\left(h\right):=0C\left(h\right)$ and if $h\in J,C\left(h\right)=1C\left(h\right).$ |

– Set ${I}^{\prime}:=H,$${\mathbb{P}}^{\prime}:=\{p\left(k\right):k\in {I}^{\prime}\},$$p\left({I}^{\prime}\right)={\sum}_{k\in {I}^{\prime}}p\left(k\right),$${G}^{\prime}=\{k:k\in {I}^{\prime}\}$ |

– Define $\mathbb{P}:=\{\mathbb{P}\setminus {\mathbb{P}}^{\prime}\}\cup \left\{p\left({I}^{\prime}\right)\right\},\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}$$G:=\{G\setminus {G}^{\prime}\}\cup \left\{{I}^{\prime}\right\}$ and return to 1. |

Output: $\left\{C\right(1),C(2),\dots ,C(m\left)\right\}$ |

- (a)
- For $i=1,\phantom{\rule{0.277778em}{0ex}}{C}^{\prime}\left(i\right)$ is the concatenation of ${l}_{1}$ zeroes.
- (b)
- For $j=i+1,$ assign ${C}^{\prime}\left(j\right)$ as the next (larger) binary number. Do this until the first $k\in A$ such that the codelength increases, and append zeros to the right of ${C}^{\prime}\left(k\right)$ until $l\left({C}^{\prime}\left(k\right)\right)={l}_{k}.$
- (c)
- repeat (b), until $i=m.$

**Example**

**1.**

- i.
- $\mathbb{P}=\{p\left(a\right)=\frac{1}{11},p\left(b\right)=\frac{4}{11},p\left(c\right)=\frac{3}{11},p\left(d\right)=\frac{2}{11},p\left(e\right)=\frac{1}{11}\},$$G=\{a,b,c,d,e\}$
- –
- $I=a,J=e,$$H=\{a,e\},$ and $C\left(a\right)=0,C\left(e\right)=1$
- –
- ${I}^{\prime}=\{a,e\},p\left({I}^{\prime}\right)=\frac{2}{11},{G}^{\prime}=\{a,e\}\Rightarrow $$\left\{\begin{array}{c}\mathbb{P}=\{p\left({I}^{\prime}\right)=\frac{2}{11},p\left(b\right)=\frac{4}{11},p\left(c\right)=\frac{3}{11},p\left(d\right)=\frac{2}{11}\}\hfill \\ G=\{{G}^{\prime},b,c,d\}\hfill \end{array}\right.$

- ii
- $\mathbb{P}=\{p\left(\{a,e\}\right)=\frac{2}{11},p\left(b\right)=\frac{4}{11},p\left(c\right)=\frac{3}{11},p\left(d\right)=\frac{2}{11}\},$$G=\left\{\right\{a,e\},b,c,d\}$
- –
- $I=\{a,e\},J=d,$$H=\left\{\right\{a,e\},d\},$ and $C\left(a\right)=00,C\left(e\right)=01,C\left(d\right)=1$
- –
- ${I}^{\prime}=\{a,e,d\},p\left({I}^{\prime}\right)=\frac{4}{11},{G}^{\prime}=\{a,e,d\}\Rightarrow $$\left\{\begin{array}{c}\mathbb{P}=\{p\left({I}^{\prime}\right)=\frac{4}{11},p\left(b\right)=\frac{4}{11},p\left(c\right)=\frac{3}{11}\}\hfill \\ G=\{{G}^{\prime},b,c\}\hfill \end{array}\right.$

- iii
- $\mathbb{P}=\{p\left(\{a,e,d\}\right)=\frac{4}{11},p\left(b\right)=\frac{4}{11},p\left(c\right)=\frac{3}{11}\},$$G=\left\{\right\{a,e,d\},b,c\}$
- –
- $I=c,J=\{a,e,d\},$$H=\{c,\{a,e,d\left\}\right\},$ and $C\left(c\right)=0,C\left(a\right)=100,C\left(e\right)=101,C\left(d\right)=11$
- –
- ${I}^{\prime}=\{a,e,d,c\},p\left({I}^{\prime}\right)=\frac{7}{11},{G}^{\prime}=\{a,e,d,c\}\Rightarrow $$\left\{\begin{array}{c}\mathbb{P}=\{p\left({I}^{\prime}\right)=\frac{7}{11},p\left(b\right)=\frac{4}{11}\}\hfill \\ G=\{{G}^{\prime},b\}\hfill \end{array}\right.$

- iv
- $\mathbb{P}=\{p\left(\{a,e,d,c\}\right)=\frac{7}{11},p\left(b\right)=\frac{4}{11}\},$$G=\left\{\right\{a,e,d,c\},b\}$
- –
- $I=b,J=\{a,e,d,c\},$$H=\{b,\{a,e,d,c\left\}\right\}.$Obtaining the following Huffman code for X:$C\left(b\right)=0,C\left(c\right)=10,C\left(a\right)=1100,C\left(e\right)=1101,C\left(d\right)=111.$

**Example**

**2.**

## 3. Stochastic Estimation Strategy and Optimal Coding

**Definition**

**8.**

- 1.
- $s,r\in \mathcal{S}$ are equivalent if $P\left(a\right|s)=P(a\left|r\right)\phantom{\rule{0.277778em}{0ex}}\forall a\in A.$
- 2.
- $\left({X}_{t}\right)$ is a Markov chain with partition $\mathcal{L}=\{{L}_{1},{L}_{2},\dots ,{L}_{\left|\mathcal{L}\right|}\}$ if this partition is the one defined by the relationship introduced by item 1.

**Remark**

**1.**

**Theorem**

**1.**

**Proof.**

**Corollary**

**1.**

**Remark**

**2.**

**Remark**

**3.**

## 4. Codification of the Model Information

#### 4.1. Codification of the Transition Frequencies

#### 4.2. Codification of the Model

**Example**

**3.**

**Example**

**4.**

## 5. Simulation Study and Application

#### 5.1. Simulation Study

**Simulation 1.**

**Simulation 2.**

**Simulation 3.**

**Simulation 4.**

#### 5.2. SARS-CoV-2 and Compression

## 6. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Rissanen, J. AUniversal Data Compression System. Trans. Inf. Theory
**1983**, 29, 656–664. [Google Scholar] [CrossRef][Green Version] - Rissanen, J. Modelling by shortest data description. Automatica
**1978**, 14, 465–471. [Google Scholar] [CrossRef] - García, J.E.; González-López, V.A.; Tasca, G.H. Partition Markov Model for COVID-19 Virus. 4open
**2020**, 3, 13. [Google Scholar] [CrossRef] - García, J.E.; González-López, V.A.; Tasca, G.H. A Stochastic Inspection about Genetic Variants of COVID-19 Circulating in Brazil during 2020. AIP Conf. Proc. forthcoming.
- García, J.E.; González-López, V.A. Minimal markov models. In Proceedings of the Fourth Workshop on Information Theoretic Methods in Science and Engineering, Helsinki, Finland, 7–10 August 2011; p. 25. [Google Scholar]
- García, J.E.; González-López, V.A. Consistent Estimation of Partition Markov Models. Entropy
**2017**, 19, 160. [Google Scholar] [CrossRef] - Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
- Huffman, D.A. A method for the construction of minimum-redundancy codes. Proc. IRE
**1952**, 40, 1098–1101. [Google Scholar] [CrossRef] - McMillan, B. The basic theorems of information theory. Ann. Math. Stat.
**1953**, 24, 196–219. Available online: https://www.jstor.org/stable/2236328 (accessed on 24 December 2021). [CrossRef] - McMillan, B. Two inequalities implied by unique decipherability. IRE Trans. Inf. Theory
**1956**, 2, 115–116. [Google Scholar] [CrossRef] - Mannan, M.A.; Kaykobad, N. Block huffman coding. Comput. Math. Appl.
**2003**, 46, 1581–1587. [Google Scholar] [CrossRef][Green Version] - Schwarz, G. Estimating the dimension of a model. Ann. Stat.
**1978**, 6, 461–464. [Google Scholar] [CrossRef] - Buhlmann, P.; Wyner, A.J. Variable length Markov chains. Ann. Stat.
**1999**, 27, 480–513. [Google Scholar] [CrossRef] - Csiszár, I.; Talata, Z. Context tree estimation for not necessarily finite memory processes, via BIC and MDL. IEEE Trans. Inf. Theory
**2006**, 52, 1007–1016. [Google Scholar] [CrossRef] - Wu, F.; Zhao, S.; Yu, B.; Chen, Y.M.; Wang, W.; Song, Z.G.; Hu, Y.; Tao, Z.W.; Tian, J.H.; Pei, Y.Y.; et al. A new coronavirus associated with human respiratory disease in China. Nature
**2020**, 579, 265–269. [Google Scholar] [CrossRef] [PubMed][Green Version] - Cordeiro, M.T.A.; García, J.E.; González-López, V.A.; Mercado Londoño, S.L. Partition Markov model for multiple processes. Math. Methods Appl. Sci.
**2020**, 43, 7677–7691. [Google Scholar] [CrossRef]

**Figure 1.**Scheme of the past necessary to determine the partition of the state space of the SARS-CoV-2 process at time $t,$ in dashed line, the irrelevant period with limits on top of the scheme $[t-8,t-4]$.

**Table 1.**Huffman and canonical Huffman for $X.$ From left to right: $x,$$P(X=x),$y, $P(Y=y)$, Huffman code for $Y,$ codelength for $Y,$ CHC for Y$\left({C}_{Y}^{\prime}\right)$, and codelength increments.

x | $\mathit{P}(\mathit{X}=\mathit{x})$ | y | $\mathit{P}(\mathit{Y}=\mathit{y})$ | ${\mathit{C}}_{\mathit{Y}}\left(\mathit{y}\right)$ (Huffman) | $\mathit{l}\left(\mathit{y}\right)$ | ${\mathit{C}}_{\mathit{Y}}^{\prime}\left(\mathit{y}\right)$ (CHC) | $\mathit{l}\left(\mathit{y}\right)$ Increment |
---|---|---|---|---|---|---|---|

b | 4/11 | a | 4/11 | 0 | 1 | 0 | 1 |

c | 3/11 | b | 3/11 | 10 | 2 | 10 | 1 |

d | 2/11 | c | 2/11 | 111 | 3 | 110 | 1 |

a | 1/11 | d | 1/11 | 1100 | 4 | 1110 | 1 |

e | 1/11 | e | 1/11 | 1101 | 4 | 1111 | 0 |

**Table 2.**Left: transition probabilities $P\left(x\right|s).$ Right: conditional canonical Huffman code $C\left(x\right|s)$ and length $l\left(x\right|s),\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}x\in A=\{a,b,c\},\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}s\in \mathcal{S}=\{a,b,c\}$.

$\mathit{P}\left(\mathit{a}\right|\mathit{s})$ | $\mathit{P}\left(\mathit{b}\right|\mathit{s})$ | $\mathit{P}\left(\mathit{c}\right|\mathit{s})$ | $\mathit{C}\left(\mathit{a}\right|\mathit{s})$ | $\mathit{l}\left(\mathit{a}\right|\mathit{s})$ | $\mathit{C}\left(\mathit{b}\right|\mathit{s})$ | $\mathit{l}\left(\mathit{b}\right|\mathit{s})$ | $\mathit{C}\left(\mathit{c}\right|\mathit{s})$ | $\mathit{l}\left(\mathit{c}\right|\mathit{s})$ | |
---|---|---|---|---|---|---|---|---|---|

$s=a$ | 1/6 | 2/6 | 3/6 | 11 | 2 | 10 | 2 | 0 | 1 |

$s=b$ | 1/3 | 1/3 | 1/3 | 10 | 2 | 11 | 2 | 0 | 1 |

$s=c$ | 2/7 | 1/7 | 4/7 | 10 | 2 | 11 | 2 | 0 | 1 |

**Table 3.**List of parts of $\left({X}_{t}\right).$ The second column reports the composition of the part (on the left) and the third column reports the cardinal of the part.

Part (${\mathit{L}}_{\mathit{i}}$) | Composition of ${\mathit{L}}_{\mathit{i}}$ | $|{\mathit{L}}_{\mathit{i}}|$ |
---|---|---|

${L}_{1}$ | $\{000,011,100,111\}$ | 4 |

${L}_{2}$ | $\{001,010,101\}$ | 3 |

${L}_{3}$ | $\left\{110\right\}$ | 1 |

s | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 |
---|---|---|---|---|---|---|---|---|

index | 1 | 2 | 2 | 1 | 1 | 2 | 3 | 1 |

**Table 5.**List of indices that identify the states with the parts reported in Table 3. The second column reports the frequency of the index (on the left). The third column reports the Huffman code of the index. The last three columns report the CHC, the codelengths, and the increments in codelengths, respectively.

Index | Frequency | Huffman | CHC | Codelength | Increase in Codelength |
---|---|---|---|---|---|

1 | 4/8 | 1 | 0 | 1 | 1 |

2 | 3/8 | 01 | 10 | 2 | 1 |

3 | 1/8 | 00 | 11 | 2 | 0 |

**Table 6.**Counts (and binary form) of each state s followed by 0 (and 1), from a message ${x}_{1}^{n},$ also represented by partition in Table 3—in blue, the cases of part ${L}_{1},$ in red the cases of part ${L}_{2}$, and in magenta the case of part ${L}_{3}$.

$\mathit{s}\to $ | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 |
---|---|---|---|---|---|---|---|---|

${N}_{n}(s,0)$ | 30 | 24 | 26 | 31 | 29 | 25 | 10 | 30 |

binary | 11110 | 11000 | 11010 | 11111 | 11101 | 11001 | 1010 | 11110 |

${N}_{n}(s,1)$ | 20 | 25 | 25 | 20 | 20 | 25 | 30 | 20 |

binary | 10100 | 11001 | 11001 | 10100 | 10100 | 11001 | 11110 | 10100 |

**Table 7.**Counts of each part L followed by 0 (and 1), from the message represented by the partition in Table 3.

$\mathit{L}\to $ | ${\mathit{L}}_{1}$ | ${\mathit{L}}_{2}$ | ${\mathit{L}}_{3}$ |
---|---|---|---|

${N}_{n}(L,0)$ | 120 | 75 | 10 |

binary | 1111000 | 1001011 | 1010 |

${N}_{n}(L,1)$ | 80 | 75 | 30 |

binary | 1010000 | 1001011 | 11110 |

**Table 8.**$\left|\right\{{N}_{n}(s,a),s\in \mathcal{S},a\in A\left\}\right|,$$\left|\right\{{N}_{n}(L,a),L\in \mathcal{L},a\in A\left\}\right|$ and number of bits necessary for the transmission, for each case.

$\{{\mathit{N}}_{\mathit{n}}(\mathit{s},\mathit{a}),\mathit{s}\in \mathcal{S},\mathit{a}\in \mathit{A}\}$ | $\{{\mathit{N}}_{\mathit{n}}(\mathit{L},\mathit{a}),\mathit{L}\in \mathcal{L},\mathit{a}\in \mathit{A}\}$ | |
---|---|---|

Number of counts | 16 | 6 |

Number of bits | 147 | 69 |

**Table 9.**Results for the simulation study. From left to right. (1) Model description: label (situation), order, number of parts, and size of the alphabet. (2) Sample size $\left(n\right)$ of ${x}_{1}^{n}.$ In the next columns, we report mean values over 100 samples. On the left for the strategy by States, on the right, for the strategy by Parts. (3) Bits for the model: cost in bits for the codification of the model (columns 6 and 7). (4) Bits for the data: cost in bits for the codification of the data ${x}_{1}^{n}$ (columns 8 and 9). (5) Bits (total): total of bits (columns 10 and 11). (6) Compression ratio ${R}_{c}$ (columns 12 and 13). In bold type, the cases with the smallest number of bits and higher ${R}_{c}$.

Model Description | n | Bits for the Model | Bits for the Data ${\mathit{x}}_{1}^{\mathit{n}}$ | Bits (Total) | Compression Ratio (${\mathit{R}}_{\mathit{c}}$) | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

Label | $\mathbf{o}$ | $\left|\mathcal{L}\right|$ | $\left|\mathbf{A}\right|$ | States | Parts | States | Parts | States | Parts | States | Parts | |

1000 | 2326.2 | 366.6 | 831.9 | 1121.5 | 3158.1 | 1488.1 | 0.64 | 1.35 | ||||

1 | 3 | 64 | 4 | 5000 | 3143.5 | 829.2 | 4673.2 | 5131.6 | 7816.6 | 5960.8 | 1.28 | 1.68 |

10,000 | 3401.5 | 1299.9 | 9380.3 | 9864.9 | 12,781.8 | 11,164.8 | 1.57 | 1.80 | ||||

1000 | 3415.9 | 567.3 | 653.7 | 1558.2 | 4069.6 | 2125.5 | 0.49 | 0.94 | ||||

2 | 4 | 256 | 4 | 5000 | 11508.1 | 1069.4 | 4379.6 | 5842.1 | 15,887.7 | 6911.5 | 0.63 | 1.45 |

10,000 | 13,449.8 | 1391.1 | 9206.2 | 11,103.7 | 22,656.0 | 12,494.7 | 0.88 | 1.60 | ||||

1000 | 713.0 | 116.7 | 1125.6 | 1166.1 | 1838.6 | 1282.8 | 0.82 | 1.17 | ||||

3 | 3 | 2 | 3 | 5000 | 991.5 | 131.6 | 5777.3 | 5816.8 | 6768.8 | 5948.4 | 1.11 | 1.26 |

10,000 | 1074.7 | 138.7 | 11,572.6 | 11,620.8 | 12,647.3 | 11,759.5 | 1.19 | 1.28 | ||||

1000 | 739.9 | 125.0 | 1127.9 | 1188.1 | 1867.8 | 1313.1 | 0.80 | 1.14 | ||||

4 | 3 | 2 | 3 | 5000 | 989.3 | 131.5 | 5768.4 | 5807.9 | 6757.8 | 5939.4 | 1.11 | 1.26 |

10,000 | 1076.3 | 137.7 | 11,598.4 | 11,645.5 | 12,674.6 | 11,783.2 | 1.18 | 1.27 | ||||

1000 | 2460.1 | 358.4 | 899.4 | 1060.7 | 3359.5 | 1419.1 | 0.60 | 1.41 | ||||

5 | 3 | 4 | 4 | 5000 | 3144.4 | 395.6 | 5016.0 | 5168.5 | 8160.5 | 5564.1 | 1.23 | 1.80 |

10,000 | 3400.5 | 411.6 | 10,173.0 | 10,341.4 | 13,573.5 | 10,753.0 | 1.47 | 1.86 | ||||

1000 | 3341.8 | 613.2 | 634.1 | 1842.6 | 3975.9 | 2455.8 | 0.50 | 0.81 | ||||

6 | 4 | 4 | 4 | 5000 | 11,874.7 | 971.6 | 4531.5 | 5201.9 | 16,406.2 | 6173.5 | 0.61 | 1.62 |

10,000 | 13,484.8 | 987.6 | 9709.9 | 10,371.0 | 23,194.8 | 11,358.6 | 0.86 | 1.76 | ||||

1000 | 158.6 | 99.5 | 549.9 | 555.7 | 708.5 | 655.2 | 1.41 | 1.53 | ||||

7 | 3 | 6 | 2 | 5000 | 196.6 | 139.9 | 2737.0 | 2744.2 | 2933.6 | 2884.1 | 1.70 | 1.73 |

10,000 | 212.6 | 176.0 | 5465.6 | 5469.9 | 5678.2 | 5646.0 | 1.76 | 1.77 | ||||

1000 | 157.8 | 74.3 | 564.8 | 570.2 | 722.6 | 644.5 | 1.39 | 1.56 | ||||

8 | 3 | 3 | 2 | 5000 | 196.6 | 100.3 | 2804.1 | 2807.4 | 3000.7 | 2907.6 | 1.67 | 1.72 |

10,000 | 212.6 | 106.9 | 5618.7 | 5625.5 | 5831.3 | 5732.5 | 1.72 | 1.74 |

**Table 10.**From top to bottom: sequence (${x}_{1}^{n}$) name, sample size n of ${x}_{1}^{n},$ country of origin. On the left, for the strategy by States, on the right, for the strategy by Parts, (1) cost in bits for the codification of the model, (2) cost in bits for the codification of the data ${x}_{1}^{n},$ (3) bits (total)—in bold type, the cases with the smallest number of bits.

Sequence (${\mathit{x}}_{1}^{\mathit{n}}$) | MN908947.3 | MT126808.1 | MT350282.1 | |||
---|---|---|---|---|---|---|

n | 29,903 | 29,876 | 29,903 | |||

Origin | China | Brazil | Brazil | |||

Strategy | States | Parts | States | Parts | States | Parts |

Bits of Model | 9168 | 1811 | 9169 | 1711 | 9169 | 1799 |

Bits of ${x}_{1}^{n}$ | 58,116 | 58,401 | 58,063 | 58,501 | 58,123 | 58,489 |

Bits (total) | 67,284 | 60,214 | 67,232 | 60,212 | 67,292 | 60,288 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

García, J.E.; González-López, V.A.; Tasca, G.H.; Yaginuma, K.Y. An Efficient Coding Technique for Stochastic Processes. *Entropy* **2022**, *24*, 65.
https://doi.org/10.3390/e24010065

**AMA Style**

García JE, González-López VA, Tasca GH, Yaginuma KY. An Efficient Coding Technique for Stochastic Processes. *Entropy*. 2022; 24(1):65.
https://doi.org/10.3390/e24010065

**Chicago/Turabian Style**

García, Jesús E., Verónica A. González-López, Gustavo H. Tasca, and Karina Y. Yaginuma. 2022. "An Efficient Coding Technique for Stochastic Processes" *Entropy* 24, no. 1: 65.
https://doi.org/10.3390/e24010065