# Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora

## Abstract

## 1. Introduction

## 2. Entropy Rate

## 3. Direct Estimation Methods

- The first approach is to compress the text using a data compression algorithm. Let $R({X}_{1}^{n})$ denote the size in bits of text ${X}_{1}^{n}$ after the compression. Then the code length per unit, $r(n)=R({X}_{1}^{n})/n$, is always larger than the entropy rate [13],$$r(n)\ge h.$$We call $r(n)$ the encoding rate. In our application, we are interested in universal compression methods. A universal text compressor guarantees that the encoding rate converges to the entropy rate, provided that the stochastic process ${X}_{1}^{\infty}$ is stationary and ergodic, i.e., equality$$\underset{n\to \infty}{lim}r(n)=h$$
- The second approach is to estimate the probabilistic language models underlying formula (2). A representative classic work is [6], who reported $h\approx 1.75$ bpc, by estimating the probability of trigrams in the Brown National Corpus.
- Besides that, a bunch of different entropy estimation methods has been proposed in information theory. There are lower bounds of entropy such as the plug-in estimator [15], there are estimators which work under assumption that the process is Markovian [16,17,18], and there are a few other methods such as Context Tree Weighting [15,19].

## 4. Extrapolation Functions

## 5. Experimental Procedure

#### 5.1. Data Preparation

**English**English;**Chinese**Chinese; and**Others**French, Russian, Japanese, Korean and Romanized Chinese and Japanese.

#### 5.2. Detailed Procedure

## 6. Experimental Results

#### 6.1. Effects of Randomization by Documents

#### 6.2. Comparison of the Error of Fit

#### 6.3. Universality of the Estimates of Exponent β

#### 6.4. A Linear Perspective onto the Decay of the Encoding Rate

#### 6.5. Discriminative Power the Decay of the Encoding Rate

#### 6.6. Stability of the Entropy Rate Estimates

## 7. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

**Figure 1.**Compression results for (

**a**) a Bernoulli process ($p=0.5$) and (

**b**) the Wall Street Journal for Lempel-Ziv (LZ), PPM (Prediction by Partial Match), and Sequitur.

**Figure 2.**Encoding rates for the Wall Street Journal corpus (in English). Panel (

**a**) is for the original data, whereas (

**b**) is the average of the data 10-fold shuffled by documents. To these results we fit functions ${f}_{1}(n)$ and ${f}_{3}(n)$.

**Figure 3.**The values of error and h for all natural language data sets in Table 1 and the three ansatz functions ${f}_{1}(n)$, ${f}_{2}(n)$, and ${f}_{3}(n)$. Each data point corresponds to a distinct corpus or a distinct text, where black is English, red is Chinese, and blue for other languages. The squares are the fitting results for ${f}_{1}(n)$, triangles—for ${f}_{2}(n)$, and circles—for ${f}_{3}(n)$. The means and the standard deviations of h (left) and error (right) are indicated in the figure next to the ovals, which show the range of standard deviation—dotted for ${f}_{1}(n)$, dashed for ${f}_{2}(n)$, and solid for ${f}_{3}(n)$.

**Figure 4.**The values of β and h for all natural language data sets in Table 1 and the ansatz functions ${f}_{1}(n)$, ${f}_{2}(n)$, and ${f}_{3}(n)$. Each data point corresponds to a distinct corpus or a distinct text, where black is English, red is Chinese, and blue for other languages. The squares are the fitting results for ${f}_{1}(n)$, triangles—for ${f}_{2}(n)$, and circles—for ${f}_{3}(n)$. The means and the standard deviations of h (left) and β (right) are indicated in the figure next to the ovals, which show the range of standard deviation—dotted for ${f}_{1}(n)$, dashed for ${f}_{2}(n)$, and solid for ${f}_{3}(n)$.

**Figure 5.**All large scale natural language data (first block of Table 1) from a linear perspective for function ${f}_{3}(n)$. The axes are $Y=lnr(n)$ and $X={n}^{\beta -1}$, where $\beta =0.884$. The black points are English, the red ones are Chinese, and the blue ones are other languages. The two linear fit lines are for English (lower) and Chinese (upper).

**Figure 6.**Data from the third block of Table 1 from a linear perspective for function ${f}_{3}(n)$. The axes are $X={n}^{\beta -1}$ and $Y=lnr(n)$, where $\beta =0.884$ as in Figure 5. The black points are the English text, the magenta ones are its randomized versions, whereas the blue ones are Bernoulli and Zipf processes.

Text | Encoding | f_{1}(n) | f_{3}(n) | ||||
---|---|---|---|---|---|---|---|

Language | Size (chars) | Rate (bit) | h (bit) | Error × 10^{−2} | h (bit) | Error × 10^{−2} | |

Large Scale Random Document Data | |||||||

Agence France-Presse | English | 4096003895 | 1.402 | 1.249 | 1.078 | 1.033 | 0.757 |

Associated Press Worldstream | English | 6524279444 | 1.439 | 1.311 | 1.485 | 1.128 | 1.070 |

Los Angeles Times/Washington Post | English | 1545238421 | 1.572 | 1.481 | 1.108 | 1.301 | 0.622 |

New York Times | English | 7827873832 | 1.599 | 1.500 | 0.961 | 1.342 | 0.616 |

Washington Post/Bloomberg | English | 97411747 | 1.535 | 1.389 | 1.429 | 1.121 | 0.991 |

Xinhua News Agency | English | 1929885224 | 1.317 | 1.158 | 0.906 | 0.919 | 0.619 |

Wall Street Journal | English | 112868008 | 1.456 | 1.320 | 1.301 | 1.061 | 0.812 |

Central News Agency of Taiwan | Chinese | 678182152 | 5.053 | 4.459 | 1.055 | 3.833 | 0.888 |

Xinhua News Agency of Beijing | Chinese | 383836212 | 4.725 | 3.810 | 0.751 | 2.924 | 0.545 |

People’s Daily (1991–95) | Chinese | 101507796 | 4.927 | 3.805 | 0.413 | 2.722 | 0.188 |

Mainichi | Japanese | 847606070 | 3.947 | 3.339 | 0.571 | 2.634 | 0.451 |

Le Monde | French | 727348826 | 1.489 | 1.323 | 1.103 | 1.075 | 0.711 |

KAIST Raw Corpus | Korean | 130873485 | 3.670 | 3.661 | 0.827 | 3.327 | 1.158 |

Mainichi (Romanized) | Japanese | 1916108161 | 1.766 | 1.620 | 2.372 | 1.476 | 2.067 |

People’s Daily (pinyin) | Chinese | 247551301 | 1.850 | 1.857 | 1.651 | 1.667 | 1.136 |

Small Scale Data | |||||||

Ulysses | English | 1510885 | 2.271 | 2.155 | 0.811 | 1.947 | 1.104 |

(by James Joyce) | |||||||

À la recherche du temps perdu | French | 7255271 | 1.660 | 1.414 | 0.770 | 1.078 | 0.506 |

(by Marcel Proust) | |||||||

The Brothers Karamazov | Russian | 1824096 | 2.223 | 1.983 | 0.566 | 1.598 | 0.839 |

(by Fyodor Dostoyevskiy) | |||||||

Daibosatsu toge | Japanese | 4548008 | 4.296 | 3.503 | 1.006 | 2.630 | 0.875 |

(by Nakazato Kaizan) | |||||||

Dang Kou Zhi | Chinese | 665591 | 6.739 | 4.479 | 1.344 | 2.988 | 1.335 |

(by by Wan-Chun Yu) | |||||||

Other Data | |||||||

Bernoulli (0.5) | Stochastic | 8000000000 | 1.019 | 1.016 | 0.391 | 1.012 | 0.721 |

Zipf’s law Random Character | English | 63683795 | 4.406 | 4.417 | 0.286 | 4.402 | 0.258 |

WSJ (Original) | English | 112868008 | 1.456 | 1.305 | 1.156 | 1.041 | 0.833 |

WSJ (Random Characters) | English | 112868008 | 4.697 | 4.706 | 0.131 | 4.699 | 0.146 |

WSJ (Random Word) | English | 112868008 | 2.028 | 1.796 | 0.663 | 1.554 | 0.956 |

WSJ (Random Sentence) | English | 112868008 | 1.461 | 1.026 | 0.500 | 0.562 | 0.532 |

