# Entropy Rate Estimation for English via a Large Cognitive Experiment Using Mechanical Turk

## Abstract

## 1. Introduction

## 2. Entropy Rate Estimation

#### 2.1. Entropy Rate and n-Gram Entropy

#### 2.2. Shannon’s Method

#### 2.3. Cover King’s Method

#### 2.4. Summary of the Scales Used in Previous Studies

## 3. Cognitive Experiment Using Mechanical Turk

#### 3.1. The Mechanical Turk Framework

#### 3.2. Experimental Design

- The number of characters still available for use.
- The preceding $n-1$ characters.
- The set of incorrect characters already used.

#### 3.3. Experimental Outcomes

#### 3.4. Human Prediction Accuracy with Respect to Context Length

#### 3.5. The Datapoints of the Bounds for n

## 4. Extrapolation of the Bounds with an Ansatz Function

#### 4.1. Ansatz Functions

#### 4.2. Comparison among Ansatz Functions Using All Estimates

## 5. Analysis via the Bootstrap Technique

#### 5.1. The Effect of the Sample Size

#### 5.2. The Effect of Variation on Subjects’ Estimation Performances

## 6. Discussion

#### 6.1. Computational versus Cognitive Methods

#### 6.2. Application to Other Languages and Words

#### 6.3. Nature of h Revealed by Cognitive Experimentation

## 7. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## Appendix A

**Figure 1.**Our user interface for our cognitive experiment on Amazon Mechanical Turk. It provides: (

**i**) the number of characters still available for use, (

**ii**) the preceding $n-1$ characters, and (

**iii**) the set of incorrect characters already used.

**Figure 2.**The number of observations collected for the predictions made for the n-th character. The vertical line indicates $n=70$, which provided the minimum direct estimate of ${h}_{expmin}=1.407$ in our experiment.

**Figure 3.**The probability that the subject needed only one guess to make the correct prediction of n-th character.

**Figure 4.**The plots of the upper bound (

**blue**) and the lower bound (

**red**) acquired from all observations and their extrapolations via ansatz functions of ${f}_{1}$ (dashed lines).

**Figure 5.**Histograms for the estimated values of the upper bound of the entropy rate h for different sample sizes. (

**a**) $S=100$; (

**b**) $S=500$; (

**c**) $S=1000$; (

**d**) $S=1500$.

**Figure 6.**The estimated upper bounds with ansatz function ${f}_{1}$ using: (1) 1000 experimental sessions with the best prediction performances (

**blue**), and (2) all experimental sessions (

**red**), with the values reported in Table 3. The blue and red points indicate the mean values for the $B=1000$ sets, and the shaded areas indicate the 5% percentile bounds.

Total Number | Number of | Number of | Max n | Number of | |
---|---|---|---|---|---|

of Samples | Subjects | Phrases | for a Session | Sample Per n | |

Shannon [1] | 1600 | 1 | 100 | 100 | 100 |

Jamison and Jamison [9] | 360 | 2 | 50 and 40 | 100 | 50 and 40 |

Cover and King [10] No.1 | 440 | 2 | 1 | 220 | 2 |

Cover and King [10] No.2 | 900 | 12 | 1 | 75 | 12 |

Moradi et al. [11] No.1 | 6400 | 1 | 100 | 64 | 100 |

Moradi et al. [11] No.2 | 3200 | 8 | 400 | 32 | 100 |

Our Experiment | 172,954 | 683 | 225 | 87.51 | 1954.86 |

**Table 2.**The top ten most frequently used words along with two subsequent words appearing in the phrases used in our experiment.

Rank | Word | Frequency | Two Subsequent Words | Frequency |
---|---|---|---|---|

1 | market | 15 | interest rates | 4 |

2 | company | 13 | future contracts | 3 |

3 | investment | 11 | program trading | 3 |

4 | price | 11 | stock market | 3 |

5 | people | 11 | money managers | 3 |

6 | companies | 10 | same time | 2 |

7 | stock | 9 | wide variety | 2 |

8 | buy | 9 | time around | 2 |

9 | officials | 7 | higher dividends | 2 |

10 | growth | 7 | some firms | 2 |

**Table 3.**The means and the 5% percentile-bound-intervals for the upper bound of h found by using the ansatz function ${f}_{1}$ for $S=100$, 500, 1000, and 1500. The number of sets is $B=1000$. The error is large for a small sample sizes, such as $S=100$, as the difference between the 5% percentile upper and lower bounds is larger than $0.3$ bpc. This difference decreases with increasing S and eventually becomes smaller than $\pm 0.1$ bpc for $S\ge 1000$.

Sample Size S | Mean | 5% Upper | 5% Lower |
---|---|---|---|

100 | 1.340 | 1.467 | 1.124 |

200 | 1.383 | 1.468 | 1.263 |

300 | 1.391 | 1.459 | 1.302 |

400 | 1.398 | 1.456 | 1.327 |

500 | 1.405 | 1.455 | 1.349 |

1000 | 1.412 | 1.438 | 1.383 |

1500 | 1.411 | 1.444 | 1.374 |

