# Stylometry and Numerals Usage: Benford’s Law and Beyond

## Abstract

## 1. Introduction

## 2. Benford’s Law and Texts

- Arabic numbers (not spelled out) of consecutive front-page news items of a newspaper. “Dates were barred as not being variable, and the omission of spelled-out numbers restricted the counted digits to numbers 10 and over”;
- The first 342 street addresses given in an American Men of Science edition;
- Numeral usage (except for dates and page numbers) of an issue of the Readers’ Digest.

- Possible rounding of numerals starting with digits 8 and 9;
- In German, the indefinite article ein coincides with the numeral ein.

- we take into account not only numerals expressed in digits but also those spelled (expressed verbally), both cardinal and ordinal ones—technically, a much more difficult task, especially for texts in languages in which the numerals are declined: Russian, Czech, Lithuanian, etc.;
- the object of our study is coherent literary texts (as well as compilations of such texts), not a random set of texts.

- There are differences between the distributions (especially between the Gospels from Matthew, Mark, Luke on the one hand and that from John, on the other hand)—not very large, but statistically significant, given the amount of analyzed data.
- In general, the distribution of the first significant digit of numerals here also resembles Benford’s one, but the first significant digit 1 is noticeably predominant.
- It turned out that the share of the digit 1 is so much higher than prescribed by Benford’s Law (it varies from 38 to 45 percent instead of Benford’s 30 percent) that the distribution could be called ultra-Benfordian. As we later found out, this is typical for the coherent literary texts of most authors.

- Benford’s Law approximately holds for coherent texts.
- Deviations from Benford’s Law are statistically significant author’s features that allow, under certain conditions, to distinguish between parts of the text with different authorship. The obvious requirements are the sufficient length of the text and the sufficient use of numerals in it, which, for example, is usually satisfied in historical literature.
- The distribution of the first significant digits at the end of the {1, 2,..., 7, 8, 9} row is subject to strong fluctuations (even in texts by the same author) and is not indicative.

- The frequencies of the first significant digits are stabilized for texts larger than 200 KB (the size of the txt file in UTF-8 encoding).
- We confirm the visual similarity/differences in the frequency distributions of the first significant digits by the Pearson chi-squared test; to apply it, we had to develop a special technique (for details, see [21]). Unfortunately, the standard procedure offered by statistical packages is not suitable here.

**Benford’s Law and Texts: Overview of Results**

#### 2.1. Distribution of the First Significant Digits of the Numerals in Compiled Texts

#### 2.2. Coherent Literary Texts: The Author’s Peculiarities

#### 2.3. First Significant Digits and Texts Authorship Attribution

**The problem of “And Quiet Flows the Don”**

#### 2.4. Statistical Characteristics of Translated Texts

## 3. Beyond the Benford’s Law

- The analysis of the statistics of the first significant digits is only applicable to the significant digit 1 and (sometimes) 2 and 3 since the occurrence of subsequent digits is subject to strong fluctuations even in the texts of the same author. Thus, only a small part of the statistical information on the numerals contained in the text is available for analysis.
- On the other hand, using the first significant digits is advantageous since the information here is presented in a generalized form: it can minimize the influence of numbers closely connected to the topic of the text (e.g., the year 1812 in L. Tolstoy’s War and Peace).
- Analysis of the use of the numerals themselves (and not the first significant digits) gives richer information about an author’s peculiarities of the text and, to a large extent, is not blocked by indistinguishability of the numeral one and the indefinite article.
- However, the analysis of numerals statistics is more difficult.

#### 3.1. The Extension of the Numerals Analysis. Dobychin vs. Platonov

- Platonov, in his literary texts, more likely uses numerals than Dobychin.
- Platonov less often resorts to rounding of numerals (10, 20, 30...), which, in conjunction with item 1, can indirectly indicate a greater tendency to detail.
- The numeral one (in different word forms) is the undisputed leader among the numerals found in Platonov’s texts. In the texts by Dobychin, the numeral one is inferior in frequency to the numeral two!
- Note the psychologically understandable rarefaction of the series of numerals and a decrease in their occurrence as they increase, as well as a noticeable local maximum at the numeral 100, which, of course, plays the role of an indefinitely large number.

#### 3.2. Who Wrote “The Twelve Chairs”?

**y**are n-dimensional vectors whose components are the absolute frequency of occurrence of the first n natural numbers found in both analyzed texts.

- For all the analyzed texts, there are peaks in the occurrence of round numbers 10, 20,..., 100, 200,…
- In the texts by Ilf and Petrov, as well as in Bulgakov’s The Master and Margarita, the numeral 1 has the highest frequency (which is consistent with Benford’s Law), but in Kataev’s texts, the number 2 leads.
- These two texts are characterized by the greatest variety of numerals.
- On the contrary, Kataev’s texts are distinguished by the least variety of numerals.
- In terms of the variety of numerals, The Master and Margarita occupy an average position, but the frequencies of the numerals (after the initial frequent ones and twos) are usually lower than in other texts analyzed. In fact, many numbers occur once.

- The Twelve Chairs; a joint work by Ilf and Petrov, 1927–1928; vol. 1 [44];
- Joint works 1932–1937 (stories, feuilletons, articles, speeches, vaudevilles, screenplays) by Ilf and Petrov, included in vol. 3;
- The Little Golden Calf; a joint work by Ilf and Petrov, 1929–1930; vol. 2;
- Works (stories, essays, feuilletons) written individually by Petrov in 1924–1932 and included in vol. 5;
- Works (essays, articles, memoirs) written individually by Petrov in 1937–1942 and included in vol. 5;
- One-storied America (travel essays; sometimes translated as Little Golden America), 1936, vol. 4;
- Works (stories, essays, feuilletons) written solely by Ilf in 1923–1929, as well as his notebooks from 1925-37, included in vol. 5.

## 4. Discussion

**Figure 1.**The frequency distribution of the first significant digits of numerals in three collections of Russian-language literary texts. Results are compared with those prescribed by Benford’s Law.

**Figure 2.**The distribution of the first significant digits of numerals in eight collections of English-language literary texts.

**Figure 4.**The distribution of the first significant digits of numerals in Dostoevsky’s texts. In addition to voluminous works (Nos. 1–9), a shorter one (No. 10) was analyzed for comparison.

**Figure 9.**Distribution of the first significant digits of numerals in the novels And Quiet Flows the Don, Virgin Soil Upturned, They Fought for Their Country.

**Figure 10.**Distribution of the first significant digits of numerals in H. G. Wells’ novels and in their Russian translations.

**Figure 11.**Distribution of relative occurrence frequencies of the first significant digits of numerals in the texts by L. Dobychin and A. Platonov.

**Figure 14.**Results of hierarchical cluster analysis based on the occurrence of numerals in the texts by Ilf and Petrov. The horizontal scale indicates the “distance” between clusters in conventional units. Texts Nos. 1–7, combined into clusters, are indicated in the text of the paper.

**Figure 15.**The results of hierarchical cluster analysis based on the occurrence of numerals in texts by Ilf and Petrov (Nos. 1, 2), Kataev (Nos. 3, 4), and Bulgakov (No. 5). The horizontal scale indicates the “distance” between clusters in conventional units. Texts Nos. 1–5, combined into clusters, are indicated in the article.

