Next Article in Journal
Thermoelectric Enhancement of Series-Connected Cross-Conjugated Molecular Junctions
Previous Article in Journal
Multiscale Permutation Time Irreversibility Analysis of MEG in Patients with Schizophrenia
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Amount of Data Required to Recognize a Writer’s Style Is Consistent Across Different Languages of the World

by
Boris Ryabko
1,2,*,†,
Nadezhda Savina
2,†,
Yeshewas Getachew Lulu
2,† and
Yunfei Han
2,†
1
Federal Research Center for Information and Computational Technologies, 6300090 Novosibirsk, Russia
2
Department of Information Technologies, Novosibirsk State University, 6300090 Novosibirsk, Russia
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Entropy 2025, 27(10), 1039; https://doi.org/10.3390/e27101039
Submission received: 25 August 2025 / Revised: 25 September 2025 / Accepted: 1 October 2025 / Published: 4 October 2025
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

In this paper, we apply an information-theoretic method proposed by Ryabko and Savina (therefore called the RS-method), based on the use of data compression, to recognize the individual author’s style of a writer across four languages from different language groups and families. In this paper, the presented method was used to study fiction texts in Russian (East Slavic group of languages of the Indo-European language family), Amharic (South Ethiosemitic group of the Semitic language family), Chinese (Sinitic group of the Sino-Tibetan language family) and English (West Germanic language group of the Indo-European language family). It was found that the amount of data necessary for recognizing an author’s style is almost the same for all four languages, i.e., the amount of data is invariant across different language groups. The results obtained are of interest to computer science, literary studies, linguistics and, in particular, computational linguistics.

1. Introduction

The Concept of the Individual Author’s Style of a Writer

An author’s style is a unique set of features characteristic of a particular writer’s work, which makes their novels recognizable and different from those of other writers [1,2,3]. An author’s style is formed during the creative process and reflects the individuality and worldview of the author [3,4,5,6]. In fiction, the styles of different writers are extremely diverse. For example, one can recall the businesslike and laconic style of Ernest Hemingway, the fussy James Joyce, the sardonically abrupt Kurt Vonnegut [2] or the heavy and cumbersome style of Tolstoy. An author’s style is formed gradually over the course of their life and reflects the evolution of the author [4].
In our work, we used the classification of sources of text variability proposed by the leading mathematician A.N. Kolmogorov, who is also known for his results in the field of information theory [7,8]. He identified the following sources of text variability: content, form and unconscious individual author’s style. Many researchers argue that individual author’s style is a reflection of the writer’s personality [1,2,3,4]. The elements of author’s style are well known. These include vocabulary (use of certain words and expressions), syntax (features of sentence construction), tropes and figures of speech (metaphors, epithets, comparisons, etc.) and composition (arrangement of parts of the work), as well as the general tone and mood of the work [3].
Studying the elements of author’s style requires multi-tasking and is a rather difficult problem. But recognizing the author’s style of a writer without pursuing analysis of style elements is a completely different task. The information-theoretic method proposed by Ryabko & Savina [9,10,11], which we call the RS-method, helps to solve this problem reliably, i.e., with the help of the apparatus of mathematical statistics. It is based on the use of so-called archivers or data compression methods, which, in turn, can be attributed to information theory. The fact is that modern archivers are aimed at finding a variety of patterns in compressed texts, including through using methods such as describing the text with the shortest formal grammar, building dictionaries of minimal volume describing the text and other methods related to artificial intelligence.
An important application of data compression for classification was proposed by P. Vitani and developed by him and his co-authors in several papers (see [12,13] and the references therein). They used the length of a compressed message as an estimate of its Kolmogorov complexity and, based on this, proposed the so-called normalized compression distance between two different texts. This approach made it possible to classify different human languages, animal species (based on their genomes), computer and biological viruses and some other objects. The main difference between the application of normalized compression distance and our approach is the integration of the latter with methods from mathematical statistics, which makes it possible to apply the developed apparatuses of this science, including hypothesis testing and numerical measures such as Cramer’s coefficient.
In this study, we applied this method of recognizing author’s style to four languages: English, Russian, Amharic and Chinese. This choice was due to the fact that these languages belong to different language groups and families. An unexpected result was obtained: author’s style was reliably determined in texts across such different languages using the same amount of data, measured in kilobytes and not in the number of letters, symbols or similar units. Thus, we can assert that the amount of data necessary to determine the author’s style of a writer is, in a sense, invariant for all the languages we have considered.

2. RS-Method for Recognizing the Author’s Style of a Writer

2.1. The Idea of the Method

The method of recognizing an author’s style is based on the use of algorithms for lossless compression, implemented in the form of so-called archivers. Their purpose is to encode texts in such a way that the length of the encoded message is shorter than the original (the text is compressed) and, if necessary, the encoded text can be decoded into the original. Text data was fed as the input of the archiver, which encoded the text data into files of shorter length, i.e., compression. Compression occurs because archivers find unevenness in the frequencies of occurrence of letters and words and use hidden patterns based on the theory of formal grammars and the laws of information transmission. Let us briefly describe the scheme of application of the developed method. Let us define three texts, T1, T2, T3, and it is known that T1 and T2 were generated by different sources of information, I1 and I2, and T3 was generated by either I1 or I2 (for example, T1 is a text in English, T2 is in German, and T3 is in English or German). Let d be some archiver, and, if it is applied to some file X, then the length of the compressed file is denoted by d(X). First, the texts are combined into the pairs T1T3 and T2T3, and both pairs are compressed. Then, we separately compress files T1 and T2, after which we calculate the differences in the lengths of the compressed files: d(T1T3) − d(T1) and similarly d(T2T3) − d(T2). If d(T1T3) − d (T1) is less than d(T2T3) − d(T2), then we conclude that the text T3 was generated by the information source I1. If d(T1T3) − d(T1) > d(T2T3) − d(T2), then T3 was generated by the information source I2. This conclusion is due to the fact that the archiver, when compressing later texts, i.e., T3, uses the statistical features it found when compressing earlier texts, namely T1 or T2. Therefore, the text T3 is compressed more effectively after text with the same source of information was compressed before it. The following simple example explains the essence of this method: Let T1 be a text in English, T2 in German, and (unknown) T3 also in English. Then d(T1T3) − d(T1) will be less than d(T2T3) − d(T2) because, in the first case, T3 in T1T3 was compressed after the archiver had been “tuned” to “its” statistics (for example, in the case of texts in English and German, the method works flawlessly with text lengths of several hundred letters for T1, T2 and T3).
This idea was proposed by Tehan [14,15] and was further developed by Ryabko and Savina (RS-method) [9,10,11]. In particular, in [9], this idea was applied to construct a statistical method for classifying texts, allowing one to determine the reliability of the obtained conclusions using mathematical statistics methods. The described scheme was also successfully applied by the authors of this paper to solve problems of text attribution in works [10], where it was experimentally shown that the individual style of an author can be determined quite accurately based on 4 KB of their text (approximately two pages of text in Russian or English). Based on this fact, we will apply the same scheme to solve the problem of recognizing the author’s style of writers of different language groups.

2.2. Description of the RS-Method for Recognizing the Author’s Style of a Writer

In order to make the description more understandable, we will illustrate it with an example of constructing a method for determining the author’s style of English-language writers. Let N writers and their works T1, T2, …, TN be given.
Each text Ti is represented as two samples, called training (Xi, i = 1, …, N) and experimental, which, in turn, consist of M parts (slices), which we will denote by Yij, i = 1, …, N; j = 1, …, M.
For the experimental work, we compiled a sample of texts from Beresford, Jerome, Defoe and Locke, N = 4, M = 16. From the works of these authors, we made 4 training samples X1, …, X4, each 64 KB in size. Then we made test samples—16 files Y1j, j = 1, …, 16, each 4 KB in size, from the works of Beresford, Y2j, j = 1, …, 16, from the works of Defoe, and …, Y16j, j = 1, …, 16, from the works of Jerome and Locke. Then the file Y1,1 was successively “compressed” with the training samples of the sample X1, …,X4 and it was determined which of them was “better” compressed (i.e., d(X1 Y1,1) − d(X1), …, d(X4 Y1,1) − d(X4) were calculated and i was found, for which d(Xi Y1,1) − d(Xi) is minimal). All Yij, i = 1,…, 4; j = 1, …, 16, were processed similarly.
Table 1 presents the obtained data for the LZMA archiver, with a training sample (Xi) of 64 kB and a slice (Yij) of 4 kB.
Let us explain the meaning of these numbers: 16 in the upper left corner means that out of 16 files Y1j, j = 1, …, 16, all were compressed better with X1 (in other words, all 16 slices from Defoe’s works were compressed better with the training set of his works. The obtained result shows that D. Defoe’s author’s style is uniquely recognized by a 4 KB slice with a training set of 64 KB). The numbers from the first line mean that out of 16 files Y2j, j = 1, …, 16, 14 slices were compressed better with X2 (i.e., 14 slices from Beresford’s works were compressed better with his training set; however, 1 slice was more similar to Jerome’s works and 1 slice was similar to Locke’s works; here, the recognition of the writer’s style is 14 out of 16).
We will call the entire process of transition from the source texts T1, T2, …, TN to the table (of size N × N) the construction of a contingency table, and we will denote the contingency table itself as W (T1, T2, …, TN) or W (depending on the context) and represent this table as follows:
        t1,1 t1,2 … t1,N
W(T1, T2, …, TN) = t2,1 t2,2 … t2,N
…………………
        tN,1 tN,2 … tN,N
In addition, for each W table, we calculated the value of Cramer’s coefficient V [16]); here it should be noted that V is used to assess the relationship, or interdependence, and it takes values from zero to one, and a higher value indicates a greater dependence or interrelationship.
We will explain its meaning in more detail together with the contingency table W. As we saw in the example, the numbers in the cells of the contingency table indicate the number of slices whose authorship was attributed to a specific writer. If the method works “correctly”, i.e., it correctly determines the author’s style by the slices, then the values in the table will be concentrated mainly on the main diagonal. Otherwise, when the slices do not reveal the author’s style of the writer, the values in the table will be evenly distributed among different cells related to different writers.
This effect can be quantified using the Cramer V coefficient [14], which is calculated as follows: first, calculate P = i = 1 N j = 1 N t i j , p i j = t i j P , p i . = j = 1 N p i j , p . j = i = 1 N p i j , and then calculate the following: x 2 = i = 1 N i = 1 N ( ( t i j N p i . p . j ) 2 / ( N p i . p . j ) ) and Cramer’s coefficient V = x 2 / ( P ( N 1 ) ) .
For Table 1, Cramer’s coefficient V = 0.9.
Note that the Cramer coefficient V = 1 if all nondiagonal elements are equal to 0, and V is equal to 0 if all ti,j are equal.
Now let us pay attention to the choice of archiver. There are quite a lot of them at present. For this purpose, we examined the BZIP2, DEFLATE and LZMA archivers on the same sample. It turned out that the LZMA archiver has the highest Cramer coefficient; henceforth, we used this archiver. In our experiments, compression was performed using the 7-Zip archiver; the reference implementation of LZMA was developed in [17]. (We will not describe this in detail, since similar calculations were performed in [11], see 2.3. “Selection of method parameters”.)

3. Recognizing the Author’s Style of Writers in Different Language Groups

For our study, we selected 4 languages from different language groups belonging to different language families:
English (West Germanic language group of the Indo-European language family);
Amharic (Southern Ethiosemitic group of the Semitic language family);
Russian (East Slavic group of the Indo-European language family);
Chinese (Sinitic group of the Sino-Tibetan language family).
We note that we had already worked with texts in Russian and English in previous studies on determining the quality of translations and attribution of literary texts [10,11]. Therefore, we started with English. For this study, we selected the texts of the following works in English (see Table 2).
From each literary work, text pieces of 64 KB were taken for the training sample and text pieces of 64 KB for the test sample. Each sample was divided into 16 fragments (4 KB slices). Each fragment was added to the training sample in turn; the number of recognized fragments was recorded in the table. The results are presented in Table 3. The writers are presented by numbers.
The table shows that only two writers, Humphry Ward and Schreiner Olive, each had one slice attributed to the style of another writer. In George Eliot’s texts, two fragments out of 16 were attributed to Kipling. All other writers had their author styles recognized absolutely correctly: 16 slices out of 16. And the Cramer coefficient is close to 1 and equals V = 0.992.
To study the author style of Russian-speaking writers, 16 literary works by Russian writers of the late 19th–early 20th centuries were selected (see Table 4).
The preprocessing work with the Russian texts was exactly the same as with the English novels: a 64 KB sample, divided into 16 fragments of 4 slices. These 16 fragments of 4 KB were added to the training sample one by one for compression. The number of recognized slices was recorded in a table. The results are in Table 5.
The table shows a well-built diagonal consisting of recognized fragments of the author’s styles. However, Valery Bryusov’s texts were recognized in 10 fragments out of 16. This phenomenon has its own explanation. Bryusov is an outstanding Russian poet and the founder of Russian symbolism. His historical novel The Altar of Victory was the first prose work of the outstanding poet. The novel was dedicated to the Roman Empire during the era of its collapse. Apparently, his author’s style had not yet been formed; it contained many imitations and quotations. Bryusov used citations from 34 ancient poetic sources of various lyrical genres, and also accompanied the novel with notes occupying more than 100 of the 400 pages of text.
At the next stage of our research, we turned to the Chinese language. Chinese is a unique language with a rich history. It has features that are not found in other languages. As Chinese language experts note the Chinese language consists of many idioms [18,19]. An idiom is a stable figure of speech used as a single whole, forming a phraseological fusion [19]. An idiom can consist of 1–4 hieroglyphs. Each of the hieroglyphs carries its own semantic load, forming one image figure of speech [18]. An idiom is one indivisible lexical unit. Literary texts contain a large number of idioms. Idioms are written in hieroglyphs. Chinese writing is a logographic writing system in which symbols (logogram-hieroglyphs) [18,19] represent whole words or morphemes, but not individual sounds and letters [19]. Unlike phonetic writing, each hieroglyph is assigned not only a phoneme, but also a meaning, so the number of signs in Chinese writing is very large [20]. For our study, we selected literary works written in the official language, Putonghua (Mandarin) (see Table 6).
The texts were processed using the method already tested in English and Russian. We prepared a training sample of 64 KB and a test experimental sample of 64 KB. Then, for compression, 4 KB text fragments were added to the training sample, of which we selected 16. The results are presented in Table 7.
As can be seen from the table, the results are very similar to the results of the analyses of texts in Russian and English. The perfectly constructed diagonal shows the recognition of the author’s style of all writers.
The next language chosen for our study was Amharic. Amharic (አማርኛ) is the language of the Amhara people; it belongs to the Semitic family of languages [21]. For many years, Amharic was the official language of Ethiopia; now it has the status of being the working language of the government. About 25 million people in Ethiopia speak Amharic. The language is also widespread among some of the peoples of neighboring states: in Eritrea, Somalia, and Sudan [21]. It should be noted that more than 3 million emigrants speak Amharic outside of Ethiopia in the USA, Canada, Sweden and Israel. Amharic is used in business communications, in government agencies and in education. Newspapers, magazines and books are published in it. The list of literary works selected for the study is presented in Table 8.
Preliminary work with Amharic texts was the same as with other languages presented in the study. Two samples of 64 KB texts were formed: a training sample and a test experimental sample. Both samples were divided into 16 fragments of 4 KB. Then, 4 KB slices from the test sample were added one-by-one to the training sample for compression. After text compression, the results were entered into a table. The results are presented in Table 9.
The results of the study show that the RS-method of recognizing author’s style also works in the Amharic language. The Amharic language is a unique language that has a number of specific features. For example, the Amharic alphabet consists of 28 consonants and 7 vowels, but the writing system has special signs and combinations that bring the number of sounds to 200 [21]. The Amharic alphabet, also known as the Ethiopian script, is a syllabic script in which each sign represents a combination of a consonant and a vowel [22]. Despite the uniqueness and complexity of the Amharic language, Cramer’s coefficient is almost the same as that of the other languages we have considered.

4. Conclusions

The conducted study on the corpora of texts in different languages from four language groups showed that it is quite possible to determine author’s style using the RS-method. The main finding of the study (which was not known before) is the discovery of a new scientific fact: the same amount of data is required to recognize the author’s style of a writer in different languages that are culturally, historically and grammatically distant from each other. A completely natural question arises about the stability of these conclusions given different volumes of the training sample and sizes of the “slice”. It is natural to assume that with an increase in each of these parameters, and with their joint increase, the Cramer coefficient should increase. We deliberately conducted experiments on different sample sizes, similar to the one described above, and the results confirmed this assumption (see Table 10).
As can be seen from the table, the degree of change in the values of the Cramer coefficient remains approximately the same for all the languages considered, which confirms the conclusion that the amount of data required to recognize the author’s style in different languages from different language groups is almost the same or invariant.
Let us now discuss the possible applications of the developed method. Some of them are practical, while others are theoretical and even philosophical in nature.
Among the practical tasks, we will mention the detection of plagiarism and the determination of authorship. Among the theoretical tasks are issues related to artificial intelligence systems being capable of maintaining dialogue with people and/or creating texts on specific topics. An interesting question is whether different artificial intelligence systems have their own authorial style. And if so, is it possible to build an artificial intelligence system without an authorial style (or with a hidden authorial style)? Another related question is whether there is a certain level of complexity needed for a system to be capable of maintaining a dialogue with a human being, above which the system must have its own unique style. Perhaps the approach proposed here could become a tool for investigating such problems.

Author Contributions

Conceptualization, B.R. and N.S.; methodology, B.R.; software, Y.G.L.; validation, B.R. and Y.G.L.; formal analysis, N.S.; investigation, B.R.; resources, Y.G.L. and Y.H.; data curation, B.R.; writing—original draft preparation, Y.G.L. and Y.H.; writing—review and editing, N.S.; visualization, N.S.; supervision, B.R.; project administration, B.R.; funding acquisition, B.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

This work was supported by the State Assignment of Ministry of Science and Higher Education of Russian Federation for Federal Research Center for Information and Computational Technologies.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Holmes, D.I. The Analysis of Literary Style—A Review. J. R. Stat. Soc. Ser. A (Gen.) 1985, 148, 328–341. [Google Scholar] [CrossRef]
  2. Ray, B. Style: An Introduction to History, Theory, Research, and Pedagogy; University Press of Colorado: Fort Collins, CO, USA; WAC Clearinghouse: Fort Collins, CO, USA, 2015; ISBN 978-1-60235-614-6. [Google Scholar]
  3. Aquilina, M. The Event of Style in Literature; Palgrave Macmillan: London, UK, 2014. [Google Scholar]
  4. Can, F.; Patton, J.M. Change of writing style with time. Comput. Humanit. 2004, 38, 61–82. [Google Scholar] [CrossRef]
  5. Zheng, R.; Li, J.; Chen, H.; Huang, Z. A framework for authorship identification of online messages: Writing-style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 2006, 57, 378–393. [Google Scholar] [CrossRef]
  6. Nguyen, T.; Dinh, D. An Empirical Investigation of Authorial Writing Styles Based on a Vietnamese Corpus. Open J. Mod. Linguist. 2021, 11, 967–982. [Google Scholar] [CrossRef]
  7. Kolmogorov, A.N. Three approaches to quantitative definition of information. Probl. Inf. Transm. 1965, 1, 3–11. [Google Scholar] [CrossRef]
  8. Ryabko, B.; Astola, J.; Malyutov, M. Compression-Based Methods of Statistical Analysis and Prediction of Time Series; Springer: New York, NY, USA, 2016; pp. 122–130. [Google Scholar]
  9. Ryabko, B. Using data-compressors for statistical analysis of problems on homogeneity testing and classification. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017. [Google Scholar]
  10. Ryabko, B.; Savina, N. Using Data Compression to Build a Method for Statistically Verified Attribution of Literary Texts. Entropy 2021, 23, 1302. [Google Scholar] [CrossRef] [PubMed]
  11. Ryabko, B.; Savina, N. Information-Theoretical Method for Assessing the Quality of Translations. Entropy 2022, 24, 1739. [Google Scholar] [CrossRef] [PubMed]
  12. Cilibrasi, R.; Vitanyi, P. Clustering by compression. IEEE Trans. Inf. Theory 2005, 51, 1523–1545. [Google Scholar] [CrossRef]
  13. Cilibrasi, R.; Vitanyi, P.; De Wolf, R. Algorithmic clustering of music based on string compression. Comput. Music 2004, 28, 49–67. [Google Scholar] [CrossRef]
  14. Teahan, W.J.; Harper, D.J. Using compression—based language models for text categorization. In Language Modeling for Information Retrieval; The Springer International Series on Information Retrieval; Springer: Dordrecht, The Netherlands, 2003; Volume 13, pp. 83–88. [Google Scholar]
  15. Teahan, W.J.; Wen, Y.Y.; McNabb, R.; Witten, I.H. Using compression models to segment Chinese text. Comput. Linguist. 2000, 26, 375–393. [Google Scholar] [CrossRef]
  16. Kendall, M.; Stjuart, A. Inference and relationship. In The Advanced Theory of Statistics; Hafner Publisher: London, UK, 1961; Volume 2. [Google Scholar]
  17. Pavlov, I. 7-Zip Compression Utility. 1999. Available online: https://www.7-zip.org/ (accessed on 24 August 2025).
  18. Yong, H.; Peng, J. Chinese Lexicography: A History from 1046 BC to AD 1911; OUP Oxford: Oxford, UK, 2008; ISBN 978-0-19-953982-6. [Google Scholar]
  19. Norman, J. Chinese; Cambridge University Press: Cambridge, UK, 1988; ISBN 978-0-7007-1129-1. [Google Scholar]
  20. Hilary, M. Diversity in Sinitic Languages; Oxford University Press: Oxford, UK, 2015; ISBN 978-0-19-872379-0. [Google Scholar]
  21. Hartmann, J. Amharische Grammatik; Äthiopische Forschungen; Steiner: Wiesbaden, Germany, 1980. [Google Scholar]
  22. Meyer, R. Amharic as lingua franca in Ethiopia. Lissan J. Afr. Lang. Linguist. 2006, 20, 117–131. [Google Scholar]
Table 1. Recognition of the author’s style of the writers.
Table 1. Recognition of the author’s style of the writers.
WritersBeresfordDefoeJeromeLocke
John Davys Beresford14011
Daniel Defoe01600
Jerome Klapka Jerome10141
John Locke00115
Table 2. List of literary works in English selected for the study.
Table 2. List of literary works in English selected for the study.
No.Author NamesBook TitlesPublished Year
1.Florence L. BarclayThe White Ladies of Worcester1917
2.Arnold BennetImperial Palace1930
3.R. D. BlackmoreA Tale of the Great War1887
4.Frances Hodgson BurnettThe Secret Garden1911
5.Gilbert Keith ChestertonThe Innocence of Father Brown1911
6.Arthur Conan DoyleThe Lost World1912
7.George EliotFelix Holt, the Radical1866
8.Ford Madox FordThe Good Soldier1915
9.John GalsworthyOver the River1933
10.George GissingWill Warburton1903
11.Rudyard KiplingKim1901
12.D. H. LawrenceWomen in Love1920
13.Humphry WardHarvest1920
14.Virginia WoolfTo the Lighthouse1927
15.Schreiner OliveUndine1929
Table 3. Recognition of authorial styles of English writers.
Table 3. Recognition of authorial styles of English writers.
V = 0.992
123456789101112131415
11600000000000000
20160000000000000
30016000000000000
40001600000000000
50000160000000000
60000016000000000
70000001400200000
80000000160000000
90000000016000000
100000000001600000
110000000000160000
120000000000016000
130000010000001500
140000000000000160
150000001000000015
Table 4. List of literary works by Russian writers selected for research.
Table 4. List of literary works by Russian writers selected for research.
No.Author NamesBook TitlesPublished Year
1.Mikhail BulgakovThe White Guard1925
2.Anton ChekhovLady with a Dog1898
3.Fyodor DostoevskyDemons1872
4.Fyodor SologubDrops of Blood1905
5.Valery BryusovAltar of Victory1912
6.Zinaida GippiusDevil’s Doll1911
7.Nikolai GogolDead Souls1842
8.Maxim GorkyThe Artamonov Business1925
9.Alexander HerzenWho is to Blame?1846
10.Vladimir NabokovThe Gift1938
11.Avdotya PanaevaThe Talnikov Family1928
12.Alexander PushkinThe Captain’s Daughter1836
13.Lev TolstoyResurrection1899
14.Ivan TurgenevOn the Eve1860
15.Mikhail LermontovHero of our time1840
16.Maria ZhukovaDacha on Peterhof Road1845
Table 5. Recognition of authorial styles of Russian writers.
Table 5. Recognition of authorial styles of Russian writers.
V = 0.993
12345678910111213141516
116000000000000000
201400010000000100
300160000000000000
400016000000000000
500001000000010014
600000160000000000
700000016000000000
800000001600000000
900000000160000000
1000000000016000000
1100000000001600000
1200000000000160000
1300000000000015100
1400000000000001600
1500000000000000160
1600000000000000016
Table 6. List of authors and works in Chinese.
Table 6. List of authors and works in Chinese.
No.English Author NameEnglish TitleChinese Author NameChinese TitleFirst PublishedRevised Edition
1Lao SheRickshaw Boy老舍骆驼祥子19392010
2Liu CixinThe Three-Body Problem刘慈欣三体20082020
3Zhang Ailing (Eileen Chang)Half a Lifelong Romance张爱玲半生缘19512014
4Lu XunCall to Arms鲁迅呐喊19232000
5Qian ZhongshuFortress Besieged钱钟书围城19472003
6Lu YaoOrdinary World路遥平凡的世界19862017
7Yu HuaTo Live余华活着19932014
8Wang AnyiThe Song of Everlasting Sorrow王安忆长恨歌19962008
9Mo YanLife and Death Are Wearing Me Out莫言生死疲劳20062012
10Jia PingwaThe Qin Opera贾平凹秦腔20052016
11Jin YuchengBlossoms金宇澄繁花20132023
12Shen CongwenBorder Town沈从文边城19342009
13A LaiRed Poppies: A Novel of Tibet阿来尘埃落定19982002
14Chen ZhongshiWhite Deer Plain陈忠实白鹿原19932023
15Chi ZijianThe Last Quarter of the Moon迟子建额尔古纳河右岸20052013
Table 7. Recognition of the author’s styles of Chinese writers.
Table 7. Recognition of the author’s styles of Chinese writers.
Cramer’s V = 0.984
123456789101112131415
11600000000000000
20160000000000000
30016000000000000
40041200000000000
50000160000000000
60000016000000000
70000001600000000
80000000160000000
90000000016000000
100000000001600000
110020000000140000
120000000000016000
130000000000001600
140000000000000160
150000000000000016
Table 8. List of authors and literary works in Amharic.
Table 8. List of authors and literary works in Amharic.
No. Author Name in EnglishBook Titles in EnglishAuthor Name in AmharicBook Titles in AmharicPublished Year
1.Haddis AlemayehuLove to the Graveሀዲስ አለምየሁፍቅር እስከ መቃብር1968
2.Bealu GirmaOromayeበዓሉ ግርማኦሮማይ1983
3.Mammo WudnehEye of the Needleማሞ ውድነህሾተላይ1981
4.Tsegaye Gabre-MedhinMakbethጸጋዬ ገብረመድህንማክቤዝ1972
5.Kebede Michael A prophetic Appointmentከበደ ሚካኤልየትንቢት ቀጠሮ1959
6.Alemayehu WassieEmegoaአለማየሁ ዋሴእመጎ2008
7.Sahle Sellassie Berhane Mariam Mr. Ketawሣህለሥላሴ ብርሃነማርያምባሻ ቅጣው1976
8.Adam RetaMehaletአዳም ረታማህሌት2002
9.Tekletsadik MekuriaEmperor Menilik and Ethiopian Unityተክለ ጻድቅ መኩሪያዐፄ ምኒልክ እና የኢትዮጵያ አንድነት1983
10.Muluken TarikuEmperor Minilik and Adwa victoryሙሉቀን ታሪኩአፄ ምኒልክ እና የአድዋ ድል2006
11.Afework GebereyesusTobiaአፈወርቅ ገ/ኢየሱስጦቢያ1900
12.Aleqa TayeEthiopian Historyአለቃ ታየየኢትዮጵያ ህዝብ ታሪክ1914
13.Bahru ZewdeModern Ethiopiaባህሩ ዘውዴዘመናዊ የኢትዮጵያ ታሪክ1999
14.Berhanu ZereyehunThe Tear of Tewodrosብርሃኑ ዘርይሁንየቴድሮስ ዕምባ1960
Table 9. Results of recognition of the author’s styles of writers in Amharic.
Table 9. Results of recognition of the author’s styles of writers in Amharic.
Cramer’s V = 0.892
1234567891011121314
1160000000000000
2014200000000000
3001600000000000
4000151000000000
5000016000000000
6020011300000000
7000000160000000
8011001013000000
9000000001600000
1000000000690010
11000000000016000
12020000000001400
13000000000000160
14000000000000016
Table 10. Parameters’ efficiency comparisons.
Table 10. Parameters’ efficiency comparisons.
ParametersLanguageTraining Sample SizeTest Sample SizeCramer V
Parameter 1Amharic9680.928
English1
Russian1
Chinese1
Amharic9640.914
English0.994
Russian0.991
Chinese0.997
Amharic9620.916
English0.983
Russian0.976
Chinese0.978
Parameter 2Amharic 6480.919
English1
Russian1
Chinese0.992
Amharic 6440.892
English0.992
Russian0.993
Chinese0.984
Amharic 6420.873
English0.98
Russian0.97
Chinese0.979
Parameter 3Amharic 4880.913
English1
Russian1
Chinese0.971
Amharic 4840.912
English0.990
Russian0.981
Chinese0.960
Amharic 4820.887
English0.942
Russian0.982
Chinese0.952
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ryabko, B.; Savina, N.; Lulu, Y.G.; Han, Y. The Amount of Data Required to Recognize a Writer’s Style Is Consistent Across Different Languages of the World. Entropy 2025, 27, 1039. https://doi.org/10.3390/e27101039

AMA Style

Ryabko B, Savina N, Lulu YG, Han Y. The Amount of Data Required to Recognize a Writer’s Style Is Consistent Across Different Languages of the World. Entropy. 2025; 27(10):1039. https://doi.org/10.3390/e27101039

Chicago/Turabian Style

Ryabko, Boris, Nadezhda Savina, Yeshewas Getachew Lulu, and Yunfei Han. 2025. "The Amount of Data Required to Recognize a Writer’s Style Is Consistent Across Different Languages of the World" Entropy 27, no. 10: 1039. https://doi.org/10.3390/e27101039

APA Style

Ryabko, B., Savina, N., Lulu, Y. G., & Han, Y. (2025). The Amount of Data Required to Recognize a Writer’s Style Is Consistent Across Different Languages of the World. Entropy, 27(10), 1039. https://doi.org/10.3390/e27101039

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop