2.1. The Idea of the Method
The method of recognizing an author’s style is based on the use of algorithms for lossless compression, implemented in the form of so-called archivers. Their purpose is to encode texts in such a way that the length of the encoded message is shorter than the original (the text is compressed) and, if necessary, the encoded text can be decoded into the original. Text data was fed as the input of the archiver, which encoded the text data into files of shorter length, i.e., compression. Compression occurs because archivers find unevenness in the frequencies of occurrence of letters and words and use hidden patterns based on the theory of formal grammars and the laws of information transmission. Let us briefly describe the scheme of application of the developed method. Let us define three texts, T1, T2, T3, and it is known that T1 and T2 were generated by different sources of information, I1 and I2, and T3 was generated by either I1 or I2 (for example, T1 is a text in English, T2 is in German, and T3 is in English or German). Let d be some archiver, and, if it is applied to some file X, then the length of the compressed file is denoted by d(X). First, the texts are combined into the pairs T1T3 and T2T3, and both pairs are compressed. Then, we separately compress files T1 and T2, after which we calculate the differences in the lengths of the compressed files: d(T1T3) − d(T1) and similarly d(T2T3) − d(T2). If d(T1T3) − d (T1) is less than d(T2T3) − d(T2), then we conclude that the text T3 was generated by the information source I1. If d(T1T3) − d(T1) > d(T2T3) − d(T2), then T3 was generated by the information source I2. This conclusion is due to the fact that the archiver, when compressing later texts, i.e., T3, uses the statistical features it found when compressing earlier texts, namely T1 or T2. Therefore, the text T3 is compressed more effectively after text with the same source of information was compressed before it. The following simple example explains the essence of this method: Let T1 be a text in English, T2 in German, and (unknown) T3 also in English. Then d(T1T3) − d(T1) will be less than d(T2T3) − d(T2) because, in the first case, T3 in T1T3 was compressed after the archiver had been “tuned” to “its” statistics (for example, in the case of texts in English and German, the method works flawlessly with text lengths of several hundred letters for T1, T2 and T3).
This idea was proposed by Tehan [
14,
15] and was further developed by Ryabko and Savina (RS-method) [
9,
10,
11]. In particular, in [
9], this idea was applied to construct a statistical method for classifying texts, allowing one to determine the reliability of the obtained conclusions using mathematical statistics methods. The described scheme was also successfully applied by the authors of this paper to solve problems of text attribution in works [
10], where it was experimentally shown that the individual style of an author can be determined quite accurately based on 4 KB of their text (approximately two pages of text in Russian or English). Based on this fact, we will apply the same scheme to solve the problem of recognizing the author’s style of writers of different language groups.
2.2. Description of the RS-Method for Recognizing the Author’s Style of a Writer
In order to make the description more understandable, we will illustrate it with an example of constructing a method for determining the author’s style of English-language writers. Let N writers and their works T1, T2, …, TN be given.
Each text Ti is represented as two samples, called training (Xi, i = 1, …, N) and experimental, which, in turn, consist of M parts (slices), which we will denote by Yij, i = 1, …, N; j = 1, …, M.
For the experimental work, we compiled a sample of texts from Beresford, Jerome, Defoe and Locke, N = 4, M = 16. From the works of these authors, we made 4 training samples X1, …, X4, each 64 KB in size. Then we made test samples—16 files Y1j, j = 1, …, 16, each 4 KB in size, from the works of Beresford, Y2j, j = 1, …, 16, from the works of Defoe, and …, Y16j, j = 1, …, 16, from the works of Jerome and Locke. Then the file Y1,1 was successively “compressed” with the training samples of the sample X1, …,X4 and it was determined which of them was “better” compressed (i.e., d(X1 Y1,1) − d(X1), …, d(X4 Y1,1) − d(X4) were calculated and i was found, for which d(Xi Y1,1) − d(Xi) is minimal). All Yij, i = 1,…, 4; j = 1, …, 16, were processed similarly.
Table 1 presents the obtained data for the LZMA archiver, with a training sample (X
i) of 64 kB and a slice (Y
ij) of 4 kB.
Let us explain the meaning of these numbers: 16 in the upper left corner means that out of 16 files Y1j, j = 1, …, 16, all were compressed better with X1 (in other words, all 16 slices from Defoe’s works were compressed better with the training set of his works. The obtained result shows that D. Defoe’s author’s style is uniquely recognized by a 4 KB slice with a training set of 64 KB). The numbers from the first line mean that out of 16 files Y2j, j = 1, …, 16, 14 slices were compressed better with X2 (i.e., 14 slices from Beresford’s works were compressed better with his training set; however, 1 slice was more similar to Jerome’s works and 1 slice was similar to Locke’s works; here, the recognition of the writer’s style is 14 out of 16).
We will call the entire process of transition from the source texts T
1, T
2, …, T
N to the table (of size N × N) the construction of a contingency table, and we will denote the contingency table itself as W (T
1, T
2, …, T
N) or W (depending on the context) and represent this table as follows:
| t1,1 t1,2 … t1,N |
| W(T1, T2, …, TN) = t2,1 t2,2 … t2,N |
| ………………… |
| tN,1 tN,2 … tN,N |
In addition, for each W table, we calculated the value of Cramer’s coefficient V [
16]); here it should be noted that V is used to assess the relationship, or interdependence, and it takes values from zero to one, and a higher value indicates a greater dependence or interrelationship.
We will explain its meaning in more detail together with the contingency table W. As we saw in the example, the numbers in the cells of the contingency table indicate the number of slices whose authorship was attributed to a specific writer. If the method works “correctly”, i.e., it correctly determines the author’s style by the slices, then the values in the table will be concentrated mainly on the main diagonal. Otherwise, when the slices do not reveal the author’s style of the writer, the values in the table will be evenly distributed among different cells related to different writers.
This effect can be quantified using the Cramer V coefficient [
14], which is calculated as follows: first, calculate P =
and then calculate the following:
=
and Cramer’s coefficient V =
.
For
Table 1, Cramer’s coefficient V = 0.9.
Note that the Cramer coefficient V = 1 if all nondiagonal elements are equal to 0, and V is equal to 0 if all ti,j are equal.
Now let us pay attention to the choice of archiver. There are quite a lot of them at present. For this purpose, we examined the BZIP2, DEFLATE and LZMA archivers on the same sample. It turned out that the LZMA archiver has the highest Cramer coefficient; henceforth, we used this archiver. In our experiments, compression was performed using the 7-Zip archiver; the reference implementation of LZMA was developed in [
17]. (We will not describe this in detail, since similar calculations were performed in [
11], see 2.3. “Selection of method parameters”.)