1. Introduction
Authorship analysis is widely used to identify the author of an anonymous or disputed literary work, especially in a copyright dispute, to verify the authorship of suicide letters, for information to determine whether an anonymous message or statement was written by a known terrorist, to identify the author of malicious computer programs, for example, computer viruses and malware, and to identify the authors of certain Internet texts, such as e-mails, blog posts, and texts on Internet forum pages [
1,
2].
One of the important classes of problems in the authorship analysis of texts is authorship recognition of texts. Studies in the direction of text authorship recognition differ in the types of the text features, i.e., stylistic characteristics of the authors in the texts, genre and size characteristics of texts, the language, authorship identification approach used, etc. Among the scientific studies carried out on the identification of the author of texts, several works can be found related to the identification of the author of texts in newspapers [
3,
4,
5,
6,
7,
8,
9,
10] and the identification of the author of literary works [
11,
12,
13,
14,
15].
Machine learning methods and models are widely used in text author recognition. Among these methods and models are support vector machine [
4,
5,
6,
11,
16,
17,
18], naive Bayes [
7,
8,
16], random forest [
5,
6,
8,
19], k-nearest neighbors [
5,
6,
7], and artificial neural network [
15,
20,
21]. The recognition of the author of the text can be considered in the example of texts in different languages, e.g., Arabic, Chinese, Dutch, English, German, Greek, Russian, Spanish, Turkish, and Ukrainian [
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
22,
23,
24,
25,
26]. Different types of text features, for example, frequencies of selected words or frequencies of some tags/signs that replace words (e.g., part-of-speech tags) and frequencies of their n-grams [
3,
4,
6,
8,
14,
16,
22,
23,
27], frequencies of character n-grams [
5,
7,
9,
10,
12,
17,
24,
25,
28], and frequencies of word lengths [
13], can be used in author identification.
In the present study, a comparative analysis of the effectiveness of the use of different methods and models of machine learning with different feature groups consisting of different types of text features to recognize the authorship of texts was carried out based on the results of computer experiments in the example of large and small literary works of some Azerbaijani writers.
The rest of this article is structured as follows: In
Section 2, the considered authorship recognition problem is discussed. In
Section 3, various text feature types used in the study are given. In
Section 4, feature selection procedures used in the study are described. In
Section 5, feature groups used in the study are depicted. In
Section 6, the dataset and characteristics of the recognition methods are given. In
Section 7, analysis and discussion of the results of computer experiments are carried out. Finally, we draw our conclusions in
Section 8 and provide several ideas for future work.
2. Purpose of the Study
There are certain categories of problems in the direction of authorship analysis of texts, namely:
authorship verification of texts—determination of whether a given text was written by a certain person.
authorship attribution of texts, author recognition of texts, or authorship identification—determination of the author of a given text from predetermined, suspected candidate authors.
plagiarism detection—detection of similarities between two given texts.
author profiling—detection of author’s characteristics or building author profile consisting of some characteristics—age, gender, education level, etc. based on the given text.
detection of stylistic inconsistencies within a text—detection of text parts that do not correspond to the general writing style of the text that is written by more than one author.
The purpose of the study is to solve the problem of determining the authors of texts. The problem arises in relation to the author of a given text among a priori well-known authors. It is possible that the author of the text is not among these candidates. In the considered research study, there is a set of texts written by candidate authors. The author of a text is determined using computer models that use assessments of the stylistic quantitative characteristics of the texts of this author and the stylistic quantitative characteristics of the texts of candidate authors.
This evaluation can be performed with the help of machine learning methods. Each machine learning method can be used with different sets of text features. In the present study, frequencies of word lengths, frequencies of sentence lengths, frequencies of character n-grams, statistical characteristics of character n-grams frequencies in the text, and frequencies of selected words were used, where a word length, sentence length, and character n-gram refer to the number of letters in a word, the number of words in a sentence, and a given combination of given characters, respectively. Also, in addition to the frequencies of the selected words in a given text whose author is to be determined, the frequencies of the words in the candidate authors’ texts in the training set were also used in some feature groups as text features. The generated feature sets were used with artificial neural network, support vector machine, and random forest methods, and machine learning models. In addition, an n-gram frequency character group was also used with a convolutional neural network in the form of a two-dimensional matrix.
A comparative analysis of different feature sets with different machine learning methods and models was performed based on the results of computer experiments conducted on the example of works of literary fiction in the Azerbaijani language by eleven Azerbaijani authors.
The results of the conducted research study on feature types, feature selection procedures, and machine learning techniques can be used for author recognition in many other languages.
3. Types of Text Features
Below, it is assumed that the author of a given text is one of a certain limited number of candidate authors, and some other texts written by these candidate authors are known in advance. Let us adopt the following notation.
Consider the set of given candidate authors , where is the number of the candidate authors, and let us denote the set of texts of author by , where is the number of texts of the author , .
The set of all texts with known authors is divided into two non-intersecting subsets where is the training set and is the test set.
Text feature groups consist of the following text characteristics, which are considered by many researchers to be important features and have a distinctive character among authors.
3.1. Sentence Length Frequency
By sentence length, we mean the number of words in a sentence [
29]. The frequency of sentences of a certain length in a given text is calculated by dividing the number of sentences of that length by the total number of sentences in the text [
30].
3.2. Word Length Frequency
By word length, we mean the number of letters in a word [
13]. The frequency of words of a given length in a given text is calculated by dividing the number of words of that length by the total number of words in the text.
3.3. Character n-Gram Frequency
In the scientific literature, a character n-gram is considered to mean any given combination of the given
characters [
12,
17,
24,
25].
For a given
, let us denote the character alphabet set, i.e., the set of all character 1-grams from which characters in any character n-gram are selected as follows:
Hereafter in the article, notation will be used for dimensional multi-index . Here, each of the indices can take a value in the given set , hence .
After that, we will use n-grams and character n-grams interchangeably. An n-gram with an arbitrary multi-index whose characters are chosen from the character alphabet set will be used as , where .
Let us denote the set of all character n-grams as follows:
The frequency of any arbitrary character n-gram with characters in a given text is calculated by dividing the number of times that is used by the total number of character n-grams in the text.
3.4. Variance of Character n-Gram Frequencies
Statistical characteristics (e.g., variances) of character n-gram frequencies in the separate parts of an arbitrary given text can be used as text features.
If we calculate the frequencies of an arbitrary character n-gram in the separate parts of a given text and find the variance [
30] of these frequencies, this variance shows the fulfillment of this character n-gram in the separate parts—at the beginning, in the middle, and at the end of that text, i.e., the stability of the character n-gram in the text. In other words, if the variance of frequencies of a character n-gram in a text, which is calculated as aforementioned, is small, this n-gram can be considered stable in that text. Even if we calculate the frequency of an arbitrary character n-gram in each of the texts of a certain author candidate in the training set and find the variance of these frequencies, the small value of this variance indicates that the character n-gram characterizes the texts of that author well. Or if we combine the known texts of each of the author candidates in the considered problem and get a single text for each author candidate (for
number of authors, we obtain
texts), if the variance of an arbitrary character n-gram in one of these texts is small, and the variances in the others’ are large, this character n-gram characterizes one author candidate better than others, so this character n-gram can be effectively used in the considered author recognition problem. It turns out that the variance values, which indicate how character n-grams are distributed in a certain text or the stability of character n-grams, are informative in terms of author recognition. Hence, we considered that along with the frequencies of character n-grams in a given text, their variances in the text can be used as text features as well. In other words, by dividing the given text into a certain number of separate, non-intersecting parts and calculating the frequencies of an arbitrary character n-gram in these parts, the variance of these frequencies can be used as some of the features of that text.
To calculate the values of the statistical characteristics of any character n-gram in a given text for an arbitrary , the text must first be divided into certain almost equally sized non-intersecting parts, the frequencies of that character n-gram in each of these parts of the text must be found, and the variance value of these frequencies must be calculated.
3.5. Word Frequency
Word frequencies in texts are important features that can be related to authors [
3,
4,
14,
22]. The frequency of a particular word in a given text is calculated by dividing the number of times that word occurs by the total number of words in the text.
The results of our research (see
Section 7) show that if the frequency of an arbitrary word in a given text is used together with the frequencies of the unified texts of the candidate authors among the text features, it positively affects the recognition efficiency, where, by the unified text of a candidate author, we mean the text resulting from merging the author’s texts in the training set. In addition to the frequencies of a given word in the unified texts of the candidate authors, the frequency of this word in the unified text obtained as the result of merging all the texts in the training set can be used among the text features. The rules for calculating the frequencies of an arbitrary word in the combined texts of the candidate authors and in the text resulting from merging all the texts of the training set are described below.
The total usage frequency of an arbitrary word by any author is obtained by dividing the sum of the number of that word in the author’s texts in the training set by the total number of words in the author’s texts in the training set. The total frequency of an arbitrary word in all texts in the training set is calculated by dividing the total number of that word in the texts in the training set by the total number of words in the texts in the training set.
4. Feature Selection Procedures
Various feature selection procedures have been proposed [
1,
2,
6,
16,
31]. Here, two different feature selection procedures for word selection and two for selecting character n-grams whose frequencies will be used as text features are proposed, as described below. Recall that these feature selection procedures are performed on the training set texts whose authors are known.
4.1. Procedure for Selecting Frequently Used Words by Authors (Procedure 1)
During the selection of words, the frequencies of which are to be used as text features in author recognition with the help of this procedure, the total frequencies of words used by each of the authors are taken into account separately. This first feature selection procedure differs from the second feature selection procedure (see
Section 4.2), i.e., in procedure 2, word selection is carried out based on the total number of times the words are used in the texts in the training set without considering the authors of the texts; in procedure 1, the most or average or least used words of the author
were selected, where
, then words from authors’ lexicons are combined into a final set of words.
The determination of the most or average or least used words in the authors’ lexicons is based on the total usage frequencies of the words in the author’s , texts, here , where is the set of texts in the training set, is the set of the author’s texts in the dataset , and is the number of texts of in dataset T. Let us denote the total usage frequency of an arbitrary word in author’s texts by , where is the set of all words in the training set , and .
In order to select the most or average or least used words by an author
from among the elements of the word set
, at first,
has to be sorted according to the descending order of the usage frequency of the words by the author
, where
. Let us denote the set of words arranged according to the descending order of usage frequency of the words by the author
by
. For given
consider the set of most frequently used words of author
as
and the final set of most frequently used words of authors
as
, where
The most frequently used words of authors should be selected in such a way that authors’ lexicons do not intersect, i.e., .
It is possible that the most used words of a certain author
, although not included in the set of most used words of
, are also used a lot by another author
, where
. These kinds of words may not have a discriminatory character between authors
and
. Therefore, in the author recognition problem considered in the study, in addition to the most frequently used words in the authors’ lexicons, the selection of words that are used on average was also used. These words are neither at the beginning nor at the end of a sorted list
, where
. This is analogous to the way described above, where the most frequently used words of authors are selected, except the following Formula (4) has to be used instead of the aforementioned (2):
where
indicates an integer obtained as a result of rounding down an arbitrary positive number. For example,
,
.
Formula (4) was obtained by only changing the indices of words, i.e., j, in the set of words in Formula (2), where . This is an analogous operation not only for the most or average used words but also the words in other percentiles, e.g., those with a frequency higher than 75% can be selected.
4.2. Procedure for Selecting Frequently Used Words in the Training Set (Procedure 2)
During the selection of words, the frequencies of which are to be used as text features in author recognition with the help of this procedure, the total frequencies of words in the training set is taken as a basis without taking into account the authors.
Let us denote the total usage frequency of an arbitrary word
in
by
, where
is the set of all words in the training set
, and
. When using this procedure, the set of all the words
in the texts
is at first sorted according to the decreasing order of the total usage frequency in the training set
:
where
.
The set of most frequently used given
words
in the training set is as follows:
In the description of the first feature selection procedure (see
Section 4.1), we noted that some of the most frequently used words of one author may be used a lot by another author, and in this case, these words may not be discriminative between these authors. Therefore, in the first feature selection procedure, along with the most frequently used words by the authors, words used as average were also selected. A similar situation may arise when using feature selection procedure 2. It is possible that some words, which are generally the most used words in the training set, are commonly used by more than one author; thus, they may not have a discriminatory character. Therefore, in addition to selecting the most used words in the training set, the selection of words that are used on average, i.e., are neither at the beginning nor at the end of
, is also employed. The set of these words is as follows:
4.3. Procedure for Selecting Frequently Used Character n-Grams in the Training Set (Procedure 3)
During the selection of character n-grams (the frequencies of which are to be used as text features in author recognition), with the help of this procedure, the total frequencies of character n-grams in the training set are taken as a basis without taking into account the authors. During the execution of this procedure, the set of all character n-grams have to be sorted according to the decreasing order of the total usage frequency in the training set. The characters in these character n-grams, each with characters, are selected from the character alphabet set . Let us denote this obtained sorted set by . Consider the set of most or average or least frequently used character n-grams used most or average or least in the training set as .
We will denote the total usage frequency of a certain character n-gram in the texts in the training set by .
The set of all character n-grams
obtained by sorting the character n-grams in the order of decreasing total usage frequency in the training set is as follows:
is the multi-index of a given arbitrary character n-gram
numbered with
in the set of character n-grams
, where
.
The set of most used character n-grams
in the training set is as follows:
Feature selection procedure 3 is used to select character n-grams rather than words, unlike feature selection procedure 2. In the description of the second feature selection procedure, it was noted that the most used words in the training set may not be discriminative among authors; therefore, it is necessary to select the words that are averagely used in those texts as well. This is also typical for the third feature selection procedure used to select character n-grams.
The set of character n-grams
that are used on average, which means they are neither at the beginning nor at the end of
in the texts in the training set, is as follows:
4.4. Procedure for Selecting Character n-Grams with Different Characterizing Degrees of Authors (Procedure 4)
In solving author recognition problems, text features should be used such that the values of these features are similar in the texts of one author and different in the texts of different authors. In other words, characterizing degrees of authors per feature, the values of which are used in author recognition, have to differ from each other. This feature selection procedure was proposed for the selection of character n-grams for which the characterizing degrees of the authors differ.
During the execution of this procedure, firstly, degrees of not characterizing the authors in the sense that will be mentioned below per a given n-gram are determined, and based on these degrees of non-characterization, their complements—the characterizing degrees—are determined. Then, an indicator indicating how much the characterizing degrees of the authors differ from each other in terms of one of the following meanings per that n-gram is calculated. In the study, different metrics were used to calculate the latter.
We consider the degree of not characterizing a given author
per an arbitrary n-gram
as the average difference of characterizing degrees (mean of the pairwise differences of these frequencies) of the corresponding frequencies
in the texts
of author
as:
where
are the frequencies of n-gram
in the texts of author
numbered with
accordingly,
. Considering that for the frequency of an arbitrary
in a given text
is true, it is clear that
. Then, the characterizing degree of an author
per a n-gram
is
, where
.
For simplicity, the characterizing degrees of the authors per a n-gram is sorted in decreasing order: where .
The difference of characterizing degrees
of the authors
for an arbitrary character n-gram
can be defined as one of the following:
As the difference of characterizing degrees per an n-gram can be used.
Sorted set of character n-grams
ordered in the decreasing order of the difference indicators of characterizing degrees of authors per character n-grams in
is as follows:
is the multi-index of a given arbitrary character n-gram
numbered with
in
, where
.
The set of character n-grams
with high discrimination indicator—characterizing degree difference among authors in the training set—is as follows:
5. Characteristics of Feature Groups
Using the value of a single feature of a particular text, e.g., the frequency of 5-word sentences or the frequency of the character 2-gram “ab”, is not sufficient for an effective author identification in terms of adequately identifying the authors of the texts that will be executed. Therefore, the values of more than one feature are used in a specific feature group consisting of different features for an arbitrary text; see [
1,
25]. The characteristics of the feature groups used in this study are detailed below.
5.1. Sentence Length Frequency and Word Length Frequency Types of Feature Groups
Only two feature groups consisting of features related to the sentence length frequency type were used. These feature groups differ according to the sentence lengths to be used and the number of these different sentence lengths, where sentence length means the number of words in a sentence. In one such feature group, there are frequencies of sentences with lengths of 5, 6, …, 14. We will denote this feature group by the name SL10 in the study, where SL stands for “sentence length”, and 10 indicates the number of features in the feature group. Naming the groups of features used in the study with such short expressions is for the reader to understand the differences in those groups and also for use in the results of computer experiments if needed. Another feature group, SL25, consists of frequencies of sentences with lengths of 5, 6, …, 29.
The number of feature groups consisting of only word length type features is also two. These feature groups differ according to the word lengths to be used and the number of these different word lengths, where word length means the number of letters in a word. In one such feature group, there are frequencies of words with lengths of 3, 4, …, 7. We will denote this feature group by the name WL5, where WL stands for “word length”, and 5 indicates the number of features in the group. Another feature group, WL10, consists of frequencies of words with lengths of 3, 4, …, 12.
A mixed feature group consisting of features belonging to the types of sentence length frequency and word length frequency was also used. The values of the features in this feature group are the frequencies of sentences with lengths 5, 6, …, 14 and the frequencies of words with lengths 3, 4, …, 12. Let us denote this feature group by SL10&WL10, where the “&” symbol is used in the names of mixed feature groups.
In the study, the number of feature groups used in the considered author recognition problem, which contains features of sentence length frequency and word length frequency types, is 5:
sentence length frequency: {number of feature groups} = 2;
word length frequency: {number of feature groups} = 2;
mixed groups of sentence length frequencies and word length frequencies: {number of feature groups} = 1.
5.2. Character n-Gram Frequency and Variance of Character n-Gram Frequencies Types of Feature Groups
In the study, the character alphabet set—the set that contains the characters in character n-grams with a given used in the author recognition problem considered in the example of Azerbaijani writers—is the letters of the alphabet of the Azerbaijani language, in other words, , where . The character alphabet set is clearly the set of character 1-grams, each with 1 character. In the study, for the frequencies of character 1-grams—unigrams—and for the frequencies of character 2-grams—bigrams—were used in the feature groups consisting of character n-gram frequency type features.
A feature group consisting of the frequencies of all unigrams in the character alphabet set was used. We will denote this feature group by Ch1g32 in the study, where “Ch1g” is for “character 1-grams”, and 32 represents the number of features in that group.
Different subsets of all character bigrams set were used to create feature groups. Recall that characters in these bigrams are selected from the character alphabet set . These feature groups differ from each other according to the feature selection procedure used and the variety and number of bigrams, the values of which are used in these feature groups. In the study feature selection, procedures 3 and 4 are used to select character bigrams from . For procedure 3—the procedure for selecting frequently used character n-grams in the training set—and procedure 4—the procedure for selecting character n-grams with different characterizing degrees of authors—see Formulas (8)–(10) and (11)–(16), respectively. A total of 26 and 21 different feature groups were created with the help of feature selection procedures 3 and 4, respectively. Two of these twenty-six feature groups of bigrams were selected with the help of the third feature selection procedure, each consisting of frequencies of 100 bigrams selected directly from the set . These two feature groups consist of the frequencies of the 100 most and average used bigrams in the training set selected from 1024 character bigrams in the set . We will denote these feature groups as Ch2g_p3high100 and Ch2g_p3middle100, respectively. Here, “Ch2g” stands for “character 2-grams”; “p3” indicates that the third feature selection procedure is used; and 100 indicates the number of features in each of the groups.
The other 24 feature groups were selected from the bigrams whose frequencies were used in these two feature groups—Ch2g_p3high100 and Ch2g_p3middle100. Among the high-frequency 100 bigrams in the training set 5, 10, …, 25, 50, the first and last bigrams in the descending list of bigrams ordered by the frequencies in the training set are selected. We denote them by Ch2g_p3high100first5, …, Ch2g_p3high100first50 and Ch2g_p3high100last5, …, Ch2g_p3high100last50. Among the middle—neither high nor low—frequency 100 bigrams in the training set 5, 10, …, 25, 50, the first and last bigrams in the descending list of bigrams ordered by the frequencies in the training set are also selected. We denote them by Ch2g_p3middle100first5, …, Ch2g_p3middle100first50 and Ch2g_p3middle100last5, …, Ch2g_p3middle100last50.
With the help of the fourth feature selection procedure, 21 feature groups consisting of the frequencies of character bigrams selected from the set were created. In seven of these groups, the two best-characterized authors, the three best-characterized authors in the other seven groups, and all the authors in the remaining seven groups were taken into account during the calculation of the difference indicator of the characterizing degrees of authors (see Formulas (12)–(14)). We will denote them by 4.1, 4.2, and 4.3 in feature group names as some variants of the fourth feature selection procedure. Feature groups consisting of frequencies of 5, 10, …, 25, 50, 100 character bigrams with the highest indicators using each of the aforementioned metrics were created. Let us denote them as Ch2g_p4.1high5, …, Ch2g_p4.1high50; Ch2g_p4.2high5, …, Ch2g_p4.2high50; Ch2g_p4.3high5, …, Ch2g_p4.3high50, respectively.
Character unigrams and character bigrams selected with the help of feature selection procedure 3 are used to create feature groups consisting of variance of character n-gram frequencies as well. They are analogous to Ch1g32; Ch2g_p3high100, Ch2g_p3middle100; Ch2g_p3high100first5, …, Ch2g_p3high100first50; Ch2g_p3high100last5, …, Ch2g_p3high100last50; Ch2g_p3middle100first5, …, Ch2g_p3middle100first50; Ch2g_p3middle100last5, …, Ch2g_p3middle100last50. But instead of character n-gram frequencies, variances of character n-gram frequencies are employed in feature groups. We use “ ChngVar” prefix instead of “Chng” in the names of these groups, where “Var” is for “variance”.
Mixed feature groups consisting of features of character n-gram frequency and character n-gram frequency variance types were also used. In one of these mixed feature groups, the character n-gram frequencies and variances of character n-gram frequencies for all unigrams were used, i.e., a group Ch1g&Ch1gVar64 was created consisting of all features in the Ch1g32 and Ch1gVar32 groups. In the other four mixed feature groups, character n-gram frequencies and variances of character n-gram frequencies for the most frequently used 5, 10, 15, 20 character bigrams in the training set were used. We denote these groups by Ch2g&Ch2gVar_p3high100first10, …, Ch2g&Ch2gVar_p3high100first40.
In this study, the number of feature groups used in the considered author recognition problem, which contains the features of character n-gram frequency and variance of character n-gram frequency types, is 79:
1-grams: {number of feature types} × {number of n-gram subsets} = 2 × 1 = 2;
2-grams selected with the help of the third feature selection procedure: {number of feature types} × ({number of multi-gram n-gram sets} × (1 + {number of selection criteria of few-gram n-gram subsets) × {few-number of n-gram subsets })) = 2 × (2 × (1 + 2 × 6)) = 52;
2-grams selected with the help of the fourth feature selection procedure: {number of feature types} × {number of metrics} × {number of n-gram subsets} = 1 × 3 × 7 = 21;
mixed groups of character n-gram frequencies and variances of character n-gram frequencies: {number of n-gram subsets} = 4.
5.3. Word Frequency Type of Feature Groups
In the considered author recognition problem, different feature groups consisting of features related to the type of word frequency were used.
These feature groups differ from each other according to the variety and number of words used in these groups, as well as the frequency types, i.e., in addition to the frequencies of the words in the given text, the features of the given text may also include the frequencies of these words in candidate authors’ unified texts and the total usage frequencies in the training set (see
Section 3.5).
We divide these feature groups into several types. It may be confusing, but it is not about the type of features but the type of feature group. We will only separate groups of features of word frequency type into such feature group types due to the large number of approaches used for the selection of words, the values of which, in these groups, are quite different from each other. Let us take a look at the characteristics of each of the different feature group types in the study, which consist of features belonging to the word frequency feature type used:
1. With the help of feature selection procedure 1—the procedure for selecting frequently used words by authors (see Formulas (1)–(4))—the most frequently used words of the authors in their own texts in the training set are determined separately per each author. Using procedure 1, all the words in the training set are sorted times for number of authors each to the order of the decreasing total usage frequency by an appropriate author, which gives L sets of words. Then, an almost equal number of words is selected from each of the word sets. Finally, selected words are collected together as a final set of words selected with feature selection procedure 1. In the study, 15, 25, 45, and 50 high frequent and middle frequent words of authors are used as feature groups. We denote these feature groups by the names W_p1high15, …, W_p1high50, W_p1middle15, …, W_p1middle50, where “W” is “word”, “p1” indicates that the first feature selection procedure is used, “high” and “middle” indicates that the most and middle frequently used words by the authors are selected, and 15, …, 50 indicate counts of features in these groups.
2. Words that are rarely used in everyday life can be more effective in author recognition in terms of discriminating texts of different authors than words that are often used in daily life. Taking this into account, among the most used words by the authors individually, the frequencies of which were used in groups W_p1high15, …, W_p1high50, we manually selected the words that we assume are used less often in everyday life and manually selected the feature groups consisting of the frequencies of these words. In other words, we removed words that are often used in everyday life among the words selected with the help of the first feature selection procedure. The numbers of words in such groups are 3, 15, 25, 40, and 45, and let us denote these groups by the names W_p1highMAN3, …, W_p1highMAN45, respectively, where “MAN” is for “manual”, which means that after selecting the most used words by the authors with the help of feature selection procedure 1, some of these words were manually selected from this word set.
3. Using the second feature selection procedure—the procedure for selecting frequently used words in the training set (see Formulas (5)–(7))—5, 10, …, 25, 50 most and middle occurring words in the training set where the authors of the texts are not taken into account were selected and used in feature groups. Let us denote these feature groups as W_p2high5, …, W_p2high50, W_p2middle5, …, W_p2middle50.
4. In author recognition problems, such text features should be used that represent the stylistic characteristics of its author in any given text of him/her [
1]. It is clear that in a given author recognition problem, if the text features effectively reflect the stylistic characteristics of any author, the features of texts of a certain author must remain invariant according to their topics.
In this study, an author recognition problem is considered in the example of Azerbaijani writers. There are two types of parts of speech in the Azerbaijani language: main and auxiliary. Based on the morphology of the Azerbaijani language, words related to auxiliary parts of speech, e.g., conjunctions, do not have lexical meaning, so they do not seriously affect the topic of a text. These words are function words that do not possess a meaning but have a function in sentences. For function words, see, e.g., [
1]. So, the frequency of words related to auxiliary parts of speech can be more effective as text features than the frequency of words related to the main parts of speech because they can more adequately reflect the author’s stylistics since they are independent of the text’s topic.
For this type of word frequency feature group, feature selection procedure 2 is used, but the word set W, where selected words are selected, contains only words of auxiliary parts of speech in the training set instead of all the words in the training set. In this study, 5, 10, …, 25, 50 highest and middle frequent auxiliary parts of speech words are used for feature group creation. Let us denote them by W_p2highAUX5, …, W_p2highAUX50, W_p2middleAUX5, …, W_p2middleAUX50, where “AUX” is for “auxiliary”, which means that the words whose frequencies are used in the feature groups belong only to the auxiliary parts of speech.
5. In the word frequency feature groups described above, the frequencies of words in the given text features that were intended to be extracted were used. In this type of word frequency group, not only the frequencies of the selected words in the given text but also the total frequencies of those words in the texts of the authors in the training set and in all the texts in the training set without taking authors into account are used.
In this study, some feature groups were used in which not only the frequencies of the selected words in the given text but also the total frequencies of the words in the unified texts of the authors in the training set and the total frequency in the training set without taking authors into account were used. During the use of groups belonging to this type of feature group, among the features of an arbitrary text, there is the frequency of a given word in that text, the total frequency of the authors’ texts in the training set, and the total frequency of all texts in the training set.
As is clear from the description above, the main difference between this type of word frequency group and the previous types is not about the words selected but the text(s) from which it is calculated. Using a word, the frequency of that word in a given text, the total frequencies in authors’ unified texts in the training set, and the total frequency in the training set would be calculated. In the study, in addition to the word frequencies in a given text, the total frequencies obtained based on the texts in the training set used with word sets were used in the groups W_p2high50, W_p1highMAN3, W_p1highMAN15, W_p1highMAN25, W_p1highMAN40, W_p1highMAN45. Thus, W&WA&WT_p2high50, W&WA&WT_p1highMAN3, …, W&WA&WT_p1highMAN45 groups were created, where “W”, “WA”, and “WT” denote the frequencies of words in the given text, the total usage frequencies of words by authors, and the total usage frequencies of words in the training set, respectively.
In the study, the number of feature groups used in the considered author recognition problem that contain features related to the word frequency type is 43:
W_p1 feature groups—feature groups that used procedure 1 for word selection: {number of word selection criteria} × {number of word sets} = 2 × 4 = 8;
W_p1MAN feature groups—manually intervened feature groups that used procedure 1 for word selection: {number of word selection criteria} × {number of word sets} = 1 × 5 = 5;
W_p2 feature groups—feature groups that used procedure 2 for word selection: {number of word selection criteria} × {number of word sets} = 2 × 6 = 12;
W_p2AUX feature groups—feature groups that used procedure 2 to select words from among words belonging to some chosen parts of speech, i.e., auxiliary parts of speech: {number of word selection criteria} × {number of word sets} = 2 × 6 = 12;
W&WA&WT feature groups—along with the frequencies of the words in the given text, the total frequencies in the training set are used: {number of feature groups of type 3} + {number of feature groups of type 2} = 1 + 5 = 6.
The calculation of the frequency of an arbitrary word in any given text in the training set, the total frequencies of that word in the authors’ texts in the training set, and the total frequency in all texts in the training set are outlined in
Section 3.5.