You are currently viewing a new version of our website. To view the old version click .
Information
  • Article
  • Open Access

7 June 2024

Genre Classification of Books in Russian with Stylometric Features: A Case Study

,
,
and
Department of Software Engineering, Shamoon College of Engineering, Beer Sheva 84500, Israel
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
This article belongs to the Special Issue Text Mining: Challenges, Algorithms, Tools and Applications

Abstract

Within the literary domain, genres function as fundamental organizing concepts that provide readers, publishers, and academics with a unified framework. Genres are discrete categories that are distinguished by common stylistic, thematic, and structural components. They facilitate the categorization process and improve our understanding of a wide range of literary expressions. In this paper, we introduce a new dataset for genre classification of Russian books, covering 11 literary genres. We also perform dataset evaluation for the tasks of binary and multi-class genre identification. Through extensive experimentation and analysis, we explore the effectiveness of different text representations, including stylometric features, in genre classification. Our findings clarify the challenges present in classifying Russian literature by genre, revealing insights into the performance of different models across various genres. Furthermore, we address several research questions regarding the difficulty of multi-class classification compared to binary classification, and the impact of stylometric features on classification accuracy.

1. Introduction

In the realm of literature, the concept of genre serves as a fundamental organizational principle, providing readers, publishers, and scholars with a cohesive framework. Within any collection of works, a genre stands as a distinct category, characterized by shared stylistic, thematic, and structural elements. It functions as a conceptual tool, simplifying the process of categorization and enhancing the comprehension of a diverse array of literary expressions. Genres encompass a broad spectrum, ranging from timeless and conventional categories like fiction and non-fiction to more nuanced classifications such as mystery, romance, fantasy, and beyond.
Genres extend beyond mere categorization; they act as guiding beacons, directing readers toward narratives that align with their preferences and expectations. Recognizing the genre of a literary work becomes akin to following arrows that point toward stories tailored to individual tastes.
A text’s stylistic features serve as markers of different genres and are frequently employed for automatic analysis in this field because they represent a text’s structural quirks, among other things [1,2].
Automatic genre classification makes it possible to solve several computational linguistics problems more quickly, including figuring out a word or phrase’s meaning or part of speech, locating documents that are pertinent to a semantic query [3], improving authorship attribution [4,5,6], and more [2,7].
Only a small number of the numerous studies that address the subject of automatic genre classification (explained in more detail in Section 2) focus on Russian literature, and just three corpora have been generated for the job of genre classification in Russian, even though there are over 258 million Russian speakers in the world [8]. The corpus introduced in [9] contains texts collected from the Internet that belong to six genre segments of Internet texts, namely, contemporary fiction, poetry, social media, news, subtitles for films, and a collection of thematic magazines annotated as “the rest”. These texts span 5 billion words; however, out of the six genres, only two can be attributed to literature—fiction and poetry. The corpus of [2] contains 10,000 texts assigned to five different genres—novels, scientific articles, reviews, posts from the VKontakte social network [10], and news texts from OpenCorpora [11], the open corpus of Russian texts. Only one genre in this corpus is a literature genre—the novels. In [12], the authors have developed a corpus of the A.S. Pushkin Lyceum period (1813–1817) that contains ten different genres of his poems. However, no prosaic texts are contained in this corpus, and the texts are limited to a single author.
None of the above corpora covers a significant amount of modern and historical genres in Russian literature. To overcome this gap, we present a new dataset comprising Russian books spanning eleven diverse literature genres, aimed at facilitating research in text classification. The dataset encompasses eleven different literature genres, thereby providing a comprehensive resource for studying genres in Russian literature. We evaluate several traditional machine learning models, alongside state-of-the-art deep learning models, including transformers [13] and dual contrastive learning [14], for both binary and multi-class genre classification tasks. Furthermore, we provide insights into the strengths and limitations of each model, shedding light on their applicability to real-world genre classification scenarios. This dataset can serve as a valuable resource for researchers interested in advancing the understanding and development of genre studying and classification systems for Russian texts.
We perform an extensive evaluation of binary and multi-class genre classification on a subset of our dataset and analyze the results; we employ a wide range of text representations, including stylometric features [2]. The purpose of this evaluation is to show that genre classification of Russian books is a highly nontrivial task. We also analyze what text representations work better for what task, and the difference in classification of different genres.
We address the following research questions in our work.
  • RQ1: Do stylometric features improve genre classification accuracy?
  • RQ2: What genres are easier to classify?
  • RQ3: Does contrastive learning perform better for genre classification than fine-tuned transformer models and traditional models?
  • RQ4: Does removing punctuation decrease classification accuracy for genre classification?
  • RQ5: Does a transformer model pre-trained on Russian perform better than a multi-lingual transformer model?
This paper is organized as follows. Section 2 describes the related work. Section 3 describes our dataset, the process of its collection, and the data processing we performed. In Section 4, we describe text representations and classification models we used to perform genre classification on our data. Section 5 describes the hardware and software setup and full results of our experimental evaluation. Finally, Section 6 and Section 7 discuss the conclusions and limitations of our approach.

3. The SONATA Dataset

We have considered several online sources of Russian literary texts. Our requirements were as follows:
  • A wide, up-to-date, and legitimate selection of titles, and agreements with leading Russian and international publishers;
  • Clear genre labels and a wide selection of genres;
  • The option to freely and legally download a significant number of text samples in .txt format;
  • A convenient site structure that allows automated data collection.
The following options were examined. The LitRes site [44] contains a wide range of e-books across various genres, including fiction, non-fiction, educational materials, and more. For most of the books, text fragments but not the whole texts can be downloaded. However, to use LitRes API and to automatically download multiple text samples, a user is required to pay with a credit card issued in Russia, which may not be suitable for some researchers. http://royallib.com/ (accessed on 1 January 2024) is an online library that offers a large collection of free electronic books [45]. The site provides access to a wide range of e-books, including classic literature, modern novels, non-fiction, educational materials, and more. This site offers books for free, making literature accessible to a broad audience. However, this feature also implies that the text collection is outdated because most modern Russian books are the subject of a copyright. Finally, https://knigogo.net/ (accessed on 1 January 2024) is a Russian-language website that provides fresh news about literature, reviews, and feedback on popular books. It contains a large selection of audiobooks and online books in formats such as fb2, rtf, epub, and txt for iPad, iPhone, Android, and Kindle. It has clear genre labels and a convenient structure that allows efficient parsing. Moreover, it provides free access to text samples.
For these reasons, we have chosen to collect our data from https://knigogo.net/ (accessed on 1 January 2024). We name the resulting dataset SONATA for ruSsian bOoks geNre dATAset. Genre categories were translated into English for the reader’s benefit because the source website [46] supports the Russian language only. The dataset is available on GitHub at https://github.com/genakogan/Identification-of-the-genre-of-books.

3.1. The Genres

To build our dataset, we used the genres provided by the Knigogo website at https://knigogo.net/zhanryi/ (accessed on 1 January 2024). Because not all genres and sub-genres provided by the website have a sufficient amount of data for analysis, we filtered out the less-represented genres and combined sub-genres where appropriate, aimed to streamline the classification for more meaningful insights.
As a result, 11 genres were selected for the dataset: science fiction, detective, romance, fantasy, classic, action, non-fiction, contemporary literature, adventure, novel and short stories, and children’s books. In non-fiction, we encompassed all genres that do not belong to fiction literature. Due to the relatively small number of books in each sub-genre within non-fiction, considering each sub-genre separately would not be productive for our experiment. The original list of genres on Knigogo and the list of selected genres are depicted in Figure 1. All genres, covered by the SONATA dataset, and their translations are shown in Table 1.
Figure 1. Genre taxonomy in Knigogo and a subset of genres used in the SONATA dataset.
Table 1. Genres in Russian and their translation to English.

3.2. Data Processing and Statistics

Book downloading was performed by a Python script; the script extracts URLs leading to pages where individual books can be downloaded first; then, it extracts the URLs for direct download of the text files of these books. Then, the script attempts to download each book, saving them as text files in a directory. The execution of the script is initiated using a specific URL associated with fantasy genre books. A more detailed description of the script is available in the Appendix A.
As a result, 8189 original books were downloaded. However, because some books belong to multiple genres, the total amount of book instances with a single genre label is 10,444.
During the data processing, a series of challenges arose, such as text formatting, removing author names and publisher information, and the volume of texts. We applied the following steps: (1) using a custom Python script, we re-encoded the files into a UTF-8 format; (2) we parsed the second line of text containing the meta-data and removed the authors’ names; (3) finally, we extracted books without author names and split them into small parts (denoted by chunks) of 300 words each. Splitting text into chunks allows us to process long texts that exceed the length limit in many pre-trained language models (LMs).
The amount of books and book chunks per genre in the SONATA dataset appears in Table 2. Because some books are attributed to several genres on the original site, the total unique number of books is smaller than the sum of books per genre. We report both of these values in the table.
Table 2. Size of the collected data.

4. Binary and Multi-Class Genre Classification

In binary genre classification, the task involves categorizing texts into two distinct genres. For example, a text can be classified as either fiction or non-fiction, romance or thriller, positive or negative sentiment, etc. In multi-class genre classification, texts are classified into more than two genres or categories. The task is usually more complex than binary genre classification, as the classifier needs to differentiate between multiple classes and assign the most appropriate genre label to each text. Multi-class genre classification problems are often encountered in large-scale text categorization tasks, where texts can belong to diverse and overlapping genres.

4.1. The Pipeline

To evaluate the SONATA dataset for tasks of binary and multi-class genre classification, we first processed and sampled the data (see details in Section 4.2) and generated the appropriate text representation (see details in Section 4.3). Then, we split the text representations into training and test sets, trained the selected model (see Section 4.4) on the training set and evaluated the model on the test set. This pipeline is depicted in Figure 2.
Figure 2. Evaluation pipeline.

4.2. Preprocessing and Data Setup

We did not change the case of the texts and did not remove punctuation in the main setup. The effect of these optional operations on evaluation results is addressed in the Appendix A.
Because of the hardware limitations and data size, we could not apply classification models to the whole dataset. To construct a sample of our dataset, we first selected one random text chunk from every book to avoid the case of author recognition instead of genre recognition. Then, we sampled N chunks at random from every genre, where N is a user-defined parameter. In our experiments, we used N = 100 . The number of text chunks, average character counts, and the number of unique words for sample size N = 100 and every genre is shown in Table 3 (we do not report the average number of words per chunk because all the chunks in our data contain exactly 300 words). The effect of smaller and larger values of N is addressed in the Appendix A.
Table 3. Data statistics.
For the binary genre classification task, we select text chunks as a balanced random sample of size N where N is a user-defined parameter. If a genre contains fewer than N book chunks, we select all of them. In each sample, half of the chunks represent positive samples belong to the selected genre, and the other half contain book chunks that represent negative samples and are chosen in a uniformly random fashion from all the other genres. We ensure that no chunks belonging to the same book fall into different sample categories. The positive samples are labeled 1, and the negative samples are labeled 0. For the multiclass genre classification task, we select a random sample of N text chunks from every genre, where N is a user-defined parameter. If a genre contains fewer than N book chunks, all of them are added to the data. The label of every sample is determined by its genre and is a number in the range [ 0 10 ] .
For the evaluation, the obtained balanced dataset is randomly divided into training and test sets with the ratio 80%/20%. This process is illustrated in Figure 3.
Figure 3. Evaluation pipeline.

4.3. Text Representations

We represent texts as vectors and optionally enhance them with stylometric features. Details are provided in the subsections below. Figure 4 depicts the general pipeline of text representation construction.
Figure 4. Text representations.

4.3.1. Sentence Embeddings

BERT sentence embeddings [13] are vector representations of entire sentences generated using the BERT model. These embeddings can then be used as features for various downstream NLP tasks. The sentence embeddings (SEs) we use in this work were obtained using one of the pre-trained BERT models (a multi-lingual model of [13] or a Russian BERT model [43]). With both models, the SE vector size is 768.
Figure 5 shows the distribution of books from different genres, where every book is represented by its SE vector computed with ruBERT. For data visualization, we used t-Distributed Stochastic Neighbor Embedding (t-SNE), a common dimensionality reduction technique [47]. It is designed to preserve pairwise similarities, making it more effective at capturing non-linear structures and clusters in high-dimensional data. We can see that contemporary literature (top left) and science fiction (top right) are the only genres for which the data points are partially clustered together. This plot demonstrates that genre classification is a non-trivial task and relying solely on SE can be challenging.
Figure 5. Sentence embedding features of 11 genres represented by t-SNE for samples of size N = 100 .

4.3.2. BOW Vectors with tf-idf and n-Gram Weights

The concept of Term Frequency-Inverse Document Frequency (tf-idf) constitutes a quantitative measure designed to signify the significance of a term within a document set or corpus. The tf-idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word. In our methodology, book chunks were treated as discrete documents, and the entire dataset was regarded as the corpus. We filtered out Russian stopwords using the list provided by the NLTK package [48], to which we added the word ‘этo’ (meaning ‘this’); details are provided in the Appendix A.
N-grams are the sequences of n consecutive words or characters seen in the text, where n is a parameter. In our evaluation, we used the values n = 1 , 2 , 3 for both character and word n-grams. In the case of word n-grams, we filtered out Russian stopwords as well using the list provided by NLTK [48]. Vector sizes for these representations for the samples of size N = 100 are shown in Table 4. For the multi-class setup, we empirically limited the number of word n-gram features to 10,000 due to very large vector sizes. This does not affect our analysis because this text representation does not provide the best performance.
Table 4. BOW vector lengths for samples of size N = 100 .

4.3.3. Stylometric Features

StyloMetrix [49] is a multi-lingual tool designed for generating stylometric vectors for texts in Polish, English, German, Ukrainian, and Russian introduced in [39]. These vectors encode linguistic features related to writing style and can be used for authorship attribution, and genre classification. A total of 95 metrics are supported for the Russian language. The metrics describe lexical forms, parts of speech, syntactic forms, and verb forms. The lexical metrics provide information on plural and singular nouns, and additional morphological features such as animacy (animate/inanimate), gender (feminine, masculine, and neutral), distinguishing between first and second names, and diminutive. Direct and indirect objects as well as cardinal and ordinal numerals are also included in the metrics. Six distinctive lexical forms of pronouns such as demonstrative, personal, total, relative, and indexical are reported, as well as qualitative, quantitative, relative, direct and indirect adjectives. Other lexical features include punctuation, direct speech, and three types of adverb and adjective comparison. A partial list of stylometric features for Russian is provided by the authors of the StyloMetrix package on their GitHub repository [50]; we present a compact list of these features in the Appendix A. We compute stylometric features with StyloMetrix for text chunks in our dataset and use them alone or in conjunction with text representations described in previous sections.
Data visualization of stylometric features with t-SNE for data samples of size N = 100 and 11 genres is shown in Figure 6. We can see that, similarly to SEs, contemporary literature (bottom left) and science fiction (bottom right) are the only genres for which the data points are partially clustered together. Therefore, relying solely on stylometric features for genre classification is not expected to produce a good result.
Figure 6. Stylometric features of 11 genres represented by t-SNE for samples of size N = 100 .

4.4. Classification Models

4.4.1. Traditional Models

An ensemble learning technique called the Random Forest (RF) [28] classifier works by building several decision trees during training. A bootstrapped sample of the training data and a random subset of the characteristics are used to build each tree in the forest. The logistic function, which converts input values into probabilities between 0 and 1, is used in logistic regression (LR) [29] to represent the relationship between the predictor variables and the probability of the result. Gradient boosting is used in the Extreme Gradient Boosting (XGB) [30] classifier, an ensemble learning technique that creates a predictive model. It creates a sequence of decision trees repeatedly, trying to fix the mistakes of the preceding ones with each new tree. We apply RF, LR, and XGB models to all data representations described in Section 4.3.

4.4.2. Voting Ensemble of Traditional Models

A voting-based ensemble classifier approach combines the predictions of multiple base classifiers to make a final decision. Each base classifier is trained independently on the same dataset or subsets of it using different algorithms or parameter settings [51]. During the prediction phase, each base classifier provides its prediction for a given instance. The final prediction is then determined by aggregating the individual predictions through a voting mechanism, where the most commonly predicted class is selected as the ensemble’s prediction.
In our voting setup, we use RF, LR, and XGB classifiers described in Section 4.4.1. For the binary genre classification setup, the decision is made based on the majority vote of the three classifiers (i.e., max voting). For the multi-class genre classification setup, we computed the sum of probabilities of every class based on the individual probability distributions produced by each classifier and assigned each sample to the class with the highest accumulated probability.

4.4.3. Fine-Tuned Transformers

We fine-tune and apply fine-tuned transformer models to the texts in our dataset for both tasks—binary genre classification and multi-class genre classification.
The main transformer model we employ is the ruBERT (Russian BERT) model, specifically the ruBERT-base-cased variant, which is trained on large-scale Russian derived from the Russian part of Wikipedia and news data [43]. The baseline transformer model is the BERT multi-lingual base model bert-base-multilingual-uncased developed by GoogleAI [13]. The model is pre-trained on the top 102 languages (including Russian) with the largest Wikipedia using a masked language modeling objective. We denote this model by mlBERT. Both models utilize a BERT transformer architecture, which employs a bidirectional approach that allows to capture of contextual information from both left and right contexts.

4.4.4. Dual Contrastive Learning

To tackle the problem of genre classification, we also apply the advanced text classification method DualCL [27] that uses contrastive learning with label-aware data augmentation.
The objective function used in this method consists of two contrastive losses, one for labeled data and another for unlabeled data. Contrastive loss is computed for each labeled instance ( x i , y i ) as
L L = log e f ( x i ) · g ( y i ) j = 1 N e f ( x i ) · g ( y j )
where f ( x i ) is the feature representation of input x i , y i is the corresponding label of x i , g ( y i ) is the embedding of label y i , and N is the total number of classes. In our experiments, we use the pre-trained transformer model ruBERT [43] as the basis for feature representation computation. The number of classes N is set to 2 for the task of binary genre classification, and to 11 for the multi-class genre classification. Contrastive loss for unlabeled data is computed as
L U = log e f ( x i ) · f ( x j ) / τ k = 1 M e f ( x i ) · f ( x k ) / τ
where f ( x i ) and f ( x j ) are the feature representations of inputs x i and x j , M is the total number of unlabeled instances, and τ is a temperature parameter that controls the concentration of the distribution. We use the default value of τ provided by [52].
Dual contrastive loss is the combination of the contrastive losses for labeled and unlabeled data, along with a regularization term:
L = L L + λ L U + β | | θ | | 2
where λ and β are hyperparameters that control the trade-off between the supervised and unsupervised losses, and the regularization term θ represents the model parameters. The values of these parameters we use are of the native implementation in [52].

5. Experimental Results

5.1. Hardware Setup

Experiments were performed on a cloud server with a 2-core Intel Xeon CPU, 16 GB of RAM, and 1 NVIDIA TU104GL GPU. The runtime for every experiment setting (binary or multi-class classification) was less than 10 min.

5.2. Software Setup

All non-neural models were implemented in sklearn [53] python package. Our neural models were implemented with PyTorch [54]. NumPy and Pandas libraries were used for data manipulation. For contrastive learning, we utilized the publicly available Python implementation DualCL [52]. Pre-trained transformer models mlBERT [55] and ruBERT [56] were applied.

5.3. Models and Representations

We applied traditional models denoted by RF, LR, and XGB described in Section 4.4.1. We also used the voting model described in Section 4.4.2 and denoted it by ‘voting’. Additionally, we fine-tuned the two transformer models described in Section 4.4.1 and denoted the Russian-language model by RuBERT, and the multi-lingual BERT model by mlBERT. Finally, we also applied the dual contrastive model of [27] and denoted it by DualCL. All of the above models were used for the binary classification task and the multi-class classification task.
For the traditional and voting models, we used the eight representations described in Section 4.3. Transformer models and DualCL were applied to the raw text.

5.4. Metrics

We report the metrics described below for all the models.
Precision measures the accuracy of positive predictions made by the model, and it is computed as
Precision = True Positives True Positives + False Positives
Recall or sensitivity measures the ability of the model to correctly identify all positive instances. It is computed as
Recall = True Positives True Positives + False Negatives
The F1 measure combines precision and recall into a single metric and is computed as
F 1 = 2 × Precision × Recall Precision + Recall
Accuracy is the ratio of correctly predicted instances to the total number of instances in the data:
Accuracy = True Positives + True Negatives Total Population
When assessing the genre classification results, it is essential to employ all of these metrics because each statistic represents a distinct facet of model performance. The accuracy of positive predictions is the main focus of precision, which is important for situations where false positives can be expensive. In situations where missing a relevant instance is very undesirable, recall evaluates the model’s capacity to find all relevant examples and ensures that true positives are not overlooked. Because it combines recall and precision, the F1 score offers a balanced metric that is especially helpful in situations when class distributions are unbalanced.
Although accuracy provides a measure of overall correctness, it can be deceptive in situations where class distributions are not uniform since it may be excessively optimistic [57,58]. Moreover, for the task of genre classification, it is vital to see how a model performs on different classes. The most undesirable output would be assigning all text instances to a single genre, implying that the model does not learn anything except the majority rule. What we seek is a model that learns what the genres are and has moderate to high success in identifying all the genres. Thus, employing all four metrics ensures a comprehensive evaluation.

5.5. Binary Genre Classification Results

This section describes the evaluation of all of the 11 genres in the SONATA dataset with all the models applied to the task of binary genre identification.

5.5.1. Traditional Models

Table 5 shows the results of traditional model evaluation. Because of the large number of setups (11 genres, 8 representations, and 3 models), we show the representation and the classifier that achieved the best result for every one of the 11 genres. We use here the default setting of N = 100 samples for every genre and address different values of N in the Appendix A. We can see that all of the obtained accuracies are above the majority, which is 0.5, as we have balanced data samples. We can see a clear difference in the classification accuracy of different genres—the classic literature is easier to detect (with the accuracy of over 0.93), and the short stories genre is the hardest one (with the accuracy of 0.68). This may be because classic literature tends to employ a more formal and elaborate language style compared to short stories. The language in classic literature often includes archaic words, complex sentence structures, and sophisticated vocabulary, while short stories may use simpler language and have a more straightforward narrative style. In terms of text representation, sentence embeddings perform better for the majority of genres but not for all of them, and stylometric features are helpful in some but not in all of the cases. Tf-idf vectors work best for children’s and contemporary literature—children’s literature typically uses simpler language with shorter sentences, basic vocabulary, and straightforward syntax that can be captured with tf-idf vectors successfully. Contemporary Russian literature may employ innovative narrative techniques, non-linear storytelling, metafiction, and experimental forms that are also expressed in the vocabulary used. The traditional classifier that performs the best for the majority of genres (but not always) is RF.
Table 5. Results of binary genre classification with traditional models for sample size N = 100 .

5.5.2. The Voting Model

Table 6 shows the results of the evaluation for the ensemble voting models. Because of the large number of setups (11 genres and 8 representations), we show the results for the text representation that achieved the best result for every one of the 11 genres. The arrows indicate the increase (↑) or decrease (↓) of classification accuracy in comparison to the best traditional model for that genre. We can see that in all but one genre (non-fiction) the voting model does not outperform the best single traditional classifier. Non-fiction literature can include a wide range of sub-genres, including history, science, biography, memoirs, essays, and more. Each of these sub-genres may have distinct linguistic features that perhaps can be better captured by a voting ensemble. However, for other genres, single models seem to build different classification functions that do not separate between classes in the same way. The texts that fall to opposite “sides” of each separation function “confuse” the ensemble model.
Table 6. Results of binary genre classification with the voting model; grey color indicates the best result.

5.5.3. Fine-Tuned Transformers

Table 7 shows the results of fine-tuning and testing BERT-based models—mlBERT and ruBERT—for every one of the 11 genres separately. We can see that both models produce results that are much worse than those of traditional models, and in most cases, these results fall below the majority. This outcome might be the result of several factors. First, our training data might be too small for efficient training of LLMs. Second, distinguishing between one specific genre against a mix of multiple genres is a difficult task based on semantics, without any stylistic features.
Table 7. Results of binary genre classification with fine-tuned BERT models; the grey color indicates the best accuracy.
To our surprise, classification accuracy is higher for ruBERT for several genres only but not for all of them. This may be an indication that a cross-lingual training of mlBERT allows the model to utilize insights from other languages when classifying Russian texts.

5.5.4. Dual Contrastive Learning

Table 8 shows the results of binary genre classification with the DualCL model that employs either ruBERT or mlBERT as its base transformed model. We also indicate by the grey color the best accuracy among the two base transformer models.
Table 8. Results of binary genre classification for DualCL; the grey color indicates better results.
We can see that while the results are worse than those of traditional models for every genre, there is a significant improvement over the fine-tuned transformer models. It is also evident that for all but one genre (romance), the DualCL model with ruBERT outperforms the same model with mlBERT.

5.6. Multi-Class Genre Classification Results

5.6.1. Traditional Models

Table 9 contains the results produced by traditional classifiers RF, LR, and XGB for all the text representations we employ. For every text representation, we report the results of the best model out of three (full results are contained in the Appendix A). Here, we can see a clear advantage of using sentence embeddings enhanced with stylometric features as text representation. The second best result is achieved by the sentence embeddings without stylometric features. These results indicate that capturing semantic information, linguistic patterns, and stylistic characteristics of the text is much more important for Russian genre classification than the vocabulary.
Table 9. Results of multi-class genre classification for traditional models; grey color indicates the best result.
We can also see that the LR classifier achieves the best result for the majority of text representations. This may be because the logistic regression model is less prone to overfitting, especially when the dataset is small.
Table 10 shows the per-genre precision, recall, and F-measure produced by the best model (LR and SE + stylometry text representation). We can see that the model does attribute all instances to a single class but is producing real predictions for all the genres. Some genres are identified better than others—adventure, contemporary literature, and science fiction. It may be because these genres typically adhere to clear conventions and tropes that provide consistent patterns and signals that a classification model can leverage. In contrast, genres with more fluid boundaries pose greater challenges for classification due to their variability and ambiguity.
Table 10. Best result details of multi-class genre classification for traditional models (SE + stylometry, LR).

5.6.2. The Voting Model

Table 11 shows the evaluation results of the voting ensemble model applied to different text representations. The arrow indicates an increase or decrease in accuracy concerning the results produced by individual traditional models. In general, the voting model does not reach the high scores of the best single traditional models; however, there is an improvement in accuracy for text representations that use word n-grams and character n-grams with stylometric features. N-grams capture local syntactic and semantic information, which might be beneficial for capturing genre-specific patterns, and using multiple classifiers may help to identify the most probable outcome that enhances classification accuracy.
Table 11. Results of multi-class genre classification for the voting model.
The best text representation for the voting model is sentence embeddings. The detailed scores for all the genres produced by this model are shown in Table 12, with an arrow indicating a comparison with the F1 scores of the best single traditional model result shown in Table 10. We can see that while the combined score of this voting model is lower than that of its single-model counterpart, there are individual genres such as children’s books, classic literature, fantasy, and romance novels that have higher F1 scores. Again, we observe that some genres such as adventure and science fiction are ‘easier’ to identify in this task.
Table 12. Best result details of multi-class genre classification for the voting model (SE + stylometry); grey color indicates improvement.

5.6.3. Fine-Tuned Transformer Models

Table 13 shows the evaluation results of two fine-tuned transformer models, ruBERT and mlBERT, for the task of muli-class genre classification. The scores are low for both of the models, and the per-genre detailed scores show that both models failed to learn and classify all texts as belonging to a single class. There is a slight advantage to the mlBERT model over ruBERT, similar to the binary classification task. We believe that classifying books into one out of multiple genres based purely on their vocabulary and semantics is a very difficult task and something beyond embeddings learned by transformers needs to be provided to a classification layer. One such thing is the stylistic characteristics of the text provided with stylometric features.
Table 13. Multi-class classification results of ruBERT and mlBERT transformer models; grey color indicates better results.

5.6.4. Dual Contrastive Learning

Table 14 contains the evaluation results for the DualCL model with ruBERT and mlBERT backends. The model was trained for 100 epochs, and the best score that was achieved at epoch 97 is reported in the table, together with the per-genre scores. The scores are much higher than those of fine-tuned transformers, but they do not reach the results produced by the best single traditional model. Again, we see that some genres such as non-fiction and classic literature achieve higher scores than other genres. Similarly to the binary classification results and the fine-tuning results, the model that uses mlBERT as the backend, surprisingly, outperformed ruBERT in terms of the final accuracy. However, for specific genres, ruBERT-based DualCL performs better for 9 out of 11 genres (the grey color in the table indicates the best F1 genre for both variants of the DualCL model).
Table 14. Multi-class classification results for the DualCL model; grey color indicates better results.

5.7. Punctuation Importance

To verify our assumption about the inherent importance of punctuation for genre classification, we conducted a series of additional experiments. In this experiment, we removed punctuation marks from the texts. This act decreased slightly the sizes of BOW representations (details are provided in the Appendix A). We applied the experimental setup of Section 4.2 to the simplified SONATA dataset and used the same sample size of N = 100 book chunks per genre.
The results of the binary genre classification for this setup for traditional classifiers appear in Table 15. We show the best results for every genre and indicate the representation and classifier that achieve them. The arrows indicate an increase or decrease in accuracy in comparison to the best traditional model results shown on texts with punctuation. We see that in most cases, the results are inferior to those of Section 5.5.1, and stylometric features have a less prominent role. However, for 3 genres out of 11, the results are improved—fantasy, romance, and non-fiction. In these genres, the content words play a crucial role in conveying the theme and style of the text. It is possible that by removing punctuation, these content words are emphasized and can become more indicative of genre characteristics.
Table 15. Binary classification with traditional classifiers with sample size N = 100 .
The results of multi-class genre classification for this setup appear in Table 16. We applied single traditional models because they produced the best scores on the original SONATA dataset. In a few cases, an improvement was achieved; however, none of these setups outperformed the best model-representation combo in the original setting that uses the data with punctuation.
Table 16. Multi-class classification with traditional classifiers with sample size N = 100 , with and without punctuation; grey color indicates improvement.
To further verify our hypothesis about the importance of punctuation, we also tested the ensemble voting model for the multi-class genre classification; the results are shown in Table 17. No model–representation setup achieved an increase in accuracy over the non-modified dataset. Therefore, we can conclude that preserving punctuation is vital for genre classification.
Table 17. Multi-class classification with the voting model, sample size N = 100 , with and without punctuation (grey color indicates better result in each case).

6. Conclusions

In this study, we studied the task of genre classification for Russian literature. By introducing a novel dataset comprising Russian books spanning 11 different genres, we facilitate research in genre classification for Russian texts and evaluate multiple classification models to discern their effectiveness in this context. Through extensive experimentation, we explored the impact of different text representations, such as stylometric features, on genre classification accuracy.
Our experimental evaluation confirms that stylometric features improve classification accuracy in most cases, especially in binary genre classification. Therefore, RQ1 is also answered positively. Our evaluation also shows that while there are genres that receive higher accuracy scores, the results depend more on the model being used than on the features. Thus, some genres are ‘easier’ for traditional models, while other genres are ‘easier’ for contrastive learning, and so on. It means that RQ2 cannot be answered positively. We have also verified that contrastive learning performs much better than transformer models for both classification tasks, answering RQ3. Finally, we have shown that removing punctuation decreases classification accuracy and thus answered positively to RQ4. We have also found, surprisingly, that the ruBERT model pre-trained on a large Russian corpus performs worse than the multi-lingual BERT model for the multi-class classification task. For the binary classification, ruBERT performs worse than multi-lingual BERT on 8 out of 11 genres, answering negatively to RQ5.
Our study highlights the multi-faceted nature of genre classification in Russian literature and underscores the importance of considering diverse factors, including linguistic characteristics, cultural nuances, and genre-specific features. An accurate model of genre classification is capable of performing literary analysis by automating the identification and categorization of texts. This can help researchers better understand how genres in Russian literature have evolved with social and political changes in Russian-speaking nations by, for example, enabling them to identify trends and changes in literary styles across time. Furthermore, by examining relatively small (300-word) text samples rather than the whole texts, our classification models can be utilized to quickly search and retrieve books by genre given the extensive holdings of Russian literary works in digital libraries and archives.
While our research provides valuable insights into genre classification for Russian texts, it also reveals limitations and areas for future exploration.

7. Limitations and Future Research Directions

While our study provides valuable insights into genre classification for Russian literature, several limitations are worth mentioning. Firstly, the dataset we compiled, though extensive, may not encompass the entirety of Russian literary genres, potentially limiting the generalizability of our findings to all facets of Russian literature.
Furthermore, our evaluation of classification models is based on a subset of the dataset, which may not fully capture the diversity and complexity of genre classification for Russian books. The selection of this subset could introduce sampling bias and impact the robustness of our conclusions.
Another limitation relates to the choice of text representations and classification models evaluated in our study. While we explored a range of traditional machine learning models and state-of-the-art deep learning architectures, there may exist alternative approaches or models that could yield superior performance in genre classification tasks. Moreover, the effectiveness of stylometric features and other text representations may vary across different genres, and our study may not fully capture this variability.
Finally, the research questions addressed in our study provide valuable insights into the genre classification of Russian books; however, they represent only a subset of the broader landscape of research questions in this domain. Future studies may explore additional research questions, such as the impact of personal authorial style on genre classification, the role of narrative structure in genre categorization, transfer learning from other languages, or the influence of reader preferences on genre classification outcomes.
Addressing these limitations through further research could enhance the robustness and applicability of genre classification systems for Russian texts.

Author Contributions

Conceptualization, N.V. and M.L.; methodology, N.V. and M.T.; software, N.V. and G.K.; validation, N.V., M.T. and M.L.; formal analysis, N.V. and M.L.; investigation, M.T.; resources, G.K.; data curation, M.T.; writing—original draft preparation, N.V. and M.T.; writing—review and editing, N.V., M.L. and M.T.; supervision, N.V. and M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The SONATA dataset is residing in repository on GitHub at https://github.com/genakogan/Identification-of-the-genre-of-books. It is freely available to the NLP community.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
NLPNatural Language Processing
RQResearch Question
RFRandom Forest
XGBeXtreme Gradient Boost
LRLogistic Regression
CLContrastive Learning
RNNRecurrent Neural Network
SVMSupport Vector Machine
BERTBidirectional Encoder Representations from Transformers
RRecall
PPrecision
F1F1 measure

Appendix A

Appendix A.1. The List of Stylometric Features

We use the features provided by the StyloMetrix package [49]. Table A1, Table A2 and Table A3 contains the lists of these features according. Unless specified otherwise, the incidence (amount) of appearances of a feature is reported.
Table A1. Lexical features provided by the StyloMetric tool.
Table A1. Lexical features provided by the StyloMetric tool.
Lexical Features
type-token ratio for words lemmas, content words, function words, content words types, function words types, nouns in plural, nouns in singular, proper names, personal names, animate nouns, inanimate nouns, neutral nouns, feminine nouns, masculine nouns, feminine proper nouns, masculine proper nouns, surnames, given names, flat multiwords expressions, direct objects, indirect objects, nouns in Nominative case, nouns in Genitive case, nouns in Dative case, nouns in Accusative case, nouns in Instrumental case, nouns in Locative case, qualitative adj positive, relative adj, qualitative comparative adj, qualitative superlative adj, direct adjective, indirect adjective, punctuation, dots, comma, semicolon, colon, dashes, numerals, relative pronouns, indexical pronouns, reflexive pronoun, posessive pronoun, negative pronoun, positive adverbs, comparative adverbs, superlative adverbs
Table A2. Part-of-speech features provided by the StyloMetric tool.
Table A2. Part-of-speech features provided by the StyloMetric tool.
Part-of-Speech Features
verbs, nouns, adjectives, adverbs, determiners, interjections, conjunctions, particles, numerals, prepositions, pronouns, code-switching, number of words in narrative sentences, number of words in negative sentences, number of words in parataxis sentences, number of words in sentences that do not have any root verbs, words in sentences with quotation marks, number of words in exclamatory sentences, number of words in interrogative sentences, number of words in general questions, number of words in special questions, number of words in alternative questions, number of words in tag questions, number of words in elliptic sentences, number of positionings, number of words in conditional sentences, number of words in imperative sentences, number of words in amplified sentences
Table A3. Grammar features provided by the StyloMetric tool.
Table A3. Grammar features provided by the StyloMetric tool.
Grammar Features
root verbs in imperfect aspect, all verbs in imperfect aspect, active voice, root verbs in perfect form, all verbs in perfect form, verbs in the present tense, indicative mood, imperfect aspect, verbs in the past tense, indicative mood, imperfect aspect, verbs in the past tense, indicative mood, perfect aspect, verbs in the future tense, indicative mood, perfect aspect, verbs in the future tense, indicative mood, imperfect aspect, simple verb forms, verbs in the future tense, indicative mood, complex verb forms, verbs in infinitive, verbs in the passive form, transitive verbs, intransitive verbs, impersonal verbs, passive participles, active participles, adverbial perfect participles, adverbial imperfect participles

Appendix A.2. The List of Russian Stopwords

Below, we show the list of Russian stopwords produced that we used and their translation.
Stopwords (Russian)
и, в, вo, не, чтo, oн, на, я, с, сo, как, а, тo, все, oна, так, егo, нo, да, ты, к, у, же, вы, за, бы, пo, тoлькo, ее, мне, былo, вoт, oт, меня, еще, нет, o, из, ему, теперь, кoгда, даже, ну, вдруг, ли, если, уже, или, ни, быть, был, негo, дo, вас, нибудь, oпять, уж, вам, ведь, там, пoтoм, себя, ничегo, ей, мoжет, oни, тут, где, есть, надo, ней, для, мы, тебя, их, чем, была, сам, чтoб, без, будтo, чегo, раз, тoже, себе, пoд, будет, ж, тoгда, ктo, этoт, тoгo, пoтoму, этoгo, какoй, сoвсем, ним, здесь, этoм, oдин, пoчти, мoй, тем, чтoбы, нее, сейчас, были, куда, зачем, всех, никoгда, мoжнo, при, накoнец, два, oб, другoй, хoть, пoсле, над, бoльше, тoт, через, эти, нас, прo, всегo, них, какая, мнoгo, разве, три, эту, мoя, впрoчем, хoрoшo, свoю, этoй, перед, инoгда, лучше, чуть, тoм, нельзя, такoй, им, бoлее, всегда, кoнечнo, всю, между
Translation
and, in, in the, not, that, he, on, I, with, with, like, and, then, all, she, so, his, but, yes, you, to, at, already, you (plural), behind, would, by, only, her, to me, was, here, from, me, yet, no, about, to him, now, when, even, well, suddenly, whether, if, already, or, neither, to be, was, him, before, to you, ever, again, already, you (plural), after all, there, then, oneself, nothing, to her, can, they, here, where, there is, need, her, for, we, you (singular), them, than, was, oneself, without, as if, of what, time, also, to oneself, under, will be, what, then, who, this, of that, therefore, of this, what kind, completely, him, here, in this, one, almost, my, by, her, now, were, where, why, all, never, can, at, finally, two, about, other, even if, after, above, more, that, through, these, us, about, all, what kind of, many, whether, three, this, my, however, well, her own, this, before, sometimes, better, a bit, that, cannot, such, to them, more, always, of course, whole, between

Appendix A.3. BOW Vector Statistics for the No-Punctuation SONATA Data Sample

The sizes of BOW vectors for the SONATA dataset sample with N = 100 are provided in Table A4.
Table A4. BOW vector lengths for samples of size N = 100 without punctuation.
Table A4. BOW vector lengths for samples of size N = 100 without punctuation.
Char n-GramsWord n-Grams
GenreVector SizesVector Sizestf-idf Vector Size
n = [1, 2, 3]n = [1, 2, 3]
action10,59335,52117,923
adventure908518,23611,506
children’s11,64727,26214,581
classic11,90727,45615,905
contemporary11,84435,39619,132
detective11,75034,96318,532
fantasy933435,63518,044
non-fiction13,90034,90017,867
romance10,86733,39816,553
science-fiction959035,82418,799
short-stories929823,67313,442
all23,902315,81287,706

Appendix A.4. Changing the Number of Samples

Because of HW limitations, we were unable to run all the experiments on full data, and we used the sampling procedure described in Section 4.2 with the default value of N = 100 randomly sampled book chunks per genre. However, we examined the setups with fewer and more sampled chunks ( N = 50 and N = 150 ). Table A5 and Table A6 show the sizes of BOW vectors for both of these setups.
Table A5. BOW vector lengths for samples of size N = 50 .
Table A5. BOW vector lengths for samples of size N = 50 .
Char n-GramsWord n-Grams
GenreVector SizesVector Sizestf-idf Vector Size
n = [1, 2, 3]n = [1, 2, 3]
action12,10423,47013,037
adventure11,84518,02411,108
children’s11,80323,11512,414
classic13,33223,69214,159
contemporary12,95623,74714,167
detective11,68024,23613,564
fantasy10,91024,23513,639
non-fiction14,43524,65613,676
romance10,85222,59812,048
science-fiction11,86624,23414,016
short-stories12,85320,80311,874
all26,192222,48769,362
Table A6. BOW vector lengths for samples of size N = 200 .
Table A6. BOW vector lengths for samples of size N = 200 .
Char n-GramsWord n-Grams
GenreVector SizesVector Sizestf-idf Vector Size
n = [1, 2, 3]n = [1, 2, 3]
action16,72177,37331,126
adventure13,31929,28316,241
children’s15,91158,78924,921
classic18,09162,02529,327
contemporary19,06884,34536,132
detective16,96493,49335,765
fantasy15,75995,34937,162
non-fiction23,28890,60235,625
romance15,63788,17832,476
science-fiction17,35195,40039,043
short-stories16,68446,21421,788
all37,228687,100139,054
Table A7 shows evaluation results of traditional models for both of these sampling setups and their comparison to the default sampling setup of N = 100 for the task of multi-class genre classification. The arrows show a decrease or increase in classification accuracy of the sampling setup N = 100 as compared to N = 50 , and the sampling setup N = 150 as compared to N = 100 . We observe that, with some exceptions, increasing sample size does improve the classification accuracy of traditional models across representations. However, it is worth mentioning that increasing the sample size makes running advanced transformer models in our HW setup much slower and in some cases, impossible.
Table A7. Comparing smaller and larger sample sizes for multi-class genre classification with traditional models; grey color indicates the best result.
Table A7. Comparing smaller and larger sample sizes for multi-class genre classification with traditional models; grey color indicates the best result.
RepresentationClassifierSample SizeSample SizeSample Size
N = 50 AccN = 100 AccN = 150 Acc
SERF0.36450.3375 ↓0.3462 ↑
SELR0.37850.4293 ↑0.4095 ↓
SEXGB0.26170.2978 ↑0.3163 ↑
SE + stylometryRF0.34580.3400 ↓0.3374 ↓
SE + stylometryLR0.37380.4367 ↑0.4130 ↓
SE + stylometryXGB0.27100.3127 ↑0.3603 ↑
char n-gramsRF0.27100.2333 ↓0.2882 ↑
char n-gramsLR0.24770.2705 ↑0.2953 ↑
char n-gramsXGB0.17290.2432 ↑0.3005 ↑
char n-grams + stylometryRF0.23360.2531 ↑0.2882 ↑
char n-grams + stylometryLR0.24770.2506 ↑0.3146 ↑
char n-grams + stylometryXGB0.22900.2382 ↑0.3040 ↑
n-gramsRF0.25700.2333 ↓0.2794 ↑
n-gramsLR0.30840.2754 ↓0.2988 ↑
n-gramsXGB0.11680.1712 ↑0.2320 ↑
n-grams + stylometryRF0.27570.2878 ↑0.3076 ↑
n-grams + stylometryLR0.30370.2779 ↓0.2917 ↑
n-grams + stylometryXGB0.25230.2233 ↓0.2812 ↑
tfidfRF0.23830.2432 ↑0.2865 ↑
tfidfLR0.29910.2903 ↓0.3409 ↑
tfidfXGB0.09810.1439 ↑0.2021 ↑
tfidf + stylometryRF0.25230.2804 ↑0.2917 ↑
tfidf + stylometryLR0.24300.2531 ↑0.3093 ↑
tfidf + stylometryXGB0.26640.2134 ↓0.2654 ↑

Appendix A.5. Full Results of Traditional Models for the Multi-Class Genre Classification Task

Table A8 contains the full list of results for the multi-class classification task performed with traditional models.
Table A8. Full results of multi-class genre classification for traditional models; grey color indicates the best result.
Table A8. Full results of multi-class genre classification for traditional models; grey color indicates the best result.
RepresentationClassifierPRF1Acc
SERF0.32210.33100.31130.3275
SELR0.42890.43320.42640.4293
SEXGB0.31540.30190.29970.3027
SE + stylometryRF0.33610.31820.30030.3176
SE + stylometryLR0.44150.44710.43860.4367
SE + stylometryXGB0.30820.30750.29610.2978
char n-gramsRF0.21630.23150.20340.2333
char n-gramsLR0.28650.26500.27110.2705
char n-gramsXGB0.21800.23730.21880.2357
char n-grams + stylometryRF0.21500.24360.20800.2407
char n-grams + stylometryLR0.26940.24710.25500.2506
char n-grams + stylometryXGB0.24440.24800.23570.2457
n-gramsRF0.20550.24940.20620.2333
n-gramsLR0.30040.28660.28000.2754
n-gramsXGB0.20110.17710.16000.1712
n-grams + stylometryRF0.29620.29110.26170.2878
n-grams + stylometryLR0.29560.28680.27840.2779
n-grams + stylometryXGB0.23730.23490.21900.2233
tfidfRF0.22340.23320.19390.2283
tfidfLR0.29960.29900.23720.2903
tfidfXGB0.20710.19640.17940.1911
tfidf + stylometryRF0.25030.26170.23080.2581
tfidf + stylometryLR0.23250.25490.21900.2531
tfidf + stylometryXGB0.21870.22890.19750.2233

Appendix A.6. The Python Script Used to Download Books from knigogo.net

The script is freely available at https://github.com/genakogan/Identification-of-the-genre-of-books. Due to size limitations, we do not provide it here in full. Its main parts are described below.
  • The url_download_for_each_book function takes a URL (in this case, a page with links to free books) and retrieves the HTML content. It then parses the HTML to extract URLs that link to book download pages, specifically those that match a pattern (starts with https://knigogo.net/knigi/ (accessed on 1 January 2024) and end with /#lib_book_download).
  • The url_text_download_for_each_book function takes the list of book download URLs obtained in the previous step and retrieves the HTML content of each page. It then parses these pages to extract URLs of the actual text files.
  • The download_url function attempts to download the content of a given URL and returns the content if successful.
  • The download_book function receives a text file URL, a book ID, and a save path. It downloads the text file’s content and saves it locally as a .txt file in the specified directory.

Appendix A.7. Validation on Texts from a Different Source

To ensure that our best models for genre classification are suitable for texts from different sources, we conducted an additional experiment.
We manually downloaded an additional 200 texts from the https://www.rulit.me (accessed on 15 May 2024) online book repository, and performed the pre-processing described in Section 3.2. We chose the Russian language, .txt format, and the genre of science fiction, which constitutes the most texts in the repository. The positive samples were chosen to be 100 texts belonging to the science fiction genre, and the negative samples were another 100 texts that belong to other genres such as detectives, children’s books, romance, documentaries, adventure, thrillers, ancient literature, esoterics, home economics, science, and culture. Then, we split all the texts into 300-word chunks and preserved one chunk from each book.
In the next step, we trained our best-performing models on the training part of the SONATA dataset for the science fiction genre and evaluated them on the rulit.me data. The results of this binary cross-source genre classification are shown in Table A9 and Table A10. In Table A9, we can see that traditional models achieve good results on the new data, and the best classifier is RF, which is consistent with our findings in Section 5.5.1. It is also evident that adding stylometric features is beneficial in this case, and the best text representation is n-grams combined with stylometry. Table A10 contains the results of the voting model evaluation and shows that the best representation in this setup is the same as above. However, similarly to the SONATA dataset tests, the voting models, while producing decent results, fail to achieve the same accuracy as the best traditional model in Table A9.
Table A9. Evaluation results for traditional models on the rulit.me data (accessed on 15 May 2024); grey color indicates best results.
Table A9. Evaluation results for traditional models on the rulit.me data (accessed on 15 May 2024); grey color indicates best results.
GenreRepresentationClassifierF1Acc
science-fictionSERF0.69000.6900
science-fictionSELR0.70970.7100
science-fictionSEXGB0.63870.6400
science-fictionSE + stylometryRF0.76980.7700
science-fictionSE + stylometryLR0.69950.7000
science-fictionSE + stylometryXGB0.62970.6300
science-fictionchar n-gramsRF0.63940.6400
science-fictionchar n-gramsLR0.68080.6900
science-fictionchar n-gramsXGB0.66730.6700
science-fictionchar n-grams + stylometryRF0.77960.7800
science-fictionchar n-grams + stylometryLR0.67840.6900
science-fictionchar n-grams + stylometryXGB0.59000.5900
science-fictionn-gramsRF0.69700.7000
science-fictionn-gramsLR0.71000.7100
science-fictionn-gramsXGB0.60520.6100
science-fictionn-grams + stylometryRF0.73000.7300
science-fictionn-grams + stylometryLR0.71000.7100
science-fictionn-grams + stylometryXGB0.60960.6100
science-fictiontfidfRF0.65320.6600
science-fictiontfidfLR0.68000.6800
science-fictiontfidfXGB0.55120.5600
science-fictiontfidf + stylometryRF0.70930.7100
science-fictiontfidf + stylometryLR0.62550.6300
science-fictiontfidf + stylometryXGB0.63940.6400
Table A10. Evaluation results for the voting model on the rulit.me data (accessed on 15 May 2024); the grey color indicates improvement over individual models.
Table A10. Evaluation results for the voting model on the rulit.me data (accessed on 15 May 2024); the grey color indicates improvement over individual models.
GenreRepresentationF1Acc
science-fictionSE0.68000.6800
science-fictionSE + stylometry0.69990.7000
science-fictionchar n-grams0.68620.6900
science-fictionchar n-grams + stylometry0.73830.7400
science-fictionn-grams0.67680.6800
science-fictionn-grams + stylometry0.69950.7000
science-fictiontfidf0.63050.6400
science-fictiontfidf + stylometry0.67530.6800

References

  1. Kochetova, L.; Popov, V. Research of Axiological Dominants in Press Release Genre based on Automatic Extraction of Key Words from Corpus. Nauchnyi Dialog. 2019, 6, 32–49. [Google Scholar] [CrossRef]
  2. Lagutina, K.V. Classification of Russian texts by genres based on modern embeddings and rhythm. Model. I Anal. Informatsionnykh Sist. 2022, 29, 334–347. [Google Scholar] [CrossRef]
  3. Houssein, E.H.; Ibrahem, N.; Zaki, A.M.; Sayed, A. Semantic protocol and resource description framework query language: A comprehensive review. Mathematics 2022, 10, 3203. [Google Scholar] [CrossRef]
  4. Romanov, A.; Kurtukova, A.; Shelupanov, A.; Fedotova, A.; Goncharov, V. Authorship identification of a Russian-language text using support vector machine and deep neural networks. Future Int. 2020, 13, 3. [Google Scholar] [CrossRef]
  5. Fedotova, A.; Romanov, A.; Kurtukova, A.; Shelupanov, A. Authorship attribution of social media and literary Russian-language texts using machine learning methods and feature selection. Future Int. 2021, 14, 4. [Google Scholar] [CrossRef]
  6. Embarcadero-Ruiz, D.; Gómez-Adorno, H.; Embarcadero-Ruiz, A.; Sierra, G. Graph-based siamese network for authorship verification. Mathematics 2022, 10, 277. [Google Scholar] [CrossRef]
  7. Kessler, B.; Nunberg, G.; Schütze, H. Automatic detection of text genre. arXiv 1997, arXiv:cmp-lg/9707002. [Google Scholar]
  8. Russian Language—Wikipedia, The Free Encyclopedia. 2024. Available online: https://en.wikipedia.org/wiki/Russian_language (accessed on 16 May 2024).
  9. Shavrina, T. Differential Approach to Webcorpus Construction. In Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference Dialogue 2018; National Research University Higher School of Economics: Moscow, Russia, 2018. [Google Scholar]
  10. VKontakte. 2024. Available online: https://vk.com (accessed on 1 January 2024).
  11. OpenCorpora. 2024. Available online: http://opencorpora.org (accessed on 1 January 2024).
  12. Barakhnin, V.; Kozhemyakina, O.; Pastushkov, I. Automated determination of the type of genre and stylistic coloring of Russian texts. In ITM Web of Conferences; EDP Sciences: Les Ulis, France, 2017; Volume 10, p. 02001. [Google Scholar]
  13. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  14. Sun, H.; Liu, J.; Zhang, J. A survey of contrastive learning in NLP. In Proceedings of the 7th International Symposium on Advances in Electrical, Electronics, and Computer Engineering, Xishuangbanna, China, 18–20 March 2022; Volume 12294, pp. 1073–1078. [Google Scholar]
  15. Bulygin, M.; Sharoff, S. Using machine translation for automatic genre classification in Arabic. In Proceedings of the Komp’juternaja Lingvistika i Intellektual’nye Tehnologii, Moscow, Russia, 30 May–2 June 2018; pp. 153–162. [Google Scholar]
  16. Nolazco-Flores, J.A.; Guerrero-Galván, A.V.; Del-Valle-Soto, C.; Garcia-Perera, L.P. Genre Classification of Books on Spanish. IEEE Access 2023, 11, 132878–132892. [Google Scholar] [CrossRef]
  17. Ozsarfati, E.; Sahin, E.; Saul, C.J.; Yilmaz, A. Book genre classification based on titles with comparative machine learning algorithms. In Proceedings of the 2019 IEEE 4th International Conference on Computer and Communication Systems (ICCCS), Singapore, 23–25 February 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 14–20. [Google Scholar]
  18. Steinwart, I.; Christmann, A. Support Vector Machines; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
  19. Saraswat, M.; Srishti. Leveraging genre classification with RNN for Book recommendation. Int. J. Inf. Technol. 2022, 14, 3751–3756. [Google Scholar] [CrossRef]
  20. Webster, R.; Fonteyne, M.; Tezcan, A.; Macken, L.; Daems, J. Gutenberg goes neural: Comparing features of dutch human translations with raw neural machine translation outputs in a corpus of english literary classics. Informatics 2020, 7, 32. [Google Scholar] [CrossRef]
  21. Alfraidi, T.; Abdeen, M.A.; Yatimi, A.; Alluhaibi, R.; Al-Thubaity, A. The Saudi novel corpus: Design and compilation. Appl. Sci. 2022, 12, 6648. [Google Scholar] [CrossRef]
  22. Mendhakar, A. Linguistic profiling of text genres: An exploration of fictional vs. non-fictional texts. Information 2022, 13, 357. [Google Scholar] [CrossRef]
  23. Williamson, G.; Cao, A.; Chen, Y.; Ji, Y.; Xu, L.; Choi, J.D. Exploring a Multi-Layered Cross-Genre Corpus of Document-Level Semantic Relations. Information 2023, 14, 431. [Google Scholar] [CrossRef]
  24. Shavrina, T. Genre Classification on Text-Internal Features: A Corpus Study. In Proceedings of the Web Corpora as a Language Training Tool Conference (ARANEA 2018), Univerzita Komenského v Bratislave, Bratislava, Slovakia, 23–24 November 2018; pp. 134–147. [Google Scholar]
  25. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  26. Le-Khac, P.H.; Healy, G.; Smeaton, A.F. Contrastive representation learning: A framework and review. IEEE Access 2020, 8, 193907–193934. [Google Scholar] [CrossRef]
  27. Chen, Q.; Zhang, R.; Zheng, Y.; Mao, Y. Dual Contrastive Learning: Text Classification via Label-Aware Data Augmentation. arXiv 2022, arXiv:2201.08702. [Google Scholar]
  28. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  29. Wright, R.E. Logistic Regression. In Reading and Understanding Multivariate Statistics; Grimm, L.G., Yarnold, P.R., Eds.; American Psychological Association: Worcester, MA, USA, 1995; pp. 217–244. [Google Scholar]
  30. Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd aCm Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  31. Neal, T.; Sundararajan, K.; Fatima, A.; Yan, Y.; Xiang, Y.; Woodard, D. Surveying stylometry techniques and applications. ACM Comput. Surv. 2017, 50, 86. [Google Scholar] [CrossRef]
  32. Lagutina, K.; Lagutina, N.; Boychuk, E.; Vorontsova, I.; Shliakhtina, E.; Belyaeva, O.; Paramonov, I.; Demidov, P. A survey on stylometric text features. In Proceedings of the 2019 25th Conference of Open Innovations Association (FRUCT), Helsinki, Finland, 5–8 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 184–195. [Google Scholar]
  33. Stamatatos, E.; Fakotakis, N.; Kokkinakis, G. Automatic text categorization in terms of genre and author. Comput. Linguist. 2000, 26, 471–495. [Google Scholar] [CrossRef]
  34. Sarawgi, R.; Gajulapalli, K.; Choi, Y. Gender attribution: Tracing stylometric evidence beyond topic and genre. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, Portland, OR, USA, 23–24 June 2011; pp. 78–86. [Google Scholar]
  35. Eder, M. Rolling stylometry. Digit. Scholarsh. Humanit. 2016, 31, 457–469. [Google Scholar] [CrossRef]
  36. Eder, M.; Rybicki, J.; Kestemont, M. Stylometry with R: A package for computational text analysis. R J. 2016, 8, 107–121. [Google Scholar] [CrossRef]
  37. Maciej, P.; Tomasz, W.; Maciej, E. Open stylometric system WebSty: Integrated language processing, analysis and visualisation. CMST 2018, 24, 43–58. [Google Scholar]
  38. McNamara, D.S.; Graesser, A.C.; McCarthy, P.M.; Cai, Z. Cohesive Features in Expository Texts: A Large-scale Study of Expert and Novice Writing. Writ. Commun. 2014, 31, 151–183. [Google Scholar]
  39. Okulska, I.; Stetsenko, D.; Kołos, A.; Karlińska, A.; Głąbińska, K.; Nowakowski, A. StyloMetrix: An Open-Source Multilingual Tool for Representing Stylometric Vectors. arXiv 2023, arXiv:2309.12810. [Google Scholar]
  40. Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J. Deep learning–based text classification: A comprehensive review. ACM Comput. Surv. 2021, 54, 62. [Google Scholar] [CrossRef]
  41. Cunha, W.; Viegas, F.; França, C.; Rosa, T.; Rocha, L.; Gonçalves, M.A. A Comparative Survey of Instance Selection Methods applied to Non-Neural and Transformer-Based Text Classification. ACM Comput. Surv. 2023, 55, 265. [Google Scholar] [CrossRef]
  42. Face, H. Hugging Face. 2016. Available online: https://huggingface.co/ (accessed on 26 April 2024).
  43. Kuratov, Y.; Arkhipov, M. Adaptation of deep bidirectional multilingual transformers for Russian language. arXiv 2019, arXiv:1905.07213. [Google Scholar]
  44. LitRes. LitRes: Digital Library and E-Book Retailer. 2024. Available online: https://www.litres.ru (accessed on 1 January 2024).
  45. Royallib. Royallib: Free Online Library. 2024. Available online: https://royallib.com/ (accessed on 1 January 2024).
  46. Knigogo. 2013. Available online: https://knigogo.net/zhanryi/ (accessed on 1 January 2024).
  47. Belkina, A.C.; Ciccolella, C.O.; Anno, R.; Halpert, R.; Spidlen, J.; Snyder-Cappione, J.E. Automated optimized parameters for T-distributed stochastic neighbor embedding improve visualization and analysis of large datasets. Nat. Commun. 2019, 10, 5415. [Google Scholar] [CrossRef]
  48. Bird, S.; Loper, E.; Klein, E. NLTK: The Natural Language Toolkit. arXiv 2009, arXiv:arXiv:cs/0205028. [Google Scholar] [CrossRef]
  49. ZILiAT-NASK. StyloMetrix: An Open-Source Multilingual Tool for Representing Stylometric Vectors (Code Repository). 2023. Available online: https://github.com/ZILiAT-NASK/StyloMetrix (accessed on 26 April 2024).
  50. Okulska, I.; Stetsenko, D.; Kołos, A.; Karlińska, A.; Głąbińska, K.; Nowakowski, A. StyloMetrix Metrics List (Russian). 2023. Available online: https://github.com/ZILiAT-NASK/StyloMetrix/blob/main/resources/metrics_list_ru.md (accessed on 26 April 2024).
  51. Schapire, R.E. Improving Regressors using Boosting Techniques. In Proceedings of the International Conference on Machine Learning (ICML), Austin, TX, USA, 21–23 June 1990. [Google Scholar]
  52. Hiyouga. Dual Contrastive Learning. 2022. Available online: https://github.com/hiyouga/Dual-Contrastive-Learning (accessed on 26 March 2024).
  53. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  54. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8024–8035. [Google Scholar]
  55. Google Research. BERT: Multilingual (Uncased). 2018. Available online: https://huggingface.co/google-bert/bert-base-multilingual-uncased (accessed on 26 April 2024).
  56. DeepPavlov. RuBERT: Russian (Cased). 2021. Available online: https://huggingface.co/DeepPavlov/rubert-base-cased (accessed on 26 April 2024).
  57. Makridakis, S. Accuracy measures: Theoretical and practical concerns. Int. J. Forecast. 1993, 9, 527–529. [Google Scholar] [CrossRef]
  58. Streiner, D.L.; Norman, G.R. “Precision” and “accuracy”: Two terms that are neither. J. Clin. Epidemiol. 2006, 59, 327–330. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.