Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (18)

Search Parameters:
Keywords = stylometric

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
31 pages, 5323 KiB  
Article
Learning the Style via Mixed SN-Grams: An Evaluation in Authorship Attribution
by Juan Pablo Francisco Posadas-Durán, Germán Ríos-Toledo, Erick Velázquez-Lozada, J. A. de Jesús Osuna-Coutiño, Madaín Pérez-Patricio and Fernando Pech May
AI 2025, 6(5), 104; https://doi.org/10.3390/ai6050104 - 20 May 2025
Viewed by 1003
Abstract
This study addresses the problem of authorship attribution with a novel method for modeling writing style using dependency tree subtree parsing. This method exploits the syntactic information of sentences using mixed syntactic n-grams (mixed sn-grams). The method comprises an algorithm to generate [...] Read more.
This study addresses the problem of authorship attribution with a novel method for modeling writing style using dependency tree subtree parsing. This method exploits the syntactic information of sentences using mixed syntactic n-grams (mixed sn-grams). The method comprises an algorithm to generate mixed sn-grams by integrating words, POS tags, and dependency relation tags. The mixed sn-grams are used as style markers to feed Machine Learning methods such as a SVM. A comparative analysis was performed to evaluate the performance of the proposed mixed sn-grams method against homogeneous sn-grams with the PAN-CLEF 2012 and CCAT50 datasets. Experiments with PAN 2012 showed the potential of mixed sn-grams to model a writing style by outperforming homogeneous sn-grams. On the other hand, experiments with CCAT50 showed that training with mixed sn-grams improves accuracy over homogeneous sn-grams, with the POS-Word category showing the best result. The study’s results suggest that mixed sn-grams constitute effective stylistic markers for building a reliable writing style model, which machine learning algorithms can learn. Full article
Show Figures

Figure 1

16 pages, 1051 KiB  
Article
Kafka’s Literary Style: A Mixed-Method Approach
by Carsten Strathausen, Wenyi Shang and Andrei Kazakov
Humanities 2025, 14(3), 61; https://doi.org/10.3390/h14030061 - 12 Mar 2025
Viewed by 922
Abstract
In this essay, we examine how the polyvalence of meaning in Kafka’s texts is engineered both semantically (on the narrative level) and syntactically (on the linguistic level), and we ask whether a computational approach can shed new light on the long-standing debate about [...] Read more.
In this essay, we examine how the polyvalence of meaning in Kafka’s texts is engineered both semantically (on the narrative level) and syntactically (on the linguistic level), and we ask whether a computational approach can shed new light on the long-standing debate about the major characteristics of Kafka’s literary style. A mixed-method approach means that we seek out points of connection that interlink traditional humanist (i.e., interpretative) and computational (i.e., quantitative) methods of investigation. Following the introduction, the second section of our article provides a critical overview of the existing scholarship from both a humanist and a computational perspective. We argue that the main methodological difference between traditional humanist and AI-enhanced computational studies of Kafka’s literary style lies not in the use of statistics but in the new interpretative possibilities enabled by AI methods to explore stylistic features beyond the scope of human comprehension. In the third and fourth sections of our article, we will introduce our own stylometric approach to Kafka, detail our methods, and interpret our findings. Rather than focusing on training an AI model capable of accurately attributing authorship to Kafka, we examine whether AI could help us detect significant stylistic differences between the writing Kafka himself published during his lifetime (Kafka Core) and his posthumous writings edited and published by Max Brod. Full article
(This article belongs to the Special Issue Franz Kafka in the Age of Artificial Intelligence)
Show Figures

Figure 1

37 pages, 2517 KiB  
Article
Multitask Learning for Authenticity and Authorship Detection
by Gurunameh Singh Chhatwal and Jiashu Zhao
Electronics 2025, 14(6), 1113; https://doi.org/10.3390/electronics14061113 - 12 Mar 2025
Cited by 1 | Viewed by 1105
Abstract
Traditionally, detecting misinformation (real vs. fake) and authorship (human vs. AI) have been addressed as separate classification tasks, leaving a critical gap in real-world scenarios where these challenges increasingly overlap. Motivated by this need, we introduce a unified framework—the Shared–Private Synergy Model (SPSM)—that [...] Read more.
Traditionally, detecting misinformation (real vs. fake) and authorship (human vs. AI) have been addressed as separate classification tasks, leaving a critical gap in real-world scenarios where these challenges increasingly overlap. Motivated by this need, we introduce a unified framework—the Shared–Private Synergy Model (SPSM)—that tackles both authenticity and authorship classification under one umbrella. Our approach is tested on a novel multi-label dataset and evaluated through an exhaustive suite of methods, including traditional machine learning, stylometric feature analysis, and pretrained large language model-based classifiers. Notably, the proposed SPSM architecture incorporates multitask learning, shared–private layers, and hierarchical dependencies, achieving state-of-the-art results with over 96% accuracy for authenticity (real vs. fake) and 98% for authorship (human vs. AI). Beyond its superior performance, our approach is interpretable: stylometric analyses reveal how factors like sentence complexity and entity usage can differentiate between fake news and AI-generated text. Meanwhile, LLM-based classifiers show moderate success. Comprehensive ablation studies further highlight the impact of task-specific architectural enhancements such as shared layers and balanced task losses on boosting classification performance. Our findings underscore the effectiveness of synergistic PLM architectures for tackling complex classification tasks while offering insights into linguistic and structural markers of authenticity and attribution. This study provides a strong foundation for future research, including multimodal detection, cross-lingual expansion, and the development of lightweight, deployable models to combat misinformation in the evolving digital landscape and smart society. Full article
Show Figures

Figure 1

33 pages, 3827 KiB  
Review
Distinguishing Reality from AI: Approaches for Detecting Synthetic Content
by David Ghiurău and Daniela Elena Popescu
Computers 2025, 14(1), 1; https://doi.org/10.3390/computers14010001 - 24 Dec 2024
Cited by 9 | Viewed by 8647
Abstract
The advancement of artificial intelligence (AI) technologies, including generative pre-trained transformers (GPTs) and generative models for text, image, audio, and video creation, has revolutionized content generation, creating unprecedented opportunities and critical challenges. This paper systematically examines the characteristics, methodologies, and challenges associated with [...] Read more.
The advancement of artificial intelligence (AI) technologies, including generative pre-trained transformers (GPTs) and generative models for text, image, audio, and video creation, has revolutionized content generation, creating unprecedented opportunities and critical challenges. This paper systematically examines the characteristics, methodologies, and challenges associated with detecting the synthetic content across multiple modalities, to safeguard digital authenticity and integrity. Key detection approaches reviewed include stylometric analysis, watermarking, pixel prediction techniques, dual-stream networks, machine learning models, blockchain, and hybrid approaches, highlighting their strengths and limitations, as well as their detection accuracy, independent accuracy of 80% for stylometric analysis and up to 92% using multiple modalities in hybrid approaches. The effectiveness of these techniques is explored in diverse contexts, from identifying deepfakes and synthetic media to detecting AI-generated scientific texts. Ethical concerns, such as privacy violations, algorithmic bias, false positives, and overreliance on automated systems, are also critically discussed. Furthermore, the paper addresses legal and regulatory frameworks, including intellectual property challenges and emerging legislation, emphasizing the need for robust governance to mitigate misuse. Real-world examples of detection systems are analyzed to provide practical insights into implementation challenges. Future directions include developing generalizable and adaptive detection models, hybrid approaches, fostering collaboration between stakeholders, and integrating ethical safeguards. By presenting a comprehensive overview of AIGC detection, this paper aims to inform stakeholders, researchers, policymakers, and practitioners on addressing the dual-edged implications of AI-driven content creation. Full article
Show Figures

Graphical abstract

30 pages, 1001 KiB  
Article
Genre Classification of Books in Russian with Stylometric Features: A Case Study
by Natalia Vanetik, Margarita Tiamanova, Genady Kogan and Marina Litvak
Information 2024, 15(6), 340; https://doi.org/10.3390/info15060340 - 7 Jun 2024
Viewed by 2038
Abstract
Within the literary domain, genres function as fundamental organizing concepts that provide readers, publishers, and academics with a unified framework. Genres are discrete categories that are distinguished by common stylistic, thematic, and structural components. They facilitate the categorization process and improve our understanding [...] Read more.
Within the literary domain, genres function as fundamental organizing concepts that provide readers, publishers, and academics with a unified framework. Genres are discrete categories that are distinguished by common stylistic, thematic, and structural components. They facilitate the categorization process and improve our understanding of a wide range of literary expressions. In this paper, we introduce a new dataset for genre classification of Russian books, covering 11 literary genres. We also perform dataset evaluation for the tasks of binary and multi-class genre identification. Through extensive experimentation and analysis, we explore the effectiveness of different text representations, including stylometric features, in genre classification. Our findings clarify the challenges present in classifying Russian literature by genre, revealing insights into the performance of different models across various genres. Furthermore, we address several research questions regarding the difficulty of multi-class classification compared to binary classification, and the impact of stylometric features on classification accuracy. Full article
(This article belongs to the Special Issue Text Mining: Challenges, Algorithms, Tools and Applications)
Show Figures

Figure 1

24 pages, 1274 KiB  
Article
Significance of Single-Interval Discrete Attributes: Case Study on Two-Level Discretisation
by Urszula Stańczyk, Beata Zielosko and Grzegorz Baron
Appl. Sci. 2024, 14(10), 4088; https://doi.org/10.3390/app14104088 - 11 May 2024
Cited by 1 | Viewed by 1027
Abstract
Supervised discretisation is widely considered as far more advantageous than unsupervised transformation of attributes, because it helps to preserve the informative content of a variable, which is useful in classification. After discretisation, based on employed criteria, some attributes can be found irrelevant, and [...] Read more.
Supervised discretisation is widely considered as far more advantageous than unsupervised transformation of attributes, because it helps to preserve the informative content of a variable, which is useful in classification. After discretisation, based on employed criteria, some attributes can be found irrelevant, and all their values can be represented in a discrete domain by a single interval. In consequence, such attributes are removed from considerations, and no knowledge is mined from them. The paper presents research focused on extended transformations of attribute values, thus combining supervised with unsupervised discretisation strategies. For all variables with single intervals returned from supervised algorithms, the ranges of values were transformed by unsupervised methods with varying numbers of bins. Resulting variants of the data were subjected to selected data mining techniques, and the performance of a group of classifiers was evaluated and compared. The experiments were performed on a stylometric task of authorship attribution. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

32 pages, 2235 KiB  
Article
Importance of Characteristic Features and Their Form for Data Exploration
by Urszula Stańczyk, Beata Zielosko and Grzegorz Baron
Entropy 2024, 26(5), 404; https://doi.org/10.3390/e26050404 - 6 May 2024
Viewed by 1802
Abstract
The nature of the input features is one of the key factors indicating what kind of tools, methods, or approaches can be used in a knowledge discovery process. Depending on the characteristics of the available attributes, some techniques could lead to unsatisfactory performance [...] Read more.
The nature of the input features is one of the key factors indicating what kind of tools, methods, or approaches can be used in a knowledge discovery process. Depending on the characteristics of the available attributes, some techniques could lead to unsatisfactory performance or even may not proceed at all without additional preprocessing steps. The types of variables and their domains affect performance. Any changes to their form can influence it as well, or even enable some learners. On the other hand, the relevance of features for a task constitutes another element with a noticeable impact on data exploration. The importance of attributes can be estimated through the application of mechanisms belonging to the feature selection and reduction area, such as rankings. In the described research framework, the data form was conditioned on relevance by the proposed procedure of gradual discretisation controlled by a ranking of attributes. Supervised and unsupervised discretisation methods were employed to the datasets from the stylometric domain and the task of binary authorship attribution. For the selected classifiers, extensive tests were performed and they indicated many cases of enhanced prediction for partially discretised datasets. Full article
Show Figures

Figure 1

18 pages, 506 KiB  
Article
Morphosyntactic Annotation in Literary Stylometry
by Robert Gorman
Information 2024, 15(4), 211; https://doi.org/10.3390/info15040211 - 9 Apr 2024
Cited by 2 | Viewed by 1913
Abstract
This article investigates the stylometric usefulness of morphosyntactic annotation. Focusing on the style of literary texts, it argues that including morphosyntactic annotation in analyses of style has at least two important advantages: (1) maintaining a topic agnostic approach and (2) providing input variables [...] Read more.
This article investigates the stylometric usefulness of morphosyntactic annotation. Focusing on the style of literary texts, it argues that including morphosyntactic annotation in analyses of style has at least two important advantages: (1) maintaining a topic agnostic approach and (2) providing input variables that are interpretable in traditional grammatical terms. This study demonstrates how widely available Universal Dependency parsers can generate useful morphological and syntactic data for texts in a range of languages. These data can serve as the basis for input features that are strongly informative about the style of individual novels, as indicated by accuracy in classification tests. The interpretability of such features is demonstrated by a discussion of the weakness of an “authorial” signal as opposed to the clear distinction among individual works. Full article
(This article belongs to the Special Issue Computational Linguistics and Natural Language Processing)
Show Figures

Figure 1

16 pages, 6689 KiB  
Article
The Question of Studying Information Entropy in Poetic Texts
by Olga Kozhemyakina, Vladimir Barakhnin, Natalia Shashok and Elina Kozhemyakina
Appl. Sci. 2023, 13(20), 11247; https://doi.org/10.3390/app132011247 - 13 Oct 2023
Viewed by 2036
Abstract
One of the approaches to quantitative text analysis is to represent a given text in the form of a time series, which can be followed by an information entropy study for different text representations, such as “symbolic entropy”, “phonetic entropy” and “emotional entropy” [...] Read more.
One of the approaches to quantitative text analysis is to represent a given text in the form of a time series, which can be followed by an information entropy study for different text representations, such as “symbolic entropy”, “phonetic entropy” and “emotional entropy” of various orders. Studying authors’ styles based on such entropic characteristics of their works seems to be a promising area in the field of information analysis. In this work, the calculations of entropy values of the first, second and third order for the corpus of poems by A.S. Pushkin and other poets from the Golden Age of Russian Poetry were carried out. The values of “symbolic entropy”, “phonetic entropy” and “emotional entropy” and their mathematical expectations and variances were calculated for given corpora using the software application that automatically extracts statistical information, which is potentially applicable to tasks that identify features of the author’s style. The statistical data extracted could become the basis of the stylometric classification of authors by entropy characteristics. Full article
(This article belongs to the Special Issue Natural Language Processing: Theory, Methods and Applications)
Show Figures

Figure 1

16 pages, 660 KiB  
Article
Stylometric Fake News Detection Based on Natural Language Processing Using Named Entity Recognition: In-Domain and Cross-Domain Analysis
by Chih-Ming Tsai
Electronics 2023, 12(17), 3676; https://doi.org/10.3390/electronics12173676 - 31 Aug 2023
Cited by 16 | Viewed by 4033
Abstract
Nowadays, the dissemination of news information has become more rapid, liberal, and open to the public. People can find what they want to know more and more easily from a variety of sources, including traditional news outlets and new social media platforms. However, [...] Read more.
Nowadays, the dissemination of news information has become more rapid, liberal, and open to the public. People can find what they want to know more and more easily from a variety of sources, including traditional news outlets and new social media platforms. However, at a time when our lives are glutted with all kinds of news, we cannot help but doubt the veracity and legitimacy of these news sources; meanwhile, we also need to guard against the possible impact of various forms of fake news. To combat the spread of misinformation, more and more researchers have turned to natural language processing (NLP) approaches for effective fake news detection. However, in the face of increasingly serious fake news events, existing detection methods still need to be continuously improved. This study proposes a modified proof-of-concept model named NER-SA, which integrates natural language processing (NLP) and named entity recognition (NER) to conduct the in-domain and cross-domain analysis of fake news detection with the existing three datasets simultaneously. The named entities associated with any particular news event exist in a finite and available evidence pool. Therefore, entities must be mentioned and recognized in this entity bank in any authentic news articles. A piece of fake news inevitably includes only some entitlements in the entity bank. The false information is deliberately fabricated with fictitious, imaginary, and even unreasonable sentences and content. As a result, there must be differences in statements, writing logic, and style between legitimate news and fake news, meaning that it is possible to successfully detect fake news. We developed a mathematical model and used the simulated annealing algorithm to find the optimal legitimate area. Comparing the detection performance of the NER-SA model with current state-of-the-art models proposed in other studies, we found that the NER-SA model indeed has superior performance in detecting fake news. For in-domain analysis, the accuracy increased by an average of 8.94% on the LIAR dataset and 19.36% on the fake or real news dataset, while the F1-score increased by an average of 24.04% on the LIAR dataset and 19.36% on the fake or real news dataset. In cross-domain analysis, the accuracy and F1-score for the NER-SA model increased by an average of 28.51% and 24.54%, respectively, across six domains in the FakeNews AMT dataset. The findings and implications of this study are further discussed with regard to their significance for improving accuracy, understanding context, and addressing adversarial attacks. The development of stylometric detection based on NLP approaches using NER techniques can improve the effectiveness and applicability of fake news detection. Full article
(This article belongs to the Special Issue Data Push and Data Mining in the Age of Artificial Intelligence)
Show Figures

Figure 1

22 pages, 5373 KiB  
Article
Secret Key Distillation with Speech Input and Deep Neural Network-Controlled Privacy Amplification
by Jelica Radomirović, Milan Milosavljević, Zoran Banjac and Miloš Jovanović
Mathematics 2023, 11(6), 1524; https://doi.org/10.3390/math11061524 - 21 Mar 2023
Cited by 5 | Viewed by 1642
Abstract
We propose a new high-speed secret key distillation system via public discussion based on the common randomness contained in the speech signal of the protocol participants. The proposed system consists of subsystems for quantization, advantage distillation, information reconciliation, an estimator for predicting conditional [...] Read more.
We propose a new high-speed secret key distillation system via public discussion based on the common randomness contained in the speech signal of the protocol participants. The proposed system consists of subsystems for quantization, advantage distillation, information reconciliation, an estimator for predicting conditional Renyi entropy, and universal hashing. The parameters of the system are optimized in order to achieve the maximum key distillation rate. By introducing a deep neural block for the prediction of conditional Renyi entropy, the lengths of the distilled secret keys are adaptively determined. The optimized system gives a key rate of over 11% and negligible information leakage to the eavesdropper, while NIST tests show the high cryptographic quality of produced secret keys. For a sampling rate of 16 kHz and quantization of input speech signals with 16 bits per sample, the system provides secret keys at a rate of 28 kb/s. This speed opens the possibility of wider application of this technology in the field of contemporary information security. Full article
Show Figures

Figure 1

19 pages, 5568 KiB  
Article
A Scientometric Study of the Stylometric Research Field
by Panagiotis D. Michailidis
Informatics 2022, 9(3), 60; https://doi.org/10.3390/informatics9030060 - 18 Aug 2022
Cited by 8 | Viewed by 4286
Abstract
Stylometry has gained great popularity in digital humanities and social sciences. Many works on stylometry have recently been reported. However, there is a research gap regarding review studies in this field from a bibliometric and evolutionary perspective. Therefore, in this paper, a bibliometric [...] Read more.
Stylometry has gained great popularity in digital humanities and social sciences. Many works on stylometry have recently been reported. However, there is a research gap regarding review studies in this field from a bibliometric and evolutionary perspective. Therefore, in this paper, a bibliometric analysis of publications from the Scopus database in the stylometric research field was proposed. Then, research articles published between 1968 and 2021 were collected and analyzed using the Bibliometrix R package for bibliometric analysis via the Biblioshiny web interface. Empirical results were also presented in terms of the performance analysis and the science mapping analysis. From these results, it is concluded that there has been a strong growth in stylometry research in recent years, while the USA, Poland, and the UK are the most productive countries, and this is due to many strong research partnerships. It was also concluded that the research topics of most articles, based on author keywords, focused on two broad thematic categories: (1) the main tasks in stylometry and (2) methodological approaches (statistics and machine learning methods). Full article
(This article belongs to the Special Issue Digital Humanities and Visualization)
Show Figures

Figure 1

24 pages, 5111 KiB  
Article
Post-Authorship Attribution Using Regularized Deep Neural Network
by Abiodun Modupe, Turgay Celik, Vukosi Marivate and Oludayo O. Olugbara
Appl. Sci. 2022, 12(15), 7518; https://doi.org/10.3390/app12157518 - 26 Jul 2022
Cited by 10 | Viewed by 3812
Abstract
Post-authorship attribution is a scientific process of using stylometric features to identify the genuine writer of an online text snippet such as an email, blog, forum post, or chat log. It has useful applications in manifold domains, for instance, in a verification process [...] Read more.
Post-authorship attribution is a scientific process of using stylometric features to identify the genuine writer of an online text snippet such as an email, blog, forum post, or chat log. It has useful applications in manifold domains, for instance, in a verification process to proactively detect misogynistic, misandrist, xenophobic, and abusive posts on the internet or social networks. The process assumes that texts can be characterized by sequences of words that agglutinate the functional and content lyrics of a writer. However, defining an appropriate characterization of text to capture the unique writing style of an author is a complex endeavor in the discipline of computational linguistics. Moreover, posts are typically short texts with obfuscating vocabularies that might impact the accuracy of authorship attribution. The vocabularies include idioms, onomatopoeias, homophones, phonemes, synonyms, acronyms, anaphora, and polysemy. The method of the regularized deep neural network (RDNN) is introduced in this paper to circumvent the intrinsic challenges of post-authorship attribution. It is based on a convolutional neural network, bidirectional long short-term memory encoder, and distributed highway network. The neural network was used to extract lexical stylometric features that are fed into the bidirectional encoder to extract a syntactic feature-vector representation. The feature vector was then supplied as input to the distributed high networks for regularization to minimize the network-generalization error. The regularized feature vector was ultimately passed to the bidirectional decoder to learn the writing style of an author. The feature-classification layer consists of a fully connected network and a SoftMax function to make the prediction. The RDNN method was tested against thirteen state-of-the-art methods using four benchmark experimental datasets to validate its performance. Experimental results have demonstrated the effectiveness of the method when compared to the existing state-of-the-art methods on three datasets while producing comparable results on one dataset. Full article
(This article belongs to the Special Issue Application of Machine Learning in Text Mining)
Show Figures

Figure 1

18 pages, 849 KiB  
Article
Privacy Issues in Stylometric Methods
by Antonios Patergianakis and Konstantinos Limniotis
Cryptography 2022, 6(2), 17; https://doi.org/10.3390/cryptography6020017 - 7 Apr 2022
Cited by 2 | Viewed by 4673
Abstract
Stylometry is a well-known field, aiming to identify the author of a text, based only on the way she/he writes. Despite its obvious advantages in several areas, such as in historical research or for copyright purposes, it may also yield privacy and personal [...] Read more.
Stylometry is a well-known field, aiming to identify the author of a text, based only on the way she/he writes. Despite its obvious advantages in several areas, such as in historical research or for copyright purposes, it may also yield privacy and personal data protection issues if it is used in specific contexts, without the users being aware of it. It is, therefore, of importance to assess the potential use of stylometry methods, as well as the implications of their use for online privacy protection. This paper aims to present, through relevant experiments, the possibility of the automated identification of a person using stylometry. The ultimate goal is to analyse the risks regarding privacy and personal data protection stemming from the use of stylometric techniques to evaluate the effectiveness of a specific stylometric identification system, as well as to examine whether proper anonymisation techniques can be applied so as to ensure that the identity of an author of a text (e.g., a user in an anonymous social network) remains hidden, even if stylometric methods are to be applied for possible re-identification. Full article
(This article belongs to the Special Issue Privacy-Preserving Techniques in Cloud/Fog and Internet of Things)
Show Figures

Figure 1

27 pages, 1761 KiB  
Article
Parallel Stylometric Document Embeddings with Deep Learning Based Language Models in Literary Authorship Attribution
by Mihailo Škorić, Ranka Stanković, Milica Ikonić Nešić, Joanna Byszuk and Maciej Eder
Mathematics 2022, 10(5), 838; https://doi.org/10.3390/math10050838 - 7 Mar 2022
Cited by 9 | Viewed by 5583
Abstract
This paper explores the effectiveness of parallel stylometric document embeddings in solving the authorship attribution task by testing a novel approach on literary texts in 7 different languages, totaling in 7051 unique 10,000-token chunks from 700 PoS and lemma annotated documents. We used [...] Read more.
This paper explores the effectiveness of parallel stylometric document embeddings in solving the authorship attribution task by testing a novel approach on literary texts in 7 different languages, totaling in 7051 unique 10,000-token chunks from 700 PoS and lemma annotated documents. We used these documents to produce four document embedding models using Stylo R package (word-based, lemma-based, PoS-trigrams-based, and PoS-mask-based) and one document embedding model using mBERT for each of the seven languages. We created further derivations of these embeddings in the form of average, product, minimum, maximum, and l2 norm of these document embedding matrices and tested them both including and excluding the mBERT-based document embeddings for each language. Finally, we trained several perceptrons on the portions of the dataset in order to procure adequate weights for a weighted combination approach. We tested standalone (two baselines) and composite embeddings for classification accuracy, precision, recall, weighted-average, and macro-averaged F1-score, compared them with one another and have found that for each language most of our composition methods outperform the baselines (with a couple of methods outperforming all baselines for all languages), with or without mBERT inputs, which are found to have no significant positive impact on the results of our methods. Full article
Show Figures

Figure 1

Back to TopTop