MDPI - Publisher of Open Access Journals

25 pages, 1887 KB

Open AccessArticle

Does All or Nothing Always Work Best? In Search of Advantageous Representation of Attributes

by Urszula Stańczyk and Grzegorz Baron

Appl. Sci. 2026, 16(6), 2679; https://doi.org/10.3390/app16062679 - 11 Mar 2026

Viewed by 142

Discretisation is a processing step often included in the preliminary data preparation. Typically, when the input features have continuous domains and their discrete forms are needed, all are translated into categorical type at the same time, before data mining takes place. However, proceeding [...] Read more.

Discretisation is a processing step often included in the preliminary data preparation. Typically, when the input features have continuous domains and their discrete forms are needed, all are translated into categorical type at the same time, before data mining takes place. However, proceeding this way is not always the most advantageous to performance. The paper presents results from the research where the discretisation transformations were carried out sequentially forward for variables, and their selection was based on their values and also importance of the attributes estimated by the constructed rankings. The experiments were executed on the datasets from the area of stylometric analysis of texts, the application domain focused on recognising authorship based on individual characteristics of writing styles. For the selected data mining techniques, the performance was studied in the context of transformed features. The observed trends indicate that along with enhanced understanding of the nature of the data, partial discretisation of feature sets could bring higher accuracy than transformation of entire input domain, showing the merits of the described research methodology. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

24 pages, 1783 KB

Open AccessArticle

A Hybrid Human-Centric Framework for Discriminating Engine-like from Human-like Chess Play: A Proof-of-Concept Study

by Zura Kevanishvili and Maksim Iavich

Appl. Syst. Innov. 2026, 9(1), 11; https://doi.org/10.3390/asi9010011 - 26 Dec 2025

Viewed by 1260

Abstract

The rapid growth of online chess has intensified the challenge of distinguishing engine-assisted from authentic human play, exposing the limitations of existing approaches that rely solely on deterministic evaluation metrics. This study introduces a proof-of-concept hybrid framework for discriminating between engine-like and human-like [...] Read more.

The rapid growth of online chess has intensified the challenge of distinguishing engine-assisted from authentic human play, exposing the limitations of existing approaches that rely solely on deterministic evaluation metrics. This study introduces a proof-of-concept hybrid framework for discriminating between engine-like and human-like chess play patterns, integrating Stockfish’s deterministic evaluations with stylometric behavioral features derived from the Maia engine. Key metrics include Centipawn Loss (CPL), Mismatch Move Match Probability (MMMP), and a novel Curvature-Based Stability (ΔS) indicator. These features were incorporated into a convolutional neural network (CNN) classifier and evaluated on a controlled benchmark dataset of 1000 games, where ‘suspicious’ gameplay was algorithmically generated to simulate engine-optimal patterns, while ‘clean’ play was modeled using Maia’s human-like predictions. Results demonstrate the framework’s ability to discriminate between these behavioral archetypes, with the hybrid model achieving a macro F1-score of 0.93, significantly outperforming the Stockfish-only baseline (F1 = 0.87), as validated by McNemar’s test (p = 0.0153). Feature ablation confirmed that Maia-derived features reduced false negatives and improved recall, while ΔS enhanced robustness. This work establishes a methodological foundation for behavioral pattern discrimination in chess, demonstrating the value of combining deterministic and human-centric modeling. Beyond chess, the approach offers a template for behavioral anomaly analysis in cybersecurity, education, and other decision-based domains, with real-world validation on adjudicated misconduct cases identified as the essential next step. Full article

► Show Figures

Figure 1

15 pages, 1374 KB

Open AccessArticle

Stylometric Analysis of Sustainable Central Bank Communications: Revealing Authorial Signatures in Monetary Policy Statements

by Hakan Emekci and İbrahim Özkan

Sustainability 2025, 17(20), 8979; https://doi.org/10.3390/su17208979 - 10 Oct 2025

Viewed by 779

Abstract

Sustainable economic development requires transparent and consistent institutional communication from monetary authorities to maintain long-term financial stability and public trust. This study investigates the latent authorial structure and stylistic heterogeneity of central bank communications by applying stylometric analysis and unsupervised machine learning to [...] Read more.

Sustainable economic development requires transparent and consistent institutional communication from monetary authorities to maintain long-term financial stability and public trust. This study investigates the latent authorial structure and stylistic heterogeneity of central bank communications by applying stylometric analysis and unsupervised machine learning to official announcements of the Central Bank of the Republic of Turkey (CBRT). Using a dataset of 557 press releases from 2006 to 2017, we extract a range of linguistic features at both sentence and document levels—including sentence length, punctuation density, word length, and type–token ratios. These features are reduced using Principal Component Analysis (PCA) and clustered via Hierarchical Clustering on Principal Components (HCPC), revealing three distinct authorial groups within the CBRT’s communications. The robustness of these clusters is validated using multidimensional scaling (MDS) on character-level and word-level n-gram distances. The analysis finds consistent stylistic differences between clusters, with implications for authorship attribution, tone variation, and communication strategy. Notably, sentiment analysis indicates that one authorial cluster tends to exhibit more negative tonal features, suggesting potential bias or divergence in internal communication style. These findings challenge the conventional assumption of institutional homogeneity and highlight the presence of distinct communicative voices within the central bank. Furthermore, the results suggest that stylistic variation—though often subtle—may convey unintended policy signals to markets, especially in contexts where linguistic shifts are closely scrutinized. This research contributes to the emerging intersection of natural language processing, monetary economics, and institutional transparency. It demonstrates the efficacy of stylometric techniques in revealing the hidden structure of policy discourse and suggests that linguistic analytics can offer valuable insights into the internal dynamics, credibility, and effectiveness of monetary authorities. These findings contribute to sustainable financial governance by demonstrating how AI-driven analysis can enhance institutional transparency, promote consistent policy communication, and support long-term economic stability—key pillars of sustainable development. Full article

(This article belongs to the Special Issue Public Policy and Economic Analysis in Sustainability Transitions)

► Show Figures

Figure 1

16 pages, 1051 KB

Open AccessArticle

Kafka’s Literary Style: A Mixed-Method Approach

by Carsten Strathausen, Wenyi Shang and Andrei Kazakov

Humanities 2025, 14(3), 61; https://doi.org/10.3390/h14030061 - 12 Mar 2025

Viewed by 3081

Abstract

In this essay, we examine how the polyvalence of meaning in Kafka’s texts is engineered both semantically (on the narrative level) and syntactically (on the linguistic level), and we ask whether a computational approach can shed new light on the long-standing debate about [...] Read more.

In this essay, we examine how the polyvalence of meaning in Kafka’s texts is engineered both semantically (on the narrative level) and syntactically (on the linguistic level), and we ask whether a computational approach can shed new light on the long-standing debate about the major characteristics of Kafka’s literary style. A mixed-method approach means that we seek out points of connection that interlink traditional humanist (i.e., interpretative) and computational (i.e., quantitative) methods of investigation. Following the introduction, the second section of our article provides a critical overview of the existing scholarship from both a humanist and a computational perspective. We argue that the main methodological difference between traditional humanist and AI-enhanced computational studies of Kafka’s literary style lies not in the use of statistics but in the new interpretative possibilities enabled by AI methods to explore stylistic features beyond the scope of human comprehension. In the third and fourth sections of our article, we will introduce our own stylometric approach to Kafka, detail our methods, and interpret our findings. Rather than focusing on training an AI model capable of accurately attributing authorship to Kafka, we examine whether AI could help us detect significant stylistic differences between the writing Kafka himself published during his lifetime (Kafka Core) and his posthumous writings edited and published by Max Brod. Full article

(This article belongs to the Special Issue Franz Kafka in the Age of Artificial Intelligence)

► Show Figures

Figure 1

37 pages, 2517 KB

Open AccessArticle

Multitask Learning for Authenticity and Authorship Detection

by Gurunameh Singh Chhatwal and Jiashu Zhao

Electronics 2025, 14(6), 1113; https://doi.org/10.3390/electronics14061113 - 12 Mar 2025

Cited by 2 | Viewed by 2869

Abstract

Traditionally, detecting misinformation (real vs. fake) and authorship (human vs. AI) have been addressed as separate classification tasks, leaving a critical gap in real-world scenarios where these challenges increasingly overlap. Motivated by this need, we introduce a unified framework—the Shared–Private Synergy Model (SPSM)—that [...] Read more.

Traditionally, detecting misinformation (real vs. fake) and authorship (human vs. AI) have been addressed as separate classification tasks, leaving a critical gap in real-world scenarios where these challenges increasingly overlap. Motivated by this need, we introduce a unified framework—the Shared–Private Synergy Model (SPSM)—that tackles both authenticity and authorship classification under one umbrella. Our approach is tested on a novel multi-label dataset and evaluated through an exhaustive suite of methods, including traditional machine learning, stylometric feature analysis, and pretrained large language model-based classifiers. Notably, the proposed SPSM architecture incorporates multitask learning, shared–private layers, and hierarchical dependencies, achieving state-of-the-art results with over 96% accuracy for authenticity (real vs. fake) and 98% for authorship (human vs. AI). Beyond its superior performance, our approach is interpretable: stylometric analyses reveal how factors like sentence complexity and entity usage can differentiate between fake news and AI-generated text. Meanwhile, LLM-based classifiers show moderate success. Comprehensive ablation studies further highlight the impact of task-specific architectural enhancements such as shared layers and balanced task losses on boosting classification performance. Our findings underscore the effectiveness of synergistic PLM architectures for tackling complex classification tasks while offering insights into linguistic and structural markers of authenticity and attribution. This study provides a strong foundation for future research, including multimodal detection, cross-lingual expansion, and the development of lightweight, deployable models to combat misinformation in the evolving digital landscape and smart society. Full article

(This article belongs to the Special Issue Big Data Analytics and Information Technology for Smart Cities and Citizen Wellbeing)

► Show Figures

Figure 1

30 pages, 1001 KB

Open AccessArticle

Genre Classification of Books in Russian with Stylometric Features: A Case Study

by Natalia Vanetik, Margarita Tiamanova, Genady Kogan and Marina Litvak

Information 2024, 15(6), 340; https://doi.org/10.3390/info15060340 - 7 Jun 2024

Cited by 2 | Viewed by 3358

Abstract

Within the literary domain, genres function as fundamental organizing concepts that provide readers, publishers, and academics with a unified framework. Genres are discrete categories that are distinguished by common stylistic, thematic, and structural components. They facilitate the categorization process and improve our understanding [...] Read more.

Within the literary domain, genres function as fundamental organizing concepts that provide readers, publishers, and academics with a unified framework. Genres are discrete categories that are distinguished by common stylistic, thematic, and structural components. They facilitate the categorization process and improve our understanding of a wide range of literary expressions. In this paper, we introduce a new dataset for genre classification of Russian books, covering 11 literary genres. We also perform dataset evaluation for the tasks of binary and multi-class genre identification. Through extensive experimentation and analysis, we explore the effectiveness of different text representations, including stylometric features, in genre classification. Our findings clarify the challenges present in classifying Russian literature by genre, revealing insights into the performance of different models across various genres. Furthermore, we address several research questions regarding the difficulty of multi-class classification compared to binary classification, and the impact of stylometric features on classification accuracy. Full article

(This article belongs to the Special Issue Text Mining: Challenges, Algorithms, Tools and Applications)

► Show Figures

Figure 1

32 pages, 2235 KB

Open AccessArticle

Importance of Characteristic Features and Their Form for Data Exploration

by Urszula Stańczyk, Beata Zielosko and Grzegorz Baron

Entropy 2024, 26(5), 404; https://doi.org/10.3390/e26050404 - 6 May 2024

Cited by 2 | Viewed by 2499

Abstract

The nature of the input features is one of the key factors indicating what kind of tools, methods, or approaches can be used in a knowledge discovery process. Depending on the characteristics of the available attributes, some techniques could lead to unsatisfactory performance [...] Read more.

The nature of the input features is one of the key factors indicating what kind of tools, methods, or approaches can be used in a knowledge discovery process. Depending on the characteristics of the available attributes, some techniques could lead to unsatisfactory performance or even may not proceed at all without additional preprocessing steps. The types of variables and their domains affect performance. Any changes to their form can influence it as well, or even enable some learners. On the other hand, the relevance of features for a task constitutes another element with a noticeable impact on data exploration. The importance of attributes can be estimated through the application of mechanisms belonging to the feature selection and reduction area, such as rankings. In the described research framework, the data form was conditioned on relevance by the proposed procedure of gradual discretisation controlled by a ranking of attributes. Supervised and unsupervised discretisation methods were employed to the datasets from the stylometric domain and the task of binary authorship attribution. For the selected classifiers, extensive tests were performed and they indicated many cases of enhanced prediction for partially discretised datasets. Full article

► Show Figures

Figure 1

18 pages, 506 KB

Open AccessArticle

Morphosyntactic Annotation in Literary Stylometry

by Robert Gorman

Information 2024, 15(4), 211; https://doi.org/10.3390/info15040211 - 9 Apr 2024

Cited by 3 | Viewed by 3028

Abstract

This article investigates the stylometric usefulness of morphosyntactic annotation. Focusing on the style of literary texts, it argues that including morphosyntactic annotation in analyses of style has at least two important advantages: (1) maintaining a topic agnostic approach and (2) providing input variables [...] Read more.

This article investigates the stylometric usefulness of morphosyntactic annotation. Focusing on the style of literary texts, it argues that including morphosyntactic annotation in analyses of style has at least two important advantages: (1) maintaining a topic agnostic approach and (2) providing input variables that are interpretable in traditional grammatical terms. This study demonstrates how widely available Universal Dependency parsers can generate useful morphological and syntactic data for texts in a range of languages. These data can serve as the basis for input features that are strongly informative about the style of individual novels, as indicated by accuracy in classification tests. The interpretability of such features is demonstrated by a discussion of the weakness of an “authorial” signal as opposed to the clear distinction among individual works. Full article

(This article belongs to the Special Issue Computational Linguistics and Natural Language Processing)

► Show Figures

Figure 1

16 pages, 6689 KB

Open AccessArticle

The Question of Studying Information Entropy in Poetic Texts

by Olga Kozhemyakina, Vladimir Barakhnin, Natalia Shashok and Elina Kozhemyakina

Appl. Sci. 2023, 13(20), 11247; https://doi.org/10.3390/app132011247 - 13 Oct 2023

Viewed by 2902

Abstract

One of the approaches to quantitative text analysis is to represent a given text in the form of a time series, which can be followed by an information entropy study for different text representations, such as “symbolic entropy”, “phonetic entropy” and “emotional entropy” [...] Read more.

One of the approaches to quantitative text analysis is to represent a given text in the form of a time series, which can be followed by an information entropy study for different text representations, such as “symbolic entropy”, “phonetic entropy” and “emotional entropy” of various orders. Studying authors’ styles based on such entropic characteristics of their works seems to be a promising area in the field of information analysis. In this work, the calculations of entropy values of the first, second and third order for the corpus of poems by A.S. Pushkin and other poets from the Golden Age of Russian Poetry were carried out. The values of “symbolic entropy”, “phonetic entropy” and “emotional entropy” and their mathematical expectations and variances were calculated for given corpora using the software application that automatically extracts statistical information, which is potentially applicable to tasks that identify features of the author’s style. The statistical data extracted could become the basis of the stylometric classification of authors by entropy characteristics. Full article

(This article belongs to the Special Issue Natural Language Processing: Theory, Methods and Applications)

► Show Figures

Figure 1

22 pages, 5373 KB

Open AccessArticle

Secret Key Distillation with Speech Input and Deep Neural Network-Controlled Privacy Amplification

by Jelica Radomirović, Milan Milosavljević, Zoran Banjac and Miloš Jovanović

Mathematics 2023, 11(6), 1524; https://doi.org/10.3390/math11061524 - 21 Mar 2023

Cited by 5 | Viewed by 1974

Abstract

We propose a new high-speed secret key distillation system via public discussion based on the common randomness contained in the speech signal of the protocol participants. The proposed system consists of subsystems for quantization, advantage distillation, information reconciliation, an estimator for predicting conditional [...] Read more.

We propose a new high-speed secret key distillation system via public discussion based on the common randomness contained in the speech signal of the protocol participants. The proposed system consists of subsystems for quantization, advantage distillation, information reconciliation, an estimator for predicting conditional Renyi entropy, and universal hashing. The parameters of the system are optimized in order to achieve the maximum key distillation rate. By introducing a deep neural block for the prediction of conditional Renyi entropy, the lengths of the distilled secret keys are adaptively determined. The optimized system gives a key rate of over 11% and negligible information leakage to the eavesdropper, while NIST tests show the high cryptographic quality of produced secret keys. For a sampling rate of 16 kHz and quantization of input speech signals with 16 bits per sample, the system provides secret keys at a rate of 28 kb/s. This speed opens the possibility of wider application of this technology in the field of contemporary information security. Full article

► Show Figures

Figure 1

24 pages, 5111 KB

Open AccessArticle

Post-Authorship Attribution Using Regularized Deep Neural Network

by Abiodun Modupe, Turgay Celik, Vukosi Marivate and Oludayo O. Olugbara

Appl. Sci. 2022, 12(15), 7518; https://doi.org/10.3390/app12157518 - 26 Jul 2022

Cited by 14 | Viewed by 5085

Abstract

Post-authorship attribution is a scientific process of using stylometric features to identify the genuine writer of an online text snippet such as an email, blog, forum post, or chat log. It has useful applications in manifold domains, for instance, in a verification process [...] Read more.

Post-authorship attribution is a scientific process of using stylometric features to identify the genuine writer of an online text snippet such as an email, blog, forum post, or chat log. It has useful applications in manifold domains, for instance, in a verification process to proactively detect misogynistic, misandrist, xenophobic, and abusive posts on the internet or social networks. The process assumes that texts can be characterized by sequences of words that agglutinate the functional and content lyrics of a writer. However, defining an appropriate characterization of text to capture the unique writing style of an author is a complex endeavor in the discipline of computational linguistics. Moreover, posts are typically short texts with obfuscating vocabularies that might impact the accuracy of authorship attribution. The vocabularies include idioms, onomatopoeias, homophones, phonemes, synonyms, acronyms, anaphora, and polysemy. The method of the regularized deep neural network (RDNN) is introduced in this paper to circumvent the intrinsic challenges of post-authorship attribution. It is based on a convolutional neural network, bidirectional long short-term memory encoder, and distributed highway network. The neural network was used to extract lexical stylometric features that are fed into the bidirectional encoder to extract a syntactic feature-vector representation. The feature vector was then supplied as input to the distributed high networks for regularization to minimize the network-generalization error. The regularized feature vector was ultimately passed to the bidirectional decoder to learn the writing style of an author. The feature-classification layer consists of a fully connected network and a SoftMax function to make the prediction. The RDNN method was tested against thirteen state-of-the-art methods using four benchmark experimental datasets to validate its performance. Experimental results have demonstrated the effectiveness of the method when compared to the existing state-of-the-art methods on three datasets while producing comparable results on one dataset. Full article

(This article belongs to the Special Issue Application of Machine Learning in Text Mining)

► Show Figures

Figure 1

18 pages, 8411 KB

Open AccessReview

Stylometry and Numerals Usage: Benford’s Law and Beyond

by Andrei V. Zenkov

Stats 2021, 4(4), 1051-1068; https://doi.org/10.3390/stats4040060 - 14 Dec 2021

Cited by 6 | Viewed by 3455

Abstract

We suggest two approaches to the statistical analysis of texts, both based on the study of numerals occurrence in literary texts. The first approach is related to Benford’s Law and the analysis of the frequency distribution of various leading digits of numerals contained [...] Read more.

We suggest two approaches to the statistical analysis of texts, both based on the study of numerals occurrence in literary texts. The first approach is related to Benford’s Law and the analysis of the frequency distribution of various leading digits of numerals contained in the text. In coherent literary texts, the share of the leading digit 1 is even larger than prescribed by Benford’s Law and can reach 50 percent. The frequencies of occurrence of the digit 1, as well as, to a lesser extent, the digits 2 and 3, are usually a characteristic the author’s style feature, manifested in all (sufficiently long) literary texts of any author. This approach is convenient for testing whether a group of texts has common authorship: the latter is dubious if the frequency distributions are sufficiently different. The second approach is the extension of the first one and requires the study of the frequency distribution of numerals themselves (not their leading digits). The approach yields non-trivial information about the author, stylistic and genre peculiarities of the texts and is suited for the advanced stylometric analysis. The proposed approaches are illustrated by examples of computer analysis of the literary texts in English and Russian. Full article

(This article belongs to the Special Issue Benford's Law(s) and Applications)

► Show Figures

Figure 1

18 pages, 847 KB

Open AccessArticle

Language-Independent Fake News Detection: English, Portuguese, and Spanish Mutual Features

by Hugo Queiroz Abonizio, Janaina Ignacio de Morais, Gabriel Marques Tavares and Sylvio Barbon Junior

Future Internet 2020, 12(5), 87; https://doi.org/10.3390/fi12050087 - 11 May 2020

Cited by 82 | Viewed by 11460

Abstract

Online Social Media (OSM) have been substantially transforming the process of spreading news, improving its speed, and reducing barriers toward reaching out to a broad audience. However, OSM are very limited in providing mechanisms to check the credibility of news propagated through their [...] Read more.

Online Social Media (OSM) have been substantially transforming the process of spreading news, improving its speed, and reducing barriers toward reaching out to a broad audience. However, OSM are very limited in providing mechanisms to check the credibility of news propagated through their structure. The majority of studies on automatic fake news detection are restricted to English documents, with few works evaluating other languages, and none comparing language-independent characteristics. Moreover, the spreading of deceptive news tends to be a worldwide problem; therefore, this work evaluates textual features that are not tied to a specific language when describing textual data for detecting news. Corpora of news written in American English, Brazilian Portuguese, and Spanish were explored to study complexity, stylometric, and psychological text features. The extracted features support the detection of fake, legitimate, and satirical news. We compared four machine learning algorithms (k-Nearest Neighbors (k-NN), Support Vector Machine (SVM), Random Forest (RF), and Extreme Gradient Boosting (XGB)) to induce the detection model. Results show our proposed language-independent features are successful in describing fake, satirical, and legitimate news across three different languages, with an average detection accuracy of 85.3% with RF. Full article

(This article belongs to the Special Issue Social Web, New Media, Algorithms and Power)

► Show Figures

Figure 1

Search Results (13)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (13)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI