MDPI - Publisher of Open Access Journals

31 pages, 5323 KiB

Open AccessArticle

Learning the Style via Mixed SN-Grams: An Evaluation in Authorship Attribution

by Juan Pablo Francisco Posadas-Durán, Germán Ríos-Toledo, Erick Velázquez-Lozada, J. A. de Jesús Osuna-Coutiño, Madaín Pérez-Patricio and Fernando Pech May

AI 2025, 6(5), 104; https://doi.org/10.3390/ai6050104 - 20 May 2025

Viewed by 1003

Abstract

This study addresses the problem of authorship attribution with a novel method for modeling writing style using dependency tree subtree parsing. This method exploits the syntactic information of sentences using mixed syntactic n-grams (mixed sn-grams). The method comprises an algorithm to generate mixed sn-grams by integrating words, POS tags, and dependency relation tags. The mixed sn-grams are used as style markers to feed Machine Learning methods such as a SVM. A comparative analysis was performed to evaluate the performance of the proposed mixed sn-grams method against homogeneous sn-grams with the PAN-CLEF 2012 and CCAT50 datasets. Experiments with PAN 2012 showed the potential of mixed sn-grams to model a writing style by outperforming homogeneous sn-grams. On the other hand, experiments with CCAT50 showed that training with mixed sn-grams improves accuracy over homogeneous sn-grams, with the POS-Word category showing the best result. The study’s results suggest that mixed sn-grams constitute effective stylistic markers for building a reliable writing style model, which machine learning algorithms can learn. Full article

► Show Figures

Figure 1

37 pages, 2517 KiB

Open AccessArticle

Multitask Learning for Authenticity and Authorship Detection

by Gurunameh Singh Chhatwal and Jiashu Zhao

Electronics 2025, 14(6), 1113; https://doi.org/10.3390/electronics14061113 - 12 Mar 2025

Cited by 1 | Viewed by 1105

Abstract

Traditionally, detecting misinformation (real vs. fake) and authorship (human vs. AI) have been addressed as separate classification tasks, leaving a critical gap in real-world scenarios where these challenges increasingly overlap. Motivated by this need, we introduce a unified framework—the Shared–Private Synergy Model (SPSM)—that tackles both authenticity and authorship classification under one umbrella. Our approach is tested on a novel multi-label dataset and evaluated through an exhaustive suite of methods, including traditional machine learning, stylometric feature analysis, and pretrained large language model-based classifiers. Notably, the proposed SPSM architecture incorporates multitask learning, shared–private layers, and hierarchical dependencies, achieving state-of-the-art results with over 96% accuracy for authenticity (real vs. fake) and 98% for authorship (human vs. AI). Beyond its superior performance, our approach is interpretable: stylometric analyses reveal how factors like sentence complexity and entity usage can differentiate between fake news and AI-generated text. Meanwhile, LLM-based classifiers show moderate success. Comprehensive ablation studies further highlight the impact of task-specific architectural enhancements such as shared layers and balanced task losses on boosting classification performance. Our findings underscore the effectiveness of synergistic PLM architectures for tackling complex classification tasks while offering insights into linguistic and structural markers of authenticity and attribution. This study provides a strong foundation for future research, including multimodal detection, cross-lingual expansion, and the development of lightweight, deployable models to combat misinformation in the evolving digital landscape and smart society. Full article

(This article belongs to the Special Issue Big Data Analytics and Information Technology for Smart Cities and Citizen Wellbeing)

► Show Figures

Figure 1

28 pages, 1581 KiB

Open AccessArticle

Authorship Attribution in Less-Resourced Languages: A Hybrid Transformer Approach for Romanian

by Melania Nitu and Mihai Dascalu

Appl. Sci. 2024, 14(7), 2700; https://doi.org/10.3390/app14072700 - 23 Mar 2024

Cited by 1 | Viewed by 2431

Abstract

Authorship attribution for less-resourced languages like Romanian, characterized by the scarcity of large, annotated datasets and the limited number of available NLP tools, poses unique challenges. This study focuses on a hybrid Transformer combining handcrafted linguistic features, ranging from surface indices like word frequencies to syntax, semantics, and discourse markers, with contextualized embeddings from a Romanian BERT encoder. The methodology involves extracting contextualized representations from a pre-trained Romanian BERT model and concatenating them with linguistic features, selected using the Kruskal–Wallis mean rank, to create a hybrid input vector for a classification layer. We compare this approach with a baseline ensemble of seven machine learning classifiers for authorship attribution employing majority soft voting. We conduct studies on both long texts (full texts) and short texts (paragraphs), with 19 authors and a subset of 10. Our hybrid Transformer outperforms existing methods, achieving an F1 score of 0.87 on the full dataset of the 19-author set (an 11% enhancement) and an F1 score of 0.95 on the 10-author subset (an increase of 10% over previous research studies). We conduct linguistic analysis leveraging textual complexity indices and employ McNemar and Cochran’s Q statistical tests to evaluate the performance evolution across the best three models, while highlighting patterns in misclassifications. Our research contributes to diversifying methodologies for effective authorship attribution in resource-constrained linguistic environments. Furthermore, we publicly release the full dataset and the codebase associated with this study to encourage further exploration and development in this field. Full article

(This article belongs to the Special Issue Neural Network Technologies in Natural Language Processing and Data Mining)

► Show Figures

Figure 1

18 pages, 886 KiB

Open AccessArticle

Retraction Notices: Who Authored Them?

by Shaoxiong (Brian) Xu and Guangwei Hu

Publications 2018, 6(1), 2; https://doi.org/10.3390/publications6010002 - 3 Jan 2018

Cited by 17 | Viewed by 11061

Abstract

Unlike other academic publications whose authorship is eagerly claimed, the provenance of retraction notices (RNs) is often obscured presumably because the retraction of published research is associated with undesirable behavior and consequently carries negative consequences for the individuals involved. The ambiguity of authorship, however, has serious ethical ramifications and creates methodological problems for research on RNs that requires clear authorship attribution. This article reports a study conducted to identify RN textual features that can be used to disambiguate obscured authorship, ascertain the extent of authorship evasion in RNs from two disciplinary clusters, and determine if the disciplines varied in the distributions of different types of RN authorship. Drawing on a corpus of 370 RNs archived in the Web of Science for the hard discipline of Cell Biology and the soft disciplines of Business, Finance, and Management, this study has identified 25 types of textual markers that can be used to disambiguate authorship, and revealed that only 25.68% of the RNs could be unambiguously attributed to authors of the retracted articles alone or jointly and that authorship could not be determined for 28.92% of the RNs. Furthermore, the study has found marked disciplinary differences in the different categories of RN authorship. These results point to the need for more explicit editorial requirements about RN authorship and their strict enforcement. Full article

(This article belongs to the Special Issue Scientific Ethics)

► Show Figures

Figure 1

Search Results (4)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (4)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI