Learning the Style via Mixed SN-Grams: An Evaluation in Authorship Attribution

Posadas-Durán, Juan Pablo Francisco; Ríos-Toledo, Germán; Velázquez-Lozada, Erick; Osuna-Coutiño, J. A. de Jesús; Pérez-Patricio, Madaín; Pech May, Fernando

doi:10.3390/ai6050104

Open AccessArticle

Learning the Style via Mixed SN-Grams: An Evaluation in Authorship Attribution

by

Juan Pablo Francisco Posadas-Durán

¹

,

Germán Ríos-Toledo

^2,*

,

Erick Velázquez-Lozada

¹

,

J. A. de Jesús Osuna-Coutiño

²

,

Madaín Pérez-Patricio

²

and

Fernando Pech May

³

¹

Escuela Superior de Ingeniería Mecánica y Eléctrica, Unidad Zacatenco, Instituto Politécnico Nacional, Mexico City 07738, Mexico

²

Departamento de Sistemas y Computación, Tecnológico Nacional de México/IT de Tuxtla Gutiérrez, Tuxtla Gutiérrez 29050, Chiapas, Mexico

³

Academia de Sistemas Computacionales, Tecnológico Nacional de México (TecNM), Campus de los Ríos, Balancán 86930, Tabasco, Mexico

^*

Author to whom correspondence should be addressed.

AI 2025, 6(5), 104; https://doi.org/10.3390/ai6050104

Submission received: 11 February 2025 / Revised: 2 May 2025 / Accepted: 10 May 2025 / Published: 20 May 2025

Download

Browse Figures

Versions Notes

Abstract

This study addresses the problem of authorship attribution with a novel method for modeling writing style using dependency tree subtree parsing. This method exploits the syntactic information of sentences using mixed syntactic n-grams (mixed sn-grams). The method comprises an algorithm to generate mixed sn-grams by integrating words, POS tags, and dependency relation tags. The mixed sn-grams are used as style markers to feed Machine Learning methods such as a SVM. A comparative analysis was performed to evaluate the performance of the proposed mixed sn-grams method against homogeneous sn-grams with the PAN-CLEF 2012 and CCAT50 datasets. Experiments with PAN 2012 showed the potential of mixed sn-grams to model a writing style by outperforming homogeneous sn-grams. On the other hand, experiments with CCAT50 showed that training with mixed sn-grams improves accuracy over homogeneous sn-grams, with the POS-Word category showing the best result. The study’s results suggest that mixed sn-grams constitute effective stylistic markers for building a reliable writing style model, which machine learning algorithms can learn.

Keywords:

authorship attribution; stylometrics; style markers; syntactic dependency n-grams; machine learning; writing style

1. Introduction

An authorship attribution system is a method based on the writing style that predicts who is the most likely author of a text among a set of candidate authors [1]. Through the texts that people write, it is possible to identify their authorship, since each person uses the elements of language in a certain order and frequency. Sequences and frequencies define an author’s writing patterns. Recently this same principle has made it possible to identify text generated using Large Language Models [2,3]. Writing style patterns are also used in tasks such as plagiarism detection [4,5], author profiling [6,7], author identification [8,9], and neurological disease detection [10,11].

In computational linguistics, writing style is the frequency of use of linguistic elements called style markers. The writing style analysis reveals recurring patterns in the writing of texts. Knowing these patterns allows the author to adjust vocabulary and sentence structure for more effective communication. On the other hand, writing style analysis is useful in identifying underlying problems in writing. In addition, it allows you to define a unique and distinctive writing style. Style markers are classified into four main categories as follows: character-based, word-based, POS tag-based, and n-gram based.

Character-based style markers include punctuation marks, uppercase characters, lowercase characters, alphabetic characters, numeric characters, and typos. These markers are easy to collect and are present in most languages. However, the spelling standards of many languages and people are the same; for example, a period followed by a capital letter. Also, these style markers are not frequently used, making it difficult to identify an author’s writing style. To address these problems, previous work uses more markers in the inference [6,12,13,14].

A function word is a word-based style marker. Function words are words without semantic information, such as prepositions, adverbs, pronouns, conjunctions, and verb conjugations. Function words are widely used because of their ability to relate to other words and the grammatical and semantic information they provide to a sentence [15]. Furthermore, the extraction of function words is a simple and efficient process from a computational standpoint. Generally, the author does not use these words consciously [16]. However, using only function words for authorship attribution leads to the loss of valuable information about the author’s sentence structure and writing style [17,18,19,20,21].

The POS tag is another word-based style marker. A POS tag represents the grammatical category of a word. Examples of tags are verbs (VERB), adjectives (ADJ), adverbs (ADV), pronouns (PRON), nouns (NOUN), etc. POS tags are another topic-independent style marker. This style marker shows the category of words most frequently used by the author, providing favorable results in authorship attribution [22,23,24,25]. The tagging of a word requires contextual information from the surrounding words. However, when assigning the POS tag, sentence structure information is discarded. That is, word tagging is “shallow parsing” since it ignores information about the structure of the sentence. The interrelationship between words is only used to retrieve the POS tag. This is a major limitation in writing style analysis.

A n-gram is an imaginary window that scrolls n tokens from left to right to the end of the sentence. Tokens within the n-gram can be characters, words, and POS tags [26,27,28,29,30]. An n-gram is a linear combination of tokens that allows us to identify new patterns in the author’s writing style. However, an n-gram is processed as a single piece. For example, in a word 4-gram, four words are processed as a single token (word collocations), i.e., n-grams only identify sequences of tokens but not the underlying sentence structure. By ignoring non-linear connections between words, valuable information on sentence structure is lost.

Recently, deep learning has been used in authorship attribution. The text is converted into low-dimensional dense numerical vectors for processing in the neural network. The vectors as a whole are called embeddings. For each token (words, characters), the embedding assigns a token a point in a continuous space, where its position captures semantic or contextual relationships [31]. Examples of pre-trained word embeddings are Word2Vec, GloVe, FastText, and contextual embeddings such as BERT or GPT [32,33,34]. Nowadays, Large Language Models (LLMs) are used for writing style analysis. An LLM is trained with large amounts of text to generate natural language. LLMs can learn lexical, syntactic, semantic, and contextual patterns. These models are based on deep neural network architectures such as transformers, some examples of models are GPT, BERT or LLaMA.

LLM research on authorship attribution is focused on two branches as follows: First, each language model generates texts with a particular style that manifests itself in lexical, syntactic, and grammatical patterns. Using simple style markers (character-based or word-based) it is possible to identify whether the author of the text is a machine or a human. Second, LLMs can perform authorship attribution tasks on texts written by humans. Moreover, due to their generative capacity, they justify the ‘stylistic reasons’ for their attribution. The results suggest that the ability of LLMs to identify linguistic styles is useful not only for the detection of generative models but also for forensic, literary, or academic applications related to authorship analysis [35,36,37,38,39].

However, the embedding architecture defines how many tokens are evaluated in an execution. For example, BERT has a context window of 512 tokens [40]. This is a significant drawback since an embedding detects local or global writing style patterns depending on the context window length. As the context window length increases, the computational processing demands also rise. Generally, this approach requires high-performance computers for model training.

Writing style analysis has already looked into the value of syntactic sentence form. For this purpose, the specialized tool called Parser is used to obtain the syntactic information of a sentence (Stanford Parser (https://stanfordnlp.github.io/CoreNLP/parser-standalone.html, accessed on 1 May 2025), Spacy (https://spacy.io/, accessed on 1 May 2025), and Stanza (https://stanfordnlp.github.io/stanza/, accessed on 1 May 2025)). In state-of-the-art systems, there are two main approaches to exploit syntactic information as follows:

(1): create a new style marker or
(2): propose embeddings with syntactic information for deep learning methods

Several approaches to creating syntactic style markers have been described as state-of-the-art methods. Tschuggnall and Specht [41] present a method for authorship attribution based on the author’s grammatical profile, using syntax trees and pq-grams. These pq-grams are subsets of tree nodes, defined by two parameters p and q, where p indicates how many vertical levels of the tree are included and q defines how many nodes are considered horizontally. Evaluation results on three datasets (CC04, FED, and PAN12) show a high rate of correct attribution, suggesting that grammatical style is a significant feature for improving authorship attribution methods.

Patchala et al. [42] defined author templates of Context-Free Grammar (CFG) production frequencies from training on Enron dataset e-mail messages. Patchala et al. extracted similar frequencies from a new email message to compare them with templates and identify the best match. Patchala et al. claimed that CFG production frequencies perform very well in attributing the authorship of email messages.

Martijn and Wietse [43] developed a Support Vector Machine (SVM) model using a combination of lexical and syntactic features (character n-grams, punctuation, tokens, POS tags, syntactic dependency relations, and Syntactic n-grams) in four languages. The syntactic analysis was performed with UUParser [44]. Martijn and Wietse argued that incorporating syntactic n-grams of dependency labels can effectively capture the stylometric features of texts. Additionally, they improved the model’s performance when paired with other style markers.

Mehler et al. [45] proposed a multidimensional syntactic dependency tree model using the MATE Parser [46] to predict sentence authorship. The goal was to generate “fingerprints” to predict the author of the underlying sentences. Mehler et al. argued that syntactic dependency features are effective for authorship attribution. However, to achieve a better understanding of alignment in communication, other levels of language such as lexical and semantic aspects should be considered.

Lučić and Blake [47] present a method for authorship attribution based on Local Syntactic Dependencies (LSDs) surrounding the mentions of proper names. The aim is to determine whether the syntactic patterns used when referring to persons is useful for authorship attribution. The article concludes that local syntactic dependencies surrounding proper names can be a useful stylistic marker for authorship attribution. It is also observed that consistency in an author’s writing style affects predictive performance. The results suggest that the approach can achieve good predictive performance with 1000 to 1500 sentences and 29 features.

Recently, deep learning methods have been proposed to exploit syntactic information. Zhang et al. [48] proposed a strategy for encoding the syntactic tree of a sentence into a learnable distributed representation. For that, they construct an embedding vector for each word that encodes the path in the syntactic tree corresponding to the word. According to the authors, their model learns feature embeddings at the content and syntactic levels. The model consisted of five types of layers as follows: syntactic-level feature embeddings, content-level feature embeddings, convolution, max-pooling and SoftMax. The predictive performance of the Syntax-CNN approach was carried out with the CCAT10, CCAT50, IMDB62, blogs10, and blogs50 dataset. Zhang et al. pointed out that such syntax style embeddings can bring a significant accuracy gain to CNN strategies.

Jafariakinabad and Hua [49] proposed a method based on a “Siamese” neural network with the following two subnetworks: a lexical one that processes word sequences and a syntactic one for structural tags (POS tags). The latter subnetwork encodes syntactic information of parse trees (constituencies) using the CoreNLP parser [50]. Both subnetworks have the same architecture, as follows: a bidirectional LSTM network and a self-attention mechanism. POS tag parse trees are linearized into a sequence of structural tags following a depth-first traversal order. Predictive performance was performed with the CCAT10, CCAT50, BLOGS10, and BLOGS50 authorship attribution datasets. According to the authors, the proposed method consistently outperformed all benchmark models, including the models that used only lexical or syntactic information. In addition, they stated that using syntactic information explicitly provides valuable information for author identification beyond semantic content.

In 2021,Wu et al. [51] developed a neural network called a Multichannel Self-Attribution Network (MCSAN) for authorship attribution. The method uses features in several dimensions as follows: style, content, syntax, and semantics. MCSAN integrates n-grams of characters, words, POS tags, sentence structures, and dependency relations, using an attention mechanism to weight their importance. Syntactic features were obtained with Stanford CoreNLP from the well-known CCAT10, CCAT50, and IMDB62 datasets. Experimental results show a significant improvement in authorship attribution accuracy compared to state-of-the-art models, demonstrating the effectiveness of multidimensional feature integration. The ablation study showed that by removing syntactic information from the model, accuracy was considerably decreased.

In 2021, Murauer and Specht [52] presented a language-independent feature named Dependency Tree grams (DT-grams) for cross-language authorship attribution. Murauer and Specht used Stanza (https://github.com/stanfordnlp/stanza, accessed on 1 May 2025) to obtain style markers. When obtaining the syntax tree, substructures of different sizes are selected to produce DT-grams of dependency relations, POS tags, and combinations of them. Style markers as baselines included character, word, and POS tag n-grams (n = {1, 2, 3, 4, 5}). Murauer and Specht stated that DT-grams showed promising performance in cross-language authorship attribution. Furthermore, they claimed that DT-grams outperformed POS tag n-grams based on the original word order and that dependency relations and grammatical style contribute to an author’s stylometric fingerprint. However, Murauer and Specht also stated that the performance of DT-grams is below that of traditional methods applied to the machine translation of texts.

Jafariakinabad et al. [53] developed a syntactic recurrent neural network to encode syntactic patterns. The model uses convolutional neural networks (CNNs) as short-term memories (LSTMs). The CNN captures the short-term dependencies between words and LSTM captures long-term relationships in sequences and represents a sentence by its overall syntactic pattern. The sentence encoder learns the syntactic representation of a document from the output of the POS encoder. The learned vector representation is fed into a SoftMax classifier to compute the probability distribution of class labels. The method was tested with the PAN12 (https://pan.webis.de/clef12/pan12-web/authorship-attribution.html, accessed on 1 May 2025) authorship attribution dataset. The study concludes that the syntactic recurrent neural network model performs better in authorship attribution than lexical models and traditional n-grams-based models. Jafariakinabad et al. reported that both CNN-LSTM and LSTM-LSTM correctly classified all 14 novels in the test set, suggesting that syntactic information is more effective than lexical information for authorship attribution, especially in texts of different topics and genres.

This paper proposes a methodology for authorship attribution formulated around the syntactic information contained in sentences’ dependency trees. This approach aims to model the writing style by analyzing the subtrees within the dependency tree. To capture the syntactic information of the subtrees, we propose to use a representation based on the generation of a particular type of syntactic n-grams (sn-grams) called mixed syntactic n-grams (mixed sn-grams).

The main contributions of this paper are as follows: (1) we design a method for authorship attribution that relies on a machine learning approach and utilizes mixed sn-grams as style markers, (2) develop a strategy for the generation of mixed sn-grams from dependency trees, (3) evaluate the use of mixed sn-grams as features for a machine learning approach to perform authorship attribution, and (4) compare the performance of mixed sn-grams with homogeneous sn-grams.

2. Materials and Methods

This section describes the proposed method to address the problem of monolingual authorship attribution. First, the data characteristics are described. Then, the text preprocessing required to apply the proposed method. Finally, the stylometric feature generation technique and text representation used in the authorship attribution models are discussed.

2.1. Description of the Datasets

The proposed method was evaluated using the following two datasets from previous studies: the shared task of Authorship Attribution Corpus at PAN-CLEF 2012 [54] and the Reuters Corpus Volume 1 [55]. Both datasets contain English texts and are used exclusively for the closed authorship attribution problem. The two corpora differ in the size of the training documents, the type of documents (novels and news), and the topics and the number of candidate authors. Currently, both datasets are used as benchmarks to evaluate all kind of authorship attribution techniques [56,57].

The PAN-CLEF 2012 Shared Authorship Attribution task (PAN 2012) was designed with three corpora intended for the evaluation of techniques oriented to closed authorship attribution. The corpora were classified as follows: Task A ( 3 authors), Task C (8 authors), and Task I (14 authors) [54]. The corpora are composed of fragments of novels written by classical English authors. Each author has the same number of texts. Table 1 shows a summary of the characteristics of Task A, Task C, and Task I corpora.

The Reuters Corpus Volume 1 (RCV1) is a dataset widely used in previous research on closed authorship attribution. Paper [58] presents an adaptation of the RCV1 corpus known as CCAT50. This corpus contains 50 authors of written emails related to the main corporate/industrial topic. The training set consists of 50 examples per author, and the test set contains another 50 examples per author. The CCAT50 texts have an average length from 2 to 8 KB. A summary of the properties of the CCTA50 corpus is given in Table 1.

2.2. Proposed Method

Authorship attribution methods generally involve three fundamental steps [59]. First, the potential authors of the disputed text are identified, and texts of their publications are obtained. Second, the writing style of each potential author and the style employed in the disputed text is quantified. Third, an association between the disputed text and the potential authors is established.

Authorship attribution is approached as a classification problem as follows: assigning a label to a disputed text to one of the candidate authors. We propose a method based on a supervised machine learning approach to train a model capable of classifying the input text as part of a collection of authors. Figure 1 shows the proposed method. This study is restricted to the problem of closed authorship attribution. In this context, it is assumed that the authorship of the disputed text is within the group of potential authors, and the possibility that none of them is the author is excluded.

This study introduces the generation of mixed syntactic n-grams derived from the parser output and evaluates their effectiveness as stylistic markers in authorship attribution. These mixed sn-grams are constructed using various linguistic features, including words, part-of-speech (POS) tags, and dependency relations. The frequency of these markers is leveraged to generate a vector representation that quantitatively captures the author’s writing style. A feature selection process is applied to reduce the number of mixed sn-grams (dimensionality reduction). Supervised machine learning algorithms are trained with the vector representations of the mixed sn-grams. Finally, the trained model is then used to determine the authorship of a text.

2.3. Text Preprocessing and Syntactic Parsing

Digital media are frequently used as data sources for corpus compilation. These sources include repositories, news and review sites, blogs, and e-mails. These texts contain metadata of their origin, such as the name of the website or the date of publication. In addition, they often contain links to external sources or multimedia files. On the other hand, they may also contain spelling errors resulting from digitization or encoding procedures. The preprocessing stage (see Figure 1) aims to remove non-relevant information from the texts for the authorship attribution task, ensuring a homogeneous word representation while preserving the semantic and syntactic structure. At this stage, metadata, references, and typographical errors are eliminated.

Mixed sn-grams are obtained from dependency trees. The segmentation of texts at a sentence level is a prerequisite to obtain the dependency trees. After segmentation, a syntactic analysis of each phrase (sentence) is performed to obtain lemmas, POS tags, and word dependency relations. Sentence segmentation and parsing were performed using the Stanza [60] library, a commonly used tool in Natural Language Processing. For this purpose, neural processors were configured for tokenization, lemmatization, POS tagging, and dependency parsing. The script used a language model trained on the LinES tree bank for the English language.

2.4. Description of Mixed SN-Grams

Syntactic analysis of a language enables the comprehension of how words interrelate with each other to express an idea, as well as the function they perform in a sentence. The study of syntax focuses on understanding the rules that govern the associations and classifications of words according to their function in a given text [61]. The syntactic study unit is called a phrase, a linguistic structure composed of at least two morphemes with a function in the linguistic structure. Syntactic functions can be primary (there is a dependency), secondary (indicates grammatical value when they join a nucleus), or tertiary (there is a connection between phrases). Dependency grammar represents the syntactic relationships of the elements of the sentences. Dependency grammar uses a tree structure to show the relationship between pairs of words. One word rules another dependent word. The root of the structure represents the main word of the sentence. The derivations are formed with the dependent words, and each arc indicates a dependency between two words. Also, dependent words become governors, creating another level in the tree, and so on (recursively). Figure 2 depicts the dependency tree for the following sentence: The neighbor went to a new restaurant last weekend.

The syntactic dependency n-grams (sn-grams), previously introduced in [62], are obtained by traversing the dependency trees of sentences. Information retrieved from a syntactic analysis can be used to represent an n-gram. In addition to dependency relationship labels, information includes lexical elements (words, lemmas, or stems) or POS labels. Figure 3 shows the syntactic information obtained after parsing a sentence with the Stanza tool. Each node of the tree shows the token, lemma, POS tag, and the dependency relation with the governing word.

The sn-grams are associated with subtrees of the dependency tree. The metalanguage described in [63] is used to represent the n-grams. In the aforementioned metalanguage, the leftmost element represents the parent node of the subtree, while the nodes to the right of it are its siblings, with their order of appearance in the tree being represented by their position in the comma-separated list. Any elements enclosed within square brackets [ ] indicate the child nodes of the parent node located outside the left square brackets.

Table 2 shows the three-element sn-grams derived from the dependency tree of the example sentence. Within the Word sn-grams column, the sn-grams are represented using the words. In the POS sn-grams column, they are represented using the POS tags of the words. Finally, in the DR sn-grams column, they are represented using the syntactic dependency relations tags.

In previous works, features derived from syntactic trees were proposed. In works [48,53], they use the full syntactic tree preserving the order of the tokens and allowing words to be replaced by POS tags or constituents. Conversely, paper [52] proposes abstractions of subtrees that follow a particular structural pattern called DT-grams. In addition, the features they obtain are restricted to a single type of information, such as post labels, constituent labels, or words. Moreover, these features are limited in their ability to consider subtrees with specific morphology.

Mixed sn-grams are a variant that comprises multiple types of information in their structure. A general description of mixed sn-grams as a type of sn-grams was presented in [63,64]. However, the construction process was not thoroughly outlined, and the effectiveness of these features was not evaluated in authorship attribution.

The proposed method is a strategy used to build mixed sn-grams. For that, a type of information is used to represent the root element and another type to represent the remaining elements of a subtree. Information to generate mixed sn-grams includes words, POS tags, and dependency relation tags. For example, a combination would be to use words for the main elements and POS labels for the remaining nodes. This combination is denoted as a Word-POS sn-grams. Other combinations are the DR-POS, DR-Word, POS-DR, POS-Word, and Word-DR sn-grams.

Table 3 shows the mixed sn-grams of Word-POS, Word-DR, and POS-DR obtained from the dependency tree illustrated in Figure 3. Mixed sn-grams allow the incorporation of information from multiple contexts, and therefore, different patterns are identified from those obtained with homogeneous sn-grams. Note in Table 3 that sn-grams went[NOUN[DET]] occur more than once in the example sentence.

2.5. Generation of Mixed SN-Grams

Mixed sn-grams are generated through an adaptation of the algorithm proposed in [65] and the metalanguage described in [63]. The algorithm considers as an input the dependency tree of a sentence. A dependency tree

T = (V, B)

is defined by a set of nodes

V = \{v_{0}, v_{1}, \dots, v_{i}\}

with

v_{0}

as the root and a set of branches

B = \{b_{0}, b_{1}, \dots, b_{j}\}

. The nodes in V represent the elements of the sentence (words and punctuation marks), and the dependency relations between these words are indicated by the branches in B.

The sn-grams obtained from a dependency tree correspond to the subtree structures included in it. Given a tree

T = (V, B)

whose root element is

v_{0}

, a subtree

s t

of T is defined as a tree whose root is a node

v_{k} \in V

in T and includes all the nodes that are descendants of

v_{k}

.

The algorithm obtains the sn-grams by counting the subtrees of size n, i.e., the set of

S T = \{s t_{1}, s t_{2}, \dots, s t_{l}\}

subtrees of size n derived from tree T. For the algorithm, the smallest size of a subtree is

n = 2

, while the largest size of a subtree is equal to the total number of sentence elements.

A syntactic analysis of the sentences of a document is performed by the Stanza [60] module. The output of the analysis includes information about the words, lemmas, POS tags, and the dependency tree structure. Sentence elements are associated with an index and the information like the lemma, POS tag, word, or dependency tag is retrieved by using this index (as illustrated in the Figure 3). This output is used by the algorithm to obtain the sn-grams.

The algorithm for mixed sn-grams generation used in the proposal considers all possible subtrees from a dependency tree. As illustrated in Figure 4, the method can enumerate various possible configurations, unlike other works [52,66] that focus on particular configurations.

The time complexity of the algorithm for the generation of mixed sn-grams is

O (n^{3})

. The pseudocode and detailed analysis of the time complexity calculation can be found in Appendix A. Due to the complexity of the algorithm for mixed sn-grams generation, some sentences may be difficult to process because of their number of nodes, especially those cases where nodes with a high degree (many children) are found. The algorithm uses the parameter

m a x_n u m_c h i l d r e n

to limit the number of children of a node and proceeds as follows: if the number of children is greater than the value of the parameter, then only the first

m a x_n u m_c h i l d r e n

are taken from left to right, discarding the rest of the children. We set the default value of this parameter to 5.

3. Results

This section outlines the experiments conducted to evaluate the usefulness of mixed sn-grams in authorship attribution. Two datasets specifically created for closed-class authorship attribution were used (described in Section 2). These datasets offer diversity in text size, candidate authors, topics covered, and the number of texts available per author. The experimental setup and results are presented below.

3.1. Experimental Setting

The experiments were performed as follows: (1) feature extraction in terms of mixed sn-grams and training of a machine learning algorithm to obtain a model, (2) model validation was performed to tune the hyperparameters, and (3) model evaluation using the test data with mixed sn-grams. The experiments employed widely adopted machine learning methods such as Support Vector Machines (SVMs), Multinomial Naive Bayes (MNB), and Logistic Regression (LR). These classifiers have proven effective in previous analyses of authorship attribution. Furthermore, the proposed style markers have been shown to generate high-dimensional vector spaces (greater than 100 dimensions), and these classifiers have demonstrated acceptable performance in such scenarios. Deep learning-based techniques were not considered, primarily due to the inability of the mixed sn-grams representation to achieve the necessary linearity in the input data and the insufficient size of the dataset for reliable model creation.

The experiments were carried out on a computer with Intel i5 @ 1.6 GHz CPU, Windows 11 operating system, and Python 3.9. The implementation of the classifiers and the evaluation of the models were performed with the Scikit-learn machine learning module [67] version 1.3.0. A total of 80% of the training data was used to train the algorithms and the remaining 20% to validate the models.

The hyperparameters of the classifiers were selected through a semi-exhaustive search and the accuracy metric was used to verify the model performance. The hyperparameters used for the classifiers are described below. For the SVM classifier, the values used were as follows: linear kernel, the value of C was obtained by performing a semi-exhaustive search in the range of

[0, 1]

with increments of

0.1

,

t o l = 0.0001

and

p e n a l t y = l 2

. The parameters for the MNB classifier were as follows:

a l p h a = 1

. The parameters for the LR classifier were the following:

t o l = 0.0001

,

p e n a l t y = l 2

,

s o l v e r = l b f g s

, and the value of C was obtained by performing a semi-exhaust search in the range of

[0, 1]

with increments of

0.1

.

Three main types of elements were used to generate the syntactic n-grams: Word, POS, and DR tags. From these elements, the following six combinations of mixed sn-grams were generated, obtained from the dependency trees of the sentences with the Stanza parser [60]: Word-POS, Word-DR, DR-POS, DR-Word, POS-DR, and POS-Word. Only mixed sn-grams of lengths 2, 3, and 4 were considered.

Values higher than 3 could cause the data to become rather sparse [68,69,70,71]. As the value of n increased, the number of features also increased. In contrast, high-order n-grams have very low frequencies of occurrence. These two factors produce sparse data sets. These issues occur regardless of the type of n-grams. The chosen size of sn-grams corresponds to that reported in recent studies employing n-grams of characters and words [51,72].

The study utilized the Variance Threshold technique [67] to select the features that were used to feed the model. This approach eliminates any features whose variance falls below a certain threshold. By default, it excludes features with zero variance or the same value in all samples. The threshold was determined by experimental analysis within the range of [0.01–0.001].

3.2. Runtime and Memory Usage Statistics

Aspects of the computational efficiency of the proposed method based on mixed sn-grams generation were evaluated by recording statistics on the execution time and memory consumption. The runtime was determined by comparing the time tag at the beginning and end of program execution using the libraries provided by the programming language and computer hardware described previously. The time of five program executions was measured, and the average time is reported. The memory consumption was determined using the Memory Profiler module version 0.61.0 (available at https://pypi.org/project/memory-profiler/, accessed on 1 May 2025) and the maximum amount of memory used during program execution was recorded.

Table 4 presents the collected efficiency statistics for the CCAT50 and PAN 12 Task C corpora. The execution time is given in minutes (min) and the memory consumption in Megabytes (MB). The information is specified for three stages of interest of the proposed method as follows: (1) the preprocessing stage in which the input text syntactic analysis is performed, (2) the feature extraction stage in which the mixed sn-grams are obtained, and (3) the model training stage. The efficiency evaluation was carried out considering the sizes from 2 to 4 for the mixed sn-grams corresponding to the most demanding scenario. The POS-Word type was chosen to perform the evaluation indistinctly because the same algorithm is used for the generation of the other types.

The efficiency of the two methods considered as baselines was also evaluated using the same corpus. One method is based on the use of n-grams of words ranging in size from 2 to 4, and the other method is based on document embeddings using the doc2vec model.

The statistics indicate that the time it takes to process a document using the mixed sn-grams method depends on its length and increases proportionally. We can say that for texts with a length of around 6404 words, the average processing time is

0.70

min. If the text length is around 584 words, the average processing time is

0.07

min. In contrast, the method based on word n-grams has an average time of

0.06

min, and the document embedding-based method has an average time of

0.03

min. The proposed method has a higher memory footprint compared to sentence-level embedding and comparable memory usage to word n-grams. In the case of runtime, the difference is significantly greater. The proposed method is, therefore, recommended for scenarios where the amount of data is medium to small. Generally, the method has a higher time and memory consumption, mainly in the preprocessing stage, due to the use of a parser. However, some parsers start to offer alternative models that reduce processing time and memory consumption without significantly reducing accuracy.

3.3. Results with the PAN 2012 Corpus

The accuracy obtained by the SVM and MNB classifiers using the homogeneous sn-grams and the six types of mixed sn-grams proposed by experimenting with the corpus for Tasks A, C, and I of the PAN 12 is shown in Table 5. For homogeneous sn-grams, the optimal result was achieved with a size of four (n = 4), whereas for mixed sn-grams, the best performance was obtained with a size of three (n = 3).

The accuracy of the proposed method, which is based on the use of mixed sn-grams, was compared with the accuracies obtained by the most representative teams that participated in the PAN 12 evaluation. The results of this comparison are shown in Table 6. The line denoted by sn-grams indicates the highest efficiency achieved through the utilization of homogeneous sn-grams. The line represented by mixed sn-grams presents the maximum efficiency achieved by implementing this type of sn-gram. Most of the teams’ proposals relied on the use of character and word n-grams and lexical features (verb conjugation tense analysis and vocabulary size). They used SVM and neural networks as classification methods.

3.4. Results with the CCAT 50 Corpus

In the experiments conducted with the CCAT 50 corpus, homogeneous sn-grams and the proposed mixed sn-grams were evaluated. For each type of sn-grams, the accuracy was reported for n = {2, 3, 4}, individually and jointly. The RL, SVM, and MNB classifiers were utilized.

Figure 5 shows the number of features obtained for the different sizes and types of homogeneous and mixed sn-grams in the CCAT50 training corpus. The number of features of Word sn-grams and mixed sn-grams that include this information (DR-Word, POS-Word, Word-POS, and Word-DR) tends to be significantly higher compared to SN-grams that do not. The larger the size of the sn-grams, the larger the number of features obtained tends to be. This behavior is expected because sn-grams that include more features tend to repeat fewer times in sentences. For the CCAT50 corpus, it was observed that sn-grams using words in their structure and having a size equal to or greater than three generate features in the order of

1 \times 10^{6}

.

Because of the large number of features that can be obtained with mixed sn-grams, some feature selection techniques were evaluated. We compared the efficiency of the proposed classifiers using the Principal Component Analysis (PCA) and the Variance Threshold technique.

Table 7 shows the accuracy achieved by the SVM classifier after applying the feature selection techniques for some of the mixed sn-grams using the CCAT50 corpus. For the PCA technique, the standard methodology of using only the three principal components (3-PCA) was followed. As shown in the table above, the Variance Threshold technique offers an improvement in efficiency by ≈4.38%. On the other hand, the PCA technique does not represent a suitable option to be employed with the mixed sn-grams. Experiments performed with the rest of the mixed sn-grams types confirm this tendency.

Table 8 presents the efficiency for each of the homogeneous sn-grams types and sizes indicated earlier. Each line in the table corresponds to a specific experiment, with the size of the sn-grams considered in each experiment being indicated on the respective line.

Table 9 presents the efficiency obtained by the LR, SVM, and MNB classifiers when employing mixed sn-grams for n = {2, 3, 4}. The mark on each line indicates the size of the sn-grams that were considered in each corresponding experiment. The marks indicated on each line denote the dimensions of the sn-grams that were taken into consideration during each respective experiment.

Figure 6 shows the confusion matrix obtained using the SVM method with the sn-grams of a POS-Word type and size [2, 3]. The combination of the type and size of features, together with the classifier, corresponds to the one that obtained the best accuracy in the previous experiments (see Table 9). It is observed that for most authors, a recognition equal or higher than

40 %

is achieved. Only in the case of authors Scott Hillis and Jan Lopatka can the model recognize only

14 %

of the documents.

According to the confusion matrix shown in Figure 6, the trained model achieves poor efficiency in certain cases; for example, texts by author Jan Lopatka are often assigned to Edna Fernandes. To explain why the proposed method tends to confuse these authors, an exploratory analysis was performed on the instances of each author in the training corpus using some well-known NLP techniques. Table 10 shows the findings of the analysis. The most frequently occurring POS-Word sn-grams for each of the authors are also shown. Figure 7 and Figure 8 show the word clouds representing the frequency of occurrence of words in the texts of authors Jan Lopatka and Edna Fernandes.

Based on the information retrieved in the analysis described above, some aspects can be highlighted that explain why the model tends to confuse the authors. It is observed that the topics about which of the authors wrote the text are different, even though the genre of the text is the same. Some of the keywords used overlap (e.g., said, year, would, percent), which could be attributed to the topic or genre of the text. However, the overlap between the keywords or vocabulary used does not exceed 50%.

The most frequent POS-Word sn-grams refer to sentence structures used in the construction of the text. In the case of sn-grams of size 3 and 4, it is observed that most begin with the root element ADJ, and they then include double quotation marks and a word. These types of structures refer to sentences that represent quotes about what a character has expressed, which is the reason why double quotation marks appear. Some examples of phrases encoded by mixed sn-grams include the following: “He added that…” and “…terms as well, he added.”. The use of this type of sentence can become common among authors who are colleagues in their profession.

An ablation analysis was performed to identify and evaluate the significance of the proposed features based on the mixed sn-grams. The features were divided into three groups for the analysis, considering the type of information used for their generation. The first group comprised sn-grams of the following types: POS, Word, Word-POS, and POS-Word. The second group included sn-grams of type DR, Word, Word-DR, and DR-Word. Finally, the third group included POS, DR, DR-POS, and POS-DR sn-grams. The efficiency obtained by homogeneous sn-grams and mixed sn-grams related to the same information type was analyzed for each group.

The ablation analysis was conducted to assess the significance of the type and size of the mixed sn-grams presented in the proposed method, using the accuracy achieved by a trained model. The SVM classifier was the unique classifier employed during the ablation analysis because it obtained the best overall performance in the preceding experiments (see Table 9). Figure 9 shows the graphs indicating the efficiency achieved for each group into which the proposed characteristics were divided.

Figure 9a depicts the results for the first group of features. It exhibits that the POS- Word sn-grams achieve the highest accuracy, while the Word-POS type obtains an accuracy that is 7.61% lower on average for the different sizes evaluated. Word sn-grams are the most contributing homogeneous sn-gram type compared to the POS type used for classification. The sn-grams of size n = 2 contribute the most relevant features for the task, followed by the features of size n = 3, and finally, the features of size n = 4.

Figure 9b describes the results for the second group of features. The DR-Word sn-grams achieve the greatest accuracy, while the Word-DR type obtains an accuracy that is 3.14% lower on average for the different sizes evaluated. Word sn-grams are a type of homogeneous sn-grams that contributes the most to the classification compared to the DR type, and they even obtain values similar to those obtained by the Word-DR type. The sn-grams of size n = 2 provide the most task-relevant features, followed by the features of size n = 3 and, finally, features of size n = 4.

Figure 9c describes the results for the third group of features. The POS-DR sn-grams achieve the best accuracy, while the DR-POS type obtains an accuracy that is 0.91% lower on average for the different sizes evaluated. POS sn-grams are a type of homogeneous sn-grams that contributes the most to the classification compared to the DR type. The sn-grams of sizes n = 3 and n = 4 provide the most relevant features for the task while size n = 2 tasks are the least relevant features.

The use of two types of mixed sn-grams was also evaluated using the CCAT50 corpus. For the evaluation, the POS-Word sn-grams of size [2, 3] (the configuration that proved to obtain the highest accuracy) were selected, along with the best configurations of the other mixed sn-grams types (reported on Table 9). The evaluation was performed using only the SVM classifier and the same configuration previously used to evaluate each type of mixed sn-grams. Table 11 shows the accuracy obtained for each combination. Combining features from mixed POS-Word and DR-Word sn-grams improves accuracy compared to using each type individually by 0.76% and 4.16%, respectively.

Table 12 presents the accuracy achieved by previous works that have used the CCAT50 corpus for evaluation. Early work considers variants of character n-grams and the application of machine learning methods such as SVM or NN. More recent work explores the incorporation of word n-grams, POS tags, and syntactic information as features in combination with deep learning techniques. These features are introduced in sequence to the first layer of the network—the embedding layer. This layer receives as input n-dimensional vectors obtained from pre-trained models, such as transformer-based embeddings like BERT [77,78,79]. Alternatively, the embeddings can be created directly in the embedding layer, either via a CNN or RNN.

4. Discussion

Mixed sn-grams are style markers generated from the dependency tree that combine different types of information (lexical, POS tags, and Syntactic Dependency Relationships) for its generation. These markers can capture intrinsic information associated with the grammatical relationship between tokens. In traditional n-grams, words are captured according to the sequential order of the text (superficial textual analysis). In contrast, sn-grams capture the syntactic relations between words corresponding to the grammatical rules of the language (deep textual analysis).

Experiments on the PAN 2012 Authorship Attribution Corpus demonstrated the potential to model an author’s style through mixed sn-grams in the authorship attribution task. According to the Table 5, mixed sn-grams outperformed homogeneous sn-grams in terms of accuracy, particularly in scenarios with larger input data. The experiments showed that the DR-POS combination achieved the best accuracy compared to the other mixed sn-grams combinations. The SVM classifier showed the best performance for both types of sn-grams (homogeneous and mixed). The findings suggest that combining dependency relations and POS tags of the syntactic information of a sentence improves the classifier’s accuracy by 12.50% for Task C and 25% for Task I, compared to the use of homogeneous sn-grams.

Comparing the proposed method with the papers presented at the PAN 2012 event (see Table 6), it is observed that task A obtains the best efficiency, tying it with other papers. Task C obtains the second-best efficiency by a difference of 12.50% with respect to the best efficiency. Finally, the proposed method obtains the best reported accuracy for task I, which is considered the most complicated scenario of the competition due to the number of authors.

The CCAT50 corpus is more demanding because it considers 50 authors and offers thematic diversity in the instances. Furthermore, the average size of the texts is also considerably reduced (584 words) when compared to the PAN 12 corpus average size. The corpus was used in experiments with homogeneous sn-grams of size n = {2, 3, 4} and machine learning algorithms RL, SVM and MNB. The results shown in Table 8 demonstrate similar accuracy for all machine learning algorithms. The analysis revealed that Word sn-grams with n = 2 yielded the optimal accuracy compared to other sn-grams types and n-values. On the other hand, Table 9 shows that mixed sn-grams of size n = 2 consistently produced optimal results. The POS-Word combination achieved the highest accuracy. The experiments in Table 8 show that training the learning algorithm with mixed sn-grams as style markers outperforms homogeneous sn-grams by about 5%.

Table 12 analyzes previous works that used the CCAT50 corpus for evaluation. Currently, there are two approaches to address the authorship attribution task, using feature engineering and deep learning techniques. Our proposal is based on feature engineering. In this category, mixed sn-grams obtain results superior or equal to those of the state of the art (see Table 12). On the other hand, DL-based SOTA approaches outperform mixed n-grams. However, to achieve this, in most cases, they input not only syntactic information but also add morphological and lexical information to the network. Deep learning approaches based only on syntactic information [48,49] do not outperform mixed n-grams. It is worth noting that mixed n-grams achieve 74.36% accuracy, a performance close to 78% of DL approaches based on lexical information [32,83], outperformed by about 4%. The best results of 83.42% [51] and 93.20% [82] are obtained by combining the three types of information available in a sentence.

The proposal described in this paper outperforms proposals based only on the use of syntactic tree embeddings, without the need for additional information or the use of deep learning strategies. Mixed sn-grams offer the advantage of focusing on the frequency of occurrence of patterns in the dependency trees of sentences, and they also allow the representation of nonlinear relationships between words. Unlike other features that focus on the syntactic level, mixed sn-grams allow for the quantification of the writing style by repeating patterns that can be explained through the grammatical relationships between words.

5. Conclusions

Mixed sn-grams are reported to have comparable or superior efficiency compared to other approaches based on stylometric features (lexical or character level), at least 6.85% in the case of PAN 12 task I and at least 1% in the CCAT50 corpus. Comparing homogeneous sn-grams against mixed sn-grams, we found that the latter provides additional information to that obtained with the former. The mixed sn-grams identify the context of a syntactic tree element in three dimensions (lexical, POS tags, and dependency relations), which enables them to identify other style patterns, like subcategorization phenomena. The results achieved by mixed sn-grams generally exceed those achieved by homogeneous sn-grams by at least 12.5% for the PAN 12 corpus and 5.12% for the CCAT50 corpus, despite both being generated from dependency trees.

One drawback of the proposed method is its reliance on parsers. Memory consumption and runtime analyses indicate that the proposed method best suits small- or medium-sized datasets (texts no longer than tens of thousands of words). In data-scarce scenarios, the proposed method demonstrated competitive results. Furthermore, analysts seek alternative models that reduce processing time and memory consumption without significantly reducing accuracy. This would eventually allow the method to be applied in more realistic scenarios.

Unlike deep learning methods, our approach is efficient in scenarios with small to moderate datasets, where deep techniques tend to overfit or fail to generalize. Our method achieves 74% accuracy, which makes it suitable for cases where large amounts of labeled text are not available, such as in real authorship attribution cases. The model obtained through mixed sn-grams not only dispenses with GPUs, but its low computational requirement allows its application in devices with limited hardware, such as mobile devices or low-cost servers. This broadens the possibilities of adoption and scalability of the solution.

Our method offers a significant advantage in terms of interpretability. In authorship attribution, tracking and justifying attribution decisions through concrete and understandable linguistic features is crucial, especially in legal expertise, plagiarism detection, or forensic investigations. In contrast, deep learning models are treated as black boxes that achieve high classification rates but fail to justify the attribution process. Our approach provides transparency, confidence, and auditability in the attribution process.

In this article, we propose a method for solving the closed monolingual authorship attribution problem using mixed sn-grams. Experiments demonstrated that using mixed sn-grams as style markers creates a reliable writing style model that machine learning algorithms can learn. The proposed method offers comparable or even better accuracy than other works based on stylometric features. Comparing approaches using deep learning with the proposed method, the latter reports at least 28% higher accuracy than strategies such as syntax parse tree embeddings or syntactic parse tree embeddings with a bidirectional LSTM and self-attention mechanism. The proposed method also offers the advantage of being used in resource-limited scenarios (it does not require specialized hardware or large data volumes).

Future research directions include the following two potential courses of action: firstly, evaluating the impact of integrating mixed sn-grams with other feature types, including lexical or character-level features; and secondly, formulating a strategy to adapt mixed sn-grams for utilization by deep learning methods, followed by a performance evaluation. The proposed method can be applied to larger data sets, such as IMDB60, Blogs10, and Blogs50. To reduce runtime, a faster model or parser can be used. In the case of multilingual authorship attribution, the proposed method can also be used, provided the parser includes models for the languages involved.

Author Contributions

Data curation, M.P.-P.; investigation, G.R.-T. and J.P.F.P.-D.; methodology, J.P.F.P.-D.; project administration, J.P.F.P.-D.; resources E.V.-L. and M.P.-P.; software, J.P.F.P.-D.; validation, F.P.M. and G.R.-T.; visualization E.V.-L. and F.P.M.; writing—original draft preparation, J.A.d.J.O.-C. and J.P.F.P.-D.; writing—review and editing, J.A.d.J.O.-C. and G.R.-T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

msn-gram	mixed syntactic n-gram
LR	Logistic Regression
MNB	Multinomial Naive Bayes
NN	Neural Network
sn-gram	syntactic n-gram
SVM	Support Vector Machines
GNN	Graph Neural Network
LSTM	Long Short-Term Memory

Appendix A. Complexity Analysis of the Mixed SN-Grams Generation Algorithm

A key element for obtaining mixed sn-grams is the GET_SUBTREES function. The pseudocode of the function is shown in Algorithm 1. The function receives the syntactic information of a sentence and returns a list of subtrees that meet the size constraints indicated by parameters

m i n_s i z e

and

m a x_s i z e

[65].

The time complexity of the

G E T_S U B T R E E S

function is determined from the cycles contained in the function. The outer loop iterates over the

s u b r o o t s

list, which we can denote as n. Therefore, the complexity contribution from this loop is

O (n)

. Inside the outer loop, there is another loop that iterates over the

c h i l d r e n

list. Assuming that the average number of children per node is m, this leads to a complexity

O (m)

for this inner loop.The inner logic checks conditions and appends to lists, which are

O (1)

operations. However, when the counter exceeds

m a x_n u m_c h i l d r e n

, the function calls for the

N O D E_A N A L Y S I S ()

function. Its complexity can be denoted as

O (k)

, where k is the number of combinations generated by

N O D E_A N A L Y S I S ()

. The overall complexity for the nested loops can be approximated as

O (n * m * k)

in the worst case.

Algorithm 2 presents the pseudocode of the NODE_ANALYSIS function that is called by the GET_SUBTREES function described previously. This function receives the index of a specific node, a list with the indexes of its children nodes, and a list with the indexes of the leaves of the dependency tree as parameters. The function returns a list of all the subtrees where the specified node is the root. The size of the subtrees has a size that is in the range of 1 to the number of child nodes that the node has. The subtrees are then filtered to meet the size limits indicated by the

m i n_s i z e

and

m a x_s i z e

parameters [65]. The function uses the method

C O M B I N A T I O N (e l e m e n t s, r)

which calculates the number of combinations of size r that can be achieved with the

e l e m e n t s

.

The time complexity of the

N O D E_A N A L Y S I S

function is calculated as follows. The outer loop runs from 1 to

l e n (c h i l d r e n)

, which is

O (n)

where n is the number of children. Inside this loop, there is another loop that runs from 1 to r, which can also be considered

O (n)

in the worst case since r can be equal to n. The innermost loop iterates over the range from 0 to r, which again can be up to

O (n)

. The function

C O M B I N A T I O N (l e n (c h i l d r e n), r)

can be treated as a constant-time operation. Therefore, the overall time complexity is

O (n^{3}) .

Algorithm A1: Function GET_SUBTREES

Parameters:

c h i l d r e n

(list of children of a specif node),

s u b r o o t s

(possible roots of subtrees),

l e a v e s

(leaves of the tree),

m i n_s i z e

(minimum size),

m a x_s i z e

(maximum size),

m a x_n u m_c h i l d r e n

(maximum number of children to be consider for a node).

Output: list of unigrams, list of subtrees from a sentence, list of nodes with more children than the

m a x_n u m_c h i l d r e n

.

1:: functionGET_SUBTREES(children,subroots,min_size,max_size,max_num_children)
2:: Vars: $u n i g r a m s [], c o m b i n a t i o n s [], c o u n t e r \leftarrow 0, a u x [], l o g []$
3:: for all $n o d e$ in $s u b r o o t s$ do
4:: if $m a x_n u m_c h i l d r e n! = 0$ then
5:: $a u x \leftarrow []$
6:: $c o u n t e r \leftarrow 0$
7:: for all $c h i l d$ in $c h i l d r e n [n o d e]$ do
8:: if $m i n_s i z e < 2$ or $m a x_s i z e = = 0$ then
9:: $u n i g r a m s . a p p e n d ([c h i l d])$
10:: end if
11:: $a u x . a p p e n d (c h i l d)$
12:: $c o u n t e r \leftarrow c o u n t e r + 1$
13:: if $c o u n t e r > m a x_n u m_c h i l d r e n$ then
14:: $a u x . p o p$
15:: $c o m b i n a t i o n s . a p p e n d (N O D E_A N A L Y S I S (n o d e, c h i l d r e n [n o d e], l e a v e s))$
16:: $c o u n t e r \leftarrow 0$
17:: $a u x \leftarrow []$
18:: $a u x . a p p e n d (c h i l d)$
19:: $l o g . a p p e n d (n o d e)$
20:: end if
21:: end for
22:: if $l e n (a u x) > 0$ then
23:: $c o m b i n a t i o n s . a p p e n d (N O D E_A N A L Y S I S (n o d e, c h i l d r e n [n o d e], l e a v e s))$
24:: end if
25:: else
26:: $c o m b i n a t i o n s . a p p e n d (N O D E_A N A L Y S I S (n o d e, c h i l d r e n [n o d e], l e a v e s))$
27:: for all $c h i l d$ in $c h i l d r e n [n o d e]$ do
28:: if $m i n_s i z e < 2$ or $m a x_s i z e = = 0$ then
29:: $u n i g r a m s . a p p e n d ([c h i l d])$
30:: end if
31:: end for
32:: end if
33:: end for
34:: return $u n i g r a m s, c o m b i n a t i o n s$
35:: end function

The sn-grams obtained by the GET_SUBTREES function are encoded using syntactic information (words, POS, or DR tags). The function called PREPARE_SNGRAMS (see the pseudocode presented in Algorithm 3) performs the necessary encoding to obtain the mixed sn-grams. The function receives as parameters the encoded sn-grams with the indexes of the nodes, the syntactic information, and a variable indicating the type of mixed sn-grams.

The function’s time complexity

P R E P A R E_S N G R A M S

primarily depends on the length of the

l i n e

input. The function iterates through each item in

l i n e

, performing constant-time operations for each item. Therefore, the time complexity can be expressed as

O (n)

.

Algorithm A2: Function NODE_ANALYSIS

Parameters:

i d x

(specific node index),

c h i l d r e n

(children nodes of the specific node),

l e a v e s

(list of the tree leaves).

Output: subtrees with root node idx

1:: functionNODE_ANALYSIS(idx, children, leaves)
2:: Vars: $n g r a m [], o p t i o n s [], c o m b i n a t i o n s [], a u x []$
3:: for all p in $r a n g e (0, l e n (c h i l d r e n))$ do
4:: $c o m b i n a t i o n s [p] \leftarrow 0$
5:: for all r in $r a n g e (1, l e n (c h i l d r e n) + 1)$ do
6:: for all j in $r a n g e (1, r + 1)$ do
7:: $c o m b i n a t i o n [j - 1] \leftarrow j - 1$
8:: end for
9:: $o p t i o n s \leftarrow [], n g r a m \leftarrow []$
10:: $n g r a m . a p p e n d (i d x)$
11:: $n g r a m . a p p e n d (- d e l i z q -)$ ▹ -delizq- is part of the metalanguage
12:: for all z in $r a n g e (0, r)$ do
13:: $n g r a m . a p p e n d (c h i l d r e n [c o m b i n a t i o n [z]])$
14:: if $c h i l d r e n [c o m b i n a t i o n [z]] \notin s e n t e n c e . l e a v e s$ then
15:: $o p t i o n s . a d d (c h i l d r e n [c o m b i n a t i o n [z]])$
16:: end if
17:: $n g r a m . a p p e n d (- d e l s e p -)$ ▹ -delsep- is part of the metalanguage
18:: end for
19:: $n g r a m . p o p (l e n (n g r a m) - 1)$
20:: $n g r a m . a d d (- d e l d e r -)$ ▹ -delder- is part of the metalanguage
21:: $a u x . a d d (n g r a m, o p t i o n s)$
22:: $t o p \leftarrow C O M B I N A T I O N (l e n (c h i l d r e n), r)$
23:: for all j in $r a n g e (2, t o p + 1)$ do
24:: $m \leftarrow r, v a l_m a x \leftarrow l e n (c h i l d r e n)$
25:: while $c o m b i n a t i o n [m - 1] + 1 = = v a l_m a x$ do
26:: $m \leftarrow m - 1, v a l_m a x \leftarrow v a l_m a x - 1$
27:: end while
28:: $c o m b i n a t i o n [m - 1] \leftarrow c o m b i n a t i o n [m - 1] + 1$
29:: for all k in $[m + 1, r + 1]$ do
30:: $c o m b i n a t i o n [k - 1] \leftarrow c o m b i n a t i o n [k - 2] + 1$
31:: end for
32:: $o p t i o n s \leftarrow [], n g r a m \leftarrow []$
33:: $n g r a m . a p p e n d (v a l u e)$
34:: $n g r a m . a p p e n d (- d e l i z q -)$
35:: for all z in $r a n g e (0, r)$ do
36:: $n g r a m . a p p e n d (c h i l d r e n [c o m b i n a t i o n s [z]])$
37:: if $c h i l d r e n [c o m b i n a t i o n s [z]] \notin s e n t e n c e . l e a v e s$ then
38:: $o p t i o n s . a p p e n d (c h i l d r e n [c o m b i n a t i o n s [z]])$
39:: end if
40:: $n g r a m . a p p e n d (- d e l s e p -)$
41:: end for
42:: $n g r a m . p o p (l e n (n g r a m) - 1)$
43:: $n g r a m . a p p e n d (- d e l d e r -)$
44:: $a u x . a p p e n d (n g r a m, o p t i o n s)$
45:: end for
46:: end for
47:: end for
48:: return $l i s t$
49:: end function

Algorithm A3: Function PREPARE_SNGRAM

Parameters:

l i n e

(sn-gram index codification),

s e n t e n c e

(syntactic information),

o p

( type of mixed sn-grams).

Output: a mixed sn-gram.

1:: functionPREPARE_SNGRAM(line, sentence, op)
2:: Vars: $n g r a m$
3:: $n g r a m \leftarrow$ “”
4:: for all $i t e m \in l i n e$ do
5:: if $d a t a_t y p e (i t e m)$ is $s t r$ then▹ type() returns the data type
6:: $n g r a m \leftarrow n g r a m + i t e m$
7:: else if $d a t a_t y p e (i t e m)$ is $i n t$ then
8:: if $o p = = 0$ then ▹ case for WORDS
9:: $n g r a m \leftarrow n g r a m + s e n t e n c e . w o r d [i t e m]$
10:: else if $o p = = 1$ then ▹ case for POS
11:: $n g r a m \leftarrow n g r a m + s e n t e n c e . p o s [i t e m]$
12:: else if $o p = = 2$ then ▹ case for DR
13:: $n g r a m \leftarrow n g r a m + s e n t e n c e . d r [i t e m]$
14:: else if $o p = = 3$ then ▹ case for WORD-POS
15:: $n g r a m \leftarrow n g r a m + s e n t e n c e . w o r d [i t e m]$
16:: $o p \leftarrow 1$
17:: else if $o p = = 4$ then ▹ case for WORD-DR
18:: $n g r a m \leftarrow n g r a m + s e n t e n c e . w o r d [i t e m]$
19:: $o p \leftarrow 2$
20:: else if $o p = = 5$ then ▹ case for POS-WORD
21:: $n g r a m \leftarrow n g r a m + s e n t e n c e . p o s [i t e m]$
22:: $o p \leftarrow 0$
23:: else if $o p = = 6$ then ▹ case for POS-DR
24:: $n g r a m \leftarrow n g r a m + s e n t e n c e . p o s [i t e m]$
25:: $o p \leftarrow 2$
26:: else if $o p = = 7$ then ▹ case for DR-WORD
27:: $n g r a m \leftarrow n g r a m + s e n t e n c e . d r [i t e m]$
28:: $o p \leftarrow 0$
29:: else if $o p = = 8$ then ▹ case for DR-POS
30:: $n g r a m \leftarrow n g r a m + s e n t e n c e . d r [i t e m]$
31:: $o p \leftarrow 1$
32:: end if
33:: else
34:: $n g r a m \leftarrow n g r a m + P R E P A R E_S N G R A M (i t e m, s e n t e n c e, o p)$
35:: end if
36:: end for
37:: return $n g r a m$
38:: end function

References

Škorić, M.; Stanković, R.; Ikonić Nešić, M.; Byszuk, J.; Eder, M. Parallel stylometric document embeddings with deep learning based language models in literary authorship attribution. Mathematics 2022, 10, 838. [Google Scholar] [CrossRef]
AlAfnan, M.A.; MohdZuki, S.F. Do artificial intelligence chatbots have a writing style? An investigation into the stylistic features of ChatGPT-4. J. Artif. Intell. Technol. 2023, 3, 85–94. [Google Scholar] [CrossRef]
Nayani, A.R.; Gupta, A.; Selvaraj, P.; Singh, R.K.; Vaidya, H. Chatbot Detection with the Help of Artificial Intelligence. Int. J. Multidiscip. Innov. Res. Methodol. 2024, 3, 1–16. [Google Scholar]
Khalil, M.; Er, E. Will ChatGPT G et You Caught? In Rethinking of Plagiarism Detection. In Proceedings of the International Conference on Human-Computer Interaction, Copenhagen, Denmark, 23–28 July 2023; Springer: Cham, Switzerland, 2023; pp. 475–487. [Google Scholar]
Zimba, O.; Gasparyan, A. Plagiarism detection and prevention: A primer for researchers. Rheumatology 2021, 59, 132–137. [Google Scholar] [CrossRef]
Kavuri, K.; Kavitha, M. A stylistic features based approach for author profiling. In Proceedings of the Recent Trends in Communication and Intelligent Systems: Proceedings of ICRTCIS 2019, Jaipur, India, 8–9 June 2019; Springer: Singapore, 2020; pp. 185–193. [Google Scholar]
Ouni, S.; Fkih, F.; Omri, M.N. Toward a new approach to author profiling based on the extraction of statistical features. Soc. Netw. Anal. Min. 2021, 11, 59. [Google Scholar] [CrossRef]
Ramezani, R. A language-independent authorship attribution approach for author identification of text documents. Expert Syst. Appl. 2021, 180, 115139. [Google Scholar] [CrossRef]
Khonji, M.; Iraqi, Y.; Mekouar, L. Authorship identification of electronic texts. IEEE Access 2021, 9, 101124–101146. [Google Scholar] [CrossRef]
Calzà, L.; Gagliardi, G.; Favretti, R.R.; Tamburini, F. Linguistic features and automatic classifiers for identifying mild cognitive impairment and dementia. Comput. Speech Lang. 2021, 65, 101113. [Google Scholar] [CrossRef]
Manabe, M.; Liew, K.; Yada, S.; Wakamiya, S.; Aramaki, E. Estimation of psychological distress in japanese youth through narrative writing: Text-based stylometric and sentiment analyses. JMIR Form. Res. 2021, 5, e29500. [Google Scholar] [CrossRef]
Mukherjee, S.; De Ghosh, I. Writer identification based on writing individuality and combination of features. In Proceedings of the 2020 IEEE Applied Signal Processing Conference (ASPCON), Kolkata, India, 7–9 October 2020; pp. 324–329. [Google Scholar]
Lagutina, K.V. Comparison of style features for the authorship verification of literary texts. Model. Anal. Inf. Syst. 2021, 28, 250–259. [Google Scholar] [CrossRef]
Oloo, V.; Wanzare, L.D.; Otieno, C. An Optimal Feature Set for Stylometry-based Style Change detection at Document and Sentence Level. Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol. 2022, 8, 295–313. [Google Scholar] [CrossRef]
Zipf, G.K. The Psycho-Biology of Language: An Introduction to Dynamic Philology; Psychology Press: London, UK, 1999; Volume 21. [Google Scholar]
Mosteller, F.; Wallace, D.L. Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist Papers. J. Am. Stat. Assoc. 1963, 58, 275–309. [Google Scholar]
Venglařová, K.; Matlach, V. Beyond content: Discriminatory power of function words in text type classification. Digit. Scholarsh. Humanit. 2024, 39, 765–789. [Google Scholar] [CrossRef]
Dou, C.; Wu, S.; Zhang, X.; Feng, Z.; Wang, K. Function-words enhanced attention networks for few-shot inverse relation classification. arXiv 2022, arXiv:2204.12111. [Google Scholar]
Markov, I.; Ljubešić, N.; Fišer, D.; Daelemans, W. Exploring stylometric and emotion-based features for multilingual cross-domain hate speech detection. In Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Online, 19 April 2021; pp. 149–159. [Google Scholar]
Wang, H.; Juola, P.; Riddell, A. Reproduction and Replication of an Adversarial Stylometry Experiment. arXiv 2022, arXiv:2208.07395. [Google Scholar]
Wang, H. Defending Against Authorship Identification Attacks. arXiv 2023, arXiv:2310.01568. [Google Scholar]
Zheng, W.; Jin, M. Is word length inaccurate for authorship attribution? Digit. Scholarsh. Humanit. 2022, 38, 875–890. [Google Scholar] [CrossRef]
Ouyang, L.; Zhang, Y.; Liu, H.; Chen, Y.; Wang, Y. Gated pos-level language model for authorship verification. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, Yokohama, Japan, 7–15 January 2021; pp. 4025–4031. [Google Scholar]
Altakrori, M.; Cheung, J.C.K.; Fung, B.C. The topic confusion task: A novel evaluation scenario for authorship attribution. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 16–20 November 2021; pp. 4242–4256. [Google Scholar]
Jafariakinabad, F.; Hua, K.A. Unifying Lexical, Syntactic, and Structural Representations of Written Language for Authorship Attribution. SN Comput. Sci. 2021, 2, 481. [Google Scholar] [CrossRef]
Tripto, N.I.; Ali, M.E. The Word2vec Graph Model for Author Attribution and Genre Detection in Literary Analysis. arXiv 2023, arXiv:2310.16972. [Google Scholar]
Alamanda, S.; Pabboju, S.; Narsimha, G. n-Gram Based Convolutional Neural Network Approach for Authorship Identification; Springer: Singapore, 2022. [Google Scholar] [CrossRef]
Gupta, S.; Das, S.; Mallik, J.R. Machine learning-based authorship attribution using token n-grams and other time tested features. Int. J. Hybrid Intell. Syst. 2022, 18, 37–51. [Google Scholar] [CrossRef]
Nelson, J.; Shekaramiz, M. Authorship Verification via Linear Correlation Methods of n-gram and Syntax Metrics. In Proceedings of the 2022 Intermountain Engineering, Technology and Computing (IETC), Orem, UT, USA, 13–14 May 2022. [Google Scholar] [CrossRef]
Singh, M.; Murthy, K.N. Authorship Attribution Using Filtered n-Grams as Features; Springer: Singapore, 2020. [Google Scholar] [CrossRef]
Pak, A.; Saparov, T.; Akhmetov, I.; Gelbukh, A. Word Embeddings: A Comprehensive Survey. Comput. Y Sistemas 2024, 28, 2005–2029. [Google Scholar] [CrossRef]
Modupe, A.; Celik, T.; Marivate, V.; Olugbara, O.O. Integrating Bidirectional Long Short-Term Memory with Subword Embedding for Authorship Attribution. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, HI, USA, 1–4 October 2023; pp. 1910–1917. [Google Scholar]
Huertas-Tato, J.; Martin, A.; Huertas-Garcia, A.; Camacho, D. Generating authorship embeddings with transformers. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–8. [Google Scholar]
Madhubala, B.; Fayaz, S. A Deep Learning Approach for Author Profiling using Word Embeddings. Int. J. Sci. Technol. Eng. 2023, 11, 1553–1558. [Google Scholar] [CrossRef]
McGovern, H.; Stureborg, R.; Suhara, Y.; Alikaniotis, D. Your Large Language Models Are Leaving Fingerprints. arXiv 2024, arXiv:2405.14057. [Google Scholar]
Huang, B.; Chen, C.; Shu, K. Can Large Language Models Identify Authorship? arXiv 2024, arXiv:2403.08213. [Google Scholar]
Kumarage, T.; Liu, H. Neural Authorship Attribution: Stylometric Analysis on Large Language Models. In Proceedings of the 2023 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), Suzhou, China, 2–4 November 2023; pp. 51–54. [Google Scholar]
Hung, C.Y.; Hu, Z.; Hu, Y.; Lee, R.K.W. Who wrote it and why? prompting large-language models for authorship verification. arXiv 2023, arXiv:2310.08123. [Google Scholar]
Soos, C.; Haroutunian, L. On the Question of Authorship in Large Language Models (LLMs). NASKO 2023, 9, 1–17. [Google Scholar] [CrossRef]
Ribeiro, R.; Carvalho, J.P.; Coheur, L. Leveraging Fuzzy Fingerprints from Large Language Models for Authorship Attribution. In Proceedings of the 2024 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Yokohama, Japan, 30 June–5 July 2024; pp. 1–7. [Google Scholar]
Tschuggnall, M.; Specht, G. Enhancing authorship attribution by utilizing syntax tree profiles. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Short Papers, Gothenburg, Sweden, 26–30 April 2014; Volume 2, pp. 195–199. [Google Scholar]
Patchala, J.; Bhatnagar, R.; Gopalakrishnan, S. Author Attribution of Email Messages Using Parse-Tree Features. In Proceedings of the Machine Learning and Data Mining in Pattern Recognition: 11th International Conference, MLDM 2015, Hamburg, Germany, 20–21 July 2015; Proceedings 11. Springer: Cham, Switzerland, 2015; pp. 313–327. [Google Scholar]
Bartelds, M.; de Vries, W. Improving Cross-domain Authorship Attribution by Combining Lexical and Syntactic Features. In Proceedings of the CLEF (Working Notes), Lugano, Switzerland, 9–12 September 2019. [Google Scholar]
De Lhoneux, M.; Stymne, S.; Nivre, J. Arc-hybrid non-projective dependency parsing with a static-dynamic oracle. In Proceedings of the 15th International Conference on Parsing Technologies,(IWPT 2017), Pisa, Italy, 20–22 September 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 99–104. [Google Scholar]
Mehler, A.; Hemati, W.; Uslu, T.; Lücking, A. A multidimensional model of syntactic dependency trees for authorship attribution. In Quantitative Analysis of Dependency Structures; Walter de Gruyter GmbH: Berlin, Germany, 2018; pp. 315–347. [Google Scholar]
Bohnet, B.; Nivre, J.; Boguslavsky, I.; Farkas, R.; Ginter, F.; Hajič, J. Joint morphological and syntactic analysis for richly inflected languages. Trans. Assoc. Comput. Linguist. 2013, 1, 415–428. [Google Scholar] [CrossRef]
Lučić, A.; Blake, C.L. A syntactic characterization of authorship style surrounding proper names. Digit. Scholarsh. Humanit. 2015, 30, 53–70. [Google Scholar] [CrossRef]
Zhang, R.; Hu, Z.; Guo, H.; Mao, Y. Syntax encoding with application in authorship attribution. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2742–2753. [Google Scholar]
Jafariakinabad, F.; Hua, K.A. A self-supervised representation learning of sentence structure for authorship attribution. ACM Trans. Knowl. Discov. Data (TKDD) 2022, 16, 1–16. [Google Scholar] [CrossRef]
Manning, C.D.; Surdeanu, M.; Bauer, J.; Finkel, J.R.; Bethard, S.; McClosky, D. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, USA, 22–27 June 2014; pp. 55–60. [Google Scholar]
Wu, H.; Zhang, Z.; Wu, Q. Exploring syntactic and semantic features for authorship attribution. Appl. Soft Comput. 2021, 111, 107815. [Google Scholar] [CrossRef]
Murauer, B.; Specht, G. DT-grams: Structured Dependency Grammar Stylometry for Cross-Language Authorship Attribution. In Proceedings of the 32nd GI-Workshop Grundlagen von Datenbanken, Online Event, Germany, 1–3 September 2021; Thor, A., Totzauer, S., Eds.; CEUR-WS.orgCEUR Workshop Proceedings. 2021; Volume 3075. [Google Scholar]
Jafariakinabad, F.; Tarnpradab, S.; Hua, K.A. Syntactic Recurrent Neural Network for Authorship Attribution. arXiv 2019, arXiv:1902.09723. [Google Scholar]
Juola, P. An Overview of the Traditional Authorship Attribution Subtask. In Proceedings of the CLEF (Online Working Notes/Labs/Workshop), Rome, Italy, 17–20 September 2012; Volume 1178, p. 1. [Google Scholar]
Lewis, D.D.; Yang, Y.; Russell-Rose, T.; Li, F. Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 2004, 5, 361–397. [Google Scholar]
Lichtblau, D.; Stoean, C. Chaos game representation for authorship attribution. Artif. Intell. 2023, 317, 103858. [Google Scholar] [CrossRef]
Ma, W.; Liu, R.; Wang, L.; Vosoughi, S. Towards improved model design for authorship identification: A survey on writing style understanding. arXiv 2020, arXiv:2009.14445. [Google Scholar]
Houvardas, J.; Stamatatos, E. n-gram feature selection for authorship identification. In Proceedings of the International Conference on Artificial Intelligence: Methodology, Systems, and Applications, Varna, Bulgaria, 12–15 September 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 77–86. [Google Scholar]
Bozkurt, I.N.; Baghoglu, O.; Uyar, E. Authorship attribution. In Proceedings of the 2007 22nd International Symposium on Computer and Information Sciences, Ankara, Turkey, 7–9 November 2007; pp. 1–5. [Google Scholar]
Qi, P.; Zhang, Y.; Zhang, Y.; Bolton, J.; Manning, C.D. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Online, 5–10 July 2020. [Google Scholar]
Haro, S.N.G.; Gelbukh, A. Investigaciones en Análisis Sintáctico para el Español; Instituto Politécnico Nacional, Dirección de Publicaciones: Ciudad de México, Mexico, 2007. [Google Scholar]
Sidorov, G. Non-continuous syntactic n-grams. Polibits 2013, 48, 69–78. [Google Scholar] [CrossRef]
Sidorov, G. Syntactic n-Grams in Computational Linguistics; Springer: Cham, Switzerland, 2019. [Google Scholar]
Sidorov, G.; Velasquez, F.; Stamatatos, E.; Gelbukh, A.; Chanona-Hernández, L. Syntactic n-grams as machine learning features for natural language processing. Expert Syst. Appl. 2014, 41, 853–860. [Google Scholar] [CrossRef]
Posadas-Durán, J.P.; Sidorov, G.; Gómez-Adorno, H.; Batyrshin, I.; Mirasol-Mélendez, E.; Posadas-Durán, G.; Chanona-Hernández, L. Algorithm for extraction of subtrees of a sentence dependency parse tree. Acta Polytech. Hung. 2017, 14, 79–98. [Google Scholar]
Jafariakinabad, F.; Hua, K.A. Style-aware neural model with application in authorship attribution. In Proceedings of the 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), Boca Raton, FL, USA, 16–19 December 2019; pp. 325–328. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Klaussner, C.; Vogel, C. Temporal predictive regression models for linguistic style analysis. J. Lang. Model. 2018, 6, 175–222. [Google Scholar] [CrossRef]
Sapkota, U.; Solorio, T.; Montes, M.; Bethard, S.; Rosso, P. Cross-topic authorship attribution: Will out-of-topic data help? In Proceedings of the COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, 23–29 August 2014; pp. 1228–1237. [Google Scholar]
Stamatatos, E. A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 2009, 60, 538–556. [Google Scholar] [CrossRef]
Barrón-Cedeño, A.; Rosso, P. On automatic plagiarism detection based on n-grams comparison. In Proceedings of the Advances in Information Retrieval: 31th European Conference on IR Research, ECIR 2009, Toulouse, France, 6–9 April 2009; Proceedings 31. Springer: Berlin/Heidelberg, Germany, 2009; pp. 696–700. [Google Scholar]
Fabien, M.; Villatoro-Tello, E.; Motlicek, P.; Parida, S. BertAA: BERT fine-tuning for Authorship Attribution. In Proceedings of the 17th International Conference on Natural Language Processing (ICON), Patna, India, 18–21 December 2020; pp. 127–137. [Google Scholar]
Popescu, M.; Grozea, C. Kernel Methods and String Kernels for Authorship Analysis. In Proceedings of the CLEF (Online Working Notes/Labs/Workshop), Rome, Italy, 17–20 September 2012; Volume 1178. [Google Scholar]
Akiva, N. Authorship and Plagiarism Detection Using Binary BOW Features. In Proceedings of the CLEF (Online Working Notes/Labs/Workshop), Rome, Italy, 17–20 September 2012. [Google Scholar]
Ryan, M.; Noecker, J., Jr. Mixture of Experts Authorship Attribution. In Proceedings of the CLEF (Online Working Notes/Labs/Workshop), Rome, Italy, 17–20 September 2012. [Google Scholar]
Tanguy, L.; Sajous, F.; Calderone, B.; Hathout, N. Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce. In Proceedings of the PAN Lab at CLEF, Rome, Italy, 17–20 September 2012. [Google Scholar]
Nitu, M.; Dascalu, M. Authorship Attribution in Less-Resourced Languages: A Hybrid Transformer Approach for Romanian. Appl. Sci. 2024, 14, 2700. [Google Scholar] [CrossRef]
Silva, K.; Can, B.; Blain, F.; Sarwar, R.; Ugolini, L.; Mitkov, R. Authorship attribution of late 19th century novels using GAN-BERT. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), Toronto, ON, Canada, 9–14 July 2023; pp. 310–320. [Google Scholar]
Alqurashi, L.; Sharoff, S.; Watson, J.; Blakesley, J. BERT-based Classical Arabic Poetry Authorship Attribution. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 6105–6119. [Google Scholar]
Sapkota, U.; Bethard, S.; Montes, M.; Solorio, T. Not all character n-grams are created equal: A study in authorship attribution. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, CO, USA, 31 May–5 June 2015; pp. 93–102. [Google Scholar]
Sari, Y.; Vlachos, A.; Stevenson, M. Continuous n-gram representations for authorship attribution. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, 3–7 April 2017; pp. 267–273. [Google Scholar]
Modupe, A.; Celik, T.; Marivate, V.; Olugbara, O.O. Post-Authorship Attribution Using Regularized Deep Neural Network. Appl. Sci. 2022, 12, 7518. [Google Scholar] [CrossRef]
Rahgouy, M.; Giglou, H.B.; Tabassum, M.; Feng, D.; Das, A.; Rahgooy, T.; Dozier, G.; Seals, C.D. Towards Effective Authorship Attribution: Integrating Class-Incremental Learning. In Proceedings of the 2024 IEEE 6th International Conference on Cognitive Machine Intelligence (CogMI), Washington, DC, USA, 28–30 October 2024; pp. 56–65. [Google Scholar]

Figure 1. Diagram of the overall method.

Figure 2. Dependency tree of the sentence The neighbor went to a new restaurant last weekend.

Figure 3. Syntactic information in the dependency tree.

Figure 4. Subtrees configurations.

Figure 5. Number of features per type of mixed sn-grams.

Figure 6. Confusion matrix of the SVM using POS-Word of size

[2, 3]

.

Figure 6. Confusion matrix of the SVM using POS-Word of size

[2, 3]

.

Figure 7. Wordcloud of Jan Lopatka instances.

Figure 8. Wordcloud of Edna Fernandes instances.

Figure 9. Ablation analysis of the proposed features: (a) accuracy for POS, Word, Word-POS, and POS-Word sn-grams; (b) accuracy for DR, Word, Word-DR, and DR-Word sn-grams; (c) accuracy for POS, DR, DR-POS, and POS-DR sn-grams.

Table 1. Summary of the Authorship Attribution Corpora, PAN-CLEF 2012 and CCTA50.

Feature	Task A	Task C	Task I	CCAT50
Number of authors	3	8	14	50
Training documents per author	2	2	2	50
Testing documents per author	2	2	2	50
Average number of words per document	4286	6404	85,373	584

Table 2. Types of sn-grams with n = 3.

Word SN-Grams	POS SN-Grams	DR SN-Grams
went[neighbor, restaurant]	VBZ[NOUN, NOUN]	root[nsubj, obl]
went[neighbor[The]]	VBZ[NOUN[DET]]	root[nsubj[det]]
went[restaurant[to]]	VBZ[NOUN[ADP]]	root[obl[case]]
went[restaurant[a]]	VBZ[NOUN[DET]]	root[obl[det]]
went[restaurant[new]]	VBZ[NOUN[ADJ]]	root[obl[amod]]
went[restaurant[weekend]]	VBZ[NOUN[NOUN]]	root[obl[nmod]]
restaurant[to, a]	NOUN[ADP, DET]	obl[case, det]
restaurant[to, new]	NOUN[ADP, ADJ]	obl[case, amod]
restaurant[to, weekend]	NOUN[ADP, NOUN]	obl[case, nmod]
restaurant[a, new]	NOUN[DET, ADJ]	obl[det, amod]
restaurant[a, weekend]	NOUN[DET, NOUN]	obl[det, nmod]
restaurant[new, weekend]	NOUN[ADJ, NOUN]	obl[amod, nmod]
restaurant[weekend[last]]	NOUN[NOUN[ADJ]]	obl[nmod[amod]]

Table 3. Word-POS, Word-DR, and POS-DR sn-grams for the sentence The neighbor went to a new restaurant last weekend.

Word-POS SN-Grams	Word-DR SN-Grams	POS-DR SN-Grams
went[NOUN, NOUN]	went[nsubj, obl]	VBZ[nsubj, obl]
went[NOUN[DET]]	went[nsubj[det]]	VBZ[nsubj[det]]
went[NOUN[ADP]]	went[obl[case]]	VBZ[obl[case]]
went[NOUN[DET]]	went[obl[det]]	VBZ[obl[det]]
went[NOUN[ADJ]]	went[obl[amod]]	VBZ[obl[amod]]
went[NOUN[NOUN]]	went[obl[nmod]]	VBZ[obl[nmod]]
restaurant[ADP, DET]	restaurant[case, det]	NOUN[case, det]
restaurant[ADP, ADJ]	restaurant[case, amod]	NOUN[case, amod]
restaurant[ADP, NOUN]	restaurant[case, nmod]	NOUN[case, nmod]
restaurant[DET, ADJ]	restaurant[det, amod]	NOUN[det, amod]
restaurant[DET, NOUN]	restaurant[det, nmod]	NOUN[det, nmod]
restaurant[ADJ, NOUN]	restaurant[amod, nmod]	NOUN[amod, nmod]
restaurant[NOUN[ADJ]]	restaurant[nmod[amod]]	NOUN[nmod[amod]]

Table 4. Runtime and memory usage statistics of the algorithm.

Corpus	Method	Mixed SN-Grams		Word n-Grams		Word Embeddings
	Phase	Runtime (min)	Memory (MB)	Runtime (min)	Memory (MB)	Runtime (min)	Memory (MB)
Task C	Preprocessing	9.17	1008.93	Not applicable	Not applicable	0.02	154.66
Task C	Feature extraction	0.25	27.05	0.48	593.28	Not applicable	Not applicable
Task C	Model training	0.98	315.83	1.60	486.64	0.48	160.53
Task C	Total	10.4	1351.81	2.08	1079.92	0.5	315.19
CCAT50	Preprocessing	176.29	3905.63	Not applicable	Not applicable	0.32	299.89
CCAT50	Feature extraction	7.80	27.47	0.65	576.29	Not applicable	Not applicable
CCAT50	Model training	5.86	358.29	0.78	582.90	7.37	323.59
CCAT50	Total	189.95	4291.39	1.43	1159.19	7.69	623.48

Table 5. Accuracy (%) of homogeneous sn-grams for the PAN 12 corpus.

Type of	Task A		Task C		Task I
SN-Grams	MNB	SVM	MNB	SVM	MNB	SVM
DR	66.66	33.33	75.00	50.00	50.00	50.00
POS	100	100	32.14	37.50	53.57	67.85
Word	33.33	50.00	37.50	50.00	63.57	64.28
DR-POS	100	100	62.50	75.00	85.71	92.85
DR-Word	83.33	83.33	62.50	50.00	60.71	67.85
POS-DR	100	100	62.50	75.00	82.14	82.14
POS-Word	66.66	83.33	62.50	62.50	60.71	64.28
Word-DR	66.66	83.33	62.50	37.50	71.42	78.57
Word-POS	83.33	83.33	87.50	50.00	64.28	82.14

Table 6. Accuracy (%) of different models for the PAN 12 corpus.

Ref.	Method	PAN A	PAN C	PAN I
Popescu and Grozea [73]	String Kernels and Kernel Methods	84.62	100	75.00
Akiva [74]	Binary BOW and SVM	100	75.00	86.00
Ryan and Noecker [75]	Mixture of Experts and Voting System	100	87.50	85.70
Tanguy et al. [76]	Linguistic Features and Char Trigrams	100	37.50	85.70
–	Homogeneous sn-grams	100	75.00	67.85
–	Mixed sn-grams	100	87.50	92.85

Table 7. Accuracy (%) using feature selection techniques for the CCAT50 corpus.

Type of	Size			SVM
SN-Grams	2	3	4	All Features	Variance Threshold	3-PCA
DR-Word	✓			65.36	70.20	6.96
DR-Word		✓		66.44	65.88	7.60
DR-Word			✓	61.24	51.36	5.92
DR-Word	✓	✓		67.32	70.12	7.56
DR-Word	✓		✓	66.84	69.92	7.48
DR-Word		✓	✓	66.52	60.12	8.24
DR-Word	✓	✓	✓	67.52	69.04	7.72
POS-Word	✓			64.56	71.40	8.80
POS-Word		✓		68.64	69.68	9.88
POS-Word			✓	65.76	49.08	6.24
POS-Word	✓	✓		68.56	73.60	10.20
POS-Word	✓		✓	67.88	72.24	9.12
POS-Word		✓	✓	68.24	67.00	9.40
POS-Word	✓	✓	✓	68.96	73.24	10.72
Word-DR	✓			63.08	68.52	9.20
Word-DR		✓		57.60	61.52	7.48
Word-DR			✓	50.60	48.60	6.44
Word-DR	✓	✓		61.64	67.40	8.32
Word-DR	✓		✓	57.04	66.36	9.32
Word-DR		✓	✓	55.24	56.48	6.44
Word-DR	✓	✓	✓	58.28	65.72	8.76

Table 8. Accuracy (%) of homogeneous sn-grams for the CCAT50 corpus.

Type of	Size			Classifiers
SN-Grams	2	3	4	LR	SVM	MNB
DR	✓			34.68	30.00	30.20
DR		✓		39.84	37.68	35.88
DR			✓	35.52	34.92	29.44
DR	✓	✓		41.36	39.52	34.44
DR	✓		✓	39.08	37.68	35.24
DR		✓	✓	38.92	37.96	31.08
DR	✓	✓	✓	40.56	38.12	35.92
POS	✓			27.04	22.36	17.20
POS		✓		38.28	34.16	28.76
POS			✓	42.60	40.92	33.20
POS	✓	✓		38.84	35.40	28.12
POS	✓		✓	43.36	41.88	30.56
POS		✓	✓	43.48	42.08	33.36
POS	✓	✓	✓	43.84	42.40	33.44
Word	✓			67.04	68.48	65.48
Word		✓		56.16	60.16	60.20
Word			✓	38.16	38.60	50.08
Word	✓	✓		65.60	67.32	65.56
Word	✓		✓	65.16	67.32	64.80
Word		✓	✓	51.64	56.12	58.32
Word	✓	✓	✓	64.72	66.16	64.68

Table 9. Accuracy (%) of mixed sn-grams for the CCAT50 corpus.

Type of	Size			Classifiers
SN-Grams	2	3	4	LR	SVM	MNB
DR-POS	✓			35.16	28.96	28.68
DR-POS		✓		41.16	38.52	34.36
DR-POS			✓	38.72	38.56	30.88
DR-POS	✓	✓		42.72	40.44	35.00
DR-POS	✓		✓	41.16	40.84	31.28
DR-POS		✓	✓	41.44	41.04	32.32
DR-POS	✓	✓	✓	42.92	42.16	32.44
DR-Word	✓			68.64	70.20	63.76
DR-Word		✓		63.04	65.88	61.68
DR-Word			✓	46.72	51.36	52.80
DR-Word	✓	✓		68.96	70.12	63.76
DR-Word	✓		✓	68.60	69.92	62.44
DR-Word		✓	✓	59.68	60.12	59.48
DR-Word	✓	✓	✓	68.52	69.04	65.60
Word-POS	✓			66.76	67.56	63.00
Word-POS		✓		59.36	60.36	58.08
Word-POS			✓	44.92	46.20	45.44
Word-POS	✓	✓		65.44	66.00	62.00
Word-POS	✓		✓	62.72	64.48	58.60
Word-POS		✓	✓	54.64	54.60	52.84
Word-POS	✓	✓	✓	62.16	63.72	57.96
Word-DR	✓			68.04	68.52	63.92
Word-DR		✓		59.52	61.52	58.16
Word-DR			✓	47.44	48.60	46.60
Word-DR	✓	✓		66.52	67.40	62.36
Word-DR	✓		✓	66.28	66.36	60.92
Word-DR		✓	✓	55.92	56.48	54.08
Word-DR	✓	✓	✓	65.08	65.72	60.24
POS-Word	✓			70.68	71.40	65.72
POS-Word		✓		68.16	69.68	64.84
POS-Word			✓	50.88	49.08	50.64
POS-Word	✓	✓		72.36	73.60	66.12
POS-Word	✓		✓	71.32	72.24	67.28
POS-Word		✓	✓	65.96	67.00	63.48
POS-Word	✓	✓	✓	72.00	73.24	67.44
POS-DR	✓			31.36	24.32	21.44
POS-DR		✓		43.08	39.88	33.80
POS-DR			✓	42.28	41.42	31.96
POS-DR	✓	✓		43.48	40.52	33.80
POS-DR	✓		✓	43.48	42.60	32.24
POS-DR		✓	✓	44.96	43.80	32.60
POS-DR	✓	✓	✓	45.40	44.40	32.96

Table 10. Comparative analysis for authors Jan Lopatka and Edna Fernandes.

Feature	Jan Lopatka		Edna Fernandes
Type of text	Literary gender: economic news		Literary gender: economic news
	Topic: monetary and fiscal policy		Topic: expansion strategy and conglomerate split
	said	Czech	said	million
	percent	year	pounds	year
Top 10 relevant words	would	klaus	would	percent
(tf-idf)	bank	billion	company	group
	state	government	one	new
	ADJ[“]	ADJ[,]	ADJ[“]	ADJ[$1=0.5970]
	ADJ[&parentheses;]	ADJ[’s]	ADJ[,]	ADJ[&dash;]
Top 10 most frequent	ADJ[–]	ADJ[.–]	ADJ[&parentheses;]	ADJ[’s]
POS-Word size 2	ADJ[…]	ADJ[.]	ADJ[…for]	ADJ[.]
	ADJ[13]	ADJ[1995]	ADJ[142]	ADJ[234p]
	ADJ[“,”]	ADJ[“,added]	ADJ[“,”]	ADJ[“,,]
	ADJ[“,be]	ADJ[“,claiming]	ADJ[“,’s]	ADJ[“,…for]
Top 10 most frequent	ADJ[“,discuss]	ADJ[“,extremely]	ADJ[“,.]	ADJ[“,added]
POS-Word size 3	ADJ[“,government]	ADJ[“,it]	ADJ[“,always]	ADJ[“,are]
	ADJ[“,medium]	ADJ[“,more]	ADJ[“,disasters]	ADJ[“,fade]
	ADJ[“,”,claiming]	ADJ[“,”,government]	ADJ[“,’s,quite]	ADJ[“,’s,there]
	ADJ[“,added[.]]	ADJ[“,added[divisova]]	ADJ[“,.,”]	ADJ[“,…for,,]
Top 10 most frequent	ADJ[“,be,discuss]	ADJ[“,claiming[for]]	ADJ[“,…for,reasons]	ADJ[“,added[.]]
POS-Word size 4	ADJ[“,claiming[service]]	ADJ[“,discuss[account]]	ADJ[“,added[he]]	ADJ[“,always,fade]
	ADJ[“,discuss[to]]	ADJ[“,extremely,”]	ADJ[“,are,.]	ADJ[“,are,always]

Table 11. Accuracy (%) combining types of mixed sn-grams.

Features				Accuracy (%)
Type 1	Size	Type 2	Size
POS-Word	[2, 3]	DR-Word	[2]	74.36
POS-Word	[2, 3]	Word-DR	[2]	73.12
POS-Word	[2, 3]	POS-DR	[2, 3, 4]	66.92
POS-Word	[2, 3]	DR-POS	[2, 3, 4]	66.44
POS-Word	[2, 3]	Word-POS	[2]	72.48

Table 12. Accuracy (%) of different models for the CCAT50 corpus.

Ref.	Type Information	Features	Classifier	Accuracy (%)
Sidorov et al., 2014 [64]	Syntactic	Homogeneous sn-grams	SVM	68.48
Sapkota et al., 2015 [80]	Morphological	Affix+punctuation 3-grams	SVM	69.30
Sari et al., 2017 [81]	Morphological	Continuous char n-grams (FastText)	SoftMax	72.60
Zhang et al., 2018 [48]	Syntactic	Syntax parse tree embeddings	CNN	10.08
Zhang et al., 2018 [48]	Syntactic and lexical	Syntax parse tree+lexical embeddings	CNN, SoftMax	81.00
Wu et al., 2021 [51]	Morphological, lexical, and syntactic	n-grams with Multi-Channel Self-Attention Network	Logistic Regression	83.42
Jafariakinabad and Hua, 2022 [49]	Syntactic	Syntactic parse tree embeddings with bidirectional LSTM and self-attention mechanism	SoftMax	45.20
Jafariakinabad and Hua, 2022 [49]	Lexical and Syntactic	Syntactic parse tree+lexical embeddings with bidirectional LSTM and self-attention mechanism	SoftMax	83.20
Modupe et al., 2022 [82]	Lexical, syntactic, and structural	Bidirectional LSTM, CNN and Distributed Highway Network	SoftMax	93.20
Modupe et al., 2023 [32]	Lexical	Subword embeddings with bidirectional LSTM and CNN	SoftMax	78.90
Rahgouy et al., 2024 [83]	Lexical	Finetuning transformer BERT	Class-Incremental Learning	78.30
–	Syntactic	Mixed sn-grams	SVM	73.60
–	Syntactic	Combining mixed sn-grams	SVM	74.36

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Posadas-Durán, J.P.F.; Ríos-Toledo, G.; Velázquez-Lozada, E.; Osuna-Coutiño, J.A.d.J.; Pérez-Patricio, M.; Pech May, F. Learning the Style via Mixed SN-Grams: An Evaluation in Authorship Attribution. AI 2025, 6, 104. https://doi.org/10.3390/ai6050104

AMA Style

Posadas-Durán JPF, Ríos-Toledo G, Velázquez-Lozada E, Osuna-Coutiño JAdJ, Pérez-Patricio M, Pech May F. Learning the Style via Mixed SN-Grams: An Evaluation in Authorship Attribution. AI. 2025; 6(5):104. https://doi.org/10.3390/ai6050104

Chicago/Turabian Style

Posadas-Durán, Juan Pablo Francisco, Germán Ríos-Toledo, Erick Velázquez-Lozada, J. A. de Jesús Osuna-Coutiño, Madaín Pérez-Patricio, and Fernando Pech May. 2025. "Learning the Style via Mixed SN-Grams: An Evaluation in Authorship Attribution" AI 6, no. 5: 104. https://doi.org/10.3390/ai6050104

APA Style

Posadas-Durán, J. P. F., Ríos-Toledo, G., Velázquez-Lozada, E., Osuna-Coutiño, J. A. d. J., Pérez-Patricio, M., & Pech May, F. (2025). Learning the Style via Mixed SN-Grams: An Evaluation in Authorship Attribution. AI, 6(5), 104. https://doi.org/10.3390/ai6050104

Article Menu

Learning the Style via Mixed SN-Grams: An Evaluation in Authorship Attribution

Abstract

1. Introduction

2. Materials and Methods

2.1. Description of the Datasets

2.2. Proposed Method

2.3. Text Preprocessing and Syntactic Parsing

2.4. Description of Mixed SN-Grams

2.5. Generation of Mixed SN-Grams

3. Results

3.1. Experimental Setting

3.2. Runtime and Memory Usage Statistics

3.3. Results with the PAN 2012 Corpus

3.4. Results with the CCAT 50 Corpus

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Complexity Analysis of the Mixed SN-Grams Generation Algorithm

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI