A comparison of several AI techniques for authorship attribution on Romanian texts

Determining the author of a text is a difficult task. Here we compare multiple AI techniques for classifying literary texts written by multiple authors by taking into account a limited number of speech parts (prepositions, adverbs, and conjunctions). We also introduce a new dataset composed of texts written in the Romanian language on which we have run the algorithms. The compared methods are Artificial Neural Networks, Support Vector Machines, Multi Expression Programming, Decision Trees with C5.0, and k-Nearest Neighbour. Numerical experiments show, first of all, that the problem is difficult, but some algorithms are able to generate decent errors on the test set.


Introduction
Automated authorship attribution (AA) is defined in [1] as the task of determining authorship of an unknown text based on the textual characteristics of the text itself. Today the AA is useful in a plethora of fields: from the educational and research domain to detect plagiarism [2] to the justice domain to analyze evidence on forensic cases [3] and cyberbullying [4], to the social media [5,6] to detect compromised accounts [7].
Most approaches in the area of artificial intelligence treat the AA problem by using simple classifiers (e.g., linear SVM or decision tree) that have bag-of-words (character n-grams) as features or other conventional feature sets [8,9]. Although deep neural learning was already used for natural language processing (NLP), the adoption of such strategies for authorship identification occurred later. In recent years, pre-trained language models (such as BERT and GPT-2) have been used for finetuning and accuracy improvements [8,10,11].
The challenges in solving the AA problem can be grouped into three main groups [8]: 1.
The lack of large-scale datasets; 2.
The lack of methodological diversity; 3.
The ad hoc nature of authorship.
The availability of large-scale datasets has improved in recent years as large datasets have become widespread [12,13]. Other issues that relate to the datasets are the language in which the texts are written, the domain, the topic, and the writing environment. Each of these aspects has its own particularities. From the language perspective, the issue is that most available datasets consist of texts written in English. There is PAN18 [9] for English, French, Italian, Polish, and Spanish; or PAN19 [14] for English, French, Italian, and Spanish. However, there are not very many datasets for other languages and this is crucial as there are particularities that pertain to the language [15].
The methodological diversity has also improved in recent years, as it is detailed in [8]. However, the ad hoc nature of authorship is a more difficult issue, as a set of features that differentiates one author from the rest may not work for another author due to the individuality aspect of different writing styles. Even for one author, the writing style can evolve or change over a period of time, or it can differ depending on the context (e.g., the domain, the topic, or the writing environment). Thus, modeling the authorial writing style has to be carefully considered and needs to be tailored to a specific set of authors [8]. Therefore, selecting a distinguishing set of features is a challenging task.
We propose a new dataset named ROST (ROmanian Stories and other Texts) as there are few available datasets that contain texts written in Romanian [16]. The existing datasets are small, on obscure domains, or translated from other languages. Our dataset consists of 400 texts written by 10 authors. We have elements that pertain to the intended heterogeneity of the dataset such as: • Different text types: stories, short stories, fairy tales, novels, articles, and sketches; • Different number of texts per author: ranging from 27 to 60; • Different sources: texts are collected from 4 different websites; • Different text lengths: ranging from 91 to 39,195 words per text; • Different periods: the time period in which the considered texts were written spans over 3 centuries, which introduces diachronic developments; • Different mediums: texts were written with the intention of being read from paper medium (most of the considered authors) to online (two contemporary authors). This aspect considerably changes the writing style, as shorter sentences and shorter words are used online, and they also contain more adjectives and pronouns [17].
As our set is heterogeneous (as described above) from multiple perspectives, the authorship attribution is even more difficult. We investigate this classification problem by using five different techniques from the artificial intelligence area:
For each of these methods, we investigate different scenarios by varying the number and the type of some features to determine the context in which they obtain the best results. The aim of our investigations is twofold. On one side, the result of this investigation is to determine which method performs best while working on the same data. On another side, we try to find out the proper number and type of features that best classify the authors on this specific dataset.
The paper is organized as follows: • Section 2 describes the AA state of the art by using methods from artificial intelligence; details the entire prerequisite process to be considered before applying the specific AI algorithms (highlighting possible "stylometric features" to be considered); provides a table with some available datasets; presents a number of AA methods already proposed; describes the steps of the attribution process; presents an overview and a comparison of AA state-of-the-art methods. • Section 3 details the specific particularities (e.g., in terms of size, sources, time frames, types of writing, and writing environments) of the database we are proposing and we are going to use, and the building and scaling/pruning process of the feature set.
• Section 4 introduces the five methods we are going to use in our investigation. • Section 5 presents the results and interprets them, making a comparison between the five methods and the different sets of features used; measures the results by using metrics that allow a comparison with the results of other state-of-the-art methods. • Section 6 concludes with final remarks on the work and provides future possible directions and investigations.

Related Work
The AA detection can be modeled as a classification problem. The starting premise is that each author has a stylistic and linguistic "fingerprint" in their work [18]. Therefore, in the realm of AI, this means extracting a set of characteristics, which can be identified in a large-enough writing sample [8].

Features
Stylometric features are the characteristics that define an author's style. They can be quantified, learned [19], and classified into five groups [20]:

1.
Lexical (the text is viewed as a sequence of tokens grouped into sentences, with each token corresponding to a word, number, or punctuation mark): • Token-based (e.g., word length, sentence length, etc.); • Vocabulary richness (i.e., attempts to quantify the vocabulary diversity of a text); • Word frequencies (e.g., the traditional "bag-of-words" representation [21] in which texts become vectors of word frequencies disregarding contextual information, i.e., the word order); • Word n-grams (i.e., sequences of n contiguous words also known as word collocations); • Errors (i.e., intended to capture the idiosyncrasies of an author's style) (requires orthographic spell checker).

2.
Character (the text is viewed as a sequence of characters): • Character types (e.g., letters, digits, etc.) (requires character dictionary); • Character n-grams (i.e., considers all sequences of n consecutive characters in the texts; n can have a variable or fixed length); • Compression methods (i.e., the use of a compression model acquired from one text to compress another text; compression models are usually based on repetitions of character sequences).

3.
Syntactic (text-representation which considers syntactic information): • Part-of-speech (POS) (requires POS tagger-a tool that assigns a tag of morphosyntactic information to each word-token based on contextual information); • Chunks (i.e., phrases); • Sentence and phrase structure (i.e., a parse tree of each sentence is produced); • Rewrite rules frequencies (these rules express part of the syntactic analysis, helping to determine the syntactic class of each word as the same word can have different syntactic values based on different contexts); • Errors (e.g., sentence fragments, run-on sentences, mismatched use of tenses) (requires syntactic spell checker).

5.
Application-specific (the text is viewed from an application-specific perspective to better represent the nuances of style in a given domain): • Functional (requires specialized dictionaries); • Structural (e.g., the use of greetings and farewells in messages, types of signatures, use of indentation, and paragraph length); • Content-specific (e.g., content-specific keywords); • Language-specific.
The lexical and character features are simpler because they view the text as a sequence of word-tokens or characters, not requiring any linguistic analysis, in contrast with the syntactic and semantic characteristics, which do. The application-specific characteristics are restricted to certain text domains or languages.
A simple and successful feature selection, based on lexical characteristics, is made by using the top of the N most frequent words from a corpus containing texts of the candidate author. Determining the best value of N was the focus of numerous studies, starting from 100 [22], and reaching 1000 [23], or even all words that appear at least twice in the corpus [24]. It was observed, that depending on the value of N, different types of words (in terms of content specificity) make up the majority. Therefore, when the size of N falls within dozens, the most frequent words of a corpus are closed-class words (i.e., articles, prepositions, etc.), while when N exceeds a few hundred words, open-class words (i.e., nouns, adjectives, verbs) are the majority [20].
Even though the word n-grams approach comes as a solution to keeping the contextual information, i.e., the word order, which is lost in the word frequencies (or "bag-of-words") approach, the classification accuracy is not always better [25,26].
The main advantage of character feature selection is that they pertain to any natural language and corpus. Furthermore, even the simplest in this category (i.e., character types) proved to be useful to quantify the writing style [27].
The character n-grams have the advantages of capturing the nuances of style and being tolerant to noise (e.g., grammatical errors or making strange use of punctuation), and the disadvantage is that they capture redundant information [20].
The syntactic feature selection requires the use of Natural Language Processing (NLP) tools to perform a syntactic analysis of texts, and they are language-dependent. Additionally, being a method that requires complex text processing, noisy datasets may be produced due to unavoidable errors made by the parser [20].
For semantic feature selection an even more detailed text analysis is required for extracting stylometric features. Thus, the measures produced may be less accurate as more noise may be introduced while processing the text. NLP tools are used here for sentence splitting, POS tagging, text chunking, and partial parsing. However, complex tasks, such as full syntactic parsing, semantic analysis, and pragmatic analysis, are hard to be achieved for an unrestricted text [20].
A comprehensive survey of the state of the art in stylometry is conducted in [28]. The most common approach used in AA is to extract features that have a high discriminatory potential [29]. There are multiple aspects that have to be considered in AA for selecting the appropriate set of features. Some of them are the language, the literary style (e.g., poetry, prose), the topic addressed by the text (e.g., sports, politics, storytelling), the length of the text (e.g., novels, tweets), the number of text samples, and the number of considered features. For instance, lexical and character features, although more simple, can considerably increase the dimensionality of the feature set [20]. Therefore, feature selection algorithms can be applied to reduce the dimensionality of such feature sets [30]. This also helps the classification algorithm to avoid overfitting on the training data.
Another prerequisite for the training phase is deciding whether the training texts are processed individually or cumulatively (per author). From this perspective, the following two approaches can be distinguished [20]:

1.
Instance-based approach (i.e., each training text is individually represented as a separate instance in the training process to create a reliable attribution model); 2.
Profile-based approach (i.e., a cumulative representation of an author's style, also known as the author's profile, is extracted by concatenating all available training texts of one author into one large file to flatten differences between texts).
Efstathios Stamatatos offers in [20] a comparison between the two aforementioned approaches.

Datasets
In Table 1, we present a list of datasets used in AA investigations. There is a large variation between the datasets. In terms of language, there are usually datasets with texts that are written in one language, and there are a few that have texts written in multiple languages. However, most of the available datasets contain texts written in English.
The Size column is generally the number of texts and authors that have been used in AA investigations. For example, PAN11 and PAN12 have thousands of texts and hundreds of authors. However, in the referenced paper, only a few were used. The datasets vary in the number of texts from hundreds to hundreds of thousands, and in terms of the number of authors, from tens to tens of thousands. Table 1. Datasets used for author attribution detection; Author(s) are names of individuals who created the dataset (for a group consisting of more than two, only the name of the first person is provided in the list, followed by "et al."); Paper is the first paper that introduced that dataset or that is recommended by its creator(s) to be used for citing the dataset; Language is the language in which the texts in the database were written; Size is the dimension of the dataset; Features stands for the types of features that can be or were used on that specific dataset (however, the information here is only indicative and should not be taken literally); No. of features, is also more an indicative value for possible feature set dimensions; Name or link provides the name by which that specific dataset is known and, when available, links are provided.

Strategies
According to [46], the entire process of text classification occurs in 6 stages: 1.
Data acquisition (from one or multiple sources); 2.
Data analysis and labeling; 3.
Feature construction and weighting; 4.
Feature selection and projection; 5.
Training of a classification model; 6.
Solution evaluation.
The classification process initiates with data acquisition, which is used to create the dataset. There are two strategies for the analysis and labeling of the dataset [46]: labeling groups of texts (also called multi-instance learning) [47], or assigning a label or labels to each text part (by using supervised methods) [48]. To yield the appropriate data representation required by the selected learning method, first, the features are selected and weighted [46] according to the obtained labeled dataset. Then, the number of features is reduced by selecting only the most important features and projected onto a lower dimensionality. There are two different representations of textual data: vector space representation [49] where the document is represented as a vector of feature weights, and graph representation [50] where the document is modeled as a graph (e.g., nodes represent words, whereas edges represent the relationships between the words). In the next stage, different learning approaches are used to train a classification model. Training algorithms can be grouped into different approaches [46]: supervised [48] (i.e., any machine learning process), semi-supervised [51] (also known as self-training, co-training, learning from the labeled and unlabeled data, or transductive learning), ensemble [52] (i.e., training multiple classifiers and considering them as a "committee" of decision-makers), active [53] (i.e., the training algorithm has some role in determining the data it will use for training), transfer [54] (i.e., the ability of a learning mechanism to improve the performance for a current task after having learned a different but related concept or skill from a previous task; also known as inductive transfer or transfer of knowledge across domains), or multi-view learning [55] (also known as data fusion or data integration from multiple feature sets, multiple feature spaces, or diversified feature spaces that may have different distributions of features).
By providing probabilities or weights, the trained classifier is then able to decide a class for each input vector. Finally, the classification process is evaluated. The performance of the classifier can be measured based on different indicators [46]: precision, recall, accuracy, F-score, specificity, area under the curve (AUC), and error rate. These all are related to the actual classification task. However, other performance-oriented indicators can also be considered, such as CPU time for training, CPU time for testing, and memory allocated to the classification model [56].
Aside from the aforementioned challenges, there are also other sets of issues that are currently being investigated. These are: • Issues related to cross-domain, cross topic and/or cross-genre datasets; • Issues related to the specificity of the used language; • Issues regarding the style change of authors when the writing environment changes from offline to online; • The balanced or imbalanced nature of datasets.
Some examples which focus on these types of issues, alongside their solutions and/or findings, are presented next.
Participants in the Identification Task at PAN-2018 [9], investigated two types of classifications. The corpus consists of fan-fiction texts written in English, French, Italian, Polish, and Spanish, and a set of questions and answers on several topics in English. First, they addressed the cross-domain AA, finding that heterogeneous ensembles of simple classifiers and compression models outperformed more sophisticated approaches based on deep learning. Also, the set size is inversely correlated with attribution accuracy, especially for cases when more than 10 authors are considered. Second, they investigated the detection of style changes, where single-author and multi-author texts were distinguished. Techniques ranging from machine learning ensembles to deep learning with a rich set of features have been used to detect style changes, achieving the accuracy of up to nearly 90% over the entire dataset and several reaching 100%.
The issue of cross-topic confusion is addressed in [57] for AA. This problem arises when the training topic differs from the test topic. In such a scenario, the types of errors caused by the topic can be distinguished from the errors caused by the detection of the writing style. The findings show that classification is least likely to be affected by topic variations when parts of speech are considered as features.
The analysis conducted in [58] aimed to determine which approach, such as topic or style, is better for AA. The findings showed that online news, which have a wide variety of topics, are better classified using content-based features, while texts with less topical variation (e.g., legal judgments and movie reviews) benefit from using style-based features.
In [59] it is shown that syntax (e.g., sentence structure) helps AA on cross-genre texts, while additional lexical information (e.g., parts of speech such as nouns, verbs, adjectives, and adverbs) helps to classify cross-topic and single-domain texts. It is also shown that syntax-only models may not be efficient.
Language-specific issues (e.g., the complexity and structure of sentences) are addressed in [15] in relation to the Arabic language. Ensemble methods were used to improve the effectiveness of the AA task.
The authors of [60] propose solutions to address the many issues in AA (e.g., cross-domain, language specificity, writing environment) by introducing the concept of stacked classifiers, which are built from words, characters, parts of speech n-grams, syntactic dependencies, word embeddings, and more. This solution proposes that these stacked classifiers are dynamically included in the AA model according to the input.
Two different AA approaches called "writer-dependent" and "writer-independent" were addressed in [37]. In the first approach, they used a Support Vector Machine (SVM) to build a model for each author. The second approach combined a feature-based description with the concept of dissimilarity to determine whether a text is written by a particular author or not, thereby reducing the problem to a two-class problem. The tests were performed on texts written in Portuguese. For the first approach, 77 conjunctions and 94 adverbs were used to determine the authorship and the best accuracy results on the test set composed of 200 documents from 20 different authors were 83.2%. For the second approach, the same set of documents and conjunctions was used, obtaining the best result of 75.1% accuracy. In [38], along with conjunctions and adverbs, 50 verbs and 91 pronouns were added to improve the results obtained previously, achieving a 4% improvement in both "writer-dependent" and "writer-independent" approaches.
The challenges of variations in authors' style when the writing environment changes from traditional to online are addressed in [17]. These investigations consider changes in sentence length, word usage, readability, and frequency use of some parts of speech. The findings show that shorter sentences and words, as well as more adjectives and pronouns, are used online.
The authors of [61] proposed a feature extraction solution for AA. They investigated trigrams, bags of words, and most frequent terms in both balanced and imbalanced samples and showed with 79.68% accuracy that an author's writing style can be identified by using a single document.

Comparison
AA is a very important and currently intensively researched topic. However, the multitude of approaches makes it very difficult to have a unified view of the state-of-the-art results. In [10], authors highlight this challenge by noting significant differences in: • Datasets -In terms of size: small (CCAT50, CMCC, Guardian10), medium (IMDb62, Blogs50), and large (PAN20, Gutenberg); - In terms of content: cross-topic (× t ), cross-genre (× g ), unique authors; - In terms of imbalance (imb): i.e., standard deviation of the number of documents per author; - In terms of topic confusion (as detailed in [57]).
• Performance metrics -In terms of type: accuracy, F1, c@1, recall, precision, macro-accuracy, AUC, R@8, and others; -In terms of computation: even for the same performance metrics there were different ways of computing them.

-
In terms of the feature extraction method, * Feature-based: n-gram, summary statistics, co-occurrence graphs; * Embedding-based: char embedding, word embedding, transformers The work presented in [10] tries to address and "resolve" these differences, bringing everything to a common denominator, even when that meant recreating some results. To differentiate between different methods, authors of [10] grouped the results in 4 classes: , also using BERT as described in [67].
The results of the state of the art as presented in [10] are shown in Table 2. As can be seen in Table 2, the methods in the Ngram class generally work best. However, BERT-class methods can perform better on large training sets that are not cross-topic and/or cross-genre. The authors of [10] state that from their investigations it can be inferred that Ngram-class methods are preferred for datasets that have less than 50,000 words per author, while BERT-class methods should be preferred for datasets with over 100,000 words per author.

Proposed Dataset
The texts considered are Romanian stories, short stories, fairy tales, novels, articles, and sketches.
There are 400 such texts of different lengths, ranging from 91 to 39,195 words. Table 3 presents the averages and standard deviations of the number of words, unique words, and the ratio of words to unique words for each author. There are differences up to almost 7000 words between the average word counts (e.g., between Slavici and Oltean). For unique words, the difference between averages goes up to more than 1300 unique words (e.g., between Eminescu and Oltean). Even the ratio of total words to unique words is a significant difference between the authors (e.g., between Slavici and Oltean). Table 3. Diversity of the considered dataset in terms of the length of the texts (i.e., number of words). Author is the author's name (the last name is in bold); Average is the mean number of words per text written by the corresponding author; StdDev is the standard deviation; Average-Unique is the mean number of unique words; StdDev-Unique is the standard deviation on unique words; Average-Ratio is the mean number of the ratio of total words to unique words; StdDev-Ratio is the standard deviation of the ratio of total words to unique words.

Author
Average StdDev Average-Unique

StdDev-Ratio
Ion Eminescu and Slavici, the two authors with the largest averages also have large standard deviations for the number of words and the number of unique words. This means that their texts range from very short to very long. Gârleanu and Oltean have the shortest texts, as their average number of words and unique words and the corresponding standard deviations are the smallest.
There is also a correlation between the three groups of values (pertaining to the words, unique words, and the ratio between the two) that is to be expected as a larger or smaller number of words would contain a similar proportion of unique words or the ratio of the two, while the standard deviations of the ratio of total words to unique words tend to be more similar. However, Slavici has a very high ratio, which means that there are texts in which he repeats the same words more often, and in other texts, he does not. There is also a difference between Slavici and Eminescu here because even if they have similar word count average and unique word count average, their ratio differs. Eminescu has a similar representation in terms of ratio and standard deviation with his lifelong friend Creangȃ, which can mean that both may have similar tendencies in reusing words. Table 4 shows the averages of the number of features that are contained in the texts corresponding to each author. The pattern depicted here is similar to that in Table 3, which is to be expected. However, standard deviations tend to be similar for all authors. These standard deviations are considerable in size, being on average as follows: This means that the frequency of feature occurrence differs even in the texts written by the same author. Table 4. Diversity of the considered dataset in terms of the number of occurrences of the considered features in the texts. Author is the author's name (the last name is in bold); Average-P is the average number of the occurrence of the considered prepositions in the texts corresponding to each author; StdDev-P is the standard deviation for the occurrence of the prepositions; Average-PA is the average number of the occurrence of the considered prepositions and adverbs; StdDev-PA is the standard deviation of the number of the occurrence of the considered prepositions and adverbs; Average-PAC is the average number of the occurrence of the considered prepositions, adverbs, and conjunctions; StdDev-PAC is the standard deviation of the number of the occurrence of the considered prepositions, adverbs, and conjunctions.

Author
Average The considered texts are collected from 4 websites and are written by 10 different authors, as shown in Table 5. The diversity of sources is relevant from a twofold perspective. First, especially for old texts, it is difficult to find or determine which is the original version. Second, there may be differences between versions of the same text either because some words are no longer used or have changed their meaning, or because fragments of the text may be added or subtracted. For some authors, texts are sourced from multiple websites. Table 5. List of authors (the author's last name is in bold), the number of texts considered for each author (total number is in bold), and their source (i.e., the website from which they were collected).

Author
No The diversity of the texts is intentional because we wanted to emulate a more likely scenario where all these characteristics might not be controlled. This is because, for future texts to be tested on the trained models, the text length, the source, and the type of writing cannot be controlled or imposed.
To highlight the differences between the time frames of the periods in which the authors lived and wrote the considered texts, as well as the environment from which the texts were intended to be read, we gathered the information presented in Table 6. It can be seen that the considered texts were written in the time span of three centuries. This also brings an increased diversity between texts, since within such a large time span there have been significant developments in terms of language (e.g., diachronic developments), writing style relating to the desired reading medium (e.g., paper or online), topics (e.g., general concerns and concerns that relate to a particular time), and viewpoints (e.g., a particular worldview). Table 6. List of authors, time spans of the periods in which the authors lived and wrote the considered texts and the medium from which the readers read their texts. Author is the author's name (the last name is in bold); Life is the lifetime of the author; Publication is the publication interval of the texts (note: the information presented here was not always easily accessible and some sources would contradict in terms of specific years, however, this information should be considered more as an indicative coordinate and should not be taken literally, the goal being that the literary texts be temporally framed in order to have a perspective on the period in which they were written/published); Century is a coarser temporal framing of the periods in which the texts were written; Medium is the environment from which most of the readers read the author's texts. The diversity of the texts also pertains to the type of writing, i.e., stories, short stories, fairy tales, novels, articles, and sketches. Table 7 shows the distribution of these types of writing among the texts belonging to the 10 authors. The difference in the type of writing has an impact on the length of the texts (for example, a novel is considerably longer than a short story), genre (for example, fairy tales have more allegorical worlds that can require a specific style of writing), the topic (for example, an article may describe more mundane topics, requiring a different type of discourse compared to the other types of writing). Table 7. List of authors and types of writing of the considered texts. Author is the author's name (the last name is in bold); Article * include, in addition to articles written for various newspapers and magazines, other types of writing that did not fit into the other categories, but relate to this category, such as prose, essays, and theatrical or musical chronicles. Regarding the list of possible features, we selected as elements to identify the author of a text inflexible parts of speech (IPoS) (i.e., those that do not change their form in the context of communication): conjunctions, prepositions, interjections, and adverbs. Of these, we only considered those that were single-word and we removed the words that may represent other parts of speech, as some of them may have different functions depending on the context, and we did not use any syntactic or semantic processing of the text to carry out such an investigation.
We collected a list of 24 conjunctions that we checked on dexonline.ro (i.e., site that contains explanatory dictionaries of the Romanian language) not to be any other part of speech (not even among the inflexible ones). We also considered 3 short forms, thus arriving at a list of 27 conjunctions. The process of selecting prepositions was similar to that of selecting conjunctions, resulting in a list of 85 (including some short forms).
The lists of interjections and adverbs were taken from: • ro.wiktionary.org/wiki/Categorie:Interject , ii_în_românȃ, accessed on 20 October 2022 • ro.wiktionary.org/wiki/Categorie:Adverbe_în_românȃ, accessed on 20 October 2022 To compile the lists of interjections and adverbs, we again considered only single-word ones and we eliminated words that may represent other parts of speech (e.g., proper nouns, nouns, adjectives, verbs), resulting in lists of 290 interjections and 670 adverbs.
The lists of the aforementioned IPoS also contain archaic forms in order to better identify the author. This is an important aspect that has to be taken into consideration (especially for our dataset which contains texts that were written over a time span of 3 centuries), as language is something that evolves and some words change as form and sometimes even as meaning or the way they are used.
From the lists corresponding to the considered IPoS features, we use only those that appear in the texts. Therefore, the actual lists of prepositions, adverbs, and conjunctions may be shorter. Details of the texts and the lists of inflexible parts of speech used can be found at reference [68].

Compared Methods
Below we present the methods we will use in our investigations.

Artificial Neural Networks
Artificial neural networks (ANN) is a machine learning method that applies the principle function approximation through learning by example (or based on provided training information) [69]. An ANN contains artificial neurons (or processing elements), organized in layers and connected by weighted arcs. The learning process takes place by adjusting the weights during the training process so that based on the input dataset the output outcome is obtained. Initially, these weights are chosen randomly.
The artificial neural structure is feedforward and has at least three layers: input, hidden (one or more), and output.
The experiments in this paper were performed using fast artificial neural network (FANN) [70] library. The error is RMSE. For the test set, the number of incorrectly classified items is also calculated.

Multi-Expression Programming
Multi-expression programming (MEP) is an evolutionary algorithm for generating computer programs. It can be applied to symbolic regression, time-series, and classification problems [71]. It is inspired by genetic programming [72] and uses three-address code [73] for the representation of programs.

K-Nearest Neighbors
K-nearest neighbors (k-NN) [75][76][77] is a simple classification method based on the concept of instance-based learning [78]. It finds the k items, in the training set, that are closest to the test item and assigns the latter to the class that is most prevalent among these k items found.
The source code of k-NN used in this paper is written by us and is available at github.com/sandaavram/ROST-source-code, (accessed on 29 November 2022) along other scripts and programs we wrote to perform the tests.

Support Vector Machine
A support vector machine (SVM) [79] is also a classification principle based on machine learning with the maximization (support) of separating distance/margin (vector). As in k-NN, SVM represents the items as points in a high-dimensional space and tries to separate them using a hyperplane. The particularity of SVM lies in the way in which such a hyperplane is selected, i.e., selecting the hyperplane that has the maximum distance to any item.
LIBSVM [80,81] is the support vector machine library that we used in our experiments. It supports classification, regression, and distribution estimation.

Decision Trees with C5.0
Classification can be completed by representing the acquired knowledge as decision trees [82]. A decision tree is a directed graph in which all nodes (except the root) have exactly one incoming edge. The root node has no incoming edge. All nodes that have outgoing edges are called internal (or test) nodes. All other nodes are called leaves (or decision) nodes. Such trees are built starting from the root by top-down inductive inference based on the values of the items in the training set. So, within each internal node, the instance space is divided into two or more sub-spaces based on the input attribute values. An internal node may consider a single attribute. Each leaf is assigned to a class. Instances are classified by running them through the tree starting from the root to the leaves.
See5 and C5.0 [83] are data mining tools that produce classifiers expressed as either decision trees or rulesets, which we have used in our experiments.

Numerical Experiments
To prepare the dataset for the actual building of the classification model, the texts in the dataset were shuffled and divided into training (50%), validation (25%), and test (25%) sets, as detailed in Table 8. In cases where we only needed training and test sets, we concatenated the validation set to the training set. We reiterated the process (i.e., shuffle and split 50%-25%-25%) three times and, thus, obtained three different training-validation-test shuffles from the considered dataset. Before building a numerical representation of the dataset as vectors of the frequency of occurrence of the considered features, we made a preliminary analysis to determine which of the inflexible parts of speech are more prevalent in our texts. Therefore, we counted the number of occurrences of each of them based on the lists described in Section 3. The findings are detailed in Table 9. Table 9. The occurrence of inflexible parts of speech considered. IPoS stands for Inflexible part of speech; No. of occurrence is the total number of occurrences of the considered IPoS in all texts; % from total words represents the percentage corresponding to the No. of occurrence in terms of the total number of words in all texts (i.e., 1,342,133); No. of files represents the number of texts in which at least one word from the corresponding IPoS list appears; Avg. per file represents the No. of occurrence divided by the total number of texts/files (i.e., 400); and No. of IPoS represents the list length (i.e., the number of words) for each corresponding IPoS.

IPoS
No Based on the data presented here, we decided not to consider interjections because they do not appear in all files (i.e., 44 files do not contain any interjections), and in the other files, their occurrence is much less compared to the rest of the IPoS considered. This investigation also allowed us to decide the order in which these IPoS will be considered in our tests. Thus, the order of investigation is prepositions, adverbs, and conjunctions. Therefore, we would first consider only prepositions, then add adverbs to this list, and finally add conjunctions as well. The process of shuffling and splitting the texts into trainingvalidation-test sets (described at the beginning of the current section, i.e., Section 5) was reiterated once more for each feature list considered. We, therefore, obtained different dataset representations, which we will refer further as described in Table 10. The last 3 entries (i.e., ROST-PC-1, ROST-PC-2, and ROST-PC-3) were used in a single experiment. Table 10. Names used in the rest of the paper refer to the different dataset representations and their shuffles. Only the first 9 entries (with the Designation written in bold) were used for the entire set of investigations. Correspondingly, we created different representations of the dataset as vectors of the frequency of occurrence of the considered feature lists. All these representations (i.e., trainingvalidation-test sets) can be found as text files at reference [68]. These files contain feature-based numerical value representations for a different text on each line. On the last column of these files, are numbers from 0 to 9 corresponding to the author, as specified in the first columns of Tables 6-8.

Results
The parameter setting for all 5 methods are presented in Appendix A, while Appendix B contains some prerequisite tests.
Most results are presented in a tabular format. The percentages contained in the cells under the columns named Best, Avg, or Error may be highlighted using bold text or gray background. In these cases, the percentages in bold represent the best individual results (i.e., obtained by the respective method on any ROST-*-* in the dataset, out of the 9 representations mentioned above), while the gray-colored cells contain the best overall results (i.e., compared to all methods on that specific ROST-X-n representation of the dataset).

ANN
Results that showed that ANN is a good candidate to solve this kind of problem and prerequisite tests that determined the best ANN configuration (i.e., number of neurons on the hidden layer) for each dataset representation are detailed in Appendix B.1. The best values obtained for test errors and the number of neurons on the hidden layer for which these "bests" occurred are given in Table 11. These results show that the best test error rates were mainly generated by ANNs that have a number of neurons between 27 and 49. The best test error rate obtained with this method was 23.46% for ROST-PAC-3, while the best average was 36.93% for ROST-PAC-2. Table 11. ANN results on the considered datasets. On each set, 30 runs are performed by ANNs with the hidden layer containing from 5 to 50 neurons. The number of incorrectly classified data is given as a percentage (the best results obtained by ANN on any ROST-*-* dataset representation are in bold). Best stands for the best solution (out of 30 runs on each of the 46 ANNs), Avg stands for Average (over 30 runs), StdDev stands for Standard Deviation, and No. of neurons stands for the number of neurons in the hidden layer of the ANN that produced the best solution. The best result obtained by ANN compared to all methods for a given ROST-X-n dataset representation is in a gray cell. We are interested in the generalization ability of the method. For this purpose, we performed full (30) runs on all datasets. The results, on the test sets, are given in Table 12. Table 12. MEP results on the considered datasets. A total of 30 runs are performed. The number of incorrectly classified data is given as a percentage (the best results obtained by MEP on any ROST-*-* dataset representation are in bold). Best stands for the best solution (out of 30 runs), Avg stands for Average (over 30 runs) and StdDev stands for Standard Deviation. The best result obtained by MEP compared to all methods for a given ROST-X-n dataset representation is in a gray cell.

Dataset
Best Avg StdDev With this method, we obtained an overall "best" on all ROST-*-*, which is 20.40%, and also an overall "average" best with a value of 27.95%, both for ROST-PA-2.
One big problem is overfitting. The error on the training set is low (they are not given here, but sometimes are below 10%). However, on the validation and test sets the errors are much higher (2 or 3 times higher). This means that the model suffers from overfitting and has poor generalization ability. This is a known problem in machine learning and is usually corrected by providing more data (for instance more texts for an author).

k-NN
Preliminary tests and their results for determining the best value of k for each dataset representation are presented in Appendix B.3.
The best k-NN results are given in Table 13 with the corresponding value of k for which these "bests" were obtained. It can be seen that for all ROST-P-*, the values of k were higher (i.e., k ≥ 8) than those for ROST-PA-* or ROST-PAC-* (i.e., k ≤ 4). The best value obtained by this method was 29.59% for ROST-PAC-2 and ROST-PAC-3. Table 13. k-NN results on the considered datasets. In total, 30 runs are performed with k varying with the run index. The number of incorrectly classified data is given as a percentage (the best results obtained by k-MM on any ROST-*-* dataset representation are in bold). Best stands for the best solution (out of the 30 runs), k stands for the value of k for which the best solution was obtained.

Dataset
Best k We ran tests for each kernel type and with nu varying from 0.1 to 1, as we saw in Figure A6 that for values less than 0.1, SVM is unlikely to produce the best results. The best results obtained are shown in Table 14. Table 14. SVM results on the considered datasets. The number of incorrectly classified data is given as a percentage (the best results obtained by SVM on any ROST-*-* dataset representation are in bold). Best stands for the best test error rate (out of 30 runs with nu ranging from 0.001 to 1), and nu stands for the parameter specific to the selected type of SVM (i.e., nu-SVC). Results are given for each type of kernel that was used by the SVM. The best result obtained by SVM compared to all methods for a given ROST-X-n dataset representation is in a gray cell. As can be seen, the best values were obtained for values of parameter nu between 0.2 and 0.6 (where sometimes 0.6 is the smallest value of the set {0.6, 0.7, . . . , 1} for which the best test error was obtained). The best value obtained by this method was 23.46% for ROST-PAC-1, using the linear kernel and nu parameter value 0.2.

Decision Trees with C5.0
Advanced pruning options for optimizing the decision trees with C5.0 model and their results are presented in Appendix B.5. The best results were obtained by using −m cases option, as detailed in Table 15. Table 15. Decision tree results on the considered datasets. The number of incorrectly classified data is given as a percentage (the best results obtained by DT with C5.0 on any ROST-*-* dataset representation are in bold). Error stands for the test error rate, Size stands for the size of the decision tree required for that specific solution and cases stands for the threshold for which is decided to have two more that two branches at a specific branching point (cases ∈ {1, 2, . . . , 30}). The best result obtained by DT with C5.0 compared to all methods for a given ROST-X-n dataset representation is in a gray cell. The best result obtained by this method was 24.5% on ROST-PAC-2, with −m 14 option, on a decision tree of size 12. When no options were used, the size of the decision trees was considerably larger for ROST-P-* (i.e., ≥57) than those for ROST-PA-* and ROST-PAC-* (i.e., ≤39).

Comparison and Discussion
The findings of our investigations allow for a twofold perspective. The first perspective refers to the evaluation of the performance of the five investigated methods, as well as to the observation of the ability of the considered feature sets to better represent the dataset for successful classification. The other perspective is to place our results in the context of other state-of-the-art investigations in the field of author attribution.

Comparing the Internally Investigated Methods
From all the results presented above, upon consulting the tables containing the best test error rates, and especially the gray-colored cells (which contain the best results while comparing the methods amongst themselves) we can highlight the following: • ANN: -Four best results for: ROST-PA-1, ROST-PA-3, ROST-PAC-2 and ROST-PAC-3 (see Table 11); -Best ANN 23.46% on ROST-PAC-3; best ANN average 36.93% on ROST-PAC-2; -Worst best overall 61.22% on ROST-P-1.
Other notes from the results are: • Best values for each method were obtained for ROST-PA-2 or ROST-PAC-*; • The worst of these best results were obtained for ROST-P-1 or ROST-P-2; • ANN and MEP suffer from overfitting. The training errors are significantly smaller than the test errors. This problem can only be solved by adding more data to the training set.
An overview of the average test results obtained by all five methods is given in Table 17. However, for ANN and MEP alone, we could generate different results with the same parameters, based on different starting seed values, with which we ran 30 different runs. For the other 3 methods, we used the best results obtained with a specific set of parameters (as in Table 16).
Comparing all 5 methods based on averages, SVM and DT take the lead as the two methods that share the 1st and 2nd places with two exceptions, i.e., for ROST-P-2 and ROST-P3 for which SVM and DT, respectively, rank 3rd. k-NN usually ranks 3rd, with four exceptions, when k-NN was ranked 2nd for ROST-P-2 and ROST-P-3, for ROST-PA-1 for which k-NN ranks 1st together with SVM and DT, and for ROST-PA-2 for which k-NN ranks 4th. MEP is generally ranked 4th with one exception, i.e., for ROST-PA-2 for which it ranks 3rd. ANN ranks last for all ROST-*-*.
For a better visual representation, we have plotted the results from Tables 16 and 17 in Figure 1.  We performed statistical tests to determine whether the results obtained by MEP and ANN are significantly different with a 95% confidence level. The tests were two-sample, equal variance, and two-tailed T-tests. The results are shown in Table 18. Table 18. p-values obtained when comparing MEP and ANN results over 30 runs. No. of neurons used by ANN on the hidden layer represents the best-performing ANN structure on the specific ROST-*-*. The p-values obtained show that the MEP and ANN test results are statistically significantly different for almost all ROST-*-* (i.e., p < 0.05) with one exception, i.e., for ROST-PAC-2 for which the differences are not statistically significant (i.e., p = 0.107).
Next, we wanted to see which feature set, out of the three we used, was the best for successful author attribution. Therefore, we plotted all best and best average results obtained with all methods (as presented in Tables 16 and 17) on all ROST-*-* and aggregated on the three datasets corresponding to the distinct feature lists, in Figure 2.  Table 17.
Based on the results represented in Figure 2a (i.e., which considered only the best results, as detailed in Table 16) we can conclude that we obtained the best results on ROST-PA-* (i.e., corresponding to the 415 feature set, which contains prepositions and adverbs). However, using the average results, as shown in Figure 2b and detailed in Table 17 we infer that the best performance is obtained on ROST-PAC-* (i.e., corresponding to the 432-feature set, containing prepositions, adverbs, and conjunctions).
Another aspect worth mentioning based on the graphs presented in Figure 2 is related to the standard deviation (represented as error bars) between the results obtained by all methods considered on all considered datasets. Standard deviations are the smallest in Figure 2a, especially for ROST-PA-* and even more so for ROST-PAC-*. This means that the methods perform similarly on those datasets. For ROST-P-* and in Figure 2b, the standard deviations are larger, which means that there are bigger differences between the methods.

Comparisons with Solutions Presented in Related Work
To better evaluate our results and to better understand the discriminating power of the best performing method (i.e., MEP on ROST-PA-2), we also calculate the macro-accuracy (or macro-average accuracy). This metric allows us to compare our results with the results obtained by other methods on other datasets, as detailed in Table 2. For this, we considered the test for which we obtained our best result with MEP, with a test error rate of 20.40%. This means that 20 out of 98 tests were misclassified.
To perform all the necessary calculations, two vectors are required, i.e., the vector of targets or true values (i.e., authors/classes that were the actual authors of the test texts) and the vector of outputs or predicted values (i.e., authors/classes identified by the algorithm as the authors of the test texts). The resulting Confusion Matrix is depicted in Table 19. Table 19. Confusion Matrix (on the right side). Column headers and row headers (i.e., numbers from 0 to 9 that are written in bold) are the codes 1 given to our authors, as specified on the left side. This matrix is a representation that highlights for each class/author the true positives (i.e., the number of cases in which an author was correctly identified as the author of the text), the true negatives (i.e., the number of cases where an author was correctly identified as not being the author of the text), the false positives (i.e., the number of cases in which an author was incorrectly identified as being the author of the text), the false negatives (i.e., the number of cases where an author was incorrectly identified as not being the author of the text). For binary classification, these four categories are easy to identify. However, in a multiclass classification, the true positives are contained in the main diagonal cells corresponding to each author, but the other three categories are distributed according to the actual authorship attribution made by the algorithm.
For each class/author, various metrics are calculated based on the confusion matrix. They are: • Precision-the number of correctly attributed authors divided by the number of instances when the algorithm identified the attribution as correct; • Recall (Sensitivity)-the number of correctly attributed authors divided by the number of test texts belonging to that author; • F-score-a combination of the Precision and Recall (Sensitivity).
Based on these individual values, classification performance metrics are calculated. The overall results are shown in Table 20. We have 400 documents, and 10 authors in our dataset. The content of our texts is crossgenre (i.e., stories, short stories, fairy tales, novels, articles, and sketches) and cross-topic (as in different texts, different topics are covered). We also calculated an average number of words per document, which is 3355, and the imbalance (considered in [10] to be the standard deviation of the number of documents per author), which in our case is 10.45. Our type of investigation can be considered to be part of the Ngram class (this class and other investigation-type classes are presented in Section 2.4). Next, we recreated Table 2 (depicted in Section 2.4) while reordering the datasets based on their macro-accuracy results obtained by Ngram class methods in reverse order, and we have appropriately placed details of our own dataset and the macro-accuracy we achieved with MEP as shown above. This top is depicted in Table 21. Table 21. State of the art macro-accuracy of authorship attribution models. Information collected from [10] ( Tables 1 and 3). Name is the name of the dataset; No.docs represents the number of documents in that dataset; No. auth represents the number of authors; Content indicates whether the documents are crosstopic (× t ) or cross-genre (× g ); W/D stands for words per documents, being the average length of documents; imb represents the imbalance of the dataset as measured by the standard deviation of the number of documents per author. We would like to underline the large imbalance of our dataset compared with the first three datasets, the fact that we had fewer documents, and the fact that the average number of words in our texts, although higher, has a large standard deviation, as already shown in Table 3. Furthermore, as already presented in Section 3, our dataset is by design very heterogeneous from multiple perspectives which are not only in terms of content and size, but also the differences that pertain to the time periods of authors, the medium they wrote for (paper or online media), and the sources of the texts. Although all these aspects do not restrict the new test texts to certain characteristics (to be easily classified by the trained model), they make the classification problem even harder.

Conclusions and Further Work
In this paper, we introduced a new dataset of Romanian texts by different authors. This dataset is heterogeneous from multiple perspectives, such as the length of the texts, the sources from which they were collected, the time period in which the authors lived and wrote these texts, the intended reading medium (i.e., paper or online), and the type of writing (i.e., stories, short stories, fairy tales, novels, literary articles, and sketches). By choosing these very diverse texts we wanted to make sure that the new texts do not have to be restricted by these constraints. As features, we wanted to use the inflexible parts of speech (i.e., those that do not change their form in the context of communication): conjunctions, prepositions, interjections, and adverbs. After a closer investigation of their relevance to our dataset, we decided to use only prepositions, adverbs, and conjunctions, in that specific order, thus having three different feature lists of (1) 56 prepositions; (2) 415 prepositions and adverbs; and (3) 432 prepositions, adverbs, and conjunctions. Using these features, we constructed a numerical representation of our texts as vectors containing the frequencies of occurrence of the features in the considered texts, thus obtaining 3 distinct representations of our initial dataset. We divided the texts into training-validation-test sets of 50%-25%-25% ratios, while randomly shuffling them three times in order to have three randomly selected arrangements of texts in each set of training, validation, and testing.
To build our classifiers, we used five artificial intelligence techniques, namely artificial neural networks (ANN), multi-expression programming (MEP), k-nearest neighbor (k-NN), support vector machine (SVM), and decision trees (DT) with C5.0. We used the trained classifiers for authorship attribution on the texts selected for the test set. The best result we obtained was with MEP. By using this method, we obtained an overall "best" on all shuffles and all methods, which is of a 20.40% error rate.
Based on the results, we tried to determine which of the three distinct feature lists lead to the best performance. This inquiry was twofold. First, we considered the best results obtained by all methods. From this perspective, we achieved the best performance when using ROST-PA-* (i.e., the dataset with 415 features, which contains prepositions and adverbs). Second, we considered the average results over 30 different runs for ANN and MEP. These results indicate that the best performance was achieved when using ROST-PAC-* (i.e., the dataset with 432 features, which contains prepositions, adverbs, and conjunctions).
We also calculated the macro-accuracy for the best MEP result to compare it with other state-of-the-art methods on other datasets.
Given all the trained models that we obtained, the first future work is using ensemble decision. Additionally, determining whether multiple classifiers made the same error (i.e., attributing one text to the same incorrect author instead of the correct one) may mean that two authors have a similar style. This investigation can also go in the direction of detecting style similarities or grouping authors into style classes based on such similarities.
Extending our area of research is also how we would like to continue our investigations. We will not only fine-tune the current methods but also expand to the use of recurrent neural networks (RNN) and convolutional neural networks (CNN).
Regarding fine-tuning, we have already started an investigation using the top N most frequently used words in our corpus. Even though we have some preliminary results, this investigation is still a work in progress.
Using deep learning to fine-tune ANN is another direction we would like to tackle. We would also like to address overfitting and find solutions to mitigate this problem.
Linguistic analysis could help us as a complementary tool for detecting peculiarities that pertain to a specific author. For that, we will consider using long short-term memory (LSTM) architectures and pre-trained BERT models that are already available for Romanian. However, considering that a large section of our texts was written one or two centuries ago, we might need to further train BERT to be able to use it in our texts. That was one reason that we used inflexible parts of speech, as the impact of the diachronic developments of the language was greatly reduced.
We would also investigate the profile-based approach, where texts are treated cumulatively (per author) to build a profile, which is a representation of the author's style. Up to this point we have treated the training texts individually, an approach called instance-based.
In terms of moving towards other types of neural networks, we would like to achieve the initial idea from which this entire area of research was born, namely finding a "fingerprint" of an author. We already have some incipient ideas on how these instruments may help us in our endeavor, but these new directions are still in the very early stages for us.
Improving upon the dataset is also high on our priority list. We are considering adding new texts and new authors.  Data Availability Statement: The proposed, used and analyzed dataset is available at https://www.ka ggle.com/datasets/sandamariaavram/rost-romanian-stories-and-other-texts. The source code that we wrote to perform the tests are available at https://github.com/sanda-avram/ROST-source-code, accessed on 28 November 2022. The data presented in Tables 2 and 21 Table A2. These parameters were obtained mostly through experimentation. We thought that small errors on the training set would also lead to small errors on the test set. However, we were wrong: the main problem we encountered was overfitting and poor generalization ability of the model. Thus, other sets of parameters can also generate similar results on the test set even if the training error will be higher. The k-NN considers only training and test data. Thus, we have training sets of 302 items, while the test contains 98 items. During the tests, we varied the value of k from 1 to 30. This is because we observed (as is depicted in Figure A1) that with higher values we would not obtain better results, as the results tend to deteriorate as the value of k increases. However, this depends on the number of features, as the results become bad faster for a consistent number (>100) of features, as for ROST-PC-*, compared to the evolution of the results for ROST-P-*, where the results do not deteriorate so fast by increasing the value of k in the case of a smaller number (<100) of features. To calculate the distance between the test value and the ones in the training set, we used Euclidean distance. Support vector machines also consider only training and test data. Therefore, the training sets consist of 302 items, while the test sets contain 98 items. We experimented with the type of kernel and nu parameters, selecting values that varied through all possible kernel types and values from 0.001 to 1 for nu. For the type of kernel, the best results were obtained for linear. For nu we had different values that gave better results depending on the dataset. However, even though we tried with values starting from 0.001, the best results were obtained for nu ≥ 0.2. We also changed the seed for the random function with no effect on the results. The SVM parameters are given in Table A3. As with k-NN and SVM, the decision trees with the C5.0 algorithm also use only training and test data. Thus, there are 302 items in the training sets and 98 items in the tests. All considered features/attributes, which in our case are: prepositions, adverbs, and conjunctions, are set for the Decision Trees with C5.0 as "continuous" because they are numerical float values between 0 and 1 representing the frequency of occurrence in terms of the total number of words in the file in which they occur. For authors, we set explicit-defined discrete values from 0 to 9 for the 10 authors (as specified in the first columns of Tables 6-8). To improve our results, we experimented with advanced pruning options. These parameters along with others we used for decision trees with C5.0 are shown in Table A4. Apart from these parameters we also used the option −I seed, to set the random seed, with seed ∈ {1, 2, . . . , 9, 10, 20, . . . , 100}, without any effect on the results. Table A4. Parameters for decision trees with C5.0.

Appendix B. Prerequisite Tests and Results
Appendix B.1. ANN As a prerequisite, we are interested in seeing how ANN evolves while training on the data. The ANN error evolution on a training set is shown in Figure A2. It can be seen here that within 20 epochs, the training error drops below 0.1, while within 60 epochs, it reaches 0. Thus, ANN can be used to solve this kind of problem.
Next, we want to find a good value for the number of neurons on the hidden layer. In total, 30 runs were performed with the number of neurons (on the hidden layer) varying from 5 to 50.
The results for the 9 ROST-*-* are presented in Figure A3. These graphics show that, using only 56 features (i.e., ROST-P-*) tests errors were very high, while with an increased number of features: i.e., 415 (i.e., ROST-PA-*) and 432 (i.e., ROST-PAC-*), respectively, test errors are significantly reduced. Moreover, it appears that the test error values tend to stabilize between 40 and 50 neurons on the hidden layer.

Appendix B.2. MEP
We are interested to see if MEP is able to discover a classifier and then to see how well it performs on new (test) data. The evolution of MEP error on a training set is shown in Figure A4. One can see that the error rapidly drops from over 65% to 15%. This means that MEP can handle this type of problem.  To optimize this method, we tried the advanced pruning options. For this we tried 3 options: • −g, which disables the global tree pruning mechanism that prunes parts (of an initially large tree) that are predicted to have high error rates. • −c CF, changes the estimation of error rates. This affects the "severity of pruning". CF stands for confidence level and is a percentage. We chose values from 10 to 100 for the CF parameter. • −m cases, which influences the construction of the decision tree by having at least 2 branches at each branch point for which there are more than cases training items. The default value for cases is 2. We have selected values from 1 to 30 for the cases parameter.
The results obtained using decision trees with C5.0 are detailed in Table A5. Table A5. Decision tree results on the considered datasets. The number of incorrectly classified data is given as a percentage. Result sets are grouped into columns of Error, Size, and sometimes a parameter. The first set of Error, Size columns represent the results obtained with no options. −g stands for global tree pruning is disabled, −c CF stands for setting the confidence level via the CF parameter, and −m cases stands for controlling how the decision tree is built by using the cases parameter. Error stands for the test error rate, Size stands for the size of the decision tree required for that specific solution, CF stands for "confidence level" (CF ∈ {10, 20, . . . , 100}), and cases stands for the threshold for which is decided to have two more that two branches at a specific branching point (cases ∈ {1, 2, . . . , 30}). Using the −g option, it can be seen that most of the trees have become larger (as expected since global tree pruning was disabled by this option). Changes in the results are marked in the table with the values in the boxes. However, most results remained the same, two worsened (i.e., for ROST-P-2 from 53.1% to 54.1% and for ROST-PAC-3 from 32.7% to 33.7%), while only one result improved (i.e., for ROST-PA-2 from 28.6% to 27.6%).
When using the −c CF option, almost all results were similar to those obtained without using any option. The exceptions (marked with boxes), i.e., for ROST-PA-2 better results (i.e., 27.7% vs. 28.6% and the same as using the −g option) were obtained by using a larger tree (i.e., 43 compared to 38; in this case larger than using the −g option, which had a tree size of 42). In the case of ROST-P-3, only the tree size was optimized from 64 to 56 for CF = 10. For CF > 20, both the error and the tree size remained the same as without using any option, while for CF = 20, the tree size was slightly reduced (i.e., 63 from 64), but the error was higher (i.e., 70.4% from 69.4%).
Using the −m cases option, we obtained improvements, as shown in Table A5. All error rates were improved, while the improvement in the tree size, although in some cases was significant (i.e., from 60 to 18, from 39 to 13, from 37 to 12, or from 38 to 14), in other cases the tree size increased or remained large (i.e., for ROST-P-3 from 64 to 99, for ROST-PA-2 from 38 to 58, and for ROST-PAC-1 it remained the same as when no option was used). For these three ROST-*-* mentioned above for which the tree size increased or remained large, the value of the cases parameter was very low (i.e., 1 or 2). For ROST-P-2 and ROST-PA-3, there is cases = 3 and the tree size did not change that much (i.e., from 38 to 31) and remained the same, respectively. For ROST-P-1, ROST-PA-1, ROST-PAC-2, and ROST-PA-3, cases ≥ 8, and the three decision trees have greatly reduced in size to values ≤18.
To show the evolution of the error rates for the three datasets considered, we plotted the results of the decision trees obtained using C5.0 with the −m option, with cases varying from 1 to 30. The results are shown in Figure A7.