Artificial text generation has been an active area of NLP research in recent years. This section investigates the challenges associated with machine-generated text and the imperative need for robust detection mechanisms. The integration of text generation through generative language models has transformed NLP [
14], facilitating content enrichment and opening the debate for ethical considerations as also highlighted by Bender [
15], Brown [
16], Michel-Villarreal [
17], Farrelly [
18], and other researchers in their studies. Ethical considerations play a critical role in the responsible development and deployment of these language models, prompting discussions on issues such as bias, misinformation, and the societal impact of AI-generated content. As advancements in NLP have led to the proliferation of sophisticated language models, the potential misuse of these models for generating deceptive or malicious content has become a growing concern. Detecting machine-generated text poses a unique set of challenges, including the ability of these models to mimic human language intricacies and bypass traditional detection methods. Additional challenges emerge when addressing less-resourced languages such as Romanian, necessitating dedicated detection systems due to language specificity.
The subsequent sections provide a brief overview of the main neural architectures employed in this study. Following this, we provide a comprehensive examination of dedicated language models designed for the Romanian language, along with an exploration of popular multilingual models that leverage the foundational architectures of these models. This contextualization lays the foundation for an exploration of the existing landscape in machine-generated text detection relevant to the scope of this research.
2.1. Neural LLM Architectures
GenerativePre-Trained Transformer, widely known as GPT, is a class of NLP models (
https://platform.openai.com/docs/models/gpt-4, accessed on 8 January 2024) designed for language understanding and generation tasks. GPT is built on the Transformer architecture [
3], incorporating self-attention mechanisms to model long-range textual dependencies efficiently. GPT models are pre-trained on a massive corpus of text using unsupervised learning, thus allowing the models to learn a rich language representation from the data without any task-specific labels. Moreover, during pre-training, GPT models leverage the Masked Language Model (MLM) objective to predict missing words in sentences, allowing the model to capture contextual information and relationships between words in a text. Additionally, GPT models are fine-tuned on downstream tasks using supervised learning. Fine-tuning leverages the knowledge acquired during pre-training, which allows GPT to achieve remarkable performance via transfer learning, even with limited data. The GPT architecture has evolved from GPT1 to GPT4, each released version being characterized by an increase in model size. GPT1 (117M parameters, max sequence length of 1024) [
13] was originally trained on a combination of two datasets: Common Crawl and Book Corpus. This first version was, however, prone to generate repetitive text and could not track long-term dependencies in a text, producing coherent results only for short text sequences. GPT2 (1.5B parameters, max sequence length of 2048) [
2] was trained on an exponentially larger corpus combining Common Crawl, Book Corpus, and Web Text, being capable of generating more human-like answers. Similar to its predecessor, it performed well on shorter texts, lacking coherence or reasoning on longer texts. GPT3 (175 B parameters, max sequence length of 4096) [
16] kept increasing the training corpus, incorporating Wikipedia, books, articles, and other sources, building datasets of trillion words. GPT3 capabilities included generating coherent texts for longer sequences, understanding context, writing computer code, or even creating art. GPT4 [
19] was pre-trained to predict the next token in a text and fine-tuned using Reinforcement Learning from Human Feedback [
20]. Specifics on model training and size have not yet been publicly released. GPT4 provides significant improvements compared to its previous versions, being capable of processing images and audio and providing coherent answers. The GPT series introduced variations in its architecture, such as layer normalization, gradient accumulation, and positional encodings, to enhance model training and performance.
Text-To-Text Transfer Transformer (T5) architecture [
21] is a transformative approach in NLP, where language tasks are formulated as unified text-to-text problems leveraging transfer learning. T5 is also built on the Transformer architecture, relying on self-attention mechanisms and being capable of capturing contextual information across input and output tokens. T5 leverages an encoder–decoder architecture, where the encoder processes the input text, and the decoder generates the output text. Moreover, T5 uses positional encodings to preserve spatial information and token positions. The authors used SentencePiece [
22] to encode text as WordPiece [
23] tokens, resulting in a vocabulary of 32,000-word pieces. The model was pre-trained on a set of languages, including Romanian. There are several model versions released of different sizes—e.g., T5-Small (60 M parameters), T5-Base (220 M parameters), T5-Large (770 M parameters), T5-3B (3 B parameters), and T5-11B (11 B parameters). The models achieved state-of-the-art results on various tasks such as machine translation, abstractive summarization, question answering, and text classification.
Finetuned Language Net (FLAN) [
24] emphasized fine-tuning as a key component in its design for achieving high performance across various tasks and improving zero-shot learning ability for language models. FLAN was fine-tuned on a large set of instructions—more than 470 NLP datasets and 1800 tasks, which makes the model suitable to follow instructions, even for unseen tasks. The model was evaluated against language inference, reading comprehension, question answering, machine translation, common sense reasoning, coreference resolution, and additional tasks like sentiment analysis, paraphrase detection, and struct-to-text. FLAN-T5 outperformed GPT3 on zero-shot prompting on 20 out of 25 tasks.
Bidirectional Auto-Regressive Transformers (BART) [
25] is a denoising auto-encoder, representing a combination between BERT and GPT architectures, by using a seq2seq machine translation with bidirectional encoder and left to right decoder. The pre-training shuffled the order of sentences, and chunks of text were replaced with masked tokens using an infilling scheme. BART is efficient in text generation and comprehension tasks, achieving good results on various NLP tasks like machine translation, abstractive dialogue, question answering, or summarization. The model was released in two standard versions: BART-base (6 encoder and decoder layers and 140 M parameters) and BART-large (12 encoder and decoder layers and 400 M parameters), and three fine-tuned versions of the BART-large model on MNLI [
26] which is a bitext classification that predicts if one sentence entails another (BART-large-mnli), CNN/DM [
27] which is a news summarization dataset (BART-large-cnn), and Xsum [
28] which is also a news summarization dataset with highly abstractive summaries (BART-large-xsum).
2.3. Detection Mechanisms
Due to the outstanding performance of the generative language models in producing qualitative texts extremely similar to human writings, machine-generated detection algorithms became a necessity in the research field. Despite the release of various detection systems, none of them have proven to be foolproof. This subsection explores the recent state-of-the-art detection methods and algorithms.
According to Chakraborty [
11], recent research in MGT detection can be split into different categories, namely, statistical approaches, classification-based detection models, zero-shot detection, LLM fine-tuning-based detection methods, and watermark-based identification.
Statistical approaches use metrics like entropy, perplexity, or n-gram frequency to distinguish machine-generated and human content [
43,
44]. More recent studies proposed DetectGPT [
45], which states the artificially generated text tends to lie in the negative curvature of the log-likelihood. The model uses Gradient Boosting [
46], a Machine Learning (ML) technique that trains multiple models sequentially and combines them to produce a more accurate model, which outperforms other zero-shot methods with high AUC [
47] scores. The algorithm extracts a set of 35 features from the input text, such as sentence length, punctuation usage, and the frequency of particular words and phrases. These features are fed to the Gradient Boosting model that predicts if the text is GPT-generated. The model achieved an F1-score of 98.6%. Even though the DetectGPT algorithm is detailed in the Mitchell [
45] paper, the implementation was not open sourced nor publicly available until recently. For this reason, the research community created a public implementation (
https://github.com/BurhanUlTayyab/DetectGPT, accessed on 8 January 2024) of the algorithm. The approach preserved the idea from the original paper; however, there are a few differences in the feature extraction and the model training process. In the public implementation, 20 features are extracted from the input text, which includes the frequency of particular words and phrases, sentence length, punctuation usage, and the presence of certain characters. These features are then fed to the Gradient Boosting model, which is trained using cross-validation to ensure that the model is generalizable to new data. The model achieved an F1-score of 96.3%, slightly lower than the original paper implementation.
Another statistical approach for MGT detection is GPTZero [
48] that uses perplexity [
49] and burstiness [
50] to classify texts. Perplexity is a measure of text randomness vastly used in NLP. The human-written text is considered less structured and more unpredictable; therefore, its perplexity value should be higher. In contrast, a text generated by AI should have a lower perplexity score. Burstiness considers other variables not accounted for in perplexity to improve text analysis. The term refers to the appearance of non-common tokens in random clusters. The artificially generated text tends to have a more consistent structure than the human-written text. GPTZero uses these two measures to determine if a text is human or AI-generated: through perplexity, GPTZero evaluates how good a language model is at predicting the next token, while with burstiness, it assesses the distribution of sentences. Detection is performed based on the idea that humans tend to mix long and short sentences, while AI-produced sentences are more uniform. Recent research [
51] presents encouraging findings in the identification of AI-generated texts using GPTZero. The study reveals an accuracy of 0.80 with a 95% confidence interval, a specificity of 0.90, and a sensitivity of 0.65. The conclusion drawn is that GPTZero exhibits a low false-positive rate (misclassifying human-written texts as machine-generated) and a high false-negative rate (misclassifying machine-generated texts as human-written).
Despite the performance exhibited by DetectGPT and GPTZeo on English corpora, both models had extremely poor results on our Romanian dataset, which is described in the following section. As per the results of our preliminary experiments, GPTZero misdetected all human texts as AI-generated, while DetectGPT misclassified most AI texts as being written by humans. As outlined in the beginning, pre-existing solutions are primarily designed for high-resourced languages like English and exhibit limited performance or complete inoperability when applied to Romanian. As such, there is a need for specialized solutions to deal with the specificity of the Romanian language.
Zero-shot detection is showcased by Gehrmann [
44] via the Giant Language Model Test Room (GLTR) study. The idea of the research is that LLMs generate from a limited subset of the true distribution of natural language for which they have high confidence. To test whether a text is machine-generated, the authors use three approaches: (1) compute the probability of the word, (2) compute the absolute rank of a word, and (3) compute the entropy of the predicted distribution. The first two steps evaluate whether a word was sampled from a model similar to the detection model; in contrast, the last step verifies whether the previously generated context is well-known to the detection system, such that it is sure of its next prediction.
Classifier-based methods are widely spread in detection paradigms, while watermark-based methods represent an innovative alternative to the above-mentioned methods [
52]. In the early days, watermark methods were used in computer vision and image processing to ensure copyright protection [
53]. Recently, Kirchenbauer [
54] proposed in their study the use of watermarks with LLMs, incorporating signals in generated text, which is undetectable to human observers and can be detected with open source algorithms without access to the language model API or parameters for detection. It works by selecting a randomized set of “green” tokens before a word is generated and then softly promoting the use of green tokens during sampling. It requires, however, access to the language model while generating the text.
Another detection approach based on text classification was proposed by OpenAI (
https://openai.com/blog/new-ai-classifier-for-indicating-ai-written-text, accessed on 8 January 2024) and consisted in fine-tuning a GPT model with data from Wikipedia, WebText [
55], and human input data to create an interface for a discrimination task using outputs produced by 34 language models. Their approach combined the classifier-based method with human evaluation to determine if a text was artificially generated. Nonetheless, this approach has some limitations. The text must have at least 1000 characters, and it was primarily trained on English corpora, making it inappropriate for multilingual use cases. Its authors recommend using the classifier only for English text since it performs significantly worse in other languages. Based on the preliminary evaluations of a set of English texts, the model correctly identifies 26% of AI-written text (true positives) and incorrectly labels human-authored text as AI text for 9% of the texts (false positives).
Simpler classifier methods involve ML models such as XGBoost [
56]. In their approach, the input features are based on the TF-IDF score and hand-crafted linguistic features based on characters and punctuation. The authors achieved an F1-score of 99% for detecting ChatGPT text. However, as a limitation, it can easily perform overfitting due to sample bias, requiring a large training dataset to overcome this drawback.
Among the various detection methods, we can also include fine-tuning language models for binary classification [
2]. Solaiman et al. [
2] used a sequence classifier model based on RoBERTa-base and RoBERTa-large, which achieved 95% accuracy on a GPT2 dataset detection. The advantage of this method is the bidirectionality, which allows discriminative classification models to be more powerful for detection than generative classification models.
However, despite the numerous studies targeting MGT detection, Krishna et al. [
57] highlight that paraphrased text escapes the existent detectors, including watermarking, DetectGPT, or GPTZero, with an important drop in performance. Therefore, text perturbations may affect detectors’ accuracy, hence the increasing need for more reliable and robust detection systems.