You are currently viewing a new version of our website. To view the old version click .
Future Internet
  • Article
  • Open Access

25 January 2024

Beyond Lexical Boundaries: LLM-Generated Text Detection for Romanian Digital Libraries

and
1
Faculty of Automatic Control and Computers, National University of Science and Technology Politehnica Bucharest, 313 Splaiul Independentei, 060042 Bucharest, Romania
2
Academy of Romanian Scientists, Str. Ilfov, Nr.3, 050044 Bucharest, Romania
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Digital Analysis in Digital Humanities

Abstract

Machine-generated content reshapes the landscape of digital information; hence, ensuring the authenticity of texts within digital libraries has become a paramount concern. This work introduces a corpus of approximately 60 k Romanian documents, including human-written samples as well as generated texts using six distinct Large Language Models (LLMs) and three different generation methods. Our robust experimental dataset covers five domains, namely books, news, legal, medical, and scientific publications. The exploratory text analysis revealed differences between human-authored and artificially generated texts, exposing the intricacies of lexical diversity and textual complexity. Since Romanian is a less-resourced language requiring dedicated detectors on which out-of-the-box solutions do not work, this paper introduces two techniques for discerning machine-generated texts. The first method leverages a Transformer-based model to categorize texts as human or machine-generated, while the second method extracts and examines linguistic features, such as identifying the top textual complexity indices via Kruskal–Wallis mean rank and computes burstiness, which are further fed into a machine-learning model leveraging an extreme gradient-boosting decision tree. The methods show competitive performance, with the first technique’s results outperforming the second one in two out of five domains, reaching an F1 score of 0.96. Our study also includes a text similarity analysis between human-authored and artificially generated texts, coupled with a SHAP analysis to understand which linguistic features contribute more to the classifier’s decision.

1. Introduction

The outstanding capability of the Large Language Models (LLMs) to generate human-like texts has raised content authenticity concerns for researchers and users around the world [1,2]. LLMs demonstrated great potential in NLP tasks such as machine translation [3], summarization [4], dialogue systems [5], question answering [6,7,8], or information retrieval [9,10]. In these scenarios, the objective is to produce qualitative texts that meet the user’s requirement rather than deceive or mislead. Among their various applications, the LLMs can be misused, for instance, to produce academic essays and research papers or even generate fake news. They can produce nearly identical content with human-authored texts when the LLM distribution is human-like, thus making the detection harder and requiring the collection of additional samples [11,12].
A legitimate question is raised by Clark et al. [12] who stated that “Human evaluations are considered the gold standard in Natural Language Generation (NLG), but as models’ fluency improves, how well can evaluators detect and judge machine-generated text (MGT)?”. Hence, there is an emergent need for performant automated detection systems and tailored approaches to maximize their potential.
Additionally, despite the frequent attention given to high-resource languages such as English, it is crucial not to overlook the importance of Natural Language Processing (NLP) tools for languages with limited resources. Consequently, this research centers on using NLP for Romanian, a language facing distinct challenges due to insufficient resources. Nevertheless, the knowledge derived from this investigation offers flexible methodologies and approaches that overcome the constraints imposed by limited data and can serve as a starting point for studies on other low or less-resourced languages.

Current Study Objective

This paper examines existing techniques for identifying machine-generated text, explores their limitations, and proposes novel approaches that leverage advanced linguistic analysis, contextual understanding, and pattern recognition to enhance the accuracy and reliability of detection systems. By addressing this pressing issue, the research aims to contribute to the development of more resilient and effective tools for safeguarding against the proliferation of machine-generated text with malicious intent in digital contexts.
In this work, we propose and evaluate two detection methods to distinguish the automated generated texts in the context of digital libraries, leveraging two models’ architectures, namely Transformer-based and classic ML-based. As part of this study, starting from a human-authored set of texts (7 k), we developed an extensive corpus of artificially generated texts of around 52 k documents using six different LLMs and three different generation methods, comprising texts across five domains. Among the text generation techniques, we mention text completion, backtranslation, or paraphrasing. Text completion is an NLP technique used to generate text by predicting the next word or sequence of words in a given context. This method uses statistical language models that have been trained on a large amount of textual data to generate the most probable words that would complete a sentence or a paragraph. The language models used for text completion in this experiment are based on GPT architecture [13] and are pre-trained on Romanian corpus. Backtranslation is a text generation method that involves iterative translations from one language to another and then translating it back to the original language. This method is used to generate variations of the original text that may have different sentence structures, words, or meanings. Paraphrasing algorithms generate a semantically similar text and a grammatically correct output, preserving the input’s original meaning. A new text is generated with the same sense while using different words and sentence structures. Paraphrasing can also be used as a simplification instrument for rephrasing complex texts, facilitating text comprehension by decreasing the reading difficulty.
The main contributions of our work are as follows:
  • Contribution to the development of resources for a less-resourced language, such as Romanian;
  • The development of an extensive Romanian dataset, containing both human-authored and machine-generated texts, of around 60 k documents across five domains, generated via seven distinct methods, using several LLMs and generation techniques like text completion, backtranslation, and paraphrasing;
  • Two detection models based on different architectures, namely Transformer-based and classic ML-based, designed to distinguish automatically generated texts for the Romanian language. The ML model slightly outperformed the Transformer model, reaching a macro F1 of 0.91, while the Transformer exhibited a macro F1 of 0.90;
  • An analysis of textual similarities between human-authored and machine-generated texts for Romanian via similarity measures (cosine, BLEU, ROUGE) and non-parametric statistical tests;
  • An exploration of linguistic differences between human and artificially generated texts via textual complexity indices.
We release as open-source the dataset (https://huggingface.co/datasets/readerbench/ro-human-machine-60k, accessed on 8 January 2024), and the codebase (https://github.com/readerbench/ro-mgt-detection, accessed on 8 January 2024).

3. Method

Our method comprises two modules, namely the artificial corpus generation and the detection models (see Figure 1). Both detection models are formulated as a multiclass text classification task, leveraging a Transformer architecture and a Romanian pre-trained encoder model versus an XGBoost model fed with a selection of linguistic features.
Figure 1. Detection methods using RoBERT [58] and XGBoost [59] models.

3.1. Corpus

The corpus for this study consists of multiple datasets of comparable text lengths, both machine-generated and human-written. Experiments were performed against all datasets iteratively for a comprehensive overview. For a detector that generalizes well on several domains and writing styles, the human dataset comprises texts acquired from different domains such as Romanian literature (i.e., books), news, medical, legal, and scientific articles, obtaining a human corpus of 7387 documents across five domains, from which,
  • A total of 1401 books: 841 manually written abstracts provided by the Central University Library of Bucharest, representing descriptions of Romanian old documents (literary magazines and books dated between the 19th century and the present), and 560 books descriptions (https://cartigratis.com/, accessed on 8 January 2024);
  • A total of 4320 news articles crawled from DigiNews (https://www.digi24.ro/, accessed on 8 January 2024);
  • A total of 557 medical texts acquired from several specialized publications: 71 texts from medical scientific journals (https://srumb.ro/index.php?page=revista&spage=numere&id=9, accessed on 8 January 2024), 372 texts from scientific magazines (https://www.medichub.ro, accessed on 8 January 2024), and 114 texts from glossary of diseases (https://www.sfatulmedicului.ro/boli-si-afectiuni, accessed on 8 January 2024);
  • A total of 1000 juridical/legal texts representing Romanian law texts from Monitorul Oficial (https://monitoruloficial.ro/, accessed on 8 January 2024);
  • A total of 109 scientific articles from the Romanian Journal of Human-Computer Interaction (RoCHI) (http://rochi.utcluj.ro/, accessed on 8 January 2024).
Our primary focus when generating the artificial corpus was to create human–machine text pairs for analyzing the linguistic similarity and comparing the employed LLMs. Specifically, we were interested in capturing model-specific patterns, which resulted in the global dataset imbalance having a ratio of 1:7 for human–machine texts.
With respect to enhancing readability of the conducted experiments, we refer to the readerbench/RoGPT2-medium model from HuggingFace as RoGPT2, dumitrescustefan/gpt-neo-romanian-780m is denoted as GPT-Neo-Ro, flan-t5-small-paraphrase-ro is referred to as Flan-T5, while OpenAI’s text-davinci-003 accessible via their API is referred to as davinci-003. We maintain the original names for Opus-MT and mBART.

3.1.1. Text Generation Strategies

Starting from the human-authored texts, we expanded the dataset artificially via text generation, leveraging different generation techniques and state-of-the-art language models. An overview of the utilized methods and models is presented in Table 1.
Table 1. Methods overview for artificial text generation.
Several techniques were leveraged for text generation, such as text completion, paraphrasing, or backtranslation, to have a greater diversity of the artificially generated texts to feed our detection system.
For each generation method, a different input was provided to the model. The text completion method takes as input the first 10 words from each human text and asks the model to continue the text with coherent paragraphs, while for paraphrasing and backtranslation methods, the input is represented by the entire human written text fragments. The motivation behind choosing 10-word prompts for the text completion task is to avoid artificially inflating similarity scores in human–machine text pairs due to n-gram overlap. Using longer prompts could lead to a higher number of shared words between the human reference and the generated texts, skewing the similarity metrics by introducing a higher degree of text overlap. By limiting the prompt length to 10 words, we aimed to strike a balance where the generated texts could still be contextually meaningful, yet not excessively overlap with the reference paragraphs, ensuring a more accurate evaluation of text similarity without introducing unintended biases into the analysis.
The language models used for the text completion tasks are RoGPT2 (based on GPT2 architecture), GPT-Neo-Ro (based on GPT3 architecture), and Open AI’s davinci-003 (based on GPT3.5 architecture). For the paraphrasing task, Flan-T5 (based on FLAN architecture) was leveraged. The backtranslation task applied three multilingual models, namely davinci-003 (based on GPT3.5 architecture), mBART (based on BART architecture), and Opus-MT (based on a standard Transformer architecture with six attention layers in both encoder and decoder network, having eight attention heads in each layer). The backtranslation models used three iterative translations for each model.
The choice of employing iterative translations, involving translations from the original language (Romanian) to multiple target languages and subsequently backtranslating to Romanian, is motivated by several key advantages in the context of generating an artificial corpus. First, this process induces a higher diversity of language patterns since iterative translations introduce linguistic variations, thus making the artificial corpus more representative of the linguistic variability encountered in real-case scenarios. Second, the translations may yield different word choices and contextual interpretations by involving multiple languages, contributing to richer representations in the artificial corpus. Third, iterative translations introduce controlled semantic drift and ambiguity into the text. As the text navigates multiple languages back and forth, subtle changes in meaning and context may occur. This controlled variation is valuable for simulating ambiguities present in real-world language. Fourth, the iterative translation strategy enhances generalization as the generated corpus becomes more adaptable to a range of language variations, ensuring better performance and robustness when applied to tasks involving varied linguistic inputs.
In the context of machine-generated text detection, the decision to use three iterative translations instead of just two aligns with the goal of creating a more diverse artificial corpus that further amplifies linguistic variations and introduces additional layers of text transformation. While two iterations (forward translation and backtranslation) already bring diversity, incorporating a third iteration enhances the previously introduced benefits. This additional step ultimately contributes to the model’s capability to generalize effectively across a wide range of linguistic variations.

3.1.2. Overview of the Generated Corpus

The AI-generated dataset was built starting from the human set using the previously introduced strategies applied to the texts from each of the five domains. The machine-generated set contains 51,709 documents across the five domains. The entire corpus reached almost 60 k documents (59,096), both human and machine-generated—the overview is presented in Table 2.
Table 2. MGT dataset: human-written and machine-generated texts.
Lexical diversity was computed for each generated set to have a comparison baseline with human-authored sets. Lexical diversity represents the variety of words used in a given text and is commonly computed via the Type-Token Ratio (TTR) score, which is the ratio of the number of unique words (types) to the total number of words (tokens) in a text. A high TTR score indicates that a text has a wide range of different words (a higher diversity); in contrast, a low TTR score means the text has a limited vocabulary and may be repetitive (have a lower diversity). Lexical diversity and TTR score are important measures in linguistics in general and for our experiment, as they facilitate analysis and comparison of the language use and generation style across different language generation models. Based on the average TTR scores for our generated datasets, Flan-T5 and davinci-003 (on back-translation task) produce the most diversified text, superior in diversity compared to human writings.
For a comparison baseline, both models leverage the same data split for train-test partitions and use stratified labels.

3.2. Detection Models

We built a comprehensive pipeline dedicated to the training and assessment of a text classifier tasked with predicting the text generation model associated with a given text fragment. This involves two distinct classification techniques, each based on different architectures. The primary objective is to evaluate and compare their efficiency, particularly when incorporating additional features such as linguistic complexity into the analysis. The comprehensive approach taken in this study aims to provide insights into the effectiveness of diverse classification methodologies while considering nuanced linguistic attributes, thus contributing to a stronger understanding of AI-text detection.

3.2.1. Transformer-Based Model

We model the task as a multiclass classification problem that leverages a Romanian pre-trained encoder model, readerbench/RoBERT-base (https://huggingface.co/readerbench/RoBERT-base, accessed on 8 January 2024). The model is trained for four epochs using a data split of 80-20 ratio for the train-validation sets, using the maximum length of the model of 512 tokens, with a batch size of 32 and a learning rate of 1 × 10 5 . We use specific parametrization during the model’s training to overcome the challenges of overfitting and of the unbalanced dataset. As such, the model uses L2 regularization with a weight decay of 0.1, adding a penalty term to the loss function that discourages large weights. Moreover, we use early stopping to monitor the model’s performance on the validation set during the training, stopping the training when the validation starts to degrade, which indicates the model begins to overfit. This helps denoise the model learning and helps generalize better on unseen data. We also leverage gradient accumulation. The weight updates become more stable by accumulating gradients during several mini-batches (i.e., two in our case). This technique is helpful when dealing with large batch sizes that do not fit into the available memory.

3.2.2. Classic ML-Based Model

This method leverages the ReaderBench textual complexity indices (RBIs) (https://github.com/readerbench/ReaderBench/wiki/Textual-Complexity-Indices, accessed on 8 January 2024), available for Romanian language. The first step in our classic ML-based approach is to compute and extract the most relevant textual complexity indices from the dataset containing both human and machine-generated texts. Since most RBIs are non-normally distributed on the train partition, we employ the non-parametric Kruskal–Wallis test with mean ranks to determine the most relevant indices in predicting significant differences between human-written and machine-generated texts. From the RBI collection, the top 100 indices with the most significant differences in mean ranks are selected. These indices are considered the most relevant to distinguish between human and machine-generated texts and are further used in the detection algorithm to build the feature selection given as input to the classification model.
Additionally, we compute the burstiness measure, which shows the variation in word lengths within a text and may be used to identify irregularities or changes in word lengths. As such, we compute the mean word length as the average length of all words in the text. Further, we compute the standard deviation of word lengths to measure the extent to which word lengths deviate from the mean word length. The standard deviation measures the dispersion of word lengths in the text. Finally, the burstiness is determined by dividing the standard deviation by the mean word length. This ratio represents how much the word lengths vary relative to their average length. Higher burstiness values indicate greater variability in word lengths within the text, while lower values indicate more uniform word lengths. Further, the feature selection is built by combining the bustiness score with the top 100 textual complexity indices, which serve as input to the XGBoost classification model. This enhances the model’s capability to discriminate between human and machine-generated texts.
The selection of XGBoost as our classic ML model for the binary classification task of detecting machine-generated texts was based on its superior performance and versatility. XGBoost, an ensemble learning algorithm, has achieved noteworthy success in various domains [60,61], particularly excelling in binary classification tasks. Its ability to handle complex relationships within the data, manage large datasets efficiently, and mitigate overfitting makes it a robust choice for the given task. Furthermore, XGBoost is known for its flexibility in handling diverse feature types and managing imbalanced datasets, which is crucial when dealing with text data. The decision to opt for XGBoost is supported by its proven track record in achieving high accuracy, precision, and recall in comparable applications [60]. While other ML classifiers could be employed, the selection of XGBoost is justified by its empirical effectiveness, which aligns with the objectives of this study, ensuring a robust and reliable approach to text classification in the context of Romanian binary text classification.
Due to the high degree of imbalance between human-written and artificially generated texts, we leverage a weighted XGBoost classifier to mitigate this issue by incorporating weighting and regularization parameters. Moreover, we perform hyperparameter tuning by defining a grid search with five-fold cross-validation to find the best hyperparameters. We select the best classifier with the optimal parameters from grid search (see Table 3), and the classifier is further trained on the entire training dataset. Next, the classifier is evaluated on the test dataset, and the class probabilities are predicted.
Table 3. Grid Search best parameters for ML-model.

4. Results

Given the results presented in Table 4, both detection methods are effective instruments for discerning human-authored from AI-generated texts across different domains. The F1 scores are generally good, showing minor differences between the two methods in terms of performance. However, the number of misclassified texts, specifically for the human category, varies across domains, behavior that is motivated by class imbalance. The random chance for predicting each class, given the class imbalance of 1:7 (machine-generated to human-generated texts), is 12.25%. Table 4 presents the cumulative results, based on which the Transformer-based method outperformed the classic ML-based method on two domains out of five, being better in distinguishing AI texts from human texts for legal and news domains, while the linguistic features extracted in the classic ML-based approach contributed to better results for books, medical, and scientific (RoCHI) domains.
Table 4. Detection results (bold marks the best scores per domain).
Transformer-based model. The multiclass classifier performs well across different categories with high precision (P), recall (R), and F1-scores as highlighted by confusion matrices (see Figure 2 for the books domain, while the matrices for other domains can be consulted in Appendix A, Figure A1). While evaluating the model’s performance, it is important to consider the dataset split, the sampling technique, and the class imbalance. The classification is less performant on the subset with fewer samples (i.e., RoCHI), causing more confusion and an F1-score of 0.85, while the classification performs best for the subset with the highest number of samples (i.e., news), achieving an F1-score of 0.96.
Figure 2. Transformer-based model: confusion matrix for the books domain.
Classic ML-based model. The human texts provide less accurate results because the number of human samples in the dataset is seven times less than the total number of AI texts. Analyzing the results by domains, the classic ML-based algorithm achieves the highest F1 score for the medical domain (0.96), followed by the RoCHI domain, which exhibits an F1 of 0.92. The precision for AI texts is particularly high (0.93), with only a few misclassified AI texts (9 out of 144). Books domain follows with an F1 of 0.90. It correctly identifies both human and machine-generated texts with high accuracy (0.92). The legal domain exhibits reasonable performance as well, with an F1 of 0.85. We observe the same pattern regarding human misclassified texts compared to AI texts motivated by class imbalance. The news domain achieves an F1 score of 0.87.
Furthermore, we introduce SHAP (Shapley Additive exPlanations) [62] to examine the contribution of individual features (RBI indices and burstiness) to the model’s predictions. SHAP provides insights into why a particular prediction was made by attributing a portion of the prediction to each feature. SHAP values are based on cooperative game theory to allocate contributions among features, helping us understand the impact of each feature on the model’s decision. The SHAP summary plots (see Figure 3 for the books domain, while SHAP for other domains can be consulted in Appendix B, Figure A2) show the impact of each feature on the model’s predictions for the respective domain. The features are ranked by their importance, with the most important features at the top. The color of the dots indicates the direction of the impact, with red indicating a positive impact and blue indicating a negative impact. The size of the dots indicates the magnitude of the impact. To determine which class is influenced most by a particular feature, we can look at the color of the dots. For example, if the dots for a particular feature are mostly red, then that feature is likely to be associated with the positive class (human text). On the contrary, if the dots are mostly blue, then that feature is likely to be associated with the negative class (machine text).
Figure 3. Classic ML-based model: SHAP summary plot for the books domain.
Based on the charts, in the news domain, the most important features are the number of nouns per paragraph, the number of unique words in the document, the word entropy per paragraph, and the number of words per paragraph. The prediction is less influenced by the number of punctuation marks per paragraph or by the number of words in the document. This means the model predictions for the news domain are the most influenced by morphology and surface indices. The books domain is most influenced by surface indices like word entropy per paragraph and syntax indices like dependencies, but also by morphology features like unique words for adverbs per sentence. The predictions are less influenced by word polysemy or number of syllables in words. For the legal domain, the model output is most influenced by the number of sentences in the document, the number of syllables in words, and the number of connectors in the sentences. This reveals that the syntax, word, and discourse elements are the differentiators, while the cohesion between the first and last text element has a lower impact on the prediction. The main feature of the medical domain is represented by a syntactic measure, namely case dependency per paragraph, followed by a surface index, namely word entropy per paragraph. This validates the assumption that more complex texts contain more information and more diverse concepts. The third most significant index is burstiness, which shows the variation in word lengths within a text and may be used to identify irregularities or changes in word lengths. The scientific domain is most influenced by a syntactic index, namely case dependency per paragraph, followed by word entropy per sentence, and a morphology index, namely unique nouns in the paragraph.
The SHAP analysis provides insights into the top linguistic features contributing the most to the model’s predictions. As a general observation, we may notice among all domains that the most significant indices were the surface indices, followed by morphology and syntax, and then discourse structure.

5. Discussion

Our results highlight the effectiveness of both the Transformer-based and classic ML-based methods in discriminating between human-authored and AI-generated texts across diverse domains. The performance differences observed in the individual domains provide valuable insights into the strengths and limitations of each approach, further pinpointing potential paths for further refinement and application.
One key observation is the variance in misclassified texts, particularly from the human category, across different domains. This behavior is attributed to class imbalance, where the number of human samples is significantly lower than that of AI texts. The classic ML-based model, in particular, shows sensitivity to this imbalance, as evidenced by the notable improvement in results when experiments were conducted with an equal number of documents for both classes. This underscores the importance of addressing the class distribution challenge in training datasets to enhance the model’s accuracy, especially in scenarios where human-authored texts are underrepresented.
Moreover, examining SHAP values for the classic ML-based model offers valuable insights into the features contributing significantly to classification decisions. The consistency of the average word entropy per paragraph as a strong indicator across multiple domains suggests its importance in machine-generated text detection.
In addition to evaluating the performance of the Transformer-based and classic ML-based models in discerning human-authored from AI-generated texts, we further research two methods to gain a comprehensive understanding of the classification outcomes. First, we conduct a human–machine textual similarity analysis to assess the degree of resemblance between human and machine-generated texts. This analysis involves leveraging similarity metrics to quantify the textual proximity between the human texts and their actual counterparts. By exploring the textual distinctions, we aim to discern any patterns or divergences in similarity across different domains, providing insights into the models’ interpretability and their ability to capture subtle variations in writing styles.
Furthermore, we extend our investigation to analyze the impact of textual complexity indices for both datasets. Textual complexity metrics, such as syntactic and semantic features, are key in characterizing the complexities of written content. By examining indices such as sentence length, vocabulary richness, and syntactic structures, we seek to clarify the distinct patterns exhibited by human and machine-generated texts in each domain. This dual analysis offers a multi-faceted perspective, allowing us to not only quantify the degree of similarity but also investigate the inherent complexities present in the textual content, thus enriching our understanding of the models’ performance across diverse domains.

5.1. Human–Machine Textual Similarity

For a better understanding of the similarity between human-written and artificially generated texts, a series of statistical similarity metrics were computed, such as cosine similarity [63], BLEU scores [33], and ROUGE scores [34], followed by non-parametric statistical tests to determine the differences or patterns in the similarity scores, while also extracting and comparing various textual complexity indices between human and machine-generated sets of documents.
Cosine similarity measures the relatedness of documents using a bag-of-words representation that considers term frequency. The main advantages of this approach are its efficiency for high-dimensional data, such as text, and its insensitivity to document length, while as for disadvantages, it ignores semantic meaning and word order. BLEU is a standard machine translation metric that measures the precision of n-grams (word sequences) in the generated text. The main benefit is the penalization of overusing common words, while the shortcoming is represented by its sensitivity to the reference length, as well as its disregard for recall and semantic meaning. The ROUGE scores are a set of metrics used for evaluating text quality by comparing it to a reference text by measuring n-gram overlap (ROUGE-N) and the longest common subsequence (ROUGE-L). As an advantage, ROUGE is sensitive to content overlap and partial matches, while as a disadvantage, it may not capture the semantic meaning well due to its limitation to the recall and precision of n-grams. Therefore, analyzing the three scores together provides a more grounded view of text similarity, covering different aspects, such as incorporating structural and distributional aspects of the text (cosine), adding a layer of specificity via analyzing the presence of key phrases and content overlap (ROUGE), and offering a view on how well machine-generated texts align with human references in terms of specific word sequences (BLEU).
The previous three similarity scores (i.e., cosine similarity, BLEU score, and ROUGE score) were computed between human-written documents and artificially generated texts. Cosine similarity measures the cosine of the angle between two vectors and denotes the similarity between a pair of texts, while BLEU and ROUGE scores evaluate the quality of machine-generated texts by comparing them to reference (human) texts.
The scores distribution for each domain is illustrated in Appendix C, Figure A3, Figure A4, Figure A5, Figure A6 and Figure A7 for each dataset (e.g., books, news, medical, legal, and scientific/RoCHI), and are further correlated with mean and standard deviation presented in Table 5 for a comprehensive similarity overview. Models with higher mean scores are generally better at capturing textual similarity according to the respective metrics. Nevertheless, we should also analyze the standard deviations to understand the consistency of each model’s performance. Lower standard deviations suggest more consistent performance, while higher standard deviations indicate greater score variability.
Table 5. Mean (M) and standard deviation (SD) values for similarities per model.
Summarizing these results, we notice that Flan-T5 produces texts that are similar to human-produced ones while also having the highest variability (standard deviation), thus denoting potential inconsistencies.
Since our data do not follow a normal distribution across the five domains, non-parametric statistical methods are required to compare and analyze the results of human and machine similarity. These tests are valuable when working with similarity scores, allowing us to make statistical inferences about the performance of different text generation models without making strong assumptions about the underlying data distribution.
The Friedman test is a non-parametric statistical test that represents an alternative to repeated measures analysis of variance (ANOVA) and represents a measure of the variability between multiple models. Table 6 shows the Friedman test results for the five domains applied to cosine similarity scores.
Table 6. Friedman statistics across domains for cosine similarity scores.
In hypothesis testing, a low p-value (typically less than 0.05) means the null hypothesis is rejected, and significant differences exist among the tested models. According to the results across all five domains, we have high Friedman scores and low p-values, which suggests the compared language models perform differently in each domain, and further analysis or pairwise comparisons can be conducted to understand the nature of these differences.
We further perform the Wilcoxon test, targeting pairwise comparisons. The Wilcoxon signed-rank test is used to compare two paired groups and check if there are significant differences between them. Similarly to the Friedman test, the Wilcoxon test is used in cases where data does not follow a normal distribution, and it denotes the degree of differences between the two pairs analyzed. The Wilcoxon test results for our dataset across the five domains for cosine similarity scores are presented in Appendix D, Figure A8. The high Wilcoxon scores in the matrix (see Figure A8) argue substantial differences in cosine similarity scores between pairs of models. There is strong evidence to suggest that the variations in performance are statistically significant.

5.2. Textual Complexity Indices

To explore the linguistic differences between human-written and machine-generated texts, we additionally leverage the ReaderBench textual complexity indices (RBI) for Romanian (https://github.com/readerbench/ReaderBench/wiki/Textual-Complexity-Indices, accessed on 8 January 2024). The selection of the top three textual complexity indices (see Table 7) was determined through the Kruskal–Wallis statistical test. The indices that exhibited the most notable variations among the domains were selected, providing valuable insights into the distinctive linguistic features within each domain.
Table 7. Top indices per domain, where I, II, and III represents the rank.
The analysis of the top three ranking textual complexity indices across different domains yields valuable insights into the key features influencing the prediction process of the classifier. While the specific indices vary between domains, a consistent pattern emerges regarding the importance of certain features across diverse datasets. For the books domain, the model deems word entropy per paragraph and per sentence as the most significant factors. In legal texts, the focus shifts to dependencies per sentence, with a notable emphasis on the number of conjunctions per sentence. The medical domain distinguishes itself through lexical diversity, particularly in terms of specific dependencies and the number of unique words at the paragraph level. Similarly, the news domain emphasizes lexical diversity, as evidenced by the number of words and unique words per paragraph. Scientific publications, on the other hand, prioritize specific dependencies and the number of conjunctions per paragraph as the most influential indices. It is noteworthy that despite the divergence in the top indices, the underlying consistency in the importance of certain features across domains underscores the robustness of these linguistic patterns in shaping the classifier’s predictions. The provided table summarizes the top indices for each domain, offering a clear reference for understanding the observed linguistic distinctions.
The distribution of values for these textual complexity indices within each dataset and domain is depicted in Figure A9, Figure A10, Figure A11, Figure A12 and Figure A13 in Appendix E. This provides a comprehensive overview of the variations in the most critical statistical features present in both human-written and machine-generated texts.

6. Conclusions

To conclude, our study introduced two distinct detection models designed for identifying machine-generated text in the Romanian language, employing different architectures—the first based on Transformers and the second based on decision trees with linguistic feature extraction. Leveraging a substantial corpus of 60,000 Romanian texts generated through diverse methods such as text completion, paraphrasing, and backtranslation, utilizing state-of-the-art large language models, we conducted extensive experimental studies. Additional investigations aimed to evaluate text similarity between human-authored and machine-generated texts, considering both statistical and linguistic perspectives. The findings underscored that artificially generated texts tend to exhibit higher complexity, while human-written texts showcase greater diversity from a lexical standpoint.
The comparative analysis revealed the superiority of the classic ML-based detection model over the Transformer model in three out of the five domains, as indicated by its highest F1 score (0.96 for detecting texts in the medical domain). The XGBoost model that considered linguistic features exhibited a macro F1 of 0.91, while the Transformer model reached an F1 of 0.90. Despite these advancements, it is crucial to acknowledge that no existing detection system is foolproof, emphasizing the ongoing need for continuous investment in developing reliable detection algorithms. The dynamic nature of text generation methods and evolving language models necessitates a proactive approach in adapting detection mechanisms to remain ahead of emerging challenges.
Moving forward, future research endeavors could explore innovative techniques for enhancing the robustness of detection models. One path is the integration of advanced NLP techniques to discern more subtle nuances in language use. Additionally, investigating the potential impact of domain-specific linguistic characteristics on detection accuracy could provide valuable insights. Moreover, the exploration of ensemble models combining the strengths of different architectures may contribute to even more effective and versatile detection systems. Also, investigating the generalization capabilities of the detection models across diverse linguistic styles and writing conventions can provide insights into their adaptability. Exploring the robustness of the models against adversarial attacks, where subtle manipulations are made to deceive the system, is another option for future research. Such explorations could lead to developing more resilient detection mechanisms capable of withstanding sophisticated adversarial attempts in the evolving landscape of machine-generated text. Sustained efforts in research and development are essential to meet the growing demand for reliable and adaptive detection algorithms.

Author Contributions

Conceptualization, M.N. and M.D.; methodology, M.N. and M.D.; software, M.N.; validation, M.N.; formal analysis, M.N.; investigation, M.N.; resources, M.N.; data curation, M.N.; writing—original draft preparation, M.N.; writing—review and editing, M.D.; visualization, M.N.; supervision, M.D.; project administration, M.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

We release as open-source the dataset (https://huggingface.co/datasets/readerbench/ro-human-machine-60k, accessed on 8 January 2024), and the codebase (https://github.com/readerbench/ro-mgt-detection, accessed on 8 January 2024). Additional information is available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AIArtificial Intelligence
LLMLarge Language Models
MGTMachine-Generated Text
MLMachine Learning
MLMMasked Language Model
NLGNatural Language Generation
NLPNatural Language Processing
PPrecision
RRecall
RBIReaderBench Indices
TF-IDFTerm-Frequency Inverse Document Frequency
TTRType-Token Ratio

Appendix A. Transformer-Based Model: Confusion Matrices per Domain

Figure A1. Transformer-based model: confusion matrices per domains.

Appendix B. Classic ML-Based Model: SHAP Summary Plots per Domain

Figure A2. Classic ML-based model: SHAP summary plot per domain.

Appendix C. Statistical Similarity for Artificially-Generated Texts

Figure A3. Text similarity metrics for books domain.
Discussion for books dataset textual similarity results:
  • Cosine similarity—as revealed by the cosine similarity results, Flan-T5 has the highest mean cosine similarity score (0.5458), indicating it produces text that is more similar to human-written texts. Opus-MT also has a relatively high mean score (0.5578), followed closely by davinci-003 (on the backtranslation task, 0.4106). mBART and RoGPT2 have mean scores above 0.35, suggesting moderate performance, while GPT-Neo-Ro and davinci-003 (on text completion task) have lower mean scores. From a variability point of view, Flan-T5 has the highest standard deviation (0.2399), which denotes a wide range of scores and potential inconsistency. Opus-MT also has a relatively high standard deviation (0.1470), suggesting variability in its performance, while the most consistent models for the books dataset are considered GPT-Neo-Ro, davinci-003 (on text completion task) and RoGPT2, which expose lower standard deviations and variability in terms of performance.
  • BLEU—BLEU scores reveal Flan-T5 with the highest mean BLEU score (0.2960), demonstrating better performance on average. Opus-MT has the second highest score (0.2374), followed by models with intermediate performance like davinci-003 (on text completion task). GPT-Neo-Ro and RoGPT2 show lower performance, while davinci-003 (on the backtranslation task) and mBART have the lowest scores. Concerning the variability of the models, Flan-T5 has the highest standard deviation (0.2372), illustrating variability in its BLEU scores. Lower variability and better consistency are shown by GPT-NEo-Ro, Opus-MT, RoGPT2, and davinci-003 (on text completion task).
  • ROUGE—concerning ROUGE score results, Flan-T5 is also on top of the list with the highest mean score, followed by Opus-MT, while the lowest mean scores are registered for the mBART model. From a variability perspective, Flan-T5 shows the highest standard deviation (0.2437), hence the biggest variability in its ROUGE scores, while among the consistent models are GPT-Neo-Ro, Opus-MT, and RoGPT2.
Figure A4. Text similarity metrics for news domain.
Discussion for news dataset textual similarity results:
  • Cosine similarity—among the cosine scores, mBART has the highest mean score (0.6888), which indicates that, on average, it produces the most human-like texts. Opus-MT follows closely with a mean score of 0.6717. GPT-Neo-Ro, RoGPT2, and davinci-003 have mean scores above 0.4 but below the top-performing models. In terms of variability, Flan-T5 has the highest standard deviation (0.2389), which indicates a wide range of scores and, hence, higher variability. A more consistent performance is shown by GPT-NEo-Ro and RoGPT2.
  • BLEU—similar to cosine scores result, mBART is at the top of the list with the highest mean BLEU score (0.3108), followed by Opus-MT (0.2854). The lowest BLEU mean scores are reached for RoGPT2 and davinci-003 (on the backtranslation task). Similar to the cosine analysis, variability scores (standard deviations) show Flan-T5 in the top list for inconsistent results, while the more consistent models were GPT-Neo-Ro, RoGPT2, and davinci-003.
  • ROUGE—mBART shows the highest mean ROUGE score (0.3843) and a relatively high variability illustrated by its standard deviation score. On the contrary, GPT-Neo-Ro has the lowest mean score for ROUGE and is among the more stable models, having a lower standard deviation and a more consistent performance.
Figure A5. Text similarity metrics for medical domain.
Discussion for medical dataset textual similarity results:
  • Cosine similarity—the top of the list model for cosine similarity scores mean is Flan-T5 (0.4716), followed by davinci-003 (on backtranslation task), which shows the most similar texts to human-produced texts. The least similar texts are produced by mBART and Opus-MT models, which exhibit mean scores above 0.27. In terms of variability, Flan-Ts is also the most variable model, as shown by its standard deviation (0.2347), while GPT-Neo-Ro and RoGPT2 show more consistent results.
  • BLEU—Flan-T5 has the highest mean BLEU score (0.2193), illustrating the best performance, while Opus-MT exhibits the lowest mean BLEU score (0.0197). Moreover, Flan-T5 exhibits the greatest variability, as shown by its standard deviation score (0.1995), while RoGPT2 has lower standard deviations, thus causing more stability for the results.
  • ROUGE—Flan-T5 also shows the highest mean ROUGE score (0.2983), which shows better performance on the one hand and the highest variability on the other hand, with the highest standard deviation as well (0.2263). On the contrary, GPT-Neo-Ro shows a lower mean ROUGE score (0.1374) but proves higher stability in terms of standard deviation.
Figure A6. Text similarity metrics for Legal domain.
Discussion for legal dataset textual similarity results:
  • Cosine similarity—among the models, Flan-T5 exhibits the highest mean for cosine similarity (0.6805), generating the most similar texts with the human set, nonetheless it produces the most variable scores as shown by the standard deviation (0.1607). On the contrary, the lowest mean is observed for davinci-003 (on backtranslation task) (0.3569), which suggests the lowest similarity with the human texts. A high standard deviation is noticed for mBART.
  • BLEU—similarly, Flan-T5 exhibits the highest BLEU score (0.4057) and a moderate to high standard deviation (0.1948), indicating variability in its BLEU scores. The lowest mean BLEU score is observed for davinci-003 (on the backtranslation task) (0.0540), showing a moderate standard deviation.
  • ROUGE—ROUGE scores also highlight Flan-T5 with the highest mean score (0.5191), indicating the highest similarity, and it has a moderate to high standard deviation (0.1705), which shows a high variability in its scores. GPT3 has a mean score of 0.1873 and a low standard deviation, which makes it more consistent, while mBART has the lowest mean score but a high variability in scores, shown by its high standard deviation.
Figure A7. Text similarity metrics for RoCHI domain.
Discussion for scientific dataset (RoCHI) textual similarity results:
  • Comparing the average (mean) cosine similarity scores among the models for the RoCHI dataset, the Opus-MT model has the highest mean score (see Table 5), indicating that on average, it produces text that is more similar to human-written text according to this metric. Flan-T5 also has a relatively high mean score (0,4719), while GPT-Neo-Ro has a slightly lower mean (0,4024), followed closely by RoGPT2 and davinci-003 (on text completion task). Intermediate mean scores are observed for davinci-003 (on backtranslation task) and mBART. Looking at the standard deviation scores (presented in Table 5), Flan-T5 has the highest standard deviation (0.2089), indicating a wide range of scores and potential inconsistency. Opus-MT has a relatively low standard deviation (0.1019), suggesting that its scores are more consistent across different text pairs, while other models have standard deviations that fall somewhere in between.
  • BLEU scores demonstrate consistent results with cosine similarity scores, showing Opus-MT with the highest mean BLEU score (0.2812), indicating better performance according to BLEU. Flan-T5 follows closely with a mean score of 0.1652, GPT-NEo-Ro and davinci-003 (on text completion task) also have relatively high scores, while RoGPT2 and davinci-003 (on backtranslation task) have lower mean scores, mBART showing the lowest mean score (0.1136). The standard deviations for BLEU scores vary, with Flan-T5 having the highest variability (0.1658), which suggests inconsistency in its performance, while Opus-MT and RoGPT2 have relatively low standard deviations, indicating more stable performance.
  • ROUGE scores expose Opus-MT with the highest score (0.3533), indicating better performance, followed by Flan-T5 and RoGPT2 with relatively high similarity. Davinci-003 and GPT-NEo-Ro have intermediate mean scores, meaning a moderate similarity, while mBART shows the lowest mean score (0.1624), signifying the least similarity with the human texts. In terms of variability, Flan-T5 has the highest standard deviation (0.2038), suggesting its ROUGE scores vary widely, while Opus-MT has a low standard deviation, indicating more consistent performance.

Appendix D. Pairwise Wilcoxon Test

Figure A8. Pairwise Wilcoxon test by model per domain.

Appendix E. Top ReaderBench Textual Complexity Indices

Figure A9. Top three RBIs for books domain.
Figure A10. Top three RBIs for legal domain.
Figure A11. Top three RBIs for medical domain.
Figure A12. Top 3 RBIs for news domain.
Figure A13. Top three RBIs for RoCHI domain.

References

  1. Weidinger, L.; Mellor, J.F.; Rauh, M.; Griffin, C.; Uesato, J.; Huang, P.; Cheng, M.; Glaese, M.; Balle, B.; Kasirzadeh, A.; et al. Ethical and social risks of harm from Language Models. arXiv 2021, arXiv:2112.04359. [Google Scholar]
  2. Solaiman, I.; Brundage, M.; Clark, J.; Askell, A.; Herbert-Voss, A.; Wu, J.; Radford, A.; Krueger, G.; Kim, J.W.; Kreps, S.; et al. Release Strategies and the Social Impacts of Language Models. arXiv 2019, arXiv:1908.09203. [Google Scholar]
  3. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates: San Jose, CA, USA, 2017; pp. 5998–6008. [Google Scholar]
  4. Rush, A.M.; Chopra, S.; Weston, J. A Neural Attention Model for Abstractive Sentence Summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015. [Google Scholar]
  5. Serban, I.; Sordoni, A.; Bengio, Y.; Courville, A.C.; Pineau, J. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models. In Proceedings of the AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015. [Google Scholar]
  6. Karpukhin, V.; Oğuz, B.; Min, S.; Lewis, P.; Wu, L.Y.; Edunov, S.; Chen, D.; Yih, W. Dense Passage Retrieval for Open-Domain Question Answering. arXiv 2020, arXiv:2004.04906. [Google Scholar]
  7. Pandya, H.A.; Bhatt, B.S. Question Answering Survey: Directions, Challenges, Datasets, Evaluation Matrices. arXiv 2021, arXiv:2112.03572. [Google Scholar]
  8. Wang, Z. Modern Question Answering Datasets and Benchmarks: A Survey. arXiv 2022, arXiv:2206.15030. [Google Scholar]
  9. Chowdhury, T.; Rahimi, R.; Allan, J. Rank-LIME: Local Model-Agnostic Feature Attribution for Learning to Rank. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval, Taipei, Taiwan, 23–27 July 2023. [Google Scholar]
  10. Zheng, H.; Zhang, X.; Chi, Z.; Huang, H.; Yan, T.; Lan, T.; Wei, W.; Mao, X. Cross-Lingual Phrase Retrieval. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022. [Google Scholar]
  11. Chakraborty, S.; Bedi, A.S.; Zhu, S.; An, B.; Manocha, D.; Huang, F. On the Possibilities of AI-Generated Text Detection. arXiv 2023, arXiv:2304.04736. [Google Scholar]
  12. Clark, E.; August, T.; Serrano, S.; Haduong, N.; Gururangan, S.; Smith, N.A. All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 1–6 August 2021. [Google Scholar]
  13. Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. Preprint, 2018. Available online: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 6 November 2023).
  14. Lamb, A. A Brief Introduction to Generative Models. arXiv 2021, arXiv:2103.00265. [Google Scholar]
  15. Bender, E.M.; Koller, A. Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 5185–5198. [Google Scholar]
  16. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
  17. Michel-Villarreal, R.; Vilalta-Perdomo, E.; Salinas-Navarro, D.E.; Thierry-Aguilera, R.; Gerardou, F.S. Challenges and Opportunities of Generative AI for Higher Education as Explained by ChatGPT. Educ. Sci. 2023, 13, 856. [Google Scholar] [CrossRef]
  18. Farrelly, T.; Baker, N. Generative Artificial Intelligence: Implications and Considerations for Higher Education Practice. Educ. Sci. 2023, 13, 1109. [Google Scholar] [CrossRef]
  19. OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
  20. Christiano, P.; Leike, J.; Brown, T.B.; Martic, M.; Legg, S.; Amodei, D. Deep reinforcement learning from human preferences. arXiv 2023, arXiv:1706.03741. [Google Scholar]
  21. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1532–4435. [Google Scholar]
  22. Kudo, T.; Richardson, J. SentencePiece: A simple and language-independent subword tokenizer and detokenizer for Neural Text Processing. arXiv 2018, arXiv:1808.06226. [Google Scholar]
  23. Sennrich, R.; Haddow, B.; Birch, A. Neural Machine Translation of Rare Words with Subword Units. arXiv 2016, arXiv:1508.07909. [Google Scholar]
  24. Wei, J.; Bosma, M.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned Language Models Are Zero-Shot Learners. arXiv 2022, arXiv:2109.01652. [Google Scholar]
  25. Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar]
  26. Williams, A.; Nangia, N.; Bowman, S.R. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. arXiv 2018, arXiv:1704.05426. [Google Scholar]
  27. Hermann, K.M.; Kočiský, T.; Grefenstette, E.; Espeholt, L.; Kay, W.; Suleyman, M.; Blunsom, P. Teaching Machines to Read and Comprehend. arXiv 2015, arXiv:1506.03340. [Google Scholar]
  28. Narayan, S.; Cohen, S.B.; Lapata, M. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. arXiv 2018, arXiv:1808.08745. [Google Scholar]
  29. Niculescu, M.A.; Ruseti, S.; Dascalu, M. RoGPT2: Romanian GPT2 for text generation. In Proceedings of the 33rd International Conference on Tools with Artificial Intelligence (ICTAI), Washington, DC, USA, 1–3 November 2021; pp. 1154–1161. [Google Scholar]
  30. Dumitrescu, S.; Rebeja, P.; Lorincz, B.; Gaman, M.; Avram, A.; Ilie, M.; Pruteanu, A.; Stan, A.; Rosia, L.; Iacobescu, C.; et al. LiRo: Benchmark and leaderboard for Romanian language tasks. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, Online, 6–14 December 2021. [Google Scholar]
  31. Buzea, M.; Trausan-Matu, S.; Rebedea, T. Automatic Romanian Text Generation using GPT-2. U.P.B. Sci. Bull. Ser. C Electr. Eng. Comput. Sci. 2022, 84, 2286–3540. [Google Scholar]
  32. Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the 8th International Conference on Learning Representations, ICLR, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  33. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
  34. Lin, C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
  35. Niculescu, M.A.; Ruseti, S.; Dascalu, M. RoSummary: Control Tokens for Romanian News Summarization. Algorithms 2022, 15, 472. [Google Scholar] [CrossRef]
  36. Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling Instruction-Finetuned Language Models. arXiv 2022, arXiv:2210.11416. [Google Scholar]
  37. Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. arXiv 2022, arXiv:2203.02155. [Google Scholar]
  38. Liu, Y.; Gu, J.; Goyal, N.; Li, X.; Edunov, S.; Ghazvininejad, M.; Lewis, M.; Zettlemoyer, L. Multilingual Denoising Pre-training for Neural Machine Translation. arXiv 2020, arXiv:2001.08210. [Google Scholar] [CrossRef]
  39. Bojar, O.; Chatterjee, R.; Federmann, C.; Graham, Y.; Haddow, B.; Huck, M.; Jimeno, Y.A.; Koehn, P.; Logacheva, V.; Monz, C.; et al. Findings of the 2016 Conference on Machine Translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers; Association for Computational Linguistics: Florence, Italy, 2016; pp. 131–198. [Google Scholar]
  40. Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 483–498. [Google Scholar]
  41. Tiedemann, J.; Thottingal, S. OPUS-MT—Building open translation services for the World. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation (EAMT), Lisbon, Portugal, 4–6 May 2020. [Google Scholar]
  42. Junczys-Dowmunt, M.; Grundkiewicz, R.; Dwojak, T.; Hoang, H.; Heafield, K.; Neckermann, T.; Seide, F.; Germann, U.; Aji, A.; Bogoychev, N.; et al. Marian: Fast Neural Machine Translation in C++. In Proceedings of the ACL 2018, System Demonstrations, Melbourne, Australia, 15–20 July 2018; pp. 116–121. [Google Scholar]
  43. Lavergne, T.; Urvoy, T.; Yvon, F. Detecting Fake Content with Relative Entropy Scoring. In Proceedings of the 2008 International Conference on Uncovering Plagiarism, Authorship and Social Software Misuse, Patras, Greece, 22 July 2008; Volume 377, pp. 27–31. [Google Scholar]
  44. Gehrmann, S.; Strobelt, H.; Rush, A. GLTR: Statistical Detection and Visualization of Generated Text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Florence, Italy, 28 July–2 August 2019; pp. 111–116. [Google Scholar]
  45. Mitchell, E.; Lee, Y.; Khazatsky, A.; Manning, C.D.; Finn, C. DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature. arXiv 2023, arXiv:2301.11305. [Google Scholar]
  46. Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
  47. Bradley, A.P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997, 30, 11145–11159. [Google Scholar] [CrossRef]
  48. Tian, E.; Cui, A. GPTZero: Towards Detection of AI-Generated Text Using Zero-Shot and Supervised Methods. Available online: https://gptzero.me (accessed on 8 January 2024).
  49. Lee, N.; Bang, Y.; Madotto, A.; Khabsa, M.; Fung, P. Towards Few-Shot Fact-Checking via Perplexity. arXiv 2021, arXiv:2103.09535. [Google Scholar]
  50. Kleinberg, J. Bursty and Hierarchical Structure in Streams. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, Canada, 23–26 July 2002; pp. 91–101. [Google Scholar]
  51. Habibzadeh, F. GPTZero Performance in Identifying Artificial Intelligence-Generated Medical Texts: A Preliminary Study. J. Korean Med. Sci. 2023, 38, e319. [Google Scholar] [CrossRef] [PubMed]
  52. Verma, H.K.; Singh, A.N.; Kumar, R. Robustness of the Digital Image Watermarking Techniques against Brightness and Rotation Attack. arXiv 2009, arXiv:0909.3554. [Google Scholar]
  53. Langelaar, G.C.; Setyawan, I.; Lagendijk, R.L. Watermarking digital image and video data. A state-of-the-art overview. IEEE Signal Process. Mag. 2000, 17, 20–46. [Google Scholar] [CrossRef]
  54. Kirchenbauer, J.; Geiping, J.; Wen, Y.; Katz, J.; Miers, I.; Goldstein, T. A Watermark for Large Language Models. arXiv 2023, arXiv:2301.10226. [Google Scholar]
  55. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models Are Unsupervised Multitask Learners. 2019. Available online: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (accessed on 10 November 2023).
  56. Shijaku, R.; Canhasi, E. ChatGPT Generated Text Detection; Technical Report; Unpublished; 2023. Available online: https://www.researchgate.net/profile/Ercan-Canhasi/publication/366898047_ChatGPT_Generated_Text_Detection/links/63b76718097c7832ca932473/ChatGPT-Generated-Text-Detection.pdf (accessed on 8 January 2024).
  57. Krishna, K.; Song, Y.; Karpinska, M.; Wieting, J.; Iyyer, M. Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. arXiv 2023, arXiv:2303.13408. [Google Scholar]
  58. Masala, M.; Ruseti, S.; Dascaku, M. RoBERT – A Romanian BERT Model. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), 8–13 December 2020; International Committee on Computational Linguistics: Barcelona, Spain, 2020; pp. 6626–6637. [Google Scholar]
  59. Liu, J.-J.; Liu, J.-C. Permeability Predictions for Tight Sandstone Reservoir Using Explainable Machine Learning and Particle Swarm Optimization. Geofluids. 2022, 2022, 2263329. [Google Scholar] [CrossRef]
  60. Chen, J.; Zhang, Y.; Hu, J. Synergistic effects of instruction and affect factors on high- and low-ability disparities in elementary students’ reading literacy. Read. Writ. J. 2021, 34, 199–230. [Google Scholar] [CrossRef]
  61. Vangala, S.R.; Bung, N.; Krishnan, S.R.; Roy, A. An interpretable machine learning model for selectivity of small-molecules against homologous protein family. Future Med. Chem. 2022, 14, 1441–1453. [Google Scholar] [CrossRef]
  62. Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 4765–4774. [Google Scholar]
  63. Singhal, A. Modern Information Retrieval: A Brief Overview. IEEE Data Eng. Bull. 2001, 24, 35–43. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.