1. Introduction
The rapid development and widespread adoption of Large Language Models (LLMs) have significantly enhanced the quality of Artificial Intelligence (AI)-generated text [
1,
2]. The exceptional performance of these models in complex tasks such as text summarization, machine translation, and natural language generation has led to their extensive use across various domains [
3,
4]. Despite this advancements in generation quality, language models still exhibit tendencies to produce misinformation and hallucinated content. This raises significant concerns about the accuracy and reliability of information in AI-generated text. In domains where intellectual originality is critical, such as scientific publishing, academic peer review, and education, the unregulated use of synthetic text introduces substantial risks, including plagiarism and violations of academic integrity. Consequently, the reliable and accurate detection of AI-generated text has become a critical research challenge in the modern digital landscape.
Although numerous transformer-based models have been introduced in the literature for detecting AI-generated text, these systems exhibit notable limitations in dynamic, complex and structured scenarios. In many existing approaches, statistical features such as perplexity and sentence length variability play a key role in text classification. The tendency of AI models to generate more structured and lower-variance text is often considered a distinguishing signal between human-written and synthetic content.
However, this approach introduces a significant challenge, particularly in formal and structured text domains such as academic writing. Human-authored texts, including scientific articles that follow predefined structures and inherently exhibit low-variance linguistic characteristics, may be misclassified as AI-generated by existing detection systems [
5,
6]. Another major limitation lies in the limited out-of-distribution (OOD) generalization ability of current detectors. As the architectures of modern language models continue to evolve and diversify, many detection models tend to overfit to patterns specific to particular generative models during training. As a result, their performance often degrades when evaluated on previously unseen model architectures [
7,
8]. This indicates that detection systems may develop structural biases against formal human writing and may rely on dataset-specific superficial cues rather than capturing underlying linguistic features.
To address these stylistic and structural limitations in the literature, this study proposes a methodological framework utilizing the XLM-RoBERTa [
9] architecture as a classification backbone, with a focus on improving generalization capability within formal and academic discourse. The primary objective is to reduce the false positive bias exhibited by existing detection models toward formal human-written texts. To this end, a “Combined Dataset” comprising a total of 63,000 samples was constructed, consisting of human-written texts and outputs from six different artificial intelligence models (Llama-3.1 [
2], Gemma-2 [
10], Qwen-2.5 [
11], Mistral [
12], Phi-3 [
13], Falcon [
14]) in a fully balanced distribution (31,500 human, 31,500 AI). This multi-model data strategy aims to limit overfitting to stylistic features specific to individual generative models, enable the framework to learn structural distinctions between human and AI-generated text, and improve generalization across diverse model architectures.
In the second stage, a controlled ablation framework was designed specifically for a test set of 1200 samples to evaluate the model’s true detection capability and to highlight the issue of source data contamination, which is often overlooked in the literature. Prior to synthetic generation, the source texts were preprocessed to remove structural inconsistencies, thereby limiting the likelihood of language models reproducing superficial artifacts and enabling a more controlled evaluation setting based on cleaner data. The model was evaluated in this artifact-free environment, applied only to the test set, and achieved an accuracy of 93.42% and a recall of 94.67%. These results suggest that the model has learned consistent linguistic patterns characteristic of AI-generated text in structured contexts, rather than relying on superficial data artifacts for classification. In this context, the primary contributions of this study can be summarized as follows:
Addressing the False Positive Paradox in Academic Writing: To mitigate the tendency to misclassify formal and structured human-written texts—such as academic articles—as AI-generated, we propose a balanced and diverse training data strategy consisting of 63,000 samples. This approach aims to enable detection systems to make more reliable and balanced decisions specifically when evaluating human-authored academic content.
Enhancing Out-of-Distribution (OOD) Robustness and Cross-Model Generalization: By incorporating outputs from six different large language models (Llama-3.1, Gemma-2, Qwen-2.5, Mistral, Phi-3, and Falcon) into the training process, the model is exposed to diverse generative distributions. This data-centric approach aims to limit overfitting to model-specific patterns and to improve generalization across unseen architectures within formal text domain.
Evaluating Memorization Tendencies via Data Purification: The model’s reliance on superficial data artifacts is examined through a controlled data purification (ablation) strategy applied exclusively to the test set. The results suggest that the proposed approach leverages consistent linguistic patterns characteristic of AI-generated text within academic discourse, providing a foundation for more resilient detection mechanisms.
2. Related Works
2.1. Core Approaches in AI-Generated Text Detection
Modern approaches to detecting machine-generated text largely rely on adapting pre-trained language models (PLMs), such as BERT and RoBERTa, to classification tasks, or on statistical probability-based methods. Guo et al. [
15] introduced the Human ChatGPT 3.5 Comparison Corpus (HC3), one of the early large-scale datasets, to compare the language capabilities of ChatGPT with those of human experts. Classifiers built on the RoBERTa architecture were fine-tuned on this dataset, and F1 scores exceeding 97% were reported, particularly in in-domain evaluations focusing on a single generative model.
As an alternative to conventional fine-tuning approaches, Mitchell et al. [
6], proposed a probability-based, zero-shot method called DetectGPT. Their work shows that text generated by Large Language Models (LLMs) tends to lie in regions of negative curvature within the underlying model’s log-probability function. The DetectGPT method leverages the model’s own probability outputs to perform detection without requiring a separately trained classifier, and has been shown to achieve strong performance, particularly in white-box settings.
However, the high performance reported in these foundational studies on static and controlled datasets has been increasingly questioned in more realistic and adversarial settings. Sadasivan et al. [
8], examined the reliability of existing automated text identification methods through both empirical and theoretical analyses. Their findings show that simple paraphrasing attacks can effectively circumvent a broad range of detection techniques, including watermarking-based approaches, neural network-based classifiers, and zero-shot methods. Furthermore, their theoretical analysis suggests that as language models increasingly approximate human language, even the most advanced detection methods may achieve performance only marginally better than random guessing.
While the aforementioned foundational studies have laid the groundwork for AI text detection and achieved high performance on limited and controlled datasets, their success is largely attributed to overfitting to a single language model (predominantly the GPT family) or specific data distributions.
To address this limitation, this study differs from traditional approaches by employing an expanded “Large Mixture” training strategy that incorporates outputs from six contemporary language models. This approach aims to reduce the tendency of the fine-tuned XLM-RoBERTa classifier to overfit and to enhance its cross-architecture generalization capability across different generative model architectures.
2.2. Out-of-Distribution (OOD) Generalization and Architectural Adaptation Challenges
One of the most significant technical challenges faced by current detection systems in dynamic real-world scenarios is their limited generalization ability to next-generation language models beyond the training distribution (Out-of-Distribution, OOD). It is widely reported in the literature that detectors trained on data from a specific model experience a significant drop in performance when evaluated on models with different architectures. In this context, a recent study by Borile and Abrate [
16] (2025) demonstrates that detection models tend to learn statistical artifacts (spurious correlations) specific to the training data, rather than capturing model-independent linguistic patterns. Their findings further show that this memorization tendency leads to substantial performance degradation when models are exposed to architectures not seen during training.
With the aim of mitigating overfitting and one-dimensional memorization, Guo et al. [
17] (2024), proposed formulating the detection task within a contrastive learning-based framework, arguing that standard binary classification approaches lead models to overfit to the distribution of a single language model. This approach aims to enable models to learn more discriminative representations capable of distinguishing between diverse generation styles. Furthermore, the authors emphasize that exposing detection systems to outputs from multiple language models, rather than relying on limited, single-model datasets is a key factor in improving generalization capability.
To address the limitations in OOD generalization and the challenges associated with memorization-driven learning highlighted in the literature, this study adopts a methodological framework that reduces dependency on a single architecture or data distribution. Accordingly, a large-scale and strategically constructed “Large Mixture” dataset was developed for the training phase, incorporating outputs from contemporary open-source language models, including Llama-3.1, Gemma-2, Qwen-2.5, Mistral, Phi-3, and Falcon. This multi-model approach aims to limit the tendency of the XLM-RoBERTa backbone to overfit to model-specific patterns and to enhance its generalization capability by enabling it to learn shared stylistic characteristics across diverse AI model architectures.
2.3. The ‘False Positive’ Paradox in Formal and Structured Texts
Another well-documented drawback of AI text detection systems is the systematic bias they exhibit toward formal and structured human writing. In a comprehensive study, Liang et al. [
7] (2023) report that widely used commercial AI detectors misclassified 61.3% of TOEFL essays written by non-native English speakers as AI-generated. The authors show that these detectors primarily rely on low perplexity and low variability metrics when evaluating text. Formal, structured, and academic writing styles exhibit predictable patterns and structural regularities, similar to those produced by Large Language Models (LLMs). As a result, detection systems struggle to distinguish between structured human-written text and synthetic content, leading to an increased rate of false positive classifications.
While the vast majority of studies in the literature focus on improving the detection rates of synthetic text, the problem of reducing false positive rates for human-generated content has received comparatively limited attention. However, the ability of detection systems to identify synthetic text alone is insufficient for real-world applications; the accurate classification of human-written text must also be considered a critical evaluation criterion.
To address this limitation, highly formal, structured, and rule-based human-written texts were intentionally integrated into the large-scale training dataset of 63,000 samples constructed in this study. Through this balanced data curation strategy, the XLM-RoBERTa classification head is trained to more effectively distinguish low-complexity and regular structural patterns in human writing from synthetic generations, learn decision boundaries more precisely, and mitigate the systematic bias toward formal texts reported in the literature.
2.4. Ablation Gaps in Synthetic Artifacts and Data Purification
In the construction of benchmark datasets for deepfake text detection, data quality and structural cleanliness are often insufficiently addressed in the literature. In a study by Pu et al. [
18] (2023), it is shown that when source human texts used as inputs to language models contain formatting errors, HTML code, or raw metadata tags, generative models tend to reproduce these structural artifacts in their outputs. This leads detection models to rely on superficial data artifacts and statistical leakage rather than learning the underlying linguistic and semantic characteristics of AI-generated text. Consequently, models may achieve misleadingly high detection performance—referred to as spurious performance—by exploiting such artifacts instead of capturing genuine discriminative features.
The significance of this issue is further highlighted in the results report of the recent GenAI Detection Task 3 [
19] (2025) The authors show that existing detectors often rely on superficial artifacts originating from source texts rather than capturing the fundamental linguistic characteristics of AI-generated content. In this context, they emphasize that applying consistent preprocessing across datasets is essential for properly assessing model robustness. Nevertheless, there is a notable lack of systematic and transparent ablation studies in the literature that examine the impact of removing such artifacts exclusively during the testing phase. A considerable portion of existing work reports performance on raw datasets, which may obscure the models’ tendency to rely on memorization.
To address this methodological gap in the literature and to transparently evaluate the memorization tendencies of the proposed approach, a comprehensive data purification strategy is applied exclusively to the final (OOD) test set in this study. This zero-artifact ablation setting, designed to prevent the model from exploiting superficial data artifacts or formatting remnants, aims to assess the framework’s capacity to learn consistent stylistic markers characteristic of AI-generated text rather than relying on memorization-based detection.
3. Materials and Methods
In this section, the two-stage deep learning training and evaluation framework is presented. While leveraging a conventional XLM-RoBERTa architecture as its classification backbone, the primary focus of this methodology lies in its strategic data curation and rigorous evaluation design, developed to address the challenges of overfitting and the “False Positive” paradox in formal and structured human texts. The experimental design is structured to limit the model’s tendency to overfit to specific data distributions or superficial structural noise. Accordingly, the process consists of a large-scale initial training phase involving 60,000 samples, followed by an analysis of the observed overfitting behavior, and a final adaptation stage in which a strategic dataset of 3000 samples is incorporated to improve cross-architecture generalization capability. The effectiveness of this data-centric strategy in capturing consistent stylistic markers, rather than relying on superficial cues, is evaluated in a specialized test setting largely purified of synthetic artifacts. The overall architecture of this two-stage training and evaluation pipeline is illustrated in
Figure 1.
The model training architecture and data flow diagram presented in
Figure 1 consist of three main components:
Initial Training and Overfitting Analysis: In the first stage, “Dataset A,” consisting of 60,000 samples and designed to capture domain-specific linguistic features, is used. The evaluation results from the initial training phase reveal that the model exhibits a tendency to overfit to the underlying data distribution.
Strategic Data Integration and Adaptation Training: To mitigate the effects of overfitting, “Dataset B,” comprising 3000 out-of-distribution (OOD) samples, is incorporated into the training process. The XLM-RoBERTa architecture is then retrained on the combined dataset—consisting of 63,000 samples in total—to improve its ability to generalize across diverse model architectures within the academic domain.
Zero-Artifact Test Set and Evaluation: The performance of the final model is evaluated on an isolated test set that undergoes extensive preprocessing to remove data artifacts (e.g., HTML tags, formatting inconsistencies). This evaluation setting is designed to assess the model’s structural resilience based on consistent stylistic patterns rather than relying on superficial, dataset-specific cues.
3.1. Dataset Preparation
3.1.1. Construction of the Initial Dataset (Dataset A)
Dataset A, designed to enable the model to learn inherent linguistic features of formal text, was constructed as a fully balanced dataset (30,000 human-written and 30,000 AI-generated samples). To reduce the tendency of deep learning models to misclassify formal and structured human-written texts—such as academic articles—as synthetic content (false positives), scientific sources were primarily used for the human-written component. In this context, the “arxiv-summarization” [
20] dataset, available on the Hugging Face platform and containing approximately 210,000 academic articles, was selected as the primary source. From this dataset, 30,000 human-written texts were sampled. The 30,000 synthetic texts forming the AI component of Dataset A were generated using outputs from three modern large language models (Llama-3.1, Gemma-2, and Qwen-2.5) to mitigate the commonly reported issue of single-model overfitting. A total of 10,000 samples were obtained from each model, ensuring a balanced dataset aligned with the human-written texts. This data collection and balancing strategy establishes a training environment that enables the XLM-RoBERTa model to learn generalizable linguistic patterns characteristic of AI-generated text, rather than relying on content-specific features. Dataset A is illustrated in
Figure 2a.
3.1.2. Strategic Adaptation Set (Dataset B) and Out-of-Distribution (OOD) Integration
Following the initial training on Dataset A, the model was observed to exhibit limitations when evaluated on next-generation architectures not included in the training set, indicating an out-of-distribution (OOD) generalization issue. To address this limitation and enhance the model’s robustness and adaptability, Dataset B, consisting of 3000 samples, was constructed.
This strategic adaptation set is composed of 1500 human-written texts [
21] and 1500 AI-generated samples. The synthetic component includes outputs from three OOD models (Mistral, Phi-3, and Falcon), with 500 samples collected from each model. Dataset B is illustrated in
Figure 2b.
This “Large Mixture,” comprising a total of 63,000 samples, was randomly shuffled to ensure a balanced distribution, and the model was retrained on this combined dataset. This approach aims to limit the model’s tendency to overfit to patterns specific to individual generative models and to improve its generalization capability across diverse architectures. The “Large Mixture” dataset is illustrated in
Figure 2c.
3.1.3. Zero-Artifact Test Set and Evaluation Setting
To provide a transparent evaluation of the cross-model detection performance and generalization capability of the proposed XLM-RoBERTa model, a dedicated “Zero-Artifact Test Set” consisting of 1200 samples—fully independent of the training process—was constructed. To minimize algorithmic bias during evaluation, the test set was designed as a balanced dataset comprising 600 human-written and 600 AI-generated texts (50–50%).
The AI component of the test set includes synthetic texts sampled equally (100 per model) from six modern language models examined in this study (Llama-3.1, Gemma-2, Qwen-2.5, Mistral, Phi-3, and Falcon). This homogeneous distribution enables a more balanced assessment of the model’s performance across both architectures encountered during training and those considered out-of-distribution (OOD).
Prior to the final evaluation, all texts in the test set were subjected to a comprehensive preprocessing pipeline. Data artifacts that are commonly exploited by detection models in the literature and that can bias evaluation results (such as HTML tags, irregular spacing, markdown formatting, and prompt leakage) were largely removed. This zero-artifact approach provides a more controlled evaluation setting for assessing the model’s resilience against structural variations based on inherent linguistic markers rather than superficial structural noise. The “Zero-Artifact Test Set” used for evaluation is illustrated in
Figure 2d.
3.1.4. Technical Parameters and Implementation Strategy for Data Generation
To ensure methodological transparency and high reproducibility, the data generation and curation pipeline were executed under a rigorous set of technical parameters and validation protocols.
The synthesis of academic content was performed using the Ollama framework and the Hugging Face Transformers library. To optimize VRAM utilization while preserving generative quality, Large Language Models (LLMs) were deployed using 4-bit quantization (BitsAndBytes) with bfloat16 compute precision.
The generation process was governed by distinct hyperparameter configurations tailored to the training and evaluation objectives:
Context and Prediction Limits: For the primary training distribution, a large context window (num_ctx: 16,384) was maintained to accommodate the structural complexity of full academic papers. The maximum output was constrained to 800 tokens (num_predict) for training samples and 250 tokens (max_new_tokens) for the adaptation and test sets.
Stochastic vs. Deterministic Decoding: A temperature-variant strategy was applied to evaluate model resilience. The training datasets (A and B) utilized a stochastic decoding approach with Temperature (T) values between 0.7 and 0.8 and Top-p (Nucleus Sampling) of 0.9. Conversely, the “Zero-Artifact” test set was generated using a high-stability configuration with T = 0.3, prioritizing linguistic consistency over variability to represent a more challenging detection scenario.
A role-based “Expert Academic Researcher” system prompt was implemented to enforce stylistic uniformity. In the strategic adaptation (Dataset B) and evaluation phases, a Title-to-Abstract generation strategy was used, where models were provided only with the scientific title of the paper. Technical constraints were embedded directly into the prompt to prohibit:
Introductory Artifacts: Phrases such as “Certainly, here is the abstract” were suppressed at the source.
Structural Noise: The use of bullet points, numbered lists, and multi-paragraph formatting was strictly avoided to ensure a continuous, formal academic prose structure.
Length Bias: A dynamic word-count matching strategy (capped at 400 words for training and 150–200 words for testing) was applied to prevent the detector from learning simple length-based discriminators.
3.2. XLM-RoBERTa Architecture
In this study, the XLM-RoBERTa architecture is employed as the primary classification backbone for detecting synthetic text. XLM-RoBERTa is an extension of the transformer-based RoBERTa architecture, pre-trained on large-scale multilingual data. The model is trained using the masked language modeling (MLM) objective and is capable of learning diverse linguistic structures and contextual patterns.
XLM-RoBERTa is a large-scale language model trained on the 2.5 TB multilingual CommonCrawl dataset, covering approximately 100 languages. This extensive training corpus enables the model to learn rich representations not only at the word level but also across sentence-level semantic relationships and syntactic structures. The underlying RoBERTa architecture incorporates several key improvements over BERT, including the removal of the Next Sentence Prediction (NSP) objective, training with larger batch sizes, and the use of dynamic masking.
The XLM-RoBERTa-base configuration is employed in this study. This architecture consists of 12 transformer encoder layers, each comprising a 768-dimensional hidden representation and 12 multi-head attention heads. The attention score, which forms the basis of this mechanism, is computed using the Query (
Q), Key (
K), and Value (
V) matrices, as defined in Equation (1):
here,
, denotes the dimensionality of the key vectors. With approximately 270 million trainable parameters, this architecture enables the effective modeling of complex contextual relationships within text. Furthermore, the model can process input sequences of up to 512 tokens.
To adapt the contextual representations learned by the pre-trained XLM-RoBERTa model to a binary classification task, a task-specific classification head is integrated into the architecture. The special <s> (start-of-sentence) token, added to the beginning of each input sequence, captures the contextual representation of the entire text. The classification process is carried out as follows:
Representation: The 768-dimensional contextual vector corresponding to the <s> token from the final transformer layer is used as the feature representation of the entire input sequence.
Regularization: To mitigate overfitting, a dropout layer with a rate of 0.1 is applied to this vector.
Linear Transformation: The resulting representation is fed into a linear layer that projects it from a 768-dimensional space to a 2-dimensional output space (Human/AI classes).
Activation: In the output layer, the softmax function is applied to convert the raw class scores (logits) into a probability distribution:
In this equation, denotes the predicted class score for class , while C represents the total number of classes, with C = 2.
To ensure compatibility with the softmax activation function used in the classification head, the cross-entropy loss function (which minimizes the discrepancy between the predicted probability distribution and the ground truth labels) is employed during training. The loss value
L, defined over
N samples and
C classes (with
C = 2), is computed as follows:
During training, the model was fine-tuned on the “Large Mixture” dataset (comprising 63,000 samples) while preserving its pre-trained weights. This approach aims to adapt the model to the task of deepfake text detection while retaining its previously learned linguistic representations.
4. Results
In this section, the training performance of the XLM-RoBERTa model on the expanded dataset of 63,000 samples and the final results obtained through the proposed framework on the “Zero-Artifact” test set (consisting of 1200 samples) are presented.
The fine-tuning process was limited to 3 epochs to mitigate the risk of overfitting. During training, a learning rate of and a mini-batch size of 16 were used. Examination of the training process indicates that the model exhibits a stable learning behavior. The training loss, which was approximately 0.42 at the end of the first epoch, decreased to 0.12 by the end of the third epoch.
In addition, the model achieved an accuracy of 95.8% on the validation set, which corresponds to 10% of the training data. This result suggests that the model effectively captured the underlying patterns in the dataset.
To address potential concerns regarding the model’s dependency on superficial data artifacts, we conducted a rigorous ablation study on the test set. We evaluated the performance under three conditions: (1) Raw Text with original formatting, (2) Without Intro Artifacts where common introductory phrases were removed, and (3) Lowercase Only where all casing information was eliminated. The obtained metrics are presented in
Table 1.
An examination of
Table 1 indicates that the model maintains a consistent performance across the evaluated testing scenarios. Notably, the removal of introductory artifacts did not lead to a decrease in accuracy, which remained stable at 93.41%. Furthermore, converting the text to lowercase resulted in a marginal change of only 0.17%. These results provide evidence that the proposed approach is largely independent of superficial formatting markers and effectively captures inherent stylistic and structural markers characteristic of academic discourse. The consistency of the false positive rate (7.83% or 47/600) across scenarios further suggests a degree of robustness in detecting AI-generated text beyond simple stylistic artifacts.
Table 2 presents a comparative performance analysis of the fine-tuned XLM-RoBERTa approach alongside current statistical and supervised detection methods. To ensure the objectivity of the evaluations, the testing process was conducted on a balanced Out-of-Distribution (OOD) test set consisting of 1200 samples. This dataset comprises 600 human-written academic texts and 600 AI-generated samples produced by several recent large language models, specifically Falcon, Phi-3, Mistral, Qwen-2.5, Gemma-2, and Llama-3.1 (100 samples each).
Based on the findings, the proposed framework appears to exhibit a notably stable performance across the different tested scenarios. The model achieved a baseline accuracy of 93.41% on raw texts while maintaining a relatively low false positive rate of 7.83% (47/600). Furthermore, the model’s performance remained consistent (93.24–93.41%) during the stages involving structural modifications. This outcome suggests that the methodology captures consistent linguistic patterns specific to academic writing, rather than relying on structural or superficial formatting cues.
In contrast, a significant degree of variability was observed in the performance of zero-shot statistical models such as DetectGPT, Fast-DetectGPT, and Binoculars. While their baseline accuracies ranged between 64.38% and 68.39%, they exhibited considerably higher ethical risks, with false positive rates recorded at 26.67%, 33.00%, and 64.33%, respectively. These models also seemed to be more sensitive to minor structural changes, such as lowercase conversion or the removal of introductory segments. Specifically, the decline in Binoculars’ accuracy during structural modifications might point toward a potential dependency of these methods on specific linguistic patterns found at the beginning of academic texts, which often results in a higher rate of false accusations. Furthermore, general-purpose supervised detectors, namely Hello-ChatGPT and OpenAI-RoBERTa, yielded results near the level of random chance (approximately 50–51%) on this specialized academic test set. Although these models showed lower FPRs, this was primarily attributed to an extreme class bias toward human-text predictions, effectively rendering them incapable of reliably identifying AI-generated academic content. These results suggest that models trained on general web content may encounter difficulties in distinguishing the unique terminological and structural complexities of academic discourse.
The findings indicate that domain-specific fine-tuning may offer a more consistent alternative to statistical or general-purpose approaches, particularly by reducing the risk of false accusations by approximately four to eight times compared to the baseline statistical methods evaluated.
The confusion matrix, constructed to visualize the framework’s prediction performance on the 1200-sample test set and to analyze class-wise error patterns, is presented in
Figure 3.
An examination of
Figure 3 indicates that the model demonstrates a balanced performance in distinguishing between the two classes (Human and AI). The analysis based on the confusion matrix is summarized as follows:
Out of 600 synthetic texts, 568 (94.67%) were correctly classified as “AI (1)”. The remaining 32 instances misclassified as “Human” indicate that, while the model achieves a high detection rate for synthetic content, a limited number of errors still occur.
Similarly, 553 out of 600 human-written texts (92.17%) were correctly classified as “Human (0)”. However, 47 instances were misclassified as false positives. This may be attributed to the high level of formality and structural regularity of academic texts in the dataset, which can partially resemble the generative patterns of large language models.
Overall, the confusion matrix results demonstrate that the model achieves high performance in detecting synthetic texts, while maintaining a balanced level of accuracy in classifying human-written content.
To further analyze the model’s learning dynamics and generalization behavior during training, the training and validation loss curves are presented in
Figure 4.
An examination of
Figure 4 indicates that the training loss exhibited a rapid decline from an initial value of approximately 0.8 and gradually decreased as training progressed, stabilizing at a low level by the end of the third epoch.
The validation loss maintained a consistently low and stable trend throughout training, suggesting that the model effectively learned without overfitting to the training data. Moreover, the diminishing fluctuations in the training loss over time indicate that the model progressively adapted to the underlying linguistic patterns in the dataset.
To evaluate the discriminative power of the proposed approach between classes, a Receiver Operating Characteristic (ROC) curve analysis was conducted. The ROC curve is illustrated in
Figure 5.
The ROC curve presented in
Figure 5 shows that the framework clearly deviates from the random guessing line and follows a trajectory close to the top-left corner. The Area Under the Curve (AUC) was calculated as 0.9810.
This high AUC value indicates that the model achieves strong discriminative performance in distinguishing between human-written and AI-generated texts. Furthermore, this result demonstrates that the model attains a high true positive rate while maintaining a low false positive rate.
5. Discussion
The findings of this study demonstrate that the proposed methodological framework leveraging a conventional XLM-RoBERTa backbone achieves high discriminative performance, particularly for content written in formal and structured academic language. The “Large Mixture” training strategy employed in this study effectively limits the model’s tendency to overfit to specific generative architectures and contributes to more balanced performance across diverse generation styles. In particular, the results obtained in the “Zero-Artifact” test setting indicate that the strategy relies on structural patterns and stylistic markers inherent to academic discourse rather than superficial data artifacts.
The consistent performance exhibited across different architectures, such as Llama, Mistral, and Falcon, suggests that the proposed approach possesses a strong cross-architecture generalization capability within the context of direct LLM outputs. Notably, the approach’s ability to accurately distinguish human-written academic texts—which often share structural similarities with AI-generated content—emerges as a key finding of this study. This indicates that the model bases its decisions on domain-specific contextual features specific to academic writing rather than relying solely on surface-level metrics.
However, the sustained stability of such detection performance may be subject to challenges related to distributional shift. As suggested by Pozzi et al. [
22], exposure bias during the training or distillation of Large Language Models (LLMs) could potentially lead to constrained generative distributions, which detection systems might risk overfitting to. To address this potential risk, the current methodology seeks to integrate a degree of stochastic diversity into the data generation phase through the inclusion of a spectrum of decoding configurations, ranging from high-stability outputs (T = 0.3) to more varied distributions (T = 0.8). Furthermore, the utilization of models deployed with 4-bit quantization (BitsAndBytes) may introduce a layer of structural noise into the training distribution, potentially approximating conditions encountered in resource-constrained real-world deployments. These measures are intended to facilitate the identification of relatively invariant stylistic markers of AI authorship, which may contribute to enhanced resilience against the evolving nature of generative distributions.
Furthermore, the high recall values indicate the framework’s strong capacity to detect direct synthetic texts. The low false negative rate suggests that the likelihood of synthetic content going undetected remains limited. However, the rapid advancement in the generative capabilities of AI models necessitates the continuous updating of such detection systems. The emergence of more advanced and human-like generative models makes the ongoing re-evaluation of detection mechanisms essential.
It is important to note that the effectiveness of current detection mechanisms can be significantly challenged by adversarial techniques. As discussed by Sadasivan et al. [
8] (2023), many AI text detectors face substantial performance degradation when subjected to paraphrasing or human-in-the-loop editing. While our strategy demonstrates high discriminative power on the presented dataset, its robustness against high-entropy paraphrasing or collaborative human-AI writing remains a limitation of the current study.
From a future research perspective, leveraging the multilingual capabilities of XLM-RoBERTa to evaluate the proposed method across different languages presents a promising research direction. In particular, the morphological richness of agglutinative languages such as Turkish may provide new opportunities for synthetic text detection. Furthermore, incorporating evaluation against proprietary closed-source systems and advanced adversarial testing—such as evaluating the model against various paraphrasing attacks—is a critical next step to enhance the system’s reliability in more complex, real-world adversarial scenarios.
6. Conclusions
In this study, we presented a data-centric framework for AI-generated academic text detection using XLM-RoBERTa as the classification backbone. Rather than introducing a novel detection architecture, the primary objective of this work was to investigate how training diversity and artifact-controlled evaluation influence cross-model robustness in academic AI-text detection. The findings demonstrate that strategically curated multi-model training distributions, combined with a rigorous “Zero-Artifact” evaluation setting, substantially improve the detector’s ability to generalize across diverse large language model architectures while maintaining balanced performance on formal human-written academic texts. The proposed “Large Mixture” strategy exposed the model to outputs from multiple contemporary LLM families, enabling the framework to learn shared linguistic and structural characteristics of AI-generated academic discourse instead of memorizing model-specific stylistic signatures. Experimental results obtained on the independent Zero-Artifact test set demonstrated that the proposed framework achieves strong and stable discriminative capability, reaching 93.41% accuracy and an AUC score of 0.9810 even after the removal of superficial formatting cues and structural artifacts. These findings suggest that the model primarily relies on relatively stable stylistic and contextual markers inherent to synthetic academic writing rather than dataset-specific leakage patterns. A particularly important outcome of this study is the reduction in false positive behavior on highly formal and structured human-authored texts, which remains a major limitation of many existing detection systems. The results indicate that incorporating formal academic writing into a balanced and diverse training distribution can substantially improve robustness against the “false positive paradox” reported in prior literature. Overall, the study highlights that robust academic AI-text detection may depend more on principled dataset design, diversity-aware training strategies, and controlled evaluation methodologies than on major architectural innovation alone. These findings provide evidence that data-centric generalization strategies constitute a practical and scalable direction for improving reliability in academic integrity applications. Nevertheless, the current framework was evaluated primarily on direct LLM generations. As emphasized in prior research, adversarial paraphrasing, human-AI collaborative editing, and increasingly human-like generation capabilities continue to pose significant challenges for future detection systems. Accordingly, future work should focus on multilingual evaluation, paraphrase-resilient detection mechanisms, and robustness against advanced rewriting attacks in more realistic deployment scenarios.