1. Introduction
Before the dominance of large language models (LLMs), NLP progress relied heavily on task-specific architectures. Meta-learning [
1,
2] improved few-shot adaptation, while graph neural networks [
3] enhanced structured predictions. However, these methods require predefined tasks and extensive labeled data. In contrast, LLMs are pretrained on massive text corpora to develop general language understanding, enabling zero- or few-shot generalization across diverse tasks without task-specific retraining—a fundamental shift in the NLP paradigm.
LLMs, such as ChatGPT-4o and Qwen3, have revolutionized natural language processing, achieving remarkable fluency in tasks like question-answering, text summarization, translation, and content creation [
4]. These models, encompassing both closed-source and open-source paradigms, have demonstrated text generation capabilities that rival or even surpass human performance in certain tasks, enabling their widespread deployment in education, scientific writing, and creative industries [
5].
However, the realism of LLM-generated text poses significant challenges, particularly in ensuring academic integrity and content authenticity. AI-driven falsification, including fake news, academic plagiarism, and manipulated online content, undermines trust in digital communication [
6]. Humans struggle to distinguish AI-generated from human-written text, with expert evaluators achieving only around 57% accuracy, barely surpassing random guessing [
7]. Consequently, robust automated detection methods are urgently needed to address these socio-technological risks.
Existing detection approaches, including watermarking, statistical analysis, and supervised classification, face critical limitations. Watermarking methods [
8,
9] embed detectable signals but are easily disrupted by paraphrasing attacks. Statistical methods, such as GLTR [
10] and DetectGPT [
11], rely on statistical features such as token distribution or log probability curvature. However, it has emerged that some strategies, such as substituting synonyms for original characters, can alter the statistical characteristics of the generated text, making the detection task more difficult [
12]. Supervised classifiers like RoBERTa [
13] can detect text generated by LLM to some extent, but when spelling or grammatical errors are deliberately introduced, as these are common in human writing, they increase the risk of misjudgment [
14]. Therefore, the common challenge of these three types of detection methods is that their detection effectiveness decreases significantly when the original text is rewritten, which makes reliable text authenticity determination more difficult.
The core limitation of existing detectors lies in their reliance on surface statistical or syntactic features, which are easily manipulated by paraphrasing. To achieve true robustness, a model must learn paraphrase-invariant semantic features by focusing on the underlying semantic consistency and structural patterns inherent to LLM-generated texts. Contrastive learning is particularly suited for this purpose, as it aims to pull similar samples (e.g., original and paraphrased LLM texts) closer in the embedding space while pushing dissimilar samples (e.g., human text) further away. This mechanism naturally enhances robustness against minor adversarial perturbations like paraphrasing. We introduce the gravitational factor to further refine the separation by dynamically addressing hard negative examples with higher repulsion. This design directly addresses the limitations of watermarking, statistical, and supervised methods, whose reliance on surface-level cues makes them fragile under paraphrasing, whereas GravText explicitly learns semantic invariances and enforces stronger separation through the gravitational factor.
To address these challenges, we propose
GravText, a unified framework for detecting LLM-generated text, leveraging triplet contrastive learning and a novel gravitational factor to capture paraphrase-invariant semantic features. For the different tasks, GravText dynamically switches anchors between original and paraphrased texts, such as distinguishing human-written text from original LLM outputs (
Human or Original ChatGPT) or paraphrased LLM outputs (
Human or Paraphrased LLM). Drawing from the law of gravitation, the gravitational factor refines embedding space separation, enhancing robustness against paraphrasing. Recognizing that human and LLM-generated texts differ in both content and structural alignment—including attention saliency—we propose incorporating cross-attention as a proxy for semantic mass. Experiments on the HC3 Chinese dataset [
15], augmented with paraphrased texts from open-source (e.g., Qwen) and closed-source (e.g., ChatGPT) models, explore GravText’s robustness across token lengths (128, 256, and 512 tokens). In addition, in order to verify the robustness and effectiveness of the model across languages, we also supplement the experiments in the English essay dataset [
16]. We summarize our main contributions as follows:
We propose a more flexible triplet contrastive learning approach in the GravText framework, utilizing dynamic anchor switching between original and paraphrased LLM-generated texts to capture paraphrase-invariant semantic features, enabling robust detection for tasks (e.g., Human or Original ChatGPT, Human or Paraphrased LLM) across open-source (e.g., Qwen) and closed-source (e.g., ChatGPT) large language models.
We introduce the gravitational factor, inspired by Newton’s law and implemented via cross-attention, to enhance embedding space separation by clustering LLM-generated texts (original and paraphrased) while separating them from human-written texts, complementing the triplet contrastive learning approach.
We propose a hybrid loss function composed of triplet loss, cross-entropy loss, and a gravitational factor, which is used to fine-tune the RoBERTa model. These components are combined using different weights, chosen based on empirical validation.
To guide our investigation, we formulate the following research questions:
RQ1: To improve robustness against rewriting attacks in LLM-generated text detection, can we employ a contrastive learning framework with dynamic anchor switching to learn paraphrase-invariant representations?
RQ2: To improve detection accuracy against paraphrasing attacks, can a physics-inspired gravitational factor in the embedding space enhance cluster separation between human and AI-generated texts?
RQ3: To evaluate the generalizability of the proposed GravText framework, can its performance consistency be assessed across different text lengths and between open-source and closed-source LLMs?
These questions directly motivate our design choices and empirical evaluation.
The remainder of this paper is structured as follows:
Section 2 reviews related work in LLM text detection and contrastive learning.
Section 3 details the architecture of the proposed GravText framework, including the dynamic anchor switching strategy and the gravitational factor implementation.
Section 4 describes the experimental setup, datasets, and baseline models.
Section 5 presents the comprehensive results and comparative analysis. Finally,
Section 6 concludes the paper and outlines directions for future work.
3. Methodology
To address the challenge of detecting paraphrased text generated by large language models (LLMs), as highlighted in prior work [
6], we propose the GravText framework, designed to robustly discern both original LLM-generated text (
Human or Original ChatGPT) and paraphrased LLM-generated text (
Human or Paraphrased LLM). Motivated by the limitations of existing methods in maintaining performance against paraphrased outputs, GravText integrates three key components: a flexible triplet contrastive data construction to capture paraphrase-invariant semantics, a gravitational factor inspired by physical principles to enhance embedding separation, and a fine-tuning strategy with a hybrid loss to balance multiple objectives. This section details these components, their implementation, and their synergy in tackling paraphrasing challenges. The GravText architecture is depicted in
Figure 1, illustrating the anchor selection process and loss integration.
3.1. Paraphrase Strategy and Data Augmentation
To enhance robustness, GravText leverages the semantic proximity between original () and paraphrased () LLM-generated texts. We design a two-phase paraphrasing strategy: (1) varying the paraphrase lengths, and (2) comparing different LLMs as paraphrasers. In both phases, we use the prompt “Please rewrite the above text to mimic human tone” to generate , which intentionally blends LLM-generated features with human-like writing styles, thereby increasing detection difficulty and challenging the model’s embedding optimization.
Phase 1: Paraphrase Length. During this phase, we reformulate texts generated by LLMs in the HC3 dataset into three lengths: short (128 tokens), medium (256 tokens), and long (512 tokens). These adaptations reflect the dataset’s token distribution, spanning from concise to more elaborate responses (mean of 271, median 216, mode 116; refer to
Figure 2). Each reformulated text (
) aims to convey the key message of the original
while adopting a more relatable and human tone.
For the Human or Original ChatGPT task, the reformulated text is treated as the reference, with the original used for comparison. Conversely, in the Human versus Reformulated LLM setup, the original serves as the basis, while is examined as the comparative text. This approach provides insight into how detection systems respond to variations in linguistic style and semantic alignment.
Finally, we decided to use 256-length tokens as the main tokens for the subsequent experiments, and use different LLMS to paraphrase according to this length. The specific reasons and discussion are in
Section 5.1.
Phase 2: LLM Comparison. In the second phase, we fix the paraphrase length at 256 tokens (based on preliminary findings from Phase 1) and compare two LLMs—ChatGPT and Qwen—as paraphrasers. ChatGPT is consistent with the original generation source in HC3 and English essay dataset, while Qwen is selected for its strong performance in open-source LLM evaluations. Both models utilize the same prompt to generate . We hypothesize that ChatGPT-based paraphrases may reinforce stylistic patterns already present in the dataset, enhancing detection. In contrast, Qwen may introduce stylistic variance, potentially increasing the detection challenge.
The resulting paraphrased texts (
) are used to augment the HC3 and English essay dataset, contributing diversity in both writing style and length. This augmentation allows GravText to robustly detect both original and paraphrased LLM-generated texts across varying conditions, and supports the triplet contrastive learning approach introduced in the
Section 3.2.
3.2. Anchor Data Selection
Contrastive learning excels at optimizing embedding spaces for semantic discrimination, with triplet loss being a cornerstone formulation. The triplet loss minimizes the distance between an anchor and a positive sample while maximizing the distance to a negative sample, defined as:
where
A,
P, and
N denote the anchor, positive, and negative samples, respectively,
is the Euclidean distance between embeddings of samples
X and
Y, and margin enforces a minimum separation threshold.
In GravText, we leverage triplet loss to distinguish LLM-generated text from human-written text, explicitly addressing paraphrasing. We hypothesize that paraphrased LLM-generated text retains semantic proximity to its original form, distinct from human text. We augment the HC3 Chinese dataset [
15] and English essay dataset [
16], which provides human-written texts (
) and original LLM-generated texts (
), with paraphrased texts (
) generated by applying open-source models like Qwen and closed-source models like ChatGPT to
. This design ensures diverse paraphrasing styles and enables GravText to robustly detect both original and paraphrased LLM-generated texts across diverse model architectures. To support dual detection tasks, we propose two anchor selection strategies:
The function of this dynamic anchor switching strategy (DASS) extends beyond task definition; it is engineered for effective hard negative mining. Paraphrasing is inherently an adversarial process designed to make the LLM-generated text (
P) semantically closer to human text (
N), resulting in hard negative samples. The DASS ensures the triplet structure is optimized to target these ambiguous boundaries: by varying the anchor (A) between
and
, we maximize the chance of consistently capturing informative triplets that violate the margin constraint (
). This focused optimization process, which forces the model to learn the subtle difference between the two classes at their closest points, is the true measure of DASS performance, as demonstrated by the overall robustness gains in our experimental results (
Section 5).
By alternating between these configurations during training, GravText optimizes an embedding space that clusters LLM-generated text (original or paraphrased) tightly while separating it from human text, effectively solving the challenges faced by paraphrase.
3.3. Gravitational Factor
To enhance the detection effect of GravText, we introduce a gravitational factor inspired by Newton’s law of universal gravitation. In Newton’s theory, every mass attracts every other mass with a force proportional to the product of their masses and inversely proportional to the square of the distance between them. The formula for the gravitational force
F between two masses
and
separated by distance
R is:
where
G is the gravitational constant. This physical model describes the attraction between objects based on their mass and distance.
In GravText, we adapt this concept to represent semantic relationships between text samples. Here, the “masses” correspond to the semantic significance of the text samples, and the “distance” is related to their embedding space distances. Specifically, paraphrased and original LLM-generated texts are “attracted” to form tighter clusters, while human-written texts are “repelled” to maintain separation. This gravitational factor supplements the triplet loss by providing a dynamic structural constraint on the embeddings.
To implement this idea, we replace the concept of mass with Cross-Attention. In traditional gravitational force, the mass of an object determines its gravitational pull; in our case, the Cross-Attention mechanism simulates the “mass” based on the semantic relationship between text samples. The specific computation process for the
Human or Paraphrased LLM task is shown in
Figure 3.
A crucial aspect of our framework is the intuitive motivation for using cross-attention as a proxy for semantic mass. Standard triplet loss relies on a single global distance metric (e.g., the Euclidean distance between [CLS] embeddings), which is a low-resolution measure that can be insensitive to subtle yet critical differences in paraphrased text. To overcome this, we introduce cross-attention as a high-resolution alignment mechanism. Instead of comparing two compressed global vectors, cross-attention computes a token-by-token alignment matrix, quantifying the semantic relevance between every token in the first sequence and every token in the second. In this context, we posit that the aggregated cross-attention score serves as a direct measure of semantic density and shared information content.
Specifically, a high aggregate score (high “mass”) signifies that the two texts possess a high concentration of mutually aligned semantic units. This is characteristic of an anchor and its positive (paraphrased) sample. Conversely, a hard negative sample (e.g., human text on the same topic) might have a similar global embedding (small distance) but will exhibit low internal alignment (low “mass”) under cross-attention, as its underlying information structure and expression differ. Therefore, cross-attention is uniquely suited as a proxy for “mass” because it quantifies the density of shared meaning, allowing our gravitational factor to more accurately model the true semantic attraction and repulsion between text pairs.
With this robust definition of semantic interaction, we can formally define the gravitational factor. Specifically, for each triplet, the gravitational force between the anchor text and another sample is computed as:
where
is the cross-attention output using the query from the anchor text
and the key-value pairs from the sample text
,
denotes their embedding distance, and
is a smoothing factor.
The sign factor
is defined as:
Here, a positive example represents a paraphrased LLM-generated text (attraction), while a negative example represents a human-written text (repulsion). The positive force pulls similar texts closer, and the negative force pushes dissimilar texts apart.
The overall gravitational loss is defined as:
where
and
are the gravitational factors for anchor-positive and anchor-negative pairs respectively,
denotes the dataset,
is a scaling factor (default 1), and margin enforces minimum separation.
Integrating the cross-attention mechanism into the gravitational factor introduces additional computational overhead during the training phase, compared to standard triplet loss based solely on embedding distances. Specifically, the cross-attention operation adds an complexity per triplet, where L is the sequence length, contributing to a longer optimization cycle.
However, it is important to emphasize that this computational cost is strictly limited to training. During inference, GravText relies solely on the trained RoBERTa encoder to generate text embeddings, followed by a lightweight distance-based classification. The gravitational factor module, including cross-attention, is not required at inference time. As a result, the inference speed of GravText remains comparable to that of a standard fine-tuned RoBERTa baseline. The additional training cost is thus a necessary and efficient trade-off for achieving significant robustness gains against adversarial paraphrasing.
By incorporating this gravitational loss, GravText strengthens the semantic cohesion of LLM-generated texts while enhancing their separation from human texts. This dynamic structural constraint further improves the model’s ability to distinguish between generated and human-authored content.
3.4. Fine-Tuning with Hybrid Loss
Relevant studies have demonstrated that fine-tuning RoBERTa yields strong performance in various natural language understanding tasks, such as text classification and recognition [
33,
34]. Building on the RoBERTa-base model [
13], a 12-layer Transformer architecture, GravText leverages robust text representations by encoding input text into contextualized embeddings. Specifically, the final hidden state of the
[CLS] token is used as the global representation of the input sequence:
where
. To ensure that the model performs best on the validation set, we adopted an early stop strategy. Specifically, if the F1 score and accuracy of the validation set of the model do not improve and the loss does not further decrease within five consecutive epochs, the training will be terminated and the model parameters with the highest F1 score will be saved.
The fine-tuning optimizes a hybrid loss:
where
is the cross-entropy loss for classification, aligning with the binary detection tasks. The triplet loss
is selected from the set
, where
(Paraphrase-Aware) and
(Original Anchor) are designed for different tasks by selecting different anchor types.
Specifically, uses paraphrased texts as anchors to learn similarity between paraphrased and original LLM-generated texts while separating them from human texts. In contrast, uses original LLM texts as anchors to distinguish paraphrased LLM texts from human-written texts. The specific loss function is selected during training depending on the task objective.
Additionally, improves semantic separation via an attention-based gravitational factor. The weights , , and are selected based on validation F1 scores, balancing classification accuracy with embedding robustness. This hybrid loss formulation ensures GravText’s effectiveness across both detection scenarios.
4. Experimental Settings
This section presents the experimental settings for evaluating the GravText framework, designed to distinguish human-written text from large language model (LLM)-generated text across two tasks: identifying original LLM-generated text (
Human or Original ChatGPT) and paraphrased LLM-generated text (
Human or Paraphrased LLM). Building on the methodology in
Section 3 and the triplet contrastive learning approach with a gravitational factor introduced in
Section 3.2, we describe the experimental configuration, including the dataset, evaluation metrics, paraphrase strategies, and baseline comparisons. We validate the performance of the model on the HC3 dataset and English essay dataset. Paraphrase strategies leverage DeepSeek, Qwen, and ChatGPT to generate varied paraphrased texts, while baselines include DetectGPT, GLTR, and RoBERTa for comprehensive comparison. These settings enable a thorough assessment of GravText’s detection capabilities.
4.1. Dataset
We first adopt the HC3 Chinese dataset [
15], a benchmark comprising approximately 25,706 samples (12,853 human-written and 12,853 LLM-generated responses) across domains including medicine, finance, law, psychology, computer science, and open-ended Q&A. The dataset’s diversity in style, vocabulary, and length (50 to 600 tokens) supports robust testing of GravText’s generalization. Each sample is reliably labeled for human or LLM authorship, aligning with GravText’s objectives.
Subsequently, in order to verify the robustness and effectiveness of Gravtext in the face of cross-language tasks, we choose the English essay dataset [
16]. The dataset consists of 335 L2 English argumentative essays written by humans and 335 AI-generated texts. The human text is selected from Uppsala Student English Corpus (USE), and the author is a Swedish university student at CEFR A2 level. The LLM-generated texts were generated by ChatGPT on the same topic with the prompt “Write an 800-word essay on [topic] as a second language speaker.” The dataset ensures comparability through topic pairing, which aims to simulate the AI ghostwriting detection task in real academic scenarios.
4.2. Evaluation Metrics
We evaluate GravText using accuracy and F1-score to assess its performance in detecting LLM-generated text (
or
) versus human-written text (
). These metrics are computed based on classification outcomes: true positives (
), correctly identified LLM-generated texts; true negatives (
), correctly classified human texts; false positives (
), human texts misclassified as LLM-generated; and false negatives (
), undetected LLM-generated texts. They are defined as:
Accuracy reflects overall classification correctness, while F1-score balances precision and recall for the positive class, critical for detecting paraphrased LLM texts that mimic human tone. We report metrics separately for
Human or Original ChatGPT and
Human or Paraphrased LLM to highlight GravText’s dual-task effectiveness, especially under human-like paraphrasing challenges.
4.3. Baseline Comparisons
We compare GravText against three representative baselines: GLTR [
10], DetectGPT [
11], and a fine-tuned RoBERTa [
13], spanning statistical, curvature-based, and supervised paradigms.
GLTR: It evaluates the statistical likelihood of each token in a sequence by ranking it against predictions from a pretrained language model. Tokens with unusually low likelihood ranks are flagged as potential anomalies, under the assumption that human-written text exhibits more natural statistical patterns than machine-generated content [
10].
DetectGPT: This is a model that estimates the local curvature of a log-probability surface around a given input by applying perturbations and measuring the change in model confidence.
RoBERTa: We fine-tune the standard RoBERTa-base model on a portion of the HC3 dataset to distinguish human-written from model-generated content through supervised learning. All models use the same tokenization and training configuration for fair comparison. Specifically, we adopt a batch size of 32, a learning rate of , and a maximum sequence length of 256 tokens. Training is governed by early stopping based on validation F1-score, with a patience of 5 epochs. The best-performing checkpoint is retained for evaluation.
To ensure compatibility with Chinese texts, we substitute the language models used in GLTR and DetectGPT with DeepSeek, a Chinese-oriented large language model. DeepSeek replaces GPT for token ranking in GLTR and functions as both perturbation and scoring model in DetectGPT, ensuring linguistic alignment and implementation simplicity.
5. Results
In this section, we evaluate the performance of the GravText framework on two binary classification tasks: Human or Original ChatGPT and Human or Paraphrased LLM. In the first task, we train models using paraphrased texts () as anchors to distinguish between human-written texts () and original ChatGPT responses (). The paraphrased texts are generated using ChatGPT-4o and Qwen, each truncated to 256 tokens. We denote the variant using ChatGPT-generated anchors as GravTextGPT256, and the variant using Qwen-paraphrased anchors as GravTextQwen.
For the second task, serves as anchors to differentiate from , and we introduce GravTextChatGPT, which adopts 256-token ChatGPT-4o paraphrases as anchors. Ablation studies examine the role of the gravitational factor across different tasks and anchor configurations. All results on the HC3 Chinese test set use 256-token paraphrases, and are benchmarked against baselines including DetectGPT, GLTR, and RoBERTa. Each experiment is averaged over five runs, and results are reported as F1 scores and accuracies with standard deviations. For sampling-based baselines (GLTR, DetectGPT), we use five random seeds to account for variance.
5.1. Effect of Paraphrase Token Length
Since the original LLM-generated text of the HC3 Chinese dataset is generated by ChatGPT, the rewritten text with three different lengths of tokens is first generated by ChatGPT-4o and detected for two experiments respectively. The results are shown as
Table 1.
The visualization of the data in
Table 1 is shown in the
Figure 4. It reveals a clear upward trend in detection performance as the length of paraphrased responses increases. This pattern is consistent across both classification tasks. One likely reason for this improvement is that longer responses tend to expose more of the distinctive stylistic and structural features typical of LLM-generated text. These patterns, when more fully developed, offer stronger signals for classifiers to distinguish machine-generated outputs from those written by humans. In addition, our conjecture can also be supported by analyzing the variation of perplexity with respect to token length, as shown in
Figure 5.
Nevertheless, the 512-token paraphrasing length, while yielding the highest metrics, raises two practical concerns. First, it considerably overshoots the average length of LLM responses in the original dataset, which is approximately 271 tokens. Inflating the text to this extent may introduce information that, although helpful for detection, is atypical in natural usage and thus may reduce the generalizability of the results. Second, the marginal performance gain between 256 and 512 tokens does not appear to justify the additional computational cost, particularly when compared to the more substantial improvement seen from 128 to 256 tokens.
At the other end of the spectrum, the 128-token setting—despite aligning closely with the dataset’s modal length of 116 tokens—appears to fall short in capturing sufficient semantic content. Its limited scope restricts the model’s ability to detect nuanced linguistic markers that often differentiate human and machine text. In contrast, the 256-token setting strikes a more appropriate balance. It delivers nearly optimal accuracy and F1 scores, while maintaining closer alignment with the dataset’s central length tendencies—situated between the median (216 tokens) and the mean (271 tokens).
For these reasons, we adopt 256 tokens as the default paraphrasing length in the remainder of our experiments. This choice reflects a pragmatic compromise between performance, efficiency, and consistency with the original text distribution. Unless otherwise noted, all future paraphrasing procedures will employ this token length to ensure comparability and minimize extraneous variance across settings.
5.2. Main Results
Summarized in
Table 2 is the performance of GravText models and baselines on the
Human or Original ChatGPT task, showcasing GravText’s robustness in detecting original LLM-generated texts using ChatGPT and Qwen-paraphrased anchors.
As shown in
Table 2, both
GravText256 and
GravTextQwen outperform all baseline models by a notable margin on the
Human or Original ChatGPT classification task. Specifically,
GravTextQwen achieves the highest performance, with an accuracy of 0.9757 and an F1-score of 0.9754, while
GravText256 follows closely with scores of 0.9676 and 0.9678, respectively. In contrast, the strongest baseline, RoBERTa, records an F1-score of 0.9459, whereas DetectGPT and GLTR fall further behind at 0.8891 and 0.8688, respectively.
To further assess the generalization capability of GravText models, we extend our evaluation to the
Human or Paraphrased LLM task. In this setting, original ChatGPT generations (
) serve as anchors to differentiate paraphrased LLM responses (
) from genuine human-authored text. The corresponding results are presented in
Table 3.
In the experiments of two different tasks, it can be found that the model dominated by Qwen performs better than the model dominated by ChatGPT. This may be due to the fact that the rewritten text by Qwen is closer to the original AI text than the rewritten text by ChatGPT, so the discrimination is more obvious. To test this, we plot the perplexity of human answers and chatgpt answers in the original dataset and the rewritten text using different models. The details are shown in the
Figure 6.
To further verify the robustness and effectiveness of GravText in cross-lingual tasks, we conducted experiments on the English essay dataset using RoBERTa-base, the best-performing baseline model. The specific experimental results are shown in the
Table 4.
Table 4 presents the results of our cross-lingual generalization experiment on the English essay dataset, where the GravText framework is compared against the fine-tuned English RoBERTa-base baseline for the
Human or Original ChatGPT task. The results strongly validate the language-agnostic effectiveness of GravText’s core mechanisms. Specifically, the RoBERTa baseline achieves an F1 score of
, which is significantly surpassed by both GravText variants. The best-performing model,
GravTextQwen, achieves an F1 score of
and an accuracy of
, demonstrating an absolute improvement of approximately 2.16 percentage points over the baseline. Furthermore, the GravText models exhibit substantially lower standard deviations (e.g.,
for
GravTextGPT256 compared to
for RoBERTa), underscoring the superior detection accuracy and stability of our framework when applied to a new language and domain.
As with the HC3 dataset, we also extend our evaluation to the
Human or Paraphrased LLM task. The results are shown in the
Table 5.
Table 5 extends the cross-lingual analysis by evaluating robustness under paraphrasing attacks, comparing GravText and the RoBERTa baseline on the
Human or Paraphrased LLM task using the English essay dataset. In the English essay dataset, the performance pattern differs slightly from that observed on HC3. The RoBERTa baseline exhibits a notably lower F1 score and accuracy when confronted with Qwen-paraphrased texts compared to those paraphrased by ChatGPT, suggesting a reduced robustness to paraphrasing strategies employed by different LLMs. In contrast, GravText achieves consistently high performance across both paraphrasing sources, with only a marginal advantage in favor of ChatGPT-paraphrased inputs—contrary to the more balanced results seen on HC3. As shown in
Figure 7, the perplexity (PPL) values indicate that Qwen-generated paraphrases are closer to human writing in this dataset, which may explain the greater challenge they pose to the baseline model. Additionally, the relatively small scale of the English essay dataset may contribute to the overall higher F1 and accuracy scores, potentially amplifying performance differences between models under specific attack conditions.
5.3. Statistical Significance Analysis
To evaluate the performance differences between the GravText framework and the best baseline model, RoBERTa, we conducted corrected paired
t-tests based on F1 scores and accuracy data from five independent runs. The
Table 6 and
Table 7 summarize the statistical significance results for the different tasks on HC3.
In addition, based on GravText’s performance on the English essay dataset,
t-tests were also conducted for different tasks, and the results are shown in the
Table 8 and
Table 9.
Systematic statistical significance evaluation confirms that the GravText framework significantly outperforms the RoBERTa-base model in detection performance. This conclusion is based on extensive testing on both HC3 and English essay datasets: in all pairwise t-tests based on five independent experiments, the two variants of GravText consistently show significant advantages in F1 score and accuracy metrics (all p-values < 0.05). The results fully confirm the effectiveness and robustness of GravText method in cross-language and different detection task scenarios.
5.4. Ablation Study
To evaluate the contribution of the gravitational factor G, we conduct an ablation study by removing it from the loss function while keeping the overall architecture unchanged. In this variant, the force between samples is computed without the scalar modulation introduced by G, effectively eliminating the gravitational influence from the optimization process. We also compare the effect of different tasks when the anchor data is not changed.
As shown in
Table 10 and
Table 11, removing
G reduces performance across both tasks. For
Human or Original ChatGPT,
GravTextQwen’s F1 drops from 0.9769 to 0.9560 (2.1%) and
’s from 0.9678 to 0.9548 (1.3%). For
Human or Paraphrased LLM,
GravTextChatGPT’s F1 decreases from 0.9617 to 0.9545 (0.7%) under Qwen and from 0.9523 to 0.9348 (1.8%) under ChatGPT-256. These results confirm the role of
G in enhancing embedding space separation, particularly for challenging paraphrased texts.
Similarly, for the other task, Human or Paraphrased LLM, ablation was also performed by eliminating the gravitational factor.
5.5. Discussion
Exemplified by DetectGPT, to better understand how model perturbations can distinguish between human-written and machine-generated texts, we visualize the log-probability curvature of different text sources. The comparison includes three types of datasets: (i) original responses generated by ChatGPT, (ii) paraphrased versions of these responses using 256 tokens from ChatGPT, and (iii) texts paraphrased by Qwen. The curvature values are calculated based on perturbations introduced via the DeepSeek model.
As illustrated in
Figure 8, while there is a distinction between human-written and LLM-generated content in the curvature distribution, the separation is not distinct. The presence of overlapping regions indicates that relying solely on log-probability curvature may not be sufficient for reliable detection. This suggests the need for incorporating additional features or combining multiple signals to build a more robust identification framework.
The experimental results validate the effectiveness of our proposed method in detecting LLM-generated texts. In both tasks—Human or Original ChatGPT and Human or Paraphrased LLM—our approach outperformed existing mainstream methods such as DetectGPT, GLTR, and RoBERTa. The advantage became more evident when rewritten anchors were generated by stronger models like Qwen or original ChatGPT-4o.
We also found that increasing the length of rewritten texts improved classification accuracy, particularly from 128 to 256 tokens. However, the marginal gain diminished at 512 tokens, likely due to the distribution of text lengths in the dataset. Thus, 256 tokens strike a practical balance between performance and efficiency.
Regarding anchor quality, using rewritten texts from a different LLM (e.g., Qwen and ChatGPT) led to better performance than using the same model as the source, highlighting the benefit of stylistic diversity in contrastive learning. Ablation studies further confirmed the significance of the gravitational factor G, which leverages attention-based representations to quantify “semantic mass” and enhances the contrastive signal during training.
GravText demonstrates consistent and statistically significant performance gains over RoBERTa-base on both the Chinese HC3 and English essay datasets. These results establish its effectiveness and robustness as a language-agnostic solution for detecting AI-generated text. This cross-lingual validation is a critical finding of our study. While the absolute performance metrics naturally vary between the distinct datasets, the key observation is the persistence of the performance gap in favor of GravText. Notably, the framework demonstrates remarkable adaptability: it excels not only in detecting content from primary sources like the original ChatGPT but also maintains its advantage against paraphrased text generated by different LLMs (e.g., Qwen). The fact that all 16 paired t-tests across the two datasets yielded statistically significant results () underscores that GravText’s superiority is not an artifact of a specific data distribution or language but a reliable attribute of the proposed method.
In summary, our approach demonstrates strong adaptability and robustness across different detection scenarios, although further validation is needed in multilingual and cross-domain contexts.
6. Conclusions
This study proposes the GravText framework for detecting text generated by large language models (LLMs), with a focus on paraphrased outputs posing challenges to authenticity in digital communication. By combining triplet contrastive learning and a gravitational factor, GravText achieves robust detection of both original and paraphrased LLM-generated texts, excelling particularly on texts mimicking human writing styles. Experiments on the HC3 Chinese dataset and English essay dataset show that GravText significantly outperforms existing methods, demonstrating strong generalization across various tasks and paraphrase models.
The framework introduces innovative techniques, including dynamic anchor switching to capture paraphrase-invariant semantic features and a physics-inspired gravitational factor that enhances embedding space separation through cross-attention. These advancements improve detection accuracy and provide a novel approach to countering paraphrase-based attacks. The significance of this work lies in its technical contributions to academic integrity, misinformation prevention, and content authenticity in domains like education and journalism.
Building upon the demonstrated cross-lingual effectiveness of GravText, future work will focus on deepening its theoretical underpinnings. A promising direction is to explore alternative semantic metrics, such as the dynamics of representation collapse or information-theoretic measures, to quantify and potentially replace the cross-attention-based gravitational factor. Furthermore, integrating theoretical insights from geometric deep learning could elucidate how textual embeddings evolve in gravitational fields, enhancing both the framework’s adaptability and its interpretability.
In conclusion, the GravText framework offers an innovative and effective solution for detecting paraphrased LLM-generated text, establishing a foundation for more reliable AI-driven communication systems.