GravText: A Robust Framework for Detecting LLM-Generated Text Using Triplet Contrastive Learning with Gravitational Factor

Feng, Youling; Wang, Haoyu; Li, Jun; Cao, Zhongwei; Yan, Linghao

doi:10.3390/systems13110990

Open AccessArticle

GravText: A Robust Framework for Detecting LLM-Generated Text Using Triplet Contrastive Learning with Gravitational Factor

by

Youling Feng

^†

,

Haoyu Wang

^†

,

Jun Li

^*

,

Zhongwei Cao

and

Linghao Yan

School of Management Science and Information Engineering, Jilin University of Finance and Economics, Changchun 130117, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Systems 2025, 13(11), 990; https://doi.org/10.3390/systems13110990

Submission received: 18 September 2025 / Revised: 23 October 2025 / Accepted: 3 November 2025 / Published: 5 November 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

The growing abilities of large language models (LLMs) have introduced new challenges in reliably distinguishing LLM-generated texts from human-written content, particularly when paraphrasing techniques are used to evade detection. This paper proposes GravText, a detection framework designed to address this robustness gap by targeting paraphrase-invariant semantic features. GravText integrates triplet contrastive learning with a dynamic anchor switching strategy to better model inter-class separability under paraphrasing. Additionally, it introduces a physics-inspired gravitational factor based on cross-attention mechanisms, which enhances the discriminative power of learned embeddings by simulating semantic attraction and repulsion. Experimental results on the HC3 Chinese dataset demonstrate GravText’s superior robustness against paraphrasing. Crucially, further cross-lingual evaluation on an English essay dataset confirms the framework’s strong generalization ability and language-agnostic properties. These findings point to a promising direction for building more reliable AI-text detectors resilient to paraphrasing-based evasion.

Keywords:

LLM-generated text detection; contrastive learning; gravitational factor; cross-attention mechanism

1. Introduction

Before the dominance of large language models (LLMs), NLP progress relied heavily on task-specific architectures. Meta-learning [1,2] improved few-shot adaptation, while graph neural networks [3] enhanced structured predictions. However, these methods require predefined tasks and extensive labeled data. In contrast, LLMs are pretrained on massive text corpora to develop general language understanding, enabling zero- or few-shot generalization across diverse tasks without task-specific retraining—a fundamental shift in the NLP paradigm.

LLMs, such as ChatGPT-4o and Qwen3, have revolutionized natural language processing, achieving remarkable fluency in tasks like question-answering, text summarization, translation, and content creation [4]. These models, encompassing both closed-source and open-source paradigms, have demonstrated text generation capabilities that rival or even surpass human performance in certain tasks, enabling their widespread deployment in education, scientific writing, and creative industries [5].

However, the realism of LLM-generated text poses significant challenges, particularly in ensuring academic integrity and content authenticity. AI-driven falsification, including fake news, academic plagiarism, and manipulated online content, undermines trust in digital communication [6]. Humans struggle to distinguish AI-generated from human-written text, with expert evaluators achieving only around 57% accuracy, barely surpassing random guessing [7]. Consequently, robust automated detection methods are urgently needed to address these socio-technological risks.

Existing detection approaches, including watermarking, statistical analysis, and supervised classification, face critical limitations. Watermarking methods [8,9] embed detectable signals but are easily disrupted by paraphrasing attacks. Statistical methods, such as GLTR [10] and DetectGPT [11], rely on statistical features such as token distribution or log probability curvature. However, it has emerged that some strategies, such as substituting synonyms for original characters, can alter the statistical characteristics of the generated text, making the detection task more difficult [12]. Supervised classifiers like RoBERTa [13] can detect text generated by LLM to some extent, but when spelling or grammatical errors are deliberately introduced, as these are common in human writing, they increase the risk of misjudgment [14]. Therefore, the common challenge of these three types of detection methods is that their detection effectiveness decreases significantly when the original text is rewritten, which makes reliable text authenticity determination more difficult.

The core limitation of existing detectors lies in their reliance on surface statistical or syntactic features, which are easily manipulated by paraphrasing. To achieve true robustness, a model must learn paraphrase-invariant semantic features by focusing on the underlying semantic consistency and structural patterns inherent to LLM-generated texts. Contrastive learning is particularly suited for this purpose, as it aims to pull similar samples (e.g., original and paraphrased LLM texts) closer in the embedding space while pushing dissimilar samples (e.g., human text) further away. This mechanism naturally enhances robustness against minor adversarial perturbations like paraphrasing. We introduce the gravitational factor to further refine the separation by dynamically addressing hard negative examples with higher repulsion. This design directly addresses the limitations of watermarking, statistical, and supervised methods, whose reliance on surface-level cues makes them fragile under paraphrasing, whereas GravText explicitly learns semantic invariances and enforces stronger separation through the gravitational factor.

To address these challenges, we propose GravText, a unified framework for detecting LLM-generated text, leveraging triplet contrastive learning and a novel gravitational factor to capture paraphrase-invariant semantic features. For the different tasks, GravText dynamically switches anchors between original and paraphrased texts, such as distinguishing human-written text from original LLM outputs (Human or Original ChatGPT) or paraphrased LLM outputs (Human or Paraphrased LLM). Drawing from the law of gravitation, the gravitational factor refines embedding space separation, enhancing robustness against paraphrasing. Recognizing that human and LLM-generated texts differ in both content and structural alignment—including attention saliency—we propose incorporating cross-attention as a proxy for semantic mass. Experiments on the HC3 Chinese dataset [15], augmented with paraphrased texts from open-source (e.g., Qwen) and closed-source (e.g., ChatGPT) models, explore GravText’s robustness across token lengths (128, 256, and 512 tokens). In addition, in order to verify the robustness and effectiveness of the model across languages, we also supplement the experiments in the English essay dataset [16]. We summarize our main contributions as follows:

We propose a more flexible triplet contrastive learning approach in the GravText framework, utilizing dynamic anchor switching between original and paraphrased LLM-generated texts to capture paraphrase-invariant semantic features, enabling robust detection for tasks (e.g., Human or Original ChatGPT, Human or Paraphrased LLM) across open-source (e.g., Qwen) and closed-source (e.g., ChatGPT) large language models.
We introduce the gravitational factor, inspired by Newton’s law and implemented via cross-attention, to enhance embedding space separation by clustering LLM-generated texts (original and paraphrased) while separating them from human-written texts, complementing the triplet contrastive learning approach.
We propose a hybrid loss function composed of triplet loss, cross-entropy loss, and a gravitational factor, which is used to fine-tune the RoBERTa model. These components are combined using different weights, chosen based on empirical validation.

To guide our investigation, we formulate the following research questions:

RQ1: To improve robustness against rewriting attacks in LLM-generated text detection, can we employ a contrastive learning framework with dynamic anchor switching to learn paraphrase-invariant representations?
RQ2: To improve detection accuracy against paraphrasing attacks, can a physics-inspired gravitational factor in the embedding space enhance cluster separation between human and AI-generated texts?
RQ3: To evaluate the generalizability of the proposed GravText framework, can its performance consistency be assessed across different text lengths and between open-source and closed-source LLMs?

These questions directly motivate our design choices and empirical evaluation.

The remainder of this paper is structured as follows: Section 2 reviews related work in LLM text detection and contrastive learning. Section 3 details the architecture of the proposed GravText framework, including the dynamic anchor switching strategy and the gravitational factor implementation. Section 4 describes the experimental setup, datasets, and baseline models. Section 5 presents the comprehensive results and comparative analysis. Finally, Section 6 concludes the paper and outlines directions for future work.

2. Related Work

2.1. LLM Generated Text Detection

The rapid evolution of large language models has made detecting LLM-generated text increasingly challenging, especially under paraphrase-based obfuscation attacks [6,7]. Existing detection methods can be grouped into three categories: watermarking, statistical analysis, and supervised classification.

Watermarking Methods embed hidden signals in LLM-generated text, making them detectable by computers while imperceptible to humans. Kirchenbauer et al. [8] introduced a watermarking framework that modifies logits during token sampling to embed a detectable signal. However, detection rates typically drop below 50% when the text undergoes heavy paraphrasing such as sentence reconstruction, and the method often requires access to the model’s sampling process. Zhao et al. [17] extended this approach with the Unigram-Watermark, which employs fixed grouping and error-correcting codes to achieve provable robustness against local edits. Nevertheless, it performs weakly under global semantic paraphrasing. Hou et al. [18] proposed SemStamp, a semantic watermarking scheme that uses locality-sensitive hashing to embed sentence-level markers, maintaining strong detection performance after paraphrasing. However, its effectiveness diminishes for texts shorter than 100 words, and the hashing computation is resource-intensive. Golowich [19] developed a watermark resilient to proportional editing through pseudorandom codes, with theoretical guarantees on persistence in models such as Llama. However, the assumption of proportional editing does not cover random or adversarial paraphrasing strategies. Despite these advances, a major limitation of current watermarking methods is their reliance on the surface form or specific token distribution of the generated text. As a result, they often fail when the original text is paraphrased or semantically preserved but syntactically altered.

Statistical Methods detect AI-generated content by analyzing intrinsic linguistic features or response stability under perturbation, typically without requiring labeled training data. GLTR [10] measures the predictability of the text using the conditional probability ranking of each word by the language model to identify the content likely to be generated by the model, improving human accuracy by 72% in visualization studies; however, it is vulnerable to sophisticated perturbations and scales poorly for large-scale analysis. DetectGPT [11] makes judgments by calculating the log-probability curvature before and after the perturbation using a BERT-like mask-filling mechanism, achieving 95% AUROC on AI generated text detection; however, it is computationally intensive due to multiple perturbations and generalizes weakly to complex texts. Fast-DetectGPT [20] optimizes the detection efficiency of DetectGPT by calculating the conditional probability curvature, accelerating by 340× and improving AUROC by 75% on XSum datasets; nonetheless, it performs poorly on short texts. DNA-GPT [21] truncates the original text and uses LLM to continue it, identifying the source according to the difference between the generated text and the original text via n-gram divergence and z-tests, reaching up to 98% accuracy on multilingual datasets with explainability; however, it is sensitive to the sourcing model. Fagni et al. [12] evaluated detection on deepfake tweets by creating the TweepFake dataset and testing 13 methods, with BERT-based classifiers achieving 95% accuracy; but short-text perturbations easily evade it. However, statistical methods have weak generalization ability in the face of some complex texts or tasks, and are difficult to capture complex text structure and semantic meaning, and are easy to avoid using simple perturbations [12].

Supervised Learning Methods train binary classifiers on labeled datasets of human and AI-generated texts. Jawahar et al. [14] surveyed early detection approaches, highlighting the need for robustness against evolving language models. Uchendu et al. [22] showed that TuringBench, a RoBERTa-based detector, achieves up to 95% accuracy on in-distribution data, but its performance drops significantly on out-of-distribution samples. Chen et al. [23] reformulated detection as a next-token prediction task using fine-tuned T5 models, leveraging token-level probabilities for discrimination. Peng et al. [24] studied adversarial essay detection and found that existing detectors degrade to an AUROC of 0.6 under attack, underscoring the fragility of supervised models even under minor perturbations. Hu et al. [25] improved robustness by integrating PPO-based reward signals into adversarial training, enhancing detection under paraphrasing. However, such methods depend on specific reward designs and may still fail on entirely unseen paraphrase patterns. Despite their high empirical accuracy, supervised approaches typically require large labeled datasets and are prone to overfitting to stylistic patterns in the training distribution. They often lack the semantic invariance necessary to withstand targeted paraphrasing attacks, motivating the development of model-agnostic and paraphrase-invariant frameworks such as ours.

2.2. Contrastive Learning

Contrastive Learning is a powerful self-supervised method that trains models to bring similar samples closer in representation space while pushing dissimilar samples apart. Its applications have evolved from computer vision to NLP, and most recently, to AI-generated text detection.

The seminal work of Schroff et al. [26] first introduced triplet loss to learn compact, discriminative embeddings in face recognition, although its cubic computational scaling limits its large-scale use. This foundation was expanded in vision by Chen et al.’s SimCLR [27], which uses InfoNCE loss to learn visual representations via augmented image pairs, proving the effectiveness of large batch sizes and data augmentation.

Building on the success in vision, contrastive learning has been adapted to NLP. For example, Gao et al. [28] proposed SimCSE, which constructs positive pairs by applying dropout twice to the same sentence and optimizes sentence embeddings via InfoNCE loss, effectively regularizing anisotropic representation spaces. Chuang et al. [29] improved negative sampling by reducing the impact of false negatives—instances where negative samples are semantically similar—thereby mitigating sampling bias in unsupervised contrastive learning. These works demonstrate that contrastive learning can capture fine-grained semantic distinctions, making it well-suited for tasks requiring deep textual understanding.

Recently, contrastive learning has been successfully applied to detect AI-generated text. DeTeCtive [30] employs multi-task and multi-level contrastive objectives to distinguish human and machine writing styles across different encoders, achieving strong out-of-distribution performance. Liu et al. [31] proposed PECOLA, which enhances perturbation-based detection with a multi-contrastive objective and improved preservation of key semantic words, yielding greater robustness than standard classifiers. La Cava et al. [32] introduced WhosAI, a triplet-based contrastive framework that jointly performs detection and authorship attribution by learning semantic embeddings, enabling scalable and model-agnostic identification.

These advances motivate our work, where we further enrich contrastive learning by integrating triplet-based anchors, cross-anchor switching, and a gravitational factor to further improve the robustness of the model, especially when faced with paraphrased texts.

3. Methodology

To address the challenge of detecting paraphrased text generated by large language models (LLMs), as highlighted in prior work [6], we propose the GravText framework, designed to robustly discern both original LLM-generated text (Human or Original ChatGPT) and paraphrased LLM-generated text (Human or Paraphrased LLM). Motivated by the limitations of existing methods in maintaining performance against paraphrased outputs, GravText integrates three key components: a flexible triplet contrastive data construction to capture paraphrase-invariant semantics, a gravitational factor inspired by physical principles to enhance embedding separation, and a fine-tuning strategy with a hybrid loss to balance multiple objectives. This section details these components, their implementation, and their synergy in tackling paraphrasing challenges. The GravText architecture is depicted in Figure 1, illustrating the anchor selection process and loss integration.

3.1. Paraphrase Strategy and Data Augmentation

To enhance robustness, GravText leverages the semantic proximity between original (

T_{L}

) and paraphrased (

T_{P}

) LLM-generated texts. We design a two-phase paraphrasing strategy: (1) varying the paraphrase lengths, and (2) comparing different LLMs as paraphrasers. In both phases, we use the prompt “Please rewrite the above text to mimic human tone” to generate

T_{P}

, which intentionally blends LLM-generated features with human-like writing styles, thereby increasing detection difficulty and challenging the model’s embedding optimization.

Phase 1: Paraphrase Length. During this phase, we reformulate texts generated by LLMs in the HC3 dataset into three lengths: short (128 tokens), medium (256 tokens), and long (512 tokens). These adaptations reflect the dataset’s token distribution, spanning from concise to more elaborate responses (mean of 271, median 216, mode 116; refer to Figure 2). Each reformulated text (

T_{P}

) aims to convey the key message of the original

T_{L}

while adopting a more relatable and human tone.

For the Human or Original ChatGPT task, the reformulated text

T_{P}

is treated as the reference, with the original

T_{L}

used for comparison. Conversely, in the Human versus Reformulated LLM setup, the original

T_{L}

serves as the basis, while

T_{P}

is examined as the comparative text. This approach provides insight into how detection systems respond to variations in linguistic style and semantic alignment.

Finally, we decided to use 256-length tokens as the main tokens for the subsequent experiments, and use different LLMS to paraphrase according to this length. The specific reasons and discussion are in Section 5.1.

Phase 2: LLM Comparison. In the second phase, we fix the paraphrase length at 256 tokens (based on preliminary findings from Phase 1) and compare two LLMs—ChatGPT and Qwen—as paraphrasers. ChatGPT is consistent with the original generation source in HC3 and English essay dataset, while Qwen is selected for its strong performance in open-source LLM evaluations. Both models utilize the same prompt to generate

T_{P}

. We hypothesize that ChatGPT-based paraphrases may reinforce stylistic patterns already present in the dataset, enhancing detection. In contrast, Qwen may introduce stylistic variance, potentially increasing the detection challenge.

The resulting paraphrased texts (

T_{P}

) are used to augment the HC3 and English essay dataset, contributing diversity in both writing style and length. This augmentation allows GravText to robustly detect both original and paraphrased LLM-generated texts across varying conditions, and supports the triplet contrastive learning approach introduced in the Section 3.2.

3.2. Anchor Data Selection

Contrastive learning excels at optimizing embedding spaces for semantic discrimination, with triplet loss being a cornerstone formulation. The triplet loss minimizes the distance between an anchor and a positive sample while maximizing the distance to a negative sample, defined as:

L_{triplet} (A, P, N) = max {0, D (A, P) - D (A, N) + margin},

(1)

where A, P, and N denote the anchor, positive, and negative samples, respectively,

D (X, Y)

is the Euclidean distance between embeddings of samples X and Y, and margin enforces a minimum separation threshold.

In GravText, we leverage triplet loss to distinguish LLM-generated text from human-written text, explicitly addressing paraphrasing. We hypothesize that paraphrased LLM-generated text retains semantic proximity to its original form, distinct from human text. We augment the HC3 Chinese dataset [15] and English essay dataset [16], which provides human-written texts (

T_{H}

) and original LLM-generated texts (

T_{L}

), with paraphrased texts (

T_{P}

) generated by applying open-source models like Qwen and closed-source models like ChatGPT to

T_{L}

. This design ensures diverse paraphrasing styles and enables GravText to robustly detect both original and paraphrased LLM-generated texts across diverse model architectures. To support dual detection tasks, we propose two anchor selection strategies:

Paraphrased Anchor (Human or Original ChatGPT): Set $T_{P}$ as the anchor, $T_{L}$ as the positive sample (shared LLM origin), and $T_{H}$ as the negative sample (human authorship). The loss is:

$L_{triplet}^{P A} = max {0, D (T_{P}, T_{L}) - D (T_{P}, T_{H}) + margin} .$

(2)

This configuration enhances detection of original LLM text by aligning paraphrased and original embeddings.
Original Anchor (Human or Paraphrased LLM): Set $T_{L}$ as the anchor, $T_{P}$ as the positive sample, and $T_{H}$ as the negative sample. The loss is:

$L_{triplet}^{O A} = max {0, D (T_{L}, T_{P}) - D (T_{L}, T_{H}) + margin} .$

(3)

This configuration targets paraphrased LLM text by anchoring on the original generation.

The function of this dynamic anchor switching strategy (DASS) extends beyond task definition; it is engineered for effective hard negative mining. Paraphrasing is inherently an adversarial process designed to make the LLM-generated text (P) semantically closer to human text (N), resulting in hard negative samples. The DASS ensures the triplet structure is optimized to target these ambiguous boundaries: by varying the anchor (A) between

T_{P}

and

T_{L}

, we maximize the chance of consistently capturing informative triplets that violate the margin constraint (

L_{triplet} > 0

). This focused optimization process, which forces the model to learn the subtle difference between the two classes at their closest points, is the true measure of DASS performance, as demonstrated by the overall robustness gains in our experimental results (Section 5).

By alternating between these configurations during training, GravText optimizes an embedding space that clusters LLM-generated text (original or paraphrased) tightly while separating it from human text, effectively solving the challenges faced by paraphrase.

3.3. Gravitational Factor

To enhance the detection effect of GravText, we introduce a gravitational factor inspired by Newton’s law of universal gravitation. In Newton’s theory, every mass attracts every other mass with a force proportional to the product of their masses and inversely proportional to the square of the distance between them. The formula for the gravitational force F between two masses

M_{1}

and

M_{2}

separated by distance R is:

F = G \cdot \frac{M_{1} \cdot M_{2}}{R^{2}},

(4)

where G is the gravitational constant. This physical model describes the attraction between objects based on their mass and distance.

In GravText, we adapt this concept to represent semantic relationships between text samples. Here, the “masses” correspond to the semantic significance of the text samples, and the “distance” is related to their embedding space distances. Specifically, paraphrased and original LLM-generated texts are “attracted” to form tighter clusters, while human-written texts are “repelled” to maintain separation. This gravitational factor supplements the triplet loss by providing a dynamic structural constraint on the embeddings.

To implement this idea, we replace the concept of mass with Cross-Attention. In traditional gravitational force, the mass of an object determines its gravitational pull; in our case, the Cross-Attention mechanism simulates the “mass” based on the semantic relationship between text samples. The specific computation process for the Human or Paraphrased LLM task is shown in Figure 3.

A crucial aspect of our framework is the intuitive motivation for using cross-attention as a proxy for semantic mass. Standard triplet loss relies on a single global distance metric (e.g., the Euclidean distance between [CLS] embeddings), which is a low-resolution measure that can be insensitive to subtle yet critical differences in paraphrased text. To overcome this, we introduce cross-attention as a high-resolution alignment mechanism. Instead of comparing two compressed global vectors, cross-attention computes a token-by-token alignment matrix, quantifying the semantic relevance between every token in the first sequence and every token in the second. In this context, we posit that the aggregated cross-attention score serves as a direct measure of semantic density and shared information content.

Specifically, a high aggregate score (high “mass”) signifies that the two texts possess a high concentration of mutually aligned semantic units. This is characteristic of an anchor and its positive (paraphrased) sample. Conversely, a hard negative sample (e.g., human text on the same topic) might have a similar global embedding (small distance) but will exhibit low internal alignment (low “mass”) under cross-attention, as its underlying information structure and expression differ. Therefore, cross-attention is uniquely suited as a proxy for “mass” because it quantifies the density of shared meaning, allowing our gravitational factor to more accurately model the true semantic attraction and repulsion between text pairs.

With this robust definition of semantic interaction, we can formally define the gravitational factor. Specifically, for each triplet, the gravitational force between the anchor text and another sample is computed as:

G (T_{i}, T_{j}) = {(- 1)}^{s_{i j}} \times \frac{C A (T_{i}, T_{j})}{D {(T_{i}, T_{j})}^{2} + ϵ}

(5)

where

C A (T_{i}, T_{j})

is the cross-attention output using the query from the anchor text

T_{i}

and the key-value pairs from the sample text

T_{j}

,

D (T_{i}, T_{j})

denotes their embedding distance, and

ϵ = 10^{- 6}

is a smoothing factor.

The sign factor

s_{i j}

is defined as:

s_{i j} = \{\begin{matrix} 0, & if T_{j} is a positive example (attraction) \\ 1, & if T_{j} is a negative example (repulsion) \end{matrix}

(6)

Here, a positive example represents a paraphrased LLM-generated text (attraction), while a negative example represents a human-written text (repulsion). The positive force pulls similar texts closer, and the negative force pushes dissimilar texts apart.

The overall gravitational loss is defined as:

G = α \cdot E_{j \sim p (j)} [ln (1 + exp (G_{neg}^{(j)} - G_{pos}^{(j)} + margin))]

(7)

where

G_{pos}

and

G_{neg}

are the gravitational factors for anchor-positive and anchor-negative pairs respectively,

p (j)

denotes the dataset,

α

is a scaling factor (default 1), and margin enforces minimum separation.

Integrating the cross-attention mechanism into the gravitational factor introduces additional computational overhead during the training phase, compared to standard triplet loss based solely on embedding distances. Specifically, the cross-attention operation adds an

O (L^{2})

complexity per triplet, where L is the sequence length, contributing to a longer optimization cycle.

However, it is important to emphasize that this computational cost is strictly limited to training. During inference, GravText relies solely on the trained RoBERTa encoder to generate text embeddings, followed by a lightweight distance-based classification. The gravitational factor module, including cross-attention, is not required at inference time. As a result, the inference speed of GravText remains comparable to that of a standard fine-tuned RoBERTa baseline. The additional training cost is thus a necessary and efficient trade-off for achieving significant robustness gains against adversarial paraphrasing.

By incorporating this gravitational loss, GravText strengthens the semantic cohesion of LLM-generated texts while enhancing their separation from human texts. This dynamic structural constraint further improves the model’s ability to distinguish between generated and human-authored content.

3.4. Fine-Tuning with Hybrid Loss

Relevant studies have demonstrated that fine-tuning RoBERTa yields strong performance in various natural language understanding tasks, such as text classification and recognition [33,34]. Building on the RoBERTa-base model [13], a 12-layer Transformer architecture, GravText leverages robust text representations by encoding input text into contextualized embeddings. Specifically, the final hidden state of the [CLS] token is used as the global representation of the input sequence:

H_{[CLS]} = RoBERTa (T) [0],

(8)

where

T \in {T_{H}, T_{L}, T_{P}}

. To ensure that the model performs best on the validation set, we adopted an early stop strategy. Specifically, if the F1 score and accuracy of the validation set of the model do not improve and the loss does not further decrease within five consecutive epochs, the training will be terminated and the model parameters with the highest F1 score will be saved.

The fine-tuning optimizes a hybrid loss:

L_{total} = λ_{1} L_{cls} + λ_{2} L_{triplet} + λ_{3} G,

(9)

where

L_{cls}

is the cross-entropy loss for classification, aligning with the binary detection tasks. The triplet loss

L_{triplet}

is selected from the set

{L_{triplet}^{PA}, L_{triplet}^{OA}}

, where

L_{triplet}^{PA}

(Paraphrase-Aware) and

L_{triplet}^{OA}

(Original Anchor) are designed for different tasks by selecting different anchor types.

Specifically,

L_{triplet}^{PA}

uses paraphrased texts as anchors to learn similarity between paraphrased and original LLM-generated texts while separating them from human texts. In contrast,

L_{triplet}^{OA}

uses original LLM texts as anchors to distinguish paraphrased LLM texts from human-written texts. The specific loss function is selected during training depending on the task objective.

Additionally,

G

improves semantic separation via an attention-based gravitational factor. The weights

λ_{1} = 1

,

λ_{2} = 1

, and

λ_{3} = 0.5

are selected based on validation F1 scores, balancing classification accuracy with embedding robustness. This hybrid loss formulation ensures GravText’s effectiveness across both detection scenarios.

4. Experimental Settings

This section presents the experimental settings for evaluating the GravText framework, designed to distinguish human-written text from large language model (LLM)-generated text across two tasks: identifying original LLM-generated text (Human or Original ChatGPT) and paraphrased LLM-generated text (Human or Paraphrased LLM). Building on the methodology in Section 3 and the triplet contrastive learning approach with a gravitational factor introduced in Section 3.2, we describe the experimental configuration, including the dataset, evaluation metrics, paraphrase strategies, and baseline comparisons. We validate the performance of the model on the HC3 dataset and English essay dataset. Paraphrase strategies leverage DeepSeek, Qwen, and ChatGPT to generate varied paraphrased texts, while baselines include DetectGPT, GLTR, and RoBERTa for comprehensive comparison. These settings enable a thorough assessment of GravText’s detection capabilities.

4.1. Dataset

We first adopt the HC3 Chinese dataset [15], a benchmark comprising approximately 25,706 samples (12,853 human-written and 12,853 LLM-generated responses) across domains including medicine, finance, law, psychology, computer science, and open-ended Q&A. The dataset’s diversity in style, vocabulary, and length (50 to 600 tokens) supports robust testing of GravText’s generalization. Each sample is reliably labeled for human or LLM authorship, aligning with GravText’s objectives.

Subsequently, in order to verify the robustness and effectiveness of Gravtext in the face of cross-language tasks, we choose the English essay dataset [16]. The dataset consists of 335 L2 English argumentative essays written by humans and 335 AI-generated texts. The human text is selected from Uppsala Student English Corpus (USE), and the author is a Swedish university student at CEFR A2 level. The LLM-generated texts were generated by ChatGPT on the same topic with the prompt “Write an 800-word essay on [topic] as a second language speaker.” The dataset ensures comparability through topic pairing, which aims to simulate the AI ghostwriting detection task in real academic scenarios.

4.2. Evaluation Metrics

We evaluate GravText using accuracy and F1-score to assess its performance in detecting LLM-generated text (

T_{L}

or

T_{P}

) versus human-written text (

T_{H}

). These metrics are computed based on classification outcomes: true positives (

T P

), correctly identified LLM-generated texts; true negatives (

T N

), correctly classified human texts; false positives (

F P

), human texts misclassified as LLM-generated; and false negatives (

F N

), undetected LLM-generated texts. They are defined as:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N},

(10)

F 1 - Score = \frac{2 \times T P}{2 \times T P + F P + F N} .

(11)

Accuracy reflects overall classification correctness, while F1-score balances precision and recall for the positive class, critical for detecting paraphrased LLM texts that mimic human tone. We report metrics separately for Human or Original ChatGPT and Human or Paraphrased LLM to highlight GravText’s dual-task effectiveness, especially under human-like paraphrasing challenges.

4.3. Baseline Comparisons

We compare GravText against three representative baselines: GLTR [10], DetectGPT [11], and a fine-tuned RoBERTa [13], spanning statistical, curvature-based, and supervised paradigms.

GLTR: It evaluates the statistical likelihood of each token in a sequence by ranking it against predictions from a pretrained language model. Tokens with unusually low likelihood ranks are flagged as potential anomalies, under the assumption that human-written text exhibits more natural statistical patterns than machine-generated content [10].

DetectGPT: This is a model that estimates the local curvature of a log-probability surface around a given input by applying perturbations and measuring the change in model confidence.

RoBERTa: We fine-tune the standard RoBERTa-base model on a portion of the HC3 dataset to distinguish human-written from model-generated content through supervised learning. All models use the same tokenization and training configuration for fair comparison. Specifically, we adopt a batch size of 32, a learning rate of

2 \times 10^{- 5}

, and a maximum sequence length of 256 tokens. Training is governed by early stopping based on validation F1-score, with a patience of 5 epochs. The best-performing checkpoint is retained for evaluation.

To ensure compatibility with Chinese texts, we substitute the language models used in GLTR and DetectGPT with DeepSeek, a Chinese-oriented large language model. DeepSeek replaces GPT for token ranking in GLTR and functions as both perturbation and scoring model in DetectGPT, ensuring linguistic alignment and implementation simplicity.

5. Results

In this section, we evaluate the performance of the GravText framework on two binary classification tasks: Human or Original ChatGPT and Human or Paraphrased LLM. In the first task, we train models using paraphrased texts (

T_{P}

) as anchors to distinguish between human-written texts (

T_{H}

) and original ChatGPT responses (

T_{L}

). The paraphrased texts are generated using ChatGPT-4o and Qwen, each truncated to 256 tokens. We denote the variant using ChatGPT-generated anchors as GravText_GPT256, and the variant using Qwen-paraphrased anchors as GravText_Qwen.

For the second task,

T_{L}

serves as anchors to differentiate

T_{P}

from

T_{H}

, and we introduce GravText_ChatGPT, which adopts 256-token ChatGPT-4o paraphrases as anchors. Ablation studies examine the role of the gravitational factor across different tasks and anchor configurations. All results on the HC3 Chinese test set use 256-token paraphrases, and are benchmarked against baselines including DetectGPT, GLTR, and RoBERTa. Each experiment is averaged over five runs, and results are reported as F1 scores and accuracies with standard deviations. For sampling-based baselines (GLTR, DetectGPT), we use five random seeds to account for variance.

5.1. Effect of Paraphrase Token Length

Since the original LLM-generated text of the HC3 Chinese dataset is generated by ChatGPT, the rewritten text with three different lengths of tokens is first generated by ChatGPT-4o and detected for two experiments respectively. The results are shown as Table 1.

The visualization of the data in Table 1 is shown in the Figure 4. It reveals a clear upward trend in detection performance as the length of paraphrased responses increases. This pattern is consistent across both classification tasks. One likely reason for this improvement is that longer responses tend to expose more of the distinctive stylistic and structural features typical of LLM-generated text. These patterns, when more fully developed, offer stronger signals for classifiers to distinguish machine-generated outputs from those written by humans. In addition, our conjecture can also be supported by analyzing the variation of perplexity with respect to token length, as shown in Figure 5.

Nevertheless, the 512-token paraphrasing length, while yielding the highest metrics, raises two practical concerns. First, it considerably overshoots the average length of LLM responses in the original dataset, which is approximately 271 tokens. Inflating the text to this extent may introduce information that, although helpful for detection, is atypical in natural usage and thus may reduce the generalizability of the results. Second, the marginal performance gain between 256 and 512 tokens does not appear to justify the additional computational cost, particularly when compared to the more substantial improvement seen from 128 to 256 tokens.

At the other end of the spectrum, the 128-token setting—despite aligning closely with the dataset’s modal length of 116 tokens—appears to fall short in capturing sufficient semantic content. Its limited scope restricts the model’s ability to detect nuanced linguistic markers that often differentiate human and machine text. In contrast, the 256-token setting strikes a more appropriate balance. It delivers nearly optimal accuracy and F1 scores, while maintaining closer alignment with the dataset’s central length tendencies—situated between the median (216 tokens) and the mean (271 tokens).

For these reasons, we adopt 256 tokens as the default paraphrasing length in the remainder of our experiments. This choice reflects a pragmatic compromise between performance, efficiency, and consistency with the original text distribution. Unless otherwise noted, all future paraphrasing procedures will employ this token length to ensure comparability and minimize extraneous variance across settings.

5.2. Main Results

Summarized in Table 2 is the performance of GravText models and baselines on the Human or Original ChatGPT task, showcasing GravText’s robustness in detecting original LLM-generated texts using ChatGPT and Qwen-paraphrased anchors.

As shown in Table 2, both GravText₂₅₆ and GravText_Qwen outperform all baseline models by a notable margin on the Human or Original ChatGPT classification task. Specifically, GravText_Qwen achieves the highest performance, with an accuracy of 0.9757 and an F1-score of 0.9754, while GravText₂₅₆ follows closely with scores of 0.9676 and 0.9678, respectively. In contrast, the strongest baseline, RoBERTa, records an F1-score of 0.9459, whereas DetectGPT and GLTR fall further behind at 0.8891 and 0.8688, respectively.

To further assess the generalization capability of GravText models, we extend our evaluation to the Human or Paraphrased LLM task. In this setting, original ChatGPT generations (

T_{L}

) serve as anchors to differentiate paraphrased LLM responses (

T_{P}

) from genuine human-authored text. The corresponding results are presented in Table 3.

In the experiments of two different tasks, it can be found that the model dominated by Qwen performs better than the model dominated by ChatGPT. This may be due to the fact that the rewritten text by Qwen is closer to the original AI text than the rewritten text by ChatGPT, so the discrimination is more obvious. To test this, we plot the perplexity of human answers and chatgpt answers in the original dataset and the rewritten text using different models. The details are shown in the Figure 6.

To further verify the robustness and effectiveness of GravText in cross-lingual tasks, we conducted experiments on the English essay dataset using RoBERTa-base, the best-performing baseline model. The specific experimental results are shown in the Table 4.

Table 4 presents the results of our cross-lingual generalization experiment on the English essay dataset, where the GravText framework is compared against the fine-tuned English RoBERTa-base baseline for the Human or Original ChatGPT task. The results strongly validate the language-agnostic effectiveness of GravText’s core mechanisms. Specifically, the RoBERTa baseline achieves an F1 score of

0.9694

, which is significantly surpassed by both GravText variants. The best-performing model, GravText_Qwen, achieves an F1 score of

0.9910

and an accuracy of

0.9910

, demonstrating an absolute improvement of approximately 2.16 percentage points over the baseline. Furthermore, the GravText models exhibit substantially lower standard deviations (e.g.,

0.0039

for GravText_GPT256 compared to

0.0118

for RoBERTa), underscoring the superior detection accuracy and stability of our framework when applied to a new language and domain.

As with the HC3 dataset, we also extend our evaluation to the Human or Paraphrased LLM task. The results are shown in the Table 5.

Table 5 extends the cross-lingual analysis by evaluating robustness under paraphrasing attacks, comparing GravText and the RoBERTa baseline on the Human or Paraphrased LLM task using the English essay dataset. In the English essay dataset, the performance pattern differs slightly from that observed on HC3. The RoBERTa baseline exhibits a notably lower F1 score and accuracy when confronted with Qwen-paraphrased texts compared to those paraphrased by ChatGPT, suggesting a reduced robustness to paraphrasing strategies employed by different LLMs. In contrast, GravText achieves consistently high performance across both paraphrasing sources, with only a marginal advantage in favor of ChatGPT-paraphrased inputs—contrary to the more balanced results seen on HC3. As shown in Figure 7, the perplexity (PPL) values indicate that Qwen-generated paraphrases are closer to human writing in this dataset, which may explain the greater challenge they pose to the baseline model. Additionally, the relatively small scale of the English essay dataset may contribute to the overall higher F1 and accuracy scores, potentially amplifying performance differences between models under specific attack conditions.

5.3. Statistical Significance Analysis

To evaluate the performance differences between the GravText framework and the best baseline model, RoBERTa, we conducted corrected paired t-tests based on F1 scores and accuracy data from five independent runs. The Table 6 and Table 7 summarize the statistical significance results for the different tasks on HC3.

In addition, based on GravText’s performance on the English essay dataset, t-tests were also conducted for different tasks, and the results are shown in the Table 8 and Table 9.

Systematic statistical significance evaluation confirms that the GravText framework significantly outperforms the RoBERTa-base model in detection performance. This conclusion is based on extensive testing on both HC3 and English essay datasets: in all pairwise t-tests based on five independent experiments, the two variants of GravText consistently show significant advantages in F1 score and accuracy metrics (all p-values < 0.05). The results fully confirm the effectiveness and robustness of GravText method in cross-language and different detection task scenarios.

5.4. Ablation Study

To evaluate the contribution of the gravitational factor G, we conduct an ablation study by removing it from the loss function while keeping the overall architecture unchanged. In this variant, the force between samples is computed without the scalar modulation introduced by G, effectively eliminating the gravitational influence from the optimization process. We also compare the effect of different tasks when the anchor data is not changed.

As shown in Table 10 and Table 11, removing G reduces performance across both tasks. For Human or Original ChatGPT, GravText_Qwen’s F1 drops from 0.9769 to 0.9560 (2.1%) and

G r a v T e x t_{256}

’s from 0.9678 to 0.9548 (1.3%). For Human or Paraphrased LLM, GravText_ChatGPT’s F1 decreases from 0.9617 to 0.9545 (0.7%) under Qwen and from 0.9523 to 0.9348 (1.8%) under ChatGPT-256. These results confirm the role of G in enhancing embedding space separation, particularly for challenging paraphrased texts.

Similarly, for the other task, Human or Paraphrased LLM, ablation was also performed by eliminating the gravitational factor.

5.5. Discussion

Exemplified by DetectGPT, to better understand how model perturbations can distinguish between human-written and machine-generated texts, we visualize the log-probability curvature of different text sources. The comparison includes three types of datasets: (i) original responses generated by ChatGPT, (ii) paraphrased versions of these responses using 256 tokens from ChatGPT, and (iii) texts paraphrased by Qwen. The curvature values are calculated based on perturbations introduced via the DeepSeek model.

As illustrated in Figure 8, while there is a distinction between human-written and LLM-generated content in the curvature distribution, the separation is not distinct. The presence of overlapping regions indicates that relying solely on log-probability curvature may not be sufficient for reliable detection. This suggests the need for incorporating additional features or combining multiple signals to build a more robust identification framework.

The experimental results validate the effectiveness of our proposed method in detecting LLM-generated texts. In both tasks—Human or Original ChatGPT and Human or Paraphrased LLM—our approach outperformed existing mainstream methods such as DetectGPT, GLTR, and RoBERTa. The advantage became more evident when rewritten anchors were generated by stronger models like Qwen or original ChatGPT-4o.

We also found that increasing the length of rewritten texts improved classification accuracy, particularly from 128 to 256 tokens. However, the marginal gain diminished at 512 tokens, likely due to the distribution of text lengths in the dataset. Thus, 256 tokens strike a practical balance between performance and efficiency.

Regarding anchor quality, using rewritten texts from a different LLM (e.g., Qwen and ChatGPT) led to better performance than using the same model as the source, highlighting the benefit of stylistic diversity in contrastive learning. Ablation studies further confirmed the significance of the gravitational factor G, which leverages attention-based representations to quantify “semantic mass” and enhances the contrastive signal during training.

GravText demonstrates consistent and statistically significant performance gains over RoBERTa-base on both the Chinese HC3 and English essay datasets. These results establish its effectiveness and robustness as a language-agnostic solution for detecting AI-generated text. This cross-lingual validation is a critical finding of our study. While the absolute performance metrics naturally vary between the distinct datasets, the key observation is the persistence of the performance gap in favor of GravText. Notably, the framework demonstrates remarkable adaptability: it excels not only in detecting content from primary sources like the original ChatGPT but also maintains its advantage against paraphrased text generated by different LLMs (e.g., Qwen). The fact that all 16 paired t-tests across the two datasets yielded statistically significant results (

p < 0.05

) underscores that GravText’s superiority is not an artifact of a specific data distribution or language but a reliable attribute of the proposed method.

In summary, our approach demonstrates strong adaptability and robustness across different detection scenarios, although further validation is needed in multilingual and cross-domain contexts.

6. Conclusions

This study proposes the GravText framework for detecting text generated by large language models (LLMs), with a focus on paraphrased outputs posing challenges to authenticity in digital communication. By combining triplet contrastive learning and a gravitational factor, GravText achieves robust detection of both original and paraphrased LLM-generated texts, excelling particularly on texts mimicking human writing styles. Experiments on the HC3 Chinese dataset and English essay dataset show that GravText significantly outperforms existing methods, demonstrating strong generalization across various tasks and paraphrase models.

The framework introduces innovative techniques, including dynamic anchor switching to capture paraphrase-invariant semantic features and a physics-inspired gravitational factor that enhances embedding space separation through cross-attention. These advancements improve detection accuracy and provide a novel approach to countering paraphrase-based attacks. The significance of this work lies in its technical contributions to academic integrity, misinformation prevention, and content authenticity in domains like education and journalism.

Building upon the demonstrated cross-lingual effectiveness of GravText, future work will focus on deepening its theoretical underpinnings. A promising direction is to explore alternative semantic metrics, such as the dynamics of representation collapse or information-theoretic measures, to quantify and potentially replace the cross-attention-based gravitational factor. Furthermore, integrating theoretical insights from geometric deep learning could elucidate how textual embeddings evolve in gravitational fields, enhancing both the framework’s adaptability and its interpretability.

In conclusion, the GravText framework offers an innovative and effective solution for detecting paraphrased LLM-generated text, establishing a foundation for more reliable AI-driven communication systems.

Author Contributions

Conceptualization, Y.F. and H.W.; methodology, Y.F.; software, H.W.; validation, H.W., J.L., Z.C. and L.Y.; formal analysis, Y.F. and H.W.; investigation, J.L.; resources, Y.F.; data curation, H.W.; writing—original draft preparation, Y.F. and H.W.; writing—review and editing, J.L., Z.C. and L.Y.; visualization, H.W.; supervision, Y.F. and J.L.; project administration, Y.F. and H.W.; funding acquisition, Y.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (12271201, 11601181), the Scientific and Technological Research Project of the Education Department of Jilin Province: “Research and Application of Multi-Objective Path Planning Based on Environmentally Dynamic Perception Reinforcement Learning”, and the Research Project of Jilin University of Finance and Economics (2024LH011).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article, and further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, J.; Han, P.; Ren, X.; Hu, J.; Chen, L.; Shang, S. Sequence labeling with meta-learning. IEEE Trans. Knowl. Data Eng. 2021, 35, 3072–3086. [Google Scholar] [CrossRef]
Li, J.; Shang, S.; Shao, L. MetaNER: Named Entity Recognition with Meta-Learning. In Proceedings of the Web Conference 2020, New York, NY, USA, 20–24 April 2020; WWW ’20. pp. 429–440. [Google Scholar] [CrossRef]
Li, J.; Feng, S.; Chiu, B. Few-shot relation extraction with dual graph neural network interaction. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 14396–14408. [Google Scholar] [CrossRef] [PubMed]
Laskar, M.T.R.; Bari, M.S.; Rahman, M.; Bhuiyan, M.A.H.; Joty, S.; Huang, J. A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; ACL Anthology: Saarbrücken, Germany, 2023; pp. 431–469. [Google Scholar] [CrossRef]
Oketch, K.; Lalor, J.P.; Yang, Y.; Abbasi, A. Bridging the LLM Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open LLMs for Automated Essay Scoring. arXiv 2025, arXiv:2503.11827. [Google Scholar] [CrossRef]
Krishna, K.; Song, Y.; Karpinska, M.; Wieting, J.; Iyyer, M. Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Sydney, Australia, 2023; Volume 36, pp. 27469–27500. Available online: https://proceedings.neurips.cc/paper_files/paper/2023/file/575c450013d0e99e4b0ecf82bd1afaa4-Paper-Conference.pdf (accessed on 16 September 2025).
Chein, J.; Martinez, S.; Barone, A. Human intelligence can safeguard against artificial intelligence: Individual differences in the discernment of human from AI texts. Sci. Rep. 2024, 14, 25989. [Google Scholar] [CrossRef]
Kirchenbauer, J.; Geiping, J.; Wen, Y.; Katz, J.; Miers, I.; Goldstein, T. A Watermark for Large Language Models. In Proceedings of Machine Learning Research, Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J., Eds.; PMLR: Norfolk, MA, USA, 2023; Volume 202, pp. 17061–17084. Available online: https://proceedings.mlr.press/v202/kirchenbauer23a.html (accessed on 16 September 2025).
Kirchenbauer, J.; Geiping, J.; Wen, Y.; Shu, M.; Saifullah, K.; Kong, K.; Fernando, K.; Saha, A.; Goldblum, M.; Goldstein, T. On the Reliability of Watermarks for Large Language Models. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024; Available online: https://openreview.net/forum?id=DEJIDCmWOz (accessed on 16 September 2025).
Gehrmann, S.; Strobelt, H.; Rush, A. GLTR: Statistical Detection and Visualization of Generated Text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Florence, Italy, 28 July–2 August 2019; Costa-jussà, M.R., Alfonseca, E., Eds.; ACL Anthology: Saarbrücken, Germany, 2019; pp. 111–116. [Google Scholar] [CrossRef]
Mitchell, E.; Lee, Y.; Khazatsky, A.; Manning, C.D.; Finn, C. DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature. In Proceedings of Machine Learning Research, Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J., Eds.; PMLR: Norfolk, MA, USA, 2023; Volume 202, pp. 24950–24962. Available online: https://proceedings.mlr.press/v202/mitchell23a.html (accessed on 16 September 2025).
Fagni, T.; Falchi, F.; Gambini, M.; Martella, A.; Tesconi, M. TweepFake: About detecting deepfake tweets. PLoS ONE 2021, 16, e0251415. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar] [CrossRef]
Jawahar, G.; Abdul-Mageed, M.; Lakshmanan, V.S.L. Automatic Detection of Machine Generated Text: A Critical Survey. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; Scott, D., Bel, N., Zong, C., Eds.; ACL Anthology: Saarbrücken, Germany, 2020; pp. 2296–2309. [Google Scholar] [CrossRef]
Guo, B.; Zhang, X.; Wang, Z.; Jiang, M.; Nie, J.; Ding, Y.; Yue, J.; Wu, Y. How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv 2023, arXiv:2301.07597. [Google Scholar] [CrossRef]
Corizzo, R.; Leal-Arenas, S. One-class learning for ai-generated essay detection. Appl. Sci. 2023, 13, 7901. [Google Scholar] [CrossRef]
Zhao, X.; Ananth, P.V.; Li, L.; Wang, Y.X. Provable Robust Watermarking for AI-Generated Text. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna Austria, 7–11 May 2024; Available online: https://openreview.net/forum?id=SsmT8aO45L (accessed on 16 September 2025).
Hou, A.; Zhang, J.; He, T.; Wang, Y.; Chuang, Y.S.; Wang, H.; Shen, L.; Van Durme, B.; Khashabi, D.; Tsvetkov, Y. SemStamp: A Semantic Watermark with Paraphrastic Robustness for Text Generation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 16–21 June 2024; Duh, K., Gomez, H., Bethard, S., Eds.; ACL Anthology: Saarbrücken, Germany, 2024; pp. 4067–4082. [Google Scholar] [CrossRef]
Golowich, N.; Moitra, A. Edit distance robust watermarks via indexing pseudorandom codes. Adv. Neural Inf. Process. Syst. 2024, 37, 20645–20693. [Google Scholar] [CrossRef]
Bao, G.; Zhao, Y.; Teng, Z.; Yang, L.; Zhang, Y. Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature. In Proceedings of the ICLR, Vienna, Austria, 7–11 May 2024; Available online: https://openreview.net/forum?id=Bpcgcr8E8Z (accessed on 16 September 2025).
Yang, X.; Cheng, W.; Wu, Y.; Petzold, L.R.; Wang, W.Y.; Chen, H. DNA-GPT: Divergent N-Gram Analysis for Training-Free Detection of GPT-Generated Text. In Proceedings of the ICLR, Vienna, Austria, 7–11 May 2024; Available online: https://openreview.net/forum?id=Xlayxj2fWp (accessed on 16 September 2025).
Uchendu, A.; Ma, Z.; Le, T.; Zhang, R.; Lee, D. TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 16–20 November 2021; Moens, M.F., Huang, X., Specia, L., Yih, S.W.t., Eds.; ACL Anthology: Saarbrücken, Germany, 2021; pp. 2001–2016. [Google Scholar] [CrossRef]
Chen, Y.; Kang, H.; Zhai, V.; Li, L.; Singh, R.; Raj, B. Token Prediction as Implicit Classification to Identify LLM-Generated Text. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; ACL Anthology: Saarbrücken, Germany, 2023; pp. 13112–13120. [Google Scholar] [CrossRef]
Peng, X.; Zhou, Y.; He, B.; Sun, L.; Sun, Y. Hidding the Ghostwriters: An Adversarial Evaluation of AI-Generated Student Essay Detection. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; ACL Anthology: Saarbrücken, Germany, 2023; pp. 10406–10419. [Google Scholar] [CrossRef]
Hu, X.; Chen, P.Y.; Ho, T.Y. RADAR: Robust AI-Text Detection via Adversarial Learning. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Sydney, Australia, 2023; Volume 36, pp. 15077–15095. Available online: https://proceedings.neurips.cc/paper_files/paper/2023/file/30e15e5941ae0cdab7ef58cc8d59a4ca-Paper-Conference.pdf (accessed on 16 September 2025).
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; Available online: https://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Schroff_FaceNet_A_Unified_2015_CVPR_paper.html (accessed on 16 September 2025).
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; PmLR. pp. 1597–1607. Available online: https://proceedings.mlr.press/v119/chen20j/chen20j.pdf (accessed on 16 September 2025).
Gao, T.; Yao, X.; Chen, D. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6894–6910. [Google Scholar] [CrossRef]
Chuang, C.Y.; Robinson, J.; Lin, Y.C.; Torralba, A.; Jegelka, S. Debiased contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 8765–8775. Available online: https://proceedings.neurips.cc/paper/2020/hash/63c3ddcc7b23daa1e42dc41f9a44a873-Abstract.html (accessed on 16 September 2025).
Guo, X.; He, Y.; Zhang, S.; Zhang, T.; Feng, W.; Huang, H.; Ma, C. Detective: Detecting ai-generated text via multi-level contrastive learning. Adv. Neural Inf. Process. Syst. 2024, 37, 88320–88347. [Google Scholar] [CrossRef]
Liu, S.; Liu, X.; Wang, Y.; Cheng, Z.; Li, C.; Zhang, Z.; Lan, Y.; Shen, C. Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 1874–1889. [Google Scholar] [CrossRef]
La Cava, L.; Costa, D.; Tagarelli, A. Is contrasting all you need? Contrastive learning for the detection and attribution of ai-generated text. arXiv 2024, arXiv:2407.09364. [Google Scholar] [CrossRef]
Bafna, J.; Mittal, H.; Sethia, S.; Shrivastava, M.; Mamidi, R. Mast Kalandar at SemEval-2024 Task 8: On the Trail of Textual Origins: RoBERTa-BiLSTM Approach to Detect AI-Generated Text. In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), Mexico City, Mexico, 20–21 June 2024; Ojha, A.K., Doğruöz, A.S., Tayyar Madabushi, H., Da San Martino, G., Rosenthal, S., Rosá, A., Eds.; ACL Anthology: Saarbrücken, Germany, 2024; pp. 1627–1633. [Google Scholar] [CrossRef]
Zhang, Y.; Rong, L.; Li, X.; Chen, R. Multi-modal sentiment and emotion joint analysis with a deep attentive multi-task learning model. In Advances in Information Retrieval, Proceedings of the European Conference on Information Retrieval, Stavanger, Norway, 10–14 April 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 518–532. [Google Scholar] [CrossRef]

Figure 1. The framework of GravText architecture, illustrating the flexible anchor selection for triplet contrastive learning, gravitational factor computation, and hybrid loss integration. Here, green blocks represents the tensor of human text, orange blocks represents the tensor of LLM-generated text, and blue represents the tensor of text after paraphrase of LLM-generated text.

Figure 2. Token length distribution of LLM-generated answers in the HC3 dataset, with mean (271), median (216), and mode (116) tokens.

Figure 3. In the task of Human or Paraphrased LLM, we use the original ChatGPT answers as the anchor data. Firstly, we compute the key vectors for both positive and negative samples based on the query vector of the anchor sample, and normalize the key values using the softmax function to obtain the corresponding attention weights. Subsequently, these weights are used to perform weighted aggregation with the value vectors of the positive and negative samples, resulting in

G_{pos}

and

G_{neg}

, respectively. Finally,

G_{pos}

and

G_{neg}

are combined to generate the final representation vector G.

Figure 3. In the task of Human or Paraphrased LLM, we use the original ChatGPT answers as the anchor data. Firstly, we compute the key vectors for both positive and negative samples based on the query vector of the anchor sample, and normalize the key values using the softmax function to obtain the corresponding attention weights. Subsequently, these weights are used to perform weighted aggregation with the value vectors of the positive and negative samples, resulting in

G_{pos}

and

G_{neg}

, respectively. Finally,

G_{pos}

and

G_{neg}

are combined to generate the final representation vector G.

Figure 4. Results for different token lengths in different tasks. Task 1 is Human or Original ChatGPT and task 2 is Human or Paraphrased LLM.

Figure 5. Perplexity distribution of ChatGPT paraphrasing texts with different token lengths.

Figure 6. Perplexity distribution of texts from different sources.

Figure 7. Perplexity in English essay dataset.

Figure 8. The logarithmic probability curvature of the original ChatGPT answers, ChatGPT256token rewrite, and Qwen rewrite text after being perturbed by DeepSeek in 300 random samples.

Table 1. Performance of GravText_ChatGPT on HC3 with 128, 256 and 512 tokens paraphrased for the Human or Original ChatGPT task and Human or Paraphrased LLM task, across texts paraphrased. Metrics include accuracy and F1-score (variances in parentheses).

Paraphrase Length	Human or Original ChatGPT		Human or Paraphrased LLM
Paraphrase Length	F1	Acc	F1	Acc
128 tokens	0.9450 (0.0105)	0.9433 (0.0111)	0.9315 (0.0072)	0.9290 (0.0082)
256 tokens	0.9678 (0.0013)	0.9676 (0.0012)	0.9523 (0.0025)	0.9520 (0.0030)
512 tokens	0.9711 (0.0041)	0.9707 (0.0043)	0.9639 (0.0065)	0.9630 (0.0070)

Table 2. Performance of GravText models and baselines on HC3 with 256-token paraphrases for the Human or Original ChatGPT task; GravText_GPT256 denotes the model using ChatGPT-4o 256-token paraphrased texts as anchors. Metrics include accuracy and F1-score (variances in parentheses). The model with the best results is in bold.

Method	F1	Acc
DetectGPT	0.8891 (0.0246)	0.8740 (0.0391)
GLTR	0.8688 (0.0170)	0.8703 (0.0150)
RoBERTa	0.9459 (0.0099)	0.9489 (0.0104)
GravText_GPT256	0.9678 (0.0013)	0.9676 (0.0012)
GravText_Qwen	0.9754 (0.0033)	0.9757 (0.0033)

Table 3. Performance of GravText_ChatGPT and baselines on HC3 with 256-token paraphrases for the Human or Paraphrased LLM task, across texts paraphrased by Qwen and ChatGPT. Metrics include accuracy and F1-score (standard deviations in parentheses).

Method	Qwen		ChatGPT256
Method	F1	Acc	F1	Acc
DetectGPT	0.8331 (0.0109)	0.8024 (0.0160)	0.8080 (0.0814)	0.7624 (0.0135)
GLTR	0.7208 (0.0860)	0.7420 (0.0526)	0.7232 (0.0374)	0.7030 (0.0543)
RoBERTa	0.9283 (0.0133)	0.9219 (0.0286)	0.9099 (0.0186)	0.9167 (0.0256)
GravText_ChatGPT	0.9617 (0.0155)	0.9602 (0.0167)	0.9523 (0.0025)	0.9520 (0.0030)

Table 4. Performance of GravText models and RoBERTa-base on English essay dataset with 256-token paraphrases for the Human or Original ChatGPT task. The model with the best results is in bold.

Method	F1	Acc
RoBERTa	0.9675 (0.0118)	0.9676 (0.0121)
GravText_GPT256	0.9902 (0.0039)	0.9901 (0.0039)
GravText_Qwen	0.9910 (0.0083)	0.9910 (0.0082)

Table 5. Performance of GravText_ChatGPT and RoBERTa-base on English essay dataset with 256-token paraphrases for the Human or Paraphrased LLM task, across texts paraphrased by Qwen and ChatGPT.

Method	Qwen		ChatGPT256
Method	F1	Acc	F1	Acc
RoBERTa	0.9074 (0.0252)	0.9170 (0.0270)	0.9710 (0.0159)	0.9719 (0.0147)
GravText_ChatGPT	0.9924 (0.0062)	0.9924 (0.0062)	0.9931 (0.0047)	0.9931 (0.0047)

Table 6. Results of Paired t-tests Comparing GravText and RoBERTa Model Performance on task Human or Original ChatGPT on HC3.

Comparison	Metric	t Statistic	p Value
GravText_Qwen and RoBERTa	F1	6.5786	0.0028
GravText_Qwen and RoBERTa	Accuracy	5.7357	0.0046
GravText_GPT256 and RoBERTa	F1	5.2928	0.0061
GravText_GPT256 and RoBERTa	Accuracy	4.4217	0.0115

Table 7. Results of Paired t-tests Comparing GravText and RoBERTa Model Performance on task Human or Paraphrased LLM on HC3.

Paraphrased LLM	Metric	t Statistic	p Value
Qwen Paraphrase	F1 Score	5.1413	0.0068
Qwen Paraphrase	Accuracy	3.4738	0.0255
ChatGPT Paraphrase	F1 Score	3.7221	0.0204
ChatGPT Paraphrase	Accuracy	3.0177	0.0393

Table 8. Results of Paired t-tests Comparing GravText and RoBERTa Model Performance on task Human or Original ChatGPT on English essay dataset.

Comparison	Metric	t Statistic	p Value
GravText_Qwen and RoBERTa	F1	3.0270	0.0389
GravText_Qwen and RoBERTa	Accuracy	3.0174	0.0393
GravText_GPT256 and RoBERTa	F1	3.879	0.0179
GravText_GPT256 and RoBERTa	Accuracy	3.6643	0.0215

Table 9. Results of Paired t-tests Comparing GravText and RoBERTa Model Performance on task Human or Paraphrased LLM on English essay dataset.

Paraphrased LLM	Metric	t Statistic	p Value
Qwen Paraphrase	F1 Score	6.8632	0.0024
Qwen Paraphrase	Accuracy	5.5303	0.0052
ChatGPT Paraphrase	F1 Score	2.9901	0.0403
ChatGPT Paraphrase	Accuracy	3.6435	0.0219

Table 10. Ablation study results of Human or Original ChatGPT.

Method	F1	Acc
GravText_Qwen w/o. G	0.9560 (0.0136)	0.9540 (0.0131)
GravText_Qwen	0.9769 (0.0028)	0.9768 (0.0028)
GravText₂₅₆ w/o. G	0.9548 (0.0069)	0.9509 (0.0078)
GravText₂₅₆	0.9678 (0.0013)	0.9676 (0.0012)

Table 11. Ablation study results of Human or Paraphrased LLM.

Method	Qwen		ChatGPT256
Method	F1	Acc	F1	Acc
GravText_ChatGPT w/o. G	0.9545 (0.0113)	0.9567 (0.0102)	0.9348 (0.0069)	0.9309 (0.0078)
GravText_ChatGPT	0.9617 (0.0155)	0.9602 (0.0167)	0.9523 (0.0025)	0.9520 (0.0030)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, Y.; Wang, H.; Li, J.; Cao, Z.; Yan, L. GravText: A Robust Framework for Detecting LLM-Generated Text Using Triplet Contrastive Learning with Gravitational Factor. Systems 2025, 13, 990. https://doi.org/10.3390/systems13110990

AMA Style

Feng Y, Wang H, Li J, Cao Z, Yan L. GravText: A Robust Framework for Detecting LLM-Generated Text Using Triplet Contrastive Learning with Gravitational Factor. Systems. 2025; 13(11):990. https://doi.org/10.3390/systems13110990

Chicago/Turabian Style

Feng, Youling, Haoyu Wang, Jun Li, Zhongwei Cao, and Linghao Yan. 2025. "GravText: A Robust Framework for Detecting LLM-Generated Text Using Triplet Contrastive Learning with Gravitational Factor" Systems 13, no. 11: 990. https://doi.org/10.3390/systems13110990

APA Style

Feng, Y., Wang, H., Li, J., Cao, Z., & Yan, L. (2025). GravText: A Robust Framework for Detecting LLM-Generated Text Using Triplet Contrastive Learning with Gravitational Factor. Systems, 13(11), 990. https://doi.org/10.3390/systems13110990

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GravText: A Robust Framework for Detecting LLM-Generated Text Using Triplet Contrastive Learning with Gravitational Factor

Abstract

1. Introduction

2. Related Work

2.1. LLM Generated Text Detection

2.2. Contrastive Learning

3. Methodology

3.1. Paraphrase Strategy and Data Augmentation

3.2. Anchor Data Selection

3.3. Gravitational Factor

3.4. Fine-Tuning with Hybrid Loss

4. Experimental Settings

4.1. Dataset

4.2. Evaluation Metrics

4.3. Baseline Comparisons

5. Results

5.1. Effect of Paraphrase Token Length

5.2. Main Results

5.3. Statistical Significance Analysis

5.4. Ablation Study

5.5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI