1. Introduction
Large language models (LLMs), such as GPT-4, PaLM, and Deepseek, have revolutionized natural language processing (NLP) with their ability to generate high-quality, human-like text across a wide range of tasks, including open-domain dialogue, machine translation, and document summarization. However, this technological advancement has also raised serious concerns about content authenticity, copyright attribution, and model accountability. The indistinguishability of AI-generated content has facilitated malicious use cases such as fake news propagation, academic plagiarism, and impersonation attacks [
1,
2,
3,
4,
5]. These challenges necessitate technical mechanisms that ensure the traceability and verifiability of generated text.
Beyond technical misuse, recent studies have highlighted broader risks posed by LLMs, including AI deception [
6], misinformation in healthcare [
7], and regulatory gaps in generative model governance [
8]. The societal consequences of untraceable AI-generated content have prompted calls for legal, ethical, and technical solutions to safeguard information integrity and public trust. These challenges necessitate technical mechanisms that ensure the traceability and verifiability of generated text.
Text watermarking has emerged as a promising approach by embedding imperceptible identifiers into generated outputs, enabling post-hoc content verification. Existing watermarking methods generally fall into three categories [
9,
10,
11]: (1) rule-based schemes such as synonym substitution, (2) neural steganography using encoding modules within Transformer architectures, and (3) statistical watermarking applied during inference via token partitioning strategies (e.g., green/red vocabulary sets). However, these methods exhibit significant limitations. Rule-based approaches are highly vulnerable to minor lexical perturbations, with detection performance dropping to as low as 68.4% [
12,
13,
14]. Neural steganography is more robust but introduces substantial inference overhead—up to a 37% latency increase—rendering it impractical in many real-time or resource-constrained scenarios [
15,
16,
17,
18]. Statistical methods, while lightweight, often fail under top-k sampling or semantic rewriting due to their reliance on frequency-based heuristics.
To address these gaps, we propose a novel watermarking framework—Contrastive Watermarking with Semantic Modeling (CWS)—that introduces several key innovations beyond existing token partitioning or neural signature methods. Unlike prior approaches that rely on static vocabularies or encoder-decoder architectures, CWS dynamically selects watermark tokens through semantic contrastive learning, ensuring contextual fluency and stealth. A shared embedding layer is introduced to align the semantic feature spaces between embedding and detection stages, mitigating feature drift. Detection is achieved through a dual-branch mechanism: a z-score-based statistical test for efficient public verification and a GRU-based semantic decoder capable of robust watermark identification even without access to logits or model internals.
We explicitly consider a black-box adversarial setting, where model internals, sampling logits, or private watermark keys are inaccessible. The adversary may apply paraphrasing, synonym substitution, or compression to obfuscate embedded signals. CWS is designed to be resilient against such expression-preserving perturbations but does not aim to resist white-box attacks with full model access or targeted removal. Importantly, we position watermarking as a verifiable attribution aid—not a standalone DRM solution—and advocate for its deployment in conjunction with ethical, legal, and policy-based safeguards.
Empirical results on GPT-2, OPT-1.3B, and LLaMA-7B using the C4 and DBpedia datasets show that CWS achieves up to 99.9% F1 and remains highly robust under various attacks, maintaining F1 ≥ 93% even in low-strength adversarial settings (ε ≤ 0.25, δ ≤ 0.2). Compared with LSTM and Transformer-based detectors, the GRU-based decoder offers the best speed–accuracy balance, requiring only 0.42 s per sample.
The main contributions of this work are as follows:
- (1)
Contrastive semantic token selection: A context-aware contrastive learning mechanism that selects semantically aligned watermark tokens, enhancing stealth and semantic fidelity.
- (2)
Contrastive embedding alignment: A shared embedding layer that synchronizes semantic representations across embedding and detection, supporting model-agnostic deployment.
- (3)
Keyless dual-branch detection framework: An efficient, inference-independent pipeline combining statistical testing and GRU-based decoding for robust watermark verification.
2. Related Work
The growing demand for trustworthy attribution and copyright protection of content generated by large language models (LLMs) has catalyzed the rapid development of detection technologies. Existing approaches can be broadly categorized into two paradigms: passive classification-based detection and active watermark embedding, each differing significantly in technical route and application scenario.
Firstly, passive classification-based detection identifies AI-generated text by distinguishing it from human-written content using classifiers. Representative studies include fine-tuning pretrained models (Zhan, H. et al., 2023) [
19], leveraging combined language model features (Gehrmann et al., 2023) [
20], and adversarial training strategies (Liu et al., 2023) [
21]. Although such approaches can achieve over 90% accuracy on benchmark datasets, they suffer from limited interpretability due to reliance on the implicit decision-making processes of black-box models (Sadasivan et al., 2023) [
22]. More critically, as the quality of LLM-generated text improves, detection performance declines. For instance, when the perplexity (PPL) of GPT-4-generated text drops below 6.0, classifier accuracy becomes indistinguishable from random guessing. Moreover, passive methods lack the ability to embed identifiable marks proactively, making them insufficient for copyright certification.
Secondly, active watermark embedding techniques inject imperceptible identifiers during the generation process, enabling both content traceability and copyright verification. These methods can be grouped into three categories:
- (1)
Rule-based watermarking modifies surface-level features of text using predefined rules, such as lexical substitution or syntactic perturbation. For instance, He et al. (2022) [
23] encoded watermarks by manipulating lexical diversity, but the method is highly sensitive to synonym substitution (detection accuracy drops to 68.4%). Yang et al. (2023) [
24] used dependency-based paraphrasing to maintain local coherence, yet introduced significant global fluency degradation (PPL↑ 21.7%, BLEU↓ ≥ 12.5%). These methods rely on rigid handcrafted heuristics, which limits robustness and scalability for long or semantically complex texts.
- (2)
Neural network-based watermarking leverages deep learning to embed and extract watermarks through latent representations. Abdelnabi and Fritz (2021) [
25] proposed a Transformer-based steganographic encoder that hides signals in the hidden state space, achieving high stealth but increasing generation latency by 37%. REMARK-LLM (Zhang et al., 2023) [
26] encoded both model signatures and content jointly to improve traceability but suffered a 28.6% F1 drop in cross-model scenarios. These approaches typically require adversarial training and model-specific adaptation, making them computationally costly and less portable across LLM families.
- (3)
Statistical watermarking at inference time modifies token sampling distributions to encode signals without retraining. Kirchenbauer et al. (2023) [
27] introduced a green-red vocabulary partitioning strategy to bias generation towards the green list, enabling lightweight detection. However, fixed partitions are vulnerable to reverse engineering and leak risks. Chen et al. (2023) [
28] applied semantic clustering to improve stealth, reducing BLEU degradation to 8.3% but introducing 5× runtime overhead. SWEET (Lee et al., 2023) [
29] watermarked high-entropy tokens for code generation but showed low generalization to natural language (F1 = 76.2%). These methods assume access to shared keys (e.g., seed lists) and often struggle to balance stealth with semantic fidelity.
In addition, recent advances in cryptographic and zero-shot watermarking have opened new research avenues. Liu et al. (2023) proposed UPV, an unforgeable and publicly verifiable watermark framework utilizing verifiable tokens and cryptographic commitments [
10]. While UPV offers formal attribution guarantees, it requires specialized decoding infrastructure and incurs non-negligible embedding overhead.
Complementing cryptographic methods, zero-shot watermarking and detection approaches aim to eliminate training or architecture modification by leveraging pretrained model behavior. DetectGPT (Mitchell et al., 2023) [
30] introduced a probability curvature-based criterion to identify LLM-generated text, achieving strong detection performance without retraining. Building on this idea, Yang et al. (2023) [
31] extended zero-shot detection to code domains, showing that structural token probability deviations can flag synthetic code even in black-box scenarios. More recently, Lin et al. (2024) [
32] proposed a zero-shot generative linguistic steganography (ZSGS) framework using in-context learning to encode secrets in natural language. Their method achieved higher statistical and perceptual imperceptibility than prior linguistic steganographic approaches.
However, zero-shot approaches often rely on surrogate scoring models or implicit behaviors, making their robustness and interpretability sensitive to model drift and prompt variations. Additionally, while promising for detection or covert communication, most zero-shot frameworks lack verifiable control over embedding strength, position, or extraction fidelity, limiting their applicability in authenticated watermarking or legal attribution settings.
While these methods show diverse technical designs, they face common trade-offs. Rule-based methods assume static mappings and fail under minor paraphrasing. Neural models require tight encoder-decoder coupling, hindering transferability. Statistical methods assume independent token choices and fixed key sharing, which may be statistically invalid or leak-prone. Cryptographic schemes offer provable guarantees but lack efficiency. Zero-shot methods ease deployment but reduce precision and resilience. Most steganographic strategies show an inherent tension between robustness, imperceptibility, and computational cost: strong embedding often reduces fluency (BLEU↓), low-strength watermarks are harder to detect (F1↓), and robust detection frequently incurs higher latency (≥0.85 s).
To address these limitations, we propose the CWS algorithm, which dynamically selects context-adaptive watermark tokens via semantic contrastive learning and combines a shared embedding layer with a lightweight GRU-based semantic detector. This design jointly optimizes stealth (PPL = 7.3), robustness (F1 = 99.9%), and efficiency (detection time = 0.42 s, 35–52% faster than baselines). CWS remains highly robust under three attack types (semantic rewriting, synonym substitution, and text compression) and demonstrates strong resistance even under high compression (F1 drop ≤ 1.6% at δ = 0.2), offering a promising solution for trustworthy LLM content attribution.
3. Proposed Method
We propose a novel watermarking algorithm for large language models (LLMs), termed Contrastive Watermarking with Semantic Modeling (CWS). This method integrates semantic contrastive learning, a shared embedding layer, and a GRU-based detection network to enable efficient watermark embedding and detection in generated text.
To ensure imperceptibility, detectability, and robustness, CWS introduces a semantic contrastive learning module during the watermark embedding process. This module selects watermark tokens by modeling and contrasting semantic contexts, allowing for the insertion of invisible watermarks. In parallel, a shared embedding layer is employed to unify the feature representation across embedding and detection phases, thereby mitigating feature shift and enhancing detectability. This design ensures semantic and structural symmetry between watermark generation and detection, as both processes operate over a unified embedding space and context-aligned representations. In the detection phase, we construct a lightweight dual-branch detection framework, consisting of a z-score-based statistical path and a Bi-GRU-based semantic path. These branches jointly support keyless public verification and robust recognition under semantic perturbations. Compared to traditional methods, CWS enhances watermark robustness while preserving text quality, ensuring reliable detectability even after post-processing or compression.
The overall architecture is illustrated in
Figure 1.
3.1. Semantic Contrastive Learning
The semantic contrastive learning (SCL) module serves as the core component of the proposed CWS watermarking algorithm. Traditional text watermarking methods typically rely on a predefined vocabulary or static adjustment of token probabilities, overlooking the semantic coherence between candidate watermark tokens and the surrounding textual context. This often leads to degraded fluency or incoherent semantics in the generated text. To address this issue, we introduce a context-aware contrastive optimization strategy that leverages pretrained language models to extract deep semantic representations and applies an InfoNCE-based contrastive loss to identify contextually appropriate watermark tokens. This significantly improves the naturalness and imperceptibility of watermark embedding.
Specifically, the SCL module first employs a pretrained language model (e.g., GPT-2 or BERT) to generate deep semantic encodings of the input text sequence. Given an input sequence
, a self-attention mechanism captures the contextual hidden state matrix as:
where
denotes the context-sensitive representation matrix of the sequence, with
being the sequence length and
the embedding dimension. This encoding effectively captures the underlying semantic context and guides the selection of watermark tokens with high contextual relevance.
Building on this representation, we introduce a contrastive learning mechanism to further optimize semantic alignment of candidate watermark tokens. A semantic similarity function is first defined to quantify the alignment between the current target token
and each candidate watermark token
:
where
is the target token to be generated at position
,
is a candidate token selected from the watermark vocabulary
, and
and
represent their respective semantic embeddings. This similarity is used to rank candidate tokens based on contextual appropriateness.
To enforce the selection of watermark tokens that best match the surrounding context, we adopt a contrastive optimization objective using the InfoNCE loss:
Here,
is a temperature parameter controlling the sharpness of the similarity distribution, and
denotes the positive sample—the candidate token with the highest semantic similarity to the target context:
All other tokens in the candidate set serve as negative samples, which help distinguish the true context-aligned token from semantically mismatched ones. By minimizing the InfoNCE loss, the model adaptively optimizes the selection of watermark tokens that best align with the context, enabling more natural, stealthy, and robust watermark embedding.
In our proposed framework, the selected positive sample —i.e., the watermark token with the highest semantic similarity to the target token—is not only used for optimizing the InfoNCE loss but also directly injected into the text generation process. Specifically, after the model identifies for a given context window, we adjust the output token probability distribution by increasing the logit corresponding to with a tunable watermark strength parameter δ. This adjustment biases the language model to a sample with higher probability, thereby embedding the watermark in a way that is semantically consistent with the original context.
This integration of contrastive learning with generation-time logit adjustment forms a closed loop: semantic selection guides token biasing, and token biasing ensures that the semantically aligned watermark token is embedded into the generated sequence. Compared to traditional methods that rely on frequency-based or syntactic heuristics, our approach leverages deep contextual alignment, significantly improving stealth and robustness under real-world perturbations.
3.2. Training Procedure of CWS
Having introduced the core semantic modeling module, we now describe the overall training procedure for our watermarking system. This process involves training both the contrastive watermark embedding module and the semantic detection model. The complete training workflow is summarized in Algorithm 1.
Algorithm 1: Training Procedure of the CWS Framework |
Input: Pretrained language model . Watermark vocabulary . Labeled training corpus . Hyperparameters: (top-K), τ (temperature), η (learning rate), E (epochs). Phase 1: Train Contrastive Watermark Embedding 1: Freeze parameters of . 2: Initialize shared embedding layer and contrastive scoring head. 3: For each input sequence in : (a) Use to generate top-K token candidates per position. (b) Compute cosine similarities between ground truth and candidate tokens. (c) Apply InfoNCE loss to train embedding and scoring modules. 4: Optimize with Adam optimizer for E epochs. Phase 2: Train GRU-based Semantic Detector 5: Generate watermarked texts using trained . 6: Construct binary classification dataset: human vs. watermarked texts. 7: Extract token embeddings via shared encoder. 8: Train GRU-based detector using binary cross-entropy loss. Deploy: Statistical z-score Detector 9: Fix shared embedding and . 10: Compute semantic centroids and similarity threshold. 11: At inference, count semantically marked tokens and compute z-score. 12: Use hypothesis testing to determine watermark presence. Output: Trained contrastive watermarking module . Trained GRU-based semantic detector. Deployable statistical z-score detector. |
3.3. Watermark Embedding Algorithm
The objective of the watermark embedding algorithm is to imperceptibly inject a watermark signal into the text generation process while preserving the naturalness of the text and ensuring the watermark’s detectability and robustness. To achieve this, we propose a semantic contrastive learning-based watermark embedding method (CWS), which utilizes a shared embedding layer to align the feature space and applies a fully connected network (FCN) to adjust the generation probabilities of watermark tokens. Additionally, a multi-head attention mechanism is introduced to strengthen both local and global semantic associations.
- (1)
Watermark Embedding Procedure
Given an input sequence , the pretrained language model (e.g., GPT-2 or BERT) outputs logits representing the probability distribution over the next token . This distribution serves as the basis for evaluating token generation probabilities and is dynamically adjusted during watermark embedding.
Layer Mapping: During embedding, the pretrained LLM generates logits and selects watermark tokens based on semantic contrastive learning. However, in the detection phase, the logits of the LLM are not accessible; the watermark must be analyzed based solely on the generated text. If independent embedding representations are used in the embedding and detection stages (i.e., separate encoders), a mismatch in vector space distributions may occur, making it difficult for the detector to reconstruct the features used during embedding, thus reducing detection accuracy.
To solve this, we introduce a shared embedding layer—a separately trained fully connected network—that maps token IDs to continuous semantic vectors:
In implementation, the shared embedding layer is jointly trained with the contrastive learning module during watermark embedding. It replaces the default pretrained embeddings of the language model and is used consistently across both embedding and detection pipelines. Once trained, the shared embedding parameters are frozen and deployed in the detection phase to maintain vector space consistency.
Compared to approaches that use separate encoders for watermark insertion and detection, our shared embedding strategy ensures that both phases operate over a unified semantic space. This eliminates distributional mismatch and enhances the reliability and robustness of feature-based watermark detection, especially under black-box scenarios where LLM logits are inaccessible.
Watermark Token Selection: Based on the representations from the shared embedding layer, CWS selects watermark tokens through semantic contrastive learning, which evaluates the semantic similarity between the target token and each candidate watermark token
. This process follows the semantic modeling and contrastive optimization introduced in
Section 3.1. The generation probability of the selected watermark token is then adjusted to ensure successful embedding without disrupting textual coherence.
Model for Adjusted Generation: The adjusted probabilities form an enhanced language model, which increases the likelihood of generating the selected watermark token while minimizing its impact on textual fluency. The adjusted probability distribution is defined as:
where
is the watermark strength parameter,
is the watermark vocabulary, and
is an indicator function equal to 1 if
, and 0 otherwise.
- (2)
Network Architecture
The CWS embedding framework follows a four-component architecture, comprising a shared embedding layer, a semantic contrastive learning layer, a pretrained language model, and a fully connected classification network (FCN). This end-to-end structure ensures semantic consistency, imperceptibility, and robust detectability of watermarks. After semantic optimization via the shared embedding and contrastive learning layers, the FCN adjusts the selection probabilities of candidate watermark tokens, finalizing the embedding decision.
The FCN computes the final adjustment as follows:
where
denotes the contextual embedding of the
token;
represents vector concatenation;
is the semantic similarity matrix generated from contrastive learning;
refers to an eight-head self-attention mechanism used to enhance global contextual information extraction;
are weight matrices;
are bias terms; and ReLU is the rectified linear unit activation function.
The semantic contrastive module computes the similarity between target tokens and watermark candidates, and the InfoNCE loss pulls semantically similar tokens into proximity within the embedding space. This allows for context-aware selection of watermark tokens that are semantically congruent with the surrounding text. The overall network thus supports natural and stealthy embedding while significantly improving the stability and robustness of watermark detection.
3.4. Watermark Detection Algorithm
The core task of watermark detection is to identify embedded watermark signals from generated text and evaluate their credibility, significance, and stability. Considering practical requirements such as interpretability, black-box adaptability, and robustness, we propose a dual-branch watermark detection architecture consisting of a z-score-based statistical path and a Bi-GRU-based neural semantic path. These two branches are complementary by design and can be flexibly applied to different levels of complexity and openness in real-world scenarios.
Specifically, the z-score branch provides a lightweight, publicly verifiable watermark detection mechanism based on statistical hypothesis testing, well-suited to black-box settings where access to the language model is restricted. In contrast, the Bi-GRU branch learns the semantic distribution of watermarks using deep neural networks, enhancing resilience against text perturbations and paraphrasing attacks. Both branches reuse the shared embedding layer from the embedding phase, forming a closed-loop system that mitigates feature shift. This design ensures detection accuracy and interpretability while supporting flexible deployment and improved scalability.
3.4.1. Branch I: Z-Score Statistical Detection Path (Statistical Path)
This branch introduces a statistical significance detection method based on classical hypothesis testing that does not require access to the internal structure or parameters of the language model. The core idea is to assess whether the observed distribution of watermark tokens in a text significantly deviates from a theoretically random distribution, thereby inferring the presence of a watermark. This method offers strong computational efficiency and theoretical interpretability.
To enhance its practical applicability, we further integrate the shared embedding layer to perform semantic representation of the input text. This alignment between detection and embedding stages reduces the risk of representation mismatch and increases the z-score method’s tolerance to lexical perturbation.
- (1)
Probability Modeling and Hypothesis Definition
Let
denote the set of watermark tokens, where each token
is assumed to occur with a theoretical sampling probability
in non-watermarked text
. Given a total text length of nnn, let random variable
denote the total number of watermark tokens observed in the text. Assuming independent token sampling,
follows a binomial distribution:
When
is sufficiently large, by the Central Limit Theorem, the binomial distribution
can be approximated by a normal distribution:
where
denotes the expected frequency of watermark tokens, and
quantifies their statistical dispersion in natural text.
We define the following hypotheses:
- (i)
Null hypothesis (H0): The text contains no watermark; the frequency of watermark tokens follows a random distribution.
- (ii)
Alternative hypothesis (H1): The text is watermarked; the observed frequency X is significantly higher than expected μ.
- (2)
Test Statistic and Decision Rule
The standardized z-score is defined as:
Given a significance level (e.g., 0.01), the critical value is obtained from the standard normal distribution. If , the hypothesis is rejected, indicating the presence of a watermark.
To reduce over-sensitivity in short texts, we apply interval correction to adjust the confidence range using a length-aware weighting scheme, constraining the fluctuation within a reliable confidence band.
Moreover, traditional z-score methods based on surface token counting are vulnerable to morphological variants and paraphrasing attacks. To improve robustness, we adopt a soft semantic matching strategy based on embedding similarity:
where
is the shared embedding of the
token,
is the centroid embedding of all watermark tokens,
is the cosine similarity threshold, and
is an indicator function. This semantic-aware frequency counting approach effectively enhances the system’s tolerance to expression-level perturbations.
It is worth noting that the statistical decision process of the z-score detector relies on an approximation that assumes independence among token selections, effectively modeling the occurrence of watermark tokens as a binomial process. While this assumption does not fully capture the correlated nature of natural language, it has been widely adopted in prior watermarking research (e.g., [
27]) due to its analytical simplicity and interpretability. In our framework, we mitigate the potential dependence effects by applying semantic constraints on the watermark vocabulary and smoothing across token sequences. As shown in our experiments (
Section 4.2.1), this approximation yields stable detection performance, with empirically bounded false positive rates under diverse text conditions.
- (3)
Applicability and Extensibility
This detection path does not require access to logits, model architecture, or secret keys. It relies solely on the predefined watermark vocabulary and semantic embeddings, offering true black-box adaptability and public verifiability. Its foundation in statistical hypothesis testing ensures strong interpretability and clear decision boundaries, making it well-suited for deployment in scenarios such as third-party auditing and content compliance inspection.
However, due to its dependence on explicit watermark tokens, its robustness may degrade under paraphrasing or synonym substitution attacks. To address this limitation, we further introduce a neural detection path, which complements the statistical method by providing enhanced recognition under complex disturbances, completing a dual-path detection framework.
3.4.2. Branch II: End-to-End Bi-GRU-Based Watermark Detection Path (Neural Path)
To overcome the limitations of statistical methods—namely their strong dependence on token frequencies and vocabulary and their limited robustness—we design an end-to-end watermark detection network based on a Bidirectional Gated Recurrent Unit (Bi-GRU). This method models contextual dependencies in the input embedding sequence to extract deep semantic features, thereby achieving robust identification of watermark signals under text perturbation and rewriting.
- (1)
Input Representation and Embedding Encoding
Given a candidate text sequence
of length
, we first map discrete tokens into continuous vectors using the shared embedding layer trained during watermark embedding:
where
is the embedding dimension, and
denotes the context-sensitive embedding of the
token. The parameters of the shared embedding layer are frozen during detection to ensure feature consistency with the embedding phase, thus improving the generalizability of the detection model.
- (2)
Sequential Feature Extraction and Watermark Signal Identification
Since the text may be subject to paraphrasing, editing, or synonym replacement, frequency-based methods lack robustness. The Bi-GRU module can extract semantic watermark features, going beyond surface-level token statistics and significantly improving resilience.
- (i)
Bi-GRU Feature Extraction:
where
denotes vector concatenation. The Bi-GRU captures bidirectional dependencies and contextual cues, yielding a high-level feature representation H\mathbf{H}H that preserves watermark-relevant semantics.
- (ii)
Watermark Signal Strength Estimation:
The final hidden state
is passed through a parameterized transformation to compute the watermark signal strength:
where
is the scalar watermark score, and
and
are the weight vector and bias term, respectively.
- (iii)
Modified z-score Significance Testing:
We further apply a thresholded z-score decision
on the Bi-GRU-extracted watermark score:
If , where is a detection threshold, the text is classified as watermarked; otherwise, it is considered non-watermarked. This hybrid approach combines deep semantic feature extraction with statistical decision theory to improve detection accuracy and robustness.
- (3)
Detection Network Architecture
The CWS detection network adopts a modular design composed of a shared embedding layer, a Bi-GRU encoder, a z-score calculation module, and a binary classification head. The overall detection pipeline is formalized as:
where
denotes the contextual embedding of the input text,
is the modified z-score value, W
o and W
s are weight matrices,
is the bias term, and
is the sigmoid activation function that maps the output to the interval [0, 1] for binary classification. Finally, when
(where
is the detection threshold), the text is judged to contain a watermark; otherwise, it is considered non-watermarked.
3.4.3. Method Analysis and Strength Summary
From a theoretical perspective, the detection mechanism of CWS is supported by the following guarantees:
- (1)
Statistical Theoretical Basis: The z-score-based detection path is grounded in significance hypothesis testing, providing a reproducible and interpretable watermark decision mechanism. It requires no access to the language model’s internal logits or architecture and detects watermark presence by evaluating statistical deviation in token usage or embedding distributions. The false positive rate (Type-I Error) can be explicitly controlled via the significance level α, ensuring statistical reliability.
- (2)
Neural Network Generalization: The Bi-GRU-based detection path models semantic context distributions of watermark tokens in latent space. Even when watermark tokens are rewritten, paraphrased, or reordered, the system can still identify watermark signals. Unlike explicit vocabulary matching, this path generalizes to semantically equivalent but lexically diverse expressions, providing robust generalization.
- (3)
Detection Efficiency Analysis: From a computational complexity standpoint, the statistical path performs vector comparison and normalization, with complexity . The neural path uses a single-layer Bi-GRU network, also with inference complexity , where nnn is the text length. The overall detection process can be completed within milliseconds, satisfying the performance requirements of online detection and large-scale auditing scenarios.
4. Experiments and Analysis
4.1. Experimental Setup
4.1.1. Language Models and Datasets
To evaluate the effectiveness of our proposed watermarking algorithm under standard English generation settings, we utilize three pretrained language models of varying sizes: GPT-2, OPT-1.3B, and LLaMA-7B. For text generation, we adopt two common decoding strategies: beam search and top-k sampling.
In line with prior work, we construct prompt-based generation tasks using two benchmark datasets: C4 and DBpedia Class. These corpora are widely adopted in LLM evaluation due to their high-quality annotations and well-structured content, which facilitate reproducibility and controlled comparisons. We extract the first 30 words from each text sample as the prompt and generate the subsequent 200 ± 5 tokens using the specified decoding strategy. These texts are used to produce both clean and watermarked outputs for evaluation. For binary classification tasks, we randomly sample 500 human-written and 500 model-generated texts per setting. This sample size is selected to strike a balance between statistical reliability and computational efficiency and is consistent with prior studies in LLM watermarking research [
27].
While these benchmark datasets offer consistency, we acknowledge their limitations in representing noisy or domain-shifted real-world inputs (e.g., OCR outputs or social media). To mitigate this gap, our robustness experiments in
Section 4.2.4 include semantically perturbed, lexically substituted, and compressed adversarial texts. Moreover, paraphrased and structurally rewritten examples are analyzed in
Section 4.2.5 and
Appendix B, simulating more diverse and challenging usage scenarios.
4.1.2. Baseline Method
As a baseline, we adopt the soft green-red list watermarking approach proposed by Kirchenbaue [
27], which introduces a constant bias δ to green-list token logits to increase their sampling probability. Watermark detection is based on the proportion of green-list tokens in the generated output, where a one-tailed z-test is used to determine statistical significance. This method serves as a representative benchmark for comparing watermark detectability and stealth across different models and decoding strategies.
4.1.3. Implementation and Hyper-Parameter Settings
All experiments were conducted on a server with a 16-core Xeon® Platinum 8352 V CPU and 90 GB of RAM running Ubuntu 20.04. We used Python 3.8 and PyTorch 1.10 and loaded pretrained models via the Hugging Face Transformers library. Our watermark generation and detection pipeline was implemented in two stages: text generation and post-hoc detection.
To ensure statistical independence and eliminate the risk of dataset leakage or decoding alignment, we enforced strict data partitioning at the document and prompt levels. Specifically, training, validation, and test sets were disjointed in both document origin and prompt source. No prompt templates, input prefixes, or generated continuations were reused between phases. Watermarked texts used for training were generated using GPT-2, OPT-1.3B, and LLaMA-7B under fixed decoding parameters (e.g., δ = 3.0, top-k = 20). During evaluation, samples were generated under different configurations (e.g., δ = 2.0, top-k = 50, temperature = 0.9) and with different random seeds, ensuring diversity and avoiding configuration alignment between training and test phases.
In addition to intra-model evaluation, where detectors were trained and tested on samples from the same model under different decoding settings, we also conducted cross-model generalization tests. Specifically, detectors trained on the outputs of one model (e.g., GPT-2) were evaluated on watermarked texts generated by a different model (e.g., OPT-1.3B or LLaMA-7B). This setup tested the detector’s robustness to changes in token distribution and model behavior.
For watermark embedding, the token bias strength δ was set to 2.5 and the embedding ratio γ to 0.5 by default. The detection module was based on a Bi-GRU network with a shared embedding layer frozen from the generator stage to ensure consistent token representation. The network was trained using the Adam optimizer with a learning rate of 0.01. The decision threshold was set to Z = 8, offering a balance between detection power and false positive control. All hyperparameters are empirically validated and further analyzed in
Section 4.2.3.
4.2. Core Performance Validation
4.2.1. Watermark Detection Under Different Generation Strategies
- (1)
Detection Stability under Varying Generation Strategies
To evaluate the performance of watermark detection under varied decoding strategies, we conducted experiments on three representative language models: GPT-2, OPT-1.3B, and LLaMA-7B. Both deterministic (beam search) and stochastic (top-k sampling) generation settings were considered. The evaluation compared the proposed CWS algorithm with the baseline KGW across four detection metrics: accuracy, true positive rate (TPR), true negative rate (TNR), and F1 score. The results are presented in
Table 1.
The results indicate that both KGW and CWS maintain consistently high detection performance across all models and decoding strategies. Under beam search, which generates more deterministic sequences, watermark signals are better preserved, leading to near-perfect detection scores. For instance, LLaMA-7B combined with KGW achieves 100% on all metrics. In contrast, top-k sampling introduces greater randomness, slightly reducing performance stability. Nevertheless, CWS maintains a strong F1 score (e.g., 99.8% on LLaMA-7B) and demonstrates robust adaptability across diverse generation patterns.
To ensure comprehensive decoding coverage, we also explored other generation strategies during preliminary experiments, including nucleus sampling (top-p = 0.9) and temperature sampling (T = 0.8–1.2). However, we observed that these methods often produced highly entropic outputs with inconsistent watermark behavior, particularly under smaller language models like GPT-2. Moreover, the detection metrics under these settings were not significantly distinguishable from those of top-k sampling with similar effective vocabulary sizes (e.g., top-k = 40 vs. top-p = 0.9). As such, to maintain clarity and avoid redundant comparisons, we focused our main analysis on beam search and top-k sampling, which offer stronger contrast and more interpretable differences in watermark retention.
Model-wise, LLaMA-7B shows the most stable behavior in precision-critical detection scenarios, while GPT-2 and OPT-1.3B offer more balanced performance across controlled and diverse generation settings. Algorithm-wise, KGW performs optimally under deterministic decoding, whereas CWS exhibits stronger generalizability and resilience under stochastic conditions, making it more suitable for practical deployment in dynamic environments.
- (2)
Extended Comparison with Prior Watermarking Methods
To further contextualize the performance of our proposed method, we compared it with two representative baseline watermarking algorithms—Unigram-Watermarking [
21] and UPV [
14]—in addition to KGW [
27] under consistent generation settings. All models were evaluated using GPT-2 with beam search (beam = 4), a watermark strength of δ = 2.0, embedding ratio γ = 0.5, and generation length T = 200. The same watermark vocabulary and detection interfaces were applied across all methods to ensure a fair comparison. The results are summarized in
Table 2. All results were reproduced under a unified watermarking and detection pipeline to ensure comparability.
Compared to Unigram-Watermarking, which relies on token-level frequency bias without contextual alignment, CWS exhibits significant improvements in both precision and recall. The UPV method, while offering public verifiability and strong robustness guarantees, lacks semantic adaptivity and performs less effectively under beam-based decoding. KGW remains competitive across metrics but relies on access to internal sampling mechanisms, limiting its deployment in black-box settings. In contrast, CWS offers the best overall detection performance while maintaining a generalizable and publicly verifiable design, confirming its effectiveness across a diverse range of watermarking strategies.
4.2.2. Ablation Analysis
To evaluate the functional contributions of key components in the proposed CWS framework—namely semantic contrastive learning (SCL), the shared embedding layer (SE), multi-head attention (MHA), and the detection backbone—we conducted a series of ablation experiments. The evaluation was performed using the LLaMA-7B generator under the beam search decoding strategy. Metrics include accuracy, true positive rate (TPR), true negative rate (TNR), F1 score, and inference latency. Each result was averaged over three independent runs (standard deviation < 0.3%), as presented in
Table 3.
The results reveal that all three components—SCL, SE, and MHA—are essential to achieving optimal detection performance. The baseline configuration with all components enabled attains the highest accuracy (99.8%) and F1 score (99.8%), while removal of any individual module results in substantial degradation.
Notably, eliminating the SCL module leads to the most severe performance collapse, with accuracy dropping to 15.3% and F1 score to 11.2%. This confirms that semantic contrastive learning is pivotal in distinguishing watermarked text from human-written content. The absence of SCL disrupts semantic token alignment, resulting in nearly random detection outcomes (TPR = 7.5%, TNR = 20.0%). Similarly, removing the shared embedding layer reduces accuracy to 76.7% and sharply lowers TNR to 53.4%, indicating feature misalignment between the embedding and detection stages. This misalignment causes the detector to overpredict watermarked content (TPR = 100%), severely compromising specificity. Disabling MHA moderately impacts performance (accuracy = 82.3%, TNR = 64.6%), as the model loses its ability to capture long-range dependencies and nuanced contextual signals. The MHA module enhances robustness by amplifying subtle watermark patterns dispersed across the input sequence.
In the backbone comparison, replacing GRU with LSTM maintains relatively high accuracy (97.6%) but increases inference time by 54.8% (from 0.42 s to 0.65 s). While the Transformer backbone achieves a slightly better accuracy (98.5%), it suffers from a 2.1× latency increase (0.89 s), making it less efficient for real-time applications.
In summary, the GRU-based backbone demonstrates the most favorable trade-off between detection accuracy and computational efficiency, outperforming LSTM and Transformer configurations in both speed and precision. These findings underscore the critical role of the proposed modules and support the design choice of a lightweight, high-performing GRU architecture for watermark detection tasks.
4.2.3. Hyper-Parameter Analysis
This section investigates the effects of three key hyper-parameters on watermark detection performance and text quality: the watermark bias strength (δ), embedding ratio (γ), and detection threshold (Z).
- (1)
Effect of Watermark Bias Strength (δ): The parameter δ controls the logit adjustment applied to watermark candidate tokens during generation. Smaller values result in weak watermark signals with minimal impact on fluency, while larger values strengthen detectability at the potential cost of naturalness. We evaluated δ ∈ {0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0} using GPT2-Chinese, measuring the z-score (detection significance), perplexity (PPL), and BLEU score (semantic consistency). As shown in
Figure 2, increasing δ raises the z-score from 5.3 to 10.3, indicating stronger detectability. However, this comes with an increase in PPL (6.5 → 9.2) and a decrease in BLEU (81.0 → 72.7), suggesting a trade-off between robustness and generation quality. Based on this analysis, we selected δ = 2.5 as the default value, as it achieves a z-score > 8.0 while maintaining acceptable PPL and BLEU levels.
- (2)
Effect of Watermark Embedding Ratio (γ): The parameter γ defines the proportion of tokens eligible for watermark embedding. We evaluated γ ∈ {0.1, 0.25, 0.5, 0.78, 0.9}, assessing the F1 score alongside diversity metrics (Distinct-1 and Distinct-2). As illustrated in
Figure 3, higher γ improves F1 from 95% to 99%, enhancing detection capability. However, diversity declines: Distinct-1 drops from 0.86 to 0.48 and Distinct-2 from 0.79 to 0.32. This reflects a loss in lexical and syntactic variability due to more aggressive watermarking. We thus set γ = 0.5 as the default, achieving near-optimal detectability (F1 = 99%) while preserving a high degree of text diversity (Distinct-1 ≈ 0.75, Distinct-2 ≈ 0.68).
- (3)
Effect of Detection Threshold (Z): The detection threshold Z determines whether a text is classified as watermarked based on its detection score. We varied Z from 4 to 10 and evaluated detection accuracy, false positive rate (FPR), true negative rate (TNR), and F1 score. As shown in
Figure 4, increasing Z drastically reduces FPR (55% → <1%) and boosts TNR (≈50% → ≈99%). Meanwhile, F1 and accuracy peak at ~99% when Z ∈ [7, 8]. For Z > 9, detection becomes over-conservative, leading to a slight drop in overall performance. We therefore adopted Z = 8 as the default threshold, ensuring high precision, robustness, and minimal false positives.
4.2.4. Robustness Evaluation
To comprehensively evaluate the robustness of the proposed watermarking framework under adversarial and natural perturbations, we simulated four representative types of attacks: (1) semantic reconstruction (e.g., paraphrasing), (2) vocabulary substitution (e.g., synonym replacement), (3) text compression (e.g., token deletion), and (4) summarization-based semantic compression (e.g., content abstraction). These attacks represent common post-processing scenarios in real-world downstream applications of LLM outputs. We evaluated detection performance under each attack type using three popular open-source English LLMs: GPT-2, OPT-1.3B, and LLaMA-7B. All watermark-embedded texts were generated with Beam Search decoding under default watermark strength δ = 2.5 and embedding ratio γ = 0.5. The watermark detector was implemented using a shared embedding layer and Bi-GRU architecture, consistent across all evaluations. Each experiment was repeated three times, and reported metrics included F1 score, z-score, token-level TPR, BLEU-4, ROUGE-L, and other task-specific indicators.
- (1)
Semantic Reconstruction Attack (ε = 0.2): We applied DeepSeek-R1-7B to paraphrase watermarked samples using high-beam decoding (beam size = 50) with cosine similarity constraints (θ ≥ 0.85), simulating fluent semantic rewriting. As shown in
Table 4, the F1 scores drop moderately by 5.7–7.6 points across models, indicating effective but not complete evasion of watermark traces. LLaMA-7B shows the best resilience, retaining a post-attack F1 of 94.1. BLEU scores drop slightly, while ROUGE-L remains stable above 0.79, confirming semantic fidelity. These results demonstrate that the GRU detector can generalize across paraphrased expressions by capturing deep semantic deviations rather than surface forms.
- (2)
Vocabulary Substitution Attack (ε = 0.2): To simulate local lexical perturbations, we replaced 20% of watermarked tokens with synonyms generated by DeepSeek-R1-7B while enforcing semantic similarity (θ ≥ 0.8).
Table 5 shows a more significant F1 degradation, especially for GPT-2 and OPT-1.3B, with z-score drops of over 40%. Despite this, the semantic similarity remains above 0.80, and the GRU-based detector maintains an F1 above 90%, indicating partial robustness under distributed substitutions.
- (3)
Text Compression Attack (δ = 0.2): This attack removes 20% of tokens uniformly from each sentence, preserving semantic skeleton while reducing watermark density. As shown in
Table 6, F1 scores remain high (above 97.8%), and z-scores are only slightly affected. Information Entropy Ratio (IER) and Keyword Retention Rate (KRR) confirm that important tokens are mostly preserved. These findings indicate that moderate deletion does not significantly hinder watermark detectability, particularly when biased tokens are semantically reinforced.
- (4)
Summarization-Based Compression Attack (δ = 0.6): To simulate extreme abstraction scenarios (e.g., media summarization), we applied the T5-base summarizer to reduce watermarked texts to 40% of their original length.
Table 7 shows a substantial degradation in both F1 (drops of 23.7–25.7) and TPR (drops of 28.5–31.4), despite maintaining meaningful content. LLaMA-7B performs slightly better under this severe compression, highlighting its more stable embedding patterns. This attack reveals a limitation of current watermark strategies under aggressive semantic shortening and motivates future development of entropy-resilient or syntax-agnostic detection modules.
Figure 5 presents a comprehensive robustness boundary analysis of the proposed watermarking system under increasing perturbation intensities across four attack types: semantic reconstruction (ε), vocabulary substitution (ε), text compression (δ), and summarization-based compression (δ = 0.6). Each curve illustrates the degradation trend of F1 scores as the attack strength increases, and background shading highlights four performance zones: high (F1 ≥ 95%), acceptable (90% ≤ F1 < 95%), moderate risk (80% ≤ F1 < 90%), and low (F1 < 80%).
The results reveal distinct robustness profiles across attack types. Under moderate perturbation levels (ε, δ ≤ 0.2), all models maintain F1 scores above 90%, demonstrating strong resilience against typical semantic rewriting and substitution. In contrast, summarization-based compression at δ = 0.6 causes a sharp performance collapse, reducing the F1 score to 75.1%—the only scenario crossing into the low-performance zone. Statistical tests (e.g., p < 0.01) confirm the significance of this degradation.
These observations confirm that the proposed framework achieves graceful degradation under realistic attacks, with only extreme summarization exposing structural vulnerabilities. This motivates future work toward entropy-resilient or syntax-invariant watermarking designs that can survive more destructive transformations without sacrificing detection accuracy.
4.2.5. Case Study
To validate the impact of watermark embedding on the core attributes of generated texts—especially the dual effect on detection capability and generation quality—we conducted case-specific generation experiments using extended prompts. The evaluation focused on three key metrics: z-score (detectability), PPL (language fluency), and BLEU (semantic consistency). We compared three types of outputs: real news articles, non-watermarked generated texts, and watermarked generated texts. The results are presented in
Table 8. For clarity and reproducibility, representative example contents corresponding to each row are included in
Appendix A.
From a detection standpoint, non-watermarked texts exhibit near-zero z-scores (avg. 1.18), whereas watermarked texts consistently achieve high z-scores (avg. 9.13), confirming strong detectability. In terms of fluency, watermarked outputs maintain PPL levels similar to non-watermarked ones (7.5 vs. 7.4), and BLEU scores show only minor decreases (avg. drop < 4 points), indicating that semantic integrity and readability are preserved.
These results suggest that the proposed watermarking method achieves a robust trade-off between traceability and imperceptibility, making it suitable for real-world deployment. For adversarial robustness analysis, see
Appendix B.
4.2.6. Generalization and Sanity Checks
To address concerns regarding the generalizability and robustness of the proposed watermarking system, we conducted a series of sanity checks to verify that the reported detection performance is not the result of dataset leakage or structural overfitting.
First, we ensured strict separation between training, validation, and test sets by partitioning at the document and prompt levels. Watermarked samples in the test set were generated using different random seeds, sampling parameters (e.g., temperature, top-k), and watermark embedding hyperparameters (e.g., δ, γ) than those used during training. In all experiments, we used English-language generative models (GPT-2, OPT-1.3B, and LLaMA-7B) and avoided reusing any prompt or content across phases. Furthermore, we conducted cross-configuration evaluations, where detectors trained under one set of watermarking parameters (e.g., δ = 3.0, top-k = 40) were tested on samples generated under alternative settings (e.g., δ = 2.0, top-k = 100). These tests yielded stable results with minor performance degradation (F1 remains above 94%), confirming that the model is not merely memorizing surface-level distributional features.
We also evaluated the generalization ability of the GRU-based semantic detection module across models. In cross-model evaluations, detectors trained on GPT-2-generated watermarked samples were tested on samples generated by OPT-1.3B or LLaMA-7B. In these settings, the F1 score drops by 6–9 points, indicating some performance loss, but the model still maintains over 90% accuracy. This suggests that the GRU module captures generalizable semantic features rather than overfitting to model-specific token distributions.
Finally, we analyzed the system’s reliance on the semantic contrastive learning (SCL) module. As reported in the ablation study (
Section 4.2.5), removing SCL causes the F1 score to drop from 99.3% to 11.7%. While this drop may appear drastic, it reflects the intended design of the system. The SCL module plays a pivotal role in embedding semantically aligned watermark tokens that are difficult to detect without context-aware representation learning. The sharp performance decline in its absence confirms that the system is not trivially classifying based on lexical features but relies on deeper semantic cues. Nevertheless, we acknowledge the risk of overdependence on a single module and plan to explore modular redundancy in future work. This includes integrating auxiliary weak detectors, ensemble voting schemes, and hybrid lexical–semantic embedding strategies to ensure graceful degradation under adversarial or out-of-distribution scenarios.
5. Conclusions
This paper proposes CWS, a lightweight and robust semantic watermarking framework for LLMs. By integrating contrastive learning with a shared embedding and dual-branch detection, CWS achieves both high imperceptibility and detectability. Experiments on GPT-2, OPT-1.3B, and LLaMA-7B across C4 and DBpedia Class datasets show that CWS reaches up to 99.9% F1 and maintains F1 ≥ 93% under typical perturbations (ε ≤ 0.25, δ ≤ 0.2). The GRU-based detector ensures efficient inference (0.42 s/sample) with better accuracy than LSTM and Transformer. Ablation and hyperparameter analyses confirm the effectiveness of semantic contrastive learning and the robustness of default settings (δ = 2.5, γ = 0.5, Z = 8). Overall, CWS provides a practical and deployable solution for watermarking LLM-generated text in real-world black-box settings. The framework leverages a structurally symmetric embedding–detection pipeline, in which shared feature alignment ensures coherence between watermark insertion and recognition phases.
Deployment Considerations. While our framework demonstrates strong detection performance across multiple decoding strategies and adversarial attacks, deploying it in open environments introduces additional challenges. Specifically, adversaries may engage in black-box evasion by selectively altering token distributions, or they may exploit transferability by applying paraphrasing or compression strategies learned from one model to bypass detection on another. To address such threats, future work should investigate model-agnostic detection methods, ensemble-based voting mechanisms, and online adaptation frameworks. Cross-model training and diversified watermarking patterns may further improve resilience in dynamic, heterogeneous deployments. Furthermore, introducing symmetry-aware embedding objectives or alignment-based detection strategies may offer graceful degradation paths under asymmetric adversarial perturbations.
Ethical, Multilingual, and Cryptographic Considerations. We recognize the broader ethical and practical implications of watermarking technologies. False positives or unconsented use could lead to surveillance or censorship risks if not properly governed. We advocate for transparent deployment protocols, public audit mechanisms, and open-source verification tools to ensure accountability. Although our current experiments focus on English-language models, the proposed framework is language-agnostic by design, and future work will explore multilingual detection pipelines and cross-lingual generalization. Finally, integrating cryptographic watermarking (e.g., public-key authentication, zero-knowledge proofs, chain-of-custody logging) offers promising directions for provable, secure, and policy-compliant attribution, especially in high-stakes scenarios such as misinformation tracking or AI authorship verification.
Author Contributions
J.W.: Methodology, Formal analysis, Visualization, Writing—review and editing, Project administration; X.C.: Conceptualization, Methodology, Investigation, Software, Data curation, Validation, Writing—original draft, Supervision; C.X.: Investigation, Data curation, Experiment design, Resource preparation; L.Z.: Investigation, Data annotation, Supportive experiments. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the College-level Characteristic Teaching Material Project, grant number 20220119Z0221; the College Teaching Incubation Project, grant number 20220120Z0220; the Ministry of Education Industry-University Cooperation Collaborative Education Project, grant number 20220163H0211; the Central Universities Basic Scientific Research Fund, grant numbers 3282024009, 20230051Z0114, 20230050Z0114; the Beijing Higher Education “Undergraduate Teaching Reform and Innovation Project”, grant numbers 20220121Z0208, 202110018002; and the College Discipline Construction Project, grant numbers 20230007Z0452, 20230010Z0452.
Data Availability Statement
The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.
Acknowledgments
We acknowledge the administrative and technical support provided by our institution.
Conflicts of Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Appendix A
Sample Prompts and Output Examples in Table 6
- (1)
Prompt: Bitcoin mining operations in Southeast Asia face scrutiny after massive power outages in border provinces…
- (i)
Real Text (C4 Dataset): “According to reports, Thai authorities raided an illegal Bitcoin mining site west of Bangkok on Friday. The raid followed complaints from local residents about frequent power outages over the past month…”
- (ii)
Generated (non-watermarked): “Records indicate that the raided house consumed an abnormally high amount of electricity. However, residents had not paid for most of the consumed power. During the raid, police and provincial electricity officers discovered…”
- (ii)
Watermarked: “The site is located southwest of northeastern Bangkok, about 10 km from the city center. Due to difficult living conditions and local government support…”
- (2)
Prompt: Federal Reserve signals rate cut amid concerns of slowing U.S. inflation and bond yield inversion…
- (i)
Real Text (C4 Dataset): “Fed Chairman Jerome Powell said it was time to cut interest rates. Hang Seng Bank advisor Leung Siu-kei mentioned on a radio program that a rate cut is very likely in September…”
- (ii)
Generated (non-watermarked): “Leung predicted two rate cuts this year, each by 0.25%, totaling 0.5% by year-end. He believed this would be good news for Hong Kong…”
- (iii)
Watermarked: “The rate may reach around 9%. Today’s market movement is not due to an anticipated hike on October 14th, but because of the December meeting…”
- (3)
Prompt: Global central banks coordinate monetary easing to cushion anticipated economic downturn…
- (i)
Real Text (C4 Dataset): “On Friday, officials from three major central banks signaled their intention to continue or initiate rate cuts in the coming months. This indicates that the era of high borrowing costs is coming to an end…”
- (ii)
Generated (non-watermarked): “ECB and BoJ both emphasized flexible tools in the face of slower global trade and commodity shocks…”
- (iii)
Watermarked: “Analysts predict these moves could anchor long-term inflation expectations amid currency fluctuations…”
Appendix B
Robustness Illustration Under Summarization Attack
To further assess the robustness of the watermarking system under aggressive post-processing, we conduct a qualitative case study involving semantic summarization, which simulates downstream editing scenarios such as article abstraction or social media rewriting.
We select a representative watermarked output from the case study in
Section 4.2.5 and apply a T5-based summarizer, reducing the output to ~40% of its original length.
Table A1 shows the z-score before and after compression.
Table A1.
z-score degradation under summarization.
Table A1.
z-score degradation under summarization.
Text Type | z-Score |
---|
Original watermarked | 9.8 |
Summarized (δ ≈ 0.6) | 4.1 |
Example:
- (1)
Original: Bitcoin mining operations in Southeast Asia are drawing scrutiny after repeated power shortages were reported near cross-border power grids. This surge in crypto-related energy consumption has raised policy concerns in several provinces, especially in Laos and northern Vietnam.
- (2)
Summarized: Crypto-mining in Southeast Asia causes repeated blackouts, prompting concern among regional authorities.
Despite substantial token reduction and syntactic simplification, the summarized text retains a z-score above the detection threshold (Z = 4), showing that the watermark signal—anchored in semantic representation—remains partially intact. This further validates the design’s resistance to semantic compression and its applicability in black-box downstream scenarios.
References
- OpenAI. GPT-4 Technical Report; OpenAI: San Francisco, CA, USA, 2023. [Google Scholar]
- Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling Language Models with Pathways. J. Mach. Learn. Res. 2023, 24, 1–113. [Google Scholar]
- Mittal, U.; Sai, S.; Chamola, V. A comprehensive review on generative ai for education. IEEE Access 2024, 12, 142733–142759. [Google Scholar] [CrossRef]
- Chen, C.; Shu, K. Combating misinformation in the age of llms: Opportunities and challenges. AI Mag. 2024, 45, 354–368. [Google Scholar] [CrossRef]
- Wu, J.; Yang, S.; Zhan, R.; Yuan, Y.; Chao, L.S.; Wong, D.F. A Survey on LLM-Generated Text Detection: Necessity, Methods, and Future Directions. Comput. Linguist. 2025, 51, 275–338. [Google Scholar] [CrossRef]
- Park, P.S.; Goldstein, S.; O’Gara, A.; Chen, M.; Hendrycks, D. AI deception: A survey of examples, risks, and potential solutions. Patterns 2024, 5, 100988. [Google Scholar] [CrossRef] [PubMed]
- Thirunavukarasu, A.J.; Ting, D.S.J.; Elangovan, K.; Gutierrez, L.; Tan, T.F.; Ting, D.S.W. Large language models in medicine. Nat. Med. 2023, 29, 1930–1940. [Google Scholar] [CrossRef] [PubMed]
- Hacker, P.; Engel, A.; Mauer, M. Regulating ChatGPT and other large generative AI models. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, Chicago, IL, USA, 12–15 June 2023; pp. 1112–1123. [Google Scholar]
- Liu, A.; Pan, L.; Lu, Y.; Li, J.; Hu, X.; Zhang, X.; Wen, L.; King, I.; Xiong, H.; Yu, P. A survey of text watermarking in the era of large language models. ACM Comput. Surv. 2024, 57, 1–36. [Google Scholar] [CrossRef]
- Liu, A.; Pan, L.; Hu, X.; Li, S.; Wen, L.; King, I.; Yu, P.S. An unforgeable publicly verifiable watermark for large language models. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Huo, M.; Somayajula, S.A.; Liang, Y.; Zhang, R.; Koushanfar, F.; Xie, P. Token-specific watermarking with enhanced detectability and semantic coherence for large language models. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Rodriguez, J.D.; Hay, T.; Gros, D.; Shamsi, Z.; Srinivasan, R. Cross-domain detection of GPT-2-generated technical text. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics, Seattle, WA, USA, 10–15 July 2022. [Google Scholar]
- Christ, M.; Gunn, S.; Zamir, O. Undetectable watermarks for language models. In Proceedings of the Thirty Seventh Annual Conference on Learning Theory, Edmonton, AB, Canada, 30 June–3 July 2024; pp. 1125–1139. [Google Scholar]
- Bhattacharjee, A.; Liu, H. Fighting fire with fire: Can ChatGPT detect AI-generated text? ACM SIGKDD Explor. Newsl. 2024, 25, 14–21. [Google Scholar] [CrossRef]
- Zhao, X.; Ananth, P.; Li, L.; Wang, Y.X. Provable robust watermarking for AI-generated text. In Proceedings of the 41st International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024. [Google Scholar]
- Pan, L.; Liu, A.; Hu, X.; Meng, S.; Wen, L. Combating AI-generated fake content with robust watermarking. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Toronto, ON, Canada, 3–10 March 2021; pp. 610–623. [Google Scholar]
- Nguyen, T.A.; Muller, B.; Yu, B.; Costa-jussa, M.R.; Elbayad, M.; Popuri, S.; Ropers, C.; Duquenne, P.-A.; Algayres, R.; Mavlyutov, R.; et al. Spirit-lm: Interleaved spoken and written language model. Trans. Assoc. Comput. Linguist. 2025, 13, 30–52. [Google Scholar]
- Zhan, H.; He, X.; Xu, Q.; Wu, Y.; Stenetorp, P. G3Detector: General GPT-generated text detector. arXiv 2023, arXiv:2305.12680. [Google Scholar]
- Gehrmann, S.; Strobelt, H.; Rush, A.M. GLTR: Statistical detection and visualization of generated text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Florence, Italy, 28 July–2 August 2019; pp. 111–116. [Google Scholar]
- Liu, A.; Pan, L.; Hu, X.; Meng, S.; Wen, L. A semantic invariant robust watermark for large language models. arXiv 2023, arXiv:2310.06356. [Google Scholar]
- Sadasivan, V.S.; Kumar, A.; Balasubramanian, S.; Wang, W.; Feizi, S. Can AI-generated text be reliably detected? arXiv 2023, arXiv:2303.11156. [Google Scholar]
- He, X.; Xu, Q.; Lyu, L.; Wu, F.; Wang, C. Protecting IP of language generation APIs with lexical watermark. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 22 February–1 March 2022. [Google Scholar]
- Yang, L.; Ma, X.; Fu, Y.; Xiong, D. Syntax-aware watermarking for text generation. arXiv 2023, arXiv:2306.07930. [Google Scholar]
- Abdelnabi, S.; Fritz, M. Adversarial watermarking transformer. In Proceedings of the IEEE Symposium on Security and Privacy, San Francisco, CA, USA, 24–27 May 2021; pp. 121–140. [Google Scholar]
- Zhang, R.; Hussain, S.S.; Koushanfar, F. REMARK-LLM: Robust and efficient watermarking for generative large language models. arXiv 2023, arXiv:2310.12362. [Google Scholar]
- Kirchenbauer, J.; Geiping, J.; Wen, Y.; Katz, J.; Miers, I.; Goldstein, T. A watermark for large language models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 17061–17084. [Google Scholar]
- Chen, M.; Wu, X.; Li, L.; Wang, Y.; Tan, S.; Shi, S. Improving semantic coherence in watermarked texts with LLaMA2-based clustering. arXiv 2023, arXiv:2311.03118. [Google Scholar]
- Lee, T.; Hong, S.; Ahn, J.; Hong, I.; Lee, H.; Yun, S.; Shin, J.; Kim, G. Who wrote this code? Watermarking for code generation. arXiv 2023, arXiv:2305.15060. [Google Scholar]
- Mitchell, E.; Lee, Y.; Khazatsky, A.; Manning, C.D.; Finn, C. DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023), Honolulu, HI, USA, 23–29 July 2023; pp. 24950–24962. [Google Scholar]
- Yang, X.; Zhang, K.; Chen, H.; Petzold, L.; Wang, W.Y.; Cheng, W. Zero-shot detection of machine-generated codes. arXiv 2023, arXiv:2310.05103. [Google Scholar]
- Lin, K.; Luo, Y.; Zhang, Z.; Luo, P. Zero-shot generative linguistic steganography. arXiv 2024, arXiv:2403.10856. [Google Scholar]
Figure 1.
Overview of the CWS framework. The upper part shows the watermark embedding pipeline. A generative LLM processes the input prompt and context to produce raw token logits, which are mapped into a continuous space via a shared embedding layer. The semantic contrastive learning module then selects contextually appropriate watermark tokens from a candidate set. These tokens are embedded through logit adjustment (δ), producing watermarked logits, which are then decoded into watermarked text via Softmax. The lower part depicts the watermark detection pipeline. The watermarked text is re-embedded using a frozen shared embedding layer to ensure consistency. Two detection branches operate in parallel: (1) a statistical branch, which applies a z-score significance test to identify token-level deviations for public, model-free verification, and (2) a neural branch, which inputs semantic embeddings into a Bi-GRU network to extract deep contextual watermark signals. The results of both branches are integrated to determine watermark presence. The framework supports robust and efficient detection under unknown models and unknown key conditions.
Figure 1.
Overview of the CWS framework. The upper part shows the watermark embedding pipeline. A generative LLM processes the input prompt and context to produce raw token logits, which are mapped into a continuous space via a shared embedding layer. The semantic contrastive learning module then selects contextually appropriate watermark tokens from a candidate set. These tokens are embedded through logit adjustment (δ), producing watermarked logits, which are then decoded into watermarked text via Softmax. The lower part depicts the watermark detection pipeline. The watermarked text is re-embedded using a frozen shared embedding layer to ensure consistency. Two detection branches operate in parallel: (1) a statistical branch, which applies a z-score significance test to identify token-level deviations for public, model-free verification, and (2) a neural branch, which inputs semantic embeddings into a Bi-GRU network to extract deep contextual watermark signals. The results of both branches are integrated to determine watermark presence. The framework supports robust and efficient detection under unknown models and unknown key conditions.
![Symmetry 17 01124 g001]()
Figure 2.
Effect of watermark bias strength (δ) on watermark detection significance (z-score), language fluency (PPL), and semantic coherence (BLEU). Larger δ improves detectability but compromises text quality beyond δ = 3.0.
Figure 2.
Effect of watermark bias strength (δ) on watermark detection significance (z-score), language fluency (PPL), and semantic coherence (BLEU). Larger δ improves detectability but compromises text quality beyond δ = 3.0.
Figure 3.
Effect of watermark embedding ratio (γ) on F1 score and text diversity metrics. Higher γ values enhance detectability but reduce Distinct-1 and Distinct-2, indicating increased redundancy and pattern repetition.
Figure 3.
Effect of watermark embedding ratio (γ) on F1 score and text diversity metrics. Higher γ values enhance detectability but reduce Distinct-1 and Distinct-2, indicating increased redundancy and pattern repetition.
Figure 4.
Effect of detection threshold (Z) on detection accuracy, F1 score, false positive rate (FPR), and true negative rate (TNR). The optimal performance is observed when Z is set between 7 and 8.
Figure 4.
Effect of detection threshold (Z) on detection accuracy, F1 score, false positive rate (FPR), and true negative rate (TNR). The optimal performance is observed when Z is set between 7 and 8.
Figure 5.
Robustness boundary analysis under varying perturbation intensities. This figure illustrates the F1 score degradation trends of the proposed watermarking system under four types of adversarial attacks: semantic reconstruction (ε), vocabulary substitution (ε), text compression (δ), and summarization compression (δ = 0.6). Performance zones are color-coded as high (F1 ≥ 95%), acceptable (90% ≤ F1 < 95%), moderate risk (80% ≤ F1 < 90%), and low (F1 < 80%). Critical points and statistical significance levels are annotated (** p < 0.01; *** p < 0.001). The system maintains F1 ≥ 90% under moderate perturbations, with summarization attack being the only scenario to cross the low-performance threshold.
Figure 5.
Robustness boundary analysis under varying perturbation intensities. This figure illustrates the F1 score degradation trends of the proposed watermarking system under four types of adversarial attacks: semantic reconstruction (ε), vocabulary substitution (ε), text compression (δ), and summarization compression (δ = 0.6). Performance zones are color-coded as high (F1 ≥ 95%), acceptable (90% ≤ F1 < 95%), moderate risk (80% ≤ F1 < 90%), and low (F1 < 80%). Critical points and statistical significance levels are annotated (** p < 0.01; *** p < 0.001). The system maintains F1 ≥ 90% under moderate perturbations, with summarization attack being the only scenario to cross the low-performance threshold.
Table 1.
Watermark detection performance under different generation strategies.
Table 1.
Watermark detection performance under different generation strategies.
Model | Methods | Accuracy (%) | TPR (%) | TNR (%) | F1 (%) |
---|
GPT-2 | Beam-search (beam = 4) | KGW | 99.1 | 99.7 | 99.4 | 99.8 |
CWS | 99.4 | 99.0 | 99.8 | 99.9 |
TOP-k (K = 20) | KGW | 99.0 | 99.2 | 98.8 | 99.0 |
CWS | 99.1 | 99.4 | 98.6 | 99.1 |
OPT-1.3B | Beam-search (beam = 2) | KGW | 99.9 | 100 | 98.8 | 99.9 |
CWS | 99.4 | 100 | 99.8 | 99.4 |
TOP-k (K = 20) | KGW | 99.8 | 100 | 99.6 | 99.9 |
CWS | 99.7 | 99.9 | 99.8 | 99.8 |
LLaMA-7B | Beam-search (beam = 2) | KGW | 99.9 | 99.9 | 99.8 | 99.9 |
CWS | 99.8 | 100 | 99.6 | 99.8 |
TOP-k (K = 50) | KGW | 99.7 | 99.5 | 100 | 99.7 |
CWS | 99.0 | 99.2 | 99.8 | 99.8 |
Table 2.
Extended comparison with prior methods (GPT-2, δ = 2.0, γ = 0.5, Beam = 4, T = 200).
Table 2.
Extended comparison with prior methods (GPT-2, δ = 2.0, γ = 0.5, Beam = 4, T = 200).
Method | Model | Accuracy (%) | TPR (%) | TNR (%) | F1 (%) |
---|
Unigram-Watermark [15] | GPT-2 | 96.9 | 95.8 | 98.0 | 96.7 |
UPV [10] | GPT-2 | 98.4 | 97.1 | 99.0 | 98.2 |
KGW [27] | GPT-2 | 99.1 | 99.7 | 98.4 | 99.8 |
CWS (ours) | GPT-2 | 99.4 | 99.0 | 98.8 | 99.9 |
Table 3.
Modular ablation study results.
Table 3.
Modular ablation study results.
Method | SCL | SE | MHA | Accuracy (%) | TPR (%) | TNR (%) | F1 (%) | Time (s) |
---|
Baseline (Bi-GRU) | √ | √ | √ | 99.8 | 100 | 99.6 | 99.8 | 0.42 |
w/o SCL | | √ | √ | 15.3 | 7.5 | 20 | 11.2 | 0.40 |
w/o SE | √ | | √ | 76.7 | 100 | 53.4 | 84.2 | 0.41 |
w/o MHA | √ | √ | | 82.3 | 100 | 64.6 | 85.1 | 0.33 |
Replace GRU with LSTM | √ | √ | √ | 97.6 | 98.0 | 97.2 | 97.6 | 0.65 |
Replace GRU with Transformer | √ | √ | √ | 98.5 | 98.8 | 98.2 | 98.5 | 0.89 |
Table 4.
Performance under semantic reconstruction attack (ε = 0.2).
Table 4.
Performance under semantic reconstruction attack (ε = 0.2).
Model | ΔF1 (%) | BLEU-4 | ROUGE-L |
---|
GPT2 | 99.9 → 93.7 (−6.2) | 78.5 | 0.79 |
OPT-1.3B | 99.4 → 91.8 (−7.6) | 77.9 | 0.81 |
LLaMA-7B | 99.8 → 94.1 (−5.7) | 79.6 | 0.84 |
Table 5.
Performance under vocabulary substitution attack (ε = 0.2).
Table 5.
Performance under vocabulary substitution attack (ε = 0.2).
Model | Substitution Rate (%) | Δz-Score | ΔF1 (%) | Semantic Similar |
---|
GPT2 | 85.6 ± 2.1 | 8.9 → 5.1 | 99.9 → 91.3 (−8.6) | 0.81 ± 0.03 |
OPT-1.3B | 87.3 ± 1.8 | 8.3 → 4.9 | 99.4 → 90.9 (−8.5) | 0.83 ± 0.02 |
LLaMA-7B | 89.2 ± 1.5 | 8.8 → 6.2 | 99.8 → 93.7 (−6.1) | 0.84 ± 0.04 |
Table 6.
Performance under text compression attack (δ = 0.2).
Table 6.
Performance under text compression attack (δ = 0.2).
Model | IER | KRR (%) | Δz-Score | ΔF1 (%) |
---|
GPT2 | 0.89 | 85.4 | 8.9 → 8.5 | 99.9 → 98.5 (−1.4) |
OPT-1.3B | 0.91 | 87.2 | 8.3 → 7.9 | 99.4 → 97.8 (−1.6) |
LLaMA-7B | 0.93 | 89.7 | 8.8 → 8.4 | 99.8 → 98.5 (−1.3) |
Table 7.
Performance under summarization compression attack (δ = 0.6).
Table 7.
Performance under summarization compression attack (δ = 0.6).
Prompt Topic | Model | Δz-Score | ΔF1 (%) | ΔTPR (%) |
---|
Bitcoin mining | GPT2 | 8.9 → 4.2 | 99.9 → 75.4 (−24.5) | 99.0 → 70.1 (−28.9) |
Fed rate cut | OPT-1.3B | 8.3 → 3.9 | 99.4 → 73.7 (−25.7) | 100 → 68.6 (−31.4) |
Global easing | LLaMA-7B | 8.8 → 4.4 | 99.8 → 76.1 (−23.7) | 100 → 71.5 (−28.5) |
Table 8.
Metric comparison between real, non-watermarked, and watermarked texts under different watermarking conditions.
Table 8.
Metric comparison between real, non-watermarked, and watermarked texts under different watermarking conditions.
Prompt | Text Type | z-Score | PPL | BLEU |
---|
Bitcoin mining operations in Southeast Asia face scrutiny after massive power outages in border provinces… | Non-WM | 1.15 | 7.3 | 78.3 |
Watermarked | 10.2 | 8.1 | 76.4 |
Federal Reserve signals rate cut amid concerns of slowing U.S. inflation and bond yield inversion… | Non-WM | 0.2 | 7.5 | 80.2 |
Watermarked | 8.9 | 7.6 | 74.9 |
Global central banks coordinate monetary easing to cushion anticipated economic downturn… | Non-WM | 2.2 | 7.5 | 82.1 |
Watermarked | 8.3 | 6.8 | 78.7 |
— | Real | — | 4.7 | — |
— | Average (non-WM) | 1.18 | 7.4 | 80.2 |
— | Average (watermarked) | 9.13 | 7.5 | 76.7 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).