PatchSeal: A Robust and Intangible Image Watermarking Framework for AIGC

You, Ting; Zheng, Haixia; Wang, Zhaohan; Chen, Yi

doi:10.3390/math14040679

Open AccessArticle

PatchSeal: A Robust and Intangible Image Watermarking Framework for AIGC

¹

School of Electrical and Information Engineering, Quzhou University, Quzhou 324000, China

²

School of Engineering, Newcastle University, Newcastle upon Tyne NE1 7RU, UK

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(4), 679; https://doi.org/10.3390/math14040679

Submission received: 14 January 2026 / Revised: 6 February 2026 / Accepted: 12 February 2026 / Published: 14 February 2026

Download

Browse Figures

Versions Notes

Abstract

The rapid growth of artificial intelligence-generated content (AIGC) has created serious challenges for image copyright protection, since semantic edits and deep-fake manipulations can easily erase or distort embedded watermarks. Traditional robust watermarking methods, which are mainly designed to resist pixel-level distortions such as noise, compression or filtering, often fail when faced with content-level transformations generated by AIGC models. This paper presents PatchSeal, a robust and intangible image watermarking framework that combines multi-targeted and attention-oriented embedding with a focus-oriented masking. The proposed framework introduces a segmentation-assisted embedding strategy that distributes watermark bits across several prominent regions to improve resilience to semantic changes. An attention-based module, composed of a subject extraction branch and a channel weighting branch, adapts to the encoder towards texture-rich and semantically stable regions, improving both invisibility and robustness. Experiments conducted in three public object data sets show that PatchSeal achieves an average PSNR of 43.13 dB and a bit precision of 92.98 percent under various AIGC editing conditions, surpassing representative methods such as MBRS and FIN. These results demonstrate the effectiveness of the proposed method in resisting AIGC-driven manipulations and provide new practical paths and methodological insights for the design of robust watermarks in the AIGC era.

Keywords:

artificial intelligence-generated content; robust image watermarking; semantic editing; attention-oriented masking; dispersed embedding

MSC:

68T01; 68T05

1. Introduction

The exponential growth of online image creation and sharing has increased the need for reliable copyright protection and traceability of origin. Invisible digital watermarking remains a practical and widely adopted solution because it preserves perceptual quality while enabling a post-hoc verification under common image processing operations. However, the rapid emergence of AI-generated content (AIGC), especially the image generation based on diffusion and instruction-based editing, has brought new challenges. These models perform semantic transformations that fundamentally change content structures, textures and spatial layouts, thus invalidating many of the robust assumptions underlying classical watermarking systems.

Recent learning-based watermarking research has significantly advanced robustness against pixel-level distortions through end-to-end encoder–noise–decoder optimization and improved distortion modeling. These approaches have evolved from training with real and simulated JPEG compression [1], to incorporating screen-shooting simulation layers that approximate perspective and illumination variations [2], and to adopting decoder-driven architectures that tightly couple embedding and extraction [3]. Meanwhile, the integration of watermarking with generative models has opened new possibilities, including watermarking within diffusion pipelines [4,5], latent-space watermarking [6], and tamper localization via embedded marks [7]. Formal robustness has also been strengthened through certified guarantees against removal and forgery attacks [8]. Despite these advances, most existing methods remain optimized for pixel-level or mildly geometric distortions and still degrade under semantic edits. Moreover, techniques that rely on white-box access to generators or editors are often impractical in real-world deployment scenarios, where editing platforms are typically black-box systems.

This paper targets robustness under AIGC-driven image editing as the primary threat model. We present a practical and imperceptible watermarking framework, PatchSeal, designed to withstand semantic transformations while maintaining high visual quality. The proposed framework distributes redundant watermark bits across multiple semantically meaningful regions using a multi-target dispersed embedding strategy and dynamically adjusts embedding strength through attention-guided masking. This design ensures that watermark information remains recoverable even when portions of an image are locally or globally modified. The framework is fully compatible with standard end-to-end training and does not require access to the internal parameters of proprietary editors.

The main contributions of this work are summarized as follows. First, a multi-target dispersed embedding strategy is introduced to allocate watermark redundancy across content-aware regions, reducing the risk of information loss under semantic edits. Second, an attention-guided masking module is developed to emphasize salient structures and texture-rich areas, improving both invisibility and robustness during recovery. Third, a comprehensive evaluation under instruction-driven editing and composite distortions demonstrates higher bit accuracy and comparable or better perceptual quality than representative baselines such as MBRS [1], PIMoG [2], and DeEND [3]. Additional studies also discuss complementarity with diffusion-integrated watermarking [5,6] and certified defenses [8], and include ablation experiments that clarify how semantic dispersion and attention mechanisms jointly enhance robustness.

The remainder of this paper is organized as follows. Section 2 reviews related work on learning-based watermarking and AIGC image editing. Section 3 presents the proposed framework, including the overall architecture, object-level embedding strategy, and attention-guided masking module. Section 4 reports experimental results in terms of perceptual quality, robustness, and ablation analysis, followed by discussions on limitations and future directions. Finally, Section 5 concludes the paper.

2. Related Works

Deep image watermarking has progressed from codec-specific heuristics to end-to-end learning frameworks that explicitly incorporate distortion modeling. The encoder–noise–decoder paradigm has become the mainstream design, enhancing robustness against non-differentiable distortions such as JPEG compression through a combination of real and simulated corruptions [1]. Further improvements have been achieved by disentangling forward and backward propagation [9] and by embedding or recovering watermarks in latent representation spaces such as self-supervised features [10] and implicit neural representations [11]. These studies suggest that curriculum design and distortion modeling, rather than handcrafted features, play a decisive role in robustness. However, existing pipelines mainly learn invariance to parameterized pixel-level perturbations and still perform poorly under semantic edits or large distribution shifts.

In real-world scenarios, geometric transformations and black-box channels remain major challenges. Screen-to-camera pipelines can be modeled through trainable layers that capture perspective, illumination, and moiré patterns, improving recovery after physical capture [2]. Recent research has introduced synchronization templates and relocking mechanisms to ensure consistency in video and camera-recorded content [12,13,14]. Geometry-aware models further improve immunity by combining wide-baseline alignment with local non-affine correction through deformable operators and attention mechanisms [15,16]. These approaches enhance practical robustness but depend on heavy simulation and complex training curricula, which increase computational cost and instability. Moreover, they do not effectively address semantic re-synthesis, such as object replacement or layout rewriting, which has become a common form of manipulation in AI-generated content (AIGC).

The rapid emergence of diffusion-based image generation and editing has shifted research from pixel-level perturbations to semantic integrity and proactive forensics [17]. EditGuard [7] unifies tamper localization and copyright tracing within a single diffusion pipeline. Latent-space watermarking [6] and diffusion-integrated watermarking [5] improve the balance between fidelity and robustness, while certified watermarking [8] provides formal guarantees against removal or forgery. Diffusion-aware strategies have also been extended to neural radiance fields and Gaussian-based representations [18,19,20,21], achieving better generalization across modalities. Despite these advances, many frameworks still rely on white-box access to the generator or restrictive norm-bounded assumptions, which limit their generalization to open-ended instruction-driven editing. Existing methods also offer limited coverage across diverse editors and editing intents, leaving a gap between generative robustness and practical deployment.

In parallel, deepfake protection research has evolved from post-hoc detection to proactive and traceable watermark-based defense [22]. Identity-aware watermarking and separable decoding enable reliable provenance tracing and semi-fragile localization [23]. Decoder-driven training, multi-source tracing, and dual-defense strategies further expand robustness against adaptive removal attacks [3,24,25]. Video-oriented pipelines report improved synchronization and strong resistance to screen capture [12,13]. Despite such progress, most methods still focus on pixel or geometric distortions and lack resilience to semantic transformations introduced by AIGC. Motivated by these limitations, this work introduces a generator-free watermarking framework that improves robustness under semantic transformations while preserving high perceptual quality. By embedding redundant payloads into semantically stable and texture-rich regions with lightweight synchronization, the method achieves a balanced trade-off among robustness, localizability, and deployability for practical watermarking in AIGC-driven environments.

3. Proposed Approach

This section presents PatchSeal, a robust image watermarking framework that maintains perceptual fidelity and remains reliable under semantic transformations and AI-generated content (AIGC) edits. The framework integrates three coordinated components: (i) object-level redundant embedding that disperses the payload across multiple salient regions, (ii) an attention mask generation module that modulates embedding strength in a content-aware manner, and (iii) localization with geometric rectification prior to decoding. The architecture follows an encoder, noise, decoder, and critic design together with a differentiable distortion curriculum that supports cross-editor generalization and stable training. The design provides concrete mechanisms and measurable guarantees, including an exponential reduction in failure probability as the number of object-level embeddings increases and an explicit perceptual budget controlled by the attention mask, with formal analysis provided later.

3.1. System Overview and Design Principles

Let

I_{co} \in R^{3 \times H \times W}

denote a cover image and

M \in {0, 1}^{L}

denote the watermark bits. The encoder E produces an embedded image

I_{en} = E (I_{co}, M)

. A composite channel

C

acts on

I_{en}

and yields an attacked image

I_{no} = C (I_{en}) .

The channel comprises three categories of disturbances: non-geometric perturbations such as JPEG compression, Gaussian blur, additive noise, pixel drop, and color jitter; geometric transformations such as rotation, cropping, and scaling; and AIGC edits that may partially regenerate or replace regions. The decoder D reconstructs an estimate

\hat{M} = D (I_{no}) .

PatchSeal is low-overhead and editor-agnostic. It leverages semantic structural invariance rather than editor-specific gradients. In the embedding phase, the Segment Anything Model (SAM) generates multiple object masks. Their centroids are expanded to form embedding masks that guide E to distribute watermark bits across several regions, which improves resistance to localized AIGC edits. During retrieval, a lightweight U²-Net detector predicts watermark regions and compensates geometric distortions prior to decoding.

Figure 1 summarizes the pipeline. The system includes an encoder E, a decoder D, a noise layer

T (\cdot)

used during training, and a critic C. The attention mask

A_{M}

regulates embedding to meet a perceptual budget, while

T (\cdot)

samples randomized perturbations to improve robustness against diverse degradations. In contrast to methods that backpropagate through specific editors [5,9], PatchSeal relies on the structural stability of salient regions together with the distortion curriculum to limit editor-specific overfitting and reduce training cost, while maintaining cross-editor generalization [1,2,4].

3.2. Object-Level Embedding and Redundancy Design

Conventional schemes often embed a single global watermark, which is fragile when large semantic areas are regenerated or replaced [1]. PatchSeal adopts an object-level redundant embedding strategy that distributes an identical payload across K object-centric regions, which increases the survival probability under semantic edits [4].

As shown in Algorithm 1 [4], given a cover image

I_{co}

, a segmentation backbone such as the Segment Anything Model (SAM) produces disjoint binary masks

{x_{o}^{(k)}}_{k = 1}^{K}

. For each mask

x_{o}

, the centroid

(x_{c}, y_{c})

is computed from image moments.

\begin{matrix} m_{00} = \sum_{(x, y)} x_{o} (x, y), & m_{10} = \sum_{(x, y)} x x_{o} (x, y), & m_{01} = \sum_{(x, y)} y x_{o} (x, y) \end{matrix}

(1)

\begin{matrix} (x_{c}, y_{c}) = (\frac{m_{10}}{m_{00}}, \frac{m_{01}}{m_{00}}) \end{matrix}

(2)

A fixed-size patch centered at

(x_{c}, y_{c})

is cropped and used for embedding the message M.

Algorithm 1 Object-Level Dispersed Embedding [4] (PatchSeal)

Require: Cover image

I_{co}

, message M, SAM, encoder

E_{θ_{E}}

Ensure: Watermarked image

I_{en}

{x_{o}^{(k)}}_{k = 1}^{K} \leftarrow SAM (I_{co})

for

k = 1

to K do
Compute

(x_{c}, y_{c})

from

x_{o}^{(k)}

via image moments
Crop a fixed-size cover patch

x_{co}^{(k)}

centered at

(x_{c}, y_{c})

x_{en}^{(k)} \leftarrow E_{θ_{E}} (x_{co}^{(k)}, M)

Replace the corresponding region in

I_{co}

with

x_{en}^{(k)}

end for
return

I_{en}

Under independent block failures with probability

p_{e}^{⋆}

, the probability that all K embedded instances fail satisfies

P_{fail} \leq {(p_{e}^{⋆})}^{K},

(3)

which yields an exponential improvement as K increases. Placing patches near object interiors increases the chance that at least one instance remains intact when backgrounds are replaced or partially regenerated. Majority voting across recovered instances is performed at detection time (see Section 3.5), which further reduces the effective failure probability.

3.3. Attention Mask Generation Module

Embedding strength should adapt to local semantics and textural complexity. Different regions exhibit uneven tradeoffs between imperceptibility and robustness: signals injected into smooth backgrounds are more visible and are easily removed by background replacement, whereas signals placed inside semantically stable object interiors persist longer. To address this heterogeneity, an Attention Mask Generation Module (AM) adaptively regulates both the location and the strength of embedding. As shown in Figure 2, AM comprises two cooperative subnetworks: a Body Extraction Network (BEN) that estimates spatial subjectness and a Channel-Wise Weighting Network (CWN) that modulates feature channels. Their outputs are fused into a unified attention tensor that guides content-aware embedding.

3.3.1. Spatial Subjectness Extraction (BEN)

BEN assigns higher importance to object interiors and lower importance to backgrounds to concentrate watermark energy where semantic stability is high and visual sensitivity is low. The network adopts a ResNet-50 encoder-decoder without fully connected layers. The input cover patch is encoded into multi-scale features

{F_{i}}_{i = 1}^{5}, F_{i} \in R^{h_{i} \times w_{i} \times c_{i}}

. For efficiency, higher-level features

{F_{3}, F_{4}, F_{5}}

are retained and compressed to 64 channels by two

3 \times 3

convolutions to form the decoder input. The decoder performs three upsampling and fusion stages to produce a soft spatial map

BodyMask \in {[0, 1]}^{h \times w}

with larger values in object interiors.

Supervision uses a distance transform derived from a binary object mask

M_{obj}

. For each pixel p inside the object,

D (p) = \min_{q \in Ω_{bg}} {∥ p - q ∥}_{2},

(4)

followed by normalization to

[0, 1]

. BEN is trained with

L_{BEN} = BCE (BodyMask, {BodyMask}_{gt}),

(5)

which establishes a reliable spatial prior for embedding.

3.3.2. Channel-Wise Weighting (CWN)

CWN refines embedding strength along the channel dimension. A

3 \times 3

Conv-BN-ReLU stem maps the cover patch to a 64-channel representation that is processed by four squeeze-and-excitation blocks. These blocks capture inter-channel dependencies and generate adaptive gates according to perceptual tolerance and texture richness. High-frequency or complex responses receive larger weights, while responses associated with smooth or salient areas are suppressed. After an additional Conv-BN-ReLU and a sigmoid function

σ (\cdot)

, CWN outputs a gating tensor

WeightMask = σ (f_{CWN} (x_{co})) \in {[0, 1]}^{h \times w \times c} .

(6)

3.3.3. Mask Fusion and Feature Modulation

Spatial and channel attentions are fused element-wise:

A = WeightMask ⊙ expand (BodyMask),

(7)

where

⊙

denotes the Hadamard product and

expand (\cdot)

broadcasts

BodyMask

to the channel dimension. Let

Φ (\cdot)

be a learnable message carrier aligned with the cover patch. With an embedding budget

λ > 0

, the attention-modulated embedding is

x_{en} = x_{co} + λ A ⊙ Φ (x_{co}, M) .

(8)

This concentrates energy in semantically stable and perceptually tolerant regions and provides an explicit upper bound on distortion,

{∥ x_{en} - x_{co} ∥}_{2} = λ {∥ A ⊙ Φ (\cdot) ∥}_{2} \leq λ {∥ Φ (\cdot) ∥}_{2},

(9)

which links the attention mechanism to a controllable perceptual budget.

3.4. Watermarking Architecture and Distortion Modeling

PatchSeal adopts an encoder-decoder-critic design with a stochastic distortion layer used during training, following recent robust watermarking systems [1,2]. The encoder

E_{θ_{E}}

maps a cover image

x_{co}

and a message M to an embedded image

x_{en} = E_{θ_{E}} (x_{co}, M) .

A critic

C_{θ_{C}}

encourages natural appearance of

x_{en}

. The decoder

D_{θ_{D}}

reconstructs the message from a distorted input

x_{no}

.

3.4.1. Fidelity and Adversarial Objectives

The encoder is trained to preserve pixel fidelity and structural similarity:

L_{E 1} = MSE (x_{co}, E_{θ_{E}} (x_{co}, M)) + α (1 - SSIM (x_{co}, E_{θ_{E}} (x_{co}, M))),

(10)

with

α > 0

. Adversarial learning uses a binary cross-entropy form. The critic minimizes

L_{C} = - E_{x_{co}} [\log C_{θ_{C}} (x_{co})] - E_{x_{co}, M} [\log (1 - C_{θ_{C}} (E_{θ_{E}} (x_{co}, M)))],

(11)

while the encoder maximizes the critic response on embedded images through

L_{E 2} = - E_{x_{co}, M} [\log C_{θ_{C}} (E_{θ_{E}} (x_{co}, M))] .

(12)

The decoder is trained with a reconstruction objective

L_{D} = MSE (M, D_{θ_{D}} (x_{no})) .

(13)

3.4.2. Differentiable Distortion Curriculum

A differentiable noise layer

T

is inserted between

E_{θ_{E}}

and

D_{θ_{D}}

during training to expose the encoder-decoder pair to diverse degradations without relying on editor-specific gradients. Given

x_{en}

, the distorted input is

x_{no} = T (x_{en}) = T (E_{θ_{E}} (x_{co}, M)) .

(14)

At each iteration,

T

samples a randomized composition of non-geometric distortions (JPEG compression, Gaussian blur, additive noise, color jitter, pixel drop) and geometric distortions (rotation, cropping, resizing, padding, picture-in-picture). Distortion intensities are drawn from predefined ranges to form a lightweight curriculum that improves generalization without explicit backpropagation through any editor pipeline [9]. This design encourages invariance to a broad spectrum of real-world degradations while maintaining stable optimization.

3.5. Watermark Localization and Optimization Strategy

Dispersed embedding creates uncertainty in both the location and the number of watermark instances. A detector

D_{t}

based on the U²-Net architecture is employed to predict a binary mask

M_{p}

that indicates candidate watermark regions in an attacked image

I_{no}

. To support arbitrary resolutions,

I_{no}

is padded and partitioned into tiles of size

(H_{t}, W_{t})

. Each tile is processed independently, and the tile-level masks are merged to obtain a full-resolution prediction.

The detector is trained to achieve accurate coverage and sharp boundaries. The objective combines pixel-wise binary cross-entropy with an overlap term:

L_{D_{t}} = BCE (M_{p}, M_{g}) + β (1 - IoU (M_{p}, M_{g})),

(15)

where

M_{g}

is the ground-truth mask,

IoU

denotes intersection-over-union, and

β > 0

balances region fidelity and boundary accuracy. The use of

1 - IoU

ensures that minimizing

L_{D_{t}}

increases the overlap.

For each connected component of

M_{p}

, an affine transform

{\hat{A}}_{j}

is estimated to align the predicted region to the canonical embedding geometry. The rectified patch is obtained by inverse warping,

{\tilde{x}}_{no}^{(j)} = {\hat{A}}_{j}^{- 1} (I_{no}),

and is fed to the decoder to recover a local message estimate,

{\hat{M}}_{j} = D_{θ_{D}} ({\tilde{x}}_{no}^{(j)}) .

A majority vote over

{{\hat{M}}_{j}}

yields the final reconstruction

\hat{M}

, which compensates for partial detection errors and local corruptions. The geometric rectification step improves decoder stability by reducing spatial misalignment before bit extraction.

The overall training objective balances imperceptible embedding, adversarial realism, message fidelity, detection accuracy, and attention supervision:

L = λ_{1} L_{E 1} + λ_{2} L_{E 2} + λ_{3} L_{D} + λ_{4} L_{D_{t}} + λ_{5} L_{BEN},

(16)

with

λ_{i} > 0

. Segmentation and centroid estimation scale as

O (H W)

. Embedding or extracting K patches scales linearly with K. Tiled detection introduces an additional factor of

⌈ H / H_{t} ⌉ ⌈ W / W_{t} ⌉

.

From a probabilistic standpoint, redundant embedding over K object-centric regions yields

P_{fail} \leq {(p_{e}^{⋆})}^{K}

, under independent per-region failures with rate

p_{e}^{⋆} \in (0, 1)

, which explains the exponential reduction in decoding failure as K increases. The attention mechanism enforces a perceptual budget through the bound established in Section 3.3,

{∥ x_{en} - x_{co} ∥}_{2} \leq λ {∥ Φ (x_{co}, M) ∥}_{2},

(17)

and geometric rectification ensures that small affine estimation errors translate into bounded perturbations at the decoder input.

4. Performance Evaluation

We evaluate PatchSeal with two goals. The first is visual fidelity. The second is robustness under classic corruptions, AIGC edits, and composed pipelines that reflect practice. We use strong baselines and standard saliency datasets. We report complete settings for reproducibility and follow a simple protocol so that results are easy to verify and compare.

4.1. Experimental Setup and Metrics

(1) Datasets: Experiments use DUTS [26], DUT-OMRON [27], HKU-IS [28], and PASCAL-S [29]. Training draws 10,553 images from DUTS-TR and 5168 from DUT-OMRON. A validation split holds out 1000 images from DUTS-TR and 500 from DUT-OMRON. The remaining 9553 and 4668 images form a 14,221-image training set. Generalization is evaluated on the DUTS test split, HKU-IS, and PASCAL-S, which are disjoint from training. During training, saliency masks are used to prompt SAM and obtain object masks. Centroids are computed and expanded to define embedding regions. For local edits, 100 ImageNet images are used to paste patches with random rotation and scale. LaMa [30] performs inpainting under irregular, rectangular, and saliency masks. For instruction-driven global edits, the editor in [31] is employed.

(2) Evaluation setup: All models are implemented in PyTorch 2.2.2 and trained on Ubuntu 20.04 with an Intel Xeon Platinum 8352V CPU, an NVIDIA RTX 4090 GPU with 24 GB memory, and 90 GB system memory. Optimization uses AdamW with learning rate

1 \times 10^{- 4}

. The encoder fidelity loss

L_{E 1}

balances MSE and SSIM with coefficient

α = 0.005

. Global loss weights follow the formulation in Section 3.5:

λ_{1} = 0.2

,

λ_{2} = 0.001

, and

λ_{3} = 1

. The attention mask loss weight is set to

λ_{5} = 32

. The detector loss

L_{D_{t}}

uses

β = 0.1

. For the embedding network, the input patch size is

128 \times 128

. The number of embedded instances equals the number of non-overlapping salient masks. Training runs for 125 epochs with a batch size of 64. For the detection module, the input is

512 \times 512

with 100 training epochs and batch size 12.

(3) Metrics: Fidelity is measured by PSNR and SSIM. Recovery reliability is measured by bit accuracy (BA).

PSNR and MSE:

PSNR = 10 \log_{10} (\frac{{MAX}_{I}^{2}}{MSE}), MSE = \frac{1}{N} \sum_{i = 1}^{N} {(I_{co} (i) - I_{en} (i))}^{2} .

SSIM:

SSIM (x, y) = \frac{(2 μ_{x} μ_{y} + c_{1}) (2 σ_{x y} + c_{2})}{(μ_{x}^{2} + μ_{y}^{2} + c_{1}) (σ_{x}^{2} + σ_{y}^{2} + c_{2})} .

Bit accuracy:

BA = \frac{1}{L} \sum_{i = 1}^{L} I ({\hat{w}}_{i} = w_{i}),

where L is the message length,

w_{i}

is the i-th bit, and

{\hat{w}}_{i}

is its estimate.

4.2. Visual Fidelity

As shown in Figure 3, the visual results and residual maps on DUTS, HKU-IS, and PASCAL-S indicate that PatchSeal concentrates small perturbations in texture-rich or low-saliency areas and avoids structural artifacts. In contrast, MBRS and ARWGAN exhibit boundary chroma noise and CIN produces mild blur. Quantitatively, PatchSeal achieves the highest PSNR and SSIM on all test sets. Averaged over the three datasets, the PSNR exceeds MBRS, ARWGAN, and CIN by 10.71 dB, 6.01 dB, and 1.54 dB, and the SSIM is higher by 0.1902, 0.0396, and 0.0085. On PASCAL-S, the scores reach 44.38 dB and 0.9951, which implies negligible perceptual deviation. The behavior aligns with the design: object-level dispersed embedding limits the spatial footprint and reduces global distortion, and attention-guided masks place bits within semantically stable and visually tolerant regions. A minor limitation appears in low-texture scenes with strong edges, where residual energy along edge bands can increase slightly while remaining sub-perceptual. This can be mitigated by down-weighting edge bands in the attention mask or adding edge-preserving regularization.

4.3. Robustness Evaluation

All robustness tests are read jointly to provide a unified view of performance. Table 1 reports three groups: Part A for single distortions, Part B for AIGC edits, and Part C for composed pipelines.

(1) Single distortions. PatchSeal is strongest overall, with pronounced gains under rotation, padding, and picture-in-picture. Detector-guided relocking, together with dispersed redundancy, supplies alignment cues when geometry changes. Small gaps appear where CIN leads on Gaussian noise and resize, and ARWGAN leads on crop. These cases suggest stronger scale-equivariant augmentation and anti-aliasing filters during training.

(2) AIGC editing. Part B considers local paste and inpainting at two strengths and a global instruction-driven edit. PatchSeal keeps BA above 90% for all local edits and is clearly better on the global edit. Object-aware dispersion preserves redundancy across salient instances and attention masks emphasize semantically stable regions. These choices matter most when edits alter content rather than only pixels.

(3) Composed pipelines. Part C evaluates compositions that are closer to practice. PatchSeal achieves the highest average BA across all compositions, with a larger margin when rotation is present. JPEG followed by inpainting is harder: PatchSeal is slightly below MBRS and ARWGAN by 1.16% and 0.66%. Compression weakens low-amplitude cues before mask-based removal. Pre-compression regularization and feature-level anchors may help. Averaged over all compositions, PatchSeal improves upon MBRS, ARWGAN, and CIN by

+ 14.92 %

,

+ 3.67 %

, and

+ 16.38 %

.

Table 1. BA (%) ↑ under single distortions, AIGC edits, and composed pipelines.

A. Single distortions	MBRS	ARWGAN	CIN	PatchSeal
None	97.49	100.00	99.99	99.90
JPEG (Non-Geom.)	96.13	97.68	95.88	98.49
Gaussian Blur (Non-Geom.)	97.04	99.28	99.76	99.78
Gaussian Noise (Non-Geom.)	97.38	99.32	99.88	99.83
Drop (Non-Geom.)	96.64	98.17	96.57	98.84
Color Adjust (Non-Geom.)	97.30	95.71	99.54	99.86
Rotation (Geom.)	69.19	86.89	68.23	93.93
Crop (Geom.)	67.79	99.26	99.70	94.60
Padding (Geom.)	69.40	91.95	92.32	97.79
Picture in Picture (Geom.)	58.35	94.41	93.52	96.42
Resize (Geom.)	97.41	95.50	98.10	96.13
B. AIGC editing	MBRS	ARWGAN	CIN	PatchSeal
Paste (20%)	93.85	93.82	95.22	97.11
Paste (35%)	88.88	89.55	93.83	93.31
Inpaint (20%)	93.65	93.10	83.14	95.56
Inpaint (35%)	91.37	90.63	77.36	90.29
Global Edit	77.62	82.58	73.26	88.64
C. Composed pipelines	MBRS	ARWGAN	CIN	PatchSeal
JPEG + Gaussian Blur	95.39	96.13	95.24	97.64
JPEG + Rotation	68.69	86.26	67.31	86.59
Rotation + Resize	69.03	84.97	67.87	90.34
JPEG + Inpaint	92.19	91.69	78.76	91.03
Rotation + Inpaint	66.78	82.77	65.66	85.16
Inpaint + Global Edit	76.06	78.96	70.52	87.25
JPEG + Rotation + Inpaint	66.27	80.60	66.84	84.74

Note. Non-Geom. denotes intensity-only distortions with fixed coordinates (JPEG, blur, noise, pixel drop, color adjust). Geom. denotes geometric transforms that change alignment (rotation, crop, padding, resize, picture-in-picture). AIGC editing includes local Paste/Inpaint (20%/35% edited area) and a Global Edit. Composed pipelines apply operations in the listed order. Entries are BA (%) as in Section 4.1; higher is better.

4.4. Ablations and Multi-Target Dispersion

Three design factors are assessed: the dispersion strategy, the attention masks, and the number of embedded watermarks. Results indicate that PatchSeal benefits from object-aware placement and moderate redundancy, with consistent gains in imperceptibility and recovery.

(1) Dispersion strategy. Replacing object-aware dispersion with random placement (PatchSeal-R ) reduces BA by 1.11% under non-geometric corruptions, with a 3.87% drop on Drop. Under geometric changes, BA falls by 2.73% on Crop. For AIGC edits the gap widens: BA decreases by 2.48% on Paste, by 4.51% on Inpaint, and by 6.86% for the Global edit. These trends are consistent with how edits interact with image content. Drop and crop often affect borders, while salient instances tend to occupy central regions. Random dispersion places bits near vulnerable edges. Object-aware dispersion concentrates payload inside salient instances, which preserves synchronization and reduces exposure to content removal.

(2) Attention masks. Removing attention masks weakens both fidelity and robustness on DUTS. PatchSeal-N attains PSNR 42.05, SSIM 0.9887, and BA 95.97, compared with 42.78, 0.9915, and 96.66 for PatchSeal. Attention guides embedding toward texture-rich and semantically stable regions. This lowers visible residue and improves survival under resampling and edits. The effect size in PSNR and SSIM is modest but consistent, and translates into a measurable BA gain of 0.69%.

(3) Number of watermarks. The number of embedded instances varies from one to four on the DUTS test set (Figure 4). With one watermark, PSNR and SSIM are 43.68 dB and 0.9931. As the number increases, PSNR decreases by 6.87% from one to two, by 3.24% from two to three, and by 2.74% from three to four. With four instances, the scores remain at 38.28 dB and 0.9805. Redundancy improves recovery: Two watermarks raise BA by about 3% over one, and BA is at least 98% with three. A slight BA dip at four suggests overlap and mask boundary effects. Considering fidelity, robustness, and deployment cost, embedding two or three dispersed instances are a sound operating point.

Across all ablations, object-aware dispersion and attention-guided placement are the primary contributors to the observed improvements. They confine energy to perceptually tolerant regions, maintain alignment after edits, and deliver higher BA without visible artifacts.

4.5. Discussion and Limitations

PatchSeal improves fidelity and robustness under classic corruptions, AIGC edits, and composed pipelines. Gains are most visible when geometry changes, which indicates that object-aware dispersion and detector relocking provide stable alignment cues. Two residual challenges are discussed.

Small gaps appear on Gaussian noise, resize, and crop where CIN or ARWGAN are slightly higher. A likely cause is a mismatch between the training distortions and the high-frequency statistics of these tests. Resampling can also introduce anti-aliasing that suppresses low-amplitude signals. JPEG followed by inpainting is harder. Compression weakens residuals and mask-guided inpainting removes local structure, which limits relocking. A broader curriculum with scale-aware augmentation and resampling-aware filters can help. Pre-compression regularization, invariant anchors in the detector, and inpainting-aware training are also promising. Pairing compression with semantic removal during training is expected to raise the worst case.

Dispersed embedding works well on medium and large targets, but performance on small objects can drop. Capacity is limited and edits affect a larger fraction of the object, so cues become fragile. Mild edge-band residuals are observed in low-texture scenes with strong contrast. These remain sub-perceptual but indicate sensitivity near mask boundaries. In practice many images exceed 1080p and contain richer texture, which changes embedding strength, local aliasing, and detector confidence. Future work includes scale-aware dispersion, adaptive per-instance payloads, and masks that avoid thin structures. Additional steps include edge-preserving regularization and boundary smoothing, multi-resolution pyramids during training, evaluation at native resolution, and a study of memory and latency trade-offs to balance capacity, imperceptibility, and synchronization in realistic settings.

5. Conclusions and Future Work

This paper studies the impact of AIGC image editing on watermark robustness and presents PatchSeal, which combines multi-target dispersed embedding with attention-guided placement. SAM provides object masks for embedding and U2-Net supports accurate localization during retrieval, while inverse transforms relock geometry. Experiments on DUTS, DUT-OMRON, and HKU-IS show higher PSNR and SSIM and higher bit accuracy than strong baselines under single and composed attacks. Visual evidence and ablations support the mechanism and show that perturbations remain imperceptible. Future work will expand the distortion curriculum with stronger instruction-driven editors, screen-camera capture, and longer pipelines. We will study scale-aware dispersion for small objects, edge-preserving regularization, and adaptive weights that downplay fragile regions. We also plan certified robustness analysis, tighter capacity control, latency reduction, and broader benchmarks to support fair and reproducible comparison.

Author Contributions

Software, T.Y., H.Z. and Z.W.; Validation, H.Z.; Formal analysis, T.Y.; Resources, T.Y.; Data curation, Z.W.; Writing—review & editing, Y.C.; Visualization, Z.W. and Y.C.; Supervision, T.Y. and Y.C.; Project administration, T.Y. and Y.C.; Funding acquisition, T.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Natural Science Foundation of Zhejiang Province grant number LTGG24F030001.

Data Availability Statement

The datasets generated and analysed during the current study are not publicly available. Part of the data involves ongoing research activities and internal experimental configurations, and therefore cannot be openly shared at this stage due to research continuity and technical constraints. Reasonable requests for access to the data may be considered by the corresponding author, subject to compliance with relevant data-use policies and the journal’s data sharing guidelines.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jia, Z.; Fang, H.; Zhang, W. MBRS: Enhancing Robustness of DNN-Based Watermarking by Mini-Batch of Real and Simulated JPEG Compression. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 41–49. [Google Scholar]
Fang, H.; Jia, Z.; Ma, Z.; Zhang, W.; Chang, E.-C. PIMoG: An Effective Screen-Shooting Noise-Layer Simulation for Deep-Learning-Based Watermarking Network. In Proceedings of the 30th ACM International Conference on Multimedia, New York, NY, USA, 10–14 October 2022; pp. 2267–2275. [Google Scholar]
Wen, S.; Fang, H.; Zhu, H.; Zhang, W.; Yu, N. De-END: Decoder-Driven Watermarking Network. IEEE Trans. Multimed. 2023, 25, 7571–7581. [Google Scholar]
Fernandez, P.; Couairon, G.; Jégou, H.; Douze, M.; Furon, T. The Stable Signature: Rooting Watermarks in Latent Diffusion Models. In Proceedings of the IEEE/International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 22466–22477. [Google Scholar]
Min, R.; Li, S.; Chen, H.; Cheng, M. A Watermark-Conditioned Diffusion Model for IP Protection. In Computer Vision—ECCV 2024; Lecture Notes in Computer Science; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer: Cham, Switzerland, 2025; Volume 15127, pp. 104–120. [Google Scholar]
Rezaei, A.; Akbari, M.; Alvar, S.R.; Fatemi, A.; Zhang, Y. LaWa: Using Latent Space for In-Generation Image Watermarking in Latent Diffusion Models. In Computer Vision—ECCV 2024; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2025; Volume 15147, pp. 118–136. [Google Scholar] [CrossRef]
Zhang, X.; Li, R.; Yu, J.; Xu, Y.; Li, W.; Zhang, J. EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright Protection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 19–21 June 2024; pp. 11964–11974. [Google Scholar]
Jiang, Z.; Guo, M.; Hu, Y.; Jia, J.; Gong, N.Z. Certifiably Robust Image Watermark. In Computer Vision—ECCV 2024; Springer: Cham, Switzerland, 2025. [Google Scholar] [CrossRef]
Zhang, C.; Karjauv, A.; Benz, P.; Kweon, I.S. Towards Robust Deep Hiding Under Non-Differentiable Distortions for Practical Blind Watermarking. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 5158–5166. [Google Scholar]
Fernandez, P.; Furon, T.; Jégou, H.; Douze, M. Watermarking Images in Self-Supervised Latent Spaces. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Singapore, 22–27 October 2022; pp. 2964–2968. [Google Scholar]
Wang, Y.; Zhu, X.; Ye, G.; Zhang, S.; Wei, X. Achieving Resolution-Agnostic DNN-Based Image Watermarking via Implicit Neural Representation. In Proceedings of the 32nd ACM International Conference on Multimedia (MM), Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 7442–7450. [Google Scholar]
Lin, Z.; Zhang, J.; Zhou, H.; Zhang, W. Automatic Robust Blind Video Watermarking Resisting Camera Recording. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 5093–5106. [Google Scholar] [CrossRef]
Wang, K.; Wu, S.; Yin, X.; Lu, W.; Luo, X.; Yang, R. Robust Image Watermarking With Synchronization Using Template Enhanced-Extracted Network. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 1602–1614. [Google Scholar] [CrossRef]
Guo, X.; Zhang, W.; Zhou, W.; Yu, N.; Zheng, Y. DWSF: Practical Deep Dispersed Watermarking with Synchronization and Fusion. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 7922–7932. [Google Scholar]
Hu, R.; Zhang, J.; Xu, T.; Li, J.; Zhang, T. Robust-Wide: Robust Watermarking Against Instruction-Driven Image Editing. In Computer Vision—ECCV 2024; Springer: Cham, Switzerland, 2025; pp. 20–37. [Google Scholar]
Ma, X.; Li, Y.; Liu, S.; Lu, J. Geometric Distortion Immunized Deep Watermarking under Local Non-Affine Transformations. In Proceedings of the 18th European Conference on Computer Vision—ECCV 2024, Milan, Italy, 29 September–4 October 2024; pp. 265–281. [Google Scholar]
Bui, T.; Agarwal, S.; Collomosse, J. TrustMark: Universal Watermarking for Arbitrary Resolution Images. In Proceedings of the 32nd ACM International Conference on Multimedia (MM), Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 5802–5805. [Google Scholar]
Lou, X.; Zhao, H.; Zhao, P.; Ma, K. Gaussian Shading: Certifiable, Performance-Lossless Image Watermarking for Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 19–21 June 2024; pp. 12055–12065. [Google Scholar]
Zhu, P.; Feng, Y.; Han, S.; Han, X. Watermark-Embedded Adversarial Examples for Copyright Protection Against Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 19–21 June 2024; pp. 14220–14229. [Google Scholar]
Pan, Z.; Jun, J.; Zhang, A.; Xu, Z. WateRF: Structure-Preserving Watermarking for Neural Radiance Fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 19–21 June 2024; pp. 9885–9895. [Google Scholar]
Huang, Z.; Wu, X.; Cai, Z.; Zhang, W.; Yu, N. Protecting NeRFs’ Copyright via Plug-and-Play Watermark. In Computer Vision—ECCV 2024; Springer: Cham, Switzerland, 2025; pp. 382–398. [Google Scholar]
Gan, G.; Chen, Y.; Liu, Z. Towards Robust Model Watermark via Reducing Parametric Vulnerability. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Paris, France, 2–6 October 2023; pp. 20463–20472. [Google Scholar]
Zhao, Y.; Sun, X.; Wang, K.; Lu, S.-P. Proactive Deepfake Defence via Identity Watermarking. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 3684–3693. [Google Scholar]
Wang, G.; Ma, Z.; Liu, C.; Yang, X.; Fang, H.; Zhang, W.; Yu, N. MuST: Robust Image Watermarking for Multi-Source Tracing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; pp. 5364–5371. [Google Scholar] [CrossRef]
Gu, Y.; Zhou, X.; Wang, S.; Zhang, J. Dual Defense: Adversarial, Traceable, and Invisible Robust Watermarking Against Face Swapping. IEEE Trans. Inf. Forensics Secur. 2024, 19, 4628–4641. [Google Scholar] [CrossRef]
Wang, L.; Lu, H.; Wang, Y.; Feng, M.; Wang, D.; Yin, B.; Ruan, X. Learning to Detect Salient Objects with Image-Level Supervision. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 136–145. [Google Scholar]
Yang, C.; Zhang, L.; Lu, H.; Ruan, X.; Yang, M.-H. Saliency Detection via Graph-Based Manifold Ranking. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 3166–3173. [Google Scholar]
Li, G.; Yu, Y. Visual Saliency Based on Multiscale Deep Features. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5455–5463. [Google Scholar]
Li, Y.; Hou, X.; Koch, C.; Rehg, J.M.; Yuille, A.L. The Secrets of Salient Object Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 280–287. [Google Scholar]
Suvorov, R.; Logacheva, E.; Mashikhin, A.; Remizova, A.; Kharlamov, A.; Aliev, V.; Lempitsky, A.; Vatolin, D. Resolution-Robust Large Mask Inpainting with Fourier Convolutions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 2149–2159. [Google Scholar]
Zhao, H.; Ma, X.S.; Chen, L.; Yang, Y.; Zhang, J.; Jia, X.; Sun, J. UltraEdit: Instruction-based Fine-Grained Image Editing at Scale. Adv. Neural Inf. Process. Syst. (NeurIPS) 2025, 37, 3058–3093. [Google Scholar]

Figure 1. System overview of PatchSeal. (a) object-level dispersive watermark embedding; (b) watermark-region detection and extraction. The pipeline includes an encoder E, a decoder D, a noise layer

T (\cdot)

used during training, and a critic C. An attention mask

A_{M}

modulates embedding to preserve perceptual fidelity, while

T (\cdot)

samples randomized perturbations to improve robustness and support cross-editor generalization.

Figure 1. System overview of PatchSeal. (a) object-level dispersive watermark embedding; (b) watermark-region detection and extraction. The pipeline includes an encoder E, a decoder D, a noise layer

T (\cdot)

used during training, and a critic C. An attention mask

A_{M}

modulates embedding to preserve perceptual fidelity, while

T (\cdot)

samples randomized perturbations to improve robustness and support cross-editor generalization.

Figure 2. Attention mask generation. BEN predicts a spatial subjectness map

BodyMask

. CWN produces a channel-gating tensor

WeightMask

. Their fusion yields the attention tensor

A

, which modulates the location and strength of embedding to balance imperceptibility and robustness.

Figure 2. Attention mask generation. BEN predicts a spatial subjectness map

BodyMask

. CWN produces a channel-gating tensor

WeightMask

. Their fusion yields the attention tensor

A

, which modulates the location and strength of embedding to balance imperceptibility and robustness.

Figure 3. Visual comparison on DUTS, HKU-IS, and PASCAL-S. The left column shows originals. Other columns show each method’s watermarked images and residual maps. PatchSeal preserves structure and confines low-energy perturbations to texture-rich or low-saliency regions. MBRS and ARWGAN introduce chroma noise near object boundaries, while CIN yields mild blur. Quantitatively, PatchSeal attains the best PSNR and SSIM on all test sets; on PASCAL-S it reaches 44.38 dB and 0.9951.

Figure 4. Effect of the number of embedded watermarks on visual quality and robustness (DUTS). With one watermark, PatchSeal achieves 43.68 dB PSNR and 0.9931 SSIM, which exceed the dataset averages by 0.9 dB and 0.0088. As the count increases to four, PSNR decreases by 6.87%, 3.24%, and 2.74% while still reaching 38.28 dB with 0.9805 SSIM. Redundancy improves recovery: two watermarks raise BA by about 3% over one, and BA is at least 98% with three. A slight dip at four suggests overlap and boundary effects. Overall, two to three dispersed instances provide a balanced operating point and support multi-target dispersion.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

You, T.; Zheng, H.; Wang, Z.; Chen, Y. PatchSeal: A Robust and Intangible Image Watermarking Framework for AIGC. Mathematics 2026, 14, 679. https://doi.org/10.3390/math14040679

AMA Style

You T, Zheng H, Wang Z, Chen Y. PatchSeal: A Robust and Intangible Image Watermarking Framework for AIGC. Mathematics. 2026; 14(4):679. https://doi.org/10.3390/math14040679

Chicago/Turabian Style

You, Ting, Haixia Zheng, Zhaohan Wang, and Yi Chen. 2026. "PatchSeal: A Robust and Intangible Image Watermarking Framework for AIGC" Mathematics 14, no. 4: 679. https://doi.org/10.3390/math14040679

APA Style

You, T., Zheng, H., Wang, Z., & Chen, Y. (2026). PatchSeal: A Robust and Intangible Image Watermarking Framework for AIGC. Mathematics, 14(4), 679. https://doi.org/10.3390/math14040679

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PatchSeal: A Robust and Intangible Image Watermarking Framework for AIGC

Abstract

1. Introduction

2. Related Works

3. Proposed Approach

3.1. System Overview and Design Principles

3.2. Object-Level Embedding and Redundancy Design

3.3. Attention Mask Generation Module

3.3.1. Spatial Subjectness Extraction (BEN)

3.3.2. Channel-Wise Weighting (CWN)

3.3.3. Mask Fusion and Feature Modulation

3.4. Watermarking Architecture and Distortion Modeling

3.4.1. Fidelity and Adversarial Objectives

3.4.2. Differentiable Distortion Curriculum

3.5. Watermark Localization and Optimization Strategy

4. Performance Evaluation

4.1. Experimental Setup and Metrics

4.2. Visual Fidelity

4.3. Robustness Evaluation

4.4. Ablations and Multi-Target Dispersion

4.5. Discussion and Limitations

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI