1. Introduction
The exponential growth of online image creation and sharing has increased the need for reliable copyright protection and traceability of origin. Invisible digital watermarking remains a practical and widely adopted solution because it preserves perceptual quality while enabling a post-hoc verification under common image processing operations. However, the rapid emergence of AI-generated content (AIGC), especially the image generation based on diffusion and instruction-based editing, has brought new challenges. These models perform semantic transformations that fundamentally change content structures, textures and spatial layouts, thus invalidating many of the robust assumptions underlying classical watermarking systems.
Recent learning-based watermarking research has significantly advanced robustness against pixel-level distortions through end-to-end encoder–noise–decoder optimization and improved distortion modeling. These approaches have evolved from training with real and simulated JPEG compression [
1], to incorporating screen-shooting simulation layers that approximate perspective and illumination variations [
2], and to adopting decoder-driven architectures that tightly couple embedding and extraction [
3]. Meanwhile, the integration of watermarking with generative models has opened new possibilities, including watermarking within diffusion pipelines [
4,
5], latent-space watermarking [
6], and tamper localization via embedded marks [
7]. Formal robustness has also been strengthened through certified guarantees against removal and forgery attacks [
8]. Despite these advances, most existing methods remain optimized for pixel-level or mildly geometric distortions and still degrade under semantic edits. Moreover, techniques that rely on white-box access to generators or editors are often impractical in real-world deployment scenarios, where editing platforms are typically black-box systems.
This paper targets robustness under AIGC-driven image editing as the primary threat model. We present a practical and imperceptible watermarking framework, PatchSeal, designed to withstand semantic transformations while maintaining high visual quality. The proposed framework distributes redundant watermark bits across multiple semantically meaningful regions using a multi-target dispersed embedding strategy and dynamically adjusts embedding strength through attention-guided masking. This design ensures that watermark information remains recoverable even when portions of an image are locally or globally modified. The framework is fully compatible with standard end-to-end training and does not require access to the internal parameters of proprietary editors.
The main contributions of this work are summarized as follows. First, a multi-target dispersed embedding strategy is introduced to allocate watermark redundancy across content-aware regions, reducing the risk of information loss under semantic edits. Second, an attention-guided masking module is developed to emphasize salient structures and texture-rich areas, improving both invisibility and robustness during recovery. Third, a comprehensive evaluation under instruction-driven editing and composite distortions demonstrates higher bit accuracy and comparable or better perceptual quality than representative baselines such as MBRS [
1], PIMoG [
2], and DeEND [
3]. Additional studies also discuss complementarity with diffusion-integrated watermarking [
5,
6] and certified defenses [
8], and include ablation experiments that clarify how semantic dispersion and attention mechanisms jointly enhance robustness.
The remainder of this paper is organized as follows.
Section 2 reviews related work on learning-based watermarking and AIGC image editing.
Section 3 presents the proposed framework, including the overall architecture, object-level embedding strategy, and attention-guided masking module.
Section 4 reports experimental results in terms of perceptual quality, robustness, and ablation analysis, followed by discussions on limitations and future directions. Finally,
Section 5 concludes the paper.
2. Related Works
Deep image watermarking has progressed from codec-specific heuristics to end-to-end learning frameworks that explicitly incorporate distortion modeling. The encoder–noise–decoder paradigm has become the mainstream design, enhancing robustness against non-differentiable distortions such as JPEG compression through a combination of real and simulated corruptions [
1]. Further improvements have been achieved by disentangling forward and backward propagation [
9] and by embedding or recovering watermarks in latent representation spaces such as self-supervised features [
10] and implicit neural representations [
11]. These studies suggest that curriculum design and distortion modeling, rather than handcrafted features, play a decisive role in robustness. However, existing pipelines mainly learn invariance to parameterized pixel-level perturbations and still perform poorly under semantic edits or large distribution shifts.
In real-world scenarios, geometric transformations and black-box channels remain major challenges. Screen-to-camera pipelines can be modeled through trainable layers that capture perspective, illumination, and moiré patterns, improving recovery after physical capture [
2]. Recent research has introduced synchronization templates and relocking mechanisms to ensure consistency in video and camera-recorded content [
12,
13,
14]. Geometry-aware models further improve immunity by combining wide-baseline alignment with local non-affine correction through deformable operators and attention mechanisms [
15,
16]. These approaches enhance practical robustness but depend on heavy simulation and complex training curricula, which increase computational cost and instability. Moreover, they do not effectively address semantic re-synthesis, such as object replacement or layout rewriting, which has become a common form of manipulation in AI-generated content (AIGC).
The rapid emergence of diffusion-based image generation and editing has shifted research from pixel-level perturbations to semantic integrity and proactive forensics [
17]. EditGuard [
7] unifies tamper localization and copyright tracing within a single diffusion pipeline. Latent-space watermarking [
6] and diffusion-integrated watermarking [
5] improve the balance between fidelity and robustness, while certified watermarking [
8] provides formal guarantees against removal or forgery. Diffusion-aware strategies have also been extended to neural radiance fields and Gaussian-based representations [
18,
19,
20,
21], achieving better generalization across modalities. Despite these advances, many frameworks still rely on white-box access to the generator or restrictive norm-bounded assumptions, which limit their generalization to open-ended instruction-driven editing. Existing methods also offer limited coverage across diverse editors and editing intents, leaving a gap between generative robustness and practical deployment.
In parallel, deepfake protection research has evolved from post-hoc detection to proactive and traceable watermark-based defense [
22]. Identity-aware watermarking and separable decoding enable reliable provenance tracing and semi-fragile localization [
23]. Decoder-driven training, multi-source tracing, and dual-defense strategies further expand robustness against adaptive removal attacks [
3,
24,
25]. Video-oriented pipelines report improved synchronization and strong resistance to screen capture [
12,
13]. Despite such progress, most methods still focus on pixel or geometric distortions and lack resilience to semantic transformations introduced by AIGC. Motivated by these limitations, this work introduces a generator-free watermarking framework that improves robustness under semantic transformations while preserving high perceptual quality. By embedding redundant payloads into semantically stable and texture-rich regions with lightweight synchronization, the method achieves a balanced trade-off among robustness, localizability, and deployability for practical watermarking in AIGC-driven environments.
3. Proposed Approach
This section presents PatchSeal, a robust image watermarking framework that maintains perceptual fidelity and remains reliable under semantic transformations and AI-generated content (AIGC) edits. The framework integrates three coordinated components: (i) object-level redundant embedding that disperses the payload across multiple salient regions, (ii) an attention mask generation module that modulates embedding strength in a content-aware manner, and (iii) localization with geometric rectification prior to decoding. The architecture follows an encoder, noise, decoder, and critic design together with a differentiable distortion curriculum that supports cross-editor generalization and stable training. The design provides concrete mechanisms and measurable guarantees, including an exponential reduction in failure probability as the number of object-level embeddings increases and an explicit perceptual budget controlled by the attention mask, with formal analysis provided later.
3.1. System Overview and Design Principles
Let denote a cover image and denote the watermark bits. The encoder E produces an embedded image . A composite channel acts on and yields an attacked image The channel comprises three categories of disturbances: non-geometric perturbations such as JPEG compression, Gaussian blur, additive noise, pixel drop, and color jitter; geometric transformations such as rotation, cropping, and scaling; and AIGC edits that may partially regenerate or replace regions. The decoder D reconstructs an estimate
PatchSeal is low-overhead and editor-agnostic. It leverages semantic structural invariance rather than editor-specific gradients. In the embedding phase, the Segment Anything Model (SAM) generates multiple object masks. Their centroids are expanded to form embedding masks that guide E to distribute watermark bits across several regions, which improves resistance to localized AIGC edits. During retrieval, a lightweight U2-Net detector predicts watermark regions and compensates geometric distortions prior to decoding.
Figure 1 summarizes the pipeline. The system includes an encoder
E, a decoder
D, a noise layer
used during training, and a critic
C. The attention mask
regulates embedding to meet a perceptual budget, while
samples randomized perturbations to improve robustness against diverse degradations. In contrast to methods that backpropagate through specific editors [
5,
9], PatchSeal relies on the structural stability of salient regions together with the distortion curriculum to limit editor-specific overfitting and reduce training cost, while maintaining cross-editor generalization [
1,
2,
4].
3.2. Object-Level Embedding and Redundancy Design
Conventional schemes often embed a single global watermark, which is fragile when large semantic areas are regenerated or replaced [
1]. PatchSeal adopts an object-level redundant embedding strategy that distributes an identical payload across
K object-centric regions, which increases the survival probability under semantic edits [
4].
As shown in Algorithm 1 [
4], given a cover image
, a segmentation backbone such as the Segment Anything Model (SAM) produces disjoint binary masks
. For each mask
, the centroid
is computed from image moments.
A fixed-size patch centered at
is cropped and used for embedding the message
M.
| Algorithm 1 Object-Level Dispersed Embedding [4] (PatchSeal) |
Require: Cover image , message M, SAM, encoder Ensure: Watermarked image for to K do Compute from via image moments Crop a fixed-size cover patch centered at Replace the corresponding region in with end for return |
Under independent block failures with probability
, the probability that all
K embedded instances fail satisfies
which yields an exponential improvement as
K increases. Placing patches near object interiors increases the chance that at least one instance remains intact when backgrounds are replaced or partially regenerated. Majority voting across recovered instances is performed at detection time (see
Section 3.5), which further reduces the effective failure probability.
3.3. Attention Mask Generation Module
Embedding strength should adapt to local semantics and textural complexity. Different regions exhibit uneven tradeoffs between imperceptibility and robustness: signals injected into smooth backgrounds are more visible and are easily removed by background replacement, whereas signals placed inside semantically stable object interiors persist longer. To address this heterogeneity, an Attention Mask Generation Module (AM) adaptively regulates both the location and the strength of embedding. As shown in
Figure 2, AM comprises two cooperative subnetworks: a Body Extraction Network (BEN) that estimates spatial subjectness and a Channel-Wise Weighting Network (CWN) that modulates feature channels. Their outputs are fused into a unified attention tensor that guides content-aware embedding.
3.3.1. Spatial Subjectness Extraction (BEN)
BEN assigns higher importance to object interiors and lower importance to backgrounds to concentrate watermark energy where semantic stability is high and visual sensitivity is low. The network adopts a ResNet-50 encoder-decoder without fully connected layers. The input cover patch is encoded into multi-scale features . For efficiency, higher-level features are retained and compressed to 64 channels by two convolutions to form the decoder input. The decoder performs three upsampling and fusion stages to produce a soft spatial map with larger values in object interiors.
Supervision uses a distance transform derived from a binary object mask
. For each pixel
p inside the object,
followed by normalization to
. BEN is trained with
which establishes a reliable spatial prior for embedding.
3.3.2. Channel-Wise Weighting (CWN)
CWN refines embedding strength along the channel dimension. A
Conv-BN-ReLU stem maps the cover patch to a 64-channel representation that is processed by four squeeze-and-excitation blocks. These blocks capture inter-channel dependencies and generate adaptive gates according to perceptual tolerance and texture richness. High-frequency or complex responses receive larger weights, while responses associated with smooth or salient areas are suppressed. After an additional Conv-BN-ReLU and a sigmoid function
, CWN outputs a gating tensor
3.3.3. Mask Fusion and Feature Modulation
Spatial and channel attentions are fused element-wise:
where
denotes the Hadamard product and
broadcasts
to the channel dimension. Let
be a learnable message carrier aligned with the cover patch. With an embedding budget
, the attention-modulated embedding is
This concentrates energy in semantically stable and perceptually tolerant regions and provides an explicit upper bound on distortion,
which links the attention mechanism to a controllable perceptual budget.
3.4. Watermarking Architecture and Distortion Modeling
PatchSeal adopts an encoder-decoder-critic design with a stochastic distortion layer used during training, following recent robust watermarking systems [
1,
2]. The encoder
maps a cover image
and a message
M to an embedded image
A critic
encourages natural appearance of
. The decoder
reconstructs the message from a distorted input
.
3.4.1. Fidelity and Adversarial Objectives
The encoder is trained to preserve pixel fidelity and structural similarity:
with
. Adversarial learning uses a binary cross-entropy form. The critic minimizes
while the encoder maximizes the critic response on embedded images through
The decoder is trained with a reconstruction objective
3.4.2. Differentiable Distortion Curriculum
A differentiable noise layer
is inserted between
and
during training to expose the encoder-decoder pair to diverse degradations without relying on editor-specific gradients. Given
, the distorted input is
At each iteration,
samples a randomized composition of non-geometric distortions (JPEG compression, Gaussian blur, additive noise, color jitter, pixel drop) and geometric distortions (rotation, cropping, resizing, padding, picture-in-picture). Distortion intensities are drawn from predefined ranges to form a lightweight curriculum that improves generalization without explicit backpropagation through any editor pipeline [
9]. This design encourages invariance to a broad spectrum of real-world degradations while maintaining stable optimization.
3.5. Watermark Localization and Optimization Strategy
Dispersed embedding creates uncertainty in both the location and the number of watermark instances. A detector based on the U2-Net architecture is employed to predict a binary mask that indicates candidate watermark regions in an attacked image . To support arbitrary resolutions, is padded and partitioned into tiles of size . Each tile is processed independently, and the tile-level masks are merged to obtain a full-resolution prediction.
The detector is trained to achieve accurate coverage and sharp boundaries. The objective combines pixel-wise binary cross-entropy with an overlap term:
where
is the ground-truth mask,
denotes intersection-over-union, and
balances region fidelity and boundary accuracy. The use of
ensures that minimizing
increases the overlap.
For each connected component of , an affine transform is estimated to align the predicted region to the canonical embedding geometry. The rectified patch is obtained by inverse warping, and is fed to the decoder to recover a local message estimate, A majority vote over yields the final reconstruction , which compensates for partial detection errors and local corruptions. The geometric rectification step improves decoder stability by reducing spatial misalignment before bit extraction.
The overall training objective balances imperceptible embedding, adversarial realism, message fidelity, detection accuracy, and attention supervision:
with
. Segmentation and centroid estimation scale as
. Embedding or extracting
K patches scales linearly with
K. Tiled detection introduces an additional factor of
.
From a probabilistic standpoint, redundant embedding over
K object-centric regions yields
, under independent per-region failures with rate
, which explains the exponential reduction in decoding failure as
K increases. The attention mechanism enforces a perceptual budget through the bound established in
Section 3.3,
and geometric rectification ensures that small affine estimation errors translate into bounded perturbations at the decoder input.
4. Performance Evaluation
We evaluate PatchSeal with two goals. The first is visual fidelity. The second is robustness under classic corruptions, AIGC edits, and composed pipelines that reflect practice. We use strong baselines and standard saliency datasets. We report complete settings for reproducibility and follow a simple protocol so that results are easy to verify and compare.
4.1. Experimental Setup and Metrics
(1) Datasets: Experiments use DUTS [
26], DUT-OMRON [
27], HKU-IS [
28], and PASCAL-S [
29]. Training draws 10,553 images from DUTS-TR and 5168 from DUT-OMRON. A validation split holds out 1000 images from DUTS-TR and 500 from DUT-OMRON. The remaining 9553 and 4668 images form a 14,221-image training set. Generalization is evaluated on the DUTS test split, HKU-IS, and PASCAL-S, which are disjoint from training. During training, saliency masks are used to prompt SAM and obtain object masks. Centroids are computed and expanded to define embedding regions. For local edits, 100 ImageNet images are used to paste patches with random rotation and scale. LaMa [
30] performs inpainting under irregular, rectangular, and saliency masks. For instruction-driven global edits, the editor in [
31] is employed.
(2) Evaluation setup: All models are implemented in PyTorch 2.2.2 and trained on Ubuntu 20.04 with an Intel Xeon Platinum 8352V CPU, an NVIDIA RTX 4090 GPU with 24 GB memory, and 90 GB system memory. Optimization uses AdamW with learning rate
. The encoder fidelity loss
balances MSE and SSIM with coefficient
. Global loss weights follow the formulation in
Section 3.5:
,
, and
. The attention mask loss weight is set to
. The detector loss
uses
. For the embedding network, the input patch size is
. The number of embedded instances equals the number of non-overlapping salient masks. Training runs for 125 epochs with a batch size of 64. For the detection module, the input is
with 100 training epochs and batch size 12.
(3) Metrics: Fidelity is measured by PSNR and SSIM. Recovery reliability is measured by bit accuracy (BA).
Bit accuracy:
where
L is the message length,
is the
i-th bit, and
is its estimate.
4.2. Visual Fidelity
As shown in
Figure 3, the visual results and residual maps on DUTS, HKU-IS, and PASCAL-S indicate that PatchSeal concentrates small perturbations in texture-rich or low-saliency areas and avoids structural artifacts. In contrast, MBRS and ARWGAN exhibit boundary chroma noise and CIN produces mild blur. Quantitatively, PatchSeal achieves the highest PSNR and SSIM on all test sets. Averaged over the three datasets, the PSNR exceeds MBRS, ARWGAN, and CIN by 10.71 dB, 6.01 dB, and 1.54 dB, and the SSIM is higher by 0.1902, 0.0396, and 0.0085. On PASCAL-S, the scores reach 44.38 dB and 0.9951, which implies negligible perceptual deviation. The behavior aligns with the design: object-level dispersed embedding limits the spatial footprint and reduces global distortion, and attention-guided masks place bits within semantically stable and visually tolerant regions. A minor limitation appears in low-texture scenes with strong edges, where residual energy along edge bands can increase slightly while remaining sub-perceptual. This can be mitigated by down-weighting edge bands in the attention mask or adding edge-preserving regularization.
4.3. Robustness Evaluation
All robustness tests are read jointly to provide a unified view of performance.
Table 1 reports three groups: Part A for single distortions, Part B for AIGC edits, and Part C for composed pipelines.
(1) Single distortions. PatchSeal is strongest overall, with pronounced gains under rotation, padding, and picture-in-picture. Detector-guided relocking, together with dispersed redundancy, supplies alignment cues when geometry changes. Small gaps appear where CIN leads on Gaussian noise and resize, and ARWGAN leads on crop. These cases suggest stronger scale-equivariant augmentation and anti-aliasing filters during training.
(2) AIGC editing. Part B considers local paste and inpainting at two strengths and a global instruction-driven edit. PatchSeal keeps BA above 90% for all local edits and is clearly better on the global edit. Object-aware dispersion preserves redundancy across salient instances and attention masks emphasize semantically stable regions. These choices matter most when edits alter content rather than only pixels.
(3) Composed pipelines. Part C evaluates compositions that are closer to practice. PatchSeal achieves the highest average BA across all compositions, with a larger margin when rotation is present. JPEG followed by inpainting is harder: PatchSeal is slightly below MBRS and ARWGAN by 1.16% and 0.66%. Compression weakens low-amplitude cues before mask-based removal. Pre-compression regularization and feature-level anchors may help. Averaged over all compositions, PatchSeal improves upon MBRS, ARWGAN, and CIN by
,
, and
.
Table 1.
BA (%) ↑ under single distortions, AIGC edits, and composed pipelines.
Table 1.
BA (%) ↑ under single distortions, AIGC edits, and composed pipelines.
| A. Single distortions | MBRS | ARWGAN | CIN | PatchSeal |
| None | 97.49 | 100.00 | 99.99 | 99.90 |
| JPEG (Non-Geom.) | 96.13 | 97.68 | 95.88 | 98.49 |
| Gaussian Blur (Non-Geom.) | 97.04 | 99.28 | 99.76 | 99.78 |
| Gaussian Noise (Non-Geom.) | 97.38 | 99.32 | 99.88 | 99.83 |
| Drop (Non-Geom.) | 96.64 | 98.17 | 96.57 | 98.84 |
| Color Adjust (Non-Geom.) | 97.30 | 95.71 | 99.54 | 99.86 |
| Rotation (Geom.) | 69.19 | 86.89 | 68.23 | 93.93 |
| Crop (Geom.) | 67.79 | 99.26 | 99.70 | 94.60 |
| Padding (Geom.) | 69.40 | 91.95 | 92.32 | 97.79 |
| Picture in Picture (Geom.) | 58.35 | 94.41 | 93.52 | 96.42 |
| Resize (Geom.) | 97.41 | 95.50 | 98.10 | 96.13 |
| B. AIGC editing | MBRS | ARWGAN | CIN | PatchSeal |
| Paste (20%) | 93.85 | 93.82 | 95.22 | 97.11 |
| Paste (35%) | 88.88 | 89.55 | 93.83 | 93.31 |
| Inpaint (20%) | 93.65 | 93.10 | 83.14 | 95.56 |
| Inpaint (35%) | 91.37 | 90.63 | 77.36 | 90.29 |
| Global Edit | 77.62 | 82.58 | 73.26 | 88.64 |
| C. Composed pipelines | MBRS | ARWGAN | CIN | PatchSeal |
| JPEG + Gaussian Blur | 95.39 | 96.13 | 95.24 | 97.64 |
| JPEG + Rotation | 68.69 | 86.26 | 67.31 | 86.59 |
| Rotation + Resize | 69.03 | 84.97 | 67.87 | 90.34 |
| JPEG + Inpaint | 92.19 | 91.69 | 78.76 | 91.03 |
| Rotation + Inpaint | 66.78 | 82.77 | 65.66 | 85.16 |
| Inpaint + Global Edit | 76.06 | 78.96 | 70.52 | 87.25 |
| JPEG + Rotation + Inpaint | 66.27 | 80.60 | 66.84 | 84.74 |
4.4. Ablations and Multi-Target Dispersion
Three design factors are assessed: the dispersion strategy, the attention masks, and the number of embedded watermarks. Results indicate that PatchSeal benefits from object-aware placement and moderate redundancy, with consistent gains in imperceptibility and recovery.
(1) Dispersion strategy. Replacing object-aware dispersion with random placement (PatchSeal-R ) reduces BA by 1.11% under non-geometric corruptions, with a 3.87% drop on Drop. Under geometric changes, BA falls by 2.73% on Crop. For AIGC edits the gap widens: BA decreases by 2.48% on Paste, by 4.51% on Inpaint, and by 6.86% for the Global edit. These trends are consistent with how edits interact with image content. Drop and crop often affect borders, while salient instances tend to occupy central regions. Random dispersion places bits near vulnerable edges. Object-aware dispersion concentrates payload inside salient instances, which preserves synchronization and reduces exposure to content removal.
(2) Attention masks. Removing attention masks weakens both fidelity and robustness on DUTS. PatchSeal-N attains PSNR 42.05, SSIM 0.9887, and BA 95.97, compared with 42.78, 0.9915, and 96.66 for PatchSeal. Attention guides embedding toward texture-rich and semantically stable regions. This lowers visible residue and improves survival under resampling and edits. The effect size in PSNR and SSIM is modest but consistent, and translates into a measurable BA gain of 0.69%.
(3) Number of watermarks. The number of embedded instances varies from one to four on the DUTS test set (
Figure 4). With one watermark, PSNR and SSIM are 43.68 dB and 0.9931. As the number increases, PSNR decreases by 6.87% from one to two, by 3.24% from two to three, and by 2.74% from three to four. With four instances, the scores remain at 38.28 dB and 0.9805. Redundancy improves recovery: Two watermarks raise BA by about 3% over one, and BA is at least 98% with three. A slight BA dip at four suggests overlap and mask boundary effects. Considering fidelity, robustness, and deployment cost, embedding two or three dispersed instances are a sound operating point.
Across all ablations, object-aware dispersion and attention-guided placement are the primary contributors to the observed improvements. They confine energy to perceptually tolerant regions, maintain alignment after edits, and deliver higher BA without visible artifacts.
4.5. Discussion and Limitations
PatchSeal improves fidelity and robustness under classic corruptions, AIGC edits, and composed pipelines. Gains are most visible when geometry changes, which indicates that object-aware dispersion and detector relocking provide stable alignment cues. Two residual challenges are discussed.
Small gaps appear on Gaussian noise, resize, and crop where CIN or ARWGAN are slightly higher. A likely cause is a mismatch between the training distortions and the high-frequency statistics of these tests. Resampling can also introduce anti-aliasing that suppresses low-amplitude signals. JPEG followed by inpainting is harder. Compression weakens residuals and mask-guided inpainting removes local structure, which limits relocking. A broader curriculum with scale-aware augmentation and resampling-aware filters can help. Pre-compression regularization, invariant anchors in the detector, and inpainting-aware training are also promising. Pairing compression with semantic removal during training is expected to raise the worst case.
Dispersed embedding works well on medium and large targets, but performance on small objects can drop. Capacity is limited and edits affect a larger fraction of the object, so cues become fragile. Mild edge-band residuals are observed in low-texture scenes with strong contrast. These remain sub-perceptual but indicate sensitivity near mask boundaries. In practice many images exceed 1080p and contain richer texture, which changes embedding strength, local aliasing, and detector confidence. Future work includes scale-aware dispersion, adaptive per-instance payloads, and masks that avoid thin structures. Additional steps include edge-preserving regularization and boundary smoothing, multi-resolution pyramids during training, evaluation at native resolution, and a study of memory and latency trade-offs to balance capacity, imperceptibility, and synchronization in realistic settings.
5. Conclusions and Future Work
This paper studies the impact of AIGC image editing on watermark robustness and presents PatchSeal, which combines multi-target dispersed embedding with attention-guided placement. SAM provides object masks for embedding and U2-Net supports accurate localization during retrieval, while inverse transforms relock geometry. Experiments on DUTS, DUT-OMRON, and HKU-IS show higher PSNR and SSIM and higher bit accuracy than strong baselines under single and composed attacks. Visual evidence and ablations support the mechanism and show that perturbations remain imperceptible. Future work will expand the distortion curriculum with stronger instruction-driven editors, screen-camera capture, and longer pipelines. We will study scale-aware dispersion for small objects, edge-preserving regularization, and adaptive weights that downplay fragile regions. We also plan certified robustness analysis, tighter capacity control, latency reduction, and broader benchmarks to support fair and reproducible comparison.
Author Contributions
Software, T.Y., H.Z. and Z.W.; Validation, H.Z.; Formal analysis, T.Y.; Resources, T.Y.; Data curation, Z.W.; Writing—review & editing, Y.C.; Visualization, Z.W. and Y.C.; Supervision, T.Y. and Y.C.; Project administration, T.Y. and Y.C.; Funding acquisition, T.Y. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by Natural Science Foundation of Zhejiang Province grant number LTGG24F030001.
Data Availability Statement
The datasets generated and analysed during the current study are not publicly available. Part of the data involves ongoing research activities and internal experimental configurations, and therefore cannot be openly shared at this stage due to research continuity and technical constraints. Reasonable requests for access to the data may be considered by the corresponding author, subject to compliance with relevant data-use policies and the journal’s data sharing guidelines.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Jia, Z.; Fang, H.; Zhang, W. MBRS: Enhancing Robustness of DNN-Based Watermarking by Mini-Batch of Real and Simulated JPEG Compression. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 41–49. [Google Scholar]
- Fang, H.; Jia, Z.; Ma, Z.; Zhang, W.; Chang, E.-C. PIMoG: An Effective Screen-Shooting Noise-Layer Simulation for Deep-Learning-Based Watermarking Network. In Proceedings of the 30th ACM International Conference on Multimedia, New York, NY, USA, 10–14 October 2022; pp. 2267–2275. [Google Scholar]
- Wen, S.; Fang, H.; Zhu, H.; Zhang, W.; Yu, N. De-END: Decoder-Driven Watermarking Network. IEEE Trans. Multimed. 2023, 25, 7571–7581. [Google Scholar]
- Fernandez, P.; Couairon, G.; Jégou, H.; Douze, M.; Furon, T. The Stable Signature: Rooting Watermarks in Latent Diffusion Models. In Proceedings of the IEEE/International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 22466–22477. [Google Scholar]
- Min, R.; Li, S.; Chen, H.; Cheng, M. A Watermark-Conditioned Diffusion Model for IP Protection. In Computer Vision—ECCV 2024; Lecture Notes in Computer Science; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer: Cham, Switzerland, 2025; Volume 15127, pp. 104–120. [Google Scholar]
- Rezaei, A.; Akbari, M.; Alvar, S.R.; Fatemi, A.; Zhang, Y. LaWa: Using Latent Space for In-Generation Image Watermarking in Latent Diffusion Models. In Computer Vision—ECCV 2024; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2025; Volume 15147, pp. 118–136. [Google Scholar] [CrossRef]
- Zhang, X.; Li, R.; Yu, J.; Xu, Y.; Li, W.; Zhang, J. EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright Protection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 19–21 June 2024; pp. 11964–11974. [Google Scholar]
- Jiang, Z.; Guo, M.; Hu, Y.; Jia, J.; Gong, N.Z. Certifiably Robust Image Watermark. In Computer Vision—ECCV 2024; Springer: Cham, Switzerland, 2025. [Google Scholar] [CrossRef]
- Zhang, C.; Karjauv, A.; Benz, P.; Kweon, I.S. Towards Robust Deep Hiding Under Non-Differentiable Distortions for Practical Blind Watermarking. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 5158–5166. [Google Scholar]
- Fernandez, P.; Furon, T.; Jégou, H.; Douze, M. Watermarking Images in Self-Supervised Latent Spaces. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Singapore, 22–27 October 2022; pp. 2964–2968. [Google Scholar]
- Wang, Y.; Zhu, X.; Ye, G.; Zhang, S.; Wei, X. Achieving Resolution-Agnostic DNN-Based Image Watermarking via Implicit Neural Representation. In Proceedings of the 32nd ACM International Conference on Multimedia (MM), Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 7442–7450. [Google Scholar]
- Lin, Z.; Zhang, J.; Zhou, H.; Zhang, W. Automatic Robust Blind Video Watermarking Resisting Camera Recording. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 5093–5106. [Google Scholar] [CrossRef]
- Wang, K.; Wu, S.; Yin, X.; Lu, W.; Luo, X.; Yang, R. Robust Image Watermarking With Synchronization Using Template Enhanced-Extracted Network. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 1602–1614. [Google Scholar] [CrossRef]
- Guo, X.; Zhang, W.; Zhou, W.; Yu, N.; Zheng, Y. DWSF: Practical Deep Dispersed Watermarking with Synchronization and Fusion. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 7922–7932. [Google Scholar]
- Hu, R.; Zhang, J.; Xu, T.; Li, J.; Zhang, T. Robust-Wide: Robust Watermarking Against Instruction-Driven Image Editing. In Computer Vision—ECCV 2024; Springer: Cham, Switzerland, 2025; pp. 20–37. [Google Scholar]
- Ma, X.; Li, Y.; Liu, S.; Lu, J. Geometric Distortion Immunized Deep Watermarking under Local Non-Affine Transformations. In Proceedings of the 18th European Conference on Computer Vision—ECCV 2024, Milan, Italy, 29 September–4 October 2024; pp. 265–281. [Google Scholar]
- Bui, T.; Agarwal, S.; Collomosse, J. TrustMark: Universal Watermarking for Arbitrary Resolution Images. In Proceedings of the 32nd ACM International Conference on Multimedia (MM), Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 5802–5805. [Google Scholar]
- Lou, X.; Zhao, H.; Zhao, P.; Ma, K. Gaussian Shading: Certifiable, Performance-Lossless Image Watermarking for Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 19–21 June 2024; pp. 12055–12065. [Google Scholar]
- Zhu, P.; Feng, Y.; Han, S.; Han, X. Watermark-Embedded Adversarial Examples for Copyright Protection Against Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 19–21 June 2024; pp. 14220–14229. [Google Scholar]
- Pan, Z.; Jun, J.; Zhang, A.; Xu, Z. WateRF: Structure-Preserving Watermarking for Neural Radiance Fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 19–21 June 2024; pp. 9885–9895. [Google Scholar]
- Huang, Z.; Wu, X.; Cai, Z.; Zhang, W.; Yu, N. Protecting NeRFs’ Copyright via Plug-and-Play Watermark. In Computer Vision—ECCV 2024; Springer: Cham, Switzerland, 2025; pp. 382–398. [Google Scholar]
- Gan, G.; Chen, Y.; Liu, Z. Towards Robust Model Watermark via Reducing Parametric Vulnerability. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Paris, France, 2–6 October 2023; pp. 20463–20472. [Google Scholar]
- Zhao, Y.; Sun, X.; Wang, K.; Lu, S.-P. Proactive Deepfake Defence via Identity Watermarking. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 3684–3693. [Google Scholar]
- Wang, G.; Ma, Z.; Liu, C.; Yang, X.; Fang, H.; Zhang, W.; Yu, N. MuST: Robust Image Watermarking for Multi-Source Tracing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; pp. 5364–5371. [Google Scholar] [CrossRef]
- Gu, Y.; Zhou, X.; Wang, S.; Zhang, J. Dual Defense: Adversarial, Traceable, and Invisible Robust Watermarking Against Face Swapping. IEEE Trans. Inf. Forensics Secur. 2024, 19, 4628–4641. [Google Scholar] [CrossRef]
- Wang, L.; Lu, H.; Wang, Y.; Feng, M.; Wang, D.; Yin, B.; Ruan, X. Learning to Detect Salient Objects with Image-Level Supervision. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 136–145. [Google Scholar]
- Yang, C.; Zhang, L.; Lu, H.; Ruan, X.; Yang, M.-H. Saliency Detection via Graph-Based Manifold Ranking. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 3166–3173. [Google Scholar]
- Li, G.; Yu, Y. Visual Saliency Based on Multiscale Deep Features. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5455–5463. [Google Scholar]
- Li, Y.; Hou, X.; Koch, C.; Rehg, J.M.; Yuille, A.L. The Secrets of Salient Object Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 280–287. [Google Scholar]
- Suvorov, R.; Logacheva, E.; Mashikhin, A.; Remizova, A.; Kharlamov, A.; Aliev, V.; Lempitsky, A.; Vatolin, D. Resolution-Robust Large Mask Inpainting with Fourier Convolutions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 2149–2159. [Google Scholar]
- Zhao, H.; Ma, X.S.; Chen, L.; Yang, Y.; Zhang, J.; Jia, X.; Sun, J. UltraEdit: Instruction-based Fine-Grained Image Editing at Scale. Adv. Neural Inf. Process. Syst. (NeurIPS) 2025, 37, 3058–3093. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |