FE-WRNet: Frequency-Enhanced Network for Visible Watermark Removal in Document Images

Chen, Zhengli; Zhang, Yuwei; Yan, Jielu; Wei, Xuekai; Xian, Weizhi; Mao, Qin; Qin, Yi; Gao, Tong

doi:10.3390/app152212216

Open AccessArticle

FE-WRNet: Frequency-Enhanced Network for Visible Watermark Removal in Document Images

by

Zhengli Chen

¹,

Yuwei Zhang

¹,

Jielu Yan

^1,*

,

Xuekai Wei

^1,*

,

Weizhi Xian

^1,*

,

Qin Mao

^2,3,

Yi Qin

⁴

and

Tong Gao

⁵

¹

College of Computer Science, Chongqing University, Chongqing 400044, China

²

School of Computer and Information Technology, Qiannan Normal University for Nationalities, Duyun 558000, China

³

Key Laboratory of Complex Systems and Intelligent Optimization of Guizhou Province, Duyun 558000, China

⁴

College of Mechanical Engineering, ChongQing University, Chongqing 400044, China

⁵

Beijing Institute of Computer Technology and Application, Beijing 100039, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(22), 12216; https://doi.org/10.3390/app152212216

Submission received: 8 September 2025 / Revised: 22 October 2025 / Accepted: 3 November 2025 / Published: 18 November 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

In video pipelines, document content in recorded lectures, surveillance footage, and broadcasted materials is often overlaid with persistent visible watermarks. Such overlays greatly reduce the readability of document images and interfere with downstream tasks such as optical characteristic recognition (OCR). Despite extensive studies, no prior work has concurrently addressed the diverse text layouts and watermark styles commonly encountered in real-world scenarios. To address this gap, we introduce TextLogo, the first benchmark dataset specifically designed for this comprehensive setting. TextLogo encompasses 2000 training pairs and 200 test pairs, spanning a wide array of text layouts and 30 distinct watermark styles. Building on this foundation, we propose the frequency-enhanced watermark-removal network (FE-WRNet), a generative network that fuses information from the spatial domain and the wavelet domain. Our Fused Wavelet Convolution Mixer (FWCM) effectively captures both the body and the edge components of watermarks, thereby enhancing removal performance. Training is guided by a hybrid loss function—including pixel, perceptual, and wavelet-domain objectives—to preserve fine details and edge structures. Moreover, while this work focuses on single-image document watermark removal, the proposed spatial–wavelet fusion and high-frequency-aware loss are directly relevant to video processing tasks—e.g., frame-wise watermark removal and temporal restoration—because watermarks in video often persist across frames and require fidelity-preserving, temporally-consistent restoration. Extensive experiments on TextLogo demonstrate that FE-WRNet outperforms the strongest baseline and reduces the perceptual error by 10.6%. Moreover, the proposed model also generalizes effectively to natural-image watermark datasets.

Keywords:

watermark removal; generative adversarial network; wavelet transform; feature fusion; document images

1. Introduction

Video scenarios (e.g., captured lectures, surveillance and broadcasted documents) frequently exhibit visible watermarks that persist from frame to frame, degrading readability and impeding video analytics. Restoring watermark-free frames therefore requires accurate spatial localization and temporal consistency to avoid flicker while preserving fine textual details [1]. Document images are frequently overlaid with visible watermarks, such as logos, seals, or semitransparent text, which are intended to indicate ownership or confidentiality [2]. These watermarks not only significantly degrade document readability and impede downstream tasks such as optical character recognition (OCR) [3], but also appear persistently in video streams (e.g., captured lectures, surveillance footage, or broadcasted documents), where watermark artifacts persist across consecutive frames and hamper downstream video analytics and OCR pipelines. Restoring watermark-free frames in video requires both accurate spatial localization and temporal consistency to avoid flicker and to preserve fine textual details across frames. Effective removal of such watermarks is crucial for document image cleanup and enhancement, aiming to restore the original content without introducing artifacts [4]. However, watermark removal in document images poses unique challenges because of the diverse characteristics of watermarks and the necessity of preserving the underlying text and layout integrity [5]. Watermarks vary widely in size, color, transparency, and complexity; they may cover substantial portions of text or graphics and often consist of intricate patterns or repeated text [6]. Their semitransparent nature requires reconstructing the obscured content rather than simple inpainting, distinguishing watermark removal from general image denoising [7,8,9].

Early watermark removal methods exhibited significant limitations. Traditional nonlearning approaches typically require manual user input or multiple images containing the same watermark [10], which is impractical for real-world applications. Recent advances in deep learning have enabled substantial progress by framing watermark removal as an image-to-image translation task [11,12], often using generative adversarial networks (GANs) [13] to directly generate watermark-free images from watermarked inputs. However, these one-stage models generally fail to leverage the positional information of watermarks, resulting in residual artifacts in the reconstructed images. To address this, two-stage networks have been proposed to extract the watermark and restore the image [14,15], effectively mitigating background interference during watermark elimination. While effective for natural-image backgrounds such as landscapes or portraits, these methods are not specifically designed for document images; considering the significant difference in distribution between document images and natural images, their performance on document watermark removal remains suboptimal (as shown in Figure 1). Owing to the differences in the distributions of the background images, the model mistakenly removes the text in the background image as a watermark [16,17,18]. In real-world dissemination, digital documents and videos are frequently re-encoded under adaptive streaming and HEVC pipelines [19,20], which amplify perceptual artifacts around fine structures; thus robustness under compression is crucial [21,22,23].

Moreover, certain document restoration techniques have been adapted for watermark removal in document images [24,25], typically employing a one-stage strategy to generate clean images directly from watermarked inputs. Nevertheless, the datasets used in these approaches predominantly contain simple text-based watermarks or red seals, failing to capture the full diversity and complexity of real-world watermarks [26]. This limitation results in inadequate performance in challenging cases involving complex watermark structures or intricate document layouts [27]. Consequently, there is a pressing need for more robust methods and comprehensive datasets tailored specifically to document watermark removal [28], ensuring faithful content restoration without compromising document fidelity [29,30,31].

To address these challenges, we first construct TextLogo, a novel dataset for document-image watermark removal. TextLogo encompasses a broad spectrum of document types and watermark patterns, including dense textual backgrounds (e.g., paragraphs, tables, and document elements) overlaid with diverse watermarks exhibiting variations in color, texture, and edge characteristics. This dataset fills a critical gap in existing benchmarks by reflecting the complexity of real-world document watermarks. Second, building on TextLogo, we develop FE-WRNet, a frequency-enhanced watermark removal network designed to localize watermarks while preserving document details precisely. Central to FE-WRNet is the Fused Wavelet Convolution Mixer (FWCM) submodule, which performs multiscale feature extraction jointly in the spatial and discrete wavelet domains. This design enables the network to capture color cues from spatial and low-frequency subbands while isolating edge and texture information in high-frequency subbands, facilitating more effective learning of watermark features. A hybrid loss based on spatial and wavelet domains is also designed to enhance the model’s perception of details and edge information. Together, TextLogo and FE-WRNet advance the state of the art in document watermark removal by addressing dataset scarcity and methodological limitations [9,32,33].

The contributions of the proposed method can be summarized as follows:

We construct a new dataset, TextLogo, which comprises background images containing dense textual content overlaid with diverse watermarks that exhibit variations in color, texture, and edge characteristics. By encompassing a broad range of watermark types, TextLogo fills a critical gap in document-image dewatermarking benchmarks.
Building on TextLogo, we propose a frequency-enhanced watermark removal network, FE-WRNet. Central to this network is the FWCM submodule, which operates jointly in the spatial domain and the discrete wavelet domain to capture watermark edges and structural information more effectively. Additionally, a hybrid loss based on spatial and wavelet domains is designed to enhance the model’s perception of details and edge information.

Experiments on the proposed TextLogo dataset demonstrate that FE-WRNet achieves state-of-the-art performance. Further evaluations on the publicly available CLWD dataset confirm that the method also performs competitively on general watermark-removal tasks.

In this article, the scientific question (SQ) is how to accurately localize and remove visible watermarks (with varying layouts and watermark styles) from document images while preserving mid- and high-frequency text microstructure and ensuring perceptual fidelity.

The research questions (RQs) include:

Dataset sufficiency. Does a document-centric benchmark covering multiple layouts and 30 heterogeneous watermark styles (TextLogo) reveal the challenges of document watermark removal better than natural image benchmarks?
Representation and Architecture. Does the proposed FWCM, which fuses spatial cues with wavelet subbands, achieve more accurate watermark localization and removal than using a spatial feature extractor alone?
Loss Design. Does emphasizing high-frequency subbands in the pixel loss (by using a coefficient >1) sharpen mask boundaries and improve perceptual quality without compromising low-frequency color harmony?
Accuracy–efficiency trade-off. Against UNet, SplitNet, and WDNet, can FE-WRNet achieve superior TextLogo scores with lower-inference FLOPs, and remain competitive on CLWD?

To address RQ1, Section 3.1 introduces TextLogo, a document-centric benchmark with diverse layouts and 30 watermark styles. RQ2 and RQ3 is tackled in Section 3.2, Section 3.3 and Section 3.4, where we present FE-WRNet and its Fused Wavelet Convolution Mixer (FWCM) that jointly harvest spatial and wavelet cues, together with a hybrid objective. Then, RQ3 is studied in Section 4.2.1 by varying the high-frequency penalty

λ_{H}

. RQ2 is validated in Section 4.2.2 via a spatial-only ablation against FWCM. Finally, RQ4 is answered in Section 4.3 by comparing FE-WRNet with UNet, SplitNet, and WDNet on TextLogo (and CLWD for generalization), reporting PSNR/SSIM/LPIPS/RMSE as well as FLOPs to establish the accuracy–efficiency trade-off. The manuscript thus proceeds from dataset design, to model and loss, to ablations, and concludes with cross-method comparisons that substantiate our claims.

2. Related Work

Section 2 is organized into four parts: it first reviews prior work on visible watermark removal, then surveys two closely related areas—document image restoration and single-image deraining/defogging—and finally summarizes mainstream datasets for watermark removal.

2.1. Visible Watermark Removal

Early work, such as the ICA-based recovery algorithm, explicitly separates watermark and host components in the DWT domain, enabling partial reconstruction of textured regions [4]. Dekel et al. [10] demonstrated that simple multi-image optimization can jointly estimate a watermark’s matte and remove it at scale, revealing the fragility of standard stock-photo marks. However, these methods have a shortage of manual setting parameters and complex optimization methods. To overcome these drawbacks, deep learning methods with deep network architectures have been proposed. Li et al. [11] introduced a conditional GAN that leverages paired supervision for photorealistic restoration, whereas Cao et al. [12] crafted a lightweight GAN that focuses on edge consistency. Alternatively, a two-stage network is designed to first perform watermark detection and then removal. Cun et al. [14] stacked attention-guided ResUNets to localize marks and iteratively refine boundaries, markedly reducing halo artifacts. WDNet decomposes the task into coarse mask prediction followed by region-specific refinement, and its CLWD dataset remains the de facto colored-watermark benchmark [15]. Although these methods perform well in image watermark removal, they are tuned on natural images whose backgrounds are continuous-tone scenes. When they are directly applied to document images, high-frequency textual strokes are misidentified as marks, causing over-erasure and illegible content. This domain gap motivates task-specific solutions [34,35,36].

2.2. Document Image Restoration

Document image restoration tasks remove the interference part of the document image, including stains, shadows, and watermarks. Souibgui and Kessentini framed enhancement as paired translation and introduced DE-GAN, which fuses adversarial and perceptual losses to restore degraded scans without explicit priors [24]. Building on larger data, Li et al. collected StainDoc (over 5 k real pages) and proposed a memory-augmented transformer that stores feature prototypes for long-range dependencies, achieving superior OCR accuracy after stain removal [25]. Nevertheless, these pipelines are deliberately generic: they ingest any artifact and output a cleaned page in one shot. They (i) ignore the strong spatial prior that a visible watermark usually occupies a single connected layer and (ii) assume semitransparent texts or red seals, overlooking complex logos and dense repetitive patterns common in legal or financial documents. Consequently, residual ghosts or text loss persist when diverse watermarks are faced [37,38,39].

2.3. Related Vision Tasks: Deraining and Defogging

Deraining, defogging, and watermark removal are all single-image restorations; each seeks to disentangle a structured corruption layer from a latent clean image [40]. Therefore, many deraining and defogging methods have been directly used for watermark removal. Building on the physical model, Chen et al. [41] jointly learn the bidirectional tasks of generating and removing both real and synthetic rain, forming a closed loop that mitigates the domain gap. Li et al. [42] employed a discrete wavelet transform to decompose a rainy image into high-frequency rain streaks and a low-frequency background and then used an attention-guided GAN to precisely localize rainy regions. For defogging, Chen et al. [43] proposed a single-image defogging algorithm that combines the prior of mixed dark and light channels, adaptive defogging intensity [44], and brightness/color compensation models after dividing the image into near and far scenes through improved two-dimensional Otsu segmentation. However, there are still differences between these similar tasks and ours. Rain streaks and fog obey physical imaging models, whereas watermarks are human-designed overlays with arbitrary shapes, colors, and opacities. In typical rain/fog scenes, corruption pervades the image; visible watermarks are usually sparse, which requires the network to pay more attention to the local structure [45,46,47,48].

2.4. Datasets for Watermark Removal

Current watermark benchmarks [15,49] overlay color or gray logos on MS-COCO or PASCAL-VOC photos and supply masks for supervision. They expose models to diverse watermark patterns but lack the structured backgrounds of documents. Document-centric sets are smaller: the DE-GAN corpus (1 k pairs) focuses on textual watermarks and red seals, whereas StainDoc and its derivative StainDoc-Mark [25] add 5 k real stained pages plus synthetic multilanguage text watermarks. No existing dataset simultaneously offers real-document layouts and diverse graphical watermarks, which underscores the need for more comprehensive document watermark removal datasets that cover multilanguage text watermarks, various logo graphics, and even hand-drawn or stamped marks [50,51,52].

3. Method

Section 3 comprises five parts. We begin with our TextLogo dataset and then introduce the corresponding model, FE-WRNet. We describe its generator, discriminator, and loss function in sequence, and conclude with a brief summary.

3.1. TextLogo Dataset

Compared with natural images, text images concentrate their semantic information mainly in the middle- and high-frequency bands. We deem it necessary to build a watermark-removal benchmark tailored to document imagery. To the best of our knowledge, existing datasets that include text images usually assume the watermark to be grayscale text or a red circular seal, overlooking diversity in shape, color, and structure. As a result, models trained on those datasets lack the capacity to recognize a broader spectrum of watermark styles. We therefore design the TextLogo dataset.

For the background images, we collect publicly available documents from the internet and internal documents provided by an oil company. The latter differ from ordinary documents in two respects: they contain various degradations and more complex table layouts. These characteristics allow us to evaluate watermark removal under challenging conditions. From these sources, we manually crop representative regions covering Chinese and English texts, paragraphs, and tables, ensuring background diversity. For watermarks, we assemble a representative set of templates from online resources, including corporate logos, stylized lettering, and distinctive seals, to guarantee the diversity of watermark samples. In total, we gather 30 different watermark types, with 15 used for training and 15 reserved for testing, enabling assessment of generalization to unseen watermarks. For each sample, we randomly determine the watermark’s position, scale, rotation angle, and opacity

α

(in the range of 0.2–0.8). The watermarked image is then synthesized via linear alpha compositing. Each record also stores the corresponding watermark, mask, and opacity map. The dataset contains 2000 training images and 200 test images. Considering that document images usually exhibit less color and structural diversity than natural images, this scale is sufficient for preliminary validation. Examples of watermarked images are shown in Figure 2.

3.2. FE-WRNet

Our model is built upon a GAN framework whose generator adopts a multitask learning strategy, including watermark localization, watermark removal, and image restoration. Details are given point-by-point below.

3.2.1. Watermark Localization

The watermark-localization network follows the classical U-Net architecture [53], as illustrated in Figure 3, which contains four hierarchical levels. The input image is first lifted into a higher-dimensional feature space via depthwise-separable convolution [54]. After each encoder stage [55], the spatial resolution is halved while the channel dimension is doubled, thereby retaining as much discriminative information as possible. We let the output of the last encoder perceive global features (e.g., long table ruling lines) through a multihead self-attention mechanism. In the decoder, skip connections concatenate the feature map from the i-th encoder with the upsampled activation from the previous layer as input to the i-th decoder, compensating for detail loss incurred during downsampling and increasing segmentation performance. The network ultimately outputs a mask M, a transparency map

α

, an extracted watermark W, and a 64-channel feature tensor that serves as a “log” of the localization stage and acts as an auxiliary feature for the subsequent image restoration step.

As illustrated in Figure 4, the FWCM is designed to harvest both spatial-domain and wavelet-domain cues from a feature map [56,57,58]. For an input tensor

X \in R^{H \times W \times C}

, a dual-branch subnetwork applies [59,60] depthwise-separable convolutions with kernels of

5 \times 5

and

3 \times 3

to capture multiscale spatial information. The two outputs are concatenated to form a rich spatial feature representation. The concatenated feature map is then projected into the wavelet domain via the two-dimensional stationary wavelet transform (2-D SWT), which preserves resolution (no downsampling) and therefore retains finer details than the 2-D discrete wavelet transform [61]. For an input

I \in R^{H \times W \times C}

,

{\{I_{i, j}\}}_{i, j \in {L, H}} = SWT (I),

(1)

where four subbands

I_{L L}, I_{L H}, I_{H L}, I_{H H} \in R^{H \times W \times C}

. After transform in the wavelet domain, convolutions are applied separately to the low-frequency (

L L

) and the aggregated high-frequency (

L H, H L, H H

) components—the low-frequency branch conveys coarse location cues of the watermark, whereas the high-frequency branch sharpens edge details. Finally, the processed subbands are brought back to the spatial domain via the inverse SWT and added to the original spatial features, producing a joint representation [62] that blends spatial and frequency information. The whole module can be summarized as

X_{spatial} = {DSConv}_{multi} (Split (X))

(2)

X_{L L}, X_{L H}, X_{H L}, X_{H H} = SWT (X_{spatial})

(3)

X_{L L}^{'} = DSConv (X_{L L})

(4)

X_{L H}^{'}, X_{H L}^{'}, X_{H H}^{'} = DSConv (Concat (X_{L H}, X_{H L}, X_{H H}))

(5)

X_{wavelet} = {SWT}^{- 1} (X_{L L}^{'}, X_{L H}^{'}, X_{H L}^{'}, X_{H H}^{'})

(6)

X^{'} = X_{spatial} + X_{wavelet,}

(7)

where

{DSConv}_{multi}

denotes the parallel application of

3 \times 3

and

5 \times 5

depthwise-separable convolutions followed by channelwise concatenation.

3.2.2. Watermark Removal

In typical watermarking practice [10], ownership is asserted by blending the watermark into the image according to

X_{w} (p) = α (p) W (p) + (1 - α (p)) X (p),

(8)

where

p = (i, j)

represents a pixel,

α (p) \in [0, 1]

represents the spatially varying opacity,

X (p)

represents the pristine image,

W (p)

represents the watermark, and

X_{w} (p)

represents the resulting watermarked image.

To refine the network’s predictions of

α

and W, we introduce a binary watermark mask

M (p) \in {0, 1}

and define

α_{M} (p) = α (p) M (p), W_{M} (p) = W (p) M (p) .

(9)

After the watermark-removal network yields estimates

\hat{M}

,

\hat{W}

, and

\hat{α}

, the clean image can be recovered via the inverse of the blending operation:

{\hat{X}}_{1} (p) = \frac{X_{w} (p) - \hat{α_{M}} (p) \hat{W_{M}} (p)}{1 - \hat{α_{M}} (p) + ϵ},

(10)

where

{\hat{X}}_{1} (p)

is the reconstructed watermark-free result before image restoration,

ϵ = 10^{- 6}

use to prevent division by zero.

3.2.3. Image Restoration

In theory, the watermark should be fully removed by the watermark removal stage. Nevertheless, prediction errors in either the mask or the watermark often leave residual artifacts that manifest as stains. We therefore introduce the third stage—image restoration—to eliminate these residual blemishes. This stage receives three inputs: (i) the auxiliary features from the watermark localization stage, (ii) the preliminary dewatermarked image

X_{1}

from the watermark removal stage, and (iii) the watermarked image

X_{w}

. Its objective is to produce a refined output

X_{1}^{*}

. Because the transition from

X_{1}

to

X_{1}^{*}

concerns only a small set of pixels, we employ a lightweight architecture consisting of a few stacked convolutional layers; residual connections between stacks stabilize training.

Finally, since the watermarked and clean images are identical outside the watermark region, we fuse

X_{w}

and

X_{1}^{*}

under the guidance of the predicted mask to avoid unintended changes in watermark-free areas:

\hat{X} (p) = X_{w} (p) (1 - \hat{M} (p)) + X_{1}^{*} (p) \hat{M} (p),

(11)

where

\hat{X} (p)

is the final dewatered result produced by FE-WRNet.

3.3. Discriminator

Considering that watermarks usually contaminate only localized regions of an image, we adopt a patch-based discriminator [63] to encourage our model to remove residual artifacts in microtextures and fine details rather than performing merely coarse watermark elimination [64]. The discriminator receives a concatenation of the watermarked image and its watermark-free counterpart as input and outputs a feature map whose value at each location represents the probability that the watermark-free image is real within the corresponding patch. Inspired by [65], we employ 2-D SWT so that the discriminator learns image representations in the wavelet domain, thereby better distinguishing true high-frequency details from artifacts. The detailed architecture of the discriminator is shown in Figure 5. When X is the ground truth image, the label maps should all be 1 (representing real); when X is the predicted image, the label maps should all be 0 (representing fake).

3.4. Loss Function

For this task, we decompose the objective into an adversarial loss

L_{a d v}

, a pixel-level loss

L_{p i x}

, and a perceptual loss

L_{p e r}

[66]. The adversarial term enhances photorealism: during training, the generator G minimizes this objective, whereas the discriminator D maximizes this objective. The pixel-level term adopts

ℓ_{1}

loss to supervise W, M,

α

, and the two-stage nonwatermark images

X_{1}

and

\hat{X}

. For W, M, and

α

, whose colors are limited to a few discrete values, the loss is computed only in the spatial domain:

L_{I} = {∥ I - \hat{I} ∥}_{1}, I \in {W, M, α} .

(12)

For

X_{1}

and

\hat{X}

, which contain richer details, we transform them to the wavelet domain with a 2-D SWT and compute the loss there. Inspired by [67], we place a larger weight on the high-frequency sub-bands to drive the network to learn edges and textures [68,69,70]:

L_{X} = ∥ L L - \hat{L L} ∥_{1} + λ_{H} \sum_{H F \in {H L, L H, H H}} {∥ H F - \hat{H F} ∥}_{1}, X \in {X_{1}, \hat{X}} .

(13)

Here,

L L, H L, L H, H H

are the four subbands obtained by a 2-D SWT on X, and

λ_{H} > 1

penalizes errors in the high-frequency bands.

The complete pixel-level loss is

L_{p i x} = λ_{I} \sum_{I \in {W, M, α}} L_{I} + λ_{X_{1}} L_{X_{1}} + λ_{\hat{X}} L_{\hat{X}},

(14)

where

λ_{I}, λ_{X_{1}}, λ_{\hat{X}}

balance the individual terms.

To further improve semantic coherence and texture continuity, we add a perceptual loss

L_{p e r}

computed with a pretrained VGG-16 [71]:

L_{p e r} = λ_{p e r} \sum_{k \in {1, 2, 3}} {∥ Φ_{k} (\hat{X}) - Φ_{k} (X) ∥}_{2},

(15)

where

Φ_{1}, Φ_{2}, Φ_{3}

denote shallow, intermediate, and deep feature maps, respectively, and

λ_{p e r}

scales this term. The overall training objective is as follows.

L = min_{G} max_{D} (L_{a d v}) + L_{p i x} + L_{p e r} .

(16)

3.5. Summary

In our design, FE-WRNet integrates spatial–wavelet cues through the FWCM: multiscale depthwise-separable convolutions harvest spatial evidence, while 2-D SWT separates low-frequency location cues from high-frequency edges and reunifies them via inverse SWT to form a joint representation. The generator proceeds in three stages: (i) a U-Net–based localization head predicts the mask, opacity, watermark, and a 64-channel features log; (ii) watermark removal inverts alpha compositing to obtain a preliminary clean image; and (iii) a lightweight residual restoration refines artifacts, followed by mask-guided fusion to preserve watermark-free regions. Robustness is further promoted by a wavelet-domain patch discriminator and a hybrid objective that emphasizes high-frequency subbands while maintaining perceptual fidelity. In practice, this design sharpens mask boundaries and helps separate watermark edges from document strokes. A caveat is that the model may still struggle with dense watermarks (see Figure 6), where the watermark will be diffused throughout the image, and there will be heavy overlap between the background text and the watermark, making the distinction between the content ambiguous [72]. Meanwhile, the restoration stage—intended for few pixels—may then be underpowered, making such cases challenging.

4. Experiment

Section 4 contains three parts. We first detail the basic experimental parameters and settings, then conduct an ablation study to validate the effectiveness of our design, and finally compare against related watermark-removal models to demonstrate superiority in both performance and efficiency.

4.1. Experimental Settings

Datasets: Two datasets, CLWD and TextLogo, are used to verify the effectiveness. CLWD provides 60,000 training samples and 10,000 test samples, whose background images are drawn from the corresponding training and test splits of the PASCAL VOC 2012. CLWD is used to evaluate the watermark-removal ability of our method on generic natural images, whereas the TextLogo dataset that we construct assesses performance on structured document-like images. Baseline: As baselines, we employ several well-established watermark-removal networks: UNet [53], SplitNet [14], and WDNet [15]. All of them capture watermark features solely in the spatial domain; comparisons against these baselines highlight the advantage of our approach in learning watermark characteristics in the wavelet domain. Evaluation Metrics: Model performance is judged with widely accepted image-quality indicators: PSNR, RMSE, and SSIM [73] and LPIPS [74], the latter two aligning more closely with human visual perception [68,75]. We additionally report FLOPs to quantify computational complexity. Training Details: All the experiments are conducted with PyTorch 2.1.2 on a single RTX 4090 (24 GB). Unless noted otherwise, we set

λ_{I} = 10, λ_{X 1} = 15, λ_{\hat{X}} = 35, λ_{p e r} = 1 \times 10^{- 2},

and train for 100 epochs. In the CLWD, we set the batch size to 4; the learning rates

l_{r g} = l_{r d} = 2 \times 10^{- 4}

. In TextLogo, because image sizes vary, we resize all samples to

512 \times 512

and set the batch size to 2; the learning rates are

l_{r g} = l_{r d} = 1 \times 10^{- 4}

.

4.2. Ablation Study

4.2.1. Analysis of the High-Frequency Penalty Coefficient

To verify the effectiveness of the high-frequency penalty coefficient

λ_{H}

and determine its optimal value, we set

λ_{H} = 1, 2, 4

and carried out three comparative experiments on the TextLogo dataset. The quantitative results, listed in Table 1, indicate that when

λ_{H} > 1

, the overall watermark-removal performance is noticeably better than that at

λ_{H} = 1

, thus confirming the utility of

λ_{H}

. The qualitative results shown in Figure 7 further demonstrate that increasing

λ_{H}

makes the model focus more on edge information, leading to sharper watermark masks—exactly as intended. Nevertheless, we also observe that at

λ_{H} = 4

, the performance is slightly inferior to that at

λ_{H} = 2

; we conjecture that excessive emphasis on high-frequency details may cause the model to overlook low-frequency components, which are likewise part of the watermark. Consequently, we choose

λ_{H} = 2

for all subsequent experiments.

4.2.2. Analysis of the FWCM

To verify the effectiveness of the proposed feature-fusion-based FWCM module, we conducted the following comparative experiments. In the only space configuration, the wavelet transform operation within the FWCM was removed, ensuring that feature extraction was performed solely in the spatial domain while keeping the depth of convolutional layers unchanged. The quantitative results are summarized in Table 2, where the observed values of all four evaluation metrics consistently support the effectiveness of the FWCM’s spatial–wavelet feature fusion strategy. The qualitative results are presented in Figure 8. We observed that, in the version utilizing only spatial-domain features, the model tends to misidentify background text as watermark regions in certain test samples. In contrast, our model largely avoids such misclassification. We infer that when background text and watermarks exhibit similar characteristics in color or scale, spatial-domain features alone may lack sufficient discriminative power. By incorporating wavelet-domain features, the model can leverage texture and structural cues as complementary evidence, thereby reducing the ambiguity inherent in spatial-only representations.

4.3. Comparison with Other WaterMark Removal Models

4.3.1. Comparisons on TextLogo

For our TextLogo dataset, the results in Table 3 show that FE-WRNet outperforms all four metrics. The FWCM parallel feature extractor in the spatial and wavelet domains simultaneously captures the watermark body color and edges. The hybrid loss weights high-frequency subbands with

λ_{H} = 2

, allowing the network to remove watermarks while precisely preserving the fine document microstructure. Meanwhile, FE-WRNet delivers the best-inference FLOPs than UNet, SplitNet, and WDNet, reflecting a stronger accuracy and efficiency. Figure 9 visualizes the results. WDNet and SplitNet leave “hazy” artifacts under highly transparent watermarks; UNet mistakenly removes table lines or text during mask localization. In contrast, FE-WRNet predicts masks whose edges are sharp and tightly follow the true contours, leveraging the high-frequency constraint verified in Figure 7 to avoid erasing characters. The lightweight residual stacking in the restoration stage further suppresses residual color blotches, yielding coherent strokes and clean backgrounds. Visually, FE-WRNet realizes the ideal “remove the watermark only, keep the content intact,” which is consistent with the objective scores in Table 3.

4.3.2. Comparisons of CLWD

Furthermore, the results on the continuous-tone CLWD dataset are reported in Table 4. This outcome supports our domain-gap hypothesis: CLWD mainly contains natural scenes whose background energy is concentrated at low frequencies, whereas FE-WRNet is purposely designed to highlight mid- and high-frequency textures that carry document-character information, making it difficult to surpass SplitNet—originally tailored for natural images—for every metric. To bridge this, two lightweight routes are readily applicable: (i) reduce the high-frequency penalty and few-shot fine-tune the late-stage layers and FWCM heads on a small CLWD subset; (ii) adopt domain-adaptation techniques such as adaptive BatchNorm or Fourier-spectrum style mixing to better match CLWD statistics without altering the backbone [76].

Notably, although FE-WRNet lags behind SplitNet in PSNR and SSIM, its LPIPS is only 0.0129 higher, indicating that the perceptual quality remains acceptable after wavelet-domain alignment. Meanwhile, FE-WRNet attains the lowest-inference FLOPs among baselines, indicating a favorable quality–efficiency trade-off on natural scenes. As shown in the visual results in Figure 10, FE-WRNet can matches or surpasses SplitNet in certain text-watermark scenarios. Overall, the scores are on par with or better than those of several mainstream watermark-removal networks, demonstrating that the proposed multidomain fusion strategy is robust to different data distributions.

In summary, FE-WRNet delivers the best PSNR/SSIM/LPIPS/RMSE with the lowest FLOPs on TextLogo; visual comparisons show sharp, well-aligned masks and clean backgrounds. Ablations confirm both components: (i) raising the wavelet high-frequency penalty (

λ_{H}

> 1) sharpens edges, with

λ_{H}

= 2 giving the best trade-off; (ii) the FWCM’s spatial–wavelet fusion consistently surpasses a spatial-only variant and reduces confusion between background glyphs and watermarks. On CLWD (natural scenes), FE-WRNet trails SplitNet in PSNR/SSIM but remains competitive perceptually (LPIPS + 0.0129) while retaining the lowest FLOPs, and qualitatively matches or surpasses baselines when marks contain text-like strokes.

4.3.3. Application Test

To further validate the effectiveness of our model, we applied the Tesseract OCR toolkit to recognize the watermark-removed outputs produced by all evaluated models. Because we have not yet constructed a dedicated OCR dataset for our document images, we present a single illustrative case, as shown in Figure 11. The example indicates that our model removes watermarks more cleanly and thoroughly while preserving the background, which in turn yields higher OCR recognition accuracy. This indirectly substantiates, to some extent, the superiority of our approach for document-image watermark removal.

5. Conclusions and Future Work

This work introduced TextLogo, the first document-image watermark-removal dataset that jointly accounts for the diversity of watermark styles and document layouts. On this dataset, we propose FE-WRNet, a novel model that fuses spatial- and wavelet-domain features to accurately delineate watermark edges while capturing the watermark in its entirety. Our experiments further revealed that applying a slightly increased penalty on high-frequency components leads to sharper predicted masks without compromising low-frequency color harmony. Compared with the other models, FE-WRNet consistently outperforms UNet, SplitNet, and WDNet on TextLogo, delivering clearer glyph contours and fewer residual artifacts. Moreover, it attains performance levels comparable to mainstream watermark-removal methods on CLWD, highlighting its adaptability to various real-world scenarios and underscoring the efficacy of spatial–wavelet collaborative feature modelling.

Despite these achievements, several improvements are still needed. We aim to address the following topics in future work: Evaluation of real-document datasets: FE-WRNet has thus far been trained and evaluated only on synthetic watermark pairs. Its generalization to real-world document images—which often contain physical seals or translucent printed marks captured via scanning or photography—has not yet been validated. Such data introduce additional challenges such as uneven illumination, perspective distortion, and paper texture variations. So FE-WRNet will be tested on collections containing genuine stamps and signatures, such as StaVer and the Chinese Seal Dataset (CSD), to assess its robustness under realistic scanning conditions. Video extension and temporal consistency: although the introduction emphasizes potential applicability to video-based watermark removal, FE-WRNet has not yet been experimentally verified in temporal scenarios. We will test FE-WRNet on consecutive frames from short clips, integrating lightweight temporal attention or optical-flow guidance and enforcing frame-to-frame consistency losses to suppress flicker [77]. Downstream OCR evaluation: To assess the practical benefits of watermark removal, we plan to compare OCR recognition accuracy before and after applying FE-WRNet, providing a quantitative measure of its effectiveness in real-document-processing pipelines.

Author Contributions

Conceptualization, X.W. and T.G.; methodology, Z.C., Y.Z., X.W. and Y.Q.; software, J.Y. and Y.Q.; validation, Y.Q.; formal analysis, X.W.; investigation, T.G., J.Y. and Y.Q.; resources, T.G., Q.M. and W.X.; data curation, Z.C. and T.G.; writing—original draft preparation, Z.C. and Y.Z.; writing—review and editing, J.Y., X.W., W.X. and Y.Q.; visualization, J.Y., X.W. and W.X.; supervision, X.W.; project administration, X.W.; funding acquisition, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Chongqing New YC Project under Grant CSTB2024YCJH-KYXM0126, the General Program of the Natural Science Foundation of Chongqing under Grant CSTB2024NSCQ-MSX0479, the Chongqing Postdoctoral Foundation Special Support Program under Grant 2023CQBSHTB3119, the China Postdoctoral Science Foundation under Grant 2024MD754244, and the Postdoctoral Fellowship Program of CPSF under Grant GZC20233322. J.Y. was supported by Grant GZC20233322; X.W. was supported by Grant CSTB2024YCJH-KYXM0126 and W.X. was supported by Grants CSTB2024NSCQMSX0479, 2023CQBSHTB3119, and 2024MD754244. The funders had no role in the study design, data collection, interpretation, or decision to submit the work for publication.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, Y.; Zhu, Z.; Hou, J.; Wu, D. Spatial-temporal graph enhanced detr towards multi-frame 3D object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10614–10628. [Google Scholar] [CrossRef]
Shen, Y.; Feng, Y.; Fang, B.; Zhou, M.; Kwong, S.; Qiang, B.-h. DSRPH: Deep semantic-aware ranking preserving hashing for efficient multilabel image retrieval. Inf. Sci. 2020, 539, 145–156. [Google Scholar] [CrossRef]
Zhou, M.; Wei, X.; Wang, S.; Kwong, S.; Fong, C.K.; Wong, P.H.W.; Yuen, W.Y.F. Global Rate-Distortion Optimization-Based Rate Control for HEVC HDR Coding. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 4648–4662. [Google Scholar] [CrossRef]
Pei, S.-C.; Zeng, Y.C. A novel image recovery algorithm for visible watermarked images. IEEE Trans. Inf. Forensics Secur. 2006, 1, 543–550. [Google Scholar] [CrossRef]
Zhang, W.; Zhou, M.; Ji, C.; Sui, X.; Bai, J. Cross-Frame Transformer-Based Spatio-Temporal Video Superresolution. IEEE Trans. Broadcast. 2022, 68, 359–369. [Google Scholar] [CrossRef]
Zhou, M.; Zhang, Y.; Li, B.; Lin, X. Complexity Correlation-Based CTU-Level Rate Control with Direction Selection for HEVC. ACM Trans. Multimedia Comput. Commun. Appl. 2017, 13, 1–23. [Google Scholar] [CrossRef]
Shen, W.; Zhou, M.; Luo, J.; Li, Z.; Kwong, S. Graph-Represented Distribution Similarity Index for Full Reference Image Quality Assessment. IEEE Trans. Image Process. 2024, 33, 3075–3089. [Google Scholar] [CrossRef] [PubMed]
Shen, W.; Zhou, M.; Wei, X.; Wang, H.; Fang, B.; Ji, C.; Zhuang, X.; Wang, J.; Luo, J.; Pu, H.; et al. A blind video quality assessment method via spatiotemporal pyramid attention. IEEE Trans. Broadcast. 2024, 70, 251–264. [Google Scholar] [CrossRef]
Lan, X.; Xian, W.; Zhou, M.; Yan, J.; Wei, X.; Luo, J. No-Reference Image Quality Assessment: Exploring Intrinsic Distortion Characteristics via Generative Noise Estimation with Mamba. IEEE Trans. Circuits Syst. Video Technol. 2025; early access. [Google Scholar] [CrossRef]
Dekel, T.; Rubinstein, M.; Liu, C.; Freeman, W.T. On the Effectiveness of visible watermarks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Honolulu, HI, USA, 21–26 July 2017; pp. 6864–6872. [Google Scholar]
Li, X.; Lu, C.; Cheng, D.; Li, W.-H.; Cao, M.; Liu, B.; Ma, J.; Zheng, W.-S. Towards photorealistic visible watermark removal with conditional generative adversarial networks. In Proceedings of the Image and Graphics: 10th International Conference, ICIG 2019, Beijing, China, 23–25 August 2019; Proceedings, Part I 10. Springer International Publishing: Berlin/Heidelberg, Germany, 2019; pp. 345–356. [Google Scholar]
Cao, Z.; Niu, S.; Zhang, J.; Wang, X. Generative adversarial network model for visible watermark removal. IET Image Process. 2019, 13, 1783–1789. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Bengio, Y. Generative adversarial nets. In Proceedings of the Neural Information Processing Systems, NIPS, Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
Cun, X.; Pun, C.M. Split then refine: Stacked attention-guided resunets for blind single image visible watermark removal. AAAI Conf. Artif. Intell. 2021, 35, 1184–1192. [Google Scholar] [CrossRef]
Liu, Y.; Zhu, Z.; Bai, X. Wdnet: Watermark-decomposition network for visible watermark removal. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3685–3693. [Google Scholar]
Wei, X.; Li, J.; Zhou, M.; Wang, X. Contrastive distortion-level learning-based no-reference Image quality assessment. Int. J. Intell. Syst. 2022, 37, 8730–8746. [Google Scholar] [CrossRef]
Zhou, M.; Wei, X.; Ji, C.; Xiang, T.; Fang, B. Optimum Quality Control Algorithm for Versatile Video Coding. IEEE Trans. Broadcast. 2022, 68, 582–593. [Google Scholar] [CrossRef]
Liao, X.; Wei, X.; Zhou, M.; Kwong, S. Full-reference image quality assessment: Addressing content misalignment issue by comparing order statistics of deep features. IEEE Trans. Broadcast. 2023, 70, 305–315. [Google Scholar] [CrossRef]
Wei, X.; Zhou, M.; Kwong, S.; Yuan, H.; Jia, W. A hybrid control scheme for 360-degree dynamic adaptive video streaming over mobile devices. IEEE Trans. Mob. Comput. 2021, 21, 3428–3442. [Google Scholar] [CrossRef]
Xian, W.; Zhou, M.; Fang, B.; Kwong, S. A content-oriented no-reference perceptual video quality assessment method for computer graphics animation videos. Inf. Sci. 2022, 608, 1731–1746. [Google Scholar] [CrossRef]
Wang, G.; Zhang, Y.; Li, B.; Fan, R.; Zhou, M. A fast and HEVC-compatible perceptual video coding scheme using a transform-domain Multi-Channel JND model. Multimed. Tools Appl. 2018, 77, 12777–12803. [Google Scholar] [CrossRef]
Zhou, M.; Zhang, Y.; Li, B.; Hu, H.-M. Complexity-based Intra Frame Rate Control by Jointing Inter-Frame Correlation for High Efficiency Video Coding. J. Vis. Commun. Image Represent. 2016, 42, 46–64. [Google Scholar] [CrossRef]
Wei, X.; Zhou, M.; Jia, W. Toward Low-Latency and High-Quality Adaptive 360^∘ Streaming. IEEE Trans. Ind. Inform. 2022, 19, 6326–6336. [Google Scholar] [CrossRef]
Souibgui, M.A.; Kessentini, Y. Degan: A conditional generative adversarial network for document enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1180–1191. [Google Scholar] [CrossRef]
Li, M.; Sun, H.; Lei, Y.; Zhang, X.; Dong, Y.; Zhou, Y. High-fidelity document stain removal via a large-scale real-world dataset and a memory-augmented transformer. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 7614–7624. [Google Scholar]
Gao, T.; Sheng, W.; Zhou, M.; Fang, B.; Luo, F.; Li, J. Method for Fault Diagnosis of a Temperature-Related MEMS Inertial Sensors that combine the Hilbert–Huang transform and deep learning. Sensors 2020, 20, 5633. [Google Scholar] [CrossRef] [PubMed]
Yan, J.; Zhang, B.; Zhou, M.; Kwok, H.F.; Siu, S.W. Multi-Branch-CNN: Classification of ion channel interacting peptides using a multibranch convolutional neural network. Comput. Biol. Med. 2022, 147, 105717. [Google Scholar] [CrossRef]
Yan, J.; Zhang, B.; Zhou, M.; Campbell-Valois, F.X.; Siu, S.W.I. A deep learning method for predicting the minimum inhibitory concentration of antimicrobial peptides against Escherichia coli using Multi-Branch-CNN and Attention. mSystems 2023, 8, E00345-23. [Google Scholar] [CrossRef]
Wei, X.; Zhou, M.; Wang, H.; Yang, H.; Chen, L.; Kwong, S. Recent Advances in Rate Control: From Optimization to Implementation and Beyond. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 17–33. [Google Scholar] [CrossRef]
Xian, W.; Zhou, M.; Fang, B.; Xiang, T.; Jia, W.; Chen, B. Perceptual Quality Analysis in Deep Domains Using Structure Separation and High-Order Moments. IEEE Trans. Multimed. 2024, 26, 2219–2234. [Google Scholar] [CrossRef]
Zhang, K.; Cong, R.; Chen, J.; Zhou, M.; Jia, W. Low-light image enhancement via a frequency-based model with structure and texture decomposition. Acm Trans. Multimed. Comput. Commun. Appl. 2023, 19, 187. [Google Scholar] [CrossRef]
Shen, W.; Zhou, M.; Chen, Y.; Wei, X.; Feng, Y.; Pu, H. Image Quality Assessment: Investigating Causal Perceptual Effects with Abductive Counterfactual Inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025. [Google Scholar]
Zhou, M.; Xian, W.; Chen, B.; Fang, B.; Xiang, T.; Jia, W. HDIQA: A Hyper Debiasing Framework for Full-Reference Image Quality Assessment. IEEE Trans. Broadcast. 2024, 70, 545–554. [Google Scholar] [CrossRef]
Zhou, M.; Li, J.; Wei, X.; Luo, J.; Pu, H.; Wang, W.; He, J.; Shang, Z. AFES: Attention-Based Feature Excitation and Sorting for Action Recognition. IEEE Trans. Consum. Electron. 2015, 71, 5752–5760. [Google Scholar] [CrossRef]
Gan, Y.; Xiang, T.; Liu, H.; Ye, M.; Zhou, M. Generative adversarial networks with adaptive learning strategy for noise-to-image synthesis. Neural Comput. Appl. 2023, 35, 6197–6206. [Google Scholar] [CrossRef]
Wei, X.; Song, J.; Pu, H.; Luo, J.; Zhou, M.; Jia, W. COFNet: Contrastive Object-aware Fusion using Box-level Masks for Multispectral Object Detection. IEEE Trans. Multimed. 2025; early access. [Google Scholar]
Lang, S.; Liu, X.; Zhou, M.; Luo, J.; Pu, H.; Zhuang, X. A full-reference image quality assessment method via deep meta-learning and conformer. IEEE Trans. Broadcast. 2023, 70, 316–324. [Google Scholar] [CrossRef]
Huang, Z.; Wang, X.; Li, X.; Li, Y.; Ding, S.; Li, Y.; Xiao, J. Regularized attentive capsule network for overlapped relation extraction. Expert Syst. Appl. 2024, 245, 122437. [Google Scholar]
Zhang, Y.; Liu, Z.; Wu, Y.; Wang, X.; Wang, Y. Cross-modal identity correlation mining for visible-thermal person re-identification. IEEE Trans. Image Process. 2020, 29, 1761–1775. [Google Scholar] [CrossRef]
Zhu, Z.; Hou, J.; Liu, H.; Zeng, H.; Hou, J. Learning efficient and effective trajectories for differential equation-based image restoration. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 9150–9168. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Yan, Z.; Ma, L. New insights on the generation of rain streaks: Generating-removing united unpaired image deraining network. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Xiamen, China, 13–15 October 2023; Springer Nature: Singapore, 2023; pp. 390–402. [Google Scholar]
Li, J.; Feng, H.; Deng, Z.; Cui, X.; Deng, H.; Li, H. Image derain method for generative adversarial network based on wavelet high frequency feature fusion. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Shenzhen, China, 14–17 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 165–178. [Google Scholar]
Chen, T.; Liu, M.; Gao, T.; Cheng, P.; Mei, S.; Li, Y. A fusion-based defogging algorithm. Remote Sens. 2022, 14, 425. [Google Scholar] [CrossRef]
Wei, X.; Zhou, M.; Kwong, S.; Yuan, H.; Wang, S.; Zhu, G.; Cao, J. Reinforcement learning-based QoE-oriented dynamic adaptive streaming framework. Inf. Sci. 2021, 569, 786–803. [Google Scholar] [CrossRef]
Zhou, M.; Zhao, X.; Luo, F.; Luo, J.; Pu, H.; Xiang, T. Robust rgb-t tracking via adaptive modality weight correlation filters and cross-modality learning. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 20, 95. [Google Scholar] [CrossRef]
Luo, F.; Zhou, M.; Fang, B. Correlation Filters Based on Strong Spatio-Temporal for Robust RGB-T Tracking. J. Circuits Syst. Comput. 2022, 31, 2250041. [Google Scholar] [CrossRef]
Li, J.; Fang, B.; Zhou, M. Multi-Modal Sparse Tracking by Jointing Timing and Modal Consistency. Int. J. Pattern Recognit. Artif. Intell. 2022, 36, 2251008. [Google Scholar] [CrossRef]
Guo, Q.; Zhou, M. Progressive domain translation defogging network for real-world fog images. IEEE Trans. Broadcast. 2022, 68, 876–885. [Google Scholar] [CrossRef]
Cheng, D.; Li, X.; Li, W.H.; Lu, C.; Li, F.; Zhao, H.; Zheng, W.S. Large-scale visible watermark detection and removal with deep convolutional networks. In Proceedings of the Pattern Recognition and Computer Vision: First Chinese Conference, PRCV 2018, Guangzhou, China, 23–26 November 2018; Proceedings, Part III 1. Springer International Publishing: Berlin/Heidelberg, Germany, 2018; pp. 27–40. [Google Scholar]
Guo, T.; Peng, S.; Li, Y.; Zhou, M.; Truong, T.-K. Community-based social recommendation under local differential privacy protection. Inf. Sci. 2023, 639, 119002. [Google Scholar] [CrossRef]
Li, L.; Zhang, X.; Wang, S.; Ma, S.; Gao, W. Content-adaptive parameters estimation for multi-dimensional rate control. IEEE Trans. Circuits Syst. Video Technol. 2016, 26, 117–129. [Google Scholar] [CrossRef]
Li, L.; Wang, S.; Ma, S.; Gao, W. Region-based intra-frame rate-control scheme for high efficiency video coding. IEEE Trans. Circuits Syst. Video Technol. 2018, 28, 1304–1317. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Inter Vention-MICCAI 2015, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Cheng, S.; Song, J.; Zhou, M.; Wei, X.; Pu, H.; Luo, J. Ef-detr: A lightweight transformer-based object detector with an encoder-free neck. IEEE Trans. Ind. Inform. 2024, 20, 12994–13002. [Google Scholar] [CrossRef]
Li, Y.-l.; Feng, Y.; Zhou, M.-l.; Xiong, X.-c.; Wang, Y.-h.; Qiang, B.-h. DMA-YOLO: Multi-scale object detection method with attention mechanism for aerial images. Vis. Comput. 2023, 40, 4505–4518. [Google Scholar] [CrossRef]
Song, J.; Zhou, M.; Luo, J.; Pu, H.; Feng, Y.; Wei, X. Boundary-aware feature fusion with dual-stream attention for remote sensing small object detection. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5600213. [Google Scholar] [CrossRef]
Wei, X.; Song, J.; Pu, H.; Luo, J.; Zhou, M.; Jia, W. GAANet: Graph Aggregation Alignment Feature Fusion for Multispectral Object Detection. IEEE Trans. Ind. Inform. 2025; early access. [Google Scholar]
Zhou, M.; Han, S.; Luo, J.; Zhuang, X.; Mao, Q.; Li, Z. Transformer-Based and Structure-Aware Dual-Stream Network for Low-Light Image Enhancement. ACM Trans. Multimed. Comput. Commun. Appl. 2025, 21, 293. [Google Scholar] [CrossRef]
Zhou, Z.; Zhou, M.; Luo, J.; Pu, H.; Wei, L.H.U.X.; Jia, W. VideoGNN: Video Representation Learning via Dynamic Graph Modelling. ACM Trans. Multimed. Comput. Commun. Appl. 2025. [Google Scholar] [CrossRef]
Starck, J.L.; Fadili, J.; Murtagh, F. The undecimated wavelet decomposition and its reconstruction. IEEE Trans. Image Process. 2007, 16, 297–309. [Google Scholar] [CrossRef]
Zhang, Q.; Hou, J.; Qian, Y.; Zeng, Y.; Zhang, J.; He, Y. Flattening-net: Deep regular 2D representation for 3D point cloud analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9726–9742. [Google Scholar] [CrossRef]
Tahmid, M.; Alam, M.S.; Rao, N.; Ashrafi, K.M.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Thiruvananthapuram, India, 25–26 November 2017; pp. 1125–1134. [Google Scholar]
Zhao, L.; Shang, Z.; Tan, J.; Zhou, M.; Zhang, M.; Gu, D.; Zhang, T.; Tang, Y.Y. Siamese networks with an online reweighted example for imbalanced data learning. Pattern Recognit. 2022, 132, 108947. [Google Scholar] [CrossRef]
Korkmaz, C.; Tekalp, A.M.; Dogan, Z. Training generative image superresolution models by wavelet-domain losses enables better control of artifacts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5926–5936. [Google Scholar]
Johnson, J.; Alahi, A.; Li, F.-F. Perceptual losses for real-time style transfer and superresolution. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part II 14. Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 694–711. [Google Scholar]
Kim, M.W.; Cho, N.I. WHFL: Wavelet-domain high frequency loss for sketch-to-image translation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 744–754. [Google Scholar]
Liao, X.; Wei, X.; Zhou, M.; Li, Z.; Kwong, S. Image Quality Assessment: Measuring Perceptual Degradation via Distribution Measures in Deep Feature Spaces. IEEE Trans. Image Process. 2024, 33, 4044–4059. [Google Scholar] [CrossRef]
Hamedani, E.Y.; Aybat, N.S. Accelerated primal-dual mirror dynamics for centralized and distributed constrained convex optimization problems. J. Mach. Learn. Res. 2023, 24, 1–76. [Google Scholar]
Duan, C.; Feng, Y.; Zhou, M.; Xiong, X.; Wang, Y.; Qiang, B.; Jia, W. Multilevel Similarity-Aware Deep Metric Learning for Fine-Grained Image Retrieval. IEEE Trans. Ind. Inform. 2023, 19, 9173–9182. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Zhang, Y.; Zhang, Q.; Zhu, Z.; Hou, J.; Yuan, Y. Glenet: Boosting 3D object detectors with generative label uncertainty estimation. Int. J. Comput. Vis. 2023, 131, 3332–3352. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
Ren, S.; Hou, J.; Chen, X.; Xiong, H.; Wang, W. DDM: A Metric for Comparing 3D Shapes Using Directional Distance Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 6631–6646. [Google Scholar] [CrossRef]
Xian, W.; Zhou, M.; Fang, B.; Liao, X.; Ji, C.; Xiang, T. Spatiotemporal feature hierarchy-based blind prediction of natural video quality via transfer learning. IEEE Trans. Broadcast. 2022, 69, 130–143. [Google Scholar] [CrossRef]
Zhang, Y.; Hou, J.; Ren, S.; Wu, J.; Yuan, Y.; Shi, G. Self-supervised learning of lidar 3D point clouds via 2D-3D neural calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 9201–9216. [Google Scholar] [CrossRef]

Figure 1. The model trained on natural images fails to remove watermarks correctly on document images.

Figure 2. Some examples from the TextLogo dataset.

Figure 3. Illustration of our watermark-localization network. Where “Eq.11” and “Eq.10” correspond to formulas numbered (11) and (10) in the paper, respectively.

Figure 4. Illustration of the proposed Fused Wavelet Convolution Mixer (FWCM).

Figure 5. Illustration of the discriminator framework.

Figure 6. FE-WRNet fails to fully remove watermarks on document images with dense watermarks. From left to right: watermarked input, mask and de-watermarked result predicted by FE-WRNet.

Figure 7. Mask predictions of FE-WRNet on TextLogo with

λ_{H} = 1, 2, 4

. As

λ_{H}

increases, our method generates progressively sharper watermark masks.

Figure 7. Mask predictions of FE-WRNet on TextLogo with

λ_{H} = 1, 2, 4

. As

λ_{H}

increases, our method generates progressively sharper watermark masks.

Figure 8. Mask predictions on TextLogo between FWCM and only space configuration, the FWCM generates more accurate watermark masks.

Figure 9. Qualitative results on the TextLogo dataset.

Figure 10. Qualitative results on the CLWD dataset.

Figure 11. A simple OCR-based verification across de-watermarked outputs.

Table 1. Performance of our method on TextLogo with

λ_{H} = 1, 2, 4

. The arrows (↑, ↓) indicate whether higher or lower values are better. Bold indicates the best-performing result for each metric.

Table 1. Performance of our method on TextLogo with

λ_{H} = 1, 2, 4

. The arrows (↑, ↓) indicate whether higher or lower values are better. Bold indicates the best-performing result for each metric.

$λ_{H}$	PSNR ↑	SSIM ↑	LPIPS ↓
1	37.1172	0.9876	0.0118
2	37.5532	0.9897	0.0086
4	37.2040	0.9887	0.0097

Table 2. Performance on TextLogo between FWCM and only space configuration. The arrows (↑, ↓) indicate whether higher or lower values are better. Bold indicates the best-performing result for each metric.

Configuration	PSNR ↑	SSIM ↑	LPIPS ↓
only space	36.1617	0.9850	0.0140
FWCM	37.5532	0.9897	0.0086

Table 3. Performance comparison on the TextLogo datasets. The arrows (↑, ↓) indicate whether higher or lower values are better. Bold indicates the best-performing result for each metric.

Models	PSNR ↑	SSIM ↑	LPIPS ↓	RMSE ↓	FLOPs ↓
UNet	36.3627	0.9882	0.0096	4.4597	524.92G
SplitNet	36.5808	0.9874	0.0106	4.3012	681.44G
WDNet	36.7202	0.9872	0.0114	4.2350	560.28G
Ours	37.5532	0.9897	0.0086	3.8511	422.98G

Table 4. Performance comparison on the CLWD datasets. The arrows (↑) indicate whether higher or lower values are better. Bold indicates the best-performing result for each metric.

Models	PSNR ↑	SSIM ↑	LPIPS ↑	RMSE ↑	FLOPs ↑
UNet	31.2305	0.9534	0.0612	7.9124	131.24G
SplitNet	34.3812	0.9688	0.0437	5.8872	170.36G
WDNet	31.0243	0.9520	0.0638	8.5043	140.06G
Ours	31.9011	0.9578	0.0566	7.8075	105.74G

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Z.; Zhang, Y.; Yan, J.; Wei, X.; Xian, W.; Mao, Q.; Qin, Y.; Gao, T. FE-WRNet: Frequency-Enhanced Network for Visible Watermark Removal in Document Images. Appl. Sci. 2025, 15, 12216. https://doi.org/10.3390/app152212216

AMA Style

Chen Z, Zhang Y, Yan J, Wei X, Xian W, Mao Q, Qin Y, Gao T. FE-WRNet: Frequency-Enhanced Network for Visible Watermark Removal in Document Images. Applied Sciences. 2025; 15(22):12216. https://doi.org/10.3390/app152212216

Chicago/Turabian Style

Chen, Zhengli, Yuwei Zhang, Jielu Yan, Xuekai Wei, Weizhi Xian, Qin Mao, Yi Qin, and Tong Gao. 2025. "FE-WRNet: Frequency-Enhanced Network for Visible Watermark Removal in Document Images" Applied Sciences 15, no. 22: 12216. https://doi.org/10.3390/app152212216

APA Style

Chen, Z., Zhang, Y., Yan, J., Wei, X., Xian, W., Mao, Q., Qin, Y., & Gao, T. (2025). FE-WRNet: Frequency-Enhanced Network for Visible Watermark Removal in Document Images. Applied Sciences, 15(22), 12216. https://doi.org/10.3390/app152212216

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FE-WRNet: Frequency-Enhanced Network for Visible Watermark Removal in Document Images

Abstract

1. Introduction

2. Related Work

2.1. Visible Watermark Removal

2.2. Document Image Restoration

2.3. Related Vision Tasks: Deraining and Defogging

2.4. Datasets for Watermark Removal

3. Method

3.1. TextLogo Dataset

3.2. FE-WRNet

3.2.1. Watermark Localization

3.2.2. Watermark Removal

3.2.3. Image Restoration

3.3. Discriminator

3.4. Loss Function

3.5. Summary

4. Experiment

4.1. Experimental Settings

4.2. Ablation Study

4.2.1. Analysis of the High-Frequency Penalty Coefficient

4.2.2. Analysis of the FWCM

4.3. Comparison with Other WaterMark Removal Models

4.3.1. Comparisons on TextLogo

4.3.2. Comparisons of CLWD

4.3.3. Application Test

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI