Removal of Color-Document Image Show-Through Based on Self-Supervised Learning

Ni, Mengying; Liang, Zongbao; Xu, Jindong

doi:10.3390/app14114568

Open AccessArticle

Removal of Color-Document Image Show-Through Based on Self-Supervised Learning

by

Mengying Ni

,

Zongbao Liang

^*

and

Jindong Xu

School of Computer and Control Engineering, Yantai University, Yantai 264005, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(11), 4568; https://doi.org/10.3390/app14114568

Submission received: 4 April 2024 / Revised: 19 May 2024 / Accepted: 23 May 2024 / Published: 26 May 2024

(This article belongs to the Special Issue AI-Based Image Processing: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Show-through phenomena have always been a challenging issue in color-document image processing, which is widely used in various fields such as finance, education, and administration. Existing methods for processing color-document images face challenges, including dealing with double-sided documents with show-through effects, accurately distinguishing between foreground and show-through parts, and addressing the issue of insufficient real image data for supervised training. To overcome these challenges, this paper proposes a self-supervised-learning-based method for removing show-through effects in color-document images. The proposed method utilizes a two-stage-structured show-through-removal network that incorporates a double-cycle consistency loss and a pseudo-similarity loss to effectively constrain the process of show-through removal. Moreover, we constructed two datasets consisting of different show-through mixing ratios and conducted extensive experiments to verify the effectiveness of the proposed method. Experimental results demonstrate that the proposed method achieves competitive performance compared to state-of-the-art methods and can effectively perform show-through removal without the need for paired datasets. Specifically, the proposed method achieves an average PSNR of 33.85 dB on our datasets, outperforming comparable methods by a margin of 0.89 dB.

Keywords:

document image restoration; show-through removal; self-supervised learning; cycle generative adversarial network

1. Introduction

In the current era of digitization, with the widespread use of electronic devices, the processing of textual data has predominantly shifted towards digital formats. To convert paper documents into a digital form, methods such as scanning or capturing photos using smartphones can be employed. Consequently, document image processing has become increasingly significant, encompassing key steps such as image acquisition, image pre-processing, layout analysis, and optical character recognition (OCR) [1]. However, during the process of acquiring document images, transparent or uneven materials may result in a show-through effect, where the content from the reverse side of the document becomes visible on the front side. This “show-through” phenomenon, also referred to as “back-to-front interference” or “bleeding”, significantly impacts subsequent text recognition and analysis. Therefore, document image pre-processing and optimization have emerged as important challenges that necessitate the application of image de-bleeding methods to restore and enhance the quality and legibility of document images.

Image restoration (IR) has long been a prominent and challenging research area in computer vision that is renowned for its complexities. In a multitude of application scenarios, IR holds paramount significance in delivering exceptional visual experiences and facilitating high-level visual applications. The primary objective of IR is to recover the ideal, pristine image

x

from the observed image

y

, a process that is typically associated with the following model:

y = H (x) + n

(1)

Here,

H (\cdot)

represents the transfer function, describing the transformations undergone by the image during the acquisition and transmission processes, while

n

denotes the presence of noise, representing the existence of random interference.

Due to the diverse causes of image degradation, different IR tasks are closely associated with their corresponding transmission functions

H (\cdot)

and noises

n

. For instance, in the task of document image dehazing, the transfer function

H (\cdot)

can represent optical properties, color distortions, and other phenomena introduced during scanning, capturing, or other image-acquisition methods. The noise term

n

represents random interference introduced during the processes of image acquisition and transmission. These transformations may result in the occurrence of a “show-through” phenomenon in document images, making text, graphics, colors, and other information visible through the pages or paper. Based on the appearance and underlying mechanisms of the show-through images, the document image show-through model can be expressed as follows:

I^{S} = (1 - α) \cdot I^{G T} + α \cdot ϕ (S)

(2)

In this context,

I^{S}

represents the mixed image with a show-through effect;

I^{G T}

represents the foreground image (the front side of the document image);

S

represents the background image (the back side of the document image); and

α

represents the blending ratio of the show-through layer. Additionally,

ϕ (\cdot)

denotes the bleeding-attenuation function. Resolving Equation (2) without any prior knowledge leads to an infinite number of solutions due to the unknown variables

S

,

α

, and

ϕ (\cdot)

, making the restoration of

I^{G T}

from

I^{S}

a non-deterministic problem. Furthermore, the structure and attributes of the show-through layer often resemble the content of the foreground image, posing challenges in removing unnecessary information from the show-through layer while preserving the content and details of the foreground image with reduced redundancy.

So far, many studies have adopted methods based on generative adversarial network (GANs) [2] to restore document images or natural images. Souibgui et al. [3] proposed a robust end-to-end framework called document-enhancement GAN, which effectively restores severely degraded document images using conditional GAN [4]. However, these methods still require the use of a large number of paired datasets to train the model for good experimental results. It is difficult to find real, clear images that match the degraded document images. In addition, Liu et al. [5] proposed a single-image dehazing method based on cycle GAN (CycleGAN) [6], which uses unpaired hazy images for training and constrains the process of haze removal through double-cycle consistency loss. This method is suitable for removing fog images in natural scenes, but for document images, few people adopt the approach of double-cycle consistency loss.

Therefore, the removal of show-through effects in document images faces several challenges:

The foreground and show-through parts in document images are highly similar because they both comprise document content. Without additional prior knowledge, it is difficult to accurately recover the desired foreground images.
In real-world scenarios, there is a lack of authentic images that match the show-through images, making supervised training infeasible. As a result, most document image processing algorithms rely on synthetic training data. However, there exists a gap between synthetic and real-world data, which limits the performance of these algorithms.

To address these challenges, this paper proposes a self-supervised-learning-based method for removing show-through effects in document images. This method, called the CycleGAN-based color document show-through-removal network (CDSR-CycleGAN), eliminates show-through effects without requiring paired datasets. The network framework adopts a two-stage structure and employs double-cycle consistency loss and pseudo-similarity loss constraints to guide the process of removal of image show-through. Two datasets, each consisting of different show-through mixing ratios, are constructed for training the document show-through-removal task.

The remaining parts of this article are organized as follows: Section 2 presents the related work; Section 3 introduces the proposed method; Section 4 presents the experimental results; Section 5 discusses the findings; and Section 6 concludes the paper.

2. Related Work

2.1. Show-through-Removal Method

Show-through refers to the presence of transparency or uneven materials in document images during the scanning or capturing process, resulting in a visible transparency effect. In 1994, Lins et al. [7] first described this “front-to-back interference” noise as “seen-through.” Later, researchers referred to it as “show-through” or “bleed-through.” The initial solution to this problem was proposed by Lins et al. [7], who suggested a “mirror-filtering” strategy that scans both sides of the document, aligns the two images, and compares the intensities of corresponding ink pixels. Sharma [8] initially analyzed the show-through phenomenon based on physical principles and derived a simplified mathematical model. Subsequently, they linearized the model and designed an adaptive linear filtering scheme to eliminate the show-through effect. Rowley-Brooke et al. [9] introduced a database of degraded document images affected by bleed-through and demonstrated its utility by evaluating three non-blind bleed-through removal techniques against manually created ground-truth foreground masks. Moghaddam et al. [10] segmented images from double-sided ancient documents using a level-set framework and eliminated the influence of bleed-through effects. The method comprises three key steps: binarization using adaptive global thresholds, separation of interference patterns and genuine text using reverse diffusion forces, and smoothing boundaries using small regularization forces. However, due to variations in document image quality, the performance of traditional methods varies, requiring a large number of empirical parameters to construct suitable solutions. He et al. [11] proposed a novel DL-based document-enhancement-and-binarization model (DeepOtsu), utilizing iterative refinement and superposition to achieve visually pleasing results. Hanif et al. [12] proposed a document-restoration method aimed at eliminating unwanted interfering degradation patterns from ancient manuscripts containing color.

2.2. GAN

GAN, proposed by Ian Goodfellow et al. [2] in 2014, is a deep learning (DL) [13] model consisting of a generator and a discriminator. Its basic principle is to train the generator and discriminator in an adversarial manner, allowing the generator to gradually generate realistic samples. GAN has been widely applied in various fields such as images, audio, and text due to its unique generation capabilities and capacity for unsupervised learning.

Poddar et al. [14] proposed TBM-GAN, a generative-adversarial-network framework for synthesizing realistic handwritten documents with degraded backgrounds. De et al. [15] developed a DL-based document image-binarization model employing dual discriminator GAN and focal loss as the generator loss. Suh et al. [16] introduced a two-stage method for color-document image enhancement and binarization using independent GANs. Lin et al. [17] suggested a training method using GANs to address limited data availability. Ju et al. [18] proposed CCDWT-GAN, a color channel-based discrete wavelet transform GAN, for efficient extraction of text information from color degraded document images. Zou et al. [19] introduced a method called “deep adversarial decomposition” for separating individual layers of a single superimposed image. This method combines adversarial training, a separation-critic network, and a crossroad

L_{1}

loss function to effectively handle both linear and non-linear mixtures. However, previously developed GAN methods require large amounts of paired datasets. Therefore, in this paper, we choose to use CycleGAN, which does not require paired images.

2.3. Cycle Consistency Loss

CycleGAN is a special variant of traditional GAN proposed by Zhu et al. [6] in 2017. There are some problems in previous GANs, such as difficult training, unstable generation results, and the need for paired training data. To solve these problems, CycleGAN introduces the cycle consistency loss function.

Gangeh et al. [20] proposed an end-to-end unsupervised multi-document image blind denoising method, which provides a unified model that can remove various artifacts from various document types without the need for paired clean images. Torbunov et al. [21] proposed a new image-transformation model that achieves better performance while maintaining cycle consistency constraints, solving problems in the field of unpaired image transformation. The model uses a vision transformer and adopts necessary training and regularization techniques to achieve better performance. Wu et al. [22] proposed an unsupervised method based on cycle consistency for blind image restoration. By constraining low-frequency content, optimizing the content of training data, and using model-averaging techniques, the quality of image restoration is improved. Wang et al. [23] proposed the UDoc-GAN framework to solve the problem of uncontrolled lighting affecting document images captured on mobile devices. UDoc-GAN performs document light correction in non-paired settings, learning the relationship between normal and abnormal lighting domains by predicting environmental light features and redefining cycle consistency constraints. Xu et al. [24] proposed a single-image bleed-through image-restoration method based on self-supervised learning. By leveraging the cyclic consistency of generative adversarial networks and the transfer-learning capability of constraints, this method achieves model training without paired image sets. The approach incorporates a self-supervised-learning module designed to extract supervised information from large-scale unsupervised data, effectively enhancing the quality of texture, edges, and other detailed information in the bleed-through image content, thus enabling the removal of bleed-through from a single image.

The above methods have verified the effectiveness of CycleGAN and cycle consistency loss. Based on this finding, we also considered that CycleGAN does not have accurately matched label supervision during the training process. Therefore, we designed a network to provide pseudo-similarity loss and used double-cycle consistency loss to constrain the training process.

3. Proposed Methods

3.1. Overall Network Framework

The overall network framework of the self-supervised-learning-based color document show-through-removal network, CDSR-CycleGAN, is illustrated in Figure 1 and Figure 2. CDSR-CycleGAN transforms the unpaired image show-through-removal problem into an image-to-image generation problem, generating non-show-through images through a self-supervised-learning cycle-consistency network. This section provides a detailed description of the network composition, which consists of two parts: a show-through-removal framework and a show-through-generation framework.

In the diagram,

X

represents the input show-through image.

D e

represents the show-through-removal generator, and

D e (X)

represents the show-through-removal image.

Y

represents the clear image.

R e

represents the show-through generator, and

R e (Y)

represents the synthetic show-through image.

S

represents the similarity network;

S_D e

and

S_R e

are, respectively, used to generate similarity non-show-through images and show-through images.

D_{s h o w - t h r o u g h}

is the adversarial discriminator for classifying between the real show-through image and the generated show-through image.

D_{n o n - s h o w - t h r o u g h}

is the adversarial discriminator for distinguishing between the real non-show-through image and the show-through-removal image. The light blue lines in the figure represent pseudo-similarity loss and cycle-consistency loss, respectively.

Since Figure 1 and Figure 2 share a similar framework structure, we will use Figure 1 as an example to illustrate the procedural steps of our method. As shown in Figure 1, the input image

X

with show-through is fed into the show-through-removal-generator

D e

, resulting in the bleeding-free image

D e (X)

. Then,

D e (X)

is passed through both the show-through generator

R e

and the similarity network

S_D e

, producing the bleeding-added image

R e (D e (X))

and the similarity image

S (D e (X))

, respectively. Subsequently, the cycle consistency loss_1 and the pseudo-similarity loss are computed. Both

D e (X)

and

S (D e (X))

are then fed into the discriminator

D_{n o n - s h o w - t h r o u g h}

to determine whether they are bleeding-removed images. Finally,

S (D e (X))

is input into the show-through generator

R e

to generate the bleeding-added image

R e (S (D e (X)))

and the cycle consistency loss_2 is computed.

3.2. Generative Network

(1): Generator $D e$

Generator

D e

is an integral component of the CDSR-CycleGAN model aimed at effectively removing the show-through effect from input images and restoring the original image. As illustrated in Figure 3, it consists of three sub-networks, including

{D e}_{r e m o v e}

,

{D e}_{p r e d i c t}

, and

{D e}_{r e f i n e}

. The first stage includes

{D e}_{r e m o v e}

and

{D e}_{p r e d i c t}

. The primary role of the

{D e}_{r e m o v e}

network is to initially eliminate the show-through effect of the image, making it easier for subsequent processes to restore the original image. The

{D e}_{p r e d i c t}

network is used to predict and locate foreground information, providing precise guidance for further processing. Lastly, in the second stage, the

{D e}_{r e f i n e}

network adjusts and restores the image with removed show-through, further enhancing its visual appeal and realism. Specifically, the channel numbers of

{D e}_{r e m o v e}

and

{D e}_{p r e d i c t}

are set to 64 for consistency, while the channel numbers of

{D e}_{r e f i n e}

increase sequentially (64, 128, 256, 128, 64, and 3), capturing image features and details more effectively. Through the coordinated efforts of these three sub-networks, Generator

D e

efficiently removes the show-through effect from input images and generates visually superior and detail-rich images.

Both

{D e}_{r e m o v e}

and

{D e}_{p r e d i c t}

take input images with show-through effects. Firstly,

{D e}_{r e m o v e}

undergoes shallow feature extraction to obtain preliminary features. Next, a triple feature attention module (TFAM) with multiple skip connections is applied to achieve deeper feature attention. Subsequently, further processing is conducted using two convolutional networks and a long-skip global residual learning module. Ultimately, an image with the initial removal of the show-through effect is obtained.

The TFAM network structure, illustrated in Figure 4, consists of three feature attention modules (FAMs) and a convolutional layer. The FAM network structure, shown in Figure 5, comprises local residual learning and channel-pixel attention (CPA) modules. Local residual learning allows less-important bleeding information to bypass through multiple local residual connections, while the main network focuses on effective foreground information. CPA retains shallow information and transfers it to deep layers, allowing

{D e}_{r e m o v e}

to focus more on recovering effective information such as show-through overlapping regions, detail textures, and color accuracy. This not only enhances the restoration accuracy of

{D e}_{r e m o v e}

but also reduces redundancy and improves efficiency.

The

{D e}_{p r e d i c t}

network is used to predict and locate foreground information. Firstly, the receptive field is expanded through four sets of dilated convolutions and standard convolution operations to enable the network to capture more contextual information. This enhances the accuracy and precision of foreground prediction. Next, the global and local features of the

{D e}_{p r e d i c t}

network are integrated through long paths to extract more robust features. Attention mechanisms are then employed to quickly capture foreground information hidden in complex backgrounds. First, a 1 × 1 convolutional layer compresses the features of layer 13 into a vector, which is used to adjust the feature representation of the previous step as weights. These adjusted weights are element-wise multiplied with the output of layer 12 to extract more significant foreground features. Finally, the extracted foreground features are reconstructed together with the original input image to obtain a clear image.

{D e}_{r e f i n e}

consists of an encoder and decoder, using a encoder-decoder structure similar to that of CycleGAN. Nine context-aware blocks (CAB) are stacked between the encoder and decoder to increase the receptive field and contextual dependency, as shown in Figure 6. CAB achieves this end through feature transformation, a context-attention mechanism, and residual connections. The encoder and decoder are connected through residual connections, allowing the

{D e}_{r e f i n e}

network to be end-to-end mapped and speeding up convergence while improving the quality of generated images. In the

{D e}_{r e f i n e}

network, the encoder extracts features and downsamples the input image through components such as convolution, batch normalization (IN), a ReLU activation function, dropout, and max pooling. The decoder then upsamples and reconstructs the features extracted by the encoder through components such as transposed convolution, IN, the ReLU activation function, and dropout.

(2): Generator $R e$

The generator

R e

(show-through-generation network) is another component of the CDSR-CycleGAN model, and its network structure is shown in Figure 7. Its role is to take in input images without the show-through effect and generate the corresponding show-through images. Unlike the other generator

D e

, which focuses on generating cycle mappings, the generator

R e

specifically focuses on generating show-through mappings.

R e

is the network used for show-through generation and consists of two sub-networks: the foreground prediction network

{R e}_{p r e d i c t}

, which predicts and locates the foreground information, and the background blur network

{R e}_{b l u r}

, which performs blurring on the background. Similar to

{D e}_{p r e d i c t}

,

{R e}_{p r e d i c t}

has a similar network structure but differs in the way it reconstructs the image.

{R e}_{b l u r}

has a network structure similar to

{D e}_{r e f i n e}

but serves a different purpose.

{R e}_{p r e d i c t}

combines the extracted foreground features with the original input image to obtain a clear foreground image with known foreground information positions.

{R e}_{b l u r}

is used to blur the input image and provides a foundation for adding the back-to-front interference effect. Subsequently, the show-through effect is simulated around the predicted foreground information by further enhancing the realism of the image.

(3): Similarity Network $S$

The similarity network

S

is a crucial component of the CDSR-CycleGAN model. In the absence of labeled values, this paper utilizes

S

to provide unsupervised similarity labels and compute pseudo-similarity loss. In practical applications,

S

is employed to generate similarity images for both the

D e

generator and the

R e

generator. Therefore, it is referred to as the similarity-

D e

network

S_D e

in the

D e

architecture and the similarity-

R e

network

S_R e

in the

R e

architecture.

The architecture of

S

is illustrated in Figure 8, with a similar application for

R e

as for

D e

. Firstly, a 7 × 7 initial convolution operation is applied, followed by instance normalization and ReLU activation. Subsequently, downsampling is performed twice using 3 × 3 convolution kernels, IN, and ReLU activation, gradually increasing the number of channels. Then, feature extraction is conducted through nine CABs, as shown in Figure 6. Finally, two upsampling operations are performed using deconvolution blocks, each consisting of a 3 × 3 deconvolution operation, IN, ReLU activation, and a gradual reduction in channel numbers. Ultimately, the output of network

S

is generated using a 7 × 7 convolution kernel and mapped to the range (−1, 1) through the tanh activation function.

Through the combined action of the

D e

generator, the

R e

generator, and the similarity network

S

, the CDSR-CycleGAN model achieves high-quality show-through removal and show-through generation.

3.3. Discriminative Network

For the show-through-removal task, the discriminator plays a role in identifying the differences between the generated show-through images and the real non-bleeding removal images. In the show-through-removal network, two discriminators with identical network structures are employed, as shown in Figure 9. The discriminator

D_{s h o w - t h r o u g h}

is used to differentiate between input show-through images and generated images with show-through, while the discriminator

D_{n o n - s h o w - t h r o u g h}

is used to distinguish between the non-show-through images generated by the generator G and the input non-show-through images. We utilize five convolution blocks for classifying real and fake images, where each convolution layer has a kernel size of 4 × 4; a stride of 2; and channel numbers of 64, 128, 256, 512, and 1, from low to high. Within each convolution block, a convolution layer, LReLU activation function, and IN normalization layer are sequentially applied.

3.4. Loss Functions

In the CycleGAN network, this paper utilizes multiple loss functions to train the network; these aim to capture different objectives between the generator and discriminator.

(1): Adversarial loss $L_{a d v}$

Adversarial loss was used for adversarial learning between the generator and discriminator. The goal of the generator is to generate realistic fake images that the discriminator cannot distinguish from real images. On the other hand, the goal of the discriminator is to precisely differentiate between real and fake images.

For the generators

R e

and

S

and the discriminator

D_{s h o w - t h r o u g h}

, the adversarial loss for generating show-through images can be defined as shown in Equation (3). This loss encourages the generator to generate images that are more similar to the real images, while the discriminator aims to distinguish between the generated show-through images and real images that have not been subjected to show-through removal.

\begin{matrix} L_{r e - a d v} (R e, S, D_{s h o w - t h r o u g h}, Y, X) \\ = E_{{X \sim p}_{s h o w - t h r o u g h} (X)} [\log D_{s h o w - t h r o u g h} (X)] \\ + E_{{Y \sim p}_{n o n - s h o w - t h r o u g h} (Y)} [\log (1 - D_{s h o w - t h r o u g h} (R e (Y)))] \\ + E_{{Y \sim p}_{n o n - s h o w - t h r o u g h} (Y)} [\log (1 - D_{s h o w - t h r o u g h} (S (R e (Y))))] \end{matrix}

(3)

In the equation,

R e

represents the show-through-generation network,

S (D e)

represents the similarity-guided show-through-generation network, and

D_{s h o w - t h r o u g h}

aims to distinguish between the generated bleeding images

R e (Y)

and the real bleeding images

X

. Similarly, the adversarial loss for images that have not been subjected to show-through removal is defined as shown in Equation (4). This loss encourages the generator to generate images that are more similar to the real non-show-through images, while the discriminator aims to differentiate between the generated non-show-through-removal images and the input non-show-through images.

\begin{matrix} L_{d e - a d v} (D e, S, D_{n o n - s h o w - t h r o u g h}, X, Y) \\ = E_{{Y \sim p}_{n o n - s h o w - t h r o u g h} (Y)} [\log D_{n o n - s h o w - t h r o u g h} (Y)] \\ + E_{{X \sim p}_{s h o w - t h r o u g h} (X)} [\log (1 - D_{n o n - s h o w - t h r o u g h} (D e (X)))] \\ + E_{{X \sim p}_{s h o w - t h r o u g h} (X)} [\log (1 - D_{n o n - s h o w - t h r o u g h} (S (D e (X))))] \end{matrix}

(4)

In the aforementioned equations,

D e

,

S

, and

R e

minimize the target values, while the discriminators

D_{s h o w - t h r o u g h}

and

D_{n o n - s h o w - t h r o u g h}

attempt to maximize them. However, when using the adversarial loss is used, some artifacts may appear in the generated results.

(2): Cycle consistency loss $L_{c y c l e}$

To better correct the bleeding images and preserve more details, we introduced a cycle consistency loss [6] function for show-through images. The goal of this function is to minimize the difference between the show-through image

X

and its reconstructed bleeding images

R e (D e (X))

and

R e (S (D e (X)))

, as well as the difference between the non-show-through image

Y

and its reconstructed non-show-through images

D e (R e (Y))

and

D e (S (R e (X)))

(as illustrated in Figure 1 and Figure 2). Through this loss function, the consistency of the images can be maintained after they have been transformed in both directions. That is, an image that has been transformed into a show-through image should be able to be restored back to the original domain. This loss function helps to reduce information loss and maintain the content of the images.

The cycle consistency loss from domain

X

to domain

Y

is defined as follows:

\begin{matrix} L_{c y c l e_d e} (D e, S, R e) \\ = E_{{X \sim p}_{s h o w - t h r o u g h} (X)} [{| | R e (D e (X)) - X | |}_{1}] \\ + E_{{X \sim p}_{s h o w - t h r o u g h} (X)} [{| | R e (S (D e (X))) - X | |}_{1}] \end{matrix}

(5)

Simultaneously, the cycle consistency loss from domain

Y

to domain

X

is defined as follows:

\begin{matrix} L_{c y c l e_r e} (R e, S, D e) \\ = E_{{Y \sim p}_{n o n - s h o w - t h r o u g h} (Y)} [{| | D e (R e (Y)) - Y | |}_{1}] \\ + E_{{Y \sim p}_{n o n - s h o w - t h r o u g h} (Y)} [{| | D e (S (R e (X))) - Y | |}_{1}] \end{matrix}

(6)

The overall definition of the cycle consistency loss is the sum of the losses from both directions and is shown in Equation (7), as follows:

L_{c y c l e} (R e, S, D e) = L_{c y c l e_d e} (D e, S, R e) + L_{c y c l e_r e} (R e, S, D e)

(7)

(3): Identity loss $L_{i d}$

The identity loss function for show-through images plays a pivotal role in CycleGAN, as it ensures color consistency between the input and output bleeding images. This prevents the generator from arbitrarily altering the color tone of the images and ensures that the generated results align with our expectations.

The identity loss is defined as shown in Equation (8):

\begin{matrix} L_{i d} (R e, D e) \\ = E_{{X \sim p}_{s h o w - t h r o u g h} (X)} [{| | R e (X) - X | |}_{1}] \\ + E_{{Y \sim p}_{n o n - s h o w - t h r o u g h} (Y)} [{| | D e (Y) - Y | |}_{1}] \end{matrix}

(8)

(4): Pseudo-similarity loss $L_{p s}$

Due to the lack of label supervision during the training process of CycleGAN, we introduced a similarity network to generate pseudo-labels and calculate the similarity loss between the generated images. This helps the generator produce more realistic non-show-through and show-through images, thereby improving the quality and realism of the generated results. Because the similarity loss is minimized, the generated images become more similar to the real images, leading to better transformation outcomes.

The pseudo-similarity loss is defined as shown in Equation (9):

\begin{matrix} L_{p s} (D e, S, R e) \\ = E_{{X \sim p}_{s h o w - t h r o u g h} (X)} [{| | S (D e (X)) - D e (X) | |}_{1}] \\ + E_{{Y \sim p}_{n o n - s h o w - t h r o u g h} (Y)} [{| | S (R e (Y)) - R e (Y) | |}_{1}] \end{matrix}

(9)

(5): Perceptual loss $L_{p e r c e p t u a l}$

To generate images that are more semantically and visually similar, we introduced a pre-trained VGG19 [25] perceptual loss to constrain the generator. The perceptual loss preserves the structure of the original images by combining features extracted from the second and fifth pooling layers of VGG19. Under the constraints of the generators

D e

:

X \to Y

and

R e

:

Y \to X

, the expression of the perceptual loss is shown in Equation (10), as follows:

L_{p e r c e p t u a l} = {‖δ (X) - δ (R e (D e (X)))‖}_{2}^{2} + {‖δ (Y) - δ (R e (D e (Y)))‖}_{2}^{2}

(10)

Here,

(X, Y)

represents the non-paired image sets of show-through and non-show-through images, respectively, and

δ

is the feature extractor of VGG19.

(6): Total loss $L_{t o t a l}$

Combining the above loss terms, the final total loss function can be represented as follows, in Equation (11):

L_{t o t a l} = λ_{1} L_{r e_a d v} + λ_{2} L_{d e_a d v} + λ_{3} L_{c y c l e} + λ_{4} L_{i d} + λ_{5} L_{p s} + λ_{6} L_{p e r c e p t u a l}

(11)

Here,

λ_{1} = λ_{2} = 1

,

λ_{3} = 10

,

λ_{4} = 5

,

λ_{5} = 0.5

, and

λ_{6} = 0.7

[24] are the weights corresponding to each loss term. To optimize the networks in this paper, the objective is shown as follows, in Equation (12):

{D e}^{*}, {R e}^{*} = a r g \min_{De, Re, S} \max_{D_{show - through}, D_{non - show - through}} L_{t o t a l}

(12)

4. Experimental Results and Analysis

4.1. Implementation Details and Parameter Settings

The proposed CDSR-CycleGAN was implemented in PyTorch on a PC equipped with an NVIDIA GeForce RTX 3090 GPU. It should be noted that all training images were learned in a non-paired manner and that their original resolution was 256 × 256. During the training process, the Adam optimizer [26] was used with

β_{1}

= 0.5 and

β_{2}

= 0.999. The model was trained for a total of 100 epochs with a batch size of 1. The model was trained from scratch for 50 epochs with a learning rate of 0.0001 and then trained for another 50 epochs with a linear decay of the learning rate to 0 before the training was stopped.

4.2. Dataset and Evaluation Metrics

(1): Synthetic dataset

Since the degree of the show-through phenomenon varies across different real-life scenes, we simulated two different degrees of bleeding phenomena based on different Gaussian blur levels to demonstrate the effectiveness of our proposed method for different levels of show-through phenomena. We simulated the real bleeding phenomenon by adjusting the Gaussian template. Ultimately, a kernel size (ksize) of 5 × 5 with a standard deviation (sigma) of either 1 or 0.5 was chosen, as depicted in Figure 10.

S-color adopts a hybrid linear show-through model, where the show-through component of S-color is achieved by flipping the image horizontally during the image-preprocessing stage. The specific operation in Equation (2) is as follows: the flipped image is subjected to Gaussian blur using the function

ϕ (\cdot)

, with a kernel size of 5 × 5 and a selectable standard deviation of 1 or 0.5. Additionally, a randomly selected parameter

α

with a value between 0.1 and 0.2 is used to simulate a real bleeding phenomenon.

The S-color1.0 and S-color0.5 datasets each consist of 1000 pairs of training images and 100 pairs of testing images. Each dataset consists of show-through images and non-show-through images, and there is no exact correspondence between these two pairs of images. This means that the model needs to learn how to transform images from one domain to another, rather than simply performing a straightforward mapping.

(2): Real dataset

We evaluated the proposed method using the Media Team Document Database (MTDB) [27], which contains various types of document images such as address lists, advertisements, articles, business cards, music sheets, and maps. In our experiments, we focused on 10 out of the 19 predefined categories in MTDB that, more than the others, contain text documents exhibiting show-through phenomena. The scans from the MTDB were performed using a Hewlett Packard ScanJet scanner and DeskScan ll 2.3.1a software, with color-adjustment parameters set to distance 0, direction 0, saturation 50, normal emphasis curve, resolution of 300 dpi, true-color mode, no image sharpening, and photo style with highlight and shadow values set to 125. We first randomly cropped images of size 256 × 256 from higher-resolution images (

2000^{+} \times 3000^{+}

) and then selected 200 images with obvious show-through phenomena as our test set. The access link for MTDB is as follows: https://github.com/liangzongbao/CDSR-CycleGAN/tree/main/data/MDTB (25 May 2024).

The ICDAR-SROIE dataset [28] was created to support the ICDAR 2019 competition and aims to evaluate the performance of scene text-recognition and information-extraction algorithms on real business documents, specifically receipts and invoices. This dataset consists of 1000 complete scanned receipt images, with each image containing approximately four key text fields such as item names, unit prices, total costs, etc. The text primarily consists of numbers and English characters, and some images may have issues like stains, noise, and blurriness. In an approach similar to that used for MTDB, we first randomly cropped images of size 256 × 256 from a subset of higher-resolution images, and then selected 30 images with noticeable stains as our test set to validate the generalizability of our proposed method in this paper.

The MS Táin Bó Cuailnge (MS) [9] is an ancient manuscript. It was created by Seosamh Ó Longáin (1817–1880), one of the last professional Gaelic scribes. He produced numerous Gaelic manuscripts, especially during the 19th century. We selected the first and thirteenth pages from the book, then cropped them to a size of 256 × 256 for testing. The URL is https://www.isos.dias.ie/AUS1/AUS_MS_Tain_Bo_Cuailgne.html (22 April 2024).

The Bleed-Through Database (BTD) [9] is designed to be a resource for people working in the field of digital document restoration and more specifically on the problem of bleed-through degradation. It consists of a set of 25 registered recto–verso sample grayscale image pairs taken from larger manuscript images with varied degrees of bleed-through. The URL is https://www.isos.dias.ie/Sigmedia/Bleed_Through_Database.html (8 May 2024). Both the MS and BTD datasets can be accessed at https://www.isos.dias.ie (8 May 2024) [29].

(3): Evaluation metrics:

To quantitatively evaluate the network’s performance in removing show-through effects, we used PSNR and SSIM [30] as performance metrics. For both PSNR and SSIM, larger values indicate a closer resemblance to the ground truth and better removal of show-through effects.

Quantitative analysis was conducted on benchmark datasets with ground truth. We calculated FgError, the probability that a pixel in the foreground text is classified as background or translucent; BgError, the probability that a background or translucent pixel is classified as foreground; and TotError, which is the average of FgError and BgError weighted by the number of foreground pixels and background pixels, which are derived from the corresponding ground-truth images. The Terr error indicates the probability of misclassification of any pixel in the image. According to [9], these quality metrics are defined as follows:

\begin{matrix} F g E r r o r = \frac{1}{N} \sum_{G T (F g)} |G T - B_{Y}| \\ B g E r r o r = \frac{1}{N} \sum_{G T (B g)} |G T - B_{Y}| \\ T o t E r r o r = \frac{1}{N} \sum_{G T} |G T - B_{Y}| \end{matrix}

(13)

where GT is the ground truth,

B_{Y}

is the binarized restoration result, and

G T (F g)

is the foreground region only of the ground truth image; similarly,

G T (B g)

corresponds to the background region only, and

N

is the number of pixels in the image.

4.3. Analysis of Experimental Results on Synthesized Dataset

We compared the proposed method with several other image-restoration methods, including DAD [19], DeepOtsu [11], Uformer [31], MPR-Net [32], VDIR [33], S-CycleGAN [24], and YTMT [34]. For these methods, we retrained the network using the implementations provided by the respective authors and then evaluated them. Table 1 presents the quantitative analysis of two synthetic datasets, while Figure 11 and Figure 12 show examples of sample restoration.

As shown in Table 1, on the synthetic datasets, the proposed method achieved higher average SSIM and PSNR (dB) values than the compared methods. Figure 11 and Figure 12 respectively present the restoration results of a sample image from the S-color0.5 and S-color1.0 synthetic datasets. It is observed that both the DAD and DeepOtsu methods removed the show-through effects, but the result of the DeepOtsu method suffered from color distortion, while the DAD method resulted in color deviation. The results of the Uformer and YTMT methods still contained a significant number of show-through components. The MPR-Net, VDIR, and S-CycleGAN methods all removed the show-through effects to some extent, but some residual show-through components remained. In contrast, the proposed method completely removed the show-through effects and achieved better restoration results.

4.4. Analysis of Experimental Results on Real Show-through Dataset

In this section, we aim to evaluate the generalizability of the proposed method using the MTDB, MS, and ICDAR-SROIE datasets. These three real datasets were directly tested using the model trained on the S-color0.5 dataset.

Figure 13 presents example restorations of the advertisement-category samples in the MTDB dataset. It can be observed that only our proposed method achieved good visual restoration results, while the results of the other methods suffered from significant residual show-through effects. Although the DeepOtsu method completely removed the show-through effects, it resulted in a grayscale appearance, and the color of deep-red patterns is not accurately represented.

Figure 14 displays an example restoration of a sample with stains in the ICDAR-SROIE dataset. It can be seen that the DAD and S-CycleGAN methods and our proposed method yielded good restoration results, while the DeepOtsu, Uformer, MPR-Net, VDIR, and YTMT methods all yielded images with some residual stains, failing to produce clear images.

Figure 15 shows the results of show-through removal on MS books. Although there are still some remnants of show-through distortion, the use of our method has made the image clearer and more readable. This also confirms the applicability of our method to the digitization of ancient books.

4.5. Quantitative and Qualitative Analysis on the Bleed-through Database

This section utilizes the BTD dataset with ground truth for both quantitative and qualitative analysis of our proposed method and existing methods, including Hua [35], Mog [10], and Ro2 [36]. Figure 16 displays the restoration results for the dataset named NLI.MSG311.265/6. It can be observed that, compared to other methods, our proposed approach effectively removes most of the show-through while preserving the foreground. Hua performs well in handling dark, translucent areas in isolation but often deletes foreground text in overlapping image regions, thereby reducing readability. Mog retains foreground information but struggles with dark, translucent regions, leaving visible show-through. Ro2 generally preserves foreground information well but still results in some residual show-through in most cases. Table 2 presents the quantitative analysis. Our method performs the best across all three metrics, indicating superior performance in show-through removal. Due to the Hua method removing a significant amount of foreground text while eliminating show-through, it has the highest average FgError. The results of the Mog and Ro2 methods exhibit some residual show-through.

4.6. Runtime Analysis

To compare the time consumption, we compared the average runtime of different methods on the S-color0.5 dataset. To illustrate the relationship between runtime and document size, we conducted tests on images of different dimensions: 256 × 256, 512 × 512, and 1024 × 1024 pixels. The average runtimes were 0.1303, 0.2155, and 0.4105, respectively. It is evident that as the document size increases, so does the runtime. Subsequently, we compared our method with the others using a 256 × 256-pixel document image as an example. Table 3 lists the implementation frameworks used by various methods and their average runtimes. Among them, DeepOtsu and VDIR use the TensorFlow framework, while the other compared methods, including our proposed method, are based on the PyTorch framework. In terms of average runtime, our proposed method slightly outperforms the other methods in this aspect.

4.7. Analysis of Ablation-Experiment Results

To evaluate the effectiveness of the method proposed in this paper, we conducted ablation studies using the S-color0.5 dataset. Specifically, we periodically removed a component from the overall architecture. To ensure fair comparison, all models tested were trained under the same training settings, except for the modifications shown in Table 4. Additionally, the effectiveness of cycle consistency loss 2 (

L_{c y c l e 2}

) and pseudo-similarity loss (

L_{p s}

) was further investigated.

From observations of the last three columns of Table 4, it can be seen that there is a decline in performance when the

L_{c y c l e 2}

and

L_{p s}

are replaced. This confirms their advantages in improving the quality of restoration results. By combining all the aforementioned components, our proposed method achieved the best performance, with a PSNR of 33.79 dB. Therefore, each key component considered during the ablation process has made its own contribution.

4.8. OCR Recognition Analysis

This section aims to validate the effectiveness of the proposed algorithm in removing document show-through and to explore its potential applications in practical engineering. A series of experiments were conducted, and the experimental result images based on the S-color0.5 dataset were tested using the open-source PaddlePaddle v4 version for OCR testing, with the test address available at https://github.com/PaddlePaddle/PaddleOCR (8 May 2024).

In this section, the show-through-removal results of the proposed algorithm are presented and compared with the results of several comparison algorithms for OCR recognition. The OCR-recognition results are shown in Figure 17, indicating that the proposed algorithm demonstrates significant effectiveness in removing document show-through. By applying this algorithm to process the experimental result images, the document show-through effect can be successfully eliminated, greatly improving the readability and accuracy of the images.

The first row of the image in Figure 17 is used as an example for evaluation of the proposed method and comparison with other methods. The OCR-recognition comparison results, as shown in Figure 18, consist of three rows: the first row presents the experimental result image, the second row displays the recognition result visualization, and the third row shows the output text result. The results of the DAD and DeepOtsu methods are missing some text content, while YTMT incorrectly recognized the show-through content as the correct text. A comparison reveals the significant effectiveness of the proposed algorithm in eliminating document show-through. By applying this algorithm to process the experimental result images, the document show-through effect can be successfully eliminated, greatly improving the readability and accuracy of the images. Taking the sample in Figure 16 as an example, incomplete data from the top row and right side were not included in the evaluation. Table 5 presents a quantitative analysis of OCR recognition based on counts of the number of words and characters and the number of correct words and characters. The number of words in the ground truth is 65, and the number of characters is 303, so the closer the measured quantities are to these values, the better the recognition performance. In terms of word count, some adjacent words are recognized as a single word due to being too close together; for example, “a bag to” is recognized as “abagto”, which reduces the word count. In terms of characters, it was observed that “n”, “r”, and “c” are prone to recognition errors. Specifically, the proposed method demonstrates better recognition accuracy compared to several other methods.

This finding demonstrates the practical significance of the proposed algorithm in solving the document show-through problem. Furthermore, the potential advantages and possibilities of applying this algorithm in practical engineering will be explored. For example, in fields such as digital archive management, print quality control, and image post-processing, this algorithm can provide a reliable solution for relevant applications. Eliminating the document show-through effect can allow clearer and more accurate images to be obtained, thereby enhancing the efficiency and accuracy of subsequent processing tasks.

5. Discussion

One limitation of our proposed method is that CDSR-CycleGAN cannot fully restore complex backgrounds in real images. For instance, in the MTDB dataset, although our method successfully removes the show-through effect, it does not perfectly preserve the original appearance of the images. This may impact the visual experience, despite being advantageous for document image recognition. The reason lies in the inability to accurately describe real show-through images using existing synthesized data in such scenarios. In future research, we plan to explore new methods to address this issue and create diverse show-through datasets to train our network and improve upon this limitation.

Another limitation is that this paper employ synthetic data as the training set; therefore, its capability to handle the maximum show-through level may be affected. Our method simulates different degrees of show-through in document images by adjusting the standard deviation of Gaussian blur. Specifically, we used Gaussian blurs with standard deviations of 0.5 and 1, respectively. Therefore, the maximum degradation supported by our method is that of show-through document images with a Gaussian blur standard deviation of 0.5. In the future, we will focus on addressing higher-level show-through documents.

6. Conclusions

In this paper, we propose a self-supervised-learning-based method named CDSR-CycleGAN for removing the show-through effect in color document images. The method utilizes unpaired image data for training, eliminating the need for extensive preparation of synthesized show-through images. An innovative cyclic generative adversarial network was introduced to oversee the mapping from the show-through domain to the non-show-through domain. The network employs a two-stage structure and incorporates double-cycle consistency loss and pseudo-similarity loss as constraints for the process of removal of image show-through. Extensive experiments on both synthesized and real show-through images demonstrate the effectiveness of the method in recovering show-through images and outperforming several other approaches.

Author Contributions

Conceptualization, Z.L. and J.X.; methodology, Z.L.; software, M.N.; validation, Z.L., M.N. and J.X.; formal analysis, Z.L.; investigation, Z.L.; resources, M.N. and J.X.; data curation, Z.L.; writing—original draft preparation, Z.L.; writing—review and editing, Z.L., J.X. and M.N.; visualization, M.N.; supervision, M.N. and J.X.; project administration, M.N.; funding acquisition, J.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62072391 and Grant 62066013.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author, Zongbao Liang, at liangzongbao@s.ytu.edu.cn.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chandio, A.A.; Asikuzzaman, M.; Pickering, M.R.; Leghari, M. Cursive text recognition in natural scene images using deep convolutional recurrent neural network. IEEE Access 2022, 10, 10062–10078. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 2, 2672–2680. [Google Scholar]
Souibgui, M.A.; Kessentini, Y. DE-GAN: A conditional generative adversarial network for document enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1180–1191. [Google Scholar] [CrossRef] [PubMed]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
Liu, W.; Hou, X.; Duan, J.; Qiu, G. End-to-end single image fog removal using enhanced cycle consistent adversarial networks. IEEE Trans. Image Process. 2020, 29, 7819–7833. [Google Scholar] [CrossRef]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Lins, R.D.; Neto, M.G.; Neto, L.F.; Rosa, L.G. An environment for processing images of historical documents. Microprocess. Microprogramming 1994, 40, 939–942. [Google Scholar] [CrossRef]
Sharma, G. Show-through cancellation in scans of duplex printed documents. IEEE Trans. Image Process. 2001, 10, 736–754. [Google Scholar] [CrossRef] [PubMed]
Rowley-Brooke, R.; Pitié, F.; Kokaram, A. A ground truth bleed-through document image database. In Theory and Practice of Digital Libraries: Second International Conference, TPDL 2012, Paphos, Cyprus, September 23–27, 2012. Proceedings 2; Springer: Berlin/Heidelberg, Germany, 2012; pp. 185–196. [Google Scholar]
Moghaddam, R.F.; Cheriet, M. A variational approach to degraded document enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1347–1361. [Google Scholar] [CrossRef] [PubMed]
He, S.; Schomaker, L. DeepOtsu: Document enhancement and binarization using iterative deep learning. Pattern Recognit. 2019, 91, 379–390. [Google Scholar] [CrossRef]
Hanif, M.; Tonazzini, A.; Hussain, S.F.; Khalil, A.; Habib, U. Restoration and content analysis of ancient manuscripts via color space based segmentation. PLoS ONE 2023, 18, e0282142. [Google Scholar] [CrossRef] [PubMed]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Poddar, A.; Dey, S.; Jawanpuria, P.; Mukhopadhyay, J.; Kumar Biswas, P. TBM-GAN: Synthetic document generation with degraded background. In Proceedings of the International Conference on Document Analysis and Recognition, San José, CA, USA, 21–26 August 2023; pp. 366–383. [Google Scholar]
De, R.; Chakraborty, A.; Sarkar, R. Document image binarization using dual discriminator generative adversarial networks. IEEE Signal Process. Lett. 2020, 27, 1090–1094. [Google Scholar] [CrossRef]
Suh, S.; Kim, J.; Lukowicz, P.; Lee, Y.O. Two-stage generative adversarial networks for binarization of color document images. Pattern Recognit. 2022, 130, 108810. [Google Scholar] [CrossRef]
Lin, Y.-S.; Lin, T.-Y.; Chiang, J.-S.; Chen, C.-C. Binarization of color document image based on adversarial generative network and discrete wavelet transform. In Proceedings of the 2022 IET International Conference on Engineering Technologies and Applications (IET-ICETA), Changhua, Taiwan, 14–16 October 2022; pp. 1–2. [Google Scholar]
Ju, R.-Y.; Lin, Y.-S.; Chiang, J.-S.; Chen, C.-C.; Chen, W.-H.; Chien, C.-T. CCDWT-GAN: Generative adversarial networks based on color channel using discrete wavelet transform for document image binarization. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence (PRICAI), Jakarta, Indonesia, 15–19 November 2023; pp. 186–198. [Google Scholar]
Zou, Z.; Lei, S.; Shi, T.; Shi, Z.; Ye, J. Deep adversarial decomposition: A unified framework for separating superimposed images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12806–12816. [Google Scholar]
Gangeh, M.J.; Plata, M.; Nezhad, H.R.M.; Duffy, N.P. End-to-end unsupervised document image blind denoising. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7888–7897. [Google Scholar]
Torbunov, D.; Huang, Y.; Yu, H.; Huang, J.; Yoo, S.; Lin, M.; Viren, B.; Ren, Y. Uvcgan: Unet vision transformer cycle-consistent gan for unpaired image-to-image translation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 702–712. [Google Scholar]
Wu, S.; Dong, C.; Qiao, Y. Blind image restoration based on cycle-consistent network. IEEE Trans. Multimed. 2022, 25, 1111–1124. [Google Scholar] [CrossRef]
Wang, Y.; Zhou, W.; Lu, Z.; Li, H. Udoc-gan: Unpaired document illumination correction with background light prior. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 5074–5082. [Google Scholar]
Xu, J.D.; Ma, Y.L.; Liang, Z.B.; Ni, M.Y. Single bleed-through image restoration with self-supervised learning. Acta Autom. Sin. 2023, 49, 219–228. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep convolutional networks for large-Scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015; pp. 1–14. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Sauvola, J.; Kauniskangas, H. MediaTeam Document Database II, a CD-ROM Collection of Document Images; University of Oulu: Oulu, Finland, 1999. [Google Scholar]
Huang, Z.; Chen, K.; He, J.; Bai, X.; Karatzas, D.; Lu, S.; Jawahar, C. Icdar2019 competition on scanned receipt ocr and information extraction. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 1516–1520. [Google Scholar]
Irish Script On Screen Project (2012). Available online: www.isos.dias.ie (accessed on 8 May 2024).
Hore, A.; Ziou, D. Image quality metrics: PSNR vs. SSIM. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 2366–2369. [Google Scholar]
Wang, Z.; Cun, X.; Bao, J.; Zhou, W.; Liu, J.; Li, H. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17683–17693. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.-H.; Shao, L. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14821–14831. [Google Scholar]
Soh, J.W.; Cho, N.I. Variational deep image restoration. IEEE Trans. Image Process. 2022, 31, 4363–4376. [Google Scholar] [CrossRef] [PubMed]
Hu, Q.; Guo, X. Trash or treasure? an interactive dual-stream strategy for single image reflection separation. Adv. Neural Inf. Process. Syst. 2021, 34, 24683–24694. [Google Scholar]
Huang, Y.; Brown, M.S.; Xu, D. User assisted ink-bleed reduction. IEEE Trans. Image Process. 2010, 19, 2646–2658. [Google Scholar] [CrossRef] [PubMed]
Rowley-Brooke, R.; Pitié, F.; Kokaram, A. A non-parametric framework for document bleed-through removal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2954–2960. [Google Scholar]

Figure 1. The framework for show-through removal.

Figure 2. The framework for show-through generation.

Figure 3. The network architecture of generator

D e

.

Figure 3. The network architecture of generator

D e

.

Figure 4. The network architecture of TFAM.

Figure 5. The network architecture of FAM.

Figure 6. The network architecture of CAB.

Figure 7. The network architecture of generator Re.

Figure 8. The network architecture of similarity network

S

.

Figure 8. The network architecture of similarity network

S

.

Figure 9. The network architecture of discriminator

D

.

Figure 9. The network architecture of discriminator

D

.

Figure 10. Example of synthesized show-through image.

Figure 11. Qualitative analysis of S-color0.5 dataset.

Figure 12. Qualitative analysis of the S-color1.0 dataset.

Figure 13. Qualitative analysis of MTDB dataset.

Figure 14. Qualitative analysis of ICDAR-SROIE dataset.

Figure 15. Qualitative analysis of MS dataset.

Figure 16. Qualitative analysis of NLI.MSG311.265/6 from the BTD dataset.

Figure 17. Example of OCR recognition comparison.

Figure 18. Visualization of OCR recognition for different methods.

Table 1. Quantitative evaluation of synthesized dataset.

Dataset	Index	Methods
Dataset	Index	DAD	DeepOtsu	Uformer	MPRNet	VDIR	S-Cycle GAN	YTMT	Ours
S-color0.5	PSNR	28.65	21.49	24.69	32.86	27.81	32.92	25.17	33.79
S-color0.5	SSIM	0.977	0.861	0.902	0.972	0.956	0.979	0.912	0.984
S-color1.0	PSNR	28.87	21.48	24.68	32.98	27.80	32.99	25.40	33.91
S-color1.0	SSIM	0.980	0.861	0.914	0.981	0.958	0.983	0.948	0.989
Average	PSNR	28.76	21.49	24.69	32.92	27.81	32.96	25.29	33.85
Average	SSIM	0.979	0.861	0.908	0.977	0.957	0.981	0.930	0.987

Table 2. Analysis of mean error probabilities for the entire BTD dataset.

Index	Methods
Index	Hua	Mog	Ro2	Ours
FgError	0.2308	0.0746	0.0696	0.0633
BgError	0.0012	0.0148	0.0085	0.0008
TotError	0.0413	0.0244	0.0196	0.0160

Table 3. Analysis of average runtime (seconds) using images sized 256 × 256.

Methods	Framework	Time (s)
DAD	PyTorch	0.4649
DeepOtsu	TensorFlow	1.3982
Uformer	PyTorch	0.5203
MPRNet	PyTorch	0.1363
VDIR	TensorFlow	0.5729
S-CycleGAN	PyTorch	0.2438
YTMT	PyTorch	0.2134
CDSR-CycleGAN	PyTorch	0.1303

Table 4. Analysis of ablation experiments on different components on the S-color0.5 dataset.

Index	Methods
Index	$w / o {D e}_{p r e d i c t}$	$w / o {D e}_{r e m o v e}$	$w / o {R e}_{p r e d i c t}$	w/o S	w/o CAB	w/o $L_{p s}$	w/o $L_{c y c l e 2}$	CDSR-CycleGAN
PSNR	32.52	33.73	29.87	33.75	32.94	30.45	33.71	33.79
SSIM	0.978	0.981	0.969	0.983	0.982	0.974	0.983	0.984

Table 5. Quantitative analysis of OCR recognition.

	Index	Methods
	Index	Ground Truth	Show-Through Image	DeepOtsu	DAD	YTMT	Ours
Numbers	Words	65	76	46	59	71	65
Numbers	Characters	303	353	266	295	375	303
Correct Numbers	Words	65	48	36	54	32	64
Correct Numbers	Characters	303	290	244	293	295	303

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ni, M.; Liang, Z.; Xu, J. Removal of Color-Document Image Show-Through Based on Self-Supervised Learning. Appl. Sci. 2024, 14, 4568. https://doi.org/10.3390/app14114568

AMA Style

Ni M, Liang Z, Xu J. Removal of Color-Document Image Show-Through Based on Self-Supervised Learning. Applied Sciences. 2024; 14(11):4568. https://doi.org/10.3390/app14114568

Chicago/Turabian Style

Ni, Mengying, Zongbao Liang, and Jindong Xu. 2024. "Removal of Color-Document Image Show-Through Based on Self-Supervised Learning" Applied Sciences 14, no. 11: 4568. https://doi.org/10.3390/app14114568

APA Style

Ni, M., Liang, Z., & Xu, J. (2024). Removal of Color-Document Image Show-Through Based on Self-Supervised Learning. Applied Sciences, 14(11), 4568. https://doi.org/10.3390/app14114568

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Removal of Color-Document Image Show-Through Based on Self-Supervised Learning

Abstract

1. Introduction

2. Related Work

2.1. Show-through-Removal Method

2.2. GAN

2.3. Cycle Consistency Loss

3. Proposed Methods

3.1. Overall Network Framework

3.2. Generative Network

3.3. Discriminative Network

3.4. Loss Functions

4. Experimental Results and Analysis

4.1. Implementation Details and Parameter Settings

4.2. Dataset and Evaluation Metrics

4.3. Analysis of Experimental Results on Synthesized Dataset

4.4. Analysis of Experimental Results on Real Show-through Dataset

4.5. Quantitative and Qualitative Analysis on the Bleed-through Database

4.6. Runtime Analysis

4.7. Analysis of Ablation-Experiment Results

4.8. OCR Recognition Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI