Text Removal for Trademark Images Based on Self-Prompting Mechanisms and Multi-Scale Texture Aggregation

Zhou, Wenchao; Wang, Xiuhui; Zhou, Boxiu; Li, Longwen

doi:10.3390/app15031553

Open AccessArticle

Text Removal for Trademark Images Based on Self-Prompting Mechanisms and Multi-Scale Texture Aggregation

Computer Science and Technology Department, China Jiliang University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(3), 1553; https://doi.org/10.3390/app15031553

Submission received: 26 November 2024 / Revised: 28 January 2025 / Accepted: 30 January 2025 / Published: 4 February 2025

Download

Browse Figures

Versions Notes

Abstract

With the rapid development of electronic business, there has been a surge in incidents of trademark infringement, making it imperative to improve the accuracy of trademark retrieval systems as a key measure to combat such illegal behaviors. Evidently, the textual information encompassed within trademarks substantially influences the precision of search results. Considering the diversity of trademark text and the complexity of its design elements, accurately locating and analyzing this text poses a considerable challenge. Against this background, this research has developed an original self-prompting text removal model, denoted as “Self-prompting Trademark Text Removal Based on Multi-scale Texture Aggregation” (abbreviated as MTF-STTR). This model astutely applies a text detection network to automatically generate the required input cues for the Segment Anything Model (SAM) while incorporating the technological benefits of diffusion models to attain a finer level of trademark text removal. To further elevate the performance of the model, we introduce two innovative architectures to the text detection network: the Integrated Differentiating Feature Pyramid (IDFP) and the Texture Fusion Module (TFM). These mechanisms are capable of efficiently extracting multilevel features and multiscale textual information, which enhances the model’s stability and adaptability in complex scenarios. The experimental validation has demonstrated that the trademark text erasure model designed in this paper achieves a peak signal-to-noise ratio as high as 40.1 dB on the SCUT-Syn dataset, which is an average improvement of 11.3 dB compared with other text erasure models. Furthermore, the text detection network component of the designed model attains an accuracy of up to 89.9% on the CTW1500 dataset, representing an average enhancement of 10 percentage points over other text detection networks.

Keywords:

electronic business; trademark retrieval; self-prompting mechanism; texture aggregation

1. Introduction

Compared to natural images, trademarks typically contain less information due to their strongly stylized characteristics and lack of rich textures commonly found in natural imagery. Moreover, trademarks may share common design elements such as characters and icons. The definition of trademark similarity is ambiguous and broad, encompassing aspects such as shape, sound, semantics, layout, texture, and local details. Recent studies [1,2] have employed keypoint features like SIFT, deep-crafted DCNN features, and neural networks to address trademark retrieval. Although these advancements have significantly improved retrieval performance over previous methods, even the best-performing retrievals still fall short of acceptable levels of accuracy. The analysis of failed case studies for these systems indicates that poor performance is often attributed to similarities caused by text, background, minor symbols, contrast, or noise. Common failure cases relate to the textual elements within trademarks; textual elements can exhibit similarities yet not be considered infringing. There can also be similarities between graphic elements and textual elements; textual elements with sharp angles, holes, and strokes can produce features similar to those of graphical elements. Given the importance of visual salience in similarity detection [3], reducing the contribution of textual element features in calculating similarity values should promote the effectiveness of existing trademark retrieval methods.

The integration of text erasure technology with image restoration techniques involves first precisely detecting the text regions within an image, followed by the efficient erasing of these areas, and then using image restoration to recover the filled background information, maintaining the overall naturalness and integrity of the image. However, text erasure tasks are more challenging than image restoration because they require the meticulous treatment of text regions to ensure effective erasure and background recovery while simultaneously being vigilant about non-textual regions to prevent erroneous modification or destruction during the process. Within the field of text erasure technologies, scene text erasure techniques dominate, primarily due to the ubiquity and variety of scene text images in everyday life. Scene text images, as omnipresent visual elements, enrich our visual experiences but also pose unprecedented challenges for text erasure technology. First, the backgrounds of scene text images are extremely complex, potentially including natural landscapes, urban architecture, indoor environments, etc., with diverse colors, textures, and brightness levels, posing difficulties for text region detection. Second, the variability of lighting conditions presents a major challenge, as different intensities, directions, and colors of light affect the visibility and recognition of text. Third, the layouts of text in scene text images are often free-form and variable, ranging from horizontal, vertical, and oblique to curved, and they require high flexibility and adaptability in text erasure algorithms. Lastly, the multitude of font types adds complexity to the task, from common print fonts to handwriting and artistic scripts, each with unique characteristics and identification hurdles.

Trademark images, although characterized by specific design elements and styles, also face similar challenges in text erasure tasks. The elements within trademarks are diverse, extending beyond single backgrounds and possibly including patterns, colors, and other elements. The directionality and typography of trademark text can also vary arbitrarily. Therefore, the challenges faced by scene text erasure technology exist in trademark text erasure tasks as well, and they may even be more complex. As depicted in Figure 1, a comparison between scene text and logo/textual brand elements is presented. The scene text is sourced from the CTW1500 dataset [4], whereas the logo/textual brand elements originate from the METU dataset [5]. Currently, in the domain of text segmentation, the Segment Anything Model (SAM) [6] demonstrates exceptional performance, though it requires manual input of artificial prompts. Thus, this paper proposes an auto-prompting model that incorporates a text detection network before the SAM module to autonomously generate input prompts and follows up with a diffusion model after the SAM module to enhance the accuracy and efficiency of trademark text segmentation, realizing background restoration post-text erasure to improve the naturality of the resulting image. The main contributions of this research can be summarized as follows:

(1): Proposing an auto-prompting model that combines a text detection network, SAM, and a diffusion model to achieve better trademark text erasure.
(2): Introducing an integrated differentiating feature pyramid that improves the model’s expressive power by integrating multi-layer features and capturing potential regularities among intra-layer features.
(3): Presenting a texture fusion module that fuses features across different scales to accurately represent multi-scale text instances, thereby enhancing the model’s robustness and generalization ability in complex scenes.

Figure 1. Comparison of scene text and trademark text.

2. Related Work

The primary task of scene text erasing involves receiving an image containing text and producing an output image devoid of any text. During this process, the network repairs the textual regions within the image using non-textual textures, aiming to achieve visually plausible effects. Image retrieval follows as a subsequent operation to text erasing; the more cleanly the text areas are erased and the more accurately the textures are repaired, the better the performance of the image retrieval task is enhanced. Therefore, how to precisely detect and repair text regions becomes the key issue in scene text erasing. Current research strategies can be categorized into two main types: first, single-stage methods based on image reconstruction; secondly, two-stage methods involving both text detection and image restoration.

Single-stage methods were prevalent before the development of deep learning techniques. Miguel et al. [7] employed Gaussian filters, a simple image processing technology, for blurring purposes, but this method was effective only for texts with specific shapes. Shi et al. [8] utilized a straightforward repairing algorithm that used sample precedents for restoration, reconstructing target text after filling in missing pixels. With the advancement of deep learning technologies, convolutional neural networks have demonstrated strong feature representation capabilities. Nakamura et al. [9] were among the first to introduce the concept of SceneTextEraser for natural scene text erasing, utilizing a deep convolutional neural network to automatically detect, erase, and fill in natural scene text images. This method uses sliding windows to divide the image into multiple blocks and then employs U-Net to perform text erasing on each block and reassembles them to obtain the erased image. However, this method did not consider the relationships between image blocks, severely limiting the model’s ability to capture contextual information, affecting the detection of text regions, and resulting in a lack of continuity and consistency in the erasure outcomes. Tursun et al. [2] introduced soft attention mechanisms to reduce the adverse effects caused by cropped text regions. They adopted hard attention methods to achieve more precise identification of image text information and incorporated a text-erasing network to enhance the feature extraction capability of convolution modules in text regions. Liu et al. [10] presented EraseNet, dividing the text erasing process into coarse erasing and refinement stages. In the coarse erasing stage, the backbone network extracts features and provides them to the text perception branch and erasing branch. The text perception branch generates a text mask through features, comparing it with annotated masks to obtain mask loss. The erasing branch performs feature reconstruction as input for the refinement stage. In the refinement stage, the outputs from the erasing branch receive secondary erasure, and the result is compared with manually labeled erasures to identify the reconstruction loss. At the same time, the result is provided to an adversarial network to identify the adversarial loss. Although the model ensures more complete text erasure by performing secondary erasure on the image, the text mask generated via the text perception branch is not used in the next stage along with other branches, leading to issues such as erroneous erasure in EraseNet.

Two-stage methods initially employ a text detection model to identify the location of text within the image or directly use scene text position annotations. Subsequently, they apply repairs to the detected text regions, completing the background where the text was present. MTRNet [11] utilizes text masks to provide positional information about the text. After the original image and the text mask are concatenated channel-wise, these inputs are fed into a generative network to produce the erased image. The implicit erasing guidance used avoids the excessive erasure of the original background, thereby improving the overall quality of the final erased image. MTRNet++ [12], building upon MTRNet, adopts a similar coarseness-to-fineness strategy as EraseNet, with the network divided into coarse erasing, mask optimization, and fine erasing branches. MTRNet++ refines the mask from the candidate box level to the stroke level using the mask optimization branch and then sends the coarse erased image and optimized mask to the fine erasing branch to obtain the final erased image. Zdenek et al. [13] proposed a weak supervision method that pre-trains text detection and background repair networks using text detection datasets and texture repair datasets. They utilize limited marked data for model fine-tuning. Tang et al. [14] have introduced a novel model that facilitates stroke-level recognition by leveraging text masks. The process involves the deployment of a mask prediction module, which employs bounding boxes to delineate the textual regions within images. Subsequently, this module utilizes a U-Net architecture to predict the respective text masks. Simultaneously, a background completion module conducts partial convolutions, aligning with the ongoing enhancement of the mask. Conrad et al. [15] integrated a text mask generator into the text detection network to generate pixel-level binary segmentation masks. In the background repair network, they combined loss functions to create more realistic images and concatenated the two networks during inference. Beyond enhancing the logo text erasure effect, the security of AI models in practical applications, particularly the issue of adversarial examples, is another pressing and important subject that needs to be addressed. Despite the significant advancements that deep neural networks have made in fields such as image recognition and text recognition, their potential security risks should not be overlooked. Research indicates [16] that, by adding specific triggers to input samples, backdoor samples can be generated to mislead the model into making erroneous predictions. In the task of logo text erasure, the model requires the precise removal of text and the recovery of the background, and its ability to defend against malicious inputs is equally crucial. Therefore, future research should focus on strengthening the model’s resistance to adversarial attacks in order to ensure its safety and reliability in real-world applications.

3. Methodology

Due to the diversity of trademark text and the complexity of design elements, traditional text erasure models frequently encounter issues such as the incorrect erasure of non-textual regions within trademarks and the incomplete removal of trademark text. In this paper, we propose a Self-prompting Trademark Text Removal Based on Multi-scale Texture Aggregation (MTF-STTR) model, which aims to effectively detect and erase text regions within images.

3.1. Overall Framework

The framework is mainly composed of a text detection network and a background inpainting network. Specifically, the text detection network consists of an Integrated Differentiating Feature Pyramid and a Texture Fusion Module, while the background repair network comprises a Segment Anything Model (SAM) and a Latent Diffusion Model (LDM) [17]. Although the SAM possesses powerful capabilities for image segmentation, it relies on manually provided input prompts, which limits the efficiency and scalability of the segmentation process. To address this issue, this paper introduces an Integrated Differentiating Feature Pyramid and a Texture Fusion Module prior to the SAM module. Both modules independently generate input prompts, aiming to combine their segmentation abilities with those of SAM to enhance the accuracy and efficiency of segmentation while reducing reliance on manual labor costs. The mask generated via the SAM module serves as the conditional input for the latent diffusion model within the background repair network, guiding its reverse diffusion process.

The overall workflow of MTF-STTR is illustrated in Figure 2. Initially, the input image is fed into the backbone network of the Integrated Differentiating Feature Pyramid to extract multi-scale features, which are then used to generate feature representations rich in contextual information. With inspiration from the DBNet++ [18] method, probability maps and threshold maps are predicted based on these feature representations. By combining the probability map with the feature representation, an approximate binary map is calculated to locate the text area. The boundary box generation mechanism allows for the easy extraction of the boundary boxes directly from the approximate binary map. These boundary boxes are then used as prompt inputs to the segmentation model to generate masks, thus minimizing the need for manual annotation. Subsequently, the latent diffusion model is applied for text erasure and image repair, culminating in the generation of an image with erased text.

3.2. Comprehensive Differentiating Feature Pyramid

Inspired by other text detection and object detection methods [19,20], mainstream approaches often concentrate on the deep interplay and fusion of interlayer features. Nevertheless, these methods tend to overlook the potential important regularities among intralayer features, which have been widely verified by earlier research and have made significant contributions to improving the accuracy of visual recognition tasks. To compensate for this deficiency while drawing inspiration from recent advances in dense prediction tasks [21,22], this paper proposes an Integrated Differentiating Feature Pyramid (IDFP), whose specific structure is shown in Figure 3.

In the text detection network pipeline, the input image is passed through an Integrated Differentiating Feature Pyramid (IDFP) backbone network to extract multi-scale features. Following this, the pyramidal features are upscaled to a uniform scale and fed into the Texture Fusion Module to generate a feature, F, rich in contextual information. The IDFP backbone network uses Swin Transformer V2 [23] as its base. After passing through the Patch Partition layer, which performs image partitioning, the raw image extracts a four-level feature pyramid, P1, P2, P3, and P4, with respective spatial sizes of 1/4, 1/8, 1/16, and 1/32 for the input image. Then, the P3 feature map is added pixel-wise to the correspondingly upscaled P4 feature map, resulting in P7, as described in Equation (1).

P 7 = P 3 + U p_{\times 2} (P 4)

(1)

The pixel values corresponding to the P2 feature map, as well as those of the P7 feature map after upsampling two times and the P4 feature map after upsampling four times, are summed up to obtain P6, defined as follows:

P 6 = P 2 + U p_{\times 2} (P 7) + U p_{\times 4} (P 4) .

(2)

The pixel values corresponding to the P1 feature map, along with those from the P6 feature map that has been subjected to 2× upsampling and the P4 feature map that has undergone 8× upsampling, are combined to produce the P5 feature map, described as follows:

P 5 = P 1 + U p_{\times 2} (P 6) + U p_{\times 8} (P 4) .

(3)

The P5 feature map is processed through a 3 × 3 convolutional layer. The P6 feature map undergoes a 3 × 3 convolutional layer, followed by a 2× upsampling operation. The P7 feature map is first passed through a 3 × 3 convolutional layer and then subjected to 4× upsampling. Similarly, the P4 feature map is processed via a 3 × 3 convolutional layer before being upscaled by a factor of 8. Finally, these four feature maps, namely P5, P6, P7, and P4, are input into the texture aggregation module. This process is detailed in Equations (4)–(7).

P 5' = Con v_{3 \times 3} (P 5)

(4)

P 6' = U P_{\times 2} (Con v_{3 \times 3} (P 6))

(5)

P 7' = U P_{\times 4} (Con v_{3 \times 3} (P 7))

(6)

P 4' = U P_{\times 8} (Con v_{3 \times 3} (P 4))

(7)

The design of IDFP further optimizes the traditional feature pyramid structure, ensuring the efficient fusion of shallow features while benefiting from the rich and concentrated visual information contained within the deepest layers. Considering that the deepest-layered features typically contain the most abstract representation, whereas the shallowest ones are scarce, this chapter leverages explicit visual center cues extracted from the deepest levels of the feature pyramid to synchronize and finely adjust related shallow features. This approach not only promotes communication between different hierarchical features but also significantly enhances the ability of shallow features to represent key visual elements in images. Thus, while maintaining computational efficiency, it improves both the representational efficiency and accuracy of the entire feature pyramid.

Compared to conventional feature pyramid structures, IDFP demonstrates unique advantages. It achieves richness and diversity in feature representations through intra-level feature adjustment mechanisms, allowing the model to perceive complex scenes and diverse targets within images with greater subtlety. This design philosophy not only strengthens the model’s capability of capturing target features but also elevates its generalization and robustness in complex scenarios.

3.3. Texture Aggregation Module

Different scales of features exhibit unique perceptual capacities and receptive domains; consequently, they emphasize different aspects when describing textual instances. Shallow features are adept at detecting local details of small-scale text instances, yet they lack the capacity to acquire global information from large-scale instances. Conversely, deep features effectively handle global information but fail to adequately represent the fine-grained details of small-scale instances. To tackle this challenge, this study introduces the Texture Fusion Module (TFM), which integrates features across varying scales to enable the accurate characterization of multi-scale text instances. This integration enhances the model’s robustness and generalization performance in complex scenarios. The architecture of the TFM is depicted in Figure 4.

The input to the Texture Aggregation Module (TFM) consists of concatenated feature maps, which employs a parallel path strategy to refine and enhance the input features layer by layer. Initially, the outputs from the IDFP are concatenated and serve as the input to the TFM. Global information is extracted using max pooling and generalized mean pooling [24], and multilayer perceptrons (MLPs) are utilized to perform nonlinear mappings on these features, generating weight coefficients for each channel. The weight coefficient

γ

can be defined as follows:

γ = S i g m o i d (M L P (m a x (x)) + M L P (g e m (x))),

(8)

where “

m a x

” signifies the application of maximum pooling over the feature map, “

g e m

” denotes the use of Generalized Mean (GeM) pooling, and “

M L P

” stands for multi-layer perceptron.

Following this, the feature maps that have undergone pooling operations are added together, and then a Sigmoid activation function is applied to generate channel weight coefficients. Subsequently, these weight coefficients are utilized to weigh the initial features, thereby more effectively emphasizing the features of critical channels while suppressing irrelevant or redundant information.

In another branch, the concatenated feature maps are first subjected to a 3 × 3 convolutional operation to further extract local spatial information. Subsequently to this, a spatial attention mechanism [25] is employed to generate a weighting distribution across different locations within the image. Ultimately, these spatial weights are multiplied element-wise with the concatenated feature maps, thereby reinforcing the model’s ability to focus on text regions in the spatial dimension. The channel attention mechanism filters and optimizes feature channels during the preliminary stage, mitigating noise interference and enhancing the model’s expressiveness towards text instances of various sizes. The spatial attention mechanism, on the other hand, specializes in optimizing features within the spatial dimension, guaranteeing the model’s precision in locating text areas and efficiently separating background distractions. By employing a parallel strategy of both channel attention and spatial attention mechanisms, the module accomplishes the refinement of multidimensional features, significantly boosting the model’s performance in text detection under complex scenarios.

Finally, the TFM produces feature F. Utilizing feature F, probabilities are predicted to yield probability map P and thresholds to generate threshold map T. Combining the probability map P with feature F allows for the calculation of an approximate binary image, B. During the training phase, the model applies supervision to the probability map, the threshold map, and the approximate binary image, with the probability map and the approximate binary image sharing the same supervisory signal. In the inference phase, polygonal bounding boxes are extracted from the approximate binary image through a boundary box generation mechanism. The vertex coordinates of the polygons are converted into minimal rectangular bounding boxes, which serve as prompts for input to the SAM module. The SAM module of the background repair network receives standardized bounding boxes and the original image, generating precise binary masks within these boundaries. These masks are then fed into the potential diffusion model for text erasure and background restoration.

3.4. Diffusion Module

Recent years have seen widespread interest in diffusion models [26] as a category of generative models. Their core objective is to generate desired data samples from a specified probability distribution. These models consist of a forward diffusion process and an inverse diffusion process: during the forward diffusion process, the model continuously adds noise to the original image, disrupting its information. In the inverse diffusion process, the model analyzes the noisy image and iteratively performs denoising operations to progressively restore the noise-free original image, approaching the target distribution. In terms of data modeling, diffusion models offer significant advantages, particularly in handling highly complex data distributions. What sets them apart is their ability to effectively capture long-range dependencies within the data during the diffusion process. Moreover, diffusion models can generate samples in a continuous and incremental fashion without requiring additional training, unlike Generative Adversarial Networks (GANs), which usually necessitate extra training to generate high-resolution samples—this characteristic contrasts sharply with other generative models.

This section utilizes the LDM to achieve the goals of text erasing and background repairing. LDM is a generative model based on a denoising probability model that is aimed at generating data samples by executing a diffusion process in the latent space. As opposed to operating directly in the high-dimensional data space, the latent diffusion model first maps the original data into a lower-dimensional latent space. Within this latent space, the model gradually disrupts the latent variables through an additive noise forward process, and then it learns to recover the original latent variables from the noise through the inverse process. A structural diagram of the LDM is presented in Figure 5.

Specifically, the forward process transforms the latent variables stepwise into a standard normal distribution, with the added noise causing the data distribution to become increasingly blurred. The inverse process is the core component of the model, where it learns denoising steps to progressively revert the noisy latent variables back to clear latent representations. Finally, these latent representations are mapped back into the data space by means of a decoder to generate new samples consistent with the original data distribution. The entire LDM can be divided into three main parts: perceptive compression, latent diffusion, and conditional mechanisms.

Firstly, there is perceptive compression. The core goal of this stage is to map high-dimensional and complex raw data onto a low-dimensional latent space to reduce the dimensions and computational complexity of the data. To achieve this, models commonly employ automatic encoders, such as Variational Autoencoders (VAEs) and other deep neural network structures. The encoder portion is responsible for extracting the key features and semantic information from the data, compressing them into compact latent representations. By operating in the low-dimensional space, the model can more effectively learn the distribution of the data and reduce resource consumption. Additionally, to ensure the quality of the latent representations, the model uses techniques such as perceptual loss functions during training to ensure that the compressed representations still retain the sensory features and important information of the original data.

The latent diffusion process is the core part of the model, where it executes the diffusion process in the previously obtained low-dimensional latent space, including both forward and inverse diffusion steps. The forward diffusion process starts from the latent representation and gradually adds Gaussian noise to it, bringing it closer to a standard normal distribution. This defines a simple noise distribution that facilitates the modeling of the inverse process. The inverse diffusion process is the key to generating high-quality samples through the model. The model trains a denoising network to learn how to progressively remove noise from the noised latent representations to recover the original latent representations. Over several iterations, the model progressively improves the quality of generated samples. The model can effectively capture the complex structures and long-range dependencies of the data, thus generating highly realistic samples. The success of this process depends on carefully designed denoising networks, effective training strategies, and the accurate modeling of the data distribution in the latent space.

Lastly, there is the conditional mechanism, which enables the model to introduce external information into the generation process, allowing it to produce samples that meet specific requirements. Through the conditional mechanism, the model can utilize additional input information, such as textual descriptions, categorical labels, or data from other modalities, to control the attributes and content of the generated results. Methods to implement the conditional mechanism typically involve encoding the condition information into vector representations and then integrating them with the latent representations at each step of the diffusion process. This introduces conditional information into the denoising network so that the model can refer to conditions while denoising, thereby generating samples that match the conditions. By having to learn the associations between conditional information and data distributions during training, the model can accurately capture and reflect the conditional information during generation, improving the quality and variety of the generated results.

These three components work together to allow the latent diffusion model to efficiently conduct data generation in the latent space while achieving precise control over the generated content. Perceptual compression reduces the dimensions and computational complexity of the data, latent diffusion carries out efficient diffusion processes in the compressed space, and the conditional mechanism provides flexibility so that the model to generate specific samples as needed.

3.5. Loss Function

The loss function is composed of three parts: the Probability Map Loss,

L_{p}

, the Binary Image Loss,

L_{b}

, and the Threshold Map Loss,

L_{t}

. The total loss is the weighted sum of these three losses. Specifically,

α

and

β

are set to 1 and 10, respectively.

L = L_{p} + α L_{b} + β L_{t}

(9)

The losses

L_{p}

and

L_{b}

utilize binary cross-entropy loss. To address the issue of unbalanced positive–negative sample numbers, a hard negative mining strategy can be adopted within the binary cross-entropy loss. This involves prioritizing the selection of difficult-to-classify negative samples for sampling, which aims to improve the model’s performance on an imbalanced dataset.

L_{p} = L_{b} = \sum_{i \in S_{l}} y_{i} log x_{i} + (1 - y_{i}) log (1 - x_{i})

(10)

Amongst the sampled collection

S_{l}

, the ratio of positive to negative samples is 1:3. This collection includes all positive samples, as well as the top k negative samples with the highest probability values, where k is three times the number of positive samples.

L_{t}

represents the sum of the L1 distances between the predicted results and the labels within the expanded text polygon area, defined as follows:

L_{t} = \sum_{i \in R_{d}} {| {y_{i}}^{*} - x_{i}}^{*} |,

(11)

where

R_{d}

represents the set of indices of pixels within the dilated polygonal area

G_{d}

, and

y^{*}

denotes the label of the threshold map.

4. Experiments

4.1. Datasets and Evaluation Criteria

In this study, we opted to utilize scene text datasets instead of logo text datasets for several key reasons. Firstly, logo text datasets are relatively scarce, and the volume of annotated data is limited, which poses certain limitations in large-scale training and the assessment of generalization capabilities. Conversely, scene text datasets offer a greater variety of samples, encompassing complex backgrounds and diverse text layouts and thus better simulating the intricate situations in which logo texts might appear in reality. Secondly, scene text shares a substantial similarity with logo text in numerous aspects, particularly in the high degree of integration between text and backgrounds. Logo texts often appear against complex backgrounds as well, and the challenge in the removal task is akin to that encountered with scene text removal, suggesting that employing scene text datasets can effectively evaluate the performance of the model in logo text removal tasks. Finally, the scene text datasets contain a wide range of different types of text instances, such as curved text and overlapping text, which can serve as rich training data for the model, enhancing its adaptability and robustness when tackling the logo text removal task. To validate the effectiveness of MTF-STTR, this study employed two publicly available datasets. For the text detection network comparisons, the CTW1500 dataset [4] was used, while for the overall network framework comparisons, the SCUT-Syn dataset [27] was utilized.

The CTW1500 dataset primarily focuses on curved text and comprises 1000 training images and 500 test images. The text instances within the dataset are annotated at the level of complete lines of text, meaning that the annotation does not limit itself to single characters or words but extends to entire lines of text. An example of the CTW1500 dataset is shown in Figure 6. The emphasis on line-level annotations in the CTW1500 dataset makes it suitable for evaluating algorithms capable of handling complex text layouts, including those involving curvature and overlapping text. The inclusion of both training and testing subsets allows researchers to assess the generalizability and performance of their models on real-world data. The figure referenced would likely illustrate a sample image from the dataset, showcasing the challenges faced by text detection algorithms in recognizing and segmenting curved text strings.

The SCUT-Syn dataset is a synthetic dataset that includes a training set of 8000 images and a test set of 800 images. It incorporates real-world data collected from ICDAR-2013 [28] and ICDAR MLT-2017 [29]. An example of the SCUT-Syn dataset is shown in Figure 7.

The mean squared error (

M S E

) is a widely used image quality metric that quantifies the difference between the reconstructed or processed image and the reference image.

M S E

evaluates the degree of error between two images by calculating the square of the differences between corresponding pixels and averaging them. A lower numerical value indicates that the reconstructed image is closer to the reference image. The formula is as shown in Equation (12), where N represents the total number of pixels in the image, and

y (i)

and

x (i)

denote the pixel values of the original and processed images, respectively.

M S E = \frac{1}{N} \sum_{i = 1}^{N} {(y (i) - x (i))}^{2}

(12)

The peak signal-to-noise ratio (

P S N R

) is an extensively used objective criterion for assessing image quality, mainly employed to quantify the difference between a reconstructed or compressed image and the original image.

P S N R

measures reconstruction accuracy by calculating the

M S E

between original and processed images, with higher values indicating that the reconstructed image quality is closer to the original and the distortion is smaller. Specifically, PSNR characterizes the ratio of the peak signal strength to the noise level in the image. The formula is as shown in Equation (13), where

M A X

is the maximum possible pixel value in the image (typically 255 for 8-bit images), and

M S E

is the mean square error.

P S N R = 10 {log}_{10} \frac{M A X^{2}}{M S E}

(13)

The Percentage of Error Pixels (

p E P s

) is an indicator of the quality of image reconstruction or processing, which calculates the proportion of pixels in the image that are deemed erroneous. Specifically,

p E P s

determines the quality of the image by counting the percentage of pixels in the processed image that differ substantially from the reference image among the total number of pixels. A lower

p E P s

value indicates higher image quality, meaning fewer erroneous pixels. The formulas are as shown in Equations (14) and (15).

p E P s = \frac{1}{H \times W} \sum_{h, w} {E_{r} (D (h, w))}

(14)

E_{r} (x) = \{\begin{matrix} 0 & x < 20 \\ 1 & x \geq 20 \end{matrix}

(15)

The percentage of clustered error pixels (

p C E P s

) measures the extent of locally clustered errors in an image, specifically expressing the proportion of erroneous pixels in all error pixels where neighboring pixels also contain errors. A lower

p C E P s

value indicates a lesser degree of clustering for misclassified pixels in the image, suggesting that the synthesized image’s structure and detail more closely resemble reality. Therefore,

p C E P s

can effectively evaluate the authenticity and quality of generated images. The formulas are as shown in Equations (16) and (17). Equation (18) represents the absolute difference between the pixel values of two grayscale images, and

g (.)

denotes the process of converting an image into grayscale.

p C E P s = \frac{1}{H \times W} \sum_{h, w} \begin{matrix} {E_{c} (D (h + 1, w), D (h, w - 1), \\ D (h - 1, w), D (h, w + 1))} \end{matrix}

(16)

E_{c} (x_{1}, x_{2}, x_{3}, x_{4}) = \{\begin{matrix} 0 & o t h e r s \\ 1 & x_{i} \geq 20 f o r i \in [1, 4] \end{matrix}

(17)

D (h, w) = | g (X_{h, w}) - g (Y_{h, w})

(18)

P r e c i s i o n

measures the proportion of the detected text regions that are actually text regions. High precision means there are relatively few false detections among the detected text regions. For text detection tasks, higher precision indicates that the model has a lower rate of misjudgment within the detected text regions.

P r e c i s i o n

is defined in Equation (19).

Pr e c i s i o n = \frac{T r u e P o s i t i v e s (T P)}{T r u e P o s i t i v e s (T P) + F a l s e P o s i t i v e s (F P)}

(19)

R e c a l l

measures the proportion of all actual text regions that are correctly detected via the model. A high recall rate indicates that the model has a strong ability to detect text regions with few omissions. The higher the

r e c a l l

rate, the greater the proportion of actual text regions successfully detected via the model.

R e c a l l

is defined in Equation (20).

Re c a l l = \frac{T r u e P o s i t i v e s (T P)}{T r u e P o s i t i v e s (T P) + F a l s e N e g a t i v e s (F N)}

(20)

The F-score, also known as the

F 1

score, is the harmonic mean of precision and recall, and it is used to balance these two metrics, particularly when there is a need to weigh false positives against false negatives. The

F 1

score is calculated as the geometric mean of precision and recall, which gives a single value that combines the two metrics. The formula for the

F 1

score is as shown in Equation (21).

F - s c o r e = 2 \times \frac{Pr e c i s i o n \times Re c a l l}{Pr e c i s i o n + Re c a l l}

(21)

4.2. Experimental Configuration

Due to the fact that MTF-STTR’s framework benefits from autonomous input hints derived from IDFP and TFM, which can more effectively retrieve masks via boundary box hints to SAM, and LDM further performs text erasure, this article utilizes pre-trained SAM and LDM. Consequently, in the ablation experiments conducted in this article, only the combination of IDFP and TFM was used for the comparative verification of the modularity effectiveness. In the comparative experiment, the effectiveness of the proposed text detection network was verified using IDFP and TFM, while the effectiveness of the proposed framework was comparatively validated using MTF-STTR and other text erasure networks.

This article employs a model that had been pretrained on the SynthText dataset. After data augmentation for dataset expansion using the CTW1500 dataset, the model was fine-tuned for 100 epochs to achieve better training efficiency. Data augmentation included random rotation (−20^∘ to 20^∘), random cropping, and random flipping (horizontal flip and vertical flip), and all processed images were resized to 512 × 512. The training learning rate was adjusted according to the epoch, with the initial learning rate set to 0.01, power of 0.9, and the current iterative learning rate lr given through Equation (22), where

m_i t e r

is the maximum number of iterations, and iter is the current iteration count. With the increase in iterations, the learning rate decreased gradually to facilitate the model’s stable convergence later on and reduce the amplitude of parameter updates.

l r = 0.01 \times {(1 - \frac{i t e r}{m_i t e r})}^{0.9}

(22)

4.3. Comparative Experiments

This article selects methods from the text detection field as comparison networks to verify the effectiveness of the text detection network: TextSnake [30], TLOC [4], PAN [31], and DBNet++ [18]. To verify the effectiveness of the proposed overall framework, MTF-STTR, methods from the text erasure field were chosen as comparison networks: EnsNet [27], EraseNet [10], Pix2Pix [32], and Bain [33].

As shown in Table 1, the performance of MTF-STTR’s text detection network section compared to other text detection networks on the CTW1500 dataset is presented, with bold numbers indicating the best values in the table. From the table, it can be observed that the method proposed in this article achieves improvements of 22.0%, 12.5%, 3.5%, and 2.0% in precision, respectively, compared to TextSnake, TLOC, PAN, and DBNet++. Correspondingly, recall improves by 0.3%, 15.8%, 4.4%, and 2.8%, respectively. The F-score is elevated by 12.1%, 14.3%, 4.0%, and 2.4%, respectively. This indicates that the proposed method outperforms the compared approaches in terms of text detection accuracy, providing evidence for the effectiveness of the text detection network incorporated into the MTF-STTR framework. The substantial improvements in precision, recall, and F-score suggest that the MTF-STTR framework offers a competitive solution in the field of text detection.

The performance of the various networks on the SCUT-Syn dataset is shown in Table 2, with bold numbers indicating the best values in the table. From the table, it can be observed that the method proposed in this article compares favorably with EnsNet, EraseNet, Pix2Pix, and Bain. Regarding the evaluation metric MSE, our method differs from EraseNet by only 0.0001, while it surpasses the compared networks in all other evaluation metrics.

4.4. Ablation Experiments

In this ablation study, a detailed performance assessment of different combinations of model modules was carried out, analyzing the contributions of each module to the text recognition task. The experimental results, as shown in Table 3, use precision, recall, and F-score as evaluation metrics, observing the improvement effect on model performance as different modules were sequentially added. Without any additional modules, MTF-STTR achieved precision, recall, and an F-score of 84.3%, 75.2%, and 80.6%, respectively. On this basis, individual tests were performed by separately incorporating the DB, IDFP, and TFM modules.

Subsequently, the efficacy of dual-module combinations was tested. The combination of DB and IDFP resulted in an increased F-score to 85.9%, demonstrating strong synergistic effects. The combination of DB and TFM yielded an F-score of 85.4%, while the combination of IDFP and TFM showed the best performance, reaching an F-score of 87.3 and indicating a significant promotion of text recognition performance when IDFP and TFM act in concert.

Ultimately, when all three modules—DB, IDFP, and TFM—were fully integrated, the model’s precision reached 89.9%, its recall rose to 85.6%, and the F-score achieved 87.8%, delivering optimal performance. This indicates that the collaborative action of the DB, IDFP, and TFM modules has a significant impact on enhancing the model’s text recognition capabilities. In summary, the ablation experiment conducted in this study effectively validated the contributions of individual modules to the enhancement of text recognition performance. By utilizing different modules singly and in combination, the experiment has illustrated the impact of each module on the model’s performance, as well as the synergistic effects among them. This provides crucial empirical support for the optimization of the model architecture, further demonstrating the effectiveness and optimization potential of the proposed framework. Additionally, it offers invaluable experience and reference for future applications in similar tasks.

To better showcase the effectiveness of the proposed method in erasing logo text in practical applications, Figure 8 provides a detailed depiction of the method’s performance in real-life scenarios. This visualization allows viewers to observe the outcomes of applying the method to remove logo text from actual-environment contexts.

5. Conclusions

The methodology proposed in this study addresses the limitation of manual input prompts by employing a text detection network, ensures accurate text segmentation using the SAM, and ultimately achieves precise text removal and background reconstruction via a latent diffusion model. Extensive testing across various datasets validated the superiority of the text detection network designed with specific modules over comparable alternatives, and our text erasure technique clearly outstrips existing solutions.

In the subsequent stages of our research endeavors, we shall concentrate on several pivotal domains to further augment the performance and practicality of our method. Initially, our objective is to refine the model architecture, boosting its robustness and adaptability to accommodate a broader spectrum of text types and varied backgrounds. This may entail enhancements to the text detection and segmentation procedures in order to address more intricate and variable circumstances. Furthermore, we intend to enhance the model’s inferential efficiency, striving to expedite processing velocities while maintaining accuracy levels. Ultimately, we will investigate the integration of our proposed technique with other systems and applications to guarantee its compliance with the requirements for large-scale practical deployment across diverse sectors.

Author Contributions

Conceptualization, X.W. and W.Z.; methodology, W.Z., X.W. and B.Z.; software, W.Z. and B.Z.; validation, W.Z. and L.L.; formal analysis, X.W. and W.Z.; investigation, X.W.; resources, X.W. and W.Z.; data curation, W.Z.; writing—draft, W.Z.; writing—formal, X.W. and W.Z.; visualization, W.Z. and L.L.; supervision, X.W.; project administration, X.W.; funding acquisition, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by funding from the National Key Research and Development Program of China under grant number 2021YFC3340402.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article.

Acknowledgments

We thank Zhiduoduo Technology Co., Ltd. for providing the experimental data used in this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Feng, Y.; Shi, C.; Qi, C.; Xu, J.; Xiao, B.; Wang, C. Aggregation of reversal invariant features from edge images for large-scale trademark retrieval. In Proceedings of the 2018 4th International Conference on Control, Automation and Robotics (ICCAR), Auckland, New Zealand, 14 June 2018; pp. 384–388. [Google Scholar]
Tursun, O.; Denman, S.; Sivapalan, S.; Sridharan, S.; Fookes, C.; Mau, S. Component-based attention for large-scale trademark retrieval. IEEE Trans. Inf. Forensics Secur. 2019, 17, 2350–2363. [Google Scholar] [CrossRef]
Zheng, L.; Lei, Y.; Qiu, G.; Huang, J. Near-duplicate image detection in a visually salient riemannian space. IEEE Trans. Inf. Forensics Secur. 2012, 7, 1578–1593. [Google Scholar] [CrossRef]
Liu, Y.; Jin, L.; Zhang, S.; Luo, C.; Zhang, S. Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognit. 2019, 90, 337–345. [Google Scholar] [CrossRef]
Tursun, O.; Aker, C.; Kalkan, S. A large-scale dataset and benchmark for similar trademark retrieval. arXiv 2017, arXiv:1701.05766. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 15 January 2023; pp. 4015–4026. [Google Scholar]
Carreira-Perpinán, M.A. Generalised blurring mean-shift algorithms for nonparametric clustering. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
Shi, J.; Qi, C. Sparse modeling based image inpainting with local similarity constraint. In Proceedings of the 2013 IEEE International Conference on Image Processing, Melbourne, Australia, 15–18 September 2013; pp. 1371–1375. [Google Scholar]
Nakamura, T.; Zhu, A.; Yanai, K.; Uchida, S. Scene text eraser. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 832–837. [Google Scholar]
Liu, C.; Liu, Y.; Jin, L.; Zhang, S.; Luo, C.; Wang, Y. Erasenet: End-to-end text removal in the wild. IEEE Trans. Image Process. 2020, 29, 8760–8775. [Google Scholar] [CrossRef] [PubMed]
Tursun, O.; Zeng, R.; Denman, S.; Sivapalan, S.; Sridharan, S.; Fookes, C. Mtrnet: A generic scene text eraser. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 39–44. [Google Scholar]
Tursun, O.; Denman, S.; Zeng, R.; Sivapalan, S.; Sridharan, S.; Fookes, C. MTRNet++: One-stage mask-based scene text eraser. Comput. Vis. Image Underst. 2020, 201, 103066. [Google Scholar] [CrossRef]
Zdenek, J.; Nakayama, H. Erasing scene text with weak supervision. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Seattle, WA, USA, 13–19 June 2020; pp. 2238–2246. [Google Scholar]
Tang, Z.; Miyazaki, T.; Sugaya, Y.; Omachi, S. Stroke-based scene text erasing using synthetic data for training. IEEE Trans. Image Process. 2021, 30, 9306–9320. [Google Scholar] [CrossRef] [PubMed]
Conrad, B.; Chen, P.I. Two-stage seamless text erasing on real-world scene images. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 1309–1313. [Google Scholar]
Kwon, H.; Lee, S. Toward backdoor attacks for image captioning model in deep neural networks. Secur. Commun. Netw. 2022, 2022, 1525052. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Liao, M.; Zou, Z.; Wan, Z.; Yao, C.; Bai, X. Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 919–931. [Google Scholar] [CrossRef] [PubMed]
Tolstikhin, I.O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. Mlp-mixer: An all-mlp architecture for vision. Adv. Neural Inf. Process. Syst. 2021, 34, 24261–24272. [Google Scholar]
Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10819–10829. [Google Scholar]
Quan, Y.; Zhang, D.; Zhang, L.; Tang, J. Centralized feature pyramid for object detection. IEEE Trans. Image Process. 2023, 32, 4341–4354. [Google Scholar] [CrossRef] [PubMed]
Peng, Z.; Huang, W.; Gu, S.; Xie, L.; Wang, Y.; Jiao, J.; Ye, Q. Conformer: Local Features Coupling Global Representations for Visual Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 367–376. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12009–12019. [Google Scholar]
Wang, Z.; Li, Z.; Sun, J.; Xu, Y. Selective convolutional features based generalized-mean pooling for fine-grained image retrieval. In Proceedings of the 2018 IEEE Visual Communications and Image Processing (VCIP), Taichung, Taiwan, 9–12 December 2018; pp. 1–4. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. arXiv 2015, arXiv:1506.02025. [Google Scholar]
Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 2256–2265. [Google Scholar]
Zhang, S.; Liu, Y.; Jin, L.; Huang, Y.; Lai, S. Ensnet: Ensconce text in the wild. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 801–808. [Google Scholar]
Karatzas, D.; Shafait, F.; Uchida, S.; Iwamura, M.; i Bigorda, L.G.; Mestre, S.R.; Mas, J.; Mota, D.F.; Almazan, J.A.; De Las Heras, L.P. ICDAR 2013 robust reading competition. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition, Athens, Greece, 30 August–4 September 2013; pp. 1484–1493. [Google Scholar]
Nayef, N.; Yin, F.; Bizid, I.; Choi, H.; Feng, Y.; Karatzas, D.; Luo, Z.; Pal, U.; Rigaud, C.; Chazalon, J.; et al. Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 1454–1459. [Google Scholar]
Long, S.; Ruan, J.; Zhang, W.; He, X.; Wu, W.; Yao, C. Textsnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 20–36. [Google Scholar]
Wang, W.; Xie, E.; Song, X.; Zang, Y.; Wang, W.; Lu, T.; Yu, G.; Shen, C. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8440–8449. [Google Scholar]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Bian, X.; Wang, C.; Quan, W.; Ye, J.; Zhang, X.; Yan, D.M. Scene text removal via cascaded text stroke detection and erasing. Comput. Vis. Media 2022, 8, 273–287. [Google Scholar] [CrossRef]

Figure 2. Architecture of the proposed MTF-STTR network.

Figure 3. Architecture of the IDFP module.

Figure 4. Architecture of the TFM module.

Figure 5. Architecture of the LDM module.

Figure 6. Examples from the CTW1500 dataset.

Figure 7. Examples from the SCUT-Syn dataset.

Figure 8. Actual-scene application effect.

Table 1. Comparison of various networks on CTW1500 dataset.

Method	Precision	Recall	F-Score
TextSnake [30]	67.9	85.3	75.6
TLOC [4]	77.4	69.8	73.4
PAN [31]	86.4	81.2	83.7
DBNet++ [18]	87.9	82.8	85.3
MTF-STTR	89.9	85.6	87.7

Table 2. Comparison of various networks on SCUT-Syn dataset.

Method	MSE	PSNR	pEPs	pCEPs
EnsNet [27]	0.0021	29.5	0.0069	0.002
EraseNet [10]	0.0002	38.3	0.0048	0.0004
Pix2Pix [32]	0.0027	26.8	0.0473	0.0244
Bain [33]	0.0159	20.8	0.1021	0.5996
MTF-STTR	0.0003	40.1	0.004	0.0003

Table 3. Ablation experiment results of MTF-STTR on CTW1500 dataset.

DB	IDFP	TFM	Precision	Recall	F-Score
			84.3	75.2	80.6
✓			86.1	81.4	84.5
	✓		87.6	80.3	84.9
		✓	85.8	81.3	84.1
✓	✓		88.2	83	85.9
✓		✓	86.8	82.9	85.4
	✓	✓	88.7	84.2	87.3
✓	✓	✓	89.9	85.6	87.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, W.; Wang, X.; Zhou, B.; Li, L. Text Removal for Trademark Images Based on Self-Prompting Mechanisms and Multi-Scale Texture Aggregation. Appl. Sci. 2025, 15, 1553. https://doi.org/10.3390/app15031553

AMA Style

Zhou W, Wang X, Zhou B, Li L. Text Removal for Trademark Images Based on Self-Prompting Mechanisms and Multi-Scale Texture Aggregation. Applied Sciences. 2025; 15(3):1553. https://doi.org/10.3390/app15031553

Chicago/Turabian Style

Zhou, Wenchao, Xiuhui Wang, Boxiu Zhou, and Longwen Li. 2025. "Text Removal for Trademark Images Based on Self-Prompting Mechanisms and Multi-Scale Texture Aggregation" Applied Sciences 15, no. 3: 1553. https://doi.org/10.3390/app15031553

APA Style

Zhou, W., Wang, X., Zhou, B., & Li, L. (2025). Text Removal for Trademark Images Based on Self-Prompting Mechanisms and Multi-Scale Texture Aggregation. Applied Sciences, 15(3), 1553. https://doi.org/10.3390/app15031553

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Text Removal for Trademark Images Based on Self-Prompting Mechanisms and Multi-Scale Texture Aggregation

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Overall Framework

3.2. Comprehensive Differentiating Feature Pyramid

3.3. Texture Aggregation Module

3.4. Diffusion Module

3.5. Loss Function

4. Experiments

4.1. Datasets and Evaluation Criteria

4.2. Experimental Configuration

4.3. Comparative Experiments

4.4. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI