HFI-Former: High-Frequency Interaction Transformer for Robust Scene Text Detection

Gao, Yubing; Gao, Quanli; Shao, Lianhe; Wang, Xihan; Liu, Lufang

doi:10.3390/info17040365

Open AccessArticle

HFI-Former: High-Frequency Interaction Transformer for Robust Scene Text Detection

by

Yubing Gao

¹,

Quanli Gao

¹

,

Lianhe Shao

¹,

Xihan Wang

^1,* and

Lufang Liu

^2,*

¹

School of Computer Science, Xi’an Polytechnic University, Xi’an 710600, China

²

China Academy of Aerospace Science and Innovation, Beijing 100048, China

^*

Authors to whom correspondence should be addressed.

Information 2026, 17(4), 365; https://doi.org/10.3390/info17040365

Submission received: 8 February 2026 / Revised: 25 March 2026 / Accepted: 8 April 2026 / Published: 13 April 2026

(This article belongs to the Section Information Applications)

Download

Browse Figures

Versions Notes

Abstract

Scene text detection aims to accurately localize text instances in images captured under complex environments. Its performance depends heavily on precise text boundary delineation and reliable semantic discrimination from cluttered backgrounds. However, existing methods still struggle in such complex scenes. Repeated downsampling gradually biases features toward low-frequency components, thereby weakening edge details and local structures that are critical to text morphology. Additionally, semantic information and local details are often modeled independently. This lack of coordination makes high-frequency responses vulnerable to background noise. To address these issues, we propose HFI-Former, a Transformer-based model designed for high-frequency enhancement and feature interaction. The framework consists of multi-scale feature extraction, frequency-enhanced representation, semantic-guided feature interaction, and deformable Transformer encoding. Frequency-domain enhancement is introduced to preserve high-frequency structural features degraded by repeated downsampling. Semantic-aware feature interaction further injects global context to regulate multi-scale feature fusion. Experiments on CTW1500, Total-Text and ICDAR1500 demonstrate competitive boundary localization accuracy and strong overall detection performance in complex scenes.

Keywords:

scene text detection; frequency-domain enhancement; high-frequency structural features; feature interaction; Transformer

1. Introduction

Scene text detection is a core task in computer vision that aims to precisely localize text instances and their contours in natural images. By bridging visual perception and language understanding, it plays a critical role in applications such as autonomous driving, real-time translation, intelligent transportation, and human–computer interaction. Despite recent progress in deep learning, detection robustness and accuracy remain limited in real-world scenes due to complex text shapes and diverse environments.

Existing text detection methods are commonly categorized into segmentation-based and regression-based methods. Segmentation-based methods detect arbitrary-shaped text using pixel-level masks. Representative methods include PSENet [1] and DBNet [2]. These approaches perform well in boundary delineation, but their performance depends heavily on backbone feature resolution. Detail loss caused by downsampling often degrades accuracy, and background noise may lead to missed or false detections. Regression-based methods, such as EAST [3], directly regress text boxes or control points. They provide efficient inference in regular text scenarios but struggle with curved text and complex backgrounds.

Recently, Transformer-based architectures, inspired by DETR [4], have been adapted to text detection. Methods such as TESTR [5] and LayoutFormer [6] exploit global dependency modeling. Self-attention and cross-attention enhance contextual reasoning. These mechanisms are effective for curved and densely arranged text, often outperforming traditional CNN-based approaches. However, performance still strongly depends on input feature quality. If low-level features suffer structural degradation, the global modeling advantages of Transformers are limited. Existing multi-scale fusion strategies often rely on simple addition or concatenation. Such designs ignore semantic reliability differences across local features, and may introduce noise in complex backgrounds, weakening semantic consistency.

Recent studies attempt to introduce dedicated enhancement mechanisms during the feature representation stage. HFENet [7] strengthens edge representation by explicitly compensating for the loss of high-frequency information in multi-scale feature learning. TextFuseNet [8] improves the interaction between global semantics and local fine-grained features through multi-level feature fusion. However, although these methods achieve improvements in specific aspects, they still face significant limitations in complex scenarios. Repeated downsampling operations progressively bias feature representations toward low-frequency components, leading to the loss of text boundaries and fine structural details. Moreover, semantic information and local details are often modeled independently, without effective collaboration. As a result, local high-frequency responses are easily corrupted by background noise, producing false activations and weakening reliable text localization.

To address these challenges, we propose a general and efficient scene text detection framework. It aims to enhance local structure representation and global semantic consistency during the feature representation stage. The main contributions are summarized as follows:

We propose HFI-Former, a high-frequency enhancement and feature interaction Transformer. It jointly models high-frequency structural information and cross-scale semantic relationships. Multi-scale features preserve structural integrity under stable semantic constraints, improving detection performance in complex scenes.
We design two key modules: WFE-Net and FIRM. WFE-Net compensates for high-frequency detail degradation caused by backbone downsampling, enhancing structural sensitivity. FIRM promotes effective interaction between high-frequency structures and semantics, suppressing background interference and improving robustness.
We conduct systematic experiments and ablation studies on CTW1500, Total-Text and ICDAR2015. The results validate the effectiveness of the framework and its key modules, demonstrating strong generalization ability in complex scenes.

2. Related Works

Text in natural scenes exhibits diverse shapes, large-scale variations, and complex backgrounds, which require text detection methods to effectively preserve high-frequency details such as text edges and local geometric structures, while simultaneously relying on global semantics to stably discriminate and constrain these local responses. To this end, existing research has mainly developed three representative approaches: regression-based methods, segmentation-based methods, and Transformer-based methods. However, regardless of the detection framework used, the performance largely depends on the representation capability of the underlying features. Therefore, after reviewing the three representative approaches mentioned above, this section further analyzes research related to feature enhancement and adaptation to clarify the research positioning of our method.

2.1. Regression-Based Methods

Regression-based scene text detection methods achieve end-to-end localization by directly regressing the geometric parameters of text instances from feature maps, such as axis-aligned rectangles, quadrilateral vertices, offsets, Bezier control points, or sequences of polygon points. These methods typically offer advantages such as simple network structures, fast inference speeds, and minimal reliance on additional post-processing, which has allowed them to maintain an important role in industrial-level real-time systems.

EAST [3] was the first to propose a fully convolutional single-stage pipeline that directly regresses multi-oriented rectangles or quadrilaterals at the pixel level, eliminating the lengthy multi-stage proposal generation process. TextBoxes++ [9] introduced irregular quadrilateral regression and aspect-ratio-adaptive convolutions within the SSD framework, further improving detection accuracy for long and inclined text. Subsequently, CRAFT [10] regressed from character center points to boundary offsets in four directions and combined character affinity maps, effectively enhancing localization performance for curved text. After 2020, researchers continued to make advances in regression accuracy and robustness. DeepRel [11] introduced deep relational modeling to enhance global consistency for long text instances. ContourNet [12] combined contour point regression with Bezier curve control points to achieve a precise parametric representation of highly curved text. PCR [13] proposed a progressive contour regression strategy that iteratively refines polygon vertex sequences. I3CL [14] enhanced features of character and background regions through intra-instance collaborative learning, and captured dependencies and global context among different text instances via inter-instance collaborative learning, leveraging pseudo-labeling to exploit unlabeled data.

Regression-based methods exhibit clear advantages in structural simplicity and inference efficiency. However, their prediction process typically relies on fixed or weakly constrained geometric parameterizations, which limit their capacity to represent high-frequency structural features such as text boundaries and slender strokes. During multi-scale downsampling and feature aggregation, high-frequency information closely related to text morphology is prone to further degradation, leading to unstable localization in complex backgrounds or densely populated text scenes. Although some studies alleviate geometric representation issues for long or curved text by introducing relational modeling or contour parameterization, these methods still fundamentally rely on spatial-domain regression. They have limited capability for explicitly modeling and semantically constraining high-frequency structural information.

2.2. Segmentation-Based Methods

Segmentation-based detection paradigms reformulate scene text detection as a pixel-level semantic or instance segmentation task. They generate high-resolution text probability maps, kernel maps, or instance-aware masks, and then recover final text instances through post-processing steps such as Vatti clipping, connected component clustering, or learnable clustering. These methods naturally adapt to text of arbitrary shapes and are capable of precisely delineating complex boundaries, making them a dominant direction in both academia and industry over the past five years.

PSENet [1] first proposed a progressive scale expansion algorithm to effectively separate adjacent text instances, alleviating adhesion problems. DBNet [2] and DBNet++ [15] introduced a differentiable binarization module that transforms the originally non-differentiable thresholding operation into a learnable process, enabling truly end-to-end training. Subsequent studies further focused on robustness and efficiency. FCE [16] replaced traditional binary masks with Fourier contour regression, avoiding complex post-processing. RSCA [17] proposed a context-aware upsampling strategy to enhance robustness for small and densely distributed text. TransText [18] introduced a feature reallocation module and an improved Transformer pyramid decoder for precise binary maps. DText [19] leveraged text shape-sensitive positional embeddings to generate instance-adaptive dynamic convolution parameters. RMIPN [20] designed a multi-information-aware segmentation head predicting boundaries, distance fields, and direction fields, improving boundary localization accuracy.

Segmentation-based methods exhibit clear advantages in boundary refinement and shape representation. However, their performance heavily depends on the backbone’s ability to preserve high-frequency structural information. When features suffer from edge blurring or high-frequency degradation during repeated downsampling, spatial precision can deteriorate and may introduce noisy responses in complex backgrounds. Therefore, relying solely on spatial-domain segmentation supervision is insufficient to fundamentally alleviate high-frequency information loss. Enhancing structural information at the feature level and multi-scale modeling remain key to improving robustness.

2.3. Transformer-Based Methods

In recent years, influenced by the success of the DETR family and Transformers in vision, many studies have introduced Transformer architectures into scene text detection. DETR-based methods focus on effectively leveraging prior information to improve detection performance.

Such methods typically adopt a DETR-style encoder–decoder architecture to directly predict text queries or polygon points. TESTR [5] adopts a single-encoder dual-decoder architecture to simultaneously perform text boundary control point regression and character recognition. DPText-DETR [21] employs dynamic points as queries combined with enhanced decomposed self-attention, enabling efficient modeling of arbitrarily shaped text. SwinTextSpotter [22] integrates a Swin Transformer backbone with a dynamic head for collaborative optimization of detection and recognition. CDText [23] introduces a context-aware deformable Transformer to enhance multi-scale feature interaction and improve the representation of text instances with complex shapes. ESTextSpotter [24] unifies detection and recognition through explicit point queries or task-aware query designs. LayoutFormer [6] introduces hierarchical decoding and efficient foreground feature sampling, achieving word-, line-, and paragraph-level text detection.

Transformer-based methods exhibit advantages in global relationship modeling, flexible representation, and end-to-end pipelines. However, their performance still depends on structural integrity and multi-scale representation of input features. When backbone features suffer high-frequency detail degradation, global modeling advantages are hard to fully exploit, sometimes amplifying non-text responses. Thus, achieving deep synergy between high-frequency structural preservation and semantic constraint at the feature level remains an open problem.

2.4. Feature Enhancement and Adaptation-Based Methods

In addition to the three major detection paradigms mentioned above, many studies focus on the underlying feature representation stage. They aim to improve the robustness of models for complex text through feature enhancement and adaptive mechanisms. In conventional convolutional networks, standard downsampling operations easily introduce aliasing effects. This leads to irreversible degradation of high-frequency details. To alleviate this fundamental issue, the general vision community has proposed methods such as Anti-Aliasing CNNs [25]. These methods explicitly preserve high-frequency components by introducing low-pass filtering before downsampling. Recent studies further confirm the importance of this mechanism in fine-grained tasks. For example, Ning et al. [26] investigated the role of anti-aliasing in small object detection. They showed that suppressing aliasing effects is crucial for preserving small and densely distributed structural features.

Meanwhile, to address the diverse shapes of text in natural scenes, researchers have developed various sample-sensitive architectures for adaptive feature extraction. Kernel Adaptive Convolution (KAC) [27] uses distance map prediction to dynamically guide convolution kernels. Li et al. [28] proposed an attention-based feature extraction and cascade feature fusion framework to enhance multi-scale feature representation and cross-level interactions. Similarly, Wang et al. [29] introduced internal feature enhancement and adaptive cross fusion to strengthen cross-level feature interactions, thereby improving the representation of irregular and multi-scale text instances. In addition, several studies explore feature enhancement and frequency-domain modeling. DBNet [2] and TextFuseNet [8] perform boundary-aware modeling and multi-level feature fusion in the spatial domain. FCE [16] explicitly models complex contours using Fourier transforms. TPWGAN [30] introduces a wavelet-aware framework guided by text priors to enhance high-frequency details and improve fine-grained feature recovery. HFENet [7] focuses on compensating for high-frequency information loss during feature learning. Although these methods achieve notable improvements in preserving local structures or adaptive feature extraction, they usually treat high-frequency preservation in the frequency domain and semantic interaction in the spatial domain as separate steps. As a result, it remains difficult to dynamically align and deeply fuse explicit high-frequency structural details with global semantic constraints under complex background interference.

A comprehensive review reveals that, although scene text detection has progressed under different paradigms, model performance is constrained by structural integrity and semantic consistency at the feature representation stage. High-frequency structural information is prone to degradation [31], while high-level semantic features often fail to provide stable discriminative constraints. The insufficient synergy between these aspects has become a key factor limiting performance in complex scene text detection. Most existing methods focus on detection head design, decoding strategies, or post-processing. Explicit modeling of high-frequency structural information and its deep interaction with cross-scale semantics remains underexplored. In scenes with curved, dense, or complex backgrounds, relying solely on spatial-domain features or single-scale semantic modeling cannot ensure structural completeness and semantic stability.

3. Method

3.1. Overview

We propose HFI-Former, a high-frequency enhancement and feature interaction Transformer framework for scene text detection. The overall architecture is illustrated in Figure 1. By organically integrating frequency-domain enhancement with cross-scale semantic interaction, the proposed framework constructs multi-scale feature representations that are both structurally explicit and semantically consistent. Specifically, we introduce a Wavelet Frequency-Enhanced Network (WFE-Net) that operates on multi-scale features extracted from the backbone. WFE-Net performs multi-level discrete wavelet decomposition to separate features into low-frequency structural components and multi-directional high-frequency texture components. Lightweight learnable convolutional enhancements are applied to each sub-band, followed by progressive reconstruction to restore spatial-domain features. This process effectively compensates for the degradation of high-frequency structural information caused by downsampling, while preserving the original scale hierarchy, thereby enabling the collaborative preservation of structural details and semantic representations. Subsequently, the Feature Interaction Refinement Module (FIRM) adopts a dual-stream pyramid architecture to inject global semantic information from the original multi-scale features into the WFE-Net-enhanced features. FIRM employs a Dual-Path Interaction Transformer (DRIT) that combines Softmax-based attention and Sigmoid-based gating mechanisms to achieve selective local semantic modulation and global noise suppression, producing high-quality enhanced multi-scale features. The enhanced features are then fed into a six-layer multi-scale deformable Transformer encoder to capture cross-scale global contextual dependencies. The contour generator consists of three segmentation layers and three regression layers. The segmentation layers generate text instance masks and cooperate with anchor priors to refine control points through joint regression, while the regression layers further leverage prior information to accurately predict the final control points, enabling precise contour fitting of text instances. In the following sections, we describe the design of each component in detail.

3.2. WFE-NET

During multi-scale feature extraction, convolutional networks tend to attenuate high-frequency components through successive convolution and downsampling operations, making it difficult for edges, textures, and local geometric structures contained in low-level features to be fully preserved in deeper layers. The loss of such high-frequency information limits the representational capacity of features for shapes, structures, and fine-grained patterns, thereby affecting the accurate modeling of target regions in subsequent modules.

To explicitly compensate for this deficiency, we design a high-frequency enhancement network, termed WFE-Net (Wavelet Frequency-Enhanced Network), based on a standard ResNet-50 backbone. The core idea of WFE-Net is to introduce a learnable frequency-domain modeling mechanism to selectively compensate for high-frequency information at each scale, while preserving the original multi-scale structure. Specifically, WFE-Net takes multi-scale features

{F_{1}, F_{2}, F_{3}}

from the backbone network and generates an additional lower-resolution feature

F_{4}

from the deepest feature

F_{3}

via an extra convolution with stride, forming a feature hierarchy with progressively decreasing resolutions, denoted as

{F_{1}, F_{2}, F_{3}, F_{4}}

. At each scale, we introduce a learnable Wavelet-Fusion Convolution (WFC) to enhance the high-frequency structural representations of CNN features across different stages. The WFC architecture is shown in Figure 2.

Considering the differences among multi-scale features in spatial resolution and semantic abstraction, WFE-Net adopts resolution-adaptive wavelet decomposition depths at different scales, rather than using a unified number of wavelet levels. Specifically, the highest-resolution feature

F_{1}

employs a four-level decomposition, the intermediate-resolution feature

F_{2}

adopts a three-level decomposition, while the lower-resolution features

F_{3}

and

F_{4}

use only two-level decompositions. This design is motivated by the following considerations. High-resolution features contain abundant local texture information, and deeper wavelet decomposition enables the capture of multi-scale high-frequency patterns. Intermediate-scale features strike a balance between detailed structures and semantic abstraction, for which moderate decomposition is sufficient. In contrast, low-resolution features primarily encode high-level semantic information, and excessively deep decomposition may lead to sparse high-frequency components, potentially disrupting semantic consistency.

Given a feature map

F_{i}

at an arbitrary scale, WFC first performs an

L_{i}

level discrete wavelet transform (DWT) to decompose the feature into one low-frequency sub-band

X_{L L}^{l}

and three directional high-frequency sub-bands

{X_{L H}^{l}, X_{H L}^{l}, X_{H H}^{l}}

, corresponding to horizontal, vertical, and diagonal edge and texture responses, respectively:

W T^{l} (F_{i}) = {X_{L L}^{l}, X_{L H}^{l}, X_{H L}^{l}, X_{H H}^{l}}, l = 1, \dots, L_{i}

(1)

The low-frequency sub-band

X_{L L}^{l}

preserves the main structural and semantic information and is recursively fed into the next wavelet level, where it is further decomposed into new low- and high-frequency sub-bands:

W T^{l + 1} (X_{L L}^{l}) = {X_{L L}^{l + 1}, X_{L H}^{l + 1}, X_{H L}^{l + 1}, X_{H H}^{l + 1}}

(2)

The three high-frequency sub-bands capture directional edge and fine-texture information along the horizontal, vertical, and diagonal directions. To make these components learnable and enhanceable within the network, we concatenate the four sub-bands along the channel dimension and apply a lightweight depthwise convolution followed by a learnable scaling factor

s_{l}

to achieve direction-sensitive enhancement:

{\tilde{X}}_{l} = s_{l} \cdot {Conv}_{d w} (Concat [X_{L L}^{l}, X_{L H}^{l}, X_{H L}^{l}, X_{H H}^{l}])

(3)

The enhanced sub-bands are then progressively reconstructed via the inverse discrete wavelet transform (IDWT), following a deep-to-shallow order, to obtain the enhanced spatial-domain feature at the corresponding scale:

F_{i}^{e n h} = {IDWT}_{l = L_{i} \to 1} ({\tilde{X}}_{l})

(4)

Finally, the reconstructed feature is residually fused with the spatial-domain convolutional feature at the same scale to produce the output feature:

F_{i}^{W} = F_{i}^{e n h} + α_{i} \cdot Conv (F_{i})

(5)

where

α_{i}

denotes a learnable channel-wise fusion coefficient that is adaptively optimized via backpropagation to balance the contributions of frequency-domain enhancement and the original spatial semantic representation.

To further verify that the proposed WFE-Net can effectively compensate for high-frequency information during feature enhancement, we visualize and compare the responses of the feature

F_{1}

before enhancement and the enhanced feature

F_{1}^{W}

in both the spatial domain and the frequency domain. The results are shown in Figure 3.

In the spatial-domain heatmaps, the responses of the original features to text regions usually appear as blob-like patterns with blurred boundaries. This indicates that standard convolution struggles to maintain precise structural localization for small targets during continuous downsampling. After being processed by WFE-Net, the color of the text regions becomes darker and the responses are significantly enhanced. The features also exhibit clearer line-like and skeleton-like structures.

In the frequency domain, we compute the two-dimensional amplitude spectrum to explicitly observe the frequency distribution of features. Before enhancement, the high-frequency regions (i.e., the outer areas of the spectrum) contain relatively weak energy. In contrast, WFE-Net significantly increases the brightness of these outer regions. From a quantitative perspective, the high-frequency energy ratio (HFER) is substantially improved. For example, in the second case, it increases from 2.40% to 3.12%.

Through this design, WFE-Net effectively compensates for high-frequency information related to text shapes in each feature stream without altering the original scale hierarchy. As a result, the enhanced multi-scale features exhibit stronger structural discriminability, improved texture robustness, and better semantic preservation before being fed into subsequent modules, thereby significantly improving text detection quality in complex scenes.

3.3. FIRM

In scene text detection, feature fusion is often limited by weak interaction between low-level structural details and high-level semantics. Although WFE-Net enhances high-frequency textures, frequency-domain enhancement alone cannot fully exploit semantic cues in complex scenes. In dense text or cluttered backgrounds, local high-frequency responses lack contextual support and are easily corrupted by noise. To this end, we design the Feature Interaction Refinement Module (FIRM), which employs a structured cross-stream feature interaction mechanism to effectively inject global semantic information from the original backbone features into the high-frequency-enhanced features, while simultaneously suppressing noisy responses. This design enables the construction of multi-scale representations that are both semantically consistent and structurally explicit.

The original multi-scale features

F_{i}

extracted by ResNet-50 and the WFE-Net-enhanced multi-scale features

F_{i}^{W}

are separately processed by a Feature Pyramid Network (FPN), producing the corresponding pyramid features

P_{i}

and

P_{i}^{W}

. These two feature streams serve as dual inputs to FIRM. The core component of FIRM is the Dual-Path Interaction Transformer (DRIT), whose primary objective is to introduce stable and controllable semantic information while preserving high-frequency structural localization capability. The DRIT architecture is shown in Figure 4.

In DRIT, the high-frequency-enhanced features are used as the query stream, while the original semantic features serve as the key/value stream. This design is motivated by the following considerations. Text edges and stroke structures exhibit stronger spatial localization certainty and thus provide reliable anchors for semantic alignment. In contrast, if semantic features are used as queries, the attention responses tend to diffuse in complex backgrounds. This diffusion weakens structural discriminability. Specifically, the highest-level pyramid feature

P_{H}^{W}

is selected as the query stream, while

P_{H}

is used as the key/value stream. After adding learnable positional encodings, both streams are linearly projected to construct the embedding representations required for attention computation.

To simultaneously achieve fine-grained semantic selection and robust noise suppression, DRIT introduces a dual-path attention mechanism, which collaboratively models Softmax attention and Sigmoid-gated attention. The final attention weights are defined as:

A = Softmax (\frac{q k^{⊤}}{\sqrt{d_{k}}}) ⊙ S i g m o i d (\sum_{j} (q k^{⊤}))

(6)

The Softmax branch captures local semantic matching relationships between query and key, emphasizing semantic regions that are most relevant to high-frequency structures. Meanwhile, the Sigmoid gating branch performs global response statistics for each query position, suppressing spurious activations caused by complex backgrounds, texture clutter, or artifacts introduced by frequency-domain enhancement. The element-wise fusion of the two branches endows the attention mechanism with both selectivity and robustness.

Based on the fused attention weights, semantic features are adaptively injected into the high-frequency features:

{\tilde{F}}_{H} = A \cdot v

(7)

Subsequently, the semantically injected feature

{\tilde{F}}_{H}

is added to the original highest-level semantic feature

P_{H}

, and further transformed through a residual feed-forward network and normalization layers to obtain a stable, semantically enhanced high-level representation:

F_{H}^{attn} = P_{H} + {\tilde{F}}_{H} + MLP (LayerNorm ({\tilde{F}}_{H}))

(8)

This design preserves the global semantic consistency of the original features while introducing structure-aligned semantic information via residual injection, effectively avoiding structural degradation caused by excessive semantic dominance. However, relying solely on the linear projections and feed-forward mappings of Transformers remains insufficient for precise spatial structure modeling. To further strengthen local spatial consistency and fully integrate the original semantic features with the injected representations, DRIT incorporates a lightweight convolutional combination module at the output stage to structurally remap

F_{H}^{attn}

:

F_{H}^{D R I T} = {Conv}_{1} (F_{H}^{attn}) + {Conv}_{2} (F_{H}^{attn})

(9)

where

{Conv}_{1}

consists of two convolution layers followed by ReLU activation, enhancing local context modeling and nonlinear representation capacity, and

{Conv}_{2}

is a 1 × 1 convolution used for channel recalibration and information compression. This convolutional branch complements the Transformer output, enabling high-level features to jointly capture global dependency modeling and local structural perception.

To fully exploit the multi-scale pyramid representation, FIRM further adopts a top–down progressive propagation strategy. The highest-level DRIT output

F_{H}^{D R I T}

is propagated downward along the feature pyramid and fused with the corresponding scale-wise high-frequency-enhanced features

P_{i}^{W}

, yielding the complete set of multi-scale-enhanced features

{F_{i}^{F I R M}}_{i = 1}^{H}

:

F_{H}^{F I R M} = F_{H}^{D R I T}

(10)

F_{i}^{F I R M} = P_{i}^{W} + U p s a m p l e (F_{i + 1}^{F I R M})

(11)

Through this hierarchical refinement process, high-level semantic information is effectively transmitted to lower-level features while maintaining structural consistency. As a result, each scale inherits the high-frequency texture information provided by WFE-Net and is simultaneously constrained by global semantic cues from higher layers, delivering high-quality, multi-scale, and semantically consistent feature representations for subsequent encoder and contour generation modules.

3.4. Transformer Encoder

In our model, we adopt a standard multi-scale deformable Transformer encoder to establish global contextual dependencies before the enhanced features are fed into the prediction stage. The encoder consists of six stacked layers. Each layer comprises a multi-scale deformable self-attention module and a feed-forward network, with residual connections to ensure stable training. Within each layer, the self-attention mechanism performs sparse sampling over the multi-scale feature maps for each query position, adaptively focusing on regions relevant to text structures. The feed-forward network further enhances the feature representation capability. After six iterations, the encoder outputs features that not only integrate cross-scale information but also capture global semantic dependencies, providing unified and robust feature representations for the decoder and subsequent prediction modules.

3.5. Contour-Former

Contour-Former decodes the enhanced multi-scale features using a multi-scale deformable Transformer to produce text instance masks and contour control point coordinates. During decoding, initial queries are selected from the feature maps by a Top-K module and are associated with reference points, providing initial spatial guidance for the decoder. The decoder processes multi-scale features layer by layer: the first three layers predict text instance masks via segmentation and generate spatial anchor priors based on the masks, providing stable region-level localization references for subsequent control point regression. The last three layers progressively refine the control point coordinates under the guidance of these anchor priors, achieving high-precision text contour fitting.

For the input to Contour-Former, the multi-scale-enhanced features from FIRM are fused with the multi-scale features produced by the encoder via an FPN, forming feature representations that integrate both high-frequency texture information and global semantic context. To ensure efficient decoding, the query features adopt the decoupled self-attention mechanism used in prior works such as TESTR [5]. For N text instances (each containing K control points), the computational complexity of standard global attention is

O (N^{2} K^{2} C)

. By decomposing the attention process, intra-group self-attention is first applied to capture local dependencies, and inter-group self-attention is then used to integrate global relationships. As a result, the overall computational complexity is significantly reduced to

O (N K^{2} C + K N^{2} C)

. This design successfully endows the model with high scalability in dense text scenarios. Next, multi-scale deformable cross-attention allows queries to focus on key regions relevant to the text instances, followed by a feed-forward network that further enhances feature representation capability.

The mask prediction layers generate text instance masks using the highest-resolution features and compute anchor priors via a mask-weighted strategy, reflecting the spatial center positions of the text instances. The regression layers take the enhanced query features and the previous layer’s reference points as input, and employ intra-/inter-group self-attention, multi-scale deformable cross-attention, and feed-forward networks to progressively refine the control point coordinates, ultimately producing the final text contour representations.

Thanks to the front-end high-frequency structural enhancement and cross-layer feature interaction mechanisms, the features fed into Contour-Former exhibit significantly improved structural integrity and semantic consistency. This provides a stable and discriminative feature foundation for instance-level decoding, substantially enhancing Contour-Former’s ability to localize and fit text contours in complex scenes.

3.6. Loss Function

To effectively train the proposed multi-scale text detection framework, we adopt a joint loss function consisting of classification, mask prediction, control point regression, and bounding box regression terms, enabling end-to-end optimization. This loss function aims to ensure accurate text instance classification while enhancing the spatial integrity of instance masks and the geometric precision of contour control points, thereby achieving stable localization and precise fitting for arbitrary-shaped text.

Let the model outputs be outputs, ground truth labels be targets, the matching set be indices, and the total number of samples be

N_{inst}

. The overall loss can be formulated as:

L = λ_{cls} L_{cls} + λ_{mask} L_{mask} + λ_{ctrl} L_{ctrl} + λ_{bbox} L_{bbox} + λ_{giou} L_{giou}

(12)

To balance the gradient scales of different terms, we set the loss weights as

λ_{cls} = 2

,

λ_{mask} = 5, λ_{ctrl} = 5, λ_{bbox} = 1, λ_{giou} = 1

, ensuring balanced and stable optimization of classification, mask, and control point regression during training.

The classification loss

L_{cls}

employs a weighted Sigmoid Focal Loss for each predicted category:

L_{cls} = FocalLoss ({pred}_{l} ogits, {target}_{o} nehot, α, γ)

(13)

This suppresses easy-to-classify samples and emphasizes hard examples, encouraging the model to focus on challenging text instances in complex backgrounds. The mask loss

L_{mask}

combines Dice Loss and binary cross-entropy (BCE) to optimize the shape and edge accuracy of text instance masks:

L_{mask} = DiceLoss (\hat{M}, M) + BCE (\hat{M}, M)

(14)

where

\hat{M}

and M denote the predicted and ground-truth masks, respectively. Additionally, auxiliary supervision is applied on lower-resolution masks to enhance local texture representation. The control point regression loss

L_{ctrl}

applies L1 regression on key points, incorporating the anchor priors A and reference points

R_{1}

generated from the segmentation layers:

L_{ctrl} = \frac{1}{N_{inst}} \sum_{i = 1}^{N_{inst}} {‖ {\hat{R}}_{i} - R_{i} ‖}_{1}

(15)

where

{\hat{R}}_{i}

and

R_{i}

denote the predicted and target control points, respectively. The bounding box regression loss

L_{bbox}

and GIoU loss

L_{giou}

optimize the spatial localization and coverage accuracy of text instances:

L_{bbox} = \frac{1}{N_{inst}} \sum_{i} {‖ {\hat{B}}_{i} - B_{i} ‖}_{1}

(16)

L_{giou} = \frac{1}{N_{inst}} \sum_{i} (1 - GloU ({\hat{B}}_{i}, B_{i}))

(17)

During training, a matching algorithm determines correspondences between predictions and targets, based on which all the above loss terms are computed. For the multi-layer decoder structure, auxiliary losses are applied on intermediate outputs to improve gradient propagation and training stability. Additionally, a weighted sampling strategy is adopted for regions with high uncertainty in the text masks, encouraging the model to focus on challenging regions under complex backgrounds, thereby improving overall detection accuracy and robustness. This joint loss does not introduce new loss forms but rather provides a rational combination tailored to the proposed structure–semantic collaborative framework, ensuring that the high-frequency-enhanced features and semantic injection mechanisms are fully constrained and jointly optimized during training.

4. Experiments

4.1. Datasets

To comprehensively evaluate the text detection performance of the proposed model under diverse scenarios, we conduct training and evaluation on multiple public benchmarks. These datasets cover a wide range of challenging text scenes, including synthetic and real-world images, horizontal and curved text instances, as well as multilingual settings.

Total-Text [32]: It contains 1555 images, of which 1255 are used for training and 300 for testing. The annotations are primarily provided at the English word level and cover three text orientations: horizontal, multi-oriented, and curved text. More than half of the images include a combination of multiple orientations, emphasizing the evaluation of detection performance on complex text shapes.

CTW1500 [33]: It is a dataset focused on curved text detection, consisting of 1500 images, with 1000 images for training and 500 for testing. Each text instance is finely annotated using a 14-point polygon, enabling accurate representation of elongated, curved, or distorted text shapes.

ICDAR2015 [34]: A dataset for evaluating text detection in multi-directional scenes, comprising 1000 training images and 500 test images. The images are mostly candid shots and exhibit complex conditions such as distortion and blurring. All text instances are annotated using word-level bounding boxes, which are widely used to evaluate a model’s ability to localize tilted text.

SynthText 150k [35]: It is a large-scale synthetic dataset used for pretraining text detection and recognition models. The dataset is generated by rendering synthetic text onto real background images and exhibits rich variations in fonts, colors, scales, orientations, and illumination conditions. Although composed of synthetic images, it provides high visual realism, which effectively enhances the model’s initial feature learning ability and reduces the risk of overfitting on real-world datasets, making it widely adopted for the pretraining stage.

MLT2017 [36]: It is a dataset designed for multilingual scene text detection and recognition, containing 10,000 images that cover nine languages (e.g., Chinese, English, Arabic, etc.). The dataset is split into 7200 training images, 1800 validation images, and 1000 test images. Text instances appear in various forms, including horizontal, inclined, and curved shapes, with complex backgrounds, making it a widely used benchmark for evaluating cross-lingual generalization and robustness in challenging scenes.

4.2. Implementation Details

All experiments are conducted under an Ubuntu 20.04 environment on a single NVIDIA RTX 3090 GPU. The software stack includes Python 3.8, PyTorch 1.11.0, and CUDA 11.3, ensuring the reproducibility of the experimental results.

During the pretraining stage, we train on a hybrid dataset composed of SynthText 150k, Total-Text, and MLT2017 for 40,000 iterations. The backbone network is optimized with a learning rate of

1 \times 10^{- 5}

, while the remaining modules use an initial learning rate of

1 \times 10^{- 4}

, which is decayed by a factor of 10 at 240,000 iterations. AdamW is employed for end-to-end training with a weight decay of

1 \times 10^{- 4}

, and gradient clipping with a maximum norm of 0.1 is applied to enhance the stability of the deep network. Subsequently, the best pretrained weights are loaded for finetuning on the Total-Text, CTW1500 and ICDAR2015 datasets for 30,000 iterations each, with initial learning rates set to

1 \times 10^{- 4}

,

5 \times 10^{- 5}

and

1 \times 10^{- 5}

, respectively, and a 10× decay applied uniformly at 24,000 iterations.

The Transformer module employs 8 attention heads and 4 deformable sampling points to achieve efficient sparse feature aggregation. The model’s Contour-Former consists of 3 segmentation layers followed by 3 regression layers, responsible for predicting both binary masks and precise polygonal contours composed of 16 control points. During detection, 100 learnable queries are used to cover text instances of varying scales, shapes, and quantities.

For data augmentation, a multi-scale strategy is adopted during training: the short side of images is randomly sampled between 480 and 896 pixels (with the long side capped at 1600 pixels), combined with random cropping, horizontal flipping, and photometric distortions to improve robustness and generalization. During testing, images are resized such that the short side is 1000 pixels (long side not exceeding 1800 pixels), and a confidence threshold of 0.4 is applied to filter the final detection results.

4.3. Results

4.3.1. Comparison of Overall Results

To comprehensively evaluate the effectiveness of the proposed high-frequency detail enhancement and cross-scale semantic–detail feature interaction mechanisms in complex natural scenes, experiments were conducted on two mainstream arbitrary-shaped text detection benchmarks: CTW1500, Total-Text and ICDAR2015.

As shown in Table 1, on CTW1500, our method achieves state-of-the-art performance with 91.7% Precision, 85.7% Recall, and an F-score of 88.6%. Compared with the previous best method, LRANet (87.4% F-score), our approach improves by 1.2 percentage points; relative to recent methods such as KAC (86.8% F-score) and TextBPN++ (86.5% F-score), the improvement reaches 1.8–2.1 percentage points. Considering that many text instances in CTW1500 exhibit high curvature, large span, and complex contour structures, the significant performance advantage demonstrates that the proposed frequency-domain enhancement and feature interaction strategy effectively strengthens structural continuity and boundary consistency modeling in text regions, thereby substantially improving detection performance for long curved and arbitrary-shaped text instances.

As shown in Table 2 and Table 3, our method demonstrates outstanding performance on both the Total-Text and ICDAR2015 benchmark datasets. On Total-Text, our method achieves 89.3% Precision, 86.0% Recall, and an F-score of 87.6%, surpassing all existing methods based on the ResNet-50 backbone. Notably, the improvement in Recall is the most significant: compared with the best baseline I3CL (84.2%), it increases by 1.8 percentage points, and relative to mainstream methods (ranging from 82–83%), it improves by 3–4 percentage points. Furthermore, on the highly challenging ICDAR2015 dataset, our method achieves 90.9% Precision, 85.5% Recall, and 88.1% F-score. Compared with the classic DBNet++ (87.3%) and the recent STD (87.0%), our approach yields an F-score improvement of 0.8–1.1 percentage points. These results demonstrate that the proposed method can more comprehensively localize text regions and effectively reduce missed detections across various complex scenarios. The advantage primarily stems from WFE-Net’s effective enhancement of high-frequency structural information and FIRM’s suppression of irrelevant background responses and semantic consistency modeling during multi-scale feature fusion, thereby improving the model’s overall recall ability under complex backgrounds and varied text shapes.

The computational efficiency and model complexity of HFI-Former are summarized in Table 1, Table 2 and Table 3. As shown in the results, HFI-Former maintains a constant parameter count of 51.1 M across all benchmarks. However, the computational cost (FLOPs) varies according to the input resolutions of different datasets, requiring 228.9 G, 246.9 G, and 261.3 G FLOPs for CTW1500, Total-Text, and ICDAR2015, respectively. This fluctuation in computational demand leads to corresponding inference speeds of 9.4, 8.7, and 8.1 FPS. Although slower than real-time-oriented architectures (e.g., LRANet, KAC) due to our focus on deep frequency enhancement for complex shapes, HFI-Former outperforms representative high-performance models such as TESTR and TextFuseNet in both F-score (SOTA) and inference efficiency. This demonstrates a highly competitive speed-accuracy trade-off for challenging scene text detection tasks.

Figure 5 and Figure 6 show the visualization results of our model on the CTW1500, Total-Text and ICDAR2015 datasets. On all three datasets, our method achieves consistently optimal or near-optimal performance. This clearly demonstrates that the Wavelet Frequency-Enhanced Network (WFE-Net) and the Feature Interaction Refinement Module (FIRM) form a complementary and synergistic relationship at the feature level: WFE-Net strengthens the structural response of text regions, while FIRM injects cross-scale semantic information to enhance overall consistency and suppress irrelevant activations. As a result, the model exhibits stable and well-generalized detection performance across different datasets and complex scenarios.

4.3.2. Comparison with Other Feature Enhancement Methods

To further verify the effectiveness of our enhancement strategy, we compare HFI-Former with several representative methods. DBNet utilizes edge-aware adaptive threshold maps for better boundary localization. TextFuseNet employs high-resolution multi-level feature fusion to enhance the fine-grained representation of arbitrary-shaped text. In the frequency domain, FCE models shapes via Fourier coefficients but filters out high-frequency details. While HFENet compensates for high-frequency loss, HFI-Former uniquely integrates WFE-Net (structural enhancement) with FIRM (semantic interaction). This collaborative design effectively mitigates the background noise and erroneous activations prone to occur with pure high-frequency enhancement.

As shown in Table 4, HFI-Former achieves state-of-the-art F-scores on CTW1500 (88.6%) and Total-Text (87.6%), significantly outperforming DBNet, TextFuseNet, FCE, and HFENet. Specifically, we reach 91.7% Precision on CTW1500 and 86.0% Recall on Total-Text. These results demonstrate that combining high-frequency enhancement with semantic-guided interaction provides a more robust representation for complex scene text.

4.4. Ablation Studies

4.4.1. Ablation on Key Modules

To verify the effectiveness of the proposed modules, we conducted systematic ablation experiments on the CTW1500 and Total-Text datasets. The baseline model, with all enhancement modules removed, was used as a reference. WFE-Net and FIRM were then added individually and jointly to analyze their independent contributions and collaborative gains. The results are presented in Table 5 and Table 6.

With the incorporation of WFE-Net, Precision on the CTW1500 dataset increased from 88.62% to 90.32%, and F-score from 86.56% to 87.60%. On the Total-Text dataset, Precision improved from 85.70% to 88.88%, and F-score increased from 84.94% to 86.76%. This indicates that the frequency-based high-frequency enhancement mechanism effectively strengthens the structural response of text regions, enabling the model to achieve more accurate localization in complex backgrounds, thereby reducing false positives and improving detection accuracy.

When only FIRM is added, on CTW1500, Recall rises from 84.60% to 86.16%, and F-score increases to 87.54%. On Total-Text, Precision improved from 85.70% to 87.33%, Recall rises from 84.19% to 85.91%, and F-score from 84.94% to 86.61%. This demonstrates that FIRM’s cross-scale semantic injection and feature interaction mechanism enhance semantic consistency among features, allowing the model to maintain stable region coverage even in areas with complex backgrounds or significant structural variations.

More importantly, when WFE-Net and FIRM are enabled simultaneously, the model achieves the most significant performance improvement; on CTW1500, the F-score reaches 88.58%, which is 2.02% higher than the baseline. On Total-Text, the F-score is 87.62%, an increase of 2.68%. These results are better than using either module alone. The visualization comparison is shown in Figure 7. WFE-Net provides a structurally enhanced feature foundation, while FIRM further injects global semantic constraints and suppresses irrelevant activations on this basis. Together, they form a complementary and synergistic interaction at the feature level, resulting in more stable and accurate text detection.

4.4.2. Ablation on WFE-Net Decomposition Levels

To quantitatively analyze the impact of high-frequency energy preservation on detection performance, we conduct an ablation study on the wavelet decomposition levels of the four multi-scale feature maps

{F_{1}, F_{2}, F_{3}, F_{4}}

. As shown in Table 7, we compare the proposed resolution-adaptive setting (4, 3, 2, 2) with configurations using a uniform decomposition depth. During the experiments, the FIRM module and all other hyperparameters are kept unchanged.

As shown in Table 7, compared with the baseline model using only FIRM, the proposed setting (4, 3, 2, 2) improves the Precision by 2.68%. This demonstrates that WFE-Net effectively compensates for the structural energy lost during downsampling through deep wavelet decomposition, thereby enhancing boundary localization accuracy.

In addition, we observe that the uniform deep decomposition setting (4, 4, 4, 4) achieves the highest Precision (92.00%), but the Recall drops sharply to 82.59%. This leads to an F-score even lower than the baseline. This result indicates that excessive decomposition on low-resolution features introduces background noise and disrupts semantic consistency.

In contrast, the proposed (4, 3, 2, 2) strategy achieves the best balance between Precision and Recall, reaching an F-score of 88.58%. This confirms the necessity of adaptively balancing structural enhancement and semantic abstraction.

4.4.3. Analysis of Backbone Independence

To further validate the backbone independence and scalability of our proposed method, we evaluate its performance on the CTW1500 dataset by replacing the default ResNet-50 with a higher-capacity architecture, ResNet-101.

As shown in Table 8, our method achieves further performance improvements with the stronger backbone. Specifically, the F-score increases from 88.58% to 89.82%, and the Recall improves by 1.80%, demonstrating that the proposed method effectively benefits from richer feature representations. Expectedly, the use of ResNet-101 increases the model complexity to 70.7 M parameters and 301.4 G FLOPs, resulting in a decrease in inference speed from 9.4 FPS to 5.8 FPS. Despite the increased computational cost, the steady performance gains confirm that our approach is robust across different backbone architectures and can effectively leverage deeper features to handle challenging curved text instances.

5. Failure Analysis

Although HFI-Former performs well overall, it shows limitations in certain cases (Figure 8). For large-scale text or complex textures, WFE-Net’s high sensitivity to local high-frequency structures can cause multiple queries to activate simultaneously, leading to duplicate detections. Conversely, low-contrast misses often result from physical signal loss under poor lighting. When text and background are highly similar, degraded high-frequency components make it difficult for WFE-Net to extract effective enhancement anchors, and the FIRM module alone cannot fully recover boundaries from such weak semantic cues. Overall, these cases highlight a trade-off between high-detail sensitivity and robustness in extreme imaging conditions.

6. Discussion and Conclusions

This paper addresses the issues in natural scene text detection where high-frequency details are lost due to downsampling and feature aggregation, and where insufficient synergy between semantic and structural information leads to semantic–detail decoupling. We propose HFI-Former, a high-frequency enhancement–feature interaction Transformer framework for natural scene text detection. The core innovation lies in integrating frequency-domain high-frequency enhancement and semantic-aware feature interaction at the feature representation stage, thereby improving the structural sensitivity and semantic consistency of features as a whole. In terms of framework design, WFE-Net compensates for detail degradation caused by backbone downsampling, while FIRM enables effective interaction between high-frequency structures and cross-scale semantic information, enhancing detection robustness in complex backgrounds. A multi-scale deformable Transformer encoder is employed to strengthen long-range dependency modeling, combined with a contour generation strategy based on segmentation and prior-guided regression to achieve more precise text instance boundaries. Experimental results demonstrate that our method outperforms existing approaches on the CTW1500, Total-Text and ICDAR2015 datasets. This fully validates the effectiveness and generalization ability of the proposed framework and its key modules.

Although the model may still miss detections in extreme scenarios with severely degraded signals, such as very low contrast, failure analysis shows that our method provides an effective solution for high-frequency feature compensation. Future work will explore data augmentation strategies based on diffusion models to handle extreme degradation. We will also focus on lightweight designs for the high-frequency enhancement and feature interaction modules. These efforts aim to further improve the model’s robustness and suitability for practical deployment.

Author Contributions

Conceptualization, Y.G. and X.W.; methodology, Y.G. and Q.G.; software, Y.G.; validation, Y.G., Q.G. and L.S.; formal analysis, Y.G.; investigation, Y.G.; resources, X.W. and L.L.; data curation, Y.G. and L.S.; writing—original draft preparation, Y.G.; writing—review and editing, L.S., Q.G. and X.W.; visualization, Y.G.; supervision, X.W. and L.L.; project administration, X.W.; funding acquisition, X.W. and Q.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of China (No. 62072362, 12101479), Natural Science Basis Research Plan in Shaanxi Province of China (Nos. 2021JQ-660 and 2024JC-YBMS-531), Shaanxi Provincial Innovation Capacity Support Programme Project (No. 2024ZC-KJXX-034), and Xi’an Major Scientific and Technological Achievements Transformation Industrialization Project (No. 23CGZH CYH0008), and the Youth Innovation Team Project, Scientific Research Program of Shaanxi Provincial Department of Education (No. 25JP070).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available. The TotalText dataset was downloaded from https://github.com/cs-chan/Total-Text-Dataset (accessed on 23 April 2025), and the CTW1500 dataset was downloaded from https://github.com/Yuliang-Liu/Curve-Text-Detector (accessed on 23 April 2025).

Conflicts of Interest

We declare no conflicts of interest.

References

Wang, W.; Xie, E.; Li, X.; Hou, W.; Lu, T.; Yu, G.; Shao, S. Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9336–9345. [Google Scholar]
Liao, M.; Wan, Z.; Yao, C.; Chen, K.; Bai, X. Real-time scene text detection with differentiable binarization. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11474–11481. [Google Scholar]
Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. EAST: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5551–5560. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV); Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Zhang, X.; Su, Y.; Tripathi, S.; Tu, Z. Text spotting transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 9519–9528. [Google Scholar]
Liang, M.; Ma, J.W.; Zhu, X.; Qin, J.; Yin, X.C. Layoutformer: Hierarchical text detection towards scene text understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 15665–15674. [Google Scholar]
Liang, M.; Zhu, X.; Zhou, H.; Qin, J.; Yin, X.C. HFENet: Hybrid feature enhancement network for detecting texts in scenes and traffic panels. IEEE Trans. Intell. Transp. Syst. 2023, 24, 14200–14212. [Google Scholar] [CrossRef]
Ye, J.; Chen, Z.; Liu, J.; Du, B. TextFuseNet: Scene text detection with richer fused features. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI), Yokohama, Japan, 11–17 July 2020; pp. 516–522. [Google Scholar]
Liao, M.; Shi, B.; Bai, X. TextBoxes++: A single-shot oriented scene text detector. IEEE Trans. Image Process. 2018, 27, 3676–3690. [Google Scholar] [CrossRef] [PubMed]
Baek, Y.; Lee, B.; Han, D.; Yun, S.; Lee, H. Character region awareness for text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9365–9374. [Google Scholar]
Zhang, S.X.; Zhu, X.; Hou, J.B.; Liu, C.; Yang, C.; Wang, H.; Yin, X.C. Deep relational reasoning graph network for arbitrary shape text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9699–9708. [Google Scholar]
Wang, Y.; Xie, H.; Zha, Z.J.; Xing, M.; Fu, Z.; Zhang, Y. ContourNet: Taking a further step toward accurate arbitrary-shaped scene text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11753–11762. [Google Scholar]
Dai, P.; Zhang, S.; Zhang, H.; Cao, X. Progressive contour regression for arbitrary-shape scene text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7393–7402. [Google Scholar]
Du, B.; Ye, J.; Zhang, J.; Liu, J.; Tao, D. I3CL: Intra-and inter-instance collaborative learning for arbitrary-shaped scene text detection. Int. J. Comput. Vis. 2022, 130, 1961–1977. [Google Scholar]
Liao, M.; Zou, Z.; Wan, Z.; Yao, C.; Bai, X. Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 919–931. [Google Scholar] [CrossRef] [PubMed]
Zhu, Y.; Chen, J.; Liang, L.; Kuang, Z.; Jin, L.; Zhang, W. Fourier contour embedding for arbitrary-shaped text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 3123–3131. [Google Scholar]
Li, J.; Lin, Y.; Liu, R.; Yin, C.; Wang, W.; Lai, J. RSCA: Real-time segmentation-based context-aware scene text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2349–2358. [Google Scholar]
Zhu, J.; Wang, G. TransText: Improving scene text detection via transformer. Digit. Signal Process. 2022, 130, 103698. [Google Scholar] [CrossRef]
Cai, Y.; Liu, Y.; Shen, C.; Jin, L.; Li, Y.; Ergu, D. Arbitrarily shaped scene text detection with dynamic convolution. Pattern Recognit. 2022, 127, 108608. [Google Scholar] [CrossRef]
Zheng, J.; Zhang, L.; Wu, Y.; Zhao, C. Text region multiple information perception network for scene text detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 7820–7824. [Google Scholar]
Ye, M.; Zhang, J.; Zhao, S.; Liu, J.; Du, B.; Tao, D. DPText-DETR: Towards better scene text detection with dynamic points in transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 3241–3249. [Google Scholar]
Huang, M.; Liu, Y.; Peng, Z.; Liu, C.; Lin, D.; Zhu, S.; Yuan, L.; Ding, E.; Wang, J. SwinTextSpotter: Scene text spotting via better synergy between text detection and text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4593–4603. [Google Scholar]
Wu, Y.; Kong, Q.; Lai, Y.; Narducci, F.; Wan, S. CDText: Scene text detector based on context-aware deformable transformer. Pattern Recognit. Lett. 2023, 172, 8–14. [Google Scholar] [CrossRef]
Huang, M.; Zhang, J.; Peng, D.; Lu, H.; Huang, C.; Liu, Y.; Bai, X.; Jin, L. ESTextSpotter: Towards better scene text spotting with explicit synergy in transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 19495–19505. [Google Scholar]
Zhang, R. Making convolutional networks shift-invariant again. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 7324–7334. [Google Scholar]
Ning, J.; Spratling, M. The importance of anti-aliasing in tiny object detection. In Proceedings of the 15th Asian Conference on Machine Learning (ACML), Istanbul, Turkey, 11–14 November 2023; pp. 975–990. [Google Scholar]
Zheng, J.; Fan, H.; Zhang, L. Kernel adaptive convolution for scene text detection via distance map prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 5957–5966. [Google Scholar]
Li, N.; Wang, Z.; Huang, Y.; Tian, J.; Li, X.; Xiao, Z. A multi-scale natural scene text detection method based on attention feature extraction and cascade feature fusion. Sensors 2024, 24, 3758. [Google Scholar] [CrossRef] [PubMed]
Wang, G.; Wei, S.; Yang, D.; Guo, A. Natural scene text detection algorithm via internal feature enhancement and adaptive cross fusion. IEEE Access 2025, 13, 153159–153170. [Google Scholar]
Liu, S.; Miao, J.; Qiao, Y.; Wang, H. TPWGAN: Wavelet-aware text prior guided super-resolution for scene text images. Image Vis. Comput. 2025, 162, 105707. [Google Scholar] [CrossRef]
Xia, Z.; Huang, H.; Chen, H.; Feng, X.; Zhao, G. Hybrid-supervised hypergraph-enhanced transformer for micro-gesture based emotion recognition. IEEE Trans. Affective Comput. 2026, 17, 379–393. [Google Scholar] [CrossRef]
Ch’ng, C.K.; Chan, C.S. Total-text: A comprehensive dataset for scene text detection and recognition. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; pp. 935–942. [Google Scholar]
Liu, Y.; Jin, L.; Zhang, S.; Luo, C.; Peng, S. Detecting curve text in the wild: New dataset and new solution. arXiv 2017, arXiv:1712.02170. [Google Scholar] [CrossRef]
Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V.R.; Lu, S.; et al. ICDAR 2015 competition on robust reading. In Proceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; pp. 1156–1160. [Google Scholar]
Gupta, A.; Vedaldi, A.; Zisserman, A. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2315–2324. [Google Scholar]
Nayef, N.; Yin, F.; Bizid, I.; Choi, H.; Feng, Y.; Karatzas, D.; Luo, Z.; Pal, U.; Rigaud, C.; Chazalon, J.; et al. ICDAR 2017 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Script Identification (RRC-MLT). In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; pp. 1454–1459. [Google Scholar]
Zhang, S.X.; Zhu, X.; Hou, J.B.; Liu, C.; Yang, C.; Wang, H.; Yin, X.C. Kernel proposal network for arbitrary shape text detection. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 8731–8742. [Google Scholar] [CrossRef] [PubMed]
Qu, Y.; Xie, H.; Fang, S.; Wang, Y.; Zhang, Y. ADNet: Rethinking the shrunk polygon-based approach in scene text detection. IEEE Trans. Multimed. 2022, 25, 6983–6996. [Google Scholar]
Zhao, X.; Feng, W.; Zhang, Z.; Lv, J.; Zhu, X.; Lin, Z.; Hu, J.; Shao, J. CBNet: A plug-and-play network for segmentation-based scene text detection. Int. J. Comput. Vis. 2024, 132, 3119–3138. [Google Scholar] [CrossRef]
Zhang, S.X.; Yang, C.; Zhu, X.; Yin, X.C. Arbitrary shape text detection via boundary transformer. IEEE Trans. Multimed. 2024, 26, 1747–1760. [Google Scholar] [CrossRef]
Su, Y.; Chen, Z.; Shao, Z.; Du, Y.; Ji, Z.; Bai, J.; Zhou, Y.; Jiang, Y.-G. LRANet: Towards accurate and efficient scene text detection with low-rank approximation network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 4979–4987. [Google Scholar]
Han, X.; Gao, J.; Yang, C.; Yuan, Y.; Wang, Q. Focus entirety and perceive environment for arbitrary-shaped text detection. IEEE Trans. Multimed. 2024, 27, 287–299. [Google Scholar] [CrossRef]
Han, X.; Gao, J.; Yang, C.; Yuan, Y.; Wang, Q. Spotlight text detector: Spotlight on candidate regions like a camera. IEEE Trans. Multimed. 2024, 27, 1937–1949. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the proposed HFI-Former for scene text detection.

Figure 2. Structural diagram of the Wavelet-Fusion Convolution (WFC).

Figure 3. Visual comparison of spatial-domain heatmaps and frequency-domain spectra before and after enhancement by WFE-Net.

Figure 4. Structural diagram of the Dual-Path Interaction Transformer (DRIT).

Figure 5. Visualization results on the CTW1500 and Total-Text datasets.

Figure 6. Visualization results on the ICDAR2015 datasets.

Figure 7. Visualization comparison of detection results: (a) shows the baseline results. (b) shows the results of the proposed model.

Figure 8. Typical failure cases of HFI-Former.

Table 1. CTW1500 scene text detection results. P, R, and F denote Precision, Recall, and F-score, respectively.

Methods	Paper	Backbone	P (%)	R (%)	F (%)	Params (M)	FLOPs (G)	FPS
DBNet [2]	AAAI’20	Res50	86.9	80.2	83.4	-	-	22.0
TextFuseNet [8]	IJCAI’20	Res50	85.8	85.0	85.4	-	-	7.3
FCE [16]	CVPR’21	Res50	87.6	83.4	85.5	-	-	-
DText [19]	PR’22	Res50	86.9	82.7	84.7	-	-	-
I3CL [14]	IJCV’22	Res50	88.4	84.6	86.5	52.2	247.3	7.6
TESTR [5]	CVPR’22	Res50	92.0	82.6	87.1	-	-	5.6
HFENet [7]	TITS’23	Res50	88.1	83.4	85.7	-	-	18.1
DBNet++ [15]	TPAMI’23	Res50	87.9	82.8	85.3	-	-	26.0
KPN [37]	TNNLS’23	Res50	84.4	84.2	84.3	-	-	6.3
ADNet [38]	TMM’23	Res50	88.2	83.1	85.6	-	-	-
CBNet [39]	IJCV’24	Res18/50	89.0	81.9	86.0	-	-	28.2
KAC [27]	CVPR’24	Res50	88.6	85.4	86.8	-	-	19.2
TextBPN++ [40]	TMM’24	Res50	88.3	84.7	86.5	-	-	16.5
LayoutFormer [6]	CVPR’24	Res50	88.2	84.3	86.2	-	-	-
LRANet [41]	AAAI’24	Res50	89.4	85.5	87.4	-	-	37.2
FEPE [42]	TMM’25	Res50	88.8	83.5	86.0	-	-	22.0
STD [43]	TMM’25	Res50	88.5	84.9	86.7	-	-	12.1
Ours	Proposed	Res50	91.7	85.7	88.6	51.1	228.9	9.4

Table 2. Total-Text scene text detection results. P, R, and F denote Precision, Recall, and F-score, respectively.

Methods	Paper	Backbone	P (%)	R (%)	F (%)	Params (M)	FLOPs (G)	FPS
DBNet [2]	AAAI’20	Res50	87.1	82.5	84.7	-	-	32.0
TextFuseNet [8]	IJCAI’20	Res50	87.5	83.2	85.3	-	-	7.1
FCE [16]	CVPR’21	Res50	89.3	82.5	85.8	-	-	-
DText [19]	PR’22	Res50	90.5	82.7	86.4	-	-	-
I3CL [14]	IJCV’22	Res50	89.8	84.2	86.9	52.2	247.3	-
TESTR [5]	CVPR’22	Res50	93.4	81.4	86.9	-	-	5.3
HFENet [7]	TITS’23	Res50	89.0	84.0	86.4	-	-	12.2
DBNet++ [15]	TPAMI’23	Res50	88.9	83.2	86.0	-	-	28.0
CBNet [39]	IJCV’24	Res18/50	90.1	82.5	86.1	-	-	28.2
KAC [27]	CVPR’24	Res50	90.2	83.4	86.7	-	-	56.6
FEPE [42]	TMM’25	Res50	91.3	81.9	86.4	-	-	32
Ours	Proposed	Res50	89.3	86.0	87.6	51.1	246.9	8.7

Table 3. ICDAR2015 Scene text detection results. P, R, and F denote Precision, Recall, and F-score, respectively.

Methods	Paper	Backbone	P (%)	R (%)	F (%)	Params (M)	FLOPs (G)	FPS
DBNet [2]	AAAI’20	Res50	91.8	83.2	87.3	-	-	12.0
FCE [16]	CVPR’21	Res50	90.1	82.6	86.2	-	-	-
DText [19]	PR’22	Res50	88.5	85.6	87.0	-	-	-
DBNet++ [15]	TPAMI’23	Res50	90.9	83.9	87.3	-	-	10
KPN [37]	TNNLS’23	Res50	88.3	84.8	86.5	-	-	6.3
CBNet [39]	IJCV’24	Res18/50	89.0	85.5	87.2	-	-	28.2
FEPE [42]	TMM’25	Res50	88.5	80.4	84.2	-	-	12
STD [43]	TMM’25	Res50	88.9	85.2	87.0	-	-	4.4
Ours	Proposed	Res50	90.9	85.5	88.1	51.1	261.3	8.1

Table 4. Comparison of different feature enhancement strategies. P, R, and F denote Precision, Recall, and F-score, respectively.

Methods	Strategy Type	Backbone	CTW1500			Total-Text
Methods	Strategy Type	Backbone	P (%)	R (%)	F (%)	P (%)	R (%)	F (%)
DBNet [2]	Edge-aware module	Res50	86.9	80.2	83.4	87.1	82.5	84.7
TextFuseNet [8]	High-resolution & Multi-level	Res50	85.8	85.0	85.4	87.5	83.2	85.3
FCE [16]	Frequency & Filtering	Res50	87.6	83.4	85.5	89.3	82.5	85.8
HFENet [7]	High-frequency & Edge-aware	Res50	88.1	83.4	85.7	89.0	84.0	86.4
Ours	Our Enhancement	Res50	91.7	85.7	88.6	89.3	86.0	87.6

Table 5. Ablation study of WFE-Net and FIRM on CTW1500. P, R, and F denote Precision, Recall, and F-score, respectively (✓ sindicates that the module is used).

WFE-Net	FIRM	P (%)	R (%)	F (%)
		88.62	84.60	86.56
✓		90.32	85.04	87.60
	✓	88.97	86.16	87.54
✓	✓	91.65	85.71	88.58

Table 6. Ablation study of WFE-Net and FIRM on Total-Text. P, R, and F denote Precision, Recall, and F-score, respectively (✓ indicates that the module is used).

WFE-Net	FIRM	P (%)	R (%)	F (%)
		85.70	84.19	84.94
✓		88.88	84.75	86.76
	✓	87.33	85.91	86.61
✓	✓	89.31	86.00	87.62

Table 7. Ablation on WFE-Net decomposition levels over multi-scale features on CTW1500. P, R, and F denote Precision, Recall, and F-score, respectively.

Setting	Levels ( $F_{1}, F_{2}, F_{3}, F_{4}$ )	P (%)	R (%)	F (%)
Baseline (+FIRM)	-	88.97	86.16	87.54
Unified Shallow	(2, 2, 2, 2)	90.61	85.12	87.78
Unified Deep	(4, 4, 4, 4)	92.00	82.59	87.04
Ours	(4, 3, 2, 2)	91.65	85.71	88.58

Table 8. Performance comparison on the CTW1500 dataset using different backbone networks. P, R, and F denote Precision, Recall, and F-score, respectively.

Backbone	P (%)	R (%)	F (%)	Params (M)	FLOPs (G)	FPS
ResNet-50 (Ours)	91.65	85.71	88.58	51.1	228.9	9.4
ResNet-101 (Ours)	92.25	87.51	89.82	70.7	301.4	5.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gao, Y.; Gao, Q.; Shao, L.; Wang, X.; Liu, L. HFI-Former: High-Frequency Interaction Transformer for Robust Scene Text Detection. Information 2026, 17, 365. https://doi.org/10.3390/info17040365

AMA Style

Gao Y, Gao Q, Shao L, Wang X, Liu L. HFI-Former: High-Frequency Interaction Transformer for Robust Scene Text Detection. Information. 2026; 17(4):365. https://doi.org/10.3390/info17040365

Chicago/Turabian Style

Gao, Yubing, Quanli Gao, Lianhe Shao, Xihan Wang, and Lufang Liu. 2026. "HFI-Former: High-Frequency Interaction Transformer for Robust Scene Text Detection" Information 17, no. 4: 365. https://doi.org/10.3390/info17040365

APA Style

Gao, Y., Gao, Q., Shao, L., Wang, X., & Liu, L. (2026). HFI-Former: High-Frequency Interaction Transformer for Robust Scene Text Detection. Information, 17(4), 365. https://doi.org/10.3390/info17040365

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HFI-Former: High-Frequency Interaction Transformer for Robust Scene Text Detection

Abstract

1. Introduction

2. Related Works

2.1. Regression-Based Methods

2.2. Segmentation-Based Methods

2.3. Transformer-Based Methods

2.4. Feature Enhancement and Adaptation-Based Methods

3. Method

3.1. Overview

3.2. WFE-NET

3.3. FIRM

3.4. Transformer Encoder

3.5. Contour-Former

3.6. Loss Function

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Results

4.3.1. Comparison of Overall Results

4.3.2. Comparison with Other Feature Enhancement Methods

4.4. Ablation Studies

4.4.1. Ablation on Key Modules

4.4.2. Ablation on WFE-Net Decomposition Levels

4.4.3. Analysis of Backbone Independence

5. Failure Analysis

6. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI