3.2. WFE-NET
During multi-scale feature extraction, convolutional networks tend to attenuate high-frequency components through successive convolution and downsampling operations, making it difficult for edges, textures, and local geometric structures contained in low-level features to be fully preserved in deeper layers. The loss of such high-frequency information limits the representational capacity of features for shapes, structures, and fine-grained patterns, thereby affecting the accurate modeling of target regions in subsequent modules.
To explicitly compensate for this deficiency, we design a high-frequency enhancement network, termed WFE-Net (Wavelet Frequency-Enhanced Network), based on a standard ResNet-50 backbone. The core idea of WFE-Net is to introduce a learnable frequency-domain modeling mechanism to selectively compensate for high-frequency information at each scale, while preserving the original multi-scale structure. Specifically, WFE-Net takes multi-scale features
from the backbone network and generates an additional lower-resolution feature
from the deepest feature
via an extra convolution with stride, forming a feature hierarchy with progressively decreasing resolutions, denoted as
. At each scale, we introduce a learnable Wavelet-Fusion Convolution (WFC) to enhance the high-frequency structural representations of CNN features across different stages. The WFC architecture is shown in
Figure 2.
Considering the differences among multi-scale features in spatial resolution and semantic abstraction, WFE-Net adopts resolution-adaptive wavelet decomposition depths at different scales, rather than using a unified number of wavelet levels. Specifically, the highest-resolution feature employs a four-level decomposition, the intermediate-resolution feature adopts a three-level decomposition, while the lower-resolution features and use only two-level decompositions. This design is motivated by the following considerations. High-resolution features contain abundant local texture information, and deeper wavelet decomposition enables the capture of multi-scale high-frequency patterns. Intermediate-scale features strike a balance between detailed structures and semantic abstraction, for which moderate decomposition is sufficient. In contrast, low-resolution features primarily encode high-level semantic information, and excessively deep decomposition may lead to sparse high-frequency components, potentially disrupting semantic consistency.
Given a feature map
at an arbitrary scale, WFC first performs an
level discrete wavelet transform (DWT) to decompose the feature into one low-frequency sub-band
and three directional high-frequency sub-bands
, corresponding to horizontal, vertical, and diagonal edge and texture responses, respectively:
The low-frequency sub-band
preserves the main structural and semantic information and is recursively fed into the next wavelet level, where it is further decomposed into new low- and high-frequency sub-bands:
The three high-frequency sub-bands capture directional edge and fine-texture information along the horizontal, vertical, and diagonal directions. To make these components learnable and enhanceable within the network, we concatenate the four sub-bands along the channel dimension and apply a lightweight depthwise convolution followed by a learnable scaling factor
to achieve direction-sensitive enhancement:
The enhanced sub-bands are then progressively reconstructed via the inverse discrete wavelet transform (IDWT), following a deep-to-shallow order, to obtain the enhanced spatial-domain feature at the corresponding scale:
Finally, the reconstructed feature is residually fused with the spatial-domain convolutional feature at the same scale to produce the output feature:
where
denotes a learnable channel-wise fusion coefficient that is adaptively optimized via backpropagation to balance the contributions of frequency-domain enhancement and the original spatial semantic representation.
To further verify that the proposed WFE-Net can effectively compensate for high-frequency information during feature enhancement, we visualize and compare the responses of the feature
before enhancement and the enhanced feature
in both the spatial domain and the frequency domain. The results are shown in
Figure 3.
In the spatial-domain heatmaps, the responses of the original features to text regions usually appear as blob-like patterns with blurred boundaries. This indicates that standard convolution struggles to maintain precise structural localization for small targets during continuous downsampling. After being processed by WFE-Net, the color of the text regions becomes darker and the responses are significantly enhanced. The features also exhibit clearer line-like and skeleton-like structures.
In the frequency domain, we compute the two-dimensional amplitude spectrum to explicitly observe the frequency distribution of features. Before enhancement, the high-frequency regions (i.e., the outer areas of the spectrum) contain relatively weak energy. In contrast, WFE-Net significantly increases the brightness of these outer regions. From a quantitative perspective, the high-frequency energy ratio (HFER) is substantially improved. For example, in the second case, it increases from 2.40% to 3.12%.
Through this design, WFE-Net effectively compensates for high-frequency information related to text shapes in each feature stream without altering the original scale hierarchy. As a result, the enhanced multi-scale features exhibit stronger structural discriminability, improved texture robustness, and better semantic preservation before being fed into subsequent modules, thereby significantly improving text detection quality in complex scenes.
3.3. FIRM
In scene text detection, feature fusion is often limited by weak interaction between low-level structural details and high-level semantics. Although WFE-Net enhances high-frequency textures, frequency-domain enhancement alone cannot fully exploit semantic cues in complex scenes. In dense text or cluttered backgrounds, local high-frequency responses lack contextual support and are easily corrupted by noise. To this end, we design the Feature Interaction Refinement Module (FIRM), which employs a structured cross-stream feature interaction mechanism to effectively inject global semantic information from the original backbone features into the high-frequency-enhanced features, while simultaneously suppressing noisy responses. This design enables the construction of multi-scale representations that are both semantically consistent and structurally explicit.
The original multi-scale features
extracted by ResNet-50 and the WFE-Net-enhanced multi-scale features
are separately processed by a Feature Pyramid Network (FPN), producing the corresponding pyramid features
and
. These two feature streams serve as dual inputs to FIRM. The core component of FIRM is the Dual-Path Interaction Transformer (DRIT), whose primary objective is to introduce stable and controllable semantic information while preserving high-frequency structural localization capability. The DRIT architecture is shown in
Figure 4.
In DRIT, the high-frequency-enhanced features are used as the query stream, while the original semantic features serve as the key/value stream. This design is motivated by the following considerations. Text edges and stroke structures exhibit stronger spatial localization certainty and thus provide reliable anchors for semantic alignment. In contrast, if semantic features are used as queries, the attention responses tend to diffuse in complex backgrounds. This diffusion weakens structural discriminability. Specifically, the highest-level pyramid feature is selected as the query stream, while is used as the key/value stream. After adding learnable positional encodings, both streams are linearly projected to construct the embedding representations required for attention computation.
To simultaneously achieve fine-grained semantic selection and robust noise suppression, DRIT introduces a dual-path attention mechanism, which collaboratively models Softmax attention and Sigmoid-gated attention. The final attention weights are defined as:
The Softmax branch captures local semantic matching relationships between query and key, emphasizing semantic regions that are most relevant to high-frequency structures. Meanwhile, the Sigmoid gating branch performs global response statistics for each query position, suppressing spurious activations caused by complex backgrounds, texture clutter, or artifacts introduced by frequency-domain enhancement. The element-wise fusion of the two branches endows the attention mechanism with both selectivity and robustness.
Based on the fused attention weights, semantic features are adaptively injected into the high-frequency features:
Subsequently, the semantically injected feature
is added to the original highest-level semantic feature
, and further transformed through a residual feed-forward network and normalization layers to obtain a stable, semantically enhanced high-level representation:
This design preserves the global semantic consistency of the original features while introducing structure-aligned semantic information via residual injection, effectively avoiding structural degradation caused by excessive semantic dominance. However, relying solely on the linear projections and feed-forward mappings of Transformers remains insufficient for precise spatial structure modeling. To further strengthen local spatial consistency and fully integrate the original semantic features with the injected representations, DRIT incorporates a lightweight convolutional combination module at the output stage to structurally remap
:
where
consists of two convolution layers followed by ReLU activation, enhancing local context modeling and nonlinear representation capacity, and
is a 1 × 1 convolution used for channel recalibration and information compression. This convolutional branch complements the Transformer output, enabling high-level features to jointly capture global dependency modeling and local structural perception.
To fully exploit the multi-scale pyramid representation, FIRM further adopts a top–down progressive propagation strategy. The highest-level DRIT output
is propagated downward along the feature pyramid and fused with the corresponding scale-wise high-frequency-enhanced features
, yielding the complete set of multi-scale-enhanced features
:
Through this hierarchical refinement process, high-level semantic information is effectively transmitted to lower-level features while maintaining structural consistency. As a result, each scale inherits the high-frequency texture information provided by WFE-Net and is simultaneously constrained by global semantic cues from higher layers, delivering high-quality, multi-scale, and semantically consistent feature representations for subsequent encoder and contour generation modules.
3.6. Loss Function
To effectively train the proposed multi-scale text detection framework, we adopt a joint loss function consisting of classification, mask prediction, control point regression, and bounding box regression terms, enabling end-to-end optimization. This loss function aims to ensure accurate text instance classification while enhancing the spatial integrity of instance masks and the geometric precision of contour control points, thereby achieving stable localization and precise fitting for arbitrary-shaped text.
Let the model outputs be outputs, ground truth labels be targets, the matching set be indices, and the total number of samples be
. The overall loss can be formulated as:
To balance the gradient scales of different terms, we set the loss weights as , , ensuring balanced and stable optimization of classification, mask, and control point regression during training.
The classification loss
employs a weighted Sigmoid Focal Loss for each predicted category:
This suppresses easy-to-classify samples and emphasizes hard examples, encouraging the model to focus on challenging text instances in complex backgrounds. The mask loss
combines Dice Loss and binary cross-entropy (BCE) to optimize the shape and edge accuracy of text instance masks:
where
and
M denote the predicted and ground-truth masks, respectively. Additionally, auxiliary supervision is applied on lower-resolution masks to enhance local texture representation. The control point regression loss
applies L1 regression on key points, incorporating the anchor priors A and reference points
generated from the segmentation layers:
where
and
denote the predicted and target control points, respectively. The bounding box regression loss
and GIoU loss
optimize the spatial localization and coverage accuracy of text instances:
During training, a matching algorithm determines correspondences between predictions and targets, based on which all the above loss terms are computed. For the multi-layer decoder structure, auxiliary losses are applied on intermediate outputs to improve gradient propagation and training stability. Additionally, a weighted sampling strategy is adopted for regions with high uncertainty in the text masks, encouraging the model to focus on challenging regions under complex backgrounds, thereby improving overall detection accuracy and robustness. This joint loss does not introduce new loss forms but rather provides a rational combination tailored to the proposed structure–semantic collaborative framework, ensuring that the high-frequency-enhanced features and semantic injection mechanisms are fully constrained and jointly optimized during training.