In this section, we first review the components of the bottleneck in ResNet and multi-head self-attention (MSA) in ViT. We then present a detailed analysis of the technical design and advantages of the WBS and WMSA–WBS modules within the proposed WMSA–WBS–ViT network.
3.1. Preliminaries
Bottleneck in ResNet. As shown in
Figure 1a, the traditional bottleneck architecture consists of three convolutional layers (conv 1 × 1, conv 3 × 3, and conv 1 × 1) designed to reduce dimensionality, extract spatial features, and restore the original channel size efficiently, enabling deeper networks with fewer parameters. Specifically, let
be the input image feature, where
H,
W, and
D represent the height, width, and number of channels, respectively. For each residual block, a 3-layer stack is used. The input
X first goes through a conv 1 × 1 for dimensionality reduction, followed by a conv 3 × 3 for feature extraction, and finally a conv 1 × 1 for dimensionality restoration. The residual connection adds the output to the input.
Multi-Head Self-Attention in ViT. As shown in
Figure 1b, MSA uses multiple independent attention heads to process input data in parallel, with each head responsible for different subsets of features, calculating attention weights separately, and then concatenating their results to comprehensively capture diverse characteristics of the input. Specifically, given an input feature
representing the input patch sequence, where
H,
W, and
D denote the height, width, and number of channels, respectively, the input is first reshaped into a sequence of
tokens:
. Three different linear layers are then used to generate the query
, key
, and value
matrices. The multi-head self-attention (
multi-head) module splits each query/key/value into
heads along the channel dimension, producing
,
, and
for the
j-th head, where
. The self-attention (
attention) mechanism computes the dot product between the query and key, scales it by
, and applies a softmax function to obtain the attention weights. These weights are then used to compute the weighted sum over the value vectors, yielding the attention output for each head. All head outputs are concatenated and projected through a final linear layer to obtain the final attention output. Here, we show the general formula for classical MSA as follows:
3.2. WMSA–WBS
We propose
WMSA–WBS, a hybrid architecture that integrates wavelet-based multi-resolution analysis into a multi-head self-attention framework, aiming to enhance global–local feature representation under constrained computational budgets. As shown in
Figure 2a, WMSA–WBS consists of two core components: the
wavelet bottleneck structure (WBS) for compact local-context encoding and the
wavelet-enhanced multi-head self-attention (WMSA) module for frequency-aware attention.
This design is inspired by the complementary strengths of wavelet transforms and self-attention. While MSA captures long-range dependencies, it lacks strong locality bias. Wavelet transforms, in contrast, offer low-cost multi-scale decomposition with localized spatial support. WMSA–WBS combines both for efficient and expressive representation learning.
WBS injects frequency-aware inductive bias into the backbone while preserving the spatial and channel resolutions. It consists of three consecutive wavelet-based processing stages that progressively extract, compress, and reconstruct informative representations in both spatial and frequency domains.
Given the input feature map
, we first apply a 1D discrete wavelet transform (DWT) along a spatial axis to decompose the signal into directional low-frequency and high-frequency components. The transformed representation is then compressed by a group convolution block
and reconstructed using 1D Inverse DWT (IDWT),
where
, and
denotes a conv 1 × 1-BN-ReLU block that reduces the channel dimension.
To further capture joint spatial–frequency patterns, we apply a 2D DWT to
, decomposing it into four frequency subbands. These are then fused using a group convolution block
and reconstructed by 2D IDWT,
where
,
, and
is implemented as a conv 3 × 3-BN-ReLU block. Note that
is also shared with the WMSA branch for attention computation.
To reinforce directional structure modeling and enhance discriminative capability, we finally apply a 1D DWT to decompose the signal into directional low-frequency and high-frequency components again. The transformed representation is then compressed by a group convolution block
and reconstructed using 1D IDWT,
where
, and
denotes a conv 1 × 1-BN-ReLU block that restores the dimension of the channel.
Discussion. Compared with traditional bottlenecks, convolutions in the wavelet domain benefit from inherently larger receptive fields due to the spatial downsampling property of DWT. Specifically, since DWT reduces the spatial resolution by a factor of two, a standard convolution applied in the wavelet-transformed space corresponds to an effective receptive field in the original image space. This enables the model to aggregate broader contextual information at significantly lower computational cost, enhancing its ability to model long-range dependencies without increasing parameter count.
WMSA. To exploit the complementary nature of low-frequency and high-frequency components in the wavelet domain, we propose a wave fusion module (WFM) that selectively enhances structural representations using directional detail signals. This module splits the input on the four subbands obtained from WBS: the approximation coefficients and the detail coefficients corresponding to horizontal, vertical, and diagonal orientations.
Instead of directly concatenating all four subbands, which may lead to feature redundancy or misalignment in importance, we adopt a residual-style enhancement strategy centered on the low-frequency base
. Specifically, we treat the high-frequency responses as residual corrections to the coarse low-frequency map. The absolute values of the high-frequency subbands are used to emphasize edge and texture information while preserving the semantic context carried by
,
This fusion method is motivated by the observation that low-frequency wavelet coefficients preserve global structure and semantic content, while high-frequency components capture local discontinuities, such as edges and textures. However, directly using high-frequency maps as standalone inputs may amplify noise and background clutter. To mitigate this, we treat their absolute activations as refinement terms, aligning them with the low-frequency representation. Mathematically, this fusion can be interpreted as introducing anisotropic feature enhancement: directional derivatives in the wavelet domain serve as informative perturbations to a coarse base map. The result, , is thus a structurally enhanced frequency-aware feature map that balances locality and semantics.
We linearly project
to produce key
and value
embeddings for wavelet-enhanced multi-head self-attention (WMSA), allowing the attention mechanism to attend over both coarse and fine-grained frequency cues. The attention output per head is
where
,
. Before the dot product,
and
are flattened along the spatial dimensions, resulting in
and
. This produces an attention map of size
, enabling cross-resolution attention where each high-resolution query attends to all coarse-scale key positions without explicitly downsampling
.
WMSA–WBS. All attention heads are concatenated with the residual local feature
and projected
where
W is a learnable linear projection.
Complexity Analysis. Traditional MSA has complexity. In WMSA–WBS, since key and value are computed from downsampled , the attention cost is reduced to . The DWT/IDWT cost is linear, i.e., , yielding an efficient design suitable for high-resolution vision tasks.
Comparison with Prior Wavelet Methods. While prior works have explored the integration of wavelet transforms into neural architectures, such as DWT-UNet [
39] for segmentation and Wave-ViT [
14] for vision transformers, they typically utilize wavelet decomposition as a preprocessing step or pooling replacement. In contrast, our proposed WBS introduces a residual wavelet bottleneck that leverages both 1D and 2D DWT-IDWT pipelines, enabling frequency-aware feature compression, fusion, and restoration within the network body. Moreover, unlike FFCNet [
40] or WaveNet [
41], which focus on global frequency aggregation or 1D dilation, WBS emphasizes spatial–frequency disentanglement and preserves the structural integrity of features via invertible transforms. These design choices allow WBS to be seamlessly integrated into transformer blocks (as in WMSA–WBS), achieving both contextual efficiency and spatial fidelity.
3.3. WMSA–WBS–ViT
The overall architecture of the proposed WMSA–WBS vision transformer (WMSA–WBS–ViT) is illustrated in
Figure 3a, while the internal structure of a single WMSA–WBS transformer block is shown in
Figure 3b. Following the multi-scale vision transformer paradigm, we develop three model variants—WMSA–WBS–ViT-S, WMSA–WBS–ViT-B, and WMSA–WBS–ViT-L—differing in depth, width, and number of attention heads.
WMSA–WBS–ViT begins with a patch embedding layer, which partitions the input image
into non-overlapping patches and projects them into an embedding space using a convolutional projection. This operation reduces the spatial resolution by a factor of 4 and projects each patch into a
-dimensional embedding, producing the Stage 1 feature map of size
. Compared with standard ViTs that apply linear patch flattening, our convolution-based embedding retains local spatial correlations and seamlessly integrates with hierarchical architectures. Subsequent stages further reduce the spatial resolution by a factor of 2 and increase the channel dimension, generating feature maps of sizes
,
, and
at Stages 2 to 4, respectively. Each stage contains
WMSA–WBS transformer blocks to progressively enrich the hierarchical representations. Each WMSA–WBS transformer block, as illustrated in
Figure 3b, consists of two main components: the proposed WMSA–WBS module and a two-layer feed-forward MLP. Each component is preceded by a LayerNorm layer and followed by a residual connection. This design preserves the standard transformer structure while incorporating frequency-aware attention through WMSA–WBS. After Stage 4, the final feature map is globally pooled and fed into a classification head. Due to its modular and hierarchical design, WMSA–WBS–ViT can serve as a versatile backbone for various computer vision tasks. The detailed configurations of the three model variants are summarized in
Table 1, where
,
, and
denote the feed-forward expansion ratio, number of attention heads, and channel dimension in Stage
i, respectively.