WLAM Attention: Plug-and-Play Wavelet Transform Linear Attention

Feng, Bo; Xu, Chao; Li, Zhengping; Liu, Shaohua

doi:10.3390/electronics14071246

Open AccessArticle

WLAM Attention: Plug-and-Play Wavelet Transform Linear Attention

¹

School of Integrated Ciruits, Anhui University, Hefei 230601, China

²

Anhui Engineering Laboratory of Agro-Ecological Big Data, Hefei 230601, China

³

Anhui Zhongke Jingle Technology Co., Ltd., Hefei 230601, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(7), 1246; https://doi.org/10.3390/electronics14071246

Submission received: 9 February 2025 / Revised: 10 March 2025 / Accepted: 20 March 2025 / Published: 21 March 2025

Download

Browse Figures

Versions Notes

Abstract

Linear attention has gained popularity in recent years due to its lower computational complexity compared to Softmax attention. However, its relatively lower performance has limited its widespread application. To address this issue, we propose a plug-and-play module called the Wavelet-Enhanced Linear Attention Mechanism (WLAM), which integrates a discrete wavelet transform (DWT) with linear attention. This approach enhances the model’s ability to express global contextual information while improving the capture of local features. Firstly, we introduce the DWT into the attention mechanism to decompose the input features. The original input features are utilized to generate the query vector Q, while the low-frequency coefficients are used to generate the key K. The high-frequency coefficients undergo convolution to produce the value V. This method effectively embeds global and local information into different components of the attention mechanism, thereby enhancing the model’s perception of details and overall structure. Secondly, we perform multi-scale convolution on the high-frequency wavelet coefficients and incorporate a Squeeze-and-Excitation (SE) module to enhance feature selectivity. Subsequently, we utilize the inverse discrete wavelet transform (IDWT) to reintegrate the multi-scale processed information back into the spatial domain, addressing the limitations of linear attention in handling multi-scale and local information. Finally, inspired by certain structures of the Mamba network, we introduce a forget gate and an improved block design into the linear attention framework, inheriting the core advantages of the Mamba architecture. Following a similar rationale, we leverage the lossless downsampling property of wavelet transforms to combine the downsampling module with the attention module, resulting in the Wavelet Downsampling Attention (WDSA) module. This integration reduces the network size and computational load while mitigating information loss associated with downsampling. We apply the Wavelet-Enhanced Linear Attention Mechanism (WLAM) to classical networks such as PVT, Swin, and CSwin, achieving significant improvements in performance on image classification tasks. Furthermore, we combine wavelet linear attention with the Wavelet Downsampling Attention (WDSA) module to construct WDLMFormer, which achieves an accuracy of 84.2% on the ImageNet-1K dataset.

Keywords:

vision transformer; wavelet transform; self-attention learning; image recognition

1. Introduction

In recent years, Transformer models have shown remarkable performance in the field of computer vision, achieving significant success in image classification, object detection, semantic segmentation, and multimodal tasks. However, the use of Transformers and self-attention mechanisms in computer vision still faces considerable challenges. Modern Transformer models typically employ the Softmax attention mechanism, which calculates the similarity between each query–key pair. The computational complexity of this mechanism grows quadratically with the number of tokens. As a result, the Softmax attention mechanism can lead to uncontrollable computational demands. The self-attention mechanism lacks the inductive biases found in CNNs, such as translation invariance and locality [1]. These inductive biases are crucial for the model’s generalization ability on smaller datasets, and the absence of such features in Transformers may affect their performance on certain tasks.

To address the uncontrollable computational demands posed by the Softmax attention mechanism, various remedial measures have been proposed in prior work. PVT [2] introduced sparse global attention by reducing the resolution of keys (K) and values (V) to manage computational costs. The Swin-Transformer [3] alleviated the computational burden by limiting self-attention calculations to local windows, thereby reducing the receptive field. Subsequently, Swin-Transformer_V2 [4] improved accuracy under large sample conditions. DAT [5] utilized a deformable attention mechanism that adaptively focuses on different regions of the input features. NAT [6] simulated convolutional operations and presented an automated network design approach based on the Transformer architecture. BiFormer [7] employed dual-level routing attention to dynamically identify areas of interest for each query. However, these methods inherently restrict the overall receptive field of self-attention or are heavily influenced by specially designed attention patterns, hindering their plug-and-play adaptability. Linformer [8] discarded the Softmax function and decoupled it into two independent functions ϕ, allowing the attention computation order to shift from (query·key)·value to query·(key·value), thereby reducing the overall computational complexity to O (N). Nevertheless, this approximation resulted in a significant performance drop [9,10]. To mitigate this issue, Efficient Attention [11] employed an effective attention mechanism that applies the Softmax function to both Q and K. SOFT [12] and Nysströmformer [13] further approximated the Softmax operation using matrix decomposition. Castling-ViT [14] utilized Softmax attention as a training auxiliary tool while exclusively employing linear attention during inference. FLatten-Transformer [15] introduced a focus function and leveraged deep convolutions to preserve feature diversity. Despite the effectiveness of these approaches, they still face limitations in expressive capacity due to the constraints of linear attention. Agent attention [16] defined a novel four-component attention mechanism (Q, A, K, V), where the agent vector A serves as a proxy for the query vector Q, aggregating information from K and V before broadcasting it back to Q. This agent-based attention mechanism enables the modeling of global information with significantly reduced computational costs.

To address the limitations of self-attention mechanisms in processing local information, various hybrid models combining convolutional neural networks (CNNs) and Transformers have been proposed. Models such as SCTNet [17], AdaMCT [18], TransXNet [19], and Enriched CNN-Transformer [20] represent parallel fusion networks, where the architecture is divided into CNN and Transformer branches, and the information from both branches is integrated through a fusion network. In contrast, EdgeNeXt [21], CvT [22], MobileViT2 [23], MobileViT3 [24], and MLLA [25] are examples of serial fusion networks that first utilize CNNs to extract local features, which are then fed into a Transformer for global context modeling, thereby enhancing feature representation capabilities. However, these networks are specifically designed architectures, making them largely incompatible with plug-and-play applications.

To achieve a plug-and-play attention mechanism for the collaborative enhancement of local features and global context features, this paper proposes the Wavelet Linear Attention Mechanism module (WLAM_Attention) based on the wavelet domain. This method leverages the multi-resolution analysis characteristics of wavelet transforms to losslessly decompose the input feature map into low-frequency sub-bands that represent global structures and high-frequency sub-bands that encode detailed information. The low-frequency components model long-range dependencies through linear attention to capture global context, while the high-frequency components undergo local feature enhancement using a lightweight CNN structure. Through the reconstruction of the optimized multi-frequency features using inverse wavelet transforms, the synchronization of high-frequency spatial details and low-frequency semantic information is achieved. This module establishes a cross-scale feature interaction mechanism through frequency domain decomposition, providing a plug-and-play feature optimization solution for various existing network architectures.

Key Innovations:

Integration of Discrete Wavelet Transform (DWT) with Linear Attention: The proposed method incorporates the DWT into the attention mechanism by decomposing the input features for different attention constructs. Specifically, the input features are utilized to generate the attention queries (Q), low-frequency information is employed to generate the attention keys (K), and high-frequency information, processed through convolution, is used to generate the values (V). This approach effectively enhances the model’s ability to capture both local and global features, improving the perception of details and overall structure.
Multi-Scale Processing of Wavelet Coefficients: The high-frequency wavelet coefficients are processed through convolutional layers with varying kernel sizes to extract features at different scales. This is complemented by the Squeeze-and-Excitation (SE) module, which enhances the selectivity of the features. An inverse discrete wavelet transform (IDWT) is utilized to reintegrate the multi-scale decomposed information back into the spatial domain, compensating for the limitations of linear attention in handling multi-scale and local information.
Structural Mimicry of Mamba Network: The proposed wavelet linear attention incorporates elements from the Mamba network, including a forget gate and a modified block design. This adaptation retains the core advantages of Mamba, making it more suitable for visual tasks compared to the original Mamba model.
Wavelet Downsampling Attention (WDSA) Module: By exploiting the lossless downsampling property of wavelet transforms, we introduce the WDSA module, which combines downsampling and attention mechanisms. This integration reduces the network size and computational load while minimizing information loss caused by downsampling.

2. Related Work

A wavelet transform is an effective method for time–frequency analysis. It is reversible and capable of preserving a significant amount of information, making it widely applicable in various neural network architectures. For instance, Bae et al. [26] were among the first to incorporate a wavelet transform into CNNs for image restoration tasks. In [27], Haar wavelets were integrated into CNNs for multi-resolution analysis, achieving texture classification and image labeling. Additionally, ref. [28] introduced a wavelet transform into Transformer models, demonstrating promising performance in image classification and object detection. This model reduced the number of input feature channels to one-fourth of the original, employing the wavelet transform and convolution to generate the keys (K) and values (V) for Softmax attention, followed by a wavelet inverse transform to fuse the output features. However, this approach did not leverage the lossless downsampling properties of wavelet transforms to reduce computational complexity in the attention module, nor did it fully exploit the multi-resolution analysis capabilities of wavelet transforms.

In [29], a wavelet transform was utilized for downsampling at the front end of the model, with an inverse wavelet transform used for upsampling at the end. This method effectively lowered the image resolution while preserving significant image features, leading to reduced resource consumption in Transformer models. Yet it inadequately utilized the multi-resolution analysis potential of wavelet transforms. The work in [30] made minor modifications to the approach in [28] and applied them within a U-Net architecture, but it suffered from the same limitations. In [31], a multi-scale enhancement module was developed using a wavelet transform, convolution, nonlinear transformations, and inverse wavelet transform to enhance the multi-scale recognition capabilities of neural networks. Furthermore, ref. [32] employed gradient wavelet transform and Transformer networks to improve edge information recognition. Lastly, ref. [33] proposed a novel wavelet-based Mamba model with Fourier adjustment, termed WalMaFa, which consists of a wavelet-based Mamba block (WMB) and a fast Fourier adjustment block (FFAB), achieving outstanding initial brightness enhancement.

3. Our Work

3.1. Plug-and-Play WLAM Attention Module

As indicated in [29], a wavelet transform can achieve nearly lossless downsampling, thereby reducing the computational complexity of neural networks. Additionally, insights from [31,32] demonstrate that utilizing the multi-resolution analysis capabilities of wavelet transforms can significantly enhance a neural network’s ability to recognize local details and edge features. The WLAM (Wavelet-Enhanced Linear Attention Mechanism) designed in this paper fully exploits the lossless downsampling and multi-resolution analysis capabilities of wavelet transforms, leading to a substantial increase in linear attention recognition capabilities while effectively reducing computational workload. This is particularly evident in the improvement in the module’s ability to express local information.

In [28,30], the integration of Softmax attention with a wavelet transform is employed to lower computational complexity and achieve multi-resolution analysis. They utilize the input feature X as the query (Q), compressing the number of channels through a linear transformation to one-fourth of the original and subsequently obtaining the key (K) and value (V) through wavelet transform and convolution, as illustrated in Figure 1. However, Softmax attention employs exponentially weighted normalization for the calculation of attention weights, which is computed as follows:

S o f t m a x (\frac{(Q K^{T})}{\sqrt{d_{k}}})_{i j} = \frac{e x p (\frac{{(Q K^{T})}_{i j}}{\sqrt{d_{k}}})}{\sum_{k = 1}^{m} e x p (\frac{{(Q K^{T})}_{i j}}{\sqrt{d_{k}}})}

(1)

In this context, exp() denotes the exponential function, with

\sum_{k = 1}^{m} e x p ()

representing the sum of all exponential functions. This normalization process tends to significantly amplify features with higher weights while rendering those with lower weights nearly negligible. Consequently, Softmax attention is relatively sensitive to the feature distributions of Q and K, necessitating that they reside in similar spaces. As illustrated in Figure 1a,b from [28,30], the feature distributions of Q and K can exhibit significant disparities, which prevents them from existing in a comparable space. This discrepancy can lead to computational instability.

From Equation (2), we observe that the wavelet transform decomposes the input tensor X into four sub-bands, resulting in both the height (H) and width (W) being reduced to half of their original dimensions, specifically H/2 and W/2.

{X_{L L}, X_{L H}, X_{H L}, X_{H H}} = D W T (X)

(2)

$X_{L L} :$ This sub-band preserves most of the image’s energy and structural information, making it the richest in content. Typically, the primary structures and general shapes of the image are contained within this sub-band.
$X_{L H} :$ This sub-band represents the horizontal details of the image, capturing high-frequency components such as horizontal edges or textures. However, it contains relatively less information, primarily focusing on changes in the horizontal direction.
$X_{H L} :$ This sub-band captures the vertical details of the image, including vertical edges and textures. However, it contains relatively less information, primarily focusing on variations in the vertical direction.
$X_{H H} :$ This sub-band represents the finest details of the image, encompassing diagonal features such as diagonal edges. It contains high-frequency noise and very subtle details, resulting in the least amount of information.

The information carried by the four sub-bands resulting from wavelet decomposition is unevenly distributed, with the majority of the information concentrated in the

X_{L L}

sub-band. In papers [28,30], the authors compress the channels of the input X to D/4 and then expand the channel count back to D through wavelet transformation, ultimately obtaining K and V through a convolution. Due to the uneven distribution of information across the sub-bands, a significant amount of information is lost during the process of first compressing the channel count and then expanding it through wavelet transformation. As a result, the information content in K and V is considerably lower than that in Q. Therefore, while the method in papers [28,30] appears to reduce computational complexity and enhance sensitivity to local information by leveraging wavelet transforms, it does not fully utilize the advantages of the wavelet transform.

To maximize the potential of the multi-resolution analysis afforded by wavelet transforms, we have opted to forgo Softmax attention, which is relatively sensitive to the feature distribution of Q and K, and instead employ a linear attention mechanism combined with wavelet transformation. The key characteristic of linear attention is that it typically utilizes the dot product of Q and K directly, along with kernelization, without applying Softmax normalization.

Q = ϕ (x W_{q}), K = ϕ (x W_{k}), V = ϕ \emptyset (x W_{v})

(3)

l i n e a r_A t t e n t i o n = \sum_{i = 1}^{N} \frac{Q_{i} K_{j}^{T}}{\sum_{j = 1}^{N} Q_{i} K_{j}^{T}} V_{j} = \frac{Q_{i} (\sum_{j = 1}^{N} K_{j}^{T} V_{j})}{Q_{i} (\sum_{j = 1}^{N} K_{j}^{T})}

(4)

From Equation (4), it can be observed that the linear attention weights are accumulated linearly, rather than amplifying or diminishing the weights of specific features. In other words, the weighting mechanism of linear attention is more balanced, making it suitable for handling more diverse inputs. By applying the associative property of matrix multiplication, we can rearrange the computation order from the Softmax attention format

(Q K^{T}) V

to

Q (K^{T} V

), thereby reducing the computational complexity to O(N). This represents a significant decrease in computational complexity compared to Softmax attention.

In traditional self-attention mechanisms, the input X is typically subjected to linear transformations to generate queries (Q), keys (K), and values (V), as shown in Equation (3). In contrast, we utilize wavelet transformation to achieve a more diverse input representation.

We downsample X to serve as the query (Q), defined as

Q_{d w t} = ϕ ({{C o n v}_{3 \times 3} (X)}_{↓ 2} W^{q})

. The original feature map X contains all the raw information and is typically well suited as a query, as queries are designed to capture the global information of the input data for calculating attention weights.

In contrast to the approaches presented in papers [28,30], we first apply wavelet transformation to the input feature X as

{X_{L L}, X_{L H}, X_{H L}, X_{H H}} = D W T (X)

, and then we proportionally compress the information contained in each sub-band to obtain the keys (K).

K = {C o n v}_{1 \times 1}^{1.5 C \to C} (C o n c a t (X_{L L}) {, C o n v}_{1 \times 1}^{C \to C / 4} (X_{L H}) {, C o n v}_{1 \times 1}^{C \to C / 4} (X_{H L}))

(5)

This approach allows K to provide a rich representation of features while minimizing information loss, which aids the model in capturing more useful information within the attention mechanism. Additionally, we can perform a secondary wavelet transformation here to further enhance the low-frequency sub-band

X_{L L}

, thereby mimicking the scale enhancement module presented in paper [31].

{{D X}_{L L}, {D X}_{L H}, D X_{H L}, D X_{H H}} = D W T (X_{L L})

(6)

{D X}_{L L 1} = {C o n v}_{7 \times 7} ({D X}_{L L})

(7)

{D X}_{L H 1} = {{C o n v}_{3 \times 1} (C o n v}_{1 \times 3} ({D X}_{L H}))

(8)

{D X}_{H L 1} = {{C o n v}_{3 \times 1} (C o n v}_{1 \times 3} ({D X}_{H L}))

(9)

K_{d w t} = ϕ ((I D W T {{D X}_{L L 1}, {D X}_{L H 1}, D X_{H L 1}, D X_{H H}} {+ K) W}_{k})

(10)

We use the high-frequency sub-bands

X_{L H}

,

X_{H L}

, and

X_{H H}

to create V (value) through convolution, defined as

V_{d w t} = ϕ ({C o n v}_{3 \times 3}^{3 C \to C} (C o n c a t (X_{L H}, X_{H L}, X_{H H})) {) W}^{V}

. Here,

X_{L H}

,

X_{H L}

, and

X_{H H}

represent the high-frequency coefficients extracted from the wavelet decomposition of the feature map X, with

X_{L H}

corresponding to horizontal high-frequency coefficients,

X_{H L}

to vertical high-frequency coefficients, and

X_{H H}

to diagonal high-frequency coefficients. The improved V consists entirely of high-frequency information, excluding low-frequency components, which enhances the attention module’s ability to capture local features more effectively.

{l i n e a r_A t t e n t i o n}_{d w t} = Q_{d w t} (K_{d w t}^{T} V_{d w t})

(11)

By employing the aforementioned approach, we provide multi-resolution representations for Q, K, and V in the linear attention mechanism, thereby enhancing the model’s performance when dealing with diverse and complex inputs. Let the input tensor X have dimensions H and W; consequently, the dimensions of Q, K, and V will be H/2 and W/2. This further reduces the computational burden of linear attention.

The improvements outlined above can enhance the expressive capability of linear attention to some extent; however, linear attention still exhibits suboptimal performance in terms of feature diversity.

The rank of the attention matrix: in traditional Transformer models, the attention matrix is typically of full rank, indicating a high degree of feature diversity.

r a n k (S o f t m a x (Q K^{T})) = N

(12)

The rank in linear attention is constrained by the number of tokens N in each head and the channel dimension d, as illustrated in Figure 2a.

r a n k (Q {K)}^{T} \leq m i n {r a n k (Q), r a n k (K)} \leq m i n {N, d}

(13)

Since

d

is typically less than

N

, the rank of the attention matrix in the linear attention mechanism is limited to d, whereas Softmax attention can be ranked up to N (and is likely to be equal to both d and N). In this context, the upper bound of the attention matrix’s rank is constrained to a lower ratio, indicating that many rows of the attention mapping are severely homogenized. As the output of self-attention is a weighted sum of the same set of V, the uniformity of attention weights inevitably leads to similarities among the aggregated features.

To address this issue, papers [15,16] propose the incorporation of a depthwise convolution (DWC) module in the attention matrix, with the output represented as follows:

O u t = Q K^{T} V + D W C (V) = (Q K^{T} + M_{D W C}) V

(14)

D W C (V) = {C o n v}_{3 \times 3} (V)

(15)

Here,

M_{D W C}

is a sparse matrix corresponding to the depthwise convolution operation. Since

M_{D W C}

has the potential to become a full-rank matrix, it effectively raises the upper limit of the rank of the equivalent attention. As shown in Figure 2b, although this approach results in a significant increase in the rank value, the actual improvement in model accuracy is quite limited.

This paper proposes enhancing the high-frequency components

X_{L H}, X_{H L}, a n d X_{H H}

using a depthwise convolution module, followed by the integration of these enhancements with linear attention through inverse wavelet transformation. This method significantly improves the feature diversity of linear attention. In contrast to the direct addition of a depthwise convolution module as suggested in papers [15,16], inverse wavelet transformation enables a more effective fusion of the features from linear attention and the depthwise convolution module. The components

X_{L H}, X_{H L}, a n d X_{H H}

capture most of the local features present in the input tensor X. By enhancing these components with a depthwise convolution module, we can substantially improve the module’s capability to extract local features. Paper [34] highlights that performing convolution in the wavelet domain results in a larger receptive field. By combining inverse wavelet transformation with linear attention, we can significantly address the deficiencies in linear attention’s ability to extract local information, thereby achieving robust extraction capabilities for both global and local features.

X_{L H 1} = {R e l u (D W C o n v}_{1 \times 1}^{2 C \to C} (S E ({D W C o n v}_{1 \times 3}^{C \to 2 C} ({D W C o n v}_{3 \times 1} (X_{L H})))) + B N (X_{L H}))

(16)

X_{H L 1} = R e l u ({D W C o n v}_{1 \times 1}^{2 C \to C} (S E ({D W C o n v}_{1 \times 3}^{C \to 2 C} ({D W C o n v}_{3 \times 1} (X_{H L})))) + B N (X_{L H}))

(17)

X_{H H 1} = {C o n v}_{3 \times 3} (X_{H H})

(18)

O_{i d w t} = I D W T ({l i n e a r_A t t e n t i o n}_{d w t}, (X_{L H 1}, X_{H L 1}, X_{H H 1}))

(19)

Paper [35] shares a similar approach to ours; however, its method employs only a single 3 × 3 convolution across all high-frequency sub-bands, which limits the effective extraction of local features within these sub-bands. The high-frequency sub-bands (LH/HL/HH) inherently carry detailed information such as the edges and textures of an image. These features are characterized by strong locality, spatial sparsity, and low semantic correlation. Directly applying standard 3 × 3 convolutions leads to two issues: (A) Over-parameterization—dense convolution kernels in high-frequency sparse regions tend to introduce redundant computations. (B) Cross-channel Coupling Interference—the channel mixing operations of standard convolutions may compromise the independence of high-frequency features. To address these issues, we draw inspiration from the architecture of MobileNetV3 [36]. Since high-frequency features consist solely of horizontal or vertical local features, we employ separate 3 × 1 and 1 × 3 convolutions to extract these features. Depthwise convolution (DWConv) is utilized, where each channel is processed independently, thereby reducing cross-channel information interference (dynamically adjusted through subsequent Squeeze-and-Excitation (SE) modules). The cascade of 3 × 1 and 1 × 3 convolutions is equivalent to a 5 × 5 asymmetric receptive field but with only 24% of the parameters of a standard 5 × 5 convolution, making it more suitable for the anisotropy of high-frequency features.

Furthermore, MobileNetV3 [36] alleviates the information loss problem inherent in deep convolutions through an inverted residual structure that first expands and then reduces the channel dimensions. Specifically, the channel number is expanded to 2C, passed through an SE attention module, and then reduced back to C. By emulating this residual structure, we enhance the nonlinear expressive capability of the high-frequency sub-bands.

The

X_{H H}

sub-band contains high-frequency noise and very fine details, representing the least amount of information, which is why we apply only a 3 × 3 convolution to it. Ultimately, we treat linear attention as the low-frequency sub-band while combining

X_{L H}, X_{H L}, a n d X_{H H}

as high-frequency sub-bands to perform inverse wavelet transformation, resulting in the output

O_{i d w t}

. As illustrated in Figure 2a, the rank of the attention module significantly increases after inverse wavelet transformation, with the dimensions of the output feature tensor restored to H and W.

The paper [25] proposes Mamba, which can be regarded as a linear attention Transformer variant featuring specialized linear attention and an improved block design. There are many potential similarities between the equations of the two operations: linear attention and the selective SSM model. The paper [25] introduces two methods that can be expressed using a unified formula:

h_{i} = \tilde{A_{i}} ⨀ h_{i - 1} + B_{i} (△_{i} ⨀ x_{i}); y_{i} = \frac{C_{i} h_{i}}{1} + D ⨀ x_{i}

(20)

S_{i} = 1 ⨀ S_{i - 1} + K_{i}^{T} (1 ⨀ V_{i}); y_{i} = \frac{Q_{i} S_{i}}{Q_{i} Z_{i}} + 0 ⨀ x_{i}

(21)

Equation (20) represents the selective SSM, and Equation (21) corresponds to the linear attention formula. The relationship between Equations (20) and (21) is evident. Therefore, the selective SSM can be viewed as a special variant of linear attention. In summary, the similarities and differences between selective SSM and linear attention can be outlined as follows: the selective state space model is similar to linear attention with additional input gates (

△_{i}

), forget gates (

\tilde{A_{i}}

), and shortcuts (

D ⨀ x_{i}

), while omitting normalization and multi-head designs. This study demonstrates that the input gate (

△_{i}

) can provide a 0.2% accuracy gain but results in a 7% reduction in throughput. The forget gate (

\tilde{A_{i}}

) offers approximately a 0.8% accuracy gain, but the model throughput decreases significantly from 1152 to 743. The learnable shortcut (

D ⨀ x_{i}

) provides a 0.2% accuracy gain while reducing throughput from 1152 to 1066.

From paper [25], it can be observed that incorporating parts of Mamba [37] into linear attention can enhance the performance of linear attention, but it causes a severe decrease in throughput. To further improve the WLAM attention module in this paper, we made the following modifications:

We emulate Mamba’s forget gate. The forget gate provides the model with two key attributes: local bias and positional information. In this paper, we replace the forget gate with RoPE, modifying it to integrate positional information into the vector representation of the sum, allowing the attention scores to inherently carry positional relationships. This modification offers a 0.8% accuracy gain while only reducing throughput by 3%.

$Q_{d w t} = R o p E (Q_{d w t})$

(22)

$K_{d w t} = R o p E (K_{d w t})$

(23)
Inspired by Mamba, we incorporate a learnable shortcut mechanism into the linear attention framework. This enhancement results in an accuracy improvement of 0.2%.

${{O U T}_{T W M A} = l i n e r (O}_{i d w t} \cdot δ (l i n e r (X)))$

(24)

The proposed WLAM attention module significantly reduces the computational burden of the linear attention mechanism while enhancing input diversity through wavelet transformation. By employing deep convolution on the high-frequency sub-bands and utilizing inverse wavelet transformation, we effectively address the issue of limited feature diversity in linear attention. The WLAM attention module, as shown in Figure 1d, can be integrated as a plug-and-play component and is easily adaptable to various modern Vision Transformer (ViT) architectures. To demonstrate its effectiveness, the authors empirically applied the WLAM attention module to four advanced and representative Transformer models, including PVT [2], Swin [3], CSWin [38], and UNet [39]. Detailed structural information can be found in Appendix A and Appendix C.

3.2. Lossless Downsampling Attention Module

The primary advantage of wavelet transformation lies in its ability to perform nearly lossless downsampling. Leveraging this property, we propose merging the downsampling module with the first attention module that follows the downsampling process into a single attention module, termed the Wavelet Downsampling Attention module. This integration reduces computational complexity while minimizing information loss associated with downsampling. Let X denote the input tensor, with C representing the number of channels and H and W denoting the height and width, respectively.

X = δ (l i n e a r (X))

(25)

Q = ϕ ({C o n v}_{3 \times 3}^{C \to 2 C} (X) W^{q})

(26)

{X_{L L}, X_{L H}, X_{H L}, X_{H H}} = D W T (X)

(27)

K = ϕ ((C o n c a t (X_{L L} {, C o n v}_{1 \times 1}^{C \to C / 2} (X_{L H}) {, C o n v}_{1 \times 1}^{C \to C / 2} (X_{H L})) W^{K})

(28)

V = ϕ ({C o n v}_{3 \times 3}^{3 C \to 2 C} (C o n c a t (X_{L H}, X_{H L}, X_{H H})) {) W}^{V}

(29)

W a v e l e t_D o w n s a m p l i n g_A t t e n t i o n = Q (K^{T} V) + D W C (V)

(30)

The Wavelet Downsampling Attention module has a channel count of 2C and reduces the dimensions to H/2 and W/2, as shown in Figure 3a.

3.3. Macro Architecture Design

The Wavelet-Enhanced Linear Attention Mechanism (WLAM) can be integrated as a plug-in component within various modern Vision Transformer (ViT) architectures, or it can be combined with the Wavelet Downsampling Attention module (WDAM) to form the WLAMFormer network, as illustrated in Figure 3. The input is a natural image with dimensions

H \times W \times 3

. The image undergoes downsampling through a convolutional layer with a stride of 2, followed by another convolutional layer with a stride of 1, resulting in a downsampled output of size

\frac{H}{2} \times \frac{W}{2} \times C_{0}

, where

C_{0}

represents the number of channels. Subsequently, the image is processed through four stages of encoding layers, with each stage utilizing downsampling to produce feature maps of sizes

\frac{H}{4} \times \frac{W}{4} \times C_{1}

,

\frac{H}{8} \times \frac{W}{8} \times C_{2}, \frac{H}{16} \times \frac{W}{16} \times C_{3}

, and

\frac{H}{32} \times \frac{W}{32} \times C_{4}

, where

C_{i}

denotes the channel count for each feature map. Each stage consists of

N_{i}

stacked blocks, as depicted in Figure 3. The design is inspired by EfficientViT [40] and EdgeViT [4] networks, incorporating both the wavelet linear attention module and the MLP module. For specific parameter settings, please refer to Appendix B and Figure 3.

When the input image size is 224 × 224, we have

\frac{H}{32} = \frac{W}{32} = 7

. Due to the constraints of wavelet transformation, there is a minimum size requirement for the input image, which prevents the use of the WLAM attention module in stage 4. Consequently, we substitute it with a linear attention module.

4. Experiments

4.1. Image Classification

The ImageNet-1K dataset [41] comprises over 1.3 million images spanning 1000 natural categories. Due to its diversity, this dataset covers a wide range of objects and scenes, making it one of the most widely used datasets in the field. We trained our network from scratch without utilizing any additional data, employing the CSwin-B model [38], which is pre-trained on ImageNet and achieves a top-1 accuracy of 84.2%, as the teacher model for distillation.

The training strategy follows the setup outlined for EdgeNeXt [21]. All models were trained with an input size of 224 × 224 using the AdamW [35] optimizer for 300 epochs, with a batch size of 1024. The learning rate was set to 1 × 10⁻⁴ with a cosine annealing schedule [42], and a warm-up period of 20 epochs was implemented. We enabled label smoothing (with a coefficient of 0.1), random size cropping, horizontal flipping, RandAugment [43], and multi-scale sampling. During training, the exponential moving average (EMA) momentum was set to 0.9995. To fully leverage the network’s effectiveness, we fine-tuned the model for an additional 30 epochs at a resolution of 384 × 384, using a learning rate of 1 × 10⁻⁵ and a batch size of 64.

We implemented the classification model based on PyTorch (https://pytorch.org/), running on six V100 GPUs. The experimental results on the ImageNet-1K dataset [41], presented in Table 1, clearly demonstrate the advancements our model brings to the field of image classification. It is important to note that for throughput, we report per-frame metrics on mobile devices and results with a batch size of 64 on GPUs. The results for all variants of our model are highlighted in bold.

Through Table 1 and Figure 4, we observe that the WLAMFormer model consistently achieves higher top-1 accuracy compared to other models with similar computational budgets and parameter counts.

WLAMFormer_L1 (13.5 M parameters) reaches a top-1 accuracy of 83.0%, outperforming models such as CAS_ViT_M [45] (12.42 M, 82.8%), SwiftFormer-L1 [44] (12.05 M, 80.9%), and EffiFormer-L1 [11] (12.28 M, 79.2%).

WLAMFormer_L2 (25.07 M parameters) achieves a top-1 accuracy of 84.1%, surpassing CAS-ViT-T [45] (21.76 M, 83.9%), ConvNeXt-T [47] (29.1 M, 82.1%), and Swin-T [3] (28.27 M, 81.3%).

WLAMFormer_L3 (46.6 M parameters) attains a top-1 accuracy of 84.6%, exceeding MLLA-S [25] (47.6 M, 84.4%), CSwin-S [38] (35.4 M, 83.6%), and ConvNeXt-S [47] (50.2 M, 83.1%).

These results demonstrate that the WLAMFormer model delivers state-of-the-art accuracy across various model scales, highlighting the effectiveness of integrating the discrete wavelet transform (DWT) into the Transformer architecture.

While achieving higher accuracy, the WLAMFormer model also maintains a competitive computational cost.

WLAMFormer_L1 exhibits a computational cost of 2.847 GFLOPs. While this is higher than that of some efficient models, such as EffiFormer-L1 [11] (1.310 GFLOPs) and SwiftFormer-L1 [44] (1.604 GFLOPs), WLAMFormer_L1 achieves a significant accuracy improvement of up to 3.8%.

WLAMFormer_L2 provides an excellent balance with a computational cost of 3.803 GFLOPs. It outperforms ConvNeXt-T [47] by 2% in accuracy while reducing the FLOPs by 0.7 G. Additionally, it surpasses Vmanba-T [48] by 1.6% in accuracy with a decrease of 1.1 G in FLOPs, exceeds MLLA-T [25] by 0.6% in accuracy while reducing FLOPs by 0.4 G, and outperforms WTConvNeXt-T [34] by 1.6% in accuracy with a reduction of 0.7 G in FLOPs.

WLAMFormer_L3 achieves a high accuracy of 84.6% with a computational cost of 7.75 GFLOPs, exceeding ConvNeXt-S [47] by 1.5% in accuracy while reducing FLOPs by nearly 1 G. It also surpasses Vmanba-S [48] by 1% in accuracy with an approximate 1 G reduction in FLOPs, outperforms MLLA-S [25] by 0.2% in accuracy while decreasing FLOPs by 0.4 G, and exceeds WTConvNeXt-T [34] by 1% in accuracy with a reduction of 1.1 G in FLOPs.

The WLAMFormer model exhibits moderate performance in terms of throughput. WLAMFormer_L1 achieves a throughput of 2296 images per second (imgs/s), surpassing CAS_ViT_M [45] (2254 imgs/s) but lagging behind other efficient models. The discrete wavelet transform and its inverse have an impact on image throughput, which is particularly noticeable in smaller models.

WLAMFormer_L2 delivers a throughput of 1580 imgs/s, exceeding that of Swin-T [3] (1246 imgs/s), CAS-ViT-T [45] (1084 imgs/s), and MLLA-T [25] (1009 imgs/s). However, it falls short compared to SwiftFormer-L1 [44] (5051 imgs/s) and EffiFormer-L1 [11] (5046 imgs/s), placing it at an intermediate level among models of comparable size.

WLAMFormer_L3 achieves a throughput of 881 imgs/s, which surpasses that of PVTv2-B3 [40] (403 imgs/s) and CSwin-S [38] (625 imgs/s), demonstrating a relatively strong performance among models of similar scale.

The WLAMFormer model achieves an optimal balance between accuracy and computational efficiency. Influential networks such as VMamba [49], the MLLA network [25], which integrates Mamba with linear attention, and the WTConvNeXt network [50], renowned for its successful fusion of wavelet transforms with convolutional neural networks (CNNs), have partially inspired the design of the WLAMFormer network. Experimental results based on the ImageNet1K dataset demonstrate that the proposed network surpasses these three models in terms of performance. The integration of the discrete wavelet transform (DWT) enables the model to efficiently capture multi-scale representations, thereby excelling in image classification tasks. Although the model exhibits a notable improvement in accuracy, it incurs a slight increase in computational cost and a reduction in throughput. This trade-off is justifiable for applications that demand high accuracy. However, in scenarios with limited computational resources or stringent throughput requirements, the elevated computational demands and reduced processing speed may pose significant limitations. Future research can address these shortcomings by optimizing DWT operations and exploring more efficient implementation methods, thereby facilitating the broader application of the WLAMFormer model across various practical scenarios.

In this paper, we introduce a plug-and-play attention module called WLAM (Wavelet-Enhanced Linear Attention Mechanism) and integrate it into mainstream neural network architectures, including PVT, Swin, and CSwin, to evaluate its performance enhancement in the ImageNet image classification task. Table 2 and Figure 5 provide a comparison of different models in terms of the number of parameters (Par. ↓), FLOPs ↓, and top-1 ↑. In addition to the baseline models, we compare our approach with the plug-and-play Agent attention module proposed in [16], which demonstrated exceptional performance in 2024. The results for all model variants are highlighted in bold.

Performance Improvement on the PVT Architecture

Compared to the baseline model, WLAM-PVT-T exhibits a slight increase in parameters and FLOPs while achieving a 3.7 percentage point improvement in accuracy, surpassing Agent-PVT-T by 0.3 percentage points. This indicates that the WLAM module provides a more substantial performance enhancement in smaller models. WLAM-PVT-S, with parameters and FLOPs comparable to those of Agent-PVT-S, achieves an accuracy improvement of 0.4 percentage points over Agent-PVT-S and 2.8 percentage points over the baseline model, demonstrating the superiority of the WLAM module in mid-sized models. WLAM-PVT-M shows optimized parameters and FLOPs while achieving an accuracy that exceeds Agent-PVT-M by 0.1 percentage points and improves upon the baseline model by 2.3 percentage points, thereby validating the effectiveness of the WLAM module in large models.

Performance Improvement on the Swin Architecture

WLAM-Swin-T achieves a 1.7 percentage point increase in accuracy while reducing both parameters and computational load, outperforming the Agent version by 0.4 percentage points. This highlights the efficient performance of the WLAM module within the Swin-T model. WLAM-Swin-S demonstrates an accuracy increase of 0.8 percentage points over the baseline model and a 0.1 percentage point improvement compared to the Agent version, all while reducing parameters and FLOPs, further confirming the effectiveness of the WLAM module.

Performance Improvement on the CSwin Architecture

WLAM-CSwin-T achieves a 0.9 percentage point accuracy increase over the baseline model while reducing parameters and computational load, exceeding the Agent version by 0.3 percentage points, which reflects the efficiency of the WLAM module. Similarly, WLAM-CSwin-S shows a 0.6 percentage point improvement in accuracy over the baseline model and a 0.2 percentage point increase compared to the Agent version, further showcasing the advantages of the WLAM module.

Across the PVT, Swin, and CSwin architectures, models integrated with the WLAM module achieved a significant improvement in top-1 accuracy, with the maximum enhancement reaching 3.7 percentage points. Regarding parameter and computational efficiency, the WLAM models not only enhanced performance but also reduced the number of parameters and FLOPs in many cases, demonstrating their efficacy. Compared to models incorporating the Agent attention module, the WLAM models consistently achieved notable accuracy improvements, indicating that the WLAM module is superior in capturing feature representations.

In addition to validating the performance of our network on ImageNet1K, we also tested our model on CIFAR-10 [49,51] and CIFAR-100 [49,51], both of which consist of low-resolution images, as illustrated in Table 3. We present a comparison of several publicly available models that report transfer accuracy on the CIFAR-10 and CIFAR-100 datasets. The parameters used for training our model on CIFAR-10 and CIFAR-100 are similar to those employed during training on ImageNet1K, specifically with 400 epochs and a batch size of 512, while keeping other settings constant.

The WLAM-PVT-T model adds only a small number of parameters compared to the baseline PVT-T model (11.8 M vs. 11.2 M), yet it achieves a 4.5% improvement in accuracy on CIFAR-100 (from 77.6% to 82.1%) and a 1.2% increase on CIFAR-10 (from 95.8% to 97.7%). Similarly, the WLAM-PVT-S model incurs only a slight increase in FLOPs compared to the baseline PVT-S model (3.9 G vs. 3.8 G), while demonstrating a 5.0% enhancement in accuracy on CIFAR-100 (from 79.8% to 84.8%) and a 1.9% improvement on CIFAR-10 (from 96.5% to 98.4%). These results clearly indicate that the WLAM attention module significantly enhances the recognition capability of neural networks on low-resolution images.

WLAMFormer_L1 achieves an accuracy of 84.5% on CIFAR-100, outperforming other models of similar scale, such as EfficientFormer-L1 (83.2%) and EdgeViT-M (82.7%). Due to the influence of wavelet transformations, the FLOPs value of WLAMFormer_L1 is relatively high among models of similar size (2.8 G).

WLAMFormer_L2 reaches an accuracy of 98.2% on CIFAR-10 and 87.1% on CIFAR-100. Although its performance does not surpass that of ConvNet architectures such as ConvNeXt and EfficientNet, it demonstrates substantial improvements over non-CNN architectures, exceeding the accuracy of the PoolFormer-S24 model by 5.3% and the EfficientFormer-L3 model by 1.4%.

Traditional Transformer models (e.g., PVT-Tiny, PVT-Small) and hybrid models (e.g., EfficientFormer, EdgeViT) often underperform convolutional neural networks when processing small-sized images (such as those in the CIFAR dataset). This limitation arises because Transformer models require large amounts of data and higher resolutions to learn global features effectively. However, the WLAM series models introduce attention modules based on wavelet transformations, which effectively enhance the ability of Transformer models to capture multi-scale and multi-resolution features in small-sized images, facilitating the learning of critical detail information. The WLAM module applies linear attention to low-frequency components while employing convolution.

4.2. Image Segmentation

The WLAM attention module proposed in this study facilitates the synergistic enhancement of local features and global contextual information. We posit that this module exhibits significant potential in medical image segmentation applications. Medical image segmentation datasets are particularly effective in demonstrating the WLAM attention module’s technological innovations in handling intricate details and meeting high-precision requirements. Compared to general-purpose datasets, medical datasets present greater challenges, thereby more effectively highlighting the unique advantages and improvements of our proposed method.

ISIC17 and ISIC18 datasets: The International Skin Imaging Collaboration Challenge datasets for 2017 and 2018 (ISIC17 and ISIC18) [53,54] are two publicly available skin lesion segmentation datasets, comprising 2150 and 2694 dermoscopic images with corresponding segmentation mask labels, respectively. We partitioned each dataset into training, testing, and validation sets in a ratio of 7:1.5:1.5. Specifically, the ISIC17 training set includes 1500 images, the test set contains 325 images, and the validation set encompasses 325 images. Similarly, the ISIC18 training set consists of 1886 images, the test set includes 404 images, and the validation set comprises 404 images.

All images from the ISIC17 and ISIC18 datasets were resized to 256 × 256 pixels. To mitigate overfitting, we applied data augmentation techniques such as random flipping, random rotation, and the addition of random noise. For both the ISIC17 and ISIC18 datasets, we employed the BCE–Dice loss function, set the batch size to 32, and utilized the AdamW optimizer [35] with an initial learning rate of 1 × 10⁻⁴. CosineAnnealingLR was adopted, with a maximum of 50 iterations and a minimum learning rate of 1 × 10⁻⁵. The number of training epochs was set to 300. For the WLAM-UNet model, the weights of the encoder and decoder were initialized using the pre-trained weights of WLAMFormer-L2 on the ImageNet-1K dataset.

Table 4 presents the performance comparison of the WLAM-UNet model against other advanced models on the ISIC17 and ISIC18 datasets. Through an analysis of the key evaluation metrics (mIoU, DSC, Acc, Spe, Sen), it is evident that WLAM-UNet excels in multiple aspects, demonstrating its superiority in skin lesion segmentation tasks.

On the ISIC17 dataset, WLAM-UNet achieved the best results across all evaluation metrics. mIoU (80.41%): WLAM-UNet slightly outperformed VM-UNet (80.23%) and significantly surpassed other models such as UNet (79.98%) and TransFuse (79.21%). DSC (89.23%): compared to other models, WLAM-UNet achieved the highest Dice coefficient, indicating a substantial advantage in the overlap between predicted results and ground truth labels. Acc (96.45%): in terms of accuracy, WLAM-UNet also led all comparison models, showcasing higher overall classification accuracy. Spe (97.55%) and Sen (90.10%): WLAM-UNet performed excellently in both specificity and sensitivity, scoring above 97% and 90%, respectively. This indicates its effectiveness in reducing false positives and false negatives.

On the more complex ISIC18 dataset, WLAM-UNet maintained its leading position. mIoU (80.43%): WLAM-UNet slightly outperformed TransFuse (80.33%) and other models, further validating its generalization capability across different datasets. DSC (89.84%): similarly to the ISIC17 dataset, WLAM-UNet continued to lead in the DSC metric, demonstrating excellent segmentation performance on a larger-scale dataset. Acc (95.00%): despite the increased complexity of the ISIC18 dataset, WLAM-UNet maintained high accuracy, surpassing VM-UNet (94.91%) and other comparison models. Spe (96.20%) and Sen (91.22%): in terms of specificity and sensitivity, WLAM-UNet achieved high scores of 96.20% and 91.22%, respectively, showcasing its robustness and efficiency in complex environments.

Overall, the WLAM-UNet model consistently outperforms other state-of-the-art models on both the ISIC17 and ISIC18 datasets, highlighting its unique advantages and the effectiveness of the proposed WLAM attention module in medical image segmentation.

The segmentation results of WLAM-Unet and other state-of-the-art models on the ISIC18 dataset are presented in Figure 6. Except for VM-UNet, which is an improved version derived from VMamba [49], WLAM-Unet’s performance is comparable, and it demonstrates a significant advantage over other UNet-based networks. Specifically, in the visualization of the second row, it can be observed that other models (such as TransFuse [56] and UTNetV2 [55]) often misclassify non-target areas, leading to poorer segmentation performance. In contrast, WLAM-Unet exhibits higher stability in this aspect. As shown in the visualization of the third row, WLAM-Unet achieves high accuracy in segmenting small targets. Additionally, as illustrated in the visualizations of the fourth and fifth rows, despite the more complex boundaries, WLAM-Unet is still able to accurately delineate the edges of the targets.

4.3. Ablation Study

In this section, we investigate the effectiveness of key components within the WLAM attention module by systematically removing them. We report the results of ImageNet-1K classification based on the WLAMFormer_L2 model. As shown in Table 5.

We removed the structure that mimics Mamba, while keeping all other components unchanged.
We discontinued the use of the structure that imitates MobileNetV3 for processing high-frequency sub-bands; instead, we employed a single 3 × 3 convolution for high-frequency sub-bands, similar to the approach outlined in [36].
We eliminated the multi-resolution input from the attention module, following the methodology of [36], and solely utilized the low-frequency components as inputs for linear attention.

$Q = ϕ (X_{L L} W_{q}), K = ϕ (X_{L L} W_{k}), V = ϕ \emptyset (X_{L L} W_{v})$

(31)
We removed the Wavelet Downsampling Attention module and instead adopted a downsampling approach similar to that of MLLA-T [25] and CAS-ViT-T [34] (stem + Patch Merging).

Impact of the Mamba Biomimetic Structure (Model 1)

The removal of the Mamba-inspired structure led to a decrease in top-1 accuracy from 84.1% to 82.9%, reflecting a reduction of 1.2%. This decline underscores the importance of the Mamba-inspired forget gate, which provides local bias and positional information to the attention module. The incorporation of learnable shortcuts in the Mamba design enhances the stability of the model. Its removal results in a significant performance drop, indicating that this component is critical for improving model accuracy.

Impact of the High-Frequency Processing Module (Model 2)

When the high-frequency processing module was simplified to a single 3 × 3 convolution, the top-1 accuracy further declined to 83.3%, a reduction of 0.8%. The MobileNetV3-inspired high-frequency processing structure is designed to more effectively extract high-frequency detail features. Simplifying this module reduces the model’s ability to capture fine-grained information, leading to a decrease in performance. However, this impact is comparatively less significant than that observed with the removal of the Mamba biomimetic structure.

Impact of Multi-Resolution Input in the Attention Module (Model 3)

The omission of the multi-resolution input, which limited the attention module to only low-frequency components, resulted in a top-1 accuracy drop to 82.1%, a reduction of 2.0%. The multi-resolution input enables the attention module to integrate features across different scales, facilitating the fusion of global and local information. The removal of this component restricts the model’s feature representation capabilities, leading to a substantial decline in performance. This effect represents the most significant impact observed across all ablation experiments.

Impact of the Wavelet Downsampling Attention Module (Model 4)

After the Wavelet Downsampling Attention module was replaced with a downsampling approach similar to that used in MLLA-T [25] and CAS-ViT-T [34] (stem + Patch Merging), the model’s throughput decreased from 1280 to 1050, and the top-1 accuracy dropped by 0.3%. This indicates that the original Wavelet Downsampling Attention module maintains high classification performance while offering greater computational efficiency and throughput. Although the replaced downsampling method is largely able to preserve good performance, it falls short of the original module in terms of efficiency and resource utilization.

4.4. Network Visualization

We employed the Grad-CAM method to generate heatmaps that highlight the regions of focus within the network. To validate the accuracy of the model’s identification, we compared the heatmaps of WLAMFormer-L2, WLAM-Swin-T, and WLAM-PVT-Small with those of MLLA-T, Swin-T, Agent-Swin-T, PVT-Small, and Agent-PVT-Small. The results demonstrate that WLAMFormer-L2 exhibits a clear advantage in performance. Additionally, WLAM-Swin-T shows improved performance compared to both Swin-T and Agent-Swin-T. Similarly, WLAM-PVT-Small outperforms PVT-Small and Agent-PVT-Small, indicating its effectiveness in feature identification.

Grad-CAM Heatmaps of WLAMFormer, WLAM-Swin, WLAM-PVT, and Other State-of-the-Art Models on the ImageNet1K Dataset. In Figure 7, the Grad-CAM heatmaps of WLAMFormer, WLAM-Swin, WLAM-PVT, and other state-of-the-art models on the ImageNet1K dataset are presented. The results indicate that the proposed WLAMFormer network has significant advantages in multiple aspects. Second-Column Visualization: the WLAMFormer network is able to accurately focus on the target tree frog, whereas other methods fail to concentrate feature weights on the tree frog. Fourth-Column Visualization: The WLAMFormer network precisely highlights the drumsticks held in the hand and those around the waist. In contrast, the MLLA network does not focus on the waist drumsticks, and the Swin-Transformer network simultaneously attends to both the drumsticks and the drum set, with the drum set not being the intended target. In the comparison of heatmaps among the three networks WLAM-Swin, Swin, and Agent-Swin, WLAM-Swin is clearly superior to Swin and slightly outperforms Agent-Swin. For the comparison of heatmaps among the three networks WLAM-PVT, PVT, and Agent-PVT, WLAM-PVT significantly outperforms PVT in the first column (dog), fourth column (drumsticks), and fifth column (dough). WLAM-PVT also noticeably surpasses Agent-PVT in the first column (dog), third column (barbell), and fifth column (dough). These comparisons demonstrate the enhanced focus and accuracy of the WLAM-based models in identifying and concentrating on the relevant target features compared to their counterparts.

5. Conclusions

This paper addresses the limitations of linear attention in terms of performance by proposing a plug-and-play Wavelet-Enhanced Linear Attention Mechanism (WLAM) module. This module integrates a discrete wavelet transform (DWT) with linear attention to enhance the model’s ability to express global context and local features. By introducing DWT into the attention mechanism, we perform wavelet decomposition on the input features, generating query vectors Q from the original input features, keys K from the low-frequency coefficients, and values V from the high-frequency coefficients processed through multi-scale convolutions and SE (Squeeze-and-Excitation) modules. This method effectively embeds global information and local features into different components of the attention mechanism, enhancing the model’s perception of details and overall structure.

Furthermore, we reintegrate the multi-scale processed information back into the spatial domain using an inverse discrete wavelet transform (IDWT), addressing the shortcomings of linear attention in handling multi-scale and local information. We also drew inspiration from the Mamba network’s forget gate and improved block design, inheriting its core advantages to further enhance the model’s performance and robustness. Based on the lossless downsampling characteristics of wavelet transforms, we proposed the Wavelet Downsampling Attention (WDSA) module, which combines downsampling and attention modules, reducing the network size and computational load while minimizing information loss due to downsampling. By combining the WLAM and WDSA modules, we constructed the WDLMFormer model. We applied the proposed WLAM module to classical networks such as PVT, Swin, and CSwin, significantly improving their performance on the image classification task of the ImageNet-1K dataset. WDLMFormer achieved an accuracy of 84.6% on the ImageNet-1K dataset, validating the effectiveness and superiority of our approach.

In summary, the WLAM and WDSA modules proposed in this paper provide new insights into the design of attention mechanisms. By integrating wavelet transforms with linear attention, we successfully enhanced the model’s capability to capture both global and local information, achieving outstanding performance in practical applications. However, there are still several issues worth further investigation and exploration. In future work, we will consider applying this method to more visual tasks, such as object detection and semantic segmentation, to validate its generality and effectiveness in different task scenarios. Additionally, we will explore more efficient ways to integrate wavelet transforms with deep learning models to further enhance model performance and computational efficiency.

Although the WLAM and WDSA modules proposed in this paper have achieved significant improvements in model performance and robustness, there are still numerous research avenues worth exploring in depth: (1) Expansion of Multi-task Learning Applications—In the future, the WLAM module can be applied to more complex visual tasks such as instance segmentation, pose estimation, and image generation. Through multi-task learning, the generalization ability of WLAM across different tasks can be validated, and its potential in collaborative learning can be explored. (2) Adaptive Wavelet Transforms—Adaptive or learnable wavelet basis functions can be investigated, enabling wavelet transforms to dynamically adjust based on data characteristics. This approach aims to further enhance feature extraction effectiveness, allowing the model to better capture multi-scale information from various types of images. (3) Theoretical Analysis and Interpretation: In-depth studies on the synergistic effects of wavelet transforms and linear attention mechanisms within the WLAM module can be conducted. The goal is to theoretically explain their advantages in information capture and representation. Such theoretical analysis can guide the further optimization of module design, thereby improving the model’s interpretability and transparency.

Author Contributions

Conceptualization, B.F. and S.L.; methodology, B.F.; software, B.F.; validation, C.X. and Z.L.; formal analysis, B.F.; investigation, S.L.; resources, S.L.; data curation, S.L.; writing—original draft preparation, B.F.; writing—review and editing, B.F.; visualization, Z.L.; supervision, C.X.; project administration, C.X.; funding acquisition, C.X. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the National Key Research and Development Program under project number 2019YFC0117800.

Data Availability Statement

The ImageNet1K dataset can be downloaded from the website https://www.image-net.org/ (accessed on 15 January 2025). The CIFAR-10 and CIFAR-100 datasets can be downloaded from the website https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 15 January 2025).

Conflicts of Interest

Author Shaohua Liu was employed by the company Anhui Zhongke Jingle Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A

Table A1. Structure of WLAM_Swin and WLAM_CSwin.

Stage	Output	WLAM_Swin_T		WLAM_Swin_S		WLAM_Swin_B
Stage	Output	WLAM_Block	Swin_Blocck	WLAM_Block	Swin_Blocck	WLAM_Block	Swin_Blocck
1	56 × 56	Concat 4 × 4, 96, LN		Concat 4 × 4, 96, LN		Concat 4 × 4128, LN
1	56 × 56	$[\begin{matrix} w i n 56 \times 56 \\ d i m 96 \\ h e a d 3 \end{matrix}]$	$[\begin{matrix} w i n 56 \times 56 \\ d i m 96 \\ h e a d 3 \end{matrix}]$	$[\begin{matrix} w i n 56 \times 56 \\ d i m 96 \\ h e a d 3 \end{matrix}]$	$[\begin{matrix} w i n 56 \times 56 \\ d i m 96 \\ h e a d 3 \end{matrix}]$	$[\begin{matrix} w i n 56 \times 56 \\ d i m 128 \\ h e a d 3 \end{matrix}]$	$[\begin{matrix} w i n 56 \times 56 \\ d i m 128 \\ h e a d 3 \end{matrix}]$
2	28 × 28	Concat 4 × 4192, LN		Concat 4 × 4192, LN		Concat 4 × 4256, LN
2	28 × 28	$[\begin{matrix} w i n 28 \times 28 \\ d i m 192 \\ h e a d 6 \end{matrix}]$	$[\begin{matrix} w i n 28 \times 28 \\ d i m 192 \\ h e a d 6 \end{matrix}]$	$[\begin{matrix} w i n 28 \times 28 \\ d i m 192 \\ h e a d 6 \end{matrix}]$	$[\begin{matrix} w i n 28 \times 28 \\ d i m 192 \\ h e a d 6 \end{matrix}]$	$[\begin{matrix} w i n 28 \times 28 \\ d i m 256 \\ h e a d 6 \end{matrix}]$	$[\begin{matrix} w i n 28 \times 28 \\ d i m 256 \\ h e a d 6 \end{matrix}]$
3	14 × 14	Concat 4 × 4384, LN		Concat 4 × 4384, LN		Concat 4 × 4512, LN
3	14 × 14	$[\begin{matrix} w i n 14 \times 14 \\ d i m 384 \\ h e a d 12 \end{matrix}]$ × 3	$[\begin{matrix} w i n 14 \times 14 \\ d i m 384 \\ h e a d 12 \end{matrix}]$ × 3	$[\begin{matrix} w i n 14 \times 14 \\ d i m 384 \\ h e a d 12 \end{matrix}]$ × 9	$[\begin{matrix} w i n 14 \times 14 \\ d i m 384 \\ h e a d 12 \end{matrix}]$ × 9	$[\begin{matrix} w i n 14 \times 14 \\ d i m 512 \\ h e a d 12 \end{matrix}]$ × 9	$[\begin{matrix} w i n 14 \times 14 \\ d i m 512 \\ h e a d 12 \end{matrix}]$ × 9
4	7 × 7	Concat 4 × 4768, LN		Concat 4 × 4768, LN		Concat 4 × 41024, LN
4	7 × 7	None	$[\begin{matrix} w i n 7 \times 7 \\ d i m 768 \\ h e a d 24 \end{matrix}]$ × 2	None	$[\begin{matrix} w i n 7 \times 7 \\ d i m 768 \\ h e a d 24 \end{matrix}]$ × 2	None	$[\begin{matrix} w i n 7 \times 7 \\ d i m 1024 \\ h e a d 24 \end{matrix}]$ × 2
Stage	Output	WLAM_CSwin_T		WLAM_CSwin_S		WLAM_CSwin_B
Stage	Output	WLAM_Block	CSwin_Blocck	WLAM_Block	CSwin_Blocck	WLAM_Block	CSwin_Blocck
1	56 × 56	Concat 7 × 7, stride = 4, 64, LN		Concat 7 × 7, stride = 4, 64, LN		Concat 7 × 7, stride = 4, 96, LN
1	56 × 56	$[\begin{matrix} w i n 56 \times 56 \\ d i m 64 \\ h e a d 2 \end{matrix}]$	$[\begin{matrix} w i n 56 \times 56 \\ d i m 64 \\ h e a d 2 \end{matrix}]$	$[\begin{matrix} w i n 56 \times 56 \\ d i m 64 \\ h e a d 2 \end{matrix}]$	$[\begin{matrix} w i n 56 \times 56 \\ d i m 64 \\ h e a d 2 \end{matrix}] \times 2$	$[\begin{matrix} w i n 56 \times 56 \\ d i m 96 \\ h e a d 2 \end{matrix}]$	$[\begin{matrix} w i n 56 \times 56 \\ d i m 96 \\ h e a d 2 \end{matrix}] \times 2$
2	28 × 28	Concat 7 × 7, stride = 4128, LN		Concat 7 × 7, stride = 4128, LN		Concat 7 × 7, stride = 4192, LN
2	28 × 28	$[\begin{matrix} w i n 28 \times 28 \\ d i m 128 \\ h e a d 4 \end{matrix}] \times 2$	$[\begin{matrix} w i n 28 \times 28 \\ d i m 128 \\ h e a d 4 \end{matrix}] \times 2$	$[\begin{matrix} w i n 28 \times 28 \\ d i m 128 \\ h e a d 4 \end{matrix}] \times 3$	$[\begin{matrix} w i n 28 \times 28 \\ d i m 128 \\ h e a d 4 \end{matrix}] \times 3$	$[\begin{matrix} w i n 28 \times 28 \\ d i m 192 \\ h e a d 4 \end{matrix}] \times 3$	$[\begin{matrix} w i n 28 \times 28 \\ d i m 192 \\ h e a d 4 \end{matrix}] \times 3$
3	14 × 14	Concat 7 × 7, stride = 4256, LN		Concat 7 × 7, stride = 4256, LN		Concat 7 × 7, stride = 4384, LN
3	14 × 14	$[\begin{matrix} w i n 14 \times 14 \\ d i m 256 \\ h e a d 8 \end{matrix}]$ × 9	$[\begin{matrix} w i n 14 \times 14 \\ d i m 256 \\ h e a d 8 \end{matrix}]$ × 9	$[\begin{matrix} w i n 14 \times 14 \\ d i m 256 \\ h e a d 8 \end{matrix}]$ × 15	$[\begin{matrix} w i n 14 \times 14 \\ d i m 256 \\ h e a d 8 \end{matrix}]$ × 14	$[\begin{matrix} w i n 14 \times 14 \\ d i m 384 \\ h e a d 8 \end{matrix}]$ × 15	$[\begin{matrix} w i n 14 \times 14 \\ d i m 384 \\ h e a d 8 \end{matrix}]$ × 14
4	7 × 7	Concat 7 × 7, stride = 4512, LN		Concat 7 × 7, stride = 4512, LN		Concat 7 × 7, stride = 4768, LN
4	7 × 7	None	$[\begin{matrix} w i n 7 \times 7 \\ d i m 512 \\ h e a d 16 \end{matrix}]$ × 1	None	$[\begin{matrix} w i n 7 \times 7 \\ d i m 512 \\ h e a d 16 \end{matrix}]$ × 2	None	$[\begin{matrix} w i n 7 \times 7 \\ d i m 768 \\ h e a d 16 \end{matrix}]$ × 2

Appendix B

Table A2. Structure of the MLAMFormer model.

Stage	Output	WLAMFormer_L1		WLAMFormer_L2		WLAMFormer_L3
Stage	Output	WLAM_Block	Liner_Blocck	WLAM_Block	Liner_Blocck	WLAM_Block	Liner_Blocck
1	56 × 56	stem, 32		stem, 32		stem, 42
		$A t t e n t i o n_D$ ownSampling, 64		$A t t e n t i o n_D$ ownSampling, 64		$A t t e n t i o n_D$ ownSampling, 84
		$[\begin{matrix} w i n 56 \times 56 \\ d i m 64 \\ h e a d 2 \end{matrix}]$	None	$[\begin{matrix} w i n 56 \times 56 \\ d i m 64 \\ h e a d 2 \end{matrix}]$	None	$[\begin{matrix} w i n 56 \times 56 \\ d i m 84 \\ h e a d 3 \end{matrix}]$	None
2	28 × 28	$A t t e n t i o n_D$ ownSampling, 128		$A t t e n t i o n_D$ ownSampling, 128		$A t t e n t i o n_D$ ownSampling, 168
2	28 × 28	$[\begin{matrix} w i n 28 \times 28 \\ d i m 128 \\ h e a d 4 \end{matrix}]$	None	$[\begin{matrix} w i n 28 \times 28 \\ d i m 128 \\ h e a d 4 \end{matrix}] \times 3$	None	$[\begin{matrix} w i n 28 \times 28 \\ d i m 168 \\ h e a d 6 \end{matrix}] \times 3$	None
3	14 × 14	$A t t e n t i o n_D$ ownSampling, 256		$A t t e n t i o n_D$ ownSampling, 256		$A t t e n t i o n_D$ ownSampling, 336
3	14 × 14	$[\begin{matrix} w i n 14 \times 14 \\ d i m 256 \\ h e a d 8 \end{matrix}]$ × 3	$[\begin{matrix} w i n 14 \times 14 \\ d i m 256 \\ h e a d 12 \end{matrix}]$ × 2	$[\begin{matrix} w i n 14 \times 14 \\ d i m 256 \\ h e a d 8 \end{matrix}]$ × 4	$[\begin{matrix} w i n 14 \times 14 \\ d i m 256 \\ h e a d 8 \end{matrix}]$ × 3	$[\begin{matrix} w i n 14 \times 14 \\ d i m 226 \\ h e a d 12 \end{matrix}]$ × 6	$[\begin{matrix} w i n 14 \times 14 \\ d i m 226 \\ h e a d 12 \end{matrix}]$ × 5
4	7 × 7	$A t t e n t i o n_D$ ownSampling, 512		$A t t e n t i o n_D$ ownSampling, 512		$A t t e n t i o n_D$ ownSampling, 672
4	7 × 7	None	$[\begin{matrix} w i n 7 \times 7 \\ d i m 512 \\ h e a d 16 \end{matrix}]$ × 2	None	$[\begin{matrix} w i n 7 \times 7 \\ d i m 512 \\ h e a d 16 \end{matrix}]$ × 3	None	$[\begin{matrix} w i n 7 \times 7 \\ d i m 672 \\ h e a d 24 \end{matrix}]$ × 3

Appendix C

Figure A1. MLAM-UNet.

References

Alexey, D. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Zhao, S.; Yang, J.; Wu, N.; Wu, Y.; Zhang, T. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 568–578. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Guo, B. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12009–12019. [Google Scholar]
Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision transformer with deformable attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4794–4803. [Google Scholar]
Zhou, H.; Zhang, Y.; Guo, H.; Liu, C.; Zhang, X.; Xu, J.; Gu, J. Neural architecture transformer. arXiv 2021, arXiv:2106.04247. [Google Scholar]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R.W. Biformer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10323–10333. [Google Scholar]
Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar]
Qin, Z.; Sun, W.; Deng, H.; Li, D.; Wei, Y.; Lv, B.; Zhong, Y. cosformer: Rethinking softmax in attention. arXiv 2022, arXiv:2202.08791. [Google Scholar] [CrossRef]
Ma, X.; Kong, X.; Wang, S.; Zhou, C.; May, J.; Ma, H.; Zettlemoyer, L. Luna: Linear unified nested attention. Adv. Neural Inf. Process. Syst. 2021, 34, 2441–2453. [Google Scholar]
Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; Li, H. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3531–3539. [Google Scholar]
Gao, Y.; Chen, Y.; Wang, K. SOFT: A simple and efficient attention mechanism. arXiv 2021, arXiv:2104.02544. [Google Scholar]
Xiong, Y.; Zeng, Z.; Chakraborty, R.; Tan, M.; Fung, G.; Li, Y.; Singh, V. Nyströmformer: A nyström-based algorithm for approximating self-attention. Proc. AAAI Conf. Artif. Intell. 2021, 35, 14138–14148. [Google Scholar] [CrossRef]
You, H.; Xiong, Y.; Dai, X.; Wu, B.; Zhang, P.; Fan, H.; Vajda, P.; Lin, Y. Castling-vit: Compressing self-attention via switching towards linear-angular attention at vision transformer inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14431–14442. [Google Scholar]
Han, D.; Pan, X.; Han, Y.; Song, S.; Huang, G. Flatten transformer: Vision transformer using focused linear attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 5961–5971. [Google Scholar]
Han, D.; Ye, T.; Han, Y.; Xia, Z.; Pan, S.; Wan, P.; Huang, G. Agent attention. In European Conference on Computer Vision 2024; Springer Nature: Cham, Switzerland, 2024; pp. 124–140. [Google Scholar]
Xu, Z.; Wu, D.; Yu, C.; Chu, X.; Sang, N.; Gao, C. SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation. Proc. AAAI Conf. Artif. Intell. 2024, 38, 6378–6386. [Google Scholar] [CrossRef]
Jiang, J.; Zhang, P.; Luo, Y.; Li, C.; Kim, J.B.; Zhang, K.; Kim, S. AdaMCT: Adaptive mixture of CNN-transformer for sequential recommendation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023; pp. 976–986. [Google Scholar] [CrossRef]
Lou, M.; Zhou, H.Y.; Yang, S.; Yu, Y. TransXNet: Learning both global and local dynamics with a dual dynamic token mixer for visual recognition. arXiv 2023, arXiv:2310.19380. [Google Scholar] [CrossRef]
Yoo, J.; Kim, T.; Lee, S.; Kim, S.H.; Lee, H.; Kim, T.H. Enriched cnn-transformer feature aggregation networks for super-resolution. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 4956–4965. [Google Scholar]
Maaz, M.; Shaker, A.; Cholakkal, H.; Khan, S.; Zamir, S.W.; Anwer, R.M.; Shahbaz Khan, F. Edgenext: Efficiently amalgamated cnn-transformer architecture for mobile vision applications. In European Conference on Computer Vision 2022; Springer Nature: Cham, Switzerland, 2022; pp. 3–20. [Google Scholar]
Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 22–31. [Google Scholar]
Mehta, S.; Rastegari, M. Separable self-attention for mobile vision transformers. arXiv 2022, arXiv:2206.02680. [Google Scholar] [CrossRef]
Wadekar, S.N.; Chaurasia, A. MobileViTv3: Mobile-friendly vision transformer with simple and effective fusion of local, global, and input features. arXiv 2022, arXiv:2209.15159. [Google Scholar] [CrossRef]
Han, D.; Wang, Z.; Xia, Z.; Han, Y.; Pu, Y.; Ge, C.; Huang, G. Demystifying Mamba in Vision: A Linear Attention Perspective. arXiv 2024, arXiv:2405.16605. [Google Scholar] [CrossRef]
Bae, W.; Yoo, J.; Chul, Y.J. Beyond deep residual learning for image restoration: Persistent homology-guided manifold simplification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 145–153. [Google Scholar]
Fujieda, S.; Takayama, K.; Hachisuka, T. Wavelet convolutional neural networks. arXiv 2018, arXiv:1805.08620. [Google Scholar] [CrossRef]
Yao, T.; Pan, Y.; Li, Y.; Ngo, C.W.; Mei, T. Wave-vit: Unifying wavelet and transformers for visual representation learning. In European Conference on Computer Vision 2022; Springer Nature: Cham, Switzerland, 2022; pp. 328–345. [Google Scholar]
Li, J.; Cheng, B.; Chen, Y.; Gao, G.; Shi, J.; Zeng, T. EWT: Efficient Wavelet-Transformer for single image denoising. Neural Netw. 2024, 177, 106378. [Google Scholar] [CrossRef] [PubMed]
Azad, R.; Kazerouni, A.; Sulaiman, A.; Bozorgpour, A.; Aghdam, E.K.; Jose, A.; Merhof, D. Unlocking fine-grained details with wavelet-based high-frequency enhancement in transformers. In International Workshop on Machine Learning in Medical Imaging; Springer Nature: Cham, Switzerland, 2023; pp. 207–216. [Google Scholar]
Gao, X.; Qiu, T.; Zhang, X.; Bai, H.; Liu, K.; Huang, X.; Liu, H. Efficient multi-scale network with learnable discrete wavelet transform for blind motion deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2733–2742. [Google Scholar]
Roy, A.; Sarkar, S.; Ghosal, S.; Kaplun, D.; Lyanova, A.; Sarkar, R. A wavelet guided attention module for skin cancer classification with gradient-based feature fusion. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 27–30 May 2024; pp. 1–4. [Google Scholar]
Tan, J.; Pei, S.; Qin, W.; Fu, B.; Li, X.; Huang, L. Wavelet-based Mamba with Fourier Adjustment for Low-light Image Enhancement. In Proceedings of the Asian Conference on Computer Vision, Hanoi, Vietnam, 8–12 December 2024; pp. 3449–3464. [Google Scholar]
Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet convolutions for large receptive fields. In European Conference on Computer Vision 2024; Springer Nature: Cham, Switzerland, 2024; pp. 363–380. [Google Scholar]
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Koonce, B.; Koonce, B.E. Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; Apress: New York, NY, USA, 2021; pp. 109–123. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12124–12134. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Shao, L. Pvt v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 702–703. [Google Scholar]
Shaker, A.; Maaz, M.; Rasheed, H.; Khan, S.; Yang, M.H.; Khan, F.S. Swiftformer: Efficient additive attention for transformer-based real-time mobile vision applications. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 17425–17436. [Google Scholar]
Zhang, T.; Li, L.; Zhou, Y.; Liu, W.; Qian, C.; Ji, X. Cas-vit: Convolutional additive self-attention vision transformers for efficient mobile applications. arXiv 2024, arXiv:2408.03703. [Google Scholar]
Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10819–10829. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16133–16142. [Google Scholar]
Huang, T.; Pei, X.; You, S.; Wang, F.; Qian, C.; Xu, C. Localmamba: Visual state space model with windowed selective scan. arXiv 2024, arXiv:2403.09338. [Google Scholar]
Liu, Y.; Tian, J. Probabilistic Attention Map: A Probabilistic Attention Mechanism for Convolutional Neural Networks. Sensors 2024, 24, 8187. [Google Scholar] [CrossRef]
Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. Efficientvit:Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14420–14430. [Google Scholar]
Krizhevsky, A.; Hinton, D. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
Pan, J.; Bulat, A.; Tan, F.; Zhu, X.; Dudziak, L.; Li, H.; Tzimiropoulos, G.; Martinez, B. Edgevits: Competing light-weight cnns on mobile devices with vision transformers. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 294–311. [Google Scholar]
Available online: https://challenge.isic-archive.com/data/#2017 (accessed on 15 January 2025).
Available online: https://challenge.isic-archive.com/data/#2018 (accessed on 15 January 2025).
Gao, Y.; Zhou, M.; Liu, D.; Metaxas, D. A multi-scale transformer for medical image segmentation: Architectures, model efficiency, and benchmarks. arXiv 2022, arXiv:2203.00131. [Google Scholar]
Zhang, Y.; Liu, H.; Hu, Q. Transfuse: Fusing transformers and cnns for medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2021; pp. 14–24. [Google Scholar]
Ruan, J.; Xiang, S.; Xie, M.; Liu, T.; Fu, Y. Malunet: A multi-attention and light-weight unet for skin lesion segmentation. In Proceedings of the 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA, 6–8 December 2022; pp. 1150–1156. [Google Scholar]
Ruan, J.; Li, J.; Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv 2024, arXiv:2402.02491. [Google Scholar]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support; Springer: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Wei, J.; Hu, Y.; Zhang, R.; Li, Z.; Zhou, S.K.; Cui, S. Shallow attention network for polyp segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2021; pp. 699–708. [Google Scholar]

Figure 1. Wavelet transform-enhanced Transformer structure diagram: (a) attention diagram from [28]; (b) attention diagram from [30]; (c) attention diagram from [33]; (d) attention diagram from this paper.

Figure 2. The rank of the attention matrices: (a) the rank of the linear attention matrix; (b) the rank of the attention matrix after the addition of a depthwise convolution (DWC) module; and (c) the rank of the attention matrix following inverse wavelet transformation.

Figure 3. The macro architecture design of the WLAMFormer network: (a) a schematic diagram of the TWMA BLOCK structure; (b) an overview of the overall structure of the WLAMFormer.

Figure 4. Comparison of image classification performance on the ImageNet-1K dataset.

Figure 5. Performance comparison of plug-and-play attention modules. (a) Comparison of plug-and-play attention performance based on the PVT model. (b) Comparison of plug-and-play attention performance based on the Swin model. (c) Comparison of plug-and-play attention performance based on the CSwin model.

Figure 6. Visualization of segmentation results on the ISIC18 dataset.

Figure 7. Heatmaps generated using the Grad-CAM method.

Table 1. Comparison of image classification performance on the ImageNet-1K dataset.

Model	Par. ↓ (M)	Flops ↓ (G)	Throughput (A100)	Type	Top-1 ↑
PVTv2-B1 [40]	14.02	2.034	1945	Transformer	78.7
SwiftFormer-L1 [44]	12.05	1.604	5051	Hybrid	80.9
CAS_ViT_M [45]	12.42	1.887	2254	Hybrid	82.8
PoolFormer-S12 [46]	11.9	1.813	3327	Pool	77.2
MobileViT-v2 × 1.5 [23]	10.0	3.151	2356	Hybrid	80.4
EffiFormer-L1 [11]	12.28	1.310	5046	Hybrid	79.2
WLAMFormer_L1	13.5	2.847	2296	DWT-Transformer	83.0
ResNet-50	25.5	4.123	4835	ConvNet	78.5
PoolFormer-S24 [46]	21.35	3.394	2156	Pool	80.3
PoolFormer-S36 [46]	32.80	4.620	1114	Pool	81.4
SwiftFormer-L3 [44]	28.48	4.021	2896	Hybrid	83.0
Swin-T [3]	28.27	4.372	1246	Transformer	81.3
PVT-S [2]	24.10	3.687	1156	Transformer	79.8
ConvNeXt-T [47]	29.1	4.532	3235	ConvNet	82.1
CAS-ViT-T [45]	21.76	3.597	1084	Hybrid	83.9
EffiFormer-L3 [11]	31.3	3.940	2691	Hybrid	82.4
Vmanba-T [48]	30.2	4.902	1686	Mamba	82.5
MLLA-T [25]	25.12	4.250	1009	mlla	83.5
WTConvNeXt-T [34]	30 M	4.5 G	2514	DWT-ConvNet	82.5
WLAMFormer_L2	25.07	3.803	1280	DWT-Transformer	84.1
ConvNeXt-S [47]	50.2	8.74	1255	ConvNet	83.1
PVTv2-B3 [40]	45.2	6.97	403	Transformer	83.2
CSwin-S [38]	35.4	6.93	625	Transformer	83.6
VMamba-S [48]	50.4	8.72	877	Mamba	83.6
MLLA-S [25]	47.6	8.13	851	mlla	84.4
WTConvNeXt-S [34]	54.2	8.8 G	1045	DWT-ConvNet	83.6
WLAMFormer_L3	46.6	7.75	861	DWT-Transformer	84.6

Table 2. Performance comparison of plug-and-play attention modules.

Model	Par. ↓ (M)	Flops ↓ (G)	REs	Top-1 ↑
PVT-T	11.2	1.9	224 × 224	75.1
Agent-PVT-T	11.6	2.0	224 × 224	78.5
WLAM-PVT-T	11.8	2.0	224 × 224	78.8
PVT-S	24.5	3.6	224 × 224	79.8
Agent-PVT-S	20.6	4.0	224 × 224	82.2
WLAM-PVT-S	20.8	3.9	224 × 224	82.6
PVT-M	44.2	6.7	224 × 224	81.2
Agent-PVT-M	35.9	7.0	224 × 224	83.4
WLAM-PVT-M	35.6	6.8	224 × 224	83.5
Swin-T	29	4.5	224 × 224	81.3
Agent-Swin-T	29	4.5	224 × 224	82.6
WLAM-Swin-T	27	4.3	224 × 224	83.0
Swin-S	50	8.7	224 × 224	83.0
Agent-Swin-S	50	8.7	224 × 224	83.7
WLAM-Swin-S	49	8.4	224 × 224	83.8
CSWin-T	23	4.3	224 × 224	82.7
Agent-CSwin-T	23	4.3	224 × 224	83.3
WLAM_CSwin_T	21	4.1	224 × 224	83.6
CSwin-S	35	6.9	224 × 224	83.6
Agent-CSwin-S	35	6.9	224 × 224	84.0
WLAM-CSwin-S	34	6.7	224 × 224	84.2

Table 3. Comparison of Accuracy on Cifar10 and Cifar100 Datasets.

Model	Par. ↓ (M)	Flops ↓ (G)	Type	Top-1 ↑ (Cifar10)	Top-1 ↑ (Cifar100)
MobileViT-v2 × 1.5	10.0	3.151	Hybrid	96.2	79.5
EfficientFormer-L1 [50]	12.3	2.4	Hybrid	97.5	83.2
EdgeViT-S [52]	11.1	1.1	Transformer	97.8	81.2
EdgeViT-M [52]	13.6	2.3	Transformer	98.2	82.7
PVT-Tiny	11.2	1.9	Transformer	95.8	77.6
WLAM-PVT-T	11.8	2.0	DWT-Transformer	96.9	82.1
WLAMFormer_L1	13.5	2.8	DWT-Transformer	97.7	84.5
PVT-Small	24.5	3.8	Transformer	96.5	79.8
WLAM-PVT-S	20.8	3.9	DWT-Transformer	98.4	84.8
PoolFormer-S24	21	3.5	Pool	96.8	81.8
EfficientFormer-L3 [50]	31.9	5.3	Hybrid	98.2	85.7
ConvNeXt	28	4.5	ConvNet	98.7	87.5
ConvNeXt V2-Tiny	28	4.5	ConvNet	99.0	90.0
EfficientNetV2-S	24	8.8	ConvNet	98.1	90.3
WLAMFormer_L2	23	3.8	DWT-Transformer	98.2	87.1

Table 4. The comparative experimental results on the ISIC17 and ISIC18 datasets, where the best performances are highlighted in bold.

Dataset	Model	mIoU (%)↑	DSC (%)↑	Acc (%)↑	Spe (%)↑	Sen (%)↑
ISIC17	UNet [39]	79.98	86.99	95.65	97.43	86.82
	UTNetV2 [55]	77.35	87.23	95.84	98.05	84.85
	TransFuse [56]	79.21	88.40	96.17	97.98	87.14
	MALUNet [57]	78.78	88.13	96.18	98.47	84.78
	VM-UNet [58]	80.23	89.03	96.29	97.58	89.90
	WLAM-UNet	80.41	89.23	96.45	97.55	90.10
ISIC18	UNet [55]	77.86	87.55	94.05	96.69	85.86
	UNet++ [59]	78.31	87.83	94.02	95.75	88.65
	Att-UNet [60]	78.43	87.91	94.13	96.23	87.60
	UTNetV2 [55]	78.91	88.25	94.32	96.48	87.60
	SANet [61]	79.52	88.59	94.39	95.97	89.46
	TransFuse [56]	80.33	89.27	94.66	95.74	91.28
	MALUNet [57]	80.25	89.04	94.62	96.19	89.74
	VM-UNet [58]	80.35	89.71	94.91	96.13	91.12
	WLAM-UNet	80.43	89.84	95.00	96.20	91.22

Table 5. Ablation study comparison.

Model	Par. ↓ (M)	Flops ↓ (G)	Throughput (A100)	Top-1 ↑	Difference
1	25.0	3.8	1389	82.9	−1.2
2	24.9	4.2	1266	83.3	−0.8
3	25.6	4.0	1401	82.1	−2.0
4	25.2	4.2	1050	83.8	−0.3
WLAMFormer_L2	25.0	3.8	1280	84.1	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, B.; Xu, C.; Li, Z.; Liu, S. WLAM Attention: Plug-and-Play Wavelet Transform Linear Attention. Electronics 2025, 14, 1246. https://doi.org/10.3390/electronics14071246

AMA Style

Feng B, Xu C, Li Z, Liu S. WLAM Attention: Plug-and-Play Wavelet Transform Linear Attention. Electronics. 2025; 14(7):1246. https://doi.org/10.3390/electronics14071246

Chicago/Turabian Style

Feng, Bo, Chao Xu, Zhengping Li, and Shaohua Liu. 2025. "WLAM Attention: Plug-and-Play Wavelet Transform Linear Attention" Electronics 14, no. 7: 1246. https://doi.org/10.3390/electronics14071246

APA Style

Feng, B., Xu, C., Li, Z., & Liu, S. (2025). WLAM Attention: Plug-and-Play Wavelet Transform Linear Attention. Electronics, 14(7), 1246. https://doi.org/10.3390/electronics14071246

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

WLAM Attention: Plug-and-Play Wavelet Transform Linear Attention

Abstract

1. Introduction

2. Related Work

3. Our Work

3.1. Plug-and-Play WLAM Attention Module

3.2. Lossless Downsampling Attention Module

3.3. Macro Architecture Design

4. Experiments

4.1. Image Classification

4.2. Image Segmentation

4.3. Ablation Study

4.4. Network Visualization

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

Appendix C

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI