Next Article in Journal
Optimization of Cryogenic Gas Separation Systems Based on Exergetic Analysis—The Claude–Heylandt Cycle for Oxygen Separation
Next Article in Special Issue
High-Quality Representation Learning Approach to Spatio-Temporal Traffic Speed Data with Lp,ϵ-Norm
Previous Article in Journal
What Is a Pattern in Statistical Mechanics? Formalizing Structure and Patterns in One-Dimensional Spin Lattice Models with Computational Mechanics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

HASwinNet: A Swin Transformer-Based Denoising Framework with Hybrid Attention for mmWave MIMO Systems

1
School of Artificial Intelligence and Computer, North China University of Technology, Beijing 100144, China
2
Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong 999077, China
*
Author to whom correspondence should be addressed.
Entropy 2026, 28(1), 124; https://doi.org/10.3390/e28010124
Submission received: 30 September 2025 / Revised: 14 January 2026 / Accepted: 17 January 2026 / Published: 20 January 2026

Abstract

Millimeter-wave (mmWave) massive multiple-input, multiple-output (MIMO) systems are a cornerstone technology for integrated sensing and communication (ISAC) in sixth-generation (6G) mobile networks. These systems provide high-capacity backhaul while simultaneously enabling high-resolution environmental sensing. However, accurate channel estimation remains highly challenging due to intrinsic noise sensitivity and clustered sparse multipath structures. These challenges are particularly severe under limited pilot resources and low signal-to-noise ratio (SNR) conditions. To address these difficulties, this paper proposes HASwinNet, a deep learning (DL) framework designed for mmWave channel denoising. The framework integrates a hierarchical Swin Transformer encoder for structured representation learning. It further incorporates two complementary branches. The first branch performs sparse token extraction guided by angular-domain significance. The second branch focuses on angular-domain refinement by applying discrete Fourier transform (DFT), squeeze-and-excitation (SE), and inverse DFT (IDFT) operations. This generates a mask that highlights angularly coherent features. A decoder combines the outputs of both branches with a residual projection from the input to yield refined channel estimates. Additionally, we introduce an angular-domain perceptual loss during training. This enforces spectral consistency and preserves clustered multipath structures. Simulation results based on the Saleh–Valenzuela (S–V) channel model demonstrate that HASwinNet achieves significant improvements in normalized mean squared error (NMSE) and bit error rate (BER). It consistently outperforms convolutional neural network (CNN), long short-term memory (LSTM), and U-Net baselines. Furthermore, experiments with reduced pilot symbols confirm that HASwinNet effectively exploits angular sparsity. The model retains a consistent advantage over baselines even under pilot-limited conditions. These findings validate the scalability of HASwinNet for practical 6G mmWave backhaul applications. They also highlight its potential in ISAC scenarios where accurate channel recovery supports both communication and sensing.

1. Introduction

With the explosive growth of mobile data traffic and the increasing demand for high-capacity backhaul, millimeter-wave (mmWave) massive multiple-input, multiple-output (MIMO) has emerged as a key technology for next-generation networks. By exploiting the abundant bandwidth and spatial multiplexing gains, mmWave MIMO enables ultra-high data rates [1,2]. Nevertheless, its inherent path loss, sparse multipath structure, and low-rank channel characteristics pose significant challenges for accurate channel estimation and denoising. Consequently, reliable acquisition of channel state information (CSI) under constrained training overhead remains a major problem. While conventional channel estimation focuses on extracting CSI from pilot symbols using statistical methods, these methods often suffer severe performance degradation under low signal-to-noise ratio (SNR) and limited pilot overhead. Channel denoising, on the other hand, treats the initial noisy estimate as a corrupted image and aims to recover the underlying clean structure. This paper specifically addresses the practical problem of channel denoising in mmWave massive MIMO systems, where the goal is to recover accurate CSI from severely corrupted observations resulting from stringent pilot constraints and harsh propagation environments.
The introduction of reconfigurable intelligent surfaces (RIS) further enhances beamforming capabilities and coverage. Nevertheless, the passive nature of RIS makes direct estimation of individual channel components infeasible, thereby increasing the complexity of CSI acquisition [3,4]. Beyond communications, mmWave MIMO is also regarded as a promising enabler for integrated sensing and communication (ISAC), since its large antenna arrays and high-frequency operation naturally support high-resolution radar-like sensing while delivering high-capacity links. For instance, Wang et al. [5] proposed an assistant vehicle localization scheme based on three collaborative base stations via SBL-based robust DOA estimation, demonstrating the potential of high-precision sensing. Gao et al. [6] demonstrated that compressed sampling-based frameworks can jointly recover CSI and radar perception information, thereby reducing pilot overhead while enabling ISAC functionalities. This underscores that efficient channel estimation and denoising not only benefit communications but also play a key role in enabling joint sensing capabilities for sixth-generation (6G) systems.
In response to these challenges, deep learning (DL) techniques have gained considerable attention in wireless communication research. Early DL-based methods approached the physical layer holistically, employing end-to-end training to jointly estimate the channel, detect signals, and decode information [7]. Subsequent developments combined traditional signal processing with neural network-based refinement, while recurrent models like long short-term memory (LSTM) were leveraged to capture temporal channel dynamics in time-varying environments [8]. A promising strategy in this field is to treat channel estimation as a signal restoration problem, where the channel matrix is analogous to a noisy image. DL methods are then used to reconstruct a cleaner, more accurate version of the channel. Early approaches in this direction included convolutional neural network (CNN)-based denoisers and super-resolution networks [9,10], while more recent models have explored diffusion-based generative methods for structure-preserving channel recovery, offering improvements over traditional generative adversarial network (GAN)-based approaches [11,12].
Recent DL models for mmWave and RIS-assisted systems have demonstrated strong capabilities in capturing angular-domain sparsity and spatial correlations. Both compressive sensing-inspired and data-driven architectures have been explored [13]. Notable advancements include domain-specific attention mechanisms such as Channelformer [14] and Comm-Transformer [15], which integrate hierarchical attention and positional encoding to better represent channel characteristics. In RIS-assisted systems, methods like RIS-TX [16] enhance estimation by actively exciting pilot signals to overcome passive reflection challenges. Furthermore, the combination of DL and RIS has been comprehensively reviewed, and the results confirm that data-driven solutions are particularly effective for RIS-assisted wireless networks [17].
From an architectural perspective, our work builds upon the Swin Transformer [18], which introduces hierarchical shifted-window attention to balance local feature extraction and global context modeling. This design is particularly advantageous for mmWave channels, where both clustering and angular sparsity are simultaneously present. A recent tutorial article [19] has systematically reviewed Transformer-based architectures in wireless communications and emphasized their capability to model long-range dependencies. In addition, sparse representation plays an important role in mmWave channel estimation. Wang et al. [20] investigated polarization channel estimation for circular and non-circular signals in massive MIMO systems, highlighting the value of exploiting signal structures. Similarly, Zhang et al. [21] demonstrated that combining compressive sensing priors with deep neural networks significantly improves sparse channel recovery in massive MIMO systems. Motivated by this, we design a sparse token extraction module guided by angular-domain significance.
Another critical dimension is frequency-domain consistency. In practice, residual distortions often manifest as spectral mismatch, degrading performance across subcarriers. Mashhadi and Gündüz [22] developed a frequency-aware deep learning approach for joint pilot design and channel estimation in MIMO-OFDM, leveraging inter-frequency correlations to enhance performance. Inspired by this idea, we incorporate an angular-domain perceptual loss to preserve the clustered multipath structure and enhance generalization.
Despite these advances, many existing models fall short of capturing both long-range angular dependencies and local spatial structures, especially in low-SNR environments or when pilot symbols are limited. Specifically, while CNN-based methods excel at modeling localized features, they struggle with representing global context due to their limited receptive fields. Similarly, recurrent models like LSTM, although effective for temporal sequences, typically treat the 2D channel matrix as 1D sequences, thereby disrupting the intrinsic 2D spatial-angular correlations required for accurate reconstruction. Although U-Net architectures improve multi-scale feature extraction through encoder-decoder structures, they generally treat channel denoising as a generic image restoration task, often failing to exploit the specific, sparse, clustered structure of mmWave propagation. Furthermore, standard Transformer architectures, while capable of modeling global relationships, often incur quadratic computational complexity, making them prohibitive for the large channel dimensions typical of 6G backhaul systems. Moreover, few prior models explicitly decouple coarse angular patterns from dense spatial textures in the channel matrix.
To address these limitations, we propose HASwinNet, a DL framework specifically designed for mmWave MIMO channel denoising. Intuitively, the proposed system operates in three stages: First, it utilizes a hierarchical Swin Transformer to extract multi-scale spatial features from the noisy input. Second, it splits the processing into two parallel branches: a Sparse Token branch that identifies and preserves the dominant multipath components, and an Angular Refinement branch that enforces spectral consistency to filter out incoherent noise. Finally, these complementary features are fused to reconstruct a high-fidelity channel matrix. This design ensures that both the strong directional paths and the subtle angular structures are preserved. The framework is intended for deployment at the Small Cell Base Station (SCBS) receiver side to enable real-time denoising of uplink or backhaul signals.
The main contributions of this work are summarized as follows:
  • We explicitly formulate the channel denoising problem distinct from traditional estimation, and propose HASwinNet, a hierarchical framework based on Swin Transformer. Unlike conventional CNNs with fixed receptive fields, our architecture efficiently models both local spatial textures and long-range angular dependencies, making it effective for capturing the clustered multipath structure of mmWave channels.
  • We design a novel dual-branch mechanism that decouples signal recovery into two complementary tasks: a sparsity-aware token extraction module that identifies dominant angular components, and a spectral consistency branch that utilizes DFT/IDFT operations to filter incoherent noise. This hybrid design ensures that faithful channel reconstruction is achieved by simultaneously exploiting angular sparsity and frequency-domain coherence.
  • We demonstrate through extensive simulations on Saleh–Valenzuela (S–V) channels that HASwinNet significantly outperforms representative baselines in terms of NMSE and BER. Specifically, the proposed framework exhibits superior robustness in harsh deployment scenarios characterized by low SNR and severe pilot constraints, confirming its engineering value for future 6G backhaul and ISAC systems.

2. System Model

As illustrated in Figure 1, this paper considers a typical mmWave backhaul communication system comprising a macro base station (BS), a RIS, and multiple small cell base stations (SCBSs). Unlike conventional backhaul approaches that rely on optical fiber or microwave links, this architecture establishes high-speed and flexible wireless connectivity purely through mmWave communication. Due to the high sensitivity of mmWave signals to blockages, a passive RIS is introduced to enhance non-line-of-sight (NLoS) coverage, enabling indirect communication between the BS and SCBSs via intelligent reflection.
In this setup, the BS is equipped with N t transmit antennas, the RIS comprises M reconfigurable reflecting elements, and each SCBS is equipped with N r receive antennas. The transmission can be divided into three segments: the direct link from the BS to the RIS, the reflection and phase modulation process at the RIS, and the downstream link from the RIS to the SCBS. All propagation paths are modeled according to the S–V geometric channel model to effectively capture the sparse and directional characteristics of mmWave propagation.
Let H BR C M × N t denote the channel matrix between the BS and RIS, which can be expressed as follows:
H BR = 1 P 1 p = 1 P 1 α p BR a R ( θ p AoA ) a T H ( ϕ p AoD ) ,
where P 1 denotes the number of propagation paths in the BS–RIS links, α p BR represents the complex gain of the p-th path, and a T ( · ) and a R ( · ) are the array response vectors of the BS and RIS, respectively. The angles θ p AoA and ϕ p AoD correspond to the angle of arrival (AoA) and angle of departure (AoD) in the angular domain.
The RIS behavior is modeled via a diagonal phase-shift matrix Φ = diag ( e j ϕ 1 , , e j ϕ M ) , where ϕ m denotes the programmable phase shift at the m-th reflecting element. The channel between the RIS and the SCBS is denoted as H RS C N r × M , given by the following:
H RS = 1 P 2 p = 1 P 2 α p RS a S ( ϑ p AoA ) a R H ( φ p AoD ) ,
where P 2 denotes the number of propagation paths in the RIS–SCBS links, and a S ( · ) is the receive array response at the SCBS.
Combining the above, the end-to-end cascaded channel between the BS and SCBS, denoted as H eq C N r × N t , can be modeled as follows:
H eq = H RS Φ H BR .
During the pilot training phase, the BS transmits a known pilot matrix X C N t × L to probe the equivalent channel, where L denotes the number of pilot symbols. The received signal at the SCBS is given by the following:
Y = H eq X + W ,
where Y C N r × L denotes the received signal, and W is additive white Gaussian noise (AWGN) following the distribution CN ( 0 , σ 2 I ) . The cascaded channel H eq explicitly depends on the RIS phase-shift matrix Φ ; HASwinNet is designed to denoise the effective cascaded channel from noisy pilot observations. In our setting, Φ is not provided as an explicit side input to the network. Instead, the RIS effect is implicitly embedded in the training and evaluation data through H eq and the received pilots Y .
To quantify the training cost, we define the dimensionless pilot overhead as
η L N t ,
which measures the proportion of pilot resources relative to the channel dimension. A larger η improves estimation accuracy but consumes more spectrum, while a smaller η preserves throughput but risks higher error. Assuming an orthogonal pilot matrix with X X H = L P s I N t , the least-squares (LS) estimator under (4) is as follows:
H ^ eq = Y X H ( X X H ) 1 = H eq + W X H ( X X H ) 1 .
In this case, the additive error term has per-entry variance σ 2 / ( L P s ) . For the LS estimator, the NMSE is proportional to the relative estimation error power in H ^ eq , and thus scales with the same factor, yielding the following:
NMSE LS 1 L P s = 1 η N t P s .
This LS-based scaling is used only for motivation and does not necessarily describe the NMSE trend of HASwinNet. This reveals a fundamental trade-off in which increasing the pilot overhead η or the pilot power P s reduces NMSE, while a larger overhead simultaneously decreases spectral efficiency.
Notably, mmWave MIMO and RIS-assisted channels exhibit strong angular sparsity and low-rank structures due to clustered propagation. These priors enable advanced estimators—such as compressed sensing and low-rank tensor decomposition—to recover channels reliably even when η 1 , thereby alleviating pilot overhead while maintaining low NMSE. This insight directly motivates the design of learning-based denoising frameworks that can exploit such low-rank/sparse features, as developed next in the proposed HASwinNet.
To interface with real-valued DL models, the received complex signal is separated into real and imaginary components and represented as a three-dimensional real-valued tensor:
Y r = Re ( Y ) ; Im ( Y ) R 2 × N r × L .
To ensure a fixed-size 2D input for the Swin-based encoder, we adopt a simple padding strategy along the pilot dimension. Specifically, when L < N t , we zero-pad Y r to obtain Y ˜ r R 2 × N r × N t ; when L = N t , Y ˜ r = Y r . Unless otherwise stated, the network takes Y ˜ r as its input. On this basis, the proposed HASwinNet is deployed at the SCBS to reconstruct the equivalent channel from noisy observations. Formally, the network implements a data-driven denoising map:
H ^ r = f θ ( Y ˜ r ) , H ^ r R 2 × N r × N t ,
where f θ ( · ) denotes a learnable function parameterized by θ and H ^ r is the real-valued representation of the estimated channel. The complex-domain estimate is obtained by recombining the two real channels as follows:
H ^ eq = H ^ r ( 1 ) + j H ^ r ( 2 ) C N r × N t .

3. Proposed Method

The HASwinNet framework is developed to accurately recover mmWave MIMO channels under low SNR and high angular sparsity conditions. Unlike conventional CNN-based or generic Transformer-based networks, the proposed architecture incorporates Swin Transformer encoding, angular-domain refinement, and sparsity-aware attention, jointly modeling the structural properties of mmWave propagation. This design enables the network to balance global angular sparsity with fine-grained angular spectral consistency, which is essential for robust denoising in practical backhaul scenarios. The detailed forward propagation procedure of the proposed HASwinNet is summarized in Algorithm 1.
Algorithm 1 Forward Propagation of HASwinNet.
  • Require: Noisy pilot observation Y r R 2 × N r × L , temperature τ , sparsity ratio ρ
  • Ensure: Denoised channel estimate H ^ r R 2 × N r × N t
    • Operator definitions:  RowNorm 2 ( V ) i = V i , : 2 ; TopKMask ( g , K ) returns a binary mask with K = ρ N ones. During training, we adopt a straight-through estimator (STE) for the hard mask; during inference, hard Top-K is used.
    • Stage 0: Input formatting
 1:
Y ˜ r Pad ( Y r )      ▹ Pad to size 2 × N r × N t when L < N t
Stage 1: Encoding
 2:
χ 0 PatchEmbed ( Y ˜ r )
 3:
X e n c SwinEncoder ( χ 0 )
Stage 2: Dual-branch processing
 4:
F ten DFT ( X e n c )
 5:
F Reshape ( F ten )                   ▹ F C N × C , N = N r P N t P
Branch A: Sparse token branch
 6:
V F W s
 7:
z RowNorm 2 ( V )
 8:
g Softmax ( z / τ )
 9:
m TopKMask ( g , K )
10:
V s e l ( m g ) V
11:
T ^ ten Fold ( V s e l )     ▹zero-fill + fold to full grid before IDFT
12:
H t o k e n IDFT ( Refine ( T ^ ten ) )
Branch B: Angular refinement branch
13:
s σ ( W 2 δ ( W 1 · Pool ( F ten ) ) )
14:
F ^ ten F ten s
15:
H s p e c IDFT ( F ^ ten )
Stage 3: Fusion and reconstruction
16:
H f u s i o n H t o k e n H s p e c
17:
H u p ConvTranspose ( H f u s i o n , k = P , s = P )
18:
H s k i p ResidualProj ( Y ˜ r )
19:
H ^ r OutProj ( H u p ) + H s k i p
20:
return  H ^ r

3.1. Overall Architecture

As illustrated in Figure 2, HASwinNet adopts an encoder–decoder paradigm with auxiliary refinement paths. The input is the real-valued tensor Y r R 2 × N r × L formed by stacking the real and imaginary parts of the received pilot signal. A patch embedding layer first projects the raw observation into a latent feature space, followed by a Swin Transformer encoder that hierarchically extracts spatial features while preserving angular correlations across the antenna array. The encoded latent features then serve as the input for the subsequent dual-branch processing, maintaining the compressed feature representation to reduce computational redundancy.
The transformed representation is processed by two complementary branches. The first branch consists of a DFT block, a sparse selector, a linear projection, and multi-head self-attention. This pathway emphasizes dominant multipath components by projecting the angular-domain representation into a compact set of sparse tokens. The selected tokens are then refined through attention layers and upsampling, yielding a high-resolution reconstruction that captures the principal angular information. The second branch introduces spectral refinement by applying a DFT block, a squeeze-and-excitation module, and an IDFT operation. This pathway adaptively emphasizes critical frequency components and suppresses redundant noise, thereby enhancing angular-domain spectral consistency.
The outputs of both branches are fused in the decoder, while a shallow residual projection maps the original observation Y r directly into the channel domain, forming a stabilizing skip connection. The final channel estimate is then obtained by integrating the hybrid reconstruction with this residual pathway, as detailed in the Decoder subsection. To explicitly track the signal transformation, the tensor dimensions at each processing stage are summarized in Table 1.

3.2. Swin Transformer Encoder

The encoder is based on hierarchical Swin Transformer blocks. Specifically, to process the input tensor, we utilize a patch size of P = 4 , resulting in a grid of 16 × 16 tokens with an embedding dimension of C = 96 . The backbone comprises four stages with depths set to { 2 , 2 , 6 , 2 } and attention heads set to { 3 , 6 , 12 , 24 } , respectively. Within these blocks, we apply window-based multi-head self-attention (W-MSA) and shifted-window multi-head self-attention (SW-MSA) using a window size of M = 8 and a shift size of 4. This mechanism enables both local feature extraction and long-range angular dependency modeling at a reduced computational cost compared with global attention. The encoder generates domain-adapted latent features that are structurally consistent with the MIMO channel matrix, providing an effective basis for subsequent sparse and spectral processing.

3.3. Sparse Token Branch

The sparse token branch explicitly exploits the angular sparsity of mmWave propagation. Given the encoded features comprising N = 256 tokens on a 16 × 16 grid, a DFT block along the antenna dimension transforms them into the angular domain, producing F C N × C , where N = N r P N t P = 256 and C = 96 . Here, the DFT is applied along the 16 × 16 antenna-grid dimensions N r P , N t P . To filter out noise and redundancy, we enforce a sparse token budget of K = 64 , retaining only the top-25% most significant features. A sparsity-aware gating operator is then applied to emphasize these informative components while attenuating noise-dominated ones. Formally, this operation can be expressed as
T ^ = G s F · W s ,
where W s R C × d is a learnable projection, d denotes the projected token embedding dimension after the linear projection W s . Accordingly, the sparse token set satisfies T ^ R K × d , and G s ( · ) denotes a temperature-controlled soft gating function:
G s ( z ) = exp ( z / τ ) j exp ( z j / τ ) , τ > 0 .
Here, z represents the token importance scores, and τ is a temperature parameter. This design provides a differentiable approximation of sparse selection: with large τ , the gating weights are smooth and stable for training, while decreasing τ gradually sharpens the distribution, leading to selective emphasis on dominant angular paths. The resulting sparse tokens are subsequently refined by linear transformation and multi-head self-attention (MSA), followed by upsampling to restore spatial resolution.

3.4. Angular Refinement Branch

The spectral refinement branch complements the sparse pathway by enforcing angular-domain consistency. After applying a DFT block, the transformed features F ten C C × 16 × 16 are passed through a squeeze-and-excitation (SE) module that adaptively reweighs spectral components. Here, F ten denotes the tensor-form angular-domain features used for spectral refinement, to avoid ambiguity with the matrix-form tokens in the sparse branch. The reweighting operation is defined as
s = σ W 2 δ W 1 · Pool F ten ,
where Pool ( · ) denotes global average pooling over the 16 × 16 angular-grid dimensions, δ ( · ) is the ReLU activation, σ ( · ) is the sigmoid activation, and W 1 , W 2 are learnable weights. The resulting s R C is broadcast along the 16 × 16 grid for element-wise reweighting. The refined spectral representation is then obtained as
F ^ ten = F ten s ,
which emphasizes informative angular bins and suppresses irrelevant ones. An IDFT block finally maps F ^ ten back to the spatial domain, providing an angularly consistent refinement that complements the sparse token pathway.

3.5. Decoder and Reconstruction

To efficiently recover the channel structure, the decoder first performs feature fusion in the latent domain. Specifically, the token-based reconstruction H ^ token and the spectrally refined output H ^ spec are combined through element-wise multiplication while still at the encoded resolution of 16 × 16 :
H fusion = H ^ token H ^ spec ,
where ⊙ denotes the Hadamard (element-wise) product. This latent-space fusion ensures that structural consistency is enforced before expanding to the full spatial dimension.
Subsequently, to restore the full channel resolution, the fused features H fusion are processed by a transposed convolution layer, denoted as U ( · ) , with a kernel size and stride of 4. This operation symmetrically matches the encoder’s patch partition and directly upsamples the features back to the 64 × 64 spatial dimension. To stabilize training and preserve low-level information, a residual projection of the original observation Y r is introduced, denoted as H skip . The final channel estimate is then obtained by adding the upsampled reconstruction to the residual pathway:
H ^ r = U ( H fusion ) + H skip .
This formulation integrates the recovered angular features with the raw observation, balancing the reconstruction of sparse multipath clusters with the preservation of element-wise signal details.

3.6. Loss Function

The proposed HASwinNet is optimized in an end-to-end manner using a composite loss that reflects the physical characteristics of mmWave propagation, namely spatial-domain fidelity, angular sparsity, and angular-domain spectral consistency. The overall loss is defined as
L = α · L spatial ( H ^ eq , H eq ) + β · T ^ 1 + γ · L ang ( H ^ eq , H eq ) ,
where H ^ eq C N r × N t denotes the reconstructed channel, H eq C N r × N t is the ground-truth channel, and T ^ R K × d represents the sparse token set extracted from the encoded features after the projection W s , where d is the projected token embedding dimension. The weights α , β , γ R + are nonnegative coefficients that balance the three loss terms.
In our implementation, the weighting factors are empirically set to α = 1.0 , β = 0.01 , and γ = 0.1 . This specific configuration is motivated by the distinct roles of each term: α = 1.0 maintains the dominance of element-wise reconstruction fidelity; a small β = 0.01 provides a sufficient sparsity constraint to filter noise without eliminating weak signal components; and γ = 0.1 ensures angular spectral consistency serves as a regularizer rather than a conflicting objective. We analyze the impact of varying β and γ on the denoising performance. The experimental results indicate a clear trade-off. For the sparsity weight β , setting it to 0 leads to noise residue, while increasing it beyond 0.01 degrades NMSE by over-pruning valid paths. Similarly, for the spectral weight γ , a moderate value of 0.1 yields the best structural consistency. Excessive weighting was found to destabilize training, as spectral loss began to dominate the spatial reconstruction loss.
The first term L spatial is chosen to be the NMSE between the estimated and true channel matrices, ensuring element-wise reconstruction fidelity. The second term T ^ 1 is an 1 regularization penalty that enforces sparsity in the token representation, thus selecting only the dominant angular components that correspond to strong propagation paths. The third term is an angular-domain consistency loss, expressed as
L ang = | F ( H ^ eq ) | | F ( H eq ) | F 2 ,
where F ( · ) denotes the two-dimensional discrete Fourier transform (DFT); | · | is the element-wise magnitude operator; and · F is the Frobenius norm. By enforcing similarity in the spectral magnitudes, this term preserves the clustered multipath structure that governs mmWave channels.
Each term in the loss plays a complementary role. The NMSE ensures accurate element-wise recovery, but may overlook structural coherence. The sparsity regularizer encourages the network to identify only the most salient paths, consistent with the limited number of dominant multipath components in physical propagation. The angular-domain term prevents angular spectral distortion, aligning the reconstructed channel with the clustered multipath structure of the ground truth. Together, these three terms strike a balance between local accuracy and global consistency, yielding superior robustness under low SNR and limited-pilot conditions.

3.7. Inference and Complexity Analysis

During inference, HASwinNet leverages the synergy of its sparse token branch and angular refinement branch to achieve both accuracy and efficiency. The decoder output is obtained through the element-wise fusion with residual addition, as already defined in (15) and (16). This design ensures that the fused high-level structures are preserved while the residual pathway stabilizes training and guarantees low-level fidelity.
In terms of complexity, HASwinNet benefits from both windowed attention and sparse token refinement. The window-based self-attention reduces the quadratic complexity of global attention from O ( N 2 ) to O ( N w 2 ) , where N is the input sequence length, and w is the window size. Further reducing the effective sequence length to K N yields a final complexity of O ( K w 2 ) . To quantitatively validate this efficiency, we evaluate the model under the main experiment settings with an input size of 64 × 64 , patch size P = 4 , embedding dimension C = 96 , window size M = 8 , and depths of { 2 , 2 , 6 , 2 } . In this configuration, HASwinNet contains approximately 1.52 M learnable parameters and requires 0.84 GFLOPs. The corresponding model weight size is compact, occupying about 6.1 MB (FP32) or 3.0 MB (FP16). This favorable scaling and low resource consumption highlight the practicality of HASwinNet for real-time deployment in 6G backhaul scenarios.

4. Performance Evaluation

Performance evaluation of the proposed HASwinNet framework is carried out in a mmWave MIMO backhaul scenario based on the S–V clustered channel model.

4.1. Dataset and Simulation Setup

In the simulation setup, a narrowband backhaul link is considered, with both the transmitter and receiver employing uniform linear arrays of N t = N r = 64 antennas. The channel is composed of 3 spatial clusters, each generating 10 multipath components with an angular spread of 10°. The carrier frequency is set to 28 GHz, and the path loss exponent is 2.5. Link distances are uniformly distributed between 10 and 100 m.
Each sample contains a noisy observation Y C N r × L and the ground-truth channel H eq C N r × N t . For the full-pilot setting, we use L = N t = 64 ; for reduced-pilot experiments (e.g., half pilots), we set L = 32 and pad Y r to Y ˜ r R 2 × 64 × 64 before feeding it into HASwinNet. The dataset includes 50,000 training samples, 5000 for validation, and 5000 for testing. AWGN is introduced across SNR values ranging from 5 dB to 20 dB.
In data generation, each sample is synthesized by first drawing H B R and H R S according to the adopted channel model, and then constructing the effective cascaded channel H eq . The RIS phase shifts in Φ are generated following the considered hardware setting; therefore, the impact of RIS is naturally embedded in the statistics of the paired data ( Y r , H eq ) used for training and evaluation.

4.2. Training Configuration

The HASwinNet model is trained using the Adam with decoupled weight decay (AdamW) optimizer with an initial learning rate of 3 × 10 4 and weight decay of 1 × 10 5 . Training proceeds for 100 epochs with a batch size of 256. A warm-up phase is applied for the first five epochs, followed by a cosine annealing scheduler. The total loss function includes three components: NMSE loss, a sparsity-inducing regularization term, and an angular-domain perceptual loss based on 2D DFT magnitude. Unless otherwise stated, inference follows the same HASwinNet configuration as training.

4.3. Baseline Models

To evaluate the effectiveness of the proposed denoising framework, comparisons are made against three representative DL baselines widely adopted in prior literature. The first is a CNN consisting of three residual convolutional blocks tailored for local feature extraction. The second is an LSTM network that interprets the antenna columns of the channel matrix as temporal sequences, enabling sequence modeling. The third baseline is a U-Net-based model with an encoder–decoder architecture and skip connections, designed to recover spatial details at multiple scales. For fair comparison, all baseline models are constructed with comparable parameter counts and trained using the same dataset, batch size, and optimization schedule. Moreover, their hyperparameters, including learning rate, layer depth, and dropout probability, are independently optimized within a unified search space.

4.4. Evaluation Metrics

The denoising and channel recovery performance is assessed based on two commonly used evaluation metrics.
The normalized mean squared error (NMSE) measures reconstruction fidelity and is defined as
NMSE = E H ^ H F 2 H F 2 ,
where H ^ and H denote the estimated and ground-truth channel matrices, respectively, E [ · ] denotes expectation over channel realizations, and · F denotes the Frobenius norm. This metric quantifies the average squared difference between the estimated and true channel matrices, normalized by the energy of the ground-truth channel.
The bit error rate (BER) is computed by transmitting randomly generated quadrature phase-shift keying (QPSK) symbols through the estimated channel and comparing the received symbols with the original transmitted ones. It is defined as
BER = N e N sym ,
where N e denotes the number of erroneously detected symbols, and N sym is the total number of transmitted symbols.
These two metrics together reflect the recovery accuracy and communication reliability of the proposed HASwinNet model.

4.5. Performance Analysis

Figure 3 presents the denoising performance of all models across input SNRs in terms of NMSE and BER. Each data point is averaged over five independent runs with different random seeds.
In the NMSE plot (left), the proposed HASwinNet consistently achieves the lowest error across the entire SNR range. Its curve decreases rapidly with increasing SNR, reaching values on the order of 10 3 at 20 dB, which is nearly an order of magnitude lower than CNN and LSTM. U-Net performs better than CNN and LSTM, but HASwinNet still maintains a clear margin, confirming its superior capability in suppressing residual noise and preserving angular-domain structures.
In the BER plot (right), HASwinNet also exhibits the best performance, with the steepest decline as SNR improves. At low SNR ( 5 to 0 dB), CNN and LSTM curves are close, while U-Net achieves slightly better results. In the mid-to-high SNR regime (10–20 dB), U-Net clearly surpasses CNN and LSTM, but HASwinNet consistently achieves the lowest BER, reaching the order of 10 5 at 20 dB and thus ensuring the most reliable symbol detection.
To further examine robustness under limited pilot resources, Figure 4 reports performance with only half of the pilot symbols. As expected, all models degrade under reduced overhead, and the relative ranking remains unchanged. HASwinNet maintains a clear margin, particularly at medium to high SNRs, with markedly smaller degradation than the baselines. This behavior aligns with the network’s hybrid design in which the sparse token pathway concentrates representation on dominant angular components, the DFT–SE–IDFT refinement preserves clustered angular spectra, the hierarchical Swin encoder captures both local detail and long-range angular dependencies, and the residual projection safeguards low-level fidelity. Together with a composite objective that enforces spatial NMSE fidelity, token-level sparsity, and angular domain spectral consistency, these elements provide strong inductive priors that allow limited pilot observations to be translated into robust reconstructions and reflect higher pilot efficiency and resilience to reduced training overhead.
We further evaluate the robustness of the proposed HASwinNet against non-ideal RIS hardware constraints, specifically discrete phase shifts. Practical RIS elements often support only a limited number of phase levels due to hardware complexity and cost. Figure 5 compares the NMSE and BER performance of HASwinNet under the ideal continuous phase assumption against practical 1-bit and 2-bit quantization schemes.
As observed in Figure 5, while 1-bit quantization leads to a noticeable performance gap due to large phase errors, the 2-bit quantization scheme achieves performance highly comparable to the ideal continuous case across the entire SNR range. Notably, the performance curves of the 2-bit scheme closely follow the ideal baseline, maintaining the same order of magnitude in both NMSE and BER, even in the high SNR regime. This marginal degradation demonstrates that HASwinNet is highly robust to quantization errors and can be effectively deployed with low-cost, low-resolution RIS hardware without significantly compromising system performance.

5. Conclusions

This paper proposes HASwinNet, a DL-based denoising framework for mmWave MIMO channels. The architecture integrates a hierarchical Swin Transformer encoder with two complementary branches: A sparse token extraction path that captures dominant angular-domain features, and an angular-domain refinement path that enforces angular-domain consistency through adaptive reweighting. A composite loss function, consisting of spatial NMSE, angular-domain consistency, and sparsity regularization, ensures that both global and local channel characteristics are faithfully reconstructed. This design explicitly exploits the sparse multipath nature of mmWave propagation and strikes a balance between reconstruction fidelity and computational efficiency.
Extensive simulations on S–V channels have demonstrated that HASwinNet consistently outperforms representative deep learning baselines, including CNN, LSTM, and U-Net, across multiple evaluation metrics. The proposed model achieves superior NMSE and BER performance, exhibits strong robustness under low SNR conditions, and delivers pronounced gains in the high SNR regime. Moreover, experiments with reduced pilot symbols verify that HASwinNet effectively exploits angular sparsity and preserves channel structure, thereby achieving higher pilot efficiency and resilience to limited training overhead. And the proposed framework demonstrates strong robustness to hardware imperfections, maintaining high reconstruction accuracy even under low-resolution (2-bit) RIS phase quantization. These results confirm that the hybrid integration of sparse-token extraction and angular-domain refinement attains a favorable balance between reconstruction fidelity and computational complexity, underscoring the suitability of HASwinNet for practical deployment in 6G mmWave backhaul systems.
Given that mmWave massive MIMO is widely recognized as a cornerstone technology for ISAC, the proposed framework not only enhances the reliability of mmWave backhaul links but also enables accurate channel recovery that can support high-resolution sensing. Future research will extend HASwinNet to broadband and time-varying orthogonal frequency division multiplexing (OFDM) systems and conduct rigorous over-the-air (OTA) validation under realistic propagation conditions to further substantiate its applicability in unified ISAC scenarios.

Author Contributions

Conceptualization, X.H. and H.T.; methodology, H.T. and X.H.; software, H.T.; validation, J.C. and H.T.; formal analysis, X.H. and Z.X.; investigation, H.T. and J.C.; writing—original draft preparation, H.T.; writing—review and editing, X.H., J.Y. and H.T.; supervision, X.H. and Z.X.; project administration, X.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Key Research and Development Program of China under Grant 2024YFE0200103, and the RGC Postdoctoral Fellowship Scheme of Project No. PDFS2425-6S05.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
6GSixth-Generation Mobile Networks
mmWaveMillimeter Wave
MIMOMultiple-Input, Multiple-Output
RISReconfigurable Intelligent Surface
SCBSSmall Cell Base Station
BSBase Station
UEUser Equipment
ISACIntegrated Sensing and Communication
S–VSaleh–Valenzuela
AoAAngle of Arrival
AoDAngle of Departure
AWGNAdditive White Gaussian Noise
LSLeast Squares
OFDMOrthogonal Frequency Division Multiplexing
QPSKQuadrature Phase Shift Keying
SNRSignal-to-Noise Ratio
NMSENormalized Mean Squared Error
BERBit Error Rate
CSIChannel State Information
DLDeep Learning
CNNConvolutional Neural Network
LSTMLong Short-Term Memory
U-NetU-shaped Convolutional Neural Network
GANGenerative Adversarial Network
NLoSNon-line-of-sight
MSAMulti-Head Self-Attention
W-MSAWindow-Based Multi-Head Self-Attention
SW-MSAShifted Window Multi-Head Self-Attention
MLPMultilayer Perceptron
DFTDiscrete Fourier Transform
IDFTInverse Discrete Fourier Transform
SESqueeze-and-Excitation
OTAOver-the-Air
AdamWAdam with Decoupled Weight Decay

References

  1. Wu, Q.; Zhang, R. Towards smart and reconfigurable environment: Intelligent reflecting surface aided wireless communications. IEEE Commun. Mag. 2020, 58, 106–112. [Google Scholar] [CrossRef]
  2. Basar, E.; Renzo, M.D.; de Rosny, J.; Debbah, M.; Alouini, M.; Zhang, R. Wireless communications through reconfigurable intelligent surfaces. IEEE Access 2019, 7, 116753–116773. [Google Scholar] [CrossRef]
  3. Zheng, B.; Zhang, R. Intelligent reflecting surface-enhanced OFDM: Channel estimation and reflection optimization. IEEE Wirel. Commun. Lett. 2020, 9, 518–522. [Google Scholar] [CrossRef]
  4. An, J.; Wu, Q.; Yuen, C. Scalable channel estimation and reflection optimization for reconfigurable intelligent surface-enhanced OFDM systems. IEEE Wirel. Commun. Lett. 2022, 11, 796–800. [Google Scholar] [CrossRef]
  5. Wang, H.; Wan, L.; Dong, M.; Ota, K.; Wang, X. Assistant Vehicle Localization Based on Three Collaborative Base Stations via SBL-Based Robust DOA Estimation. IEEE Internet Things J. 2019, 6, 5766–5777. [Google Scholar] [CrossRef]
  6. Gao, Z.; Wan, Z.; Masouros, C.; Ng, D.W.K.; Chen, S. Integrated sensing and communication with mmWave massive MIMO: A compressed sampling perspective. IEEE Trans. Wirel. Commun. 2022, 21, 8301–8315. [Google Scholar] [CrossRef]
  7. Ye, H.; Li, G.Y.; Juang, B. Power of deep learning for channel estimation and signal detection in OFDM systems. IEEE Wirel. Commun. Lett. 2018, 7, 114–117. [Google Scholar] [CrossRef]
  8. Pan, J.; Shan, H.; Li, R.; Wu, Y.; Wu, W.; Quek, T.Q.S. Channel estimation based on deep learning in vehicle-to-everything environments. IEEE Commun. Lett. 2021, 25, 1891–1895. [Google Scholar] [CrossRef]
  9. He, H.; Wen, C.; Jin, S.; Li, G.Y. Deep learning-based channel estimation for beamspace mmWave massive MIMO systems. IEEE Wirel. Commun. Lett. 2018, 7, 852–855. [Google Scholar] [CrossRef]
  10. Liu, C.; Liu, X.; Ng, D.W.K.; Yuan, J. Deep residual learning for channel estimation in intelligent reflecting surface-assisted multi-user communications. IEEE Trans. Wirel. Commun. 2022, 21, 898–912. [Google Scholar] [CrossRef]
  11. Guo, L.; Chen, W.; Sun, Y.; Ai, B.; Pappas, N.; Quek, T.Q.S. Diffusion-Driven Semantic Communication for Generative Models With Bandwidth Constraints. IEEE Trans. Wirel. Commun. 2025, 24, 6490–6503. [Google Scholar] [CrossRef]
  12. Pei, J.; Feng, C.; Wang, P.; Tabassum, H.; Shi, D. Latent Diffusion Model-Enabled Low-Latency Semantic Communication in the Presence of Semantic Ambiguities and Wireless Channel Noises. IEEE Trans. Wirel. Commun. 2025, 24, 4055–4072. [Google Scholar] [CrossRef]
  13. Liu, S.; Gao, Z.; Zhang, J.; Renzo, M.D.; Alouini, M. Deep denoising neural network assisted compressive channel estimation for mmWave intelligent reflecting surfaces. IEEE Trans. Veh. Technol. 2020, 69, 9223–9228. [Google Scholar] [CrossRef]
  14. Luan, D.; Thompson, J.S. Channelformer: Attention-based neural solution for wireless channel estimation and effective online training. IEEE Trans. Wirel. Commun. 2023, 22, 6562–6577. [Google Scholar] [CrossRef]
  15. Xie, Y.; Teh, K.C.; Kot, A.C. Comm-Transformer: A robust deep learning-based receiver for OFDM system under TDL channel. IEEE Trans. Commun. 2024, 72, 2014–2026. [Google Scholar] [CrossRef]
  16. Zhu, Y.; Liu, Y.; Wu, Q.; You, C.; Shi, Q. Channel estimation by transmitting pilots from reconfigurable intelligent surface. IEEE Trans. Wirel. Commun. 2024, 23, 3328–3343. [Google Scholar] [CrossRef]
  17. Zhou, H.; Erol-Kantarci, M.; Liu, Y.; Poor, H.V. A Survey on Model-Based, Heuristic, and Machine Learning Optimization Approaches in RIS-Aided Wireless Networks. IEEE Commun. Surv. Tutor. 2024, 26, 781–823. [Google Scholar] [CrossRef]
  18. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
  19. Zayat, A.; Hasabelnaby, M.A.; Obeed, M.; Chaaban, A. Transformer Masked Autoencoders for Next-Generation Wireless Communications: Architecture and Opportunities. IEEE Commun. Mag. 2024, 88–94. [Google Scholar] [CrossRef]
  20. Wang, X.; Wan, L.; Huang, M.; Shen, C.; Zhang, K. Polarization Channel Estimation for Circular and Non-Circular Signals in Massive MIMO Systems. IEEE J. Sel. Top. Signal Process. 2019, 13, 1001–1016. [Google Scholar] [CrossRef]
  21. Zhang, R.; Tan, W.; Nie, W.; Wu, X.; Liu, T. Deep Learning-Based Channel Estimation for mmWave Massive MIMO Systems in Mixed-ADC Architecture. Sensors 2022, 22, 3938. [Google Scholar] [CrossRef] [PubMed]
  22. Mashhadi, M.B.; Gündüz, D. Pruning the Pilots: Deep Learning-Based Pilot Design and Channel Estimation for MIMO-OFDM Systems. IEEE Trans. Wirel. Commun. 2021, 20, 6315–6328. [Google Scholar] [CrossRef]
Figure 1. RIS-assisted mmWave MIMO backhaul system, where the BS communicates with multiple SCBSs via a passive RIS. Each SCBS serves a set of user equipments (UEs) within its local cell.
Figure 1. RIS-assisted mmWave MIMO backhaul system, where the BS communicates with multiple SCBSs via a passive RIS. Each SCBS serves a set of user equipments (UEs) within its local cell.
Entropy 28 00124 g001
Figure 2. Architecture of the proposed HASwinNet denoising network.
Figure 2. Architecture of the proposed HASwinNet denoising network.
Entropy 28 00124 g002
Figure 3. Denoising performance of different models under varying input SNRs with full pilot overhead. (Left): NMSE. (Right): BER. Each curve shows the mean of five runs.
Figure 3. Denoising performance of different models under varying input SNRs with full pilot overhead. (Left): NMSE. (Right): BER. Each curve shows the mean of five runs.
Entropy 28 00124 g003
Figure 4. Denoising performance of different models when only half of the pilot symbols are available. (Left): NMSE. (Right): BER. Each curve shows the mean of five runs.
Figure 4. Denoising performance of different models when only half of the pilot symbols are available. (Left): NMSE. (Right): BER. Each curve shows the mean of five runs.
Entropy 28 00124 g004
Figure 5. Performance comparison of HASwinNet under ideal continuous phase shifts versus practical discrete phase shifts (1-bit and 2-bit). (Left): NMSE. (Right): BER.
Figure 5. Performance comparison of HASwinNet under ideal continuous phase shifts versus practical discrete phase shifts (1-bit and 2-bit). (Left): NMSE. (Right): BER.
Entropy 28 00124 g005
Table 1. Tensor dimension tracking across HASwinNet processing stages. (Settings: N r = N t = 64 , patch size P = 4 , embedding dim C = 96 ).
Table 1. Tensor dimension tracking across HASwinNet processing stages. (Settings: N r = N t = 64 , patch size P = 4 , embedding dim C = 96 ).
Processing BlockTensor NotationShape
1. Input Stage
Pilot Observation Y r R 2 × N r × L 2 × 64 × L
Padding (if L < N t ) Y ˜ r R 2 × N r × N t 2 × 64 × 64
Patch Embedding X 0 R C × N r P × N t P 96 × 16 × 16
2. Encoder Backbone
Swin Encoder Output X enc R C × 16 × 16 96 × 16 × 16
3. Dual-Branch Processing
Token Flattening T R N × C 256 × 96
Sparse Selection T ^ R K × d 64 × 96
Spectral Branch F ten C C × 16 × 16 96 × 16 × 16
4. Decoder & Output
Branch Fusion H fusion R C × 16 × 16 96 × 16 × 16
Upsampling H up R C × N r × N t 96 × 64 × 64
Final Reconstruction H ^ r R 2 × N r × N t 2 × 64 × 64
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Han, X.; Tu, H.; Ying, J.; Chen, J.; Xing, Z. HASwinNet: A Swin Transformer-Based Denoising Framework with Hybrid Attention for mmWave MIMO Systems. Entropy 2026, 28, 124. https://doi.org/10.3390/e28010124

AMA Style

Han X, Tu H, Ying J, Chen J, Xing Z. HASwinNet: A Swin Transformer-Based Denoising Framework with Hybrid Attention for mmWave MIMO Systems. Entropy. 2026; 28(1):124. https://doi.org/10.3390/e28010124

Chicago/Turabian Style

Han, Xi, Houya Tu, Jiaxi Ying, Junqiao Chen, and Zhiqiang Xing. 2026. "HASwinNet: A Swin Transformer-Based Denoising Framework with Hybrid Attention for mmWave MIMO Systems" Entropy 28, no. 1: 124. https://doi.org/10.3390/e28010124

APA Style

Han, X., Tu, H., Ying, J., Chen, J., & Xing, Z. (2026). HASwinNet: A Swin Transformer-Based Denoising Framework with Hybrid Attention for mmWave MIMO Systems. Entropy, 28(1), 124. https://doi.org/10.3390/e28010124

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop