A Convolutional-Transformer Residual Network for Channel Estimation in Intelligent Reflective Surface Aided MIMO Systems

Wu, Qingying; Bao, Junqi; Xu, Hui; Ng, Benjamin K.; Lam, Chan-Tong; Im, Sio-Kei

doi:10.3390/s25195959

Open AccessArticle

A Convolutional-Transformer Residual Network for Channel Estimation in Intelligent Reflective Surface Aided MIMO Systems

by

Qingying Wu

,

Junqi Bao

,

Hui Xu

,

Benjamin K. Ng

^*

,

Chan-Tong Lam

and

Sio-Kei Im

Faculty of Applied Sciences, Macao Polytechnic University, Macao SAR, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(19), 5959; https://doi.org/10.3390/s25195959

Submission received: 21 August 2025 / Revised: 15 September 2025 / Accepted: 23 September 2025 / Published: 25 September 2025

(This article belongs to the Section Communications)

Download

Browse Figures

Versions Notes

Abstract

Intelligent Reflective Surface (IRS)-aided Multiple-Input Multiple-Output (MIMO) systems have emerged as a promising solution to enhance spectral and energy efficiency in future wireless communications. However, accurate channel estimation remains a key challenge due to the passive nature and high dimensionality of IRS channels. This paper proposes a lightweight hybrid framework for cascaded channel estimation by combining a physics-based Bilinear Alternating Least Squares (BALS) algorithm with a deep neural network named ConvTrans-ResNet. The network integrates convolutional embeddings and Transformer modules within a residual learning architecture to exploit both local and global spatial features effectively while ensuring training stability. A series of ablation studies is conducted to optimize architectural components, resulting in a compact configuration with low parameter count and computational complexity. Extensive simulations demonstrate that the proposed method significantly outperforms state-of-the-art neural models such as HA02, ReEsNet, and InterpResNet across a wide range of SNR levels and IRS element sizes in terms of the Normalized Mean Squared Error (NMSE). Compared to existing solutions, our method achieves better estimation accuracy with improved efficiency, making it suitable for practical deployment in IRS-aided systems.

Keywords:

reconfigurable intelligent surface; multiple-input multiple-output; channel estimation

1. Introduction

The Intelligent Reflective Surface (IRS) has emerged as a promising technology for future wireless networks due to its ability to reconfigure the wireless propagation environment in a programmable manner. An IRS consists of a large number of low-cost passive reflective elements, each of which can independently adjust the phase of incident electromagnetic waves [1]. This unique capability enables the IRS to enhance signal coverage, suppress interference, and improve both spectral and energy efficiency without requiring additional active transmitters or power sources [2,3,4]. When deployed in Multiple-Input Multiple-Output (MIMO) systems, an IRS can relieve Non-Line-of-Sight (NLoS) communication losses by establishing virtual links between the User Terminal (UT) and Base Station (BS) even in the presence of tall structures or other obstacles [5,6].

Despite its promising potential in wireless communication systems, the IRS presents several technical challenges. A major challenge is that the performance of IRS-aided systems critically depends on the availability of accurate Channel State Information (CSI) [7], which is essential for subsequent tasks such as beamforming, user scheduling, and rate adaptation. Unlike conventional MIMO systems that rely on active radio-frequency chains for signal transmission and reception, IRS elements are inherently passive and incapable of performing independent communication [8]. This leads to the high dimensionality of the cascaded BS-IRS-UT channel, along with the fact that the pilot overhead scales with the number of IRS elements, and further causes a trade-off between the accuracy of the estimate and the cost of signaling [9,10,11].

Beyond RIS-specific designs, the broader wireless communication community has explored various advanced signal processing and resource optimization methods to improve channel utilization, spectral efficiency, and energy harvesting capabilities. For example, faster-than-Nyquist-assisted simultaneous wireless information and power transfer-nonorthogonal multiple access schemes have been proposed to simultaneously enhance wireless efficiency and energy harvesting performance in IoT networks [12], while adaptive damping message-passing algorithms have been introduced for faster-than-Nyquist orthogonal time frequency space systems to improve robustness in high-mobility vehicle-to-everything communications [13]. Although these approaches are not directly applied to RIS scenarios, they represent important progress in non-deep learning-based channel processing and provide valuable insights for designing efficient algorithms in emerging wireless architectures. In the context of IRS-aided systems, the Bilinear Alternating Least Squares (BALS) method models the received signal via a parallel factor (PARAFAC) tensor decomposition to estimate the cascaded channel efficiently [14].

More recently, deep learning (DL)-based approaches have been applied to leverage channel reciprocity. By using uplink CSI to estimate downlink CSI, these approaches achieve notable gains in estimation accuracy [15,16,17,18]. Despite the progress made by prior works, existing methods still face several limitations. Many of these methods rely on deep structures with large parameter counts, resulting in increased computational burden and reduced suitability for real-time applications. Additionally, most existing designs emphasize either convolutional or attention-based mechanisms, without fully leveraging the complementary strengths of both. DL models also remain vulnerable to training instabilities such as gradient vanishing and model degradation, particularly when stacking multiple layers. These challenges motivate the development of a lightweight yet expressive channel estimation framework.

In this paper, we propose a Convolutional-Transformer Residual Network (ConvTrans-ResNet) for efficient channel estimation in RIS-aided MIMO systems. The model operates in two stages. First, a coarse but structured estimate of the cascaded channel is obtained using the BALS algorithm. This estimate is then denoised using the proposed method to capture both local and global spatial dependencies. Compared with existing methods, the proposed design ensures low complexity, stable training, and strong generalization across a wide range of scenarios. The main contributions of this work can be summarized as follows.

We propose a lightweight neural network for cascaded channel estimation in RIS-aided MIMO systems. The model denoises a coarse channel estimate obtained with the BALS algorithm so that both model-driven interpretability and data-driven representation are combined to enhance estimation accuracy.
To fully exploit the structured nature of RIS channels, the proposed network incorporates convolutional embedding blocks for extracting fine-grained local spatial features and Transformer modules for capturing long-range dependencies. This complementary design addresses the limitations of existing methods that rely solely on either convolutional or attention-based architectures.
To ensure training stability and relieve gradient degradation, a residual learning framework is adopted. This design facilitates gradient flow, accelerates convergence, and improves robustness against noise, thereby guaranteeing effective deep stacking without sacrificing accuracy.
To systematically investigate the impact of key architectural parameters and further identify an optimal configuration, a detailed ablation study is conducted. Simulation results also demonstrate that the proposed ConvTrans-ResNet consistently outperforms state-of-the-art approaches such as ReEsNet, InterpResNet, HA02, and BALS across a wide range of SNRs and IRS sizes, while significantly reducing parameter count and FLOPs, making it highly suitable for real-time and resource-constrained deployments.

The remainder of this paper is structured as follows. Recent works on channel estimations in RIS-aided systems are summarized in Section 2. In Section 3, we introduce the considered IRS-aided MIMO system. The existing BALS method is explained in Section 4. In Section 5, the proposed ConvTrans-ResNet is detailed. To evaluate the effectiveness of our proposed method, the simulation results and analysis are presented in Section 6. Finally, the paper is concluded in Section 7.

Notation: In this paper, the transpose and pseudo-inverse of a matrix

A

are denoted as

A^{T}

and

A^{†}

, respectively. In addition,

⋄

denotes the Khatri Rao and

I_{N}

denotes the

N \times N

identity matrix.

2. Related Work

Numerous methods have been proposed in recent years to address the channel estimation challenge. To reduce pilot overhead, [19] has leveraged the observation that each user shares a uniform IRS-to-BS link, thereby significantly reducing the required pilot length. In [14], a tensor decomposition framework was introduced, linking IRS-aided MIMO communications with the PARAFAC model. By designing structured time-domain pilot and phase shift patterns, the received signal conforms to a PARAFAC tensor model. This formulation enables the use of the BALS algorithm for efficient cascaded channel estimation. As a physics-informed iterative method, BALS exploits the bilinear channel structure and is particularly well-suited to block-fading scenarios, where the channel remains constant over multiple training intervals.

Advances in DL have also contributed to channel estimation research. Tabassum et al. [20] designed recurrent neural network-based models that decode transmitted symbols directly from received OFDM signals, bypassing explicit CSI estimation in favor of end-to-end learning. Ye et al. [21] introduced a conditional generative adversarial network to directly estimate the channel from pilot sequences and observation signals in IRS-aided systems. In [18], to denoise channel estimation, a convolutional deep residual network was used in IRS multi-user communication systems. Similarly, a twin Convolutional Neural Network (CNN) architecture was proposed to estimate both direct (BS-UE) and cascaded (BS-IRS-UE) channels in millimeter-wave massive MIMO systems, as discussed in [22]. For the frequency division duplex-based Massive MIMO systems, Abdelmaksoud et al. [23] proposed a denoising gated recurrent unit with dropout-based CSI extraction, but this method does not leverage the inherent low-rank structure of IRS channels, with generalization capabilities limited. In contrast, Janawade et al. [24] employed reinforcement learning to directly optimize IRS phase configurations without estimating the channel. While efficient, this approach lacks flexibility for tasks requiring full CSI.

Despite their advantages, neural network-based solutions often face issues such as gradient vanishing and model degeneration. To address it, Li et al. [25] presented the Residual Channel Estimation Network (ReEsNet), a residual CNN that offers improved efficiency and reduced complexity. An improved deep residual shrinkage network was also introduced in [15] to improve the pilot design by effectively reducing noise, making it advantageous in stable channel conditions. The authors in [16] used a residual U-shaped network and a deep CS-based channel estimation model to identify the cascaded channel matrix with minimal pilot overhead.

Some studies have shown that combining model-based signal processing with deep learning can significantly improve channel estimation accuracy and ensure robustness. For instance, researchers presented a hybrid IRS structure and DL-based CNN for sparse channel amplitude determination in [17]. However, the added complexity of the model may decrease channel estimation accuracy, making it less suitable for real-time applications. Luan and Thompson [26] proposed a hybrid encoder–decoder network (HA02) that uses a transformer encoder to extract salient features from least-square estimates, followed by a residual convolutional decoder for channel denoising in the orthogonal frequency-division multiplexing system. This design leverages the attention mechanism to selectively emphasize important components of the input, thereby enhancing performance and robustness. Similarly, Gu et al. [27] introduced ReEsNet and InterpResNet channel estimation networks built on preliminary BALS estimates of the cascaded channel. Their approach demonstrates that a data-driven network can effectively denoise coarse physics-based estimates and generalize across different numbers of IRS elements. These studies confirm the advantages of hybrid frameworks: The traditional algorithm provides domain-aware initialization, while the deep network performs non-linear denoising and structural interpolation, leading to improved robustness, better generalization, and reduced training complexity.

Several recent works have further advanced IRS-aided channel estimation. Chu et al. proposed an adaptive and robust framework for mmWave systems, where a parallel estimation strategy mitigates error propagation and improves robustness [28]. An automatic neural network construction method was introduced that uses neural architecture search to generate high-performance channel estimators tailored to specific propagation conditions in [29]. In [30], active IRS-aided IoT systems were investigated, where joint power optimization and deep learning techniques improve channel estimation under stringent power budgets. The study on joint location sensing and channel estimation for IRS-aided mmWave ISAC systems demonstrates that structured sparsity and Bayesian inference can support integrated communication and sensing [31]. The power measurement-based estimation scheme was also developed to reduce pilot overhead by leveraging received signal power and lightweight neural networks [32]. These recent contributions highlight the rapid progress in IRS-aided channel estimation, covering robust optimization, neural architecture design, and integration with IoT and ISAC applications.

3. System Model

As depicted in Figure 1, we consider a downlink communication scenario in an IRS-aided MIMO system. In this setup, the BS is equipped with M transmit antennas and communicates with a UT with L receive antennas. The communication link between the BS and the UT is established via an IRS, which consists of N passive elements arranged in a Uniform Planar Array (UPA). Each IRS element can independently adjust the phase of the incident signal, thereby reconfiguring the wireless propagation environment. These phase shifts are controlled through a centralized controller. Notably, we assume that the direct Line-of-Sight (LoS) path between the BS and the UT is completely blocked due to obstacles in the environment. Therefore, the reflected link via the IRS becomes the only feasible channel for signal transmission.

Assuming a block-fading environment with coherence time

T_{c}

, the received signal at time instant

t \in \{1, \dots, T_{c}\}

is modeled as

y_{t} = H_{2} (s_{t} ⊙ H_{1} x_{t}) + n_{t},

(1)

where

H_{1} \in C^{N \times M}

and

H_{2} \in C^{L \times N}

denote the BS-IRS and the IRS-UT channel matrix, respectively. Both

H_{1}

and

H_{2}

are assumed to be independent and identically distributed zero-mean circularly symmetric complex Gaussian random variables. The term

n_{t} \in C^{L \times 1}

represents Additive White Gaussian Noise (AWGN).

x_{t} \in C^{M \times 1}

denotes the transmitted pilot signal,

s_{t} = {[s_{1, t} e^{j ϕ_{1}}, \dots, s_{N, t} e^{j ϕ_{N}}]}^{T} \in C^{N \times 1}

is the IRS configuration vector, where

s_{n, t} \in {0, 1}

controls the on/off control status of element n at time t and

ϕ_{n} \in (0, 2 π]

denotes the phase shift applied by the n-th IRS element. For analytical convenience, with the assumption that both

H_{1}

and

H_{2}

are independent and identically distributed zero-mean circularly symmetric complex Gaussian random variables, the received signal can be alternatively expressed as [33]

y_{t} = H_{2} diag (s_{t}) H_{1} x_{t} + n_{t} .

(2)

Let the total coherence interval be

T_{c} = K T

, where K denotes the number of blocks, and each block comprises T time slots. As illustrated in Figure 2, a structured time-domain protocol is adopted such that the phase shift vector

s_{k}

remains fixed within each block and varies only across blocks. In contrast, the pilot signal sequence

{x_{1}, \dots, x_{T}}

is reused across all K blocks. Accordingly, the received signal can be defined as

y_{k, t} ≐ y [(k - 1) T + t]

, for

t = 1, \dots, T_{c}

,

k = 1, \dots, K

. Similarly, the pilot signal and phase shift vectors are defined as

\begin{matrix} x_{k, t} = x_{k}, for t = 1, \dots, T_{c}, \\ s_{k, t} = s_{t}, for k = 1, \dots, K, \end{matrix}

(3)

Thus, the received signal can be simplified to

y_{k, t} = H_{2} diag (s_{k}) H_{1} x_{t} + n_{k, t} .

(4)

By aggregating the received signals over T time slots into a matrix

Y_{k} = [y_{k, 1}, \dots, y_{k, T}]

, the block-wise received signal can be written as

Y_{k} = H_{2} D_{k} (S) H_{1} X^{T} + N_{k},

(5)

where

X ≐ {[x_{1}, \dots, x_{T}]}^{T} \in C^{T \times M}

is the stacked pilot matrix and

N ≐ [n_{1}, \dots, n_{T}]

denotes Additive White Gaussian Noise (AWGN).

D_{k} (S) = diag (s_{k})

represents a diagonal matrix holding the k-th row of the IRS phase shift matrix

S

on its main diagonal, where

S = [s_{1}, \dots, s_{K}] \in C^{K \times N}

stacks all IRS configurations over blocks.

To further facilitate structured modeling, the noiseless received signal is defined as

{\bar{Y}}_{k} = H_{2} D_{k} (S) Z^{T},

(6)

where

Z = X H_{1}^{T} \in C^{T \times N}

and

{\bar{Y}}_{k} \in C^{L \times T}

is the k-th frontal matrix slice of a three-way tensor

\bar{Y} \in C^{L \times T \times K}

that follows a PARAFAC decomposition [34]. At

(l, t, k)

-th entry, the noiseless received signal tensor is expressed as

{[\bar{Y}]}_{l, t, k} = \sum_{n = 1}^{N} g_{l, n} z_{t, n} s_{k, n},

(7)

where

g_{l, n} ≐ {[H_{2}]}_{l, n}

,

z_{t, n} ≐ {[Z]}_{t, n}

, and

s_{k, n} ≐ {[S]}_{k, n}

. This naturally leads to a Canonical Polyadic (CP) decomposition [34,35,36,37,38]:

\bar{Y} = [[H_{2}, Z, S]] \in C^{L \times T \times K}

. The tensor unfoldings along three modes are given by [34,35]

{\bar{Y}}_{(1)} = H_{2} {(S ⋄ Z)}^{T} \in C^{L \times T K},

(8)

{\bar{Y}}_{(2)} = Z {(S ⋄ H_{2})}^{T} \in C^{T \times L K},

(9)

{\bar{Y}}_{(3)} = R {(Z ⋄ H_{2})}^{T} \in C^{K \times L T},

(10)

where ⋄ denotes the Khatri-Rao product and

{\bar{Y}}_{(n)}

is the mode-n unfolding of tensor

\bar{Y}

. Specifically,

{\bar{Y}}_{(1)} ≐ [{\bar{Y}}_{1}, \dots, {\bar{Y}}_{K}]

,

{\bar{Y}}_{(2)} ≐ [{\bar{Y}}_{1}^{T}, \dots, {\bar{Y}}_{K}^{T}]

, and

{\bar{Y}}_{(3)} ≐ {[vec ({\bar{Y}}_{1}), \dots, vec ({\bar{Y}}_{K})]}^{T}

.

4. Traditional BALS-Based Channel Estimation

To obtain an initial estimate of the cascaded channel, the BALS method has been proposed by exploiting the PARAFAC structure inherent in the received signal tensor. Let

Y ≐ \bar{Y} + N

denote the observed signal tensor, where

\bar{Y}

is the noiseless signal component and

N \in C^{L \times T \times K}

is the additive noise tensor with independent and identically distributed complex Gaussian entries. Likewise, we can define

Y_{(n)} ≐ {\bar{Y}}_{(n)} + N_{(n)}, n = 1, 2, 3

as noisy versions of the one-mode, two-mode and three-mode matrix unfoldings in Equations (8)–(10), and

N_{(n)}

are the corresponding noise matrix unfoldings.

The tensor decomposition begins with the three mode-n unfoldings

Y_{(n)}

of

Y

, which correspond to different dimensions of the tensor. Specifically, we have

Y_{(1)} \in C^{L \times T K}

,

Y_{(2)} \in C^{T \times L K}

,

Y_{(3)} \in C^{K \times L T}

. Among these, the first two unfoldings are primarily used to recover the unknown BS–IRS channel matrix

H_{1}

and IRS–UT channel matrix

H_{2}

. The pilot matrix

X

and IRS phase configuration matrix

S

are assumed to be truncated discrete Fourier transform (DFT) matrices, satisfying the orthonormality conditions

X^{H} X = I_{M}

and

S^{H} S = I_{N}

.

The BALS algorithm alternates between updating

H_{2}

and

H_{1}

by minimizing the following Frobenius-norm-based cost functions:

\{\begin{matrix} {\hat{H}}_{1} = \underset{H_{1}}{arg min} {∥Y_{(2)} - X H_{1}^{T} {(S ⋄ H_{2})}^{T}∥}_{F}^{2} \\ {\hat{H}}_{2} = \underset{H_{2}}{arg min} {∥Y_{(1)} - H_{2} {(S ⋄ Z)}^{T}∥}_{F}^{2} \end{matrix}

(11)

where

Z = X H_{1}^{T}

. These subproblems admit closed-form solutions based on Moore–Penrose pseudoinverses:

\{\begin{matrix} {\hat{H}}_{1}^{T} = X^{†} Y_{(2)} {[{(S ⋄ H_{2})}^{T}]}^{†} \\ {\hat{H}}_{2} = Y_{(1)} {[{(S ⋄ Z)}^{T}]}^{†} \end{matrix}

(12)

Since both

X

and

S

are orthonormal by design, their pseudoinverses can be efficiently computed and, in some cases, replaced by their Hermitian transposes. This significantly reduces the computational complexity of the BALS updates. The detailed procedure of the BALS algorithm is outlined in Algorithm 1.

Algorithm 1 Bilinear Alternating Least Squares (BALS)

1:: Initialization: Set iteration index $i = 0$ , and random ${\hat{H}}_{1}^{(0)}$ .
2:: repeat
3:: $i \leftarrow i + 1$ ;
4:: Update to find the least square estimate of $H_{2}$ as

${\hat{H}}_{2}^{(i)} = Y_{(1)} {[{(S ⋄ X {\hat{H}}_{1}^{T (i - 1)})}^{T}]}^{†}$

(13)
5:: Update to find the least square estimate of $H_{1}$ as

${\hat{H}}_{1}^{T (i)} = X^{†} Y_{(2)} {[{(S ⋄ H_{2}^{(i)})}^{T}]}^{†}$

(14)
6:: until convergence.

At each iteration i, the reconstruction error is evaluated as

e (i) = {∥Y - {\hat{Y}}_{(i)}∥}_{F}^{2}

until it terminates when the error falls below a predefined threshold

10^{- 6}

. Once the iterative process converges, the preliminary estimate of the cascaded channel

{\hat{H}}_{c} = {\hat{H}}_{2} {\hat{H}}_{1}^{T}

can be obtained. In this study, the estimate serves as the initial channel representation for further denoising in the IRS-aided MIMO system.

It is important to note that although BALS alternately updates

H_{1}

and

H_{2}

, both sub-channels are refined within the same iterative optimization loop. Therefore, BALS does not incur irreversible stage-to-stage error propagation. Any bias introduced in one update can be corrected in subsequent iterations, and the final cascaded channel estimate

{\hat{H}}_{c}

can serve as a stable initialization for the proposed denoising network.

5. Proposed Method

5.1. Overview

Unlike conventional methods that primarily utilize stacked convolutional blocks to model local features, our approach enhances estimation accuracy by incorporating global spatial correlations through attention mechanisms. Specifically, we propose a lightweight Transformer-based network for IRS-aided MIMO systems, which is capable of preserving structural characteristics in complex propagation scenarios. The proposed method directly targets the cascaded channel estimate

{\hat{H}}_{c}

, which is inherently free from error propagation between sub-channel estimates.

As illustrated in Figure 3, the proposed ConvTrans-ResNet consists of three main components: a convolutional embedding (ConvEmbed) module, a residual Transformer encoder composed of multiple stacked attention blocks, and a convolutional output (ConvOut) module. The model denoises the coarse channel estimate

\hat{H}

by transforming it through convolutional encoding, self-attention-based denoising, and reconstruction. This architecture is designed to capture both local and global spatial dependencies while maintaining low complexity and stable training dynamics.

5.2. ConvEmbed Module

In our design, the ConvEmbed module serves as the feature extractor to capture local spatial patterns from the input channel tensor

X \in R^{M \times L \times 2}

. It consists of two sequential

3 \times 3

convolutional layers, each followed by ReLU activation. Apart from projecting the input into a higher-dimensional feature space of size

D_{E}

, the module enhances the expressiveness of local features while preserving fine-grained spatial resolution. The output feature map

F \in R^{M \times L \times D_{E}}

is then reshaped into

D_{E}

sequences of shape

R^{M \times L}

so that it is compatible with the Transformer module.

5.3. Transformer Module with Residual Connections

To capture both fine-grained and long-range dependencies, the sequence output from the ConvEmbed module is processed by

n_{L}

stacked Transformer encoder blocks. Each block consists of a Multi-Head Self-Attention (MHSA) mechanism and a feed-forward multilayer perceptron (MLP), both equipped with residual connections.

Given an input sequence

S \in R^{M \times L \times D_{E}}

, the scaled dot-product attention for each head is defined as

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,

(15)

where the query (Q), key(K), and value (V) matrices are computed via

\{\begin{matrix} Q = X W^{Q}, \\ K = X W^{K}, \\ V = X W^{V} . \end{matrix}

(16)

with

W^{Q}, W^{K}, W^{V} \in R^{D_{E} \times d_{k}}

denoting learnable projection matrices and

d_{k} = D_{E} / n_{H}

. The outputs from all heads are concatenated and linearly projected as

MHSA (S) = Concat ({head}_{1}, \dots, head n_{H}) W^{O},

(17)

where

W^{O} \in R^{D_{E} \times D_{E}}

is a trainable projection matrix. With multiple attention heads, complementary perspectives on the channel representation can be provided. Each head learns a different projection of the input features, which enables the model to capture diverse types of spatial dependencies. For instance, some heads may emphasize broad global correlations across the entire channel matrix, while others focus on localized structures associated with dominant paths or clusters. By jointly aggregating these heterogeneous patterns, the multi-head mechanism enriches the representation capacity of the Transformer module to help denoise coarse channel matrices.

Each Transformer block also includes an MLP sublayer composed of two linear layers and a GELU activation, which can be expressed as

MLP (x) = W_{2} (GELU (W_{1} x)) + b_{2},

(18)

where

W_{1} \in R^{D_{E} \times r D_{E}}

and

W_{2} \in R^{r D_{E} \times D_{E}}

, with r denoting the MLP expansion ratio.

Residual connections are applied after both the self-attention and MLP sublayers, followed by layer normalization. These residual paths play a crucial role in maintaining information flow and training stability, particularly in deeper Transformer architectures. Specifically, the residual connection after the self-attention module allows the network to retain the original input features, which is beneficial when the attention weights are sparse or uncertain. This ensures that the model does not lose critical positional or contextual information. The residual link after the MLP block helps mitigate the problem of vanishing gradients by providing a direct path for gradient backpropagation. It also prevents the degradation of learned representations during deep stacking, thereby enabling more effective convergence and better generalization. Together, these mechanisms ensure that the Transformer module can robustly model both fine-grained and global channel structures in RIS-aided MIMO systems.

5.4. ConvOut Module

After attention processing, the feature sequence is reshaped back to the spatial format

R^{M \times L \times D_{E}}

. The ConvOut module then denoises this representation through two successive

3 \times 3

convolutions with ReLU activations. More than reducing the feature dimension to match the channel output format, this module also fuses contextual information aggregated by the previous Transformer module. Specifically, the ConvOut module enhances local consistency and ensures that the final output retains high-resolution structural details by applying convolutions over spatial dimensions. The last convolutional layer projects the feature map to a two-channel output, corresponding to the real and imaginary parts of the estimated cascaded channel. This makes the final output compatible with the ground-truth channel format used for supervision during training.

6. Experimental Results

To comprehensively assess the performance of the proposed ConvTrans-ResNet model, we conduct comparative experiments against several state-of-the-art channel estimation methods, including ReEsNet [27], InterpResNet [27], HA02 [26], and the baseline BALS algorithm [14]. After detailing the simulation setup, experimental results in terms of estimation accuracy are presented. Furthermore, we analyze the contributions of different architectural components through ablation studies and assess the computational efficiency of the proposed method to ensure its practicality in real-world deployment scenarios.

6.1. Implementation Details

Experiments are implemented on 13th Gen Intel(R) Core(TM) i7-13700KF CPU (24 logical CPUs, 32 GB RAM; Intel Corporation, Santa Clara, CA, USA) and an NVIDIA GeForce RTX 3080 GPU with 10 GB VRAM (NVIDIA Corporation, Santa Clara, CA, USA).

6.1.1. Dataset

The dataset is generated based on a block-fading channel model, where the signal-to-noise ratio (SNR) varies from 0 dB to 30 dB in increments of 5 dB. For each SNR level, 5000 independent channel realizations are simulated, i.e., a total of 35,000 samples. Among them, 95% are used for training and the remaining 5% for testing. The simulation parameters are summarized in Table 1. All models are trained and evaluated on the same dataset to ensure a fair and consistent comparison.

6.1.2. Parameters

The mean square error is used as the error function for training ReEsNet, InterpResNet, and the proposed method. HA02 employs the Huber loss, which is defined as

L_{δ} (a) = \{\begin{matrix} \frac{1}{2} a^{2}, if |a| \leq δ, \\ δ (|a| - \frac{1}{2} δ), otherwise . \end{matrix}

(19)

The hyperparameter settings for each method are summarized in Table 2.

6.1.3. Performance Metric

To quantify estimation performance, we adopt the Normalized Mean Squared Error (NMSE) as the performance metric, which is defined as

NMSE = E \{\frac{{∥{\hat{H}}_{c}^{DNN} - H_{c}∥}_{2}^{2}}{{∥H_{c}∥}_{2}^{2}}\}

(20)

where

H_{c}

and

{\hat{H}}_{c}^{DNN}

denote the ground truth and the estimated cascaded channel matrices, respectively. A lower NMSE indicates higher channel estimation accuracy.

6.2. Effectiveness Validation

To investigate the convergence of different models, Figure 4 presents the NMSE curves during training with 25 IRS elements at 0 dB. The proposed ConvTrans-ResNet converges smoothly and maintains the lowest NMSE across the entire training process. InterpResNet and ReEsNet also converge within 40 epochs, but their final NMSE values remain higher than the proposed method. In contrast, HA02 converges fast in the early stage but performs worst in terms of estimation accuracy.

Figure 5 illustrates the NMSE performance of different channel estimation methods with various IRS elements under SNRs from 0 dB to 30 dB. It can be observed that all DL-based models significantly outperform the traditional BALS algorithm at lower SNR ranges, which validates the effectiveness of data-driven approaches in modeling complex channel matrices. HA02, which adopts a pure Transformer encoder–decoder architecture without convolutional enhancement, exhibits the highest NMSE among DL-based models. This suggests that the lack of local context modeling limits the performance of attention-based designs. ReEsNet and InterpResNet show competitive performance but suffer from performance degradation at low SNRs. This is likely due to their limited receptive fields, which hinder the modeling of long-range spatial dependencies. Among these methods, the proposed ConvTrans-ResNet consistently achieves the lowest NMSE across all SNR levels, particularly at high-SNR conditions (above 20 dB). The outperformance compared to ReEsNet and InterpResnet further confirms the superior denoising capability and estimation accuracy benefits from convolutional feature extraction and attention-based global modeling.

Figure 6 presents the NMSE results under different numbers of IRS elements

N \in \{25, 49, 81, 100, 144\}

. The proposed method still performs best across all configurations, indicating its scalability and robustness with an increasing number of IRS elements. As N increases, ReEsNet and InterpResNet exhibit slight performance degradation, which reveals the limited generalization when faced with a large-scale IRS deployment. The worst performance of HA02 shows its inadequate adaptability. It is worth noting that while BALS maintains a consistent NMSE trend, it remains significantly less accurate than learning-based methods. As a lightweight and interpretable solution, BALS’s estimation capability is constrained by its reliance on linear algebraic structure, which highlights the necessity of learning-based denoising.

To further assess generalizability under realistic propagation, we adopt the 3GPP TR 38.901 channel model (urban microcell, NLOS) to generate the BS–IRS and IRS–UE links with pathloss, shadowing, angle spreads, and spatial correlation. As illustrated in Figure 7, the NMSE performance of the proposed ConvTrans-ResNet consistently outperforms BALS across different numbers of RIS elements and various SNR levels under the practical 3GPP channel models. Therefore, the robustness and generalization of the proposed method for real-world IRS-aided MIMO systems can be confirmed.

Generally, the proposed ConvTrans-ResNet consistently outperforms existing methods across a wide range of SNR conditions and IRS configurations. To achieve deeper insight into the contribution of each architectural component, we conduct a comprehensive ablation study by varying key structural parameters. The results are shown in Figure 8.

As shown in Figure 8a, increasing the number of convolutional embedding (ConvEmbed) blocks from 8 to 32 leads to gradual performance improvement, especially under high-SNR scenarios (above 20 dB). The deeper ConvEmbed configurations enhance the model’s ability to extract local spatial features. However, the gain from using 32 blocks over 16 is marginal but requires a higher computational cost. Hence, selecting 16 ConvEmbed blocks is a balanced choice for local feature learning with reasonable efficiency.

In Figure 8b, we investigate the effect of stacking one, two, and four Transformer blocks. Compared to the single Transformer block, using two Transformer blocks yields a more substantial gain by better modeling global dependencies. However, further increasing the number to four leads to diminishing returns and slight degradation at low SNRs, which may be due to overfitting or gradient instability. Therefore, two Transformer blocks are adopted to balance expressiveness and stability.

Figure 8c compares models with varying numbers of attention heads

n_{H} \in \{1, 2, 4, 8\}

. A single head results in higher NMSE, suggesting that insufficient attention to diversity hinders the model’s ability to learn from heterogeneous spatial patterns. Multiple heads (

n_{H} = 2, 4, 8

) provide better estimation accuracy. However, the performance gain between two and eight heads is negligible, while computational cost increases linearly with

n_{H}

. Thus,

n_{H} = 2

is the best choice to offer a good trade-off between performance and efficiency.

Lastly, Figure 8d analyzes the effect of MLP expansion ratios (1 and 2) in the Transformer feed-forward module. A ratio of 2 marginally improves NMSE at medium-to-high SNRs by increasing hidden dimensionality. Nonetheless, the improvement is minor, and a larger MLP ratio introduces additional parameters. As a result, we choose a ratio of 1, which offers sufficient capacity while maintaining the lightweight nature of the model.

The advantage of multiple attention heads is that the diversity among heads usually provides complementary perspectives so that a richer representation of the cascaded channel can be learned. The representative attention maps from the Transformer blocks are visualized in Figure 9. Each map shows the learned attention weights between query and key positions for a given sample. It can be observed that the attention mechanism assigns higher weights to a subset of elements while suppressing irrelevant components, which indicates that it selectively captures dominant channel features. It can be confirmed that the attention module effectively models global spatial dependencies and complements the local feature extraction provided by convolutional layers.

In summary, the final architecture is configured with 16 ConvEmbed blocks, two Transformer blocks, two attention heads, and an MLP expansion ratio of 1. This design achieves a favorable balance between performance, computational complexity, and generalization ability, making it well suited for practical deployment in IRS-aided MIMO systems.

6.3. Computational Complexity

The computational complexities of the various methods are assessed from three aspects, including parameter size, floating-point operations (FLOPs), and per-inference calculation time. The comparison is illustrated in Figure 10.

As shown in Figure 10a, our proposed model contains significantly fewer parameters than all other methods. This compact architecture reduces memory consumption and makes the model more suitable for resource-constrained devices. Figure 10b demonstrates that the proposed model achieves the lowest FLOPs with one-fifth of ReEsNet and InterpResNet and one-tenth of HA02. Such a low computational burden highlights the efficiency of our design and confirms its potential for deployment in real-time systems. Although Figure 10c indicates that the inference time of the proposed model is slightly higher than that of ReEsNet and InterpResNet, the model remains substantially faster than HA02. This phenomenon is reasonable since FLOPs and latency are not perfectly correlated. Operations such as multi-head attention and normalization are lightweight in FLOPs but may incur additional memory access and kernel launch overheads. Nevertheless, the proposed model achieves a balanced trade-off by maintaining low overall complexity while incorporating Transformer blocks to capture global dependencies. This modest overhead translates into significant improvements in estimation accuracy, as demonstrated in previous subsections.

Overall, the proposed ConvTrans-ResNet achieves a balance between model size, computational cost, and estimation accuracy. The lightweight design enables fast execution and easy deployment while guaranteeing the robustness and generalizability of accurate channel estimation through attention-based denoising.

7. Conclusions and Future Works

In this paper, we propose ConvTrans-ResNet, a lightweight and effective neural network architecture for channel estimation in IRS-aided MIMO systems. By integrating convolutional embeddings and Transformer modules, the model effectively captures both local and global spatial structures within the channel matrix. Furthermore, a residual learning framework is incorporated to enhance training stability and convergence. The proposed network operates on a coarse channel estimate provided by the BALS algorithm. Such a hybrid design benefits from the incorporation of domain knowledge while reducing the overall learning complexity. Extensive experiments were conducted to evaluate the proposed method against state-of-the-art approaches, including HA02, ReEsNet, and InterpResNet. The results demonstrate that ConvTrans-ResNet consistently achieves lower NMSE across a wide range of SNR conditions and IRS configurations, while also maintaining the fewest parameters, the least computational overhead, and a fast inference time.

While the present work assumes block-fading channels, extending the framework to time-varying or mobile scenarios is an important direction. Moreover, applying the framework to multi-user IRS scenarios and jointly optimizing the IRS phase shifts may further improve system-level performance. Since our proposed method is essentially a supervised learning approach, it can be naturally adapted to such cases once reliable ground-truth channel data become available. Possible extensional works include modeling channel dynamics caused by user mobility, Doppler shifts, and temporal fading, as well as incorporating temporal modeling modules to exploit correlations across time. In addition, although computational efficiency has been analyzed in terms of FLOPs and inference latency, future work will also include energy profiling and embedded-device evaluation, which are particularly relevant for IoT and mobile applications. Standard energy-related measures, such as joules per inference and energy per FLOP, will be adopted to provide a fair and comparable assessment of efficiency across different models and hardware platforms.

Author Contributions

Conceptualization, Q.W. and B.K.N.; methodology, Q.W. and J.B.; software, Q.W.; validation, J.B. and H.X.; formal analysis, Q.W. and B.K.N.; writing—original draft preparation, Q.W.; writing—review and editing, J.B., H.X. and B.K.N.; supervision, C.-T.L.; project administration, B.K.N., C.-T.L. and S.-K.I.; funding acquisition, B.K.N., C.-T.L. and S.-K.I. All authors have read and agreed to the published version of the manuscript.

Funding

This work is funded by The Science and Technology Development Fund, Macau SAR (File no. 0044/2022/A1).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

IRS	Intelligent Reflective Surface
MIMO	Multiple-Input Multiple-Output
CSI	Channel State Information
DL	Deep Learning
BALS	Bilinear Alternating Least Squares
BS	Base Station
UT	User Terminal
MHSA	Multi-Head Self-Attention
MLP	Multilayer Perceptron
NMSE	Normalized Mean Squared Error

References

Kang, Z.; You, C.; Zhang, R. Active-passive IRS aided wireless communication: New hybrid architecture and elements allocation optimization. IEEE Trans. Wirel. Commun. 2023, 23, 3450–3464. [Google Scholar] [CrossRef]
Basar, E.; Di Renzo, M.; De Rosny, J.; Debbah, M.; Alouini, M.S.; Zhang, R. Wireless communications through reconfigurable intelligent surfaces. IEEE Access 2019, 7, 116753–116773. [Google Scholar] [CrossRef]
Gong, S.; Lu, X.; Hoang, D.T.; Niyato, D.; Shu, L.; Kim, D.I.; Liang, Y.C. Toward smart wireless communications via intelligent reflecting surfaces: A contemporary survey. IEEE Commun. Surv. Tutor. 2020, 22, 2283–2314. [Google Scholar] [CrossRef]
Papazafeiropoulos, A.; Kourtessis, P.; Ntontin, K.; Chatzinotas, S. Joint spatial division and multiplexing for FDD in intelligent reflecting surface-assisted massive MIMO systems. IEEE Trans. Veh. Technol. 2022, 71, 10754–10769. [Google Scholar] [CrossRef]
Zheng, X.; Cao, R.; Ma, L. Uplink channel estimation and signal extraction against malicious IRS in massive MIMO system. In Proceedings of the 2021 IEEE International Conference on Communications Workshops (ICC Workshops), Montreal, QC, Canada, 14–23 June 2021; pp. 1–6. [Google Scholar]
Zhang, C.; Xu, H.; Ng, B.K.; Lam, C.T.; Wang, K. RIS-Assisted Received Adaptive Spatial Modulation for Wireless Communications. In Proceedings of the 2025 IEEE Wireless Communications and Networking Conference (WCNC), Milan, Italy, 24–27 March 2025; pp. 1–6. [Google Scholar]
Gkonis, P.K. A survey on machine learning techniques for massive MIMO configurations: Application areas, performance limitations and future challenges. IEEE Access 2022, 11, 67–88. [Google Scholar] [CrossRef]
Okogbaa, F.C.; Ahmed, Q.Z.; Khan, F.A.; Abbas, W.B.; Che, F.; Zaidi, S.A.R.; Alade, T. Design and application of intelligent reflecting surface (IRS) for beyond 5G wireless networks: A review. Sensors 2022, 22, 2436. [Google Scholar] [CrossRef]
Kim, I.S.; Bennis, M.; Oh, J.; Chung, J.; Choi, J. Bayesian channel estimation for intelligent reflecting surface-aided mmWave massive MIMO systems with semi-passive elements. IEEE Trans. Wirel. Commun. 2023, 22, 9732–9745. [Google Scholar] [CrossRef]
Sur, S.N.; Singh, A.K.; Kandar, D.; Silva, A.; Nguyen, N.D. Intelligent reflecting surface assisted localization: Opportunities and challenges. Electronics 2022, 11, 1411. [Google Scholar] [CrossRef]
Lam, C.T.; Wang, K.; Ng, B.K. Channel estimation using in-band pilots for cell-free massive MIMO. In Proceedings of the 2023 IEEE Virtual Conference on Communications (VCC), Virtual, 28–30 November 2023; pp. 200–205. [Google Scholar]
Xu, H.; Zhang, C.; Wu, Q.; Ng, B.K.; Lam, C.T.; Yanikomeroglu, H. FTN-assisted SWIPT-NOMA design for IoT wireless networks: A paradigm in wireless efficiency and energy utilization. IEEE Sens. J. 2025, 25, 7431–7444. [Google Scholar] [CrossRef]
Xu, H.; Zhang, C.; Wu, Q.; Ng, B.K.; Lam, C.T. Adaptive Damping Log-Domain Message-Passing Algorithm for FTN-OTFS in V2X Communications. Sensors 2025, 25, 3692. [Google Scholar] [CrossRef]
de Araújo, G.T.; de Almeida, A.L. PARAFAC-based channel estimation for intelligent reflective surface assisted MIMO system. In Proceedings of the 2020 IEEE 11th Sensor Array and Multichannel Signal Processing Workshop (SAM), Hangzhou, China, 8–11 June 2020; pp. 1–5. [Google Scholar]
Chen, Y.; Jiang, F. Compressive Channel Estimation Based on the Deep Denoising Network in an IRS-Enhanced Massive MIMO System. Comput. Intell. Neurosci. 2022, 2022, 8234709. [Google Scholar] [CrossRef] [PubMed]
Xie, W.; Xiao, J.; Zhu, P.; Yu, C.; Yang, L. Deep compressed sensing-based cascaded channel estimation for RIS-aided communication systems. IEEE Wirel. Commun. Lett. 2022, 11, 846–850. [Google Scholar] [CrossRef]
Gao, T.; He, M. Two-stage channel estimation using convolutional neural networks for IRS-assisted mmwave systems. IEEE Syst. J. 2023, 17, 3183–3191. [Google Scholar] [CrossRef]
Liu, C.; Liu, X.; Ng, D.W.K.; Yuan, J. Deep residual learning for channel estimation in intelligent reflecting surface-assisted multi-user communications. IEEE Trans. Wirel. Commun. 2021, 21, 898–912. [Google Scholar] [CrossRef]
Wang, Z.; Liu, L.; Cui, S. Channel estimation for intelligent reflecting surface assisted multiuser communications: Framework, algorithms, and analysis. IEEE Trans. Wirel. Commun. 2020, 19, 6607–6620. [Google Scholar] [CrossRef]
Tabassum, R.; Sejan, M.A.S.; Rahman, M.H.; Aziz, M.A.; Song, H.K. Intelligent Reflecting Surface-Assisted Wireless Communication Using RNNs: Comprehensive Insights. Mathematics 2024, 12, 2973. [Google Scholar] [CrossRef]
Ye, M.; Zhang, H.; Wang, J.B. Channel estimation for intelligent reflecting surface aided wireless communications using conditional GAN. IEEE Commun. Lett. 2022, 26, 2340–2344. [Google Scholar] [CrossRef]
Elbir, A.M.; Papazafeiropoulos, A.; Kourtessis, P.; Chatzinotas, S. Deep channel learning for large intelligent surfaces aided mm-wave massive MIMO systems. IEEE Wirel. Commun. Lett. 2020, 9, 1447–1451. [Google Scholar] [CrossRef]
Abdelmaksoud, A.; Abdelhamid, B.; Elbadawy, H.; El Hennawy, H.; Eldyasti, S. DGD-CNet: Denoising Gated Recurrent Unit with a Dropout-Based CSI Network for IRS-Aided Massive MIMO Systems. Sensors 2024, 24, 5977. [Google Scholar] [CrossRef]
Janawade, S.A.; Krishnan, P.; Kandasamy, K.; Holla, S.S.; Rao, K.; Chandrasekar, A. A Low-Complexity Solution for Optimizing Binary Intelligent Reflecting Surfaces towards Wireless Communication. Future Internet 2024, 16, 272. [Google Scholar] [CrossRef]
Li, L.; Chen, H.; Chang, H.H.; Liu, L. Deep residual learning meets OFDM channel estimation. IEEE Wirel. Commun. Lett. 2019, 9, 615–618. [Google Scholar] [CrossRef]
Luan, D.; Thompson, J. Attention based neural networks for wireless channel estimation. In Proceedings of the 2022 IEEE 95th Vehicular Technology Conference: (VTC2022-Spring), Helsinki, Finland, 19–22 June 2022; pp. 1–5. [Google Scholar]
Gu, Z.; He, C.; Huang, Z.; Xiao, M. Channel Estimation for IRS Aided MIMO System with Neural Network Solution. In Proceedings of the 2023 IEEE 98th Vehicular Technology Conference (VTC2023-Fall), Hong Kong, China, 10–13 October 2023; pp. 1–5. [Google Scholar]
Chu, H.; Pan, X.; Jiang, J.; Li, X.; Zheng, L. Adaptive and robust channel estimation for IRS-aided millimeter-wave communications. IEEE Trans. Veh. Technol. 2024, 73, 9411–9423. [Google Scholar] [CrossRef]
Shi, H.; Huang, Y.; Jin, S.; Wang, Z.; Yang, L. Automatic high-performance neural network construction for channel estimation in IRS-aided communications. IEEE Trans. Wirel. Commun. 2024, 23, 10667–10682. [Google Scholar] [CrossRef]
Wang, Y.; Dong, R.; Shu, F.; Gao, W.; Zhang, Q.; Liu, J. Power Optimization and Deep Learning for Channel Estimation of Active IRS-Aided IoT. IEEE Internet Things J. 2024, 11, 41194–41206. [Google Scholar] [CrossRef]
Chen, Z.; Zhao, M.M.; Li, M.; Xu, F.; Wu, Q.; Zhao, M.J. Joint location sensing and channel estimation for IRS-aided mmWave ISAC systems. IEEE Trans. Wirel. Commun. 2024, 23, 11985–12002. [Google Scholar] [CrossRef]
Sun, H.; Zhu, L.; Mei, W.; Zhang, R. Power measurement based channel estimation for IRS-enhanced wireless coverage. IEEE Trans. Wirel. Commun. 2024, 23, 19183–19198. [Google Scholar] [CrossRef]
He, Z.Q.; Yuan, X. Cascaded channel estimation for large intelligent metasurface assisted massive MIMO. IEEE Wirel. Commun. Lett. 2019, 9, 210–214. [Google Scholar] [CrossRef]
Harshman, R.A. Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-modal factor analysis. UCLA Work. Pap. Phon. 1970, 16, 84. [Google Scholar]
Kolda, T.G.; Bader, B.W. Tensor decompositions and applications. SIAM Rev. 2009, 51, 455–500. [Google Scholar] [CrossRef]
Comon, P.; Luciani, X.; De Almeida, A.L. Tensor decompositions, alternating least squares and other tales. J. Chemom. J. Chemom. Soc. 2009, 23, 393–405. [Google Scholar] [CrossRef]
de Almeida, A.L.F.; Favier, G.; da Costa, J.; Mota, J.C.M. Overview of tensor decompositions with applications to communications. In Signals and Images: Advances and Results in Speech, Estimation, Compression, Recognition, Filtering, and Processing; CRC-Press: Boca Raton, FL, USA, 2016; Volume 12, pp. 325–356. [Google Scholar]
Sidiropoulos, N.D.; De Lathauwer, L.; Fu, X.; Huang, K.; Papalexakis, E.E.; Faloutsos, C. Tensor decomposition for signal processing and machine learning. IEEE Trans. Signal Process. 2017, 65, 3551–3582. [Google Scholar] [CrossRef]

Figure 1. Demonstration of an IRS-aided MIMO system. There are M and L antennas equipped at the BS and UT, respectively. The IRS comprises N passive elements and reflects signals from the BS to the UT under the control of a centralized controller. The direct LoS path is obstructed.

Figure 2. Illustration of the structured pilot transmission strategy in the time domain. Each block

S_{k}

spans T time slots, within which the pilot signals

{x_{1}, \dots, x_{T}}

are repeated while the IRS phase shift vector remains fixed.

Figure 2. Illustration of the structured pilot transmission strategy in the time domain. Each block

S_{k}

spans T time slots, within which the pilot signals

{x_{1}, \dots, x_{T}}

are repeated while the IRS phase shift vector remains fixed.

Figure 3. Illustration of the proposed ConvTrans-ResNet architecture. The model denoises the coarse channel estimate

\hat{H_{c}} \in R^{M \times L \times 2}

using 3 main components: ConvEmbed module for local feature extraction and dimensional lifting, stacked Transformer blocks with residual connections for modeling spatial dependencies, and ConvOut module for channel reconstruction.

Figure 3. Illustration of the proposed ConvTrans-ResNet architecture. The model denoises the coarse channel estimate

\hat{H_{c}} \in R^{M \times L \times 2}

using 3 main components: ConvEmbed module for local feature extraction and dimensional lifting, stacked Transformer blocks with residual connections for modeling spatial dependencies, and ConvOut module for channel reconstruction.

Figure 4. Convergence of various methods in terms of NMSE with

N = 25

at 0 dB.

Figure 4. Convergence of various methods in terms of NMSE with

N = 25

at 0 dB.

Figure 5. Comparison of different methods in terms of NMSE under SNRs ranging from 0 to 30 dB with various numbers of IRS elements N: (a)

N = 25

; (b)

N = 49

; (c)

N = 81

; (d)

N = 100

; (e)

N = 144

.

Figure 5. Comparison of different methods in terms of NMSE under SNRs ranging from 0 to 30 dB with various numbers of IRS elements N: (a)

N = 25

; (b)

N = 49

; (c)

N = 81

; (d)

N = 100

; (e)

N = 144

.

Figure 6. Comparison of various methods in terms of NMSE with various numbers of IRS elements N at 30 dB.

Figure 7. Comparison of BALS and ConvTrans-ResNet in terms of NMSE with varying numbers of IRS elements N under the 3GPP channel model.

Figure 8. Ablation study in terms of NMSE using various structures with

N = 25

: (a) various ConvEmbed blocks; (b) various Transformer blocks; (c) Various

n_{H}

; (d) various MLP expansion ratios.

Figure 8. Ablation study in terms of NMSE using various structures with

N = 25

: (a) various ConvEmbed blocks; (b) various Transformer blocks; (c) Various

n_{H}

; (d) various MLP expansion ratios.

Figure 9. Illustration of attention maps from two attention heads: (a) Head 1; (b) Head 2.

Figure 10. Comparison between various models in terms of complexities: (a) parameter size; (b) FLOPs; (c) calculation time.

Table 1. Parameters for the simulated MIMO network aided by an IRS.

Symbol	Description	Value
M	Number of BS antennas	64
L	Number of UE antennas	4
N	Number of passive components at IRS	25, 49, 81, 100, 144
$T_{c}$	Channel coherence time	200
K	Number of blocks	50
T	Number of time slots per block	4
SNR	Signal-to-noise ratios	0:5:30 dB

Table 2. Hyperparameters used for training different channel estimation methods.

	HA02	ReEsNet	InterpResNet	Proposed Method
Optimizer	Adam	Adam	Adam	Adam
Maximum epoch	100	100	100	100
Initial learning rate (lr)	0.002	0.001	0.001	0.001
Drop period for lr	every 20	None	every 20	None
Drop factor for lr	0.5	None	0.5	None
Batch Size	128	128	128	128
L2 regularization	1 × 10⁻⁷	1 × 10⁻⁷	1 × 10⁻⁷	1 × 10⁻⁷

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, Q.; Bao, J.; Xu, H.; Ng, B.K.; Lam, C.-T.; Im, S.-K. A Convolutional-Transformer Residual Network for Channel Estimation in Intelligent Reflective Surface Aided MIMO Systems. Sensors 2025, 25, 5959. https://doi.org/10.3390/s25195959

AMA Style

Wu Q, Bao J, Xu H, Ng BK, Lam C-T, Im S-K. A Convolutional-Transformer Residual Network for Channel Estimation in Intelligent Reflective Surface Aided MIMO Systems. Sensors. 2025; 25(19):5959. https://doi.org/10.3390/s25195959

Chicago/Turabian Style

Wu, Qingying, Junqi Bao, Hui Xu, Benjamin K. Ng, Chan-Tong Lam, and Sio-Kei Im. 2025. "A Convolutional-Transformer Residual Network for Channel Estimation in Intelligent Reflective Surface Aided MIMO Systems" Sensors 25, no. 19: 5959. https://doi.org/10.3390/s25195959

APA Style

Wu, Q., Bao, J., Xu, H., Ng, B. K., Lam, C.-T., & Im, S.-K. (2025). A Convolutional-Transformer Residual Network for Channel Estimation in Intelligent Reflective Surface Aided MIMO Systems. Sensors, 25(19), 5959. https://doi.org/10.3390/s25195959

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Convolutional-Transformer Residual Network for Channel Estimation in Intelligent Reflective Surface Aided MIMO Systems

Abstract

1. Introduction

2. Related Work

3. System Model

4. Traditional BALS-Based Channel Estimation

5. Proposed Method

5.1. Overview

5.2. ConvEmbed Module

5.3. Transformer Module with Residual Connections

5.4. ConvOut Module

6. Experimental Results

6.1. Implementation Details

6.1.1. Dataset

6.1.2. Parameters

6.1.3. Performance Metric

6.2. Effectiveness Validation

6.3. Computational Complexity

7. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI