A Low-Complexity Algorithm for a Reinforcement Learning-Based Channel Estimator for MIMO Systems

Kim, Tae-Kyoung; Min, Moonsik

doi:10.3390/s22124379

Open AccessArticle

A Low-Complexity Algorithm for a Reinforcement Learning-Based Channel Estimator for MIMO Systems

by

Tae-Kyoung Kim

¹

and

Moonsik Min

^2,3,*

¹

Department of Electronic Engineering, Gachon University, Seongnam 13120, Korea

²

School of Electronics Engineering, Kyungpook National University, Daegu 41566, Korea

³

School of Electronic and Electrical Engineering, Kyungpook National University, Daegu 41566, Korea

^*

Author to whom correspondence should be addressed.

Sensors 2022, 22(12), 4379; https://doi.org/10.3390/s22124379

Submission received: 20 May 2022 / Revised: 2 June 2022 / Accepted: 7 June 2022 / Published: 9 June 2022

(This article belongs to the Section Communications)

Download

Browse Figures

Versions Notes

Abstract

:

This paper proposes a low-complexity algorithm for a reinforcement learning-based channel estimator for multiple-input multiple-output systems. The proposed channel estimator utilizes detected symbols to reduce the channel estimation error. However, the detected data symbols may include errors at the receiver owing to the characteristics of the wireless channels. Thus, the detected data symbols are selectively used as additional pilot symbols. To this end, a Markov decision process (MDP) problem is defined to optimize the selection of the detected data symbols. Subsequently, a reinforcement learning algorithm is developed to solve the MDP problem with computational efficiency. The developed algorithm derives the optimal policy in a closed form by introducing backup samples and data subblocks, to reduce latency and complexity. Simulations are conducted, and the results show that the proposed channel estimator significantly reduces the minimum-mean square error of the channel estimates, thus improving the block error rate compared to the conventional channel estimation.

Keywords:

multiple-input multiple-output; channel estimation; Markov decision process; reinforcement learning

1. Introduction

Currently, multiple-input multiple-output (MIMO) is an essential technology in wireless communications [1,2,3,4,5,6]. Multiple antennas are easy to implement in wireless systems, and their use significantly increases system reliability and capacity. However, to utilize the advantages of multiple antennas, perfect channel information is required at both the transmitter and receiver. Meeting this necessity is generally impossible because of the characteristics of wireless channels.

Although perfect channel information is unavailable, many studies have been conducted to improve the accuracy of channel estimation [7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]. These investigations were mostly based on the use of pilots whose information is shared by both the transmitter and receiver and employed least-squares and linear minimum-mean square-error (LMMSE) estimations [10,11,12]. This is because the two estimation methods reasonably perform with affordable complexities for wireless systems. However, their performance strongly depends on the number of pilots, which is generally limited in wireless systems because employing several pilots as resources degrades the spectral efficiency.

This limitation can be overcome using data in channel estimation, i.e., conducting data-aided channel estimation [13,14,15,16,17,18,19,20,21]. Its concept is to exploit a detected data symbol as an additional pilot. Because a detected data symbol may have an error, the accuracy of the channel estimation may be degraded by it. An iterative turbo approach is a good method to address this degradation because the improved detection performance achieved using an iterative turbo equalizer also increases the estimation accuracy of a channel [19,20,21,22,23,24,25]. However, the use of this iterative turbo approach is limited in wireless systems because of its inherent high complexity and latency.

Recently, a reinforcement learning (RL) approach was introduced in [26] for data-aided channel estimation. In this approach, a Markov decision process (MDP) problem is described to minimize the estimation error, and an RL algorithm is used to solve the MDP problem. Without an iterative approach, the RL solution resulted in a significant improvement compared to conventional channel estimations. However, this solution is difficult to implement in practical systems because of its considerable complexity and latency in computing the optimal policy. For example, using the approach in [26] to calculate the optimal policy requires all a posteriori probabilities (APPs) in a data block. In addition, its limitation is that the optimal policy is characterized by a specific discounting factor.

In this paper, a low-complexity channel estimator using an RL approach is proposed for MIMO systems. The key concept of this estimator is the selection of the detected data symbols obtained during data detection as additional pilot symbols. To achieve this, an MDP problem is first defined to minimize the channel estimation error where the Q-value function is generalized by a discounting factor. Subsequently, an RL solution is proposed that can be easily implement in wireless systems. To this end, concepts of backup samples and data subblocks are introduced, which significantly reduce the complexity and latency. The main contributions of this study are summarized as follows:

A data-aided channel estimator is developed to optimize the selection of detected symbols for MIMO systems. An MDP problem is defined for this selection to minimize the mean-square-error (MSE) of the channel estimates. Compared with [26], a discounting factor is introduced in the Q-value function. The discounting factor adjusts the effects of rewards after the current state.
A low-complexity RL algorithm is proposed. To achieve this efficiently, a data block is separated into multiple data subblocks and the optimal policy for the data subblocks is characterized. In the characterization, only partial soft information obtained from data detection is utilized to reduce the calculation latency. Unlike in [26], the optimal policy is calculated using only this partially obtained information; the remaining rewards are approximated under the assumption of perfect detection. Finally, the optimal policy is obtained using a closed-form expression. Note that the conventional RL algorithm in [26] can be employed after obtaining all soft information in a data block.
The performance enhancement achieved for MIMO systems using the developed RL algorithm is evaluated. Simulations are conducted, and the results demonstrate that the proposed algorithm significantly reduces the performance degradation of conventional channel estimation. Based on the simulations, the proposed channel estimator using an approximate MDP presents a similar performance to that of the original MDP. In addition, the proposed channel estimator provides robustness in time-varying channels.

The remainder of this paper is organized as follows. Section 2 introduces a signal model including the channel estimation and data detection considered in this study. In Section 3, an MDP problem to select detected data symbols optimally to minimize the channel estimation error is defined. A low-complexity RL algorithm is proposed in Section 4. In Section 5, simulation results are discussed, to demonstrate the effectiveness of the developed algorithm. Finally, conclusions are presented in Section 6.

Notation

Matrices

0_{m}

and

I_{m}

represent

m \times m

all-zero and the

m \times m

identity matrices, respectively. The superscripts

{(\cdot)}^{T}

and

{(\cdot)}^{H}

denote the transpose and the conjugate transpose, respectively. Operators

E (\cdot)

and

P (\cdot)

denote the expectation of a random variable and the probability of an event, respectively. Operators

| \cdot |

and

{∥ \cdot ∥}^{2}

denote the cardinality of a set and the norm, respectively. Operators

{(\cdot)}^{- 1}

,

Tr (\cdot)

, and

CN

denote the inverse, trace, and complex normal distribution, respectively. Set

C

represents a set of complex numbers.

2. Signal Model

This section describes the signal model for a MIMO system. Based on the signal model, the channel estimator and data detector considered in this study are introduced.

2.1. Signal Model

A MIMO system is considered; in it, a transmitter with

N_{t}

antennas communicates with a receiver with

N_{r}

antennas through a wireless channel. A wireless channel is denoted as

H \in C^{N_{t} \times N_{r}}

, where each channel element

h_{t, r} \in C

between the t-th transmitter and r-th receiver is modeled by Rayleigh fading

h_{t, r} \sim CN (0, 1)

. The transmitter sends a frame consisting of one pilot block and

N_{d}

data blocks, as shown in Figure 1. During the pilot transmission, the transmitter sends a pilot symbol

x^{p} [n] \in C^{N_{t} \times 1}

for

n \in N_{p} = {1, \dots, T_{p}}

, where

T_{p}

is the pilot length. When the pilot symbol

x^{p} [n]

is transmitted to the receiver, the received symbol

y^{p} [n] \in C^{N_{r} \times 1}

at time slot n is given as

\begin{matrix} y^{p} [n] & = H^{H} x^{p} [n] + z^{p} [n], \end{matrix}

(1)

where

z^{p} [n]

is an additive white Gaussian noise (AWGN) at time slot n whose distribution follows

CN (0_{N_{r}}, N_{0} I_{N_{r}})

. After the pilot transmission is completed, the transmitter sends a data symbol

x^{d} [n] \in C^{N_{t} \times 1}

for

n \in N_{d} = {(d - 1) T_{d} + 1, \dots, d T_{d}}

, where

T_{d}

is the data length. Supposing

X

is a constellation set, the data symbol

x^{d} [n] \in X^{N_{t}}

. After the data transmission, the received symbol

y^{d} [n] \in C^{N_{r} \times 1}

is expressed as

\begin{matrix} y^{d} [n] & = H^{H} x^{d} [n] + z^{d} [n], \end{matrix}

(2)

where

z^{d} [n]

is also an AWGN at time slot n.

2.2. Channel Estimator and Data Detector

The LMMSE channel estimator is considered in this study because of its satisfactory performance with low complexity. Using the received symbol in (1), the LMMSE channel estimator,

W \in C^{N_{t} \times T_{p}}

, is expressed as follows:

\begin{matrix} \hat{W} & = \underset{W}{argmin} E [∥ W {(y_{r}^{p})}^{H} - h_{r} ∥^{2}] = {(X^{p} {(X^{p})}^{H} + N_{0} I_{N_{t}})}^{- 1} X^{p}, \end{matrix}

(3)

where

y_{r}^{p}

and

X^{p}

are sets of the received and pilot symbols and are defined as

y_{r}^{p} = [y_{r}^{p} [1], \dots, y_{r}^{p} [T_{p}]]

and

X^{p} = [x^{p} [1], \dots, x^{p} [T_{p}]]

, respectively. Using the channel estimator in (3), a channel estimate is expressed as

\begin{matrix} {\hat{h}}_{r} & = \hat{W} {(y_{r}^{p})}^{H} = {(X^{p} {(X^{p})}^{H} + N_{0} I_{N_{t}})}^{- 1} X^{p} {(y_{r}^{p})}^{H}, \end{matrix}

(4)

where

{\hat{h}}_{r}

is the r-th row of the channel estimate matrix

\hat{H}

.

A maximum a posteriori probability (MAP) data detector is considered in this study to ensure the optimal detection performance. The APP from the MAP data detector is computed as

\begin{matrix} θ_{k} [n] & = P [x^{d} [n] = x_{k} | y^{d} [n]] = \frac{P [y^{d} [n] | x^{d} [n] = x_{k}] P [x^{d} [n] = x_{k}]}{\sum_{j \in K} P [y^{d} [n] | x^{d} [n] = x_{j}] P [x^{d} [n] = x_{j}]}, \end{matrix}

(5)

where

x_{k} \in X^{N_{t}}

is the k-th possible symbol for

k \in K = {1, \dots, | X |^{N_{t}}}

. In (5), the apriori probability,

P [x^{d} [n] = x_{k}]

, is assumed to be equal for all possible symbols

x_{k}

for

k \in K

, i.e.,

P [x^{d} [n] = x_{k}] = \frac{1}{{| X |}^{N_{t}}}

. Concurrently, under the AWGN assumption, the likelihood probability

P [y^{d} [n] | x^{d} [n] = x_{k}]

in (5) can be expressed as

\begin{matrix} P [y^{d} [n] | x^{d} [n] = x_{k}] & = \frac{1}{{(π N_{0})}^{N_{r}}} e^{- \frac{∥ y^{d} [n] - {\hat{H}}^{H} x_{k} ∥^{2}}{N_{0}}} . \end{matrix}

(6)

The MAP data detector detects the data symbol

\hat{x} [n]

that has the best APP value at time slot n, and it is given by

\begin{matrix} \hat{x} [n] & = \underset{x_{k} \in X^{N_{t}}}{argmax} θ_{k} [n] = \underset{x_{k} \in X^{N_{t}}}{argmax} P [y^{d} [n] | x^{d} [n] = x_{k}] . \end{matrix}

(7)

Note that the accuracy of the detected symbol

\hat{x} [n]

depends on the accuracy of the channel estimator,

\hat{H}

. However, the accuracy of the channel estimator cannot be ensured in practical systems where the pilot length,

T_{p}

, is limited. To address this limitation, this study focused on improving the accuracy of the channel estimator.

3. Optimization Problem

This section defines the optimization problem for the channel estimator proposed subsequently, which uses detected symbols to improve the MSE of the channel estimates. Subsequently, to solve the optimization problem, the MDP problem and the optimal policy are presented.

3.1. Optimization Problem

This study considers a channel estimator that uses the detected symbols in (7) as additional pilot symbols. However, the data detector may generate detection errors at the receiver. Consequently, the use of detected symbols with errors degrades the accuracy of the channel estimator. To overcome this problem, the detected symbols should be selectively exploited by the channel estimator.

Let

a \in {0, 1}^{T_{d}}

be the set of actions whose n-th component is the selection of a detected symbol of the d-th data block for

n \in N_{d}

. Specifically, when

a = 1

, a detected symbol is used as an additional pilot symbol; otherwise, it is not used. By exploiting

a

, the LMMSE channel estimate in (4) can be updated as

\begin{matrix} {\hat{h}}_{r} (a) & = {(X (a) X {(a)}^{H} + N_{0} I_{N_{t}})}^{- 1} X (a) {\bar{y}}_{r} {(a)}^{H}, \end{matrix}

(8)

where

{\bar{y}}_{r} (a) = [y_{r}^{p}, y_{r}^{d} [u_{1} (a)], \dots, y_{r}^{d} [u_{{∥ a ∥}_{0}} (a)]]

and

X (a) = [X^{p}, \hat{x} [u_{1} (a)], \dots, \hat{x} [u_{{∥ a ∥}_{0}} (a)]]

.

Here,

u_{i} (a)

is the time slot index of the i-th nonzero element in

a

. Thus, the optimization problem that maximizes the accuracy of the proposed channel estimator can be expressed as

\begin{matrix} a^{★} & = \underset{a \in {0, 1}^{T_{d}}}{argmax} E {∥ \hat{H} (a) - H ∥^{2}} . \end{matrix}

(9)

Solving the optimization problem in (9) is difficult. First, the distribution of

\hat{H} (a)

requires information regarding the transmitted symbols. However, this information is generally unknown to a receiver. In addition, the number of candidates for actions

a

exponentially increases with data length

T_{d}

. Accordingly, an exhaustive search for these actions is impractical because of the unsatisfactory complexity and latency for the receiver.

3.2. Markov Decision Process

To efficiently solve the problem in (9), an MDP was formulated in [26] that sequentially selected detected symbols. In this formulation, a detected symbol is selected if the updated channel estimator reduces the estimation error.

Similar to [26], for this study, the state set of the MDP at time slot n is expressed as

\begin{matrix} S_{n} = {(X_{n}, {\hat{X}}_{n}, M_{n}) | & X_{n} = [X^{p}, x_{k_{M_{n} (1)}}, \dots, x_{k_{M_{n} (| M_{n} |)}}], k_{i} \in K, \\ {\hat{X}}_{n} = [X^{p}, \hat{x} [M_{n} (1)], \dots, \hat{x} [M_{n} (| M_{n} |)]], \\ M_{n} \subset \{T_{p} + 1, \dots, n - 1\}}, \end{matrix}

(10)

where

k_{n}

denotes the transmitted symbol index at time slot n. Set

M_{n}

represents the set of time slot indices of the data symbols to be utilized as additional pilot symbols.

M_{n} (i)

is the i-th smallest element of

M_{n}

. Based on the above notations, the proposed channel estimate at state

S_{n} = (X_{n}, {\hat{X}}_{n}, M_{n}) \in S_{n}

is expressed as

\begin{matrix} {\hat{h}}_{r} (S_{n}) & = {({\hat{X}}_{n} {\hat{X}}_{n}^{H} + N_{0} I_{N_{t}})}^{- 1} {\hat{X}}_{n} {\bar{y}}_{r}^{H} (S_{n}), \end{matrix}

(11)

where

{\bar{y}}_{r} (S_{n}) = [y_{r}^{p}, y_{r}^{d} [M_{n} (1)], \dots, y_{r}^{d} [M_{n} (| M_{n} |)]]

.

The action set of the MDP is expressed as

A = {0, 1}

. An action is defined as whether to utilize a current detected symbol as an additional pilot symbol. Specifically, when

a = 1 \in A

, the current detected symbol is used as an additional pilot symbol.

Based on the state and action sets, the state transition function of the MDP for

a \in A

and

S_{n} \in S_{n}

is expressed as follows:

\begin{matrix} T_{n + 1}^{(a, j)} (S_{n}) & = P [U_{n + 1}^{(a, j)} (S_{n}) | S_{n}, a] = \{\begin{matrix} I [x^{d} [n] = x_{j}], & j \in J_{a}, a = 1, \\ 1, & j \in J_{a}, a = 0 . \end{matrix} \end{matrix}

(12)

where

J_{0} = {0}

and

J_{1} = {1, \dots, K}

. State

U_{n + 1}^{(a, j)} (S_{n}) \in S_{n + 1}

is the valid state from the current state

S_{n} = (X_{n}, {\hat{X}}_{n}, M_{n}) \in S_{n}

, and is expressed as

\begin{matrix} U_{n + 1}^{(a, j)} (S_{n}) & = \{\begin{matrix} ([X_{n}, x_{j}], [{\hat{X}}_{n}, \hat{x} [n]], [M_{n} \cup n]), & j \in J_{a}, a = 1, \\ (X_{n}, {\hat{X}}_{n}, M_{n}), & j \in J_{a}, a = 0 . \end{matrix} \end{matrix}

(13)

The reward function of the MDP is obtained by the MSE improvement between the channel estimates at the current state

S_{n}

and the next state

S_{n + 1}

. Thus, the reward function from

S_{n} \in S_{n}

to

S_{n + 1} \in S_{n + 1}

is defined as

\begin{matrix} R (S_{n}, S_{n + 1}) & = E_{r} (S_{n}) - E_{r} (S_{n + 1}), \end{matrix}

(14)

where

E_{r} (S_{n})

is the MSE of the channel estimate for the r-th receive antenna at state

S_{n} \in S_{n}

, which can be computed as

\begin{matrix} E_{r} (S_{n}) & = E [∥ {\hat{h}}_{r} (S_{n}) - h_{r} ∥^{2}] = Tr [C_{e} (S_{n})], \end{matrix}

(15)

where the error covariance matrix

C_{e} (S_{n})

is defined as

E {({\hat{h}}_{r} (S_{n}) - h_{r}) {({\hat{h}}_{r} (S_{n}) - h_{r})}^{H}}

.

Here,

C_{e} (S_{n})

is independent of the receiver antenna index, r, because the channel and noise distributions are the same for different receive antenna indices. Thus, the reward function in (14) can be simplified as

\begin{matrix} R (S_{n}, S_{n + 1}) & = Tr [C_{e} (S_{n}) - C_{e} (S_{n + 1})] . \end{matrix}

(16)

The optimal policy of the MDP at time slot n is defined as

\begin{matrix} π^{★} (S_{n}) & = \underset{a \in A}{argmax} Q (S_{n}, a) . \end{matrix}

(17)

where the Q-value function

Q (S_{n}, a)

is the optimal sum of the rewards. Based on the state transition function in (12), the Q-value function can be expressed as

\begin{matrix} Q (S_{n}, a) & = \sum_{j \in J_{a}} T_{n + 1}^{(a, j)} (S_{n}) [R (S_{n}, U_{n + 1}^{(a, j)} (S_{n})) + γ V^{★} (U_{n + 1}^{(a, j)} (S_{n}))], \end{matrix}

(18)

where

0 \leq γ \leq 1

is a discounting factor whose value depends on the target of the optimization problem. For example, a small value is desirable when the accuracy of the channel estimator obtained at the current state is significant. In contrast, a larger value is preferred when the accuracy of the channel estimator obtained at the ending state is significant.

V^{★} (U_{n + 1}^{(a, j)} (S_{n}))

is the optimal sum of the future rewards. The future value function

V^{★} (S_{m})

at state

S_{m} \in S_{m}

for

n + 1 \leq m

can be recursively computed, as follows:

\begin{matrix} V^{★} (S_{m}) & = \sum_{a \in A} π (S_{m}, a) \sum_{j \in J_{a}} T_{m + 1}^{(a, j)} (S_{m}) [R (S_{m}, U_{m + 1}^{(a, j)} (S_{m})) + γ V^{★} (U_{m + 1}^{(a, j)} (S_{m}))], \end{matrix}

(19)

where

π (S_{m}, a)

is a state–action transition function, expressed as

\begin{matrix} π (S_{m}, a) & = I {a = \underset{a^{'} \in A}{argmax} Q (S_{m}, a^{'})}, \end{matrix}

(20)

where

Q (S_{m}, a)

is the Q-value function that can be calculated as the sum of the rewards obtained after taking action

a \in A

at state

S_{m} \in S_{m}

.

Using the MDP in (10), (12), and (13), the state–action diagram of the original MDP is depicted in Figure 2a. In this figure, state

S_{n}

is transited to the next valid state,

U_{n + 1}^{(a, j)} (S_{n})

, based on action a. Particularly, when

a = 1

, state

S_{n}

is transited to state

U_{n + 1}^{(1, k_{n})} (S_{n})

by utilizing the transmitted symbol index,

k_{n}

. Based on the state and state–action transition functions in (12) and (20), the state is transited to the next valid state until the end of a data block. As previously mentioned, the original MDP, which is shown in Figure 2a, cannot be solved by dynamic programming.

First, the state and state–action functions are unavailable to the receiver because the information of the transmitted symbols,

x_{k_{n}}

, and the true channel information,

H

, are unknown. In addition, the computational complexity and latency required to solve the original MDP are extremely high because the number of states exponentially increases with data length

T_{d}

.

4. Proposed Rl-Based Channel Estimator

In this section, an RL-based channel estimator is proposed. To address the unknown state and state–action functions, an RL algorithm is adopted because it provides a solution for the partially observable MDP [27,28]. Based on this algorithm, a computationally efficient RL solution is also proposed. The key concept of the proposed solution is to approximate the state–action transition functions to determine the optimal policy by separating the cases using the APPs.

The overall procedure of the proposed RL-based channel estimator is illustrated in Figure 3. The proposed channel estimator exploits the information of (

\hat{x} [m], θ_{j} [m]

) obtained from the MIMO detector. In the proposed channel estimator, the optimal policy is calculated by using only N APPs

(θ_{j} [n], \dots, θ_{j} [n + N])

for a computationally efficient algorithm. The channel estimate is then updated according to the optimal policy. Details of the proposed channel estimator, i.e., how to approximate the MDP and how to derive the optimal policy in a closed form, are explained in this section.

4.1. Statistical State Transition

In this section, the state transition function in (12) at time slot n is approximated using the APP

θ_{j} [n]

. The basic concept was introduced in [26] by assuming the APP

θ_{j} [n]

as the probability of the event,

{x [n] = x_{j}}

. Thus, the state transition function in (12) at time slot n is approximated as follows:

\begin{matrix} {\hat{T}}_{n + 1}^{(a, j)} (S_{n}) & = \{\begin{matrix} θ_{j} [n], & j \in J_{a}, a = 1, \\ 1, & j \in J_{a}, a = 0 . \end{matrix} \end{matrix}

(21)

where the detected symbol index at time slot n is denoted as

{\hat{k}}_{n}

. Note that APP

θ_{j} [n]

can be interpreted as the probability of the event

{x [n] = x_{j}}

; thus, it is called a statistical transition. In addition, when the data detection performance is improved, i.e.,

θ_{k_{n}} [n] \to 1

, the approximate state transition function in (21) approaches the true state transition function in (12).

4.2. State–Action Transition Using Backup Samples

After time slot

n + 1 \leq m

, the state in (20) is assumed to be transited to a virtual state that mimics the possible next states by exploiting the expected transmitted symbol,

\tilde{x} [m]

. The expected transmitted symbol,

\tilde{x} [m]

, is defined as

\begin{matrix} \tilde{x} [m] & = \sum_{j = 1}^{K} θ_{j} [m] x_{j} . \end{matrix}

(22)

In this study, the use of the expected transmitted symbol is the same as in [26], except its use is limited to N backup samples to reduce the complexity. A backup sample is defined as APP

θ_{j} [m]

for

n + 1 \leq m \leq n + N

because the expected transmitted symbol can be computed by

θ_{j} [m]

. Thus, the Q-value function can be calculated after all

θ_{j} [m]

for

n + 1 \leq m \leq n + N

values are obtained. Using a backup sample of an APP, the state–action transition is expressed as

\begin{matrix} \hat{π} (S_{m}, a) & = 1 . \end{matrix}

(23)

Thus, the virtual state,

{\tilde{U}}_{m}^{(a, j)} (S_{n}) \in S_{m}

, that can be transited from

S_{n} \in S_{n}

is expressed as

\begin{matrix} {\tilde{U}}_{m}^{(a, j)} (S_{n}) & = (X_{m}^{(a, j)}, {\hat{X}}_{m}^{(a)}, M_{m}^{(a)}), \end{matrix}

(24)

where their components are

\begin{matrix} X_{m}^{(a, j)} & = \{\begin{matrix} [X_{n}, x_{j}, \tilde{x} [n + 1], \dots, \tilde{x} [n + N]], & a = 1, \\ [X_{n}, \tilde{x} [n + 1], \dots, \tilde{x} [n + N]], & a = 0 . \end{matrix} \\ {\hat{X}}_{m}^{(a)} & = \{\begin{matrix} [{\hat{X}}_{n}, \hat{x} [n], \tilde{x} [n + 1], \dots, \tilde{x} [n + N]], & a = 1, \\ [{\hat{X}}_{n}, \tilde{x} [n + 1], \dots, \tilde{x} [n + N]], & a = 0 . \end{matrix} \\ M_{m}^{(a)} & = \{\begin{matrix} [M_{n} \cup {n, \dots, n + N}], & a = 1, \\ [M_{n} \cup {n + 1, \dots, n + N}], & a = 0 . \end{matrix} \end{matrix}

Because a virtual state mimics the transitions to the candidate symbols, state

{\tilde{U}}_{m}^{(a, j)} (S_{n}) \in S_{m}

is always transited to a virtual state

{\tilde{U}}_{m + 1}^{(a, j)} (S_{n}) \in S_{m + 1}

. Therefore, the corresponding state transition function is written as

\begin{matrix} {\hat{T}}_{m + 1}^{(a, j)} ({\tilde{U}}_{m}^{(a, j)} (S_{n})) & = 1, \end{matrix}

(25)

where

n + 1 \leq m \leq n + N

.

4.3. State–Action Transition after Backup Samples

In this subsection, the virtual states after

n + N

that can be transited without the information of the backup samples,

θ_{j} [m]

, are described for

n + N + 1 \leq m

. To achieve this, the states,

{\hat{U}}_{m + 1}^{(a, j)} (S_{n})

, for

n + N + 1 \leq m

are assumed to optimally act when all symbols are correctly detected. By using the property of

x [m] = \hat{x} [m]

after time slot

n + N + 1

, an approximate virtual state is expressed as

\begin{matrix} {\hat{U}}_{m}^{(a, j)} (S_{n}) = (X_{m}^{(a, j)}, {\hat{X}}_{m}^{(a)}, M_{m}^{(a)}), \end{matrix}

(26)

where its components are defined as

\begin{matrix} X_{m}^{(a, j)} & = [X_{n + N + 1}^{(a, j)}, \hat{x} [n + N + 1], \dots, \hat{x} [m - 1]], \\ {\hat{X}}_{m}^{(a)} & = [{\tilde{X}}_{n + N + 1}^{(a)}, \hat{x} [n + N + 1], \dots, \hat{x} [m - 1]], \\ M_{m}^{(a)} & = [M_{n + N + 1}^{(a)} \cup {n + N + 1, \dots, m - 1}], \end{matrix}

where

(X_{n + N + 1}^{(a, j)}, {\hat{X}}_{n + N + 1}^{(a)}, M_{n + N + 1}^{(a)})

are the components of

{\tilde{U}}_{n + N + 1}^{(a, j)} (S_{n})

.

In Figure 2b, a state–action diagram of the approximate MDP is depicted. The original MDP requires information regarding the transmitted symbols for the state transition, as shown in Figure 2a. In contrast, the approximate MDP utilizes virtual states

{\tilde{U}}_{m}^{(a, j)} (S_{n})

and

{\hat{U}}_{m}^{(a, j)} (S_{n})

, which mimic the transitions to the candidate symbols for an unknown transmitted symbol and action. Specifically, virtual state

{\tilde{U}}_{m}^{(a, j)} (S_{n})

is used at time slot

n + 1 \leq m \leq n + N

and after time slot

n + N

, respectively. These two approximations decrease the number of transitions to the next state transition, so the calculation to solve the MDP is considerably reduced.

4.4. Proposed Optimal Policy

Using the approximations in (21), (23), and (24), the optimal policy can be determined. However, the calculation latency is still considerable, because the optimal policy can be computed at the end of a data block. To prevent this computational burden, the proposed solution separates a data block into

N_{b}

data subblocks and subsequently characterizes the optimal policy for each data subblock, as shown in Figure 4. Based on this characterization, the state in (10) and the corresponding channel estimate using (11) are updated for a data subblock. To realize this data subblock separation, the data subblock length is defined as

T_{b}

, which satisfies

N_{b} = T_{d} / T_{b}

. Thus, a set of time slot indices of the b-th data subblock in the d-th data block,

N_{b, d}

, is defined as

{T_{p} + (b - 1) T_{b} + (d - 1) T_{d} + 1, \dots, T_{p} + b T_{b} + (d - 1) T_{d}}

, for

b \in {1, \dots, N_{b}}

and

d \in {1, \dots, N_{d}}

(see Figure 4).

Using the virtual states in (24) and (26), the Q-value function is written as

\begin{matrix} Q (S_{n}, a) & = \sum_{j \in J_{a}} T_{n + 1}^{(a, j)} (S_{n}) [R (S_{n}, {\tilde{U}}_{n + 1}^{(a, j)} (S_{n})) + \sum_{m = n + 1}^{n + N} γ^{m - n} R ({\tilde{U}}_{m}^{(a, j)} (S_{n}), {\tilde{U}}_{m + 1}^{(a, j)} (S_{n})) \\ + γ^{N + 1} V^{★} ({\hat{U}}_{n + N + 1}^{(a, j)} (S_{n}))], \end{matrix}

(27)

where the future value function,

V^{★} ({\hat{U}}_{n + N + 1}^{(a, j)} (S_{n}))

, is obtained based on the approximation of

{\hat{U}}_{m}^{(a, j)} (S_{n})

as follows:

V^{★} ({\hat{U}}_{n + N + 1}^{(a, j)} (S_{n})) \approx R ({\tilde{U}}_{n + N + 1}^{(a, j)} (S_{n}), {\hat{U}}_{n + N + 2}^{(a, j)} (S_{n})) + \sum_{m = n + N + 2}^{N_{b, d} (T_{b})} R ({\hat{U}}_{m}^{(a, j)} (S_{n}), {\hat{U}}_{m + 1}^{(a, j)} (S_{n})) .

(28)

In the future reward in (28), the discounting factor is assumed to be 1 to reduce the complexity by a simple calculation.

Based on (27) and (28), the optimal policy for each state is obtained as a closed-form expression, as described in the following theorem:

Theorem 1.

Under the virtual states and the use of backup samples, the optimal policy for the state

S_{n} = (X_{n}, {\hat{X}}_{n}, M_{n}) \in S_{n}

is

\begin{matrix} π^{★} (S_{n}) & = I [\frac{\sum_{m = n}^{n + N} γ^{m - n} (1 - γ) U_{m} (S_{n}) + γ^{N + 1} U_{N_{b, d} (T_{b}) + 1} (S_{n})}{\sum_{m = n}^{n + N} γ^{m - n} (1 - γ) L_{m} (S_{n}) + γ^{N + 1} L_{N_{b, d} (T_{b}) + 1} (S_{n})} \geq 1], \end{matrix}

(29)

where functions

U_{m} (S_{n})

and

L_{m} (S_{n})

are respectively defined as

\begin{matrix} U_{m} (S_{n}) & = ∥ t_{m} ∥^{2} (N_{0} + N_{0}^{2} ∥ t_{m} ∥^{2} + {∥ v_{m} ∥}^{2}) \\ L_{m} (S_{n}) & = ∥ t_{m} ∥^{2} (2 N_{0}^{2} β_{m} + δ_{m} + {∥ e_{m} - u_{m} + v_{m} ∥}^{2}) \end{matrix}

All components are defined as

\begin{matrix} Q_{m} = {({\hat{X}}_{n} {\hat{X}}_{n}^{H} + \sum_{l = n + 1}^{m} \tilde{x} [l] {\tilde{x}}^{H} [l] + N_{0} I_{N_{t}})}^{- 1}, D_{m} = {\hat{X}}_{n} {({\hat{X}}_{n} - X_{n})}^{H} + \sum_{l = n + 1}^{m} \hat{x} [l] {(\hat{x} [l] - \tilde{x} [l])}^{H} + N_{0} I_{N_{r}}, \\ t_{m} = \frac{1}{\sqrt{1 + α_{m}}} Q_{m} \hat{x} [n], e_{m} = \frac{1}{\sqrt{1 + α_{m}}} (\hat{x} [n] - \tilde{x} [n]), u_{m} = D_{m}^{H} t_{m}, v_{m} = \frac{D_{m}^{H} Q_{m} t_{m}}{∥ t_{m} ∥^{2}}, \\ α_{m} = {\hat{x}}^{H} [n] Q_{m} \hat{x} [n], β_{m} = \frac{t_{m}^{H} Q_{m} t_{m}}{∥ t_{m} ∥^{2}}, δ_{m} = \frac{1}{1 + α_{m}} (\sum_{j = 1}^{K} θ_{j} [n] ∥ \hat{x} [n] - x_{j} ∥^{2} - {∥ \hat{x} [n] - \tilde{x} [n] ∥}^{2}) \\ Q_{N_{b, d} (T_{b}) + 1} = {(Q_{n + N}^{- 1} + (N_{b, d} (T_{b}) - (n + N - 1)) I_{N_{t}})}^{- 1}, D_{N_{b, d} (T_{b}) + 1} = D_{n + N} . \end{matrix}

(30)

Proof.

See Appendix A. □

4.5. Summary: The Proposed Algorithm

The proposed channel estimator is summarized in Algorithm 1. First, the receiver initializes the state during pilot transmission. In this algorithm, the current state is updated and transited to the next state according to the optimal action obtained using (29). For example, the most probable state transition is used when

α^{★} = 1

for the unknown transmitted symbol index. This transition ensures a true state transition as

θ_{j} [n]

approaches 1 in reliable communication. At the end of a data subblock, the proposed channel estimator updates the channel estimate using the current state,

S_{n}

.

Algorithm 1: The proposed channel estimator.

1 Set

H \leftarrow \hat{H} = [{\hat{h}}_{1}, \dots, {\hat{h}}_{N_{r}}]

from (4)
2 Initialize

S_{1} = (X^{p}, X^{p}, ϕ)

.
3 for

d = 1

to

N_{d}

do

4 for

b = 1

to

N_{b}

do
5 for

n \in N_{b, d}

do
6 Obtain

\hat{x} [n]

from (8) and

{θ_{j} [n], \dots, θ_{j} [n + N]}

from (5) for

j \in K

7 Compute

a^{★} = π^{★} (S_{n})

from (29).
8 Set

j^{★} = 0

for

a^{★} = 0

and

x_{j^{★}} = \hat{x} [n]

for

a^{★} = 1

.
9 Update

S_{n + 1} \leftarrow U_{n + 1}^{(a^{★}, j^{★})} (S_{n})

from (13).
10 end
11 Set

H \leftarrow \hat{H} = [{\hat{h}}_{1} (S_{n}), \dots, {\hat{h}}_{N_{r}} (S_{n})]

from (11).
12 end
13 end

4.6. Complexity Analysis

In this subsection, the complexity of both the proposed channel estimator and that in [26] is discussed based on the number of states visited in the calculation of the optimal policy. This is because the rewards in the optimal policy are computed based on the states, and the calculation in (29) is similar to that in [26]. First, when the current state is

S_{n} \in S_{n}

in the d-th data block, the number of visiting states in [26] is exactly

d T_{d} - n

. By contrast, the number of visiting states using the proposed channel estimator in the b-th data subblock is exactly

(b - 1) T_{b} + 1 + (d - 1) T_{d} - n

. Thus, the number of states (

T_{d} - (b - 1) T_{b} - 1

) is not used in the policy calculation on introducing the data subblocks. In addition to the complexity, the proposed optimal policy can be calculated after obtaining N backup samples, whereas in the approach in [26], this is possible at the end of a data block. Thus, the latency of the optimal policy by the approach in [26] is much longer than that of the proposed optimal policy.

5. Simulation Results

This section discusses the performance of the proposed channel estimator. The number of antennas in MIMO systems is

(N_{t}, N_{r}) = (4, 4)

. A rate

1 / 2

turbo code is adopted for channel coding, and 4-quadrature amplitude modulation (QAM) is adopted for symbol mapping. The frame consists of

(T_{p}, T_{d}, N_{d}) = (8, 64, 20)

, and the proposed channel estimator utilizes a data subblock as

(T_{b}, N_{b}) = (16, 4)

. In addition, the parameters of the proposed channel estimator are

(N, γ) = (1, 0.5)

, unless specified otherwise. The per-bit signal-to-noise ratio (SNR) is defined as

E_{b} / N_{0} = \frac{1}{{log}_{2} | X | N_{0}}

.

In all figures, the performance with perfect and imperfect channel estimates using the LMMSE method are denoted as PCSI and CE, respectively. For performance benchmarking, the optimal cases of the proposed channel estimator and the expected-symbol-based channel estimator utilizing perfect knowledge of the transmitted symbol and the expected symbol in (22) as an additional pilot symbol, respectively, are compared. The performance is measured in terms of the block error rate (BLER) and the normalized MSE (NMSE). In Figure 5, the proposed channel estimator is compared with other channel estimators, and the conventional RL method used in [26] is also depicted. It shows that the BLER of the proposed estimator is better than those of the conventional and expected-symbol-based estimators regardless of the per-bit SNR. Moreover, the proposed channel estimator outperforms the conventional estimator of [26]. This is because the proposed channel estimator updates a channel estimate by

N_{b}

in a data block, whereas the method in [26] updates it once at the end of a data block.

Figure 6 compares the BLERs of the conventional and proposed channel estimators for different modulations. For 16-QAM, a MIMO system with

(N_{t}, N_{r}) = (2, 4)

is considered because of the SNR range. The proposed channel estimator achieves an improved BLER compared to the conventional LMMSE channel estimators. This result demonstrates the effectiveness of the proposed channel estimator, which optimizes the selection of detected symbols. The improvements to achieve a BLER of

10^{- 1}

are approximately 1.2 dB and 0.7 dB for the 4- and 16-QAM, respectively. The BLER for the 16-QAM is more improved than that of the PCSI, which is better than that of the 4-QAM. This is because in 16-QAM, the number of reliable detected symbols that can be used as additional pilot symbols is larger than in 4-QAM.

The NMSEs of the proposed channel estimator for different data subblock lengths are shown in Figure 7. The NMSE improves as

N_{b}

decreases. This is because the approximate MDP using data subblocks approaches the original MDP as

N_{b}

decreases. However, as shown in Figure 7, the NMSE improvement is insignificant, whereas the complexity exponentially increases with

T_{b}

. Thus,

(T_{b}, N_{b}) = (16, 4)

is considered in this study for the simulations.

The NMSE of the proposed channel estimator based on the number of backup samples is shown in Figure 8. Noticeably, the NMSE is improved as the number of backup samples increases. This is because the accuracy of the state–action diagram model improves as the number of backup samples increases. In addition, with a small value of N, the proposed channel estimator achieves a sufficient NMSE performance. It should be noted that the complexity and latency required to determine the optimal policy increase with the number of backup samples.

Figure 9 and Figure 10 are the results obtained using the proposed channel estimator in time-varying channels. Specifically, a first-order Gaussian–Markov process used in [29,30] was adopted.In this process, the channel matrix at time slot n is defined as

\begin{matrix} H^{(n)} & = \sqrt{1 - ϵ^{2}} H^{(n - 1)} + ϵ e^{(n)}, \end{matrix}

(31)

where

n \in N_{b, d}

for

b \in {1, 2, \dots, N_{b}}

and

d \in {1, 2, \dots, N_{d}}

.

ϵ \in [0, 1]

is a temporal correlation coefficient depending on the velocity, and

H^{(0)}

is an initial channel estimate. Each element in

e^{(n)} \in C^{N_{r} \times N_{t}}

is assumed to follow

CN (0, 1)

. Temporal correlation coefficients

ϵ = 5 \times 10^{- 3}

and

ϵ = 10^{- 2}

are used for the simulations.

Figure 9 shows the variation in the NMSE of the proposed channel estimator with the discounting factor. When a channel varies over time as

ϵ = 5 \times 10^{- 3}

, an NMSE with

γ = 0.1

is better than it is with

γ = 0.9

. This is because the rewards at the future states in the time-varying channels are insignificant; therefore, a small value of the discounting factor is preferable. By contrast, when the channels are time-invariant, the rewards at the future states as well as those at the current state are important. Thus, the large value of

γ = 0.9

improves the NMSE compared to

γ = 0.1

. Figure 10 compares the BLERs of the proposed and conventional channel estimators. When

ϵ = 10^{- 2}

, the BLERs of the CE are severely degraded because the CE method cannot capture the channel variation. However, the proposed channel estimator shows robustness in time-varying channels because the channel variation can be tracked efficiently by selecting the detected symbols.

6. Conclusions

In this paper, a low-complexity algorithm for an RL-based channel estimator for MIMO systems was proposed. The proposed channel estimator adaptively selects detected symbols as additional pilot symbols to minimize the channel estimation error. In this study, an MDP problem was introduced, and a practical algorithm to solve it was developed using backup samples and data subblocks. Simulation results showed that the proposed channel estimator significantly improves the BLER and the NMSE compared to the conventional channel estimator.

A future direction of this study is to develop the RL approach for a realistic channel. The proposed method was derived based on the Rayleigh fading channel, but the realistic channel may have a line of sight. Thus, the MDP under the Rician fading channel should be investigated. Another important direction is to develop the RL approach for frequency-selective channels. In frequency-selective channels, the use of multiple sub-carriers can increase computational complexity considerably. Thus, a low-complexity algorithm in frequency-selective channels is necessary. Lastly, the RL approach can also be extended to other advanced channel estimators, such as the iterative method. In this method, the MDP should be reformulated according to the channel estimator.

Author Contributions

Conceptualization, M.M. and T.-K.K.; methodology, T.-K.K.; software, M.M. and T.-K.K.; validation, M.M. and T.-K.K.; formal analysis, M.M. and T.-K.K.; investigation, M.M. and T.-K.K.; resources, T.-K.K.; data curation, T.-K.K.; writing–original draft preparation, T.-K.K.; writing–review and editing, M.M. and T.-K.K.; visualization, T.-K.K.; supervision, M.M.; project administration, M.M. and T.-K.K.; funding acquisition, M.M. and T.-K.K. All authors have read and agreed to the published version of the manuscript.

Funding

The work of M.M. was supported in part by a National Research Foundation of Korea (NRF) grant, funded by the Korea Government (MSIT) (No. 2020R1F1A1071649), and in part by the BK21 FOUR Project, funded by the Ministry of Education, Korea (4199990113966). The work of T.-K.K. was supported by a National Research Foundation of Korea (NRF) grant, funded by the Korea Government (MIST) (No. 2021R1F1A1063273).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare that there is no conflict of interest.

Appendix A. Proof of Theorem 1

Although the basic derivation of the optimal policy is based on [26], two additional factors are considered, which are presented in this appendix. The first is that the proposed derivation considers a discounting factor in the Q-value; thus, the intermediate rewards do not disappear, unlike in [26]. Second, a finite number of backup samples are used in the derivation; thus, the rewards that do not exploit the APPs are approximated differently compared to [26].

Under the assumption that the discounting factor is 1, the future value function at state

{\tilde{U}}_{n + N + 1}^{(a, j)} (S_{n}) \in S_{n + N + 1}

is expressed by substituting (14) in (28), as follows:

\begin{matrix} V^{★} ({\tilde{U}}_{n + N + 1}^{(a, j)} (S_{n})) & = Tr [C_{e} ({\tilde{U}}_{n + N + 1}^{(a, j)} (S_{n})) - C_{e} ({\hat{U}}_{n + N + 2}^{(a, j)} (S_{n})) \\ + \sum_{m = n + N + 2}^{N_{b, d} (T_{b})} C_{e} ({\hat{U}}_{m}^{(a, j)} (S_{n})) - C_{e} ({\hat{U}}_{m + 1}^{(a, j)} (S_{n}))] \\ = Tr [C_{e} ({\tilde{U}}_{n + N + 1}^{(a, j)} (S_{n})) - C_{e} ({\hat{U}}_{N_{b, d} (T_{b}) + 1}^{(a, j)} (S_{n}))] . \end{matrix}

(A1)

By substituting (14) and (A1) into (27), the Q-value function can be obtained as follows:

\begin{matrix} Q (S_{n}, a) & = \sum_{j \in J_{a}} T_{n + 1}^{(a, j)} (S_{n}) Tr [C_{e} (S_{n}) + \sum_{m = n}^{n + N} γ^{m - n} (γ - 1) C_{e} ({\tilde{U}}_{m + 1}^{(a, j)} (S_{n})) - γ^{N + 1} C_{e} ({\hat{U}}_{N_{b, d} (T_{b}) + 1}^{(a, j)} (S_{n}))] . \end{matrix}

(A2)

Thus, the optimal policy in (17) is expressed as

\begin{matrix} π^{★} (S_{n}) & = \underset{a \in {0, 1}}{argmax} Q (S_{n}, a) \\ = I [(Q (S_{n}, 1) - Q (S_{n}, 0)) \geq 0] \\ = I [Tr [\sum_{m = n}^{n + N} γ^{m - n} (γ - 1) (\sum_{j = 1}^{K} θ_{j} [n] C_{e} ({\tilde{U}}_{m + 1}^{(1, j)} (S_{n})) - C_{e} ({\tilde{U}}_{m + 1}^{(0, 0)} (S_{n}))) \\ - γ^{N + 1} (\sum_{j = 1}^{K} θ_{j} [n] C_{e} ({\hat{U}}_{N_{b, d} (T_{b}) + 1}^{(1, j)} (S_{n})) - C_{e} ({\hat{U}}_{N_{b, d} (T_{b}) + 1}^{(0, 0)} (S_{n})))] \geq 0] . \end{matrix}

(A3)

In (17), the optimal policy is determined by the difference between the error covariance matrices with

a = 0

and

a = 1

. The error covariance matrices for virtual states

{\tilde{U}}_{m}^{(a, j)} (S_{n})

and

{\hat{U}}_{m}^{(a, j)} (S_{n})

are derived as described below.

Appendix A.1. Error Covariance Calculation for ${\tilde{U}}_{m}^{(a, j)} (S_{n})$

To obtain the error covariance matrix, the distribution of the received symbols,

{\bar{y}}_{r}^{H} ({\tilde{U}}_{m}^{(a, j)} (S_{n}))

, in (2) is required, which is given by

{\bar{y}}_{r}^{H} ({\tilde{U}}_{m}^{(a, j)} (S_{n})) \sim CN (0_{| M_{m}^{(a)} |}, {(X_{m}^{(a, j)})}^{H} X_{m}^{(a, j)} + N_{0} I_{| M_{m}^{(a)} |}),

(A4)

for

j \in J_{a}

and

a \in A

. Thus, the error covariance matrix in (A3) is computed using the result in [26], as follows:

\begin{matrix} C_{e} ({\tilde{U}}_{m}^{(a, j)} (S_{n})) & = N_{0} Q_{m}^{(a)} - N_{0}^{2} {(Q_{m}^{(a)})}^{2} + Q_{m}^{(a)} D_{m}^{(a, j)} {(D_{m}^{(a, j)})}^{H} Q_{m}^{(a)}, \end{matrix}

(A5)

where

\begin{matrix} Q_{m}^{(a)} & = {({\hat{X}}_{m}^{(a)} {({\hat{X}}_{m}^{(a)})}^{H} + N_{0} I_{N_{t}})}^{- 1} \overset{(a)}{=} \{\begin{matrix} {({\hat{X}}_{n} {\hat{X}}_{n}^{H} + \sum_{l = n + 1}^{m - 1} \tilde{x} [l] {\tilde{x}}^{H} [l] + N_{0} I_{N_{t}})}^{- 1}, & a = 0, \\ {({(Q_{m}^{(0)})}^{- 1} + \hat{x} [n] {\hat{x}}^{H} [n])}^{- 1}, & a = 1 . \end{matrix} \\ D_{m}^{(a, j)} & = {\hat{X}}_{m}^{(a)} {({\hat{X}}_{m}^{(a)} - X_{m}^{(a, j)})}^{H} + N_{0} I_{N_{t}} \\ \overset{(b)}{=} \{\begin{matrix} {\hat{X}}_{n} {({\hat{X}}_{n} - X_{n})}^{H} + \sum_{l = n + 1}^{m - 1} \hat{x} [l] {(\hat{x} [l] - \tilde{x} [l])}^{H} + N_{0} I_{N_{t}}, & j \in J_{a}, a = 0, \\ D_{m}^{(0, 0)} + \hat{x} [n] {(\hat{x} [n] - x_{j})}^{H}, & j \in J_{a}, a = 1 . \end{matrix} \end{matrix}

Thus, the matrix

Q_{m}^{(1)}

is re-expressed as

\begin{matrix} Q_{m}^{(1)} & = Q_{m}^{(0)} - \frac{Q_{m}^{(0)} \hat{x} [n] {\hat{x}}^{H} [n] Q_{m}^{(0)}}{1 + {\hat{x}}^{H} [n] Q_{m}^{(0)} \hat{x} [n]} . \end{matrix}

(A6)

In addition,

D_{m}^{(1, j)} {(D_{m}^{(1, j)})}^{H}

can be computed as

D_{m}^{(1, j)} {(D_{m}^{(1, j)})}^{H} = (D_{m}^{(0, 0)} + {\hat{d}}_{n}) {(D_{m}^{(0, 0)} + {\hat{d}}_{n})}^{H} + {\hat{δ}}_{n} \hat{x} [n] {\hat{x}}^{H} [n],

(A7)

where

{\hat{d}}_{n} = \hat{x} [n] {(\hat{x} [n] - \tilde{x} [n])}^{H}

, and

\begin{matrix} {\hat{δ}}_{n} & = \sum_{j = 1}^{K} θ_{j} [n] ∥ \hat{x} [n] - x_{j} ∥^{2} - {∥ \hat{x} [n] - \tilde{x} [n] ∥}^{2} . \end{matrix}

(A8)

Appendix A.2. Error Covariance Calculation for ${\hat{U}}_{m}^{(a, j)} (S_{n})$

Similar to the description in Appendix A.1, the error covariance matrix for

{\hat{U}}_{m}^{(a, j)} (S_{n})

can be obtained as

\begin{matrix} C_{e} ({\hat{U}}_{N_{b, d} (T_{b}) + 1}^{(a, j)} (S_{n})) & = N_{0} Q_{N_{b, d} (T_{b}) + 1}^{(a)} - N_{0}^{2} {(Q_{N_{b, d} (T_{b}) + 1}^{(a)})}^{2} \\ + Q_{N_{b, d} (T_{b}) + 1}^{(a)} D_{N_{b, d} (T_{b}) + 1}^{(a, j)} {(D_{N_{b, d} (T_{b}) + 1}^{(a, j)})}^{H} Q_{N_{b, d} (T_{b}) + 1}^{(a)}, \end{matrix}

(A9)

where

Q_{N_{b, d} (T_{b}) + 1}^{(0)}

and

D_{N_{b, d} (T_{b}) + 1}^{(0, 0)}

can be obtained from (26) as

\begin{matrix} Q_{N_{b, d} (T_{b}) + 1}^{(0)} & = {({\hat{X}}_{n} {\hat{X}}_{n}^{H} + \sum_{l = n + 1}^{n + N} \tilde{x} [l] {\tilde{x}}^{H} [l] + \sum_{l = n + N + 1}^{N_{b, d} (T_{B})} \hat{x} [l] {\hat{x}}^{H} [l] + N_{0} I_{N_{t}})}^{- 1} \\ D_{N_{b, d} (T_{b}) + 1}^{(0, 0)} & = D_{n + N + 1}^{(0, 0)} . \end{matrix}

To resolve the detected symbols after

n + N + 1

in (A9),

Q_{N_{b, d} (T_{b}) + 1}^{(0)}

is further approximated. To this end, the expectation value of

Q_{N_{b, d} (T_{b}) + 1}^{(0)}

is used with Jensen’s inequality in (A9), yielding

\begin{matrix} Q_{N_{b, d} (T_{b}) + 1}^{(0)} & \approx E \{{({\hat{X}}_{n} {\hat{X}}_{n}^{H} + \sum_{l = n + 1}^{n + N} \tilde{x} [l] {\tilde{x}}^{H} [l] + \sum_{l = n + N + 1}^{N_{b, d} (T_{b})} \hat{x} [l] {\hat{x}}^{H} [l] + N_{0} I_{N_{t}})}^{- 1}\} \\ \geq {({\hat{X}}_{n} {\hat{X}}_{n}^{H} + \sum_{l = n + 1}^{n + N} \tilde{x} [l] {\tilde{x}}^{H} [l] + (N_{b, d} (T_{b}) - (n + N - 1) + N_{0}) I_{N_{t}})}^{- 1}, \end{matrix}

(A10)

where

E {\hat{x} [n] {\hat{x}}^{H} [n]} \approx E {x [n] x^{H} [n]} = I_{N_{t}}

. Thus, by substituting (A5) and (A9) into (A3), a result in (29) is obtained where

Q_{m} = Q_{m + 1}^{(0)}

and

D_{m} = D_{m + 1}^{(0, 0)}

.

References

Foschini, G.J. Layered Space-Time Architecture for Wireless Communication in a Fading Environment When Using Multi-Element Antennas. Bell Labs Tech. J. 1996, 1, 41–59. [Google Scholar] [CrossRef]
Telatar, I.E. Capacity of Multi-Antenna Gaussian Channels. Eur. Trans. Telecommun. 1999, 10, 585–595. [Google Scholar] [CrossRef]
Zheng, L.; Tse, D.N.C. Diversity and Multiplexing: A Fundamental Tradeoff in Multiple-Antenna Channels. IEEE Trans. Inf. Theory 2003, 49, 1073–1096. [Google Scholar] [CrossRef] [Green Version]
Björnson, E.; Hoydis, J.; Sanguinetti, L. Massive MIMO Has Unlimited Capacity. IEEE Trans. Wirel. Commun. 2018, 17, 574–590. [Google Scholar] [CrossRef] [Green Version]
Larsson, E.G.; Edfors, O.; Tufvesson, F.; Marzetta, T.L. Massive MIMO for Next Generation Wireless Systems. IEEE Commun. Mag. 2014, 52, 186–1954. [Google Scholar] [CrossRef] [Green Version]
Lu, L.; Li, G.; Swindlehurst, A.; Ashikhmin, A.; Zhang, R. An Overview of Massive MIMO: Benefits and Challenges. IEEE J. Sel. Top. Signal Process. 2014, 8, 742–758. [Google Scholar] [CrossRef]
Simeone, O.; Bar-Ness, Y.; Spagnolini, U. Pilot-based Channel Estimation for OFDM Systems by Tracking the Delay-Subspace. IEEE Trans. Wirel. Commun. 2004, 3, 315–325. [Google Scholar] [CrossRef]
Morelli, M.; Mengali, U. A Comparison of Pilot-Aided Channel Estimation Methods for OFDM System. IEEE Trans. Signal Process. 2001, 49, 3065–3073. [Google Scholar] [CrossRef]
Kim, H.M.; Kim, D.; Kim, T.K.; Im, G.H. Frequency Domain Channel Estimation for MIMO SC-FDMA Systems with CDM Pilots. J. Commun. Netw. 2014, 16, 447–457. [Google Scholar] [CrossRef]
Biguesh, M.; Gershman, A.B. Training-based MIMO Channel Estimation: A Study of Estimator Tradeoffs and Optimal Training Signals. IEEE Trans. Signal Process. 2006, 54, 884–893. [Google Scholar] [CrossRef]
Ozdemir, M.K.; Arslan, H. Channel Estimation for Wireless OFDM Systems. IEEE Commun. Surv. Tutor. 2007, 9, 18–48. [Google Scholar] [CrossRef]
Neumann, D.; Wiese, T.; Utschick, W. Learning the MMSE Channel Estimator. IEEE Trans. Signal Process. 2018, 66, 2905–2917. [Google Scholar] [CrossRef] [Green Version]
Dowler, A.; Nix, A.; McGeehan, J. Data-derived Iterative Channel Estimation with Channel Tracking for a Mobile Fourth Generation Wide Area OFDM System. In Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM), San Francisco, CA, USA, 1–5 December 2003. [Google Scholar]
Le, H.A.; Van Chien, T.; Nguyen, T.H.; Choo, H.; Nguyen, V.D. Machine Learning-Based 5G-and-Beyond Channel Estimation for MIMO-OFDM Communication Systems. Sensors 2021, 21, 4861. [Google Scholar] [CrossRef] [PubMed]
Naeem, M.; De Pietro, G.; Coronato, A. Application of Reinforcement Learning and Deep Learning in Multiple-Input and Multiple-Output (MIMO) Systems. Sensors 2022, 22, 309. [Google Scholar] [CrossRef]
Li, X.; Wang, Q.; Yang, H.; Ma, X. Data-Aided MIMO Channel Estimation by Clustering and Reinforcement-Learning. In Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC), Austin, TX, USA, 10–13 April 2022. [Google Scholar]
Üçüncü, A.B.; Güvensen, G.M.; Yılmaz, A.Ö. A Reduced Complexity Ungerboeck Receiver for Quantized Wideband Massive SC-MIMO. IEEE Trans. Commun. 2021, 69, 4921–4936. [Google Scholar] [CrossRef]
Yuan, J.; Ngo, H.Q.; Matthaiou, M. Machine Learning-Based Channel Prediction in Massive MIMO with Channel Aging. IEEE Trans. Wirel. Commun. 2020, 19, 2960–2973. [Google Scholar] [CrossRef]
Zhao, M.; Shi, Z.; Reed, M.C. Iterative Turbo Channel Estimation for OFDM System over Rapid Dispersive Fading Channel. IEEE Trans. Wirel. Commun. 2008, 7, 3174–3184. [Google Scholar] [CrossRef]
Ma, J.; Ping, L. Data-Aided Channel Estimation in Large Antenna Systems. IEEE Trans. Signal Process. 2014, 62, 3111–3124. [Google Scholar]
Park, S.; Shim, B.; Choi, J.W. Iterative Channel Estimation Using Virtual Pilot Signals for MIMO-OFDM Systems. IEEE Trans. Signal Process. 2015, 63, 3032–3045. [Google Scholar] [CrossRef]
Huang, C.; Liu, L.; Yuen, C.; Sun, S. Iterative Channel Estimation Using LSE and Sparse Message Passing for mmWave MIMO Systems. IEEE Trans. Signal Process. 2018, 67, 245–259. [Google Scholar] [CrossRef] [Green Version]
Park, S.; Choi, J.W.; Seol, J.Y.; Shim, B. Expectation-Maximization-based Channel Estimation for Multiuser MIMO Systems. IEEE Trans. Commun. 2017, 65, 2397–2410. [Google Scholar] [CrossRef]
Valenti, M.C.; Woerner, B.D. Iterative Channel Estimation and Decoding of Pilot Symbol Assisted Turbo Codes Over Flat-Fading Channels. IEEE J. Sel. Areas Commun. 2001, 19, 1697–1705. [Google Scholar] [CrossRef]
Song, S.; Singer, A.C.; Sung, K.M. Soft Input Channel Estimation for Turbo Equalization. IEEE Trans. Signal Process. 2004, 52, 2885–2894. [Google Scholar] [CrossRef]
Jeon, Y.S.; Li, J.; Tavangaran, N.; Poor, H.V. Data-Aided Channel Estimator for MIMO Systems via Reinforcement Learning. In Proceedings of the IEEE International Conference on Communications (ICC), Dublin, Ireland, 7–11 June 2020. [Google Scholar]
Jeon, Y.S.; Lee, N.; Poor, H.V. Robust Data Detection for MIMO Systems with One-Bit ADCs: A Reinforcement Learning Approach. IEEE Trans. Wirel. Commun. 2020, 19, 1663–1676. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Dong, M.; Tong, L.; Sadler, B.M. Optimal Insertion of Pilot Symbols for Transmissions over Time-Varying Flat Fading Channels. IEEE Trans. Signal Process. 2004, 52, 1403–1418. [Google Scholar] [CrossRef] [Green Version]
Kim, T.K.; Jeon, Y.S.; Min, M. Training Length Adaptation for Reinforcement Learning-Based Detection in Time-Varying Massive MIMO Systems With One-Bit ADCs. IEEE Trans. Veh. Technol. 2021, 70, 6999–7011. [Google Scholar] [CrossRef]

Figure 1. Frame consisting of one pilot block with

T_{p}

symbols and

N_{d}

data blocks with

T_{d}

symbols.

Figure 1. Frame consisting of one pilot block with

T_{p}

symbols and

N_{d}

data blocks with

T_{d}

symbols.

Figure 2. State–action diagrams of the original MDP (a) where

k_{n}

is the transmitted symbol index, and the approximate MDP (b) where

{\hat{k}}_{n}

is the detected symbol index for

a \in A

and

S_{n} \in S_{n}

.

Figure 2. State–action diagrams of the original MDP (a) where

k_{n}

is the transmitted symbol index, and the approximate MDP (b) where

{\hat{k}}_{n}

is the detected symbol index for

a \in A

and

S_{n} \in S_{n}

.

Figure 3. System structure of the proposed data-aided channel estimator.

Figure 4. d-th data block consists of

N_{b}

data subblocks with

T_{b}

symbols.

Figure 4. d-th data block consists of

N_{b}

data subblocks with

T_{b}

symbols.

Figure 5. BLERs of conventional and proposed channel estimators for the different estimations.

Figure 6. BLERs of conventional and proposed channel estimators for different modulations.

Figure 7. NMSEs of the proposed channel estimator for different

T_{b}

and

N_{b}

.

Figure 7. NMSEs of the proposed channel estimator for different

T_{b}

and

N_{b}

.

Figure 8. NMSE of the proposed channel estimator based on the number of backup samples N.

Figure 9. NMSEs of the proposed channel estimator for different discounting factors.

Figure 10. BLERs of the proposed channel estimators in time-varying channels.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, T.-K.; Min, M. A Low-Complexity Algorithm for a Reinforcement Learning-Based Channel Estimator for MIMO Systems. Sensors 2022, 22, 4379. https://doi.org/10.3390/s22124379

AMA Style

Kim T-K, Min M. A Low-Complexity Algorithm for a Reinforcement Learning-Based Channel Estimator for MIMO Systems. Sensors. 2022; 22(12):4379. https://doi.org/10.3390/s22124379

Chicago/Turabian Style

Kim, Tae-Kyoung, and Moonsik Min. 2022. "A Low-Complexity Algorithm for a Reinforcement Learning-Based Channel Estimator for MIMO Systems" Sensors 22, no. 12: 4379. https://doi.org/10.3390/s22124379

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Low-Complexity Algorithm for a Reinforcement Learning-Based Channel Estimator for MIMO Systems

Abstract

1. Introduction

Notation

2. Signal Model

2.1. Signal Model

2.2. Channel Estimator and Data Detector

3. Optimization Problem

3.1. Optimization Problem

3.2. Markov Decision Process

4. Proposed Rl-Based Channel Estimator

4.1. Statistical State Transition

4.2. State–Action Transition Using Backup Samples

4.3. State–Action Transition after Backup Samples

4.4. Proposed Optimal Policy

4.5. Summary: The Proposed Algorithm

4.6. Complexity Analysis

5. Simulation Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Proof of Theorem 1

Appendix A.1. Error Covariance Calculation for ${\tilde{U}}_{m}^{(a, j)} (S_{n})$

Appendix A.2. Error Covariance Calculation for ${\hat{U}}_{m}^{(a, j)} (S_{n})$

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

A Low-Complexity Algorithm for a Reinforcement Learning-Based Channel Estimator for MIMO Systems

Abstract

1. Introduction

Notation

2. Signal Model

2.1. Signal Model

2.2. Channel Estimator and Data Detector

3. Optimization Problem

3.1. Optimization Problem

3.2. Markov Decision Process

4. Proposed Rl-Based Channel Estimator

4.1. Statistical State Transition

4.2. State–Action Transition Using Backup Samples

4.3. State–Action Transition after Backup Samples

4.4. Proposed Optimal Policy

4.5. Summary: The Proposed Algorithm

4.6. Complexity Analysis

5. Simulation Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Proof of Theorem 1

Appendix A.1. Error Covariance Calculation for U ˜ m ( a , j ) S n

Appendix A.2. Error Covariance Calculation for U ^ m ( a , j ) S n

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Appendix A.1. Error Covariance Calculation for ${\tilde{U}}_{m}^{(a, j)} (S_{n})$

Appendix A.2. Error Covariance Calculation for ${\hat{U}}_{m}^{(a, j)} (S_{n})$