Next Article in Journal
Utilizing Piezo Acoustic Sensors for the Identification of Surface Roughness and Textures
Previous Article in Journal
Making Cities Smarter—Optimization Problems for the IoT Enabled Smart City Development: A Mapping of Applications, Objectives, Constraints
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Low-Complexity Algorithm for a Reinforcement Learning-Based Channel Estimator for MIMO Systems

1
Department of Electronic Engineering, Gachon University, Seongnam 13120, Korea
2
School of Electronics Engineering, Kyungpook National University, Daegu 41566, Korea
3
School of Electronic and Electrical Engineering, Kyungpook National University, Daegu 41566, Korea
*
Author to whom correspondence should be addressed.
Sensors 2022, 22(12), 4379; https://doi.org/10.3390/s22124379
Submission received: 20 May 2022 / Revised: 2 June 2022 / Accepted: 7 June 2022 / Published: 9 June 2022
(This article belongs to the Section Communications)

Abstract

:
This paper proposes a low-complexity algorithm for a reinforcement learning-based channel estimator for multiple-input multiple-output systems. The proposed channel estimator utilizes detected symbols to reduce the channel estimation error. However, the detected data symbols may include errors at the receiver owing to the characteristics of the wireless channels. Thus, the detected data symbols are selectively used as additional pilot symbols. To this end, a Markov decision process (MDP) problem is defined to optimize the selection of the detected data symbols. Subsequently, a reinforcement learning algorithm is developed to solve the MDP problem with computational efficiency. The developed algorithm derives the optimal policy in a closed form by introducing backup samples and data subblocks, to reduce latency and complexity. Simulations are conducted, and the results show that the proposed channel estimator significantly reduces the minimum-mean square error of the channel estimates, thus improving the block error rate compared to the conventional channel estimation.

1. Introduction

Currently, multiple-input multiple-output (MIMO) is an essential technology in wireless communications [1,2,3,4,5,6]. Multiple antennas are easy to implement in wireless systems, and their use significantly increases system reliability and capacity. However, to utilize the advantages of multiple antennas, perfect channel information is required at both the transmitter and receiver. Meeting this necessity is generally impossible because of the characteristics of wireless channels.
Although perfect channel information is unavailable, many studies have been conducted to improve the accuracy of channel estimation [7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]. These investigations were mostly based on the use of pilots whose information is shared by both the transmitter and receiver and employed least-squares and linear minimum-mean square-error (LMMSE) estimations [10,11,12]. This is because the two estimation methods reasonably perform with affordable complexities for wireless systems. However, their performance strongly depends on the number of pilots, which is generally limited in wireless systems because employing several pilots as resources degrades the spectral efficiency.
This limitation can be overcome using data in channel estimation, i.e., conducting data-aided channel estimation [13,14,15,16,17,18,19,20,21]. Its concept is to exploit a detected data symbol as an additional pilot. Because a detected data symbol may have an error, the accuracy of the channel estimation may be degraded by it. An iterative turbo approach is a good method to address this degradation because the improved detection performance achieved using an iterative turbo equalizer also increases the estimation accuracy of a channel [19,20,21,22,23,24,25]. However, the use of this iterative turbo approach is limited in wireless systems because of its inherent high complexity and latency.
Recently, a reinforcement learning (RL) approach was introduced in [26] for data-aided channel estimation. In this approach, a Markov decision process (MDP) problem is described to minimize the estimation error, and an RL algorithm is used to solve the MDP problem. Without an iterative approach, the RL solution resulted in a significant improvement compared to conventional channel estimations. However, this solution is difficult to implement in practical systems because of its considerable complexity and latency in computing the optimal policy. For example, using the approach in [26] to calculate the optimal policy requires all a posteriori probabilities (APPs) in a data block. In addition, its limitation is that the optimal policy is characterized by a specific discounting factor.
In this paper, a low-complexity channel estimator using an RL approach is proposed for MIMO systems. The key concept of this estimator is the selection of the detected data symbols obtained during data detection as additional pilot symbols. To achieve this, an MDP problem is first defined to minimize the channel estimation error where the Q-value function is generalized by a discounting factor. Subsequently, an RL solution is proposed that can be easily implement in wireless systems. To this end, concepts of backup samples and data subblocks are introduced, which significantly reduce the complexity and latency. The main contributions of this study are summarized as follows:
  • A data-aided channel estimator is developed to optimize the selection of detected symbols for MIMO systems. An MDP problem is defined for this selection to minimize the mean-square-error (MSE) of the channel estimates. Compared with [26], a discounting factor is introduced in the Q-value function. The discounting factor adjusts the effects of rewards after the current state.
  • A low-complexity RL algorithm is proposed. To achieve this efficiently, a data block is separated into multiple data subblocks and the optimal policy for the data subblocks is characterized. In the characterization, only partial soft information obtained from data detection is utilized to reduce the calculation latency. Unlike in [26], the optimal policy is calculated using only this partially obtained information; the remaining rewards are approximated under the assumption of perfect detection. Finally, the optimal policy is obtained using a closed-form expression. Note that the conventional RL algorithm in [26] can be employed after obtaining all soft information in a data block.
  • The performance enhancement achieved for MIMO systems using the developed RL algorithm is evaluated. Simulations are conducted, and the results demonstrate that the proposed algorithm significantly reduces the performance degradation of conventional channel estimation. Based on the simulations, the proposed channel estimator using an approximate MDP presents a similar performance to that of the original MDP. In addition, the proposed channel estimator provides robustness in time-varying channels.
The remainder of this paper is organized as follows. Section 2 introduces a signal model including the channel estimation and data detection considered in this study. In Section 3, an MDP problem to select detected data symbols optimally to minimize the channel estimation error is defined. A low-complexity RL algorithm is proposed in Section 4. In Section 5, simulation results are discussed, to demonstrate the effectiveness of the developed algorithm. Finally, conclusions are presented in Section 6.

Notation

Matrices 0 m and I m represent m × m all-zero and the m × m identity matrices, respectively. The superscripts ( · ) T and ( · ) H denote the transpose and the conjugate transpose, respectively. Operators E ( · ) and P ( · ) denote the expectation of a random variable and the probability of an event, respectively. Operators | · | and · 2 denote the cardinality of a set and the norm, respectively. Operators ( · ) 1 , Tr ( · ) , and CN denote the inverse, trace, and complex normal distribution, respectively. Set C represents a set of complex numbers.

2. Signal Model

This section describes the signal model for a MIMO system. Based on the signal model, the channel estimator and data detector considered in this study are introduced.

2.1. Signal Model

A MIMO system is considered; in it, a transmitter with N t antennas communicates with a receiver with N r antennas through a wireless channel. A wireless channel is denoted as H C N t × N r , where each channel element h t , r C between the t-th transmitter and r-th receiver is modeled by Rayleigh fading h t , r CN 0 , 1 . The transmitter sends a frame consisting of one pilot block and N d data blocks, as shown in Figure 1. During the pilot transmission, the transmitter sends a pilot symbol x p [ n ] C N t × 1 for n N p = { 1 , , T p } , where T p is the pilot length. When the pilot symbol x p [ n ] is transmitted to the receiver, the received symbol y p [ n ] C N r × 1 at time slot n is given as
y p [ n ] = H H x p [ n ] + z p [ n ] ,
where z p [ n ] is an additive white Gaussian noise (AWGN) at time slot n whose distribution follows CN 0 N r , N 0 I N r . After the pilot transmission is completed, the transmitter sends a data symbol x d [ n ] C N t × 1 for n N d = { ( d 1 ) T d + 1 , , d T d } , where T d is the data length. Supposing X is a constellation set, the data symbol x d [ n ] X N t . After the data transmission, the received symbol y d [ n ] C N r × 1 is expressed as
y d [ n ] = H H x d [ n ] + z d [ n ] ,
where z d [ n ] is also an AWGN at time slot n.

2.2. Channel Estimator and Data Detector

The LMMSE channel estimator is considered in this study because of its satisfactory performance with low complexity. Using the received symbol in (1), the LMMSE channel estimator, W C N t × T p , is expressed as follows:
W ^ = argmin W E W ( y r p ) H h r 2 = X p X p H + N 0 I N t 1 X p ,
where y r p and X p are sets of the received and pilot symbols and are defined as y r p = [ y r p [ 1 ] , , y r p [ T p ] ] and X p = [ x p [ 1 ] , , x p [ T p ] ] , respectively. Using the channel estimator in (3), a channel estimate is expressed as
h ^ r = W ^ ( y r p ) H = X p X p H + N 0 I N t 1 X p ( y r p ) H ,
where h ^ r is the r-th row of the channel estimate matrix H ^ .
A maximum a posteriori probability (MAP) data detector is considered in this study to ensure the optimal detection performance. The APP from the MAP data detector is computed as    
θ k [ n ] = P x d [ n ] = x k | y d [ n ] = P y d [ n ] | x d [ n ] = x k P x d [ n ] = x k j K P y d [ n ] | x d [ n ] = x j P x d [ n ] = x j ,
where x k X N t is the k-th possible symbol for k K = { 1 , , | X | N t } . In (5), the apriori probability, P x d [ n ] = x k , is assumed to be equal for all possible symbols x k for k K , i.e., P x d [ n ] = x k = 1 | X | N t . Concurrently, under the AWGN assumption, the likelihood probability P y d [ n ] | x d [ n ] = x k in (5) can be expressed as
P y d [ n ] | x d [ n ] = x k = 1 π N 0 N r e y d [ n ] H ^ H x k 2 N 0 .
The MAP data detector detects the data symbol x ^ [ n ] that has the best APP value at time slot n, and it is given by
x ^ [ n ] = argmax x k X N t θ k [ n ] = argmax x k X N t P y d [ n ] | x d [ n ] = x k .
Note that the accuracy of the detected symbol x ^ [ n ] depends on the accuracy of the channel estimator, H ^ . However, the accuracy of the channel estimator cannot be ensured in practical systems where the pilot length, T p , is limited. To address this limitation, this study focused on improving the accuracy of the channel estimator.

3. Optimization Problem

This section defines the optimization problem for the channel estimator proposed subsequently, which uses detected symbols to improve the MSE of the channel estimates. Subsequently, to solve the optimization problem, the MDP problem and the optimal policy are presented.

3.1. Optimization Problem

This study considers a channel estimator that uses the detected symbols in (7) as additional pilot symbols. However, the data detector may generate detection errors at the receiver. Consequently, the use of detected symbols with errors degrades the accuracy of the channel estimator. To overcome this problem, the detected symbols should be selectively exploited by the channel estimator.
Let a { 0 , 1 } T d be the set of actions whose n-th component is the selection of a detected symbol of the d-th data block for n N d . Specifically, when a = 1 , a detected symbol is used as an additional pilot symbol; otherwise, it is not used. By exploiting a , the LMMSE channel estimate in (4) can be updated as
h ^ r ( a ) = X ( a ) X ( a ) H + N 0 I N t 1 X ( a ) y ¯ r ( a ) H ,
where y ¯ r ( a ) = [ y r p , y r d [ u 1 ( a ) ] , , y r d [ u a 0 ( a ) ] ] and X ( a ) = [ X p , x ^ [ u 1 ( a ) ] , , x ^ [ u a 0 ( a ) ] ] .
Here, u i ( a ) is the time slot index of the i-th nonzero element in a . Thus, the optimization problem that maximizes the accuracy of the proposed channel estimator can be expressed as
a = argmax a { 0 , 1 } T d E { H ^ ( a ) H 2 } .
Solving the optimization problem in (9) is difficult. First, the distribution of H ^ ( a ) requires information regarding the transmitted symbols. However, this information is generally unknown to a receiver. In addition, the number of candidates for actions a exponentially increases with data length T d . Accordingly, an exhaustive search for these actions is impractical because of the unsatisfactory complexity and latency for the receiver.

3.2. Markov Decision Process

To efficiently solve the problem in (9), an MDP was formulated in [26] that sequentially selected detected symbols. In this formulation, a detected symbol is selected if the updated channel estimator reduces the estimation error.
Similar to [26], for this study, the state set of the MDP at time slot n is expressed as
S n = { X n , X ^ n , M n | X n = X p , x k M n 1 , , x k M n | M n | , k i K , X ^ n = X p , x ^ M n 1 , , x ^ M n | M n | , M n T p + 1 , , n 1 } ,
where k n denotes the transmitted symbol index at time slot n. Set M n represents the set of time slot indices of the data symbols to be utilized as additional pilot symbols. M n ( i ) is the i-th smallest element of M n . Based on the above notations, the proposed channel estimate at state S n = X n , X ^ n , M n S n is expressed as
h ^ r S n = X ^ n X ^ n H + N 0 I N t 1 X ^ n y ¯ r H S n ,
where y ¯ r S n = [ y r p , y r d [ M n ( 1 ) ] , , y r d [ M n ( | M n | ) ] ] .
The action set of the MDP is expressed as A = { 0 , 1 } . An action is defined as whether to utilize a current detected symbol as an additional pilot symbol. Specifically, when a = 1 A , the current detected symbol is used as an additional pilot symbol.
Based on the state and action sets, the state transition function of the MDP for a A and S n S n is expressed as follows:
T n + 1 ( a , j ) S n = P U n + 1 ( a , j ) S n | S n , a = I x d [ n ] = x j , j J a , a = 1 , 1 , j J a , a = 0 .
where J 0 = { 0 } and J 1 = { 1 , , K } . State U n + 1 ( a , j ) S n S n + 1 is the valid state from the current state S n = X n , X ^ n , M n S n , and is expressed as
U n + 1 ( a , j ) S n = [ X n , x j ] , [ X ^ n , x ^ [ n ] ] , [ M n n ] , j J a , a = 1 , X n , X ^ n , M n , j J a , a = 0 .
The reward function of the MDP is obtained by the MSE improvement between the channel estimates at the current state S n and the next state S n + 1 . Thus, the reward function from S n S n to S n + 1 S n + 1 is defined as
R S n , S n + 1 = E r S n E r S n + 1 ,
where E r S n is the MSE of the channel estimate for the r-th receive antenna at state S n S n , which can be computed as
E r S n = E h ^ r S n h r 2 = Tr C e S n ,
where the error covariance matrix C e S n is defined as E { ( h ^ r S n h r ) ( h ^ r S n h r ) H } .
Here, C e S n is independent of the receiver antenna index, r, because the channel and noise distributions are the same for different receive antenna indices. Thus, the reward function in (14) can be simplified as
R S n , S n + 1 = Tr C e S n C e S n + 1 .
The optimal policy of the MDP at time slot n is defined as
π S n = argmax a A Q S n , a .
where the Q-value function Q S n , a is the optimal sum of the rewards. Based on the state transition function in (12), the Q-value function can be expressed as
Q S n , a = j J a T n + 1 ( a , j ) S n R S n , U n + 1 ( a , j ) S n + γ V U n + 1 ( a , j ) S n ,
where 0 γ 1 is a discounting factor whose value depends on the target of the optimization problem. For example, a small value is desirable when the accuracy of the channel estimator obtained at the current state is significant. In contrast, a larger value is preferred when the accuracy of the channel estimator obtained at the ending state is significant.
V U n + 1 ( a , j ) S n is the optimal sum of the future rewards. The future value function V S m at state S m S m for n + 1 m can be recursively computed, as follows:
V S m = a A π S m , a j J a T m + 1 ( a , j ) S m R S m , U m + 1 ( a , j ) S m + γ V U m + 1 ( a , j ) S m ,
where π S m , a is a state–action transition function, expressed as
π S m , a = I { a = argmax a A Q S m , a } ,
where Q S m , a is the Q-value function that can be calculated as the sum of the rewards obtained after taking action a A at state S m S m .
Using the MDP in (10), (12), and (13), the state–action diagram of the original MDP is depicted in Figure 2a. In this figure, state S n is transited to the next valid state, U n + 1 ( a , j ) S n , based on action a. Particularly, when a = 1 , state S n is transited to state U n + 1 ( 1 , k n ) S n by utilizing the transmitted symbol index, k n . Based on the state and state–action transition functions in (12) and (20), the state is transited to the next valid state until the end of a data block. As previously mentioned, the original MDP, which is shown in Figure 2a, cannot be solved by dynamic programming.
First, the state and state–action functions are unavailable to the receiver because the information of the transmitted symbols, x k n , and the true channel information, H , are unknown. In addition, the computational complexity and latency required to solve the original MDP are extremely high because the number of states exponentially increases with data length T d .

4. Proposed Rl-Based Channel Estimator

In this section, an RL-based channel estimator is proposed. To address the unknown state and state–action functions, an RL algorithm is adopted because it provides a solution for the partially observable MDP [27,28]. Based on this algorithm, a computationally efficient RL solution is also proposed. The key concept of the proposed solution is to approximate the state–action transition functions to determine the optimal policy by separating the cases using the APPs.
The overall procedure of the proposed RL-based channel estimator is illustrated in Figure 3. The proposed channel estimator exploits the information of ( x ^ [ m ] , θ j [ m ] ) obtained from the MIMO detector. In the proposed channel estimator, the optimal policy is calculated by using only N APPs ( θ j [ n ] , , θ j [ n + N ] ) for a computationally efficient algorithm. The channel estimate is then updated according to the optimal policy. Details of the proposed channel estimator, i.e., how to approximate the MDP and how to derive the optimal policy in a closed form, are explained in this section.

4.1. Statistical State Transition

In this section, the state transition function in (12) at time slot n is approximated using the APP θ j [ n ] . The basic concept was introduced in [26] by assuming the APP θ j [ n ] as the probability of the event, { x [ n ] = x j } . Thus, the state transition function in (12) at time slot n is approximated as follows:
T ^ n + 1 ( a , j ) S n = θ j [ n ] , j J a , a = 1 , 1 , j J a , a = 0 .
where the detected symbol index at time slot n is denoted as k ^ n . Note that APP θ j [ n ] can be interpreted as the probability of the event { x [ n ] = x j } ; thus, it is called a statistical transition. In addition, when the data detection performance is improved, i.e., θ k n [ n ] 1 , the approximate state transition function in (21) approaches the true state transition function in (12).

4.2. State–Action Transition Using Backup Samples

After time slot n + 1 m , the state in (20) is assumed to be transited to a virtual state that mimics the possible next states by exploiting the expected transmitted symbol, x ˜ [ m ] . The expected transmitted symbol, x ˜ [ m ] , is defined as
x ˜ [ m ] = j = 1 K θ j [ m ] x j .
In this study, the use of the expected transmitted symbol is the same as in [26], except its use is limited to N backup samples to reduce the complexity. A backup sample is defined as APP θ j [ m ] for n + 1 m n + N because the expected transmitted symbol can be computed by θ j [ m ] . Thus, the Q-value function can be calculated after all θ j [ m ] for n + 1 m n + N values are obtained. Using a backup sample of an APP, the state–action transition is expressed as
π ^ S m , a = 1 .
Thus, the virtual state, U ˜ m ( a , j ) S n S m , that can be transited from S n S n is expressed as
U ˜ m ( a , j ) S n = X m ( a , j ) , X ^ m ( a ) , M m ( a ) ,
where their components are
X m ( a , j ) = X n , x j , x ˜ [ n + 1 ] , , x ˜ [ n + N ] , a = 1 , X n , x ˜ [ n + 1 ] , , x ˜ [ n + N ] , a = 0 . X ^ m ( a ) = X ^ n , x ^ [ n ] , x ˜ [ n + 1 ] , , x ˜ [ n + N ] , a = 1 , X ^ n , x ˜ [ n + 1 ] , , x ˜ [ n + N ] , a = 0 . M m ( a ) = M n { n , , n + N } , a = 1 , M n { n + 1 , , n + N } , a = 0 .
Because a virtual state mimics the transitions to the candidate symbols, state U ˜ m ( a , j ) S n S m is always transited to a virtual state U ˜ m + 1 ( a , j ) S n S m + 1 . Therefore, the corresponding state transition function is written as
T ^ m + 1 ( a , j ) U ˜ m ( a , j ) S n = 1 ,
where n + 1 m n + N .

4.3. State–Action Transition after Backup Samples

In this subsection, the virtual states after n + N that can be transited without the information of the backup samples, θ j [ m ] , are described for n + N + 1 m . To achieve this, the states, U ^ m + 1 ( a , j ) S n , for n + N + 1 m are assumed to optimally act when all symbols are correctly detected. By using the property of x [ m ] = x ^ [ m ] after time slot n + N + 1 , an approximate virtual state is expressed as   
U ^ m ( a , j ) S n = X m ( a , j ) , X ^ m ( a ) , M m ( a ) ,
where its components are defined as
X m ( a , j ) = X n + N + 1 ( a , j ) , x ^ [ n + N + 1 ] , , x ^ [ m 1 ] , X ^ m ( a ) = X ˜ n + N + 1 ( a ) , x ^ [ n + N + 1 ] , , x ^ [ m 1 ] , M m ( a ) = M n + N + 1 ( a ) { n + N + 1 , , m 1 } ,
where X n + N + 1 ( a , j ) , X ^ n + N + 1 ( a ) , M n + N + 1 ( a ) are the components of U ˜ n + N + 1 ( a , j ) S n .
In Figure 2b, a state–action diagram of the approximate MDP is depicted. The original MDP requires information regarding the transmitted symbols for the state transition, as shown in Figure 2a. In contrast, the approximate MDP utilizes virtual states U ˜ m ( a , j ) S n and U ^ m ( a , j ) S n , which mimic the transitions to the candidate symbols for an unknown transmitted symbol and action. Specifically, virtual state U ˜ m ( a , j ) S n is used at time slot n + 1 m n + N and after time slot n + N , respectively. These two approximations decrease the number of transitions to the next state transition, so the calculation to solve the MDP is considerably reduced.

4.4. Proposed Optimal Policy

Using the approximations in (21), (23), and (24), the optimal policy can be determined. However, the calculation latency is still considerable, because the optimal policy can be computed at the end of a data block. To prevent this computational burden, the proposed solution separates a data block into N b data subblocks and subsequently characterizes the optimal policy for each data subblock, as shown in Figure 4. Based on this characterization, the state in (10) and the corresponding channel estimate using (11) are updated for a data subblock. To realize this data subblock separation, the data subblock length is defined as T b , which satisfies N b = T d / T b . Thus, a set of time slot indices of the b-th data subblock in the d-th data block, N b , d , is defined as { T p + ( b 1 ) T b + ( d 1 ) T d + 1 , , T p + b T b + ( d 1 ) T d } , for b { 1 , , N b } and d { 1 , , N d } (see Figure 4).
Using the virtual states in (24) and (26), the Q-value function is written as
Q S n , a = j J a T n + 1 ( a , j ) S n [ R S n , U ˜ n + 1 ( a , j ) S n + m = n + 1 n + N γ m n R U ˜ m ( a , j ) S n , U ˜ m + 1 ( a , j ) S n + γ N + 1 V U ^ n + N + 1 ( a , j ) S n ] ,
where the future value function, V U ^ n + N + 1 ( a , j ) S n , is obtained based on the approximation of U ^ m ( a , j ) S n as follows:   
V U ^ n + N + 1 ( a , j ) S n R U ˜ n + N + 1 ( a , j ) S n , U ^ n + N + 2 ( a , j ) S n + m = n + N + 2 N b , d T b R U ^ m ( a , j ) S n , U ^ m + 1 ( a , j ) S n .
In the future reward in (28), the discounting factor is assumed to be 1 to reduce the complexity by a simple calculation.
Based on (27) and (28), the optimal policy for each state is obtained as a closed-form expression, as described in the following theorem:
Theorem 1.
Under the virtual states and the use of backup samples, the optimal policy for the state S n = X n , X ^ n , M n S n is
π S n = I m = n n + N γ m n ( 1 γ ) U m S n + γ N + 1 U N b , d T b + 1 S n m = n n + N γ m n ( 1 γ ) L m S n + γ N + 1 L N b , d T b + 1 S n 1 ,
where functions U m S n and L m S n are respectively defined as
U m S n = t m 2 N 0 + N 0 2 t m 2 + v m 2 L m S n = t m 2 2 N 0 2 β m + δ m + e m u m + v m 2
All components are defined as
Q m = X ^ n X ^ n H + l = n + 1 m x ˜ [ l ] x ˜ H [ l ] + N 0 I N t 1 , D m = X ^ n X ^ n X n H + l = n + 1 m x ^ [ l ] x ^ [ l ] x ˜ [ l ] H + N 0 I N r , t m = 1 1 + α m Q m x ^ [ n ] , e m = 1 1 + α m x ^ [ n ] x ˜ [ n ] , u m = D m H t m , v m = D m H Q m t m t m 2 , α m = x ^ H [ n ] Q m x ^ [ n ] , β m = t m H Q m t m t m 2 , δ m = 1 1 + α m j = 1 K θ j [ n ] x ^ [ n ] x j 2 x ^ [ n ] x ˜ [ n ] 2 Q N b , d T b + 1 = Q n + N 1 + N b , d ( T b ) ( n + N 1 ) I N t 1 , D N b , d T b + 1 = D n + N .
Proof. 
See Appendix A. □

4.5. Summary: The Proposed Algorithm

The proposed channel estimator is summarized in Algorithm 1. First, the receiver initializes the state during pilot transmission. In this algorithm, the current state is updated and transited to the next state according to the optimal action obtained using (29). For example, the most probable state transition is used when α = 1 for the unknown transmitted symbol index. This transition ensures a true state transition as θ j [ n ] approaches 1 in reliable communication. At the end of a data subblock, the proposed channel estimator updates the channel estimate using the current state, S n .  
Algorithm 1: The proposed channel estimator.
  1  Set H H ^ = h ^ 1 , , h ^ N r from (4)
  2  Initialize S 1 = X p , X p , ϕ .
  3  for  d = 1 to N d do (
  4   for  b = 1 to N b do
  5    for  n N b , d do
  6       Obtain x ^ [ n ] from (8) and { θ j [ n ] , , θ j [ n + N ] } from (5) for j K
  7       Compute a = π ( S n ) from (29).
  8       Set j = 0 for a = 0 and x j = x ^ [ n ] for a = 1 .
  9       Update S n + 1 U n + 1 ( a , j ) S n from (13).
10    end
11    Set H H ^ = h ^ 1 S n , , h ^ N r S n from (11).
12  end
13  end

4.6. Complexity Analysis

In this subsection, the complexity of both the proposed channel estimator and that in [26] is discussed based on the number of states visited in the calculation of the optimal policy. This is because the rewards in the optimal policy are computed based on the states, and the calculation in (29) is similar to that in [26]. First, when the current state is S n S n in the d-th data block, the number of visiting states in [26] is exactly d T d n . By contrast, the number of visiting states using the proposed channel estimator in the b-th data subblock is exactly ( b 1 ) T b + 1 + ( d 1 ) T d n . Thus, the number of states ( T d ( b 1 ) T b 1 ) is not used in the policy calculation on introducing the data subblocks. In addition to the complexity, the proposed optimal policy can be calculated after obtaining N backup samples, whereas in the approach in [26], this is possible at the end of a data block. Thus, the latency of the optimal policy by the approach in [26] is much longer than that of the proposed optimal policy.

5. Simulation Results

This section discusses the performance of the proposed channel estimator. The number of antennas in MIMO systems is ( N t , N r ) = ( 4 , 4 ) . A rate 1 / 2 turbo code is adopted for channel coding, and 4-quadrature amplitude modulation (QAM) is adopted for symbol mapping. The frame consists of ( T p , T d , N d ) = ( 8 , 64 , 20 ) , and the proposed channel estimator utilizes a data subblock as ( T b , N b ) = ( 16 , 4 ) . In addition, the parameters of the proposed channel estimator are ( N , γ ) = ( 1 , 0.5 ) , unless specified otherwise. The per-bit signal-to-noise ratio (SNR) is defined as E b / N 0 = 1 log 2 | X | N 0 .
In all figures, the performance with perfect and imperfect channel estimates using the LMMSE method are denoted as PCSI and CE, respectively. For performance benchmarking, the optimal cases of the proposed channel estimator and the expected-symbol-based channel estimator utilizing perfect knowledge of the transmitted symbol and the expected symbol in (22) as an additional pilot symbol, respectively, are compared. The performance is measured in terms of the block error rate (BLER) and the normalized MSE (NMSE). In Figure 5, the proposed channel estimator is compared with other channel estimators, and the conventional RL method used in [26] is also depicted. It shows that the BLER of the proposed estimator is better than those of the conventional and expected-symbol-based estimators regardless of the per-bit SNR. Moreover, the proposed channel estimator outperforms the conventional estimator of [26]. This is because the proposed channel estimator updates a channel estimate by N b in a data block, whereas the method in [26] updates it once at the end of a data block.
Figure 6 compares the BLERs of the conventional and proposed channel estimators for different modulations. For 16-QAM, a MIMO system with ( N t , N r ) = ( 2 , 4 ) is considered because of the SNR range. The proposed channel estimator achieves an improved BLER compared to the conventional LMMSE channel estimators. This result demonstrates the effectiveness of the proposed channel estimator, which optimizes the selection of detected symbols. The improvements to achieve a BLER of 10 1 are approximately 1.2 dB and 0.7 dB for the 4- and 16-QAM, respectively. The BLER for the 16-QAM is more improved than that of the PCSI, which is better than that of the 4-QAM. This is because in 16-QAM, the number of reliable detected symbols that can be used as additional pilot symbols is larger than in 4-QAM.
The NMSEs of the proposed channel estimator for different data subblock lengths are shown in Figure 7. The NMSE improves as N b decreases. This is because the approximate MDP using data subblocks approaches the original MDP as N b decreases. However, as shown in Figure 7, the NMSE improvement is insignificant, whereas the complexity exponentially increases with T b . Thus, ( T b , N b ) = ( 16 , 4 ) is considered in this study for the simulations.
The NMSE of the proposed channel estimator based on the number of backup samples is shown in Figure 8. Noticeably, the NMSE is improved as the number of backup samples increases. This is because the accuracy of the state–action diagram model improves as the number of backup samples increases. In addition, with a small value of N, the proposed channel estimator achieves a sufficient NMSE performance. It should be noted that the complexity and latency required to determine the optimal policy increase with the number of backup samples.
Figure 9 and Figure 10 are the results obtained using the proposed channel estimator in time-varying channels. Specifically, a first-order Gaussian–Markov process used in [29,30] was adopted.In this process, the channel matrix at time slot n is defined as
H ( n ) = 1 ϵ 2 H ( n 1 ) + ϵ e ( n ) ,
where n N b , d for b { 1 , 2 , , N b } and d { 1 , 2 , , N d } . ϵ [ 0 , 1 ] is a temporal correlation coefficient depending on the velocity, and H ( 0 ) is an initial channel estimate. Each element in e ( n ) C N r × N t is assumed to follow CN ( 0 , 1 ) . Temporal correlation coefficients ϵ = 5 × 10 3 and ϵ = 10 2 are used for the simulations.
Figure 9 shows the variation in the NMSE of the proposed channel estimator with the discounting factor. When a channel varies over time as ϵ = 5 × 10 3 , an NMSE with γ = 0.1 is better than it is with γ = 0.9 . This is because the rewards at the future states in the time-varying channels are insignificant; therefore, a small value of the discounting factor is preferable. By contrast, when the channels are time-invariant, the rewards at the future states as well as those at the current state are important. Thus, the large value of γ = 0.9 improves the NMSE compared to γ = 0.1 . Figure 10 compares the BLERs of the proposed and conventional channel estimators. When ϵ = 10 2 , the BLERs of the CE are severely degraded because the CE method cannot capture the channel variation. However, the proposed channel estimator shows robustness in time-varying channels because the channel variation can be tracked efficiently by selecting the detected symbols.

6. Conclusions

In this paper, a low-complexity algorithm for an RL-based channel estimator for MIMO systems was proposed. The proposed channel estimator adaptively selects detected symbols as additional pilot symbols to minimize the channel estimation error. In this study, an MDP problem was introduced, and a practical algorithm to solve it was developed using backup samples and data subblocks. Simulation results showed that the proposed channel estimator significantly improves the BLER and the NMSE compared to the conventional channel estimator.
A future direction of this study is to develop the RL approach for a realistic channel. The proposed method was derived based on the Rayleigh fading channel, but the realistic channel may have a line of sight. Thus, the MDP under the Rician fading channel should be investigated. Another important direction is to develop the RL approach for frequency-selective channels. In frequency-selective channels, the use of multiple sub-carriers can increase computational complexity considerably. Thus, a low-complexity algorithm in frequency-selective channels is necessary. Lastly, the RL approach can also be extended to other advanced channel estimators, such as the iterative method. In this method, the MDP should be reformulated according to the channel estimator.

Author Contributions

Conceptualization, M.M. and T.-K.K.; methodology, T.-K.K.; software, M.M. and T.-K.K.; validation, M.M. and T.-K.K.; formal analysis, M.M. and T.-K.K.; investigation, M.M. and T.-K.K.; resources, T.-K.K.; data curation, T.-K.K.; writing–original draft preparation, T.-K.K.; writing–review and editing, M.M. and T.-K.K.; visualization, T.-K.K.; supervision, M.M.; project administration, M.M. and T.-K.K.; funding acquisition, M.M. and T.-K.K. All authors have read and agreed to the published version of the manuscript.

Funding

The work of M.M. was supported in part by a National Research Foundation of Korea (NRF) grant, funded by the Korea Government (MSIT) (No. 2020R1F1A1071649), and in part by the BK21 FOUR Project, funded by the Ministry of Education, Korea (4199990113966). The work of T.-K.K. was supported by a National Research Foundation of Korea (NRF) grant, funded by the Korea Government (MIST) (No. 2021R1F1A1063273).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare that there is no conflict of interest.

Appendix A. Proof of Theorem 1

Although the basic derivation of the optimal policy is based on [26], two additional factors are considered, which are presented in this appendix. The first is that the proposed derivation considers a discounting factor in the Q-value; thus, the intermediate rewards do not disappear, unlike in [26]. Second, a finite number of backup samples are used in the derivation; thus, the rewards that do not exploit the APPs are approximated differently compared to [26].
Under the assumption that the discounting factor is 1, the future value function at state U ˜ n + N + 1 ( a , j ) S n S n + N + 1 is expressed by substituting (14) in (28), as follows:
V U ˜ n + N + 1 ( a , j ) S n = Tr [ C e U ˜ n + N + 1 ( a , j ) S n C e U ^ n + N + 2 ( a , j ) S n + m = n + N + 2 N b , d T b C e U ^ m ( a , j ) S n C e U ^ m + 1 ( a , j ) S n ] = Tr C e U ˜ n + N + 1 ( a , j ) S n C e U ^ N b , d T b + 1 ( a , j ) S n .
By substituting (14) and (A1) into (27), the Q-value function can be obtained as follows:
Q S n , a = j J a T n + 1 ( a , j ) S n Tr C e S n + m = n n + N γ m n ( γ 1 ) C e U ˜ m + 1 ( a , j ) S n γ N + 1 C e U ^ N b , d T b + 1 ( a , j ) S n .
Thus, the optimal policy in (17) is expressed as
π S n = argmax a { 0 , 1 } Q S n , a = I Q S n , 1 Q S n , 0 0 = I [ Tr [ m = n n + N γ m n ( γ 1 ) j = 1 K θ j [ n ] C e U ˜ m + 1 ( 1 , j ) S n C e U ˜ m + 1 ( 0 , 0 ) S n γ N + 1 j = 1 K θ j [ n ] C e U ^ N b , d T b + 1 ( 1 , j ) S n C e U ^ N b , d T b + 1 ( 0 , 0 ) S n ] 0 ] .
In (17), the optimal policy is determined by the difference between the error covariance matrices with a = 0 and a = 1 . The error covariance matrices for virtual states U ˜ m ( a , j ) S n and U ^ m ( a , j ) S n are derived as described below.

Appendix A.1. Error Covariance Calculation for U ˜ m ( a , j ) S n

To obtain the error covariance matrix, the distribution of the received symbols, y ¯ r H U ˜ m ( a , j ) S n , in (2) is required, which is given by
y ¯ r H U ˜ m ( a , j ) S n CN 0 | M m ( a ) | , X m ( a , j ) H X m ( a , j ) + N 0 I | M m ( a ) | ,
for j J a and a A . Thus, the error covariance matrix in (A3) is computed using the result in [26], as follows:
C e U ˜ m ( a , j ) S n = N 0 Q m ( a ) N 0 2 Q m ( a ) 2 + Q m ( a ) D m ( a , j ) D m ( a , j ) H Q m ( a ) ,
where
Q m ( a ) = X ^ m ( a ) X ^ m ( a ) H + N 0 I N t 1 = ( a ) X ^ n X ^ n H + l = n + 1 m 1 x ˜ [ l ] x ˜ H [ l ] + N 0 I N t 1 , a = 0 , Q m ( 0 ) 1 + x ^ [ n ] x ^ H [ n ] 1 , a = 1 . D m ( a , j ) = X ^ m ( a ) X ^ m ( a ) X m ( a , j ) H + N 0 I N t = ( b ) X ^ n X ^ n X n H + l = n + 1 m 1 x ^ [ l ] x ^ [ l ] x ˜ [ l ] H + N 0 I N t , j J a , a = 0 , D m ( 0 , 0 ) + x ^ [ n ] x ^ [ n ] x j H , j J a , a = 1 .
Thus, the matrix Q m ( 1 ) is re-expressed as
Q m ( 1 ) = Q m ( 0 ) Q m ( 0 ) x ^ [ n ] x ^ H [ n ] Q m ( 0 ) 1 + x ^ H [ n ] Q m ( 0 ) x ^ [ n ] .
In addition, D m ( 1 , j ) D m ( 1 , j ) H can be computed as
D m ( 1 , j ) D m ( 1 , j ) H = D m ( 0 , 0 ) + d ^ n D m ( 0 , 0 ) + d ^ n H + δ ^ n x ^ [ n ] x ^ H [ n ] ,
where d ^ n = x ^ [ n ] ( x ^ [ n ] x ˜ [ n ] ) H , and
δ ^ n = j = 1 K θ j [ n ] x ^ [ n ] x j 2 x ^ [ n ] x ˜ [ n ] 2 .

Appendix A.2. Error Covariance Calculation for U ^ m ( a , j ) S n

Similar to the description in Appendix A.1, the error covariance matrix for U ^ m ( a , j ) S n can be obtained as
C e U ^ N b , d T b + 1 ( a , j ) S n = N 0 Q N b , d T b + 1 ( a ) N 0 2 Q N b , d T b + 1 ( a ) 2 + Q N b , d T b + 1 ( a ) D N b , d T b + 1 ( a , j ) D N b , d T b + 1 ( a , j ) H Q N b , d T b + 1 ( a ) ,
where Q N b , d T b + 1 ( 0 ) and D N b , d T b + 1 ( 0 , 0 ) can be obtained from (26) as
Q N b , d T b + 1 ( 0 ) = X ^ n X ^ n H + l = n + 1 n + N x ˜ [ l ] x ˜ H [ l ] + l = n + N + 1 N b , d T B x ^ [ l ] x ^ H [ l ] + N 0 I N t 1 D N b , d T b + 1 ( 0 , 0 ) = D n + N + 1 ( 0 , 0 ) .
To resolve the detected symbols after n + N + 1 in (A9), Q N b , d T b + 1 ( 0 ) is further approximated. To this end, the expectation value of Q N b , d T b + 1 ( 0 ) is used with Jensen’s inequality in (A9), yielding
Q N b , d T b + 1 ( 0 ) E X ^ n X ^ n H + l = n + 1 n + N x ˜ [ l ] x ˜ H [ l ] + l = n + N + 1 N b , d T b x ^ [ l ] x ^ H [ l ] + N 0 I N t 1 X ^ n X ^ n H + l = n + 1 n + N x ˜ [ l ] x ˜ H [ l ] + N b , d T b ( n + N 1 ) + N 0 I N t 1 ,
where E { x ^ [ n ] x ^ H [ n ] } E { x [ n ] x H [ n ] } = I N t . Thus, by substituting (A5) and (A9) into (A3), a result in (29) is obtained where Q m = Q m + 1 ( 0 ) and D m = D m + 1 ( 0 , 0 ) .

References

  1. Foschini, G.J. Layered Space-Time Architecture for Wireless Communication in a Fading Environment When Using Multi-Element Antennas. Bell Labs Tech. J. 1996, 1, 41–59. [Google Scholar] [CrossRef]
  2. Telatar, I.E. Capacity of Multi-Antenna Gaussian Channels. Eur. Trans. Telecommun. 1999, 10, 585–595. [Google Scholar] [CrossRef]
  3. Zheng, L.; Tse, D.N.C. Diversity and Multiplexing: A Fundamental Tradeoff in Multiple-Antenna Channels. IEEE Trans. Inf. Theory 2003, 49, 1073–1096. [Google Scholar] [CrossRef] [Green Version]
  4. Björnson, E.; Hoydis, J.; Sanguinetti, L. Massive MIMO Has Unlimited Capacity. IEEE Trans. Wirel. Commun. 2018, 17, 574–590. [Google Scholar] [CrossRef] [Green Version]
  5. Larsson, E.G.; Edfors, O.; Tufvesson, F.; Marzetta, T.L. Massive MIMO for Next Generation Wireless Systems. IEEE Commun. Mag. 2014, 52, 186–1954. [Google Scholar] [CrossRef] [Green Version]
  6. Lu, L.; Li, G.; Swindlehurst, A.; Ashikhmin, A.; Zhang, R. An Overview of Massive MIMO: Benefits and Challenges. IEEE J. Sel. Top. Signal Process. 2014, 8, 742–758. [Google Scholar] [CrossRef]
  7. Simeone, O.; Bar-Ness, Y.; Spagnolini, U. Pilot-based Channel Estimation for OFDM Systems by Tracking the Delay-Subspace. IEEE Trans. Wirel. Commun. 2004, 3, 315–325. [Google Scholar] [CrossRef]
  8. Morelli, M.; Mengali, U. A Comparison of Pilot-Aided Channel Estimation Methods for OFDM System. IEEE Trans. Signal Process. 2001, 49, 3065–3073. [Google Scholar] [CrossRef]
  9. Kim, H.M.; Kim, D.; Kim, T.K.; Im, G.H. Frequency Domain Channel Estimation for MIMO SC-FDMA Systems with CDM Pilots. J. Commun. Netw. 2014, 16, 447–457. [Google Scholar] [CrossRef]
  10. Biguesh, M.; Gershman, A.B. Training-based MIMO Channel Estimation: A Study of Estimator Tradeoffs and Optimal Training Signals. IEEE Trans. Signal Process. 2006, 54, 884–893. [Google Scholar] [CrossRef]
  11. Ozdemir, M.K.; Arslan, H. Channel Estimation for Wireless OFDM Systems. IEEE Commun. Surv. Tutor. 2007, 9, 18–48. [Google Scholar] [CrossRef]
  12. Neumann, D.; Wiese, T.; Utschick, W. Learning the MMSE Channel Estimator. IEEE Trans. Signal Process. 2018, 66, 2905–2917. [Google Scholar] [CrossRef] [Green Version]
  13. Dowler, A.; Nix, A.; McGeehan, J. Data-derived Iterative Channel Estimation with Channel Tracking for a Mobile Fourth Generation Wide Area OFDM System. In Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM), San Francisco, CA, USA, 1–5 December 2003. [Google Scholar]
  14. Le, H.A.; Van Chien, T.; Nguyen, T.H.; Choo, H.; Nguyen, V.D. Machine Learning-Based 5G-and-Beyond Channel Estimation for MIMO-OFDM Communication Systems. Sensors 2021, 21, 4861. [Google Scholar] [CrossRef] [PubMed]
  15. Naeem, M.; De Pietro, G.; Coronato, A. Application of Reinforcement Learning and Deep Learning in Multiple-Input and Multiple-Output (MIMO) Systems. Sensors 2022, 22, 309. [Google Scholar] [CrossRef]
  16. Li, X.; Wang, Q.; Yang, H.; Ma, X. Data-Aided MIMO Channel Estimation by Clustering and Reinforcement-Learning. In Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC), Austin, TX, USA, 10–13 April 2022. [Google Scholar]
  17. Üçüncü, A.B.; Güvensen, G.M.; Yılmaz, A.Ö. A Reduced Complexity Ungerboeck Receiver for Quantized Wideband Massive SC-MIMO. IEEE Trans. Commun. 2021, 69, 4921–4936. [Google Scholar] [CrossRef]
  18. Yuan, J.; Ngo, H.Q.; Matthaiou, M. Machine Learning-Based Channel Prediction in Massive MIMO with Channel Aging. IEEE Trans. Wirel. Commun. 2020, 19, 2960–2973. [Google Scholar] [CrossRef]
  19. Zhao, M.; Shi, Z.; Reed, M.C. Iterative Turbo Channel Estimation for OFDM System over Rapid Dispersive Fading Channel. IEEE Trans. Wirel. Commun. 2008, 7, 3174–3184. [Google Scholar] [CrossRef]
  20. Ma, J.; Ping, L. Data-Aided Channel Estimation in Large Antenna Systems. IEEE Trans. Signal Process. 2014, 62, 3111–3124. [Google Scholar]
  21. Park, S.; Shim, B.; Choi, J.W. Iterative Channel Estimation Using Virtual Pilot Signals for MIMO-OFDM Systems. IEEE Trans. Signal Process. 2015, 63, 3032–3045. [Google Scholar] [CrossRef]
  22. Huang, C.; Liu, L.; Yuen, C.; Sun, S. Iterative Channel Estimation Using LSE and Sparse Message Passing for mmWave MIMO Systems. IEEE Trans. Signal Process. 2018, 67, 245–259. [Google Scholar] [CrossRef] [Green Version]
  23. Park, S.; Choi, J.W.; Seol, J.Y.; Shim, B. Expectation-Maximization-based Channel Estimation for Multiuser MIMO Systems. IEEE Trans. Commun. 2017, 65, 2397–2410. [Google Scholar] [CrossRef]
  24. Valenti, M.C.; Woerner, B.D. Iterative Channel Estimation and Decoding of Pilot Symbol Assisted Turbo Codes Over Flat-Fading Channels. IEEE J. Sel. Areas Commun. 2001, 19, 1697–1705. [Google Scholar] [CrossRef]
  25. Song, S.; Singer, A.C.; Sung, K.M. Soft Input Channel Estimation for Turbo Equalization. IEEE Trans. Signal Process. 2004, 52, 2885–2894. [Google Scholar] [CrossRef]
  26. Jeon, Y.S.; Li, J.; Tavangaran, N.; Poor, H.V. Data-Aided Channel Estimator for MIMO Systems via Reinforcement Learning. In Proceedings of the IEEE International Conference on Communications (ICC), Dublin, Ireland, 7–11 June 2020. [Google Scholar]
  27. Jeon, Y.S.; Lee, N.; Poor, H.V. Robust Data Detection for MIMO Systems with One-Bit ADCs: A Reinforcement Learning Approach. IEEE Trans. Wirel. Commun. 2020, 19, 1663–1676. [Google Scholar] [CrossRef]
  28. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  29. Dong, M.; Tong, L.; Sadler, B.M. Optimal Insertion of Pilot Symbols for Transmissions over Time-Varying Flat Fading Channels. IEEE Trans. Signal Process. 2004, 52, 1403–1418. [Google Scholar] [CrossRef] [Green Version]
  30. Kim, T.K.; Jeon, Y.S.; Min, M. Training Length Adaptation for Reinforcement Learning-Based Detection in Time-Varying Massive MIMO Systems With One-Bit ADCs. IEEE Trans. Veh. Technol. 2021, 70, 6999–7011. [Google Scholar] [CrossRef]
Figure 1. Frame consisting of one pilot block with T p symbols and N d data blocks with T d symbols.
Figure 1. Frame consisting of one pilot block with T p symbols and N d data blocks with T d symbols.
Sensors 22 04379 g001
Figure 2. State–action diagrams of the original MDP (a) where k n is the transmitted symbol index, and the approximate MDP (b) where k ^ n is the detected symbol index for a A and S n S n .
Figure 2. State–action diagrams of the original MDP (a) where k n is the transmitted symbol index, and the approximate MDP (b) where k ^ n is the detected symbol index for a A and S n S n .
Sensors 22 04379 g002
Figure 3. System structure of the proposed data-aided channel estimator.
Figure 3. System structure of the proposed data-aided channel estimator.
Sensors 22 04379 g003
Figure 4. d-th data block consists of N b data subblocks with T b symbols.
Figure 4. d-th data block consists of N b data subblocks with T b symbols.
Sensors 22 04379 g004
Figure 5. BLERs of conventional and proposed channel estimators for the different estimations.
Figure 5. BLERs of conventional and proposed channel estimators for the different estimations.
Sensors 22 04379 g005
Figure 6. BLERs of conventional and proposed channel estimators for different modulations.
Figure 6. BLERs of conventional and proposed channel estimators for different modulations.
Sensors 22 04379 g006
Figure 7. NMSEs of the proposed channel estimator for different T b and N b .
Figure 7. NMSEs of the proposed channel estimator for different T b and N b .
Sensors 22 04379 g007
Figure 8. NMSE of the proposed channel estimator based on the number of backup samples N.
Figure 8. NMSE of the proposed channel estimator based on the number of backup samples N.
Sensors 22 04379 g008
Figure 9. NMSEs of the proposed channel estimator for different discounting factors.
Figure 9. NMSEs of the proposed channel estimator for different discounting factors.
Sensors 22 04379 g009
Figure 10. BLERs of the proposed channel estimators in time-varying channels.
Figure 10. BLERs of the proposed channel estimators in time-varying channels.
Sensors 22 04379 g010
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Kim, T.-K.; Min, M. A Low-Complexity Algorithm for a Reinforcement Learning-Based Channel Estimator for MIMO Systems. Sensors 2022, 22, 4379. https://doi.org/10.3390/s22124379

AMA Style

Kim T-K, Min M. A Low-Complexity Algorithm for a Reinforcement Learning-Based Channel Estimator for MIMO Systems. Sensors. 2022; 22(12):4379. https://doi.org/10.3390/s22124379

Chicago/Turabian Style

Kim, Tae-Kyoung, and Moonsik Min. 2022. "A Low-Complexity Algorithm for a Reinforcement Learning-Based Channel Estimator for MIMO Systems" Sensors 22, no. 12: 4379. https://doi.org/10.3390/s22124379

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop