A Low-Complexity Algorithm for a Reinforcement Learning-Based Channel Estimator for MIMO Systems

This paper proposes a low-complexity algorithm for a reinforcement learning-based channel estimator for multiple-input multiple-output systems. The proposed channel estimator utilizes detected symbols to reduce the channel estimation error. However, the detected data symbols may include errors at the receiver owing to the characteristics of the wireless channels. Thus, the detected data symbols are selectively used as additional pilot symbols. To this end, a Markov decision process (MDP) problem is defined to optimize the selection of the detected data symbols. Subsequently, a reinforcement learning algorithm is developed to solve the MDP problem with computational efficiency. The developed algorithm derives the optimal policy in a closed form by introducing backup samples and data subblocks, to reduce latency and complexity. Simulations are conducted, and the results show that the proposed channel estimator significantly reduces the minimum-mean square error of the channel estimates, thus improving the block error rate compared to the conventional channel estimation.


Introduction
Currently, multiple-input multiple-output (MIMO) is an essential technology in wireless communications [1][2][3][4][5][6]. Multiple antennas are easy to implement in wireless systems, and their use significantly increases system reliability and capacity. However, to utilize the advantages of multiple antennas, perfect channel information is required at both the transmitter and receiver. Meeting this necessity is generally impossible because of the characteristics of wireless channels.
Although perfect channel information is unavailable, many studies have been conducted to improve the accuracy of channel estimation [7][8][9][10][11][12][13][14][15][16][17][18][19][20][21]. These investigations were mostly based on the use of pilots whose information is shared by both the transmitter and receiver and employed least-squares and linear minimum-mean square-error (LMMSE) estimations [10][11][12]. This is because the two estimation methods reasonably perform with affordable complexities for wireless systems. However, their performance strongly depends on the number of pilots, which is generally limited in wireless systems because employing several pilots as resources degrades the spectral efficiency.
This limitation can be overcome using data in channel estimation, i.e., conducting data-aided channel estimation [13][14][15][16][17][18][19][20][21]. Its concept is to exploit a detected data symbol as an additional pilot. Because a detected data symbol may have an error, the accuracy of the channel estimation may be degraded by it. An iterative turbo approach is a good method to address this degradation because the improved detection performance achieved using an iterative turbo equalizer also increases the estimation accuracy of a channel [19][20][21][22][23][24][25]. However, the use of this iterative turbo approach is limited in wireless systems because of its inherent high complexity and latency. • A data-aided channel estimator is developed to optimize the selection of detected symbols for MIMO systems. An MDP problem is defined for this selection to minimize the mean-square-error (MSE) of the channel estimates. Compared with [26], a discounting factor is introduced in the Q-value function. The discounting factor adjusts the effects of rewards after the current state. • A low-complexity RL algorithm is proposed. To achieve this efficiently, a data block is separated into multiple data subblocks and the optimal policy for the data subblocks is characterized. In the characterization, only partial soft information obtained from data detection is utilized to reduce the calculation latency. Unlike in [26], the optimal policy is calculated using only this partially obtained information; the remaining rewards are approximated under the assumption of perfect detection. Finally, the optimal policy is obtained using a closed-form expression. Note that the conventional RL algorithm in [26] can be employed after obtaining all soft information in a data block. • The performance enhancement achieved for MIMO systems using the developed RL algorithm is evaluated. Simulations are conducted, and the results demonstrate that the proposed algorithm significantly reduces the performance degradation of conventional channel estimation. Based on the simulations, the proposed channel estimator using an approximate MDP presents a similar performance to that of the original MDP. In addition, the proposed channel estimator provides robustness in time-varying channels.
The remainder of this paper is organized as follows. Section 2 introduces a signal model including the channel estimation and data detection considered in this study. In Section 3, an MDP problem to select detected data symbols optimally to minimize the channel estimation error is defined. A low-complexity RL algorithm is proposed in Section 4. In Section 5, simulation results are discussed, to demonstrate the effectiveness of the developed algorithm. Finally, conclusions are presented in Section 6.

Notation
Matrices 0 m and I m represent m×m all-zero and the m × m identity matrices, respectively. The superscripts (·) T and (·) H denote the transpose and the conjugate transpose, respectively. Operators E(·) and P(·) denote the expectation of a random variable and the probability of an event, respectively. Operators | · | and · 2 denote the cardinality of a set and the norm, respectively. Operators (·) −1 , Tr(·), and CN denote the inverse, trace, and complex normal distribution, respectively. Set C represents a set of complex numbers.

Signal Model
This section describes the signal model for a MIMO system. Based on the signal model, the channel estimator and data detector considered in this study are introduced.

Signal Model
A MIMO system is considered; in it, a transmitter with N t antennas communicates with a receiver with N r antennas through a wireless channel. A wireless channel is denoted as H ∈ C N t ×N r , where each channel element h t,r ∈ C between the t-th transmitter and r-th receiver is modeled by Rayleigh fading h t,r ∼ CN (0, 1). The transmitter sends a frame consisting of one pilot block and N d data blocks, as shown in Figure 1. During the pilot transmission, the transmitter sends a pilot symbol x p [n] ∈ C N t ×1 for n ∈ N p = {1, . . . , T p }, where T p is the pilot length. When the pilot symbol x p [n] is transmitted to the receiver, the received symbol y p [n] ∈ C N r ×1 at time slot n is given as where z p [n] is an additive white Gaussian noise (AWGN) at time slot n whose distribution follows CN (0 N r , N 0 I N r ). After the pilot transmission is completed, the transmitter sends a data symbol After the data transmission, the received symbol y d [n] ∈ C N r ×1 is expressed as where z d [n] is also an AWGN at time slot n.

Channel Estimator and Data Detector
The LMMSE channel estimator is considered in this study because of its satisfactory performance with low complexity. Using the received symbol in (1), the LMMSE channel estimator, W ∈ C N t ×T p , is expressed as follows: where y p r and X p are sets of the received and pilot symbols and are defined as y [1], · · · , x p [T p ]], respectively. Using the channel estimator in (3), a channel estimate is expressed aŝ whereĥ r is the r-th row of the channel estimate matrixĤ.
A maximum a posteriori probability (MAP) data detector is considered in this study to ensure the optimal detection performance. The APP from the MAP data detector is computed as where x k ∈ X N t is the k-th possible symbol for k ∈ K = {1, . . . , |X | N t }. In (5), the apriori probability, P x d [n] = x k , is assumed to be equal for all possible symbols x k for k ∈ K, Concurrently, under the AWGN assumption, the likelihood probability P y d [n]|x d [n] = x k in (5) can be expressed as The MAP data detector detects the data symbolx[n] that has the best APP value at time slot n, and it is given bŷ Note that the accuracy of the detected symbolx[n] depends on the accuracy of the channel estimator,Ĥ. However, the accuracy of the channel estimator cannot be ensured in practical systems where the pilot length, T p , is limited. To address this limitation, this study focused on improving the accuracy of the channel estimator.

Optimization Problem
This section defines the optimization problem for the channel estimator proposed subsequently, which uses detected symbols to improve the MSE of the channel estimates. Subsequently, to solve the optimization problem, the MDP problem and the optimal policy are presented.

Optimization Problem
This study considers a channel estimator that uses the detected symbols in (7) as additional pilot symbols. However, the data detector may generate detection errors at the receiver. Consequently, the use of detected symbols with errors degrades the accuracy of the channel estimator. To overcome this problem, the detected symbols should be selectively exploited by the channel estimator.
Let a ∈ {0, 1} T d be the set of actions whose n-th component is the selection of a detected symbol of the d-th data block for n ∈ N d . Specifically, when a = 1, a detected symbol is used as an additional pilot symbol; otherwise, it is not used. By exploiting a, the LMMSE channel estimate in (4) can be updated aŝ ]. Here, u i (a) is the time slot index of the i-th nonzero element in a. Thus, the optimization problem that maximizes the accuracy of the proposed channel estimator can be expressed as Solving the optimization problem in (9) is difficult. First, the distribution ofĤ(a) requires information regarding the transmitted symbols. However, this information is generally unknown to a receiver. In addition, the number of candidates for actions a exponentially increases with data length T d . Accordingly, an exhaustive search for these actions is impractical because of the unsatisfactory complexity and latency for the receiver.

Markov Decision Process
To efficiently solve the problem in (9), an MDP was formulated in [26] that sequentially selected detected symbols. In this formulation, a detected symbol is selected if the updated channel estimator reduces the estimation error.
Similar to [26], for this study, the state set of the MDP at time slot n is expressed as where k n denotes the transmitted symbol index at time slot n. Set M n represents the set of time slot indices of the data symbols to be utilized as additional pilot symbols. M n (i) is the i-th smallest element of M n . Based on the above notations, the proposed channel estimate at state S n = X n ,X n , M n ∈ S n is expressed aŝ . The action set of the MDP is expressed as A = {0, 1}. An action is defined as whether to utilize a current detected symbol as an additional pilot symbol. Specifically, when a = 1 ∈ A, the current detected symbol is used as an additional pilot symbol.
Based on the state and action sets, the state transition function of the MDP for a ∈ A and S n ∈ S n is expressed as follows: n+1 (S n ) ∈ S n+1 is the valid state from the current state S n = X n ,X n , M n ∈ S n , and is expressed as The reward function of the MDP is obtained by the MSE improvement between the channel estimates at the current state S n and the next state S n+1 . Thus, the reward function from S n ∈ S n to S n+1 ∈ S n+1 is defined as where E r (S n ) is the MSE of the channel estimate for the r-th receive antenna at state S n ∈ S n , which can be computed as where the error covariance matrix C e (S n ) is defined as Here, C e (S n ) is independent of the receiver antenna index, r, because the channel and noise distributions are the same for different receive antenna indices. Thus, the reward function in (14) can be simplified as The optimal policy of the MDP at time slot n is defined as where the Q-value function Q(S n , a) is the optimal sum of the rewards. Based on the state transition function in (12), the Q-value function can be expressed as where 0 ≤ γ ≤ 1 is a discounting factor whose value depends on the target of the optimization problem. For example, a small value is desirable when the accuracy of the channel estimator obtained at the current state is significant. In contrast, a larger value is preferred when the accuracy of the channel estimator obtained at the ending state is significant.
is the optimal sum of the future rewards. The future value function V (S m ) at state S m ∈ S m for n + 1 ≤ m can be recursively computed, as follows: where π(S m , a) is a state-action transition function, expressed as where Q(S m , a) is the Q-value function that can be calculated as the sum of the rewards obtained after taking action a ∈ A at state S m ∈ S m . Using the MDP in (10), (12), and (13), the state-action diagram of the original MDP is depicted in Figure 2a. In this figure, state S n is transited to the next valid state, U (a,j) n+1 (S n ), based on action a. Particularly, when a = 1, state S n is transited to state U (1,k n ) n+1 (S n ) by utilizing the transmitted symbol index, k n . Based on the state and state-action transition functions in (12) and (20), the state is transited to the next valid state until the end of a data block. As previously mentioned, the original MDP, which is shown in Figure 2a  wherek n is the detected symbol index for a ∈ A and S n ∈ S n . First, the state and state-action functions are unavailable to the receiver because the information of the transmitted symbols, x k n , and the true channel information, H, are unknown. In addition, the computational complexity and latency required to solve the original MDP are extremely high because the number of states exponentially increases with data length T d .

Proposed Rl-Based Channel Estimator
In this section, an RL-based channel estimator is proposed. To address the unknown state and state-action functions, an RL algorithm is adopted because it provides a solution for the partially observable MDP [27,28]. Based on this algorithm, a computationally efficient RL solution is also proposed. The key concept of the proposed solution is to approximate the state-action transition functions to determine the optimal policy by separating the cases using the APPs.
The overall procedure of the proposed RL-based channel estimator is illustrated in Figure 3. The proposed channel estimator exploits the information of (x[m], θ j [m]) obtained from the MIMO detector. In the proposed channel estimator, the optimal policy is calculated by using only N APPs (θ j [n], . . . , θ j [n + N]) for a computationally efficient algorithm. The channel estimate is then updated according to the optimal policy. Details of the proposed channel estimator, i.e., how to approximate the MDP and how to derive the optimal policy in a closed form, are explained in this section.

Statistical State Transition
In this section, the state transition function in (12) at time slot n is approximated using the APP θ j [n]. The basic concept was introduced in [26] by assuming the APP θ j [n] as the probability of the event, {x[n] = x j }. Thus, the state transition function in (12) at time slot n is approximated as follows: where the detected symbol index at time slot n is denoted ask n . Note that APP θ j [n] can be interpreted as the probability of the event {x[n] = x j }; thus, it is called a statistical transition. In addition, when the data detection performance is improved, i.e., θ k n [n] → 1, the approximate state transition function in (21) approaches the true state transition function in (12).

State-Action Transition Using Backup Samples
After time slot n + 1 ≤ m, the state in (20) is assumed to be transited to a virtual state that mimics the possible next states by exploiting the expected transmitted symbol,x[m]. The expected transmitted symbol,x[m], is defined as In this study, the use of the expected transmitted symbol is the same as in [26], except its use is limited to N backup samples to reduce the complexity. A backup sample is defined as APP θ j [m] for n + 1 ≤ m ≤ n + N because the expected transmitted symbol can be computed by θ j [m]. Thus, the Q-value function can be calculated after all θ j [m] for n + 1 ≤ m ≤ n + N values are obtained. Using a backup sample of an APP, the state-action transition is expressed asπ Thus, the virtual state,Ũ (a,j) m (S n ) ∈ S m , that can be transited from S n ∈ S n is expressed as where their components are Because a virtual state mimics the transitions to the candidate symbols, stateŨ (a,j) m (S n ) ∈ S m is always transited to a virtual stateŨ (a,j) m+1 (S n ) ∈ S m+1 . Therefore, the corresponding state transition function is written aŝ where n + 1 ≤ m ≤ n + N.

State-Action Transition after Backup Samples
In this subsection, the virtual states after n + N that can be transited without the information of the backup samples, θ j [m], are described for n + N + 1 ≤ m. To achieve this, the states,Û where its components are defined as In Figure 2b, a state-action diagram of the approximate MDP is depicted. The original MDP requires information regarding the transmitted symbols for the state transition, as shown in Figure 2a. In contrast, the approximate MDP utilizes virtual statesŨ (a,j) m (S n ) andÛ (a,j) m (S n ), which mimic the transitions to the candidate symbols for an unknown transmitted symbol and action. Specifically, virtual stateŨ (a,j) m (S n ) is used at time slot n + 1 ≤ m ≤ n + N and after time slot n + N, respectively. These two approximations decrease the number of transitions to the next state transition, so the calculation to solve the MDP is considerably reduced.

Proposed Optimal Policy
Using the approximations in (21), (23), and (24), the optimal policy can be determined. However, the calculation latency is still considerable, because the optimal policy can be computed at the end of a data block. To prevent this computational burden, the proposed solution separates a data block into N b data subblocks and subsequently characterizes the optimal policy for each data subblock, as shown in Figure 4. Based on this characterization, the state in (10) and the corresponding channel estimate using (11) are updated for a data subblock. To realize this data subblock separation, the data subblock length is defined as T b , which satisfies N b = T d /T b . Thus, a set of time slot indices of the b-th data subblock in the d- Figure 4). Using the virtual states in (24) and (26), the Q-value function is written as where the future value function, V Û (a,j) n+N+1 (S n ) , is obtained based on the approximation ofÛ (a,j) m (S n ) as follows: In the future reward in (28), the discounting factor is assumed to be 1 to reduce the complexity by a simple calculation.
Based on (27) and (28), the optimal policy for each state is obtained as a closed-form expression, as described in the following theorem: Theorem 1. Under the virtual states and the use of backup samples, the optimal policy for the state S n = X n ,X n , M n ∈ S n is where functions U m (S n ) and L m (S n ) are respectively defined as All components are defined as Proof. See Appendix A.

Summary: The Proposed Algorithm
The proposed channel estimator is summarized in Algorithm 1. First, the receiver initializes the state during pilot transmission. In this algorithm, the current state is updated and transited to the next state according to the optimal action obtained using (29). For example, the most probable state transition is used when α = 1 for the unknown transmitted symbol index. This transition ensures a true state transition as θ j [n] approaches 1 in reliable communication. At the end of a data subblock, the proposed channel estimator updates the channel estimate using the current state, S n .

Complexity Analysis
In this subsection, the complexity of both the proposed channel estimator and that in [26] is discussed based on the number of states visited in the calculation of the optimal policy. This is because the rewards in the optimal policy are computed based on the states, and the calculation in (29) is similar to that in [26]. First, when the current state is S n ∈ S n in the d-th data block, the number of visiting states in [26] is exactly dT d − n. By contrast, the number of visiting states using the proposed channel estimator in the b-th data subblock is is not used in the policy calculation on introducing the data subblocks. In addition to the complexity, the proposed optimal policy can be calculated after obtaining N backup samples, whereas in the approach in [26], this is possible at the end of a data block. Thus, the latency of the optimal policy by the approach in [26] is much longer than that of the proposed optimal policy.

Simulation Results
This section discusses the performance of the proposed channel estimator. The number of antennas in MIMO systems is (N t , N r ) = (4, 4). A rate 1/2 turbo code is adopted for channel coding, and 4-quadrature amplitude modulation (QAM) is adopted for symbol mapping. The frame consists of (T p , T d , N d ) = (8,64,20), and the proposed channel estimator utilizes a data subblock as (T b , N b ) = (16,4). In addition, the parameters of the proposed channel estimator are (N, γ) = (1, 0.5), unless specified otherwise. The per-bit signal-to-noise ratio (SNR) is defined as In all figures, the performance with perfect and imperfect channel estimates using the LMMSE method are denoted as PCSI and CE, respectively. For performance benchmarking, the optimal cases of the proposed channel estimator and the expected-symbol-based channel estimator utilizing perfect knowledge of the transmitted symbol and the expected symbol in (22) as an additional pilot symbol, respectively, are compared. The performance is measured in terms of the block error rate (BLER) and the normalized MSE (NMSE). In Figure 5, the proposed channel estimator is compared with other channel estimators, and the conventional RL method used in [26] is also depicted. It shows that the BLER of the proposed estimator is better than those of the conventional and expected-symbol-based estimators regardless of the per-bit SNR. Moreover, the proposed channel estimator outperforms the conventional estimator of [26]. This is because the proposed channel estimator updates a channel estimate by N b in a data block, whereas the method in [26] updates it once at the end of a data block.  Figure 6 compares the BLERs of the conventional and proposed channel estimators for different modulations. For 16-QAM, a MIMO system with (N t , N r ) = (2, 4) is considered because of the SNR range. The proposed channel estimator achieves an improved BLER compared to the conventional LMMSE channel estimators. This result demonstrates the effectiveness of the proposed channel estimator, which optimizes the selection of detected symbols. The improvements to achieve a BLER of 10 −1 are approximately 1.2 dB and 0.7 dB for the 4-and 16-QAM, respectively. The BLER for the 16-QAM is more improved than that of the PCSI, which is better than that of the 4-QAM. This is because in 16-QAM, the number of reliable detected symbols that can be used as additional pilot symbols is larger than in 4-QAM. The NMSEs of the proposed channel estimator for different data subblock lengths are shown in Figure 7. The NMSE improves as N b decreases. This is because the approximate MDP using data subblocks approaches the original MDP as N b decreases. However, as shown in Figure 7, the NMSE improvement is insignificant, whereas the complexity exponentially increases with T b . Thus, (T b , N b ) = (16,4) is considered in this study for the simulations. The NMSE of the proposed channel estimator based on the number of backup samples is shown in Figure 8. Noticeably, the NMSE is improved as the number of backup samples increases. This is because the accuracy of the state-action diagram model improves as the number of backup samples increases. In addition, with a small value of N, the proposed channel estimator achieves a sufficient NMSE performance. It should be noted that the complexity and latency required to determine the optimal policy increase with the number of backup samples.  [29,30] was adopted.In this process, the channel matrix at time slot n is defined as where n ∈ N b,d for b ∈ {1, 2, . . . , N b } and d ∈ {1, 2, . . . , N d }. ∈ [0, 1] is a temporal correlation coefficient depending on the velocity, and H (0) is an initial channel estimate. Each element in e (n) ∈ C N r ×N t is assumed to follow CN (0, 1). Temporal correlation coefficients = 5 × 10 −3 and = 10 −2 are used for the simulations.   Figure 9 shows the variation in the NMSE of the proposed channel estimator with the discounting factor. When a channel varies over time as = 5 × 10 −3 , an NMSE with γ = 0.1 is better than it is with γ = 0.9. This is because the rewards at the future states in the time-varying channels are insignificant; therefore, a small value of the discounting factor is preferable. By contrast, when the channels are time-invariant, the rewards at the future states as well as those at the current state are important. Thus, the large value of γ = 0.9 improves the NMSE compared to γ = 0.1. Figure 10 compares the BLERs of the proposed and conventional channel estimators. When = 10 −2 , the BLERs of the CE are severely degraded because the CE method cannot capture the channel variation. However, the proposed channel estimator shows robustness in time-varying channels because the channel variation can be tracked efficiently by selecting the detected symbols.

Conclusions
In this paper, a low-complexity algorithm for an RL-based channel estimator for MIMO systems was proposed. The proposed channel estimator adaptively selects detected symbols as additional pilot symbols to minimize the channel estimation error. In this study, an MDP problem was introduced, and a practical algorithm to solve it was developed using backup samples and data subblocks. Simulation results showed that the proposed channel estimator significantly improves the BLER and the NMSE compared to the conventional channel estimator.
A future direction of this study is to develop the RL approach for a realistic channel. The proposed method was derived based on the Rayleigh fading channel, but the realistic channel may have a line of sight. Thus, the MDP under the Rician fading channel should be investigated. Another important direction is to develop the RL approach for frequencyselective channels. In frequency-selective channels, the use of multiple sub-carriers can increase computational complexity considerably. Thus, a low-complexity algorithm in frequency-selective channels is necessary. Lastly, the RL approach can also be extended to other advanced channel estimators, such as the iterative method. In this method, the MDP should be reformulated according to the channel estimator.

Conflicts of Interest:
The authors declare that there is no conflict of interest.

Appendix A. Proof of Theorem 1
Although the basic derivation of the optimal policy is based on [26], two additional factors are considered, which are presented in this appendix. The first is that the proposed derivation considers a discounting factor in the Q-value; thus, the intermediate rewards do not disappear, unlike in [26]. Second, a finite number of backup samples are used in the derivation; thus, the rewards that do not exploit the APPs are approximated differently compared to [26].
Under the assumption that the discounting factor is 1, the future value function at stateŨ (a,j) n+N+1 (S n ) ∈ S n+N+1 is expressed by substituting (14) in (28), as follows: V Ũ (a,j) n+N+1 (S n ) = Tr C e Ũ (a,j) n+N+1 (S n ) − C e Û (a,j) n+N+2 (S n ) In (17), the optimal policy is determined by the difference between the error covariance matrices with a = 0 and a = 1. The error covariance matrices for virtual statesŨ (a,j) m (S n ) andÛ (a,j) m (S n ) are derived as described below.
Appendix A.1. Error Covariance Calculation forŨ (a,j) m (S n ) To obtain the error covariance matrix, the distribution of the received symbols, y H r Ũ (a,j) m (S n ) , in (2) is required, which is given by y H r Ũ (a,j) m (S n ) ∼ CN 0 To resolve the detected symbols after n + N + 1 in (A9), Q N b,d (T b )+1 is further approximated. To this end, the expectation value of Q (0) N b,d (T b )+1 is used with Jensen's inequality in (A9), yielding