Reinforcement Learning-Aided Channel Estimator in Time-Varying MIMO Systems

This paper proposes a reinforcement learning-aided channel estimator for time-varying multi-input multi-output systems. The basic concept of the proposed channel estimator is the selection of the detected data symbol in the data-aided channel estimation. To achieve the selection successfully, we first formulate an optimization problem to minimize the data-aided channel estimation error. However, in time-varying channels, the optimal solution is difficult to derive because of its computational complexity and the time-varying nature of the channel. To address these difficulties, we consider a sequential selection for the detected symbols and a refinement for the selected symbols. A Markov decision process is formulated for sequential selection, and a reinforcement learning algorithm that efficiently computes the optimal policy is proposed with state element refinement. Simulation results demonstrate that the proposed channel estimator outperforms conventional channel estimators by efficiently capturing the variation of the channels.


Introduction
The multi-input multi-output (MIMO) system is a key technology in modern communication and can significantly improve channel capacity and communication reliability by using multiple antennas [1][2][3][4][5][6][7]. Spatial multiplexing and diversity gain are representative schemes for this improvement [1,2]. Notably, the channel capacity increases linearly with the number of either transmitter and receiver antennas. However, this increase is based on the unrealistic assumption of perfect channel state information (PCSI) at both the transmitter and receiver.
Many studies have proposed improving the channel estimation accuracy with limited time and frequency resources [8][9][10][11][12][13][14][15]. A representative method is pilot-aided channel estimation, which exploits the information shared between a transmitter and receiver. Linear minimum-mean square-error (LMMSE) channel estimation is a well-known method for pilot-aided channel estimation, owing to its simple structure [8]. However, LMMSE channel estimation exhibits unsatisfactory performance with a limited number of pilots. Thus, many pilots are required to satisfy the performance requirement, which decreases the spectral efficiency.
As a non-iterative approach, the reinforcement learning (RL)-aided channel estimator was introduced in [27][28][29][30][31][32][33]. The basic concept of this approach is the sequential selection of detected data symbols to minimize the channel estimation errors. Hence, a Markov decision process (MDP) was defined to solve the sequential selection, and the corresponding optimal policy was derived in a closed-form expression in [31]. In [32], a low-complexity algorithm was investigated by introducing sub-blocks and finite backup samples, and the computational complexity and latency were significantly reduced without performance loss. Recently, a general framework for RL-aided channel estimation was studied in [33] based on Monte Carlo tree search. However, the RL-aided channel estimators in [31][32][33] were originally considered in time-invariant channels; they perform insufficiently in time-varying channels.
In this paper, we propose an RL-aided channel estimator for time-varying MIMO channels. To achieve this, we first introduce an optimization problem for an RL-aided channel estimator in time-varying channels. We then formulate an MDP to solve the optimization problem, and propose an RL algorithm for the MDP that considers the timevarying nature of the channel. The main contributions of this paper are as follows: • We propose an RL-aided channel estimator for time-varying channels modeled using a first-order Gaussian-Markov process. First, we define the optimization problem in time-varying channels to select the detected data symbols and minimize the estimation error between the estimated and current channels. This optimization problem is different from those in [31][32][33], where the selection of the detected data symbols is unchanged because the current channel remains unchanged with the time slot index. • We propose an RL algorithm for the optimization problem that captures the timevarying nature of a channel. Because the optimization problem minimizes the estimation error between the estimated and current channels, we adjust the weights of the data symbols to improve the channel accuracy of the current channel.
Using this adjustment, we derive the optimal policy as a closed-form solution.
Note that the proposed optimal policy differs from those in [31][32][33] because the influence of soft-decision symbols in the virtual state for future rewards gradually diminishes as the time slot index increases. • We propose a further performance improvement scheme to refine the state elements. This is because the previously selected data symbol degrades the estimation accuracy of the current channel. To improve the estimation accuracy, we refine the previously selected data symbol by reflecting the channel variation. In addition, we remove selected data symbols that are too old by introducing a sliding window, because they have a large noise variance to estimate the current channel. Through simulations, we demonstrated the effectiveness of the proposed channel estimator compared with conventional channel estimators in time-varying channels.
The remainder of this paper is organized as follows. In Section 1, we introduce the system model, optimization problem, and the MDP. The proposed channel estimator, which determines the optimal policy for time-varying channels, is described in Section 2. We propose a further performance improvement scheme in Section 3. In Section 4, we present simulation results to demonstrate the effectiveness of the proposed channel estimator. Finally, we provide our conclusions in Section 5.

Preliminaries
This section describes the system model of a data-aided channel estimator for timevarying MIMO channels. We present the considered channel estimation and data detection schemes based on the model and introduce an optimization problem for data-aided channel estimation.

System Model
We consider MIMO systems in which a transmitter with a number of transmit antennas N t communicates with a receiver with a number of receive antennas N r (Figure 1).
The information is first encoded and mapped to the symbol constellation where X is the symbol constellation set. The transmitted symbol at time n denoted by x[n] ∈ X N t is then sent over a wireless channel. We model the wireless channel using a first-order Gaussian-Markov process as a time-varying channel model [34][35][36][37][38], where the channel matrix H[n] ∈ C N t ×N r has its (t, r)-th component between the t-th and r-th antennas following a Rayleigh fading CN (0, 1) distribution. The temporal correlation of the wireless channel, denoted by ∈ [0, 1], increases with velocity. Based on this model, the channel matrix H[n] at time slot n is given by where ∆[n] follows a CN (0, 1) distribution.
where (·) H denotes the conjugate transpose. n[n] is the additive white Gaussian noise (AWGN) at time slot n, with distribution CN (0 N r , σ 2 n I N r ), where 0 m and I m respectively denote m × m zero and identity matrices.
The frame consists of one pilot and M d data blocks ( Figure 1). The pilot block contains N p symbols, whereas each data block contains N d symbols. M p = {1, . . . , N p } is defined as the pilot index set and M d = {(d − 1)N d + 1, . . . , dN d } is defined as the data index set. We consider data-aided channel estimation, where the receiver obtains the initial channel estimates using pilot symbols, and the accuracy of the initial channel estimates is improved by exploiting data symbols.
We adopt the LMMSE method as the basic channel estimation method because it has a simple structure and provides a reasonable performance. Based on the LMMSE method,ĥ r of the r-th row for the initial channel estimateĤ can be obtained aŝ where (·) −1 is the inverse operation. The conventional channel estimator performs data detection at the receiver using the initial channel estimatesĥ. Because the MAP rule guarantees optimal performance, we adopt it for data detection, which is given bŷ where | · | is the cardinality of a set. x k ∈ X N t where k belongs to the index set of the symbol vector candidate K = {1, . . . , |X N t |}. θ k [n] denotes a posteriori probability (APP), which is given by where the likelihood probability in (5) is calculated by assuming the AWGN channel as where · 2 denotes the norm operation and P(·) is the probability of an event.
The a priori probability in (5) is also assumed to have an equal probability for possible

Problem
In a time-varying channel, the estimation accuracy ofĥ decreases gradually as time slot index n increases. This degradation results in poor detection performance at the receiver. Because the detected data symbol may have an error owing to the channel, an incorrect use of the detected data symbol severely degrades performance. To overcome this degradation, we consider a data-aided channel estimator that selects the detected data symbols for data-aided channel estimation.
For the selection, we define action a ∈ A = {0, 1} where the detected data symbol is used in channel estimation when a = 1; otherwise, the detected data symbol is not used. When we define a ∈ {0, 1} N d as a set of actions, the considered data-aided channel estimation can be obtained using this set aŝ where z r (a) = [z where E (·) is the expectation of a random variable. Compared with previous studies [31][32][33], the optimization problem in (8) considers the selection to minimize the MSE between the estimated channel and H[n]. Because the channel is variant with time slot index n, the best action a may be different with time slot index n. That is, the best action in the previous time slot index may be invalid in the next time slot index. In addition, the optimization problem is difficult to solve because the number of candidate actions increases exponentially with the data symbol length. An exhaustive search for action candidates is not feasible in practical applications. To resolve these difficulties, we introduce a sequential selection of the detected data symbols and a refinement of the selected data.

Markov Decision Process
We formulate an MDP that solves the optimization problem in (8). To achieve this, we define state S n , transition function T (a,j) n+1 (S n ), action A, and reward R(S n , S n+1 ) [39]. Subsequently, the Q-value function Q(S n , a) and the optimal policy π (S n ) will be presented. The basic definitions for the MDP are adopted from those in [31][32][33]; however, the RL solution for the MDP is different from those in previous studies, which will be explained in the next section.
The state set S n is defined as where C is the set of time slot indices where the symbol is used in channel estimation, and C(i) is the i-th smallest element. k n ∈ K = {1, . . . , |X |} is the transmitted symbol index at time slot n. Based on the expression, we can obtain the proposed channel estimate using the state S n ∈ S n asĥ Note that S n is the set of all states and S n is the state.
The action set is defined as A = {0, 1}. As explained in the previous subsection, the detected data symbol is used in the proposed channel estimation when a = 1; otherwise, the detected data symbol is not used. The transition function T (a,j) (S n ) from state S n ∈ S n is defined as where I(·) equals one when the event is true and zero otherwise.
n+1 (S n ) is a possible candidate for the next state from state S n and is defined as The reward R(S n , S n+1 ) is defined as the difference between the MSEs at the current state S n ∈ S n and the next state S n+1 ∈ S n+1 , which is given by where the error covariance is defined between the estimated channelĥ r (S n ) and h r [n] at time slot index n. The Q-value function Q(S n , a) is the sum of the rewards, which is given by where Tr(·) is a trace operation. V U (a,j) n+1 (S n ) is the optimal sum of future reward after U (a,j) n+1 (S n ). γ is a discounting factor whose value is assumed as one because the proposed channel estimator also considers the effect of future rewards at the ending state [31].
The optimal policy maximizes the Q-value function, which is expressed as Solving the optimization problem in (15) is highly difficult because the transition probability T (a,j) (S n ) is unknown, and the number of candidate states exponentially increases with the data length. An effective method to solve this problem is to use a reinforcement learning algorithm. Therefore, the proposed channel estimator also adopts a reinforcement learning algorithm, but the effect of the time-varying channel is also considered in comparison with [31][32][33].
A deep reinforcement learning (DRL) approach is a promising solution for dealing with the dimension explosion of the states by leveraging deep neural networks. To apply the DRL approach to our MDP, an agent needs to interact with an environment to obtain an action-value function for a given action and state. However, both the states and rewards of our MDP are not observable at the receiver. This means that the agent cannot acquire training samples, each of which consists of the state (or the state transition) and the corresponding reward. Consequently, the DRL approach and other data-driven approaches are not directly applicable to solving our MDP.

Proposed Optimal Policy
This section describes the proposed optimal policy. The basic concept of the derivation is similar to that in [31][32][33]. However, its direct extension is difficult for time-varying channels. This is because capturing time-variant channels using previously selected data symbols is difficult. To address this, we approximate the first-order Gaussian-Markov process and propose a computationally efficient algorithm.
The main difficulty in analyzing the time-varying channel model is solving element ∆[n]. To resolve this difficulty, we approximate the first-order Gaussian-Markov process in (1) as follows: where . This approximation is often adopted in studies because it provides analytical tractableness [36][37][38]. Using this approximation, the received symbol z[n + m] for 1 ≤ m can be expressed in terms of H[n] as follows: From approximation (18), the virtual state in [31] that mimics the optimal behavior from state U (a,j) n+1 (S n ) can be obtained as follows: where The soft-decision symbolx[m] for m ≥ n + 1 is define as In (19), because 0 ≤ ≤ 1, the effect of soft decision symbolx[n + m] for estimating H[n] is diminished as m increases. Based on the virtual state, the state-action diagram for the proposed channel estimator is shown in Figure 2. In this figure, the number of state transitions at state S n are one and K for a = 0 and a = 1, respectively. However, after n + 2, the state transition is simplified to one because the virtual state mimics the behavior of state U (a,j) n+1 (S n ). Using the definition of virtual state (19), we can compute the future reward V U (a,j) By applying (13) to the future reward, the future reward is simplified as Using the approximations (16) and (22), the Q-value function in (14) is obtained as follows: The error covariance matrix B Ũ (a,j) m can be computed as . By applying (24) to the Q-value function, the optimal policy at S n is computed as π (S n ) = argmax where B Ũ (a,j) , a = 0, Similar to [31], Q (1) and . In addition, D (1,j) and Finally, similar to [32], by applying the results in (23) and (24) to (25), we obtain the proposed optimal policy in closed-form as π (S n ) = I σ 2 n (1 + α n ) + σ 4 n a n 2 + d n 2 2σ 4 n β n + γ n + c n − b n + d n 2 ≥ 1 . , β n = a H n Qa n a n 2 , and γ n = δ n 1+α n . Note that the expression of the optimal policy in (26) is similar to that in [32]. However, the vectors and constants in the optimal policy is different from those in [32] because the temporal correlation is considered in Q and D. When = 0, the optimal policy in (26) is equivalent to that in [32].

Further Performance Improvement
In this section, we propose a practical method to improve the estimation accuracy of the proposed channel estimator. The proposed method refines state elements to capture the time-varying nature of the channel.

State Element Refinement
Elements X n andX n in state S n are updated when the detected data symbol is selected based on the optimal policy. However, the elements gradually lose their effectiveness in estimating H[n] as time slot index n increases. To address this, we first represent the received symbol for 1 ≤ m in terms of H[n] as Using (27), we refine the elements X n andX n in state as the time slot index increases, which is given by Regardless of the above refinement, the previously selected data symbols lose their effectiveness as the time slot index increases, particularly for large data lengths. This is because the term ∆[n] in (1) becomes dominant, increasing the uncertainty in estimating the channel. To overcome this, we remove too-old selected data symbols in state by introducing a window size N w . In other words, we maintain the size of the set of time slot indices as |C| = N w . Thus, when the optimal action is one at time slot index n, n is included, whereas the first index C(1) is removed from set C, which can be expressed as C ← C \ C(1).

Algorithm
Using the proposed optimal policy and performance improvement strategy, the proposed channel estimator is summarized in Algorithm 1. The receiver obtains the initial channel estimation during pilot transmission. Subsequently, during the data transmission, the receiver sequentially selects a data symbol based on the optimal policy. When the optimal action a = 1, the state S n is updated using the most-probable state transition [31]. In addition, the state element refinement is performed based on this condition. After each data block ends, the channel estimate is updated using the state S n .

end
In Figure 3, we show a block diagram of the proposed channel estimator, which consists of the LMMSE channel estimator, optimal policy calculator, and state element refinement. The LMMSE channel estimator obtains the initial estimate at pilot transmission and updates the estimate at data transmission using state S n . The optimal policy calculator obtains the optimal action of (26) from the channel estimates and APP from the data detector. The state elements are then refined based on the obtained optimal action, and the refined state is used to estimate the channel and optimal policy for the next step.

Application of other data detection:
The proposed RL-aided channel estimator can be universally applied to any other soft-output data detection method. To achieve this, the proposed RL-aided channel estimator relies on the availability of APPs, which can be directly derived from the MAP data detection method. In the case of using other soft-output data detections, the proposed RL-aided channel estimator can utilize the APPs that are computed from the log-likelihood ratios.
Complexity analysis: Complexity is analyzed in terms of real multiplications to provide an implementation perspective. Figure 3 shows the hardware structure of the proposed RL-aided channel estimator, which consists of the LMMSE channel estimator, state element refinement, and optimal policy calculator. Because the exact complexity can vary depending on the implementation details, the complexity order (O(·)) of each component is analyzed.
The complexity order of the LMMSE channel estimator in (7) is O((N p + |C(a )|) (N 2 t + N t N r )) where C(a ) is the set of selected data symbol vectors. The complexity order of state element refinement in Section 4.1 is O(4(N p + |C(a )|). The complexity of the optimal policy in (25) is primarily determined by the computation of Q (a) . Consequently, the complexity order of the optimal policy in (25) It is important to note that among the components of the proposed channel estimator, the optimal policy calculator has the highest complexity because it performs every data symbol index n, while the other components perform every data block index d.

Simulation Results
This section presents the effectiveness of the proposed channel estimator using simulations. The numbers of transmit and receive antennas used were N t = 2 and N r = 4. The transmission frame consisted of one pilot block with N p = 8 symbols and M d = 20 data blocks with N d = 128 symbols. Each symbol used 4-quadrature amplitude modulation (QAM) symbol mapping. We adopted turbo channel code with a rate of 1/2 and 16 cyclic redundancy check bits. For the proposed channel estimator, the window size was set to N w = 2 × N d . The signal-to-noise ratio (SNR) was defined as E b /N 0 = 1/(log 2 |X |σ 2 n ) under the power constraint E{ x[n] 2 } = 1. The proposed channel estimator was compared with the following methods.
• PCSI: This method is ideal for time-invariant channels in which a perfect initial channel estimate is available at the receiver. Because the initial channel changes during data transmission, it is not optimal for time-varying channels. • Pilot: This method uses a conventional pilot-aided channel estimator using (3).
• Soft: This method is a data-aided channel estimator when all symbols in (20) are used as additional pilot symbols. • Conv-RL [31]: This method is a data-aided channel estimator in which the detected data symbol is selected using the RL approach developed for time-invariant channels.
The performance of the methods was compared with that of the proposed channel estimator in terms of the block-error rate (BLER) and normalized MSE (NMSE). In addition, we considered the time-invariant channel = 0 and time-variant channel with = 0.005 and = 0.01. Note that channel was more severely variant when = 0.01 than when = 0.005. Figure 4 shows the BLERs for the proposed and other channel estimators in the timeinvariant channel, i.e., = 0. The conventional pilot-aided channel estimator exhibited a poor performance when the number of pilots was small. Data-aided channel estimators can overcome performance degradation caused by pilot-aided channel estimators. In particular, the RL-based channel estimator [31] showed an outstanding performance compared with other channel estimators. The BLER of the proposed channel estimator was slightly worse than that of [31] because of a reduced window size N w = 2 × N d . In Figure 5, the proposed channel estimator is compared with other channel estimators in time-varying channels. The proposed channel estimator had a better BLER improvement than the conventional pilot-aided channel estimator. In particular, the performance improvement is more prominent at = 0.01 than that at = 0.005. This is because the proposed channel estimator can efficiently capture channel variations by selecting and refining detected data symbols. In addition, in time-variant channels, the proposed channel estimator had a slightly higher BLER than the RL-based channel estimator, primarily due to the utilization of a reduced window size (see Figure 5). However, in time-varying channels, this reduction in window size actually contributed to an improvement in BLER by effectively leveraging the most recent data symbols. Consequently, the proposed channel estimator had a lower BLER compared to the RL-based channel estimator. In Figure 6, we show the BLER of the proposed channel estimator for different window sizes N w in time-varying channels with = 0.01 The BLER of the proposed channel estimator gradually degraded as N w increased. This is because the selected data symbol is undesirable as an additional pilot symbol in fast fading channels; therefore, only the usage of the latest selected data symbol can improve the performance.  To further investigate the effect of window size, we investigated the NMSE of the proposed channel estimator for different window sizes N w at = 0.005 and E b /N 0 = −2 dB (Figure 7). We observed that the NMSE improved until M d = 2 but decreased as the data block length increased. This is because the old selected data symbol is ineffective for estimating the channel. Thus, when we discard the old data symbol, we can further improve the estimation accuracy (Figure 7).

Conclusions
A data-aided channel estimator was proposed for time-varying channels, which involves selecting the detected data symbol. To facilitate efficient selection of the detected data symbol, an optimization problem was initially formulated to minimize the channel estimation error. Subsequently, the MDP for this optimization problem was formulated, and its optimal policy was derived using an RL algorithm. In the derivation process, approximations for the transition probability and a first-order Gaussian-Markov process were utilized. To improve estimation accuracy, a state element refinement was introduced to capture the time-varying nature of the channel by incorporating a window size. Simulation results demonstrated that the proposed channel estimator provides similar performance to the conventional RL-based channel estimator in time-invariant channels when = 0, while showing improved performance in time-varying channels when = 0.01 and = 0.005 compared to conventional RL-based channel estimator.
An interesting direction for further research involves optimizing the frame structure in terms of the spectral efficiency. In this study, the frame structure comprises one pilot and D data block. The proposed RL-aided channel estimator is applied to the data blocks to capture the time-varying nature of the channel. However, in fast fading channels, it can be challenging for the proposed channel estimator to accurately track channel variations. In such cases, reducing the value of D in the frame structure can potentially improve the performance. However, this reduction also leads to a degradation in spectral efficiency. To find an appropriate value for D in time-varying channels, an optimization problem that maximizes spectral efficiency while maintaining acceptable performance levels becomes a suitable criterion. To address this, one approach is to first derive the performance of the RL-aided channel estimator. Subsequently, the solution to the optimization problem can be obtained using the derived performance.