Sparse Sliding-Window Kernel Recursive Least-Squares Channel Prediction for Fast Time-Varying MIMO Systems

Accurate channel state information (CSI) is important for MIMO systems, especially in a high-speed scenario, fast time-varying CSI tends to be out of date, and a change in CSI shows complex nonlinearities. The kernel recursive least-squares (KRLS) algorithm, which offers an attractive framework to deal with nonlinear problems, can be used in predicting nonlinear time-varying CSI. However, the network structure of the traditional KRLS algorithm grows as the training sample size increases, resulting in insufficient storage space and increasing computation when dealing with incoming data, which limits the online prediction of the KRLS algorithm. This paper proposed a new sparse sliding-window KRLS (SSW-KRLS) algorithm where a candidate discard set is selected through correlation analysis between the mapping vectors in the kernel Hilbert spaces of the new input sample and the existing samples in the kernel dictionary; then, the discarded sample is determined in combination with its corresponding output to achieve dynamic sample updates. Specifically, the proposed SSW-KRLS algorithm maintains the size of the kernel dictionary within the sample budget requires a fixed amount of memory and computation per time step, incorporates regularization, and achieves online prediction. Moreover, in order to sufficiently track the strongly changeable dynamic characteristics, a forgetting factor is considered in the proposed algorithm. Numerical simulations demonstrate that, under a realistic channel model of 3GPP in a rich scattering environment, our proposed algorithm achieved superior performance in terms of both predictive accuracy and kernel dictionary size than that of the ALD-KRLS algorithm. Our proposed SSW-KRLS algorithm with M=90 achieved 2 dB NMSE less than that of the ALD-KRLS algorithm with v=0.001, while the kernel dictionary was about 17% smaller when the speed of the mobile user was 120 km/h.


Introduction
Multiantenna technology can fully use spatial dimension resources and dramatically improve the capacity of wireless communication systems without increasing transmission power and bandwidth [1]. Meanwhile, beamforming technology [2,3] is widely used to reduce the interference between cochannel users with cooperative transmission and reception since it can compensate channel fading and distortion caused by the multipath effect. Base stations optimize the allocation of radio resources through reasonable precoding, rendering the desired signal and interference more orthogonal provided that CSI is known at the base station. Thus, the acquisition of CSI is very important in the cooperation of transmission and reception [4]. However, due to the dynamics of the channel, especially when the terminal is moving at a high speed, the acquisition of CSI is a formidable problem in MIMO systems.
In TDD systems, the user sends a sounding reference signal (SRS), and the base station performs channel estimation algorithms such as LS [5] and MMSE [6]. Then, the obtained CSI is used for downlink beamforming to realize cooperative signal processing. The coherence time of wireless channels is the time duration after which CSI is considered to be outdated. When the terminal is moving at a high speed, the Doppler frequency shift grows, and the time variability of the channel is severe, which leads to the shortening of channel coherence time. The measured uplink channel CSI cannot represent the real channel state of downlink slots, resulting in the mismatch between the downlink beamforming designed according to the measured CSI and the actual channel. In [1], with a typical CSI delay of 4 milliseconds, the user terminal speed at 30 km/h led to as much as 50% performance reduction versus in the low-mobility scenario at 3 km/h.
In order to overcome the performance degradation caused by severe time-varying channels, [7] proposed a tractable user-centric CSI model and a robust beamforming design by taking deterministic equivalents [8] into account. However, when the user terminal is at high speed, the channel shows nonstationary characteristics, and the statistical characteristics of the channel also change with time, which cannot be used for the beamforming of high-speed time-varying channels.
Another approach is to use a channel prediction algorithm [9][10][11][12][13][14][15][16] to obtain more accurate CSI for beamforming. The kernel method is widely used in channel prediction algorithms due to its ability to track nonlinear channels, and its adaptation to time-varying channels [17][18][19][20][21][22]. However, algorithms based on kernel methods face the problem in online prediction that the networks structure grows as the size of the training samples grows in time. Though researchers have proposed sparseness methods to limit the size of samples by setting prerequisites for new samples added into the dictionary, the size of the kernel dictionary cannot be precisely controlled. Therefore, this paper proposes a new channel algorithm based on the kernel method that maintains the size of the kernel dictionary within a fixed budget while precisely tracking the fast time-varying dynamic characteristics.

Related Work
State-of-the-art channel fading prediction algorithms were reviewed in [9]. Parametric radio channel-based methods [10] consider channel changes faster than multipath parameters, and the estimation of these parameters can help in the extrapolation of the channel into future. However, the effective time of static multipath parameters is inversely proportional to the terminal moving speed, rendering the channel prediction based on radio parameters not appropriate for high-speed scenarios. Autoregressive (AR) model-based methods [11,13] do not explicitly model physical scatterings, considering the time-varying channel as a stochastic wide-sense stationary (WSS) process and the temporal autocorrelation function is used for prediction. Nevertheless, they are not capable of predicting ephemeral variations in non-wide-sense-stationary (NWSS) channels due to their linear correlation assumption. When treating NWSS channels, many studies attempted to use machine learning in channel prediction. The authors in [14] proposed a backpropagation (BP) framework regarding channel prediction for backscatter communication networks that considers both spatial and frequency diversity. In [15], the authors developed a machine-learning method for predicting a mobile communication channel on the basis of a specific type of convolutional neural network. Although these algorithms can achieve good performance, they need to build neural networks, require a large number of samples for training, and the complexities of these algorithms are high. The channel prediction method based on a support vector machine [16,23] uses Mercer's theorem [24] to map the channel sample space to the high-dimensional space, and performs linear regression in the high-dimensional space to solve the problem of tracking nonlinear channels well. The kernel recursive least-squares (KRLS) [17][18][19][20][21][22] algorithm is a nonlinear version of the recursive least-squares (RLS) algorithm. It can not only solve nonlinear problems, but also adaptively iterate the model parameters.
In order to solve this problem and render the online kernel method feasible, researchers proposed various sparseness methods or criteria, such as approximate linear dependency (ALD) [17], the novelty criterion (NC) [20], the surprise criterion (SC) [21], and the coherence criterion (CC) [22]. On the basis of these sparseness methods or criteria, only new input samples that meet the prerequisites are added to the dictionary. However, these sparse methods cannot precisely control the size of the kernel dictionary, which motivated us to introduce a sliding window to control the size of the kernel dictionary. Recently, some KRLSbased algorithms with an inserted forgetting factor have achieved better performance than that of QKRLS algorithms [25,26] and ALD-KRLS, which motivated us to insert a forgetting factor into the proposed SSW-KRLS algorithm. This paper proposes a new sparse slidingwindow KRLS algorithm where a candidate discard set is selected through correlation analysis between the mapping vectors in the kernel Hilbert spaces of the new input sample and the existing samples in the kernel dictionary; then, the discarded sample is determined in combination with its corresponding output to achieve dynamic sample updates.

Contributions
The main contributions of this paper are summarized as follows: • We propose a novel sparse sliding-window KRLS algorithm. To precisely control the size of samples, we introduce a sample budget as a size restriction. When the dictionary was smaller than the sample budget, we directly added the new sample to the dictionary. Otherwise, we chose the best sample to discard according to our proposed new criterion. • To differentiate the sample value collected at different times, we introduced a forgetting matrix. By setting different forgetting values for samples collected at different times, we quantified the time value of the samples. The older sample had a smaller forgetting value, which means that its time value was smaller. In this way, we considered both the correlation of samples and the time value when discarding old samples. • Regarding our new method for discarding old samples, we set a candidate set where we decided which sample to discard. The candidate set was obtained by adding the samples that had larger kernel functions of the new sample than a threshold and were highly correlated with the new sample. Then, we conducted a weighted estimation of the output value of these samples. We decided which sample to discard on the basis of the deviation between its output and the estimated value.

System Model
We considered two typical user movement scenarios: urban road where the moving speed of users is 60 km/h, and a highway where the moving speed of users is 120 km/h. As shown in Figure 1, in TDD systems, the user sends a SRS, and the base station runs channel estimation algorithms such as LS and MMSE. The channel matrix is first estimated in frequency-domain; then, the channel frequency response is transformed into the timedomain by IDFT. The noise is evenly distributed in the whole time domain and is easy to be eliminated using a raised cosine window. Therefore, the effect of noise is very small compared to the user's high mobility. Due to the channel reciprocity in TDD systems, uplink channel CSI can be directly used in the design of downlink beamforming to realize cooperative signal processing. However, nonstationary fast fading characteristics of mobile environments bring challenges to multi-input multi-output (MIMO) communications. When the terminal is moving at a high speed, the Doppler frequency shift grows, and the time variability of the channel is severe. The measured uplink channel CSI cannot represent the real channel state of the downlink slots, resulting in performance degradation, and a mismatch between the measured CSI and the actual channel. As shown in Figure 2, during the SRS reporting cycle, the user can be regarded as moving in a fixed direction at a fixed speed. In two adjacent time slots, due to the short moving distance of the user, the amplitude and phase of the direct and scattered components have a certain correlation. When the user speed is low, an AR-based channel prediction method can achieve good performance by utilizing the linear correlation between adjacent time slots. However, when the user is moving at a high speed, the channel correlation between adjacent time slots presents nonlinear characteristics. Kernel methods are needed to exploit the nonlinear correlation characteristics of the channel. The user sends SRS in special time slots, and the BS performs the channel measurement and estimation to obtain the channel matrix. As shown in Figure 3, the channel matrix measured by BS in the i-th SRS period is assumed to be H i ∈ C N r ×N t . N t and N r represent the number of antennas of the BS and the mobile UE, respectively. The real and imaginary parts are separately processed in the subsequent algorithm, and H represents a real matrix in the later representation. The prediction order means that each channel matrix is related to the channel matrix measured in the previous-order SRS periods, so the input vector of the prediction system is

Scattering component LoS component
There are some complex and unknown dependencies between u i and H i : One slot, T=0.5ms measurement and estimation

Traditional KRLS Algorithm
In this section, we introduce the traditional KRLS algorithm and several extensions to the KRLS algorithm.

Traditional KRLS Algorithm
Assume a set of ordered input-output pairs where m is the total number of samples, u i ∈ R m are m-dimensional input vectors and y i ∈ R is the output. We call D a dictionary that records the input-output pairs collected before the i-th time slot. First, according to the Mercer theorem, we can adopt a nonlinear function ϕ(·) to transform data u i ∈ R m into a high-dimensional feature space: ϕ(·) :R m → R c . The corresponding kernel function is κ(u, v) = ϕ(u), ϕ(v) . We need to minimize the cost function: where By minimizing the cost function as (2), we can obtain weight ) is a high-dimensional mapping, and the exact mapping function is unknown. To avoid overfitting when the samples are small, we used L2 regularization, so the cost function is reformulated as: By letting ∂J ∂w i = 0, we can obtain that By substituting (4) into (3), the cost function can be reformulated as: where K i is the kernel matrix and the element located at the i-th row and j-th column of K i is Then, the inverse of Q(i) can be obtained as: where can be solved using the inversion of the partitioned matrix. where By relying on the kernel trick, the traditional KRLS algorithm can deal with nonlinear problems by nonlinearly transforming data into a high-dimensional reproducing kernel Hilbert space, which is similar to other techniques such as support vector machines (SVMs). Compared to SVMs, it avoids large-scale high-dimensional computing through iterative computations. However, the traditional KRLS algorithm grows linearly with the number of processed data, rendering growing complexities for each consecutive update if no additional measures are taken. In order to render online kernel algorithms feasible, growth is typically slowed down by approximately representing the solution using only a subset of bases that are considered to be relevant according to a chosen criterion. In our proposed SSW-KRLS algorithm, we took similar operations to avoid an increase in computation, which is discussed in detail in Section 4.

Extensions to KRLS Algorithm
The kernel recursive least-squares method is one of the most efficient online kernel methods, and it achieves good performance in nonlinear fitting and prediction. However, the bottleneck problem is that the network structure grows with the training samples, which leads to insufficient memory and computational complexities when processing continuously incoming signals. In order to solve this problem and render the online kernel method feasible, researchers have proposed various sparseness methods or criteria, such as approximate linear dependency (ALD), the novelty criterion (NC), the surprise criterion (SC) and the coherence criterion (CC). On the basis of these sparseness methods or criteria, only new input samples that meet the prerequisites are used as training samples. Thus, the growth of network structure is effectively slowed down. Table 1 shows several common sparseness criteria.
If δ(n) ≥ τ, add the new sample to the dictionary.

NC
Calculate the minimal distance between the new and existing samples in the kernel dictionary: If dis ≥ τ, add the new sample to the dictionary.

SC
According to information theory, based on prior joint Gaussian distribution, the amount of information brought by the new sample is: S(n) = − ln p(x n , d n |D n−1 ), where p(x n , d n |D n−1 ) is posterior probability distribution of (x n , d n ) If τ 1 < S(n) < τ 2 , add the new sample to the dictionary. If S(n) > τ 2 , discard new samples.

CC
Calculate the maximal kernel function of the new and existing samples : If µ < τ, add the new sample to the dictionary.
The performance of the above methods in filtering samples largely depends on the selected threshold. As time goes by, the number of samples increases slowly and lastly stabilizes, while the update of model parameters tends to be slow. This is not conducive to updating time-varying channels. Furthermore, when the user moves into a new environment, the outdated samples are reserved, which is not conducive to channel prediction.
An effective way to keep the dictionary updated while precisely controlling its size is using the sliding window. A simple implementation is discarding the oldest sample directly each time a new sample is collected, but it lacks the judgement of correlation between samples. Sparseness simply based on time information is unreliable and may result in unstable prediction performance when the window is small. In order to solve this problem, we propose a new algorithm, SSW-KRLS, where we take both the time value and correlation between samples into consideration.

Proposed SSW-KRLS Algorithm
Assume that there is a set of input-output pairs D = {u i , y i }| L i=1 where L is the total number of samples. u i ∈ R m are m-dimensional input vectors, and y i ∈ R is the output. In the traditional KRLS algorithm, in each time slot, when a new sample comes, the size dictionary increases, and this leads to an increase in computation and memory requirements. To solve this problem, we used a sliding-window approach to keep the size of the kernel dictionary within a fixed budget. Our criterion for discarding old samples was based on the correlation between existing samples in the kernel dictionary and the new sample. Moreover, we introduced a forgetting factor to exponentially weigh older data by scaling them, so as to track the dynamic characteristics of the channel.
In our proposed SSW-KRLS algorithm, we solved the following least-squares cost function: where β is the forgetting factor, and i is the iteration number, y i = (y 1 , . . . , y i ) T is the output vector, w is the weight vector, B(i) = diag(β L−1 , β L−2 , . . . , 1) is the forgetting matrix, ϕ j = ϕ u j is the transformation of u j , and λ is the regularization parameter. The optimal w * is solved: We reformulate the equation: where _ y i = B(i)y i is the exponentially weighted input signal. The solutions for Q(i) = [λB(i) + B(i)K i ] −1 are different under two cases in the i-th iteration: one case is that the size of the dictionary increased, and the other is that the size of the dictionary remained unchanged. Cases I and II are discussed in Sections 4.2 and 4.3, respectively. In Section 4.1, we introduce how to update our dictionary when a new sample comes. Our methods are different depending on whether the size of the dictionary reaches the fixed sample budget, which is denoted by M. In particular, when the dictionary was not full, we discarded an old sample on the basis of the correlation analysis between the new and existing samples in the dictionary in order to slow down the increase in dictionary size. In the case when the size of the dictionary was full, we set a candidate discard set which contains the samples highly correlated with the new sample, and determine which sample to discard on the basis of their corresponding output. The whole process of our proposed algorithm is shown in Figure 4.

How to Optimally Discard an Old Sample
The cosine value is usually adopted for judging the correlation of two vectors. In the KRLS algorithm, on the other hand, the cosine value is calculated in the kernel Hilbert spaces. Suppose u i is a new sample, and u j is an existing sample in the dictionary. Let k(u j ) denote the kernel vector of u j ∈ D. k(u j ) = {k jn } n =i,j , where k jn = κ u j , u n . To measure the correlation between the existing and new samples, we calculated the cosine value of k(u i ) and k u j for ∀u j ∈ D, as cos k(u i ), k(u j ) = . When the size of the dictionary did not reach the budget, we found a sample with highest correlation with the new sample. Existing sample u j that had the highest cosine value was the most probable to be discarded, where j = arg max j cos k(u i ), k(u j ) . We set a threshold τ and discarded sample u j if cos k(u i ), k(u j ) ≥ τ. The updated dictionary is: In another case, when the size of the dictionary reaches the fixed budget, one sample must be discarded from the kernel dictionary in each time slot. In our strategy, we first set a candidate discard set on the basis of the correlation with the new sample, and then determined the optimal discarded sample according to its output value.
Candidate discard set S was composed of the samples that had high correlation with the new sample. Specifically, an existing sample in dictionary u j ∈ D was added into S if κ(u i , u j ) > ε, where u i is a new sample and ε is a threshold. Among the samples in S, the one with the smallest value for prediction may be one that has the most similar information with the new sample or the one that is untypical with little probability to occur. We determined the discarded sample in combination with its corresponding output as following.
Suppose that the candidate discard set is S = {u 1 , u 2 , . . . , u n }. The weighted average of the output values can be obtained as y = n ∑ j=1 w j y j where and n is the number of samples in S.
The deviation between the output value of u j ∈ S and the weighted average output value is e j = y j − y . The sample with maximal deviation e max is not typical and may have occurred with a small probability, while the one with minimal deviation had similar information to that of the new sample. Suppose that the sample with the maximal deviation is u j max , and the sample with the minimal deviation is u j min . We chose one of these two samples to discard according to the production of e max and e min . If the production of e max and e min was larger than threshold τ, this indicated that e max was large enough, so we discarded u j max ; otherwise, e min was small enough, and we discarded u j min . The updated dictionary is:

Case I: The Size of D(i) Is Changed
As mentioned before, the prediction process depends on whether the size of the dictionary has increased or not. In Case I, when the size of dictionary increased, we could obtain from (11) that: With the combination of B(i) and y i , we could obtain exponentially weighted input signalȳ i . We can find that where By substituting (19) into (15), we can obtain that With partitioned matrix inversion, we can obtain Q(i) in recursive form: where and e(i) = y i − h(i) T α(i − 1) is the prediction error in the ith time slot.

Case II: The Size of D(i) Is Unchanged
In the case when the size of the dictionary did not change, the information of the discarded sample u j * in Q(i − 1) needed to be deleted. The information of u j * lay in the j * th column and j * th row of Q(i − 1). In order to not influence the update of the matrix, we moved the j * th column and j * th row of Q(i − 1) into the first row and the first column, and obtainedQ(i − 1). Correspondingly, we applied the same transformation to K i−1 and obtainedK(i − 1). For Q(i − 1) = [λB(i − 1) + B(i − 1)K i−1 ] −1 ; after the movement, the jth column that meets j < j * should be multiplied by β.
Suppose that the matrix after removing the first column and first row ofQ(i − 1) andK(i − 1) isQ(i − 1) andK(i − 1), respectively. We can obtain its inversionQ(i − 1) −1 according to Appendix B,Q New matrix Q(i) can be formulated into a partitioned matrix: Then, Q(i) can be obtained using the partitioned matrix: where Then, the weight coefficient is updated: where y i is composed of the output value of the samples y i = [y 1 , y 2 , . . . , y i ] T . Lastly, we can obtain the prediction value for the next time slot asŷ i+1 = α(i)k(i) T .

Performance Evaluation
Based on the analysis above, we present algorithmic steps as Algorithm 2 and in this section we show the simulation results of our proposed SSW-KRLS algorithmcompare its performance to that of the ALD-KRLS algorithm as a baseline. The basic simulation parameters are listed in Table 2. We adopted a 3D urban macro scenario, and considered a typical setting of 3 kHz with 30 kHz of subcarrier spacing. We considered 20 MHz of bandwidth that contained 51 resource partitions. Our adopted channel model was a CDL-A channel model, and the DL precoder was RZF. Step 2: For each sample u j in D(i − 1), compute cos k(u i ), k(u j ) .
Find j * = arg max j cos k(u i ), k(u j ) .
If max j cos k(u i ), k(u j ) > τ, discard u j * . Then, add new sample u i into the dictionary.
Turn to Step 4. 4: Step 3: Construct candidate discard set S. Suppose S = {u 1 , u 2 , . . . , u n }. Then, calculate the output of the samples. For each sample u j , calculate e j = y j − y . Find j max = arg max j e j and j min = arg min j e j .
If e min e max > τ, D(i) = D(i − 1)\u j max , u i . If e min e max ≤ τ, D(i) = D(i − 1)\u j min , u i . Turn to Step 4. 5: Step 4: If D(i) is larger than D(i − 1), perform Step 5; otherwise, perform Step 6. 6: Step 5: Calculate Q(i) according to (21) and then calculate intermediate matrix z B (i), r(i), z(i) according to (22). Calculate α(i) according to (23). The prediction value for the next time slot isŷ i+1 = α(i)k(i) T . 7: Step 6: For Q(i − 1) moving the j * th column and j * th row into the first column and the first row, and obtainQ(i − 1); calculateQ(i − 1) −1 by (24) and (25). Update Q(i) according to (26) and (27). Then, the prediction value is obtained on the basis of (28).  Figure 5 shows the normalized mean squared error (NMSE) performance of different algorithms. We show the performance of the ALD-KRLS and SSW-KRLS algorithms with the 60 and 120 km/h velocity levels for all UEs. When the UE speed was 60 km/h, the prediction algorithms was more accurate than with a UE speed of 120 km/h. Regarding the performance of the SSW-KRLS algorithm, we set different sample budgets: M = 30, 90, 150. The algorithm performed better with a higher sample budget. The SSW-KRLS algorithm with sample budget M = 30 performed better than the ALD-KRLS algorithm with v = 0.001, and the SSW-KRLS algorithm with sample budget M = 90 greatly outperformed the ALD-KRLS algorithm. However, the performance was insignificantly improved when changing the sample budget from 90 to 150.   Figures 7 and 8 show the mean rate of mobile users under different algorithms at the speed of 60 and 120 km/h, respectively. Comparing the figure at two speed levels shows that the user with the lower speed had a higher mean rate. In addition, the performance could be enhanced greatly with the use of the channel prediction algorithm. Particularly, our proposed SSW-KRLS algorithm with sample budgets M = 150 and M = 90 achieved better performance than that of the ALD-KRLS algorithm with v = 0.0001; for our proposed SSW-KRLS algorithm, a higher sample budget brought about better performance. Moreover, users with more antennas showed better performance under any circumstances.

Conclusions
This paper proposed a new sparse sliding-window KRLS algorithm where a candidate discard set is selected through correlation analysis between the mapping vectors in kernel Hilbert spaces of the new input sample and the existing samples in the kernel dictionary. It then determines the discarded sample in combination with its corresponding output to achieve dynamic sample updates. Specifically, the proposed SSW-KRLS algorithm, which maintains the size of the kernel dictionary within the sample budget, requires a fixed amount of memory and computation per time step, incorporates regularization, and achieves online predictions. Moreover, in order to sufficiently track strongly changeable dynamic characteristics, a forgetting factor was considered in the proposed algorithm. Numerical simulations demonstrated that, under a realistic channel model of 3GPP in a rich scattering environment, our proposed algorithm achieved superior performance in terms of both predictive accuracy and kernel dictionary size than that of the ALD-KRLS algorithm. The NMSE for the channel prediction of the SSW-KRLS algorithm with M = 90 was about 2 dB lower than that of the ALD-KRLS algorithm with v = 0.001, while the kernel dictionary was 17% smaller.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
For a non-singular matrix A, if there is a new column and a new row added to the matrix and get E as as followings: Suppose the inversion of E is formulated as: Then the E −1 can be calculated by solving: where E −1 can be obtained as:

Appendix B
For a non-singular matrix E, if the first column and the first row are removed from the matrix and get C: Suppose the inversion of E is formulated as: Then the C −1 can be calculated by solving: where C −1 can be obtained as C −1 = F − ee T /d.