Downlink Channel Estimation in Massive Multiple-Input Multiple-Output with Correlated Sparsity by Overcomplete Dictionary and Bayesian Inference

We exploited the temporal correlation of channels in the angular domain for the downlink channel estimation in a massive multiple-input multiple-output (MIMO) system. Based on the slow time-varying channel supports in the angular domain, we combined the channel support information of the downlink angular channel in the previous timeslot into the channel estimation in the current timeslot. A downlink channel estimation method based on variational Bayesian inference (VBI) and overcomplete dictionary was proposed, in which the support prior information of the previous timeslot was merged into the VBI for the channel estimation in the current timeslot. Meanwhile the VBI was discussed for a complex value in our system model, and the structural sparsity was utilized in the Bayesian inference. The Bayesian Cramér–Rao bound for the channel estimation mean square error (MSE) was also given out. Compared with other algorithms, the proposed algorithm with overcomplete dictionary achieved a better performance in terms of channel estimation MSE in simulations.


Introduction
Massive multiple-input multiple-output (MIMO) is the key technology for next generation wireless communication.The large number of antennas enable high spectrum efficiency and lower power consumption [1].To get these benefits, the base station (BS) needs to acquire the channel stated information (CSI) for uplink and downlink.Pilot-based channel estimation is widely used in wireless communication systems.In the time division duplex (TDD) system, the channel reciprocity is used to get the CSI by only estimating the uplink channel at BS.In the frequency division duplex (FDD) system, the channel reciprocity cannot be used directly.In FDD massive MIMO system it is challenging to get the downlink CSI with the conventional feedback scheme.In the conventional feedback scheme each user estimates its channel and then feeds back the estimated CSI to the BS.The pilot and feedback overheads are high for massive MIMO, since they are scaling linearly with the number of antennas.Hence, it is important to design an efficient downlink channel estimation and feedback scheme for a FDD massive MIMO system.By exploiting the sparsity in massive MIMO channel, compressed sensing (CS) was applied in the channel estimation and feedback.The users could feed the compressed training measurements back to the BS, and an orthogonal matching pursuit (OMP) was used for downlink CSI recovery in [2].In [3] the modified basis pursuit (MBP) was proposed by utilizing the partial priori signal support information to improve the recovery performance.In [4] the support information of a signal in the discrete fourier transform domain was incorporated into the weighted l 1 minimization approach for CS recovery, which could reduce the number of measurements by the size of the known part of support.In [5] a three-level weighting scheme based on the support information was used for the weighted l 1 minimization and the simulation results showed superiority.In [6] we exploited the reciprocity between uplink and downlink channels in the angular domain, and diagnosed the supports of the downlink channel from the estimated uplink channel, and proposed a weighted subspace pursuit (SP) channel estimation algorithm for FDD massive MIMO.It can be seen that CS was effective in the channel estimation for massive MIMO.
However, most of these algorithms need the sparsity level in the estimation algorithm, which is not practical in engineering scenarios.The Bayesian framework can be applied to the compressive channel estimation.In [7], Bayesian estimation of sparse massive MIMO channel was developed in which neighboring antennas shared among each other their information about the channel support.In [8] a variational expectation maximization strategy was used for massive MIMO channel estimation, and a Gaussian mixture prior model was designed to capture the individual sparsity for each channel and the joint sparsity among users.In [9] a sparse Bayesian learning algorithm was proposed for FDD massive MIMO channel estimation with arbitrary 2D-array.By the Bayesian framework in compressive channel estimation the sparsity level is unnecessary, and it has relatively better recovery performance.Additionally, there exists angular reciprocity in massive MIMO.For example, the channel covariance matrices for uplink and downlink are reconstructed by making use of the angle reciprocity between uplink and downlink channels in [10].Hence it is promising to apply the angular reciprocity and Bayesian framework in the compressive massive MIMO channel estimation.
Additionally, there exists angular reciprocity in the FDD massive MIMO.There is also time correlation of channels.In [11] a differential compressive feedback in FDD massive MIMO was proposed based on the channel impulses response (CIR) between timeslots, which were slow time-varying and sparse, and the differential CIR between two CIRs in adjacent timeslots was sparse.Inspired by the sparsity in the angular domain and time correlation of channels, the correlated angular sparsity can also be exploited for massive MIMO channel estimation.
In this paper we proposed a downlink channel estimation in a TDD/FDD massive MIMO system.The timeslots were divided into groups.In each group the estimated channel support information of the previous timeslot was utilized by the following timeslot.The correlated angular sparsity between timeslots in the downlink channel was utilized in the Bayesian inference for channel recovery.We transformed the complex sparse vector to the real sparse vector recovery by Bayesian inference, and the structural sparsity of the transformed real sparse vector was utilized.Meanwhile, the prior support information from the estimated channel in the previous timeslot was made use of in modeling the hidden hyperparameters in the Bayesian model.A Bayesian Cramér-Rao bound analysis is presented, and simulations are given out to verify the performance of the proposed algorithm.The main contributions were as follows: (1) a group-based channel estimation scheme was proposed, in which previous estimated channel support information was used as the priori information in the following timeslot due to the sparsity correlation; (2) priori information was merged into the Bayesian inference algorithm for channel recovery; (3) the Bayesian Cramér-Rao bound for the channel estimation mean square error (MSE) was analyzed.
The system model is illustrated in Section 2, while the proposed channel estimation algorithm based on Bayesian inference is presented in Section 3. The Bayesian Cramér-Rao bound (BCRB) for the channel estimation of mean square error (MSE) is given out in Section 4. Simulations and conclusions are presented in Sections 5 and 6.
In the paper, we used the following notations.Scalars, vectors and matrices were denoted by lower-case, boldface lower-case and boldface upper-case symbols.The probability density function of a given random variable was denoted by p(•).Gamma(x|a, b) was the Gamma probability density function (PDF) with shape parameters a and b for x, while Normal(x|c, d) was the Gaussian PDF with parameters mean c and variance d for x.Γ(•) was the Gamma function, and ln(•) was the logarithm function.Tr(•) stood for the trace operator.function.Tr(•) stood for the trace operator.a(•) denoted the expectation operation with the PDF of variable a.

System Model
We considered a massive MIMO TDD/FDD system with a single user, and assumed that the BS was equipped with N antennas and the user terminal (UT) had a single antenna.For the downlink channel estimation in the massive MIMO system, the BS transmitted the pilots to UT.The UT received the pilots and fed back the received signal to the BS directly.The received signal ( ) d t y at the UT in the t-th timeslot was written as where 1 ( ) is the downlink pilots, d T is the pilot length, is the downlink received power, is the received noise with each element to be i.i.d Gaussian with mean 0 and variance , was the channel dictionary for downlink channel which could be unitary dictionary or overcomplete dictionary ( > , their column vector had the form of steering vector with a different sampling angle), ( ) . In this paper we applied the overcomplete dictionary to present the sparse angular channel to get a better recovery performance.In the downlink channel estimation, we needed to obtain ˆ( ) d a t h the estimated downlink channel in the angular domain in the t-th timeslot.
By utilizing the sparse channel representation we then had For simplicity, the timeslot mark is omitted in the following equations.Since ( ) where Re(•) and Im(•) denote the real and imaginary parts respectively.For simplicity, we rewrote Equation (3) as where Re( ) Im( ) was sparse.However, there was leakage effect induced by dictionary mismatch which will have deteriorated the sparsity of the angular channel representation [12].When the movement velocity of UT was not very high, e.g., v = 12 km/h, and the typical timeslot duration τ = 0.5 ms, the movement distance of UT in one timeslot was 0.017 m.When the distance of UT and BS was 200 m, the angle change for the line of sight (LoS) transmission in one timeslot was 0.0049° which was much smaller than the sampling interval in the dictionary.For the non-LoS (NLoS) transmission, the angle change was also small which is discussed in Section 4.1.Hence the a (•) denoted the expectation operation with the PDF of variable a.

System Model
We considered a massive MIMO TDD/FDD system with a single user, and assumed that the BS was equipped with N antennas and the user terminal (UT) had a single antenna.For the downlink channel estimation in the massive MIMO system, the BS transmitted the pilots to UT.The UT received the pilots and fed back the received signal to the BS directly.The received signal y d (t) at the UT in the t-th timeslot was written as where is the received noise with each element to be i.i.d Gaussian with mean 0 and variance σ 2 , y d (t) ∈ C T d ×1 is the received signal at UT. Since in the massive MIMO there existed sparsity, when D d ∈ C N×M was the channel dictionary for downlink channel which could be unitary dictionary or overcomplete dictionary (M > N, their column vector had the form of steering vector with a different sampling angle), h d a (t) was the sparse representation with h d (t) = D d h d a (t).In this paper we applied the overcomplete dictionary to present the sparse angular channel to get a better recovery performance.In the downlink channel estimation, we needed to obtain ĥd a (t) the estimated downlink channel in the angular domain in the t-th timeslot.
By utilizing the sparse channel representation we then had For simplicity, the timeslot mark is omitted in the following equations.Since y d (t), h d a (t), and n d (t) are complex number vectors, we could rewrite Equation (2) into real number vectors as where Re(•) and Im(•) denote the real and imaginary parts respectively.For simplicity, we rewrote Equation (3) as where y = Re(y d ) On the other hand, we considered the meaning of sparse angular channel representation h d a (t).If the transmission angles were allocated exactly at the sampling points in the channel dictionary D d , then the corresponding coefficient in the h d a (t) was nonzero.If the path number was smaller than the antenna number, then h d a (t) was sparse.However, there was leakage effect induced by dictionary mismatch which will have deteriorated the sparsity of the angular channel representation [12].When the movement velocity of UT was not very high, e.g., v = 12 km/h, and the typical timeslot duration τ = 0.5 ms, the movement distance of UT in one timeslot was 0.017 m.When the distance of UT and BS was 200 m, the angle change for the line of sight (LoS) transmission in one timeslot was 0.0049 • which was much smaller than the sampling interval in the dictionary.For the non-LoS (NLoS) transmission, the angle change was also small which is discussed in Section 4.1.Hence the transmission angle change between two timeslots is very small if the transmission environment doesn't change dramatically, and there is correlation in the angular channel sparsity between adjacent timeslots.In other words, the information regarding the estimated angular channel in the previous timeslot could be utilized in the current channel estimation.
It was proven that the prior support information could improve the channel recovery performance [3][4][5][6].Hence in this paper we made use of the prior support information from the previous timeslot to improve the Bayesian channel estimation.In the following section we have discussed how to merge the prior information into the Bayesian inference algorithm for channel estimation.

Proposed Algorithm
We designed a three-layer hierarchical graphical model as shown in Figure 1.In the first layer, h was assigned a Gaussian prior distribution where h i and α i are the i-th entry in h and α respectively, p(h i α i ) = Normal(h i 0, α i ) and α i is the inverse variance of the Gaussian distribution .When h i is close to 0, then α i is very large, and vice versa.
In the second layer, we assumed a Gamma distribution as hyperpriors over the hyperparameters α i , and it can be presented as where Gamma(•) is the Gamma PDF, and the parameters a i and b i characterize the shape of Gamma PDF.For fixed a i , the larger b i is, the smaller α i is; then h i tends to be nonzero.In the sparse Bayesian learning a i and b i were set to be very small for non-informative hyperprior over α i [13].
In our model, we set a i to be constant with a predefined value, and we modeled b i as random parameters.In Figure 1 it could be found that the entries of y were divided into two sets by their indices, i.e., S, S +N and S, S +N c , where S was the set with channel support indices from the previous timeslot, and S +N was the set with each index in S added by N, since we converted the complex system model to the real system model as Equation (3).S, S +N c was the complementary set of S, S +N .
For example, in the (t − 1)-th timeslot, the positions of nonzero entries or called supports in h d a (t − 1) were S = {4, 5, 6}, then S +N = {4 + N, 5 + N, 6 + N}.The probable supports for h d a (t) in the current t-th timeslot can be assumed to be the same as those for previous (t − 1)-th timeslot for simplicity.On the other hand, we could have also diagnosed the probable channel supports further by taking the angle deviation and leakage effects into consideration.In this paper we adopted the support diagnosis algorithm, and the details can be found in [6].
For y j , j ∈ S, S +N , we employed a Gamma distribution over the hyperparameters b i in the third layer as Gamma where c and d characterize the shape of Gamma PDF.By the system model and assumptions for massive MIMO, we could use a Bayesian inference to perform the sparse channel recovery.According to the standard Bayesian inference [14], let ≜ , , , we have z , and ( ) p z is the product of PDF of , , and .In order to make use of the prior support information from the previous timeslot and the structure sparsity in Equation ( 4), we needed to make some modifications to the standard Bayesian inference.The main considerations for the modifications were as follows: (I) Since we rewrote Equation (2) as Equation ( 4), if ℎ , was nonzero, then i h and i N h + were nonzero simultaneously.Hence it was wise to assume that bi and bi+N were the same; (II) In the standard Bayesian learning ai and bi were set to be very small for non-informative hyperprior over .This assumption was valid if no prior information was provided.If the prior support information was available, such as that the support information of the previous timeslot could be used for channel estimation in the coming timeslot by sparsity correlation, it was wise to assume that the supports between adjacent timeslots were partially common.If the i-th element in the angular channel vector was nonzero, then the hyperparameter bi and bi+N tended to be variables rather than to be fixed small numbers, which meant only for the indices from the prior support set S the third layer prior model was adopted.
It can be seen that the consideration (II) was similar to [15].However, our proposed algorithm was extended for a complex number system and the structure sparsity was considered.However, on the other hand, the overcomplete dictionary was adopted in our algorithm.
The proposed uplink-aided downlink channel estimation based on Bayesian inference was as follows: (i) Update of p(ℎ ) According to Equation (8), by ignoring the terms which are independent of , we have According to the standard Bayesian inference [14], let z h, α, b , we have where constant is a constant used for p(z i ) normalization, p(y, z) is the joint pdf for h and z, and z i can be h, α, andb.We have p(y, z) = p(z y)p(y) .We assume p(z y) posterior independence among the hidden variables z, then p(z y) ≈ p(z) , and p(z) is the product of PDF of h, α, and b.
In order to make use of the prior support information from the previous timeslot and the structure sparsity in Equation ( 4), we needed to make some modifications to the standard Bayesian inference.The main considerations for the modifications were as follows: (I) Since we rewrote Equation (2) as Equation ( 4), if h d a,i was nonzero, then h i and h i+N were nonzero simultaneously.Hence it was wise to assume that b i and b i+N were the same; (II) In the standard Bayesian learning a i and b i were set to be very small for non-informative hyperprior over α i .This assumption was valid if no prior information was provided.If the prior support information was available, such as that the support information of the previous timeslot could be used for channel estimation in the coming timeslot by sparsity correlation, it was wise to assume that the supports between adjacent timeslots were partially common.If the i-th element in the angular channel vector was nonzero, then the hyperparameter b i and b i+N tended to be variables rather than to be fixed small numbers, which meant only for the indices from the prior support set S the third layer prior model was adopted.
It can be seen that the consideration (II) was similar to [15].However, our proposed algorithm was extended for a complex number system and the structure sparsity was considered.However, on the other hand, the overcomplete dictionary was adopted in our algorithm.
The proposed uplink-aided downlink channel estimation based on Bayesian inference was as follows: (i) Update of p(h) According to Equation (8), by ignoring the terms which are independent of h, we have where Λ = diag E α [α i ] , σ 2 is the noise variance in the system model, the vectors b and α are comprised by b i and α i respectively.Since p(y h) and p h α are a Gaussian distribution, then p h follows a Gaussian distribution with the mean µ and covariance φ given by (ii) Update of p(α) According to Equation ( 8), by ignoring the terms which are independent of α, we have where S is the estimated support set from the previous timeslot.Since the complex system model was converted in Equation (4).By (II), S +N {s i + N} was also the support set in the converted system model in Equation (4).For i ∈ S, S +N , b i is variable number, b i and b i+N were assumed to be the same, The same assumption was applied to h i and h i+N with ).In this way the structural sparsity was utilized.Since p(α|a, b ) is the Gamma distribution and p h α is the Gaussian distribution, p(α) is the Gamma distribution.Then p(α i ) is also the Gamma distribution with the updated parameters a i and b i given by a i = a i + 0.5 (13) Then the Bayesian inference for the channel estimation was executed iteratively among (i), (ii), and (iii).The details of the algorithm are summarized in step 3 of Algorithm 1.When the estimated channel vector h was recovered, we needed to convert it to the complex vector h d a according to Equation (3).
Algorithm 1 Downlink channel estimation with variational inference algorithm and overcomplete dictionary.Input: A, y, σ 2 Output: h

1.
Divide the timeslots into groups, and with each group comprised by t g timeslots.

2.
For the first timeslot in the group, use variational Bayesian inference (VBI) for channel estimation, and obtain the angular channel supports.

3.
For the rest of the timeslots in the group, utilize the support information from the previous timeslot for channel estimation one by one.The recovery algorithm in each timeslot is as follows: 3.1.Initialize α, a, b, c, d.
, where Λ = diag E α [α i ] , µ i is the i-th entry in µ, and φ i,i is the i-th diagonal entry in φ.

3.3.
Update a i and b i according to Equations ( 13) and ( 14) in (ii) ( a i and b i are the updated a i and b i , and a i and b i are the results from last iteration); then according to the property of the Gamma distribution variable, E α (α i ) = a i / b i .

3.4.
Update c and d according to Equations ( 16) and ( 17) in (iii) ( c and d are the updated c and d, and c and d are the results from last iteration); then according to the property of the Gamma distribution variable, E α ( b i ) = c/ d.

3.5.
Go to step 3.2 until stop criteria meets.

4.
Go back to step 2 for a new group of timeslots.
In a practical massive MIMO system, the transmission environment may change suddenly, in this way the correlation of sparsity between adjacent timeslots will deteriorate, and the previous channel support information cannot be utilized.On the other hand, the error will accumulate if the previous channel support information is utilized timeslot by timeslot.Hence, the initialization is important for the robustness and efficiency of the algorithm.As shown in Figure 2 divided the timeslots into groups, and each group was comprised of several timeslots.During the channel estimation for each group, the VBI was used for the channel estimation in the first timeslot, and then the proposed algorithm was executed for the remaining timeslots in which the channel support information of the previous timeslot was made use of by the current timeslot.This procedure is detailed in steps 1, 2 and 4 in Algorithm 1.
In a practical massive MIMO system, the transmission environment may change suddenly, in this way the correlation of sparsity between adjacent timeslots will deteriorate, and the previous channel support information cannot be utilized.On the other hand, the error will accumulate if the previous channel support information is utilized timeslot by timeslot.Hence, the initialization is important for the robustness and efficiency of the algorithm.As shown in Figure 2 divided the timeslots into groups, and each group was comprised of several timeslots.During the channel estimation for each group, the VBI was used for the channel estimation in the first timeslot, and then the proposed algorithm was executed for the remaining timeslots in which the channel support information of the previous timeslot was made use of by the current timeslot.This procedure is detailed in steps 1, 2 and 4 in Algorithm 1.

Sparsity Correlation Analysis
The UT movement distance was very small when the velocity of UT was small and the timeslot was 0.5 ms.The reflector for the transmission was static during the UT moving between timeslots.The ellipse geometry channel model is shown in Figure 3 Electronics 2019, 8, x FOR PEER REVIEW 8 of 17

Sparsity Correlation Analysis
The UT movement distance was very small when the velocity of UT was small and the timeslot was 0.5 ms.The reflector for the transmission was static during the UT moving between timeslots.The ellipse geometry channel model is shown in Figure 3.The line of sight (LoS) distance between BS and UT was dLos, the non-LoS (NLoS) distance by reflector between BS and UT was dNLoS, and the UT movement distance in one timeslot was dΔ.If the transmission path was still reflected by the same reflector as shown in Figure 3, the maximum and minimum NLoS distances from BS to UT between timeslots were dNLoS + dΔ and dNLoS − dΔ.The transmission angle change was Δθ.The distance between the reflector and BS was d1.By some mathematical manipulations shown in Appendix A, we got In order to illustrate the angle change Δθ during one timeslot, we assumed that dNLoS was 800 m, the velocity of UT was 14.4 km/h, and the typical timeslot duration τ = 0.5 ms, then the movement distance of UT in one timeslot was 0.02 m.By changing the distance between BS and reflector, as shown in Figure 4, the angle change was not more than 0.025°.It should be noted that when the LoS distance and the dNLoS were fixed, BS and reflector distance could not be arbitrary vales due to triangle inequality.Hence, the angle of arrival or departure changed slowly and then there was sparsity correlation among the angular channels for adjacent timeslots.In order to illustrate the angle change ∆ θ during one timeslot, we assumed that d NLoS was 800 m, the velocity of UT was 14.4 km/h, and the typical timeslot duration τ = 0.5 ms, then the movement distance of UT in one timeslot was 0.02 m.By changing the distance between BS and reflector, as shown in Figure 4, the angle change was not more than 0.025 • .It should be noted that when the LoS distance Electronics 2019, 8, 473 9 of 16 and the d NLoS were fixed, BS and reflector distance could not be arbitrary vales due to triangle inequality.Hence, the angle of arrival or departure changed slowly and then there was sparsity correlation among the angular channels for adjacent timeslots.

Bayesian Cramér-Rao bound analysis
In this section we have discussed the Bayesian Cramér-Rao bound (BCRB) for the channel estimation with the proposed algorithm.Let ≜ , , then the BCRB for the channel vector ̅ is given by the inverse of the Fisher information matrix J as: According to the system model in Section 2, , are independent, the Fisher information matrix J is block diagonal.We can rewrite ( , ) as Then the BCRB on the MSE of the estimated channel vector ′ is given by { } ( ) where 2 log ( , ) is the fisher information sub-matrix.Thus, we can obtain the Bayesian Cramér-Rao bound of the minimum mean square error for the estimated channel ′ as shown in Proposition 1. Proposition 1.The BCRB of MSE for the channel estimation ′ is represented as where S is the diagnosed support set, is the eigenvalues of ,and ∈ ℝ × , and a, , c, and are the parameters in the Bayesian model in Figure 1.When , → ∞ and = , according to the random matrix theory, we have

Bayesian Cramér-Rao Bound Analysis
In this section we have discussed the Bayesian Cramér-Rao bound (BCRB) for the channel estimation with the proposed algorithm.Let z h, σ , then the BCRB for the channel vector h is given by the inverse of the Fisher information matrix J as: According to the system model in Section 2, h, σ are independent, the Fisher information matrix J is block diagonal.We can rewrite p(y, z) as Then the BCRB on the MSE of the estimated channel vector h is given by where is the fisher information sub-matrix.Thus, we can obtain the Bayesian Cramér-Rao bound of the minimum mean square error for the estimated channel h as shown in Proposition 1. Proposition 1.The BCRB of MSE for the channel estimation h is represented as where S is the diagnosed support set, λ i is the eigenvalues of A where snr 1 = amin(d) The proof of proposition 1 is presented in Appendix B. From proposition 1, we can see that the MSE lower bound is related to the priori support size |S|, (1 + c)/min(d) and max(b) for the massive MIMO channel estimation.

Simulations
In the simulation, the support diagnosis algorithm in [6] was adopted, and we assumed that the transmission angle change between timeslots was within 1 degree.The pilot length was 50, and antenna number at the BS was 100.The channel was generated according to the spatial model as defined in 3GPP TR25.996.We compared our proposed algorithm with a unitary dictionary with a size of 100 and the overcomplete dictionary with a size of 150, 200, and 250, and compared this with a Bayesian sparse learning (SL) [16], weighted subspace pursuit (WSP) [6], weighted l 1 minimization (W-l 1 min) [5], weighted iteratively reweighted least square(W-IRLS), IRLS [17], compressive sampling matched pursuit (COSAMP) in [11], and l 1 minimization (l 1 min) [18].
In order to evaluate the channel estimation performance, we used a normalized mean-square error (MSE) between true and estimated channel vectors as follows: where T is the number of trials, ĥd and h d are the estimated and original channel vector, respectively for each trial.In the simulations the trial number T was 250.
In Figure 5 the overcomplete dictionary size was 150 in the proposed algorithm.It could be seen that when the unitary dictionary was used, our proposed algorithm outperformed WSP, COSAMP and IRLS, but was a little worse than W-l 1 with a small gap.However, when the overcomplete dictionary was used, our proposed algorithm outperformed other algorithms, but almost had the same performance as SL with a little performance improvement which could be seen in the zoomed-in subfigure.The overcomplete dictionary in the proposed algorithm can dramatically improve the MSE performance due to the fact that there are more atoms in the overcomplete dictionary than in the unitary dictionary which can improve the sparsity in the angular channel; however, it doesn't mean that the larger the overcomplete dictionary size is, the better performance it has, which is shown in Figure 6.In Figure 5 the overcomplete dictionary size was 150 in the proposed algorithm.It could be seen that when the unitary dictionary was used, our proposed algorithm outperformed WSP, COSAMP and IRLS, but was a little worse than W-l1 with a small gap.However, when the overcomplete dictionary was used, our proposed algorithm outperformed other algorithms, but almost had the same performance as SL with a little performance improvement which could be seen in the zoomedin subfigure.The overcomplete dictionary in the proposed algorithm can dramatically improve the MSE performance due to the fact that there are more atoms in the overcomplete dictionary than in the unitary dictionary which can improve the sparsity in the angular channel; however, it doesn't mean that the larger the overcomplete dictionary size is, the better performance it has, which is shown in Figure 6.We compared the performance of the proposed algorithm with different dictionary sizes in Figure 6.It can be seen that in the high SNR region the performance improved when an overcomplete dictionary was used, but the MSE performance gain did not improve when increasing the dictionary size.For example, the algorithm with a dictionary size of 150 had a relatively better performance than with a dictionary size of 100.However, the performances with a dictionary size of 200 and 250 almost gave the same trends as that with a dictionary size of 150.This was because the larger dictionary would induce angel ambiguity because the correlation of atoms increased.Hence, in the practical engineering, the dictionary size is not recommended to be very large.A large dictionary size is We compared the performance of the proposed algorithm with different dictionary sizes in Figure 6.It can be seen that in the high SNR region the performance improved when an overcomplete dictionary was used, but the MSE performance gain did not improve when increasing the dictionary size.For example, the algorithm with a dictionary size of 150 had a relatively better performance than with a dictionary size of 100.However, the performances with a dictionary size of 200 and 250 almost gave the same trends as that with a dictionary size of 150.This was because the larger dictionary would induce angel ambiguity because the correlation of atoms increased.Hence, in the practical engineering, the dictionary size is not recommended to be very large.A large dictionary size is computationally expensive and the benefit is limited.It also should be noted that in the low SNR region the MSE performance with a larger dictionary size did not always do better than those with a smaller dictionary size.For example, when the SNR was 0 dB, they hadsimilar performance.The reason was that in the low SNR region the estimated channel support of the previous timeslot was not accurate enough, and on the other hand larger dictionary size would have deteriorated the dictionary incoherence.
We compared the runtime and convergence performance of the proposed algorithm with a different dictionary size in Figure 7.The relative error was defined as the ratio of the difference of adjacent iteration results to the previous iteration result.It can be seen that the proposed algorithm with dictionary size 150 converged fast than with a dictionary size of 100.However, the improvement had its price, and the runtime for the proposed algorithm with dictionary size 150 was longer which meant that the computational complexity was higher with a larger dictionary size.Based on the simulation results shown in Figures 6 and 7, when the antenna at BS is 100, the dictionary size is recommended to be set at 150 or so to balance the performance improvement and computation complexity.We compared the runtime and convergence performance of the proposed algorithm with a different dictionary size in Figure 7.The relative error was defined as the ratio of the difference of adjacent iteration results to the previous iteration result.It can be seen that the proposed algorithm with dictionary size 150 converged fast than with a dictionary size of 100.However, the improvement had its price, and the runtime for the proposed algorithm with dictionary size 150 was longer which meant that the computational complexity was higher with a larger dictionary size.Based on the simulation results shown in Figures 6 and 7, when the antenna at BS is 100, the dictionary size is recommended to be set at 150 or so to balance the performance improvement and computation complexity.

Conclusions
In this paper we proposed a downlink channel estimation algorithm based on overcomplete dictionary and variational Bayesian inference.We converted the complex system model to a real model and exploited the correlation of angular channel sparsity in adjacent timeslots.In the algorithm we divided the timeslots into groups and made use of the channel support information of the previous timeslot to the channel estimation in the current timeslot within each group.The sparsity correlation and Bayesian Cramér-Rao bound for the MSE of channel estimation was analyzed.Compared with other recovery algorithms, such as WSP, IRLS, WIRLS, l1 min, W-l1 min and COSAMP, our proposed algorithm with overcomplete dictionary had a relatively better performance.Comparisons of runtime and convergence performances of the proposed algorithm with orthogonal dictionary (size is 100) and overcomplete dictionary (size is 150).

Conclusions
In this paper we proposed a downlink channel estimation algorithm based on overcomplete dictionary and variational Bayesian inference.We converted the complex system model to a real model and exploited the correlation of angular channel sparsity in adjacent timeslots.In the algorithm we divided the timeslots into groups and made use of the channel support information of the previous timeslot to the channel estimation in the current timeslot within each group.The sparsity correlation and Bayesian Cramér-Rao bound for the MSE of channel estimation was analyzed.Compared with other recovery algorithms, such as WSP, IRLS, WIRLS, l 1 min, W-l 1 min and COSAMP, our proposed algorithm with overcomplete dictionary had a relatively better performance.Moderate overcomplete dictionary can improve the MSE performance of channel estimation to balance the computational complexity and performance gain.
Because p(y, z) = p y z p(h α)p(α b)p(b)p(σ) , we have Since we mainly focus on the MSE of h, we only need to analyze J h,h .We discuss the above formula part by part as follows: Since the priori support set information is used in our proposed algorithm, a three-layer model is constructed for the elements belonging to the priori support set, and a two-layer model is used for the elements not belonging to the priori support set, so E Z 1 α i has different expressions for the two cases.2) When i does not belong to the priori support set, according to the high-order moment properties for the general gamma distribution, we have Then in summary, we have where S is the diagnosed support set, λ i is the eigenvalues of A Then the proofs are complete.
signal at UT. Since in the massive MIMO there existed sparsity, when d N M   D

Figure 1 .
Figure 1.Graphical model for the channel estimation with Bayesian inference.The nodes with double circle, single circle and square correspond to the observed data, hidden variables and parameters, respectively.
constant is a constant used for ( ) i p z normalization, ( , ) is the joint pdf for and , and zi can be , , and .We have ( , ) ( | ) ( ) assume ( | ) p z y posterior independence among the hidden variables , then ( | ) ( ) p p ≈ z y

Figure 1 .
Figure 1.Graphical model for the channel estimation with Bayesian inference.The nodes with double circle, single circle and square correspond to the observed data, hidden variables and parameters, respectively.

c ( 14 )
(iii) Update of p(b {S,S +N } )According to Equation(8), by ignoring the terms which are independent of b, we haveln p(b {S,S +N } ) ∝ E α,h ln p(y h) + ln p(h α) + ln p(α a, b) + ln p(b c, d) ∝ E α,h [ln p(α a, b) + ln p(b c, d)] = i∈{S,S +N } −b i E α (α i ) + (c i − 1) ln b i − d i b i (15)where b {S,S +N } is comprised by the entries indicated by S, S +N in b.In(15) the α, a, b, c, d are also comprised by their indicated S, S +N , the subscript S, S +N is omitted for simplicity.As shown in Figure1, b {S,S +N } was modelled as a Gamma distribution.Since p(α i |a i , b i ) and p(b i |c i , d i ) were a Gamma distribution, p( b{S,S +N } ) was Gamma( b i∈{S,S +M } c i , d i ), and the updated c i and d i were given by

Figure 2 .
Figure 2. Channel estimations by group.Each block represents one timeslot, and the block filled with grey is the timeslot with variational Bayesian inference (VBI) for the channel estimation, while the blank blocks are the timeslots with the proposed algorithm for channel estimation.
. The line of sight (LoS) distance between BS and UT was d Los , the non-LoS (NLoS) distance by reflector between BS and UT was d NLoS , and the UT movement distance in one timeslot was d ∆ .If the transmission path was still reflected by the same reflector as shown in Figure 3, the maximum and minimum NLoS distances from BS to UT between timeslots were d NLoS + d ∆ and d NLoS − d ∆ .The transmission angle change was ∆ θ .The distance between the reflector and BS was d 1 .By some mathematical manipulations shown in Appendix A, we got

Figure 2 .
Figure 2. Channel estimations by group.Each block represents one timeslot, and the block filled with grey is the timeslot with variational Bayesian inference (VBI) for the channel estimation, while the blank blocks are the timeslots with the proposed algorithm for channel estimation.

Figure 3 .
Figure 3. Ellipse geometry channel model for line of sight (LoS) and non-LoS (NLoS) transmission.

Figure 3 .
Figure 3. Ellipse geometry channel model for line of sight (LoS) and non-LoS (NLoS) transmission.

Figure 4 .
Figure 4. Transmission angle change during one timeslot with a different LoS distance and different distances between the base station (BS) and reflector.

Figure 4 .
Figure 4. Transmission angle change during one timeslot with a different LoS distance and different distances between the base station (BS) and reflector.

TA, and A T A
∈ R 2N×2N , and a, b i , c, and d i are the parameters in the Bayesian model in Figure 1.When T d , M → ∞ and T d M = β, according to the random matrix theory, we have and max(b) are the minimum and maximum entries in d and b.

Figure 5 .
Figure 5. Comparisons of channel estimation mean square error (MSE) for different algorithms.

Figure 5 .
Figure 5. Comparisons of channel estimation mean square error (MSE) for different algorithms.Electronics 2019, 8, x FOR PEER REVIEW 12 of 17

Figure 6 .
Figure 6.Comparisons of channel estimation of MSE for the proposed algorithm with different dictionary sizes.

Figure 6 .
Figure 6.Comparisons of channel estimation of MSE for the proposed algorithm with different dictionary sizes.

17 Figure 7 .
Figure 7.Comparisons of runtime and convergence performances of the proposed algorithm with orthogonal dictionary (size is 100) and overcomplete dictionary (size is 150).

Figure 7 .
Figure 7.Comparisons of runtime and convergence performances of the proposed algorithm with orthogonal dictionary (size is 100) and overcomplete dictionary (size is 150).

1 a−1 α i d 1 + 1 − 1 −
cases are discussed as follows:1)When i belongs to the priori support set, according to the three-layer graph model we havep(α) = 2N i=1 Gamma(α i a, b i ) , (A10) p(b i ) = Gamma(b i c, d i ) = Γ(c) −1 d c i b c−1 i e −d i b i .(A11) Then we get p(α i ) = ∞ 0 p(α i b i )p(b i )db i = ∞ 0 Γ(a) −1 b a i α a−1 i e −b i α i Γ(c) −1 d c i b c−1 i e −d i b i db i = Γ(a) −1 Γ(c) −1 α a−a−csatisfies the probability density function of Beta prime distribution.According to the properties of the Beta prime distribution, when −a < −1 < c, we have