PCQNet: A Trainable Feedback Scheme of Precoder for the Uplink Multi-User MIMO Systems

Multi-user multiple-input multiple-output (MU-MIMO) technology can significantly improve the spectral and energy efficiencies of wireless networks. In the uplink MU-MIMO systems, the optimal precoder design at the base station utilizes the Lagrange multipliers method and the centralized iterative algorithm to minimize the mean squared error (MSE) of all users under the power constraint. The precoding matrices need to be fed back to the user equipment to explore the potential benefits of the joint transceiver design. We propose a CNN-based compression network named PCQNet to minimize the feedback overhead. We first illustrate the effect of the trainable compression ratios and feedback bits on the MSE between the original precoding matrices and the recovered ones. We then evaluate the block error rates as the performance measure of the centralized implementation with an optimal minimum mean-squared error (MMSE) transceiver. Numerical results show that the proposed PCQNet achieves near-optimal performance compared with other quantized feedback schemes and significantly reduces the feedback overhead with negligible performance degradation.


Introduction
In recent years, the multi-user multiple-input multiple-output (MU-MIMO) technology has offered great advantages over conventional point-to-point MIMO systems due to its improvement on the spectral and the energy efficiencies [1,2]. Specifically, the base station (BS) of a MU-MIMO system communicates with a large number of user terminals in the same time-frequency resource by configuring numerous antennas. Furthermore, multiple antennas bring large improvements in throughput and radiated energy efficiency through focusing energy into ever smaller regions of space [3]. As a result, the MU-MIMO systems have become a fundamental and integral part of present and future generations of wireless networks. Digital beamforming and hybrid analog/digital beamforming are widely applied for inter-user interference reduction with the evolution and growth of 5G technical standards [4].
Nowadays, the joint optimization of the transceiver has attracted increasing research activities as an effective interference management technique for the uplink MU-MIMO systems [5]. Since the bottleneck of hardware cost and power consumption in the millimeterwave mmWave Massive MIMO system will not appear in the uplink MU-MIMO scenarios, we adopt the digital precoding for its excellent performance in terms of sum rates. Some early works adopt the non-iterative [6,7] and iterative methods [8,9] to solve the highly non-convex problem of the joint transceiver optimization. The non-iterative precoding schemes are based on matrix decomposition, such as the singular value decomposition [6] and the QR decomposition [7], which cannot cope with the mismatch between the numbers of transmitting streams and the antennas. We note that the centralized iterative precoding 1.
We propose a CNN-based architecture named PCQNet to produce the data-bearing bitstreams for each UE to recover the precoding matrices. It can achieve near-optimal performance and further reduce the feedback overhead compared with the existing 3GPP codebook scheme in certain scenarios.

2.
We develop a general trainable compression and quantization framework for the precoding matrices in the uplink MU-MIMO systems. The proposed PCQNet architecture as well as the Lloyd-Max quantization scheme can flexibly adjust the feedback overhead by training an auto-encoder. 3.
The precoding matrices with different compression ratios (CRs) are evaluated on the performance of the centralized implementation with an optimal MMSE transceiver.
As far as we know, the effect of the feedback accuracy on the performance has not been investigated before. Specifically, we explore the trade-off between the block error rates (BLER) and the CRs of the precoding matrices.
The remainder of this article is organized as follows: Section 2 introduces the system model and the joint transceiver optimization. Section 3 describes the network architecture and the training strategy of the PCQNet. Three baseline methods are also presented to provide a benchmark for our proposed PCQNet. In Section 4, experimental evaluations and performance analysis are provided to demonstrate the efficiency of our trainable CNN-based PCQNet. Finally, the concluding statements are given in Section 5.
Notations: Symbols for matrices (vectors) are denoted by boldface upper (lower) case letters. R, C and N denote the real set, the complex set, and the positive integers, respectively.
denote the conjugate transpose, the Frobenius norm, the Euclidean norm, the trace operation, and the expectation, respectively. I N is the N × N identity matrix. CN µ, σ 2 is a complex Guassian vector with mean µ and variance σ 2 .

System Model
In this section, we introduce a simple signal model of an uplink MU-MIMO system, the joint transceiver design and the channel models.

Uplink MU-MIMO System
Without loss of generality, we consider the uplink MU-MIMO system consisting of one BS equipped with N r antennas and K UEs as depicted in Figure 1. For convenience, K = {1, . . . , K} denotes the set of UEs. The k-th UE equipped with N k,t antennas transmits N k,s modulated data streams. We denote N t = ∑ K k=1 N k,t , N s = ∑ K k=1 N k,s as the total numbers of transmit antennas and independent data streams of all UEs, respectively. For simplicity, we consider the case where N k,t and N k,s are constants. The channel matrix can be represented as H [H 1 , . . . , H K ] ∈ C N r ×N t , where H k ∈ C N r ×N k,t denotes the channel matrix from the k-th UE to the BS. The BS firstly obtains the CSI of all UEs and calculates the optimal precoding matrices F k ∈ C N k,t ×N k,s of the k-th (k ∈ K) UE which will be fed back to the UEs for the deployment of uplink data transmission. The received signal vector y ∈ N r at the BS can be represented as where s k ∈ C N k,s ×1 represents the data symbol vector with a covariance matrix Φ s k = E s k s H k = I N k,s and n ∈ C N r ×1 is the received complex white Gaussian noise vector consisting of independent and identically distributed (i.i.d.) elements with the distribution CN 0, σ 2 I Nr . The noise covariance matrix is Φ n = σ 2 I Nr .
The precoding and the linear detection are jointly optimized to obtain the best system performance. The recovered data symbol of the k-th UE can be represented bŷ where G k represents the detection matrix of the BS. The centralized implementation in [9] aims to jointly optimize the transceiver to eliminate the MU interference in the uplink system by minimizing the mean squared error (MSE) between the estimated symbols and the transmitted symbols. The MSE between the estimated symbols and the actual symbols for the k-th UE is given by

The Joint Precoding and Detection Design
In this paper, we consider the scheme in [9] which jointly designs the precoding and detection matrices to minimize the MSE between the estimated and the transmitted symbols. Specifically, we work on the overall system performance with the joint transceiver design which is subjected to the per user power constraint. The sum-MSE can be formulated as where the precoding matrix F k is subject to the per user power constraint P k = E F k s k 2 = tr{F k Φ sk F H k }, ∀k ∈ K. The joint minimization of MSE by iteratively updating the transceiver is carried out as follows where λ k is the Lagrange multiplier associated with the user power constraint. The detailed minimization process to obtain the optimal transceiver is described in Algorithm 1. The precoding matrix F is initialized with codebooks in the 3GPP protocol, which is simultaneously normalized to satisfy the power constraints. The Lagrangian formulation is utilized to solve the jointly convex optimization problem. The MMSE detection can optimally balance the multi-user interference and Gaussian noise compared with the ZF detector, Thus, we apply the MMSE detector in the proposed uplink scenario because of its practical implementation and better performance. The output-precoding matrix F k and detection matrix G k are utilized for precoding and MMSE detection at the k-th UE and the BS, respectively.

Algorithm 1:
The centralized implementation with optimal MMSE precoding.

The Feedback Process of Precoding Matrices
The full feedback of the precoding matrix imposes particularly high feedback overhead and storage requirements. We propose a CNN-based PCQNet scheme illustrated in Figure 2 to substantially decrease the feedback overhead. Specifically, the BS compresses the precoding matrices and then quantizes the compressed matrices to bitstreams for each UE. The UEs recover the precoding matrices utilizing an error-free transmission of the feedback bitstreams.

Channel Model
The channels are assumed to be frequency flat and known at the receiver side. We consider multiple channel models such as the i.i.d. Rayleigh fading channel and the more realistic NAIE channel provided by Huawei Corporation [25]. The channel matrix h k i,j = [H k ] i,j represents the channel fading coefficient between the j-th transmit antenna of the k-th UE and the i-th received antenna of the BS. The i.i.d. Rayleigh channel matrix H consists of independent and identically distributed (i.i.d.) CN (0, 1) elements. The NAIE MIMO channels are taken from the CDLB300_20UE_4T32R dataset provided by the iMaster NAIE platform, which is a channel environment measured in practical scenarios.
The parameters of the NAIE dataset are listed in Table 1. The dimension of the dataset matrices is L, K, N k,t , N r , N f = [500, 20,4,32,96], which represents the number of data frames, the number of UEs, the number of antennas for UE, the number of antennas for BS, and the number of carriers, respectively. When the testing and training datasets are generated for the NAIE channel, they are randomly picked from the 500 matrix elements and then normalized to satisfy the power constraints. If the number of frames is less than 500, the data frames are randomly selected. Otherwise, the datasets will be reused when the number of data frames is larger than the dimension of the dataset L = 500.

Network Architecture
The PCQNet consists of the encoder network and the recovery network as illustrated in Figure 3. The encoder network is made up of a trainable compression module and a non-trainable quantization module. The quantization is accomplished during the offline training, directly feeding back the codeword to the UEs through the data-bearing bitstreams. The recovery network mainly consists of a dequantizer and the ResNet in [26]. Considering that the CNNs can efficiently manage the memory requirements ondevice and achieve better memory usage than the DNNs, we apply the CNNs with growing popularity for the compression networks. We firstly concatenate the real and the imaginary parts of the precoding matrices F k , k ∈ K, where the dimension transformation can be represented by C N k,t ×N k,s → R 2×N k,t ×N k,s . The input of the first convolutional layer is the real and imaginary parts of the precoding matrices generated at the BS. The compression module consists of five CNN layers which create a filter kernel that is convolved with the input layer to produce an output tensor. The CNN layers are parameterized by F × F × K|S, where F and K denote the filter size and the number of filters, respectively. S represents the downsampling strides in the convolution layer at the encoder and the upsampling strides in the transposed convolution layer at the decoder. The hyperparameters of five CNN layers in the compression module are: The linear unit (ReLU) activation function is inserted after each CNN layer. The output is the compressed vector z k ∈ R l . The compression function f θ : C N k,t ×N k,s → R l can be represented as where θ k is the same parameter for each user, l is the dimension of the compressed output. We use a uniform quantization module with the quantization factor β. The vector z k is quantized into an m-dimensional binary vector b k for the feedback transmission of the where m represents the feedback overhead. For each UE, the CR of the PCQNet can be defined as The CR and the number of quantization bits β jointly determine the feedback overhead m which influences the normalized MSE (NMSE) between the recovered precoding matrices and the original ones. A smaller value of m reflects lower feedback overhead of the precoding matrices. The binary vector b k ∈ R m is fed to the UE for the recovery of the precoding matrices. An error-free channel is assumed when transmitting the encoded vector b k from the BS to the k-th UE.
The decoder network at the k-th UE outputs the restored complex-valued precoding matrixF k from the feedback bitstream b k . The reconstruction of the precoding matrix can be functioned with where φ k represents the parameter sets of the decompression module. The feedback bitstream b k is reshaped to the dimension of m/β. The decompression module retrieves the real and the imaginary parts from three fully connected (FC) layers and five ResNet layers. The ResNet applies shortcut connections that directly pass data flow to later layers to avoid the vanishing of the gradient caused by multiple stacked non-linear transformations. Each of the FC layer is followed by a ReLU activation and the hyperparameters of three ResNets in decoder are: 3 × 3 × 64|1, 3 × 3 × 32|1, and 3 × 3 × 16|1, respectively.

The Training Strategy of PCQNet
In the offline stage, we compute the precoding matrices in advance by the aforementioned Algorithm 1 and generate the training, testing, and evaluation datasets. In the online stage, we can directly obtain the low dimensional feedback bitstreams with the well-trained neural network. The gradient of the quantization module is treated as a constant to make the network differentiable, and for this reason the encoder and the recovery network can be trained end-to-end. We jointly optimized the encoder and the decoder modules with back-propagation and gradients can pass through the quantization layer during back-propagation.
We formulate the feedback of the precoding matrices into a reconstruction problem byF k = g φ (Q ϕ (β, f θ (F k , θ k )), φ k ). The auto-encoder is optimized by updating the network parameters θ k and φ k , which can be applied for all UEs. The loss function is the NMSE, which quantifies the difference between the recovered precoding matrices and the original ones with The PCQNet is trained and evaluated on an Nvidia GeForce 3090 platform. We use the Adam [27] optimizer with a batch size of 32 and the training process stops early with a patience of eight epochs, where the maximum number of training epochs is 1000. We apply the adaptive learning rate schedule with a factor of 0.8. If the loss does not improve for four epochs in a row, the learning rate is reduced. The 3GPP protocol in [15] contains a set of PMIs F (i) codebook , i ∈ N with corresponding configurations of streams and antennas. The BS firstly calculates the precoding matrix with Algorithm 1 and then feeds back the binary index. The optimal codebook index i (opt) is acquired with the minimum Euclidean distance by searching the predefined codebooks where i represents the index of the standard precoding matrix defined in the 3GPP protocol supporting limited scenarios (e.g., N k,t = 4, N k,s = 2, i (opt) ∈ [0, 21] ). The number of feedback bits for the protocol codebook scheme under the scenarios with 22 indexes is 5 (m = log 2 22 = 5).
The protocol codebook-based precoding scheme is labeled as 3GPP codebook in our simulation. Since this feedback method of the vector quantization only needs to search the optimal index, it greatly reduces the feedback overhead. However, the scope of the codebook scheme is limited and the precoding matrices retrieved from the codebook indexes inevitably have certain quantization error.

Baseline2: The Lloyd-Max Quantization Scheme
We apply the Lloyd-Max quantizer to reduce the dimension of the feedback matrices which is labeled as LloydMax in our simulation. An optimized Lloyd-Max quantizer minimizes the mean square quantization error (distortion) as much as possible. The Lloyd-Max quantization scheme stores the designed codebooks and partitions at the UEs and the BS, thus the BS only needs to transmit the indexes to specify the precoding matrices.
In the offline stage, we first acquire the datasets generated by Algorithm 1. The empirical probability distribution function of the real and the imaginary parts of the precoding matrices is obtained for the design of the Lloyd-Max quantizer in [13,28,29]. Then, we develop a Lloyd-Max-based quantizer under different SNRs and channel models. The codebooks and partitions are specifically optimized by the Lloyd-Max algorithm in [30,31].
In the online stage, the UEs can readily recover the precoding matrices utilizing the received indexes with the prestored Lloyd-Max quantization partitions and codebooks. Each element in the precoding matrices has to be quantized individually, thus the minimum number of feedback bits is N k,t × N k,s × 2 for each precoding matrix F k ∈ C N k,t ×N k,s . (e.g., N k,t = 4, N k,s = 2, β = 1, m = 16). It is necessary to further reduce the feedback overhead and design a more SNR-adaptive feedback scheme to combat channel variations.

Baseline3: The Ideal Feedback Scheme
We consider the optimal scheme labeled as w/o compression which directly feeds back the precoding matrices without compression.

Experimental Evaluations
In this section, we evaluate the NMSE of the precoding matrices with different CRs and the influence of the feedback accuracy on the BLER performance. The comprehensive performance comparisons of the uplink MU-MIMO system with different number of UEs, different modulation orders, and different channel models are provided. The uplink MU-MIMO system parameters and the coefficients of the channel coding are listed in Table 2. The SNR is generally defined as the ratio of the signal power to the noise power at the receiver

Data Generation
We firstly generate L channel realizations H 1 , . . . , H L ∈ C L×N r ×N t for the i.i.d. Rayleigh channel or randomly sample L data frames from the iMaster NAIE platform [25] for the NAIE channel. Then, we calculate the noise variance and normalize the channel matrix. The precoding matrices F l [F 1 , . . . , At the training and evaluation stage for the networks, the precoding matrix F k ∈ C N k,t ×N k,s of the k-th UE is randomly picked from the set of [F 1 , . . . , F L ], l = 1, . . . , L. The number of frames of the training, validation, and testing data for our proposed PCQNet are L 1 = 100,000, L 2 = 20,000, L 3 = 10,000, respectively.

The NMSE Performance
The NMSE performance between the recovered and the original precoding matrices utilizing the LloydMax scheme is depicted in Table 3. The codebooks are specially optimized over statistical datasets with various SNRs and different quantization bits. Since these codebook-based schemes are inevitably limited by the quantization error, better NMSE performance of the LloydMax scheme comes at the expense of the increased feedback overhead. Although remarkable NMSE performance can be obtained when the evaluating SNRs (i.e., SN R eval ) match with the designing SNRs (i.e., SN R design ) under the same quantization bits, the performance of the codebook-based scheme is not satisfactory when SN R eval mismatch with the SN R design over the i.i.d. Rayleigh channel. Thus, it needs to store multiple codebooks to combat channel variations under different channel SNRs. The NMSE performance of the LloydMax scheme and the PCQNet is not correlated with SNRs. Only the generation of different test datasets is related to SNRs and the testing of NMSE performance is not necessarily related to the value of the SNRs. We provide a guideline for subsequent research of BLER performance via the visualization of NMSE performance in Figure 4.
The codebook-based LloydMax scheme is sensitive to the number of bits, the SNRs, and the realistic channel distribution. The NMSE will significantly drop when the SN R eval mismatches with the SN R design . On the contrary, the CNN-based compression scheme is more SNR-adaptive and we set SN R design = 0 dB. The CRs of the PCQNet are set to 1/16, 1/8, 1/4, and 1/2 for m = 4, 8, 16, 32, respectively. The PCQNet can achieve better NMSE performance than the LloydMax scheme under the same feedback bits (e.g., m = 16, 32, 48) as shown in Figure 4. The green dotted line with low quantization factor β = 1 is almost a straight line over the i.i.d. channel. Despite the different input precoding matrices under different SNRs, the NMSE of the recovered matrices is equally poor. Because the quantization error is so large that the precoding matrices cannot be correctly recovered.

The BLER Performance
We compare the PCQNet with the aforementioned baselines as depicted in Figures 5 and 6. The comparisons of feedback overhead between the PCQNet and three baselines (i.e., the 3GPP codebook scheme, the w/o compression scheme, and the LloydMax scheme) are provided in Table 4. The performance upper bound is the ideal centralized iterative scheme w/o compression which is not appropriate for practical transmission. The protocol codebook-based precoding scheme is tailored for a specific number of users or transmit antennas. To provide a benchmark for our proposed method, we consider the precoding matrices F k ∈ C 4×2 , k ∈ {1, 2, 3, 4} which is to be fed back to the k-th UE. The number of the feedback bits for the 3GPP codebook scheme is m = 5. The LloydMax scheme separately quantizes the real and the imaginary parts of the precoding matrices, which respectively takes at least 16 bits and 32 bits to quantize a R 2×4×2 matrix with β = 1 and 2 (m = 16 × β = 16, 32).
The proposed PCQNet can dramatically decrease the feedback overhead and exhibit a slight BLER degradation with the further reduction of feedback bits (i.e., m = 4,8). Note that the change of the CRs can be achieved by adjusting the number of feedback bits. Near-optimal performance can be derived when the number of the feedback bits is beyond 16, where the feedback overhead can be further reduced. When m = 4, the PCQNet significantly enhances the BLER performance of the 3GPP codebook scheme with m = 5. Naturally, as we increase the number of feedback bits to 32, the performance of the PCQNet scheme as well as the LloydMax scheme will approach that of the ideal w/o compression scheme.

Simulation Results and Analysis with NAIE Channel
We provide more simulation tests to evaluate the performance of the proposed scheme with various numbers of UEs and higher modulation orders (e.g., 16-QAM) as well as the NAIE channel in practical scenarios provided by the iMaster NAIE platform [25].

The NMSE Performance
The LloydMax scheme has to design multiple codebooks which are optimized for specific channel conditions. The best NMSE performance can be achieved when the SN R eval is equal to the SN R design in Table 5. The PCQNet has superior NMSE performance than the LloydMax scheme under the same CRs as depicted in Figure 7. The PCQNet is tested with the fixed training SNR value SN R design = 0 dB while the LloydMax scheme is evaluated with SN R eval = SN R design .

The BLER Performance of the NAIE Channel
Compared with the w/o compression scheme which fully attains the precoding matrix, the CNN-based PCQNet can enhance the recovery quality of the precoding matrix with adaptive feedback overhead and obtain near optimal reconstruction performance. Similar performance can be seen over the NAIE channel in Figures 8 and 9. The CNN-based PCQNet scheme performs close to the ideal w/o compression scheme when CR = 1/2. Moreover, there is a slight performance penalty when CR = 1/16 which is still superior to the 3GPP codebook. With the deployment of the pre-trained PCQNet, the feedback overhead is substantially reduced while the performance degradation is acceptable. The PCQNet scheme has better BLER performance than the LloydMax scheme under the same CRs.
The PCQNet achieves the near-optimal BLER performance when the NMSE of the recovered precoding matrix is lower than −20 dB. The transmission tends to stop if the NMSE of the recovered precoding matrices is worse than the threshold of −5 dB. We also observe that, when the NMSE performance exceeds a certain threshold (e.g., −20 dB), the overall BLER performance is quite close to the ideal w/o compression scheme. From the results, we can see that it is a reasonable compromise to set the CR to 1/4 for the compression of precoding matrices.
The provided numerical results show that our proposed PCQNet achieves a better trade-off between the feedback overhead and the BLER performance over the i.i.d. Rayleigh channel and the NAIE channel. This CNN-based compression scheme significantly enhances the BLER performance compared with the 3GPP codebook scheme and the LloydMax scheme under the same CRs.

Conclusions
The proposed PCQNet has achieved considerable gains in BLER performance compared with the protocol codebook-based precoding scheme and the Lloyd-Max quantization scheme under the same CRs. The adaptability of trainable PCQNet architecture to different channel bandwidths is more competitive than the Lloyd-Max quantization scheme in bandwidth-limited scenarios. The PCQNet also provides better resilience to the mismatch between the trained SNRs and tested SNRs than the Lloyd-Max quantization scheme due to channel variations. Our experiments demonstrate that the application of the CNN-based PCQNet greatly improves the adaptability and the generality of the precoding matrix feedback in the uplink MU-MIMO systems. Importantly, it preserves only a slight degradation of BLER performance with high compression rate of the precoding matrix, making the compression architecture more attractive for the deployment of practical systems.