Design and VLSI Implementation of a Reduced-Complexity Sorted QR Decomposition for High-Speed MIMO Systems

: In this article, a low-complexity and high-throughput sorted QR decomposition (SQRD) for multiple-input multiple-output (MIMO) detectors is presented. To reduce the heavy hardware overhead of SQRD, we propose an e ﬃ cient SQRD algorithm based on a novel modiﬁed real-value decomposition (RVD). Compared to the latest study, the proposed SQRD algorithm can save the computational complexity by more than 44.7% with similar bit error rate (BER) performance. Furthermore, a corresponding deeply pipelined hardware architecture implemented with the coordinate rotation digital computer (CORDIC)-based Givens rotation (GR) is designed. In the design, we propose a time-sharing Givens rotation structure utilizing CORDIC modules in idle state to share the concurrent GR operations of other CORDIC modules, which can further reduce hardware complexity and improve hardware e ﬃ ciency. The proposed SQRD processor is implemented in SMIC 55-nm CMOS technology, which processes 62.5 M SQRD per second at a 250-MHz operating frequency with only 176.5 kilo-gates. Compared to related studies, the proposed design has the best normalized hardware e ﬃ ciency and achieves a 6-Gbps MIMO data rate which can support current high-speed wireless communication systems such as IEEE 802.11ax.


Introduction
Multiple-input multiple-output (MIMO) is widely employed in current wireless communication systems, such as IEEE 802.11ax [1], to achieve high data throughput. A MIMO detector is used to recover the original signal from the mixed multi-dimensional data streams, which has a great impact on system performance. In the MIMO detector, QR decomposition (QRD) serves as a preprocessor dealing with a channel matrix to facilitate subsequent MIMO detection. Thus, QRD needs massive arithmetic operations and takes up a considerable part of the hardware complexity of the MIMO detector [2,3]. The sorted QRD (SQRD) can improve the bit error rate (BER) performance of detection through inserting sorting procedures into the original QRD [4]. Due to the addition of sorting modules, however, the hardware overhead further increases. In view of this, it is necessary to design an SQRD processor with high throughput and reduced complexity.
There are three well-known algorithms used for QR decomposition: Householder transformation (HT) [5], modified Gram-Schmidt (MGS) [6], and Givens rotation (GR) [7]. HT is rarely used in the QR decomposition because of its huge computational complexity. Compared with GMS, GR can be

•
We propose an efficient reduced-complexity SQRD algorithm based on a novel modified RVD.
Compared to the latest related study, the computational complexity of the proposed SQRD algorithm is greatly reduced by more than 44.7%. In addition, the proposed SQRD algorithm has a competitive BER performance and is implementation-friendly; • We design a deeply pipelined SQRD hardware architecture with a time-sharing GR structure for 4 × 4 MIMO systems. The proposed time-sharing GR structure cleverly utilizes the CORDIC modules in idle state to process certain rotation operations that should have been handled by additional CORDIC module. Therefore, additional hardware is saved and the hardware efficiency of the proposed SQRD design is improved.
The purpose of MIMO detection is to recover the transmit signal s from the receive signal y. According to reference [10,11], this can be realized by the maximum likelihood (ML) principle. Therefore, the optimum solution of MIMO detection can be obtained bŷ where Ω is the solution space. The derivation process of Equation (2) is given in Appendix A. To solve the least square problem like Equation (2), QR decomposition [12] and Cholescky factorization [13] can be performed on the matrix H to simplify the solution procedure. However, the detection procedure based on Cholesky factorization consists of successive forward substitution and backward substitution [14], which is incompatible with column iterative sorting. Whereas, when performed with QR decomposition, the detection procedure can be executed with only backward substitution, which is suitable for column iterative sorting. On the other hand, performing QR decomposition on H can obtain a unitary matrix Q and an upper triangular matrix R, i.e., H = QR and pre-multiplying y − Hs by a unitary matrix Q H (Q H = Q T ) does not change its norm and thereby has no impact on the MIMO detection. Thus, Equation (2) can be reformulated aŝ Therefore, QR decomposition is preferred in MIMO detection. With SQRD performed on the channel matrix, the columns of R are rearranged by iterative sorting to try to ensure that the detection of the signals in different layers of s is conducted in the order of signal-to-noise ratio (SNR) from large to small. In this way, the error propagation in the detection process is alleviated, which could improve the detection performance.

Related Studies
In recent years, there were many studies focusing on the implementation of SQRD. In reference [15][16][17][18][19][20][21], the SQRD is performed directly on the complex-valued matrix H for the complex-valued MIMO detection model. However, the subsequent MIMO detection in a complex-valued model is more complicated than the one in a real-valued model [22]. Thus, many studies [3,8] develop SQRD based on conventional RVD [23], with which the complex-valued model in Equation (1) is converted into a real-valued model as follows Before performing iterative sorting and GR operations, the N × N complex channel matrix is firstly converted into its real counterpart of size 2N × 2N with conventional RVD [23]. As the size of the matrix is enlarged by four times, the number of sorting procedures and operations of GR both dramatically increase, which leads to huge hardware overhead and lengthy processing latency. The design in reference [9] adopts a SQRD algorithm based on a modified RVD and utilizes the symmetry of adjacent columns of the RVD matrix to reduce the number of CORDIC operations as well as sorting procedures. However, the computational complexity of this scheme still stays at a relatively high level and thereby brings about considerable hardware cost. To alleviate this problem, we propose an efficient SQRD algorithm, which can significantly reduce the computational complexity.

Proposed Modified RVD
With the proposed modified RVD, the complex-valued system model in Equation (1) can be reformulated as . . .
It can be suggested that the proposed modified RVD is a permuted version of conventional RVD and it takes 2 × 2 sub-matrices and 2 × 1 sub-vectors of original complex matrix and vectors, respectively, as basic units to perform RVD. Thus, N is restricted to an even number.

The Proposed SQRD Algorithm
Based on the modified RVD introduced above, we propose a reduced-complexity SQRD algorithm with CORDIC-based GR, which is shown in Algorithm 1. The proposed SQRD algorithm basically comprises three steps. In step 1, sorted complex Givens rotation (SCGR), which contains iterative sorting, and complex Givens rotation is applied to complex channel matrix H and receive signal y. Before the elimination process of every column, first, the sorting procedure finds the column with the smallest norm value and swaps it with the first column. Then, the reordered matrix is processed with complex Givens rotation to zero the elements below the diagonal in the first column. Repeat the two procedures until all the elements below the diagonal of H are eliminated and then the complex upper triangular matrix R c is obtained. In the SCGR, the permutation matrix P records the column order of the iterative sorting for subsequent MIMO detection. In step 2, R c is converted into its real counterpart S with the proposed modified RVD. Figure 1 shows the diagram of the proposed modified RVD performed on R c for N × N (N is even) MIMO systems, where r R and r I denote the real and imaginary part of r, respectively. Firstly, R c are partitioned into 2 × 2 sub-matrices q 2×2 i,j and 2 × 1 sub-vectors q 2×1 i , where i, j = 1, 2, · · · , n and n = N/2. Then, RVD is performed on all the sub-matrices q 2×2 i, j and sub-vectors q 2×1 i to obtain their corresponding extended versions p 4×4 i,j of size 4 × 4 and p 4×1 i of size 4 × 1 by Equations (6) and (7), respectively. Due to the feature of complex Givens rotation [24], the diagonal elements of the upper triangular matrix R c are all real numbers and all the elements above its diagonal are complex numbers. After the modified RVD is performed, below the diagonal of every sub-matrix q 4×4 i,j (in red dotted box) of S, there is only one non-zero element. Consequently, the number of non-zero elements that need to be eliminated in S is N/2 in total, which is quite small and does not need many elimination operations in the following real Givens rotation (RGR). It is clear that all the non-zero elements to be nullified in S are located close to the diagonal but in different rows far from each other. This will facilitate the eliminating operations of the following RGR in hardware design. Finally, in step 3, these non-zero elements below the diagonal of S are eliminated with RGR to get the desired upper triangular matrix R and Q H y.

Performance Evaluation of the Proposed SQRD Algorithm
For SQRD, which is implemented with CORDIC, the number of required CORDIC operations can be regarded as a measurement of the computational complexity of the SQRD algorithm. Within the proposed SQRD algorithm for NN  ( N is even) MIMO systems, the CORDIC operations needed in SCGR stage are given by The detailed derivation of Equation (8) is shown in Appendix B, whereas in the following RGR, the CORDIC operations consumed are as follows: Table 1 lists the number of CORDIC operations required in the proposed SQRD algorithm as well as a latest related study. For a clear comparison, the number of CORDIC operations needed and the complexity reduction of the proposed compared to the one in reference [9] for different matrix sizes are presented in Figure 2a. It is clear that the proposed SQRD algorithm has a huge advantage over the one in reference [9] in terms of computational complexity. For 4 × 4 MIMO systems, the number of CORDIC operations of the proposed algorithm is greatly reduced by 44.7% compared to the one in reference [9]. In addition, as the size of the channel matrix increases, the reduction will be further expanded and gradually approach 50%.

Number of CORDIC Operations If
To evaluate the BER performance of the proposed SQRD algorithm in MIMO detection, it is simulated with a K-best detector and a maximum likelihood (ML) detector in an uncoded 4 × 4 64-QAM MIMO system along with the SQRD algorithm in reference [9]. Figure 2b shows that the proposed SQRD algorithm achieves similar BER performance with the one in reference [9].

Performance Evaluation of the Proposed SQRD Algorithm
For SQRD, which is implemented with CORDIC, the number of required CORDIC operations can be regarded as a measurement of the computational complexity of the SQRD algorithm. Within the proposed SQRD algorithm for N × N (N is even) MIMO systems, the CORDIC operations needed in SCGR stage are given by The detailed derivation of Equation (8) is shown in Appendix B, whereas in the following RGR, the CORDIC operations consumed are as follows: Table 1 lists the number of CORDIC operations required in the proposed SQRD algorithm as well as a latest related study. For a clear comparison, the number of CORDIC operations needed and the complexity reduction of the proposed compared to the one in reference [9] for different matrix sizes are presented in Figure 2a. It is clear that the proposed SQRD algorithm has a huge advantage over the one in reference [9] in terms of computational complexity. For 4 × 4 MIMO systems, the number of CORDIC operations of the proposed algorithm is greatly reduced by 44.7% compared to the one in reference [9]. In addition, as the size of the channel matrix increases, the reduction will be further expanded and gradually approach 50%. Table 1. Complexity of different Givens rotation (GR)-based QR decomposition schemes.

Algorithm
Number of CORDIC Operations If N=4 The proposed

Overview of the Proposed SQRD Hardware Architecture
According to Algorithm 1, we designed the corresponding high-speed and low-complexity SQRD hardware architecture for a 4 × 4 MIMO system. To match the high throughput of current wireless communication systems, the proposed SQRD architecture is deeply pipelined and highly parallel, which can decompose one 4 × 4 channel matrix in four clock cycles and process one H Qy per clock cycle. In other words, the processing cycles of SQRD (clock cycles needed for decomposing To evaluate the BER performance of the proposed SQRD algorithm in MIMO detection, it is simulated with a K-best detector and a maximum likelihood (ML) detector in an uncoded 4 × 4 64-QAM MIMO system along with the SQRD algorithm in reference [9]. Figure 2b shows that the proposed SQRD algorithm achieves similar BER performance with the one in reference [9].

Overview of the Proposed SQRD Hardware Architecture
According to Algorithm 1, we designed the corresponding high-speed and low-complexity SQRD hardware architecture for a 4 × 4 MIMO system. To match the high throughput of current wireless communication systems, the proposed SQRD architecture is deeply pipelined and highly parallel, which can decompose one 4 × 4 channel matrix in four clock cycles and process one Q H y per clock cycle. In other words, the processing cycles of SQRD (clock cycles needed for decomposing one H) and Q H y (clock cycles needed for processing one Q H y) are four clock cycles and one clock cycle, respectively, which are both decreased by 20% compared to the design of reference [9]. As throughput =clock frequency/processing cycles, both the SQRD and Q H y throughput of the proposed SQRD will be improved by 25% compared to the design of reference [9]. In addition, the time-sharing GR structure is designed to take advantages of the idle state of CORDIC modules caused by iterative sorting, which can further reduce the hardware overhead and improve hardware efficiency.
The proposed SQRD hardware architecture is basically comprised of one norm calculator, three sorting modules, and four processing engines. Norm calculator and sorting modules are used for iterative sorting; processing engines with CORDIC-based Givens rotation structure are used for decomposing the channel matrix and processing Q H y. Figure 3 presents the dataflow of the proposed SQRD design for the 4 × 4 MIMO system. Sorted complex Givens rotation (SCGR) is performed on the 4 × 4 complex channel matrix H and 4 × 1 complex receive signal vector y with all sorting modules and processing engines. After the iterative sorting procedure, every processing engine zeros the elements below the diagonal in the first column of the current input matrix in vectoring mode (introduced in Section 4.2) and rotates the elements of subsequent columns in rotation mode (introduced in Section 4.2). Thus, the SQRD processing cycles of SCGR are four clock cycles. After the process of all the four columns is done, the complex upper triangular matrix R c is obtained. According to Algorithm 1, next, the 4 × 4 intermediate R c is converted into its 8 × 8 real version S for the following RGR, which is given by Electronics 2020, 9, x FOR PEER REVIEW 8 of 16    Then, RGR performed with processing engine#3-4 eliminates the two non-zero elements below the diagonal of S, as shown in Figure 4. It can be suggested that the elements of the upper two rows of R c are obtained after the processing in sorting#3 because the elimination of the first two columns of [H y] have been completed. As the upper four rows of S are derived from the upper two rows of R c according to Equation (10), they can be obtained immediately after the process of sorting#3 is done. Therefore, the elimination processes of the non-zero element S 3,2 (i.e., r I 1,2 ) in RGR and the third column of R c in SCGR are performed with processing engine#3 at the same time when the relevant elements are output from sorting#3. This will improve the parallelism of the SQRD hardware architecture and shorten the processing latency. As for the non-zero element S 7,6 (i.e., r I 3,4 ), it will be zeroed with processing engine#4 when r 4,4 of R c is obtained. Then, RGR performed with processing engine#3-4 eliminates the two non-zero elements below the diagonal of S , as shown in Figure 4. It can be suggested that the elements of the upper two rows of c R are obtained after the processing in sorting#3 because the elimination of the first two columns of   Hy have been completed. As the upper four rows of S are derived from the upper two rows of c R according to Equation (10), they can be obtained immediately after the process of sorting#3 is done. Therefore, the elimination processes of the non-zero element 3,2 S (i.e., 1,2 I r ) in RGR and the third column of c R in SCGR are performed with processing engine#3 at the same time when the relevant elements are output from sorting#3. This will improve the parallelism of the SQRD hardware architecture and shorten the processing latency. As for the non-zero element 7,6 S (i.e., 3,4 I r ), it will be zeroed with processing engine#4 when 4,4 r of c R is obtained.

Processing Engines
Figures 5-7 present the block diagram of the proposed processing engines (PEs). All these PEs comprise three kinds of CORDIC-based Givens rotation structures: processing unit a (PUa), processing unit b (PUb), and processing unit c (PUc). Figure 8 shows their architecture based on CORDIC module, all of which can work in vectoring mode (VM) and rotation mode (RM). In VM, elimination operation is applied to the leading column elements of the current input matrix; in RM, the rotation directions generated in the vectoring mode are retrieved for rotation operations of subsequent elements. PUa processes paired complex rows, zeroing the lower one and turning the upper one into a real number [25]. In vectoring mode, the upper-left three CORDIC modules are used, whereas in rotation mode, all of the four CORDIC modules are occupied. PUb is used for the paired rows of which the leading elements are real numbers and the rest are complex numbers. In vectoring mode, the upper CORDIC module processes the leading paired real elements and zeros the lower one; in rotation mode, the two CORDIC modules are used for the rotation operations of the following complex elements. PEc with only one CORDIC module processes paired real inputs. As these processing units have different processing delays (PUa has two CORDIC stages, whereas PUb and PUc have one), delay elements (DEs) are used to align the data sequences for the subsequent processing unit or sorting module. Among all these Givens rotation structures, all the processing units in PE#1 and PE#2, PUa#4 in PE#3, and PUc#2 in PE#4 are used for SCGR, whereas PUc#3 in PE#3 and PUc#4 in PE#4 are dedicated to RGR. The processing units with multiplexers (MUXs) at input and output ports, including PUb#2 and PUc#2-4, form the time-sharing Givens rotation structure, which will be introduced in the next subsection.
PUc have one), delay elements (DEs) are used to align the data sequences for the subsequent processing unit or sorting module. Among all these Givens rotation structures, all the processing units in PE#1 and PE#2, PUa#4 in PE#3, and PUc#2 in PE#4 are used for SCGR, whereas PUc#3 in PE#3 and PUc#4 in PE#4 are dedicated to RGR. The processing units with multiplexers (MUXs) at input and output ports, including PUb#2 and PUc#2-4, form the time-sharing Givens rotation structure, which will be introduced in the next subsection.  PUc have one), delay elements (DEs) are used to align the data sequences for the subsequent processing unit or sorting module. Among all these Givens rotation structures, all the processing units in PE#1 and PE#2, PUa#4 in PE#3, and PUc#2 in PE#4 are used for SCGR, whereas PUc#3 in PE#3 and PUc#4 in PE#4 are dedicated to RGR. The processing units with multiplexers (MUXs) at input and output ports, including PUb#2 and PUc#2-4, form the time-sharing Givens rotation structure, which will be introduced in the next subsection.  PUc have one), delay elements (DEs) are used to align the data sequences for the subsequent processing unit or sorting module. Among all these Givens rotation structures, all the processing units in PE#1 and PE#2, PUa#4 in PE#3, and PUc#2 in PE#4 are used for SCGR, whereas PUc#3 in PE#3 and PUc#4 in PE#4 are dedicated to RGR. The processing units with multiplexers (MUXs) at input and output ports, including PUb#2 and PUc#2-4, form the time-sharing Givens rotation structure, which will be introduced in the next subsection.

Time-Sharing Givens Rotation Structure
During the SCGR procedure for 4 × 4 MIMO system described above, the current input matrix of PE#2 is of the size 3 × 3 after the elimination process of the first column of H with PE#1. With the sorting procedures inserted in the column elimination processes, the Givens rotation structures of PE#2 process only three columns and stay in an idle state for one clock cycle within every processing cycle (four clock cycles). Similarly, the periods of idle state of PE#3 and PE#4 in every processing cycles are two and three clock cycles, respectively. As a result, quite a few processing units are operating in an unsaturated state, which causes inefficiency of the SQRD hardware architecture. On the other hand, the elimination process of non-zero elements below the diagonal of S in RGR needs many Givens rotation operations which are independent of the process of SCGR. Given this, we design a time-sharing (TS) Givens rotation structure to take advantage of these idle states mentioned above to share parts of the Givens rotation operations in RGR, which can save hardware cost as well as improve hardware efficiency. Figure 9 shows the deeply pipelined processing flow of the proposed SQRD processor with time- sharing Givens rotation structure. The index   , ij in the box denotes the element being processed in the i th row and j th column of the first H during SCGR procedure or the first S during RGR

Time-Sharing Givens Rotation Structure
During the SCGR procedure for 4 × 4 MIMO system described above, the current input matrix of PE#2 is of the size 3 × 3 after the elimination process of the first column of H with PE#1. With the sorting procedures inserted in the column elimination processes, the Givens rotation structures of PE#2 process only three columns and stay in an idle state for one clock cycle within every processing cycle (four clock cycles). Similarly, the periods of idle state of PE#3 and PE#4 in every processing cycles are two and three clock cycles, respectively. As a result, quite a few processing units are operating in an unsaturated state, which causes inefficiency of the SQRD hardware architecture. On the other hand, the elimination process of non-zero elements below the diagonal of S in RGR needs many Givens rotation operations which are independent of the process of SCGR. Given this, we design a time-sharing (TS) Givens rotation structure to take advantage of these idle states mentioned above to share parts of the Givens rotation operations in RGR, which can save hardware cost as well as improve hardware efficiency. Figure 9 shows the deeply pipelined processing flow of the proposed SQRD processor with time-sharing Givens rotation structure. The index (i, j) in the box denotes the element being processed in the ith row and jth column of the first H during SCGR procedure or the first S during RGR procedure. In the SCGR procedure, after the process of sorting#2 is done, PUc#1 and PUa#3 in PE#2 stay idle at the first clock cycle of the processing cycles and process the three columns of the current input 3 × 3 matrix in the other three clock cycles; PUa#4 in PE#3 is in an idle state for the first half of every processing cycle and processes the two columns of the current input 2 × 2 matrix in the second half; then, PUc#2 in PE#4 processes the remaining H 4,4 at the last clock cycle of the processing cycles and stays idle for the other three clock cycles. In the procedure of RGR, the elimination of S 3,2 in the first S performed by PUc#3 in PE#3 in vectoring mode starts at the 41st clock cycle when S 3,2 , i.e., r I 1,2 , and S 2,2 , i.e., r 2,2 , are obtained after the processing in sorting#3; PUc#4 in PE#4 will eliminate S 7,6 in vectoring mode at the 55th clock cycle after S 6,6 , i.e., r 4,4 , is obtained from the process of PUc#2. With a time-sharing Givens rotation structure, part of the rotation operations of PUc#3 and PUc#4 performed on the subsequent elements shown in Figure 4 are reasonably assigned to the suitable processing units in an idle state. As illustrated in Figure 9, the rotation operations of (r R 2,4 , r I 1,4 ), i.e., (S 2,6 , S 3,6 ), and (−r I 2,4 , r R 1,4 ), i.e., (S 2,8 , S 3,8 ), are separately assigned to the two CORDIC modules in PEb#2 because PEb#2 is just idle when r 1,4 and r 2,4 are generated. For a similar reason, the rotation operations of (−r I 2,3 , r R 1,3 ), i.e., (S 2,7 , S 3,7 ), (0, r 1,1 ), i.e., (S 2,3 , S 3,3 ), and (0, r R 3,4 ), i.e., (S 6,8 , S 7,8 ), are assigned to PUc#2, PUc#4, and PUc#3, respectively. If no time-sharing Givens rotation is adopted, as shown in Figure 10, an additional PUc#5 must be used for helping PUc#3 with the rotation operations to match the processing rate of SCGR because it will take seven clock cycles for PUc#3 alone to finish the Givens rotation of the seven paired elements of S 2:3,: , which exceeds the processing cycles of SCGR. Furthermore, it will cost more delay buffers (DBs) for the aligning of the relevant elements for modified RVD, because the process is extended with only three processing units (PUc#3, PUc#4, and PUc#5). Therefore, with time-sharing Givens rotation structure, one CORDIC module and eight DBs (a DB contains N bw bit registers; N bw is the adopted bit-width) are saved and the hardware efficiency of the entire SQRD is improved.  Figure 9. Deeply pipelined processing flow of the proposed SQRD processor with time-sharing Givens rotation structure.  Figures 11 and 12 illustrate the detailed CORDIC architecture adopted in our design. The   Figures 11 and 12 illustrate the detailed CORDIC architecture adopted in our design. The CORDIC module has eight micro-rotation stages and four pipelined stages. In the normal CORDIC  Figures 11 and 12 illustrate the detailed CORDIC architecture adopted in our design. The CORDIC module has eight micro-rotation stages and four pipelined stages. In the normal CORDIC module shown in Figure 11a, the rotation direction calculated by y i > 0 : −1 : 1 is used and stored in VM; and in RM, the rotation direction is retrieved. In the CORDIC module used for the time-sharing Givens rotation structure shown in Figure 11b, a time-sharing mode (TSM) is added to process the shared rotation operation using the rotation direction from the corresponding processing unit. The scale factor of CORDIC with eight micro-rotations is 7 i=0 1/ √ 1 + 2 −2i = 0.607259. In our design, we approximate this number to 2 −1 + 2 −3 − 2 −6 − 2 −9 = 0.607421875 and the corresponding hardware architecture is presented in Figure 12.  Figures 11 and 12 illustrate the detailed CORDIC architecture adopted in our design. The CORDIC module has eight micro-rotation stages and four pipelined stages. In the normal CORDIC module shown in Figure 11a, the rotation direction calculated by 0 : 1:1 i y  is used and stored in VM; and in RM, the rotation direction is retrieved. In the CORDIC module used for the time-sharing Givens rotation structure shown in Figure 11b

Sorting
The iterative sorting procedure of the proposed SQRD processor is performed with norm calculator and sorting#1-3. Figure 13 shows the architectures of these modules. The complex elements of the 4 × 4 channel matrix H are delivered column by column as the input of norm calculator. First, the squares of the real part and imaginary part of the four elements in one column are calculated with SQ modules and then these squares are added up in pairs to obtain the norm value of every complex element; ultimately, all the norm values are added together with the SUM module to get the norm value of the current input column. In sorting#1, the norm update module is bypassed and the CS module compares these successive norm values until it finds the minimum one. Finally, the column with the smallest norm is swapped with the first column of the four stored in the column buffer. In a similar way, sorting#2 and sorting#3 perform the rest of the sorting procedures. However, what is different from sorting#1 is that, before the CS module, they will first update the norm values from the former sorting module according to line 13 in Algorithm 1.

Sorting
The iterative sorting procedure of the proposed SQRD processor is performed with norm calculator and sorting#1-3. Figure 13 shows the architectures of these modules. The complex elements of the 4 × 4 channel matrix H are delivered column by column as the input of norm calculator. First, the squares of the real part and imaginary part of the four elements in one column are calculated with SQ modules and then these squares are added up in pairs to obtain the norm value of every complex element; ultimately, all the norm values are added together with the SUM module to get the norm value of the current input column. In sorting#1, the norm update module is bypassed and the CS module compares these successive norm values until it finds the minimum one. Finally, the column with the smallest norm is swapped with the first column of the four stored in the column buffer. In a similar way, sorting#2 and sorting#3 perform the rest of the sorting procedures. However, what is different from sorting#1 is that, before the CS module, they will first update the norm values from the former sorting module according to line 13 in Algorithm 1.
value of the current input column. In sorting#1, the norm update module is bypassed and the CS module compares these successive norm values until it finds the minimum one. Finally, the column with the smallest norm is swapped with the first column of the four stored in the column buffer. In a similar way, sorting#2 and sorting#3 perform the rest of the sorting procedures. However, what is different from sorting#1 is that, before the CS module, they will first update the norm values from the former sorting module according to line 13 in Algorithm 1.  Figure 13. Architecture of the modules of the iterative sorting procedure: (a) norm calculator; (b) sorting#1-3.

Implementation Results and Comparisons
We designed the RTL models of the proposed SQRD processor with Verilog HDL and synthesized it by Synopsys Design Compiler with SMIC 55-nm COMS technology. The bit-width of the data pass of our design was set to 16 bits, including 5 bits for the integer part and 11 bits for the fractional part. The fixed-point design of the proposed SQRD was simulated in the same platform introduced in Section 2.1. The result shows that the BER performance has a negligible degradation compared to the floating-point one. Table 2 presents a summary of the implementation results and performance comparisons with related studies. The gate count of the proposed SQRD processor is 176.5 K which is greatly reduced compared to the other designs. This is due to the adopted lowcomplexity SQRD algorithm and time-sharing Givens rotation structure. Our design achieves a high throughput of 62.5 M SQRD/s and 250 M H Qy /s, with an operating frequency of 250 MHz, which is better than most of the other studies. The design in reference [3] has the highest SQRD throughput with the best max f , albeit at the expense of exorbitant hardware cost and long processing latency.

Implementation Results and Comparisons
We designed the RTL models of the proposed SQRD processor with Verilog HDL and synthesized it by Synopsys Design Compiler with SMIC 55-nm COMS technology. The bit-width of the data pass of our design was set to 16 bits, including 5 bits for the integer part and 11 bits for the fractional part. The fixed-point design of the proposed SQRD was simulated in the same platform introduced in Section 2.1. The result shows that the BER performance has a negligible degradation compared to the floating-point one. Table 2 presents a summary of the implementation results and performance comparisons with related studies. The gate count of the proposed SQRD processor is 176.5 K which is greatly reduced compared to the other designs. This is due to the adopted low-complexity SQRD algorithm and time-sharing Givens rotation structure. Our design achieves a high throughput of 62.5 M SQRD/s and 250 M Q H y/s, with an operating frequency of 250 MHz, which is better than most of the other studies. The design in reference [3] has the highest SQRD throughput with the best f max , albeit at the expense of exorbitant hardware cost and long processing latency. Therefore, we take hardware complexity and implementation technology into consideration and introduce normalized hardware efficiency (NHE) for a fair comparison of throughput. Table 2 clearly shows that our design is superior to all the other SQRD processors in terms of normalized hardware efficiency. In addition, we evaluate these designs in a 4 × 4 64QAM MIMO system and our design achieves the highest data throughput of 6 Gbps. For IEEE 802.11ax [1], the maximum uncoded data throughput in the same scenarios with a bandwidth of 160 MHz is 2.88 Gbps. Therefore, the proposed SQRD design is able to support the IEEE 802.11ax.

Conclusions
In this article, we designed an SQRD processor with reduced complexity and high throughput for MIMO detectors. An efficient SQRD algorithm based on a novel modified RVD was proposed, which could significantly reduce the computational complexity compared to the latest studies. According to the proposed algorithm, we designed the corresponding SQRD hardware architecture with CORDIC-based Givens rotation. In the hardware design, a time-sharing Givens rotation structure was adopted to take advantage of the CORDIC processor in an idle state as far as possible. In this way, hardware complexity was further decreased and hardware efficiency was improved. We also implemented the SQRD processor with SMIC 55-nm COMS technology. The implementation results show that our design surpasses other related studies in normalized hardware efficiency and achieves a MIMO data throughput of 6 Gbps, which can support current high-speed wireless MIMO systems.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
Since y is normally distributed which is y ∼ N Hs, δ 2 I NN [14], the likelihood to estimate s can be given by · exp − y − Hs 2 /2σ 2 (A1) Note that to maximize likelihood, p(y; s) is equivalent to minimize y − Hs 2 . Therefore, the ML solution of s is given byŝ ML = argmin s∈Ω y − Hs 2 .

Appendix B
The number of CORDIC operations needed in SCGR is the summation of the CORDIC operations needed in every column elimination process.
The elimination process of the kth column zeros the complex-valued elements below the kth row and turns the element in the kth row into a real number, which contains two steps.
Step 1, zero the imaginary parts of all the N − k + 1 complex-valued elements to be processed of the kth column. For the elimination process of every complex-valued element to be processed, one CORDIC operation for vectoring mode and N − k CORDIC operations (rotation operations for the subsequent N − k elements in the corresponding row) for rotation mode are needed. Thus, the number of CORDIC operations cost in step 1 is (N − k + 1)(1 + N − k).
Step 2, the first row of the kth column is selected as pivot row to nullify all the N − k real-valued elements in the subsequent rows of the kth column. For the elimination operation of every real-valued element in the subsequent rows, one CORDIC operation for vectoring mode and 2(N − k) CORDIC operations (rotation operations for the subsequent N − k paired complex-valued elements in the corresponding rows) for rotation mode are needed. Thus, the number of CORDIC operations cost in step 2 is (N − k)(1 + 2(N − k)).