FPGA-Based Implementation of Multidimensional Reconciliation Encoding in Quantum Key Distribution

We propose a multidimensional reconciliation encoding algorithm based on a field-programmable gate array (FPGA) with variable data throughput that enables quantum key distribution (QKD) systems to be adapted to different throughput requirements. Using the circulatory structure, data flow in the most complex pipeline operation in the same time interval, which enables the structural multiplexing of the algorithm. We handle the calculation and storage of eight-dimensional matrices cleverly to conserve resources and increase data processing speed. In order to obtain the syndrome more efficiently, we designed a simplified algorithm according to the characteristics of the FPGA and parity-check matrix, which omits the unnecessary operation of matrix multiplication. The simplified algorithm could adapt to different rates. We validated the feasibility and high speed of the algorithm by implementing the multidimensional reconciliation encoding algorithm on a Xilinx Virtex-7 FPGA. Our simulation results show that the maximum throughput could reach 4.88 M symbols/s.


Introduction
Quantum cryptography can provide information-theoretic security by combining onetime pad with quantum key distribution (QKD) [1,2] and has become an important branch and hotspot in the field of modern cryptography. QKD allows legitimate parties, Alice and Bob, to share secure keys even if the quantum channel is controlled by eavesdropper Eve. The fundamental theorems of quantum physics (no-cloning theorem, etc.) guarantee that quantum states transmitted through a quantum channel cannot be replicated accurately. Any eavesdropping behavior inevitably disturbs the quantum states on which the key information is encoded, which results in the increase in channel noises. Such noises can be monitored by the legitimate parties; therefore, any eavesdropping behavior can be discovered.
According to different carriers of the key, QKD can be divided into discrete-variable QKD (DVQKD) and continuous-variable QKD (CVQKD) [3][4][5][6]. DVQKD uses the polarization or phase degree of single photons to encode key information, which can realize long-distance key distribution using single-photon detection technology and post-selection. CVQKD employs the quadrature components of quantum states to encode key information; it is compatible with existing optical communication technology, and the key rate is high for short and medium distances. Important progress has been made recently [7][8][9][10][11][12][13][14]. However,

The Principle of Multidimensional Reconciliation
For Gaussian modulation CVQKD protocols, the shared raw keys of Alice and Bob are correlated Gaussian variables. For long transmission distances, the SNR is very low. In this case, the Gaussian variables have a small absolute value and are distributed around 0; thus, it is difficult to discriminate the sign and realize the encoding. The basic ideal of multidi-mensional reconciliation [17] is that one can perform a proper rotation before encoding the key in the sign of the coordinates; in this way, the small-absolute-value coordinates can be eliminated. The schematic diagram of multidimensional reconciliation is shown in Figure 1. The encoding ideal is as follows: Given y ∈ S n−1 (S n−1 represents a sphere), define a cube Q y centered on 0 and containing y such that the prior distribution of y in Q y is uniform. The description of Q y can be achieved by describing the orthogonal transformation of the transformed canonical cube. Bob randomly selects vertices U (U ∈ S n−1 ) of the canonical cube and then provides Alice with an orthogonal transformation M(y, U) that maps y to U, satisfying M(y, U)y = U; this transformation defines cube Q y . Finally, Alice can recover U according to x and M(y, U). The specific steps of multidimensional reconciliation are reported below.
Step 1. For security, Alice and Bob use the method of sequential combination to form a d-dimensional vector of d continuous variables in multidimensional reconciliation and obtain X = (x 1 , x 2 , . . . , x d ) and Y = (y 1 , y 2 , . . . , y d ), where d denotes the dimension of multidimensional reconciliation. Here, we choose d = 8 because its performance with a low SNR is better than that of other dimensions (d = 1, 2, 4) [29]. Assuming that the quantum channel is linear, the relationship between Alice and Bob can be expressed as and where x and y represent the d-dimensional vectors of Alice and Bob, respectively; and z represents the noise of the quantum channel.
Step 2. In order to change the variable space from a non-uniform Gaussian distribution to a uniform Gaussian distribution, the d-dimensional state points need to be mapped from Euclidean space D d to unit spherical space S d−1 . To this end, Alice and Bob normalize their Gaussian variables x and y. and where and y = y, y .
Step 3. Bob generates a random binary sequence (b 1 , b 2 , . . . , b d ). Then, a d-dimensional random vector u = (u 1 , u 2 , . . . , u d ) can be generated as Step 4. Bob calculates rotation matrix M(y, u). When d = 8, there is a (non-unique) family of 8 eight-dimensional orthogonal matrices (A 1 , A 2 , .., A d ) [17], where A 1 is an 8 × 8 identity matrix. Moreover, for i, j > 1, we have where {A, B} = AB + BA, δ ij is the unit impulse function. Rotation matrix M(y, u) can be calculated from a family of d-dimensional orthogonal matrices (A 1 , A 2 , .., A d ) as where, α i (y, u) = (A i y | u) are the coordinates of u in basis (A 1 y, A 2 y, . . . , A d y) and (A 1 y, A 2 y, . . . , A d y) is a set of standard orthonormal bases for multidimensional space. The above proves that M(y, u)y = u.
where I syn is the syndrome and (b 1 , b 2 , . . . , b 16 ) is a 16-bit binary string. The width of the binary string is determined by the structure of LDPC codes.

FPGA Logic Design and Implementation of Multidimensional Reconciliation Encoding
In the FPGA design and implementation, we used the pipeline operation to achieve high throughput with minimal FPGA resources and theoretically analyzed the throughput of our scheme. Then, we designed the storage mode of an eight-dimensional matrix according to the addressing mode of the FPGA and simplified the calculations for the orthogonal basis coordinates and matrix inversion of the eight-dimensional matrix. Finally, the syndrome was obtained through addressing and shifting.

Hardware Implementation Scheme
In order to implement the multidimensional reconciliation algorithm on an FPGA chip, we used VC709 Evaluation Kit manufactured by Xilinx, which included a Virtex-7 VX690T FPGA [30] and a user-programmable differential oscillator (range: 10 MHz-810 MHz). The Virtex-7 FPGA had a total of 433,200 LUTs, 866,400 Flip-Flops, and 3600 DSPs, as well as 52,920 Kb BRAMs.
To efficiently implement the multidimensional reconciliation algorithm, the relationship between hardware resources and throughput needs to be carefully investigated. In view of the available storage resources and editable logical resources, we adopted two approaches in different data processing parts, including the combination of data parallel and time division multiplexing, and the equal-interval circulatory structure-based pipeline operation. These methods can realize the exchange between area and speed. Because the encoding algorithm contains a large number of eight-dimensional matrix operations and the data are 32-bit single-precision floating-point numbers, this consumes a lot of editable logical resources. To overcome these issues, some modules were designed to be implemented with time division multiplexing and the circulatory structure with the same time interval. Due to the storage mode of the FPGA being a one-dimensional array, we converted the matrix form to fit the storage of the FPGA.
The logic design structure diagram of the multidimensional reconciliation encoding algorithm is shown in Figure 2. Encoding-Top is the top-level module of the encoding algorithm. BRAM-Y are eight block memory units that are configured as a true dualport RAM with a width of 32 and depth of 8. BRAM-A i is a block memory unit that is configured as a true dual-port RAM with a width of 64 and depth of 8 and is used to store eight orthogonal matrices (A 1 , A 2 , . . . , A 8 ). BRAM-H is a block memory unit that is configured as a single-port RAM with a width of 18 and depth of 500,000 and is employed to store the check matrix. DCM are four different data computing modules. Each cplt signal marks the end of the corresponding module operation. Inv are five matrix inversion modules with the circulatory structure operating in the same time interval.

Analysis of Data Throughput
The total throughput, T h , is a key parameter to evaluate multidimensional reconciliation encoding and can be expressed as where f s is the system clock frequency, T n is the number of clock cycles required by the longest pipeline operation step, and d is the dimension of multidimensional reconciliation. It can be seen that the total throughput is closely related to the above three parameters. At a given clock frequency, the encoding performance is optimal when d = 8, so the throughput is inversely proportional to T n . Therefore, with limited hardware resources, the pipeline task block loop should be optimized to achieve high throughput T h . The pipeline operation divides the combinatorial logic into a series of task modules and requires the addition of a first-level register before and after each task module. If the task module is too small, it consumes a large number of registers; each level of registers requires one clock cycle, so it also causes the operation time of the task module to increase. Therefore, we should reasonably divide the pipeline structure according to the specific throughput requirements.

Flexible Division of Pipeline Operation
The goal of the algorithm optimization is to achieve high throughput with minimal FPGA resources. As mentioned above, given the clock frequency and d = 8, the number of clock cycles, T n , that the longest pipeline step requires is the key factor that affects throughput. We mainly used pipeline technology and a series-parallel structure to balance throughput and resource utilization of the FPGA. Pipeline technology cannot shorten the processing time of a single datum, but it can effectively shorten the processing time of the overall data. In contrast, the parallel structure can shorten the processing time of a single datum.
After carefully analyzing the calculation process of the encoding algorithm, we divided the pipeline operation of the encoding operation into four operation steps. Pipelining was used in the normalization of encoding and random number mapping operations, as well as in the syndrome generation module. The module with the longest running time is the calculation of (A 1 y 1 , A 2 y 2 , . . . , A d y d ) −1 . Figure 3 shows the designed pipeline structure, in which Data2 starts to run the first step when Data1 has finished the first step. The circulatory structure with the same time interval is employed for the module that computes (A 1 y 1 , A 2 y 2 , . . . , A d y d ) −1 to suppress T n according to the required throughput.
Time Figure 3. The pipeline structure of the encoding algorithm. The first step is to calculate (y 1 , y 2 , . . . , y d ) and (u 1 , u 2 , . . . , u d ) using Y and (b 1 , b 2 , . . . , b d ). The second step is to build standard orthonormal basis (A 1 y 1 , A 2 y 2 , . . . , A d y d ) for the multidimensional space. The third step is matrix inversion. The fourth step is to map u to standard orthonormal basis (A 1 y 1 , A 2 y 2 , . . . , A d y d ) and obtain M(y, u).

Eight-Dimensional Matrix Operation
Because the performance of the multidimensional reconciliation algorithm is optimal when d = 8, we only studied the eight-dimensional matrix operation. During the operation process of the eight-dimensional matrix, such as addition, subtraction, multiplication, division, root, square, and inversion [31], the bit growth of data may occur. Although not every level of operation encounters bit growth, each bit growth instance leads to the doubling of the maximum value of the data. This may cause data overflow if fixed-point numbers are employed, making the results of the operation incorrect because the data are not accurate enough. In contrast, floating-point arithmetic has a large dynamic range, and because of its automatic scaling, the possibility of data overflow and mantissa loss can be eliminated. We used 32-bit single-precision floating-point numbers for the calculation of all basic units.

Storage of 8-Dimensional Matrices
According to the addressing mode of the FPGA, the 8 × 8 matrix is extended to a onedimensional array of length 64. In order to read and store quickly, we need to use 8 memory units, each with a depth of 8 and a width of 32. The first address of each memory unit stores the first row of the 8-dimensional matrix, and so on. Therefore, each write-and-read operation only takes 8 clock cycles.

Construction of Orthogonal Basis Coordinates
The orthogonal basis coordinates of u is (A 1 y, A 2 y, . . . , A d y), where (A 1 , A 2 , . . . , A d ) is a set of orthogonal matrices that can be computed by four 2 × 2 matrices [17]. The calculation process is reported below.
The four basis matrices are given by where K i 1 ,...,i l = K i 1 ⊗ · · · ⊗ K i l represents the tensor product. According to the characteristic of orthogonal matrix A i , each row of it has only one non-zero element, and the absolute value of this element is 1. Therefore, the multiplication of a one-dimensional vector y with a matrix A i can be converted to finding a non-zero element of an orthogonal matrix A i and rearranging the y vector. In this way, we can eliminate many multiplication and addition operations and save a lot of LUT resources.
For example, For the calculation of (A 1 y 1 , A 2 y 2 , . . . , A d y d ) −1 , we used the Gaussian elimination method [32] to obtain the inverse matrix of the eight-dimensional matrix, which can be divided into three steps: (1) finding the maximum value of each row; (2) obtaining an upper triangular matrix using elementary row operations; (3) data normalization. Due to the calculation of the inverse matrix consuming a lot of LUTs and DSPs, we used time division multiplexing to balance the speed and area of the FPGA, as shown in Figure 4.
Step 1 and Step 2 perform loop execution seven times to obtain an upper triangular matrix. Then, Step 3 performs loop execution eight times to obtain an inverse matrix.
Step 1 sends the position of the maximum value for each row to Step 2, and Step 2 sends the matrix to Step 1 after elementary row operations.

Upper triangular matrix
Inverse matrix Step 1 Step 2 Step 3

Figure 4. Time division multiplexing of inverse matrix.
Step 1 is finding the maximum value of each row.
Step 2 is elementary row operations.
Step 3 is data normalization.
In order to search for the maximum value of each row, eight comparators and an AND door are exploited. During the calculation process, the eight data in each row are simultaneously read from the same address. Because the comparator consumes only 87 LUTs, seven comparators are run concurrently and then the AND gate is used to decide the maximum value in each row. In this way, the logic delay of this combination circuit is minimized, which consists of the delay of the comparator of single-precision floating-point numbers and the delay of an AND gate.

Multiplexing Structure of Matrix Inversion Module
The matrix inversion module cannot be subdivided into multi-level pipelines with the pipeline operation due to module multiplexing and nested loops. To improve the computation speed, we designed a circulatory structure with the same time interval, as shown in Figure 5.
Firstly, T n is calculated based on the target throughput. The input data control module then assigns the input data stream to different matrix inversion modules simultaneously. Then, Data 2 performs the matrix inversion operation after T n clock cycles relative to Data 1, and so on. Therefore, Data 1 and Data 6 are both calculated by Matrix inversion module 1.
Each data stream has a corresponding matrix inversion module with a time interval of T n . The number of reused matrix inversion modules is T inv T n , where T inv is the time required for the inversion of an eight-dimensional matrix in serial operation.

Construction of the Syndrome
Due to the existence of channel loss and excess noise in quantum key distribution, there is inevitably some discrepancy when the two groups of Gaussian numbers of Alice and Bob are transformed into binary bits in the spherical space. To correct the bit errors, Bob can send a syndrome of his bit string to Alice. Here, the syndrome can be obtained by multiplying the check matrix with Bob's random bit string. The check matrix of an LDPC code [22] has very strong sparsity, that is, the "0" element accounts for most of the check matrix and the "1" element is very sparsely distributed. Considering the sparsity characteristic of an LDPC code, the shifting method is used to calculate the syndrome.
The storage of the entire check matrix requires a large amount of resources. In order to save memory resources, we store the check matrix by storing the positions of nonzero elements in each row of the matrix. The check matrix is extended in a quasi-cyclic manner [33]. Notice that the shifting operation is implemented after quasi-cyclic expansion, in which the non-zero element is extended to a 16 × 16 identity matrix. So, we use an 18-bit-wide memory unit to save the basis matrix and the extended matrix. The first 14 bits (A 0 -A 13 ) describe the positions of non-zero elements in the basis matrix, and the last four bits (S 0 -S 3 ) describe the shifting of the extended matrix.
For the multiplication of the basis matrix and the input random bit string, the position of the non-zero elements of the basis matrix is used as the address to read the bit string, and the drifting value of the non-zero element is used to shift the bit string. The multiple shifted bit strings are accumulated to obtain the product of a row of the check matrix and the random bit string, that is, the syndrome, as shown in Figure 6. If the address of the Nth non-zero element is smaller than the address of the (N−1)th non-zero element, that means that we start to compute the next syndrome, which is the product of the next row of the check matrix and the random bit string. This saves a lot of DSPs and LUTs and increases the running rate, because a lot of multiplication and addition operations are omitted.
The address of H base [1,3] Figure 6. Calculation procedure of the syndrome for a 2 × 4 base matrix. H base and H expa denote the base matrix and the matrix extended in a quasi-cyclic manner. str1, str2, str3, and str4 denote independent random bit strings.

Results and Discussions
In this section, we show the implementation results of multidimensional reconciliation encoding on an FPGA chip (Xilinx Virtex-7 FPGA). The feasibility of the encoding algorithm was validated. When the pipeline step takes longer than T n , we can use a modified pingpong operation to multiplex the step. In this case, throughput T h (as shown in (11)) becomes where n is the number for multiplexing. According to all the relevant parameters [15], the practical secret key rate of a CV-QKD system is given by where α = PP out /PP in ; PP out and PP in represent the post-processing output and input rates, respectively; FER is the frame error rate; n is the number of data used to distill the secret key; N is the number of sifted data after quantum transmission and measurement; I AB is the mutual information between Alice and Bob; χ BE is the Holevo bound; and ∆(n) is the finite-size offset factor. Notice that PP out depends on T h and satisfies PP out <= T h ; therefore, coding throughput T h affects the secret key rate of the CV-QKD system through α. The consumed resources and throughput for different multiplexing structures are shown in Figure 7.
We can see that the data throughput and hard resources consumed increased with the number of multiplexing. A throughput of 0.98 Msymbols/s was achieved without multiplexing. The throughput increased to 4.88 M symbols/s when the number of multiplexing was five. The resource consumption of LUTs was the highest, whereas the resource consumption of DSPs was the lowest.
The throughput results of the multidimensional reconciliation encoding algorithm on an FPGA and a CPU are shown in Table 1. The model of the CPU was MXC-6301D(EA), running with the C programming language. It is evident that the FPGA-based encoding algorithm dramatically improved the data processing speed in comparison to the CPUbased encoding algorithm, which had a throughput of only 0.063 M per second.  Our proposed algorithm for obtaining the syndrome can be adapted to systems with different parity-check matrices. The simulated results on an FPGA are shown in Table 2. Three check matrices with two different rates (0.02 and 0.1) and two different code lengths (160,000 and 500,000) were used to calculate the syndrome. For a fixed code rate, the time taken was directly proportional to the code length. For a fixed code length, the time taken was slightly longer when the code rate was higher.

Conclusions
In this paper, we propose and demonstrate an FPGA-based multidimensional reconciliation encoding algorithm with variable data throughput that is suitable for applications in the real-time data post-processing of CVQKD systems. The syndrome construction algorithm consumes very little BRAMs and LUTs of the FPGA and can adapt to paritycheck matrices with different code rates and code lengths. It is noted that the achieved data throughput is still limited. In our current algorithm, we find that matrix inversion consumes lots of hardware resources, which hinders the further improvement of throughput. In the future, we will improve the matrix inversion algorithm to make it compatible with pipeline operations; combined with higher system clock speed of the FPGA, a multidimensional reconciliation encoding algorithm with much higher throughput is possible.