High Efﬁciency Ring-LWE Cryptoprocessor Using Shared Arithmetic Components

: A high efﬁciency architecture for ring learning with errors (ring-LWE) cryptoprocessor using shared arithmetic components is presented in this paper. By applying a novel approach for sharing number theoretic transform (NTT) polynomial multiplier and polynomial adder in encryption and decryption operations, the total number of polynomial multipliers and polynomial adders used in the proposed ring-LWE cryptoprocessor are reduced. In addition, the processing time of NTT polynomial multiplier is speeded up by employing multiple-path delay feedback (MDF) architecture and deploying pipelined technique between all stages of NTT processes. As a result, the proposed architecture offers a great reduction in terms of the hardware complexity and computation latency compared with existing works. The implementation result for the proposed ring-LWE cryptoprocessor on Virtex-7 FPGA board using Xilinx VIVADO shows a signiﬁcant decrease in the number of slices and LUTs compared with previous works. Moreover, the proposed ring-LWE cryptoprocessor offers higher throughput and efﬁciency than its predecessors.


Introduction
Cryptographic algorithms are grouped into two categories named symmetric algorithms and public key (or asymmetric) algorithms. The former uses a single key between two parties to enable a secure communication, where the key is kept private from all other parties. Symmetric algorithms are widely used because of its simplicity. However, a symmetric key algorithm requires an agreement between sender and receiver on the secret key. Asymmetric, or public key cryptosystems use two different keys called private keys and public keys. Whereas the private key is kept secret for the decryption process, the public key is used for encryption and can be revealed to all other parties. The encryption operation is conducted using a public key, and the encrypted message can only be decrypted using the corresponding private key. The security of Rivest, Shamir, and Adleman (RSA) cryptosystems [1] and elliptic curve cryptosystems (ECC) [2,3] are based on the difficulty of solving some number theoretic problems and the difficulty of solving the elliptic curve discrete logarithm problems, respectively. However, these problems can be solved by the algorithm proposed by Shor [4] in polynomial time with quantum computers. Therefore, stronger security systems or post-quantum cryptosystems have been proposed, and the National Institute of Standards and Technology (NIST) is standardizing them. Among lattice-based cryptosystems for the post-quantum era, ring learning with errors (LWE) is a promising candidate because its security proofs are based on the worst-case hardness of lattice problems that there is no known quantum algorithm can efficiently solve [5]. A typical block diagram of a typical ring-LWE cryptoprocessor is shown in Figure 1. Input message m is encrypted into ciphertext (c 1 , c 2 ) using arithmetic computation on public key (a, p) and error polynomials e 1 , e 2 , and e 3 . Original message m can be recovered from ciphertext (c 1 , c 2 ) and private key r 2 using the decryption operation. The ring-LWE problem has been discussed in recent studies, both in software and hardware [5,[7][8][9]. In Reference [5], high-throughput ring-LWE cryptoprocessors are designed to perform ring-LWE encryption and decryption operations. In Reference [7], an approach of integrating ring-LWE cryptography into existing fingerprint authentication systems to fully protect the fingerprint data are introduced. Authors in [9] present the implementation of ring-LWE encryption on IoT processors.
In the ring-LWE cryptosystem, lattices with an algebraic structure like polynomial multiplication and addition are performed over a polynomial ring, typically R q = Z q [x]/ f (x). Among these operations, polynomial multiplication is the most complex one that can be efficiently performed using number theoretic transform (NTT)-based polynomial multiplication [5]. NTT multiplier is a modified version of fast Fourier transform (FFT) to work in a finite field without inaccurate floating point or complex arithmetic to compute polynomial multiplication efficiently [10]. There are several NTT multiplier architectures that deploy single-path delay feedback (SDF) or multiple path delay commutator (MDC) structures in literature. For example, a high throughput multiplier using NTT cores with radix-2 SDF architecture is presented in [11]. In Reference [5], authors introduce radix-2 and radix-8 MDC architecture-based NTT cores for ring-LWE cryptoprocessors to obtain the encryption throughput of gigabits per seconds and decryption throughput of megabits per second. However, these architectures require large hardware resources and high computation time since the NTT polynomial multipliers work separately and NTT operations are not fully optimized.
In this paper, a novel approach to efficiently use arithmetic components in ring-LWE cryptoprocessors to achieve a high efficiency is presented. Specifically, the polynomial multiplier and polynomial adder that participate in the ring-LWE encryption operation are reused in decryption phase to reduce hardware complexity. Additionally, the NTT polynomial multiplier is designed using MDF architecture and deploying pipeline technique among all stages of NTT and INTT transforms to mitigate hardware complexity and speed up multiplication operations. Our contributions of this article are summarized as follows: • We propose a ring-LWE cryptoprocessor architecture in which the same arithmetic components, including one polynomial multiplier and one polynomial adder, are used in both encryption and decryption operations to reduce hardware complexity. As a result, the proposed ring-LWE cryptoprocessor requires less hardware resource than existing architectures to perform encryption and decryption operation.

•
We deploy the polynomial multiplier using NTT multiplier with parallel based MDF architecture to enhance the polynomial multiplication. Furthermore, the pipeline technique is applied in the proposed design to reduce the system latency.

•
We implement the proposed ring-LWE cryptoprocessor architecture on Xilinx Virtex-7 FPGA board and compare the obtained results with its predecessors. Performance evaluation results show that the proposed architecture offers a higher throughput and a better efficiency than others.
The remaining of this paper is structured as follows. In Section 2, brief discussions on the ring-LWE cryptosystems are carried out. Section 3 presents the proposed algorithm and architecture for ring-LWE cryptoprocessor. Section 4 provides the results of implementation and comparison, and Section 5 includes conclusions.

Ring-LWE Cryptosystem
The ring-LWE problem introduced by Regev [12] in 2005 is a machine learning problem that is equivalent to the worst-case lattice problems. The ring-LWE cryptosystem operates over a ring is the irreducible polynomial of degree n, n is a power-of-two number, and q is a prime number such that q ≡ 1 mod 2n. The common case of irreducible polynomial is f (x) = x n + 1 that is presented in [11].
Polynomial multiplication and polynomial addition are employed to carry out the cryptographic primitives of the ring-LWE cryptosystem. The procedures for key generation, encryption, and decryption of ring-LWE cryptosystem are described as follows: Key generation: This process generates the public key (a, p) and the private key r 2 . Polynomial a is chosen uniformly and two polynomials r 1 and r 2 are selected from the discrete Gaussian distribution χ σ to compute the public key polynomial: Encryption: The input message m is encoded to get the polynomial m e . If the ith coefficient of m is 1, it is mapped to (q − 1)/2; otherwise, it is converted to 0. The cipher-text c 1 and c 2 are computed from the given polynomials and three error polynomials e 1 , e 2 , and e 3 that sampled from the Gaussian distribution: Decryption: The input message m is recovered from the pre-decoded polynomial m d : Depending on the value of each coefficient of the message m d , the decoder maps it to a corresponding binary bit to recover the original message m.

Arithmetic Operations over Ring
The arithmetic operations over the ring R q = Z q [x]/ f (x) include modulo reduction, polynomial addition, and polynomial multiplication. Given coefficients a i and b i in R q , two polynomials a(x) and b(x) over the ring can be defined as follows: As mentioned previously, the polynomial multiplication requires the most processing time, and the NTT-based algorithm is an efficient algorithm for performing this multiplication. Assume that ω is a primitive nth root of unity, the NTT of each coefficient of a(x) is defined as: The inverse NTT (INTT) is calculated as: Let α and β be extended vectors of a(x) and b(x) by filling n zero elements. The multiplication of two polynomials a(x) and b(x) can be computed based on NTT and INTT: where is the point-wise multiplication.
The polynomial addition of two polynomials a(x) and b(x) is simply adding corresponding coefficients of two polynomials and then doing modulo reduction (MR). In MR operation, the coefficients of the resulting polynomial are reduced by modulus q. To execute this operation, a few MR algorithms are presented in [5,13]. Since the parameters used in this paper are n = 512 and q = 12, 289, the SAMS2 algorithm for q = 12, 289 that is presented in [5] is selected.

Discrete Gaussian Sampler
In ring-LWE cryptography, error polynomials sampled from a discrete Gaussian distribution χ σ with a standard deviation σ are required. This distribution uses the probability function described as follows: where E is the random variable on Z, S is the normalization factor, and i is an integer. By approximating S to the probability density function can also be described using In References [5,14,15], authors present several methods for performing discrete Gaussian sampling. Among these methods, rejection sampling and inversion sampling are the popular ones [14].
In practice, rejection sampling for a discrete Gaussian distribution is slow due to the high rejection rate for the sampled values, which are far from the center of the distribution. The inversion method first generates a random probability and then selects a sample value such that the cumulative distribution up to that sample point is just larger than the randomly generated probability. Since the random probability should be of high precision, this method also requires a large number of random bits. The Knuth-Yao algorithm [14,15] uses a random walk model for sampling from any non-uniform distribution. However, the output of Knuth-Yao algorithm is generated in an unpredictable time; then, it is not a reliable sampler [5]. In this work, we deploy the linear feedback shift registers (LFSRs) approach that was proposed in [16]. This approach offers a low-complexity with an approximated uniform pseudo-random distribution; hence, it can be exploited to generate an accurate approximation of a Gaussian distribution with low maximum auto-correlation.

Proposed Algorithm for the Ring-LWE Cryptoprocessor
The proposed shared arithmetic components based ring-LWE cryptography algorithm is described in Algorithm 1. Two additional parameters enc and dec are used to control the encryption and decryption operations. Encryption operation is enabled when enc = 1, and the multiplier Mult2 and adder Add3 participate in encryption phase. Input message m is encoded to get polynomial m e . This encoded message is then added with error polynomial e 3 and stored in c 21 . Polynomials' multiplications c 10 = a × e 1 and c 20 = p × e 1 are calculated by Mult1 and Mult2, respectively. Ciphertext c 1 is computed by adding two polynomials c 10 and e 2 using polynomial addition function Add2, while c 2 = c 20 + c 21 is conducted by addition function Add3. Ciphertext (c 1 , c 2 ) is then successfully carried out.

Algorithm 1: Proposed ring-LWE cryptography algorithm using shared arithmetic components
Input : a, p ∈ Z n q , m ∈ {0, 1} n , ω ∈ Z q Output : Ciphertext (c 1 , c 2 ) ∈ Z n q , or original messsage m r 1 , r 2 ← Gaussian sampler(s, n); e 1 , e 2 , e 3 ← Gaussian sampler(s, n); The decryption operation is enabled by the signal d_en. In this phase, multiplier Mult2 and adder Add3 that participate in the encryption phase are reconfigured to perform operations over the ring. Specifically, multiplier Mult2 calculates the multiplication between the cipher-text c 1 and the polynomial r 2 . The result of this multiplication is then transferred to adder Add3, where it is added to the cipher-text polynomial c 2 to return the pre-decoded polynomial m d = c 1 × r 2 + c 2 . Finally, the original message m is recovered from the pre-decoded message m d using a decoder. The original message m is recovered from the pre-decoded polynomial m d using a decoder. If the i-th coefficient of m d satisfies the condition q/4 ≤ m d [i] ≤ 3q/4, the corresponding i-coefficient of message m (m[i]) is decoded to 1; otherwise, it is converted to 0. The detailed timing diagram of the proposed ring-LWE crytoprocessors is shown in Figure 2. When dec = 1, polynomial multiplication m d1 = c 1 × r 2 is computed using the same multiplication resource Mult2 in encryption phase. Similarly, polynominal addition Add3 is reused to compute polynomial addition m d between polynomial m d1 and ciphertex c 2 . Finally, the original message m is retrieved by decoding message m d .

Ring-LWE Cryptoprocessor Architecture Using Shared Arithmetic Components
The proposed ring-LWE cryptoprocessor architecture using shared NTT polynomial multiplier and polynomial adder is illustrated in Figure 3, which consists of an encoder, a Gaussian sampler, polynomial adders, polynomial multipliers, a decoder, and a controller. As can be seen, Multiplier 2 and Adder 3 are deployed to participate in both encryption and decryption operations. The detailed architecture is described in Figure 4.  The encryption operation computes the cipher-text (c 1 , c 2 ). This operation is enabled by the control signal enc. Initially, the input information m is encoded using an encoder. Each bit of the message m works as the control signal of the corresponding MUX, where its inputs are 0 or (q − 1)/2. The encoded message m e is constructed by the outputs of these MUXs. The encoded message is then added with the error polynomial e 3 using the adder Add1, controlled by signals a 1 _e and a 1 _d, to get the value (m e + e 3 ). Multiplier Mult1 computes the product of the polynomial a and the error vector e 1 , while multiplier Mult2 calculates the multiplication of the public key p and e 1 . These multipliers are triggered by the control signals m 1 _e and m 2 _e, respectively. In the proposed architecture, we use different architectures of NTT multiplier, which are discussed in the following part. Two control signals m 1 _d and m 2 _d are assigned to 1 indicating that the multiplications at multipliers Mult1 and Mult2 are completely executed. The output of multiplier Mult1 becomes the input of adder Add 2, where it is added to the error vector e 2 to form the ciphertext c 1 . Concurrently, the output of multiplier Mult2 is added with the polynomial (e 3 + m e ) to generate ciphertext c 2 , performed by adder Add3. Upon control signals, a 2 _d and a 3 _d are equal to 1, and the encryption operation is fully accomplished. The ciphertext (c 1 , c 2 ) is carried out.

Proposed NTT Polynomial Multiplier Using MDF Architecture
To speed up the computation time and reduce the complexity of ring-LWE cryptoprocessors, a novel NTT polynomial multiplier using MDF architecture is proposed. The MDF architecture can provide a higher throughput rate with minimal hardware cost by combining the features of MDC and SDF. In MDF architecture, the SDF architecture is extended by using a multi-path approach. In order to achieve higher throughput rate, the number of data-paths can be increased to eight or even sixteen.
Theoretically, a NTT-based polynomial multiplier consists of three bit-reverse processes, two NTT processes, one point-wise multiplication, and one INTT process. By using the reverse Cooley-Tukey algorithm [17] in the NTT-based polynomial multiplication operation, three bit-reverse operations are eliminated, as described in Figure 5a. Therefore, the computation time and hardware complexity are greatly reduced. In addition, two NTT calculations for input polynomials are executed concomitantly to mitigate the multiplication latency. The pipeline technique is also applied between all stages of NTT multipliers to decrease critical path delay. In this work, the 8-parallel MDF architecture-based NTT multiplier is deployed. This multiplier is employed in the ring-LWE cryptoprocessors to conduct the encryption and decryption operations. The detail architecture of NTT core using 8-parallel MDF architecture is illustrated in Figure 5b. In Figure 5b, 512-coefficients of input polynomial are divided into eight parallel paths, each path consists of 64 coefficients. For example, path a 1 of polynomial a(x) in Figure 5b consists of coefficients a 1 , a 9 , a 17 , and so on. Two input polynomials a(x) and b(x) are processed using NTT transform architecture. After obtaining the NTT transform of two polynomials a(x) and b(x), the point-wise multiplication operation is calculated, followed by the INTT transform to return the value of the multiplication operation. The architecture of INTT core is similar with NTT core architecture, which consists of processing elements PE1 and PE2, as presented in Figure 5b. The proposed PE1 and PE2 architectures for the NTT core are detailed in Figure 6.

Implementation Results and Comparison
The proposed architecture for ring-LWE cryptoprocessor is modeled in Verilog HDL, synthesized, and implemented using Xilinx VIVADO 2017.4 on a Virtex-7 FPGA platform. The implementation results of ring-LWE cryptoprocessor are summarized in Table 1. As can be seen from Table 1, the proposed architecture requires less hardware resources, calculated in number of slices and LUTs, to conduct a completed ring-LWE encryption-decryption operation compared with similar parallel multiplier-based ring-LWE architectures presented in [5,6]. Specifically, to perform ring-LWE encryption and decryption operations, the proposed architecture uses only 23,707 slices and 61,258 LUTs, which is about 42% and 67.83% of that in [5], respectively. Additionally, the encryption and decryption throughput of the proposed architecture are higher than that of architecture in [6] and R8M architecture in [5]. The architectures in [18,19] require a small number of slices and LUTs to perform encryption and decryption; however, these architectures require high latency to complete ring-LWE encryption and decryption operations. Therefore, the values of throughput offered by these architectures are smaller than 150 Mbps, as described in Table 1.
Efficiency presented in [20] can be used as a parameter to evaluate the performance of different designs on various FPGA platforms. In detail, the efficiency parameter represents the throughput value that one hardware unit (LUT) of an architecture can offer. As can be seen from Table 1, the proposed ring-LWE cryptoprocessor architecture can offer a better value of efficiency compared with that of other architectures. Specifically, the obtained efficiency value of the proposed architecture is about two and seven times larger than that of architectures in [5,18], respectively. Comparison in encryption time, decryption time, and efficiency is described in Figure 7.

Conclusions
A novel approach to improve encryption and decryption operation of ring-LWE cryptoprocessors has been presented in this paper. By sharing the hardware resources, including polynomial multiplier and polynomial adder, in both encryption and decryption phases, the hardware complexity of the proposed architecture can be significantly reduced. Furthermore, the polynomial multiplication is greatly enhanced by deploying an efficient NTT multipliers using MDF architecture and the pipeline technique between all stages of NTT multipliers. As a result, the proposed ring-LWE cryptoprocessor offers higher throughput and efficiency compared with that of existing works. Therefore, it can be applied in hardware resource-limited systems that require high throughput and low latency.

Conflicts of Interest:
The authors declare no conflict of interest.