3.2. Number Theoretic Transform
In this paper, we optimized the modular reduction for the high-speed implementation of NTT computation. We chose
and
primes (i.e.,
0x1E01 and
0x3001 in hexadecimal representation) for the target parameters, which are widely used in previous works [
28,
29,
30].
The modular reduction can be implemented using the bit-shift and add technique (i.e., SAMS2) or Montgomery reduction covered in previous works [
28,
29]. These methods can be accelerated further by using the optimized Look-Up Table (LUT) access-based fast reduction technique for performing
and
operations in ICISC’17 [
30]. The main idea of the LUT-based approach is to first reduce the result by using 8-bit wise pre-computed reduced results. Afterward, the tiny fast reduction steps are performed on short coefficients. The results are kept in the incomplete representation in order to optimize the number of subtraction operations in the reduction step. This approach is a well-known lazy reduction technique and ensures that last step will go through the complete reduction. In this paper, we further optimized the LUT-based approach by using novel combined (or well aligned) LUT techniques.
When the target prime modulus is
, the operand is located within range (0~
). The intermediate result of partial product (i.e.,
,
,
, and
in
Figure 1) is located in range (0~
). Two pre-computed LUTs within 7681 are constructed. Each LUT receives 8-bit long input. The first input is located within 17-th~24-th bits (i.e.,
in
Figure 1). By passing the LUT, the 8-bit input is transformed into 13-bit wise result (≈
). The output is added to the intermediate result (i.e.,
and
in
Figure 1) and this may generate 17-bit wise intermediate result (i.e., result of Step 2 in
Figure 1; addition of 13-bit result and 16-bit result). Afterward, the two separate parts passed to the second LUT come from different variables: The 14-th~17-th bits are from the result after Step 2 in
Figure 1, while the 25-th~28-th bits are from the input variable
(Two LUTs only require 1 KB (
). Both LUTs are stored in the
FLASH memory of target 8-bit AVR microcontrollers. Considering that 8-bit AVR platforms support the
FLASH memory, which ensures write-only storage. The size of
FLASH memory is ranging from 128~384 KB, depending on microcontrollers. The storage for LUTs (1 KB~1.5 KB) is negligible on the target processors with 128~384 KB.). The output of second LUT is 13-bit wise results (≈
) and this is added to remaining intermediate results (i.e.,
and
in
Figure 1). The addition outputs 14-bit wise results. Previous work requires final reduction after two times of LUT accesses, while proposed method terminates the computation after only two times of LUT accesses. By removing the final reduction step, the performance is improved, further.
The detailed modular reduction is given in
Figure 1. The intermediate result of product is kept in four registers
. Different colors represent different registers, where the register is 8-bit long. The colored block and white block represent used and not used for computation, respectively. The proposed reduction on 7681 is given as follows:
First, LUT access with the variable () is executed. This operation outputs 13-bit wise results (i.e., and ). Afterward, outputs (i.e., and ) are added to intermediate results (i.e., and ). The addition of 16-bit and 13-bit operands outputs the 17-bit wise result. Then, values below 13-bit are extracted from intermediate results (i.e., ). The 13-bit result is stored in variables . The highest limb () and the other 4-bit wise intermediate result (i.e., ) are combined to generate the 8-bit wise value. Second, LUT access with the generated 8-bit input is performed. This operation generates 13-bit wise results (i.e., and ). Finally, intermediate results and LUT outputs (i.e., and ) are added together. This may generate final 14-bit long results.
In Algorithm 2, the LUT-based modular reduction in source code level is described. Firstly, in Step 1~13, MOV-and-ADD multiplication technique is used to perform the 16-bit wise multiplication. The 28-bit intermediate result is obtained and stored in 4 8-bit registers (
R18, R19, R20, R21). Afterward, the LUT-based reduction operation is performed. The LUT input and output are 8-bit long and 16-bit long, respectively.
Algorithm 2 LUT-based modular reduction in source code (mod 7681) |
Input: operands R22, R23, R24, R25 | | 17: LDI R31, hi8(LUT1_H) | |
| | 18: LPM R23, Z | |
Output: results {R24, R25} | | | |
| | 19: ADD R18, R22 | |
1: CLR R26 | {MOV-and-ADD} | 20: ADC R19, R23 | |
2: MUL R24, R22 | | 21: ADC R20, R20 | {Register re-use} |
3: MOVW R18, R0 | | | |
4: MUL R25, R23 | | 22: MOV R30, R19 | |
5: MOVW R20, R0 | | 23: ANDI R19, 0X1F | |
| | 24: LSR R20 | |
6: MOVW R18, R0 | | 25: ROR R30 | |
7: ADD R19, R0 | | | |
8: MOVW R18, R0 | | 26: ANDI R30, 0XF0 | |
9: MOVW R18, R0 | | 27: ADD R30, R21 | |
| | 28: LDI R31, hi8(LUT2_L) | {LUT access} |
10: MOVW R18, R0 | | 29: LPM R24, Z | |
11: MOVW R18, R0 | | | |
12: ADC R20, R1 | | 30: LDI R31, hi8(LUT2_H) | |
13: ADC R21, R26 | | 31: LPM R25, Z | |
| | | |
14: MOV R30, R20 | | 32: ADD R24, R18 | |
15: LDI R31, hi8(LUT1_L) | {LUT access} | 33: ADC R25, R19 | |
16: LPM R22, Z | | 34: CLR R1 | |
The Algorithm 1 is fully implemented in assembly language for high speed implementation of NTT. It is possible to implement them by calling each function, independently. Definitely, this approach is efficient for program maintenance. However, this approach has disadvantages, in terms of performance. First, each operation requires function call routine. The function call routine requires stack management and program jump flow. After performing the function, the return process is also required. Second, all variables should be kept in memory space. This process generates additional overheads for memory load and store in the variable access. By using assembly implementation, we reduce the number of function call and memory accesses, significantly.
When target prime modulus is
, the operand is located within range (0~
). The intermediate result of partial product is located in range (0~
). Two pre-computed LUTs within
are constructed. Each LUT receives 8-bit long input. First input is located within 17-th~24-th bits. By passing the LUT, the 8-bit input is transformed into 14-bit wise result (≈
). The output is added to the intermediate result (i.e.,
and
in
Figure 2) and this may generate 17-bit wise intermediate result (i.e., Step 2 of
Figure 2; addition of 14-bit result and 16-bit result). Afterward, the two separate parts passed to the second LUT come from different variables, where 15-th~17-th bits are from the result after Step 2 in the
Figure 2, while the 25-th~30-th bits are from the input variable
(Both LUTs only require 1.5 KB (
) memory space. Two LUTs are stored in the
FLASH memory space). The output of second LUT is 14-bit wise results (≈
) and this is added to remaining intermediate results (i.e.,
and
in
Figure 2). The addition outputs 15-bit wise results.
The detailed modular reduction on 12,289 is given in
Figure 2 and the proposed reduction on
is given as follows:
Firstly, LUT access with the variable () is executed. This operation outputs 14-bit wise results (i.e., and ). Afterward, outputs (i.e., and ) are added to intermediate results (i.e., and ). The addition of 16-bit and 14-bit operands outputs the 17-bit wise result. Then, values below 14-bit are extracted from intermediate results (i.e., ). The 14-bit result is stored in variables . The highest limb () and the other 3-bit wise intermediate result (i.e., ) are combined to generate the 9-bit wise value. Second, LUT access with the generated 9-bit input is performed. This operation generates 14-bit wise results (i.e., and ). Finally, intermediate results and LUT outputs (i.e., and ) are added together. This may generate final 15-bit long results.
The detailed source code for LUT-based fast modular reduction is given in Algorithm 3.
Algorithm 3 LUT-based modular reduction in source code (mod 12,289) |
Input: operands R22, R23, R24, R25 | | 17: LDI R31, hi8(LUT1_H) | |
| | 18: LPM R23, Z | |
Output: results {R24, R25} | | | |
| | 19: ADD R18, R22 | |
1: CLR R26 | {MOV-and-ADD} | 20: ADC R19, R23 | |
2: MUL R24, R22 | | 21: ADC R20, R20 | {Register re-use} |
3: MOVW R18, R0 | | | |
4: MUL R25, R23 | | 22: MOV R30, R19 | |
5: MOVW R20, R0 | | 23: ANDI R19, 0X3F | |
| | 24: ANDI R20, 0X01 | |
6: MUL R24, R23 | | 25: ANDI R30, 0XC0 | |
7: ADD R19, R0 | | | |
8: ADC R20, R1 | | 26: ADD R30, R21 | |
9: ADC R21, R26 | | 27: LDI R31, hi8(LUT2_L) | {LUT access} |
| | 28: ADD R31, R20 | |
10: MUL R25, R22 | | 29: LPM R24, Z | |
11: ADD R19, R0 | | | |
12: ADC R20, R1 | | 30: LDI R31, hi8(LUT2_H) | |
13: ADC R21, R26 | | 31: ADD R31, R20 | |
| | 32: LPM R25, Z | |
14: MOV R30, R20 | | | |
15: LDI R31, hi8(LUT1_L) | {LUT access} | 33: ADD R24, R18 | |
16: LPM R22, Z | | 34: ADC R25, R19 | |
| | 35: CLR R1 | |
In Step 1~13, two 15-bit coefficients are multiplied and output 30-bit wise intermediate result. The result is stored in 4 8-bit registers (R18, R19, R20, R21). After the multiplication, the modular reduction is performed. The first LUT receives 8-bit input and generates 16-bit output.
In Step 14~15, bits ranging from 17-th to 24-th (R20) is assigned to the lower 8-bit address (R30). The higher 8-bit address of is assigned to the register (R31). In Step 16, FLASH memory access to first LUT is performed with LPM instruction. The LPM instruction consumes 3 clock cycles per each byte. In Step 17~18, the higher part of LUT1 (i.e., ) is loaded. This is separated access to aligned memory address. In Step 19~21, the output of LUT1 and intermediate result are added. The carry bit generated in Step 20 is stored in the register (R20). Thereafter, in Step 22~25, two intermediate results are concatenated. In Step 26~32, LUT2 access is performed in the aligned memory access method. Finally, reduced results and intermediate results are added together. This approach ensures 15-bit wise intermediate results.
In order to accelerate the memory access, we exploited two different optimized memory access techniques. First method is the memory access in an aligned format. The higher 8-bit address is always constant where the offset is 8-bit long. The lower byte is only updated with different offsets to access the memory space. The detailed descriptions are as follows (These methods are defined, where R1, R24, R25, R30, R31, R26, and Z are zero value, first input value, second input value, lower part of memory address, higher part of memory address, result, and Z pointer, respectively).
Initialization of 8-bit aligned access is
The second step of 8-bit aligned access is
The second approach is a separated memory access for 16-bit wise LUT outputs. The LUT for targeted modulus outputs 14-bit and 15-bit wise results for 7681 and 12,289, respectively. The LUT access requires 2-byte aligned offsets to obtain the 14-bit or 15-bit result, which means 9-bit offsets. In this access pattern, the aligned memory access is not feasible since the offset size increases from 8-bit to 9-bit. In order to resolve this issue, we separated the one 16-bit LUT output into two 8-bit parts. The first output is for lower 8-bit result and the second output is for higher 8-bit result. The detailed method is described in
Figure 3. Unlike the previous LUT construction, separated two LUTs are constructed. Under this LUT setting, the aligned memory access can be efficiently performed.
The proposed modular reduction method is a generic approach for any primes for lattice-based schemes. For this reason, the proposed method can be extended to other primes without difficulty. Definitely, the proposed method is working for lattice-based NIST PQC candidates, such as NewHope and CRYPSTALS-KYBER [
33,
34].
3.2.1. Discrete Gaussian Sampling
Discrete Gaussian sampling is an important part of Ring-LWE scheme. For the fast sampling method, we adopted the Knuth–Yao sampler method with byte-scanning [
28,
35]. This byte-scanning method samples the value in byte-wise rather than bit-wise. However, the original sampling is not a secure approach against timing attack and simple power analysis. The sampler performs the large part with the LUT access. When the proper result is obtained, the sampling is terminated. For this reason, the timing is highly related with the input value (i.e., secret value). In order to resolve this issue, the random shuffling method after random sampling is used [
35]. The approach firstly samples whole results. Afterward, whole results are randomly shuffled with random numbers. The random shuffling technique efficiently removes the relation between random samples and timing information. The attack success ratio is reduced from 1 to
. However, this countermeasure is also vulnerable to sophisticated side channel attack [
36]. For this reason, the target application of our approach is limited to simple IoT nodes with basic security level.
3.2.2. AES-Based Pseudo Random Number Generator
The random number generation is highly related with the security of Ring-LWE schemes. Previous Ring-LWE implementations adopted AES-based Pseudo Random Number Generator (PRNG) algorithm (Available in
http://www.atmel.com/Images/article_random_number.pdf). The PRNG algorithm runs the AES block cipher in the counter mode and uses the output as random numbers. The recent 8-bit AVR ATxmega128A1 microcontroller features an AES cryptography accelerator that performs AES-128-based data encryption with reasonably fast computation (Computation takes about 375 clock cycles for 128-bit plaintext) and small memory footprint for AES data and control flow management program. The hardware-assisted AES-based counter mode outperforms software-based implementations of the AES block cipher (about 3,521 clock cycles for 128-bit plaintext).
Furthermore, the AES cryptography accelerator and Arithmetic Logic Unit (ALU) of microcontroller can be independently executed in the target machine, which hides the latency for the AES encryption into the arithmetic execution [
28]. The detailed hardware AES encryption is as follows. The hardware accelerator firstly sets the key into the
0x00C3 address. The text is loaded to the
0x00C2 memory address. Afterward, the
0x00C0 memory address is set to
0x80 value to perform the AES-128 encryption. This operation only takes 375 clock cycles for 128-bit plaintext. During this period, other operations can be performed simultaneously. The termination of AES encryption is by checking the
0x00C1 memory address. When the value in the memory is below
0x80 value, it indicates that the encryption is completed.
However, the AES accelerator of the ATxmega128A1 can only support 128-bit key, which is not sufficient for long-term security, such as 192-bit and 256-bit security levels. However, previous works utilized the 128-bit AES hardware accelerator for 256-bit scheme in order to achieve high performance by sacrificing the security [
27,
29,
30]. In [
28], they only used the software AES from the AVR Crypto Lib for the long-term security level (i.e., 256-bit security level). The AES-256 encryption requires 3521 clock cycles to encrypt a block under a 256-bit key (Available in
http://avrcryptolib.das-labor.org/trac). Unlike previous works, we adopted the most recent optimized implementation by [
32]. The implementation utilized the unique feature of counter mode (CTR) of AES block cipher. During the counter mode of operation, the small fraction of data is updated and the remaining part is kept without changes. The method generates the cache table first and the cache table is used for skipping Round 0, Round 1, and part of Round 2. For the AES-256 case, the required clock cycles are 3184. This implementation is 9.5% faster than previous work by AVR Crypto Lib.