Memory Efﬁcient Implementation of Modular Multiplication for 32-bit ARM Cortex-M4

: In this paper, we present scalable multi-precision multiplication implementation and scalable multi-precision squaring implementation for 32-bit ARM Cortex-M4 microcontrollers. For efﬁcient computation and scalable functionality, we present optimized Multiplication and ACcumulation (MAC) techniques for the target microcontrollers. In particular, we present the 64-bit wise MAC operation with the Unsigned Long Multiply with Accumulate Accumulate ( UMAAL ) instruction. The MAC is used to perform column-wise multiplication/squaring (i.e., product-scanning) with general-purpose registers in an optimal way. Second, the squaring algorithm is further optimized through an efﬁcient doubling routine together with an optimized product-scanning method. Finally, the proposed implementations achieved a very small memory footprint and high scalability to cover algorityms ranging from well-known public key cryptography (i.e., Rivest–Shamir–Adleman (RSA) and Elliptic Curve Cryptography (ECC)) to post-quantum cryptography (i.e., Supersingular Isogeny Key Encapsulation (SIKE)). All SIKE round 2 protocols were evaluated with the proposed modular reduction implementations. The results demonstrate that the scalable implementation can achieve the smallest code size together with a reasonable performance.


Introduction
The implementation of cryptographic algorithms on an embedded device is more challenging than on personal computers due to the limited resources (e.g., low frequency, basic instruction set, and small RAM and ROM). Thus, cryptographic implementors have to carefully redesign or make specific optimizations of existing algorithms to fit such scenarios. In general, a lightweight implementation of a cryptographic algorithm on embedded devices should satisfy the following implementation requirements [1][2][3]: achieving high performance, having small code size, and supplying scalability to an arbitrary length. In recent decades, a number of works have improved the implementations of public key cryptography on microcontrollers. One of the milestone works was carried out by Gura et al. [4], who proposed the hybrid-scanning based multiplication. Compact implementations of Rivest-Shamir-Adleman (RSA) and Elliptic Curve Cryptography (ECC) on 8-bit Alf and Vegard's RISC processor (AVR) embedded processors, such as TinyECC [5], Relic [6], Networking and Cryptography library (NaCl) [7], and Montgomery and Twisted Edward (MoTE)-ECC [8] were also investigated.
In very recent years, the Advanced RISC Machine (ARM) company released a low-cost 32-bit ARM processor, called ARM Cortex-M4, in response to customer requests. The key characteristics of ARM's Cortex-M4 are a lower cost with a higher productivity than others. The embedded processor has a significantly small chip area, low energy consumption, and an optimal code footprint. These advanced capabilities are able to achieve a high performance at a low price point for various IoT applications, such as smart metering, motor control, and domestic household appliances. Compared to implementations on typical 8-bit processors, the Cortex-M4 achieved much higher performance thanks to its advanced hardware architecture.
Groot provided a constant-time implementation of X25519 for the ARM Cortex-M4 architecture [9]. In particular, a reduced-radix representation (25-bit or 26-bit) and refined Karatsuba multiplication are used for optimized implementation. Santis and Sigl implemented the Curve25519 on ARM Cortex-M4 microcontrollers [10]. The Karatsuba algorithm in a two-level subtractive is utilized for large integer multiplication. This approach replaces 256-bit wise multiplication into nine 64-bit wise multiplication. Fujii and Aranha implemented the integer multiplication with an operand-caching method and Unsigned Long Multiply with Accumulate Accumulate (UMAAL) instruction [11]. This work fully utilized the UMAAL instructions to achieve a high performance.
In CHES'19, the fastest implementation of Curve25519 is suggested by Haase and Labrique [12]. The hand-optimized operand-scanning method is efficiently implemented on the ARM Cortex-M4 microcontroller. For that reason, the utilization of the register is highly optimized. The first Supersingular Isogeny Key Encapsulation (SIKE) implementation is suggested by Koppermann et al. [13]. The product-scanning and Karatsuba methods are utilized to improve the Supersingular Isogeny Diffie-Hellman key exchange (SIDH). However, the implementation of SIDHp751 requires 18 s to exchange the keys. In [14], they implemented integer multiplication with operand caching (in UMAAL) and pipeline-friendly programming. The results show that SIKEp434 requires only 1.5 s.
While previous software implementations have focused on improving performance, they have paid relatively less attention to code size. In practice, most of the flash memory is used for application programs, and only a small footprint of the flash memory is used for cryptographic implementation. Many low-end ARM Cortex-M4 boards (e.g., Teensy 3.0, XMC4100, MAX32660, M481, etc.) have only 128-256 KB Flash memory. Furthermore, Internet of Things (IoT) devices must communicate with other devices with different protocols. For the Public Key Cryptography (PKC) implementation, there is pre-quantum cryptography and post-quantum cryptography.
During the transition and migration from pre-quantum to post-quantum, IoT devices should support both PKC algorithms according to the National Institute of Standards and Technology (NIST) Post-Quantum Cryptography (PQC) committee (i.e., hybrid PKC protocol) [15]. There are also several security levels (e.g., 128-bit brute-force attack, 192-bit, and 256-bit);, thus a number of PKC implementations should be considered. In particular, a lightweight PKC implementation can be achieved by optimized multi-precision multiplication implementation and optimized multi-precision squaring implementation. These methods are important for the deployment of cryptography for practical applications.
In Table 1, both multi-precision multiplication implementation and multi-precision squaring implementation are compared. The fastest performance is achieved by [14]. For the 256-bit ECC implementation (e.g., NIST P-256 and Curve25519), 772 bytes are required for the multiplication and squaring implementations. For the finite field operation, modular reduction is also needed, and it is usually a similar size of multiplication (i.e., 452 bytes) by using Montgomery reduction. To support SIKEp751, an additional 5,768 bytes are used for multiplication and squaring. Compared with other approaches, the proposed method is highly memory-efficient, requiring only 584 bytes. Moreover, the implementation is already parameterized, which supports all ECC, RSA, and SIKE protocols. By supporting all PKC protocols, the hybrid PKC protocol can be utilized. For this reason, the memory-efficient implementation can be a practical solution and the proposed method only satisfies this requirement. Table 1. The comparison results of multi-precision multiplication implementation and multi-precision squaring implementation on 32-bit ARM Cortex-M4 microcontrollers in terms of the code size (in bytes), scalability, and highest speed. The Elliptic Curve Cryptography (ECC), Rivest-Shamir-Adleman (RSA), and Supersingular Isogeny Key Encapsulation (SIKE) symbols represent the targeted implementation with the modular arithmetic implementation (Letters U, L, and P indicate whether the works are implemented in unrolled, looped, or parameterized).

Comparison of CANS'19 [14]
Previous work in CANS'19 [14] successfully evaluated the SIKE round 2 schemes on 32-bit ARM Cortex-M4 microcontrollers. In order to accelerate the execution timing, modular multiplication, and squaring operations are optimized for specific parameters of SIKEp434, SIKEp503, and SIKEp751. Unlike previous works, we proposed scalable modular multiplication and squaring. The proposed implementation reduces the code size significantly and supports all parameters, including RSA, ECC, and SIKE, in single code. For the implementation of RSA, the proposed implementation only supports the RSA parameters, as described in Table 1. For instance, we implemented the SIKE round 2 schemes on 32-bit ARM Cortex-M4 microcontrollers in Section 3. The proposed implementation obtained the smallest code size and reasonably fast execution timing. Previous work in CANS'19 [14] requires 44,688 bytes while the proposed implementation requires 28,816 bytes for SIKEp751, which is a code reduction of 35.5%. Furthermore, we first implemented the SIKEp610 protocol on 32-bit ARM Cortex-M4 microcontrollers. This result first covers all SIKE protocols.

Research Contributions
In contrast to most of the previous implementations, which have focused primarily on the execution time, this study focused on optimizing the memory consumption of multi-precision multiplication implementation and multi-precision squaring implementation, without reducing the high performance. In particular, we present a MAC operation. The operation is used for the inner loop of target multiplication and squaring operations.
The proposed MAC routine fully employs general purpose registers and a 32 × 32 + 32 + 32 → 64-bit multiplier, also known as Unsigned Long Multiply with Accumulate Accumulate (UMAAL). Second, the squaring algorithm is further optimized through a dedicated doubling routine. Finally, the proposed implementations achieved a very small memory footprint and good scalability; thus, they can be used for RSA, ECC, and SIKE. We implemented all of the second round of SIKE protocols on 32-bit ARM Cortex-M4 microcontrollers. A reasonable execution timing can be achieved with the smallest code size.
The paper is written as follows. In Section 2, previous implementations of multiplication/squaring, target ARM Cortex-M4 microcontrollers, and SIKE Round 2 candidates are introduced. In Sections 2.4 and 2.5, implementations of the multi-precision multiplication method and multi-precision squaring method are presented. In Section 3, a case study of SIKE round 2 protocols is given. The proposed implementations are evaluated in Section 4. Conclusion is given in Section 5.

Multi-Precision Multiplication and Squaring
A straightforward method to execute the large integer multiplication is row-wise multiplication (i.e., operand-scanning), which consists of a nested loop. One digit of an operand A (i.e., a[i]) of the inner loop is multiplied with all digits of the second operand B, while the pointer of operand A points to the next digit of A, in the outer loop. An alternative method is column-wise multiplication [16]. The product-scanning method consists of two inner loops. The first loop performs the lower part of the result. Afterward, the second loop performs the upper part of the result. An important operation in each inner loop is called the Multiply-ACcumulate (MAC) operation, which performs a form of The intermediate results are not stored or loaded to/from the memory. Carry propagation is simply achieved by a register-copy operation. In recent years, many works have improved the product-scanning method by re-ordering the inner loop and using new instructions [4,17,18].
A squaring operation is a unique case of multi-precision multiplication, where the multiplier and multiplicand are the identical operand (e.g., A × A). For multi-precision squaring, partial products of the form (a appears just once for i = j. Thus, squaring requires less mul instructions or MAC operations than multiplication. Some previous implementations for squaring have been optimized based on this symmetric feature on embedded processors. The lazy-doubling method optimizes the doubling operation by performing the doubling column-wise [19]. Similarly, many works have focused on developing an optimized doubling method for target microcontrollers [20,21].

ARM Cortex-M4 Microcontroller
32-bit ARM Cortex-M4 embedded microprocessors are famous low-end microcontrollers. The processor is selected for benchmark target processor of NIST post-quantum cryptography. Recently, one benchmark framework (pqm4) has also been implemented on Cortex-M4 [22]. The embedded microprocessor supports an optimal multiplication instruction (i.e., UMAAL). The instruction is multiplication-accumulation with two 32-bit values.

SIKE Round 2
We introduce the SIKE standard and key exchange protocol. For better understanding, we recommend to refer to [23,24]. The SIKE standard is using a transformation by [25]. This ensures the supersingular isogeny Public Key Encryption (PKE) protocol [23]. This is Key Encapsulation Mechanism (KEM), which is secure against the static key vulnerability of the key exchange protocol [26]. The SIKE protocol is a NIST PQC round 2 candidate [27]. SIKE has relatively small public keys and ciphertext.

Public Parameters
The SIKE is over a prime (p = e A A · e B B · f ± 1). For better performance, A = 2, B = 3, and f = 1 are fixed. The prime of SIKE is p = 2 e A · 3 e B − 1. The beginning supersingular elliptic curve E 0 /F p 2 : are defined as public parameters.

Key Encapsulation Mechanism
KEM consists of three parts: Alice's key generation, Bob's key encapsulation, and Alice's key decapsulation. Figure 1 describes the KEM in detail.

Alice
Bob Key generation:

Key Generation
Alice chooses an random integer sk A ∈ Z/2 e A Z and by applying an isogeny φ A : . Moreover, Alice generates a t-bit (the implementation parameter defines the t value) random sequence s ∈ R {0, 1} t .

Encapsulation
Bob generates a t-bit random message m ∈ R {0, 1} t . Afterward, Bob concatenates the message with her public key pk A . Bob generates an e B -bit hash result r with cSHAKE256 hash function H 1 and input (m pk A ). With r, Bob executes a secret isogeny φ B : E0 → EB to base points {P A , Q A }. Then, Bob forms his public key pk . Bob also performs the common j-invariant of curve E BA by using another isogeny φ B : EA → E BA with Alice's public key. Finally Bob generates a ciphertext c = (c 0 , c 1 ), such that: where H 2 is generated with a cSHAKE256 hash function. This generates an arbitrary-length of output with a defined initialization parameter. Lastly, Bob generates the shared secret K = H 3 (m c). Afterward, Bob sends the ciphertext (c) to Alice.

Decapsulation
With the ciphertext (c), Alice performs the common j-invariant of E AB by using her secret isogeny to E B . Alice executes m = c 1 ⊕ H 2 (j(E AB )) and r = H 1 (m pk A ). Lastly, Alice validates Bob's public key by performing pk B (r ) and comparing the value with c 0 . Alice generates the secret shared value K = H 3 (m c) when the public key is correct. Otherwise, Alice generates a random value K = H 3 (c s) to be secure against active attacks.
In 2019, the candidates for the second round of post-quantum cryptography competition is selected. The round 2 of SIKE protocol [28] is different from the first version. These changes are outlined as follows: • Two new parameter sets for NIST security levels 1 and 3 have been added (i.e., SIKEp434 and SIKEp610). • One parameter set (i.e., SIKEp964) has been removed. • The security categories for parameter sets have been adjusted upward (i.e., the NIST security levels of SIKEp503 and SIKEp751 are changed to 2 and 5, respectively.). • The starting curve has been changed. • A public key compression has been implemented.

Multi-Precision Multiplication
For both outer and inner loops, the product-scanning method is utilized. In Figure 2, detailed descriptions are given. The left part is for the outer loop, and the right part is for the inner loop. One block in the outer loop consists of two word-multiplication, indicated by the dashed boxes and colors in the right part of Figure 2. At the beginning, two words of operand A (labeled a 0 and a 1 in Figure 2) along with two words of B (namely b 0 and b 1 ) are loaded from the RAM.
Product scanning (Inner loop) The 64-bit product of a 0 and b 0 is performed and accumulated to two registers by using the UMAAL instruction. Afterward, the product of a 1 · b 0 is performed and the product of the two words is added into the accumulator registers. The carry values from this operation can be performed without other side effects. Thereafter, we multiply a 0 by b 1 , and add the resulting 64-bit product of a 0 · b 1 to registers. The result of the higher word is stored in a temporal register before accumulation to avoid overflows.
After the last product of the first block (i.e., a 0 · b 0 ), we add the products with values in temporary registers to the three accumulator registers. The carry bit is finally propagated. The execution of the first block in Figure 2 executes four UMAAL, three SUB, and three ADD instructions, respectively. These instructions require 10 clock cycles. In Algorithm 1, the source code of the proposed implementation is given. In lines 1 to 3, the operands are loaded to the registers and the address pointer is corrected. In lines 4 to 9, the multiplication and accumulation routine is performed. In particular, registers R1 In Algorithm 2, full rounds of integer multiplication are given. In line 1, the accumulation buffer is initialized. The multiplication is divided into two steps, which are lower result (i.e., lines 2-9) and higher result (i.e., lines 10-18), respectively. The core MAC operation is implemented by following Algorithm 1.
Require: operands (a and b) and len (operand length/word where the word is 64-bit) Ensure: results (r) 1: accumulation = 0 2: for i = 0 to len − 1 by 1 do 3: for j = 0 to i by 1 do for j = i − len + 1 to len − 1 by 1 do 12:

Multi-Precision Squaring
A novel method for implementing the multi-precision squaring method, called the "doubling and MAC" method, is proposed. The squaring is implemented by following the structure of the column-wise method for the "outer algorithm". For the "inner algorithm", a combination of the proposed product-scanning and doubling and MAC method is performed.
The proposed technique optimizes the number of ADD (resp. ADC) instructions through rearranging the sequence of the UMAAL operation. An example of 256-bit squaring is shown in Figure 3. The left part describes the outer loop of squaring. It consists of three parts (inside the dashed box), which are exactly in line with the three loops of squaring. The middle part and the right part show the inner loop of squaring. The middle part is used to calculate a i · a j for i = j, similar to the procedure that we presented in Section 2.4 for multiplication. The proposed doubling and MAC for the computation of A i · A i is given in following section.  In Figure 3, the procedure of doubling and MAC can be split into two blocks as represented by dashed boxes. Taking the computation of A 0 · A 0 (marked in red) as an example, the 64-bit operand A 0 can be represented as (a 0 and a 1 ), where a i is one word long.
First, we load the intermediate results computed by the middle part of Figure 3 from the memory to the registers. Afterward, A 0 is loaded into two registers which are labeled a 0 and a 1 . We first perform the multiplication of a 0 · a 1 and accumulate the 64-bit product to three intermediate result registers. The accumulated intermediate results are doubled at once. Next, we perform two-word multiplication (a 0 · a 0 and a 1 · a 1 ) and the results are accumulated to the doubled results. During accumulation, we catch the carry value and store it in the temporal register.
The carry value is used in the next doubling and MAC routine. In total, the proposed doubling and MAC method costs 21 clock cycles, including 3 UMAAL, 12 ADD (resp. ADC), 4 MOV and 1 SUB instructions. The source code can be found in Algorithm 3. In line 2, the operands are loaded to registers (R8, R9). In line 3, intermediate results are loaded from memory to registers (R2, R3, R4, and R5). In lines 4 and 5, two registers are cleared. In lines 6 to 9, the partial product (a i · a j for i = j) is performed and accumulated to the intermediate results.  for j = i − len + 1 to len − 1 by 1 do 14: if i = j then 15:

Scalable Montgomery Reduction for All SIKE Round 2 on Cortex-M4
The parameter sets (SIKEp434, SIKEp503, SIKEp610, and SIKEp751) are selected as the SIKE round-2 protocol [28,29]. For high performance, the Montgomery reduction method is used [30]. The Montgomery reduction replaces the expensive inversion operation into a relatively cheap operation (i.e., multiplication). The detailed descriptions for the Montgomery reduction are given in Algorithm 5.
We evaluated the all-SIKE protocols with the proposed multi-precision multiplication operation and multi-precision squaring operation. For the Montgomery reduction, we used a scalable operandscanning method. To support all-SIKE protocols, the internal word size is set to 32-bit wise. We divided the n-word operand scanning into three steps, including initialization, a middle round, and finalization. The initialization calculates one partial product first. The intermediate result is stored in the memory, which is a quotient value. The middle round performs the (n − 1)-word partial products. The result is accumulated to the intermediate result.
Operand scanning (inner loop)  The modulus (M) of SIKEp434 is multiplied by the quotient in a 32-bit wise (Q; q0-q13). Since the lower part of the modulus (i.e., m0-m5) is zero, only remaining parts (m6-m13) are multiplied. During the inner loop, only two registers are used to maintain the intermediate results. The inner loop of operand scanning is given in Algorithm 6. In line 1, one operand Q is loaded to the register (R8). In line 2, the intermediate result is loaded to the register (R9). In line 3, the 32-bit wise MAC is performed. In line 4, the result is stored in the memory. In Algorithm 7, 32-bit wise scalable operand-scanning for Montgomery reduction is described. All operations are performed in 32-bit wise. In particular, the inner loop (i.e., line 4) is performed with the Algorithm 6.

Evaluation of Modular Arithmetic
The 32-bit ARM Cortex-M4 processor is evaluated with the STM32F4 Discovery board. The board supports 1 MB of Flash and 192 KB of RAM. Table 2  The performance is reasonable as the 32-bit ARM Cortex-M4 microcontrollers support 168 MHz, which means that Curve25519 and SIKEp434 can be still performed within seconds. In terms of code size, our implementation only requires 584 bytes (260 bytes for multiplication and 324 bytes for squaring, respectively). This requires 75% and 10% of the code size used by Seo et al., which directly saves 0.2-5 KB of code size for PKC implementations.
In terms of long integer multiplication, previous works did not explore the implementations. This means that previous implementations cannot cover RSA implementations. For this reason, we evaluated the proposed implementations and compared the results to the estimated results by Seo et al. For the RSA3072 encryption, the 3072-bit multiplication and squaring implementations based on Seo et al. require about 54 KB and 38 KB, respectively. Furthermore, the fast RSA3072 decryption based on the Chinese-Remainder Theorem (CRT) needs 1536-bit multiplication and squaring implementations (13 KB for multiplication and 9 KB for squaring). For this reason, the unrolled RSA implementation cannot be used due to the code size. On the other hand, the proposed method still maintains a small code size (260 bytes for multiplication and 324 bytes for squaring). The results show that our method only provides all PKC options (i.e., ECC, RSA, and SIKE) on ARM Cortex-M4 microcontrollers.
It is also interesting to compare parameterized implementations on other low-cost platforms (e.g., 8-bit AVR processors). Compared with previous parameterized implementations over the AVR processor [1], our implementation has several distinguishing features. In terms of the MAC algorithm, we introduced an optimized MAC method, which is specialized for the new UMAAL multiplier of target processor. The one-way carry-catcher method for integrated doubling and MAC operations introduces further optimizations of memory access.  library is uploaded to the board and the actual system clock cycles are obtained. This is a well-known benchmark framework for post-quantum cryptography.

Evaluation of SIKE
Previous works by Seo et al. mainly focused on the speed optimization [14], while the proposed implementation targets the minimum code size. The previous SIKEp434 implementation achieved 252 × 10 6 clock cycles, while the proposed SIKEp434 implementation achieved 469 × 10 6 clock cycles. The execution timing is translated into throughput. As we evaluated the performance at 168 MHz, the full key exchange of SIKEp434 is performed within 2.79 s on low-end microcontrollers, which is reasonably fast. By considering the speed and size trade-off, the performance of the SIKEp434 protocol is 46% (252 × 10 6 cc vs. 469 × 10 6 cc) slower than that achieved in the previous work, but we reduced the code size by 13% in this case (33,528 bytes vs. 29,176 bytes). The enhancement of execution timing is similar for SIKEp751, but the difference in code size between the speed version and the size version in greater. The previous SIKEp751 implementation requires 44,688 bytes, while the proposed SIKEp751 implementation requires 28,816 bytes. We also first implemented the SIKEp610 protocols on ARM Cortex-M4 processors, and the execution timing and code size achieved reasonable results. The other strength of scalable implementation is definitely scalability. The SIKE implementations share the majority of operations except special parameters and finite field operations. All SIKE implementations, including SIKEp434, SIKEp503, SIKEp610, and SIKEp751 with scalable multiplication and squaring may require around 40 KB. However, all SIKE implementations with speed optimized multiplication and squaring may require around 100 KB.
On the other hand, the proposed modular arithmetic implementation supports all RSA, ECC, and SIKE with the smallest code size (e.g., 584 bytes) and reasonable execution timing. In terms of the energy consumption, the proposed method actually consumes more energy than speed optimized implementation due to the high latency (i.e., long execution timing or long working time), which consumes high energy [31][32][33]. However, the proposed method is a suitable solution when considering the reasonable code size in real-world implementations. The importance of energy consumption for PKC is relatively lower than block cipher as PKC only needs one time before the secure communication as a secure key exchange (i.e., Diffie-Hellman). For this reason, the energy consumption metric of PKC implementation is of relatively lower priority than the code size in a real-world setting.

Conclusions
In this paper, we presented scalable implementations of multi-precision multiplication, squaring, and the Montgomery algorithm on the 32-bit ARM Cortex-M4. The implementation emphasizes reducing code size and providing scalability with practically fast speed. We proposed several novel techniques to further boost the implementations on low-cost 32-bit ARM processors, including MAC as well as doubling and MAC techniques. Our implementation on the 32-bit ARM platform highlights the practical benefits of the proposed methods. The multiplication and squaring implementations on the ARM Cortex-M4 require execution times of 608 and 493 clock cycles for 256-bit, and 50,857 and 26,806 clock cycles for 3072-bit, respectively. Even though our implementation is slower than the unrolled implementation, it only requires 10% of the memory footprint for the SIKEp751 case. Furthermore, the proposed multiplication and squaring require only 584 bytes for code size, which is perfectly suitable for the PKC (e.g., RSA, ECC, and SIKE) implementations on low-cost processors. The proposed implementations are fully parameterized and are based on a constant-time solution. An operand of any length can be supported with a single implementation. We also evaluated the all-SIKE round 2 protocols on the target microcontrollers. The results show that a reasonable execution timing can be achieved with a very small code size.