A Secure Algorithm for Inversion Modulo 2 k

Modular inversions are widely employed in public key crypto-systems, and it is known that they imply a bottleneck due to the expensive computation. Recently, a new algorithm for inversions modulo pk was proposed, which may speed up the calculation of a modulus dependent quantity used in the Montgomery multiplication. The original algorithm lacks security countermeasures; thus, a straightforward implementation may expose the input. This is an issue if that input is a secret. In the RSA-CRT signature using Montgomery multiplication, the moduli are secrets (primes p and q). Therefore, the moduli dependent quantities related to p and q must be securely computed. This paper presents a security analysis of the novel method considering that it might be used to compute secrets. We demonstrate that a Side Channel Analysis leads to disclose the data being manipulated. In consequence, a secure variant for inversions modulo 2k is proposed, through the application of two known countermeasures. In terms of performance, the secure variant is still comparable with the original one.


Introduction
Public key cryptographic schemes often require performing modular inversions, which are known to be expensive operations.In RSA, for example, the secret key is obtained through the inversion of the public key.In ECDSA (Elliptic Curve Digital Signature Algorithm), to generate a digital signature, the per-message random secret is inverted after the scalar multiplication.Fermat's, Euler's and Euclidean's methods are the most well-known solutions to compute a multiplicative inverse.Derived from Euclid's method, the Binary Extended Euclidean Algorithm (BEEA) is very efficient as it substitutes multi-precision divisions by right shifts.This is a suitable approach for software and hardware realizations [1].However, a straightforward BEEA implementation is susceptible to Side Channel Analysis (SCA) [2,3].
Recently, a very efficient algorithm to compute the multiplicative inverse modulo p k has been introduced by Koç in [4].The new inversion method has a low computational complexity.This is a clear advantage, since cryptographic implementations usually manipulate large numbers.A special case of the algorithm is the computation modulo 2 k .This is especially useful to compute the required modular inverse in a Montgomery multiplication [5].The method, however, is clearly not intended to manipulate sensitive data, as it will be analyzed in next sections.Therefore, a straightforward implementation should be avoided, for example, in the Chinese Remainder Theorem variant of RSA (RSA-CRT).Some other approaches had been previously introduced to compute the inverse modulo 2 k [6,7].Those algorithms do not seem either intended to manipulate secrets and their performance is lower compared to the new one (see the algorithms comparison given in [4]).Thus, we focus this work in the analysis of the new inversion method, because a secure version of it may be a suitable candidate to be used in low power devices with cryptographic capabilities.

RSA-CRT with Montgomery Multiplications
RSA is an asymmetric cryptosystem that allows both encryption and signing [8].The cryptosystem is still a widely used standard, even in financial sector products like smart cards [9].Despite its age, there are works that recently addressed some security research on RSA implementations [10][11][12].RSA signature of message m, by using the private key d can be written as where N is a public modulus compound by the multiplication of two secret primes (N = p • q).As N is a large number, the modular exponentiation is a costly operation.RSA-CRT variant is preferred for efficiency reasons.As p and q are both smaller than N, residue-based arithmetic (modulo p and q) allows working with shorter registers, and then, the exponentiation complexity gets reduced by signing the message following S p = m dp mod p S q = m dq mod q where d p and d q are the residues of the private key modulo p and q, respectively.In Equations ( 2) and ( 3), two partial signatures are obtained.To give a unified result, these values need to be joined.The recombination methods of Gauss and Garner are well known to do that.Garner's recombination (below) is often preferred for being more efficient than Gauss's method.
The main advantage of the Montgomery multiplication (see Algorithm 1) is the substitution of divisions by right shifts and modular reductions by truncations.Because of that simplification, this method is commonly used to solve the modular multiplications involved in the exponentiation.
Even when the output is multiplied by r −1 , Algorithm 1 works with an n , which is such that Therefore, the calculation of n can be given by n = −n −1 mod 2 k .In the case of the standard RSA, the modulus N is public; however, in RSA-CRT the partial exponentiations are calculated using moduli p and q, which are both secrets.The analogue calculations of n in RSA-CRT would be p = −p −1 mod 2 k (5) where k is such that 2 k−1 < p, q < 2 k .The usage of a non-protected algorithm for the calculation of n in a standard RSA does not imply any risk.On the contrary, for the RSA-CRT, if p and q need to be computed, it should be done more carefully, because the secret primes are directly involved.Actually, if p and q are dynamically masked (p m and q m ) at each RSA-CRT computation, as it commonly occurs in banking products, the Montgomery constants p m and q m will be different every time and they should always be computed.

Our Contributions
In this work, we conduct a security evaluation of the new inversion method proposed by Koç.We demonstrate that the algorithm lacks security countermeasures, and then a straightforward implementation of it may compromise a secret if it is being manipulated.A secure and still efficient variant for the computation of the inverse modulo 2 k is proposed herein.It includes countermeasures that allows handling sensitive data in a safe mode, as needed in the case of RSA-CRT with Montgomery multiplication.

Paper Organization
This paper is organized as follows: in Section 1 an introduction to the topic is given.Section 2 describes the inversion method under study.In Section 3 a security analysis is conducted for the special case of inversions modulo 2 k , where two vulnerabilities are discussed.In Section 4 the countermeasures to patch the vulnerabilities are described and a SCA-protected variant of the algorithm is presented.Finally, Conclusions and References are listed.

On a New Algorithm for Inversion Modulo p k
The algorithm introduced in [4] focuses on the need for the public key cryptographic schemes to perform modular inverse operations.The new method to compute x = a −1 mod p k seems to be quite efficient and it works for any p and any k.The assumptions to perform the computation are: p is usually a small number (commonly 2 or 3), thus the computation at step 1 is expected to be easily performed.In fact, for the case of p = 2, the computation of c is trivial.A better comprehension of Algorithm 2 and its demonstrations, can be obtained from the work in [4].

Algorithm 2: Modular Inverse [mod p k ]
Input: a, p and k; such that gcd(a, p) = 1 and a < p k Output:

Special case p = 2
In the previous section, we discussed the Montgomery constant computation, and it was highlighted the need for the modular inverse in that process.As from Equations ( 5) and ( 6) for the RSA-CRT cryptosystem, the said inverse is computed modulo 2 k which is a particular case of modulo p k where p = 2. Algorithm 2 can be reduced if p = 2 to compute the inverse modulo 2 k .In this case, the computation of x = a −1 mod 2 k requires that gcd(a, 2 k ) = 1, thus a must be an odd number and then c = 1.
From Algorithm 3 one appreciates that the operation at step 3 is trivial, as it only requires checking the LSB of b i .On the other hand, the returned value in x is binary.The simplified algorithm for p = 2 follows Algorithm 3: Modular Inverse [mod 2 k ] Input: a and 2 k ; such that a < 2 k and a is odd Output:

Security Analysis for p = 2
Side Channel Analysis techniques have been introduced in the late 1990s by Kocher [13].An effective attack leads to disclose sensitive data being manipulated by a cryptographic device, through the leakage associated with its power consumption.One of such techniques is the Simple Power Analysis (SPA).Through an SPA, one can observe the sequence of operations, or one can even distinguish an operation from another one by the differences in their power consumption patterns.The observation of a power consumption trace may also give a clue of the operation latency, which is closely related to the bit length of the operands.This applies mainly for sequential operations.In the following subsections we describe two vulnerabilities found in the inversion algorithm under analysis, that impede a safely manipulation of secrets.

Asymmetric Iterations
It is well known that the Square-and-Multiply method to compute y = g k allows to recover k through an SPA, due to a difference in the operations performed whether k = 0 or k = 1.The Montgomery ladder exponentiation solves that issue by performing the same operations at every iteration of the algorithm [14].
A similar issue has been detected in the inversion method under analysis in this work.It allows a straightforward SPA, which leads to an easy recovery of the operation result, and in consequence, the input data is disclosed.As from the previous section, the modular inverse of the input a, obtained through Algorithm 3 is formed by x = (X k−1 ...X 1 X 0 ) 2 ; where X i ∈ [0, 1].Furthermore, the intermediate result b i − a • X i at step 4 is always divisible by 2. At step 4, besides the multiplication a • X i , two other operations can be distinguished: a subtraction and a division by 2. The division can be performed as a right shift because the result of the subtraction is always divisible by 2.
Regarding the subtraction, this may or may not be computed.One can see that if X i = 0, then a • X i = 0, and then the subtraction b i − a • X i becomes b i − 0. In a straightforward implementation of the original algorithm, the developer may choose to obviate the subtraction if X i = 0. We recall that the original work does not refer to any SCA protection to keep the input data safe, thus we believe the author did not consider a scenario with a secret input.If the subtraction is not performed, a significant difference in the execution flow exists depending on the X i value.In summary Such a data-dependant characteristic could be distinguished in a power consumption trace of the algorithm execution.It then leads to a straightforward SPA where the modular inverse of the secret could be directly recovered.Once the modular inverse is recovered, it is then trivial to obtain the input by computing a = x −1 mod 2 k .If a was a secret, as it is the case in the Montgomery constants computation for RSA-CRT, this would imply a critical security issue.
However, the developer may choose a more regular implementation by always computing the subtraction.In this case, there are two possibilities: if X i = 0, the subtraction b i − 0 is computed, while if X i = 1, it will be computed b i − a.The operands b i and a are large integers since the second iteration, in the context of the Montgomery constants computation for RSA-CRT.It makes the subtraction b i − a highly susceptible of having lots of carry bits propagation.This effect has a negative impact on the latency of the addition/subtraction.While in b i − 0 the carry propagation is null, in b i − a the carry propagation varies making that operation longer in time.This should be enough to apply a Timing Attack to distinguish one operation from the other, which directly leads to infer the values of the related X i .

Operations Latency
The latency of the arithmetic operations is closely related to the data length of the operands, especially in software implementations.In RSA, for example, the exponentiation latency is expected to be proportional to the key length.In the case of additions/subtractions, they both commonly require to manage a carry bit that can be generated at each bit-bit operation.Therefore, the carry chain is as long as the operands, and it determines the whole operation latency.
Let us say, for example, that the evenness of an operand determines the next operation where it will be involved, and the said operation impacts on the operand's bit length.If that quantity is further added or subtracted from a constant value and this sequence is performed in a loop, the addition/subtraction latency might experiment variations at each iteration, as a consequence of the carry chain modification.If an adversary is able to identify the additions/subtractions through an SPA and measure those variations, then the operand's evenness (its Least Significant Bit-LSB-) might be traced back.
From Algorithm 3, one can see that the subtraction at the step 4 depends on b i and a.The value of a is invariant throughout the whole operation, while b i does varies.In fact, the value of b i is strongly dependent on X i .If X i = 0, then b i+1 , computed at iteration i, yields b i /2.For consecutive X i = 0, the respective b i+1 are always smaller by a factor of 2. On the other hand, considering a < 0 (as required in the Montgomery constant computation), it can be demonstrated that b i+1 tends to a for consecutive X i = 1.
Let's have According to the right side of Equation ( 8) and comparing it with the right side of Equation ( 7), the second fraction (which depends on a) in ( 8) is greater and it approaches more to a.Meanwhile, the first fraction is halved and tends to zero.Thus, it makes b i+1 closer to a rather than to b i .Something similar occurs when X i = 0 and X i+1 = 1.
In summary, we might then expect to observe in a power trace, a continuous decreasing latency in the addition/subtraction for consecutive iterations where X i = 0; while the latency would tend to increase for continuous X i = 1 or even for transitions from X i = 0 to X i+1 = 1.
The differences in the execution flow for the cases X i = 0 and X i = 1, presented in the previous section, are enough to perform an SPA on Algorithm 2. Thus, a timing analysis for this purpose is not necessary; however, in order to design a countermeasure to overcome such data-dependent vulnerability, the issue on the operations timing has to be taken into account.

A Secure Method for Inversion Modulo 2 k
Considering the issues found in the previous section, a secure variant of Algorithm 3 is proposed herein (See Algorithm 4).Input: a and 2 k ; such that a < 2 k and a is odd Output: if Our new variant resists both SPA and Timing Attacks with a minimum overhead; therefore, it can be used to obtain the multiplicative inverse of odd secrets modulo 2 k .Moreover, the low complexity of the countermeasures applied makes this algorithm suitable to be implemented in low power devices.
In our algorithm, two conditional branches have been included for the cases of X i = 0 and X i = 1 respectively.They both compute the same operations, in the same order, with the current values of b i .Dummy operations have been introduced in both branches to balance them.The variable b f holds the useless result derived from the dummy operation.This simple countermeasure follows the same strategy as in the Montgomery ladder and impedes the recognition of X i through an SPA.
Moreover, the subtraction b i − a is executed for all X i , and even in the fake case, the operation manipulates the actual value of b i .Following the analysis in Section 3.2, the subtraction latency depends on b i , which varies in every iteration depending on X i .Thus, it can be said that, for every iteration, the latency of b i − a depends on X i .Therefore, a constant bit length of b i (that implies a constant carry chain length in b i − a) is a must so that the subtraction has the same latency throughout the whole execution.
Previously, it was seen that b i+1 decreases if X i = 0, and that b i+1 tends to a for continuous X i = 1.As a < 2 k , it has at most k bits and so has b i .When a < 0 (as in the Montgomery constant computation), the operation b i − a becomes an addition.In such cases b i+1 might have at most k + 1 bits, considering the carry.
To get X i at step 4, the evenness of b i is evaluated.If an even value is added to b i , it will not affect b i 's evenness.On the other hand, if a number v = 2 k+1 is added to b i , it will not affect the first k + 1 bits of the operation result at steps 7 and 10.Following this reasoning, X i at step 4 could be evaluated with b i + 2 k+1 and then b i+1 = (b i + 2 k+1 − a)/2.Please note that, as the algorithm evaluates only k bits of the resultant b i , the Most Significant Bit (MSB), at position k + 1, will have no effect on the result.Table 1 gives a detailed view of this.
In the algorithm, the loop first gets a b i of k + 2 bits, where its MSB is '1'.After the subtractions at steps 7 and 10, a division by 2 is carried out.It makes the MSB to shift one bit right to occupy the 2 k position.This is compensated in the next iteration, at the step 3, by adding 2 k so that b i has k + 2 bits again; it is, with the MSB in the 2 k+1 position.
The starting point was that b i different bit lengths may cause a latency variation in the subtractions, and it may lead to infer the previous value of X i .By adding 2 k+1 to b i , it guarantees that subtractions at steps 7 and 10 are always performed with operands of constant bit length.At step 3, b i will always have k + 1 bits, so the addition b i = b i + 2 k will always be performed with constant bit length operands too.Furthermore, the right shifts at steps 6 and 9 will be carried out with b i having fixed k + 2 bits length.

Secure Variant Overhead
The proposed variant herein implies no significant overhead in comparison with the original algorithm.The count of operations of the original inversion method yields one addition.This is true if we dismiss the b i evenness check and the division by 2. Our algorithm does not add any further addition.In fact, the operation at step 3 may be coded in a few ways, but in any case, it is only needed to manipulate the most significant byte of b i to set the bit in the 2 k+1 position (see Table 1).

Conclusions
The analysis performed on the original inversion algorithm for modulo 2 k led us to establish a direct relationship between the output data bits and the execution flow.In consequence, we demonstrated that the modular inverse of a secret could be revealed by an SPA conducted on a single power consumption trace.
A timing analysis was also carried out, specifically on the subtraction operation.It was concluded that the bit length of b i may affect the latency of subtractions, and because the bit length is related to the factor X i , a direct relationship could also be established between the sensitive output data and the latency of the subtractions.
Having this into account, a secure variant of the original inversion algorithm was proposed.By solving the security issues described, the protected algorithm allows to manipulate secret values, as it is the case in the Montgomery constants computation for RSA-CRT.The secure method is resistant to SPA and Timing attacks.Furthermore, the overhead of the applied countermeasures implies no significant lack of performance respect to the original algorithm.

Table 1 .
Addition a + b i not affected by the summand 2 k+1 .