Speeding-Up Elliptic Curve Cryptography Algorithms

: In recent decades there has been an increasing interest in Elliptic curve cryptography (ECC) and, especially, the Elliptic Curve Digital Signature Algorithm (ECDSA) in practice. The rather recent developments of emergent technologies, such as blockchain and the Internet of Things (IoT), have motivated researchers and developers to construct new cryptographic hardware accelerators for ECDSA. Different types of optimizations (either platform dependent or algorithmic) were presented in the literature. In this context, we turn our attention to ECC and propose a new method for generating ECDSA moduli with a predetermined portion that allows one to double the speed of Barrett’s algorithm. Moreover, we take advantage of the advancements in the Artiﬁcial Intelligence (AI) ﬁeld and bring forward an AI-based approach that enhances Schoof’s algorithm for ﬁnding the number of points on an elliptic curve in terms of implementation efﬁciency. Our results represent algorithmic speed-ups exceeding the current paradigm as we are also preoccupied by other particular security environments meeting the needs of governmental organizations.


Introduction
Elliptic curve cryptographic (ECC) was initially proposed in [1,2] as an alternative to the already established public key cryptographic schemes.As a side note, the credit for the first use of elliptic curves in a cryptology related context is given to Lenstra for his factorization algorithm [3].ECC has received an increasing amount of attention in time not only for the high level provable security offered, but especially due to a desired property concerning the implementation efficiency: the cryptographic keys are significantly shorter compared to, e.g., the case of RSA [4].
For more than a decade now, ECC has become a central piece for the Blockchain technology.To be more specific, the Elliptic Curve Digital Signature Algorithm (ECDSA) [5] is widely adopted in the construction of cryptocurrency and, implicitly, blockchains.Thus, there has been a justified hype with respect to efficient implementations of ECDSA and other ECC schemes in recent years, especially using Field Programmable Gate Arrays (FPGAs) [6][7][8][9][10][11][12][13][14].Hence, FPGA-based hardware accelerators represent the main applied research topic when dealing with (permissioned) blockchains.To underline the importance of the subject, we have to mention that the main FPGA technology producer, Xilinx [15], organized a competition in 2021 [16], which encouraged R&D representants to propose, among other topics, blockchain-related projects [17].
Motivated by all the above and following the work of Géraud et al. from [32], this paper aims at building the foundation for new techniques of optimizing the implementation of ECDSA.As in the case of most public-key cryptosystems, the basic arithmetic operation used in ECC is the modular reduction.[32] describes a method allowing to double the speed of Barrett's algorithm [33] by using specific RSA moduli with a predetermined portion.The result is then applied in order to generate DSA [34] parameters.As an extension, our article presents a technique for generating ECDSA moduli with a predetermined portion that allows one to double the speed of Barrett's algorithm, a widely adopted efficient technique for performing modular reduction in a costly reduced manner.We also provide the reader with mathematical proof of our algorithm.
Moreover, we target a more general type of optimization suitable not only for ECDSA, but for various ECC algorithms.Thus, we propose an artificial intelligence (AI) based approach that enhances the speed of Schoof's algorithm for finding the number of points on an elliptic curve [35].Schoof's method is the first deterministic polynomial time algorithm for counting points on elliptic curves defined over finite fields.The result represented, undoubtedly, a breakthrough in terms of designing ECC algorithms.
While the first result we propose is rather particular and can be applied for a certain digital signature scheme (ECDSA), the latter can be of general interest in terms of ECC algorithms.We underline that the AI-optimized variant of Schoof's algorithm is rather a proof of concept for a future series of results, with respect to this research direction.
Nonetheless, our methods can also be combined with already established algorithmic improvements to obtain even better implementation timings.

Specific Supplementary Motivation
The common practice when using ECC is to have specific curves [5,36] rather than choose them every time when the algorithm is run in order to ease computations (by applying dedicated formulae for point addition and scalar multiplication).Nonetheless, speeding-up ECDSA in the general case can be advantageous, e.g., either for cryptographic implementations, requiring a higher level of security than the standard one, or simply for proprietary cryptographic algorithms (of course, given that a step consisting of checking the security of the generated curve is performed.).Such needs are customary, especially for governmental organizations.Moreover, in the aforementioned example, implementations for resource constrained devices and cryptographic hardware may be of great interest.Thus, when proposing our results, we do not seek to compare them with existing targeted ECC implementations in terms of speed, as we consider performing an initial costly step: the parameter generation.
In addition to the above, we believe that our AI-based strategy will benefit from the rapid advances in the field of AI in the near future.

Structure of the Paper
In Section 2, we introduce the notations and briefly describe Barrett's algorithm for modular reduction, Schoof's algorithm for point counting on elliptic curves, and ECDSA.The main results are discussed in Section 3, namely the algorithm for generating the ECDSA parameters and the method for finding the number of points of an elliptic curve.Details regarding the straightforward, unoptimized implementations of the previously mentioned algorithms are presented in Section 4. We conclude and provide the reader with future work ideas in Section 5.Moreover, we recall ECDSA in Appendix A and Schoof's algorithm in Appendix B.

Notations
Throughout this paper, we denote by NextPrime(r) the smallest prime p, such that p ≥ r. #S represents the cardinality of the set S. We let P be the bit-length of p, such that P = ||p||.The value of P is fixed from now onwards.The binary shift-to-the-right of x by y positions is further denoted by x y.

Barrett's Algorithm
Let d and m be integer numbers.Barrett's algorithm (Algorithm 1) only uses two bit-shifts and one multiplication to produce an approximate value of the quotient obtained when d is divided by m.This approximation is denoted by c 3 and it satisfies the following inequality which means that the whole loop is not repeated more than two times.The bit-lengths of d and m are represented by D and M. Algorithm 1 also requires a quantity that is denoted by L and which represents the maximal bit-length of the numbers that can be reduced.Barrett's algorithm works as long as the condition D ≤ L is satisfied.In most cases, these constants can be chosen such that D = L = 2M, provided that the reduction is performed after every operation.The constant k is computed only once, since it does not depend on the value of d.Further details regarding Algorithm 1 can be found in [33].
The elliptic curves considered are of the form y 2 = x 3 + ax + b and are defined over a finite field F p , where p is prime.An important result, which will be used throughout this paper, is the following theorem.
Theorem 1 (Hasse).The number of points n of an elliptic curve defined over a finite field of size p satisfies the inequality |n − p − 1| ≤ 2 √ p.
In [35], Schoof published the first deterministic and polynomial-time algorithm that computes the order of an elliptic curve, which is defined over a finite field.This algorithm starts off by using Theorem 1, which provides an interval of possible values for the order of the elliptic curve.That specific interval has the width 4 √ p.
Since the order can be written as #E(F p ) = p + 1 − t, where t is the trace of the Frobenius endomorphism [37], the problem of finding the order reduces to that of finding the value of t.The next step involves computing the value of t modulo for some primes, such that their product is greater than 4 √ p. Finally, the Chinese Remainder Theorem [37] produces the value of t, which is needed for finding the order.The details of Schoof's algorithm are included in Appendix A as Algorithm A4.

ECDSA
ECDSA [5] is a digital signature scheme based on cyclic groups of elliptic curves defined over finite fields.Its security relies on the Elliptic Curve Discrete Logarithm Problem [38].Details about setting-up the parameters of ECDSA, generating a signature and verifying it are included in Appendix A.

Double-Speed Barrett for ECDSA
For the setup of ECDSA, two prime numbers p and n are required: p represents the size of the finite field and n is the order of the group E(F p ), since we are only considering the case when the order of this group is prime.Note that both the multiplications performed in Algorithm 1 are multiplications by constants, namely k and n.
Our aim is to generate the primes p and n such that their leading bits do not have to be computed.Moreover, we want this to happen also for the associated constants k p = 2 L p and k n = 2 L n .The idea of Algorithm 2 is that if we choose the prime p in a convenient way, then we can control the most significant bits of n.
Input: P, the bit-length of the prime p, which has to be even and large Output: (p, a, b, n), the parameters needed for ECDSA Then p, n, k p and k n satisfy the following inequalities: 4. Proof.

1.
Using the inequality r < 2 U and Line 6 of Algorithm 2, we obtain that p < Q.

2.
From Theorem 1 we have that p For the left side of the inequality, i.e., 2 P−1 < n, we obtain that Thus, For the right side of the inequality, i.e., n Using Line 2 of Algorithm 2, we can deduce that Hence,

4.
Similarly, using Line 2 of Algorithm 2, we obtain that Remark 1.In Line 6 of Algorithm 2, we do not allow the distance between α and the next prime p to be too large.Additionally, we choose α in a specific way so that we can control p's size.This implies that the probability of the first U bits of p being different than the first U bits of α is negligible.We performed 10 5 experiments with the value P = 256, and the success rate was 100%.Example 1.This example illustrates Algorithm 2. For the values P = 256, L = 512, and U = 128, we get:

Enhancing Schoof's Algorithm Using AI
Our aim is to modify Schoof's algorithm by replacing Hasse's interval with another one containing the order, such that the width of the new interval is smaller.
In order to obtain such a result, we can use a neural network that takes input triplets of the form (p i /2 P , a i /2 P , b i /2 P ) and returns as output elements ŷi given by ŷi where we denote by ni that the estimate of the actual order n i .Here we use the sigmoid activation function for the output layer to ensure that the output is in the appropriate range.The labels used for training the neural network are written as The elements of the training, validation, and test sets will be written in the form (p * i /2 P , a * i /2 P , b * i /2 P , y * i ), where instead of * , we will have the superscripts tr, v, and t, respectively.These three sets will have the cardinal numbers equal to N tr , N v , and N t , respectively.
At training time, we choose to use as loss function the mean squared logarithmic error, since we want this to work well for large primes.This function is given by Let us denote by the average distance between the actual order and the estimate of the order, which is computed on the validation set, Our approach is a probabilistic one, since we need to assume that the order n satisfies the inequality n − 2 < n < n + 2 . ( This leads to the following result involving t, Notice that in the above inequality, we have doubled in order to increase the probability that our assumption is true.Hence, if we manage to determine the value of t 0 ≡ t (mod 4 ), then we can find t by replacing t 0 in the formula t = (p + 1 − n) − 2 + t 0 , and thus we know the order of the group.The benefit of using the estimate given by the neural network is that Schoof's algorithm can be applied for an interval of width equal to 4 instead of one of width equal to 4 √ p.This means that if the neural network is good at estimating the order, i.e., < √ p, then this approach will be faster than the standard one.
Remark 2. Since our algorithm is probabilistic, after obtaining a value of , which is significantly lower than √ p, instead of assuming that n lies in the interval ( n − 2 , n + 2 ), we can assume that it lies in an interval of greater width, for example ( n − 4 , n + 4 ).By doing this, we are able to increase the success probability of the algorithm.
Remark 3. The difference between Schoof's algorithm and our proposed technique is that we choose the set of primes from Line 1 of Appendix B, such that the product of the elements is greater than 4 .All the steps that follow remain unchanged.

GitHub Implementation
We refer the reader to [39] for the source code representing the implementation of our proposed results.
Note that for simplicity, we overlooked the initial part of Algorithm 2 (i.e., from Line 1 to Line 6) in our implementation.

Implementation Results
We ran the code for our algorithm on a standard laptop using Ubuntu 20.04.5 LTS OS, with the following specifications: Intel Core i3-1005G1 with 2 cores and 8 Gigabytes of RAM.The programming language we used for implementing our algorithms was Python, and the AI library we chose was TensorFlow.

AI-Based Speed-Up
Our AI-based technique can speed up the search for an elliptic curve of prime order.This speed-up depends initially on the model architecture and then on the accuracy of the AI-model.Within the current section, we report our (proof of concept) results.
To achieve our proof of concept goal, in our implementation we initially considered primes p of length 32 bits.Thus, we generated 60,000 elliptic curves of the form (p, a, b, n) by means of Schoof's algorithm.Based on these examples, we trained, validated, and tested the neural network model we chose.This network was composed of 7 dense hidden layers with the number of units decreasing from 512 to 8. Note that decreasing the number of units, as stated before, is a straightforward technique used in AI algorithms.The reason we decided to have 7 hidden layers was to obtain the best compromise in terms of error rate and code optimization (especially with respect to time complexity).We provide the reader with a graphical representation of the relationship between the number of neural network layers and the error rate of our proposed algorithm in Figure 1.Note that the error rate stabilizes at 7%, starting with the use of 7 layers.Hence, using the previously described neural network, we managed to replace Hasse's interval with another interval, with the width approximately 15% smaller than the original one.In this case, the probability that the order n satisfies Equation (1) was 93%, which was also the success rate of our probabilistic algorithm.The obtained probability was computed by finding the number of testing examples, which satisfied Equation (1).
We provide the reader with a graphical representation of the relationship between the number of neural network layers and the reduced Hasse interval of our proposed algorithm in Figure 2. Note that the percent by which the width of our reduced Hasse's interval is smaller than the original one stabilizes right after 7 layers at 16%.Note that the next value considered after 7 is 9 given that the AI models work better in the case of an odd number of layers.Thus, given the results presented in Figures 1 and 2, we chose to use 7 layers in our implementation as a trade-off between accuracy and time complexity.Table 1 shows that the difference between the timing of the 7 layers implementation version and the 9 layers version is significant (the latter is 55% slower) and clearly sustains our choice of parameters.Due to the fact that our proposed result is a particular algorithmic improvement, we compare it in terms of efficiency with the original Schoof algorithm (see Remark 3) and provide the reader with precise timings in Table 2.The average timings we report are given for 32, 48, and 64 bits prime numbers.It is straightforward for any other publicly available implementation optimization to be applied in our case too, and, thus, we can obtain the best timings.Based on the results in [32], we showed that using Barrett-compatible ECDSA parameters doubles the speed of Barrett's algorithm when performing the modular reductions required for generating (Algorithm A2) and verifying (Algorithm A3) ECDSA signatures.Thus, in such an optimized ECDSA implementation, the steps including modular reduction are performed two times faster than in a standard ECDSA implementation.

ECDSA Related Works Comparison
The authors of [13] report the fastest implementation of the ECDSA verification algorithm in FPGA compared to the already established results in the literature so far.The majority of papers presenting hardware optimizations for blockchain applications are only discussing the verification algorithm of ECDSA.Nonetheless, we are interested in optimizing the complete ECDSA scheme as our proposed speed-ups can also be applied for the signature generation algorithm (already stated in Section 4.2.2),not only for the verification algorithm.
Given that our FPGA implementation is work in progress, at this point, our target is to make a software implementation comparison of both the signing and the verification ECDSA algorithms.Thus, we considered the fastest (lightweight) ECDSA implementation available online [40] and modified it to include our proposed optimization from Section 3.1.The average time differences we obtained after 100 runs are presented in Table 3.Note that in the implementation at [40], the modular reduction steps are performed in a straightforward manner as opposed to our proposed double-speed Barrett optimization.Hence, the speed-up is obvious both in theory (see Section 4.2.2) and in practice (see Table 3).

Conclusions and Future Work
We briefly described Barrett's algorithm for modular reduction, Schoof's algorithm for point counting on elliptic curves, and ECDSA as an example for applying our proposed speed-ups.We presented as main results an algorithm for generating implementationfriendly ECDSA parameters and a method for finding the number of points of an elliptic curve, representing an enhancement of Schoof's algorithm.We also gave details regarding the unoptimized implementations of the previously mentioned algorithms.

Future Work
We consider that timing comparisons between our enhanced Schoof algorithm and already established implementations of SEA [41] represent an interesting idea to be tackled in the near future.
Next, a valuable idea is looking into more sophisticated AI optimizations for the mathematical computations inside Schoof's algorithm.
Another interesting research direction is the implementantion of ECDSA in cryptographic hardware while using our proposed optimizations, together with a complexity analysis of other implementations in the literature.We are currently working on such an approach using FPGA-based equipment.

Figure 1 .
Figure 1.The relationship between the number of neural network layers and the error rate of our proposed algorithm.

Figure 2 .
Figure 2. The relationship between the number of neural network layers and the reduced Hasse interval.

Table 1 .
Timing comparison between the implementation of our proposed algorithm using 7 and 9 layers, respectively.

Table 2 .
Timing comparison between the implementation of our proposed algorithm and the original Schoof algorithm.

Table 3 .
Timing comparison between a lightweight ECDSA implementation and our enhanced version of it.