On Multi-Scalar Multiplication Algorithms for Register-Constrained Environments

A basic but expensive operation in the implementations of several famous public-key cryptosystems is the computation of the multi-scalar multiplication in a certain finite additive group defined by an elliptic curve. We propose an adaptive window method for the multi-scalar multiplication, which aims to balance the computation cost and the memory cost under registerconstrained environments. That is, our method can maximize the computation efficiency of multiscalar multiplication according to any small, fixed number of registers provided by electronic devices. We further demonstrate that our method is efficient when five registers are available. Our method is further studied in detail in the case where it is combined with the non-adjacent form (NAF) representation and the joint sparse form (JSF) representation. One efficiency result is that our method with the proposed improved NAF n-bit representation on average requires 209n/432 point additions. To the best of our knowledge, this efficiency result is optimal compared with those of similar methods using five registers. Unlike the previous window methods, which store all possible values in the window, our method stores those with comparatively high probabilities to reduce the number of required registers.


Introduction
The notations in Table 1 are used throughout this paper, often without further definition. Others are defined where they are first used and in Appendix A.
A basic but expensive operation in the implementations of several famous publickey schemes, for instance, Digital Signature Algorithm (DSA) [1], Elliptic Curve Digital Signature Algorithm (ECDSA) [2], and the Schnorr signature scheme [3], is the computation of the multi-scalar multiplication in a certain finite additive group defined by an elliptic curve or the multi-exponentiation in a certain finite multiplication group. Moreover, many public-key protocols, such as [4][5][6][7], also require one or more of the multi-scalar multiplication/multi-exponentiation operations.
For a better understanding of this, we adopt the symbol system of the multi-scalar multiplication herein. Without loss of generality, all techniques discussed in this paper can also be directly applied to the computation of the multi-exponentiation. The multi-scalar multiplication can be written as follows: given two integers x and y, and points A ∈ E F q and B ∈ E F q , compute xA + yB. Due to the large operands, the computation of the multi-scalar multiplication requires a large number of processing steps and is thus time consuming. Since cryptographic implementations on embedded devices provided with little computation and memory power are often desired, a challenging problem is how to reduce the costs for the computation of the multi-scalar multiplication. The point at infinity on E F q A, B Any two points on E F q A + B The point addition applied to A ∈ E F q and B ∈ E F q 2A The point doubling applied to A ∈ E F q , i.e., A + A xA The scalar multiplication by an integer x applied to A ∈ E F q , i.e., A + A + . . . + A x times (x n−1 x n−2 · · · x 0 ) 2 , (y n−1 y n−2 · · · y 0 ) 2 The binary representations of the integers x and y (x n−1 x n−2 · · · x 0 ) SD ,(y n−1 y n−2 · · · y 0 ) SD Any signed binary representations of the integers x and y, i.e., x i , y i ∈ {−1, 1, 0} (x n−1 x n−2 · · · x 0 ) NAF ,(y n−1 y n−2 · · · y 0 ) NAF The non-adjacent form (NAF) representations of the integers x and y x n−1 x n−2 · · · x 0 y n−1 y n−2 · · · y 0 The signed sequence pair of the integers x and y, where x n−1 x n−2 · · · x 0 and y n−1 y n−2 · · · y 0 are, respectively, the certain signed binary representations of the integers x and y n The bit length of the integers x and y using the binary representations or the signed binary representationŝ 1 Any bit −1 or 1

w(n)
The performance factor of a certain multi-scalar multiplication algorithm, i.e., the number of point additions required by the algorithm w ST (n) The performance factor of Shamir's trick using the binary representation w N AF−4 (n) The performance factor of Shamir's trick using the NAF representation w JSF−4 (n) The performance factor of Shamir's trick using the joint sparse form (JSF) representation w BN AF−5 (n) The performance factor of the 5-register adaptive window method using the NAF representation w I N AF−5 (n) The performance factor of the 5-register adaptive window method using the improved NAF representation w JSF−5 (n) The performance factor of the 5-register adaptive window method using the JSF representation Pr (EV) The probability that the event EV occurs Pr(EV 1 |EV 2 ) The conditional probability of the event EV 1 given the event EV 2

Previous Work
Obviously, we can separately compute the scalar multiplication values xA and yB, and then add them together. Gordon [8] surveyed the key techniques for the computation of the scalar multiplication. However, since the public-key cryptosystems do not require the intermediate values xA and yB, Shamir [9] suggested a simple but efficient trick for speeding up the multi-scalar multiplication by doing the two scalar multiplications simultaneously. Figure 1,   It needs to be pointed out that more sophisticated techniques can potentially improve Step 3.2 to reduce the number of point additions. The value ( ) is equal to the number of bit pairs ( ), in which at least one bit, i.e., or , is nonzero. It therefore implies that the performance factor is on average. Based on the frame of Shamir's trick, the improved multi-scalar multiplication algorithms are divided into two categories. The first category codes the integers and so that the number of zero bit pairs, i.e., ( ) = ( 0 0 ), increases. Whereas the binary representation for an integer is unique, the signed binary representation by −1, 1, and 0 is not. Since the cost of computing the inverse of a point is negligible compared to the point addition over the elliptic curve group, the improved multi-scalar multiplication algorithms, detailed in [10][11][12][13][14][15][16][17], require only one extra register to store the valuein Step 1 of Figure 1. The NAF representation [18,19] is optimal for one integer. The performance factor −4 ( ) can be improved to 5 /9 on average, when the algorithm uses the NAF representation instead. The JSF [11] is the optimal signed binary representation for two integers. The performance factor −4 ( ) can further hit /2 on average, when the algorithm uses the JSF representation instead. Due to its optimality property, a disadvantage of the coding approach is that the best performance factor cannot exceed the value −4 ( ). The second category, in contrast, scans and processes the -bit pair in Step 3.2 of Figure 1, where is an integer, and > 1. To reduce the number of point additions, all possible values for -bit pairs should be pre-computed and stored in Step 1 of Figure 1. Certainly, the wide scanning approach [20][21][22][23][24][25], including the -ary method and the sliding window method, always combines with the coding approach, such as the NAF representation and the JSF representation, in practice. However, the approach requires a large number of extra registers to store all possible values for -bit pairs, even with a moderate .
Finally, some works [26,27] are dedicated to presenting the parallel algorithms for the multi-scalar multiplication, because the chip manufacturers are increasing the num- It needs to be pointed out that more sophisticated techniques can potentially improve Step 3.2 to reduce the number of point additions. The value w(n) is equal to the number of bit pairs x i y i , in which at least one bit, i.e., x i or y i , is nonzero. It therefore implies that the performance factor is on average. Based on the frame of Shamir's trick, the improved multi-scalar multiplication algorithms are divided into two categories. The first category codes the integers x and y so that the number of zero bit pairs, i.e., x i y i = 0 0 , increases. Whereas the binary representation for an integer is unique, the signed binary representation by −1, 1, and 0 is not. Since the cost of computing the inverse of a point is negligible compared to the point addition over the elliptic curve group, the improved multi-scalar multiplication algorithms, detailed in [10][11][12][13][14][15][16][17], require only one extra register to store the value A -B in Step 1 of Algorithm Shamir's trick in Figure 1. The NAF representation [18,19] is optimal for one integer. The performance factor w N AF−4 (n) can be improved to 5n/9 on average, when the algorithm uses the NAF representation instead. The JSF [11] is the optimal signed binary representation for two integers. The performance factor w JSF−4 (n) can further hit n/2 on average, when the algorithm uses the JSF representation instead. Due to its optimality property, a disadvantage of the coding approach is that the best performance factor cannot exceed the value w JSF−4 (n). The second category, in contrast, scans and processes the w-bit pair in Step 3.2 of Algorithm Shamir's trick in Figure 1, where w is an integer, and w > 1.
To reduce the number of point additions, all possible values for w-bit pairs should be pre-computed and stored in Step 1 of Algorithm Shamir's trick in Figure 1. Certainly, the wide scanning approach [20][21][22][23][24][25], including the m-ary method and the sliding window method, always combines with the coding approach, such as the NAF representation and the JSF representation, in practice. However, the approach requires a large number of extra registers to store all possible values for w-bit pairs, even with a moderate w.
Finally, some works [26,27] are dedicated to presenting the parallel algorithms for the multi-scalar multiplication, because the chip manufacturers are increasing the number of cores inside the processors. Other works [28,29] focus on the algorithms to speed up a group of the multi-scalar multiplications under cryptosystem configurations.

Motivation and Contribution
As low-cost computing devices, such as smartcards and RFID tags, are becoming ever more pervasive, new security threats are growing very quickly. However, these devices cannot always provide enough computation, memory, and electric power resources to implement the standard public-key schemes. We give several examples of potential crypto-oriented devices under register-constrained environments. One example is the ATmega128, which is part of the megaAVR family from Atmel [30] and has been widely used in embedded systems, automotive environments, and sensor-node applications. The ATmega128 features 128 KB of flash memory and 4 KB of internal SRAM. Additionally, it has only 32 8-bit general-purpose registers (R0 to R31) and the 16-bit result is stored in the registers R0 (lower word) and R1 (higher word). Another example is the ARM7TDMI (ARM7 Thumb Debug Multiplier ICE) [31], which was introduced by ARM in 1994 and has been used in a wide range of applications, e.g., mobile devices produced by Nokia and Motorola, Apple's iPod, video game consoles integrated by companies such as SEGA and Sony, routers, and automobile systems. For the standard ARM operating mode, 16 general-purpose registers (R0 to R15) are available to users. In the Thumb mode, only eight registers are available, i.e., R0 to R7, which in general limits the applicability for many cryptographic algorithms. Moreover, even if these devices will be more powerful as a result of Moore's Law, the manufacturer may still prefer those that are less powerful but more cost competitive. As a result, cryptographic engineers are always faced with a situation where the number of available registers is not sufficient for the ideal cryptographic implementation of multi-scalar multiplication.
Therefore, under register-constrained environments, this paper focuses on the design and analysis of the multi-scalar multiplication algorithms, which can flexibly improve the computation efficiency based on the available registers. We present an adaptive window method, which codes the integers x and y in the forms such as the NAF. Our adaptive window method can practically improve the computation efficiency of multi-scalar multiplication according to the small, fixed number of the registers provided by the registerconstrained computing devices. To illustrate this, we further give an example with five registers. To be more precise, we consider the 5-register adaptive window method using the NAF and JSF, respectively. Additionally, the computational complexity is analyzed by modeling the scan process as the Markov chain. Furthermore, the performance factor for the adaptive window method using our improved NAF representation can achieve on average, which is slightly smaller than using the JSF representation. To the best of our knowledge, when only five registers are allowed, our method with the improved NAF representation is the most efficient one for the computation of the multi-scalar multiplication.

Adaptive Window Method
Assume that the register-constrained computing devices can provide t registers for the computation of multi-scalar multiplication. Figure 2 describes the Algorithm adaptive window method. The integers x and y are coded by a certain signed binary representation, i.e., and where x i , y i ∈ {−1, 1, 0}. Let w denote the window size. In Step 3 of Algorithm Adaptive window method in Figure 2, scan x and y in the ordinary signed binary representations from left to right for the largest bit(s) pair within the window such that the pair has a value already precomputed in Step 1 and its first 1-bit sub-pair is nonzero.
Electronics 2021, 10, x FOR PEER REVIEW 5 of 16 where , ∈ {−1, 1, 0}. Let denote the window size. In Step 3 of Figure 2, scan and in the ordinary signed binary representations from left to right for the largest bit(s) pair within the window such that the pair has a value already precomputed in Step 1 and its first 1-bit sub-pair is nonzero. To compute the multi-scalar multiplication, previous window methods require to store all possible values for -bit pairs. However, our adaptive window method merely pre-computes and stores part of the values for these pairs (See Step 1 of Figure 2), when the available registers (whose number is denoted by ) are not enough. Thus, our method may spend more than one point addition for -bit pairs using Steps 3.1 and 3.2 of Figure  2. Obviously, to reduce the number of point additions as much as possible, Step 1 of Figure 2 should select the pairs with comparatively high probabilities in the signed sequence pair ( x n-1 x n-2 ⋯ x 0 y n-1 y n-2 ⋯ y 0 ) and store their corresponding values. As a result, the achievement of our adaptive window method is that the computation efficiency of the multi-scalar multiplication can be flexibly improved according to the registers provided by the register-constrained computing devices. Comparatively, previous window methods require the fixed number of registers based on the window size . Compared with Shamir's trick using the NAF representation or the JSF representation, Figure 2 at least requires five registers to improve the computation efficiency of the multi-scalar multiplication. Therefore, in the next section, we provide an example of the adaptive window method. That is, a detailed design and analysis for = 5 is presented with the NAF representation and the JSF representation, respectively. Figure 3 illustrates the basic version of the adaptive window method combined with the NAF representation, when five registers are available. In this case, we find that the To compute the multi-scalar multiplication, previous window methods require to store all possible values for w-bit pairs. However, our adaptive window method merely pre-computes and stores part of the values for these pairs (See Step 1 of Algorithm Adaptive window method in Figure 2), when the available registers (whose number is denoted by t) are not enough. Thus, our method may spend more than one point addition for w-bit pairs using Steps 3.1 and 3.2 of Algorithm Adaptive window method in Figure 2. Obviously, to reduce the number of point additions as much as possible, Step 1 of Algorithm Adaptive window method in Figure 2 should select the pairs with comparatively high probabilities in the signed sequence pair x n−1 x n−2 · · · x 0 y n−1 y n−2 · · · y 0 and store their corresponding values.

Using NAF Representation
As a result, the achievement of our adaptive window method is that the computation efficiency of the multi-scalar multiplication can be flexibly improved according to the registers provided by the register-constrained computing devices. Comparatively, previous window methods require the fixed number of registers based on the window size w. Compared with Shamir's trick using the NAF representation or the JSF representation, Algorithm Adaptive window method in Figure 2 at least requires five registers to improve the computation efficiency of the multi-scalar multiplication. Therefore, in the next section, we provide an example of the adaptive window method. That is, a detailed design and analysis for t = 5 is presented with the NAF representation and the JSF representation, respectively.

Using NAF Representation
Algorithm The 5-register algorithm using the non-adjacent form (NAF) representation in Figure 3 illustrates the basic version of the adaptive window method combined with the NAF representation, when five registers are available. In this case, we find that  Due to the frame of Shamir's trick, the above 5-register algorithm still requires − 1 point doublings. Thus, we only need to consider its performance factor, which directly determines the number of point additions. We have the following result.
on average, when n → ∞.
The proof of Theorem 1 appears in Appendix B and Appendix C. According to Theorem 1, the basic version of the above 5-register algorithm has the same performance factor as that of Shamir's trick coupled with the JSF representation, which merely needs four registers. However, the basic version can be further improved Due to the frame of Shamir's trick, the above 5-register algorithm still requires n − 1 point doublings. Thus, we only need to consider its performance factor, which directly determines the number of point additions. We have the following result. Theorem 1. The performance factor of Algorithm The 5-register algorithm using the non-adjacent form (NAF) representation in Figure 3 is on average, when n → ∞ .
The proof of Theorem 1 appears in Appendices B and C. According to Theorem 1, the basic version of the above 5-register algorithm has the same performance factor as that of Shamir's trick coupled with the JSF representation, which merely needs four registers. However, the basic version can be further improved to reduce its performance factor. We propose the recoding rules for the input NAF sequence pair x n−1 x n−2 · · · x 0 y n−1 y n−2 · · · y 0 as follows: After Step 1 of Algorithm The 5-register algorithm using the non-adjacent form (NAF) representation in Figure 3, the improved 5-register algorithm converts x n−1 x n−2 · · · x 0 y n−1 y n−2 · · · y 0 into x n−1 x n−2 · · · x 0 y n−1 y n−2 · · · y 0 by replacing according to the above recoding rules from left to right. If the replacement is due to Rule A1, A2, A3, or A4, then discard the left two columns that have been replaced and consider the next three or four columns for future replacement. If a replacement is due to Rule A5, A6, A7, or A8, then discard all columns that have been replaced and consider the next three or four columns for future replacement. If no replacement is possible, then discard one column and consider the next three or four columns for future replacement. The improved version of the 5-register algorithm is fully the same as its basic version except for the replacement operation by above recoding rules. on average, when n → ∞.
The proof of Theorem 2 appears in Appendix D.

Using JSF Representation
Assume that the JSF representation [11] is used for the integers x and y as the inputs of Algorithm Adaptive window method in Figure 2. Additionally, assume that the window size w = 2 in Algorithm Adaptive window method in Figure 2. According to the properties of the JSF representation, all possible 1-bit and 2-bit pairs are 0 0 , 1 0 , 0 1 , quence pair x n−1 x n−2 · · · x 0 y n−1 y n−2 · · · y 0 . Therefore  Figure 2. Now, we can consider designing the 5-register algorithm using the JSF representation. In fact, if the NAF is replaced with the JSP, then Algorithm The 5-register algorithm using the non-adjacent form (NAF) representation in Figure 3 is a 5-register algorithm using the JSF representation. We can obtain the following result.
Theorem 3. The performance factor of the 5-register algorithm using the JSF representation is on average, when n → ∞ .
The proof of Theorem 3 appears in Appendices E and F. Unlike the NAF, no recording rule is found in the JSF, and thus no further improvement can be provided.

Experiments and Comparison
For performance evaluation, we have simulated the adaptive window method and other similar methods in the Visual C++ platform. Those methods include Shamir's trick using, respectively, the NAF representation and the JSF representation, and the interleaving method using the w-NAF [21,32]. Here, we only consider the multi-scalar multiplication algorithms, which require at most five registers in the pre-computation process. For the multi-scalar multiplication xA + yB, we assume that the bit lengths of integers x and y using any representation are all n. To compare the number of point additions in terms of bits, a performance factor constant is defined as the ratio of the performance factor w(n) to the bit length n, i.e., w(n)/n. During the experiments, we generate randomly 1,000,000 pairs of 160-bit integers and calculate the performance factor constant for each method. The results are summarized in Table 2. In the 5-register case, the experiments on the adaptive window method using Ruan-Katti's representation [14] are also conducted for comparison. However, the improvement from Ruan-Katti's representation is not so much as from the NAF and JSF representations.
The asynchronous method [23] mandatorily requires six or eight registers. When six registers are available, the asynchronous method is actually the same as the adaptive window method using the NAF representation. Additionally, the asynchronous method using eight registers is the corresponding sliding window method [21] with the window size w = 2. Hence, the asynchronous method can be treated as a special case of the adaptive window method. The interleaving method using the w-NAF is not directly suitable for optimizing computation efficiency based on the available registers. If all five available registers are required to be used, Shamir's trick with the 4-NAF and 1-NAF interleaving is the only choice. However, it makes no sense, since its performance factor constant is 8/15, even larger than that of Shamir's trick with the 3-NAF interleaving (See Table 2). When five registers are available, the adaptive window method using our improved NAF representation requires the least number of point additions compared to the known methods.  We also verify the efficiency results in the real mobile phone. We use Eclipse to edit the Java code and the C code of those five algorithms in Table 2. Additionally, JNI (Java Native Interface) is employed to realize the interaction between the C code and the Java code. The interaction process of the five algorithms' codes is shown in Figure 4. Here, the C code is responsible for operating the CPU registers of the five algorithms. We then use Android Studio to implement them in the Android 10 system. For those five algorithms, the elliptic curve group is based on NIST's P-192 curve [1]. For each algorithm, the computation of 1,000,000 multi-scalar multiplications is carried out, and the average value of the running time is taken as the final result (see Figure 5). It can be seen that the efficiency results achieved on the physical device are basically consistent with our theoretical expectation. Native Interface) is employed to realize the interaction between the C code and the Java code. The interaction process of the five algorithms' codes is shown in Figure 4. Here, the C code is responsible for operating the CPU registers of the five algorithms. We then use Android Studio to implement them in the Android 10 system. For those five algorithms, the elliptic curve group is based on NIST's P-192 curve [1]. For each algorithm, the computation of 1,000,000 multi-scalar multiplications is carried out, and the average value of the running time is taken as the final result (see Figure 5). It can be seen that the efficiency results achieved on the physical device are basically consistent with our theoretical expectation.

Future Work
Three possible directions for future improvement are as follows.
(1) The optimal signed binary representation for the adaptive window method. In practice, the improved NAF representation in Section 3.1 can achieve the minimal per- Native Interface) is employed to realize the interaction between the C code and the Java code. The interaction process of the five algorithms' codes is shown in Figure 4. Here, the C code is responsible for operating the CPU registers of the five algorithms. We then use Android Studio to implement them in the Android 10 system. For those five algorithms, the elliptic curve group is based on NIST's P-192 curve [1]. For each algorithm, the computation of 1,000,000 multi-scalar multiplications is carried out, and the average value of the running time is taken as the final result (see Figure 5). It can be seen that the efficiency results achieved on the physical device are basically consistent with our theoretical expectation.

Future Work
Three possible directions for future improvement are as follows.
(1) The optimal signed binary representation for the adaptive window method. In practice, the improved NAF representation in Section 3.1 can achieve the minimal per-

Future Work
Three possible directions for future improvement are as follows.
(1) The optimal signed binary representation for the adaptive window method. In practice, the improved NAF representation in Section 3.1 can achieve the minimal performance factor among all well-known representations. However, we still do not know how to find the one with the best performance factor among all signed binary representations. Hence, it remains an open problem to find the optimal one for the adaptive window method with a fixed number of registers.
(2) The on-line strategy for the adaptive window method. To compute the multiscalar multiplication xA + yB, the m-ary method and the sliding window method need to pre-compute and store all possible values for the w-bit pairs, where the integer w is the window size, and w > 1. However, the adaptive window method only computes and stores part of them based on the number of available registers. Thus, it would be useful to check each w-bit pair in the sequence pair according to on-line input integers x and y, and then determine the high frequency values among all possible values in real time. Clearly, should those high frequency values be pre-computed and stored, the adaptive window method could be further improved in practical implementations. It might be interesting to investigate this on-line strategy further.
(3) The register-constrained implementation for the adaptive window method. We use the Java code linked with the C code to implement several multi-scalar multiplication algorithms on the mobile phone and obtain their corresponding efficiency results. However, both the device and the development tool are not perfect in consideration of the register-constrained environment. Embedded hardware microprocessors, such as Atmega and ARM, and the assembly code, are more suitable to simulate our proposed multi-scalar multiplication algorithms and verify their performance results. Additionally, the novel optimization implementation technique on our proposed algorithms may be designed according to the particular embedded hardware microprocessor. Hence, it is valuable work to further implement the adaptive window method in the embedded hardware microprocessors.

Conclusions
We have studied the cryptographic implementations of multi-scalar multiplication under register-constrained environments. In order to make the best of the available registers, our idea is not to store all possible values in the window, but only to store those with comparatively high probabilities. The computational complexity analysis and the experimental results show that the proposed adaptive window method achieves the notable computation efficiency with one more register provided. For embedded cryptographic applications, it is especially convenient for our method to balance the performance and the costs according to the computation and memory abilities of the embedded devices. We also expect that our research will inspire others to work in the fascinating algorithms of multi-scalar multiplication under resource-constrained environments.

Acknowledgments:
The authors would like to thank the editor and the reviewers for their valuable suggestions and comments.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix B. Some Facts of NAF Representation
To analyze the computational complexity of our proposed algorithms, we need first review two well-known properties of the NAF representation [18,19] as follows.
Lemma A1. The NAF representation of an integer is unique. No two nonzero digits are adjacent in the representation.
where Pr(P 0≤j≤7 |P 0≤k≤7 ) denotes the conditional probability of the next state given the current state P k . Since the matrix T 2 BN AF−5 has all positive elements, this Markov chain is a regular chain. Let Pr(P 0≤i≤7 ) be the probability of the state P i as n → ∞ . According to the theorems of the regular Markov chain, we have (A16) In Figure 3, we can see that the state P 0 needs no point addition, and each state P 1≤i≤7 needs one point addition. Thus, on average, the asymptotic performance factor is w BN AF−5 (n) = Pr(P 1 ) + Pr(P 2 ) + Pr(P 3 ) + Pr(P 4 ) + Pr(P 5 ) + Pr(P 6 ) + Pr(P 7 ) Pr(P 0 ) + 2Pr(P 1 ) + 2Pr(P 2 ) + Pr(P 3 ) + Pr(P 4 ) + Pr(P 5 ) + Pr(P 6 ) + Pr(P 7 ) n = (A17)

Appendix D. Proof of Theorem 2
Proof. Consider the 3-bit or 4-bit pairs in the recoding rules. Based on the scanning process of the input NAF sequence pair x n−1 x n−2 · · · x 0 y n−1 y n−2 · · · y 0 in Figure 3, each 3-bit pair using Rule A1, A2, A3, or A4 requires two point additions before the recoding process. However, it requires only one point addition after the recoding process. For example, the 3-bit pair 1 0 −1 0 −1 0 , consecutively, needs to pass Steps 3.3 and 3.4 of Figure 3 for the first two columns, and Steps 3.3 and 3.4 requires in total two point additions. Comparably, the 3-bit pair 0 1 1 0 −1 0 needs to execute Steps 3.1 and 3.5 of Figure 3 for the first two columns, which requires one point addition instead. Similarly, each 4-bit pair using Rule A5, A6, A7, or A8 requires three point additions before the recoding process, but requires two point additions after the recoding process. Next, according to Rules A1, A2, A3, A4, A5, A6, A7, and A8, we calculate the probabilities of those 3-bit and 4-bit pairs appeared in the NAF representation. We can obtain by Lemmas A4 and A5. Consequently, it follows from Theorem 1 that the performance factor of the improved Figure 3 can be estimated as where w RN AF−5 (n) denotes the number of saving point additions due to Rules A1, A2, A3, A4, A5, A6, A7, and A8.

Appendix E. Some Facts of JSF Representation
To analyze the 5-register algorithm using the JSF representation, we need the following important fact of the JSF representation.
Lemma A6. Assume that the scan process of the sliding window method [22] is used for the JSF sequence pair x n−1 x n−2 · · · x 0 y n−1 y n−2 · · · y 0 , and the window size w = 2 .
Let η be any 2-bit pair appeared in the JSF sequence pair, that is, Proof. We assume that the reader is already acquainted with the results in Solinas' technical report [11], from which we recall a few important facts. For the JSF coding algorithm, the JSF coding output x i y i is a function of the internal current state S j , where 0 ≤ i ≤ n − 1 and 0 ≤ j ≤ 7. We can further extend the detailed relations between the states and the corresponding outputs as follows: (1) S 0 maps to the current output 0 0 ; (2) S 1 maps to the case where the current output is 0 1 and the next output will be 0 0 ; (3) S 2 maps to the case where the current output is 0 1 and the next output will be 1 0 ; (4) S 3 maps to the case where the current output is 0 1 and the next output will be 1 1 ; (5) S 4 maps to the case where the current output is 1 0 and the next output will be 0 0 ; (6) S 5 maps to the case where the current output is 1 0 and the next output will be 0 1 ; (7) S 6 maps to the case where the current output is 1 0 and the next output will be 1 1 ; (8) S 7 maps to the current output 1 1 .
Pr(S 0 |S 7 ) Pr(S 1 |S 7 ) · · · Pr(S 7 |S 7 ) (A26) Because of previous relations between the states and the corresponding outputs, we know Proof. Our 5-register algorithm using the JSF representation is almost the same as Figure 3 but with the input x and y represented in the JSF. By Lemma A6, our algorithm is optimal. Because the value 2A + B is stored in Step 1 of Figure 3, the 2-bit pairs 1 0 0 1 and −1 0 0 −1 in the JSF sequence pair x n−1 x n−2 · · · x 0 y n−1 y n−2 · · · y 0 only require one point addition. Therefore, the corresponding performance factor is