A Compact Coprocessor for the Elliptic Curve Point Multiplication over Gaussian Integers

: This work presents a new concept to implement the elliptic curve point multiplication (PM). This computation is based on a new modular arithmetic over Gaussian integer ﬁelds. Gaussian integers are a subset of the complex numbers such that the real and imaginary parts are integers. Since Gaussian integer ﬁelds are isomorphic to prime ﬁelds, this arithmetic is suitable for many elliptic curves. Representing the key by a Gaussian integer expansion is beneﬁcial to reduce the computational complexity and the memory requirements of secure hardware implementations, which are robust against attacks. Furthermore, an area-efﬁcient coprocessor design is proposed with an arithmetic unit that enables Montgomery modular arithmetic over Gaussian integers. The proposed architecture and the new arithmetic provide high ﬂexibility, i.e., binary and non-binary key expansions as well as protected and unprotected PM calculations are supported. The proposed coprocessor is a competitive solution for a compact ECC processor suitable for applications in small embedded systems.


Introduction
Many resource-constrained systems, such as embedded systems, still rely on symmetric cryptography for message authentication. For digital signatures and key exchange protocols, asymmetric cryptography is required. However, software implementations of public-key algorithms such as the Rivest-Shamir-Adleman (RSA) algorithm or elliptic curve cryptography (ECC) may be too slow on small embedded systems. Such systems benefit from hardware assistance, i.e., coprocessors optimized for public-key operations. For applications in embedded systems, ECC algorithms dominate, because shorter key lengths are required [1][2][3][4][5][6].
In this work, we investigate the elliptic curve cryptography over Gaussian integers. Gaussian integers are a subset of complex numbers such that the real and imaginary parts are integers. Due to the isomorphism between Gaussian integer rings and rings over ordinary integers [7,8], this principle is suitable for many cryptographic systems. It was shown in [9][10][11][12] that a significant complexity reduction can be achieved for the RSA cryptographic system using Gaussian integers. Similarly, Rabin cryptographic system was previously considered over Gaussian integers in [13,14]. In the context of ECC, arithmetic over complex numbers was proposed to speed up the point multiplication [15][16][17][18][19]. However, to the best of our knowledge, no ECC hardware implementation over Gaussian integers exists. We propose a compact ECC processor design and demonstrate that the Gaussian integer arithmetic provides high flexibility.
The operation that dominates the execution time of ECC algorithms is the point multiplication, i.e., kP, where P is a point on the elliptic curve and k is an integer. The point multiplication is typically implemented based on the binary expansion k = ∑ r−1 i=0 k i 2 i of the integer k with binary digits k i ∈ {0, 1}, where r is the length in bits and k r−1 , k r−2 , . . . , k 0 is the binary representation of the key (the integer k). Using the Horner scheme the point multiplication can be calculated as kP = r−1 ∑ i=0 k i 2 i P = 2(. . . 2(2k r−1 P + k r−2 P) + . . .) + k 0 P, with the double-and-add method [17]. This method requires two different point operations, the point addition and the point doubling operation. Alternatively, τ-adic expansions of the integer k with a non-binary basis τ were proposed to speed up the point multiplication (PM) and to improve the resistance against attacks [20][21][22][23][24][25]. This τ-adic expansion results in the point multiplication where the digits κ i can be elements from complex number fields such as Gaussian, Eisenstein, or Kleinian integers depending on the basis τ. Publications on hardware designs for non-binary-base expansions mainly consider implementations for extension fields [26][27][28]. In [27], a multiple-base expansion and arithmetic over Kleinian integers are used. The approach in [26,28] is based on a ternary instead of a binary expansion. The corresponding point multiplication has a run-time of r/3 point additions and r − 1 Frobenius mappings τP. This point multiplication with the ternary basis is much faster than with a binary basis because the point-doubling operation is replaced by a simple Frobenius mapping (Frobenius endomorphism) τP. For any point P(x, y) this mapping operation results in the point (x 2 , y 2 ), which requires only squaring of the field elements x and y. In [26,28], arithmetic over binary extension field is considered, where the squaring operation is very efficient [29,30]. It was demonstrated in [22,24,25,[31][32][33][34] that non-binary expansions are beneficial to harden ECC implementations against attacks (SCA). Hardware implementations of the elliptic curve point multiplication are prone to SCA. There exist different attacks in the literature such as timing attacks (TA), simple power analysis (SPA), differential power analysis (DPA), refined power analysis (RPA), zero value analysis, and methods based on machine learning [17,[34][35][36][37][38][39][40].
In this work, we consider an endomorphism for prime fields [17], where the prime field is represented by Gaussian integers. Hardware implementations can benefit from arithmetic over Gaussian integers and non-binary expansion. In particular, we consider the PM for Gaussian integers, where both the key as well as τ are Gaussian integers. The material in this paper was presented in part at the Zooming Innovation in Consumer Electronics International Conference 2020 [41]. In [41] a new algorithm for the τ-adic expansion of the key k was presented. The resulting expansion increases the robustness against TA and SPA and reduces the required memory size and the computational complexity for a protected PM. In this study, we propose an ECC coprocessor design for resource-constrained systems which employs a new concept for the Montgomery modular arithmetic [42]. The proposed processor provides high flexibility, because it supports binary and τ-adic key expansions and different PM algorithms. An unprotected PM with binary key expansions can reduce the latency for applications where only a public key is employed, e.g., the verification of digital signatures according to the NIST standard [43]. On the other hand, τ-adic expansions can reduce the computational complexity and memory requirements for calculations protected against SCA. This is beneficial for applications where the secret key is used, e.g., for the elliptic curve Diffie-Hellman key exchange [17]. We demonstrate that the arithmetic over Gaussian integers can be implemented as efficiently as ordinary Montgomery modular arithmetic over prime fields. The main aim of this work is to demonstrate that Gaussian integer arithmetic is a competitive solution for compact ECC processor designs. This publication is organized as follows. In Section 2, we review the Gaussian integers as well as the ECC point multiplication for binary and τ-adic expansions. The resistance of the PM implementation against attacks is discussed in Section 3. The Montgomery arithmetic over Gaussian integers is considered in Section 4. It is demonstrated that the key expansion employing Gaussian integers reduces the memory requirements and computational complexity compared with other τ-adic expansions. In Section 5, we introduce the architecture of the hardware implementation that supports Gaussian integer arithmetic. We discuss area requirements and throughput results for field-programmable gate array (FPGA) implementations in Section 6. Finally, we conclude our work in Section 7.

Point Multiplication over Gaussian Integers
In this section, we briefly review the Gaussian integers and the elliptic curve point multiplication for the ordinary binary representation of the key. We demonstrate that the point multiplication over Gaussian integers with a τ-adic key expansion can reduce the computational complexity. Furthermore, we discuss suitable projective coordinates.

Gaussian Integers
Many ECC applications [5,6,32,[44][45][46][47][48] consider arithmetic operations over ordinary integer fields. On the other hand, it was previously shown that Gaussian integers are suitable for ECC applications [42]. In this section, we review some important properties of the Gaussian integers.
Gaussian integers are a subset of the complex numbers with integers as real and imaginary parts. The set of Gaussian integers is typically denoted by Z[i]. The modulo function of a Gaussian integer where π is the complex divisor and π * its conjugate.
is a finite field isomorphic to a prime field GF(p) over ordinary integers [7]. Hence, these sets are suitable for ECC applications. In this case, p is the sum of two perfect squares, i.e., π = a + bi and p = ππ * = |a| 2 + |b| 2 with integers a, b. An algorithm for finding a, b for a given prime is provided in [7]. The inverse mapping of a Gaussian integer z ∈ G p to an element z of the prime field GF(p) is defined as where the parameters u and v are calculated using the extended Euclidean algorithm [7]. Next, we consider an example to illustrate the modular arithmetic over Gaussian integer fields.

Example 1.
Consider the set G 17 with π = 4 + i. The elements of the Gaussian integer field according to (4) are shown in Figure 1. Pleas note that the black markings are relevant for Example 3 and will be discussed in Section 3.
For the ordinary integers x = 7 and y = 9 the corresponding Gaussian integers x, y ∈ G 17 are x = x mod π = − 1 − 2i and y = y mod π = 2i. The sum z = (x + y) mod π = − 1 can also be calculated as z = (x + y ) mod π = 16 mod π = − 1. Similarly, the product z = (x · y) mod π = (4 − 2i) mod π = − 1 + i can also be expressed as z = (x · y ) mod π = 63 mod π = − 1 + i. Any Gaussian integer z ∈ G 17 can be mapped to an element of the prime field z ∈ GF(17) using the parameters v = 2 + i and u = − 2. For instance, for z = −1 + i we have z = ((−1 + i) · vπ * + (−1 − i) · uπ) mod p = 63 mod 17 = 12. The complex multiplication of two Gaussian integers x = c + di and y = e + f i can be efficiently calculated as where v 1 = (c + d)(e + f ), v 2 = ce, and v 3 = d f . Hence, the complex multiplication requires three integer multiplications. We call such an integer multiplication an atomic multiplication. While the complex multiplication over Gaussian integers does not provide a significant complexity reduction in comparison with the multiplication over ordinary integers, the squaring of a Gaussian integer can be simplified to where only two atomic multiplications and three additions are required [11,12].

Elliptic Curve Point Multiplication for Binary Keys
The public-key cryptosystems using elliptic curves were introduced in [1,2]. In this work, we consider the curve which is recommended for prime fields GF(p) [3,17]. The parameters α and β are non-zero constant coefficients. The pair x and y defines the coordinates of a point P(x, y) on the curve. The main operation of ECC is the PM k · P(x, y), i.e., multiplying a point P(x, y) on the elliptic curve with a scalar k. There exist different algorithms for the ECC point multiplication [17,49]. These multiplication algorithms are based on two point operations, the point addition (ADD) and the point doubling (DBL). Both operations depend on the form of the curve [17]. Let P(x 1 , y 1 ) and Q(x 2 , y 2 ) be two distinct points on an elliptic curve (8), then the ADD operation R(x 3 , y 3 ) = P(x 1 , y 1 ) + Q(x 2 , y 2 ) is calculated as Similarly, the DBL operation R(x 3 , y 3 ) = 2P(x 1 , y 1 ) is defined as The calculations in (9) and (10) are typically performed using arithmetic over prime fields. However, we consider arithmetic over Gaussian integer fields, where x i , y i , α ∈ G p .
The number of required ADD and DBL operations per point multiplication depends on the field size, but also on the PM algorithm. We first consider a binary representation of the key (k r−1 , . . . , k 1 , k 0 ) 2 , where r is the number of binary digits of the key. The subscript indicates the binary basis, i.e., k i ∈ {0, 1}. One example for the PM is presented in Algorithm 1, which requires r − 1 DBL and at most r − 1 ADD operations per PM, whereas the number of ADD operations varies with the key. We focus on the worst-case complexity because most implementations will try to balance the number of ADD operations in ordered to harden the calculation against timing attacks [32,33].

Elliptic Curve Point Multiplication for τ-Adic Expansion
Next, we consider Algorithm 2 which calculates the τ-adic expansion of an integer k [19]. Algorithm 2 is the ordinary division algorithm for base conversion when k and τ are both integers. In this case, the algorithm results in an expansion with l ≈ log 2 (k)/ log 2 (τ) digits. However, we consider the expansion where k and τ are Gaussian integers. In this case, the modulo function is calculated according to (3) and the number of digits can be estimated as To see this, note that the division (z − κ l )/τ results in a quotient which is a Gaussian integer. Hence, the absolute value of the enumerator z − κ l is diminished by a factor |τ| in each iteration.
From the estimate (11) follows that it is advantageous to represent the key as a Gaussian integer. For any Gaussian integer ring x ∈ G n the absolute value |x| is bounded by |x| ≤ √ n/2 [41]. Hence, we can estimate the maximum number of digits l for the τ-adic expansion of a key k ∈ G n as Note that the size of the keyspace is determined by the order of the elliptic curve. For a curve of order n, we choose a Gaussian integer key from G n or a smaller ring. The size of this ring should equal the order to retain the size of the keyspace. Consequently, using k ∈ G n approximately halves the number of digits compared with an expansion of an ordinary integer k ∈ Z n , which has a maximum key length l ≈ log 2 (n)/ log 2 (|τ|).

Algorithm 2 τ-adic expansion of integers or Gaussian integers.
input: Gaussian integer k and base τ In [17,50] several endomorphisms for different curves over prime fields were presented. To demonstrate that a τ-adic expansion can reduce the computational complexity, we consider an example based on the curve (8) with β = 0 and τ = 2 + i. The τ-adic expansion according to (2) results in digits κ i ∈ {0, ±1, ±i}. For the products κ i P(x, y) we use the endomorphism iP(x, y) = P(−x, iy) for prime fields [17,50]. Note that P(−x, iy) is a point on the curve (8) if P(x, y) is a valid point. The negation of any point −P(x, y) is P(x, −y) according to [17]. Hence, for the products κ i P(x, y) with non-zero κ i we have Similarly, we can calculate the mapping τP as τP = (2 + i)P = 2P + iP. This product requires a DBL and an ADD operation. The operation of the PM with this expansion is summarized in Algorithm 3, where (κ l−1 , . . . , κ 1 , κ 0 ) (2+i) denotes the pentavalent representation of the integer k.

Algorithm 3 Point multiplication with
if κ i = 0 then 6: end if 8: end for 9: return R Example 2. We consider the curve y 2 = x 3 + 3x over the field G 17 from Example 1. Using the generator point P(1, 2) we obtain a subgroup of order 13, which comprises the neutral element O and the following points Due to the order of the subgroup we choose a key from G 13 , i.e., k = 1 + i. Using Algorithm 2 with τ = 2 + i, we obtain the expansion (κ 1 , κ 0 ) τ = (1, −1) τ . With the starting point P(1, 2), the PM from Algorithm 3 results in the point R(−2i, 1 + i).
Each iteration of Algorithm 3 is more complex than a single iteration of Algorithm 1. However, the τ-adic expansion reduces the number of iterations that are required to calculate the PM. Again let r be the number of bits for the binary key representation, and l the number of digits κ i . We have l ≈ r/ log 2 (5) ≈ 0.43r because of (12). Algorithm 1 requires a maximum of r − 1 iterations with one DBL and one ADD operation in each iteration. Similarly, the point multiplication based on the τ-adic expansion requires a maximum of l − 1 iterations, with one DBL and two ADD operations in each iteration. This results in l − 1 ≈ 0.43r DBL operations and at most 2(l − 1) ≈ 0.86r ADD operations. Hence, the number of both operations is reduced compared with the binary case. Similar considerations hold for the Montgomery PM [49]. A comparison of the latencies of the PM using binary and τ-adic expansion of the key are given in Section 6.
As proposed in [25], the non-binary expansions are beneficial to harden ECC implementations against attacks based on power analysis. The products κ i P(x, y) can be precomputed and stored in memory. This minimizes the dependencies between arithmetic operations and the key.

Projective Coordinates
ADD and DBL are calculated using modular arithmetic operations, i.e., addition, subtraction, multiplication, and inversion over prime or Gaussian integer fields. The most expensive operation is the multiplicative inverse. To speed up the PM calculation, several point representation methods were considered, such as the projective homogeneous coordinates [51] and the Jacobian coordinates [17,52]. These methods transform each point P(x, y) from affine to projective coordinates P(X, Y, Z), where the computation of ADD and DBL is applied without inversion. After the computation of a PM, the result is transformed back into affine coordinates by applying a single inversion. The number of required modular arithmetic operations depends on the representation of the projective coordinates. In this work, we consider the Jacobian coordinates [17,52], because fewer modular multiplications are required.
A drawback of the representation in the projective coordinates is that many interim results have to be stored. We can reduce the number of interim values by exploiting some properties of the Jacobian coordinates. Consider the calculation of the ADD operation +P or +κ i P in Algorithms 1 and 3. In both cases, the second argument does not change throughout the PM algorithm. Hence, the point P can be represented in affine coordinates with Z = 1 as P = P(X, Y, 1). Similarly, for κ i = i holds iP = P(−X, iY, 1) and so on. This enables a special ADD operation with reduced complexity [17,51,52]. In the following, this operation is denoted by SADD.
Transforming the result from projective to affine coordinates requires a single inversion. This transformation is defined for the Jacobian coordinates as

Resistance Against Side-Channel Attacks
In Section 5, we will discuss a flexible architecture for the PM calculation that can be used with binary and non-binary key expansions and with different PM algorithms. A PM according to Algorithm 1 with binary key expansions can be useful for applications where only a public key is employed, e.g., the verification of digital signatures according to the NIST standard [43]. On the other hand, applications, where the secret key is used, require protection against side-channel attacks. In this section, we discuss the resistance of the PM implementation against SCA. In particular, we consider the key expansion algorithm from [41] which improves the resistance against TA and SPA. We demonstrate that this key expansion employing Gaussian integers reduces the memory requirements and computational complexity compared with other τ-adic expansions. Algorithm 3 is an example for such a point multiplication.
The SADD operation in Algorithm 1 is only calculated if the current bit of the binary key k i = 0. Moreover, the DBL and SADD operations employ different numbers of arithmetic operations. Hence, the power consumption and the execution time of an iteration depend on the current bit of the secret key. Consequently, an attacker can estimate the current bit of the secret key by observing the power consumption over time. To avoid TA and SPA attacks, many publications balance the number of applied point operations using additional dummy ADD operations [22,24,[31][32][33][34].
Similar to the binary PM, the conditional SADD operation is computed if κ i = 0 for the PM with τ-adic expansion in Algorithm 3. Note that the product κ i P is a point multiplication, where the associated number of DBL and SADD operations depends on the actual value of κ i . Hence, an attacker can infer κ i by estimating the number of calculated DBL and SADD from the power consumption of the device.
Improved robustness against SPA can be achieved by precomputing all possible products κ i P and storing these values in a memory [21,22,24,31]. Hence, by processing the PM according to Algorithm 3, the corresponding product κ i P is read from the memory. This results in fixed calculation times for the SADD operation with any product κ i P. Nonetheless, this concept is not protected against TA or SPA. An attacker still can detect all values κ i = 0 in an expansion, where the conditional SADD operation is skipped.
To protect the PM against such attacks, Algorithm 4 was recommended in [41] for the key expansions. This algorithm replaces all values κ i = 0 by κ i = τ. Hence, it determines a new base conversion of the key excluding all zero elements. Consequently, the SADD operation in step 6 of the PM in Algorithm 3 is calculated in each iteration independent of the value of κ i and without the need for dummy additions. This results in a constant calculation time for a PM and increases the robustness against SPA and timing attacks. It was shown in [41] that the key expansions according to Algorithm 2 and Algorithm 4 result in the same point multiplication.
Algorithm 4 SPA resistance τ-adic expansion of a Gaussian integers (or an integer) k according to [41]. l = l + 1 10: end while 11: return κ l−1 , . . . , κ 1 , κ 0 Algorithm 4 is suitable for ordinary integers and Gaussian integers. However, using Gaussian integers reduces memory requirements and computational complexity. This complexity reduction results from the symmetry of the Gaussian integer field. We illustrate this with the following example.

Example 3.
We consider expansions with τ = 4 + i. The set of all possible digits κ i is depicted in Figure 1. The corresponding precomputations can be obtained as follows. First, the product 2P is calculated using a DBL operation. Next, we calculate the products κ i P for the remaining κ i digits in the first quadrant, i.e., (1 + i)P and (1 + 2i)P, where we use two SADD operations and the endomorphism (15). The corresponding digits are depicted with filled points in Figure 1. For the Gaussian integer κ i = τ we have to calculate τP = (4 + i)P, which requires additionally one DBL and one SADD operation.
Due to the symmetry of Gaussian integers [42], we can use (13) to (16) and the elements from the first quadrant to determine all products κ i P in the remaining quadrants without additional point operations. Hence, the precomputations require a total of two DBL and three SADD operations.
Similarly, the number of stored precomputed points can be reduced using Gaussian integers. For the hardware implementation, we use the sign-magnitude binary format for the real and imaginary parts of a Gaussian integer. With this representation, it is sufficient to store the value τ and all values κ i P corresponding to the κ i values from the first quadrant. These κ i values are represented with filled points in Figure 1 for τ = 4 + i. The different signs for real and imaginary parts resulting from (13) to (16) are implicitly stored as parts of the κ i values in the expansion, i.e., each κ i value contains bits addressing a value from the first quadrant and two bits for the signs of the real and imaginary parts. This representation does not increase the number of bits for the expansion, but reduces the number of stored precomputed points to (|τ| 2 − 1)/4 for the products κ i P and one point for the value τP.
We provide results for different τ values in Table 1. This table contains the number of stored points, the number of iterations l per PM, and the number of multiplications M per PM. Table 1 illustrates the number of required multiplications M for the DBL and SADD point operations with and without precomputations. Note that the parameter M denotes the number of multiplication equivalent operations. For Gaussian integers, this number is approximated based on the complex squaring (7) and multiplication (6), where the complexity of a complex squaring is about 2/3 of the complexity of a complex multiplication.
Furthermore, results from [21] are included in the lower part of Table 1 for comparison. The concept from [21] aims on reducing the required memory for the precomputed points. For instance, compare for example τ = 4 + i with |τ| 2 = 17 digits and the base w = 4 with 2 w = 16 digits from [21]. The proposed approach reduces the number of stored points from 8 to 5. Moreover, the number of multiplications is decreased by 30%. Finally, we note that the robustness against DPA attacks can be improved for τ-adic expansion with τ ∈ G p with the randomized initial point approach proposed in [21]. This approach increases the number of the stored precomputed points, but still achieves a complexity reduction compared with other non-binary expansions.

Montgomery Arithmetic over Gaussian Integers
In this section, we consider the Montgomery arithmetic over Gaussian integer fields. Furthermore, we discuss some important properties of the proposed concept for hardware implementations.

Montgomery Reduction over Gaussian Integers
After each complex multiplication or squaring operation, a reduction step is required to map the result in a valid field representative. However, the modulo operation according to (3) is computationally expensive. Alternatively, the Montgomery method can be applied to efficiently determine this modulo operation [53].
Typically, the Montgomery reduction is considered in ECC applications to design arithmetic that supports arbitrary prime fields [32,33,48,54]. In this case, the Montgomery multiplication is based on arithmetic over a ring R n that is isomorphic to the prime field GF(p) = {x mod p : x = 0, . . . , p − 1, x ∈ Z}. Due to the isomorphism between prime fields GF(p) and Gaussian integer fields G p , the concept of Montgomery multiplication is applicable for Gaussian integers. This was previously considered in [11,12,42].
Similar to [42] and without loss of generality, we consider π = a + bi with a > b ≥ 1. Then we can reduce the complexity of the modulo reduction by replacing the expensive calculation x mod π with x mod R, where the integer R is a power of two that satisfies R > N = a − 1. Note that x mod R is a simple bitwise AND operation of the real and imaginary parts of x = c + di with R − 1, because R is a power of two. Each Gaussian integer x ∈ G p is mapped to the Montgomery domain as The complex addition and subtraction in the Montgomery domain are equal to the modular arithmetic of Gaussian integers, because (X+Y) mod π = (xR+yR) mod π = (x+y)R mod π.
However, the multiplication in the Montgomery domain requires a special reduction function where the result is again in the Montgomery domain. The Montgomery reduction function µ(X) also determines the inverse mapping, since µ(X) = xRR −1 mod π = x mod π. The generalization of the Montgomery reduction algorithm from ordinary integers to Gaussian integers is not trivial. The final reduction step of the algorithm uses the total order of integers. However, such an order relation does not exist for complex numbers. Two Montgomery reduction algorithms were presented in [42]. These algorithms differ in the norm used for the final reduction step, where one uses the absolute value |x| = |c| 2 + |d| 2 of x = c + di and the other is based on the Manhattan weight x = |c| + |d|. Calculating the Manhattan weight is less complex than calculating the absolute value because only addition is required, whereas squaring is necessary to determine |x|. To achieve a low complexity implementation of the proposed coprocessor, we apply the algorithm based on the Manhattan weight.
Note that |x| ≤ x holds for any x which can be seen from squaring both sides of the inequality. Furthermore, it was shown in [42] that for each Gaussian integer x ∈ G p holds.
The algorithm based on the Manhattan weight calculates a result satisfying this bound. This precision reduction is described in Algorithm 5, wherê and π −1 mod R can be calculated using the extended Euclidean algorithm [7].

Algorithm 5
Precision reduction for Gaussian integers using the Manhattan weight according to [42].
input: Z = XY, π = −π −1 mod R, R = 2 l > N output: M ≡ µ(Z) = ZR −1 mod π 1: t = Zπ mod R // bitwise AND of real and imaginary part with R − 1 2: q = (Z + tπ) div R // shift real and imaginary part right by l 3: if ( q ≤ N) then 4: M = q 5: else 6: determineα according to (22) 7: This algorithm is useful when many successive multiplications have to be calculated which is the case in ECC applications. However, not every Gaussian integer satisfying the bound in (21) is necessarily an element of the set G p . Hence, in the final calculation step, the correct representative can be determined by minimizing absolute value as This step is more complex but has to be performed only once per PM.

Simplifying the Reduction
The number of comparisons to calculate the offsets in (22) and (23) can be reduced based on the signs of the real and imaginary parts of the Gaussian integer q, i.e., we can first determine the corresponding quadrant and reduce the number of possible offsets [42].
For all interim results in the PM calculation, we can use the reduction based on the Manhattan weight in Algorithm 5. Note that the reduction for interim results does not necessarily require the offset that minimizes the Manhattan weight in (22). The aim of this reduction is finding a Gaussian integerx that is congruent to the actual representative x and has a Manhattan weight satisfying (21). Hence, the reduction can be stopped once a valuex = q − απ with x ≤ N is found. We implement an iterative reduction in microcode. Hence, the latency depends on the number of reduction steps required until such a congruentx is found.
To estimate the latency, a Monte Carlo simulation is conducted. Table 2 shows the results of this simulation. This table provides the percentage of the number of required reduction steps for Gaussian integers of different sizes. Each simulation considered 100.000 random values. The four columns on the right in Table 2 present the percentage for the number of required offset reductions after an arithmetic operation. Note that no reduction is required if the result q already satisfies q ≤ N. These results illustrate that sequential processing of the offset reduction is reasonable because 91% of all operations satisfy q ≤ N and require no reduction. Approximately 4-5% of all operations require a single reduction step. Only in about 4% of all cases, two or three calculation steps were required.

Bit Width for the Gaussian Integer Representation
Next, we determine the number of bits that are required to represent a Gaussian integer x ∈ G p with p = a 2 + b 2 and π = a + ib. In particular, we show that bits for the real as well as for the imaginary part are sufficient, where · denotes the floor function. In the following, Re {x} and Im {x} denote the real and the imaginary parts of the complex number x, respectively. For x ∈ G p , we have x = x mod π. Let c and d be the real and the imaginary parts of x π . From (3) follows xπ * ππ * = 0.
because Re {x} and Im {x} are integers. With the binary logarithm of these bounds, we can determine the required number of bits considering an extra bit in (24) for the signs.

Hardware Architecture
In this section, we present an area-efficient processor architecture supporting Gaussian integer arithmetic. We propose an application-specific instruction set processor based on a Harvard architecture. This architecture aims at high flexibility, i.e., it supports different algorithms for point multiplications. With this architecture, we can demonstrate that a protected PM using Gaussian integers can be as fast as an unprotected binary PM.
The main components of the proposed Harvard architecture are depicted in Figure 2. The coprocessor is divided into three main components, which are interconnected and controlled by a control unit. The uppermost component in this block diagram is a program memory. The processor is controlled by microcode instead of a finite state machine to enable different PM algorithms. The undermost components are a data memory that stores the operands, interim, and final results of the PM, and an arithmetic unit, which performs all required arithmetic operations. Since the carry chains in the arithmetic unit limits the operating frequency, this unit determines the performance of the whole coprocessor. Figure 2 includes four types of buses. The operation path, shown in black, defines the operation to be performed by the arithmetic unit. The address paths define either the register addresses for the arithmetic unit or the addresses for the data memory. For the data memory, two address paths are concatenated to obtain 6-bit addresses. Please note the path, connected to the address 4 of the program memory, can either be used as part of a data memory address or for further specification of arithmetic instructions, depending on the operation.
The 64-bit data paths are shown in light gray. These directly connect the arithmetic unit and the data memory to omit multiplexing in the control unit. The dark gray paths are for the initialization of the memories before the processing of a PM. The data memory has to be initialized with necessary values for computing the PM such as the domain parameters α, β, the point P, and the key k, while the program memory has to be initialized with the microcode for the PM processing. As mentioned above, the calculation in projective coordinates significantly reduces the computational complexity. However, the computation with projective coordinates requires storage for many interim results. Typically, these results are stored in registers to enable fast data access. The proposed architecture stores these interim results in random accessible memory (RAM) because RAM is more compact than registers. This also results in a compact control logic with very simple data paths.
The data paths connect the RAM and the arithmetic unit. All data passes through the RAM, hence no multiplexers are required to control the data flow. However, additional clock cycles are required for each operation to read and store data. On the other hand, some arithmetic operations for the PM processing are multicycle operations, where the additional cycles for data transfer are only a small overhead. Moreover, reducing the logic results in a higher operating frequency that shortens the latency for a PM.
The address of the program memory is defined by a program counter in the control unit, which is not shown in Figure 2. Hence, each instruction is available for exactly one clock cycle. For multicycle operations, as the multiplication a dummy operation no op is available. For branches, the program counter can be set to a value specified in the corresponding instruction. For function calls within the microcode a register is available, which can store the current value of the counter as return address. After the function, the counter can be set to the value of this register using a return operation. Next, we specify the three components of the proposed ECC coprocessor in more detail. Table 3 presents all instructions for the proposed application-specific instruction set processor using arithmetic over Gaussian integers. In total, 11 instructions are sufficient to control all necessary operations for calculating a PM. Therefore, only four bits are considered to indicate the requested operation. Table 3. Definition of the instruction set. shift

Instruction Set Architecture
The operations read, store, branch, return, and no op are control instructions that are processed in the control unit, while add, multiply, complement, shift, rotate, and mod R are arithmetic instructions that are executed in the arithmetic unit.
The instruction read is used to read data from the RAM to the two input registers 1 and 2 in the arithmetic unit. Similarly, the store instruction saves the two output registers 4 and 5 to the RAM. Both operations require two addresses for the data memory. The data memory stores up to 64 variables, hence six address bits are required. In Table 3 these addresses are represented by the combination of two fields of three bits width.
The branch instruction is used for branches, loops, and function calls. By calling the branch instruction, we set the program counter to the address defined in the memory addr. field if a specific condition is fulfilled. Depending on the control bits this condition can be the carry bit or the sign of the first register. For function calls no condition is required. Instead, the current program counter value is written to a register as return address. Using the return operation this address can be written to the program counter. The instruction no op is used when the arithmetic unit is busy. This is required for multicycle operations.
The add instruction adds two m-bit numbers, which allows for a complex addition in two clock cycles. Depending on the control bits this instruction can also add the absolute values to determine the Manhattan weight in one clock cycle. The complement calculates the two's complement of an m-bit number. This complement instruction is required to transform to or from sign-magnitude representation used for the multiplication. Moreover, it can change the sign of a number in sign-magnitude representation by flipping the sign bit of a register. For subtraction, the complement instruction is called before the add instruction. Subtraction could also be implemented as a combined adder-subtractor logic requiring an additional subtraction instruction.
The multiply instruction performs an m/2-bit multiplication which requires four clock cycles. The rotate instruction is used to shift the last two registers by m/2 bits, which allows for a faster m-bit multiplication using accumulation on these two registers. Each m-bit multiplication requires four m/2-bit multiplications and some additions, i.e., for two m-bit numbers a, b we have where a 0 and b 0 denote the least significant half-words, a 1 and b 1 the most significant half-words of a and b, respectively. The term (a 1 b 0 + a 0 b 1 ) could be written as ((a 1 + a 0 ) which would reduce the number of multiplications to three. On the other hand, this would require subtraction and a representation of negative numbers in the half-words. To avoid this we implement the multiplication according to (26). To allow for a complex multiplication three m-bit multiplications and some additions and subtractions are required according to (6). The shift instruction is required for the division by R and can right-shift the last two registers by either one or eight bits, depending on the control bits. For the modulo reduction, the mod R instruction is used to perform bit-wise AND of the first and the last register. While the modulo R operation could be implemented by truncating the result, we require the ability to use arbitrary R. Hence, we can load R − 1 to the first register and call the mod R operation to calculate the last register modulo R.

Data Memory
We employ dual-port RAM (DP-RAM) to store variables and intermediate results, which enables accessing the real and imaginary parts of each Gaussian integer simultaneously. Alternatively, two smaller single port memories could be used in parallel. However, this would result in more complicated microcode. We store the results of complex additions and subtractions before the reduction in RAM. Hence, an additional carry bit is necessary, which increases the required bits for the real and imaginary parts of Gaussian integers and consequently the word width of the RAM from (24) to for the whole design. For embedded systems, where the proposed architecture is supported as a coprocessor to a primary processor, this DP-RAM may be shared with the primary processor. Note that a smaller word width increases the likelihood that this DP-RAM can be shared because small embedded systems typically do not provide memories with large word width. Consequently, using Gaussian integers increases the likelihood that the RAM can be shared, due to the reduced word width of m ≈ r/2.

Arithmetic Unit for Gaussian Integer Fields
The arithmetic unit consists of five registers (reg.) of the length m, an adder, and a multiplier (MUL), as shown in Figure 2. The adder is used for the instructions add, multiply, and complement to minimize the area requirements.
The multiplication of two unsigned m/2-bit integers is implemented using the Karatsuba principal (cf. (26)) as depicted in Figure 3. The operands are stored in the first two registers, which are connected to the multipliers. We use four multipliers with a width of m/8 which achieves a good trade-off between area requirements and latency. Register 2 is shifted by m/8 bits in each iteration to perform a m/2-bit multiplication in four clock cycles. The partial results of the four multipliers are stored in register 3. These results are then added to registers 4 and 5, which are used as accumulator registers. As shown in Figure 3, registers 4 and 5 are applied to the adder with an m/8-bit shift, corresponding to the m/8-bit shift of the input register. After an m/2-bit multiplication the accumulator registers can be shifted by m/2-bit using the rotate instruction. This allows for an accumulation of the next m/2-bit multiplication.

Results and Discussion
In this section, we present implementation results for the proposed processor architecture for the Xilinx Virtex 7 FPGA. To the best of our knowledge, no ECC hardware implementation over Gaussian integers exists. Hence, we compare these results with implementations over prime fields from the literature. This comparison illustrates that the proposed architecture is a competitive solution for a compact ECC processor. Yet more important, we compare the latency of a PM with binary keys, according to Algorithm 1, with a PM using τ-adic key expansions, according to Algorithm 3, to demonstrate that a protected PM using Gaussian integers can be as fast as an unprotected binary PM or even faster.
All results are presented in Table 4, which is divided into two parts. In the upper part, the results for the proposed architecture with different bit lengths are summarized. The lower part contains results from the literature for comparison. The parameter r denotes the maximum key length in bits. The hardware requirements are represented by the number of look-up tables (LUT), flip-flops (FF), slices, and DSP units, as well as the RAM size. The maximum clock frequency is denoted by f clk . Note that the number of inputs per LUT as well as the number of LUTs and registers per slice vary for different FPGA devices. However, we can compare the total number of registers and memory size.
The proposed architectures were synthesized with and without the FPGA's DSP units. However, the architectures were not optimized for the provided DSP units. The number of flip-flops is dominated by the five registers of the arithmetic unit which have a size of m bits. The RAM size is the sum of the data and program memory sizes. A data memory with 38 words of m bits and a program memory with 950 words of length 16 bits are sufficient. Hence, the RAM requirements are dominated by the program memory, since the arithmetic unit performs only very simple operations. Multiple instructions are required to perform a complete arithmetic operation over complex numbers.
The proposed architecture is suitable for binary key representations and τ-adic expansions. Hence, the latency values are provided for both representations, where we use τ = 2 + i and τ = 4 + i for the τ-adic expansion. Similarly, we consider two different PM algorithms. One algorithm is protected against timing and SPA attacks by balancing the number of ADD and DBL operations. The latency for this algorithm is denoted by protected in Table 4. The results denoted by unprotected correspond to the PM with the minimum number of ADD and DBL operations. To evaluate the proposed architecture, we compare the implementation results with other designs that employ the Montgomery modular arithmetic over ordinary integer fields GF(p) [33,55]. The design in [33] support primes up to 521 bits. It requires a smaller number of LUT, but more flip-flops and a larger memory due to the larger key. For longer keys, the design in [33] results in higher latency. Consider for example the key length r = 384. The protected PM from [33] requires 55.87 ms. The proposed design without DSP units requires only 17.87 ms for the protected PM with τ = 2 + i and 9.9 ms for τ = 4 + i.
The design in [55] considers key lengths up to r = 256 and the unprotected PM. This design is optimized for the use of DSP units which reduces the number of LUT. The number of LUT is similar to the proposed design with key length r = 253 using DSP units. It employs much more flip-flops and memory. The proposed architecture for r = 253 with 4 DSP units is faster for all considered PM calculations.
The implementation results demonstrate that the latency of a protected PM with (2 + i)-adic expansion is similar to the latency of an unprotected binary PM. The latency of the protected τ-adic expansion is only 10% higher than the latency of unprotected binary PM. On the other hand, the protected binary PM has latency values that are 25% larger than the results for the protected τ-adic expansion. Note that the latency with τ-adic expansions can be further reduced with other bases τ, i.e, the choice of τ enables a trade-off between speed and memory size. According to Table 1, the base τ = 4 + i with |τ| 2 =17 reduces the computational complexity by 29% compared with τ = 2 + i, at the cost of memory for three additional precomputed points. For instance, with τ = 4 + i and r = 253 the implementation with DSP units achieves a latency of 5.58 ms for the protected PM and 5.5 ms for the unprotected PM, respectively. The memory size increases from 18.5 kbits to 20.4 kbits. This PM is significantly faster than the results from [33,55]. In [32], an implementation for the Virtex 7 FPGA is reported that achieves a latency of only 1.96 ms for a protected PM with r = 256. However, this implementation employs a much higher parallelization degree using 20 DSP units, i.e., five times more DSP units than with the proposed design.

Conclusions
In this work, we have presented a new concept to implement the elliptic curve point multiplication. This computation is based on a new modular arithmetic over Gaussian integer fields [42]. The proposed coprocessor is flexible, i.e., it supports different point multiplication algorithms over prime fields. The implementation results illustrate that the proposed concept is a competitive solution for a compact ECC processor for applications in small embedded systems.
Moreover, the proposed architecture supports different ECC key representations with binary and τ-adic expansions. We have demonstrated that it is beneficial to represent the key and the base τ as Gaussian integers. This representation speeds up the PM computation. Furthermore, such τ-adic expansions improve the robustness against TA and SPA attacks. The randomized initial point method from [56] can be applied to further strengthen the robustness against DPA, and RPA attacks. We have demonstrated that the Gaussian integer representation reduces the number of stored points and the computational complexity of such a protected point multiplication compared with other non-binary key representations. Gaussian integer fields can only be constructed for primes of the form p mod 4 = 1, hence a generalization of this work to Eisenstein integers could be beneficial. Eisenstein integers are complex numbers of the form x = a + bω, where ω = 1 2 × 1 + √ −3 , and Eisenstein integer fields can be constructed for primes of the form p mod 6 = 1. An elliptic curve point multiplication using Eisenstein integers was considered in [57] showing similar properties as the PM over Gaussian integers. However, up to now no efficient modulo operation for Eisenstein integers is known. We believe that a generalization of the Montgomery modular multiplication to Eisenstein integers would be a promising direction for further research.