On the Performance and Security of Multiplication in GF(2N)

Multiplications in GF(2N) can be securely optimized for cryptographic applications when the integer N is small and does not match machine words (i.e., N < 32). In this paper, we present a set of optimizations applied to DAGS, a code-based post-quantum cryptographic algorithm and one of the submissions to the National Institute of Standards and Technology’s (NIST) Post-Quantum Cryptography (PQC) standardization call.


Introduction
Arithmetic in GF(2 N ) is very attractive since addition is carry-less. This is why it is adopted in many cryptographic algorithms, which are thus efficient both in hardware (no carry means no long delays) and in software implementations.
In this article, we focus on software multiplication in GF(2 N ), and more specifically for small N. When N is smaller than a machine word size (that is, N < 32 or 64, on typical smartphones or desktops), all known window-based computational optimizations become irrelevant.
Our goal is to compute fast multiplications (since sums are trivially executed with the native XOR operation of computer instruction sets) that are secure with respect to cache-timing attacks. Therefore, we look for regular multiplication algorithms, that is, algorithms whose control flow does not depend on the data. Our method is not to come up with novel algorithms for multiplication, but to organize the computations in such a way that the resources of the computer are utilized optimally. Our contribution is thus to explore the way to load the machine in the most efficient way while remaining regular. We leverage the fact that regular algorithms can be executed SIMD (Single Instruction Multiple Data), hence they are natural candidates for bitslicing or similar types of parallel processing of packed operations. We compare these operations with those which are insecure and those which resort to special instructions (such as Intel Carry-Less MULtiplication, PCLMULQDQ). We conclude that the (two in the log table, one in the antilog table, or vice versa). However, this creates a data-dependent table access, and the operands could potentially be recovered using standard side-channel attacks such as FLUSH+RELOAD [11]. For big extensions, computing these tables is too costly and the implementation of operations in GF(2 N ) is handled differently. However, the multiplications are executed by taking data-dependent branches; which branch is taken could be recovered via similar attack techniques, thereby revealing once again the operands of the multiplication. We checked for these kinds of potential leaks using a static analysis tool [12] specifically developed for tracking microarchitectural side-channels, including data-dependent table accesses and data-dependent branches. This tool requires the user to specify which variables are sensitive, such as the secret key or the randomness used during signature or encryption. It then performs a dependency analysis and determines whether any variable depending on sensitive values is used as an index for table access or as the condition variable of a branching operation (If, While, Switch or the stop condition in a For loop).

Tower Fields Representation
Depending on the choice of the basis B, the elements of GF(2 N ) can be defined differently. If N is the product of two integers and m, then GF(2 N ) can be defined over GF (2 ). In the rest of the paper, we call GF((2 ) m ) a composite field and GF(2 ) the ground field. Note that GF((2 ) m ) and GF(2 N ) refer to the same field although their representation methods are different.
Given two representations of the finite field GF(2 N ), it is possible to map one to the other, thanks to a conversion matrix. The first representation is GF(2 N ) as an extension of GF(2) and the second representation is GF(2 N ) as an extension of GF(2 ) where N = m for , m ∈ N. Here, the elements of GF(2 N ) are polynomials whose coefficients are in GF(2) = {0, 1} of degree at most N − 1 and the elements of GF((2 ) m ) are polynomials whose coefficients are in GF(2 ) of degree at most m − 1. Hence, we write in Kronecker style where: and where γ, β and α are their respective roots. Thus, the elements of GF(2 ), GF((2 ) m ) and GF(2 N ) are the residue classes modulo of their respective irreducible polynomials. Such polynomials always exist [13]. In general, the number of irreducible polynomials of degree N with coefficients in GF(q) is given by where µ(k) is the Möbius function defined by For instance, the number of irreducible polynomials of degree 12 in GF(2 12 ) (field used in DAGS) is and thus multiple representations can be derived for the same element in the field.

Composite Fields and Fields Mapping
Since GF((2 ) m ) and GF(2 N ) refer to the same field, they are isomorphic [13]. However, although two fields' representations are isomorphic, the algorithmic complexity of their field operations may differ, depending on the polynomials Q and D. A binary N × N matrix T can be derived to map elements of GF(2 N ) to elements of GF((2 ) m ). The inverse of T, denoted T −1 , will perform the mapping the other way around. The conversion problem was addressed by Paar in [1]. In this work, conversion matrices are derived from GF(2 N ) and GF((2 ) m ) that are already fixed by their generating polynomials. The construction is based on finding a relation between the primitive elements γ and α such that • α r = γ is known for some integer r and; • D(α r ) ≡ 0 (mod P, Q).
Since there is no established mathematical connection between α and γ an exhaustive search is needed. In a related work [2], Sunar et al. redefined the problem. Instead of finding a conversion matrix, the paper proposes to construct the composite field given the field of characteristic two. Here, we recall the results and examine the problem in a slightly different way: our aim is to find a suitable isomorphic representation to construct the field of characteristic two given the ground and extension fields. Theorem 1. For β ∈ GF(2 m ) and γ = β r where r = 2 m −1 2 −1 , if β is a primitive element, then γ is primitive in GF(2 ).
Let GF(2 ) = GF (2)[γ]/P be the ground field. The extension field GF((2 ) m ) can be constructed using the polynomial Noting that s m = 1 and using Vieta's formulas, we obtain or, equivalently, the (m − k)-th coefficient s m−k is related to a signed sum of all possible subproducts of roots, taken k-at-a-time: Thus, given a ground field GF(2 ) and its extension field GF((2 ) m ), we are looking for a field GF(2 N ) with primitive element α such that: • P(α r ) = 0 where r = (2 m − 1)/(2 − 1), hence α r = γ, Once we find the suitable representation of GF(2 N ), we derive the conversion matrix as follows. Any element A in GF(2 N ) has two representations: , a j ∈ GF(2 ). We showed that our construction allows β r to be primitive in GF(2 ). Thus, {1, β r , β 2r , . . . , β (m−1)r } is a basis in GF(2 ) and we can write Then, the terms β ri+j are reduced using the generating polynomial P(x) These are the coefficients of the conversion matrix. In the end, we will have The N × N conversion matrix T with coefficients in GF (2) is obtained from Equation. (1): The conversion matrix from the field to the composite field is then T −1 .

Results and Discussion
We start by presenting state-of-the-art multiplication algorithms in GF(2 N ) for small values of N, i.e., when N is smaller than the machine word. Those algorithms are insecure. Then, we present secure variants with respect to cache timing attacks.

Multiplication in GF(2 N )
Fast implementation techniques for GF(2 N ) arithmetic have been studied intensively by researchers. Among various arithmetic operations, GF(2 N ) multiplications gained the most attention.
For small N, the multiplication is usually carried out using look-up tables, called log-antilog tables. Algorithms of tables initialization and the derived tabulated multiplication are given in Algorithm 1 and Algorithm 2 and in Algorithms A1 and A2 in Appendix A as C code. This method (Algorithm A3 in Appendix A) presents a testing vulnerability. Indeed, 0 is always mapped to −1 and never used; furthermore, this method is vulnerable anyways to cache-timing attacks such as PRIME+PROBE [14], and FLUSH+RELOAD [11] that targets the Last-Level-Cache. In a cryptographic scheme where critical operations such as key generation or encryption use log-antilog multiplication over GF(2 N ), the difference in memory access time to lookup tables caused by the cache may leak information about the secret key. These attacks first completely fill the cache with the attacker's data. The critical operation is run, and, as it is running, the parts of tables that it uses are loaded from main memory into the cache. Since the cache is full of attacker's data, some of it will have to be evicted to make place. Once the operation is done, the attacker analyses which parts of his data have been evicted and this tells him which table indexes were used, leaking information about the secret key.

Algorithm 1 Initialization of the antilog table
Require: The finite field GF(2 n ) and its generator polynomial P. Ensure: The antilog The implementation of the tabulated method may be sometimes too costly for large N and the multiplication is then handled differently. One may use tower field arithmetic and store lookup tables for GF (2 ) where is taken small such as | N (Algorithm A4 in Appendix A). Tower field arithmetic is slow, but we can perform conversions both ways with respect to the theory presented in Section 2 (Algorithm 3).
A straightforward alternative method is iterative multiplication. This is done performing polynomial multiplication and conditional reduction modulo the generator polynomial (Algorithm 4 and Algorithm A5 in Appendix A as C code). Iterative methods cannot be executed without taking data-dependent branches and thus are vulnerable to branch prediction analysis [15,16]. In fact, the information leakage is based on the timing differences produced by the branch prediction unit (BPU).

Algorithm 4 Iterative multiplication with conditional reduction
Require: Two polys X = {x i }, Y = {y i } of orders at most n and a reduction polynomial P of order n Ensure: Polynomial R = X.Y = {r i } of order n 1: R ← 0 2: for i in range(n) do 3: if y i = 1 then 4: if order(R)> n then 6: reduce R by P polynomial division

Secure Computation in GF(2 N )
There are many countermeasures to prevent cache-timing attacks, but they may affect the computation performance. Here, we propose a trade-off between secure and fast GF(2 N ) computation. One countermeasure is the constant time implementation where the execution time does not depend on the secret key or input data. In case of tabulated multiplication, this cannot be achieved, but, in case of iterative methods, we replace conditional reduction by unconditional reduction. This means that the reduction modulo the generator polynomial is performed at each iteration and therefore no timing information is leaked (Algorithm 5 and Algorithm A6 in Appendix A as C code). reduce R by P polynomial division Basically, this code adds the term x i X i if y i is one (where x i and y i are the polynomials to multiply). Note that, because of the two's complement representation, −0 is 0000000000000000 over 16 bits and −1 is 1111111111111111. We can, thus, use −y i as a mask in the first branch. We follow the same idea in the second branch shifting the condition by . Another constant time countermeasure is the bitsliced implementation [17,18] that uses SIMD architecture to perform the same operation on multiple data points simultaneously. In fact, we convert 64 N-bit words into N 64-bit words and multiply the 64-bit words in a single generation (Algorithm 6 and Algorithm A7 in Appendix A as C code).

Algorithm 6 Bitsliced multiplication
Require: 2 × 64 n-bit words X i and Y i where i ∈ [1, 64] Ensure: 64 n-bit words R i = X i .Y i where X j are 64-bit words for j ∈ [1, n] where Y j are 64-bit words for j ∈ [1, n].
where R i are n-bit words for i ∈ [1,64] This countermeasure prevents information leakage and speeds up the implementation. To compensate for the loss of performance, one can use Intel's CLMUL (Carry-Less Multiplication) assembly instruction set to improve the speed of multiplication over GF(2 N ). PCLMULQDQ, available with the new 2010 Intel Core processor family based on the 32 nm Intel microarchitecture codename Westmere, performs carry-less multiplication of two 64-bit operands over a finite field (without reduction). The optimizations take advantage of the processor datapath (64 bit) and of available instructions. It is thus artificial and probably not very informative to write multiplications as algorithms. Instead, we provide the extensive C code for the case N = 12 ( = 6 and m = 2), written in a portable way (ANSI POSIX). The code is abundantly commented on, hence the operations carried out should not leave place to ambiguity. Regarding performance evaluation of the different functionally equivalent codes, again, a pure algorithmic description would be misleading. For instance, 64 XOR operations can be conducted in one clock cycle provided the operands are laid out as the 64 bits of a quad word, or otherwise in 64 clock cycles if the operands are located at different addresses. Therefore, performance reads better from the C code. We estimated the performance by averaging the execution time of each multiplication placed in a loop. This method allows to filter out abnormal durations caused by improper pipeline or cache initializations.
In the next section, we show a case study where we compare different implementations of GF(2 N ) multiplication on the basis of security and performance. The comparison further points out computation performance over the tower field GF((2 ) m ) and the field GF(2 N ), where N = m , using the derived conversion matrices in Section 2.

Case Study: Optimization of DAGS
DAGS is a code-based Key Encapsulation Mechanism (KEM). As all code-based primitives, it relies on the Syndrome Decoding Problem [19], and shows no vulnerabilities against quantum attacks. In particular, DAGS is based on the McEliece cryptosystem [6] and uses Quasi-Dyadic Generalized Srivastava (GS) codes to address the issue of the large public key size, which is inherent to code-based cryptography. We start by recalling some important definitions.

Definition 1.
For m, n, s, t ∈ N and a prime power q, let α 1 , . . . , α n , w 1 , . . . , w s be n + s distinct elements of GF(q m ) and z 1 , . . . , z n be nonzero elements of GF(q m ). The Generalized Srivastava (GS) code of order st and length n is defined by a parity-check matrix of the form: where each block is defined as Parameters are the length n ≤ q m − s, dimension k ≥ n − mst and minimum distance d ≥ st + 1.

Definition 2.
Given a ring R (in our case F q m ) and a vectorh = (h 0 , . . . , h n−1 ) ∈ R n , the dyadic matrix ∆(h) ∈ R n×n is the symmetric matrix with components ∆ ij = h i⊕j , where ⊕ stands for bitwise exclusive-or on the binary representations of the indices. The sequenceh is called its signature. If n is a power of 2, then every 2 l × 2 l dyadic matrix can be described recursively as where each block is a 2 l−1 × 2 l−1 dyadic matrix (and where any 1 × 1 matrix is dyadic).
A linear code is quasi-dyadic if it admits a parity-check in quasi-dyadic form, i.e., a block matrix whose component blocks are dyadic matrices.
It has been shown by Misoczki and Barreto [20] that it is possible to build Goppa codes in quasi-dyadic form if the code is defined over a field of characteristic 2, and the dyadic signature satisfies the fundamental equation Persichetti in [21] then showed how to adapt the Misoczki-Barreto algorithm to generate quasi-dyadic GS codes. Intuitively, using this larger class of codes (of which Goppa codes are a subclass) provides more flexibility in the design of cryptographic schemes. More importantly, thanks to their "layered" structure, GS codes make it easier to resist structural attacks aimed at recovering the private key. In particular, the parameter t plays a crucial role in defining the complexity of one such attack (and successive variants), due to Faugère, Otmani, Perret and Tillich, and simply known as FOPT [22].
The attack, succinctly, consists of generating a system of equations from the fundamental relationship between generator and parity-check matrices G · H T = 0. The system of equations is heavily simplified thanks to the particular relations stemming from the quasi-dyadic form (and the limitations inherent to alternant codes) and successively solved using Gröbner bases. This allows for recovering an equivalent matrix for decoding, i.e., a private key. Despite the lack of a definitive complexity analysis, it is possible to observe that the attack scales somewhat proportionally to the value that defines the solution space (i.e., number of free variables) of the system of equations. In the case of quasi-dyadic Goppa codes, this is given simply by m − 1; thus, the key factor is that the value m be large enough to make the attack unfeasible. In the original proposal by Misoczki and Barreto, this value varies from 2 to 16, where the large extension field GF(2 N ) is kept constant (i.e., N = 16) and the base field varies, (i.e., = 1, 2, 4, 8). As a consequence, the dimension of the solution space is in most cases trivial or so, and the only parameters that weren't broken in practice were those corresponding to m = N = 16, although the attack authors recommend m − 1 to be at least 20.
In the case of GS codes, it is possible to apply the attack, but the dimension of the solution space is given by mt − 1 instead. It is thus a lot easier to achieve larger values for this, while at the same time keeping the extension field small (but still large enough to define long codes) and consequently having more efficient arithmetic. It follows that schemes based on GS codes (and DAGS is no exception) are usually defined over a relatively large base field, with the goal of minimizing the value m; the "burden" of thwarting FOPT falls then on t, which is chosen as relatively large. This has the additional advantage of a better error-correction capacity, since the number of correctable errors depends on t (it is in fact st/2). Better error-correction means that generic decoding attacks like Information-Set Decoding (ISD) [23,24] are harder, and thus implies the possibility of better parameters.

Initial Choice of Parameters
We report DAGS parameters below Table 2. Note that in all cases the condition mt ≥ 21 is satisfied to prevent the FOPT attack, as suggested by the authors themselves.  6 2 2112 704 2 6 11 11,616 For DAGS 3 and DAGS 5, the finite field GF(2 6 ) is built using the polynomial x 6 + x + 1 and then extended to GF(2 12 ) using the quadratic irreducible polynomial x 2 + α 34 x + α, where α is a primitive element of GF(2 6 ).

Improved Field Selection
The protocol specification for DAGS 3 and DAGS 5 is the same; hence, we choose to focus on DAGS 5 optimization in this section. The overall process is Decapsulation.
Multiplications over the finite fields GF(2 6 ) and GF((2 6 ) 2 ) are carried all along the process and must be protected especially in critical operations such as key generation and encapsulation. The key generation is performed over the tower field GF((2 6 ) 2 ) using the log-antilog tables of GF(2 6 ). Thus, it must be protected against cache-timing attack on one hand and optimized for fast implementation on the other. Encapsulation is a critical operation performed over GF(2 6 ) using the tabulated method and hence is vulnerable to cache-leakage. In the following, we propose comparing the performance of seven implementations of multiplication algorithms over GF(2 6 ), GF((2 6 ) 2 ) and GF(2 12 ). In fact, according to Section 2, we can convert elements from GF((2 6 ) 2 ) to GF(2 12 ) to perform multiplication in the key generation process faster. In the example of Section 2, we used DAGS polynomials x 6 + x + 1 for GF (2 6 ) and x 2 + α 34 x + α for GF((2 6 ) 2 ) with α a primitive in GF(2 6 ) and hence we can use the derived matrix T for the isomorphic mapping. Note that we can change the tower field polynomial, yet still be consistent with DAGS design, in order to construct a field GF(2 12 ) with a sparse generator polynomial. That is to say, using the polynomial x 2 + α 27 x + α for the extension yields a mapping to GF(2 12 ) with the generator trinomial x 12 + x 7 + 1. However, the overall gain when compared to the performances using the pentanomial from the example is negligible, not to mention the cost of changing the initial polynomial in the reference code. Thus, we chose to keep the initial parameters and compare the following seven implementations on the basis of security and performance: • Tabulated log/antilog (Algorithms A1-A3), • Iterative, conditional reduction (Algorithm A5), • Iterative, ASM with PCLMUL, conditional reduction (Algorithm A5), • Iterative, unconditional reduction (Algorithm A6), • Iterative, ASM with PCLMUL, unconditional reduction (Algorithm A6), • Iterative, unconditional reduction, 1-bit-sliced, 64 comput. in parallel (Algorithm A7), • Iterative, ASM with PCLMUL, unconditional reduction, bit-sliced 2 computs. In parallel (Algorithm A8).

Implementation Performances
We will present the results of our experiments below.
In Table 3, we have presented the performances of different implementations of multiplication over the finite field GF(2 6 ). We further compared these implementations over the tower field GF((2 6 ) 2 ) and the isomorphic field GF(2 12 ). Note that the costs of isomorphic mapping, bitslice transposition and log-antilog tables initialization are excluded from Table 3 since they can be carried only a few times during the process and not at each multiplication. Accordingly, we conclude the following: • The tabulated log-antilog version is the fastest amongst non-parallel algorithms.

•
It is faster to implement tower field computation directly in an isomorphic field of characteristic two.

•
The modular multiplication with Carry-less MULtiplication (PCLMUL) dedicated Assembly (ASM) instruction does not improve the speed since the overhead in the function call is dominating the computation for those small values of N. However, in case of only one serial operation, PCLMUL should be used because it has the lowest latency.

•
Constant-time cache secure implementations take more time than those that are not secure. Moreover, we noticed that "conditional reduction" in C code is actually constant-time once compiled in assembly code (when optimization flag is set) owing to the use by the compiler of the CMOV (conditional move) assembly instruction, which executes in one single clock cycle.

•
The bitsliced single multiplication takes only 55/64 = 0.86 cycle over GF (2 6 ) and 335/64 = 5.23 cycles for GF (2 12 ) and is invulnerable to cache-timing attacks. Thus, it is our champion implementation to be chosen for fast and secure arithmetic over GF(2 N ). • For the second version of bitsliced implementation, we pack two words X, X ∈ GF(2 N ) as X 2 2N + X. Then, the products XY and X Y can be computed in one go by noticing that PCLMULQDQ(X 2 2N + X, Y 2 2N + Y) = PCLMULQDQ(X , Y ) 2 4N + (PCLMULQDQ(X, Y )⊕ PCLMULQDQ(X , Y)) 2 2N + PCLMULQDQ(X, Y); hence, the results are obtained at bit indices [6N, 4N] and [2N, 0]. For DAGS key generation, we chose to map elements from GF((2 6 ) 2 ) to GF(2 12 ) and perform a bitsliced multiplication for secure and fast computation. The encapsulation process is carried over GF(2 6 ) and we chose the bitsliced multiplication as well. Concerning the decapsulation, we mapped elements to GF(2 12 ) and kept the tabulated method because it is unimportant to secure this public-key process against cache-leakage.

Conclusions
In this paper, we compared several implementations of multiplication over the finite field GF(2 N ), for N < 32, on the basis of security and performance. Our analysis showed that log-antilog tabulated method and conditional iterative methods are vulnerable to cache-timing attacks. Moreover, for big values of N, the tabulated method becomes costly and tower fields are used to perform the arithmetic. We showed that this towering technique is slow and we proposed to map the elements to the isomorphic field for better performances.
To counter the cache-attacks, we presented two constant-time implementations: the iterative method with unconditional reduction which removes branches and thus is longer and the bitsliced implementation which is executed SIMD. We used DAGS, a code-based KEM submission to NIST's PQC call, as a case study to examine the different multiplications over GF((2 6 ) 2 ). The conclusions are that the bitsliced implementation is faster than the tower multiplication and secure with respect to cache-attacks. It should be pointed out that our results also apply to secure and accelerated implementations of the other PQC algorithms listed, as well as AES and symmetric ciphers that run over finite fields GF(2 N ). Finally, note that all the algorithms tested in the paper are provided in C language in Appendix A.