Novel Hybrid-Size Digit-Serial Systolic Multiplier over GF ( 2 m )

: Because of the efﬁcient tradeoff in area–time complexities, digit-serial systolic multiplier over GF ( 2 m ) has gained substantial attention in the research community for possible application in current/emerging cryptosystems. In general, this type of multiplier is designed to be applicable to one certain ﬁeld-size, which in fact determines the actual security level of the cryptosystem and thus limits the ﬂexibility of the operation of cryptographic applications. Based on this consideration, in this paper, we propose a novel hybrid-size digit-serial systolic multiplier which not only offers ﬂexibility to operate in either pentanomial- or trinomial-based multiplications, but also has low-complexity implementation performance. Overall, we have made two interdependent efforts to carry out the proposed work. First, a novel algorithm is derived to formulate the mathematical idea of the hybrid-size realization. Then, a novel digit-serial structure is obtained after efﬁcient mapping from the proposed algorithm. Finally, the complexity analysis and comparison are given to demonstrate the efﬁciency of the proposed multiplier, e.g., the proposed one has less area-delay product (ADP) than the best existing trinomial-based design. The proposed multiplier can be used as a standard intellectual property (IP) core in many cryptographic applications for ﬂexible operation.


Introduction
Finite field multipliers have gained substantial attentions recently due to their critical roles in many cryptosystems such as elliptic curve cryptography (ECC), especially on hardware platforms [1]. Typically, there are three types of structuring related to the finite field multipliers, namely the bit-serial, bit-parallel, and digit-serial. Because of the efficient tradeoff in area-time complexities, digit-serial structures usually are more widely preferred than the other two in many applications [2].
Along with the recent advance in artificial intelligence technology, systolic structure has becoming more and more attracting in high-performance hardware platforms [3]. Accordingly, digit-serial systolization of finite field multipliers have the potential to be applied in high-performance cryptosystems due to their superior features such as high-throughput rate and regularity and modularity. Thus far, several efforts have been made on efficient implementation of digit-serial systolic finite field multipliers: (i) an efficient systolic finite field multiplier is presented in [3], where its complexity is significantly reduced compared with the previous reported one; (ii) a systolic-like digit-serial multiplier is reported in [4] and it is found that the systolic structure proposed is specifically suitable for Reed-Solomon Codec; (iii) an efficient digit-serial systolic multiplier is presented in [5]; (iv) the same authors reported a unified digit-serial systolic multiplier based on trinomials and all-one-polynomials [6]; (v) a low-complexity systolic multiplier is given in [7], where its complexity is optimized to be minimal; (vi) an efficient resource-sharing technique is employed in another digit-serial systolic multiplier to achieve low critical-path and high-performance operation [8]; and (vii) an efficient systolic digit-serial multipliers is reported in [9], where the complexity is so far the least in the literature. These designs, undoubtedly, represent the major advance in the field of systolic digit-serial multipliers.
On the other side, however, the existing digit-serial systolic finite field multipliers, more or less, still have some drawbacks to be overcome: (i) although the digit-serial systolic multipliers have relatively few processing elements (PEs), the register-complexity of the multipliers is still large; and (ii) the current digit-serial multipliers are designed to be fixed field-size, and thus cannot provide enough flexibility to meet the current technology trend, i.e., one cryptosystem can meet different security level (field-size) need and the designers have to finalize different field-size multipliers with respect to different application requirement, which is some sort of inefficient in integrated chip (IC) design. Facing with these two challenges, in this paper, we have proposed a novel hybrid-size digit-serial systolic multiplier with low-complexity implementation. The proposed work is carried out through a combination of two coherent interdependent stages' efforts: (i) a novel hybrid-size digit-serial systolic multiplication algorithm is proposed which provides enough flexibility to both pentanomial-and trinomial-based multipliers; and (ii) the proposed algorithm is then mapped into a novel systolic structure through a series of optimization techniques. Thorough complexity analysis and detailed comparison have also been made to confirm the efficiency of the proposed design, i.e., it not only offers flexibility to be switched from one field-size to another one, but also has smaller area-time complexities compared with the existing single field-size digit-serial systolic multipliers. The proposed design can not only be used as a standard intellectual property (IP) core for various field-size cryptosystem, but also can be employed as a core computation unit in reconfigurable cryptographic processor (where demands flexible field-size choice).
The rest of the paper is organized as follows: Section 2 presents the mathematical formulation of the proposed digit-serial multiplication algorithm. Section 3 shows the detailed steps of the proposed systolic structure mapped from the algorithm. The analysis and comparison are provided in Section 4. The conclusion is given in Section 5.

Mathematical Formulation of the Proposed Multiplication Algorithm
Let the three elements A, B, and C ∈ GF(2 m ) and let the polynomial basis be the {1, x, x 2 , . . .}, where x is the root of f (x) ( f (x) determines the field) [1]. Suppose for two field-sizes with m 1 and m 2 , and m 1 < m 2 , we can first define that and where a i , b i , and c i ∈ GF(2) and it is clear that a i in both A 1 and A 2 are the same (for 0 ≤ i ≤ m 2 − 1), and the same applies to b i and c i . Suppose in the field-size of m 1 , let C 1 be the product of A 1 and B 1 (corresponding field polynomial is f 1 (x)), we can have where A (0) Similarly, for the field-size of m 2 , we can have C 2 as the product of A 2 and B 2 (the field polynomial is f 2 (x)): where, similarly, we have A Then, after comparing Equation (3) with Equation (5), we can have where j = 1 or 2 according to Equations (3) and (5), respectively, and Then, we can have the following definitions: For any integer of m 2 , we have m 2 = w · d (meanwhile, one can have m 1 = w · d 1 ); then, we can define Similarly, we can have where we assume m 2 − m 1 > w.
It is clear that we can now transfer Equation (6) into where ξ(k j ) works for terms only when m 1 ≤ i ≤ m 2 − 1 and where we can see that Equation (10) can be used to perform two field-size finite field multiplications if we select the control signal properly. The above equations can thus be summarized as Algorithm 1.

Algorithm 1. Proposed multiplication algorithm for hybrid field-size-based implementation
Inputs: A 1 and B 1 (also A 2 and B 2 ) are the pair of elements (polynomial basis representation) in GF(2 m ) for field-size of m 1 and m 2 , respectively The detailed processes of Steps 2.2 and 2.4 are the key multiplication processes.
Note that, due to the difference of the field polynomial f j (x), the process of deriving A j is slightly different from each other. For instance, assume where we can have Besides that, one has to note that the National Institute of Standards and Technology (NIST) has recommended five irreducible polynomials for ECC implementation [10,11] (three pentanomials and two trinomials). Without loss of generality, we can assume m 1 is a pentanomial and m 2 is a trinomial. The corresponding structure presented below is also based on this assumption.

Proposed Hybrid-Size Digit-Serial Systolic Multiplier
In this section, we propose several optimization technique to successfully map the corresponding algorithm into desired systolic structure. Specifically:

Novel Input Data Broadcasting Scheme
One major component of the register-complexity of a systolic finite field multiplier comes from the input data broadcasting. In this subsection, we propose a novel input data broadcasting that the main inputs to each PE are fed independent from each other and thus the relation of these data between the PEs is reduced to minimum, which can significantly reduce the related register-complexity among systolic array. In Figure 1, the proposed input data broadcasting technique is employed.
... As shown in Figure 1, according to Step 2.3 of Algorithm 1, each PE in the systolic array is fed with two inputs, namely the A (i) j and the corresponding b i . The output of each PE is then transferred to the next PE on its right. The complete output can be delivered after (d + w) cycles, with the help of an extra accumulation cell. Since differences exist among all the A (i) j , we have used the selective connection to rightly connect each PE according to Algorithm 1. Because only one signal pipelining to the next PE is used, the register-complexity of the systolic array is significantly reduced. The details of the internal structures of these PEs are shown below. Note that, due to the simple internal structure of the PEs, i.e., critical-path of the PE is quite small, the proposed broadcasting technique has very limited influence on the overall time complexity.

Proper Arrangement on the Input Data Delivery
The two inputs, i.e., A j and B j , must be properly arranged to meet the data dependence requirement for hybrid-size operation. For B j , according to Algorithm 1, all bits are delivered in a grouped-sequential way, which can be realized by the structure, as shown in Figure 2. One can see that the shift-register is producing the required output bits to each PE of Figure 1 based on Algorithm 1, while the hybrid-size selection is done by the inserting of an extra MUX (MUX is short for multiplexer) in the shifting path such that the shift-register can be working under the field-size of either m 1 or m 2 through the proper control of the MUX (control signal). Figure 2. The proposed shift-register to deliver input data B j .
The operand A j , through the help of PE-0, delivers the correct output bits to each PE according to Algorithm 1, which requires a more sophisticated structure, as shown in Figure 3a. From Equations (13) and (14), one can observe that there are one XOR gate involved when obtaining
where we can see that the identical bits, e.g., a 0 , a 1 , . . ., can be shared among these A (i) j (0 ≤ i ≤ 12), as shown by the example in Figure 3b (where we have shown how the MUXes are located to obtain hybrid-size implementation). Since other bits cannot be shared, we just use the MUX to connect with the two bits at the same position (according to Equations (8) and (9)) such that, through the proper working of these MUXes, the correct signals can be produced to the corresponding PE.
...  One can also notice that, according to Equations (9), (13), and (14), with the help of a modular operation (done by the modular cell in Figure 3), the PE-0 delivers the corresponding output to each PE, i.e., obtaining A u from A u−1 (for 1 ≤ u ≤ d/d 1 ), which needs a delay time of 2T X (T X is the delay time of an XOR gate, and it takes T X for trinomial-based multiplier and 2T X for pentanomial-based one [9]).
Besides that, one has to note that all the A (i) j in one specific A u can be obtained through the sharing of identical bits, as represented by the selective connection in Figures 1 and 3. Following this arrangement, the proposed hybrid-size structure operates in an ordered form according to Algorithm 1.

Hybrid Accumulation
The accumulation of the digit-serial operation also needs adjustment when compared with the conventional ones. As shown in Figure 4, where we have used a m 1 -bit MUX cell to obtain the hybrid-size accumulation (where the accumulation cell is realized through the XOR cell connected with the register cell in a back-loop style). Note that these m 1 bit-level MUXes connect with the m 1 -bit output of PE-d 1 , while the remaining (m 2 − m 1 ) bits of PE-d are directly connected with the accumulation cell. According to Equations (8) and (9), and Algorithm 1, we can let the MUX determine the multiplier is working under the condition of field-size of either m 1 bits or m 2 bits. Besides that, the number of output bits is also selected according to the specific chosen field-size, as shown in Figure 4, i.e., after designated number of cycle periods, the output is produced based on the value of the control signal.

Final Structure
The internal structure of each PE is shown in Figure 5b, where it mainly consists of an AND cell, an XOR cell, and a register cell. With the combination of all the optimization techniques introduced above, we have presented the finalized proposed hybrid-size digit-serial structure, as shown in Figure 5a. All the control signals connected with the inserted MUXes collaborate together to switch the finite field multiplier from operating in one field-size to another. After designated cycle periods of accumulation, the multiplier delivers the desired output.

Complexity and Comparison
For simplicity of discussion, we just follow the assumption in Section 3 that m 1 comes from a pentanomial while m 2 is the field-size of a trinomial. The detailed complexity of the proposed multiplier is: (i) Systolic array: The systolic array has d number of PEs, where each PE has m 2 AND gates, m 2 XOR gates, and m 2 registers. (ii) Shift-register: The shift-register for B j requires m 2 registers and one MUX. (iii) Accumulation cell: The accumulation cell requires m 1 MUXes, m 2 XORs, and m 2 registers. (iv) PE-0: There are in total (3d 1 + d − 4) XOR gates, (4d 1 − 4) MUXes, and (m 2 + 3d 1 + d − 4) registers involved. Moreover, the proposed structure has a critical-path of (2T X + T M ) (T M is the delay time of an MUX), and it takes (d + w) cycles to produce the desired output for hybrid-size operation.
Overall, the complexity of the proposed design is listed along with the existing digit-serial multipliers (trinomial-or pentanomial-based designs) in Table 1 in terms of logic gates number, register number, latency (number of cycle periods), and critical-path. Note that the designs of [5,6] are based on all-one-polynomials (or used all-one-polynomials as a computation core), we thus do not list them in Table 1, just for a fair comparison. As shown in Table 1, one can see that the proposed hybrid-size digit-serial multiplier has relatively better area-time complexities than the existing ones, especially when considering that the proposed one can offer hybrid field-size operation (the existing ones are all single field-size based). To have a detailed comparison, we have also used the NanGate's Library Creator and the 45-nm FreePDK Base Kit from North Carolina State University (NCSU) [12] to estimate the area and time complexities of all the designs for m 2 = 233, m 1 = 163, d = 16, and d 1 = 13.
The obtained area, delay (latency time), power, area-delay product (ADP), and power-delay product (PDP) are listed in Table 2 for a comparison. Again, we can observe that the proposed one has better performance than the existing ones, e.g., it has at least 7.3% less ADP than the best trinomial one of [8], while it offers the flexibility to execute the pentanomial-based multiplier. Compared with the existing pentanomial ones, the proposed one still has better ADP when considering the scaling of the field-size. The proposed one also has 41.5% less ADP and 34.6% less PDP than the conventional hybrid field-size implementation (we have combined the best existing ones of [8,9] together to realize it).
Digit-serial systolic structures (pentanomial of size m 1 ) Hybrid-size digit-serial systolic structures (pentanomial of size m 1 and trinomial of size m 2 )   1 : delay = latency cycle number × critical-path. 2 : Refers to the conventional implementation of two field-size finite field multipliers; we have used the best existing ones of [8,9] to be combined together.
The proposed hybrid-size digit-serial systolic multiplier, undoubtedly, can be extended as a standard IP core in various cryptosystems that demand different security levels. On the other hand, due to the low-complexity of the proposed design, it can also be used in cryptosystem for flexible operation, in the case the user of that cryptosystem needs to change/upgrade the system. Moreover, it is worth mentioning that the proposed hybrid field-size strategy can also be extended to multiple filed-size implementation.

Conclusions
This paper presents a novel implementation of a hybrid field-size digit-serial systolic multiplier over GF(2 m ). A novel digit-serial multiplication algorithm suitable for hybrid field-size realization is proposed first. Then, through a series of optimization techniques, the proposed algorithm is successfully mapped into a high-performance digit-serial systolic multiplier. The complexity analysis and detailed comparison have been given to confirm the efficiency of the proposed design. Future work may focus on the application of the proposed design in various cryptosystems.

Abbreviations
The following abbreviations are used in this manuscript:

IP
Intellectual property ECC Elliptic curve cryptography PE Processing elements IC Integrated chip NCSU North Carolina State University