Novel Hybrid-Size Digit-Serial Systolic Multiplier over GF(2m)

Hu, Zhenji; Xie, Jiafeng

doi:10.3390/sym10110540

Open AccessArticle

Novel Hybrid-Size Digit-Serial Systolic Multiplier over GF(2^m)

by

Zhenji Hu

¹ and

Jiafeng Xie

^2,*

¹

School of Law, ShangHai University of Finance & Economics, Shanghai 200433, China

²

Department of Electrical & Computer Engineering, Villanova University, Villanova, PA 19085, USA

^*

Author to whom correspondence should be addressed.

Symmetry 2018, 10(11), 540; https://doi.org/10.3390/sym10110540

Submission received: 26 September 2018 / Revised: 19 October 2018 / Accepted: 20 October 2018 / Published: 24 October 2018

Download

Browse Figures

Versions Notes

Abstract

:

Because of the efficient tradeoff in area–time complexities, digit-serial systolic multiplier over

G F (2^{m})

has gained substantial attention in the research community for possible application in current/emerging cryptosystems. In general, this type of multiplier is designed to be applicable to one certain field-size, which in fact determines the actual security level of the cryptosystem and thus limits the flexibility of the operation of cryptographic applications. Based on this consideration, in this paper, we propose a novel hybrid-size digit-serial systolic multiplier which not only offers flexibility to operate in either pentanomial- or trinomial-based multiplications, but also has low-complexity implementation performance. Overall, we have made two interdependent efforts to carry out the proposed work. First, a novel algorithm is derived to formulate the mathematical idea of the hybrid-size realization. Then, a novel digit-serial structure is obtained after efficient mapping from the proposed algorithm. Finally, the complexity analysis and comparison are given to demonstrate the efficiency of the proposed multiplier, e.g., the proposed one has less area-delay product (ADP) than the best existing trinomial-based design. The proposed multiplier can be used as a standard intellectual property (IP) core in many cryptographic applications for flexible operation.

Keywords:

digit-serial; hybrid-size; low-complexity; pentanomial; systolic structure; trinomial

1. Introduction

Finite field multipliers have gained substantial attentions recently due to their critical roles in many cryptosystems such as elliptic curve cryptography (ECC), especially on hardware platforms [1]. Typically, there are three types of structuring related to the finite field multipliers, namely the bit-serial, bit-parallel, and digit-serial. Because of the efficient tradeoff in area–time complexities, digit-serial structures usually are more widely preferred than the other two in many applications [2].

Along with the recent advance in artificial intelligence technology, systolic structure has becoming more and more attracting in high-performance hardware platforms [3]. Accordingly, digit-serial systolization of finite field multipliers have the potential to be applied in high-performance cryptosystems due to their superior features such as high-throughput rate and regularity and modularity. Thus far, several efforts have been made on efficient implementation of digit-serial systolic finite field multipliers: (i) an efficient systolic finite field multiplier is presented in [3], where its complexity is significantly reduced compared with the previous reported one; (ii) a systolic-like digit-serial multiplier is reported in [4] and it is found that the systolic structure proposed is specifically suitable for Reed–Solomon Codec; (iii) an efficient digit-serial systolic multiplier is presented in [5]; (iv) the same authors reported a unified digit-serial systolic multiplier based on trinomials and all-one-polynomials [6]; (v) a low-complexity systolic multiplier is given in [7], where its complexity is optimized to be minimal; (vi) an efficient resource-sharing technique is employed in another digit-serial systolic multiplier to achieve low critical-path and high-performance operation [8]; and (vii) an efficient systolic digit-serial multipliers is reported in [9], where the complexity is so far the least in the literature. These designs, undoubtedly, represent the major advance in the field of systolic digit-serial multipliers.

On the other side, however, the existing digit-serial systolic finite field multipliers, more or less, still have some drawbacks to be overcome: (i) although the digit-serial systolic multipliers have relatively few processing elements (PEs), the register-complexity of the multipliers is still large; and (ii) the current digit-serial multipliers are designed to be fixed field-size, and thus cannot provide enough flexibility to meet the current technology trend, i.e., one cryptosystem can meet different security level (field-size) need and the designers have to finalize different field-size multipliers with respect to different application requirement, which is some sort of inefficient in integrated chip (IC) design. Facing with these two challenges, in this paper, we have proposed a novel hybrid-size digit-serial systolic multiplier with low-complexity implementation. The proposed work is carried out through a combination of two coherent interdependent stages’ efforts: (i) a novel hybrid-size digit-serial systolic multiplication algorithm is proposed which provides enough flexibility to both pentanomial- and trinomial-based multipliers; and (ii) the proposed algorithm is then mapped into a novel systolic structure through a series of optimization techniques. Thorough complexity analysis and detailed comparison have also been made to confirm the efficiency of the proposed design, i.e., it not only offers flexibility to be switched from one field-size to another one, but also has smaller area–time complexities compared with the existing single field-size digit-serial systolic multipliers. The proposed design can not only be used as a standard intellectual property (IP) core for various field-size cryptosystem, but also can be employed as a core computation unit in reconfigurable cryptographic processor (where demands flexible field-size choice).

The rest of the paper is organized as follows: Section 2 presents the mathematical formulation of the proposed digit-serial multiplication algorithm. Section 3 shows the detailed steps of the proposed systolic structure mapped from the algorithm. The analysis and comparison are provided in Section 4. The conclusion is given in Section 5.

2. Mathematical Formulation of the Proposed Multiplication Algorithm

Let the three elements A, B, and C

\in G F (2^{m})

and let the polynomial basis be the

{1, x, x^{2}, \dots}

, where x is the root of

f (x)

(

f (x)

determines the field) [1]. Suppose for two field-sizes with

m_{1}

and

m_{2}

, and

m_{1} < m_{2}

, we can first define that

A_{1} = \sum_{i = 0}^{m_{1} - 1} a_{i} x^{i}, B_{1} = \sum_{i = 0}^{m_{1} - 1} b_{i} x^{i}, C_{1} = \sum_{i = 0}^{m_{1} - 1} c_{i} x^{i},

(1)

and

A_{2} = \sum_{i = 0}^{m_{2} - 1} a_{i} x^{i}, B_{2} = \sum_{i = 0}^{m_{2} - 1} b_{i} x^{i}, C_{2} = \sum_{i = 0}^{m_{2} - 1} c_{i} x^{i},

(2)

where

a_{i}

,

b_{i}

, and

c_{i}

\in G F (2)

and it is clear that

a_{i}

in both

A_{1}

and

A_{2}

are the same (for

0 \leq i \leq m_{2} - 1

), and the same applies to

b_{i}

and

c_{i}

.

Suppose in the field-size of

m_{1}

, let

C_{1}

be the product of

A_{1}

and

B_{1}

(corresponding field polynomial is

f_{1} (x)

), we can have

\begin{matrix} C_{1} = & A_{1} B_{1} \mod f_{1} (x) \\ = & \sum_{i = 0}^{m_{1} - 1} b_{i} (x^{i} \cdot A_{1}) \mod f_{1} (x) \\ = & \sum_{i = 0}^{m_{1} - 1} b_{i} A_{1}^{(i)}, \end{matrix}

(3)

where

A_{1}^{(0)} = A_{1}

and

A_{1}^{(i)} = x^{i} \cdot A_{1} \mod f_{1} (x)

. Similarly, for the field-size of

m_{2}

, we can have

C_{2}

as the product of

A_{2}

and

B_{2}

(the field polynomial is

f_{2} (x)

):

\begin{matrix} C_{2} = & A_{2} B_{2} \mod f_{2} (x) \\ = & \sum_{i = 0}^{m_{2} - 1} b_{i} (x^{i} \cdot A_{2}) \mod f_{2} (x) \\ = & \sum_{i = 0}^{m_{2} - 1} b_{i} A_{2}^{(i)}, \end{matrix}

(4)

where, similarly, we have

A_{2}^{(0)} = A_{2}

and

A_{2}^{(i)} = x^{i} \cdot A_{2} \mod f_{2} (x)

. One can then have

\begin{matrix} C_{2} = & \sum_{i = 0}^{m_{1} - 1} b^{i} A_{2}^{(i)} + \sum_{i = m_{1}}^{m_{2} - 1} b^{i} A_{2}^{(i)} \end{matrix}

(5)

Then, after comparing Equation (3) with Equation (5), we can have

C_{j} = \sum_{i = 0}^{m_{1} - 1} b^{i} A_{j}^{(i)} + k_{j} \cdot \sum_{i = m_{1}}^{m_{2} - 1} b^{i} A_{2}^{(i)},

(6)

where

j = 1

or 2 according to Equations (3) and (5), respectively, and

\begin{matrix} k_{j} = 0, if j = 1 \\ k_{j} = 1, if j = 2 . \end{matrix}

(7)

Then, we can have the following definitions:

For any integer of

m_{2}

, we have

m_{2} = w \cdot d

(meanwhile, one can have

m_{1} = w \cdot d_{1}

); then, we can define

\begin{matrix} B_{1} = & [b_{0}, b_{1}, \dots, b_{w - 1}] \\ B_{2} = & [b_{w}, b_{w + 1}, \dots, b_{2 w - 1}] \\ \dots \dots \dots \\ B_{d} = & [b_{m_{2} - w}, b_{m_{2} - w + 1}, \dots, b_{m_{2} - 1}] . \end{matrix}

(8)

Similarly, we can have

\begin{matrix} A_{1} = & [A_{j}^{(0)}, A_{j}^{(1)}, \dots, A_{j}^{(w - 1)}] \\ A_{2} = & [A_{j}^{(w)}, A_{j}^{(w + 1)}, \dots, A_{j}^{(2 w - 1)}] \\ \dots \dots \dots \\ A_{d} = & [A_{2}^{(m_{2} - w)}, A_{2}^{(m_{2} - w + 1)}, \dots, A_{2}^{(m_{2} - 1)}], \end{matrix}

(9)

where we assume

m_{2} - m_{1} > w

.

It is clear that we can now transfer Equation (6) into

\begin{matrix} C_{j} = & A_{1} B_{1}^{T} + A_{2} B_{2}^{T} + \dots + k_{j} A_{d} B_{d}^{T} \\ = & \sum_{u = 1}^{d} ξ (k_{j}) A_{u} B_{u}^{T}, \end{matrix}

(10)

where

ξ (k_{j})

works for terms only when

m_{1} \leq i \leq m_{2} - 1

and

\begin{matrix} ξ (k_{j}) = 1, if j = 2 \\ ξ (k_{j}) = 0, if j = 1, \end{matrix}

(11)

where we can see that Equation (10) can be used to perform two field-size finite field multiplications if we select the control signal properly.

The above equations can thus be summarized as Algorithm 1.

Algorithm 1. Proposed multiplication algorithm for hybrid field-size-based implementation

Inputs:

A_{1}

and

B_{1}

(also

A_{2}

and

B_{2}

) are the pair of elements (polynomial basis representation) in

G F (2^{m})

for

field-size of

m_{1}

and

m_{2}

, respectively

Output:

C_{j} = A_{j} \cdot B_{j} \mod f_{j} (x)

for

j = 1 or 2

and

f_{j} (x)

is the field polynomial

1. Initialization step

1.1 Define

B_{1}

,

B_{2}

, …,

B_{d}

for

m_{2} = w \cdot d

according to (8)
1.2 Define

A_{1}

,

A_{2}

, …,

A_{d}

for

m_{2} = w \cdot d

according to (9)
1.3 Define

D = 0

2. Multiplication step

2.1 According to the field-size selection signal, determine the value of j
2.2 For

1 \leq u \leq d

2.3

D = D + ξ (k_{j}) A_{u} B_{u}^{T}

2.4 End for

3. Final step

3.1 Get

C_{j} = D

The detailed processes of Steps 2.2 and 2.4 are the key multiplication processes.

Note that, due to the difference of the field polynomial

f_{j} (x)

, the process of deriving

A_{j}^{(i + 1)}

from

A_{j}^{(i)}

is slightly different from each other. For instance, assume

A_{j}^{(i)} = \sum_{v = 0}^{m_{j} - 1} a_{j \to v}^{(i)} x^{v},

(12)

where we can have

\begin{matrix} a_{j \to 0}^{(i + 1)} & = a_{j \to m_{j} - 1}^{(i)} \\ a_{j \to s}^{(i + 1)} & = a_{j \to s - 1}^{(i)} + a_{j \to m - 1}^{(i)} \\ a_{j \to v}^{(i + 1)} & = a_{j \to v - 1}^{(i)}, for 1 \leq v \leq m_{j} - 1, \end{matrix}

(13)

when

f_{j} (x)

is a trinomial of

f_{j} (x) = x^{m_{j}} + x^{s} + 1

. While, for pentanomial

f_{j} (x) = x^{m_{j}} + x^{s_{1}} + x^{s_{2}} + x^{s_{3}} + 1

, we can have

\begin{matrix} a_{j \to 0}^{(i + 1)} & = a_{j \to m_{j} - 1}^{(i)}, a_{j \to s_{1}}^{(i + 1)} = a_{j \to s_{1} - 1}^{(i)} + a_{j \to m_{j} - 1}^{(i)} \\ a_{j \to s_{2}}^{(i + 1)} & = a_{j \to s_{2} - 1}^{(i)} + a_{j \to m_{j} - 1}^{(i)}, a_{j \to s_{3}}^{(i + 1)} = a_{j \to s_{3} - 1}^{(i)} + a_{j \to m_{j} - 1}^{(i)} \\ a_{j \to v}^{(i + 1)} & = a_{j \to v - 1}^{(i)}, for 1 \leq v \leq m_{j} - 1 and v \neq s_{1}, s_{2}, s_{3} . \end{matrix}

(14)

Besides that, one has to note that the National Institute of Standards and Technology (NIST) has recommended five irreducible polynomials for ECC implementation [10,11] (three pentanomials and two trinomials). Without loss of generality, we can assume

m_{1}

is a pentanomial and

m_{2}

is a trinomial. The corresponding structure presented below is also based on this assumption.

3. Proposed Hybrid-Size Digit-Serial Systolic Multiplier

In this section, we propose several optimization technique to successfully map the corresponding algorithm into desired systolic structure. Specifically:

3.1. Novel Input Data Broadcasting Scheme

One major component of the register-complexity of a systolic finite field multiplier comes from the input data broadcasting. In this subsection, we propose a novel input data broadcasting that the main inputs to each PE are fed independent from each other and thus the relation of these data between the PEs is reduced to minimum, which can significantly reduce the related register-complexity among systolic array. In Figure 1, the proposed input data broadcasting technique is employed.

As shown in Figure 1, according to Step 2.3 of Algorithm 1, each PE in the systolic array is fed with two inputs, namely the

A_{j}^{(i)}

and the corresponding

b_{i}

. The output of each PE is then transferred to the next PE on its right. The complete output can be delivered after

(d + w)

cycles, with the help of an extra accumulation cell. Since differences exist among all the

A_{j}^{(i)}

, we have used the selective connection to rightly connect each PE according to Algorithm 1. Because only one signal pipelining to the next PE is used, the register-complexity of the systolic array is significantly reduced. The details of the internal structures of these PEs are shown below. Note that, due to the simple internal structure of the PEs, i.e., critical-path of the PE is quite small, the proposed broadcasting technique has very limited influence on the overall time complexity.

3.2. Proper Arrangement on the Input Data Delivery

The two inputs, i.e.,

A_{j}

and

B_{j}

, must be properly arranged to meet the data dependence requirement for hybrid-size operation. For

B_{j}

, according to Algorithm 1, all bits are delivered in a grouped-sequential way, which can be realized by the structure, as shown in Figure 2. One can see that the shift-register is producing the required output bits to each PE of Figure 1 based on Algorithm 1, while the hybrid-size selection is done by the inserting of an extra MUX (MUX is short for multiplexer) in the shifting path such that the shift-register can be working under the field-size of either

m_{1}

or

m_{2}

through the proper control of the MUX (control signal).

The operand

A_{j}

, through the help of PE-0, delivers the correct output bits to each PE according to Algorithm 1, which requires a more sophisticated structure, as shown in Figure 3a. From Equations (13) and (14), one can observe that there are one XOR gate involved when obtaining

A_{2}^{(i + 1)}

from

A_{2}^{(i)}

(trinomial-based multiplication) while it requires three XOR gates when deriving

A_{1}^{(i + 1)}

from

A_{1}^{(i)}

. Thus, according to the data dependence requirement of Algorithm 1, one can find that there are d number of operands being delivered from PE-0 at the same cycle period, i.e.,

A_{j}^{(w u)}, A_{j}^{(w u + 1)}, \dots, A_{j}^{(w u + d - 1)}

. These operands, in fact, involve multiple identical bits between each other and thus can be shared. For simplicity of discussion, let assume

f_{2} (x) = x^{233} + x^{74} + 1

and

f_{1} (x) = x^{163} + x^{7} + x^{6} + x^{3} + 1

, we can have (

d_{1} = 13

,

d = 16

,

a_{0}, a_{1}, a_{2}, \dots, a_{162}

for

A_{1}

, and

a_{0}, a_{1}, a_{2}, \dots, a_{232}

for

A_{2}

)

\begin{matrix} [A_{1}^{(0)} A_{1}^{(1)} A_{1}^{(2)} \dots A_{1}^{(12)}] \\ = & [\begin{matrix} a_{0} & a_{162} & a_{161} & \dots & a_{151} \\ a_{1} & a_{0} & a_{162} & \dots & a_{152} \\ a_{2} & a_{1} & a_{0} & \dots & a_{153} \\ a_{3} & a_{2} + a_{162} & a_{1} + a_{161} & \dots & a_{154} + a_{150} \\ a_{4} & a_{3} & a_{2} & \dots & a_{155} \\ a_{5} & a_{4} & a_{3} & \dots & a_{156} \\ a_{6} & a_{5} + a_{162} & a_{4} + a_{161} & \dots & a_{157} + a_{150} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ a_{162} & a_{161} & a_{160} & \dots & a_{150} \end{matrix}], \end{matrix}

(15)

and

\begin{matrix} [A_{2}^{(0)} A_{2}^{(1)} A_{2}^{(2)} \dots A_{2}^{(15)}] \\ = & [\begin{matrix} a_{0} & a_{232} & a_{231} & \dots & a_{218} \\ a_{1} & a_{0} & a_{232} & \dots & a_{219} \\ a_{2} & a_{1} & a_{0} & \dots & a_{220} \\ a_{3} & a_{2} & a_{1} & \dots & a_{221} \\ a_{4} & a_{3} & a_{2} & \dots & a_{222} \\ a_{5} & a_{4} & a_{3} & \dots & a_{223} \\ a_{6} & a_{5} & a_{4} & \dots & a_{224} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ a_{232} & a_{231} & a_{230} & \dots & a_{217} \end{matrix}], \end{matrix}

(16)

where we can see that the identical bits, e.g.,

a_{0}, a_{1}, \dots

, can be shared among these

A_{j}^{(i)}

(

0 \leq i \leq 12

), as shown by the example in Figure 3b (where we have shown how the MUXes are located to obtain hybrid-size implementation). Since other bits cannot be shared, we just use the MUX to connect with the two bits at the same position (according to Equations (8) and (9)) such that, through the proper working of these MUXes, the correct signals can be produced to the corresponding PE.

One can also notice that, according to Equations (9), (13), and (14), with the help of a modular operation (done by the modular cell in Figure 3), the PE-0 delivers the corresponding output to each PE, i.e., obtaining

A_{u}

from

A_{u - 1}

(for

1 \leq u \leq d / d_{1}

), which needs a delay time of

2 T_{X}

(

T_{X}

is the delay time of an XOR gate, and it takes

T_{X}

for trinomial-based multiplier and

2 T_{X}

for pentanomial-based one [9]). Besides that, one has to note that all the

A_{j}^{(i)}

in one specific

A_{u}

can be obtained through the sharing of identical bits, as represented by the selective connection in Figure 1 and Figure 3. Following this arrangement, the proposed hybrid-size structure operates in an ordered form according to Algorithm 1.

3.3. Hybrid Accumulation

The accumulation of the digit-serial operation also needs adjustment when compared with the conventional ones. As shown in Figure 4, where we have used a

m_{1}

-bit MUX cell to obtain the hybrid-size accumulation (where the accumulation cell is realized through the XOR cell connected with the register cell in a back-loop style). Note that these

m_{1}

bit-level MUXes connect with the

m_{1}

-bit output of PE-

d_{1}

, while the remaining

(m_{2} - m_{1})

bits of PE-d are directly connected with the accumulation cell. According to Equations (8) and (9), and Algorithm 1, we can let the MUX determine the multiplier is working under the condition of field-size of either

m_{1}

bits or

m_{2}

bits. Besides that, the number of output bits is also selected according to the specific chosen field-size, as shown in Figure 4, i.e., after designated number of cycle periods, the output is produced based on the value of the control signal.

3.4. Final Structure

The internal structure of each PE is shown in Figure 5b, where it mainly consists of an AND cell, an XOR cell, and a register cell. With the combination of all the optimization techniques introduced above, we have presented the finalized proposed hybrid-size digit-serial structure, as shown in Figure 5a. All the control signals connected with the inserted MUXes collaborate together to switch the finite field multiplier from operating in one field-size to another. After designated cycle periods of accumulation, the multiplier delivers the desired output.

4. Complexity and Comparison

For simplicity of discussion, we just follow the assumption in Section 3 that

m_{1}

comes from a pentanomial while

m_{2}

is the field-size of a trinomial. The detailed complexity of the proposed multiplier is: (i) Systolic array: The systolic array has d number of PEs, where each PE has

m_{2}

AND gates,

m_{2}

XOR gates, and

m_{2}

registers. (ii) Shift-register: The shift-register for

B_{j}

requires

m_{2}

registers and one MUX. (iii) Accumulation cell: The accumulation cell requires

m_{1}

MUXes,

m_{2}

XORs, and

m_{2}

registers. (iv) PE-0: There are in total

(3 d_{1} + d - 4)

XOR gates,

(4 d_{1} - 4)

MUXes, and

(m_{2} + 3 d_{1} + d - 4)

registers involved. Moreover, the proposed structure has a critical-path of

(2 T_{X} + T_{M})

(

T_{M}

is the delay time of an MUX), and it takes

(d + w)

cycles to produce the desired output for hybrid-size operation. Overall, the complexity of the proposed design is listed along with the existing digit-serial multipliers (trinomial- or pentanomial-based designs) in Table 1 in terms of logic gates number, register number, latency (number of cycle periods), and critical-path. Note that the designs of [5,6] are based on all-one-polynomials (or used all-one-polynomials as a computation core), we thus do not list them in Table 1, just for a fair comparison. As shown in Table 1, one can see that the proposed hybrid-size digit-serial multiplier has relatively better area–time complexities than the existing ones, especially when considering that the proposed one can offer hybrid field-size operation (the existing ones are all single field-size based). To have a detailed comparison, we have also used the NanGate’s Library Creator and the 45-nm FreePDK Base Kit from North Carolina State University (NCSU) [12] to estimate the area and time complexities of all the designs for

m_{2} = 233, m_{1} = 163

,

d = 16

, and

d_{1} = 13

. The obtained area, delay (latency time), power, area-delay product (ADP), and power-delay product (PDP) are listed in Table 2 for a comparison. Again, we can observe that the proposed one has better performance than the existing ones, e.g., it has at least 7.3% less ADP than the best trinomial one of [8], while it offers the flexibility to execute the pentanomial-based multiplier. Compared with the existing pentanomial ones, the proposed one still has better ADP when considering the scaling of the field-size. The proposed one also has 41.5% less ADP and 34.6% less PDP than the conventional hybrid field-size implementation (we have combined the best existing ones of [8,9] together to realize it).

The proposed hybrid-size digit-serial systolic multiplier, undoubtedly, can be extended as a standard IP core in various cryptosystems that demand different security levels. On the other hand, due to the low-complexity of the proposed design, it can also be used in cryptosystem for flexible operation, in the case the user of that cryptosystem needs to change/upgrade the system. Moreover, it is worth mentioning that the proposed hybrid field-size strategy can also be extended to multiple filed-size implementation.

5. Conclusions

This paper presents a novel implementation of a hybrid field-size digit-serial systolic multiplier over

G F (2^{m})

. A novel digit-serial multiplication algorithm suitable for hybrid field-size realization is proposed first. Then, through a series of optimization techniques, the proposed algorithm is successfully mapped into a high-performance digit-serial systolic multiplier. The complexity analysis and detailed comparison have been given to confirm the efficiency of the proposed design. Future work may focus on the application of the proposed design in various cryptosystems.

Author Contributions

Resources, Z.H.; Writing—Review & Editing, J.X.

Funding

This research received no external funding

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

IP	Intellectual property
ECC	Elliptic curve cryptography
PE	Processing elements
IC	Integrated chip
NCSU	North Carolina State University

References

Blake, I.; Seroussi, G.; Smart, N.P. Elliptic Curves in Cryptography; London Mathematical Society Lecture Note Series; Cambridge University Press: Cambridge, UK, 1999. [Google Scholar]
Xie, J.; Meher, P.K.; He, J. Low-latency area-delay-efficient systolic multiplier over GF(2^m) for a wider class of trinomials using parallel register sharing. In Proceedings of the 2012 IEEE International Symposium on Circuits and Systems, Seoul, Korea, 20–23 May 2012; pp. 89–92. [Google Scholar]
Systolic Array. Available online: https://en.wikipedia.org/wiki/Systolic-array (accessed on 25 September 2018).
Kim, C.H.; Hong, C.P.; Kwon, S. A digit-serial multiplier for finite field GF(2^m). IEEE Trans. Very Large Scale Integr. Syst. 2005, 13, 476–483. [Google Scholar]
Meher, P.K. Systolic and non-systolic scalable modular designs of finite field multipliers for Reed-Solomon Codec. IEEE Trans. Very Large Scale Integr. Syst. 2009, 17, 747–757. [Google Scholar] [CrossRef]
Talapatra, S.; Rahaman, H.; Mathew, J. Low complexity digit serial systolic montgomery multipliers for special class of GF(2^m). IEEE Trans. Very Large Scale Integr. Syst. 2010, 18, 847–852. [Google Scholar] [CrossRef]
Talapatra, S.; Rahaman, H.; Saha, S.K. Unified digit serial systolic montgomery multiplication architecture for special classes of polynomials over. In Proceedings of the 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools, Lille, France, 1–3 September 2010; pp. 427–432. [Google Scholar]
Pan, J.-S.; Lee, C.-Y.; Meher, P.K. Low-latency digit-serial and digit-parallel systolic multipliers for large binary extension fields. IEEE Trans. Circuits Syst. I Regul. Pap. 2013, 60, 3195–3204. [Google Scholar] [CrossRef]
Xie, J.; Meher, P.K.; Mao, Z. High-throughput digit-level systolic multiplier over GF(2^m) based on irreducible trinomials. IEEE Trans. Circuits Syst. II 2015, 62, 481–485. [Google Scholar] [CrossRef]
Xie, J.; Meher, P.K.; Mao, Z.-H. Low-latency high-throughput systolic multipliers over GF(2^m) for NIST recommended pentanomials. IEEE Trans. Circuits Syst. I Regul. Pap. 2015, 62, 881–890. [Google Scholar] [CrossRef]
NIST. Boulder, CO, USA. Available online: http://www.csrc.nist.gov/publications (accessed on 25 September 2018).
Nangate Standard Cell Library. Available online: http://www.si2.org/openeda.si2.org/projects/nangatelib (accessed on 25 September 2018).

Figure 1. The proposed input data broadcasting technique.

Figure 2. The proposed shift-register to deliver input data

B_{j}

.

Figure 2. The proposed shift-register to deliver input data

B_{j}

.

Figure 3. The internal structure of PE-0 to deliver input data

A_{j}

: (a) internal structure; and (b) an example of how the MUXes are inserted to obtain hybrid-size implementation for the modular cell in PE-0.

Figure 3. The internal structure of PE-0 to deliver input data

A_{j}

: (a) internal structure; and (b) an example of how the MUXes are inserted to obtain hybrid-size implementation for the modular cell in PE-0.

Figure 4. The detailed arrangement of how the accumulation cell operate to realize the hybrid-size implementation (only

m_{1}

bit-level MUXes are employed and the remaining

(m_{2} - m_{1})

bits of the output of PE-d are directly connected with the accumulation cell), where the black box in the accumulation cell denotes the register cell.

Figure 4. The detailed arrangement of how the accumulation cell operate to realize the hybrid-size implementation (only

m_{1}

bit-level MUXes are employed and the remaining

(m_{2} - m_{1})

bits of the output of PE-d are directly connected with the accumulation cell), where the black box in the accumulation cell denotes the register cell.

Figure 5. The final structure: (a) the proposed structure; and (b) the internal structure of a regular PE.

Table 1. Comparison of Area–Time Complexities of Various Digit-Serial Systolic Multipliers.

Design	AND	XOR	Register	Latency	Critical-Path
Digit-serial systolic structures (trinomial of size $m_{2}$ )
[7] $^{1}$	$m_{2} d$	$d (2 + m_{2}) + 1$	$2 m_{2} d + 3 m_{2}$	$d + w$	$T_{A} + T_{X}$
[8] $^{1}$	$m_{2} d$	$m_{2} d + m_{2} - d + 1$	$2 m_{2} d + 2 m_{2}$	$d + w$	$T_{A} + T_{X}$
Digit-serial systolic structures (pentanomial of size $m_{1}$ )
[4] $^{2}$	$2 m_{1} d_{1} + m_{1}$	$2 m_{1} d_{1}$	$m_{1} / d_{1} (10 d_{1} +$	$3 m_{1} / d_{1}$	$d_{1} (T_{A} + T_{X} +$
[4] $^{2}$	$2 m_{1} d_{1} + m_{1}$	$2 m_{1} d_{1}$	$1 + 9 s d_{1} / 2 + s)$	$3 m_{1} / d_{1}$	$T_{M U X}) / (s + 1)$
DS-I [9]	$m_{1} d_{1}$	$m_{1} d_{1} + 3 m_{1}$	$2 d_{1} m_{1} + 2 m_{1}$	$d_{1} + w_{1}$	$2 T_{X}$
DS-I [9]	$m_{1} d_{1}$	$- 3 m_{1} / d_{1} + 3$	$2 d_{1} m_{1} + 2 m_{1}$	$d_{1} + w_{1}$	$2 T_{X}$
Hybrid-size digit-serial systolic structures (pentanomial of size $m_{1}$ and trinomial of size $m_{2}$ )
Proposed $^{3}$	$m_{2} d$	$m_{2} d + m_{2}$	$m_{2} d + 2 m_{2}$	$d + w$	$2 T_{X} + T_{M}$
Proposed $^{3}$	$m_{2} d$	$3 d_{1} + d - 4$	$+ 3 d_{1} + d - 4$	$d + w$	$2 T_{X} + T_{M}$

T_{A}

, delay time of an AND gate;

T_{M}

, delay time of an MUX;

^{1}

: For a fair comparison, the input bits of operand

B_{j}

fed to each PE is finalized as 1 for [7,8] (a shift-register of

m_{2}

bits is also added).

^{2}

:

(s + 1)

refers to the pipelined stage in the PE (here we choose s = 1), and

2 m_{1}

of MUX gates are not listed here.

^{3}

: There are also

(m_{1} + 4 d_{1} - 3)

MUXes involved.

Table 2. Comparison of Area–Time Complexities of Various Digit-Serial Systolic Multipliers.

Design	Area ( $μ$ m $^{2}$ )	Delay (ns) $^{1}$	Power ( $μ$ W/GHz)	ADP ( $μ$ m $^{2} \times$ ns)	PDP ( $μ$ W/GHz× ns)
Digit-serial trinomial-based ( $m_{2} = 233, d = 16$ )
[7]	46,958	4.48	56,068	$210, 372$	251,185
[8]	46,174	4.48	55,726	$206, 860$	249,652
Digit-serial pentanomial-based ( $m_{1} = 163, d_{1} = 13$ )
[4]	22,675	45.63	35,412	1,034,660	1,615,850
[9]	26,998	4.16	33,139	112,312	137,858
Hybrid-size ( $m_{1} = 163$ and $m_{2} = 233$ )
Traditional $^{2}$	73,172	4.48	88,865	327,811	398,115
Proposed	29,961	6.4	40,672	191,750	260,301

^{1}

: delay = latency cycle number × critical-path.

^{2}

: Refers to the conventional implementation of two field-size finite field multipliers; we have used the best existing ones of [8,9] to be combined together.

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, Z.; Xie, J. Novel Hybrid-Size Digit-Serial Systolic Multiplier over GF(2^m). Symmetry 2018, 10, 540. https://doi.org/10.3390/sym10110540

AMA Style

Hu Z, Xie J. Novel Hybrid-Size Digit-Serial Systolic Multiplier over GF(2^m). Symmetry. 2018; 10(11):540. https://doi.org/10.3390/sym10110540

Chicago/Turabian Style

Hu, Zhenji, and Jiafeng Xie. 2018. "Novel Hybrid-Size Digit-Serial Systolic Multiplier over GF(2^m)" Symmetry 10, no. 11: 540. https://doi.org/10.3390/sym10110540

APA Style

Hu, Z., & Xie, J. (2018). Novel Hybrid-Size Digit-Serial Systolic Multiplier over GF(2^m). Symmetry, 10(11), 540. https://doi.org/10.3390/sym10110540

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Novel Hybrid-Size Digit-Serial Systolic Multiplier over GF(2^m)

Abstract

1. Introduction

2. Mathematical Formulation of the Proposed Multiplication Algorithm

3. Proposed Hybrid-Size Digit-Serial Systolic Multiplier

3.1. Novel Input Data Broadcasting Scheme

3.2. Proper Arrangement on the Input Data Delivery

3.3. Hybrid Accumulation

3.4. Final Structure

4. Complexity and Comparison

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI