Low-Space Bit-Parallel Systolic Structure for AOP-Based Multiplier Suitable for Resource-Constrained IoT Edge Devices

Security and privacy issues with IoT edge devices hinder the application of IoT technology in many applications. Applying cryptographic protocols to edge devices is the perfect solution to security issues. Implementing these protocols on edge devices represents a significant challenge due to their limited resources. Finite-field multiplication is the core operation for most cryptographic protocols, and its efficient implementation has a remarkable impact on their performance. This article offers an efficient low-area and low-power one-dimensional bit-parallel systolic implementation for field multiplication in GF(2n) based on an irreducible all-one polynomial (AOP). We represented the adopted multiplication algorithm in the bit-level form to be able to extract its dependency graph (DG). We choose to apply specific scheduling and projection vectors to the DG to extract the bit-parallel systolic multiplier structure. In contrast with most of the previously published parallel structures, the proposed one has an area complexity of the order O(n) compared to the area complexity of the order of O(n2) for most parallel multiplier structures. The complexity analysis of the proposed multiplier structure shows that it exhibits a meaningful reduction in area compared to most of the compared parallel multipliers. To confirm the results of the complexity analysis, we performed an ASIC implementation of the proposed and the existing efficient multiplier structures using an ASIC CMOS library. The obtained ASIC synthesis report shows that the proposed multiplier structure displays significant savings in terms of its area, power consumption, area-delay product (ADP), and power-delay product (PDP). It offers average savings in space of nearly 33.7%, average savings in power consumption of 39.3%, average savings in ADP of 24.8%, and savings in PDP of 31.2% compared to the competitive existing multiplier structures. The achieved results make the proposed multiplier structure more suitable for utilization in resource-constrained devices such as IoT edge devices, smart cards, and other compact embedded devices.


Introduction
The internet of things currently plays a crucial role in our daily life. These devices can be used in many fields, such as healthcare, automobiles, entertainment, industrial appliances, agriculture, and in homes. The main goal of the IoT network is to collect data and transfer these data to the cloud for further analysis and decision-making. Since most of the applications using IoT technology are sensitive ones, the data collected by IoT devices should be protected throughout all the layers of the IoT network. Due to the limited resources of most IoT edge devices, implementing security mechanisms on these devices represents a great challenge. Therefore, many efforts have been made to solve this challenging problem. Many security protocols have been proposed to be suitable for implementation on resource-constrained IoT edge devices. Furthermore, cryptographic algorithms such as elliptic curve cryptography (ECC) are optimized to be suitable for implementation on these devices. Most of the optimized algorithms depend mainly on finite-field arithmetic operations, namely, finite-field multiplication. Field multiplication is the core operation of the other field operations, such as inversion, division, and exponentiation [1]. Therefore, it has attracted great interest in order to help in implementing compact and highly efficient cryptographic algorithms [2][3][4][5][6][7][8][9][10][11][12].

Literature Review
We can perform finite-field multiplication on different bases, such as a polynomial basis (PB), a normal basis (NB), and a dual basis (DB) [13]. PB multiplication is comparatively simple and does not require basis conversion, as the others do. The need for base conversion increases the hardware complexity of the multipliers. As a result, PB multipliers are widely employed in a variety of cryptographic techniques.
Due to the simplicity and efficiency of their implementation in hardware, several hardware algorithms and structures for PB multiplication for fields formed by trinomials and pentanomials have been proposed in the literature [14][15][16][17][18]. When compared to trinomials and pentanomial-based multipliers, all-one polynomials (AOP) represent a special class that can be employed for simpler and more efficient implementation. As a result, the AOPbased element representation is predicted to find use in efficient hardware implementations of elliptic curve cryptosystems and error control coding [19]. Although irreducible AOPs are not as common as irreducible trinomials or pentanomials, finding the AOP bases for creating finite fields is not difficult [19]. Efficient structures for finite-field multiplication for AOP-formed fields have been offered in the literature [19][20][21].
To obtain an efficient VLSI implementation for finite-field multipliers, we should offer hardware structures that have regular modules with local interconnections such as systolic arrays. The main building blocks of the systolic array are processing elements (PEs). The PEs perform a specific task and can be organized in one-dimensional or two-dimensional space. The systolic multiplier structures can be arranged in bit-parallel [4,5,8,[22][23][24][25][26][27] or bitserial systolic structures [2,28,29]. Bit-parallel systolic architectures are well known for their significant area cost and high power consumption. On the other hand, they act effectively to execute the tasks that have been assigned to them. Bit-serial systolic architectures show significant savings in area and power consumption, but with a high degradation in their execution speed.
In the literature, many authors have tried to offer efficient implementations of bit-parallel systolic multipliers over the binary extension field GF(2 n ) Most of them tried to use a specific irreducible polynomial to build their structures. In 2001, Lee et al. [22,23] offered a bit-parallel systolic multiplier structure based on the equally spaced and AOP polynomials. In 2005, Lee et al. [24] suggested a mapping approach transforming the bit-parallel systolic multiplier based on AOP into one that is based on trinomials to decrease its complexity. In 2008, Lee et al. [25] used Toeplitz matrix-vector representation to decrease the complexity of the recommended Montgomery-based bit-parallel multiplier. In 2015, Sarmadi [26] offered a bit-parallel systolic multiplier with low hardware complexity and high throughput. The recommended multiplier is based on the Montgomery algorithm that used trinomials as a reduction polynomial. In 2018, Mathe [27] adopted an interleaving multiplication algorithm over the binary extension field to implement a bit-parallel systolic multiplier with low hardware complexity.

Paper Contribution
In this paper, we offer a low-space bit-parallel systolic implementation for finite-field multiplication. The multiplication operation is performed over GF(2 n ) and based on the irreducible all-one polynomial (AOP). We present the adopted multiplication algorithm, offered by [19], in the bit-level form in order to extract its dependency graph (DG). The DG will help us to extract the proposed low-complexity bit-parallel systolic multiplier structure by choosing the proper time-scheduling and node-projection functions. The multiplier structure offered here shows an area complexity of the order of O(n), which differentiates it from most of the previously reported ones, which have an area complexity of the order of O(n 2 ). Therefore, the proposed structure achieves a remarkable reduction in area complexity and power consumption. The reduction in the area does not lead to a deterioration in the performance of the recommended multiplier structure, and it shows almost the same timing delays as the previously reported ones. Furthermore, it has a regular systolic structure with local communication between the constituting PEs, making it suitable for VLSI implementation. Local interconnection between the PEs reduces the wire delays and hence improves the whole performance of the multiplier structure. Due to the significant savings in terms of area and power consumption of the recommended multiplier structure, it is more suited to be used in resource-constrained IoT edge devices or compact embedded devices.

Paper Organization
The layout of this paper can be summarized as follows: Section 2 presents the mathematical formulation of the adopted multiplication algorithm and its representation in bit-level form. Section 3 explains the developed DG of the adopted algorithm. Section 4 describes the methodology used to extract the recommended bit-parallel systolic multiplier structure. Section 5 presents the complexity analysis of the proposed multiplier and of the existing efficient structures. Furthermore, it provides a performance evaluation of the suggested multiplier design and other competitive designs based on ASIC synthesis. Conclusions are provided in Section 6.

Formulation of the Finite Field Multiplication Algorithm
Suppose the irreducible polynomial R(z) of degree n defines the finite field over the binary extension field GF(2 n ). We can represent R(z) in the polynomial form as follows: R(z) = 1 + r 1 z 1 + · · · + r i z i + · · · + r n−1 z n−1 + z n (1) where r i ∈ GF (2). Let β represent the root of the irreducible polynomial R(z). Therefore, the set of polynomial bases {1, β, β 2 , β 3 , · · · , β n−1 } can be used to represent the field elements. Suppose E and H are any two field elements in GF(2 n ). They can be represented in the polynomial form of degree n − 1 as: The multiplication of E and H over GF(2 n ) can be performed as follows: We can expand Equation (4) to have a recurrence relation of multiplication as follows: where K = βE is a polynomial of degree n and can be represented as: where k 0 = 0 and k i = e i−1 for i = 1, 2, · · · , n. By expanding the polynomial of (6) and multiplying by β, we can obtain Since β is a root of R(z), this leads to R(β) = 0. Therefore, from Equation (1) we can obtain When polynomial R(z) is an AOP, we can write Equation (8) as: If we multiply both sides of Equatin (9) by β, we can obtain: By substituting from (10) in (7), we can reduce βK to a polynomial (K 1 ) of degree n as follows: We note based on Equation (11) that the cyclic-shift-left operation of polynomial K will produce the partially-reduced polynomial K 1 of polynomial βK. Similarly, the cyclic-shift-left operation of polynomial K 1 will produce the partially-reduced polynomial K 2 of polynomial β 2 K. In general, cyclic-shift-left of polynomial K i−1 will produce the partially-reduced polynomial K i of polynomial β i K. We can mathematically express this cyclic-shift-left operation as: where K −1 = (0&E) and CSL represent the cyclic-shift-left operation. Equation (12) can help us to rewrite Equation (5) as: where Alternately, we can express Equation (13) as: where V is the sum of polynomials of degree n that can be represented as: where The polynomial in Equation (15) can be represented as: By replacing β n in Equation (16) with the expansion given in Equation (9), we obtain the reduced form of polynomial V mod R(z) (polynomial of degree n − 1) as: Assuming that j represents the bit position in a binary string that represents any polynomial, we can express Equations (12) and (15) in the bit-level form as shown in Equations (18) and (19), respectively: Furthermore, the reduced form of the product polynomial D, given in Equation (17), can be represented in the bit-level form as:

Dependency Graph
The two iterative equations, Equations (18) and (19), describe the iterative part of the finite-field multiplication algorithm. The two indices i and j define the iterations. Following the technique of [30], it is possible to develop a dependance graph (DG) in the two-dimensional integer domain D. Figure 1 shows the DG for the case in which n = 5. The nodes of the DG symbolize the operations represented by Equations (18) and (19). Based on the construction rules of [30], signals of v i j are represented by vertical lines. Signals h i are represented by horizontal lines. Signals k i j are represented by diagonal lines. k i n+1 signal is generated from the last column nodes and assigned to the nodes in the first column. As described in the reduction step of the algorithm, Equation (20), the resulting signals v n−1 j , 0 ≤ j ≤ n − 1, from the bottom row are added, using XOR gates, with the most significant signal v n−1 n used to generate the final product bits d j , 0 ≤ j ≤ n − 1. The algorithm inputs v −1 j , k −1 j = e j are represented in the DG as the vertical and diagonal inputs to the nodes at the top row. On the other hand, the reduced product output d j , 0 ≤ j ≤ n − 1, results from the bottom row after adding the outputs of the least significant bits of v n−1 j , 0 ≤ j ≤ n − 1, with the most significant bit output v n−1 n resulting from the right bottom node. Since the addition is conducted in the binary field, GF(2), it can be performed using two input XOR gates (blue nodes) as displayed in Figure 1. Figure 1. DG of the adopted algorithm for n = 5.

Extraction of the Bit-Parallel Systolic Multiplier Architecture
In this section we discuss the scheduling and node projection methodologies proposed in [30][31][32] and used to extract the bit-parallel systolic array structure from the adopted finite-field multiplication algorithm.

Scheduling Function
Each node in the DG of Figure 1 is expressed as a point p(i, j) = [i j]. Data scheduling is determined through the timing function t(p) using a scheduling vector s = [s 0 s 1 ], which assign a time value to each node based on a scheduling function, defined as follows: where u is a scalar value added to the previous function to avoid allocating negative time values to any node of the DG. In our case, choosing u ≡ 0 will assign only positive values to the DG nodes shown in Figure 1. There are restrictions on the possible values of the scheduling vector. For example, iterations in Equation (19) Using the value of s, we can write the above equation as: Another timing limitation is obtained using the iterations in Equation (18), which dictate that p = [i, j + 1] must be executed after point p = [i − 1, j], i.e., Using the value of s, we can write the above equation as: Inequalities (23) and (25) allow us to choose valid scheduling vectors. As one choice of a valid scheduling vector, we could have the scheduling vector s, given as: This choice of scheduling vector results in the associated DG shown in Figure 2. The input signals v −1 j and k −1 j are fed in parallel (i.e., at the same time) and output signals v n−1 j are obtained in parallel after n clock cycles.

Projection Function
According to [30], the projection function maps many DG nodes or points p(i, j) to one processing element p. The resulting processing elements are connected to constitute the systolic array. The projection function can be expressed as follows: where F symbolizes a projection matrix. To extract the projection matrix, we should first find a projection vector L, which is the null space of F. According to the discussion provided in [30], the following restriction should be applied to the projection vector: This constraint ensures that each PE executes the allocated tasks at different clock periods. This multiplexing results in better PE utilization.  Using the limitations placed on L, Equation (28), and the scheduling vector s = [1 0], the suitable projection vector resulting in the bit-parallel systolic array is supplied by: Since L is the null space of the projection matrix F, it can be presented as:

Extraction of the Bit-Parallel Systolic Multiplier Structure
By inserting the scheduling vector s = [1 0] and projection matrix F = [0 1] into Equations (21) and (27), we can obtain the time scheduling and node projection functions for every DG node or point, p[i, j]. The resultant functions are expressed as: The bit-parallel systolic multiplier structure, resulting from applying the previouslyderived scheduling and projection functions to the points (nodes) of the DG, is shown in Figure 3. The systolic structure is composed of n + 1 regular PEs. Figure 4 shows the logical details of the PEs. In contrast with most of the previously published parallel systolic designs that have area complexities of the order of O(n 2 ), the proposed systolic multiplier has an area complexity of the order of O(n). In addition, as with most previously published parallel systolic structures, the final product output from the systolic array is available after a latency of n clock cycles. As a result, the offered systolic multiplier structure surpasses them in terms of area complexity while also having a comparable latency. Figure 3. Bit-parallel systolic multiplier structure. By investigating Figures 3 and 4, we can describe the structure of the bit-parallel systolic multiplier as follows. The input bits of K and k −1 j are assigned to each PE. Since the initial values of input V ( v −1 j ) have zero values, they can be generated by clearing the D v latches before the PEs start the execution process. The resulting internal bits of K and k i j are pipelined through the D k latches between the PEs. In addition, the resulting internal bits of V and v i j are produced locally inside each PE. The last PE (PE n+1 ) generates the bits of k i n+1 that were assigned to k i 0 at the first PE (PE 0 ). Input bits h i are fed in series, one bit at each time step, and pass through all the PEs. After n clock cycles, the resultant bits of V and v n−1 j , will be available in parallel at the outputs of all the PEs. These bits are logically XORed with the most significant bit v n−1 n to generate the final product bits d j ; 0 ≤ j ≤ n − 1. The operation of the explored bit-parallel systolic multiplier structure is explained as follows.

1.
Throughout the initial clock period, MUXes are deactivated (M k = 0) to pass the input bits of K and k −1 j , to be localized in each PE. Furthermore, the D v latches, in each PE, are cleared to initialize the v −1 j signal with zero values. The input bit h 0 is fed to all PEs in this clock period.

2.
Throughout the subsequent n − 1 clock periods, the PEs generate the internal bit values of k i j+1 and v i j , 0 ≤ i ≤ n − 1 and 0 ≤ j ≤ n. Furthermore, input bits of h i and 1 ≤ i ≤ n − 1 are fed in a bit sequence to all PEs. 3.
In the last clock period (clock period n), the XOR gates shown in Figure 3 generate the resultant output bits of the product d, d j . They are produced in parallel, as displayed in the figure.

Results and Discussion
This section presents an estimate for the area and time complexities, as well as the power consumption, of the offered one-dimensional bit-parallel systolic multiplier and the existing efficient systolic multiplier structures presented in [4,26,27,[33][34][35][36], as well as the competitive sequential multiplier structures presented in [37]. The multiplier systolic structures proposed in [4,26,27,33,34,37] are based on trinomials, whereas the multiplier systolic structures proposed in [35,36] are based on AOPs. As displayed in Figure 3, the proposed systolic structure consists of regular n + 1 PEsand each PE consists of n + 1 AND gates, n + 1 XOR gates, n + 1 MUXes, and 2n + 2 Latches. For the reduction process to obtain the final product results, the output of the first n PEs is added to the most significant bit resulting from the last PE (PE n ) using n XOR gates. Therefore, the total number of utilized XOR gates should be 2n + 1. As we discussed above, the proposed multiplier produces the output results after a latency of n clock cycles. By investigating the PE logic details, we can estimate the critical path delay (CPD) of the proposed multiplier as the sum of the delays of the AND gate (T A ), XOR gate (T X ), and 2-to-1 MUXes (T MUX ). Table 1 lists the gate counts, latch counts, latency in the number of cycles, and the cycle period (CPD) of the suggested bit-parallel systolic multiplier design, in addition to the existing systolic designs presented in [4,26,27,[33][34][35][36] and the competitive sequential multiplier structure presented in [37]. As can be noted from Table 1, the proposed multiplier design consumes significantly less AND gates and MUXes compared to the recently published systolic multiplier structure of Ibrahim [34]. Furthermore, it has almost the same number of XOR gates and latches compared to that structure. We can also note that the proposed systolic design saves more AND gates and latches compared to the competitive design of Chen [36] and has the same number of XOR gates and MUXes. Moreover, the proposed multiplier consumes less area compared to all the remaining multiplier structures, as they require a significant number of gates compared to the proposed one. Table 1 also shows that the systolic multiplier structure of Kim [4] has the lowest latency compared to all the other designs, including the proposed one, and it has a cycle period (CPD) that is comparable to the multiplier designs of Sarmadi [26], Know [35], Chen [36] and which is shorterthan those of the other multiplier structures. The proposed systolic design has a comparable latency and cycle period (CPD) to the other remaining designs. Chiou [33] n 2 n 2 + n n 2n 2 + 3n n + 1 T A + T X + T M Kim [4] 2n 2 + 2n 2n 2 + 3n 0 3n 2 + 4n n 2 + 1 T A + T X Sarmadi [26] (n 2 ) * 1.5n 2 + 0.5n 1.5n 2 − 2.5n + 3 1.5n 2 + 2n − 1 n + 2 T N + T X Mathe [27] n n 2 − 1 n 2 − n n 2 n T M + 2T X Mathe [37] 2n 2n 2n 3n n T A + T X + T M Ibrahim [34] 2n 2n 3n 2n n T A + T X + T M Know [35] 3n + 3 2n + 2 n 5n + 5 n 2 + 1 T A + T X Chen [36] 2n + 1 2n + 1 n + 1 3n Based on the above qualitative analysis, we note that the multiplier structure of Chen [36] is the most competitive one in relation to the proposed multiplier structure. Therefore, we chose this structure for quantitative comparisons with our proposed structure. To validate and evaluate the performance of the offered systolic multiplier structure and the competitive multiplier structure of Chen [36], we described both multiplier structures using the VHDL hardware description language and synthesized the obtained code using the Synopsis design compiler with the Nangate (1.5 nm, 0.8 V) open-cell library. Before synthesizing them, the multiplier designs were verified using ModelSim tools for functional verification. The power consumption was evaluated at a frequency of 10 MHz. Table 2 shows the estimated area, delay, power consumption, the computed area-delay product (ADP), and power delay product (PDP) for AOP field sizes of n = 226 and n = 388. It also shows the savings in terms of area, power consumption, ADP, and PDP of the developed bit-parallel systolic multiplier structure and the existing competitive design [36]. Considering the results obtained in Table 2, we can observe the following: • The proposed systolic multiplier structure shows significant savings in area and power consumption over the competitive design presented by Chen [36]. The average savings of area for n = 226 and n = 388 are 30.9% and 33.7%, respectively. Furthermore, the table shows that the achievable average reductions in power consumption, for n = 226 and n = 388, of the developed multiplier structure over the multiplier structure of Chen [36] are 37.1% and 39.3%, respectively. • The developed systolic structure shows a significant reduction in area-delay product (ADP) and power-delay product (PDP) over the competitive systolic design presented by Chen [36]. The average reductions of ADP at n = 226 and n = 388 are equivalent to 18.3% and 24.8%, respectively. Furthermore, the achievable average reductions of PDP offered by our proposed multiplier structure over the competitive design for n = 226 and n = 388 represent savings of 25.6% and 31.2%, respectively. As can be seen from the obtained results, the proposed bit-parallel systolic multiplier showed the lowest area and power consumption. These achievable results make the proposed multiplier structure more suitable for utilization in resource-constrained devices such as IoT edge devices, smart cards, and other compact embedded devices. The savings in terms of area and power consumption of the proposed design are attributed to the significant reductions in AND gate counts and latches of the proposed systolic structure compared to competitive multiplier designs. Due to the regularity of the proposed multiplier structure and the local communication between the PEs, the wiring complexity is minimized, making a limited contribution to the overall space consumed by the proposed systolic multiplier compared to the other multiplier structures.

Summary and Conclusions
In this paper we have presented a low-area and low-power one-dimensional bitparallel systolic implementation for field multiplication in GF(2 n ) based on the irreducible all-one polynomial. The adopted algorithm is a regular iterative algorithm and can be represented by means of a dependency graph. By assigning proper scheduling and node projection functions to each node of the DG, we obtained an efficient bit-parallel systolic multiplier structure. The extracted parallel structure has an area complexity of the order of O(n), which distinguishes it from most of the previously published parallel structures, which have an area complexity of the order of O(n 2 ). Furthermore, it has a regular systolic array structure with local interconnection between its PEs, making it more suitable for ASIC implementation. The complexity analysis of the proposed multiplier structure shows that it exhibits a meaningful reduction in the number of gate counts and latches compared to most of the compared parallel multipliers. To verify the results of the complexity analysis, we synthesized the proposed structure and one of the existing efficient multiplier structures, based on an ASIC CMOS library, to evaluate their performance. The estimated ASIC results show that the proposed bit-parallel systolic multiplier demonstrated significant savings in area and power consumption, the area-delay product, and the power delayproduct. These achievable results make the proposed multiplier structure more suitable for utilization in resource-constrained devices such as IoT edge devices, smart cards, and all other compact embedded devices. In our future work, we will incorporate the proposed multiplier structure into an ECC cryptographic processing unit to calculate the overall savings in area and the energy consumed by the entire system.

Conflicts of Interest:
The authors declare no conflict of interest.