Compact Finite Field Multiplication Processor Structure for Cryptographic Algorithms in IoT Devices with Limited Resources

The rapid evolution of Internet of Things (IoT) applications, such as e-health and the smart ecosystem, has resulted in the emergence of numerous security flaws. Therefore, security protocols must be implemented among IoT network nodes to resist the majority of the emerging threats. As a result, IoT devices must adopt cryptographic algorithms such as public-key encryption and decryption. The cryptographic algorithms are computationally more complicated to be efficiently implemented on IoT devices due to their limited computing resources. The core operation of most cryptographic algorithms is the finite field multiplication operation, and concise implementation of this operation will have a significant impact on the cryptographic algorithm’s entire implementation. As a result, this paper mainly concentrates on developing a compact and efficient word-based serial-in/serial-out finite field multiplier suitable for usage in IoT devices with limited resources. The proposed multiplier structure is simple to implement in VLSI technology due to its modularity and regularity. The suggested structure is derived from a formal and systematic technique for mapping regular iterative algorithms onto processor arrays. The proposed methodology allows for control of the processor array workload and the workload of each processing element. Managing processor word size allows for control of system latency, area, and consumed energy. The ASIC experimental results indicate that the proposed processor structure reduces area and energy consumption by factors reaching up to 97.7% and 99.2%, respectively.


Introduction
The Internet of Things (IoT) is a contemporary technology that links a large number of items to the internet, including wearable devices, sensors, smartphones, smart meters, and auto-mobiles [1,2] It offers services and cost-effective solutions in a variety of fields, including healthcare, smart grid, industrial manufacturing, smart cities, business, and smart railway infrastructure [3][4][5].
For most IoT-based systems, privacy and security are the top priorities. They restrict it from being used in the majority of applications. As a result, to defend IoT-based systems, we should use effective and realistic security solutions. To address all of the security flaws, cryptographic protocols should be used at various levels of the IoT paradigm, particularly, at edge devices. Conventional cryptographic algorithms such as Rivest, Shamir, and Adleman (RSA) and Digital Signature Algorithm (DSA) [6] are expensive to execute on most IoT edge devices due to of their restricted processing capability. The Elliptic Curve Cryptographic (EEC) algorithm [6,7] is the preferred cryptography for resource-constrained integrated devices due to its small key sizes and increased computing effectiveness. The critical part in implementing ECC is the efficient implementation of the finite field multiplication operation. This operation is the core operation in all field arithmetic operations used in ECC such as finite-field inversion and division [8][9][10][11].

Related Work
Depending on the application, finite field multipliers can be built in serial or parallel. When the multiplier is constructed in parallel, it generates all output bits in a single clock cycle, resulting in a significant throughput at the cost of a lot of hardware resources [12,13]. Serial architectures, on the other hand, are optimized for low-space applications at the cost of increasing processing latency to n clock cycles, where n is the field size [14,15]. We will focus on serial development of the finite field multiplier algorithm because we are targeting resource-constrained IoT applications [15]. The multiplier can be implemented in either a bit-serial or a word-serial fashion. The word-serial version is more economical for resource-constrained IoT devices, because it achieves better area and time complexity than the bit-serial version [16].
The basic four constructions of word-serial finite field multipliers are: serial-in/serialout (SISO), serial-in/parallel-out (SIPO), parallel-in/serial-out (PISO), and scalable constructions. References [17][18][19][20][21] discussed the polynomial SISO multipliers. The multipliers presented in [17][18][19] have systolic structures that have area complexity of order O(nl) and latecny of order O( n l ), where n represents the field size and l is the bus word size. The multiplier design proposed in [20] is also a systolic design, but has area complexity of approximately O(n √ nl) and a lower latency of order O(2 n l ). The multiplier design explained in [21] is a three operand non-systolic multiplier with area complexity of order O(nl) and latency of order O( n l + 2). References [22,23] provide the details of the polynomial SIPO multipliers. The multiplier offered in [22] has a systolic structure with area complexity of order O(ln n l ) and latency of order O(2 n l ). The multiplier discussed in [23] has a systolic structure with area complexity of order O(2ln) and latency of order O( n l ) + l). In [24], the PISO multiplier was explained using a Type-T Gaussian normal basis. The proposed architecure consumes area of order O(2ln) and has latency of order O(l), but have a very long critical pass delay that it is a function of word size l, O(log 2 (l)), making the total computation time very high specially for long word sizes.
Later, in [25][26][27][28], the scalable multiplier constructions were discussed in detail. The scalable multipliers of [25,26] are based on a fixed bit-parallel Hankel matrix-vector multiplier whose latency is (l + n l ( n l − 1)) clock cycles. The multiplier architecture of [25] has area complexity of order O(n 2 ), while the multiplier architecture of [26] has lower area complexity of order O(l 2 ). The multiplier of [27] is based on the dual basis multiplication and targets lightweight cryptographic architectures. It has estimated area complexity of order O(n) and latency of order O(n n l ). The design proposed in [28] is a unified structure that performs both multiplication and inversion operations. It has estimated area complexity of order O(l n l ) and latency of the same order. From the previous discussion, we notice that most SISO multiplier constructions provide improved area and time complexity than other forms of word-serial multiplier constructions. As a result, we will concentrate on obtaining the SISO construction of the adopted algorithm.

Paper Contribution
In this paper, we present a SISO finite field multiplier processor that is two-dimensional (2-D) and word-based. Regularity, modularity, concurrency, and local interconnectivity of the explored processor's systolic structure are all special aspects, which makes it more convenient for VLSI implementation. The system developer can manage the area and power consumption of the investigated multiplier construction to suit IoT devices by using the formal mapping technique provided in [29][30][31]. The system developer can adjust the workload of the processor array as well as the workload of each processing element by using a non-linear scheduling function. Furthermore, non-linear task scheduling is used to manage the algorithm's latency. The actual results reveal that the improved multiplier construction saves a large amount of space and energy, making it more suitable for IoT devices with restricted resources.

Paper Organization
The following describes the layout of the manuscript: Section 2 modifies the adopted finite field multiplication algorithm, offered by [32], to be represented in the bit level form. The algorithm performs the multiplication operation over GF(2 n ) and is based on the irreducible All-One Polynomial (AOP). The dependency graph (DG) of the explained algorithm is investigated in Section 3. The systematic technique utilized to extract the 2-D word-based SISO processor is explained in Section 4. The experimental findings and analysis of the produced word-based multiplier construction and the competitor ones are displayed in Section 5. Finally, under Section 6, you can find the conclusion of this work.

Formulation of the Multiplication Algorithm
Suppose that a degree n irreducible polynomial U(w) characterizes the finite field over GF(2 n ). It can be described in the polynomial form as: with u i ∈ GF (2). Consider also that the above polynomial has a root denoted as ζ. As a result, the field elements can be defined by the set of polynomial basis {1, ζ, ζ 2 , ζ 3 , · · · , ζ n−1 }. Assume that polynomials E and H denote any two field elements in GF(2 n ) space. They can be described in degree n − 1 polynomial form as follows: E = e 0 + e 1 ζ 1 + · · · + e i ζ i + · · · + +e n−1 ζ n−1 (2) where e i , h i ∈ GF(2).
To multiply E and H over GF(2 n ), we can use the following formula .
Equation (4) could be extended to include a multiplication recurrence formula as follows: where K = ζE is a polynomial of degree n that can be written as: with k 0 = 0 and k i = e i−1 for i = 1, 2, · · · , n. We can derive the following expression by extending the polynomial of (6) and multiplying by ζ. ζK = k 0 ζ + k 1 ζ 2 + · · · + k n−1 ζ n + k n ζ n+1 As we mentioned before, ζ is a root of U(w) and this leads to U(ζ) = 0. As a result, we can find the following expression by substituting with ζ in Equation (1).
As U(w) is an AOP polynomial, Equation (8) can be expressed as: By multiplying both sides of Equation (9) by ζ, we obtain the following result: By substituting from (10) in (7), we may reduce ζK to a polynomial (K 1 ) of degree n as: As indicated in Equation (11), the cyclic-shift-left of polynomial K creates the partiallyreduced polynomial K 1 of polynomial ζK. Additionally, the cyclic-shift-left of polynomial K 1 produces the partially-reduced polynomial K 2 of polynomial ζ 2 K. In general, cyclicshift-left of polynomial K i−1 forms the partially-reduced polynomial K i of polynomial ζ i K.
The following is a mathematical representation of the cyclic-shift-left procedure: where K −1 = (0&E). L denotes the cyclic-shift-left operation. Equation (12) could be used to construct Equation (13) as: with K 0 = K = ζE. Alternatively, Equation (13) might be written as: where V is the sum of polynomials of degree n that can be expressed as: with K −1 = (0&E). Equation (15) can be described in the subsequent form: By substituting ζ n in Equation (16) with the expansion given in Equation (9), we could derive the reduced form of polynomial V mod U(w) (polynomial of degree n − 1) as: We can describe Equations (12) and (15) in bit-level format as shown in Equations (18) and (19), respectively. The subscript j in these equations denotes the bit position in their binary coding. (17) provides the reduced form of the product polynomial D, which can be interpreted in the bit-level formate as: Algorithms 1 and 2 are the algorithm structure of the previously stated formulas. Algorithm 2 represents the bit-level version of Algorithm 1.

Algorithm 1 Finite Field Multiplication Algorithm based on AOP polynomial.
Input: E, H, and U Output: D Initialization:

Construction of Algorithm Dependence Graph
Algorithm 2 has two indices, i and j, that define the iterative phase of the multiplication algorithm. The approach described in reference [29] can be used to generate a dependence graph (DG) in the two-dimensional integer domain D. Figure 1 shows the DG for the situation when n = 5. The nodes of the DG indicates the operations specified by the algorithm steps 3 to 5. According to the design criteria of reference [29], v i j signals are indicated by vertical lines. The h i signals are denoted by horizontal lines. The signals k i j+1 are depicted by the diagonal lines.
The signals of k i n+1 are generated by the nodes in the last column and transmitted to the nodes in the first column. As indicated in the reduction step of the Algorithm 2, step 9, the resultant signals v n−1 j , 0 ≤ j ≤ n − 1, are combined with the most significant signal v n−1 n , using the XOR gates, to generate the final product output signals d j , 0 ≤ j ≤ n − 1.
The algorithm inputs v −1 j and k −1 j = e j are displayed in the DG as vertical and diagonal inputs to the top row nodes, respectively. On the other hand, the reduced product output d j , 0 ≤ j ≤ n − 1 is created by merging the vertical outputs of the bottom nodes with the output of the most right bottom node as depicted in Figure 1. Using the technique outlined in [29], the DG of Figure 1 can be used for design space exploration by selecting proper node scheduling and projection functions.
We will not employ the linear scheduling and projection functions presented in [29], as they give us few alternatives for determining the resulting processor array area, latency, processing element workload, and total system workload. We will apply the non-linear node scheduling and projection techniques described in [29] to the DG. This option provides a wide range of design alternatives for optimizing the resulting processor array area, latency, workload of processing elements, and overall system workload.

Two-Dimensional SISO Multiplier
Our objective is to create a SISO multiplier that accepts inputs K and H in a word-serial format. In addition, the resultant output D is generated from the SISO multiplier in the word-serial format. Assume the system designer's aim is to process l bits of each input at the same time in order to find l bits of the output. The following subsections describe the steps that the system developer should follow to construct the SISO multiplier.

Non-Linear Task Scheduling
As explained in [29], the nonlinear scheduling technique is employed to divide the domain D into l × l equitemporal zones or clusters. The l value allows the system designer to set the number of bits of inputs and outputs that are processed at the same time. This has an indirect impact on the system's size, speed, and latency.
To assign timing to each node p of the DG, we use the following non-linear scheduling function: where k(p) is the time allocated to the DG's node p; and θ is defined as: To make the DG's rows an integer multiple of l, we should add θ rows to it. In addition, θ − 1 columns must be added to the DG in order for the number of columns to be an integer multiple of l. We have θ equal to 1 in the scenario depicted in Figure 2 where n = 5 and l = 2, implying that one row should be placed at the bottom (row with green nodes) and no columns at the left. The equitemporal zones (the cluster of nodes having the same time values) are determined by the light red boxes and marked with the blue numbers as displayed in Figure 2. The scheduling time for the DG nodes when n = 5 and l = 4 is shown in Figure 3. We have θ equal to 3 in this scenario, which means we need to employ two columns on the left and three rows on the bottom (rows and columns with green nodes). v 0 By inspecting Figures 2 and 3, we notice that any equitemporal zone (give it name block k) takes inputs from the north and west sides and generates outputs from the south and east sides. Table 1 summarizes the timings associated with these inputs and outputs (I/Os).

I/O Time Instance
It is worth noting that the top row's inputs result in the right column's outputs. Similarly, the left column's inputs result in the bottom row's outputs. As a result, the total number of iterations (I) for finite field multiplication should be calculated using the following expression.

Non-Linear Task Projection
As we observe from Figures 2 and 3, the l × l equitemporal zones execute at the same time. This remark, together with the projection technique described in [29], yields the nonlinear task projection function shown below: The l × l node clusters are mapped to a single processor array using the extracted projection function. The processor array is made up of l × l processing elements (PEs) that are arranged in a two-dimensional array. The processor structure of Figure 4 depicts the entire system.
By reading Figure 4, we notice that registers K and H are of size l and used to feed the word inputs of K and H, in sequence, to the processor array block. Furthermore, register D is used to synchronize the output product D before delivering it to the processor data bus. As input words of variable V have zero initial values, there is no need to feed them to the processor array through an input register. They will be initialized by clearing the shift register SR-V shown in the figure. This shift register has a width of l bits and depth of r registers, where r = n l . The depth of SR-V is sufficient to the guarantee that all the initial input words of variable V are fed to the processor array block. With a closer look at Figure 4, we can notice that the words of K variable (K o ) resulted from the processor array block have three different types of signals due to the delay differences between signals K e , K f e , and the remaining signals of word K, as shown in Figures 2 and 3. K e signal should be delayed by r − 1 time steps, r = n l , before feeding it back to the input of the processor array block. Additionally, before returning the K f e signal to the input of the processor array block, it should be delayed by 2r time steps. The remaining signals of the word K (K o ) should be delayed by r time steps before being fed back to the processor array's input. These delays are implemented using the shift registers (SR) related to variable K as shown in Figure 4. The width and depth of each SR are indicated in the figure. As we also notice from Figure 4, the intermediate words of V are looped back through the shift register SR-V to be delayed by r time steps before reaching out to the inputs of the processor array block.
The processor array description is shown in Figure 5 for the case when n = 5 and l = 4. Two types of tri-state buffers are used to select between signals k d and k f . Another two types of tri-state buffers are used to select between signals k e and k f e . All of these buffers are controlled with the control signal g. At time instances k = q n/l + 1, 0 ≤ q < n/l , the control signal g is enabled (g = 1), allowing the tri-state buffers Tr1 to pass k f and k f e signals shown in Figure 5. The control signal g is deactivated (g = 0) for the remaining time instances, allowing the k d and k e signals to pass through tri-state buffers Tr2.
To compute the intermediate bits of word V, the input bits of word H (h i ) should be transferred to the processing elements of the processor array as displayed in Figure 5. The logic diagram of the PE is depicted in Figure 6. It includes one AND gate and one XOR gates.  At the first time instance k = 1, the controller activates the MUX with select signal (S in ) to allow the l most significant bits (MSB) of variable K to reach out to the input of the processor array block as shown in Figure 4. To ensure V variable has zero initial value as described in Algorithm 1, the controller resets the shift register SR-V at the first time instance. At the same time instance, the least significant l bits of variable H are transmitted horizontally to the PEs nodes of the processor array block. Notice that the H word transferred at this time instance should be hold for the following n l − 1 time instances. 2.
At time instances 1 < k ≤ n l , the controller still activates the MUX with select signal (S in ) to enable the remaining words of input K to reach out to the processor array input. These words, together with the previously held H words at the first time instance, are used to calculate in sequence the partial words of V and K. The V words resulted from the output of the processor array block (V o ) are looped back to its input through the shift register SR-V. The K words resulted from the output of the processor array block are looped back to its input through the shift registers SR-K, SR-Ke, SR-Kfe, and the MUX controlled by the select signal S as displayed in Figure 4. It is worth noticing that the depth of the shift register SR-V keeps the initial values of V having zero values during these time instances.

3.
During times k = q + ( n l − 1), 2 ≤ q ≤ 2 n l and q = n l + 1, the controller deactivates the MUX controlled by the select signal S (S = 0), see Figure 4, to pass the K e signal to be concatenated with the K o word. At the same time instances, the controller deactivates the MUX controlled by the select signal S in (S in = 0) to transfer the whole partial word of K to the input of the processor array block as displayed in Figure 4.

4.
During times k = (q + 1) n l , 1 ≤ q < n l , the controller activates the MUX controlled by the select signal S (S = 1), see Figure 4, to pass the K f e signal to be concatenated with the K o word. At the same time instances, the controller deactivates the MUX controlled by the select signal S in (S in = 0) to transfer the whole partial word of K to the input of the processor array block as displayed in Figure 4.

5.
At times k = q n l + 1, 0 ≤ q < n l , the remaining H words are transferred to the input of the processor array block to be used alongside the word inputs V in , K in in updating the partial words of variable V (V o ). 6.
At time k = ( n l − 1) n l + 1, the control signal f of the tri-state buffer Tr3, shown in Figure 4, is set ( f = 1) to pass the signal v n−1 n to be XORed with the words of V to find the output product words D, in sequence, as displayed in Figure 4. 7.
Starting at time k = n l 2 + 1, the output words of product D will be available in sequence at the output bus.
To ensure that there is always one time instance difference between the words of V, we inserted delay elements (D Flip-Flop blocks) to the processor array, as illustrated in Figure 5. These elements synchronize the processor array's work by delaying V words by one time instance to arrive at the same time as the resultant bits of k d . The k d bits are created starting at the second time instance, as seen in Figure 3, and this results in increasing the total number of clock cycles by one as indicated in Equation (23). Furthermore, shift registers SR-Kf of depth r are added to the processor array (see Figure 5) to ensure that the k f signals arrive at the left processing elements at the appropriate time.

Experimental Results and Discussion
We compared the suggested 2-D word-based multiplier structure to the optimal wordbased ones in the literature [20,23,33,34]. The area estimation is determined by the number of basic logic gates and components in the examined multiplier architectures (AND gates, Tri-state buffers, XOR gates, Flip-Flops (FFs), and MUXs). The number of clock cycles needed to accomplish the multiplication operation is defined as latency. The delay of the basic gates/components of the multiplier logic circuit's longest path is referred to as critical path delay (CPD). The estimated area and time results of the multiplier structures are shown in Table 2. The following symbols are used in Table 2. They can be translated as follows: l denotes the word size of the multiplier constructions. δ A denotes the delay of the fundamental 2-input AND gate. δ X denotes the delay of the fundamental 2-input XOR gate. δ MUX denotes the delay of the 2-input MUX. α 1 = 7n + n( log n ) + l + 3 expresses the overall number of FFs employed in the multiplier construction of Pan [20]. α 2 = 2l 2 + 2l( n/l ) + 4l + 1 expresses the overall number of FFs employed in the multiplier construction of Hua [33]. α 3 = 2l 2 + 3l( n/l ) + 2l expresses the overall number of FFs employed in the multiplier construction of Chen [34]. η 1 = l + n/l 2 + n/l designates the latency of the multiplier construction of Chen [34]. β 1 = δ A + ( log 2 l + 1)δ X is the approximated CPD of Pan's multiplier construction [20]. β 2 = δ A + 2δ X is the approximated CPD of Hua's multiplier construction [33]. β 3 = δ A + δ X is the approximated CPD of Chen's multiplier construction [34]. β 4 = lδ A + lδ X + 2δ MUX is the approximated CPD of the suggested multiplier construction.
(1) r = n l ; (2) The area of a 3-input XOR gate is 1.5× that of a 2-input XOR gate; (3) In [34], the multiplier employs switches with the same level of complexity as a MUX.
It is worth mentioning that the input/output registers are included in the approximated number of FFs. This guarantees that the multiplier architectures are fairly compared.
We can find the following conclusions from examining the area expressions in Table 2: 1.
The area complexities of Pan [20] and Xie [23] multipliers are roughly of order O(n √ nl) and O(nl), respectively.

2.
Except for the MUXes and FFs of the recommended multiplier structure, which have area complexity of order O(l) and O(lr), all other components have area complexity of order O(l 2 ).

3.
Pan's [20] and Xie's [23] multiplier constructions have a larger area complexity than the other multiplier constructions. This is due to the fact that the field size n is significantly bigger than the embedded word size l.

4.
In comparison to the other multipliers, the suggested multiplier has the smallest number of FFs. This is due to the suggested multiplier having an area complexity of order O(lr), as opposed to O(l 2 ) and O(n( log n ) for the other multiplier structures.

5.
The number of FFs in the proposed multiplier structure does not rise significantly as the word size l is increased. This is due to the fact that the proposed multiplier structure's FFs have an area complexity of order O(lr).
According to the data books of most typical CMOS libraries, the FFs require more chip space than the other logic components. As a result, lowering the number of FFs reduces the overall size of the multiplier structures dramatically. Increasing the word size does not considerably increase the overall number of FFs in the proposed multiplier structures, as we previously stated. As a result, the overall area of the suggested multiplier structure will not rise considerably as l grows.
We can notice the following by examining the latency expressions in Table 2: 1.
When compared to the other multiplier constructions, the multiplier of Hua [33] has the lowest latency.

2.
The latency findings in Table 3, for the field size n = 508 and word sizes l = 8, 16, 32, indicate that the suggested multiplier structure's latency expression will result in a larger latency than the multiplier constructions in [20,23], and inexpensive latency compared to the Hua [33] and Chen [34] multiplier constructions.

3.
When the word size l increases, the latency reduces. This is due to the fact that latency expressions are inversely related to l.
We could remark the following facts when we examine CPD expressions: 1.
The word sizes l have no effect on the CPD expressions of the Xie [23], Hua [33], and Chen [34] multiplier constructions. As a result, for all l values, they will always have constant CPD values.

2.
CPD expressions of Pan [20] and the proposed multiplier structure are both directly dependent on l. As a result, the CPD values of these multipliers will rise as l rises.
We cannot accurately predict which multiplier architecture has the perfect computation time because it is challenging to qualitatively evaluate the latency reduction and CPD increment as l rises. Nevertheless, the quantitative results provided in Table 3 will demonstrate which multiplier layout outperforms the others in computation time.
The VHDL programming language has been used to describe all of the multiplier constructions. For the field size n = 508 and embedded word sizes l = 8, 16, 32, the multipliers are synthesized. Synopsys tools version 2005.09-SP2 and the NanGate (15 nm, 0.8 V) Open Cell Library have been used to synthesize the modeled multipliers.
The design performance indicators-Latency, Area (A), CPD, Total Computation Time (T), Consumed Power (P), and Consumed Energy (E)-are used to compare the chosen word-based multiplier constructions. The obtained results are listed in Table 3. The area and CPD are provided by the synthesis tools. The area of a 2-input NAND gate is used to normalize the area. The needed time to accomplish one product operation can be defined as the total computation time. It is calculated by multiplying latency and CPD together. At a frequency of 1 kHz, the consumed power is measured. The product of P and T yields the consumed energy results.
The performance results achieved in Table 3 can be interpreted as follows: 1.
In terms of area (A), the proposed multiplier structure is superior to all existing multiplier structures. It greatly decreases area for all embedded word sizes l, with reduction rates ranging from 67.3% to 97.7%. The reduction in area is primarily due to the proposed multiplier structure's area, which is mainly determined by the field size l, drastically reducing the number of counted logic gates when compared to most other existing multiplier structures. Furthermore, due to the systolic nature of the suggested multiplier, the majority of its connections are local, leading to a reduction in the area to a large extent.

2.
In terms of the area-time product (AT), Pan's multiplier structure [20] surpasses all other multiplier structures, including the suggested one, at l = 8. This is mainly attributed to the significant reduction in its latency compared to the other multiplier constructions at this word size. At this embedded size, it outperforms the offered design by 37.9%. The proposed architecture, on the other hand, surpasses Pan's multiplier structure for l = 16 and l = 32. At l = 16, it reduces AT by 26.3%, while at l = 32, it reduces AT by 49.2%. Furthermore, the suggested multiplier structure outperforms all alternative multiplier structures by percentages ranging from 21.1% to 99.4% based on the embedded word size.The reduction in AT over the other multiplier structures is mainly due to the significant savings in area complexity of the suggested multiplier structure. 3.
In terms of consumed power (P), the proposed multiplier outperforms the other multiplier structures at all embedded word sizes. It reduces power consumption at all l values by percentages ranging from 64.4% to 99.5%. The power reduction is attributed to the substantial reduction in the consumed area of the proposed design when compared to the consumed area of the other multiplier designs. The reduced area minimises parasitic capacitance and, as a result, the circuit's dynamic power significantly reduces. The systolic nature of the proposed design reduces the switching activities of the proposed design compared to the other conventional designs.
The switching activities is one of the major parameters that significantly affects the dynamic power consumption.

4.
In terms of consumed energy (E), the offered multiplier construction surpasses the other multiplier constructions at all embedded sizes. It saves energy at rates ranging from 70.6% to 99.2%. The energy savings are due to the massive reduction in consumed power and the reasonable computation time of the offered multiplier construction compared to the other multiplier structures. From the obtained results, we can conclude that the offered multiplier outperforms its competitors in terms of area, consumed power, and consumed energy for all popular embedded word sizes. As a result, the proposed design can be used to efficiently implement crypto-processors in resource-constrained IoT devices such as wearable and implantable devices. It can also be used in other resource-constrained applications that set restrictions on the area and energy consumed.

Summary and Conclusions
In this paper, we offered a compact and practical 2-D word-based serial-in/serial-out processor for the finite field multiplier in GF(2 n ). A rigorous and systematic technique for mapping regular iterative algorithms onto processor arrays is used to create the proposed processor structure. The methodology enables the system developer to manage the overall workload of the processor array system as well as the workload of each processing element. Controlling processor word size allows us to adjust system speed, latency, and area. The recommended processor size can be adjusted to meet the intended chip area, allowing for better implementation of the suggested multiplier processor in resource-constrained IoT devices. The obtained experimental results confirm that the suggested multiplier processor has the benefit of reducing size, power consumption, and utilized energy when compared to the conventional multiplier processor.

Future Work
As a future work, we will incorporate the proposed multiplier into the ECC cryptography to evaluate the amount of savings in its area and consumed energy. The process will start by replacing the inversion operation by several multiplication operations by representing the elliptic curve points as projective coordinate points.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: