Energy-Efﬁcient Word-Serial Processor for Field Multiplication and Squaring Suitable for Lightweight Authentication Schemes in RFID-Based IoT Applications

: Radio-Frequency Identiﬁcation (RFID) technology is a crucial technology used in many IoT applications such as healthcare, asset tracking, logistics, supply chain management, assembly, manufacturing, and payment systems. Nonetheless, RFID-based IoT applications have many security and privacy issues restricting their use on a large scale. Many authors have proposed lightweight RFID authentication schemes based on Elliptic Curve Cryptography (ECC) with a low-cost implementation to solve these issues. Finite-ﬁeld multiplication are at the heart of these schemes, and their implementation signiﬁcantly affects the system’s overall performance. This article presents a formal methodology for developing a word-based serial-in/serial-out semisystolic processor that shares hardware resources for multiplication and squaring operations in GF(2 n ). The processor concurrently executes both operations and hence reduces the execution time. Furthermore, sharing the hardware resources provides savings in the area and consumed energy. The acquired implementation results for the ﬁeld size n = 409 indicate that the proposed structure achieves a signiﬁcant reduction in the area–time product and consumed energy over the previously published designs by at least 32.3% and 70%, respectively. The achieved results make the proposed design more suitable to realize cryptographic primitives in resource-constrained RFID devices.


Introduction
The Internet of Things (IoT) is a new paradigm that connects a significant number of objects-such as wearable devices, sensors, smartphones, smart meters, and vehicles-to the Internet [1]. The IoT is a promising technology that provides services and practical solutions in several fields, including smart cities, healthcare, smart grids, smart homes, logistics, industrial manufacturing, intelligent transportation, and business management [2,3].
Radio-Frequency Identification (RFID) is an essential technology in most IoT applications, especially healthcare applications [4]. RFID systems have a small chip called an RFID tag implanted in an object. The tag contains data that a reader can read via short-range radio waves. The data retrieved by the reader are transmitted to the server for further processing. RFID systems increase productivity in applications by providing contactless and automatic object identification [5].
Based on the cost, memory size, and battery requirements, RFID tags are categorized into three classes: passive tags, semipassive tags, and active tags [6]. Passive tags acquire their energy from the reader and hence do not require a battery to operate. They use the reader to trigger a response from the tag. They have a limited storage memory, a lower cost,

Motivation
A direct way to provide adequate security solutions to secure the RFID system is to encrypt all the communications among the reader, the tag, and the server. However, the tags are resource-constrained devices with limited computation power, storage capacity, and energy, which makes implementing the standard cryptographic algorithms on these tags a challenging task. Therefore, over the last several years, researchers have developed numerous lightweight authentication schemes that satisfy the security requirements of the RFID systems. These schemes are based on elliptic curve cryptography (ECC) [12][13][14][15][16][17][18][19][20][21][22], hash functions [23][24][25], and rotation functions [26][27][28][29] to secure communications in IoT environments.
Most of the cryptographic primitives (such as ECC and hash functions) used in the proposed lightweight authentication protocols are mainly based on finite-field arithmetic operations of addition, multiplication, squaring, exponentiation, and division/inversion. Most of these operations are at the heart of the cryptographic primitives; hence, their performance has a significant effect on the overall performance of the whole authentication protocol. Since the finite-field multiplication operation is the basic building block of the other field operations of division/inversion and exponentiation, it is considered the fundamental building block of most cryptographic processors. Therefore, any slight improvement in its implementation leads to a significant increase in the cryptographic processor's whole performance.

Contribution
The main contribution of this work is using a systematic methodology to obtain the Dependence Graph (DG) for the modular multiply/square algorithm. A nonlinear scheduling function was applied to the DG and a nonlinear projection function was used to implement a unified processor core to perform both modular multiplication and squaring operations. We followed a systematic approach previously reported by the authors of this paper in [30][31][32][33][34][35][36] to extract the processor core of one of the efficient multiplier and squaring algorithms previously reported in the literature, the bipartite multiplication and squaring algorithm suggested by Kim in [37]. We can briefly summarize the approach as follows:

1.
We developed the bit-level version of the bipartite multiplication and squaring algorithm to have its regular iterative form; 2.
We obtained the DG of the developed algorithm to help in extracting the unified hardware module; 3.
We obtained a nonlinear scheduling function to allocate a time value to each node of the DG; 4.
We developed a nonlinear projection function to map the DG nodes to the corresponding Processing Element (PE) in the extracted processor core.
The derived nonlinear scheduling and projection functions provide more flexibility to manage the utilization of the processor besides its execution time.
After applying the adopted approach to the suggested algorithm, we obtained a compact and efficient word-based Serial-In/Serial-Out (SISO) semisystolic processor core. The processor architecture simultaneously computes both the multiplication and squaring operations and accordingly reduces the total execution time. The shared processor hardware resources lead to savings in the hardware resource area and the consumed energy, making it more suitable for implementing cryptographic processors in resource-constrained RFID devices.

Paper Organization
The rest of the article is structured as follows. Section 3 briefly reviews the bipartite multiplication and squaring algorithm over GF(2 n ) and its developed bit-level form. Section 4 explains the procedure followed to explore the expected word-based SISO semisystolic processor. Section 5 provides an analysis of the estimated area and time complexities besides the ASIC implementation results. Section 6 draws the summary and conclusion of the work.

Related Work
The binary extension field GF(2 n ) is exceptionally efficient in hardware implementation due to the removal of the carry arithmetic. One more advantage of GF(2 n ) over the other fields is the availability of various representations of the field elements such as the normal, polynomial, and dual basis. There are three main categories of polynomial basis multipliers in GF(2 n ): bit-serial, word-serial, and bit-parallel. Most of the bit-serial and bit-parallel multiplier structures have a large hardware resource area and large time complexities, making them unsuitable for implementation in resource-constrained embedded devices such as RFID devices [38][39][40][41]. Thus, there is a need to possess multiplier architectures with limited hardware resources and a reasonable execution time suitable for these embedded devices. Word-serial multipliers are the potential candidate for solving this problem because they have a trade-off between the hardware resource area and the time complexities. Therefore, they provide more adaptability to manage the design space and speed. The word-serial multiplier structures can be divided into four categories: Serial-In/Serial-Output (SISO), Serial-In/Parallel-Output (SIPO), Parallel-In/Serial-Output (PISO), and scalable structures. There are several polynomial-basis word-serial SISO systolic multiplier structures, published in [42][43][44][45][46]. Other polynomial-basis word-serial SIPO systolic multipliers were exhibited in [47][48][49][50]. Namin et al. in [51] developed a type-T Gaussian normal-basis word-serial PISO multiplier structure. Different scalable systolic multiplier structures were published in [52][53][54][55][56][57][58].
Several authors in the literature [40,41,59] tried to increase the performance and hardware utilization of multiplier structures by merging both multiplication and squaring operations in a combined construction. Even though the presented structures perform both operations concurrently and perform well in high-speed applications, they are mainly not suitable for embedded devices due to the considerable amount of consumed hardware resources and meaningful wasted power. This is mainly attributed to their parallel systolic/semisystolic structures, which require a large amount of hardware resources, which are limited in resource-constrained embedded devices. Increasing the hardware resources necessary to implement these parallel structures leads to increasing their extracted parasitic capacitances and hence raising the circuit switching activities, resulting in a considerable amount of consumed power that cannot be afforded by the embedded devices.
In this research article, we present a competent word-based SISO semisystolic processor to perform the bipartite multiplication and squaring algorithm suggested by Kim in [37]. We developed the bit-level version of the suggested algorithm to be able to extract the hardware structure of the semisystolic multiplier and squarer processor. As mentioned in Section 1.2, the processor architecture simultaneously computes both operations and shares the hardware resources, leading to a reduction of the total processing time, hardware resources, and consumed energy. Moreover, the semisystolic array core of the proposed processor has a regular architecture and local interconnections between the processing elements constituting it, making it highly practical for VLSI implementation. The obtained implementation results for the recommended field size n = 409 indicate that the proposed multiplier-squarer structure realizes a reduction in the area-time product and consumed energy over the compared competing multiplier structures at specific embedded word sizes, as discussed in the Results Section. The results also confirm that the recommended multiplier-squarer structure is suitable to realize cryptographic primitives in RFID tags, especially semipassive and active tags, that have modest limitations in area and delay and significant restrictions in wasted energy.

Polynomial-Based Bipartite Multiplication-Squaring Algorithm over GF(2 n )
Let G(β) denote the irreducible polynomial of order n over GF(2) that defines GF(2 n ). Furthermore, let E(β) and F(β) denote any two polynomial elements of order n − 1 over GF(2 n ). The polynomials G(β), E(β) and F(β) can be respectively expressed as: where g i , e i , f i ∈ GF(2). Since β is a root of G(β), β n mod G(β) and β n+1 mod G(β) can be represented as follows: Suppose c = n/2 , d = n/2 , and assume that G (β) is available in advance. We can express polynomial multiplication and squaring over GF(2 n ) as: We can divide Q(β) and R(β) into two parts as: where, Employing the above equations, Kim [37,41] proposed the unified bipartite algorithm, Algorithm 1, to simultaneously compute the products Q(β) and R(β). In Algorithm 1, we denote partial values of polynomials E(β), X(β), Y(β), P(β), and S(β) as E i , X i , Y i , P i , and S i , where i represents the iteration index. The symbols f 2i−2 , f 2i−1 , e 2i−2 , and e 2i−1 indicate the (2i − 2) th and (2i − 1) th coefficients of input polynomials F(β) and E(β), respectively. In the initialization step of the algorithm, we notice that the coefficients of input polynomial E(β) are allocated to the initial variable E 0 and zero values are allocated to the initial variables X 0 , Y 0 , P 0 , and S 0 . The for-loop terminates after d = n/2 iterations, and the products Q and R can be concurrently computed as shown in Steps 8 and 9 of Algorithm 1.
For the hardware implementation of Algorithm 1, we developed the corresponding bitlevel form as shown in Algorithm 2. The algorithm has two nested loops to compute Steps 2 to 6 of Algorithm 1, the outer loop with the i index and the inner loop with the j index. Furthermore, there is a for-loop at the end of the algorithm to compute the postprocessing Steps 8 and 9 in Algorithm 1. e i j designates the (j)th bit of E at the ith iteration. Furthermore, x i j , y i j , p i j , and s i j designate the jth bit of X, Y, P, and S at the ith iteration, respectively. As we notice, the total number of iterations required to execute the algorithm is (d + 1)n. The explored word-based SISO semisystolic processor will significantly reduce the total number of iterations, as will be explained later, to be ( d k + 1) n k , where k is the processor bus width.

Algorithm 2
Bit-level form of the bipartite multiplication and squaring algorithm.

Extraction of the Word-Based SISO Semisystolic Processor
The approach used to explore the intended word-based SISO semisystolic processor starts by extracting the DG of the adopted algorithm. As we notice, Algorithm 2 performs regular iterations with two indices i and j. Therefore, it can be expressed in the twodimensional (2D) domain, as shown in Figure 1. The DG consists of an array of node operations. Each node can be allocated using the corresponding row index i and the corresponding column index j. The DG has two types of nodes: the upper light orange nodes and the lower light blue nodes. The light orange nodes, arranged in the upper d rows, perform the operations of Steps 4 to 8 of Algorithm 2, while the light blue nodes perform the operations of Steps 13 and 14 of the same algorithm.
The nodes in the first row of the DG receive the initial bits e 0 j , x 0 j , y 0 j , p 0 j , s 0 j , g j , and g j of variables E, X, Y, P, S, G, and G , respectively. In the upper d rows of DG, the updated partial bits e i j , x i j , y i j , p i j , and s i j besides the bits g j and g j are designated by the vertical lines. The updated partial bits of e i j are also passed diagonally as designated by the slanted red lines. The input bits of e 2i−2 , e 2i−1 , f 2i−2 , and f 2i−1 , as well as the updated partial bits of e i−1 n−2 , e i−1 n−1 , and e i−1 j−2 are broadcast horizontally to all nodes in the same row of the upper d rows in DG. The resulting bits x d j , p d j from the dth row, as well as the broadcast bits g j are passed vertically to the nodes in the last row of the DG. Furthermore, the resulting bits of y c j and s c j are passed diagonally to the same nodes as shown in Figure 1. These bits are used to compute the final bits of field multiplication q j and field squaring r j , as indicated in Figure 1. The second step in the utilized approach, as discussed in [30], is to find a valid scheduling function to allocate a time value to each node of the DG. The last step of the procedure is to find a projection function to map the DG nodes to a corresponding Processing Element (PE) in the extracted systolic/semisystolic array. There are two types of scheduling and projection functions, as explained in [30]: affine (linear) scheduling and projection functions and nonlinear scheduling and projection functions. The affine scheduling and projection functions have the following limitations: 1.
The inputs and outputs can only be accessed as bit-serial (one bit at a time) or bitparallel (all bits at the same time instance) based on our choice of the scheduling function. That means that the linear functions cannot satisfy any restrictions on the processor bus size; 2.
The designer cannot manage the number of accessed input or output samples at a particular time step; 3.
The number of active PEs cannot be managed at a specific time step; 4.
The designer cannot manage the PE's workload.
The nonlinear scheduling and projection functions avoid all of these limitations and provide the designer with more flexibility to control the number of accessed inputs and outputs and the number of utilized PEs, as well as the PE's workload. In our case, we require that the unified SISO multiplier-squarer feeds in variables E, F, G, and G in a word-serial fashion at the start of the iteration and produces the output products Q and R in a wordserial fashion at the end of the iteration. Assuming the processor word size is k, we need to feed in k bits of input variables at the first clock cycle and produce k bits of output products at the last clock cycle. The following subsections display the selected nonlinear scheduling and projection functions that resulted in the exploration of the intended word-based SISO semisystolic processor.

Word-Based SISO Scheduling Function
Based on the scheduling methodology discussed in [30], we can partition the DG into k × k equitemporal zones using the following scheduling function: where t(p) represents the time allocated to each node p of the DG, By applying the chosen scheduling function to the DG, we obtain the node timing (scheduling time) allocated to each zone, as indicated in Figure 2. This figure is developed for the case when n = 7 and k = 3. The green zones have k × k nodes that are executed at the time index (blue numerals) allocated to each zone. The yellow zones have 1 × k postprocessing nodes that are executed also at the time index assigned to each of the yellow zones. , e For the upper d rows of the DG, it is important to make the number of rows and columns multiple integers of k. Therefore, ω more rows and ζ more columns should be added to the DG. For our case, when n = 7 and k = 3, both ζ and ω will equal two, and hence, we should add two more rows and two more columns to the DG as indicated in Figure 2. This leads to the least significant two bits of the input variables E, G, and G , and the initial values of the intermediate variables X, Y, P, and S should be assigned zero values, as shown at the rightmost edge of the DG in Figure 2. Furthermore, the most significant two bits of input variables E and F should be assigned zero values at the two rows before the last row of the DG, as shown in Figure 2. It is worth noticing that the e i−1 n−1 and e i−1 n−2 signals should be updated at the nodes of the leftmost column, as displayed in Figure 2, and then broadcast horizontally and collectively with the signals e 2i−2 , e 2i−1 , f 2i−2 , and f 2i−1 to the nodes of row i.
The chosen scheduling function reduces the total number of iterations required to execute the adopted bipartite multiplication/squaring algorithm to be ( d k + 1) n k instead of (d + 1)n, and hence, this significantly reduces its time complexity.

Word-Based SISO Projection Function
By applying the nonlinear projection approach, discussed in [30], to the DG nodes shown in Figure 2, we can develop the following nonlinear projection function that maps any node p(i, j) of Figure 2 to a processing element PE(o, u) in the resulting systolic/semisystolic array space: where "dot" is a place holder for the argument, as explained in [30]. The selected nonlinear projection function results in all the nodes of each green zone being mapped to a two-dimensional k × k semisystolic array. The semisystolic array is composed of k rows and k columns, as shown in Figure 3. Furthermore, the selected nonlinear projection function will map all the nodes of the yellow zones (the postprocessing nodes) to a one-dimensional 1 × k postprocessing array, shown in Figure 4. Figure 5 displays the block architecture of the word-based SISO semisystolic processor. It is composed of the semisystolic array block, the postprocessing array block, I/O registers, FIFO buffers, and two k-bit two-input MUXes, as well as (k + 2)-bit two-input MUX. The k-bit two-input MUXes select between the input words of G and G and their stored word values in FIFOs G and G . On the other hand, the (k + 2)-bit two-input MUX selects between the input words of E and its updated intermediate partial words stored in FIFO-E. The FIFO buffers FIFO-X, FIFO-Y, FIFO-P, FIFO-S, FIFO-G, FIFO-G', and FIFO-E sequentially feedback the resulting output words X, Y, P, S, G, G , and E of the semisystolic array block to the corresponding input words of the same array block. FIFO-X, FIFO-Y, FIFO-P, FIFO-S, FIFO-G, and FIFO-G' have a width size of k bits and a depth size of W − 1, where W = n k . FIFO-E has a width size of k + 2 bits and the same depth size as the previous FIFOs. It is worth noticing that the upper registers of G, G , and E feed their word values to the corresponding inputs of the semisystolic array block starting from the most significant words. On the other hand, the registers of E 2i−2 and E 2i−1 , F 2i−2 and F 2i−1 at the left of the semisystolic array block supply their word values starting from the least significant words.      Perceiving the semisystolic array of Figure 3, we notice that it consists of two types of PEs: dark orange and light orange PEs. The logic details of the dark and light orange PEs are displayed in Figures 6 and 7, respectively. The two PEs almost have the same logic structure except that the dark orange PE has an extra tristate buffer that is enabled (u = 0) to pass the updated words of E i−1 n−1 and E i−1 n−2 at the proper time instances of t = (i − 1) n k + 1, 1 ≤ i < n k . Besides the input words E 2i−2 , E 2i−1 , F 2i−2 , and F 2i−1 , these words are horizontally transferred to the remaining PEs in the semisystolic array to update the intermediate partial words of X, Y, P, S, and E.
Similarly, observing the postprocessing array of Figure 4, we notice that it consists of two types of PEs: dark blue and light blue PEs. The logic details of the dark and light blue PEs are displayed in Figures 8 and 9. The two PEs almost have the same logic structure except that the dark blue PE has an extra tristate buffer that is enabled (u = 0) to pass the updated signals s c n−1 and y c n−1 at the proper time of t = d k n k + 1 to the remaining light blue PEs of the postprocessor array. These signals are used to calculate the words of the final products Q and R.
The partial signals x d j , y c j−1 , p d j , and s c j−1 resulting from the semisystolic array block are provided at the proper time to the corresponding inputs of the postprocessing array using the tristate buffers T x , T y , T p , and T s , as shown in Figures 8 and 9. For an odd field size n, the value of d will be less than the value of c (d = c), and in this case, the partial signals of y c j−1 , s c j−1 should be provided one clock cycle earlier than the signals x d j , p d j .    The following provides the operation summary of the developed word-based SISO semisystolic array processor for different parameters of n and k as follows:

1.
Through clock periods 1 < t ≤ n k , the control signals of MUXes M e , M g , and M g , presented in Figure 5, are set to one to serially transfer the words of input operands E, G, and G to the corresponding inputs of the semisystolic array block (one word at each clock period) starting from the most significant words. Through the first clock period, the FIFO buffers of X, Y, P, and S are cleared to maintain zero initial values as denoted in Algorithm 1. As we notice from Figure 5, the FIFOs of X, Y, P, and S have a width size of k bits and a depth size of W − 1, where W = n k . The depth size of these buffers guarantees that the variables X, Y, P, and S have zero values through the intended clock periods, 1 < t ≤ n k ; 2.
At clock periods t = z n k + 1, 0 ≤ z < n k , the control signal u in all dark orange PEs of the semisystolic array shown in Figure 3 should be assigned a zero value (u = 0) to enable the tristate buffers designated in Figure 6 to serially transfer the updated words of E i−1 n−1 and E i−1 n−2 to the remaining light orange PEs in the array (one word at each clock period). Furthermore, through these clock periods, the words of E 2i−2 , E 2i−1 , F 2i−2 , and F 2i−1 should be serially transferred to all the PEs in the semisystolic array. The words of E 2i−2 , E 2i−1 , F 2i−2 , and F 2i−1 are transferred to the corresponding inputs in the semisystolic array block through the registers allocated at the left side of Figure 5; 3.
At clock periods t = z n k , 1 ≤ z < n k , the control signal v, shown in Figure 3, should be forced to have a zero value (v = 0) to force the least significant ζ bits of E, through the AND gates, to have zero values as indicated at the rightmost edge of the DG shown in Figure 2. Notice that in our case example of n = 7 and k = 3, ζ should equal two; 4.
At clock periods t > n k , the control signals of MUXes M e , M g , and M g are assigned zero values to transfer the following words to the corresponding inputs of the semisystolic array block: updated E words saved in FIFO-E, G words saved in FIFO-G', and G words saved in FIFO-G. All of these words are transferred to the semisystolic array in a word-serial fashion (i.e., one word at each clock period). Through the same clock periods, the updated words X, Y, P, and S, saved in FIFO-X, FIFO-Y, FIFO-P, and FIFO-S, as well as the words of E i−1 n−1 , E i−1 n−2 , E 2i−2 , E 2i−1 , F 2i−2 , and F 2i−1 are serially passed (one word at each clock period) to the corresponding inputs of the semisystolic array block to compute serially (one word at each clock period) the intermediate partial results of X, Y, P, S, and E;

5.
At clock period t = d k n k + 1, the control signal u in the dark blue PE of the postprocessing array, Figure 4, should be assigned a zero value (u = 0) to enable the tristate buffers designated in Figure 8 to serially transfer the updated bits of s c n−1 and y c n−1 to the remaining light blue nodes of the postprocessing array. The bits of s c n−1 and y c n−1 alongside the bits g j , x d j , y c j−1 , p d j , and s c j−1 are used to determine the final words of the products Q and R;

6.
At clock period t = ( d k + 1) n k , the control signal v, indicated at the right side of the postprocessing array designated in Figure 4, should be forced to have a zero value (v = 0) to force the least significant bits of y d and s d , through the AND gates, to have zero values as shown at the rightmost edge of the DG shown in Figure 2;

7.
Through clock periods t ≥ d k n k + 1, the final output words Q and R produced from the postprocessing array will be loaded serially (one word at each clock period) in registers Q and R as presented in Figure 5.
It is worth noticing that the updated words of E are produced from the dark orange PEs of the semisystolic array, shown in Figure 3, starting from the second clock period. Therefore, the words of X, Y, P, S, E i−1 n−1 , E i−1 n−2 , E 2i−2 , E 2i−1 , F 2i−2 , and F 2i−1 should be delayed by one clock period to synchronize the operation. This is implemented by adding delay elements (represented by the D squares) to the semisystolic array, as shown in Figure 3.

Experimental Results and Discussion
In this section, we provide a qualitative and quantitative analysis of the proposed word-based SISO semisystolic multiplier-squarer structure and the efficient word-based serial multiplier structures available in the literature [45,50,60,61]. The qualitative analysis concentrates on developing analytical formulas for both the area and time complexities of the compared multiplier structures. On the other hand, the quantitative analysis focuses on providing the ASIC implementation results for all the compared designs to confirm the qualitative findings. Table 1 provides the analytical formulas for both the hardware resource area and execution time complexities for the proposed structure, as well as the compared ones [45,50,60,61]. The hardware resource area was evaluated in terms of the counts of the logic gates/components (tristate buffers, 2-input AND gate, 2-input XOR gate, 1 bit 2-input multiplexers, and flip-flops). In contrast, the execution time complexity was evaluated in terms of the latency and Critical Path Delay (CPD). We notice from Table 1 that the formulas of the counted gates/components of the multiplier structures of Pan [45] and Xie [50] are almost of order O(nk), while being of order O(k 2 ), except for some gates/components of the proposed multiplier, in the case of the other multiplier structures. Therefore, the multiplier structures of Pan [45] and Xie [50] should have a higher area complexity over the other compared designs, including the proposed one. By further investigation of the gate count formulas of the proposed design and the designs of Hua [60] and Chen [61], we notice that the proposed design has a higher number of tristate buffers, AND gates, XOR gates, and MUXes. On the other hand, it has a lower number of flip-flops compared to them. The flip-flops formula of the proposed design is of order O(k n/k ), while it is of order O(k 2 ) in the case of the designs of Hua [60] and Chen [61]. Therefore, for large values of k, the number of flip-flops of the proposed design will significantly decrease compared to the other designs. Based on the standard CMOS libraries' data, the area of the flop-flops, for most of the layout styles, is significantly higher than the area of the other basic gates. Thus, the significant reduction in the number of flip-flops in the design structure will lead to a considerable decrease in the design's overall area. The value of k at which the proposed design area outperforms the other design areas mainly depends on the amount of the field size n. For the recommend field size of n = 409, the word size of k = 32 leads to a significant decrease in the number of flip-flops of the proposed design and hence compensates for the increase in the area of the other gate counts. Further, it exceeds that by reducing the total area of the proposed design over the compared designs as proven by the real results in Table 2. Compared to the other designs, the proposed design performs both multiplication and squaring operations concurrently, while the other designs perform both operations in sequence; hence, reducing the area of the proposed design, for large word sizes, over that of the different designs is a notable achievement.  (2) 2k 2 + 3k( n/k ) + 2k k + n/k 2 + n/k T A + T X Proposed 6k + 2 6k 2 + 4k + 2 6k 2 + 4k 3k + 2 (7k + 2)( n/k ) − 4(k + 1) ( d/k + 1) n/k log 2 ( k 2 ) (T A + 2T X ) (1) Area of the 3-input XOR gate as a 1.5×2-input XOR gate; (2) the multiplier of [61] uses switches that have the same transistor count as the 2-input MUX. By observing the latency and CPD formulas in Table 1, we notice that the design of Pan [45] has lower latency and the design of Hua [60] has the most considerable latency compared to the remaining designs, including the proposed one. The quantitative analysis presented in the following subsection shows that the developed multiplier-squarer structure has lower latency than the multiplier structures of Hua [60] and Chen [61] and higher latency compared to the multiplier structures of Xie [50] and Pan [45] for field size n = 409. We notice also from Table 1 that as the word size k increases, the latency of all designs significantly decreases as the latency is inversely proportional to k.

Analysis of the Estimated Area and Time Complexities
By examining the formulas of the CPD, we notice that the multiplier structures of Xie [50], Hua [60], and Chen [61] have a constant and lower CPD for all values of word size k. In contrast, the CPD values of Pan [45] and the proposed design mainly increase as k increases. Since the latency of all designs significantly reduces as k increases and the CPD has constant values or increases as k increases, we cannot precisely conclude from the qualitative analysis which design should have the best execution time. The quantitative analysis provided in the following subsection will show which multiplier structure provides better execution time complexity.

ASIC Implementation Results
ASIC designs of the published works of Xie [50], Pan [45], Hua [60], Chen [61], as well as our proposed design were implemented using VHDL for field size n = 409 and embedded applications word sizes k of 8, 16, and 32. The developed code was synthesized using Synopsys design compiler 2005.09-SP2 with the NanGate (15 nm, 0.8 V) Open Cell Library. The simulations used Mentor Graphics ModelSim SE 6.0a tools and produced a Switching Activity Interchange Format (SAIF) file to obtain the power report. The simulations went through 300 possible 32 bit input combinations. Post-layout simulations were used to accurately model the delays, power, and area. Table 2 summarizes the ASIC implementation results for the proposed design and the previously published designs. The design performance metrics were: (1) latency: the total time steps required to produce the final results;  It is important to note that the published word-serial multipliers of [45,50,60,61] only perform multiplication operation, while the proposed word-serial multiplier-squarer concurrently performs both multiplication and squaring operations. Therefore, to have a fair comparison, the word-serial multipliers of [45,50,60,61] should sequentially compute both operations. The execution of multiplication and squaring operations in sequence resulted in doubling the latency, T, P, and E.
Examining Table 2, we note the following: (1) The multiplier structure of Pan [45] achieved the lowest latency for all word sizes compared to the other structures, including the proposed one. We also observed that the latency significantly decreased, for all multiplier structures, as the word size k increased; (2) The multiplier structure of Chen [61] had the lowest CPD for all values of k.
For a better interpretation of the achieved results, we visualized the area, time, Area-Time product (AT), power, and energy results using the charts displayed in Figures 10-14, respectively.    From Figure 10, we notice that the multiplier structure of Hua [60] achieved a lower area (A) at word sizes k = 8 and k = 16, by at least 21.3% and 23.0%, respectively. In contrast, at the word size of k = 32, the proposed multiplier-squarer structure achieved the smallest area by at least 22.95%. The reduction of the proposed structure's area was mainly attributed to the significant decrease in the number of flip-flops as the word size increased, as we discussed before in the previous subsection.
From Figure 11, we notice that the multiplier structure of Pan [45] achieved a significant reduction in time (T) at word sizes k = 8 and k = 16 due to its significantly lower latency. On the other hand, the multiplier structure of Xie [50] realized a slight reduction in T, at word size k = 32, compared to that of Pan [45]. Due to the fluctuations in the obtained results of the latency, A, and T among the different multiplier structures at different word size, we added the Area-Time product (AT) metric to Table 2. The AT metric enabled us to discover the optimal multiplier structures in terms of area and time.
From Figure 12, we notice that the proposed multiplier-squarer structure achieved the lowest AT for word sizes k = 16 and k = 32. That was attributed to the modest amounts of its latency and area at these word sizes. The proposed multiplier-squarer fulfilled a decrease in AT at word sizes of k = 16 and k = 32 by at least 32.3% and 70.4%.
From Figure 13, we notice that the multiplier structure of Hua [60] had lower Power consumption (P) at word sizes k = 8 and k = 16 due mainly to the significant reduction in area. The decrease in area resulted in a reduction of the extracted parasitic capacitance and, hence, reduced the circuit's switching activities. The proposed multiplier-squarer structure had slightly higher power consumption over the multiplier structure of Hua [60] at the same word sizes of k = 8 and k = 16. In contrast, it had a slightly lower power consumption than the multiplier of Hua [60] at word size k = 32 due to its smaller area at this word size.
Despite having lower power consumption at word sizes k = 8, and k = 16, the multiplier of Hua [60] had significant values of consumed energy, as shown in Figure 14, due to the tremendous amounts of its execution Time (T). The proposed multiplier-squarer had the lowest consumed energy, as shown in Figure 14, at all word sizes due to the reasonable values of its execution Time (T), as well as the lower values of the consumed power. The proposed multiplier-squarer reduced the consumed energy at word sizes k = 8, k = 16, and k = 32 by at least 70%, 72.4%, and 86.2%, respectively.
From the previous discussion, we can conclude that the presented word-serial multiplier-squarer had a smaller AT for word sizes k = 16 and k = 32 and had less consumed energy for all word sizes. Thus, it is more suitable for realizing cryptographic primitives in RFID tags, especially semipassive and active tags with reasonable restrictions on area and delay and large restrictions on consumed energy. The design is more suitable for semipassive and active tags than battery-powered RFID tags (semipassive and active tags), which have restricted energy, contrary to energy-harvesting RFID tags (passive tags), which can accommodate endless energy with restricted power.

Conclusions and Future Work
This paper introduced a capable word-based SISO semisystolic processor to simultaneously compute multiplication and squaring operations over GF(2 n ). Both operations share the same hardware core, resulting in significant savings in the hardware resource area and the consumed energy, especially for large word sizes. We used a systematic approach to find valid nonlinear scheduling and projection functions that can be applied to the DG to produce the expected word-based SISO semisystolic processor. The chosen functions provide more adaptability to adjust the utilization of the processor besides its execution time. The paper also presented a qualitative and quantitative analysis of the proposed multiplier-squarer structure and the efficient multiplier structures in the literature. The acquired results for the recommended field size n = 409 proved that the proposed multiplier-squarer structure had the lowest Area-Time product (AT) for word sizes k = 16 and k = 32 and consumed less energy for all embedded word sizes. Therefore, it is more suitable to realize cryptographic primitives in RFID-based applications with moderate restrictions on area and delay and significant restrictions on wasted energy. In future work, we will improve the hardware design to have lower area and delay complexities to be suitable for applications with high restrictions on these design parameters.